Data Compression and Huffman Tree
In One Sentence
Huffman tree is a classic application of binary trees. It is an optimal prefix code tree and is often used for data compression.
This article will introduce the idea behind Huffman coding and compare it with some common data compression methods.
The actual code will be shown in the data structure design section. You will learn how to use Huffman coding to build a compression program.
A Brief Talk About Data Compression
We can divide data compression algorithms into two types: lossless compression and lossy compression.
Lossless compression means the compressed data can be fully restored, with no information lost.
For example, when we pack some files into a zip file, the zip file takes less disk space, and when we unzip it, we can get back the original files without any loss. This is lossless compression.
Lossy compression means some information is lost during compression, but the compression ratio is higher (the compressed data is smaller).
For example, when we compress images, some tools can greatly reduce image file size without obvious loss of quality. This is lossy compression.
Now, let's think about some questions:
- How can lossy compression keep the image quality while losing information?
- Lossy compression loses information to save space, which is easy to understand. But how does lossless compression reduce data size while not losing any information?
First, lossy compression will always reduce the image quality, but the loss is within what we can accept.
Take image compression for example. The human eye is more sensitive to "brightness" than "color". So, we can use lower-precision data to store "color". Even if we lose some "color" information, we can hardly tell the difference.
But lossless compression cannot do this, because it must allow the data to be fully restored. So, the essence of lossless compression is encoding and decoding.
For example, the string hahahahahahaha
can be encoded as ha*7
. When decoding, we just repeat ha
7 times to get the original string.
The effect of lossless compression depends on how well the algorithm can find and use redundant information in the original data.
General-purpose compression algorithms can find less redundancy, so the compression rate is lower. Special-purpose algorithms can find more redundancy and achieve better compression.
For example, audio files have sound wave data. A special audio compression can find redundancy in sound waves and compress it well. But a general zip compressor just sees audio files as byte streams and cannot find sound wave redundancy, so the compression is not as good.
So, there is no perfect compression algorithm. We must make trade-offs between being general, compression rate, and performance.
Huffman coding, which we are going to discuss, is a general-purpose lossless compression algorithm. When you give your data to the Huffman algorithm, you will get compressed data and a code table. You need the code table to decode and restore the original data.
Fixed-Length Encoding vs Variable-Length Encoding
Now, let's talk about fixed-length encoding and variable-length encoding.
ASCII is a fixed-length encoding. It uses 8 bits (1 byte) for each character.
UTF-8 is a variable-length encoding. It uses 1 to 4 bytes for each character.
The biggest advantage of fixed-length encoding is random access. Because each character uses the same number of bits, you can easily find a character's position by its index.
The advantage of variable-length encoding is high storage efficiency. For example, UTF-8 uses 1 byte for English letters and 3 bytes for Chinese characters. It is more flexible and saves space compared to ASCII. But because the length varies, you can't use the index for random access.
Think about this: modern text editors use UTF-8, and random access is a basic feature. If every access requires scanning from the start, it would be very slow. How do editors solve this?
Back to the compression example. Suppose we have the string aaabacc
, which has 7 lowercase letters. Using ASCII, we need 7x8=56 bits. If we want to compress it further, what should we do?
Since we only have a
, b
, and c
, we do not need 8 bits for each character. 2 bits are enough.
For example, we can encode as follows:
a
as00
b
as01
c
as10
So aaabacc
becomes 00000001001010
in binary, which is 14 bits.
This is called fixed-length encoding because each character uses 2 bits.
Fixed-length encoding is simple. As long as you know all possible characters, you can assign codes. But the compression is not great, because it does not use the frequency of each character.
For example, in aaabacc
, a
appears most, while b
and c
are rare. Can we use shorter codes for a
, and longer codes for b
and c
?
Yes, this is called variable-length encoding. For example:
a
as0
b
as10
c
as11
So aaabacc
becomes 0001001111
in binary, which is 11 bits. This is better than fixed-length encoding.
The Difficulties of Variable-Length Encoding
Two Main Difficulties
How to design the encoding so that decoding is always unique?
How to keep the compression rate high (make the encoded data as short as possible)?
How to make decoding efficient?
Let's look closely at the example aaabacc
above.
This encoding scheme has no ambiguity:
a
is encoded as0
b
is encoded as10
c
is encoded as11
But if you encode a
as 1
, then there is ambiguity between the codes for a
and c
:
a
is encoded as1
b
is encoded as10
c
is encoded as11
The string aaabacc
would be encoded as binary 11111011111
, but now the codes for a
and c
are confusing. 11
could be decoded as c
or as aa
. So you cannot decode the original data correctly.
By comparing these two examples, you can see a rule: No code should be a prefix of another code.
For example, in the second case, a
is 1
, but both b
and c
start with 1
, so the code has ambiguity.
Some readers might say, what about this encoding scheme:
a
is encoded as1
b
is encoded as10
c
is encoded as100
Although the codes have the same prefix, you could add extra logic when decoding:
When you read a 1
, you look ahead two more bits to see if you can match 10
or 100
. Then you decide how to decode.
This can make decoding unique, but the compression rate is low, and decoding is slow. The look-ahead logic acts like a nested for loop:
for (int i = 0; i < N; i++) {
for (int j = 1; j <= K; j++) {
...
}
}
Suppose the longest code length is , and the total encoded data length is , then the decoding time complexity is .
If you make sure that no code is a prefix of another, you don't need to look ahead. Then, decoding time drops to .
In real encoding and decoding, is usually large. Even if is small, decoding several times slower is still a big problem. So we want our algorithm's time complexity to be , and the compression rate to be as high as possible.