Data Compression and Huffman Tree

labuladongOriginalAbout 1814 words

Prerequisite

Before reading this article, you need to learn:

Basics of Binary Trees and Common Types

In One Sentence

Huffman tree is a classic application of binary tree. It is an optimal prefix code tree, often used in data compression.

This article will explain the idea behind Huffman coding and compare it with other common data compression methods.

The detailed code will be shown in the data structure design section. You will learn how to use Huffman coding to build a compression program.

A Brief Introduction to Data Compression

We can divide data compression algorithms into two main types: lossless compression and lossy compression.

Lossless compression means the data can be fully restored after compression, with no loss of information.

For example, when we put several files into a zip file, the zip file takes up less disk space, and after unzipping, all files are restored exactly. This is lossless compression.

Lossy compression means some information is lost after compression, but the compression ratio is higher (the data size becomes much smaller).

For example, when you want to reduce the size of images, some image tools can make the file much smaller without making the image look much worse. This is lossy compression.

Let's think about a few questions.

How does lossy compression keep images looking good, even though it loses some information?
It's easy to understand that lossy compression trades information for smaller size. But how does lossless compression reduce size without losing any information?

First, lossy compression will make images worse, but only so much that we don’t really notice.

Take image compression as an example. The human eye is sensitive to "brightness" but not very sensitive to "color". So, we can use low-precision data for "color". Even if some "color" data is lost, we hardly notice.

But lossless compression cannot do this. It must make sure the original data can be restored perfectly. So the core of lossless compression is encoding and decoding.

For example, the string hahahahahahaha can be encoded as ha*7. When decoding, we just repeat ha 7 times to get the original string.

The effect of lossless compression depends on how well the algorithm finds redundant information in the original data.

The more general the compression algorithm is, the less redundancy it can find, and the lower the compression ratio. The more specialized the algorithm is, the more redundancy it can use, and the higher the compression ratio.

For example, an audio file contains sound wave data. A specialized audio compression algorithm can find and remove redundancy in the sound wave, so it compresses well. But a general zip algorithm treats audio files as just bytes, so it can't find this redundancy, and the compression is not so good.

So, there is no "perfect" compression algorithm. You have to balance generality, compression ratio, and speed.

Huffman coding, which we explain in this article, is a general lossless compression method. When you use the Huffman algorithm, you get compressed data and a code table. You need the code table to decode and restore the original data.

Fixed-Length vs Variable-Length Encoding

Now let's talk about fixed-length encoding and variable-length encoding.

ASCII is a fixed-length encoding. Each character is 8 bits, that is, 1 byte.

UTF-8 is a variable-length encoding. Each character takes 1 to 4 bytes.

The biggest advantage of fixed-length encoding is random access. Because all characters are the same length, you can easily find the position of any character by index.

The advantage of variable-length encoding is better storage efficiency. For example, UTF-8 uses 1 byte for English letters and 3 bytes for Chinese characters. It is more flexible and uses less space than ASCII. But since the length of each character is not fixed, you cannot do random access by index.

Think about this: Most modern editors use UTF-8, but random access is a basic editing feature. If every access required scanning from the start, it would be slow. How do editors solve this problem?

Back to data compression. For example, the string aaabacc has 7 lowercase letters. If we use ASCII, we need 7x8=56 bits. If we want to make it smaller, how can we do that?

Because there are only a, b, and c in the string, we don't need 8 bits for each character. 2 bits are enough.

For example:

a is encoded as 00
b is encoded as 01
c is encoded as 10

So aaabacc becomes 00000001001010 in binary, which is 14 bits long.

This method is called fixed-length encoding because every character uses 2 bits.

Fixed-length encoding is simple. As long as you know all possible characters, you can assign codes. But the compression effect is not always good, because it does not use the frequency of each character.

Take the same example, aaabacc. Since a appears most, and b and c appear less, can we use a shorter code for a and longer codes for b and c?

Yes, we can. This is called variable-length encoding. For example:

a is encoded as 0
b is encoded as 10
c is encoded as 11

So aaabacc becomes 00001001111 in binary, which is 11 bits long, better than fixed-length encoding.

The Difficulties of Variable-Length Encoding

Two Main Difficulties

How to design the encoding so that decoding is always unique?
How to keep the compression rate high (make the encoded data as short as possible)?
How to make decoding efficient?

Let's look closely at the example aaabacc above.

This encoding scheme has no ambiguity:

a is encoded as 0
b is encoded as 10
c is encoded as 11

But if you encode a as 1, then there is ambiguity between the codes for a and c:

a is encoded as 1
b is encoded as 10
c is encoded as 11

The string aaabacc would be encoded as binary 11111011111, but now the codes for a and c are confusing. 11 could be decoded as c or as aa. So you cannot decode the original data correctly.

By comparing these two examples, you can see a rule: No code should be a prefix of another code.

For example, in the second case, a is 1, but both b and c start with 1, so the code has ambiguity.

Some readers might say, what about this encoding scheme:

a is encoded as 1
b is encoded as 10
c is encoded as 100

Although the codes have the same prefix, you could add extra logic when decoding:

When you read a 1, you look ahead two more bits to see if you can match 10 or 100. Then you decide how to decode.

This can make decoding unique, but the compression rate is low, and decoding is slow. The look-ahead logic acts like a nested for loop:

for (int i = 0; i < N; i++) {
    for (int j = 1; j <= K; j++) {
        ...
    }
}

Suppose the longest code length is $K$ , and the total encoded data length is $N$ , then the decoding time complexity is $O(NK)$ .

If you make sure that no code is a prefix of another, you don't need to look ahead. Then, decoding time drops to $O(N)$ .

In real encoding and decoding, $N$ is usually large. Even if $K$ is small, decoding several times slower is still a big problem. So we want our algorithm's time complexity to be $O(N)$ , and the compression rate to be as high as possible.

Data Compression and Huffman Tree

A Brief Introduction to Data Compression

Fixed-Length vs Variable-Length Encoding

The Difficulties of Variable-Length Encoding

Principle of Huffman Coding