Implement Huffman Coding Compression
Prerequisite Knowledge
Before reading this article, you should learn:
Basics of Huffman Tree introduces some basic concepts of fixed-length coding, variable-length coding, lossy compression, and lossless compression. It also leads to an important discovery:
Using a binary tree structure for coding can make sure the variable-length codes can be decoded without ambiguity.
Now the key question is: how to build this binary tree so that the total length of the encoded data is as short as possible?
Huffman Coding is an optimal coding algorithm. It builds a binary tree in a smart way so that the Weighted Path Length of the tree is the smallest. This means the encoded data will be as short as possible.
This article will introduce the idea behind Huffman Coding, and show how to implement a simple compression program to compress text files with this algorithm.
But before we introduce Huffman Coding, let's try to come up with an algorithm for building a coding tree based on what we already know.
Fixed-Length Encoding Again
First, you can see that fixed-length encoding is actually a complete binary tree of height logN
see details.
For example, here is a binary tree:
The codes for these four characters are:
a -> 00
b -> 01
c -> 10
d -> 11
Suppose there are N
characters. The height of this tree is logN
. Each character’s code length equals the tree height, so the total encoded data length is N * logN
.
The problem with fixed-length encoding is that it does not consider how often a character appears. We can use shorter codes for more frequent characters and longer codes for less frequent ones. This way, the total encoded length becomes shorter.
Variable-Length Encoding Again
Obviously, the characters that appear more often should be closer to the root. This way, they get a shorter code length.
The simplest way to improve is to sort all characters by their frequencies, then build a binary tree that goes deeper to the right.
For example, if a, b, c, d
appear with frequencies 4, 3, 2, 1
, we can build a tree like this:
In this example, the most frequent character a
has code length 1, the next b
has code length 2, and the lowest frequency c,d
have code length 3.
This method considers frequency: higher frequency means shorter code. It seems that the compression will be better.
But, can this coding method always get the shortest encoded result?
In fact, this simple sort-based method is not always the best.
For example, if a, b, c, d
each appear once, this method’s total encoded length is 1 + 2 + 3 + 3 = 9
, which is worse than the fixed-length encoding result of 4 * 2 = 8
.