Huffman coding is a lossless data compression algorithm that assigns shorter binary codes to more frequent characters and longer codes to less frequent characters. This minimizes the overall size of encoded data while ensuring perfect decompression.
- Analyze Character Frequencies
- Count how often each character appears in the text.
- Build a Huffman Tree
- Create a binary tree where characters with lower frequency are placed deeper.
- Generate Unique Binary Codes
- Assign shorter binary codes to more common characters.
- Encode the Text
- Replace each character with its binary code.
- Decode the Text
- Use the Huffman tree to reconstruct the original text.
Given the text: ABRACADABRA
Step 1: Character Frequencies A: 5, B: 2, R: 2, C: 1, D: 1
Step 2: Build Huffman Tree The algorithm constructs a binary tree where frequently occurring characters get shorter codes.
Step 3: Assign Huffman Codes A → 0 B → 10 R → 110 C → 1110 D → 1111
Step 4: Encode the Text Original: ABRACADABRA Compressed: 010110011101111010110
Since no code is a prefix of another, decoding is straightforward.
- It significantly reduces file sizes (especially with repetitive text).
- It's lossless, meaning the original data can be fully reconstructed.
Generate a sample text file.
utf8_sample.py
Generated file: ./utf8_sample.txt, Size: 2.30 MB
Compress the text file.
huffman_compress.py utf8_sample.txt utf8_sample.compressed
Original File Size: 2358.25 KB
Compressed File Size: 1287.76 KB
Compression Ratio: 54.61%
Space Saved: 1070.49 KB
Compression successful: 'utf8_sample.txt' → 'utf8_sample.compressed'
Decompress the compressed file.
huffman_decompress.py utf8_sample.compressed utf8_sample.decompressed
Decompression successful: 'utf8_sample.compressed' → 'utf8_sample.decompressed'
Compare the original text file with the the decompressed file.
diff utf8_sample.txt utf8_sample.decompressed
If there's no difference, the output will be blank.