How Lossless Data Compression Works 20230531
How Lossless Data Compression Works 20230531
utm_source=pocket-newtab
By Elliot Lichtman
One student’s desire to get out of a final exam led to the ubiquitous
algorithm that shrinks data without sacrificing information.
1 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab
Yet Morse code is inefficient, too. Sure, some codes are short and others are
long. But because code lengths vary, messages in Morse code cannot be
understood unless they include brief periods of silence between each
character transmission. Indeed, without those costly pauses, recipients
would have no way to distinguish the Morse message dash dot-dash-dot
dot-dot dash dot (“trite”) from dash dot-dash-dot dot-dot-dash dot (“true”).
Fano had solved this part of the problem. He realized that he could use
codes of varying lengths without needing costly spaces, as long as he never
used the same pattern of digits as both a complete code and the start of
another code. For instance, if the letter S was so common in a particular
2 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab
message that Fano assigned it the extremely short code 01, then no other
letter in that message would be encoded with anything that started 01;
codes like 010, 011 or 0101 would all be forbidden. As a result, the coded
message could be read left to right, without any ambiguity. For example,
with the letter S assigned 01, the letter A assigned 000, the letter M
assigned 001, and the letter L assigned 1, suddenly the message
0100100011 can be immediately translated into the word “small” even
though L is represented by one digit, S by two digits, and the other letters
by three each.
To actually determine the codes, Fano built binary trees, placing each
necessary letter at the end of a visual branch. Each letter’s code was then
defined by the path from top to bottom. If the path branched to the left,
Fano added a 0; right branches got a 1. The tree structure made it easy for
Fano to avoid those undesirable overlaps: Once Fano placed a letter in the
tree, that branch would end, meaning no future code could begin the same
way.
A Fano tree for the message “encoded.” The letter D appears after a left then a right, so
it’s coded as 01, while C is right-right-left, 110. Crucially, the branches all end once a
letter is placed.
3 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab
assign letters to branches so that the letters on the left in any given branch
pair were used in the message roughly the same number of times as the
letters on the right. In this way, frequently used characters would end up on
shorter, less dense branches. A small number of high-frequency letters
would always balance out some larger number of lower-frequency ones.
The message “bookkeeper” has three E’s, two K’s, two O’s and one each of B, P and R.
Fano’s symmetry is apparent throughout the tree. For example, the E and K together
have a total frequency of 5, perfectly matching the combined frequency of the O, B, P
and R.
Fano had built his trees from the top down, maintaining as much symmetry
as possible between paired branches. His student David Huffman flipped
the process on its head, building the same types of trees but from the
bottom up. Huffman’s insight was that, whatever else happens, in an
efficient code the two least common characters should have the two longest
codes. So Huffman identified the two least common characters, grouped
them together as a branching pair, and then repeated the process, this time
looking for the two least common entries from among the remaining
characters and the pair he had just built.
4 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab
appears four times, and S/C/H/L/R/M each appear once. Fano’s balancing
approach starts by assigning the O and one other letter to the left branch,
with the five total uses of those letters balancing out the five appearances of
the remaining letters. The resulting message requires 27 bits.
His updated frequency chart then offers him four choices: the O that
appears four times, the new combined RM node that is functionally used
twice, and the single letters S, C, H and L. Huffman again picks the two
least common options, matching (say) H with L.
The chart updates again: O still has a weight of 4, RM and HL now each
have a weight of 2, and the letters S and C stand alone. Huffman continues
from there, in each step grouping the two least frequent options and then
updating both the tree and the frequency chart.
5 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab
One bit may not sound like much, but even small savings grow enormously
when scaled by billions of gigabytes.
An earlier version of the story implied that the JPEG image compression
standard is lossless. While the lossless Huffman algorithm is a part of the
6 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab
7 of 7 6/5/2023, 2:09 PM