0% found this document useful (0 votes)

10 views7 pages

How Lossless Data Compression Works 20230531

How Lossless Data Compression Works.

Uploaded by

CEWQ NEGU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views7 pages

How Lossless Data Compression Works 20230531

How Lossless Data Compression Works.

Uploaded by

CEWQ NEGU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?

utm_source=pocket-newtab

Data Compression Drives the Internet. Here’s

How It Works.

By Elliot Lichtman

May 31, 2023

One student’s desire to get out of a final exam led to the ubiquitous
algorithm that shrinks data without sacrificing information.

Kristina Armitage/Quanta Magazine

With more than 9 billion gigabytes of information traveling the internet

every day, researchers are constantly looking for new ways to compress
data into smaller packages. Cutting-edge techniques focus on lossy
approaches, which achieve compression by intentionally “losing”
information from a transmission. Google, for instance, recently unveiled a
lossy strategy where the sending computer drops details from an image and
the receiving computer uses artificial intelligence to guess the missing
parts. Even Netflix uses a lossy approach, downgrading video quality
whenever the company detects that a user is watching on a low-resolution
device.

Very little research, by contrast, is currently being pursued on lossless

1 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab

strategies, where transmissions are made smaller, but no substance is

sacrificed. The reason? Lossless approaches are already remarkably
efficient. They power everything from the PNG image standard to the
ubiquitous software utility PKZip. And it’s all because of a graduate student
who was simply looking for a way out of a tough final exam.

Seventy years ago, a Massachusetts Institute of Technology professor

named Robert Fano offered the students in his information theory class a
choice: Take a traditional final exam, or improve a leading algorithm for
data compression. Fano may or may not have informed his students that he
was an author of that existing algorithm, or that he’d been hunting for an
improvement for years. What we do know is that Fano offered his students
the following challenge.

Consider a message made up of letters, numbers and punctuation. A

straightforward way to encode such a message would be to assign each
character a unique binary number. For instance, a computer might
represent the letter A as 01000001 and an exclamation point as 00100001.
This results in codes that are easy to parse — every eight digits, or bits,
correspond to one unique character — but horribly inefficient, because the
same number of binary digits is used for both common and uncommon
entries. A better approach would be something like Morse code, where the
frequent letter E is represented by just a single dot, whereas the less
common Q requires the longer and more laborious dash-dash-dot-dash.

Yet Morse code is inefficient, too. Sure, some codes are short and others are
long. But because code lengths vary, messages in Morse code cannot be
understood unless they include brief periods of silence between each
character transmission. Indeed, without those costly pauses, recipients
would have no way to distinguish the Morse message dash dot-dash-dot
dot-dot dash dot (“trite”) from dash dot-dash-dot dot-dot-dash dot (“true”).

Fano had solved this part of the problem. He realized that he could use
codes of varying lengths without needing costly spaces, as long as he never
used the same pattern of digits as both a complete code and the start of
another code. For instance, if the letter S was so common in a particular

2 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab

message that Fano assigned it the extremely short code 01, then no other
letter in that message would be encoded with anything that started 01;
codes like 010, 011 or 0101 would all be forbidden. As a result, the coded
message could be read left to right, without any ambiguity. For example,
with the letter S assigned 01, the letter A assigned 000, the letter M
assigned 001, and the letter L assigned 1, suddenly the message
0100100011 can be immediately translated into the word “small” even
though L is represented by one digit, S by two digits, and the other letters
by three each.

To actually determine the codes, Fano built binary trees, placing each
necessary letter at the end of a visual branch. Each letter’s code was then
defined by the path from top to bottom. If the path branched to the left,
Fano added a 0; right branches got a 1. The tree structure made it easy for
Fano to avoid those undesirable overlaps: Once Fano placed a letter in the
tree, that branch would end, meaning no future code could begin the same
way.

A Fano tree for the message “encoded.” The letter D appears after a left then a right, so
it’s coded as 01, while C is right-right-left, 110. Crucially, the branches all end once a
letter is placed.

To decide which letters would go where, Fano could have exhaustively

tested every possible pattern for maximum efficiency, but that would have
been impractical. So instead he developed an approximation: For every
message, he would organize the relevant letters by frequency and then

3 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab

assign letters to branches so that the letters on the left in any given branch
pair were used in the message roughly the same number of times as the
letters on the right. In this way, frequently used characters would end up on
shorter, less dense branches. A small number of high-frequency letters
would always balance out some larger number of lower-frequency ones.

The message “bookkeeper” has three E’s, two K’s, two O’s and one each of B, P and R.
Fano’s symmetry is apparent throughout the tree. For example, the E and K together
have a total frequency of 5, perfectly matching the combined frequency of the O, B, P
and R.

The result was remarkably effective compression. But it was only an

approximation; a better compression strategy had to exist. So Fano
challenged his students to find it.

Fano had built his trees from the top down, maintaining as much symmetry
as possible between paired branches. His student David Huffman flipped
the process on its head, building the same types of trees but from the
bottom up. Huffman’s insight was that, whatever else happens, in an
efficient code the two least common characters should have the two longest
codes. So Huffman identified the two least common characters, grouped
them together as a branching pair, and then repeated the process, this time
looking for the two least common entries from among the remaining
characters and the pair he had just built.

Consider a message where the Fano approach falters. In “schoolroom,” O

4 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab

appears four times, and S/C/H/L/R/M each appear once. Fano’s balancing
approach starts by assigning the O and one other letter to the left branch,
with the five total uses of those letters balancing out the five appearances of
the remaining letters. The resulting message requires 27 bits.

Huffman, by contrast, starts with two of the uncommon letters — say, R

and M — and groups them together, treating the pair like a single letter.

His updated frequency chart then offers him four choices: the O that
appears four times, the new combined RM node that is functionally used
twice, and the single letters S, C, H and L. Huffman again picks the two
least common options, matching (say) H with L.

The chart updates again: O still has a weight of 4, RM and HL now each
have a weight of 2, and the letters S and C stand alone. Huffman continues
from there, in each step grouping the two least frequent options and then
updating both the tree and the frequency chart.

5 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab

Ultimately, “schoolroom” becomes 11101111110000110110000101, shaving

one bit off the Fano top-down approach.

One bit may not sound like much, but even small savings grow enormously
when scaled by billions of gigabytes.

Indeed, Huffman’s approach has turned out to be so powerful that, today,

nearly every lossless compression strategy uses the Huffman insight in
whole or in part. Need PKZip to compress a Word document? The first step
involves yet another clever strategy for identifying repetition and thereby
compressing message size, but the second step is to take the resulting
compressed message and run it through the Huffman process.

Not bad for a project originally motivated by a graduate student’s desire to

skip a final exam.

Correction: June 1, 2023

An earlier version of the story implied that the JPEG image compression
standard is lossless. While the lossless Huffman algorithm is a part of the

6 of 7 6/5/2023, 2:09 PM
how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?utm_source=pocket-newtab

JPEG process, overall the standard is lossy.

7 of 7 6/5/2023, 2:09 PM

Multimedia Systems Chapter 7
No ratings yet
Multimedia Systems Chapter 7
21 pages
Chapter 7
No ratings yet
Chapter 7
70 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Chapter 3 Multimedia Data Compression
100% (2)
Chapter 3 Multimedia Data Compression
23 pages
Notes 07 Compression PDF
No ratings yet
Notes 07 Compression PDF
193 pages
Compression
100% (1)
Compression
38 pages
Chapter 5 New
No ratings yet
Chapter 5 New
19 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
Data Compression (RCS 087)
No ratings yet
Data Compression (RCS 087)
51 pages
A Software Implementation of The Shannon-Fano Coding Algorithm
No ratings yet
A Software Implementation of The Shannon-Fano Coding Algorithm
4 pages
HGGJ Chapter Four
No ratings yet
HGGJ Chapter Four
30 pages
Chapter Five Lossless Compression
No ratings yet
Chapter Five Lossless Compression
49 pages
Digital Data Compression
No ratings yet
Digital Data Compression
10 pages
Vik
No ratings yet
Vik
23 pages
Data Compression
No ratings yet
Data Compression
35 pages
Text Data Compression
No ratings yet
Text Data Compression
13 pages
A Unique Perspective On Data Coding and Decoding
No ratings yet
A Unique Perspective On Data Coding and Decoding
11 pages
CHAPTER FOURmultimedia
No ratings yet
CHAPTER FOURmultimedia
23 pages
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
No ratings yet
6.1 Lossless Compression Algorithms: Introduction: Unit 6: Multimedia Data Compression
25 pages
Ut 1 PPT
No ratings yet
Ut 1 PPT
77 pages
Vidler Data Compression Powerpoint
No ratings yet
Vidler Data Compression Powerpoint
25 pages
UNIT 4a
No ratings yet
UNIT 4a
34 pages
09 CM0340 Basic Compression Algorithms
No ratings yet
09 CM0340 Basic Compression Algorithms
73 pages
Ultimedia OF ATA Ompression: IS502:M D I S
No ratings yet
Ultimedia OF ATA Ompression: IS502:M D I S
29 pages
Chapter 7
No ratings yet
Chapter 7
36 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
7.file Compression
No ratings yet
7.file Compression
20 pages
Unit-4 (Second Half)
No ratings yet
Unit-4 (Second Half)
49 pages
3 Chapter Text and Image Compression
No ratings yet
3 Chapter Text and Image Compression
132 pages
Data Compression Techniques: Pushpender Rana, Student
No ratings yet
Data Compression Techniques: Pushpender Rana, Student
4 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
Stu-Lossless Compression Algos
No ratings yet
Stu-Lossless Compression Algos
21 pages
Department of Information and Communication Engineering (ICE)
No ratings yet
Department of Information and Communication Engineering (ICE)
11 pages
Data Compression Chapter 7
No ratings yet
Data Compression Chapter 7
40 pages
Lesson - Huffman and Entropy Coding
No ratings yet
Lesson - Huffman and Entropy Coding
31 pages
Literature Survey
No ratings yet
Literature Survey
5 pages
Shan PDF
No ratings yet
Shan PDF
104 pages
MM05 1
No ratings yet
MM05 1
27 pages
Unit 1
No ratings yet
Unit 1
4 pages
Lec 29
No ratings yet
Lec 29
25 pages
Chapter 4 - Introduction To Source Coding
No ratings yet
Chapter 4 - Introduction To Source Coding
72 pages
Compression PDF
No ratings yet
Compression PDF
55 pages
Image and Video Compression: Lecture 12, April 27, 2009 Lexing Xie
No ratings yet
Image and Video Compression: Lecture 12, April 27, 2009 Lexing Xie
77 pages
Lecture 22 Compression
No ratings yet
Lecture 22 Compression
42 pages
Unzipping The Mystery - How ZIP Files Work
No ratings yet
Unzipping The Mystery - How ZIP Files Work
2 pages
Graph Theory - Important Application of Trees Huffman Coding
No ratings yet
Graph Theory - Important Application of Trees Huffman Coding
50 pages
Elective: Data Compression and Encryption V Extc ECCDLO 5014
No ratings yet
Elective: Data Compression and Encryption V Extc ECCDLO 5014
60 pages
SSE Lossless Compression Method For The Text of The Insignificance of The Lines Order
No ratings yet
SSE Lossless Compression Method For The Text of The Insignificance of The Lines Order
13 pages
Data Compresion 1
No ratings yet
Data Compresion 1
2 pages
Computer Graphics and Multimedia Notes Unit 4
No ratings yet
Computer Graphics and Multimedia Notes Unit 4
19 pages
Multimdia Word File
No ratings yet
Multimdia Word File
22 pages
21IS742 Module 2 PDF P
No ratings yet
21IS742 Module 2 PDF P
33 pages
Mad Unit 3-Jntuworld
No ratings yet
Mad Unit 3-Jntuworld
53 pages
Lossless Data Compression Techniques and Their Performance
No ratings yet
Lossless Data Compression Techniques and Their Performance
6 pages
Information Theory: A Concise Introduction
From Everand
Information Theory: A Concise Introduction
Stefan Hollos
No ratings yet
Tactile Morse Code
From Everand
Tactile Morse Code
Robert Bodnaryk
No ratings yet
People of Africa
From Everand
People of Africa
Edith A. How
No ratings yet
Monograms & Ciphers
From Everand
Monograms & Ciphers
Other Members of Carlton Studio
No ratings yet
21St Century Language of Texting: 1St Edition
From Everand
21St Century Language of Texting: 1St Edition
Sophie Jane
No ratings yet
BRKRST 3009
No ratings yet
BRKRST 3009
128 pages
Linux For Beginners Complete
No ratings yet
Linux For Beginners Complete
121 pages
02 - Networking - Assignment 1 Brief
No ratings yet
02 - Networking - Assignment 1 Brief
4 pages
Study Guide 1
No ratings yet
Study Guide 1
112 pages
HITRUST Policies - Network Security Management Procedure
100% (1)
HITRUST Policies - Network Security Management Procedure
8 pages
Tekla Tedds 2021 User Guide 0
No ratings yet
Tekla Tedds 2021 User Guide 0
624 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Physics-Lab-Project-Report
No ratings yet
Physics-Lab-Project-Report
38 pages
NCC Health Check - Check - Network and Network Related Check - CVM - Configuration Messages
No ratings yet
NCC Health Check - Check - Network and Network Related Check - CVM - Configuration Messages
11 pages
Aws Test
No ratings yet
Aws Test
123 pages
Interface Description For Dynamic Library of CDM6240 Cash Dispenser PDF
No ratings yet
Interface Description For Dynamic Library of CDM6240 Cash Dispenser PDF
51 pages
Questions and Answers-Ajax
No ratings yet
Questions and Answers-Ajax
8 pages
AWS Lambda
No ratings yet
AWS Lambda
5 pages
HD Multi-Standard Decoder: Highly Integrated Soc For HDTV Receivers
No ratings yet
HD Multi-Standard Decoder: Highly Integrated Soc For HDTV Receivers
2 pages
Last Minute Revision Notes For C++Programming Exam
No ratings yet
Last Minute Revision Notes For C++Programming Exam
33 pages
Power Panel PP21: 5.1 Order Data
No ratings yet
Power Panel PP21: 5.1 Order Data
7 pages
Lab Report 1 - CPU Scheduling - 202353080023
No ratings yet
Lab Report 1 - CPU Scheduling - 202353080023
11 pages
Based On The PaaS Prototype, Which Azure SQL Database Compute Tier Should You Use?
No ratings yet
Based On The PaaS Prototype, Which Azure SQL Database Compute Tier Should You Use?
8 pages
Data Structure L1
No ratings yet
Data Structure L1
4 pages
Shalvi Python Internship Report - Word
No ratings yet
Shalvi Python Internship Report - Word
53 pages
Computer Science Project Guide: CIS 490/491 and CIS 700/710
No ratings yet
Computer Science Project Guide: CIS 490/491 and CIS 700/710
33 pages
Thingsboard EN
No ratings yet
Thingsboard EN
4 pages
Arrays & Strings Pactice Problems Class
No ratings yet
Arrays & Strings Pactice Problems Class
5 pages
Smart City
No ratings yet
Smart City
12 pages
Data Load Tutorial Introduction
No ratings yet
Data Load Tutorial Introduction
4 pages
SFP 10g SR Datasheet
No ratings yet
SFP 10g SR Datasheet
3 pages
MicroC2 eCh10L02Mem Const Var DataTypes
No ratings yet
MicroC2 eCh10L02Mem Const Var DataTypes
44 pages
DeploymentGuide PulseConnectSecure-SAMLauth
No ratings yet
DeploymentGuide PulseConnectSecure-SAMLauth
28 pages
Course Out-Line
No ratings yet
Course Out-Line
24 pages
TIA Portal V20 Technical Slides EN
No ratings yet
TIA Portal V20 Technical Slides EN
10 pages

How Lossless Data Compression Works 20230531

Uploaded by

How Lossless Data Compression Works 20230531

Uploaded by

how-lossless-data-compression-works-20230531 https://www.quantamagazine.org/print?

Data Compression Drives the Internet. Here’s

May 31, 2023

Kristina Armitage/Quanta Magazine

With more than 9 billion gigabytes of information traveling the internet

Very little research, by contrast, is currently being pursued on lossless

strategies, where transmissions are made smaller, but no substance is

Seventy years ago, a Massachusetts Institute of Technology professor

Consider a message made up of letters, numbers and punctuation. A

To decide which letters would go where, Fano could have exhaustively

The result was remarkably effective compression. But it was only an

Consider a message where the Fano approach falters. In “schoolroom,” O

Huffman, by contrast, starts with two of the uncommon letters — say, R

Ultimately, “schoolroom” becomes 11101111110000110110000101, shaving

Indeed, Huffman’s approach has turned out to be so powerful that, today,

Not bad for a project originally motivated by a graduate student’s desire to

Correction: June 1, 2023

JPEG process, overall the standard is lossy.

You might also like