Data compression

Uploaded by

bog2k3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data compression

Uploaded by

bog2k3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Data compression

Introduction
• Data compression is a subfield of Information Theory.
• Information Theory is the mother of all computer science and
transmission systems and networks (such as mobile phones, TV, radio,
computer networks, you name it). It was hinted at by Harry Nyquist
and Ralph Hartley in the 1920s, but fully developed by the one known
as “the father of information theory”, Claude Shannon, in the 1940s.
• Useful links:
• Information Theory: https://en.wikipedia.org/wiki/Information_theory
• Data compression: https://en.wikipedia.org/wiki/Data_compression
• Claude Shannon: https://en.wikipedia.org/wiki/Claude_Shannon
What is (and is not) data
compression
• Data compression is a method used to reduce the apparent size* of a
message
• Data compression is not a magical tool to make more bits fit into
fewer bits since that’s impossible
• Data compression can be done in two ways:
• lossless (no information is lost in the process, the compression rate cannot
get too high)
• lossy (some information is lost in the process, but the compression rate can
get arbitrarily high – trading off quality)

--------------------------------------------------------
* we’ll talk about this later, don’t worry
What was that “apparent size” all
about?
• The apparent size of a message is the number of bits that are used to
represent the message.
• The message itself usually contains less information than its apparent
size, that’s why it can be compressed.
• The amount of information in a message is a measurable quantity which
depends on the symbols that make up the message and their rate of
occurrence, and is called entropy.
• In physics “entropy” refers to the amount of disorder in a system, or the
total degrees of freedom of that system. It’s quite analogous to
Information Theory, where “entropy” means the amount of information
in a message, or the amount of “irregularity”.
Let’s explain Entropy
• In information theory, it’s usually talked about “sources” (or “random variables”)
rather than “messages”. A source is like a stream that “emits” symbols in time. A
message is the sequence of symbols emitted by a source thus far. A “random
variable” is a variable that takes the value of each symbol emitted by a source.
It’s called random because we don’t know what the source will emit next.
• Entropy is usually attributed to a source rather than a message, but it can be used
for both, meaning pretty much the same thing.
• The entropy of a source represents the amount of uncertainty (or randomness)
attributed to the source (we’ll see soon that uncertainty = information), or the
amount of surprise that the source holds.
Entropy of a source
• Example 1:
• suppose we have a source that can emit two symbols, A or B. We know that 95% of
the time it emits A, and only 5% of the time it emits B
• Now looking at the source, we wait for the next symbol to be emitted. Based on
what we know about the source, we pretty much expect the source to output an A,
thus, there’s not much uncertainty involved – we can call this source pretty “boring”
- its entropy is low.
• Example 2:
• we now move onto a different source that can output any one of the letters from A
to Z with an equal probability.
• Now we get excited because there’s quite an element of surprise here, there’s no
telling what the source might emit next – much uncertainty (or surprise), thus, this
source’s entropy is high.
So, to recap:
Low-entropy source High-entropy source

Predictable Unpredictable
Few surprises A lot of surprises
Boring Exciting

Example:
Example:
AAAAAAAAABAAAAAAAAAABAAAAAA…
AGPEJXZSNWPEQBJSLEJFOI…
we can assume the next symbol
who knows what comes next?
will probably be A
Entropy is quantifiable
• We said the entropy can be measured – or computed and Father Shannon
tells us how to do that:

• Here H represents the entropy of X, where X is a “random variable” which

stands for the value of symbols emitted by our source
• P() means the probability that X will take the value (there are n possible
values for X: …)
• The logarithm can use various bases depending on the application, but for
our purposes we’ll use base 2, in which case the unit of measure for entropy
will be “bit” (isn’t than neat? :-)
Let’s apply the entropy formula to
our examples
• First “boring” source: can emit A (with a probability of 95%) or B (with a probability of
5%). Thus:
• n = 2, = A, = B
• P() = 0.95, P() = 0.05
• H(X) = -[P()*logP() + P()*logP()]
= -[0.95 * -0.074 + 0.05 * -4.322]
= 0.286 bit
• Second “exciting” source: can emit any letter A..Z with an equal probability (100% / 26
~= 3.85% per symbol):
• n = 26, = A, = B, … = Z
• P() = P() = … = P() ~= 0.0385
• H(X) = -[P()*logP() + P()*logP() + … + P()*logP()]
= - 26 * (0.0385 * log(0.0385)) = -26 * (0.0385 * -4.698)
= 4.70 bit
So why’s entropy important?
• The entropy tells us the amount of information (on average) that a source puts out
(or the amount of surprise we can expect with each new symbol).
• It can also be thought of as “how many bits do you need to represent an average
symbol from the source”
• In the second example, we got H = 4.70, if we round that up to 5, we get 5 bits required (2^5
= 32 possible values) to hold a symbol from our “exciting” source which has 26 possible
symbols.
• Considering the first example (the “boring” source), if we had allocated 8 bits per
character (symbol) to hold what that source emits, then we’ve wasted a lot of bits,
since the source’s entropy is only 0.286 bits
• Computing the entropy of a source gives us a rough idea of how much space is
wasted to hold that message, considering its “real information content” versus its
“apparent size”.
More entropy
• So far we’ve discussed the entropy of a source.
• A symbol also has an associated entropy which is computed from its
probability of occurrence:
• H() = -P()*logP()
• Considering this, you can see that the entropy of the source is simply
the sum of entropies of the possible values emitted by it.
• There’s also the entropy of a message – which means the amount of
information contained in that message. This would be equal to the
entropy of a source that emits exactly that message and nothing else.
Entropy of a message
• Consider this text (16 chars): The probabilities of each letter are:
a: 0.125 n: 0.0625
• “apples are green” e: 0.25 p: 0.125
• If we use 8 bit per character to g: 0.0625 r: 0.125
l: 0.0625 s: 0.0625
represent this text, then its space: 0.125

apparent size will be 128 bits

• Computing the entropy of this text, gives 48 bits
• Obviously we’re wasting a lot of space, and we could save up to 82
bits if only we could represent this data more efficiently.
Entropy and redundancy
• We’ve established that the entropy of a message can be less than its
apparent size. What remains is called redundancy

Apparent Size

Entropy Redundancy

This represents the amount of information This represents the amount of bits that
contained in the message don’t contain information
Data Compression
• It should be obvious by now that compression can be attained by removing (or
reducing) redundancy.
• Lossless compression is limited by the message’s entropy, since we can only
remove the redundancy – the amount of actual information (entropy) cannot be
reduced, otherwise we’d lose information which can’t be recreated later out of
nothing.
• Lossy compression is theoretically unlimited, since we can discard an arbitrarily
large amount of actual information to reduce the size.
• In practice, lossy compression techniques discard the least relevant information and keep
the most important part. The recreated message after decompression will not be 100%
accurate to the original, it will have some missing “details” – for example see JPEG
compression which discards fine details in the image for the sake of size.
• In the next part we’ll discuss lossless compression.
Lossless Compression
• So how do you actually remove redundancy?
• Let’s start with an obvious example:
• We have this message “apples are green”, encoded as 8 bit per character
ASCII – thus 128 bits total.
• We’ve already established that the entropy of this message is only 48 bits.
• Dividing 48 by 16 (the number of symbols in the message) we get
• 48 / 16 = 3 bits per char on average.
• With 3 bits we can represent 2^3 = 8 different symbols, but we have 9 symbols in the
text (including space), so we’ll need one more bit
• We settle on 4 bits per character, which can represent 16 different characters.
• We create a custom encoding for the message using 4 bits per characters
• … continued on next page
Lossless compression example
“apples are green” 4-bit Encoding table
9 different symbols: [a, p, l, e, s, r, g, n, _]
4 bits per character to hold 9 possible values a 0000 e 0011 g 0110
p 0001 s 0100 n 0111
l 0010 r 0101 _ 1000

We can now represent our message using the following bit sequence:
0000000100010010001101001000000001010011100001100101001100110111
a p p l e s _ a r e _ g r e e n

We used a total of 64 bits, so half of what the message’s initial apparent size of 128 bit was.
Not bad, eh?
But this is still suboptimal, the entropy of the message was only 48 bits, and more so, the average entropy per symbol was 3
bits, while we used 4. Unfortunately, this is the limit of constant-length encoding (we’ve used the same bit length for each
symbol, taking the smallest number of bits possible).
Improving on compression rate with
variable-length encoding
• In the previous example we used a constant-length encoding table – the
same number of bits to represent each symbol. This would work well if
all symbols had the same probability.
• However, the probabilities of the symbols in our example differ
considerably, so if we could allocate fewer bits for the symbols that
appear more often (have a higher probability) and more bits for
symbols that appear less often (lower probability), then we would
achieve a better compression ratio, approaching entropy.
• Let’s see an example of variable-length encoding on the next page and
see how well it fares compared with the constant-length encoding.
Variable-length encoding example
The probabilities of each letter:
“apples are green”
e: 0.25 g: 0.0625
9 different symbols: [a, p, l, e, s, r, g, n, _] a: 0.125 l: 0.0625
p: 0.125 n: 0.0625
• You can see in the variable-length encoding table _: 0.125 s: 0.0625
r: 0.125
that we now allocate fewer bits for symbols
that appear more often (higher probability) Encoding table:
and more bits for those that appear seldom e 00 g 1100
(lower probability). a 010 l 1101
p 011 n 1110
_ 100 s 1111
• Naturally, the question arises: when we see r 101
a sequence of bits, how do we now know where one symbol ends and the next begins?
The trick is that no code-word is a prefix of another one (a code-word is the sequence of bits we assign to a symbol,
in the encoding table), thus once we fully matched a valid code-word we can be sure it ends right there.
Same example, with variable-length
encoding Encoding table:
Message to encode: e 00 g 1100
“apples are green” a 010 l 1101
p 011 n 1110
_ 100 s 1111
Encoded sequence: r 101
010011011110100111110001010100100110010100001110
a p p l e s _ a r e _ g r e e n
We’ve now used a total of 48 bits which, coincidentally, is the lowest number of bits possible since we’ve reached entropy (remember
the entropy of this message was 48 bits).

Decoding the sequence, will go like this:

• read bit by bit until we’ve formed a known code-word (in this case, we read until we get “010” which is the code-
word for “a”)
• Once we’ve matched a code-word, we are positive that this must be the correct symbol, since we know that this
code-word cannot be a prefix of another one
But how do you construct the
encoding?
• Good question! Fortunately there’s a very simple method, developed
by David Huffman (https://en.wikipedia.org/wiki/Huffman_coding)
• The Huffman algorithm is based on cleverly distributing all the
symbols in a binary tree, depending on their probabilities. Each
symbol will be the sole occupant of a leaf node, and each leaf node
has exactly one symbol.
• Because the route from the tree’s root to any leaf cannot be a
subpath (prefix) of another route to another leaf (that would imply
our initial leaf is not a leaf at all), we can derive the code-words from
this tree, with this property conserved.
Enter Huffman
• The algorithm goes like this:
• put each symbol into a root node, along with its probability
• while there are more than one root nodes:
• combine the two root nodes with the smallest probabilities into a new root node which
becomes their parent and gets the sum of their probabilities
• Congrats, you have a Huffman tree with a single root node!

• Starting from the tree root, going down each branch, assign “0” to the
left branch and “1” to the right branch. Do this recursively until all
branches are covered to the leaves.
Huffman coding exemplified
e: 0.25 0
0.5
a: 0.125 0 0
0.25 1
p: 0.125 1 1.0

_: 0.125 0
0.25
r: 0.125 1 0 1

g: 0.0625 0
0.5
1
0.125 0
l: 0.0625 1
0.25 Encoding table:
n: 0.0625 0
1 e 00 g 1100
0.125
a 010 l 1101
s: 0.0625
1 p 011 n 1110
_ 100 s 1111
r 101
A few notes on decoding
• So far we saw how to compress a message by removing redundancy
using Huffman encoding.
• One neat aspect of the Huffman tree is that it can be reused when
decoding a message:
• Start from the root of the tree
• read the next bit from the encoded message
• if the bit is zero, go to the left child, if it’s 1, go the right child
• repeat until you reach a leaf node, and that’s your symbol.
• go back to the root
Further improving the compression
ratio
• Suppose we have this message:
ABCABCABCABCD
• If we compute the probabilities of each symbol, we’ll get
P(A) = P(B) = P(C) = 0.308; P(D) = 0.077
• Applying Huffman encoding to this we’ll get
A 00 B 01 C 10 D 11 averaging 2 bits per symbol, thus the encoded message will have a
length of 26 bits total.
• However, we can see that up until we get to D, the same “ABC” sequence repeats itself 4
times, thus not giving much information at all. We could consider the group ABC (called a
3-gram) to be a single symbol, in which case we’d get these probabilities and encoding:
P(ABC)=0.8 P(D) = 0.2
ABC 0 D 1
Our encoded message will now have a length of 5 bits, way better than 26 bits.
Data compression in practice
• In practice Huffman encoding is rarely used by itself since better compression
can be achieved by combining it with various other methods:
• n-gram decomposition (like we saw in the previous page) considers recurring groups
of characters / bytes as single symbols
• dictionary encoding – is based on a large, precomputed dictionary that is used to
lookup commonly occurring sequences of bytes/chars and replace them with short
codes
• “deflate” is a method that dynamically detects recurring patterns and instead of re-
encoding them, simply writes in the output a short identifier for the whole sequence;
the dictionary is built on the fly. See https://en.wikipedia.org/wiki/Deflate
• Popular tools like RAR, *ZIP, tar, etc. use a combination of all these methods
along with proprietary algorithms.
I hope you
enjoyed this,
and…

Thanks for bearing with me for this

long!

Module 5 - Part1
No ratings yet
Module 5 - Part1
36 pages
DSL-2730U: Firmware Release Notes
100% (1)
DSL-2730U: Firmware Release Notes
13 pages
50+ Power BI Resume Samples - Developer Resume - 2018 Latest
0% (1)
50+ Power BI Resume Samples - Developer Resume - 2018 Latest
4 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
09 Basic Compression
No ratings yet
09 Basic Compression
81 pages
Sayood DataCompression
No ratings yet
Sayood DataCompression
22 pages
Lossless Math
No ratings yet
Lossless Math
32 pages
Information Theory
No ratings yet
Information Theory
38 pages
Lecture 3-Huffman Coding
No ratings yet
Lecture 3-Huffman Coding
30 pages
Noise, Information Theory, and Entropy: CS414 - Spring 2007
No ratings yet
Noise, Information Theory, and Entropy: CS414 - Spring 2007
44 pages
Source Coding
No ratings yet
Source Coding
29 pages
L15-Compression
No ratings yet
L15-Compression
63 pages
Entropy (Information Theory)
No ratings yet
Entropy (Information Theory)
3 pages
chap2
No ratings yet
chap2
47 pages
Book-Chapter-07 (Lossless Compression Algorithms) Merged
No ratings yet
Book-Chapter-07 (Lossless Compression Algorithms) Merged
25 pages
Chapter Five Lossless Compression
No ratings yet
Chapter Five Lossless Compression
49 pages
Week 3
No ratings yet
Week 3
30 pages
8
No ratings yet
8
35 pages
EC 2214: Coding & Data Compression: Vishwakarma Institute of Technology
No ratings yet
EC 2214: Coding & Data Compression: Vishwakarma Institute of Technology
35 pages
CSEP 590 Data Compression: Course Policies Introduction To Data Compression Entropy Variable Length Codes
No ratings yet
CSEP 590 Data Compression: Course Policies Introduction To Data Compression Entropy Variable Length Codes
93 pages
Noiseless Coding
No ratings yet
Noiseless Coding
5 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Advanced Multimedia Infrastructure
No ratings yet
Advanced Multimedia Infrastructure
32 pages
Data_Compression__Unit-5 (1)
No ratings yet
Data_Compression__Unit-5 (1)
17 pages
Lecture 2-Print
No ratings yet
Lecture 2-Print
19 pages
Image and Video Compression: Lecture 12, April 27, 2009 Lexing Xie
No ratings yet
Image and Video Compression: Lecture 12, April 27, 2009 Lexing Xie
77 pages
Information Coding Techniques
No ratings yet
Information Coding Techniques
42 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
CE Notes
No ratings yet
CE Notes
32 pages
Data Compression
No ratings yet
Data Compression
49 pages
Entropy 3
No ratings yet
Entropy 3
10 pages
Data Compression: Chapter - 2 Mathematical Preliminaries For Lossless Compression
100% (2)
Data Compression: Chapter - 2 Mathematical Preliminaries For Lossless Compression
26 pages
Compression PDF
No ratings yet
Compression PDF
55 pages
Module IV
No ratings yet
Module IV
37 pages
01-Syllabus and Intro
No ratings yet
01-Syllabus and Intro
21 pages
Materi Source Coding
No ratings yet
Materi Source Coding
39 pages
Algorithms in The Real World: Data Compression: Lectures 1 and 2
No ratings yet
Algorithms in The Real World: Data Compression: Lectures 1 and 2
55 pages
Information Theory Notes
No ratings yet
Information Theory Notes
4 pages
Lossless Compression: Lesson 1
No ratings yet
Lossless Compression: Lesson 1
10 pages
Basic Information Theory: Thinh Nguyen Oregon State University
No ratings yet
Basic Information Theory: Thinh Nguyen Oregon State University
17 pages
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
No ratings yet
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
37 pages
cp467_12_lecture14_compression1
No ratings yet
cp467_12_lecture14_compression1
146 pages
Intro To ICT 11
No ratings yet
Intro To ICT 11
31 pages
Chapter 2 - Edited
No ratings yet
Chapter 2 - Edited
45 pages
Noise, Information Theory, and Entropy
No ratings yet
Noise, Information Theory, and Entropy
34 pages
What Is The Computer Science Definition of Entropy?
No ratings yet
What Is The Computer Science Definition of Entropy?
9 pages
Introduction To Information Theory and Coding
No ratings yet
Introduction To Information Theory and Coding
46 pages
DC-PPT 5
No ratings yet
DC-PPT 5
44 pages
Module-1
No ratings yet
Module-1
40 pages
2015 Chapter 7 MMS IT
No ratings yet
2015 Chapter 7 MMS IT
36 pages
Data Compression Explained
100% (1)
Data Compression Explained
92 pages
2017 May 24 Huffman Lecture1
No ratings yet
2017 May 24 Huffman Lecture1
24 pages
Image Compression
No ratings yet
Image Compression
113 pages
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
No ratings yet
ECE4007 Information Theory and Coding: DR - Sangeetha R.G
44 pages
Information, Entropy, and Coding: 8.1 The Need For Data Compression
No ratings yet
Information, Entropy, and Coding: 8.1 The Need For Data Compression
13 pages
Information Theory: A Concise Introduction
From Everand
Information Theory: A Concise Introduction
Stefan Hollos
No ratings yet
Python for Beginners: This comprehensive introduction to the world of coding introduces you to the Python programming language
From Everand
Python for Beginners: This comprehensive introduction to the world of coding introduces you to the Python programming language
Vere salazar
No ratings yet
Computer Programming: A Step-by-Step Guide to Learn Python, SQL, C++, C#, Raspberry Pi, and Data Science
From Everand
Computer Programming: A Step-by-Step Guide to Learn Python, SQL, C++, C#, Raspberry Pi, and Data Science
Vere salazar
No ratings yet
I Wish I Knew That: Math
From Everand
I Wish I Knew That: Math
GOLDSMITH, MICHAEL
3.5/5 (3)
Amazing Math: Projects You Can Build Yourself
From Everand
Amazing Math: Projects You Can Build Yourself
Lazlo C. Bardos
No ratings yet
Challenging Prime Number Problems
From Everand
Challenging Prime Number Problems
Gerald Patterson
No ratings yet
FortiGate Secure SD-WAN Solutiion
No ratings yet
FortiGate Secure SD-WAN Solutiion
18 pages
Web Tech Class 4
No ratings yet
Web Tech Class 4
43 pages
ITNE2002 (Tut 5-8)
No ratings yet
ITNE2002 (Tut 5-8)
11 pages
Citra
No ratings yet
Citra
2 pages
Class 4 Indetifiers and Reserved Words
No ratings yet
Class 4 Indetifiers and Reserved Words
2 pages
Cyber Terrorism in India: A Physical Reality Orvirtual Myth: Original Article
No ratings yet
Cyber Terrorism in India: A Physical Reality Orvirtual Myth: Original Article
8 pages
2-Wire Serial Eeprom: Features
No ratings yet
2-Wire Serial Eeprom: Features
20 pages
Operating Systems:: Threads
No ratings yet
Operating Systems:: Threads
26 pages
20SCN13-Information Seccurity Group 2
No ratings yet
20SCN13-Information Seccurity Group 2
2 pages
Proc Emb - Ch2
No ratings yet
Proc Emb - Ch2
29 pages
Unit 1 - Software Engineering - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Software Engineering - WWW - Rgpvnotes.in
20 pages
Airline Management System.
100% (1)
Airline Management System.
5 pages
Mediatek Confidential: Mt7628 Datasheet
No ratings yet
Mediatek Confidential: Mt7628 Datasheet
54 pages
WTL Report (Abhi)
No ratings yet
WTL Report (Abhi)
26 pages
Module Code & Module Title CC6051NI Ethical Hacking Assessment Weightage & Type Weekly Assignment
No ratings yet
Module Code & Module Title CC6051NI Ethical Hacking Assessment Weightage & Type Weekly Assignment
7 pages
Securing Your WAN Infrastructure: Enabling The Hybrid WAN Webinar Series
No ratings yet
Securing Your WAN Infrastructure: Enabling The Hybrid WAN Webinar Series
43 pages
Report 2020 Crowd Strike Services Cyber Front Lines
No ratings yet
Report 2020 Crowd Strike Services Cyber Front Lines
38 pages
Mca I, Ii, Year Ii Semester (Supplementary) Examinations - /january - 2014
No ratings yet
Mca I, Ii, Year Ii Semester (Supplementary) Examinations - /january - 2014
4 pages
Department of Computer Science, UAF: A) Major Courses
0% (1)
Department of Computer Science, UAF: A) Major Courses
1 page
Java Major - Ieee - 2022-23 - 9581464142
No ratings yet
Java Major - Ieee - 2022-23 - 9581464142
4 pages
Eztrieve Presentation
No ratings yet
Eztrieve Presentation
60 pages
W7 Lesson 7 - Introduction To Information Security - Module
No ratings yet
W7 Lesson 7 - Introduction To Information Security - Module
8 pages
List of Vlsi Based Companies
100% (2)
List of Vlsi Based Companies
22 pages
Bcse308l Computer-Networks TH 11.0 67 Bcse308l
No ratings yet
Bcse308l Computer-Networks TH 11.0 67 Bcse308l
3 pages
Firewalls and Packet Filters
No ratings yet
Firewalls and Packet Filters
49 pages
Music Player - Using Python
No ratings yet
Music Player - Using Python
3 pages
Cyber Shield 20120801 080658 1
No ratings yet
Cyber Shield 20120801 080658 1
1 page
Entrevista Top Notch 1 Nivel 3
100% (1)
Entrevista Top Notch 1 Nivel 3
3 pages

Data compression

Uploaded by

Data compression

Uploaded by

Data compression

• Here H represents the entropy of X, where X is a “random variable” which

apparent size will be 128 bits

Decoding the sequence, will go like this:

Thanks for bearing with me for this

You might also like