Data compression
Data compression
Introduction
• Data compression is a subfield of Information Theory.
• Information Theory is the mother of all computer science and
transmission systems and networks (such as mobile phones, TV, radio,
computer networks, you name it). It was hinted at by Harry Nyquist
and Ralph Hartley in the 1920s, but fully developed by the one known
as “the father of information theory”, Claude Shannon, in the 1940s.
• Useful links:
• Information Theory: https://en.wikipedia.org/wiki/Information_theory
• Data compression: https://en.wikipedia.org/wiki/Data_compression
• Claude Shannon: https://en.wikipedia.org/wiki/Claude_Shannon
What is (and is not) data
compression
• Data compression is a method used to reduce the apparent size* of a
message
• Data compression is not a magical tool to make more bits fit into
fewer bits since that’s impossible
• Data compression can be done in two ways:
• lossless (no information is lost in the process, the compression rate cannot
get too high)
• lossy (some information is lost in the process, but the compression rate can
get arbitrarily high – trading off quality)
--------------------------------------------------------
* we’ll talk about this later, don’t worry
What was that “apparent size” all
about?
• The apparent size of a message is the number of bits that are used to
represent the message.
• The message itself usually contains less information than its apparent
size, that’s why it can be compressed.
• The amount of information in a message is a measurable quantity which
depends on the symbols that make up the message and their rate of
occurrence, and is called entropy.
• In physics “entropy” refers to the amount of disorder in a system, or the
total degrees of freedom of that system. It’s quite analogous to
Information Theory, where “entropy” means the amount of information
in a message, or the amount of “irregularity”.
Let’s explain Entropy
• In information theory, it’s usually talked about “sources” (or “random variables”)
rather than “messages”. A source is like a stream that “emits” symbols in time. A
message is the sequence of symbols emitted by a source thus far. A “random
variable” is a variable that takes the value of each symbol emitted by a source.
It’s called random because we don’t know what the source will emit next.
• Entropy is usually attributed to a source rather than a message, but it can be used
for both, meaning pretty much the same thing.
• The entropy of a source represents the amount of uncertainty (or randomness)
attributed to the source (we’ll see soon that uncertainty = information), or the
amount of surprise that the source holds.
Entropy of a source
• Example 1:
• suppose we have a source that can emit two symbols, A or B. We know that 95% of
the time it emits A, and only 5% of the time it emits B
• Now looking at the source, we wait for the next symbol to be emitted. Based on
what we know about the source, we pretty much expect the source to output an A,
thus, there’s not much uncertainty involved – we can call this source pretty “boring”
- its entropy is low.
• Example 2:
• we now move onto a different source that can output any one of the letters from A
to Z with an equal probability.
• Now we get excited because there’s quite an element of surprise here, there’s no
telling what the source might emit next – much uncertainty (or surprise), thus, this
source’s entropy is high.
So, to recap:
Low-entropy source High-entropy source
Predictable Unpredictable
Few surprises A lot of surprises
Boring Exciting
Example:
Example:
AAAAAAAAABAAAAAAAAAABAAAAAA…
AGPEJXZSNWPEQBJSLEJFOI…
we can assume the next symbol
who knows what comes next?
will probably be A
Entropy is quantifiable
• We said the entropy can be measured – or computed and Father Shannon
tells us how to do that:
Apparent Size
Entropy Redundancy
This represents the amount of information This represents the amount of bits that
contained in the message don’t contain information
Data Compression
• It should be obvious by now that compression can be attained by removing (or
reducing) redundancy.
• Lossless compression is limited by the message’s entropy, since we can only
remove the redundancy – the amount of actual information (entropy) cannot be
reduced, otherwise we’d lose information which can’t be recreated later out of
nothing.
• Lossy compression is theoretically unlimited, since we can discard an arbitrarily
large amount of actual information to reduce the size.
• In practice, lossy compression techniques discard the least relevant information and keep
the most important part. The recreated message after decompression will not be 100%
accurate to the original, it will have some missing “details” – for example see JPEG
compression which discards fine details in the image for the sake of size.
• In the next part we’ll discuss lossless compression.
Lossless Compression
• So how do you actually remove redundancy?
• Let’s start with an obvious example:
• We have this message “apples are green”, encoded as 8 bit per character
ASCII – thus 128 bits total.
• We’ve already established that the entropy of this message is only 48 bits.
• Dividing 48 by 16 (the number of symbols in the message) we get
• 48 / 16 = 3 bits per char on average.
• With 3 bits we can represent 2^3 = 8 different symbols, but we have 9 symbols in the
text (including space), so we’ll need one more bit
• We settle on 4 bits per character, which can represent 16 different characters.
• We create a custom encoding for the message using 4 bits per characters
• … continued on next page
Lossless compression example
“apples are green” 4-bit Encoding table
9 different symbols: [a, p, l, e, s, r, g, n, _]
4 bits per character to hold 9 possible values a 0000 e 0011 g 0110
p 0001 s 0100 n 0111
l 0010 r 0101 _ 1000
We can now represent our message using the following bit sequence:
0000000100010010001101001000000001010011100001100101001100110111
a p p l e s _ a r e _ g r e e n
We used a total of 64 bits, so half of what the message’s initial apparent size of 128 bit was.
Not bad, eh?
But this is still suboptimal, the entropy of the message was only 48 bits, and more so, the average entropy per symbol was 3
bits, while we used 4. Unfortunately, this is the limit of constant-length encoding (we’ve used the same bit length for each
symbol, taking the smallest number of bits possible).
Improving on compression rate with
variable-length encoding
• In the previous example we used a constant-length encoding table – the
same number of bits to represent each symbol. This would work well if
all symbols had the same probability.
• However, the probabilities of the symbols in our example differ
considerably, so if we could allocate fewer bits for the symbols that
appear more often (have a higher probability) and more bits for
symbols that appear less often (lower probability), then we would
achieve a better compression ratio, approaching entropy.
• Let’s see an example of variable-length encoding on the next page and
see how well it fares compared with the constant-length encoding.
Variable-length encoding example
The probabilities of each letter:
“apples are green”
e: 0.25 g: 0.0625
9 different symbols: [a, p, l, e, s, r, g, n, _] a: 0.125 l: 0.0625
p: 0.125 n: 0.0625
• You can see in the variable-length encoding table _: 0.125 s: 0.0625
r: 0.125
that we now allocate fewer bits for symbols
that appear more often (higher probability) Encoding table:
and more bits for those that appear seldom e 00 g 1100
(lower probability). a 010 l 1101
p 011 n 1110
_ 100 s 1111
• Naturally, the question arises: when we see r 101
a sequence of bits, how do we now know where one symbol ends and the next begins?
The trick is that no code-word is a prefix of another one (a code-word is the sequence of bits we assign to a symbol,
in the encoding table), thus once we fully matched a valid code-word we can be sure it ends right there.
Same example, with variable-length
encoding Encoding table:
Message to encode: e 00 g 1100
“apples are green” a 010 l 1101
p 011 n 1110
_ 100 s 1111
Encoded sequence: r 101
010011011110100111110001010100100110010100001110
a p p l e s _ a r e _ g r e e n
We’ve now used a total of 48 bits which, coincidentally, is the lowest number of bits possible since we’ve reached entropy (remember
the entropy of this message was 48 bits).
• Starting from the tree root, going down each branch, assign “0” to the
left branch and “1” to the right branch. Do this recursively until all
branches are covered to the leaves.
Huffman coding exemplified
e: 0.25 0
0.5
a: 0.125 0 0
0.25 1
p: 0.125 1 1.0
_: 0.125 0
0.25
r: 0.125 1 0 1
g: 0.0625 0
0.5
1
0.125 0
l: 0.0625 1
0.25 Encoding table:
n: 0.0625 0
1 e 00 g 1100
0.125
a 010 l 1101
s: 0.0625
1 p 011 n 1110
_ 100 s 1111
r 101
A few notes on decoding
• So far we saw how to compress a message by removing redundancy
using Huffman encoding.
• One neat aspect of the Huffman tree is that it can be reused when
decoding a message:
• Start from the root of the tree
• read the next bit from the encoded message
• if the bit is zero, go to the left child, if it’s 1, go the right child
• repeat until you reach a leaf node, and that’s your symbol.
• go back to the root
Further improving the compression
ratio
• Suppose we have this message:
ABCABCABCABCD
• If we compute the probabilities of each symbol, we’ll get
P(A) = P(B) = P(C) = 0.308; P(D) = 0.077
• Applying Huffman encoding to this we’ll get
A 00 B 01 C 10 D 11 averaging 2 bits per symbol, thus the encoded message will have a
length of 26 bits total.
• However, we can see that up until we get to D, the same “ABC” sequence repeats itself 4
times, thus not giving much information at all. We could consider the group ABC (called a
3-gram) to be a single symbol, in which case we’d get these probabilities and encoding:
P(ABC)=0.8 P(D) = 0.2
ABC 0 D 1
Our encoded message will now have a length of 5 bits, way better than 26 bits.
Data compression in practice
• In practice Huffman encoding is rarely used by itself since better compression
can be achieved by combining it with various other methods:
• n-gram decomposition (like we saw in the previous page) considers recurring groups
of characters / bytes as single symbols
• dictionary encoding – is based on a large, precomputed dictionary that is used to
lookup commonly occurring sequences of bytes/chars and replace them with short
codes
• “deflate” is a method that dynamically detects recurring patterns and instead of re-
encoding them, simply writes in the output a short identifier for the whole sequence;
the dictionary is built on the fly. See https://en.wikipedia.org/wiki/Deflate
• Popular tools like RAR, *ZIP, tar, etc. use a combination of all these methods
along with proprietary algorithms.
I hope you
enjoyed this,
and…