0% found this document useful (0 votes)
2 views

Data compression

Uploaded by

bog2k3
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data compression

Uploaded by

bog2k3
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data compression

Introduction
• Data compression is a subfield of Information Theory.
• Information Theory is the mother of all computer science and
transmission systems and networks (such as mobile phones, TV, radio,
computer networks, you name it). It was hinted at by Harry Nyquist
and Ralph Hartley in the 1920s, but fully developed by the one known
as “the father of information theory”, Claude Shannon, in the 1940s.
• Useful links:
• Information Theory: https://en.wikipedia.org/wiki/Information_theory
• Data compression: https://en.wikipedia.org/wiki/Data_compression
• Claude Shannon: https://en.wikipedia.org/wiki/Claude_Shannon
What is (and is not) data
compression
• Data compression is a method used to reduce the apparent size* of a
message
• Data compression is not a magical tool to make more bits fit into
fewer bits since that’s impossible
• Data compression can be done in two ways:
• lossless (no information is lost in the process, the compression rate cannot
get too high)
• lossy (some information is lost in the process, but the compression rate can
get arbitrarily high – trading off quality)

--------------------------------------------------------
* we’ll talk about this later, don’t worry
What was that “apparent size” all
about?
• The apparent size of a message is the number of bits that are used to
represent the message.
• The message itself usually contains less information than its apparent
size, that’s why it can be compressed.
• The amount of information in a message is a measurable quantity which
depends on the symbols that make up the message and their rate of
occurrence, and is called entropy.
• In physics “entropy” refers to the amount of disorder in a system, or the
total degrees of freedom of that system. It’s quite analogous to
Information Theory, where “entropy” means the amount of information
in a message, or the amount of “irregularity”.
Let’s explain Entropy
• In information theory, it’s usually talked about “sources” (or “random variables”)
rather than “messages”. A source is like a stream that “emits” symbols in time. A
message is the sequence of symbols emitted by a source thus far. A “random
variable” is a variable that takes the value of each symbol emitted by a source.
It’s called random because we don’t know what the source will emit next.
• Entropy is usually attributed to a source rather than a message, but it can be used
for both, meaning pretty much the same thing.
• The entropy of a source represents the amount of uncertainty (or randomness)
attributed to the source (we’ll see soon that uncertainty = information), or the
amount of surprise that the source holds.
Entropy of a source
• Example 1:
• suppose we have a source that can emit two symbols, A or B. We know that 95% of
the time it emits A, and only 5% of the time it emits B
• Now looking at the source, we wait for the next symbol to be emitted. Based on
what we know about the source, we pretty much expect the source to output an A,
thus, there’s not much uncertainty involved – we can call this source pretty “boring”
- its entropy is low.
• Example 2:
• we now move onto a different source that can output any one of the letters from A
to Z with an equal probability.
• Now we get excited because there’s quite an element of surprise here, there’s no
telling what the source might emit next – much uncertainty (or surprise), thus, this
source’s entropy is high.
So, to recap:
Low-entropy source High-entropy source

Predictable Unpredictable
Few surprises A lot of surprises
Boring Exciting

Example:
Example:
AAAAAAAAABAAAAAAAAAABAAAAAA…
AGPEJXZSNWPEQBJSLEJFOI…
we can assume the next symbol
who knows what comes next?
will probably be A
Entropy is quantifiable
• We said the entropy can be measured – or computed and Father Shannon
tells us how to do that:

• Here H represents the entropy of X, where X is a “random variable” which


stands for the value of symbols emitted by our source
• P() means the probability that X will take the value (there are n possible
values for X: …)
• The logarithm can use various bases depending on the application, but for
our purposes we’ll use base 2, in which case the unit of measure for entropy
will be “bit” (isn’t than neat? :-)
Let’s apply the entropy formula to
our examples
• First “boring” source: can emit A (with a probability of 95%) or B (with a probability of
5%). Thus:
• n = 2, = A, = B
• P() = 0.95, P() = 0.05
• H(X) = -[P()*logP() + P()*logP()]
= -[0.95 * -0.074 + 0.05 * -4.322]
= 0.286 bit
• Second “exciting” source: can emit any letter A..Z with an equal probability (100% / 26
~= 3.85% per symbol):
• n = 26, = A, = B, … = Z
• P() = P() = … = P() ~= 0.0385
• H(X) = -[P()*logP() + P()*logP() + … + P()*logP()]
= - 26 * (0.0385 * log(0.0385)) = -26 * (0.0385 * -4.698)
= 4.70 bit
So why’s entropy important?
• The entropy tells us the amount of information (on average) that a source puts out
(or the amount of surprise we can expect with each new symbol).
• It can also be thought of as “how many bits do you need to represent an average
symbol from the source”
• In the second example, we got H = 4.70, if we round that up to 5, we get 5 bits required (2^5
= 32 possible values) to hold a symbol from our “exciting” source which has 26 possible
symbols.
• Considering the first example (the “boring” source), if we had allocated 8 bits per
character (symbol) to hold what that source emits, then we’ve wasted a lot of bits,
since the source’s entropy is only 0.286 bits
• Computing the entropy of a source gives us a rough idea of how much space is
wasted to hold that message, considering its “real information content” versus its
“apparent size”.
More entropy
• So far we’ve discussed the entropy of a source.
• A symbol also has an associated entropy which is computed from its
probability of occurrence:
• H() = -P()*logP()
• Considering this, you can see that the entropy of the source is simply
the sum of entropies of the possible values emitted by it.
• There’s also the entropy of a message – which means the amount of
information contained in that message. This would be equal to the
entropy of a source that emits exactly that message and nothing else.
Entropy of a message
• Consider this text (16 chars): The probabilities of each letter are:
a: 0.125 n: 0.0625
• “apples are green” e: 0.25 p: 0.125
• If we use 8 bit per character to g: 0.0625 r: 0.125
l: 0.0625 s: 0.0625
represent this text, then its space: 0.125

apparent size will be 128 bits


• Computing the entropy of this text, gives 48 bits
• Obviously we’re wasting a lot of space, and we could save up to 82
bits if only we could represent this data more efficiently.
Entropy and redundancy
• We’ve established that the entropy of a message can be less than its
apparent size. What remains is called redundancy

Apparent Size

Entropy Redundancy

This represents the amount of information This represents the amount of bits that
contained in the message don’t contain information
Data Compression
• It should be obvious by now that compression can be attained by removing (or
reducing) redundancy.
• Lossless compression is limited by the message’s entropy, since we can only
remove the redundancy – the amount of actual information (entropy) cannot be
reduced, otherwise we’d lose information which can’t be recreated later out of
nothing.
• Lossy compression is theoretically unlimited, since we can discard an arbitrarily
large amount of actual information to reduce the size.
• In practice, lossy compression techniques discard the least relevant information and keep
the most important part. The recreated message after decompression will not be 100%
accurate to the original, it will have some missing “details” – for example see JPEG
compression which discards fine details in the image for the sake of size.
• In the next part we’ll discuss lossless compression.
Lossless Compression
• So how do you actually remove redundancy?
• Let’s start with an obvious example:
• We have this message “apples are green”, encoded as 8 bit per character
ASCII – thus 128 bits total.
• We’ve already established that the entropy of this message is only 48 bits.
• Dividing 48 by 16 (the number of symbols in the message) we get
• 48 / 16 = 3 bits per char on average.
• With 3 bits we can represent 2^3 = 8 different symbols, but we have 9 symbols in the
text (including space), so we’ll need one more bit
• We settle on 4 bits per character, which can represent 16 different characters.
• We create a custom encoding for the message using 4 bits per characters
• … continued on next page
Lossless compression example
“apples are green” 4-bit Encoding table
9 different symbols: [a, p, l, e, s, r, g, n, _]
4 bits per character to hold 9 possible values a 0000 e 0011 g 0110
p 0001 s 0100 n 0111
l 0010 r 0101 _ 1000

We can now represent our message using the following bit sequence:
0000000100010010001101001000000001010011100001100101001100110111
a p p l e s _ a r e _ g r e e n

We used a total of 64 bits, so half of what the message’s initial apparent size of 128 bit was.
Not bad, eh?
But this is still suboptimal, the entropy of the message was only 48 bits, and more so, the average entropy per symbol was 3
bits, while we used 4. Unfortunately, this is the limit of constant-length encoding (we’ve used the same bit length for each
symbol, taking the smallest number of bits possible).
Improving on compression rate with
variable-length encoding
• In the previous example we used a constant-length encoding table – the
same number of bits to represent each symbol. This would work well if
all symbols had the same probability.
• However, the probabilities of the symbols in our example differ
considerably, so if we could allocate fewer bits for the symbols that
appear more often (have a higher probability) and more bits for
symbols that appear less often (lower probability), then we would
achieve a better compression ratio, approaching entropy.
• Let’s see an example of variable-length encoding on the next page and
see how well it fares compared with the constant-length encoding.
Variable-length encoding example
The probabilities of each letter:
“apples are green”
e: 0.25 g: 0.0625
9 different symbols: [a, p, l, e, s, r, g, n, _] a: 0.125 l: 0.0625
p: 0.125 n: 0.0625
• You can see in the variable-length encoding table _: 0.125 s: 0.0625
r: 0.125
that we now allocate fewer bits for symbols
that appear more often (higher probability) Encoding table:
and more bits for those that appear seldom e 00 g 1100
(lower probability). a 010 l 1101
p 011 n 1110
_ 100 s 1111
• Naturally, the question arises: when we see r 101
a sequence of bits, how do we now know where one symbol ends and the next begins?
The trick is that no code-word is a prefix of another one (a code-word is the sequence of bits we assign to a symbol,
in the encoding table), thus once we fully matched a valid code-word we can be sure it ends right there.
Same example, with variable-length
encoding Encoding table:
Message to encode: e 00 g 1100
“apples are green” a 010 l 1101
p 011 n 1110
_ 100 s 1111
Encoded sequence: r 101
010011011110100111110001010100100110010100001110
a p p l e s _ a r e _ g r e e n
We’ve now used a total of 48 bits which, coincidentally, is the lowest number of bits possible since we’ve reached entropy (remember
the entropy of this message was 48 bits).

Decoding the sequence, will go like this:


• read bit by bit until we’ve formed a known code-word (in this case, we read until we get “010” which is the code-
word for “a”)
• Once we’ve matched a code-word, we are positive that this must be the correct symbol, since we know that this
code-word cannot be a prefix of another one
But how do you construct the
encoding?
• Good question! Fortunately there’s a very simple method, developed
by David Huffman (https://en.wikipedia.org/wiki/Huffman_coding)
• The Huffman algorithm is based on cleverly distributing all the
symbols in a binary tree, depending on their probabilities. Each
symbol will be the sole occupant of a leaf node, and each leaf node
has exactly one symbol.
• Because the route from the tree’s root to any leaf cannot be a
subpath (prefix) of another route to another leaf (that would imply
our initial leaf is not a leaf at all), we can derive the code-words from
this tree, with this property conserved.
Enter Huffman
• The algorithm goes like this:
• put each symbol into a root node, along with its probability
• while there are more than one root nodes:
• combine the two root nodes with the smallest probabilities into a new root node which
becomes their parent and gets the sum of their probabilities
• Congrats, you have a Huffman tree with a single root node!

• Starting from the tree root, going down each branch, assign “0” to the
left branch and “1” to the right branch. Do this recursively until all
branches are covered to the leaves.
Huffman coding exemplified
e: 0.25 0
0.5
a: 0.125 0 0
0.25 1
p: 0.125 1 1.0

_: 0.125 0
0.25
r: 0.125 1 0 1

g: 0.0625 0
0.5
1
0.125 0
l: 0.0625 1
0.25 Encoding table:
n: 0.0625 0
1 e 00 g 1100
0.125
a 010 l 1101
s: 0.0625
1 p 011 n 1110
_ 100 s 1111
r 101
A few notes on decoding
• So far we saw how to compress a message by removing redundancy
using Huffman encoding.
• One neat aspect of the Huffman tree is that it can be reused when
decoding a message:
• Start from the root of the tree
• read the next bit from the encoded message
• if the bit is zero, go to the left child, if it’s 1, go the right child
• repeat until you reach a leaf node, and that’s your symbol.
• go back to the root
Further improving the compression
ratio
• Suppose we have this message:
ABCABCABCABCD
• If we compute the probabilities of each symbol, we’ll get
P(A) = P(B) = P(C) = 0.308; P(D) = 0.077
• Applying Huffman encoding to this we’ll get
A 00 B 01 C 10 D 11 averaging 2 bits per symbol, thus the encoded message will have a
length of 26 bits total.
• However, we can see that up until we get to D, the same “ABC” sequence repeats itself 4
times, thus not giving much information at all. We could consider the group ABC (called a
3-gram) to be a single symbol, in which case we’d get these probabilities and encoding:
P(ABC)=0.8 P(D) = 0.2
ABC 0 D 1
Our encoded message will now have a length of 5 bits, way better than 26 bits.
Data compression in practice
• In practice Huffman encoding is rarely used by itself since better compression
can be achieved by combining it with various other methods:
• n-gram decomposition (like we saw in the previous page) considers recurring groups
of characters / bytes as single symbols
• dictionary encoding – is based on a large, precomputed dictionary that is used to
lookup commonly occurring sequences of bytes/chars and replace them with short
codes
• “deflate” is a method that dynamically detects recurring patterns and instead of re-
encoding them, simply writes in the output a short identifier for the whole sequence;
the dictionary is built on the fly. See https://en.wikipedia.org/wiki/Deflate
• Popular tools like RAR, *ZIP, tar, etc. use a combination of all these methods
along with proprietary algorithms.
I hope you
enjoyed this,
and…

Thanks for bearing with me for this


long!

You might also like