Dereje Teferi Dereje - Teferi@aau - Edu.et
Dereje Teferi Dereje - Teferi@aau - Edu.et
Dereje Teferi Dereje - Teferi@aau - Edu.et
1
Data Compression
data compression or source coding is the process of
encoding information using fewer bits (or other
information-bearing units) than an un-encoded
representation would use, through use of specific
encoding schemes. (Wikipedia)
The task of compression consists of two components,
an encoding algorithm that takes a message and
generates a “compressed” representation (hopefully with
fewer bits), and
a decoding algorithm that reconstructs the original
message or some approximation of it from the compressed
representation.
2
History of data compression
Morse code, invented in 1838 by David Morse for use in
telegraphy, is an early example of data compression based
on using
shorter codewords for letters such as "e(.)" and “a(.-)" that are more
common in English language and
longer codewords for letters that occur less frequently such as “q(--
.-)” and “j(.---)”
In the mid 19th century, a widely used form of Braille code
used frequency of words rather than characters to achieve
compression.
2x3 array of dots are used to represent text
3
History…
Modern work on data compression began in
the late 1940s with the development of
information theory.
In 1949 Claude Shannon and Robert Fano
devised a systematic way to assign
codewords based on probabilities of blocks.
An optimal method for doing this was then
found by David Huffman in 1951.
4
History …
Early implementations were typically done in
hardware, with specific choices of codewords being
made as compromises between compression and
error correction.
In the mid-1970s, the idea emerged of dynamically
updating codewords for Huffman encoding, based on
the actual data encountered.
In the late 1970s, with online storage of text files
becoming common, software compression programs
began to be developed, almost all of them were based
on adaptive Huffman coding.
5
History …
In 1977 Abraham Lempel and Jacob Ziv suggested the
basic idea of pointer-based encoding.
In the mid-1980s, following work by Terry Welch, the
so-called LZW(Lempel-Ziv-Welch) algorithm rapidly
became the method of choice for most general-purpose
compression systems.
It was used in programs such as PKZIP(PKUNZIP), as
well as in hardware devices such as modems.
In the late 1980s, digital images became more common,
and standards for compressing them emerged. Lossy
compression methods also began to be widely used.
6
History…
Current image compression standards include:
FAX CCITT 3 (run-length encoding, with codewords
determined by Huffman coding from a definite distribution of
run lengths);
GIF (LZW);
7
In General
Compression
is a process of deriving more compact (i.e., smaller) representations of data.
8
History …
9
Motivation
Massive Amounts of Data Involved in Storage/ Transmission of
Text, Sound, Images, and Videos in Many Applications
A typical hospital generates close to 1 terabits or more per year
depending on no of patients. This could be more these days
NASA generates over 1 terabytes or more data per day
A 2-hour video = 1.3 terabits
10
Trade-offs
A compression scheme for video may require fast/
expensive hardware for the video to be decompressed
fast enough to be viewed as it's being decompressed
trade-offs include
Degree of compression,
Amount of distortion introduced (if using a lossy
compression scheme), and,
Computational resources required to compress and
decompress the data.
Speed of compression and decompression
11
Probability
All compression algorithms assume that there is
some bias on the input messages so that some inputs
are more likely than others,
i.e. there is some unbalanced probability distribution
over the possible messages.
Most compression algorithms base this “bias” on the
structure of the messages
i.e., an assumption that repeated characters are more
likely than random characters, or that large white
patches occur in “typical” images.
Compression is therefore all about probability.
12
Lossless Vs Lossy compression
Lossless compression algorithms usually exploit statistical
redundancy in such a way as to represent the sender's data
more concisely without error.
Lossless compression is possible because most real-world
data has statistical redundancy.
For symbolic data such as spreadsheets, text, executable
programs, etc., losslessness is essential because changing
even a single bit cannot be tolerated (except in some
limited cases).
For example, in English text, the letter 'e' is much more
common than the letter 'z', and the probability that the
letter 'q' will be followed by the letter 'z' is extremely small,
whereas ‘q’ being followed by ‘u’ is very high.
13
Cont…
Lossy data compression or perceptual coding, is possible if
some loss of fidelity is acceptable.
Generally, a lossy data compression will be guided by
research on how people perceive the data in question.
For example, the human eye is more sensitive to subtle
variations in luminance than it is to variations in color.
Lossy data compression provides a way to obtain the best
quality for a given amount of compression.
In some cases, transparent (unnoticeable) compression is
desired;
In other cases, quality is sacrificed to reduce the amount of data
as much as possible.
14
Cont…
Lossless compression schemes are reversible so that
the original data can be reconstructed, while lossy
schemes accept/tolerate some loss of data in order to
achieve higher compression.
For visual and audio data, some loss of quality can
be tolerated without losing the essential nature of the
data.
By taking advantage of the limitations of the human
sensory system, a great deal of space and bandwidth
can be saved while producing an output which is
nearly indistinguishable from the original.
15
Cont…
An example of lossless vs. lossy compression is the
following string:
Take the number 25.888888888
17
Measure of performance
The performance of a compression algorithm can be
measured in a number of ways
Complexity of algorithm
Memory requirement
Speed
Amount of compression and,
Similarity of compressed and original data (for
lossy)
18
Performance…
Compression ratio
bitrate= number of bits in the coded bitstream
number of samples
CR= size of uncompressed signal
size of coded bitstream
Signal-to-Noise Ratio (SNR) is used in the case of lossy
compression.
Let I be an original signal (e.g., an image), and R be its
lossly reconstructed counterpart. SNR is defined to be:
SNR 20* log 10(||I||/||I R||)
19
The Model and Coder
The development of data compression algorithms
for a variety of data can be divided into two phases:
modeling and coding
The model is the compression component that somehow
captures the probability distribution of the messages by
knowing or discovering the redundancy and the structure
of the input.
The coder component then takes advantage of the
probability biases generated in the model to generate
codes.
The coder does this by effectively lengthening low
probability messages and shortening high-probability
messages.
20
Strategies for compression
Symbol-Level Representation Redundancy
Different symbols occur with different frequencies
21
Block-Level Representation Redundancy
Different blocks of data occur with varying frequencies
In this cases it is better to code blocks than individual
symbols
The block size can be fixed or variable
The block-code size also can be fixed or variable
Frequent blocks are better coded with short codes
Example techniques:
Block-oriented Huffman,
23
Inter-Pixel Temporal Redundancy
(in Video)
Often, the majority of corresponding pixels in
successive video-frames are identical/highly similar
over long spans of frames
A scene change significantly affects this assumption
Moreover, blocks of pixels change in position due to
motion, but not in values between successive frames
Thus, block-oriented motion-compensated
redundancy reduction techniques are used for video
compression.
24
Visual Redundancy
The human visual system (HVS) has certain limitations
that make many image contents almost invisible.
Those contents, termed visually redundant, are the target
of removal in lossy compression.
In fact, the HVS can see within a small range of spatial
frequencies: 1-60 cycles
Approach for reducing visual redundancy in lossy
compression
Transform: Convert the data to the frequency domain
26
Self information
Shannon defined self information of a message sS as
28
Here are some examples of entropies for different
probability distributions over five messages.
29
Huffman Coding
Huffman codes are optimal prefix codes generated
from a set of probabilities by a particular algorithm,
David Huffman developed the algorithm as a student
at MIT in 1950.
The algorithm is now probably the most prevalently
used component of compression algorithms, used as
the back end of GZIP, JPEG and many other utilities.
The Huffman algorithm is very simple and is most
easily described in terms of how it generates the
prefix-code tree.
30
Huffman Coding …
Start with a forest of trees, one for each message. Each tree
contains a single vertex with weight wi=pi
Repeat until only a single tree remains
Select two trees with the lowest weight roots w1 and w2
31
Huffman Coding …
If H(S), the entropy of a message is less than 1, then an
extra 1 is added to the entropy to make it H(S)+1
It is also possible to combine messages to get an
optimal prefix codes especially if all the messages are
from the same probability distribution
Consider a distribution of six possible messages.
We could generate probabilities for all 36 pairs by
multiplying the probabilities of each message (there will
be at most 21 unique probabilities).
A Huffman code can now be generated for this new
probability distribution and used to code two messages
at a time.
32
Huffman Coding Example
The following example bases on a data source using a set of five different
symbols. The symbol's frequencies are:
Symbol Frequency
A 24
B 12
C 10
D8
E 8 ----> total 186 bit (with 3 bit per code word)
Code Tree according to Huffman
Symbol Frequency Code Code Total
Length Length
A 24 0 1 24
B 12 100 3 36
C 10 101 3 30
D 8 110 3 24
E 8 111 3 24
------------------------------------------------------------
ges. 186 bit tot. 138 bit
(3 bit code)
33
Arithmetic coding
Arithmetic coding is a technique that allows the information
from the messages in a message sequence to be combined to
share the same bits.
The technique allows the total number of bits sent to approach
the sum of the self information of the individual messages
That is, Arithmetic Coding achieves a bit rate almost equal to the entropy
It codes the whole input sequence, rather than individual symbols, into one
codeword
The Conceptual Main Idea
For each binary input sequence of n bits, divide the unit interval into 2n
intervals, where the length of ith interval Ii is the probability of the ith n-bit
binary sequence
Code the ith binary sequence by l1, l2, …lt where 0, l1, l2, …lt is the binary
representation of the left end of interval Ii and
t=[-log(Prob(input sequences))]
34
Example
Consider sending a thousand messages each having
probability 0.999. Using a Huffman code, each
message has to take at least 1 bit, requiring 1000 bits
to be sent.
On the other hand the self information of each message is
log2(1/p(s))=0.00144 so the sum of this self-information
over 1000 messages is only 1.4 bits.
That is arithmetic coding will send all the messages using
only 3 bits, a factor of hundreds fewer than Huffman coder.
35
36