Dereje Teferi Dereje - Teferi@aau - Edu.et

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Dereje Teferi

[email protected]

1
Data Compression
 data compression or source coding is the process of
encoding information using fewer bits (or other
information-bearing units) than an un-encoded
representation would use, through use of specific
encoding schemes. (Wikipedia)
 The task of compression consists of two components,
 an encoding algorithm that takes a message and
generates a “compressed” representation (hopefully with
fewer bits), and
 a decoding algorithm that reconstructs the original
message or some approximation of it from the compressed
representation.
2
History of data compression
 Morse code, invented in 1838 by David Morse for use in
telegraphy, is an early example of data compression based
on using
 shorter codewords for letters such as "e(.)" and “a(.-)" that are more
common in English language and
 longer codewords for letters that occur less frequently such as “q(--
.-)” and “j(.---)”
 In the mid 19th century, a widely used form of Braille code
used frequency of words rather than characters to achieve
compression.
 2x3 array of dots are used to represent text

 Different letters are represented by different combinations of


raised and flat dots

3
History…
 Modern work on data compression began in
the late 1940s with the development of
information theory.
 In 1949 Claude Shannon and Robert Fano
devised a systematic way to assign
codewords based on probabilities of blocks.
 An optimal method for doing this was then
found by David Huffman in 1951.

4
History …
 Early implementations were typically done in
hardware, with specific choices of codewords being
made as compromises between compression and
error correction.
 In the mid-1970s, the idea emerged of dynamically
updating codewords for Huffman encoding, based on
the actual data encountered.
 In the late 1970s, with online storage of text files
becoming common, software compression programs
began to be developed, almost all of them were based
on adaptive Huffman coding.
5
History …
 In 1977 Abraham Lempel and Jacob Ziv suggested the
basic idea of pointer-based encoding.
 In the mid-1980s, following work by Terry Welch, the
so-called LZW(Lempel-Ziv-Welch) algorithm rapidly
became the method of choice for most general-purpose
compression systems.
 It was used in programs such as PKZIP(PKUNZIP), as
well as in hardware devices such as modems.
 In the late 1980s, digital images became more common,
and standards for compressing them emerged. Lossy
compression methods also began to be widely used.
6
History…
 Current image compression standards include:
 FAX CCITT 3 (run-length encoding, with codewords
determined by Huffman coding from a definite distribution of
run lengths);
 GIF (LZW);

 JPEG (lossy discrete cosine transform, then Huffman or


arithmetic coding);
 BMP (run-length encoding, etc.);

 TIFF (LZW / ZIP, Both ZIP and LZW are lossless


compression methods.).

7
In General
 Compression
 is a process of deriving more compact (i.e., smaller) representations of data.

 This compact representation is created by identifying and using structures


that exist in the data
 Goal of Compression
 Significant reduction in the data size to reduce the storage/bandwidth
requirements
 Constraints on Compression
 Perfect or near-perfect reconstruction (lossless/lossy)

 Strategies for Compression


 Reducing redundancies

 Exploiting the characteristics of human hearing/vision

8
History …

 Typical compression ratios currently


achieved
 For text are around 3/4:1,
 For line diagrams and text images around
3/4:1, and
 for photographic images around 2/3:1
lossless, and 20/30:1 lossy.

9
Motivation
 Massive Amounts of Data Involved in Storage/ Transmission of
Text, Sound, Images, and Videos in Many Applications
 A typical hospital generates close to 1 terabits or more per year
depending on no of patients. This could be more these days
 NASA generates over 1 terabytes or more data per day
 A 2-hour video = 1.3 terabits

 Average video transmission speed = 180Mb/sec

 With MPEG1 (1.5Mb/s), need compression ratio of 120:1


 With MPEG2 (4-10Mb/s), need comp. ratio of 18-45:1

 Compression helps to reduce the consumption of expensive


resources, such as storage space and/or transmission bandwidth.

10
Trade-offs
 A compression scheme for video may require fast/
expensive hardware for the video to be decompressed
fast enough to be viewed as it's being decompressed
 trade-offs include
 Degree of compression,
 Amount of distortion introduced (if using a lossy
compression scheme), and,
 Computational resources required to compress and
decompress the data.
 Speed of compression and decompression

11
Probability
 All compression algorithms assume that there is
some bias on the input messages so that some inputs
are more likely than others,
 i.e. there is some unbalanced probability distribution
over the possible messages.
 Most compression algorithms base this “bias” on the
structure of the messages
 i.e., an assumption that repeated characters are more
likely than random characters, or that large white
patches occur in “typical” images.
 Compression is therefore all about probability.
12
Lossless Vs Lossy compression
 Lossless compression algorithms usually exploit statistical
redundancy in such a way as to represent the sender's data
more concisely without error.
 Lossless compression is possible because most real-world
data has statistical redundancy.
 For symbolic data such as spreadsheets, text, executable
programs, etc., losslessness is essential because changing
even a single bit cannot be tolerated (except in some
limited cases).
 For example, in English text, the letter 'e' is much more
common than the letter 'z', and the probability that the
letter 'q' will be followed by the letter 'z' is extremely small,
whereas ‘q’ being followed by ‘u’ is very high.
13
Cont…
 Lossy data compression or perceptual coding, is possible if
some loss of fidelity is acceptable.
 Generally, a lossy data compression will be guided by
research on how people perceive the data in question.
 For example, the human eye is more sensitive to subtle
variations in luminance than it is to variations in color.
 Lossy data compression provides a way to obtain the best
quality for a given amount of compression.
 In some cases, transparent (unnoticeable) compression is
desired;
 In other cases, quality is sacrificed to reduce the amount of data
as much as possible.
14
Cont…
 Lossless compression schemes are reversible so that
the original data can be reconstructed, while lossy
schemes accept/tolerate some loss of data in order to
achieve higher compression.
 For visual and audio data, some loss of quality can
be tolerated without losing the essential nature of the
data.
 By taking advantage of the limitations of the human
sensory system, a great deal of space and bandwidth
can be saved while producing an output which is
nearly indistinguishable from the original.
15
Cont…
 An example of lossless vs. lossy compression is the
following string:
 Take the number 25.888888888

 This string can be compressed without any loss


(lossless) as:
 25.[9]8 Interpreted as, "twenty five point 9 eights", the
original string is perfectly recreated, just written in a smaller
form. (this is in fact an example of run-length encoding)
 In a lossy system, the data is compressed as 26 instead,
the exact original data is lost, at the benefit of a smaller
file size.
16
Cont…

17
Measure of performance
 The performance of a compression algorithm can be
measured in a number of ways
 Complexity of algorithm
 Memory requirement
 Speed
 Amount of compression and,
 Similarity of compressed and original data (for
lossy)

18
Performance…
 Compression ratio
bitrate= number of bits in the coded bitstream
number of samples
CR= size of uncompressed signal
size of coded bitstream
 Signal-to-Noise Ratio (SNR) is used in the case of lossy
compression.
 Let I be an original signal (e.g., an image), and R be its
lossly reconstructed counterpart. SNR is defined to be:
SNR  20* log 10(||I||/||I  R||)
19
The Model and Coder
 The development of data compression algorithms
for a variety of data can be divided into two phases:
modeling and coding
 The model is the compression component that somehow
captures the probability distribution of the messages by
knowing or discovering the redundancy and the structure
of the input.
 The coder component then takes advantage of the
probability biases generated in the model to generate
codes.
 The coder does this by effectively lengthening low
probability messages and shortening high-probability
messages.
20
Strategies for compression
 Symbol-Level Representation Redundancy
 Different symbols occur with different frequencies

 Variable-length codes vs. fixed-length codes

 Frequent symbols are better coded with short codes

 Infrequent symbols are coded with long codes

 Example: Huffman Coding

21
Block-Level Representation Redundancy
 Different blocks of data occur with varying frequencies
 In this cases it is better to code blocks than individual
symbols
 The block size can be fixed or variable
 The block-code size also can be fixed or variable
 Frequent blocks are better coded with short codes
 Example techniques:
 Block-oriented Huffman,

 Run-Length Encoding (RLE), Arithmetic Coding, Lempil-Ziv


(LZ)
22
Inter-Pixel Spatial Redundancy
 Neighboring pixels tend to have similar values
 Neighboring pixels tend to exhibit high correlations
 Techniques: Decorrelation and/or processing in the
frequency domain
 Spatial decorrelation converts correlations into
symbol- or block-redundancy
 Frequency domain processing addresses visual
redundancy

23
Inter-Pixel Temporal Redundancy
(in Video)
 Often, the majority of corresponding pixels in
successive video-frames are identical/highly similar
over long spans of frames
 A scene change significantly affects this assumption
 Moreover, blocks of pixels change in position due to
motion, but not in values between successive frames
 Thus, block-oriented motion-compensated
redundancy reduction techniques are used for video
compression.

24
Visual Redundancy
 The human visual system (HVS) has certain limitations
that make many image contents almost invisible.
 Those contents, termed visually redundant, are the target
of removal in lossy compression.
 In fact, the HVS can see within a small range of spatial
frequencies: 1-60 cycles
 Approach for reducing visual redundancy in lossy
compression
 Transform: Convert the data to the frequency domain

 Quantize: Under-represent the high frequencies

 compress the quantized data


25
Information Theory
Entropy is a numerical measure of the uncertainty of an
outcome
 Shannon borrowed the definition of entropy from statistical
physics to capture the notion of how much information is
contained in a message and their probabilities.
 For a set of possible messages , Shannon defined entropy as,

 Where p(s) is the probability of Message S


 H(S) is the minimum average number of bits/symbol possible

26
Self information
 Shannon defined self information of a message sS as

 It represents the number of bits of information


contained in a message and, roughly speaking, the
number of bits we should use to send that message.
 The equation says that messages with higher
probability will contain less information
 Ex: it is going to be sunny tomorrow in Addis Ababa does
not have much information compared to saying it is going to
snow tomorrow in Addis Ababa
27
Computing entropy
 Entropy is simply a weighted average of the information
of each message, and therefore the average number of bits
of information in the set of messages.
 Larger entropies represent more information, and perhaps
counter-intuitively, the more random a set of messages
(the more even the probabilities) the more information
they contain on average.
 i.e. If entropy is less, it means there is so many
redundancy in the message (data that does not convey
too much information) that can be represented with less
bits– good for compression

28
Here are some examples of entropies for different
probability distributions over five messages.

29
Huffman Coding
 Huffman codes are optimal prefix codes generated
from a set of probabilities by a particular algorithm,
 David Huffman developed the algorithm as a student
at MIT in 1950.
 The algorithm is now probably the most prevalently
used component of compression algorithms, used as
the back end of GZIP, JPEG and many other utilities.
 The Huffman algorithm is very simple and is most
easily described in terms of how it generates the
prefix-code tree.

30
Huffman Coding …
 Start with a forest of trees, one for each message. Each tree
contains a single vertex with weight wi=pi
 Repeat until only a single tree remains
 Select two trees with the lowest weight roots w1 and w2

 Combine them into a single tree by adding a new root with


weight w1 + w2 and making the two trees its children.
 It does not matter which is the left or right child, but the
standard convention will be to put the lower weight root on
the left if w1 ≠ w2
 For a code of size n this algorithm will require n-1 steps
since every complete binary tree with n leaves has n-1
internal nodes, and each step creates one internal node.

31
Huffman Coding …
 If H(S), the entropy of a message is less than 1, then an
extra 1 is added to the entropy to make it H(S)+1
 It is also possible to combine messages to get an
optimal prefix codes especially if all the messages are
from the same probability distribution
 Consider a distribution of six possible messages.
 We could generate probabilities for all 36 pairs by
multiplying the probabilities of each message (there will
be at most 21 unique probabilities).
 A Huffman code can now be generated for this new
probability distribution and used to code two messages
at a time.

32
Huffman Coding Example
 The following example bases on a data source using a set of five different
symbols. The symbol's frequencies are:
 Symbol Frequency
 A 24
 B 12
 C 10
 D8
 E 8 ----> total 186 bit (with 3 bit per code word)
 Code Tree according to Huffman
Symbol Frequency Code Code Total
Length Length
A 24 0 1 24
B 12 100 3 36
C 10 101 3 30
D 8 110 3 24
E 8 111 3 24
------------------------------------------------------------
ges. 186 bit tot. 138 bit
(3 bit code)

33
Arithmetic coding
 Arithmetic coding is a technique that allows the information
from the messages in a message sequence to be combined to
share the same bits.
 The technique allows the total number of bits sent to approach
the sum of the self information of the individual messages
 That is, Arithmetic Coding achieves a bit rate almost equal to the entropy
 It codes the whole input sequence, rather than individual symbols, into one
codeword
 The Conceptual Main Idea
 For each binary input sequence of n bits, divide the unit interval into 2n
intervals, where the length of ith interval Ii is the probability of the ith n-bit
binary sequence
 Code the ith binary sequence by l1, l2, …lt where 0, l1, l2, …lt is the binary
representation of the left end of interval Ii and
t=[-log(Prob(input sequences))]
34
Example
 Consider sending a thousand messages each having
probability 0.999. Using a Huffman code, each
message has to take at least 1 bit, requiring 1000 bits
to be sent.
 On the other hand the self information of each message is
log2(1/p(s))=0.00144 so the sum of this self-information
over 1000 messages is only 1.4 bits.
 That is arithmetic coding will send all the messages using
only 3 bits, a factor of hundreds fewer than Huffman coder.

35
36

You might also like