Data Compression
Data Compression
Compression
15.1
Data compression implies sending or storing a smaller
number of bits. Although many methods are used for this
purpose, in general these methods can be divided into two
broad categories: lossless and lossy methods.
Codes /
Codewords Decoder
(decompression)
Output Data
15.3
Compression Techniques
Run-length Coding
Entropy
Huflfman Coding
Encoding
Arithmetic Coding
DPCM
Prediction DM
FFT
Transformation DCT
Sub-band Coding
Vector Quantization
J PEG
15.4
15-1 LOSSLESS COMPRESSION
15.5
Run-length encoding
Run-length encoding is probably the simplest method of
compression. It can be used to compress data made of any
combination of symbols.
It does not need to know the frequency of occurrence of
symbols and can be very efficient if data is represented as 0s
and 1s.
The general idea behind this method is to replace
consecutive repeating occurrences of a symbol by one
occurrence of the symbol followed by the number of
occurrences.
The method can be even more efficient if the data uses only
two symbols (for example 0 and 1) in its bit pattern and one
symbol is more frequent than the other.
15.6
Run-length encoding example
15.7
Run-length encoding for two symbols
15.8
15.9
Huffman coding
Huffman coding assigns shorter codes to symbols that occur
more frequently and longer codes to those that occur less
frequently.
For example, imagine we have a text file that uses only five
characters (A, B, C, D, E).
Before we can assign bit patterns to each character, we
assign each character a weight based on its frequency of use.
In this example, assume that the frequency of the characters
is as shown in the table below.
Steps to produce a binary code tree by reducing redundancy.
Frequency of character
15.10
Algorithm
a. Make a leaf node for each code symbol.
Add the generation probability of each symbol
to the leaf node (arrange the symbols in
ascending order and their frequency count.)
b. Take the two leaf nodes with the smallest
probability(Frequencies) and connect them into a new
node.
1. Add 1 or 0 to each of the two branches
2. The probability of the new node is the sum of
the probabilities of the two connecting nodes.
c. If there is only one node left, the code construction
is completed. If not, go back to (b)
15.11
Rules:
1. If you assign weight ‘0’ to the left edges, then assign weight ‘1’ to the
right edges.
2. If you assign weight ‘1’ to the left edges, then assign weight ‘0’ to the
right edges.
3. Any of the above two conventions may be followed.
But follow the same convention at the time of decoding that is adopted at
the time of encoding.
Huffman coding
15.12
A character’s code is found by starting at the root and
following the branches that lead to that character.
The code itself is the bit value of each branch on the path,
taken in sequence.
Huffman encoding
15.14
Decoding
The recipient has a very easy job in decoding the data it
receives. Figure below shows how decoding takes place.
Huffman decoding
15.15
Question:
Given a file that consists of the following set of characters
along with corresponding frequencies.
Characters Frequencies
a 10
e 15
i 12
o 3
u 4
s 13
t 1
Using Huffman Coding scheme as data compression,
determine:
1. Huffman Code for each character
2. Draw the Huffman tree
3. Length of Huffman encoded message (in bits)
4. Encode the message aeiou.
15.16
Lempel Ziv encoding
Lempel Ziv (LZ) encoding is an example of a category of
algorithms called dictionary-based encoding. The idea is to
create a dictionary (a table) of strings used during the
communication session. If both the sender and the receiver
have a copy of the dictionary, then previously-encountered
strings can be substituted by their index in the dictionary to
reduce the amount of information transmitted.
15.17
Compression
In this phase there are two concurrent events: building an
indexed dictionary and compressing a string of symbols. The
algorithm extracts the smallest substring that cannot be
found in the dictionary from the remaining uncompressed
string. It then stores a copy of this substring in the dictionary
as a new entry and assigns it an index value. Compression
occurs when the substring, except for the last character, is
replaced with the index found in the dictionary. The process
then inserts the index and the last character of the substring
into the compressed string.
15.18
An example of Lempel Ziv encoding
15.19
Decompression
Decompression is the inverse of the compression process.
The process extracts the substrings from the compressed
string and tries to replace the indexes with the corresponding
entry in the dictionary, which is empty at first and built up
gradually. The idea is that when an index is received, there is
already an entry in the dictionary corresponding to that
index.
15.20
An example of Lempel Ziv decoding
15.21
File Formats
TIFF files
Scanning as TIFF
LZW compression
EPS files (vector)
Encapsulated PostScript
DCS files
PICT files (Macintosh)
BMP files (Windows)
WMF files (windows)
GIF file format
GIF file format
PNG file format
JPEG files
PDF files