0% found this document useful (0 votes)
154 views11 pages

Why Needed?: Without Compression, These Applications Would Not Be Feasible

1) Data compression is needed due to the large size of modern applications like videos and the limited capacity of storage and transmission mediums. It allows for more efficient organization and transmission of data. 2) There are two main types of compression: lossless, which preserves all original data, and lossy, which sacrifices some quality to achieve higher compression ratios. Common lossless techniques include run length encoding (RLE) and Huffman coding, while lossy techniques typically reduce information content. 3) Huffman coding assigns variable length codes to symbols based on their frequency, with more common symbols getting shorter codes. It achieves better compression than RLE on average. Lempel-Ziv (LZ

Uploaded by

smile00972
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views11 pages

Why Needed?: Without Compression, These Applications Would Not Be Feasible

1) Data compression is needed due to the large size of modern applications like videos and the limited capacity of storage and transmission mediums. It allows for more efficient organization and transmission of data. 2) There are two main types of compression: lossless, which preserves all original data, and lossy, which sacrifices some quality to achieve higher compression ratios. Common lossless techniques include run length encoding (RLE) and Huffman coding, while lossy techniques typically reduce information content. 3) Huffman coding assigns variable length codes to symbols based on their frequency, with more common symbols getting shorter codes. It achieves better compression than RLE on average. Lempel-Ziv (LZ

Uploaded by

smile00972
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Compersion:

With the increased emphasis on full-text data bases, the problem of handling
the quantity of data becomes significant. Since the time required to search a
database is heavily dependent on the amount of data, for efficient operation of
an information system is necessary both to organize the data well and to find
as efficient a representation for the data as is possible. Thus there is growing
interest in data compersion. Why needed?
1. Size of applications is going from large to larger MP3, MPEG, Tiff, etc.
2. Fax has about 4 million dots/page more than 1 minutes over 56Kbps.
If the data is compressed by a factor of 10, the transmission time is
reduced to 6 seconds per page.
3. TV / Motion Pictures uses 30 pictures (frames) / second 200,000 pixels /
frames, color pictures require 3 bytes for each pixel (RGB). Each frame
has 200,000 * 24 = 4.8 Mbits, 2-hour movie requires 216,000 pictures.
total bits for such movie = 216,000 * 4.8 Mbits = 1.0368 x 1012. This is
much higher than the capacity of DVDs
Without compression, these applications would not be feasible.
A codec is called LOSSY, if the data is lost during compression, while it called
LOSSLESS, if the data is not loss during compression.

1. Redundancy reduction (Usually lossless):
Remove redundancy from the message.
2. Reduce information content (Usually loosy):
Reduce the total amount of information in the message.
Leads to sacrifice of quality.
Two classes of text compression methods
Symbol-wise (or statistical) methods
Estimate probabilities of symbols - modeling step
Usually based on either arithmetic or Huffman coding
Dictionary methods
Replace fragments of text with a single code word
(typically an index to an entry in the dictionary).
eg: Ziv-Lempel coding, which replaces strings of
characters with a pointer to a previous
occurrence of the string.
No probability estimates needed
Text Compression
model
encoder
model
decoder
compressed text text text
Information Theory
Entropy: Shannon borrowed the definition of entropy from statistical physics
to capture the notion of how much information is contained in the whole
alphabet. For a set of possible messages S, Shannon defined entropy as,
Where p(s) is the probability of message s. The self information i(s) represents
the number of bits of information contained in it, and roughly speaking the
number of bits we should use to send that message.
( ) ( )
( )
( ) ( ) s i s p
s p
s p S H
S s S s
.
1
log
2
e e
= =
average original symbol length
average compressed symbol length
C
< >
=
< >
( ) } {
( ) 25 . 2
125 . 0
1
log x 0.125 x 2
25 . 0
1
log x 0.25 x 3
0.125 0.125, 0.25, 0.25, , 25 . 0
2 2
= + =
=
s H
s P
Redundance: is the average codeword legths minus the entropy.

Compersion ratio: is the ratio between the average number of bit/symbol in
the original message and the same quantity for the coded message.
Based on the assumption that a file has a great deal of redundancy. Data is
considered just a string of symbols. RLE is good for fax and voice.
22 characters 14 characters
ABBCCDDDDDDDDDEEFGGGGG => ABBCCD#9EEFG#5
(22-14)/22 = 36 % reduction
Disadvantage:
1. We are unable to distinguish compressed text in the file from
uncompressed text.
2. Any numeric value will be interpreted as the beginning of a
compressed sequence.
1: Run Length Encoding (RLE)
1. Intially each symbole is considered as a separate
binary tree.
2. Two tree with the lowest frequencies are chosen and
combined into a single tree whose assigned frequency
is the sum of the two given frequencies. The chosen
tree form the two branches of the new tree.
3. The process is repeated until only a single tree
remains. Then the two branches for every are labeled 0
and 1 (0 on the left branch, but the order is not
important).
4. The code for each symbole can be read by following
the branch from the root to the symbol.
There is another algorithm which performances are slightly
better than Run Length Ecoding, the famous Huffman coding.
Huffman code is the frequency distribution of the symboles to
be encoded. A binary tree is then constructed.
2: Huffman coding
Huffman coding - Example
0
a
0.05
b
0.05
c
0.1
d
0.2
e
0.3
f
0.2
g
0.1
0.1
0.2
0.3
0.4
0.6
1.0
0
0
0
0
0
1
1
1
1
1
1
a
0.05
b
0.05
c
0.1
d
0.2
e
0.3
f
0.2
g
0.1
0.1
0.2
0.3
0.4
0.6
1.0
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111 g
Code the sequence (aeebcddegfced) and
evaluate entropy and compression ratio.
Sol: 0000 10 10 0001 001 01 01 10 111 110 001 10 01
Aver. orig. symb. length = 3 bits
Aver. compr. symb. length = 34/13
Symbol Prob. Codeword
0.05 0000
0.05 0001
0.1 001
0.2 01
0.3 10
0.2 11
a
b
c
d
e
f 0
0.1 111 g
Huffman coding - Exercise
H(X) = 2.5464 bits
Huffman coding - Notes
1. In the huffman coding, if, at any time, there is more than one way to
choose a smallest pair of probabilities, any such pair may be chosen.
2. Huffman code is a variable-length code, with the more frequent symbols
being assigned shorter codes.
3. Huffman codes are good for data messages.
LZ77 keep track of last n bytes of data seen and when a phrase is encountered
that has already been seen, they output a pair of values corresponding to the
position of the phrase in the previously-seen buffer of data, and the length of
the phrase. The code consists of a set of triples < a, b, c >, where:
a = relative position of the longest match in the dictionary
b = length of longest match
c = next char in buffer beyond longest match
The beginning with 0 identify new characters, not previously seen.
Lempel-Ziv Compression (LZ77):
P e t e r _ P i p e r _ p i c (0,0,P)
(0,0,t)
(2,1,r)
(0,0,_)
Output
Code
P e t e r _ P i p e r _ p i c
(0,0,e)
P e t e r _ P i p e r _ p i c
P e t e r _ P i p e r _ p i c
P e t e r _ P i p e r _ p i c
1
2
3
4
5
No. of code
triples
k
k
k
k
k
Decoded text
P e t e r _ P i p e r _
(6,1,i)
(6,3,c)
(0,0,k)
P e t e r _ P i p e r
(8,2,r)
P e t e r _ P i p e r
P e t e r _ P i p e r
6
7
8
9
_
_
_
p i c
p i c
p i c
p i c
k
k
k
k
Output
Code
No. of code
triples Decoded text
Arithmetic coding is based on the concept of interval subdividing.
In arithmetic coding a source ensemble is represented by an
interval between 0 and 1 on the real number line.
Each symbol of the ensemble narrows this interval. It uses the
probabilities of the source messages to successively narrow
the interval used to represent the ensemble.
Arithmetic Coding:
Arithmetic Coding: Description
In the following discussions, we will use M as the size of the
alphabet of the data source,
N[x] as symbol x's probability,
Q[x] as symbol x's cumulative probability (Q[i]=N[0]+N[1]+.)
Assume we know the probabilities of each symbol,
we can allocate to each symbol an interval with width proportional to
its probability, and each of the intervals does not overlap with others.
This can be done if we use the cumulative probabilities as the two
ends of each interval. Therefore, the two ends of each symbol x
amount to Q[x-1] and Q[x].
Symbol x is said to own the range [Q[x-1], Q[x]).
Arithmetic Coding: Encoder example
Symbol, x Probability,
N[x]
[Q[x-1], Q[x])
A 0.4 0.0, 0.4
B 0.3 0.4, 0.7
C 0.2 0.7, 0.9
D 0.1 0.9, 1.0
1
0
B
0.4
0.7 0.67
0.61
C
0.634
0.61
A
0.6286
0.6196
B
String: BCAB
Code sent:
0.6196
0.52
0.61
0.67
0.634
0.652
0.664
0.6196
0.6268

You might also like