0% found this document useful (0 votes)
8 views64 pages

Data Compression and Source Coding

The document discusses data rate limits in communications, focusing on factors affecting data transmission speed, such as bandwidth, signal levels, and channel quality. It explains Nyquist's theorem for noiseless channels and Shannon's theorem for noisy channels, providing examples of calculating maximum bit rates and capacities. Additionally, it covers source coding and compression techniques, including Huffman coding and Shannon-Fano encoding, emphasizing their importance in efficient data transmission.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views64 pages

Data Compression and Source Coding

The document discusses data rate limits in communications, focusing on factors affecting data transmission speed, such as bandwidth, signal levels, and channel quality. It explains Nyquist's theorem for noiseless channels and Shannon's theorem for noisy channels, providing examples of calculating maximum bit rates and capacities. Additionally, it covers source coding and compression techniques, including Huffman coding and Shannon-Fano encoding, emphasizing their importance in efficient data transmission.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

DATA RATE LIMITS

A very important consideration in data communications


is how fast we can send data, in bits per second, over a
channel. Data rate depends on three factors:
1. The bandwidth available
2. The level of the signals we use
3. The quality of the channel (the level of noise)

 Noiseless Channel: Nyquist Bit Rate


 Noisy Channel: Shannon Capacity
 Using Both Limits

3/25/2024 CME 514 1


Note

Increasing the levels of a signal


increases the probability of an error
occurring, in other words it reduces the
reliability of the system. Why??

3/25/2024 CME 514 2


Capacity of a System
• The bit rate of a system increases with an increase
in the number of signal levels we use to denote a
symbol.
• A symbol can consist of a single bit or “n” bits.
• The number of signal levels = 2n.
• As the number of levels goes up, the spacing
between level decreases -> increasing the
probability of an error occurring in the presence of
transmission impairments.

3/25/2024 CME 514 3


Nyquist Theorem
• Nyquist gives the upper bound for the bit rate of a
transmission system by calculating the bit rate
directly from the number of bits in a symbol (or
signal levels) and the bandwidth of the system
(assuming 2 symbols/per cycle and first harmonic).
• Nyquist theorem states that for a noiseless
channel:
C = 2 B log22n
C= capacity in bps
B = bandwidth in Hz
3/25/2024 CME 514 4
Example 1

Consider a noiseless channel with a bandwidth of 3000


Hz transmitting a signal with two signal levels. The
maximum bit rate can be calculated as

3/25/2024 CME 514 5


Example 2

Consider the same noiseless channel transmitting a


signal with four signal levels (for each level, we send 2
bits). The maximum bit rate can be calculated as

3/25/2024 CME 514 6


Example 2

We need to send 265 kbps over a noiseless channel with


a bandwidth of 20 kHz. How many signal levels do we
need?
Solution
We can use the Nyquist formula as shown:

Since this result is not a power of 2, we need to either


increase the number of levels or reduce the bit rate. If we
have 128 levels, the bit rate is 280 kbps. If we have 64
levels, the bit rate is 240 kbps.
3/25/2024 CME 514 7
Shannon’s Theorem

• Shannon’s theorem gives the capacity of a system


in the presence of noise.

C = B log2(1 + SNR)

3/25/2024 CME 514 8


Example 3

Consider an extremely noisy channel in which the value


of the signal-to-noise ratio is almost zero. In other
words, the noise is so strong that the signal is faint. For
this channel the capacity C is calculated as

This means that the capacity of this channel is zero


regardless of the bandwidth. In other words, we cannot
receive any data through this channel.

3/25/2024 CME 514 9


Example 4

We can calculate the theoretical highest bit rate of a


regular telephone line. A telephone line normally has a
bandwidth of 3000. The signal-to-noise ratio is usually
3162. For this channel the capacity is calculated as

This means that the highest bit rate for a telephone line
is 34.860 kbps. If we want to send data faster than this,
we can either increase the bandwidth of the line or
improve the signal-to-noise ratio.
3/25/2024 CME 514 10
Example 5

The signal-to-noise ratio is often given in decibels.


Assume that SNRdB = 36 and the channel bandwidth is 2
MHz. The theoretical channel capacity can be calculated
as

3/25/2024 CME 514 11


Example 5

For practical purposes, when the SNR is very high, we


can assume that SNR + 1 is almost the same as SNR. In
these cases, the theoretical channel capacity can be
simplified to

For example, we can calculate the theoretical capacity of


the previous example as

3/25/2024 CME 514 12


Example 6

We have a channel with a 1-MHz bandwidth. The SNR


for this channel is 63. What are the appropriate bit rate
and signal level?

Solution
First, we use the Shannon formula to find the upper
limit.

3/25/2024 CME 514 13


Example 6 (continued)

The Shannon formula gives us 6 Mbps, the upper limit.


For better performance we choose something lower, 4
Mbps, for example. Then we use the Nyquist formula to
find the number of signal levels.

3/25/2024 CME 514 14


Note

The Shannon capacity gives us the


upper limit; the Nyquist formula tells us
how many signal levels we need.

3/25/2024 CME 514 15


Source Coding-
Compression

3/25/2024 CME 514 16


Introduction
 Noiseless Coding
Compression without distortion
 Basic Concept
Symbols with lower probabilities are represented by the binary
indices with longer length
 Methods
Huffman codes, Lempel-Ziv codes, Arithmetic codes
and Golomb codes

3/25/2024 CME 514 17


Fundamental Limits on
Performance
• Given an information source, and a noisy
channel
1) Limit on the minimum number of bits
per symbol
2) Limit on the maximum rate for reliable
communication
 Shannon’s theorems

3/25/2024 CME 514 18


Shannon’s source coding theorem:

The lowest rate for encoding a message without


distortion is the entropy of the symbols
in the message.
3/25/2024 CME 514 19
Information Theory
• Let the source alphabet,
S  {s0, s1 , .. , sK -1}
with the prob. of occurrence
K -1
P( s sk )  pk , k  0,1, .. , K -1 and p
k 0
k 1
• Assume the discrete memory-less source (DMS)

What is the measure of information?

3/25/2024 CME 514 20


Property of
Information

1) I (s k )  0 for p k 1
2) I ( sk )  0 for 0  p k  1
3) I ( sk )  I ( si ) for p k  pi
4)
I ( sk si )  I ( sk )  I ( si ), if s k and s i statist. indep.

* Custom is to use logarithm of base 2

3/25/2024 CME 514 21


Entropy
• Consider a set of symbols S={S1,...,SN}.
• The entropy of the symbols is defined as

N
1
H ( S )  P ( Si ) log 2
i 1 P( Si )
where

P(Si) is the probability of Si.


3/25/2024 CME 514 22
Example:

Consider a set of symbols {a,b,c} with


P(a)=1/4, P(b)=1/4 and P(c)=1/2.

The entropy of the symbols is then given by

1 1 1
P(a ) log  P(b) log  P(c) log 1.5
P(a) P (b) P (c )

3/25/2024 CME 514 23


Consider a message containing symbols
in S.

Define rate of a source coding technique


as the average number of bits representing
each symbol after compressing.

3/25/2024 CME 514 24


Example:

Suppose the following message is desired


to be compressed .
a a a b c a

Suppose a encoding technique uses 7 bits to


represent the message.

The rate of the the encoding technique therefore is 7/6.


(since there are 6 symbols)

3/25/2024 CME 514 25


Average Length
For a code C with associated probabilities p(c) the
average length is defined as
la (C )  p(c)l (c)
cC

We say that a prefix code C is optimal if for all prefix


codes C’, la(C)  la(C’)

3/25/2024 CME 514 26


Relationship to Entropy
Theorem (lower bound): For any probability distribution p(S) with
associated uniquely decodable code C,

H ( S ) la (C )
Theorem (upper bound): For any probability distribution p(S) with
associated optimal prefix code C,

la (C )  H ( S ) 1
3/25/2024 CME 514 27
Coding Efficiency
• Coding Efficiency
• n = Lmin/La
• where La is the average code-word length
• From Shannon’s Theorem
• La >= H(S)
• Thus Lmin = H(S)
• Thus
• n = H(S)/La

3/25/2024 CME 514 28


Kraft McMillan Inequality
Theorem (Kraft-McMillan): For any
uniquely decodable
 l (code C,
2
cC
c)
1
Also, for any set of lengths L such that
2
l L
l
1
there is a prefix code C such that
l (ci ) li (i 1,...,| L|)

NOTE: Kraft McMillan Inequality does


not tell us whether the code is prefix-
3/25/2024 free or not CME 514 29
Uniquely Decodable Codes
A variable length code assigns a bit
string (codeword) of variable length to
every message value
e.g. a = 1, b = 01, c = 101, d = 011
What if you get the sequence of bits
1011 ?
Is it aba, ca, or, ad?
A uniquely decodable code is a variable
length code in which bit strings can
always be uniquely decomposed into its
codewords.
3/25/2024 CME 514 30
Prefix Codes
A prefix code is a variable length code in
which no codeword is a prefix of another
word
e.g a = 0, b = 110, c = 111, d = 10
Can be viewed as a binary tree with
message values at the leaves and 0 or 1s
on the edges.
0 1
0 1
a
0 1 d
3/25/2024
bCME 514c 31
Some Prefix Codes for Integers
n Binary Unary Split
1 ..001 0 1|
2 ..010 10 10|0
3 ..011 110 10|1
4 ..100 1110 110|00
5 ..101 11110 110|01
6 ..110 111110 110|10
Many other fixed prefix codes:
Golomb, phased-binary, subexponential, ...
3/25/2024 CME 514 32
Data compression implies sending or
storing a smaller number of bits.
Although many methods are used for
this purpose, in general these methods
can be divided into two broad
categories: lossless and lossy methods.

3/25/2024
Data compression
CME 514
methods 33
Shannon – Fano Encoding:
First the messages are ranked in a table in descending order of probability. The table is then progressively divided into subsections of

probability near equal as possible. Binary 0 is attributed to the upper subsection and binary 1 to the lower subsection. This process

continues until it is impossible to divide any further, the following steps show the algorithmic procedure of Shannon – Fano encoding:

1-List the symbols in descending order of the probabilities.

2-Divide the table into as near as possible two equal values of probability.

3-Allocate binary 0 to the upper section and binary 1 to the lower section.

4-Divide both the upper section and the lower section into two.

5-Allocate binary 0 to the top half of each section and binary 1 to the lower half.

6-Repeat steps (4) and (5) until it is not possible to go any further.

3/25/2024 CME 514 34


For this example we can evaluate the efficiency of this system:
L = 2.72 digits / symbol
H = 2.67 bits / symbol
η = (H / L) *100% = ((2.67) / (2.72))*100% = 98.2%.
3/25/2024 CME 514 35
Huffman Coding

3/25/2024 CME 514 36


Huffman Codes
• Invented by Huffman as a class assignment in 1950.
• Used in many, if not most compression algorithms such
as gzip, bzip, jpeg (as option), fax compression,…
• Properties:
• Generates optimal prefix codes
• Cheap to generate codes
• Cheap to encode and decode
• la=H if probabilities are powers of 2

3/25/2024 CME 514 37


Huffman Codes
Huffman Algorithm
• Start with a forest of trees each consisting of a single
vertex corresponding to a message s and with weight
p(s)
• Repeat:
• Select two trees with minimum weight roots p1 and p2
• Join into single tree by adding root with weight p1 + p2

3/25/2024 CME 514 38


Huffman Coding
• David A. Huffman (1951)
• Huffman coding uses frequencies of symbols in a string to build a variable rate prefix code
• Each symbol is mapped to a binary string
• More frequent symbols have shorter codes
• No code is a prefix of another
• Example:
0 1

A 0 A
0 1
B 100
0 D
C 101 1
D 11 B C

3/25/2024 CME 514 39


Huffman Codes
• We start with a set of symbols , where each symbol is associated
with a probability .

• Merge two symbols having lowest probabilities to a new symbol .

 Repeat the merging process until all the symbols are


merged to a single symbol .

 Following the merging path, we can form the Huffman


3/25/2024
codes . CME 514 40
Example

• Consider the following three symbols :

a ( with prob. 0.5 )


b ( with prob. 0.3 )
c ( with prob. 0.2 )

3/25/2024 CME 514 41


Merging Process

a 1 Huffman Codes :

b 1 a 1
0 b 01
c 0 c 00

3/25/2024 CME 514 42


Example

• Suppose the following message is desired to be compressed .


a a a b c a
• The results of the Huffman coding are :
1 1 1 01 00 1
• Total # of bits used to represent the message : 8 bits (Rate=8/6=4/3)

3/25/2024 CME 514 43


• If the message is not compressed by the Huffman codes , each symbol
should be represented by 2 bits . Total # of bits used to represent the
message therefore is 12 bits .
• We have saved 4 bits using the Huffman codes .

3/25/2024 CME 514 44


Example
p(a) = .1, p(b) = .2, p(c ) = .2, p(d) = .5
a(.1) b(.2) c(.2) d(.5)
(.3) (.5) (1.0)
0 1
a(.1) b(.2) (.3) (.5) d(.5)
c(.2) 1
0
Step 1 a(.1) b(.2) (.3) c(.2)
0 1
Step 2 a(.1) b(.2)
Step 3
a=000, b=001, c=01, d=1

3/25/2024 CME 514 45


Encoding and Decoding
Encoding: Start at leaf of Huffman tree and follow path
to the root. Reverse order of bits and send.
Decoding: Start at root of Huffman tree and take
branch for each bit received. When at leaf can output
message and return to root. (1.0)
0 1
There are even faster (.5) d(.5)
0 1
methods that can process (.3) c(.2)
8 or 32 bits at a time 0 1
a(.1) b(.2)

3/25/2024 CME 514 46


Discussions

• It does not matter how the symbols are arranged.


• It does not matter how the final code tree are labeled (with 0s and
1s).
• Huffman code is not unique.

3/25/2024 CME 514 47


Cost of Huffman Trees
• Let A={a1, a2, .., am} be the alphabet in which each symbol ai has
probability pi
• We can define the cost of the Huffman tree HT as
C(HT)= pi·ri,
m

where ri is the lengthi=1


of the path from the root to ai
• The cost C(HT) is the expected length (in bits) of a code word
represented by the tree HT. The value of C(HT) is called the bit rate of
the code.

3/25/2024 CME 514 48


Cost of Huffman Trees - example
• Example:
• Let a1=A, p1=1/2; a2=B, p2=1/8; a3=C, p3=1/8; a4=D, p4=1/4
where r1=1, r2=3, r3=3, and r4=2

HT 0 1
A C(HT) =1·1/2 +3·1/8 +3·1/8 +2·1/4=1.75
0 1

0 D
1
B C

3/25/2024 CME 514 49


Huffman Tree Property
• Input: Given probabilities p1, p2, .., pm for symbols a1, a2, .., am
from alphabet A
• Output: A tree that minimizes the average number of bits
(bit rate) to code a symbol from A
• I.e., the goal is to minimize function:
C(HT)= pi·ri,
where ri is the length of the path from the root to leaf ai.
This is called a Huffman tree or Huffman code for alphabet A

3/25/2024 CME 514 50


Construction of Huffman Trees
• Form a (tree) node for each symbol ai with weight pi
• Insert all nodes to a priority queue PQ (e.g., a heap) ordered by nodes
probabilities
• while (the priority queue has more than two nodes)
• min1  remove-min(PQ); min2  remove-min(PQ);
• create a new (tree) node T;
• T.weight  min1.weight + min2.weight;
• T.left  min1; T.right  min2;
• insert(PQ, T)
• return (last node in PQ)
3/25/2024 CME 514 51
Construction of Huffman Trees
P(A)= 0.4, P(B)= 0.1, P(C)= 0.3, P(D)= 0.1, P(E)= 0.1

0.1 0.1 0.1 0.3 0.4


D E B C A

0.2 0.1 0.3 0.4


B C A
0 1

D E

3/25/2024 CME 514 52


Construction of Huffman Trees
0.1 0.2 0.3 0.4
B C A
0 1

D E

0.3 0.3 0.4


C A
0 1

B
0 1

D E
3/25/2024 CME 514 53
Construction of Huffman Trees
0.3 0.3 0.4 0.6 0.4
C A A
0 1 0 1

B C
0 1 0 1

D E B
0 1

D E

3/25/2024 CME 514 54


Construction of Huffman Trees

0.4 0.6
A
0 1
0 1

C A
0 1
0 1

B C
0 1 0 1

D E B
0 1

3/25/2024 CME 514


D E 55
Construction of Huffman Trees

0 1
A=0
A B = 100
0 1
C = 11
C D = 1010
0 1
E = 1011
B
0 1

D E
3/25/2024 CME 514 56
Huffman Codes
• Theorem: For any source S the Huffman code can be
computed efficiently in time O(n·log n) , where n is the size
of the source S.
Proof: The time complexity of Huffman coding algorithm is
dominated by the use of priority queues
• One can also prove that Huffman coding creates the most
efficient set of prefix codes for a given text
• It is also one of the most efficient entropy coder

3/25/2024 CME 514 57


Huffman Code vs. Entropy
P(A)= 0.4, P(B)= 0.1, P(C)= 0.3, P(D)= 0.1, P(E)= 0.1
• Entropy:
• 0.4 · log2(10/4) + 0.1 · log2(10) + 0.3 · log2(10/3) + 0.1 · log2(10) +
0.1 · log2(10) = 2.05 bits per symbol
• Huffman Code:
• 0.4 · 1 + 0.1 · 3 + 0.3 · 2 + 0.1 · 4 + 0.1 · 4 = 2.10
• Not bad, not bad at all.

3/25/2024 CME 514 58


Run Length Coding

3/25/2024 CME 514 59


Introduction – What is RLE?
• Compression technique
• Represents data using value and run length
• Run length defined as number of consecutive equal values
e.g RLE
1110011111 130215

Values Run Lengths

3/25/2024 CME 514 60


Introduction
• Compression effectiveness depends on input
• Must have consecutive runs of values in order to maximize
compression
• Best case: all values same
• Can represent any length using two values
• Worst case: no repeating values
• Compressed data twice the length of original!!
• Should only be used in situations where we know for sure have
repeating values

3/25/2024 CME 514 61


Run-length encoding
3/25/2024 example
CME 514 62
3/25/2024 Run-length encoding
CME 514 for two symbols 63
Encoder – Results
Input: 4,5,5,2,7,3,6,9,9,10,10,10,10,10,10,0,0
Output: 4,1,5,2,2,1,7,1,3,1,6,1,9,2,10,6,0,2,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1…
Valid Output
Output Ends Here

Best Case:
Input: 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Output: 0,16,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1…

Worst Case:
Input: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
Output: 0,1,1,1,2,1,3,1,4,1,5,1,6,1,7,1,8,1,9,1,10,1,11,1,12,1,13,1,14,1,15,1
3/25/2024 CME 514 64

You might also like