NUST
NUST School
School of
of Electrical
Electrical Engineering
Engineering &
& Computer
Computer Sciences
Sciences (SEECS)
(SEECS)
Department
Department of
of Communication
Communication Systems
Systems Engineering
Engineering
CSE-434: Systems Engineering
Source Coding and Data Compression
Information Theory
• The mathematical theory of communication
involving the quantification of information
• Deals with the transmission of information over
a noisy channel
– Source coding theorem
– Noisy channel coding theorem
• Does not concern with the importance of a
message
Shannon’s Generic Communication System
Generic Communication System, from Chapter 2 of K.V. Prasad.
Components of a Communication System
• Information source produces the symbols
• Source encoder converts the symbols into a data
stream
– Source encoding reduces the redundancy
– Can be divided into lossless encoding techniques and
lossy encoding techniques
• Channel encoder introduces redundancy for error
detection or error correction at receiver
• Modulator transforms the signal so that it can be
transmitted through the medium
Entropy
• Shannon’s formula to measure information of
the source, known as Entropy
H log 2 N bits / symbol
for equally likely symbols
H P i log 2 P i bits / symbol
if ith symbol has probability P(i)
Example 1
• Entropy of a source producing English alphabet
with each symbol being equally likely:
H log 2 26 4.7 bits / symbol
• Entropy of a source producing 4 symbols with
probability {0.5, 0.25, 0.125, 0.125} respectively
H P (i ) log 2 P (i )
H 1.75 bits / symbol
Exercise 1
• Calculate the entropy of a source that produces
4 symbols with probability 1/8 and 2 symbols
with probability 1/4
• Answer: 2.5 bits/symbol
Data Compression
• The process of encoding information using
fewer bits than the original message
• Lossless data compression
– e.g. RLE, Huffman, Shannon-Fano, LZW, LZ77
• Lossy data compression
– e.g. JPEG, MPEG, AMR, AC3
Run Length Encoding
• Sequence of repeating characters are replaced
by their count
• Useful when input text has long repeating
sequences
• Special character inserted to identify
compression
Example 2:
Input stream: WHOOOOOODUNNNNNIT!!!
Special char: \
Output stream: WH\6ODU\5NIT\3!
Huffman Compression
• Variable length lossless data compression
technique
• More frequently occurring characters are given
shorter code
Example 3:
• Input stream: WHOOOOOODUNNNNNIT!!!
Character W H O D U N I T !
Frequency 1 1 6 1 1 5 1 1 3
Sorted
Character W H D U I T ! N O
Frequency 1 1 1 1 1 1 3 5 6
Huffman Compression
Character D U I T (W+H) ! N O
Frequency 1 1 1 1 2 3 5 6
Character I T (D+U) (W+H) ! N O
Frequency 1 1 2 2 3 5 6
Character (I+T) (D+U) (W+H) ! N O
Frequency 2 2 2 3 5 6
Character (W+H) ! (I+T)+(D+U) N O
Frequency 2 3 4 5 6
Character ((I+T)+(D+U)) ((W+H)+!) N O
Frequency 4 5 5 6
Huffman Compression
Character (((I+T)+(D+U))+((W+H)+!)) N+O
Frequency 9 11
Huffman Codes Huffman Tree
Char Code
O 11
N 10
! 011
H 0101
W 0100
U 0011
D 0010
T 0001
I 0000
Exercise 2
• Find the entropy of the source in the previous
example
• Find the average number of bits/symbol for the
Huffman code derived in the previous example
– Hint: Use
L L(i ) P (i )
where L(i) is the length of code assigned to symbol ‘i’
• Discuss why Huffman code will be better than a
fixed length code
Shannon-Fano
Algorithm:
1. Determine the probability of each symbol in the source
text
2. Sort the symbols in decreasing probability order
3. Divide the set of symbols into two parts such that each
part has an approximately equal probability
4. The symbols in the first part are coded with the bit zero
and the symbols in the second part with the bit one
5. Repeat steps 3 and 4 until each sub-division contains
exactly one symbol
Shannon-Fano Example
Example 4:
• Input stream: WHOOOOOODUNNNNNIT!!!
Character W H O D U N I T !
Probability 1/20 1/20 6/20 1/20 1/20 5/20 1/20 1/20 3/20
Shannon-Fano Example
Shannon-Fano Example
Shannon-Fano Codes Shannon-Fano Tree
Char Code
O 00
N 01
! 100
W 101
H 1100
D 1101
U 1110
I 11110
T 11111
Exercise 3: Find the average number of bits/symbol for this code
LZW Coding Example
• Lossless data compression algorithm
• Created by Abraham Lempel, Jacob Ziv and Terry
Welch
• Do not need to know the probability of symbol
occurence
Example 5:
• Input Stream: TOBEORNOTTOBEORTOBEORNOT#
• Symbols: A-Z, #
• 5 bits required for fixed length code
• Length of message: 25 x 5 = 125 bits
LZW Coding Example
Compressed Message
= 97 bits
Ref: http://en.wikipedia.org/wiki/LZW
JPEG
• "Joint Photographic Expert Group" – an
international standard in 1992
• JPEG is a commonly used method of
compression for photographic images
• Works with both color and grey-scale images
• JPEG file can be encoded in several ways e.g.,
JFIF (JPEG File Interchange Format)
JPEG
Loss of information
Coding Techniques
• Text
– ASCII, Extended ASCII, Morse, RLE, Huffman,
Adaptive Huffman, Shannon-Fano, LZ77, LZ78, LZW,
CTW, BWT, DMC
• Audio
– A-law, -law, G.7xx (ITU-T suite of standards)
Error Detection and Correction
• Ability to detect transmission errors in the
received data and to reconstruct the original
data
• Error detection techniques
– e.g. Parity, Checksum, CRC, Hamming codes, Hash
functions
• Error correction techniques
– ARQ (Stop-and-Wait, Go-back-N, Selective Repeat)
– FEC (Hamming, Reed-Solomon, Golay)
Cryptography and Steganography
• Cryptography is the study of hiding the
information
– Substitution ciphers
– Transposition ciphers
– One-time pads
– Symmetric and public key algorithms
• Steganography is the study of hiding the
existence of information
Modulation
• The addition of information to a signal carrier
– Digital data, digital signal (data encoding)
– Digital data, analog signal(ASK, FSK, PSK)
– Analog data, analog signal (AM, FM, PM)
– Analog data, digital signal (PCM, DM)
• Reasons
– Compatibility of signal with transmission medium
– Frequency division multiplexing