Information Theory and Source Coding
Information Theory and Source Coding
1
Contents
Information Theory:
➢ Discrete messages
➢ Concept of amount of information and its properties
➢ Average information
➢ Entropy and its properties
➢ Information rate
➢ Mutual information and its properties
➢ Illustrative Problems
Source Coding:
➢ Introduction
➢ Advantages
➢ Hartley Shannon’s theorem
➢ Bandwidth –S/N trade off
➢ Shanon- Fano coding,
➢ Huffman coding
➢ Illustrative Problems
2
Information Theory
There are two fundamentally different ways to transmit messages: via discrete signals
and via continuous signals ..... For example, the letters of the English alphabet are commonly
thought of as discrete signals.
Information sources
Definition:
The set of source symbols is called the source alphabet, and the elements of the set are
called the symbols or letters.
The number of possible answers ‘ r ’ should be linked to “information.”
“Information” should be additive in some sense.
We define the following measure of information:
The basis ‘b’ of the logarithm b is only a change of units without actually changing the
amount of information it describes.
Discrete memory less source (DMS) can be characterized by “the list of the symbols, the
probability assignment to these symbols, and the specification of the rate of generating these
symbols by the source”.
1. Information should be proportion to the uncertainty of an outcome.
2. Information contained in independent outcome should add.
Scope of Information Theory
3
The basic setup in Information Theory has:
– a source,
– a channel and
– destination.
The output from source is conveyed through the channel and received at the destination.
The source is a random variable S
which takes symbols from a finite alphabet i.e.,
With probabilities
Properties of Information
Entropy:
The Entropy (H(s)) of a source is defined as the average information generated by a
discrete memory less source.
4
Information content of a symbol:
Let us consider a discrete memory less source (DMS) denoted by X and having the
alphabet {U1, U2, U3, ……Um}. The information content of the symbol xi, denoted by I(xi) is
defined as
Units of I(xi):
For two important and one unimportant special cases of b it has been agreed to use the
following names for these units:
b =2(log2): bit,
b =10(log10): Hartley.
log2a=
Definition:
In order to get the information content of the symbol, the flow information on the
symbol can fluctuate widely because of randomness involved into the section of symbols.
H(U)= E[I(u)]=
5
Where PU (·) denotes the probability mass function (PMF) 2 of the RV U, and where
the support of P U is defined as
We will usually neglect to mention “support” when we sum over PU (u) · logb PU (u), i.e., we
implicitly assume that we exclude all u
With zero probability PU (u) =0.
It may be noted that for a binary source U which genets independent symbols 0 and 1
with equal probability, the source entropy H (u) is
Bounds on H (U)
Where
To derive the upper bound we use at rick that is quite common in.
Formation theory: We take the deference and try to show that it must be non positive.
6
Equality can only be achieved if
7
Conditional Entropy
Similar to probability of random vectors, there is nothing really new about conditional
probabilities given that a particular event Y = y has occurred.
The conditional entropy or conditional uncertainty of the RV X given the event Y = y is
defined as
Note that the definition is identical to before apart from that everything is conditioned
on the event Y = y
Note that the conditional entropy given the event Y = y is a function of y. Since Y is
also a RV, we can now average over all possible events Y = y according to the probabilities
of each event. This will lead to the averaged.
Mutual Information
Although conditional entropy can tell us when two variables are completely
independent, it is not an adequate measure of dependence. A small value for H(Y| X) may
implies that X tells us a great deal about Y or that H(Y) is small to begin with. Thus, we
measure dependence using mutual information:
I(X,Y) =H(Y)–H(Y|X)
8
I(X,Y)=H(X)–H(X| Y)
KL divergence measures the difference between two distributions. It is sometimes called the
relative entropy. It is always non-negative and zero only when p=q; however, it is not a
distance because it is not symmetric.
In other words, mutual information is a measure of the difference between the joint
probability and product of the individual probabilities. These two distributions are equivalent
only when X and Y are independent, and diverge as X and Y become more dependent.
Source coding
Coding theory is the study of the properties of codes and their respective fitness for
specific applications. Codes are used for data compression, cryptography, error-
correction, and networking. Codes are studied by various scientific disciplines—such as
information theory, electrical engineering, mathematics, linguistics, and computer
science—for the purpose of designing efficient and reliable data transmission methods.
This typically involves the removal of redundancy and the correction or detection of
errors in the transmitted data.
The aim of source coding is to take the source data and make it smaller.
All source models in information theory may be viewed as random process or random
sequence models. Let us consider the example of a discrete memory less source
(DMS), which is a simple random sequence model.
A DMS is a source whose output is a sequence of letters such that each letter is
independently selected from a fixed alphabet consisting of letters; say a1, a2 ,
9
……….ak. The letters in the source output sequence are assumed to be random
and statistically
Let us consider a source with four letters a1, a2, a3 and a4 with P(a1)=0.5,
P(a2)=0.25, P(a3)= 0.13, P(a4)=0.12. Let us decide to go for binary coding of these
four
Source letters While this can be done in multiple ways, two encoded representations
are shown below:
Code Representation#1:
Code Representation#2:
It is easy to see that in method #1 the probability assignment of a source letter has not
been considered and all letters have been represented by two bits each. However in
The second method only a1 has been encoded in one bit, a2 in two bits and the
remaining two in three bits. It is easy to see that the average number of bits to be used
per source letter for the two methods is not the same. ( a for method #1=2 bits per
letter and a for method #2 < 2 bits per letter). So, if we consider the issue of encoding
a long sequence of
Letters we have to transmit less number of bits following the second method. This
is an important aspect of source coding operation in general. At this point, let us
note
a) We observe that assignment of small number of bits to more probable letters and
assignment of larger number of bits to less probable letters (or symbols) may lead to
efficient source encoding scheme.
10
b) However, one has to take additional care while transmitting the encoded letters. A
careful inspection of the binary representation of the symbols in method #2 reveals
that it may lead to confusion (at the decoder end) in deciding the end of binary
representation of a letter and beginning of the subsequent letter.
11
Shannon-Fano Code
Shannon–Fano coding, named after Claude Elwood Shannon and Robert Fano, is a technique
for constructing a prefix code based on a set of symbols and their probabilities. It is
suboptimal in the sense that it does not achieve the lowest possible expected codeword length
like Huffman coding; however unlike Huffman coding, it does guarantee that all codeword
lengths are within one bit of their theoretical ideal I(x) =−log P(x).
In Shannon–Fano coding, the symbols are arranged in order from most probable to least
probable, and then divided into two sets whose total probabilities are as close as possible to
being equal. All symbols then have the first digits of their codes assigned; symbols in the first
set receive "0" and symbols in the second set receive "1". As long as any sets with more than
one member remain, the same process is repeated on those sets, to determine successive
digits of their codes. When a set has been reduced to one symbol, of course, this means the
symbol's code is complete and will not form the prefix of any other symbol's code.
The algorithm works, and it produces fairly efficient variable-length encodings; when the two
smaller sets produced by a partitioning are in fact of equal probability, the one bit of
information used to distinguish them is used most efficiently. Unfortunately, Shannon–Fano
does not always produce optimal prefix codes.
For this reason, Shannon–Fano is almost never used; Huffman coding is almost as
computationally simple and produces prefix codes that always achieve the lowest expected
code word length. Shannon–Fano coding is used in the IMPLODE compression method,
which is part of the ZIP file format, where it is desired to apply a simple algorithm with high
performance and minimum requirements for programming.
12
Shannon-Fano Algorithm:
A Shannon–Fano tree is built according to a specification designed to define an
effective code table. The actual algorithm is simple:
For a given list of symbols, develop a corresponding list of probabilities or frequency
counts so that each symbol’s relative frequency of occurrence is known.
Sort the lists of symbols according to frequency, with the most frequently
occurring
Symbols at the left and the least common at the right.
Divide the list into two parts, with the total frequency counts of the left part being
as
Close to the total of the right as possible.
The left part of the list is assigned the binary digit 0, and the right part is assigned
the digit 1. This means that the codes for the symbols in the first part will all start
with 0, and the codes in the second part will all start with 1.
Recursively apply the steps 3 and 4 to each of the two halves, subdividing groups
and adding bits to the codes until each symbol has become a corresponding code leaf
on the tree.
Example:
The source of information A generates the symbols {A0, A1, A2, A3 and A4} with the
corresponding probabilities {0.4, 0.3, 0.15, 0.1 and 0.05}. Encoding the source symbols
using binary encoder and Shannon-Fano encoder gives
13
Shanon-Fano code is a top-down approach. Constructing the code tree, we get
14
Binary Huffman Coding (an optimum variable-length source coding scheme)
In Binary Huffman Coding each source letter is converted into a binary code
word. It is a prefix condition code ensuring minimum average length per source letter in
bits.
Let the source letters a1, a 2, ……….aK have probabilities P(a1), P(a2),………….
P(aK) and let us assume that P(a1) ≥ P(a2) ≥ P(a 3)≥…. ≥ P(aK).
We now consider a simple example to illustrate the steps for Huffman coding.
Example Let us consider a discrete memory less source with six letters having
Arrange the letters in descending order of their probability (here they are
arranged).
Consider the last two probabilities. Tie up the last two probabilities. Assign, say, 0
to the last digit of representation for the least probable letter (a6) and 1 to the last
digit of representation for the second least probable letter (a5). That is, assign ‘1’
to the upper arm of the tree and ‘0’ to the lower arm.
(3) Now, add the two probabilities and imagine a new letter, say b1, substituting for a6
and a5. So P(b1) =0.2. Check whether a4 and b1are the least likely letters. If not,
reorder the letters as per Step#1 and add the probabilities of two least likely letters.
For our example, it leads to:
P(a1)=0.3, P(a2)=0.2, P(b1)=0.2, P(a3)=0.15 and P(a4)=0.15
15
(4) Now go to Step#2 and start with the reduced ensemble consisting of a1 , a2 , a3 ,
Continue till the first digits of the most reduced ensemble of two letters are
assigned a ‘1’ and a ‘0’.
Again go back to the step (2): P(a1)=0.3, P(b2)=0.3, P(a2)=0.2 and P(b1)=0.2.
Now we consider the last two probabilities:
16
6. Now, read the code tree inward, starting from the root, and construct the
code words. The first digit of a codeword appears first while reading the code tree
inward.
Hence, the final representation is: a1=11, a2=01, a3=101, a4=100, a5=001, a6=000.
A few observations on the preceding example
4. Note that the entropy of the source is: H(X)=2.465 bits/symbol. Average length
per source letter after Huffman coding is a little bit more but close to the source
entropy. In fact, the following celebrated theorem due to C. E. Shannon sets the
limiting value of average length of code words from a DMS.
Shannon–Hartley theorem
In information theory, the Shannon–Hartley theorem tells the maximum rate at which
information can be transmitted over a communications channel of a specified bandwidth in
the presence of noise. It is an application of the noisy-channel coding theorem to the
archetypal case of a continuous-time analog communications channel subject to Gaussian
noise. The theorem establishes Shannon's channel capacity for such a communication link, a
17
bound on the maximum amount of error-free information per time unit that can be transmitted
with a specified bandwidth in the presence of the noise interference, assuming that the signal
power is bounded, and that the Gaussian noise process is characterized by a known power or
power spectral density.
The law is named after Claude Shannon and Ralph Hartley.
The theory behind designing and analyzing channel codes is called Shannon’s noisy
channel coding theorem. It puts an upper limit on the amount of information you can
send in a noisy channel using a perfect channel code. This is given by the following
equation:
where C is the upper bound on the capacity of the channel (bit/s), B is the
bandwidth of the channel (Hz) and SNR is the Signal-to-Noise ratio (unit less).
Bandwidth-S/N Tradeoff
The expression of the channel capacity of the Gaussian channel makes intuitive
sense:
2 As S/N increases, one can increase the information rate while still preventing errors
due to noise.
Thus we may trade off bandwidth for SNR. For example, if S/N = 7 and B = 4kHz,
then the channel capacity is C = 12 ×103 bits/s. If the SNR increases to S/N = 15 and B
is decreased to 3kHz, the channel capacity remains the same. However, as B tends to
1, the channel capacity does not become infinite since, with an increase in bandwidth,
the noise power also increases. If the noise power spectral density is ɳ/2, then the total
noise power is N = ɳB, so the Shannon-Hartley law becomes
18
19
Linear Block Codes
Introduction
Coding theory is concerned with the transmission of data
across noisy channels and the recovery of corrupted messages. It has found
widespread applications in electrical engineering, digital communication,
mathematics and computer science. The transmission of the data over the channel depends
upon two parameters. They are transmitted power and channel bandwidth. The power spectral
density of channel noise and these two parameters determine signal to noise power ratio.
The signal to noise power ratio determine the probability of error of the modulation
scheme. Errors are introduced in the data when it passes through the channel. The channel
noise interferes the signal. The signal power is reduced. For the given signal to noise ratio, the
error probability can be reduced further by using coding techniques. The coding techniques
also reduce signal to noise power ratio for fixed probability of error.
For the block of k message bits, (n-k) parity bits or check bits are added. Hence the
total bits at the output of channel encoder are ‘n’. Such codes are called (n,k)block
codes.Figure illustrates this concept.
k bits k (n-k)
n bits
Types are
Systematic codes:
In the systematic block code, the message bits appear at the beginning of the code
word. The message appears first and then check bits are transmitted in a block. This type of
code is called systematic code.
Nonsystematic codes:
In the nonsystematic block code it is not possible to identify the message bits and
check bits. They are mixed in the block.
1
Consider the binary codes and all the transmitted digits are binary.
A code is linear if the sum of any two code vectors produces another code vector.
This shows that any code vector can be expressed as a linear combination of other code
vectors. Consider that the particular code vector consists of m1,m2, m3,…mk message bits and
c1,c2,c3…cq check bits. Then this code vector can be written as,
X=(m1,m2,m3,…mkc1,c2,c3…cq)
Here q=n-k
X=(M/C)
The main aim of linear block code is to generate check bits and this check bits are
mainly used for error detection and correction.
Example :
The (7, 4) linear code has the following matrix as a generator matrix
2
Let u = (u0, u1, … , uk-1) be the message to be encoded.The corresponding code word
is
The n – k equations given by above equation are called parity-check equations of the
code
Let u = (u0, u1, u2, u3) be the message to be encoded and v = (v0, v1, v2, v3, v4, v5,v6) be
the corresponding code word
3
Solution :
If the generator matrix of an (n, k) linear code is in systematic form, the parity-check
matrix may take the following form
4
Figure: Encoding Circuit
For the block of k=4 message bits, (n-k) parity bits or check bits are added. Hence
the total bits at the output of channel encoder are n=7. The encoding circuit for (7, 4)
systematic code is shown below.
Let v = (v0, v1, …, vn-1) be a code word that was transmitted over a noisy channel. Let
r = (r0, r1, …, rn-1) be the received vector at the outputof the channel
5
Where
e = r + v = (e0, e1, …, en-1) is an n-tuple and the n-tuple ‘e’ is called the
error vector (or error pattern).The condition is
ei = 1 for ri ≠ vi
ei = 0 for ri = vi
Upon receiving r, the decoder must first determine whether r contains transmission
errors. If the presence of errors is detected, the decoder will take actions to locate the errors,
correct errors (FEC) and request for a retransmission of v.
When r is received, the decoder computes the following (n – k)-tuple.
s = r • HT
s = (s0, s1, …, sn-k-1)
The syndrome is not a function of the transmitted codeword but a function of error
pattern. So we can construct only a matrix of all possible error patterns with corresponding
syndrome.
When s = 0, if and only if r is a code word and hence receiver accepts r as the
transmitted code word. When s≠ 0, if and only if r is not a code word and hence the presence
of errors has been detected. When the error pattern e is identical to a nonzero code word (i.e.,
r contain errors but s = r • HT = 0), error patterns of this kind are called undetectable error
patterns. Since there are 2k – 1 non-zero code words, there are 2k – 1 undetectable error
patterns. The syndrome digits are as follows:
s0 = r0 + rn-k p00 + rn-k+1 p10 + ··· + rn-1 pk-1,0
s1 = r1 + rn-k p01 + rn-k+1 p11 + ··· + rn-1 pk-1,1
.
sn-k-1 = rn-k-1 + rn-k p0,n-k-1 + rn-k+1 p1,n-k-1 + ··· + rn-1 pk-1,n-k-1
The syndrome s is the vector sum of the received parity digits (r0,r1,…,rn-k-1) and the parity-
check digits recomputed from the received information digits (rn-k,rn-k+1,…,rn-1).
The below figure shows the syndrome circuit for a linear systematic (n, k) code.
6
Figure: Syndrome Circuit
If the minimum distance of a block code C is dmin, any two distinct code vector of C
differ in at least dmin places. A block code with minimum distance dmin is capable of detecting
all the error pattern of dmin– 1 or fewer errors.
However, it cannot detect all the error pattern of dmin errors because there exists at least
one pair of code vectors that differ in dmin places and there is an error pattern of dmin errors
that will carry one into the other. The random-error-detecting capability of a block code with
minimum distance dmin is dmin– 1.
If an error pattern is not identical to a nonzero code word, the received vector r will
not be a code word and the syndrome will not be zero.
Hamming Codes:
These codes and their variations have been widely used for error control
in digital communication and data storage systems.
For any positive integer m ≥ 3, there exists a Hamming code with the following parameters:
Code length: n = 2m – 1
Number of information symbols: k = 2m – m – 1
Number of parity-check symbols: n – k = m
Error-correcting capability: t = 1(dmin= 3)
7
The parity-check matrix H of this code consists of all the non zero m-tuple as its columns
(2m-1)
G = [QT I2m–m–1]
where QT is the transpose of Q and I 2m–m–1 is an (2m – m – 1) ×(2m – m – 1)
identity matrix.
Since the columns of H are nonzero and distinct, no two columns add to zero. Since H
consists of all the nonzero m-tuples as its columns, the vector sum of any two columns, say hi
and hj, must also be a column in H, say hlhi+ hj+ hl = 0.The minimum distance of a Hamming
code is exactly 3.
Using H' as a parity-check matrix, a shortened Hamming code can be obtained with
the following parameters :
Code length: n = 2m – l – 1
Number of information symbols: k = 2m – m – l – 1
Number of parity-check symbols: n – k = m
Minimum distance : dmin ≥ 3
When a single error occurs during the transmission of a code vector, the resultant
syndrome is nonzero and it contains an odd number of 1’s (e x H’T corresponds to a column
in H’).When double errors occurs, the syndrome is nonzero, but it contains even number of
1’s.
Decoding can be accomplished in the following manner:
i) If the syndrome s is zero, we assume that no error occurred
ii) If s is nonzero and it contains odd number of 1’s, assume that a single error
occurred. The error pattern of a single error that corresponds to s is added to the received
vector for error correction.
iii) If s is nonzero and it contains even number of 1’s, an uncorrectable error
pattern has been detected.
8
Problems:
1.
9
Binary Cyclic codes:
Cyclic codes are the sub class of linear block codes.
Definition:
A linear code is called a cyclic code if every cyclic shift of the code vector produces
some other code vector.
Linearity: This property states that sum of any two code words is also a valid code word.
X1+X2=X3
Cyclic: Every cyclic shift of valid code vector produces another valid code vector.
X = {xn-1,xn-2,.........................x1,x0}
Here xn-1, xn-2 ….x1, x0 represent individual bits of the code vector ‘X’.
If the above code vector is cyclically shifted to left side i.e., One cyclic shift of X gives,
The code words can be represented by a polynomial. For example consider the n-bit code
word X = {xn-1,xn-2, ........................ x1,x0}.
10
This code word can be represented by a polynomial of degree less than or equal to (n-1)
i.e.,
X(p)=xn-1pn-1+xn-2pn-2+....................... +x1p+x0
pn-1 – MSB
p0 -- LSB
Let M= {mk-1, mk-2, ........................ m1,m0} be ‘k’ bits of message vector. Then it can be
represented by the polynomial as,
X(p)=M(p)G(p)
For (n,k) cyclic codes, q=n-k represent the number of parity bits.
11
If M1, M2, M3....................................etc are the other message vectors, then the corresponding
code vectors can be calculated as
Since cyclic codes are sub class of linear block codes, generator and parity check matrices
can also be defined for cyclic codes.
G= [Ik : Pkxq]kxn
The tth row of this matrix will be represented in the polynomial form as follows
Where t= 1, 2, 3 ................ k
12
Lets divide pn-t by a generator matrix G(p). Then we express the result of this division in
terms of quotient and remainder i.e.,
𝑝𝑛−𝑡 𝑅𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟
= 𝑄𝑢𝑜𝑡𝑖𝑒𝑛𝑡 +
𝐺(𝑝) 𝐺(𝑝)
Here remainder will be a polynomial of degree less than q, since the degree of G(p) is ‘q’.
Quotient = Qt(p)
The feedback switch is first closed. The output switch is connected to message input.
All the shift registers are initialized to zero state. The ‘k’ message bits are shifted to the
transmitter as well as shifted to the registers.
13
After the shift of ‘k’ message bits the registers contain ‘q’ check bits. The feedback
switch is now opened and output switch is connected to check bits position. With the every
shift, the check bits are then shifted to the transmitter.
The block diagram performs the division operation and generates the remainder.
Remainder is stored in the shift register after all message bits are shifted out.
In cyclic codes also during transmission some errors may occur. Syndrome decoding can
be used to correct those errors.
If ‘E’ represents the error vector then the correct code vector can be obtained as
X=Y+E or Y=X+E
Y(p) = X(p)+E(p)
X(p) = M(p)G(p)
If Y(p)=X(p)
𝑋(𝑝) 𝑅𝑒𝑚𝑎𝑖𝑛𝑑𝑒𝑟
= 𝑄𝑢𝑜𝑡𝑖𝑒𝑛𝑡 +
𝐺(𝑝) 𝐺(𝑝)
𝑌(𝑝) 𝑅(𝑝)
= 𝑄(𝑝) +
𝐺(𝑝) 𝐺(𝑝)
Y(p)=Q(p)G(p) + R(p)
Clearly R(p) will be the polynomial of degree less than or equal to q-1
M(p)G(p)+E(p)=Q(p)G(p)+R(p)
E(p)=M(p)G(p)+Q(p)G(p)+ R(p)
14
E(p)=[M(p)+Q(p)]G(p)+R(p)
This equation shows that for a fixed message vector and generator polynomial, an
error pattern or error vector ‘E’ depends on remainder R.
For every remainder ‘R’ there will be specific error vector. Therefore we can call the
remainder vector ‘R’ as syndrome vector ‘S’, or R(p)=S(p). Therefore
𝑌(𝑝) 𝑆(𝑝)
= 𝑄(𝑝) +
𝐺(𝑝) 𝐺(𝑝)
Thus Syndrome vector is obtained by dividing received vector Y (p) by G (p) i.e.,
𝑌(𝑝)
𝑆(𝑝) = 𝑟𝑒𝑚[ ]
𝐺(𝑝)
There are ‘q’ stage shift register to generate ‘q’ bit syndrome vector. Initially all the
shift register contents are zero & the switch is closed in position 1.
The received vector Y is shifted bit by bit into the shift register. The contents of flip
flops keep changing according to input bits of Y and values of g1,g2 etc.
After all the bits of Y are shifted, the ‘q’ flip flops of shift register contain the q bit
syndrome vector. The switch is then closed to position 2 & clocks are applied to shift register.
The output is a syndrome vector S= (Sq-1, Sq-2 ….S1, S0)
Once the syndrome is calculated, then an error pattern is detected for that particular
syndrome. When the error vector is added to the received code vector Y, then it gives
corrected code vector at the output.
15
The switch named Sout is opened and Sin is closed. The bits of the received vector Y
are shifted into the buffer register as well as they are shifted in to the syndrome calculator.
When all the n bits of the received vector Y are shifted into the buffer register and Syndrome
calculator the syndrome register holds a syndrome vector.
Syndrome vector is given to the error pattern detector. A particular syndrome detects
a specific error pattern.
Sin is opened and Sout is closed. Shifts are then applied to the flip flop of buffer
registers, error register, and syndrome register.
The error pattern is then added bit by bit to the received vector. The output is the
corrected error free vector.
16
Convolution codes
1
2
3
4
5
6
7
8
Decoding methods of Convolution code:
1.Veterbi decoding
2.Sequential decoding
3.Feedback decoding
Metric:it is the discrepancybetwen the received signal y and the decoding signal at
particular node .this metric can be added over few nodes a particular path
Surviving path: this is the path of the decoded signalwith minimum metric
Metric of the particular is obtained by adding individual metric on the nodes along that
path.
Example:
9
Exe:
10