0% found this document useful (0 votes)
4 views25 pages

ITC 2020 21 Lecture 3

The document outlines the course ECE F344 on Information Theory and Coding, taught by Dr. Amit Ranjan Azad at BITS Pilani, Hyderabad Campus. It covers key concepts such as entropy, conditional entropy, joint entropy, and mutual information, with examples including the entropy of binary sources and the English language. Additionally, it discusses the relationship between entropy and redundancy in spoken English and provides insights into information measures for continuous random variables.

Uploaded by

f20220457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

ITC 2020 21 Lecture 3

The document outlines the course ECE F344 on Information Theory and Coding, taught by Dr. Amit Ranjan Azad at BITS Pilani, Hyderabad Campus. It covers key concepts such as entropy, conditional entropy, joint entropy, and mutual information, with examples including the entropy of binary sources and the English language. Additionally, it discusses the relationship between entropy and redundancy in spoken English and provides insights into information measures for continuous random variables.

Uploaded by

f20220457
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

ECE F344

Information Theory and Coding

Instructor-in-Charge: Dr. Amit Ranjan Azad


Email: [email protected]

Birla Institute of Technology and Science Pilani, Hyderabad Campus


Department of Electrical and Electronics Engineering

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 1
Lecture - 3
Module - 1
Information and Source Coding

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 2
Outline
• Average Self Information (or Entropy)
• Conditional Entropy
• Joint Entropy
• Self Entropy
• Differential Entropy
• Relative Entropy
• Jensen Shannon Distance

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 3
Average Self Information (Entropy)
• Consider a discrete random variable X with possible outcomes xi, i = 1, 2, …, n.
• The Average Self Information of the event X = xi is defined as
n n
H  X    P  xi  I  xi     P  xi  log P  xi 
i 1 i 1

• H(X) is called the Entropy.


• The entropy of X can be interpreted as the expected value of
 1 
log 
 P  X  
 
• The term entropy has been borrowed from statistical mechanics, where it is used to denote the level
of disorder in a system.
• Since 0 ≤ P(xi) ≤ 1, we observe that
 1 
log  0
 P  x  
 i 

• Hence, H(X) ≥ 0.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 4
Entropy (Example)
• Consider a discrete binary source that emits a sequence of statistically independent symbols.
• The output is either a 0 with a probability p or 1 with a probability 1 – p.
• The Entropy of this binary source is
1
H  X     P  xi  log P  xi 
i 0
  p log 2  p   1  p  log 2 1  p 
• This H(X) is called binary entropy function.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 5
Entropy of English
• Consider the English language with alphabets {A, B, ..., Z}.
• If every letter occurred with the same probability and was independent from the other letters, then
the entropy per letter would be
n n
H  X    P  xi  I  xi     P  xi  log P  xi 
i 1 i 1
26 
1   1 
     log 2  
i 1  26   26 
 1 
  log 2    4.70
 26 
• This is the absolute upperbound.
• However, we know that all letters do not appear with equal probability.
• A, E, S, T are more frequent whereas J, Q, Z are less frequent.
• If we take into consideration the probabilities of occurrences of different alphabets (normalized
letter frequency), the entropy per letter HL would be
H L  H  X   4.14 bits

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 6
Entropy of English
• If X 2 denotes the random variable of bigrams in the English language, the upperbound of HL can be

 
refined as
H X2
HL   3.56 bits
2
• Here we consider the possibilities of all pairs.
• This logic can be extended to n-grams. Thus, the entropy of the language can be defined as

H L  lim
 
H Xn
n 
n
• Even though the exact value of HL may be difficult to determine, statistical investigations show that
for the English language
1  H L  1.5 bits
• So each letter in the English text gives at most 1.5 bits of information.
• Let us assume the value of HL ≈ 1.25 bits. Thus the redundancy of the English language is
HL 1.25
REng  1   1  0.75
log 2 26 4.70
• It is interesting that most languages in the world have similar redundancies.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 7
Entropy of Spoken English
• Let us now consider the redundancy in spoken English.
• Suppose an average speaker speaks 60 words per minute and the average number of letters per
word is 6.
• The average number of letters spoken per second in this case is 6 letters/sec.
• Assuming each letter carries 1.25 bits of information, the information rate of an average speaker is
7.5 bits/sec.
• If each letter is represented by 5 bits, the bit rate of an average speaker is 30 bits/sec.
• However, the typical data rate requirement for speech is 32 kilobits/sec (kbps).

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 8
Conditional Entropy
• The Average Conditional Self Information called the Conditional Entropy is defined as

   
n m
H  X | Y     P xi , y j I xi | y j
i 1 j 1

 
n m 1
   P xi , y j log
i 1 j 1 
P xi | y j 
   
n m
    P xi , y j log P xi | y j
i 1 j 1

• The physical interpretation of this definition is as follows: The conditional entropy H(X | Y) is the
information (or uncertainty) in X after Y is observed.
• Based on the definitions of H(X), H(Y) and H(X | Y), we can write
I  X ;Y   H  X   H  X | Y 
 H Y   H Y | X 

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 9
Average Mutual Information and Conditional Entropy
I  X ;Y   H  X   H  X | Y 
 H Y   H Y | X 
• Since I(X; Y) ≥ 0, it implies that H(X) ≥ H(X | Y).
• The case I(X; Y) = 0 implies that H(X) = H(X | Y), which is possible if and only if X and Y are
statistically independent.
• Since H(X | Y) is the average amount of uncertainty (information) in X after we observe Y and H(X)
is the average amount of uncertainty (self information) of X, I(X; Y) is the average amount of
uncertainty (mutual information) about X having observed Y.
• Since H(X) ≥ H(X | Y), the observation of Y does not increase the entropy (uncertainty). It can only
decrease the entropy.
• That is, observing Y cannot reduce the information about X, it can only add to the information.
Thus, conditioning does not increase entropy.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 10
Joint Entropy
• The Joint Entropy of a pair of discrete random variables (X, Y) with a joint distribution P(x, y) is
defined as

 
n m 1
H  X , Y     P xi , y j log
i 1 j 1 
P xi , y j 
   
n m
    P xi , y j log P xi , y j
i 1 j 1

• By using the mathematical definitions of H(X), H(Y), H(X | Y) and H(X, Y), we obtain the following
chain rule.
H  X , Y   H  X   H  Y | X   H Y   H  X | Y 

I  X ; Y   H  X   H Y   H  X , Y 

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 11
Self Entropy (Revisited)
• How much does X convey about itself?
I  X ;Y   H  X   H  X | Y 

I X; X   H X  H X | X   H X 

• The Average Self Information is also called the Entropy.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 12
Conditional Entropy (Example)

• Consider a BSC.
• Let the input symbols be ‘0’ with probability q and ‘1’ with probability 1 – q.
• The source symbol probabilities (input probabilities to the channel) are: {q, 1 – q}
• The BSC crossover probabilities (channel transition probabilities) are: {p, 1 – p}
• These two sets of probabilities are independent.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 13
Conditional Entropy (Example)
• The entropy of this binary source is

1
H  X     P  xi  log P  xi   q log 2  q   1  q  log 2 1  q 
i 0

• The conditional entropy is given by

 
n m 1
H  X | Y     P xi , y j log
i 1 j 1 
P xi | y j 
• In order to calculate the values of H(X | Y), we can make use of the following equalities:

      
P xi , y j  P xi | y j P y j  P y j | xi P  xi  

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 14
Conditional Entropy (Example)

• The conditional entropy H(X | Y) versus q with p as the parameter is plotted in the first figure.
• The average mutual information I(X; Y) versus q with p as the parameter is plotted in the second
figure.
• It can be seen that as we increase the parameter p from 0 to 0.5, I(X; Y) decreases.
• Physically it implies that as we make the channel less reliable (increase the value of p ≤ 0.5), the
mutual information between the random variable X (at the transmitter end) and the random variable
Y (at the receiver end) decreases.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 15
Information Measures for Continuous Random Variables
• The definitions of mutual information for discrete random variables can be directly extended to
continuous random variables.
• Let X and Y be random variables with joint probability density function (pdf) p(x, y) and marginal
pdfs p(x) and p(y).
• The Average Mutual Information between two continuous random variables X and Y is defined as
 
I  X ; Y     p  x, y I  x; y dxdy
 
   p  x, y  
   p  x, y  log  dxdy

   p  x p  y 
   p  y | x 
   p  x  p  y | x  log   dxdy

   p  y 
   p  x | y 
   p  y  p  x | y  log   dxdy

   p  x 

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 16
Information Measures for Continuous Random Variables
• It should be pointed out that the definition of average mutual information can be carried over from
discrete random variable to continuous random variables, but the concept and physical
interpretation cannot.
• The reason is that the information content in a continuous random variable is actually infinite, and
we require infinite number of bits to represent a continuous random variable precisely.
• The self information, and hence the entropy is infinite.
• To get around the problem, we define a quantity called the Differential Entropy.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 17
Differential Entropy
• The Differential Entropy of a continuous random variable X is defined as

h  X     p  x  log p  x  dx


• Again, it should be understood that there is no physical meaning attached to the above quantity.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 18
Some Properties of Differential Entropy
1. Chain rule for differential entropy
n
h  X1 , X 2 , ..., X n    h  X i | X1 , X 2 ,..., X i 1 
i 1

2. Translation does not alter the differential entropy


h  X  c  h  X 

3. h  aX   h  X   log a

4. If X and Y are independent then h  X  Y   h  X  .


This is because h  X  Y   h  X  Y | Y   h  X | Y   h  X 

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 19
Relative Entropy
• An interesting question to ask is how similar (or different) are two probability distributions?
• Relative entropy is used as a measure of distance between two distributions.
• The Relative Entropy or Kullback Leibler (KL) Distance between two probability mass functions
p(x) and q(x) is defined as
 p  x 
D  p || q    p  x  log  
 q  x 
xX  

 p  x 
• It can be interpreted as the expected value of log 
 q  x  
 

Note:
 p  x 
D  p || q    p  x  log     p  x  log p  x    p  x  log q  x 

xX  q  x   xX xX

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 20
Relative Entropy
• Does Kullback Leibler (KL) Distance follow Symmetry Property?
• Is D(p || q) = D(q || p)?
• Lets check for
 p  x   q  x 
 p  x  log     q  x  log  
xX     xX
q x   
p x
• We find

 p  x  log p  x    p  x  log q  x    q  x  log q  x    q  x  log p  x 


xX xX xX xX

• KL Distance does not follow Symmetry Property.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 21
Relative Entropy
• Does Kullback Leibler (KL) Distance follow Triangle Inequality?
• Is D(p || q) + D(q || r) ≥ D(p || r)?
• Lets check for
 p  x   q  x   p  x 
 p  x  log     q  x  log     p  x  log  
xX     xX
q x     xX
r x   
r x
• We find
 q  x 
   p  x   q  x   log    0
xX  r  x 
• This relation does not hold if p(x) > q(x).
• KL Distance does not follow Triangle Inequality.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 22
Relative Entropy (Example)
• Consider a Gaussian distribution p(x) with mean and variance given by 1 , 1 .
2
 
• Consider another Gaussian distribution q(x) with mean and variance given by 2 ,  2 .
2
 
• We find the KL distance between two Gaussian distributions as

1  12  2  1   12  
2
D  p || q    2     1  log 2  2  
2  2   2    2  

• The distance becomes zero when the two distributions are identical, i.e., 1  2 and 12   22 .

• It is interesting to note that when 1  2 , the distance is minimum for 12   22 .

• The minimum distance is given by


2
1    1 
Dmin  p || q    2 
2  2 
• The KL distance is infinite if either 12  0 or  22  0, i.e., if either of the distributions tends to
the Dirac delta.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 23
Average Mutual Information
• The average mutual information can be seen as the relative entropy between the joint distribution,
p(x, y), and the product distribution p(x)q(x), i.e.,
I  X ; Y   D  p  x, y  || p  x  q  x  
• We note that, in general, D(p || q) ≠ D(q || p).
• Thus, even though the relative entropy is a distance measure, it does not follow the symmetry
property of distances.
• To overcome this, another measure, called the Jensen Shannon Distance, is sometimes used to
define the similarity between two distributions.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 24
Jensen Shannon Distance
• The Jensen Shannon Distance between two probability mass functions p(x) and q(x) is defined as
1 1
JSD  p || q   D  p || m   D  q || m 
2 2
1
where m   p  q
2
• If the base of the logarithm is 2, then 0 ≤ JSD(p || q) ≤ 1.
• The Jensen Shannon Distance is sometimes referred to as Jensen Shannon Divergence or
Information Radius in literature.

ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 25

You might also like