ITC 2020 21 Lecture 3
ITC 2020 21 Lecture 3
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 1
Lecture - 3
Module - 1
Information and Source Coding
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 2
Outline
• Average Self Information (or Entropy)
• Conditional Entropy
• Joint Entropy
• Self Entropy
• Differential Entropy
• Relative Entropy
• Jensen Shannon Distance
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 3
Average Self Information (Entropy)
• Consider a discrete random variable X with possible outcomes xi, i = 1, 2, …, n.
• The Average Self Information of the event X = xi is defined as
n n
H X P xi I xi P xi log P xi
i 1 i 1
• Hence, H(X) ≥ 0.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 4
Entropy (Example)
• Consider a discrete binary source that emits a sequence of statistically independent symbols.
• The output is either a 0 with a probability p or 1 with a probability 1 – p.
• The Entropy of this binary source is
1
H X P xi log P xi
i 0
p log 2 p 1 p log 2 1 p
• This H(X) is called binary entropy function.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 5
Entropy of English
• Consider the English language with alphabets {A, B, ..., Z}.
• If every letter occurred with the same probability and was independent from the other letters, then
the entropy per letter would be
n n
H X P xi I xi P xi log P xi
i 1 i 1
26
1 1
log 2
i 1 26 26
1
log 2 4.70
26
• This is the absolute upperbound.
• However, we know that all letters do not appear with equal probability.
• A, E, S, T are more frequent whereas J, Q, Z are less frequent.
• If we take into consideration the probabilities of occurrences of different alphabets (normalized
letter frequency), the entropy per letter HL would be
H L H X 4.14 bits
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 6
Entropy of English
• If X 2 denotes the random variable of bigrams in the English language, the upperbound of HL can be
refined as
H X2
HL 3.56 bits
2
• Here we consider the possibilities of all pairs.
• This logic can be extended to n-grams. Thus, the entropy of the language can be defined as
H L lim
H Xn
n
n
• Even though the exact value of HL may be difficult to determine, statistical investigations show that
for the English language
1 H L 1.5 bits
• So each letter in the English text gives at most 1.5 bits of information.
• Let us assume the value of HL ≈ 1.25 bits. Thus the redundancy of the English language is
HL 1.25
REng 1 1 0.75
log 2 26 4.70
• It is interesting that most languages in the world have similar redundancies.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 7
Entropy of Spoken English
• Let us now consider the redundancy in spoken English.
• Suppose an average speaker speaks 60 words per minute and the average number of letters per
word is 6.
• The average number of letters spoken per second in this case is 6 letters/sec.
• Assuming each letter carries 1.25 bits of information, the information rate of an average speaker is
7.5 bits/sec.
• If each letter is represented by 5 bits, the bit rate of an average speaker is 30 bits/sec.
• However, the typical data rate requirement for speech is 32 kilobits/sec (kbps).
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 8
Conditional Entropy
• The Average Conditional Self Information called the Conditional Entropy is defined as
n m
H X | Y P xi , y j I xi | y j
i 1 j 1
n m 1
P xi , y j log
i 1 j 1
P xi | y j
n m
P xi , y j log P xi | y j
i 1 j 1
• The physical interpretation of this definition is as follows: The conditional entropy H(X | Y) is the
information (or uncertainty) in X after Y is observed.
• Based on the definitions of H(X), H(Y) and H(X | Y), we can write
I X ;Y H X H X | Y
H Y H Y | X
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 9
Average Mutual Information and Conditional Entropy
I X ;Y H X H X | Y
H Y H Y | X
• Since I(X; Y) ≥ 0, it implies that H(X) ≥ H(X | Y).
• The case I(X; Y) = 0 implies that H(X) = H(X | Y), which is possible if and only if X and Y are
statistically independent.
• Since H(X | Y) is the average amount of uncertainty (information) in X after we observe Y and H(X)
is the average amount of uncertainty (self information) of X, I(X; Y) is the average amount of
uncertainty (mutual information) about X having observed Y.
• Since H(X) ≥ H(X | Y), the observation of Y does not increase the entropy (uncertainty). It can only
decrease the entropy.
• That is, observing Y cannot reduce the information about X, it can only add to the information.
Thus, conditioning does not increase entropy.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 10
Joint Entropy
• The Joint Entropy of a pair of discrete random variables (X, Y) with a joint distribution P(x, y) is
defined as
n m 1
H X , Y P xi , y j log
i 1 j 1
P xi , y j
n m
P xi , y j log P xi , y j
i 1 j 1
• By using the mathematical definitions of H(X), H(Y), H(X | Y) and H(X, Y), we obtain the following
chain rule.
H X , Y H X H Y | X H Y H X | Y
I X ; Y H X H Y H X , Y
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 11
Self Entropy (Revisited)
• How much does X convey about itself?
I X ;Y H X H X | Y
I X; X H X H X | X H X
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 12
Conditional Entropy (Example)
• Consider a BSC.
• Let the input symbols be ‘0’ with probability q and ‘1’ with probability 1 – q.
• The source symbol probabilities (input probabilities to the channel) are: {q, 1 – q}
• The BSC crossover probabilities (channel transition probabilities) are: {p, 1 – p}
• These two sets of probabilities are independent.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 13
Conditional Entropy (Example)
• The entropy of this binary source is
1
H X P xi log P xi q log 2 q 1 q log 2 1 q
i 0
n m 1
H X | Y P xi , y j log
i 1 j 1
P xi | y j
• In order to calculate the values of H(X | Y), we can make use of the following equalities:
P xi , y j P xi | y j P y j P y j | xi P xi
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 14
Conditional Entropy (Example)
• The conditional entropy H(X | Y) versus q with p as the parameter is plotted in the first figure.
• The average mutual information I(X; Y) versus q with p as the parameter is plotted in the second
figure.
• It can be seen that as we increase the parameter p from 0 to 0.5, I(X; Y) decreases.
• Physically it implies that as we make the channel less reliable (increase the value of p ≤ 0.5), the
mutual information between the random variable X (at the transmitter end) and the random variable
Y (at the receiver end) decreases.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 15
Information Measures for Continuous Random Variables
• The definitions of mutual information for discrete random variables can be directly extended to
continuous random variables.
• Let X and Y be random variables with joint probability density function (pdf) p(x, y) and marginal
pdfs p(x) and p(y).
• The Average Mutual Information between two continuous random variables X and Y is defined as
I X ; Y p x, y I x; y dxdy
p x, y
p x, y log dxdy
p x p y
p y | x
p x p y | x log dxdy
p y
p x | y
p y p x | y log dxdy
p x
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 16
Information Measures for Continuous Random Variables
• It should be pointed out that the definition of average mutual information can be carried over from
discrete random variable to continuous random variables, but the concept and physical
interpretation cannot.
• The reason is that the information content in a continuous random variable is actually infinite, and
we require infinite number of bits to represent a continuous random variable precisely.
• The self information, and hence the entropy is infinite.
• To get around the problem, we define a quantity called the Differential Entropy.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 17
Differential Entropy
• The Differential Entropy of a continuous random variable X is defined as
h X p x log p x dx
• Again, it should be understood that there is no physical meaning attached to the above quantity.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 18
Some Properties of Differential Entropy
1. Chain rule for differential entropy
n
h X1 , X 2 , ..., X n h X i | X1 , X 2 ,..., X i 1
i 1
3. h aX h X log a
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 19
Relative Entropy
• An interesting question to ask is how similar (or different) are two probability distributions?
• Relative entropy is used as a measure of distance between two distributions.
• The Relative Entropy or Kullback Leibler (KL) Distance between two probability mass functions
p(x) and q(x) is defined as
p x
D p || q p x log
q x
xX
p x
• It can be interpreted as the expected value of log
q x
Note:
p x
D p || q p x log p x log p x p x log q x
xX q x xX xX
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 20
Relative Entropy
• Does Kullback Leibler (KL) Distance follow Symmetry Property?
• Is D(p || q) = D(q || p)?
• Lets check for
p x q x
p x log q x log
xX xX
q x
p x
• We find
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 21
Relative Entropy
• Does Kullback Leibler (KL) Distance follow Triangle Inequality?
• Is D(p || q) + D(q || r) ≥ D(p || r)?
• Lets check for
p x q x p x
p x log q x log p x log
xX xX
q x xX
r x
r x
• We find
q x
p x q x log 0
xX r x
• This relation does not hold if p(x) > q(x).
• KL Distance does not follow Triangle Inequality.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 22
Relative Entropy (Example)
• Consider a Gaussian distribution p(x) with mean and variance given by 1 , 1 .
2
• Consider another Gaussian distribution q(x) with mean and variance given by 2 , 2 .
2
• We find the KL distance between two Gaussian distributions as
1 12 2 1 12
2
D p || q 2 1 log 2 2
2 2 2 2
• The distance becomes zero when the two distributions are identical, i.e., 1 2 and 12 22 .
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 24
Jensen Shannon Distance
• The Jensen Shannon Distance between two probability mass functions p(x) and q(x) is defined as
1 1
JSD p || q D p || m D q || m
2 2
1
where m p q
2
• If the base of the logarithm is 2, then 0 ≤ JSD(p || q) ≤ 1.
• The Jensen Shannon Distance is sometimes referred to as Jensen Shannon Divergence or
Information Radius in literature.
ECE F344 Information Theory and Coding | Dr. Amit Ranjan Azad | BITS Pilani, Hyderabad Campus 25