Lecture 5
Lecture 5
Jan Bouda
FI MU
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42
Part I
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 2 / 42
Uncertainty
Given a random experiment it is natural to ask how uncertain we are
about an outcome of the experiment.
Compare two experiments - tossing an unbiased coin and throwing a
fair six-sided dice. First experiment attains two outcomes and the
second experiment has six possible outcomes. Both experiments have
the uniform probability distribution. Our intuition says that we are
more uncertain about an outcome of the second experiment.
Let us compare tossing of an ideal coin and a binary message source
emitting 0 and 1 both with probability 1/2. Intuitively we should
expect that the uncertainty about an outcome of each of these
experiments is the same. Therefore the uncertainty should be based
only on the probability distribution and not on the concrete sample
space.
Therefore, the uncertainty about a particular random experiment can
be specified as a function of the probability distribution
{p1 , p2 , . . . , pn } and we will denote it as H(p1 , p2 , . . . , pn ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 3 / 42
Uncertainty - requirements
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 4 / 42
Uncertainty - requirements
n× (n+1)×
z }| { z }| {
H(1/n, . . . , 1/n) ≤ H(1/(n + 1), . . . , 1/(n + 1)).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 5 / 42
Entropy and uncertainty
8 Let us consider a random choice
Pm of one of n + m balls, m being red
and n being blue. LetP p = i=1 pi be the probability that a red ball
is chosen and q = m+n i=m+1 pi be the probability that a blue one is
chosen. Then the uncertainty which ball is chosen is the uncertainty
whether red of blue ball is chosen plus weighted uncertainty that a
particular ball is chosen provided blue/red ball was chosen. Formally,
H(p1 , . . . , pm , pm+1 , . . . , pm+n ) =
(1)
p1 pm pm+1 pm+n
=H(p, q) + pH ,..., + qH ,..., .
p p q q
It can be shown that any function satisfying Axioms 1 − 8 is of the form
m
X
H(p1 , . . . , pm ) = −(loga 2) pi log2 pi (2)
i=1
showing that the function is defined uniquely up to multiplication by a
constant, which effectively changes only the base of the logarithm.
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 6 / 42
Entropy and uncertainty
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 7 / 42
Entropy
The function H(p1 , . . . , pn ) we informally introduced is called the (Shannon)
entropy and, as justified above, it measures our uncertainty about an
outcome of an experiment.
Definition
Let X be a random variable with probability distribution p(x). Then the
(Shannon) entropy of the random variable X is defined as
X
H(X ) = − p(X = x) log P(X = x).
x∈Im(X )
Lemma
H(X ) ≥ 0.
Proof.
0 < p(x) ≤ 1 implies log(1/p(x)) ≥ 0.
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 9 / 42
Part II
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 10 / 42
Joint entropy
In order to examine an entropy of more complex random experiments
described by correlated random variables we have to introduce the entropy
of a pair (or n–tuple) of random variables.
Definition
Let X and Y be random variables distributed according to the probability
distribution p(x, y ) = P(X = x, Y = y ). We define the joint (Shannon)
entropy of random variables X and Y as
X X
H(X , Y ) = − p(x, y ) log p(x, y ),
x∈Im(X ) y ∈Im(Y )
or, alternatively,
1
H(X , Y ) = −E [log p(X , Y )] = E .
log p(X , Y )
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 11 / 42
Conditional entropy
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 12 / 42
Conditional Entropy
Definition
Let X and Y be random variables distributed according to the probability
distribution p(x, y ) = P(X = x, Y = y ). Let us denote
p(x|y ) = P(X = x|Y = y ). The conditional entropy of X given Y is
X
H(X |Y ) = p(y )H(X |Y = y ) =
y ∈Im(Y )
X X
=− p(y ) p(x|y ) log p(x|y ) =
y ∈Im(Y ) x∈Im(X ) (4)
X X
=− p(x, y ) log p(x|y )
x∈Im(X ) y ∈Im(Y )
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 13 / 42
Conditional Entropy
Using the previous definition we may raise the question how much
information we learn on average about X given an outcome of Y .
Naturally, we may interpret it as the decrease of our uncertainty about X
when we learn outcome of Y , i.e. H(X ) − H(X |Y ). Analogously, the
amount of information we obtain when we learn the outcome of X is
H(X ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 14 / 42
Chain rule of conditional entropy
Proof.
X X
H(X , Y ) = − p(x, y ) log p(x, y ) =
x∈Im(X ) y ∈Im(Y )
X X
=− p(x, y ) log[p(y )p(x|y )] =
x∈Im(X ) y ∈Im(Y )
X X
=− p(x, y ) log p(y ) − p(x, y ) log p(x|y ) =
(5)
x∈Im(X ) x∈Im(X )
y ∈Im(Y ) y ∈Im(Y )
X X
=− p(y ) log p(y ) − p(x, y ) log p(x|y ) =
y ∈Im(Y ) x∈Im(X )
y ∈Im(Y )
=H(Y ) + H(X |Y ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 15 / 42
Chain rule of conditional entropy
Proof.
Alternatively we may use log p(X , Y ) = log p(Y ) + log p(X |Y ) and take
the expectation on both sides to get the desired result.
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 16 / 42
Part III
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 17 / 42
Relative entropy
Let us start with the definition of the relative entropy, which measures
inefficiency of assuming that a given distribution is q(x) when the true
distribution is p(x).
Definition
The relative entropy or Kullback-Leibler distance between two
probability distributions p(x) and q(x) is defined as
X p(x) p(X )
D(pkq) = p(x) log = E log .
q(x) q(X )
x∈Im(X )
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 19 / 42
Mutual Information and Entropy
Theorem
I (X ; Y ) = H(X ) − H(X |Y ).
Proof.
X p(x, y ) X p(x|y )
I (X ; Y ) = p(x, y ) log == p(x, y ) log =
x,y
p(x)p(y ) x,y
p(x)
X X
=− p(x, y ) log p(x) + p(x, y ) log p(x|y ) =
x,y x,y (7)
!
X X
=− p(x) log p(x) − − p(x, y ) log p(x|y ) =
x,y x,y
=H(X ) − H(X |Y ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 20 / 42
Mutual information
Theorem
I (X ; Y ) = H(X ) + H(Y ) − H(X , Y ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 21 / 42
Part IV
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 22 / 42
General Chain Rule for Entropy
Theorem
Let X1 , X2 , . . . , Xn be random variables. Then
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 ).
i=1
Proof.
We use repeated application of the chain rule for a pair of random variables
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 23 / 42
General Chain Rule for Entropy
Proof.
..
.
H(X1 , X2 , . . . , Xn ) =H(X1 ) + H(X2 |X1 ) + · · · + H(Xn |Xn−1 , . . . , X1 ) =
n
X
= H(Xi |Xi−1 , . . . , X1 ).
i=1
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 24 / 42
Conditional Mutual Information
Definition
The conditional mutual information between random variables X and Y
given Z is defined as
p(X , Y |Z )
I (X ; Y |Z ) = H(X |Z ) − H(X |Y , Z ) = E log ,
p(X |Z )p(Y |Z )
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 25 / 42
Conditional Relative Entropy
Definition
The conditional relative entropy is the average of the relative entropies
between the conditional probability distributions p(y |x) and q(y |x)
averaged over the probability distribution p(x). Formally,
X X p(y |x) p(Y |X )
D p(y |x)kq(y |x) = p(x) p(y |x) log = E log .
x y
q(y |x) q(Y |X )
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 26 / 42
Chain Rule for Relative Entropy
Proof.
XX p(x, y )
D(p(x, y )kq(x, y )) = p(x, y ) log =
x y
q(x, y )
XX p(x)p(y |x)
= p(x, y ) log =
x y
q(x)q(y |x) (9)
X p(x) X p(y |x)
= p(x, y ) log + p(x, y ) log =
x,y
q(x) x,y
q(y |x)
=D(p(x)kq(x)) + D(p(y |x)kq(y |x)).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 27 / 42
Part V
Information inequality
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 28 / 42
Information Inequality
D(pkq) ≥ 0
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 29 / 42
Information Inequality
Proof.
Let A = {x|p(x) > 0} be the support set of p(x). Then
X p(x)
−D(pkq) = − p(x) log =
q(x)
x∈A
X q(x)
= p(x) log ≤
p(x)
x∈A
(∗) X q(x) (10)
≤ log =
p(x)
p(x)
x∈A
X X
= log q(x) ≤ log q(x) =
x∈A x∈X
= log 1 = 0,
I (X ; Y ) ≥ 0
Proof.
I (X ; Y ) = D(p(x, y )kp(x)p(y )) ≥ 0 with equality if and only if
p(x, y ) = p(x)p(y ), i.e. X and Y are independent.
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 31 / 42
Consequences of Information Inequality
Corollary
D(p(y |x)kq(y |x)) ≥ 0
with equality if and only if p(y |x) = q(y |x) for all y and x with p(x) > 0.
Corollary
I (X ; Y |Z ) ≥ 0
with equality if and only if X and Y are conditionally independent given Z .
Theorem
H(X ) ≤ log |Im(X )| with equality if and only if X has a uniform
distribution over Im(X ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 32 / 42
Consequences of Information Inequality
Proof.
Let u(x) = 1/|Im(X )| be a uniform probability distribution over Im(X )
and let p(x) be the probability distribution of X . Then
X p(x)
D(pku) = p(x) log =
u(x)
X X
=− p(x) log u(x) − − p(x) log p(x) = log |Im(X )| − H(X ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 33 / 42
Consequences of Information Inequality
Proof.
0 ≤ I (X ; Y ) = H(X ) − H(X |Y ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 34 / 42
Consequences of Information Inequality
Proof.
We use the chain rule for entropy
n
X
H(X1 , X2 , . . . , Xn ) = H(Xi |Xi−1 , . . . , X1 )
i=1
n (11)
X
≤ H(Xi ),
i=1
where the inequality follows directly from the previous theorem. We have
equality if and only if Xi is independent of all Xi−1 , . . . , X1 .
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 35 / 42
Part VI
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 36 / 42
Log Sum Inequality
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 37 / 42
Log Sum Inequality
Proof.
Assume WLOG that ai > 0 and bi > 0. The function f (t) = t log t is
strictly convex since f 00 (t) = 1t log e > 0 for all positive t. We use the
Jensen’s inequality to get
!
X X
αi f (ti ) ≥ f αi ti
i i
P Pn
for αi ≥ 0, i αi = 1. Setting αi = bi / j=1 bj and ti = ai /bi we obtain
!
a ai a a
Pi Pi Pi
X X X
log ≥ log ,
i j bj bi
i j bj
i j bj
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 38 / 42
Consequences of Log Sum Inequality
Theorem
D(pkq) is convex in the pair (p, q), i.e. if (p1 , q1 ) and (p2 , q2 ) are two
pairs of probability distributions, then
for all 0 ≤ λ ≤ 1.
Theorem
Let (X , Y ) ∼ p(x, y ) = p(x)p(y |x). The mutual information I (X ; Y ) is a
concave function of p(x) for fixed p(y |x) and a convex function of p(y |x)
for fixed p(x).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 39 / 42
Part VII
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 40 / 42
Data Processing Inequality
Theorem
X → Y → Z is a Markov chain if and only if X and Z are independent
when conditioned by Y , i.e.
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 41 / 42
Data Processing Inequality
Proof.
We expand mutual information using the chain rule in two different ways as
I (X ; Y , Z ) =I (X ; Z ) + I (X ; Y |Z )
(12)
=I (X ; Y ) + I (X ; Z |Y ).
I (X ; Y ) ≥ I (X ; Z ).
Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 42 / 42