Lecture 2: Entropy and Mutual Information: 2.1 Example
Lecture 2: Entropy and Mutual Information: 2.1 Example
Introduction
Imagine two people Alice and Bob living in Toronto and Boston respectively. Alice (Toronto) goes
jogging whenever it is not snowing heavily. Bob (Boston) doesnt ever go jogging.
Notice that Alices actions give information about the weather in Toronto. Bobs actions give
no information. This is because Alices actions are random and correlated with the weather in
Toronto, whereas Bobs actions are deterministic.
How can we quantify the notion of information?
Entropy
(1)
The entropy measures the expected uncertainty in X. We also say that H(X) is approximately
equal to how much information we learn on average from one instance of the random variable X.
Note that the base of the algorithm is not important since changing the base only changes the
value of the entropy
by a multiplicative constant.
P
P
Hb (X) = x p(x) logb p(x) = logb (a)[ x p(x) loga p(x)] = logb (a)Ha (X). Customarily, we use
the base 2 for the calculation of entropy.
2.1
Example
0 with prob p
X=
1 with prob 1 p,
(2)
(3)
Note that the entropy does not depend on the values that the random variable takes (0 and 1
in this case), but only depends on the probability distribution p(x).
1
McGill University
Electrical and Computer Engineering
2.2
Two variables
Consider now two random variables X, Y jointly distributed according to the p.m.f p(x, y). We now
define the following two quantities.
Definition The joint entropy is given by
H(X, Y ) =
(4)
x,y
The joint entropy measures how much uncertainty there is in the two random variables X and Y
taken together.
Definition The conditional entropy of X given Y is
X
H(X|Y ) =
p(x, y) log p(x|y) = E[ log(p(x|y)) ]
(5)
x,y
The conditional entropy is a measure of how much uncertainty remains about the random variable
X when we know the value of Y .
2.3
Properties
n
X
H(Xi |X i1 ),
(6)
i=1
(7)
(8)
(9)
McGill University
Electrical and Computer Engineering
Maximum entropy: Let X be set from which the random variable X takes its values
(sometimes called the alphabet), then
H(X) log |X |.
(10)
(11)
(12)
(13)
(14)
=0
so we have
with equality if and only if we can deterministically guess X given g(X), which is only the
case if g is invertible.
Similarly to the discrete case we can define entropic quantities for continuous random variables.
Definition The differential entropy of a continuous random variable X with p.d.f f (x) is
Z
h(X) = f (x) log f (x)dx = E[ log(f (x)) ]
(15)
Definition Consider a pair of continuous random variable (X, Y ) distributed according to the joint
p.d.f f (x, y). The joint entropy is given by
Z Z
h(X, Y ) =
f (x, y) log f (x, y)dxdy,
(16)
while the conditional entropy is
h(X|Y ) =
Z Z
(17)
McGill University
Electrical and Computer Engineering
3.1
Properties
Some of the properties of the discrete random variables carry over to the continuous case, but some
do not. Let us go through the list again.
Non negativity doesnt hold: h(X) can be negative.
Example: Consider the R.V. X uniformly distributed on the interval [a, b]. The entropy is
given by
Z
1
1
log
dx = log(b a),
(18)
h(X) =
ba
ba
which can be a negative quantity if b a is less than 1.
Chain rule holds for continuous variables:
h(X, Y ) = h(X|Y ) + h(Y )
= h(Y |X) + h(X).
(19)
(20)
h(X|Y ) h(X)
(21)
Monotonicity:
The proof follows from the non-negativity of mutual information (later).
Maximum entropy: We do not have a bound for general p.d.f functions f (x), but we do
have a formula for power-limited functions. Consider a R.V. X f (x), such that
Z
2
E[x ] = x2 f (x)dx P,
(22)
then
1
log(2eP ),
2
and the maximum is achieved by X N (0, P ).
max h(X) =
(23)
To verify this claim one can useR standard Lagrange multiplier Rtechniques form calculus to
solve the problem max h(f ) = f log f dx, subject to E[x2 ] = x2 f dx P .
Mutual information
Definition The mutual information between two discreet random variables X, Y jointly distributed
according to p(x, y) is given by
X
p(x, y)
I(X; Y ) =
p(x, y) log
(24)
p(x)p(y)
x,y
= H(X) H(X|Y )
= H(Y ) H(Y |X)
= H(X) + H(Y ) H(X, Y ).
(25)
4
McGill University
Electrical and Computer Engineering
H(X)
H(X|Y )
I(X : Y )
H(Y |X)
Figure 1: Graphical representation of the conditional entropy and the mutual information.
4.1
(27)
and this is true both for the discrete and continuous case.
Before we get to the proof, we have to introduce some preliminary concepts like Jensens inequality and the relative entropy.
Jensens inequality tells us something about the expected value of a random variable after
applying a convex function to it.
We say function is convex on the interval [a, b] if, x1 , x2 [a, b] we have:
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
(28)
Another way stating the above is to say that the function always lies below the imaginary line
joining the points (x1 , f (x1 )) and (x2 , f (x2 )). For a twice-differentiable function f (x), convexity is
equivalent to the condition f (x) 0, x [a, b].
5
McGill University
Electrical and Computer Engineering
Definition Jensens inequality states that for any convex function f (x), we have
E[f (x)] f (E[x]).
(29)
X
x
p(x) log
p(x)
.
q(x)
(30)
The reason why we are interested in the relative entropy in this section is because it is related
to the mutual information in the following way
I(X; Y ) = D(p(x, y)||p(x)p(y)).
(31)
Thus, if we can show that the relative entropy is a non-negative quantity, we will have shown that
the mutual information is also non-negative.
Proof of non-negativity of relative entropy: Let p(x) and q(x) be two arbitrary probability distributions. We calculate the relative entropy as follows:
D(p(x)||q(x)) =
p(x) log
p(x)
q(x)
q(x)
p(x)
p(x) log
q(x)
= E log
p(x)
q(x)
log E
(by Jensens inequality for concave func. log)
p(x)
!
X
q(x)
= log
p(x)
p(x)
x
!
X
= log
q(x)
(32)
(33)
(34)
(35)
(36)
(37)
= 0.
McGill University
Electrical and Computer Engineering
4.2
Definition Let X, Y, Z be jointly distributed according to some p.m.f. p(x, y, z). The conditional
mutual information between X, Y given Z is
X
p(x, y|z)
(38)
I(X; Y |Z) =
p(x, y, z) log
p(x|z)p(y|z)
x,y,z
= H(X|Z) H(X|Y Z)
= H(XZ) + H(Y Z) H(XY Z) H(Z).
The conditional mutual information is a measure of how much uncertainty is shared by X and
Y , but not by Z.
4.3
Properties
n
X
I(X; Yi |Y i1 ),
(39)
i=1
and
(40)
To illustrate the last point, consider the following two examples where conditioning has different
effects. In both cases we will make use of the following equation
I(X; Y Z) = I(X; Y Z)
I(X; Y ) + I(X; Z|Y ) = I(X; Z) + I(X; Y |Z).
(41)
Increasing example: If we have some X, Y, Z such that I(X; Z) = 0 (which means X and Z
are independent variables), then equation (41) becomes:
I(X; Y ) + I(X; Z|Y ) = I(X; Y |Z),
(42)
(43)
(44)
McGill University
Electrical and Computer Engineering
For three variables X, Y, Z one situation which is of particular interest is when they form a Markov
chain: X Y Z. This relation is implies that the probability distribution p(x, z|y) =
p(x|y)p(z|y) which in turn implies that I(X; Z|Y ) = 0 like in the example above.
This situation often occurs when we have some input message X that gets transformed by a
channel to give Y and then we want to apply some processing to obtain the message Z as illustrated
below.
X Channel Y Processing Z
In this case we have the data processing inequality:
I(X; Z) I(X; Y ).
(45)