0% found this document useful (0 votes)
40 views24 pages

Differential Entropy: Peng-Hua Wang

The document summarizes key concepts about differential entropy from chapter 8 of an information theory textbook. It defines differential entropy and discusses its properties for continuous random variables, including: the asymptotic equipartition property (AEP); typical sets; joint and conditional differential entropy; and that Gaussian distributions have maximum differential entropy among distributions with a given variance. It also presents formulas for differential entropy of multivariate normal distributions.

Uploaded by

riondi27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views24 pages

Differential Entropy: Peng-Hua Wang

The document summarizes key concepts about differential entropy from chapter 8 of an information theory textbook. It defines differential entropy and discusses its properties for continuous random variables, including: the asymptotic equipartition property (AEP); typical sets; joint and conditional differential entropy; and that Gaussian distributions have maximum differential entropy among distributions with a given variance. It also presents formulas for differential entropy of multivariate normal distributions.

Uploaded by

riondi27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Chapter 8

Differential Entropy

Peng-Hua Wang

Graduate Inst. of Comm. Engineering

National Taipei University


Chapter Outline

Chap. 8 Differential Entropy


8.1 Definitions
8.2 AEP for Continuous Random Variables
8.3 Relation of Differential Entropy to Discrete Entropy
8.4 Joint and Conditional Differential Entropy
8.5 Relative Entropy and Mutual Information
8.6 Properties of Differential Entropy and Related Amounts

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 2/24


8.1 Definitions

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 3/24


Definitions

Definition 1 (Differential entropy) The differential entropy h(X) of a


continuous random variable X with pdf f (X) is defined as
Z
h(X) = − f (x) log f (x)dx,
S

where S is the support region of the random variable.


Example
a
1 1
Z
X ∼ U (0, a), h(X) = − log dx = log a.
0 a a

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 4/24


Differential Entropy of Gaussian

x2
1 −
Example. If X ∼ N (0, σ 2 ) with pdf φ(x) = √2πσ 2
e , then
2σ 2

Z
ha (x) = − φ(x) loga φ(x)dx
2
 
1 x
Z
= − φ(x) loga √ − 2 loga e dx
2πσ 2 2σ
1 2 loga e 2 1 2
= loga (2πσ ) + 2
E φ [X ] = log a (2πeσ ) 
2 2σ 2

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 5/24


Differential Entropy of Gaussian

Remark. If a random variable with pdf f (x) has zero mean and
variance σ 2 , then
Z
− f (x) loga φ(x)dx
2
 
1 x
Z
= − f (x) loga √ − 2 loga e dx
2πσ 2 2σ
1 2 loga e 2 1 2
= loga (2πσ ) + 2
E f [X ] = log a (2πeσ )
2 2σ 2

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 6/24


Gaussian has Maximal Differential Entropy

Suppose that a random variable X with pdf f (x) has zero mean and
variance σ 2 , what is its maximal differential entropy?
Let φ(x) be the pdf of N (0, σ 2 ).
Z Z
φ(x)
h(X) + f (x) log φ(x)dx = f (x) log dx
f (x)
Z 
φ(x)
≤ log f (x) dx (convexity of logarithm)
f (x)
Z
= log φ(x)dx = 0

That is,
1
Z
h(X) ≤ − f (x) log φ(x)dx = log(2πeσ 2 )
2
and equality holds if f (x) = φ(x). 

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 7/24


8.2 AEP for Continuous Random Variables

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 8/24


AEP

Theorem 1 (AEP) Let X1 , X2 , . . . , Xn be a sequence of i.i.d. random


variables with common pdf f (x). Then,

1
− log f (X1 , X2 , . . . , Xn ) → E[− log f (X)] = h(X)
n
in probability.
(n)
Definition 2 (Typical Set) For ǫ > 0 the typical set Aǫ with respect
to f (x) is defined as

A(n)
ǫ = (x 1 , x 2 , . . . , x n ) ∈ S n
:

1
− log f (x1 , x2 , . . . , xn ) − h(X) ≤ ǫ
n

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 9/24


AEP

Definition 3 (Volume) The volume Vol(A) of a set A ⊂ Rn is defined


as Z
Vol(A) = dx1 dx2 . . . dxn
A
(n)
Theorem 2 (Properties of typical set) 1. Pr(Aǫ ) > 1 − ǫ for n
sufficiently large.

(n)
2. Vol(Aǫ ) ≤ 2n(h(X)+ǫ) for all n.

(n)
3. Vol(Aǫ ) ≥ (1 − ǫ)2n(h(X)−ǫ) for n sufficiently large.

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 10/24


8.4 Joint and Conditional Differential
Entropy

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 11/24


Definitions

Definition 4 (Differential entropy) The differential entropy of jointly


distributed random variables X1 , X2 , . . . , Xn is defined as
Z
h(X1 , X2 , . . . , Xn ) = − f (xn ) log f (xn )dxn

where f (xn ) = f (x1 , x2 , . . . , xn ) is the joint pdf.


Definition 5 (Conditional differential entropy) The conditional
differential entropy of jointly distributed random variables X, Y with joint
pdf f (x, y) is defined as, if it exists,
Z
h(X|Y ) = − f (x, y) log f (x|y)dxdy = h(X, Y ) − h(Y )

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 12/24


Multivariate Normal Distribution

Theorem 3 (Entropy of a multivariate normal) Let X1 , X2 , . . . , Xn


have a multivariate normal distribution with mean vector µ and
covariance matrix K. Then
1
h(X1 , X2 , . . . , Xn ) = log(2πe)n |K|
2
Proof. The joint pdf of a multivariate normal distribution is
1 − 21 (x−µ)t K−1 (x−µ)
φ(x) = n/2 1/2
e
(2π) |K|

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 13/24


Multivariate Normal Distribution

Therefore,
Z
h(X1 , X2 , . . . , Xn ) = − φ(x) loga φ(x)dx
Z  
1 1
= φ(x) loga (2π)n |K| + (x − µ)t K−1 (x − µ) loga e dx
2 2
1 1  
= loga (2π)n |K| + (loga e) E (x − µ)t K−1 (x − µ)
2 2 | {z }
=n
1 1
= loga (2π)n |K| + n loga e
2 2
1
= loga (2πe)n |K| 
2

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 14/24


Multivariate Normal Distribution

Let Y = (Y1 , Y2 , . . . , Yn )t be a random vector. If K = E[YY t ],


then E[Y t K−1 Y] = n.

Proof. Denote  
| | |
t
 
K = E[YY ] = 
 k1 k2 ... kn 

| | |
and  
at1
 

 at2 

K−1 =
 .


.
 . 
 
atn

We have ki = E[Yi Y] and atj ki = δij .

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 15/24


Multivariate Normal Distribution

Now,
   
at1 at1 Y
   

 at2 

 at Y 
 2 
t −1 t
Y K Y=Y  .
 Y = (Y1 , Y2 , . . . , Yn ) 
  .. 

.
 .   . 
   
atn atn Y
= Y1 at1 Y + Y2 at2 Y + · · · + Yn atn Y

and

E[Y t K−1 Y] = at1 E[Y1 Y] + at2 E[Y2 Y] + · · · + atn E[Yn Y]


= at1 k1 + at2 k2 + · · · + atn kn = n 

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 16/24


8.5 Relative Entropy and Mutual Information

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 17/24


Definitions

Definition 6 (Relative entropy) The relative entropy (or


KullbackVLeibler distance) D(f ||g) between two densities f (x) and
g(x) is defined as
f (x)
Z
D(f ||g) = f (x) log dx
g(x)
Definition 7 (Mutual information) The mutual information I(X; Y )
between two random variables with joint density f (x, y) is defined as

f (x, y)
Z
I(X; Y ) = f (x, y) log dxdy
f (x)f (y)

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 18/24


Example

Let (X, Y ) ∼ N (0, K) where


" #
2 2
σ ρσ
K= .
ρσ 2 σ2

Then h(X) = h(Y ) = 12 log(2πe)σ 2 and


1 1
h(X, Y ) = log(2πe) |K| = log(2πe)2 σ 4 (1 − ρ2 ).
2
2 2
Therefore,
1
I(X; Y ) = h(X) + h(Y ) − h(X; Y ) = − log(1 − ρ2 ).
2

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 19/24


8.6 Properties of Differential Entropy and
Related Amounts

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 20/24


Properties

Theorem 4 (Relative entropy)

D(f ||g) ≥ 0

with equality iff f


= g almost everywhere.
Corollary 1 1. I(X; Y ) ≥ 0 with equality iff X and Y are
independent.

1. h(X|Y ) ≤ h(X) with equality iff X and Y are independent.

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 21/24


Properties

Theorem 5 (Chain rule for differential entropy)


n
X
h(X1 , X2 , . . . , Xn ) = h(Xi |X1 , X2 , . . . , Xi−1 )
i=1

Corollary 2
n
X
h(X1 , X2 , . . . , Xn ) ≤ h(Xi )
i=1
Corollary 3 (Hadamard’s inequality) If K is the covariance matrix of a
multivariate normal distribution, then
n
Y
|K| ≤ Kii .
i=1

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 22/24


Properties

Theorem 6 1. h(X + c) = h(X)

2. h(aX) = h(X) + log |a|.

3. h(AX) = h(X) + log | det(A)|

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 23/24


Gaussian has Maximal Entropy

Theorem 7 Let the random vector X ∈ Rn have zero mean and


covariance K = E[XXt ]. Then h(X) ≤ 21 log(2πe)n |K|. with
equality X ∼ N (0, K)
R
Proof. Let g(x) be any density satisfying xi xj g(x)dx = Kij . Let
φ(x) be the density of N (0, K). Then,
Z Z
0 ≤ D(g||φ) = g log(g/φ) = −h(g) − g log φ
Z
= −h(g) − φ log φ = −h(g) + h(φ)

That is, h(g) ≤ h(φ). Equality holds if g = φ. 

Peng-Hua Wang, May 14, 2012 Information Theory, Chap. 8 - p. 24/24

You might also like