Dabel Info Theory
Dabel Info Theory
I made these notes while taking APMA 1710 at Brown during Fall 2016 (taught by Prof. Govind
Menon1 ), which followed the 2nd edition of the Cover & Thomas Information Theory Textbook [1].
If you find typos, please let me know at the email above. The images are of course based on the
textbook, but are of my own creation.
Contents
1 Chapter Two: Entropy 3
1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Others: I(X; Y | Z), I(X1 , . . . , Xn ; Y | Z), H(X1 , . . . , Xn | Z), H(X, Y | Z), D(p(y | x) || q(y | x))
1.2 Identities
H(X,Y)
H(X | Y) H(Y | X)
H(X) H(Y)
I(X;Y)
Bounds:
• 0 ≤ H(X) ≤U log |X |, where ≤U is with equality iff p(x) is the uniform distribution.
• 0 ≤ H(X, Y ) ≤I H(X) + H(Y ), where ≤I is with equality iff X and Y are independent.
• I(X; X) = H(X)
• 0 ≤ I(X; Y ) ≤ H(X)
Where 1 specifies any point on the function between x1 and x2 , and 2 specifies any point on the
line connecting f (x1 ) and f (x2 ). A function is convex when it’s second derivative is non negative.
Weak Law of Large Numbers: if {Xi }ni=1 are iid random variables with mean µ and variance
σ 2 < ∞, then the sample mean approaches the true mean as you get more samples:
n
1X
Xi → µ = E[X] (6)
n Pr
i=1
2.2 AEP
AEP: Consider a sequence {Xi }∞i=1 where each Xi is iid with pmf p(x) and entropy H(X). Then
the sample entropy approaches the true entropy as you get more samples. Or:
1
− log p(X1 , . . . , Xn ) → H(X)
n Pr
1
lim Pr − log p(X1 , . . . , Xn ) − H(X) > ε = 0
n→∞ n
(1) The first is that the “sample entropy” is close to the true entropy. That is, for each
(n)
xn ∈ Aε :
1
H(X) − ε ≤ − log p(xn ) ≤ H(X) + ε (8)
n
Which follows from the AEP, basically. That is, since the sample entropy converges to
H(X) in probability, there must exist a nε such that:
1
lim Pr(| − log p(xn ) − H(X)| > ε) < δ (10)
n→∞ nε
1
(n)
But note that the actual event, term 1 , can be rewritten into xn ∈ Aε :
1
−ε ≤ − log p(xn ) − H(X) ≤ ε
nε
1
H(X) − ε ≤ − log p(xn ) ≤ ε + H(X)
nε
−nε (H(X) − ε) ≥ log p(xn ) ≥ −nε (ε + H(X))
2−nε (H(X)−ε) ≥ p(xn ) ≥ 2−n(ε+H(X))
Which is exactly the condition for being in the typical set. Therefore, for nε , Equation 10
(n)
occurs, which implies that xn ∈ Aε . Therefore, the probability of being in the typical set
goes to 1 for n sufficiently large.
(3,4) The last two properties give bounds on the size of the typical set:
Where the left hand size (lower bound) is for n sufficiently large.
We know the total number of length n sequences is |X n |, but surely the typical set doesn’t
(n)
contain every length n sequence. Since we know that each xn ∈ Aε has bounds on the
(n)
probability of that sequence. We can leverage this to bound the size of Aε :
X
p(xn ) = 1
xn ∈X n
X
≥ p(xn )
(n)
xn ∈Aε
X
≥ 2−n(H(X)+ε)
(n)
xn ∈Aε
(n)
Recall that P (xn ∈ Aε ) > 1 − ε for n large. We bound the LHS by:
X X
P (xn ∈ A(n)
ε ) ≤M p(xn ) = 2−n(H(X)−ε) = |A(n)
ε 2
−n(H(X)−ε)
(13)
(n) (n)
xn ∈Aε xn ∈Aε
Where ≤M follows since the right hand side gives the maximal probability of the set, since
every element is maximally probable.
2.4 Codes
Code: Assigns a unique binary sequence to every sequence in X n .
Need:
• n(H(X) + ε) + 2 bits to make a code for each item in the typical set.
• n log |X | + 1 bits to make a code for each item not in the typical set.
(n)
Since there are |Aε | ≤ 2−n(H(X)+ε) , if we just enumerate all items in the typical set, we need
n(H(X) + ε) bits to code each item. We then add 1 in case that’s not an integer (could take ceil
just easily), and add 1 so we prefix all typical sequences with a 0. Therefore we have a code that
encodes all sequences in the typical set with n(H(X) + ε) + 2 bits. We get the same for the non
typical sets, giving us a total code length of n log |X | + 1 (don’t need +2 since guaranteed integer).
(n)
If n is sufficiently large so that Pr(xn ∈ Aε ) > 1 − ε, the expected length of a code word in the
typical set is:
1 n
E `(X ) ≤ H(X) + ε (14)
n
On average, each element of the sequence takes about the entropy of the r.v. to encode. Thus we
can represent sequences X n using around nH(X) bits on average.
1 (n)
log Bδ > H(X) − δ 0 (16)
n
(n) (n)
Thus, Bδ must have at least 2nH(x) elements, which is about the same size as Aε .
Stationary: A process S is stationary if the statistics don’t change as you move in time:
And a start state. Typically we’ll ask for a start distribution. If the distribution after one transition
is identical to the start distribution, then we say it’s the stationary distribution:
µ1 , µ2 , . . . , µn = µ1 , µ2 , . . . , µn P (20)
We solve for the stationary distribution using Eigenvalue decomposition with eigenvalue one, or
just solving the system of equations.
That is, to solve for eigen values, we do Av = λv, so det(A−λI)v = 0, which gives the characteristic
polynomial. Solve for the roots gives eigen vectors.
Entropy Rate: The per symbol entropy of the n random variables, when the limit exists:
1
H(S) = lim H(X1 , . . . , Xn ) (21)
n→∞ n
And a related quantity, the conditional entropy rate of the last random variable given the sequence:
Entropy Rates. We have two definitions, H(S) and H 0 (S). Moreover, they’re equivalent which
is convenient for computing the entropy rate of a stationary Markov chain.
Where =M follows from the Markov property and =T follows by time invariance.
So the entropy rate of a stationary Markov chain is H(X2 | X1 ). Let µ be the stationary distribution
and P be the transition matrix. Then:
X X
H(X2 | X1 ) = − p(x1 , x2 ) log p(x2 | x1 ) (23)
x1 ∈X x2 ∈X
Where by the chain rule of probability, p(x1 , x2 ) = p(x2 | x1 )p(x1 ). Note that the transition matrix
P denotes the probability of going to state x2 given state x1 , and µ denotes the probability of being
in state x1 . So: XX
H(X2 | X1 ) = − µi Pij log Pij (24)
µi Pij
3.3 Thermodynamics
Relative entropy between two distributions decreases with time:
Argument follows from chain rule for entropy (or expanding and using total law of prob).
From this we also see that the relative entropy between any distribution and the stationary distri-
bution decreases with time. Let µ0n = κ be stationary, then µ0n+1 = µ0n , so, applying the previous
result:
D(µn || κ) ≥ D(µn+1 || κ) (26)
• Cesaro Means
• Random Walks
• A Source Code for a r.v. is a mapping from X to D∗ , the set of finite-length strings from
a size D alphabet. C(x) is the codeword of x and `(x) is the length of C(x).
P
• A code word’s expected length is: L(C) = x∈X p(x)`(x)
2. Uniquely Decodable: Every sequence of coded strings decodes to exactly one message.
All codes
Nonsingular codes
Uniquely
decodable codes
Al
Instantaneous
codes
Proof.
We can think of a prefix code as a binary tree, where each branch represents choosing one
of the D symbols for the next symbol of the code. Then a prefix code guarantees that each
codeword has no children in the tree.
Consider the length of the longest codeword `max . Now consider all codewords at this level
of the tree.
In the complete tree (so no children are pruned, they’re just listed as a “descendent”), we
have: X
D`max −`i ≤ D`max (30)
Converse Proof
Proof.
Given lengths, `1 . . . , `k that satisfy the Kraft inequality, we can always come up with a
prefix tree.
We care about finding the prefix code with minimum expected length: that is, due to Kraft, we
want to find a prefix code that satisfies the kraft inequality, that minimizes the expected code word
length. So: !
X
min(L) = min pi `i (31)
` `
i
1
Thus, the optimal code lengths are `i = logD pi . Later we’ll force this to an integer with the
ceiling operator.
Theorem: Expected length L of any instantaneous D-ary code for a r.v. X is lower bounded
below by the entropy HD (X):
Proof of lower bound idea: Write out the difference L − HD (X) and turn the result into a relative
entropy quantity plus a positive constant, by the information inequality we conclude L−HD (X) ≥ 0.
Proof of upper bound idea: Let each length `i = dlogD p1i e, so it’s guaranteed to be between logD 1
pi
and logD p1i + 1. Then we multiplty both sides by pi and sum over i to get the bounds.
(1) Cluster
(2) Rerank
The cluster step takes a probability vector of m elements in order of mass: hp1 , p2 , . . . , pm i and
computes an length m − 1 vector, also in order, where the two smallest elements are merged:
hp1 , p2 , . . . , pm−1 + pm i.
• Examples
• They’re optimal
W Xn Channel Yn Ŵ
Encoder Decoder
p(y | x) Estimate of
Message
Message
Intuition: if we could control exactly how bits are sent, what’s the most info we can send over the
channel? How much shared info is there between the output and the input.
I(X; Y ) = 1 − α (39)
Definition 5 ((M, n) Code): An (M, n) code for the channel (X , p(y | x), Y) consists of the
following:
1. First: λi , which is the probability of error in sending message i over the channel:
2. The maximal probability of error is just the maximal error term over all λi :
(n)
3. The average probability of error Pe for an (M, n) code is:
1 X
Pe(n) = λi (43)
M
i
log M
R= bits per transmission (44)
n
So the rate is, per bit sent over the channel, how much of the actual message does it actually capture?
A rate is achievable if there exists a sequence of d2nR e, n codes such that the maximal probability
Definition 7 (Jointly Typical Set): The jointly typical set for two r.v.’s is:
A(n) n n n n
ε , {(x , y ) : fε (x , y )} (45)
Where:
1
fε (xn , y n ) = − log p( ) − H( ) < ε (46)
n
Where can be x , or can be y , or both at the same time (xn , y n ). Where:
n n
n
Y
p(xn , y n ) = p(xi , yi ) (47)
i=1
3. Consider (X̃ n , Ỹ n ) ∼ p(xn )p(y n ): that is, the tilde vars are sampled independently but with
the same marginals as X n and Y n . Then:
Takeaway from 3. is that, the probability that the independently sampled var-pairs are in the
typical set is controlled by the mutual information.
Channel Coding Intuition: All rates below capacity C are achievable, and all rates above capacity
are not.
Proof idea:
1
0 1
Place the 4 information bits into the 4 central intersecting regions. To code, place 1s in each of
the remaining regions so that each circle has an even number of bits. Then when you receive the
message, reconstruct the venn diagrams and you can identify where bits may have been flipped.
Hamming code is the elements of the null space of the matrix denoting the possible messages.
That is, each column is a possible message. We compute the null space of matrix h, which is the
set of vectors such that Hv = 0.
Conversely, for any stationary stochastic process, if H(V) > C, it’s not possible to send the
process over the channel with arbitrarily low probability of error.
Takeaway: The separation theorem says that the separate encoder can achieve the same rates as
the joint encoder. That is, the following two are the same:
Vn Xn(Vn) Channel Yn V̂ n
Encoder Decoder
p(y | x) Estimate of
Message
Message
Vn Xn(Vn) Channel Yn V̂ n
Source Channel Channel Source
Encoder Encoder p(y | x) Decoder Decoder
Message Estimate of
Message
Proof idea:
• Since the stochastic process satisfies the AEP, it implies there exists a typical set.
• Index all sequences in the typical set.
• There are at most 2n(H(X)+ε) elements in the typical set, so we need at most n(H(X) + ε)
bits to encode them.
• If H(V) + ε = R < C, we can transmit the sequence with low probability of error:
b such that X → Y → X,
Fano’s Inequality: For any estimator X b with Pe = Pr(X 6= X),
b we have:
Or:
1 + Pe log |X | ≥ H(X | Y ) (51)
R
3. Conditional Entropy: h(X | Y ) = − f (x, y) log f (x | y)dx dy.
We can translate between Differential Entropy and Discrete Entropy. Consider quantizing f (x)
according to some fixed step size, ∆. That is, approximate the curve with blocks of width ∆.
P∞
Let H(X∆ ) = − −∞ pi log pi . Where pi is going to be the value of those rectangles.
R xk+∆
Consider a point xk . Then along the interval, [xk , xk+∆ ], let p(xk ) = xk f (x)dx, where
H(X∆ ) = − ∞
P
−∞ p(xk ) log p(xk ).
Idea: We’re taking rectangles and putting them over each interval so that the area of the rectangle
is identical to the area of the curved piece of the function.
But p(xk ) = pk = f (xk ) · ∆, since it just describes a box that approximates the pdf for that interval
(width ∆ and height f (xk )). So: 4
H(X∆ ) ≈ h(x) − log ∆ (52)
P∞
Therefore, we add the right term to get H(X∆ ) + log ∆ = − −∞ f (xi )∆ log f (xi ), which, as
∆ → 0, becomes h(X).
6.1 Examples
Now, some example continuous channels.
6.1.1 Uniform
Let X be a r.v. with uniform probability on the interval [0, a]. Then:
Z a Z a
1 1
h(X) = − f (x)logf (x)dx = − log dx = log a (53)
0 0 a a
6.1.2 Gaussian
Let X be a r.v. with a Gaussian density function:
1 −x2
f (x) ∼ φ(x) = √ exp − (54)
2πσ 2 2σ 2
Then it’s entropy is:
∞
1
Z
h(X) = − φ ln φ = ln 2πeσ 2 (55)
−∞ 2
we can compute h(X, Y ) as the entropy of a multivariate normal: 21 log [2πe det(K)], and we’re
done.
6.4 Identities
Key: In general, h(X) + n is the number of bits on the average required to describe X to n-bit
accuracy.
• Venn Diagram: I(X; Y ) = h(X) − h(Y | X) = h(Y ) − h(X | Y ) = h(X) + h(Y ) − h(X, Y )
• Information Inequality: D(f || g) ≥ 0. Consequently:
– I(X; Y ) ≥ 0, equality iff independent.
– h(X) ≥ h(X | Y ), equality iff independent.
– h(X1 , . . . , Xn ) ≤ ni=1 h(Xi ), equality iff independent.
P
• Hadamard’s Inequality:
n
Y
det(K) ≤ Ki,i (58)
i=1
• h(X + c) = h(X)
• h(aX) = h(X) + log |a|
If there’s no constraint on the input, then the capacity could be infinite since X can take any
real value, so we can just spread the input values arbitrarily far apart subject to whatever noise
is present in the channel. To avoid this (which is clearly unrealistic) we impose an input power
constraint.
Power Constraint:
n
1X 2
xi ≤ P (62)
n
i=1
Capacity is the same, but now subject to a power constraint:
1 P
C = max I(X; Y ) ≤ πe 1 + (63)
p(x):E[x]≤P 2 N
Proof idea is just to write it out: The noise is just Z, so I(X; Y ) = h(Y ) − h(Y | X) = h(Y ) − h(Z),
where h(Z) = 12 log [2πeN ]. Plug and chug.
Note that:
We know E[Z] = 0,
7.1 Codes
We can
P make codes in the same way, only now the encoding function produces codewords such
that: ni=1 xi (w)2 ≤ nP .
A rate is achievable there exists a code that satisfies the power constraint and the usual notion of
achievability is obtained.
Channel Coding Theorem: We get the channel coding theorem again for Gaussian channels.
That is, any Rate R < C is achievable, and the converse, that any rate R ≥ C is not achievable.
So if there is some value for which, outside that interval, there are no frequencies!
2πW
n
Z
f (n/2W ) = 1/2π F (ω) exp(iω )dw (65)
−2πW 2W
References
[1] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,
2012.