lecture1
lecture1
In the first lecture, we will briefly repeat some basic results in discrete probability theory
and introduce two key problems in classical information theory:
• Source coding: The compression of a discrete memoryless information source.
• Channel coding: To send information reliably via a noisy communication channel at
rates as high as possible.
We will end with a perspective of what we will study in the rest of the course.
or some other set. Let us define the two basic concepts of probability theory:
P
• A probability distribution on Σ is a function p : Σ → [0, 1] such that x∈Σ p(x) = 1.
We denote the set of probability distributions on Σ by P (Σ).
• A (discrete) random variable X is given by a pair (Σ, p) of an alphabet Σ, not neces-
sarily finite, and a probability distribution p ∈ P (Σ). We say that X is Σ-valued and
distributed according to p, and we write X ∼ p.
If a random variable X is Σ-valued and distributed according to p ∈ Pd , then we interpret
p(x) for x ∈ Σ as the probability that X takes the value x, and we will write
P (X = x) = p(x).
We will sometimes simplify our language slightly, by not specifying the smallest alphabet Σ
of values that a random variable X can take. For example, we will call a random variable
R-valued if it takes values on a discrete subset of R, but not all values in R.
Definition 1.1 (Joint and marginal probability distributions). Consider alphabets ΣA and
ΣB and the product alphabet ΣA × ΣB . For every distribution pAB ∈ P (ΣA × ΣB ), we can
define the marginal distributions pA ∈ P(ΣA ) and pB ∈ P(ΣB ) by
X X
pAB x, y 0 pAB x0 , y ,
pA (x) = and pB (y) =
y 0 ∈ΣB x0 ∈ΣA
for any x ∈ ΣA and any y ∈ ΣB . The distributions in P (ΣA × ΣB ) are also called joint
distributions. All of these definitions generalize to more than two alphabets.
1
It is common to write (X, Y ) for a random variable with values in ΣA × ΣB distributed
according to some probability distribution pAB ∈ ΣA × ΣB . We will refer to (X, Y ) as a
pair of random variables with values in ΣA × ΣB and joint distribution pAB . This notation
suggests that X ∼ pA and Y ∼ pB can somehow be considered as independent entities,
but in general this is not so. In general, the two marginals pA and pB do not describe the
entire probability distribution pAB , and we refer to aspects not captured by the marginal
distributions as correlations. There is a case where the two marginals are describing the
entire joint distribution:
Definition 1.2 (Independence). We say that a pair of random variables (X, Y ) with val-
ues in ΣA × ΣB distributed according to some probability distribution pAB ∈ ΣA × ΣB is
independent if the probability distribution pAB factorizes as
In this case, the marginals pA and pB determine the entire joint distribution pAB and we
write pAB = pA × pB .
The definition of independence generalizes to general N -tuples of random variables. We
will often consider sequences (Xn )n∈N of random variables with values in Σ that are of
independently and identically distributed according to some distribution p ∈ P(Σ). By this
we mean, that for each N ∈ N the tuple (X1 , . . . , XN ) consists of independent random
variables with joint distribution p×N , i.e., such that
Probability theorists developed their own notation to deal with non-independent pairs
(or tuples) of random variables. Given a pair (X, Y ) of random variables with values in
ΣA × ΣB distributed according to some probability distribution pAB ∈ ΣA × ΣB , we write
for some set S ⊆ ΣA × ΣB , and if Σ = ΣA = ΣB has some additional structure we may even
do some arithmetic such as
P (X + Y = 3) = P (X + Y ∈ {(x, y) ∈ Σ × Σ : x + y = 3}) .
If we just mention one of the random variables, then we will always mean the marginal
distributions, i.e., we have
2
for subsets SA ⊆ ΣA and SB ⊆ ΣB , if y0 ∈SB pB (y 0 ) 6= 0. As before, we may abbreviate the
P
corresponding probability distribution as p(·|Y ∈ SB ) ∈ P (ΣA ), which is called the proba-
bility distribution of X conditioned to Y ∈ SB . These probabilities satisfy a few properties
which can be derived directly from the definitions:
Theorem 1.3 (Properties of joint and conditional probabilities). Consider a pair of random
variables (X, Y ) with values in ΣA ×ΣB distributed according to some probability distribution
pAB ∈ ΣA × ΣB . We have:
• Product rule:
• Sum rule: X
P(X ∈ SA ) = P(Y = y)P(X ∈ SA |Y = y).
y∈ΣB
• Bayes’ theorem:
P (Y ∈ SB ) P (X ∈ SA |Y ∈ SB )
P(Y ∈ SB |X ∈ SA ) = .
P(X ∈ SA )
When talking about random variables attaining values in some countable subset Σ ⊂ R,
it will be useful to define the following two functions:
• Expected value: E [X] = x∈Σ p(x)x.
P
In general, neither the expected value nor the variance have to be finite. However, we will
often restrict to the case where the probability distributions have finite support, i.e., p(x) 6= 0
only for a finite number of x ∈ Σ, and in this case the expected value and the variance are
finite.
The following lemma can be verified easily:
Lemma 1.4. Let ΣA , ΣB ⊂ R be countable subsets and (X, Y ) a pair of random variables
with values in ΣA × ΣB . Then, we have
E [X + Y ] = E [X] + E [Y ] .
If the random variables (X, Y ) are independent, then
E [XY ] = E [X] E [Y ] .
We will sometimes need the following elementary inequalities of probability theory:
Theorem 1.5 (Markov’s and Chebychev’s inequalities). Let Z denote a random variable
with values in a countable subset Σ ⊂ R and distributed according to p ∈ P (Σ).
1. (Markov’s inequality) For every > 0, we have
P [|Z| > ] 6
E [|Z|] .
P (Z − E [Z])2 > 6
h i Var [Z]
.
3
Proof. Both inequalities are trivially satified if E [|Z|] = ∞ or Var [Z] = ∞. For the first
inequality note that
P [|Z| > ] =
X
p(z) 6
X
p(z)
|z|
6
E [|Z|] ,
|z|> |z|>
where we multiplied each summand by a number |z|/ 6 1, and in the last inequality we
added more non-negative terms. For the second inequality, we just insert the positive-valued
random variable (Z − E [Z])2 into the first inequality.
Theorem 1.6 (Weak law of large numbers). Let Σ ⊂ R be an alphabet. Consider a random
variable Y with values in Σ and distributed according to p ∈ P (Σ) with expected value
µ = E [Y ] and Var [Y ] < ∞. If (Yn )n∈N are independent random variables identically
distributed to Y , then
!
Y1 + Y2 + · · · + Yn
lim P − µ > = 0,
n→∞ n
2 What is information?
Consider the following strings. Which of these contain a high amount of information (intu-
itively speaking)? Can you rationalize your intuition?
1. 0000000000000000000000000000000000000000000000000000000000000000000
4
2. 0101010101010101010101010101010101010101010101010101010101010101010
3. 3141592653589793238462643383279502884197169399375105820974944592307
4. 1621683129157654671616347956219518784030306919262080790346992725831
5. “By virtue of its innermost intention, and like all questions about language, struc-
turalism escapes the classical history of ideas which already supposes structuralism’s
possibility, for the latter naively belongs to the province of language and propounds
itself within it.Nevertheless, by virtue of an irreducible region of irreflection and spon-
taneity within it, by virtue of the essential shadow of the undeclared, the structuralist
phenomenon will deserve examination by the historian of ideas. For better or for
worse. Everything within this phenomenon that does not in itself transparently belong
to the question of the sign will merit this scrutiny; as will everything within it that is
methodologically effective, thereby possessing the kind of infallibility now ascribed to
sleepwalkers and formerly attributed to instinct, which was said to be as certain as it
was blind.”1
6. “Preheat oven to 220 degrees C. Melt the butter in a saucepan. Stir in flour to form
a paste. Add water, white sugar and brown sugar, and bring to a boil. Reduce
temperature and let simmer. Place the bottom crust in your pan. Fill with apples,
mounded slightly. Cover with a lattice work crust. Gently pour the sugar and butter
liquid over the crust. Pour slowly so that it does not run off. Bake 15 minutes in the
preheated oven. Reduce the temperature to 175 degrees C. Continue baking for 35 to
45 minutes, until apples are soft.”2
as the smallest instance in which four fifth powers sum to a fifth power. This is a
counterexample to a conjecture by Euler that at least n nth powers are required to
sum to an nth power, n > 2.”3
As you may have noticed there might be different possible notions of “information”.
Maybe, you had the idea of defining “information” in terms of compressibility such that a
string of symbols has a large amount of “information” if it cannot be compressed too much,
and it has a low amount of “information” if it can be compressed a lot. This is indeed the
intuition behind the notion of “information” that we are going to define. However, to make
it precise, we need to think about what it means to compress a bit string. One possible
notion based on a model of computation would be as follows:
While this notion of complexity is very elegant and has a very general scope, it has a
serious disadvantage: The Kolmogorov complexity is uncomputable, i.e., there cannot exist
an algorithm to compute it on any kind of computer. We will use a different definition first
introduced by the mathematician Claude Shannon. To introduce this definition, we have to
(as often in applied mathematics) first properly define the problem.
1
Jaques Derrida, Writing and difference, Force and Signification
2
https://www.allrecipes.com/recipe/12682/apple-pie-by-grandma-ople/
3
Lander, L. J., Parkin, T. R. (1966) Bulletin of the American Mathematical Society, 72(6), 1079.
5
3 Classical source coding
The first idea is to not focus on the actual content of the string, but rather consider the
statistical properties of the symbols appearing in it. To make this precise, we define what
we mean by an information source:
Definition 3.1 (Discrete memoryless information sources). Let Σ denote a finite alphabet.
A discrete memoryless source on Σ (DMS) is a sequence of random variables (Xn )n∈N that
are independently and identically distributed and take values in Σ .
Definition 3.2 (Compression scheme). For any δ > 0 and any n, m ∈ N an (n, m, δ)-
compression scheme for a discrete memoryless source (Xn )n∈N with distribution p ∈ P(Σ)
on the alphabet Σ is a pair of functions
where
S = {(x1 , . . . , xn ) ∈ Σn : (D ◦ E)(x1 , . . . , xn ) = (x1 , . . . , xn )},
denotes the set where the compression succeeds.
A good compression scheme will have two properties: The success probability will be
high, and it compresses a string into few bits, i.e., the ratio m/n is low. It is intuitively
clear, that there should be a tradeoff between the success probability and the compression
rate: If we do not compress at all, then the success probability can be 1, but if we want
to compress n > 1 symbols into m = 1 bits, then the success probability will (usually) be
small. Shannon’s next insight was to consider compression schemes in the asymptotic limit
n → ∞. To make this precise, we will define asymptotically achievable rates:
It would be cool if we could find the optimal achievable rate. This is what Shannon did:
6
Theorem 3.4 (Shannon’s source coding4 theorem). Let (Xn )n∈N denote a discrete memo-
ryless source on the alphabet Σ with distribution p ∈ P(Σ). The Shannon entropy is given
by X
H(p) = − p(x) log(p(x)). (1)
x∈Σ
1. Any rate R > H(p) is achievable for compression of the discrete memoryless source
(Xn )n∈N .
and magically the Shannon entropy appears in the exponent. Motivated by this intuition,
we state the following definition:
Definition 3.5 (Typical strings). Let Σ be an alphabet and p ∈ P (Σ) a probability distribu-
tion. For n ∈ N and > 0 a string (x1 , . . . , xn ) ∈ Σn is called -typical for the distribution
p if
2−n(H(p)+) < p(x1 ) · · · p(xn ) < 2−n(H(p)−) .
We denote the set of these strings by Tn, (p).
How many typical strings are there, and how likely is it that a string obtained from a
discrete memoryless source is typical? To answer these questions we will use the weak law
of large numbers from probability theory:
Lemma 3.6 (Properties of typical strings). Let Σ be an alphabet and p ∈ P (Σ) a probability
distribution. We have:
1. For any n ∈ N and > 0 we have
7
2. For any > 0 we have
lim P (X1 , . . . , Xn ) ∈ T,n (p) = 1,
n→∞
Proof.
µ = E (Z) =
X
p(x)f (x) = H(p).
x∈Σ
After applying the exponential function, this is equivalent to (x1 , . . . , xn ) ∈ Tn, (p),
and we can rewrite (2) into
lim P (X1 , . . . , Xn ) ∈ Tn, (p) = 1.
n→∞
Let us emphasize again the intuition behind Lemma 3.6: There are 2n log(|Σ|) many strings
of length n in Σn , but at most 2n(H(p)+) of them are -typical for the distribution p. For
large n and if H(p) < log(|Σ|) these are very few strings compared to the total number. Still,
when receiving strings of length n from the information source, then we will essentially only
get typical strings for large n. Let us exploit this fact, to construct a compression scheme:
8
Proof of Theorem 3.4.
Direct part. For > 0 and any n ∈ N we will construct an (n, dn(H(p)+)e, δn ) compression
scheme for the discrete memoryless source (Xn )n∈N over the alphabet Σ distributed according
to p ∈ P(Σ) such that δn → 0 as n → ∞. Since
dn(H(p) + )e
H(p) + = lim ,
n→∞ n
this shows that the H(p) + is an achievable rate.
For n ∈ N and > 0 we will construct a compression scheme which succeeds on all
typical strings, i.e., we have S = Tn, (p) in the terminology of Definition 3.2. First, we set
m = dn(H(p) + )e and we choose a bit string b(x1 , x2 , . . . , xn ) ∈ {0, 1}m for any typical
sequence (x1 , . . . , xn ) ∈ Tn, (p). The first case of Lemma 3.6 shows that there are enough bit
strings of length m to do this. Now, we define an encoding function En : Σn → {0, 1}m by
(
b(x1 , . . . , xn ), if (x1 , . . . , xn ) ∈ T,n (p)
En (x1 , . . . , xn ) =
(0, 0, . . . , 0) if (x1 , . . . , xn ) ∈
/ Tn, (p),
and a decoding function Dn : {0, 1}m → Σn by
Dn (b1 , . . . , bm )
(
(x1 , . . . , xn ), if (b1 , . . . , bm ) = b(x1 , x2 , . . . , xn ) for some (x1 , . . . , xn ) ∈ Tn, (p)
=
(f, f, . . . , f ) if (b1 , . . . , bm ) 6= b(x1 , x2 , . . . , xn ) for any (x1 , . . . , xn ) ∈ Tn, (p),
for some symbol f ∈ Σ corresponding to a failure. From this construction it follows that
(Dn ◦ En )(x1 , . . . , xn ) = (x1 , . . . , xn ),
whenever (x1 , . . . , xn ) ∈ Tn, (p) (and maybe in the additional case where xi = f for all
i ∈ {1, . . . , n}). Therefore, we conclude that the success probability equals
P (Dn ◦ En )(X1 , . . . , Xn ) = (X1 , . . . , Xn ) > P (X1 , . . . , Xn ) ∈ Tn, (p) =: 1 − δn .
By the second case of Lemma 3.6, we see that δn → 1 as n → ∞. This finishes the proof.
Converse part. Consider a sequence of (nk , mk , δk )-compression schemes for the discrete
memoryless source (Xn )n∈N such that limk→∞ nk = ∞ and
mk
lim = R < H(p).
k→∞ nk
For each k ∈ N let Sk denote the set of strings on which the (nk , mk , δk )-compression scheme
in the sequence succeeds (see Definition 3.2). We have
|Sk | 6 2mk ,
since encoding more than 2mk strings into a set with 2mk elements necessarily leads to a
collision. Furthermore, note that for each k ∈ N we have
Sk ⊆ (Sk ∩ Tnk , (p)) ∪ (Σnk \ Tnk , (p)),
for any > 0, which implies that
X
1 − δk = p(x1 ) · · · p(xnk )
(x1 ,...,xnk )∈Sk
X
6 p(x1 ) · · · p(xnk ) + P [(X1 , . . . , Xnk ) ∈
/ Tnk , (p)]
(x1 ,...,xnk )∈Sk ∩Tnk , (p)
9
Finally, we note that as k → ∞ we have
m
−nk (H(p)− n k −)
2−nk (H(p)−) |Sk | 6 2 k → 0,
P ((X1 , . . . , Xnk ) ∈
/ Tnk , (p)) → 0,
by Lemma 3.6. Therefore, the failure probability of the compression scheme satisfies δk → 1
as k → ∞.
such that
min P N ×n ◦ E(i) ∈ D−1 (i) > 1 − δ. (3)
i∈ {1,2,...,M }
on (x1 , . . . , xn ) ∈ ΣnA .
The previous definition might be a bit difficult to parse. It should be read as follows:
There are two functions, the encoder E and the decoder D. The encoder E encodes a
message (labelled by 1, . . . , M ) into a string in ΣnA of length n. The symbols E(i)1 , E(i)2 , . . .
making up the string corresponding to message i are then send successively through the
communication channel leading to a product of probability distributions
N ×n ◦ E(i) = N (E(i)1 ), N (E(i)2 ), . . . , N (E(i)n ) ,
10
which we may interpret as a probability distribution on ΣnB . Receiving one possible string in
ΣnB , the receiver applies the decoding map D thereby obtaining a guess for what the message
could be. In (3) the success probability is given by the probability that the string N ×n ◦ E(i)
is in the preimage of D−1 (i). Finally, we consider the minimal probability of success over
all messages to be our figure of merit. Note that we made the implicit assumption that
consecutive applications of the communication channel are independent from each other
leading to the product distribution in (3). This is an idealization, and there are many
information theorists studying non-i.i.d. scenarios for channel coding. However, here we
focus on the simplest case. As in the case for compression, we can define asymptotically
achievable rates:
Definition 4.3 (Achievable rates for channel coding). A number R ∈ R+ is called an
achievable rate for transmitting information over the communication channel N : ΣA →
P(ΣB ) on the alphabets ΣA and ΣB , if for every n ∈ N there exists an (n, Mn , δn ) coding
scheme such that
log(Mn )
R = lim and lim δn = 0.
n→∞ n n→∞
The following definition is central for information theory and goes back to Shannon:
Definition 4.4 (Capacity of a channel). The capacity C(N ) of a communication channel
N is the supremum of the achievable rates for transmitting information over it.
Is it possible to compute the capacity, and does it fully characterize the achievable rates
for communication? Yes, both questions where again answered by Claude Shannon. Shan-
non’s channel coding theorem gives a formula for the capacity of a communication channel in
terms of the joint probability distributions obtained from “sending” a probability distribution
through the channel. We need to introduce another entropic quantity:
Definition 4.5 (Mutual information). The mutual information of a joint probability distri-
bution pAB ∈ P (ΣA × ΣB ) is given by
The mutual information is never negative (Homework), and it quantifies how close the
joint distribution is to the product distribution of its marginals. You can check, that I(A :
B)pAB = 0 if pAB = pA × pB . Consider a communication channel N : ΣA → P(ΣB ) and we
write N (y|x) for the probability of obtaining the symbol
P y ∈ ΣB at the output of the channel
after the symbol x ∈ ΣA has been send. Note that y∈ΣB N (y|x) = 1 for any x ∈ ΣA . Given
a probability distribution pA ∈ P (ΣA ) we can now define a joint probability distribution
pAB ∈ P (ΣA × ΣB ) by setting
pN
AB (x, y) = pA (x)N (y|x). (4)
This joint probability distribution describes the joint probability of inputs and outputs for
the communication channel N and it is easy to verify that pA is a marginal of pN AB . Finally,
we can state the following:
Theorem 4.6 (Shannon’s channel coding theorem). For alphabets ΣA and ΣB let N : ΣA →
P(ΣB ) denote a communication channel. The capacity of N is given by
R < C(N ).
11
We will postpone the proof of this theorem until later, but we still want to point out
one remarkable feature about its proof: It is non-constructive! Shannon’s proof shows that
generating coding schemes at random will almost always achieve rates very close to the
capacity. It turned out to be very difficult to construct specific codes with rates close to
the capacity for general communication channels. The channel coding theorem was proved
in 1948, and it took until 1992 when a family of codes (called turbo codes) where invented
achieving rates close to capacity. Then, it took until 2006 when a family of codes (called
polar codes) was invented that provably achieved rates arbitrarily close to capacity.
• What is the maximum amount of information that can be stored in a quantum system?
• How can classical information and quantum information be transmitted over quantum
channels?
• How is it possible to transmit information through two channels each having zero
capacity?
12