0% found this document useful (0 votes)
2 views

lecture1

The first lecture of the Quantum Information Theory course introduces key concepts in classical information theory, including source coding and channel coding. It covers basic terminology from discrete probability theory, defining probability distributions, random variables, and their relationships, including joint and marginal distributions. The lecture also discusses the concept of information and introduces Kolmogorov complexity as a measure of information content.

Uploaded by

ekrrmerder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture1

The first lecture of the Quantum Information Theory course introduces key concepts in classical information theory, including source coding and channel coding. It covers basic terminology from discrete probability theory, defining probability distributions, random variables, and their relationships, including joint and marginal distributions. The lecture also discusses the concept of information and introduces Kolmogorov complexity as a measure of information content.

Uploaded by

ekrrmerder
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Quantum information theory (MAT4430) Spring 2021

Lecture 1: What is information theory?


Lecturer: Alexander Müller-Hermes

In the first lecture, we will briefly repeat some basic results in discrete probability theory
and introduce two key problems in classical information theory:
• Source coding: The compression of a discrete memoryless information source.
• Channel coding: To send information reliably via a noisy communication channel at
rates as high as possible.
We will end with a perspective of what we will study in the rest of the course.

1 Discrete probability theory


To talk about classical information theory, we will need some very basic terminology from
finite probability theory. In the following, Σ will denote an alphabet, i.e., a countable set
of symbols (a priori without any further structure). Often, our alphabets will be finite, and
occasionally we will choose the particular alphabet

[d] := {1, . . . , d},

or some other set. Let us define the two basic concepts of probability theory:
P
• A probability distribution on Σ is a function p : Σ → [0, 1] such that x∈Σ p(x) = 1.
We denote the set of probability distributions on Σ by P (Σ).
• A (discrete) random variable X is given by a pair (Σ, p) of an alphabet Σ, not neces-
sarily finite, and a probability distribution p ∈ P (Σ). We say that X is Σ-valued and
distributed according to p, and we write X ∼ p.
If a random variable X is Σ-valued and distributed according to p ∈ Pd , then we interpret
p(x) for x ∈ Σ as the probability that X takes the value x, and we will write

P (X = x) = p(x).

Given a subset S ⊂ Σ we will write


X
P (X ∈ S) = p(x).
x∈S

We will sometimes simplify our language slightly, by not specifying the smallest alphabet Σ
of values that a random variable X can take. For example, we will call a random variable
R-valued if it takes values on a discrete subset of R, but not all values in R.
Definition 1.1 (Joint and marginal probability distributions). Consider alphabets ΣA and
ΣB and the product alphabet ΣA × ΣB . For every distribution pAB ∈ P (ΣA × ΣB ), we can
define the marginal distributions pA ∈ P(ΣA ) and pB ∈ P(ΣB ) by
X X
pAB x, y 0 pAB x0 , y ,
 
pA (x) = and pB (y) =
y 0 ∈ΣB x0 ∈ΣA

for any x ∈ ΣA and any y ∈ ΣB . The distributions in P (ΣA × ΣB ) are also called joint
distributions. All of these definitions generalize to more than two alphabets.

1
It is common to write (X, Y ) for a random variable with values in ΣA × ΣB distributed
according to some probability distribution pAB ∈ ΣA × ΣB . We will refer to (X, Y ) as a
pair of random variables with values in ΣA × ΣB and joint distribution pAB . This notation
suggests that X ∼ pA and Y ∼ pB can somehow be considered as independent entities,
but in general this is not so. In general, the two marginals pA and pB do not describe the
entire probability distribution pAB , and we refer to aspects not captured by the marginal
distributions as correlations. There is a case where the two marginals are describing the
entire joint distribution:
Definition 1.2 (Independence). We say that a pair of random variables (X, Y ) with val-
ues in ΣA × ΣB distributed according to some probability distribution pAB ∈ ΣA × ΣB is
independent if the probability distribution pAB factorizes as

pAB (x, y) = pA (x)pB (y).

In this case, the marginals pA and pB determine the entire joint distribution pAB and we
write pAB = pA × pB .
The definition of independence generalizes to general N -tuples of random variables. We
will often consider sequences (Xn )n∈N of random variables with values in Σ that are of
independently and identically distributed according to some distribution p ∈ P(Σ). By this
we mean, that for each N ∈ N the tuple (X1 , . . . , XN ) consists of independent random
variables with joint distribution p×N , i.e., such that

p×N (x1 , . . . , xN ) = p(x1 ) · · · p(xN ).

Probability theorists developed their own notation to deal with non-independent pairs
(or tuples) of random variables. Given a pair (X, Y ) of random variables with values in
ΣA × ΣB distributed according to some probability distribution pAB ∈ ΣA × ΣB , we write

P (X = x, Y = y) = pAB (x, y).

It is common to use this notation quite liberally, and we will write


X
P ((X, Y ) ∈ S) = pAB (x, y),
(x,y)∈S

for some set S ⊆ ΣA × ΣB , and if Σ = ΣA = ΣB has some additional structure we may even
do some arithmetic such as

P (X + Y = 3) = P (X + Y ∈ {(x, y) ∈ Σ × Σ : x + y = 3}) .

If we just mention one of the random variables, then we will always mean the marginal
distributions, i.e., we have

P (X = x) = pA (x) and P (Y = y) = pB (y).

If pB (y) 6= 0, then we write


pAB (x, y)
P (X = x|Y = y) = ,
pB (y)
which we may abbreviate as p(x|y). We call the probability distribution p(·|y) ∈ P (ΣA )
the conditional distribution or the probability distribution of X conditioned to Y = y. This
notation can be generalized by setting
P
x∈SA y∈SB pAB (x, y)
P (X ∈ SA |Y ∈ SB ) = P 0
,
y 0 ∈SB pB (y )

2
for subsets SA ⊆ ΣA and SB ⊆ ΣB , if y0 ∈SB pB (y 0 ) 6= 0. As before, we may abbreviate the
P
corresponding probability distribution as p(·|Y ∈ SB ) ∈ P (ΣA ), which is called the proba-
bility distribution of X conditioned to Y ∈ SB . These probabilities satisfy a few properties
which can be derived directly from the definitions:
Theorem 1.3 (Properties of joint and conditional probabilities). Consider a pair of random
variables (X, Y ) with values in ΣA ×ΣB distributed according to some probability distribution
pAB ∈ ΣA × ΣB . We have:
• Product rule:

P(X ∈ SA , Y ∈ SB ) = P(Y ∈ SB )P(X ∈ SA |Y ∈ SB ).

• Sum rule: X
P(X ∈ SA ) = P(Y = y)P(X ∈ SA |Y = y).
y∈ΣB

• Bayes’ theorem:
P (Y ∈ SB ) P (X ∈ SA |Y ∈ SB )
P(Y ∈ SB |X ∈ SA ) = .
P(X ∈ SA )

When talking about random variables attaining values in some countable subset Σ ⊂ R,
it will be useful to define the following two functions:
• Expected value: E [X] = x∈Σ p(x)x.
P

• Variance: Var [X] = E (X − E [X])2 = E X 2 − E [X]2 .


h i  

In general, neither the expected value nor the variance have to be finite. However, we will
often restrict to the case where the probability distributions have finite support, i.e., p(x) 6= 0
only for a finite number of x ∈ Σ, and in this case the expected value and the variance are
finite.
The following lemma can be verified easily:
Lemma 1.4. Let ΣA , ΣB ⊂ R be countable subsets and (X, Y ) a pair of random variables
with values in ΣA × ΣB . Then, we have

E [X + Y ] = E [X] + E [Y ] .
If the random variables (X, Y ) are independent, then

E [XY ] = E [X] E [Y ] .
We will sometimes need the following elementary inequalities of probability theory:
Theorem 1.5 (Markov’s and Chebychev’s inequalities). Let Z denote a random variable
with values in a countable subset Σ ⊂ R and distributed according to p ∈ P (Σ).
1. (Markov’s inequality) For every  > 0, we have

P [|Z| > ] 6
E [|Z|] .


2. (Chebychev’s inequality) For every  > 0, we have

P (Z − E [Z])2 >  6
h i Var [Z]
.


3
Proof. Both inequalities are trivially satified if E [|Z|] = ∞ or Var [Z] = ∞. For the first
inequality note that

P [|Z| > ] =
X
p(z) 6
X
p(z)
|z|
6
E [|Z|] ,
 
|z|> |z|>

where we multiplied each summand by a number |z|/ 6 1, and in the last inequality we
added more non-negative terms. For the second inequality, we just insert the positive-valued
random variable (Z − E [Z])2 into the first inequality.

Using Chebychev’s inequality we can prove the following theorem:

Theorem 1.6 (Weak law of large numbers). Let Σ ⊂ R be an alphabet. Consider a random
variable Y with values in Σ and distributed according to p ∈ P (Σ) with expected value
µ = E [Y ] and Var [Y ] < ∞. If (Yn )n∈N are independent random variables identically
distributed to Y , then
!
Y1 + Y2 + · · · + Yn
lim P − µ >  = 0,
n→∞ n

for every  > 0.

Proof. For any n ∈ N consider the random variable


Y1 + Y2 + · · · + Yn
Zn = ,
n
and note that
E [Zn] = µ.
Using Chebychev’s inequality, we have
!
h i Var (Z )
n
P Zn − µ >  = P (Zn − µ)2 > 2 6 .
2

Finally, note that


n n n
1 X
E [YiYj ] − µ2 = n12 E [Yi] E [Yj ] + n12 E
X X
Yi2 − µ2
 
Var (Zn ) =
n2
i,j=1 i6=j i
1
E Y 2 − µ2 = n1 Var (Y ) ,
  
=
n
and we conclude that
!
Y1 + Y2 + · · · + Yn Var (Y )
P −µ > 6 →0 as n → ∞.
n n2

2 What is information?
Consider the following strings. Which of these contain a high amount of information (intu-
itively speaking)? Can you rationalize your intuition?

1. 0000000000000000000000000000000000000000000000000000000000000000000

4
2. 0101010101010101010101010101010101010101010101010101010101010101010

3. 3141592653589793238462643383279502884197169399375105820974944592307

4. 1621683129157654671616347956219518784030306919262080790346992725831

5. “By virtue of its innermost intention, and like all questions about language, struc-
turalism escapes the classical history of ideas which already supposes structuralism’s
possibility, for the latter naively belongs to the province of language and propounds
itself within it.Nevertheless, by virtue of an irreducible region of irreflection and spon-
taneity within it, by virtue of the essential shadow of the undeclared, the structuralist
phenomenon will deserve examination by the historian of ideas. For better or for
worse. Everything within this phenomenon that does not in itself transparently belong
to the question of the sign will merit this scrutiny; as will everything within it that is
methodologically effective, thereby possessing the kind of infallibility now ascribed to
sleepwalkers and formerly attributed to instinct, which was said to be as certain as it
was blind.”1

6. “Preheat oven to 220 degrees C. Melt the butter in a saucepan. Stir in flour to form
a paste. Add water, white sugar and brown sugar, and bring to a boil. Reduce
temperature and let simmer. Place the bottom crust in your pan. Fill with apples,
mounded slightly. Cover with a lattice work crust. Gently pour the sugar and butter
liquid over the crust. Pour slowly so that it does not run off. Bake 15 minutes in the
preheated oven. Reduce the temperature to 175 degrees C. Continue baking for 35 to
45 minutes, until apples are soft.”2

7. “A direct search on the CDC 6600 yielded

275 + 845 + 1105 + 1335 = 1445

as the smallest instance in which four fifth powers sum to a fifth power. This is a
counterexample to a conjecture by Euler that at least n nth powers are required to
sum to an nth power, n > 2.”3

As you may have noticed there might be different possible notions of “information”.
Maybe, you had the idea of defining “information” in terms of compressibility such that a
string of symbols has a large amount of “information” if it cannot be compressed too much,
and it has a low amount of “information” if it can be compressed a lot. This is indeed the
intuition behind the notion of “information” that we are going to define. However, to make
it precise, we need to think about what it means to compress a bit string. One possible
notion based on a model of computation would be as follows:

Definition 2.1 (Kolmogorov complexity). The Kolmogorov complexity of a string is the


length of the shortest program of a Turing machine producing the string when initialized on
an empty tape.

While this notion of complexity is very elegant and has a very general scope, it has a
serious disadvantage: The Kolmogorov complexity is uncomputable, i.e., there cannot exist
an algorithm to compute it on any kind of computer. We will use a different definition first
introduced by the mathematician Claude Shannon. To introduce this definition, we have to
(as often in applied mathematics) first properly define the problem.
1
Jaques Derrida, Writing and difference, Force and Signification
2
https://www.allrecipes.com/recipe/12682/apple-pie-by-grandma-ople/
3
Lander, L. J., Parkin, T. R. (1966) Bulletin of the American Mathematical Society, 72(6), 1079.

5
3 Classical source coding
The first idea is to not focus on the actual content of the string, but rather consider the
statistical properties of the symbols appearing in it. To make this precise, we define what
we mean by an information source:

Definition 3.1 (Discrete memoryless information sources). Let Σ denote a finite alphabet.
A discrete memoryless source on Σ (DMS) is a sequence of random variables (Xn )n∈N that
are independently and identically distributed and take values in Σ .

An example of a discrete memoryless source is the text written by a monkey on a type-


writer. The random variables Xi is then the letter the monkey hits at time i. Of course,
our definition of an information source is a strong idealization and in english text (not writ-
ten by a monkey) consecutive symbols are correlated. For instance the combination “ed”
will occurr more often than the combination “xz”. We could even envision cases, where
the probability of later symbols depends on all previous symbols using a kind of memory.
Such non-i.i.d. information sources are studied extensively in information theory, but we
will restrict ourselves to the simple setting stated above.
Informally, a compression scheme is a pair of an encoding and a decoding function. The
encoding function maps blocks of symbols to bit strings of length as short as possible, and
the decoding function should reverse this process. The main insight is to not require the
decoding to work perfectly, but rather to measure its probability of success.

Definition 3.2 (Compression scheme). For any δ > 0 and any n, m ∈ N an (n, m, δ)-
compression scheme for a discrete memoryless source (Xn )n∈N with distribution p ∈ P(Σ)
on the alphabet Σ is a pair of functions

E : Σn → {0, 1}m and D : {0, 1}m → Σn ,

such that the success probability satisfies


  X
P (D ◦ E)(X1 , . . . , Xn ) = (X1 , . . . , Xn ) = p(x1 ) · · · p(xn ) > 1 − δ,
x1 ,...,xn ∈S

where
S = {(x1 , . . . , xn ) ∈ Σn : (D ◦ E)(x1 , . . . , xn ) = (x1 , . . . , xn )},
denotes the set where the compression succeeds.

A good compression scheme will have two properties: The success probability will be
high, and it compresses a string into few bits, i.e., the ratio m/n is low. It is intuitively
clear, that there should be a tradeoff between the success probability and the compression
rate: If we do not compress at all, then the success probability can be 1, but if we want
to compress n > 1 symbols into m = 1 bits, then the success probability will (usually) be
small. Shannon’s next insight was to consider compression schemes in the asymptotic limit
n → ∞. To make this precise, we will define asymptotically achievable rates:

Definition 3.3 (Achievable rates). A number R ∈ R+ is called an achievable rate for


compression of a discrete memoryless source (Xn )n∈N , if for every n ∈ N there exists an
(n, mn , δn ) compression scheme for (Xn )n∈N such that
mn
R = lim and lim δn = 0.
n→∞ n n→∞

It would be cool if we could find the optimal achievable rate. This is what Shannon did:

6
Theorem 3.4 (Shannon’s source coding4 theorem). Let (Xn )n∈N denote a discrete memo-
ryless source on the alphabet Σ with distribution p ∈ P(Σ). The Shannon entropy is given
by X
H(p) = − p(x) log(p(x)). (1)
x∈Σ

1. Any rate R > H(p) is achievable for compression of the discrete memoryless source
(Xn )n∈N .

2. If there is a sequence of (nk , mk , δk )-compression schemes for the discrete memoryless


source ((Xn )n∈N ) satisfying
mk
lim nk = ∞ and lim = R < H(p),
k→∞ k→∞ nk

then we have limk→∞ δk = 1, i.e., the success probability converges to zero.


Shannon’s source coding theorem shows that rates close to H(p) are achievable for com-
pression, and that rates lower than H(p) cannot be achieved with success probability con-
verging to 1. In the following, we will construct a compression scheme that achieves rates
arbitrarily close to H(p). This will prove one direction of Theorem 3.4, and for the other
direction we have to argue that such a sequence of schemes does not exist.
Let us first get some intuition about how the compression scheme will work: For a discrete
memoryless source with some non-trivial and non-uniform distribution p ∈ P (Σ) not all
strings (x1 , x2 , . . . , xn ) ∈ Σn of length n will have the same probability. For example, when
you toss a biased coin with the probability for “tails” much larger than the probability for
“heads”, then it is very unlikely to observe a string of 100 “heads” in a row. Typically, we will
observe strings of length n in which each symbol x ∈ Σ occurs approximately p(x)n times.
To construct an efficient coding scheme with high success probability it might therefore
be enough to focus on such typical strings and ignore the untypical ones. How can we
characterize the typical strings? To get some intuition, let us consider a string (x1 , . . . , xn )
such that each symbol x ∈ Σ occurs approximately p(x)n times. What is the probability of
observing such a string? We can compute it as
P
p(x1 ) · · · p(xn ) = Πx∈Σ p(x)#{xi =x} ≈ Πx∈Σ p(x)p(x)n = 2n x∈Σ p(x) log(p(x))
= 2−nH(p) ,

and magically the Shannon entropy appears in the exponent. Motivated by this intuition,
we state the following definition:
Definition 3.5 (Typical strings). Let Σ be an alphabet and p ∈ P (Σ) a probability distribu-
tion. For n ∈ N and  > 0 a string (x1 , . . . , xn ) ∈ Σn is called -typical for the distribution
p if
2−n(H(p)+) < p(x1 ) · · · p(xn ) < 2−n(H(p)−) .
We denote the set of these strings by Tn, (p).
How many typical strings are there, and how likely is it that a string obtained from a
discrete memoryless source is typical? To answer these questions we will use the weak law
of large numbers from probability theory:
Lemma 3.6 (Properties of typical strings). Let Σ be an alphabet and p ∈ P (Σ) a probability
distribution. We have:
1. For any n ∈ N and  > 0 we have

|Tn, (p)| < 2n(H(p)+) .


4
The term source coding is synonymous to compression, and it is more common in the literature.

7
2. For any  > 0 we have
 
lim P (X1 , . . . , Xn ) ∈ T,n (p) = 1,
n→∞

where (Xn )n∈N is a discrete memoryless source distributed according to p.

Proof.

Ad 1.: By Definition 3.5 and the normalization of probability distributions, we have


X
2−n(H(p)+) |Tn, (p)| < p(x1 ) · · · p(xn ) 6 1.
(x1 ,...,xn )∈Tn, (p)

Ad 2.: Consider the function f : Σ → [0, ∞) given by


(
− log(p(x)), if p(x) > 0
f (x) =
0, if p(x) = 0,

and the random variable Z = f (X), where X is distributed according to p. Observe,


that the expectation value of Z is given by

µ = E (Z) =
X
p(x)f (x) = H(p).
x∈Σ

We conclude from the weak law of large numbers that


!
f (X1 ) + f (X2 ) + · · · + f (Xn )
lim P − H(p) <  = 1, (2)
n→∞ n

whenever (Xn )n∈N is a discrete memoryless source distributed according to p. Note


that
f (x1 ) + f (x2 ) + · · · + f (xn )
− H(p) < ,
n
holds for a string (x1 , . . . , xn ) ∈ Σn if and only if

−n(H(p) + ) < log (p(x1 ) · · · p(xn )) < −n(H(p) − ).

After applying the exponential function, this is equivalent to (x1 , . . . , xn ) ∈ Tn, (p),
and we can rewrite (2) into
 
lim P (X1 , . . . , Xn ) ∈ Tn, (p) = 1.
n→∞

This finishes the proof.

Let us emphasize again the intuition behind Lemma 3.6: There are 2n log(|Σ|) many strings
of length n in Σn , but at most 2n(H(p)+) of them are -typical for the distribution p. For
large n and if H(p) < log(|Σ|) these are very few strings compared to the total number. Still,
when receiving strings of length n from the information source, then we will essentially only
get typical strings for large n. Let us exploit this fact, to construct a compression scheme:

8
Proof of Theorem 3.4.
Direct part. For  > 0 and any n ∈ N we will construct an (n, dn(H(p)+)e, δn ) compression
scheme for the discrete memoryless source (Xn )n∈N over the alphabet Σ distributed according
to p ∈ P(Σ) such that δn → 0 as n → ∞. Since
dn(H(p) + )e
H(p) +  = lim ,
n→∞ n
this shows that the H(p) +  is an achievable rate.
For n ∈ N and  > 0 we will construct a compression scheme which succeeds on all
typical strings, i.e., we have S = Tn, (p) in the terminology of Definition 3.2. First, we set
m = dn(H(p) + )e and we choose a bit string b(x1 , x2 , . . . , xn ) ∈ {0, 1}m for any typical
sequence (x1 , . . . , xn ) ∈ Tn, (p). The first case of Lemma 3.6 shows that there are enough bit
strings of length m to do this. Now, we define an encoding function En : Σn → {0, 1}m by
(
b(x1 , . . . , xn ), if (x1 , . . . , xn ) ∈ T,n (p)
En (x1 , . . . , xn ) =
(0, 0, . . . , 0) if (x1 , . . . , xn ) ∈
/ Tn, (p),
and a decoding function Dn : {0, 1}m → Σn by
Dn (b1 , . . . , bm )
(
(x1 , . . . , xn ), if (b1 , . . . , bm ) = b(x1 , x2 , . . . , xn ) for some (x1 , . . . , xn ) ∈ Tn, (p)
=
(f, f, . . . , f ) if (b1 , . . . , bm ) 6= b(x1 , x2 , . . . , xn ) for any (x1 , . . . , xn ) ∈ Tn, (p),
for some symbol f ∈ Σ corresponding to a failure. From this construction it follows that
(Dn ◦ En )(x1 , . . . , xn ) = (x1 , . . . , xn ),
whenever (x1 , . . . , xn ) ∈ Tn, (p) (and maybe in the additional case where xi = f for all
i ∈ {1, . . . , n}). Therefore, we conclude that the success probability equals
   
P (Dn ◦ En )(X1 , . . . , Xn ) = (X1 , . . . , Xn ) > P (X1 , . . . , Xn ) ∈ Tn, (p) =: 1 − δn .

By the second case of Lemma 3.6, we see that δn → 1 as n → ∞. This finishes the proof.

Converse part. Consider a sequence of (nk , mk , δk )-compression schemes for the discrete
memoryless source (Xn )n∈N such that limk→∞ nk = ∞ and
mk
lim = R < H(p).
k→∞ nk
For each k ∈ N let Sk denote the set of strings on which the (nk , mk , δk )-compression scheme
in the sequence succeeds (see Definition 3.2). We have
|Sk | 6 2mk ,
since encoding more than 2mk strings into a set with 2mk elements necessarily leads to a
collision. Furthermore, note that for each k ∈ N we have
Sk ⊆ (Sk ∩ Tnk , (p)) ∪ (Σnk \ Tnk , (p)),
for any  > 0, which implies that
X
1 − δk = p(x1 ) · · · p(xnk )
(x1 ,...,xnk )∈Sk
X
6 p(x1 ) · · · p(xnk ) + P [(X1 , . . . , Xnk ) ∈
/ Tnk , (p)]
(x1 ,...,xnk )∈Sk ∩Tnk , (p)

6 2−nk (H(p)−) |Sk | + P ((X1 , . . . , Xnk ) ∈


/ Tnk , (p)) .

9
Finally, we note that as k → ∞ we have
m
−nk (H(p)− n k −)
2−nk (H(p)−) |Sk | 6 2 k → 0,

when we choose  < H(p) − R, and

P ((X1 , . . . , Xnk ) ∈
/ Tnk , (p)) → 0,

by Lemma 3.6. Therefore, the failure probability of the compression scheme satisfies δk → 1
as k → ∞.

4 Information transmission over noisy channels


Another basic problem of information theory is to determine the maximal rates at which
information can be send reliably over noisy channels. Again, this problem was solved by
Claude Shannon. In this introduction, we will only state Shannon’s channel coding theorem,
and we will postpone the proof until later. We start with a definition:
Definition 4.1 (Classical communication channel). Let ΣA , ΣB denote two alphabets. A
communication channel is given by a function N : ΣA → P(ΣB ) mapping each symbol in ΣA
to a probability distribution over ΣB .
To get a concrete picture, envision a noisy telegraph line (not very up-to-date of course),
where trying to send the letter “a” might result in the letters “a”,“b”, or “c” to appear at
the other end of the line with different probabilities depending on the quirks of the system.
Note that in this case ΣA = ΣB . How would you send your messages over such a telegraph
line? One idea might be to encode your message by adding some redundancy. To stay in the
example we might just repeat every symbol five times: If we want to send the symbol “a”,
then we would input the string “aaaaa” into the telegraph line. Even if some error happens,
and the string “caaaba” comes out the other end, the receiver could still guess that probably
the symbol “a” was the intended message. This seems to work fine, but is it the best we can
do? To quantify what we mean by best, we can again define the information transmission
problem similarly to the compression problem from before:
Definition 4.2 (Coding schemes). For alphabets ΣA , ΣB let N : ΣA → P(ΣB ) denote a
communication channel. An (n, M, δ)-coding scheme for information transmission over the
channel N is a pair of functions

E : {1, 2, . . . , M } → ΣnA and D : ΣnB → {1, 2, . . . , M },

such that  
min P N ×n ◦ E(i) ∈ D−1 (i) > 1 − δ. (3)
i∈ {1,2,...,M }

Here, N ×n is the n-fold direct product of N with itself acting as

N ×n (x1 , . . . , xn ) = (N (x1 ), . . . , N (xn )) ,

on (x1 , . . . , xn ) ∈ ΣnA .
The previous definition might be a bit difficult to parse. It should be read as follows:
There are two functions, the encoder E and the decoder D. The encoder E encodes a
message (labelled by 1, . . . , M ) into a string in ΣnA of length n. The symbols E(i)1 , E(i)2 , . . .
making up the string corresponding to message i are then send successively through the
communication channel leading to a product of probability distributions
 
N ×n ◦ E(i) = N (E(i)1 ), N (E(i)2 ), . . . , N (E(i)n ) ,

10
which we may interpret as a probability distribution on ΣnB . Receiving one possible string in
ΣnB , the receiver applies the decoding map D thereby obtaining a guess for what the message
could be. In (3) the success probability is given by the probability that the string N ×n ◦ E(i)
is in the preimage of D−1 (i). Finally, we consider the minimal probability of success over
all messages to be our figure of merit. Note that we made the implicit assumption that
consecutive applications of the communication channel are independent from each other
leading to the product distribution in (3). This is an idealization, and there are many
information theorists studying non-i.i.d. scenarios for channel coding. However, here we
focus on the simplest case. As in the case for compression, we can define asymptotically
achievable rates:
Definition 4.3 (Achievable rates for channel coding). A number R ∈ R+ is called an
achievable rate for transmitting information over the communication channel N : ΣA →
P(ΣB ) on the alphabets ΣA and ΣB , if for every n ∈ N there exists an (n, Mn , δn ) coding
scheme such that
log(Mn )
R = lim and lim δn = 0.
n→∞ n n→∞

The following definition is central for information theory and goes back to Shannon:
Definition 4.4 (Capacity of a channel). The capacity C(N ) of a communication channel
N is the supremum of the achievable rates for transmitting information over it.
Is it possible to compute the capacity, and does it fully characterize the achievable rates
for communication? Yes, both questions where again answered by Claude Shannon. Shan-
non’s channel coding theorem gives a formula for the capacity of a communication channel in
terms of the joint probability distributions obtained from “sending” a probability distribution
through the channel. We need to introduce another entropic quantity:
Definition 4.5 (Mutual information). The mutual information of a joint probability distri-
bution pAB ∈ P (ΣA × ΣB ) is given by

I(A : B)pAB = H(pA ) + H(pB ) − H(pAB ).

The mutual information is never negative (Homework), and it quantifies how close the
joint distribution is to the product distribution of its marginals. You can check, that I(A :
B)pAB = 0 if pAB = pA × pB . Consider a communication channel N : ΣA → P(ΣB ) and we
write N (y|x) for the probability of obtaining the symbol
P y ∈ ΣB at the output of the channel
after the symbol x ∈ ΣA has been send. Note that y∈ΣB N (y|x) = 1 for any x ∈ ΣA . Given
a probability distribution pA ∈ P (ΣA ) we can now define a joint probability distribution
pAB ∈ P (ΣA × ΣB ) by setting

pN
AB (x, y) = pA (x)N (y|x). (4)

This joint probability distribution describes the joint probability of inputs and outputs for
the communication channel N and it is easy to verify that pA is a marginal of pN AB . Finally,
we can state the following:
Theorem 4.6 (Shannon’s channel coding theorem). For alphabets ΣA and ΣB let N : ΣA →
P(ΣB ) denote a communication channel. The capacity of N is given by

C(N ) = sup I(A : B)pN ,


AB
pA ∈P(ΣA )

and a rate R is achievable if and only if

R < C(N ).

11
We will postpone the proof of this theorem until later, but we still want to point out
one remarkable feature about its proof: It is non-constructive! Shannon’s proof shows that
generating coding schemes at random will almost always achieve rates very close to the
capacity. It turned out to be very difficult to construct specific codes with rates close to
the capacity for general communication channels. The channel coding theorem was proved
in 1948, and it took until 1992 when a family of codes (called turbo codes) where invented
achieving rates close to capacity. Then, it took until 2006 when a family of codes (called
polar codes) was invented that provably achieved rates arbitrarily close to capacity.

5 What will be the topic of the course?


Information theory really took off after Shannon’s paper “A mathematical theory of com-
munication” in 1948 containing all the results we have seen so far. The general theory was
intended to describe how any physical systems processes information. However, in the 1920s
another fundamental theory took off: Quantum mechanics. From experiments with atomic
and subatomic particles it became clear that classical mechanics and statistical physics does
not describe nature on its smallest scales. It took until the 1960s and 1970s when physicists
and some mathematicians realised that classical information theory itself does not describe
how quantum mechanical systems process information. They started an effort to general-
ize information theory into what we call “quantum information theory” today. This course
starts from the fundamentals of quantum mechanics and will give a thorough introduction to
modern quantum information theory. In particular, we will answer the following questions:

• What are the fundamental limitations of quantum communication? Why can’t we


clone quantum states? Why can’t we communicate faster than light by exploiting
entanglement?

• What is the maximum amount of information that can be stored in a quantum system?

• How can quantum states be compressed?

• How can classical information and quantum information be transmitted over quantum
channels?

• How is it possible to transmit information through two channels each having zero
capacity?

12

You might also like