0% found this document useful (0 votes)

6 views12 pages

lecture1

The first lecture of the Quantum Information Theory course introduces key concepts in classical information theory, including source coding and channel coding. It covers basic terminology from discrete probability theory, defining probability distributions, random variables, and their relationships, including joint and marginal distributions. The lecture also discusses the concept of information and introduces Kolmogorov complexity as a measure of information content.

Uploaded by

ekrrmerder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views12 pages

lecture1

Uploaded by

ekrrmerder

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Quantum information theory (MAT4430) Spring 2021

Lecture 1: What is information theory?

Lecturer: Alexander Müller-Hermes

In the first lecture, we will briefly repeat some basic results in discrete probability theory
and introduce two key problems in classical information theory:
• Source coding: The compression of a discrete memoryless information source.
• Channel coding: To send information reliably via a noisy communication channel at
rates as high as possible.
We will end with a perspective of what we will study in the rest of the course.

1 Discrete probability theory

To talk about classical information theory, we will need some very basic terminology from
finite probability theory. In the following, Σ will denote an alphabet, i.e., a countable set
of symbols (a priori without any further structure). Often, our alphabets will be finite, and
occasionally we will choose the particular alphabet

[d] := {1, . . . , d},

or some other set. Let us define the two basic concepts of probability theory:
P
• A probability distribution on Σ is a function p : Σ → [0, 1] such that x∈Σ p(x) = 1.
We denote the set of probability distributions on Σ by P (Σ).
• A (discrete) random variable X is given by a pair (Σ, p) of an alphabet Σ, not neces-
sarily finite, and a probability distribution p ∈ P (Σ). We say that X is Σ-valued and
distributed according to p, and we write X ∼ p.
If a random variable X is Σ-valued and distributed according to p ∈ Pd , then we interpret
p(x) for x ∈ Σ as the probability that X takes the value x, and we will write

P (X = x) = p(x).

Given a subset S ⊂ Σ we will write

X
P (X ∈ S) = p(x).
x∈S

We will sometimes simplify our language slightly, by not specifying the smallest alphabet Σ
of values that a random variable X can take. For example, we will call a random variable
R-valued if it takes values on a discrete subset of R, but not all values in R.
Definition 1.1 (Joint and marginal probability distributions). Consider alphabets ΣA and
ΣB and the product alphabet ΣA × ΣB . For every distribution pAB ∈ P (ΣA × ΣB ), we can
define the marginal distributions pA ∈ P(ΣA ) and pB ∈ P(ΣB ) by
X X
pAB x, y 0 pAB x0 , y ,

pA (x) = and pB (y) =
y 0 ∈ΣB x0 ∈ΣA

for any x ∈ ΣA and any y ∈ ΣB . The distributions in P (ΣA × ΣB ) are also called joint
distributions. All of these definitions generalize to more than two alphabets.

1
It is common to write (X, Y ) for a random variable with values in ΣA × ΣB distributed
according to some probability distribution pAB ∈ ΣA × ΣB . We will refer to (X, Y ) as a
pair of random variables with values in ΣA × ΣB and joint distribution pAB . This notation
suggests that X ∼ pA and Y ∼ pB can somehow be considered as independent entities,
but in general this is not so. In general, the two marginals pA and pB do not describe the
entire probability distribution pAB , and we refer to aspects not captured by the marginal
distributions as correlations. There is a case where the two marginals are describing the
entire joint distribution:
Definition 1.2 (Independence). We say that a pair of random variables (X, Y ) with val-
ues in ΣA × ΣB distributed according to some probability distribution pAB ∈ ΣA × ΣB is
independent if the probability distribution pAB factorizes as

pAB (x, y) = pA (x)pB (y).

In this case, the marginals pA and pB determine the entire joint distribution pAB and we
write pAB = pA × pB .
The definition of independence generalizes to general N -tuples of random variables. We
will often consider sequences (Xn )n∈N of random variables with values in Σ that are of
independently and identically distributed according to some distribution p ∈ P(Σ). By this
we mean, that for each N ∈ N the tuple (X1 , . . . , XN ) consists of independent random
variables with joint distribution p×N , i.e., such that

p×N (x1 , . . . , xN ) = p(x1 ) · · · p(xN ).

Probability theorists developed their own notation to deal with non-independent pairs
(or tuples) of random variables. Given a pair (X, Y ) of random variables with values in
ΣA × ΣB distributed according to some probability distribution pAB ∈ ΣA × ΣB , we write

P (X = x, Y = y) = pAB (x, y).

It is common to use this notation quite liberally, and we will write

X
P ((X, Y ) ∈ S) = pAB (x, y),
(x,y)∈S

for some set S ⊆ ΣA × ΣB , and if Σ = ΣA = ΣB has some additional structure we may even
do some arithmetic such as

P (X + Y = 3) = P (X + Y ∈ {(x, y) ∈ Σ × Σ : x + y = 3}) .

If we just mention one of the random variables, then we will always mean the marginal
distributions, i.e., we have

P (X = x) = pA (x) and P (Y = y) = pB (y).

If pB (y) 6= 0, then we write

pAB (x, y)
P (X = x|Y = y) = ,
pB (y)
which we may abbreviate as p(x|y). We call the probability distribution p(·|y) ∈ P (ΣA )
the conditional distribution or the probability distribution of X conditioned to Y = y. This
notation can be generalized by setting
P
x∈SA y∈SB pAB (x, y)
P (X ∈ SA |Y ∈ SB ) = P 0
,
y 0 ∈SB pB (y )

2
for subsets SA ⊆ ΣA and SB ⊆ ΣB , if y0 ∈SB pB (y 0 ) 6= 0. As before, we may abbreviate the
P
corresponding probability distribution as p(·|Y ∈ SB ) ∈ P (ΣA ), which is called the proba-
bility distribution of X conditioned to Y ∈ SB . These probabilities satisfy a few properties
which can be derived directly from the definitions:
Theorem 1.3 (Properties of joint and conditional probabilities). Consider a pair of random
variables (X, Y ) with values in ΣA ×ΣB distributed according to some probability distribution
pAB ∈ ΣA × ΣB . We have:
• Product rule:

P(X ∈ SA , Y ∈ SB ) = P(Y ∈ SB )P(X ∈ SA |Y ∈ SB ).

• Sum rule: X
P(X ∈ SA ) = P(Y = y)P(X ∈ SA |Y = y).
y∈ΣB

• Bayes’ theorem:
P (Y ∈ SB ) P (X ∈ SA |Y ∈ SB )
P(Y ∈ SB |X ∈ SA ) = .
P(X ∈ SA )

When talking about random variables attaining values in some countable subset Σ ⊂ R,
it will be useful to define the following two functions:
• Expected value: E [X] = x∈Σ p(x)x.
P

• Variance: Var [X] = E (X − E [X])2 = E X 2 − E [X]2 .

h i

In general, neither the expected value nor the variance have to be finite. However, we will
often restrict to the case where the probability distributions have finite support, i.e., p(x) 6= 0
only for a finite number of x ∈ Σ, and in this case the expected value and the variance are
finite.
The following lemma can be verified easily:
Lemma 1.4. Let ΣA , ΣB ⊂ R be countable subsets and (X, Y ) a pair of random variables
with values in ΣA × ΣB . Then, we have

E [X + Y ] = E [X] + E [Y ] .
If the random variables (X, Y ) are independent, then

E [XY ] = E [X] E [Y ] .
We will sometimes need the following elementary inequalities of probability theory:
Theorem 1.5 (Markov’s and Chebychev’s inequalities). Let Z denote a random variable
with values in a countable subset Σ ⊂ R and distributed according to p ∈ P (Σ).
1. (Markov’s inequality) For every > 0, we have

P [|Z| > ] 6
E [|Z|] .

2. (Chebychev’s inequality) For every > 0, we have

P (Z − E [Z])2 > 6
h i Var [Z]
.

3
Proof. Both inequalities are trivially satified if E [|Z|] = ∞ or Var [Z] = ∞. For the first
inequality note that

P [|Z| > ] =
X
p(z) 6
X
p(z)
|z|
6
E [|Z|] ,

|z|> |z|>

where we multiplied each summand by a number |z|/ 6 1, and in the last inequality we
added more non-negative terms. For the second inequality, we just insert the positive-valued
random variable (Z − E [Z])2 into the first inequality.

Using Chebychev’s inequality we can prove the following theorem:

Theorem 1.6 (Weak law of large numbers). Let Σ ⊂ R be an alphabet. Consider a random
variable Y with values in Σ and distributed according to p ∈ P (Σ) with expected value
µ = E [Y ] and Var [Y ] < ∞. If (Yn )n∈N are independent random variables identically
distributed to Y , then
!
Y1 + Y2 + · · · + Yn
lim P − µ > = 0,
n→∞ n

for every > 0.

Proof. For any n ∈ N consider the random variable

Y1 + Y2 + · · · + Yn
Zn = ,
n
and note that
E [Zn] = µ.
Using Chebychev’s inequality, we have
!
h i Var (Z )
n
P Zn − µ > = P (Zn − µ)2 > 2 6 .
2

Finally, note that

n n n
1 X
E [YiYj ] − µ2 = n12 E [Yi] E [Yj ] + n12 E
X X
Yi2 − µ2

Var (Zn ) =
n2
i,j=1 i6=j i
1
E Y 2 − µ2 = n1 Var (Y ) ,

=
n
and we conclude that
!
Y1 + Y2 + · · · + Yn Var (Y )
P −µ > 6 →0 as n → ∞.
n n2

2 What is information?
Consider the following strings. Which of these contain a high amount of information (intu-
itively speaking)? Can you rationalize your intuition?

1. 0000000000000000000000000000000000000000000000000000000000000000000

4
2. 0101010101010101010101010101010101010101010101010101010101010101010

3. 3141592653589793238462643383279502884197169399375105820974944592307

4. 1621683129157654671616347956219518784030306919262080790346992725831

5. “By virtue of its innermost intention, and like all questions about language, struc-
turalism escapes the classical history of ideas which already supposes structuralism’s
possibility, for the latter naively belongs to the province of language and propounds
itself within it.Nevertheless, by virtue of an irreducible region of irreflection and spon-
taneity within it, by virtue of the essential shadow of the undeclared, the structuralist
phenomenon will deserve examination by the historian of ideas. For better or for
worse. Everything within this phenomenon that does not in itself transparently belong
to the question of the sign will merit this scrutiny; as will everything within it that is
methodologically effective, thereby possessing the kind of infallibility now ascribed to
sleepwalkers and formerly attributed to instinct, which was said to be as certain as it
was blind.”1

6. “Preheat oven to 220 degrees C. Melt the butter in a saucepan. Stir in flour to form
a paste. Add water, white sugar and brown sugar, and bring to a boil. Reduce
temperature and let simmer. Place the bottom crust in your pan. Fill with apples,
mounded slightly. Cover with a lattice work crust. Gently pour the sugar and butter
liquid over the crust. Pour slowly so that it does not run off. Bake 15 minutes in the
preheated oven. Reduce the temperature to 175 degrees C. Continue baking for 35 to
45 minutes, until apples are soft.”2

7. “A direct search on the CDC 6600 yielded

275 + 845 + 1105 + 1335 = 1445

as the smallest instance in which four fifth powers sum to a fifth power. This is a
counterexample to a conjecture by Euler that at least n nth powers are required to
sum to an nth power, n > 2.”3

As you may have noticed there might be different possible notions of “information”.
Maybe, you had the idea of defining “information” in terms of compressibility such that a
string of symbols has a large amount of “information” if it cannot be compressed too much,
and it has a low amount of “information” if it can be compressed a lot. This is indeed the
intuition behind the notion of “information” that we are going to define. However, to make
it precise, we need to think about what it means to compress a bit string. One possible
notion based on a model of computation would be as follows:

Definition 2.1 (Kolmogorov complexity). The Kolmogorov complexity of a string is the

length of the shortest program of a Turing machine producing the string when initialized on
an empty tape.

While this notion of complexity is very elegant and has a very general scope, it has a
serious disadvantage: The Kolmogorov complexity is uncomputable, i.e., there cannot exist
an algorithm to compute it on any kind of computer. We will use a different definition first
introduced by the mathematician Claude Shannon. To introduce this definition, we have to
(as often in applied mathematics) first properly define the problem.
1
Jaques Derrida, Writing and difference, Force and Signification
2
https://www.allrecipes.com/recipe/12682/apple-pie-by-grandma-ople/
3
Lander, L. J., Parkin, T. R. (1966) Bulletin of the American Mathematical Society, 72(6), 1079.

5
3 Classical source coding
The first idea is to not focus on the actual content of the string, but rather consider the
statistical properties of the symbols appearing in it. To make this precise, we define what
we mean by an information source:

Definition 3.1 (Discrete memoryless information sources). Let Σ denote a finite alphabet.
A discrete memoryless source on Σ (DMS) is a sequence of random variables (Xn )n∈N that
are independently and identically distributed and take values in Σ .

An example of a discrete memoryless source is the text written by a monkey on a type-

writer. The random variables Xi is then the letter the monkey hits at time i. Of course,
our definition of an information source is a strong idealization and in english text (not writ-
ten by a monkey) consecutive symbols are correlated. For instance the combination “ed”
will occurr more often than the combination “xz”. We could even envision cases, where
the probability of later symbols depends on all previous symbols using a kind of memory.
Such non-i.i.d. information sources are studied extensively in information theory, but we
will restrict ourselves to the simple setting stated above.
Informally, a compression scheme is a pair of an encoding and a decoding function. The
encoding function maps blocks of symbols to bit strings of length as short as possible, and
the decoding function should reverse this process. The main insight is to not require the
decoding to work perfectly, but rather to measure its probability of success.

Definition 3.2 (Compression scheme). For any δ > 0 and any n, m ∈ N an (n, m, δ)-
compression scheme for a discrete memoryless source (Xn )n∈N with distribution p ∈ P(Σ)
on the alphabet Σ is a pair of functions

E : Σn → {0, 1}m and D : {0, 1}m → Σn ,

such that the success probability satisfies

X
P (D ◦ E)(X1 , . . . , Xn ) = (X1 , . . . , Xn ) = p(x1 ) · · · p(xn ) > 1 − δ,
x1 ,...,xn ∈S

where
S = {(x1 , . . . , xn ) ∈ Σn : (D ◦ E)(x1 , . . . , xn ) = (x1 , . . . , xn )},
denotes the set where the compression succeeds.

A good compression scheme will have two properties: The success probability will be
high, and it compresses a string into few bits, i.e., the ratio m/n is low. It is intuitively
clear, that there should be a tradeoff between the success probability and the compression
rate: If we do not compress at all, then the success probability can be 1, but if we want
to compress n > 1 symbols into m = 1 bits, then the success probability will (usually) be
small. Shannon’s next insight was to consider compression schemes in the asymptotic limit
n → ∞. To make this precise, we will define asymptotically achievable rates:

Definition 3.3 (Achievable rates). A number R ∈ R+ is called an achievable rate for

compression of a discrete memoryless source (Xn )n∈N , if for every n ∈ N there exists an
(n, mn , δn ) compression scheme for (Xn )n∈N such that
mn
R = lim and lim δn = 0.
n→∞ n n→∞

It would be cool if we could find the optimal achievable rate. This is what Shannon did:

6
Theorem 3.4 (Shannon’s source coding4 theorem). Let (Xn )n∈N denote a discrete memo-
ryless source on the alphabet Σ with distribution p ∈ P(Σ). The Shannon entropy is given
by X
H(p) = − p(x) log(p(x)). (1)
x∈Σ

1. Any rate R > H(p) is achievable for compression of the discrete memoryless source
(Xn )n∈N .

2. If there is a sequence of (nk , mk , δk )-compression schemes for the discrete memoryless

source ((Xn )n∈N ) satisfying
mk
lim nk = ∞ and lim = R < H(p),
k→∞ k→∞ nk

then we have limk→∞ δk = 1, i.e., the success probability converges to zero.

Shannon’s source coding theorem shows that rates close to H(p) are achievable for com-
pression, and that rates lower than H(p) cannot be achieved with success probability con-
verging to 1. In the following, we will construct a compression scheme that achieves rates
arbitrarily close to H(p). This will prove one direction of Theorem 3.4, and for the other
direction we have to argue that such a sequence of schemes does not exist.
Let us first get some intuition about how the compression scheme will work: For a discrete
memoryless source with some non-trivial and non-uniform distribution p ∈ P (Σ) not all
strings (x1 , x2 , . . . , xn ) ∈ Σn of length n will have the same probability. For example, when
you toss a biased coin with the probability for “tails” much larger than the probability for
“heads”, then it is very unlikely to observe a string of 100 “heads” in a row. Typically, we will
observe strings of length n in which each symbol x ∈ Σ occurs approximately p(x)n times.
To construct an efficient coding scheme with high success probability it might therefore
be enough to focus on such typical strings and ignore the untypical ones. How can we
characterize the typical strings? To get some intuition, let us consider a string (x1 , . . . , xn )
such that each symbol x ∈ Σ occurs approximately p(x)n times. What is the probability of
observing such a string? We can compute it as
P
p(x1 ) · · · p(xn ) = Πx∈Σ p(x)#{xi =x} ≈ Πx∈Σ p(x)p(x)n = 2n x∈Σ p(x) log(p(x))
= 2−nH(p) ,

and magically the Shannon entropy appears in the exponent. Motivated by this intuition,
we state the following definition:
Definition 3.5 (Typical strings). Let Σ be an alphabet and p ∈ P (Σ) a probability distribu-
tion. For n ∈ N and > 0 a string (x1 , . . . , xn ) ∈ Σn is called -typical for the distribution
p if
2−n(H(p)+) < p(x1 ) · · · p(xn ) < 2−n(H(p)−) .
We denote the set of these strings by Tn, (p).
How many typical strings are there, and how likely is it that a string obtained from a
discrete memoryless source is typical? To answer these questions we will use the weak law
of large numbers from probability theory:
Lemma 3.6 (Properties of typical strings). Let Σ be an alphabet and p ∈ P (Σ) a probability
distribution. We have:
1. For any n ∈ N and > 0 we have

|Tn, (p)| < 2n(H(p)+) .

4
The term source coding is synonymous to compression, and it is more common in the literature.

7
2. For any > 0 we have

lim P (X1 , . . . , Xn ) ∈ T,n (p) = 1,
n→∞

where (Xn )n∈N is a discrete memoryless source distributed according to p.

Proof.

Ad 1.: By Definition 3.5 and the normalization of probability distributions, we have

X
2−n(H(p)+) |Tn, (p)| < p(x1 ) · · · p(xn ) 6 1.
(x1 ,...,xn )∈Tn, (p)

Ad 2.: Consider the function f : Σ → [0, ∞) given by

(
− log(p(x)), if p(x) > 0
f (x) =
0, if p(x) = 0,

and the random variable Z = f (X), where X is distributed according to p. Observe,

that the expectation value of Z is given by

µ = E (Z) =
X
p(x)f (x) = H(p).
x∈Σ

We conclude from the weak law of large numbers that

!
f (X1 ) + f (X2 ) + · · · + f (Xn )
lim P − H(p) < = 1, (2)
n→∞ n

whenever (Xn )n∈N is a discrete memoryless source distributed according to p. Note

that
f (x1 ) + f (x2 ) + · · · + f (xn )
− H(p) < ,
n
holds for a string (x1 , . . . , xn ) ∈ Σn if and only if

−n(H(p) + ) < log (p(x1 ) · · · p(xn )) < −n(H(p) − ).

After applying the exponential function, this is equivalent to (x1 , . . . , xn ) ∈ Tn, (p),
and we can rewrite (2) into

lim P (X1 , . . . , Xn ) ∈ Tn, (p) = 1.
n→∞

This finishes the proof.

Let us emphasize again the intuition behind Lemma 3.6: There are 2n log(|Σ|) many strings
of length n in Σn , but at most 2n(H(p)+) of them are -typical for the distribution p. For
large n and if H(p) < log(|Σ|) these are very few strings compared to the total number. Still,
when receiving strings of length n from the information source, then we will essentially only
get typical strings for large n. Let us exploit this fact, to construct a compression scheme:

8
Proof of Theorem 3.4.
Direct part. For > 0 and any n ∈ N we will construct an (n, dn(H(p)+)e, δn ) compression
scheme for the discrete memoryless source (Xn )n∈N over the alphabet Σ distributed according
to p ∈ P(Σ) such that δn → 0 as n → ∞. Since
dn(H(p) + )e
H(p) + = lim ,
n→∞ n
this shows that the H(p) + is an achievable rate.
For n ∈ N and > 0 we will construct a compression scheme which succeeds on all
typical strings, i.e., we have S = Tn, (p) in the terminology of Definition 3.2. First, we set
m = dn(H(p) + )e and we choose a bit string b(x1 , x2 , . . . , xn ) ∈ {0, 1}m for any typical
sequence (x1 , . . . , xn ) ∈ Tn, (p). The first case of Lemma 3.6 shows that there are enough bit
strings of length m to do this. Now, we define an encoding function En : Σn → {0, 1}m by
(
b(x1 , . . . , xn ), if (x1 , . . . , xn ) ∈ T,n (p)
En (x1 , . . . , xn ) =
(0, 0, . . . , 0) if (x1 , . . . , xn ) ∈
/ Tn, (p),
and a decoding function Dn : {0, 1}m → Σn by
Dn (b1 , . . . , bm )
(
(x1 , . . . , xn ), if (b1 , . . . , bm ) = b(x1 , x2 , . . . , xn ) for some (x1 , . . . , xn ) ∈ Tn, (p)
=
(f, f, . . . , f ) if (b1 , . . . , bm ) 6= b(x1 , x2 , . . . , xn ) for any (x1 , . . . , xn ) ∈ Tn, (p),
for some symbol f ∈ Σ corresponding to a failure. From this construction it follows that
(Dn ◦ En )(x1 , . . . , xn ) = (x1 , . . . , xn ),
whenever (x1 , . . . , xn ) ∈ Tn, (p) (and maybe in the additional case where xi = f for all
i ∈ {1, . . . , n}). Therefore, we conclude that the success probability equals

P (Dn ◦ En )(X1 , . . . , Xn ) = (X1 , . . . , Xn ) > P (X1 , . . . , Xn ) ∈ Tn, (p) =: 1 − δn .

By the second case of Lemma 3.6, we see that δn → 1 as n → ∞. This finishes the proof.

Converse part. Consider a sequence of (nk , mk , δk )-compression schemes for the discrete
memoryless source (Xn )n∈N such that limk→∞ nk = ∞ and
mk
lim = R < H(p).
k→∞ nk
For each k ∈ N let Sk denote the set of strings on which the (nk , mk , δk )-compression scheme
in the sequence succeeds (see Definition 3.2). We have
|Sk | 6 2mk ,
since encoding more than 2mk strings into a set with 2mk elements necessarily leads to a
collision. Furthermore, note that for each k ∈ N we have
Sk ⊆ (Sk ∩ Tnk , (p)) ∪ (Σnk \ Tnk , (p)),
for any > 0, which implies that
X
1 − δk = p(x1 ) · · · p(xnk )
(x1 ,...,xnk )∈Sk
X
6 p(x1 ) · · · p(xnk ) + P [(X1 , . . . , Xnk ) ∈
/ Tnk , (p)]
(x1 ,...,xnk )∈Sk ∩Tnk , (p)

6 2−nk (H(p)−) |Sk | + P ((X1 , . . . , Xnk ) ∈

/ Tnk , (p)) .

9
Finally, we note that as k → ∞ we have
m
−nk (H(p)− n k −)
2−nk (H(p)−) |Sk | 6 2 k → 0,

when we choose < H(p) − R, and

P ((X1 , . . . , Xnk ) ∈
/ Tnk , (p)) → 0,

by Lemma 3.6. Therefore, the failure probability of the compression scheme satisfies δk → 1
as k → ∞.

4 Information transmission over noisy channels

Another basic problem of information theory is to determine the maximal rates at which
information can be send reliably over noisy channels. Again, this problem was solved by
Claude Shannon. In this introduction, we will only state Shannon’s channel coding theorem,
and we will postpone the proof until later. We start with a definition:
Definition 4.1 (Classical communication channel). Let ΣA , ΣB denote two alphabets. A
communication channel is given by a function N : ΣA → P(ΣB ) mapping each symbol in ΣA
to a probability distribution over ΣB .
To get a concrete picture, envision a noisy telegraph line (not very up-to-date of course),
where trying to send the letter “a” might result in the letters “a”,“b”, or “c” to appear at
the other end of the line with different probabilities depending on the quirks of the system.
Note that in this case ΣA = ΣB . How would you send your messages over such a telegraph
line? One idea might be to encode your message by adding some redundancy. To stay in the
example we might just repeat every symbol five times: If we want to send the symbol “a”,
then we would input the string “aaaaa” into the telegraph line. Even if some error happens,
and the string “caaaba” comes out the other end, the receiver could still guess that probably
the symbol “a” was the intended message. This seems to work fine, but is it the best we can
do? To quantify what we mean by best, we can again define the information transmission
problem similarly to the compression problem from before:
Definition 4.2 (Coding schemes). For alphabets ΣA , ΣB let N : ΣA → P(ΣB ) denote a
communication channel. An (n, M, δ)-coding scheme for information transmission over the
channel N is a pair of functions

E : {1, 2, . . . , M } → ΣnA and D : ΣnB → {1, 2, . . . , M },

such that
min P N ×n ◦ E(i) ∈ D−1 (i) > 1 − δ. (3)
i∈ {1,2,...,M }

Here, N ×n is the n-fold direct product of N with itself acting as

N ×n (x1 , . . . , xn ) = (N (x1 ), . . . , N (xn )) ,

on (x1 , . . . , xn ) ∈ ΣnA .
The previous definition might be a bit difficult to parse. It should be read as follows:
There are two functions, the encoder E and the decoder D. The encoder E encodes a
message (labelled by 1, . . . , M ) into a string in ΣnA of length n. The symbols E(i)1 , E(i)2 , . . .
making up the string corresponding to message i are then send successively through the
communication channel leading to a product of probability distributions

N ×n ◦ E(i) = N (E(i)1 ), N (E(i)2 ), . . . , N (E(i)n ) ,

10
which we may interpret as a probability distribution on ΣnB . Receiving one possible string in
ΣnB , the receiver applies the decoding map D thereby obtaining a guess for what the message
could be. In (3) the success probability is given by the probability that the string N ×n ◦ E(i)
is in the preimage of D−1 (i). Finally, we consider the minimal probability of success over
all messages to be our figure of merit. Note that we made the implicit assumption that
consecutive applications of the communication channel are independent from each other
leading to the product distribution in (3). This is an idealization, and there are many
information theorists studying non-i.i.d. scenarios for channel coding. However, here we
focus on the simplest case. As in the case for compression, we can define asymptotically
achievable rates:
Definition 4.3 (Achievable rates for channel coding). A number R ∈ R+ is called an
achievable rate for transmitting information over the communication channel N : ΣA →
P(ΣB ) on the alphabets ΣA and ΣB , if for every n ∈ N there exists an (n, Mn , δn ) coding
scheme such that
log(Mn )
R = lim and lim δn = 0.
n→∞ n n→∞

The following definition is central for information theory and goes back to Shannon:
Definition 4.4 (Capacity of a channel). The capacity C(N ) of a communication channel
N is the supremum of the achievable rates for transmitting information over it.
Is it possible to compute the capacity, and does it fully characterize the achievable rates
for communication? Yes, both questions where again answered by Claude Shannon. Shan-
non’s channel coding theorem gives a formula for the capacity of a communication channel in
terms of the joint probability distributions obtained from “sending” a probability distribution
through the channel. We need to introduce another entropic quantity:
Definition 4.5 (Mutual information). The mutual information of a joint probability distri-
bution pAB ∈ P (ΣA × ΣB ) is given by

I(A : B)pAB = H(pA ) + H(pB ) − H(pAB ).

The mutual information is never negative (Homework), and it quantifies how close the
joint distribution is to the product distribution of its marginals. You can check, that I(A :
B)pAB = 0 if pAB = pA × pB . Consider a communication channel N : ΣA → P(ΣB ) and we
write N (y|x) for the probability of obtaining the symbol
P y ∈ ΣB at the output of the channel
after the symbol x ∈ ΣA has been send. Note that y∈ΣB N (y|x) = 1 for any x ∈ ΣA . Given
a probability distribution pA ∈ P (ΣA ) we can now define a joint probability distribution
pAB ∈ P (ΣA × ΣB ) by setting

pN
AB (x, y) = pA (x)N (y|x). (4)

This joint probability distribution describes the joint probability of inputs and outputs for
the communication channel N and it is easy to verify that pA is a marginal of pN AB . Finally,
we can state the following:
Theorem 4.6 (Shannon’s channel coding theorem). For alphabets ΣA and ΣB let N : ΣA →
P(ΣB ) denote a communication channel. The capacity of N is given by

C(N ) = sup I(A : B)pN ,

AB
pA ∈P(ΣA )

and a rate R is achievable if and only if

R < C(N ).

11
We will postpone the proof of this theorem until later, but we still want to point out
one remarkable feature about its proof: It is non-constructive! Shannon’s proof shows that
generating coding schemes at random will almost always achieve rates very close to the
capacity. It turned out to be very difficult to construct specific codes with rates close to
the capacity for general communication channels. The channel coding theorem was proved
in 1948, and it took until 1992 when a family of codes (called turbo codes) where invented
achieving rates close to capacity. Then, it took until 2006 when a family of codes (called
polar codes) was invented that provably achieved rates arbitrarily close to capacity.

5 What will be the topic of the course?

Information theory really took off after Shannon’s paper “A mathematical theory of com-
munication” in 1948 containing all the results we have seen so far. The general theory was
intended to describe how any physical systems processes information. However, in the 1920s
another fundamental theory took off: Quantum mechanics. From experiments with atomic
and subatomic particles it became clear that classical mechanics and statistical physics does
not describe nature on its smallest scales. It took until the 1960s and 1970s when physicists
and some mathematicians realised that classical information theory itself does not describe
how quantum mechanical systems process information. They started an effort to general-
ize information theory into what we call “quantum information theory” today. This course
starts from the fundamentals of quantum mechanics and will give a thorough introduction to
modern quantum information theory. In particular, we will answer the following questions:

• What are the fundamental limitations of quantum communication? Why can’t we

clone quantum states? Why can’t we communicate faster than light by exploiting
entanglement?

• What is the maximum amount of information that can be stored in a quantum system?

• How can quantum states be compressed?

• How can classical information and quantum information be transmitted over quantum
channels?

• How is it possible to transmit information through two channels each having zero
capacity?

PMQ Indicative Content
No ratings yet
PMQ Indicative Content
20 pages
ALL ST218 Lecture Notes
No ratings yet
ALL ST218 Lecture Notes
87 pages
Frantz Operation
No ratings yet
Frantz Operation
48 pages
1 Introduction To Information Theory
No ratings yet
1 Introduction To Information Theory
9 pages
Probab Refresh
No ratings yet
Probab Refresh
7 pages
斯坦福大学机器学习数学基础 25-32
No ratings yet
斯坦福大学机器学习数学基础 25-32
8 pages
Information Theory and Coding
No ratings yet
Information Theory and Coding
79 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
CS145: Probability & Computing: Lecture 6: Multiple Discrete Variables, Joint & Conditional Distributions, Independence
No ratings yet
CS145: Probability & Computing: Lecture 6: Multiple Discrete Variables, Joint & Conditional Distributions, Independence
23 pages
Information Theory: 1 Random Variables and Probabilities X
No ratings yet
Information Theory: 1 Random Variables and Probabilities X
8 pages
StochasticModels 2011 Part 2 v1
No ratings yet
StochasticModels 2011 Part 2 v1
22 pages
SF 2940 Forms
No ratings yet
SF 2940 Forms
23 pages
EDA Report
No ratings yet
EDA Report
24 pages
Instructor: DR - Saleem AL Ashhab Al Ba'At University Mathmatical Class Second Year Master Dgree
No ratings yet
Instructor: DR - Saleem AL Ashhab Al Ba'At University Mathmatical Class Second Year Master Dgree
13 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
Signal Detection Part 1 (Prob&Random)
No ratings yet
Signal Detection Part 1 (Prob&Random)
4 pages
Statistical Theory of Distribution: Stat 471
No ratings yet
Statistical Theory of Distribution: Stat 471
45 pages
CH 0 Introduction: 0.1 Overview of Information Theory and Coding
No ratings yet
CH 0 Introduction: 0.1 Overview of Information Theory and Coding
133 pages
A Probabilistic Theory of Pattern Recognition: Based On The Appendix of The Textbook
No ratings yet
A Probabilistic Theory of Pattern Recognition: Based On The Appendix of The Textbook
4 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
stochbasics_handout
No ratings yet
stochbasics_handout
36 pages
03 MultivariateProbability
No ratings yet
03 MultivariateProbability
73 pages
Notes On Random Variables, Expectations, Probability Densities, and Martingales
No ratings yet
Notes On Random Variables, Expectations, Probability Densities, and Martingales
8 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
rv_3
No ratings yet
rv_3
20 pages
Traffic Lect04
No ratings yet
Traffic Lect04
50 pages
II Sem - Last Minute Revision
No ratings yet
II Sem - Last Minute Revision
44 pages
MAT3003 Modules - (1 2 3) - Updated
No ratings yet
MAT3003 Modules - (1 2 3) - Updated
40 pages
MIT14 381F13 Lec1 PDF
No ratings yet
MIT14 381F13 Lec1 PDF
8 pages
Probabilty Distributions
No ratings yet
Probabilty Distributions
7 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
DOC-20250405-WA0002.
No ratings yet
DOC-20250405-WA0002.
55 pages
Distributions and Normal Random Variables
No ratings yet
Distributions and Normal Random Variables
8 pages
Week 2 PNS Monsoon2019
No ratings yet
Week 2 PNS Monsoon2019
5 pages
UW MATH-STAT395 Bivariate-Distributions PDF
No ratings yet
UW MATH-STAT395 Bivariate-Distributions PDF
17 pages
Random Variables (R.V.) : Probability Theory
No ratings yet
Random Variables (R.V.) : Probability Theory
17 pages
Random Variables: 1.1 Elementary Examples
No ratings yet
Random Variables: 1.1 Elementary Examples
14 pages
Part 2aa
No ratings yet
Part 2aa
89 pages
Ma8391 Notes
No ratings yet
Ma8391 Notes
60 pages
Bivariate Distribution (Discrete RV)
No ratings yet
Bivariate Distribution (Discrete RV)
6 pages
notes05 (1)
No ratings yet
notes05 (1)
20 pages
Randon Variable and Probability distribution
No ratings yet
Randon Variable and Probability distribution
75 pages
Lecture Notes 1 36-705 Brief Review of Basic Probability
No ratings yet
Lecture Notes 1 36-705 Brief Review of Basic Probability
7 pages
04 Estimation
No ratings yet
04 Estimation
48 pages
1804.09086v1
No ratings yet
1804.09086v1
35 pages
ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
Basic Statistics and Probability Theory
No ratings yet
Basic Statistics and Probability Theory
45 pages
Joint Distributions: A Random Variable Is That Maps To Numbers
No ratings yet
Joint Distributions: A Random Variable Is That Maps To Numbers
37 pages
Probability: Youtube: Learn With Ca. Pranav, Instagram: @learnwithpranav, Telegram: @pranavpopat, Twitter: @pranav - 2512
No ratings yet
Probability: Youtube: Learn With Ca. Pranav, Instagram: @learnwithpranav, Telegram: @pranavpopat, Twitter: @pranav - 2512
23 pages
P&S Unit 1
No ratings yet
P&S Unit 1
50 pages
MAS 102_Topic 1
No ratings yet
MAS 102_Topic 1
13 pages
MAT 326 Chapter 7 Fall 2024
No ratings yet
MAT 326 Chapter 7 Fall 2024
9 pages
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
No ratings yet
CS 725: Foundations of Machine Learning: Lecture 2. Overview of Probability Theory For ML
23 pages
17 Notes MFML Probreview
No ratings yet
17 Notes MFML Probreview
19 pages
All in One CheatSheet PDF
No ratings yet
All in One CheatSheet PDF
52 pages
Chapter One
No ratings yet
Chapter One
10 pages
Group Theory I Essentials
From Everand
Group Theory I Essentials
Emil Milewski
No ratings yet
Topology Essentials
From Everand
Topology Essentials
Emil G. Milewski
5/5 (1)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Modern Algebra Essentials
From Everand
Modern Algebra Essentials
Lufti A. Lutfiyya
No ratings yet
lecture19
No ratings yet
lecture19
10 pages
Lecture 2
No ratings yet
Lecture 2
12 pages
Lecture 10
No ratings yet
Lecture 10
6 pages
lecture11
No ratings yet
lecture11
7 pages
0.000 Phd Thesis Asset Bubble Markov Switching Levy Process
No ratings yet
0.000 Phd Thesis Asset Bubble Markov Switching Levy Process
100 pages
Lectures on Statistics in Theory - Prelude to Statistics in Practice
No ratings yet
Lectures on Statistics in Theory - Prelude to Statistics in Practice
94 pages
lecture5
No ratings yet
lecture5
10 pages
A Day in the Life of a Quantitative Portfolio Manager _ CQF
No ratings yet
A Day in the Life of a Quantitative Portfolio Manager _ CQF
9 pages
Deaner Bdeaner PhD Economics 2021 Thesis Essays in Econometrics Nonparametrics And
No ratings yet
Deaner Bdeaner PhD Economics 2021 Thesis Essays in Econometrics Nonparametrics And
212 pages
Design Patterns in C# Part 5_ Ensuring a Single Instance with the Singleton Pattern _ by João Melo _ Medium
No ratings yet
Design Patterns in C# Part 5_ Ensuring a Single Instance with the Singleton Pattern _ by João Melo _ Medium
4 pages
markov functional model kaspers
No ratings yet
markov functional model kaspers
72 pages
rbc-quantitative-investment-strategy-excess-return-and-total-return-strategy
No ratings yet
rbc-quantitative-investment-strategy-excess-return-and-total-return-strategy
15 pages
Design Patterns in C# Part 7_ Simplifying Object Communication with the Mediator Pattern _ by João Melo _ Oct, 2024 _ Medium
No ratings yet
Design Patterns in C# Part 7_ Simplifying Object Communication with the Mediator Pattern _ by João Melo _ Oct, 2024 _ Medium
5 pages
C++_ shared_ptr and how to write your own _ by Karan Kakwani _ Analytics Vidhya _ Medium
No ratings yet
C++_ shared_ptr and how to write your own _ by Karan Kakwani _ Analytics Vidhya _ Medium
6 pages
Design Patterns in C# Part 4_ Enhancing Event-Driven Architectures with the Observer Pattern _ by João Melo _ Medium
No ratings yet
Design Patterns in C# Part 4_ Enhancing Event-Driven Architectures with the Observer Pattern _ by João Melo _ Medium
6 pages
fabio 2010 - lmm with stochastic basis
No ratings yet
fabio 2010 - lmm with stochastic basis
16 pages
Feynman Kac Theorem
No ratings yet
Feynman Kac Theorem
12 pages
Design Patterns in C# Part 6_ Bridging Interfaces with the Adapter Pattern _ by João Melo _ Oct, 2024 _ Medium
No ratings yet
Design Patterns in C# Part 6_ Bridging Interfaces with the Adapter Pattern _ by João Melo _ Oct, 2024 _ Medium
4 pages
topology
No ratings yet
topology
9 pages
bond etf
No ratings yet
bond etf
13 pages
Kendriya Vidyalaya Gomoh Ranchi Region NAME OF EXAM: - 2021-22 OMR Response Sheet
No ratings yet
Kendriya Vidyalaya Gomoh Ranchi Region NAME OF EXAM: - 2021-22 OMR Response Sheet
1 page
EAPP Q2 Module 2
No ratings yet
EAPP Q2 Module 2
20 pages
Chapter 06 Design and Analysis of Experiments Solutions Manual
No ratings yet
Chapter 06 Design and Analysis of Experiments Solutions Manual
22 pages
A Detailed Lesson Plan in Math II
No ratings yet
A Detailed Lesson Plan in Math II
4 pages
Depressuring Study and Application On BP-A Project
100% (8)
Depressuring Study and Application On BP-A Project
35 pages
Beauty Standards: Ideologies and Stereotypes
No ratings yet
Beauty Standards: Ideologies and Stereotypes
15 pages
Essential_Skills_ISYS10301_Academic_Development_Coursework
No ratings yet
Essential_Skills_ISYS10301_Academic_Development_Coursework
9 pages
(Template) Activity # 12 Peace Education
No ratings yet
(Template) Activity # 12 Peace Education
2 pages
Gamma Function
No ratings yet
Gamma Function
9 pages
Unit 1.4
No ratings yet
Unit 1.4
5 pages
A Survey On Mahcine Unlearing
No ratings yet
A Survey On Mahcine Unlearing
36 pages
DQ Analytical Balance
No ratings yet
DQ Analytical Balance
3 pages
Soil Particle Density Protocol
No ratings yet
Soil Particle Density Protocol
10 pages
Ethiopian New Agency: The End To The Rochester Curse
No ratings yet
Ethiopian New Agency: The End To The Rochester Curse
2 pages
MINI
No ratings yet
MINI
19 pages
Stylistics, Con Analysis p1
100% (1)
Stylistics, Con Analysis p1
23 pages
Tracking of Potholes and Measurement of Noise and Illumination Level in Roadways
No ratings yet
Tracking of Potholes and Measurement of Noise and Illumination Level in Roadways
6 pages
HRD Form 2 Training Matrix District
No ratings yet
HRD Form 2 Training Matrix District
5 pages
Altec Transformers
No ratings yet
Altec Transformers
8 pages
Research and Application of Reliability Management Process in Nuclear Power Plants
No ratings yet
Research and Application of Reliability Management Process in Nuclear Power Plants
6 pages
141 Publish
No ratings yet
141 Publish
8 pages
1 Cable Schedule, Tankara R1
No ratings yet
1 Cable Schedule, Tankara R1
6 pages
Google Agile Essentials
No ratings yet
Google Agile Essentials
5 pages
2020 Mumbai October Data
No ratings yet
2020 Mumbai October Data
42 pages
Soler, P. and Bonhomme, M. (1988)
No ratings yet
Soler, P. and Bonhomme, M. (1988)
7 pages
Innovation management in apiaries in times of biodiversity crisis
No ratings yet
Innovation management in apiaries in times of biodiversity crisis
28 pages
Puzzles Solutions
No ratings yet
Puzzles Solutions
22 pages
Efficient and Robust LiDAR-Based End-to-End Navigation
No ratings yet
Efficient and Robust LiDAR-Based End-to-End Navigation
8 pages

lecture1

Uploaded by

lecture1

Uploaded by

Quantum information theory (MAT4430) Spring 2021

Lecture 1: What is information theory?

1 Discrete probability theory

[d] := {1, . . . , d},

Given a subset S ⊂ Σ we will write

pAB (x, y) = pA (x)pB (y).

p×N (x1 , . . . , xN ) = p(x1 ) · · · p(xN ).

P (X = x, Y = y) = pAB (x, y).

It is common to use this notation quite liberally, and we will write

P (X = x) = pA (x) and P (Y = y) = pB (y).

If pB (y) 6= 0, then we write

P(X ∈ SA , Y ∈ SB ) = P(Y ∈ SB )P(X ∈ SA |Y ∈ SB ).

• Variance: Var [X] = E (X − E [X])2 = E X 2 − E [X]2 .

2. (Chebychev’s inequality) For every  > 0, we have

Using Chebychev’s inequality we can prove the following theorem:

for every  > 0.

Proof. For any n ∈ N consider the random variable

Finally, note that

7. “A direct search on the CDC 6600 yielded

275 + 845 + 1105 + 1335 = 1445

Definition 2.1 (Kolmogorov complexity). The Kolmogorov complexity of a string is the

An example of a discrete memoryless source is the text written by a monkey on a type-

E : Σn → {0, 1}m and D : {0, 1}m → Σn ,

such that the success probability satisfies

Definition 3.3 (Achievable rates). A number R ∈ R+ is called an achievable rate for

2. If there is a sequence of (nk , mk , δk )-compression schemes for the discrete memoryless

then we have limk→∞ δk = 1, i.e., the success probability converges to zero.

|Tn, (p)| < 2n(H(p)+) .

where (Xn )n∈N is a discrete memoryless source distributed according to p.

Ad 1.: By Definition 3.5 and the normalization of probability distributions, we have

Ad 2.: Consider the function f : Σ → [0, ∞) given by

and the random variable Z = f (X), where X is distributed according to p. Observe,

We conclude from the weak law of large numbers that

whenever (Xn )n∈N is a discrete memoryless source distributed according to p. Note

−n(H(p) + ) < log (p(x1 ) · · · p(xn )) < −n(H(p) − ).

This finishes the proof.

6 2−nk (H(p)−) |Sk | + P ((X1 , . . . , Xnk ) ∈

when we choose  < H(p) − R, and

4 Information transmission over noisy channels

E : {1, 2, . . . , M } → ΣnA and D : ΣnB → {1, 2, . . . , M },

Here, N ×n is the n-fold direct product of N with itself acting as

N ×n (x1 , . . . , xn ) = (N (x1 ), . . . , N (xn )) ,

I(A : B)pAB = H(pA ) + H(pB ) − H(pAB ).

C(N ) = sup I(A : B)pN ,

and a rate R is achievable if and only if

5 What will be the topic of the course?

• What are the fundamental limitations of quantum communication? Why can’t we

• How can quantum states be compressed?

You might also like

2. (Chebychev’s inequality) For every > 0, we have

for every > 0.

|Tn, (p)| < 2n(H(p)+) .

−n(H(p) + ) < log (p(x1 ) · · · p(xn )) < −n(H(p) − ).

6 2−nk (H(p)−) |Sk | + P ((X1 , . . . , Xnk ) ∈

when we choose < H(p) − R, and