Information Theory and Coding by Example PDF
Information Theory and Coding by Example PDF
This fundamental monograph introduces both the probabilistic and the algebraic
aspects of information theory and coding. It has evolved from the authors’ years
of experience teaching at the undergraduate level, including several Cambridge
Mathematical Tripos courses. The book provides relevant background material, a
wide range of worked examples and clear solutions to problems from real exam
papers. It is a valuable teaching aid for undergraduate and graduate students, or for
researchers and engineers who want to grasp the basic principles.
M A R K K E L B E RT
Swansea University, and Universidade de São Paulo
Y U R I S U H OV
University of Cambridge, and Universidade de São Paulo
University Printing House, Cambridge CB2 8BS, United Kingdom
Published in the United States of America by Cambridge University Press, New York
www.cambridge.org
Information on this title: www.cambridge.org/9780521769358
c Cambridge University Press 2013
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2013
Printed in the United Kingdom by CPI Group Ltd. Croydon cr0 4yy
A catalogue record for this publication is available from the British Library
ISBN 978-0-521-76935-8 Hardback
ISBN 978-0-521-13988-5 Paperback
Cambridge University Press has no responsibility for the persistence or
accuracy of URLs for external or third-party internet websites referred to
in this publication, and does not guarantee that any content on such
websites is, or will remain, accurate or appropriate.
Contents
v
vi Contents
Bibliography 501
Index 509
Preface
This book is partially based on the material covered in several Cambridge Math-
ematical Tripos courses: the third-year undergraduate courses Information The-
ory (which existed and evolved over the last four decades under slightly varied
titles) and Coding and Cryptography (a much younger and simplified course avoid-
ing cumbersome technicalities), and a number of more advanced Part III courses
(Part III is a Cambridge equivalent to an MSc in Mathematics). The presentation
revolves, essentially, around the following core concepts: (a) the entropy of a prob-
ability distribution as a measure of ‘uncertainty’ (and the entropy rate of a random
process as a measure of ‘variability’ of its sample trajectories), and (b) coding as a
means to measure and use redundancy in information generated by the process.
Thus, the contents of this book includes a more or less standard package of
information-theoretical material which can be found nowadays in courses taught
across the world, mainly at Computer Science and Electrical Engineering Depart-
ments and sometimes at Probability and/or Statistics Departments. What makes this
book different is, first of all, a wide range of examples (a pattern that we followed
from the onset of the series of textbooks Probability and Statistics by Example
by the present authors, published by Cambridge University Press). Most of these
examples are of a particular level adopted in Cambridge Mathematical Tripos ex-
ams. Therefore, our readers can make their own judgement about what level they
have reached or want to reach.
The second difference between this book and the majority of other books
on information theory or coding theory is that it covers both possible direc-
tions: probabilistic and algebraic. Typically, these lines of inquiry are presented
in different monographs, textbooks and courses, often by people who work in
different departments. It helped that the present authors had a long-time associ-
ation with the Institute for Information Transmission Problems, a section of the
Russian Academy of Sciences, Moscow, where the tradition of embracing a broad
spectrum of problems was strongly encouraged. It suffices to list, among others,
vii
viii Preface
was active in this area more than 40 years ago. [Although on several advanced
topics Shannon, probably, could have thought, re-phrasing Einstein’s words: “Since
mathematicians have invaded the theory of communication, I do not understand it
myself anymore.”]
During the years that passed after Shannon’s inceptions and inventions, math-
ematics changed drastically, and so did electrical engineering, let alone computer
science. Who could have foreseen such a development back in the 1940s and 1950s,
as the great rivalry between Shannon’s information-theoretical and Wiener’s cyber-
netical approaches was emerging? In fact, the latter promised huge (even fantastic)
benefits for the whole of humanity while the former only asserted that a mod-
est goal of correcting transmission errors could be achieved within certain limits.
Wiener’s book [171] captivated the minds of 1950s and 1960s thinkers in practi-
cally all domains of intellectual activity. In particular, cybernetics became a serious
political issue in the Soviet Union and its satellite countries: first it was declared
“a bourgeois anti-scientific theory”, then it was over-enthusiastically embraced. [A
quotation from a 1953 critical review of cybernetics in a leading Soviet ideology
journal Problems of Philosophy reads: “Imperialists are unable to resolve the con-
troversies destroying the capitalist society. They can’t prevent the imminent eco-
nomical crisis. And so they try to find a solution not only in the frenzied arms race
but also in ideological warfare. In their profound despair they resort to the help of
pseudo-sciences that give them some glimmer of hope to prolong their survival.”
The 1954 edition of the Soviet Concise Dictionary of Philosophy printed in hun-
dreds of thousands of copies defined cybernetics as a “reactionary pseudo-science
which appeared in the USA after World War II and later spread across other cap-
italist countries: a kind of modern mechanicism.” However, under pressure from
top Soviet physicists who gained authority after successes of the Soviet nuclear
programme, the same journal, Problems of Philosophy, had to print in 1955 an ar-
ticle proclaiming positive views on cybernetics. The authors of this article included
Alexei Lyapunov and Sergei Sobolev, prominent Soviet mathematicians.]
Curiously, as was discovered in a recent biography on Wiener [35], there exist
“secret [US] government documents that show how the FBI and the CIA pursued
Wiener at the height of the Cold War to thwart his social activism and the growing
influence of cybernetics at home and abroad.” Interesting comparisons can be found
in [65].
However, history went its own way. As Freeman Dyson put it in his review [41]
of [35]: “[Shannon’s theory] was mathematically elegant, clear, and easy to apply
to practical problems of communication. It was far more user-friendly than cyber-
netics. It became the basis of a new discipline called ‘information theory’ . . . [In
modern times] electronic engineers learned information theory, the gospel accord-
ing to Shannon, as part of their basic training, and cybernetics was forgotten.”
x Preface
Not quite forgotten, however: in the former Soviet Union there still exist at
least seven functioning institutes or departments named after cybernetics: two in
Moscow and two in Minsk, and one in each of Tallinn, Tbilisi, Tashkent and Kiev
(the latter being a renowned centre of computer science in the whole of the for-
mer USSR). In the UK there are at least four departments, at the Universities of
Bolton, Bradford, Hull and Reading, not counting various associations and soci-
eties. Across the world, cybernetics-related societies seem to flourish, displaying
an assortment of names, from concise ones such as the Institute of the Method
(Switzerland) or the Cybernetics Academy (Italy) to the Argentinian Associa-
tion of the General Theory of Systems and Cybernetics, Buenos Aires. And we
were delighted to discover the existence of the Cambridge Cybernetics Society
(Belmont, CA, USA). By contrast, information theory figures only in a handful of
institutions’ names. Apparently, the old Shannon vs. Wiener dispute may not be
over yet.
In any case, Wiener’s personal reputation in mathematics remains rock solid:
it suffices to name a few gems such as the Paley–Wiener theorem (created on
Wiener’s numerous visits to Cambridge), the Wiener–Hopf method and, of course,
the Wiener process, particularly close to our hearts, to understand his true role in
scientific research and applications. However, existing recollections of this giant of
science depict an image of a complex and often troubled personality. (The title of
the biography [35] is quite revealing but such views are disputed, e.g., in the review
[107]. In this book we attempt to adopt a more tempered tone from the chapter on
Wiener in [75], pp. 386–391.) On the other hand, available accounts of Shannon’s
life (as well as other fathers of information and coding theory, notably, Richard
Hamming) give a consistent picture of a quiet, intelligent and humorous person.
It is our hope that this fact will not present a hindrance for writing Shannon’s
biographies and that in future we will see as many books on Shannon as we see on
Wiener.
As was said before, the purpose of this book is twofold: to provide a synthetic
introduction both to probabilistic and algebraic aspects of the theory supported by
a significant number of problems and examples, and to discuss a number of topics
rarely presented in most mainstream books. Chapters 1–3 give an introduction into
the basics of information theory and coding with some discussion spilling over to
more modern topics. We concentrate on typical problems and examples [many of
them originated in Cambridge courses] more than on providing a detailed presen-
tation of the theory behind them. Chapter 4 gives a brief introduction into a variety
of topics from information theory. Here the presentation is more concise and some
important results are given without proofs.
Because the large part of the text stemmed from lecture notes and various solu-
tions to class and exam problems, there are inevitable repetitions, multitudes of
Preface xi
1
2 Essentials of Information Theory
where λ (u) = P(U1 = u), u ∈ I, are the initial probabilities and P(u, u ) = P(U j+1 =
u |U j = u), u, u ∈ I, are transition probabilities. A Markov source is called sta-
tionary if P(U j = u) = λ (u), j ≥ 1, i.e. λ = {λ (u), u = 1, . . . , m} is an invariant
row-vector for matrix P = {P(u, v)}: ∑ λ (u)P(u, v) = λ (v), v ∈ I, or, shortly,
u∈I
λP = λ.
(c) A ‘degenerated’ example of a Markov source is where a source emits repeated
symbols. Here,
P(U1 = U2 = · · · = Uk = u) = p(u), u ∈ I,
(1.1.3c)
P(Uk = Uk ) = 0, 1 ≤ k < k ,
is called a (source) sample n-string, or n-word (in short, a string or a word), with
digits from I, and is treated as a ‘message’. Correspondingly, one considers a ran-
dom n-string (a random message)
Strings f (u) that are images, under f , of symbols u ∈ I are called codewords
(in code f ). A code has (constant) length N if the value s (the length of a code-
word) equals N for all codewords. A message u(n) = u1 u2 . . . un is represented as a
concatenation of codewords
Example 1.1.4 A code with three source letters 1, 2, 3 and the binary encoder
alphabet J = {0, 1} given by
Furthermore, under condition (1.1.4) there exists a prefix-free code with codewords
of lengths s1 , . . . , sm .
Proof (I) Sufficiency. Let (1.1.4) hold. Our goal is to construct a prefix-free code
with codewords of lengths s1 , . . . , sm . Rewrite (1.1.4) as
s
∑ nl q−l ≤ 1, (1.1.5)
l=1
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 5
or
s−1
ns q−s ≤ 1 − ∑ nl q−l ,
l=1
or
ns−1 ≤ qs−1 − n1 qs−2 − · · · − ns−2 q. (1.1.6b)
n1 ≤ q. (1.1.6.s)
Observe that actually either ni+1 = 0 or ni is less than the RHS of the inequality,
for all i = 1, . . . , s − 1 (by definition, ns ≥ 1 so that for i = s − 1 the second possi-
bility occurs). We can perform the following construction. First choose n1 words
of length 1, using distinct symbols from J: this is possible in view of (1.1.6.s).
It leaves (q − n1 ) symbols unused; we can form (q − n1 )q words of length 2 by
appending a symbol to each. Choose n2 codewords from these: we can do so in
view of (1.1.6.s−1). We still have q2 − n1 q − n2 words unused: form n3 codewords,
etc. In the course of the construction, no new word contains a previous codeword
as a prefix. Hence, the code constructed is prefix-free.
where bl is the number of ways r codewords can be put together to form a string of
length l.
6 Essentials of Information Theory
One of the principal aims of the theory is to find the ‘best’ (that is, the shortest)
decipherable (or prefix-free) code. We now adopt a probabilistic point of view and
assume that symbol u ∈ I is emitted by a source with probability p(u):
P(Uk = u) = p(u).
[At this point, there is no need to specify a joint probability of more than one
subsequently emitted symbol.]
We are looking for a decipherable code that minimises the expected word-length:
m
ES = ∑ sP(S = s) = ∑ s(i)p(i).
s≥1 i=1
Theorem 1.1.7 The optimal value for problem (1.1.8) is lower-bounded as fol-
lows:
min ES ≥ hq (p(1), . . . , p(m)), (1.1.9)
where
hq (p(1), . . . , p(m)) = − ∑ p(i) logq p(i). (1.1.10)
i
Proof The algorithm (1.1.8) is an integer-valued optimisation problem. If we drop
the condition that s(1), . . . , s(m) ∈ {1, 2, . . .}, replacing it with a ‘relaxed’ con-
straint s(i) > 0, 1 ≤ i ≤ m, the Lagrange sufficiency theorem could be used. The
Lagrangian reads
L (s(1), . . . , s(m), z; λ ) = ∑ s(i)p(i) + λ (1 − ∑ q−s(i) − z)
i i
∑ p(i)/(−λ ln q) = 1, i.e. − λ ln q = 1.
i
Hence,
s(i) = − logq p(i), 1 ≤ i ≤ m,
is the (unique) optimiser for the relaxed problem, giving the value hq from (1.1.10).
The relaxed problem is solved on a larger set of variables s(i); hence, its minimal
value does not exceed that in the original one.
Remark 1.1.8 The quantity hq defined in (1.1.10) plays a central role in the
whole of information theory. It is called the q-ary entropy of the probability distri-
bution (p(x), x ∈ I) and will emerge in a great number of situations. Here we note
that the dependence on q is captured in the formula
1
hq (p(1), . . . , p(m)) = h2 (p(1), . . . , p(m))
log q
where h2 stands for the binary entropy:
h2 (p(1), . . . , p(m)) = − ∑ p(i) log p(i). (1.1.11)
i
8 Essentials of Information Theory
Worked Example 1.1.9 (a) Give an example of a lossless code with alphabet
Jq which does not satisfy the Kraft inequality. Give an example of a lossless code
with the expected code-length strictly less than hq (X).
(b) Show that the ‘Kraft sum’ ∑ q−s(i) associated with a lossless code may be
i
arbitrarily large (for sufficiently large source alphabet).
Solution (a) Consider the alphabet I = {0, 1, 2} and a lossless code f with f (0) =
0, f (1) = 1, f (2) = 00 and codeword-lengths s(0) = s(1) = 1, s(2) = 2. Obviously,
∑ 2−s(x) = 5/4, violating the Kraft inequality. For a random variable X with p(0) =
x∈I
p(1) = p(2) = 1/3 the expected codeword-length Es(X) = 4/3 < h(X) = log 3 =
1.585.
(b) Assume that the alphabet size m = I = 2(2L − 1) for some positive
integer L. Consider the lossless code assigning to the letters x ∈ I the codewords
0, 1, 00, 01, 10, 11, 000, . . ., with the maximum codeword-length L. The Kraft sum is
where hq = − ∑ p(i) logq p(i) is the q-ary entropy of the source; see (1.1.10).
i
Proof The LHS inequality is established in (1.1.9). For the RHS inequality, let
s(i) be a positive integer such that
The non-strict bound here implies ∑ q−s(i) ≤ ∑ p(i) = 1, i.e. the Kraft inequality.
i i
Hence, there exists a decipherable code with codeword-lengths s(1), . . . , s(m). The
strict bound implies
log p(i)
s(i) < − + 1,
log q
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 9
and thus
∑ p(i) log p(i)
h
ES < − i
+ ∑ p(i) = + 1.
log q i log q
Then construct a prefix-free code, from the shortest s(i) upwards, ensuring that
the previous codewords are not prefixes. The Kraft inequality guarantees enough
room. The obtained code may not be optimal but has the mean codeword-length
satisfying the same inequalities (1.1.13) as an optimal code.
Repeat steps (i) and (ii) with the reduced alphabet, etc. We obtain a binary tree. For
an example of Huffman’s encoding for m = 7 see Figure 1.1.
The number of branches we must pass through in order to reach a root i of the
tree equals s(i). The tree structure, together with the identification of the roots
as source letters, guarantees that encoding is prefix-free. The optimality of binary
Huffman encoding follows from the following two simple lemmas.
10 Essentials of Information Theory
m= 7
1.0
i pi f(i) si
1 .5 0 1
2 .15 100 3
3 .15 101 3
4 .1 110 3
5 .05 1110 4
6 .025 11110 5
7 .025 11111 5
.5 .15 .15 .1 .05 .025 .025
Figure 1.1
Lemma 1.1.12 Any optimal prefix-free binary code has the codeword-lengths
reverse-ordered versus probabilities:
Proof If not, we can form a new code, by swapping the codewords for i and i .
This shortens the expected codeword-length and preserves the prefix-free property.
Lemma 1.1.13 In any optimal prefix-free binary code there exist, among the
codewords of maximum length, precisely two agreeing in all but the last digit.
Proof If not, then either (i) there exists a single codeword of maximum length,
or (ii) there exist two or more codewords of maximum length, and they all differ
before the last digit. In both cases we can drop the last digit from some word of
maximum length, without affecting the prefix-free property.
Proof The proof proceeds with induction in m. For m = 2, the Huffman code f2H
has f2H (1) = 0, f2H (2) = 1, or vice versa, and is optimal. Assume the Huffman code
H is optimal for I
fm−1 m−1 , whatever the probability distribution. Suppose further that
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 11
the Huffman code fmH is not optimal for Im for some probability distribution. That
is, there is another prefix-free code, fm∗ , for Im with a shorter expected word-length:
∗
ESm < ESm
H
. (1.1.15)
The probability distribution under consideration may be assumed to obey
p(1) ≥ · · · ≥ p(m).
By Lemmas 1.1.12 and 1.1.13, in both codes we can shuffle codewords so that
the words corresponding to m − 1 and m have maximum length and differ only in
the last digit. This allows us to reduce both codes to Im−1 . Namely, in the Huffman
code fmH we remove the final digit from fmH (m) and fmH (m − 1), ‘glueing’ these
codewords. This leads to Huffman encoding fm−1 H . In f ∗ we do the same, and obtain
m
∗
a new prefix-free code fm−1 .
Observe that in Huffman code fmH the contribution to ESm H from f H (m − 1)
m
and fmH (m) is sH (m)(p(m − 1) + p(m)); after reduction it becomes (sH (m) − 1)
(p(m − 1) + p(m)). That is, ES is reduced by p(m − 1) + p(m). In code fm∗ the sim-
ilar contribution is reduced from s∗ (m)(p(m − 1) + p(m)) to (s∗ (m) − 1)(p(m − 1)
+ p(m)); the difference is again p(m − 1) + p(m). All other contributions to ESm−1
H
and ESm−1∗ are the same as the corresponding contributions to ESm H and ES∗ ,
m
∗ ∗
respectively. Therefore, fm−1 is better than fm−1 : ESm−1 < ESm−1 , which contra-
H H
(a) (b)
Figure 1.2
Solution (a) Two cases are possible: the letter 1 either was, or was not merged with
other letters before two last steps in constructing a Huffman code. In the first case,
s(1) ≥ 2. Otherwise we have symbols 1, b and b , with
p(1) < 1/3, p(1) + p(b) + p(b ) = 1 and hence max[p(b), p(b )] > 1/3.
Then letter 1 is to be merged, at the last but one step, with one of b, b , and hence
s(1) ≥ 2. Indeed, suppose that at least one codeword has length 1, and this code-
word is assigned to letter 1 with p(1) < 1/3. Hence, the top of the Huffman tree is
as in Figure 1.2(a) with 0 ≤ p(b), p(b ) ≤ 1 − p(1) and p(b) + p(b ) = 1 − p(1).
are binary Huffman codes, e.g. for a probability distribution 1/3, 1/3, 1/4, 1/12.
(b) Now let p(1) > 2/5 and assume that letter 1 has a codeword-length s(1) ≥ 2 in
a Huffman code. Thus, letter 1 was merged with other letters before the last step.
That is, at a certain stage, we had symbols 1, b and b say, with
(A) p(b ) ≥ p(1) > 2/5,
(B) p(b ) ≥ p(b),
(C) p(1) + p(b) + p(b ) ≤ 1
(D) p(1), p(b) ≥ 1/2 p(b ).
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 13
Indeed, if, say, p(b) < 1/2p(b ) then b should be selected instead of p(3) or p(4)
on the previous step when p(b ) was formed. By virtue of (D), p(b) ≥ 1/5 which
makes (A)+(C) impossible.
A piece of the Huffman tree over p(1) is then as in Figure 1.2(b), with p(3) +
p(4) = p(b ) and p(1) + p(b ) + p(b) ≤ 1. Write
p(1) = 2/5 + ε , p(b ) = 2/5 + ε + δ , p(b) = 2/5 + ε + δ − η ,
with ε > 0, δ , η ≥ 0. Then
p(1) + p(b ) + p(b) = 6/5 + 3ε + 2δ − η ≤ 1, and η ≥ 1/5 + 3ε + 2δ .
This yields
p(b) ≤ 1/5 − 2ε − δ < 1/5.
However, since
probability p(b) should be merged with min p(3), p(4) , i.e. diagram (b) is
impossible. Hence, the letter 1 has codeword-length s(1) = 1.
Worked Example 1.1.17 Suppose that letters i1 , . . . , i5 are emitted with probabil-
ities 0.45, 0.25, 0.2, 0.05, 0.05. Compute the expected word-length for Shannon–
Fano and Huffman coding. Illustrate both methods by finding decipherable binary
codings in each case.
Solution Write
P(S∗ ≤ SSF − r) = ∑ p(i).
i∈I : s∗ (i)≤sSF (i)−r
∑ p(i) ≤ ∑ p(i)
i∈I : s∗ (i)≤sSF (i)−r i∈I : s∗ (i)≤− logq p(i)+1−r
= ∑ p(i)
i∈I : s∗ (i)−1+r≤− logq p(i)
= ∑ ∗ (i)+1−r
p(i)
i∈I : p(i)≤q−s
∗ (i)+1−r
≤ ∑ q−s
i∈I
∗
=q 1−r
∑ q−s (i)
i∈I
≤q 1−r
;
the last inequality is due to Kraft.
A common modern practice is not to encode each letter u ∈ I separately, but
to divide a source message into ‘segments’ or ‘blocks’, of a fixed length n, and
encode these as ‘letters’. It obviously increases the nominal number of letters in
the alphabet: the blocks are from the Cartesian product I ×n = I × · · · (n times) × I.
But what matters is the entropy
∑
(n)
hq = − P(U1 = i1 , . . . ,Un = in ) logq P(U1 = i1 , . . . ,Un = in ) (1.1.16)
i1 ,...,in
Example 1.1.19 For a Bernoulli source emitting letter i with probability p(i) (cf.
Example 1.1.2), equation (1.1.16) yields
hq = − ∑ p(i1 ) · · · p(in ) logq p(i1 ) · · · p(in )
(n)
i1 ,...,in
n
=−∑ ∑ p(i1 ) · · · p(in ) logq p(i j ) = nhq , (1.1.18)
j=1 i1 ,...,in
where hq = − ∑ p(i) logq p(i). Here, en ∼ hq . Thus, for n large, the minimal
expected codeword-length per source letter, in a segmented code, eventually at-
tains the lower bound in (1.1.13), and hence does not exceed min ES, the minimal
expected codeword-length for letter-by-letter encodings. This phenomenon is much
more striking in the situation where the subsequent source letters are dependent. In
(n)
many cases hq n hq , i.e. en hq . This is the gist of data compression.
Definition 1.1.20 A source is said to be (reliably) encodable at rate R > 0 if, for
any n, we can find a set An ⊂ I ×n such that
In other words, we can encode messages at rate R with a negligible error for long
source strings.
Definition 1.1.21 The information rate H of a given source is the infimum of the
reliable encoding rates:
0 ≤ H ≤ log m, (1.1.21)
(b) A similar idea was applied to the decimal and binary decomposition of a
given number. For example, take number π . If the information rate for its binary
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 17
x 1 ln x
1
x
Figure 1.3
We conclude this section with the following simple but fundamental fact.
Theorem 1.1.24 (The Gibbs inequality: cf. PSE II, p. 421) Let {p(i)} and {p (i)}
be two probability distributions (on a finite or countable set I ). Then, for any b > 1,
p (i)
∑ p(i) logb p(i)
≤ 0, i.e. − ∑ p(i) logb p(i) ≤ − ∑ p(i) logb p (i), (1.1.22)
i i i
holds for each x > 0, with equality iff x = 1. Setting I = {i : p(i) > 0}, we have
p (i) p (i) 1 p (i)
∑ p(i) logb = ∑ p(i) logb ≤ ∑ p(i) −1
i p(i) i∈I p(i) ln bi∈I p(i)
1 1
= ∑ p (i) − ∑ p(i) = ∑ p (i) − 1 ≤ 0.
ln b i∈I i∈I ln b i∈I
For equality we need: (a) ∑ p (i) = 1, i.e. p (i) = 0 when p(i) = 0; and (b)
i∈I
p (i)/p(i) = 1 for i ∈ I .
[In view of the adopted equality 0 · log 0 = 0, the sum may be reduced to those xi
for which pX (xi ) > 0.]
Sometimes an alternative view is useful: i(A) represents the amount of informa-
tion needed to specify event A and h(X) gives the expected amount of information
required to specify a random variable X.
Clearly, the entropy h(X) depends on the probability distribution, but not on
the values x1 , . . . , xm : h(X) = h(p1 , . . . , pm ). For m = 2 (a two-point probability
distribution), it is convenient to consider the function η (p)(= η2 (p)) of a single
variable p ∈ [0, 1]:
η (p) = −p log p − (1 − p) log(1 − p). (1.2.2a)
1.2 Entropy: an introduction 19
Entropy for p in [0,1]
1.0
0.8
0.6
h(p,1−p)
0.4
0.2
0.0
Figure 1.4
1.5
1.0
h( p,q,1−p−q)
0.5
0.8
0.6
0.4
q
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8
Figure 1.5
The graph of η (p) is plotted in Figure 1.4. Observe that the graph is concave as
d2
2
η (p) = − log e [p(1 − p)] < 0. See Figure 1.4.
dp
The graph of the entropy of a three-point distribution
Here, pX,Y (i, j) is the joint probability P(X = xi ,Y = y j ) and pX|Y (xi |y j ) the con-
ditional probability P(X = xi |Y = y j ). Clearly, (1.2.3) and (1.2.4) imply
h(X|Y ) = h(X,Y ) − h(Y ). (1.2.5)
Note that in general h(X|Y ) = h(Y |X).
For random variables X and Y taking values in the same set I, and such that
pY (x) > 0 for all x ∈ I, the relative entropy h(X||Y ) (also known as the entropy of
X relative to Y or Kullback–Leibler distance D(pX ||pY )) is defined by
pX (x) pY (X)
h(X||Y ) = ∑ pX (x) log = EX − log , (1.2.6)
x pY (x) pX (X)
with pX (x) = P(X = x) and pY (x) = P(Y = x), x ∈ I.
Straightforward properties of entropy are given below.
Theorem 1.2.3
(a) If a random variable X takes at most m values, then
0 ≤ h(X) ≤ log m; (1.2.7)
1.2 Entropy: an introduction 21
the LHS equality occurring iff X takes a single value, and the RHS equality
occurring iff X takes m values with equal probabilities.
(b) The joint entropy obeys
with equality iff X and Y are independent, i.e. P(X = x,Y = y) = P(X = x)
P(Y = y) for all x, y ∈ I .
(c) The relative entropy is always non-negative:
h(X||Y ) ≥ 0, (1.2.9)
Proof Assertion (c) is equivalent to Gibbs’ inequality from Theorem 1.1.24. Next,
(a) follows from (c), with {p(i)} being the distribution of X and p (i) ≡ 1/m,
1 ≤ i ≤ m. Similarly, (b) follows from (c), with i being a pair (i1 , i2 ) of values
of X and Y , p(i) = pX,Y (i1 , i2 ) being the joint distribution of X and Y and p (i) =
pX (i1 )pY (i2 ) representing the product of their marginal distributions. Formally:
We used here the identities ∑ pX,Y (i1 , i2 ) = pX (i1 ), ∑ pX,Y (i1 , i2 ) = pY (i2 ).
i2 i1
(a) Show that the geometric random variable Y with p j = P(Y = j) = (1 − p)p j ,
j = 0, 1, 2, . . ., yields maximum entropy amongst all distributions on Z+ =
{0, 1, 2, . . .} with the same mean.
(b) Let Z be a random variable with values
from a finite
set ∗K and f be a given real
E f (Z) ≤ α . (1.2.10)
22 Essentials of Information Theory
Show that:
(bi) when f ∗ ≥ α ≥ E( f ) then the maximising probability distribution is uni-
form on K , with P(Z = k) = 1/( K), k ∈ K ;
(bii) when f∗ ≤ α < E( f ) and f is not constant then the maximising probabil-
ity distribution has
P(Z = k) = pk = eλ f (k) ∑ eλ f (i) , k ∈ K, (1.2.11)
i
∑ pk f (k) = α . (1.2.12)
k
Moreover, suppose that Z takes countably many values, but f ≥ 0 and for a
given α there exists a λ < 0 such that ∑ eλ f (i) < ∞ and ∑ pk f (k) = α where pk
i k
has form (1.2.11). Then:
(biii) the probability distribution in (1.2.11) still maximises h(Z) under
(1.2.10).
Deduce assertion (a) from (biii).
(c) Prove that hY (X) ≥ 0, with equality iff P(X = x) = P(Y = x) for all x. By
considering Y , a geometric random variable on Z+ with parameter chosen
appropriately, show that if the mean EX = μ < ∞, then
h(X) ≤ (μ + 1) log(μ + 1) − μ log μ , (1.2.13)
with equality iff X is geometric.
Solution (a) By the Gibbs inequality, for all probability distribution (q0 , q1 , . . .)
with mean ∑ iqi ≤ μ ,
i≥0
Eq f = ∑ qk f (k) ≤ α . Next, observe that the mean value (1.2.12) calculated for the
k
probability distribution from (1.2.11) is a non-decreasing function of λ . In fact, the
derivative
2
2 λ f (k) λ (k)
∑ [ f (k)] e ∑ f (k)e f
dα k
= − k 2 = E[ f (Z)] − [E f (Z)]
2 2
dλ ∑e λ f (i)
i ∑ eλ f (i)
i
is positive (it yields the variance of the random variable f (Z)); for a non-constant
f the RHS is actually non-negative. Therefore, for non-constant f (i.e. with
f∗ < E( f ) < f ∗ ), for all α from the interval [ f∗ , f ∗ ] there exists exactly one prob-
ability distribution of form (1.2.11) satisfying (1.2.12), and for f∗ ≤ α < E( f ) the
corresponding λ (α ) is < 0.
Next, we use the fact that the Kullback–Leibler distance D(q||p∗ ) (cf. (1.2.6))
satisfies D(q||p∗ ) = ∑ qk log (qk /p∗k ) ≥ 0 (Gibbs’ inequality) and that ∑ qk f (k) ≤ α
k k
and λ < 0 to obtain that
h(q) = − ∑ qk log qk = −D(q||p∗ ) − ∑ qk log p∗k
k k
≤ −∑ qk log p∗k = − ∑ qk − log ∑ eλ f (i) + λ f (k)
k k i
≤ − ∑ qk − log ∑ eλ f (i) − λ α
k i
= −∑ p∗k − log ∑ eλ f (i) + λ f (k)
k i
= − ∑ p∗k log p∗k = h(p∗ ).
k
For part (biii): the above argument still works for an infinite countable set K
provided that the value λ (α ) determined from (1.2.12) is < 0.
(c) By the Gibbs inequality hY (X) ≥ 0. Next, we use part (b) by taking f (k) = k,
α = μ and λ = ln q. The maximum-entropy distribution can be written as p∗j =
(1 − p)p j , j = 0, 1, 2, . . ., with ∑ kp∗k = μ , or μ = p/(1 − p). The entropy of this
k
distribution equals
h(p∗ ) = − ∑ (1 − p)p j log (1 − p)p j
j
p
=− log p − log(1 − p) = (μ + 1) log(μ + 1) − μ log μ ,
1− p
where μ = p/(1 − p).
24 Essentials of Information Theory
Alternatively:
p(i)
0 ≤ hY (X) = ∑ p(i) log
i (1 − p)pi
= −h(X) − log(1 − p) ∑ p(i) − (log p) ∑ ip(i)
i i
= −h(X) − log(1 − p) − μ log p.
The optimal choice of p is p = μ /(μ + 1). Then
1 μ
h(X) ≤ − log − μ log = (μ + 1) log(μ + 1) − μ log μ .
μ +1 μ +1
The RHS is the entropy h(Y ) of the geometric random variable Y . Equality holds
iff X ∼ Y , i.e. X is geometric.
A simple but instructive corollary of the Gibbs inequality is
Lemma 1.2.5 (The pooling inequalities) For any q1 , q2 ≥ 0, with q1 + q2 > 0,
− (q1 + q2 ) log(q1 + q2 ) ≤ −q1 log q1 − q2 log q2
q1 + q2
≤ −(q1 + q2 ) log ; (1.2.14)
2
the first equality occurs iff q1 q2 = 0 (i.e. either q1 or q2 vanishes), and the second
equality iff q1 = q2 .
Proof Indeed, (1.2.14) is equivalent to
q1 q2
0≤h , ≤ log 2 (= 1).
q1 + q2 q1 + q2
log x
1/ 1
2
Figure 1.6
Solution Part (i) follows from the pooling inequality, and (ii) holds as
h ≥ − ∑ pi log p∗ = − log p∗ .
i
Theorem 1.2.8 (The Fano inequality) Suppose a random variable X takes m > 1
values, and one of them has probability (1 − ε ). Then
h(X) ≤ η (ε ) + ε log(m − 1) (1.2.17)
where η is the function from (1.2.2a).
26 Essentials of Information Theory
Definition 1.2.9 Given random variables X, Y , Z, we say that X and Y are con-
ditionally independent given Z if, for all x and y and for all z with P(Z = z) > 0,
P(X = x,Y = y|Z = z) = P(X = x|Z = z)P(Y = y|Z = z). (1.2.18)
For the conditional entropy we immediately obtain
Theorem 1.2.10 (a) For all random variables X , Y ,
0 ≤ h(X|Y ) ≤ h(X), (1.2.19)
the first equality occurring iff X is a function of Y and the second equality holding
iff X and Y are independent.
(b) For all random variables X , Y , Z ,
h(X|Y, Z) ≤ h(X|Y ) ≤ h(X|φ (Y )), (1.2.20)
the first equality occurring iff X and Z are conditionally independent given Y and
the second equality holding iff X and Z are conditionally independent given φ (Y ).
Proof (a) The LHS bound in (1.2.19) follows from definition (1.2.4) (since
h(X|Y ) is a sum of non-negative terms). The RHS bound follows from repre-
sentation (1.2.5) and bound (1.2.8). The LHS quality in (1.2.19) is equivalent
to the equation h(X,Y ) = h(Y ) or h(X,Y ) = h(φ (X,Y )) with φ (X,Y ) = Y . In
view of Theorem 1.2.6, this occurs iff, with probability 1, the map (X,Y ) → Y
is invertible, i.e. X is a function of Y . The RHS equality in (1.2.19) occurs iff
h(X,Y ) = h(X) + h(Y ), i.e. X and Y are independent.
(b) For the lower bound, use a formula analogous to (1.2.5):
h(X|Y, Z) = h(X, Z|Y ) − h(Z|Y ) (1.2.21)
and an inequality analogous to (1.2.10):
h(X, Z|Y ) ≤ h(X|Y ) + h(Z|Y ), (1.2.22)
1.2 Entropy: an introduction 27
with equality iff X and Z are conditionally independent given Y . For the RHS
bound, use:
(i) a formula that is a particular case of (1.2.21): h(X|Y, φ (Y )) = h(X,Y |φ (Y )) −
h(Y |φ (Y )), together with the remark that h(X|Y, φ (Y )) = h(X|Y );
(ii) an inequality which is a particular case of (1.2.22): h(X,Y |φ (Y )) ≤
h(X|φ (Y )) + h(Y |φ (Y )), with equality iff X and Y are conditionally independent
given φ (Y ).
Theorems 1.2.8 above and 1.2.11 below show how the entropy h(X) and con-
ditional entropy h(X|Y ) are controlled when X is ‘nearly’ a constant (respectively,
‘nearly’ a function of Y ).
Theorem 1.2.11 (The generalised Fano inequality) For a pair of random vari-
ables, X and Y taking values x1 , . . . , xm and y1 , . . . , ym , if
m
∑ P(X = x j ,Y = y j ) = 1 − ε , (1.2.23)
j=1
then
h(X|Y ) ≤ η (ε ) + ε log(m − 1), (1.2.24)
where η (ε ) is defined in (1.2.3).
Proof Denoting ε j = P(X = x j |Y = y j ), we write
∑ pY (y j )ε j = ∑ P(X = x j ,Y = y j ) = ε . (1.2.25)
j j
If the random variable X takes countably many values {x1 , x2 , . . .}, the above
definitions may be repeated, as well as most of the statements; notable exceptions
are the RHS bound in (1.2.7) and inequalities (1.2.17) and (1.2.24).
Many properties of entropy listed so far are extended to the case of random
strings.
Theorem 1.2.12 For a pair of random strings, X(n) = (X1 , . . . , Xn ) and Y(n) =
(Y1 , . . . ,Yn ),
28 Essentials of Information Theory
obeys
n n
h(X(n) ) = ∑ h(Xi |X(i−1) ) ≤ ∑ h(Xi ), (1.2.26)
i=1 i=1
h(X(n) |Y(n) )
=− ∑ P(X(n) = x(n) , Y(n) = y(n) ) log P(X(n) = x(n) |Y(n) = y(n) ),
x(n) ,y(n)
satisfies
n n
h(X(n) |Y(n) ) ≤ ∑ h(Xi |Y(n) ) ≤ ∑ h(Xi |Yi ), (1.2.27)
i=1 i=1
with the LHS equality holding iff X1 , . . . , Xn are conditionally independent, given
Y(n) , and the RHS equality holding iff, for each i = 1, . . . , n, Xi and {Yr : 1 ≤ r ≤
n, r = i} are conditionally independent, given Yi .
Proof The proof repeats the arguments used previously in the scalar case.
the first equality occurring iff X and φ (Y ) are independent, and the second iff X
and Y are conditionally independent, given φ (Y ).
1.2 Entropy: an introduction 29
Observe that
n n
∑ I(Xi : Y(n) ) ≥ ∑ I(Xi : Yi ). (1.2.34)
i=1 i=1
Worked Example 1.2.17 Let X , Z be random variables and Y(n) = (Y1 , . . . ,Yn )
be a random string.
first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
(c) Prove or disprove by producing a counter-example the inequality
n
I(X : Y(n) ) ≥ ∑ I(X : Y j ), (1.2.36)
j=1
first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
1.2 Entropy: an introduction 31
Recall that a real function f (y) defined on a convex set V ⊆ Rm is called con-
cave if
f (λ0 y(0) + λ1 y(1) ) ≥ λ0 f (y(0) ) + λ1 f (y(1) )
for any y(0) , y(1) ∈ V and λ0 , λ1 ∈ [0, 1] with λ0 + λ1 = 1. It is called strictly concave
if the equality is attained only when either y(0) = y(1) or λ0 λ1 = 0. We treat h(X) as
a function of variables p = (p1 , . . . , pm ); set V in this case is {y = (y1 , . . . , ym ) ∈ Rm :
yi ≥ 0, 1 ≤ i ≤ m, y1 + · · · + ym = 1}.
Theorem 1.2.18 Entropy is a strictly concave function of the probability distri-
bution.
Proof Let the random variables X (i) have probability distributions p(i) , i = 0, 1,
and assume that the random variable Λ takes values 0 and 1 with probabilities
λ0 and λ1 , respectively, and is independent of X (0) , X (1) . Set X = X (Λ) ; then the
inequality h(λ0 p(0) + λ1 p(1) ) ≥ λ0 h(p(0) ) + λ1 h(p(1) ) is equivalent to
h(X) ≥ h(X|Λ) (1.2.38)
which follows from (1.2.19). If we assume equality in (1.2.38), X and Λ must be
independent. Assume in addition that λ0 > 0 and write, by using independence,
P(X = i, Λ = 0) = P(X = i)P(Λ = 0) = λ0 P(X = i).
(0) (0)
The LHS equals λ0 P(X = i|Λ = 0) = λ0 pi and the RHS equals λ0 λ0 pi +
(1)
λ1 pi . We may cancel λ0 obtaining
(0) (1)
(1 − λ0 )pi = λ1 pi ,
i.e. the probability distributions p(0) and p(1) are proportional. Then either they are
equal or λ1 = 0, λ0 = 1. The assumption λ1 > 0 leads to a similar conclusion.
Worked Example 1.2.19 Show that the quantity
ρ (X,Y ) = h(X|Y ) + h(Y |X)
obeys
ρ (X,Y ) = h(X) + h(Y ) − 2I(X : Y )
= h(X,Y ) − I(X : Y ) = 2h(X,Y ) − h(X) − h(Y ).
Prove that ρ is symmetric, i.e. ρ (X,Y ) = ρ (Y, X) ≥ 0, and satisfies the triangle
inequality, i.e. ρ (X,Y ) + ρ (Y, Z) ≥ ρ (X, Z). Show that ρ (X,Y ) = 0 iff X and Y
are functions of each other. Also show that if X and X are functions of each other
then ρ (X,Y ) = ρ (X ,Y ). Hence, ρ may be considered as a metric on the set of
the random variables X , considered up to equivalence: X ∼ X iff X and X are
functions of each other.
1.2 Entropy: an introduction 33
or
h(X, Z) ≤ h(X,Y ) + h(Y, Z) − h(Y ).
To this end, write h(X, Z) ≤ h(X,Y, Z) and note that h(X,Y, Z) equals
Equality holds iff (i) Y = φ (X, Z) and (ii) X, Z are conditionally independent
given Y .
Remark 1.2.20 The property that ρ (X, Z) = ρ (X,Y )+ ρ (Y, Z) means that ‘point’
Y lies on a ‘line’ through X and Z; in other words, that all three points X, Y , Z lie
on a straight line. Conditional independence of X and Z given Y can be stated
in an alternative (and elegant) way: the triple X → Y → Z satisfies the Markov
property (in short: is Markov). Then suppose we have four random variables X1 ,
X2 , X3 , X4 such that, for all 1 ≤ i1 < i2 < i3 ≤ 4, the random variables Xi1 and
Xi3 are conditionally independent given Xi2 ; this property means that the quadruple
X1 → X2 → X3 → X4 is Markov, or, geometrically, that all four points lie on a
line. The following fact holds: if X1 → X2 → X3 → X4 is Markov then the mutual
entropies satisfy
In fact, for all triples Xi1 , Xi2 , Xi3 as above, in the metric ρ we have that
As we show below, bound (1.2.41) holds for any X and Z (without referring to a
Markov property). Indeed, (1.2.41) is equivalent to
nh(X) − ∑ h(X, Zi ) + h Z ≤ h(X) + h Z − h X, Z
1≤i≤n
or
h X, Z − h(X) ≤ ∑ h(X, Zi ) − nh(X)
1≤i≤n
which in turn is nothing but the inequality h Z|X ≤ ∑ h(Zi |X).
1≤i≤n
m
Worked Example 1.2.22 Write h(p) := − ∑ p j log p j for a probability ‘vector’
⎛ ⎞ 1
p1
⎜ ⎟
p = ⎝ ... ⎠, with entries p j ≥ 0 and p1 + · · · + pm = 1.
pm
(a) Show that h(Pp) ≥ h(p) if P = (Pi j ) is a doubly stochastic matrix (i.e. a square
matrix with elements Pi j ≥ 0 for which all row and column sums are unity).
Moreover, h(Pp) ≡ h(p) iff P is a permutation matrix.
m m
(b) Show that h(p) ≥ − ∑ ∑ p j Pjk log Pjk if P is a stochastic matrix and p is an
j=1 k=1
invariant vector of P: Pp = p.
(b) The LHS equals h(Un ) for the stationary Markov source (U1 ,U2 , . . .) with equi-
librium distribution p, whereas the RHS is h(Un |Un−1 ). The general inequality
h(Un |Un−1 ) ≤ h(Un ) gives the result.
(a) Quoting standard properties of conditional entropy, show that h(X j |X j−1 ) ≤
h(X j |X j−2 ) and, in the case of a stationary DTMC, h(X j |X j−2 ) ≤ 2h(X j |X j−1 ).
(b) Show that the mutual information I(Xm : Xn ) is non-decreasing in m and non-
increasing in n, 1 ≤ m ≤ n.
(b) Write
Similarly,
Solution By the Markov property, Xn−1 and Xn+1 are conditionally independent,
given Xn . Hence,
the final equality holding because of the conditional independence and the last
inequality following from (1.2.21).
h(p1 , . . . , pm ) = − ∑ p j log p j .
j
Solution (a) Using (1.2.42), we obtain for the function F(m) = h 1/m, . . . , 1/m
the following identity:
1 1 1 1 1 1
2
F(m ) = h × ,..., × , 2 ,..., 2
m m m m m m
1 1 1 1
=h , , . . . , 2 + F(m)
m m2 m m
..
.
1 1 m
=h ,..., + F(m) = 2F(m).
m m m
Now, for given positive integers b > 2 and m, we can find a positive integer n such
that 2n ≤ bm ≤ 2n+1 , i.e.
n n 1
≤ log2 b ≤ + .
m m m
By monotonicity of F(m), we obtain nF(2) ≤ mF(b) ≤ (n + 1)F(2), or
n F(b) n 1
≤ ≤ + .
m F(2) m m
F(b) 1
We conclude that log2 b − ≤ , and letting m → ∞, F(b) = c log b with
F(2) m
c = F(2).
38 Essentials of Information Theory
r1 rm
Now take rational numbers p1 = , . . . , pm = and obtain
r r
r
1 rm
r1 1 r1 1 r2 rm r1
h ,..., =h × ,..., × , ,..., − F(r1 )
r r r r1 r r1 r r r
..
.
1 1 ri
=h ,..., −c ∑ log ri
r r 1≤i≤m r
ri ri ri
= c log r − c ∑ log ri = −c ∑ log .
1≤i≤m r 1≤i≤m r r
for any positive integers m, n. Hence, for a canonical prime number decomposition
m = qα1 1 . . . qαs s we obtain
or, equivalently,
m
F(m) m + 1 2 k−1
m
= ∑ k F(k) − k F(k − 1) .
2m m(m + 1) k=1
The quantity in the square brackets is the arithmetic mean of m(m + 1)/2 terms of
a sequence
2 2
F(1), F(2) − F(1), F(2) − F(1), F(3) − F(2), F(3) − F(2),
3 3
2 k−1
F(3) − F(2), . . . , F(k) − F(k − 1), . . . ,
3 k
k−1
F(k) − F(k − 1), . . .
k
that tends to 0. Hence, it goes to 0 and F(m)/m → 0. Furthermore,
m−1 1
F(m) − F(m − 1) = F(m) − F(m − 1) − F(m − 1) → 0,
m m
and (1.2.46) holds. Now define
F(m)
c(m) = ,
log m
and prove that c(m) = const. It suffices to prove that c(p) = const for any prime
number p. First, let us prove that a sequence (c(p)) is bounded. Indeed, suppose the
numbers c(p) are not bounded from above. Then, we can find an infinite sequence
of primes p1 , p2 , . . . , pn , . . . such that pn is the minimal prime such that pn > pn−1
and c(pn ) > c(pn−1 ). By construction, if a prime q < pn then c(q) < c(pn ).
40 Essentials of Information Theory
p p
Moreover, as lim log = 0, equations (1.2.46) and (1.2.47) imply that
p→∞ log p p−1
c(pn ) − c(2) ≤ 0 which contradicts with the construction of c(p). Hence, c(p) is
bounded from above. Similarly, we check that c(p) is bounded from below. More-
over, the above proof yields that sup p c(p) and inf p c(p) are both attained.
Now assume that c( p) = sup p c(p) > c(2). Given a positive integer m, decom-
pose into prime factors pm − 1 = qα1 1 . . . qαs s with q1 = 2. Arguing as before, we
write the difference F( pm ) − F( pm − 1) as
F( pm )
F( pm ) − log( pm − 1) + c( p) log( pm − 1) − F( pm − 1)
log pm
F( pm ) pm pm s
=
pm log pm
log +
pm − 1 j=1 ∑ α j (c( p) − c(q j )) log q j
one has
k k
∑ pi ≤ ∑ qi , for all k = 1, . . . , n.
i=1 i=1
Then
h(p) ≥ h(q) whenever p q.
Condition p q means that if p = q then there exist i1 and i2 such that (a) 1 ≤ i1 ≤
i2 ≤ n, (b) q(i1 ) > p(i1 ) ≥ p(i2 ) > q(i2 ) and (c) q(i) ≥ p(i) for 1 ≤ i ≤ i1 , q(i) ≤ p(i) for
i ≥ i2 .
Now apply induction in s, the number of values i = 1, . . . , n for which q(i) = p(i) .
If s = 0 we have p = q and the entropies coincide. Make the induction hypothesis
and then increase s by 1. Take a pair i1 , i2 as above. Increase q(i2 ) and decrease q(i1 )
so that the sum q(i1 ) +q(i2 ) is preserved, until either q(i1 ) reaches p(i1 ) or q(i2 ) reaches
p(i2 ) (see Figure 1.7). Property (c) guarantees that the modified distributions p q.
As the function x → η (x) = −x log x − (1 − x) log(1 − x) strictly increases on
[0, 1/2]. Hence, the entropy of the modified distribution strictly increases. At the
end of this process we diminish s. Then we use our induction hypothesis.
Q P
1. n
i1 i2
Figure 1.7
Proof For brevity, omit the upper index (n) in the notation u(n) and U(n) . Set
Bn := {u ∈ I ×n : pn (u) ≥ 2−nR }
= {u ∈ I ×n : − log pn (u) ≤ nR}
= {u ∈ I ×n : ξn (u) ≤ R}.
Then
1 ≥ P(U ∈ Bn ) = ∑ pn (u) ≥ 2−nR Bn , whence Bn ≤ 2nR .
u∈Bn
Thus,
Definition 1.3.3 (See PSE II, p. 367.) A sequence of random variables {ηn }
converges in probability to a constant r if, for all ε > 0,
lim P |ηn − r| ≥ ε = 0. (1.3.5)
n→∞
In fact,
P 2−n(H+ε ) ≤ pn (U(n) ) ≤ 2−n(H−ε )
1
= P H − ε ≤ − log pn (U ) ≤ H + ε
(n)
n
= P |ξn − H| ≤ ε = 1 − P |ξn − H| > ε .
In other words, for all ε > 0 there exists n0 = n0 (ε ) such that, for any n > n0 , the
set I ×n decomposes into disjoint subsets, Πn and Tn , with
(i) P U(n) ∈ Πn < ε ,
(ii) 2−n(H+ε ) ≤ P U(n) = u(n) ≤ 2−n(H−ε ) for all u(n) ∈ Tn .
Pictorially speaking, Tn is a set of ‘typical’ strings and Πn is the residual set.
We conclude that, for a source with the asymptotic equipartition property, it is
worthwhile to encode the typical strings with codewords of the same length, and
the rest anyhow. Then we have the effective encoding rate H + o(1) bits/source-
letter, though the source emits log m bits/source-letter.
(b) Observe that
1 1
E ξn = − ∑
n u(n) ∈I ×n
pn (u(n) ) log pn (u(n) ) = h(n) .
n
(1.3.7)
The simplest example of an information source (and one among the most
instructive) is a Bernoulli source.
Theorem 1.3.7 For a Bernoulli source U1 ,U2 , . . ., with P(Ui = x) = p(x),
H = − ∑ p(x) log p(x).
x
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 45
the final equality being in agreement with (1.3.7), since, for the Bernoulli source,
P
h(n) = nh (see (1.1.18)), and hence Eξn = h. We immediately see that ξn −→ h by
the law of large numbers. So H = h by Theorem 1.3.5 (FCT).
Theorem 1.3.8 (The law of large numbers for IID random variables) For any
sequence of IID random variables η1 , η2 , . . . with finite variance and mean Eηi = r,
and for any ε > 0,
1 n
lim P | ∑ ηi − r| ≥ ε = 0. (1.3.8)
n→∞ n i=1
Proof The proof of Theorem 1.3.8 is based on the famous Chebyshev inequality;
see PSE II, p. 368.
This condition means that the DTMC is irreducible and aperiodic. Then (see
PSE II, p. 71), the DTMC has a unique invariant (equilibrium) distribution
π (1), . . . , π (m):
m m
0 ≤ π (u) ≤ 1, ∑ π (u) = 1, π (v) = ∑ π (u)P(u, v), (1.3.10)
u=1 u=1
for all initial distribution {λ (u), u ∈ I}. Moreover, the convergence in (1.3.11) is
exponentially (geometrically) fast.
Theorem 1.3.10 Assume that condition (1.3.9) holds with r = 1. Then the DTMC
U1 ,U2 , . . . possesses a unique invariant distribution (1.3.10), and for any u, v ∈ I and
any initial distribution λ on I ,
|P(n) (u, v) − π (v)| ≤ (1 − ρ )n and |P(Un = v) − π (v)| ≤ (1 − ρ )n−1 . (1.3.12)
In the case of a general r ≥ 1, we replace, in the RHS of (1.3.12), (1 − ρ )n by
(1 − ρ )[n/r] and (1 − ρ )n−1 by (1 − ρ )[(n−1)/r] .
Proof See Worked Example 1.3.13.
Now we introduce an information rate H of a Markov source.
Theorem 1.3.11 For a Markov source, under condition (1.3.9),
H =− ∑ π (u)P(u, v) log P(u, v) = lim h(Un+1 |Un );
n→∞
(1.3.13)
1≤u,v≤m
and write
1 n−1
ξn = σ1 + ∑ σi+1 . (1.3.17)
n i=1
1 n
lim Eξn = lim
n→∞ n→∞ n
∑ Eσi = H,
i=1
P
and the convergence ξn −→ H is again a law of large numbers, for the sequence
(σi ):
1 n
lim P ∑ σi − H ≥ ε = 0. (1.3.19)
n→∞ n i=1
However, the situation here is not as simple as in the case of a Bernoulli source.
There are two difficulties to overcome: (i) Eσi equals H only in the limit i → ∞;
(ii) σ1 , σ2 , . . . are no longer independent. Even worse, they do not form a DTMC,
or even a Markov chain of a higher order. [A sequence U1 ,U2 , . . . is said to form a
DTMC of order k, if, for all n ≥ 1,
Lemma 1.3.12 The expectation value in the RHS of (1.3.20) satisfies the bound
2
n
E ∑ (σi − H) ≤ C n, (1.3.21)
i=1
Solution (Compare with PSE II, p. 72.) First, observe that (1.3.12) implies the
second bound in Theorem 1.3.10 as well as (1.3.10). Indeed, π (v) is identified as
the limit
lim P(n) (u, v) = lim ∑ P(n−1) (u, u)P(u, v) = ∑ π (u)P(u, v), (1.3.23)
n→∞ n→∞
u u
then π (v) = ∑ π (u)P(n) (u, v) for all n ≥ 1. The limit n → ∞ gives then
u
mn (v) = min P(n) (u, v), Mn (v) = max P(n) (u, v). (1.3.24)
u u
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 49
Then
mn+1 (v) = min P(n+1) (u, v) = min
u u
∑ P(u, u)P(n) (u, v)
u
≥ min P (u, v) ∑ P(u, u) = mn (v).
(n)
u
u
Similarly,
Mn+1 (v) = max P(n+1) (u, v) = max
u u
∑ P(u, u)P(n) (u, v)
u
Since 0 ≤ mn (v) ≤ Mn (v) ≤ 1, both mn (v) and Mn (v) have the limits
m(v) = lim mn (v) ≤ lim Mn (v) = M(v).
n→∞ n→∞
then M(v) = m(v) for each v. Furthermore, denoting the common value M(v) =
m(v) by π (v), we obtain (1.3.22)
|P(n) (u, v) − π (v)| ≤ Mn (v) − mn (v) ≤ (1 − ρ )n .
whereas if u1 = u2 then
∑ P (u1 , u2 ), (v1 , v2 ) = ∑ P(u1 , v1 ) ∑ P(u2 , v2 ) = 1
v1 ,v2 v1 v2
50 Essentials of Information Theory
(the inequalities 0 ≤ P (u1 , u2 ), (v1 , v2 ) ≤ 1 follow directly from the definition
(1.3.26)).
(because each component of (Vn ,Wn ) moves with the same transition probabilities)
Worked Example 1.3.14 Under condition (1.3.9) with r = 1 prove the following
bound:
2
|E (σi − H)(σi+k − H) | ≤ H + | log ρ | (1 − ρ )k−1 . (1.3.29)
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 51
Solution For brevity, we assume i > 1; the case i = 1 requires minor changes.
Returning to the definition of random variables σi , i > 1, write
E (σi − H)(σi+k − H)
= ∑ ∑ P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )
u,u v,v
× − log P(u, u ) − H − log P(v, v ) − H . (1.3.30)
∑∑ λ P i−1
(u)P(u, u ) − log P(u, u ) − H
u,u v,v
Observe that (1.3.31) in fact vanishes because the sum ∑ vanishes due to the defi-
v,v
nition (1.3.13) of H.
The difference between sums (1.3.30) and (1.3.31) comes from the fact that the
probabilities
and
λ Pi−1 (u)P(u, u )π (v)P(v, v )
Proofof Theorem 1.3.11. This is now easy to complete. To prove (1.3.21), expand
the square and use the additivity of the expectation:
2
n
The first sum in (1.3.32) is OK: it contains n terms E(σi − H)2 each bounded by a
2
constant (say, C may be taken to be H + | log ρ | ). Thus this sum is at most C n.
52 Essentials of Information Theory
It is the second sum that causes problems: it contains n(n − 1) 2 terms. We bound
it as follows:
n ∞
∑ E (σi − H)(σ j − H) ≤ ∑ ∑ |E (σi − H)(σi+k − H) | , (1.3.33)
1≤i< j≤n i=1 k=1
Here, ∼ means that the ratio of the left- and right-hand sides tends to 1 as n → ∞,
kn
p∗ (= p∗n ) denotes the ratio , and D(p||p∗ ) stands for the relative entropy h(X||Y )
n
where X is distributed as ζi (i.e. it takes values 0 and 1 with probabilities 1 − p
and p), while Y takes the same values with probabilities 1 − p∗ and p∗ .
Proof Use Stirling’s formula (see PSE I, p.72):
√
n! ∼ 2π nnn e−n . (1.3.35)
√
[In fact, this formula admits a more precise form: n! = 2π nnn e−n+θ (n) , where
1 1
< θ (n) < , but for our purposes (1.3.35) is enough.] Then the proba-
12n + 1 12n
bility in the LHS of (1.3.34) is (for brevity, the subscript n in kn is omitted)
1/2
n k n nn
p (1 − p) n−k
∼ pk (1 − p)n−k
k 2π k(n − k) kk (n − k)n−k
−1/2
= 2π np∗ (1 − p∗ )
× exp [−k ln k/n − (n − k) ln (n − k)/n + k ln p + (n − k) ln(1 − p)] .
But the RHS of the last formula coincides with the RHS of (1.3.34).
If p∗ is close to p, we can write
∗ 1 1 1
D(p||p ) = + (p∗ − p)2 + O(|p∗ − p|3 ), (1.3.36)
2 p 1− p
d
as D(p||p∗ )| p∗ =p = D(p||p∗) |
p∗ =p = 0, and immediately obtain
dp∗
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 53
Worked Example 1.3.17 At each time unit a device reads the current version
of a string of N characters each of which may be either 0 or 1. It then transmits
the number of characters which are equal to 1. Between each reading the string is
perturbed by changing one of the characters at random (from 0 to 1 or vice versa,
with each character being equally likely to be changed). Determine an expression
for the information rate of this source.
Solution The source is Markov, with the state space {0, 1, . . . , N} and the transition
probability matrix
⎛ ⎞
0 1 0 0 ... 0 0
⎜1/N − ⎟
⎜ 0 (N 1)/N 0 ... 0 ⎟
0
⎜ ⎟
⎜ 0 2/N 0 (N − 2)/N ... 0 ⎟
0
⎜ ⎟.
⎜ ... ... ⎟
⎜ ⎟
⎝ 0 0 0 0 ... 0 1/N ⎠
0 0 0 0 ... 1 0
The DTMC is irreducible and periodic. It possesses a unique invariant distribution
N
πi = 2−N , 0 ≤ i ≤ N.
i
By Theorem 1.3.11,
1 N−1 N N
H = − ∑ πi P(i, j) log P(i, j) = 21−N ∑ j j log j .
i, j N j=1
How does the answer change if m is odd? How can you use, for m odd, Shannon’s
FCT to derive the information rate of the above source?
Solution For m even, the DTMC is reducible: there are two communicating classes,
I1 = {0, 2, . . . , m} with m/2 + 1 states, and I2 = {1, 3, . . . , m − 1} with m/2 states.
Correspondingly, for any set An of n-strings,
As follows from (1.3.38), the information rate of the whole DTMC equals
% (1)
H = max [H (1) , H (2) ], if 0 < q ≤ 1,
Hodd = (1.3.41)
H (2) , if q = 0.
DTMCs P1 and P2 are irreducible and aperiodic and their invariant distributions
are uniform:
(1) 2 (2) 2
πi = , i ∈ I1 , πi = , i ∈ I2 .
m+1 m+1
Their common information rate equals
8
Hodd = log 3 − , (1.3.42)
3(m + 1)
which also gives the information rate of the whole DTMC. It agrees with Shannon’s
FCT, because now
1 P
ξn = − log pn (U(n) ) −→ Hodd . (1.3.43)
n
Worked Example 1.3.19 Let a be the size of A and b the size of the alphabet B.
Consider a source with letters chosen from an alphabet A + B, with the constraint
that no two letters of A should ever occur consecutively.
(a) Suppose the message follows a DTMC, all characters which are permitted at a
given place being equally likely. Show that this source has information rate
a log b + (a + b) log(a + b)
H= . (1.3.44)
2a + b
(b) By solving a recurrence relation, or otherwise, find how many strings of length
n satisfy the constraint that no two letters of A occur consecutively. Suppose
these strings are equally likely and let n → ∞. Show that the limiting informa-
tion rate becomes
√
b + b2 + 4ab
H = log .
2
the detailed balance equations (DBEs) π (x)P(x, y) = π (y)P(y, x) (cf. PSE II, p. 82),
which yields
'
1 (2a + b), x ∈ {1, . . . , a},
π (x) =
(a + b) [b(2a + b)], x ∈ {a + 1, . . . , a + b}.
The DBEs imply that π is invariant: π (y) = ∑ π (x)P(x, y), but not vice versa. Thus,
x
we obtain (1.3.44).
(b) Let Mn denote the number of allowed n-strings, An the number of allowed
n-strings ending with a letter from A, and Bn the number of allowed n-strings
ending with a letter from B. Then
Mn = An + Bn , An+1 = aBn , and Bn+1 = b(An + Bn ),
which yields
Bn+1 = bBn + abBn−1 .
The last recursion is solved by
Bn = c+ λ+n + c− λ−n ,
where λ± are the eigenvalues of the matrix
0 ab
,
1 b
i.e.
√
b ± b2 + 4ab
λ± = ,
2
and c± are constants, c+ > 0. Hence,
+ λ+ + c− λ− + c+ λ+n + c− λ−n
n−1 n−1
Mn = a c
λ n−1
λ n
1
= λ+n c− a −n + −n + c+ a +1 ,
λ+ λ+ λ+
1
and log Mn is represented as the sum
n
1 λ−n−1 λ−n 1
log λ+ + log c− a n + n + c+ a +1 .
n λ+ λ+ λ+
λ−
Note that < 1. Thus, the limiting information rate equals
λ+
1
lim log Mn = log λ+ .
n→∞ n
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 57
The answers are different since the conditional equidistribution results in a strong
dependence between subsequent letters: they do not form a DTMC.
(a) in the case where the rows of the transition probability matrix P are all equal
(i.e. {U j } is a Bernoulli sequence),
(b) in the case where the rows of P are permutations of each other, and in a general
case. Comment on the significance of this result for coding theory.
Solution (a) Let P stand for the probability distribution of the IID sequence (Un )
m
and set H = − ∑ p j log p j (the binary entropy of the source). Fix ε > 0 and parti-
j=1
tion the set I ×n of all n-strings into three disjoint subsets:
and
K = {u(n) : 2−n(H+ε ) < p(u(n) ) < 2−n(H−ε ) }.
1
By the law of large numbers (or asymptotic equipartition property), − log P(U(n) )
n
converges to H(= h), i.e. lim P(K+ ∪ K− ) = 0, and lim P(K) = 1. Thus, to obtain
n→∞ n→∞
probability ≥ α , for n large enough, you (i) cannot restrict yourself to K+ and have
to borrow strings from K , (ii) don’t need strings from K− , i.e. will have the last
selected string from K . Denote by Mn (α ) the set of selected strings, and Mn (α )
by Mn . You have two two-side bounds
α ≤ P Mn (α ) ≤ α + 2−n(H−ε )
and
2−n(H+ε ) Mn (α ) ≤ P Mn (α ) ≤ P(K+ ) + 2−n(H−ε ) Mn (α ).
Excluding P Mn (α ) yields
(b) The argument may be repeated without any change in the case of permutations
because the ordered probabilities form the same set as in case (a) , and in a general
case by applying the law of large numbers to (1/n)ξn ; cf. (1.3.3b) and (1.3.19).
Finally, the significance for coding theory: if we are prepared to deal with the error-
probability ≤ α , we do not need to encode all mn string u(n) but only ∼ 2nH most
frequent ones. As H ≤ log m (and in many cases log m), it yields a significant
economy in storage space (data-compression).
Worked Example 1.3.21 A binary source emits digits 0 or 1 according to the
rule
P(Xn = k|Xn−1 = j, Xn−2 = i) = qr ,
where k, j, i and r take values 0 or 1, r = k − j −i mod 2, and q0 +q1 = 1. Determine
the information rate of the source.
Also derive the information rate of a binary Bernoulli source, emitting digits
0 and 1 with probabilities q0 and q1 . Explain the relationship between these two
results.
Solution The source is a DTMC of the second order. That is, the pairs (Xn , Xn+1 )
form a four-state DTMC, with
P(00, 00) = q0 , P(00, 01) = q1 , P(01, 10) = q0 , P(01, 11) = q1 ,
P(10, 00) = q0 , P(10, 01) = q1 , P(11, 10) = q0 , P(11, 11) = q1 ;
the remaining eight entries of the transition probability matrix vanish. This gives
H = −q0 log q0 − q1 log q1 .
For a Bernoulli source the answer is the same.
Worked Example 1.3.22 Find an entropy rate of a DTMC associated with a
random walk on the 3 × 3 chessboard:
⎛ ⎞
1 2 3
⎝4 5 6⎠ . (1.3.45)
7 8 9
Find the entropy rate for a rook, bishop (both kinds), queen and king.
1.4 Channels of information transmission 59
Solution We consider the king’s DTMC only; other cases are similar. The transition
probability matrix is
⎛ ⎞
0 1/3 0 1/3 1/3 0 0 0 0
⎜1/5 0 1/5 1/5 1/5 1/5 0 0 0 ⎟
⎜ ⎟
⎜ 0 1/3 0 0 ⎟
⎜ 0 1/3 1/3 0 0 ⎟
⎜1/5 1/5 0 0 1/5 0 1/5 1/5 0 ⎟
⎜ ⎟
⎜ ⎟
⎜ 1/8 1/8 1/8 1/8 0 1/8 1/8 1/8 1/8 ⎟
⎜ ⎟
⎜ 0 1/5 1/5 0 1/5 0 0 1/5 1/5⎟
⎜ ⎟
⎜ 0 0 0 1/3 1/3 0 0 0 1/3 ⎟
⎜ ⎟
⎝ 0 0 0 1/5 1/5 1/5 1/5 0 1/5⎠
0 0 0 0 1/3 1/3 0 1/3 0
By symmetry the invariant distribution is π1 = π3 = π9 = π7 = λ , π4 = π2 = π6 =
π8 = μ , π5 = ν , and by the DBEs
λ /3 = μ /5, λ /3 = ν /8, 4λ + 4μ + ν = 1
implies λ = 3
40 , μ = 18 , ν = 15 . Now
1 1 1 1 1 1 1 3
H = −4λ log − 4μ log − ν log = log 15 + .
3 3 5 5 8 8 10 40
we again suppose
that it is knownto both sender and receiver.
(We use a distinct
symbol Pch · |codeword x sent or, briefly, Pch · |x
(N) (N) , to stress that this prob-
ability distribution is generated by the channel, conditional on the event that code-
word x(N) has been sent.) Speaking below of a channel, we refer to a conditional
probability (1.4.1) (or rather a family of conditional probabilities, depending on
N). Consequently, we use the symbol Y(N) for a random string representing the
output of the channel; given that a word x(N) was sent,
In other words, strings of length N sent to the channel will be codewords repre-
senting source messages of a shorter length n. The maximal ratio n/N which still
allows the receiver to recover the original message is an important characteristic of
the channel, called the capacity. As we will see, passing from a = ±1 to a = ±1
changes the capacity from 1 (no encoding needed) to 1/2 (where the codeword-
length is twice as long as the length of the source message).
So, we need to introduce a decoding rule fN : J ×N → I ×n such that the overall
probability of error ε (= ε ( fn , fN , P)) defined by
ε = ∑ P fN (Y(N) ) = u(n) , u(n) emitted
u(n)
= ∑ P U(n) = u(n) Pch fN (Y(N) ) = u(n) | fn (u(n) ) sent (1.4.3)
u(n)
is small. We will try (and under certain conditions succeed) to have the error-
probability (1.4.3) tending to zero as n → ∞.
The idea which is behind the construction is based on the following facts:
(1) For a source with the asymptotic equipartition property the number of dis-
tinct n-strings emitted is 2n(H+o(1)) where H ≤ log m is the information rate of
the source. Therefore, we have to encode not mn = 2n log m messages, but only
2n(H+o(1)) which may be considerably less. That is, the code fn may be defined
on a subset of I ×n only, with the codeword-length N = nH.
−1
(2) We may try even a larger N: N = R nH, where R is a constant with 0 < R <
1. In other words, the increasing length of the codewords used from nH to
−1
R nH will allow us to introduce a redundancy in the code fn , and we may
hope to be able to use this redundancy to diminish the overall error-probability
(1.4.3) (provided that in addition a decoding rule is ‘good’). It is of course
−1
desirable to minimise R , i.e. maximise R: it will give the codes with optimal
parameters. The question of how large R is allowed to be depends of course on
the channel.
1
tending to zero as N → ∞. That is, for each sequence UN with lim log UN = R,
N→∞ N
there exist a sequence of encoding rules fN : UN → XN , XN ⊆ J ×N , and a sequence
of decoding rules fN : J ×N → UN such that
1
lim ∑ ∑ 1 fN (Y(N) ) = u Pch Y(N) | fN (u) = 0. (1.4.5)
N→∞ UN
u∈UN Y(N)
(b) The reason for the equiprobable distribution on UN is that it yields the worst-
case scenario. See Theorem 1.4.6 below.
(c) If encoding rule fN used is one-to-one (lossless) then it suffices to treat the
decoding rules as maps J ×N → XN rather than J ×N → UN : if we guess correctly
what codeword x(N) has been sent, we simply set u = fN−1 (x(N) ). If, in addition,
the source distribution is equiprobable over U then the error-probability ε can be
written as an average over the set of codewords XN :
1
ε= ∑ 1 − Pch fN (Y(N) ) = x|x sent .
X x∈XN
Accordingly, it makes sense to write ε = ε ave and speak about the average proba-
bility of error. Another form is the maximum error-probability
ε max = max 1 − Pch fN (Y(N) ) = x|x sent : x ∈ XN ;
obviously, ε ave ≤ ε max . In this section we work with ε ave → 0 leaving the question
of whether ε max → 0. However, in Section 2.2 we reduce the problem of assessing
ε max to that with ε ave , and as a result, the formulas for the channel capacity deduced
in this section will remain valid if ε ave is replaced by ε max .
1.4 Channels of information transmission 63
Remark 1.4.5 (a) By Theorem 1.4.17 below, the channel capacity of an MBC is
given by
C = sup I(Xk : Yk ). (1.4.7)
pXk
Here, I(Xk : Yk ) is the mutual information between a single pair of input and output
letters Xk and Yk (the index k may be omitted), with the joint distribution
where pX (x) = P(X = x). The supremum in (1.4.7) is over all possible distributions
pX = (pX (0), pX (1)). A useful formula is I(X : Y ) = h(Y ) − h(Y |X) (see (1.3.12)).
In fact, for the MBSC
But sup h(Y ) is equal to log 2 = 1: it is attained at pX (0) = pX (1) = 1/2, and
pX
pY (0) = pY (1) = 1/2(p + 1 − p) = 1/2. Therefore, for an MBSC, with the row
error-probability p,
C = 1 − η (p). (1.4.11)
In fact, Shannon’s ideas have not been easily accepted by leading contemporary
mathematicians. It would be interesting to see the opinions of the leading scientists
who could be considered as ‘creators’ of information theory.
Theorem 1.4.6 Fix a channel (i.e. conditional probabilities Pch in (1.4.1)) and a
set U of the source strings and denote by ε (P) the overall error-probability (1.4.3)
for U(n) having a probability distribution P over U , minimised over all encoding
and decoding rules. Then
ε (P) ≤ ε (P0 ), (1.4.12)
Proof Fix encoding and decoding rules f and f, and let a string u ∈ U have
probability P(u). Define the error-probability when u is emitted as
1
ε=
U ∑ β (u) = ε (P0 , f , f).
u∈U
Hence, given any f and f, we can find new encoding and decoding rules with
overall error-probability ≤ ε (P0 , f , f). Minimising over f and f leads to (1.4.12).
1.4 Channels of information transmission 65
Worked Example 1.4.7 Let the random variables X and Y , with values from
finite ‘alphabets’ I and J , represent, respectively, the input and output of a trans-
mission channel, with the conditional probability P(x | y) = P(X = x | Y = y). Let
h(P(· | y)) denote the entropy of the conditional distribution P(· | y), y ∈ J :
h(P(· | y)) = − ∑ P(X | y) log P(x | y).
x
Let h(X | Y ) denote the conditional entropy of X given Y Define the ideal observer
decoding rule as a map f IO : J → I such that P( f (y) | y) = maxx∈I P(x | y) for all
y ∈ J . Show that
(a) under this rule the error-probability
πerIO (y) = ∑ 1(x = f (y))P(x | y)
x∈I
1
satisfies πerIO (y) ≤ h(P(· | y));
2
1
(b) the expected value of the error-probability obeys EπerIO (Y ) ≤ h(X | Y ).
2
Solution Indeed, (a) follows from (iii) in Worked Example 1.2.7, as
πerr
IO
= 1 − P f (y) | y = 1 − Pmax ( · |y),
1
which is less than or equal to h(P( · |y)). Finally, (b) follows from (a) by taking
2
expectations, as h(X|Y ) = Eh(P( · |Y )).
As was noted before, a general decoding rule (or a decoder) is a map fN : J ×N →
UN ; in the case of a lossless encoding rule fN , fN is a map J ×N → XN . Here X
is a set of codewords. Sometimes it is convenient to identify the decoding rule by
fixing, for each codeword x(N) , a set A(x(N) ) ⊂ J ×N , so that A(x1 ) and A(x2 ) are
(N) (N)
disjoint for x1 = x2 , and the union ∪x(N) ∈XN A(x(N) ) gives the whole J ×N . Given
(N) (N)
Although in the definition of the channel capacity we assume that the source
messages are equidistributed (as was mentioned, it gives the worst case in the
sense of Theorem 1.4.6), in reality of course the source does not always follow
this assumption. To this end, we need to distinguish between two situations: (i) the
receiver knows the probabilities
p(u) = P(U = u) (1.4.13)
of the source strings (and hence the probability distribution pN (x(N) ) of the code-
words x(N) ∈ XN ), and (ii) he does not know pN (x(N) ). Two natural decoding rules
are, respectively,
66 Essentials of Information Theory
(i) the ideal observer (IO) rule decodes a received word y(N) by a codeword x(N)
that maximises the posterior probability
p (x(N) )P (y(N) |x(N) )
N ch
P x(N) sent |y(N) received = , (1.4.14)
pY(N) (y(N) )
where
pY(N) (y(N) ) = ∑ pN (x(N) )Pch (y(N) |x(N) ),
x(N) ∈XN
and
(ii) the maximum likelihood (ML) rule decodes a received word y(N) by a codeword
(N)
x that maximises the prior probability
Pch (y(N) |x(N) ). (1.4.15)
Theorem 1.4.8 Suppose that an encoding rule f is defined for all messages that
occur with positive probability and is one-to-one. Then:
(a) For any such encoding rule, the IO decoder minimises the overall error-
probability among all decoders.
(b) If the source message U is equiprobable on a set U , then for any encoding rule
f : U → XN as above, the random codeword X(N) = f (U) is equiprobable on
XN , and the IO and ML decoders coincide.
Proof Again, for simplicity let us omit the upper index (N).
(a) Note that, given a received word y, the IO obviously maximises the joint
probability p(x)Pch (y|x) (the denominator in (1.4.14) is fixed when word y is
fixed). If we use an encoding rule f and decoding rule f, the overall error-
probability (see (1.4.3)) is
∑ P(U = u)Pch f(y) = u| f (u) sent
u
= ∑ p(x) ∑ 1 f(y) = x Pch (y|x)
x y
= ∑ ∑ 1 x = f(y) p(x)Pch (y|x)
y x
= ∑ ∑ p(x)Pch (y|x) − ∑ p f(y) Pch y| f(y)
y x
y
= 1 − ∑ p f (y) Pch y| f(y) .
y
It remains to note that each term in the sum ∑ p f(y) Pch y| f(y) is maximised
y
when f coincides with the IO rule. Hence, the whole sum is maximised, and the
overall error-probability minimised.
Assuming in the definition of the channel capacity that the source messages are
equidistributed, it is natural to explore further the ML decoder. While using the ML
decoder, an error can occur because either the decoder chooses a wrong codeword
x or an encoding rule f used is not one-to-one. The probability of this is assessed
in Theorem 1.4.8. For further simplification, we write P instead of Pch ; symbol P
is used mainly for the joint input/output distribution.
Lemma 1.4.9 If the source messages are equidistributed over a set U then, while
using the ML decoder and an encoding rule f , the overall error-probability satisfies
1
ε( f ) ≤
U ∑ ∑ P P Y| f (u ) ≥ P (Y| f (u)) |U = u . (1.4.16)
u∈U u ∈U : u =u
1
Multiplying by and summing up over u yields the result.
U
Remark 1.4.10 Bound (1.4.16) of course holds for any probability distribution
1
p(u) = P(U = u), provided is replaced by p(u).
U
As was already noted, a random coding is a useful tool alongside with deter-
×N
ministic encoding rules. A deterministic encoding rule ( is a map f : U)→ J ; if
U = r then f is given as a collection of codewords f (u1 ), . . . , f (ur ) or, equiv-
alently, as a concatenated ‘megastring’ (or codebook)
×r
f (u1 ) . . . f (ur ) ∈ J ×N = {0, 1}×Nr .
Here, u1 , . . . , ur are the source strings (not letters!) constituting set U . If f is loss-
less then f (ui ) = r f (u j ) whenever i = j. A random encoding
×N r rule is a random ele-
ment F of J ×N , with probabilities P(F = f ), f ∈ J . Equivalently, F may
68 Essentials of Information Theory
Theorem 1.4.11
Proof
Part (i) is obvious.
For (ii), use the Chebyshev inequality (see PSE I, p. 75):
E 1−ρ
P ε (F) ≥ ≤ E = 1 − ρ.
1−ρ E
1.4 Channels of information transmission 69
Recall that I X(N) : Y(N) is the mutual entropy given by
h X(N) − h X(N) |Y(N) = h Y(N) − h Y(N) |X(N) .
all of them equally likely. We will not be able to detect which sequence X was sent
unless no two X(N) sequences produce the same Y(N) output sequence. The total
(N)
number of typical Y(N) sequences is 2h(Y ) . This set has to be divided into subsets
of size 2h(Y |X ) corresponding to the different input X(N) sequences. The total
(N) (N)
Hence, the total number of distinguishable signals of the length N could not be
(N) (N)
bigger than 2I (X :Y ) . Putting the same argument slightly differently, the number
(N) (N) (N)
of typical sequences X(N) is 2Nh(X ) . However, there are only 2Nh(X ,Y ) jointly
typical sequences (X(N) , Y(N) ). So, the probability that any randomly chosen pair
is jointly typical is about 2−I (X :Y ) . So, the number of distinguished signals is
(N) (N)
Theorem 1.4.14 (Shannon’s SCT: converse part) The channel capacity C obeys
The assertion of the theorem immediately follows from (1.4.20) and the definition
of the channel capacity because
1
lim inf ε ( f ) ≥ 1 − lim sup CN
N→∞ R N→∞
Here and below r = U . The last bound follows by the generalised Fano inequality
(1.2.25). Indeed, observe that the (random) codeword X(N) = f (U) takes r values
(N) (N)
x1 , . . . , xr from the codeword set X (= XN , and the error-probability is
r
ε ( f ) = ∑ P(X(N) = xi , f(Y(N) ) = xi ).
(N) (N)
i=1
i.e.
N(R + o(1)) − NCN C + o(1)
ε( f ) ≥
= 1− N .
log 2N(R+o(1)) − 1 R + o(1)
Let p(X(N) , Y(N) ) be the random variable that assigns, to random words X(N) and
Y(N) , the joint probability of having these words at the input and output of a chan-
nel, respectively. Similarly, pX (X(N) ) and pY (Y(N) ) denote the random variables
that give the marginal probabilities of words X(N) and Y(N) , respectively.
1.4 Channels of information transmission 71
Theorem 1.4.15 (Shannon’s SCT: direct part) Suppose we can find a constant
c ∈ (0, 1) such that for any R ∈ (0, c) and N ≥ 1 there exists a random coding
F(u1 ), . . . , F(ur ), where r = 2N(R+o(1)) , with IID codewords F(ui ) ∈ J ×N , such
that the (random) input/output mutual information
1 p(X(N) , Y(N) )
ΘN := log (1.4.21)
N pX (X(N) )pY (Y(N) )
converges in probability to c as N → ∞. Then the channel capacity C ≥ c.
The proof of Theorem 1.4.15 is given after Worked Examples 1.4.24 and 1.4.25
(the latter is technically rather involved). To start with, we explain the strategy of
the proof outline by Shannon in his original 1948 paper. (It took about 10 years
before this idea was transformed into a formal argument.)
First, one generates a random codebook X consisting of r = 2NR words,
X(N) (1), . . . , X(N) (r). The codewords X(N) (1), . . . , X(N) (r) are assumed to be
known to both the sender and the receiver, as well as the channel transition
matrix Pch (y|x). Next, the message is chosen according to a uniform distribution,
and the corresponding codeword is sent over a channel. The receiver uses the max-
imum likelihood (ML) decoding, i.e. choose the a posteriori most likely message.
But this procedure is difficult to analyse. Instead, a suboptimal but straightforward
typical set decoding is used. The receiver declares that the message w is sent if there
is only one input such that the codeword for w and the output of the channel are
jointly typical. If no such word exists or it is non-unique then an error is declared.
Surprisingly, this procedure is asymptotically optimal. Finally, the existence of a
good random codebook implies the existence of a good non-random coding.
In other words, channel capacity C is no less than the supremum of the values
c for which the convergence in probability in (1.4.21) holds for an appropriate
random coding.
So, if the LHS and RHS sides of (1.4.22) coincide, then their common value gives
the channel capacity.
Next, we use Shannon’s SCT for calculating the capacity of an MBC. Recall (cf.
(1.4.2)), for an MBC,
N
P y(N) |x(N) = ∏ P(yi |xi ). (1.4.23)
i=1
72 Essentials of Information Theory
I X(N) : Y(N) = h Y(N) − h Y(N) |X(N)
= h Y(N) − ∑ h(Y j |X j )
1≤ j≤N
≤ ∑ h(Y j ) − h(Y j |X j ) = ∑ I(X j : Y j ).
j j
The equality holds iff Y1 , . . . ,YN are independent. But Y1 , . . . ,YN are independent if
X1 , . . . , XN are.
Remark 1.4.18 Compare with inequalities (1.4.24) and (1.2.27). Note the oppo-
site inequalities in the bounds.
On the other hand, take a random coding F, with codewords F(ul ) = Vl1 . . . VlN ,
1 ≤ l ≤ r, containing IID symbols Vl j that are distributed according to p∗ , a prob-
ability distribution that maximises I(X1 : Y1 ). [Such random coding is defined for
1.4 Channels of information transmission 73
any r, i.e. for any R (even R > 1!).] For this random coding, the (random) mutual
entropy ΘN equals
1 p X(N) , Y(N)
log (N) (N)
N pX X pY Y
N
1 p(X j ,Y j ) 1 N
= ∑ log ∗ = ∑ ζ j,
N j=1 p (X j )pY (Y j ) N j=1
p(X j ,Y j )
where ζ j := log .
p∗ (X
j )pY (Y j )
The random variables ζ j are IID, and
p(X j ,Y j )
Eζ j = E log = Ip∗ (X1 : Y1 ).
p∗ (X
j )pY (Y j )
By the law of large numbers for IID random variables (see Theorem 1.3.5), for the
random coding as suggested,
P
ΘN −→ Ip∗ (X1 : Y1 ) = sup I(X1 : Y1 ).
pX1
Remark 1.4.20 (a) The pair (X1 ,Y1 ) may be replaced by any (X j ,Y j ), j ≥ 1.
(b) Recall that the joint
distribution
of X1 and Y1 is defined by P(X1 = x,Y1 = y) =
pX1 (x)P(y|x) where P(y|x) is the channel matrix.
(c) Although, as was noted, the construction holds for each r (that is, for each
R ≥ 0) only R ≤ C are reliable.
Example 1.4.21 A helpful statistician preprocesses the output of a memory-
less channel (MBC) with transition probabilities P(y|x) and channel capacity C =
max pX I(X : Y ) by forming Y = g(Y ): he claims that this will strictly improve the
capacity. Is he right? Surely not, as preprocessing (or doctoring) does not increase
the capacity. Indeed,
I(X : Y ) = h(X) − h(X|Y ) ≥ h(X) − h(X|g(Y )) = I(X : g(Y )). (1.4.26)
Under what condition does he not strictly decrease the capacity? Equality in
(1.4.26) holds iff, under the distribution pX that maximises I(X : Y ), the ran-
dom variables X and Y are conditionally independent given g(Y ). [For example,
g(y1 ) = g(y2 ) iff for any x, PX|Y (x|y1 ) = PX|Y (x|y2 ); that is, g glues together only
those values of y for which the conditional probability PX|Y ( · |y) is the same.] For
an MBC, equality holds iff g is one-to-one, or p = P(1|0) = P(0|1) = 1/2.
74 Essentials of Information Theory
Then
1+α 1+α 1−α 1−α
h(Y ) = − log − log
2 2 2 2
and
1+α 1+α 1−α 1−α
I(X : Y ) = − log − log −1+α
2 2 2 2
which is maximised at α = 3/5, with the capacity given by
log 5 − 2 = 0.321928.
Our next goal is to prove the direct part of Shannon’s SCT (Theorem 1.4.15). As
was demonstrated earlier, the proof is based on two consecutive Worked Examples
below.
Worked Example 1.4.24 Let F be a random coding, independent of the source
string U, such that the codewords F(u1 ), . . . , F(ur ) are IID, with a probability dis-
tribution pF :
pF (v) = P(F(u) = v), v (= v(N) ) ∈ J ×N .
Here, u j , j = 1, . . . , r, are source strings, and r = 2N(R+o(1)) . Define random code-
words V1 , . . . , Vr−1 by
if U = u j then Vi := F(ui ) for i < j (if any),
and Vi := F(ui+1 ) for i ≥ j (if any), (1.4.30)
1 ≤ j ≤ r, 1 ≤ i ≤ r − 1.
Then U (the message string), X = F(U) (the random codeword) and V1 , . . . , Vr−1
are independent words, and each of X, V1 , . . . , Vr−1 has distribution pF .
Solution This is straightforward and follows from the formula for the joint proba-
bility,
P(U = u j , X = x, V1 = v1 , . . . , Vr−1 = vr−1 )
= P(U = u j ) pF (x) pF (v1 ) . . . pF (vr−1 ). (1.4.31)
Worked Example 1.4.25 Check that for the random coding as in Worked Ex-
ample 1.4.24, for any κ > 0,
E = Eε (F) ≤ P(ΘN ≤ κ ) + r2−N κ . (1.4.32)
Here, the random variable ΘN is defined in (1.4.21), with EΘN =
1 (N) (N)
I X :Y .
N
76 Essentials of Information Theory
Solution For given words x(= x(N) ) and y(= y(N) ) ∈ J ×N , denote
* +
Sy (x) := x ∈ J ×N : P(y | x ) ≥ P(y | x) . (1.4.33)
That is, Sy (x) includes all words the ML decoder may produce in the situation
where x was sent and y received. Set, for a given non-random encoding rule f
and a source string u, δ ( f , u, y) = 1 if f (u ) ∈ Sy ( f (u)) for some u = u, and
δ ( f , u, y) = 0 otherwise. Clearly, δ ( f , u, y) equals
1 − ∏ 1 f (u ) ∈ Sy ( f (u))
u : u =u
= 1 − ∏ 1 − 1 f (u ) ∈ Sy ( f (u)) .
u :u =u
It is plain that, for all non-random encoding f , ε ( f ) ≤ Eδ ( f , U, Y), and for all
random encoding F, E = Eε (F) ≤ Eδ (F, U, Y). Furthermore, for the random
encoding as in Worked Example 1.4.24, the expected value Eδ (F, U, Y) does not
exceed
r−1
E 1 − ∏ 1 − 1 Vi ∈ SY (X) = ∑ pX (x) ∑ P(y|x)
i=1 x y
r−1
× E 1 − ∏ 1 − 1 Vi ∈ SY (X) |X = x, Y = y ,
i=1
Furthermore, due to the IID property (as explained in Worked Example 1.4.24),
r−1
∏ E 1 − 1{V ∈ S (x)} = (1 − Qy (x)) ,
i y
r−1
i=1
where
Qy (x) := ∑ 1 x ∈ Sy (x) pX (x ).
x
this yields
E ≤ P (X,Y ) ∈ T + (r − 1) ∑ pX (x)P(y|x)Qy (x). (1.4.36)
(x,y)∈T
Theorems 1.4.17 and 1.4.19 may be extended to the case of a memoryless chan-
nel with an arbitrary (finite) output alphabet, Jq = {0, . . . , q − 1}. That is, at the
input of the channel we now have a word Y(N) = Y1 . . . YN where each Y j takes a
(random) value from Jq . The memoryless property means, as before, that
N
Pch y(N) |x(N) = ∏ P(yi | xi ), (1.4.39)
i=1
If, in addition, the columns of the channel matrix are permutations of each other,
then h(Y1 ) attains log q. Indeed, take a random coding as suggested. Then P(Y = y)
q−1 1
= ∑ P(X1 = x)P(y|x) = ∑ P(y|x). The sum ∑ P(y|x) is along a column of the
x=0 q x x
channel matrix, and it does not depend on y. Hence, P(Y = y) does not depend on
y ∈ Iq , which means equidistribution.
Remark 1.4.27 (a) In the random coding F used in Worked Examples 1.4.24 and
1.4.25 and Theorems 1.4.6, 1.4.15 and 1.4.17, the expected error-probability E → 0
with N → ∞. This guarantees not only the existence of a ‘good’ non-random coding
for which the error-probability E vanishes as N → ∞ (see Theorem 1.4.11(i)), but
also that ‘almost’ all codes are asymptotically good. In fact, by Theorem 1.4.11(ii),
1.4 Channels of information transmission 79
C ( p)
1
p
1 1
2
Figure 1.8
√ √
√
with ρ = 1 − E, P ε (F) < E ≥ 1 − E → 1, as N → ∞. However, this does
not help to find a good code: constructing good codes remains a challenging task
in information theory, and we will return to this problem later.
the rows are permutations of each other, and hence have equal entropies. Therefore,
the conditional entropy h(Y |X) equals
Solution (a) Given Y , the random variables X and Z are conditionally independent.
Hence,
h(X | Y ) = h(X | Y, Z) ≤ h(X | Z),
and
I(X : Y ) = h(X) − h(X|Y ) ≥ h(X) − h(X | Z) = I(X : Z).
1.4 Channels of information transmission 81
. . . . . . . . .
Figure 1.9
The equality holds iff X and Y are conditionally independent given Z, e.g. if the
second channel is error-free (Y, Z) → Z is one-to-one, or the first channel is fully
noisy, i.e. X and Y are independent.
(b) The rows of the channel matrix are permutations of each other. Hence h(Y |X) =
h(p0 , . . . , pr−1 ) does not depend on pX . The quantity h(Y ) is maximised when
pX (i) = 1/r, which gives
1 − h(1/2, 1/2) = 1 − 1 = 0.
then C is given by 2C = 2C1 + 2C2 . To what mode of operation does this corre-
spond?
Solution (a)
X Y Z
−→ channel 1 −→ channel 2 −→
As in Worked Example 1.4.29a,
I(X : Z) ≤ I(X : Y ), I(X : Z) ≤ I(Y : Z).
Hence,
C = sup I(X : Z) ≤ sup I(X : Y ) = C1
pX pX
and similarly
C ≤ sup I(Y : Z) = C2 ,
pY
i.e. C ≤ min[C1 ,C2 ]. A strict inequality may occur: take δ ∈ (0, 1/2) and the
matrices
1−δ δ 1−δ δ
ch 1 ∼ , ch 2 ∼ ,
δ 1−δ δ 1−δ
and
1 (1 − δ )2 + δ 2 2δ (1 − δ )
ch [1 + 2] ∼ .
2 2δ (1 − δ ) (1 − δ )2 + δ 2
1.4 Channels of information transmission 83
C1 = C2 = 1 − h(δ , 1 − δ ),
and
C = 1 − h 2δ (1 − δ ), 1 − 2δ (1 − δ ) < Ci
X2 −→ channel 2 −→ Y2
But
I (X1 , X2 ) : (Y1 ,Y2 ) = h(Y1 ,Y2 ) − h Y1 ,Y2 |X1 , X2
≤ h(Y1 ) + h(Y2 ) − h(Y1 |X1 ) − h(Y2 |X2 )
= I(X1 : Y1 ) + I(X2 : Y2 );
equality applies iff X1 and X2 are independent. Thus, C = C1 + C2 and the max-
imising p(X1 ,X2 ) is pX1 × pX2 where pX1 and pX2 are maximisers for I(X1 : Y1 ) and
I(X2 : Y2 ).
(c)
channel 1 −→ Y1
X
channel 2 −→ Y2
Here,
C = sup I X : (Y1 : Y2 )
pX
and
I (Y1 : Y2 ) : X = h(X) − h X|Y1 ,Y2
≥ h(X) − min h(X|Y j ) = min I(X : Y j ).
j=1,2 j=1,2
84 Essentials of Information Theory
Thus, C ≥ max[C1 ,C2 ]. A strict inequality may occur: take an example from part
(a). Here, Ci = 1 − h(δ , 1 − δ ). Also,
I (Y1 ,Y2 ) : X = h(Y1 ,Y2 ) − h Y1 ,Y2 |X
= h(Y1 ,Y2 ) − h(Y1 |X) − h(Y2 |X)
= h(Y1 ,Y2 ) − 2h(δ , 1 − δ ).
with
h(Y1 ,Y2 ) = 1 + h 2δ (1 − δ ), 1 − 2δ (1 − δ ) ,
and
I (Y1 ,Y2 ) : X = 1 + h 2δ (1 − δ ), 1 − 2δ (1 − δ ) − 2h(δ , 1 − δ )
> 1 − h(δ , 1 − δ ) = Ci .
Hence, C > Ci , i = 1, 2.
(d)
X1 channel 1 −→ Y1
→ X : X1 or X2 →
X2 channel 2 −→ Y2
The difference with part (c) is that every second only one symbol is sent, either
to channel 1 or 2. If we fix probabilities α and 1 − α that a given symbol is sent
through a particular channel then
h(Y ) = − ∑ α pY1 (y) log α pY1 (y) − ∑(1 − α )pY2 (y) log(1 − α )pY2 (y)
y y
= −α log α − (1 − α ) log(1 − α ) + α h(Y1 ) + (1 − α )h(Y2 )
1.4 Channels of information transmission 85
and
h(Y |X) = − ∑ α pX1 ,Y1 (x, y) log pY1 |X1 (y|x)
x,y
− ∑ (1 − α )pX2 ,Y2 (y|x) log pY2 |X2 (y|x)
x,y
= α h(Y1 |X1 ) + (1 − α )h(Y2 |X2 )
Worked Example 1.4.32 A spy sends messages to his contact as follows. Each
hour either he does not telephone, or he telephones and allows the telephone to ring
a certain number of times – not more than N , for fear of detection. His contact does
not answer, but merely notes whether or not the telephone rings, and, if so, how
many times. Because of deficiencies in the telephone system, calls may fail to be
properly connected; the correct connection has probability p, where 0 < p < 1, and
is independent for distinct calls, but the spy has no means of knowing which calls
reach his contact. If connection is made, then the number of rings is transmitted
correctly. The probability of a false connection from another subscriber at a time
when no call is made may be neglected. Write down the channel matrix for this
channel and calculate the capacity explicitly. Determine a condition on N in terms
of p which will imply, with optimal coding, that the spy will always telephone.
Solution The channel alphabet is {0, 1, . . . , N}: 0 ∼ non-call (in a given hour), and
j ≥ 1 ∼ j rings. The channel matrix is P(0|0) = 1, P(0| j) = 1 − p and P( j| j) = p,
1 ≤ j ≤ N, and h(Y |X) = −q(p log p + (1 − p) log(1 − p)), where q = pX (X ≥ 1).
Furthermore, given q, h(Y ) attains its maximum when
pq
pY (0) = 1 − pq, pY (k) = , 1 ≤ k ≤ N.
N
Maximising I(X : Y ) = h(Y ) − h(Y |X) in q yields p(1 − p)(1−p)/p × (1 − pq) =
pq/N or
⎡ ⎤
(1−p)/p −1
1 1 1
q = min ⎣ 1+ , 1⎦.
p Np 1− p
86 Essentials of Information Theory
1
The condition q = 1 is equivalent to log N ≥ − log(1 − p), i.e.
p
1
N≥ .
(1 − p)1/p
under the assumption that the integral is absolutely convergent. As in the discrete
case, hdiff (X) may be considered as a functional of the density p : x ∈ Rn → R+ =
[0, ∞). The difference is however that hdiff (X) may be negative, e.g. for a uniform
1
distribution on [0, a], hdiff (X) = − 0a dx(1/a) log(1/a) = log a < 0 for a < 1. [We
write x instead of x when x ∈ R.] The relative, joint and conditional differential
entropy are defined similarly to the discrete case:
0
p (x)
hdiff (X||Y ) = Ddiff (p||p ) = − p(x) log dx, (1.5.2)
p(x)
0
hdiff (X,Y ) = − pX,Y (x, y) log pX,Y (x, y)dxdy, (1.5.3)
0
hdiff (X|Y ) = − pX,Y (x, y) log pX|Y (x|y)dxdy
(1.5.4)
= hdiff (X,Y ) − hdiff (Y ),
again under the assumption that the integrals are absolutely convergent. Here, pX,Y
is the joint probability density and pX|Y the conditional density (the PDF of the
conditional distribution). Henceforth we will omit the subscript diff when it is clear
what entropy is being addressed. The assertions of Theorems 1.2.3(b),(c), 1.2.12,
and 1.2.18 are carried through for the differential entropies: the proofs are com-
pletely similar and will not be repeated.
Remark 1.5.2 Let 0 ≤ x ≤ 1. Then x can be written as a sum ∑ αn 2−n where
n≥1
αn (= αn (x)) equals 0 or 1. For ‘most’ of the numbers x the series is not reduced to a
finite sum (that is, there are infinitely many n such that αn = 1; the formal statement
1.5 Differential entropy and its properties 87
is that the (Lebesgue) measure of the set of numbers x ∈ (0, 1) with infinitely many
αn (x) = 1 equals one). Thus, if we want to ‘encode’ x by means of binary digits we
would need, typically, a codeword of an infinite length. In other words, a typical
value for a uniform random variable X with 0 ≤ X ≤ 1 requires infinitely many bits
for its ‘exact’ description. It is easy to make a similar conclusion in a general case
when X has a PDF fX (x).
However, if we wish to represent the outcome of the random variable X with
an accuracy of first n binary digits then we need, on average, n + h(X) bits where
h(X) is the differential entropy of X. Differential entropies can be both positive
and negative, and can even be −∞. Since h(X) can be of either sign, n + h(X) can
be greater or less than n. In the discrete case the entropy is both shift and scale
invariant since it depends only on probabilities p1 , . . . , pm , not on the values of the
random variable. However, the differential entropy is shift but not scale invariant
as is evident from the identity (cf. Theorem 1.5.7)
1 1 −1
p(x) =
1/2 exp − x − μ ,C (x − μ ) , x ∈ Rd .
2
(2π )d detC
0
1 log e −1
− p(x) − log (2π ) detC − d
x − μ ,C (x − μ ) dx
Rd 2 2
log e 1
= E ∑(xi − μi )(x j − μ j ) C−1 i j + log (2π )d detC
2 i, j 2
log e −1 1
= ∑
2 i, j
C i j E(xi − μi )(x j − μ j ) + log (2π )d detC
2
log e −1 1
= ∑
2 i, j
C i j C ji + log (2π )d detC
2
d log e 1 1
= + log (2π )d detC = log (2π e)d detC .
2 2 2
Theorem 1.5.5 For a random vector X = (X1 , . . . , Xd ) with mean μ and covari-
ance matrix C = (Ci j ) (i.e. Ci j = E (Xi − μi )(X j − μ j )] = C ji ),
1
h(X) ≤ log (2π e)d detC , (1.5.6)
2
Proof Let p(x) be the PDF of X and p0 (x) the normal density with mean μ
and covariance matrix C. Without loss of generality assume μ = 0. Observe that
log p0 (x) is, up to an additive constant term, a quadratic form in xk . Furthermore,
1.5 Differential entropy and its properties 89
1 0
1
for each monomial xi x j , dxp (x)xi x j = dxp(x)xi x j = Ci j = C ji , and the moment
of quadratic form log p0 (x) are equal. We have
0
p(x)
0 ≤ D(p||p0 ) (by Gibbs) = p(x) log dx
1 p0 (x)
= −h(p) − p(x) log p0 (x)dx
1
= −h(p) − p0 (x) log p0 (x)dx
(by the above remark) = −h(p) + h(p0 ).
(a) Show that the exponential density maximises the differential entropy among
the PDFs on [0, ∞) with given mean, and the normal density maximises the
differential entropy among the PDFs on R with a given variance.
Moreover, let X = (X1 , . . . , Xd )T be a random vector with
EX = 0 and
EXi X j = Ci j , 1 ≤ i, j ≤ d . Then hdiff (X) ≤ 2 log (2π e)d det(Ci j ) , with equal-
1
Solution (a) For the Gaussian case, see Theorem 1.5.5. In the exponential
case, by the Gibbs inequality, for any random variable Y with PDF f (y),
1
f (y) log f (y)eλ y /λ dy ≥ 0 or
The value of EX is not essential for h(X) as the following theorem shows.
Theorem 1.5.7
(a) The differential entropy is not changed under the shift: for all y ∈ Rd ,
h(X + y) = h(X).
(b) The differential entropy changes additively under multiplication:
h(aX) = h(X) + log |a|, for all a ∈ R.
Furthermore, if A = (Ai j ) is a d × d non-degenerate matrix, consider the affine
transformation x ∈ Rd → Ax + y ∈ Rd .
(c) Then
h(AX + y) = h(X) + log | det A|. (1.5.7)
Proof The proof is straightforward and left as an exercise
Worked Example 1.5.8 (The data-processing inequality for the relative entropy)
Let S be a finite set, and Π = (Π(x, y), x, y ∈ S) be a stochastic kernel (that is, for
all x, y ∈ S, Π(x, y) ≥ 0 and ∑y∈S Π(x, y) = 1; in other words, Π(x, y) is a transi-
tion probability in a Markov chain). Prove that D(p1 Π||p2 Π) ≤ D(p1 ||p2 ) where
pi Π(y) = ∑x∈S pi (x)Π(x, y), y ∈ S (that is, applying a Markov operator to both
probability distributions cannot increase the relative entropy).
Extend this fact to the case of the differential entropy.
Solution In the discrete case Π is defined by a stochastic matrix (Π(x, y)). By the
log-sum inequality (cf. PSE II, p. 426), for all y
∑ p1 (w)Π(w, y)
∑ p1 (x)Π(x, y) log w∑ p2 (z)Π(z, y)
x
z
p1 (x)Π(x, y)
≤ ∑ p1 (x)Π(x, y) log
x p2 (x)Π(x, y)
p1 (x)
= ∑ p1 (x)Π(x, y) log .
x p2 (x)
Taking summation over y we obtain
∑ p1 (w)Π(w, y)
D(p1 Π||p2 Π) = ∑∑ p1 (x)Π(x, y) log
w
x y ∑ p2 (z)Π(z, y)
z
p1 (x)
≤ ∑ ∑ p1 (x)Π(x, y) log = D(p1 ||p2 ).
x y p2 (x)
Conversely, for any positive definite matrix C there exists a PDF for which C is a
covariance matrix, e.g. a multivariate normal distribution (if C is not strictly posi-
tive definite, the distribution is degenerate).
Solution Take two positive definite matrices C(0) and C(1) and λ ∈ [0, 1]. Let X(0)
and X(1) be two multivariate normal vectors, X(i) ∼ N(0,C(i) ). Set, as in the proof
of Theorem 1.2.18, X = X(Λ) , where the random variable Λ takes two values, 0 and
1, with probabilities λ and 1 − λ , respectively, and is independent of X(0) and X(1) .
Then the random variable X has covariance C = λ C(0) + (1 − λ )C(1) , although X
need not be normal. Thus,
1 1
log 2π e)d + log det λ C(0) + (1 − λ )C(1)
2 2
1
= log (2π e)d detC ≥ h(X) (by Theorem 1.5.5)
2
≥ h(X|Λ) (by Theorem 1.2.11)
λ 1−λ
= log (2π e)d detC(0) + log (2π e)d detC(1)
2 2
1
= log 2π e) + λ log detC(0) + (1 − λ ) log detC(1) .
d
2
This property is often called the Ky Fan inequality and was proved initially in
1950 by using much more involved methods. Another famous inequality is due to
Hadamard:
It is easy to see that for d = 1 (1.5.9) and (1.5.10) are equivalent. In general,
inequality (1.5.9) implies (1.5.10) via (1.5.13) below which can be established
independently. Note that inequality (1.5.10) may be true or false for discrete ran-
dom variables. Consider the following example: let X ∼ Y be independent with
PX (0) = 1/6, PX (1) = 2/3, PX (2) = 1/6. Then
2 16 18
h(X) = h(Y ) = ln 6 − ln 4, h(X +Y ) = ln 36 − ln 8 − ln 18.
3 36 36
By inspection, e2h(X+Y ) = e2h(X) + e2h(Y ) . If X and Y are non-random constants
then h(X) = h(Y ) = h(X +Y ) = 0, and the EPI is obviously violated. We conclude
1.5 Differential entropy and its properties 93
The entropy–power inequality plays a very important role not only in informa-
tion theory and probability but in geometry and analysis as well. For illustration
we present below the famous Brunn–Minkowski theorem that is a particular case
of the EPI. Define the set sum of two sets as
A1 + A2 = {x1 + x2 : x1 ∈ A1 , x2 ∈ A2 }.
By definition A + 0/ = A.
Theorem 1.5.12 (Brunn–Minkowski)
(a) Let A1 and A2 be measurable sets. Then the volume
V (A1 + A2 )1/d ≥ V (A1 )1/d +V (A2 )1/d . (1.5.11)
(b) The volume of the set sum of two sets A1 and A2 is greater than the volume
of the set sum of two balls B1 and B2 with the same volume as A1 and A2 ,
respectively:
V (A1 + A2 ) ≥ V (B1 + B2 ), (1.5.12)
where B1 and B2 are spheres with V (A1 ) = V (B1 ) and V (A2 ) = V (B2 ).
Worked Example 1.5.13 Let C1 ,C2 be positive-definite d × d matrices. Then
[det(C1 +C2 )]1/d ≥ [detC1 ]1/d + [detC2 ]1/d . (1.5.13)
But, more interestingly, the following intermediate inequality holds true. Let
X1 , X2 , . . . , Xn+1 be IID square-integrable random variables. Then
1 n+1 2h i∑ Xi /d
e2h X1 +···+Xn /d
≥ ∑ e =j . (1.5.16)
n j=1
In the example, the three extra codewords must be 00, 01, 10 (we cannot take
11, as then a sequence of ten 1s is not decodable). Reversing the order in every
codeword gives a prefix-free code. But prefix-free codes are decipherable. Hence,
the code is decipherable.
In conclusion, we present an alternative proof of necessity of Kraft’s inequal-
ity. Denote s = max si ; let us agree to extend any word in X to the length s,
say by adding some fixed symbol. If x = x1 x2 . . . xsi ∈ X , then any word of the
form x1 x2 . . . xsi ysi +1 . . . ys ∈ X because x is a prefix. But there are at most qs−si of
such words. Summing up on i, we obtain that the total number of excluded words
is ∑mi=1 q
s−si . But it cannot exceed the total number of words qs . Hence, (1.6.1)
follows:
m
qs ∑ q−si ≤ qs .
i=1
Problem 1.2 Consider an alphabet with m letters each of which appears with
probability 1/m. A binary Huffman code is used to encode the letters, in order to
minimise the expected codeword-length (s1 + · · · + sm )/m where si is the length of
a codeword assigned to letter i. Set s = max[si : 1 ≤ i ≤ m], and let n be the number
of codewords of length .
(a) Show that 2 ≤ ns ≤ m.
(b) For what values of m is ns = m?
(c) Determine s in terms of m.
(d) Prove that ns−1 + ns = m, i.e. any two codeword-lengths differ by at most 1.
(e) Determine ns−1 and ns .
(f) Describe the codeword-lengths for an idealised model of English (with m =
27) where all the symbols are equiprobable.
(g) Let now a binary Huffman code be used for encoding symbols 1, . . . , m occur-
ring with probabilities p1 ≥ · · · ≥ pm > 0 where ∑ p j = 1. Let s1 be the length
1≤ j≤m
of a shortest codeword and sm of a longest codeword. Determine the maximal and
minimal values of sm and s1 , and find binary trees for which they are attained.
Solution (a) Bound ns ≥ 2 follows from the tree-like structure of Huffman codes.
More precisely, suppose ns = 1, i.e. a maximum-length codeword is unique and
corresponds to say letter i. Then the branch of length s leading to i can be pruned at
the end, without violating the prefix-free condition. But this contradicts minimality.
1.6 Additional problems for Chapter 1 97
4 1
m c i m
a b 2
m
Figure 1.10
Bound ns ≤ m is obvious. (From what is said below it will follow that ns is always
even.)
(b) ns = m means all codewords are of equal length. This, obviously, happens iff
m = 2k , in which case s = k (a perfect binary tree Tk with 2k leaves).
(c) In general,
'
log m, if m = 2k ,
s=
log m, if m = 2k .
The case m = 2k was discussed in (b), so let us assume that m = 2k . Then 2k < m <
2k+1 where k = log m. This is clear from the observation that the binary tree for
probabilities 1/m (we will call it a binary m-tree Bm ) contains the perfect binary
tree Tk but is contained in Tk+1 . Hence, s is as above.
(d) Indeed, in the case of an equidistribution 1/m, . . ., 1/m it is impossible to have
a branch of the tree whose length differs from the maximal value s by two or more.
In fact, suppose there is such a branch, Bi , of the binary tree leading to some letter i
and choose a branch M j of maximal length s leading to a letter j. In a conventional
terminology, letter j was engaged in s merges and i in t ≤ s − 2 merges. Ultimately,
the branches Bi and M j must merge, and this creates a contradiction. For example,
the ‘least controversial’ picture is still ‘illegal’; see Figure 1.10. Here, vertex i
carrying probability 1/m should have been joined with vertex a or b carrying each
probability 2/m, instead of joining a and b (as in the figure), as it creates vertex c
carrying probability 4/m.
(e) We conclude that (i) for m = 2k , the m-tree Bm coincides with Tk , (ii) for m = 2k
we obtain Bm in the following way. First, take a binary tree Tk where k = [log m],
with 1 ≤ m − 2k < 2k . Then m − 2k leaves of Tk are allowed to branch one step
98 Essentials of Information Theory
k+1 _
2 m
_ k
2 (m 2 )
Figure 1.11
further: this generates 2(m − 2k ) = 2m − 2k+1 leaves of tree Tk+1 . The remaining
2k − (m − 2k ) = 2k+1 − m leaves of Tk are left intact. See Figure 1.11. So,
ns−1 = 2k+1 − m, ns = 2m − 2k+1 , where k = [log m].
(f) In the example of English, with equidistribution among m = 27 = 16 + 11 sym-
bols, we have 5 codewords of length 4 and 22 codewords of length 5. The average
codeword-length is
5 × 4 + 22 × 5 130
= ≈ 4.8.
27 27
3 4
(g) The minimal value for s1 is 1 (obviously). The maximal value is log2 m ,
i.e. the positive integer l with 2l < m ≤ 2l+1 . The maximal value for sm is m −
1 (obviously). The minimal value is log2 m, i.e. the natural l such that 2l−1 <
m ≤ 2l .
The tree that yields s1 = 1 and sm = m − 1 is given in Figure 1.12.
It is characterised by
i f (i) si
1 0 1
2 10 2
.. .. ..
. . .
m−1 11. . . 10 m−1
m 11. . . 11 m−1
and is generated when
p1 > p2 + · · · + pm > 2(p3 + · · · + pm ) > · · · > 2m−1 pm .
1.6 Additional problems for Chapter 1 99
1 2 m–1 m
Figure 1.12
m = 16
Figure 1.13
m = 18
Figure 1.14
Here the capacity is understood as the supremum over all reliable information rates
while the RHS is defined as
max I(X : Y )
X
where the random variables X and Y represent an input and the corresponding
output.
The binary erasure channel keeps an input letter 0 or 1 intact with probability
1 − p and turns it to a splodge ∗ with probability p. An input random variable X is
0 with probability α and 1 with probability 1 − α . Then the output random variable
Y takes three values:
P(Y = 0) = (1 − p)α ,
P(Y = 1) = (1 − p)(1 − α ),
P(Y = ∗) = p.
Therefore,
capacity = maxα I(X : Y )
= maxα [h(X) − h(X|Y )]
= maxα [h(α ) − ph(α )]
= (1 − p) maxα h(α ) = 1 − p,
1.6 Additional problems for Chapter 1 101
Problem 1.4 Let X and Y be two discrete random variables with corresponding
cumulative distribution functions (CDF) PX and PY .
(a) Define the conditional entropy h(X|Y ), and show that it satisfies
h(X|Y ) ≤ h(X),
(c) Let hPo (λ ) be the entropy of a Poisson random variable Po(λ ). Show that
hPo (λ ) is a non-decreasing function of λ > 0.
Then
P(X = x,Y = y)
0 ≤ ∑ P(X = x,Y = y) log
x,y P(X = x)P(Y = y)
= ∑ P(X = x,Y = y) log P(X = x,Y = y)
x,y
(b) Define a random variable T equal to 0 with probability α and 1 with probability
1 − α . Then the random variable Z has the distribution W (α ) where
'
X, if T = 0,
Z=
Y, if T = 1.
By part (a),
h(Z|T ) ≤ h(Z),
with the LHS = α h(X) + (1 − α )h(Y ), and the RHS = h(W (α )).
(c) Observe that for independent random variables X and Y , h(X + Y |X) =
h(Y |X) = h(Y ). Hence, again by part (a),
Using this fact, for all λ1 < λ2 , take X ∼ Po(λ1 ), Y ∼ Po(λ2 − λ1 ), independently.
Then
Problem 1.5 What does it mean to transmit reliably at rate R through a binary
symmetric channel (MBSC) with error-probability p? Assuming Shannon’s sec-
ond coding theorem (SCT), compute the supremum of all possible reliable trans-
mission rates of an MBSC. What happens if: (i) p is very small; (ii) p = 1/2; or
(iii) p > 1/2?
By the SCT, the so-called operational channel capacity is sup R = maxα I(X : Y ),
the maximum information transmitted per input symbol. Here X is a Bernoulli
random variable taking values 0 and 1 with probabilities α ∈ [0, 1] and 1 − α , and
Y is the output random variable for the given input X. Next, I(X : Y ) is the mutual
entropy (information):
Observe that the binary entropy function h(x) ≤ 1 with equality for x = 1/2.
Selecting α = 1/2 conclude that the MBSC with error probability p has the
capacity
x∈A x∈A
x∈A
Solution (i) Denote x = ln π , and taking the logarithm twice obtain the inequality
x − 1 > ln x. This is true as x > 1, hence eπ > π e .
(ii) Assume without loss of generality that ai > 0 and bi > 0. The function g(x) =
x log x is strictly convex. Hence, by the Jensen inequality for any coefficients ∑ ci =
1, ci ≥ 0,
∑ ci g(xi ) ≥ g ∑ ci xi .
−1
Selecting ci = bi ∑ b j and xi = ai /bi , we obtain
j
⎛ ⎞
∑ ai
ai ai ai
∑ ∑ b j log bi ≥ ∑ ∑ b j log ⎝ i ⎠
i i ∑bj
i
≥ c ∑ p(x) 1 − = c 1 − q(B) ≥ 0.
x∈B p(x)
Equality holds iff q(x) ≡ p(x). Next, write
f (x)
f (A) = ∑ f (x), p(x) = 1(x ∈ A),
x∈A f (A)
g(x)
g(A) = ∑ g(x), q(x) = 1(x ∈ A).
x∈A g(A)
Then
f (x) f (A)p(x)
∑ f (x) log g(x) = f (A) ∑ p(x) log g(A)q(x)
x∈A x∈A
p(x) f (A)
= f (A) ∑ p(x) log q(x) + f (A) log
g(A)
x∈A
: ;< =
≥ by the previous part
f (A)
≥ f (A) log .
g(A)
1.6 Additional problems for Chapter 1 105
then
p(x) p(x)
D(p||q) = ∑ p(x) log + ∑ p(x) log
x∈A q(x) x∈Ac q(x)
p(A) p(A c)
≥ p(A) log + p(Ac ) log
q(A) q(Ac ) 2
2 log2 e
≥ 2 log2 e p(A) − q(A) =
2 ∑ |p(x) − q(x)| .
x
Problem 1.7 (a) Define the conditional entropy, and show that for random vari-
ables U and V the joint entropy satisfies
h(U,V ) = h(V |U) + h(U).
Given random variables X1 , . . . , Xn , by induction or otherwise prove the chain rule
n
h(X1 , . . . Xn ) = ∑ h(Xi |X1 , . . . , Xi−1 ). (1.6.7)
i=1
where h(XS ) = h(Xs1 , . . . , Xsk ) for S = {s1 , . . . , sk }. Assume that, for any i,
h(Xi |XS ) ≤ h(Xi |XT ) when T ⊆ S, and i ∈/ S.
By considering terms of the form
h(X1 , . . . , Xn ) − h(X1 , . . . Xi−1 , Xi+1 , . . . , Xn )
(n) (n)
show that hn ≤ hn−1 .
(k) (k) (n) (n)
Using the fact that hk ≤ hk−1 , show that hk ≤ hk−1 , for k = 2, . . . , n.
(c) Let β > 0, and define
> n
tk = ∑ e β h(XS )/k
(n)
.
S:|S|=k
k
Prove that
(n) (n) (n)
t1 ≥ t2 ≥ · · · ≥ tn .
106 Essentials of Information Theory
and, in general,
h(X1 , . . . , Xn )
= h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )
≤ h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi−1 ), (1.6.9)
because
h(Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) ≤ h(Xi |X1 , . . . , Xi−1 ).
The second sum in the RHS equals h(X1 , . . . , Xn ) by the chain rule (1.6.7). So,
n
(n − 1)h(X1 , . . . , Xn ) ≤ ∑ h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ).
i=1
(n) (n)
This implies that hn ≤ hn−1 , since
In general, fix a subset S of size k in {1, . . . , n}. Writing S(i) for S \{i}, we obtain
1 1 h(X[S(i)])
h[X(S)] ≤ ∑ ,
k k i∈S k − 1
1.6 Additional problems for Chapter 1 107
Finally, each subset of size k − 1, S(i), appears [n − (k − 1)] times in the sum
(n)
(1.6.11). So, we can write hk as
h[X(T )] n − (k − 1) n
∑
T ⊂{1,...,n}: |T |=k−1 k − 1 k k
h[X(T )] n
= ∑ = hnk−1 .
T ⊂{1,...,n}: |T |=k−1 k − 1 k − 1
(c) Starting from (1.6.11), exponentiate and then apply the arithmetic
mean/geometric mean inequality, to obtain for S0 = {1, 2, . . . , n}
1 n β h(S0 (i))/(n−1)
eβ h(X(S0 ))/n ≤ eβ [h(S0 (1))+···+h(S0 (n))]/(n(n−1)) ≤ ∑e
n i=1
(n) (n)
which is equivalent to tn ≤ tn−1 . Now we use the same argument as in (b), taking
(n) (n)
the average over all subsets to prove that for all k ≤ n,tk ≤ tk−1 .
Problem 1.8 Let p1 , . . . , pn be a probability distribution, with p∗ = maxi [pi ].
Prove that
(i) −∑ pi log2 pi ≥ −p∗ log2 p∗ − (1 − p∗ ) log2 (1 − p∗ );
i
(ii) −∑ pi log2 pi ≥ log2 (1/p∗ );
i
(iii) −∑ pi log2 pi ≥ 2(1 − p∗ ).
i
The random variables X and Y with values x and y from finite ‘alphabets’ I and
J represent the input and output of a transmission channel, with the conditional
probability P(x | y) = P(X = x | Y = y). Let h(P(· | y)) denote the entropy of the
conditional distribution P(· | y), y ∈ J , and h(X | Y ) denote the conditional entropy
of X given Y . Define the ideal observer decoding rule as a map f : J → I such
that P( f (y) | y) = maxx∈I P(x | y) for all y ∈ J . Show that under this rule the error-
probability
πer (y) = ∑ P(x | y)
x∈I: x= f (y)
1
satisfies πer (y) h(P(· | y)), and the expected error satisfies
2
1
Eπer (Y ) ≤ h(X | Y ).
2
108 Essentials of Information Theory
Solution Bound (i) follows from the pooling inequality. Bound (ii) holds as
1 1
− ∑ pi log pi ≥ ∑ pi log ∗
= log ∗ .
i i p p
To check (iii), it is convenient to use (i) for p∗ ≥ 1/2 and (ii) for p∗ ≤ 1/2. Assume
first that p∗ ≥ 1/2. Then, by (i),
h(p1 , . . . , pn ) ≥ h (p∗ , 1 − p∗ ) .
The function x ∈ (0, 1) → h(x, 1 − x) is concave, and its graph on (1/2, 1) lies
strictly above the line x → 2(1 − x). Hence,
h(p1 , . . . , pn ) ≥ 2 (1 − p∗ ) .
Problem 1.9 Define the information rate H and the asymptotic equipartition
property of a source. Calculate the information rate of a Bernoulli source. Given a
memoryless binary channel, define the channel capacity C. Assuming the statement
of Shannon’s second coding theorem (SCT), deduce that C = sup pX I(X : Y ).
An erasure channel keeps a symbol intact with probability 1 − p and turns it into
an unreadable splodge with probability p. Find the capacity of the erasure channel.
where Psource stands for the source probability distribution, and one uses a code fN
and a decoding rule fN . A value R ∈ (0, 1) is said to be a reliable transmission rate
if, given that Psource is an equidistribution over a set UN of source strings u with
UN = 2N[R+o(1)] , there exist fN and fN such that
1
lim ∑ chP
f N (Y(N)
)
= u | f N (u) sent = 0.
N→∞ UN
u∈UN
The conditional entropy h(Y |X) = h(p, 1 − p) does not depend on pX . Thus,
Problem 1.10 Define Huffman’s encoding rule and prove its optimality among
decipherable codes. Calculate the codeword lengths for the symbol-probabilities
1 1 1 1 1 1 1 1
5 , 5 , 6 , 10 , 10 , 10 , 10 , 30 .
Prove, or provide a counter-example to, the assertion that if the length of a code-
word from a Huffman code equals l then, in the same code, there exists another
codeword of length l such that | l − l | ≤ 1.
110 Essentials of Information Theory
Problem 1.11 A memoryless channel with the input alphabet {0, 1} repro-
duces the symbol correctly with probability (n − 1)/n2 and reverses it with prob-
ability 1/n2 . [Thus, for n = 1 the channel is binary and noiseless.] For n ≥ 2 it
also produces 2(n − 1) sorts of ‘splodges’, conventionally denoted by αi and βi ,
i = 1, . . . , n − 1, with similar probabilities: P(αi |0) = (n − 1)/n2 , P(βi |0) = 1/n2 ,
P(βi |1) = (n − 1)/n2 , P(αi |1) = 1/n2 . Prove that the capacity Cn of the channel
increases monotonically with n, and limn→∞ Cn = ∞. How is the capacity affected
if we simply treat splodges αi as 0 and βi as 1?
Solution Write
n n!
P(Sn = k) = (1 − p)n−k pk = (1 − p)n−k pk
? k k!(n − k)!
n nn
= (1 − p)n−k pk
2π k(n − k) kk (n
− k) n−k
=$ exp − nh p (y)
2π ny(1 − y)
(b) Consider two cases, (i) p∗ ≥ 1/2 and (ii) p∗ ≤ 1/2. In case (i), by pooling
inequality,
1 1
h(X) ≥ h(p∗ , 1 − p∗ ) ≥ (1 − p∗ ) log ≥ (1 − p∗ ) log = 2(1 − p∗ )
p∗ (1 − p∗ ) 4
114 Essentials of Information Theory
Solution Write
= ∑ pi sign (i − i )
i
= E sgn( − ) ≤ E 2− − 1 ,
i i i
≤ 1 − ∑ 2−i = 1 − 1 = 0
i
X(N) , the random word of length N sent through the channel, and Y(N) , the received
word, and where the supremum is over the probability distribution of X(N) . Prove
that C ≤ lim supN→∞ CN .
The last bound here follows from the generalised Fano inequality
h X(N) | f Y(N) ≤ −e(N) log e(N) − 1 − e(N) log 1 − e(N)
+e(N) log U (N) − 1
≤ 1 + e(N) log U (N) − 1 .
Now, from (1.6.22),
NCN ≥ N R + o(1) − 1 − e(N) log 2N[R+o(1)] − 1 ,
i.e.
Also,
h(Y |X) = − ∑ pX (x) ∑ P(y|x) log P(y|x)
x=0,1 y
= −pX (1) log 1/3 = (1 − p) log 3.
Thus,
1 + 2p 1 + 2p 2(1 − p) 1− p
I(X : Y ) = − log − log − (1 − p) log 3.
3 3 3 3
Differentiating yields
d
I(X : Y ) = −2/3(log (1/3 + 2p/3) + 2/3 log (1/3 − p/3) + log 3.
dp
Hence, the maximum max I(X : Y ) is found from relation
2 1− p
log + log 3 = 0.
3 1 + 2p
This yields
1− p 3
log = − log 3 := b,
1 + 2p 2
and
1− p
= 2b , i.e. 1 − 2b = p 1 + 2b+1 .
1 + 2p
The answer is
1 − 2b
p= .
1 + 2b+1
For the last part, write
for any Y that is a function of Y ; the equality holds iff Y and X are conditionally
independent, given Y . It is the case of our channel, hence the suggestion leaves the
capacity the same.
Problem 1.17 (a) Given a pair of discrete random variables X , Y , define the
joint and conditional entropies h(X,Y ) and h(X|Y ).
(b) Prove that h(X,Y ) ≥ h(X|Y ) and explain when equality holds.
(c) Let 0 < δ < 1, and prove that
h(X|Y ) ≥ log(δ −1 ) P(q(X,Y ) ≤ δ ),
where q(x, y) = P(X = x|Y = y). For which δ and for which X , Y does equality
hold here?
1.6 Additional problems for Chapter 1 119
where
q(x, y) = P(X = x|Y = y).
The joint entropy is given by
h(X,Y ) = − ∑ P(X = x,Y = y) log P(X = x,Y = y).
x,y
Solution The asymptotic equipartition property for a Bernoulli source states that
the number of distinct strings (words) of length n emitted by the source is ‘typi-
cally’ 2nH+o(n) , and they have ‘nearly equal’ probabilities 2−nH+o(n) :
lim P 2−n(H+ε ) ≤ Pn (U(n) ) ≤ 2−n(H−ε ) = 1.
n→∞
Here, H = h(p1 , . . . , pn ).
Denote
( )
Tn (= Tn (ε )) = u(n) : 2−n(H+ε ) ≤ Pn (u(n) ) ≤ 2−n(H−ε )
By the definition of the channel capacity, the words u(n) ∈ Tn (ε ) may be encoded
−1
by binary codewords of length R (H + ε ) and sent reliably through a memoryless
symmetric channel with matrix
1 − p∗ p∗
p∗ 1 − p∗
of the input binary symbol X; the conditional distribution of the output symbol Y
is given by
'
1 − p∗ , y = x,
P(Y = y|X = x) =
p∗ , y = x.
We see that
independently of pX . Hence,
Therefore, if
All other elements are zero. Determine the information rate of the source.
Denote the transition matrix thus specified by Pm . Consider
a source in an
Pm 0
alphabet of m + n characters whose transition matrix is , where the zeros
0 Pn
indicate zero matrices of appropriate size. The initial character is supposed uni-
formly distributed over the alphabet. What is the information rate of the source?
122 Essentials of Information Theory
max Hm , Hn = Hm∨n .
Solution For the second part, the Markov property implies that
P(U1 = u1 |U2 = u2 ,U3 = u3 ) = P(U1 = u1 |U2 = u2 ).
Hence,
P(U1 = u1 |U2 = u2 ,U3 = u3 )
= E − log = I(U1 : U2 ).
P(U1 = u1 )
Since
I(U1 : (U2 ,U3 )) ≥ I(U1 : U3 ),
the result follows.
1.6 Additional problems for Chapter 1 123
Problem 1.21 Construct a Huffman code for a set of 5 messages with probabil-
ities as indicated below
Message 1 2 3 4 5
Probability 0.1 0.15 0.2 0.26 0.29
Solution
Message 1 2 3 4 5
Probability 0.1 0.15 0.2 0.026 0.029
Codeword 101 100 11 01 00
Problem 1.22 State the first coding theorem (FCT), which evaluates the infor-
mation rate for a source with suitable long-run properties. Give an interpretation of
the FCT as an asymptotic equipartition property. What is the information rate for a
Bernoulli source?
Consider a Bernoulli source that emits symbols 0, 1 with probabilities 1 − p and
p respectively, where 0 < p < 1. Let η (p) = −p log p − (1 − p) log(1 − p) and let
ε > 0 be fixed. Let U(n) be the string consisting of the first n symbols emitted by
the source. Prove that there is a set Sn of possible values of U(n) such that
2
(n) p p(1 − p)
P U ∈ Sn ≥ 1 − log ,
1− p nε 2
and so that for each u(n) ∈ Sn the probability that P U(n) = u(n) lies between
2−n(h+ε ) and 2−n(h−ε ) .
Here '
1 − p, if U j = 0,
P(U j ) =
p, if U j = 1,
Pn (U(n) ) = ∏ P(U j ),
1≤ j≤n
and
Var ∑ log P(U j ) = ∑ Var log P(U j )
1≤ j≤n 1≤ j≤n
where
2
2
Var log P(U j ) = E log P(U j ) − E log P(U j )
2
= p(log p)2 + (1 − p)(log(1 − p))2 − p log p + (1 − p) log(1 − p)
2
p
= p(1 − p) log .
1− p
Hence, the bound (1.6.23) yields
P 2−n(h+ε ) ≤ Pn (U(n) ) ≤ 2−n(h−ε )
2
1 p
≥ 1 − 2 p(1 − p) log .
nε 1− p
It now suffices to set
Sn = {u(n) = u1 . . . un : 2−n(h+ε ) ≤ P(U(n) = u(n) ) ≤ 2−n(h−ε ) },
and the result follows.
Problem 1.23 The alphabet {1, 2, . . . , m} is to be encoded by codewords with
letters taken from an alphabet of q < m letters. State Kraft’s inequality for the word-
lengths s1 , . . . , sm of a decipherable code. Suppose that a source emits letters from
the alphabet {1, 2, . . . , m}, each letter occurring with known probability pi > 0. Let
S be the random codeword-length resulting from the letter-by-letter encoding of the
source output. It is desired to find a decipherable code that minimises the expected
2
S √
value of q . Establish the lower bound E q ≥
S
∑ pi , and characterise
1≤i≤m
when equality occurs.
Prove also that an optimal code for the above criterion must satisfy E qS <
2
√
q ∑ pi .
1≤i≤m
Hint: Use the Cauchy–Schwarz inequality: for all positive xi , yi ,
1/2 1/2
∑ xi yi ≤ ∑ xi2 ∑ y2i ,
1≤i≤m 1≤i≤m 1≤i≤m
Solution By Cauchy–Schwarz,
∑ pi = ∑ pi qsi /2 q−si /2
1/2 1/2
1≤i≤m 1≤i≤m
1/2 1/2 1/2
≤ −s ≤
∑ pi qsi
∑ q i
∑ pi qsi ,
1≤i≤m 1≤i≤m 1≤i≤m
pi = (cq−xi )2 , xi > 0,
∑ q−xi = 1 (so,
1/2
where ∑ pi = c). Take si to be the smallest integer ≥ xi .
1≤i≤m 1≤i≤m
Then ∑ q−si ≤ 1 and, again by Kraft, there exists a decipherable coding with
1≤i≤m
1/2
the codeword-length si . For this code, qsi −1 < qxi = c pi , and hence
matrix
dead 1−α α
live β 1−β
and the equilibrium probabilities
β α
1 − πL (dead) = , πL (live) =
α +β α +β
(assuming that α + β > 0). The received signal sequence follows a DTMC with
states 0 (dead), 1, 2, . . . and transition probabilities
q00 = 1 − α , q0 j = α p j ,
j, k ≥ 1.
q j0 = β , q jk = (1 − β )pk
This chain has a unique equilibrium distribution
β α
πRS (0) = , πRS ( j) = p j , j ≥ 1.
α +β α +β
Then the information rate of the received signal equals
HRS = − ∑ πRS ( j)q jk log q jk
j,k≥0
β
=− (1 − α ) log(1 − α ) + ∑ α p j log(α p j )
α +β
j≥1
α
−
α + β j≥1 ∑ p j β log β + (1 − β ) ∑ pk log (1 − β )pk
k≥1
α
= HL + HS .
α +β
Here HL is the entropy rate of the line state DTMC:
β
HL = − (1 − α ) log(1 − α ) + α log α
α +β
α
− (1 − β ) log(1 − β ) + β log β ,
α +β
and π = α /(α + β ).
Problem 1.25 Consider a Bernoulli source in which the individual character
can take value i with probability pi (i = 1, . . . , m). Let ni be the number of times the
character value i appears in the sequence u(n) = u1 u2 . . . un of given length n. Let
An be the smallest set of sequences u(n) which has total probability at least 1 − ε.
Show that each sequence in An satisfies the inequality
− ∑ ni log pi ≤ nh + nk/ε )1/2 ,
1.6 Additional problems for Chapter 1 127
P(An ) ≥ 1 − ε .
Now, for the random string U(n) = U1 . . . Un , let Ni is the number of appearances
of value i. Then
Eθ j = − ∑ pi log pi := h
1≤i≤m
and
2
2
Var θ j = E(θ j )2 − Eθ j = ∑ pi (log pi )2 − ∑ pi log pi := v.
1≤i≤m 1≤i≤m
Then
E ∑ θ j = nh and Var ∑ θ j = nv.
1≤ j≤n 1≤ j≤n
For an irreducible and aperiodic Markov source the assertion is similar, with
H =− ∑ πi pi j log pi j ,
1≤i, j≤m
1
and v ≥ 0 a constant given by v = lim sup Var
n→∞ n
∑ θj .
1≤ j≤n
Solution If we disregard the condition that s1 , . . . , sn are positive integers, the min-
imisation problem becomes
minimise ∑ si pi
i (1.6.24)
subject to si ≥ 0 and ∑ a−si ≤ 1 (Kraft).
i
si = − loga pi , 1 ≤ i ≤ n. (1.6.25)
vrel = − ∑ pi loga pi := h,
1≤i≤n
h ≤ ∑ s∗i pi .
i
∑ qi si ≤ b. (1.6.26)
1≤i≤n
The relaxed problem (1.6.24) complemented with (1.6.26) again can be solved by
the Lagrange method. Here, if
− ∑ qi loga pi ≤ b
i
then adding the new constraint does not affect the minimiser (1.6.24), i.e. the
optimal positive s1 , . . . , sn are again given by (1.6.25), and the optimal value is h.
Otherwise, i.e. when − ∑ qi loga pi > b, the new minimiser s1 , . . . , sn is still unique
i
(since the problem is still strong Lagrangian) and fulfils both constraints
∑ a−s i
= 1, ∑ qi si = b.
i i
In both cases, the optimal value vrel for the new relaxed problem satisfies h ≤ vrel .
Finally, the solution s∗1 , . . . , s∗n to the integer-valued word-length problem
minimise ∑ si pi
i (1.6.27)
subject to si ≥ 1 integer and ∑ a−si ≤ 1, ∑ qi si ≤ b
i i
will satisfy
h ≤ vrel ≤ ∑ s∗i pi , ∑ s∗i qi ≤ b.
i i
Problem 1.27 Suppose a discrete Markov source {Xt } has transition probability
p jk = P(Xt+1 = k|Xt = j)
130 Essentials of Information Theory
j k s≥1
(s)
where p jk is the s-step transition probability of the original DTMC.
Solution Denote the corrupted source sequence {Xt }, with Xt = ∗ (a splodge) every
time there was an erasure. Correspondingly, a string x1n from the corrupted source
is produced from a string x1n of the original Markov source
by replacing
the oblit-
erated digits with splodges. The probability pn (x) = P X1n = x1n of such a string
is represented as
depending on where the initial non-obliterated digit occurred in x1n (if at all). The
subsequent factors contributing to (1.6.28) have a similar structure:
(s)
pxt−1 xt β or pxt−s xt β s−1 α or 1.
Here
N(α ) = number of non-obliterated digits in X1n ,
N(β ) = number of obliterated digits in X1n ,
M(i, j; s) = number of series of digits i ∗ · · · ∗ j in X1n of length s + 1
As n → ∞, we have the convergence of the limiting frequencies (the law of large
numbers applies):
N(α ) N(β ) M(i, j; s) (s)
→ α, → β, → αβ s−1 πi pi j α .
n n n
This yields
1
− log pn X1n
n
(s) (s)
→ −α log α − β log β − α 2 ∑ πi ∑ β s−1 pi j log pi j ,
i, j s≥1
The source is a second-order Markov chain on {0, 1}, i.e. a DTMC with four states
{00, 01, 10, 11}. The 4 × 4 transition matrix is
⎛ ⎞
00 q0 q1 0 0
01 ⎜ ⎜ 0 0 q1 q0 ⎟
⎟
10 ⎝q1 q0 0 0 ⎠
11 0 0 q0 q1
132 Essentials of Information Theory
and equals
1
4 α ,β∑
h(q0 , q1 ) = −q0 log q0 − q1 log q1 .
=0,1
where
⎫
PY (0) = PX (1)p, ⎪
⎪
PX (1)(1 − 2p) + PX (2)p, ⎪
⎪
PY (1) = ⎬
PY (2) = PX (1)p + PX (2)(1 − 2p) + PX (3)p, (1.6.29)
⎪
⎪
PY (3) = PX (3)(1 − 2p) + PX (2)p, ⎪
⎪
⎭
PY (4) = PX (3)p.
1.6 Additional problems for Chapter 1 133
The symmetry in (1.6.29) suggests that h(Y ) is maximised when PX (0) = PX (2) = q
and PX (1) = 1 − 2q. So:
Denote by S(n) the random codeword-length while encoding in blocks. The mini-
1
mal expected word-length per source letter is en := min ES(n) . By Shannon’s NC
n
theorem,
h(n) h(n) 1
≤ en ≤ + ,
n log q n log q n
h(n)
where q is the size of the original alphabet A . We see that, for large n, en ∼ .
n log q
In the question, q = 10 and
Problem 1.32 Let {Ut } be a discrete-time process with values ut and let
P(u ) be the probability that a string u(n) = u1 . . . un is produced. Show that if
(n)
Relate this to the information rate of a two-state source with transition probabilities
p and 1 − p.
h = − ∑ πi pi j log pi j .
i, j
If matrix P is irreducible (i.e. has a unique communicating class) then this state-
ment holds for the chain with any initial distribution λ (in this case the equilibrium
distribution is unique).
1.6 Additional problems for Chapter 1 137
The rows are permutations of each other, and each of them has entropy
1 1 1
∑ pi j = (p + 1 − p) = ,
1≤i≤m m m m
and it is unique, as the chain has a unique communicating class. Therefore, the
information rate equals
h= ∑
m 1≤i≤m
− p log p − (1 − p) log(1 − p) = −p log p − (1 − p) log(1 − p).
p 1− p
For m = 2 we obtain precisely the matrix , so – with the equilib-
1− p p
rium distribution π = (1/2, 1/2) – the information rate is again h = η (p).
Here, the maximum is taken over PX = (PX (i), i ∈ I ), the input-letter probabil-
ity distribution, and I(X : Y ) is the mutual entropy between the input and output
random letters X and Y tied through the channel matrix:
I(X : Y ) = h(Y ) − h(Y |X) = h(X) − h(X|Y ).
For the symmetric channel, the conditional entropy
h(Y |X) = − ∑ PX (i)pi j log pi j ≡ h,
i, j
and the maximisation needs only to be performed for the output symbol entropy
h(Y ) = − ∑ PY ( j) log PY ( j), where PY ( j) = ∑ PX (i)pi j .
j i
with equality iff X and Y are Gaussian with proportional covariance matrices.
Let X be a real-valued random variable with a PDF fX and finite differential
entropy h(X), and let function g : R → R have strictly positive derivative g every-
where. Prove that the random variable g(X) has differential entropy satisfying
h(g(X)) = h(X) + E log2 g (X),
dFg(X) (y)
i.e. the PDF fg(X) (y) = takes the form
dy
−1 −1 fX g−1 (y)
fg(X) (y) = fX g (y) g (y) = −1 .
g g (y)
Then
0
h(g(X)) = − fg(X) (y) log2 fg(X) (y)dy
0
fX g−1 (y) fX g−1 (y)
= log2 dy
g g−1 (y) g g−1 (y)
0
fX (x)
=− log f X (x) − log g (x) g (x)dx
g (x) 2 2
with X3 = X1 + X2 . Then
h(Y1Y2 ) = h(eX1 +X2 ) = h(X1 + X2 ) + EX1 + EX2 log2 e.
Problem 1.35 In this problem we work with the following functions defined for
0 < a < b:
√ b−a 1
G(a, b) = ab, L(a, b) = , I(a, b) = (bb /aa )1/(b−a) .
log(b/a) e
Check that
a+b
0 < a < G(a, b) < L(a, b) < I(a, b) < A(a, b) = < b. (1.6.33)
2
Next, for 0 < a < b define
Let m = min[qi /pi ], M = max[qi /pi ], μ = min[pi ], ν = max[pi ]. Prove the following
bounds for the entropy h(X) and Kullback–Leibler divergence D(p||q) (cf. PSE II,
p. 419):
Applying (1.6.37) for a convex function f (x) = − log x we obtain after some cal-
culations that the maximum in (1.6.37) is achieved at p0 = (b − L(a, b))/(b − a),
with p0 a + (1 − p0 )b = L(a, b), and
A (p, x) b−a
log(bb /aa )
0 ≤ log ≤ log − log(ab) + −1
G (q, x)) log(b/a) b−a
which is equivalent to (1.6.36). Finally, we establish (1.6.37). Write xi = λi a +
(1 − λi )b for some λi ∈ [0, 1]. Then by convexity
0 ≤ ∑ pi f (xi ) − f (∑ pi xi )
≤ ∑ pi (λi f (a) + (1 − λi ) f (b)) − f (a ∑ pi λi + b ∑ pi (1 − λi )) .
Denoting ∑ pi λi = p and 1 − ∑ pi λi = q and maximising over p we obtain
(1.6.37).
Problem 1.36 Let f be a strictly positive probability density function (PDF)
on the line R, define the Kullback–Leibler divergence D(g|| f ) and prove that
D(g|| f ) ≥ 0. 0 0
Next, assume that ex f (x)dx < ∞ and |x|ex f (x)dx < ∞. Prove that the mini-
mum of the expression
0
− xg(x)dx + D(g|| f ) (1.6.38)
0
over the PDFs g with |x|g(x)dx < ∞ is attained at the unique PDF g∗ ∝ ex f (x)
and calculate this minimum.
g(x) −x
D(g||g∗ ) = g(x) ln dx = q(x) ln q(x)e Z f (x)dx
0 g∗ (x) 0
= − x f (x)q(x)dx + f (x)q(x) ln q(x)dx + ln Z
0
=− xg(x)dx + D(g|| f ) + ln Z,
implying that
0 0
− xg(x)dx + D(g|| f ) = − xg∗ (x)dx + D(g∗ || f ) + D(g||g∗ ).
δ (x(N) ,y(N) )
p
= (1 − p) N
1− p
144
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 145
Figure 2.1
An important part is played by the distance δ (x(N) , 0(N) ) between words x(N) =
x1. . . xN and 0(N) = 0 . . . 0; it is called the weight of word x(N) and denoted by
w x(N) :
w x(N) = the number of digits i with xi = 0. (2.1.1b)
Lemma 2.1.1 The quantity δ (x(N) , y(N) ) defines a distance on HN,q . That is:
(i) 0 ≤ δ (x(N) , y(N) ) ≤ N and δ (x(N) , y(N) ) = 0 iff x(N) = y(N) .
(ii) δ (x(N) , y(N) ) = δ (y(N) , x(N) ).
(iii) δ (x(N) , z(N) ) ≤ δ (x(N) , y(N) ) + δ (y(N) , z(N) ) (the triangle inequality).
Proof The proof of (i) and (ii) is obvious. To check (iii), observe that any digit i
with zi = xi has either yi = xi and then counted in δ (x(N) , y(N) ) or zi = yi and then
counted in δ (y(N) , z(N) ).
Geometrically, the binary Hamming space HN,2 may be identified with the col-
lection of the vertices of a unit cube in N dimensions. The Hamming distance
equals the lowest number of edges we have to pass from one vertex to another. It is
a good practice to plot pictures for relatively low values of N: see Figure 2.1.
An important role is played below by geometric and algebraic properties of the
Hamming space. Namely, as in any metric space, we can consider a ball of a given
radius R around a given word x(N) :
BN,q (x(N) , R) = {y(N) ∈ HN,q : δ (x(N) , y(N) ) ≤ R}. (2.1.2)
An important (and hard) problem is to calculate the maximal number of disjoint
balls of a given radius which can be packed in a given Hamming space.
Observe that words admit an operation of addition mod q:
x(N) + y(N) = (x1 + y1 ) mod q . . . (xN + yN ) mod q . (2.1.3a)
146 Introduction to Coding Theory
This makes the Hamming space HN,q a commutative group, with the zero code-
word 0(N) = 0 . . . 0 playing the role of the zero of the group. (Words also may be
multiplied which generates a powerful apparatus; see below.)
For q = 2, we have a two-point code alphabet {0, 1} that is actually a two-
point field, F2 , with the following arithmetic: 0 + 0 = 1 + 1 = 0 · 1 = 1 · 0 = 0,
0 + 1 = 1 + 0 = 1 · 1 = 1. (Recall, a field is a set equipped with two commutative
operations: addition and multiplication, satisfying standard axioms of associativity
and distributivity.) Thus, each point in the binary Hamming space HN,2 is opposite
to itself: x(N) + x (N) = 0(N) iff x(N) = x (N) . In fact, HN,2 is a linear space over the
coefficient field F2 , with 1 · x(N) = x(N) , 0 · x(N) = 0(N) .
Henceforth, all additions of q-ary words are understood digit-wise and mod q.
Lemma 2.1.2 The Hamming distance on HN,q is invariant under group transla-
tions:
δ (x(N) + z(N) , y(N) + z(N) ) = δ (x(N) , y(N) ). (2.1.3b)
A code is identified with a set of codewords XN ⊂ HN,q ; this means that we dis-
regard any particular allocation of codewords (which fits the assumption that the
source messages are equidistributed). An assumption is that the code is known to
both the sender and the receiver. Shannon’s coding theorems guarantee that, under
certain conditions, there exist asymptotically good codes attaining the limits im-
posed by the information rate of a source and the capacity of a channel. Moreover,
Shannon’s SCT shows that almost all codes are asymptotically good. However, in
a practical situation, these facts are of a limited use: one wants to have a good code
in an explicit form. Besides, it is desirable to have a code that leads to fast encoding
and decoding and maximises the rate of the information transmission.
So, assume that the source emits binary strings u(n) = u1 . . . un , ui = 0, 1. To
obtain the overall error-probability vanishing as n → ∞, we have to encode words
u(n) by longer codewords x(N) ∈ HN,2 where N ∼ R−1 n and 0 < R < 1. Word x(N)
is then sent to the channel and is transformed into another word, y(N) ∈ HN,2 . It
is convenient to represent the error occurred by the difference of the two words:
e(N) = y(N) − x(N) = x(N) + y(N) , or equivalently, write y(N) = x(N) + e(N) , in the
sense of (2.1.3a). Thus, the more digits 1 the error word e(N) has, the more sym-
bols are distorted by the channel. The ML decoder then produces a ‘guessed’ code-
(N)
word x that may or may not coincide with x(N) , and then reconstructs a string
(n)
u . In the case of a one-to-one encoding rule, the last procedure is (theoretically)
straightforward: we simply invert the map u(n) → x(N) . Intuitively, a code is ‘good’
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 147
if it allows the receiver to ‘correct’ the error string e(N) , at least when word e(N)
does not contain ‘too many’ non-zero digits.
Going back to an MBSC with the row probability of the error p < 1/2: the ML
(N)
decoder selects a codeword x that leads to a word e(N) with a minimal number
of the unit digits. In geometric terms:
(N)
x ∈ XN is the codeword closest to y(N)
(2.1.4)
in the Hamming distance δ .
The same rule can be applied in the q-ary case: we look for the codeword closest
to the received string. A drawback of this rule is that if several codewords have the
same minimal distance from a received word we are ‘stuck’. In this case we either
choose one of these codewords arbitrarily (possibly randomly or in connection with
the message’s content; this is related to the so-called list decoding), or, when a high
quality of transmission is required, refuse to decode the received word and demand
a re-transmission.
Definition 2.1.3 We call N the length of a binary code XN , M := XN the size
log2 M
and ρ := the information rate. A code XN is said to be D-error detecting
N
if making up to D changes in any codeword does not produce another codeword,
and E-error correcting if making up to E changes in any codeword x(N) produces
a word which is still (strictly) closer to x(N) than to any other codeword (that is,
x(N) is correctly guessed from a distorted word under the rule (2.1.4)). A code has
minimal distance (or briefly distance) d if
d = min δ (x(N) , x ) : x(N) , x ∈ XN , x(N) = x
(N) (N) (N)
. (2.1.5)
The minimal distance and the information rate of a code XN will be sometimes
denoted by d(XN ) and ρ (XN ), respectively.
This definition can be repeated almost verbatim for the general case of a
logq M
q-ary code XN ⊂ HN,q , with information rate ρ = . Namely, a code XN
N
is called E-error correcting if, for all r = 1, . . . , E, x(N) ∈ XN and y(N) ∈ HN,q
with δ (x(N) , y(N) ) = r, the distance δ (y(N) , x (N) ) > r for all x (N) ∈ XN such that
x (N) = x(N) . In words, it means that making up to E errors in a codeword pro-
duces a word that is still closer to it than to any other codeword. Geometrically,
this property means that the balls of radius E about the codewords do not intersect:
BN,q (x(N) , E) ∩ BN,q (x , E) = 0/ for all distinct x(N) , x
(N) (N)
∈ XN .
Next, a code XN is called D-error detecting if the ball of radius D about a codeword
does not contain another codeword. Equivalently, the intersection BN,q (x(N) , D) ∩
XN is reduced to a single point x(N) .
148 Introduction to Coding Theory
and γ · x(N) ∈ XN for all γ ∈ Fq . For a linear code X , the size M is given by
M = qk where k may take values 1, . . . , N and gives the dimension of the code, i.e.
the maximal number of linearly independent codewords. Accordingly, one writes
k = dim X . As in the usual geometry, if k = dim X then in X there exists a basis
of size k, i.e. a linearly independent collection of codewords x(1) , . . . , x(k) such that
any codeword x ∈ X can be (uniquely) written as a linear combination ∑ a j x( j) ,
1≤ j≤k
where a j ∈ Fq . [In fact, if k = dim X then any linearly independent collection of k
codewords is a basis in X .] In the linear case, we speak of [N, k, d] or [N, k] codes.
As follows from the definition, a linear [N, k, d] code XN always contains the
zero string 0(N) = 0 . . . 0. Furthermore, owing to property (2.1.3b),
(N)the
minimal
distance d(XN ) in a linear code X equals the minimal weight w x of a non-0
codeword x(N) ∈ XN . See (2.1.1b).
Finally, we define the so-called wedge-product of codewords x and y as a word
w = x ∧ y with components
wi = min[xi , yi ], i = 1, . . . , N. (2.1.6b)
A number of properties of linear codes can be mentioned already in this section,
although some details of proofs will be postponed.
A simple example of a linear code is a repetition code RN ⊂ HN,q , of the form
( )
RN = x(N) = x . . . x : x = 0, 1, . . . , q − 1
A B
N −1
detects N − 1 and corrects errors. A linear parity-check code
2
( )
PN = x(N) = x1 . . . xN : x1 + · · · + xN = 0
Observe that the ‘volume’ of the ball in the Hamming space HN,q centred at z(N)
is
N
vN,q (R) = BN,q (z (N)
, R) = ∑ (q − 1)k ; (2.1.7)
0≤k≤R k
the detecting and correcting abilities). From this point of view, it is important to
understand basic bounds for codes.
Upper bounds are usually written for Mq∗ (N, d), the largest size of a q-ary code
of length N and distance d. We begin with elementary facts: Mq∗ (N, 1) = qN ,
Mq∗ (N, N) = q, Mq∗ (N, d) ≤ qMq∗ (N − 1, d) and – in the binary case – M2∗ (N, 2s) =
M2∗ (N − 1, 2s − 1) (easy exercises).
Indeed, the number of the codewords cannot be too high if we want to keep
good an error-detecting and error-correcting ability. There are various bounds for
parameters of codes; the simplest bound was discovered by Hamming in the late
1940s.
Theorem 2.1.6 (The Hamming bound)
Proof (i) The E-balls about the codewords x(N) ∈ XN must be disjoint. Hence,
the total number of points covered equals the product vN,q (E)M which should not
exceed qN , the cardinality of the Hamming space HN,q .
9 if XN is an [N, M, d] code then, as was noted above, for E =
8(ii) Likewise,
(d − 1)/2 , the balls BN,q (x(N) , E), x(N) ∈ XN , do not intersect. The volume
BN,q (x(N) , E) is given by
N
vN,q (E) = ∑ (q − 1)k ,
0≤k≤E k
We see that the problem of finding good codes becomes a geometric problem,
because a ‘good’ code XN correcting E errors must give a ‘close-packing’ of the
Hamming space by balls of radius E. A code XN that gives a ‘true’ close-packing
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 151
partition has an additional advantage: the code not only corrects errors, but never
leads to a refusal of decoding. More precisely:
(a) E = 1: here N = 2l −1, M = 22 −1−l , and these codes correspond to the so-called
l
Hamming codes;
(b) E = 3: here N = 23, M = 212 ; they correspond to the so called (binary) Golay
code.
Both the Hamming and Golay codes are discussed below. The Golay code is
used (together with some modifications) in the US space programme: already in
the 1970s the quality of photographs encoded by this code and transmitted from
Mars and Venus was so excellent that it did not require any improving procedure.
In the former Soviet Union space vessels (and early American ones) other codes
were also used (and we also discuss them later): they generally produced lower-
quality photographs, and further manipulations were required, based on statistics
of the pictorial images.
If we consider non-binary codes then there exists one more perfect code, for
three symbols (also named after Golay).
(i) Extension: You add a digit xN+1 to each codeword x(N) = x1 . . . xN from
code XN , following an agreed rule. Viz., the so-called parity-check exten-
sion requires that xN+1 + ∑ x j = 0 in the alphabet field Fq . Clearly, the
1≤ j≤N
extended code, XN+1
+
, has the same size as the original code XN , and the
+
distance d(XN+1 ) is equal to either d(XN ) or d(XN ) + 1.
152 Introduction to Coding Theory
(v) Shortening: Take all codewords x(N) ∈ XN with the ith digit 0, say, and
delete this digit (shortening on xi = 0). In this way the original binary lin-
sh,0
ear [N, M, d] code XN is reduced to a binary linear code XN−1 (i) of length
N − 1, whose size can be M/2 or M and distance ≥ d or, in a trivial case, 0.
(vi) Repetition: Repeat each codeword x(= x(N) ) ∈ XN a fixed number of times,
say m, producing a concatenated (Nm)-word
re xx . . . x. The result is a code
XNm , of length Nm and distance d XNm = md(XN ).
re
(ix) The dual code. The concept of duality is based on the inner dot-product in
space HN,q (with q = ps ): for x = x1 . . . xN and y = y1 . . . yN ,
C D
x(N) · y(N) = x1 · y1 + · · · + xN · yN
which yields a value from field Fq . For a linear [N, k] code XN its dual, XN⊥ ,
is a linear [N, N − k] code defined by
( C D )
XN⊥ = y(N) ∈ HN,q : x(N) · y(N) = 0 for all x ∈ XN . (2.1.9)
However, such a code does not exist. Assume that it exists, and, the zero word
0 = 0 . . . 0 is a codeword. The code must have d = 5. Consider the 88 words with
three non-zero digits, with 1 in the first two places:
Each of these words should be at distance ≤ 2 from a unique codeword. Say, the
codeword for 1110 . . . 00 must contain 5 non-zero digits. Assume that it is
111110 . . . 00.
Continuing with this construction, we see that any word from list (2.1.10) is ‘at-
tracted’ to a codeword with 5 non-zero digits, along with two other words from
(2.1.10). But 88 is not divisible by 3.
Proof Consider a code of maximal size among the codes of minimal distance
d and length N. Then any word y(N) ∈ HN,q must be distant ≤ d − 1 from some
codeword: otherwise we can add y(N) to the code without changing the minimal
distance. Hence, the balls of radius d − 1 about the codewords cover the whole
Hamming space HN,q . That is, for the code of maximal size, XNmax ,
XNmax vN,q (d − 1) ≥ qN .
As was listed before, there are ways of producing one code from another (or from
a collection of codes). Let us apply truncation and drop the last digit xN in each
codeword x(N) from an original code XN . If code XN had the minimal distance
−
d > 1 then the new code, XN−1 , has the minimal distance ≥ d − 1 and the same
size as XN . The truncation procedure leads to the following bound.
Theorem 2.1.12 (The Singleton bound) Any q-ary code XN with minimal dis-
tance d has
M = XN ≤ Mq∗ (N, d) ≤ qN−d+1 . (2.1.12)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 155
As with the Hamming bound, the case of equality in the Singleton bound at-
tracted a special interest:
Definition 2.1.13 A q-ary linear [N, k, d] code is called maximum distance sepa-
rating (MDS) if it gives equality in the Singleton bound:
d = N − k + 1. (2.1.13)
We will see below that, similarly to perfect codes, the family of the MDS codes
is rather ‘thin’.
Corollary 2.1.14 If Mq∗ (N, d) is the maximal size of a code XN with minimal
distance d then
qN q N
≤ Mq∗ (N, d) ≤ min , qN−d+1 . (2.1.14)
vN,q (d − 1) vN,q (d − 1)/2
From now on we will omit indices N and (N) whenever it does not lead to
confusion. The upper bound in (2.1.14) becomes too rough when d ∼ N/2. Say, in
the case of binary [N, M, d]-code with N = 10 and d = 5, expression (2.1.14) gives
the upper bound M2∗ (10, 5) ≤ 18, whereas in fact there is no code with M ≥ 13, but
there exists a code with M = 12. The codewords of the latter are as follows:
The lower bound gives in this case the value 2 (as 210 /v10,2 (4) = 2.6585) and is
also far from being satisfactory. (Some better bounds will be obtained below.)
Theorem 2.1.15 (The Plotkin bound) For a binary code X of length N and
distance d with N < 2d , the size M obeys
A B
d
M = X ≤ 2 . (2.1.15)
2d − N
156 Introduction to Coding Theory
Proof The minimal distance cannot exceed the average distance, i.e.
M(M − 1)d ≤ ∑ ∑ δ (x, x ).
x∈X x ∈X
∑ ∑ δ (x, x ) ≤ 2 ∑ si (M − si ). (2.1.16)
x∈X x ∈X 1≤i≤N
Turning to the proof of (2.1.18), given an [N, d] code, divide the codewords into
two classes: those ending with 0 and those ending with 1. One class must contain
at least half of the codewords. Hence the result.
Corollary 2.1.17 If d is even and such that 2d > N ,
A B
d
M2∗ (N, d) ≤ 2 (2.1.19)
2d − N
and
M2∗ (2d, d) ≤ 4d. (2.1.20)
If d is odd and 2d + 1 > N then
A B
d +1
M2∗ (N, d) ≤2 (2.1.21)
2d + 1 − N
and
M2∗ (2d + 1, d) ≤ 4d + 4. (2.1.22)
Proof Inequality (2.1.19) follows from (2.1.17), and (2.1.20) follows from
(2.1.18) and (2.1.19): if d = 2d then
M2∗ (4d , 2d ) = 2M2∗ (4d − 1, 2d ) ≤ 8d = 4d.
Furthermore, (2.1.21) follows from (2.1.17):
A B
d +1
M2∗ (N, d) = M2∗ (N + 1, d + 1) ≤2 .
2d + 1 − N
Finally, (2.1.22) follows from (2.1.17) and (2.1.20).
Worked Example 2.1.18 Prove the Plotkin bound for a q-ary code:
A B
∗ q−1 q−1
Mq (N, d) ≤ d d −N , if d > N . (2.1.23)
q q
Solution Given a q-ary [N, M, d] code XN , observe that the minimal distance d is
bounded by the average distance
1
d≤ S, where S = ∑ ∑ δ (x, x ).
M(M − 1) x∈X x ∈X
As before, let ki j denote the number of letters j ∈ {0, . . . , q − 1} in the ith position
in all codewords from X , i = 1, . . . , N. Then, clearly, ∑ ki j = M and the
0≤ j≤q−1
contribution of the ith position into S is
M2
∑ ki j (M − ki j ) = M 2 − ∑ ki2j ≤ M 2 −
q
0≤ j≤q−1 0≤ j≤q−1
158 Introduction to Coding Theory
as the quadratic function (u1 , . . . , uq ) → ∑ u2j achieves its minimum on the set
* + 1≤ j≤q
u = u1 . . . uq : u j ≥ 0, ∑ u j = M at u1 = · · · = uq = M/q. Summing over all N
digits, we obtain with θ = (q − 1)/q
M(M − 1)d ≤ θ M 2 N,
which yields the bound M ≤ d(d − θ N)−1 . The proof is completed as in the binary
case.
There exists a substantial theory related to the equality in the Plotkin bound
(Hadamard codes) but it will not be discussed in this book. We would also like
to point out the fact that all bounds established so far (Hamming, Singleton, GV
and Plotkin) hold for codes that are not necessarily linear. As far as the GV bound
is concerned, one can prove that it can be achieved by linear codes: see Theorem
2.3.26.
Worked Example 2.1.19 Prove that a 2-error correcting binary code of length
10 can have at most 12 codewords.
Solution The distance of the code must be ≥ 5. Suppose that it contains M code-
words and extend it to an [11, M] code of distance 6. The Plotkin bound works as
follows. List all codewords of the extended code as rows of an M × 11 matrix. If
column i in this matrix contains si zeros and M − si ones then
11
6(M − 1)M ≤ ∑ ∑ δ (x, x ) ≤ 2 ∑ si (M − si ).
x∈X + x ∈X + i=1
By using more elaborate bounds (also due to Plotkin), we’ll show in Problem
2.10 that
a(τ ) ≤ 1 − 2τ , 0 ≤ τ ≤ 1/2. (2.1.32)
Plotkin
1
Singleton
Hamming
Gilbert−Varshamov
1 1
2
Figure 2.2
Figure 2.2 shows the behaviour of the bounds established. ‘Good’ sequences of
codes are those for which the pair (τ , α (N, τ N)) is asymptotically confined to
the domain between the curves indicating the asymptotic bounds. In particular, a
‘good’ code should ‘lie’ above the curve emerging from the GV bound. Construct-
ing such sequences is a difficult problem: the first examples achieving the asymp-
totic GV bound appeared in 1973 (the Goppa codes, based on ideas from algebraic
geometry). All families of codes discussed in this book produce values below the
GV curve (in fact, they yield α (τ ) = 0), although these codes demonstrate quite
impressive properties for particular values of N, M and d.
As to the upper bounds, the Hamming and Plotkin compete against each other,
while the Singleton bound turns out to be asymptotically insignificant (although
it is quite important for specific values of N, M and d). There are about a dozen
various other upper bounds, some of which will be discussed in this and subsequent
sections of the book.
The Gilbert–Varshamov bound itself is not necessarily optimal. Until 1982 there
was no better lower bound known (and in the case of binary coding there is still no
better lower bound known). However, if the alphabet used contains q ≥ 49 symbols
where q = p2m and p ≥ 7 is a prime number, there exists a construction, again based
on algebraic geometry, which produces a different lower bound and gives exam-
ples of (linear) codes that asymptotically exceed, as N → ∞, the GV curve [159].
Moreover, the TVZ construction carries a polynomial complexity. Subsequently,
two more lower bounds were proposed: (a) Elkies’ bound, for q = p2m + 1; and
(b) Xing’s bound, for q = pm [43, 175]. See N. Elkies, ‘Excellent codes from mod-
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 161
ular curves’, Manipulating with different coding constructions, the GV bound can
also be improved for other alphabets.
Worked Example 2.1.22 Prove bounds (2.1.28) and (2.1.30) (that is, those parts
of Theorem 2.1.21 related to the asymptotical Hamming and GV bounds).
For the upper bound, observe that with d/N ≤ τ < 1/2,
d−1−i
d −1 N
vN,2 (d − 1) ≤ ∑
0≤i≤d−1 N − d + 1 i
d−1−i
τ N 1−τ N
≤ ∑ ≤ .
0≤i≤d−1 1 − τ d −1 1 − 2τ d − 1
Then, for the information rate (log M2∗ (N, d)) N,
1 1−τ N
1 − log
N 1 − 2τ d − 1
1 ∗ 1 N
≤ log M2 (N, d) ≤ 1 − log .
N N (d − 1)/2
By Stirling’s formula, as N → ∞ the logs in the previous inequalities obey
1 N 1 N
log → η (τ /2), log → η (τ ).
N (d − 1)/2 N d −1
The bounds (2.1.28) and (2.1.30) then readily follow.
where
η (q) (τ ) := −τ logq τ − (1 − τ ) logq (1 − τ ), κ := logq (q − 1). (2.1.35)
Next, similarly to (2.1.26), introduce
1
α (q) (N, τ ) = log Mq∗ (N, τ N) (2.1.36)
N
and the limits
a(q) (τ ) := lim inf α (q) (N, τ ) ≤ lim sup α (q) (N, τ ) =: a(q) (τ ). (2.1.37)
N→∞ N→∞
Here C is given by (1.4.11) and (1.4.27). For convenience, we reproduce the ex-
pression for C again:
That is, we assume that the channel transmits a letter correctly with probability
1 − p and reverses with probability p, independently for different letters.
In Theorem 2.2.1, it is asserted that there exists a sequence of one-to-one cod-
ing maps fn , for which the task of decoding is reduced to guessing the codewords
fn (u) ∈ HN . In other words, the theorem guarantees that for all R < C there exists a
sequence of subsets XN ⊂ HN with XN ∼ 2NR for which the probability of incor-
rect guessing tends to 0, and the exact nature of the coding map fn is not important.
Nevertheless, it is convenient to keep the map fn firmly in sight, as the existence
will follow from a probabilistic construction (random coding) where sample cod-
ing maps are not necessarily one-to-one. Also, the decoding rule is geometric: upon
receiving a word a(N) ∈ HN , we look for the nearest codeword fn (u) ∈ XN . Conse-
quently, an error is declared every time such a codeword is not unique or is a result
of multiple encodings or simply yields a wrong message. As we saw earlier, the
geometric decoding rule corresponds with the ML decoder when the probability
p ∈ (0, 1/2). Such a decoder enables us to use geometric arguments constituting
the core of the proof.
Again as in Section 1.4, the new proof of the direct part of the SCT/NCT only
guarantees the existence of ‘good’ codes (and even their ‘proliferation’) but gives
no clue on how to construct such codes [apart from running again a random coding
scheme and picking its ‘typical’ realisation].
In the statement of the SCT/NCT given below, we deal with the maximum error-
probability (2.2.4) rather than the averaged one over possible messages. However,
a large part of the proof is still based on a direct analysis of the error-probabilities
averaged over the codewords.
Theorem 2.2.1 (The SCT/NCT, the direct part) Consider an MBSC with channel
matrix Π as in (2.2.2), with 0 ≤ p < 1/2, and let C be as in (2.2.1). Then for any
164 Introduction to Coding Theory
(i)
n = NR , and Un = 2n ; (2.2.3)
This is a random variable which has a binomial distribution Bin(N, p), with the
mean value
N
E ∑ 1(digit j in Y = digit j in fn (u)) fn (u)
(N)
j=1
N
= ∑ E 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p,
j=1
Then, by Chebyshev’s inequality, for all given ε ∈ (0, 1 − p) and positive integer
N > 1/ε , the probability that at least N(p + ε ) digits have been distorted given
that the codeword fn (u) has been sent, is
p(1 − p)
≤ P ≥ N(p + ε ) − 1 distorted | fn (u) ≤ . (2.2.5)
N(ε − 1/N)2
Proofof Theorem 2.2.1. Throughout the proof, we follow the set-up from (2.2.3).
Subscripts n and N will be often omitted; viz., we set
2n = M.
We will assume the ML/geometric decoder without any further mention. Similarly
to Section 1.4, we identify the set of source messages Un with Hamming space Hn .
As proposed by Shannon, we use again a random encoding. More precisely, a mes-
sage u ∈ Hn is mapped to a random codeword Fn (u) ∈ HN , with IID digits taking
values 0 and 1 with probability 1/2 and independently of each other. In addition,
we make codewords Fn (u) independent for different messages u ∈ Hn ; labelling
the strings from Hn by u(1), . . . , u(M) (in no particular order) we obtain a fam-
ily of IID random strings Fn (u(1)), . . . , Fn (u(M)) from HN . Finally, we make the
codewords independent of the channel. Again, in analogy with Section 1.4, we can
think of the random code under consideration as a random megastring/codebook
from HNM = {0, 1}NM with IID digits 0, 1 of equal probability. Every given sample
f (= fn ) of this random codebook (i.e. any given megastring from HNM ) specifies
166 Introduction to Coding Theory
Figure 2.3
Relation (2.2.9) implies (again in a manner similar to Section 1.4) that there
exists a sequence of deterministic codes fn such that the average error-probability
eave ( fn ) = eave ( fn (u(1)), . . . , fn (u(2n ))) obeys
lim eave ( fn ) = 0. (2.2.10)
n→∞
fn (u(i))
decoding ...
a
fn (u(i)) sent
.. a received
N
(a, m)
Figure 2.4
Lemma 2.2.4 Consider the channel matrix Π (cf. (2.2.2)) with 0 ≤ p < 1/2.
Suppose that the transmission rate R < C = 1 − η (p). Let N be > 1/ε. Then for
any ε ∈ (0, 1/2 − p), the expected average error-probability E n eave (Fn ) defined
in (2.2.8), (2.2.9) obeys
p(1 − p) M − 1 3 4
E n eave (Fn ) ≤ + vN N(p + ε ) , (2.2.12)
N(ε − 1/N)2 2N
where vN (b) stands for the number of points in the ball of radius b in the binary
Hamming space HN .
Proof Set m(= mN (p, ε )) := N(p + ε ). The ML decoder definitely returns the
codeword fn (u(i)) sent through the channel when fn (u(i)) is the only codeword in
168 Introduction to Coding Theory
the Hamming ball BN (y, m) around the received word y = y(N) ∈ HN (see
Figure 2.4). In any other situation (when fn (u(i)) ∈ BN (y, m) or fn (u(k)) ∈
BN (y, m) for some k = i) there is a possibility of error.
Hence,
P error while using codebook f | fn (u(i))
≤ ∑ P y| fn (u(i)) 1 fn (u(i)) ∈ BN (y, m) (2.2.13)
y∈HN
+ ∑ P(z| fn (u(i))) ∑ 1 fn (u(k)) ∈ BN (z, m) .
z∈HN k=i
and
vN (m)
En 1 Fn (u(k)) ∈ BN (z, m) = . (2.2.16b)
2N
2.2 A geometric proof of Shannon’s second coding theorem 169
(i) For all R ∈ (0,C), there exist codes fn with lim emax ( fn ) = 0.
n→∞
(ii) For all R ∈ (0,C), there exist codes fn such that lim eave ( fn ) = 0.
n→∞
Proof of Lemma 2.2.6. It is clear that assertion (i) implies (ii). To deduce (i) from
(ii), take R < C and set for N big enough
1
R = R + < C, n = NR , M = 2n . (2.2.22)
N
We know that there exists a sequence fn of codes Hn → HN with eave ( fn ) → 0.
Recall that
1
eave ( fn ) = ∑ P error while using fn | fn (u(i)) . (2.2.23)
M 1≤i≤M
Here and below, M = 2NR and fn (u(1)), . . . , fn (u(M )) are the codewords for
source messages u(1), . . . , u(M ) ∈ Hn .
Instead of P error while using fn | fn (u(i)) , we write P fn -error| fn (u(i)) ,
for brevity. Now, at least half of summands P fn -error| fn (u(i)) in the RHS of
(2.2.23) must be < 2eave ( fn ). Observe that, in view of (2.2.22),
M /2 ≥ 2NR−1 . (2.2.24)
List
these
codewords as a new binary code, of length N and information rate
log M /2 N. Denoting this new code by fn , we have
emax ( fn ) ≤ 2eave ( fn ).
Hence, emax ( fn ) → 0 as n → ∞ whereas log M /2 N → R. This gives statement
(i) and completes the proof of Lemma 2.2.6.
Therefore, the proof of Theorem 2.2.1 is now complete (provided that we prove
Lemma 2.2.5).
Worked Example 2.2.7 (cf. Worked Example 2.1.20.) Prove that for positive
integers N and m, with m < N/2 and β = m/N ,
2N η (β ) (N + 1) < vN (m) < 2N η (β ) . (2.2.25)
2.2 A geometric proof of Shannon’s second coding theorem 171
Solution Write
* + N
vN (m) = points at distance ≤ m from 0 in HN = ∑ k .
0≤k≤m
implying that vN (m) < 2N η (β ) . To obtain the left-hand bound in (2.2.25), write
N
vN (m) > ;
m
then we aim to check that the RHS is ≥ 2N η (β ) /(N + 1). Consider a binomial
random variable Y ∼ Bin(N, β ) with
N
pk = P(Y = k) = β k (1 − β )N−k , k = 0, . . . , N.
k
It suffices to prove that pk achieves its maximal value when k = m, since then
N 1
pm = β m (1 − β )N−m ≥ , with β m (1 − β )N−m = 2−N η (β ) .
m N +1
To this end, suppose first that k ≤ m and write
pk m!(N − m)!(N − m)m−k
=
pm k!(N − k)!mm−k
(k + 1) · · · m (N − m)m−k
= · .
mm−k (N − m + 1) · · · (N − k)
172 Introduction to Coding Theory
Proof of Lemma 2.2.5 First, p + ε < 1/2 implies that m = N(p + ε ) < N/2 and
m N(p + ε )
β := = < p + ε,
N N
which, in turn, implies that η (β ) < η (p + ε ) as x → η (x) is a strictly increasing
function for x from the interval (0, 1/2). This yields the assertion of Lemma 2.2.5.
The geometric proof of the direct part of SCT/NCT clarifies the meaning of the
concept of capacity (of an MBSC at least). Physically speaking, in the expressions
(1.4.11), (1.4.27) and (2.2.1) for capacity C = η (p) of an MBSC, the positive term
1 points at the rate at which a random code produces an ‘empty’ volume between
codewords whereas the negative term −η (p) indicates the rate at which the code-
words progressively fill this space. We continue with a working example of an
essay type:
Worked Example 2.2.8 Quoting general theorems on the evaluation of the chan-
nel capacity, deduce an expression for the capacity of a memoryless binary sym-
metric channel. Evaluate, in particular, the capacities of (i) a symmetric memory-
less channel and (ii) a perfect channel with an input alphabet {0, 1} whose inputs
are subject to the restriction that 0 should never occur in succession.
In other words, the noise acts on each symbol xi of the input string x independently,
and P(y|x) is the probability of having an output symbol y given that the input
symbol is x.
2.2 A geometric proof of Shannon’s second coding theorem 173
Symbol x runs over Ain , an input alphabet of a given size q, and y belongs to
Aout , an output alphabet of size r. Then probabilities P(y|x) form a q × r stochastic
matrix (the channel matrix). A memoryless channel is called symmetric if the rows
of this matrix are permutations of each other, i.e. contain the same collection of
probabilities, say p1 , . . . , pr . A memoryless symmetric channel is said to be double-
symmetric if the columns of the channel matrix are also permutations of each other.
If m = n = 2 (typically, Ain = Aout = {0, 1}) a memoryless channel is called binary.
For a memoryless binary symmetric channel, the channel matrix entries P(y|x)
are P(0|0) = P(1|1) = 1 − p, P(1|0) = P(0|1) = p, p ∈ (0, 1) being the flipping
probability and 1 − p the probability of flawless transmission of a single binary
symbol.
A channel is characterised by its capacity: the value C ≥ 0 such that:
(a) for all R < C, R is a reliable transmission rate; and
(b) for all R > C, R is an unreliable transmission rate.
Here R is called a reliable transmission rate if there exists a sequence of codes
fn : Hn → HN and decoding rules fN : HN → Hn such that n ∼ NR and the (suit-
ably defined) probability of error
e( fn , fN ) → 0, as N → ∞.
In other words,
1
C = lim log MN
N→∞ N
where MN is the maximal number of codewords x ∈ HN for which the probability
of erroneous decoding tends to 0.
The SCT asserts that, for a memoryless channel,
C = max I(X : Y )
pX
where I(X : Y ) is the mutual information between a (random) input symbol X and
the corresponding output symbol Y , and the maximum is over all possible proba-
bility distributions pX of X.
Now in the case of a memoryless symmetric channel (MSC), the above maximi-
sation procedure applies to the output symbols only:
C = max h(Y ) + ∑ pi log pi ;
pX
1≤i≤r
the sum − ∑ pi log pi being the entropy of the row of channel matrix (P(y|x)). For
i
a double-symmetric channel, the expression for C simplifies further:
C = log M − h(p1 , . . . , pr )
174 Introduction to Coding Theory
C = 1 − η (p).
Write it as a recursion
n(1,t) n(1,t − 1)
=A ,
n(1,t − 1) n(1,t − 2)
with the recursion matrix
1 1
A= .
1 0
The general solution is
n(1,t) = c1 λ1t + c2 λ2t ,
where λ1 , λ2 are the eigenvalues of A, i.e. the roots of the characteristic equation
det (A − λ I) = (1 − λ )(−λ ) − 1 = λ 2 − λ − 1 = 0.
√
So, λ = 1 ± 5 2, and
√
1 5+1
log n(1,t) = log .
t 2
Next, we present the strong converse part of Shannon’s SCT for an MBSC (cf.
Theorem 1.4.14); again we are going to prove it by using geometry of Hamming’s
spaces. The term ‘strong’ indicates that for every transmission rate R > C, the
channel capacity, the maximum probability of error actually gets arbitrarily close
to 1. Again for simplicity, we prove the assertion for an MBSC.
Theorem 2.2.10 (The SCT/NCT, thestrong converse part) Let C be the capacity
1− p p
of an MBSC with the channel matrix , where 0 < p < 1/2, and
p 1− p
take R > C. Then, with n = NR, for all codes fn : Hn → HN and decoding rules
fN : HN → Hn , the maximum error-probability
ε max ( fn , fN ) := max P error under fN | fn (u) : u ∈ Hn (2.2.26a)
obeys
lim sup ε max ( fn , fN ) = 1. (2.2.26b)
N→∞
Proof As in Section 1.4, we can assume that codes fn are one-to-one and obey
fN ( fn (u)) = u, for all u ∈ Hn (otherwise, the chances of erroneous decoding will
be even larger). Assume the opposite of (2.2.26b):
ε max ( fn , fN ) ≤ c for some c < 1 and all N large enough. (2.2.27)
Our aim is to deduce from (2.2.27) that R ≤ C. As before, set n = NR and let
fn (u(i)) be the codeword for string u(i) ∈ Hn , i = 1, . . . , 2n . Let Di ⊂ HN be the set
of binary strings where fN returns fn (u(i)): fN (a) = fn (u(i)) if and only if a ∈ Di .
Then Di ! f (u(i)), sets Di are pairwise disjoint, and if the union ∪i Di = HN then
on the complement HN \ ∪i Di the channel declares an error. Set si = Di , the size
of set Di .
Our first step is to ‘improve’ the decoding rule, by making it ‘closer’ to the ML
rule. In other words, we want to replace each Di with a new set, Ci ∈ HN , of the
same cardinality Ci = si , but of a more ‘rounded’ shape (i.e. closer to a Hamming
176 Introduction to Coding Theory
ball B( f (u(i)), bi )). That is, we look for pairwise disjoint sets Ci , of cardinalities
Ci = si , satisfying
Denote the new decoding rule by gN . As the flipping probability p < 1/2, the
relation (2.2.29) implies that
P(error when using gN | fn (u(i))) ≤ P(error when using fN | fn (u(i))). (2.2.30)
Then, clearly,
ε max ( fn , gN ) ≤ ε max ( fn , fN ) ≤ c. (2.2.31)
Next, suppose that there exists p < p such that, for any N large enough,
Then, by virtue of (2.2.28) and (2.2.31), with Cic standing for the complement
HN \ Ci ,
P(at least N p digits distorted| fn (u(i)))
≤ P(at least bi + 1 digits distorted| fn (u(i)))
≤ P(Cic | fn (u(i))) ≤ ε max ( fn , gN ) ≤ c.
This would lead to a contradiction, since, by the law of large numbers, as N → ∞,
the probability
uniformly in the choice of the input word x ∈ HN . (In fact, this probability does
not depend on x ∈ HN .)
Thus, we cannot have p ∈ (0, p) such that, for N large enough, (2.2.32) holds
true. That is, the opposite is true: for any given p ∈ (0, p), we can find an arbitrarily
large N such that
bi > N p , for all i = 1, . . . , 2n . (2.2.33)
2.2 A geometric proof of Shannon’s second coding theorem 177
(As we claim (2.2.33) for all p ∈ (0, p), it does not matter if in the LHS of (2.2.33)
we put bi or bi + 1.)
At this stage we again use the explicit expression for the volume of the Hamming
ball:
N N
si = Di = Ci ≥ vN (bi ) = ∑ ≥
0≤ j≤bi j bi
N
≥ , provided that bi > N p . (2.2.34)
N p
A useful bound has been provided in Worked Example 2.2.7 (see (2.2.25)):
1
2N η N .
R
vN (R) ≥ (2.2.35)
N +1
We are now in a position to finish the proof of Theorem 2.2.10. In view of
(2.2.35), we have that, for all p ∈ (0, p), we can find an arbitrarily large N such
that
si ≥ 2N(η (p )−εN ) , for all 1 ≤ i ≤ 2n ,
with lim εN = 0. As the original sets D1 , . . . , D2n are disjoint, we have that
N→∞
s1 + · · · + s2n ≤ 2N , implying that 2N(η (p )−εN ) × 2NR ≤ 2N ,
or
NR 1
η (p ) − εN + ≤ 1, implying that R ≤ 1 − η (p ) + εN + .
N N
As N → ∞, the RHS tends to 1 − η (p). So, given any p ∈ (0, p), R ≤ 1 − η (p ).
This is true for all p < p, hence R ≤ 1 − η (p) = C. This completes the proof of
Theorem 2.2.10.
We have seen that the analysis of intersections of a given set X in a Hamming
space HN (and more generally, in HN,q ) with various balls BN (y, s) reveals a lot
about the set X itself. In the remaining part of this section such an approach will
be used for producing some advanced bounds on q-ary codes: the Elias bound and
the Johnson bound. These bounds are among the best-known general bounds for
codes, and they are competing.
The Elias bound is proved in a fashion similar to Plotkin’s: cf. Theorem 2.1.15
and Worked Example 2.1.18. We count codewords from a q-ary [N, M, d] code X
in balls BN,q (y, s) of radius s about words y ∈ HN,q . More precisely, we count pairs
(x, BN,q (y, s)) where x ∈ X ∩ BN,q (y, s). If ball BN,q (y, s) contains Ky codewords
then
∑ Ky = MvN,q (s) (2.2.36)
y∈HN
178 Introduction to Coding Theory
as each word x falls in vN,q (s) of the balls BN,q (y, s).
A ball BN,q (y, s) with property (2.2.37) is called critical (for code X ).
Theorem 2.2.12 (The Elias bound) Set θ = (q − 1)/q. Then for all integers s ≥ 1
such that s < θ N and s2 − 2θ Ns + θ Nd > 0, the maximum size Mq∗ (N, d) of a q-ary
code of length N and distance d satisfies
θ Nd qN
Mq∗ (N, d) ≤ · . (2.2.38)
s2 − 2θ Ns + θ Nd vN,q (s)
Proof Fix a critical ball BN,q (y, s) and consider code X obtained by subtracting
word y from the codewords of X : X = {x − y : x ∈ X }. Then X is again an
[N, M, d] code. So, we can assume that y = 0 and BN,q (0, s) is a critical ball.
Then take X1 = X ∩ BN,q (0, s) = {x ∈ X : w(x) ≤ s}. The code X1 is [N, K, e]
where e ≥ d and K (= K0 ) ≥ MvN,q (s)/qN . As in the proof of the Plotkin bound,
consider the sum of the distances between the codewords in X1 :
S1 = ∑ ∑ δ (x, x ).
x∈X1 x ∈X1
Again, we have that S1 ≥ K(K − 1)e. On the other hand, if ki j is the number of
letters j ∈ Jq = {0, . . . , q − 1} in the ith position in all codewords x ∈ X1 then
S1 = ∑ ∑ ki j (K − ki j ).
1≤i≤N 0≤ j≤q−1
Then
1
S≤ NK 2 − ∑ 2 +
ki0 (K − ki0 ) 2
1≤i≤N q−1
1
= NK 2 − 2 + K 2 − 2Kk + k2
∑ (q − 1)ki0 i0
q − 1 1≤i≤N i0
1
= NK 2 − ∑ (qk2 + K 2 − 2Kki0 )
q − 1 1≤i≤N i0
N q 2 + 2 K
= NK 2 − K2 − ∑ ki0 ∑ ki0
q−1 q − 1 1≤i≤N q − 1 1≤i≤N
q−2 q 2 + 2 KL,
= NK 2 − ∑ ki0
q−1 q − 1 1≤i≤N q−1
2
1 1 2
∑ 2
ki0 ≥
N ∑ ki0 =
N
L .
1≤i≤N 1≤i≤N
Then
q−2 q 1 2 2
S≤ NK 2 − L + KL
q−1 q−1 N q−1
1 q
= (q − 2)NK 2 − L2 + 2KL .
q−1 N
The maximum of the quadratic expression in the square brackets occurs at L =
NK/q. Recall that L ≥ K(N − s). So, choosing K(N − s) ≥ NK/q, i.e. s ≤ N(q −
1)/q, we can estimate
1 q 2
S≤ (q − 2)NK − K (N − s) + 2K (N − s)
2 2 2
q−1 N
1 qs
= K 2 s 2(q − 1) − .
q−1 N
1 qs
This yields the inequality K(K − 1)e ≤ K 2 s 2(q − 1) − which can be
q−1 N
solved for K:
θ Ne
K≤ ,
s2 − 2θ Ns + θ Ne
180 Introduction to Coding Theory
provided that s < N θ and s2 − 2θ Ns + θ Ne > 0. Finally, recall that X (1) arose
from an [N, M, d] code X , with K ≥ Mv(s)/qN and e ≥ d. As a result, we obtain
that
MvN,q (s) θ Nd
≤ 2 .
qN s − 2θ Ns + θ Nd
This leads to the Elias bound (2.2.38).
The ideas used in the proof of the Elias bound (and earlier in the proof of the
Plotkin bounds) are also helpful in obtaining bounds for W2∗ (N, d, ), the maximal
size of a binary (non-linear) code X ∈ HN,2 of length N, distance d(X ) ≥ d and
with the property that the weight w(x) ≡ , x ∈ X . First, three obvious statements:
A B
∗ N
(i) W2 (N, 2k, k) = ,
k
(ii) W2∗ (N, 2k, ) = W2∗ (N, 2k, N − ),
(iii) W2∗ (N, 2k − 1, ) = W2∗ (N, 2k, ), /2 ≤ k ≤ .
Solution Take an [N, M, 2k] code X such that w(x) ≡ , x ∈ X . As before, let
ki1 be the number of 1s in
position
i in all codewords. Consider the sum of the
dot-products D = ∑ 1 x = x "x · x #. We have
x,x ∈X
1
"x · x # = w(x ∧ x ) = w(x) + w(x ) − δ x, x )
2
1
≤ (2 − 2k) = − k
2
and hence
D ≤ ( − k)M(M − 1).
On the other hand, the contribution to D from position i equals ki1 (ki1 − 1), i.e.
D= ∑ ki1 (ki1 − 1) = ∑ 2
(ki1 − ki1 ) = ∑ 2
ki1 − M.
1≤i≤N 1≤i≤N 1≤i≤N
2.2 A geometric proof of Shannon’s second coding theorem 181
2 M 2
− M ≤ D ≤ ( − k)M(M − 1).
N
This immediately leads to (2.2.39).
Solution Again take an [N, M, 2k] code X such that w(x) ≡ for all x ∈ X .
Consider the shortening code X on x1 = 1 (cf. Example 2.1.8(v)): it gives a code
of length (N − 1), distance ≥ 2k and constant weight ( − 1). Hence, the size of
the cross-section is ≤ W2∗ (N − 1, 2k, − 1). Therefore, the number of 1s at position
1 in the codewords of X does not exceed W2∗ (N − 1, 2k, − 1). Repeating this
argument, we obtain that the total number of 1s in all positions is ≤ NW2∗ (N −
1, 2k, − 1). But this number equals M, i.e. M ≤ NW2∗ (N − 1, 2k, − 1). The
bound (2.2.40) then follows.
N
Corollary 2.2.15 For all positive integers N ≥ 1, k ≤ and 2k ≤ ≤ 4k − 2,
2
W2∗ (N, 2k − 1, ) = W2∗ (N, 2k, )
A A A A B BBB
N N −1 N −+k
≤ ··· ··· . (2.2.41)
−1 k
The remaining part of Section 2.2 focuses on the Johnson bound. This bound
aims at improving the binary Hamming bound (cf. (2.1.8b) with q − 2):
M2∗ (N, 2E + 1) ≤ 2N vN (E) or vN (E) M2∗ (N, 2E + 1) ≤ 2N . (2.2.42)
where
1 N
v∗N (E) = vN (E) +
N/(E + 1) E +1
2E + 1
−W2∗ (N, 2E + 1, 2E + 1) . (2.2.44)
E
182 Introduction to Coding Theory
N
Recall that vN (E) = ∑ stands for the volume of the binary Hamming ball
0≤s≤E s
of radius E. We begin our derivation of bound (2.2.43) with the following result.
Lemma
2.2.16 If x, y are binary words, with δ (x, y) = 2 + 1, then there exists
2 + 1
binary words z such that δ (x, z) = + 1 and δ (y, z) = .
Proof Left as an exercise.
as none of the words z ∈ T falls in any of the balls of radius E about the codewords
y ∈ X . The bound (2.2.43) will follow when we solve the next worked example.
Worked Example 2.2.17 Prove that the cardinality T is greater than or equal
to the second term from the RHS of (2.2.44):
M N ∗ 2E + 1
−W2 (N, 2E + 1, 2E + 1) . (2.2.47)
N/(E + 1) E + 1 E
Corollary 2.2.18 In view of Corollary 2.2.15 the following bound holds true:
∗
M (N, 2E + 1) ≤ 2 vN (E)
N
A B −1 (2.2.55)
1 N N −E N −E
− − .
N/(E + 1) E E +1 E +1
Example 2.2.19A Let A NA= 13BBBand E = 2, i.e. d = 5. Inequality (2.2.41) implies
13 12 11
W ∗ (13, 5, 5) ≤ = 23, and the Johnson bound in (2.2.43) yields
5 4 3
A B
∗ 213
M (13, 5) ≤ = 77.
1 + 13 + 78 + (286 − 10 × 23)/4
This bound is much better than Hamming’s which gives M ∗ (13, 5) ≤ 89. In fact, it
is known that M ∗ (13, 5) = 64. Compare Section 3.4.
Proof A basis of the code contains k linearly independent vectors. The code is
generated by the basis; hence it consists of the sums of basic vectors. There are
precisely 2k sums (the number of subsets of {1, . . . , k} indicating the summands),
and they all give different vectors.
Consequently, a binary linear code X of rank k may be used for encoding all
possible source strings of length k; the information rate of a binary linear [N, k]
code is k/N. Thus, indicating k ≤ N linearly independent words x ∈ HN identifies
a (unique) linear code X ⊂ HN of rank k. In other words, a linear binary code of
rank k is characterised by a k × N matrix of 0s and 1s with linearly independent
rows:
⎛ ⎞
g11 . . . . . . . . . g1N
⎜ g21 . . . . . . . . . g2N ⎟
⎜ ⎟
G=⎜ .. .. ⎟
⎝ . . ⎠
gk1 ... ... ... gkN
Namely, we take the rows g(i) = gi1 . . . giN , 1 ≤ i ≤ k, as the basic vectors of a
linear code.
Here, for x, y ∈ HN ,
N
"x · y# = "y · x# = ∑ xi yi , where x = x1 . . . xN , y = y1 . . . yN ; (2.3.2)
i=1
The inner product (2.3.2) possesses all properties of the Euclidean scalar product
in RN , but one: it is not positive definite (and therefore does not define a norm). That
is, there are non-zero vectors x ∈ HN with "x · x# = 0. Luckily, we do not need the
positive definiteness.
However, the key rank–nullity property holds true for the dot-product: if L is
a linear subspace in HN of rank k then its orthogonal complement L ⊥ (i.e. the
collection of vectors z ∈ HN such that "x · z# = 0 for all x ∈ L ) is a linear subspace
of rank N − k. Thus, the (N − k) rows of H can be considered as a basis in X ⊥ ,
the orthogonal complement to X .
The matrix H (or sometimes its transpose H T ) with the property X = ker H or
"x · h( j)# ≡ 0 (cf. (2.3.1)) is called a parity-check (or, simply, check) matrix of code
X . In many cases, the description of a code by a check matrix is more convenient
than by a generating one.
The parity-check matrix is again not unique as the basis in X ⊥ can be chosen
non-uniquely. In addition, in some situations where a family of codes is consid-
ered, of varying length N, it is more natural to identify a check matrix where the
number of rows can be greater than N − k (but some of these rows will be linearly
dependent); such examples appear in Chapter 3. However, for the time being we
will think of H as an (N − k) × N matrix with linearly independent rows.
Here, I is a unit N × N matrix and the zeros mean the zero matrices (of size (N −
k) × N and N × N, accordingly). The number of the unit matrices in the first column
equals m − 1. (This is not a unique form of H re (m).) The size of H re (m) is (Nm −
k) × Nm.
The rank is unchanged, the minimal distance in X re (m) is md and the informa-
tion rate ρ /m.
Worked Example 2.3.5 A dual code of a linear binary [N, k] code X is defined
as the set X ⊥ of the words y = y1 . . . yN such that the dot-product
Show that the subset of a linear binary code consisting of all words of even
weight is a linear code. Prove that, for d even, if there exists a linear [N, k, d] code
then there exists a linear [N, k, d] code with codewords of even weight.
1 k−1 k
Solution The size is 2k and the number of different bases ∏
k! i=0
2 − 2i . Indeed,
if the l first basis vectors are selected, all their 2l linear combinations should be
excluded on the next step. This gives 840 for k = 4, and 28 for k = 3.
Finally, for d even, we can truncate the original code and then use the parity-
check extension.
Example 2.3.7 The binary Hamming [7, 4] code is determined by a 3 × 7 parity-
check matrix. The columns of the check matrix are all non-zero words of length 3.
Using lexicographical order of these words we obtain
⎛ ⎞
1 0 1 0 1 0 1
Ham
Hlex = ⎝0 1 1 0 0 1 1⎠ .
0 0 0 1 1 1 1
The corresponding generating matrix may be written as
⎛ ⎞
0 0 1 1 0 0 1
⎜ 0 1 0 0 1 0 1 ⎟
GHam ⎜ ⎟.
lex = ⎝ (2.3.3)
0 0 1 0 1 1 0 ⎠
1 1 1 0 0 0 0
In many cases it is convenient to write the check matrix of a linear [N, k] code in
a canonical (or standard) form:
Hcan = IN−k H . (2.3.4a)
In the case of the Hamming [7, 4] code it gives
⎛ ⎞
1 0 0 1 1 0 1
Ham
Hcan = ⎝0 1 0 1 0 1 1⎠ ,
0 0 1 0 1 1 1
with a generating matrix also in a canonical form:
Gcan = G Ik ; (2.3.4b)
namely,
⎛ ⎞
1 1 0 1 0 0 0
⎜ 0 ⎟
GHam ⎜ 1
can = ⎝
0 1 0 1 0 ⎟.
1 1 1 0 0 1 0 ⎠
1 1 1 0 0 0 1
190 Introduction to Coding Theory
Formally, Glex and Gcan determine different codes. However, these codes are
equivalent:
Definition 2.3.8 Two codes are called equivalent if they differ only in permuta-
tion of digits. For linear codes, equivalence means that their generating matrices
can be transformed into each other by permutation of columns and by row-
operations including addition of columns multiplied by scalars. It is plain that
equivalent codes have the same parameters (length, rank, distance).
Theorem 2.3.11
(i) The distance of a linear binary code equals the minimal weight of its non-zero
codewords.
(ii) The distance of a linear binary code equals the minimal number of linearly
dependent columns in the check matrix.
Proof (i) As the code X is linear, the sum x + y ∈ X for each pair of codewords
x, y ∈ X . Owing to the shift invariance of the Hamming distance (see Lemma
2.1.1), δ (x, y) = δ (0, x + y) = w(x + y) for any pair of codewords. Hence, the
minimal distance of X equals the minimal distance between 0 and the rest of the
code, i.e. the minimal weight of a non-zero codeword from X .
(ii) Let X be a linear code with a parity-check matrix H and minimal distance d.
Then there exists a codeword x ∈ X with exactly d non-zero digits. Since xH T = 0,
2.3 Linear codes: basic constructions 191
we conclude that there are d columns of H which are linearly dependent (they
correspond to non-zero digits in x). On the other hand, if there exist (d −1) columns
of H which are linearly dependent then their sum is zero. But that means that there
exists a word y, of weight w(y) = d − 1, such that yH T = 0. Then y must belong to
X which is impossible, since min[w(x) : x ∈ X , x = 0] = d.
Theorem 2.3.12 The Hamming [7, 4] code has minimal distance 3, i.e. it detects
2 errors and corrects 1. Moreover, it is a perfect code correcting a single error.
Proof For any pair of columns the parity-check matrix H lex contains their sum to
obtain a linearly dependent triple (viz. look at columns 1, 6, 7). No two columns
are linearly dependent because they are distinct (x + y = 0 means that x = y). Also,
the volume v7 (1) equals 1 + 7 = 23 , and the code is perfect as its size is 24 and
24 × 23 = 27 .
The rows of H Ham are linearly independent, and hence H Ham may be considered
as a check matrix of a linear code of length N = 2l − 1 and rank N − l = 2l −
1 − l. Any two columns of H Ham are linearly independent but there exist linearly
dependent triples of columns, e.g. x, y and x + y. Hence, the code X Ham with the
check matrix H Ham has a minimal distance 3, i.e. it detects 2 errors and corrects 1.
codes.
192 Introduction to Coding Theory
Worked Example 2.3.17 Show that word x∗ is always a codeword that min-
imises the distance between y and the words from X .
and contains at most one non-zero digit. If the syndrome yH T = s occupies position
i among the columns of the parity-check matrix then, for word ei = 0 . . . 1 0 . . . 0
with the only non-zero digit i,
(y + ei )H T = s + s = 0.
That is, (y + ei ) ∈ X and ei ∈ y + X . Obviously, ei is the leader.
⊥
The duals X Ham of binary Hamming codes form a particular class, called sim-
plex codes. If X Ham is [2 − 1, 2 − 1 − ], its dual (X Ham )⊥ is [2 − 1, ], and the
⊥
original parity-check matrix H Ham serves as a generating matrix for X Ham .
Worked Example 2.3.20 Prove that each non-zero codeword in a binary simplex
⊥
code X Ham has weight 2−1 and the distance between any two codewords equals
2−1 . Hence justify the term ‘simplex’.
We now discuss the decoding procedure for a general linear code X of rank
k. As was noted before, it may be used for encoding source messages (strings)
u = u1 . . . uk of length k. The source encoding u ∈ Fkq → X becomes particularly
simple when the generating and parity-check matrices are used in the canonical (or
standard) form.
Theorem 2.3.23 For any linear code X there exists an equivalent code X with
the generating matrix Gcan and the check matrix H can in standard form (2.3.4a),
(2.3.4b) and G = −(H )T .
Proof Assume that code X is non-trivial (i.e. not reduced to the zero word 0).
Write a basis for X and take the corresponding generating matrix G. By perform-
ing row-operations (where a pair of rows i and j are exchanged or row i is replaced
by row i plus row j) we can change the basis, but do not change the code. Our
matrix G contains a non-zero column, say l1 : perform row operations to make g1l1
the only non-zero entry in this column. By permuting digits (columns), place col-
umn l1 at position N − k. Drop row 1 and column N − k (i.e. the old column l1 )
and perform a similar procedure with the rest, ending up with the only non-zero
entry g2l2 in a column l2 . Place column l2 at position N − k + 1. Continue until an
upper triangular k × k submatrix emerges. Further operations may be reduced to
this matrix only. If this matrix is a unit matrix, stop. If not, pick the first column
with more than one non-zero entry. Add the corresponding rows from the bottom
to ‘kill’ redundant non-zero entries. Repeat until a unit submatrix emerges. Now a
generating matrix is in a standard form, and new code is equivalent to the original
one.
To complete the proof, observe that matrices Gcan and H can figuring in (2.3.4a),
(2.3.4b) with G = −(H )T , have k independent rows and N − k independent
columns, correspondingly. Besides, the k × (N − k) matrix Gcan (H can )T vanishes.
In fact,
Returning to source encoding, select the generating matrix in the canonical form
k
Gcan . Then, given a string u = u1 . . . uk , we set x = ∑ ui gcan (i), where gcan (i) rep-
i=1
resents row i of Gcan . The last k digits in x give string u; they are called the infor-
mation digits. The first N − k digits are used to ensure that x ∈ X ; they are called
the parity-check digits.
detection and correction of errors), and the final k string yields the message from
F×k
q . As in the binary case, the parity-check matrix H satisfies Theorem 2.3.11. In
particular, the minimal distance of a code equals the minimal number of linearly
dependent columns in its parity-check matrix H .
Definition 2.3.24 Given an [N, k] linear q-ary code X with parity-check matrix
H, the syndrome of an N vector y ∈ F×N ×k
q is the k vector yH ∈ Fq , and the syn-
T
In the case of linear codes, some of the bounds can be improved (or rather new
bounds can be produced).
Worked Example 2.3.25 Let X be a binary linear [N, k, d] code.
(a) Fix a codeword x ∈ X with exactly d non-zero digits. Prove that truncating
X on the non-zero digits of x produces a code XN−d of length N − d , rank
k − 1 and distance d for some d ≥ d/2.
(b) Deduce the Griesmer bound improving the Singleton bound (2.1.12):
E F
d
N ≥d+ ∑
. (2.3.9)
1≤≤k−1 2
Solution (a) Without loss of generality, assume that the non-zero digits in x are
x1 = · · · = xd = 1. Truncating on digits 1, . . ., d will produce the code XN−d with
the rank reduced by 1. Indeed, suppose that a linear combination of k − 1 vectors
vanishes on positions d + 1, . . . , N. Then on the positions 1, . . . , d all the values
equal either 0s or 1s because d is the minimal distance. But the first case is im-
possible, unless the vectors are linearly dependent. The second case also leads to
contradiction by adding the string x and obtaining k linearly
E F dependent vectors in
d
the code X . Next, suppose that X has distance d < and take y ∈ X with
2
N
w(y ) = ∑ y j = d .
j=d+1
198 Introduction to Coding Theory
d
x
y
y⬘
x^y
y + (x^y)
x + (x^y)
Figure 2.5
Proof Let X be a linear code of maximal rank with distance at least d of maximal
size. If inequality (2.3.10) is violated the union of all Hamming spheres of radius
d − 1 centred on codewords cannot cover the whole Hamming space. So, there
must be a point y that is not in any Hamming sphere around a codeword. Then for
any codeword x and any scalar b ∈ Fq the vectors y and y + b · x are in the same
coset by X . Also y + b · x cannot be in any Hamming sphere of radius d − 1. The
same is true for x + b · y because if it were, then y would be in a Hamming sphere
around another codeword. Here we use the fact that Fq is a field. Then the vector
subspace spanned by X and y is a linear code larger than X and with a minimal
distance at least d. That is a contradiction, which completes the proof.
For example, let q = 2 and N = 10. Then 25 < v10,2 (2) = 56 < 26 . Upon taking
d = 3, the Gilbert bound guarantees the existence of a binary [10, 5] code with
d ≥ 3.
to exclude columns that are multiples of each other. To this end, we can choose
as columns all non-zero -words that have 1 in their top-most non-0 component.
q − 1
Such columns are linearly independent, and their total equals . Next, as in
q−1
the binary case, one can arrange words with digits from Fq in the lexicographic
order. By construction, any two columns of H H are linearly
independent, but there
exist triples of linearly dependent columns. Hence, d X H = 3, and X H detects
two errors and corrects one. Furthermore, X H is a perfect code correcting a single
error, as
q − 1
M(1 + (q − 1)N) = q 1 + (q − 1)
k
= qk+ = qN .
q−1
As in the binary case, the general Hamming codes admit an efficient (and el-
egant) decoding procedure. Suppose a parity-check matrix H = H H has been
constructed as above. Upon receiving a word y ∈ F×N q we calculate the syn-
T ×
drome yH ∈ Fq . If yH = 0 then y is a codeword. Otherwise, the column-
T
Until the late 1950s, the Hamming codes were a unique family of codes exist-
ing in dimensions N → ∞, with ‘regular’ properties. It was then discovered that
these codes have a deep algebraic background. The development of the algebraic
methods based on these observations is still a dominant theme in modern coding
theory.
Another important example is the four Golay codes (two binary and two ternary).
Marcel Golay (1902–1989) was a Swiss electrical engineer who lived and worked
in the USA for a long time. He had an extraordinary ability to ‘see’ the discrete
geometry of the Hamming spaces and ‘guess’ the construction of various codes
without bothering about proofs.
The binary Golay code X24Gol is a [24, 12] code withthe generating
matrix G =
(I12 |G ) where I12 is a 12 × 12 identity matrix, and G = G (2) has the following
form:
⎛ ⎞
0 1 1 1 1 1 1 1 1 1 1 1
⎜ 1 1 1 0 1 1 1 0 0 0 1 0 ⎟
⎜ ⎟
⎜ 1 1 0 1 1 1 0 0 0 1 0 1 ⎟
⎜ ⎟
⎜ 1 0 1 1 1 0 0 0 1 0 1 1 ⎟
⎜ ⎟
⎜ 1 1 1 1 0 0 0 1 0 1 1 0 ⎟
⎜ ⎟
⎜ 1 1 1 0 0 0 1 0 1 1 0 1 ⎟
G =⎜
⎜ 1 1 0 0 0 1 0 1 1
⎟.
⎟ (2.4.1)
⎜ 0 1 1 ⎟
⎜ 1 0 0 0 1 0 1 1 0 1 1 1 ⎟
⎜ ⎟
⎜ 1 0 0 1 0 1 1 0 1 1 1 0 ⎟
⎜ ⎟
⎜ 1 0 1 0 1 1 0 1 1 ⎟
⎜ 1 0 0 ⎟
⎝ 1 1 0 1 1 0 1 1 1 0 0 0 ⎠
1 0 1 1 0 1 1 1 0 0 0 1
The rule of forming matrix G is ad hoc (and this is how it was determined by M.
Golay in 1949). There will be further ad hoc arguments in the analysis of Golay
codes.
Remark 2.4.3 Interestingly, there is a systematic way of constructing all code-
words of X24Gol (or its equivalent) by fitting together two versions of Hamming [7, 4]
code X7H . First, observe that reversing the order of all the digits of a Hamming
code X7H yields an equivalent code which we denote by X7K . Then add a parity-
check to both X7H and X7K , producing codes X8H,+ and X8K,+ . Finally, select
two different words a, b ∈ X8H,+ and a word x ∈ X8K,+ . Then all 212 codewords
of X24Gol of length 24 could be written as concatenation (a + x)(b + x)(a + b + x).
This can be checked by inspection of generating matrices.
⊥
Lemma 2.4.4 The binary Golay code X24Gol is self-dual, with X24Gol = X24Gol .
The code X24Gol is also generated by the matrix G = (G |I12 ).
202 Introduction to Coding Theory
Proof A direct calculation shows that any two rows of matrix G are dot-
⊥ ⊥
orthogonal. Thus X24Gol ⊂ X24Gol . But the dimensions of X24Gol and X24Gol
⊥
coincide. Hence, X24Gol = X24Gol . The last assertion of the lemma now follows
from the property (G )T = G .
Solution First, we check that for all x ∈ X24Gol the weight w(x) is divisible by 4 .
This is true for every row of G = (I12 |G ): the number of 1s is either 12 or 8. Next,
for all binary N-words x, x ,
When we truncate X24Gol at any digit, we get X23Gol , a [23, 12, 7] code. This code
is perfect 3 error correcting. We recover X24Gol from X23Gol by adding a parity-
check.
The Hamming [2 −1, 2 −1−, 3] and the Golay [23, 12, 7] are the only possible
perfect binary linear codes.
The ternary Golay code X12,3 Gol of length 12 has the generating matrix I |G
6 (3)
where
⎛ ⎞
0 1 1 1 1 1
⎜ 1 0 1 2 2 1 ⎟
⎜ ⎟
⎜ ⎟
⎜ 1 1 0 1 2 2 ⎟
G (3) = ⎜ ⎟ , with (G (3) )T = G (3) . (2.4.2)
⎜ 1 2 1 0 1 2 ⎟
⎜ ⎟
⎝ 1 2 2 1 0 1 ⎠
1 1 2 2 1 0
Gol is a truncation of X Gol at the last digit.
The ternary Golay code X11,3 12,3
2.4 The Hamming, Golay and Reed–Muller codes 203
(Gol)⊥ (Gol)
Theorem 2.4.6 The ternary Golay code X12,3 = X12,3 is [12, 6, 6]. The code
(Gol)
X11,3 is [11, 6, 5], hence perfect.
11 × 10
Proof The code [11, 6, 5] is perfect since v11,3 (2) = 1 + 11 × 2 + × 22 =
2
35 . The rest of the assertions of the theorem are left as an exercise.
3 −1
The Hamming , 3 − 1 − , 3 and the Golay [11, 6, 5] codes are the only
2
possible perfect ternary linear codes. Moreover, the Hamming and Golay are the
only perfect linear codes, occurring in any alphabet Fq where q = ps is a prime
power. Hence, these codes are the only possible perfect linear codes. And even
non-linear perfect codes do not bring anything essentially new: they all have the
same parameters (length, size and distance) as the Hamming and Golay codes. The
Golay codes were used in the 1980s in the American Voyager spacecraft program,
to transmit close-up photographs of Jupiter and Saturn.
The next popular examples are the Reed–Muller codes. For N = 2m consider
binary Hamming spaces Hm,2 and HN,2 . Let M(= Mm ) be an m × N matrix where
the columns are the binary representations of the integers j = 0, 1, . . . , N − 1, with
the least significant bit in the first place:
j = j1 · 20 + j2 · 21 + · · · + jm 2m−1 . (2.4.3)
So,
0 1 2 . . . 2m − 1
⎛ ⎞
0 1 0 ... 1 v(1)
⎜ 0 0 1 ... 1 ⎟ v(2)
⎜ ⎟
⎜ .. .. .. .. .. ⎟ ..
M=⎜ . . . . . ⎟ . . (2.4.4)
⎜ ⎟
⎝ 0 0 0 ... 1 ⎠ v(m−1)
0 0 0 ... 1 v(m)
The columns of M list all vectors from Hm,2 and the rows are vectors from HN,2
denoted by v(1) , . . . , v(m) . In particular, v(m) has the first 2m−1 entries 0, the last
2m−1 entries 1. To pass from Mm to Mm−1 , one drops the last row and takes one of
the two identical halves of the remaining (m − 1) × N matrix. Conversely, to pass
from Mm−1 to Mm , one concatenates two copies of Mm−1 and adds row v(m) :
Mm−1 Mm−1
Mm = . (2.4.5)
0...0 1...1
204 Introduction to Coding Theory
1 0 ... 0
0 1 ... 0
.... . . .. .
. . . .
0 0 ... 1
In terms of the wedge-product (cf. (2.1.6b)) v(i1 ) ∧ v(i2 ) ∧ · · · ∧ v(ik ) is the indicator
function of the intersection Ai(1) ∩ · · · ∩ Ai(k) . If all i1 , . . . , ik are distinct, the cardi-
nality (∩1≤ j≤k Ai( j) ) = 2m−k . In other words, we have the following.
An important fact is
Theorem 2.4.8 The vectors v(0) = 11 . . . 1 and ∧1≤ j≤k v(i j ) , 1 ≤ i1 < · · · < ik ≤ m,
k = 1, . . . , m, form a basis in HN,2 .
(i)
e( j) = ∧1≤i≤m (v(i) + (1 + v j )v(0) ), 0 ≤ j ≤ N − 1. (2.4.7)
[All factors in position j are equal to 1 and at least one factor in any position l = j
is equal to 0.]
2.4 The Hamming, Golay and Reed–Muller codes 205
v(0) = 1111111111111111
v(1) = 0101010101010101
v(2) = 0011001100110011
v(3) = 0000111100001111
v(4) = 0000000011111111
v(1) ∧ v(2) = 0001000100010001
v(1) ∧ v(3) = 0000010100000101
v(1) ∧ v(4) = 0000000001010101
v(2) ∧ v(3) = 0000001100000011
v(2) ∧ v(4) = 0000000000110011
v(3) ∧ v(4) = 0000000000001111
v(1) ∧ v(2) ∧ v(3) = 0000000100000001
v(1) ∧ v(2) ∧ v(4) = 0000000000010001
v(1) ∧ v(3) ∧ v(4) = 0000000000000101
v(2) ∧ v(3) ∧ v(4) = 0000000000000011
v(1) ∧ v(2) ∧ v(3) ∧ v(4) = 0000000000000001
Summarising,
Theorem 2.4.11 TheRMcode X RM (r, m), 0 ≤ r ≤ m, is a binary code of length
N
N = 2m , rank k = ∑ and distance d = 2m−r . Furthermore,
0≤l≤r l
Indeed, let (u|u + v) ∈ X2⊥ |X1⊥ and (x|x + y) ∈ (X1 |X2 ). The dot-product
I J
(u|u + v) · (x|x + y) = u · x + (u + v) · (x + y)
= u · y + v · (x + y) = 0,
We see that the ‘information space’ Hk,2 is embedded into HN,2 , by identifying
entries a j ∼ ai1 ,...,il where j = j0 20 + j1 21 + · · · + jm−1 2m−1 and i1 , . . . , il are the
successive positions of the 1s among j1 , . . . , jm , 1 ≤ l ≤ r. With such an identifica-
tion we obtain:
Lemma 2.4.14 For all 0 ≤ l ≤ m and 1 ≤ i1 < · · · < il ≤ m,
∑ x j = ai1 ,...,il , if l ≤ r,
j∈C(i1 ,...,il ) (2.4.22)
= 0, if l > r.
Proof The result follows from (2.4.20).
Lemma 2.4.15 For all 1 ≤ i1 < · · · < ir ≤ m and for any 1 ≤ t ≤ m such that
t∈
/ {i1 , . . . , ir },
ai1 ,...,ir = ∑ x j. (2.4.23)
j∈C(i1 ,...,ir )+2t−1
210 Introduction to Coding Theory
Proof The proof follows from the fact that C(i1 , . . . , ir ,t) is the disjoint union
C(i1 , . . . , ir )∪(C(i1 , . . . , ir )+2t−1 ) and the equation ∑ x j = 0 (cf. (2.4.19)).
j∈C(i1 ,...,ir ,t)
Moreover:
Theorem 2.4.16 For any information symbol ai1 ,...,ir corresponding to v(i1 ,...,ir ) ,
we can split the set {0, . . . , N − 1} into 2m−r disjoint subsets S, each containing 2r
elements, such that, for all such S, ai1 ,...,ir = ∑ x j .
j∈S
Proof The list of sets S begins with C(i1 , . . . , ir ) and continues with (m − r) dis-
joint sets C(i1 , . . . , ir ) + 2t−1 where 1 ≤ t ≤ m, t ∈ {i1 , . . . , ir }. Next, we take any
pair 1 ≤ t1 < t2 ≤ m such that {t1 ,t2 } ∩ {i1 , . . . , ir } = 0. / Then C(i1 , . . . , ir ,t1 ,t2 ) con-
tains disjoint sets C(i1 , . . . , ir ), C(i1 , . . . , ir ) + 2t1 −1 and C(i1 , . . . , ir ) + 2t2 −1 , and for
each of them, ai1 ,...,ir = ∑ x j , k = 1, 2. Then the same is true for the
j∈C(i1 ,...,ir )+2tk −1
remaining sets
1 , . . . , ir ,t1 , . . . ,ts )
C(i
2
−1 −1 (2.4.25)
\ C(i1 , . . . , ir ) + 2 1 + · · · + 2 s
t t .
{t1 ,...,ts }⊂{t1 ,...ts }
Here each such set is labelled by a collection {t1 , . . . ,ts } where 0 ≤ s ≤ m − r, t1 <
· · · < ts and {t1 , . . . ,ts } ∩ {i1 , . . . , ir } = 0.
/ [The union ∪{t1 ,...,t }⊂{t1 ,...ts } in (2.4.25)
s
is over all (‘strict’) subsets {t1 , . . . ,ts } of {t1 , . . . ,ts }, with t1 < · · · < ts and s =
with
det Gk = ± det Gu det Ik−u = ± det Gu = 0, by (b).
G = [IN−d+1 |E(N−d+1)×(d−1) ]
Worked Example 2.4.18 The MDS codes [N, N, 1], [N, 1, N] and [N, N − 1, 2]
always exist and are called trivial. Any [N, k] MDS code with 2 ≤ k ≤ N −2 is called
non-trivial. Show that there is no non-trivial MDS code over Fq with q ≤ k ≤ N − q.
In particular, there is no non-trivial binary MDS code (which causes a discernible
lack of enthusiasm about binary MDS codes).
Solution Indeed, the [N, N, 1], [N, N − 1, 2] and [N, 1, N] codes are MDS. Take
q ≤ k ≤ N − q and assume X is a q-ary MDS. Take its generating matrix G in the
standard form (Ik |G ) where G is k × (N − k), N − k ≥ q.
If some entries in a column of G are zero then this column is a linear combina-
tion of k − 1 columns of Ik−1 . This is impossible by (b) in the previous example;
hence G has no 0 entry. Next, assume that the first row of G is 1 . . . 1: otherwise
we can perform scalar multiplication of columns maintaining codes’ equivalence.
Now take the second row of G : it is of length N − k ≥ q and has no 0 entry. Then
these must be repeated entries. That is,
⎛ ⎞
1 ... 1 ... 1 ... 1
G = ⎝Ik . . . . . . a . . . a . . . . . . ⎠ , a = 0.
... ...
Then take the codeword
x = row 1 − a−1 (row 2);
it has w(x) ≤ N − k − 2 + 2 = N − k and X cannot be MDS.
By using the dual code, obtain that there exists no non-trivial q-ary MDS code
with k ≥ q. Hence, non-trivial MDS code can only have
N − q + 1 ≤ k or k ≤ q − 1.
That is, there exists no non-trivial binary MDS code, but there exists a non-trivial
[3, 2, 2] ternary MDS code.
Remark 2.4.19 It is interesting to find, given k and q, the largest value of N
for which there exists a q-ary MDS [N, k] code. We demonstrated that N must be
≤ k + q − 1, but computational evidence suggests this value is q + 1.
together with some other sharp observations made at the end of the 1950s, partic-
ularly the invention of BCH codes, opened a connection from the theory of linear
codes (which was then at its initial stage) to algebra, particularly to the theory of fi-
nite fields. This created algebraic coding theory, a thriving direction in the modern
theory of linear codes.
We begin with binary cyclic codes. The coding and decoding procedures for
binary cyclic codes of length N are based on the related algebra of polynomials
with binary coefficients:
Such polynomials can be added and multiplied in the usual fashion, except that
X k + X k = 0. This defines a binary polynomial algebra F2 [X]; the operations over
binary polynomials refer to this algebra. The degree deg a(X) of polynomial a(X)
equals the maximal label of its non-zero coefficient. The degree of the zero poly-
nomial is set to be 0. Thus, the representation (2.5.1) covers polynomials of degree
< N.
l l
Theorem 2.5.1 (a) (1 + X)2 = 1 + X 2 (A freshman’s dream).
(b) (The division algorithm) Let f (X) and h(X) be two binary polynomials with
h(X) ≡ 0. Then there exist unique polynomials g(X) and r(X) such that
f (X) = g(X)h(X) + r(X) with deg r(X) < deg h(X). (2.5.2)
The polynomial g(X) is called the ratio, or quotient, and r(X) the remainder.
Proof (a) The statement follows from the binomial decomposition where all
intermediate terms vanish.
(b) If deg h(X) > deg f (X) we simply set
If deg h(X) ≤ deg f (X), we can perform the ‘standard’ procedure of long divi-
sion, with the rules of the binary addition and multiplication.
Definition 2.5.3 Two polynomials, f1 (X) and f2 (X), are called equivalent
mod h(X), or f1 (X) = f2 (X) mod h(X), if their remainders, after division by h(X),
coincide. That is,
fi (X) = gi (X)h(X) + r(X), i = 1, 2,
and deg r(X) < deg h(X).
Theorem 2.5.4 Addition and multiplication of polynomials respect the equiva-
lence. That is, if
f1 (X) = f2 (X) mod h(X) and p1 (X) = p2 (X) mod h(X), (2.5.3)
then
'
f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X),
(2.5.4)
f1 (X)p1 (X) = f2 (X)p2 (X) mod h(X).
Proof We have, for i = 1, 2,
fi (X) = gi (X)h(X) + r(X), pi (X) = qi (X)h(X) + s(X),
with
deg r(X), deg s(X) < deg h(X).
Hence
fi (X) + pi (X) = (gi (X) + qi (X))h(X) + (r(X) + s(X))
with
deg(r(X) + s(X)) ≤ max[r(X), s(X)] < deg h(X).
Thus
f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X).
Furthermore, for i = 1, 2, the product fi (X)pi (X) is represented as
gi (X)qi (X)h(X) + r(X)qi (X) + s(X)gi (X) h(X) + r(X)s(X).
Hence, the remainder for both polynomials f1 (X)p1 (X) and f2 (X)p2 (X) may come
only from r(X)s(X). Thus it is the same for both of them.
Note that every linear binary code XN corresponds to a set of polynomials, with
coefficients 0, 1, of degree N − 1 which is closed under addition mod 2:
a(X) = a0 + a1 X + · · · + aN−1 X N−1 ↔ a(N) = a0 . . . aN−1 ,
b(X) = b0 + b1 X + · · · + bN−1 X N−1 ↔ b(N) = b0 . . . bN−1 , (2.5.5)
a(X) + b(X) ↔ a(N) + b(N) = (a0 + b0 ) . . . (aN−1 + bN−1 ).
216 Introduction to Coding Theory
Theorem 2.5.9 A binary cyclic code contains, with each pair of polynomials
a(X) and b(X), the sum a(X) + b(X) and any polynomial v(X)a(X) mod (1 + X N ).
N−k
Theorem 2.5.10 Let g(X) = ∑ gi X i be a non-zero polynomial of minimum
i=0
degree in a binary cyclic code X . Then:
(iii) Assume that property (iv) holds. Then each polynomial a(X) ∈ X has the
form
r
g(X)v(X) = ∑ vi X i g(X), r < k.
i=1
218 Introduction to Coding Theory
(iv) We know that each polynomial a(X) ∈ X has degree > deg g(X). By the
division algorithm,
a(X) = v(X)g(X) + r(X).
But then v(X)g(X) belongs to X owing to Theorem 2.5.9 (as v(X)g(X) has degree
≤ N − 1, it coincides with v(X)g(X) mod (1 + X N )). Hence,
Corollary 2.5.11 Every binary cyclic code is obtained from the codeword cor-
responding to a polynomial of minimum degree, by cyclic shifts and linear combi-
nations.
Definition 2.5.12 A polynomial g(X) of minimal degree in X is called a mini-
mal degree generator of a (cyclic) binary code X , or briefly a generator of X .
Remark 2.5.13 There may be other polynomials that generate X in the sense
of Corollary 2.5.11. But the minimum degree polynomial is unique.
Theorem 2.5.14 A polynomial g(X) of degree ≤ N − 1 is the generator of a
binary cyclic code of length N iff g(X) divides 1 + X N . That is,
1 + X N = h(X)g(X) (2.5.7)
That is,
By Theorem 2.5.10, r(X) belongs to the cyclic code X generated by g(X). But
g(X) must be the unique polynomial of minimum degree in X . Hence, r(X) = 0
and 1 + X N = h(X)g(X).
(The if part.) Suppose that 1 + X N = h(X)g(X), deg h(X) = N − deg g(X).
Consider the set {a(X) : a(X) = u(X)g(X) mod (1 + X N )}, i.e. the principal
ideal in the -multiplication polynomial ring corresponding to g(X). This set
forms a linear code; it contains g(X), Xg(X), . . . , X k−1 g(X) where k = deg h(X).
It suffices to prove that X k g(X) also belongs to the set. But X k g(X) = 1 +
k−1
X N + ∑ h j X j g(X), that is, X k g(X) is equivalent to a linear combination of
j=0
g(X), Xg(X), . . . , X k−1 g(X).
Corollary 2.5.15 All cyclic binary codes of length N are in a one-to-one corre-
spondence with the divisors of polynomial 1 + X N .
Hence, the cyclic codes are described through the factorisation of the polynomial
1 + X N . More precisely, we are interested in decomposing 1 + X N into irreducible
factors; combining these factors into products yields all possible cyclic codes of
length N.
Definition 2.5.16 A polynomial a(X) = a0 + a1 X + · · · + aN−1 X N−1 is called
irreducible if a(X) cannot be written as a product of two polynomials, b(X) and
b (X), with min[deg b(X), deg b (X)] ≥ 1.
The importance (and convenience) of irreducible polynomials for describing
cyclic codes is obvious: every generator polynomial of a cyclic code of length N is
a product of irreducible factors of (1 + X N ).
Example 2.5.17 (a) The polynomial 1 + X N has two ‘standard’ divisors:
1 + X N = (1 + X)(1 + X + · · · + X N−1 ).
The first factor 1 + X Kgenerates the binary parity-check code PN =
%
x = x0 . . . xN−1 : ∑ xi = 0 , whereas polynomial 1 + X + · · · + X N−1 (it may be
i
reducible) generates the repetition code RN = { 00 . . . 0, 11 . . . 1 }.
(b) Select the generating and check matrices of the Hamming [7, 4] code in the
lexicographic form. If we re-order the digits x4 x7 x5 x3 x2 x6 x1 (which leads to an
equivalent code) then the rows of the generating matrix become subsequent cyclic
shifts of each other:
⎛ ⎞
1101000
⎜ 0110100 ⎟
H
Gcycl =⎜ ⎝ 0011010 ⎠
⎟
0001101
220 Introduction to Coding Theory
and the cyclic shift of the last row is again in the code:
π (0 0 0 1 1 0 1) = (1 0 0 0 1 1 0)
= (1 1 0 1 0 0 0) + (0 1 1 0 1 0 0) + (0 0 1 1 0 1 0).
By Lemma 2.5.6, the code is cyclic. By Theorem 2.5.10(iii), the generating poly-
H :
nomial g(X) corresponds to the framed part in matrix Gcycl
But a similar argument can be used to show that an equivalent cyclic code is ob-
tained from the word 1011 ∼ 1 + X 2 + X 3 . There is no contradiction: it was not
claimed that the polynomial ideal of a cyclic code is the principal ideal of a unique
element.
If we choose a different order of the columns in the parity-check matrix, the
code will be equivalent to the original code; that is, the code with the generator
polynomial 1 + X 2 + X 3 is again a Hamming [7, 4] code.
In Problem 2.3 we will check that the Golay [23, 7] code is generated by the
polynomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 .
in F2 [X], find all cyclic binary codes of length 7. Identify those which are Hamming
codes and their duals.
{0, 1}7 1 1 + X7
parity-check 1+X ∑ Xi
0≤i≤6
Hamming 1+X + X3 1 + X2 + X3 + X4
Hamming 1 + X2 + X3 1 + X + X2 + X4
dual Hamming 1 + X2 + X3 + X4 1 + X + X3
dual Hamming 1 + X + X2 + X4 1 + X2 + X3
repetition ∑ Xi 1+X
0≤i≤6
zero 1 + X7 1
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 221
It is easy to check that all factors in (2.5.8) are irreducible. Any irreducible factor
could be included or not included in decomposition of the generator polynomial.
This argument proves that there exist exactly 8 binary codes in H7,2 as demon-
strated in the table.
Example 2.5.19 (a) Polynomials of the first degree, 1 + X and X, are irreducible
(but X does not appear in the decomposition for 1 + X N ). There is one irreducible
binary polynomial of degree 2: 1 + X + X 2 , two of degree 3: 1 + X + X 3 and 1 +
X 2 + X 3 , and three of degree 4:
1 + X + X 4, 1 + X3 + X4 and 1 + X + X 2 + X 3 + X 4 , (2.5.9)
1 + X 8, 1 + X4 + X6 + X7 + X8 and 1 + X 2 + X 6 + X 8 (2.5.10)
1 + X N = (1 + X)(1 + X + · · · + X N−1 ).
1 + X, 1 + X 3, 1 + X 5, 1 + X 11 , 1 + X 13 .
× (1 + X + X 2 + X 4 + X 6 )(1 + X 2 + X 4 + X 5 + X 6 ),
1 + X : (1 + X + X 5 + X 6 + X 7 + X 9 + X 11 )
23
× (1 + X 2 + X 4 + X 5 + X 6 + X 10 + X 11 ),
222 Introduction to Coding Theory
and
1 + X 25 : (1 + X + X 2 + X 3 + X 4 )(1 + X 5 + X 10 + X 15 + X 20 ).
For N even, 1 + X N can have multiple roots (see Example 2.5.35(c)).
Example 2.5.20 Irreducible polynomials of degree 2 and 3 over the field F3
(that is, from F3 [X]) are as follows. There exist three irreducible polynomials of
degree 2 over F3 : X 2 + 1, X 2 + X + 2 and X 2 + 2X + 2. There exist eight irreducible
polynomials of degree 3 over F3 : X 3 + 2X + 2, X 3 + X 2 + 2, X 3 + X 2 + X + 2, X 3 +
2X 2 + 2X + 2, X 3 + 2X + 1, X 3 + X 2 + 2X + 1, X 3 + 2X 2 + 1 and X 3 + 2X 2 + X + 1.
Cyclic codes admit encoding and decoding procedures in terms of the polyno-
mials. It is convenient to have a generating matrix of a cyclic code X in a form
similar to Gcycl for the Hamming [7, 4] code (see above). That is, we want to find
the basis in X which gives the following picture in the corresponding generating
matrix:
⎛ ⎞
⎜ 0 ⎟
⎜ ⎟
⎜ ⎟
Gcycl = ⎜ ⎟ (2.5.11)
⎜ .. ⎟
⎝ 0 . ⎠
Hence the cosets are labelled by the polynomials u(X) of deg u(X) < deg g(X) =
N − k: there are exactly 2N−k such polynomials. To determine the coset y + X it
is enough to compute the remainder u(X) = y(X) mod g(X). Unfortunately, there
is still a task to find a leader in each case: there is no simple algorithm for finding
leaders, for a general cyclic code. However, there are known particular classes of
cyclic codes which admit a relatively simple decoding: the first such class was
discovered in 1959 and is formed by BCH codes (see Section 2.6).
As was observed, a cyclic code may be generated not only by its polynomial of
minimum degree: for some purposes other polynomials with this property may be
useful. However, they all are divisors of 1 + X N :
Theorem 2.5.22 Let X be a binary cyclic code of length N . Then any polyno-
mial g(X) such that X is the principal ideal of g(X) is a divisor of 1 + X N .
We see that the cyclic codes are naturally labelled by their generator poly-
nomials.
We will use the standard notation gcd( f (X), g(X)) for the greatest common di-
visor of polynomials f (X) and g(X) and lcm( f (X), g(X)) for their least common
multiple. Denote by X1 + X2 the direct sum of two linear codes X1 , X2 ⊂ HN,2 .
That is, X1 + X2 consists of the linear combinations α1 a(1) + α2 a(2) where
α1 , α2 = 0, 1 and a(i) ∈ Xi , i = 1, 2. Compare Example 2.1.8(vii).
224 Introduction to Coding Theory
Worked Example 2.5.24 Let X1 and X2 be two binary cyclic codes of length
N , with generators g1 (X) and g2 (X). Prove that:
(a) X1 ⊂ X2 iff g2 (X) divides g1 (X);
the intersection
(b)
X1 ∩ X2 yields a cyclic code generated by
lcm g1 (X), g2 (X) ;
the direct
sum X1 + X2 is a cyclic code with the generator
(c)
gcd g1 (X), g2 (X) .
Solution (a) We know that a(X) ∈ Xi iff, in the ring F2 [X] (1 + X N ), polyno-
mial a(X) = fi gi (X), i = 1, 2. Suppose g2 (X) divides g1 (X) and write g1 (X) =
r(X)g2 (X). Then every polynomial a(X) of the form f1 g1 (X) is of the form
f1 r g2 (X). That is, if a(X) ∈ X1 then a(X) ∈ X2 , so X1 ⊂ X2 .
Conversely, suppose that X1 ⊂ X2 . Let di be the degree of gi (X), 1 ≤ di < N,
i = 1, 2, and write
and conclude that X i r(X) = 0. This implies that r(X) = 0 and hence g2 (X) divides
g1 (X).
The remaining case is that all coefficients αi ≡ 1. Then we compare
and
Xg1 (X) = h(1) (X)g2 (X) + 1 + X N
lcm(g1 (X), g2 (X)), the code generated by lcm(g1 (X), g2 (X)) must be strictly larger
than X1 ∩ X2 . This contradicts the definition of X1 ∩ X2 .
(c) Similarly, X1 + X2 is the minimal linear code containing both X1 and X2 .
Hence, its generator divides both g1 (X) and g2 (X), i.e. is their common divisor.
And if it is not equal to the gcd(g1 (X), g2 (X)) then it contradicts the above mini-
mality property.
Worked Example 2.5.25 Let X be a binary cyclic code of length N with the
generator g(X) and the check polynomial h(X). Prove that a(X) ∈ X iff the poly-
nomial (1 + X N ) divides a(X)h(X), i.e. a h(X) = 0 in F2 [X]/(1 + X N ).
Worked Example 2.5.26 Prove that the dual of a cyclic code is again cyclic and
find its generating matrix.
Solution If y ∈ X ⊥ , the dual code, then the dot-product "π x · y# = 0 for all x ∈ X .
But "π x · y# = "x · π y#, i.e. π y ∈ X ⊥ , which means that X ⊥ is cyclic.
Let g(X) = g0 + g1 X + · · · + gN−k X N−k be the generating polynomial for X ,
where N − k = d is the degree of g(X) and k gives the rank of X . We know that
the generating matrix G of X may be written as
⎛ ⎞
g(X) ⎛ ⎞
⎜ Xg(X) ⎟
⎜ ⎟ ⎜ 0 ⎟
⎜ ⎟ ⎜ ⎟
⎜ · ⎟ ⎜ ⎟
G ∼ ⎜ ⎟ ∼ ⎜ ⎟. (2.5.13)
⎜ · ⎟ ⎜ . ⎟
⎜ ⎟ ⎝ 0 . . ⎠
⎝ · ⎠
X k−1 g(X)
226 Introduction to Coding Theory
k
Take h(X) = (1 + X N )/g(X) and write h(X) = ∑ h j X j and h = h0 . . . hN−1 . Then
j=0
'
i = 1, i = 0, N,
∑ g j hi− j = 0, 1 ≤ i < N.
j=0
where h⊥ = hk hk−1 . . . h0 . It is then easy to see that h⊥ gives rise to the generator
h⊥ (X) of X ⊥ .
Worked Example 2.5.27 Let X be a binary cyclic code of length N with gen-
erator g(X).
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 227
(a) Show that the set of codewords a ∈ X of even weight is a cyclic code and find
its generator.
(b) Show that X contains a codeword of odd weight iff g(1) = 0 or, equivalently,
word 1 ∈ X .
Solution (a) If code X is even (i.e. contains only words of even weight) then
every polynomial a(X) ∈ X has a(1) = ∑ ai = 0. Hence, a(X) contains a
0≤i<N−1
factor (X + 1). Therefore, the generator g(X) has a factor (X + 1). The converse is
also true: if (X + 1) divides g(X), or, equivalently, g(1) = 0, then every codeword
a ∈ X is of even weight.
Now assume that X contains a word with an odd weight, i.e. g(1) = 1; that
is, (1 + X) does not divide g(X). Let X ev be the subcode in X formed by the
even codewords. A cyclic shift does not change the weight, so X ev is a cyclic
code. For the corresponding polynomials a(X) we have, as before, that (1 + X)
divides a(X). Thus, the generator gev (X) of X ev is divisible by (1 + X), hence
gev (X) = g(X)(X + 1).
(b) It remains to show that g(1) = 1 iff the word 1 ∈ X . The corresponding poly-
nomial is 1+ · · ·+ X N−1 , the complementary factor to (1+ X) in the decomposition
1 + X N = (1 + X)(1 + · · · + X N−1 ). So, if g(1) = 1, i.e. g(X) does not contain the
factor (1 + X), then g(X) must be a divisor of 1 + · · · + X N−1 . This implies that
1 ∈ X . The inverse statement is established in a similar manner.
Worked Example 2.5.28 Let X be a binary cyclic code of length N with gen-
erator g(X) and check polynomial h(X).
(a) Prove that X is self-orthogonal iff h⊥ (X) divides g(X) and self-dual iff
h⊥ (X) = g(X) where h⊥ (X) = hk + hk−1 X + · · · + h0 X k−1 and h(X) = h0 + · · · +
hk−1 X k−1 + hk X k is the check polynomial, with g(X)h(X) = 1 + X N .
(b) Let r be a divisor of N : r|N . A binary code X is called r-degenerate if every
codeword a ∈ X is a concatenation c . . . c where c is a string of length r. Prove that
X is r-degenerate iff h(X) divides (1 + X r ).
1 + X N = (1 + X r )(1 + X r + · · · + X r(s−1) ).
228 Introduction to Coding Theory
Now assume cyclic code X of length N with generator g(X) is r-degenerate. Then
the word g is of the form 1c1c . . . 1c for some string c of length r − 1 (with c = 1c).
Let c(X) be the polynomial corresponding to c (of degree ≤ r − 2). Then g(X) is
given by
1 + X c(X) + X r + X r+1 c(X) + · · · + X r(s−1) + X r(s−1)+1 c(X)
= (1 + X r + · · · + X r(s−1) )[1 + X c(X)].
For the check polynomial h(X) we obtain
>
h(X) = 1 + X N 1 + X r + · · · + X r(s−1) 1 + X c(X)
= 1 + Xr 1 + X c(X) ,
i.e. h(X) is a divisor of (1 + X r ).
Conversely, let h(X)|(1 + X r ), with h(X)g(X) = 1 + X r where g(X) =
∑ c j X j , with c0 = 1. Take c = c0 . . . cr−1 ; repeating the above argument in
0≤ j≤r−1
the reverse order, we conclude that the word g is the concatenation c . . . c. Then the
cyclic shift π g is the concatenation c(1) . . . c(1) where c(1) = cr−1 c0 . . . cr−2 (= π c,
the cyclic shift of c in {0, 1}r ). Similarly, for subsequent cyclic shift iterations
π 2 g, . . .. Hence, the basis vectors in X are r-degenerate, and so is the whole of X .
has a consistent meaning (which is provided within the framework of finite fields).
Even without knowing the formal theory, we are able to make a couple of helpful
observations.
The first observation is that the αi are Nth roots of unity, as they should be among
the zeros of polynomial 1 + X N . Hence, they could be multiplied and inverted, i.e.
would form an Abelian multiplicative group of size N, perhaps cyclic. Second, in
the binary arithmetic, if α is a zero of g(X) then so is α 2 , as g(X)2 = g(X 2 ). Then
α 2 is also a zero, as well as α 4 , and so on. We conclude that the sequence α , α 2 , . . .
begins cycling: α 2 = α (or α 2 −1 = 1) where d is the degree of g(X). That is, all
d d
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 229
( c−1
)
Nth roots of unity split into disjoint classes, of the form C = α , α 2 , . . . , α 2 ,
of size c where c = c(C ) is a positive integer (with 2 − 1 dividing N). The notation
c
C (α ) is instructive, with c = c(α ). The members of the same class are said to be
conjugate to each other. If we want a generating polynomial with root α then all
conjugate roots of unity α ∈ C (α ) will also be among the roots of g(X).
Thus, to form a generator g(X) we have to ‘borrow’ roots from classes C and
enlist, with each borrowed root of unity, all members of their classes. Then, since
any polynomial a(X) from the cyclic code generated by g(X) is a multiple of g(X)
(see Theorem 2.5.10(iv)), the roots of g(X) will be among the roots of a(X). Con-
versely, if a(X) has roots αi of g(X) among its roots then a(X) is in the code. We
see that cyclic codes are conveniently described in terms of roots of unity.
Example 2.5.29 (The Hamming [7, 4] code) Recall that the parity-check matrix
H for the binary Hamming [7, 4] code X H is 3 × 7; it enlists as its columns all
non-zero binary words of length 3: different orderings of these rows define equiv-
alent codes. Later in this section we explain that the sequence of non-zero binary
words of any given length 2 − 1 written in some particular order (or orders) can be
interpreted as a sequence of powers of a single element ω : ω 0 , ω , ω 2 , . . . , ω 2 −2 .
Then, with this interpretation, the equation aH T = 0, determining that the word
a = a0 . . . a6 (or its polynomial a(X) = ∑ ai X i ) lies in X H , can be rewritten as
0≤i<7
∑ ai ω ∗i = 0, or a(∗ω ) = 0.
0≤i<7
In other words, a(X) ∈ X H iff ω is a root of a(X) under the multiplication rule ∗
(which in this case is multiplication of binary polynomials of degree ≤ 2 modulo
the polynomial 1 + X + X 3 ).
The last statement can be rephrased in this way: the Hamming [7, 4] code is
equivalent to the cyclic code with the generator g(X) that has ω among its roots;
in this case the generator g(X) = 1 + X + X 3 , with g(∗ω ) = ω ∗0 + ω + ω ∗3 = 0.
The alternative ordering of the rows of H H is related in the same fashion to the
polynomial 1 + X 2 + X 3 .
230 Introduction to Coding Theory
We see that the Hamming [7, 4] code is defined by a single root ω , provided
that we establish proper terms of operation with its powers. For that reason we can
call ω the defining root (or defining zero) for this code. There are reasons to call
element ω ‘primitive’; cf. Sections 3.1–3.3.
Worked Example 2.5.30 A code X is called reversible if a = a0 a1 . . . aN−1 ∈ X
implies that a← = aN−1 . . . a1 a0 ∈ X . Prove that a cyclic code with generator g(X)
is reversible iff g(α ) = 0 implies g(α −1 ) = 0.
Solution For the generator polynomial g(X) = ∑ gi X i , with deg g(X) = d < N
0≤i≤d
and g0 = gd = 1, the reversed polynomial is grev (X) = X N−1 g(X −1 ), so if the cyclic
code X is reversible and α is a root of g(X) then α is also a root of grev (X). This
is possible only when g(α −1 ) = 0.
Conversely, let g(X) satisfy the property that g(α ) = 0 implies g(α −1 ) = 0.
The above formula holds for all polynomial a(X) of degree < N: arev (X) =
X N−1 a(X −1 ). If a(X) ∈ X then a(α ) = a(α −1 ) = 0 for all root α of g(X). Then
arev (α ) = arev (α −1 ) = 0 for all roots α of g(X). Thus, arev (X) is a multiple of
g(X), and arev (X) ∈ X .
Proof The only non-trivial property to check is the existence of the inverse ele-
ment. Take a non-zero polynomial f (X), with deg f (X) ≤ d − 1, and consider all
polynomials of the form f (X)h(X) (the usual multiplication) where h(X) runs over
the whole set of the polynomials of degree ≤ d − 1. These products must be distinct
mod g(X). Indeed, if
This implies that either g(X)| f (X) or g(X)|h1 (X) − h2 (X). We conclude that if
polynomial g(X) is irreducible, (2.5.15) is impossible, unless h1 (X) = h2 (X) and
v(X) = 0. For one and only one polynomial h(X), we have
h(X) represents the inverse for f (X) in multiplication mod g(X). We write h(X) =
f (X)∗−1 .
On the other hand, if g(X) is reducible, then g(X) = b(X)b (X) where both b(X)
and b (X) are non-zero and have degree < d. That is, b(X)b (X) = 0 mod g(X). If
the multiplication mod q led to a field both b(X) and b (X) would have inverses,
b(X)−∗1 and b (X)−∗1 . But then
A field obtained via the above construction is called a polynomial field and is
often denoted by F2 [X]/"g(X)#. It contains 2d elements where d = deg g(X) (rep-
resenting polynomials of degree < d). We will call g(X) the core polynomial of the
field. For the rest of this section we denote the multiplication in a given polynomial
field by ∗. The zero polynomial and the unit polynomial are denoted correspond-
ingly, by 0 and 1: they are obviously the zero and the unity of the polynomial field.
A key role is played by the following result.
Proof We will only prove here assertion (a); assertion (b) will be established in
Section 3.1. Take any element from the field, a(X) ∈ F2 [X]/"g(X)#, and observe
that
a∗i (X) := a
: ∗ .;<
. . ∗ a=(X)
i times
(the multiplication in the field) takes at most 2d − 1 values (the number of elements
in the field less one, as the zero 0 is excluded). Hence there exists a positive integer
r such that a∗r (X) = 1; the smallest value of r is called the order of a(X).
232 Introduction to Coding Theory
product a∗p ∗ b∗l (X) has order l pc . Hence, c ≤ c or else r would not be maximal.
b
the field.
In the wake of Theorem 2.5.33, we can use the notation F2d for any polynomial
field F2 [X]/"g(X)# where g(X) is an irreducible binary polynomial of degree d.
Further, the multiplicative group of non-zero elements in F2d is denoted by F∗2d ;
it is cyclic ( Z2d −1 , according to Theorem 2.5.33). Any generator of group F∗2d
(whose ∗-powers exhaust F∗2d ) is called a primitive element of field F2d .
Example 2.5.34 We can see the importance of writing down the full list of ir-
reducible polynomials. There are six irreducible binary polynomials of degree 5
(each of which is primitive):
1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5, (2.5.16)
1 + X2 + X3 + X4 + X5
1 + X + X 6, 1 + X + X 3 + X 4 + X 6, 1 + X 5 + X 6,
1 + X + X 2 + X 5 + X 6, 1 + X 2 + X 3 + X 5 + X 6,
1 + X + X 4 + X 5 + X 6, (2.5.17)
1 + X + X + X 4 + X 6, 1 + X 2 + X 4 + X 5 + X 6,
2
1 + X 3 + X 6.
The number of irreducible polynomials grows significantly with the degree: there
are 18 of degree 7, 30 of degree 8, and so on. However, there exist and are available
quite extensive tables of irreducible polynomials over various finite fields.
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 233
1 + X + X3 1 + X2 + X3
X ∗i polynomial word X ∗i polynomial word
−− 0 000 −− 0 000
X ∗0 1 100 X ∗0 1 100
X X 010 X X 010
X ∗2 X2 001 X ∗2 X2 001
X ∗3 1+X 110 X ∗3 1 + X2 101
X ∗4 X + X2 011 X ∗4 1 + X + X2 111
X ∗5 1 + X + X2 111 X ∗5 1+X 110
X ∗6 1 + X2 101 X ∗6 X + X2 011
Figure 2.6
coefficient
powers X ∗i polynomials
strings
−− 0 0000
X ∗0 1 1000
X X 0100
X ∗2 X2 0010
X ∗3 X3 0001
X ∗4 1+X 1100
X ∗5 X + X2 0110
X ∗6 X2 + X3 0011
X ∗7 1 + X + X3 1101
X ∗8 1 + X2 1010
X ∗9 X + X3 0101
X ∗10 1 + X + X2 1110
X ∗11 X + X2 + X3 0111
X ∗12 1 + X + X2 + X3 1111
X ∗13 1 + X2 + X3 1011
X ∗14 1 + X3 1001
Figure 2.7
(c) The field F2 [X]/"1 + X + X 4 # contains 16 elements. The field table is given in
Figure 2.7. In this case, the multiplicative group is Z15 , and the field can be denoted
by F16 . As above, element X ∈ F2 [X]/"1 + X + X 4 # yields a root of polynomial
1 + X + X 4 ; other roots are X ∗2 , X ∗4 and X ∗8 .
This example can be used to identify the Hamming [15, 11] code as (an equiva-
lent to) the cyclic code with generator g(X) = 1 + X + X 4 . We can now say that the
Hamming [15, 11] code is (modulo equivalence) the cyclic code of length 15 with
the defining root ω (= X) in field F2 [X]/"1 + X + X 4 #. As X is a generator of the
multiplicative group of the field, we again could say that the defining root ω is a
primitive element in F16 . 2
in general: it only happens when g(X) is a ‘primitive’ binary polynomial; for the
detailed discussion of this property see Sections 3.1–3.3. For a primitive core poly-
nomial g(X) we have, in addition, that the powers X i for i < d = deg g(X) coincide
with X ∗i , while further powers X ∗i , m ≤ i ≤ 2d − 1, are relatively easy to calculate.
With this in mind, we can pass to a general binary Hamming code.
i.e. the conjugates ω , ω 2 and ω 4 are the roots of the core polynomial 1 + X + X 3 :
1 + X + X 3 = (X − ω ) X − ω 2 X − ω 4 .
Hence, the binary BCH code of length 7 with designed distance 3 is formed by
binary polynomials a(X) of degree ≤ 6 such that
This code is equivalent to the Hamming [4, 7] code; in particular its ‘true’ distance
equals 3.
Next, the binary BCH code of length 7 with designed distance 4 is formed by
binary polynomials a(X) of degree ≤ 6 such that
a(ω ) = a(ω 2 ) = a(ω 3 ) = 0, that is, a(X) is a multiple of
(1 + X + X 3 )(1 + X 2 + X 3 ) = 1 + X + X 2 + X 3 + X 4 + X 5 + X 6 .
This is simply the repetition code R7 .
The staple of the theory of the BCH codes is
Theorem 2.5.39 (The BCH bound) The minimal distance of a binary BCH code
with designed distance δ is ≥ δ .
The proof of Theorem 2.5.39 (sometimes referred to as the BCH theorem) is
based on the following result.
Lemma 2.5.40 Consider the m × m Vandermonde determinant with entries from
a commutative ring:
⎛ ⎞ ⎛ ⎞
α1 α2 . . . αm α1 α12 . . . α1m
⎜α2 α2 . . . α2 ⎟ ⎜ α2 α 2 . . . α m ⎟
⎜ 1 2 m⎟ ⎜ 2 2⎟
det ⎜ . . .. . ⎟ = det ⎜ .. . .. . ⎟. (2.5.21)
⎝ . . .
. . . ⎠
. ⎝ . .
. . . ⎠
.
α1m α2m . . . αmm αm αm2 . . . αmm
The value of this determinant is
∏ αl × ∏ (αi − α j ). (2.5.22)
1≤l≤m 1≤i< j≤m
Proofof Theorem 2.5.39 Let the polynomial a(X) ∈ X . Then a(∗ω ∗ j ) = 0 for all
j = 1, . . . , δ − 1. That is,
⎛ ⎞⎛ ⎞
1 ω ω ∗2 ... ω ∗(N−1) a0
⎜1 ω ∗2 ω ∗4 ω ∗2(N−1) ⎟ ⎜ ⎟
⎜ ... ⎟ ⎜ a1 ⎟
⎜ .. .. .. .. ⎟ ⎜ . ⎟ = 0.
⎝. . . . ⎠ ⎝ .. ⎠
1 ω ∗(δ −1) ω ∗2(δ −1) . . . ω ∗(N−1)(δ −1) aN−1
Due to Lemma 2.5.40, any (δ −1) columns of this ((δ −1)×N) matrix are linearly
independent. Hence, there must be at least δ non-zero coefficients in a(X). Thus,
the distance of X is ≥ δ .
g(x) = (X 4 + X + 1)(X 4 + X 3 + X 2 + X + 1) = X 8 + X 7 + X 6 + X 4 + 1.
By definition, the generator of the BCH code of length 31 with a designed dis-
tance δ = 8 is g(X) = lcm(Mω (X), Mω 3 (X), Mω 5 (X), Mω 7 (X)). In fact, the mini-
mal distance of the BCH code (which is, obviously, at least 9) is in fact at least 11.
This follows from Theorem 2.5.39 because all the powers ω , ω 2 , . . . , ω 10 are listed
among the roots of g(X).
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 239
There exists a decoding procedure for a BCH code which is simple to imple-
ment: it generalises the Hamming code decoding procedure. In view Aof Theorem B
δ −1
2.5.39, the BCH code with designed distance δ corrects at least t = er-
2
rors. Suppose a codeword c = c0 . . . cN−1 has been sent and corrupted to r = c + e
where e = e0 . . . eN−1 . Assume that e has at most t non-zero entries. Introduce the
corresponding polynomials c(X), r(X) and e(X), all of degrees < N. For c(X) we
have that c(ω ) = c(ω 2 ) = · · · = c(ω (δ −1) ) = 0. Then, clearly,
r(ω ) = e(ω ), r ω 2 = e ω 2 , . . . , r ω (δ −1) = e ω (δ −1) . (2.5.23)
So, we calculate r(ω i ) for i = 1, . . . , δ − 1. If these are all 0, r(X) ∈ X (no error
or at least t + 1 errors). Otherwise, let E = {i : ei = 1} indicate the erroneous digits
and assume that 0 < E ≤ t. Introduce the error locator polynomial
σ (X) = ∏(1 − ω i X), (2.5.24)
i∈E
with binary coefficients, of degree E and with the lowest coefficient 1. If we know
σ (X), we can find which powers ω −i are its roots and hence find the erroneous
digits i ∈ E. We then simply change these digits and correct the errors.
In order to calculate σ (X), consider the formal power series
ζ (X) = ∑ e ω j X j .
j≥1
(Observe that, as ω N = 1, the coefficients of this power series recur.) For the initial
(δ − 1) coefficients, we have equalities, by virtue of (2.5.23):
e ω j = r ω j , j = 1, . . . , δ − 1;
these are the only ones needed for our purpose, and they are calculated in terms of
the received word r.
Now set
ω (X) = ∑ ω i X ∏ (1 − ω j X). (2.5.25)
i∈E j∈E: j=i
namely,
σ0 + σ1 X + · · · + σt X t
× r(ω )X + + · · · + r ω 2t X 2t + e ω (2t+1) X 2t+1 + · · · (2.5.27)
= ω0 + ω1 X + · · · + ωt X t .
Xg(X), . . . , X N−k−1 g(X) form a basis for X . In particular, the rank of X equals
N − k. In this example, N = 7, k = 3 and rank(X ) = 4.
If h(X) = b0 + b1 X + · · · + bN−k X N−k then the parity-check matrix H for X has
the form
⎛ ⎞
bN−k bN−k−1 ... b1 b0 0 ... 0 0
⎜ 0 bN−k bN−k−1 . . . b1 b0 ... 0 0 ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ . . . . . . ⎟.
⎜ 0 .. .. .. .. .. .. ⎟
⎜ ⎟
⎝ ⎠
0 0 ... 0 bN−k bN−k−1 . . . b1 b0
: ;< =
N
The minimum distance d(X ) of a linear code X is the minimum non-zero weight
of a codeword. In the example, d(X ) = 3. [In fact, X is equivalent to the Ham-
ming [7, 4] code.]
>
Since g(X) ∈ F2 [X] is irreducible, the code X ∈ F2 [X] "X 7 − 1# is the cyclic
code defined by ω . The multiplicative cyclic group Z×
7 of non-zero elements of
field F8 is
ω 0 = 1, ω , ω 2 , ω 3 = ω + 1, ω 4 = ω 2 + ω ,
ω 5 = ω 3 + ω 2 = ω 2 + ω + 1, ω 6 = ω 3 + ω 2 + ω = ω 2 + 1,
ω 7 = ω 3 + ω = 1.
r(ω ) = ω + ω 3 + ω 5
= ω + (ω + 1) + (ω 2 + ω + 1)
= ω 2 + ω = ω 4,
242 Introduction to Coding Theory
as required. Let c(X) = r(X) + X 4 mod (X 7 − 1). Then c(ω ) = 0, i.e. c(X) is a
codeword. Since d(X ) = 3 the code is 1-error correcting. We just found a code-
word c(X) at distance 1 from r(X). Then r(X) = X +X 3 +X 5 should be decoded by
c(X) = X + X 3 + X 4 + X 5 mod (X 7 − 1)
r−1 (X) = q1 (X)r0 (X) + r1 (X) where deg r1 (X) < deg r0 (X),
r0 (X) = q2 (X)r1 (X) + r2 (X) where deg r1 (X) < deg r1 (X),
.
..
(k) divide rk−1 (X) by rk−1 (X):
rk−2 (X) = qk (X)rk−1 (X) + rk (X) where deg rk (X) < deg Rk−1 (X),
...
Then
gcd f (X), g(X) = rs−1 (X). (2.5.29)
At each stage, the equation for the current remainder rk (X) involves two previous
remainders. Hence, all remainders, including gcd( f (X), g(X)), can be written in
terms of f (X) and g(X). In fact,
Lemma 2.5.44 The remainders rk (X) in the Euclid algorithm satisfy
where
a−1 (X) = b−1 (X) = 0,
a0 (X) = 0, b0 (X) = 1,
ak (X) = −qk (X)ak−1 (X) + ak−2 (X), k ≥ 1,
bk (X) = −qk (X)bk−1 (X) + bk−2 (X), k ≥ 1.
In particular, there exist polynomials a(X), b(X) such that
gcd ( f (X), g(X)) = a(X) f (X) + b(X)g(X).
Furthermore:
(1) deg ak (X) = ∑ deg qi (X), deg bk (X) = ∑ deg qk (X).
2≤i≤k 1≤i≤k
(2) deg rk (X) = deg f (X) − ∑ deg qk (X).
1≤i≤k+1
(3) deg bk (X) = deg f (X) − deg rk−1 (X).
(4) ak (X)bk+1 (X) − ak+ (X)bk (X) = (−1)k+1 .
(5) ak (X) and bk (X) are co-prime.
(6) rk (X)bk+1 (X) − rk+1 (X)bk (X) = (−1)k+1 f (X).
(7) rk+1 (X)ak (X) − rk (X)ak+1 (X) = (−1)k+1 g(X).
Proof The proof is left as an exercise.
Solution All cyclic codes of length 16 are divisors of 1 + X 16 = (1 + X)16 , i.e. are
generated by g(X) = (1 + X)k where k = 0, 1, . . . , 16. Here k = 0 gives the whole
{0, 1}16 , k = 1 the parity-check code, k = 15 the repetition code {00 . . . 0, 11 . . . 1}
and k = 16 the zero code. For N = 15, the decomposition into irreducible polyno-
mials looks as follows:
1 + X 15 = (1 + X)(1 + X + X 2 )(1 + X + X 4 )(1 + X 3 + X 4 )
×(1 + X + X 2 + X 3 + X 4 ).
Any product of the listed irreducible polynomials generates a cyclic code.
244 Introduction to Coding Theory
y(i) (X) = vi (X)g(X) + u(i) (X), i = 1, 2, where u(1) (X) = u(2) (X).
In fact, y(1) and y(2) belong to the same coset iff y(1) + y(2) ∈ X . This is equivalent
to u(1) (X) + u(2) (X) = 0, i.e. u(1) (X) = u(2) (X).
k
If we write h(X) = ∑ h j X j , then the dot-product
j=0
'
i 1, i = 0, N,
∑ g j hi− j = 0, 1 ≤ i < N.
j=0
Problem 2.2 (a) Prove the Hamming and Gilbert–Varshamov bounds on the
size of a binary [N, d] code in terms of vN (d), the volume of an N -dimensional
Hamming ball of radius d .
Suppose that the minimum distance is λ N for some fixed λ ∈ (0, 1/4). Let
α (N, λ N) be the largest information rate of any binary code correcting λ N
errors. Show that
1 − η (λ ) ≤ lim inf α (N, λ N) ≤ lim sup α (N, λ N) ≤ 1 − η (λ /2). (2.6.1)
N→∞ N→∞
(b) Fix R ∈ (0, 1) and suppose we want to send one of a collection UN of messages
of length N , where the size UN = 2NR . The message is transmitted through an
2.6 Additional problems for Chapter 2 245
MBSC with error-probability p < 1/2, so that we expect about pN errors. Accord-
ing to the asymptotic bound of part (a), for which values of p can we correct pN
errors, for large N ?
or we could add such a word to X ∗ , increasing the size but preserving the error-
correcting property. Since every word y ∈ FN2 is less than d − 1 from a codeword,
we can add y to the code. Hence, balls of radius d − 1 cover the whole of FN2 , i.e.
M × vN (d − 1) ≥ 2N , or
Here I(X : Y ) = h(Y ) − h(Y |X) is the mutual entropy between the single-letter
random input and output of the channel, maximised over all distributions of the
input letter X. For an MBSC with the error-probability p, the conditional entropy
h(Y |X) equals η (p). Then
But h(Y ) attains its maximum 1, by using the equidistributed input X (then Y is also
equidistributed). Hence, for the MBSC, C = 1− η (p). So, a reliable transmission is
possible via MBSC with R ≤ 1 − η (p), i.e. p ≤ η −1 (1 − R). These two arguments
lead to the same answer.
Problem 2.3 Prove that the binary code of length 23 generated by the polynomial
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimum distance 7, and is perfect.
Hint: Observe that by the BCH bound (see Theorem 2.5.39) if a generator polyno-
mial of a cyclic code has roots {ω , ω 2 , . . . , ω δ −1 } then the code has distance ≥ δ ,
and check that X 23 + 1 ≡ (X + 1)g(X)grev (X) mod 2, where grev (X) = X 11 g(1/X)
is the reversal of g(X).
Solution First, show that the code is BCH, of designed distance 5. Recall that if
ω is a root of a polynomial p(X) ∈ F2 [X] then so is ω 2 . Thus, if ω is a root of
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are ω 2 , ω 4 , ω 8 , ω 16 , ω 9 , ω 18 ,
ω 13 , ω 3 , ω 6 , ω 12 . This yields the design sequence {ω , ω 2 , ω 3 , ω 4 }. By the BCH
theorem, the code X = "g(X)# has distance ≥ 5.
Next, the parity-check extension, X + , is self-orthogonal. To check this, we need
only to show that any two rows of the generating matrix of X + are orthogonal.
These are represented by
The cyclic code with generator g(X) = X 3 + X + 1 has check polynomial h(X) =
X 4 + X 2 + X + 1. The parity-check matrix of the code is
⎛ ⎞
1 0 1 1 1 0 0
⎝ 0 1 0 1 1 1 0⎠ . (2.6.4)
0 0 1 0 1 1 1
The columns of this matrix are the non-zero elements of F32 . So, this is equivalent
to Hamming’s original [7, 4] code.
The dual of Hamming’s [7, 4] code has the generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hamming’s [7, 4] code.
Finally, any irreducible polynomial M j (X) could be included in a generator of a
cyclic code in any power 0, . . . , k j . So, the number of possibilities to construct this
generator equals ∏lj=1 (k j + 1).
Π j = {p ∈ Fm
2 : p j = 0}.
A0 = {h0 },
A1 = {h j ; j = 1, 2, . . . , m},
A2 = {hi · h j ; i, j = 1, 2, . . . , m, i < j},
..
.
Ak+1 = {a · h j ; a ∈ Ak , j = 1, 2, . . . , m, h j |a},
..
.
Am = {h1 · · · hm }.
The union of these sets has cardinality N = 2m (there are 2m functions altogether).
Therefore, functions from ∪m i=0 Ai can be taken as a basis in F2 .
N
d RM(r, m) ≥ 2m−r .
On the other hand, the vector h1 · h2 · · · · · hm is at distance 2m−r from RM(m, r).
Hence,
d RM(r, m) = 2m−r .
Problem 2.6 (a) Define a parity-check code of length N over the field F2 . Show
that a code is linear iff it is a parity-check code. Define the original Hamming code
in terms of parity-checks and then find a generating matrix for it.
(b) Let X be a cyclic code. Define the dual code
N
X ⊥ = {y = y1 . . . yN : ∑ xi yi = 0 for all x = x1 . . . xN ∈ X }.
i=1
Prove that X ⊥ is cyclic and establish how the generators of X and X ⊥ are re-
lated to each other. Show that the repetition and parity-check codes are cyclic, and
determine their generators.
From the definition it is clear that X PC is also the parity-check code for X , the
PC
linear code spanned by X : X PC = X . Indeed, if y · x = 0 and y · x = 0 then
y · (x + x ) = 0. Hence, the parity-check code X PC is always linear, and it forms
a subspace dot-orthogonal to X . Thus, a given code X is linear iff it is a parity-
check code. A pair of linear codes X and X PC form a dual pair: X PC is the dual
of X and vice versa. The generating matrix H for X PC serves as a parity-check
matrix for X and vice versa.
250 Introduction to Coding Theory
(b) The generator of dual code g⊥ (X) = X N−1 g(X −1 ). The repetition code has
g(X) = 1 + X + · · · + X N−1 and the rank 1. The parity-check code has g(X) = 1 + X
and the rank N − 1.
Problem 2.7 (a) How does coding theory apply when the error rate p > 1/2?
(b) Give an example of a code which is not a linear code.
(c) Give an example of a linear code which is not a cyclic code.
(d) Define the binary Hamming code and its dual. Prove that the Hamming code is
perfect. Explain why the Hamming code cannot always correct two errors.
(e) Prove that in the dual code:
(i) The weight of any non-zero codeword equals 2−1 .
(ii) The distance between any pair of words equals 2−1 .
columns of this matrix are linearly independent, but there are triples of columns
that are linearly dependent (a pair of columns complemented by their sum).
Every non-zero dual codeword x is a sum of rows of the above generating matrix.
Suppose these summands are rows i1 , . . . , is where 1 ≤ i1 < · · · < is ≤ . Then, as
above, the number of digits 1 in the sum equals the number of columns of this
matrix for which the sum of digits i1 , . . . , is is 1. We have no restriction on the
remaining − s digits, so for them there are 2−s possibilities. For digits i1 , . . . , is
we have 2s−1 possibilities (a half of the total of 2s ). Thus, again 2−s × 2s−1 = 2−1 .
We proved that the weight of every non-zero dual codeword equals 2−1 . That is,
the distance from the zero vector to any dual codeword is 2−1 . Because the dual
code is linear, the distance between any pair of distinct dual codewords x, x equals
2−1 :
δ (x, x ) = δ (0, x − x) = w(x − x ) = 2−1 .
x = ∑ g(i) .
i∈J
which yields 2−|J| 2|J|−1 = 2−1 . In other words, to get a contribution from a digit
(i)
x j = ∑ g j = 1, we must fix (i) a configuration of 0s and 1s over {1, . . . , } \ J (as it
i∈J
is a part of the description of a non-zero vector of length N), and (ii) a configuration
of 0s and 1s over J, with
an odd number of 1s.
⊥
To check that d X H = 2−1 , it suffices to establish that the distance between
⊥
the zero word and any other word x ∈ X H equals 2−1 .
Problem 2.8 (a) What is a necessary and sufficient condition for a polynomial
g(X) to be the generator of a cyclic code of length N ? What is the BCH code?
Show that the BCH code associated with {ω , ω 2 }, where ω is a root of X 3 + X + 1
in an appropriate field, is Hamming’s original code.
(b) Define and evaluate the Vandermonde determinant. Define the BCH code and
obtain a good estimate for its minimum distance.
252 Introduction to Coding Theory
Solution (a) The necessary and sufficient condition for g(X) being the generator of
a cyclic code of length N is g(X)|(X N − 1). The generator g(X) may be irreducible
or not; in the latter case it is represented as a product g(X) = M1 (X) · · · Mk (X)
of its irreducible factors, with k ≤ d = deg g. Let s be the minimal number such
that N|2s − 1. Then g(X) is factorised into the product of first-degree monomials
d
in a field K = F2s ⊇ F2 : g(X) = ∏ (X − ω j ) with ω1 , . . . , ωd ∈ K. [Usually one
i=1
refers to the minimal field – the splitting field for g, but this is not necessary.] Each
element ωi is a root of g(X) and also a root of at least one of its irreducible factors
M1 (X), . . . , Mk (X). [More precisely, each Mi (X) is a sub-product of the above first-
degree monomials.]
We want to select a defining set D of roots among ω1 , . . . , ωd ∈ K: it is a collec-
tion comprising at least one root ω ji for each factor Mi (X). One is naturally tempted
to take a minimal defining set where each irreducible factor is represented by one
root, but this set may not be easy to describe exactly. Obviously, the cardinality |D|
of defining set D is between k and d. The roots forming D are all from field K but
in fact there may be some from its subfield, K ⊂ K containing all ω ji . [Of course,
F2 ⊂ K .] We then can identify the cyclic code X generated by g(X) with the set
of polynomials
* +
f (X) ∈ F2 [X]/"X N − 1# : f (ω ) = 0 for all ω ∈ D .
It is said that X is a cyclic code with defining set of roots (or zeros) D.
(b) A binary BCH code of length N (for N odd) and designed distance δ is a cyclic
code with defining set {ω , ω 2 , . . . , ω δ −1 } where δ ≤ N and ω is a primitive Nth
root of unity, with ω N = 1. It is helpful to note that if ω is a root of a polyno-
s−1
mial p(X) then so are ω 2 , ω 4 , . . . , ω 2 . By considering a defining set of the form
{ω , ω 2 , . . . , ω δ −1 } we ‘fill the gaps’ in the above diadic sequence and produce an
ideal of polynomials whose properties can be analytically studied.
The simplest example is where N = 7 and D = {ω , ω 2 } where ω is a root of
X + X + 1. Here, ω 7 = (ω 3 )2 ω = (ω + 1)2 ω = ω 3 + ω = 1, so ω is a 7th root of
3
unity. [We used the fact that the characteristic is 2.] In fact, it is a primitive root.
Also, as was said, ω 2 is a root of X 3 + X + 1: (ω 2 )3 + ω 2 + 1 = (ω 3 + ω + 1)2 =
0, and so is ω 4 . Then the cyclic code with defining set {ω , ω 2 } has generator
X 3 +X +1 since all roots of this polynomial are engaged. We know that it coincides
with the Hamming [7, 4] code.
The Vandermonde determinant is
⎛ ⎞
1 1 1 ... 1
⎜ x1 x2 x3 . . . xn ⎟
Δ = det ⎜ ⎝ ...
⎟.
... ... ... ... ⎠
x1n−1 x2n−1 x3n−1 . . . xnn−1
2.6 Additional problems for Chapter 2 253
Observe that if xi = x j (i = j) the determinant vanishes (two rows are the same).
Thus xi − x j is a factor of Δ,
Δ = P(x) ∏(xi − x j ),
i< j
δ (x, y) < d . Conclude that if d or more changes are made in a codeword then the
new word is closer to some other codeword than to the original one.
Suppose that a maximal [N, M, d] code is used for transmitting information via a
binary memoryless channel with the error-probability p, and the receiver uses the
maximum likelihood decoder. Prove that the probability of erroneous decoding,
πerr
ML , obeys the bounds
1 − b(N, d − 1) πerr
ML
1 − b(N, (d − 1)/2),
Solution If a code is maximal then adding one more word will reduce the distance.
Hence, for all y there exists x ∈ X such that δ (x, y) < d. Conversely, if this prop-
erty holds then the code cannot be enlarged without reducing d. Then making d or
more changes in a codeword gives a word that is closer to a different codeword.
This will certainly not give the correct guess under the ML decoder as it chooses
the closest codeword.
Therefore,
N k
πerr ≥ ∑
ML
p (1 − p)N−k = 1 − b(N, d − 1).
d≤k≤N
k
πerr
ML
≤ 1 − b (N, d − 1/2) .
Problem 2.10 The Plotkin bound for an [N, M, d] binary code states that M ≤
d
if d > N/2. Let M2∗ (N, d) be the maximum size of a code of length N and
d − N/2
distance d , and let
1
α (λ ) = lim log2 M2∗ (N, λ N).
N→∞ N
Solution If d > N/2 apply the Plotkin bound and conclude that α (λ ) = 0. If
d ≤ N/2 consider the partition of a code X of length N and distance d ≤ N/2
according to the last N − (2d − 1) digits, i.e. divide X into disjoint subsets, with
fixed N − (2d − 1) last digits. One of these subsets, X , must have size M such
that M 2N−(2d−1) ≥ M.
Hence, X is a code of length N = 2d − 1 and distance d = d, with d > N /2.
Applying Plotkin’s bound to X gives
d d
M ≤
= = 2d.
d − N/2 d − (2d − 1)/2
Therefore,
M ≤ 2N−(2d−1) 2d.
Taking d = λ N with N → ∞ yields α (λ ) ≤ 1 − 2λ , 0 ≤ λ ≤ 1/2.
Problem 2.11 State and prove the Hamming, Singleton and Gilbert–Varshamov
bounds. Give (a) examples of codes for which the Hamming bound is attained, (b)
examples of codes for which the Singleton bound is attained.
Solution The Hamming bound states that the size M of an E-error correcting code
X of length N,
2N
M≤ ,
vN (E)
N
where vN (E) = ∑ is the volume of an E-ball in the Hamming space
0≤i≤E i
{0, 1}N . It follows from the fact that the E-balls about the codewords x ∈ X must
be disjoint:
M × vN (E) = of points covered by M E-balls
≤ 2N = of points in {0, 1}N .
The Singleton bound is that the size M of a code X of length N and distance d
obeys
M ≤ 2N−d+1 .
It follows by observing that truncating X (i.e. omitting a digit from the codewords
x ∈ X ) d − 1 times still does not merge codewords (i.e. preserves M) while the
resulting code fits in {0, 1}N−d+1 .
The Gilbert–Varshamov bound is that the maximal size M ∗ = M2∗ (N, d) of a
binary [N, d] code satisfies
2N
M∗ ≥ .
vN (d − 1)
256 Introduction to Coding Theory
This bound follows from the observation that any word y ∈ {0, 1}N must be within
distance ≤ d − 1 from a maximum-size code X ∗ . So,
Codes attaining the Hamming bound are called perfect codes, e.g. the Hamming
[2 − 1, 2 − 1 − , 3] codes. Here, E = 1, vN (1) = 1 + 2 − 1 = 2 and M = 22 −−1 .
Apart from these codes, there is only one example of a (binary) perfect code: the
Golay [23, 12, 7] code.
Codes attaining the Singleton bound are called maximum distance separable
(MDS): their check matrices have any N − M rows linearly independent. Examples
of such codes are (i) the whole {0, 1}N , (ii) the repetition code {0 . . . 0, 1 . . . 1}
and the collection of all words x ∈ {0, 1}N of even weight. In fact, these are all
examples of binary MDS codes. More interesting examples are provided by Reed–
Solomon codes that are non-binary; see Section 3.2. Binary codes attaining the
Gilbert–Varshamov bound for general N and d have not been constructed so far
(though they have been constructed for non-binary alphabets).
Problem 2.12 (a) Explain the existence and importance of error correcting codes
to a computer engineer using Hamming’s original code as your example.
(b) How many codewords in a Hamming code are of weight 1? 2? 3? 4? 5?
Solution (a) Consider the linear map F72 → F32 given by the matrix H of the form
(2.6.4). The Hamming code X is the kernel ker H, i.e. the collection of words
x = x1 x2 x3 x4 x5 x6 x7 ∈ {0, 1}7 such that xH T = 0. Here, we can choose four digits,
say x4 , x5 , x6 , x7 , arbitrarily from {0, 1}; then x1 , x2 , x3 will be determined:
x1 = x4 + x5 + x7 ,
x2 = x4 + x6 + x7 ,
x3 = x5 + x6 + x7 .
It means that code X can be used for encoding 16 binary ‘messages’ of length 4.
If y = y1 y2 y3 y4 y5 y6 y7 differs from a codeword x ∈ X in one place, say y = x + ek
then the equation yH T = ek H T gives the binary decomposition of number k, which
leads to decoding x. Consequently, code X allows a single error to be corrected.
Suppose that the probability of error in any digit is p << 1, independently of
what occurred to other digits. Then the probability of an error in transmitting a
non-encoded (4N)-digit message is
1 − (1 − p)4N 4N p.
2.6 Additional problems for Chapter 2 257
But using the Hamming code we need to transmit 7N digits. An erroneous trans-
mission requires at least two wrong digits, which occurs with probability
N
7 2
≈ 1− 1− p 21N p2 << 4N p.
2
So, the extra effort of using 3 check digits in the Hamming code is justified.
(b) A Hamming code X H, of length N = 2 − 1 ( ≥ 3) consists of binary words
x = x1 . . . xN such that xH T = 0 where H is an × N matrix whose columns
h(1) , . . . , h(N) are all non-zero binary vectors of length l. Hence, the number of
N
codewords of weight w(x) = ∑ x j = s equals the number of (non-ordered) collec-
j=1
tions of s binary, non-zero, pair-wise distinct -vectors of total sum 0. In fact, if
xH T = 0 and w(x) = s and x j1 = x j2 = · · · = x js = 1 then the sum of row-vectors
h( j1 ) + · · · + h( js ) = 0.
Thus, one codeword has weight 0, no codeword has weight 1 or 2, N(N − 1)/3!
codewords have weight 3 (i.e. 7 and 35 words of weights 3 for l = 3 and l = 4).
Further we have [N(N − 1)(N − 2) − N(N − 1)]/4! = N(N − 1)(N − 3)/4! words
of weight 4 (i.e. 7 and 105 words of weights 4 for = 3 and = 4). Finally, we
have N(N − 1)(N − 3)(N − 7)/5! words weight 5 (i.e. 0 and 168 words of weight
5 for = 3 and = 4). Each time when we add a factor, we should avoid -vectors
equal to a linear combination of previously selected vectors. In Problem 3.9 we
will compute the enumerator polynomial for N = 15:
1 + 35X 3 + 105X 4 + 168X 5 + 280X 6 + 435X 7 + 435X 8
Problem 2.13 (a) The dot-product of vectors x, y from a binary Hamming space
HN is defined as x · y = ∑Ni=1 xi yi (mod 2), and x and y are said to be orthogonal
if x · y = 0. What does it mean to say that X ⊆ HN is a linear [N, k] code with
generating matrix G and parity-check matrix H ? Show that
X ⊥ = {x ∈ HN : x · y = 0 for all y ∈ X }
is a linear [N, N − k] code and find its generator and parity-check matrices.
(b) A linear code X is called self-orthogonal if X ⊆ X ⊥ . Prove that X is self-
orthogonal if the rows of G are self and pairwise orthogonal. A linear code is called
self-dual if X = X ⊥ . Prove that a self-dual code has to be an [N, N/2] code (and
hence N must be even). Conversely, prove that a self-orthogonal [N, N/2] code, for
N even, is self-dual. Give an example of such a code for any even N and prove that
a self-dual code always contains the word 1 . . . 1.
258 Introduction to Coding Theory
(c) Consider now a Hamming [2 −1, 2 −−1] code XH, . Describe the generating
⊥ . Prove that the distance between any two codewords in X ⊥ equals
matrix of XH, H,
−1
2 .
The dual XH⊥ of a Hamming code X H is called a simplex code. By the above,
it has length 2 − 1 and rank , and its generating matrix GH⊥ is × (2 − 1), with
columns listing all non-zero vectors of length . To check that dist X H⊥ = 2−1 ,
it suffices to establish that the weight of non-zero word x ∈ X H⊥ equals 2−1 . But
a non-zero word x ∈ X H⊥ is a non-zero linear combination of rows of G⊥H . Let
J ⊂ {1, . . . , } be the set of contributing rows:
x = ∑ g(i) .
i∈J
Clearly, w(g(i) ) = 2−1 as exactly half of all 2 vectors have 1 on any given position.
The proof is finished by induction on J.
A simple and elegant way is to use the MacWilliams identity (cf. Lemma 3.4.4)
which immediately gives
−1
WX ⊥ (s) = 1 + (2 − 1)s2 . (2.6.8)
have no means to determine if the original codeword was corrupted by the channel
or not.)
If yH T = 0 then yH T coincides with a column of H. Suppose yH T gives column
j of H; then we decode y by
In other words, we change digit j in y and decide that it was the word sent through
the channel. This works well when errors in the channel are rare.
If = 3 a Hamming [7, 4] code contains 24 = 16 codewords. These codewords
are fixed when H is fixed: in the example they are used for encoding 15 letters from
A to O and the space character ∗. Upon receiving a message we divide it into words
of length 7: in the example there are 15 words altogether. Performing the decoding
procedure leads to
JOHNNIE∗BE∗GOOD
Solution The parity-check matrix H for the Hamming code is ×2 −1 and formed
by all non-zero columns of length ; in particular, it includes all l columns of
weight 1. The latter are linearly independent; hence the l columns of H are lin-
early independent. Since XHam = ker H, we have dim X = 2 − 1 − = rank X .
The number of codewords then equals 22 −−1 .
Since all columns of H are distinct, any pair of columns are linearly independent.
So, the minimal distance of X is > 2. But X contains three columns that are
linearly dependent, e.g.
1 0 0 . . . 0T , 0 1 0 . . . 0T , and 1 1 0 . . . 0T .
Hence, the minimal distance equals 3. Therefore, if a single error occurs, i.e. the
received word is at distance 1 from a codeword, then this codeword is uniquely
determined. Hence, the Hamming code is single-error correcting.
2.6 Additional problems for Chapter 2 261
N N
the volume of a one-ball = + = 1 + N,
0 1
the total of words = 2N ,
and
(1 + N)2N− = 2 2N− = 2N .
The information rate of the code equals
2 − − 1
rank length = .
2 − 1
The code with = 3 has the 3 × 7 parity-check matrix of the form (2.6.4); any
permutation of rows leads to an equivalent code. The generating matrix is 4 × 7:
⎛ ⎞
1 0 0 0 1 1 1
⎜0 1 0 0 0 1 1⎟
⎜ ⎟
⎝0 0 1 0 1 0 1⎠
0 0 0 1 0 1 1
and the information rate 4/7. The Hamming code with = 2 is trivial: it contains
a single non-zero codeword 1 1 1.
Problem 2.16 Define a BCH code of length N over the field Fq with designed
distance δ . Show that the minimum weight of such a code is at least δ .
Consider a BCH code of length 31 over the field F2 with designed distance 8.
Show that the minimum distance is at least 11.
Solution A BCH code of length N over the field Fq is defined as a cyclic code X
whose minimum degree generator polynomial g(X) ∈ Fq [X], with g(X)|(X N − 1)
(and hence deg g(X) ≤ N), contains among its roots the subsequent powers ω ,
ω 2 , . . . , ω δ −1 where ω ∈ Fqs is a primitive Nth root of unity. (This root ω lies
in an extension field Fqs – the splitting field for X N − 1 over Fq , i.e. N|qs − 1.) Then
δ is called the designed distance for X ; the actual distance (which may be difficult
to calculate in a general situation) is ≥ δ .
If we consider the binary BCH code X of length 31, ω should be a primitive
root of unity of degree 31, with ω 31 = 1 (the root ω lies in an extension field F32 ).
262 Introduction to Coding Theory
Solution Let x be the codeword in X represented by the first row of G and pick
a pair of other rows, say y and z. After the first deleting they become y and z ,
correspondingly. Both weights w(y ) and w(z ) must be ≥ d/2: otherwise at least
one of the original words y and z, say y, would have had minimum d/2 digits 1
among deleted d digits (as w(y) ≥ d by condition). But then
w(x + y) = w(y ) + d − d/2 < d
which contradicts the condition that the distance of X is d.
We want to check that the weight w(y + z ) ≥ d/2. Assume the opposite:
w(y + z ) = m < d/2 .
Then m = w(y0 + z0 ) must be ≥ d − m ≥ d/2 where y0 is the deleted part of y,
of length d, and z0 is the deleted part of z, also of length d. In fact, as before, if
m < d − m then w(y + z) < d which is impossible. But if m ≥ d − m then
w(x + y + z) = d − m + m < d,
again impossible. Hence, the sum of any two rows of G1 has weight ≥ d/2.
2.6 Additional problems for Chapter 2 263
This argument can be repeated for the sum of any number of rows of G1 (not
exceeding k − 1). In fact, in the case of such a sum x + y + · · · + z, we can pass to
new matrices, G and G1 , with this sum among the rows. We conclude that X1 has
minimum distance d ≥ d/2. The rank of X1 is k − 1, for any k − 1 rows of G1
are linearly independent. (The above sum cannot be 0.)
Now, the process of deletion can be applied to X1 (you delete d columns in
G1 yielding digits 1 in a row of G1 with exactly d digits 1). And so on, until you
exhaust the initial rank k by diminishing it by 1. This leads to the required bound
3 4
N ≥ d + d/2 + d/22 + · · · + d 2k−1 .
Problem 2.18 Define a cyclic linear code X and show that it has a codeword of
minimal length which is unique, under normalisation to be stated. The polynomial
g(X) whose coefficients are the symbols of this codeword is the (minimum degree)
generator polynomial of this code: prove that all words of the code are related to
g(X) in a particular way.
Show further that g(X) can be the generator polynomial of a cyclic code with
words of length N iff it satisfies a certain condition, to be stated.
There are at least three ways of determining the parity-check matrix of the code
from a knowledge of the generator polynomial. Explain one of them.
Solution Let X be the cyclic code of length N with generator polynomial g(X) =
∑ gi X i of degree d. Without loss of generality, assume the code is non-trivial,
0≤i≤d
with 1 < d < N − 1. Let g denote the corresponding codeword g0 . . . gd 0 . . . 0 (there
are d + 1 coefficients gi completed with N − d − 1 zeros). Then:
Alternatively, let h(X) be the check polynomial for the cyclic code X length N
with a generator polynomial g(X) so that g(X)h(X) = X N − 1. Then:
(a) X = { f (X): f (X)h(X) = 0 mod (X N − e)};
(b) if h(X) = h0 + h1 X + · · · + hN−r X N−r then the parity-check matrix H of X
has the form (2.6.9);
(c) the dual code X ⊥ is a cyclic code of dim X ⊥ = r, and X ⊥ = "h⊥ (X)#,
where h⊥ (X) = h−10 X
N−r h(X −1 ) = h−1 (h X N−r + h X N−r−1 + · · · + h
0 0 1 N−r ).
Solution The code in question is [2 , + 1, 2−1 ]; with = 5, the information rate
equals 6/32 ≈ 1/5. Let us check that all codewords except 0 and 1 have weight
2−1 . For ≥ 1 the code R() is defined by recursion
which is small when p is small. (As an estimate of an acceptable p, we can take the
solution to 1 + p log p + (1 − p) log(1 − p) = 26/32.) If the block length is fixed
(and rather small), with a low value of p we can’t get near the capacity.
266 Introduction to Coding Theory
Indeed, for = 5, the code is [32, 6, 16], detecting 15 and correcting 7 errors. That
is, the code can correct a fraction > 1/5 of the total of 32 digits. Its information
rate is 6/32 and if the capacity of the (memoryless) channel is C = 1− η (p) (where
p stands for the symbol-probability of error), we need the bound C > 6/32; that
is, η (p) + 6/32 < 1, for a reliable transmission. This yields |p − 1/2| > |p∗ − 1/2|
where p∗ ∈ (0, 1) solves 26/32 = η (p∗ ). Definitely 0 ≤ p < 1/5 and 4/5 < p ≤ 1
would do. In reality the error-probability was much less.
Problem 2.20 Prove that any binary [5, M, 3] code must have M ≤ 4. Verify that
there exists, up to equivalence, exactly one [5, 4, 3] code.
d +1
M2∗ (N, d) ≤ 2 .
2d + 1 − N
In fact,
4
M2∗ (5, 3) ≤ 2 = 2 · 2 = 4.
6+1−5
All [5, 4, 3] codes are equivalent to 00000, 00111, 11001, 11110.
Problem 2.21 Let X be a binary [N, k, d] linear code with generating matrix
G. Verify that we may assume that the first row of G is 1 . . . 1 0 . . . 0 with d ones.
Write:
1...1 0...0
G= .
G1 G2
Show that if d2 is the distance of the code with generating matrix G2 then d2 ≥ d/2.
Solution Let X be [N, k, d]. We can always form a generating matrix G of X where
the first row is a codeword x with w(x) = d; by permuting columns of G we can
have the first row in the form 1 . . . 1d 0 . . . 0N−d . So, up to equivalence,
: ;< = : ;< =
1...1 0...0
G= .
G1 G2
Suppose d(G2 ) < d/2 then, without loss of generality, we may assume that there
exists a row of (G1 G2 ) where the number of ones among digits d + 1, . . . , N is
< d/2. Then the number of ones among digits 1, . . . d in this row is > d/2, as its
total weight is ≥ d. Then adding this row and 1 . . . 1 0 . . . 0 gives a codeword with
weight < d. So, d(G2 ) ≥ d/2.
2.6 Additional problems for Chapter 2 267
Problem 2.22 (Gilbert–Varshamov bound) Prove that there exists a p-ary linear
[N, k, d] code if pk < 2N /vN−1 (d −2). Thus, if pk is the largest power of p satisfying
this inequality, we have Mp∗ (N, d) ≥ pk .
So, the parity-check matrix may be constructed iff SN + 1 < pN−k . Finally, observe
that SN + 1 = vN−1 (d − 2). Say, there exists [5, 2k , 3] code if 2k < 32/5, so k = 2
and M2∗ (5, 3) ≥ 4, which is, in fact, sharp.
Problem 2.23 An element b ∈ F∗q is called primitive if its order (i.e. the minimal
k such that bk = 1 mod q) is q − 1. It is not difficult to find a primitive element of
the multiplicative group F∗q explicitly. Consider the prime factorisation
s
ν
q − 1 = ∏ pj j.
j=1
νj
(q−1)/p j (q−1)/p j
For any j = 1, . . . , s select a j ∈ Fq such that a j = e. Set b j = a j and
check that b = ∏sj=1 b j has the order q − 1.
ν
Solution Indeed, the order of b j is p j j . Next, if bn = 1 for some n then n = 0
νi νi
ν n∏
mod p j j because bn ∏i= j pi = 1 implies b j i= j i = 1, i.e. n ∏i= j pνi i = 0 mod pνj i =
p
ν
0. Because p j are distinct primes, it follows that n = 0 mod p j j for any j. Hence,
ν
n = ∏sj=1 p j j .
Problem 2.24 The minimal polynomial with a primitive root is called a primitive
polynomial. Check that among irreducible binary polynomials of degree 4 (see
(2.5.9)), 1 + X + X 4 and 1 + X 3 + X 4 are primitive and 1 + X + X 2 + X 3 + X 4 is
not. Check that all six irreducible binary polynomials of degree 5 (see (2.5.15))
are primitive; in practice, one prefers to work with 1 + X 2 + X 5 as the calculations
modulo this polynomial are slightly shorter. Check that among the nine irreducible
polynomials of degree 6 in (2.5.16), there are six primitive: they are listed in the
upper three lines. Prove that a primitive polynomial exists for every given degree.
268 Introduction to Coding Theory
Solution For the solution to the last part, see Section 3.1.
Problem 2.25 A cyclic code X of length N with the generator polynomial g(X)
of degree d = N − k can be described in terms of the roots of g(X), i.e. the elements
α1 , . . . αN−k such that g(α j ) = 0. These elements are called zeros of code X and
belong to a Galois field F2d . As g(X)|(1+X N ), they are also among roots of 1+X N .
That is, α Nj = 1, 1 ≤ j ≤ N − k, i.e. the α j are N th roots of unity. The remaining k
roots of unity α1 , . . . , αk are called non-zeros of X . A polynomial a(X) ∈ X iff,
in Galois field F2d , a(α j ) = 0, 1 ≤ j ≤ N − k.
(a) Show that if X ⊥ is the dual code then the zeros of X ⊥ are α1 −1 , . . . , αk −1 , i.e.
the inverses of the non-zeros of X .
(b) A cyclic code X with generator g(X) is called reversible if, for all x =
x0 . . . xN−1 ∈ X , the word xN−1 . . . x0 ∈ X . Show that X is reversible iff g(α ) = 0
implies that g(α −1 ) = 0.
(c) Prove that a q-ary cyclic code X of length N with (q, N) = 1 is invariant under
the permutation of digits such that πq (i) = qi mod N (i.e. x → xq ). If s = ordN (q)
then the two permutations i → i + 1 and πq (i) generate a subgroup of order Ns in
the group Aut(X ) of the code automorphisms.
Solution Indeed, since a(xq ) = a(x)q is proportional to the same generator polyno-
mial it belongs to the same cyclic code as a(x).
Problem 2.26 Prove that there are 129 non-equivalent cyclic binary codes of
length 128 (including the trivial codes, {0 . . . 0} and {0, 1}128 ). Find all cyclic bi-
nary codes of length 7.
Solution The equivalence classes of the cyclic codes of length 2k are in a one-to-
k
one correspondence with the divisors of 1+X 2 ; the number of those equals 2k +1.
Furthermore, there are eight codes listed by their generators which are divisors of
X 7 − 1 as
X 7 − 1 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).
3
Further Topics from Coding Theory
269
270 Further Topics from Coding Theory
{e, a, . . . , ad−1 }. Observe that the cyclic group Zd has exactly φ (d) elements of
order d. So, the whole F∗ has exactly φ (d) elements of order d; in other words, if
ψ (d) is the number of elements in F of order d then either ψ (d) = 0 or ψ (d) = φ (d)
and
q−1 = ∑ ψ (d) ≤ ∑ φ (d) = q − 1,
d:d(n) d:d|n
Corollary 3.1.12 For any prime p and natural s ≥ 1, there exists precisely one
field with ps elements.
Proof of Corollary 3.1.12 Take again the polynomial X q − X with coefficients from
Z p and q = ps . By Theorem 3.1.11, there exists the splitting field Spl(X q − X)
where X q − X = X(X q−1 − e) is factorised into linear polynomials. So, Spl(X q − X)
contains the roots of X q − X and has characteristic p (as it contains Z p ).
However, the roots of (X q − X) form a subfield: if aq = a and bq = b then (a ±
b) = aq + (±bq ) (Lemma 3.1.5) which coincides with a ± b. Also, (ab−1 )q =
q
aq (bq )−1 = ab−1 . This field cannot be strictly contained in Spl(X q − X) thus it
coincides with Spl(X q − X).
It remains to check that all roots of (X q − X) are distinct: then the cardinality
Spl(X q − X) will be equal to q. In fact, if X q − X had a multiple root then it would
have had a common factor with its ‘derivative’ ∂X (X q − X) = qX q−1 − e. However,
qX q−1 = 0 in Spl(X q − X) and thus cannot have such factors.
Summarising, we have the two characterisation theorems for finite fields.
Theorem 3.1.13 All finite fields have size ps where p is prime and s ≥ 1 integer.
For all such p, s, there exists a unique field of this size.
power
vector
of X polynomial
(string)
mod 1 + X 3 + X 4
−− 0 0000
X0 1 1000
X X 0100
X2 X2 0010
X3 X3 0001
X4 1 + X3 1001
X5 1 + X + X3 1101 (3.1.2)
X6 1 + X + X2 + X3 1111
X7 1 + X + X2 1110
X8 X + X2 + X3 0111
X9 1 + X2 1010
X 10 X + X3 0101
X 11 1 + X2 + X3 1011
X 12 1+X 1100
X 13 X + X2 0110
X 14 X2 + X3 0011
Worked Example 3.1.16 (a) How many elements are in the smallest extension
of F5 which contains all roots of polynomials X 2 + X + 1 and X 3 + X + 1?
(b) Determine the number of subfields of F1024 , F729 . Find all primitive elements
of F7 , F9 , F16 . Compute (ω 10 + ω 5 )(ω 4 + ω 2 ) where ω is a primitive element of
F16 .
(ω 10 + ω 5 )(ω 4 + ω 2 ) = ω 14 + ω 9 + ω 12 + ω 7
= 1001 + 0101 + 1111 + 1101 = 1110
= ω 10 .
Definition 3.1.17 The set of all polynomials with coefficients from Fq is a com-
mutative ring denoted by Fq [X]. A quotient ring Fq [X]/"g(X)# is where the opera-
tion is modulo a fixed polynomial g(X) ∈ Fq [X].
Theorem 3.1.19 Let g(X) ∈ Fq [X] have degree deg g(X) = d . Then
Fq [X]/"g(X)# is a field Fqd iff g(X) is irreducible.
has a root. We can divide g(X) by X − α in Fqd and use the same construction
to prove that g1 (X) = g(X)/(X − α ) has a root in some extension of Fqt ,t < d.
Finally, we obtain a field containing all d roots of g(X), i.e. construct the splitting
field Spl(g(X)).
d
where d is the smallest positive integer such that α q = α (such a d exists as will
be proved in Lemma 3.1.24).
A monic polynomial is the one with the highest coefficient. The minimal poly-
nomial for α ∈ K over F is a unique monic polynomial Mα (X) (= Mα ,F (X)) ∈
F[X] such that Mα (α ) = 0 and Mα (X)|g(X) for each g(X) ∈ F[X] with g(α ) = 0.
When ω is a primitive element of K (generating K∗ ), Mω (X) is called a primitive
polynomial (over F). The order of a polynomial p(X) ∈ F[X] is the smallest n such
that p(X)|(X n − e).
Lemma 3.1.24 Let Fq ⊂ Fqd and α ∈ Fqd . Let Mα (X) ∈ F[X] be the minimal
polynomial for α, of degree deg Mα (X) = d . Then:
Proof Assertions (a), (b) follow from the definition. To prove (c), assume γ ∈ K is
a root of a polynomial f (X) = a0 + a1 X + · · · + ad X d from F[X], i.e. ∑ ai γ i = 0.
0≤i≤d
As aqi = ai (which is true for all a ∈ F) and by virtue of Lemma 3.1.5,
q
i q
f (γ ) = ∑ ai γ = ∑ ai γ =
q qi
∑ ai γ = 0,
i
0≤i≤d 0≤i≤d 0≤i≤d
q 2
so γ q is a root. Similarly, γ q = γ q is a root, and so on.
2 s
For Mα (X) it yields that α , α q , α q , . . . are roots. This will end when α q = α
for the first time (which proves the existence of such an s). Finally, s = d as all
d−1 i j
α , α q , . . . , α q are distinct: if not then α q = α q where, say, i < j. Taking qd− j
d+i− j d
power of both sides, we get α q = α q = α . So, α is a root of polynomial
d+i− j
P(X) = X q − X, and Spl(P(X)) = Fqd+i− j . On the other hand, α is a root of an
irreducible polynomial of degree d, and Spl(Mα (X)) = Fqd . Hence, d|(d + i − j)
i
or d|(i − j), which is impossible. This means that all the roots α q , i < d, are
distinct.
Theorem 3.1.25 For any field Fq and integer d ≥ 1, there exists an irreducible
polynomial f (X) ∈ Fq [X] of degree d .
Proof First, we establish the additive Möbius inversion formula. Let ψ and Ψ be
two functions from Z+ to an Abelian group G with an additive group operation.
Then the following equations are equivalent:
and
n
ψ (n) = ∑ μ (d)Ψ . (3.1.7)
d|n
d
This equivalence follows when we observe that (a) the sum ∑ μ (d) is equal to 0 if
d|n
n > 1 and to 1 if n = 1, and (b) for all n,
∑ μ (d)Ψ n/d = ∑ μ (d) ∑ ψ (c)
d: d|n d: d|n c: c|n/d
= ∑ ψ (c) ∑ μ (d) = ψ (n).
c: c|n d: d|n/c
Worked Example 3.1.28 Find all irreducible polynomials of degree 2 and 3 over
F3 and determine their orders.
3.1 A primer on finite fields 279
positive integer with this property. In this case Mω (X) = ∏b∈Fqn (X − b).
3.1 A primer on finite fields 281
For a general irreducible polynomial, the notion of conjugacy is helpful: see Def-
inition 3.1.34 below. This concept was introduced (and used) informally in Section
2.5 for fields F2s .
1 + X + X4 1 + X3 + X4
power vector vector
of ω (word) (word)
−− 0000 0000
0 1000 1000
1 0100 0100
2 0010 0010
3 0001 0001
4 1100 1001
5 0110 1101 (3.1.8)
6 0011 1111
7 1101 1110
8 1010 0111
9 0101 1010
10 1110 0101
11 0111 1011
12 1111 1100
13 1011 0110
14 1001 0011
282 Further Topics from Coding Theory
Under the left table addition rule, the minimal polynomial Mω i (X) for the power
ω i is 1 + X + X 4 for i = 1, 2, 4, 8 and 1 + X 3 + X 4 for i = 7, 14, 13, 11, while for
i = 3, 6, 12, 9 it is 1 + X + X 2 + X 3 + X 4 and for i = 5, 10 it is 1 + X + X 2 . Under the
right table addition rule, we have to swap polynomials 1 + X + X 4 and 1 + X 3 + X 4 .
Polynomials 1 + X + X 4 and 1 + X 3 + X 4 are of order 15, polynomial 1 + X + X 2 +
X 3 + X 4 is of order 5 and 1 + X + X 2 of order 3.
4
A short way to produce these answers is to find the expression for ω i as a
2 3
linear combination of 1, ω i , ω i and ω i . For example, from the left table we
have for ω 7 :
7 4
ω = ω 28 = ω 3 + ω 2 + 1,
7 3
ω = ω 21 = ω 3 + ω 2 ,
4 3
and readily see that ω 7 = 1 + ω 7 , which yields 1 + X 3 + X 4 . For complete-
2
ness, write down the unused expression for ω 7 :
7 2
ω = ω 14 = ω 12 ω 2 = (1 + ω )3 ω 2 = (1 + ω + ω 2 + ω 3 )ω 2
= ω 2 + ω 3 + ω 4 + ω 5 = ω 2 + ω 3 + 1 + ω + (1 + ω )ω = 1 + ω 3 .
Mω 5 (X) = (X − ω 5 )(X − ω 10 ) = X 2 + (ω 5 + ω 10 )X + ω 15 = X 2 + X + 1.
Mω 0 (X) = 1 + X, Mω (X) = 1 + X + X 4 ,
Mω 3 (X) = 1 + X + X 2 + X 3 + X 4 ,
Mω 5 (X) = 1 + X + X 2 , Mω 7 (X) = 1 + X 3 + X 4 .
Example 3.1.37 For the field F32 F2 [X]/"1 + X 2 + X 5 #, the addition table is
calculated below. The minimal polynomials are
2 n−1
has roots ω , ω q , ω q , . . . , ω q . An (Fqn ; Fq )-automorphism τ fixes the coefficients
j
of Mω (X), thus it permutes the roots, and τ (ω ) = ω q for some j, 0 ≤ j ≤ n−1. But
j
as ω is primitive, τ is completely determined by τ (ω ). Then as σq j (ω ) = ω q =
τ (ω ), we have that τ = σq j .
The rest of this section is devoted to a study of roots of unity, i.e. the roots of the
polynomial X n − e over field Fq where q = ps and p =char (Fq ). Without loss of
generality, we suppose from now on that
gcd(n, q) = 1, i.e. n and q are co-prime. (3.1.10)
Indeed, if n and q are not co-prime, we can write n = mpk . Then, by Lemma 3.1.5
k k
X n − e = X mp − e = (X m − e) p ,
and our analysis is reduced to the polynomial X m − e.
284 Further Topics from Coding Theory
Theorem 3.1.45 Let P(n) be the set of the primitive (n, Fq )-roots of unity and T(n)
the set of primitive elements in Fqs = Spl(X n − e). Then either (i) P(n) ∩ T(n) = 0/
or (ii) P(n) = T(n) ; case (ii) occurs iff n = qs − 1.
If we begin with a primitive element ω ∈ Fqs where s = ordn (q) then β = ω (q −1)/n
s
Definition 3.1.46 The set of exponents i, iq, . . . , iqd−1 where d(= d(i)) is the
minimal positive integer such that iqd = i mod n is called a cyclotomic coset (for i)
and denoted by Ci (= Ci (n, q)) (alternatively, Cω i is defined as the set of non-zero
d−1
field elements ω i , ω iq , . . . , ω iq ).
ω 2 ∼ 2X + 1, ω 3 ∼ 2X + 2, ω 4 ∼ 2,
ω 5 ∼ 2X, ω 6 ∼ X + 2, ω 7 ∼ X + 1, ω 8 ∼ 1.
Mω (X) = (X − ω )(X − ω 3 ) = X 2 − (ω + ω 3 )X + ω 4
= X 2 − 2X + 2 = X 2 + X + 2.
Hence, X 2 + X + 2 is primitive.
286 Further Topics from Coding Theory
ω 2 ∼ X 2 , ω 3 ∼ X 2 + 2, ω 4 ∼ X 2 + 2X + 2, ω 5 ∼ 2X + 2,
ω 6 ∼ 2X 2 + 2X, ω 7 ∼ X 2 + 1, ω 8 ∼ X 2 + X + 2,
ω 9 ∼ 2X 2 + 2X + 2, ω 10 ∼ X 2 + 2X + 1, ω 11 ∼ X + 2,
ω ∼ X 2 + 2X, ω 13 ∼ 2, ω 14 ∼ 2X, ω 15 ∼ 2X 2 , ω 16 ∼ 2X 2 + 1,
12
ω 17 ∼ 2X 2 + X + 1, ω 18 ∼ X + 1, ω 19 ∼ X 2 + X,
ω ∼ 2X 2 + 2, ω 21 ∼ 2X 2 + 2X + 1, ω 22 ∼ X 2 + X + 1,
20
ω 23 ∼ 2X 2 + X + 2, ω 24 ∼ 2X + 1, ω 25 ∼ 2X 2 + X, ω 26 ∼ 1.
as required.
X 15 − 1 = (1 + X)(1 + X + X 4 )(1 + X + X 2 + X 3 + X 4 )
×(1 + X + X 2 )(1 + X 3 + X 4 ).
(b) Knowing the cyclotomic cosets we can show that a particular factorisation of
X n − e contains irreducible factors. Explicitly, take the polynomial X 9 − 1 over F2
(with n = 9, q = 2). There are three cyclotomic cosets:
1 + X, 1 + X 3 + X 6 and 1 + X + X 2 .
This yields
X 9 − 1 = (1 + X)(1 + X + X 2 )(1 + X 3 + X 6 ).
3.1 A primer on finite fields 287
f (X) = 1 + X + X 6
g(X) = 1 + X + X 2 + X 4 + X 6 ,
α of f (X) in Spl( f (X)) has ord(α ) = and hence is a root of (X −e).
(b) Each root
So, f (X)| X − e .
(c) If f (X)| X n − e then each root of f (X) is a root of X n − e, i.e. ord(α )|n. So,
|n. Conversely, if n = k then (X − e)|(X k − e) and f (X)|(X n − e) by (b).
(d) Follows from (c).
Worked Example 3.1.51 Use the Frobenius map σ : a → aq to prove that every
element a ∈ Fqn has a unique q j th root, for j = 1, . . . , n − 1.
Suppose that q = ps is odd. Show that exactly a half of the non-zero elements of
Fq have square roots.
Solution The Frobenius map σ : a → aq is a bijection Fqn → Fqn . So, for all b ∈ Fqn
j
there exists unique a with aq = b (the qth root). The jth power iteration σ j : a → aq
j
is also a bijection, so again for all b ∈ Fqn there exists unique a with aq = b.
j
Observe that for all c ∈ Fq , c1/q = c.
Now take τ : a → a2 , a multiplicative homomorphism F∗q → F∗q . If q is odd then
∗
Fq Zq−1 has an even number of elements q − 1. We want to show that if τ (a) =
b then τ −1 (b) consists of two elements, a and −a. In fact, τ (−a) = b. Also, if
τ (a ) = b then τ (a a−1 ) = e.
So, we want to analyse τ −1 (e). Clearly, ±e ∈ τ −1 (e). On the other hand, if ω
is a primitive element then τ (ω (q−1)/2 ) = ω q−1 = e and τ −1 (e) consists of e = ω 0
and ω (q−1)/2 . So, ω (q−1)/2 = −e.
Now if τ (a a−1 ) = e then a a−1 = ±e and a = ±a. Hence, τ sends precisely two
elements, a and −a, into the same image, and its range τ (F∗q ) is a half of F∗q .
Theorem 3.1.52 (cf. [92], Theorem 3.46.) Let polynomial p(X) ∈ Fq [X] be irre-
ducible, of degree n. Set m = gcd(d, n). Then m|n and p(X) factorises over Fqd into
m irreducible polynomials of degree n/m each. Hence, p(X) is irreducible over Fqd
iff m = 1.
Theorem 3.1.53 (cf. [92], Theorem 3.5.) Let gcd(d, q) = 1. The number of monic
irreducible polynomials of order and degree d equals φ ()/d if ≥ 2, and the
3.1 A primer on finite fields 289
Concluding this section, we give short summaries of the facts of the theory of
finite fields discussed above.
Summary 1.55. A field is a ring such that its non-zero elements form a commuta-
tive group under multiplication. (i) Any finite field F has the number of elements
q = ps where p is prime, and the characteristic char(F) = p. (ii) Any two finite
fields with the same number of elements are isomorphic. Thus, for a given q = ps ,
there exists, up to isomorphism, a unique field of cardinality q; such a field is de-
noted by Fq (it is often called a Galois field of size q). When q is prime, the field
Fq is isomorphic to the additive cyclic group Z p of p elements, equipped with mul-
tiplication mod p. (iii) The multiplicative group F∗q of non-zero elements from Fq
is isomorphic to the additive cyclic group Zq−1 of q − 1 elements. (iv) Field Fq
contains Fr as a subfield iff r|q; in this case Fq is isomorphic to a linear space over
(i.e. with coefficients from) Fr , of dimension log p (q/r). So, each prime number
p gives rise to an increasing sequence of finite fields F ps , s = 1, 2, . . . An element
ω ∈ Fq generating the multiplicative group F∗q is called a primitive element of Fq .
Summary 1.56. The polynomial ring over Fq is denoted by Fq [X]; if the polyno-
mials are considered mod g(X), Fq [X], the correspond-
a fixed polynomial from
ing ring is denoted by Fq [X] "g(X)#. (i) Ring Fq [X] "g(X)# is a field iff g(X)
is irreducible over Fq (i.e. does not admit a decomposition g(X) = g1 (X)g2 (X)
where deg(g1 (X)), deg(g2 (X)) < deg(g(X))). (ii) For any q and a positive integer
d there exists an irreducible polynomial g(X) over Fq of degreed. (iii) If g(X) is
irreducible
and deg g(X) = d then the cardinality of field Fq [X] "g(X)# is qd , i.e.
Fq [X] "g(X)# is isomorphic to Fqd and belongs to the same series of fields as Fq
(that is, char(Fqd ) = char(Fq )).
290 Further Topics from Coding Theory
The smallest field Fq with this property (i.e. field Fq (α1 , . . . , αu )) is called a split-
ting field for p(X); we also say that p(X) splits over Fq (α1 , . . . , αu ). The splitting
field for p(X) is denoted by Spl(p(X)); an element α ∈ Spl(p(X)) takes part in de-
composition (3.1.13) iff p(α ) = 0. Field Spl(p(X)) is described as the set {g(α j )}
where j = 1, . . . , u, and g(X) ∈ Fq [X] are polynomials of degree < deg(p(X)). (ii)
Field Fq is splitting for the polynomial X q − X. (iii) If polynomial p(X) of de-
gree d isirreducible over Fq and α is a root of p(X) in field Spl(p(X)) then Fqd
Fq [X] "p(X)# is isomorphic to Fq (α ) and all the roots of p(X) in Spl(p(X))
2 d−1
are given by the conjugate elements α , α q , α q , . . . , α q . Thus, d is the small-
d
est positive integer for which α q = α . (iv) Suppose that, for a given field Fq ,
a monic polynomial p(X) ∈ Fq [X] and an element α from a larger field we have
p(α ) = 0. Then there exists a unique minimal polynomial Mα (X) with the property
that Mα (α ) = 0 (i.e. such that any other polynomial p(X) with p(α ) = 0 is divided
by Mα (X)). Polynomial Mα (X) is the unique irreducible polynomial over Fq van-
ishing at α . It is also the unique polynomial of the minimum degree vanishing at α .
We call Mα (X) the minimal polynomial of α over Fq . If ω is a primitive element
of Fqd then Mω (X) is called a primitive polynomial for Fqd over Fq . We say that
elements α , β ∈ Fqd are conjugate over Fq if they have the same minimal poly-
d−1
nomial over Fq . Then (v) the conjugates of α ∈ Fqd over Fq are α , α q , . . . , α q ,
d
where d is the smallest positive integer with α q = α . When α = ω i where ω
is a primitive element, the congugacy class is associated with a cyclotomic coset
d−1
Cω i = {ω i , ω iq , . . . , ω iq }.
Summary 1.58. Now assume that n and q = ps are co-prime and take polynomial
X n − e. The roots of X n − e in the splitting field Spl(X n − e) are called nth roots
of unity over Fq . The set of all nth roots of unity is denoted by En . (i) Set En is
a cyclic subgroup of order n in the multiplicative group of field Spl(X n − e). An
nth root of unity generating En is called a primitive nth root of unity. (ii) If Fqs is
Spl(X n − e) then s is the smallest positive integer with n|(qs − 1). (iii) Let Πn be the
set of primitive nth roots of unity over field Fq and Φn the set of primitive elements
of the splitting field Fqs = Spl(X s − e). Then either Πn ∩ Φn = 0/ or Πn = Φn , the
latter happening iff n = qs − 1.
3.2 Reed–Solomon codes. The BCH codes revisited 291
(as the splitting field Spl (X q − X) is Fq ). Furthermore, owing to the fact that ω is a
primitive (q − 1, Fq ) root of unity (or, equivalently, a primitive element of Fq ), the
minimal polynomial Mi (X) is just X − ω i , for all i = 0, . . . , N − 1.
An important property is that the RS codes are MDS. Indeed, the generator g(X)
δ ,ω ,b has deg g(X) = δ − 1. Hence, the rank k is given by
of Xq,RS
δ ,ω ,b ) = N − deg g(X) = N − δ + 1.
k = dim(Xq,RS (3.2.2)
By the generalised BCH bound (see Theorem 3.2.9 below), the minimal distance
δ ,ω ,b ≥ δ = N − k + 1.
d Xq,RS
ω ,b ) = N − k + 1 = δ .
RS
d(Xq,d, (3.2.3)
292 Further Topics from Coding Theory
Thus the RS codes have the largest possible minimal distance among all q-ary
codes of length q − 1 and dimension k = q − δ . Summarising, we obtain
RS codes admit specific (and elegant) encoding and decoding procedures. Let
X RS be an [N, k, δ ] RS code, with N = q − 1. For a message string a0 . . . ak−1 set
a(X) = ∑ ai X i and encode a(X) as c(X) = ∑ a(ω j )X j . To show that
0≤i≤k−1 0≤ j≤N−1
c(X) ∈ X RS , we have to check that c(ω ) = · · · = c(ω δ −1 ) = 0. Think of a(X) as
a polynomial ∑ ai X i with ai = 0 for i ≥ k, and use
0≤i≤N−1
not known. Such an algorithm solution for the latter was found in 1969 by El-
wyn Berlekamp and James Massey, and is known since as the Berlekamp–Massey
decoding algorithm (cf. [20]); see Section 3.3. Later on, other algorithms were
proposed: continued fraction algorithm and Euclidean algorithm (see [112]).
Reed–Solomon codes played an important role in transmitting digital pictures
from American spacecraft throughout the 1970s and 1980s, often in combination
with other code constructions. These codes still figure prominently in modern space
missions although the advent of turbo-codes provides a much wider choice of cod-
ing and decoding procedures.
Reed–Solomon codes are also a key component in compact disc and digital game
production. The encoding and decoding schemes employed here are capable of cor-
recting bursts of up to 4000 errors (which makes about 2.5mm on the disc surface).
q-ary cyclic code XN = "g(X)# with length N, designed distance δ , such that its
generating polynomial is
g(X) = lcm Mω b (X), Mω b+1 (X), . . . , Mω b+δ −2 (X) , (3.2.5)
i.e.
*
Xq,N,
BCH = f (X) ∈ Fq [X] mod (X N − 1) :
δ ,ω ,b +
f (ω b+i ) = 0, 0 ≤ i ≤ δ − 2 .
If b = 1, this is a narrow sense BCH code. If ω is a primitive Nth root of unity, i.e. a
primitive root of the polynomial X N − 1, the BCH code is called primitive. (Recall
that under condition gcd(q, N) = 1 these roots form a commutative multiplicative
group which is cyclic, of order N, and ω is a generator of this group.)
Proof Without loss of generality consider a narrow sense code. Set the parity-
check (δ − 1) × N matrix
⎛ ⎞
1 ω ω2 ... ω N−1
⎜1 ω 2 ω4 ... ω 2(N−1) ⎟
⎜ ⎟
H = ⎜. .. . .. ⎟.
⎝. . . . . . ⎠
1 ω δ −1 ω 2( δ −1) ... ω ( δ −1)(N−1)
that differs from the Vandermonde matrix by factors ω ks in front of the sth column.
Then the determinant of D is the product
1 1 ... 1
δ −1 ω 1 k ω 2 k ... ω δ −1
k
det D = ∏ ω ks .. .. .. ..
. . . .
s=1
ω k1 (δ −2) α k2 (δ −2) . . . ω kδ −1 (δ −2)
δ −1 k
= ∏ω ks × ∏ ω −ω i k j = 0,
s=1 i> j
and any δ − 1 columns of H are indeed linearly independent. In turn, this means
that any non-zero codeword in X has weight at least δ . Thus, X has minimum
distance ≥ δ .
Let q = 2 and a(X) ∈ F2 [X]/"X n −1#. Prove that the Mattson–Solomon polynomial
aMS (X) is idempotent, i.e. aMS (X)2 = aMS (X) in F2 [X]/"X n − 1#.
Then
(2)
aMS (X) X=ω i = (aMS (X) X=ω i )2 = aMS (X) X=ω i
= aMS (X)2 X=ω i ,
i.e. polynomials aMS (X) and aMS (X)2 agree at ω 0 = e, ω , . . . , ω n−1 . Write this in
the matrix form, with aMS (X) = a0,MS + a1,MS X + · · · + an−1,MS X n−1 , aMS (X)2 =
a 0,MS X + · · · + a n−1,MS X n−1 :
⎛ ⎞
e e ... e
⎜ e ω . . . ω n−1 ⎟
(2) ⎜ ⎟
(aMS − aMS ) ⎜ .. .. . .. ⎟ = 0.
⎝ . . . . . ⎠
2
e ω n−1 . . . ω (n−1)
As the matrix is Vandermonde, its determinant is
∏ (ω j − ω i ) = 0,
0≤i< j≤n−1
(2)
and aMS = aMS . So, aMS (X) = aMS (X)2 .
Definition 3.2.11 Let v = v0 v1 . . . vN−1 be a vector over Fq , and let ω be a prim-
itive (N, Fq ) root of unity over Fq . The Fourier transform of the vector v is the
vector V = V0V1 . . .VN−1 with components given by
N−1
Vj = ∑ ω i j vi , j = 0, . . . , N − 1. (3.2.7)
i=0
3.2 Reed–Solomon codes. The BCH codes revisited 297
Lemma 3.2.12 (The inversion formula) The vector v is recovered from its Fourier
transform V by the formula
1 N−1 −i j
vi = ∑ ω Vj.
N j=0
(3.2.8)
N −1 ∑ a(ω j )ω −i j = N −1 ∑ ∑ ak ω jk ω −i j
0≤ j≤N−1 0≤ j≤N−1 0≤k≤N−1
=N −1
∑ ak ∑ ω j(k−i)
=N −1
∑ ak N δki = ai .
0≤k≤N−1 0≤ j≤N−1 0≤k≤N−1
∑ ω j = ∑ (ω ) j = (e − (ω )N )(e − ω )−1 = 0.
0≤ j≤N−1 0≤ j≤N−1
Hence
1
ai = ∑ a(ω j )ω −i j .
N 0≤ j≤N−1
(3.2.9)
298 Further Topics from Coding Theory
Worked Example 3.2.13 Give an alternative proof of the BCH bound: Let ω be a
primitive (N, Fq ) root of unity and b ≥ 1 and δ ≥ 2 integers. Let XN = "g(X)# be a
cyclic code where g(X) ∈ Fq [X]/"X N −e# is a monic polynomial of smallest degree
having ω b , ω b+1 , . . . , ω b+δ −2 among its roots. Then XN has minimum distance at
least δ .
done it in their joint paper). For brevity, we take the value b = 1 (but will be able
to extend the definition to values of N > q − 1).
Given N ≤ q, let S = {x1 , . . . , xN } ⊂ Fq be a set of N distinct points in Fq (a
supporting set). Let Ev denote the evaluation map
Ev : f ∈ Fq [X] → Ev( f ) = ( f (x1 ), . . . , f (xN )) ∈ FNq (3.2.12)
and take
L = { f ∈ Fq [X] : deg f < k}. (3.2.13)
Then the q-ary Reed–Solomon code of length N and dimension k can be defined as
X = Ev(L); (3.2.14)
A B
d −1
it has the minimum distance d = d(X ) = N − k + 1 and corrects up to er-
2
rors. The encoding of a source message u = u0 . . . uk−1 ∈ Fkq consists in calculating
the values of the polynomial f (X) = u0 + u1 X + · · · + uk X k−1 at points xi ∈ S.
Definition 3.2.1 (where X was defined as the set of polynomials c(X) =
∑ cl X l ∈ Fq [X] with c(ω ) = c(ω 2 ) = · · · = c(ω δ −1 ) = 0) emerges when
0≤l<q−1
N = q − 1, k = N − δ + 1 = q − δ , the supporting set S = {e, ω , . . . , ω N−1 } and
the coefficients c0 , c1 , . . . , cN−1 are related to the polynomial f (X) by
ci = f (ω i ), 0 ≤ i ≤ N − 1.
This determines uniquely the coefficients fl in the representation f (X) =
∑ fl X l , via the discrete inverse Fourier transform relation
0≤l<N
This idea goes back to Shannon’s bounded distance decoding: upon receiving
a word y, you inspect the Hamming balls around y until you encounter a closest
codeword (or a collection of closest codewords) to y. Of course, we want two
things: that (i) when we take s ‘moderately’ larger than t, the chance of finding
two or more codewords within distance s is small, and (ii) the algorithm has a
reasonable computational complexity.
Example 3.2.14 The [32, 8] RS code over F32 has d = 25 and t = 12. If we take
s = 13, the Hamming ball about the received word y may contain two codewords.
However, assuming that all error vectors e of weight 13 are equally likely, the
probability of this event is 2.08437 × 10−12 .
The Guruswami–Sudan list decoding algorithm (see [59]) performs the task of
finding the codewords within distance s for t ≤ s ≤ tGS in a polynomial time. Here
L$ M
tGS = n − 1 − (k − 1)n ,
the cyclic codes become ideals in the polynomial ring F2 [X]/"(X N − 1)#. They are
in a one-to-one correspondence with the ideals in F2 [X] containing polynomial
X N − 1. Because F2 [X] is a Euclidean domain, all ideals in F2 [X] are principal, i.e.
of the form { f (X)g(X) : f (X) ∈ F2 [X]}. In fact, all ideals in F2 [X]/"(X N − 1)# are
also principal ideals.
Definition 3.3.4 The polynomial g(X) is called the minimal degree generator
(or simply the generator) of the cyclic code X . The ratio h(X) = (X N − e)/g(X),
of degree N − deg g(X), is called the check polynomial for the cyclic code X =
"g(X)#.
Worked Example 3.3.7 Show that Hamming’s [7, 4] code is a cyclic code with
check polynomial X 4 + X 2 + X + 1. What is its generator polynomial? Does Ham-
ming’s original code contain a subcode equivalent to its dual?
Theorem 3.3.9 If X1 = "g1 (X)# and X2 = "g2 (X)# are cyclic codes with gener-
ators g1 (X) and g2 (X) then
(c) the dual code X ⊥ is a cyclic code of dim X ⊥ = r, and X ⊥ = "g⊥ (X)#, where
g⊥ (X) = h−1
0 X
N−r h(X −1 ) = h−1 (h X N−r + h X N−r−1 + · · · + h
0 0 1 N−r ).
Definition 3.3.12 The roots of generator g(X) are called the zeros of the cyclic
code "g(X)#. Other roots of unity are often called non-zeros of the code.
can be considered as a parity-check matrix for the code with zeros ω1 , . . . , ωu (with
the proviso that its rows may not be linearly independent).
consist of all non-zero binary vectors of length l. Hence, the Hamming [2l − 1, 2l −
l − 1, 3] code is (equivalent to) the cyclic code "Mω (X)# whose zeros consist of a
primitive (2l − 1; F2 ) root of unity ω and (necessarily) all the other roots of the
minimal polynomial for ω .
Theorem
l 3.3.14 If gcd(l, q − 1) = 1 then the q-ary Hamming
q −1 q −1
l
, − l, 3 code is equivalent to the cyclic code.
q−1 q−1
306 Further Topics from Coding Theory
−1 l
Proof Write Spl(X N − e) = Fql where l = ordN (q), N = qq−1 . To justify the
q −1
l
selection of l observe that = q − 1 and l is the least positive integer with
N
l −1
this property as qq−1 > ql−1 − 1.
Therefore, Spl(X N − e) = Fql . Take a primitive β ∈ Fql . Then ω = β (q −1)/N =
l
β q−1 is a primitive (N, Fq ) root of unity. As before, take the minimal polynomial
l−1
Mω (X) = (X − ω )(X − ω q ) · · · X − ω q and consider the cyclic code "Mω (X)#
l−1
with the zero ω (and necessarily ω , . . . , ω q ). Consider again the l × N matrix
q
(3.3.6). We want to check that any two distinct columns of H are linearly indepen-
dent. If not, there exist i < j such that ω i and ω j are scalar multiples of the element
ω j−i ∈ Fq . But then (ω j−i )q−1 = ω ( j−i)(q−1) = e in Fq ; as ω is a primitive Nth root
of unity, this holds iff ( j − i)(q − 1) ≡ 0 mod N. Write
ql − 1
N= = 1 + · · · + ql−1 .
q−1
As (q − 1)|(qr − 1) for all r ≥ 1, we have qr = (q − 1)vr + 1 for some natural vr .
Summing over 0 ≤ r ≤ s − 1 yields
N = (q − 1) ∑ vr + l. (3.3.7)
r
As the volume of the ball vN,q (E) ≥ ql , this implies that in fact k = N − l, E = 1
and d = 3. So, this code is equivalent to a Hamming code.
Next, we look in more detail on BCH codes correcting several errors. Recall that
if ω1 , . . . , ωu ∈ E(N,q) are (N, Fq ) roots of unity then
XN = { f (X) ∈ Fq [X]/"X N − e# : f (ω1 ) = · · · = f (ωu ) = 0}
is a cyclic code "g(X)# where
g(X) = lcm Mω1 ,Fq (X), . . . , Mωu ,Fq (X)(X) (3.3.8)
"Mω (X)# and is equivalent to the Hamming code. We could try other possibilities
for zeros of X to see if it leads to interesting examples. This is the way to discover
the BCH codes [25], [70].
Recall the factorisation into minimal polynomials Mi (X)(= Mω i ,Fq (X)),
X N − 1 = lcm Mi (X) : i = 0, . . . ,t , (3.3.9)
where ω is a primitive (N, Fq ) root of unity. The roots of Mi (X) are conjugate,
d−1
i.e. have the form ω i , ω iq , . . . , ω iq where d(= d(i)) is the least integer ≥ 1 such
that iqd = i mod N. The set Ci = {i, iq, . . . , iqd−1 } is the ith cyclotomic coset of q
mod N. So,
Mi (X) = ∏ (X − ω j ). (3.3.10)
j∈Ci
sense BCH codes with odd designed distance δ = 2E + 1, and obtain an improve-
ment of Theorem 3.3.16:
Theorem 3.3.17 The rank of a binary BCH code X2,N,2E+1
BCH is ≥ N − E ordN (2).
The problem of determining exactly the minimum distance of a BCH code has
been solved only partially (although a number of results exist in the literature). We
present the following theorem without proof.
Theorem 3.3.18 The minimum distance of a binary primitive narrow sense BCH
code is an odd number.
The previous results can be sharpened in a number of particular cases.
Worked Example 3.3.19 Prove that log2 (N + 1) > 1 + log2 (E + 1)! implies
N
(N + 1) < ∑
E
. (3.3.13)
0≤i≤E+1 i
Suppose the distance is ≥ 2E + 3. Observe that the rank X2,2s −1,2E+1 ≥ N − sE,
BCH
31
2 5E
< ∑ i
0≤i≤E+1
X N − 1 = X δ m − 1 = (X m − 1)(1 + X m + · · · + X (δ −1)m ).
Two more results on the minimal distance of a BCH code are presented in The-
orems 3.3.23 and 3.3.25. The full proofs are beyond the scope of this book and
omitted.
Theorem 3.3.24 The minimal distance of a primitive q-ary narrow sense BCH
code X BCH = Xq,q s −1,δ ,ω ,1 of designed distance δ is at most qδ − 1.
BCH
the same length N = qs − 1 and designed distance δ . The roots of the generator
of X are among those of X , so X ⊆ X . But according to Theorem 3.3.22,
d(X ) = δ which is ≤ δ q − 1.
The following result shows that BCH codes are not ‘asymptotically good’. How-
ever, for small N (a few thousand or less), the BCH are among the best codes
known.
310 Further Topics from Coding Theory
Then find i, j such that y1 = ω i , y2 = ω j (y1 , y2 are called error locators). If such i,
j (or equivalently, error locators y1 , y2 ) are found, we know that errors occurred at
positions i and j.
It is convenient to introduce an error-locator polynomial σ (X) whose roots are
y−1 −1
1 , y2 :
If N is not large, the roots of σ (X) can be found by trying all 2s − 1 non-zero
elements of F∗2s . (The standard formula for the roots of a quadratic polynomial
does not apply over F2 .) Thus, the following assertion arises:
for the received word r(X) = c(X) + e(X). Suppose that errors occurred at places
i1 , . . . , it . Then
e(X) = ∑ Xij.
1≤ j≤t
312 Further Topics from Coding Theory
∑ ω i j = r1 , ∑ ω 3i j = r3 , . . . , ∑ ω (δ −2)i j = rδ −2 ,
1≤ j≤t 1≤ j≤t 1≤ j≤t
∑ y j = r1 , ∑ y3j = r3 , . . . , ∑ yδj −2 = rδ −2 .
1≤ j≤t 1≤ j≤t 1≤ j≤t
σ (X) = ∏ (1 − y j X)
1≤ j≤t
⎛ ⎞
r1
⎜ r3 ⎟
⎜ ⎟
⎜ r5 ⎟
⎜ ⎟
=⎜ .. ⎟
⎜ . ⎟
⎜ ⎟
⎝ r2t−3 ⎠
r2t−1
∗
Example 3.3.27 Consider X2,16, ω ,5 where ω is a primitive element of F16 . We
BCH
r1 = a(ω ) = ω 12 + ω 8 + ω 7 + ω 6 + 1 = ω 6 ,
r3 = a(ω 3 ) = ω 36 + ω 24 + ω 21 + ω 18 + 1 = ω 9 + ω 3 + 1 = ω 4 .
σ (X) = 1 + ω 6 X + (ω 13 + ω 12 )X 2 .
The roots of l(X) are ω 3 and ω 11 by the direct check. Hence we discover the errors
at the 4th and 12th positions.
Definition 3.4.3 The discrete Fourier transform (in short, DFT) of a function f
on FNq is defined by
f = ∑ f (v)χ(v) . (3.4.9)
v∈FN
q
(if one sets x = z, y = 1, (3.4.10) coincides with (3.4.1)). So, we want to apply the
DFT to the function (no harm to say that x, y ∈ S )
g : FNq → C [x, y] : v → xw(v) yN−w(v) . (3.4.11)
Lemma 3.4.4 (The abstract MacWilliams identity) For v ∈ FNq let
If ui = 0 then
X → Fq : x → "v, x#.
Since v ∈ FNq \ X ⊥ , this linear form is surjective, whence its kernel has dimension
k − 1, i.e. for any g ∈ Fq there exist qk−1 vectors x ∈ X such that "v, x# = g. This
implies
= qk ∑ f (y)
y∈X ⊥
Example 3.4.7 (i) For all codes X , WX (0) = A0 = 1 and WX (1) = X . When
X = F×N
q , WX (z) = [1 + z(q − 1)] .
N
(iii) Let X be the Hamming [7, 4] code. The dual code X ⊥ has 8 codewords; all
except 0 are of weight 4. Hence, WX ⊥ (x, y) = x7 + 7x4 y3 , and, by the MacWilliams
identity,
1 1
WX = 3
WX ⊥ (x − y, x + y) = 3 (x − y)7 + 7(x − y)4 (x + y)3
2 2
= x7 + 7x4 y3 + 7x3 y4 + y4 .
Hence, X has 7 words of weight 3 and 4 each. Together with the 0 and 1 words,
this accounts for all 16 words of the Hamming [7, 4] code.
Another way to derive the identity (3.4.1) is to use an abstract result related
to group algebras and character transforms for Hamming spaces F×N q (which are
linear spaces over field Fq of dimension N). For brevity, the subscript q and super-
script (N) will be often omitted.
Definition 3.4.8 The (complex) group algebra CF×N for space F×N is defined as
the linear space of complex functions G : x ∈ F×N → G(x) ∈ C equipped by a com-
plex involution (conjugation) and multiplication. Thus, we have four operations for
functions G(x); addition and scalar (complex) multiplication are standard (point-
wise), with (G + G )(x) = G(x) + G (x) and (aG)(x) = aG(x), G, G ∈ CF×N ,
a ∈ C, x ∈ F×N . The involution is just the (point-wise) complex conjugation:
G∗ (x) = G(x)∗ ; it is an idempotent operation, with G∗ ∗ = G. However, the mul-
tiplication (denoted by ) is a convolution:
This makes CF×N a commutative ring and at the same time a (complex) linear
space, of dimension dim CF×N = qN , with involution. (A set that is a commutative
318 Further Topics from Coding Theory
ring and a linear space is called an algebra.) The natural basis in CF×N is formed
by Dirac’s (or Kronecker’s) delta-functions δ y , with δ y (x) = 1(x = y), x, y ∈ H .
gX (t) = ∑ t x; (3.4.18)
x∈X
Xx (g) = ∑ γy χ (x · y) (3.4.19b)
y∈Hn
Here
Ak = ∑ γx . (3.4.21)
x∈H :w(x)=k
For a linear code X , with generating function gX (t) (see 3.4.18)), Ak gives the
number of codewords of weight k:
where
Âk = ∑ Xx (g). (3.4.24)
x∈H : w(x)=k
and expand:
n
(1 − s)k (1 + (q − 1)s)n−k = ∑ Ki (k)si . (3.4.27)
i=0
= ∑ ∑ Ai Kk (i)s , k
0≤k≤n 0≤i≤n
i.e.
Âk = ∑ Ai Kk (i). (3.4.29)
0≤i≤n
Hence,
WĝX (s) = #X WgX ⊥ (s), (3.4.32)
and we obtain the MacWilliams identity for linear codes:
Theorem 3.4.12 Let X ⊂ Hn be a linear code, X ⊥ its dual, and
n n
WX (s) = ∑ Ak sk , WX ⊥ (s) = ∑ A⊥k sk (3.4.33)
k=0 k=0
and hence
1
WhX (s) = ∑ ∑
M 0≤k≤N x,y∈X :
1 w(x − y) = k sk = ∑ Bk sk
0≤k≤N
= BX (s).
Now by the MacWilliams identity, for a given non-trivial character χ and the
corresponding transform ζ → ζ, we obtain
Theorem 3.4.14 For hX (s) as above, if
hX (s) is the character transform and
Wh (s) its w-enumerator, with
X
Wh (s) = ∑ Bk s = ∑
k
∑ χx (hX ) sk ,
X
0≤k≤N 0≤k≤N w(x)=k
3.4 The MacWilliams identity and the linear programming bound 323
then
Bk = ∑ Bi Kk (i),
0≤i≤N
Thus:
Theorem 3.4.16 For all [N, M] codes X and k = 0, . . . , N ,
∑ Bi Kk (i) ≥ 0. (3.4.41)
0≤i≤N
∑ Bi = M 2
0≤i≤N
or
1
∑ Ei = M, with Ei =
M
Bi (3.4.42)
0≤i≤N
Hence,
N N
M ∑ Bi Kk (i) = ∑ ∑ ∑ ω "x−y,z#
i=0 i=0 x,y∈X :δ (x,y)=i z∈FN
q :w(z)=k
= ∑ | ∑ ω "x,z# |2 ≥ 0.
z∈FN
q :w(z)=k x∈X
This leads us to the so-called linear programming (LP) bound stated in Theorem
3.4.17 below.
Theorem 3.4.17 (The LP bound) The following inequality holds:
Mq (N, d) ≤ max ∑ Ei : Ei ≥ 0, E0 = 1, Ei = 0 for 1 ≤ i < d
∗
0≤i≤N
and ∑ Ei Kk (i) ≥ 0 for 0 ≤ k ≤ N . (3.4.44)
0≤i≤N
Hence, for d even, as we can assume that E2i+1 = 0, the constraint in (3.4.44)
need only be considered for k = 0, . . . , [N/2].
(c) K0 (i) = 1 for all i, and thus the bound ∑ Ei K0 (i) ≥ 0 follows from Ei ≥ 0.
0≤i≤N
M2∗ (N, d) ≤ max ∑ Ei : Ei ≥ 0, E0 = 1, Ei = 0 for 1 < i < d,
0≤i≤N
N
Ei = 0 for i odd, and + ∑ Ei Kk (i) ≥ 0 (3.4.45)
k d≤i≤N
A B
N
for k = 1, . . . , .
2
N
f (x) = 1 + ∑ f j K j (x)
j=1
Solution Let M = Mq∗ (N, d) and X be a q-ary [N, M] code with the distance
distribution Bi (X ), i = 0, . . . , N. The condition f (i) ≤ 0 for d ≤ i ≤ N implies
326 Further Topics from Coding Theory
N
∑ B j (X ) f ( j) ≤ 0. Using the LP bound (3.4.45) for k = 0 obtain Ki (0) ≥
j=d
N
− ∑ B j (X )Ki ( j). Hence,
j=d
N
f (0) = 1 + ∑ f j K j (0)
j=1
N N
≥ 1 − ∑ fk ∑ Bi (X )Kk (i)
k=1 i=d
N N
= 1 − ∑ Bi (X ) ∑ fk Kk (i)
i=d k=1
N
= 1 − ∑ Bi (X )( f (i) − 1)
i=d
N
≥ 1 + ∑ Bi (X )
i=d
= M = Mq∗ (N, d).
To obtain the Singleton bound select
N
x
f (x) = q N−d+1
∏ 1−
j
.
j=d
Worked Example 3.4.21 Using the linear programming bound, prove that
M2∗ (13, 5) = M2∗ (14, 6) ≤ 64. Compare it with the Elias bound. [Hint: E6 = 42,
E8 = 7, E10 = 14, E12 = E14 = 0. You may need a computer to get the solution.]
E0 = 1, E1 = E2 = E3 = E4 = E5 = E7 = E9 = E11 = E13 = 0,
E6 , E8 , E10 , E12 , E14 ≥ 0,
14 + 2E6 − 2E8 − 6E10 − 10E12 − 14E14 ≥ 0,
91 − 5E6 − 5E8 + 11E10 + 43E12 + 91E14 ≥ 0,
364 − 12E6 + 12E8 + 4E10 − 100E12 − 364E14 ≥ 0,
1001 + 9E6 + 9E8 − 39E10 + 121E12 + 1001E14 ≥ 0,
2002 + 30E6 − 30E8 + 38E10 − 22E12 − 2002E14 ≥ 0,
3003 − 5E6 − 5E8 + 27E10 − 165E12 + 3003E14 ≥ 0,
3432 − 40E6 + 40E8 − 72E10 + 264E12 − 3432E14 ≥ 0,
Note that the bound is sharp as a [13, 64, 5] binary code actually exists. Compare
the LP bound with the Hamming bound:
M2∗ (13, 5) ≤ 213 (1 + 13 + 13 · 6) = 213 92 = 211 /23,
i.e.
M2∗ (13, 5) ≤ 91.
Xα = {(a, α a) : a ∈ Fm
2 }. (3.5.1)
Then Xα is a [2m, m] linear code and has information rate 1/2. We can recover α
from any non-zero codeword (a, b) ∈ Xα , as α = ba−1 (division in F2m ). Hence,
if α = α then Xα ∩ Xα = {0}.
Now, given λ = λm ∈ (0, 1/2], we want to find α = αm such that code Xα has
minimum weight ≥ 2mλ . Since a non-zero binary (2m)-word can enter at most
one of the Xα ’s, we can find such α if the number of the non-zero (2m)-words
3.5 Asymptotically good codes 329
of weight < 2mλ is < 2m − 1, the number of distinct codes Xα . That is, we can
manage if
2m
∑ i
< 2m − 1
1≤i≤2mλ −1
2m
or even better, ∑ < 2m − 1. Now use the following:
1≤i≤2mλ i
Lemma 3.5.2 For 0 ≤ λ ≤ 1/2,
N
∑ k
≤ 2N η (λ ) , (3.5.2)
0≤k≤λ N
We fix an [N1 , k1 , d1 ] code X1 over F2k2 called an outer code: X1 ⊂ FN2k12 . Then
string a is encoded into a codeword c = c0 c1 . . . cN1 −1 ∈ X1 . Next, each ci ∈ F2k2
is encoded by a codeword bi from an [N2 , k2 , d2 ] code X2 over F2 , called an inner
code. The result is a string b = b(0) . . . b(N1 −1) ∈ FNq 1 N2 of length N1 N2 :
N2 N2
←→ ←→
b = ... b(i) ∈ F2N2 , 0 ≤ i ≤ N1 − 1.
b(0) b(N1 −1)
The encoding is represented by the diagram:
input: a (k1 k2 ) string a, output: an (N1 N2 ) codeword b.
Observe that different symbols ci can be encoded by means of different inner
codes. Let the outer code X1 be a [2m − 1, k, d] RS code X RS over F2m . Write a
binary (k2m )-word a as a concatenation a(0) . . . a(k−1) , with a(i) ∈ F2m . Encoding a
using X RS gives a codeword c = c0 . . . cN−1 , with N = 2m − 1 and ci ∈ F2m . Let β
be a primitive element in F2m . Then for all j = 0, . . . , N − 1 = 2m − 2, consider the
inner code
* +
X ( j) = (c, β j c) : c ∈ F2m . (3.5.5)
3.5 Asymptotically good codes 331
X ( j) (see (3.5.6)) as the inner codes, where 0 ≤ j ≤ 2m − 2. Code Xm,k Ju has length
k
2m(2m − 1), rank mk and hence rate < 1/2.
2(2 − 1)
m
1)) and d/(2m(2m − 1)) bounded away from 0. Fix R0 ∈ (0, 1/2) and choose a
sequence of outer RS codes XNRS of length N, with N = 2m − 1 and k = [2NR0 ].
Ju is k/(2N) ≥ R .
Then the rate of Xm,k 0
Now consider the minimum weight
Ju
Ju
w Xm,k = min w(x) : x ∈ Xm,k
Ju
, x = 0 = d Xm,k . (3.5.7)
For any fixed m, if the outer RS code XNRS , N = 2m − 1, has minimum weight
d then any super-codeword b = (c0 , c0 )(c1 , β c1 ) . . . (cN−1 , β N−1 cN−1 ) ∈ Xm,k
Ju has
So the total weight of the N(1 − 2R0 ) distinct (2m)-strings is bounded below by
2mN(1 − 2R0 ) η −1 (1/2) − o(1) (1 − o(1)) = 2mN(1 − 2R0 ) η −1 (1/2) − o(1) .
Thus the result follows.
Lemma 3.5.4 demonstrates that Xm,k
Ju has
Ju −1 1
w Xm,k ≥ 2mN(1 − 2R0 ) η − o(1) . (3.5.9)
2
Then
w Xm,kJu
≥ (1 − 2R0 )(η −1 (1/2) − o(1)) → (1 − 2R0 )η −1 (1/2)
length Xm,k Ju
In the construction, R0 ∈ (0, 1/2). However, by truncating one can achieve any
given rate R0 ∈ (0, 1); see [110].
Fq . That is, we can think of M as an (mr × n) matrix over Fq (denoted again by M).
Given elements a1 , . . . , an ∈ Fqm , we have
⎛ ⎞
⎛ ⎞ ⎛ ⎞⎛ ⎞ ∑ a j ci j
a1 c11 . . . c1n a1 ⎜ 1≤ j≤n ⎟
⎜ .. ⎟ ⎜ .. . . ⎟ ⎜ .. ⎟ ⎜ .. ⎟
M⎝ . ⎠ = ⎝ . . . . ⎠⎝ . ⎠ = ⎜
. . ⎟.
⎝ ⎠
an cr1 . . . crn an ∑ a j cr j
1≤ j≤n
So, if the columns of M are linearly independent as r-vectors over Fqm , they are
also linearly independent as (rm)-vectors over Fq . That is, the columns of M are
linearly independent over Fq .
Recall that if ω is a primitive (n, Fqm ) root of unity and δ ≥ 2 then the n × (mδ )
Vandermonde matrix over Fq
⎛ → −e →
−e →
−e ⎞
...
⎜ → −ω →
−
ω2 ... →
−
ω δ −1 ⎟
⎜ → − →
− →
− ⎟
H =⎜ ω
T ⎜ 2 ω 4 ... ω 2( δ −1) ⎟
⎟
⎝ ... ⎠
→
− →
− →
−
ω n−1 ω 2(n−1) . . . ω (δ −1)(n−1)
checks a narrow-sense BCH code Xq,n, BCH (a proper parity-check matrix emerges
ω ,δ
after column purging). Generalise it by taking an n × r matrix over Fqm
⎛ ⎞
h1 h1 α1 . . . h1 α1r−2 h1 α1r−1
⎜ h2 h2 α2 . . . h2 α r−2 h2 α r−1 ⎟
⎜ 2 2 ⎟
A=⎜ . .. . .. ⎟, (3.5.11)
⎝ . . . . . . ⎠
hn hn αn . . . hn αnr−2 hn αnr−1
334 Further Topics from Coding Theory
and
(X − α )−1 = −(G(X) − G(α ))(X − α )−1 G(α )−1 mod G(X). (3.5.14b)
Clearly, XαGo
,G is a linear code. The polynomial G(X) is called the Goppa polyno-
mial; if G(X) is irreducible, we say that X Go is irreducible.
Write G(X) = ∑ gi X i where deg G(X) = r, gr = 1 and r < n. Then in Fqm [X]
0≤i≤r
and so
∑ bi (G(X) − G(αi ))(X − αi )−1 G(αi )−1
1≤i≤n
for all u = 0, . . . , r − 1.
Equation (3.5.18) leads to the parity-check matrix for X Go . First, we see that
the matrix
⎛ ⎞
G(α1 )−1 G(α2 )−1 ... G(αn )−1
⎜ α1 G(α1 )−1 α2 G(α2 )−1 . . . αn G(αn )−1 ⎟
⎜ ⎟
⎜ α 2 G(α1 )−1 α22 G(α2 )−1 . . . αn2 G(αn )−1 ⎟
⎜ 1 ⎟, (3.5.19)
⎜ . . . . ⎟
⎝ .
. .
. . . .
. ⎠
α1 G(α1 )
r−1 −1 α2 G(α2 )
r−1 −1 . . . αn G(αn )
r−1 −1
iff G(X) divides Rb (X) which is the same as G(X) divides ∂X fb (X). For q = 2,
∂X fb (X) has only even powers of X (as its monomials are of the form X −1 times
a product of some αi j ’s: this vanishes when is even). In other words, ∂X fb =
h(X 2 ) = (h(X))2 for some polynomial h(X). Hence if g(X) is the polynomial of
lowest degree which is a square and divisible by G(X) then G(X) divides ∂X fb (X)
iff g(X) divides ∂X fb (X). So,
A binary Goppa code XαGo ,G where polynomial G(X) has no multiple roots is
called separable.
It is interesting to discuss a particular decoding procedure applicable for alter-
nant codes and based on the Euclid algorithm; cf. Section 2.5.
The initial setup for decoding an alternant code XαAlt ,h over Fq is as follows. As
− −−→i−1
→
in (3.5.12), we take the n × (mr) matrix A = h j α j over Fq obtained from the
n × r matrix A = h j α i−1 j over Fqm by replacing the entries with rows of length m.
→
−
Then purge linearly dependent columns from A . Recall that h1 , . . . , hn are non-zero
and α1 , . . . , αn are distinct elements of Fqm . Suppose a word u = c + e is received,
where c is the right codeword and e an error vector. We assume that r is even and
that t ≤ r/2 errors have occurred, at digits 1 ≤ i1 < · · · < it ≤ n. Let the i j th entry
of e be ei j = 0. It is convenient to identify the error locators with elements αi j : as
αi = αi for i = i (the αi are distinct), we will know the erroneous positions if we
determine αi1 , . . . , αit . Moreover, if we introduce the error locator polynomial
t
(X) = ∏ (1 − αi j X) = ∑ i X i , (3.5.22)
j=1 0≤i≤t
Proof Straightforward.
The crucial fact is that (X), ε (X) and s(X) are related by
Lemma 3.5.13 The following formula holds true:
ε (X) = (X)s(X) mod X r . (3.5.25)
Proof Write the following sequence:
ε (X) − (X)s(X) = ∑ hik eik ∏ (1 − αi j X) − (X) ∑ sl X l
1≤k≤t 1≤ j≤t: j=k 0≤l≤r−1
Lemma 3.5.13 shows the way of decoding alternant codes. We know that there
exists a polynomial q(X) such that
ε (X) = q(X)X r + (X)s(X). (3.5.26)
We also have deg ε (X) ≤ t − 1 < r/2, deg (X) = t ≤ r/2 and that ε (X) and (X)
are co-prime as they have no common roots in any extension. Suppose we apply the
3.5 Asymptotically good codes 339
Euclid algorithm to the known polynomials f (X) = X r and g(X) = s(X) with the
aim to find ε (X) and (X). By Lemma 2.5.44, a typical step produces a remainder
rk (X) = ak (X)X r + bk (X)s(X). (3.5.27)
If we want rk (X) and bk (X) to give ε (X) and (X), their degrees must match:
at least we must have deg rk (X) < r/2 and deg bk (X) ≤ r/2. So, the algorithm is
repeated until deg rk−1 (X) ≥ r/2 and deg rk (X) < r/2. Then, according to Lemma
2.5.44, statement (3), deg bk (X) = deg X r − deg rk−1 (X) ≤ r − r/2 = r/2. This is
possible as the algorithm can be iterated until rk (X) = gcd(X r , s(X)). But then
rk (X)|ε (X) and hence deg rk (X) ≤ deg ε (X) < r/2. So we can assume deg rk (X) ≤
r/2, deg bk (X) ≤ r/2.
The relevant equations are
ε (X) = q(X)X r + (X)s(X),
deg ε (X) < r/2, deg (X) ≤ r/2,
gcd (ε (X), (X)) = 1,
and also
rk (X) = ak (X)X r + bk (X)s(X), deg rk (X) < r/2, deg bk (X) ≤ r/2.
We want to show that polynomials rk (X) and bk (X) are scalar multiples of ε (X)
and (X). Exclude s(X) to get
bk (X)ε (X) − rk (X)(X) = (bk (X)q(X) − ak (X)(X))X r .
As
deg bk (X)ε (X) = deg bk (X) + deg ε (X) < r/2 + r/2 = r
and
deg rk (X)(X) = deg rk (X) + deg (X) < r/2 + r/2 = r,
deg(b(X)ε (X) − rk (X)(X)) < r. Hence, bk (X)ε (X) − rk (X)(X) must be 0, i.e.
(X)rk (X) = ε (X)bk (X), bk (X)q(X) = ak (X)(X).
So, (X)|ε (X)bk (X) and bk (X)|ak (X)(X). But (X) and ε (X) are co-primes as
well as ak (X) and bk (X) (by statement (5) of Lemma 2.5.44). Therefore, (X) =
λ bk (X) and hence ε (X) = λ rk (X). As l(0) = 1, λ = bk (0)−1 .
To summarise:
Theorem 3.5.14 (The decoding algorithm for alternant codes) Suppose XαAlt ,h is
an alternant code, with even r, and that t ≤ r/2 errors occurred in a received word
u. Then, upon receiving word u:
340 Further Topics from Coding Theory
Then (X) is the error locator polynomial whose roots are the inverses of
αi1 , . . . , yt = αit , and i1 , . . . , it are the error digits. The values ei j are given by
ε (αi−1
j
)
ei j = . (3.5.28)
hi j ∏l= j (1 − αil αi−1
j
)
i 0 1 2 3 4 5 6 7 8 9
ω i 1 2 4 8 5 10 9 7 3 6
3.6 Additional problems for Chapter 3 341
i 0 1 2 3 4 5 6 7 8
ω i 0001 0010 0100 1000 0011 0110 1100 1011 0101
i 9 10 11 12 13 14
ω 1010 0111 1110 1111 1101 1001
i
(i) X = X ev or
(ii) X ev is an [N, k − 1] linear subcode of X .
Prove that if the generating matrix G of X has no zero column then the total weight
∑ w(x) equals N2k−1 .
x∈X
[Hint: Consider the contribution from each column of G.]
Denote by XH, the binary Hamming code of length N = 2 − 1 and by XH, ⊥
the dual simplex code, = 3, 4, . . .. Is it always true that the N -vector 1 . . . 1 (with
all digits one) is a codeword in XH, ? Let As and A⊥ s denote the number of words
of weight s in XH, and XH, , respectively, with A0 = A⊥
⊥
0 = 1 and A1 = A2 = 0.
Check that
A3 = N(N − 1) 3!, A4 = N(N − 1)(N − 3) 4!,
and
A5 = N(N − 1)(N − 3)(N − 7) 5!.
using the last fact and the MacWilliams identity for binary codes, give a formula
for As in terms of Ks (2−1 ), the value of the Kravchuk polynomial:
s∧2−1 −1
2 2 − 1 − 2−1
Ks (2−1 ) = ∑−1 j s − j
(−1) j .
j=0∨s+2 −2 +1
this note that the generating matrix of XH, ⊥ is H. So, write x as a sum of rows of
H, and let W be the set of rows of H contributing into this sum, with W = w ≤ .
Then w(x) equals the number of j among 1, 2, . . . , 2 − 1 such that in the binary
decomposition j = 20 j0 + 21 j1 + · · · + 2−1 jl−1 the sum ∑t∈W jt mod 2 equals one.
As before, this is equal to 2w−1 (the number of subsets of W of odd cardinality).
So, w(x) = 2−w+w−1 = 2−1 . Note that the rank of XH,l ⊥ is 2 − 1 − (2 − 1 − l) = l
where
s∧i i
N − i
Ks (i) = ∑ j s − j
(−1) j , (3.6.2)
j=0∨s+i−N
1 − 1)K (2−1 )
As = 1 + (2 s
2
1 s∧2−1 2−1
2 − 1 − 2−1
= 1 + (2 − 1)
∑ j
(−1) .
2 j=0∨s+2−1 −2 +1 j 2 − 1 − j
(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1.
c(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 3 + X 2 + 1
i 8 9 10 11 12 13 14 15
ω 01101 11010 10001 00111 01110 11100 11101 11111
i
i 16 17 18 19 20 21 22 23
ω i 11011 10011 00011 00110 01100 11000 10101 01111
i 24 25 26 27 28 29 30
ω i 11110 11001 10111 01011 10110 01001 10010
The list of irreducible polynomials of degree 5 over F2 :
X 5 + X 2 + 1, X 5 + X 3 + 1, X 5 + X 3 + X 2 + X + 1,
X 5 + X 4 + X 3 + X + 1, X 5 + X 4 + X 3 + X 2 + 1;
So, ω and ω 3 suffice as zeros, and the generating polynomial g(X) equals
(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1,
as required. In other words:
X = {c(X) ∈ F2 [X]/(X 31 + 1) : c(ω ) = c(ω 3 ) = 0}
0 0 0 0 0
Then X ⊥ is generated by
⎛ ⎞
0 0 1 0 0
⎝ 0 0 0 1 0⎠ .
0 0 0 0 1
None of the vectors from X belongs to X ⊥ , so the claim is false.
Now take a self-dual code X = X ⊥ . If the word 1 = 1 . . . 1 ∈ X then there
exists x ∈ X such that x · 1 = 0. But x · 1 = ∑ xi = w(x) mod 2. On the other hand,
∑ xi = x · x, so x · x = 0. But then x ∈ X ⊥ . Hence 1 ∈ X . But then 1 · 1 = 0 which
implies that N is even.
Now let N = 2k. Divide digits 1, . . . , N into k disjoint pairs (α1 , β1 ), . . . , (αk , βk ),
with αi < βi . Then consider k binary words x(1) , . . . , x(k) of length N and weight 2,
with the non-zero digits in the word x(i) in positions (αi , βi ). Then form the [N, k]
code generated by x(1) , . . . , x(k) .
This code X is self-dual. In fact, x(i) · x(i ) = 0 for all i, i , hence X ⊂ X ⊥ .
Conversely, let y ∈ X ⊥ . Then y · x(i) = 0 for all i. This means that for all i, y
has either both 0 or both non-zero digits at positions (αi , βi ). Then y ∈ X . So,
X = X ⊥.
Now assume X = X ⊥ . Then N is even. But the dimension must be k by the
rank-nullity theorem.
The non-binary linear self-dual code is the ternary Golay [12, 6] with a generat-
ing matrix ⎛ ⎞
1 0 0 0 0 0 0 1 1 1 1 1
⎜0 1 0 0 0 0 1 0 1 2 2 1⎟
⎜ ⎟
⎜ ⎟
⎜0 0 1 0 0 0 1 1 0 1 2 2⎟
G=⎜ ⎟
⎜0 0 0 1 0 0 1 2 1 0 1 2⎟
⎜ ⎟
⎝0 0 0 0 1 0 1 2 2 1 0 1⎠
0 0 0 0 0 1 1 1 2 2 1 0
Here rows of G are orthogonal (including self-orthogonal). Hence, X ⊂ X ⊥ .
But dim(X ) = dim(X ⊥ ) = 6, so X = X ⊥ .
Problem 3.5 Define a finite field Fq with q elements and prove that q must have
the form q = ps where p is a prime integer and s 1 a positive integer. Check that
p is the characteristic of Fq .
348 Further Topics from Coding Theory
Prove that for any p and s as above there exists a finite field Fsp with ps elements,
and this field is unique up to isomorphism.
Prove that the set F∗ps of the non-zero elements of F ps is a cyclic group Z ps −1 .
Write the field table for F9 , identifying the powers ω i of a primitive element
ω ∈ F9 as vectors over F3 . Indicate all vectors α in this table such that α 4 = e.
if s > s, we obtain an element of order > r0 . Hence, s ≥ s which holds for any
prime factor of r(b), and r(b)|r(a).
Then br(a) = e, for all b ∈ F∗q , i.e. the polynomial X r0 − e is divisible by (X − b).
It must then be the product ∏b∈F∗q (X − b). Then r0 = F∗q = q − 1. Then F∗q is a
cyclic group with generator a.
For each prime p and positive integer s there exists at most one field Fq with
q = ps , up to isomorphism. Indeed, if Fq and F q are two such fields then they both
are isomorphic to Spl(X q − X), the splitting field of X q − X (over F p , the basic
field).
The elements α of F9 = F3 × F3 with α 4 = e are e = 01, ω 2 = 1 + 2ω = 21,
ω 4 = 02, ω 6 = 2 + ω = 12 where ω = 10.
Problem 3.6 Give the definition of a cyclic code of length N with alphabet Fq .
What are the defining zeros of a cyclic code
and why are they always
(N, Fq )-roots
3s − 1 3s − 1
of unity? Prove that the ternary Hamming , − s, 3 code is equivalent
2 2
to a cyclic code and identify the defining zeros of this cyclic code.
A sender uses the ternary [13, 10, 3] Hamming code, with field alphabet F3 =
{0, 1, 2} and the parity-check matrix H of the form
⎛ ⎞
1 0 1 2 0 1 2 0 1 2 0 1 2
⎝ 0 1 1 1 0 0 0 1 1 1 2 2 2⎠ .
0 0 0 0 1 1 1 1 1 1 1 1 1
The receiver receives the word x = 2 1 2 0 1 1 0 0 2 1 1 2 0. How should he
decode it?
3.6 Additional problems for Chapter 3 349
ql −1
If β is a primitive element is Fql then ω = β N = β q−1 is a primitive Nth root
of unity in Fql . Write ω 0 = e, ω , ω 2 , . . . , ω N−1 as column vectors in Fq × . . . × Fq
and form an l × N check matrix H. We want to check that any two distinct columns
of H are linearly independent. This is done exactly as in Theorem 3.3.14.
Then the code with parity-check matrix H has distance ≥ 3, rank k ≥ N − l. The
Hamming bound with N = (ql − 1)/(q − 1)
−1 A B
N
d −1
q ≤q
k N
∑ m
(q − 1) m
, with E =
2
, (3.6.3)
0≤m≤E
shows that d = 3 and k = N − l. So, the cyclic code with the parity-check matrix H
is equivalent to Hamming’s.
Problem 3.7 Compute the rank and minimum distance of the cyclic code with
generator polynomial g(X) = X 3 +X +1 and parity-check polynomial h(X) = X 4 +
X 2 + X + 1. Now let ω be a root of g(X) in the field F8 . We receive the word
r(X) = X 5 + X 3 + X(mod X 7 − 1). Verify that r(ω ) = ω 4 , and hence decode r(X)
using minimum-distance decoding.
as required. Let c(X) = r(X)+X 4 mod(X 7 − 1). Then c(ω ) = 0, i.e. c(X) is a code-
word. Since d(X ) = 3 the code is 1-error correcting. We just found a codeword
c(X) at distance 1 from r(X). Then r(X) is written as
c(X) = X + X 3 + X 4 + X 5 mod (X 7 − 1),
and should be decoded by c(X) under minimum-distance decoding.
Problem 3.8 If X is a linear [N, k] code, define its weight enumeration polyno-
mial WX (s,t). Show that:
(a) WX (1, 1) = 2k ,
(b) WX (0, 1) = 1,
(c) WX (1, 0) has value 0 or 1,
(d) WX (s,t) = WX (t, s) if and only if WX (1, 0) = 1.
[Hint: Consider g(u) = ∑ (−1)u·v zw(v) where w(v) denotes the weight of the
v∈FN
2
vector v and average over X .]
Hence or otherwise show that if X corrects at least one error then the words of
X ⊥ have average weight N/2.
Apply (3.6.9) to the enumeration polynomial of Hamming code,
1 N
W (XHam , z) = (1 + z)N + (1 + z)(N−1)/2 (1 − z)(N+1)/2 , (3.6.10)
N +1 N +1
to obtain the enumeration polynomial of the simplex code:
W (Xsimp , z) = 2−k 2N /2l + 2−k (2l − 1)/2l × 2N z2
l−1 l−1
= 1 + (2l − 1)z2 .
3.6 Additional problems for Chapter 3 353
Solution The dual code X ⊥ , of a linear code X with the generating matrix G and
the parity-check matrix H, is defined as a linear code with the generating matrix
H. If X is an [N, k] code, X ⊥ is an [N, N − k] code, and the parity-check matrix
for X ⊥ is G.
Equivalently, X ⊥ is the code which is formed by the linear subspace in FN2
orthogonal to X in the dot-product
"x, y# = ∑ xi yi , x = x1 . . . xN , y = y1 . . . yN .
1≤i≤N
By definition,
W (X , z) = ∑ zw(u) , W X ⊥ , z = ∑ zw(v) .
u∈X v∈X ⊥
Note that when v ∈ X ⊥ , the sum ∑ (−1)"u,v# = X . On the other hand, when
u∈X
v ∈ X ⊥ then there exists u0 ∈ X such that "u0 , v# = 0 (i.e. "u0 , v# = 1). Hence, if
v ∈ X ⊥ , then, with the change of variables u → u + u0 , we obtain
u∈X u∈X
= (−1)"u0 ,v# ∑ (−1)"u,v# = − ∑ (−1)"u,v#,
u∈X u∈X
which yields that in this case ∑ (−1)"u,v# = 0. We conclude that the sum in
u∈X
(3.6.11) equals
1
X ∑ ⊥
zw(v) X = W X ⊥ , z . (3.6.13)
v∈X
= ∏ ∑ zw(a) (−1)aui
1≤i≤N a=0,1
= ∏ 1 + z(−1)ui . (3.6.14)
1≤i≤N
354 Further Topics from Coding Theory
Here w(a) = 0 for a = 0 and w(a) = 1 for a = 1. The RHS of (3.6.14) equals
(1 − z)w(u) (1 + z)N−w(u) .
( X ) × ( X ⊥ ) = 2k × 2N−k = 2N .
The equality
N
the average weight in X ⊥ =
2
follows. The enumeration polynomial of the simplex code is obtained by substitu-
tion. In this case the average length is (2l − 1)/2.
Problem 3.11 Describe the binary narrow-sense BCH code X of length 15 and
the designed distance 5 and find the generator polynomial. Decode the message
100000111000100.
3.6 Additional problems for Chapter 3 355
Solution Take the binary narrow-sense BCH code X of length 15 and the designed
distance 5. We have Spl(X 15 − 1) = F24 = F16 . We know that X 4 + X + 1 is a
primitive polynomial over F16 . Let ω be a root of X 4 + X + 1. Then
M1 (X) = X 4 + X + 1, M3 (X) = X 4 + X 3 + X 2 + X + 1,
and the generator g(X) for X is
g(X) = M1 (X)M3 (X) = X 8 + X 7 + X 6 + X 4 + 1.
Take g(X) as example of a codeword. Introduce 2 errors – at positions 4 and 12
– by taking
u(X) = X 12 + X 8 + X 7 + X 6 + 1.
Using the field table for F16 , obtain
u1 = u(ω ) = ω 12 + ω 8 + ω 7 + ω 6 + 1 = ω 6
and
u3 = u(ω 3 ) = ω 36 + ω 24 + ω 18 + 1 = ω 9 + ω 3 + 1 = ω 4 .
As u1 = 0 and u31 = ω 18 = ω 3 = u3 , deduce that ≥ 2 errors occurred. Calculate the
locator polynomial
l(X) = 1 + ω 6 X + (ω 13 + ω 12 )X 2 .
Substituting 1, ω , . . . , ω 14 into l(X), check that ω 3 and ω 11 are roots. This confirms
that, if exactly 2 errors occurred their positions are 4 and 12 then the codeword sent
was 100010111000000.
Problem 3.12 For a word x = x1 . . . xN ∈ FN2 the weight w(x) is the number
of non-zero digits: w(x) = {i : xi = 0}. For a linear [N, k] code X let Ai be the
number of words in X of weight i (0 ≤ i ≤ N). Define the weight enumerator
N
polynomial W (X , z) = ∑ Ai zi . Show that if we use X on a binary symmetric
i=0
channel with error-probability
p, the
probability of failing to detect an incorrect
p
word is (1 − p) W X , 1−p − 1 .
N
Solution Suppose we have sent the zero codeword 0. Then the error-probability
E = ∑ P x |0 sent = ∑ Ai pi (1 − p)N−i =
x∈X \0 i≥1
i
p p
(1 − p) N
∑ Ai − 1 = (1 − p) N W X, −1 .
i≥0 1− p 1− p
356 Further Topics from Coding Theory
Problem 3.13 Let X be a binary linear [N, k, d] code, with the weight enumer-
ator WX (s). Find expressions, in terms of WX (s), for the weight enumerators of:
Solution (i) All words with even weights from X belong to subcode X ev . Hence
ev 1
WX (s) = [WX (s) +WX (−s)] .
2
(ii) Clearly, all non-zero coefficients of weight enumeration polynomial for X +
corresponds to even powers of z, and A2i (X + ) = A2i (X )+A2i−1 (X ), i = 1, 2, . . ..
Hence,
pc 1
WX (s) = [(1 + s)WX (s) + (1 − s)WX (−s)] .
2
If X is binary [N, k, d] then you first truncate X to X − then take the parity-
check extension (X − ) . This preserves k and d (if d is even) and makes all code-
+
Problem 3.16 Prove that the binary code of length 23 generated by the poly-
nomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimal distance 7, and is
perfect.
[Hint: If grev (X) = X 11 g(1/X) is the reversal of g(X) then
X 23 + 1 ≡ (X + 1)g(X)grev (X) mod 2.]
Solution First, show that the code is BCH, of designed distance 5. By the fresher’s
dream Lemma 3.1.5, if ω is a root of a polynomial f (X) ∈ F2 [X] then so is ω 2 .
Thus, if ω is a root of g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are ω ,
ω 2 , ω 4 , ω 8 , ω 16 , ω 9 , ω 18 , ω 13 , ω 3 , ω 6 , ω 12 . This yields the design sequence
358 Further Topics from Coding Theory
Problem 3.17 Use the MacWilliams identity to prove that the weight distribution
of a q-ary MDS code of distance d is
N j i
i 0≤ ∑
Ai = (−1) qi−d+1− j
− 1
j≤i−d j
N j i−1
= (q − 1) ∑ (−1) qi−d− j , d ≤ i ≤ N.
i 0≤ j≤i−d j
Use the fact that d(X ) = N −k +1 and d(X ⊥ ) = k +1 and obtain simplified equa-
tions involving AN−k+1 , . . . , AN−r only. Subsequently, determine AN−k+1 , . . . , AN−r .
Varying r, continue up to AN .]
(the Leibniz rule (3.6.18) is used here). Formula (3.6.19) is the starting point. For
an MDS code, A0 = A⊥ 0 = 1, and
Ai = 0, 1 ≤ i ≤ N − k (= d − 1), A⊥ ⊥
i = 0, 1 ≤ i ≤ k (= d − 1).
Then
N 1 1 N−r N −i 1 N 1 N
+ ∑
r qk qk i=N−k+1 r
Ai = r
q N −r
= r
q r
,
360 Further Topics from Coding Theory
i.e.
N−r
N −i N
∑ r
Ai =
r
(qk−r − 1).
i=N−k+1
as required.
In fact, (3.6.20) can be obtained without calculations: in an MDS code of rank k
and distance d any k = N − d + 1 digits determine the codeword uniquely. Further,
for any choice of N − d positions there are exactly q codewords with digits 0 in
these positions. One of them is the zero codeword, and the remaining q − 1 are of
weight d. Hence,
N
AN−k+1 = Ad = (q − 1).
d
Solution Write
i N −i
Kk (i) = ∑ j k − j
(−1) j (q − 1)k− j .
0∨(i+k−N)≤ j≤k∧i
Next:
(a) The following straightforward equation holds true:
i N k N
(q − 1) Kk (i) = (q − 1) Ki (k)
i k
(as all summands become insensitive to swapping i ↔ k).
N N
For q = 2 this yields Kk (i) = Ki (k); in particular,
i k
N N
K0 (i) = Ki (0) = Ki (0).
i 0
(b) Also, for q = 2: Kk (i) = (−1)k Kk (N − i) (again straightforward, after swapping
i ↔ i − j).
N N
(c) Thus, still for q = 2: K (2i) = K2i (k) which equals
2i k k
N N
(−1) 2i K2i (N − k) = K (2i). That is,
N −k 2i N−k
Problem 3.19 What is an (n, Fq )-root of unity? Show that the set E(n,q) of the
(n, Fq )-roots of unity form a cyclic group. Check that the order of E(n,q) equals n if
n and q are co-prime. Find the minimal s such that E(n,q) ⊂ Fqs .
Define a primitive (n, Fq )-root of unity. Determine the number of primitive
(n, Fq )-roots of unity when n and q are co-prime. If ω is a primitive (n, Fq )-root of
unity, find the minimal such that ω ∈ Fq .
Find representation of all elements of F9 as vectors over F3 . Find all (4, F9 )-roots
of unity as vectors over F3 .
Solution We know that any root of an irreducible polynomial of degree 2 over field
F3 = {0, 1, 2} belongs to F9 . Take the polynomial f (X) = X 2 + 1 and denote its
root by α (any of the two). Then all elements of F9 may be represented as a0 + a1 α
where a0 , a1 ∈ F3 . In fact,
F9 = {0, 1, α , 1 + α , 2 + α , 2α , 1 + 2α , 2 + 2α }.
362 Further Topics from Coding Theory
Problem 3.21 Write an essay comparing the decoding procedures for Hamming
and two-error correcting BCH codes.
Solution To clarify the ideas behind the BCH construction, we first return to the
Hamming codes. The Hamming [2l − 1, 2l − 1 − l] code is a perfect one-error cor-
recting code of length N = 2l − 1. The procedure of decoding the Hamming code is
as follows. Having a word y = y1 . . . yN , N = 2l − 1, form the syndrome s = yH T .
If s = 0, decode y by y. If s = 0 then s is among the columns of H = HHam . If this is
column i, decode y by x∗ = y + ei , where ei = 0 . . . 010 . . . 0 (1 in the ith position,
0 otherwise).
We can try the following idea to be able to correct more than one error (two to
start with). Select 2l of the rows of the parity-check matrix in the form
H
H= . (3.6.21)
ΠH
yH T = (si + s j , sΠi + sΠ j )
si + s j = z, sΠi + sΠ j = z (3.6.22)
for any pair (z, z ) that may eventually occur as a syndrome under two errors.
A natural guess is to try a permutation Π that has some algebraic significance,
e.g. sΠi = si si = (si )2 (a bad choice) or sΠi = si si si = (si )3 (a good choice)
364 Further Topics from Coding Theory
or, generally, sΠi = si si · · · si (k times). Say, one can try the multiplication
mod 1 + X N ; unfortunately, the multiplication does not lead to a field. The reason
is that polynomial 1 + X N is always reducible. So, suppose we organise the check
matrix as
⎛ ⎞
(1 . . . 00) (1 . . . 00)k
⎜ .. ⎟
HT = ⎝ . ⎠.
(1 . . . 11) (1 . . . 11)k
Then we have to deal with equations of the type
For solving (3.6.23), we need the field structure of the Hamming space, i.e. not
only multiplication but also division. Any field structure on the Hamming space
N is isomorphic to F2N , and a concrete realisation of such a structure is
of length
F2 [X] "c(X)#, a polynomial field modulo an irreducible polynomial c(X) of degree
N. Such a polynomial always exists: it is one of the primitive polynomials of degree
N. In fact, the simplest consistent system of the form (3.6.23) is
s + s = z, s3 + s = z ;
3
s + s = z, s3 + s = z ,
3
(3.6.24)
where s and s are words of length 4 (or equivalently their polynomials), and the
multiplication is mod 1 + X + X 4 . In the case of two errors it is guaranteed that
there is exactly one pair of solutions to (3.6.24), one vector occupying position i
and another position j, among the columns of the upper (Hamming) half of matrix
H. Moreover, (3.6.24) cannot have more than one pair of solutions because
z = s3 + s = (s + s )(s2 + ss + s ) = z(z2 + ss )
3 2
implies that
ss = z z−1 + z2 . (3.6.25)
3.6 Additional problems for Chapter 3 365
Now (3.6.25) and the first equation in (3.6.24) give that s, s are precisely the roots
of a quadratic equation
X 2 + zX + z z−1 + z2 = 0 (3.6.26)
(with z z−1 + z2 = 0). But the polynomial in the LHS of (3.6.26) cannot have more
than two distinct roots (it could have no root or two coinciding roots, but it is
excluded by the assumption that there are precisely two errors). In the case of a
single error, we have z = z3 ; in this case s = z is the only root and we just find the
word z among the columns of the upper half of matrix H.
Summarising, the decoding scheme, in the case of the above [15, 7] code, is as
follows: Upon receiving word y, form a syndrome yH T = (z, z )T . Then
(i) If both z and z are zero words, conclude that no error occurred and decode y
by y itself.
(ii) If z = 0 and z3 = z , conclude that a single error occurred and find the location
of the error digit by identifying word z among the columns of the Hamming
check matrix.
(iii) If z = 0 and z3 = z , form the quadric (3.6.24), and if it has two distinct roots
s and s , conclude that two errors occurred and locate the error digits by iden-
tifying words s and s among the columns of the Hamming check matrix.
(iv) If z = 0 and z3 = z and quadric (3.6.26) has no roots, or if z is zero but z is
not, conclude that there are at least three errors.
Note that the case where z = 0, z3 = z and quadric (3.6.26) has a single root is
impossible: if (3.6.26) has a root, s say, then either another root s = s or z = 0 and
a single error occurs.
The decoding procedure allows us to detect, in some cases, that more than three
errors occurred. However, this procedure may lead to a wrong codeword when
three or more errors occur.
4
Further Topics from Information Theory
366
4.1 Gaussian channels and beyond 367
MAGC is particularly attractive because it allows one to do some handy and far-
reaching calculations with elegant answers.
However, Gaussian (and other continuously distributed) channels present a chal-
lenge that was absent in the case of finite alphabets considered in Chapter 1.
Namely, because codewords (or, using a slightly more appropriate term, codevec-
tors) can a priori take values from a Euclidean space (as well as noise vectors),
the definition of the channel capacity has to be modified, by introducing a power
constraint. More generally, the value of capacity for a channel will depend upon
the so-called regional constraints which can generate analytic difficulties. In the
case of MAGC, the way was shown by Shannon, but it took some years to make
his analysis rigorous.
An input word of length N (designed to use the channel over N slots in succes-
sion) is identified with an input N-vector
⎛ ⎞
x1
⎜ ⎟
x(= x(N) ) = ⎝ ... ⎠ .
xN
We assume that xi ∈ R and hence x(N) ∈ RN (to make the notation shorter, the upper
index (N) will be often omitted).
In an⎛additive
⎞ channels an input vector x is transformed to a random vector
Y1
⎜ ⎟
Y(N) = ⎝ ... ⎠ where Y = x + Z, or, component-wise,
YN
Y j = x j + Z j , 1 ≤ j ≤ N. (4.1.1)
Moreover, under the IID assumption, with Σ j j ≡ σ 2 > 0, all random variables Z j ∼
N(0, σ 2 ), and the noise distribution for an MGC is completely specified by a single
parameter σ > 0. More precisely, the joint PDF from (4.1.3) is rewritten as
N
1 1
√ exp − 2 ∑ z2j .
2πσ 2σ 1≤ j≤N
The codebook is, of course, presumed to be known to both the sender and the
receiver. The transmission rate R is given by
log2 M
R= . (4.1.5)
N
Now suppose that⎛a codevector⎞x(i) had been sent. Then the received random
x1 (i) + Z1
⎜ .. ⎟
vector Y(= Y(i)) = ⎝ . ⎠ is decoded by using a chosen decoder d : y →
xN (i) + ZN
d(y) ∈ XM,N . Geometrically, the decoder looks for the nearest codeword x(k),
relative to a certain distance (adapted to the decoder); for instance, if we choose to
use the Euclidean distance then vector Y is decoded by the codeword minimising
the sum of squares:
d(Y) = arg min ∑ (Y j (i) − x j (l))2 : x(l) ∈ XM,N ; (4.1.6)
1≤ j≤N
when d(y) = x(i) we have an error. Luckily, the choice of a decoder is conveniently
resolved on the basis of the maximum-likelihood principle; see below.
370 Further Topics from Information Theory
There is an additional subtlety here: one assumes that, for an input word x to get a
chance of successful decoding, it should belong to a certain ‘transmittable’ domain
in RN . For example, working with an MAGC, one imposes the power constraint
1
N 1≤∑
x2j ≤ α (4.1.7)
j≤N
where α > 0 is a given constant. In the context of wireless transmission this means
that the amplitude square power per signal in an N-long input vector should be
bounded by α , otherwise the result of transmission is treated as ‘undecodable’.
Geometrically, in order to perform decoding, the input√codeword x(i) constituting
√
the codebook must lie inside the Euclidean ball BN2 ( α N) of radius r = α N
centred at 0 ∈ RN :
⎧ ⎛ ⎞ ⎫
⎪ x1 1/2 ⎪
⎨ ⎬
⎜ .. ⎟
∑ j
(N)
B2 (r) = x = ⎝ . ⎠ : x 2
≤ r .
⎪
⎩ 1≤ j≤N
⎪
⎭
xN
The subscript 2 stresses that RN with the standard Euclidean distance is viewed as
a Hilbert 2 -space.
In fact, it is not required that the whole codebook XM,N lies in a decodable
domain; the agreement is only that if a codeword x(i) falls outside then it is decoded
wrongly with probability 1. Pictorially, the requirement is that ‘most’ of codewords
lie within BN2 ((N α )1/2 ) but not necessarily all of them. See Figure 4.1.
A reason for the ‘regional’ constraint (4.1.7) is that otherwise the codewords
can be positioned in space at an arbitrarily large distance from each other, and,
eventually, every transmission rate would become reliable. (This would mean that
the capacity of the channel is infinite; although such channels should not be dis-
missed outright, in the context of an AGC the case of an infinite capacity seems
impractical.)
Typically, the decodable region D(N) ⊂ RN is represented by a ball in RN , centred
at the origin, and specified relative to a particular distance in RN . Say, in the case
of exponentially distributed noise it is natural to select
⎧ ⎛ ⎞ ⎫
⎪
⎨ x1 ⎪
⎬
⎜ .. ⎟
D = B1 (N α ) = x = ⎝ . ⎠ : ∑ |x j | ≤ N α
(N) (N)
⎪
⎩ 1≤ j≤N
⎪
⎭
xN
the ball in the 1 -metric. When an output-signal vector falling within distance r
from a codeword is decoded by this codeword, we have a correct decoding if (i)
the output signal falls in exactly one sphere around a codeword, (ii) the codeword
in question lies within D(N) , and (iii) this specific codeword was sent. We have
possibly an error when more than one codeword falls into the sphere.
4.1 Gaussian channels and beyond 371
Figure 4.1
∏
(N)
fch (y(N) | x(N) ) = fch (y j |x j ). (4.1.10)
1≤ j≤N
Here fch (y|x) is the symbol-to-symbol channel PMF describing the impact of a
single use of the channel. For an MGC, fch (y|x) is a normal N(x, σ 2 ). In other
words, fch (y|x) gives the PDF of a random variable Y = x + Z where Z ∼ N(0, σ 2 )
represents the ‘white noise’ affecting an individual input value x.
Next, we turn to a codebook XM,N , the image of a one-to-one map M → RN
where M is a finite collection of messages (originally written in a message alpha-
bet); cf. (4.1.4). As in the discrete case, the ML decoder dML decodes the received
(N)
word Y = y(N) by maximising fch (y| x) in the argument x = x(N) ∈ XM,N :
(N)
dML (y) = arg max fch (y| x) : x ∈ XM,N . (4.1.11)
The case when maximiser is not unique will be treated as an error.
(N),ε
Another useful example is the joint typicality (JT) decoder dJT = dJT (see
below); it looks for the codeword x such that x and y lie in the ε -typical set TεN :
dJT (y) = x if x ∈ XM,N and (x, y) ∈ TεN . (4.1.12)
The JT decoder is designed – via a specific form of set TεN – for codes generated as
samples of a random code X M,N . Consequently, for given output vector yN and a
code XM,N , the decoded word dJT (y) ∈ XM,N may be not uniquely defined (or not
defined at all), again leading to an error. A general decoder should be understood
as a one-to-one map defined on a set K(N) ⊆ RN taking points yN ∈ KN to points
x ∈ XM,N ; outside set K(N) it may be not defined correctly. The decodable region
K(N) is a part of the specification of decoder d (N) . In any case, we want to achieve
(N) (N)
Pch d (N) (Y) = x|x sent = Pch Y ∈ K(N) |x sent
(N)
+ Pch Y ∈ K(N) , d(Y) = x|x sent → 0
as N → ∞. In the case of an MGC, for any code XM,N , the ML decoder from
(4.1.6) is defined uniquely almost everywhere in RN (but does not necessarily give
the right answer).
We also require that the input vector x(N) ∈ D(N) ⊂ RN and when x(N) ∈ D(N) ,
the result of transmission is rendered undecodable (regardless of the qualities of the
decoder used). Then the average probability of error, while using codebook XM,N
and decoder d (N) , is defined by
1
eav (XM,N , d (N) , D(N) ) = ∑ e(x, d (N), D(N) ),
M x∈X
(4.1.13a)
M,N
4.1 Gaussian channels and beyond 373
emax (XM,N , d (N) , D(N) ) = max e(x, d (N) , D(N) ) : x ∈ XM,N . (4.1.13b)
Here e(x, d (N) , D(N) ) is the probability of error when codeword x had been trans-
mitted:
⎧
⎨ 1,
x ∈ D(N) ,
e(x, d (N) , D(N) ) = (4.1.14)
⎩P(N)
ch d
(N) (Y) = x|x , x ∈ D(N) .
In (4.1.14) the order of the codewords in the codebook XM,N does not matter;
thus XM,N may be regarded simply as a set of M points in the Euclidean space RN .
Geometrically, we want the points of XM,N to be positioned so as to maximise the
chance of correct ML-decoding and lying, as a rule, within domain D(N) (which
again leads us to a sphere-packing problem).
4 suppose that a number R > 0 is fixed, the size of the codebook XM,N :
To 3this end,
M = 2NR . We want to define a reliable transmission rate as N → ∞ in a fashion
similar to how it was done in Section 1.4.
Definition 4.1.2 Value R >30 is 4called a reliable transmission rate with regional
constraint D(N) if, with M = 2NR , there exist a sequence {XM,N } of codebooks
XM,N ⊂ RN and a sequence {d (N) } of decoders d (N) : RN → RN such that
lim eav (XM,N , d (N) , D(N) ) = 0. (4.1.15)
N→∞
Remark 4.1.3 It is easy to verify that a transmission rate R reliable in the sense of
average error-probability eav (XM,N , d (N) , D(N) ) is reliable for the maximum error-
probability emax (XM,N , d (N) , D(N) ). In fact, assume that R is reliable in the sense of
Definition 4.1.2, i.e. in the sense of the average error-probability. Take a sequence
{XM,N } of the corresponding codebooks with M = 2RN and a sequence {dN } of
(0)
the corresponding decoding rules. Divide each code XN into two halves, XN and
(1)
XN , by ordering the codewords in the non-decreasing order of their probabilities
(0)
of erroneous decoding and listing the first M (0) = M/2 codewords in XN and
(1) (0)
the rest, M (1) = M − M (0) , in XN . Then, for the sequence of codes {XM,N }:
(i) the information rate approaches the value R as N → ∞ as
1
log M (0) ≥ R + O(N −1 );
N
(ii) the maximum error-probability, while using the decoding rule dN ,
1 M
Pemax XN , dN ≤ (1) ∑ Pe (x(N) , dN ) ≤ (1) Peav (XN , dN ) .
(0)
M (1) M
(N)x ∈XN
374 Further Topics from Information Theory
Next, the capacity of the channel is the supremum of reliable transmission rates:
C = sup R > 0 : R is reliable ; (4.1.16)
it varies from channel to channel and with the shape of constraining domains.
It turns out (cf. Theorem 4.1.9 below) that for the MGC, under the average power
constraint threshold α (see (4.1.7)), the channel capacity C(α , σ 2 ) is given by the
following elegant expression:
1 α
C(α , σ 2 ) = log2 1 + 2 . (4.1.17)
2 σ
Example 4.1.4 Next, we discuss an AGC with coloured Gaussian noise. Let a
codevector x = (x1 , . . . , xN ) have multi-dimensional entries
⎛ ⎞
x j1
⎜ .. ⎟
x j = ⎝ . ⎠ ∈ Rk , 1 ≤ j ≤ N,
x jk
The formula for the capacity of an AGC with coloured noise is, not surprisingly,
more complicated. As ΣQ = QΣ, matrices Σ and Q may be simultaneously diag-
onalised. Let λi and γi , i = 1, . . . , k, be the eigenvalues of Σ and Q, respectively
(corresponding to the same eigenvectors). Then
1 (νγl−1 − λl )+
C(α , Q, Σ) = ∑ log2 1 +
2 1≤l≤k λl
, (4.1.19)
−1
where (νγl−1 − λl )+ = max −1 νγ l − λl , 0 . In other words, (νγl−1 − λl )+ are the
eigenvalues of the matrix ν Q −Σ + representing the positive-definite part of the
Hermitian matrix ν Q−1 − Σ. Next, ν = ν (α ) > 0 is determined from the condition
tr ν I − QΣ + = α . (4.1.20)
The positive-definite part ν I − QΣ + is in turn defined by
ν I − QΣ + = Π+ ν I − QΣ Π+
fX,Y (X,Y )
I(X : Y ) = E log
fX (X) fY (Y )
0
fX,Y (x, y)
= fX,Y (x, y) log μ (dx)ν (dy).
fX (x) fY (y)
(N )
fX(N) ,Y(N ) (X(N) , Y(N ) )
I(X (N)
:Y ) = E log . (4.1.21a)
fX(N) (X(N) ) fY(N ) (Y(N ) )
Here fX(N) (x(N) ) and fY(N ) (y(N ) ) are the marginal PMFs for X(N) and Y(N ) (i.e.
joint PMFs for components of these vectors).
4.1 Gaussian channels and beyond 377
Next, given ε > 0, we can define the supremum of the mutual information per
signal (i.e. per a single use of the channel), over all input probability distributions
PX(N) with E (PX(N) , D(N) ) ≤ ε :
1
Cε ,N = sup I(X(N) : Y(N) ) : E (PX(N) , D(N) ) ≤ ε , (4.1.22)
N
Cε = lim sup Cε ,N C = lim inf Cε . (4.1.23)
N→∞ ε →0
We want to stress that the supremum in (4.1.22) should be taken over all proba-
bility distributions PX(N) of the input word X(N) with the property that the expected
error-probability is ≤ ε , regardless of whether these distributions are discrete or
continuous or mixed (contain both parts). This makes the correct evaluation of
CN,ε quite difficult. However, the limiting value C is more amenable, at least in
some important examples.
We are now in a position to prove the converse part of the Shannon second
coding theorem:
words Y(N) and decodable domains D(N) . Then quantity C from (4.1.22), (4.1.23)
gives an upper bound for the capacity:
C ≤ C. (4.1.24)
jointly over XM,N , i.e. have a discrete-type joint distribution. Then, by the gener-
alised Fano inequality (1.2.23),
hdiscr (X|d(Y)) ≤ 1 + log(M − 1) ∑ P(x = x, dML (Y) = x)
X∈XM,N
NR
≤ 1+ ∑ Pch (dML (Y) = x|x sent)
M x∈XM,N
Example 4.1.6 Here we estimate the capacity C(α , σ 2 ) of an MAGC with addi-
tive white Gaussian noise of variance σ 2 , under the average power constraint (with
D(N) = B(N) ((N α )1/2 ) (cf. Example 4.1.1.), i.e. bound from above the right-hand
side of (4.1.25b).
4.1 Gaussian channels and beyond 379
EY j2 = E(X j + Z j )2 = EX j2 + 2EX j Z j + EZ 2j = α 2j + σ 2 ,
≤ log2 2π e(α 2j + σ 2 ) ,
2
and consequently,
The Jensen inequality, applied to the concave function x → log2 (1 + x), implies
1 α 2j 1 1 α 2j
2N 1≤∑ N 1≤∑
log2 1 + 2 ≤ log2 1 +
j≤N σ 2 j≤N σ
2
1 α
≤ log2 1 + 2 .
2 σ
Therefore, in this example, the information capacity C, taken as the RHS of
(4.1.25b), obeys
1 α
C ≤ log2 1 + 2 . (4.1.27)
2 σ
After establishing Theorem 4.1.8, we will be able to deduce that the capacity
C(α , σ 2 ) equals the RHS, confirming the answer in (4.1.17).
Example 4.1.7 For the coloured Gaussian noise the bound from (4.1.26) can be
repeated:
I(X(N) : Y(N) ) ≤ ∑ [h(Y j ) − h(Z j )].
1≤ j≤N
Here we work with the mixed second-order moments for the random vectors of
input and output signals X j and Y j = X j + Z j :
1
N 1≤∑
α 2j = E"X j , QX j #, E"Y j , QY j # = α 2j + tr (QΣ), α 2j ≤ α .
j≤N
In this calculation we again made use of the fact that X j and Z j are independent
and the expected value EZ j = 0.
1
Next, as in the scalar case, I(X(N) : Y(N) ) does not exceed the difference
N
h(Y ) − h(Z) where Z ∼ N(0, Σ) is the coloured noise vector and Y = X + Z is
a multivariate normal distribution maximising the differential entropy under the
trace restriction. Formally:
1
I(X(N) : Y(N) ) ≤ h(α , Q, Σ) − h(Z)
N
4.1 Gaussian channels and beyond 381
Write Σ in the diagonal form Σ = CΛCT where C is an orthogonal and Λ the diag-
onal k × k matrix formed by the eigenvalues of Σ:
⎛ ⎞
λ1 0 . . . 0
⎜ 0 λ2 . . . 0 ⎟
⎜ ⎟
Λ=⎜ . ⎟.
⎝0 0 . . 0⎠
0 0 . . . λk
is maximised at
1 1
= κγi , i.e. βi = − λi , i = 1, . . . , k.
βi + λi κγi
To satisfy the regional constraint, we take
1
βi = − λi , i = 1, . . . , k,
κγi +
where the RHS comes from (4.1.28) with ν = 1/κ . Again, we will show that the
capacity C(α , Q, Σ) equals the last expression, confirming the answer in (4.1.19).
We now pass to the direct part of the second Shannon coding theorem for general
channels with regional restrictions. Although the statement of this theorem differs
from that of Theorems 1.4.15 and 2.2.1 only in the assumption of constraints upon
the codewords (and the proof below is a mere repetition of that of Theorem 1.4.15),
it is useful to put it in the formal context.
where
(random) word YN = YN ( j) received, with the joint PMF fX(N) ,Y(N) as in (4.1.30b).
We take ε > 0 and decode YN by using joint typicality:
dJT (YN ) = xN (i) when xN (i) is the only vector among
xN (1), . . . , xN (M) such that (xN (i), YN ) ∈ TεN .
Here set TεN is specified in (4.1.30a).
Suppose a random vector xN ( j) has been sent. It is assumed that an error occurs
every time when
(i) xN ( j) ∈ D(N) , or
(ii) the pair (xN ( j), YN ) ∈ TεN , or
(iii) (xN (i), YN ) ∈ TεN for some i = j.
These possibilities do not exclude each other but if none of them occurs then
(a) xN ( j) ∈ D(N) and
(b) x( j) is the only word among xN (1), . . . , xN (M) with (xN ( j), YN ) ∈ TεN .
Therefore, the JT decoder will return the correct result. Consider the average error-
probability
1
M 1≤∑
EM (PN ) = E( j, PN )
j≤M
where E( j, PN ) is the probability that any of the above possibilities (i)–(iii) occurs:
* + * +
E( j, PN ) = P xN ( j) ∈ D(N) ∪ (xN ( j), YN ) ∈ TεN
* +
∪ (xN (i), YN ) ∈ TεN for some i = j
= E1 xN ( j) ∈ D(N)
+ E1 xN ( j) ∈ D(N) , dJT (YN ) = xN ( j) . (4.1.31)
The symbols P and E in (4.1.31) refer to (1) a collection of IID input vectors
xN (1), . . . , xN (M), and (2) the output vector YN related to xN ( j) by the action of
the channel. Consequently, YN is independent of vectors xN (i) with i = j. It is in-
structive to represent the corresponding probability distribution P as the Cartesian
product; e.g. for j = 1 we refer in (4.1.31) to
P = PxN (1),YN (1) × PxN (2) × · · · × PxN (M)
where PxN (1),YN (1) stands for the joint distribution of the input vector xN (1) and the
output vector YN (1), determined by the joint PMF
fxN (1),YN (1) (xN , yN ) = fxN (1) (xN ) fch (yN |xN sent).
384 Further Topics from Information Theory
Thanks to the condition that lim Px(N) (D(N) ) = 1, the first summand vanishes as
N→∞
N → ∞. The second summand vanishes, again in the limit N → ∞, because of
M
(4.1.30a). It remains to estimate the sum ∑ P (xN (i), YN ) ∈ TεN .
i=2
First, note that, by symmetry, all summands are equal, so
M N
∑P (x (i), YN ) ∈ TεN = 2NR − 1 P (xN (2), YN ) ∈ TεN .
i=2
and hence
m N
∑P (x (i), YN ) ∈ TεN ≤ 2N(R−c+3ε )
i=2
We conclude that there exists a sequence of sample codebooks XM,N such that
the average error-probability
1
∑ e(x) → 0
M x∈XM,N
4.1 Gaussian channels and beyond 385
(N),ε
where e(x) = e(x, XM,N , D(N) , dJT ) is the error-probability for the input word x
in code XM,N , under the JT decoder and with regional constraint specified by D(N) :
⎧
⎨ 1, xN ∈ D(N) ,
e(x) =
(N),ε
⎩Pch dJT (YN ) = x|x sent , xN ∈ D(N) .
Hence, R is a reliable transmission rate in the sense of Definition 4.1.2. This com-
pletes the proof of Theorem 4.1.8.
Theorem 4.1.9 Assume that the conditions of Theorem 4.1.5 hold true. Then,
for all R < C, there exists a sequence of codes XM,N of length N and size M ∼ 2RN
such that the maximum probability of error tends to 0 as N → ∞.
1 α
C(α , σ 2 ) = log 1 + 2 ,
2 σ
for a vector white noise with variances σ 2 = (σ12 , . . . , σk2 ), under the constraint
∑ xTj x j ≤ N α ,
1≤ j≤N
1 (ν − σi2 )+
C(α , σ 2 ) = ∑
2 1≤i≤k
log 1 +
σi2
, where ∑ (ν − σi2 )+ = α 2 ,
1≤i≤k
and for the coloured vector noise with a covariance matrix Σ, under the constraint
∑ xTj Qx j ≤ N α ,
1≤ j≤N
1 (νγi−1 − λi )+
C(α , Q, Σ) = ∑ log 1 +
2 1≤i≤k λi
,
where ∑ (ν − γi λi )+ = α .
1≤i≤k
Explicitly, for a scalar white noise we take the random coding where the signals
X j (i), 1 ≤ j ≤ N, 1 ≤ i ≤ M = 2NR , are IID N(0, α − ε ). We have to check the
conditions of Theorem 4.1.5 in this case: as N → ∞,
√
(i) lim P(x(N) (i) ∈ B(N) ( N α ), for all i = 1, . . . , M) = 1;
N→∞
386 Further Topics from Information Theory
1 P(X,Y )
N 1≤∑
θN = log .
j≤M PX (X)PY (Y )
has two-side exponential IID components Z j ∼ (2) Exp(λ ), with the PDF
1
fZ j (z) = λ e−λ |z| , −∞ < z < ∞,
2
where Exp denotes the exponential distribution, λ > 0 and E|Z j | = 1/λ (see PSE I,
Appendix). Again we will calculate the capacity under the ML rule and with a
regional constraint x(N) ∈ Ł(N α ) where
( )
Ł(N α ) = x(N) ∈ RN : ∑ |x j | ≤ N α .
1≤ j≤N
First, observe that if the random variable X has E|X| ≤ α and the random variable
Z has E|Z| ≤ ζ then E|X + Z| ≤ α + ζ . Next, we use the fact that a random variable
Y with PDF fY and E|Y | ≤ η has the differential entropy
≤ ∑ 2 + log2 (α j + λ −1 ) − 2 + log2 (λ )
N
1
= ∑ log2 1 + α j λ
N
≤ log2 1 + αλ .
The same arguments as before establish that the RHS gives the capacity of the
channel.
388 Further Topics from Information Theory
C
inf
= ln 7 M = 13
m=6
2b
_ 0
A A
Figure 4.2
That is, the points of A partition the interval [−A, A] into 2m intervals of length
A/m; the ‘extended’ interval [−A − b, A + b] contains 2(m + 1) such intervals. The
maximising probability distribution PX can be spotted without calculations: it as-
signs equal probabilities 1/(m + 1) to m + 1 points
In other words, we ‘cross off’ every second ‘letter’ from A and use the remaining
letters with equal probabilities.
In fact, with PX (−A) = PX (−A + 2b) = · · · = PX (A), the output signal PDF fY
assigns the value [2b(m + 1)]−1 to every point y ∈ [−A − b, A + b]. In other words,
Y ∼ U(−A − b, A + b) as required. The information capacity Cinf in this case is
equal (in nats) to
ln(2A + 2b) − ln 2b = ln (1 + m) . (4.1.32)
4.1 Gaussian channels and beyond 389
Say, for M = 3 (three input signals, at −A, 0, A, and b = A), Cinf = ln 2. For
M = 5 (five input signals, at −A, −A/2, 0, A/2, A, and b = A/2), Cinf = ln 3. See
Figure 4.2 for M = 13.
Remark 4.1.14 It can be proved that (4.1.32) gives the maximum mutual infor-
mation I(X : Y ) between the input and output signals X and Y = X + Z when (i)
the noise random variable Z ∼ U(−b, b) is independent of X and (ii) X has a gen-
eral distribution supported on the interval [−A, A] with b = A/m. Here, the mutual
information I(X : Y ) is defined according to Kolmogorov:
where the supremum is taken over all finite partitions ξ and η of intervals [−A, A]
and [−A − b, A + b], and Xξ and Yη stand for the quantised versions of random
variables X and Y , respectively.
In other words, the input-signal distribution PX with
1
PX (−A) = PX (−A + 2b) = · · · = PX (A − 2b) = PX (A) = (4.1.34)
m+1
maximises I(X : Y ) under assumptions (i) and (ii). We denote this distribution by
(A,A/m) (bm,b)
PX , or, equivalently, PX .
However, if M = 2m, i.e. the number A of allowed signals is even, the cal-
culation becomes more involved. Here, clearly, the uniform distribution U(−A −
b, A + b) for the output signal Y cannot be achieved. We have to maximise h(Y ) =
h(X + Z) within the class of piece-wise constant PDFs fY on [−A − b, A + b]; see
below.
Equal spacing in [−A, A] is generated by points ±A/(2m − 1), ±3A/(2m −
1), . . . , ±A; they are described by the formula ±(2k − 1)A/(2m − 1) for k =
1, . . . , m. These points divide the interval [−A, A] into (2m − 1) intervals of length
2A/(2m − 1). With Z ∼ U(−b, b) and A = b(m − 1/2), we again have the output-
signal PDF fY (y) supported in [−A − b, A + b]:
⎧
⎪
⎪ if b(m − 1/2) ≤ y ≤ b(m + 1/2),
⎪pm /(2b),
⎪
⎪
⎪
⎪
⎪ pk + pk+1 (2b), if b(k − 1/2) ≤ y ≤ b(k + 1/2)
⎪
⎪
⎪
⎪ for k = 1, . . . , m − 1,
⎨
fY (y) = p−1 + p1 (2b), if − b/2 ≤ y ≤ b/2,
⎪
⎪
⎪
⎪ pk + pk+1 (2b), if b(k − 1/2) ≤ y ≤ b(k + 1/2)
⎪
⎪
⎪
⎪
⎪
⎪ for k = −1, . . . , −m + 1,
⎪
⎪
⎩ p /(2b), if − b(m + 1/2) ≤ y ≤ −b(m − 1/2),
−m
390 Further Topics from Information Theory
where
1 (2k − 1)A
p±k = pX ±b k − =P X =± , k = 1, . . . , m,
2 2m − 1
stand for the input-signal probabilities. The entropy h(Y ) = h(X + Z) is written as
pm pk + pk+1 p1
maximise G(p) = −pm ln − ∑ (pk + pk+1 ) ln − p1 ln
2b 1≤k<m 2b b
(4.1.35)
subject to the probabilistic constraints pk ≥ 0 and 2 ∑ pk = 1. The Lagrangian
1≤k≤m
L (PX ; λ ) reads
∂
L (PX ; λ ) = 0, k = 1, . . . , m.
∂ pk
pm (pm−1 + pm )
− ln − 2 + 2λ = 0, (implies) pm (pm−1 + pm ) = 4b2 e2λ −2 ,
4b2
(pk−1 + pk )(pk + pk+1 )
− ln − 2 + 2λ = 0,
4b2
(implies) (pk−1 + pk )(pk + pk+1 ) = 4b2 e2λ −2 , 1 < k < m,
2p1 (p1 + p2 )
− ln − 2 + 2λ = 0 (implies) 2p1 (p1 + p2 ) = 4b2 e2λ −2 .
4b2
This yields
K
pm = pm−1 + pm−2 = · · · = p3 + p2 = 2p1 ,
for m even,
pm + pm−1 = pm−2 + pm−3 = · · · = p2 + p1 ,
4.1 Gaussian channels and beyond 391
and
K
pm = pm−1 + pm−2 = · · · = p2 + p1 ,
for m odd.
pm + pm−1 = pm−2 + pm−3 = · · · = p3 + p2 = 2p1 ,
⎧
⎨1/(4b), A ≤ y ≤ 3A,
⎪
fY (y) = 1/(2b), −A ≤ y ≤ A, yielding Cinf = (ln 2)/2.
⎪
⎩
1/(4b), −3A ≤ y ≤ −A,
For M = 4 (four input signals at −A, −A/3, A/3, A, with b = 2A/3): p1 = 1/6,
p2 = 1/3, and the maximising output-signal PDF is
⎧
⎨1/(6b), A ≤ y ≤ 5A/3 and − 5A/3 ≤ y ≤ −A,
⎪
fY (y) = 1/(4b), 2A/3 ≤ y ≤ A and − A ≤ y ≤ −2A/3,
⎪
⎩
1/(6b), −2A/3 ≤ y ≤ 2A/3,
pm = 2p1 ,
pm−1 = p2 − p1 ,
pm−2 = 3p1 − p2 ,
pm−3 = 2(p2 − p1 ),
pm−4 = 4p1 − 2p2 ,
..
m .
p3 = − 1 (p2 − p1 ),
2
m+2
p2 = p1 ,
m
392 Further Topics from Information Theory
whence
m+2
p2 = p1 ,
m
m−2
p3 = p1 ,
m
m+4
p4 = p1 ,
m
m−4
p5 = p1 ,
m
.. (4.1.36)
.
2m − 2
pm−2 = ,
m
2
pm−1 = ,
m
pm = 2p1 ,
1
with p1 = .
2(m + 1)
1 1 1 1
h(Y ) = − ln inf
and CA = − ln − ln 2. (4.1.37)
2 4m(m + 1)b2 2 4m(m + 1)
On the other hand, for a general odd m, the maximising input-signal distribution
PX has
m+1
p1 = ,
2m(m + 1)
m−1
p2 = ,
2m(m + 1)
m+3
p3 = ,
2m(m + 1)
m−3 (4.1.38)
p4 = ,
2m(m + 1)
..
.
1
pm−1 = ,
2m(m + 1)
m
pm = .
m(m + 1)
4.1 Gaussian channels and beyond 393
This yields the same answer for the maximum entropy and the restricted capacity:
1 1 1 1
h(Y ) = − ln inf
and CA = − ln − ln 2. (4.1.39)
2 4m(m + 1)b2 2 4m(m + 1)
In future, we will refer to the input-signal distributions specified in (4.1.36) and
(A,2A/(2m−1))
(4.1.38) as PX .
Remark 4.1.15 It is natural to suggest that the above formulas give the maxi-
mum mutual information I(X : Y ) when (i) the noise random variable Z ∼ U(−b, b)
is independent of X and (ii) the input-signal distribution PX is confined to [−A, A]
with b = 2A/(2m − 1), but otherwise is arbitrary (with I(X : Y ) again defined as in
(4.1.33)). A further-reaching (and more speculative) conjecture is about the max-
imiser under the above assumptions (i) and (ii) but for arbitrary A > b > 0, not
necessarily with A/b being integer or half-integer. Here number M = 2A/b + 1
will not be integer either, but remains worth keeping as a value of reference.
So when b decays from A/m to A/(m + 1) (or, equivalently, A grows from bm
to b(m + 1) and, respectively, M increases from 2m + 1 to 2m + 3), the maximiser
(A,b) (bm,b) (b(m+1),b)
PX evolves from PX to PX ; at A = b(m + 1/2) (when M = 2(m + 1))
(A,b) (A,b)
distribution PX may or may not coincide with the distribution PX from
(4.1.36), (4.1.38).
To (partially) clarify the issue, consider the case where A/2 ≤ b ≤ A (i.e. 3 ≤
M ≤ 5) and assume that the input-signal distribution PX has
1
PX (−A) = PX (A) = p and PX (0) = 1 − 2p where 0 ≤ p ≤ . (4.1.40)
2
Then
1 p 1− p 1 − 2p
hy(Y ) = − Ap ln + (2b − A)(1 − p) ln + (A − b)(1 − 2p) ln ,
b 2b 2b 2b
(4.1.41)
and the equation dh(Y ) dp = 0 is equivalent to
i.e.
p3 = (1 − p)(1 − 2p)2 . (4.1.43a)
394 Further Topics from Information Theory
We are interested in the solution lying in (0, 1/2) (in fact, in (1/3, 1/2)). For b =
3A/4, the equation becomes
pA = (1 − p)A/2 (1 − 2p)A/2 ,
i.e.
p2 = (1 − p)(1 − 2p), (4.1.43b)
√
whence p = (3 − 5) 2.
Example 4.1.16 It is useful to look at the example where the noise random
variable Z has two components: discrete and continuous. To start with, one could
try the case where
fZ (z) = qδ0 + (1 − q)φ (z; σ 2 ),
i.e. Z = 0 with probability q and Z ∼ N(0, σ 2 ) with probability 1 − q ∈ (0, 1). (So,
1 − q gives the total probability of error.) Here, we consider the case
1
fZ = qδ0 + (1 − q) 1(|z| ≤ b),
2b
and study the input-signal PMF of the form
PX (−A) = p−1 , PX (0) = p0 , PX (A) = p1 , (4.1.44a)
where
p−1 , p0 , p1 ≥ 0, p−1 + p0 + p1 = 1, (4.1.44b)
with b = A and M = 3 (three signal levels in (−A, A)). The input-signal entropy is
h(X) = h(p−1 , p0 , p1 ) = −p−1 ln p−1 − p0 ln p0 − p1 ln p1 .
The output-signal PMF has the form
1
fY (y) = q p−1 δ−A + p0 δ0 + p1 δA + (1 − q)
2b
× p−1 1(−2A ≤ y ≤ 0) + p0 1(−A ≤ y ≤ A) + p1 1(0 ≤ y ≤ 2A)
and its entropy h(Y ) (calculated relative to the reference measure μ on R, whose
absolutely continuous component coincides with the Lebesgue and discrete com-
ponent assigns value 1 to points −A, 0 and A) is given by
h(Y ) = −q ln q − (1 − q) ln(1 − q) − qh(p−1 , p0 , p1 )
p−1 p−1 + p0
− (1 − q)A p−1 ln + p−1 + p0 ln
2A 2A
p0 + p1 p1
+ p0 + p1 ln + p1 ln .
2A 2A
4.1 Gaussian channels and beyond 395
× fX (x )dx fX (x + z + b) − fX (x + z − b) dz.
(x+z−b)∨(−A)
inf
C = ln 10 2b points from A (10 in total)
points from A \ A (8 in total)
an example of set B
set B
square Sb ( _x )
2b
_x
Figure 4.3
with x ∈ A partition domain B (i.e. cover B but do not intersect each other) then,
for the input PMF Px with Px (x) = 1 ( A ) (a uniform distributionover A ), the
output-vector-signal PDF fY is uniform on B (that is, fY (y) = 1 area of B ).
Consequently, the output-signal entropy h(Y ) = ln area of B is attaining the
maximum over all input-signal PMFs Px with Px (A ) = 1 (and even attaining the
maximum over all input-signal PMFs Px with Px (B ) = 1 where B ⊂ B is an
arbitrary subset with the property that ∪x ∈B S(x ) lies within B). Finally, the in-
formation capacity for the channel under consideration,
1 area of B
Cinf = ln nats/(scalar input signal).
2 4b2
See Figure 4.3.
To put it differently, any bounded set D2 ⊂ R2 that can be partitioned into disjoint
squares of length 2b yields the information capacity
1 area of D2
C2inf = ln nats/(scalar input signal),
2 4b2
of an additive channel with a uniform noise over (−b, b), when the channel is used
two times per scalar input signal and the random vector input x = (X1 , X2 ) is subject
to the regional constraint x ∈ D2 . The maximising input-vector PMF assigns equal
probabilities to the centres of squares forming the partition.
A similar conclusion holds in R3 when the channel is used three times for every
input signal, i.e. the input signal is a three-dimensional vector x = (x1 , x2 , x3 ), and
so on. In general, when we use a K-dimensional input signal x = (x1 , . . . , xk ) ∈ RK ,
and the regional constraint is x ∈ DK ⊂ RK where DK is a bounded domain that can
4.2 The asymptotic equipartition property in continuous time setting 397
then CKinf = ln(1 + m) does not depend on K (and the channel is memoryless).
This section provides a missing step in the proof of Theorem 4.1.8 and ad-
ditional Worked Examples. We begin with a series of assertions illustrating the
asymptotic equipartition property in various forms. The central facts are based on
the Shannon–McMillan–Breiman (SMB) theorem which is considered a corner-
stone of information theory. This theorem gives the information rate of a stationary
ergodic process X = (Xn ). Recall that a transformation of a probability space T is
called ergodic if every set A such that TA = A almost everywhere, satisfies P(A) = 0
or 1. For a stationary ergodic source with a finite expected value, Birkhoff’s ergodic
theorem states the law of large numbers (with probability 1):
1 n
∑ Xi → EX.
n i=1
(4.2.1)
The proof of Theorem 4.2.1 requires some auxiliary lemmas and is given at the
end of the section.
Worked Example 4.2.2 (A general asymptotic equipartition property) Given a
⎛ ⎞X1 , X2 , . . ., for all N = 1, 2, . . ., the distribution of
sequence of random variables
X1
⎜ .. ⎟
the random vector x1 = ⎝ . ⎠ is determined by a PMF fxN (xN1 ) with respect to
N
1
XN
measure μ = μ ×· · ·× μ (N factors). Suppose that the statement of the Shannon–
(N)
where h > 0 is a constant (typically, h = lim h(Xi )). Given ε > 0, consider the
i→∞
typical set
⎧ ⎞ ⎛ ⎫
⎪
⎨ x1 ⎪
⎬
⎜ .. ⎟ 1
Sε = x1 = ⎝ . ⎠ : −ε ≤ log fxN (x1 ) + h ≤ ε .
N N N
⎪
⎩ N 1 ⎪
⎭
xN
0
The volume μ (N) (SεN ) = μ (dx1 ) . . . μ (dxN ) of set SεN has the following proper-
SεN
ties:
μ (N) (SεN ) ≤ 2N(h+ε ) , for all ε and N, (4.2.4)
and, for 0 < ε < h and for all δ > 0,
μ (N) (SεN ) ≥ (1 − δ )2N(h−ε ) , for N large enough, depending on δ . (4.2.5)
0
Solution Since P(RN ) =
RN
fxN (xN1 )
1
∏ μ (dx j ) = 1, we have that
1≤ j≤N
0
1=
RN
fxN (xN1 )
1
∏ μ (dx j )
1≤ j≤N
0
≥
SεN
fxN (xN1 )
1
∏ μ (dx j )
1≤ j≤N
0
≥2 −N(h+ε )
SεN
∏ μ (dx j ) = 2−N(h+ε ) μ (N) (SεN ),
1≤ j≤N
4.2 The asymptotic equipartition property in continuous time setting 399
giving the upper bound (4.2.4). On the other hand, given δ > 0, we can take N
large so that P(SεN ) ≥ 1 − δ , in which case, for 0 < ε < h,
1 − δ ≤ P(SεN )
0
=
SεN
fxN (xN1 )
1
∏ μ (dx j )
1≤ j≤N
0
≤2 −N(h−ε )
∏
SεN 1≤ j≤N
μ (dx j ) = 2−N(h−ε ) μ (N) (SεN ).
The next step is to extend the asymptotic equipartition property to joint distri-
butions of pairs XN1 , YN1 (in applications, XN1 will play a role of an input and YN1
of an output of a channel). Formally, given two sequences of random variables,
X1 , X2 , . . . and Y1 ,Y2 , . . ⎛
., for⎞
all N = 1, 2, .⎛
. ., consider
⎞ the joint distribution of the
X1 Y1
⎜ ⎟ ⎜ ⎟
random vectors XN1 = ⎝ ... ⎠ and YN1 = ⎝ ... ⎠ which is determined by a (joint)
XN YN
PMF fxN ,YN with respect to measure μ (N) × ν (N) where μ (N) = μ × · · · × μ and
1 1
ν (N) = ν × · · · × ν (N factors in both products). Let fXN1 and fYN1 stand for the
(joint) PMFs of vectors XN1 and YN1 , respectively.
As in Worked Example 4.2.2, we suppose that the statements of the Shannon–
McMillan–Breiman theorem hold true, this time for the pair (XN1 , YN1 ) and each of
XN1 and YN1 : as N → ∞,
1 1
− log fXN (XN1 ) → h1 , − log fYN (YN1 ) → h2 ,
N 1 N 1
in probability,
1
− log fXN ,YN (X1 , Y1 ) → h,
N N
N 1 1
h1 + h2 ≥ h; (4.2.6)
and
⎛ ⎞
y1
⎜ ⎟
yN1 = ⎝ ... ⎠ .
yN
Formally,
%
1
TεN= (xN1 , yN1 ) : − ε ≤ log fxN (xN1 ) + h1 ≤ ε ,
N 1
1
− ε ≤ log fYN (yN1 ) + h2 ≤ ε ,
N 1
K
1
− ε ≤ log fxN ,YN (x1 , y1 ) + h ≤ ε ; (4.2.7)
N N
N 1 1
N
by the above assumption we have that lim P Tε = 1 for all ε > 0. Next, define
N→∞
the volume of set TεN :
N 0
μ (N)
×ν (N)
Tε = μ (N) (dxN1 )ν (N) (dyN1 ).
TεN
Finally, consider an independent pair XN1 , YN1 where component XN1 has the same
PMF as XN1 and YN1 the same PMF as YN1 . That is, the joint PMF for XN1 and YN1
has the form
fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fYN (yN1 ). (4.2.8)
1 1 1 1
Next, we assess the volume of set TεN and then the probability that xN1 , YN1 ∈
TεN .
Worked Example 4.2.3 (A general joint asymptotic equipartition property)
(I) The volume of the typical set has the following properties:
μ (N) × ν (N) TεN ≤ 2N(h+ε ) , for all ε and N, (4.2.9)
and, for all δ > 0 and 0 < ε < h, for N large enough, depending on δ ,
μ (N) × ν (N) TεN ≥ (1 − δ )2N(h−ε ) . (4.2.10)
(II) For the independent pair XN1 , YN1 ,
P XN1 , YN1 ∈ TεN ≤ 2−N(h1 +h2 −h−3ε ) , for all ε and N, (4.2.11)
Solution (I) Completely follows the proofs of (4.2.4) and (4.2.5) with integration
of fxN ,YN .
1 1
(II) For the probability P XN1 , YN1 ∈ TεN we obtain (4.2.11) as follows:
0
P XN1 , YN1 ∈ TεN = fxN ,YN μ (dxN1 )ν (dyN1 )
TεN 1 1
by definition
0
= fxN (xN1 ) fYN (yN1 )μ (dxN1 )ν (dyN1 )
1 1
TεN
substituting (4.2.8)
0
≤ 2−N(h1 −ε ) 2−N(h2 −ε ) μ (dxN1 )ν (dyN1 )
TεN
according to (4.2.7)
Finally, by reversing the inequalities in the last two lines, we can cast them as
0
≥ 2−N(h1 +ε ) 2−N(h2 +ε ) μ (dxN1 )ν (dyN1 )
TεN
according to (4.2.7)
Formally, we assumed here that 0 < ε < h (since it was assumed in (4.2.10)), but
increasing ε only makes the factor 2−N(h1 +h2 −h+3ε ) smaller. This proves bound
(4.2.12).
where c > 0 is a constant. Recall that fXN ,YN represents the joint PMF while fXN
1 1 1
and fxN individual PMFs for the random input and output vectors xN and YN , with
1
respect to reference measures μ (N) and ν (N) :
fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fch (yN1 |xN1 sent ),
1 1 0 1
fYN (YN1 ) = fXN ,YN (xN1 , yN1 )μ (N) dxN1 .
1 1 1
−N(c−ε )
≤2 fXN ,YN (xN1 , yN1 )μ (dxN1 )ν (dyN1 )
1 1
TεN
−N(c−ε )
N N
=2 P X1 , Y1 ∈ TεN
≤ 2−N(c−ε ) .
4.2 The asymptotic equipartition property in continuous time setting 403
The first equality is by definition, the second step follows by substituting (4.2.8),
the third is by direct calculation, and the fourth because of the bound (4.2.14).
Finally, by reversing the inequalities in the last two lines, we obtain the bound
(4.2.16):
0
≥ 2−N(c+ε ) fXN ,YN (xN1 , yN1 )μ (dxN1 )ν (dyN1 )
1 1
TεN
−N(c+ε )
N N
=2 P X1 , Y1 ∈ TεN ≥ 2−N(c+ε ) (1 − δ ),
the first inequality following because of (4.2.14).
Worked Example 4.2.5 Let x = {X(1), . . . , X(n)}T be a given vector/collection
of random variables. Let us write x(C) for subcollection {X(i) : i ∈ C} where C is
a non-empty subset in the index set {1, . . . , n}. Assume that the joint distribution
for any subcollection x(C) with C = k, 1 ≤ k ≤ n, is given by a joint PMF fx(C)
relative to measure μ × · · · × μ (k factors, each corresponding
⎛ ⎞ to a random variable
x(1)
⎜ ⎟
X(i) with i ∈ C). Similarly, given a vector x = ⎝ ... ⎠ of values for x, denote by
x(n)
x(C) the argument {x(i) : i ∈ C} (the sub-column in x extracted by picking the rows
with i ∈ C). By the Gibbs inequality, for all partitions {C1 , . . . ,Cs } of set {1, . . . , n}
into non-empty disjoint subsets C1 , . . . ,Cs (with 1 ≤ s ≤ n), the integral
0
fxn1 (x)
fx(C1 ) (x(C1 )) . . . fx(Cs ) (x(Cs )) 1≤∏
fx (x) log μ (dx( j)) ≥ 0. (4.2.17)
j≤n
What is the partition for which the integral in (4.2.17) attains its maximum?
Let {C1 , . . . ,Cs } be any partition of {1, . . . , n}. Multiply and divide the fraction
under the log by the product of joint PMFs ∏ fx(Cl ) (x(Cl )). Then the integral
1≤i≤s
(4.2.18) is represented as the sum
0
fxn1 (x)
fx (x) log ∏ μ (dx( j)) + terms ≥ 0.
∏ fx(Ci ) (x(Ci )) 1≤ j≤n
1≤i≤s
Here x(C) = {X(i) : i ∈ C}, x(C) = {X(i) : i ∈ C}, and E I(x(C : Y )|x(C) stands
for the expectation of I(x(C : Y ) conditional on the value of x(C). Prove that this
sum does not depend on the choice of set C.
1
PeX , X = ∑
1 2
P error in channel 1 or 2| xk1 xl2 sent .
M1 M2 1≤k≤M1 ,1≤l≤M2
direct part.
4.2 The asymptotic equipartition property in continuous time setting 405
The proof of the inverse is more involved and we present only a sketch, referring
the interested reader to [174]. The idea is to apply the so-called list decoding:
suppose we have a code Y of size M and a decoding rule d = d Y . Next, given
that a vector y has been received at the output port of a channel, a list of L possible
code-vectors from Y has to be produced, by using a decoding rule d = dlist Y , and the
decoding (based on rule d) is successful if the correct word is in the list. Then, for
the average error-probability Pe = PeY (d) over code Y , the following inequality is
satisfied:
Pe ≥ Pe ( d ) PeAV (L, d) (4.2.20)
where the error-probability Pe (d) = PeY (d) refers to list decoding and PeAV (L, d) =
PeAV (Y , L, d) stands for the error-probability under decoding rule d averaged over
all subcodes in Y of size L.
Now, going back to the product-channel with marginal capacities C1 and C2 ,
choose R > C1 +C2 , set η = (R −C1 −C2 )/2 and let the list size be L = eRL τ , with
RL = C2 + η . Suppose we use a code Y of size eRτ with a decoding rule d and a
list decoder d with the list-size L. By using (4.2.20), write
and use the facts that RL > C2 and the value PeAV (eRL τ , d) is bounded away from
zero. The assertion of the inverse part follows from the following observation dis-
cussed in Worked Example 4.2.8. Take R2 < R− RL and consider subcodes L ⊂ Y
of size L = eR2 τ . Suppose we choose subcode L at random, with equal proba-
bilities. Let M2 = eR2 τ and PeY ,M2 (d) stand for the mean error-probability averaged
over all subcodes L ⊂ Y of size L = eR2 τ . Then
where ε (τ ) → 0 as τ → ∞.
Worked Example 4.2.8 Let L = eRL τ and M = eRτ . We aim to show that if
R2 < R − RL and M2 = eR2 τ then the following holds. Given a code X of size M ,
a decoding rule d and a list decoder d with list-size L, consider the mean error-
probability PeX ,M2 (d) averaged over the equidistributed subcodes S ⊂ X of size
S = M2 . Then PeX ,M2 (d) and the list-error-probability PeX (d) satisfy
where ε (τ ) → 0 as τ → ∞.
Solution Let X , S and d be as above and suppose we use a list decoder d with
list-length L.
406 Further Topics from Information Theory
Inequality (4.2.24) is valid for any subcode S . We now select S at random from
X choosing each subcode of size M2 with equal probability. After averaging over
all such subcodes we obtain a bound for the averaged error-probability PeX ,M2 =
PeX ,M2 (d):
1 M2 C DX ,M2
PeX ,M2 ≤ PeX (d) + ∑∑∑
M2 k=1
p(L |x )E
k • (L , x |x
j k ) (4.2.25)
L j=k
I JX ,M2
where means the average over all selections of subcodes. As x j and xk
are chosen independently,
C DX ,M2 C DX ,M2 C DX ,M2
p(L |xk )E 2 (L , x j ) = p(L |xk ) E•2 (L , x j ) .
Next,
C DX ,M2 1 C DX ,M2 L
p(L |xk ) = ∑ M
p(L |x), E•2 (L , x j |xk ) = ,
M
x∈X
and we obtain
1 M2 1
L
PeX ,M2 ≤ PeX (d) + ∑∑ ∑ M
M2 k=1
p(L |x) ∑M
L x∈X j=k
4.2 The asymptotic equipartition property in continuous time setting 407
which implies
M2 L
PeX ,M2 ≤ PeX (d) + . (4.2.26)
M
We now give the proof of Theorem 4.2.1. Consider the sequence of kth-order
Markov approximations of a process X, by setting
n−1
p(k) X0n−1 = pX k−1 X0k−1 ∏ p Xi |Xi−k
i−1
. (4.2.27)
0
i=k
Set also
−1
−1
H (k) = E − log p X0 |X−k = h(X0 |X−k ) (4.2.28)
and
−1
−1
H = E − log p X0 |X−∞ = h(X0 |X−∞ ). (4.2.29)
The proof is based on the following three results: Lemma 4.2.9 (the sandwich
lemma), Lemma 4.2.10 (a Markov approximation lemma) and Lemma 4.2.11 (a
no-gap lemma).
1 p(k) (X0n−1 )
lim sup log ≤ 0 a.s., (4.2.30)
n→∞ n p(X0n−1 )
1 p(X0n−1 )
lim sup log −1
≤ 0 a.s. (4.2.31)
n→∞ n p(X0n−1 |X−∞ )
= ∑ p(k) (x0n−1 )
x0n−1 ∈An
= p(k) (A) ≤ 1.
408 Further Topics from Information Theory
−1
Similarly, if Bn = Bn (X−∞ ) is a support event for pX n−1 |X −1 (i.e. P(X0n−1 ∈
0 −∞
−1
Bn |X−∞ ) = 1), write
p(X0n−1 ) p(x0n−1 )
X−∞ ∑
n−1 −1
E −1
= E −1 p(x0 |X−∞ ) −1
p(X0n−1 |X−∞ ) xn−1 ∈B
p(x0n−1 |X−∞ )
0 n
= EX −1
−∞
∑ p(x0n−1 )
x0n−1 ∈Bn
= EX −1 P(Bn ) ≤ 1.
−∞
a.s.
1 −1
− log p(X0n−1 |X−∞ ) ⇒ H. (4.2.33)
n
−1 −1
Proof Substituting f = − log p(X0 |X−k ) and f = − log p(X0 |X−∞ ) into
Birkhoff’s ergodic theorem (see for example Theorem 9.1 from [36]) yields
a.s.
1 1 1 n−1
− log p(k) (X0n−1 ) = − log p(X0k−1 ) − ∑ log p(k) (Xi |Xi−k
i−1
) ⇒ 0 + H (k)
n n n i=k
(4.2.34)
and
a.s.
1 1 n−1
−1
− log p(X0n−1 |X−∞ ) = − ∑ log p(Xi |X−∞
i−1
) ⇒ H, (4.2.35)
n n i=0
respectively.
So, by Lemmas 4.2.9 and 4.2.10,
1 1 1 1
lim sup log n−1
≤ lim log (k) n−1 = H (k) , (4.2.36)
n→∞ n p(X0 ) n→∞ n p (X0 )
4.3 The Nyquist–Shannon formula 409
and
1 1 1 1
lim inf log ≥ lim log n−1 −1
= H, )
n→∞ n n−1
p(X0 ) n→∞ n p(X0 |X−∞ )
which we rewrite as
1 1
H ≤ lim inf − log p(X0n−1 ) ≤ lim sup − log p(X0n−1 ) ≤ H (k) . (4.2.37)
n→∞ n n→∞ n
−1
a.s. −1
p X0 = x0 |X−k ⇒ p X0 = x0 |X−∞ , k → ∞. (4.2.38)
As the set of values I is supposed to be finite, and the function p ∈ [0, 1] → −p log p
is bounded, the bounded convergence theorem gives that as k → ∞,
The setting is as follows. Fix numbers τ , α , p > 0 and assume that every τ seconds
a coder produces a real code-vector
⎛ ⎞
x1
⎜ .. ⎟
x=⎝ . ⎠
xn
where n = ατ . All vectors x generated by the coder lie in a finite set X = Xn ⊂
Rn of cardinality M ∼ 2Rb τ = eRn τ (a codebook); sometimes we write, as before,
XM,n to stress the role of M and n. It is also convenient to list the code-vectors
from X as x(1), . . . , x(M) (in an arbitrary order) where
⎛ ⎞
x1 (i)
⎜ ⎟
x(i) = ⎝ ... ⎠ , 1 ≤ i ≤ M.
xn (i)
The instantaneous signal power at time t is associated with |x(t)|2 ; then the square-
1
norm ||x||2 = 0τ |x(t)|2 dt = ∑ |xi |2 will represent the full energy of the signal in
1≤i≤n
the interval [0, τ ]. The upper bound on the total energy spent on transmission takes
the form
√
||x||2 ≤ pτ , or x ∈ Bn ( pτ ). (4.3.3)
(In the theory of waveguides, the dimension n is called the Nyquist number and the
value W = n/(2τ ) ∼ α /2 the bandwidth of the channel.)
The code-vector x(i) is sent through an additive channel, where the receiver gets
the (random) vector
⎛ ⎞
Y1
⎜ .. ⎟
Y = ⎝ . ⎠ where Yk = xk (i) + Zk , 1 ≤ k ≤ n. (4.3.4)
Yn
4.3 The Nyquist–Shannon formula 411
is a vector with IID entries Zk ∼ N(0, σ 2 ). (In applications, engineers use the rep-
1
resentation Zi = 0τ Z(t)φi (t)dt, in terms of a ‘white noise’ process Z(t).)
√
From the start we declare that if x(i) ∈ X \ Bn ( pτ ), i.e. ||x(i)||2 > pτ , the
output signal vector Y is rendered ‘non-decodable’. In other words, the probability
of correctly decoding the output vector Y = x(i) + Z with ||x(i)||2 > pτ is taken to
be zero (regardless of the fact that the noise vector Z can be small and the output
vector Y close to x(i), with a positive probability).
Otherwise, i.e. when ||x(i)||2 ≤ pτ , the receiver applies, to the output vector Y,
a decoding rule d(= dn,X ), i.e. a map y ∈ K → d(y) ∈ X where K ⊂ Rn is a
‘decodable domain’ (where map d had been defined). In other words, if Y ∈ K
then vector Y is decoded as d(Y) ∈ X . Here, an error arises either if Y ∈ K or if
d(Y) = x(i) given that x(i) was sent. This leads to the following formula for the
probability of erroneously decoding the input code-vector x(i):
⎧
⎨1, ||x(i)||2 > pτ ,
Pe (i, d) =
(4.3.5)
⎩Pch Y ∈ K or d(Y) = x(i)|x(i) sent , ||x(i)||2 ≤ pτ .
The average error-probability Pe = PeX ,av (d) for the code X is then defined by
1
Pe = ∑ Pe (i, d).
M 1≤i≤M
(4.3.6)
Furthermore, we say that Rbit (or Rnat ) is a reliable transmission rate (for given
α and p) if for all ε > 0 we can specify τ0 (ε ) > 0 such that for all τ > τ0 (ε )
there exists a codebook X of size X ∼ eRnat τ and a decoding rule d such that
Pe = PeX ,av (d) < ε . The channel capacity C is then defined as the supremum of all
reliable transmission rates, and the argument from Section 4.1 yields
α p
C= ln 1 + (in nats); (4.3.7)
2 ασ 2
cf. (4.1.17). Note that when α → ∞, the RHS in (4.3.7) tends to p/(2σ 2 ).
412 Further Topics from Information Theory
.
..
.. ...
0 ....... ... .. ....... ... .. . . .. .. . . T
. . . .. .... . .
. .. .. . .. . . .
.. .
.. ..
. .
. .. . . . . . ... ... .
.. .. .. . .....
.... .
.
Figure 4.4
Here
sin (π (2Wt − k))
sinc (2Wt − k) = (4.3.11)
π (2Wt − k)
Example 4.3.1 (The Fourier transform in Ł2 ) Recall that the Fourier transform
1
φ → Fφ of an integrable function φ (i.e. a function with |φ (x)|dx < +∞) is de-
fined by
0
Fφ (ω ) = φ (x)eiω x dx, ω ∈ R. (4.3.13)
K(t−s)
W=2
W=1
W=0.5
−4
−2
t−s
0
2
4
Figure 4.5
4.3 The Nyquist–Shannon formula 415
Furthermore, the Fourier transform can be defined for generalised functions too;
see again [127]. In particular, the equations similar to (4.3.13)–(4.3.14) for the
delta-function look like this:
0 0
1 −iω t
δ (t) = e dω , 1 = δ (t)eit ω dt, (4.3.17)
2π
implying that the Fourier transform of the Dirac delta is δ(ω ) ≡ 1. For the shifted
delta-function we obtain
0
k 1
δ t− = eikω /(2W ) e−iω t dω . (4.3.18)
2W 2π
Solution The shortest way to see this is to write the Fourier-decomposition (in
Ł2 (R)) implied by (4.3.19):
√ 0 2π W
1
2 π W sinc (2Wt − k) = √ eikω /(2W ) e−it ω dω (4.3.21)
2 πW −2π W
where
'
1, k = k ,
δkk =
0, k = k ,
and functions fi have been introduced in (4.3.10). Thus, the power constraint can
be written as
|| fi ||2 ≤ pτ /4π W = p0 . (4.3.24)
In fact, the coefficients xk (i) coincide with the values fi (k/(2W )) of function fi
calculated at time points k/(2W ), k = 1, . . . , n; these points can be referred to as
‘sampling instances’.
Thus, the input signal fi (t) develops in continuous time although it is completely
specified by its values fi (k/(2W )) = xk (i). Thus, if we think that different signals
are generated in disjoint time intervals (0, τ ), (τ , 2τ ), . . ., then, despite interference
caused by infinite tails of the function sinc(t), these signals are clearly identifiable
through their values at sampling instances.
The Nyquist–Shannon assumption is that signal fi (t) is transformed in the chan-
nel into
g(t) = fi (t) + Z (t). (4.3.25)
4.3 The Nyquist–Shannon formula 417
Here Z (t) is a stationary continuous-time Gaussian process with the zero mean
(EZ (t) ≡ 0) and the (auto-)correlation function
The average error-probability Pe = PeX ,av (d) for code X (and decoder d) equals
1
Pe = ∑ Pe (i, d).
M 1≤i≤M
(4.3.30)
Value R(= Rnat ) is called a reliable transmission rate if, for all ε > 0, there exists τ
and a code X of size M ∼ eRτ such that Pe < ε .
Now fix a value η ∈ (0, 1). The class A (τ ,W, p0 ) = A (τ ,W, p0 , η ) is defined as
the set of functions f ◦ (t) such that
(i) f ◦ = Dτ f where
Theorem 4.3.3 The capacity C = C(η ) of the above channel with constraint
domain A (τ ,W, p0 , η ) described in conditions (i)–(iii) above is given by
p0 η p0
C = W ln 1 + + . (4.3.31)
2σ02W 1 − η σ02
As η → 0,
p0
C(η ) → W ln 1 + (4.3.32)
2σ02W
Before going to (quite involved) technical detail, we will discuss some facts rele-
vant to the product, or parallel combination, of r time-discrete Gaussian channels.
(In essence, this model was discussed at the end of Section 4.2.) Here, every τ time
units, the input signal is generated, which is an ordered collection of vectors
⎛ ⎞
( j)
x1
* (1) + ⎜ .. ⎟
x , . . . , x(r) where x( j) = ⎜ ⎟
⎝ . ⎠ ∈ R , 1 ≤ j ≤ r,
nj
(4.3.33)
( j)
xn j
and n j = α j τ with α j being a given value (the speed of the digital production
from coder j). For each vector x( j) we consider a specific power constraint:
O O2
O ( j) O
Ox O ≤ p( j) τ , 1 ≤ j ≤ r. (4.3.34)
each of which has the same structure as in (4.3.33).*As before, a + decoder d is a map
acting on a given set K of sample output signals y(1) , . . . , y(r) and taking these
signals to X .
As above, for i = 1, . . . , M,we define the error-probability
Pe (i, d) for code X
(1) (r)
when sending an input signal x (i), . . . , x (i) :
2
Pe (i, d) = 1, if x( j) (i) ≥ p( j) τ for some j = 1, . . . , r,
and
* +
Pe (i, d) = Pch Y(1) , . . . , Y(r) ∈ K or
* + * +
d Y(1) , . . . , Y(r) = x(1) (i), . . .
, x(r) (i)
* +
| x(1) (i), . . . , x(r) (i) sent ,
2
if x( j) (i) < p( j) τ for all j = 1, . . . , r.
The average error-probability Pe = PeX ,av (d) for code X (while using decoder d)
is then again given by
1
Pe = ∑ Pe (i, d).
M 1≤i≤M
As usual, R is said to be a reliable transmission rate if for all ε > 0 there exists a
τ0 > 0 such that for all τ > τ0 there exists a code X of cardinality M ∼ eRτ and a
decoding rule d such that Pe < ε . The capacity of the combined channel is again de-
fined as the supremum of all reliable transmission rates. In Worked Example 4.2.7
the following fact has been established (cf. Lemma A in [173]; see also [174]).
Moreover, (4.3.37) holds when some of the2 α j equal +∞: in this case the corre-
sponding summand takes the form p 2σ j .
( j)
4.3 The Nyquist–Shannon formula 421
Case II. Here we take r = 3 and assume that σ12 = σ22 ≥ σ32 and α3 = +∞. The
requirements are now that
and
x(3) 2 ≤ β ∑ x( j) 2 . (4.3.39b)
1≤ j≤3
Case III. As in Case I, take r = 2 and assume σ12 = σ22 = σ02 . Further, let α2 = +∞.
The constraints now are
x(1) 2 < p0 τ (4.3.40a)
and
x(2) 2 < β x(1) 2 + x(2) 2 . (4.3.40b)
Worked Example 4.3.5 (cf. Theorem 1 in [173]). We want to prove that the
capacities of the above combined parallel channels of types I–III are as follows.
Case I, α1 ≤ α2 :
α1 (1 − ζ )p0 α2 ζ p0
C= ln 1 + + ln 1 + (4.3.41a)
2 α1 σ02 2 α2 σ02
where
α2
ζ = min β , . (4.3.41b)
α1 + α2
422 Further Topics from Information Theory
Case II:
α1 (1 − β )p0
C= ln 1 +
2 (α1 + α2 )σ12
α2 (1 − β )p0 βp
+ ln 1 + + 2. (4.3.42)
2 ( α 1 + α2 ) σ1
2 2σ3
Case III:
α1 p0 β p0
C= ln 1 + + . (4.3.43)
2 α 1 σ0
2 2(1 − β )σ02
Solution We present the proof for Case I only. For definiteness, assume that α1 <
α2 ≤ ∞. First, the direct part. With p1 = (1 − ζ )p0 , p2 = ζ p0 , consider the parallel
combination of two channels, with individual power constraints on the input signals
x(1) and x(2) :
O O2 O O2
O (1) O O O
Ox O ≤ p1 τ , Ox(2) O ≤ p2 τ . (4.3.44a)
are sent through their respective parts of the parallel-channel combination, which
results in output vectors
⎛ (1) ⎞ ⎛ (2) ⎞
Y1 (i) Y1 (i)
⎜ .. ⎟ ⎜ .. ⎟
Y(1) = ⎜
⎝ . ⎟ ∈ Rα1 τ (l) , Y(2) = ⎜
⎠ ⎝ . ⎟ ∈ Rα2 τ (l)
⎠
(1) (2)
Yα τ (l) (i) Yα τ (l) (i)
1 2
* +
forming the combined output signal Y = Y(1) , Y(2) . The entries of vectors Y(1)
and Y(2) are sums
(1) (1) (1) (2) (2) (2)
Yj = x j (i) + Z j , Yk = xk (i) + Zk ,
(1) (2)
where Z j and Zk are IID, N(0, σ02 ) random variables. Correspondingly, Pch
(1) (2)
refers to the joint distribution of the random variables Y j and Yk , 1 ≤ j ≤
α1 τ (l) , 1 ≤ k ≤ α2 τ (l) .
Observe that function q → C1 (q) is uniformly continuous in q on [0, p0 ]. Hence,
we can find an integer J0 large enough such that
C1 (q) −C1 q − ζ p0 < ε , for all q ∈ (0, ζ p0 ).
J0 2
424 Further Topics from Information Theory
(l)
Then we partition the code X (l) into J0 classes (subcodes) X j , j = 1, . . . , J0 : a
(l)
code-vector x(1) (i), x(2) (i) falls in class X j if
ζ p0 τ ζ p0 τ
∑
(2)
( j − 1) < (xk )2 ≤ j . (4.3.45a)
J0 1≤k≤α τ (l)
J0
2
Here and below we refer to the definition of Ci (u) given in (4.3.44b), i.e.
ε
R∗ ≤ C1 ((1 − δ )p0 ) +C2 (δ p0 ) + (4.3.46)
2
where δ = j0 ζ /J0 .
Now note that, for α2 ≥ α1 , the function
That is, functions ψn (t)0 are the eigenfunctions, with the eigenvalues λn , of the
integral operator ϕ → ϕ (s)K( · , s) ds with the integral kernel
K(t, s) = 1(|s| < τ /2)(2W ) sinc 2W (t − s)
sin(2π W (t − s))
= 1(|s| < τ /2) , −τ /2 ≤ s;t ≤ τ /2.
π (t − s)
(d) The eigenvalues λn satisfy the condition
0 τ /2
λn = ψn (t)2 dt with 1 > λ1 > λ2 > · · · > 0.
−τ /2
An equivalent formulation
0 can be given in terms involving the Fourier trans-
forms [Fψn◦ ] (ω ) = ψn◦ (t)eit ω dt:
0 2π W 0 τ /2
1
| [Fψn◦ ] (ω ) |2 dω |ψn (t)|2 dt = λn ,
2π −2π W −τ /2
which means that λn gives a ‘frequency concentration’ for the truncated func-
tion ψn◦ .
(e) It can be checked that functions ψn (t) (and hence numbers λn ) depend on W
and τ through the product W τ only. Moreover, for all θ ∈ (0, 1), as W τ → ∞,
λ2W τ (1−θ ) → 1, and λ2W τ (1+θ ) → 0. (4.3.48c)
That is, for τ large, nearly 2W τ of values λn are close to 1 and the rest are close
to 0.
where ψ1 (t), ψ2 (t), . . . are the PSWFs discussed in Worked Example 4.3.9 below
and A1 , A2 , . . . are IID random variables with An ∼ N(0, λn ) where
√ λn are the cor-
responding eigenvalues. Equivalently, one writes Z(t) = ∑n≥1 λn ξn ψn (t) where
ξn ∼ N(0, 1) IID random variables.
The proof of this fact goes beyond the scope of this book, and the interested
reader is referred to [38] or [103], p. 144.
4.3 The Nyquist–Shannon formula 427
The idea of the proof of Theorem 4.3.3 is as follows. Given W and τ , an input
signal s◦ (t) from A (τ ,W, p0 , η ) is written as a Fourier series in the PSWFs ψn .
In this series, the first 2W τ summands represent the part of the signal confined
between the frequency band-limits ±2π W and the time-limits ±τ /2. Similarly, the
noise realisation Z(t) is decomposed in a series in functions ψn . The action of the
continuous-time channel is then represented in terms of a parallel combination of
two jointly constrained discrete-time Gaussian channels. Channel 1 deals with the
first 2W τ PSWFs in the signal decomposition and has α1 = 2W . Channel 2 receives
the rest of the expansion and has α2 = +∞. The power constraint s2 ≤ p0 τ leads
to a joint constraint, as in (4.3.38a). In addition, a requirement emerges that the
energy allocated outside the frequency band-limits ±2π W or time-limits ±τ /2 is
small: this results in another power constraint, as in (4.3.38b). Applying Worked
Example 4.3.5 for Case I results in the assertion of Theorem 4.3.3.
To make these ideas precise, we first derive Theorem 4.3.7 which gives an al-
ternative approach to the Nyquist–Shannon formula (more complex in formulation
but somewhat simpler in the (still quite lengthy) proof).
Theorem 4.3.7 Consider the following modification of the model from Theorem
4.3.3. The set of allowable signals A2 (τ ,W, p0 , η ) consists of functions t ∈ R →
s(t) such that
0
(1) s2 = |s(t)|2 dt ≤ p0 τ,
0
(2) the Fourier transform [Fs](ω ) = s(t)eit ω dt vanishes when |ω | > 2π W , and
0 τ /2
(3) the ratio |s(t)|2 dt s2 > 1 − η. That is, the functions s ∈
−τ /2
A (τ ,W, p0 , η ) are ‘sharply band-limited’ in frequency and ‘nearly localised’
in time.
The noise process is Gaussian, with the spectral density vanishing when |ω | > 2πW
and equal to σ02 for |ω | ≤ 2πW .
Then the capacity of such a channel is given by
p0 η p0
C = Cη = W ln 1 + (1 − η ) 2 + 2. (4.3.50)
2σ0 W 2σ0
As η → 0,
p0
Cη → W ln 1 +
2σ02W
cf. (4.3.41a). We want to construct codes and decoding rules for the time-
continuous version of the channel,
(1) (2)yielding
asymptotically vanishing probability
of error as τ → ∞. Assume x , x is an allowable input signal for the parallel
pair of discrete-time channels with parameters given in (4.3.53). The input for the
time-continuous channel is the following series of (W, τ ) PSWFs:
∑ ∑
(1) (2)
s(t) = xk ψk (t) + xk ψk+α1 τ (t). (4.3.54)
1≤k≤α1 τ 1≤k<∞
The first fact to verify is that the signal in (4.3.54) belongs to A2 (τ ,W, p0 , η ), i.e.
satisfies conditions (1)–(3) of Theorem 4.3.7.
To check property (1), write
2
2 O O2 O O2
O O O O
∑ ∑ xk = Ox(1) O + Ox(2) O ≤ p0 τ .
(1) (2)
s2 = xk +
1≤k≤α1 τ 1≤k<∞
Next, the signal s(t) is band-limited, inheriting this property from the PSWFs
ψk (t). Thus, (2) holds true.
A more involved argument is needed to establish property (3). Because the
PSWFs ψk (t) are orthogonal in Ł2 [−τ /2, τ /2] (cf. (4.3.48a)), and using the mono-
tonicity of the values λn (cf. (4.3.48b)), we have that
0 τ /2
(1 − Dτ )s||2
1− |s(t)| dt s2 =
2
−τ /2 ||s||2
2
2
(1) (2)
(1 − λk ) xk (1 − λk+α1 τ ) xk
= ∑ OO (1) OO2 OO (2) OO2 + ∑ OO (1) OO2 OO (2) OO2
1≤k≤α1 τ x + x 1≤k<∞ x + x
O (1) O2 O (2) O2
Ox O Ox O
≤ 1 − λα1 τ O O2 O O2 + O O2 O O2 .
Ox(1) O + Ox(2) O Ox(1) O + Ox(2) O
4.3 The Nyquist–Shannon formula 429
O (1) O2
Ox O
1 − λα1 τ O O2 O O2 ≤ ξ .
Ox(1) O + Ox(2) O
O O2 O (1) O2 O (2) O2
Next, the ratio Ox(2) O Ox O + Ox O ≤ η − ξ (referring to (4.3.38b)). This
finally yields
0 τ /2
(1 − Dτ )s||2
1− |s(t)| dt s2 =
2
≤ ξ + η − ξ = η,
−τ /2 ||s||2
i.e. property (3).
Further, the noise can be expanded in accordance with Karhunen–Loève:
∑ ∑
(1) (2)
Z(t) = Zk ψk (t) + Zk ψk+α1 τ (t). (4.3.55)
1≤k≤α1 τ 1≤k<∞
( j)
Here again, ψk (t) are the PSWFs and IID random variables Zk ∼ N(0, λk ). Cor-
respondingly, the output signal is written as
∑ ∑
(1) (2)
Y (t) = Yk ψk (t) + Yk ψk+α1 τ (t) (4.3.56)
1≤k≤α1 τ 1≤k<∞
where
( j) ( j) ( j)
Yk = xk + Zk , j = 1, 2, k ≥ 1. (4.3.57)
So, the continuous-time channel is equivalent to a jointly constrained parallel com-
bination. As we checked, the capacity equals C∗ specified in (4.3.52). Thus, for
R < C∗ we can construct codes of rate R and decoding rules such that the error-
probability tends to 0.
For the converse, assume that there exists a sequence τ (l) → ∞, a sequence of
(l)
transmissible domains A2 (τ (l) ,W, p0 , η (l) ) described in (1)–(3) and a sequence
of codes X (l) of size M = eRτ where
(l)
(1 − η )p0 η p0
R > W ln 1 + + 2 .
2W σ0 2 σ0
As usual, we want to show that the error-probability PeX ,av (d (l) ) does not tend to
(l)
0.
As before, we take δ > 0 and ξ ∈ (0, 1 − η ) to ensure that R > C∗ where
∗ (1 − η − ξ ) p0 η p0
C = W (1 + δ ) ln 1 + + .
(1 − ξ ) 2W σ0 (1 + δ )
2 (1 − ξ )σ02
430 Further Topics from Information Theory
Then, as in the argument on the direct half, C∗ is the capacity of the type I jointly
constrained parallel combination of channels with
η
β= , σ 2 = σ02 , p = p0 , α1 = 2W (1 + δ ), α2 = +∞. (4.3.58)
1−ξ
(l)
Let s(t) ∈ X (l) ∩ A2 (τ (l) ,W, p0 , η (l) ) be a continuous-time code-function.
Since the PSWFs ψk (t) form an ortho-basis in Ł2 (R), we can decompose
∑ ∑
(1) (2)
s(t) = xk ψk (t) + xk ψk+α1 τ (l) (t),t ∈ R. (4.3.59)
1≤k≤α1 τ (l) 1≤k<∞
We want to show that the discrete-time signal x = x(1) , x(2) represents an al-
lowable input to the type I jointly constrained parallel combination specified in
(4.3.38a–c). By orthogonality of PSWFs ψk (t) in Ł2 (R) we can write
x2 = ||s||2 ≤ p0 τ (l)
ensuring that condition (4.3.38a) is satisfied. Further, using orthogonality of PSW
functions ψk (t) in Ł2 (−τ /2, τ /2) and the fact that the eigenvalues λk decrease
monotonically, we obtain that
0 τ (l) /2
(1 − Dτ (l) ) s2
1− |s(t)|2 dt s2 =
−τ (l) /2 ||s||2
2
2
(1) (2)
(1 − λk ) xk 1 − λk+α1 τ (l) xk
= ∑ + ∑
1≤k≤α1 τ (l) x2 1≤k<∞ x2
O O
Ox(2) O 2
≥ 1 − λα1 τ (l) .
x2
By virtue of (4.3.48c), λα1 τ (l) ≤ ξ for l large enough. Moreover, since 1 −
0 (l)
τ /2
|s(t)|2 dt s2 ≤ η , we can write
−τ (l) /2
O (2) O2
Ox O η
≤
x 2 1−ξ
and deduce property (4.3.38b).
Next, as in the direct half, we again use the Karhunen–Loève decomposition
of noise Z(t) to deduce that for each code for the continuous-time channel there
corresponds a code for the jointly constrained parallel combination of discrete-time
channels, with the same rate and error-probability. Since R is > C∗ , the capacity
of the discrete-time channel, the error-probability PeX ,av (d (l) ) remains bounded
(l)
where θ ∈ (0, 1) (cf. property (e) of PSWFs in Example 4.3.6) and ξ ∈ (0, η ) are
auxiliary values.
For the converse half we use the decomposition into two parallel channels, again
as in Case III, with
η
α1 = 2W (1 + θ ), α2 = +∞, p = p0 , σ 2 = σ02 , β = . (4.3.61)
1−ξ
Here, as before, value θ ∈ (0, 1) emerges from property (e) of PSWFs, whereas
value ξ ∈ (0, 1).
vanishes for |ω | > 2πW . Then, for all x ∈ R, function f can be uniquely recon-
structed from its values f (x + n/(2W )) calculated at points x + n/(2W ), where
n = 0, ±1, ±2. More precisely, for all t ∈ R,
n
sin [2π (Wt − n)]
f (t) = ∑ f . (4.3.62)
n∈Z1
2W 2π (Wt − n)
Solution Assume the function f ∈ Ł2 (R) and let f = F f ∈ L2 (R) be the Fourier
transform of f . (Recall that space Ł2 (R) consists of functions f on R with || f ||2 =
432 Further Topics from Information Theory
1 1
| f (t)|2 dt < +∞ and that for all f , g ∈ Ł2 (R), the inner product f (t)g(t)dt is
finite.) We shall see that if
0 t0 +τ /2 0 ∞
| f (t)| dt
2
| f (t)|2 dt = α 2 (4.3.63)
t0 −τ /2 −∞
and
0 2π W 0 ∞
|F f (ω )| dω 2
|F f (ω )|2 dω = β 2 (4.3.64)
−2π W −∞
Hence,
0
1 ||D f ||
Re f (t)g(t)dt ≤
|| f ||||g|| || f ||
which implies (4.3.70), by picking g = D f .
∞
Next, we expand f = ∑ an ψn , relative to the eigenfunctions of A. This yields
n=0
the formula
1/2
||D f || ∑n |an |2 λn
cos−1 = cos−1 . (4.3.71)
|| f || ∑n |an |2
The supremum of the RHS in f is achieved when an = 0 for n ≥ 1, and f = ψ0 .
We conclude that there exists the minimal angle between subspaces B and D, and
this angle is achieved on the pair f = ψ0 , g = Dψ0 , as required.
Next, we establish
Lemma 4.3.10 There exists a function f ∈ Ł2 such that || f || = 1, ||D f || = α and
||B f || = β if and only if α and β fall in one of the following cases (a)–(d):
(a) α = 0 and√0 ≤ β < 1;
(b) 0 < α < λ0 < 1 and 0 ≤ β ≤ 1;
√ √
−1 α + cos−1 β ≥ cos−1 λ ;
(c) λ0 ≤ α < 1 and cos
√ 0
(d) α = 1 and 0 < β ≤ λ0 .
Proof Given α ∈ [0, 1], let G (α ) be the family of functions f ∈ L2 with norms
|| f || = 1 and ||D f || = α . Next, determine β ∗ (α ) := sup f ∈G (α ) ||B f ||.
(a) If α = 0, the family G (0) can contain no function with β = B f = 1. Further-
more, if D f = 0 and B f = 1 for f ∈ B then f is analytic and f (t) = 0 for
|t| < τ /2, implying f ≡ 0. To show that G (0) contains functions with all values of
ψn − Dψn √
β ∈ [0, 1), we set fn = √ . Then the norm ||B fn || = 1 − λn . Since there
1 − λn
434 Further Topics from Information Theory
0 −p+π W
1/2
||Beipt
f || = |Fn (ω | dω
2
.
−p−π W
$ $
α 2 − λn ψ0 − λ0 − α 2 ψn
f= √ ,
λ0 − λn
f = a1 D f + a2 B f + g (4.3.72)
with g orthogonal to both D f and B f . Taking the inner product of the sum in the
RHS of (4.3.72), subsequently, with f , D f , B f and g we obtain four equations:
0
1 = a1 α 2 + a2 β 2 + g(t) f (t)dt,
0
α 2 = a1 α 2 + a2 B f (t)Dg(t)dt,
0
β 2 = a1 D f (t)B f (t)dt + a2 β 2 ,
0
f (t)g(t)dt = g2 .
0 0
α 2 + β 2 − 1 + ||g||2 = a1 D f (t)B f (t)dt + a2 B f (t)D f (t)dt.
4.3 The Nyquist–Shannon formula 435
1
By eliminating g(t) f (t)dt, a1 and a2 we find, for αβ = 0,
1 − α 2 − ||g||2
β2 = 0 β2
(β 2 − B f (t)D f (t)dt)
⎡ ⎤
0
⎢ 1 − α 2 − ||g||2 ⎥
+ ⎣1 − 0 B f (t)D f (t)dt ⎦
α 2 (β 2 − B f (t)D f (t)dt)
0
× D f (t)B f (t)dt
which is equivalent to
0
β − 2Re
2
D f (t)B f (t)dt
0 2
1
≤ −α + (1 − 2 2 D f (t)B f (t)dt
2
(4.3.73)
α β
0 2
1
− ||g|| 1 − 2 2 D f (t)B f (t)dt .
2
(4.3.74)
α β
In terms of the angle θ , we can write
0 0
αβ cos θ = Re D f (t)B f (t)dt ≤ D f (t)B f (t)dt ≤ αβ .
1.0
0.8
0.6
beta^2
0.4
0.2
W=0.5
W=1
0.0
W=2
alpha^2
Figure 4.6
Definition 4.4.1 (cf. PSE II, p. 211) Let μ be a measure on R with values μ (A)
for measurable subsets A ⊆ R. Assume that μ is (i) non-atomic and (ii) σ -finite,
i.e. (i) μ (A) = 0 for all countable sets A ⊂ R and (ii) there exists a partition R =
∪ j J j of R into pairwise disjoint intervals J1 , J2 , . . . such that μ (J j ) < ∞. We say
that a random counting measure M defines a Poisson random measure (PRM, for
short) with mean, or intensity, measure μ if for all collection of pairwise disjoint
intervals I1 , . . . , In on R, the values M(Ik ), k = 1, . . . , n, are independent, and each
M(Ik ) ∼ Po(μ (Ik )).
4.4 Spatial point processes and network information theory 437
We will state several facts, without proof, about the existence and properties of
the Poisson random measure introduced in Definition 4.4.1.
Theorem 4.4.2 For any non-atomic and σ -finite measure μ on R+ there exists
a unique PRM satisfying Definition 4.4.1. If measure μ has the form μ (dt) =
λ dt where λ > 0 is a constant (called the intensity of μ), this PRM is a Poisson
process PP(λ ). If the measure μ has the form μ (dt) = λ (t)dt where λ (t) is a given
function, this PRM gives an inhomogeneous Poisson process PP(λ (t)).
f : u ∈ R+ → μ (0, u),
and let f −1 be the inverse function of f . (It exists because f (u) = μ (0, u) is strictly
monotone in u.) Let M be the PRM(μ ). Define a random measure f ∗ M by
Worked Example 4.4.4 Let the rate function of a Poisson process Π = PP(λ (x))
on the interval S = (−1, 1) be
Show that Π has, with probability 1, infinitely many points in S, and that they
can be labelled in ascending order as
Solution Since
0 1
λ (x)dx = ∞,
−1
there are with probability 1 infinitely many points of Π in (−1, 1). On the other
hand,
0 1−δ
λ (x)dx < ∞
−1+δ
for every δ > 0, so that Π(−1+ δ , 1− δ ) is finite with probability 1. This is enough
to label uniquely in ascending order the points of Π. Let
0 x
f (x) = λ (y)dy.
0
With this choice of f , the points ( f (Xn )) form a Poisson process of unit rate on R.
The strong law of large numbers shows that, with probability 1, as n → ∞,
n−1 f (Xn ) → 1, and n−1 f (X−n ) → −1.
Worked Example 4.4.5 Show that, if Y1 < Y2 < Y3 < · · · are points of a Poisson
process on (0, ∞) with constant rate function λ , then
lim Yn /n = λ
n→∞
with probability 1. Let the rate function of a Poisson process Π = PP(λ (x)) on
(0, 1) be
λ (x) = x−2 (1 − x)−1 .
Show that the points of Π can be labelled as
1
· · · < X−2 < X−1 < < X0 < X1 < · · ·
2
and that
lim Xn = 0 , lim Xn = 1 .
n→−∞ n→∞
Prove that
lim nX−n = 1
n→∞
Solution The first part again follows from the strong law of large numbers. For the
second part we set
0 x
f (x) = λ (ξ )dξ ,
1/2
and use the fact that f maps Π into a PP of constant rate on ( f (0), f (1)): f (Π) =
PP(1). In our case, f (0) = −∞ and f (1) = ∞, and so f (Π) is a PP on R. Its points
may be labelled
· · · < Y−2 < Y−1 < 0 < Y0 < Y1 < · · ·
with
lim Yn = −∞, lim Yn = +∞.
n→−∞ n→+∞
Now, as x → 0,
0 1/2 0 1/2
−2 −1
f (x) = − ξ (1 − ξ ) dξ ∼ − ξ −2 dξ ∼ −x−1 ,
x x
440 Further Topics from Information Theory
implying that
−1
X−n
lim = 1, i.e. lim nX−n = 1, a.s.
n→∞ n n→∞
Similarly,
f (Xn )
lim = 1, a.s.,
n→+∞ n
and as x → 1,
0 x
f (x) ∼ (1 − ξ )−1 dξ ∼ − ln(1 − x).
1/2
μ (∪n An ) = ∑ μ (An ).
n
The value μ (E) can be finite or infinite. Our aim is to define a random counting
measure M = (M(A), A ∈ E ), with the following properties:
(a) The random variable M(A) takes non-negative integer values (including, possi-
bly, +∞). Furthermore,
'
∼ Po(λ μ (A)), if μ (A) < ∞,
M(A) (4.4.3)
= +∞ with probability 1, if μ (A) = ∞.
First assume that μ (E) < ∞ (if not, split E into subsets of finite measure). Fix a
random variable M(E) ∼ Po(λ μ (E)). Consider a sequence X1 , X2 , . . . of IID ran-
dom points in E, with Xi ∼ μ μ (E), independently of M(E). It means that for all
n ≥ 1 and sets A1 , . . . , An ∈ E (not necessarily disjoint)
n n
−λ μ (E) λ μ (E) μ (Ai )
P M(E) = n, X1 ∈ A1 , . . . , Xn ∈ An = e ∏ , (4.4.6)
n! i=1 μ (E)
and conditionally,
n
μ (Ai )
P X1 ∈ A1 , . . . , Xn ∈ An |M(E) = n = ∏ . (4.4.7)
i=1 μ (E)
Then set
M(E)
M(A) = ∑ 1(Xi ∈ A), A ∈ E . (4.4.8)
i=1
Solution Clearly,
and
P(R1 > r) = P(C(r) contains at most one point of M)
= (1 + λ π r2 )e−λ π r , r > 0.
2
Similarly,
1
P(R2 > r) = 1 + λ π r2 + (λ π r2 )2 e−λ π r , r > 0.
2
2
442 Further Topics from Information Theory
Then
0 ∞ 0 ∞ √
1 1
e−πλ r d
2
ER0 = P(R0 > r)dr = √ 2πλ r = √ ,
0
0 ∞
2πλ 0 2 λ
ER1 = P(R1 > r)dr
0
0 ∞
1 2
= √ + e−πλ r λ π r2 dr
2 λ 0
0 ∞
1 1 2 √
= √ + √ 2πλ r2 e−πλ r d 2πλ r
2 λ 2 2πλ 0
3
= √ ,
4 λ
0 ∞ 2
3 λ π r2 −πλ r2
ER2 = √ + e dr
4 λ 0 2
0 ∞
3 1 2 2 √
= √ + √ 2λ π r2 e−πλ r d 2λ π r
4 λ 8 2πλ 0
3 3
= √ + √
4 λ 16 λ
15
= √ .
16 λ
We shall use for the PRM M on the phase space E with intensity measure μ con-
structed in Theorem 4.4.6 the notation PRM(E, μ ). Next, we extend the definition
of the PRM to integral sums: for all functions g : E → R+ define
M(E) 0
M(g) = ∑ g(Xi ) := g(y)dM(y); (4.4.9)
i=1
summation is taken over all points Xi ∈ E, and M(E) is the total number of such
points. Next, for a general g : E → R we set
M(g) = M(g+ ) − M(−g− ),
with the standard agreement that +∞ − a = +∞ and a − ∞ = −∞ for all a ∈ (0, ∞).
[When both M(g+ ) and M(−g− ) equal ∞, the value M(g) is declared not defined.]
Then
Theorem 4.4.8 (Campbell theorem) For all θ ∈ R and for all functions g : E → R
such that eθ g(y) − 1 is μ-integrable
⎡ ⎤
0
Eeθ M(g) = exp ⎣λ eθ g(y) − 1 dμ (y)⎦ . (4.4.10)
E
4.4 Spatial point processes and network information theory 443
Proof Write
Example 4.4.10 Suppose that the wireless transmitters are located at the points
of Poisson process Π on R2 of rate λ . Let ri be the distance from transmitter i to
the central receiver at 0, and the minimal distance to a transmitter is r0 . Suppose
that the power of the received signal is Y = ∑Xi ∈Π rPα for some α > 2. Then
i
⎡ ⎤
0∞
Eeθ Y = exp ⎣2λ π eθ g(r) − 1 rdr⎦ , (4.4.11)
r0
P
where g(r) = rα where P is the transmitter power.
444 Further Topics from Information Theory
A popular model in application is the so-called marked point process with the
space of marks D. This is simply a random measure on Rd × D or on its subset. We
will need the following product property proved below in the simplest set-up.
Theorem 4.4.11 (The product theorem) Suppose that a Poisson process with the
constant rate λ is given on R, and marks Yi are IID with distribution ν. Define a
random measure M on R+ × D by
∞
M(A) = ∑I (Tn ,Yn ) ∈ A , A ⊆ R+ × D. (4.4.12)
n=1
We know that Nt ∼ Po(λ t). Further, given that Nt = k, the jump points T1 , . . . , Tk
have the conditional joint PDF fT1 ,...,Tk ( · |Nt = k) given by (4.4.7). Then, by using
further conditioning, by T1 , . . . , Tk , in view of the independence of the Yn , we have
E eθ M(A) |Nt = k
= E E eθ M(A) |Nt = k; T1 , . . . , Tk
0t 0t
= ... dxk . . . dx1 fT1 ,...,Tk (x1 , . . . , xk |N = k)
0 0
k
× E exp θ ∑ I (xi ,Yi ) ∈ A |Nt = k; T1 = x1 , . . . , Tk = xk
i=1
⎛ ⎞k
0t 0
1⎝
= eθ IA (x,y) dν (y)dx⎠ .
tk
0 D
4.4 Spatial point processes and network information theory 445
Then
⎛ ⎞k
∞ 0t 0
(λ t)k 1⎝
Eeθ M(A) = e−λ t ∑ eθ IA (x,y) dν (y)dx⎠
k=0 k! t k
0 D
⎡ ⎤
0t 0
= exp ⎣λ eθ IA (x,y) − 1 dν (y)dx⎦ .
0 D
The expression eθ IA (x,y) − 1 takes value eθ − 1 for (x, y) ∈ A and 0 for (x, y) ∈ A.
Hence,
⎡ ⎤
0
Eeθ M(A) = exp ⎣ eθ − 1 λ dν (y)dx⎦ , θ ∈ R. (4.4.13)
A
Worked Example 4.4.12 Use the product and Campbell’s theorems to solve
the following problem. Stars are scattered over three-dimensional space R3 in a
Poisson process Π with density ν (X) (X ∈ R3 ). Masses of the stars are IID random
variables; the mass mX of a star at X has PDF ρ (X, dm). The gravitational potential
at the origin is given by
GmX
F= ∑ ,
X∈Π |X|
where
0
Σ= a(y)M(dy) = ∑ a(X).
X∈Π
E
dEeθ F
The expected potential at the origin is EF = | and equals
dθ θ =0
0 0∞ 0
Gm 1
ν (x)dx ρ (x, dm) = GM dx 1(|x| ≤ R).
|x| |x|2
R3 0 R3
0 0R 0 0
1 1 2
dx 2 1(|x| ≤ R) = dr r dϑ cos ϑ dφ = 4π R
|x| r2
R3 0
which yields
EF = 4π GMR.
Finally, let D be the distance to the nearest star contributing to F at least C. Then,
by the product theorem,
P(D ≥ d) = P(no points in A) = exp − μ (A) .
Here
% K
Gm
A = (x, m) ∈ R3 × R+ : |x| ≤ d, ≥C ,
|x|
4.4 Spatial point processes and network information theory 447
0
and μ (A) = μ (dx × dm) is represented as
A
0d 0 0 0∞
1 −1
dr r2 dϑ cos ϑ dφ M dme−m/M
r
0 Cr/G
0d
= 4π drre−Cr/(GM)
0
2
GM −Cd/(GM) Cd −Cd/(GM)
= 4π 1−e − e .
C GM
This determines the distribution of D on [0, R].
Example 4.4.13 Suppose that the receiver is located at point y and the transmit-
ters are scattered on the plane R2 at the points of xi ∈ Π of Poisson process of rate
λ . Then the simplest model for the power of the received signal is
where P is the emitted signal power and the function describes the fading of the
signal. In the case of so-called Rayleigh fading (|x|) = e−β |x| , and in the case of
the power fading (|x|) = |x|−α , α > 2. By the Campbell theorem
0 ∞
φ (θ ) = E eθ Y = exp 2λ π r eθ P(r) − 1 dr . (4.4.15)
0
where sk is the kth largest singular value of the matrix L = ((xk , y j )), σ02 is the
noise power spectral density, and the bandwidth W = 1.
P2 (xi , x j )
SNR(xi → x j ) = (4.4.22)
σ02 + γ ∑k=i, j P2 (xk , x j )
where P, σ02 , k > 0 and 0 ≤ γ < 1k . We say that a transmitter located at xi can send a
message to receiver located at x j if SNR(xi → x j ) ≥ k. For any k > 0 and 0 < κ < 1,
let An (k, κ ) be an event that there exists a set Sn of at least κ n points of Π such that
for any two points s, d ∈ Sn , SNR(s, d) > k. It can be proved (see [48]) that for all
κ ∈ (0, 1) there exists k = k(κ ) such that
lim P An (k(κ ), κ ) = 1. (4.4.23)
n→∞
Then we say that the network is supercritical at interference level k(κ ); it means
that the number of other points the given transmitter (say, located at the origin 0)
could communicate to, by using re-transmission at intermediate points, is infinite
with a positive probability.
450 Further Topics from Information Theory
First, we note that any given transmitter may be directly connected to at most
1 + (γ k)−1 receivers. Indeed, suppose that nx nodes are connected to the node x.
Denote by x1 the node connected to x and such that
(|x1 − x|) ≤ (|xi − x|), i = 2, . . . , nx . (4.4.24)
Since x1 is connected to x we have
P(|x1 − x|)
∞ ≥k
σ02 + γ ∑ P(|xi − x|)
i=2
which implies
P(|x1 − x|) ≥ kσ02 + kγ ∑ P(|xi − x|)
i≥2
the ball B(N) (X j , r) of radius r centred at X j . The point X j will survive only if its
mark T j is strictly smaller than the marks of all other points from Π(N) lying in
B(N) (X j , r). The resulting point process ξ (N) is known as the Matérn process; it is
an example of a more general construction discussed in the recent paper [1].
The main parameter of a random codebook with codewords x(N) of length N is
the induced distribution of the distance between codewords. In the case of code-
books generated by stationary point processes it is convenient to introduce a func-
tion K(t) such that λ 2 K(t) gives the expected number of ordered pairs of distinct
points in a unit volume less than distance t apart. In other words, λ K(t) is the ex-
pected number of further points within t of an arbitrary point of a process. Say,
for Poisson process on R2 of rate λ , K(t) = π t 2 . In random codebooks we are in-
terested in models where K(t) is much smaller for small and moderate t. Hence,
random codewords appear on a small distance from one another much more rarely
than in a Poisson process. It is convenient to introduce the so-called product density
λ 2 dK(t)
ρ (t) = , (4.4.27)
c(t) dt
where c(t) depends on the state space of the point process. Say, c(t) = 2π t on R1 ,
c(t) = 2π t 2 on R2 , c(t) = 2π sint on the unit sphere, etc.
Some convenient models of this type have been introduced by B. Matérn. Here
we discuss two rather intuitive models of point processes on RN . The first is ob-
tained by sampling a Poisson process of rate λ and deleting any point which is
within 2R of any other whether or not this point has already been deleted. The rate
of this process for N = 2 is
λM,1 = λ e−4πλ R .
2
(4.4.28)
Here B((0, 0), 2R) is the ball with centre (0, 0) of radius 2R, and B((t, 0), 2R) is
the ball with centre (t, 0) of radius 2R. For varying λ this model has the maximum
rate of (4π eR2 )−1 and
√ so cannot model densely packed codes. This is 10% of the
theoretical bound ( 12R2 )−1 which is attained by the triangular lattice packing,
cf. [1].
The second Matérn model is an example of the so-called marked point process.
The points of a Poisson process of rate λ are independently marked by IID random
variables with distribution U([0, 1]). A point is deleted if there is another point of
452 Further Topics from Information Theory
the process within distance 2R which has a bigger mark whether or not this point
has already been deleted. The rate of this process for N = 2 is
P
Xr0 = ∑ rα (4.4.32)
Jr0 ,a i
where Jr0 ,a denotes the set of interfering transmitters such that r0 ≤ ri < a. Let λP
be the rate of Poisson process producing a Matérn process after thinning. The rate
of thinned process is
1 − exp − λP π r02
λ= .
π r02
= exp λP π (a 2
− r02 ) q(t)dt e dr − 1 . (4.4.33)
0 r0 (a2 − r02 )
4.5 Selected examples and problems from cryptography 453
P
Here g(r) = α
and q(t) = exp − λP π r02t is the retaining probability of a point
r 0
1 λ
of mark t. Since q(t)dt = , we obtain
0 λP
0 a
2r θ g(r)
φ (θ ) = exp λ π (a − r0 )
2 2
e dr − 1 . (4.4.34)
r0 (a2 − r0 )
2
Engineers say that outage happens at the central receiver, i.e. the interference pre-
vents one from reading a signal obtained from a sender at distance rs , if
P/rsα
≤ k.
σ02 + ∑Jr0 ,a P/riα
Here, σ02 is the noise power, rs is the distance to sender and k is the minimal SIR
(signal/noise ratio) required for successful reception. Different approximations of
outage probability based on the moments computed in (4.4.35) are developed. Typ-
ically, the distribution of Xr0 is close to log-normal; see, e.g., [113].
for some function f : {0, 1}d → {0, 1} (a feedback function). The initial string
(x0 , . . . , xd−1 ) is called an initial fill; it produces an output stream (xn )n≥0 satisfying
the recurrence equation
A feedback shift register is said to be linear (an LFSR, for short) if function f is
linear and c0 = 1:
d−1
f (x0 , . . . , xd−1 ) = ∑ ci xi , where ci = 0, 1, c0 = 1; (4.5.2)
i=0
xn+d n+d−1
n+1 = Vxn (4.5.4)
where
⎛ ⎞ ⎛ ⎞
0 1 0 ... 0 0 xn
⎜0 0 1 ... 0 0 ⎟ ⎜ xn+1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ .. ⎟
V = ⎜ ... ..
.
..
.
..
.
..
.
..
. ⎟ , xn+d−1
= ⎜ . ⎟. (4.5.5)
⎜ ⎟ n
⎜ ⎟
⎝0 0 0 ... 0 1 ⎠ ⎝xn+d−2 ⎠
c0 c1 c2 . . . cd−2 cd−1 xn+d−1
By the expansion of the determinant along the first column one can see that det V =
1 mod 2: the cofactor for the (n, 1) entry c0 is the matrix Id−1 . Hence,
Observe that general feedback shift registers, after an initial run, become peri-
odic:
Theorem 4.5.2 The output stream (xn ) of a general feedback shift register of
length d has the property that there exists integer r, 0 ≤ r < 2d , and integer D,
1 ≤ D < 2d − r, such that xk+D = xk for all k ≥ r.
4.5 Selected examples and problems from cryptography 455
Proof A segment xM . . . xM+d−1 determines uniquely the rest of the output stream
in (4.5.1), i.e. (xn , n ≥ M + d − 1). We see that if such a segment is reproduced in
the stream, it will be repeated. There are 2d different possibilities for a string of d
subsequent digits. Hence, by the pigeonhole principle, there exists 0 ≤ r < R < 2d
such that the two segments of length d of the output stream, from positions r and
R onwards, will be the same: xr+ j = xR+ j , 0 ≤ r < R < d. Then, as was noted,
xr+ j = xR+ j for all j ≥ 0, and the assertion holds true with D = R − r.
In the linear case (LFSR), we can repeat the above argument, with the zero string
discarded. This allows us to reduce 2d to 2d − 1. However, an LFSR is periodic in
a ‘proper sense’:
Theorem 4.5.3 An LFSR (xn ) is periodic, i.e. there exists D ≤ 2d − 1 such that
xn+D = xn for all n. The smallest D with this property is called the period of the
LFSR.
Proof Indeed, the column vectors xn+d−1 n , n ≥ 0, are related by the equation
xn+1 = Vxn = Vn+1 x0 , n ≥ 0, where matrix V was defined in (4.5.5). We noted that
det V = c0 = 0 and hence V is invertible. As was said before, we may discard the
zero initial fill. For each vector xn ∈ {0, 1}d there are only 2d − 1 non-zero possibil-
ities. Therefore, as in the proof of Theorem 4.5.2, among the initial 2d − 1 vectors
xn , 0 ≤ n ≤ 2d − 2, either there will be repeats, or there will be a zero vector. The
second possibility can be again discarded, as it leads to the zero initial fill. Thus,
suppose that the first repeat was for j and D + j: x j = x j+D , i.e. V j+D x0 = V j x0 .
If j = 0, we multiply by V−1 and arrive at an earlier repeat. So: j = 0, D ≤ 2d − 1
and VD x0 = x0 . Then, obviously, xn+D = Vn+D x0 = Vn x0 = xn .
Solution Take f : {0, 1}2 → {0, 1}2 with f (x1 , x2 ) = x2 1. The initial fill 00 yields
00111111111 . . .. Here, kn+1 = 0 = k1 for all n ≥ 1.
Worked Example 4.5.5 Let matrix V be defined by (4.5.5), for the linear re-
cursion (4.5.3). Define and compute the characteristic and minimal polynomials
for V.
456 Further Topics from Information Theory
(recall, entries 1 and ci are considered in F2 ). Expanding along the bottom row,
the polynomial hV (t) is written as a linear combination of determinants of size
(d − 1) × (d − 1) (co-factors):
⎛ ⎞ ⎛ ⎞
1 0 ... 0 0 X 0 ... 0 0
⎜X 1 . . . 0 0⎟ ⎜X 1 . . . 0 0⎟
⎜ ⎟ ⎜ ⎟
c0 det ⎜ . . . . . ⎟ + c1 det ⎜ . . . .. .. ⎟
⎝. .
. . . . . .⎠
. . ⎝. .
. . . . . .⎠
0 0 ... X 1 0 0 ... X 1
⎛ ⎞
X 1 ... 0 0
⎜0 X . . . 0 0⎟
⎜ ⎟
+ · · · + cd−2 det ⎜ . .. . . .. .. ⎟
⎝ .. . . . .⎠
0 0 ... 0 1
⎛ ⎞
X 1 ... 0 0
⎜0 X ... 0 0⎟
⎜ ⎟
+(cd−1 + X) det ⎜ . . . .. ⎟
⎝ .. .. . . ... .⎠
0 0 ... 0 X
∑
( j) ( j)
mV (X) = ai X i + X d j +1 .
0≤i≤d j
(v) Then,
(1) (d)
mV (X) = lcm mV (X), . . . , mV (X) .
∑
(1)
mV (X) = ci X i + X d = hV (X).
0≤i≤d−1
We see that the feedback polynomial C(X) of the recursion coincides with the
characteristic and the minimal polynomial for V. Observe that at X = 0 we obtain
hV (0) = C(0) = c0 = 1 = det V. (4.5.9)
Any polynomial can be identified through its roots; we saw that such a descrip-
tion may be extremely useful. In the case of an LFSR, the following example is
instructive.
Theorem 4.5.6 Consider the binary linear recurrence in (4.5.3) and the corre-
sponding auxiliary polynomial C(X) from (4.5.7).
(a) Suppose K is a field containing F2 such that polynomial C(X) has a root α of
multiplicity m in K. Then, for all k = 0, 1, . . . , m − 1,
xn = A(n, k)α n , n = 0, 1, . . . , (4.5.10)
458 Further Topics from Information Theory
Here, and below, (a)+ stands for max[a, 0]. In other words, sequence x(k) =
(xn ), where xn is given by (4.5.10), is an output of the LFSR with auxiliary
polynomial C(X).
(b) Suppose K is a field containing F2 such that C(X) factorises in K into lin-
ear factors. Let α1 , . . . , αr ∈ K be distinct roots of C(X) of multiplicities
m1 , . . . , mr , with ∑ mi = d . Then the general solution of (4.5.3) in K is
1≤i≤r
for some bu,v ∈ K. In other words, sequences x(i,k) = (xn ), where xn = A(n, k)αin
and A(n, k) is given by (4.5.11), span the set of all output streams of the LFSR
with auxiliary polynomial C(X).
Proof (a) If C(X) has a root α ∈ K of multiplicity m then C(X) = (X − α )mC(X)
where C(X) is a polynomial of degree d −m (with coefficients from a field K ⊆ K).
Then, for all k = 0, . . . , m − 1, and for all n ≥ d, the polynomial
dk
Dk,n (X) := X k k X n−d C(X)
dX
(with coefficients taken mod 2) vanishes at X = α (in field K):
Dk,n (α ) = ∑ ci A(n − d + i, k)α n−d+i + A(n, k)α n .
0≤i≤d−1
This yields
A(n, k)α n = ∑ ci A(n − d + i, k)α n−d+i .
0≤i≤d−1
1≤i<r
Upon seeing a stream of digits (xn )n≥0 , an observer may wish to determine
whether it was produced by an LFSR. This can be done by using the so-called
Berlekamp–Massey (BM) algorithm, solving a system of linear equations. If a se-
d−1
quence (xn ) comes from an LFSR with feedback polynomial C(X) = ∑ ci X i + X d
i=0
d−1
then the recurrence xn+d = ∑ ci xn+i for n = 0, . . . , d can be written in a vector-
i=0
matrix form Ad cd = 0 where
⎛ ⎞
⎛ ⎞ c0
x0 x1 x2 . . . xd ⎜ ⎟
⎜x1 c1
⎜ x2 x3 . . . xd+1 ⎟
⎟
⎜
⎜
⎟
⎟
..
Ad = ⎜ . .. .. .. .. ⎟ , cd = ⎜ ⎟.
. (4.5.13)
⎝ .. . . . . ⎠ ⎜ ⎟
⎝cd−1 ⎠
xd xd+1 xd+2 . . . x2d
1
(e.g. by Gaussian elimination) and test sequence (xn ) for the recursion xn+d =
∑ ai xn+i . If we discover a discrepancy, we choose a different vector cr ∈
0≤i≤d−1
ker Ar or – if it fails – increase r.
The BM algorithm can be stated in an elegant algebraic form. Given a sequence
∞
(xn ), consider a formal power series in X: ∑ x j X j . The fact that (xn ) is produced
j=0
4.5 Selected examples and problems from cryptography 461
by the LFSR with a feedback polynomial C(X) is equivalent to the fact that the
d
above series is obtained by dividing a polynomial A(X) = ∑ ai X i by C(X):
i=0
∞
A(X)
∑ x j X j = C(X) . (4.5.14)
j=0
∞
Indeed, as c0 = 1, A(X) = C(X) ∑ x j X j is equivalent to
j=0
n
an = ∑ ci xn−i , n = 1, . . . , (4.5.15)
i=1
or ⎧
⎪
⎪
n−1
⎨an − ∑ ci xn−i , n = 0, 1, . . . , d,
xn = i=1 (4.5.16)
⎪
⎪
n−1
⎩− ∑ ci xn−i , n > d.
i=0
In other words, A(X) takes part in specifying the initial fill, and C(X) acts as the
feedback polynomial.
Worked Example 4.5.7 What is a linear feedback shift register? Explain the
Berlekamp–Massey method for recovering the feedback polynomial of a linear
feedback shift register from its output. Illustrate in the case when we observe out-
puts
1 0 1 0 1 1 0 0 1 0 0 0 ...,
0 1 0 1 1 1 1 0 0 0 1 0 ...
and
1 1 0 0 1 0 1 1.
Solution An initial fill x0 . . . xd−1 produces an output stream (xn )n≥0 satisfying the
recurrence equation
d−1
xn+d = ∑ ci xn+i for all n ≥ 0.
i=0
but
⎛ ⎞
1 0 1
A2 = ⎝0 1 0⎠ , with det A2 = 0,
1 0 1
⎛ ⎞
c0
⎝
and A2 c1 ⎠ = 0 has the solution c0 = 1, c1 = 0. This gives the recursion
1
xn+2 = xn ,
and then to A4 :
⎛ ⎞
1 0 1 0 1
⎜0 1 0 1 1⎟
⎜ ⎟
A4 = ⎜
⎜1 0 1 1 0⎟⎟ , with det A4 = 0.
⎝0 1 1 0 0⎠
1 1 0 0 1
⎛ ⎞
1
⎜ 0⎟
The equation A4 c4 = 0 is solved by c4 = ⎜ ⎟
⎝ 0⎠. This yields
1
xn+4 = xn + xn+3 ,
which fits the rest of the string. In the second example we have:
⎛ ⎞
⎛ ⎞ 0 1 0 1
0 1 0
0 1 ⎜ 1 0 1 1⎟
det = 0, det ⎝1 0 1⎠ = 0, det ⎜ ⎝0 1 1
⎟ = 0
1 0 1⎠
0 1 1
1 1 1 1
4.5 Selected examples and problems from cryptography 463
and
⎛ ⎞⎛ ⎞
0 1 0 1 1 1
⎜1 0 1 1 1⎟ ⎜ 1⎟
⎜ ⎟⎜ ⎟
⎜0 1⎟ ⎜ ⎟
⎜ 1 1 1 ⎟ ⎜0⎟ = 0.
⎝1 1 1 1 0 ⎝ 0⎠
⎠
1 1 1 0 0 1
This yields the solution: d = 4, xn+4 = xn + xn+1 . The linear recurrence relation is
satisfied by every term of the output sequence given. The feedback polynomial is
then X 4 + X + 1.
In the third example the recursion is xn+3 = xn + xn+1 .
LFSRs are used for producing additive stream ciphers. Additive stream ciphers
were invented in 1917 by Gilbert Vernam, at the time an engineer with the AT&T
Bell Labs. Here, the sending party uses an output stream from an LFSR (kn ) to
encrypt a plain text (pn ) by (zn ) where
zn = pn + kn mod 2, n ≥ 0. (4.5.17)
pn = zn + kn mod 2, n ≥ 0, (4.5.18)
but of course he must know the initial fill k0 . . . kd−1 and the string c0 . . . cd−1 . The
main deficiency of the stream cipher is its periodicity. Indeed, if the generating
LFSR has period D then it is enough for an ‘attacker’ to have in his possession a
cipher text z0 z1 . . . z2D−1 and the corresponding plain text p0 p1 . . . p2D−1 , of length
2D. (Not an unachievable task for a modern-day Sherlock Holmes.) If by some luck
the attacker knows the value of the period D then he only needs z0 z1 . . . zD−1 and
p0 p1 . . . pD−1 . This will allow the attacker to break the cipher, i.e. to decrypt the
whole plain text, however long.
Clearly, short-period LFSRs are easier to break when they are used repeat-
edly. The history of World War II and the subsequent Cold War has a number of
spectacular examples (German code-breakers succeeding in part in reading British
Navy codes, British and American code-breakers succeeding in breaking German
codes, the American project ‘Venona’ deciphering Soviet codes) achieved because
of intensive message traffic. However, even ultra-long periods cannot guarantee
safety.
As far as this section of the book is concerned, the period of an LFSR can be
increased by combining several LFSRs.
Theorem 4.5.8 Suppose a stream (xn ) is produced by an LFSR of length d1 ,
period D1 and with an auxiliary polynomial C1 (X), and a stream (yn ) by an LFSR
464 Further Topics from Information Theory
corresponding to the generic form of the output stream for the LFSR with the aux-
iliary polynomial C(X) in statement (b).
Despite serious drawbacks, LFSRs remain in use in a variety of situations: they
allow simple enciphering and deciphering without ‘lookahead’ and display a ‘lo-
cal’ effect of an error, be it encoding, transmission or decoding. More generally,
4.5 Selected examples and problems from cryptography 465
non-linear LFSRs often offer only marginal advantages while bringing serious dis-
advantages, in particular with deciphering.
Worked Example 4.5.9 (a) Let (xn ), (yn ), (zn ) be three streams produced by
LFSRs. Set
kn = xn if yn = zn ,
kn = yn if yn = zn .
Solution (a) For three streams (xn ), (yn ), (zn ) produced by LFSRs we set
d
xd+ j = ∑ ci x j+i−1 , for j = 1, 2, . . . , d.
i=1
Worked Example 4.5.11 Describe how an additive stream cipher operates. What
is a one-time pad? Explain briefly why a one-time pad is safe if used only once but
becomes unsafe if used many times. A one-time pad is used to send the message
x1 x2 x3 x4 x5 x6 y7 which is encoded as 0101011. By mistake, it is reused to send the
message y0 x1 x2 x3 x4 x5 x6 which is encoded as 0100010. Show that x1 x2 x3 x4 x5 x6 is
one of two possible messages, and find the two possibilities.
the cipher key stream is known only to the sender and the recipient.) In the example,
we have
x1 x2 x3 x4 x5 x6 y7 → 0101011,
y0 x1 x2 x3 x4 x5 x6 → 0100010.
Suppose x1 = 0. Then
k0 = 0, k1 = 1, x2 = 0, k2 = 0, x3 = 0, k3 = 0, x4 = 1, k1 = 0,
x5 = 1, k5 = 0, x6 = 1, k6 = 1.
Thus,
k = 0100101, x = 000111.
If x1 = 1, every digit changes, so
k = 1011010, x = 111000.
Alternatively, set x0 = y0 and x7 = y7 . If the first cipher is q1 q2 . . ., the second is
p1 p2 . . . and the one-time pad is k1 , k2 , . . ., then
q j = x j+1 + k j , p j = x j + k j .
So,
x j + x j+1 = q j + p j ,
and
x1 + x2 = 0, x2 + x3 = 0,
x3 + x4 = 1, x4 + x5 = 0, x5 + x6 = 0.
This yields
x1 = x2 = x3 , x4 = x5 = x6 , x4 = x3 + 1.
The message is 000111 or 111000.
Worked Example 4.5.12 (a) Let θ : Z+ → {0, 1} be given by θ (n) = 1 if n is
odd, θ (n) = 0 if n is even. Consider the following recurrence relation over F2 :
un+3 + un+2 + un+1 + un = 0. (4.5.20)
Is it true that the general solution of (4.5.20) is un = A + Bθ (n) + Cθ (n2 )? If it is
true, prove it. If not, explain why it is false and state and prove the correct result.
(b) Solve the recurrence relation un+2 + un = 1 over F2 , subject to u0 = 1, u1 = 0,
expressing the solution in terms of θ and n.
(c) Four streams wn , xn , yn , zn are produced by linear feedback registers. If we set
%
xn + yn + zn if zn + wn = 1,
kn =
xn + wn if zn + wn = 0,
show that kn is also a stream produced by a linear feedback register.
468 Further Topics from Information Theory
such that
(f) for all key e ∈ K there is a key d ∈ K , with the property that Dd (Ee (P)) = P
for all plaintext P ∈ P.
Example 4.5.14 Suppose that two parties, Bob and Alice, intend to have a two-
side private communication. They want to exchange their keys, EA and EB , by us-
ing an insecure binary channel. An obvious protocol is as follows. Alice encrypts
a plain-text m as EA (m) and sends it to Bob. He encrypts it as EB (EA (m)) and
returns it to Alice. Now we make a crucial assumption that EA and EB commute
for any plaintext m : EA ◦ EB (m ) = EB ◦ EA (m ). In this case Alice can decrypt
this message as DA (EA (EB (m))) = EB (m) and send this to Bob, who then calcu-
lates DB (EB (m)) = m. Under this protocol, at no time during the transaction is an
unencrypted message transmitted.
However, a further thought shows that this is no solution at all. Indeed, suppose
that Alice uses a one-time pad kA and Bob uses a one-time pad kB . Then any sin-
gle interception provides no information about plaintext m. However, if all three
transmissions are intercepted, it is enough to take the sum
(m + kA ) + (m + kA + kB ) + (m + kB ) = m
Otherwise, i.e. when p|m, (4.5.27) still holds as m and (ml )d are both equal to
0 mod p. By a similar argument,
By the Chinese remainder theorem (CRT) – [28], [114] – (4.5.27) and (4.5.28)
imply (4.5.26).
Example 4.5.16 Suppose Bob has chosen p = 29, q = 31, with N = 899 and
φ (N) = 840. The smallest possible value of e with gcd(l, φ (N)) = 1 is l = 11, after
that 13 followed by 17, and so on. The (extended) Euclid algorithm yields d = 611
for l = 11, d = 517 for l = 13, and so on. In the first case, the encrypting key E899,11
is
m → m11 mod 899, that is, E899,11 (2) = 250.
with the help of the computer. [The computer is needed even after the simplification
rendered by the use of the CRT. For instance, the command in Mathematica is
PowerMod[250,611,899].]
Worked Example 4.5.17 (a) Referring to the RSA cryptosystem with public key
(N, l) and private key (φ (N), d), discuss possible advantages or disadvantages of
taking (i) l = 232 + 1 or (ii) d = 232 + 1.
(b) Let a (large) number N be given, and we know that N is a product of two distinct
prime numbers, N = pq, but we do not know the numbers p and q. Assume that
another positive integer, m, is given, which is a multiple of φ (N). Explain how to
find p and q.
(c) Describe how to solve the bit commitment problem by means of the RSA.
Solution Using l = 232 + 1 provides fast encryption (you need just 33 multiplica-
tions using repeated squaring). With d = 232 + 1 one can decrypt messages quickly
(but an attacker can easily guess it).
472 Further Topics from Information Theory
(b) Next, we show that if we know a multiple m of φ (N) then it is ‘easy’ to factor N.
Given positive integers y > 1 and M > 1, denote by ordM (y) the order of y relative
to M:
Lemma 4.5.18 (i) Let N = pq, m be as before, i.e. φ (N)|m, and 2tdefine set X
the
as in (4.5.29). If x ∈ X then there exists 0 ≤ t < a such that gcd x − 1, N > 1 is
b
has size ≤ (p − 1)/2. We will do this by exhibiting such a subset of size (p − 1)/2.
Note that
φ (N)|2a b implies that there exists γ ∈ {1, . . . , p − 1}
such that ord p (γ b )is a power of 2.
In turn, the latter statement implies that
%
δb = ord p (γ b ), δ odd,
ord p (γ )
< ord p (γ b ), δ even.
* +
Therefore, γ δ b mod p : δ odd is the required subset.
Furthermore:
Alice’s public key is N = pq; her secret key is the pair (p, q);
Alice’s plaintext and ciphertext are numbers m = 0, 1 . . . , N − 1, (4.5.31)
and her encryption rule is EN (m) = c where c = m2 mod N.
To decrypt a ciphertext c addressed to her, Alice computes
Then
±m p = c1/2 mod p and ± mq = c1/2 mod q,
i.e. ±m p and ±mq are the square roots of c mod p and mod q, respectively. In fact,
2 p−1
± m p = c(p+1)/2 = c(p−1)/2 c = ± m p c = c mod p;
at the last step the Euler–Fermat theorem has been used. The argument for ±mq is
similar. Then Alice computes, via Euclid’s algorithm, integers u(p) and v(q) such
that
u(p)p + v(q)q = 1.
474 Further Topics from Information Theory
Example 4.5.19 Alice uses prime numbers p = 11 and q = 23. Then N = 253.
Bob encrypts the message m = 164, with
c = m2 mod N = 78.
Alice calculates m p = 1, mq = 3, u(p) = −2, v(q) = 1. Then Alice computes
r = ±[u(p)pmq + v(q)qm p ] mod N = 210 and 43
Then α is called the discrete logarithm, mod p, of b to base γ ; some authors write
α = dlogγ b mod p. Computing discrete logarithms is considered a difficult prob-
lem: no efficient (polynomial) algorithm is known, although there is no proof that
it is indeed a non-polynomial problem. [In an additive cyclic group Z/(nZ), the
DLP becomes b = γα mod n and is solved by the Euclid algorithm.]
The Diffie–Hellman protocol allows Alice and Bob to establish a common secret
key using field tables for F p , for a sufficient quantity of prime numbers p. That is,
they know a primitive element γ in each of these fields. They agree to fix a large
prime number p and a primitive element γ ∈ F p . The pair (p, γ ) may be publicly
known: Alice and Bob can fix p and γ through the insecure channel.
Next, Alice chooses a ∈ {0, 1, . . . , p − 2} at random, computes
A = γ a mod p
K = γ ab = Ba = Ab mod p.
If the attacker can find discrete logarithms mod p then he can break the secret
key: this is the only known way to do so. The opposite question – solving the
discrete logarithm problem if he is able to break the protocol – remains open (it is
considered an important problem in public key cryptography).
However, like previously discussed schemes, the Diffie–Hellman protocol has a
particular weak point: it is vulnerable to the man in the middle attack. Here, the
attacker uses the fact that neither Alice nor Bob can verify that a given message re-
ally comes from the opposite party and not from a third party. Suppose the attacker
can intercept all messages between Alice and Bob. Suppose he can impersonate
Bob and exchange keys with Alice pretending to be Bob and at the same time im-
personate Alice and exchange keys with Bob pretending to be Alice. It is necessary
to use electronic signatures to distinguish this forgery.
476 Further Topics from Information Theory
and Alice’s public key is (p = 37, γ = 2, A = 26), her plaintexts are 0, 1, . . . , 36 and
private key a = 12. Assume Bob has chosen b = 32; then
B = 232 mod 37 = 4.
Suppose Bob wants to send m = 31. He encrypts m by
c = Ab m mod p = (26)32 m mod 37 = 10 × 31 mod 37 = 14.
Alice decodes this message as 232 = 7 and 724 = 26 mod 37,
14 × 232(37−12−1) mod 37 = 14 × 724 = 14 × 26 mod 37 = 31.
Worked Example 4.5.21 Suppose that Alice wants to send the message ‘today’
to Bob using the ElGamal encryption. Describe how she does this using the prime
p = 15485863, γ = 6 a primitive root mod p, and her choice of b = 69. Assume
that Bob has private key a = 5. How does Bob recover the message using the
Mathematica program?
Solution Bob has public key (15485863, 6, 7776), which Alice obtains. She con-
verts the English plaintext using the alphabet order to the numerical equivalent:
19, 14, 3, 0, 24. Since 265 < p < 266 , she can represent the plaintext message as a
single 5-digit base 26 integer:
m = 19 × 264 + 14 × 263 + 3 × 262 + 0 × 26 + 24 = 8930660.
Now she computes γ b = 669 = 13733130 mod 15485863, then
mγ ab = 8930660 × 777669 = 4578170 mod 15485863.
Alice sends c = (13733130, 4578170) to Bob. He uses his private key to compute
(γ b ) p−1−a = 1373313015485863−1−5 = 2620662 mod 15485863
and
(γ )−a mγ ab = 2620662 × 4578170 = 8930660 mod 15485863,
and converts the message back to the English plaintext.
Worked Example 4.5.22 (a) Describe the Rabin–Williams scheme for coding
a message x as x2 modulo a certain N . Show that, if N is chosen appropriately,
breaking this code is equivalent to factorising the product of two primes.
(b) Describe the RSA system associated with a public key e, a private key d and
the product N of two large primes.
Give a simple example of how the system is vulnerable to a homomorphism
attack. Explain how a signature system prevents such an attack. Explain how to
factorise N when e, d and N are known.
478 Further Topics from Information Theory
Solution (a) Fix two large primes p, q ≡ −1 mod 4 which forms a private key; the
broadcasted public key is the product N = pq. The properties used are:
(i) If p is a prime, the congruence a2 ≡ d mod p has at most two solutions.
(ii) For a prime p = −1 mod 4, i.e. p = 4k − 1, if the congruence a2 ≡ c mod p has
a solution then a ≡ c(p+1)/4 mod p is one solution and a ≡ −c(p+1)/4 mod p is
another solution. [Indeed, if c ≡ a2 mod p then, by the Euler–Fermat theorem,
c2k = a4k = a(p−1)+2 = a2 mod p, implying ck = ±a.]
The message is a number m from M = {0, 1, . . . , N −1}. The encrypter (Bob) sends
(broadcasts) m = m2 mod N. The decrypter (Alice) uses property (ii) to recover the
two possible values of m mod p and two possible values of m mod q. The CRT then
yields four possible values for m: three of them would be incorrect and one correct.
So, if one can factorise N then the code would be broken. Conversely, suppose
that we can break the code. Then we can find all four distinct square roots u1 , u2 ,
u3 , u4 mod N for a general u. (The CRT plus property (i) shows that u has zero
or four square roots unless it is a multiple of p and q.) Then u j u−1 (calculable via
Euclid’s algorithm) gives rise to the four square roots, 1, −1, ε1 and ε2 , of 1 mod N,
with
ε1 ≡ 1 mod p, ε1 ≡ −1 mod q
and
ε2 ≡ −1 mod p, ε2 ≡ 1 mod q.
xλ (N) ≡ 1 mod N,
Next, we choose e randomly. Either Euclid’s algorithm will reveal that e is not
co-prime to λ (N) or we can use Euclid’s algorithm to find d such that
de ≡ 1 mod λ (N).
With a very high probability a few trials will give appropriate d and e.
We now give out the value e of the public key and the value of N but keep secret
the private key d. Given a message m with 1 ≤ m ≤ N − 1, it is encoded as the
integer c with
1 ≤ c ≤ N − 1 and c ≡ me mod N.
and the recipient of the (falsified) message believes that m2 dollars are to be paid.
Suppose that a signature B(m) is also encoded and transmitted, where B is a
many-to-one function with no simple algebraic properties. Then the attack above
will produce a message and signature which do not correspond, and the recipient
will know that the message was tampered with.
Suppose e, d and N are known. Since
de − 1 ≡ 0 mod λ (N)
Let L = NT be the total number of coupons collected by the time the complete
set of coupon types is obtained. Show that λ ET = EL. Hence, or otherwise, deduce
that EL does not depend on λ .
Solution Part (a) directly follows from the definition of a Poisson process.
(b) Let T j be the time of the first collection of a type j coupon. Then T j ∼ Exp(p j λ ),
independently for different j. We have
T = max T1 , . . . , Tm ,
and hence
m m
P(T < t) = P max T1 , . . . , Tm < t = ∏ P(T j < t) = ∏ 1 − e−p j λ t .
j=1 j=1
4.6 Additional problems for Chapter 4 481
Next, observe that the random variable L counts the jumps in the original Poisson
process (Nt ) until the time of collecting a complete set of coupon types. That is:
L
T = ∑ Si ,
i=1
where S1 , S2 , . . . are the holding times in (Nt ), with S j ∼ Exp(λ ), independently for
different j. Then
E(T |L = n) = nES1 = nλ −1 .
Moreover, L is independent of the random variables S1 , S2 , . . .. Thus,
ET = ∑ P(L = n)E T |L = n = ES1 ∑ nP(L = n) = λ −1 EL.
n≥m n≥m
But
0 ∞
λ ET = λ P(T > t)dt
0 ∞
0
m
=λ 1−∏ 1−e −p j λ t
dt
0 j=1
0 ∞
m
= 1 − ∏ 1 − e−p j t dt,
0 j=1
Suppose now that arrivals to the first queue stop at time T . Determine the mean
number of customers at the ith queue at each time t ≥ T .
Solution We apply the product theorem to the Poisson process of arrivals with
random vectors Yn = (Sn1 , . . . , Snk ) where Sni is the service time of the nth customer
at the ith queue. Then
Vi (t) = the number of customers in the ith queue at time t
∞
= ∑ 1 the nth customer arrived in the first queue at
n=1
time Jn is in the ith queue at time t
∞
= ∑ Jn > 0, Sn1 , . . . , Snk ≥ 0,
1
n=1
Jn + Sn1 + · · · + Sni−1 < t < Jn + Sn1 + · · · + Sni
∞
= ∑ 1 Jn , (Sn1 , . . . , Snk ) ∈ Ai (t) = M(Ai (t)).
n=1
Here (Jn : n ∈ N) denote the jump times of a Poisson process of rate λ , and the
measures M and ν on (0, ∞) × Rk+ are defined by
∞
M(A) = ∑1 (Jn ,Yn ) ∈ A , A ⊂ (0, ∞) × Rk+
n=1
and
ν (0,t] × B = λ t μ (B).
The product theorem states that M is a Poisson random measure on (0, ∞) × Rk+
with intensity measure ν . Next, the set Ai (t) ⊂ (0, ∞) × Rk+ is defined by
*
Ai (t) = (τ , s1 , . . . , sk ) : 0 < τ < t, s1 , . . . , sk ≥ 0
+
and τ + s1 + · · · + si−1 ≤ t < τ + s1 + · · · + si
'
= (τ , s1 , . . . , sk ) : 0 < τ < t, s1 , . . . , sk ≥ 0
@
i−1 i
and ∑ sl ≤ t − τ < ∑ sl .
l=1 l=1
Sets Ai (t) are pairwise disjoint for i = 1, . . . , k (as t − τ can fall between subse-
i−1 i
quent partial sums ∑ sl and ∑ sl only once). So, the random variables Vi (t) are
l=1 l=1
independent Poisson.
4.6 Additional problems for Chapter 4 483
A direct verification is through the joint MGF. Namely, let Nt ∼ Po(λ t) be the
number of arrivals at the first queue by time t. Then write
In turn, given n = 1, 2, . . . and points 0 < τ1 < · · · < τn < t, the conditional expec-
tations is
k
E exp ∑ θiVi (t) Nt = n; J1 = τ1 , . . . , Jn = τn
i=1
k n
= E exp ∑ θi ∑ 1 τ j , (S j , . . . , S j ) ∈ Ai (t)
1 k
i=1 j=1
n k
= E exp ∑ ∑ θi 1 τ j , (S j , . . . , S j ) ∈ Ai (t)
1 k
j=1 i=1
n k
= ∏ E exp ∑ θi 1 τ j , (S1j , . . . , Skj ) ∈ Ai (t) .
j=1 i=1
By the uniqueness of a random variable with a given MGF, this implies that
0
t i−1 i
Vi (t) ∼ Po λ
0
P ∑ Sl < t − τ < ∑ Sl dτ , independently.
l=1 l=1
484 Further Topics from Information Theory
Finally, write Vi (t, T ) for the number of customers in queue i at time t after closing
the entrance at time T . Then
0T 0t
EVi (t, T ) = λ P(Nt−τ = i − 1)dτ = λ E 1(Ns = i − 1)ds
0 t−T
= P(Nt ≥ i) − P(Nt−T ≥ i) .
μ
Solution (i) If J1 is the arrival time of the first customer then J1 + S1 is the time he
enters the checkout till and J1 + S1 + g(S1 ) the time he leaves. Let J2 be the time of
arrival of the second customer. Then J1 , J2 − J1 ∼ Exp(λ ), independently.
Then
0∞ 0∞ 0t2
−λ t1 −λ t2
P(S1 + g(S1 ) < J2 − J1 ) = dt1 λ e dt2 λ e ds1 f (s1 ,t1 )1(s1 + g(s1 ) < t2 )
0 0 0
0∞ 0∞ 0∞
−λ t1
= dt1 λ e ds1 f (s1 ,t1 ) dt2 λ e−λ t2
0 0 s1 +g(s1 )
0∞ 0∞
= dt1 λ e−λ t1 ds1 f (s1 ,t1 )e−λ (s1 +g(s1 )) .
0 0
4.6 Additional problems for Chapter 4 485
(ii) Let NTch be the number of checkouts used at time T . By the product theorem,
4.4.11, NTch ∼ Po(Λ(T )) where
0T 0∞
Λ(T ) = λ du ds f (s, u)1(u + s < T, u + s + g(s) > T )
0 0
0T 0∞
=λ du ds f (s, u)1(T − g(s) < u + s < T ).
0 0
NTarr
NTch = ∑1 Ji + Si < T < Ji + Si + g(Si ) ,
i=1
Problem 4.4 A library is open from 9am to 5pm. No student may enter after
5pm; a student already in the library may remain after 5pm. Students arrive at the
library in the period from 9am to 5pm in the manner of a Poisson process of rate
λ . Each student spends in the library a random amount of time, H hours, where
486 Further Topics from Information Theory
(a) Find the distribution of the number of students who leave the library between
3pm and 4pm.
(b) Prove that the mean number of students who leave between 3pm and 4pm is
E[min(1, (7 − H)+ )], where w+ denotes max[w, 0].
(c) What is the number of students still in the library at closing time?
Solution The library is open from 9am to 5pm. Students arrive as a PP(λ ). The
problem is equivalent to an M/GI/∞ queue (until 5pm, when the restriction of no
more arrivals applies, but for problems involving earlier times this is unimportant).
Denote by Jn the arrival time of the nth student using the 24 hour clock.
Denote by Hn the time the nth student spends in the library.
Again use the product theorem, 4.4.11, for the random measure on (0, 8) × (0, 8)
with atoms (Jn ,Yn ), where (Jn : n ∈ N) are the arrival times and (Yn : n ∈ N) are
periods of time that students stay in the library. Define measures on (0, ∞) × R+
by μ ((0,t) × B) = λ t μ (B), N(A) = ∑ 1((Jn ,Hn )∈A) . Then N is a Poisson random
n
1y
measure with intensity ν ([0,t] × [0, y]) = λ tF(y), where F(y) = h(x)dx (the time
0
t = 0 corresponds to 9am).
(a) Now, the number of students leaving the library between 3pm and 4pm (i.e.
6 ≤ t ≤ 7) has a Poisson distribution Po(ν (A)) where A = {(r, s) : s ∈ [0, 7], r ∈
[6 − s, 7 − s] if s ≤ 6; r ∈ [0, 7 − s] if s > 6}. Here
08 0 +
(7−r) 08
So, the distribution of students leaving the library between 3pm and 4pm is Poisson
17
with rate = λ [(7 − y)+ − (6 − y)+ ]dF(r).
0
(b)
⎧
⎪
⎨0, if y ≥ 7,
(7 − y)+ − (6 − y)+ = 7 − y, if 6 ≤ y ≤ 7,
⎪
⎩
1, if y ≤ 6.
4.6 Additional problems for Chapter 4 487
The mean number of students leaving the library between 3pm and 4pm is
08
ν (A) = λ [min(1, (7 − r)+ ]dF(r) = λ E[min(1, (7 − H)+ )]
0
as required.
(c) For students still to be there at closing time we require J + H ≥ 8, as H ranges
over [0, 8], and J ranges over [8 − H, 8]. Let
So,
08 08 08 0
8−x
ν (B) = λ dt dF(x) = λ dF(x) dt
0 8−t 0 0
08 08 08
=λ (8 − x)dF(x) = 8λ dF(x) − λ xdF(x),
0 0 0
18 18
but dF(x) = 1 and xdF(x) = E[H] = 1 imply λ E[H] = λ . Hence, the expected
0 0
number of students in the library at closing time is 7λ .
Problem 4.5 (i) Prove Campbell’s theorem, i.e. show that if M is a Poisson
random measure on the state space E with intensity measure μ and a : E → R is a
bounded measurable function, then
⎡ ⎤
0
E[e θX
] = exp ⎣ (eθ a(y) − 1)μ (ddy)⎦ , (4.6.3)
E
1
where X = a(y)M(dy) (assume that λ = μ (E) < ∞).
E
(ii) Shots are heard at jump times J1 , J2 , . . . of a Poisson process with rate λ . The
initial amplitudes of the gunshots A1 , A2 , . . . ∼ Exp(2) are IID exponentially dis-
tributed with parameter 2, and the amplitutes decay linearly at rate α. Compute the
MGF of the total amplitude Xt at time t :
Xt = ∑ An (1 − α (t − Jn )+ )1(Jn ≤t) ;
n
x+ = x if x ≥ 0 and 0 otherwise.
488 Further Topics from Information Theory
Hence,
E[eθ X ] = ∑ E[eθ X | M(E) = n]P(M(E) = n)
n
⎛ ⎞n
0
e−λ λ n
= ∑⎝ eθ a(y) μ (dy)/λ ⎠
n n!
E
⎛ ⎞
0
= exp ⎝ (eθ a(y) − 1)μ (dy)⎠ .
E
(ii) Fix t and let E = [0,t] × R+ and ν and M be such that ν (ds, dx) =
2λ e−2x dsdx, M(B) = ∑ 1{(Jn ,An )∈B} . By the product theorem M is a Poisson ran-
n
dom measure with intensity measure ν . Set at (s, x) = x(1 − α (t − s))+ , then
1
Xt = at (s, x)M(ds, dx). So, by Campbell’s theorem, for θ < 2,
E
⎛ ⎞
0
E[eθ Xt ] = exp ⎝ (eθ at (s,x) − 1)ν (ds, dx)⎠
E
⎛ ⎞
0t 0∞
= e−λ t exp ⎝2λ e−x(2−θ (1−α (t−s))+ ) dxds⎠
0 0
⎛ ⎞
0t
1
= e−λ t exp ⎝2λ ds ⎠
2 − θ (1 − α (t − s))+
0
2 − θ + θ α min[t, 1/α ]
θ2λα
= e−λ min[t,1/α ]
2−θ
1t 1 t− α1 1t
by splitting integral 0 = 0 + t− α1 in the case t > α1 .
Problem 4.6 Seeds are planted in a field S ⊂ R2 . The random way they are sown
means that they form a Poisson process on S with density λ (x, y). The seeds grow
into plants that are later harvested as a crop, and the weight of the plant at (x, y) has
4.6 Additional problems for Chapter 4 489
mean m(x, y) and variance v(x, y). The weights of different plants are independent
random variables. Show that the total weight W of all the plants is a random variable
with finite mean 00
I1 = m(x, y)λ (x, y) dxdy
S
and variance 00 * +
I2 = m(x, y)2 + v(x, y) λ (x, y) dxdy ,
S
is finite. Then the number N of plants is finite and has the distribution Po(μ ).
Conditional on N, their positions may be taken as independent random variables
(Xn ,Yn ), n = 1, . . . , N, with density λ /μ on S. The weights of the plants are then
independent, with
0
EW = m(x, y)λ (x, y)μ −1 dxdy = μ −1 I1
S
and 0
and
N
Var W |N = ∑ μ −1 I2 − μ −2 I12 = N μ −1 I2 − μ −2 I12 .
n=1
Then
EW = EN μ −1 I1 = I1
and
as required.
490 Further Topics from Information Theory
1 − (1 + 2π r)e−2π r .
Is Φ a Poisson process?
If there is at least one point of Φ in D then there must be at least two lines of Π
meeting D, and this has probability
(2π r)n −2π r
∑ n!
e = 1 − (1 + 2π r)e−2π r .
n≥2
4.6 Additional problems for Chapter 4 491
The probability of a point of Φ lying in D is strictly less than this, because there
may be two lines meeting D whose intersection lies outside D.
Finally, Φ is not a Poisson process, since it has with positive probability collinear
points.
Problem 4.8 Particular cases of the Poisson–Dirichlet distribution for the ran-
dom sequence (p1 , p2 , p3 , . . .) with parameter θ appeared in PSE II the definition
is given below. Show that, for any polynomial φ with φ (0) = 0,
' @ 0
∞ 1
E ∑ φ (pn ) =θ
0
φ (x)x−1 (1 − x)θ −1 dx . (4.6.5)
n=1
Here Gam stands for the Gamma distribution; see PSE I, Appendix.
To prove (4.6.5), we can take pn = ξn /σ and use the fact that σ and p are inde-
pendent. For k ≥ 1,
0
∞
E ∑ ξnk =
0
xk θ x−1 e−x dx = θ Γ(k).
n≥1
Thus,
0 1
θ Γ(k)Γ(θ )
E ∑ pkn =
Γ(k + θ )
=θ
0
xk−1 (1 − x)θ −1 dx.
n≥1
We see that the identity (4.6.5) holds for φ (x) = xk (with k ≥ 1) and hence by
linearity for all polynomials with φ (0) = 0.
492 Further Topics from Information Theory
If a > 1/2, there can be at most one such pn , so that p1 has the PDF
But this fails on (0, 1/2), and the identity (4.6.5) does not determine the distribution
of p1 on this interval.
Problem 4.9 The positions of trees in a large forest can be modelled as a Pois-
son process Π of constant rate λ on R2 . Each tree produces a random number of
seeds having a Poisson distribution with mean μ. Each seed falls to earth at a point
uniformly distributed over the circle of radius r whose centre is the tree. The po-
sitions of the different seeds relative to their parent tree, and the numbers of seeds
produced by a given tree, are independent of each other and of Π. Prove that, con-
ditional on Π, the seeds form a Poisson process Π∗ whose mean measure depends
on Π. Is the unconditional distribution of Π∗ that of a Poisson process?
Solution By a direct calculation, the seeds from a tree at X form a Poisson process
with rate
'
π −1 r−2 , |x − X| < r,
ρX (x) =
0, otherwise.
Superposing these independent Poisson processes gives a Poisson process with rate
ΛΠ (x) = ∑ ρX (x);
X∈Π
and variance
Var N(A) = E Var N(A)|Π + Var E N(A)|Π
0
Problem 4.10 A uniform Poisson process Π in the unit ball of R3 is one whose
mean measure is Lebesgue measure (volume) on
B = {(x, y, z) ∈ R3 : r2 = x2 + y2 + z2 1}.
Show that
Π1 = {r : (x, y, z) ∈ Π}
is a Poisson process on [0, 1] and find its mean measure. Show that
Problem 4.11 The points of Π are coloured randomly either red or green, the
probability of any point being red being r, 0 < r < 1, and the colours of different
points being independent. Show that the red and the green points form independent
Poisson processes.
494 Further Topics from Information Theory
where N1 and N2 are the numbers of red and green points. Conditional on N(A) = n,
N1 (A) has the binomial distribution Bin(n, r). Thus,
P N1 (A) = k, N2 (A) = l
= P N(A) = k + l P N1 (A) = k|N(A) = k + l
μ (A)k+l e−μ (A) k + l k
= r (1 − r)l
(k + l)! k
[r μ (A)]k e−r μ (A) [(1 − r)μ (A)]l e−(1−r)μ (A)
= .
k! l!
Hence, N1 (A) and N2 (A) are independent Poisson random variables with means
r μ (A) and (1 − r)μ (A), respectively.
If A1 , A2 , . . . are disjoint sets then the pairs
(there no problem about whether or not the inequality is strict since the difference
involves events of zero probability). The number of points of Π satisfying these
two inequalities is Poisson, with mean
0
μ= λ (x,t, v)1 t < τ , ||x − ξ || < r(τ − t, v) dxdtdv.
Hence, the probability that ξ is dry is e−μ (or 0 if μ = +∞). Finally, the formula
for the expected total rainfall,
∑ V,
(X,T,V )∈Π
(i) What is the distribution of the number of lines intersecting the disk Da = {z ∈
R2 : | z |≤ a}?
(ii) What is the distribution of the distance from the origin to the nearest line?
(iii) What is the distribution of the distance from the origin to the kth nearest line?
Solution (i) A line intersects the disk Da = {z ∈ R2 : |z| ≤ a} if and only if its
representative point (x, θ ) lies in (−a, a) × [0, π ). Hence,
(ii) Let Y be the distance from the origin to the nearest line. Then
P(Y ≥ a) = P M((−a, a) × [0, π )) = 0 = exp(−2aλ π ),
i.e. Y ∼ Exp(2πλ ).
(iii) Let Y1 ,Y2 , . . . be the distances from the origin to the nearest line, the second
nearest line, and so on. Then the Yi are the atoms of the PRM N on R+ which is
obtained from M by the projection (x, θ ) → |x|. By the mapping theorem, N is the
Poisson process on R+ of rate 2πλ . Hence, Yk ∼ Gam(k, 2λ π ), as Yk = S1 +· · ·+Sk
where Si ∼ Exp(2πλ ), independently.
496 Further Topics from Information Theory
where
>
d jk = ∑ (a jt − akt )2 vt .
1≤t≤n
Suppose that M = 2 and that the transmitted waveforms are subject to the power
constraint ∑ a2jt ≤ K , j = 1, 2. Which of the two waveforms minimises the prob-
1≤t≤n
ability of error?
[Hint: You may assume validity of the bound P(Z ≥ a) ≤ exp(−a2 /2), where Z is
a standard N(0, 1) random variable.]
Solution Let f j = fch (y|X = a j ) be the PDF of receiving a vector y given that a
‘waveform’ A j = (a jt ) was transmitted. Then
1
M∑ ∑ P({y : fk (y) ≥ f j (y)|X = A j }).
P(error) ≤
j k:k= j
Let V be the diagonal matrix with the diagonal elements v j . In the present case,
1 n
f j = C exp − ∑ (yt − a jt ) /vt
2
2 t=1
1 T −1
= C exp − (Y − A j ) V (Y − A j ) .
2
Then if X = A j and Y = A j + ε we have
1 1
log fk − log f j = − (A j − Ak + ε )TV −1 (A j − Ak + ε ) + ε TV −1 ε
2 2
1 T −1
= − d jk − (A j − (Ak ) V ε )
2
1 $
= − d jk + d jk Z
2
4.6 Additional problems for Chapter 4 497
subject to
By Cauchy–Schwarz,
S S 2
−1 −1 −1
(A1 − A2 ) V
T
(A1 − A2 ) ≤ T T
A1 V A1 + A2 V A2 (4.6.7)
with equality holding when A1 = const A2 . Further, in our case V is diagonal, and
(4.6.7) is maximised when ATj A j = K, j = 1, 2, . . . We conclude that
a1t = −a2t = bt
−M log M + (M + 1) log(M + 1)
If p > 1/3 then p/q > 1/2 and the alternate probabilities become negative,
which means that there is no distribution for X giving an optimum for Y . Then
we would have to maximise
− ∑ py log py , subject to py = pπy−1 + qπy ,
y
with
∂ 1
L = (vi + pi )−1 − λ , i = 1, . . . , r
∂ pi 2
and the maximum at
1 1
pi = max 0, − vi = − vi .
2λ 2λ +
The existence and uniqueness of λ ∗ follows since the LHS monotonically de-
creases from +∞ to 0. Thus,
1 1
C = ∑ log .
2 i 2λ ∗ vi
where Ξ = Ξ(γ ) = exp − γψ (x) μ (dx) is the normalising constant and γ is cho-
sen so that
0
∗ ψ (x)
Eψ (X ) = exp − γψ (x) μ (dx) = β . (4.6.8b)
Ξ
1 ψ (x)
Assume that the value γ with the property exp − γψ (x) μ (dx) = β exists.
Ξ
Show that if, in addition, function ψ is non-negative, then, for any given β > 0,
the PMF fX ∗ from (4.6.8a), (4.6.8b) maximises the entropy h(X) under a wider
constraint Eψ (X) ≤ β .
Consequently, calculate the maximal value of h(X) subject to Eψ (X) ≤ β , in
the following cases: (i) when A is a finite set, μ is a positive measure on A (with
μi = μ ({i}) = 1/μ (A) where μ (A) = ∑ μ j ) and ψ (x) ≡ 1, x ∈ A; (ii) when A is
j∈A
an arbitrary set, μ is a positive measure on A with μ (A) < ∞ and ψ (x) ≡ 1, x ∈ A;
(iii) when A = R is a real line, μ is the Lebesgue measure and ψ (x) = |x|; (iv)
when A = Rd , μ is a d -dimensional Lebesgue measure and ψ (x) = ∑ Ki j xi x j ,
1|leq j≤d
where K = (Ki j ) is a d × d positive definite real matrix.
Solution With ln fX∗ (x) = −γψ (x) − ln Ξ, we use the Gibbs inequality:
0 0
501
502 Bibliography
[18] R.E. Blahut. Principles and Practice of Information Theory. Reading, MA:
Addison-Wesley, 1987.
[19] R.E. Blahut. Theory and Practice of Error Control Codes. Reading, MA: Addison-
Wesley, 1983. See also Algebraic Codes for Data Transmission. Cambridge:
Cambridge University Press, 2003.
[20] R.E. Blahut. Algebraic Codes on Lines, Planes, and Curves. Cambridge:
Cambridge University Press, 2008.
[21] I.F. Blake, R.C. Mullin. The Mathematical Theory of Coding. New York: Academic
Press, 1975.
[22] I.F. Blake, R.C. Mullin. An Introduction to Algebraic and Combinatorial Coding
Theory. New York: Academic Press, 1976.
[23] I.F. Blake (ed). Algebraic Coding Theory: History and Development. Stroudsburg,
PA: Dowden, Hutchinson & Ross, 1973.
[24] N. Blachman. Noise and its Effect on Communication. New York: McGraw-Hill,
1966.
[25] R.C. Bose, D.K. Ray-Chaudhuri. On a class of errors, correcting binary group
codes. Information and Control, 3(1), 68–79, 1960.
[26] W. Bradley, Y.M. Suhov. The entropy of famous reals: some empirical results.
Random and Computational Dynamics, 5, 349–359, 1997.
[27] A.A. Bruen, M.A. Forcinito. Cryptography, Information Theory, and Error-
Correction: A Handbook for the 21st Century. Hoboken, NJ: Wiley-Interscience,
2005.
[28] J.A. Buchmann. Introduction to Cryptography. New York: Springer-Verlag, 2002.
[29] P.J. Cameron, J.H. van Lint. Designs, Graphs, Codes and their Links. Cambridge:
Cambridge University Press, 1991.
[30] J. Castiñeira Moreira, P.G. Farrell. Essentials of Error-Control Coding. Chichester:
Wiley, 2006.
[31] W.G. Chambers. Basics of Communications and Coding. Oxford: Clarendon, 1985.
[32] G.J. Chaitin. The Limits of Mathematics: A Course on Information Theory and the
Limits of Formal Reasoning. Singapore: Springer, 1998.
[33] G. Chaitin. Information-Theoretic Incompleteness. Singapore: World Scientific,
1992.
[34] G. Chaitin. Algorithmic Information Theory. Cambridge: Cambridge University
Press, 1987.
[35] F. Conway, J. Siegelman. Dark Hero of the Information Age: In Search of Norbert
Wiener, the Father of Cybernetics. New York: Basic Books, 2005.
[36] T.M. Cover, J.M. Thomas. Elements of Information Theory. New York: Wiley,
2006.
[37] I. Csiszár, J. Körner. Information Theory: Coding Theorems for Discrete Memo-
ryless Systems. New York: Academic Press, 1981; Budapest: Akadémiai Kiadó,
1981.
[38] W.B. Davenport, W.L. Root. Random Signals and Noise. New York: McGraw Hill,
1958.
[39] A. Dembo, T. M. Cover, J. A. Thomas. Information theoretic inequalities. IEEE
Transactions on Information Theory, 37, (6), 1501–1518, 1991.
Bibliography 503
[40] R.L. Dobrushin. Taking the limit of the argument of entropy and information func-
tions. Teoriya Veroyatn. Primen., 5, (1), 29–37, 1960; English translation: Theory
of Probability and its Applications, 5, 25–32, 1960.
[41] F. Dyson. The Tragic Tale of a Genius. New York Review of Books, July 14, 2005.
[42] W. Ebeling. Lattices and Codes: A Course Partially Based on Lectures by F. Hirze-
bruch. Braunschweig/Wiesbaden: Vieweg, 1994.
[43] N. Elkies. Excellent codes from modular curves. STOC’01: Proceedings of the
33rd Annual Symposium on Theory of Computing (Hersonissos, Crete, Greece),
pp. 200–208, NY: ACM, 2001.
[44] S. Engelberg. Random Signals and Noise: A Mathematical Introduction. Boca Ra-
ton, FL: CRC/Taylor & Francis, 2007.
[45] R.M. Fano. Transmission of Information: A Statistical Theory of Communication.
New York: Wiley, 1961.
[46] A. Feinstein. Foundations of Information Theory. New York: McGraw-Hill, 1958.
[47] G.D. Forney. Concatenated Codes. Cambridge, MA: MIT Press, 1966.
[48] M. Franceschetti, R. Meester. Random Networks for Communication. From Sta-
tistical Physics to Information Science. Cambridge: Cambridge University Press,
2007.
[49] R. Gallager. Information Theory and Reliable Communications. New York: Wiley,
1968.
[50] A. Gofman, M. Kelbert, Un upper bound for Kullback–Leibler divergence with a
small number of outliers. Mathematical Communications, 18, (1), 75–78, 2013.
[51] S. Goldman. Information Theory. Englewood Cliffs, NJ: Prentice-Hall, 1953.
[52] C.M. Goldie, R.G.E. Pinch. Communication Theory. Cambridge: Cambridge
University Press, 1991.
[53] O. Goldreich. Foundations of Cryptography, Vols 1, 2. Cambridge: Cambridge
University Press, 2001, 2004.
[54] V.D. Goppa. Geometry and Codes. Dordrecht: Kluwer, 1988.
[55] S. Gravano. Introduction to Error Control Codes. Oxford: Oxford University Press,
2001.
[56] R.M. Gray. Source Coding Theory. Boston: Kluwer, 1990.
[57] R.M. Gray. Entropy and Information Theory. New York: Springer-Verlag, 1990.
[58] R.M. Gray, L.D. Davisson (eds). Ergodic and Information Theory. Stroudsburg,
CA: Dowden, Hutchinson & Ross, 1977 .
[59] V. Guruswami, M. Sudan. Improved decoding of Reed–Solomon codes and alge-
braic geometry codes. IEEE Trans. Inform. Theory, 45, (6), 1757–1767, 1999.
[60] R.W. Hamming. Coding and Information Theory. 2nd ed. Englewood Cliffs, NJ:
Prentice-Hall, 1986.
[61] T.S. Han. Information-Spectrum Methods in Information Theory. New York:
Springer-Verlag, 2002.
[62] D.R. Hankerson, G.A. Harris, P.D. Johnson, Jr. Introduction to Information Theory
and Data Compression. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2003.
[63] D.R. Hankerson et al. Coding Theory and Cryptography: The Essentials. 2nd ed.
New York: M. Dekker, 2000. (Earlier version: D. G. Hoffman et al. Coding Theory:
The Essentials. New York: M. Dekker, 1991.)
[64] W.E. Hartnett. Foundations of Coding Theory. Dordrecht: Reidel, 1974.
504 Bibliography
[65] S.J. Heims. John von Neumann and Norbert Wiener: From Mathematics to the
Technologies of Life and Death. Cambridge, MA: MIT Press, 1980.
[66] C. Helstrom. Statistical Theory of Signal Detection. 2nd ed. Oxford: Pergamon
Press, 1968.
[67] C.W. Helstrom. Elements of Signal Detection and Estimation. Englewood Cliffs,
NJ: Prentice-Hall, 1995.
[68] R. Hill. A First Course in Coding Theory. Oxford: Oxford University Press, 1986.
[69] T. Ho, D.S. Lun. Network Coding: An Introduction. Cambridge: Cambridge Uni-
versity Press, 2008.
[70] A. Hocquenghem. Codes correcteurs d’erreurs. Chiffres, 2, 147–156, 1959.
[71] W.C. Huffman, V. Pless. Fundamentals of Error-Correcting Codes. Cambridge:
Cambridge University Press, 2003.
[72] J.F. Humphreys, M.Y. Prest. Numbers, Groups, and Codes. 2nd ed. Cambridge:
Cambridge University Press, 2004.
[73] S. Ihara. Information Theory for Continuous Systems. Singapore: World Scientific,
1993 .
[74] F.M. Ingels. Information and Coding Theory. Scranton: Intext Educational Pub-
lishers, 1971.
[75] I.M. James. Remarkable Mathematicians. From Euler to von Neumann.
Cambridge: Cambridge University Press, 2009 .
[76] E.T. Jaynes. Papers on Probability, Statistics and Statistical Physics. Dordrecht:
Reidel, 1982.
[77] F. Jelinek. Probabilistic Information Theory. New York: McGraw-Hill, 1968.
[78] G.A. Jones, J.M. Jones. Information and Coding Theory. London: Springer, 2000.
[79] D.S. Jones. Elementary Information Theory. Oxford: Clarendon Press, 1979.
[80] O. Johnson. Information Theory and the Central Limit Theorem. London: Imperial
College Press, 2004.
[81] J. Justensen. A class of constructive asymptotically good algebraic codes. IEEE
Transactions Information Theory, 18(5), 652–656, 1972.
[82] M. Kelbert, Y. Suhov. Continuity of mutual entropy in the large signal-to-noise
ratio limit. In Stochastic Analysis 2010, pp. 281–299, 2010. Berlin: Springer.
[83] N. Khalatnikov. Dau, Centaurus and Others. Moscow: Fizmatlit, 2007.
[84] A.Y. Khintchin. Mathematical Foundations of Information Theory. New York:
Dover, 1957.
[85] T. Klove. Codes for Error Detection. Singapore: World Scientific, 2007.
[86] N. Koblitz. A Course in Number Theory and Cryptography. New York: Springer,
1993 .
[87] H. Krishna. Computational Complexity of Bilinear Forms: Algebraic Coding The-
ory and Applications of Digital Communication Systems. Lecture notes in control
and information sciences, Vol. 94. Berlin: Springer-Verlag, 1987.
[88] S. Kullback. Information Theory and Statistics. New York: Wiley, 1959.
[89] S. Kullback, J.C. Keegel, J.H. Kullback. Topics in Statistical Information Theory.
Berlin: Springer, 1987.
[90] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, II. Bell System Technical Journal, 64–84, 1961.
Bibliography 505
[91] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, III. The dimension of the space of essentially time- and band-limited
signals. Bell System Technical Journal, 1295–1336, 1962.
[92] R. Lidl, H. Niederreiter. Finite Fields. Cambridge: Cambridge University Press,
1997.
[93] R. Lidl, G. Pilz. Applied Abstract Algebra. 2nd ed. New York: Wiley, 1999.
[94] E.H. Lieb. Proof of entropy conjecture of Wehrl. Commun. Math. Phys., 62, (1),
35–41, 1978.
[95] S. Lin. An Introduction to Error-Correcting Codes. Englewood Cliffs, NJ; London:
Prentice-Hall, 1970.
[96] S. Lin, D.J. Costello. Error Control Coding: Fundamentals and Applications.
Englewood Cliffs, NJ: Prentice-Hall, 1983.
[97] S. Ling, C. Xing. Coding Theory. Cambridge: Cambridge University Press, 2004.
[98] J.H. van Lint. Introduction to Coding Theory. 3rd ed. Berlin: Springer, 1999.
[99] J.H. van Lint, G. van der Geer. Introduction to Coding Theory and Algebraic
Geometry. Basel: Birkhäuser, 1988.
[100] J.C.A. van der Lubbe. Information Theory. Cambridge: Cambridge University
Press, 1997.
[101] R.E. Lewand. Cryptological Mathematics. Washington, DC: Mathematical Asso-
ciation of America, 2000.
[102] J.A. Llewellyn. Information and Coding. Bromley: Chartwell-Bratt; Lund:
Studentlitteratur, 1987.
[103] M. Loève. Probability Theory. Princeton, NJ: van Nostrand, 1955.
[104] D.G. Luenberger. Information Science. Princeton, NJ: Princeton University Press,
2006.
[105] D.J.C. Mackay. Information Theory, Inference and Learning Algorithms.
Cambridge: Cambridge University Press, 2003.
[106] H.B. Mann (ed). Error-Correcting Codes. New York: Wiley, 1969 .
[107] M. Marcus. Dark Hero of the Information Age: In Search of Norbert Wiener, the
Father of Cybernetics. Notices of the AMS 53, (5), 574–579, 2005.
[108] A. Marshall, I. Olkin. Inequalities: Theory of Majorization and its Applications.
New York: Academic Press, 1979 .
[109] V.P. Maslov, A.S. Chernyi. On the minimization and maximization of entropy in
various disciplines. Theory Probab. Appl. 48, (3), 447–464, 2004.
[110] F.J. MacWilliams, N.J.A. Sloane. The Theory of Error-Correcting Codes, Vols I,
II. Amsterdam: North-Holland, 1977.
[111] R.J. McEliece. The Theory of Information and Coding. Reading, MA: Addison-
Wesley, 1977. 2nd ed. Cambridge: Cambridge University Press, 2002.
[112] R. McEliece. The Theory of Information and Coding. Student ed. Cambridge:
Cambridge University Press, 2004.
[113] A. Menon, R.M. Buecher, J.H. Read. Impact of exclusion region and spreading in
spectrum-sharing ad hoc networks. ACM 1-59593-510-X/06/08, 2006 .
[114] R.A. Mollin. RSA and Public-Key Cryptography. New York: Chapman & Hall,
2003.
[115] R.H. Morelos-Zaragoza. The Art of Error-Correcting Coding. 2nd ed. Chichester:
Wiley, 2006.
506 Bibliography
[116] G.L. Mullen, C. Mummert. Finite Fields and Applications. Providence, RI:
American Mathematical Society, 2007.
[117] A. Myasnikov, V. Shpilrain, A. Ushakov. Group-Based Cryptography. Basel:
Birkhäuser, 2008.
[118] G. Nebe, E.M. Rains, N.J.A. Sloane. Self-Dual Codes and Invariant Theory. New
York: Springer, 2006.
[119] H. Niederreiter, C. Xing. Rational Points on Curves over Finite Fields: Theory and
Applications. Cambridge: Cambridge University Press, 2001.
[120] W.W. Peterson, E.J. Weldon. Error-Correcting Codes. 2nd ed. Cambridge,
MA: MIT Press, 1972. (Previous ed. W.W. Peterson. Error-Correcting Codes.
Cambridge, MA: MIT Press, 1961.)
[121] M.S. Pinsker. Information and Information Stability of Random Variables and Pro-
cesses. San Francisco: Holden-Day, 1964.
[122] V. Pless. Introduction to the Theory of Error-Correcting Codes. 2nd ed. New York:
Wiley, 1989.
[123] V.S. Pless, W.C. Huffman (eds). Handbook of Coding Theory, Vols 1, 2. Amster-
dam: Elsevier, 1998.
[124] P. Piret. Convolutional Codes: An Algebraic Approach. Cambridge, MA: MIT
Press, 1988.
[125] O. Pretzel. Error-Correcting Codes and Finite Fields. Oxford: Clarendon Press,
1992; Student ed. 1996.
[126] T.R.N. Rao. Error Coding for Arithmetic Processors. New York: Academic Press,
1974.
[127] M. Reed, B. Simon. Methods of Modern Mathematical Physics, Vol. II. Fourier
analysis, self-adjointness. New York: Academic Press, 1975.
[128] A. Rényi. A Diary on Information Theory. Chichester: Wiley, 1987; initially pub-
lished Budapest: Akad’emiai Kiadó, 1984.
[129] F.M. Reza. An Introduction to Information Theory. New York: Constable, 1994.
[130] S. Roman. Coding and Information Theory. New York: Springer, 1992.
[131] S. Roman. Field Theory. 2nd ed. New York: Springer, 2006.
[132] T. Richardson, R. Urbanke. Modern Coding Theory. Cambridge: Cambridge Uni-
versity Press, 2008.
[133] R.M. Roth. Introduction to Coding Theory. Cambridge: Cambridge University
Press, 2006.
[134] B. Ryabko, A. Fionov. Basics of Contemporary Cryptography for IT Practitioners.
Singapore: World Scientific, 2005.
[135] W.E. Ryan, S. Lin. Channel Codes: Classical and Modern. Cambridge: Cambridge
University Press, 2009.
[136] T. Schürmann, P. Grassberger. Entropy estimation of symbol sequences. Chaos, 6,
(3), 414–427, 1996.
[137] P. Seibt. Algorithmic Information Theory: Mathematics of Digital Information Pro-
cessing. Berlin: Springer, 2006.
[138] C.E. Shannon. A mathematical theory of cryptography. Bell Lab. Tech. Memo.,
1945.
[139] C.E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27, July, October, 379–423, 623–658, 1948.
Bibliography 507
[140] C.E. Shannon: Collected Papers. N.J.A. Sloane, A.D. Wyner (eds). New York:
IEEE Press, 1993.
[141] C.E. Shannon, W. Weaver. The Mathematical Theory of Communication. Urbana,
IL: University of Illinois Press, 1949.
[142] P.C. Shields. The Ergodic Theory of Discrete Sample Paths. Providence, RI:
American Mathematical Society, 1996.
[143] M.S. Shrikhande, S.S. Sane. Quasi-Symmetric Designs. Cambridge: Cambridge
University Press, 1991.
[144] S. Simic. Best possible global bounds for Jensen functionals. Proc. AMS, 138, (7),
2457–2462, 2010.
[145] A. Sinkov. Elementary Cryptanalysis: A Mathematical Approach. 2nd ed. revised
and updated by T. Feil. Washington, DC: Mathematical Association of America,
2009.
[146] D. Slepian, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, Vol. I. Bell System Technical Journal, 43–64, 1961 .
[147] W. Stallings. Cryptography and Network Security: Principles and Practice. 5th ed.
Boston, MA: Prentice Hall; London: Pearson Education, 2011.
[148] H. Stichtenoth. Algebraic Function Fields and Codes. Berlin: Springer, 1993.
[149] D.R. Stinson. Cryptography: Theory and Practice. 2nd ed. Boca Raton, FL;
London: Chapman & Hall/CRC, 2002.
[150] D. Stoyan, W.S. Kendall. J. Mecke. Stochastic Geometry and its Applications.
Berlin: Academie-Verlag, 1987 .
[151] C. Schlegel, L. Perez. Trellis and Turbo Coding. New York: Wiley, 2004.
[152] Š. Šujan. Ergodic Theory, Entropy and Coding Problems of Information Theory.
Praha: Academia, 1983.
[153] P. Sweeney. Error Control Coding: An Introduction. New York: Prentice Hall,
1991.
[154] Te Sun Han, K. Kobayashi. Mathematics of Information and Coding. Providence,
RI: American Mathematical Society, 2002.
[155] T.M. Thompson. From Error-Correcting Codes through Sphere Packings to Simple
Groups. Washington, DC: Mathematical Association of America, 1983.
[156] R. Togneri, C.J.S. deSilva. Fundamentals of Information Theory and Coding
Design. Boca Raton, FL: Chapman & Hall/CRC, 2002.
[157] W. Trappe, L.C. Washington. Introduction to Cryptography: With Coding Theory.
2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2006.
[158] M.A. Tsfasman, S.G. Vlǎdut. Algebraic-Geometric Codes. Dordrecht: Kluwer
Academic, 1991.
[159] M. Tsfasman, S. Vlǎdut, T. Zink. Modular curves, Shimura curves and Goppa
codes, better than Varshamov–Gilbert bound. Mathematics Nachrichten, 109,
21–28, 1982.
[160] M. Tsfasman, S. Vlǎdut, D. Nogin. Algebraic Geometric Codes: Basic Notions.
Providence, RI: American Mathematical Society, 2007.
[161] M.J. Usher. Information Theory for Information Technologists. London: Macmil-
lan, 1984.
[162] M.J. Usher, C.G. Guy. Information and Communication for Engineers. Bas-
ingstoke: Macmillan, 1997
508 Bibliography
[163] I. Vajda. Theory of Statistical Inference and Information. Dordrecht: Kluwer, 1989.
[164] S. Verdú. Multiuser Detection. New York: Cambridge University Press, 1998.
[165] S. Verdú, D. Guo. A simple proof of the entropy–power inequality. IEEE Trans.
Inform. Theory, 52, (5), 2165–2166, 2006.
[166] L.R. Vermani. Elements of Algebraic Coding Theory. London: Chapman & Hall,
1996.
[167] B. Vucetic, J. Yuan. Turbo Codes: Principles and Applications. Norwell, MA:
Kluwer, 2000.
[168] G. Wade. Coding Techniques: An Introduction to Compression and Error Control.
Basingstoke: Palgrave, 2000.
[169] J.L. Walker. Codes and Curves. Providence, RI: American Mathematical Society,
2000.
[170] D. Welsh. Codes and Cryptography. Oxford, Oxford University Press, 1988.
[171] N. Wiener. Cybernetics or Control and Communication in Animal and Machine.
Cambridge, MA: MIT Press, 1948; 2nd ed: 1961, 1962.
[172] J. Wolfowitz. Coding Theorems of Information Theory. Berlin: Springer, 1961; 3rd
ed: 1978.
[173] A.D. Wyner. The capacity of the band-limited Gaussian channel. Bell System Tech-
nical Journal, 359–395, 1996 .
[174] A.D. Wyner. The capacity of the product of channels. Information and Control,
423–433, 1966.
[175] C. Xing. Nonlinear codes from algebraic curves beating the Tsfasman–Vlǎdut–
Zink bound. IEEE Transactions Information Theory, 49, 1653–1657, 2003.
[176] A.M. Yaglom, I.M. Yaglom. Probability and Information. Dordrecht, Holland:
Reidel, 1983.
[177] R. Yeung. A First Course in Information Theory. Boston: Kluwer Academic, 1992;
2nd ed. New York: Kluwer, 2002.
Index
509
510 Index
key (as a part of a cipher) (cont.) measure (as a countably additive function of a set),
private key, 470 366
public key, 469 intensity (or mean) measure, 436
secret key, 473 non-atomic measure, 436
Karhunen–Loéve decomposition, 426 Poisson random measure, 436
product-measure, 371
law of large numbers, 34 random measure, 436
strong law of large numbers, 438 reference measure, 372
leader of a coset, 192 σ -finite, 436
least common multiple (lcm), 223 Möbius function, 277
lemma Möbius inversion formula, 278
Borel–Cantelli lemma, 418 moment generating function, 442
Nyquist–Shannon–Kotelnikov–Whittaker lemma,
431 network: see distributed system
letter, 2 supercritical network, 449
linear code, 148 network information theory, 436
linear representation of a group, 314 noise (in a channel), 2, 70
space of a linear representation, 314 Gaussian coloured noise, 374
dimension of a linear representation, 314 Gaussian white noise, 368
linear space, 146 noiseless channel, 103
linear subspace, 148 noisy (or fully noisy) channel, 81
linear feedback shift register (LFSR), 454
auxiliary, or feedback, polynomial of an LFSR, 454 one-time pad cipher, 466
operational channel capacity, 102
Markov chain, 1, 3 order of an element, 267
discrete-time Markov chain (DTMC), 1, 3 order of a polynomial, 231
coupled Markov chain, 50 orthogonal, 185
irreducible and aperiodic Markov chain, 128 ortho-basis, 430
kth-order Markov chain approximation, 407 orthogonal complement, 185
second-order Markov chain, 131 orthoprojection, 375
transition matrix of a Markov chain, 3 self-orthogonal, 227
Markov inequality, 408 output stream of a register, 454
Markov property, 33
strong Markov property, 50 parity-check code, 149
Markov source, 3 parity-check extension, 151
stationary Markov source, 3 parity-check matrix, 186
Markov triple, 33 plaintext, 468
Matérn process (with a hard core), 451 Poisson process, 436
first model of the Matérn process, 451 Poisson random measure, 436
second model of the Matérn process, 451 polynomial, 206
matrix, 13 algebra, polynomial, 214
covariance matrix, 88 degree of a polynomial, 206, 214
generating matrix, 185 distance enumerator polynomial, 322
generating check matrix, canonical, or standard, error locator polynomial, 239
form of, 189 Goppa polynomial, 335
parity-check matrix, 186 irreducible polynomial, 219
parity-check matrix, canonical, or standard, form Mattson–Solomon polynomial, 296
of, 189 minimal polynomial, 236
parity-check matrix of a Hamming code, 191 order of a polynomial, 231
positive definite matrix, 91 reducible polynomial, 221
recursion matrix, 174 primitive polynomial, 230, 267
Töplitz matrix, 93 Kravchuk polynomial, 320
transition matrix of a Markov chain, 3 weight enumerator polynomial, 319, 351
transition matrix, doubly stochastic, 34 probability distribution, vii, 1
Vandermonde matrix, 295 conditional probability, 1
maximum likelihood (ML) decoding rule, 66 probability density function (PDF), 86
Index 513