100% found this document useful (1 vote)
242 views528 pages

Information Theory and Coding by Example PDF

Uploaded by

Daniele Ballo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
242 views528 pages

Information Theory and Coding by Example PDF

Uploaded by

Daniele Ballo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 528

I N F O R M AT I O N T H E O RY A N D C O D I N G B Y E X A M P L E

This fundamental monograph introduces both the probabilistic and the algebraic
aspects of information theory and coding. It has evolved from the authors’ years
of experience teaching at the undergraduate level, including several Cambridge
Mathematical Tripos courses. The book provides relevant background material, a
wide range of worked examples and clear solutions to problems from real exam
papers. It is a valuable teaching aid for undergraduate and graduate students, or for
researchers and engineers who want to grasp the basic principles.

Mark Kelbert is a Reader in Statistics in the Department of Mathematics at


Swansea University. For many years he has also been associated with the Moscow
Institute of Information Transmission Problems and the International Institute of
Earthquake Prediction Theory and Mathematical Geophysics (Moscow).

Yuri Suhov is a Professor of Applied Probability in the Department of Pure Math-


ematics and Mathematical Statistics at the University of Cambridge (Emeritus). He
is also affiliated to the University of São Paulo in Brazil and to the Moscow Institute
of Information Transmission Problems.
INFORMATION THEORY
A N D CODING BY EXAMPLE

M A R K K E L B E RT
Swansea University, and Universidade de São Paulo

Y U R I S U H OV
University of Cambridge, and Universidade de São Paulo
University Printing House, Cambridge CB2 8BS, United Kingdom

Published in the United States of America by Cambridge University Press, New York

Cambridge University Press is part of the University of Cambridge.


It furthers the University’s mission by disseminating knowledge in the pursuit of
education, learning and research at the highest international levels of excellence.

www.cambridge.org
Information on this title: www.cambridge.org/9780521769358

c Cambridge University Press 2013
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2013
Printed in the United Kingdom by CPI Group Ltd. Croydon cr0 4yy
A catalogue record for this publication is available from the British Library
ISBN 978-0-521-76935-8 Hardback
ISBN 978-0-521-13988-5 Paperback
Cambridge University Press has no responsibility for the persistence or
accuracy of URLs for external or third-party internet websites referred to
in this publication, and does not guarantee that any content on such
websites is, or will remain, accurate or appropriate.
Contents

Preface page vii


1 Essentials of Information Theory 1
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 1
1.2 Entropy: an introduction 18
1.3 Shannon’s first coding theorem. The entropy rate of a
Markov source 41
1.4 Channels of information transmission. Decoding rules.
Shannon’s second coding theorem 59
1.5 Differential entropy and its properties 86
1.6 Additional problems for Chapter 1 95
2 Introduction to Coding Theory 144
2.1 Hamming spaces. Geometry of codes. Basic bounds on the
code size 144
2.2 A geometric proof of Shannon’s second coding theorem.
Advanced bounds on the code size 162
2.3 Linear codes: basic constructions 184
2.4 The Hamming, Golay and Reed–Muller codes 199
2.5 Cyclic codes and polynomial algebra. Introduction to BCH
codes 213
2.6 Additional problems for Chapter 2 243
3 Further Topics from Coding Theory 269
3.1 A primer on finite fields 269
3.2 Reed–Solomon codes. The BCH codes revisited 291
3.3 Cyclic codes revisited. Decoding the BHC codes 300
3.4 The MacWilliams identity and the linear programming bound 313
3.5 Asymptotically good codes 328
3.6 Additional problems for Chapter 3 340

v
vi Contents

4 Further Topics from Information Theory 366


4.1 Gaussian channels and beyond 366
4.2 The asymptotic equipartition property in the continuous time
setting 397
4.3 The Nyquist–Shannon formula 409
4.4 Spatial point processes and network information theory 436
4.5 Selected examples and problems from cryptography 453
4.6 Additional problems for Chapter 4 480

Bibliography 501
Index 509
Preface

This book is partially based on the material covered in several Cambridge Math-
ematical Tripos courses: the third-year undergraduate courses Information The-
ory (which existed and evolved over the last four decades under slightly varied
titles) and Coding and Cryptography (a much younger and simplified course avoid-
ing cumbersome technicalities), and a number of more advanced Part III courses
(Part III is a Cambridge equivalent to an MSc in Mathematics). The presentation
revolves, essentially, around the following core concepts: (a) the entropy of a prob-
ability distribution as a measure of ‘uncertainty’ (and the entropy rate of a random
process as a measure of ‘variability’ of its sample trajectories), and (b) coding as a
means to measure and use redundancy in information generated by the process.
Thus, the contents of this book includes a more or less standard package of
information-theoretical material which can be found nowadays in courses taught
across the world, mainly at Computer Science and Electrical Engineering Depart-
ments and sometimes at Probability and/or Statistics Departments. What makes this
book different is, first of all, a wide range of examples (a pattern that we followed
from the onset of the series of textbooks Probability and Statistics by Example
by the present authors, published by Cambridge University Press). Most of these
examples are of a particular level adopted in Cambridge Mathematical Tripos ex-
ams. Therefore, our readers can make their own judgement about what level they
have reached or want to reach.
The second difference between this book and the majority of other books
on information theory or coding theory is that it covers both possible direc-
tions: probabilistic and algebraic. Typically, these lines of inquiry are presented
in different monographs, textbooks and courses, often by people who work in
different departments. It helped that the present authors had a long-time associ-
ation with the Institute for Information Transmission Problems, a section of the
Russian Academy of Sciences, Moscow, where the tradition of embracing a broad
spectrum of problems was strongly encouraged. It suffices to list, among others,

vii
viii Preface

the names of Roland Dobrushin, Raphail Khas’minsky, Mark Pinsker, Vladimir


Blinovsky, Vyacheslav Prelov, Boris Tsybakov, Kamil Zigangirov (probability and
statistics), Valentin Afanasiev, Leonid Bassalygo, Serguei Gelfand, Valery Goppa,
Inna Grushko, Grigorii Kabatyansky, Grigorii Margulis, Yuri Sagalovich, Alexei
Skorobogatov, Mikhail Tsfasman, Victor Zinov’yev, Victor Zyablov (algebra, com-
binatorics, geometry, number theory), who worked or continue to work there (at
one time, all these were placed in a five-room floor of a converted building in the
centre of Moscow). Importantly, the Cambridge mathematical tradition of teaching
information-theoretical and coding-theoretical topics was developed along simi-
lar lines, initially by Peter Whittle (Probability and Optimisation) and later on by
Charles Goldie (Probability), Richard Pinch (Algebra and Geometry), Tom Körner
and Keith Carne (Analysis) and Tom Fisher (Number Theory).
We also would like to add that this book has been written by authors trained as
mathematicians (and who remain still mathematicians to their bones), who never-
theless have a strong background in applications, with all the frustration that comes
with such work: vagueness, imprecision, disputability (involving, inevitably, per-
sonal factors) and last – but by no means least – the costs of putting any math-
ematical idea – however beautiful – into practice. Still, they firmly believe that
mathematisation is the mainstream road to survival and perfection in the modern
competitive world, and therefore that Mathematics should be taken and studied
seriously (but perhaps not beyond reason).
Both aforementioned concepts (entropy and codes) forming the base of
the information-theoretical approach to random processes were introduced by
Shannon in the 1940s, in a rather accomplished form, in his publications [139],
[141]. Of course, entropy already existed in thermodynamics and was understood
pretty well by Boltzmann and Gibbs more than a century ago, and codes have
been in practical (and efficient) use for a very long time. But it was Shannon who
fully recognised the role of these concepts and put them into a modern mathemati-
cal framework, although, not having the training of a professional mathematician,
he did not always provide complete proofs of his constructions. [Maybe he did
not bother.] In relevant sections we comment on some rather bizarre moments in
the development of Shannon’s relations with the mathematical community. Fortu-
nately, it seems that this did not bother him much. [Unlike Boltzmann, who was
particularly sensitive to outside comments and took them perhaps too close to his
heart.] Shannon definitely understood the full value of his discoveries; in our view
it puts him on equal footing with such towering figures in mathematics as Wiener
and von Neumann.
It is fair to say that Shannon’s name still dominates both the probabilistic and the
algebraic direction in contemporary information and coding theory. This is quite
extraordinary, given that we are talking of the contribution made by a person who
Preface ix

was active in this area more than 40 years ago. [Although on several advanced
topics Shannon, probably, could have thought, re-phrasing Einstein’s words: “Since
mathematicians have invaded the theory of communication, I do not understand it
myself anymore.”]
During the years that passed after Shannon’s inceptions and inventions, math-
ematics changed drastically, and so did electrical engineering, let alone computer
science. Who could have foreseen such a development back in the 1940s and 1950s,
as the great rivalry between Shannon’s information-theoretical and Wiener’s cyber-
netical approaches was emerging? In fact, the latter promised huge (even fantastic)
benefits for the whole of humanity while the former only asserted that a mod-
est goal of correcting transmission errors could be achieved within certain limits.
Wiener’s book [171] captivated the minds of 1950s and 1960s thinkers in practi-
cally all domains of intellectual activity. In particular, cybernetics became a serious
political issue in the Soviet Union and its satellite countries: first it was declared
“a bourgeois anti-scientific theory”, then it was over-enthusiastically embraced. [A
quotation from a 1953 critical review of cybernetics in a leading Soviet ideology
journal Problems of Philosophy reads: “Imperialists are unable to resolve the con-
troversies destroying the capitalist society. They can’t prevent the imminent eco-
nomical crisis. And so they try to find a solution not only in the frenzied arms race
but also in ideological warfare. In their profound despair they resort to the help of
pseudo-sciences that give them some glimmer of hope to prolong their survival.”
The 1954 edition of the Soviet Concise Dictionary of Philosophy printed in hun-
dreds of thousands of copies defined cybernetics as a “reactionary pseudo-science
which appeared in the USA after World War II and later spread across other cap-
italist countries: a kind of modern mechanicism.” However, under pressure from
top Soviet physicists who gained authority after successes of the Soviet nuclear
programme, the same journal, Problems of Philosophy, had to print in 1955 an ar-
ticle proclaiming positive views on cybernetics. The authors of this article included
Alexei Lyapunov and Sergei Sobolev, prominent Soviet mathematicians.]
Curiously, as was discovered in a recent biography on Wiener [35], there exist
“secret [US] government documents that show how the FBI and the CIA pursued
Wiener at the height of the Cold War to thwart his social activism and the growing
influence of cybernetics at home and abroad.” Interesting comparisons can be found
in [65].
However, history went its own way. As Freeman Dyson put it in his review [41]
of [35]: “[Shannon’s theory] was mathematically elegant, clear, and easy to apply
to practical problems of communication. It was far more user-friendly than cyber-
netics. It became the basis of a new discipline called ‘information theory’ . . . [In
modern times] electronic engineers learned information theory, the gospel accord-
ing to Shannon, as part of their basic training, and cybernetics was forgotten.”
x Preface

Not quite forgotten, however: in the former Soviet Union there still exist at
least seven functioning institutes or departments named after cybernetics: two in
Moscow and two in Minsk, and one in each of Tallinn, Tbilisi, Tashkent and Kiev
(the latter being a renowned centre of computer science in the whole of the for-
mer USSR). In the UK there are at least four departments, at the Universities of
Bolton, Bradford, Hull and Reading, not counting various associations and soci-
eties. Across the world, cybernetics-related societies seem to flourish, displaying
an assortment of names, from concise ones such as the Institute of the Method
(Switzerland) or the Cybernetics Academy (Italy) to the Argentinian Associa-
tion of the General Theory of Systems and Cybernetics, Buenos Aires. And we
were delighted to discover the existence of the Cambridge Cybernetics Society
(Belmont, CA, USA). By contrast, information theory figures only in a handful of
institutions’ names. Apparently, the old Shannon vs. Wiener dispute may not be
over yet.
In any case, Wiener’s personal reputation in mathematics remains rock solid:
it suffices to name a few gems such as the Paley–Wiener theorem (created on
Wiener’s numerous visits to Cambridge), the Wiener–Hopf method and, of course,
the Wiener process, particularly close to our hearts, to understand his true role in
scientific research and applications. However, existing recollections of this giant of
science depict an image of a complex and often troubled personality. (The title of
the biography [35] is quite revealing but such views are disputed, e.g., in the review
[107]. In this book we attempt to adopt a more tempered tone from the chapter on
Wiener in [75], pp. 386–391.) On the other hand, available accounts of Shannon’s
life (as well as other fathers of information and coding theory, notably, Richard
Hamming) give a consistent picture of a quiet, intelligent and humorous person.
It is our hope that this fact will not present a hindrance for writing Shannon’s
biographies and that in future we will see as many books on Shannon as we see on
Wiener.
As was said before, the purpose of this book is twofold: to provide a synthetic
introduction both to probabilistic and algebraic aspects of the theory supported by
a significant number of problems and examples, and to discuss a number of topics
rarely presented in most mainstream books. Chapters 1–3 give an introduction into
the basics of information theory and coding with some discussion spilling over to
more modern topics. We concentrate on typical problems and examples [many of
them originated in Cambridge courses] more than on providing a detailed presen-
tation of the theory behind them. Chapter 4 gives a brief introduction into a variety
of topics from information theory. Here the presentation is more concise and some
important results are given without proofs.
Because the large part of the text stemmed from lecture notes and various solu-
tions to class and exam problems, there are inevitable repetitions, multitudes of
Preface xi

notation and examples of pigeon English. We left many of them deliberately,


feeling that they convey a live atmosphere during the teaching and examination
process.
Two excellent books [52] and [36] had a particularly strong impact on our pre-
sentation. We feel that our long-term friendship with Charles Goldie played a role
here, as well as YS’s amicable acquaintance with Tom Cover. We also benefited
from reading (and borrowing from) the books [18], [110], [130] and [98]. The
warm hospitality at a number of programmes at the Isaac Newton Institute, Univer-
sity of Cambridge, in 2002–2010 should be acknowledged, particularly Stochas-
tic Processes in Communication Sciences (January–July 2010). Various parts of
the material have been discussed with colleagues in various institutions, first and
foremost, the Institute for Information Transmission Problems and the Institute of
Mathematical Geophysics and Earthquake Predictions, Moscow (where the authors
have been loyal staff members for a long time). We would like to thank James
Lawrence, from Statslab, University of Cambridge, for his kind help with figures.
References to PSE I and PSE II mean the books by the present authors Prob-
ability and Statistics by Example, Cambridge University Press, Volumes I and II.
We adopted the style used in PSE II, presenting a large portion of the material
through ‘Worked Examples’. Most of these Worked Examples are stated as prob-
lems (and many of them originated from Cambridge Tripos Exam papers and keep
their specific style and spirit).
1
Essentials of Information Theory

Throughout the book, the symbol P denotes various probability distributions. In


particular, in Chapter 1, P refers to the probabilities for sequences of random
variables characterising sources of information. As a rule, these are sequences of
independent and identically distributed random variables or discrete-time Markov
chains; namely, P(U1 = u1 , . . . ,Un = un ) is the joint probability that random
variables U1 , . . . ,Un take values u1 , . . . , un , and P(V = v |U = u,W = w) is the
conditional probability that a random variable V takes value v, given that ran-
dom variables U and W take values u and w, respectively. Likewise, E denotes the
expectation with respect to P.
The symbols p and P are used to denote various probabilities (and probability-
related objects) loosely. The symbol  A denotes the cardinality of a finite set A.
The symbol 1 stands for an indicator function. We adopt the following notation and
formal rules for logarithms: ln = loge , log = log2 , and for all b > 1: 0 · logb 0 = 0 ·
logb ∞ = 0. Next, given x > 0, x and x denote the maximal integer that is no
larger than x and the minimal integer that is no less than x, respectively. Thus,
x ≤ x ≤ x; equalities hold here when x is a positive integer (x is called the
integer part of x.)
The abbreviations LHS and RHS stand, respectively, for the left-hand side and
the right-hand side of an equation.

1.1 Basic concepts. The Kraft inequality. Huffman’s encoding


A typical scheme used in information transmission is as follows:

A message source → an encoder → a channel


→ a decoder → a destination

1
2 Essentials of Information Theory

Example 1.1.1 (a) A message source: a Cambridge college choir.


(b) An encoder: a BBC recording unit. It translates the sound to a binary array and
writes it to a CD track. The CD is then produced and put on the market.
(c) A channel: a customer buying a CD in England and mailing it to Australia. The
channel is subject to ‘noise’: possible damage (mechanical, electrical, chemical,
etc.) incurred during transmission (transportation).
(d) A decoder: a CD player in Australia.
(e) A destination: an audience in Australia.
(f) The goal: to ensure a high-quality sound despite damage.
In fact, a CD can sustain damage done by a needle while making a neat hole in
it, or by a tiny drop of acid (you are not encouraged to make such an experiment!).
In technical terms, typical goals of information transmission are:
(i) fast encoding of information,
(ii) easy transmission of encoded messages,
(iii) effective use of the channel available (i.e. maximum transfer of information
per unit time),
(iv) fast decoding,
(v) correcting errors (as many as possible) introduced by noise in the channel.
As usual, these goals contradict each other, and one has to find an optimal solu-
tion. This is what the chapter is about. However, do not expect perfect solutions:
the theory that follows aims mainly at providing knowledge of the basic principles.
A final decision is always up to the individual (or group) responsible.
A large part of this section (and the whole of Chapter 1) will deal with encoding
problems. The aims of encoding are:
(1) compressing data to reduce redundant information contained in a message,
(2) protecting the text from unauthorised users,
(3) enabling errors to be corrected.
We start by studying sources and encoders. A source emits a sequence of letters
(or symbols),
u1 u2 . . . un . . . , (1.1.1)
where u j ∈ I, and I(= Im ) is an m-element set often identified as {1, . . . , m}
(a source alphabet). In the case of literary English, m = 26 + 7, 26 letters plus
7 punctuation symbols: . , : ; – ( ). (Sometimes one adds ? ! ‘ ’ and ”). Telegraph
English corresponds to m = 27.
A common approach is to consider (1.1.1) as a sample from a random source,
i.e. a sequence of random variables
U1 ,U2 , . . . ,Un , . . . (1.1.2)
and try to develop a theory for a reasonable class of such sequences.
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 3

Example 1.1.2 (a) The simplest example of a random source is a sequence of


independent and identically distributed random variables (IID random variables):
k
P(U1 = u1 , U2 = u2 , . . . ,Uk = uk ) = ∏ p(u j ), (1.1.3a)
j=1

where p(u) = P(U j = u), u ∈ I, is the marginal distribution of a single variable. A


random source with IID symbols is often called a Bernoulli source.
A particular case where p(u) does not depend on u ∈ U (and hence equals 1/m)
corresponds to the equiprobable Bernoulli source.
(b) A more general example is a Markov source where the symbols form a discrete-
time Markov chain (DTMC):
k−1
P(U1 = u1 , U2 = u2 , . . . , Uk = uk ) = λ (u1 ) ∏ P(u j , u j+1 ), (1.1.3b)
j=1

where λ (u) = P(U1 = u), u ∈ I, are the initial probabilities and P(u, u ) = P(U j+1 =
u |U j = u), u, u ∈ I, are transition probabilities. A Markov source is called sta-
tionary if P(U j = u) = λ (u), j ≥ 1, i.e. λ = {λ (u), u = 1, . . . , m} is an invariant
row-vector for matrix P = {P(u, v)}: ∑ λ (u)P(u, v) = λ (v), v ∈ I, or, shortly,
u∈I
λP = λ.
(c) A ‘degenerated’ example of a Markov source is where a source emits repeated
symbols. Here,

P(U1 = U2 = · · · = Uk = u) = p(u), u ∈ I,
(1.1.3c)
P(Uk = Uk ) = 0, 1 ≤ k < k ,

where 0 ≤ p(u) ≤ 1 and ∑ p(u) = 1.


u∈I

An initial piece of sequence (1.1.1)

u(n) = (u1 , u2 , . . . , un ) or, more briefly, u(n) = u1 u2 . . . un

is called a (source) sample n-string, or n-word (in short, a string or a word), with
digits from I, and is treated as a ‘message’. Correspondingly, one considers a ran-
dom n-string (a random message)

U(n) = (U1 ,U2 , . . . ,Un ) or, briefly, U(n) = U1U2 . . .Un .

An encoder (or coder) uses an alphabet J(= Jq ) which we typically write as


{0, 1, . . . , q − 1}; usually the number of encoding symbols q < m (or even q m);
in many cases q = 2 with J = {0, 1} (a binary coder). A code (also coding, or
4 Essentials of Information Theory

encoding) is a map, f , that takes a symbol u ∈ I into a finite string, f (u) = x1 . . . xs ,


with digits from J. In other words, f maps I into the set J ∗ of all possible strings:
 
f : I → J∗ = J × · · · (s times) × J .
s≥1

Strings f (u) that are images, under f , of symbols u ∈ I are called codewords
(in code f ). A code has (constant) length N if the value s (the length of a code-
word) equals N for all codewords. A message u(n) = u1 u2 . . . un is represented as a
concatenation of codewords

f (u(n) ) = f (u1 ) f (u2 ) . . . f (un );

it is again a string from J ∗ .

Definition 1.1.3 We say that a code is lossless if u = u implies that f (u) = f (u ).


(That is, the map f : I → J ∗ is one-to-one.) A code is called decipherable if any
string from J ∗ is the image of at most one message. A string x is a prefix in another
string y if y = xz, i.e. y may be represented as a result of a concatenation of x and z.
A code is prefix-free if no codeword is a prefix in any other codeword (e.g. a code
of constant length is prefix-free).

A prefix-free code is decipherable, but not vice versa:

Example 1.1.4 A code with three source letters 1, 2, 3 and the binary encoder
alphabet J = {0, 1} given by

f (1) = 0, f (2) = 01, f (3) = 011

is decipherable, but not prefix-free.

Theorem 1.1.5 (The Kraft inequality) Given positive integers s1 , . . . , sm , there


exists a decipherable code f : I → J ∗ , with codewords of lengths s1 , . . . , sm , iff
m
∑ q−s i
≤ 1. (1.1.4)
i=1

Furthermore, under condition (1.1.4) there exists a prefix-free code with codewords
of lengths s1 , . . . , sm .

Proof (I) Sufficiency. Let (1.1.4) hold. Our goal is to construct a prefix-free code
with codewords of lengths s1 , . . . , sm . Rewrite (1.1.4) as
s
∑ nl q−l ≤ 1, (1.1.5)
l=1
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 5

or
s−1
ns q−s ≤ 1 − ∑ nl q−l ,
l=1

where nl is the number of codewords of length l and s = max si . Equivalently,

ns ≤ qs − n1 qs−1 − · · · − ns−1 q. (1.1.6a)

Since ns ≥ 0, deduce that

ns−1 q ≤ qs − n1 qs−1 − · · · − ns−2 q2 ,

or
ns−1 ≤ qs−1 − n1 qs−2 − · · · − ns−2 q. (1.1.6b)

Repeating this argument yields subsequently

ns−2 ≤ qs−2 − n1 qs−3 − . . . −ns−3 q


.. .. .. (1.1.6.s−1)
. . .
n2 ≤ q2 − n1 q

n1 ≤ q. (1.1.6.s)

Observe that actually either ni+1 = 0 or ni is less than the RHS of the inequality,
for all i = 1, . . . , s − 1 (by definition, ns ≥ 1 so that for i = s − 1 the second possi-
bility occurs). We can perform the following construction. First choose n1 words
of length 1, using distinct symbols from J: this is possible in view of (1.1.6.s).
It leaves (q − n1 ) symbols unused; we can form (q − n1 )q words of length 2 by
appending a symbol to each. Choose n2 codewords from these: we can do so in
view of (1.1.6.s−1). We still have q2 − n1 q − n2 words unused: form n3 codewords,
etc. In the course of the construction, no new word contains a previous codeword
as a prefix. Hence, the code constructed is prefix-free.

(II) Necessity. Suppose there exists a decipherable code in J ∗ with codeword


lengths s1 , . . . , sm . Set s = max si and observe that for any positive integer r
 −s r rs
q 1 + · · · + q−sm = ∑ bl q−l
l=1

where bl is the number of ways r codewords can be put together to form a string of
length l.
6 Essentials of Information Theory

Because of decipherability, these strings must be distinct. Hence, we must have


bl ≤ ql , as ql is the total number of l-strings. Then
 −s r
q 1 + · · · + q−sm ≤ rs,
and  
−s1 −sm 1
q +···+q ≤r 1/r 1/r
s = exp (log r + log s) .
r
This is true for any r, so take r → ∞. The RHS goes to 1.
Remark 1.1.6 A given code obeying (1.1.4) is not necessarily decipherable.
Leon G. Kraft introduced inequality (1.1.4) in his MIT PhD thesis in 1949.

One of the principal aims of the theory is to find the ‘best’ (that is, the shortest)
decipherable (or prefix-free) code. We now adopt a probabilistic point of view and
assume that symbol u ∈ I is emitted by a source with probability p(u):
P(Uk = u) = p(u).
[At this point, there is no need to specify a joint probability of more than one
subsequently emitted symbol.]

Recall, given a code f : I → J ∗ , we encode a letter i ∈ I by a prescribed code-


word f (i) = x1 . . . xs(i) of length s(i). For a random symbol, the generated codeword
becomes a random string from J ∗ . When f is lossless, the probability of generating
a given string as a codeword for a symbol is precisely p(i) if the string coincides
with f (i) and 0 if there is no letter i ∈ I with this property. If f is not one-to-one,
the probability of a string equals the sum of terms p(i) for which the codeword f (i)
equals this string. Then the length of a codeword becomes a random variable, S,
with the probability distribution
P(S = s) = ∑ 1(s(i) = s)p(i). (1.1.7)
1≤i≤m

We are looking for a decipherable code that minimises the expected word-length:
m
ES = ∑ sP(S = s) = ∑ s(i)p(i).
s≥1 i=1

The following problem therefore arises:


minimise g(s(1), . . . , s(m)) = E S

subject to ∑ q−s(i) ≤ 1 (Kraft) (1.1.8)


i
with s(i) positive integers.
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 7

Theorem 1.1.7 The optimal value for problem (1.1.8) is lower-bounded as fol-
lows:
min ES ≥ hq (p(1), . . . , p(m)), (1.1.9)
where
hq (p(1), . . . , p(m)) = − ∑ p(i) logq p(i). (1.1.10)
i
Proof The algorithm (1.1.8) is an integer-valued optimisation problem. If we drop
the condition that s(1), . . . , s(m) ∈ {1, 2, . . .}, replacing it with a ‘relaxed’ con-
straint s(i) > 0, 1 ≤ i ≤ m, the Lagrange sufficiency theorem could be used. The
Lagrangian reads
L (s(1), . . . , s(m), z; λ ) = ∑ s(i)p(i) + λ (1 − ∑ q−s(i) − z)
i i

(here, z ≥ 0 is a slack variable). Minimising L in s1 , . . . , sm and z yields


∂L
λ < 0, z = 0, and = p(i) + q−s(i) λ ln q = 0,
∂ s(i)
whence
p(i)
− = q−s(i) , i.e. s(i) = − logq p(i) + logq (−λ ln q), 1 ≤ i ≤ m.
λ ln q
Adjusting the constraint ∑ q−s(i) = 1 (the slack variable z = 0) gives
i

∑ p(i)/(−λ ln q) = 1, i.e. − λ ln q = 1.
i
Hence,
s(i) = − logq p(i), 1 ≤ i ≤ m,
is the (unique) optimiser for the relaxed problem, giving the value hq from (1.1.10).
The relaxed problem is solved on a larger set of variables s(i); hence, its minimal
value does not exceed that in the original one.
Remark 1.1.8 The quantity hq defined in (1.1.10) plays a central role in the
whole of information theory. It is called the q-ary entropy of the probability distri-
bution (p(x), x ∈ I) and will emerge in a great number of situations. Here we note
that the dependence on q is captured in the formula
1
hq (p(1), . . . , p(m)) = h2 (p(1), . . . , p(m))
log q
where h2 stands for the binary entropy:
h2 (p(1), . . . , p(m)) = − ∑ p(i) log p(i). (1.1.11)
i
8 Essentials of Information Theory

Worked Example 1.1.9 (a) Give an example of a lossless code with alphabet
Jq which does not satisfy the Kraft inequality. Give an example of a lossless code
with the expected code-length strictly less than hq (X).
(b) Show that the ‘Kraft sum’ ∑ q−s(i) associated with a lossless code may be
i
arbitrarily large (for sufficiently large source alphabet).

Solution (a) Consider the alphabet I = {0, 1, 2} and a lossless code f with f (0) =
0, f (1) = 1, f (2) = 00 and codeword-lengths s(0) = s(1) = 1, s(2) = 2. Obviously,
∑ 2−s(x) = 5/4, violating the Kraft inequality. For a random variable X with p(0) =
x∈I
p(1) = p(2) = 1/3 the expected codeword-length Es(X) = 4/3 < h(X) = log 3 =
1.585.

(b) Assume that the alphabet size m =  I = 2(2L − 1) for some positive
integer L. Consider the lossless code assigning to the letters x ∈ I the codewords
0, 1, 00, 01, 10, 11, 000, . . ., with the maximum codeword-length L. The Kraft sum is

∑ 2−s(x) = ∑ ∑ 2−s(x) = ∑ 2l × 2−l = L,


x∈I l≤L x:s(x)=l l≤L

which can be made arbitrarily large.

The assertion of Theorem 1.1.7 is further elaborated in

Theorem 1.1.10 (Shannon’s noiseless coding theorem (NLCT)) For a ran-


dom source emitting symbols with probabilities p(i) > 0, the minimal expected
codeword-length for a decipherable encoding in alphabet Jq obeys

hq ≤ min ES < hq + 1, (1.1.12)

where hq = − ∑ p(i) logq p(i) is the q-ary entropy of the source; see (1.1.10).
i

Proof The LHS inequality is established in (1.1.9). For the RHS inequality, let
s(i) be a positive integer such that

q−s(i) ≤ p(i) < q−s(i)+1 .

The non-strict bound here implies ∑ q−s(i) ≤ ∑ p(i) = 1, i.e. the Kraft inequality.
i i
Hence, there exists a decipherable code with codeword-lengths s(1), . . . , s(m). The
strict bound implies
log p(i)
s(i) < − + 1,
log q
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 9

and thus
∑ p(i) log p(i)
h
ES < − i
+ ∑ p(i) = + 1.
log q i log q

Example 1.1.11 An instructive application of Shannon’s NLCT is as follows. Let


the size m of the source alphabet equal 2k and assume that the letters i = 1, . . . , m are
emitted equiprobably: p(i) = 2−k . Suppose we use the code alphabet J2 = {0, 1}
(binary encoding). With the binary entropy h2 = − log 2−k ∑ 2−k = k, we need,
1≤i≤2k
on average, at least k binary digits for decipherable encoding. Using a term bit for
a unit of entropy, we say that on average the encoding requires at least k bits.
Moreover, the NLCT leads to a Shannon–Fano encoding procedure: we fix pos-
itive integer codeword-lengths s(1), . . . , s(m) such that q−s(i) ≤ p(i) < q−s(i)+1 , or,
equivalently,
 
− logq p(i) ≤ s(i) < − logq p(i) + 1; that is, s(i) = − logq p(i) . (1.1.13)

Then construct a prefix-free code, from the shortest s(i) upwards, ensuring that
the previous codewords are not prefixes. The Kraft inequality guarantees enough
room. The obtained code may not be optimal but has the mean codeword-length
satisfying the same inequalities (1.1.13) as an optimal code.

Optimality is achieved by Huffman’s encoding fmH : Im → Jq∗ . We first discuss


it for binary encodings, when q = 2 (i.e. J = {0, 1}). The algorithm constructs a
binary tree, as follows.

(i) First, order the letters i ∈ I so that p(1) ≥ p(2) ≥ · · · ≥ p(m).


(ii) Assign symbol 0 to letter m − 1 and 1 to letter m.
(iii) Construct a reduced alphabet Im−1 = {1, . . . , m − 2, (m − 1, m)}, with proba-
bilities
p(1), . . . , p(m − 2), p(m − 1) + p(m).

Repeat steps (i) and (ii) with the reduced alphabet, etc. We obtain a binary tree. For
an example of Huffman’s encoding for m = 7 see Figure 1.1.
The number of branches we must pass through in order to reach a root i of the
tree equals s(i). The tree structure, together with the identification of the roots
as source letters, guarantees that encoding is prefix-free. The optimality of binary
Huffman encoding follows from the following two simple lemmas.
10 Essentials of Information Theory

m= 7

1.0
i pi f(i) si
1 .5 0 1
2 .15 100 3
3 .15 101 3
4 .1 110 3
5 .05 1110 4
6 .025 11110 5
7 .025 11111 5
.5 .15 .15 .1 .05 .025 .025

Figure 1.1

Lemma 1.1.12 Any optimal prefix-free binary code has the codeword-lengths
reverse-ordered versus probabilities:

p(i) ≥ p(i ) implies s(i) ≤ s(i ). (1.1.14)

Proof If not, we can form a new code, by swapping the codewords for i and i .
This shortens the expected codeword-length and preserves the prefix-free property.

Lemma 1.1.13 In any optimal prefix-free binary code there exist, among the
codewords of maximum length, precisely two agreeing in all but the last digit.

Proof If not, then either (i) there exists a single codeword of maximum length,
or (ii) there exist two or more codewords of maximum length, and they all differ
before the last digit. In both cases we can drop the last digit from some word of
maximum length, without affecting the prefix-free property.

Theorem 1.1.14 Huffman’s encoding is optimal among the prefix-free binary


codes.

Proof The proof proceeds with induction in m. For m = 2, the Huffman code f2H
has f2H (1) = 0, f2H (2) = 1, or vice versa, and is optimal. Assume the Huffman code
H is optimal for I
fm−1 m−1 , whatever the probability distribution. Suppose further that
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 11

the Huffman code fmH is not optimal for Im for some probability distribution. That
is, there is another prefix-free code, fm∗ , for Im with a shorter expected word-length:

ESm < ESm
H
. (1.1.15)
The probability distribution under consideration may be assumed to obey
p(1) ≥ · · · ≥ p(m).
By Lemmas 1.1.12 and 1.1.13, in both codes we can shuffle codewords so that
the words corresponding to m − 1 and m have maximum length and differ only in
the last digit. This allows us to reduce both codes to Im−1 . Namely, in the Huffman
code fmH we remove the final digit from fmH (m) and fmH (m − 1), ‘glueing’ these
codewords. This leads to Huffman encoding fm−1 H . In f ∗ we do the same, and obtain
m

a new prefix-free code fm−1 .
Observe that in Huffman code fmH the contribution to ESm H from f H (m − 1)
m
and fmH (m) is sH (m)(p(m − 1) + p(m)); after reduction it becomes (sH (m) − 1)
(p(m − 1) + p(m)). That is, ES is reduced by p(m − 1) + p(m). In code fm∗ the sim-
ilar contribution is reduced from s∗ (m)(p(m − 1) + p(m)) to (s∗ (m) − 1)(p(m − 1)
+ p(m)); the difference is again p(m − 1) + p(m). All other contributions to ESm−1
H

and ESm−1∗ are the same as the corresponding contributions to ESm H and ES∗ ,
m
∗ ∗
respectively. Therefore, fm−1 is better than fm−1 : ESm−1 < ESm−1 , which contra-
H H

dicts the assumption.


In view of Theorem 1.1.14, we obtain
Corollary 1.1.15 Huffman’s encoding is optimal among the decipherable binary
codes.
The generalisation of the Huffman procedure to q-ary codes (with the code
alphabet Jq = {0, 1, . . . , q − 1}) is straightforward: instead of merging two sym-
bols, m − 1, m ∈ Im , having lowest probabilities, you merge q of them (again with
the smallest probabilities), repeating the above argument. In fact, Huffman’s orig-
inal 1952 paper was written for a general encoding alphabet. There are numerous
modifications of the Huffman code covering unequal coding costs (where some of
the encoding digits j ∈ Jq are more expensive than others) and other factors; we
will not discuss them in this book.
Worked Example 1.1.16 A drawback of Huffman encoding is that the
codeword-lengths are complicated functions of the symbol probabilities p(1), . . .,
p(m). However, some bounds are available. Suppose that p(1) ≥ p(2) ≥ · · · ≥
p(m). Prove that in any binary Huffman encoding:
(a) if p(1) < 1/3 then letter 1 must be encoded by a codeword of length ≥ 2;
(b) if p(1) > 2/5 then letter 1 must be encoded by a codeword of length 1.
12 Essentials of Information Theory

1 p (1) + p (b) + p (b⬘) < 1


p (1) < 1/ _ p (1) < 2/
3 1 3 p (1) + p (b)

p (b) p (b⬘) p (b⬘)


0 < p (b) < p (1)
2/ < p (1)
5
p (4)
p (3)

(a) (b)

Figure 1.2

Solution (a) Two cases are possible: the letter 1 either was, or was not merged with
other letters before two last steps in constructing a Huffman code. In the first case,
s(1) ≥ 2. Otherwise we have symbols 1, b and b , with

p(1) < 1/3, p(1) + p(b) + p(b ) = 1 and hence max[p(b), p(b )] > 1/3.

Then letter 1 is to be merged, at the last but one step, with one of b, b , and hence
s(1) ≥ 2. Indeed, suppose that at least one codeword has length 1, and this code-
word is assigned to letter 1 with p(1) < 1/3. Hence, the top of the Huffman tree is
as in Figure 1.2(a) with 0 ≤ p(b), p(b ) ≤ 1 − p(1) and p(b) + p(b ) = 1 − p(1).

) > 1/3, and hence p(1) should be merged with


But
then max
p(b), p(b
min p(b), p(b ) . Hence, Figure 1.2(a) is impossible, and letter 1 has codeword-
length ≥ 2.
The bound is sharp as both codes

{0, 01, 110, 111} and {00, 01, 10, 11}

are binary Huffman codes, e.g. for a probability distribution 1/3, 1/3, 1/4, 1/12.
(b) Now let p(1) > 2/5 and assume that letter 1 has a codeword-length s(1) ≥ 2 in
a Huffman code. Thus, letter 1 was merged with other letters before the last step.
That is, at a certain stage, we had symbols 1, b and b say, with
(A) p(b ) ≥ p(1) > 2/5,
(B) p(b ) ≥ p(b),
(C) p(1) + p(b) + p(b ) ≤ 1
(D) p(1), p(b) ≥ 1/2 p(b ).
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 13

Indeed, if, say, p(b) < 1/2p(b ) then b should be selected instead of p(3) or p(4)
on the previous step when p(b ) was formed. By virtue of (D), p(b) ≥ 1/5 which
makes (A)+(C) impossible.
A piece of the Huffman tree over p(1) is then as in Figure 1.2(b), with p(3) +
p(4) = p(b ) and p(1) + p(b ) + p(b) ≤ 1. Write
p(1) = 2/5 + ε , p(b ) = 2/5 + ε + δ , p(b) = 2/5 + ε + δ − η ,
with ε > 0, δ , η ≥ 0. Then
p(1) + p(b ) + p(b) = 6/5 + 3ε + 2δ − η ≤ 1, and η ≥ 1/5 + 3ε + 2δ .
This yields
p(b) ≤ 1/5 − 2ε − δ < 1/5.
However, since

max p(3), p(4) ≥ p(b )/2 ≥ p(1)/2 > 1/5,


probability p(b) should be merged with min p(3), p(4) , i.e. diagram (b) is
impossible. Hence, the letter 1 has codeword-length s(1) = 1.
Worked Example 1.1.17 Suppose that letters i1 , . . . , i5 are emitted with probabil-
ities 0.45, 0.25, 0.2, 0.05, 0.05. Compute the expected word-length for Shannon–
Fano and Huffman coding. Illustrate both methods by finding decipherable binary
codings in each case.

Solution In this case q = 2, and


p(i) − log2 p(i) codeword
.45 2 00
.25 2 01
Shannon–Fano:
.2 3 100
.05 5 11100
.05 5 11111

with E codeword-length) = .9 + .5 + .6 + .25 + .25 = 2.5, and
pi codeword
.45 1
.25 01
Huffman:
.2 000
.05 0010
.05 0011

with E codeword-length) = 0.45 + 0.5 + 0.6 + 0.2 + 0.2 = 1.95.
14 Essentials of Information Theory

Worked Example 1.1.18 A Shannon–Fano code is in general not optimal. How-


ever, it is ‘not much’ longer than Huffman’s. Prove that, if SSF is the Shannon–
Fano codeword-length, then for any r = 1, 2, . . . and any decipherable code f ∗ with
codeword-length S∗ ,
 
P S∗ ≤ SSF − r ≤ q1−r .

Solution Write
P(S∗ ≤ SSF − r) = ∑ p(i).
i∈I : s∗ (i)≤sSF (i)−r

Note that sSF (i) < − logq p(i) + 1, hence

∑ p(i) ≤ ∑ p(i)
i∈I : s∗ (i)≤sSF (i)−r i∈I : s∗ (i)≤− logq p(i)+1−r

= ∑ p(i)
i∈I : s∗ (i)−1+r≤− logq p(i)

= ∑ ∗ (i)+1−r
p(i)
i∈I : p(i)≤q−s
∗ (i)+1−r
≤ ∑ q−s
i∈I

=q 1−r
∑ q−s (i)
i∈I
≤q 1−r
;
the last inequality is due to Kraft.
A common modern practice is not to encode each letter u ∈ I separately, but
to divide a source message into ‘segments’ or ‘blocks’, of a fixed length n, and
encode these as ‘letters’. It obviously increases the nominal number of letters in
the alphabet: the blocks are from the Cartesian product I ×n = I × · · · (n times) × I.
But what matters is the entropy


(n)
hq = − P(U1 = i1 , . . . ,Un = in ) logq P(U1 = i1 , . . . ,Un = in ) (1.1.16)
i1 ,...,in

of the probability distribution for the blocks in a typical message. [Obviously,


we need to know the joint distribution of the subsequently emitted source let-
ters.] Denote by S(n) the random codeword-length in a decipherable segmented
code. The minimal expected codeword-length per source letter is defined by
1
en := min ES(n) ; by Shannon’s NLCT, it obeys
n
(n) (n)
hq hq 1
≤ en ≤ + . (1.1.17)
n n n
(n)
We see that, for large n, en ∼ hq n.
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 15

Example 1.1.19 For a Bernoulli source emitting letter i with probability p(i) (cf.
Example 1.1.2), equation (1.1.16) yields

hq = − ∑ p(i1 ) · · · p(in ) logq p(i1 ) · · · p(in )
(n)

i1 ,...,in
n
=−∑ ∑ p(i1 ) · · · p(in ) logq p(i j ) = nhq , (1.1.18)
j=1 i1 ,...,in

where hq = − ∑ p(i) logq p(i). Here, en ∼ hq . Thus, for n large, the minimal
expected codeword-length per source letter, in a segmented code, eventually at-
tains the lower bound in (1.1.13), and hence does not exceed min ES, the minimal
expected codeword-length for letter-by-letter encodings. This phenomenon is much
more striking in the situation where the subsequent source letters are dependent. In
(n)
many cases hq n hq , i.e. en hq . This is the gist of data compression.

Therefore, statistics of long strings becomes an important property of a source.


Nominally, the strings u(n) = u1 . . . un of length n ‘fill’ the Cartesian power I ×n ; the
total number of such strings is mn , and to encode them all we need mn = 2n log m
distinct codewords. If the codewords have a fixed length (which guarantees the
prefix-free property), this length is between n log m, and n log m, and the rate
of encoding, for large n, is ∼ log m bits/source letter. But if some strings are rare,
we can disregard them, reducing the number of codewords used. This leads to the
following definitions.

Definition 1.1.20 A source is said to be (reliably) encodable at rate R > 0 if, for
any n, we can find a set An ⊂ I ×n such that

 An ≤ 2nR and lim P(U(n) ∈ An ) = 1. (1.1.19)


n→∞

In other words, we can encode messages at rate R with a negligible error for long
source strings.

Definition 1.1.21 The information rate H of a given source is the infimum of the
reliable encoding rates:

H = inf[R : R is reliable]. (1.1.20)

Theorem 1.1.22 For a source with alphabet Im ,

0 ≤ H ≤ log m, (1.1.21)

both bounds being attainable.


16 Essentials of Information Theory

Proof The LHS inequality is trivial. It is attained for a degenerate source


(cf. Example 1.1.2c); here An contains ≤ m constant strings, which is eventually
beaten by 2nR for any R > 0. On the other hand,  I ×n = mn = 2n log m , hence the
RHS inequality. It is attained for a source with IID letters and p(u) = 1/m: in this
case P(An ) = (1/mn )  An , which goes to zero when  An ≤ 2nR and R < log m.
Example 1.1.23 (a) For telegraph English, m = 27  24.76 , i.e. H ≤ 4.76. For-
tunately, H 4.76, and this makes possible: (i) data compression, (ii) error-
correcting, (iii) code-breaking, (iv) crosswords. The precise value of H for
telegraph English (not to mention literary English) is not known: it is a challenging
task to assess it accurately. Nevertheless, modern theoretical tools and comput-
ing facilities make it possible to assess the information rate of a given (long) text,
assuming that it comes from a source that operates by allowing a fair amount of
‘randomness’ and ‘homogeneity’ (see Section 6.3 of [36].)
Some results of numerical analysis can be found in [136] analysing three texts:
(a) the collected works of Shakespeare; (b) a mixed text from various newspapers;
and (c) the King James Bible. The texts were stripped of punctuation and the spaces
between words were removed. Texts (a) and (b) give values 1.7 and 1.25 respec-
tively (which is rather flattering to modern journalism). In case (c) the results were
inconclusive; apparently the above assumptions are not appropriate in this case.
(For example, the genealogical enumerations of Genesis are hard to compare with
the philosophical discussions of Paul’s letters, so the homogeneity of the source is
obviously not maintained.)
Even more challenging is to compare different languages: which one is more
appropriate for intercommunication? Also, it would be interesting to repeat the
above experiment with the collected works of Tolstoy or Dostoyevsky.
For illustration, we give below the original table by Samuel Morse (1791–1872),
creator of the Morse code, providing the frequency count of different letters in
telegraph English which is dominated by a relatively small number of common
words.
⎛ ⎞
E T A I N O S H R
⎜12000 9000 8000 8000 8000 8000 8000 6400 6200⎟
⎜ ⎟
⎜ ⎟
⎜ D L U C M F W Y G ⎟
⎜ ⎟
⎜ 4400 4000 3400 3000 3000 2500 2000 2000 1700⎟
⎜ ⎟
⎝ P B V K Q J X Z ⎠
1700 1600 1200 800 500 400 400 200

(b) A similar idea was applied to the decimal and binary decomposition of a
given number. For example, take number π . If the information rate for its binary
1.1 Basic concepts. The Kraft inequality. Huffman’s encoding 17

x 1 ln x

1
x

Figure 1.3

decomposition approaches value 1 (which is the information rate of a randomly


chosen sequence), we may think that π behaves like a completely random num-
ber; otherwise we could imagine that π√was a ‘Specially Chosen One’. The
same question
 may be asked
 about e, 2 or the Euler–Mascheroni constant
1
γ = lim ∑ − ln N . (An open part of one of Hilbert’s problems is to prove
N→∞ 1≤n≤N n
or disprove that γ is a transcendental number, and transcendental numbers form
a set of probability one under the Bernoulli source of subsequent digits.) As the
results of numerical experiments show, for the number of digits N ∼ 500, 000 all
the above-mentioned numbers display the same pattern of behaviour as a com-
pletely random number; see [26]. In Section 1.3 we will calculate the information
rates of Bernoulli and Markov sources.

We conclude this section with the following simple but fundamental fact.
Theorem 1.1.24 (The Gibbs inequality: cf. PSE II, p. 421) Let {p(i)} and {p (i)}
be two probability distributions (on a finite or countable set I ). Then, for any b > 1,
p (i)
∑ p(i) logb p(i)
≤ 0, i.e. − ∑ p(i) logb p(i) ≤ − ∑ p(i) logb p (i), (1.1.22)
i i i

and equality is attained iff p(i) = p (i), 1 ∈ I .

Proof The bound


x−1
logb x ≤
ln b
18 Essentials of Information Theory

holds for each x > 0, with equality iff x = 1. Setting I = {i : p(i) > 0}, we have
 
p (i) p (i) 1 p (i)
∑ p(i) logb = ∑ p(i) logb ≤ ∑ p(i) −1
i p(i) i∈I  p(i) ln bi∈I p(i) 
1 1
= ∑ p (i) − ∑ p(i) = ∑ p (i) − 1 ≤ 0.
ln b i∈I i∈I ln b i∈I
For equality we need: (a) ∑ p (i) = 1, i.e. p (i) = 0 when p(i) = 0; and (b)
i∈I
p (i)/p(i) = 1 for i ∈ I .

1.2 Entropy: an introduction


Only entropy comes easy.
Anton Chekhov (1860–1904), Russian writer and playwright

This section is entirely devoted to properties of entropy. For simplicity, we work


with the binary entropy, where the logarithms are taken at base 2. Consequently,
subscript 2 in the notation h2 is omitted. We begin with a formal repetition of the
basic definition, putting a slightly different emphasis.
Definition 1.2.1 Given an event A with probability p(A), the information gained
from the fact that A has occurred is defined as
i(A) = − log p(A).
Further, let X be a random variable taking a finite number of distinct values
{x1 , . . . , xm }, with probabilities pi = pX (xi ) = P(X = xi ). The binary entropy h(X)
is defined as the expected amount of information gained from observing X:
 
h(X) = − ∑ pX (xi ) log pX (xi ) = − ∑ pi log pi = E − log pX (X) . (1.2.1)
xi i

[In view of the adopted equality 0 · log 0 = 0, the sum may be reduced to those xi
for which pX (xi ) > 0.]
Sometimes an alternative view is useful: i(A) represents the amount of informa-
tion needed to specify event A and h(X) gives the expected amount of information
required to specify a random variable X.
Clearly, the entropy h(X) depends on the probability distribution, but not on
the values x1 , . . . , xm : h(X) = h(p1 , . . . , pm ). For m = 2 (a two-point probability
distribution), it is convenient to consider the function η (p)(= η2 (p)) of a single
variable p ∈ [0, 1]:
η (p) = −p log p − (1 − p) log(1 − p). (1.2.2a)
1.2 Entropy: an introduction 19
Entropy for p in [0,1]

1.0
0.8
0.6
h(p,1−p)

0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

Figure 1.4

Entropy for p, q, p+q in [0,1]

1.5

1.0
h( p,q,1−p−q)

0.5

0.8
0.6
0.4
q

0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8

Figure 1.5

The graph of η (p) is plotted in Figure 1.4. Observe that the graph is concave as
d2  
2
η (p) = − log e [p(1 − p)] < 0. See Figure 1.4.
dp
The graph of the entropy of a three-point distribution

η3 (p, q) = −p log p − q log q − (1 − p − q) log(1 − p − q)


20 Essentials of Information Theory

is plotted in Figure 1.5 as a function of variables p, q ∈ [0, 1] with p + q ≤ 1: it also


shows the concavity property.
Definition 1.2.1 implies that for independent events, A1 and A2 ,
i(A1 ∩ A2 ) = i(A1 ) + i(A2 ), (1.2.2b)
and i(A) = 1 for event A with p(A) = 1/2.
A justification of Definition 1.2.1 comes from the fact that any function i∗ (A)
which (i) depends on probability p(A) (i.e. obeys i∗ (A) = i∗ (A ) if p(A) = p(A )), (ii)
is continuous in p(A), and (iii) satisfies (1.2.2b), coincides with i(A) (for axiomatic
definitions of entropy, cf. Worked Example 1.2.24 below).
Definition 1.2.2 Given a pair of random variables, X, Y , with values xi and y j ,
the joint entropy h(X,Y ) is defined by
 
h(X,Y ) = − ∑ pX,Y (xi , y j ) log pX,Y (xi , y j ) = E − log pX,Y (X,Y ) , (1.2.3)
xi ,y j

where pX,Y (xi , y j ) = P(X = xi ,Y = y j ) is the joint probability distribution. In other


words, h(X,Y ) is the entropy of the random vector (X,Y ) with values (xi , y j ).
The conditional entropy, h(X|Y ), of X given Y is defined as the expected amount
of information gained from observing X given that a value of Y is known:
 
h(X|Y ) = − ∑ pX,Y (xi , y j ) log pX|Y (xi |y j ) = E − log pX|Y (X|Y ) . (1.2.4)
xi ,y j

Here, pX,Y (i, j) is the joint probability P(X = xi ,Y = y j ) and pX|Y (xi |y j ) the con-
ditional probability P(X = xi |Y = y j ). Clearly, (1.2.3) and (1.2.4) imply
h(X|Y ) = h(X,Y ) − h(Y ). (1.2.5)
Note that in general h(X|Y ) = h(Y |X).
For random variables X and Y taking values in the same set I, and such that
pY (x) > 0 for all x ∈ I, the relative entropy h(X||Y ) (also known as the entropy of
X relative to Y or Kullback–Leibler distance D(pX ||pY )) is defined by
pX (x)  pY (X) 
h(X||Y ) = ∑ pX (x) log = EX − log , (1.2.6)
x pY (x) pX (X)
with pX (x) = P(X = x) and pY (x) = P(Y = x), x ∈ I.
Straightforward properties of entropy are given below.
Theorem 1.2.3
(a) If a random variable X takes at most m values, then
0 ≤ h(X) ≤ log m; (1.2.7)
1.2 Entropy: an introduction 21

the LHS equality occurring iff X takes a single value, and the RHS equality
occurring iff X takes m values with equal probabilities.
(b) The joint entropy obeys

h(X,Y ) ≤ h(X) + h(Y ), (1.2.8)

with equality iff X and Y are independent, i.e. P(X = x,Y = y) = P(X = x)
P(Y = y) for all x, y ∈ I .
(c) The relative entropy is always non-negative:

h(X||Y ) ≥ 0, (1.2.9)

with equality iff X and Y are identically distributed: pX (x) ≡ pY (x), x ∈ I .

Proof Assertion (c) is equivalent to Gibbs’ inequality from Theorem 1.1.24. Next,
(a) follows from (c), with {p(i)} being the distribution of X and p (i) ≡ 1/m,
1 ≤ i ≤ m. Similarly, (b) follows from (c), with i being a pair (i1 , i2 ) of values
of X and Y , p(i) = pX,Y (i1 , i2 ) being the joint distribution of X and Y and p (i) =
pX (i1 )pY (i2 ) representing the product of their marginal distributions. Formally:

(a) h(X) = − ∑ p(i) log p(i) ≤ ∑ p(i) log m = log m,


i i
(b) h(X,Y ) = − ∑ pX,Y (i1 , i2 ) log pX,Y (i1 , i2 )
(i1 ,i2 )
 
≤ − ∑ pX,Y (i1 , i2 ) × log pX (i1 )pY (i2 )
(i1 ,i2 )
= − ∑ pX (i1 ) log pX (i1 ) − ∑ pY (i2 ) log pY (i2 )
i1 i2
= h(X) + h(Y ).

We used here the identities ∑ pX,Y (i1 , i2 ) = pX (i1 ), ∑ pX,Y (i1 , i2 ) = pY (i2 ).
i2 i1

Worked Example 1.2.4

(a) Show that the geometric random variable Y with p j = P(Y = j) = (1 − p)p j ,
j = 0, 1, 2, . . ., yields maximum entropy amongst all distributions on Z+ =
{0, 1, 2, . . .} with the same mean.
(b) Let Z be a random variable with values
from a finite

set ∗K and f be a given real

function f : K → R, with f∗ = min f (k) : k ∈ K and f = max f (k) : k ∈ K .


Set E( f ) = ∑ f (k) ( K) and consider the problem of maximising the entropy
k∈K
h(Z) of the random variable Z subject to a constraint

E f (Z) ≤ α . (1.2.10)
22 Essentials of Information Theory

Show that:
(bi) when f ∗ ≥ α ≥ E( f ) then the maximising probability distribution is uni-
form on K , with P(Z = k) = 1/( K), k ∈ K ;
(bii) when f∗ ≤ α < E( f ) and f is not constant then the maximising probabil-
ity distribution has

P(Z = k) = pk = eλ f (k) ∑ eλ f (i) , k ∈ K, (1.2.11)
i

where λ = λ (α ) < 0 is chosen so as to satisfy

∑ pk f (k) = α . (1.2.12)
k

Moreover, suppose that Z takes countably many values, but f ≥ 0 and for a
given α there exists a λ < 0 such that ∑ eλ f (i) < ∞ and ∑ pk f (k) = α where pk
i k
has form (1.2.11). Then:
(biii) the probability distribution in (1.2.11) still maximises h(Z) under
(1.2.10).
Deduce assertion (a) from (biii).
(c) Prove that hY (X) ≥ 0, with equality iff P(X = x) = P(Y = x) for all x. By
considering Y , a geometric random variable on Z+ with parameter chosen
appropriately, show that if the mean EX = μ < ∞, then
h(X) ≤ (μ + 1) log(μ + 1) − μ log μ , (1.2.13)
with equality iff X is geometric.

Solution (a) By the Gibbs inequality, for all probability distribution (q0 , q1 , . . .)
with mean ∑ iqi ≤ μ ,
i≥0

h(q) = − ∑ qi log qi ≤ − ∑ qi log pi = − ∑ qi (log(1 − p) + i log p)


i i i
≤ − log(1 − p) − μ log p = h(Y )
as μ = p/(1 − p), and equality holds iff q is geometric with mean μ .

(b) First, observe that the uniform distribution, with pk = 1 ( K), which ren-
ders the ‘global’ maximum of h(Z) is obtained for λ = 0 in (1.2.11). In part (bi),
this distribution satisfies (1.2.10)
and hence maximises h(Z) under this constraint.
Passing to (bii), let p∗k = eλ f (k) ∑ eλ f (i) , k ∈ K, where λ is chosen to satisfy
i
E∗ f (Z) = ∑ p∗k f (k) = α . Let q = {qk } be any probability distribution satisfying
k
1.2 Entropy: an introduction 23

Eq f = ∑ qk f (k) ≤ α . Next, observe that the mean value (1.2.12) calculated for the
k
probability distribution from (1.2.11) is a non-decreasing function of λ . In fact, the
derivative
 2
2 λ f (k) λ (k)
∑ [ f (k)] e ∑ f (k)e f
dα k
= − k 2 = E[ f (Z)] − [E f (Z)]
2 2
dλ ∑e λ f (i)
i ∑ eλ f (i)
i

is positive (it yields the variance of the random variable f (Z)); for a non-constant
f the RHS is actually non-negative. Therefore, for non-constant f (i.e. with
f∗ < E( f ) < f ∗ ), for all α from the interval [ f∗ , f ∗ ] there exists exactly one prob-
ability distribution of form (1.2.11) satisfying (1.2.12), and for f∗ ≤ α < E( f ) the
corresponding λ (α ) is < 0.
Next, we use the fact that the Kullback–Leibler distance D(q||p∗ ) (cf. (1.2.6))
satisfies D(q||p∗ ) = ∑ qk log (qk /p∗k ) ≥ 0 (Gibbs’ inequality) and that ∑ qk f (k) ≤ α
k k
and λ < 0 to obtain that
h(q) = − ∑ qk log qk = −D(q||p∗ ) − ∑ qk log p∗k
k k

≤ −∑ qk log p∗k = − ∑ qk − log ∑ eλ f (i) + λ f (k)
k k i

≤ − ∑ qk − log ∑ eλ f (i) − λ α
k i

= −∑ p∗k − log ∑ eλ f (i) + λ f (k)
k i
= − ∑ p∗k log p∗k = h(p∗ ).
k

For part (biii): the above argument still works for an infinite countable set K
provided that the value λ (α ) determined from (1.2.12) is < 0.
(c) By the Gibbs inequality hY (X) ≥ 0. Next, we use part (b) by taking f (k) = k,
α = μ and λ = ln q. The maximum-entropy distribution can be written as p∗j =
(1 − p)p j , j = 0, 1, 2, . . ., with ∑ kp∗k = μ , or μ = p/(1 − p). The entropy of this
k
distribution equals
 
h(p∗ ) = − ∑ (1 − p)p j log (1 − p)p j
j
p
=− log p − log(1 − p) = (μ + 1) log(μ + 1) − μ log μ ,
1− p
where μ = p/(1 − p).
24 Essentials of Information Theory

Alternatively:
p(i)
0 ≤ hY (X) = ∑ p(i) log
i (1 − p)pi
 
= −h(X) − log(1 − p) ∑ p(i) − (log p) ∑ ip(i)
i i
= −h(X) − log(1 − p) − μ log p.
The optimal choice of p is p = μ /(μ + 1). Then
1 μ
h(X) ≤ − log − μ log = (μ + 1) log(μ + 1) − μ log μ .
μ +1 μ +1
The RHS is the entropy h(Y ) of the geometric random variable Y . Equality holds
iff X ∼ Y , i.e. X is geometric.
A simple but instructive corollary of the Gibbs inequality is
Lemma 1.2.5 (The pooling inequalities) For any q1 , q2 ≥ 0, with q1 + q2 > 0,
− (q1 + q2 ) log(q1 + q2 ) ≤ −q1 log q1 − q2 log q2
q1 + q2
≤ −(q1 + q2 ) log ; (1.2.14)
2
the first equality occurs iff q1 q2 = 0 (i.e. either q1 or q2 vanishes), and the second
equality iff q1 = q2 .
Proof Indeed, (1.2.14) is equivalent to
 
q1 q2
0≤h , ≤ log 2 (= 1).
q1 + q2 q1 + q2

By Lemma 1.2.5, ‘glueing’ together values of a random variable could dimin-


ish the corresponding contribution to the entropy. On the other hand, the ‘re-
distribution’ of probabilities making them equal increases the contribution. An
immediate corollary of Lemma 1.2.5 is the following.
Theorem 1.2.6 Suppose that a discrete random variable X is a function of dis-
crete random variable Y : X = φ (Y ). Then
h(X) ≤ h(Y ), (1.2.15)
with equality iff φ is invertible.
Proof Indeed, if φ is invertible then the probability distributions of X and Y differ
only in the order of probabilities, which does not change the entropy. If φ ‘glues’
some values y j then we can repeatedly use the LHS pooling inequality.
1.2 Entropy: an introduction 25

log x

1/ 1
2

Figure 1.6

Worked Example 1.2.7 Let p1 , . . . , pn be a probability distribution, with p∗ =


max[pi ]. Prove the following lower bounds for the entropy h = − ∑ pi log pi :
i

(i) h ≥ −p∗ log p∗ − (1 − p∗ ) log(1 − p∗ ) = η (p∗ );


(ii) h ≥ − log p∗ ;
(iii) h ≥ 2(1 − p∗ ).

Solution Part (i) follows from the pooling inequality, and (ii) holds as
h ≥ − ∑ pi log p∗ = − log p∗ .
i

To check (iii), assume first that p∗ ≥ 1/2.


the function p → η (p), 0 ≤ p ≤ 1,
Since
is concave (see (1.2.3)), its graph on 1/2, 1 lies above the line x → 2(1 − p).
Then, by (i),
h ≥ η (p∗ ) ≥ 2 (1 − p∗ ) . (1.2.16)
On the other hand, if p∗ ≤ 1/2, we use (ii):
h ≥ − log p∗ ,
and apply the inequality − log p ≥ 2(1 − p) for 0 ≤ p ≤ 1/2.

Theorem 1.2.8 (The Fano inequality) Suppose a random variable X takes m > 1
values, and one of them has probability (1 − ε ). Then
h(X) ≤ η (ε ) + ε log(m − 1) (1.2.17)
where η is the function from (1.2.2a).
26 Essentials of Information Theory

Proof Suppose that p1 = p(x1 ) = 1 − ε . Then


m
h(X) = h(p1 , . . . , pm ) = − ∑ pi log pi
i=1
= −p1 log p1 − (1 − p1 ) log(1 − p1 ) + (1 − p1 ) log(1 − p1 )
− ∑ pi log pi
2≤i≤m
 
p2 pm
= h(p1 , 1 − p1 ) + (1 − p1 )h ,..., ;
1 − p1 1 − p1
in the RHS the first term is η (ε ) and the second one does not exceed ε log(m − 1).

Definition 1.2.9 Given random variables X, Y , Z, we say that X and Y are con-
ditionally independent given Z if, for all x and y and for all z with P(Z = z) > 0,
P(X = x,Y = y|Z = z) = P(X = x|Z = z)P(Y = y|Z = z). (1.2.18)
For the conditional entropy we immediately obtain
Theorem 1.2.10 (a) For all random variables X , Y ,
0 ≤ h(X|Y ) ≤ h(X), (1.2.19)
the first equality occurring iff X is a function of Y and the second equality holding
iff X and Y are independent.
(b) For all random variables X , Y , Z ,
h(X|Y, Z) ≤ h(X|Y ) ≤ h(X|φ (Y )), (1.2.20)
the first equality occurring iff X and Z are conditionally independent given Y and
the second equality holding iff X and Z are conditionally independent given φ (Y ).
Proof (a) The LHS bound in (1.2.19) follows from definition (1.2.4) (since
h(X|Y ) is a sum of non-negative terms). The RHS bound follows from repre-
sentation (1.2.5) and bound (1.2.8). The LHS quality in (1.2.19) is equivalent
to the equation h(X,Y ) = h(Y ) or h(X,Y ) = h(φ (X,Y )) with φ (X,Y ) = Y . In
view of Theorem 1.2.6, this occurs iff, with probability 1, the map (X,Y ) → Y
is invertible, i.e. X is a function of Y . The RHS equality in (1.2.19) occurs iff
h(X,Y ) = h(X) + h(Y ), i.e. X and Y are independent.
(b) For the lower bound, use a formula analogous to (1.2.5):
h(X|Y, Z) = h(X, Z|Y ) − h(Z|Y ) (1.2.21)
and an inequality analogous to (1.2.10):
h(X, Z|Y ) ≤ h(X|Y ) + h(Z|Y ), (1.2.22)
1.2 Entropy: an introduction 27

with equality iff X and Z are conditionally independent given Y . For the RHS
bound, use:
(i) a formula that is a particular case of (1.2.21): h(X|Y, φ (Y )) = h(X,Y |φ (Y )) −
h(Y |φ (Y )), together with the remark that h(X|Y, φ (Y )) = h(X|Y );
(ii) an inequality which is a particular case of (1.2.22): h(X,Y |φ (Y )) ≤
h(X|φ (Y )) + h(Y |φ (Y )), with equality iff X and Y are conditionally independent
given φ (Y ).
Theorems 1.2.8 above and 1.2.11 below show how the entropy h(X) and con-
ditional entropy h(X|Y ) are controlled when X is ‘nearly’ a constant (respectively,
‘nearly’ a function of Y ).
Theorem 1.2.11 (The generalised Fano inequality) For a pair of random vari-
ables, X and Y taking values x1 , . . . , xm and y1 , . . . , ym , if
m
∑ P(X = x j ,Y = y j ) = 1 − ε , (1.2.23)
j=1

then
h(X|Y ) ≤ η (ε ) + ε log(m − 1), (1.2.24)
where η (ε ) is defined in (1.2.3).
Proof Denoting ε j = P(X = x j |Y = y j ), we write

∑ pY (y j )ε j = ∑ P(X = x j ,Y = y j ) = ε . (1.2.25)
j j

By definition of the conditional entropy, the Fano inequality and concavity of


the function η ( · ),

h(X|Y ) ≤ ∑ pY (y j ) η (ε j ) + ε j log(m − 1)
j

≤ ∑ pY (y j )η (ε j ) + ε log(m − 1) ≤ η (ε ) + ε log(m − 1).


j

If the random variable X takes countably many values {x1 , x2 , . . .}, the above
definitions may be repeated, as well as most of the statements; notable exceptions
are the RHS bound in (1.2.7) and inequalities (1.2.17) and (1.2.24).
Many properties of entropy listed so far are extended to the case of random
strings.
Theorem 1.2.12 For a pair of random strings, X(n) = (X1 , . . . , Xn ) and Y(n) =
(Y1 , . . . ,Yn ),
28 Essentials of Information Theory

(a) the joint entropy, given by

h(X(n) ) = − ∑ P(X(n) = x(n) ) log P(X(n) = x(n) ),


x(n)

obeys
n n
h(X(n) ) = ∑ h(Xi |X(i−1) ) ≤ ∑ h(Xi ), (1.2.26)
i=1 i=1

with equality iff components X1 , . . . , Xn are independent;


(b) the conditional entropy, given by

h(X(n) |Y(n) )
=− ∑ P(X(n) = x(n) , Y(n) = y(n) ) log P(X(n) = x(n) |Y(n) = y(n) ),
x(n) ,y(n)

satisfies
n n
h(X(n) |Y(n) ) ≤ ∑ h(Xi |Y(n) ) ≤ ∑ h(Xi |Yi ), (1.2.27)
i=1 i=1

with the LHS equality holding iff X1 , . . . , Xn are conditionally independent, given
Y(n) , and the RHS equality holding iff, for each i = 1, . . . , n, Xi and {Yr : 1 ≤ r ≤
n, r = i} are conditionally independent, given Yi .

Proof The proof repeats the arguments used previously in the scalar case.

Definition 1.2.13 The mutual information or mutual entropy, I(X : Y ), between


X and Y is defined as
pX,Y (x, y) pX,Y (X,Y )
I(X : Y ) := ∑ pX,Y (x, y) log = E log
x,y pX (x)pY (y) pX (X)pY (Y )
= h(X) + h(Y ) − h(X,Y ) = h(X) − h(X|Y )
= h(Y ) − h(Y |X). (1.2.28)

As can be seen from this definition, I(X : Y ) = I(Y : X).

Intuitively, I(X : Y ) measures the amount of information about X conveyed by Y


(and vice versa). Theorem 1.2.10(b) implies

Theorem 1.2.14 If a random variable φ (Y ) is a function of Y then

0 ≤ I(X : φ (Y )) ≤ I(X : Y ), (1.2.29)

the first equality occurring iff X and φ (Y ) are independent, and the second iff X
and Y are conditionally independent, given φ (Y ).
1.2 Entropy: an introduction 29

Worked Example 1.2.15 Suppose that two non-negative random variables X


and Y are related by Y = X + N , where N is a geometric random variable taking
values in Z+ and is independent of X . Determine the distribution of Y which max-
imises the mutual entropy between X and Y under the constraint that the mean
EX ≤ K and show that this distribution can be realised by assigning to X the value
zero with a certain probability and letting it follow a geometrical distribution with
a complementary probability.

Solution Because Y = X + N where X and N are independent, we have


I(X : Y ) = h(Y ) − h(Y |X) = h(Y ) − h(N).
Also E(Y ) = E(X) + E(N) ≤ K + E(N). Therefore, if we can guarantee that Y may
be taken geometrically distributed with mean K + E(N) then it gives the maximal
value of I(X : Y ). To this end, write an equation for probability-generating func-
tions:
E(zY ) = E(zX )E(zN ), z > 0,
with E(zN ) = (1 − p)/(1 − zp), 0 < z < 1/p, and
1 − p∗ 1
E(zY ) = , 0<z< ∗,
1 − zp∗ p
where p∗ is to be found from an equation
p∗ p K(1 − p) + p
μY = =K+ = .
1 − p∗ 1− p 1− p
This yields
K(1 − p) + p 1− p
p∗ = , E(zY ) = ,
1 + K(1 − p) 1 + K(1 − p) − z(p + K(1 − p))
and
1 − zp
E(zX ) = . (1.2.30)
1 + K(1 − p) − z(p + K(1 − p))
The form of the distribution of X suggested in the example leads to
1 − pX
E(zX ) = κ0 + (1 − κ0 ) , (1.2.31)
1 − zpX
where κ0 + (1 − κ0 )(1 − pX ) = P(X = 0). Selecting
p + K(1 − p) p
pX = , κ0 = ,
1 + K(1 − p) p + K(1 − p)
we see that (1.2.30) and (1.2.31) coincide.
30 Essentials of Information Theory

I only ask for information. . .


Charles Dickens (1812–1870), English writer,
from David Copperfield

In Definition 1.2.13 and Theorem 1.2.14, random variables X and Y may be


replaced by random strings. In addition, by repeating the above arguments for
strings X(n) and Y(n) , we obtain
Theorem 1.2.16 (a) The mutual entropy between random strings obeys
n n
I(X(n) : Y(n) ) ≥ h(X(n) ) − ∑ h(Xi |Y(n) ) ≥ h(X(n) ) − ∑ h(Xi |Yi ). (1.2.32)
i=1 i=1

(b) If X1 , . . . , Xn are independent then


n
I(X(n) : Y(n) ) ≥ ∑ I(Xi : Y(n) ). (1.2.33)
i=1

Observe that
n n
∑ I(Xi : Y(n) ) ≥ ∑ I(Xi : Yi ). (1.2.34)
i=1 i=1

Worked Example 1.2.17 Let X , Z be random variables and Y(n) = (Y1 , . . . ,Yn )
be a random string.

(a) Prove the inequality

0 ≤ I(X : Z) ≤ min{h(X), h(Z)}.

(b) Prove or disprove by producing a counter-example the inequality


n
I(X : Y(n) ) ≤ ∑ I(X : Y j ), (1.2.35)
j=1

first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
(c) Prove or disprove by producing a counter-example the inequality
n
I(X : Y(n) ) ≥ ∑ I(X : Y j ), (1.2.36)
j=1

first under the assumption that Y1 , . . . ,Yn are independent random variables,
and then under the assumption that Y1 , . . . ,Yn are conditionally independent
given X .
1.2 Entropy: an introduction 31

Solution (a) By the Gibbs inequality, I(X : Z) ≥ 0, and


P(X = x, Z = z)
I(X : Z) := − ∑ P(X = x, Z = z) log
x,z P(X = x)P(Z = z)
= h(X) − h(X|Z) = h(Z) − h(Z|X).
Here h(X|Z) ≥ 0 and h(Z|X)

≥ 0. Hence I(X : Z) ≤ h(X) and I(X : Z) ≤ h(Z), so
I(X : Z) ≤ min h(X), h(Z) .
(b) Write
I(X : Y(n) ) = h(Y(n) ) − h(Y(n) |X). (1.2.37)
Then, if Y1 , . . . ,Yn are conditionally independent given X, the RHS of (1.2.37)
equals
n n
n
h(Y(n) ) − ∑ h(Y j |X) ≤ ∑ h(Y j ) − h(Y j |X) = ∑ I(X : Y j ),
j=1 j=1 j=1

giving that of (1.2.35).


(c) Next, if Y1 , . . . ,Yn are independent, the RHS of (1.2.37) equals
n n
n
∑ h(Y j ) − h(Y(n) |X) ≥ ∑ h(Y j ) − h(Y j |X) = ∑ I(X : Y j ),
j=1 j=1 j=1

giving the RHS of (1.2.36).


On the other hand, property (b) fails under the independence condition. Indeed,
set n = 2, with Y(2) = (Y1 ,Y2 ), and let Y1 and Y2 take values 0 or 1 with probabilities
1/2, j = 1, 2, independently, and set X = (Y1 +Y2 ) mod 2. Then
h(X) = h(X|Y j ) = 1, so I(X : Y j ) ≡ 0, j = 1, 2,
but
h(X|Y(2) ) = 0, so I(X : Y(2) ) = 1.
Also, (c) fails under the conditional independence condition. Indeed, take a
±1, the initial probability distribution {1/2, 1/2}
DTMC (U1 ,U2 , . . .) with states 
0 1
and the transition matrix . Set
1 0
Y1 = U1 , X = U2 , Y2 = U3 .
Then Y1 , Y2 are conditionally independent given X: Y1 = Y2 = −X. On the other
hand,
1 = I(X : Y(2) ) = h(Y(2) ) = h(Y1 ) = h(Y2 )
< h(Y1 ) + h(Y2 ) = I(X : Y1 ) + I(X : Y2 ) = 2.
32 Essentials of Information Theory

Recall that a real function f (y) defined on a convex set V ⊆ Rm is called con-
cave if
f (λ0 y(0) + λ1 y(1) ) ≥ λ0 f (y(0) ) + λ1 f (y(1) )
for any y(0) , y(1) ∈ V and λ0 , λ1 ∈ [0, 1] with λ0 + λ1 = 1. It is called strictly concave
if the equality is attained only when either y(0) = y(1) or λ0 λ1 = 0. We treat h(X) as
a function of variables p = (p1 , . . . , pm ); set V in this case is {y = (y1 , . . . , ym ) ∈ Rm :
yi ≥ 0, 1 ≤ i ≤ m, y1 + · · · + ym = 1}.
Theorem 1.2.18 Entropy is a strictly concave function of the probability distri-
bution.
Proof Let the random variables X (i) have probability distributions p(i) , i = 0, 1,
and assume that the random variable Λ takes values 0 and 1 with probabilities
λ0 and λ1 , respectively, and is independent of X (0) , X (1) . Set X = X (Λ) ; then the
inequality h(λ0 p(0) + λ1 p(1) ) ≥ λ0 h(p(0) ) + λ1 h(p(1) ) is equivalent to
h(X) ≥ h(X|Λ) (1.2.38)
which follows from (1.2.19). If we assume equality in (1.2.38), X and Λ must be
independent. Assume in addition that λ0 > 0 and write, by using independence,
P(X = i, Λ = 0) = P(X = i)P(Λ = 0) = λ0 P(X = i).
(0)  (0)
The LHS equals λ0 P(X = i|Λ = 0) = λ0 pi and the RHS equals λ0 λ0 pi +
(1) 
λ1 pi . We may cancel λ0 obtaining
(0) (1)
(1 − λ0 )pi = λ1 pi ,
i.e. the probability distributions p(0) and p(1) are proportional. Then either they are
equal or λ1 = 0, λ0 = 1. The assumption λ1 > 0 leads to a similar conclusion.
Worked Example 1.2.19 Show that the quantity
ρ (X,Y ) = h(X|Y ) + h(Y |X)
obeys
ρ (X,Y ) = h(X) + h(Y ) − 2I(X : Y )
= h(X,Y ) − I(X : Y ) = 2h(X,Y ) − h(X) − h(Y ).
Prove that ρ is symmetric, i.e. ρ (X,Y ) = ρ (Y, X) ≥ 0, and satisfies the triangle
inequality, i.e. ρ (X,Y ) + ρ (Y, Z) ≥ ρ (X, Z). Show that ρ (X,Y ) = 0 iff X and Y
are functions of each other. Also show that if X and X are functions of each other
then ρ (X,Y ) = ρ (X ,Y ). Hence, ρ may be considered as a metric on the set of
the random variables X , considered up to equivalence: X ∼ X iff X and X are
functions of each other.
1.2 Entropy: an introduction 33

Solution Check the triangle inequality

h(X|Z) + h(Z|X) ≤ h(X|Y ) + h(Y |X) + h(Y |Z) + h(Z|Y ),

or
h(X, Z) ≤ h(X,Y ) + h(Y, Z) − h(Y ).

To this end, write h(X, Z) ≤ h(X,Y, Z) and note that h(X,Y, Z) equals

h(X, Z|Y ) + h(Y ) ≤ h(X|Y ) + h(Z|Y ) + h(Y )


= h(X,Y ) + h(Y, Z) − h(Y ).

Equality holds iff (i) Y = φ (X, Z) and (ii) X, Z are conditionally independent
given Y .

Remark 1.2.20 The property that ρ (X, Z) = ρ (X,Y )+ ρ (Y, Z) means that ‘point’
Y lies on a ‘line’ through X and Z; in other words, that all three points X, Y , Z lie
on a straight line. Conditional independence of X and Z given Y can be stated
in an alternative (and elegant) way: the triple X → Y → Z satisfies the Markov
property (in short: is Markov). Then suppose we have four random variables X1 ,
X2 , X3 , X4 such that, for all 1 ≤ i1 < i2 < i3 ≤ 4, the random variables Xi1 and
Xi3 are conditionally independent given Xi2 ; this property means that the quadruple
X1 → X2 → X3 → X4 is Markov, or, geometrically, that all four points lie on a
line. The following fact holds: if X1 → X2 → X3 → X4 is Markov then the mutual
entropies satisfy

I(X1 : X3 ) + I(X2 : X4 ) = I(X1 : X4 ) + I(X2 : X3 ). (1.2.39)

Equivalently, for the joint entropies,

h(X1 , X3 ) + h(X2 , X4 ) = h(X1 , X4 ) + h(X2 , X3 ). (1.2.40)

In fact, for all triples Xi1 , Xi2 , Xi3 as above, in the metric ρ we have that

ρ (Xi1 , Xi3 ) = ρ (Xi1 , Xi2 ) + ρ (Xi2 , Xi3 ),

which in terms of the joint and individual entropies is rewritten as

h(Xi1 , Xi3 ) = h(Xi1 , Xi2 ) + h(Xi2 , Xi3 ) − h(Xi2 ).

Then (1.2.39) takes the form

h(X1 , X2 ) + h(X2 , X3 ) − h(X2 ) + h(X2 , X3 ) + h(X3 , X4 ) − h(X3 )


= h(X1 , X2 ) + h(X2 , X3 ) − h(X2 ) + h(X3 , X4 ) + h(X2 , X3 ) − h(X3 )

which is a trivial identity.


34 Essentials of Information Theory

Worked Example 1.2.21 Consider the following inequality. Let a triple X →


Y → Z be Markov where Z is a random string (Z1 , . . . , Zn ). Then
     
∑ I(X : Zi ) ≤ I(X,Y ) + I Z where I Z := ∑ h(Zi ) − h Z .
1≤i≤n 1≤i≤n

Solution The Markov property for X → Y → Z leads to the bound


 
I X : Z ≤ I(X : Y ).
Therefore, it suffices to verify that
   
∑ I(X : Zi ) − I Z ≤ I X : Z . (1.2.41)
1≤i≤n

As we show below, bound (1.2.41) holds for any X and Z (without referring to a
Markov property). Indeed, (1.2.41) is equivalent to
     
nh(X) − ∑ h(X, Zi ) + h Z ≤ h(X) + h Z − h X, Z
1≤i≤n
or
 
h X, Z − h(X) ≤ ∑ h(X, Zi ) − nh(X)
1≤i≤n
 
which in turn is nothing but the inequality h Z|X ≤ ∑ h(Zi |X).
1≤i≤n
m
Worked Example 1.2.22 Write h(p) := − ∑ p j log p j for a probability ‘vector’
⎛ ⎞ 1
p1
⎜ ⎟
p = ⎝ ... ⎠, with entries p j ≥ 0 and p1 + · · · + pm = 1.
pm
(a) Show that h(Pp) ≥ h(p) if P = (Pi j ) is a doubly stochastic matrix (i.e. a square
matrix with elements Pi j ≥ 0 for which all row and column sums are unity).
Moreover, h(Pp) ≡ h(p) iff P is a permutation matrix.
m m
(b) Show that h(p) ≥ − ∑ ∑ p j Pjk log Pjk if P is a stochastic matrix and p is an
j=1 k=1
invariant vector of P: Pp = p.

Solution (a) By concavity of the log-function x → log x, for all λi , ci ≥ 0 such


m m
that ∑ λi = 1, we have log(λ1 c1 + · · · + λm cm ) ≥ ∑ λi log ci . Apply this to h(Pp) =
1 1  
− ∑ Pi j p j log ∑ Pik pk ≥ − ∑ p j log ∑ Pi j Pik pk = − ∑ p j log PT Pp j . By
i, j k j i,k j
the Gibbs inequality the RHS ≥ h(p). The equality holds iff PT Pp ≡ p, i.e. PT P =
I, the unit matrix. This happens iff P is a permutation matrix.
1.2 Entropy: an introduction 35

(b) The LHS equals h(Un ) for the stationary Markov source (U1 ,U2 , . . .) with equi-
librium distribution p, whereas the RHS is h(Un |Un−1 ). The general inequality
h(Un |Un−1 ) ≤ h(Un ) gives the result.

Worked Example 1.2.23 The sequence of random variables {X j : j = 1, 2, . . .}


forms a DTMC with a finite state space.

(a) Quoting standard properties of conditional entropy, show that h(X j |X j−1 ) ≤
h(X j |X j−2 ) and, in the case of a stationary DTMC, h(X j |X j−2 ) ≤ 2h(X j |X j−1 ).
(b) Show that the mutual information I(Xm : Xn ) is non-decreasing in m and non-
increasing in n, 1 ≤ m ≤ n.

Solution (a) By the Markov property and stationarity

h(X j |X j−1 ) = h(X j |X j−1 , X j−2 )


≤ h(X j |X j−2 ) ≤ h(X j , X j−1 |X j−2 )
= h(X j |X j−1 , X j−2 ) + h(X j−1 |X j−2 ) = 2h(X j |X j−1 ).

(b) Write

I(Xm : Xn ) − I(Xm : Xn+1 ) = h(Xm |Xn+1 ) − h(Xm |Xn )


= h(Xm |Xn+1 ) − h(Xm |Xn , Xn+1 ) (because Xm and
Xn+1 are conditionally independent, given Xn )

which is ≥ 0. Thus, I(Xm : Xn ) does not increase with n.

Similarly,

I(Xm−1 : Xn ) − I(Xm : Xn ) = h(Xn |Xm−1 ) − h(Xn |Xm , Xm−1 ) ≥ 0.

Thus, I(Xm : Xn ) does not decrease with m.


Here, no assumption of stationarity has been used. The DTMC may not even be
time-homogeneous (i.e. the transition probabilities may depend not only on i and j
but also on the time of transition).

Worked Example 1.2.24 Given random variables Y1 , Y2 , Y3 , define

I(Y1 : Y2 |Y3 ) = h(Y1 |Y3 ) + h(Y2 |Y3 ) − h(Y1 ,Y2 |Y3 ).

Now let the sequence Xn , n = 0, 1, . . . be a DTMC. Show that

I(Xn−1 : Xn+1 |Xn ) = 0 and hence I(Xn−1 : Xn+1 ) ≤ I(Xn : Xn+1 ).

Show also that I(Xn : Xn+m ) is non-increasing in m, for m = 0, 1, 2, . . . .


36 Essentials of Information Theory

Solution By the Markov property, Xn−1 and Xn+1 are conditionally independent,
given Xn . Hence,

h(Xn−1 , Xn+1 |Xn ) = h(Xn+1 |Xn ) + h(Xn−1 |Xn )

and I(Xn−1 : Xn+1 |Xn ) = 0. Also,

I(Xn : Xn+m ) − I(Xn : Xn+m+1 )


= h(Xn+m ) − h(Xn+m+1 ) − h(Xn , Xn+m+1 ) + h(Xn , Xn+m )
= h(Xn |Xn+m+1 ) − h(Xn |Xn+m )
= h(Xn |Xn+m+1 ) − h(Xn |Xn+m , Xn+m+1 ) ≥ 0,

the final equality holding because of the conditional independence and the last
inequality following from (1.2.21).

Worked Example 1.2.25 (An axiomatic definition of entropy)

(a) Consider a probability distribution (p1 , . . . , pm ) and an associated measure of


uncertainty (entropy) such that

h(p1 q1 , p1 q2 , . . . , p1 qn , p2 , p3 , . . . , pm ) = h(p1 , . . . , pm ) + p1 h(q1 , . . . , qn ),


(1.2.42)
if (q1 , . . . , qn ) is another distribution. That is, if one of the contingencies (of
probability p1 ) is divided into sub-contingencies of conditional probabilities
q1 , . . . , qn , then the total uncertainty breaks up additively as shown. The func-
tional h is assumed to be symmetric in its arguments, so that analogous rela-
tions holds if contingencies 2, 3, . . . , m are subdivided.
 
Suppose that F(m) := h 1/m, . . . , 1/m is monotone increasing in m. Show
that, as a consequence of (1.2.42), F(mk ) = kF(m) and hence that F(m) =
c log m for some constant c. Hence show that

h(p1 , . . . , pm ) = −c ∑ p j log p j (1.2.43)


j

if p j are rational. The validity of (1.2.43) for an arbitrary collection {p j } then


follows by a continuity assumption.
(b) An alternative axiomatic characterisation of entropy is as follows. If a symmet-
ric function h obeys for any k < m

h(p1 , . . . , pm ) = h(p1 + · · · + pk , pk+1 , . . . , pm )


 
p1 pk
+ (p1 + · · · + pk )h ,..., , (1.2.44)
p1 + · · · + pk p1 + · · · + pk
1.2 Entropy: an introduction 37

h(1/2, 1/2) = 1, and h(p, 1 − p) is a continuous function of p ∈ [0, 1], then

h(p1 , . . . , pm ) = − ∑ p j log p j .
j

 
Solution (a) Using (1.2.42), we obtain for the function F(m) = h 1/m, . . . , 1/m
the following identity:
 
1 1 1 1 1 1
2
F(m ) = h × ,..., × , 2 ,..., 2
m m m m m m
 
1 1 1 1
=h , , . . . , 2 + F(m)
m m2 m m
..
. 
1 1 m
=h ,..., + F(m) = 2F(m).
m m m

The induction hypothesis is F(mk−1 ) = (k − 1)F(m). Then


 
1 1 1 1 1 1
k
(m ) = h × k−1 , . . . , × k−1 , k , . . . , k
m m m m m m
 
1 1 1 1
=h , , . . . , k + F(m)
mk−1 mk m m
..
. 
1 1 m
=h k−1
, . . . , k−1 + F(m)
m m m
= (k − 1)F(m) + F(m) = kF(m).

Now, for given positive integers b > 2 and m, we can find a positive integer n such
that 2n ≤ bm ≤ 2n+1 , i.e.
n n 1
≤ log2 b ≤ + .
m m m
By monotonicity of F(m), we obtain nF(2) ≤ mF(b) ≤ (n + 1)F(2), or

n F(b) n 1
≤ ≤ + .
m F(2) m m
 
 F(b)  1

We conclude that log2 b − ≤ , and letting m → ∞, F(b) = c log b with
F(2)  m
c = F(2).
38 Essentials of Information Theory
r1 rm
Now take rational numbers p1 = , . . . , pm = and obtain
r r
r  
1 rm r1 1 r1 1 r2 rm r1
h ,..., =h × ,..., × , ,..., − F(r1 )
r r r r1 r r1 r r r
..
. 
1 1 ri
=h ,..., −c ∑ log ri
r r 1≤i≤m r
ri ri ri
= c log r − c ∑ log ri = −c ∑ log .
1≤i≤m r 1≤i≤m r r

(b) For the second definition


 the point is that we do not assume the monotonicity of
F(m) = h 1/m, . . . , 1/m in m. Still, using (1.2.44), it is easy to check the additivity
property
F(mn) = F(m) + F(n)

for any positive integers m, n. Hence, for a canonical prime number decomposition
m = qα1 1 . . . qαs s we obtain

F(m) = α1 F(q1 ) + · · · + αs F(qs ).

Next, we prove that


F(m)
→ 0, F(m) − F(m − 1) → 0 (1.2.45)
m
as m → ∞. Indeed,
 
1 1
F(m) = h ,...,
m m
   
1 m−1 m−1 1 1
=h , + h ,..., ,
m m m m−1 m−1
i.e.
1 m−1 m−1
h , = F(m) − F(m − 1).
m m m
By continuity and symmetry of h(p, 1 − p),
1 m−1
lim h , = h(0, 1) = h(1, 0).
m→∞ m m
But from the representations
1 1 1 1 1
h , ,0 = h , + h(1, 0)
2 2 2 2 2
1.2 Entropy: an introduction 39

and (the symmetry again)


1 1 1 1 1 1
h , , 0 = h 0, , = h(1, 0) + h ,
2 2 2 2 2 2
we obtain h(1, 0) = 0. Hence,
m−1
lim F(m) − F(m − 1) = 0. (1.2.46)
m→∞ m
Next, we write
m k−1
mF(m) = ∑k F(k) −
k
F(k − 1)
k=1

or, equivalently,
 
m  
F(m) m + 1 2 k−1
m
= ∑ k F(k) − k F(k − 1) .
2m m(m + 1) k=1

The quantity in the square brackets is the arithmetic mean of m(m + 1)/2 terms of
a sequence

2 2
F(1), F(2) − F(1), F(2) − F(1), F(3) − F(2), F(3) − F(2),
3 3
2 k−1
F(3) − F(2), . . . , F(k) − F(k − 1), . . . ,
3 k
k−1
F(k) − F(k − 1), . . .
k
that tends to 0. Hence, it goes to 0 and F(m)/m → 0. Furthermore,
 m−1  1
F(m) − F(m − 1) = F(m) − F(m − 1) − F(m − 1) → 0,
m m
and (1.2.46) holds. Now define

F(m)
c(m) = ,
log m

and prove that c(m) = const. It suffices to prove that c(p) = const for any prime
number p. First, let us prove that a sequence (c(p)) is bounded. Indeed, suppose the
numbers c(p) are not bounded from above. Then, we can find an infinite sequence
of primes p1 , p2 , . . . , pn , . . . such that pn is the minimal prime such that pn > pn−1
and c(pn ) > c(pn−1 ). By construction, if a prime q < pn then c(q) < c(pn ).
40 Essentials of Information Theory

Consider the canonical decomposition into prime factors of the number pn − 1 =


qα1 1 . . . qαs s with q1 = 2. Then we write the difference F(pn ) − F(pn − 1) as
F(pn )
F(pn ) − log(pn − 1) + c(pn ) log(pn − 1) − F(pn − 1)
log pn
s
F(pn ) pn pn
= log + ∑ α j (c(pn ) − c(q j )) log q j .
pn log pn pn − 1 j=1

The previous remark implies that


s
∑ α j (c(pn ) − c(q j )) log q j ≥ (c(pn ) − c(2)) log 2 = (c(pn ) − c(2)). (1.2.47)
j=1

p p
Moreover, as lim log = 0, equations (1.2.46) and (1.2.47) imply that
p→∞ log p p−1
c(pn ) − c(2) ≤ 0 which contradicts with the construction of c(p). Hence, c(p) is
bounded from above. Similarly, we check that c(p) is bounded from below. More-
over, the above proof yields that sup p c(p) and inf p c(p) are both attained.
Now assume that c( p) = sup p c(p) > c(2). Given a positive integer m, decom-
pose into prime factors pm − 1 = qα1 1 . . . qαs s with q1 = 2. Arguing as before, we
write the difference F( pm ) − F( pm − 1) as
F( pm )
F( pm ) − log( pm − 1) + c( p) log( pm − 1) − F( pm − 1)
log pm
F( pm ) pm pm s
=
pm log pm
log +
pm − 1 j=1 ∑ α j (c( p) − c(q j )) log q j

c( pm ) pm pm


≥ log + (c( p) − c(2)).
pm log pm pm − 1
As before, the limit m → ∞ yields c( p) − c(2) ≤ 0 which gives a contradiction.
Similarly, we can prove that inf p c(p) = c(2). Hence,
c(p) = c is a constant, and
1 1
F(m) = c log m. From the condition F(2) = h 2, 2 = 1 we get c = 1. Finally, as
in (a), we obtain
m
h(p1 , . . . , pm ) = − ∑ pi log pi (1.2.48)
i=1
m
for any rational p1 , . . . , pm ≥ 0 with ∑ pi = 1. By continuity argument (1.2.48) is
i=1
extended to the case of irrational probabilities.
Worked Example 1.2.26 Show that ‘more homogeneous’ distributions have a
greater entropy. That is, if p = (p1 , . . . , pn ) and q = (q1 , . . . , qn ) are two probabil-
ity distributions on the set {1, . . . , n}, then p is called more homogeneous than q
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 41

(p  q, cf. [108]) if, after rearranging values p1 , . . . , pn and q1 , . . . , qn in decreasing


order:
p1 ≥ · · · ≥ pn , q1 ≥ · · · ≥ qn ,

one has
k k
∑ pi ≤ ∑ qi , for all k = 1, . . . , n.
i=1 i=1

Then
h(p) ≥ h(q) whenever p  q.

Solution We write the probability distributions p and q as non-increasing functions


of a discrete argument

p ∼ p(1) ≥ · · · ≥ p(n) ≥ 0, q ∼ q(1) ≥ · · · ≥ q(n) ≥ 0,

with ∑ p(i) = ∑ q(i) = 1.


i i

Condition p  q means that if p = q then there exist i1 and i2 such that (a) 1 ≤ i1 ≤
i2 ≤ n, (b) q(i1 ) > p(i1 ) ≥ p(i2 ) > q(i2 ) and (c) q(i) ≥ p(i) for 1 ≤ i ≤ i1 , q(i) ≤ p(i) for
i ≥ i2 .
Now apply induction in s, the number of values i = 1, . . . , n for which q(i) = p(i) .
If s = 0 we have p = q and the entropies coincide. Make the induction hypothesis
and then increase s by 1. Take a pair i1 , i2 as above. Increase q(i2 ) and decrease q(i1 )
so that the sum q(i1 ) +q(i2 ) is preserved, until either q(i1 ) reaches p(i1 ) or q(i2 ) reaches
p(i2 ) (see Figure 1.7). Property (c) guarantees that the modified distributions p  q.
As the function x → η (x) = −x log x − (1 − x) log(1 − x) strictly increases on
[0, 1/2]. Hence, the entropy of the modified distribution strictly increases. At the
end of this process we diminish s. Then we use our induction hypothesis.

1.3 Shannon’s first coding theorem. The entropy rate of a


Markov source
A useful meaning of the information rate of a source is that it specifies the mini-
mal rates of growth for the set of sample strings carrying, asymptotically, the full
probability.
Lemma 1.3.1 Let H be the information rate of a source (see (1.1.20)). Define

Dn (R) := max P(U(n) ∈ A) : A ⊂ I ×n ,  A ≤ 2nR . (1.3.1)


42 Essentials of Information Theory

Q P

1. n
i1 i2

Figure 1.7

Then for any ε > 0, as n → ∞,

lim Dn (H + ε ) = 1, and, if H > 0, Dn (H − ε ) → 1. (1.3.2)

Proof By definition, R := H + ε is a reliable encoding rate. Hence, there exists a


sequence of sets An ⊂ I ×n , with  An ≤ 2nR and P(U(n) ∈ An ) → 1, as n → ∞. Since
Dn (R) ≥ P(U(n) ∈ An ), then Dn (R) → 1.
Now suppose that H > 0, and take R := H − ε ; for ε small enough, R > 0.
However, R is not a reliable rate. That is, there is no sequence An with the above
properties. Take a set Cn where the maximum in (1.3.1) is attained. Then Cn ≤ 2nR ,
but P(Cn ) → 1.

Given a string u(n) = u1 . . . un , consider its ‘log-likelihood’ value per source-


letter:
1
ξn (u(n) ) = − log+ pn (u(n) ), u(n) ∈ I ×n , (1.3.3a)
n
where pn (u(n) ) := P(U(n) = u(n) ) is the probability assigned to string u(n) . Here
and below, log+ x = log x if x > 0, and is 0 if x = 0. For a random string, U(n) =
u1 , . . . , un ,
1
ξn (U(n) ) = − log+ pn (U(n) ) (1.3.3b)
n
is a random variable.

Lemma 1.3.2 For all R, ε > 0,

P(ξn ≤ R) ≤ Dn (R) ≤ P(ξn ≤ R + ε ) + 2−nε . (1.3.4)


1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 43

Proof For brevity, omit the upper index (n) in the notation u(n) and U(n) . Set
Bn := {u ∈ I ×n : pn (u) ≥ 2−nR }
= {u ∈ I ×n : − log pn (u) ≤ nR}
= {u ∈ I ×n : ξn (u) ≤ R}.
Then
1 ≥ P(U ∈ Bn ) = ∑ pn (u) ≥ 2−nR  Bn , whence  Bn ≤ 2nR .
u∈Bn

Thus,

Dn (R) = max P(U ∈ An ) : An ⊆ I ×n ,  A ≤ 2nR


≥ P(U ∈ Bn ) = P(ξn ≤ R),
which proves the LHS in (1.3.4).
On the other hand, there exists a set Cn ⊆ I ×n where the maximum in (1.3.1) is
attained. For such a set, Dn (R) = P(U ∈ Cn ) is decomposed as follows:
Dn (R) = P(U ∈ Cn , ξn ≤ R + ε ) + P(U ∈ Cn , ξn > R + ε )
 
≤ P(ξn ≤ R + ε ) + ∑ pn (u)1 pn (u) < 2−n(R+ε )
u∈Cn

< P(ξn ≤ R + ε ) + 2−n(R+ε ) Cn


= P(ξn ≤ R + ε ) + 2−n(R+ε ) 2nR
= P(ξn ≤ R + ε ) + 2−nε .

Definition 1.3.3 (See PSE II, p. 367.) A sequence of random variables {ηn }
converges in probability to a constant r if, for all ε > 0,

lim P |ηn − r| ≥ ε = 0. (1.3.5)
n→∞

Replacing, in this definition, r by a random variable η , we obtain a more general


definition of convergence in probability to a random variable.
P
Convergence in probability is denoted henceforth as ηn −→ r (respectively,
P
ηn −→ η ).
Remark 1.3.4 It is precisely the convergence in probability (to an expected
value) that figures in the so-called law of large numbers (cf. (1.3.8) below). See
PSE I, p. 78.
Theorem 1.3.5 (Shannon’s first coding theorem (FCT)) If ξn converges in prob-
ability to a constant γ then γ = H , the information rate of a source.
44 Essentials of Information Theory
P
Proof Let ξn −→ γ . Since ξn ≥ 0, γ ≥ 0. By Lemma 1.3.2, for any ε > 0,
Dn (γ + ε ) ≥ P(ξn ≤ γ + ε ) ≥ P(γ − ε ≤ ξn ≤ γ + ε )
 
= P |ξn − γ | ≤ ε = 1 − P |ξn − γ | > ε → 1 (n → ∞).

Hence, H ≤ γ . In particular, if γ = 0 then H = 0. If γ > 0, we have, again by Lemma


1.3.2, that
 
Dn (γ − ε ) ≤ P(ξn ≤ γ − ε /2) + 2−nε /2 ≤ P |ξn − γ | ≥ ε /2 + 2−nε /2 → 0.
By Lemma 1.3.1, H ≥ γ . Hence, H = γ .
P
Remark 1.3.6 (a) Convergence ξn −→ γ = H is equivalent to the following
asymptotic equipartition property: for any ε > 0,

lim P 2−n(H+ε ) ≤ pn (U(n) ) ≤ 2−n(H−ε ) = 1. (1.3.6)
n→∞

In fact,

P 2−n(H+ε ) ≤ pn (U(n) ) ≤ 2−n(H−ε )
 
1
= P H − ε ≤ − log pn (U ) ≤ H + ε
(n)
n
 
= P |ξn − H| ≤ ε = 1 − P |ξn − H| > ε .

In other words, for all ε > 0 there exists n0 = n0 (ε ) such that, for any n > n0 , the
set I ×n decomposes into disjoint subsets, Πn and Tn , with
 
(i) P U(n) ∈ Πn < ε , 
(ii) 2−n(H+ε ) ≤ P U(n) = u(n) ≤ 2−n(H−ε ) for all u(n) ∈ Tn .
Pictorially speaking, Tn is a set of ‘typical’ strings and Πn is the residual set.
We conclude that, for a source with the asymptotic equipartition property, it is
worthwhile to encode the typical strings with codewords of the same length, and
the rest anyhow. Then we have the effective encoding rate H + o(1) bits/source-
letter, though the source emits log m bits/source-letter.
(b) Observe that
1 1
E ξn = − ∑
n u(n) ∈I ×n
pn (u(n) ) log pn (u(n) ) = h(n) .
n
(1.3.7)

The simplest example of an information source (and one among the most
instructive) is a Bernoulli source.
Theorem 1.3.7 For a Bernoulli source U1 ,U2 , . . ., with P(Ui = x) = p(x),
H = − ∑ p(x) log p(x).
x
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 45

Proof For an IID sequence U1 , U2 , . . ., the probability of a string is


n
pn (u(n) ) = ∏ p(ui ), u(n) = u1 . . . un .
i=1

Hence, − log pn (u) = ∑ − log p(ui ). Denoting σi = − log p (Ui ), i = 1, 2, . . . , we


i
see that σ1 , σ2 , . . . form a sequence of IID random variables. For a random
n
string U(n) = U1 . . .Un , − log pn (U(n) ) = ∑ σi , where the random variables σi =
i=1
− log p(Ui ) are IID.
1 n
Next, write ξn = ∑ σi . Observe that Eσi = − ∑j p( j) log p( j) = h and
n i=1
 
1 n 1 n 1 n
E ξn = E ∑ σi
n i=1
= ∑
n i=1
Eσi = ∑ h = h,
n i=1

the final equality being in agreement with (1.3.7), since, for the Bernoulli source,
P
h(n) = nh (see (1.1.18)), and hence Eξn = h. We immediately see that ξn −→ h by
the law of large numbers. So H = h by Theorem 1.3.5 (FCT).

Theorem 1.3.8 (The law of large numbers for IID random variables) For any
sequence of IID random variables η1 , η2 , . . . with finite variance and mean Eηi = r,
and for any ε > 0,
1 n
lim P | ∑ ηi − r| ≥ ε = 0. (1.3.8)
n→∞ n i=1

Proof The proof of Theorem 1.3.8 is based on the famous Chebyshev inequality;
see PSE II, p. 368.

Lemma 1.3.9 For any random variable η and any ε > 0,


1
P(η ≥ ε ) ≤ Eη 2 .
ε2
Proof See PSE I, p. 75.

Next, consider a Markov source U1 U2 . . . with  from alphabet Im =


 letters
{1, . . . , m} and assume that the transition matrix P(u, v) (or rather its power)
obeys
min P(r) (u, v) = ρ > 0 for some r ≥ 1. (1.3.9)
u,v
46 Essentials of Information Theory

This condition means that the DTMC is irreducible and aperiodic. Then (see
PSE II, p. 71), the DTMC has a unique invariant (equilibrium) distribution
π (1), . . . , π (m):
m m
0 ≤ π (u) ≤ 1, ∑ π (u) = 1, π (v) = ∑ π (u)P(u, v), (1.3.10)
u=1 u=1

and the n-step


 transition
 probabilities P(n) (u, v) converge to π (v) as well as the
probabilities λ P n−1 (v) = P(Un = v):
lim P(n) (u, v) = lim P(Un = v) = lim ∑ λ (u)P(n) (u, v) = π (v), (1.3.11)
n→∞ n→∞ n→∞
u

for all initial distribution {λ (u), u ∈ I}. Moreover, the convergence in (1.3.11) is
exponentially (geometrically) fast.
Theorem 1.3.10 Assume that condition (1.3.9) holds with r = 1. Then the DTMC
U1 ,U2 , . . . possesses a unique invariant distribution (1.3.10), and for any u, v ∈ I and
any initial distribution λ on I ,
|P(n) (u, v) − π (v)| ≤ (1 − ρ )n and |P(Un = v) − π (v)| ≤ (1 − ρ )n−1 . (1.3.12)
In the case of a general r ≥ 1, we replace, in the RHS of (1.3.12), (1 − ρ )n by
(1 − ρ )[n/r] and (1 − ρ )n−1 by (1 − ρ )[(n−1)/r] .
Proof See Worked Example 1.3.13.
Now we introduce an information rate H of a Markov source.
Theorem 1.3.11 For a Markov source, under condition (1.3.9),
H =− ∑ π (u)P(u, v) log P(u, v) = lim h(Un+1 |Un );
n→∞
(1.3.13)
1≤u,v≤m

if the source is stationary then H = h(Un+1 |Un ).


P
Proof We again use the Shannon FCT to check that ξn −→ H where H is given by
1
(1.3.13), and ξn = − log pn (U(n) ), cf. (1.3.3b). In other words, condition (1.3.9)
n
implies the asymptotic equipartition property for a Markov source.
The Markov property means that, for all string u(n) = u1 . . . un ,
pn (u(n) ) = λ (u1 )P(u1 , u2 ) · · · P(un−1 , un ), (1.3.14a)
and − log pn (u(n) ) is written as the sum
− log λ (u1 ) − log P(u1 , u2 ) − · · · − log P(un−1 , un ). (1.3.14b)
For a random string, U(n) = U1 . . .Un , the random variable − log pn (U(n) ) has a
similar form:
− log λ (U1 ) − log P(U1 ,U2 ) − · · · − log P(Un−1 ,Un ). (1.3.15)
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 47

As in the case of a Bernoulli source, we denote

σ1 (U1 ) := − log λ (U1 ), σi (Ui−1 ,Ui ) := − log P(Ui−1 ,Ui ), i ≥ 2, (1.3.16)

and write
1 n−1
ξn = σ1 + ∑ σi+1 . (1.3.17)
n i=1

The expected value of σ is

E σ1 = − ∑ λ (u) log λ (u) (1.3.18a)


u

and, as P(Ui = v) = λ Pi−1 (v) = ∑ λ (u)P(i−1) (u, v),


u

Eσi+1 = − ∑ P(Ui = u,Ui+1 = u ) log P(u, u )


u,u
 
= − ∑ λ Pi−1 (u)P(u, u ) log P(u, u ), i ≥ 1. (1.3.18b)
u,u

Theorem 1.3.10 implies that lim Eσi = H. Hence,


i→∞

1 n
lim Eξn = lim
n→∞ n→∞ n
∑ Eσi = H,
i=1

P
and the convergence ξn −→ H is again a law of large numbers, for the sequence
(σi ):
  
1 n 
 
lim P  ∑ σi − H  ≥ ε = 0. (1.3.19)
n→∞  n i=1 

However, the situation here is not as simple as in the case of a Bernoulli source.
There are two difficulties to overcome: (i) Eσi equals H only in the limit i → ∞;
(ii) σ1 , σ2 , . . . are no longer independent. Even worse, they do not form a DTMC,
or even a Markov chain of a higher order. [A sequence U1 ,U2 , . . . is said to form a
DTMC of order k, if, for all n ≥ 1,

P(Un+k+1 = u |Un+k = uk , . . . ,Un+1 = u1 , . . .)


= P(Un+k+1 = u |Un+k = uk , . . . ,Un+1 = u1 ).

An obvious remark is that, in a DTMC of order k, the vectors Ūn =


(Un ,Un+1 , . . . ,Un+k−1 ), n ≥ 1, form an ordinary DTMC.] In a sense, the ‘mem-
ory’ in a sequence σ1 , σ2 , . . . is infinitely long. However, it decays exponentially:
the precise meaning of this is provided in Worked Example 1.3.14.
48 Essentials of Information Theory

Anyway, by using the Chebyshev inequality, we obtain


    2
1 n  n
  1
P  ∑ σi − H  ≥ ε ≤ 2 2 E ∑ (σi − H) . (1.3.20)
 n i=1  n ε i=1

Theorem 1.3.11 immediately follows from Lemma 1.3.12 below.

Lemma 1.3.12 The expectation value in the RHS of (1.3.20) satisfies the bound
 2
n
E ∑ (σi − H) ≤ C n, (1.3.21)
i=1

where C > 0 is a constant that does not depend on n.

Proof See Worked Example 1.3.14.


C
By (1.3.21), the RHS of (1.3.20) becomes ≤ and goes to zero as n → ∞.
nε 2
Worked Example 1.3.13 Prove the following bound (cf. (1.3.12)):

|P(n) (u, v) − π (v)| ≤ (1 − ρ )n . (1.3.22)

Solution (Compare with PSE II, p. 72.) First, observe that (1.3.12) implies the
second bound in Theorem 1.3.10 as well as (1.3.10). Indeed, π (v) is identified as
the limit

lim P(n) (u, v) = lim ∑ P(n−1) (u, u)P(u, v) = ∑ π (u)P(u, v), (1.3.23)
n→∞ n→∞
u u

which yields (1.3.10). If π (1), π (2), . . . , π (m) is another invariant probability


vector, i.e.
m
0 ≤ π (u) ≤ 1, ∑ π (u) = 1, π (v) = ∑ π (u)P(u, v),
u=1 u

then π (v) = ∑ π (u)P(n) (u, v) for all n ≥ 1. The limit n → ∞ gives then
u

π (v) = ∑ π (u) lim P(n) (u, v) = ∑ π (u)π (v) = π (v),


n→∞
u u

i.e. the invariant probability vector is unique.

To prove (1.3.22) denote

mn (v) = min P(n) (u, v), Mn (v) = max P(n) (u, v). (1.3.24)
u u
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 49

Then
mn+1 (v) = min P(n+1) (u, v) = min
u u
∑ P(u, u)P(n) (u, v)
u
≥ min P (u, v) ∑ P(u, u) = mn (v).
(n)
u
u

Similarly,
Mn+1 (v) = max P(n+1) (u, v) = max
u u
∑ P(u, u)P(n) (u, v)
u

≤ max P(n) (u, v) ∑ P(u, u) = Mn (v).


u
u

Since 0 ≤ mn (v) ≤ Mn (v) ≤ 1, both mn (v) and Mn (v) have the limits
m(v) = lim mn (v) ≤ lim Mn (v) = M(v).
n→∞ n→∞

Furthermore, the difference M(v) − m(v) is written as the limit


lim (Mn (v) − mn (v)) = lim max (P(n) (u, v) − P(n) (u , v)).
n→∞ n→∞ u,u

So, if we manage to prove that


max

|P(n) (u, v) − P(n) (u , v)| ≤ (1 − ρ )n , (1.3.25)
u,u ,v

then M(v) = m(v) for each v. Furthermore, denoting the common value M(v) =
m(v) by π (v), we obtain (1.3.22)
|P(n) (u, v) − π (v)| ≤ Mn (v) − mn (v) ≤ (1 − ρ )n .

To prove (1.3.25), consider a DTMC on I × I, with states (u1 , u2 ), and transition


probabilities

⎨ P(u1 , v1 )P(u2 , v2 ), if u1 = u2 ,
P (u1 , u2 ), (v1 , v2 ) = P(u, v), if u1 = u2 = u; v1 = v2 = v,

0, if u1 = u2 and v1 = v2 .
(1.3.26)
It is easy to check that P (u1 , u2 ), (v1 , v2 ) is indeed a transition probability matrix
(of size m2 × m2 ): if u1 = u2 = u then

∑ P (u ,
1 2u ), (v ,
1 2v ) = ∑ P(u, v) = 1
v1 ,v2 v

whereas if u1 = u2 then

∑ P (u1 , u2 ), (v1 , v2 ) = ∑ P(u1 , v1 ) ∑ P(u2 , v2 ) = 1
v1 ,v2 v1 v2
50 Essentials of Information Theory
 
(the inequalities 0 ≤ P (u1 , u2 ), (v1 , v2 ) ≤ 1 follow directly from the definition
(1.3.26)).

This is the so-called coupled DTMC on I × I; we denote it by (Vn ,Wn ), n ≥ 1.


Observe that both components Vn and Wn are DTMCs with transition probabilities
P(u, v). More precisely, the components Vn and Wn move independently, until the
first (random) time τ when they coincide; we call it the coupling time. After time
τ the components Vn and Wn ‘stick’ together and move synchronously, again with
transition probabilities P(u, v).

Suppose we start the coupled chain from a state (u, u ). Then

|P(n) (u, v) − P(n) (u , v)|


= |P(Vn = v|V1 = u,W1 = u ) − P(Wn = v|V1 = u,W1 = u )|

(because each component of (Vn ,Wn ) moves with the same transition probabilities)

= |P(Vn = v,Wn = v|V1 = u,W1 = u )


− P(Vn = v,Wn = v|V1 = u,W1 = u )|
≤ P(Vn = Wn |V1 = u,W1 = u )
= P(τ > n|V1 = u,W1 = u ). (1.3.27)

Now, the probability obeys

P(τ = 1|V1 = u,W1 = u ) ≥ ∑ P(u, v)P(u , v) ≥ ρ ∑ P(u , v) = ρ ,


v v

i.e. the complementary probability satisfies

P(τ > 1|V1 = u,W1 = u ) ≤ 1 − ρ .

By the strong Markov property (of the coupled chain),

P(τ > n|V1 = u,W1 = u ) ≤ (1 − ρ )n . (1.3.28)

Bounds (1.3.28) and (1.3.27) together give (1.3.25).

Worked Example 1.3.14 Under condition (1.3.9) with r = 1 prove the following
bound:

 2
|E (σi − H)(σi+k − H) | ≤ H + | log ρ | (1 − ρ )k−1 . (1.3.29)
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 51

Solution For brevity, we assume i > 1; the case i = 1 requires minor changes.
Returning to the definition of random variables σi , i > 1, write

E (σi − H)(σi+k − H)
= ∑ ∑ P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )
u,u v,v
  
× − log P(u, u ) − H − log P(v, v ) − H . (1.3.30)

Our goal is to compare this expression with


∑∑ λ P i−1
(u)P(u, u ) − log P(u, u ) − H
u,u v,v

× π (v)P(v, v ) − log P(v, v ) − H . (1.3.31)

Observe that (1.3.31) in fact vanishes because the sum ∑ vanishes due to the defi-
v,v
nition (1.3.13) of H.
The difference between sums (1.3.30) and (1.3.31) comes from the fact that the
probabilities

P(Ui = u,Ui+1 = u ;Ui+k = v,Ui+k+1 = v )



= λ Pi−1 (u)P(u, u )P(k−1) (u , v)P(v, v )

and

λ Pi−1 (u)P(u, u )π (v)P(v, v )

do not coincide. However, the difference of these probabilities in absolute value


does not exceed
|P(k−1) (u , v) − π (v)| ≤ (1 − ρ )k−1 .

As | − log P( · , · ) − H| ≤ H + | log ρ |, we obtain (1.3.29).

Proofof Theorem 1.3.11. This is now easy to complete. To prove (1.3.21), expand
the square and use the additivity of the expectation:
 2
n

E ∑ (σi − H) = ∑ E (σi − H)2


i=1 1≤i≤n

+2 ∑ E (σi − H)(σ j − H) . (1.3.32)


1≤i< j≤n

The first sum in (1.3.32) is OK: it contains n terms E(σi − H)2 each bounded by a
 2
constant (say, C may be taken to be H + | log ρ | ). Thus this sum is at most C n.
52 Essentials of Information Theory

It is the second sum that causes problems: it contains n(n − 1) 2 terms. We bound
it as follows:
   

 n ∞


 ∑ E (σi − H)(σ j − H)  ≤ ∑ ∑ |E (σi − H)(σi+k − H) | , (1.3.33)
1≤i< j≤n  i=1 k=1

and use (1.3.29) to finish the proof.


Our next theorem shows the role of the (relative) entropy in the asymptotic anal-
ysis of probabilities; see PSE I, p. 82.
Theorem 1.3.15 Let ζ1 , ζ2 , . . . be a sequence of IID random variables taking
values 0 and 1 with probabilities 1 − p and p, respectively, 0 < p < 1. Then, for
any sequence kn of positive integers such that kn → ∞ and n − kn → ∞ as n → ∞,
 
n
P ∑ ζi = kn ∼ (2π np∗ (1 − p∗ ))−1/2 exp (−nD(p||p∗ )) . (1.3.34)
i=1

Here, ∼ means that the ratio of the left- and right-hand sides tends to 1 as n → ∞,
kn
p∗ (= p∗n ) denotes the ratio , and D(p||p∗ ) stands for the relative entropy h(X||Y )
n
where X is distributed as ζi (i.e. it takes values 0 and 1 with probabilities 1 − p
and p), while Y takes the same values with probabilities 1 − p∗ and p∗ .
Proof Use Stirling’s formula (see PSE I, p.72):

n! ∼ 2π nnn e−n . (1.3.35)

[In fact, this formula admits a more precise form: n! = 2π nnn e−n+θ (n) , where
1 1
< θ (n) < , but for our purposes (1.3.35) is enough.] Then the proba-
12n + 1 12n
bility in the LHS of (1.3.34) is (for brevity, the subscript n in kn is omitted)
   1/2
n k n nn
p (1 − p) n−k
∼ pk (1 − p)n−k
k 2π k(n − k) kk (n − k)n−k
 −1/2
= 2π np∗ (1 − p∗ )
× exp [−k ln k/n − (n − k) ln (n − k)/n + k ln p + (n − k) ln(1 − p)] .
But the RHS of the last formula coincides with the RHS of (1.3.34).
If p∗ is close to p, we can write
 
∗ 1 1 1
D(p||p ) = + (p∗ − p)2 + O(|p∗ − p|3 ), (1.3.36)
2 p 1− p
 
d
as D(p||p∗ )| p∗ =p = D(p||p∗) |
p∗ =p = 0, and immediately obtain
dp∗
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 53

Corollary 1.3.16 (The local De Moivre–Laplace theorem; cf. PSE I, p. 81) If


n(p∗ − p) = kn − np = o(n2/3 ) then
   
n
1 n
P ∑ ζi = kn ∼ $ exp − ∗
(p − p) .
2
(1.3.37)
i=1 2π np(1 − p) 2p(1 − p)

Worked Example 1.3.17 At each time unit a device reads the current version
of a string of N characters each of which may be either 0 or 1. It then transmits
the number of characters which are equal to 1. Between each reading the string is
perturbed by changing one of the characters at random (from 0 to 1 or vice versa,
with each character being equally likely to be changed). Determine an expression
for the information rate of this source.

Solution The source is Markov, with the state space {0, 1, . . . , N} and the transition
probability matrix
⎛ ⎞
0 1 0 0 ... 0 0
⎜1/N − ⎟
⎜ 0 (N 1)/N 0 ... 0 ⎟
0
⎜ ⎟
⎜ 0 2/N 0 (N − 2)/N ... 0 ⎟
0
⎜ ⎟.
⎜ ... ... ⎟
⎜ ⎟
⎝ 0 0 0 0 ... 0 1/N ⎠
0 0 0 0 ... 1 0
The DTMC is irreducible and periodic. It possesses a unique invariant distribution
 
N
πi = 2−N , 0 ≤ i ≤ N.
i
By Theorem 1.3.11,
 
1 N−1 N N
H = − ∑ πi P(i, j) log P(i, j) = 21−N ∑ j j log j .
i, j N j=1

Worked Example 1.3.18 A stationary source emits symbols 0, 1, . . . , m (m ≥ 4 is


an even number), according to a DTMC, with the following transition probabilities
p jk = P(Un+1 = k | Un = j):
p j j+2 = 1/3, 0 ≤ j ≤ m − 2, p j j−2 = 1/3, 2 ≤ j ≤ m,

p j j = 1/3, 2 ≤ j ≤ m − 2, p00 = p11 = pm−1m−1 = pmm = 2/3.


The distribution of the first symbol is equiprobable. Find the information rate of
the source. Does the result contradict Shannon’s FCT?
54 Essentials of Information Theory

How does the answer change if m is odd? How can you use, for m odd, Shannon’s
FCT to derive the information rate of the above source?

Solution For m even, the DTMC is reducible: there are two communicating classes,
I1 = {0, 2, . . . , m} with m/2 + 1 states, and I2 = {1, 3, . . . , m − 1} with m/2 states.
Correspondingly, for any set An of n-strings,

P(An ) = qP1 (An1 ) + (1 − q)P2 (An2 ), (1.3.38)

where An1 = An ∩ I1 and An1 = An ∩ I2 ; Pi refers to the DTMC on class Ii , i = 1, 2,


and q = P(U1 ∈ I1 ).
1
The random variable from (1.3.3b) is ξn = − log pn (U(n) ); according to
n
(1.3.38),
1
ξn = − log pn1 (U(n) ) with probability q,
n
1
= − log pn2 (U(n) ) with probability 1 − q. (1.3.39)
n
Both DTMCs are irreducible and aperiodic on their communicating classes and
their invariant distributions are uniform:
(1) 2 (2) 2
πi = , i ∈ I1 , πi = , i ∈ I2 .
m+2 m
Their information rates equal, respectively,
8 8
H (1) = log 3 − and H (2) = log 3 − . (1.3.40)
3(m + 2) 3m

As follows from (1.3.38), the information rate of the whole DTMC equals
% (1)
H = max [H (1) , H (2) ], if 0 < q ≤ 1,
Hodd = (1.3.41)
H (2) , if q = 0.

For 0 < q < 1 Shannon’s FCT is not applicable:


1 P1 1 P2
− log pn1 (U(n) ) −→ H (1) whereas − log pn2 (U(n) ) −→ H (2) ,
n n
i.e. ξn converges to a non-constant limit. However, if q(1 − q) = 0, then (1.3.41)
is reduced to a single line, and Shannon’s FCT is applicable: ξn converges to the
corresponding constant H (i) .
If m is odd, again there are two communicating classes, I1 = {0, 2, . . . , m − 1}
and I2 = {1, 3, . . . , m}, each of which now contains (m + 1)/2 states. As before,
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 55

DTMCs P1 and P2 are irreducible and aperiodic and their invariant distributions
are uniform:
(1) 2 (2) 2
πi = , i ∈ I1 , πi = , i ∈ I2 .
m+1 m+1
Their common information rate equals
8
Hodd = log 3 − , (1.3.42)
3(m + 1)
which also gives the information rate of the whole DTMC. It agrees with Shannon’s
FCT, because now
1 P
ξn = − log pn (U(n) ) −→ Hodd . (1.3.43)
n

Worked Example 1.3.19 Let a be the size of A and b the size of the alphabet B.
Consider a source with letters chosen from an alphabet A + B, with the constraint
that no two letters of A should ever occur consecutively.

(a) Suppose the message follows a DTMC, all characters which are permitted at a
given place being equally likely. Show that this source has information rate
a log b + (a + b) log(a + b)
H= . (1.3.44)
2a + b
(b) By solving a recurrence relation, or otherwise, find how many strings of length
n satisfy the constraint that no two letters of A occur consecutively. Suppose
these strings are equally likely and let n → ∞. Show that the limiting informa-
tion rate becomes
 √ 
b + b2 + 4ab
H = log .
2

Why are the answers different?

Solution (a) The transition probabilities of the DTMC are given by




⎨ 0, if x, y ∈ {1, . . . , a},
P(x, y) = 1/b, if x ∈ {1, . . . , a}, y ∈ {a + 1, . . . , a + b},


1/(a + b), if x ∈ {a + 1, . . . , a + b}, y ∈ {1, . . . , a + b}.
(2)
 aperiodic. Moreover, min P (x, y) > 0; hence, an
The chain is irreducible and
invariant distribution π = π (x), x ∈ {1, . . . , a + b} is unique. We can find π from
56 Essentials of Information Theory

the detailed balance equations (DBEs) π (x)P(x, y) = π (y)P(y, x) (cf. PSE II, p. 82),
which yields
'
1 (2a + b), x ∈ {1, . . . , a},
π (x) =
(a + b) [b(2a + b)], x ∈ {a + 1, . . . , a + b}.
The DBEs imply that π is invariant: π (y) = ∑ π (x)P(x, y), but not vice versa. Thus,
x
we obtain (1.3.44).
(b) Let Mn denote the number of allowed n-strings, An the number of allowed
n-strings ending with a letter from A, and Bn the number of allowed n-strings
ending with a letter from B. Then
Mn = An + Bn , An+1 = aBn , and Bn+1 = b(An + Bn ),
which yields
Bn+1 = bBn + abBn−1 .
The last recursion is solved by
Bn = c+ λ+n + c− λ−n ,
where λ± are the eigenvalues of the matrix
 
0 ab
,
1 b
i.e.

b ± b2 + 4ab
λ± = ,
2
and c± are constants, c+ > 0. Hence,
   
+ λ+ + c− λ− + c+ λ+n + c− λ−n
n−1 n−1
Mn = a c
 
λ n−1
λ n
1
= λ+n c− a −n + −n + c+ a +1 ,
λ+ λ+ λ+
1
and log Mn is represented as the sum
n
    
1 λ−n−1 λ−n 1
log λ+ + log c− a n + n + c+ a +1 .
n λ+ λ+ λ+
 
 λ− 
Note that   < 1. Thus, the limiting information rate equals
λ+
1
lim log Mn = log λ+ .
n→∞ n
1.3 Shannon’s first coding theorem. The entropy rate of a Markov source 57

The answers are different since the conditional equidistribution results in a strong
dependence between subsequent letters: they do not form a DTMC.

Worked Example 1.3.20 Let {U j : j = 1, 2, . . .} be an irreducible and aperiodic


DTMC with a finite state space. Given n ≥ 1 and α ∈ (0, 1), order the strings u(n)
(n) (n)
according to their probabilities P(U(n) = u1 ) ≥ P(U(n) = u2 ) ≥ · · · and select
them in this order until the probability of the remaining set becomes ≤ 1 − α. Let
1
Mn (α ) denote the number of the selected strings. Prove that lim log Mn (α ) = H ,
n→∞ n
the information rate of the source,

(a) in the case where the rows of the transition probability matrix P are all equal
(i.e. {U j } is a Bernoulli sequence),
(b) in the case where the rows of P are permutations of each other, and in a general
case. Comment on the significance of this result for coding theory.

Solution (a) Let P stand for the probability distribution of the IID sequence (Un )
m
and set H = − ∑ p j log p j (the binary entropy of the source). Fix ε > 0 and parti-
j=1
tion the set I ×n of all n-strings into three disjoint subsets:

K+ = {u(n) : p(u(n) ) ≥ 2−n(H−ε ) }, K− = {u(n) : p(u(n) ) ≤ 2−n(H+ε ) },

and
K = {u(n) : 2−n(H+ε ) < p(u(n) ) < 2−n(H−ε ) }.
1
By the law of large numbers (or asymptotic equipartition property), − log P(U(n) )
n
converges to H(= h), i.e. lim P(K+ ∪ K− ) = 0, and lim P(K) = 1. Thus, to obtain
n→∞ n→∞
probability ≥ α , for n large enough, you (i) cannot restrict yourself to K+ and have
to borrow strings from K , (ii) don’t need strings from K− , i.e. will have the last
selected string from K . Denote by Mn (α ) the set of selected strings, and Mn (α )
by Mn . You have two two-side bounds
 
α ≤ P Mn (α ) ≤ α + 2−n(H−ε )

and
 
2−n(H+ε ) Mn (α ) ≤ P Mn (α ) ≤ P(K+ ) + 2−n(H−ε ) Mn (α ).
 
Excluding P Mn (α ) yields

2−n(H+ε ) Mn (α ) ≤ α + 2−n(H−ε ) and 2−n(H−ε ) Mn (α ) ≥ α − P(K+ ).


58 Essentials of Information Theory

These inequalities imply, respectively,


1 1
lim sup log Mn (α ) ≤ H + ε and lim inf log Mn (α ) ≥ H − ε .
n→∞ n n→∞ n
As ε is arbitrary, the limit is H.

(b) The argument may be repeated without any change in the case of permutations
because the ordered probabilities form the same set as in case (a) , and in a general
case by applying the law of large numbers to (1/n)ξn ; cf. (1.3.3b) and (1.3.19).
Finally, the significance for coding theory: if we are prepared to deal with the error-
probability ≤ α , we do not need to encode all mn string u(n) but only ∼ 2nH most
frequent ones. As H ≤ log m (and in many cases log m), it yields a significant
economy in storage space (data-compression).
Worked Example 1.3.21 A binary source emits digits 0 or 1 according to the
rule
P(Xn = k|Xn−1 = j, Xn−2 = i) = qr ,
where k, j, i and r take values 0 or 1, r = k − j −i mod 2, and q0 +q1 = 1. Determine
the information rate of the source.

Also derive the information rate of a binary Bernoulli source, emitting digits
0 and 1 with probabilities q0 and q1 . Explain the relationship between these two
results.

Solution The source is a DTMC of the second order. That is, the pairs (Xn , Xn+1 )
form a four-state DTMC, with
P(00, 00) = q0 , P(00, 01) = q1 , P(01, 10) = q0 , P(01, 11) = q1 ,
P(10, 00) = q0 , P(10, 01) = q1 , P(11, 10) = q0 , P(11, 11) = q1 ;
the remaining eight entries of the transition probability matrix vanish. This gives
H = −q0 log q0 − q1 log q1 .
For a Bernoulli source the answer is the same.
Worked Example 1.3.22 Find an entropy rate of a DTMC associated with a
random walk on the 3 × 3 chessboard:
⎛ ⎞
1 2 3
⎝4 5 6⎠ . (1.3.45)
7 8 9
Find the entropy rate for a rook, bishop (both kinds), queen and king.
1.4 Channels of information transmission 59

Solution We consider the king’s DTMC only; other cases are similar. The transition
probability matrix is
⎛ ⎞
0 1/3 0 1/3 1/3 0 0 0 0
⎜1/5 0 1/5 1/5 1/5 1/5 0 0 0 ⎟
⎜ ⎟
⎜ 0 1/3 0 0 ⎟
⎜ 0 1/3 1/3 0 0 ⎟
⎜1/5 1/5 0 0 1/5 0 1/5 1/5 0 ⎟
⎜ ⎟
⎜ ⎟
⎜ 1/8 1/8 1/8 1/8 0 1/8 1/8 1/8 1/8 ⎟
⎜ ⎟
⎜ 0 1/5 1/5 0 1/5 0 0 1/5 1/5⎟
⎜ ⎟
⎜ 0 0 0 1/3 1/3 0 0 0 1/3 ⎟
⎜ ⎟
⎝ 0 0 0 1/5 1/5 1/5 1/5 0 1/5⎠
0 0 0 0 1/3 1/3 0 1/3 0
By symmetry the invariant distribution is π1 = π3 = π9 = π7 = λ , π4 = π2 = π6 =
π8 = μ , π5 = ν , and by the DBEs
λ /3 = μ /5, λ /3 = ν /8, 4λ + 4μ + ν = 1
implies λ = 3
40 , μ = 18 , ν = 15 . Now
1 1 1 1 1 1 1 3
H = −4λ log − 4μ log − ν log = log 15 + .
3 3 5 5 8 8 10 40

1.4 Channels of information transmission. Decoding rules. Shannon’s


second coding theorem
In this section we prove a core statement of Shannon’s theory: the second coding
theorem (SCT), also known as the noisy coding theorem (NCT). Shannon stated
its assertion and gave a sketch of its proof in his papers and books in the 1940s.
His argument was subject to (not entirely unjustified) criticism by professional
mathematicians. It took the mathematical community about a decade to produce
a rigorous and complete proof of the SCT. However, with hindsight, one cannot
stop admiring Shannon’s intuition and his firm grasp of fundamental notions such
as entropy and coding as well their relation to statistics of long random strings.
We point at various aspects of this topic, not avoiding a personal touch palpable in
writings of the main players in this area.
So far, we have considered a source emitting a random text U1 U2 . . ., and an
encoding of a message u(n) by a binary codeword x(N) using a code fn : I ×n → J ×N ,
J = {0, 1}. Now we focus upon the relation between the length of a message n and
the codeword-length N: it is determined by properties of the channel through which
the information is sent. It is important to remember that the code fn is supposed to
be known to the receiver.
60 Essentials of Information Theory

Typically, a channel is subject to ‘noise’ which distorts the messages transmitted:


a message at the output differs in general from the message at the input. Formally,
a channel is characterised by a conditional distribution
 
Pch receive word y(N) |codeword x(N) sent ; (1.4.1)

we again suppose
 that it is knownto both sender and receiver.
 (We use a distinct
symbol Pch · |codeword x sent or, briefly, Pch · |x
(N) (N) , to stress that this prob-
ability distribution is generated by the channel, conditional on the event that code-
word x(N) has been sent.) Speaking below of a channel, we refer to a conditional
probability (1.4.1) (or rather a family of conditional probabilities, depending on
N). Consequently, we use the symbol Y(N) for a random string representing the
output of the channel; given that a word x(N) was sent,

Pch (Y(N) = y(N) |x(N) ) = Pch (y(N) |x(N) ).

An important example is the so-called memoryless binary channels (MBCs)


where
N
Pch y(N) |x(N) = ∏ P(yi |xi ), (1.4.2)
i=1

if y(N) = y1 . . . yN , x(N) = x1 . . . xN . Here, P(y|x), x, y = 0, 1, is a symbol-to-symbol


channel probability (i.e. the conditional probability to have symbol y at the out-
put of the channel given that symbol x has been sent). Clearly, {P(y|x)} is a
2 × 2 stochastic matrix (often called the channel matrix). In particular, if P(1|0) =
P(0|1) = p, the channel is called symmetric (MBSC). The channel matrix then has
the form
 
1− p p
Π=
p 1− p
and p is called the row error-probability (or the symbol-error-probability).

Example 1.4.1 Consider the memoryless channel, where Y = X + Z, and an


additive noise Z takes values 0 and a with probability 1/2; a is a given real number.
The input alphabet is {0, 1} and Z is independent of X.
Properties of this channel depend on the value of a. Indeed, if a = ±1, the chan-
nel is uniquely decodable. In other words, if we have to use the channel for trans-
mitting messages (strings) of length n (there are 2n of them altogether) then any
message can be sent straightaway, and the receiver will be able to recover it. But
if a = ±1, there are errors possible, and to make sure that the receiver can recover
our message we have to encode it, which, typically, results in increasing the length
of the string sent into the channel, from n to N, say.
1.4 Channels of information transmission 61

In other words, strings of length N sent to the channel will be codewords repre-
senting source messages of a shorter length n. The maximal ratio n/N which still
allows the receiver to recover the original message is an important characteristic of
the channel, called the capacity. As we will see, passing from a = ±1 to a = ±1
changes the capacity from 1 (no encoding needed) to 1/2 (where the codeword-
length is twice as long as the length of the source message).

So, we need to introduce a decoding rule fN : J ×N → I ×n such that the overall
probability of error ε (= ε ( fn , fN , P)) defined by
 
ε = ∑ P fN (Y(N) ) = u(n) , u(n) emitted
u(n)
   
= ∑ P U(n) = u(n) Pch fN (Y(N) ) = u(n) | fn (u(n) ) sent (1.4.3)
u(n)

is small. We will try (and under certain conditions succeed) to have the error-
probability (1.4.3) tending to zero as n → ∞.
The idea which is behind the construction is based on the following facts:

(1) For a source with the asymptotic equipartition property the number of dis-
tinct n-strings emitted is 2n(H+o(1)) where H ≤ log m is the information rate of
the source. Therefore, we have to encode not mn = 2n log m messages, but only
2n(H+o(1)) which may be considerably less. That is, the code fn may be defined
on a subset of I ×n only, with the codeword-length N = nH.
−1
(2) We may try even a larger N: N = R nH, where R is a constant with 0 < R <
1. In other words, the increasing length of the codewords used from nH to
−1
R nH will allow us to introduce a redundancy in the code fn , and we may
hope to be able to use this redundancy to diminish the overall error-probability
(1.4.3) (provided that in addition a decoding rule is ‘good’). It is of course
−1
desirable to minimise R , i.e. maximise R: it will give the codes with optimal
parameters. The question of how large R is allowed to be depends of course on
the channel.

It is instrumental to introduce a notational convention. As the codeword-length is


−1
a crucial parameter, we write N instead of R Hn and RN instead of Hn: the num-
ber of distinct strings emitted by the source becomes 2N(R+o(1)) . In future, the index
NR
n∼ will be omitted wherever possible (and replaced by N otherwise). It is con-
H
venient to consider a ‘typical’ set UN of distinct strings emitted by the source, with
 UN = 2N(R+o(1)) . Formally, UN can include strings of different length; it is only
the log-asymptotics of  UN that matter. Accordingly, we will omit the superscript
(n) in the notation u(n) .
62 Essentials of Information Theory

Definition 1.4.2 A value R ∈ (0, 1) is called a reliable transmission rate (for a


given channel) if, given that the source strings take equiprobable values from a set
UN with  UN = 2N(R+o(1)) , there exist an encoding rule fN : UN → XN ⊆ J ×N and
a decoding rule fN : J ×N → UN with the error-probability
1
∑  UN Pch fN (Y(N) ) = u| fN (u) sent (1.4.4)
u∈UN

1
tending to zero as N → ∞. That is, for each sequence UN with lim log  UN = R,
N→∞ N
there exist a sequence of encoding rules fN : UN → XN , XN ⊆ J ×N , and a sequence
of decoding rules fN : J ×N → UN such that
1  
lim ∑ ∑ 1 fN (Y(N) ) = u Pch Y(N) | fN (u) = 0. (1.4.5)
N→∞  UN
u∈UN Y(N)

Definition 1.4.3 The channel capacity is defined as the supremum


C = sup R ∈ (0, 1) : R is a reliable transmission rate . (1.4.6)


Remark 1.4.4 (a) Physically speaking, the channel capacity can be thought of
1
as a limit lim log n(N) where n(N) is the maximal number of strings of length
N→∞ N
N which can be sent through the channel with a vanishing probability of erroneous
decoding.

(b) The reason for the equiprobable distribution on UN is that it yields the worst-
case scenario. See Theorem 1.4.6 below.

(c) If encoding rule fN used is one-to-one (lossless) then it suffices to treat the
decoding rules as maps J ×N → XN rather than J ×N → UN : if we guess correctly
what codeword x(N) has been sent, we simply set u = fN−1 (x(N) ). If, in addition,
the source distribution is equiprobable over U then the error-probability ε can be
written as an average over the set of codewords XN :
1  
ε= ∑ 1 − Pch fN (Y(N) ) = x|x sent .
 X x∈XN
Accordingly, it makes sense to write ε = ε ave and speak about the average proba-
bility of error. Another form is the maximum error-probability
 
ε max = max 1 − Pch fN (Y(N) ) = x|x sent : x ∈ XN ;
obviously, ε ave ≤ ε max . In this section we work with ε ave → 0 leaving the question
of whether ε max → 0. However, in Section 2.2 we reduce the problem of assessing
ε max to that with ε ave , and as a result, the formulas for the channel capacity deduced
in this section will remain valid if ε ave is replaced by ε max .
1.4 Channels of information transmission 63

Remark 1.4.5 (a) By Theorem 1.4.17 below, the channel capacity of an MBC is
given by
C = sup I(Xk : Yk ). (1.4.7)
pXk

Here, I(Xk : Yk ) is the mutual information between a single pair of input and output
letters Xk and Yk (the index k may be omitted), with the joint distribution

P(X = x,Y = y) = pX (x)P(y|x), x, y = 0, 1, (1.4.8)

where pX (x) = P(X = x). The supremum in (1.4.7) is over all possible distributions
pX = (pX (0), pX (1)). A useful formula is I(X : Y ) = h(Y ) − h(Y |X) (see (1.3.12)).
In fact, for the MBSC

h(Y |X) = − ∑ pX (x) ∑ P(y|x) log P(y|x)


x=0,1 y=0,1

− ∑ P(y|x) log P(y|x) = h2 (p, 1 − p) = η2 (p); (1.4.9)


y=0,1

the lower index 2 will be omitted for brevity.


Hence h(Y |X) = η (p) does not depend on input distribution pX , and for the
MBSC
C = sup h(Y ) − η (p). (1.4.10)
pX

But sup h(Y ) is equal to log 2 = 1: it is attained at pX (0) = pX (1) = 1/2, and
pX
pY (0) = pY (1) = 1/2(p + 1 − p) = 1/2. Therefore, for an MBSC, with the row
error-probability p,
C = 1 − η (p). (1.4.11)

(b) Suppose we have a source U1 U2 . . . with the asymptotic equipartition property


and information rate H. To send a text emitted by the source through a channel
of capacity C we need to encode messages of length n by codewords of length
n(H + ε )
in order to have the overall error-probability tending to zero as n →
C
∞. The value ε > 0 may be chosen arbitrarily small. Hence, if H/C < 1, a text
can be encoded with a higher speed than it is produced: in this case the channel
is used reliably for transmitting information from the source. On the contrary, if
H/C > 1, the text will be produced with a higher speed than we can encode it and
send reliably through a channel. In this case reliable transmission is impossible.
For a Bernoulli or stationary Markov source and an MBSC, condition H/C < 1 is
equivalent to h(U) + η (p) < 1 or h(Un+1 |Un ) + η (p) < 1 respectively.
64 Essentials of Information Theory

In fact, Shannon’s ideas have not been easily accepted by leading contemporary
mathematicians. It would be interesting to see the opinions of the leading scientists
who could be considered as ‘creators’ of information theory.

Theorem 1.4.6 Fix a channel (i.e. conditional probabilities Pch in (1.4.1)) and a
set U of the source strings and denote by ε (P) the overall error-probability (1.4.3)
for U(n) having a probability distribution P over U , minimised over all encoding
and decoding rules. Then
ε (P) ≤ ε (P0 ), (1.4.12)

where P0 is the equidistribution over U .

Proof Fix encoding and decoding rules f and f, and let a string u ∈ U have
probability P(u). Define the error-probability when u is emitted as

β (u) := ∑ Pch (y| f (u)).


y: f(y) =u

The overall error-probability equals

ε (= ε (P, f , f)) = ∑ P(u)β (u).


u∈U

If we permute the allocation of codewords (i.e. encode u by f (u ) where u = λ (u)


and λ is a permutation of degree  U ), we get the overall error-probability ε (λ )
 −1
= ∑ P(u)β (λ (u)). In the case P(u) =  U (equidistribution), ε (λ ) does not
u∈U
depend on λ and is given by

1
ε=
U ∑ β (u) = ε (P0 , f , f).
u∈U

It is claimed that for each probability distribution {P(u), u ∈ U }, there exists λ


such that ε (λ ) ≤ ε̄ . In fact, take a random permutation, Λ, equidistributed among
all U ! permutations of degree U . Then

min ε (λ ) ≤ Eε (Λ) = E ∑ P(u)β (Λu)


λ u∈U
1
= ∑ P(u)Eβ (Λu) = ∑ P(u) ∑ β (u) = ε .
u∈U u∈U  U u∈U

Hence, given any f and f, we can find new encoding and decoding rules with
overall error-probability ≤ ε (P0 , f , f). Minimising over f and f leads to (1.4.12).
1.4 Channels of information transmission 65

Worked Example 1.4.7 Let the random variables X and Y , with values from
finite ‘alphabets’ I and J , represent, respectively, the input and output of a trans-
mission channel, with the conditional probability P(x | y) = P(X = x | Y = y). Let
h(P(· | y)) denote the entropy of the conditional distribution P(· | y), y ∈ J :
h(P(· | y)) = − ∑ P(X | y) log P(x | y).
x

Let h(X | Y ) denote the conditional entropy of X given Y Define the ideal observer
decoding rule as a map f IO : J → I such that P( f (y) | y) = maxx∈I P(x | y) for all
y ∈ J . Show that
(a) under this rule the error-probability
πerIO (y) = ∑ 1(x = f (y))P(x | y)
x∈I

1
satisfies πerIO (y) ≤ h(P(· | y));
2
1
(b) the expected value of the error-probability obeys EπerIO (Y ) ≤ h(X | Y ).
2
Solution Indeed, (a) follows from (iii) in Worked Example 1.2.7, as
 
πerr
IO
= 1 − P f (y) | y = 1 − Pmax ( · |y),
1
which is less than or equal to h(P( · |y)). Finally, (b) follows from (a) by taking
2
expectations, as h(X|Y ) = Eh(P( · |Y )).
As was noted before, a general decoding rule (or a decoder) is a map fN : J ×N →
UN ; in the case of a lossless encoding rule fN , fN is a map J ×N → XN . Here X
is a set of codewords. Sometimes it is convenient to identify the decoding rule by
fixing, for each codeword x(N) , a set A(x(N) ) ⊂ J ×N , so that A(x1 ) and A(x2 ) are
(N) (N)

disjoint for x1 = x2 , and the union ∪x(N) ∈XN A(x(N) ) gives the whole J ×N . Given
(N) (N)

that y(N) ∈ A(x(N) ), we decode it as fN (y(N) ) = x(N) .

Although in the definition of the channel capacity we assume that the source
messages are equidistributed (as was mentioned, it gives the worst case in the
sense of Theorem 1.4.6), in reality of course the source does not always follow
this assumption. To this end, we need to distinguish between two situations: (i) the
receiver knows the probabilities
p(u) = P(U = u) (1.4.13)
of the source strings (and hence the probability distribution pN (x(N) ) of the code-
words x(N) ∈ XN ), and (ii) he does not know pN (x(N) ). Two natural decoding rules
are, respectively,
66 Essentials of Information Theory

(i) the ideal observer (IO) rule decodes a received word y(N) by a codeword x(N)
that maximises the posterior probability
p (x(N) )P (y(N) |x(N) )
N ch
P x(N) sent |y(N) received = , (1.4.14)
pY(N) (y(N) )
where
pY(N) (y(N) ) = ∑ pN (x(N) )Pch (y(N) |x(N) ),
x(N) ∈XN
and

(ii) the maximum likelihood (ML) rule decodes a received word y(N) by a codeword
(N)
x that maximises the prior probability
Pch (y(N) |x(N) ). (1.4.15)
Theorem 1.4.8 Suppose that an encoding rule f is defined for all messages that
occur with positive probability and is one-to-one. Then:
(a) For any such encoding rule, the IO decoder minimises the overall error-
probability among all decoders.
(b) If the source message U is equiprobable on a set U , then for any encoding rule
f : U → XN as above, the random codeword X(N) = f (U) is equiprobable on
XN , and the IO and ML decoders coincide.
Proof Again, for simplicity let us omit the upper index (N).
(a) Note that, given a received word y, the IO obviously maximises the joint
probability p(x)Pch (y|x) (the denominator in (1.4.14) is fixed when word y is
fixed). If we use an encoding rule f and decoding rule f, the overall error-
probability (see (1.4.3)) is

∑ P(U = u)Pch f(y) = u| f (u) sent
u  
= ∑ p(x) ∑ 1 f(y) = x Pch (y|x)
x y
= ∑ ∑ 1 x = f(y) p(x)Pch (y|x)
y x
= ∑ ∑ p(x)Pch (y|x) − ∑ p f(y) Pch y| f(y)
y x y

= 1 − ∑ p f (y) Pch y| f(y) .
y

It remains to note that each term in the sum ∑ p f(y) Pch y| f(y) is maximised
y
when f coincides with the IO rule. Hence, the whole sum is maximised, and the
overall error-probability minimised.

(b) The first statement is obvious, as, indeed is the second.


1.4 Channels of information transmission 67

Assuming in the definition of the channel capacity that the source messages are
equidistributed, it is natural to explore further the ML decoder. While using the ML
decoder, an error can occur because either the decoder chooses a wrong codeword
x or an encoding rule f used is not one-to-one. The probability of this is assessed
in Theorem 1.4.8. For further simplification, we write P instead of Pch ; symbol P
is used mainly for the joint input/output distribution.

Lemma 1.4.9 If the source messages are equidistributed over a set U then, while
using the ML decoder and an encoding rule f , the overall error-probability satisfies
1    
ε( f ) ≤
U ∑ ∑ P P Y| f (u ) ≥ P (Y| f (u)) |U = u . (1.4.16)
u∈U u ∈U : u =u

Proof If the source emits u and the ML decoder is used, we get


   
(a) an error when P Y | f (u ) > P Y | f (u) for some u = u,
   
(b) possibly an error when P Y| f (u ) = P Y | f (u) for some u = u (this in-
cludes the case when f (u) = f (u )), and finally
   
(c) no error when P Y | f (u ) < P Y | f (u) for any u = u.

Thus, the probability is bounded as follows:


 
P error | U = u
   
≤ P P Y | f (u ) ≥ P (Y | f (u)) for some u = u | U = u
     
≤ ∑ 1 u = u P P Y | f (u ) ≥ P (Y | f (u)) | U = u .
u ∈U

1
Multiplying by and summing up over u yields the result.
U
Remark 1.4.10 Bound (1.4.16) of course holds for any probability distribution
1
p(u) = P(U = u), provided is replaced by p(u).
U
As was already noted, a random coding is a useful tool alongside with deter-
×N
ministic encoding rules. A deterministic encoding rule ( is a map f : U)→ J ; if
 U = r then f is given as a collection of codewords f (u1 ), . . . , f (ur ) or, equiv-
alently, as a concatenated ‘megastring’ (or codebook)
 ×r
f (u1 ) . . . f (ur ) ∈ J ×N = {0, 1}×Nr .

Here, u1 , . . . , ur are the source strings (not letters!) constituting set U . If f is loss-
less then f (ui ) = r f (u j ) whenever i = j. A random encoding
 ×N r rule is a random ele-
ment F of J ×N , with probabilities P(F = f ), f ∈ J . Equivalently, F may
68 Essentials of Information Theory

be regarded as a collection of random codewords F(ui ), i = 1, . . . , r, or, equiva-


lently, as a random codebook

F(u1 )F(u2 ) . . . F(ur ) ∈ {0, 1}Nr .

A typical example is where codewords F(u1 ), F(u2 ), . . . , F(ur ) are independent,


and (random) symbols Wi1 , . . . ,WiN constituting word F(ui ) are independent too.
The reasons for considering random encoding rules are:
(1) the existence of a ‘good’ deterministic code frequently follows from the exis-
tence of a good random code;
(2) the calculations for random codes are usually simpler than for optimal deter-
ministic codes, because a discrete optimisation is replaced by an optimisation over
probability distributions.

A drawback of random coding is that it is not always one-to-one (F(u) may


coincide with F(u ) for u = u ). However, this occurs, for large N, with negligible
probability.
The idea of random coding goes back to Shannon. As often happened in the
history of mathematics, a brilliant idea solves one problem but opens a Pandora
box of other questions. In this case, a particular problem that emerged from the
aftermath of random coding was the problem of finding ‘good’ non-random codes.
A major part of modern information and coding theory revolves around this prob-
lem, and so far no general satisfactory solution has been found. However, a number
of remarkable partial results have been achieved, some of which are discussed in
this book.
Continuing with random coding, write the expected error-probability for a ran-
dom encoding rule F:

E := E ε (F) = ∑ ε ( f )P(F = f ). (1.4.17)


f

Theorem 1.4.11

(i) There exists a deterministic encoding rule f with ε ( f ) ≤ E .


 
E
(ii) P ε (F) < ≥ ρ for any ρ ∈ (0, 1).
1−ρ

Proof
 Part (i) is obvious.
 For (ii), use the Chebyshev inequality (see PSE I, p. 75):
E 1−ρ
P ε (F) ≥ ≤ E = 1 − ρ.
1−ρ E
1.4 Channels of information transmission 69

Definition 1.4.12 For random words X(N) = X1 . . . XN and Y(N) = Y1 . . . YN define



1 (N) (N)
CN := sup I X :Y , over input
N

probability distributions PX(N) . (1.4.18)


Recall that I X(N) : Y(N) is the mutual entropy given by

h X(N) − h X(N) |Y(N) = h Y(N) − h Y(N) |X(N) .

Remark 1.4.13 A simple heuristic argument (which will be made rigorous in


Section 2.2) shows that the capacity of the channel cannot exceed the mutual
information between its input and output. Indeed, for each typical input N-
sequence, there are
(N) |X(N) )
approximately 2h(Y possible Y(N) sequences,

all of them equally likely. We will not be able to detect which sequence X was sent
unless no two X(N) sequences produce the same Y(N) output sequence. The total
(N)
number of typical Y(N) sequences is 2h(Y ) . This set has to be divided into subsets
of size 2h(Y |X ) corresponding to the different input X(N) sequences. The total
(N) (N)

number of disjoint sets is


(N) )−h(Y(N) |X(N) ) (N) :Y(N)
≤ 2h(Y = 2I (X ).

Hence, the total number of distinguishable signals of the length N could not be
(N) (N)
bigger than 2I (X :Y ) . Putting the same argument slightly differently, the number
(N) (N) (N)
of typical sequences X(N) is 2Nh(X ) . However, there are only 2Nh(X ,Y ) jointly
typical sequences (X(N) , Y(N) ). So, the probability that any randomly chosen pair
is jointly typical is about 2−I (X :Y ) . So, the number of distinguished signals is
(N) (N)

bounded by 2h(X )+h(Y )−h(X |Y ) .


(N) (N) (N) (N)

Theorem 1.4.14 (Shannon’s SCT: converse part) The channel capacity C obeys

C ≤ lim sup CN . (1.4.19)


N→∞

Proof Consider a code f = fN : UN → XN ⊆ J ×N , where  UN = 2N(R+o(1)) , R ∈


(0, 1). We want to prove that for any decoding rule
CN + o(1)
ε( f ) ≥ 1 − . (1.4.20)
R + o(1)
70 Essentials of Information Theory

The assertion of the theorem immediately follows from (1.4.20) and the definition
of the channel capacity because
1
lim inf ε ( f ) ≥ 1 − lim sup CN
N→∞ R N→∞

which is > 0 when R > lim supN→∞ CN .


Let us check (1.4.20) for one-to-one f (otherwise ε ( f ) is even bigger). Then a
codeword X(N) = f (U) is equidistributed when string U is, and, if a decoding rule
is f : J ×N → X , we have, for N large enough,

(N)  (N)
NCN ≥ I X : Y
(N) (N)
≥ I X : f (Y ) (cf. Theorem 1.2.6)

= h X(N) − h X(N) | f(Y(N) )

= log r − h X(N) | f(Y(N) ) (by equidistribution)
≥ log r − ε ( f ) log(r − 1) − 1.

Here and below r = U . The last bound follows by the generalised Fano inequality
(1.2.25). Indeed, observe that the (random) codeword X(N) = f (U) takes r values
(N) (N) 
x1 , . . . , xr from the codeword set X (= XN , and the error-probability is
r
ε ( f ) = ∑ P(X(N) = xi , f(Y(N) ) = xi ).
(N) (N)

i=1

So, (1.2.25) implies



(N)  (N)
h X | f (Y ) ≤ h2 (ε ) + ε log(r − 1) ≤ 1 + ε ( f ) log(r − 1),

and we obtain NCN ≥ log r − ε ( f ) log(r − 1) − 1. Finally, r = 2N(R+o(1)) and



NCN ≥ N(R + o(1)) − ε ( f ) log 2N(R+o(1)) − 1 ,

i.e.
N(R + o(1)) − NCN C + o(1)
ε( f ) ≥ = 1− N .
log 2N(R+o(1)) − 1 R + o(1)

Let p(X(N) , Y(N) ) be the random variable that assigns, to random words X(N) and
Y(N) , the joint probability of having these words at the input and output of a chan-
nel, respectively. Similarly, pX (X(N) ) and pY (Y(N) ) denote the random variables
that give the marginal probabilities of words X(N) and Y(N) , respectively.
1.4 Channels of information transmission 71

Theorem 1.4.15 (Shannon’s SCT: direct part) Suppose we can find a constant
c ∈ (0, 1) such that for any R ∈ (0, c) and N ≥ 1 there exists a random coding
F(u1 ), . . . , F(ur ), where r = 2N(R+o(1)) , with IID codewords F(ui ) ∈ J ×N , such
that the (random) input/output mutual information

1 p(X(N) , Y(N) )
ΘN := log (1.4.21)
N pX (X(N) )pY (Y(N) )
converges in probability to c as N → ∞. Then the channel capacity C ≥ c.

The proof of Theorem 1.4.15 is given after Worked Examples 1.4.24 and 1.4.25
(the latter is technically rather involved). To start with, we explain the strategy of
the proof outline by Shannon in his original 1948 paper. (It took about 10 years
before this idea was transformed into a formal argument.)
First, one generates a random codebook X consisting of r = 2NR words,
X(N) (1), . . . , X(N) (r). The codewords X(N) (1), . . . , X(N) (r) are assumed to be
known to both the sender and the receiver, as well as the channel transition
matrix Pch (y|x). Next, the message is chosen according to a uniform distribution,
and the corresponding codeword is sent over a channel. The receiver uses the max-
imum likelihood (ML) decoding, i.e. choose the a posteriori most likely message.
But this procedure is difficult to analyse. Instead, a suboptimal but straightforward
typical set decoding is used. The receiver declares that the message w is sent if there
is only one input such that the codeword for w and the output of the channel are
jointly typical. If no such word exists or it is non-unique then an error is declared.
Surprisingly, this procedure is asymptotically optimal. Finally, the existence of a
good random codebook implies the existence of a good non-random coding.

In other words, channel capacity C is no less than the supremum of the values
c for which the convergence in probability in (1.4.21) holds for an appropriate
random coding.

Corollary 1.4.16 With c as in the assumptions of Theorem 1.4.15, we have that

sup c ≤ C ≤ lim sup CN . (1.4.22)


N→∞

So, if the LHS and RHS sides of (1.4.22) coincide, then their common value gives
the channel capacity.

Next, we use Shannon’s SCT for calculating the capacity of an MBC. Recall (cf.
(1.4.2)), for an MBC,
N
P y(N) |x(N) = ∏ P(yi |xi ). (1.4.23)
i=1
72 Essentials of Information Theory

Theorem 1.4.17 For an MBC,


N
I X(N) : Y(N) ≤ ∑ I(X j : Y j ), (1.4.24)
j=1

with equality if the input symbols X1 , . . . , XN are independent.


N  
Proof Since P y(N) |x(N) = ∏ P(y j |x j ), the conditional entropy h Y(N) |X(N)
j=1
N
equals the sum ∑ h(Y j |X j ). Then the mutual information
j

   
I X(N) : Y(N) = h Y(N) − h Y(N) |X(N)
 
= h Y(N) − ∑ h(Y j |X j )
1≤ j≤N
≤ ∑ h(Y j ) − h(Y j |X j ) = ∑ I(X j : Y j ).
j j

The equality holds iff Y1 , . . . ,YN are independent. But Y1 , . . . ,YN are independent if
X1 , . . . , XN are.

Remark 1.4.18 Compare with inequalities (1.4.24) and (1.2.27). Note the oppo-
site inequalities in the bounds.

Theorem 1.4.19 The capacity of an MBC is

C = sup I(X1 : Y1 ). (1.4.25)


pX1

The supremum is over all possible distributions pX1 of the symbol X1 .

Proof By the definition of CN , NCN does not exceed

sup I(X(N) : Y(N) ) ≤ ∑ sup I(X j : Y j ) = N sup I(X1 : Y1 ).


pX j pX j pX1

So, by Shannon’s SCT (converse part),

C ≤ lim sup CN ≤ sup I(X1 : Y1 ).


N→∞ pX1

On the other hand, take a random coding F, with codewords F(ul ) = Vl1 . . . VlN ,
1 ≤ l ≤ r, containing IID symbols Vl j that are distributed according to p∗ , a prob-
ability distribution that maximises I(X1 : Y1 ). [Such random coding is defined for
1.4 Channels of information transmission 73

any r, i.e. for any R (even R > 1!).] For this random coding, the (random) mutual
entropy ΘN equals
 
1 p X(N) , Y(N)
log  (N)   (N) 
N pX X pY Y
N
1 p(X j ,Y j ) 1 N
= ∑ log ∗ = ∑ ζ j,
N j=1 p (X j )pY (Y j ) N j=1

p(X j ,Y j )
where ζ j := log .
p∗ (X
j )pY (Y j )
The random variables ζ j are IID, and
p(X j ,Y j )
Eζ j = E log = Ip∗ (X1 : Y1 ).
p∗ (X
j )pY (Y j )

By the law of large numbers for IID random variables (see Theorem 1.3.5), for the
random coding as suggested,
P
ΘN −→ Ip∗ (X1 : Y1 ) = sup I(X1 : Y1 ).
pX1

By Shannon’s SCT (direct part),


C ≥ sup I(X1 : Y1 ).
pX1
Thus, C = sup pX I(X1 : Y1 ).
1

Remark 1.4.20 (a) The pair (X1 ,Y1 ) may be replaced by any (X j ,Y j ), j ≥ 1.
(b) Recall that the joint
 distribution
 of X1 and Y1 is defined by P(X1 = x,Y1 = y) =
pX1 (x)P(y|x) where P(y|x) is the channel matrix.
(c) Although, as was noted, the construction holds for each r (that is, for each
R ≥ 0) only R ≤ C are reliable.
Example 1.4.21 A helpful statistician preprocesses the output of a memory-
less channel (MBC) with transition probabilities P(y|x) and channel capacity C =
max pX I(X : Y ) by forming Y = g(Y ): he claims that this will strictly improve the
capacity. Is he right? Surely not, as preprocessing (or doctoring) does not increase
the capacity. Indeed,
I(X : Y ) = h(X) − h(X|Y ) ≥ h(X) − h(X|g(Y )) = I(X : g(Y )). (1.4.26)
Under what condition does he not strictly decrease the capacity? Equality in
(1.4.26) holds iff, under the distribution pX that maximises I(X : Y ), the ran-
dom variables X and Y are conditionally independent given g(Y ). [For example,
g(y1 ) = g(y2 ) iff for any x, PX|Y (x|y1 ) = PX|Y (x|y2 ); that is, g glues together only
those values of y for which the conditional probability PX|Y ( · |y) is the same.] For
an MBC, equality holds iff g is one-to-one, or p = P(1|0) = P(0|1) = 1/2.
74 Essentials of Information Theory

Formula (1.4.25) admits a further simplification when the channel is symmetric


(MBSC), i.e. P(1|0) = P(0|1) = p. More precisely, in accordance with Remark
1.4.5(a) (see (1.4.11)) we obtain
Theorem 1.4.22 For an MBSC, with the row error-probability p,
C = 1 − h(p, 1 − p) = 1 − η (p) (1.4.27)
(see (1.4.11)). The channel capacity is realised by a random coding with the IID
symbols Vl j taking values 0 and 1 with probability 1/2.
Worked Example 1.4.23
(a) Consider a memoryless channel with two input symbols A and B, and three
output symbols, A, B, ∗. Suppose each input symbol is left intact with probabil-
ity 1/2, and transformed into a ∗ with probability 1/2. Write down the channel
matrix and calculate the capacity.
(b) Now calculate the new capacity of the channel if the output is further processed
by someone who cannot distinguish A and ∗, so that the matrix becomes
 
1 0
.
1/2 1/2

Solution (a) The channel has the matrix


 
1/2 0 1/2
0 1/2 1/2
and is symmetric (the rows are permutations of each other). So, h(Y |X = x) =
1 1
−2 × log = 1 does not depend on the value of x = A, B. Then h(Y |X) = 1, and
2 2
I(X : Y ) = h(Y ) − 1. (1.4.28)
If P(X = A) = α then Y has the output distribution
 
1 1 1
α , (1 − α ),
2 2 2
and h(Y |X) is maximised at α = 1/2. Then the capacity equals
1
h(1/4, 1/4, 1/2) − 1 = . (1.4.29)
2
(b) Here, the channel is not symmetric. If P(X = A) = α then the conditional
entropy is decomposed as
h(Y |X) = α h(Y |X = A) + (1 − α )h(Y |X = B)
= α × 0 + (1 − α ) × 1 = (1 − α ).
1.4 Channels of information transmission 75

Then
1+α 1+α 1−α 1−α
h(Y ) = − log − log
2 2 2 2
and
1+α 1+α 1−α 1−α
I(X : Y ) = − log − log −1+α
2 2 2 2
which is maximised at α = 3/5, with the capacity given by
 
log 5 − 2 = 0.321928.

Our next goal is to prove the direct part of Shannon’s SCT (Theorem 1.4.15). As
was demonstrated earlier, the proof is based on two consecutive Worked Examples
below.
Worked Example 1.4.24 Let F be a random coding, independent of the source
string U, such that the codewords F(u1 ), . . . , F(ur ) are IID, with a probability dis-
tribution pF :
pF (v) = P(F(u) = v), v (= v(N) ) ∈ J ×N .
Here, u j , j = 1, . . . , r, are source strings, and r = 2N(R+o(1)) . Define random code-
words V1 , . . . , Vr−1 by
if U = u j then Vi := F(ui ) for i < j (if any),
and Vi := F(ui+1 ) for i ≥ j (if any), (1.4.30)
1 ≤ j ≤ r, 1 ≤ i ≤ r − 1.
Then U (the message string), X = F(U) (the random codeword) and V1 , . . . , Vr−1
are independent words, and each of X, V1 , . . . , Vr−1 has distribution pF .

Solution This is straightforward and follows from the formula for the joint proba-
bility,
P(U = u j , X = x, V1 = v1 , . . . , Vr−1 = vr−1 )
= P(U = u j ) pF (x) pF (v1 ) . . . pF (vr−1 ). (1.4.31)

Worked Example 1.4.25 Check that for the random coding as in Worked Ex-
ample 1.4.24, for any κ > 0,
E = Eε (F) ≤ P(ΘN ≤ κ ) + r2−N κ . (1.4.32)
Here, the random variable ΘN is defined in (1.4.21), with EΘN =
1 (N) (N)
I X :Y .
N
76 Essentials of Information Theory

Solution For given words x(= x(N) ) and y(= y(N) ) ∈ J ×N , denote
* +
Sy (x) := x ∈ J ×N : P(y | x ) ≥ P(y | x) . (1.4.33)
That is, Sy (x) includes all words the ML decoder may produce in the situation
where x was sent and y received. Set, for a given non-random encoding rule f
and a source string u, δ ( f , u, y) = 1 if f (u ) ∈ Sy ( f (u)) for some u = u, and
δ ( f , u, y) = 0 otherwise. Clearly, δ ( f , u, y) equals

1 − ∏ 1 f (u ) ∈ Sy ( f (u))
u : u =u  
= 1 − ∏ 1 − 1 f (u ) ∈ Sy ( f (u)) .
u :u =u

It is plain that, for all non-random encoding f , ε ( f ) ≤ Eδ ( f , U, Y), and for all
random encoding F, E = Eε (F) ≤ Eδ (F, U, Y). Furthermore, for the random
encoding as in Worked Example 1.4.24, the expected value Eδ (F, U, Y) does not
exceed
 
r−1 
E 1 − ∏ 1 − 1 Vi ∈ SY (X) = ∑ pX (x) ∑ P(y|x)

i=1 x y 
r−1  
× E 1 − ∏ 1 − 1 Vi ∈ SY (X) |X = x, Y = y ,
i=1

which, owing to independence, equals


 
r−1
∑ pX (x) ∑ P(y|x) 1 − ∏ E 1 − 1{Vi ∈ Sy (x)} .
x y i=1

Furthermore, due to the IID property (as explained in Worked Example 1.4.24),
r−1
∏ E 1 − 1{V ∈ S (x)} = (1 − Qy (x)) ,
i y
r−1

i=1

where
 
Qy (x) := ∑ 1 x ∈ Sy (x) pX (x ).
x

Hence, the expected error-probability E ≤ 1 − E (1 − QY (X))r−1 .


Denote by T = T(κ ) the set of pairs of words x, y for which
1 p(x, y)
ΘN = log >κ
N pX (x)pY (y)
and use the identity
r−2
1 − (1 − Qy (x))r−1 = ∑ (1 − Qy (x)) j Qy (x). (1.4.34)
j=0
1.4 Channels of information transmission 77

Next observe that


1 − (1 − Qy (x))r−1 ≤ 1, when (x, y) ∈ T. (1.4.35)
Owing to the fact that when (x, y) ∈ T,
r−1
1 − (1 − Qy (x))r = ∑ (1 − Qy (x)) j Qy (x) ≤ (r − 1)Qy (x),
j=1

this yields
 
E ≤ P (X,Y ) ∈ T + (r − 1) ∑ pX (x)P(y|x)Qy (x). (1.4.36)
(x,y)∈T

Now observe that


   
P (X,Y ) ∈ T = P ΘN ≤ κ . (1.4.37)
Finally, for (x, y) ∈ T and x ∈ Sy (x),
P(y|x ) ≥ P(y|x) ≥ pY (y)2N κ .
pX (x )  
Multiplying by gives P X = x |Y = y ≥ pX (x )2N κ . Then summing over
pY (y)
x ∈ Sy (x) gives 1 ≥ P (SY (x)|Y = y) ≥ Qy (x)2N κ , or
Qy (x) ≤ 2−N κ . (1.4.38)
Substituting (1.4.37) and (1.4.38) into (1.4.36) yields (1.4.32).
Proof of Theorem 1.4.15 The proof of Theorem 1.4.15 can now be easily com-
pleted. Take R = c − 2ε and κ = c − ε . Then, as r = 2N(R+o(1)) , we have that E
does not exceed
P(ΘN ≤ c − ε ) + 2N(c − 2ε − c + ε + o(1)) = P(ΘN ≤ c − ε ) + 2−N ε .
This quantity tends to zero as N → ∞, because P(ΘN ≤ c − ε ) → 0 owing to the
P
condition ΘN −→ c. Therefore, the random coding F gives the expected error prob-
ability that vanishes as N → ∞.
By Theorem 1.4.11(i), for any N ≥ 1 there exists a deterministic encoding f = fN
such that, for R = c − 2ε , lim ε ( f ) = 0. Hence, R is a reliable transmission rate.
N→∞
This is true for any ε > 0, thus C ≥ c.
The form of the argument used in the above proof was proposed by P. Whittle
(who used it in his lectures at the University of Cambridge) and appeared in [52],
pp. 114–117. We thank C. Goldie for this information. An alternative approach is
based on the concept of joint typicality; this approach is used in Section 2.2 where
we discuss channels with continuously distributed noise.
78 Essentials of Information Theory

Theorems 1.4.17 and 1.4.19 may be extended to the case of a memoryless chan-
nel with an arbitrary (finite) output alphabet, Jq = {0, . . . , q − 1}. That is, at the
input of the channel we now have a word Y(N) = Y1 . . . YN where each Y j takes a
(random) value from Jq . The memoryless property means, as before, that
N
Pch y(N) |x(N) = ∏ P(yi | xi ), (1.4.39)
i=1

and the symbol-to-symbol channel probabilities P(y|x) now form a 2 × q stochastic


matrix (the channel matrix). A memoryless channel is called symmetric if the rows
of the channel matrix are permutations of each other and double symmetric if in
addition the columns of the channel matrix are permutations of each other. The
definitions of the reliable transmission rate and the channel capacity are carried
through without change. The capacity of a memoryless binary channel is depicted
in Figure 1.8.
Theorem 1.4.26 The capacity of a memoryless symmetric channel with an out-
put alphabet Jq is
C ≤ log q − h(p0 , . . . , pq−1 ) (1.4.40)
where (p0 , . . . , pq−1 ) is a row of the channel matrix. The equality is realised in the
case of a double-symmetric channel, and the maximising random coding has IID
symbols Vi taking values from Jq with probability 1/q.
Proof The proof is carried out as in the binary case, by using the fact that I(X1 :
Y1 ) = h(Y1 ) − h(Y1 |X1 ) ≤ log q − h(Y1 |X1 ). But in the symmetric case
h(Y1 | X1 ) = − ∑ P(X1 = x)P(y | x) log P(y | x)
x,y

= − ∑ P(X1 = x) ∑ pk log pk = h(p0 , . . . , pq−1 ). (1.4.41)


x k

If, in addition, the columns of the channel matrix are permutations of each other,
then h(Y1 ) attains log q. Indeed, take a random coding as suggested. Then P(Y = y)
q−1 1
= ∑ P(X1 = x)P(y|x) = ∑ P(y|x). The sum ∑ P(y|x) is along a column of the
x=0 q x x
channel matrix, and it does not depend on y. Hence, P(Y = y) does not depend on
y ∈ Iq , which means equidistribution.

Remark 1.4.27 (a) In the random coding F used in Worked Examples 1.4.24 and
1.4.25 and Theorems 1.4.6, 1.4.15 and 1.4.17, the expected error-probability E → 0
with N → ∞. This guarantees not only the existence of a ‘good’ non-random coding
for which the error-probability E vanishes as N → ∞ (see Theorem 1.4.11(i)), but
also that ‘almost’ all codes are asymptotically good. In fact, by Theorem 1.4.11(ii),
1.4 Channels of information transmission 79

C ( p)
1

p
1 1
2

Figure 1.8

√ √ √
with ρ = 1 − E, P ε (F) < E ≥ 1 − E → 1, as N → ∞. However, this does
not help to find a good code: constructing good codes remains a challenging task
in information theory, and we will return to this problem later.

Worked Example 1.4.28 Bits are transmitted along a communication channel.


With probability λ a bit may be inverted and with probability μ it may be rendered
illegible. The fates of successive bits are independent. Determine the optimal cod-
ing for, and the capacity of, the channel.

Solution The channel matrix is 2 × 3:


 
1−λ −μ λ μ
Π= ;
λ 1−λ −μ μ

the rows are permutations of each other, and hence have equal entropies. Therefore,
the conditional entropy h(Y |X) equals

h(1 − λ − μ , λ , μ ) = −(1 − λ − μ ) log(1 − λ − μ ) − λ log λ − μ log μ ,

which does not depend on the distribution of the input symbol X.


Thus, I(X : Y ) is maximised when h(Y ) is. If pY (0) = p and pY (1) = q, then

h(Y ) = −μ log μ − p log p − q log q,


80 Essentials of Information Theory

which is maximised when p = q = (1 − μ )/2 (by pooling), i.e. pX (0) = pX (1) =


1/2. This gives the following expression for the capacity:
1−μ
−(1 − μ ) log + (1 − λ − μ ) log(1 − λ − μ ) + λ log λ
2   
1−λ −μ λ
= (1 − μ ) 1 − h , .
1−μ 1−μ

Worked Example 1.4.29


(a) (Data-processing inequality) Consider two independent channels in series. A
random variable X is sent through channel 1 and received as Y . Then it is sent
through channel 2 and received as Z . Prove that
I(X : Z) ≤ I(X : Y ),
so the further processing of the second channel can only reduce the mutual
information.
The independence of the channels means that given Y , the random variables
X and Z are conditionally independent. Deduce that
h(X, Z|Y ) = h(X|Y ) + h(Z|Y )
and
h(X,Y, Z) + h(Z) = h(X, Z) + h(Y, Z).
Define I(X : Z|Y ) as h(X|Y ) + h(Z|Y ) − h(X, Z|Y ) and show that
I(X : Z|Y ) = I(X : Y ) − I(X : Z).
Does the equality hold in the data processing inequality
I(X : Z) = I(X : Y )?
(b) The input and output of a discrete-time channel are both expressed in an alpha-
bet whose letters are the residue classes of integers mod r, where r is fixed. The
transmitted letter [x] is received as [ j + x] with probability p j , where x and j
are integers and [c] denotes the residue class of c mod r. Calculate the capacity
of the channel.

Solution (a) Given Y , the random variables X and Z are conditionally independent.
Hence,
h(X | Y ) = h(X | Y, Z) ≤ h(X | Z),
and
I(X : Y ) = h(X) − h(X|Y ) ≥ h(X) − h(X | Z) = I(X : Z).
1.4 Channels of information transmission 81

. . . . . . . . .

Figure 1.9

The equality holds iff X and Y are conditionally independent given Z, e.g. if the
second channel is error-free (Y, Z) → Z is one-to-one, or the first channel is fully
noisy, i.e. X and Y are independent.
(b) The rows of the channel matrix are permutations of each other. Hence h(Y |X) =
h(p0 , . . . , pr−1 ) does not depend on pX . The quantity h(Y ) is maximised when
pX (i) = 1/r, which gives

C = log r − h(p0 , . . . , pr−1 ).

Worked Example 1.4.30 Find the error-probability of a cascade of n identical


independent binary symmetric channels (MBSCs), each with the error-probability
0 < p < 1 (see Figure 1.9).
Show that the capacity of the cascade tends to zero as n → ∞.

Solution The channel matrix of a combined n-cascade channel is Πn where


 
1− p p
Π= .
p 1− p
Calculating the eigenvectors/values yields
 
1 1 + (1 − 2p)n 1 − (1 − 2p)n
Π =
n
,
2 1 − (1 − 2p)n 1 + (1 − 2p)n
which gives the error-probability 1/2 (1 − (1 − 2p)n ). If 0 < p < 1, Πn converges
to
 
1/2 1/2
,
1/2 1/2
and the capacity of the channel approaches

1 − h(1/2, 1/2) = 1 − 1 = 0.

If p = 0 or 1, the channel is error-free, and C ≡ 1.


82 Essentials of Information Theory

Worked Example 1.4.31 Consider two independent MBCs, with capacities


C1 ,C2 bits per second. Prove, or provide a counter-example to, each of the fol-
lowing claims about the capacity C of a compound channel formed as stated.
(a) If the channels are in series, with the output from one being fed into the other
with no further coding, then C = min[C1 ,C2 ].
(b) Suppose the channels are used in parallel in the sense that at every second a
symbol (from its input alphabet) is transmitted through channel 1 and the next
symbol through channel 2; each channel thus emits one symbol each second.
Then C = C1 +C2 .
(c) If the channels have the same input alphabet and at each second a symbol is
chosen and sent simultaneously down both channels, then C = max[C1 ,C2 ].
(d) If channel i = 1, 2 has matrix Πi and the compound channel has
 
Π1 0
Π= ,
0 Π2

then C is given by 2C = 2C1 + 2C2 . To what mode of operation does this corre-
spond?

Solution (a)
X Y Z
−→ channel 1 −→ channel 2 −→
As in Worked Example 1.4.29a,
I(X : Z) ≤ I(X : Y ), I(X : Z) ≤ I(Y : Z).
Hence,
C = sup I(X : Z) ≤ sup I(X : Y ) = C1
pX pX

and similarly
C ≤ sup I(Y : Z) = C2 ,
pY

i.e. C ≤ min[C1 ,C2 ]. A strict inequality may occur: take δ ∈ (0, 1/2) and the
matrices
   
1−δ δ 1−δ δ
ch 1 ∼ , ch 2 ∼ ,
δ 1−δ δ 1−δ
and
 
1 (1 − δ )2 + δ 2 2δ (1 − δ )
ch [1 + 2] ∼ .
2 2δ (1 − δ ) (1 − δ )2 + δ 2
1.4 Channels of information transmission 83

Here, 1/2 > 2δ (1 − δ ) > δ ,

C1 = C2 = 1 − h(δ , 1 − δ ),

and
 
C = 1 − h 2δ (1 − δ ), 1 − 2δ (1 − δ ) < Ci

because h(ε , 1 − ε ) strictly increases in ε ∈ [0, 1/2].


(b)
X1 −→ channel 1 −→ Y1

X2 −→ channel 2 −→ Y2

The capacity of the combined channel


 
C = sup I (X1 , X2 ) : (Y1 ,Y2 ) .
p(X1 ,X2 )

But
   
I (X1 , X2 ) : (Y1 ,Y2 ) = h(Y1 ,Y2 ) − h Y1 ,Y2 |X1 , X2
≤ h(Y1 ) + h(Y2 ) − h(Y1 |X1 ) − h(Y2 |X2 )
= I(X1 : Y1 ) + I(X2 : Y2 );

equality applies iff X1 and X2 are independent. Thus, C = C1 + C2 and the max-
imising p(X1 ,X2 ) is pX1 × pX2 where pX1 and pX2 are maximisers for I(X1 : Y1 ) and
I(X2 : Y2 ).
(c)
channel 1 −→ Y1

X

channel 2 −→ Y2

Here,
 
C = sup I X : (Y1 : Y2 )
pX

and
   
I (Y1 : Y2 ) : X = h(X) − h X|Y1 ,Y2
≥ h(X) − min h(X|Y j ) = min I(X : Y j ).
j=1,2 j=1,2
84 Essentials of Information Theory

Thus, C ≥ max[C1 ,C2 ]. A strict inequality may occur: take an example from part
(a). Here, Ci = 1 − h(δ , 1 − δ ). Also,
   
I (Y1 ,Y2 ) : X = h(Y1 ,Y2 ) − h Y1 ,Y2 |X
= h(Y1 ,Y2 ) − h(Y1 |X) − h(Y2 |X)
= h(Y1 ,Y2 ) − 2h(δ , 1 − δ ).

If we set pX (0) = pX (1) = 1/2 then




(Y1 ,Y2 ) = (0, 0) with probability (1 − δ )2 + δ 2 2,


(Y1 ,Y2 ) = (1, 1) with probability (1 − δ )2 + δ 2 2,
(Y1 ,Y2 ) = (1, 0) with probability δ (1 − δ ),
(Y1 ,Y2 ) = (0, 1) with probability δ (1 − δ ),

with
 
h(Y1 ,Y2 ) = 1 + h 2δ (1 − δ ), 1 − 2δ (1 − δ ) ,

and
   
I (Y1 ,Y2 ) : X = 1 + h 2δ (1 − δ ), 1 − 2δ (1 − δ ) − 2h(δ , 1 − δ )
> 1 − h(δ , 1 − δ ) = Ci .

Hence, C > Ci , i = 1, 2.
(d)
X1 channel 1 −→ Y1
 
→ X : X1 or X2 →
 
X2 channel 2 −→ Y2

The difference with part (c) is that every second only one symbol is sent, either
to channel 1 or 2. If we fix probabilities α and 1 − α that a given symbol is sent
through a particular channel then

I(X : Y ) = h(α , 1 − α ) + α I(X1 : Y1 ) + (1 − α )I(X2 : Y2 ). (1.4.42)

Indeed, I(X : Y ) = h(Y ) − h(Y |X), where

h(Y ) = − ∑ α pY1 (y) log α pY1 (y) − ∑(1 − α )pY2 (y) log(1 − α )pY2 (y)
y y
= −α log α − (1 − α ) log(1 − α ) + α h(Y1 ) + (1 − α )h(Y2 )
1.4 Channels of information transmission 85

and
h(Y |X) = − ∑ α pX1 ,Y1 (x, y) log pY1 |X1 (y|x)
x,y
− ∑ (1 − α )pX2 ,Y2 (y|x) log pY2 |X2 (y|x)
x,y
= α h(Y1 |X1 ) + (1 − α )h(Y2 |X2 )

proving (1.4.42). This yields


 
C = max h(α , 1 − α ) + α C1 + (1 − α )C2 ;
0≤α ≤1

the maximum is given by

α = 2C1 /(2C1 + 2C2 ), 1 − α = 2C2 /(2C1 + 2C2 ),


 
and C = log 2C1 + 2C2 .

Worked Example 1.4.32 A spy sends messages to his contact as follows. Each
hour either he does not telephone, or he telephones and allows the telephone to ring
a certain number of times – not more than N , for fear of detection. His contact does
not answer, but merely notes whether or not the telephone rings, and, if so, how
many times. Because of deficiencies in the telephone system, calls may fail to be
properly connected; the correct connection has probability p, where 0 < p < 1, and
is independent for distinct calls, but the spy has no means of knowing which calls
reach his contact. If connection is made, then the number of rings is transmitted
correctly. The probability of a false connection from another subscriber at a time
when no call is made may be neglected. Write down the channel matrix for this
channel and calculate the capacity explicitly. Determine a condition on N in terms
of p which will imply, with optimal coding, that the spy will always telephone.

Solution The channel alphabet is {0, 1, . . . , N}: 0 ∼ non-call (in a given hour), and
j ≥ 1 ∼ j rings. The channel matrix is P(0|0) = 1, P(0| j) = 1 − p and P( j| j) = p,
1 ≤ j ≤ N, and h(Y |X) = −q(p log p + (1 − p) log(1 − p)), where q = pX (X ≥ 1).
Furthermore, given q, h(Y ) attains its maximum when
pq
pY (0) = 1 − pq, pY (k) = , 1 ≤ k ≤ N.
N
Maximising I(X : Y ) = h(Y ) − h(Y |X) in q yields p(1 − p)(1−p)/p × (1 − pq) =
pq/N or
⎡  ⎤
 (1−p)/p −1
1 1 1
q = min ⎣ 1+ , 1⎦.
p Np 1− p
86 Essentials of Information Theory
1
The condition q = 1 is equivalent to log N ≥ − log(1 − p), i.e.
p
1
N≥ .
(1 − p)1/p

1.5 Differential entropy and its properties


Definition 1.5.1 Suppose that the random variable X has a probability density
(PDF) p(x), x ∈ Rn :
0
P{X ∈ A} = p(x)dx
A
1
for any (measurable) set A ⊆ Rn , where p(x) ≥ 0, x ∈ Rn , and Rn dxp(x) = 1. The
differential entropy hdiff (X) is defined as
0
hdiff (X) = − p(x) log p(x)dx, (1.5.1)

under the assumption that the integral is absolutely convergent. As in the discrete
case, hdiff (X) may be considered as a functional of the density p : x ∈ Rn → R+ =
[0, ∞). The difference is however that hdiff (X) may be negative, e.g. for a uniform
1
distribution on [0, a], hdiff (X) = − 0a dx(1/a) log(1/a) = log a < 0 for a < 1. [We
write x instead of x when x ∈ R.] The relative, joint and conditional differential
entropy are defined similarly to the discrete case:
0
p (x)
hdiff (X||Y ) = Ddiff (p||p ) = − p(x) log dx, (1.5.2)
p(x)
0
hdiff (X,Y ) = − pX,Y (x, y) log pX,Y (x, y)dxdy, (1.5.3)
0
hdiff (X|Y ) = − pX,Y (x, y) log pX|Y (x|y)dxdy
(1.5.4)
= hdiff (X,Y ) − hdiff (Y ),
again under the assumption that the integrals are absolutely convergent. Here, pX,Y
is the joint probability density and pX|Y the conditional density (the PDF of the
conditional distribution). Henceforth we will omit the subscript diff when it is clear
what entropy is being addressed. The assertions of Theorems 1.2.3(b),(c), 1.2.12,
and 1.2.18 are carried through for the differential entropies: the proofs are com-
pletely similar and will not be repeated.
Remark 1.5.2 Let 0 ≤ x ≤ 1. Then x can be written as a sum ∑ αn 2−n where
n≥1
αn (= αn (x)) equals 0 or 1. For ‘most’ of the numbers x the series is not reduced to a
finite sum (that is, there are infinitely many n such that αn = 1; the formal statement
1.5 Differential entropy and its properties 87

is that the (Lebesgue) measure of the set of numbers x ∈ (0, 1) with infinitely many
αn (x) = 1 equals one). Thus, if we want to ‘encode’ x by means of binary digits we
would need, typically, a codeword of an infinite length. In other words, a typical
value for a uniform random variable X with 0 ≤ X ≤ 1 requires infinitely many bits
for its ‘exact’ description. It is easy to make a similar conclusion in a general case
when X has a PDF fX (x).
However, if we wish to represent the outcome of the random variable X with
an accuracy of first n binary digits then we need, on average, n + h(X) bits where
h(X) is the differential entropy of X. Differential entropies can be both positive
and negative, and can even be −∞. Since h(X) can be of either sign, n + h(X) can
be greater or less than n. In the discrete case the entropy is both shift and scale
invariant since it depends only on probabilities p1 , . . . , pm , not on the values of the
random variable. However, the differential entropy is shift but not scale invariant
as is evident from the identity (cf. Theorem 1.5.7)

h(aX + b) = h(X) + log |a|.

However, the relative entropy, i.e. Kullback–Leibler distance D(p||q), is scale


invariant.
Worked Example 1.5.3 Consider a PDF on 0 ≤ x ≤ e−1 ,
1
fr (x) = Cr , 0 < r < 1.
x(− ln x)r+1
Then the differential entropy h(X) = −∞.

Solution After the substitution y = − ln x we obtain


0 e−1 0 ∞
1 1 1
dx = dy = .
0 x(− ln x)r+1 1 yr+1 r
Thus, Cr = r. Further, using z = ln(− ln x)
0 e−1 0 ∞
ln(− ln x) 1
r+1
dx = ze−rz dz = .
0 x(− ln x) 0 r2
Hence,
0
h(X) = − fr (x) ln fr (x)dx
0  
= fr (x) − ln r + ln x + (r + 1) ln(− ln x) dx
0 e−1  
r ln(− ln x)
= − ln r − − r(r + 1) dx,
0 x(− ln x)r x(− ln x)r+1
so that for 0 < r < 1, the second term is infinite, and two others are finite.
88 Essentials of Information Theory

Theorem 1.5.4 Let X = (X1 , . . . , Xd ) ∼ N(μ ,C) be a multivariate normal random


vector, of mean μ = (μ1 , . . . , μd ) and covariance matrix C = (ci j ), i.e. EXi = μi ,
E(Xi − μi )(X j − μ j ) = ci j = c ji , 1 ≤ i, j ≤ d . Then

h(X) = log (2π e)d detC . (1.5.5)


2

Proof The PDF pX (x) is

 
1 1 −1

p(x) = 1/2 exp − x − μ ,C (x − μ ) , x ∈ Rd .
2
(2π )d detC

Then h(X) takes the form

0  
1   log e  −1

− p(x) − log (2π ) detC − d
x − μ ,C (x − μ ) dx
Rd  2 2
log e   1  
= E ∑(xi − μi )(x j − μ j ) C−1 i j + log (2π )d detC
2 i, j 2
log e  −1  1  
= ∑
2 i, j
C i j E(xi − μi )(x j − μ j ) + log (2π )d detC
2
log e  −1  1  
= ∑
2 i, j
C i j C ji + log (2π )d detC
2
d log e 1   1  
= + log (2π )d detC = log (2π e)d detC .
2 2 2

Theorem 1.5.5 For a random vector X = (X1 , . . . , Xd ) with mean μ and covari-
ance matrix C = (Ci j ) (i.e. Ci j = E (Xi − μi )(X j − μ j )] = C ji ),

1  
h(X) ≤ log (2π e)d detC , (1.5.6)
2

with the equality iff X is multivariate normal.

Proof Let p(x) be the PDF of X and p0 (x) the normal density with mean μ
and covariance matrix C. Without loss of generality assume μ = 0. Observe that
log p0 (x) is, up to an additive constant term, a quadratic form in xk . Furthermore,
1.5 Differential entropy and its properties 89
1 0
1
for each monomial xi x j , dxp (x)xi x j = dxp(x)xi x j = Ci j = C ji , and the moment
of quadratic form log p0 (x) are equal. We have
0
p(x)
0 ≤ D(p||p0 ) (by Gibbs) = p(x) log dx
1 p0 (x)
= −h(p) − p(x) log p0 (x)dx
1
= −h(p) − p0 (x) log p0 (x)dx
(by the above remark) = −h(p) + h(p0 ).

The equality holds iff p = p0 .

Worked Example 1.5.6

(a) Show that the exponential density maximises the differential entropy among
the PDFs on [0, ∞) with given mean, and the normal density maximises the
differential entropy among the PDFs on R with a given variance.
Moreover, let X = (X1 , . . . , Xd )T be a random vector with EX = 0 and
EXi X j = Ci j , 1 ≤ i, j ≤ d . Then hdiff (X) ≤ 2 log (2π e)d det(Ci j ) , with equal-
1

ity iff X ∼ N(0,C).


(b) Prove that the bound h(X) ≤ log m (cf. (1.2.7)) for a random variable X tak-
ing not more than m values admits the following generalisation for a discrete
random variable with infinitely many values in Z+ :
1 1
h(X) ≤ log 2π e(Var X + ) .
2 12

Solution (a) For the Gaussian case, see Theorem 1.5.5. In the exponential
case, by the Gibbs inequality, for any random variable Y with PDF f (y),
1
f (y) log f (y)eλ y /λ dy ≥ 0 or

h(Y ) ≤ (λ EY log e − log λ ) = h(Exp(λ )),

with equality iff Y ∼ Exp(λ ), λ = (EY )−1 .

(b) Let X0 be a discrete random variable with P(X0 = i) = pi , i = 1, 2, . . ., and the


random variable U be independent of X0 and uniform on [0, 1]. Set X = X0 + U.
For a normal random variable Y with Var X = VarY ,
1 1 1
hdiff (X) ≤ hdiff (Y ) = log 2π eVar Y = log 2π e(Var X + ) .
2 2 12
90 Essentials of Information Theory

The value of EX is not essential for h(X) as the following theorem shows.
Theorem 1.5.7
(a) The differential entropy is not changed under the shift: for all y ∈ Rd ,
h(X + y) = h(X).
(b) The differential entropy changes additively under multiplication:
h(aX) = h(X) + log |a|, for all a ∈ R.
Furthermore, if A = (Ai j ) is a d × d non-degenerate matrix, consider the affine
transformation x ∈ Rd → Ax + y ∈ Rd .
(c) Then
h(AX + y) = h(X) + log | det A|. (1.5.7)
Proof The proof is straightforward and left as an exercise
Worked Example 1.5.8 (The data-processing inequality for the relative entropy)
Let S be a finite set, and Π = (Π(x, y), x, y ∈ S) be a stochastic kernel (that is, for
all x, y ∈ S, Π(x, y) ≥ 0 and ∑y∈S Π(x, y) = 1; in other words, Π(x, y) is a transi-
tion probability in a Markov chain). Prove that D(p1 Π||p2 Π) ≤ D(p1 ||p2 ) where
pi Π(y) = ∑x∈S pi (x)Π(x, y), y ∈ S (that is, applying a Markov operator to both
probability distributions cannot increase the relative entropy).
Extend this fact to the case of the differential entropy.

Solution In the discrete case Π is defined by a stochastic matrix (Π(x, y)). By the
log-sum inequality (cf. PSE II, p. 426), for all y
∑ p1 (w)Π(w, y)
∑ p1 (x)Π(x, y) log w∑ p2 (z)Π(z, y)
x
z
p1 (x)Π(x, y)
≤ ∑ p1 (x)Π(x, y) log
x p2 (x)Π(x, y)
p1 (x)
= ∑ p1 (x)Π(x, y) log .
x p2 (x)
Taking summation over y we obtain
∑ p1 (w)Π(w, y)
D(p1 Π||p2 Π) = ∑∑ p1 (x)Π(x, y) log
w

x y ∑ p2 (z)Π(z, y)
z
p1 (x)
≤ ∑ ∑ p1 (x)Π(x, y) log = D(p1 ||p2 ).
x y p2 (x)

In the continuous case a similar inequality holds if we replace summation by


integration.
1.5 Differential entropy and its properties 91

The concept of differential entropy has proved to be useful in a great vari-


ety of situations, very often quite unexpectedly. We consider here inequalities for
determinants and ratios of determinants of positive definite matrices (cf. [39], [36]).
Recall that the covariance matrix C = (Ci j ) of a random vector X = (X1 , . . . , Xd )
is positive definite, i.e. for any complex vector y = (y1 , . . . , yd ), the scalar product
(y,Cy) = ∑ Ci j yi y j is written as
i, j
 2
 
 
∑ i i j j i j ∑ i i i  ≥ 0.
E(X − μ )(X − μ )y y = E  (X − μ )y
i, j i

Conversely, for any positive definite matrix C there exists a PDF for which C is a
covariance matrix, e.g. a multivariate normal distribution (if C is not strictly posi-
tive definite, the distribution is degenerate).

Worked Example 1.5.9 If C is positive definite then log[detC] is concave in C.

Solution Take two positive definite matrices C(0) and C(1) and λ ∈ [0, 1]. Let X(0)
and X(1) be two multivariate normal vectors, X(i) ∼ N(0,C(i) ). Set, as in the proof
of Theorem 1.2.18, X = X(Λ) , where the random variable Λ takes two values, 0 and
1, with probabilities λ and 1 − λ , respectively, and is independent of X(0) and X(1) .
Then the random variable X has covariance C = λ C(0) + (1 − λ )C(1) , although X
need not be normal. Thus,
1  1  
log 2π e)d + log det λ C(0) + (1 − λ )C(1)
2 2
1  
= log (2π e)d detC ≥ h(X) (by Theorem 1.5.5)
2
≥ h(X|Λ) (by Theorem 1.2.11)
λ   1−λ  
= log (2π e)d detC(0) + log (2π e)d detC(1)
2 2
1  
= log 2π e) + λ log detC(0) + (1 − λ ) log detC(1) .
d
2

This property is often called the Ky Fan inequality and was proved initially in
1950 by using much more involved methods. Another famous inequality is due to
Hadamard:

Worked Example 1.5.10 For a positive definite matrix C = (Ci j ),

detC ≤ ∏ Cii , (1.5.8)


i

and the equality holds iff C is diagonal.


92 Essentials of Information Theory

Solution If X = (X1 , . . . , Xn ) ∼ N(0,C) then


1   1
log (2π e)d detC = h(X) ≤ ∑ h(Xi ) = ∑ log(2π eCii ),
2 i i 2

with equality iff X1 , . . . , Xn are independent, i.e. C is diagonal.


Next we discuss the so-called entropy–power inequality (EPI). The situation
with the EPI is quite intriguing: it is considered one of the ‘mysterious’ facts of
information theory, lacking a straightforward interpretation. It was proposed by
Shannon; the book [141] contains a sketch of an argument supporting this inequal-
ity. However, the first rigorous proof of the EPI only appeared nearly 20 years later,
under some rather restrictive conditions that are still the subject of painstaking im-
provement. Shannon used the EPI in order to bound the capacity of an additive
channel with continuous noise by that of a Gaussian channel; see Chapter 4. The
EPI is also related to important properties of monotonicity of entropy; an example
is Theorem 1.5.15 below.
The existing proofs of the EPI are not completely elementary; see [82] for one
of the more transparent proofs.
Theorem 1.5.11 (Entropy–power inequality). For two independent random vari-
ables X and Y with PDFs fX (x) and fY (x), x ∈ R1 ,
h(X +Y ) ≥ h(X +Y ), (1.5.9)
where X and Y are independent normal random variables with h(X) = h(X ) and
h(Y ) = h(Y ).
In the d-dimensional case the entropy–power inequality is as follows.
For two independent random variables X and Y with PDFs fX (x) and fY (x), x ∈ Rd ,
e2h(X+Y )/d ≥ e2h(X)/d + e2h(Y )/d . (1.5.10)

It is easy to see that for d = 1 (1.5.9) and (1.5.10) are equivalent. In general,
inequality (1.5.9) implies (1.5.10) via (1.5.13) below which can be established
independently. Note that inequality (1.5.10) may be true or false for discrete ran-
dom variables. Consider the following example: let X ∼ Y be independent with
PX (0) = 1/6, PX (1) = 2/3, PX (2) = 1/6. Then
2 16 18
h(X) = h(Y ) = ln 6 − ln 4, h(X +Y ) = ln 36 − ln 8 − ln 18.
3 36 36
By inspection, e2h(X+Y ) = e2h(X) + e2h(Y ) . If X and Y are non-random constants
then h(X) = h(Y ) = h(X +Y ) = 0, and the EPI is obviously violated. We conclude
1.5 Differential entropy and its properties 93

that the existence of PDFs is an essential condition that cannot be omitted. In a


different form EPI could be extended to discrete random variables, but we do not
discuss this theory here.

Sometimes the differential entropy is defined as h(X) = −E log2 p(X); then


(1.5.10) takes the form 2h(X+Y )/d ≥ 2h(X)/d + 2h(Y )/d .

The entropy–power inequality plays a very important role not only in informa-
tion theory and probability but in geometry and analysis as well. For illustration
we present below the famous Brunn–Minkowski theorem that is a particular case
of the EPI. Define the set sum of two sets as
A1 + A2 = {x1 + x2 : x1 ∈ A1 , x2 ∈ A2 }.
By definition A + 0/ = A.
Theorem 1.5.12 (Brunn–Minkowski)
(a) Let A1 and A2 be measurable sets. Then the volume
V (A1 + A2 )1/d ≥ V (A1 )1/d +V (A2 )1/d . (1.5.11)
(b) The volume of the set sum of two sets A1 and A2 is greater than the volume
of the set sum of two balls B1 and B2 with the same volume as A1 and A2 ,
respectively:
V (A1 + A2 ) ≥ V (B1 + B2 ), (1.5.12)
where B1 and B2 are spheres with V (A1 ) = V (B1 ) and V (A2 ) = V (B2 ).
Worked Example 1.5.13 Let C1 ,C2 be positive-definite d × d matrices. Then
[det(C1 +C2 )]1/d ≥ [detC1 ]1/d + [detC2 ]1/d . (1.5.13)

Solution Let X1 ∼ N(0,C1 ), X2 ∼ N(0,C2 ), then X1 + X2 ∼ N(0,C1 + C2 ). The


entropy–power inequality yields
 1/d
(2π e) det(C1 +C2 ) = e2h(X1 +X2 )/d
 1/d  1/d
≥ e2h(X1 )/d + e2h(X2 )/d = (2π e) detC1 + (2π e) detC2 .

Worked Example 1.5.14 A Töplitz n × n matrix C is characterised by the prop-


erty that Ci j = Crs if |i − j| = |r − s|. Let Ck = C(1, 2, . . . , k) denote the principal
minor of the Töplitz positive-definite matrix formed by the rows and columns
1, . . . , k. Prove that for |C| = detC,
|C1 | ≥ |C2 |1/2 ≥ · · · ≥ |Cn |1/n , (1.5.14)
94 Essentials of Information Theory

|Cn |/|Cn−1 | is decreasing in n, and


|Cn |
lim = lim |Cn |1/n . (1.5.15)
n→∞ |Cn−1 | n→∞

Solution Let (X1 , X2 , . . . , Xn ) ∼ N(0,Cn ). Then the quantities h(Xk |Xk−1 , . . . , X1 )


are decreasing in k, since
h(Xk |Xk−1 , . . . , X1 ) = h(Xk+1 |Xk , . . . , X2 ) ≥ h(Xk+1 |Xk , . . . , X1 ),
where the equality follows from the Töplitz assumption and the inequality from the
fact that the conditioning reduces the entropy. Next, we use the result of Problem
1.8b from Section 1.6 that the running averages
1 1 k
h(X1 , . . . Xk ) = ∑ h(Xi |Xi−1 , . . . X1 )
k k i=1
are decreasing in k. Then (1.5.14) follows from
1
log[(2π e)k |Ck |].
h(X1 , . . . Xk ) =
2
Since h(Xn |Xn−1 , . . . , X1 ) is a decreasing sequence, it has a limit. Hence, by the
Cesáro mean theorem
h(X1 , X2 , . . . Xn ) 1 n
lim = lim ∑ h(Xk |Xk−1 , . . . , X1 )
n→∞ n n→∞ n
i=1
= lim h(Xn |Xn−1 , . . . , X1 ).
n→∞

Translating this to determinants, we obtain (1.5.15).


The entropy–power inequality could be immediately extended to the case of
several summands
  n
e2h X1 +···+Xn /d ≥ ∑ e2h(Xi )/d .
i=1

But, more interestingly, the following intermediate inequality holds true. Let
X1 , X2 , . . . , Xn+1 be IID square-integrable random variables. Then
   
1 n+1 2h i ∑ Xi /d
e2h X1 +···+Xn /d
≥ ∑ e =j . (1.5.16)
n j=1

As was established, the differential entropy is maximised by a Gaussian distri-


bution, under the constraint that the variance of the random variable under consid-
eration is bounded from above. We will state without proof the following important
result showing that the entropy increases on every summation step in the central
limit theorem.
1.6 Additional problems for Chapter 1 95

Theorem 1.5.15 Let X1 , X2 , . . . be IID square-integrable random variables with


EXi = 0, and VarXi = 1. Then
X +···+X X +···+X
1 n 1 n+1
h √ ≤h √ . (1.5.17)
n n+1

1.6 Additional problems for Chapter 1


Problem 1.1 Let Σ1 and Σ2 be alphabets of sizes m and q. What does it mean to
say that f : Σ1 → Σ∗2 is a decipherable code? Deduce from the inequalities of Kraft
and Gibbs that if letters are drawn from Σ1 with probabilities p1 , . . . , pm then the
expected word length is at least h(p1 , . . . , pm )/ log q.

Find a decipherable binary code consisting of codewords 011, 0111, 01111,


11111, and three further codewords of length 2. How do you check that the code
you have obtained is decipherable?
2
Solution Introduce Σ∗ = n≥0 Σn , the set of all strings with digits from Σ. We send
a message x1 x2 . . . xn ∈ Σ∗1 as the concatenation f (x1 ) f (x2 ) . . . f (xn ) ∈ Σ∗2 , i.e. f
extends to a function f ∗ : Σ∗1 → Σ∗2 . We say a code is decipherable if f ∗ is injective.
Kraft’s inequality states that a prefix-free code f : Σ1 → Σ∗2 with codeword-
lengths s1 , . . . , sm exists iff
m
∑ q−s i
≤ 1. (1.6.1)
i=1

In fact, every decipherable code satisfies this inequality.


Gibbs’ inequality states that if p1 , . . . , pn and p1 , . . . , pn are two probability dis-
tributions then
n n
h(p1 , . . . , pn ) = − ∑ pi log pi ≤ − ∑ pi log pi , (1.6.2)
i=1 i=1

with equality iff pi ≡ pi .


Suppose that f is decipherable with codeword-lengths s1 , . . . , sm . Put pi = q−si /c
m
where c = ∑ q−si . Then, by Gibbs’ inequality,
i=1
n
h(p1 , . . . , pn ) ≤ − ∑ pi log pi
i=1
n
= − ∑ pi (−si log q − log c)
 i=1   
= ∑ pi si log q + ∑ pi log c.
i i
96 Essentials of Information Theory

By Kraft’s inequality, c ≤ 1, i.e. log c ≤ 0. We obtain that


expected codeword-length ∑ pi si ≥ h(p1 , . . . , pn )/ log q.
i

In the example, the three extra codewords must be 00, 01, 10 (we cannot take
11, as then a sequence of ten 1s is not decodable). Reversing the order in every
codeword gives a prefix-free code. But prefix-free codes are decipherable. Hence,
the code is decipherable.
In conclusion, we present an alternative proof of necessity of Kraft’s inequal-
ity. Denote s = max si ; let us agree to extend any word in X to the length s,
say by adding some fixed symbol. If x = x1 x2 . . . xsi ∈ X , then any word of the
form x1 x2 . . . xsi ysi +1 . . . ys ∈ X because x is a prefix. But there are at most qs−si of
such words. Summing up on i, we obtain that the total number of excluded words
is ∑mi=1 q
s−si . But it cannot exceed the total number of words qs . Hence, (1.6.1)

follows:
m
qs ∑ q−si ≤ qs .
i=1

Problem 1.2 Consider an alphabet with m letters each of which appears with
probability 1/m. A binary Huffman code is used to encode the letters, in order to
minimise the expected codeword-length (s1 + · · · + sm )/m where si is the length of
a codeword assigned to letter i. Set s = max[si : 1 ≤ i ≤ m], and let n be the number
of codewords of length .
(a) Show that 2 ≤ ns ≤ m.
(b) For what values of m is ns = m?
(c) Determine s in terms of m.
(d) Prove that ns−1 + ns = m, i.e. any two codeword-lengths differ by at most 1.
(e) Determine ns−1 and ns .
(f) Describe the codeword-lengths for an idealised model of English (with m =
27) where all the symbols are equiprobable.
(g) Let now a binary Huffman code be used for encoding symbols 1, . . . , m occur-
ring with probabilities p1 ≥ · · · ≥ pm > 0 where ∑ p j = 1. Let s1 be the length
1≤ j≤m
of a shortest codeword and sm of a longest codeword. Determine the maximal and
minimal values of sm and s1 , and find binary trees for which they are attained.

Solution (a) Bound ns ≥ 2 follows from the tree-like structure of Huffman codes.
More precisely, suppose ns = 1, i.e. a maximum-length codeword is unique and
corresponds to say letter i. Then the branch of length s leading to i can be pruned at
the end, without violating the prefix-free condition. But this contradicts minimality.
1.6 Additional problems for Chapter 1 97

4 1
m c i m
a b 2
m

Figure 1.10

Bound ns ≤ m is obvious. (From what is said below it will follow that ns is always
even.)
(b) ns = m means all codewords are of equal length. This, obviously, happens iff
m = 2k , in which case s = k (a perfect binary tree Tk with 2k leaves).
(c) In general,
'
log m, if m = 2k ,
s=
log m, if m = 2k .

The case m = 2k was discussed in (b), so let us assume that m = 2k . Then 2k < m <
2k+1 where k = log m. This is clear from the observation that the binary tree for
probabilities 1/m (we will call it a binary m-tree Bm ) contains the perfect binary
tree Tk but is contained in Tk+1 . Hence, s is as above.
(d) Indeed, in the case of an equidistribution 1/m, . . ., 1/m it is impossible to have
a branch of the tree whose length differs from the maximal value s by two or more.
In fact, suppose there is such a branch, Bi , of the binary tree leading to some letter i
and choose a branch M j of maximal length s leading to a letter j. In a conventional
terminology, letter j was engaged in s merges and i in t ≤ s − 2 merges. Ultimately,
the branches Bi and M j must merge, and this creates a contradiction. For example,
the ‘least controversial’ picture is still ‘illegal’; see Figure 1.10. Here, vertex i
carrying probability 1/m should have been joined with vertex a or b carrying each
probability 2/m, instead of joining a and b (as in the figure), as it creates vertex c
carrying probability 4/m.
(e) We conclude that (i) for m = 2k , the m-tree Bm coincides with Tk , (ii) for m = 2k
we obtain Bm in the following way. First, take a binary tree Tk where k = [log m],
with 1 ≤ m − 2k < 2k . Then m − 2k leaves of Tk are allowed to branch one step
98 Essentials of Information Theory

k+1 _
2 m
_ k
2 (m 2 )

Figure 1.11

further: this generates 2(m − 2k ) = 2m − 2k+1 leaves of tree Tk+1 . The remaining
2k − (m − 2k ) = 2k+1 − m leaves of Tk are left intact. See Figure 1.11. So,
ns−1 = 2k+1 − m, ns = 2m − 2k+1 , where k = [log m].
(f) In the example of English, with equidistribution among m = 27 = 16 + 11 sym-
bols, we have 5 codewords of length 4 and 22 codewords of length 5. The average
codeword-length is
5 × 4 + 22 × 5 130
= ≈ 4.8.
27 27
3 4
(g) The minimal value for s1 is 1 (obviously). The maximal value is log2 m ,
i.e. the positive integer l with 2l < m ≤ 2l+1 . The maximal value for sm is m −
1 (obviously). The minimal value is log2 m, i.e. the natural l such that 2l−1 <
m ≤ 2l .
The tree that yields s1 = 1 and sm = m − 1 is given in Figure 1.12.
It is characterised by
i f (i) si
1 0 1
2 10 2
.. .. ..
. . .
m−1 11. . . 10 m−1
m 11. . . 11 m−1
and is generated when
p1 > p2 + · · · + pm > 2(p3 + · · · + pm ) > · · · > 2m−1 pm .
1.6 Additional problems for Chapter 1 99

1 2 m–1 m

Figure 1.12

m = 16

Figure 1.13

A tree that maximises s1 and minimises sm corresponds to uniform probabilities


where p1 = · · · = pm = 1/m. When m = 2l , the branches of the tree have the same
length l = log2 m (a perfect binary tree); see Figure 1.13.
Otherwise, i.e. if 2l < m < 2l+1 , the tree has 2l+1 − m leaves at level l and
2(m − 2l ) leaves at level l + 1; see Figure 1.14.
Indeed, by the Huffman construction, the shortest branch cannot be larger than
log2 m and the longest shorter than log2 m, as the tree is always a subtree of a
perfect binary tree.
Problem 1.3 A binary erasure channel with erasure probability p is a discrete
memoryless binary channel (MBC) with channel matrix
 
1− p p 0
.
0 p 1− p
State Shannon’s second coding theorem (SCT) and use it to compute the capacity
of this channel.
100 Essentials of Information Theory

m = 18

Figure 1.14

Solution The SCT states that for an MBC

(capacity) = (maximum information transmitted per letter).

Here the capacity is understood as the supremum over all reliable information rates
while the RHS is defined as
max I(X : Y )
X

where the random variables X and Y represent an input and the corresponding
output.
The binary erasure channel keeps an input letter 0 or 1 intact with probability
1 − p and turns it to a splodge ∗ with probability p. An input random variable X is
0 with probability α and 1 with probability 1 − α . Then the output random variable
Y takes three values:
P(Y = 0) = (1 − p)α ,
P(Y = 1) = (1 − p)(1 − α ),
P(Y = ∗) = p.

Thus, conditional on the value of Y , we have



h(X|Y = 0) = 0, ⎬
h(X|Y = 1) = 0, implying that h(X|Y ) = ph(α ).

h(X|Y = ∗) = h(α ),

Therefore,
capacity = maxα I(X : Y )
= maxα [h(X) − h(X|Y )]
= maxα [h(α ) − ph(α )]
= (1 − p) maxα h(α ) = 1 − p,
1.6 Additional problems for Chapter 1 101

because h(α ) = −α log α − (1 − α ) log(1 − α ) attains it maximum value 1 at α =


1/2.

Problem 1.4 Let X and Y be two discrete random variables with corresponding
cumulative distribution functions (CDF) PX and PY .
(a) Define the conditional entropy h(X|Y ), and show that it satisfies

h(X|Y ) ≤ h(X),

giving necessary and sufficient conditions for equality.


(b) For each α ∈ [0, 1], the mixture random variable W (α ) has PDF of the form

PW (α ) (x) = α PX (x) + (1 − α )PY (x).

Prove that for all α the entropy of W (α ) satisfies:

h(W (α )) ≥ α h(X) + (1 − α )h(Y ).

(c) Let hPo (λ ) be the entropy of a Poisson random variable Po(λ ). Show that
hPo (λ ) is a non-decreasing function of λ > 0.

Solution (a) By definition,


h(X|Y ) = h(X,Y ) − h(Y )
= − ∑ P(X = x,Y = y) log P(X = x,Y = y)
x,y
+ ∑ P(Y = y) log P(Y = y).
y

The inequality h(X|Y ) ≤ h(X) is equivalent to

h(X,Y ) ≤ h(X) + h(Y ),


pi
and follows from the Gibbs inequality ∑ pi log ≥ 0. In fact, take i = (x, y) and
i qi
pi = P(X = x,Y = y), qi = P(X = x)P(Y = y).

Then
P(X = x,Y = y)
0 ≤ ∑ P(X = x,Y = y) log
x,y P(X = x)P(Y = y)
= ∑ P(X = x,Y = y) log P(X = x,Y = y)
x,y

− ∑ P(X = x,Y = y) log P(X = x) + log P(Y = y)


x,y
= −h(X,Y ) + h(X) + h(Y ).
Equality here occurs iff X and Y are independent.
102 Essentials of Information Theory

(b) Define a random variable T equal to 0 with probability α and 1 with probability
1 − α . Then the random variable Z has the distribution W (α ) where
'
X, if T = 0,
Z=
Y, if T = 1.

By part (a),

h(Z|T ) ≤ h(Z),

with the LHS = α h(X) + (1 − α )h(Y ), and the RHS = h(W (α )).
(c) Observe that for independent random variables X and Y , h(X + Y |X) =
h(Y |X) = h(Y ). Hence, again by part (a),

h(X +Y ) ≥ h(X +Y |X) = h(Y ).

Using this fact, for all λ1 < λ2 , take X ∼ Po(λ1 ), Y ∼ Po(λ2 − λ1 ), independently.
Then

h(X +Y ) ≥ h(X) implies hPo (λ2 ) ≥ hPo (λ1 ).

Problem 1.5 What does it mean to transmit reliably at rate R through a binary
symmetric channel (MBSC) with error-probability p? Assuming Shannon’s sec-
ond coding theorem (SCT), compute the supremum of all possible reliable trans-
mission rates of an MBSC. What happens if: (i) p is very small; (ii) p = 1/2; or
(iii) p > 1/2?

Solution An MBSC can 8 transmit


9 reliably at rate R if there is a sequence of codes
XN , N = 1, 2, . . ., with 2 NR codewords such that
 
e(XN ) = max P error|x sent → 0 as N → ∞.
x∈XN

By the SCT, the so-called operational channel capacity is sup R = maxα I(X : Y ),
the maximum information transmitted per input symbol. Here X is a Bernoulli
random variable taking values 0 and 1 with probabilities α ∈ [0, 1] and 1 − α , and
Y is the output random variable for the given input X. Next, I(X : Y ) is the mutual
entropy (information):

I(X : Y ) = h(X) − h(X|Y ) = h(Y ) − h(Y |X).


1.6 Additional problems for Chapter 1 103

Observe that the binary entropy function h(x) ≤ 1 with equality for x = 1/2.
Selecting α = 1/2 conclude that the MBSC with error probability p has the
capacity

max I(X : Y ) = max h(Y ) − h(Y |X)


α α

= max h(α p + (1 − α )(1 − p)) − η (p)


α
= 1 + p log p + (1 − p) log(1 − p).
(i) If p is small, the capacity is only slightly less than 1 (the capacity of a noiseless
channel).
(ii) If p = 1/2, the capacity is zero (the channel is useless).
(iii) If p > 1/2, we may swap the labels on the output alphabet, replacing p by
1 − p, and the channel capacity is non-zero.

Problem 1.6 (i) What is bigger π e or eπ ?


(ii) Prove the log-sum inequality: for non-negative numbers a1 , a2 , . . . , an and
b1 , b2 , . . . , bn ,
  ⎛∑a ⎞
i
ai ⎝
∑ ai log bi ≥ ∑ ai log ∑ bi ⎠
i
(1.6.3)
i i
i

with equality iff ai /bi = const .


(iii) Consider two discrete probability distributions p(x) and q(x). Define the rela-
tive entropy (or Kullback–Leibler distance) and prove the Gibbs inequality,
 
p(x)
D(pq) = ∑ p(x) log ≥ 0, (1.6.4)
x q(x)
with equality iff p(x) = q(x) for all x.
Using (1.6.4), show that for any positive functions f (x) and g(x), and for any
finite set A,
    ⎛ ∑ f (x) ⎞
f (x)
∑ f (x) log g(x) ≥ ∑ f (x) log ⎝ ∑ g(x) ⎠ .
x∈A

x∈A x∈A
x∈A

Check that that for any 0 ≤ p, q ≤ 1,


   
p 1− p
p log + (1 − p) log ≥ (2 log2 e)(q − p)2 , (1.6.5)
q 1−q
and show that for any probability distributions p = (p(x)) and q = (q(x)),
 2
log2 e
D(pq) ≥
2 ∑ |p(x) − q(x)| . (1.6.6)
x
104 Essentials of Information Theory

Solution (i) Denote x = ln π , and taking the logarithm twice obtain the inequality
x − 1 > ln x. This is true as x > 1, hence eπ > π e .

(ii) Assume without loss of generality that ai > 0 and bi > 0. The function g(x) =
x log x is strictly convex. Hence, by the Jensen inequality for any coefficients ∑ ci =
1, ci ≥ 0,
 
∑ ci g(xi ) ≥ g ∑ ci xi .
 −1
Selecting ci = bi ∑ b j and xi = ai /bi , we obtain
j
  ⎛ ⎞
∑ ai
ai ai ai
∑ ∑ b j log bi ≥ ∑ ∑ b j log ⎝ i ⎠
i i ∑bj
i

which is the log-sum inequality.

(iii) There exists a constant c > 0 such that


 
1
log y ≥ c 1 − , with equality iff y = 1.
y
Writing B = {x : p(x) > 0},
p(x)
D(pq) = ∑ p(x) log
x∈B  q(x) 
q(x)

≥ c ∑ p(x) 1 − = c 1 − q(B) ≥ 0.
x∈B p(x)
Equality holds iff q(x) ≡ p(x). Next, write
f (x)
f (A) = ∑ f (x), p(x) = 1(x ∈ A),
x∈A f (A)
g(x)
g(A) = ∑ g(x), q(x) = 1(x ∈ A).
x∈A g(A)
Then
f (x) f (A)p(x)
∑ f (x) log g(x) = f (A) ∑ p(x) log g(A)q(x)
x∈A x∈A
p(x) f (A)
= f (A) ∑ p(x) log q(x) + f (A) log
g(A)
x∈A
: ;< =
≥ by the previous part
f (A)
≥ f (A) log .
g(A)
1.6 Additional problems for Chapter 1 105

Inequality (1.6.5) could be easily established by inspection. Finally, consider


A = {x : p(x) ≤ q(x)}. Since

∑ |p(x) − q(x)| = 2 q(A) − p(A) = 2 p(Ac ) − q(Ac ) ,


x

then
p(x) p(x)
D(p||q) = ∑ p(x) log + ∑ p(x) log
x∈A q(x) x∈Ac q(x)
p(A) p(A c)
≥ p(A) log + p(Ac ) log
q(A) q(Ac )  2
 
2 log2 e
≥ 2 log2 e p(A) − q(A) =
2 ∑ |p(x) − q(x)| .
x

Problem 1.7 (a) Define the conditional entropy, and show that for random vari-
ables U and V the joint entropy satisfies
h(U,V ) = h(V |U) + h(U).
Given random variables X1 , . . . , Xn , by induction or otherwise prove the chain rule
n
h(X1 , . . . Xn ) = ∑ h(Xi |X1 , . . . , Xi−1 ). (1.6.7)
i=1

(b) Define the subset average over subsets of size k to be


 
h(XS ) n
hk = ∑
(n)
,
S:|S|=k
k k

where h(XS ) = h(Xs1 , . . . , Xsk ) for S = {s1 , . . . , sk }. Assume that, for any i,
h(Xi |XS ) ≤ h(Xi |XT ) when T ⊆ S, and i ∈/ S.
By considering terms of the form
h(X1 , . . . , Xn ) − h(X1 , . . . Xi−1 , Xi+1 , . . . , Xn )
(n) (n)
show that hn ≤ hn−1 .
(k) (k) (n) (n)
Using the fact that hk ≤ hk−1 , show that hk ≤ hk−1 , for k = 2, . . . , n.
(c) Let β > 0, and define
> n
tk = ∑ e β h(XS )/k
(n)
.
S:|S|=k
k

Prove that
(n) (n) (n)
t1 ≥ t2 ≥ · · · ≥ tn .
106 Essentials of Information Theory

Solution (a) By definition, the conditional entropy

h(V |U) = h(U,V ) − h(U)


= ∑ P(U = u)h(V |U = u),
u

where h(V |U = u) is the entropy of the conditional distribution:

h(V |U = u) = − ∑ P(V = v|U = u) log P(V = v|U = u).


v

The chain rule (1.6.7) is established by induction in n.


(b) By the chain rule

h(X1 , . . . , Xn ) = h(X1 , . . . , Xn−1 ) + h(Xn |X1 , . . . , Xn−1 ) (1.6.8)

and, in general,

h(X1 , . . . , Xn )
= h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )
≤ h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) + h(Xi |X1 , . . . , Xi−1 ), (1.6.9)

because
h(Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) ≤ h(Xi |X1 , . . . , Xi−1 ).

Then adding equations (1.6.9) from i = 1 to n:


n
nh(X1 , . . . , Xn ) ≤ ∑ h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )
i=1
n
+ ∑ h(Xi |X1 , . . . , Xi−1 ).
i=1

The second sum in the RHS equals h(X1 , . . . , Xn ) by the chain rule (1.6.7). So,
n
(n − 1)h(X1 , . . . , Xn ) ≤ ∑ h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ).
i=1

(n) (n)
This implies that hn ≤ hn−1 , since

1 1 n h(X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )


h(X1 , . . . , Xn ) ≤ ∑ . (1.6.10)
n n i=1 n−1

In general, fix a subset S of size k in {1, . . . , n}. Writing S(i) for S \{i}, we obtain
1 1 h(X[S(i)])
h[X(S)] ≤ ∑ ,
k k i∈S k − 1
1.6 Additional problems for Chapter 1 107

by the above argument. This yields


 
n (n) h[X(S)] h(X[S(i)])
k k
h = ∑ k
≤ ∑ ∑ k(k − 1)
. (1.6.11)
S⊂{1,...,n}: |S|=k S⊂{1,...,n}: |S|=k i∈S

Finally, each subset of size k − 1, S(i), appears [n − (k − 1)] times in the sum
(n)
(1.6.11). So, we can write hk as
   
h[X(T )] n − (k − 1) n

T ⊂{1,...,n}: |T |=k−1 k − 1 k k
 
h[X(T )] n
= ∑ = hnk−1 .
T ⊂{1,...,n}: |T |=k−1 k − 1 k − 1
(c) Starting from (1.6.11), exponentiate and then apply the arithmetic
mean/geometric mean inequality, to obtain for S0 = {1, 2, . . . , n}
1 n β h(S0 (i))/(n−1)
eβ h(X(S0 ))/n ≤ eβ [h(S0 (1))+···+h(S0 (n))]/(n(n−1)) ≤ ∑e
n i=1
(n) (n)
which is equivalent to tn ≤ tn−1 . Now we use the same argument as in (b), taking
(n) (n)
the average over all subsets to prove that for all k ≤ n,tk ≤ tk−1 .
Problem 1.8 Let p1 , . . . , pn be a probability distribution, with p∗ = maxi [pi ].
Prove that
(i) −∑ pi log2 pi ≥ −p∗ log2 p∗ − (1 − p∗ ) log2 (1 − p∗ );
i
(ii) −∑ pi log2 pi ≥ log2 (1/p∗ );
i
(iii) −∑ pi log2 pi ≥ 2(1 − p∗ ).
i
The random variables X and Y with values x and y from finite ‘alphabets’ I and
J represent the input and output of a transmission channel, with the conditional
probability P(x | y) = P(X = x | Y = y). Let h(P(· | y)) denote the entropy of the
conditional distribution P(· | y), y ∈ J , and h(X | Y ) denote the conditional entropy
of X given Y . Define the ideal observer decoding rule as a map f : J → I such
that P( f (y) | y) = maxx∈I P(x | y) for all y ∈ J . Show that under this rule the error-
probability
πer (y) = ∑ P(x | y)
x∈I: x = f (y)

1
satisfies πer (y)  h(P(· | y)), and the expected error satisfies
2
1
Eπer (Y ) ≤ h(X | Y ).
2
108 Essentials of Information Theory

Solution Bound (i) follows from the pooling inequality. Bound (ii) holds as
1 1
− ∑ pi log pi ≥ ∑ pi log ∗
= log ∗ .
i i p p

To check (iii), it is convenient to use (i) for p∗ ≥ 1/2 and (ii) for p∗ ≤ 1/2. Assume
first that p∗ ≥ 1/2. Then, by (i),

h(p1 , . . . , pn ) ≥ h (p∗ , 1 − p∗ ) .

The function x ∈ (0, 1) → h(x, 1 − x) is concave, and its graph on (1/2, 1) lies
strictly above the line x → 2(1 − x). Hence,

h(p1 , . . . , pn ) ≥ 2 (1 − p∗ ) .

On the other hand, if p∗ ≤ 1/2, we use (ii):


1
h(p1 , . . . , pn ) ≥ log .
p∗
Further, for 0 ≤ x ≤ 1/2,
1 1
log ≥ 2(1 − x); equality iff x = .
x 2
For the concluding part, we use (iii). Write

πer (y) = 1 − Pch ( f (y)|y) = 1 − pmax ( · |y)


 
which is ≤ h P( · |y) /2.Finally, the mean Eπer (Y ) is bounded by taking expecta-
tions, since h(X|Y ) = Eh P( · |Y ) .

Problem 1.9 Define the information rate H and the asymptotic equipartition
property of a source. Calculate the information rate of a Bernoulli source. Given a
memoryless binary channel, define the channel capacity C. Assuming the statement
of Shannon’s second coding theorem (SCT), deduce that C = sup pX I(X : Y ).
An erasure channel keeps a symbol intact with probability 1 − p and turns it into
an unreadable splodge with probability p. Find the capacity of the erasure channel.

Solution The information rate H of a source U1 ,U2 , . . . with a finite alphabet I is


the supremum of all values R > 0 such that there exists a sequence of sets An ∈
I × · · · × I (n times) such that |An | ≤ 2nR and limn→∞ P(U1n ∈ An ) = 1.
The asymptotic equipartition property means that, as n → ∞,
1
− log pn (U1n ) → H,
n
1.6 Additional problems for Chapter 1 109

in one sense or another (here we mean convergence in probability). Here U1n =


U1 . . . Un and pn (un1 ) = P(U1n = un1 ). The SCT states that if the random variable
− log pn (U1n )/n converges to a limit then the limit equals H.
A memoryless binary channel (MBC) has the conditional probability
 
Pch Y(N) |X(N) sent = ∏ P(yi |xi )
1≤i≤N

and produces an error with probability


 
ε (N) = ∑ Psource U = u Pch fN (Y(N) ) = u | fN (u) sent ,
u

where Psource stands for the source probability distribution, and one uses a code fN
and a decoding rule fN . A value R ∈ (0, 1) is said to be a reliable transmission rate
if, given that Psource is an equidistribution over a set UN of source strings u with
 UN = 2N[R+o(1)] , there exist fN and fN such that
1
lim ∑ chP 
f N (Y(N)
)
= u | f N (u) sent = 0.
N→∞  UN
u∈UN

The channel capacity is the supremum of all reliable transmission rates.


For an erasure channel, the matrix is
⎛ ⎞
0 1− p 0 p
1 ⎝ 0 1 − p p⎠
0 1 

The conditional entropy h(Y |X) = h(p, 1 − p) does not depend on pX . Thus,

C = sup I(X : Y ) = sup h(Y ) − h(Y |X)


pX pX

is achieved at pX (0) = pX (1) = 1/2 with

h(Y ) = −(1 − p) log[(1 − p)/2] − p log p = h(p, 1 − p) + (1 − p).

Hence, the capacity C = 1 − p.

Problem 1.10 Define Huffman’s encoding rule and prove its optimality among
decipherable codes. Calculate the codeword lengths for the symbol-probabilities
1 1 1 1 1 1 1 1
5 , 5 , 6 , 10 , 10 , 10 , 10 , 30 .
Prove, or provide a counter-example to, the assertion that if the length of a code-
word from a Huffman code equals l then, in the same code, there exists another
codeword of length l such that | l − l | ≤ 1.
110 Essentials of Information Theory

Solution An answer to the first part:


probability codeword length
1/5 00 2
1/5 100 3
1/6 101 3
1/10 110 3
1/10 010 3
1/10 011 3
1/10 1110 4
1/30 1111 4
For the second part: a counter-example:
probability codeword length
1/2 0 1
1/8 100 3
1/8 101 3
1/8 110 3
1/8 111 3

Problem 1.11 A memoryless channel with the input alphabet {0, 1} repro-
duces the symbol correctly with probability (n − 1)/n2 and reverses it with prob-
ability 1/n2 . [Thus, for n = 1 the channel is binary and noiseless.] For n ≥ 2 it
also produces 2(n − 1) sorts of ‘splodges’, conventionally denoted by αi and βi ,
i = 1, . . . , n − 1, with similar probabilities: P(αi |0) = (n − 1)/n2 , P(βi |0) = 1/n2 ,
P(βi |1) = (n − 1)/n2 , P(αi |1) = 1/n2 . Prove that the capacity Cn of the channel
increases monotonically with n, and limn→∞ Cn = ∞. How is the capacity affected
if we simply treat splodges αi as 0 and βi as 1?

Solution The channel matrix is


⎛ ⎞
n−1 1 n−1 1 n−1 1
0 ⎜ n2 2
... ⎟
⎜ n n2 n2 n2 n2 ⎟
⎜ ⎟.
⎝ 1 n−1 1 n−1 1 n−1 ⎠
1 ...
n2 n2 n2 n2 n2 n2
0 1 α1 β1 ... αn−1 βn−1
The channel is double-symmetric (the rows and columns are permutations of each
other), hence the capacity-achieving input distribution is
1
pX (0) = pX (1) = ,
2
1.6 Additional problems for Chapter 1 111

and the capacity Cn is given by


1 2 n−1 n2
Cn = log(2n) + n log(n ) + n log
n2 n2 n−1
n−1
= 1 + 3 log n − log(n − 1) → +∞, as n → ∞.
n
Furthermore, extrapolating
 
1
C(x) = 1 + 3 log x − 1 − log(x − 1), x ≥ 1,
x
we find
dC(x) 3 1 1 1
= − + − log(x − 1)
dx x x − 1 x(x − 1) x2
2 1 1
= − 2 log(x − 1) = 2 [ 2x − log(x − 1)] > 0, x > 1.
x x x
Thus, Cn increases with n for n ≥ 1. When αi and βi are treated as 0 or 1, the
capacity does not change.
Problem 1.12 Let Xi , i = 1, 2, . . . , be IID random variables, taking values 1, 0
with probabilities p and (1 − p). Prove the local De Moivre–Laplace theorem with
a remainder term:
1
P(Sn = k) = $ exp[−nh p (y) + θn (k)], k = 1, . . . , n − 1; (1.6.12)
2π y(1 − y)n
here
   
y 1−y
Sn = ∑ Xi , y = k/n, h p (y) = y ln
p
+ (1 − y) ln
1− p
1≤i≤n

and the remainder θn (k) obeys


1
|θn (k)| < , y = k/n.
6ny(1 − y)
Hint: Use the Stirling formula with the remainder term

n! = 2π n, nn e−n+ϑ (n) ,
where
1 1
< ϑ (n) < .
12n + 1 12n

Find values k+ and k− , 0 ≤ k+ , k− ≤ n (depending on n), such that P(Sn = k+ )


is asymptotically maximal and P(Sn = k− ) is asymptotically minimal, as n → ∞,
and write the corresponding asymptotics.
112 Essentials of Information Theory

Solution Write
 
n n!
P(Sn = k) = (1 − p)n−k pk = (1 − p)n−k pk
? k k!(n − k)!
n nn
= (1 − p)n−k pk
2π k(n − k) kk (n
− k) n−k

× exp ϑ (n) − ϑ (k) − ϑ (n − k)


1 
=$ exp − k ln y − (n − k) ln(1 − y)
2π ny(1 − y) 

+k ln p + (n − k) ln(1 − p) exp ϑ (n) − ϑ (k) − ϑ (n − k)


1

=$ exp − nh p (y)
2π ny(1 − y)

× exp ϑ (n) − ϑ (k) − ϑ (n − k) .


Now, as
1 1 1 2n2
|ϑ (n) − ϑ (k) − ϑ (n − k)| < + + < ,
12n 12k 12(n − k) 12nk(n − k)
(1.6.12) follows, with θn (k) = ϑ (n) − ϑ (k) − ϑ (n − k). By the Gibbs inequality,
h p (y) ≥ 0 and h p (y) = 0 iff y = p. Furthermore,
dh p (y) y 1−y d2 h p (y) 1 1
= ln − ln , and = + > 0,
dy p 1− p dy 2 y 1−y
which yields

dh p (y)  dh p (y) dh p (y)
 = 0, < 0, 0 < y < p, and > 0, p < y < 1.
dy y=p dy dy
Hence,
h p = min h p (y) = 0, attained
 at y = p,
1 1
h p = max h p (y) = min ln , ln , attained at y = 0 or y = 1.
p 1− p
Thus, the maximal probability for n  1 is for y∗ = p, i.e. k+ = np:
  1  
P Sn = np  $ exp θn (np) ,
2π np(1 − p)
where
1
|θn (np)| ≤ .
6np(1 − p)
Similarly, the minimal probability is
P(Sn = 0) = pn , if 0 < p ≤ 1/2,
P(Sn = n) = (1 − p) , if 1/2 ≤ p < 1.
n
1.6 Additional problems for Chapter 1 113
n
Problem 1.13 (a) Prove that the entropy h(X) = − ∑ p(i) log p(i) of a discrete
i=1
random variable X with probability distribution p = (p(1), . . . , p(n)) is a concave
function of the vector p.
Prove that the mutual entropy I(X : Y ) = h(Y ) − h(Y | X) between random vari-
ables X and Y , with P(X = i,Y = k) = pX (i)PY |X (k | i), i, k = 1, . . . , n, is a concave
function of the vector pX = (pX (1), . . . , pX (n)) for fixed conditional probabilities
{PY |X (k | i)}.
(b) Show that
h(X) ≥ −p∗ log2 p∗ − (1 − p∗ ) log2 (1 − p∗ ),
where p∗ = maxx P(X = x), and deduce that, when p∗ ≥ 1/2,
h(X) ≥ 2(1 − p∗ ). (1.6.13)
Show also that inequality (1.6.13) remains true even when p∗ < 1/2.

Solution (a) Concavity of h(p) means that


h(λ1 p1 + λ2 p2 ) ≥ λ1 h(p1 ) + λ2 h(p2 ) (1.6.14)
for any probability vectors p j = (p j (1), . . . , p j (n)), j = 1, 2, and any λ1 , λ2 ∈ (0, 1)
with λ1 + λ2 = 1. Let X1 have distribution p1 and X2 have distribution p2 . Let also
Z = 1, with probability λ1 or 2, with probability λ2 ,
and Y = XZ . Then the distribution of Y is λ1 p1 + λ2 p2 . By Theorem 1.2.11(a),
h(Y ) ≥ h(Y |Z),
and by the definition of the conditional entropy
h(Y |Z) = λ1 h(X1 ) + λ2 h(X2 ).
This yields (1.6.14). Now
 
I(X : Y ) = h(Y ) − h(Y |X) = h(Y ) − ∑ pX (i)h PY |X (.|i) . (1.6.15)
If PY |X (.|.) are fixed, the second term is a linear function of pX , hence concave. The
first term, h(Y ), is a concave function of pY which in turn is a linear function of
pX . Thus, h(Y ) is concave in pX , and so is I(X : Y ).

(b) Consider two cases, (i) p∗ ≥ 1/2 and (ii) p∗ ≤ 1/2. In case (i), by pooling
inequality,
1 1
h(X) ≥ h(p∗ , 1 − p∗ ) ≥ (1 − p∗ ) log ≥ (1 − p∗ ) log = 2(1 − p∗ )
p∗ (1 − p∗ ) 4
114 Essentials of Information Theory

as p∗ ≥ 1/2. In case (ii) we use induction in n, the number of values taken by X.


The initial step is n = 3: without loss of generality, assume that p∗ = p1 ≥ p2 ≥ p3 .
Then 1/3 ≤ p1 < 1/2 and (1 − p1 )/2 ≤ p2 ≤ p1 . Write
p2
h(p1 , p2 , p3 ) = h(p1 , 1 − p1 ) + (1 − p1 )h(q, 1 − q), where q = .
1 − p1
As 1/2 ≤ q ≤ p1 (1 − p1 ) ≤ 1,
h(q, 1 − q) ≥ h(p1 /(1 − p1 ), (1 − 2p1 )/(1 − p1 )),
i.e.
h(p1 , p2 , p3 ) ≥ h(p1 , p1 , 1 − 2p1 ) = 2p1 + h(2p1 , 1 − 2p1 ).
The inequality 2p1 + h(2p1 , 1 − 2p1 ) ≥ 2(1 − p1 ) is equivalent to
h(2p1 , 1 − 2p1 ) > 2 − 4p1 , 1/3 ≤ p1 < 1/2,
or to
h(p, 1 − p) > 2 − 2p, 2/3 ≤ p ≤ 1,
which follows from (a). Thus, for n = 3, h(p1 , p2 , p3 ) ≥ 2(1 − p∗ ) regardless of the
value of p∗ . The initial induction step is completed.
Make the induction hypothesis h(X) ≥ 2(1 − p∗ ) for the number of values of X
which is ≤ n − 1. Then take p = (p1 , . . . , pn ) and assume without loss of generality
that p∗ = p1 ≥ · · · ≥ pn . Write q = (p2 /(1 − p1 ), . . . , pn−1 /(1 − p1 )) and
h(p) = h(p1 , 1 − p1 ) + (1 − p1 )h(q) ≥ h(p1 , 1 − p1 ) + (1 − p1 )2(1 − q1 ). (1.6.16)
The inequality h(p) ≥ 2(1 − p∗ ) will follow from
h(p1 , 1 − p1 ) + (1 − p1 )(2 − 2q1 ) ≥ (2 − 2p1 )
which is equivalent to
h(p1 , 1 − p1 ) ≥ 2(1 − p1 )(1 − 1 + q1 ) = 2(1 − p1 )q1 = 2p2 ,
for 1/n ≤ p1 < 1/2, (1 − p1 )/(n − 1) ≤ p2 < p1 . But obviously
h(p1 , 1 − p1 ) ≥ 2(1 − p1 ) ≥ 2p2
(with equality at p1 = 0, 1/2). Inequality (1.6.16) follows from the induction
hypothesis.
Problem 1.14 Let a probability distribution pi , i ∈ I = {1, 2, . . . , n}, be such that
log2 (1/pi ) is an integer for all i with pi > 0. Interpret I as an alphabet whose letters
are to be encoded by binary words. A Shannon–Fano (SF) code assigns to letter i
a word of length i = log2 (1/pi ); by the Kraft inequality it may be constructed
1.6 Additional problems for Chapter 1 115

to be uniquely decodable. Prove the competitive optimality of the SF codes: if  i ,


i ∈ I , are the codeword-lengths of any uniquely decodable binary code then
   
P i <  i ≥ P  i < i , (1.6.17)

with equality iff i ≡  i .



Hint: You may find useful the inequality sgn( −  ) ≤ 2− − 1, ,  = 1, . . . , n.

Solution Write

P( i < i ) − P( i > i ) = ∑ pi − ∑ pi


i: i <i i: i >i

= ∑ pi sign (i −  i )
i
 
= E sgn( −  ) ≤ E 2− − 1 ,

as sign x ≤ 2x − 1 for integer x. Continue the argument with


   
E 2− − 1 = ∑ pi 2i −i − 1
i
 
= ∑ 2−i 2i −i − 1 = ∑ 2− − ∑ 2−
i i

i i i
≤ 1 − ∑ 2−i = 1 − 1 = 0
i

by the Kraft inequality. This yields the inequality


   
P i <  i ≥ P  i < i .

To have equality, we must have (a) 2i −i − 1 = 0 or 1, i ∈ I (because sign x = 2x − 1

only for x = 0 or 1), and (b) ∑ 2−i = 1. As ∑ 2−i = 1, the only possibility is
i i

2i −i ≡ 1, i.e. i =  i .

Problem 1.15 Define the capacity C of a binary channel. Let CN =


(1/N) sup I(X : Y ), where I(X(N) : Y(N) ) denotes the mutual entropy between
(N) (N)

X(N) , the random word of length N sent through the channel, and Y(N) , the received
word, and where the supremum is over the probability distribution of X(N) . Prove
that C ≤ lim supN→∞ CN .

Solution A binary channel is defined as a sequence of conditional probability dis-


tributions
(N)
Pch (y(N) |x(N) ), N = 1, 2, . . . ,
116 Essentials of Information Theory

where x(N) = x1 . . . xN is a binary word (string) at the input and y(N) = y1 . . . yN a


binary word (string) at the output port. The channel capacity C is an asymptotic
* (N) +
parameter of the family Pch ( · | · ) defined by

C = sup R ∈ (0, 1) : R is a reliable transmission rate . (1.6.18)


Here, a number R ∈ (0, 1) is called a reliable transmission rate (for a given channel)
if, given that the random source string is equiprobably distributed over a set U (N)
with  U (N) = 2N[R+O(1)] , there exist an encoding rule f (N) : U (N) → XN ⊆ {0, 1}N
and a decoding rule f(N) : {0, 1}N → U (N) such that the error probability e(N) → 0
as N → ∞ is given by
1 ( )
e(N) := ∑ P
(N)
y (N)
: (N) y(N) = u | f (N) (u) ;
f (1.6.19)
u∈U (N)
 U (N) ch
note
e(N) = e(N) f (N) , f(N) .
The converse part of Shannon’s second coding theorem (SCT) states that
1  
C ≤ lim sup sup I X(N) : Y(N) , (1.6.20)
N→∞ N P (N)
X
 
where I X(N) : Y(N) is the mutual entropy between the random input and output
strings X(N) and Y(N) and PX(N) is a distribution of X(N) .
For the proof, it suffices to check that if  U (N) = 2N[R+O(1)] then, for all f (N)
and f(N) ,
CN + o(1)
e(N) ≥ 1 − (1.6.21)
R + o(1)
where
1  
CN = sup I X(N) : Y(N) .
N P (N)
X

Indeed, if R > lim supN→∞ CN then, according to (1.6.21)


lim infN→∞ inf f (N) , f(N) e(N) > 0 and R is not reliable.
that f (N) is lossless. Then
To prove (1.6.21), assume, without loss of generality, 
the input word x(N) is equidistributed, with probability 1  U (N) . For all decod-
ing rules f(N) and any N large enough,
    
clNCN ≥ I X(N) : Y(N) ≥ I X(N) : f Y(N)
    
= h X(N) − h X(N) | f Y(N)
  
= log  U (N) − h X(N) | f Y(N)
 
≥ log  U (N) − 1 − ε (N) log  U (N) − 1 . (1.6.22)
1.6 Additional problems for Chapter 1 117

The last bound here follows from the generalised Fano inequality
      
h X(N) | f Y(N) ≤ −e(N) log e(N) − 1 − e(N) log 1 − e(N)
+e(N) log U (N) − 1 
≤ 1 + e(N) log  U (N) − 1 .
Now, from (1.6.22),


NCN ≥ N R + o(1) − 1 − e(N) log 2N[R+o(1)] − 1 ,

i.e.

N R + o(1) − NCN − 1 CN + o(1)


e(N)
≥   = 1− ,
log 2N[R+o(1)] − 1 R + o(1)
as required.
Problem 1.16 A memoryless channel has input 0 and 1, and output 0, 1 and ∗
(illegible). The channel matrix is given by
P(0|0) = 1, P(0|1) = P(1|1) = P(∗|1) = 1/3.
Calculate the capacity of the channel and the input probabilities pX (0) and pX (1)
for which the capacity is achieved.
Someone suggests that, as the symbol ∗ may occur only from 1, it is to your
advantage to treat ∗ as 1: you gain more information from the output sequence, and
it improves the channel capacity. Do you agree? Justify your answer.

Solution Use the formula


C = sup I(X : Y ) = sup h(Y ) − h(Y |X) ,


pX pX

where pX is the distribution of the input symbol:


pX (0) = p, pX (1) = 1 − p, 0 ≤ p ≤ 1.

So, calculate I(X : Y ) as a function of p:


h(Y ) = −pY (0) log pY (0) − pY (1) log pY (1) − pY () log pY ().
Here
pY (0) = p + (1 − p)/3 = (1 + 2p)/3,
pY (1) = pY () = (1 − p)/3,
and
1 + 2p 1 + 2p 2(1 − p) 1− p
h(Y ) = − log − log .
3 3 3 3
118 Essentials of Information Theory

Also,
h(Y |X) = − ∑ pX (x) ∑ P(y|x) log P(y|x)
x=0,1 y
= −pX (1) log 1/3 = (1 − p) log 3.
Thus,
1 + 2p 1 + 2p 2(1 − p) 1− p
I(X : Y ) = − log − log − (1 − p) log 3.
3 3 3 3
Differentiating yields
d
I(X : Y ) = −2/3(log (1/3 + 2p/3) + 2/3 log (1/3 − p/3) + log 3.
dp
Hence, the maximum max I(X : Y ) is found from relation
2 1− p
log + log 3 = 0.
3 1 + 2p
This yields
1− p 3
log = − log 3 := b,
1 + 2p 2
and
1− p  
= 2b , i.e. 1 − 2b = p 1 + 2b+1 .
1 + 2p
The answer is
1 − 2b
p= .
1 + 2b+1
For the last part, write

I(X : Y ) = h(X) − h(X|Y ) ≤ h(X) − h(X|Y ) = I(X : Y )

for any Y that is a function of Y ; the equality holds iff Y and X are conditionally
independent, given Y . It is the case of our channel, hence the suggestion leaves the
capacity the same.

Problem 1.17 (a) Given a pair of discrete random variables X , Y , define the
joint and conditional entropies h(X,Y ) and h(X|Y ).
(b) Prove that h(X,Y ) ≥ h(X|Y ) and explain when equality holds.
(c) Let 0 < δ < 1, and prove that
 
h(X|Y ) ≥ log(δ −1 ) P(q(X,Y ) ≤ δ ),

where q(x, y) = P(X = x|Y = y). For which δ and for which X , Y does equality
hold here?
1.6 Additional problems for Chapter 1 119

Solution (a) The conditional entropy is given by


h(X|Y ) = −E log q(x, y) = − ∑ P(X = x,Y = y) log q(x, y)
x,y

where
q(x, y) = P(X = x|Y = y).
The joint entropy is given by
h(X,Y ) = − ∑ P(X = x,Y = y) log P(X = x,Y = y).
x,y

(b) From the definition,


h(X,Y ) = h(X|Y ) − ∑ P(Y = y) log P(Y = y) ≥ h(X|Y ).
y

The equality in (b) is achieved iff h(Y ) = 0, i.e. Y is constant a.s.


(c) By Chebyshev’s inequality,
 
P q(X,Y ) ≤ δ ) = P − log q(X,Y ≥ log 1/δ )
1
1
≤ E − log q(X,Y ) = h(X|Y ).
log 1/δ log 1/δ
Here equality holds iff

P q(X,Y ) = δ ) = 1.
This requires that (i) δ = 1/m where m is a positive integer and (ii) for all y ∈
support Y , there exists a set Ay of cardinality m such that
1
P(X = x|Y = y) = , for x ∈ Ay .
m

Problem 1.18 A text is produced by a Bernoulli source with alphabet 1, 2, . . . , m


and probabilities p1 , p2 , . . . , pm . It is desired to send this text reliably through a
memoryless binary symmetric channel (MBSC) with the row error-probability p∗ .
Explain what is meant by the capacity C of the channel, and show that
C = 1 − h(p∗ , 1 − p∗ ).
Explain why reliable transmission is possible if
h(p1 , p2 , . . . , pm ) + h(p∗ , 1 − p∗ ) < 1
and is impossible if
h(p1 , p2 , . . . , pm ) + h(p∗ , 1 − p∗ ) > 1,
m
where h(p1 , p2 , . . . , pm ) = − ∑ pi log2 pi .
i=1
120 Essentials of Information Theory

Solution The asymptotic equipartition property for a Bernoulli source states that
the number of distinct strings (words) of length n emitted by the source is ‘typi-
cally’ 2nH+o(n) , and they have ‘nearly equal’ probabilities 2−nH+o(n) :

lim P 2−n(H+ε ) ≤ Pn (U(n) ) ≤ 2−n(H−ε ) = 1.
n→∞

Here, H = h(p1 , . . . , pn ).
Denote
( )
Tn (= Tn (ε )) = u(n) : 2−n(H+ε ) ≤ Pn (u(n) ) ≤ 2−n(H−ε )

and observe that


1 1
lim log  Tn = H, i.e. lim sup log  Tn < H + ε .
n→∞ n n→∞ n

By the definition of the channel capacity, the words u(n) ∈ Tn (ε ) may be encoded
−1
by binary codewords of length R (H + ε ) and sent reliably through a memoryless
symmetric channel with matrix
 
1 − p∗ p∗
p∗ 1 − p∗

for any R < C where

C = sup I(X : Y ) = sup[h(Y ) − h(Y |X)].


pX pX

The supremum here is taken over all distributions

pX = (pX (0), pX (1))

of the input binary symbol X; the conditional distribution of the output symbol Y
is given by
'
1 − p∗ , y = x,
P(Y = y|X = x) =
p∗ , y = x.

We see that

h(Y |X) = −pX (0) − (1 − p∗ ) log(1 − p∗ ) − p∗ log p∗


+pX (1) − p∗ log p∗ − (1 − p∗ ) log(1 − p∗ ) = h(p∗ , 1 − p∗ ),


1.6 Additional problems for Chapter 1 121

independently of pX . Hence,

C = sup h(Y ) − h(p∗ , 1 − p∗ ) = 1 − h(p∗ , 1 − p∗ ),


pX

because h(Y ) is achieved for

pY (0) = pY (1) = 1/2 (occurring when pX (0) = pX (1) = 1/2).

Therefore, if

H <C ⇔ h(p1 , . . . , pn ) + h(p∗ , 1 − p∗ ) < 1,


−1
then R (H + ε ) can be made < 1, for ε > 0 small enough and R < C close to
C. This means that there exists a sequence of codes fn of length n such that the
error-probability, while using encoding fn and the ML decoder, is
 
≤ P u(n) ∈ Tn 
+P u(n) ∈ Tn ; an error while using fn (u(n) ) and the ML decoder
→ 0, as n → ∞,

since both probabilities go to 0.


On the other hand,

H >C ⇔ h(p1 , . . . , pn ) + h(p∗ , 1 − p∗ ) > 1,


−1
then R H > 1 for all R < C, and we cannot encode words u(n) ∈ Tn by codewords
of length n so that the error-probability tends to 0. Hence, no reliable transmission
is possible.

Problem 1.19 A Markov source with an alphabet of m characters has a transition


matrix Pm whose elements p jk are specified by

p11 = pmm = 2/3, p j j = 1/3 (1 < j < m),


p j j+1 = 1/3 (1 ≤ j < m), p j j−1 = 1/3 (1 < j ≤ m).

All other elements are zero. Determine the information rate of the source.
Denote the transition matrix thus specified by Pm . Consider
 a source in an
Pm 0
alphabet of m + n characters whose transition matrix is , where the zeros
0 Pn
indicate zero matrices of appropriate size. The initial character is supposed uni-
formly distributed over the alphabet. What is the information rate of the source?
122 Essentials of Information Theory

Solution The transition matrix


⎛ ⎞
2/3 1/3 0 0 ... 0
⎜1/3 1/3 1/3 0 . . . 0 ⎟
⎜ ⎟
⎜ ⎟
Pm = ⎜ 0 1/3 1/3 1/3 . . . 0 ⎟
⎜ . . . . . . ⎟
⎝ .. .. .. .. .. .. ⎠
0 0 0 0 . . . 2/3
is Hermitian and so has the equilibrium distribution π = (πi ) with πi = 1/m, 1 ≤
i ≤ m (equidistribution). The information rate equals
Hm = − ∑ π j p jk log p jk
j,k    
1 2 2 1 1 1 1
=− 2 log + log + 3(m − 2) log
m 3 3 3 3 3 3
4
= log 3 − .
3m
 
Pm 0
The source with transition matrix is non-ergodic, and its information
0 Pn
rate is the maximum of the two rates

max Hm , Hn = Hm∨n .

Problem 1.20 Consider a source in a finite alphabet. Define Jn = n−1 h(U(n) )


and Kn = h(Un+1 |U(n) ) for n = 1, 2, . . .. Here Un is the nth symbol in the sequence
and U(n) is the string constituted by the first n symbols, h(U(n) ) is the entropy and
h(Un+1 |U(n) ) the conditional entropy. Show that, if the source is stationary, then Jn
and Kn are non-increasing and have a common limit.
Suppose the source is Markov and not necessarily stationary. Show that the
mutual information between U1 and U2 is not smaller than that between U1 and U3 .

Solution For the second part, the Markov property implies that
P(U1 = u1 |U2 = u2 ,U3 = u3 ) = P(U1 = u1 |U2 = u2 ).
Hence,
P(U1 = u1 |U2 = u2 ,U3 = u3 )

I(U1 : (U2 ,U3 )) = E − log


P(U1 = u1 )
P(U1 = u1 |U2 = u2 )

= E − log = I(U1 : U2 ).
P(U1 = u1 )
Since
I(U1 : (U2 ,U3 )) ≥ I(U1 : U3 ),
the result follows.
1.6 Additional problems for Chapter 1 123

Problem 1.21 Construct a Huffman code for a set of 5 messages with probabil-
ities as indicated below

Message 1 2 3 4 5
Probability 0.1 0.15 0.2 0.26 0.29

Solution
Message 1 2 3 4 5
Probability 0.1 0.15 0.2 0.026 0.029
Codeword 101 100 11 01 00

The expected codeword-length equals 2.4.

Problem 1.22 State the first coding theorem (FCT), which evaluates the infor-
mation rate for a source with suitable long-run properties. Give an interpretation of
the FCT as an asymptotic equipartition property. What is the information rate for a
Bernoulli source?
Consider a Bernoulli source that emits symbols 0, 1 with probabilities 1 − p and
p respectively, where 0 < p < 1. Let η (p) = −p log p − (1 − p) log(1 − p) and let
ε > 0 be fixed. Let U(n) be the string consisting of the first n symbols emitted by
the source. Prove that there is a set Sn of possible values of U(n) such that
 2
 (n)  p p(1 − p)
P U ∈ Sn ≥ 1 − log ,
1− p nε 2
 
and so that for each u(n) ∈ Sn the probability that P U(n) = u(n) lies between
2−n(h+ε ) and 2−n(h−ε ) .

Solution For the Bernoulli source


1 1
− log Pn (U(n) ) = − ∑ log P(U j ) → η (p),
n n 1≤ j≤n

in the sense that for all ε > 0, by Chebyshev,


  
 1 

P − log Pn (U ) − h > ε
(n) 
n
 
1
≤ 2 2 Var
ε n ∑ log P(U j )
1≤ j≤n
1
= Var [log P(U1 )] . (1.6.23)
ε 2n
124 Essentials of Information Theory

Here '
1 − p, if U j = 0,
P(U j ) =
p, if U j = 1,
Pn (U(n) ) = ∏ P(U j ),
1≤ j≤n

and  
 
Var ∑ log P(U j ) = ∑ Var log P(U j )
1≤ j≤n 1≤ j≤n

where
 
2
2
Var log P(U j ) = E log P(U j ) − E log P(U j )
 2
= p(log p)2 + (1 − p)(log(1 − p))2 − p log p + (1 − p) log(1 − p)
 2
p
= p(1 − p) log .
1− p
Hence, the bound (1.6.23) yields
 
P 2−n(h+ε ) ≤ Pn (U(n) ) ≤ 2−n(h−ε )
 2
1 p
≥ 1 − 2 p(1 − p) log .
nε 1− p
It now suffices to set
Sn = {u(n) = u1 . . . un : 2−n(h+ε ) ≤ P(U(n) = u(n) ) ≤ 2−n(h−ε ) },
and the result follows.
Problem 1.23 The alphabet {1, 2, . . . , m} is to be encoded by codewords with
letters taken from an alphabet of q < m letters. State Kraft’s inequality for the word-
lengths s1 , . . . , sm of a decipherable code. Suppose that a source emits letters from
the alphabet {1, 2, . . . , m}, each letter occurring with known probability pi > 0. Let
S be the random codeword-length resulting from the letter-by-letter encoding of the
source output. It is desired to find a decipherable code that minimises the expected
 2
 S √
value of q . Establish the lower bound E q ≥
S
∑ pi , and characterise
1≤i≤m
when equality occurs.  
Prove also that an optimal code for the above criterion must satisfy E qS <
 2

q ∑ pi .
1≤i≤m
Hint: Use the Cauchy–Schwarz inequality: for all positive xi , yi ,
 1/2  1/2
∑ xi yi ≤ ∑ xi2 ∑ y2i ,
1≤i≤m 1≤i≤m 1≤i≤m

with equality iff xi = cyi for all i.


1.6 Additional problems for Chapter 1 125

Solution By Cauchy–Schwarz,

∑ pi = ∑ pi qsi /2 q−si /2
1/2 1/2
1≤i≤m 1≤i≤m
 1/2  1/2  1/2
≤ −s ≤
∑ pi qsi
∑ q i
∑ pi qsi ,
1≤i≤m 1≤i≤m 1≤i≤m

since, by Kraft, ∑ q−si ≤ 1. Hence,


1≤i≤m
 2
∑ ∑
1/2
EqS = pi qsi ≥ pi .
1≤i≤m 1≤i≤m

Now take the probabilities pi to be

pi = (cq−xi )2 , xi > 0,

∑ q−xi = 1 (so,
1/2
where ∑ pi = c). Take si to be the smallest integer ≥ xi .
1≤i≤m 1≤i≤m
Then ∑ q−si ≤ 1 and, again by Kraft, there exists a decipherable coding with
1≤i≤m
1/2
the codeword-length si . For this code, qsi −1 < qxi = c pi , and hence

EqS = ∑ pi qsi = q ∑ pi qsi −1


1≤i≤m 1≤i≤m
 2
∑ ∑ ∑
1/2 1/2
<q pi qxi = qc pi = q pi .
1≤i≤m 1≤i≤m 1≤i≤m

Problem 1.24 A Bernoulli source of information of rate H is fed character-


by-character into a transmission line which may be live or dead. If the line is live
when a character is transmitted then that character is received faithfully; if the line
is dead then the receiver learnt only that it is indeed dead. In shifting between
its two states the line follows a Markov chain (DTMC) with constant transition
probabilities, independent of the text being transmitted.
Show that the information rate of the source constituted by the received signal
is HL + πL HS where HS is the signal, HL is the information rate of the DTMC
governing the functioning of the line and πL is the equilibrium probability that the
line is alive.

Solution The rate of a Bernoulli source emitting letter j = 1, 2, . . . with probability


p j is H = − ∑ p j log p j . The state of the line is a DTMC with a 2 × 2 transition
j
126 Essentials of Information Theory

matrix
 
dead 1−α α
live β 1−β
and the equilibrium probabilities
β α
1 − πL (dead) = , πL (live) =
α +β α +β
(assuming that α + β > 0). The received signal sequence follows a DTMC with
states 0 (dead), 1, 2, . . . and transition probabilities
q00 = 1 − α , q0 j = α p j ,
j, k ≥ 1.
q j0 = β , q jk = (1 − β )pk
This chain has a unique equilibrium distribution
β α
πRS (0) = , πRS ( j) = p j , j ≥ 1.
α +β α +β
Then the information rate of the received signal equals
HRS = − ∑ πRS ( j)q jk log q jk
j,k≥0  
β
=− (1 − α ) log(1 − α ) + ∑ α p j log(α p j )
α +β
  j≥1 
α  

α + β j≥1 ∑ p j β log β + (1 − β ) ∑ pk log (1 − β )pk
k≥1
α
= HL + HS .
α +β
Here HL is the entropy rate of the line state DTMC:
β

HL = − (1 − α ) log(1 − α ) + α log α
α +β
α

− (1 − β ) log(1 − β ) + β log β ,
α +β
and π = α /(α + β ).
Problem 1.25 Consider a Bernoulli source in which the individual character
can take value i with probability pi (i = 1, . . . , m). Let ni be the number of times the
character value i appears in the sequence u(n) = u1 u2 . . . un of given length n. Let
An be the smallest set of sequences u(n) which has total probability at least 1 − ε.
Show that each sequence in An satisfies the inequality

− ∑ ni log pi ≤ nh + nk/ε )1/2 ,
1.6 Additional problems for Chapter 1 127

where k is a constant independent of n or ε. State (without proof) the analogous


assertion for a Markov source.

Solution For a Bernoulli source with letters 1, . . . , m, the probability of a given


string u(n) = u1 u2 . . . un is

P(U(n) = u(n) ) = ∏ pni i .


1≤i≤m

Set An consists of strings of maximal probabilities (selected in the decreasing


order), i.e. of maximal value of log P(U(n) = u(n) ) = ∑ ni log pi . Hence,
1≤i≤m
' @
An = u(n) : − ∑ ni log pi ≤ c ,
i

for some (real) c, to be determined. To determine c, we use that

P(An ) ≥ 1 − ε .

Hence, c is the value for which


 
P u (n)
: − ∑ ni log pi ≥ c < ε.
1≤i≤m

Now, for the random string U(n) = U1 . . . Un , let Ni is the number of appearances
of value i. Then

− ∑ Ni log pi = ∑ θ j , where θ j = − log pi when U j = i.


1≤i≤m 1≤ j≤n

Since entries U j are IID, so are random variables θ j . Next,

Eθ j = − ∑ pi log pi := h
1≤i≤m

and
 2
 2
Var θ j = E(θ j )2 − Eθ j = ∑ pi (log pi )2 − ∑ pi log pi := v.
1≤i≤m 1≤i≤m

Then
   
E ∑ θ j = nh and Var ∑ θ j = nv.
1≤ j≤n 1≤ j≤n

Recall that h = H is the information rate of the source.


128 Essentials of Information Theory

By Chebyshev’s inequality, for all b > 0,


  
 
  nv
P − ∑ Ni log pi − nh > b ≤ 2 ,
 1≤i≤m  b
$
and with b = nk/ε , we obtain
  ? 
 
  nk
P − ∑ Ni log pi − nh > ≤ ε.
 1≤i≤m  ε

Therefore, for all u(n) ∈ An ,


?
nk
− ∑ ni log pi ≤ nh +
ε
:= c.
1≤i≤m

For an irreducible and aperiodic Markov source the assertion is similar, with
H =− ∑ πi pi j log pi j ,
1≤i, j≤m
 
1
and v ≥ 0 a constant given by v = lim sup Var
n→∞ n
∑ θj .
1≤ j≤n

Problem 1.26 Demonstrate that an efficient and decipherable noiseless coding


procedure leads to an entropy as a measure of attainable performance.
Words of length si (i = 1, . . . , n) in an alphabet Fa = {0, 1, . . . , a − 1} are to be
n
chosen to minimise expected word-length ∑ pi si subject not only to decipherabil-
i=1
n
ity but also to the condition that ∑ qi si should not exceed a prescribed bound,
i=1
where qi is a feasible alternative to the postulated probability distribution {pi }
of characters in the original alphabet. Determine bounds on the minimal value of
n
∑ pi si .
i=1

Solution If we disregard the condition that s1 , . . . , sn are positive integers, the min-
imisation problem becomes
minimise ∑ si pi
i (1.6.24)
subject to si ≥ 0 and ∑ a−si ≤ 1 (Kraft).
i

This can be solved by the Lagrange method, with the Lagrangian


 
L (s1 , . . . , sn , λ ) = ∑ si pi − λ 1− ∑ a−si .
1≤i≤n 1≤i≤n
1.6 Additional problems for Chapter 1 129

The solution of the relaxed problem is unique and given by

si = − loga pi , 1 ≤ i ≤ n. (1.6.25)

The relaxed optimal value vrel ,

vrel = − ∑ pi loga pi := h,
1≤i≤n

provides a lower bound for the optimal expected word-length ∑ s∗i pi :


i

h ≤ ∑ s∗i pi .
i

Now consider the additional constraint

∑ qi si ≤ b. (1.6.26)
1≤i≤n

The relaxed problem (1.6.24) complemented with (1.6.26) again can be solved by
the Lagrange method. Here, if

− ∑ qi loga pi ≤ b
i

then adding the new constraint does not affect the minimiser (1.6.24), i.e. the
optimal positive s1 , . . . , sn are again given by (1.6.25), and the optimal value is h.
Otherwise, i.e. when − ∑ qi loga pi > b, the new minimiser s1 , . . . , sn is still unique
i
(since the problem is still strong Lagrangian) and fulfils both constraints

∑ a−s i
= 1, ∑ qi si = b.
i i

In both cases, the optimal value vrel for the new relaxed problem satisfies h ≤ vrel .
Finally, the solution s∗1 , . . . , s∗n to the integer-valued word-length problem

minimise ∑ si pi
i (1.6.27)
subject to si ≥ 1 integer and ∑ a−si ≤ 1, ∑ qi si ≤ b
i i

will satisfy
h ≤ vrel ≤ ∑ s∗i pi , ∑ s∗i qi ≤ b.
i i

Problem 1.27 Suppose a discrete Markov source {Xt } has transition probability
p jk = P(Xt+1 = k|Xt = j)
130 Essentials of Information Theory

with equilibrium distribution (π j ). Suppose the letter can be obliterated by noise


(in which case one observes only the event ‘erasure’) with probability β = 1 − α,
independent of current or previous letter values or previous noise. Show that the
noise-corrupted source has information rate

−α log α − β log β − α 2 ∑ ∑ ∑ π j β s−1 p jk log p jk ,


(s) (s)

j k s≥1

(s)
where p jk is the s-step transition probability of the original DTMC.

Solution Denote the corrupted source sequence {Xt }, with Xt = ∗ (a splodge) every
time there was an erasure. Correspondingly, a string x1n from the corrupted source
is produced from a string x1n of the original Markov source
by replacing
the oblit-
erated digits with splodges. The probability pn (x) = P X1n = x1n of such a string
is represented as

∑ P(X1n = x1n )P(X1n |X1n = x1n ) (1.6.28)


x1n consistent with x1n

and is calculated as the product where the initial factor is

∑ λy pyx β s−1 α , where 1 < s ≤ n, or 1,


(s)
λx1 α or s
y

depending on where the initial non-obliterated digit occurred in x1n (if at all). The
subsequent factors contributing to (1.6.28) have a similar structure:
(s)
pxt−1 xt β or pxt−s xt β s−1 α or 1.

Consequently, the information − log pn (x1n ) carried by string x1n is calculated as

− log P(Xs1 = xs1 ) − (s1 − 1) log β − log α


s −s )
− log px2s1 xs21 − (s2 − s1 − 1) log β − log α − · · ·
(sN −sN−1 )
− log pxsN−1 xsN − (sN − sN−1 − 1) log β − log α

where 1 ≤ s1 < · · · < sN ≤ n are the consecutive times of appearance of non-


obliterated symbols in x1n .
1
Now take − log pn X1n , the information rate provided by the random string
n
X1n . Ignoring the initial bit, we can write

1 N(β ) N(α ) M(i, j; s)


log α − ∑
(s)
− log pn X1n = − log β − log pi j .
n n n i n
1.6 Additional problems for Chapter 1 131

Here
N(α ) = number of non-obliterated digits in X1n ,
N(β ) = number of obliterated digits in X1n ,
M(i, j; s) = number of series of digits i ∗ · · · ∗ j in X1n of length s + 1
As n → ∞, we have the convergence of the limiting frequencies (the law of large
numbers applies):
N(α ) N(β ) M(i, j; s) (s)
→ α, → β, → αβ s−1 πi pi j α .
n n n
This yields
1
− log pn X1n
n
(s) (s)
→ −α log α − β log β − α 2 ∑ πi ∑ β s−1 pi j log pi j ,
i, j s≥1

as required. [The convergence holds almost surely (a.s.) and in probability.]


According to the SCT, the limiting value gives the information rate of the corrupted
source.
Problem 1.28 A binary source emits digits 0 or 1 according to the rule
P(Xt = k|Xt−1 = j, Xt−2 = i) = qr ,

where k, j, i and r take values 0 or 1, r = k − j −i mod 2, and q0 +q1 = 1. Determine


the information rate of this source.
Also derive the information rate of a Bernoulli source emitting digits 0 and 1
with probabilities q0 and q1 . Explain the relationship between these two results.

Solution Re-write the conditional probabilities in a detailed form:


'
q0 , i = j,
P(Xt = 0|Xt−1 = j, Xt−2 = i) =
q , i = j,
' 1
q1 , i = j,
P(Xt = 1|Xt−1 = j, Xt−2 = i) =
q0 , i = j.

The source is a second-order Markov chain on {0, 1}, i.e. a DTMC with four states
{00, 01, 10, 11}. The 4 × 4 transition matrix is
⎛ ⎞
00 q0 q1 0 0
01 ⎜ ⎜ 0 0 q1 q0 ⎟

10 ⎝q1 q0 0 0 ⎠
11 0 0 q0 q1
132 Essentials of Information Theory

The equilibrium probabilities are uniform:


1
π00 = π01 = π10 = π11 = .
4
The information rate is calculated in a standard way:
H =− ∑ παβ ∑ ∑ παβ pαβ ,β γ log pαβ ,β γ
α ,β =0,1 γ =0,1

and equals
1
4 α ,β∑
h(q0 , q1 ) = −q0 log q0 − q1 log q1 .
=0,1

Problem 1.29 An input to a discrete memoryless channel has three letters 1, 2


and 3. The letter j is received as ( j − 1) with probability p, as ( j +1) with probabil-
ity p and as j with probability 1 − 2p, the letters from the output alphabet ranging
from 0 to 4. Determine the form of the optimal input distribution, for general p,
as explicitly as possible. Compute the channel capacity in the three cases p = 0,
p = 1/3 and p = 1/2.

Solution The channel matrix is 3 × 5:


⎛ ⎞
1 p (1 − 2p) p 0 0
2 ⎝0 p (1 − 2p) p 0⎠ .
3 0 0 p (1 − 2p) p
The rows are permutations of each other, so the capacity equals
C = maxPX [h(Y ) − h(Y
|X)]

= (maxPX h(Y )) + 2p log p + (1 − 2p) log(1 − 2p) ,


with the maximisation over the input-letter distribution PX applied only to h(Y ),
the entropy of the output-symbol.
Next,
h(Y ) = − ∑ PY (y) log PY (y),
y=0,1,2,3,4

where

PY (0) = PX (1)p, ⎪

PX (1)(1 − 2p) + PX (2)p, ⎪

PY (1) = ⎬
PY (2) = PX (1)p + PX (2)(1 − 2p) + PX (3)p, (1.6.29)


PY (3) = PX (3)(1 − 2p) + PX (2)p, ⎪


PY (4) = PX (3)p.
1.6 Additional problems for Chapter 1 133

The symmetry in (1.6.29) suggests that h(Y ) is maximised when PX (0) = PX (2) = q
and PX (1) = 1 − 2q. So:

max h(Y ) = max [−2qp


log(qp) − 2 q(1 −
2p) + (1 − 2q)p
× log q(1 − 2p) + (1
− 2q)p

− 2qp + (1 − 2q)(1 − 2p) log 2qp + (1 − 2q)(1 − 2p) .


To find the maximum, differentiate and solve:
d

h(Y ) = −2p log(qp) − 2p − 2(1 − 4p) log q(1 − 2p) + (1 − 2q)p


dq

−2(1 − 4p) − (2p − 2) log 2qp + (1


− 2q)(1 − 2p) − (2p
− 2)
= 4p − 2p log(qp) − 2(1 − 4p)
log q(1 − 2p) + (1 − 2q)p

−2(1 − 4p) − 2(p − 1) log 2qp + (1 − 2q)(1 − 2p) = 0.


For p = 0 we have a perfect error-free channel, of capacity log 3 which is
achieved when PX (1) = PX (2) = PX (3) = 1/3 (i.e. q = 1/3), and PY (1) = PY (2) =
PY (3) = 1/3, PY (0) = PY (4) = 0.
For p = 1/3, the output probabilities are
pY (0) = pY (4) = q/3, pY (1) = (1 − q)/3, pY (2) = 1/3,
and h(Y ) simplifies to
q q 1−q 1−q 1 1
h(Y ) = −2 log − 2 log − log .
3 3 3 3 3 3

The derivative dh(Y ) dq = 0 becomes
2 q 2 2 1−q 2
− log − + log +
3 3 3 3 3 3
and vanishes when q = 1/2, i.e.
PX (1) = PX (3) = 1/2, PX (2) = 0,
PY (0) = PY (1) = PY (3) = PY (4) = 1/6, PY (2) = 1/3.
Next, the conditional entropy
h(Y |X) = log 3.
For the capacity this yields
2 1 1 1 2
C = − log − log − log 3 = .
3 6 3 3 3
Finally, for p = 1/2, we have h(Y |X) = 1 and
q 1 − 2q
PY (0) = PY (4) = , PY (1) = PY (3) = , PY (2) = q.
2 2
134 Essentials of Information Theory

The output entropy is


q 1 − 2q 1 − 2q
h(Y ) = −q log − log − q log q
2 2 2
1 − 2q 1 − 2q
= q − 2q log q − log
2 2
and is maximised when q = 1/6, i.e.
1 2
PX (1) = PX (2) = , PX (2) = ,
6 3
1 1 1
PY (0) = PY (4) = , PY (1) = PY (3) = , PY (2) = .
12 3 6
The capacity in this case equals
1
C = log 3 − .
2

Problem 1.30 A memoryless discrete-time channel produces outputs Y from


non-negative integer-valued inputs X by
Y = ε X,
where ε is independent of X , P(ε = 1) = p, P(ε = 0) = 1 − p, and inputs are
restricted by the condition that EX ≤ 1.
By considering input distributions {ai , i = 0, 1, . . .} of the form ai = cqi ,
i = 1, 2, . . ., or otherwise, derive the optimal input distribution and determine an
expression for the capacity of the channel.

Solution The channel matrix is


⎛ ⎞
1 0 0 ... 0
⎜1 − p p 0 . . . 0⎟
⎜ ⎟
⎜1 − p 0 p . . . 0⎟ .
⎝ ⎠
.. .. .. . . ..
. . . . .

For the input distribution with qi = P(X = i), we have that


P(Y = 0) = q0 + (1 − p)(1 − q0 ) = 1 − p + pq0 ,
P(Y = i) = pqi , i ≥ 1,
whence
h(Y ) = −(1 − p + pq0 ) log(1 − p + pq0 ) − ∑ pqi log(pqi ).
i≥1
1.6 Additional problems for Chapter 1 135

With the conditional entropy being


h(Y |X) = −(1 − q0 ) (1 − p) log(1 − p) + p log p


the mutual entropy equals
I(Y : X) = −(1 − p + pq0 ) log(1 − p + pq0 )
− ∑ pqi log(pqi ) + (1 − q0 )[(1 − p) log(1 − p) + p log p].
i≥1

We have to maximise I(Y : X) in q0 , q1 , . . ., subject to qi ≥ 0, ∑i qi = 1, ∑i iqi ≤ 1.


First, we fix q0 and maximise the sum − ∑i≥1 pqi log(pqi ) in qi with i ≥ 1. By
Gibbs, for all non-negative a1 , a2 , . . . with ∑i≥1 ai = 1 − q0 ,
− ∑ qi log qi ≤ − ∑ qi log ai , with equality iff qi ≡ ai .
i≥1 i≥1

For ai = cd i with ∑i iai = 1, the RHS becomes


 
−(1 − q0 ) log c − log d ∑ iai = −(1 − q0 ) log c − log d.
i≥1

From ∑i icd i = 1, cd (1 − d) = 1 − a0 and d = a0 , c = (1 − a0 )2 a0 .
Next, we maximise, in a0 ∈ [0, 1], the function
(1 − a0 )2
f (a0 ) = −(1 − p + pa0 ) log(1 − p + pa0 ) − p(1 − a0 ) log
a0
− log a0 + (1 − a0 )[(1 − p) log(1 − p) + p log p].
Requiring that
f (a0 ) = 0 (1.6.30a)
and
−p2 2p p
f (a0 ) = − − ≤ 0, (1.6.30b)
q + pa0 1 − a0 a0
one can solve equation (1.6.30a) numerically. Denote its root where (1.6.30b) holds
by a−
0 . Then we obtain the following answer for the optimal input distribution:
'
a−
0, i = 0,
ai =
(1 − a− 2 − i−1 , i ≥ 1.
0 ) (a0 )

with the capacity C = f (a−


0 ).

Problem 1.31 The representation known as binary-coded decimal encodes 0 as


0000, 1 as 0001 and so on up to 9, coded as 1001, with other 4-digit binary strings
being discarded. Show that by encoding in blocks, one can get arbitrarily near the
lower bound on word-length per decimal digit.
Hint: Assume all integers to be equally probable.
136 Essentials of Information Theory

Solution The code in question is obviously decipherable (and even prefix-free, as


is any decipherable code with a fixed codeword-length). The standard block-coding
procedure treats a string of n letters from the original source (Un ) operating with
(n)
an alphabet A as a letter from A n . Given joint probabilities pn (u1 ) = P(U1 =
i1 , . . . ,Un = in ), of the blocks in a typical message, we look at the binary entropy

h(n) = − ∑ P(U1 = i1 , . . . ,Un = in ) log P(U1 = i1 , . . . ,Un = in ).


i1 ,...,in

Denote by S(n) the random codeword-length while encoding in blocks. The mini-
1
mal expected word-length per source letter is en := min ES(n) . By Shannon’s NC
n
theorem,
h(n) h(n) 1
≤ en ≤ + ,
n log q n log q n
h(n)
where q is the size of the original alphabet A . We see that, for large n, en ∼ .
n log q
In the question, q = 10 and

h(n) = hn, where h = log 10 (equiprobability).

Hence, the minimal expected word-length en can be made arbitrarily close to 1.

Problem 1.32 Let {Ut } be a discrete-time process with values ut and let
P(u ) be the probability that a string u(n) = u1 . . . un is produced. Show that if
(n)

− log P(U(n) ) n converges in probability to a constant γ then γ is the information


rate of the process.
Write down the formula for the information rate of an m-state DTMC and find
the rate when the transition matrix has elements p jk where

p j j = p, p j j+1 = 1 − p ( j = 1, . . . , m − 1), pm1 = 1 − p.

Relate this to the information rate of a two-state source with transition probabilities
p and 1 − p.

Solution The information rate of an m-state stationary DTMC with transition


matrix P = (pi j ) and an equilibrium (invariant) distribution π = (πi ) equals

h = − ∑ πi pi j log pi j .
i, j

If matrix P is irreducible (i.e. has a unique communicating class) then this state-
ment holds for the chain with any initial distribution λ (in this case the equilibrium
distribution is unique).
1.6 Additional problems for Chapter 1 137

In the example, the transition matrix is


⎛ ⎞
p 1− p 0 ... 0
⎜ 0 p 1 − p . . . 0⎟
⎜ ⎟
⎜ .. .. .. . . .. ⎟ .
⎝ . . . . .⎠
1− p 0 0... p

The rows are permutations of each other, and each of them has entropy

−p log p − (1 − p) log(1 − p).

The equilibrium distribution is π = (1/m, . . . , 1/m):

1 1 1
∑ pi j = (p + 1 − p) = ,
1≤i≤m m m m

and it is unique, as the chain has a unique communicating class. Therefore, the
information rate equals

h= ∑
m 1≤i≤m
− p log p − (1 − p) log(1 − p) = −p log p − (1 − p) log(1 − p).

 
p 1− p
For m = 2 we obtain precisely the matrix , so – with the equilib-
1− p p
rium distribution π = (1/2, 1/2) – the information rate is again h = η (p).

Problem 1.33 Define a symmetric channel and find its capacity.


A native American warrior sends smoke signals. The signal is coded in puffs of
smoke of different lengths: short, medium and long. One puff is sent per unit time.
Assume a puff is observed correctly with probability p, and with probability 1 − p
(a) a short signal appears to be medium to the recipient, (b) a medium puff appears
to be long, and (c) a long puff appears to be short. What is the maximum rate at
which the warrior can transmit reliably, assuming the recipient knows the encoding
system he uses?
It would be more reasonable to assume that a short puff may disperse completely
rather than appear medium. In what way would this affect your derivation of a
formula for channel capacity?

Solution Suppose we use an input alphabet I , of m letters, to feed a memoryless


channel that produces symbols from an output alphabet J of size n (including
illegibles). The channel is described by its m × n matrix where entry pi j gives the
138 Essentials of Information Theory

probability that letter i ∈ I is transformed to symbol j ∈ J . The rows of the


channel matrix form stochastic n-vectors (probability distributions over J ):
⎛ ⎞
p11 . . . p1 j . . . p1n
⎜ .. .. .. .. .. ⎟
⎜ . . . . . ⎟
⎜ ⎟
⎜ pi1 . . . pi j . . . pin ⎟ .
⎜ ⎟
⎜ .. .. .. .. .. ⎟
⎝ . . . . . ⎠
pm1 . . . pm j . . . pmn
The channel is called symmetric if its rows are permutations of each other (or, more
generally, have the same entropy E = h(pi1 , . . . , pin ), for all i ∈ I ). The channel
is said to be double-symmetric if in addition its columns are permutations of each
other (or, more generally, have the same column sum Σ = ∑ pi j , for all j ∈ J ).
1≤i≤m
For a memoryless channel, the capacity (the supremum of reliable transmission
rates) is given by
C = max I(X : Y ).
PX

Here, the maximum is taken over PX = (PX (i), i ∈ I ), the input-letter probabil-
ity distribution, and I(X : Y ) is the mutual entropy between the input and output
random letters X and Y tied through the channel matrix:
I(X : Y ) = h(Y ) − h(Y |X) = h(X) − h(X|Y ).
For the symmetric channel, the conditional entropy
h(Y |X) = − ∑ PX (i)pi j log pi j ≡ h,
i, j

regardless of the input probabilities pX (i). Hence,


 
C = max h(Y ) − h(Y |X),
PX

and the maximisation needs only to be performed for the output symbol entropy
h(Y ) = − ∑ PY ( j) log PY ( j), where PY ( j) = ∑ PX (i)pi j .
j i

For a double-symmetric channel, the latter problem becomes straightforward: h(Y )


eq
is maximised by the uniform input equidistribution PX (i) = 1/m, as in this case PY
is also uniform:
1 1
PY ( j) = ∑ pi j = as it doesn’t depend on j ∈ J .
m i m
Thus, for the double-symmetric channel:
C = log n − h(Y |X).
1.6 Additional problems for Chapter 1 139

In the example, the channel matrix is 3 × 3,


⎛ ⎞
1 ∼ short p 1− p 0
2 ∼ medium ⎝ 0 p 1 − p⎠ ,
3 ∼ long 1− p 0 p
and double-symmetric. This yields

C = log 3 + p log p + (1 − p) log(1 − p).

In the modified example, the matrix becomes 3 × 4:


⎛ ⎞
p 0 0 1− p
⎝ 0 p 1− p 0 ⎠;
1− p 0 p 0
column 4 corresponds to a ‘no-signal’ output state (a ‘splodge’). The maximisation
problem loses its symmetry:
    
max − ∑ ∑ PX (i)pi j log ∑ PX (i)pi j
j=1,2,3,4 i=1,2,3 i=1,2,3 
− ∑ PX (i) ∑ pi j log pi j
i=1,2,3 j=1,2,3,4

subject to PX (1), PX (2), PX (3) ≥ 0, and ∑ PX (i) = 1,


i=1,2,3

and requires a full-scale analysis.


Problem 1.34 The entropy power inequality (EPI, see (1.5.10)) states: for X and
Y independent d -dimensional random vectors,

22h(X+Y)/d ≥ 22h(X)/d + 22h(Y)/d , (1.6.31)

with equality iff X and Y are Gaussian with proportional covariance matrices.
Let X be a real-valued random variable with a PDF fX and finite differential
entropy h(X), and let function g : R → R have strictly positive derivative g every-
where. Prove that the random variable g(X) has differential entropy satisfying
h(g(X)) = h(X) + E log2 g (X),

assuming that E log2 g (X) is finite.


Let Y1 and Y2 be independent, strictly positive random variables with densities.
Show that the differential entropy of the product Y1Y2 satisfies
22h(Y1Y2 ) ≥ α1 22h(Y1 ) + α2 22h(Y2 ) ,

where log2 (α1 ) = 2E log2 Y2 and log2 (α2 ) = 2E log2 Y1 .


140 Essentials of Information Theory

Solution The CDF of the random variable g(X) satisfies

Fg(X) (y) = P(g(X) ≤ y) = P(X ≤ g−1 (y)) = FX (g−1 (y)),

dFg(X) (y)
i.e. the PDF fg(X) (y) = takes the form
dy
 
 −1  −1  fX g−1 (y)
fg(X) (y) = fX g (y) g (y) =  −1  .
g g (y)
Then
0
h(g(X)) = − fg(X) (y) log2 fg(X) (y)dy
0    
fX g−1 (y) fX g−1 (y)
=   log2   dy
g g−1 (y) g g−1 (y)
0
fX (x)


=− log f X (x) − log g (x) g (x)dx
g (x) 2 2

= h(X) + E log2 g (X) . (1.6.32)

When g(t) = et then


log2 g (t) = log2 et = t log2 e.

So, Yi = eXi = g(Xi ) and (1.6.32) implies

h(eXi ) = h(g(Xi )) = h(Xi ) + EXi log2 e, i = 1, 2, 3,

with X3 = X1 + X2 . Then
 
h(Y1Y2 ) = h(eX1 +X2 ) = h(X1 + X2 ) + EX1 + EX2 log2 e.

Hence, in the entropy-power inequality,

22h(Y1Y2 ) = 22h(X1 +X2 )+2(EX1 +EX2 ) log2 e


 2h(X ) 
≥ 2 1 + 22h(X2 ) 22(EX1 +EX2 ) log2 e
= 22EX2 log2 e 22[h(X1 )+EX1 log2 e] 
+22EX1 log2 e 22[h(X2 )+EX2 log2 e]
= α1 22h(Y1 ) + α2 22h(Y2 ) .

Here α1 = 22EX2 log2 e , i.e.

log2 α1 = 2EX2 log2 e = 2E lnY2 log2 e = 2E log2 Y2 ,

and similarly, log2 α1 = 2E log2 Y1 .


1.6 Additional problems for Chapter 1 141

Problem 1.35 In this problem we work with the following functions defined for
0 < a < b:
√ b−a 1
G(a, b) = ab, L(a, b) = , I(a, b) = (bb /aa )1/(b−a) .
log(b/a) e
Check that
a+b
0 < a < G(a, b) < L(a, b) < I(a, b) < A(a, b) = < b. (1.6.33)
2
Next, for 0 < a < b define

Λ(a, b) = L(a, b)I(a, b)/G2 (a, b).

Let p = (pi ) and q = (qi ) be the probability distributions of random variables X


and Y :

P(X = i) = pi > 0, P(Y = i) = qi , i = 1, . . . , r, ∑ pi = ∑ qi = 1.

Let m = min[qi /pi ], M = max[qi /pi ], μ = min[pi ], ν = max[pi ]. Prove the following
bounds for the entropy h(X) and Kullback–Leibler divergence D(p||q) (cf. PSE II,
p. 419):

0 ≤ log r − h(X) ≤ log Λ(μ , ν ). (1.6.34)


0 ≤ D(p||q) ≤ log Λ(m, M). (1.6.35)

Solution The inequality (1.6.33) is straightforward and left as an exercise. For


a ≤ xi ≤ b, set A (p, x) = ∑ pi xi , G (p, x) = ∏ xipi . The following general inequality
holds:
A (p, x)
1≤ ≤ Λ(a, b). (1.6.36)
G (p, x)
It implies that
 
0 ≤ log ∑ pi xi − ∑ pi log xi ≤ log Λ(a, b).

Selecting xi = qi /pi we immediately obtain (1.6.35). Taking q to be uniform, we


obtain (1.6.34) from (1.6.35) since
   
1 1 1 1
Λ , =Λ , = Λ(μ , ν ).
rν r μ ν μ
Next, we sketch the proof of (1.6.36); see details in [144], [50]. Let f be a convex
function, p, q ≥ 0, p + q = 1. Then for xi ∈ [a, b], we have
 
0 ≤ ∑ pi f (xi ) − f ∑ pi xi ≤ max[p f (a) + q f (b) − f (pa + qb)]. (1.6.37)
p
142 Essentials of Information Theory

Applying (1.6.37) for a convex function f (x) = − log x we obtain after some cal-
culations that the maximum in (1.6.37) is achieved at p0 = (b − L(a, b))/(b − a),
with p0 a + (1 − p0 )b = L(a, b), and
A (p, x) b−a log(bb /aa )
0 ≤ log ≤ log − log(ab) + −1
G (q, x)) log(b/a) b−a
which is equivalent to (1.6.36). Finally, we establish (1.6.37). Write xi = λi a +
(1 − λi )b for some λi ∈ [0, 1]. Then by convexity
0 ≤ ∑ pi f (xi ) − f (∑ pi xi )
≤ ∑ pi (λi f (a) + (1 − λi ) f (b)) − f (a ∑ pi λi + b ∑ pi (1 − λi )) .
Denoting ∑ pi λi = p and 1 − ∑ pi λi = q and maximising over p we obtain
(1.6.37).
Problem 1.36 Let f be a strictly positive probability density function (PDF)
on the line R, define the Kullback–Leibler divergence D(g|| f ) and prove that
D(g|| f ) ≥ 0. 0 0
Next, assume that ex f (x)dx < ∞ and |x|ex f (x)dx < ∞. Prove that the mini-
mum of the expression
0
− xg(x)dx + D(g|| f ) (1.6.38)
0
over the PDFs g with |x|g(x)dx < ∞ is attained at the unique PDF g∗ ∝ ex f (x)
and calculate this minimum.

Solution The Kullback–Leibler divergence D(g|| f ) is defined by


0 0  
g(x)  g(x) 
D(g|| f ) = g(x) ln dx, if 
g(x) ln  dx < ∞
f (x) f (x) 
and  
0  g(x) 
D(g|| f ) = ∞, if 
g(x) ln  dx = ∞.
f (x) 
The bound D(g|| f ) ≥ 0 is the Gibbs inequality. 0
Now take the PDF g∗ (x) = ex f (x)/Z where Z = ez f (z)dz. Set W =
0
1
xex f (x)dx; then W /Z = xg∗ (x)dx. Further, write:
0
1 ex
D(g∗ || f ) = ex f (x) ln dx
Z0 Z
1   1  W
= e f (x) x − ln Z dx = W − Z ln Z = − ln Z
x
Z Z Z
1.6 Additional problems for Chapter 1 143

and obtain that 0


− xg∗ (x)dx + D(g∗ || f ) = − ln Z.

This is the claimed minimum in the0 last part of the question.


Indeed, for any PDF g such that |x|g(x)dx < ∞, set q(x) = g(x)/ f (x) and write
0 0

g(x) −x
D(g||g∗ ) = g(x) ln dx = q(x) ln q(x)e Z f (x)dx
0 g∗ (x) 0
= − x f (x)q(x)dx + f (x)q(x) ln q(x)dx + ln Z
0
=− xg(x)dx + D(g|| f ) + ln Z,

implying that
0 0
− xg(x)dx + D(g|| f ) = − xg∗ (x)dx + D(g∗ || f ) + D(g||g∗ ).

Since D(g||g∗ ) > 0 unless g = g∗ , the claim follows.


Remark 1.6.1 The property of minimisation of (1.6.38) is far reaching and im-
portant in a number of disciplines, including statistical physics, ergodic theory and
financial mathematics. We refer the reader to the paper [109] for further details.
2
Introduction to Coding Theory

2.1 Hamming spaces. Geometry of codes. Basic bounds


on the code size
For presentational purposes, it is advisable to concentrate at the first reading of this
section on the binary case where the symbols sent through a channel are 0 and 1.
As we saw earlier, in the case of an MBSC with the row error-probability p ∈
(N)
(0, 1/2), the ML decoder looks for a codeword x that has the maximum number
of digits coinciding with the received binary word y(N) . In fact, if y(N) is received,
the ML decoder compares the probabilities

P y(N) |x(N) = pδ (x ,y ) (1 − p)N−δ (x ,y )
(N) (N) (N) (N)

 δ (x(N) ,y(N) )
p
= (1 − p) N
1− p

for different binary codewords x(N) . Here

δ (x(N) , y(N) ) = the number of digits i with xi = yi (2.1.1a)

is the so-called Hamming distance between words x(N) = x1 . . . xN and y(N) =


y1 . . . yN . Since the first factor (1 − p)N does not depend on x(N) , the decoder seeks
to maximise the second factor, that is to minimise δ (x(N) , y(N) ) (as 0 < p/(1− p) <
1 for p ∈ (0, 1/2)).
The definition (2.1.1a) of Hamming distance can be extended to q-ary strings.
The space of q-ary words HN,q = {0, 1, . . . , q − 1}×N (the Nth Cartesian power of
set Jq = {0, 1, . . . , q − 1}) with distance (2.1.1a) is called the q-ary Hamming space
of length N. It contains qN elements. In the binary case, HN,2 = {0, 1}×N .

144
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 145

N=1 N=2 N=3 N=4

Figure 2.1

An important part is played by the distance δ (x(N) , 0(N) ) between words x(N) =
x1. . . xN and 0(N) = 0 . . . 0; it is called the weight of word x(N) and denoted by
w x(N) :

w x(N) = the number of digits i with xi = 0. (2.1.1b)

Lemma 2.1.1 The quantity δ (x(N) , y(N) ) defines a distance on HN,q . That is:
(i) 0 ≤ δ (x(N) , y(N) ) ≤ N and δ (x(N) , y(N) ) = 0 iff x(N) = y(N) .
(ii) δ (x(N) , y(N) ) = δ (y(N) , x(N) ).
(iii) δ (x(N) , z(N) ) ≤ δ (x(N) , y(N) ) + δ (y(N) , z(N) ) (the triangle inequality).
Proof The proof of (i) and (ii) is obvious. To check (iii), observe that any digit i
with zi = xi has either yi = xi and then counted in δ (x(N) , y(N) ) or zi = yi and then
counted in δ (y(N) , z(N) ).
Geometrically, the binary Hamming space HN,2 may be identified with the col-
lection of the vertices of a unit cube in N dimensions. The Hamming distance
equals the lowest number of edges we have to pass from one vertex to another. It is
a good practice to plot pictures for relatively low values of N: see Figure 2.1.
An important role is played below by geometric and algebraic properties of the
Hamming space. Namely, as in any metric space, we can consider a ball of a given
radius R around a given word x(N) :
BN,q (x(N) , R) = {y(N) ∈ HN,q : δ (x(N) , y(N) ) ≤ R}. (2.1.2)
An important (and hard) problem is to calculate the maximal number of disjoint
balls of a given radius which can be packed in a given Hamming space.
Observe that words admit an operation of addition mod q:
x(N) + y(N) = (x1 + y1 ) mod q . . . (xN + yN ) mod q . (2.1.3a)
146 Introduction to Coding Theory

This makes the Hamming space HN,q a commutative group, with the zero code-
word 0(N) = 0 . . . 0 playing the role of the zero of the group. (Words also may be
multiplied which generates a powerful apparatus; see below.)
For q = 2, we have a two-point code alphabet {0, 1} that is actually a two-
point field, F2 , with the following arithmetic: 0 + 0 = 1 + 1 = 0 · 1 = 1 · 0 = 0,
0 + 1 = 1 + 0 = 1 · 1 = 1. (Recall, a field is a set equipped with two commutative
operations: addition and multiplication, satisfying standard axioms of associativity
and distributivity.) Thus, each point in the binary Hamming space HN,2 is opposite
to itself: x(N) + x (N) = 0(N) iff x(N) = x (N) . In fact, HN,2 is a linear space over the
coefficient field F2 , with 1 · x(N) = x(N) , 0 · x(N) = 0(N) .
Henceforth, all additions of q-ary words are understood digit-wise and mod q.

Lemma 2.1.2 The Hamming distance on HN,q is invariant under group transla-
tions:
δ (x(N) + z(N) , y(N) + z(N) ) = δ (x(N) , y(N) ). (2.1.3b)

Proof For all i = 1, . . . , N and xi , yi , zi ∈ {0, 1, . . . , q − 1}, the digits xi + zi mod q


and yi + zi mod q are in the same relation (= or =) as digits xi and yi .

A code is identified with a set of codewords XN ⊂ HN,q ; this means that we dis-
regard any particular allocation of codewords (which fits the assumption that the
source messages are equidistributed). An assumption is that the code is known to
both the sender and the receiver. Shannon’s coding theorems guarantee that, under
certain conditions, there exist asymptotically good codes attaining the limits im-
posed by the information rate of a source and the capacity of a channel. Moreover,
Shannon’s SCT shows that almost all codes are asymptotically good. However, in
a practical situation, these facts are of a limited use: one wants to have a good code
in an explicit form. Besides, it is desirable to have a code that leads to fast encoding
and decoding and maximises the rate of the information transmission.
So, assume that the source emits binary strings u(n) = u1 . . . un , ui = 0, 1. To
obtain the overall error-probability vanishing as n → ∞, we have to encode words
u(n) by longer codewords x(N) ∈ HN,2 where N ∼ R−1 n and 0 < R < 1. Word x(N)
is then sent to the channel and is transformed into another word, y(N) ∈ HN,2 . It
is convenient to represent the error occurred by the difference of the two words:
e(N) = y(N) − x(N) = x(N) + y(N) , or equivalently, write y(N) = x(N) + e(N) , in the
sense of (2.1.3a). Thus, the more digits 1 the error word e(N) has, the more sym-
bols are distorted by the channel. The ML decoder then produces a ‘guessed’ code-
(N)
word x that may or may not coincide with x(N) , and then reconstructs a string
(n)
u . In the case of a one-to-one encoding rule, the last procedure is (theoretically)
straightforward: we simply invert the map u(n) → x(N) . Intuitively, a code is ‘good’
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 147

if it allows the receiver to ‘correct’ the error string e(N) , at least when word e(N)
does not contain ‘too many’ non-zero digits.
Going back to an MBSC with the row probability of the error p < 1/2: the ML
(N)
decoder selects a codeword x that leads to a word e(N) with a minimal number
of the unit digits. In geometric terms:
(N)
x ∈ XN is the codeword closest to y(N)
(2.1.4)
in the Hamming distance δ .
The same rule can be applied in the q-ary case: we look for the codeword closest
to the received string. A drawback of this rule is that if several codewords have the
same minimal distance from a received word we are ‘stuck’. In this case we either
choose one of these codewords arbitrarily (possibly randomly or in connection with
the message’s content; this is related to the so-called list decoding), or, when a high
quality of transmission is required, refuse to decode the received word and demand
a re-transmission.
Definition 2.1.3 We call N the length of a binary code XN , M :=  XN the size
log2 M
and ρ := the information rate. A code XN is said to be D-error detecting
N
if making up to D changes in any codeword does not produce another codeword,
and E-error correcting if making up to E changes in any codeword x(N) produces
a word which is still (strictly) closer to x(N) than to any other codeword (that is,
x(N) is correctly guessed from a distorted word under the rule (2.1.4)). A code has
minimal distance (or briefly distance) d if
 
d = min δ (x(N) , x ) : x(N) , x ∈ XN , x(N) = x
(N) (N) (N)
. (2.1.5)

The minimal distance and the information rate of a code XN will be sometimes
denoted by d(XN ) and ρ (XN ), respectively.
This definition can be repeated almost verbatim for the general case of a
logq M
q-ary code XN ⊂ HN,q , with information rate ρ = . Namely, a code XN
N
is called E-error correcting if, for all r = 1, . . . , E, x(N) ∈ XN and y(N) ∈ HN,q
with δ (x(N) , y(N) ) = r, the distance δ (y(N) , x (N) ) > r for all x (N) ∈ XN such that
x (N) = x(N) . In words, it means that making up to E errors in a codeword pro-
duces a word that is still closer to it than to any other codeword. Geometrically,
this property means that the balls of radius E about the codewords do not intersect:
BN,q (x(N) , E) ∩ BN,q (x , E) = 0/ for all distinct x(N) , x
(N) (N)
∈ XN .
Next, a code XN is called D-error detecting if the ball of radius D about a codeword
does not contain another codeword. Equivalently, the intersection BN,q (x(N) , D) ∩
XN is reduced to a single point x(N) .
148 Introduction to Coding Theory

A code of length N, size M and minimal distance d is called an [N, M, d] code.


Speaking of an [N, M] or [N, d] code, we mean any code of length N and size M or
minimal distance d.
To make sure we understand this definition, let us prove the aforementioned
equivalence in the definition of an E-error correcting code. First, assume that the
balls of radius E are disjoint. Then, making up to E changes in a codeword pro-
duces a word that is still in the corresponding ball, and hence is further apart from
any other codeword. Conversely, let our code have the property that changing up
to E digits in a codeword does not produce a word which lies at the same distance
from or closer to another codeword. Then any word obtained by changing precisely
E digits in a codeword cannot fall in any ball of radius E but in the one about the
original codeword. If we make fewer changes we again do not fall in any other ball,
for if we do, then moving towards the second centre will produce, sooner or later,
a word that is at distance E from the original codeword and at distance < E from
the second one, which is impossible. 2

For a D-error detecting code, the distance 8d ≥ D + 1.9 Furthermore, a code of


distance d detects d − 1 errors, and it corrects (d − 1)/2 errors.
Remark 2.1.4 Formally, Definition 2.1.3 means that a code detects at least D
and corrects at least E errors, and some authors make a point of this fact, specify-
ing D and E as maximal values with the above properties. We followed an original
tradition where the detection and correction abilities of codes are defined in terms
of inequalities rather than equalities, although in a number of forthcoming con-
structions and examples the claim that a code detects D and/or corrects E errors
means that D and/or E and no more. See, for instance, Definition 2.1.7.
Definition 2.1.5 In Section 2.3 we systematically study so-called linear codes.
The linear structure is established in space HN,q when the alphabet size q is of
the form ps where p is a prime and s a positive integer; in this case the alphabet
{0, 1, . . . , q − 1} can be made a field, Fq , by introducing two suitable operations:
addition and multiplication. See Section 3.1. When s = 1, i.e. q is a prime number,
then both operations can be understood as standard ones, modulo q. When Fq is a
field with addition + and multiplication · , set HN,q = F×N q becomes a linear space
over Fq , with component-wise addition and multiplication by ‘scalars’ generated
by the corresponding operations in Fq . Namely, for x(N) = x1 . . . xN , y(N) = y1 . . . yN
and γ ∈ Fq ,
x(N) + y(N) = (x1 + y1 ) . . . (xN + yN ), γ · x(N) = (γ · x1 ) . . . (γ · xN ). (2.1.6a)
With q = ps , a q-ary [N, M, d] code XN is called linear if it is a linear subspace of
HN,q . That is, XN has the property that if x(N) , y(N) ∈ XN then x(N) + y(N) ∈ XN
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 149

and γ · x(N) ∈ XN for all γ ∈ Fq . For a linear code X , the size M is given by
M = qk where k may take values 1, . . . , N and gives the dimension of the code, i.e.
the maximal number of linearly independent codewords. Accordingly, one writes
k = dim X . As in the usual geometry, if k = dim X then in X there exists a basis
of size k, i.e. a linearly independent collection of codewords x(1) , . . . , x(k) such that
any codeword x ∈ X can be (uniquely) written as a linear combination ∑ a j x( j) ,
1≤ j≤k
where a j ∈ Fq . [In fact, if k = dim X then any linearly independent collection of k
codewords is a basis in X .] In the linear case, we speak of [N, k, d] or [N, k] codes.
As follows from the definition, a linear [N, k, d] code XN always contains the
zero string 0(N) = 0 . . . 0. Furthermore, owing to property (2.1.3b),
 (N)the
 minimal
distance d(XN ) in a linear code X equals the minimal weight w x of a non-0
codeword x(N) ∈ XN . See (2.1.1b).
Finally, we define the so-called wedge-product of codewords x and y as a word
w = x ∧ y with components
wi = min[xi , yi ], i = 1, . . . , N. (2.1.6b)
A number of properties of linear codes can be mentioned already in this section,
although some details of proofs will be postponed.
A simple example of a linear code is a repetition code RN ⊂ HN,q , of the form
( )
RN = x(N) = x . . . x : x = 0, 1, . . . , q − 1
A B
N −1
detects N − 1 and corrects errors. A linear parity-check code
2
( )
PN = x(N) = x1 . . . xN : x1 + · · · + xN = 0

detects a single error only, but does not correct it.

Observe that the ‘volume’ of the ball in the Hamming space HN,q centred at z(N)
is  
N
vN,q (R) =  BN,q (z (N)
, R) = ∑ (q − 1)k ; (2.1.7)
0≤k≤R k

it does not depend on the choice of the centre z(N) ∈ HN,q .

It is interesting to consider large values of N (theoretically, N → ∞), and analyse


log  X
parameters of the code XN such as the information rate ρ (X ) = and
N
¯ d(X )
the distance per digit d(X )= . Our aim is to focus on ‘good’ codes, with
N
many codewords (to increase the information rate) and large distances (to increase
150 Introduction to Coding Theory

the detecting and correcting abilities). From this point of view, it is important to
understand basic bounds for codes.

Upper bounds are usually written for Mq∗ (N, d), the largest size of a q-ary code
of length N and distance d. We begin with elementary facts: Mq∗ (N, 1) = qN ,
Mq∗ (N, N) = q, Mq∗ (N, d) ≤ qMq∗ (N − 1, d) and – in the binary case – M2∗ (N, 2s) =
M2∗ (N − 1, 2s − 1) (easy exercises).
Indeed, the number of the codewords cannot be too high if we want to keep
good an error-detecting and error-correcting ability. There are various bounds for
parameters of codes; the simplest bound was discovered by Hamming in the late
1940s.
Theorem 2.1.6 (The Hamming bound)

(i) If a q-ary code XN corrects E errors then its size M =  XN obeys



M ≤ qN vN,q (E) . (2.1.8a)

For a linear [N, k] code this can be written in the form


 
N − k ≥ logq vN,q (E) .
8 9
(ii) Accordingly, with E = (d − 1)/2 ,

Mq∗ (N, d) ≤ qN vN,q (E). (2.1.8b)

Proof (i) The E-balls about the codewords x(N) ∈ XN must be disjoint. Hence,
the total number of points covered equals the product vN,q (E)M which should not
exceed qN , the cardinality of the Hamming space HN,q .
9 if XN is an [N, M, d] code then, as was noted above, for E =
8(ii) Likewise,
(d − 1)/2 , the balls BN,q (x(N) , E), x(N) ∈ XN , do not intersect. The volume
 BN,q (x(N) , E) is given by
 
N
vN,q (E) = ∑ (q − 1)k ,
0≤k≤E k

and the union of balls



BN,q (x(N) , E)
x(N) ∈XN

must lie in HN,q , again with cardinality  HN,q = qN .

We see that the problem of finding good codes becomes a geometric problem,
because a ‘good’ code XN correcting E errors must give a ‘close-packing’ of the
Hamming space by balls of radius E. A code XN that gives a ‘true’ close-packing
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 151

partition has an additional advantage: the code not only corrects errors, but never
leads to a refusal of decoding. More precisely:

Definition 2.1.7 An E-error correcting code XN of size  XN = M is called


perfect when the equality is achieved in the Hamming bound:

M = qN vN,q (E) .

If a code XN is perfect, every word y(N) ∈ HN,q belongs to a (unique) ball


BE (x(N) ). That is, we are always able to decode y(N) by a codeword: this leads
to the correct answer if the number of errors is ≤ E, and to a wrong answer if it is
> E. But we never get ‘stuck’ in the case of decoding.
The problem of finding perfect binary codes was solved about 20 years ago.
These codes exist only for

(a) E = 1: here N = 2l −1, M = 22 −1−l , and these codes correspond to the so-called
l

Hamming codes;
(b) E = 3: here N = 23, M = 212 ; they correspond to the so called (binary) Golay
code.

Both the Hamming and Golay codes are discussed below. The Golay code is
used (together with some modifications) in the US space programme: already in
the 1970s the quality of photographs encoded by this code and transmitted from
Mars and Venus was so excellent that it did not require any improving procedure.
In the former Soviet Union space vessels (and early American ones) other codes
were also used (and we also discuss them later): they generally produced lower-
quality photographs, and further manipulations were required, based on statistics
of the pictorial images.
If we consider non-binary codes then there exists one more perfect code, for
three symbols (also named after Golay).

We will now describe a number of straightforward constructions producing new


codes from existing ones.

Example 2.1.8 Constructions of new codes include:

(i) Extension: You add a digit xN+1 to each codeword x(N) = x1 . . . xN from
code XN , following an agreed rule. Viz., the so-called parity-check exten-
sion requires that xN+1 + ∑ x j = 0 in the alphabet field Fq . Clearly, the
1≤ j≤N
extended code, XN+1
+
, has the same size as the original code XN , and the
+
distance d(XN+1 ) is equal to either d(XN ) or d(XN ) + 1.
152 Introduction to Coding Theory

(ii) Truncation: Remove a digit from the codewords x ∈ X (= XN ). The result-



ing code, XN−1 , has length N − 1 and, if the distance d(XN ) ≥ 2, the same

size as XN , while d(XN−1 ) ≥ d(XN ) − 1, provided that d(XN ) ≥ 2.
(iii) Purge: Simply delete some codewords x ∈ XN . For example, in the binary
case removing all codewords with an odd number of non-zero digits from a
linear code leads to a linear subcode; in this case if the distance of the original
code was odd then the purged code will have a strictly larger distance.
(iv) Expansion: Opposite to purging. Say, let us add the complement of each
codeword to a binary code XN , i.e. the N-word where the 1s are replaced by
the 0s and vice versa. Denoting the expanded code by X N one can check
that d(X N ) = min[d(XN ), N − d(XN )] where

d(XN ) = max[δ (x(N) , x ) : x(N) , x


(N) (N)
∈ XN ].

(v) Shortening: Take all codewords x(N) ∈ XN with the ith digit 0, say, and
delete this digit (shortening on xi = 0). In this way the original binary lin-
sh,0
ear [N, M, d] code XN is reduced to a binary linear code XN−1 (i) of length
N − 1, whose size can be M/2 or M and distance ≥ d or, in a trivial case, 0.
(vi) Repetition: Repeat each codeword x(= x(N) ) ∈ XN a fixed number of times,
say m, producing a concatenated (Nm)-word
 re  xx . . . x. The result is a code
XNm , of length Nm and distance d XNm = md(XN ).
re

(vii) Direct sum: Given two codes XN and XN , form a code X + X =


{xx : x ∈ X , x ∈ X }. Both the repetition and direct-sum constructions
are not very effective and neither is particularly popular in coding (though
we will return to these constructions in examples and problems). A more
effective construction is
(viii) The bar-product (x|x + x ): For the [N, M, d] and [N, M , d ] codes XN and
XN define a code XN |XN of length 2N as the collection
( )
x(x + x ) : x(= x(N) ) ∈ XN , x (= x ) ∈ XN .
(N)

That is, each codeword in X |X is a concatenation of the codeword from


XN and its sum with a codeword from XN (formally, neither of XN , XN in
this construction is supposed to be linear). The resulting code is denoted by
XN |XN ; it has size
    
 XN |XN =  XN  XN .

A useful exercise is to check that the distance


d XN |XN = min 2d(XN ), d(X N ) .


2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 153

(ix) The dual code. The concept of duality is based on the inner dot-product in
space HN,q (with q = ps ): for x = x1 . . . xN and y = y1 . . . yN ,
C D
x(N) · y(N) = x1 · y1 + · · · + xN · yN

which yields a value from field Fq . For a linear [N, k] code XN its dual, XN⊥ ,
is a linear [N, N − k] code defined by
( C D )
XN⊥ = y(N) ∈ HN,q : x(N) · y(N) = 0 for all x ∈ XN . (2.1.9)

Clearly, (XN⊥ )⊥ = XN . Also dim XN + dim XN⊥ = N. A code is called self-


dual if XN = XN⊥ .
Worked Example 2.1.9
(a) Prove that if the distance d of an [N, M, d] code XN is an odd number then the
code may be extended to an [N + 1, M] code X + with distance d + 1.
(b) Show an E -error correcting code XN can be extended to a code X + that de-
tects 2E + 1 errors.
(c) Show that the distance of a perfect binary code is an odd number.

Solution (a) By adding the digit xN+1 to the codewords x = x1 . . . xN of an [N, M]


code XN so that xN+1 = ∑ x j , we obtain an [N + 1, M] code X + . If the distance
1≤ j≤N
d of XN was odd, the distance of X + is d +1. In fact, if a pair of codewords, x, x ∈
X , had δ (x, x ) > d, then the extended codewords, x+ and x + , have δ (x+ , x + ) ≥
δ (x, x ) > d. Otherwise, i.e. if δ (x, x ) = d, the distance increases: δ (x+ , x + ) =
d + 1.
(b) The distance d of an E-error correcting code is strictly greater than 2E. Hence,
the above extension gives a code with distance strictly greater than 2E + 1.
(c) For a perfect E-error correcting code the distance is at most 2E + 1 and hence
equals 2E + 1.
Worked Example 2.1.10 Show that there is no perfect 2-error correcting code
of length 90 and size 278 over F2 .

Solution We might be interested in the existence of a perfect 2-error correcting


binary code of length N = 90 and size M = 278 because
90 · 89
v90,2 (2) = 1 + 90 + = 4096 = 212
2
and
M × v90,2 (2) = 278 · 212 = 290 = 2N .
154 Introduction to Coding Theory

However, such a code does not exist. Assume that it exists, and, the zero word
0 = 0 . . . 0 is a codeword. The code must have d = 5. Consider the 88 words with
three non-zero digits, with 1 in the first two places:

1110 . . . 00 , 1101 . . . 00 , ... , 110 . . . 01 . (2.1.10)

Each of these words should be at distance ≤ 2 from a unique codeword. Say, the
codeword for 1110 . . . 00 must contain 5 non-zero digits. Assume that it is

111110 . . . 00.

This codeword is at distance 2 from two other subsequent words,

11010 . . . 00 and 11001 . . . 00 .

Continuing with this construction, we see that any word from list (2.1.10) is ‘at-
tracted’ to a codeword with 5 non-zero digits, along with two other words from
(2.1.10). But 88 is not divisible by 3.

Let us continue with bounds on codes.

Theorem 2.1.11 (The Gilbert–Varshamov (GV) bound) For any q ≥ 2 and d ≥ 2,


there exists a q-ary [N, M, d] code XN such that

M =  XN ≥ qN vN,q (d − 1) . (2.1.11)

Proof Consider a code of maximal size among the codes of minimal distance
d and length N. Then any word y(N) ∈ HN,q must be distant ≤ d − 1 from some
codeword: otherwise we can add y(N) to the code without changing the minimal
distance. Hence, the balls of radius d − 1 about the codewords cover the whole
Hamming space HN,q . That is, for the code of maximal size, XNmax ,
 
 XNmax vN,q (d − 1) ≥ qN .

As was listed before, there are ways of producing one code from another (or from
a collection of codes). Let us apply truncation and drop the last digit xN in each
codeword x(N) from an original code XN . If code XN had the minimal distance

d > 1 then the new code, XN−1 , has the minimal distance ≥ d − 1 and the same
size as XN . The truncation procedure leads to the following bound.

Theorem 2.1.12 (The Singleton bound) Any q-ary code XN with minimal dis-
tance d has
M =  XN ≤ Mq∗ (N, d) ≤ qN−d+1 . (2.1.12)
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 155

Proof As before, perform a truncation on an [N, M, d] code XN : drop the last


digit from each codeword x ∈ XN . The new code is [N, M, d − ] where d − ≥ d − 1.
Repeating this procedure d − 1 times gives an (N − d + 1) code of the same
size M and distance ≥ 1. This code must fit in Hamming space HN−d+1,q with
 HN−d+1,q = qN−d+1 ; hence the result.

As with the Hamming bound, the case of equality in the Singleton bound at-
tracted a special interest:

Definition 2.1.13 A q-ary linear [N, k, d] code is called maximum distance sepa-
rating (MDS) if it gives equality in the Singleton bound:

d = N − k + 1. (2.1.13)

We will see below that, similarly to perfect codes, the family of the MDS codes
is rather ‘thin’.

Corollary 2.1.14 If Mq∗ (N, d) is the maximal size of a code XN with minimal
distance d then
 
qN q N
≤ Mq∗ (N, d) ≤ min   , qN−d+1 . (2.1.14)
vN,q (d − 1) vN,q (d − 1)/2

From now on we will omit indices N and (N) whenever it does not lead to
confusion. The upper bound in (2.1.14) becomes too rough when d ∼ N/2. Say, in
the case of binary [N, M, d]-code with N = 10 and d = 5, expression (2.1.14) gives
the upper bound M2∗ (10, 5) ≤ 18, whereas in fact there is no code with M ≥ 13, but
there exists a code with M = 12. The codewords of the latter are as follows:

0000000000, 1111100000, 1001011010, 0100110110,

1100001101, 0011010101, 0010011011, 1110010011,

1001100111, 1010111100, 0111001110, 0101111001.

The lower bound gives in this case the value 2 (as 210 /v10,2 (4) = 2.6585) and is
also far from being satisfactory. (Some better bounds will be obtained below.)

Theorem 2.1.15 (The Plotkin bound) For a binary code X of length N and
distance d with N < 2d , the size M obeys
A B
d
M = X ≤ 2 . (2.1.15)
2d − N
156 Introduction to Coding Theory

Proof The minimal distance cannot exceed the average distance, i.e.
M(M − 1)d ≤ ∑ ∑ δ (x, x ).
x∈X x ∈X

On the other hand, write code X as an M × N matrix with rows as codewords.


Suppose that column i of the matrix contains si zeros and M − si ones. Then

∑ ∑ δ (x, x ) ≤ 2 ∑ si (M − si ). (2.1.16)
x∈X x ∈X 1≤i≤N

If M is even, the RHS of (2.1.16) is maximised when si = M/2 which yields


1 2d
M(M − 1)d ≤ NM 2 , or M ≤ .
2 2d − N
As M is even, this implies
A B
d
M≤2 .
2d − N
If M is odd, the RHS of (2.1.16) is ≤ N(M 2 − 1)/2 which yields
N 2d
M≤ = − 1.
2d − N 2d − N
This implies in turn that
A B A B
2d d
M≤ −1 ≤ 2 ,
2d − N 2d − N
because, for all x > 0, 2x ≤ 2x + 1.
Theorem 2.1.16 Let M2∗ (N, d) be the maximal size of a binary [N, d] code. Then,
for any N and d ,
M2∗ (N, 2d − 1) = M2∗ (N + 1, 2d), (2.1.17)
and
2M2∗ (N − 1, d) = M2∗ (N, d). (2.1.18)
Proof To prove (2.1.17) let X be a code of length N, distance 2d − 1 and size
M2∗ (N, 2d − 1). Take its parity-check extension X + . That is, add digit xN+1 to
N+1
every codeword x = x1 . . . xN so that ∑ xi = 0. Then X + is a code of length N + 1,
i=1
the same size M2∗ (N, 2d − 1) and distance 2d. Therefore,
M2∗ (N, 2d − 1) ≤ M2∗ (N + 1, 2d).
Similarly, deleting the last digit leads to the inverse:
M2∗ (N, 2d − 1) ≥ M2∗ (N + 1, 2d).
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 157

Turning to the proof of (2.1.18), given an [N, d] code, divide the codewords into
two classes: those ending with 0 and those ending with 1. One class must contain
at least half of the codewords. Hence the result.
Corollary 2.1.17 If d is even and such that 2d > N ,
A B
d
M2∗ (N, d) ≤ 2 (2.1.19)
2d − N
and
M2∗ (2d, d) ≤ 4d. (2.1.20)
If d is odd and 2d + 1 > N then
A B
d +1
M2∗ (N, d) ≤2 (2.1.21)
2d + 1 − N
and
M2∗ (2d + 1, d) ≤ 4d + 4. (2.1.22)
Proof Inequality (2.1.19) follows from (2.1.17), and (2.1.20) follows from
(2.1.18) and (2.1.19): if d = 2d then
M2∗ (4d , 2d ) = 2M2∗ (4d − 1, 2d ) ≤ 8d = 4d.
Furthermore, (2.1.21) follows from (2.1.17):
A B
d +1
M2∗ (N, d) = M2∗ (N + 1, d + 1) ≤2 .
2d + 1 − N
Finally, (2.1.22) follows from (2.1.17) and (2.1.20).
Worked Example 2.1.18 Prove the Plotkin bound for a q-ary code:
A  B
∗ q−1 q−1
Mq (N, d) ≤ d d −N , if d > N . (2.1.23)
q q

Solution Given a q-ary [N, M, d] code XN , observe that the minimal distance d is
bounded by the average distance
1
d≤ S, where S = ∑ ∑ δ (x, x ).
M(M − 1) x∈X x ∈X

As before, let ki j denote the number of letters j ∈ {0, . . . , q − 1} in the ith position
in all codewords from X , i = 1, . . . , N. Then, clearly, ∑ ki j = M and the
0≤ j≤q−1
contribution of the ith position into S is
M2
∑ ki j (M − ki j ) = M 2 − ∑ ki2j ≤ M 2 −
q
0≤ j≤q−1 0≤ j≤q−1
158 Introduction to Coding Theory

as the quadratic function (u1 , . . . , uq ) → ∑ u2j achieves its minimum on the set
* + 1≤ j≤q
u = u1 . . . uq : u j ≥ 0, ∑ u j = M at u1 = · · · = uq = M/q. Summing over all N
digits, we obtain with θ = (q − 1)/q
M(M − 1)d ≤ θ M 2 N,
which yields the bound M ≤ d(d − θ N)−1 . The proof is completed as in the binary
case.
There exists a substantial theory related to the equality in the Plotkin bound
(Hadamard codes) but it will not be discussed in this book. We would also like
to point out the fact that all bounds established so far (Hamming, Singleton, GV
and Plotkin) hold for codes that are not necessarily linear. As far as the GV bound
is concerned, one can prove that it can be achieved by linear codes: see Theorem
2.3.26.
Worked Example 2.1.19 Prove that a 2-error correcting binary code of length
10 can have at most 12 codewords.

Solution The distance of the code must be ≥ 5. Suppose that it contains M code-
words and extend it to an [11, M] code of distance 6. The Plotkin bound works as
follows. List all codewords of the extended code as rows of an M × 11 matrix. If
column i in this matrix contains si zeros and M − si ones then
11
6(M − 1)M ≤ ∑ ∑ δ (x, x ) ≤ 2 ∑ si (M − si ).
x∈X + x ∈X + i=1

The RHS is ≤ (1/2) · 11 M 2 if M is even and ≤ (1/2) · 11 (M 2 − 1) if M is odd.


Hence, M ≤ 12.
Worked Example 2.1.20 (Asymptotics of the size of a binary ball) Let q = 2
and τ ∈ (0, 1/2). Then, with η (τ ) = −τ log2 τ − (1 − τ ) log2 (1 − τ ) (cf. (1.2.2a)),
1 1
lim log vN,2 (τ N) = lim log vN,2 (τ N) = η (τ ). (2.1.24)
N→∞ N N→∞ N

Solution Observe that with R = τ N the last term in the sum


R  
N
vN,2 (R) = ∑ , R = [τ N],
i=0 i

is the largest. Indeed, the ratio of two successive terms is


   
N N N −i
=
i+1 i i+1
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 159

which remains ≥ 1 for 0 ≤ i ≤ R. Hence,


   
N N
≤ vN,2 (R) ≤ (R + 1) .
R R

Now use Stirling’s formula: N! ∼ N N+1/2 e−N 2π . Then
 
N N −R R
log = −(N − R) log − R log + O (log N) (2.1.25)
R N N
and
   
R R R R O (log N)
− 1− log 1 − − log +
N N N N N
log vN,2 (R) 1
≤ ≤ log(R + 1) + the LHS.
N N
The limit R/N → τ yields the result. The case where R = τ N is considered in a
similar manner.
Worked Example 2.1.20 is useful in the study of the asymptotics of
1
α (N, τ ) = log M2∗ (N, τ N), (2.1.26)
N
the information rate for the maximum size of a code correcting ∼ τ N and detecting
∼ 2τ N errors (i.e. a linear portion of the total number of digits N). Set

a(τ ) := lim inf α (N, τ ) ≤ lim sup α (N, τ ) =: a(τ ) (2.1.27)


N→∞ N→∞

For these limits we have


Theorem 2.1.21 With η (τ ) = −τ log2 τ − (1 − τ ) log2 (1 − τ ), the following
asymptotic bounds hold for binary codes:
a(τ ) ≤ 1 − η (τ /2), 0 ≤ τ ≤ 1/2 (Hamming), (2.1.28)
a(τ ) ≤ 1 − τ , 0 ≤ τ ≤ 1/2 (Singleton), (2.1.29)
a(τ ) ≥ 1 − η (τ ), 0 ≤ τ ≤ 1/2 (GV), (2.1.30)
a(τ ) = 0, 1/2 ≤ τ ≤ 1 (Plotkin). (2.1.31)

By using more elaborate bounds (also due to Plotkin), we’ll show in Problem
2.10 that
a(τ ) ≤ 1 − 2τ , 0 ≤ τ ≤ 1/2. (2.1.32)

The proof of Theorem 2.1.21 is based on a direct inspection of the above-


mentioned bounds; for Hamming and GV bounds it is carried in Worked Example
2.1.22 later.
160 Introduction to Coding Theory

Plotkin
1
Singleton

Hamming

Gilbert−Varshamov

1 1
2

Figure 2.2

Figure 2.2 shows the behaviour of the bounds established. ‘Good’ sequences of
codes are those for which the pair (τ , α (N, τ N)) is asymptotically confined to
the domain between the curves indicating the asymptotic bounds. In particular, a
‘good’ code should ‘lie’ above the curve emerging from the GV bound. Construct-
ing such sequences is a difficult problem: the first examples achieving the asymp-
totic GV bound appeared in 1973 (the Goppa codes, based on ideas from algebraic
geometry). All families of codes discussed in this book produce values below the
GV curve (in fact, they yield α (τ ) = 0), although these codes demonstrate quite
impressive properties for particular values of N, M and d.
As to the upper bounds, the Hamming and Plotkin compete against each other,
while the Singleton bound turns out to be asymptotically insignificant (although
it is quite important for specific values of N, M and d). There are about a dozen
various other upper bounds, some of which will be discussed in this and subsequent
sections of the book.
The Gilbert–Varshamov bound itself is not necessarily optimal. Until 1982 there
was no better lower bound known (and in the case of binary coding there is still no
better lower bound known). However, if the alphabet used contains q ≥ 49 symbols
where q = p2m and p ≥ 7 is a prime number, there exists a construction, again based
on algebraic geometry, which produces a different lower bound and gives exam-
ples of (linear) codes that asymptotically exceed, as N → ∞, the GV curve [159].
Moreover, the TVZ construction carries a polynomial complexity. Subsequently,
two more lower bounds were proposed: (a) Elkies’ bound, for q = p2m + 1; and
(b) Xing’s bound, for q = pm [43, 175]. See N. Elkies, ‘Excellent codes from mod-
2.1 Hamming spaces. Geometry of codes. Basic bounds on the code size 161

ular curves’, Manipulating with different coding constructions, the GV bound can
also be improved for other alphabets.

Worked Example 2.1.22 Prove bounds (2.1.28) and (2.1.30) (that is, those parts
of Theorem 2.1.21 related to the asymptotical Hamming and GV bounds).

Solution Picking up the Hamming and the GV parts in (2.1.14), we have


 
2N /vN,2 (d − 1) ≤ M2∗ (N, d) ≤ 2N vN,2 (d − 1)/2 . (2.1.33)

The lower bound for the Hamming volume is trivial:


 
  N
vN,2 (d − 1)/2 ≥ .
(d − 1)/2

For the upper bound, observe that with d/N ≤ τ < 1/2,
 d−1−i  
d −1 N
vN,2 (d − 1) ≤ ∑
0≤i≤d−1 N − d + 1 i
 d−1−i    
τ N 1−τ N
≤ ∑ ≤ .
0≤i≤d−1 1 − τ d −1 1 − 2τ d − 1

Then, for the information rate (log M2∗ (N, d)) N,
  
1 1−τ N
1 − log
N 1 − 2τ d − 1  
1 ∗ 1 N
≤ log M2 (N, d) ≤ 1 − log .
N N (d − 1)/2
By Stirling’s formula, as N → ∞ the logs in the previous inequalities obey
   
1 N 1 N
log → η (τ /2), log → η (τ ).
N (d − 1)/2 N d −1
The bounds (2.1.28) and (2.1.30) then readily follow.

Consider now the case of a general q-ary alphabet.

Example 2.1.23 Set θ := (q − 1)/q. By modifying  the argument in Worked


Example 2.1.22, prove that for any q ≥ 2 and τ ∈ 0, θ ), the volume of the q-ary
Hamming ball has the following logarithmic asymptote:
1 1
lim logq vN,q (τ N) = lim logq vN,q (τ N) = η (q) (τ ) + τκ (2.1.34)
N→∞ N N→∞ N
162 Introduction to Coding Theory

where
η (q) (τ ) := −τ logq τ − (1 − τ ) logq (1 − τ ), κ := logq (q − 1). (2.1.35)
Next, similarly to (2.1.26), introduce
1
α (q) (N, τ ) = log Mq∗ (N, τ N) (2.1.36)
N
and the limits
a(q) (τ ) := lim inf α (q) (N, τ ) ≤ lim sup α (q) (N, τ ) =: a(q) (τ ). (2.1.37)
N→∞ N→∞

Theorem 2.1.24 For all 0 < τ < θ ,


a(q) (τ ) ≤ 1 − η (q) (τ /2) − κτ /2 (Hamming), (2.1.38)
a (τ ) ≤ 1 − τ
(q)
(Singleton), (2.1.39)
a(q) (τ ) ≥ 1 − η (q) (τ ) − κτ (GV), (2.1.40)
a (τ ) ≤ max[1 − τ /θ , 0] (Plotkin).
(q)
(2.1.41)
Of course, the minimum of the right-hand sides of (2.1.38), (2.1.39) and (2.1.41)
provides the better of the three upper bounds. We omit the proof of Theorem 2.1.24,
leaving it as an exercise that is a repetition of the argument from Worked Example
2.1.22.
Example 2.1.25 Prove bounds (2.1.38) and (2.1.40), by modifying the solution
to Worked Example 2.1.22.

2.2 A geometric proof of Shannon’s second coding theorem.


Advanced bounds on the code size
In this section we give alternative proofs of both parts of Shannon’s second cod-
ing theorem (or Shannon’s noisy coding theorem, SCT/NCT; cf. Theorems 1.4.14
and 1.4.15) by using the geometry of the Hamming space. We then apply the tech-
niques that are developed in the course of the proof for obtaining some ‘advanced’
bounds on codes. The advanced bounds strengthen the Hamming bound established
in Theorem 2.1.6 and its asymptotic counterparts in Theorems 2.1.21 and 2.1.24.
The direct part of the SCT/NCT is given in Theorem 2.2.1 below, in a somewhat
modified form compared with Theorems 1.4.14 and 1.4.15. For simplicity, we only
consider here memoryless binary symmetric channels (MBSC), working in space
HN,2 = {0, 1}N (the subscript 2 will be omitted for brevity). As we learned in
Section 1.4, the direct part of the SCT states that for any transmission rate R < C
there exist
2.2 A geometric proof of Shannon’s second coding theorem 163

(i) a sequence of codes fn : Un → HN , encoding a total of  Un = 2n messages;


and
(ii) a sequence of decoding rules fN : HN → Un , such that n ∼ NR and the proba-
bility of erroneous decoding vanishes as n → ∞.

Here C is given by (1.4.11) and (1.4.27). For convenience, we reproduce the ex-
pression for C again:

C = 1 − η (p), where η (p) = −p log p − (1 − p) log(1 − p) (2.2.1)

and the channel matrix is


 
1− p p
Π= . (2.2.2)
p 1− p

That is, we assume that the channel transmits a letter correctly with probability
1 − p and reverses with probability p, independently for different letters.
In Theorem 2.2.1, it is asserted that there exists a sequence of one-to-one cod-
ing maps fn , for which the task of decoding is reduced to guessing the codewords
fn (u) ∈ HN . In other words, the theorem guarantees that for all R < C there exists a
sequence of subsets XN ⊂ HN with  XN ∼ 2NR for which the probability of incor-
rect guessing tends to 0, and the exact nature of the coding map fn is not important.
Nevertheless, it is convenient to keep the map fn firmly in sight, as the existence
will follow from a probabilistic construction (random coding) where sample cod-
ing maps are not necessarily one-to-one. Also, the decoding rule is geometric: upon
receiving a word a(N) ∈ HN , we look for the nearest codeword fn (u) ∈ XN . Conse-
quently, an error is declared every time such a codeword is not unique or is a result
of multiple encodings or simply yields a wrong message. As we saw earlier, the
geometric decoding rule corresponds with the ML decoder when the probability
p ∈ (0, 1/2). Such a decoder enables us to use geometric arguments constituting
the core of the proof.
Again as in Section 1.4, the new proof of the direct part of the SCT/NCT only
guarantees the existence of ‘good’ codes (and even their ‘proliferation’) but gives
no clue on how to construct such codes [apart from running again a random coding
scheme and picking its ‘typical’ realisation].
In the statement of the SCT/NCT given below, we deal with the maximum error-
probability (2.2.4) rather than the averaged one over possible messages. However,
a large part of the proof is still based on a direct analysis of the error-probabilities
averaged over the codewords.

Theorem 2.2.1 (The SCT/NCT, the direct part) Consider an MBSC with channel
matrix Π as in (2.2.2), with 0 ≤ p < 1/2, and let C be as in (2.2.1). Then for any
164 Introduction to Coding Theory

R ∈ (0,C) there exists a sequence of one-to-one encoding maps fn : Un → HN


such that

(i)
n = NR , and  Un = 2n ; (2.2.3)

(ii) as n → ∞, the maximum error-probability under the geometric decoding rule


vanishes:

emax ( fn ) = max Pch error under geometric decoding
 (2.2.4)
| fn (u) sent : u ∈ Un → 0.
 
Here Pch · | fn (u)sent stands for the probability distribution of the received
word in HN generated by the channel, conditional on codeword fn (u) being
sent, with u ∈ Un being the original message emitted by the source.
As an illustration of this result, consider the following.
Example 2.2.2 We wish to send a message u ∈ A  n , where the size of alphabet

0.85 0.15
A equals K, through an MBSC with channel matrix . What rate of
0.15 0.85
transmission can be achieved with an arbitrarily small probability of error?
Here the value C = 1 − η (0.15) = 0.577291. Hence, by Theorem 2.2.1 any rate
of transmission < 0.577291 can be achieved for n large enough, with an arbitrarily
small probability of error. For example, if we want a rate of transmission 0.5 < R <
0.577291 and emax < 0.015 then there exist codes fn : A n → {0, 1}n/R achieving
this goal provided that n is sufficiently large: n > n0 .
Suppose that we know such a code fn . How do we encode the message m? First
divide m into blocks of length L where
E F
0.577291N
L= so that |A L | = K L ≤ 20.577291N .
log K
Then we can embed the blocks from A L in the alphabet A n and so encode the
blocks. The transmission rate is log |A L | n/R ∼ 0.577291. As was mentioned,
the SCT tells us that there are such codes but gives no idea of how to find (or
construct) them, which is difficult to do.
Before we embark on the proof of Theorem 2.2.1, we would like to explore
connections between the geometry of Hamming space HN and the  randomness

generated by the channel. As in Section 1.4, we use the symbol P · | fn (u) as a
shorthand for Pch · | fn (u) sent . The
 expectation and  variance under this distribu-
tion will be denoted by E( · | fn (u) and Var( · | fn (u) .
2.2 A geometric proof of Shannon’s second coding theorem 165
 
Observe that, under distribution P · | fn (u) , the number of distorted digits in
the (random) received word Y(N) can be written as
N
∑ 1(digit j in Y(N) = digit j in fn (u)).
j=1

This is a random variable which has a binomial distribution Bin(N, p), with the
mean value
  

N 
E ∑ 1(digit j in Y = digit j in fn (u)) fn (u)
(N)
j=1 
N  
= ∑ E 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p,
j=1

and the variance


  

N 
Var ∑ 1(digit j in Y(N) digit j in fn (u)) fn (u)
=
j=1 
N  
= ∑ Var 1(digit j in Y(N) = digit j in fn (u))| fn (u) = N p(1 − p).
j=1

Then, by Chebyshev’s inequality, for all given ε ∈ (0, 1 − p) and positive integer
N > 1/ε , the probability that at least N(p + ε ) digits have been distorted given
that the codeword fn (u) has been sent, is
p(1 − p)
≤ P ≥ N(p + ε ) − 1 distorted | fn (u) ≤ . (2.2.5)
N(ε − 1/N)2
Proofof Theorem 2.2.1. Throughout the proof, we follow the set-up from (2.2.3).
Subscripts n and N will be often omitted; viz., we set
2n = M.
We will assume the ML/geometric decoder without any further mention. Similarly
to Section 1.4, we identify the set of source messages Un with Hamming space Hn .
As proposed by Shannon, we use again a random encoding. More precisely, a mes-
sage u ∈ Hn is mapped to a random codeword Fn (u) ∈ HN , with IID digits taking
values 0 and 1 with probability 1/2 and independently of each other. In addition,
we make codewords Fn (u) independent for different messages u ∈ Hn ; labelling
the strings from Hn by u(1), . . . , u(M) (in no particular order) we obtain a fam-
ily of IID random strings Fn (u(1)), . . . , Fn (u(M)) from HN . Finally, we make the
codewords independent of the channel. Again, in analogy with Section 1.4, we can
think of the random code under consideration as a random megastring/codebook
from HNM = {0, 1}NM with IID digits 0, 1 of equal probability. Every given sample
f (= fn ) of this random codebook (i.e. any given megastring from HNM ) specifies
166 Introduction to Coding Theory

f(u(1)) f( u(2)) f (u(r))

Figure 2.3

a deterministic encoding f (u(1)), . . . , f (u(M)) of messages u(1), . . . , u(M), i.e. a


code f ; see Figure 2.3.
As in Section 1.4, we denote by P n the probability distribution of the random
code, with
1
P n (Fn = f ) = , for all sample megastrings f , (2.2.6)
2NM
and by E n the expectation relative to P n .
The plan of the rest of the proof is as follows. First, we will prove (by repeating
in part arguments from Section 1.4) that, for the transmission rate R ∈ (0,C), the
expected average probability for the above random coding goes to zero as n → ∞:
 
lim E n eave (Fn ) = 0. (2.2.7)
n→∞

Here eave (Fn ),


which is shorthand for eave (Fn (u(1)), . . . , Fn (u(M))), is a random
variable taking values in (0, 1) and representing the aforementioned average error-
probability for the random coding in question. More precisely, as in Section 1.4,
for all given sample collections of codewords f (u(1)), . . . , f (u(M)) ∈ HN (i.e. for
all given megastrings f from HNM ), we define
  1
eave fn = ∑
M 1≤i≤M
P error while using codebook f | f (u(i)) . (2.2.8)

Then the expected average error-probability is given by


  1  
E n eave (Fn ) = NM
2 ∑ eave f . (2.2.9)
f (u(1)),..., f (u(M))∈H N

Relation (2.2.9) implies (again in a manner similar to Section 1.4) that there
exists a sequence of deterministic codes fn such that the average error-probability
eave ( fn ) = eave ( fn (u(1)), . . . , fn (u(2n ))) obeys
lim eave ( fn ) = 0. (2.2.10)
n→∞

Finally, we will deduce (2.2.4) from (2.2.10): see Lemma 2.2.6.


2.2 A geometric proof of Shannon’s second coding theorem 167

fn (u(i))

decoding ...
a

fn (u(i)) sent
.. a received

N
(a, m)

Figure 2.4

Remark 2.2.3 As the codewords f (u(1)), . . . , f (u(M)) are thought to come


from a sample of the random codebook, we must allow that they may coincide
( f (u(i)) = f (u( j)) for i = j), in which case, by default, the ML decoder is report-
ing an error. This must be included when we consider probabilities in the RHS of
(2.2.8). Therefore, for i = 1, . . . , M we define

P error while using codebook f | f (u(i))


⎪ 1, if f (u(i)) = f (u(i )) for some i = i,

⎪    

⎨P δ Y(N) , f (u( j)) ≤ δ Y(N) , f (u(i)) (2.2.11)
=

⎪ for some j = i | f (u(i)) ,



⎩ if f (u(i)) = f (u(i )) for all i = i.

Let us now go through the detailed argument. The first step is

Lemma 2.2.4 Consider the channel matrix Π (cf. (2.2.2)) with 0 ≤ p < 1/2.
Suppose that the transmission rate R < C = 1 − η (p). Let N be > 1/ε. Then for
any ε ∈ (0, 1/2 − p), the expected average error-probability E n eave (Fn ) defined
in (2.2.8), (2.2.9) obeys
  p(1 − p) M − 1 3 4
E n eave (Fn ) ≤ + vN N(p + ε ) , (2.2.12)
N(ε − 1/N)2 2N
where vN (b) stands for the number of points in the ball of radius b in the binary
Hamming space HN .

Proof Set m(= mN (p, ε )) := N(p + ε ). The ML decoder definitely returns the
codeword fn (u(i)) sent through the channel when fn (u(i)) is the only codeword in
168 Introduction to Coding Theory
 
the Hamming ball BN (y, m) around the received word y = y(N) ∈ HN (see
Figure 2.4). In any other situation (when fn (u(i)) ∈ BN (y, m) or fn (u(k)) ∈
BN (y, m) for some k = i) there is a possibility of error.
Hence,

P error while using codebook f | fn (u(i))
 
≤ ∑ P y| fn (u(i)) 1 fn (u(i)) ∈ BN (y, m) (2.2.13)
y∈HN
+ ∑ P(z| fn (u(i))) ∑ 1 fn (u(k)) ∈ BN (z, m) .
z∈HN k =i

The first sum in the RHS is simple to estimate:


   
∑ P y| fn (u(i)) 1 fn (u(i) ∈ BN (y, m)
y∈HN      
= ∑ P y| fn (u(i)) 1 distance δ y, fn (u(i)) ≥ m
y∈HN (2.2.14)
p(1 − p)
= P ≥ m digits distorted| f (u(i)) ≤ ,
N(ε − 1/N)2
by virtue of (2.2.5). Observe that since the RHS in (2.2.14) does not depend on the
choice of the sample code f , the bound (2.2.14) will hold when we take first the
1
average ∑ and then expectation E n .
M 1≤i≤M
The second sum in the RHS of (2.2.13) is more tricky: it requires averaging and
taking expectation. Here we have
 
 
En ∑ ∑ P(z|Fn (u(i))) ∑ 1 Fn (u(k)) ∈ BN (z, m)
1≤i≤M z∈HN
 k =i
 
= ∑ ∑ ∑ En P(z|Fn (u(i)))1 Fn (u(k)) ∈ BN (z, m)
1≤i≤M k =i z∈HN   (2.2.15)
= ∑ ∑ ∑ En P(z|Fn (u(i)))
1≤i≤M k =i z∈HN   
×En 1 Fn (u(k)) ∈ BN (z, m) ,

since random codewords Fn (u(1)), . . . , Fn (u(M)) are independent. Next, as


each
of these codewords


over HN , the expectations
is uniformly distributed
En P(z|Fn (u(i))) and En 1 Fn (u(k)) ∈ BN (z, m) can be calculated as

1
En P(z|Fn (u(i))) = N
2 ∑ P(z|x) (2.2.16a)
x∈HN

and
 
vN (m)
En 1 Fn (u(k)) ∈ BN (z, m) = . (2.2.16b)
2N
2.2 A geometric proof of Shannon’s second coding theorem 169

Further, summing over z yields

∑ ∑ P(z|x) = ∑ ∑ P(z|x) = 2N . (2.2.17)


z∈HN x∈HN x∈HN z∈HN

Finally, after summation over k = i we obtain


1 vN (m)
the RHS of (2.2.15) = ∑ ∑
M 1≤i≤M k =i 2N
(2.2.18)
vN (m)M(M − 1) (M − 1)vN (m)
= = .
2N M 2N
Collecting (2.2.12)–(2.2.18) we have that EN [eave (Fn )] does not exceed the RHS
of (2.2.12).
At the next stage we estimate the volume vN (m) in terms of entropy h(p + ε )
where, recall, m = N(p + ε ). The argument here is close to that from Section 1.4
and based on the following result.
Lemma 2.2.5 Suppose that 0 < p < 1/2, ε > 0 and positive integer N satisfy
p + ε + 1/N < 1/2. Then the following bound holds true:
vN (N(p + ε )) ≤ 2N η (p+ε ) . (2.2.19)
The proof of Lemma 2.2.5 will be given later, after Worked Example 2.2.7. For
the moment we proceed with the proof of Theorem 2.2.1. Recall, we want to es-
tablish (2.2.7). In fact, if p < 1/2 and R < C = 1 − η (p) then we set ζ = C − R > 0
and take ε > 0 so small that (i) p + ε < 1/2 and (ii) R + ζ /2 < 1 − η (p + ε ). Then
we take N so large that (iii) N > 2/ε . With this choice of ε and N, we have
1 ε ζ
ε− > and R − 1 + η (p + ε ) < − . (2.2.20)
N 2 2
Then, starting with (2.2.12), we can write

4p(1 − p) 2NR N η (p+ε )
EN e(Fn ) ≤ + N 2
Nε 2 2 (2.2.21)
4 −N ζ /2
< p(1 − p) + 2 .
Nε 2
This implies (2.2.7) and hence the existence of a sequence of codes fn : Hn → HN
obeying (2.2.10).
To finish the proof of Theorem 2.2.1, we deduce (2.2.4) from (2.2.7), in the form
of Lemma 2.2.6:
Lemma 2.2.6 Consider a binary channel (not necessarily memoryless), and
let C > 0 be a given constant. With 0 < R < C and n = NR, define quantities
emax ( fn ) and eave ( fn ) as in (2.2.4), (2.2.8) and (2.2.11), for codes fn : Hn → HN
and fn : Hn → HN . Then the following statements are equivalent:
170 Introduction to Coding Theory

(i) For all R ∈ (0,C), there exist codes fn with lim emax ( fn ) = 0.
n→∞
(ii) For all R ∈ (0,C), there exist codes fn such that lim eave ( fn ) = 0.
n→∞

Proof of Lemma 2.2.6. It is clear that assertion (i) implies (ii). To deduce (i) from
(ii), take R < C and set for N big enough
1
R = R + < C, n = NR , M = 2n . (2.2.22)
N
We know that there exists a sequence fn of codes Hn → HN with eave ( fn ) → 0.
Recall that
1
eave ( fn ) = ∑ P error while using fn | fn (u(i)) . (2.2.23)
M 1≤i≤M

Here and below, M = 2NR  and fn (u(1)), . . . , fn (u(M )) are the codewords for
source messages u(1), . . . , u(M ) ∈ Hn .
Instead of P error while using fn | fn (u(i)) , we write P fn -error| fn (u(i)) ,

for brevity. Now, at least half of summands P fn -error| fn (u(i)) in the RHS of
(2.2.23) must be < 2eave ( fn ). Observe that, in view of (2.2.22),

M /2 ≥ 2NR−1 . (2.2.24)

Hence we have at our disposal at least 2NR−1 codewords f (u(i)) with



P error| fn (u(i)) < 2eave ( fn ).

List
 these 
codewords as a new binary code, of length N and information rate
log M /2 N. Denoting this new code by fn , we have

emax ( fn ) ≤ 2eave ( fn ).
 
Hence, emax ( fn ) → 0 as n → ∞ whereas log M /2 N → R. This gives statement
(i) and completes the proof of Lemma 2.2.6.

Therefore, the proof of Theorem 2.2.1 is now complete (provided that we prove
Lemma 2.2.5).

Worked Example 2.2.7 (cf. Worked Example 2.1.20.) Prove that for positive
integers N and m, with m < N/2 and β = m/N ,

2N η (β ) (N + 1) < vN (m) < 2N η (β ) . (2.2.25)
2.2 A geometric proof of Shannon’s second coding theorem 171

Solution Write
 
* + N
vN (m) =  points at distance ≤ m from 0 in HN = ∑ k .
0≤k≤m

With β = m/N < 1/2, we have that β /(1 − β ) < 1, and so


 m  k
β β
< , for 0 ≤ k < m.
1−β 1−β
Then, for 0 ≤ k < m, the product
 k
β
β k (1 − β )N−k
= (1 − β )N
 m 1 − β
β
> (1 − β )N = β m (1 − β )N−m .
1−β
Hence,
   
N N
1 = ∑ β k (1 − β )N−k >
∑ β k (1 − β )N−k
k k
0≤k≤N   0≤k≤m
N
> β (1 − β )
m N−m
∑ = vN (m)β m (1 − β )N−m
0≤k≤m k

= vN (m)2N[ (m/N) log β +(1−m/N) log(1−β ) ] ,

implying that vN (m) < 2N η (β ) . To obtain the left-hand bound in (2.2.25), write
 
N
vN (m) > ;
m

then we aim to check that the RHS is ≥ 2N η (β ) /(N + 1). Consider a binomial
random variable Y ∼ Bin(N, β ) with
 
N
pk = P(Y = k) = β k (1 − β )N−k , k = 0, . . . , N.
k
It suffices to prove that pk achieves its maximal value when k = m, since then
 
N 1
pm = β m (1 − β )N−m ≥ , with β m (1 − β )N−m = 2−N η (β ) .
m N +1
To this end, suppose first that k ≤ m and write
pk m!(N − m)!(N − m)m−k
=
pm k!(N − k)!mm−k
(k + 1) · · · m (N − m)m−k
= · .
mm−k (N − m + 1) · · · (N − k)
172 Introduction to Coding Theory

Here, the RHS is ≤ 1, as it is the product of 2(m − k) factors each of which is ≤ 1.


Similarly, if k ≥ m, we arrive at the product
mk−m (N − k + 1) · · · (N − m)
·
(m + 1) · · · k (N − m)k−m
which is again ≤ 1 as the product of 2(k − m) factors ≤ 1. Thus, the ratio pk /pm ≤
1, and the desired bound follows.

We are now in position to prove Lemma 2.2.5.

Proof of Lemma 2.2.5 First, p + ε < 1/2 implies that m = N(p + ε ) < N/2 and
m N(p + ε )
β := = < p + ε,
N N
which, in turn, implies that η (β ) < η (p + ε ) as x → η (x) is a strictly increasing
function for x from the interval (0, 1/2). This yields the assertion of Lemma 2.2.5.

The geometric proof of the direct part of SCT/NCT clarifies the meaning of the
concept of capacity (of an MBSC at least). Physically speaking, in the expressions
(1.4.11), (1.4.27) and (2.2.1) for capacity C = η (p) of an MBSC, the positive term
1 points at the rate at which a random code produces an ‘empty’ volume between
codewords whereas the negative term −η (p) indicates the rate at which the code-
words progressively fill this space. We continue with a working example of an
essay type:
Worked Example 2.2.8 Quoting general theorems on the evaluation of the chan-
nel capacity, deduce an expression for the capacity of a memoryless binary sym-
metric channel. Evaluate, in particular, the capacities of (i) a symmetric memory-
less channel and (ii) a perfect channel with an input alphabet {0, 1} whose inputs
are subject to the restriction that 0 should never occur in succession.

Solution The channel capacity is defined as a supremum of transmission rates R for


which the received message can be decoded correctly, with probability approaching
1 as the length of the message increases to infinity. A popular class is formed by
memoryless channels where, for a given input word x(N) = x1 . . . xN , the probability

P (N)
y(N)
received|x(N)
sent = ∏ P(yi |xi ).
1≤i≤N

In other words, the noise acts on each symbol xi of the input string x independently,
and P(y|x) is the probability of having an output symbol y given that the input
symbol is x.
2.2 A geometric proof of Shannon’s second coding theorem 173

Symbol x runs over Ain , an input alphabet of a given size q, and y belongs to
Aout , an output alphabet of size r. Then probabilities P(y|x) form a q × r stochastic
matrix (the channel matrix). A memoryless channel is called symmetric if the rows
of this matrix are permutations of each other, i.e. contain the same collection of
probabilities, say p1 , . . . , pr . A memoryless symmetric channel is said to be double-
symmetric if the columns of the channel matrix are also permutations of each other.
If m = n = 2 (typically, Ain = Aout = {0, 1}) a memoryless channel is called binary.
For a memoryless binary symmetric channel, the channel matrix entries P(y|x)
are P(0|0) = P(1|1) = 1 − p, P(1|0) = P(0|1) = p, p ∈ (0, 1) being the flipping
probability and 1 − p the probability of flawless transmission of a single binary
symbol.
A channel is characterised by its capacity: the value C ≥ 0 such that:
(a) for all R < C, R is a reliable transmission rate; and
(b) for all R > C, R is an unreliable transmission rate.
Here R is called a reliable transmission rate if there exists a sequence of codes
fn : Hn → HN and decoding rules fN : HN → Hn such that n ∼ NR and the (suit-
ably defined) probability of error
e( fn , fN ) → 0, as N → ∞.
In other words,
1
C = lim log MN
N→∞ N
where MN is the maximal number of codewords x ∈ HN for which the probability
of erroneous decoding tends to 0.
The SCT asserts that, for a memoryless channel,
C = max I(X : Y )
pX

where I(X : Y ) is the mutual information between a (random) input symbol X and
the corresponding output symbol Y , and the maximum is over all possible proba-
bility distributions pX of X.
Now in the case of a memoryless symmetric channel (MSC), the above maximi-
sation procedure applies to the output symbols only:
 
C = max h(Y ) + ∑ pi log pi ;
pX
1≤i≤r

the sum − ∑ pi log pi being the entropy of the row of channel matrix (P(y|x)). For
i
a double-symmetric channel, the expression for C simplifies further:
C = log M − h(p1 , . . . , pr )
174 Introduction to Coding Theory

as h(Y ) is achieved at equidistribution pX , with pX (x) ≡ 1/q (and pY (y) ≡ 1/r). In


the case of an MBSC we have

C = 1 − η (p).

This completes the solution to part (i).


Next, the channel in part (ii) is not memoryless. Still, the general definitions are
applicable, together with some arguments developed so far. Let n( j,t) denote the
number of allowed strings of length t ending with letter j, j = 0, 1. Then

n(0,t) = n(1,t − 1),


n(1,t) = n(0,t − 1) + n(1,t − 1),
whence
n(1,t) = n(1,t − 1) + n(1,t − 2).

Write it as a recursion
   
n(1,t) n(1,t − 1)
=A ,
n(1,t − 1) n(1,t − 2)
with the recursion matrix
 
1 1
A= .
1 0
The general solution is
n(1,t) = c1 λ1t + c2 λ2t ,

where λ1 , λ2 are the eigenvalues of A, i.e. the roots of the characteristic equation

det (A − λ I) = (1 − λ )(−λ ) − 1 = λ 2 − λ − 1 = 0.
 √ 
So, λ = 1 ± 5 2, and
√ 
1 5+1
log n(1,t) = log .
t 2

The capacity of the channel is given by


1
C = lim log  of allowed input strings of length t
t→∞ t √ 
1
5+1
= lim log n(1,t) + n(0,t) = log .
t→∞ t 2
2.2 A geometric proof of Shannon’s second coding theorem 175

Remark 2.2.9 We canmodify the last


 question, by considering an MBC with
1− p p
the channel matrix Π = whose input is under a restriction that 0
p 1− p
should never occur in succession. Such a channel may be treated as a composi-
tion of two consecutive channels (cf. Worked Example 1.4.29(a)), which yields the
following answer for the capacity:
 √  
5+1
C = min log , 1 − η (p) .
2

Next, we present the strong converse part of Shannon’s SCT for an MBSC (cf.
Theorem 1.4.14); again we are going to prove it by using geometry of Hamming’s
spaces. The term ‘strong’ indicates that for every transmission rate R > C, the
channel capacity, the maximum probability of error actually gets arbitrarily close
to 1. Again for simplicity, we prove the assertion for an MBSC.

Theorem 2.2.10 (The SCT/NCT, thestrong converse  part) Let C be the capacity
1− p p
of an MBSC with the channel matrix , where 0 < p < 1/2, and
p 1− p
take R > C. Then, with n = NR, for all codes fn : Hn → HN and decoding rules
fN : HN → Hn , the maximum error-probability
 
ε max ( fn , fN ) := max P error under fN | fn (u) : u ∈ Hn (2.2.26a)

obeys
lim sup ε max ( fn , fN ) = 1. (2.2.26b)
N→∞

Proof As in Section 1.4, we can assume that codes fn are one-to-one and obey
fN ( fn (u)) = u, for all u ∈ Hn (otherwise, the chances of erroneous decoding will
be even larger). Assume the opposite of (2.2.26b):

ε max ( fn , fN ) ≤ c for some c < 1 and all N large enough. (2.2.27)

Our aim is to deduce from (2.2.27) that R ≤ C. As before, set n = NR and let
fn (u(i)) be the codeword for string u(i) ∈ Hn , i = 1, . . . , 2n . Let Di ⊂ HN be the set
of binary strings where fN returns fn (u(i)): fN (a) = fn (u(i)) if and only if a ∈ Di .
Then Di ! f (u(i)), sets Di are pairwise disjoint, and if the union ∪i Di = HN then
on the complement HN \ ∪i Di the channel declares an error. Set si =  Di , the size
of set Di .
Our first step is to ‘improve’ the decoding rule, by making it ‘closer’ to the ML
rule. In other words, we want to replace each Di with a new set, Ci ∈ HN , of the
same cardinality  Ci = si , but of a more ‘rounded’ shape (i.e. closer to a Hamming
176 Introduction to Coding Theory

ball B( f (u(i)), bi )). That is, we look for pairwise disjoint sets Ci , of cardinalities
 Ci = si , satisfying

BN ( f (u(i)), bi ) ⊆ Ci ⊂ BN ( f (u(i)), bi + 1), 1 ≤ i ≤ 2n , (2.2.28)

for some values of radius bi ≥ 0, to be specified later. We can think that Ci is


obtained from Di by applying a number of ‘disjoint swaps’ where we remove a
string a and add another string, b, with the Hamming distance
   
δ b, fn (u(i)) ≤ δ a, fn (u(i)) . (2.2.29)

Denote the new decoding rule by gN . As the flipping probability p < 1/2, the
relation (2.2.29) implies that

P( fN returns fn (u(i))| fn (u(i))) = P(Di | fn (u(i)))


≤ P(Ci | fn (u(i))) = P( gN returns fn (u(i))| fn (u(i))),
which in turn is equivalent to

P(error when using gN | fn (u(i))) ≤ P(error when using fN | fn (u(i))). (2.2.30)

Then, clearly,
ε max ( fn , gN ) ≤ ε max ( fn , fN ) ≤ c. (2.2.31)

Next, suppose that there exists p < p such that, for any N large enough,

bi + 1 ≤ N p  for some 1 ≤ i ≤ 2n . (2.2.32)

Then, by virtue of (2.2.28) and (2.2.31), with Cic standing for the complement
HN \ Ci ,
P(at least N p digits distorted| fn (u(i)))
≤ P(at least bi + 1 digits distorted| fn (u(i)))
≤ P(Cic | fn (u(i))) ≤ ε max ( fn , gN ) ≤ c.
This would lead to a contradiction, since, by the law of large numbers, as N → ∞,
the probability

P(at least N p digits distorted | x sent) → 1

uniformly in the choice of the input word x ∈ HN . (In fact, this probability does
not depend on x ∈ HN .)
Thus, we cannot have p ∈ (0, p) such that, for N large enough, (2.2.32) holds
true. That is, the opposite is true: for any given p ∈ (0, p), we can find an arbitrarily
large N such that
bi > N p , for all i = 1, . . . , 2n . (2.2.33)
2.2 A geometric proof of Shannon’s second coding theorem 177

(As we claim (2.2.33) for all p ∈ (0, p), it does not matter if in the LHS of (2.2.33)
we put bi or bi + 1.)
At this stage we again use the explicit expression for the volume of the Hamming
ball:
   
N N
si =  Di =  Ci ≥ vN (bi ) = ∑ ≥
0≤ j≤bi j bi
 
N
≥ , provided that bi > N p . (2.2.34)
N p 
A useful bound has been provided in Worked Example 2.2.7 (see (2.2.25)):
 
1
2N η N .
R
vN (R) ≥ (2.2.35)
N +1
We are now in a position to finish the proof of Theorem 2.2.10. In view of
(2.2.35), we have that, for all p ∈ (0, p), we can find an arbitrarily large N such
that

si ≥ 2N(η (p )−εN ) , for all 1 ≤ i ≤ 2n ,
with lim εN = 0. As the original sets D1 , . . . , D2n are disjoint, we have that
N→∞

s1 + · · · + s2n ≤ 2N , implying that 2N(η (p )−εN ) × 2NR ≤ 2N ,
or
NR 1
η (p ) − εN + ≤ 1, implying that R ≤ 1 − η (p ) + εN + .
N N
As N → ∞, the RHS tends to 1 − η (p). So, given any p ∈ (0, p), R ≤ 1 − η (p ).

This is true for all p < p, hence R ≤ 1 − η (p) = C. This completes the proof of
Theorem 2.2.10.
We have seen that the analysis of intersections of a given set X in a Hamming
space HN (and more generally, in HN,q ) with various balls BN (y, s) reveals a lot
about the set X itself. In the remaining part of this section such an approach will
be used for producing some advanced bounds on q-ary codes: the Elias bound and
the Johnson bound. These bounds are among the best-known general bounds for
codes, and they are competing.
The Elias bound is proved in a fashion similar to Plotkin’s: cf. Theorem 2.1.15
and Worked Example 2.1.18. We count codewords from a q-ary [N, M, d] code X
in balls BN,q (y, s) of radius s about words y ∈ HN,q . More precisely, we count pairs
(x, BN,q (y, s)) where x ∈ X ∩ BN,q (y, s). If ball BN,q (y, s) contains Ky codewords
then
∑ Ky = MvN,q (s) (2.2.36)
y∈HN
178 Introduction to Coding Theory

as each word x falls in vN,q (s) of the balls BN,q (y, s).

Lemma 2.2.11 If X is a q-ary [N, M] code then for all s = 1, . . . , N there


 ex-
ists a ball BN,q (y, s) about an N -word y ∈ HN,q with the number Ky =  X ∩
BN,q (y, s) of codewords in BN,q (y, s) obeying

Ky ≥ MvN,q (s)/qN . (2.2.37)


1
Proof Divide both sides of (2.2.36) by qN . Then ∑ Ky gives the average num-
qN y
ber of codewords in ball BN,q (y, s). But there must be a ball containing at least as
many as the average number of codewords.

A ball BN,q (y, s) with property (2.2.37) is called critical (for code X ).

Theorem 2.2.12 (The Elias bound) Set θ = (q − 1)/q. Then for all integers s ≥ 1
such that s < θ N and s2 − 2θ Ns + θ Nd > 0, the maximum size Mq∗ (N, d) of a q-ary
code of length N and distance d satisfies
θ Nd qN
Mq∗ (N, d) ≤ · . (2.2.38)
s2 − 2θ Ns + θ Nd vN,q (s)
Proof Fix a critical ball BN,q (y, s) and consider code X obtained by subtracting
word y from the codewords of X : X = {x − y : x ∈ X }. Then X is again an
[N, M, d] code. So, we can assume that y = 0 and BN,q (0, s) is a critical ball.
Then take X1 = X ∩ BN,q (0, s) = {x ∈ X : w(x) ≤ s}. The code X1 is [N, K, e]
where e ≥ d and K (= K0 ) ≥ MvN,q (s)/qN . As in the proof of the Plotkin bound,
consider the sum of the distances between the codewords in X1 :

S1 = ∑ ∑ δ (x, x ).
x∈X1 x ∈X1

Again, we have that S1 ≥ K(K − 1)e. On the other hand, if ki j is the number of
letters j ∈ Jq = {0, . . . , q − 1} in the ith position in all codewords x ∈ X1 then

S1 = ∑ ∑ ki j (K − ki j ).
1≤i≤N 0≤ j≤q−1

Note that the sum ∑ ki j = K. Besides, as w(x) ≤ s, the number of 0s in every


0≤ j≤q−1
word x ∈ X1 is ≥ N − s. Then the total number of 0s in all codewords equals
∑ ki0 ≥ K(N − s). Now write
1≤i≤N
 
S = NK − 2
∑ 2
ki0 + ∑ ki2j ,
1≤i≤N 1≤ j≤q−1
2.2 A geometric proof of Shannon’s second coding theorem 179

and use the Cauchy–Schwarz inequality to estimate


 2
1 1
∑ ki2j ≥ q − 1 ∑ ki j =
q−1
(K − ki0 )2 .
1≤ j≤q−1 1≤ j≤q−1

Then
 
1
S≤ NK 2 − ∑ 2 +
ki0 (K − ki0 ) 2
1≤i≤N q−1
1

= NK 2 − 2 + K 2 − 2Kk + k2
∑ (q − 1)ki0 i0
q − 1 1≤i≤N i0
1
= NK 2 − ∑ (qk2 + K 2 − 2Kki0 )
q − 1 1≤i≤N i0
N q 2 + 2 K
= NK 2 − K2 − ∑ ki0 ∑ ki0
q−1 q − 1 1≤i≤N q − 1 1≤i≤N
q−2 q 2 + 2 KL,
= NK 2 − ∑ ki0
q−1 q − 1 1≤i≤N q−1

where L = ∑ ki0 . Use Cauchy–Schwarz once again:


1≤i≤N

 2
1 1 2
∑ 2
ki0 ≥
N ∑ ki0 =
N
L .
1≤i≤N 1≤i≤N

Then
q−2 q 1 2 2
S≤ NK 2 − L + KL
q−1  q−1 N q−1 
1 q
= (q − 2)NK 2 − L2 + 2KL .
q−1 N
The maximum of the quadratic expression in the square brackets occurs at L =
NK/q. Recall that L ≥ K(N − s). So, choosing K(N − s) ≥ NK/q, i.e. s ≤ N(q −
1)/q, we can estimate
1  q 2 
S≤ (q − 2)NK − K (N − s) + 2K (N − s)
2 2 2
q−1  N 
1 qs
= K 2 s 2(q − 1) − .
q−1 N
1  qs 
This yields the inequality K(K − 1)e ≤ K 2 s 2(q − 1) − which can be
q−1 N
solved for K:
θ Ne
K≤ ,
s2 − 2θ Ns + θ Ne
180 Introduction to Coding Theory

provided that s < N θ and s2 − 2θ Ns + θ Ne > 0. Finally, recall that X (1) arose
from an [N, M, d] code X , with K ≥ Mv(s)/qN and e ≥ d. As a result, we obtain
that
MvN,q (s) θ Nd
≤ 2 .
qN s − 2θ Ns + θ Nd
This leads to the Elias bound (2.2.38).

The ideas used in the proof of the Elias bound (and earlier in the proof of the
Plotkin bounds) are also helpful in obtaining bounds for W2∗ (N, d, ), the maximal
size of a binary (non-linear) code X ∈ HN,2 of length N, distance d(X ) ≥ d and
with the property that the weight w(x) ≡ , x ∈ X . First, three obvious statements:
A B
∗ N
(i) W2 (N, 2k, k) = ,
k
(ii) W2∗ (N, 2k, ) = W2∗ (N, 2k, N − ),
(iii) W2∗ (N, 2k − 1, ) = W2∗ (N, 2k, ), /2 ≤ k ≤ .

[The reader is advised to prove these as an exercise.]


N
Worked Example 2.2.13 Prove that for all positive integers N ≥ 1, k ≤ and
? 2
N N2
< − − kN ,
2 4
A B
kN
W2∗ (N, 2k, ) ≤ . (2.2.39)
2 − N + kN

Solution Take an [N, M, 2k] code X such that w(x) ≡ , x ∈ X . As before, let
ki1 be the number of 1s in
position
i in all codewords. Consider the sum of the
dot-products D = ∑ 1 x = x "x · x #. We have

x,x ∈X

1  
"x · x # = w(x ∧ x ) = w(x) + w(x ) − δ x, x )
2
1
≤ (2 − 2k) =  − k
2
and hence
D ≤ ( − k)M(M − 1).

On the other hand, the contribution to D from position i equals ki1 (ki1 − 1), i.e.

D= ∑ ki1 (ki1 − 1) = ∑ 2
(ki1 − ki1 ) = ∑ 2
ki1 − M.
1≤i≤N 1≤i≤N 1≤i≤N
2.2 A geometric proof of Shannon’s second coding theorem 181

Again, the last sum is minimised at ki1 = M/N, i.e.

2 M 2
− M ≤ D ≤ ( − k)M(M − 1).
N
This immediately leads to (2.2.39).

Another useful bound is given now.


N
Worked Example 2.2.14 Prove that for all positive integers N ≥ 1, k ≤ and
2
2k ≤  ≤ 4k,
A B
N ∗
W2∗ (N, 2k, ) ≤ W (N − 1, 2k,  − 1) . (2.2.40)
 2

Solution Again take an [N, M, 2k] code X such that w(x) ≡  for all x ∈ X .
Consider the shortening code X on x1 = 1 (cf. Example 2.1.8(v)): it gives a code
of length (N − 1), distance ≥ 2k and constant weight ( − 1). Hence, the size of
the cross-section is ≤ W2∗ (N − 1, 2k,  − 1). Therefore, the number of 1s at position
1 in the codewords of X does not exceed W2∗ (N − 1, 2k,  − 1). Repeating this
argument, we obtain that the total number of 1s in all positions is ≤ NW2∗ (N −
1, 2k,  − 1). But this number equals M, i.e. M ≤ NW2∗ (N − 1, 2k,  − 1). The
bound (2.2.40) then follows.
N
Corollary 2.2.15 For all positive integers N ≥ 1, k ≤ and 2k ≤  ≤ 4k − 2,
2
W2∗ (N, 2k − 1, ) = W2∗ (N, 2k, )
A A A A B BBB
N N −1 N −+k
≤ ··· ··· . (2.2.41)
 −1 k
The remaining part of Section 2.2 focuses on the Johnson bound. This bound
aims at improving the binary Hamming bound (cf. (2.1.8b) with q − 2):

M2∗ (N, 2E + 1) ≤ 2N vN (E) or vN (E) M2∗ (N, 2E + 1) ≤ 2N . (2.2.42)

Namely, the Johnson bound asserts that

M2∗ (N, 2E + 1) ≤ 2N /v∗N (E) or v∗N (E) M2∗ (N, 2E + 1) ≤ 2N , (2.2.43)

where
 
1 N
v∗N (E) = vN (E) +
N/(E + 1) E +1
 
2E + 1
−W2∗ (N, 2E + 1, 2E + 1) . (2.2.44)
E
182 Introduction to Coding Theory
 
N
Recall that vN (E) = ∑ stands for the volume of the binary Hamming ball
0≤s≤E s
of radius E. We begin our derivation of bound (2.2.43) with the following result.
Lemma
 2.2.16 If x, y are binary words, with δ (x, y) = 2 + 1, then there exists
2 + 1
binary words z such that δ (x, z) =  + 1 and δ (y, z) = .

Proof Left as an exercise.

Consider the set T (= TN,E+1 ) of all binary N-words at distance exactly E + 1


from the codewords from X :
(
T = z ∈ HN : δ (z, x) = E + 1 for some x ∈ X
)
and δ (z, y) ≥ E + 1 for all y ∈ X . (2.2.45)

Then we can write that


MvN (E) +  T ≤ 2N , (2.2.46)

as none of the words z ∈ T falls in any of the balls of radius E about the codewords
y ∈ X . The bound (2.2.43) will follow when we solve the next worked example.
Worked Example 2.2.17 Prove that the cardinality  T is greater than or equal
to the second term from the RHS of (2.2.44):
   
M N ∗ 2E + 1
−W2 (N, 2E + 1, 2E + 1) . (2.2.47)
N/(E + 1) E + 1 E

Solution We want to find a lower bound on  T . Consider the set W (= WN,E+1 ))


of ‘matched’ pairs of N-words defined by
( )
W = (x, z) : x ∈ X , z ∈ TE+1 , δ (x, z) = E + 1
(
= (x, z) : x ∈ X , z ∈ HN : δ (x, z) = E + 1, (2.2.48)
)
and δ (y, z) ≥ E + 1 for all y ∈ X .

Given x ∈ X , the x-section W x is defined as


W x = {z ∈ HN : (x, z) ∈ W }
(2.2.49)
= {z : δ (x, z) = E + 1, δ (y, z) ≥ E + 1 for all y ∈ X }.
Observe that if δ (x, z) = E + 1 then δ (y, z) ≥ E for all y ∈ X \ {x}, as otherwise
δ (x, y) < 2E + 1. Hence:

W x = {z : δ (x, z) = E + 1, δ (y, z) = E for all y ∈ X }. (2.2.50)


2.2 A geometric proof of Shannon’s second coding theorem 183

We see that, to evaluate  W x , we must detract,from the


 number of binary N-
N
words lying at distance E + 1 from x, i.e. from , the number of those
E +1
lying also at distance E from some other codeword y ∈ X . But if δ (x, z) = E + 1
and δ (y, z) = E then δ (x, y) = 2E + 1. Also, no two distinct codewords can have
distance E from a single N-word z. Hence, by the previous remark,
   
N 2E + 1 * +
W =x
− ×  y ∈ X : δ (x, y) = 2E + 1 .
E +1 E
Moreover, if we subtract x from every y ∈ X with δ (x, y) = 2E + 1, the result
is a code of length N whose codewords z have weight w(z) ≡ 2E + 1. Hence,
there are at most W ∗ (N, 2E + 1, 2E + 1) codewords y ∈ X with δ (x, y) = 2E + 1.
Consequently,
   
N ∗ 2E + 1
W ≥x
−W (N, 2E + 1, 2E + 1) (2.2.51)
E +1 E
and
 W ≥ M × the RHS of (2.2.51). (2.2.52)
Now fix v ∈ T and consider the v-section
W v = {y ∈ X : (y, v) ∈ W } = {y ∈ X : δ (y, v) = E + 1}. (2.2.53)
If y, z ∈ W v then δ (y, u) = δ (z, u) = E + 1. Thus,
w(y − u) = w(z − u) = E + 1
and
2E + 1 ≤ δ (y, z) = δ (y − v, z − v)
= w(y − v) + w(z − v) − 2w((y − v) ∧ (z − v))
= 2E + 2 − 2w((y − v) ∧ (z − v)).
This implies that
w((y − v) ∧ (z − v)) = 0 and δ (y, z) = 2E + 2.
E F
N
So, y − v and z − v have no digit 1 in common. Hence, there exist at most
E F E +1
N
words of the form y − v where y ∈ W v , i.e. at most words in W v . There-
E +1
fore,
E F
N
W ≤ T . (2.2.54)
E +1
Collecting (2.2.51), (2.2.52) and (2.2.54) yields inequality (2.2.47).
184 Introduction to Coding Theory

Corollary 2.2.18 In view of Corollary 2.2.15 the following bound holds true:


M (N, 2E + 1) ≤ 2 vN (E)
N

  A B −1 (2.2.55)
1 N N −E N −E
− − .
N/(E + 1) E E +1 E +1
Example 2.2.19A Let A NA= 13BBBand E = 2, i.e. d = 5. Inequality (2.2.41) implies
13 12 11
W ∗ (13, 5, 5) ≤ = 23, and the Johnson bound in (2.2.43) yields
5 4 3
A B
∗ 213
M (13, 5) ≤ = 77.
1 + 13 + 78 + (286 − 10 × 23)/4
This bound is much better than Hamming’s which gives M ∗ (13, 5) ≤ 89. In fact, it
is known that M ∗ (13, 5) = 64. Compare Section 3.4.

2.3 Linear codes: basic constructions


In this section we explore further the class of linear codes. To start with, we con-
sider binary codes, with digits 0 and 1. Accordingly, HN will denote the binary
Hamming space of length N; words x(N) = x1 . . . xN from HN will be also called
(row) vectors. All operations over binary digits are performed in the binary arith-
metic (that is, mod 2). When it does not lead to a confusion, we will omit sub-
scripts N and superscripts (N). Let us repeat the definition of a linear code (cf.
Definition 2.1.5).
Definition 2.3.1 A binary code X ⊆ HN is called linear if, together with a pair
of vectors, x = x1 . . . xN and x = x1 . . . xN , code X contains the sum x + x , with
digits xi + xi . In other words, a linear code is a linear subspace in HN , over field
F2 = {0, 1}. Consequently, a linear code always contains a zero row-vector 0 =
0 . . . 0. A basis of a linear code X is a maximal linearly independent set of words
from X ; the linear code is generated by its basis in the sense that every vector
x ∈ X is (uniquely) represented as a sum of (some) vectors from the basis. All
bases of a given linear code X contain the same number of vectors; the number of
vectors in the basis is called the dimension or the rank of X . A linear code of length
N and rank k is also called an [N, k] code, or an [N, k, d] code if its distance is d.
Practically all codes used in modern practice are linear. They are popular because
they are easy to work with. For example, to identify a linear code it is enough to
fix its basis, which yields a substantial economy as the subsequent material shows.
Lemma 2.3.2 Any binary linear code of rank k contains 2k vectors, i.e. has size
M = 2k .
2.3 Linear codes: basic constructions 185

Proof A basis of the code contains k linearly independent vectors. The code is
generated by the basis; hence it consists of the sums of basic vectors. There are
precisely 2k sums (the number of subsets of {1, . . . , k} indicating the summands),
and they all give different vectors.

Consequently, a binary linear code X of rank k may be used for encoding all
possible source strings of length k; the information rate of a binary linear [N, k]
code is k/N. Thus, indicating k ≤ N linearly independent words x ∈ HN identifies
a (unique) linear code X ⊂ HN of rank k. In other words, a linear binary code of
rank k is characterised by a k × N matrix of 0s and 1s with linearly independent
rows:
⎛ ⎞
g11 . . . . . . . . . g1N
⎜ g21 . . . . . . . . . g2N ⎟
⎜ ⎟
G=⎜ .. .. ⎟
⎝ . . ⎠
gk1 ... ... ... gkN

Namely, we take the rows g(i) = gi1 . . . giN , 1 ≤ i ≤ k, as the basic vectors of a
linear code.

Definition 2.3.3 A matrix G is called a generating matrix of a linear code. It is


clear that the generating matrix is not unique.
Equivalently, a linear [N, k] code X may be described as the kernel of a certain
(N − k) × N matrix H, again with the entries 0 and 1: X = ker H where
⎛ ⎞
h11 h12 ... ... h1N
⎜ h21 h22 ... ... h2N ⎟
⎜ ⎟
H =⎜ . .. .. .. .. ⎟
⎝ .. . . . . ⎠
h(N−k)1 h(N−k)2 ... ... h(N−k)N
and
( )
ker H = x = x1 . . . xN : xH T = 0(N−k) . (2.3.1)

It is plain that the rows h( j), 1 ≤ j ≤ N − k, of matrix H are vectors orthogonal to


X , in the sense of the inner dot-product:

"x · h( j)# = 0, for all x ∈ X and 1 ≤ j ≤ N − k.

Here, for x, y ∈ HN ,
N
"x · y# = "y · x# = ∑ xi yi , where x = x1 . . . xN , y = y1 . . . yN ; (2.3.2)
i=1

cf. Example 2.1.8(ix).


186 Introduction to Coding Theory

The inner product (2.3.2) possesses all properties of the Euclidean scalar product
in RN , but one: it is not positive definite (and therefore does not define a norm). That
is, there are non-zero vectors x ∈ HN with "x · x# = 0. Luckily, we do not need the
positive definiteness.
However, the key rank–nullity property holds true for the dot-product: if L is
a linear subspace in HN of rank k then its orthogonal complement L ⊥ (i.e. the
collection of vectors z ∈ HN such that "x · z# = 0 for all x ∈ L ) is a linear subspace
of rank N − k. Thus, the (N − k) rows of H can be considered as a basis in X ⊥ ,
the orthogonal complement to X .
The matrix H (or sometimes its transpose H T ) with the property X = ker H or
"x · h( j)# ≡ 0 (cf. (2.3.1)) is called a parity-check (or, simply, check) matrix of code
X . In many cases, the description of a code by a check matrix is more convenient
than by a generating one.

The parity-check matrix is again not unique as the basis in X ⊥ can be chosen
non-uniquely. In addition, in some situations where a family of codes is consid-
ered, of varying length N, it is more natural to identify a check matrix where the
number of rows can be greater than N − k (but some of these rows will be linearly
dependent); such examples appear in Chapter 3. However, for the time being we
will think of H as an (N − k) × N matrix with linearly independent rows.

Worked Example 2.3.4 Let X be a binary linear [N, k, d] code of information


rate ρ = k/N . Let G and H be, respectively, the generating and parity-check matri-
ces of X . In this example we refer to constructions introduced in Example 2.1.8.

(a) The parity-check extension of X is a binary code X + of length N +1 obtained


by adding, to each codeword x ∈ X , the symbol xN+1 = ∑ xi so that the
1≤i≤N
sum ∑ xi is zero. Prove that X + is a linear code and find its rank and
1≤i≤N+1
minimal distance. How are the information rates and generating and parity-
check matrices of X and X + related?
(b) The truncation X − of X is defined as a linear code of length N − 1 obtained
by omitting the last symbol of each codeword x ∈ X . Suppose that code X
has distance d ≥ 2. Prove that X − is linear and find its rank and generating
and parity-check matrices. Show that the minimal distance of X − is at least
d − 1.
(c) The m-repetition of X is a code X re (m) of length Nm obtained by repeat-
ing each codeword x ∈ X a total of m times. Prove that X re (m) is a linear
code and find its rank and minimal distance. How are the information rates and
generating and parity-check matrices of X re (m) related to ρ, G and H ?
2.3 Linear codes: basic constructions 187

Solution (a) The generating and parity-check matrices are


⎛ ⎞
⎛ ⎞ | 1
| ∑ g1i ⎜ | ⎟
⎜ 1≤i≤N ⎟ ⎜ 1 ⎟
⎜ ⎟ ⎜ | · ⎟
⎟ + ⎜ ⎟
.. H
⎜ | . ⎜ ⎟
+ ⎜ ⎟ | ·
G =⎜ G .. ⎟,H = ⎜ ⎜
⎟.

⎜ | . ⎟ ⎜ | 1 ⎟
⎝ ⎠ ⎜ ⎟
| ∑ gki ⎝ − − − | −− ⎠
1≤i≤N
0 ... 0 | 1
The rank of X + equals the rank of X = k. If the minimal distance of X was even
it is not changed; if odd it increases by 1. The information rate ρ + = (N − 1)ρ N.
(b) The generating matrix
⎛ ⎞
g11 . . . g1N−1
⎜ .. ⎟
⎜ . ⎟
⎜ ⎟
− ⎜ . ⎟
G =⎜ .. ⎟.
⎜ ⎟
⎜ .. ⎟
⎝ . ⎠
gk1 . . . gkN−1
The parity-check matrix H of X , after suitable column operations, may be written
as ⎛ ⎞

⎜ |· ⎟
⎜ ⎟
⎜ − | · ⎟
⎜ H ⎟
⎜ ⎟
H =⎜ |· ⎟.
⎜ ⎟
⎜ |· ⎟
⎜ ⎟
⎝ − − − − |−− ⎠
0 ... 0 |
The parity-check matrix of X − is then identified with H − . The rank is unchanged;

the distance may decrease maximum by 1. The information rate ρ − = N ρ (N − 1).
(c) The generating and parity-check matrices are
Gre (m) = (G . . . G) (m times),
and
⎛ ⎞
H 0 0 ... 0
⎜ I I 0 ... 0 ⎟
⎜ ⎟
re ⎜ I 0 I ... 0 ⎟
H (m) = ⎜ ⎟.
⎜ .. .. .. .. .. ⎟
⎝ . . . . . ⎠
I 0 0 ... I
188 Introduction to Coding Theory

Here, I is a unit N × N matrix and the zeros mean the zero matrices (of size (N −
k) × N and N × N, accordingly). The number of the unit matrices in the first column
equals m − 1. (This is not a unique form of H re (m).) The size of H re (m) is (Nm −
k) × Nm.
The rank is unchanged, the minimal distance in X re (m) is md and the informa-
tion rate ρ /m.

Worked Example 2.3.5 A dual code of a linear binary [N, k] code X is defined
as the set X ⊥ of the words y = y1 . . . yN such that the dot-product

"y · x# = ∑ yi · xi = 0 for every x = x1 . . . xN from X .


1≤i≤N

Compare Example 2.1.8(ix). Prove that an (N − k) × N matrix H is a parity-check


matrix of code X iff H is a generating matrix for the dual code. Hence, derive that
G and H are generating and parity-check matrices, respectively, for a linear code
iff:

(i) the rows of G are linearly independent;


(ii) the columns of H are linearly independent;
(iii) the number of rows of G plus the number of rows of H equals the number of
columns of G which equals the number of columns of H ;
(iv) GH T = 0.

Solution The rows h( j), j = 1, . . . , N − k, of the matrix H obey "x · h( j)# ≡ 0,


x ∈ X . Furthermore, if a vector y obeys "x · y# ≡ 0, x ∈ X , then y is a linear
combination of the y( j). Hence, H is a generating matrix of X ⊥ . On the other
hand, any generating matrix of X ⊥ is a parity-check matrix for X .
Therefore, for any pair G, H representing generating and parity-check matrices
of a linear code, (i), (ii) and (iv) hold by definition, and (iii) comes from the rank–
nullity formula

N = dim(Row – Range G) + dim(Row – Range H)

that follows from (iv) and the maximality of G and H.


On the other hand, any pair G, H of matrices obeying (i)–(iv) possesses the max-
imality property (by virtue of (i)–(iii)) and the orthogonality property (iv). Thus,
they are generating and parity-check matrices for X = Row – Range G.

Worked Example 2.3.6 What is the number of codewords in a linear binary


[N, k] code? What is the number of different bases in it? Calculate the last number
for k = 4. List all bases for k = 2 and k = 3.
2.3 Linear codes: basic constructions 189

Show that the subset of a linear binary code consisting of all words of even
weight is a linear code. Prove that, for d even, if there exists a linear [N, k, d] code
then there exists a linear [N, k, d] code with codewords of even weight.

1 k−1 k
Solution The size is 2k and the number of different bases ∏
k! i=0
2 − 2i . Indeed,

if the l first basis vectors are selected, all their 2l linear combinations should be
excluded on the next step. This gives 840 for k = 4, and 28 for k = 3.

Finally, for d even, we can truncate the original code and then use the parity-
check extension.
Example 2.3.7 The binary Hamming [7, 4] code is determined by a 3 × 7 parity-
check matrix. The columns of the check matrix are all non-zero words of length 3.
Using lexicographical order of these words we obtain
⎛ ⎞
1 0 1 0 1 0 1
Ham
Hlex = ⎝0 1 1 0 0 1 1⎠ .
0 0 0 1 1 1 1
The corresponding generating matrix may be written as
⎛ ⎞
0 0 1 1 0 0 1
⎜ 0 1 0 0 1 0 1 ⎟
GHam ⎜ ⎟.
lex = ⎝ (2.3.3)
0 0 1 0 1 1 0 ⎠
1 1 1 0 0 0 0
In many cases it is convenient to write the check matrix of a linear [N, k] code in
a canonical (or standard) form:
 
Hcan = IN−k H . (2.3.4a)
In the case of the Hamming [7, 4] code it gives
⎛ ⎞
1 0 0 1 1 0 1
Ham
Hcan = ⎝0 1 0 1 0 1 1⎠ ,
0 0 1 0 1 1 1
with a generating matrix also in a canonical form:

Gcan = G Ik ; (2.3.4b)
namely,
⎛ ⎞
1 1 0 1 0 0 0
⎜ 0 ⎟
GHam ⎜ 1
can = ⎝
0 1 0 1 0 ⎟.
1 1 1 0 0 1 0 ⎠
1 1 1 0 0 0 1
190 Introduction to Coding Theory

Formally, Glex and Gcan determine different codes. However, these codes are
equivalent:

Definition 2.3.8 Two codes are called equivalent if they differ only in permuta-
tion of digits. For linear codes, equivalence means that their generating matrices
can be transformed into each other by permutation of columns and by row-
operations including addition of columns multiplied by scalars. It is plain that
equivalent codes have the same parameters (length, rank, distance).

In what follows, unless otherwise stated, we do not distinguish between equiva-


lent linear codes.

Remark 2.3.9 An advantage of writing G in a canonical form is that a source


string u(k) ∈ Hk is encoded as an N-vector u(k) Gcan ; according to (2.3.4b), the last k
digits in u(k) Gcan form word u(k) (they are called information digits), whereas the
first N −k are used for the parity-check (and called parity-check digits). Pictorially,
the parity-check digits carry the redundancy that allows the decoder to detect and
correct errors.

Like following life thro’ creatures you dissect


You lose it at the moment you detect.
Alexander Pope (1668–1744), English poet
Definition 2.3.10 The weight w(x) of a binary word x = x1 . . . xN is the number
of the non-zero digits in x:

w(x) =  {i : 1 ≤ i ≤ N, xi = 0}. (2.3.5)

Theorem 2.3.11

(i) The distance of a linear binary code equals the minimal weight of its non-zero
codewords.
(ii) The distance of a linear binary code equals the minimal number of linearly
dependent columns in the check matrix.

Proof (i) As the code X is linear, the sum x + y ∈ X for each pair of codewords
x, y ∈ X . Owing to the shift invariance of the Hamming distance (see Lemma
2.1.1), δ (x, y) = δ (0, x + y) = w(x + y) for any pair of codewords. Hence, the
minimal distance of X equals the minimal distance between 0 and the rest of the
code, i.e. the minimal weight of a non-zero codeword from X .

(ii) Let X be a linear code with a parity-check matrix H and minimal distance d.
Then there exists a codeword x ∈ X with exactly d non-zero digits. Since xH T = 0,
2.3 Linear codes: basic constructions 191

we conclude that there are d columns of H which are linearly dependent (they
correspond to non-zero digits in x). On the other hand, if there exist (d −1) columns
of H which are linearly dependent then their sum is zero. But that means that there
exists a word y, of weight w(y) = d − 1, such that yH T = 0. Then y must belong to
X which is impossible, since min[w(x) : x ∈ X , x = 0] = d.

Theorem 2.3.12 The Hamming [7, 4] code has minimal distance 3, i.e. it detects
2 errors and corrects 1. Moreover, it is a perfect code correcting a single error.

Proof For any pair of columns the parity-check matrix H lex contains their sum to
obtain a linearly dependent triple (viz. look at columns 1, 6, 7). No two columns
are linearly dependent because they are distinct (x + y = 0 means that x = y). Also,
the volume v7 (1) equals 1 + 7 = 23 , and the code is perfect as its size is 24 and
24 × 23 = 27 .

The construction of the Hamming [7, 4] code admits a straightforward generali-


sation to any length N = 2l − 1; namely, consider a (2l − 1) × l matrix H Ham with
columns representing all possible non-zero binary vectors of length l:
⎛ ⎞
1 0 ... 0 1 ... 1
⎜0 1 . . . 0 1 ... 1⎟
⎜ ⎟
⎜ 1⎟
H Ham = ⎜0 0 . . . 0 0 ... ⎟. (2.3.6)
⎜. . . .. .. .. ⎟
⎝ .. .. . . . . . 1⎠
0 0 ... 1 0 ... 1

The rows of H Ham are linearly independent, and hence H Ham may be considered
as a check matrix of a linear code of length N = 2l − 1 and rank N − l = 2l −
1 − l. Any two columns of H Ham are linearly independent but there exist linearly
dependent triples of columns, e.g. x, y and x + y. Hence, the code X Ham with the
check matrix H Ham has a minimal distance 3, i.e. it detects 2 errors and corrects 1.

This code is called the Hamming [2l − 1, 2l − 1 − l] code. It is a perfect one-error


correcting code: the volume of the 1-ball v2l −1 (1) equals 1 + 2l − 1 = 2l , and size
2l − l − 1
× volume = 22 −1−l × 2l = 22 −1 = 2N . The information rate is
l l
→ 1 as
2l − 1
l → ∞. This proves

Theorem 2.3.13 The above construction defines a family of [2l − 1, 2l − 1 −


l, 3] linear binary codes X2Ham
l −1 , l = 1, 2, . . ., which are perfect one-error correcting

codes.
192 Introduction to Coding Theory

Example 2.3.14 Suppose that the probability of error in any digit is p 1,


independently of what occurred to other digits. Then the probability of an error in
transmitting a non-encoded (4N)-digit message is
1 − (1 − p)4N  4N p.
But if we use the [7, 4] code, we need to transmit 7N digits. An erroneous trans-
mission requires at least two wrong digits, which occurs with probability
   N
7 2
≈ 1− 1− p  21N p2 4N p.
2
We see that the extra effort of using 3 check digits in the Hamming code is justified.
A standard decoding procedure for linear codes is based on the concepts of coset
and syndrome. Recall that the ML rule decodes a vector y = y1 . . . yN by the closest
codeword x ∈ X .
Definition 2.3.15 Let X be a binary linear code of length N and w = w1 . . . wN
be a word from HN . A coset of X determined by y is the collection of binary
vectors of the form w + x where x ∈ X . We denote it by w + X .
An easy (and useful) exercise in linear algebra and counting is
Example 2.3.16 Let X be a linear code and w, v be words of length N. Then:
(1) If w is in the coset v + X , then v is in the coset w + X ; in other words, each
word in a coset determines this coset.
(2) w ∈ w + X .
(3) w and v are in the same coset iff w + v ∈ X .
(4) Every word of length N belongs to one and only one coset. That is, the cosets
form a partition of the whole Hamming space HN .
(5) All cosets contain the same number of words which equals  X . If the rank
of X is k then there are 2N−k different cosets, each containing 2k words. The
code X is itself a coset of any of the codewords.
(6) The coset determined by w + v coincides with the set of elements of the form
x + y, where x ∈ w + X , y ∈ X + v.
Now the decoding rule for a linear code: you know the code X beforehand, hence
you can calculate all cosets. Upon receiving a word y, you find its coset y + X and
find a word w ∈ y + X of least weight. Such a word is called a leader of the coset
y + X . A leader may not be unique: in that case you have to make a choice among
the list of leaders (list decoding) or refuse to decode and demand a re-transmission.
Suppose you have chosen a leader w. You then decode y by the word
x∗ = y + w. (2.3.7)
2.3 Linear codes: basic constructions 193

Worked Example 2.3.17 Show that word x∗ is always a codeword that min-
imises the distance between y and the words from X .

Solution As y and w are in the same coset, y + w ∈ X (see Example 2.3.16(3)).


All other words from X are obtained as the sums y + v where v runs over coset
y + X . Hence, for any x ∈ X ,
δ (y, x) = w(y + x) ≥ min w(v) = w(w) = d(y, x∗ ).
v∈y+X

The parity-check matrix provides a convenient description of the cosets y + X .


Theorem 2.3.18 Cosets w + X are in one-to-one correspondence with vectors
of the form yH T : two vectors, y and y are in the same coset iff yH T = y H T . In
other words, cosets are identified with the rank (or range) space of the parity-check
matrix.
Proof The vectors y and y are in the same coset iff y + y ∈ X , i.e.
(y + y )H T = yH T + y H T = 0, i.e. yH T = y H T .

In practice, the decoding rule is implemented as follows. Vectors of the form


yH T are called syndromes: for a linear (N, k) code there are 2N−k syndromes. They
are all listed in the syndrome ‘table’, and for each syndrome a leader of the corre-
sponding coset is calculated. Upon receiving a word y, you calculate the syndrome
yH T and find, in the syndrome table, the corresponding leader w. Then follow
(2.3.7): decode y by x∗ = y + w.

The procedure described is called syndrome decoding; although it is relatively


simple, one has to write a rather long table of the leaders. Moreover, it is desirable
to make the whole procedure of decoding algorithmically independent on a con-
crete choice of the code, i.e. of its generating matrix. This goal is achieved in the
case of the Hamming codes:
Theorem 2.3.19 For the Hamming code, for each syndrome the leader w is
unique and
 contains
T not more than one non-zero digit. More precisely, if the syn-
drome y H Ham = s gives column i of the check matrix H Ham then the leader of
the corresponding coset has the only non-zero digit i.
Proof The leader minimises the distance between the received word and the code.
The Hamming code is perfect 1-error correcting. Hence, every word is either a
codeword or within distance 1 of a unique codeword. Hence, the leader is unique
194 Introduction to Coding Theory

and contains at most one non-zero digit. If the syndrome yH T = s occupies position
i among the columns of the parity-check matrix then, for word ei = 0 . . . 1 0 . . . 0
with the only non-zero digit i,
(y + ei )H T = s + s = 0.
That is, (y + ei ) ∈ X and ei ∈ y + X . Obviously, ei is the leader.

The duals X Ham of binary Hamming codes form a particular class, called sim-
plex codes. If X Ham is [2 − 1, 2 − 1 − ], its dual (X Ham )⊥ is [2 − 1, ], and the

original parity-check matrix H Ham serves as a generating matrix for X Ham .
Worked Example 2.3.20 Prove that each non-zero codeword in a binary simplex

code X Ham has weight 2−1 and the distance between any two codewords equals
2−1 . Hence justify the term ‘simplex’.

Solution If X = X Ham is the binary Hamming [2l − 1, 2l − l − 1] code then the


dual X ⊥ is [2l − 1, l], and its l × (2l − 1) generating matrix is H Ham . The weight of
any row of H Ham equals 2l−1 (and so d(X ⊥ ) = 2l−1 ). Indeed, the weight of row
j of H Ham equals the number of non-zero vectors of length l with 1 at position j.
This gives 2l−1 as the weight, as half of all 2l vectors from Hl have 1 at any given
position.

Consider now a general codeword from X Ham . It is represented by the sum of
rows j1 , . . . , js of H Ham where s ≤ l and 1 ≤ j1 < · · · < js ≤ l. This word again has
weight 2l−1 ; this gives the number of non-zero words v = v1 . . . vl ∈ Hl,2 such that
the sum v j1 + · · · + v js = 1. Moreover, 2l−1 gives the weight of half of all vectors in
Hl,2 . Indeed, we require that v j1 + · · · + v js = 1, which results in 2s−1 possibilities
for the s involved digits. Next, we impose no restriction on the remaining l −s digits
which gives 2l−s possibilities. Then 2s−1 ×2l−s = 2l−1 , as required. So, w(x) = 2l−1
for all non-zero x ∈ X ⊥ . Finally, for any x, x ∈ X , x = x , the distance δ (x, x ) =
δ (0, x + x ) = w(x + x ) which is always equal to 2l−1 . So, the codewords x ∈ X ⊥
form a geometric pattern of a ‘simplex’ with 2l ‘vertices’.
Next, we briefly summarise basic facts about linear codes over a finite-field al-
phabet Fq = {0, 1, . . . , q − 1} of size q = ps . We now switch to the notation F×N
q
for the Hamming space HN,q .
Definition 2.3.21 A q-ary code X ⊆ F×N is called linear if, together with a
pair of vectors, x = x1 . . . xN and x = x1 . . . xN , X contains the linear combinations
γ · x + γ · x , with digits γ · xi + γ · xi , for all coefficients γ , γ ∈ Fq . That is, X is a
linear subspace in F×N . Consequently, as in the binary case, a linear code always
contains the vector 0 = 0 . . . 0. A basis of a linear code is again defined as a maximal
linearly independent set of its words; the linear code is generated by its basis in the
2.3 Linear codes: basic constructions 195

sense that every codevector is (uniquely) represented as a linear combination of


the basis codevectors. The number of vectors in the basis is called, as before, the
dimension or the rank of the code; because all bases of a given linear code contain
the same number of vectors, this object is correctly defined. As in the binary case,
the linear code of length N and rank k is referred to as an [N, k] code, or an [N, k, d]
code when its distance equals d.
As in the binary case, the minimal distance of a linear code X equals the mini-
mal non-zero weight:
d(X ) = min [w(x) : x ∈ X , x = 0],
where
w(x) =  { j : 1 ≤ j ≤ N, x j = 0 in Fq },
x = x1 . . . xN ∈ F×N
q . (2.3.8)
A linear code X is defined by a generating matrix G or a parity-check matrix
H. The generating matrix of a linear [N, k] code is a k × N matrix G, with entries
from Fq , whose rows g(i) = gi1 . . . giN , 1 ≤ i ≤ k, form a basis of X . A parity-
check matrix is an (N − k) × N matrix H, with entries from Fq , whose rows h( j) =
h j1 . . . h jN , 1 ≤ j ≤ N − k, are linearly independent and dot-orthogonal to X : for
all j = 1, . . . , N − k and codeword x = x1 . . . xN from X ,
"x · h( j)# = ∑ xl h jl = 0.
1≤l≤N

In other words, all qk codewords of X are obtained as linear combinations of


rows of G. That is, subspace X can be viewed as a result of acting on Hamming
space F×k ×k
q (of length k) by matrix G: symbolically, X = Fq G. This shows how
code X can be used for encoding q ‘messages’ of length k (and justifies the term
k

information rate for ρ (X ) = k/N). On the other hand, X is determined as the


kernel (the null-space) of H T : X H T = 0. A useful exercise is to check that for
the dual code, X ⊥ , the situation is opposite: H is a generating matrix and G the
parity-check. Compare with Worked Example 2.3.5.
Of course, both the generating and parity-check matrices of a given code are not
unique, e.g. we can permute rows g( j) of G or perform row operations, replacing
a row by a linear combination of rows in which the original row enters with a non-
zero coefficient. Permuting columns of G gives a different but equivalent code,
whose basic geometric parameters are identical to those of X .
Lemma 2.3.22 For any [N, k] code, there exists
 anequivalent code whose gener-
ating matrix G has a ‘canonical’ form: G = G Ik where Ik is the identity k × k
matrix and G is an k × (N − k) matrix. Similarly,
 the parity-check matrix H may
have a standard form which is IN−k H .
196 Introduction to Coding Theory

We now discuss the decoding procedure for a general linear code X of rank
k. As was noted before, it may be used for encoding source messages (strings)
u = u1 . . . uk of length k. The source encoding u ∈ Fkq → X becomes particularly
simple when the generating and parity-check matrices are used in the canonical (or
standard) form.

Theorem 2.3.23 For any linear code X there exists an equivalent code X with
the generating matrix Gcan and the check matrix H can in standard form (2.3.4a),
(2.3.4b) and G = −(H )T .

Proof Assume that code X is non-trivial (i.e. not reduced to the zero word 0).
Write a basis for X and take the corresponding generating matrix G. By perform-
ing row-operations (where a pair of rows i and j are exchanged or row i is replaced
by row i plus row j) we can change the basis, but do not change the code. Our
matrix G contains a non-zero column, say l1 : perform row operations to make g1l1
the only non-zero entry in this column. By permuting digits (columns), place col-
umn l1 at position N − k. Drop row 1 and column N − k (i.e. the old column l1 )
and perform a similar procedure with the rest, ending up with the only non-zero
entry g2l2 in a column l2 . Place column l2 at position N − k + 1. Continue until an
upper triangular k × k submatrix emerges. Further operations may be reduced to
this matrix only. If this matrix is a unit matrix, stop. If not, pick the first column
with more than one non-zero entry. Add the corresponding rows from the bottom
to ‘kill’ redundant non-zero entries. Repeat until a unit submatrix emerges. Now a
generating matrix is in a standard form, and new code is equivalent to the original
one.
To complete the proof, observe that matrices Gcan and H can figuring in (2.3.4a),
(2.3.4b) with G = −(H )T , have k independent rows and N − k independent
columns, correspondingly. Besides, the k × (N − k) matrix Gcan (H can )T vanishes.
In fact,

(Gcan (H can )T )i j = " row i of G · column j of (H )T # = g i j − g i j = 0.

Hence, H can is a check matrix for Gcan .

Returning to source encoding, select the generating matrix in the canonical form
k
Gcan . Then, given a string u = u1 . . . uk , we set x = ∑ ui gcan (i), where gcan (i) rep-
i=1
resents row i of Gcan . The last k digits in x give string u; they are called the infor-
mation digits. The first N − k digits are used to ensure that x ∈ X ; they are called
the parity-check digits.

The standard form is convenient because in the above representation X = F×k G,


the initial (N − k) string of each codeword is used for encoding (enabling the
2.3 Linear codes: basic constructions 197

detection and correction of errors), and the final k string yields the message from
F×k
q . As in the binary case, the parity-check matrix H satisfies Theorem 2.3.11. In
particular, the minimal distance of a code equals the minimal number of linearly
dependent columns in its parity-check matrix H .
Definition 2.3.24 Given an [N, k] linear q-ary code X with parity-check matrix
H, the syndrome of an N vector y ∈ F×N ×k
q is the k vector yH ∈ Fq , and the syn-
T

drome subspace is the image Fq H . A coset of X by vector w ∈ F×N


×N T
q is denoted
by w + X and formed by words of the form w + x where x ∈ X . All cosets have
the same number of elements equal to qk and partition the whole Hamming space
F×N
q into qN−k disjoint subsets; code X is one of them. The cosets are in one-
to-one correspondence with syndromes yH T . The syndrome decoding procedure is
carried as in the binary case: a received vector y is decoded by x∗ = y + w where
w is the leader of coset y + X (i.e. the word from y + X with minimum weight).
All drawbacks we had in the case of binary syndrome decoding persist in the
general q-ary case, too (and in fact are more pronounced): the coset tables are
bulky, the leader of a coset may be not unique. However, for q-ary Hamming codes
the syndrome decoding procedure works well, as we will see in Section 2.4.

In the case of linear codes, some of the bounds can be improved (or rather new
bounds can be produced).
Worked Example 2.3.25 Let X be a binary linear [N, k, d] code.
(a) Fix a codeword x ∈ X with exactly d non-zero digits. Prove that truncating

X on the non-zero digits of x produces a code XN−d of length N − d , rank

k − 1 and distance d for some d ≥ d/2.
(b) Deduce the Griesmer bound improving the Singleton bound (2.1.12):
E F
d
N ≥d+ ∑ 
. (2.3.9)
1≤≤k−1 2

Solution (a) Without loss of generality, assume that the non-zero digits in x are
x1 = · · · = xd = 1. Truncating on digits 1, . . ., d will produce the code XN−d with
the rank reduced by 1. Indeed, suppose that a linear combination of k − 1 vectors
vanishes on positions d + 1, . . . , N. Then on the positions 1, . . . , d all the values
equal either 0s or 1s because d is the minimal distance. But the first case is im-
possible, unless the vectors are linearly dependent. The second case also leads to
contradiction by adding the string x and obtaining k linearly
E F dependent vectors in
d
the code X . Next, suppose that X has distance d < and take y ∈ X with
2
N
w(y ) = ∑ y j = d .
j=d+1
198 Introduction to Coding Theory

d
x
y
y⬘
x^y
y + (x^y)
x + (x^y)

Figure 2.5

Let y ∈ X be an inverse image of y under truncation. Referring to (2.1.6b), we


write the following property of the binary wedge-product:
 
w(y) = w(x ∧ y) + w y + (x ∧ y) ≥ d.
Consequently, we must have that w(x ∧ y) > d − d/2. See Figure 2.5.
Then
w(x) = w(x ∧ y) + w(x + (x ∧ y)) = d
implies that w(x + (x ∧ y)) < d/2. But this is a contradiction, because
w(x + y) = w(x + (x ∧ y)) + w(y + (x ∧ y)) < d.
We conclude that d ≥ d/2.
(b) Iterating the argument in (a) yields
N ≥ d + d1 + · · · + dk−1 ,
E F G 3 4H E F
dl−1 d/2 d
where dl ≥ . With ≥ , we obtain that
2 2 4
E F
d
N ≥d+ ∑ 
.
1≤≤k−1 2

Concluding this section, we provide a specification of the GV bound for linear


codes.
Theorem 2.3.26 (Gilbert bound) If q = ps is a prime power then for all integers
N and d such that 2 ≤ d ≤ N/2, there exists a q-ary linear [N, k, d] code with
minimum distance ≥ d provided that
qk ≥ qN /vN,q (d − 1). (2.3.10)
2.4 The Hamming, Golay and Reed–Muller codes 199

Proof Let X be a linear code of maximal rank with distance at least d of maximal
size. If inequality (2.3.10) is violated the union of all Hamming spheres of radius
d − 1 centred on codewords cannot cover the whole Hamming space. So, there
must be a point y that is not in any Hamming sphere around a codeword. Then for
any codeword x and any scalar b ∈ Fq the vectors y and y + b · x are in the same
coset by X . Also y + b · x cannot be in any Hamming sphere of radius d − 1. The
same is true for x + b · y because if it were, then y would be in a Hamming sphere
around another codeword. Here we use the fact that Fq is a field. Then the vector
subspace spanned by X and y is a linear code larger than X and with a minimal
distance at least d. That is a contradiction, which completes the proof.
For example, let q = 2 and N = 10. Then 25 < v10,2 (2) = 56 < 26 . Upon taking
d = 3, the Gilbert bound guarantees the existence of a binary [10, 5] code with
d ≥ 3.

2.4 The Hamming, Golay and Reed–Muller codes


In this section we systematically study codes with a general finite alphabet Fq of q
elements which is assumed to be a field. Let us repeat that q has to be of the form
ps where p is a prime and s a natural number; the operations of addition (+) and
multiplication (·) must also be specified. (As was said above, if q = p is prime,
we can think that Fq = {0, 1, . . . , q − 1} and addition and multiplication in Fq are
standard, mod q.) See Section 3.1. Correspondingly, the Hamming space HN,q of
length N with digits from Fq is identified, as before, with the Cartesian power F×Nq
and inherits the component-wise addition and multiplication by scalars.
q − 1
Definition 2.4.1 Given positive integers q,  ≥ 2, set N = , k = N − , and
q−1
construct the q-ary [N, k, 3] Hamming code XN,q Ham with alphabet F as follows. (a)
q
Pick any non-zero q-ary -word h(1) ∈ H,q . (b) Pick any non-zero q-ary -word
h(2) ∈ H,q that is not a scalar multiple of h(1) . (c) Continue: if h(1) , . . . , h(s) is a
collection of q-ary -words selected so far, pick any non-zero vector h(s+1) ∈ H,q
which is not a scalar multiple of h(1) , . . . , h(s) , 1 ≤ s ≤ N − 1. (d) This process ends
up with a selection of N vectors h(1) , . . . , h(N) ; form an  × N matrix H Ham with
T T Ham ⊂ F×N is defined by the parity-check
the columns h(1) , . . . , h(N) . Code XN,q q
matrix H Ham . [In fact, we deal with the whole family of equivalent codes here,
modulo choices of words h( j) , 1 ≤ j ≤ N.]
For brevity, we will now write X H and H H (or even simply H when possible)
instead of XN,q
Ham and H Ham . In the binary case (with q = 2), matrix H H is com-

posed by all non-zero binary column-vectors of length . For general q we have


200 Introduction to Coding Theory

to exclude columns that are multiples of each other. To this end, we can choose
as columns all non-zero -words that have 1 in their top-most non-0 component.
q − 1
Such columns are linearly independent, and their total equals . Next, as in
q−1
the binary case, one can arrange words with digits from Fq in the lexicographic
order. By construction, any two columns of H H are linearly
 independent, but there
exist triples of linearly dependent columns. Hence, d X H = 3, and X H detects
two errors and corrects one. Furthermore, X H is a perfect code correcting a single
error, as
 
q − 1
M(1 + (q − 1)N) = q 1 + (q − 1)
k
= qk+ = qN .
q−1
As in the binary case, the general Hamming codes admit an efficient  (and el-
egant) decoding procedure. Suppose a parity-check matrix H = H H has been
constructed as above. Upon receiving a word y ∈ F×N q we calculate the syn-
T ×
drome yH ∈ Fq . If yH = 0 then y is a codeword. Otherwise, the column-
T

vector HyT is a scalar multiple of a column h( j) of H: HyT = a · h( j) , for some


j = 1, . . . , N and a ∈ Fq \ {0}. In other words, yH T = a · e( j)H T where word
e( j) = 0 . . . 1 . . . 0 ∈ HN,q (with the jth digit 1, all others 0). Then we decode y
by x∗ = y − a · e( j), i.e. simply change digit y j in y to y j − a.

Summarising, we have the following


Theorem 2.4.2 The q-ary Hamming codes form a family of
  
q − 1 q − 1 q − 1
, − , 3 perfect codes XNH , for N = ,  = 1, 2, . . . ,
q−1 q−1 q−1
correcting one error, with a decoding rule that changes the digit y j to y j − a in a
received word y = y1 . . . yN ∈ F×N
q , where 1 ≤ j ≤ N , and a ∈ Fq \ {0} are deter-
mined from the condition that HyT = a · h( j), the a-multiple of column j of the
parity-check matrix H .
Hamming codes were discovered by R. Hamming and M. Golay in the late
1940s. At that time Hamming, an electrical engineer turned computer scientist
during the Jurassic computers era, was working at Los Alamos (“as an intellec-
tual janitor” to local nuclear physicists, in his own words). This discovery shaped
the theory of codes for more than two decades: people worked hard to extend prop-
erties of Hamming codes to wider classes of codes (with variable success). Most
of the topics on codes discussed in this book are related, in one way or another, to
Hamming codes. Richard Hamming was not only an outstanding scientist but also
an illustrious personality; his writings (and accounts of his life) are entertaining
and thought-provoking.
2.4 The Hamming, Golay and Reed–Muller codes 201

Until the late 1950s, the Hamming codes were a unique family of codes exist-
ing in dimensions N → ∞, with ‘regular’ properties. It was then discovered that
these codes have a deep algebraic background. The development of the algebraic
methods based on these observations is still a dominant theme in modern coding
theory.

Another important example is the four Golay codes (two binary and two ternary).
Marcel Golay (1902–1989) was a Swiss electrical engineer who lived and worked
in the USA for a long time. He had an extraordinary ability to ‘see’ the discrete
geometry of the Hamming spaces and ‘guess’ the construction of various codes
without bothering about proofs.
The binary Golay code X24Gol is a [24, 12] code with the generating
matrix G =
(I12 |G ) where I12 is a 12 × 12 identity matrix, and G = G (2) has the following
form:
⎛ ⎞
0 1 1 1 1 1 1 1 1 1 1 1
⎜ 1 1 1 0 1 1 1 0 0 0 1 0 ⎟
⎜ ⎟
⎜ 1 1 0 1 1 1 0 0 0 1 0 1 ⎟
⎜ ⎟
⎜ 1 0 1 1 1 0 0 0 1 0 1 1 ⎟
⎜ ⎟
⎜ 1 1 1 1 0 0 0 1 0 1 1 0 ⎟
⎜ ⎟
⎜ 1 1 1 0 0 0 1 0 1 1 0 1 ⎟
G =⎜

⎜ 1 1 0 0 0 1 0 1 1
⎟.
⎟ (2.4.1)
⎜ 0 1 1 ⎟
⎜ 1 0 0 0 1 0 1 1 0 1 1 1 ⎟
⎜ ⎟
⎜ 1 0 0 1 0 1 1 0 1 1 1 0 ⎟
⎜ ⎟
⎜ 1 0 1 0 1 1 0 1 1 ⎟
⎜ 1 0 0 ⎟
⎝ 1 1 0 1 1 0 1 1 1 0 0 0 ⎠
1 0 1 1 0 1 1 1 0 0 0 1
The rule of forming matrix G is ad hoc (and this is how it was determined by M.
Golay in 1949). There will be further ad hoc arguments in the analysis of Golay
codes.
Remark 2.4.3 Interestingly, there is a systematic way of constructing all code-
words of X24Gol (or its equivalent) by fitting together two versions of Hamming [7, 4]
code X7H . First, observe that reversing the order of all the digits of a Hamming
code X7H yields an equivalent code which we denote by X7K . Then add a parity-
check to both X7H and X7K , producing codes X8H,+ and X8K,+ . Finally, select
two different words a, b ∈ X8H,+ and a word x ∈ X8K,+ . Then all 212 codewords
of X24Gol of length 24 could be written as concatenation (a + x)(b + x)(a + b + x).
This can be checked by inspection of generating matrices.

Lemma 2.4.4 The binary Golay code X24Gol is self-dual, with X24Gol = X24Gol .
The code X24Gol is also generated by the matrix G = (G |I12 ).
202 Introduction to Coding Theory

Proof A direct calculation shows that any two rows of matrix G are dot-
⊥ ⊥
orthogonal. Thus X24Gol ⊂ X24Gol . But the dimensions of X24Gol and X24Gol

coincide. Hence, X24Gol = X24Gol . The last assertion of the lemma now follows
from the property (G )T = G .

Worked Example 2.4.5 Show that the distance d(X24Gol ) = 8.

Solution First, we check that for all x ∈ X24Gol the weight w(x) is divisible by 4 .
This is true for every row of G = (I12 |G ): the number of 1s is either 12 or 8. Next,
for all binary N-words x, x ,

w(x + x ) = w(x) + w(x ) − 2w(x ∧ x )

where (x ∧ y) is the wedge-product, with digits (x ∧ y) j = min(x j , y j ), 1 ≤ j ≤ N


(cf. (2.1.6b)). But for any pair g( j), g( j ) of the rows of G, w(g( j) ∧ g( j )) = 0
mod 2. So, 4 divides w(x) for all x ∈ X24Gol .
On the other hand, X24Gol does not have codewords of weight 4. To prove this,
compare two generating matrices, (I12 |G ) and ((G )T |I12 ). If x ∈ X24Gol has w(x) =
4, write x as a concatenation xL xR . Any non-trivial sum of rows of (I12 |G ) has
the weight of the L-half of at least 1, so w(xL ) ≥ 1. Similarly, w(xR ) ≥ 1. But if
w(xL ) = 1 then x must be one of the rows of (I12 |G), none of which has weight
w(xR ) = 3. Hence, w(xL ) ≥ 2. Similarly, w(xR ) ≥ 2. But then the only possibility
is that w(xL ) = w(xR ) = 2 which is impossible by a direct check. Thus, w(x) ≥ 8.
But (I12 |G ) has rows of weight 8. So, d(X24Gol ) = 8.

When we truncate X24Gol at any digit, we get X23Gol , a [23, 12, 7] code. This code
is perfect 3 error correcting. We recover X24Gol from X23Gol by adding a parity-
check.

The Hamming [2 −1, 2 −1−, 3] and the Golay [23, 12, 7] are the only possible
perfect binary linear codes.

The ternary Golay code X12,3 Gol of length 12 has the generating matrix I |G
6 (3)
where
⎛ ⎞
0 1 1 1 1 1
⎜ 1 0 1 2 2 1 ⎟
⎜ ⎟
⎜ ⎟
⎜ 1 1 0 1 2 2 ⎟
G (3) = ⎜ ⎟ , with (G (3) )T = G (3) . (2.4.2)
⎜ 1 2 1 0 1 2 ⎟
⎜ ⎟
⎝ 1 2 2 1 0 1 ⎠
1 1 2 2 1 0
Gol is a truncation of X Gol at the last digit.
The ternary Golay code X11,3 12,3
2.4 The Hamming, Golay and Reed–Muller codes 203
(Gol)⊥ (Gol)
Theorem 2.4.6 The ternary Golay code X12,3 = X12,3 is [12, 6, 6]. The code
(Gol)
X11,3 is [11, 6, 5], hence perfect.

11 × 10
Proof The code [11, 6, 5] is perfect since v11,3 (2) = 1 + 11 × 2 + × 22 =
2
35 . The rest of the assertions of the theorem are left as an exercise.
  
3 −1 
The Hamming , 3 − 1 − , 3 and the Golay [11, 6, 5] codes are the only
2
possible perfect ternary linear codes. Moreover, the Hamming and Golay are the
only perfect linear codes, occurring in any alphabet Fq where q = ps is a prime
power. Hence, these codes are the only possible perfect linear codes. And even
non-linear perfect codes do not bring anything essentially new: they all have the
same parameters (length, size and distance) as the Hamming and Golay codes. The
Golay codes were used in the 1980s in the American Voyager spacecraft program,
to transmit close-up photographs of Jupiter and Saturn.

The next popular examples are the Reed–Muller codes. For N = 2m consider
binary Hamming spaces Hm,2 and HN,2 . Let M(= Mm ) be an m × N matrix where
the columns are the binary representations of the integers j = 0, 1, . . . , N − 1, with
the least significant bit in the first place:

j = j1 · 20 + j2 · 21 + · · · + jm 2m−1 . (2.4.3)

So,

0 1 2 . . . 2m − 1
⎛ ⎞
0 1 0 ... 1 v(1)
⎜ 0 0 1 ... 1 ⎟ v(2)
⎜ ⎟
⎜ .. .. .. .. .. ⎟ ..
M=⎜ . . . . . ⎟ . . (2.4.4)
⎜ ⎟
⎝ 0 0 0 ... 1 ⎠ v(m−1)
0 0 0 ... 1 v(m)

The columns of M list all vectors from Hm,2 and the rows are vectors from HN,2
denoted by v(1) , . . . , v(m) . In particular, v(m) has the first 2m−1 entries 0, the last
2m−1 entries 1. To pass from Mm to Mm−1 , one drops the last row and takes one of
the two identical halves of the remaining (m − 1) × N matrix. Conversely, to pass
from Mm−1 to Mm , one concatenates two copies of Mm−1 and adds row v(m) :
 
Mm−1 Mm−1
Mm = . (2.4.5)
0...0 1...1
204 Introduction to Coding Theory

Consider the columns w(1) , . . . , w(m) of Mm corresponding to numbers


1, 2, 4, . . . , 2m−1 . They form the standard basis in Hm,2 :

1 0 ... 0
0 1 ... 0
.... . . .. .
. . . .
0 0 ... 1

Then the column at position j = ∑ ji 2i−1 is ∑ ji w(i) .


1≤i≤m 1≤i≤m

The vector v(i) , i = 1, . . . , m, can be interpreted as the indicator function of the


set Ai ⊂ Hm,2 where the ith digit is 1:

Ai = {j ∈ Hm,2 : ji = 1}. (2.4.6)

In terms of the wedge-product (cf. (2.1.6b)) v(i1 ) ∧ v(i2 ) ∧ · · · ∧ v(ik ) is the indicator
function of the intersection Ai(1) ∩ · · · ∩ Ai(k) . If all i1 , . . . , ik are distinct, the cardi-
nality  (∩1≤ j≤k Ai( j) ) = 2m−k . In other words, we have the following.

Lemma 2.4.7 The weight w(∧1≤ j≤k v(i j ) ) = 2m−k .

An important fact is

Theorem 2.4.8 The vectors v(0) = 11 . . . 1 and ∧1≤ j≤k v(i j ) , 1 ≤ i1 < · · · < ik ≤ m,
k = 1, . . . , m, form a basis in HN,2 .

Proof It suffices to check that the standard basis N-words e( j) = 0 . . . 1 . . . 0 (1 in


position j, 0 elsewhere) can be written as linear combinations of the above vectors.
But

(i)
e( j) = ∧1≤i≤m (v(i) + (1 + v j )v(0) ), 0 ≤ j ≤ N − 1. (2.4.7)

[All factors in position j are equal to 1 and at least one factor in any position l = j
is equal to 0.]
2.4 The Hamming, Golay and Reed–Muller codes 205

Example 2.4.9 For m = 4, N = 16,

v(0) = 1111111111111111
v(1) = 0101010101010101
v(2) = 0011001100110011
v(3) = 0000111100001111
v(4) = 0000000011111111
v(1) ∧ v(2) = 0001000100010001
v(1) ∧ v(3) = 0000010100000101
v(1) ∧ v(4) = 0000000001010101
v(2) ∧ v(3) = 0000001100000011
v(2) ∧ v(4) = 0000000000110011
v(3) ∧ v(4) = 0000000000001111
v(1) ∧ v(2) ∧ v(3) = 0000000100000001
v(1) ∧ v(2) ∧ v(4) = 0000000000010001
v(1) ∧ v(3) ∧ v(4) = 0000000000000101
v(2) ∧ v(3) ∧ v(4) = 0000000000000011
v(1) ∧ v(2) ∧ v(3) ∧ v(4) = 0000000000000001

Definition 2.4.10 Given 0 ≤ r ≤ m, the Reed–Muller (RM) code X RM (r, m)


of order r is a binary code of length N = 2m spanned by all wedge-products
∧1≤ j≤k v(i j ) and v(0) where 1 ≤ k ≤ r and 1 ≤ i1 < · · · < ik ≤ m. The rank of
   
m m
X RM (r, m) equals 1 + +···+ .
1 r

So, X RM (0, m) ⊂ X RM (1, m) ⊂ · · · ⊂ X RM (m − 1, m) ⊂ X RM (m, m).


Here X RM (m, m) = HN,2 , the whole Hamming space, and X RM (0, m) =
{00 . . . 00, 11 . . . 1}, the repetition code. Next, X RM (m − 1, m) consists of all words
x ∈ HN,2 of even weight (shortly: even words). In fact, any basis vector is even, by
Lemma 2.4.7. Further, if x, x are even then

w(x + x ) = w(x) + w(x ) − 2w(x ∧ x )

is again even. Hence, all codewords x ∈ X RM (m − 1, m) are even. Finally,


dim X RM (m − 1, m) = N − 1 coincides with the dimension of the subspace of even
words. This proves the claim. As X RM (r, m) ⊂ X RM (m − 1, m), r ≤ m − 1, any
RM code consists of even words.
206 Introduction to Coding Theory

The dual code is X RM (r, m)⊥ = X RM (m − r − 1, m). Indeed, if a ∈ X RM (r, m),


b ∈ X RM (m − r − 1, m) then the wedge-product a ∧ b is an even word, and hence
the dot-product "a · b# = 0. But
dim(X RM (r, m)) + dim(X RM (m − r − 1, m)) = N,
hence the claim. As a corollary, code X RM (m − 2, m) is the parity-check extension
of the Hamming code.
By definition, codewords x ∈ X RM (r, m) are associated with ∧-polynomials in
idempotent ‘variables’ v(1) , . . . , v(m) , with coefficients 0, 1, of degrees ≤ r (here,
the degree of a polynomial is counted by taking the maximal number of variables
v(1) , . . . , v(m) in the summand monomials). The 0-degree monomial in such a poly-
nomial is proportional to v(0) .
Write this correspondence as
x ∈ X RM (r, m) ↔ px (v(1) , . . . , v(m) ), deg px ≤ r. (2.4.8)
Each such polynomial can be written in the form
px (v(1) , . . . , v(m) ) = v(m) ∧ q(v(1) , . . . , v(m−1) ) + l(v(1) , . . . , v(m−1) ),
with deg q ≤ r − 1, deg l ≤ r. The word v(m) ∧ q(v(1) , . . . , v(m−1) ) has zeros on the
first 2m−1 positions.
By the same token, as above,
q(v(1) , . . . , v(m−1) ) ↔ b ∈ X RM (r − 1, m − 1),
(2.4.9)
l(v(1) , . . . , v(m−1) ) ↔ a ∈ X RM (r, m − 1).

Furthermore, 2m -word x can be written as the sum of concatenated 2m−1 -words:


x = (a|a) + (0|b) = (a|a + b). (2.4.10)
This means that the Reed–Muller codes are related via the bar-product construction
(cf. Example 2.1.8(viii)):
X RM (r, m) = X RM (r, m − 1)|X RM (r − 1, m − 1). (2.4.11)
Therefore, inductively,
d(X RM (r, m)) = 2m−r . (2.4.12)
In fact, for m = r = 0, d(X RM (0, 0)) = 2m and for all m, d(X RM (m, m)) =
1 = 20 . Assume d(X RM (r − 1, m)) = 2m−r+1 for all m ≥ r − 1, and
d(X RM (r − 1, m − 1)) = 2m−r . Then (cf. (2.4.14) below)
d(X RM (r, m)) = min[2d(X RM (r, m − 1)), d(X RM (r − 1, m − 1))]
= min[2 · 2m−1−r , 2m−1−r+1 ] = 2m−r .
(2.4.13)
2.4 The Hamming, Golay and Reed–Muller codes 207

Summarising,
Theorem 2.4.11 TheRMcode X RM (r, m), 0 ≤ r ≤ m, is a binary code of length
N
N = 2m , rank k = ∑ and distance d = 2m−r . Furthermore,
0≤l≤r l

(1) X RM (0, m) = {0 . . . 0, 1 . . . 1} ⊂ X RM (1, m) ⊂ · · · ⊂ X RM (m − 1, m) ⊂


X RM (m, m) = HN,2 ; X RM (m − 1, m) is the set of all even N -words
and X RM (m − 2, m) the parity-check extension of the Hamming [2m − 1,
2m − 1 − m] code.
(2) X RM (r, m) = X RM (r, m − 1)|X RM (r − 1, m − 1), 1 ≤ r ≤ m − 1.
(3) X RM (r, m)⊥ = X RM (m − r − 1, m), 0 ≤ r ≤ m − 1.
Worked Example 2.4.12 Define the bar-product X1 |X2 of binary linear codes
X1 and X2 , where X2 is a subcode of X1 . Relate the rank and minimum distance
of X1 |X2 to those of X1 and X2 . Show that if X ⊥ denotes the dual code of X ,
then
(X1 |X2 )⊥ = X2⊥ |X1⊥ .
Using the bar-product construction, or otherwise, define the Reed–Muller code
X RM (r, m) for 0 ≤ r ≤ m. Show that if 0 ≤ r ≤ m − 1, then the dual of X RM (r, m)
is again a Reed–Muller code.

Solution The bar-product X1 |X2 of two linear codes X1 ⊆ X2 ⊆ FN2 is defined as


* +
X1 |X2 = (x|x + y) : x ∈ X1 , y ∈ X2 ;
it is a linear code of length 2N. If X1 has basis x1 , . . . , xk and X2 basis y1 , . . . , yl
then X1 |X2 has basis
(x1 |x1 ), . . . , (xk |xk ), (0, y1 ), . . . , (0|yl ),
and the rank of X1 |X2 equals the sum of ranks of X1 and X2 .
Next, we are going to check that the minimum distance
 

d X1 |X2 = min 2d(X1 ), d(X2 ) . (2.4.14)


Indeed, let 0 = (x|x + y) ∈ X1 |X2 . If y = 0 then the weight w(x|x + y) ≥ w(y) ≥
d(X2 ). If y = 0 then w(x|x + y) = 2w(x) ≥ 2d(X1 ). This implies that
 

d X1 |X2 ≥ min 2d(X1 ), d(X2 ) . (2.4.15)


 
On the other hand, if x ∈ X1 has w(x) = d(X1 ) then d X  1 |X2 ≤ w(x|x) =
2d(X1 ). Finally, if y ∈ X2 has w(y) = d(X2 ) then d X1 |X2 ≤ w(0|y) = d(X2 ).
We conclude that
 

d X1 |X2 ≤ min 2d(X1 ), d(X2 ) , (2.4.16)


proving (2.4.14).
208 Introduction to Coding Theory

Now, we will check that



X2⊥ |X1⊥ ⊆ (X1 |X2 )⊥ .

Indeed, let (u|u + v) ∈ X2⊥ |X1⊥ and (x|x + y) ∈ (X1 |X2 ). The dot-product
I J
(u|u + v) · (x|x + y) = u · x + (u + v) · (x + y)
= u · y + v · (x + y) = 0,

since u ∈ X2⊥ , y ∈ X2 , v ∈ X1⊥ and (x + y) ∈ X1 . In addition, we know that


 
rank X2⊥ |X1⊥ = N − rank(X2 ) + N − rank(X1 )
= 2N − rank (X1 |X2 ) = rank (X1 |X2 )⊥ .
This implies that in fact

X2⊥ |X1⊥ = (X1 |X2 )⊥ . (2.4.17)

Turning to the RM codes, they are determined as follows:


X RM (0, m) = the repetition binary code of length N = 2m ,

X RM (m, m) = the whole space HN,2 of length N = 2m ,

X RM (r, m) for 0 < r < m is defined recursively by


X RM (r, m) = X RM (r, m − 1)|X (r − 1, m − 1).
 
r m
By construction, X (r, m) has rank ∑
RM and the minimum distance 2m−r .
j=0 j
In particular, X RM (m − 1, m) is the parity-check code and hence dual of X (0, m).
We will show that in general, for 0 ≤ r ≤ m − 1,

X RM (r, m) = X RM (m − r − 1, m).

It is done by induction in m ≥ 3. By the above, we can assume that



X RM (r, m − 1) = X RM (m − r − 2, m − 1) holds for 0 ≤ r < m − 1. Then for
0 ≤ r < m:
⊥  ⊥
X RM (r, m) = X RM (r, m − 1)|X RM (r − 1, m − 1)
⊥ ⊥
= X RM (r − 1, m − 1) |X RM (r, m − 1)
= X RM (m − r − 1, m − 1)|X RM (m − r − 2, m − 1)
= X RM (m − r − 1, m).
2.4 The Hamming, Golay and Reed–Muller codes 209

Encoding and decoding of RM codes is based on the following observation. By


virtue of (2.4.5), the product v(i1 ) ∧ · · · ∧ v(ik ) occurs in the expansion for e( j) ∈
(i)
Hm,2 iff v j = 0 for all i ∈
/ {i1 , . . . , ik }.
Definition 2.4.13 For 1 ≤ i1 < · · · < ik ≤ m, define
C(i1 , . . . , ik ) := the set of all integers j = ∑ ji 2i−1
1≤i≤m (2.4.18)
with ji = 0 for i ∈
/ {i1 , . . . , ik }.
/ = {1, . . . , 2m − 1}. Furthermore, set
For an empty set (k = 0), C(0)

C(i1 , . . . , ik ) + t = { j + t : j ∈ C(i1 , . . . , ik )}. (2.4.19)

Then, again in view of (2.4.5), for all y = y0 . . . yN−1 ∈ HN,2 ,


 
y= ∑ ∑ ∑ y j v(i1 ) ∧ · · · ∧ v(ik ) (2.4.20)
0≤k≤m 1≤i1 <···<ik ≤m j∈C(i1 ,...,ik )

(for k = 0, take v(0) ).


For encoding a sequence a = a0 . . . ak−1 of information symbols from Hk,2 ,
   
m m
with k = 1 + + ··· + , with Xr,m
RM , rewrite it as (a
i1 ,...,i ); here
1 r
i1 , . . . , il are the successive positions of the 1s. Then construct a codeword as
x = (x0 , . . . , xN−1 ) ∈ Xr,m
RM where

x= ∑ ∑ ai1 ,...,il v(i1 ) ∧ · · · ∧ v(il ) . (2.4.21)


0≤l≤r 1≤i1 <···<il ≤m

We see that the ‘information space’ Hk,2 is embedded into HN,2 , by identifying
entries a j ∼ ai1 ,...,il where j = j0 20 + j1 21 + · · · + jm−1 2m−1 and i1 , . . . , il are the
successive positions of the 1s among j1 , . . . , jm , 1 ≤ l ≤ r. With such an identifica-
tion we obtain:
Lemma 2.4.14 For all 0 ≤ l ≤ m and 1 ≤ i1 < · · · < il ≤ m,
∑ x j = ai1 ,...,il , if l ≤ r,
j∈C(i1 ,...,il ) (2.4.22)
= 0, if l > r.
Proof The result follows from (2.4.20).

Lemma 2.4.15 For all 1 ≤ i1 < · · · < ir ≤ m and for any 1 ≤ t ≤ m such that
t∈
/ {i1 , . . . , ir },
ai1 ,...,ir = ∑ x j. (2.4.23)
j∈C(i1 ,...,ir )+2t−1
210 Introduction to Coding Theory

Proof The proof follows from the fact that C(i1 , . . . , ir ,t) is the disjoint union
C(i1 , . . . , ir )∪(C(i1 , . . . , ir )+2t−1 ) and the equation ∑ x j = 0 (cf. (2.4.19)).
j∈C(i1 ,...,ir ,t)

Moreover:
Theorem 2.4.16 For any information symbol ai1 ,...,ir corresponding to v(i1 ,...,ir ) ,
we can split the set {0, . . . , N − 1} into 2m−r disjoint subsets S, each containing 2r
elements, such that, for all such S, ai1 ,...,ir = ∑ x j .
j∈S

Proof The list of sets S begins with C(i1 , . . . , ir ) and continues with (m − r) dis-
joint sets C(i1 , . . . , ir ) + 2t−1 where 1 ≤ t ≤ m, t ∈ {i1 , . . . , ir }. Next, we take any
pair 1 ≤ t1 < t2 ≤ m such that {t1 ,t2 } ∩ {i1 , . . . , ir } = 0. / Then C(i1 , . . . , ir ,t1 ,t2 ) con-
tains disjoint sets C(i1 , . . . , ir ), C(i1 , . . . , ir ) + 2t1 −1 and C(i1 , . . . , ir ) + 2t2 −1 , and for
each of them, ai1 ,...,ir = ∑ x j , k = 1, 2. Then the same is true for the
j∈C(i1 ,...,ir )+2tk −1
remaining sets

 C(i1 , . . . , ir ) + 2t1 −1 + 2t2 −1 = C(i1 , . . . , ir ,t1 ,t2 )\


    (2.4.24)
C(i1 , . . . , ir ) ∪ C(i1 , . . . , ir ) + 2t1 −1 ∪ C(i1 , . . . , ir ) + 2t2 −1 ;

there are (m−r


2 ) of them and they are still disjoint with each other and the previous
ones. The sets (2.4.24) form a further bunch of sets S.
And so on: a general form of set S is
C(i1 , . . . , ir ) + 2t1 −1 + · · · + 2ts −1
which is the same as the set-theoretic difference

 1 , . . . , ir ,t1 , . . . ,ts )
C(i 
2
−1 −1 (2.4.25)
\ C(i1 , . . . , ir ) + 2 1 + · · · + 2 s
t t .
{t1 ,...,ts }⊂{t1 ,...ts }

Here each such set is labelled by a collection {t1 , . . . ,ts } where 0 ≤ s ≤ m − r, t1 <
· · · < ts and {t1 , . . . ,ts } ∩ {i1 , . . . , ir } = 0.
/ [The union ∪{t1 ,...,t }⊂{t1 ,...ts } in (2.4.25)
s
is over all (‘strict’) subsets {t1 , . . . ,ts } of {t1 , . . . ,ts }, with t1 < · · · < ts and s =

0, . . . , s − 1 (s = 0 gives the empty subset).] The total number of sets C(i1 , . . . , ir )


equals 2m−r and each of them has 2r elements by construction.
Theorem 2.4.16 provides a rationale for the so-called majority decoding for the
Reed–Muller codes. Namely, upon receiving a word y = (y0 , . . . , yN−1 ), produced
from a codeword x∧ ∈ Xr,m RM , we take any 1 ≤ i < · · · < i ≤ m and consider
1 r
the sums ∑ y j along the 2m−r above sets S. If y ∈ Xr,m
RM , all these sums coincide
j∈C
2.4 The Hamming, Golay and Reed–Muller codes 211

number of errors in y (i.e. the Hamming distance δ (x∧ , y))


and give ai1 ,...,ir . If the
< 2m−r−1 = d Xr,m RM 2, the majority of sums will still give a correct ai1 ,...,ir (the
worst case is where each set S contains no or a single error). By varying {i1 , . . . , ir },
we will determine a codeword x(1) ∈ Xr,m RM containing only monomials of degree

r. Note that x∧ − x(1) will be a codeword in Xr−1,m RM .

Then y can be ‘reduced’ to y − x(1) . Compared with x∧ − x(1) , the reduced


∧ ∧
> will have δ (x − x , y − x ) = δ (x , y) errors, which is < 2
word y − x(1) (1) (1) m−r =

RM ) 2. We can repeat the above procedure and obtain the correct a


d(Xr−1,m i1 ,...,ir−1
for any 1 ≤ i1 < · · · < ir−1 ≤ m, etc. At the end, we recover the whole sequence of
information symbols ai1 ,...,ir .    
Therefore, any word y ∈ HN,2 with distance δ y, Xr,m RM < d X RM
r,m 2 is
uniquely decoded.

. . . correct, insert, refine,


enlarge, diminish, interline.
Jonathan Swift (1667–1745), Anglo–Irish writer

Reed–Muller codes were discovered at the beginning of the 1950s by David


Muller (1924–2008); Irwin Reed (1923–2012) proposed the above decoding pro-
cedure. In the early 1970s, the RM codes were used to transmit pictures from
space (as far as the Moon) by the spacecrafts. The quality of transmission was then
considered as exceptionally good. However, later on, NASA engineers decided in
favour of the Golay codes while photographing Jupiter and Saturn.
Worked Example 2.4.17 A maximum distance separable (MDS) code was de-
fined earlier as a q-ary linear [N, k, d] code with d = N − k + 1 (equality in the
Singleton bound; see Definition 2.1.13).
(a) Prove that X is MDS iff
(i) any N − k columns of its parity-check matrix H are linearly independent,
and
(ii) there exist N − k + 1 columns of H that are linearly dependent.
(b) Prove that the dual of an MDS code is MDS and deduce that X is MDS iff
any k columns of its generating matrix G are linearly independent and k is the
maximal such number.
(c) Hence prove that when G is written in the standard form (Ik |G ) then X is
MDS iff any square sub-matrix of G is non-singular.
(d) Finally, check that [N, k, d] code X is MDS iff for any d positions 1 ≤ i1 <
· · · < id ≤ N , there exists a codeword of weight d with non-zero digits at digits
i1 , . . . , id .
212 Introduction to Coding Theory

Solution (a) An MDS [N, k, d] code has d = N − k + 1. If a linear code X has


d(X ) = d then any (d − 1) columns of its parity-check matrix H are linearly in-
dependent, and (d − 1) is the maximal number with this property, and vice versa.
So, any (N − k) columns are linearly independent and (N − k) is the maximal such
number, and vice versa. Equivalently, any (N − k) × (N − k) submatrix of H is
invertible.
(b) Let X be [N, k, d] MDS code with a parity-check matrix H. Then H is a gen-
erating matrix for X ⊥ . Any (N − k) × (N − k) submatrix of H is invertible. Then
any non-trivial combination of rows of H % has ≤ N − k − 1 zero entries, i.e. weight
≥ k +1; the minimal weight is equal to k +1. So, d(X ⊥ ) = k +1 = N −(N −k)+1.
As X ⊥ is [N, N − k] code, it is MDS.
Then, clearly, [N, k] code X is MDS iff k is the maximal number l such that any
l columns of its generating matrix G are linearly independent. Equivalently, X is
systematic on any k positions.
(c) Again, let X be [N, k, d] MDS code, and write G = (Ik |G ). Take a (u × u)
submatrix Gu of G . By using row and column permutations, we may assume that
Gu occupies the top left corner in G . Then consider the last (k − u) columns of Ik
and u columns of G containing Gu ; the corresponding k × k matrix is non-singular
and forms a k × k submatrix Gk ,
 
0 Gu
Gk = ,
Ik−u ∗

with
det Gk = ± det Gu det Ik−u = ± det Gu = 0, by (b).

So, Gu is non-singular. The proof of the inverse statement is similar.


(d) Finally, choose d = N − k + 1 digits, say i1 , . . . , id . Consider i1 together with
the remaining digits j1 , . . . , jk−1 . Then i1 , j1 , . . . , jk−1 are information symbols. So,
there exists a codeword x with digit i1 non-zero and digits j1 , . . . , jk−1 zero. Then
x must have digits i1 , . . . , id non-zero.
The converse: consider an (N − d + 1) × N matrix

G = [IN−d+1 |E(N−d+1)×(d−1) ]

where IN−d+1 is a unit matrix and E is an (N − d + 1) × (d − 1) matrix with all


entries 1 (the unit of F2 ). The rows of G are linearly independent and have weight d,
and for any row there exists a codeword x(i) ∈ X with non-zero digits at the same
positions (and, possibly, elsewhere). Then k, the rank of the code, is ≥ N − d + 1.
Thus, k = N − d + 1.
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 213

Worked Example 2.4.18 The MDS codes [N, N, 1], [N, 1, N] and [N, N − 1, 2]
always exist and are called trivial. Any [N, k] MDS code with 2 ≤ k ≤ N −2 is called
non-trivial. Show that there is no non-trivial MDS code over Fq with q ≤ k ≤ N − q.
In particular, there is no non-trivial binary MDS code (which causes a discernible
lack of enthusiasm about binary MDS codes).

Solution Indeed, the [N, N, 1], [N, N − 1, 2] and [N, 1, N] codes are MDS. Take
q ≤ k ≤ N − q and assume X is a q-ary MDS. Take its generating matrix G in the
standard form (Ik |G ) where G is k × (N − k), N − k ≥ q.
If some entries in a column of G are zero then this column is a linear combina-
tion of k − 1 columns of Ik−1 . This is impossible by (b) in the previous example;
hence G has no 0 entry. Next, assume that the first row of G is 1 . . . 1: otherwise
we can perform scalar multiplication of columns maintaining codes’ equivalence.
Now take the second row of G : it is of length N − k ≥ q and has no 0 entry. Then
these must be repeated entries. That is,
⎛ ⎞
 1 ... 1 ... 1 ... 1

G = ⎝Ik  . . . . . . a . . . a . . . . . . ⎠ , a = 0.
... ...
Then take the codeword
x = row 1 − a−1 (row 2);
it has w(x) ≤ N − k − 2 + 2 = N − k and X cannot be MDS.
By using the dual code, obtain that there exists no non-trivial q-ary MDS code
with k ≥ q. Hence, non-trivial MDS code can only have
N − q + 1 ≤ k or k ≤ q − 1.
That is, there exists no non-trivial binary MDS code, but there exists a non-trivial
[3, 2, 2] ternary MDS code.
Remark 2.4.19 It is interesting to find, given k and q, the largest value of N
for which there exists a q-ary MDS [N, k] code. We demonstrated that N must be
≤ k + q − 1, but computational evidence suggests this value is q + 1.

2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes


A useful class of linear codes is formed by the so-called cyclic codes (in particular,
the Hamming, Golay and Reed–Muller codes are cyclic). Cyclic codes were pro-
posed by Eugene Prange in 1957; their importance was immediately recognised,
and they generated a large literature. But more importantly, the idea of cyclic codes,
214 Introduction to Coding Theory

together with some other sharp observations made at the end of the 1950s, partic-
ularly the invention of BCH codes, opened a connection from the theory of linear
codes (which was then at its initial stage) to algebra, particularly to the theory of fi-
nite fields. This created algebraic coding theory, a thriving direction in the modern
theory of linear codes.
We begin with binary cyclic codes. The coding and decoding procedures for
binary cyclic codes of length N are based on the related algebra of polynomials
with binary coefficients:

a(X) = a0 + a1 X + · · · + aN−1 X N−1 , where ak ∈ F2 for k = 0, . . . , N − 1. (2.5.1)

Such polynomials can be added and multiplied in the usual fashion, except that
X k + X k = 0. This defines a binary polynomial algebra F2 [X]; the operations over
binary polynomials refer to this algebra. The degree deg a(X) of polynomial a(X)
equals the maximal label of its non-zero coefficient. The degree of the zero poly-
nomial is set to be 0. Thus, the representation (2.5.1) covers polynomials of degree
< N.
l l
Theorem 2.5.1 (a) (1 + X)2 = 1 + X 2 (A freshman’s dream).
(b) (The division algorithm) Let f (X) and h(X) be two binary polynomials with
h(X) ≡ 0. Then there exist unique polynomials g(X) and r(X) such that

f (X) = g(X)h(X) + r(X) with deg r(X) < deg h(X). (2.5.2)

The polynomial g(X) is called the ratio, or quotient, and r(X) the remainder.

Proof (a) The statement follows from the binomial decomposition where all
intermediate terms vanish.
(b) If deg h(X) > deg f (X) we simply set

f (X) = 0 · h(X) + f (X).

If deg h(X) ≤ deg f (X), we can perform the ‘standard’ procedure of long divi-
sion, with the rules of the binary addition and multiplication.

Example 2.5.2 For binary polynomials:


  
(a) 1 + X + X 3 + X 4 X + X 2 + X 3 = X + X 7 .
 
(b) 1 + X N = (1 + X) 1 + X + · · · + X N−1 .
   
(c) The quotient X + X 2 + X 6 + X 7 + X 8 1 + X + X 2 + X 4 = X 3 + X 4 ; the
remainder equals X + X 2 + X 3 .
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 215

Definition 2.5.3 Two polynomials, f1 (X) and f2 (X), are called equivalent
mod h(X), or f1 (X) = f2 (X) mod h(X), if their remainders, after division by h(X),
coincide. That is,
fi (X) = gi (X)h(X) + r(X), i = 1, 2,
and deg r(X) < deg h(X).
Theorem 2.5.4 Addition and multiplication of polynomials respect the equiva-
lence. That is, if
f1 (X) = f2 (X) mod h(X) and p1 (X) = p2 (X) mod h(X), (2.5.3)
then
'
f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X),
(2.5.4)
f1 (X)p1 (X) = f2 (X)p2 (X) mod h(X).
Proof We have, for i = 1, 2,
fi (X) = gi (X)h(X) + r(X), pi (X) = qi (X)h(X) + s(X),
with
deg r(X), deg s(X) < deg h(X).
Hence
fi (X) + pi (X) = (gi (X) + qi (X))h(X) + (r(X) + s(X))
with
deg(r(X) + s(X)) ≤ max[r(X), s(X)] < deg h(X).
Thus
f1 (X) + p1 (X) = f2 (X) + p2 (X) mod h(X).
Furthermore, for i = 1, 2, the product fi (X)pi (X) is represented as

gi (X)qi (X)h(X) + r(X)qi (X) + s(X)gi (X) h(X) + r(X)s(X).

Hence, the remainder for both polynomials f1 (X)p1 (X) and f2 (X)p2 (X) may come
only from r(X)s(X). Thus it is the same for both of them.
Note that every linear binary code XN corresponds to a set of polynomials, with
coefficients 0, 1, of degree N − 1 which is closed under addition mod 2:
a(X) = a0 + a1 X + · · · + aN−1 X N−1 ↔ a(N) = a0 . . . aN−1 ,
b(X) = b0 + b1 X + · · · + bN−1 X N−1 ↔ b(N) = b0 . . . bN−1 , (2.5.5)
a(X) + b(X) ↔ a(N) + b(N) = (a0 + b0 ) . . . (aN−1 + bN−1 ).
216 Introduction to Coding Theory

[The numeration of the digits in a word of length N using 0, . . . , N − 1 instead of


1, . . . , N is more convenient.]

We systematically write a(X) ∈ X when the word a(N) = a0 . . . aN−1 , represent-


ing polynomial a(X), belongs to code X .
Definition 2.5.5 Given a binary word a = a0 a1 . . . aN−1 , we define the cyclic shift
π a as a word aN−1 a0 . . . aN−2 . A linear binary code X is called cyclic if the cyclic
shift of each codeword is again a codeword.
A ‘straightforward’ way to form a cyclic code is as follows: take a word a,
then its subsequent cyclic shifts π a, π 2 a, etc., and finally all sums of the vectors
obtained. Such a construction allows one to build a code from a single word, and
eventually all the properties of the code may be inferred from the properties of
word a. It turns out that every cyclic code may be obtained in such a way: the
corresponding word is called a generator of a cyclic code.
Lemma 2.5.6 A binary linear code X is cyclic iff, for any vector u from a basis
of X , π u ∈ X .
Proof Each codeword in X is a sum of vectors of the basis, but π (u + v) =
π u + π v; hence the result.
A useful property of a cyclic shift is established below:
Lemma 2.5.7 If the word a corresponds to a polynomial a(X) then the word π a
corresponds to Xa(X) mod (1 + X N ).
Proof The relations
Xa(X) = a0 X + a1 X 2 + · · · + aN−2 X N−1 + aN−1 X N
= aN−1 + a0 X + a1 X 2 + · · · + aN−2 X N−1 mod (1 + X N )
mean that the polynomial
aN−1 + a0 X + · · · + aN−2 X N−1
corresponding to π a equals Xa(X) mod (1 + X N ).
A similar argument implies that the word π 2 a corresponds to X 2 a(X) mod (1 +
X N ), etc. More generally, we have the following.
Example 2.5.8 The inverse cyclic shift π −1 : a0 . . . aN−2 aN−1 ∈ {0, 1}N →
a1 a2 . . . aN−1 a0 acts on polynomials a(X) of degree at least N − 1 by
1

π −1 a(X) = a(X) + a0 + a0 X N−1 .


X
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 217

Theorem 2.5.9 A binary cyclic code contains, with each pair of polynomials
a(X) and b(X), the sum a(X) + b(X) and any polynomial v(X)a(X) mod (1 + X N ).

Proof By linearity the sum a(X) + b(X) ∈ X . If v(X) = v0 + v1 X + · · · vN−1 X N−1


then each polynomial X k a(X) mod (1 + X N ) corresponds to π k a and hence belongs
to X . As
N−1
v(X)a(X) mod (1 + X N ) = ∑ vi X i a(X) mod (1 + X N ),
i=0

the LHS belongs to X .

In other words, the binary polynomials of degree at most N − 1 with the -


multiplication defined by

a  b(X) = a(X)b(X) mod (1 + X N ), (2.5.6)



and the usual F2 [X]-addition, form a commutative ring, denoted by F2 [X] (1 +
X N ). The binary cyclic codes are precisely the ideals of this ring.

N−k
Theorem 2.5.10 Let g(X) = ∑ gi X i be a non-zero polynomial of minimum
i=0
degree in a binary cyclic code X . Then:

(i) g(X) is a unique polynomial of minimal degree;


(ii) the code X has rank k;
(iii) the codewords corresponding to g(X), Xg(X), . . . , X k−1 g(X), form a basis in
X ; they are cyclic shifts of word g = g0 . . . gN−k 0 . . . 0;
(iv) a(X) ∈ X iff a(X) = v(X)g(X) for some polynomial v(x) of degree < k (that
is, g(X) is a divisor of every polynomial from X ).
N−k
Proof (i) Suppose c(X) = ∑ ci X i is another polynomial of minimal degree N −k
i=0
in X . Then gN−k = cN−k = 1, and hence deg(c(X) + g(X)) < N − k. But as N − k is
the minimal degree, c(X) + g(X) should equal zero. This happens iff g(X) = c(X).
Hence, g(X) is unique.

(ii) follows from (iii).

(iii) Assume that property (iv) holds. Then each polynomial a(X) ∈ X has the
form
r
g(X)v(X) = ∑ vi X i g(X), r < k.
i=1
218 Introduction to Coding Theory

Hence, each polynomial a(X) ∈ X is a linear combination of polynomi-


als g(X), Xg(X), . . . , X k−1 g(X) (all of which belong to X ). On the other
hand, polynomials g(X), Xg(X), . . . , X k−1 g(X) have distinct degrees and hence
are linearly independent. Therefore words g, π g, . . . , π k−1 g, corresponding to
g(X), Xg(X), . . . , X k−1 g(X), form a basis in X .

(iv) We know that each polynomial a(X) ∈ X has degree > deg g(X). By the
division algorithm,
a(X) = v(X)g(X) + r(X).

Here, we must have

deg v(X) < k and deg r(X) < deg g(X) = N − k.

But then v(X)g(X) belongs to X owing to Theorem 2.5.9 (as v(X)g(X) has degree
≤ N − 1, it coincides with v(X)g(X) mod (1 + X N )). Hence,

r(X) = a(X) + v(X)g(X) ∈ X

by linearity. As g(X) is a unique polynomial from X of minimum degree, r(X) =


0.

Corollary 2.5.11 Every binary cyclic code is obtained from the codeword cor-
responding to a polynomial of minimum degree, by cyclic shifts and linear combi-
nations.
Definition 2.5.12 A polynomial g(X) of minimal degree in X is called a mini-
mal degree generator of a (cyclic) binary code X , or briefly a generator of X .
Remark 2.5.13 There may be other polynomials that generate X in the sense
of Corollary 2.5.11. But the minimum degree polynomial is unique.
Theorem 2.5.14 A polynomial g(X) of degree ≤ N − 1 is the generator of a
binary cyclic code of length N iff g(X) divides 1 + X N . That is,

1 + X N = h(X)g(X) (2.5.7)

for some polynomial h(X) (of degree N − deg g(X)).

Proof (The only if part.) By the division algorithm,

1 + X N = h(X)g(X) + r(X), where deg r(X) < deg g(X).

That is,

r(X) = h(X)g(X) + 1 + X N , i.e. r(X) = h(X)g(X) mod (1 + X N ).


2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 219

By Theorem 2.5.10, r(X) belongs to the cyclic code X generated by g(X). But
g(X) must be the unique polynomial of minimum degree in X . Hence, r(X) = 0
and 1 + X N = h(X)g(X).
(The if part.) Suppose that 1 + X N = h(X)g(X), deg h(X) = N − deg g(X).
Consider the set {a(X) : a(X) = u(X)g(X) mod (1 + X N )}, i.e. the principal
ideal in the -multiplication polynomial ring corresponding to g(X). This set
forms a linear code; it contains g(X), Xg(X), . . . , X k−1 g(X) where k = deg h(X).
It suffices to prove that X k g(X) also belongs to the set. But X k g(X) = 1 +
k−1
X N + ∑ h j X j g(X), that is, X k g(X) is equivalent to a linear combination of
j=0
g(X), Xg(X), . . . , X k−1 g(X).
Corollary 2.5.15 All cyclic binary codes of length N are in a one-to-one corre-
spondence with the divisors of polynomial 1 + X N .
Hence, the cyclic codes are described through the factorisation of the polynomial
1 + X N . More precisely, we are interested in decomposing 1 + X N into irreducible
factors; combining these factors into products yields all possible cyclic codes of
length N.
Definition 2.5.16 A polynomial a(X) = a0 + a1 X + · · · + aN−1 X N−1 is called
irreducible if a(X) cannot be written as a product of two polynomials, b(X) and
b (X), with min[deg b(X), deg b (X)] ≥ 1.
The importance (and convenience) of irreducible polynomials for describing
cyclic codes is obvious: every generator polynomial of a cyclic code of length N is
a product of irreducible factors of (1 + X N ).
Example 2.5.17 (a) The polynomial 1 + X N has two ‘standard’ divisors:
1 + X N = (1 + X)(1 + X + · · · + X N−1 ).
The first factor 1 + X Kgenerates the binary parity-check code PN =
%
x = x0 . . . xN−1 : ∑ xi = 0 , whereas polynomial 1 + X + · · · + X N−1 (it may be
i
reducible) generates the repetition code RN = { 00 . . . 0, 11 . . . 1 }.
(b) Select the generating and check matrices of the Hamming [7, 4] code in the
lexicographic form. If we re-order the digits x4 x7 x5 x3 x2 x6 x1 (which leads to an
equivalent code) then the rows of the generating matrix become subsequent cyclic
shifts of each other:
⎛ ⎞
1101000
⎜ 0110100 ⎟
H
Gcycl =⎜ ⎝ 0011010 ⎠

0001101
220 Introduction to Coding Theory

and the cyclic shift of the last row is again in the code:

π (0 0 0 1 1 0 1) = (1 0 0 0 1 1 0)
= (1 1 0 1 0 0 0) + (0 1 1 0 1 0 0) + (0 0 1 1 0 1 0).

By Lemma 2.5.6, the code is cyclic. By Theorem 2.5.10(iii), the generating poly-
H :
nomial g(X) corresponds to the framed part in matrix Gcycl

1101 ∼ g(X) = 1 + X + X 3 = the generator.

But a similar argument can be used to show that an equivalent cyclic code is ob-
tained from the word 1011 ∼ 1 + X 2 + X 3 . There is no contradiction: it was not
claimed that the polynomial ideal of a cyclic code is the principal ideal of a unique
element.
If we choose a different order of the columns in the parity-check matrix, the
code will be equivalent to the original code; that is, the code with the generator
polynomial 1 + X 2 + X 3 is again a Hamming [7, 4] code.

In Problem 2.3 we will check that the Golay [23, 7] code is generated by the
polynomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 .

Worked Example 2.5.18 Using the factorisation

X 7 + 1 = (X + 1)(X 3 + X + 1)(X 3 + X 2 + 1) (2.5.8)

in F2 [X], find all cyclic binary codes of length 7. Identify those which are Hamming
codes and their duals.

Solution See the table below.


code X generator for X generator for X ⊥

{0, 1}7 1 1 + X7
parity-check 1+X ∑ Xi
0≤i≤6

Hamming 1+X + X3 1 + X2 + X3 + X4
Hamming 1 + X2 + X3 1 + X + X2 + X4
dual Hamming 1 + X2 + X3 + X4 1 + X + X3
dual Hamming 1 + X + X2 + X4 1 + X2 + X3
repetition ∑ Xi 1+X
0≤i≤6

zero 1 + X7 1
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 221

It is easy to check that all factors in (2.5.8) are irreducible. Any irreducible factor
could be included or not included in decomposition of the generator polynomial.
This argument proves that there exist exactly 8 binary codes in H7,2 as demon-
strated in the table.

Example 2.5.19 (a) Polynomials of the first degree, 1 + X and X, are irreducible
(but X does not appear in the decomposition for 1 + X N ). There is one irreducible
binary polynomial of degree 2: 1 + X + X 2 , two of degree 3: 1 + X + X 3 and 1 +
X 2 + X 3 , and three of degree 4:

1 + X + X 4, 1 + X3 + X4 and 1 + X + X 2 + X 3 + X 4 , (2.5.9)

each of which appears in the decomposition of 1 + X N for various values of N (see


below). A further distinction is that polynomials 1 + X + X 3 and 1 + X 2 + X 3 are
‘primitive’ whereas 1 + X + X 2 + X 3 + X 4 is not; see Example 2.5.34 below and
Sections 3.1–3.3. On the other hand, polynomials

1 + X 8, 1 + X4 + X6 + X7 + X8 and 1 + X 2 + X 6 + X 8 (2.5.10)

are reducible. The polynomial 1 + X N is always reducible:

1 + X N = (1 + X)(1 + X + · · · + X N−1 ).

(b) Generally, the factorisation of polynomial 1 + X N into the irreducible factors


is not easy to achieve. Among the first 13 odd values of N, the list of polynomials
1 + X N which admit only the trivial decomposition into two irreducible factors is
as follows:

1 + X, 1 + X 3, 1 + X 5, 1 + X 11 , 1 + X 13 .

Further, the polynomial 1 + X 19 admits only a trivial decomposition (1 + X) (1 +


X + · · · + X 18 ), while others have the following factors (the common factor (1 + X)
is omitted):
1 + X 7 : (1 + X + X 3 )(1 + X 2 + X 3 ),
1 + X 9 : (1 + X + X 2 )(1 + X 3 + X 6 ),
1 + X 15 : (1 + X + X 2 )(1 + X + X 4 )
× (1 + X 3 + X 4 )(1 + X + X 2 + X 3 + X 4 ),
1 + X 17 : (1 + X 3 + X 4 + X 5 + X 8 )
× (1 + X + X 2 + X 4 + X 6 + X 7 + X 8 ),
1 + X : (1 + X + X 2 )(1 + X + X 3 )(1 + X 2 + X 3 )
21

× (1 + X + X 2 + X 4 + X 6 )(1 + X 2 + X 4 + X 5 + X 6 ),
1 + X : (1 + X + X 5 + X 6 + X 7 + X 9 + X 11 )
23

× (1 + X 2 + X 4 + X 5 + X 6 + X 10 + X 11 ),
222 Introduction to Coding Theory

and
1 + X 25 : (1 + X + X 2 + X 3 + X 4 )(1 + X 5 + X 10 + X 15 + X 20 ).
For N even, 1 + X N can have multiple roots (see Example 2.5.35(c)).
Example 2.5.20 Irreducible polynomials of degree 2 and 3 over the field F3
(that is, from F3 [X]) are as follows. There exist three irreducible polynomials of
degree 2 over F3 : X 2 + 1, X 2 + X + 2 and X 2 + 2X + 2. There exist eight irreducible
polynomials of degree 3 over F3 : X 3 + 2X + 2, X 3 + X 2 + 2, X 3 + X 2 + X + 2, X 3 +
2X 2 + 2X + 2, X 3 + 2X + 1, X 3 + X 2 + 2X + 1, X 3 + 2X 2 + 1 and X 3 + 2X 2 + X + 1.
Cyclic codes admit encoding and decoding procedures in terms of the polyno-
mials. It is convenient to have a generating matrix of a cyclic code X in a form
similar to Gcycl for the Hamming [7, 4] code (see above). That is, we want to find
the basis in X which gives the following picture in the corresponding generating
matrix:
⎛ ⎞
⎜ 0 ⎟
⎜ ⎟
⎜ ⎟
Gcycl = ⎜ ⎟ (2.5.11)
⎜ .. ⎟
⎝ 0 . ⎠

Such a basis is provided by Theorem 2.5.10(iii): take the generator polynomial


g(X) and its multiples:
g(X), Xg(X), . . . , X k−1 g(X), deg g(X) = N − k.
Symbolically,
⎛ ⎞
g(X)
⎜ Xg(X) ⎟
⎜ ⎟
Gcycl = ⎜ .. ⎟. (2.5.12)
⎝ . ⎠
X k−1 g(X)
The code has rank k and may be used for encoding words of length k as follows.
Given a word a = a0 . . . ak−1 , we form the polynomial a(X) = ∑ ai X i and take
0≤i<k
the product a(X)g(X). It belongs to X by Theorem 2.5.9, and hence defines a
codeword. So all we have to do is to store polynomial g(X): the encoding will
correspond to polynomial multiplication. If encoding is given by multiplication,
decoding must be related to division. Recall that under the geometric decoder, we
decode the received word by the closest codeword in the Hamming distance. Such
a codeword is related to a leader of the corresponding coset: we have seen that the
cosets are in a one-to-one correspondence with the syndrome words of the form
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 223

yH T . In the case of a cyclic code, the syndromes are calculated straightforwardly.


Recall that, if g(X) is a generator polynomial of a cyclic code X and deg g(X) =
N − k, then the rank of X equals k, and there must be 2N−k distinct cosets (see
Theorem 2.5.10(v)).

Theorem 2.5.21 The cosets y + X are in a one-to-one correspondence with the


remainders y(X) = u(X) mod g(X). In other words, two words y, y belong to the
same coset iff, in the division algorithm representation,

y(X) = a(X)g(X) + u(X), y (X) = a (X)g(X) + u (X), and u(X) = u (X).

Proof y and y belong to the same coset iff y + y ∈ X . This is equivalent to


u(X) + u (X) = 0, i.e. u(X) = u (X) by Theorem 2.5.14.

Hence the cosets are labelled by the polynomials u(X) of deg u(X) < deg g(X) =
N − k: there are exactly 2N−k such polynomials. To determine the coset y + X it
is enough to compute the remainder u(X) = y(X) mod g(X). Unfortunately, there
is still a task to find a leader in each case: there is no simple algorithm for finding
leaders, for a general cyclic code. However, there are known particular classes of
cyclic codes which admit a relatively simple decoding: the first such class was
discovered in 1959 and is formed by BCH codes (see Section 2.6).

As was observed, a cyclic code may be generated not only by its polynomial of
minimum degree: for some purposes other polynomials with this property may be
useful. However, they all are divisors of 1 + X N :

Theorem 2.5.22 Let X be a binary cyclic code of length N . Then any polyno-
mial g(X) such that X is the principal ideal of g(X) is a divisor of 1 + X N .

Proof An exercise from algebra.

We see that the cyclic codes are naturally labelled by their generator poly-
nomials.

Definition 2.5.23 Let X be the cyclic binary code of length N generated by


g(X). The check polynomial h(X) of X is defined as the ratio (1 + X N )/g(X).
That is, h(X) is a unique polynomial for which h(X)g(X) = 1 + X N .

We will use the standard notation gcd( f (X), g(X)) for the greatest common di-
visor of polynomials f (X) and g(X) and lcm( f (X), g(X)) for their least common
multiple. Denote by X1 + X2 the direct sum of two linear codes X1 , X2 ⊂ HN,2 .
That is, X1 + X2 consists of the linear combinations α1 a(1) + α2 a(2) where
α1 , α2 = 0, 1 and a(i) ∈ Xi , i = 1, 2. Compare Example 2.1.8(vii).
224 Introduction to Coding Theory

Worked Example 2.5.24 Let X1 and X2 be two binary cyclic codes of length
N , with generators g1 (X) and g2 (X). Prove that:
(a) X1 ⊂ X2 iff g2 (X) divides g1 (X);
the intersection
(b)
X1 ∩ X2 yields a cyclic code generated by
lcm g1 (X), g2 (X) ;
the direct
sum X1 + X2 is a cyclic code with the generator
(c)
gcd g1 (X), g2 (X) .

Solution (a) We know that a(X) ∈ Xi iff, in the ring F2 [X] (1 + X N ), polyno-
mial a(X) = fi  gi (X), i = 1, 2. Suppose g2 (X) divides g1 (X) and write g1 (X) =
r(X)g2 (X). Then every polynomial a(X) of the form f1  g1 (X) is of the form
f1  r  g2 (X). That is, if a(X) ∈ X1 then a(X) ∈ X2 , so X1 ⊂ X2 .
Conversely, suppose that X1 ⊂ X2 . Let di be the degree of gi (X), 1 ≤ di < N,
i = 1, 2, and write

g1 (X) = f (X)g2 (X) + r(X), where deg r(X) < d2 .



We have that every polynomial -divisible by g1 (X) in F2 [X] (1 + X N ) is also -
divisible by g2 (X). In particular, the basis polynomials X i g1 (X), 0 ≤ i ≤ N −d1 −1,
are -divisible by g2 (X), i.e. have the form

X i g1 (X) = h(i) (X)g2 (X) + αi (X N − 1) where αi = 0 or 1.

If, for some i, the coefficient αi = 0 then we compare two identities,

X i g1 (X) = X i f (X)g2 (X) + X i r(X) and X i g1 (X) = h(i) (X)g2 (X),

and conclude that X i r(X) = 0. This implies that r(X) = 0 and hence g2 (X) divides
g1 (X).
The remaining case is that all coefficients αi ≡ 1. Then we compare

Xg1 (X) = Xh(0) (X)g2 (X) + X + X N+1

and
Xg1 (X) = h(1) (X)g2 (X) + 1 + X N

and see that this case is impossible.


(b) This part becomes straightforward: the intersection X1 ∩ X2 is a subcode of
both X1 and X2 . It is obviously a cyclic code; hence, by part (a), its generator g(X)
is divisible by both g1 (X) and g2 (X). Then it is divisible by the lcm(g1 (X), g2 (X)).
We must exclude the case where g(X) produces a non-trivial ratio after this di-
vision. But the lcm(g1 (X), g2 (X)) is itself a generator of a cyclic code (of the
same original length) contained in both X1 and X2 . So, in the case g(X) =
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 225

lcm(g1 (X), g2 (X)), the code generated by lcm(g1 (X), g2 (X)) must be strictly larger
than X1 ∩ X2 . This contradicts the definition of X1 ∩ X2 .
(c) Similarly, X1 + X2 is the minimal linear code containing both X1 and X2 .
Hence, its generator divides both g1 (X) and g2 (X), i.e. is their common divisor.
And if it is not equal to the gcd(g1 (X), g2 (X)) then it contradicts the above mini-
mality property.

Worked Example 2.5.25 Let X be a binary cyclic code of length N with the
generator g(X) and the check polynomial h(X). Prove that a(X) ∈ X iff the poly-
nomial (1 + X N ) divides a(X)h(X), i.e. a  h(X) = 0 in F2 [X]/(1 + X N ).

Solution If a(X) ∈ X then a(X) = f (X)g(X) for some polynomial f (X) ∈


F2 [X]/(1 + X N ). Then

a(X)h(X) = f (X)g(X)h(X) = f (X)(1 + X N )

which equals 0 in F2 [X]/(1 + X N ). Conversely, let a(X) ∈ F2 [X]/(1 + X N ) and


assume that a(X)h(X) = 0 mod (1 + X N ). Write a(X) = f (X)g(X) + r(X) where
deg r(X) <deg g(X). Then

a(X)h(X) = f (X)(1 + X N ) + r(X)h(X) = r(X)h(X) mod (1 + X N ).

Hence, r(X)h(X) = 0 mod (1 + X N ) which is only possible when r(X) = 0 (since


deg r(X)h(X) < N). Thus, a(X) = f (X)g(X) and a(X) ∈ X .

Worked Example 2.5.26 Prove that the dual of a cyclic code is again cyclic and
find its generating matrix.

Solution If y ∈ X ⊥ , the dual code, then the dot-product "π x · y# = 0 for all x ∈ X .
But "π x · y# = "x · π y#, i.e. π y ∈ X ⊥ , which means that X ⊥ is cyclic.
Let g(X) = g0 + g1 X + · · · + gN−k X N−k be the generating polynomial for X ,
where N − k = d is the degree of g(X) and k gives the rank of X . We know that
the generating matrix G of X may be written as
⎛ ⎞
g(X) ⎛ ⎞
⎜ Xg(X) ⎟
⎜ ⎟ ⎜ 0 ⎟
⎜ ⎟ ⎜ ⎟
⎜ · ⎟ ⎜ ⎟
G ∼ ⎜ ⎟ ∼ ⎜ ⎟. (2.5.13)
⎜ · ⎟ ⎜ . ⎟
⎜ ⎟ ⎝ 0 . . ⎠
⎝ · ⎠
X k−1 g(X)
226 Introduction to Coding Theory
k
Take h(X) = (1 + X N )/g(X) and write h(X) = ∑ h j X j and h = h0 . . . hN−1 . Then
j=0
'
i = 1, i = 0, N,
∑ g j hi− j = 0, 1 ≤ i < N.
j=0

Indeed, for i = 0, N, we have h0 g0 = 1 and hk gN−k = 1. For 1 ≤ i < N we obtain


that the dot-product

"π j g · π j h⊥ # = 0 for j = 0, 1, . . . , N − k − 1, j = 0, . . . , k − 1,

where h⊥ = hk hk−1 . . . h0 . It is then easy to see that h⊥ gives rise to the generator
h⊥ (X) of X ⊥ .

An alternative solution is based on Worked Example 2.5.25. We know that


a(X) ∈ X iff a  h(X) = 0. Let k be the degree of g(X) then the degree of h(X)
equals N − k. The degree deg[a(X)h(X)] is < 2N − k, so the coefficients of X N−k ,
X N−k+1 , . . . , X N−1 in a(X)h(X) all vanish. That is:
a0 hN−k + a1 hN−k−1 + · · · + aN−k h0 = 0,
a1 hN−k + a2 hN−k−1 + · · · + aN−k+1 h0 = 0,
.. ..
. .
ak−1 hN−k + ak hN−k−1 + · · · + aN−1 h0 = 0.

In other words, aH T = 0 where a = a0 , . . . , aN−1 is the word of the binary coeffi-


cients for a(X) and H is an (N − k) × N matrix
⎛ ⎞
⎛ ⊥

h (X) ⎜
⎜ ⊥ (X) ⎟ ⎜ 0 ⎟

⎜ Xh ⎟ ⎜ ⎟
H ∼⎜ .. ⎟∼⎜ ⎟ (2.5.14)
⎝ . ⎠ ⎜ .. ⎟
⎝ 0 . ⎠
X N−k−1 h⊥ (X)

and h⊥ (X) = X N−k h(X −1 ), with the coefficient string h⊥ = hk hk−1 . . . h0 .


We conclude that matrix H generates a code X ⊆ X ⊥ . But since hN−k = 1, the
rank of X equals N − k. Hence, X = X ⊥ .
It remains to check that polynomial h⊥ (X) divides 1 + X N . To this end,
we deduce from g(X)h(X) = 1 + X N that h(X −1 )g(X −1 ) = X −N + 1. Hence
h⊥ (X)X k g(X −1 ) = 1 + X N , and as X k g(X −1 ) equals the polynomial gk + gk−1 X +
· · · + g0 X k , the required fact follows. 2

Worked Example 2.5.27 Let X be a binary cyclic code of length N with gen-
erator g(X).
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 227

(a) Show that the set of codewords a ∈ X of even weight is a cyclic code and find
its generator.
(b) Show that X contains a codeword of odd weight iff g(1) = 0 or, equivalently,
word 1 ∈ X .

Solution (a) If code X is even (i.e. contains only words of even weight) then
every polynomial a(X) ∈ X has a(1) = ∑ ai = 0. Hence, a(X) contains a
0≤i<N−1
factor (X + 1). Therefore, the generator g(X) has a factor (X + 1). The converse is
also true: if (X + 1) divides g(X), or, equivalently, g(1) = 0, then every codeword
a ∈ X is of even weight.
Now assume that X contains a word with an odd weight, i.e. g(1) = 1; that
is, (1 + X) does not divide g(X). Let X ev be the subcode in X formed by the
even codewords. A cyclic shift does not change the weight, so X ev is a cyclic
code. For the corresponding polynomials a(X) we have, as before, that (1 + X)
divides a(X). Thus, the generator gev (X) of X ev is divisible by (1 + X), hence
gev (X) = g(X)(X + 1).
(b) It remains to show that g(1) = 1 iff the word 1 ∈ X . The corresponding poly-
nomial is 1+ · · ·+ X N−1 , the complementary factor to (1+ X) in the decomposition
1 + X N = (1 + X)(1 + · · · + X N−1 ). So, if g(1) = 1, i.e. g(X) does not contain the
factor (1 + X), then g(X) must be a divisor of 1 + · · · + X N−1 . This implies that
1 ∈ X . The inverse statement is established in a similar manner.

Worked Example 2.5.28 Let X be a binary cyclic code of length N with gen-
erator g(X) and check polynomial h(X).
(a) Prove that X is self-orthogonal iff h⊥ (X) divides g(X) and self-dual iff
h⊥ (X) = g(X) where h⊥ (X) = hk + hk−1 X + · · · + h0 X k−1 and h(X) = h0 + · · · +
hk−1 X k−1 + hk X k is the check polynomial, with g(X)h(X) = 1 + X N .
(b) Let r be a divisor of N : r|N . A binary code X is called r-degenerate if every
codeword a ∈ X is a concatenation c . . . c where c is a string of length r. Prove that
X is r-degenerate iff h(X) divides (1 + X r ).

Solution (a) Self-orthogonality means that X ⊆ X ⊥ , i.e. "a · b# = 0 for all


a, b ∈ X . From Worked Example 2.5.26 we know that h⊥ (X) gives the genera-
tor polynomial of X ⊥ . Then, by virtue of Worked Example 2.5.26, X ⊆ X ⊥ iff
h⊥ (X) divides g(X).
Self-duality means that X = X ⊥ , that is h⊥ (X) = g(X).
(b) For N = rs, we have the decomposition

1 + X N = (1 + X r )(1 + X r + · · · + X r(s−1) ).
228 Introduction to Coding Theory

Now assume cyclic code X of length N with generator g(X) is r-degenerate. Then
the word g is of the form 1c1c . . . 1c for some string c of length r − 1 (with c = 1c).
Let c(X) be the polynomial corresponding to c (of degree ≤ r − 2). Then g(X) is
given by
1 + X c(X) + X r + X r+1 c(X) + · · · + X r(s−1) + X r(s−1)+1 c(X)
= (1 + X r + · · · + X r(s−1) )[1 + X c(X)].
For the check polynomial h(X) we obtain
 >  

h(X) = 1 + X N 1 + X r + · · · + X r(s−1) 1 + X c(X)
 

= 1 + Xr 1 + X c(X) ,
i.e. h(X) is a divisor of (1 + X r ).
Conversely, let h(X)|(1 + X r ), with h(X)g(X) = 1 + X r where g(X) =
∑ c j X j , with c0 = 1. Take c = c0 . . . cr−1 ; repeating the above argument in
0≤ j≤r−1
the reverse order, we conclude that the word g is the concatenation c . . . c. Then the
cyclic shift π g is the concatenation c(1) . . . c(1) where c(1) = cr−1 c0 . . . cr−2 (= π c,
the cyclic shift of c in {0, 1}r ). Similarly, for subsequent cyclic shift iterations
π 2 g, . . .. Hence, the basis vectors in X are r-degenerate, and so is the whole of X .

In the ‘standard’ arithmetic, a (real or complex) polynomial p(X) of a given de-


gree d is conveniently identified through its roots (or zeros) α1 , . . . , αd (in general,
complex), by means of the monomial decomposition: p(X) = pd ∏ (X − αi ). In
1≤i≤d
the binary arithmetic (and, more generally, the q-ary arithmetic), the roots of poly-
nomials are still an extremely useful concept. In our situation, the roots help to
construct the generator polynomial g(X) = ∑ gi X i of a binary cyclic code with
0≤i≤d
important predicted properties. Assume for the moment that the roots α1 , . . . , αd of
g(X) are a well-defined object, and the representation
g(X) = ∏ (X − αi )
1≤i≤d

has a consistent meaning (which is provided within the framework of finite fields).
Even without knowing the formal theory, we are able to make a couple of helpful
observations.
The first observation is that the αi are Nth roots of unity, as they should be among
the zeros of polynomial 1 + X N . Hence, they could be multiplied and inverted, i.e.
would form an Abelian multiplicative group of size N, perhaps cyclic. Second, in
the binary arithmetic, if α is a zero of g(X) then so is α 2 , as g(X)2 = g(X 2 ). Then
α 2 is also a zero, as well as α 4 , and so on. We conclude that the sequence α , α 2 , . . .
begins cycling: α 2 = α (or α 2 −1 = 1) where d is the degree of g(X). That is, all
d d
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 229
( c−1
)
Nth roots of unity split into disjoint classes, of the form C = α , α 2 , . . . , α 2 ,
of size c where c = c(C ) is a positive integer (with 2 − 1 dividing N). The notation
c

C (α ) is instructive, with c = c(α ). The members of the same class are said to be
conjugate to each other. If we want a generating polynomial with root α then all
conjugate roots of unity α ∈ C (α ) will also be among the roots of g(X).
Thus, to form a generator g(X) we have to ‘borrow’ roots from classes C and
enlist, with each borrowed root of unity, all members of their classes. Then, since
any polynomial a(X) from the cyclic code generated by g(X) is a multiple of g(X)
(see Theorem 2.5.10(iv)), the roots of g(X) will be among the roots of a(X). Con-
versely, if a(X) has roots αi of g(X) among its roots then a(X) is in the code. We
see that cyclic codes are conveniently described in terms of roots of unity.

Example 2.5.29 (The Hamming [7, 4] code) Recall that the parity-check matrix
H for the binary Hamming [7, 4] code X H is 3 × 7; it enlists as its columns all
non-zero binary words of length 3: different orderings of these rows define equiv-
alent codes. Later in this section we explain that the sequence of non-zero binary
words of any given length 2 − 1 written in some particular order (or orders) can be
interpreted as a sequence of powers of a single element ω : ω 0 , ω , ω 2 , . . . , ω 2 −2 .


The multiplication rule generating these powers is of a special type (multiplication


of polynomials modulo a particular irreducible polynomial of degree ). To stress
this fact, we use in this section the notation ∗ for this multiplication rule, writing
ω ∗i in place of ω i . Anyway, for l = 3, one appropriate order of the binary non-zero
3-words (out of the two possible orders) is
⎛ ⎞
0 0 1 0 1 1 1
H = ⎝0 1 0 1 1 1 0⎠ ∼ (ω ∗0 ω ω ∗2 ω ∗3 ω ∗4 ω ∗5 ω ∗6 ).
1 0 0 1 0 1 1

Then, with this interpretation, the equation aH T = 0, determining that the word
a = a0 . . . a6 (or its polynomial a(X) = ∑ ai X i ) lies in X H , can be rewritten as
0≤i<7

∑ ai ω ∗i = 0, or a(∗ω ) = 0.
0≤i<7

In other words, a(X) ∈ X H iff ω is a root of a(X) under the multiplication rule ∗
(which in this case is multiplication of binary polynomials of degree ≤ 2 modulo
the polynomial 1 + X + X 3 ).
The last statement can be rephrased in this way: the Hamming [7, 4] code is
equivalent to the cyclic code with the generator g(X) that has ω among its roots;
in this case the generator g(X) = 1 + X + X 3 , with g(∗ω ) = ω ∗0 + ω + ω ∗3 = 0.
The alternative ordering of the rows of H H is related in the same fashion to the
polynomial 1 + X 2 + X 3 .
230 Introduction to Coding Theory

We see that the Hamming [7, 4] code is defined by a single root ω , provided
that we establish proper terms of operation with its powers. For that reason we can
call ω the defining root (or defining zero) for this code. There are reasons to call
element ω ‘primitive’; cf. Sections 3.1–3.3.
Worked Example 2.5.30 A code X is called reversible if a = a0 a1 . . . aN−1 ∈ X
implies that a← = aN−1 . . . a1 a0 ∈ X . Prove that a cyclic code with generator g(X)
is reversible iff g(α ) = 0 implies g(α −1 ) = 0.

Solution For the generator polynomial g(X) = ∑ gi X i , with deg g(X) = d < N
0≤i≤d
and g0 = gd = 1, the reversed polynomial is grev (X) = X N−1 g(X −1 ), so if the cyclic
code X is reversible and α is a root of g(X) then α is also a root of grev (X). This
is possible only when g(α −1 ) = 0.
Conversely, let g(X) satisfy the property that g(α ) = 0 implies g(α −1 ) = 0.
The above formula holds for all polynomial a(X) of degree < N: arev (X) =
X N−1 a(X −1 ). If a(X) ∈ X then a(α ) = a(α −1 ) = 0 for all root α of g(X). Then
arev (α ) = arev (α −1 ) = 0 for all roots α of g(X). Thus, arev (X) is a multiple of
g(X), and arev (X) ∈ X .

The natural framework for studying roots of polynomials is provided by the


theory of finite fields or Galois theory (we have seen already how polynomial fields
can be used). In the rest of this section we give an initial (and brief) introduction
into some aspects of Galois theory to understand better some examples of codes
introduced so far. In Chapter 3 we will dive deeper into the Galois theory to gain
enough knowledge in order to proceed further with code constructions.
Remark 2.5.31 A field is a commutative ring where each non-zero element has
an inverse. In other words, a ring is a field if the multiplication generates a group.
In fact, a multiplication group of non-zero elements of a field is cyclic.
Theorem 2.5.32 Let g(X) ∈ F2 [X] be an irreducible binary polynomial of de-
gree d . Then multiplication mod g(X) makes the set of the binary polynomials of
degree ≤ d − 1 (i.e. the space F×d d
2 ) a field with 2 elements. Conversely, if the
multiplication mod g(X) leads to a field then g(X) is irreducible.

Proof The only non-trivial property to check is the existence of the inverse ele-
ment. Take a non-zero polynomial f (X), with deg f (X) ≤ d − 1, and consider all
polynomials of the form f (X)h(X) (the usual multiplication) where h(X) runs over
the whole set of the polynomials of degree ≤ d − 1. These products must be distinct
mod g(X). Indeed, if

f (X)h1 (X) = f (X)h2 (X) mod g(X),


2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 231

then, for some polynomial v(X) of degree ≤ d − 2,

f (X)(h1 (X) − h2 (X)) = v(X)g(X). (2.5.15)

This implies that either g(X)| f (X) or g(X)|h1 (X) − h2 (X). We conclude that if
polynomial g(X) is irreducible, (2.5.15) is impossible, unless h1 (X) = h2 (X) and
v(X) = 0. For one and only one polynomial h(X), we have

f (X)h(X) = 1 mod g(X);

h(X) represents the inverse for f (X) in multiplication mod g(X). We write h(X) =
f (X)∗−1 .

On the other hand, if g(X) is reducible, then g(X) = b(X)b (X) where both b(X)
and b (X) are non-zero and have degree < d. That is, b(X)b (X) = 0 mod g(X). If
the multiplication mod q led to a field both b(X) and b (X) would have inverses,
b(X)−∗1 and b (X)−∗1 . But then

b(X)−∗1 ∗ b(X) ∗ b (X) = b (X) = 0,

and similarly b(X) = 0.

A field obtained via the above construction is called a polynomial field and is
often denoted by F2 [X]/"g(X)#. It contains 2d elements where d = deg g(X) (rep-
resenting polynomials of degree < d). We will call g(X) the core polynomial of the
field. For the rest of this section we denote the multiplication in a given polynomial
field by ∗. The zero polynomial and the unit polynomial are denoted correspond-
ingly, by 0 and 1: they are obviously the zero and the unity of the polynomial field.
A key role is played by the following result.

Theorem 2.5.33 (a) The multiplicative group of non-zero elements in polyno-


mial field F2 [X]/"g(X)# is isomorphic to the cyclic group Z2d −1 of size 2d − 1.
(b) The polynomial fields obtained by picking different irreducible polynomials of
degree d are all isomorphic.

Proof We will only prove here assertion (a); assertion (b) will be established in
Section 3.1. Take any element from the field, a(X) ∈ F2 [X]/"g(X)#, and observe
that
a∗i (X) := a
: ∗ .;<
. . ∗ a=(X)
i times
(the multiplication in the field) takes at most 2d − 1 values (the number of elements
in the field less one, as the zero 0 is excluded). Hence there exists a positive integer
r such that a∗r (X) = 1; the smallest value of r is called the order of a(X).
232 Introduction to Coding Theory

Choose a polynomial a(X) ∈ F2 [X]/"g(X)# with the largest order r. Then we


claim that the order of any other element b(X) divides r. Indeed, let s be the order
of b(X). Pick a prime factor p of s and write

s = pc l , and r = pc l,

with integers c , c ≥ 0 and l, l ≥ 1, where l and l are not divisible by p. We want to



show that c ≥ c . Indeed, element a∗p (X) has order l, b∗l (X) has order pc and the
c


product a∗p ∗ b∗l (X) has order l pc . Hence, c ≤ c or else r would not be maximal.
b

This is true for any prime p, hence s divides r.


Thus, with r being the maximal order, every element b(X) in the field obeys
∗r
b (X) = 1. By using the pigeon-hole principle, we conclude that r = 2d − 1, the
number of non-zero elements of the field. Hence, with a(X) being an element of
order r, the powers 1, a(X), . . . , a∗(2 −1) (X) exhaust the multiplicative groups of
d

the field.

In the wake of Theorem 2.5.33, we can use the notation F2d for any polynomial
field F2 [X]/"g(X)# where g(X) is an irreducible binary polynomial of degree d.
Further, the multiplicative group of non-zero elements in F2d is denoted by F∗2d ;
it is cyclic ( Z2d −1 , according to Theorem 2.5.33). Any generator of group F∗2d
(whose ∗-powers exhaust F∗2d ) is called a primitive element of field F2d .

Example 2.5.34 We can see the importance of writing down the full list of ir-
reducible polynomials. There are six irreducible binary polynomials of degree 5
(each of which is primitive):

1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5, (2.5.16)
1 + X2 + X3 + X4 + X5

and nine of degree 6 (of which six are primitive):

1 + X + X 6, 1 + X + X 3 + X 4 + X 6, 1 + X 5 + X 6,
1 + X + X 2 + X 5 + X 6, 1 + X 2 + X 3 + X 5 + X 6,
1 + X + X 4 + X 5 + X 6, (2.5.17)
1 + X + X + X 4 + X 6, 1 + X 2 + X 4 + X 5 + X 6,
2

1 + X 3 + X 6.

The number of irreducible polynomials grows significantly with the degree: there
are 18 of degree 7, 30 of degree 8, and so on. However, there exist and are available
quite extensive tables of irreducible polynomials over various finite fields.
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 233

1 + X + X3 1 + X2 + X3
X ∗i polynomial word X ∗i polynomial word

−− 0 000 −− 0 000
X ∗0 1 100 X ∗0 1 100
X X 010 X X 010
X ∗2 X2 001 X ∗2 X2 001
X ∗3 1+X 110 X ∗3 1 + X2 101
X ∗4 X + X2 011 X ∗4 1 + X + X2 111
X ∗5 1 + X + X2 111 X ∗5 1+X 110
X ∗6 1 + X2 101 X ∗6 X + X2 011

Figure 2.6

Example 2.5.35 (a) The field F2 [X]/"1 + X + X 2 # has four elements: 0, 1, X,


1 + X, with the multiplication table:
X ∗ X = 1 + X, as X 2 = 1 + X mod (1 + X + X 2 ),
X ∗ (1 + X) = X + X ∗ X = 1,
(1 + X) ∗ (1 + X) = 1 + X + X + X ∗ X = 1 + 1 + X = X.
Since X ∗3 = (1 + X) ∗ X = 1, the group is isomorphic to Z3 . An alternative notation
for this field is F4 .
(b) The fields F2 [X]/"1 + X + X 3 # and F2 [X]/"1 + X 2 + X 3 # contain eight elements
each, representing all polynomials of degree ≤ 2. Every such polynomial a0 +
a1 X + a2 X 2 is identified via the string of its coefficients a0 a1 a2 (a binary word).
The field tables are found by looking at the subsequent powers X ∗i : see Figure 2.6.
In both cases the multiplicative group of non-zero elements is Z7 . The two fields
are obviously isomorphic, as they share the common multiplicative cyclic group
formalism. The common notation for these fields is F8 . Note that the two field
tables coincide for the powers X ∗i with 0 ≤ i < 3; in fact, this is a general pattern:
see Sections 3.1–3.3.
Moreover, the element X = X ∗1 ∈ F2 [X]/"1 + X + X 3 # can be identified as a root
of the core polynomial 1+X +X 3 and element X = X ∗1 ∈ F2 [X]/"1+X 2 +X 3 # as a
root of 1 + X 2 + X 3 , as these polynomials yield zeros in their respective fields. The
remaining two roots are X ∗2 and X ∗4 (again calculated in their respective fields).
Applying this example to the Hamming [7, 4] code (cf. Example 2.5.29), the
field F2 [X]/"1 + X + X 3 # leads to the roots of the generator 1 + X + X 3 , and the
field F2 [X]/"1 + X 2 + X 3 # to those of 1 + X 2 + X 3 . That is, the Hamming [7, 4]
code is equivalent to the cyclic code of length 7 with the defining root ω = X in
either of the two isomorphic fields F2 [X]/"1 + X + X 3 # or F2 [X]/"1 + X 2 + X 3 #.
With some ambiguity (which will be removed in Section 3.1) we may say that this
code is defined by its root ω which is a primitive element of F8 .
234 Introduction to Coding Theory

coefficient
powers X ∗i polynomials
strings
−− 0 0000
X ∗0 1 1000
X X 0100
X ∗2 X2 0010
X ∗3 X3 0001
X ∗4 1+X 1100
X ∗5 X + X2 0110
X ∗6 X2 + X3 0011
X ∗7 1 + X + X3 1101
X ∗8 1 + X2 1010
X ∗9 X + X3 0101
X ∗10 1 + X + X2 1110
X ∗11 X + X2 + X3 0111
X ∗12 1 + X + X2 + X3 1111
X ∗13 1 + X2 + X3 1011
X ∗14 1 + X3 1001

Figure 2.7

(c) The field F2 [X]/"1 + X + X 4 # contains 16 elements. The field table is given in
Figure 2.7. In this case, the multiplicative group is Z15 , and the field can be denoted
by F16 . As above, element X ∈ F2 [X]/"1 + X + X 4 # yields a root of polynomial
1 + X + X 4 ; other roots are X ∗2 , X ∗4 and X ∗8 .
This example can be used to identify the Hamming [15, 11] code as (an equiva-
lent to) the cyclic code with generator g(X) = 1 + X + X 4 . We can now say that the
Hamming [15, 11] code is (modulo equivalence) the cyclic code of length 15 with
the defining root ω (= X) in field F2 [X]/"1 + X + X 4 #. As X is a generator of the
multiplicative group of the field, we again could say that the defining root ω is a
primitive element in F16 . 2

In general, take the field F2 [X]/"g(X)#, where g(X) = ∑ gi X i is an irreducible


0≤i≤d
X, X , X ∗4 , . . . , X ∗2
∗2 d−1
binary polynomial of degree m. Then the elements will sat-
isfy the equation
 ∗i
∑ gi X ∗s = 0, s = 1, 2, . . . , 2d−1 .
0≤i≤d

In other words, X, X ∗2 , . . . , X 2 are precisely the zeros, in field F2 [X]/"g(X)#, of


d−1

the irreducible core polynomial q.


Another feature emerging from Example 2.5.35 is that in all parts (a)–(c), ele-
ment X represented the root of the core polynomial g(X). However, this is not true
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 235

in general: it only happens when g(X) is a ‘primitive’ binary polynomial; for the
detailed discussion of this property see Sections 3.1–3.3. For a primitive core poly-
nomial g(X) we have, in addition, that the powers X i for i < d = deg g(X) coincide
with X ∗i , while further powers X ∗i , m ≤ i ≤ 2d − 1, are relatively easy to calculate.
With this in mind, we can pass to a general binary Hamming code.

Example 2.5.36 Let X H be the binary Hamming [2 − 1, 2 − 1 − ] code. We


know that its parity-check matrix H features all non-zero column-vectors of length
. These vectors, written in a particular order, list the consecutive powers ω ∗i , i =
0, 1, . . . , 2 − 2, in the field F2 [X]/"g(X)# where ω = X and g(X) = g0 + g1 X +
· · · + g X −1 + X  is a primitive polynomial of degree . Thus,
⎛ ⎞
1 0 ··· 0 g0 · · ·
⎜ 0 1 ··· 0 g1 · · · ⎟
⎜ ⎟

H = ⎜ ··· ··· ··· ··· ··· ··· ⎟ ⎟, (2.5.18)
⎝ 0 0 · · · 0 g−1 · · · ⎠
0 0 ··· 1 0 ···

or H ∼ (1 ω · · · ω ∗(−1) ω ∗ · · · ω ∗(2 −2) .


Hence, as before, the equation aH T = 0 for the codeword is equivalent to the


equation a(∗ω ) = 0 for the corresponding polynomial. So, we can say that a(X) ∈
X H iff ω is among the roots of a(X).
On the other hand, by construction, ω is a root of g(X): g(∗ω ) = 0. Thus, we
identify the Hamming [2 − 1, 2 − 1 − ] code as equivalent to the cyclic code of
length 2 − 1 with the generator polynomial g(X), with(the defining root ω). The
role of ω can be played by any conjugate element, from ω , ω ∗2 , . . . , ω ∗2
−1
.

The above idea leads to an immediate (and far-reaching) generalisation. Take


N = 2 − 1 and let ω be a primitive element of field F∗2  F2 [X]/"g(X)# where
g(X) is a primitive polynomial. (In all the examples and problems from this chapter,
this requirement is fulfilled.) Consider a defining set of roots, to start with, of the
form ω , ω 2 , ω 3 , but more generally, ω , ω 2 , . . . , ω (δ −1) . (Using parameter δ which
is an integer > 3 is a tradition here.) Consider the cyclic code with these roots:
what can we say about it? With the length N = 2 − 1, we can guess that it will
yield a subcode of the Hamming [2 − 1, 2 − 1 − ] code, and it may correct more
than a single error. This is the gist of the so-called (binary) BCH code construction
(Bose–Choudhury, Hocquenguem, 1959).
In this section we restrict ourselves to a brief introduction to the BCH codes;
in greater detail and generality these codes are discussed in Section 3.2. For
N = 2 − 1 field F2  F2 [X]/"g(X)# has the property that its non-zero elements
are the Nth roots of unity (i.e. the zeros of the polynomial 1 + X N ). In other words,
236 Introduction to Coding Theory

polynomial 1 + X N factorises into the product of linear factors ∏ (X − ω j )


1≤ j≤N
where all ω j list the whole of F∗2 . (In the terminology of Section 3.1, F2 is the
splitting field for 1 + X N over F2 .) As before, we use the notation ω := X for the
generator of the multiplicative cyclic group F∗2 . (In fact, it could be any generator
of this group.)
Because ω N = 1 and the power N is minimal with this property, the element
ω is often called a primitive Nth root of unity. Consequently, the powers ω k for
0 ≤ k < N yield distinct elements of the field. This fact is used below when we
conclude that the product

∏ ω kj
− ω ki
= 0
1≤i< j≤δ −1

for every collection of powers ω k1 , . . . , ω kδ −1 . (Such a collection extracts a


(δ − 1) × (δ − 1) submatrix from the (δ − 1) × N parity-check matrix in the proof
of Theorem 2.5.39.)
Definition 2.5.37 Given N = 2 − 1 and δ = 3, . . . N, define a narrow-sense bi-
δ of length N and with designed distance δ as the cyclic code
nary BCH code XN,BCH
formed by binary polynomials a(X) of degree < N such that
   
a(ω ) = a ω 2 = · · · = a ω (δ −1) = 0. (2.5.19)
In other words, XN,BCH
δ is the cyclic code of length N whose generator g(X) is the
minimal binary polynomial with roots including ω , ω 2 , . . . , ω (δ −1) :
( )
g(X) = lcm (X − ω ), . . . , (X − ω (δ −1) )
= lcm {Mω (X), . . . , Mω (δ −1) (X)} . (2.5.20)
Here lcm stands for the least common multiple and Mα (X) denotes the minimal
binary polynomial with root α . For brevity, we will use in this chapter the term
binary BCH codes. (A more general class of BCH codes will be introduced in
Section 3.2.)
Example 2.5.38 For N = 7, the appropriate polynomial field is F2 [X]/"1 + X +
X 3 # or F2 [X]/"1 + X 2 + X 3 #, i.e. one of two realisations of field F8 . Since 7 is a
prime number, any non-zero polynomial from the field has the multiplicative order
7, i.e. is a generator of the multiplicative group in F2 [X]/"1 + X 2 + X 3 #. In fact, we
have the decomposition of polynomial 1 + X 7 into irreducible factors:
1 + X 7 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).
Further, if we choose polynomial field F2 [X]/"1 + X + X 3 # then ω = X satisfies
 3  3
ω 3 = 1 + ω , ω 2 = 1 + ω 2, ω 4 = 1 + ω 4,
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 237

i.e. the conjugates ω , ω 2 and ω 4 are the roots of the core polynomial 1 + X + X 3 :
  
1 + X + X 3 = (X − ω ) X − ω 2 X − ω 4 .

Next, ω 3 , ω 6 and ω 12 = ω 5 are the roots of 1 + X 2 + X 3 :


   
1 + X2 + X3 = X − ω3 X − ω5 X − ω6 .

Hence, the binary BCH code of length 7 with designed distance 3 is formed by
binary polynomials a(X) of degree ≤ 6 such that

a(ω ) = a(ω 2 ) = 0, that is, a(X) is a multiple of 1 + X + X 3 .

This code is equivalent to the Hamming [4, 7] code; in particular its ‘true’ distance
equals 3.
Next, the binary BCH code of length 7 with designed distance 4 is formed by
binary polynomials a(X) of degree ≤ 6 such that
a(ω ) = a(ω 2 ) = a(ω 3 ) = 0, that is, a(X) is a multiple of
(1 + X + X 3 )(1 + X 2 + X 3 ) = 1 + X + X 2 + X 3 + X 4 + X 5 + X 6 .
This is simply the repetition code R7 .
The staple of the theory of the BCH codes is
Theorem 2.5.39 (The BCH bound) The minimal distance of a binary BCH code
with designed distance δ is ≥ δ .
The proof of Theorem 2.5.39 (sometimes referred to as the BCH theorem) is
based on the following result.
Lemma 2.5.40 Consider the m × m Vandermonde determinant with entries from
a commutative ring:
⎛ ⎞ ⎛ ⎞
α1 α2 . . . αm α1 α12 . . . α1m
⎜α2 α2 . . . α2 ⎟ ⎜ α2 α 2 . . . α m ⎟
⎜ 1 2 m⎟ ⎜ 2 2⎟
det ⎜ . . .. . ⎟ = det ⎜ .. . .. . ⎟. (2.5.21)
⎝ . . .
. . . ⎠
. ⎝ . .
. . . ⎠
.
α1m α2m . . . αmm αm αm2 . . . αmm
The value of this determinant is

∏ αl × ∏ (αi − α j ). (2.5.22)
1≤l≤m 1≤i< j≤m

Proofof Lemma 2.5.40 Both determinants in (2.5.21) are polynomial expressions


in α1 , . . . , αm . If α = α j for i < j then the determinant has repeated rows (columns),
and hence vanishes (as in the standard arithmetic). Hence, the determinant divides
238 Introduction to Coding Theory

the product ∏ (αi − α j ). Next, we compare the powers of αi in (2.5.21) and


1≤i< j≤m
(2.5.22): this immediately leads to the assertion of Lemma 2.5.40.

Proofof Theorem 2.5.39 Let the polynomial a(X) ∈ X . Then a(∗ω ∗ j ) = 0 for all
j = 1, . . . , δ − 1. That is,
⎛ ⎞⎛ ⎞
1 ω ω ∗2 ... ω ∗(N−1) a0
⎜1 ω ∗2 ω ∗4 ω ∗2(N−1) ⎟ ⎜ ⎟
⎜ ... ⎟ ⎜ a1 ⎟
⎜ .. .. .. .. ⎟ ⎜ . ⎟ = 0.
⎝. . . . ⎠ ⎝ .. ⎠
1 ω ∗(δ −1) ω ∗2(δ −1) . . . ω ∗(N−1)(δ −1) aN−1
Due to Lemma 2.5.40, any (δ −1) columns of this ((δ −1)×N) matrix are linearly
independent. Hence, there must be at least δ non-zero coefficients in a(X). Thus,
the distance of X is ≥ δ .

Example 2.5.41 (Here a mistake in [18], p. 106, is corrected.) Consider a BCH


code with N = 15 and δ = 5. Use the following decomposition into irreducible
polynomials:

X 15 − 1 = (X + 1)(X 2 + X + 1)(X 4 + X + 1)(X 4 + X 3 + 1)


× (X 4 + X 3 + X 2 + X + 1).

The generator of the code is

g(x) = (X 4 + X + 1)(X 4 + X 3 + X 2 + X + 1) = X 8 + X 7 + X 6 + X 4 + 1.

Indeed, g(ω 3 ) = g(ω 9 ) = 0. The set of zeros of X 4 + X 3 + X 2 + X + 1 is


(ω 3 , ω 9 , ω 12 , ω 9 ). The set of zeros of X 4 + X + 1 is (ω , ω 2 , ω 4 , ω 8 ). The set of
zeros of X 4 + X 3 + 1 is (ω 7 , ω 14 , ω 13 , ω 11 ). The set of zeros of X 2 + X + 1 is
(ω 5 , ω 10 ).
(b) Let N = 31 and ω be a primitive element of F32 . The minimal polynomial with
root ω is

Mω (X) = (X − ω )(X − ω 2 )(X − ω 4 )(X − ω 8 )(X − ω 16 ).

We find also the minimal polynomial for ω 5 :

Mω 5 (X) = (X − ω 5 )(X − ω 10 )(X − ω 20 )(X − ω 9 )(X − ω 18 ).

By definition, the generator of the BCH code of length 31 with a designed dis-
tance δ = 8 is g(X) = lcm(Mω (X), Mω 3 (X), Mω 5 (X), Mω 7 (X)). In fact, the mini-
mal distance of the BCH code (which is, obviously, at least 9) is in fact at least 11.
This follows from Theorem 2.5.39 because all the powers ω , ω 2 , . . . , ω 10 are listed
among the roots of g(X).
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 239

There exists a decoding procedure for a BCH code which is simple to imple-
ment: it generalises the Hamming code decoding procedure. In view Aof Theorem B
δ −1
2.5.39, the BCH code with designed distance δ corrects at least t = er-
2
rors. Suppose a codeword c = c0 . . . cN−1 has been sent and corrupted to r = c + e
where e = e0 . . . eN−1 . Assume that e has at most t non-zero entries. Introduce the
corresponding polynomials c(X), r(X) and e(X), all of degrees < N. For c(X) we
have that c(ω ) = c(ω 2 ) = · · · = c(ω (δ −1) ) = 0. Then, clearly,
   
r(ω ) = e(ω ), r ω 2 = e ω 2 , . . . , r ω (δ −1) = e ω (δ −1) . (2.5.23)

So, we calculate r(ω i ) for i = 1, . . . , δ − 1. If these are all 0, r(X) ∈ X (no error
or at least t + 1 errors). Otherwise, let E = {i : ei = 1} indicate the erroneous digits
and assume that 0 <  E ≤ t. Introduce the error locator polynomial
σ (X) = ∏(1 − ω i X), (2.5.24)
i∈E

with binary coefficients, of degree  E and with the lowest coefficient 1. If we know
σ (X), we can find which powers ω −i are its roots and hence find the erroneous
digits i ∈ E. We then simply change these digits and correct the errors.
In order to calculate σ (X), consider the formal power series
 
ζ (X) = ∑ e ω j X j .
j≥1

(Observe that, as ω N = 1, the coefficients of this power series recur.) For the initial
(δ − 1) coefficients, we have equalities, by virtue of (2.5.23):
   
e ω j = r ω j , j = 1, . . . , δ − 1;
these are the only ones needed for our purpose, and they are calculated in terms of
the received word r.
Now set
ω (X) = ∑ ω i X ∏ (1 − ω j X). (2.5.25)
i∈E j∈E: j =i

Next, rewrite the above formal series as


ω iX ω (X)
ζ (X) = ∑ ∑ ω i j X j = ∑ ∑ ω i j X j = ∑ 1 − ω i X = σ (X) . (2.5.26)
j≥1 i∈E i∈E j≥1 i∈E

Observe that both polynomials ω (X) and σ (X) are of degree  E ≤ t.


Now, the equation ζ (X)σ (X) = ω (X) from (2.5.26) can be written in terms of
the coefficients, with the help of the fact that
   
e ω j = r ω j , j = 1, . . . , 2t;
240 Introduction to Coding Theory

namely,

σ0 + σ1 X + · · · + σt X t
   
× r(ω )X + + · · · + r ω 2t X 2t + e ω (2t+1) X 2t+1 + · · · (2.5.27)
= ω0 + ω1 X + · · · + ωt X t .

We are interested in the coefficients of X k for t < k ≤ 2t: these satisfy



∑ σ j r ω (k− j) = 0, (2.5.28)
0≤ j≤t
 
which does not involve any of the terms e ω l . We obtain the following equations:
⎛  (t+1)  ⎞⎛ ⎞
r ω  r (ω t)
 (t+1)  . . . r( ω ) σ0
⎜r ω (t+2) r ω 2 ⎟
. . . r(ω )⎟ ⎜σ1 ⎟

⎜ ⎟
⎜ .. .. .. .. ⎟ ⎜ .. ⎟ = 0.
⎝ . ⎠ ⎝ .⎠
 .(2t)  .
 (2t−1)  .
r ω r ω . . . r(ω ) t σt
The above matrix is t × (t + 1), so it always has a non-zero vector in the kernel;
this vector identifies the error locator polynomial σ (X). We see that the above
routine (called the Berlekamp–Massey decoding algorithm) enables us to specify
the set E and hence correct ≤ t errors.
Unfortunately, the BCH codes are asymptotically ‘bad’: for any sequence of
BCH codes of length N → ∞, either k/N or d/N → 0. In other words, they lie
at the bottom of Figure 2.2. To obtain codes that meet the Gilbert–Varshamov
(GV) bound, one needs more powerful methods, based on algebraic geometry. Such
codes were constructed in the early 1970s (the Goppa and Justesen codes). It re-
mains a problem to construct codes that lie above the Gilbert–Varshamov curve.
As was mentioned just before on page 160, a new class of codes was invented in
1982 by Tsfasman, Vlǎdut and Zink; these codes lie above the GV curve when the
number of symbols in the code alphabet is large. However, for binary codes, the
problem is still waiting for solution.
Worked Example 2.5.42 Compute the rank and minimum distance of the cyclic
code with generator polynomial g(X) = X 3 + X + 1 and parity-check polynomial
h(X) = X 4 + X 2 + X + 1. Now let ω be a root of g(X) in the field F8 . We receive
the word r(X) = X 5 + X 3 + X (mod X 7 − 1). Verify that r(ω ) = ω 4 , and hence
decode r(X) using minimum-distance decoding.

Solution A cyclic code X of length N has generator polynomial g(X) ∈ F2 [X]


and parity-check polynomial h(X) ∈ F2 [X] with g(X)h(X) = 1 + X N . Recall that
if g(X) has degree k, i.e. g(X) = a0 + a1 X + · · · + ak X k where ak = 0, then g(X),
2.5 Cyclic codes and polynomial algebra. Introduction to BCH codes 241

Xg(X), . . . , X N−k−1 g(X) form a basis for X . In particular, the rank of X equals
N − k. In this example, N = 7, k = 3 and rank(X ) = 4.
If h(X) = b0 + b1 X + · · · + bN−k X N−k then the parity-check matrix H for X has
the form
⎛ ⎞
bN−k bN−k−1 ... b1 b0 0 ... 0 0
⎜ 0 bN−k bN−k−1 . . . b1 b0 ... 0 0 ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ . . . . . . ⎟.
⎜ 0 .. .. .. .. .. .. ⎟
⎜ ⎟
⎝ ⎠
0 0 ... 0 bN−k bN−k−1 . . . b1 b0
: ;< =
N

The codewords of X are linear dependence relations between the columns of H.


In the example,
⎛ ⎞
1 0 1 1 1 0 0
H = ⎝0 1 0 1 1 1 0 ⎠
0 0 1 0 1 1 1

and we have the following implications:

no zero column ⇒ no codewords of weight 1,


no repeated column ⇒ no codewords of weight 2.

The minimum distance d(X ) of a linear code X is the minimum non-zero weight
of a codeword. In the example, d(X ) = 3. [In fact, X is equivalent to the Ham-
ming [7, 4] code.]
>
Since g(X) ∈ F2 [X] is irreducible, the code X ∈ F2 [X] "X 7 − 1# is the cyclic
code defined by ω . The multiplicative cyclic group Z×
7 of non-zero elements of
field F8 is

ω 0 = 1, ω , ω 2 , ω 3 = ω + 1, ω 4 = ω 2 + ω ,
ω 5 = ω 3 + ω 2 = ω 2 + ω + 1, ω 6 = ω 3 + ω 2 + ω = ω 2 + 1,
ω 7 = ω 3 + ω = 1.

Next, the value r(ω ) is

r(ω ) = ω + ω 3 + ω 5
= ω + (ω + 1) + (ω 2 + ω + 1)
= ω 2 + ω = ω 4,
242 Introduction to Coding Theory

as required. Let c(X) = r(X) + X 4 mod (X 7 − 1). Then c(ω ) = 0, i.e. c(X) is a
codeword. Since d(X ) = 3 the code is 1-error correcting. We just found a code-
word c(X) at distance 1 from r(X). Then r(X) = X +X 3 +X 5 should be decoded by

c(X) = X + X 3 + X 4 + X 5 mod (X 7 − 1)

under minimum-distance decoding.

We conclude this section with two useful statements.


Worked Example 2.5.43 (The Euclid algorithm for polynomials) The Euclid al-
gorithm is a method for computing the greatest common divisor of two polynomi-
als, f (X) and g(X), over the same finite field F. Assume that deg g(X) ≤ deg f (X)
and set f (X) = r−1 (X), g(X) = r0 (X). Then

(1) divide r−1 (X) by r0 (X):

r−1 (X) = q1 (X)r0 (X) + r1 (X) where deg r1 (X) < deg r0 (X),

(2) divide r0 (X) by r1 (X):

r0 (X) = q2 (X)r1 (X) + r2 (X) where deg r1 (X) < deg r1 (X),
.
..
(k) divide rk−1 (X) by rk−1 (X):

rk−2 (X) = qk (X)rk−1 (X) + rk (X) where deg rk (X) < deg Rk−1 (X),

...

The algorithm continues until the remainder is 0:

(s) divide rs−2 (X) by rs−1 (X):

rs−2 (X) = qs (X)rs−1 (X).

Then
 
gcd f (X), g(X) = rs−1 (X). (2.5.29)

At each stage, the equation for the current remainder rk (X) involves two previous
remainders. Hence, all remainders, including gcd( f (X), g(X)), can be written in
terms of f (X) and g(X). In fact,
Lemma 2.5.44 The remainders rk (X) in the Euclid algorithm satisfy

rk (X) = ak (X) f (X) + bk (X)g(X), k ≤ −1,


2.6 Additional problems for Chapter 2 243

where
a−1 (X) = b−1 (X) = 0,
a0 (X) = 0, b0 (X) = 1,
ak (X) = −qk (X)ak−1 (X) + ak−2 (X), k ≥ 1,
bk (X) = −qk (X)bk−1 (X) + bk−2 (X), k ≥ 1.
In particular, there exist polynomials a(X), b(X) such that
gcd ( f (X), g(X)) = a(X) f (X) + b(X)g(X).
Furthermore:
(1) deg ak (X) = ∑ deg qi (X), deg bk (X) = ∑ deg qk (X).
2≤i≤k 1≤i≤k
(2) deg rk (X) = deg f (X) − ∑ deg qk (X).
1≤i≤k+1
(3) deg bk (X) = deg f (X) − deg rk−1 (X).
(4) ak (X)bk+1 (X) − ak+ (X)bk (X) = (−1)k+1 .
(5) ak (X) and bk (X) are co-prime.
(6) rk (X)bk+1 (X) − rk+1 (X)bk (X) = (−1)k+1 f (X).
(7) rk+1 (X)ak (X) − rk (X)ak+1 (X) = (−1)k+1 g(X).
Proof The proof is left as an exercise.

2.6 Additional problems for Chapter 2


Problem 2.1 A check polynomial h(X) of a binary cyclic code X of length N
is defined by the condition a(X) ∈ X if and only if a(X)h(X) = 0 mod (1 + X N ).
How is the check polynomial related to the generator of X ? Given h(X), construct
the parity-check matrix and interpret the cosets X + y of X .
Describe all cyclic codes of length 16 and 15. Find the generators and the check
polynomials of the repetition and parity-check codes. Find the generator and the
check polynomial of Hamming code of length 7.

Solution All cyclic codes of length 16 are divisors of 1 + X 16 = (1 + X)16 , i.e. are
generated by g(X) = (1 + X)k where k = 0, 1, . . . , 16. Here k = 0 gives the whole
{0, 1}16 , k = 1 the parity-check code, k = 15 the repetition code {00 . . . 0, 11 . . . 1}
and k = 16 the zero code. For N = 15, the decomposition into irreducible polyno-
mials looks as follows:
1 + X 15 = (1 + X)(1 + X + X 2 )(1 + X + X 4 )(1 + X 3 + X 4 )
×(1 + X + X 2 + X 3 + X 4 ).
Any product of the listed irreducible polynomials generates a cyclic code.
244 Introduction to Coding Theory

In general, 1 + X N = (1 + X)(1 + X + · · · + X N−1 ); g(X) = 1 + X generates the


parity-check code and g(X) = 1 + X + · · · + X N−1 the repetition code. In the case
of a Hamming [7, 4] code, the generator is g(X) = 1 + X + X 3 , by inspection.

The check polynomial h(X) equals the ratio (1 + X N ) g(X). In fact, for all
a(X) ∈ X , a(X)h(X) = v(X)g(X)h(X) = v(X)(1 + X N ) = 0 mod (1 + X N ). Con-
versely, if a(X)h(X) = v(X)(1 + X N ) then a(X) must be of the form v(X)g(X), by
the uniqueness of the irreducible decomposition.
The cosets y + X are in a one-to-one correspondence with the remainders
y(X) = u(X) mod g(X). In other words, two words y(1) , y(2) belong to the same
coset iff, in the division algorithm representation,

y(i) (X) = vi (X)g(X) + u(i) (X), i = 1, 2, where u(1) (X) = u(2) (X).

In fact, y(1) and y(2) belong to the same coset iff y(1) + y(2) ∈ X . This is equivalent
to u(1) (X) + u(2) (X) = 0, i.e. u(1) (X) = u(2) (X).
k
If we write h(X) = ∑ h j X j , then the dot-product
j=0

'
i 1, i = 0, N,
∑ g j hi− j = 0, 1 ≤ i < N.
j=0

So, "g(X) · h⊥ (X)# = 0 where h⊥ (X) = hk + hk−1 X + · · · + h0 X k . Therefore, the


parity-check matrix H for X is formed by rows that are cyclic shifts of h =
hk hk−1 · · · h0 0 · · · 0 . The check polynomials for the repetition and parity-check
codes then are 1 + X and 1 + X + · · · + X N−1 , and they are dual of each other.
The check polynomial for the Hamming [7, 4] code equals 1 + X + X 2 + X 4 , by
inspection.

Problem 2.2 (a) Prove the Hamming and Gilbert–Varshamov bounds on the
size of a binary [N, d] code in terms of vN (d), the volume of an N -dimensional
Hamming ball of radius d .
Suppose that the minimum distance is λ N for some fixed λ ∈ (0, 1/4). Let
α (N, λ N) be the largest information rate of any binary code correcting λ N
errors. Show that

1 − η (λ ) ≤ lim inf α (N, λ N) ≤ lim sup α (N, λ N) ≤ 1 − η (λ /2). (2.6.1)
N→∞ N→∞

(b) Fix R ∈ (0, 1) and suppose we want to send one of a collection UN of messages
of length N , where the size UN = 2NR . The message is transmitted through an
2.6 Additional problems for Chapter 2 245

MBSC with error-probability p < 1/2, so that we expect about pN errors. Accord-
ing to the asymptotic bound of part (a), for which values of p can we correct pN
errors, for large N ?

Solution (a) A code X ⊂ FN2 is said to be E-error correcting if B(x, E) ∩ B(y, E) =


0/ for all x, y ∈ X with x = y. The Hamming bound for a code of size M, distance
d −1
d, correcting E =   errors is as follows. The balls of radius E about the
2
codewords are disjoint: their total volume equals M × vN (E). But their union lies
inside FN2 , so M ≤ 2N /vN (E).
On the other hand, take an E-correcting code X ∗ of maximum size  X . Then
there will be no word
y ∈ FN2 \ ∪x∈X ∗ B(x, 2E + 1)

or we could add such a word to X ∗ , increasing the size but preserving the error-
correcting property. Since every word y ∈ FN2 is less than d − 1 from a codeword,
we can add y to the code. Hence, balls of radius d − 1 cover the whole of FN2 , i.e.
M × vN (d − 1) ≥ 2N , or

M ≥ 2N /vN (d − 1) (the Varshamov–Gilbert bound).


 
Combining these bounds yields, for α (N, E) = log  X N:
log vN (2E + 1) log vN (E)
1− ≤ α (N, E) ≤ 1 − .
N N
Observe that for any s < κ N with 0 < κ < 1/2
     
N s N κ N
= < .
s−1 N −s+1 s 1−κ s
Consequently,
    E  j
N N κ
E
≤ vN (E) ≤ ∑ 1−κ .
E j=0

Now, by the Stirling formula as N, E → ∞ and E/N → λ ∈ (0, 1/4)


 
1 N
log → η (λ /2).
N E

So, we proved that limN→∞ N1 log vN ([λ N]) = η (λ ), and


1 1
1 − η (λ ) ≤ lim inf log M ≤ lim sup log M ≤ 1 − η (λ /2).
N→∞ N N→∞ N
246 Introduction to Coding Theory
A B
d −1
(b) We can correct pN errors if the minimum distance d satisfies ≥ pN,
2
i.e. λ /2 ≥ p. Using the asymptotic Hamming bound we obtain R ≤ 1 − η (λ /2) ≤
1 − η (p). So, the reliable transmission is possible if p ≤ η −1 (1 − R),
The Shannon SCT states:

capacity C of a memoryless channel = sup I(X : Y ).


pX

Here I(X : Y ) = h(Y ) − h(Y |X) is the mutual entropy between the single-letter
random input and output of the channel, maximised over all distributions of the
input letter X. For an MBSC with the error-probability p, the conditional entropy
h(Y |X) equals η (p). Then

C = sup h(Y ) − η (p).


pX

But h(Y ) attains its maximum 1, by using the equidistributed input X (then Y is also
equidistributed). Hence, for the MBSC, C = 1− η (p). So, a reliable transmission is
possible via MBSC with R ≤ 1 − η (p), i.e. p ≤ η −1 (1 − R). These two arguments
lead to the same answer.

Problem 2.3 Prove that the binary code of length 23 generated by the polynomial
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimum distance 7, and is perfect.
Hint: Observe that by the BCH bound (see Theorem 2.5.39) if a generator polyno-
mial of a cyclic code has roots {ω , ω 2 , . . . , ω δ −1 } then the code has distance ≥ δ ,
and check that X 23 + 1 ≡ (X + 1)g(X)grev (X) mod 2, where grev (X) = X 11 g(1/X)
is the reversal of g(X).

Solution First, show that the code is BCH, of designed distance 5. Recall that if
ω is a root of a polynomial p(X) ∈ F2 [X] then so is ω 2 . Thus, if ω is a root of
g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are ω 2 , ω 4 , ω 8 , ω 16 , ω 9 , ω 18 ,
ω 13 , ω 3 , ω 6 , ω 12 . This yields the design sequence {ω , ω 2 , ω 3 , ω 4 }. By the BCH
theorem, the code X = "g(X)# has distance ≥ 5.
Next, the parity-check extension, X + , is self-orthogonal. To check this, we need
only to show that any two rows of the generating matrix of X + are orthogonal.
These are represented by

(X i g(X)|1) and (X j g(X)|1)


2.6 Additional problems for Chapter 2 247

and their dot-product is


1 + (X i g(X))(X j g(X)) = 1 + ∑ gi+r g j+r = 1 + ∑ gi+r grev
11− j−r
r r
= 1 + coefficient of X 11+i− j in g(X) × grev (X)
: ;< =
||
1 + · · · + X 22
= 1 + 1 = 0.
So,
any two words in X + are dot-orthogonal. (2.6.2)
This implies that all words in X + have weight divisible by 4. Indeed, by in-
spection, all rows (X i g(X)|1) of the generating matrix of X + have weight 8.
Then, by induction on the number of rows involved in the sum, if c ∈ X + and
g(i) ∼ (X i g(X)|1) is a row of the generating matrix of X + then
     
w g(i) + c = w g(i) + w(c) − 2w g(i) ∧ c ,
   
 
where g(i) ∧ c l = min g(i) l , cl , l = 1, . . . , 24. We know that 8|w g(i) and by
 (i)   (i) 
the induction hypothesis, 4|w(c).
 (i) Next,
 w g ∧ c is even, so 2w g ∧ c is divis-
ible by 4. Then the LHS, w g + c , is divisible by 4. Therefore, the distance of
X + is 8, as it is ≥ 5 and is divisible by 4. (Clearly, it cannot be bigger than 8 as
then it would be 12.) Then the distance of the original code, X , equals 7.
Finally, the code X is perfect 3-error correcting, since the volume of the 3-ball
in F23
2 equals
       
23 23 23 23
+ + + = 1 + 23 + 253 + 1771 = 2048 = 211 ,
0 1 2 3
and 212 × 211 = 223 . Here, obviously, 12 represents the rank and 23 the length.
Problem 2.4 Show that the Hamming code is cyclic with check polynomial
4 2
X + X + X + 1. What is its generator polynomial? Does Hamming’s original code
contain a subcode equivalent to its dual? Let the decomposition into irreducible
monic polynomials M j (X) be
l
X N + 1 = ∏ M j (X)k j . (2.6.3)
j=1

Prove that the number of cyclic code of length N is ∏lj=1 (k j + 1).

Solution In F72 we have


X 7 − 1 = (X 3 + X + 1)(X 4 + X 2 + X + 1).
248 Introduction to Coding Theory

The cyclic code with generator g(X) = X 3 + X + 1 has check polynomial h(X) =
X 4 + X 2 + X + 1. The parity-check matrix of the code is
⎛ ⎞
1 0 1 1 1 0 0
⎝ 0 1 0 1 1 1 0⎠ . (2.6.4)
0 0 1 0 1 1 1

The columns of this matrix are the non-zero elements of F32 . So, this is equivalent
to Hamming’s original [7, 4] code.
The dual of Hamming’s [7, 4] code has the generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hamming’s [7, 4] code.
Finally, any irreducible polynomial M j (X) could be included in a generator of a
cyclic code in any power 0, . . . , k j . So, the number of possibilities to construct this
generator equals ∏lj=1 (k j + 1).

Problem 2.5 Describe the construction of a Reed–Muller code. Establish its


information rate and its distance.

Solution The space Fm 2 has N = 2 points. If A ⊆ F2 , let 1A be the indicator


m m

function of A. Consider the collection of hyperplanes

Π j = {p ∈ Fm
2 : p j = 0}.

Set h j = 1Π j , j = 1, . . . , m, and h0 = 1Fm2 ≡ 1. Define sets of functions Fm


2 → F2 :

A0 = {h0 },
A1 = {h j ; j = 1, 2, . . . , m},
A2 = {hi · h j ; i, j = 1, 2, . . . , m, i < j},
..
.
Ak+1 = {a · h j ; a ∈ Ak , j = 1, 2, . . . , m, h j |a},
..
.
Am = {h1 · · · hm }.
The union of these sets has cardinality N = 2m (there are 2m functions altogether).
Therefore, functions from ∪m i=0 Ai can be taken as a basis in F2 .
N

Then the Reed–Muller code RM(r, m) = Xr,m RM of length N = 2m is defined as


 
r m
the span of ∪ri=0 Ai and has rank ∑ . Its information rate is
i=0 i
 
1 r m
∑ i .
2m i=0
2.6 Additional problems for Chapter 2 249

Next, if a ∈ RM(r, m) then

a = (y, y)h j + (x, x) = (x, x + y),

for some x ∈ RM(m − 1, r) and y ∈ RM(m − 1, r − 1). Thus, RM(m, r) coincides


with the bar-product (R(m − 1, r)|R(m − 1, r − 1)). By the bar-product bound,




d RM(m, k) ≥ min 2d RM(m − 1, k) , d RM(m − 1, k − 1) ,

which, by induction, yields


d RM(r, m) ≥ 2m−r .

On the other hand, the vector h1 · h2 · · · · · hm is at distance 2m−r from RM(m, r).
Hence,

d RM(r, m) = 2m−r .

Problem 2.6 (a) Define a parity-check code of length N over the field F2 . Show
that a code is linear iff it is a parity-check code. Define the original Hamming code
in terms of parity-checks and then find a generating matrix for it.
(b) Let X be a cyclic code. Define the dual code
N
X ⊥ = {y = y1 . . . yN : ∑ xi yi = 0 for all x = x1 . . . xN ∈ X }.
i=1

Prove that X ⊥ is cyclic and establish how the generators of X and X ⊥ are re-
lated to each other. Show that the repetition and parity-check codes are cyclic, and
determine their generators.

Solution (a) The parity-check code X PC of a (not necessarily linear) code X is


the collection of vectors y = y1 . . . yN ∈ FN2 such that the dot-product
N
y · x = ∑ xi yi = 0 (in F2 ), for all x = x1 . . . xN ∈ X .
i=1

From the definition it is clear that X PC is also the parity-check code for X , the
PC
linear code spanned by X : X PC = X . Indeed, if y · x = 0 and y · x = 0 then
y · (x + x ) = 0. Hence, the parity-check code X PC is always linear, and it forms
a subspace dot-orthogonal to X . Thus, a given code X is linear iff it is a parity-
check code. A pair of linear codes X and X PC form a dual pair: X PC is the dual
of X and vice versa. The generating matrix H for X PC serves as a parity-check
matrix for X and vice versa.
250 Introduction to Coding Theory

The Hamming code of length N = 2l − 1 is the one whose check matrix is l × N


and lists all non-zero columns from Fl2 (in some agreed order). So, the Hamming
[7, 4] code corresponds to l = 3; its parity-checks are
x1 + x3 + x5 + x7 = 0,
x2 + x3 + x6 + x7 = 0,
x4 + x5 + x6 + x7 = 0,
and the generating matrix equals
⎛ ⎞
1 1 0 1 0 0 0
⎜0 1 1 0 1 0 0⎟
⎜ ⎟.
⎝0 0 1 1 0 1 0⎠
0 0 0 1 1 0 1

(b) The generator of dual code g⊥ (X) = X N−1 g(X −1 ). The repetition code has
g(X) = 1 + X + · · · + X N−1 and the rank 1. The parity-check code has g(X) = 1 + X
and the rank N − 1.
Problem 2.7 (a) How does coding theory apply when the error rate p > 1/2?
(b) Give an example of a code which is not a linear code.
(c) Give an example of a linear code which is not a cyclic code.
(d) Define the binary Hamming code and its dual. Prove that the Hamming code is
perfect. Explain why the Hamming code cannot always correct two errors.
(e) Prove that in the dual code:
(i) The weight of any non-zero codeword equals 2−1 .
(ii) The distance between any pair of words equals 2−1 .

Solution (a) If p > 1/2, we reverse the output to get p = 1 − p.


(b) The code X ⊂ F22 with X = {11} is not linear as 00 ∈ X .
(c) The code X ⊂ F22 with X = {00, 10} is linear, but not cyclic, as 01 ∈ X .
(d) The original Hamming [7, 4] code has distance 3 and is perfect one-error
correcting. Thus, making two errors in a codeword will always lead outside the
ball of radius 1 about the codeword, i.e. to a ball of radius 1 about a different
codeword (at distance 1 of the nearest, at distance 2 from the initial word). Thus,
one detects two errors but never corrects them.
(e) The dual of a Hamming [2 − 1, 2 −  − 1, 3] code is linear, of length N = 2 − 1
and rank , and its generating matrix is  × (2 − 1), with columns listing all non-
zero vectors of length  (the parity-check matrix of the original code). The rows of
this matrix are linearly independent; moreover, any row i = 1, . . . ,  has 2−1 digits
1. This is because each such digit comes from a column, i.e. a non-zero vector of
length , with 1 in position i; there are exactly 2−1 such vectors. Also any pair of
2.6 Additional problems for Chapter 2 251

columns of this matrix are linearly independent, but there are triples of columns
that are linearly dependent (a pair of columns complemented by their sum).
Every non-zero dual codeword x is a sum of rows of the above generating matrix.
Suppose these summands are rows i1 , . . . , is where 1 ≤ i1 < · · · < is ≤ . Then, as
above, the number of digits 1 in the sum equals the number of columns of this
matrix for which the sum of digits i1 , . . . , is is 1. We have no restriction on the
remaining  − s digits, so for them there are 2−s possibilities. For digits i1 , . . . , is
we have 2s−1 possibilities (a half of the total of 2s ). Thus, again 2−s × 2s−1 = 2−1 .
We proved that the weight of every non-zero dual codeword equals 2−1 . That is,
the distance from the zero vector to any dual codeword is 2−1 . Because the dual
code is linear, the distance between any pair of distinct dual codewords x, x equals
2−1 :
δ (x, x ) = δ (0, x − x) = w(x − x ) = 2−1 .

Let J ⊂ {1, . . . , } be the set of contributing rows:

x = ∑ g(i) .
i∈J

Then δ (0, x) =  of non-zero digits in x is calculated as



2−|J| ×  of subsets K ⊆ J with |K| odd
↑ ↑
 of ways to place  of ways to get ∑ xi = 1 mod 2
l∈J
0s and 1s outside J with xi = 0 or 1

which yields 2−|J| 2|J|−1 = 2−1 . In other words, to get a contribution from a digit
(i)
x j = ∑ g j = 1, we must fix (i) a configuration of 0s and 1s over {1, . . . , } \ J (as it
i∈J
is a part of the description of a non-zero vector of length N), and (ii) a configuration
of 0s and 1s over J, with an odd number of 1s.

To check that d X H = 2−1 , it suffices to establish that the distance between

the zero word and any other word x ∈ X H equals 2−1 .

Problem 2.8 (a) What is a necessary and sufficient condition for a polynomial
g(X) to be the generator of a cyclic code of length N ? What is the BCH code?
Show that the BCH code associated with {ω , ω 2 }, where ω is a root of X 3 + X + 1
in an appropriate field, is Hamming’s original code.
(b) Define and evaluate the Vandermonde determinant. Define the BCH code and
obtain a good estimate for its minimum distance.
252 Introduction to Coding Theory

Solution (a) The necessary and sufficient condition for g(X) being the generator of
a cyclic code of length N is g(X)|(X N − 1). The generator g(X) may be irreducible
or not; in the latter case it is represented as a product g(X) = M1 (X) · · · Mk (X)
of its irreducible factors, with k ≤ d = deg g. Let s be the minimal number such
that N|2s − 1. Then g(X) is factorised into the product of first-degree monomials
d
in a field K = F2s ⊇ F2 : g(X) = ∏ (X − ω j ) with ω1 , . . . , ωd ∈ K. [Usually one
i=1
refers to the minimal field – the splitting field for g, but this is not necessary.] Each
element ωi is a root of g(X) and also a root of at least one of its irreducible factors
M1 (X), . . . , Mk (X). [More precisely, each Mi (X) is a sub-product of the above first-
degree monomials.]
We want to select a defining set D of roots among ω1 , . . . , ωd ∈ K: it is a collec-
tion comprising at least one root ω ji for each factor Mi (X). One is naturally tempted
to take a minimal defining set where each irreducible factor is represented by one
root, but this set may not be easy to describe exactly. Obviously, the cardinality |D|
of defining set D is between k and d. The roots forming D are all from field K but
in fact there may be some from its subfield, K ⊂ K containing all ω ji . [Of course,
F2 ⊂ K .] We then can identify the cyclic code X generated by g(X) with the set
of polynomials
* +
f (X) ∈ F2 [X]/"X N − 1# : f (ω ) = 0 for all ω ∈ D .
It is said that X is a cyclic code with defining set of roots (or zeros) D.
(b) A binary BCH code of length N (for N odd) and designed distance δ is a cyclic
code with defining set {ω , ω 2 , . . . , ω δ −1 } where δ ≤ N and ω is a primitive Nth
root of unity, with ω N = 1. It is helpful to note that if ω is a root of a polyno-
s−1
mial p(X) then so are ω 2 , ω 4 , . . . , ω 2 . By considering a defining set of the form
{ω , ω 2 , . . . , ω δ −1 } we ‘fill the gaps’ in the above diadic sequence and produce an
ideal of polynomials whose properties can be analytically studied.
The simplest example is where N = 7 and D = {ω , ω 2 } where ω is a root of
X + X + 1. Here, ω 7 = (ω 3 )2 ω = (ω + 1)2 ω = ω 3 + ω = 1, so ω is a 7th root of
3

unity. [We used the fact that the characteristic is 2.] In fact, it is a primitive root.
Also, as was said, ω 2 is a root of X 3 + X + 1: (ω 2 )3 + ω 2 + 1 = (ω 3 + ω + 1)2 =
0, and so is ω 4 . Then the cyclic code with defining set {ω , ω 2 } has generator
X 3 +X +1 since all roots of this polynomial are engaged. We know that it coincides
with the Hamming [7, 4] code.
The Vandermonde determinant is
⎛ ⎞
1 1 1 ... 1
⎜ x1 x2 x3 . . . xn ⎟
Δ = det ⎜ ⎝ ...
⎟.
... ... ... ... ⎠
x1n−1 x2n−1 x3n−1 . . . xnn−1
2.6 Additional problems for Chapter 2 253

Observe that if xi = x j (i = j) the determinant vanishes (two rows are the same).
Thus xi − x j is a factor of Δ,
Δ = P(x) ∏(xi − x j ),
i< j

with P a polynomial in x1 , . . . , xn . Now consider terms in expansion Δ in the sum


m(i)
of terms of form a ∏i xi with ∑ m(i) = 0 + 1 + · · · + (n − 1) = n(n − 1)/2. But
m(i)
∏i< j (xi −x j ) is a sum of terms a ∏i xi with ∑ m(i) = n(n−1)/2, so P(x) = const.
Considering x2 x32 . . . xnn−1 we have const = 1, so
Δ = ∏(xi − x j ). (2.6.5)
i< j

Suppose N is odd and K is a field containing F2 in which X N − 1 factorises into


linear factors. [This field can be selected as F2s where N|2s − 1.] A cyclic code
N−1
consisting of words c = c0 c1 . . . cN−1 with ∑ c j ω r j = 0 for all r = 1, 2, . . . , δ − 1
j=0
where ω is a primitive Nth root of unity is called a BCH code of design distance
δ < N. Next, X BCH is a vector space over F2 and c ∈ X BCH iff
cH T = 0 (2.6.6)
where
⎛ ⎞
1 ω ω2 ... ω N−1
⎜1 ω2 ω4 ... ω 2N−2 ⎟
⎜ ⎟
⎜ ω3 ω6 ... ω 3N−3 ⎟
H = ⎜1 ⎟. (2.6.7)
⎜. .. .. .. .. ⎟
⎝ .. . . . . ⎠
1 ω δ −1 ω 2δ −2 . . . ω (N−1)(δ −1)

Now rank H = δ . Indeed, by (2.6.5) for any δ × δ minor H


det H = ∏(ω i − ω j ) = 0.
i< j

Thus (2.6.6) tells us that


c ∈ X , c = 0 ⇒ ∑ |c j | ≥ δ .
So, the minimum distance in X BCH is not smaller than δ .
Problem 2.9 A subset X of the Hamming space {0, 1}N of cardinality X = M
and with the minimal Hamming distance d = min[δ (x, x ) : x, x ∈ X , x = x ]
is called an [N, M, d] code (not necessarily linear). An [N, M, d] code is called
maximal if it is not contained in any [N, M + 1, d] code. Prove that an [N, M, d]
code is maximal if and only if for any y ∈ {0, 1}N there exists x ∈ X such that
254 Introduction to Coding Theory

δ (x, y) < d . Conclude that if d or more changes are made in a codeword then the
new word is closer to some other codeword than to the original one.
Suppose that a maximal [N, M, d] code is used for transmitting information via a
binary memoryless channel with the error-probability p, and the receiver uses the
maximum likelihood decoder. Prove that the probability of erroneous decoding,
πerr
ML , obeys the bounds

1 − b(N, d − 1)  πerr
ML
 1 − b(N, (d − 1)/2),

where b(N, m) is a partial binomial sum


 
N k
b(N, m) = ∑ p (1 − p)N−k .
0≤k≤m k

Solution If a code is maximal then adding one more word will reduce the distance.
Hence, for all y there exists x ∈ X such that δ (x, y) < d. Conversely, if this prop-
erty holds then the code cannot be enlarged without reducing d. Then making d or
more changes in a codeword gives a word that is closer to a different codeword.
This will certainly not give the correct guess under the ML decoder as it chooses
the closest codeword.
Therefore,
 
N k
πerr ≥ ∑
ML
p (1 − p)N−k = 1 − b(N, d − 1).
d≤k≤N
k

On the other hand, the code corrects (d − 1) 2 errors. Hence,

πerr
ML
≤ 1 − b (N, d − 1/2) .

Problem 2.10 The Plotkin bound for an [N, M, d] binary code states that M ≤
d
if d > N/2. Let M2∗ (N, d) be the maximum size of a code of length N and
d − N/2
distance d , and let
1
α (λ ) = lim log2 M2∗ (N, λ N).
N→∞ N

Deduce from the Plotkin bound that α (λ ) = 0 for λ ≥ 12 .


Assuming the above bound, show that if d ≤ N/2, then
d
M ≤ 2N−(2d−1) = 2d 2N−(2d−1) .
d − (2d − 1)/2
Deduce the asymptotic Plotkin bound: α (λ ) ≤ 1 − 2λ , 0 ≤ λ < 12 .
2.6 Additional problems for Chapter 2 255

Solution If d > N/2 apply the Plotkin bound and conclude that α (λ ) = 0. If
d ≤ N/2 consider the partition of a code X of length N and distance d ≤ N/2
according to the last N − (2d − 1) digits, i.e. divide X into disjoint subsets, with
fixed N − (2d − 1) last digits. One of these subsets, X , must have size M such
that M 2N−(2d−1) ≥ M.
Hence, X is a code of length N = 2d − 1 and distance d = d, with d > N /2.
Applying Plotkin’s bound to X gives
d d
M ≤
= = 2d.
d − N/2 d − (2d − 1)/2
Therefore,
M ≤ 2N−(2d−1) 2d.
Taking d = λ N with N → ∞ yields α (λ ) ≤ 1 − 2λ , 0 ≤ λ ≤ 1/2.
Problem 2.11 State and prove the Hamming, Singleton and Gilbert–Varshamov
bounds. Give (a) examples of codes for which the Hamming bound is attained, (b)
examples of codes for which the Singleton bound is attained.

Solution The Hamming bound states that the size M of an E-error correcting code
X of length N,
2N
M≤ ,
vN (E)
 
N
where vN (E) = ∑ is the volume of an E-ball in the Hamming space
0≤i≤E i
{0, 1}N . It follows from the fact that the E-balls about the codewords x ∈ X must
be disjoint:
M × vN (E) =  of points covered by M E-balls
≤ 2N =  of points in {0, 1}N .
The Singleton bound is that the size M of a code X of length N and distance d
obeys
M ≤ 2N−d+1 .
It follows by observing that truncating X (i.e. omitting a digit from the codewords
x ∈ X ) d − 1 times still does not merge codewords (i.e. preserves M) while the
resulting code fits in {0, 1}N−d+1 .
The Gilbert–Varshamov bound is that the maximal size M ∗ = M2∗ (N, d) of a
binary [N, d] code satisfies
2N
M∗ ≥ .
vN (d − 1)
256 Introduction to Coding Theory

This bound follows from the observation that any word y ∈ {0, 1}N must be within
distance ≤ d − 1 from a maximum-size code X ∗ . So,

M ∗ × vN (d − 1) ≥  of points within distance d − 1 = 2N .

Codes attaining the Hamming bound are called perfect codes, e.g. the Hamming
[2 − 1, 2 − 1 − , 3] codes. Here, E = 1, vN (1) = 1 + 2 − 1 = 2 and M = 22 −−1 .


Apart from these codes, there is only one example of a (binary) perfect code: the
Golay [23, 12, 7] code.
Codes attaining the Singleton bound are called maximum distance separable
(MDS): their check matrices have any N − M rows linearly independent. Examples
of such codes are (i) the whole {0, 1}N , (ii) the repetition code {0 . . . 0, 1 . . . 1}
and the collection of all words x ∈ {0, 1}N of even weight. In fact, these are all
examples of binary MDS codes. More interesting examples are provided by Reed–
Solomon codes that are non-binary; see Section 3.2. Binary codes attaining the
Gilbert–Varshamov bound for general N and d have not been constructed so far
(though they have been constructed for non-binary alphabets).

Problem 2.12 (a) Explain the existence and importance of error correcting codes
to a computer engineer using Hamming’s original code as your example.
(b) How many codewords in a Hamming code are of weight 1? 2? 3? 4? 5?

Solution (a) Consider the linear map F72 → F32 given by the matrix H of the form
(2.6.4). The Hamming code X is the kernel ker H, i.e. the collection of words
x = x1 x2 x3 x4 x5 x6 x7 ∈ {0, 1}7 such that xH T = 0. Here, we can choose four digits,
say x4 , x5 , x6 , x7 , arbitrarily from {0, 1}; then x1 , x2 , x3 will be determined:

x1 = x4 + x5 + x7 ,
x2 = x4 + x6 + x7 ,
x3 = x5 + x6 + x7 .

It means that code X can be used for encoding 16 binary ‘messages’ of length 4.
If y = y1 y2 y3 y4 y5 y6 y7 differs from a codeword x ∈ X in one place, say y = x + ek
then the equation yH T = ek H T gives the binary decomposition of number k, which
leads to decoding x. Consequently, code X allows a single error to be corrected.
Suppose that the probability of error in any digit is p << 1, independently of
what occurred to other digits. Then the probability of an error in transmitting a
non-encoded (4N)-digit message is

1 − (1 − p)4N  4N p.
2.6 Additional problems for Chapter 2 257

But using the Hamming code we need to transmit 7N digits. An erroneous trans-
mission requires at least two wrong digits, which occurs with probability
   N
7 2
≈ 1− 1− p  21N p2 << 4N p.
2
So, the extra effort of using 3 check digits in the Hamming code is justified.
(b) A Hamming code X H, of length N = 2 − 1 ( ≥ 3) consists of binary words
x = x1 . . . xN such that xH T = 0 where H is an  × N matrix whose columns
h(1) , . . . , h(N) are all non-zero binary vectors of length l. Hence, the number of
N
codewords of weight w(x) = ∑ x j = s equals the number of (non-ordered) collec-
j=1
tions of s binary, non-zero, pair-wise distinct -vectors of total sum 0. In fact, if
xH T = 0 and w(x) = s and x j1 = x j2 = · · · = x js = 1 then the sum of row-vectors
h( j1 ) + · · · + h( js ) = 0.
Thus, one codeword has weight 0, no codeword has weight 1 or 2, N(N − 1)/3!
codewords have weight 3 (i.e. 7 and 35 words of weights 3 for l = 3 and l = 4).
Further we have [N(N − 1)(N − 2) − N(N − 1)]/4! = N(N − 1)(N − 3)/4! words
of weight 4 (i.e. 7 and 105 words of weights 4 for  = 3 and  = 4). Finally, we
have N(N − 1)(N − 3)(N − 7)/5! words weight 5 (i.e. 0 and 168 words of weight
5 for  = 3 and  = 4). Each time when we add a factor, we should avoid -vectors
equal to a linear combination of previously selected vectors. In Problem 3.9 we
will compute the enumerator polynomial for N = 15:
1 + 35X 3 + 105X 4 + 168X 5 + 280X 6 + 435X 7 + 435X 8

+280X 9 + 168X 10 + 105X 11 + 35X 12 + X 15 .

Problem 2.13 (a) The dot-product of vectors x, y from a binary Hamming space
HN is defined as x · y = ∑Ni=1 xi yi (mod 2), and x and y are said to be orthogonal
if x · y = 0. What does it mean to say that X ⊆ HN is a linear [N, k] code with
generating matrix G and parity-check matrix H ? Show that
X ⊥ = {x ∈ HN : x · y = 0 for all y ∈ X }
is a linear [N, N − k] code and find its generator and parity-check matrices.
(b) A linear code X is called self-orthogonal if X ⊆ X ⊥ . Prove that X is self-
orthogonal if the rows of G are self and pairwise orthogonal. A linear code is called
self-dual if X = X ⊥ . Prove that a self-dual code has to be an [N, N/2] code (and
hence N must be even). Conversely, prove that a self-orthogonal [N, N/2] code, for
N even, is self-dual. Give an example of such a code for any even N and prove that
a self-dual code always contains the word 1 . . . 1.
258 Introduction to Coding Theory

(c) Consider now a Hamming [2 −1, 2 −−1] code XH, . Describe the generating
⊥ . Prove that the distance between any two codewords in X ⊥ equals
matrix of XH, H,
−1
2 .

Solution By definition, X ⊥ is preserved under the linear operations; hence X ⊥


is a linear code. From algebraic considerations, dim X ⊥ = N − k. The generating
matrix G⊥ of X ⊥ coincides with H, and the parity-check matrix H ⊥ with G.
If X ⊆ X ⊥ then the rows g(1) , . . . , g(k) of G are self- and pairwise orthogonal.
The converse is also true. From the previous observation, if X is self-dual then
k = N − k, i.e. k = N/2, and N should be even. Similarly, if X is self-orthogonal
and k = N/2 then X is self-dual.
Let 1 = 1 . . . 1. If X = X ⊥ then 1 · g(i) = g(i) · g(i) = 0. So, 1 ∈ X ⊥ and hence
1 ∈ X . An example is a code with the generating matrix
⎛ ⎞ ⎫
1 1 1 ... 1 1 ... 1 ⎪

⎜1 1 0 . . . 0 ⎪
⎜ 1 ... 0⎟ ⎟



⎜ 0⎟
G = ⎜1 0 1 . . . 0 1 ... ⎟ N/2.
⎜. . . . .. ⎟ ⎪

⎝ .. .. .. . . ... ..
.
..
. .⎠ ⎪



1 0 0 ... 1 1 ... 1
← N/2 → ← N/2 →

The dual XH⊥ of a Hamming code X H is called a simplex code. By the above,
it has length 2 − 1 and rank , and its generating matrix GH⊥ is  × (2 − 1), with
columns listing all non-zero vectors of length . To check that dist X H⊥ = 2−1 ,
it suffices to establish that the weight of non-zero word x ∈ X H⊥ equals 2−1 . But
a non-zero word x ∈ X H⊥ is a non-zero linear combination of rows of G⊥H . Let
J ⊂ {1, . . . , } be the set of contributing rows:

x = ∑ g(i) .
i∈J

Clearly, w(g(i) ) = 2−1 as exactly half of all 2 vectors have 1 on any given position.
The proof is finished by induction on J.
A simple and elegant way is to use the MacWilliams identity (cf. Lemma 3.4.4)
which immediately gives
−1
WX ⊥ (s) = 1 + (2 − 1)s2 . (2.6.8)

It is instructive to present this derivation. We will establish in Problem 3.9 the


formula for a weight enumeration polynomial of Hamming code. Then substituting
2.6 Additional problems for Chapter 2 259

this expression into the MacWilliams identity one gets


  
1 1 1−s N
WX ⊥ (s) = N− 1+
2 N +1 1+s
 (N−1)/2   
N 1−s 1 − s (N+1)/2
+ 1+ 1− (1 + s)N
N+ 1 1+s  1+s
1 2 − 1 2−1
= 2 + s
2 2
which is equivalent to (2.6.8).
Problem 2.14 Describe briefly the decoding procedure for the Hamming [2 −1,
2 − 1 − ] code.
The codewords of the Hamming [7, 4] code, with the lexicographical parity-
check matrix H of the form (2.3.4a), are used for encoding 16 symbols, the first 15
letters of the alphabet and the space character ∗. The encoding rule is
A 0011001 E 0111100 I 1010101 M 1111111
B 0100101 F 0001111 J 1100110 N 1000011
C 0010110 G 1101001 K 0101010 O 0000000
D 1110000 H 0110001 L 1001100 ∗ 1011010
You have received a 105-digit message
1000110 0000000 0110001 1000011 1000011 1110101
0111100 0011010 0100101 0111100 1011000 1101001
0000000 0010000 1010000
where some words are corrupted. Decode the received message.

Solution The Hamming [2 − 1, 2 − 1 − ] code,  = 2, 3, . . ., is obtained as a col-


lection of binary ‘strings’ x = x1 . . . xN of length N = 2 − 1 such that xH T = 0.
Matrix H is ( × 2 − 1), with 2 − 1 non-zero binary strings as columns; that is,
⎛ ⎞
1 0 ... 0 1 ... 1
⎜ 0 1 ... 0 0 ... 1 ⎟
H =⎜ ⎟
⎝. . . . . . . . . . . . . . . . . . . . .⎠ .
0 0 ... 1 1 ... 1
Here the columns are meant to be lexicographically ordered. Different matrices
obtained from the above by permuting the rows define different, but equivalent,
codes: they are all named Hamming codes.
To perform decoding, we have to fix a matrix H (the check matrix) and let it
be known to both the sender and the receiver. Upon receiving a word (string) y =
y1 . . . yN we form a syndrome vector yH T . If yH T = 0, we decode y by itself. (We
260 Introduction to Coding Theory

have no means to determine if the original codeword was corrupted by the channel
or not.)
If yH T = 0 then yH T coincides with a column of H. Suppose yH T gives column
j of H; then we decode y by

x∗ = y + e j where e j = 0 . . . 1 . . . 0 (1 in digit j).

In other words, we change digit j in y and decide that it was the word sent through
the channel. This works well when errors in the channel are rare.
If  = 3 a Hamming [7, 4] code contains 24 = 16 codewords. These codewords
are fixed when H is fixed: in the example they are used for encoding 15 letters from
A to O and the space character ∗. Upon receiving a message we divide it into words
of length 7: in the example there are 15 words altogether. Performing the decoding
procedure leads to
JOHNNIE∗BE∗GOOD

Problem 2.15 A (binary) Hamming code of length N = 2 − 1, where  ≥ 2, is


defined as a linear binary code with a parity-check matrix H whose columns consist
of all non-zero binary vectors of length . Find the rank of such a code (i.e. the
dimension of the corresponding linear subspace) and the number of the codewords.
Find the minimum distance for the code and prove that it is single-error correcting.
Prove that the code is perfect (i.e. the union of the one-balls around the codewords
covers the space of all words).
Give a parity-check matrix and a generating matrix for a Hamming code with  =
3. What is the information rate of this code? Why is the case  = 2 not interesting?

Solution The parity-check matrix H for the Hamming code is ×2 −1 and formed
by all non-zero columns of length ; in particular, it includes all l columns of
weight 1. The latter are linearly independent; hence the l columns of H are lin-
early independent. Since XHam = ker H, we have dim X = 2 − 1 −  = rank X .
The number of codewords then equals 22 −−1 .


Since all columns of H are distinct, any pair of columns are linearly independent.
So, the minimal distance of X is > 2. But X contains three columns that are
linearly dependent, e.g.

1 0 0 . . . 0T , 0 1 0 . . . 0T , and 1 1 0 . . . 0T .

Hence, the minimal distance equals 3. Therefore, if a single error occurs, i.e. the
received word is at distance 1 from a codeword, then this codeword is uniquely
determined. Hence, the Hamming code is single-error correcting.
2.6 Additional problems for Chapter 2 261

To prove that it is perfect, we must check that


the  of codewords × the volume of a one-ball
= the total  of words.

In fact, denoting 2l − 1 = N, we have


the  of codewords = 22 −1−l = 2N−l ,
l

   
N N
the volume of a one-ball = + = 1 + N,
0 1
the total  of words = 2N ,
and
(1 + N)2N− = 2 2N− = 2N .
The information rate of the code equals
2 −  − 1
rank length = .
2 − 1
The code with  = 3 has the 3 × 7 parity-check matrix of the form (2.6.4); any
permutation of rows leads to an equivalent code. The generating matrix is 4 × 7:
⎛ ⎞
1 0 0 0 1 1 1
⎜0 1 0 0 0 1 1⎟
⎜ ⎟
⎝0 0 1 0 1 0 1⎠
0 0 0 1 0 1 1
and the information rate 4/7. The Hamming code with  = 2 is trivial: it contains
a single non-zero codeword 1 1 1.
Problem 2.16 Define a BCH code of length N over the field Fq with designed
distance δ . Show that the minimum weight of such a code is at least δ .
Consider a BCH code of length 31 over the field F2 with designed distance 8.
Show that the minimum distance is at least 11.

Solution A BCH code of length N over the field Fq is defined as a cyclic code X
whose minimum degree generator polynomial g(X) ∈ Fq [X], with g(X)|(X N − 1)
(and hence deg g(X) ≤ N), contains among its roots the subsequent powers ω ,
ω 2 , . . . , ω δ −1 where ω ∈ Fqs is a primitive Nth root of unity. (This root ω lies
in an extension field Fqs – the splitting field for X N − 1 over Fq , i.e. N|qs − 1.) Then
δ is called the designed distance for X ; the actual distance (which may be difficult
to calculate in a general situation) is ≥ δ .
If we consider the binary BCH code X of length 31, ω should be a primitive
root of unity of degree 31, with ω 31 = 1 (the root ω lies in an extension field F32 ).
262 Introduction to Coding Theory

We know that in the binary arithmetic, if a polynomial f (X) ∈ F2 [X], of order s,


s−1
has a root ω , it has roots ω 2 , ω 4 , . . . , ω 2 , i.e.

r 
X − ω 2  f (X), r = 0, . . . , s − 1.

Thus, given that the generator g(X) of X has roots ω , ω 2 , ω 3 , ω 4 , ω 5 , ω 6 , ω 7 , it


will also have roots
ω 8 = (ω 4 )2 , ω 9 = (ω 5 )8 , and ω 10 = (ω 5 )2 .
That is, the defining set can be extended to
ω , ω 2 , ω 3 , ω 4 , ω 5 , ω 6 , ω 7 , ω 8 , ω 9 , ω 10
(all these elements are distinct, as ω is a primitive 31st root of unity). In fact, code
X has designed distance ≥ 11. Hence, the minimum distance in X is ≥ 11.
Problem 2.17 Let X be a linear [N, k, d] code over the binary field F2 , and
G be a generating matrix of X , with k rows and N columns, such that exactly
d of the first row’s entries are 1. Let G1 be the matrix, of k − 1 rows and N − d
columns, formed by deleting the first row of G and those columns of G with a
non-zero entry in the first row. Show that X1 , the linear code generated by G1 ,
has minimum distance d ≥ d/2. Here, for a real number x, x is the integer
satisfying x ≤ x < x + 1.
Show also that X1 has rank k − 1. Deduce that
3 i4
N≥ ∑ d 2 .
0≤i≤k−1

Solution Let x be the codeword in X represented by the first row of G and pick
a pair of other rows, say y and z. After the first deleting they become y and z ,
correspondingly. Both weights w(y ) and w(z ) must be ≥ d/2: otherwise at least
one of the original words y and z, say y, would have had minimum d/2 digits 1
among deleted d digits (as w(y) ≥ d by condition). But then
w(x + y) = w(y ) + d − d/2 < d
which contradicts the condition that the distance of X is d.
We want to check that the weight w(y + z ) ≥ d/2. Assume the opposite:
w(y + z ) = m < d/2 .
Then m = w(y0 + z0 ) must be ≥ d − m ≥ d/2 where y0 is the deleted part of y,
of length d, and z0 is the deleted part of z, also of length d. In fact, as before, if
m < d − m then w(y + z) < d which is impossible. But if m ≥ d − m then
w(x + y + z) = d − m + m < d,
again impossible. Hence, the sum of any two rows of G1 has weight ≥ d/2.
2.6 Additional problems for Chapter 2 263

This argument can be repeated for the sum of any number of rows of G1 (not
exceeding k − 1). In fact, in the case of such a sum x + y + · · · + z, we can pass to
new matrices, G and G1 , with this sum among the rows. We conclude that X1 has
minimum distance d ≥ d/2. The rank of X1 is k − 1, for any k − 1 rows of G1
are linearly independent. (The above sum cannot be 0.)
Now, the process of deletion can be applied to X1 (you delete d columns in
G1 yielding digits 1 in a row of G1 with exactly d digits 1). And so on, until you
exhaust the initial rank k by diminishing it by 1. This leads to the required bound
3 4  
N ≥ d + d/2 + d/22 + · · · + d 2k−1 .

Problem 2.18 Define a cyclic linear code X and show that it has a codeword of
minimal length which is unique, under normalisation to be stated. The polynomial
g(X) whose coefficients are the symbols of this codeword is the (minimum degree)
generator polynomial of this code: prove that all words of the code are related to
g(X) in a particular way.
Show further that g(X) can be the generator polynomial of a cyclic code with
words of length N iff it satisfies a certain condition, to be stated.
There are at least three ways of determining the parity-check matrix of the code
from a knowledge of the generator polynomial. Explain one of them.

Solution Let X be the cyclic code of length N with generator polynomial g(X) =
∑ gi X i of degree d. Without loss of generality, assume the code is non-trivial,
0≤i≤d
with 1 < d < N − 1. Let g denote the corresponding codeword g0 . . . gd 0 . . . 0 (there
are d + 1 coefficients gi completed with N − d − 1 zeros). Then:

(a) g(X)|(X N − 1), i.e. g(X)h(X) = X N − 1 for some polynomial h(X) = ∑ hi X i


0≤i≤k
of degree k = N − d;
(b) a string a = a0 . . . aN−1 ∈ X iff the polynomial a(X) = ∑ ai X i has the
0≤i≤N−1
form a(X) = f (X)g(X) mod (X N − 1);
(c) a string g and its cyclic shifts π g, . . . , π k−1 g (corresponding to polynomials
g(X), Xg(X), . . . , X k−1 g(X)) form a basis in X .
By virtue of (a), g0 = h0 = 1 and the sum ∑ gi hl−i representing the lth coefficient
0≤i≤l
of g(X)h(X) is equal to 0, for all l = 1, . . . , N − 1. By virtue of (c), the rank of X
equals k.
264 Introduction to Coding Theory

One way to specify the parity-check matrix is to take the ratio (X N − 1) g(X) =
h(X) = h0 + h1 X + · · · + hk X k . Then form the N × (N − k) matrix
⎛ ⎞
hk hk−1 . . . 0 ... 0 0
⎜0 hk hk−1 . . . h1 . . . 0 ⎟
H =⎜
⎝. . . . . .
⎟. (2.6.9)
. . . . . . . . . . . . . . .⎠
0 0 . . . hk . . . h1 h0
 j ↓
The rows of H are the cyclic shifts π h , 0 ≤ j ≤ d − 1 = N − k − 1, of the string
h↓ = hk . . . h0 0 . . . 0.
We claim that for all a ∈ X , aH T = 0. In fact, it suffices to check that for the
basis words π j g, π j gH T = 0, j = 0, . . . , k − 1. That is, the dot-product
π j1 g · π j2 h↓ = 0, 0 ≤ j1 < k, 0 ≤ j2 < N − k − 1. (2.6.10)
But for j1 = k − 1 and j2 = 0, we have
π k−1 g · h↓ = g0 hk + g1 hk−1 = 0
since it gives the first coefficient (at monomial X) of the product g(X)h(X) =
X N − 1. Similarly, for j1 = k − 2 and j2 = 0, π k−2 g · h↓ gives the second coefficient
of g(X)h(X) (at monomial X 2 ) and is again equal to 0. And so on: for j1 = j2 = 0,
g · h↓ = 0 as the kth-degree coefficient in g(X)h(X).
Continuing, g · π h↓ equals the (k + 1)st-degree coefficient in g(X)h(X), g · π 2 h↓
the (k + 2)nd, and so on; g · π N−k−1 h↓ = gd−1 hk + gd hk−1 the (N − 1)st. As before,
they all vanish.
The same holds true when we simultaneously shift both words cyclically (when
possible) which leads to (2.6.10).
Conversely, suppose that aH T = 0 for some word a = a0 . . . aN−1 . Write the cor-
responding polynomial a(X) as a(X) = f (X)g(X) + r(X) where the ratio f (X) =
∑ fi X i and r(X) is the remainder. Then either r(X) = 0 or 1 ≤ deg r(X) =
0≤i≤k−1
d < d (and rd = 1 and rl = 0 for d < l ≤ n − 1). Then set r = r0 . . . rd .
Assume that r(X) = 0. By the above argument,
(i) aH T = rH T and hence rH T = 0,
(ii) the entries of vector rH T coincide with the coefficients of the product
r(X)h(X), beginning with r0 hk + · · · + rd hk−d and ending with rd hk . So, these
coefficients must be 0. But the equality rd hk = 0 is impossible since rd = hk = 1.
Hence, r(X) = 0 and a(X) = f (X)g(X), i.e. a ∈ X . We conclude that H is the
parity-check matrix for X .
Equivalently, H is the matrix formed by the words corresponding to polynomials
X i h↓ (X) where
h↓ (X) = ∑ hi X k−i .
0≤i≤k
2.6 Additional problems for Chapter 2 265

Alternatively, let h(X) be the check polynomial for the cyclic code X length N
with a generator polynomial g(X) so that g(X)h(X) = X N − 1. Then:
(a) X = { f (X): f (X)h(X) = 0 mod (X N − e)};
(b) if h(X) = h0 + h1 X + · · · + hN−r X N−r then the parity-check matrix H of X
has the form (2.6.9);
(c) the dual code X ⊥ is a cyclic code of dim X ⊥ = r, and X ⊥ = "h⊥ (X)#,
where h⊥ (X) = h−10 X
N−r h(X −1 ) = h−1 (h X N−r + h X N−r−1 + · · · + h
0 0 1 N−r ).

Problem 2.19 Consider the parity-check matrix H of a Hamming [2 − 1, 2 −


 − 1] binary code. Form the parity-check matrix H ∗ of a [2 , 2 −  − 1] code by
augmenting H with a column of zeros and then with a row of ones. The dual of
the resulting code is called a first-order Reed–Muller code. Show that a first-order
Reed–Muller code can correct errors of up to 2−2 − 1 bits per codeword.
For the photographs of Mars taken by the Mariner spacecraft such code with
 = 5 was used in 1972. What was the code rate? Why is this likely to have been
much less than the capacity of the channel?

Solution The code in question is [2 ,  + 1, 2−1 ]; with  = 5, the information rate
equals 6/32 ≈ 1/5. Let us check that all codewords except 0 and 1 have weight
2−1 . For  ≥ 1 the code R() is defined by recursion

R( + 1) = {xx|x ∈ R()} ∨ {x, x + 1|x ∈ R()}.

So, the length of codewords in R( + 1) is obviously 2+1 . As {xx|x ∈ R()}


and {x, x + 1|x ∈ R()} are disjoint, the number of codewords is doubled, i.e.
R( + 1) = 2+2 . Finally, assuming that all codewords in R() except 0 and 1
have weight 2−1 , consider a codeword y ∈ R(l + 1). If y = xx is different from 0
or 1, then x = 0 or 1, and so w(y) = 2w(x) = 2 × 2−1 = 2 .
If y = x, x + 1 we must consider some cases. If x = 0 then y = 01, which has
weight 2l . If x = 1 then y = 10, which also has weight 2l . Finally, if x = 0 or 1
then w(x + 1) = 2 − 2−1 = 2−1 and w(y) = 2 × 2−1 = 2 . It is clear now that
codewords xx and x, x + 1 with w(x) = 2−1 are orthogonal to rows of parity-check
matrix H ∗ .
Up to 7 bits may be in error, thus the probability of a transmission error pe (for
a binary symmetric memoryless channel with the error-probability p) obeys
 
32 i
pe ≤ 1 − ∑ p (1 − p)32−i ,
0≤i≤7
i

which is small when p is small. (As an estimate of an acceptable p, we can take the
solution to 1 + p log p + (1 − p) log(1 − p) = 26/32.) If the block length is fixed
(and rather small), with a low value of p we can’t get near the capacity.
266 Introduction to Coding Theory

Indeed, for  = 5, the code is [32, 6, 16], detecting 15 and correcting 7 errors. That
is, the code can correct a fraction > 1/5 of the total of 32 digits. Its information
rate is 6/32 and if the capacity of the (memoryless) channel is C = 1− η (p) (where
p stands for the symbol-probability of error), we need the bound C > 6/32; that
is, η (p) + 6/32 < 1, for a reliable transmission. This yields |p − 1/2| > |p∗ − 1/2|
where p∗ ∈ (0, 1) solves 26/32 = η (p∗ ). Definitely 0 ≤ p < 1/5 and 4/5 < p ≤ 1
would do. In reality the error-probability was much less.

Problem 2.20 Prove that any binary [5, M, 3] code must have M ≤ 4. Verify that
there exists, up to equivalence, exactly one [5, 4, 3] code.

Solution By the Plotkin bound, if d is odd and d > 12 (N − 1) then

d +1
M2∗ (N, d) ≤ 2 .
2d + 1 − N
In fact,
4
M2∗ (5, 3) ≤ 2  = 2 · 2 = 4.
6+1−5
All [5, 4, 3] codes are equivalent to 00000, 00111, 11001, 11110.

Problem 2.21 Let X be a binary [N, k, d] linear code with generating matrix
G. Verify that we may assume that the first row of G is 1 . . . 1 0 . . . 0 with d ones.
Write:
 
1...1 0...0
G= .
G1 G2

Show that if d2 is the distance of the code with generating matrix G2 then d2 ≥ d/2.

Solution Let X be [N, k, d]. We can always form a generating matrix G of X where
the first row is a codeword x with w(x) = d; by permuting columns of G we can
have the first row in the form 1 . . . 1d 0 . . . 0N−d . So, up to equivalence,
: ;< = : ;< =
 
1...1 0...0
G= .
G1 G2

Suppose d(G2 ) < d/2 then, without loss of generality, we may assume that there
exists a row of (G1 G2 ) where the number of ones among digits d + 1, . . . , N is
< d/2. Then the number of ones among digits 1, . . . d in this row is > d/2, as its
total weight is ≥ d. Then adding this row and 1 . . . 1 0 . . . 0 gives a codeword with
weight < d. So, d(G2 ) ≥ d/2.
2.6 Additional problems for Chapter 2 267

Problem 2.22 (Gilbert–Varshamov bound) Prove that there exists a p-ary linear
[N, k, d] code if pk < 2N /vN−1 (d −2). Thus, if pk is the largest power of p satisfying
this inequality, we have Mp∗ (N, d) ≥ pk .

Solution We construct a parity-check matrix by selecting N columns of length


N − k with the requirement that no d − 1 columns are linearly dependent. The first
column may be any non-zero string in ZN−k p . On the step i ≥ 2 we must choose
a column which is not a linear combination of any d − 2 (or fewer) of previously
selected columns. The number of such linear combinations (with non-zero coeffi-
cients) is
d−2  
i−1
Si = ∑ (p − 1) j .
j=1
j

So, the parity-check matrix may be constructed iff SN + 1 < pN−k . Finally, observe
that SN + 1 = vN−1 (d − 2). Say, there exists [5, 2k , 3] code if 2k < 32/5, so k = 2
and M2∗ (5, 3) ≥ 4, which is, in fact, sharp.
Problem 2.23 An element b ∈ F∗q is called primitive if its order (i.e. the minimal
k such that bk = 1 mod q) is q − 1. It is not difficult to find a primitive element of
the multiplicative group F∗q explicitly. Consider the prime factorisation
s
ν
q − 1 = ∏ pj j.
j=1
νj
(q−1)/p j (q−1)/p j
For any j = 1, . . . , s select a j ∈ Fq such that a j = e. Set b j = a j and
check that b = ∏sj=1 b j has the order q − 1.

ν
Solution Indeed, the order of b j is p j j . Next, if bn = 1 for some n then n = 0
νi νi
ν n∏
mod p j j because bn ∏i = j pi = 1 implies b j i = j i = 1, i.e. n ∏i = j pνi i = 0 mod pνj i =
p

ν
0. Because p j are distinct primes, it follows that n = 0 mod p j j for any j. Hence,
ν
n = ∏sj=1 p j j .
Problem 2.24 The minimal polynomial with a primitive root is called a primitive
polynomial. Check that among irreducible binary polynomials of degree 4 (see
(2.5.9)), 1 + X + X 4 and 1 + X 3 + X 4 are primitive and 1 + X + X 2 + X 3 + X 4 is
not. Check that all six irreducible binary polynomials of degree 5 (see (2.5.15))
are primitive; in practice, one prefers to work with 1 + X 2 + X 5 as the calculations
modulo this polynomial are slightly shorter. Check that among the nine irreducible
polynomials of degree 6 in (2.5.16), there are six primitive: they are listed in the
upper three lines. Prove that a primitive polynomial exists for every given degree.
268 Introduction to Coding Theory

Solution For the solution to the last part, see Section 3.1.
Problem 2.25 A cyclic code X of length N with the generator polynomial g(X)
of degree d = N − k can be described in terms of the roots of g(X), i.e. the elements
α1 , . . . αN−k such that g(α j ) = 0. These elements are called zeros of code X and
belong to a Galois field F2d . As g(X)|(1+X N ), they are also among roots of 1+X N .
That is, α Nj = 1, 1 ≤ j ≤ N − k, i.e. the α j are N th roots of unity. The remaining k
roots of unity α1 , . . . , αk are called non-zeros of X . A polynomial a(X) ∈ X iff,
in Galois field F2d , a(α j ) = 0, 1 ≤ j ≤ N − k.
(a) Show that if X ⊥ is the dual code then the zeros of X ⊥ are α1 −1 , . . . , αk −1 , i.e.
the inverses of the non-zeros of X .
(b) A cyclic code X with generator g(X) is called reversible if, for all x =
x0 . . . xN−1 ∈ X , the word xN−1 . . . x0 ∈ X . Show that X is reversible iff g(α ) = 0
implies that g(α −1 ) = 0.
(c) Prove that a q-ary cyclic code X of length N with (q, N) = 1 is invariant under
the permutation of digits such that πq (i) = qi mod N (i.e. x → xq ). If s = ordN (q)
then the two permutations i → i + 1 and πq (i) generate a subgroup of order Ns in
the group Aut(X ) of the code automorphisms.

Solution Indeed, since a(xq ) = a(x)q is proportional to the same generator polyno-
mial it belongs to the same cyclic code as a(x).
Problem 2.26 Prove that there are 129 non-equivalent cyclic binary codes of
length 128 (including the trivial codes, {0 . . . 0} and {0, 1}128 ). Find all cyclic bi-
nary codes of length 7.

Solution The equivalence classes of the cyclic codes of length 2k are in a one-to-
k
one correspondence with the divisors of 1+X 2 ; the number of those equals 2k +1.
Furthermore, there are eight codes listed by their generators which are divisors of
X 7 − 1 as
X 7 − 1 = (1 + X)(1 + X + X 3 )(1 + X 2 + X 3 ).
3
Further Topics from Coding Theory

3.1 A primer on finite fields


In this section we present a summary of the theory of finite fields, limiting our scope
by material needed in the subsequent sections and following standard texts (see
[92], [93], [131]). A finite field is a (finite) set F possessing two distinct elements,
0 (zero) and e (unity), and equipped with two commutative group operations of
addition and multiplication (where 0 · b = 0 for all b ∈ F) related by a standard
distributivity rule.
A vector space over a field F is a (finite) set V, equipped with a commutative
group operation of addition, and an operation of scalar multiplication by elements
of F, again obeying standard distributivity rules. The dimension dim V of V is the
minimal number d such that any collection of distinct elements v1 , . . . , vd+1 ∈ V
is linearly dependent, i.e. one can find elements k1 , . . . , kd+1 ∈ F, not all equal to
0, such that k1 v1 + · · · + kd+1 vd+1 = 0. Then there exists a collection of elements
b1 , . . . , bd ∈ V, called a basis, such that every v ∈ V can be written as a linear com-
bination a1 b1 +· · ·+ad bd where a1 , . . . , ad are elements of F (uniquely) determined
by v. Unless the opposite is stated, we consider fields up to an isomorphism.
An important parameter of a field is its characteristic, i.e. the minimal integer
number p ≥ 1 such that pe = e + · · · + e (p times) = 0. Such a number, denoted by
char(F), exists by a standard pigeon-hole principle. Furthermore, the characteristic
is a prime number: if p = q1 q2 then pe = (q1 q2 )e = (q1 e)(q2 e) = 0 which implies
that q1 e = 0 or q2 e = 0 leading to a contradiction.
Example 3.1.1 Let p be a prime number. An additive cyclic group Z p =
{0, 1, . . . , p − 1}, with a generator 1, becomes a field with the multiplication
(qe)(q e) = (qq )e. The characteristic of this field equals p.
Let K and F be fields. If F ⊆ K we say that K is an extension of F. Then K is
also a vector space over F whose dimension is denoted by [K : F].
Lemma 3.1.2 Let K be an extension of F, and d = [K : F]. Then  K = ( F)d .

269
270 Further Topics from Coding Theory

Proof Let b1 , . . . , bd be a basis for K over F, with a unique representation


k = ∑ a j b j for all k ∈ K. Then for all j, we have  F possibilities for a j . So,
1≤ j≤d
altogether there exists precisely ( F)d ways to write all combinations.
Lemma 3.1.3 If char(F) = p then  F = pd , for some integer d ≥ 1.
Proof Consider elements 0, e, 2e, . . . , (p − 1)e. They form Z p , i.e. Z p ⊆ F. Then
 F = pd by Lemma 3.1.2.
Corollary 3.1.4 The number of elements in a finite field F must be q = ps where
p = char(F) and s ≥ 1 is a natural number.
From now on, unless otherwise stated, p stands for a prime and q = ps for a
prime power.
Lemma 3.1.5 (A freshman’s dream) If char(F) = p then for all a, b ∈ F and
integers n ≥ 1,
n n n
(a ± b) p = a p + (±b) p . (3.1.1)
Proof Use induction in n: for n = 1,
 
p
(a ± b) p = ∑ k ak (±b) p−k .
0≤k≤p
 
p
For 1 ≤ k ≤ p − 1, the value is a multiple of p and the corresponding term
k
vanishes. Therefore, (a ± b) p = a p + (±b) p . The inductive step is completed by the
n−1 n−1
same argument, with a and ±b replaced by a p and (±b) p .
Lemma 3.1.6 The multiplicative group F∗ of non-zero elements of a field F of
size q is isomorphic to the cyclic group Zq−1 .
Proof Observe that for any divisor d|(q − 1), group F∗ contains exactly φ (d)
elements of multiplicative order d where φ is Euler’s totient phi-function. (Recall
that φ (d) = {k : k < d, gcd(k, d) = 1}.) We’ll see that all elements of order d have
q−1
the form a d r where a is a primitive element, r ≤ d and r, d are co-prime. In fact,
q − 1 = ∑ φ (d), and F∗ will have at least one element of order q − 1 which
d:d|(q−1)
implies that F∗ is cyclic, of order q − 1.
Let a ∈ F∗ be an element of order d where d|(q − 1). Take the cyclic subgroup
{e, a, . . . , ad−1 }. Every element of this subgroup has multiplicative order dividing
d, i.e. is a root of the polynomial X d − e (a dth root of unity). But X d − e has
≤ d distinct roots in F (because F is a field). So, {e, a, . . . , ad−1 } is the set of all
roots of X d − e in F. In particular, each element from F of order d belongs to
3.1 A primer on finite fields 271

{e, a, . . . , ad−1 }. Observe that the cyclic group Zd has exactly φ (d) elements of
order d. So, the whole F∗ has exactly φ (d) elements of order d; in other words, if
ψ (d) is the number of elements in F of order d then either ψ (d) = 0 or ψ (d) = φ (d)
and
q−1 = ∑ ψ (d) ≤ ∑ φ (d) = q − 1,
d:d(n) d:d|n

which implies that for all d|n,


ψ (d) = φ (d).
Definition 3.1.7 A (multiplicative) generator of F∗ (i.e. an element of multiplica-
tive order q − 1) is called a primitive element of field F. Although such an element
is non-unique, we will usually single out one such element and denote it by ω ;
of course a power ω r where r is coprime with (q − 1) will also give a primitive
element.
If a ∈ F∗ with  F∗ = q − 1 then aq−1 = e (the order of every element divides the
order of the group). Hence, aq = a, i.e. a is a root of the polynomial X q − X in F.
But X q − X can have only ≤ q roots (including zero 0), so F gives the set of all
roots of X q − X.
Definition 3.1.8 Given fields K and F, with F ⊆ K, field K is called the splitting
field for a polynomial g(X) with coefficients from F if (a) K contains all roots of
g(X), (b) there is no field K with F ⊂ K ⊂ K satisfying (a). We will write Spl(g(X))
for the splitting field for g(X).
Thus, if  F = q then F contains all roots of polynomial X q −X and is the splitting
field for this polynomial.
Lemma 3.1.9 Any two splitting fields K, K for the same polynomial g(X) with
coefficients from F coincide.
Proof In fact, take the intersection K ∩ K : it contains F and is a subfield of both
K and K . It must then coincide with each of K, K .
Corollary 3.1.10 For any prime p and natural s ≥ 1, there exists at most one
field with ps elements.
Proof Each such field is splitting for polynomial X q − X with coefficients from
Z p and q = ps . So any two such fields coincide.
On the other hand, we will prove later the following.
Theorem 3.1.11 For any non-constant polynomial with coefficients from F, there
exists a splitting field.
272 Further Topics from Coding Theory

Corollary 3.1.12 For any prime p and natural s ≥ 1, there exists precisely one
field with ps elements.
Proof of Corollary 3.1.12 Take again the polynomial X q − X with coefficients from
Z p and q = ps . By Theorem 3.1.11, there exists the splitting field Spl(X q − X)
where X q − X = X(X q−1 − e) is factorised into linear polynomials. So, Spl(X q − X)
contains the roots of X q − X and has characteristic p (as it contains Z p ).
However, the roots of (X q − X) form a subfield: if aq = a and bq = b then (a ±
b) = aq + (±bq ) (Lemma 3.1.5) which coincides with a ± b. Also, (ab−1 )q =
q

aq (bq )−1 = ab−1 . This field cannot be strictly contained in Spl(X q − X) thus it
coincides with Spl(X q − X).
It remains to check that all roots of (X q − X) are distinct: then the cardinality
 Spl(X q − X) will be equal to q. In fact, if X q − X had a multiple root then it would
have had a common factor with its ‘derivative’ ∂X (X q − X) = qX q−1 − e. However,
qX q−1 = 0 in Spl(X q − X) and thus cannot have such factors.
Summarising, we have the two characterisation theorems for finite fields.
Theorem 3.1.13 All finite fields have size ps where p is prime and s ≥ 1 integer.
For all such p, s, there exists a unique field of this size.

The field of size q = ps will be denoted by Fq (a popular alternative notation is


GF(q) (a Galois field)). In the case of the simplest fields F p = {0, 1, . . . , p − 1} (for
p is prime) we use symbol 1 instead of e for the unit.
Theorem 3.1.14 All finite fields can be arranged into sequences (‘towers’). For
a prime p and positive integers s1 , s2 , . . .,
...
...
...

F ps1 s2 ...si

...

F ps1 s2

F ps1

Fp  Zp
Here each arrow is a uniquely defined injective homomorphism.
3.1 A primer on finite fields 273

Example 3.1.15 In Section 2.5 we worked with polynomial fields F2 [X]/"q(X)#


where q(X) is an irreducible binary polynomial; see Theorems 2.5.32 and 2.5.33.
Continuing Example 2.5.35(c), consider the field F16 realised as F2 [X]/"1 + X 3 +
X 4 #. The field structure is as follows:

power
vector
of X polynomial
(string)
mod 1 + X 3 + X 4
−− 0 0000
X0 1 1000
X X 0100
X2 X2 0010
X3 X3 0001
X4 1 + X3 1001
X5 1 + X + X3 1101 (3.1.2)
X6 1 + X + X2 + X3 1111
X7 1 + X + X2 1110
X8 X + X2 + X3 0111
X9 1 + X2 1010
X 10 X + X3 0101
X 11 1 + X2 + X3 1011
X 12 1+X 1100
X 13 X + X2 0110
X 14 X2 + X3 0011

Finally, if we choose to specify the table for F2 [X]/"1 + X + X 2 + X 3 + X 4 #,


the calculations will be considerably longer (and organised differently). The point
is that in this case monomial X will not be a primitive element, since X 5 = 1
mod (1 + X + X 2 + X 3 + X 4 ). Instead, a generator of the multiplicative group will
be a sum of monomials, viz. 1 + X.

Worked Example 3.1.16 (a) How many elements are in the smallest extension
of F5 which contains all roots of polynomials X 2 + X + 1 and X 3 + X + 1?
(b) Determine the number of subfields of F1024 , F729 . Find all primitive elements
of F7 , F9 , F16 . Compute (ω 10 + ω 5 )(ω 4 + ω 2 ) where ω is a primitive element of
F16 .

Solution (a) Clearly, 56 .


(b) F1024 = F210 has 4 subfields: F2 , F4 , F32 and F1024 . F729 = F36 has 4 subfields:
F3 , F9 , F27 and F729 . F7 has 2 primitive elements: ω , ω 5 (with (ω 5 )5 = ω ). F9 has
274 Further Topics from Coding Theory

4 primitive elements, of the form ω , ω 3 , ω 5 , ω 7 . F16 has 8 primitive elements: ω ,


ω 2 , ω 4 , ω 7 , ω 8 , ω 11 , ω 13 , ω 14 .
By using the table for F2 [X]/"1 + X + X 4 # (see Example 2.5.35(c)), we can find
that

(ω 10 + ω 5 )(ω 4 + ω 2 ) = ω 14 + ω 9 + ω 12 + ω 7
= 1001 + 0101 + 1111 + 1101 = 1110
= ω 10 .

However, by taking ω = ω 7 , the RHS becomes

ω 8 + ω 3 + ω 9 + ω 4 = 1010 + 0001 + 0101 + 1100 = 0010 = ω 2 = (ω )11 .

From now on we will focus on polynomial representations of finite fields. Gen-


eralising concepts introduced in Section 2.5, consider

Definition 3.1.17 The set of all polynomials with coefficients from Fq is a com-
mutative ring denoted by Fq [X]. A quotient ring Fq [X]/"g(X)# is where the opera-
tion is modulo a fixed polynomial g(X) ∈ Fq [X].

Definition 3.1.18 A polynomial g(X) ∈ Fq [X] is called irreducible (over Fq ) if it


admits no representation
g(X) = g1 (X)g2 (X)

with g1 (X), g2 (X) ∈ Fq [X].

A generalisation of Theorem 2.5.32 is presented in Theorem 3.1.19 below.

Theorem 3.1.19 Let g(X) ∈ Fq [X] have degree deg g(X) = d . Then
Fq [X]/"g(X)# is a field Fqd iff g(X) is irreducible.

Proof Let g(X) be an irreducible polynomial over Fq . To show that Fq [X]/"g(X)#


is a field we should check that each non-zero element f (X) ∈ Fq [X]/"g(X)# has an
inverse. Consider the set F( f ) of polynomials of the form f (X)h(X) mod g(X)
where h(X) ∈ Fq [X]/"g(X)# (the principal ideal generated by f (X)). If F( f ) con-
tains the unity e ∈ Fq (the constant polynomial equal to e) then the corresponding
h(X) = f (X)−1 . If not, the map h(X) → f (X)h(X) mod g(X), from Fq [X]/"g(X)#
to itself, is not a surjection. That is, f (X)h1 (X) = f (X)h2 (X) mod g(X) for some
distinct h1 (X), h2 (X), i.e.
 
f (X) h1 (X) − h2 (X) = r(X)g(X).
3.1 A primer on finite fields 275
 
Then either g(X)| f (X) or g(X)| h1 (X) − h2 (X) as g(X) is irreducible. So, ei-
ther f (X) = 0 mod g(X) (a contradiction) or h1 (X) = h2 (X) mod g(X). Hence,
Fq [X]/"g(X)# is a field.
The inverse assertion is proved similarly: if g(X) is reducible then Fq [X]/"g(X)#
contains non-zero g1 (X), g2 (X) with g1 (X)g2 (X) = 0. Then Fq [X]/"g(X)# cannot
be a field.

The dimension Fq [X]/"g(X)# : Fq is equal to d, the degree of g(X), so


Fq [X]/"g(X)# = Fqd .
Worked Example 3.1.20  Prove that
 g(X) has an inverse in the polynomial ring
Fq [X]/"X N − e# iff gcd g(X), X N − e = e.

Solution Consider the map Fq [X]/"X N − e# → Fq [X]/"X N − e# given by h(X) →


h(X)g(X) mod (X N − e). If it is a surjection then there exists h(X) with
h(X)g(X) = e and h(X) = g(X)−1 . Suppose it is not. Then there exist h(1) (X) =
h(2) (X) mod (X N − e) such that h(1) (X)g(X) = h(2) (X)g(X) mod (X N − e), i.e.
(h(1) (X) − h(2) (X))g(X) = s(X)(X N − e).
As (X N − e) | (h(1) (X) − h(2) (X)), this means that gcd(g(X), X N − e) = e.
Conversely, if gcd(g(X), X N − e) = d(X) = e then the equation h(X)g(X) = e
mod (X N − e) gives
h(X)g(X) = e + q(X)(X N − e)
where d(X)|LHS and d(X)|q(X)(X N − e)). Therefore, d(X)|e: a contradiction.
Hence, g(X)−1 does not exist.
Example 3.1.21 (Continuing Example 2.5.19) There are six irreducible binary
polynomials of degree 5:
1 + X 2 + X 5, 1 + X 3 + X 5, 1 + X + X 2 + X 3 + X 5,
1 + X + X 2 + X 4 + X 5, 1 + X + X 3 + X 4 + X 5, (3.1.3)
1 + X 2 + X 3 + X 4 + X 5.
Then there are nine irreducible polynomials of degree 6, and so on. Calculating
irreducible polynomials of a large degree is a demanding task, although extensive
tables of such polynomials are now available on the web.
We are now going to prove Theorem 3.1.11.
Proofof Theorem 3.1.11 The key fact is that any non-constant polynomial g(X) ∈
Fq [X] has a root in some extension of Fq . Without loss of generality, assume that
g(X) is irreducible, with deg g(X) = d. Take Fq [X]/"g(X)# = Fqd as an extension
field. In this field, g(α ) = 0 where α is polynomial X ∈ Fq [X]/"g(X)#, so g(X)
276 Further Topics from Coding Theory

has a root. We can divide g(X) by X − α in Fqd and use the same construction
to prove that g1 (X) = g(X)/(X − α ) has a root in some extension of Fqt ,t < d.
Finally, we obtain a field containing all d roots of g(X), i.e. construct the splitting
field Spl(g(X)).

Definition 3.1.22 Given a field F ⊂ K and an element γ ∈ K, we denote by


F(γ ) the smallest field containing F and γ (obviously, F ⊂ F(γ ) ⊂ K). Similarly,
F(γ1 , . . . , γr ) is the smallest field containing F and elements γ1 , . . . , γr ∈ K. For
F = Fq and α ∈ K, set
d−1

Mα ,F (X) = (X − α )(X − α q ) . . . X − α q , (3.1.4)

d
where d is the smallest positive integer such that α q = α (such a d exists as will
be proved in Lemma 3.1.24).
A monic polynomial is the one with the highest coefficient. The minimal poly-
nomial for α ∈ K over F is a unique monic polynomial Mα (X) (= Mα ,F (X)) ∈
F[X] such that Mα (α ) = 0 and Mα (X)|g(X) for each g(X) ∈ F[X] with g(α ) = 0.
When ω is a primitive element of K (generating K∗ ), Mω (X) is called a primitive
polynomial (over F). The order of a polynomial p(X) ∈ F[X] is the smallest n such
that p(X)|(X n − e).

Example 3.1.23 (Continuing Example 3.1.21.) In this example we deal with


polynomials over F2 . The irreducible polynomial X 2 + X + 1 is primitive and has
order 3. The irreducible polynomials X 3 + X + 1 and X 3 + X 2 + 1 are primitive and
of order 7. The polynomials X 4 + X 3 + 1 and X 4 + X + 1 are primitive and have
order 15 whereas X 4 + X 3 + X 2 + X + 1 is not primitive and of order 5. (It is helpful
to note that with d = 4, the order of X 4 +X 3 +1 and X 4 +X +1 equals 2d −1; on the
other hand, the order of element X in the field F2 [X]/"1+X +X 2 +X 3 +X 4 # equals
5, but its order, say, in the field F2 [X]/"1 + X + X 4 # equals 15.) All six polynomials
listed in (3.1.3) are primitive and have order 31 (i.e. appear in the decomposition
of X 31 + 1).

Lemma 3.1.24 Let Fq ⊂ Fqd and α ∈ Fqd . Let Mα (X) ∈ F[X] be the minimal
polynomial for α, of degree deg Mα (X) = d . Then:

(a) Mα (X) is the only irreducible polynomial in Fq [X] with a root at α.


(b) Mα (X) is the only monic polynomial in Fq [X] of degree d with a root at α.
(c) Mα (X) has the form (3.1.4).
3.1 A primer on finite fields 277

Proof Assertions (a), (b) follow from the definition. To prove (c), assume γ ∈ K is
a root of a polynomial f (X) = a0 + a1 X + · · · + ad X d from F[X], i.e. ∑ ai γ i = 0.
0≤i≤d
As aqi = ai (which is true for all a ∈ F) and by virtue of Lemma 3.1.5,
 q
 i q
f (γ ) = ∑ ai γ = ∑ ai γ =
q qi
∑ ai γ = 0,
i
0≤i≤d 0≤i≤d 0≤i≤d
 q 2
so γ q is a root. Similarly, γ q = γ q is a root, and so on.
2 s
For Mα (X) it yields that α , α q , α q , . . . are roots. This will end when α q = α
for the first time (which proves the existence of such an s). Finally, s = d as all
d−1 i j
α , α q , . . . , α q are distinct: if not then α q = α q where, say, i < j. Taking qd− j
d+i− j d
power of both sides, we get α q = α q = α . So, α is a root of polynomial
d+i− j
P(X) = X q − X, and Spl(P(X)) = Fqd+i− j . On the other hand, α is a root of an
irreducible polynomial of degree d, and Spl(Mα (X)) = Fqd . Hence, d|(d + i − j)
i
or d|(i − j), which is impossible. This means that all the roots α q , i < d, are
distinct.
Theorem 3.1.25 For any field Fq and integer d ≥ 1, there exists an irreducible
polynomial f (X) ∈ Fq [X] of degree d .

Proof Take a primitive element ω ∈ Fqd . Then Fq (ω ), the minimal extension of


Fq containing ω , coincides with Fqd . The dimension [Fq (ω ) : Fq ] of vector space
Fq (ω ) over Fq equals [Fqd : Fq ] = d. The minimal polynomial Mω (X) for ω over
d−1
Fq has distinct roots ω , ω q , . . . , ω q and therefore is of degree d.
Although proving irreducibility of a given polynomial is a problem with no gen-
eral solution, the number of irreducible polynomials of a given degree can be evalu-
ated by using an elegant (and not very complicated) method invoking the so-called
Möbius function.
Definition 3.1.26 The Möbius function μ on the set Z+ is given by
μ (1) = 1, μ (n) = 0 if n is divisible by a square of a prime number,
and
μ (n) = (−1)k if n is a product of k distinct prime numbers.
Theorem 3.1.27 The number Nq (n) of irreducible polynomials of degree n in
the polynomial ring Fq [X] is given by
1
n d:∑
Nq (n) = μ (d)qn/d . (3.1.5)
d|n
278 Further Topics from Coding Theory

For example, Nq (20) equals


1
μ (1)q20 + μ (2)q10 + μ (4)q5 + μ (5)q4 + μ (10)q2 + μ (20)q
20
1 20
= q − q10 − q4 + q2 .
20

Proof First, we establish the additive Möbius inversion formula. Let ψ and Ψ be
two functions from Z+ to an Abelian group G with an additive group operation.
Then the following equations are equivalent:

Ψ(n) = ∑ ψ (d) (3.1.6)


d|n

and
n
ψ (n) = ∑ μ (d)Ψ . (3.1.7)
d|n
d

This equivalence follows when we observe that (a) the sum ∑ μ (d) is equal to 0 if
d|n
n > 1 and to 1 if n = 1, and (b) for all n,
 
∑ μ (d)Ψ n/d = ∑ μ (d) ∑ ψ (c)
d: d|n d: d|n c: c|n/d
= ∑ ψ (c) ∑ μ (d) = ψ (n).
c: c|n d: d|n/c

To check (a), let p1 , . . . , pk be different prime factors in decomposition of n then


k
∑ μ (d) = μ (1) + ∑ μ (pi ) + · · · + μ (p1 . . . pk )
d|n
  i=1    
k k k
= 1+ (−1) + (−1)2 + · · · + (−1)k = 0.
1 2 k

Applying (3.1.7) to G = Z, the additive group of integer numbers, with ψ (n) =


nNq (n) and Ψ(n) = qn , gives (3.1.5).
n
Now, decompose the polynomial X q − X into the product of irreducible poly-
n
nomials. Then (3.1.6) holds true as the degree qn of X q − X coincides with the
sum of degrees of all irreducible polynomials whose degrees divide n. Indeed, we
n
simply write X q − X as the product of all irreducible polynomials and observe
that an irreducible polynomial enters the decomposition iff its degree divides n (cf.
Corollary 3.1.30).

Worked Example 3.1.28 Find all irreducible polynomials of degree 2 and 3 over
F3 and determine their orders.
3.1 A primer on finite fields 279

Solution Over F3 = {0, 1, 2} there are three irreducible polynomials of degree 2:


X 2 + 1, of order 4, with
(X 4 − 1)/(X 2 + 1) = X 2 − 1,
and X 2 + X + 2 and X 2 + 2X + 2, of order 8, with
(X 8 − 1)/(X 2 + X + 2)(X 2 + 2X + 2) = X 4 − 1.
Next, there exist (33 − 3)/3 = 8 irreducible polynomials over F3 of degree 3.
Four of them have order 13 (hence, are not primitive):
X 3 + 2X + 2, X 3 + X 2 + 2, X 3 + X 2 + X + 2, X 3 + 2X 2 + 2X + 2.
The remaining four have order 26 (hence, are primitive):
X 3 + 2X + 1, X 3 + X 2 + 2X + 1, X 3 + 2X 2 + 1, X 3 + 2X 2 + X + 1.
Indeed, if p(X) denotes the product of the first four polynomials then (X 13 −
1)/p(X) = X −1. On the other hand, if r(X) stands for the product of the remaining
four then (X 26 − 1)/r(X) equals
(X − 1)(X + 1)(X 3 + 2X + 2)(X 3 + X 2 + 2)
× (X 3 + X 2 + X + 2)(X 3 + 2X 2 + 2X + 2).

Theorem 3.1.29 If g(X) ∈ Fq [X] is irreducible and of degree d and α is a root


of g(X) then the splitting field Spl(g(X)) and the minimal extension Fq (α ) both
coincide with Fqd .
Proof We know that g(X) = Mα ,Fq (X) = irrα ,Fq (X) (by Lemma 3.1.24, as g(X)
is irreducible). We then have that Fq ⊂ Fq (α ) = Fqd ⊆ Spl(g(X)). It is left to check
that any root γ of g(X) lies in Fq (α ): this will imply that Spl(g(X)) ⊆ Fq (α ).
By Theorem 3.1.13, the unique Galois field with qd elements Fq (α ) = Fqd is the
splitting field qd − X), i.e. contains all roots of X qd − X (one of which is α ).
 Spl(X
d 
Then g(X)| X q − X as g(X) = Mα ,Fq (X). Therefore, all roots of g(X) are roots
d
of X q − X and hence lie in Fq (α ).
Corollary 3.1.30 Suppose
 n that g(X) ∈ Fq [X] is an irreducible polynomial of
degree d . Then g(X)| X q − X iff d|n.
 qn 
Proof We have the splitting fields Spl(g(X))
 = F q d and Spl X − X = Fqn . By
n
Theorem 3.1.29, Spl(g(X)) ⊆ Spl X − X iff d|n.
q
 n 
n
Now if g(X)| X q − X , each root of g(X) is a root of X q − X . Then
n
Spl(g(X)) ⊆ Spl X q − X and hence d|n.
280 Further Topics from Coding Theory
 n 
Conversely,
 qn  if d|n, i.e.
 Spl(g(X))
 ⊆ Spl X q − X , then each root of g(X) lies
 in
n qn − X , so
Spl X − X . But Spl X q − X is precisely
 the set
 nof the roots
 of X
n
each root of g(X) is that of X q − X . Then g(X)| X q − X .
Theorem 3.1.31 If g(X) ∈ Fq [X] is an irreducible polynomial of degree d and
d−1
α ∈ Spl(g(X)) = Fqd [X] is its root then all the roots of g(X) are α , α q , . . . , α q .
d
Furthermore, d is the smallest positive integer such that α q = α.
d−1
Proof As in the proof of Lemma 3.1.24, α , α q , . . . , α q are distinct roots. Thus
all the roots are listed and d is the smallest positive integer with the above property.

Corollary 3.1.32 All roots of an irreducible polynomial g(X) ∈ Fq [X] with


deg g(X) = d have in Spl(g(X)) the same multiplicative order dividing qd − 1, and
it gives the order of polynomial g(X) (see Definition 3.1.22).
The order of irreducible polynomial g(X) will be denoted by ord(g(X)).
Worked Example 3.1.33 (a) Prove that for natural n, q such that lcm(n, q) = 1
there exists a natural s such that n|(qs − 1).
(b) Prove that if a polynomial g(X) ∈ F2 [X] is irreducible then g(X)|(X n − 1) iff
ord(g(X))|n.

Solution (a) Set ql − 1 = nal + bl where bl ≤ n and l = 1, 2, . . .. By the pigeon-hole


principle, bl1 = bl2 for some l1 < l2 . Then n|ql1 (ql2 −l1 − 1). Owing to the condition
lcm(n, q) = 1, n|(qs − 1) with s = l2 − l1 .
(b) For an irreducible g(X), the order ord g(X) was introduced in Definition 3.1.22:
ord(g(X)) = min[n : g(X)|(X n − 1)].
First, our goal is to check that if m = ord(g(X)) then m|n iff g(X)|(X n − 1). In-
deed, suppose m|n: n = mr. Then X n − 1 = (X m − 1)(1 + X m + · · · + X m(r−1) ). As
g(X)|(X m − 1), this implies g(X)|(X n − 1).
Conversely, if g(X)|(X n − 1) then the roots of α1 , . . . , αd of g(X) are among
those of X n − 1 in Spl(X n − 1). So, α m
j = α j = 1 in Spl(X − 1), 1 ≤ j ≤ d. Write
n n

n = mb + a where 0 ≤ a < m. Then α nj = α bm j α j = α j = 1, i.e. each α j is a root of


a a

X a − 1. Hence, if a > 0 then g(X)|(X a − 1): a contradiction. So, a = 0 and m|n.


Calculating an irreducible polynomial g(X) ∈ Fq [X] with a given root α ∈ Fqn ,
in particular, the minimal polynomial Mα ,Fq (X), is not easy. This is because the
relation between q, n, α and d = deg Mα (X) is complicated. However, if α = ω
is a primitive element of Fqn then d = n as ω q −1 = e, ω q = ω and n is the least
n n

positive integer with this property. In this case Mω (X) = ∏b∈Fqn (X − b).
3.1 A primer on finite fields 281

For a general irreducible polynomial, the notion of conjugacy is helpful: see Def-
inition 3.1.34 below. This concept was introduced (and used) informally in Section
2.5 for fields F2s .

Definition 3.1.34 Elements α , α ∈ Fqn are called conjugate over Fq if


Mα ,Fq (X) = Mα ,Fq (X).

Summarising what was said above, we deduce the following assertion.


d−1
Theorem 3.1.35 The conjugates of α ∈ Fqn over Fq are α, α q , . . . , α q ∈ Fqn
 j
where d is as before. In particular, ∏ X − α q has all its coefficients in Fq
0≤ j≤d−1
and is a unique irreducible polynomial from Fq [X] with a root at α. It is also a
unique monic polynomial of minimum degree in Fq [X] with a root at α.

Worked Example 3.1.36 Continuing Worked Example 3.1.28, we identify F16


with F2 (ω ), the smallest field containing a root ω of a primitive polynomial of
order 4. So, if we choose 1 + X + X 4 , ω will satisfy ω 4 = 1 + ω, and if we choose
1 + X 3 + X 4 , ω will satisfy ω 4 = 1 + ω 3 . In both cases, the conjugates are ω, ω 2 ,
ω 4 and ω 8 .
Correspondingly, the table in (3.1.2) will take the form

1 + X + X4 1 + X3 + X4
power vector vector
of ω (word) (word)
−− 0000 0000
0 1000 1000
1 0100 0100
2 0010 0010
3 0001 0001
4 1100 1001
5 0110 1101 (3.1.8)
6 0011 1111
7 1101 1110
8 1010 0111
9 0101 1010
10 1110 0101
11 0111 1011
12 1111 1100
13 1011 0110
14 1001 0011
282 Further Topics from Coding Theory

Under the left table addition rule, the minimal polynomial Mω i (X) for the power
ω i is 1 + X + X 4 for i = 1, 2, 4, 8 and 1 + X 3 + X 4 for i = 7, 14, 13, 11, while for
i = 3, 6, 12, 9 it is 1 + X + X 2 + X 3 + X 4 and for i = 5, 10 it is 1 + X + X 2 . Under the
right table addition rule, we have to swap polynomials 1 + X + X 4 and 1 + X 3 + X 4 .
Polynomials 1 + X + X 4 and 1 + X 3 + X 4 are of order 15, polynomial 1 + X + X 2 +
X 3 + X 4 is of order 5 and 1 + X + X 2 of order 3.
 4
A short way to produce these answers is to find the expression for ω i as a
 2  3
linear combination of 1, ω i , ω i and ω i . For example, from the left table we
have for ω 7 :
 7 4
ω = ω 28 = ω 3 + ω 2 + 1,
 7 3
ω = ω 21 = ω 3 + ω 2 ,
 4  3
and readily see that ω 7 = 1 + ω 7 , which yields 1 + X 3 + X 4 . For complete-
 2
ness, write down the unused expression for ω 7 :
 7 2
ω = ω 14 = ω 12 ω 2 = (1 + ω )3 ω 2 = (1 + ω + ω 2 + ω 3 )ω 2
= ω 2 + ω 3 + ω 4 + ω 5 = ω 2 + ω 3 + 1 + ω + (1 + ω )ω = 1 + ω 3 .

For Mω 5 (X) the ‘standard’ approach gives a shortcut:

Mω 5 (X) = (X − ω 5 )(X − ω 10 ) = X 2 + (ω 5 + ω 10 )X + ω 15 = X 2 + X + 1.

So, the full list of minimal polynomials for F16 is

Mω 0 (X) = 1 + X, Mω (X) = 1 + X + X 4 ,
Mω 3 (X) = 1 + X + X 2 + X 3 + X 4 ,
Mω 5 (X) = 1 + X + X 2 , Mω 7 (X) = 1 + X 3 + X 4 .

Example 3.1.37 For the field F32  F2 [X]/"1 + X 2 + X 5 #, the addition table is
calculated below. The minimal polynomials are

(i) 1 + X 2 + X 5 for conjugates {ω , ω 2 , ω 4 , ω 8 , ω 16 },


(ii) 1 + X 2 + X 3 + X 4 + X 5 for {ω 3 , ω 6 , ω 12 , ω 24 , ω 17 },
(iii) 1 + X + X 2 + X 4 + X 5 for {ω 5 , ω 10 , ω 20 , ω 9 , ω 18 },
(iv) 1 + X + X 2 + X 3 + X 5 for {ω 7 , ω 14 , ω 28 , ω 25 , ω 19 },
(v) 1 + X + X 3 + X 4 + X 5 for {ω 11 , ω 22 , ω 13 , ω 26 , ω 21 },
(vi) 1 + X 3 + X 5 for {ω 15 , ω 30 , ω 29 , ω 27 , ω 23 }.
3.1 A primer on finite fields 283

All minimal polynomials have order 31.


power vector power vector
of ω (word) of ω (word)
−− 00000 15 11111
0 10000 16 11011
1 01000 17 11001
2 00100 18 11000
3 00010 19 01100
4 00001 20 00110
5 10100 21 00011
(3.1.9)
6 01010 22 10101
7 00101 23 11110
8 10110 24 01111
9 01011 25 10011
10 10001 26 11101
11 11100 27 11010
12 01110 28 01101
13 00111 29 10010
14 10111 30 01001
Definition 3.1.38 An automorphism of Fqn over Fq (in short, an (Fqn , Fq )-
automorphism) is a bijection σ : Fqn → Fqn with: (a) σ (a + b) = σ (a) + σ (b);
(b) σ (ab) = σ (a)σ (b); (c) σ (c) = c, for all a, b ∈ Fqn , c ∈ Fq .
Theorem 3.1.39 The set of (Fqn , Fq )-automorphisms is isomorphic to the cyclic
group Zn and generated by the Frobenius map σq (a) = aq , a ∈ Fqn .
Proof Let ω ∈ Fqn be a primitive element. Then ω q −1 = e and Mω (X) ∈ Fq [X]
n

2 n−1
has roots ω , ω q , ω q , . . . , ω q . An (Fqn ; Fq )-automorphism τ fixes the coefficients
j
of Mω (X), thus it permutes the roots, and τ (ω ) = ω q for some j, 0 ≤ j ≤ n−1. But
j
as ω is primitive, τ is completely determined by τ (ω ). Then as σq j (ω ) = ω q =
τ (ω ), we have that τ = σq j .
The rest of this section is devoted to a study of roots of unity, i.e. the roots of the
polynomial X n − e over field Fq where q = ps and p =char (Fq ). Without loss of
generality, we suppose from now on that
gcd(n, q) = 1, i.e. n and q are co-prime. (3.1.10)
Indeed, if n and q are not co-prime, we can write n = mpk . Then, by Lemma 3.1.5
k k
X n − e = X mp − e = (X m − e) p ,
and our analysis is reduced to the polynomial X m − e.
284 Further Topics from Coding Theory

Definition 3.1.40 The roots of polynomial (X n − e) ∈ Fq [X] in the splitting field


Spl (X n − e) = Fqs are called the nth roots of unity over Fq (or the (n, Fq )-roots of
unity). The set of all (n, Fq )-roots of unity is denoted by E(n) . It turns out that the
value s is the least integer s ≥ 1 such that qs ≡ 1 mod n (cf. Theorem 3.1.44 below).
This fact is reflected in denoting the value s by ordn (q) and calling it the order of
q mod n.
Under assumption (3.1.10), there is no multiple root (as the derivative ∂X
(X n − e) = nX n−1 does not have roots in Spl(X n − e) = Fqs ). Thus,  E(n) = n.
Theorem 3.1.41 E(n) is a cyclic subgroup of F∗qs .
Proof Suppose α , β ∈ E(n) . Then (αβ −1 )n = α n (β n )−1 = e, i.e. αβ −1 ∈ E(n) .
So, E(n) is a subgroup of the cyclic group F∗qs and so is cyclic.
Definition 3.1.42 A generator of group E(n) (i.e. an nth root of unity whose
multiplicative order equals n) is called a primitive (n, Fq )-root of unity; it will be
denoted by β .
Corollary 3.1.43 There are precisely φ (n) primitive (n, Fq )-roots of unity. In
particular, primitive (n, Fq )-roots of unity exist for any n co-prime to q.
This allows us to calculate s in the splitting field Fqs = Spl(X n − e). If β is a
primitive (n, Fq )-root of unity then its multiplicative order equals n. As ω = 0, we
have that ω ∈ Fqr if β q = β , i.e. β q −1 = e. This happens iff n|(qr − 1). But s is
r r

the least r with Fqr ! ω .


Theorem 3.1.44 Spl(X n − e) = Fqs , where s = ordn (q) is the least integer ≥ 1
for which n|(qs − 1), i.e. the least integer s ≥ 1 with qs ≡ 1 mod n.
It is instructive to stress similarities and differences between primitive elements
and primitive (n, Fq )-roots of unity in field Fqs with s = ordn (q). A primitive field
element, ω , generates the multiplicative cyclic group F∗qs : F∗qs = {e, ω , . . . , ω q −2 };
s

its multiplicative order equals qs − 1. A primitive root of unity, β , generates the


multiplicative cyclic group E(n) : E(n) = {e, β , . . . , β n−1 }; its multiplicative order
equals n. [On the other hand, β generates Fqs as a field element: Fqs = Fq (β ) =
Fq (E(n) ).] This suggests that β = ω happens iff n = qs − 1. In fact, let us ask under
what condition a power ω k is a primitive nth root of unity. As was established in
Worked Example 3.1.33 this happens when n|(qs − 1), i.e. qs − 1 = nr. In fact, if
k ≥ 1 is such that
gcd(k, nr) = gcd(k, qs − 1) = r
then element ω k is a primitive nth root of unity as its multiplicative order equals
qs − 1 nr
= = n.
gcd(k, qs − 1) r
3.1 A primer on finite fields 285

This holds when k = ru and u is co-prime with n. Conversely, if ω k is a primitive


root of unity then gcd(k, qs − 1) = (qs − 1)/n. Hence we obtain the following.

Theorem 3.1.45 Let P(n) be the set of the primitive (n, Fq )-roots of unity and T(n)
the set of primitive elements in Fqs = Spl(X n − e). Then either (i) P(n) ∩ T(n) = 0/
or (ii) P(n) = T(n) ; case (ii) occurs iff n = qs − 1.

Now we can factorise polynomial (X n − e) over Fq by taking the product of the


distinct minimal polynomials for the (n, Fq )-roots of unity:
 
X n − e = lcm Mβ (X) : β ∈ E(n) . (3.1.11)

If we begin with a primitive element ω ∈ Fqs where s = ordn (q) then β = ω (q −1)/n
s

is a primitive root of unity and E(n) = {e, β , . . . , β n−1 }.


This enables us to calculate the minimal polynomial Mβ i (X). For all i =
d−1
0, . . . , n − 1, the conjugates of β i are β i , β iq , . . . , β iq where d(= d(i)) is the least
positive integer for which β iq = β i , i.e. β iq −i = e. This is equivalent to n|(iqd − i),
d d

i.e. iqd = i mod n. Therefore,


     d−1

Mi (X) = Mβ i (X) = X − β i X − β iq · · · X − β iq . (3.1.12)

Definition 3.1.46 The set of exponents i, iq, . . . , iqd−1 where d(= d(i)) is the
minimal positive integer such that iqd = i mod n is called a cyclotomic coset (for i)
and denoted by Ci (= Ci (n, q)) (alternatively, Cω i is defined as the set of non-zero
d−1
field elements ω i , ω iq , . . . , ω iq ).

Worked Example 3.1.47 Check that polynomials X 2 + X + 2 and X 3 + 2X 2 + 1


are primitive over F3 and compute the field tables for F9 and F27 generated by these
polynomials.

Solution The field F9 is isomorphic to F3 [X]/"X 2 + X + 2#. The multiplicative


powers of ω ∼ X are

ω 2 ∼ 2X + 1, ω 3 ∼ 2X + 2, ω 4 ∼ 2,
ω 5 ∼ 2X, ω 6 ∼ X + 2, ω 7 ∼ X + 1, ω 8 ∼ 1.

The cyclotomic coset of ω is {ω , ω 3 } (as ω 9 = ω ). Then the minimal polynomial

Mω (X) = (X − ω )(X − ω 3 ) = X 2 − (ω + ω 3 )X + ω 4
= X 2 − 2X + 2 = X 2 + X + 2.

Hence, X 2 + X + 2 is primitive.
286 Further Topics from Coding Theory

Next, F27  F3 [X]/"X 3 + 2X 2 + 1#, and with ω ∼ X, we have

ω 2 ∼ X 2 , ω 3 ∼ X 2 + 2, ω 4 ∼ X 2 + 2X + 2, ω 5 ∼ 2X + 2,
ω 6 ∼ 2X 2 + 2X, ω 7 ∼ X 2 + 1, ω 8 ∼ X 2 + X + 2,
ω 9 ∼ 2X 2 + 2X + 2, ω 10 ∼ X 2 + 2X + 1, ω 11 ∼ X + 2,
ω ∼ X 2 + 2X, ω 13 ∼ 2, ω 14 ∼ 2X, ω 15 ∼ 2X 2 , ω 16 ∼ 2X 2 + 1,
12

ω 17 ∼ 2X 2 + X + 1, ω 18 ∼ X + 1, ω 19 ∼ X 2 + X,
ω ∼ 2X 2 + 2, ω 21 ∼ 2X 2 + 2X + 1, ω 22 ∼ X 2 + X + 1,
20

ω 23 ∼ 2X 2 + X + 2, ω 24 ∼ 2X + 1, ω 25 ∼ 2X 2 + X, ω 26 ∼ 1.

The cyclotomic coset of ω in F27 is {ω , ω 3 , ω 9 }. Consequently, the primitive poly-


nomial
Mω (X) = (X − ω )(X − ω 3 )(X − ω 9 )
= X 3 − (ω + ω 3 + ω 9 )X 2 + (ω 4 + ω 10 + ω 12 )X − ω 13
= X 3 + 2X 2 + 1

as required.

Worked Example 3.1.48 (a) Consider the polynomial X 15 − 1 over F2 (with


n = 15, q = 2). Then ω = 2, s = ord15 (2) = 4 and Spl(X 15 − 1) = F24 = F16 .
The polynomial g(X) = 1 + X + X 4 is primitive: any of its roots β are primitive
in F16 . So, the primitive (15, F2 )-root of unity is
4 −1)/15
β = ω (2 = ω.

Hence, the roots of X 15 − 1 are 1, β , . . . , β 14 . The minimal polynomials for them


have been calculated in Worked Example 3.1.36. So, we have the factorisation

X 15 − 1 = (1 + X)(1 + X + X 4 )(1 + X + X 2 + X 3 + X 4 )
×(1 + X + X 2 )(1 + X 3 + X 4 ).

(b) Knowing the cyclotomic cosets we can show that a particular factorisation of
X n − e contains irreducible factors. Explicitly, take the polynomial X 9 − 1 over F2
(with n = 9, q = 2). There are three cyclotomic cosets:

C0 = {0},C1 = {1, 2, 4, 8, 7, 5},C3 = {3, 6};

the corresponding minimal polynomials are of degree 1, 6 and 2, respectively:

1 + X, 1 + X 3 + X 6 and 1 + X + X 2 .

This yields
X 9 − 1 = (1 + X)(1 + X + X 2 )(1 + X 3 + X 6 ).
3.1 A primer on finite fields 287

(c) Let us check primitivity of the polynomial

f (X) = 1 + X + X 6

over F2 , with n = 6, q = 2. Here, 26 − 1 = 63 = 32 · 7. As 63/3 =


21, 32 | ord( f (X)) ⇔ X 21 − 1 = 0 mod (1 + X + X 6 ). But X 21 = 1 + X + X 3 +
X 4 + X 5 = 1 mod (1 + X + X 6 ), so 32 | ord( f (X)).
Next, as 63/7 = 9, 7| ord( f (X)) ⇔ X 9 −1 = 0 mod (1+X +X 6 ). But X 9 = 1+
X 3 + X 4 = 1 mod (1 + X + X 6 ), so 7| ord( f (X)). Therefore, ord( f (X)) = 63, and
f (X) is primitive. Theorem 3.1.53 below shows that any irreducible polynomial of
order 63 has degree 6 as 26 = 1 mod 63.
(d) Now consider the polynomial

g(X) = 1 + X + X 2 + X 4 + X 6 ,

again over F2 (here n = 6 and q = 2, as before). Again 32 | ord(g(X)) ⇔ X 21 = 1


mod (1 + X + X 2 + X 4 + X 6 ). However, in F2

X 21 − 1 = (1 + X)(1 + X + X 2 )(1 + X + X 3 )(1 + X 2 + X 3 )


× (1 + X + X 2 + X 4 + X 6 )(1 + X 2 + X 4 + X 5 + X 6 ).

Hence, X 21 − 1 = 0 mod (1 + X + X 2 + X 4 + X 6 ) = 1, and so 32 does not divide


ord(g(X)).
Next, 3| ord(g(X)) ⇔ X 7 = 1 mod (1 + X + X 2 + X 4 + X 6 ). As X 7 = X + X 2 +
X 3 + X 5 = 1 mod (1 + X + X 2 + X 4 + X 6 ), 3 is a divisor for ord(g(X)).
Finally, 7| ord(g(X)) ⇔ X 9 = 1 mod (1 + X + X 2 + X 4 + X 6 ), and as X 9 = 1 +
X 2 +X 4 = 1 mod (1+X +X 2 +X 4 +X 6 ), 7 divides ord(g(X)). So, ord(g(X)) = 21.

Let us summarise results about minimal polynomials and roots of unity. We


know from Theorem 3.1.25 that for all integers d ≥ 1 and for all q = pd , where
p is prime and s ≥ 1 integer, there exists a primitive polynomial of degree d, say
Mω (X), where ω is a primitive element in the field Fqd . On the other hand, for all
irreducible polynomials f (X) ∈ Fq [X] of degree d, the roots of f (X) lie in the field
Spl( f (X)) = Fqd and have the same multiplicative order ord( f (X)).

Theorem 3.1.49 Let polynomial f (X) ∈ Fq [X] be irreducible, of degree d , and


ord( f (X)) = . Then:

(a) |(qd − 1),


 
(b)  f (X)| X  − e ,
 
(c) |n iff f (X)| X n − e ,
 
(d)  is the least positive integer such that f (X)| X  − e .
288 Further Topics from Coding Theory
d −1
Proof (a) Spl( f (X)) = Fqd , hence every root α of f (X) is a root of X q − e. So,
it has ord(α )|(qd − 1).

 α of f (X) in Spl( f (X)) has ord(α ) =  and hence is a root of (X −e).
(b) Each root 

So, f (X)| X − e .
 
(c) If f (X)| X n − e then each root of f (X) is a root of X n − e, i.e. ord(α )|n. So,
|n. Conversely, if n = k then (X  − e)|(X k − e) and f (X)|(X n − e) by (b).
(d) Follows from (c).

Theorem 3.1.50 If f (X) ∈ Fq [X] is an irreducible polynomial of degree d and


order  then d = ord (q).

Proof If α ∈ Fqd has f (α ) = 0 then by Theorem 3.1.29, Fq (α ) = Fqd =


Spl( f (X)). But α is also a primitive (, Fq )-root of unity, so Fq (α ) = Fq (E() ) =
Spl(X  − e) = Fqs where s = ord (q). Hence, d = ord (q).

Worked Example 3.1.51 Use the Frobenius map σ : a → aq to prove that every
element a ∈ Fqn has a unique q j th root, for j = 1, . . . , n − 1.
Suppose that q = ps is odd. Show that exactly a half of the non-zero elements of
Fq have square roots.

Solution The Frobenius map σ : a → aq is a bijection Fqn → Fqn . So, for all b ∈ Fqn
j
there exists unique a with aq = b (the qth root). The jth power iteration σ j : a → aq
j
is also a bijection, so again for all b ∈ Fqn there exists unique a with aq = b.
j
Observe that for all c ∈ Fq , c1/q = c.
Now take τ : a → a2 , a multiplicative homomorphism F∗q → F∗q . If q is odd then

Fq  Zq−1 has an even number of elements q − 1. We want to show that if τ (a) =
b then τ −1 (b) consists of two elements, a and −a. In fact, τ (−a) = b. Also, if
τ (a ) = b then τ (a a−1 ) = e.
So, we want to analyse τ −1 (e). Clearly, ±e ∈ τ −1 (e). On the other hand, if ω
is a primitive element then τ (ω (q−1)/2 ) = ω q−1 = e and τ −1 (e) consists of e = ω 0
and ω (q−1)/2 . So, ω (q−1)/2 = −e.
Now if τ (a a−1 ) = e then a a−1 = ±e and a = ±a. Hence, τ sends precisely two
elements, a and −a, into the same image, and its range τ (F∗q ) is a half of F∗q .

Theorem 3.1.52 (cf. [92], Theorem 3.46.) Let polynomial p(X) ∈ Fq [X] be irre-
ducible, of degree n. Set m = gcd(d, n). Then m|n and p(X) factorises over Fqd into
m irreducible polynomials of degree n/m each. Hence, p(X) is irreducible over Fqd
iff m = 1.
Theorem 3.1.53 (cf. [92], Theorem 3.5.) Let gcd(d, q) = 1. The number of monic
irreducible polynomials of order  and degree d equals φ ()/d if  ≥ 2, and the
3.1 A primer on finite fields 289

degree d = ord (q), equals 2 if  = d = 1, equals 0 in all other cases. In particu-


lar, the degree of an order  irreducible polynomial always equals ord (q), i.e. the
minimal s such that qs = 1 mod . Here φ () is the Euler totient function.
The proofs of Theorems 3.1.52 and 3.1.53 are omitted (see [92]). We only
make a short comment about Theorem 3.1.53. If p(0) = 0, the order of irreducible
polynomial p(X) of degree d coincides with the order of any of its roots in the
multiplicative group F∗qd . So, the order is  iff d = ord (q) and p(X) divides the
so-called circular polynomial
Q (X) = ∏ (X − ω s ).
s:gcd(s,)=1

In fact, the circular polynomial could be decomposed into a product of irreducible


polynomials, all of degree d = ord (q), and their number equals φ ()/d. (In the
case d =  = 1 the polynomial p(X) = X should be accounted for as well.)

Concluding this section, we give short summaries of the facts of the theory of
finite fields discussed above.

Summary 1.55. A field is a ring such that its non-zero elements form a commuta-
tive group under multiplication. (i) Any finite field F has the number of elements
q = ps where p is prime, and the characteristic char(F) = p. (ii) Any two finite
fields with the same number of elements are isomorphic. Thus, for a given q = ps ,
there exists, up to isomorphism, a unique field of cardinality q; such a field is de-
noted by Fq (it is often called a Galois field of size q). When q is prime, the field
Fq is isomorphic to the additive cyclic group Z p of p elements, equipped with mul-
tiplication mod p. (iii) The multiplicative group F∗q of non-zero elements from Fq
is isomorphic to the additive cyclic group Zq−1 of q − 1 elements. (iv) Field Fq
contains Fr as a subfield iff r|q; in this case Fq is isomorphic to a linear space over
(i.e. with coefficients from) Fr , of dimension log p (q/r). So, each prime number
p gives rise to an increasing sequence of finite fields F ps , s = 1, 2, . . . An element
ω ∈ Fq generating the multiplicative group F∗q is called a primitive element of Fq .

Summary 1.56. The polynomial ring over Fq is denoted by Fq [X]; if the polyno-
mials are considered mod g(X), Fq [X], the correspond-
a fixed polynomial from
ing ring is denoted by Fq [X] "g(X)#. (i) Ring Fq [X] "g(X)# is a field iff g(X)
is irreducible over Fq (i.e. does not admit a decomposition g(X) = g1 (X)g2 (X)
where deg(g1 (X)), deg(g2 (X)) < deg(g(X))). (ii) For any q and a positive integer
d there exists an irreducible polynomial g(X) over Fq of degree d. (iii) If g(X) is
irreducible
and deg g(X) = d then the cardinality of field Fq [X] "g(X)# is qd , i.e.
Fq [X] "g(X)# is isomorphic to Fqd and belongs to the same series of fields as Fq
(that is, char(Fqd ) = char(Fq )).
290 Further Topics from Coding Theory

Summary 1.57. An extension of a field Fq by a finite family of elements α1 , . . . , αu


(contained in a larger field from the same series) is the smallest field containing Fq
and αi , 1 ≤ i ≤ u. Such a field is denoted by Fq (α1 , . . . , αu ). (i) For any monic
polynomial p(X) ∈ Fq [X] there exists a larger field Fq from the same series as Fq
such that p(X) factors over Fq :
 
p(X) = ∏ X − α j , u = deg p(X), α1 , . . . , αu ∈ Fq . (3.1.13)
1≤ j≤u

The smallest field Fq with this property (i.e. field Fq (α1 , . . . , αu )) is called a split-
ting field for p(X); we also say that p(X) splits over Fq (α1 , . . . , αu ). The splitting
field for p(X) is denoted by Spl(p(X)); an element α ∈ Spl(p(X)) takes part in de-
composition (3.1.13) iff p(α ) = 0. Field Spl(p(X)) is described as the set {g(α j )}
where j = 1, . . . , u, and g(X) ∈ Fq [X] are polynomials of degree < deg(p(X)). (ii)
Field Fq is splitting for the polynomial X q − X. (iii) If polynomial p(X) of de-
gree d is irreducible over Fq and α is a root of p(X) in field Spl(p(X)) then Fqd
 Fq [X] "p(X)# is isomorphic to Fq (α ) and all the roots of p(X) in Spl(p(X))
2 d−1
are given by the conjugate elements α , α q , α q , . . . , α q . Thus, d is the small-
d
est positive integer for which α q = α . (iv) Suppose that, for a given field Fq ,
a monic polynomial p(X) ∈ Fq [X] and an element α from a larger field we have
p(α ) = 0. Then there exists a unique minimal polynomial Mα (X) with the property
that Mα (α ) = 0 (i.e. such that any other polynomial p(X) with p(α ) = 0 is divided
by Mα (X)). Polynomial Mα (X) is the unique irreducible polynomial over Fq van-
ishing at α . It is also the unique polynomial of the minimum degree vanishing at α .
We call Mα (X) the minimal polynomial of α over Fq . If ω is a primitive element
of Fqd then Mω (X) is called a primitive polynomial for Fqd over Fq . We say that
elements α , β ∈ Fqd are conjugate over Fq if they have the same minimal poly-
d−1
nomial over Fq . Then (v) the conjugates of α ∈ Fqd over Fq are α , α q , . . . , α q ,
d
where d is the smallest positive integer with α q = α . When α = ω i where ω
is a primitive element, the congugacy class is associated with a cyclotomic coset
d−1
Cω i = {ω i , ω iq , . . . , ω iq }.

Summary 1.58. Now assume that n and q = ps are co-prime and take polynomial
X n − e. The roots of X n − e in the splitting field Spl(X n − e) are called nth roots
of unity over Fq . The set of all nth roots of unity is denoted by En . (i) Set En is
a cyclic subgroup of order n in the multiplicative group of field Spl(X n − e). An
nth root of unity generating En is called a primitive nth root of unity. (ii) If Fqs is
Spl(X n − e) then s is the smallest positive integer with n|(qs − 1). (iii) Let Πn be the
set of primitive nth roots of unity over field Fq and Φn the set of primitive elements
of the splitting field Fqs = Spl(X s − e). Then either Πn ∩ Φn = 0/ or Πn = Φn , the
latter happening iff n = qs − 1.
3.2 Reed–Solomon codes. The BCH codes revisited 291

3.2 Reed–Solomon codes. The BCH codes revisited


From now on we consider finite fields Fq up to isomorphism, but from time to
time refer to a specific field table (e.g. by specifying F ps as F p [X]/"P(X)# where
P(X) ∈ F p [X] is an irreducible polynomial of degree s).
In Definition 2.5.37 we introduced narrow-sense binary BCH codes. Our study
will continue in this section with general q-ary BCH codes Xq,N, BCH
δ ,ω ,b , of length
N, designed distance δ and zeros ω b , . . . , ω b+δ −1 ; see in Definition 3.2.7 below.
Prior to that, we discuss an interesting special class of BCH codes formed by the
Reed–Solomon (RS) codes; as we shall see, their analysis is facilitated by the fact
that the RS codes are MDS (maximum distance separable).
Definition 3.2.1 Given q ≥ 3, a q-ary Reed–Solomon code is defined as a cyclic
code of length N = q − 1 with the generator

g(X) = (X − ω b )(X − ω b+1 ) . . . (X − ω b+δ −2 ), (3.2.1)

where δ and b are integers, 1 ≤ δ , b < q − 1, and ω is a primitive element of Fq


(or
 equivalently, a primitive Nth root of unity). Such a code is denoted by X RS
= Xq,RS
δ ,ω ,b .

According to Definition 3.2.7, the RS code is identified as Xq,q−1,


BCH
δ ,ω ,b , i.e. as a
q-ary BCH code of length q − 1 and designed distance δ . There are no reasonable
binary RS codes, as in this case the length q − 1 = 1. Observe that q − 1 gives the
number of non-zero elements in the alphabet field Fq . Moreover, for N = q − 1 we
have
X N − e = X q−1 − e = ∏ (X − α )
α ∈F∗q

(as the splitting field Spl (X q − X) is Fq ). Furthermore, owing to the fact that ω is a
primitive (q − 1, Fq ) root of unity (or, equivalently, a primitive element of Fq ), the
minimal polynomial Mi (X) is just X − ω i , for all i = 0, . . . , N − 1.
An important property is that the RS codes are MDS. Indeed, the generator g(X)
δ ,ω ,b has deg g(X) = δ − 1. Hence, the rank k is given by
of Xq,RS

δ ,ω ,b ) = N − deg g(X) = N − δ + 1.
k = dim(Xq,RS (3.2.2)

By the generalised BCH bound (see Theorem 3.2.9 below), the minimal distance
 
δ ,ω ,b ≥ δ = N − k + 1.
d Xq,RS

But the Singleton bound states that d(X RS ) ≤ N − k + 1. Hence,

ω ,b ) = N − k + 1 = δ .
RS
d(Xq,d, (3.2.3)
292 Further Topics from Coding Theory

Thus the RS codes have the largest possible minimal distance among all q-ary
codes of length q − 1 and dimension k = q − δ . Summarising, we obtain

Theorem 3.2.2 δ ,ω ,b is MDS and has distance δ and rank q − δ .


The code Xq,RS

The dual of a BCH code is not always BCH. However,

Theorem 3.2.3 The dual of an RS code is an RS code.



δ ,ω ,b ) = Xq,q−δ ,ω ,b+δ −1 .
Proof The proof is straightforward, as (Xq,RS RS

Theorem 3.2.4 Let X RS be a [N, k, δ ] RS code. Then its parity-check extension


is a [N + 1, k, δ + 1] code, with distance one more than that of X RS .

Proof Let c(X) = c0 + c1 X + · · · + cN−1 X N−1 ∈ X RS , with weight w(c(X)) = δ .


Its extension is c(X) = c(X) + cN X N , with cN = − ∑ ci = −c(e). We want to
0≤i≤N−1
show that c(e) = 0 and hence w( c(X)) = δ + 1.
To simplify notation assume that b = 1 and let g(X) = (X − ω )(X − ω 2 ) . . . (X −
ω δ −1 ) be the generator of X RS . Write c(X) = g(X)p(X) for some p(X), yielding
that c(e) = p(e)g(e). Clearly, g(e) = 0, as ω i = e for all i = 1, . . . , δ −1. If p(e) = 0,
the polynomial g1 (X) = (X − e)g(X) divides c(X). Then c(X) ∈ "g1 (X)# where
g1 (X) = (X −e)(X − ω ) . . . (X − ω δ −1 ). That is, "g1 (X)# is BCH, with the designed
distance ≥ δ + 1. But this contradicts the choice of c(X).

RS codes admit specific (and elegant) encoding and decoding procedures. Let
X RS be an [N, k, δ ] RS code, with N = q − 1. For a message string a0 . . . ak−1 set
a(X) = ∑ ai X i and encode a(X) as c(X) = ∑ a(ω j )X j . To show that
0≤i≤k−1 0≤ j≤N−1
c(X) ∈ X RS , we have to check that c(ω ) = · · · = c(ω δ −1 ) = 0. Think of a(X) as
a polynomial ∑ ai X i with ai = 0 for i ≥ k, and use
0≤i≤N−1

Lemma 3.2.5 Let a(X) = a0 + a1 X + · · · + aN−1 X N−1 ∈ Fq [X] and ω be a prim-


itive (N, Fq ) root of unity over Fq , N = q − 1. Then
1
ai = ∑ a(ω j )ω −i j .
N 0≤ j≤N−1
(3.2.4)

We postpone the proof till after Lemma 3.2.12.


Indeed, by Lemma 3.2.5
1 1 1
ai = ∑ a(ω j )ω −i j = N c(ω −i ) = N c(ω N−i ),
N 0≤ j≤N−1
3.2 Reed–Solomon codes. The BCH codes revisited 293

so c(ω j ) = NaN− j . For 0 ≤ j ≤ δ − 1 = N − k, c(ω j ) = NaN− j = 0. Therefore,


c(X) ∈ X RS . In addition, the original message is easy to recover from c(X): ai =
N c(ω
1 N−i ).

To decode the received word u(X) = c(X) + e(X), write


ui = ci + ei = ei + a(ω i ), 0 ≤ i ≤ N − 1.
Then obtain
u0 = e0 + a0 + a1 + · · · + ak−1 ,
u1 = e1 + a0 + a1 ω + · · · + ak−1 ω k−1 ,
u2 = e2 + a0 + a1 ω 2 + · · · + ak−1 ω 2(k−1) ,
..
.
uN−1 = eN−1 + a0 + a1 ω N−1 + · · · + ak−1 ω (N−1)(k−1) .
If there are no errors, i.e. e0 = · · · = eN−1 = 0, any k of these equations can
be solved in the k unknowns a0 , . . . , ak−1 , as the corresponding matrix is Vander-
monde. In fact, any subsystem of k equations can be solved for any error vector (it
is a different matter if the solution will give the correct string a0 , . . . , ak−1 or not).
Now suppose that t errors have occurred, t < N − k. Call the equations with
ei = 0 good and ei = 0 bad, then we have  t bad  and N − t good ones. If we solve
N −t
all subsystems of k equations then the subsystems consisting of k good
k
equations will give the correct values of the ai s. Moreover, a given incorrect solu-
tion cannot satisfy any set of k good equations; it can satisfy at most k − 1 correct
equations. In addition, it can satisfy at most t incorrect
t+k−1equations.
 So, it is a solu-
tion to ≤ t + k − 1 equations, i.e. can be obtained ≤ k times from subsystems
of k equations. Hence, if
   
N −t t +k−1
> ,
k k
the majority solution from among (Nk ) solutions gives the true values of the ai s. The
last inequality holds iff N −t > t + k − 1, i.e. δ = N − k + 1 > 2t. Therefore we get:
Theorem 3.2.6 For a [N, k, δ ] RS code X RS , the majority
  logic decoding cor-
N
rects up to t < δ /2 errors, at the cost of having to solve systems of equations
k
of size k × k.
Reed–Solomon codes were discovered in 1960 by Irving S. Reed and Gustave
Solomon, both working at that time in the Lincoln Laboratory of MIT. When their
joint article was published, an efficient decoding algorithm for these codes was
294 Further Topics from Coding Theory

not known. Such an algorithm solution for the latter was found in 1969 by El-
wyn Berlekamp and James Massey, and is known since as the Berlekamp–Massey
decoding algorithm (cf. [20]); see Section 3.3. Later on, other algorithms were
proposed: continued fraction algorithm and Euclidean algorithm (see [112]).
Reed–Solomon codes played an important role in transmitting digital pictures
from American spacecraft throughout the 1970s and 1980s, often in combination
with other code constructions. These codes still figure prominently in modern space
missions although the advent of turbo-codes provides a much wider choice of cod-
ing and decoding procedures.
Reed–Solomon codes are also a key component in compact disc and digital game
production. The encoding and decoding schemes employed here are capable of cor-
recting bursts of up to 4000 errors (which makes about 2.5mm on the disc surface).

Definition 3.2.7 A BCH code Xq,N, δ ,ω ,b with parameters q, N, ω , δ and b is the


BCH

q-ary cyclic code XN = "g(X)# with length N, designed distance δ , such that its
generating polynomial is
 
g(X) = lcm Mω b (X), Mω b+1 (X), . . . , Mω b+δ −2 (X) , (3.2.5)

i.e.
*
Xq,N,
BCH = f (X) ∈ Fq [X] mod (X N − 1) :
δ ,ω ,b +
f (ω b+i ) = 0, 0 ≤ i ≤ δ − 2 .

If b = 1, this is a narrow sense BCH code. If ω is a primitive Nth root of unity, i.e. a
primitive root of the polynomial X N − 1, the BCH code is called primitive. (Recall
that under condition gcd(q, N) = 1 these roots form a commutative multiplicative
group which is cyclic, of order N, and ω is a generator of this group.)

The BCH code Xq,N, δ ,ω ,b has minimum distance ≥ δ .


Lemma 3.2.8 BCH

Proof Without loss of generality consider a narrow sense code. Set the parity-
check (δ − 1) × N matrix
⎛ ⎞
1 ω ω2 ... ω N−1
⎜1 ω 2 ω4 ... ω 2(N−1) ⎟
⎜ ⎟
H = ⎜. .. . .. ⎟.
⎝. . . . . . ⎠
1 ω δ −1 ω 2( δ −1) ... ω ( δ −1)(N−1)

The codewords of X are linear dependence relations between the columns of H.


Then Lemma 2.5.40 implies that any δ − 1 columns of H are linearly independent.
In fact, select columns with top (row 1) entries ω k1 , . . . , ω kδ −1 where 0 ≤ k1 < · · · <
kδ −1 ≤ N − 1. They form a square (δ − 1) × (δ − 1) matrix
3.2 Reed–Solomon codes. The BCH codes revisited 295
⎛ ⎞
ω k1 · 1 ω k2 · 1 ... ω kδ −1 · 1
⎜ ω k1 · ω k1 ω 2 ·ω 2
k k ... ω δ −1 · ω kδ −1 ⎟
k
⎜ ⎟
D=⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
ω k1 · ω k1 (δ −2) ω k2 · ω k2 (δ −2) . . . ω kδ −1 · ω kδ −1 (δ −2)

that differs from the Vandermonde matrix by factors ω ks in front of the sth column.
Then the determinant of D is the product
 
 1 1 ... 1 
 
 
δ −1  ω 1 k ω 2 k ... ω δ −1 
k
det D = ∏ ω ks  .. .. .. .. 
 . . . . 
s=1
 
ω k1 (δ −2) α k2 (δ −2) . . . ω kδ −1 (δ −2) 
   
δ −1  k 
= ∏ω ks × ∏ ω −ω i k j = 0,
s=1 i> j

and any δ − 1 columns of H are indeed linearly independent. In turn, this means
that any non-zero codeword in X has weight at least δ . Thus, X has minimum
distance ≥ δ .

Theorem 3.2.9 (A generalisation of the BCH bound) Let ω be a primitive N th


root of unity and b ≥ 1, r ≥ 1 and δ > 2 integers, with gcd(r, N) = 1. Consider a
cyclic code X = "g(X)# of length N where g(X) is a monic polynomial of small-
est degree with g(ω b ) = g(ω b+r ) = · · · = g(ω b+(δ −2)r ) = 0. Prove that X has
d(X ) ≥ δ .

Proof As gcd(r, N) = 1, ω r is a primitive root of unity. So, we can repeat the


proof given above, with b replaced by bru where ru is found from ru + Nv = 1. An
alternative solution: the matrix N × (δ − 1)
⎛ ⎞
1 1 ... 1
⎜ ωb ω b+r . . . ω b+(δ −2)r ⎟
⎜ ⎟
⎜ ω 2b ω 2(b+r) . . . ω 2(b+(δ −2)r) ⎟
⎜ ⎟
⎜ . . . . ⎟
⎝ .. .
. . .
. . ⎠
ω (N−1)b ω (N−1)b+r ... ω (N−1)(b+( δ −2)r)

checks the code X = "g(X)#. Take any of its (δ − 1) × (δ − 1) submatrices, say,


with rows i1 < i2 < · · · < iδ −1 . Denote it by D = (D jk ). Then

det D = ∏ ω (il −1)b det(ω r(i j −1)(δ −2) )


1≤l≤δ −1

= ∏ ω (il −1)b det (Vandermonde) = 0,


1≤l≤δ −1

because gcd(r, N) = 1. So, d(X) ≥ δ .


296 Further Topics from Coding Theory

Worked Example 3.2.10 Let ω be a primitive n-root of unity in an extension


field of Fq and a(X) = ∑ ai X i be a polynomial of degree at most n − 1. The
0≤i≤n−1
Mattson–Solomon polynomial is defined by
n
aMS (X) = ∑ a(ω j )X n− j . (3.2.6)
j=1

Let q = 2 and a(X) ∈ F2 [X]/"X n −1#. Prove that the Mattson–Solomon polynomial
aMS (X) is idempotent, i.e. aMS (X)2 = aMS (X) in F2 [X]/"X n − 1#.

Solution Let a(X) = ∑ ai X i , then nai = aMS (ω i ), 0 ≤ i ≤ n − 1, by Lemma


0≤i≤n−1
3.2.5. In F2 , (nai )2 = nai , so aMS (ω i )2 = aMS (ω i ). For polynomials, write b(2) (X)
for the square in F2 [X] and b(X)2 for the square in F2 [X]/"X n − 1#:

b(2) (X) = c(X)(X n − 1) + b(X)2 .

Then
(2)
aMS (X) X=ω i = (aMS (X) X=ω i )2 = aMS (X) X=ω i
= aMS (X)2 X=ω i ,

i.e. polynomials aMS (X) and aMS (X)2 agree at ω 0 = e, ω , . . . , ω n−1 . Write this in
the matrix form, with aMS (X) = a0,MS + a1,MS X + · · · + an−1,MS X n−1 , aMS (X)2 =
a 0,MS X + · · · + a n−1,MS X n−1 :
⎛ ⎞
e e ... e
⎜ e ω . . . ω n−1 ⎟
(2) ⎜ ⎟
(aMS − aMS ) ⎜ .. .. . .. ⎟ = 0.
⎝ . . . . . ⎠
2
e ω n−1 . . . ω (n−1)
As the matrix is Vandermonde, its determinant is

∏ (ω j − ω i ) = 0,
0≤i< j≤n−1

(2)
and aMS = aMS . So, aMS (X) = aMS (X)2 .
Definition 3.2.11 Let v = v0 v1 . . . vN−1 be a vector over Fq , and let ω be a prim-
itive (N, Fq ) root of unity over Fq . The Fourier transform of the vector v is the
vector V = V0V1 . . .VN−1 with components given by
N−1
Vj = ∑ ω i j vi , j = 0, . . . , N − 1. (3.2.7)
i=0
3.2 Reed–Solomon codes. The BCH codes revisited 297

Lemma 3.2.12 (The inversion formula) The vector v is recovered from its Fourier
transform V by the formula

1 N−1 −i j
vi = ∑ ω Vj.
N j=0
(3.2.8)

Proof In any field X N − 1 = (X − 1)(X N−1 + · · · + X + 1). As the order of ω is N,


for any r, ω r is a zero of LHS. Hence for all r = 0 mod N, ω r is a zero of the last
term, i.e.
N−1
∑ ω r j = 0 mod N.
j=0

On the other hand, for r = 0


N−1
∑ ω r j = N mod p
j=0

which is not zero if N is not a multiple of the field characteristic p. But q − 1 =


ps − 1 is a multiple of N, so N is not a multiple of p. Hence, N = 0 mod p. Finally,
change the order of summation to obtain that

1 N−1 −i j 1 N−1 N−1



N j=0
ω V j = ∑ vk ∑ ω (k−i) j = vi .
N k=0 j=0

Proof of Lemma 3.2.5 Let a(X) = a0 + a1 X + · · · + aN−1 X N−1 ∈ Fq [X] and ω be


a primitive (N, Fq ) root of unity over Fq . Then write

N −1 ∑ a(ω j )ω −i j = N −1 ∑ ∑ ak ω jk ω −i j
0≤ j≤N−1 0≤ j≤N−1 0≤k≤N−1

=N −1
∑ ak ∑ ω j(k−i)
=N −1
∑ ak N δki = ai .
0≤k≤N−1 0≤ j≤N−1 0≤k≤N−1

Here we used the fact that, for 1 ≤  ≤ N − 1, ω  = 1, and

∑ ω j = ∑ (ω  ) j = (e − (ω  )N )(e − ω  )−1 = 0.
0≤ j≤N−1 0≤ j≤N−1

Hence
1
ai = ∑ a(ω j )ω −i j .
N 0≤ j≤N−1
(3.2.9)
298 Further Topics from Coding Theory

Worked Example 3.2.13 Give an alternative proof of the BCH bound: Let ω be a
primitive (N, Fq ) root of unity and b ≥ 1 and δ ≥ 2 integers. Let XN = "g(X)# be a
cyclic code where g(X) ∈ Fq [X]/"X N −e# is a monic polynomial of smallest degree
having ω b , ω b+1 , . . . , ω b+δ −2 among its roots. Then XN has minimum distance at
least δ .

Solution Let a(X) = ∑ a j X j ∈ XN satisfy condition g(X)|a(X) and


0≤ j≤N−1
a(ω i ) = 0 for i = b, . . . , b + δ − 2. Consider the Mattson–Solomon polynomial
cMS (X) for a(X):
cMS (X) = ∑ a(ω −i )X i = ∑ a(ω N−i )X i
0≤i≤N−1 0≤i≤N−1
= ∑ a(ω )X
i N−i
1≤i≤N

= ∑ a(ω j )X N− j + 0 + · · · + 0 (from ω b , . . . , ω b+δ −2 )


1≤ j≤b−1

+ a(ω b+δ −1 )X N−b−δ +1 + · · · + a(ω N ). (3.2.10)


Multiply by X b−1 and group:
X b−1 cMS (X) = a(ω )X N+b−2 + · · · + a(ω b−1 )X N
+a( ω b+δ −1 )X N−δ + · · · + a(ω N
)X b−1
= X N a(ω )X b−2 + · · · + a(ω b−1 )

+ a(ω b+δ −1 )X N−δ + · · · + a(ω N )X b−1


= X N p1 (X) + q(X)
= (X N − e)p1 (X) + p1 (X) + q(X).
We see that cMS (ω i ) = 0 iff p1 (ω i ) + q(ω i ) = 0. But p1 (X) + q(X) is a polynomial
of degree ≤ N − δ so it has at most N − δ roots. Thus, cMS (X) has at most N − δ
roots of the form ω i .
Therefore, the inversion formula (3.2.8) implies that the weight w(a(X)) (i.e. the
weight of the coefficient string a0 . . . aN−1 ) obeys
w(a(X)) ≥ N − the number of roots of cMS (X) of the form ω i . (3.2.11)
That is,
w(a(X)) ≥ N − (N − δ ) = δ .

We finish this section with a brief discussion of the Guruswami–Sudan decoding


algorithm, for list decoding of Reed–Solomon codes. First, we have to provide an
alternative description of the Reed–Solomon codes (as Reed and Solomon have
3.2 Reed–Solomon codes. The BCH codes revisited 299

done it in their joint paper). For brevity, we take the value b = 1 (but will be able
to extend the definition to values of N > q − 1).
Given N ≤ q, let S = {x1 , . . . , xN } ⊂ Fq be a set of N distinct points in Fq (a
supporting set). Let Ev denote the evaluation map
Ev : f ∈ Fq [X] → Ev( f ) = ( f (x1 ), . . . , f (xN )) ∈ FNq (3.2.12)
and take
L = { f ∈ Fq [X] : deg f < k}. (3.2.13)
Then the q-ary Reed–Solomon code of length N and dimension k can be defined as
X = Ev(L); (3.2.14)
A B
d −1
it has the minimum distance d = d(X ) = N − k + 1 and corrects up to er-
2
rors. The encoding of a source message u = u0 . . . uk−1 ∈ Fkq consists in calculating
the values of the polynomial f (X) = u0 + u1 X + · · · + uk X k−1 at points xi ∈ S.
Definition 3.2.1 (where X was defined as the set of polynomials c(X) =
∑ cl X l ∈ Fq [X] with c(ω ) = c(ω 2 ) = · · · = c(ω δ −1 ) = 0) emerges when
0≤l<q−1
N = q − 1, k = N − δ + 1 = q − δ , the supporting set S = {e, ω , . . . , ω N−1 } and
the coefficients c0 , c1 , . . . , cN−1 are related to the polynomial f (X) by
ci = f (ω i ), 0 ≤ i ≤ N − 1.
This determines uniquely the coefficients fl in the representation f (X) =
∑ fl X l , via the discrete inverse Fourier transform relation
0≤l<N

N fl = c(ω N−l ), or N fN−l−1 = c(ω l+1 ), l = 0, . . . , N − 1,


guaranteeing, in particular, that fk = · · · = fN−1 = 0.
Given f ∈ Fq [X] and y = y1 . . . yN ∈ FNq , set
dist ( f , y) = ∑ 1( f (xi ) = yi ).
1≤i≤N
B A
d −1
Now assume y = y1 . . . yN is a received word and set t = . The above-
2
mentioned ‘conventional’ decoding algorithms (the Berlekamp–Massey algorithm,
the continued fractions algorithm and the Euclidean algorithm) follow the same
principle: the algorithm either finds a unique f such that dist ( f , y) ≤ t or reports
that such f does not exist. On the other hand, given s > t, list decoding attempts to
find all f with dist ( f , y) ≤ s; the hope is that if we are lucky, the codeword with
this property will be unique, and we will be able to correct s errors, exceeding the
‘conventional’ limit of t errors.
300 Further Topics from Coding Theory

This idea goes back to Shannon’s bounded distance decoding: upon receiving
a word y, you inspect the Hamming balls around y until you encounter a closest
codeword (or a collection of closest codewords) to y. Of course, we want two
things: that (i) when we take s ‘moderately’ larger than t, the chance of finding
two or more codewords within distance s is small, and (ii) the algorithm has a
reasonable computational complexity.
Example 3.2.14 The [32, 8] RS code over F32 has d = 25 and t = 12. If we take
s = 13, the Hamming ball about the received word y may contain two codewords.
However, assuming that all error vectors e of weight 13 are equally likely, the
probability of this event is 2.08437 × 10−12 .
The Guruswami–Sudan list decoding algorithm (see [59]) performs the task of
finding the codewords within distance s for t ≤ s ≤ tGS in a polynomial time. Here
L$ M
tGS = n − 1 − (k − 1)n ,

and tGS can considerably exceed t.


In the above example, tGS = 17. Asymptotically, for RS codes of rate R, the
conventional decoding algorithms will correct
√ a fraction (1 − R)/2 of errors, while
the GS algorithm can correct up to 1 − R. The expected number of codewords in
a ball of radius s ≤ tGS (under the assumption of error-vector equidistribution) can
also be assessed.
The Guruswami–Sudan algorithm works not only for the RS codes. In the origi-
nal GS paper, the algorithm was shown to perform well for several classes of codes;
later on it was extended to cover the RM codes as well (see [7]).

3.3 Cyclic codes revisited. Decoding the BHC codes


Let us begin afresh. As before, we assume that gcd(N, q) = 1 (so if q = 2, N is odd),
and write words x ∈ HN,q as x0 . . . xN−1 . Remind that a linear code X ⊆ HN is
called cyclic if, for all x = x0 . . . xN−1 ∈ X , the cyclic shift π x = xN−1 x0 . . . xN−2 ∈
X . With each word c = c0 . . . cN−1 we associate a polynomial c(X) ∈ Fq [X]:
c(X) = c0 + c1 X + · · · + cN−1 X N−1 .
The map c ↔ c(X) is an isomorphism between X and a linear subspace of Fq [X].
Writing c(X) ∈ X simply means that the coefficient string c0 . . . cN−1 ∈ X .
Lemma 3.3.1 The code X is cyclic iff its image under the above isomorphism
is an ideal in the quotient ring Fq [X]/"X N − e#.
Proof Cyclic shift corresponds to multiplying a polynomial c(X) by X. Hence,
multiplication by any polynomial preserves X .
3.3 Cyclic codes revisited. Decoding the BHC codes 301

It is fruitful to think of X as the ideal in Fq [X]/"X N − e# and consider all poly-


nomials mod (X N − e). Moreover, Fq [X]/"X N − e# is a principal ideal ring: each
its ideal is of the form
"g(X)# = { f (X) : f (X) = g(X)h(X), h(X) ∈ Fq [X]/"X N − e#} (3.3.1)
where g(X) is a fixed polynomial.
Theorem 3.3.2 If the code X ⊆ HN,q is cyclic then there exists a unique monic
polynomial g(X) ∈ X such that:
(i) X = "g(X)#;
(ii) g(X) has the minimum degree among all polynomials f (X) ∈ X . Further-
more,
(a) g(X)|(X N − e),
(b) if deg g(X) = d then dim X = N − d ,
(c) X = { f (X) : f (X) = g(X)h(X), h(X) ∈ Fq [X], deg h(X) < N − d},
(d) if g(X) = g0 + g1 X + g2 X 2 + · · · + gd X d , with gd = e, then g0 = 0 and
⎛ ⎞
g0 g1 g2 . . . gd 0 0 ... 0
⎜ 0 g0 g1 . . . gd−1 gd 0 . . . 0 ⎟
G=⎜ ⎝


... ...
0 0 0 ... g0 g1 . . . gd
is a generating matrix for X , with row i being the cyclic shift of row
i − 1, i = 2, . . . , N − d .
Conversely, for any polynomial g(X)|(X N − e), the set "g(X)# =
{ f (X) : f (X) = g(X)h(X), h(X) ∈ Fq [X]/"X N − e#} is an ideal in
Fq [X]/"X N − e#, i.e. a cyclic code X , and the above properties (b)–(d) hold.
Proof Take g(X) ∈ F2 [X] a non-zero polynomial of the least degree in X . Take
p(X) ∈ X and write
p(X) = q(X)g(X) + r(X), with deg r(X) < deg g(X).
Then r(X) mod (X N − 1) belongs to X . This contradicts the choice of g(X) unless
r(X) = 0. Therefore, g(X)|p(x) which proves (i). Taking p(X) = X N −1 proves (ii).
Finally, if g(X) and g(X) both satisfy (i) and (ii) then g(X)|g(X) and g(X)|g(X),
implying g(X) = g(X).
Corollary 3.3.3 The cyclic codes of length N are in a one-to-one correspondence
with factors of X N − e. In other words, the map
* + * +
cyclic codes of length N → divisors of X N − 1 ,
X → g(X),
is a bijection.
302 Further Topics from Coding Theory

With the identification


* +
F2 [X]/"(X N − 1)# = f ∈ F2 [X] : deg( f ) < N = FN2

the cyclic codes become ideals in the polynomial ring F2 [X]/"(X N − 1)#. They are
in a one-to-one correspondence with the ideals in F2 [X] containing polynomial
X N − 1. Because F2 [X] is a Euclidean domain, all ideals in F2 [X] are principal, i.e.
of the form { f (X)g(X) : f (X) ∈ F2 [X]}. In fact, all ideals in F2 [X]/"(X N − 1)# are
also principal ideals.

Definition 3.3.4 The polynomial g(X) is called the minimal degree generator
(or simply the generator) of the cyclic code X . The ratio h(X) = (X N − e)/g(X),
of degree N − deg g(X), is called the check polynomial for the cyclic code X =
"g(X)#.

Example 3.3.5 X − e generates the parity-check code {x : ∑ xi = 0} and e + X +


i
· · · + X N−1 the repetition code {a . . . a, a ∈ Fq }; X ≡ e generates X = H .

Worked Example 3.3.6 (a) A cyclic code X = "g(X)# of length N is called


reversible if c0 . . . cN−1 ∈ X implies cN−1 . . . c0 ∈ X . Prove that X is reversible
iff g(α ) = 0 implies g(α −1 ) = 0.
(b) A cyclic code is called degenerate if, for some r|N , each codeword c ∈ X is a
concatenation c c · · · c of N/r copies of some string c of length r. Prove that X
is degenerate iff its check polynomial h(X)|(X r − 1).

[Hint: Prove that the generating polynomial g(X) = a(X) 1 + X r + X 2r + · · · +
X N−r . ]

Solution (a) If the code X = "g(X)# is reversible and g = g0 . . . gN−k 0 . . . 0


then X N−1 g(X −1 ) ∼ 0 . . . 0gN−k . . . g0 ∈ X , i.e. X N−1 g(X −1 ) = g(X)q(X). Thus,
if g(α ) = 0 then α N−1 g(α −1 ) = 0, i.e. g(α −1 ) = 0.
Conversely, g(α ) = 0 implies g(α −1 ) = 0. Suppose that c(X) ∈ X then
g(X)|c(X). Moreover, X N−1 c(X −1 ) has all zeros of g(X) among its roots, and so
belongs to X . But X N−1 c(X −1 ) ∼ cN−1 . . . c0 , so X is reversible.
(b) The condition g = a . . . a means g(X) = a(X)(e + X r + X 2r + · · · + X N−r ). On
the other hand,

X N − e = (X r − e)(X N−r + · · · + X r + e) = h(X)g(X).

Thus, if X = "g(X)# is degenerate then X r − e = h(X)a(X), i.e. h(X)|(X r − e).


3.3 Cyclic codes revisited. Decoding the BHC codes 303

Conversely, if h(X)|(X r − e) then X r − e = a(X)h(X) and


X N − e = (X r − e)(X N−r + · · · + X r + e)
= h(X)a(X)(X N−r + · · · + X r + e).
Then
g(X) = a(X)(X N−r + · · · + X r + e),
i.e. g = a . . . a . Furthermore, any c(X) ∈ X is of the form c(X) = q(X)g(X) where
deg q(X) ≤ N − deg g(X). Write
c(X) = q(X)g(X) = a(X)q(X)(X N−r + · · · + X r + e);
we conclude that deg a(X)q(X) < r (after multiplying by X N−r the degree cannot
exceed N − 1). Then c = c . . . c is the concatenation c . . . c where c ∼ a(X)q(X).

Worked Example 3.3.7 Show that Hamming’s [7, 4] code is a cyclic code with
check polynomial X 4 + X 2 + X + 1. What is its generator polynomial? Does Ham-
ming’s original code contain a subcode equivalent to its dual?

Solution In F72 we have


X 7 − 1 = (X 3 + X + 1)(X 4 + X 2 + X + 1).
The cyclic code with generator g(X) = X 3 + X + 1 has check polynomial h(X) =
X 4 + X 2 + X + 1. The parity-check matrix of the code is
⎛ ⎞
1 0 1 1 1 0 0
⎝ 0 1 0 1 1 1 0⎠ .
0 0 1 0 1 1 1
The columns of this matrix are the non-zero elements of F32 . So, it is equivalent to
Hamming’s [7, 4] code.
The dual of Hamming’s [7, 4] code has generator polynomial X 4 + X 3 + X 2 + 1
(the reverse of h(X)). Since X 4 + X 3 + X 2 + 1 = (X + 1)g(X), it is a subcode of
Hamming’s [7, 4] code.
Worked Example 3.3.8 Let ω be a primitive N th root of unity. Let X = "g(X)#
be a cyclic code of length N . Show that the dimension dim (X ) equals the number
of powers ω j such that g(ω j ) = 0.

Solution Denote E(N) = {ω , ω 2 , . . . , ω N = e}, dim"g(X)# = N − d, d = deg g(X).


But g(X) = ∏ (X − ω i j ) where ω i1 , . . . , ω id are the zeros of "g(X)#. Hence, the
1≤ j≤d
remaining N − d roots of unity ω l satisfy the condition g(ω l ) = 0.
304 Further Topics from Coding Theory

It is important to note that the generator polynomial of a cyclic code X = "g(X)#


is not unique. In particular, there exists a unique polynomial i(X) ∈ X such that
i(X)2 = i(X) and X = "i(X)# (an idempotent generator).

Theorem 3.3.9 If X1 = "g1 (X)# and X2 = "g2 (X)# are cyclic codes with gener-
ators g1 (X) and g2 (X) then

(a) X1 ⊂ X2 iff g2 (X)|g1 (X),


(b) X1 ∩ X2 = "lcm (g1 (X), g2 (X))#,
(c) X1 |X2 = "gcd (g1 (X), g2 (X))#.

Theorem 3.3.10 Let h(X) be the check polynomial for X . Then

(a) X = { f (X): f (X)h(X) = 0 mod (X N − e)},


(b) if h(X) = h0 + h1 X + · · · + hN−r X N−r then the parity-check matrix H of X is
⎛ ⎞
hN−r hN−r−1 . . . h1 h0 0 0 ... 0
⎜ 0 hN−r . . . . . . h1 h0 0 . . . 0 ⎟
H =⎜⎝ ...
⎟,
... ... ... ... ... ... ... ... ⎠
0 0 . . . hN−r hN−r−1 . . . . . . . . . h0

(c) the dual code X ⊥ is a cyclic code of dim X ⊥ = r, and X ⊥ = "g⊥ (X)#, where
g⊥ (X) = h−1
0 X
N−r h(X −1 ) = h−1 (h X N−r + h X N−r−1 + · · · + h
0 0 1 N−r ).

The generator g(X) of a cyclic code is specified, in terms of factorisation of


XN − e, as a ‘sub-product’,
 
X N − e = lcm Mω (X) : ω ∈ E(N) , (3.3.2)

of some minimal polynomials Mω (X). A convenient way is to characterise a cyclic


code via roots of g(X). If ω is a root of Mω (X) in an extension field Fq (ω ) then
Mω (X) is the minimal polynomial for ω over Fq . For any polynomial f (X) ∈ Fq [X]
we have f (ω ) = 0 iff f (X) = a(X)Mω (X), and if in addition f (X) ∈ Fq [X]/"X N −
e# then f (ω ) = 0 iff f (X) ∈ "Mω (X)#. Hence we get

Theorem 3.3.11 Let g(X) = q1 (X) . . . qt (X) be a product of irreducible factors


of X N − e, and ω1 , . . . , ωu be the roots of g(X) in Spl(X N − e) over Fq . Then

"g(X)# = { f (X) ∈ Fq [X]/"X N − e# : f (ω1 ) = · · · = f (ωu ) = 0}. (3.3.3)

Furthermore, it is enough to pick up a single root of each irreducible factor: if


ω j is any root of Mω (X), 1 ≤ j ≤ t, then

"g(X)# = { f (X) ∈ Fq [X]/"X N − e# : f (ω1 ) = · · · = f (ωt ) = 0}. (3.3.4)


3.3 Cyclic codes revisited. Decoding the BHC codes 305

Conversely, if ω1 , . . . , ωu is a set of roots of X N − e then the code { f (X) ∈


Fq [X]/"X N − e# : f (ω1 ) = · · · = f (ωu ) = 0} has a generator which is the lcm
of the minimal polynomials for ω1 , . . . , ωu .

Definition 3.3.12 The roots of generator g(X) are called the zeros of the cyclic
code "g(X)#. Other roots of unity are often called non-zeros of the code.

Let {ω1 , . . . , ωu } be a set of roots of X N − e lying in an extension field Fql . Recall


that l is the minimal integer such that N|ql − 1. If f (X) = ∑ fi X i is a polynomial
in Fq [X]/"X N − e# then f (ω j ) = 0 iff ∑ fi ω ij = 0. Representing Fql as a vector
0≤i≤u
space over Fq of dimension l, we associate ω ij with a (column) vector → −ω ij of length
−−−→
l over Fq , writing the last equality as ∑ fi → −
ω ij = ∑ fi ω ij = 0. So, the (ul) × N matrix
i i
⎛ →− →
− →− ⎞
ω 01 ω 11 . . . ω N−1
1
→−
⎜ ω0 →
−ω 12 . . . →−
ω N−1 ⎟
T ⎜ 2 2 ⎟
H =⎜ . .. .. .. ⎟ (3.3.5)
⎝ .. . . . ⎠

−ω 0u →
−ω 1 ... →−
ωuN−1
u

can be considered as a parity-check matrix for the code with zeros ω1 , . . . , ωu (with
the proviso that its rows may not be linearly independent).

Theorem 3.3.13 For q = 2, the Hamming [2l − 1, 2l − l − 1, 3] code is equivalent


i
to a cyclic code "Mω (X)# = ∏0≤i≤l−1 (X − ω 2 ) where ω is a primitive element in
F 2l .

Proof Let ω be a primitive (N, F2 ) root of unity where N = 2l − 1. The splitting


field Spl (X N − e) is F2l (as ordN (2) = l). So, ω is a primitive element in F2l .
l−1
Take Mω (X) = (X − ω )(X − ω 2 ) · · · X − ω 2 , of degree l. The powers ω 0 =
e, ω , . . . , ω N−1 form F∗2l , the list of the non-zero elements and the columns of the
l × N matrix
− 0 → 
H= → ω ,− ω ,...,→−
ω N−1 (3.3.6)

consist of all non-zero binary vectors of length l. Hence, the Hamming [2l − 1, 2l −
l − 1, 3] code is (equivalent to) the cyclic code "Mω (X)# whose zeros consist of a
primitive (2l − 1; F2 ) root of unity ω and (necessarily) all the other roots of the
minimal polynomial for ω .

Theorem
 l 3.3.14  If gcd(l, q − 1) = 1 then the q-ary Hamming
q −1 q −1
l
, − l, 3 code is equivalent to the cyclic code.
q−1 q−1
306 Further Topics from Coding Theory
−1 l
Proof Write Spl(X N − e) = Fql where l = ordN (q), N = qq−1 . To justify the
q −1
l
selection of l observe that = q − 1 and l is the least positive integer with
N
l −1
this property as qq−1 > ql−1 − 1.
Therefore, Spl(X N − e) = Fql . Take a primitive β ∈ Fql . Then ω = β (q −1)/N =
l

β q−1 is a primitive (N, Fq ) root of unity. As before, take the minimal polynomial
 l−1 
Mω (X) = (X − ω )(X − ω q ) · · · X − ω q and consider the cyclic code "Mω (X)#
l−1
with the zero ω (and necessarily ω , . . . , ω q ). Consider again the l × N matrix
q

(3.3.6). We want to check that any two distinct columns of H are linearly indepen-
dent. If not, there exist i < j such that ω i and ω j are scalar multiples of the element
ω j−i ∈ Fq . But then (ω j−i )q−1 = ω ( j−i)(q−1) = e in Fq ; as ω is a primitive Nth root
of unity, this holds iff ( j − i)(q − 1) ≡ 0 mod N. Write
ql − 1
N= = 1 + · · · + ql−1 .
q−1
As (q − 1)|(qr − 1) for all r ≥ 1, we have qr = (q − 1)vr + 1 for some natural vr .
Summing over 0 ≤ r ≤ s − 1 yields
N = (q − 1) ∑ vr + l. (3.3.7)
r

As gcd(q − 1, l) = 1 we have gcd(q − 1, N) = 1. But then the equality ( j − i)


(q − 1) = 0 mod N is impossible.
So, the code with the parity-check matrix H has length N, rank k ≥ N − l and
distance d ≥ 3. But the Hamming bound says that
   −1
N d −1
qk ≤ qN ∑ m (q − 1)m , E =  2 .
0≤m≤E

As the volume of the ball vN,q (E) ≥ ql , this implies that in fact k = N − l, E = 1
and d = 3. So, this code is equivalent to a Hamming code.
Next, we look in more detail on BCH codes correcting several errors. Recall that
if ω1 , . . . , ωu ∈ E(N,q) are (N, Fq ) roots of unity then
XN = { f (X) ∈ Fq [X]/"X N − e# : f (ω1 ) = · · · = f (ωu ) = 0}
is a cyclic code "g(X)# where

g(X) = lcm Mω1 ,Fq (X), . . . , Mωu ,Fq (X)(X) (3.3.8)

is the product of distinct minimal polynomials for ω1 , . . . , ωu over Fq . In particular,


if q = 2, N = 2l − 1, and ω is a primitive element in F2l then the cyclic code with
l−1
roots ω , ω 2 , . . . , ω 2 (which is the same as with a single root ω ) coincides with
3.3 Cyclic codes revisited. Decoding the BHC codes 307

"Mω (X)# and is equivalent to the Hamming code. We could try other possibilities
for zeros of X to see if it leads to interesting examples. This is the way to discover
the BCH codes [25], [70].
Recall the factorisation into minimal polynomials Mi (X)(= Mω i ,Fq (X)),
 
X N − 1 = lcm Mi (X) : i = 0, . . . ,t , (3.3.9)
where ω is a primitive (N, Fq ) root of unity. The roots of Mi (X) are conjugate,
d−1
i.e. have the form ω i , ω iq , . . . , ω iq where d(= d(i)) is the least integer ≥ 1 such
that iqd = i mod N. The set Ci = {i, iq, . . . , iqd−1 } is the ith cyclotomic coset of q
mod N. So,
Mi (X) = ∏ (X − ω j ). (3.3.10)
j∈Ci

In Section 3.2, we obtained a cyclic code of minimal distance ≥ δ by requiring


that the generator g(X) has (δ − 1) successive roots (with successive exponents).
Compare Theorem 3.3.16 below.
Example 3.3.15 A binary Hamming code is a binary primitive narrow sense
BCH of designed distance δ = 3.
 BCH 
By Lemma 3.2.8, the distance d Xq,N, δ ≥ δ . As Spl(X − e) = Fq where s =
N s

ordN (q), we have that


deg Mω b+ j (X) ≤ s. (3.3.11)
BCH ) = N − deg(g(X)) ≥ N − (δ − 1)s. So:
Hence, the rank (Xq,N, δ
BCH has distance ≥ δ and rank ≥
Theorem 3.3.16 The q-ary BCH code Xq,N, δ
N − (δ − 1) ordN (q).
As before, we can form a parity-check matrix for X BCH by writing
ω b , ω b+1 , . . . , ω b+δ −2 and their powers as vectors from Fsq where s = ordN (q). Set
⎛ →
−e →
−e →
−e ⎞
...
⎜ → −
ωb →

ω b+1 ... →
−ω b+δ −2 ⎟
⎜ → − →
− →
− ⎟
H =⎜
T
⎜ ω
2b ω 2(b+1) ... ω 2(b+ δ −2) ⎟.
⎟ (3.3.12)
⎝ ... ⎠
ω (N−1)b →

− −ω (N−1)(b+1) . . . →

ω (N−1)(b+δ −2)
The ‘proper’ parity-check matrix H is obtained by removing redundant rows.
The binary BCH codes are simplest to deal with. Let Ci = {i, 2i, . . . , i2d−1 } be
the ith cyclotomic coset (with d(= d(i)) being the smallest non-zero integer such
that i · 2d = i mod N). Then u ∈ Ci iff 2u mod N ∈ Ci . So, Mi (X) = M2i (X), and for
all s ≥ 1 the polynomials
g2s−1 (X) = g2s (X) = lcm{M1 (X), M2 (X), . . . , M2s (X)}.
308 Further Topics from Coding Theory

We immediately deduce that X2,N,2s+1


BCH = X2,N,2s
BCH . So we can focus on the narrow

sense BCH codes with odd designed distance δ = 2E + 1, and obtain an improve-
ment of Theorem 3.3.16:
Theorem 3.3.17 The rank of a binary BCH code X2,N,2E+1
BCH is ≥ N − E ordN (2).
The problem of determining exactly the minimum distance of a BCH code has
been solved only partially (although a number of results exist in the literature). We
present the following theorem without proof.
Theorem 3.3.18 The minimum distance of a binary primitive narrow sense BCH
code is an odd number.
The previous results can be sharpened in a number of particular cases.
Worked Example 3.3.19 Prove that log2 (N + 1) > 1 + log2 (E + 1)! implies
 
N
(N + 1) < ∑
E
. (3.3.13)
0≤i≤E+1 i

Solution For i ≤ E +1 we obtain i! ≤ (E +1)! < (N +1)/2. Hence, (3.3.13) follows


from
(N + 1)E+1 ≤ 2 ∑ N(N − 1) . . . (N − i + 1) = S(E). (3.3.14)
0≤i≤E+1

Inequality (3.3.14) holds for E = 0, and is checked by induction in E. Write the


RHS of (3.3.14) as S(E + 1) = S(E) + N(N − 1) . . . (N − E). Then S(E) > (N +
1)E+1 by the induction hypothesis and it remains to check
N(N +1)E+1 < 2N(N −1) . . . (N −E)(N −E −1), for N +1 > 2(E +2)!. (3.3.15)
Consider the polynomial (y + 1)E+1 − 2(y − 1) . . . (y − E)(y − E − 1) and group
together the monomials of degrees E + 1 and E. Clearly, they are negative for
y > 2(E + 2)!. Continue this procedure, concluding that (3.3.13) holds.
 
N
Theorem 3.3.20 Let N = 2 − 1. If 2 < ∑
s sE
then a primitive binary
0≤i≤E+1
i
narrow sense BCH code X2,2
BCH
s −1,2E+1 has distance 2E + 1.

Proof By Theorem 3.3.18, the distance is odd. So, d(X2,2 BCH 2E + 2.


s −1,2E+1 ) =

Suppose the distance is ≥ 2E + 3. Observe that the rank X2,2s −1,2E+1 ≥ N − sE,
BCH

and use the Hamming bound


   
N N
2 N−sE
∑ i ≤ 2 , i.e. 2 ≥ ∑ i .
N sE
0≤i≤E+1 0≤i≤E+1
BCH
The contradiction implies d(X2,2 s −1,2E+1 ) = 2E + 1.
3.3 Cyclic codes revisited. Decoding the BHC codes 309

Corollary 3.3.21 If N = 2s − 1 and s > 1 + log2 (E + 1)! then d(X2,2 BCH


s −1,2E+1 ) =

2E + 1. In particular, let N = 31 and s = 5. Then we easily verify that

 
31
2 5E
< ∑ i
0≤i≤E+1

BCH in fact equals δ


for E = 1, 2 and 3. This proves that the actual distance of X2,31,d
for δ = 3, 5, 7.
 
N
Proof s > 1 + log2 (E + 1)! implies that 2sE < ∑ .
0≤i≤E+1
i

Theorem 3.3.22 If δ |N , the minimum distance of primitive binary narrow sense


BCH code of designed distance δ equals δ .

Proof Set N = δ m, then

X N − 1 = X δ m − 1 = (X m − 1)(1 + X m + · · · + X (δ −1)m ).

As ω jm = 1 for j = 1, . . . , δ − 1, none of ω , ω 2 , . . . , ω δ −1 is a root of X m − 1.


So, they must be roots of 1 + X m + · · · + X (δ −1)m . Then this polynomial gives a
codeword, of weight δ . So, δ is the distance.

Two more results on the minimal distance of a BCH code are presented in The-
orems 3.3.23 and 3.3.25. The full proofs are beyond the scope of this book and
omitted.

Theorem 3.3.23 Let N = qs − 1. The minimal distance of a primitive q-ary nar-


row sense BCH code Xq,q
BCH
s −1,qk −1,ω ,1 of designed distance q − 1 equals q − 1.
k k

Theorem 3.3.24 The minimal distance of a primitive q-ary narrow sense BCH
code X BCH = Xq,q s −1,δ ,ω ,1 of designed distance δ is at most qδ − 1.
BCH

Proof Take k to be an integer ≥ 1 with qk−1 ≤ δ ≤ qk − 1. Set δ = qk − 1 and


consider X (= Xq,q BCH
s −1,δ ,ω ,1 ), the q-ary primitive narrow sense BCH code of

the same length N = qs − 1 and designed distance δ . The roots of the generator
of X are among those of X , so X ⊆ X . But according to Theorem 3.3.22,
d(X ) = δ which is ≤ δ q − 1.

The following result shows that BCH codes are not ‘asymptotically good’. How-
ever, for small N (a few thousand or less), the BCH are among the best codes
known.
310 Further Topics from Coding Theory

Theorem 3.3.25 There exists no infinite sequence of q-ary primitive BCH


codes XNBCH of length N such that d(XN )/N and rank(XN )/N are bounded away
from 0.
Decoding BCH codes can be done by using the so-called Berlekamp–Massey
algorithm. To begin with, consider a binary primitive narrow sense BCH code
X BCH (= X2,N,5BCH ) of length N = 2s − 1 and designed distance 5. With E = 2 and
 
N
s ≥ 4, inequality 2 < ∑
sE holds, and by Theorem 3.3.20, the distance
i
 BCH  0≤i≤E+1
d X equals 5. Thus, the code is two-error correcting. Also, by Theorem
3.3.17, the rank of X BCH is ≥ N − 2s. [For s = 4, the rank is actually equal to
N − 2s = 15 − 8 = 7.] So, X BCH is [2s − 1, ≥ 2s − 1 − 2s, 5].
The defining zeros are ω , ω 2 , ω 3 , ω 4 where ω is a primitive Nth root of unity
over F2 (which is also a primitive element ω of F2s ). We know that ω and ω 3
suffice as defining zeros: X BCH = {c(X) ∈ F2 [X]/"X N − 1# : c(ω ) = c(ω 3 ) = 0}.
So, the parity-check matrix H in (3.3.12) can be taken in the form
 →−e → − 
T ω → −
ω 2 ··· → −
ω N−1
H = → −e → ω3 →
− −
ω 6 ··· → −ω 3(N−1)
. (3.3.16)

It is instructive to compare the situation with the binary Hamming [2l − 1, 2l −


1−l] code X ( H) . In the case of code X BCH , suppose again that a codeword c(X) ∈
X was sent and the received word r(X) has ≤ 2 errors. Write r(X) = c(X) + e(X)
where the error polynomial e(X) now has weight ≤ 2. There are three cases to
consider: e(X) = 0, e(X) = X i or e(X) = X i + X j , 0 ≤ i = j ≤ N − 1. If r(ω ) = r1
and r(ω 3 ) = r3 then e(ω ) = r1 and e(ω 3 ) = r3 . In the case of no error (e(X) = 0),
r1 = r3 = 0, and vice versa. In the single-error case (e(X) = X i ),
r3 = e(ω 3 ) = ω 3i = (ω i )3 = (e(ω ))3 = r13 = 0.
Conversely, if r3 = r13 = 0 then e(ω 3 ) = e(ω )3 . If e(X) = X i + X j with i = j then
ω 3i + ω 3 j = (ω i + ω j )3 = ω 3i + ω 2i ω j + ω i ω 2 j + ω 3 j ,
i.e. ω 2i ω j + ω i ω 2 j = 0 or ω i + ω j = 0 which implies i = j, a contradiction. So, the
single error occurs iff r3 = r13 = 0, and the wrong digit is i such that r1 = ω i . So,
in the single-error case we identify a column of H, i.e. a pair (ω i , ω 3i ) = (r1 , r3 )
and change digit i in r(X). This is completely similar to the decoding procedure for
Hamming codes.
In the two-error case (e(X) = X i + X j , i = j), in the spirit of the Hamming codes,
we try to find a pair of columns (ω i , ω 3i ) and (ω j , ω 3 j ) such that the sum (ω i +
ω j , ω 3i + ω 3 j ) = (r1 , r3 ), i.e. solve the equation
r1 = ω i + ω j , r3 = ω 3i + ω 3 j .
3.3 Cyclic codes revisited. Decoding the BHC codes 311

Then find i, j such that y1 = ω i , y2 = ω j (y1 , y2 are called error locators). If such i,
j (or equivalently, error locators y1 , y2 ) are found, we know that errors occurred at
positions i and j.
It is convenient to introduce an error-locator polynomial σ (X) whose roots are
y−1 −1
1 , y2 :

σ (X) = (1 − y1 X)(1 − y2 X) = 1 − (y1 + y2 )X + y1 y2 X 2


(3.3.17)
= 1 − r1 X + (r3 r1−1 − r12 )X 2 .

As y1 + y2 = r1 , we check that y1 y2 = r3 r1−1 − r12 . Indeed,

r3 = y31 + y32 = (y1 + y2 )(y21 + y1 y2 + y22 ) = r1 (r12 + y1 y2 ).

If N is not large, the roots of σ (X) can be found by trying all 2s − 1 non-zero
elements of F∗2s . (The standard formula for the roots of a quadratic polynomial
does not apply over F2 .) Thus, the following assertion arises:

Theorem 3.3.26 For N = 2l − 1, consider a two-error correcting binary primi-


tive narrow sense BCH code X (which equals X BCH ) of length N and designed
distance 5, with the parity-check matrix produced from
 
T e ω ω 2 · · · ω N−1
H = ,
e ω 3 ω 6 · · · ω 3(N−1)
where ω is the primitive element of F2s . [The rank of the code is ≥ N − 2l and
for l ≥ 4 the distance equals 5, i.e. X is [2l − 1, ≥ 2l − 1 − 2l, 5] and corrects two
errors.] Assume that at most two errors occurred in a received word r(X) and let
r(ω ) = r1 , r(ω 3 ) = r3 . Then:

(a) if r1 = 0 then r3 = 0 and no error occurred;


(b) if r3 = r13 = 0 then a single error occurred at position i where r1 = ω i ;
0 and r3 = r13 then two errors occurred: the error locator polynomial
(c) if r1 =
σ (X) = 1 − r1 X + (r3 r1−1 − r12 )X 2 has two distinct roots ω N−1−i , ω N−1− j and
the errors occurred at positions i and j.
For a binary BCH code with a general designed distance δ (δ = 2t +1 is assumed
odd), we follow the same idea: compute

r1 = e(ω ), r3 = e(ω 3 ), . . . , rδ −2 = e(ω δ −2 )

for the received word r(X) = c(X) + e(X). Suppose that errors occurred at places
i1 , . . . , it . Then
e(X) = ∑ Xij.
1≤ j≤t
312 Further Topics from Coding Theory

As before, consider the system

∑ ω i j = r1 , ∑ ω 3i j = r3 , . . . , ∑ ω (δ −2)i j = rδ −2 ,
1≤ j≤t 1≤ j≤t 1≤ j≤t

and introduce the error locators y j = ω i j :

∑ y j = r1 , ∑ y3j = r3 , . . . , ∑ yδj −2 = rδ −2 .
1≤ j≤t 1≤ j≤t 1≤ j≤t

The error locator polynomial

σ (X) = ∏ (1 − y j X)
1≤ j≤t

has the roots y−1


j . The coefficients σi in σ (X) = ∑ σi X can be determined from
i
0≤i≤t
the equations below
⎛ ⎞⎛ ⎞
1 0 0 0 0 ... 0 σ1
⎜ r2 r1 1 0 0 ... 0 ⎟⎜ σ2 ⎟
⎜ ⎟⎜ ⎟
⎜ r4 ⎟⎜ ⎟
⎜ r3 r2 r1 1 ... 0 ⎟⎜ σ3 ⎟
⎜ . .. .. .. .. .. .. ⎟⎜ ⎟
⎜ .. . . . . . . ⎟⎜ .. ⎟
⎜ ⎟⎜ . ⎟
⎜ .. .. .. .. ⎟⎝ ⎠
⎝ r2t−4 r2t−5 . . . . rt−3 ⎠ σ2t−3
r2t−2 r2t−3 ... ... ... ... rt−1 σ2t−1

⎛ ⎞
r1
⎜ r3 ⎟
⎜ ⎟
⎜ r5 ⎟
⎜ ⎟
=⎜ .. ⎟
⎜ . ⎟
⎜ ⎟
⎝ r2t−3 ⎠
r2t−1

This requires computing rk only for k odd as

r2 j = e(ω 2 j ) = e(ω j )2 = r2j .

Once the σi are found, the roots y−1


j can be determined by trial and error.


Example 3.3.27 Consider X2,16, ω ,5 where ω is a primitive element of F16 . We
BCH

know that the primitive polynomial is M1 (X) = X 4 +X +1 and M3 (X) = X 4 +X 3 +


X 2 + X + 1. Hence, the generator of the code

g(X) = M1 (X)M3 (X) = X 8 + X 7 + X 6 + X 4 + 1.


3.4 The MacWilliams identity and the linear programming bound 313

Let us introduce two errors in the codeword c = 10001011100000000 at the 4th


and 12th positions by taking a(X) = X 12 + X 8 + X 7 + X 6 + 1. Then

r1 = a(ω ) = ω 12 + ω 8 + ω 7 + ω 6 + 1 = ω 6 ,
r3 = a(ω 3 ) = ω 36 + ω 24 + ω 21 + ω 18 + 1 = ω 9 + ω 3 + 1 = ω 4 .

Since r3 = r13 , consider the location polynomial

σ (X) = 1 + ω 6 X + (ω 13 + ω 12 )X 2 .

The roots of l(X) are ω 3 and ω 11 by the direct check. Hence we discover the errors
at the 4th and 12th positions.

3.4 The MacWilliams identity and the linear programming bound


The MacWilliams identity for linear codes deals with the so-called weight-
enumerator polynomials WX (z) and WX ⊥ (z) where X and X ⊥ are a pair of dual
codes of a given length N. The polynomials WX (z) and WX ⊥ (z) are defined by

WX (z) = ∑ Ak zk and WX ⊥ (z) = ∑ A⊥


kz
k
(3.4.1)
0≤k≤N 0≤k≤N

where Ak (= Ak (X )) equals the number of codewords of weight k in X , and A⊥k


(= Ak (X ⊥ )) the number in X ⊥ . The identity for q-ary codes reads
 
1
N 1−z
WX ⊥ (z) = 1 + (q − 1)z WX , z ∈ C, (3.4.2)
X 1 + (q − 1)z
and takes a particularly elegant form in the binary case (q = 2):
 
1 n 1−z
WX ⊥ (z) = (1 + z) WX . (3.4.3)
X 1+z
A short derivation of the abstract MacWilliams identity is rather algebraic. It
may be skipped at the first reading as only its specification for linear codes will be
used later on.

Definition 3.4.1 Let (G, +) be a group. A homomorphism χ : G to the multiplica-


tive group of complex numbers S = {z ∈ C : |z| = 1} is called a (one-dimensional)
character of G. Since χ is a homomorphism

χ (g1 + g2 ) = χ (g1 )χ (g2 ), χ (0) = 1. (3.4.4)

We say χ is trivial (or principal) if χ (·) ≡ 1.


314 Further Topics from Coding Theory

More generally, a linear representation D of a group G over a field F (not nec-


essarily finite) is defined as a homomorphism
D : G → GL(V ) : g → D(g) (3.4.5)
from G into the group GL(V ) of invertible linear mappings of a finite-dimensional
space V over F. The vector space V is called the representation space and its di-
mension dim(V ) is called the dimension of representation.
Let D be a representation of a group G. Then the map
 
χ D : G → F : g → ∑ dii (g) = trace D(g) , (3.4.6)
which takes g ∈ G to χ D (g), the trace of D(g) = (di j (g)), is called the character of
D. Representations and characters over the field C of complex numbers are called
ordinary. In the situation where the underlying field F is finite, they are called
modular.
In our case G = Fq with additive group operation. Fix a primitive qth root of
unity ω = e2π i/q ∈ S and for any j ∈ Fq define a one-dimensional representation
of the group Fq as follows:
χ ( j) : Fq → S : u → ω ju .
The character χ ( j) is non-trivial for j = 0. In fact, all characters of Fq can be de-
scribed in this way, but we omit the proof of this assertion.
Next, we define a character of the group G = FNq . Fix a non-trivial one-
dimensional ordinary character χ : Fq → S and a non-zero element v ∈ FNq and
define a character of the additive group G = FNq as follows:
χ(v) : FNq → S : u → χ (v · u), (3.4.7)
where v · u, as before, is the dot-product.
Lemma 3.4.2 Let χ be a non-trivial (i.e. χ ≡ 1) character of a finite group G.
Then
∑ χ (g) = 0. (3.4.8)
g∈G

If χ is trivial then ∑ χ (g) =  G.


g∈G

Proof Since χ is non-trivial, there exists an element h ∈ G such that χ (h) = 1.


From
χ (h) ∑ χ (g) = ∑ χ (hg) = ∑ χ (g),
g∈G g∈G g∈G

we obtain that (χ (h) − 1) ∑ χ (g) = 0. Therefore, ∑ χ (g) = 0.


g∈G g∈G
3.4 The MacWilliams identity and the linear programming bound 315

In the case G = FNq , ∑ χ (x) = qN for a trivial.


x∈FN
q

Definition 3.4.3 The discrete Fourier transform (in short, DFT) of a function f
on FNq is defined by
f = ∑ f (v)χ(v) . (3.4.9)
v∈FN
q

Sometimes, the weight enumerator polynomial of code X is defined as a func-


tion of two formal variables x, y:
WX (x, y) = ∑ xw(v) yN−w(v) (3.4.10)
v∈X

(if one sets x = z, y = 1, (3.4.10) coincides with (3.4.1)). So, we want to apply the
DFT to the function (no harm to say that x, y ∈ S )
g : FNq → C [x, y] : v → xw(v) yN−w(v) . (3.4.11)
Lemma 3.4.4 (The abstract MacWilliams identity) For v ∈ FNq let

g : FNq → C [x, y] : v → xw(v) yN−w(v) . (3.4.12)


Then
g(u) = (y − x)w(u) (y + (q − 1)x)N−w(u) . (3.4.13)
Proof Let χ denote a non-trivial ordinary character of the additive group G = Fq .
Given α ∈ Fq , set |α | = 0 if α = 0 and |α | = 1 otherwise. Then for all u ∈ FNq we
compute
 
g(u) = ∑ χ "v, u# g(v)
v∈FN
q
 
= ∑ χ "v, u# xw(v) yN−w(v)
v∈FN
q

 N−1  |v |+···|v | (1−|v |)+···+(1−|v |)


= ∑ ... ∑ χ ∑ vi ui x 0 N−1
y 0 N−1

v0 ∈Fq vN−1 ∈Fq i=0


N−1
= ∑ ... ∑ ∏ χ (vi ui )x|v | y1−|v | i i

v0 ∈Fq vN−1 ∈Fq i=0


N−1
= ∏ ∑ χ (gui )x|g| y1−|g| .
i=0 g∈G

If ui = 0 then χ (gui ) = χ (0) = 1 and so

∑ x|g| y1−|g| = y + (q − 1)x.


g∈G
316 Further Topics from Coding Theory

If ui = 0 then

∑ χ (gui )x|g| y1−|g| = y + ∑ χ (gui )x = y − χ (0)x = y − x.


g∈G g∈G\0

Lemma 3.4.5 (MacWilliams identity for linear codes) If X is a linear [N, k]


code over Fq then
∑ f(x) = qk ∑ f (y). (3.4.14)
x∈X y∈X ⊥

Proof Consider the following sum:

∑ f(x) = ∑ ∑ χ(v) (x) f (v)


x∈X x∈X v∈FN
q

= ∑ ∑ χ "v, x# f (v)
v∈FN
q x∈X

= ∑ ∑ χ "v, x# f (v)
v∈X ⊥ x∈X

+ ∑ ⊥
∑ χ "v, x# f (v).
q \X x∈X
v∈FN

In the first sum we have χ "v, x# = χ (0) = 1 for all v ∈ X ⊥ and all x ∈ X . In
the second sum we study the linear form

X → Fq : x → "v, x#.

Since v ∈ FNq \ X ⊥ , this linear form is surjective, whence its kernel has dimension
k − 1, i.e. for any g ∈ Fq there exist qk−1 vectors x ∈ X such that "v, x# = g. This
implies

∑ f(x) = qk ∑ f (y) + qk−1 ∑ f (v) ∑ χ (g)


x∈X y∈X ⊥ v∈Fnq \X ⊥ g∈G

= qk ∑ f (y)
y∈X ⊥

as the second term vanishes by Lemma 3.4.2.

Lemma 3.4.6 The weight enumerator of an [N, k] code X over Fq is related to


the weight enumerator of its dual as follows:

WX ⊥ (x, y) = q−kWX (y − x, y + (q − 1)x). (3.4.15)


3.4 The MacWilliams identity and the linear programming bound 317

Proof By Lemma 3.4.5 with g(v) = xw(v) yN−w(v)

WX ⊥ (x, y) = ∑ g(v) = q−k ∑ g(v)


v∈X ⊥ v∈X
−k
= q WX (y − x, y + (q − 1)x).

Substituting x = z, y = 1 we obtain (3.4.3).

Example 3.4.7 (i) For all codes X , WX (0) = A0 = 1 and WX (1) =  X . When
X = F×N
q , WX (z) = [1 + z(q − 1)] .
N

(ii) For a binary repetition code X = {0000, 1111}, WX (x, y) = x4 + y4 . Hence,


1 
WX ⊥ (x, y) = (y − x)4 + (y + x)4 = y4 + 6x2 y2 + x4 .
2

(iii) Let X be the Hamming [7, 4] code. The dual code X ⊥ has 8 codewords; all
except 0 are of weight 4. Hence, WX ⊥ (x, y) = x7 + 7x4 y3 , and, by the MacWilliams
identity,
1 1 
WX = 3
WX ⊥ (x − y, x + y) = 3 (x − y)7 + 7(x − y)4 (x + y)3
2 2
= x7 + 7x4 y3 + 7x3 y4 + y4 .

Hence, X has 7 words of weight 3 and 4 each. Together with the 0 and 1 words,
this accounts for all 16 words of the Hamming [7, 4] code.
Another way to derive the identity (3.4.1) is to use an abstract result related
to group algebras and character transforms for Hamming spaces F×N q (which are
linear spaces over field Fq of dimension N). For brevity, the subscript q and super-
script (N) will be often omitted.
Definition 3.4.8 The (complex) group algebra CF×N for space F×N is defined as
the linear space of complex functions G : x ∈ F×N → G(x) ∈ C equipped by a com-
plex involution (conjugation) and multiplication. Thus, we have four operations for
functions G(x); addition and scalar (complex) multiplication are standard (point-
wise), with (G + G )(x) = G(x) + G (x) and (aG)(x) = aG(x), G, G ∈ CF×N ,
a ∈ C, x ∈ F×N . The involution is just the (point-wise) complex conjugation:
G∗ (x) = G(x)∗ ; it is an idempotent operation, with G∗ ∗ = G. However, the mul-
tiplication (denoted by ) is a convolution:

(G  G )(x) = ∑ G(y)G (x − y), x ∈ F×N . (3.4.16)


y∈F×N

This makes CF×N a commutative ring and at the same time a (complex) linear
space, of dimension dim CF×N = qN , with involution. (A set that is a commutative
318 Further Topics from Coding Theory

ring and a linear space is called an algebra.) The natural basis in CF×N is formed
by Dirac’s (or Kronecker’s) delta-functions δ y , with δ y (x) = 1(x = y), x, y ∈ H .

If X ⊆ F×N is a linear code, we set GX (x) = 1(x ∈ X ).


The multiplication rule (3.4.16) requires an explanation. If we rewrite the
RHS in a symmetric form ∑ G(y)G(y ) (which makes the commu-
y,y ∈F×N :y+y =x
tativity of the -multiplication obvious) then there will be an analogy with the
multiplication of polynomials. In fact, if A(t) = a0 + a1t + · · · + al−1t l−1 and

A (t) = a 0 + a 1t + · · · + a l −1t l −1 are two polynomials, with coefficient strings
(a0 , . . . , al−1 ) and (a 0 , . . . , a l −1 ), then the product B(t) = A(t)A (t) has a string
of coefficients (b0 , . . . , bl−1+l −1 ) where bk = ∑ am a m .
m,m ≥0:m+m =k
From this point of view, rule (3.4.16) is behind some polynomial-type multipli-
cation. Polynomials of degree ≤ n − 1 form of course a (complex) linear space of
dimension n. However, they do not form a group (or even a semi-group). To make
a group, we should affiliate inverse monomials 1/t, 1/t 2 , and so on, and either con-
sider infinite series or make an agreement that t n = 1 (i.e. treat t as an element of
a cyclic group, not a ‘free’ variable). Similar constructions can be done for poly-
nomials of several variables, but there we have a variety of possible agreements on
relations between variables.
Returning to our group algebra CH , we make the following steps:
(i) Produce a ‘multiplicative version’ of the Hamming group H . That is, take a
collection of ‘formal’ variables t (x) labelled by elements x ∈ H and postulate the

rule t (x)t (x ) = t (x+x ) for all x, x ∈ CH .
(ii) Then consider the set TH of all (complex) linear combinations G =
∑x∈H γxt (x) and introduce (ii1) the addition G + G = ∑x∈H (γx + γx )t (x) and (ii2)
the scalar multiplication aG = ∑x∈H (aγx )t (x) , G, G ∈ TH , a ∈ C. We again ob-
tain a linear space of dimension qN , with the basis formed by ‘basic’ combina-
tions t (x) , x ∈ H . Obviously, TH and CH are isomorphic as linear spaces, with
G ⇐⇒ g.

(iii) Now remove brackets in t (x) (but keep the rule t xt x = t x+x ) and write
∑x∈H γxt x as g(t) thinking that this is a function (in fact, a ‘polynomial’) of some
‘variable’ t obeying the above rule. Finally, consider the polynomial multiplication
g(t)g (t) in TH . Then TH and CH become isomorphic not only as linear spaces
but also as rings, i.e. as algebras.
The above construction is very powerful and can be used for any group, not just
for HN . Its power will be manifested in the derivation of the MacWilliams identity.
So, we will think of CH as a set of functions

g(t) = ∑ γxt x (3.4.17)


x∈Hn
3.4 The MacWilliams identity and the linear programming bound 319

of a formal variable t obeying an ‘exponentiation rule’: t x+x


= t xt , with addition x

and multiplication of formal polynomials.


In agreement with (3.4.17), for a linear code X ⊂ Hn we set

gX (t) = ∑ t x; (3.4.18)
x∈X

gX (t) is often called the generating function of X .


Definition 3.4.3 admits a straightforward generalisation for any non-principal
character χ : F → S. Note the similarity with the Fourier transform (and other types
of popular transforms (viz. the Hadamard transform in the group theory)).

Definition 3.4.9 The character transform g → ĝ of the group algebra CHn is


defined by
ĝ(t) = ∑ Xx (g)t x , (3.4.19a)
x∈Hn

where g ∼ (γx , x ∈ Hn ) and

Xx (g) = ∑ γy χ (x · y) (3.4.19b)
y∈Hn

and x · y is the dot-product ∑1≤ j≤n x j y j in Hn .

Now define the weight enumerator of a group algebra element g ∈ CH as a


polynomial Wg (s) in a variable s (which may be thought of as a complex variable):
n  
Wg (s) = ∑ γx s w(x)
=∑ ∑ γx sk = ∑ Ak sk , s ∈ C. (3.4.20)
x∈H k=0 x:w(x)=k 0≤k≤n

Here
Ak = ∑ γx . (3.4.21)
x∈H :w(x)=k

For a linear code X , with generating function gX (t) (see 3.4.18)), Ak gives the
number of codewords of weight k:

Ak = #{x ∈ X : w(x) = k}. (3.4.22)

The weight enumerator Wĝ (s) of the character transform ĝ of g ∼ (γx , x ∈ H ) is


given by
 
Wĝ (s) = ∑ Xx (g)s w(x)
= ∑ ∑ Xx (g) sk = ∑ Âk sk , (3.4.23)
x∈H 0≤k≤n x: w(x)=k k
320 Further Topics from Coding Theory

where
Âk = ∑ Xx (g). (3.4.24)
x∈H : w(x)=k

The ‘abstract’ MacWilliams identity is established in the following result.


Theorem 3.4.10 We have
 
1−s
Wĝ (s) = (1 + (q − 1)s) Wg n
. (3.4.25)
1 + (q − 1)s
Proof Basically coincides with that of Lemma 3.4.4.
Rewrite (3.4.25) in terms of coefficients Ak and Âk :
n n
∑ Âk sk = ∑ Ak (1 − s)k (1 + (q − 1)s)n−k (3.4.26)
k=0 k=0

and expand:
n
(1 − s)k (1 + (q − 1)s)n−k = ∑ Ki (k)si . (3.4.27)
i=0

Here Ki (k)(= Ki (k, n, q)) is a Kravchuk polynomial: for all i, k = 0, 1, . . . , n,


  
i∧k k n−k
Ki (k) = ∑ (−1) j (q − 1)i− j ,
j=0∨(i+k−n) j i − j (3.4.28)
0 ∨ (i + k − n) = max [0, i + k − n], i ∧ k = min [i, k].
Then

∑ Âk sk = ∑ Ak ∑ Ki (k)si = ∑ ∑ Ak Ki (k)si


0≤k≤n 0≤k≤n 0≤i≤n 0≤i≤n 0≤k≤n

= ∑ ∑ Ai Kk (i)s , k
0≤k≤n 0≤i≤n

i.e.
Âk = ∑ Ai Kk (i). (3.4.29)
0≤i≤n

Lemma 3.4.11 For any (linear) code X ⊆ Hn , with generating function gX ∼


1(x ∈ X ), the character transform coefficients are related by
Xu (gX ) = #X 1(u ∈ X ⊥ ) (3.4.30)
and the character transform
ĝX = #X gX ⊥ . (3.4.31)
Here, X ⊥ is the dual code.
3.4 The MacWilliams identity and the linear programming bound 321

Proof By Lemma 3.4.2


 
Xu (gX ) = Xu ∑ t x
= ∑ χ (y · u) = #X 1(u ∈ X ⊥ ).
x∈X y∈X

In fact, the character y ∈ X → χ (y · u) is principal iff u ∈ X ⊥ . Consequently,


ĝ(t) = ∑ Xx (gX )t x = ∑ #X 1(x ∈ X ⊥ )t x
x∈H x∈H
= #X ∑ t x = #X gX ⊥ (t).
x∈X ⊥

Hence,
WĝX (s) = #X WgX ⊥ (s), (3.4.32)
and we obtain the MacWilliams identity for linear codes:
Theorem 3.4.12 Let X ⊂ Hn be a linear code, X ⊥ its dual, and
n n
WX (s) = ∑ Ak sk , WX ⊥ (s) = ∑ A⊥k sk (3.4.33)
k=0 k=0

the w-enumerators for X and X ⊥ , respectively, with Ak = #{x ∈ X : w(x) = k}


and Âk = #{x ∈ X ⊥ : w(x) = k}. Then
 
1 1−s
WX ⊥ (s) = (1 + (q − 1)s) WX
n
, s ∈ C, (3.4.34)
#X 1 + (q − 1)s
or, equivalently,
1
A⊥
k =
#X ∑ Ai Kk (i), (3.4.35)
0≤i≤n

where Kk (i) are Kravchuk polynomials (see (3.4.28)).


For a binary code, i.e. q = 2, (3.4.34) takes the form (3.4.3). Sometimes the
weight enumerators are defined as
WX ⊥ (s, r) = ∑ Ak sk rn−k . (3.4.36)
k

Then the MacWilliams identity (3.4.33) takes the form


1
WX ⊥ (s, r) = WX (r − s, r + (q − 1)s). (3.4.37)
#X
The MacWilliams identity is a powerful result providing a deep insight into the
structure of a (linear) code, particularly when the code is self-dual.
322 Further Topics from Coding Theory

The MacWilliams identity helps to establish an interesting bound on linear codes


called the linear programming (LP) bound. First, we discuss some immediate con-
sequences of this identity. If X ⊂ HN,q is a code of size M, set
1
Bk = {(x, y) : x, y ∈ X , δ (x, y) = k},
M
k = 0, 1, . . . , N
(each pair x, y is counted two times). The numbers B0 , B1 , . . . , BN form the distance
distribution of code X . The expression
BX (s) = ∑ Bk sk (3.4.38)
0≤k≤N

is called the distance enumerator of X . Clearly, the w- and d-distributions of a


linear code coincide. Furthermore we have
Lemma 3.4.13 The d -enumerator of an [N, M] code X coincides with the w-
enumerator of the group algebra element
1
hX (s) := ζX (s)ζX (s−1 )) (3.4.39)
M
where the generating function of X is
ζX (s) = ∑ sx . (3.4.40)
x∈X

Proof Using the notation (s−1 )x , write


1 1
hX (s) = ∑ sx ∑ s−y =
M x∈X y∈X ∑ sx−y
M x,y∈X

and hence
1  
WhX (s) = ∑ ∑
M 0≤k≤N x,y∈X :
1 w(x − y) = k sk = ∑ Bk sk
0≤k≤N
= BX (s).

Now by the MacWilliams identity, for a given non-trivial character χ and the
corresponding transform ζ → ζ, we obtain
Theorem 3.4.14 For hX (s) as above, if 
hX (s) is the character transform and
Wh (s) its w-enumerator, with
X
 
Wh (s) = ∑ Bk s = ∑
k
∑ χx (hX ) sk ,
X
0≤k≤N 0≤k≤N w(x)=k
3.4 The MacWilliams identity and the linear programming bound 323

then
Bk = ∑ Bi Kk (i),
0≤i≤N

where Kk (i) are Kravchuk polynomials.


The following assertion is straightforward.
Lemma 3.4.15 The following identity holds: χx (ζX (s−1 )) = χx (ζX (s)), where
the bar denotes the complex conjugate.
With the help of Lemma 3.4.15, we can write
1 1
χx (hX (t)) =χx (ζX (s)ζX (s−1 )) = χx (ζX (s))χx (ζX (s−1 ))
M M
1 1
= χx (ζX (s))χx (ζX (s)) = |χx (ζX (s))|2 ,
M M
and so,
1
Bk = ∑ χx (hX ) = ∑ |χx (ζX )|2 ≥ 0.
x:w(x)=k
M w(x)=k

Thus:
Theorem 3.4.16 For all [N, M] codes X and k = 0, . . . , N ,

∑ Bi Kk (i) ≥ 0. (3.4.41)
0≤i≤N

Now counting the number of pairs (x, y) ∈ X × X :

∑ Bi = M 2
0≤i≤N
or
1
∑ Ei = M, with Ei =
M
Bi (3.4.42)
0≤i≤N

(sometimes E0 , E1 , . . . , EN are called the d -distribution of X ). Then, by (3.4.41)–


(3.4.42),
∑ Ei Kk (i) ≥ 0.
0≤i≤N

In addition, by definition, Ei ≥ 0, 0 ≤ i ≤ N , and E0 = 1 and Ei = 0, 1 ≤ i < d .


Proof Let ω be a primitive qth root of unity and x ∈ FNq be a fixed word of weight
i. Then
∑ ω "x,y# = Kk (i). (3.4.43)
y∈FN
q :w(y)=k
324 Further Topics from Coding Theory

Indeed, we may assume that x = x1 x2 . . . xi 0 . . . 0 where the coordinates xi are not


0. Let D be a set of words that have their non-zero coordinates in a given set of
k positions. Suppose that exactly j positions h1 , . . . , hk belong to [0,  k −j
 i]and
i N −i
positions belong to [i + 1, N]. For such, a set could be selected in
j k− j
choices. Then

∑ ω "x,y# = ∑ ... ∑ ω xh1 yh1 +···+xhk yhk


y∈D yh1 ∈F∗q yhk ∈F∗q
j
= (q − 1)k− j ∏ ∑ ωx hi y
= (−1) j (q − 1)k− j .
i=1 y∈F∗q

Hence,
N N
M ∑ Bi Kk (i) = ∑ ∑ ∑ ω "x−y,z#
i=0 i=0 x,y∈X :δ (x,y)=i z∈FN
q :w(z)=k

= ∑ | ∑ ω "x,z# |2 ≥ 0.
z∈FN
q :w(z)=k x∈X

This leads us to the so-called linear programming (LP) bound stated in Theorem
3.4.17 below.
Theorem 3.4.17 (The LP bound) The following inequality holds:

Mq (N, d) ≤ max ∑ Ei : Ei ≥ 0, E0 = 1, Ei = 0 for 1 ≤ i < d

0≤i≤N

and ∑ Ei Kk (i) ≥ 0 for 0 ≤ k ≤ N . (3.4.44)
0≤i≤N

For q = 2, the LP bound will be slightly improved in Theorem 3.4.19. First, an


auxiliary result whose proof is straightforward and left as an exercise.
Lemma 3.4.18
(a) If there exists a binary [N, M, d] code, with d even, then there exists a binary
[N, M, d] code where any codeword has even weight, and so all distances are
even. So, if q = 2 and d is even, we may assume that Ei = 0 for all odd values
of i.
(b) For q = 2,
Ki (2k) = KN−i (2k).
3.4 The MacWilliams identity and the linear programming bound 325

Hence, for d even, as we can assume that E2i+1 = 0, the constraint in (3.4.44)
need only be considered for k = 0, . . . , [N/2].
(c) K0 (i) = 1 for all i, and thus the bound ∑ Ei K0 (i) ≥ 0 follows from Ei ≥ 0.
0≤i≤N

Lemma 3.4.18 directly implies

Theorem 3.4.19 (The LP for q = 2) If d is even then


M2∗ (N, d) ≤ max ∑ Ei : Ei ≥ 0, E0 = 1, Ei = 0 for 1 < i < d,
0≤i≤N
 
N
Ei = 0 for i odd, and + ∑ Ei Kk (i) ≥ 0 (3.4.45)
k d≤i≤N
A B
N
for k = 1, . . . , .
2

Since M2∗ (N, 2t + 1) = M2∗ (N + 1, 2t + 2), Theorem 3.4.19 provides a useful


bound also when d is odd. We will explore the MacWilliams identity further on.
The LP bound represents a rather universal tool in the theory of codes. For in-
stance, the Singleton, Hamming and Plotkin bounds can all be derived from the LP
bound. However, we will not exploit this avenue in detail.

Worked Example 3.4.20 For positive integers N and d ≤ N , let

N
f (x) = 1 + ∑ f j K j (x)
j=1

be a polynomial such that f j ≥ 0, 1 ≤ j ≤ N and f (i) ≤ 0 for d ≤ i ≤ N . Prove that

Mq∗ (N, d) ≤ f (0). (3.4.46)

Derive the Singleton bound from (3.4.46).

Solution Let M = Mq∗ (N, d) and X be a q-ary [N, M] code with the distance
distribution Bi (X ), i = 0, . . . , N. The condition f (i) ≤ 0 for d ≤ i ≤ N implies
326 Further Topics from Coding Theory
N
∑ B j (X ) f ( j) ≤ 0. Using the LP bound (3.4.45) for k = 0 obtain Ki (0) ≥
j=d
N
− ∑ B j (X )Ki ( j). Hence,
j=d

N
f (0) = 1 + ∑ f j K j (0)
j=1
N N
≥ 1 − ∑ fk ∑ Bi (X )Kk (i)
k=1 i=d
N N
= 1 − ∑ Bi (X ) ∑ fk Kk (i)
i=d k=1
N
= 1 − ∑ Bi (X )( f (i) − 1)
i=d
N
≥ 1 + ∑ Bi (X )
i=d
= M = Mq∗ (N, d).
To obtain the Singleton bound select
N  
x
f (x) = q N−d+1
∏ 1−
j
.
j=d

Then by the identity


j    
N −i N −k
∑ N− j
Ki (k) = q j
j
i=0
with j = d − 1 we have that
N
1
fk =
qN ∑ f (i)Ki (k)
i=0
d−1   
1 N −i N
= ∑ N −d +1 i
qd−1 i=0
K (k)/
d −1
   
N −k N
= / ≥ 0.
d −1 d −1
Here we use the identity
j    
N −k N −x
∑ N− j
Kk (x) = q j
j
. (3.4.47)
k=0

Clearly, f (i) = 0 for d ≤ i ≤ N. Hence, Rq (N, d) ≤ f (0) = qN−d+1 . In a similar


manner Hamming’s and Plotkin’s bounds may be derived as well, cf. [97].
3.4 The MacWilliams identity and the linear programming bound 327

Worked Example 3.4.21 Using the linear programming bound, prove that
M2∗ (13, 5) = M2∗ (14, 6) ≤ 64. Compare it with the Elias bound. [Hint: E6 = 42,
E8 = 7, E10 = 14, E12 = E14 = 0. You may need a computer to get the solution.]

Solution The LP bound for linear codes reads


M2∗ (N, d) = max ∑ Ei
0≤i≤N
subject to Ei ≥ 0, E0 = 1, E j = 0 for 1 ≤ j < d,
Ei = 0 for j odd, and
  A B
N N
+ ∑ Ei Kk (i) ≥ 0 for k = 1, . . . , .
k d≤i≤N 2
i even

For N = 14, d = 6, the constraints are

E0 = 1, E1 = E2 = E3 = E4 = E5 = E7 = E9 = E11 = E13 = 0,
E6 , E8 , E10 , E12 , E14 ≥ 0,
14 + 2E6 − 2E8 − 6E10 − 10E12 − 14E14 ≥ 0,
91 − 5E6 − 5E8 + 11E10 + 43E12 + 91E14 ≥ 0,
364 − 12E6 + 12E8 + 4E10 − 100E12 − 364E14 ≥ 0,
1001 + 9E6 + 9E8 − 39E10 + 121E12 + 1001E14 ≥ 0,
2002 + 30E6 − 30E8 + 38E10 − 22E12 − 2002E14 ≥ 0,
3003 − 5E6 − 5E8 + 27E10 − 165E12 + 3003E14 ≥ 0,
3432 − 40E6 + 40E8 − 72E10 + 264E12 − 3432E14 ≥ 0,

and the maximiser of the objective function S = E6 + E8 + E10 + E12 + E14 is

E6 = 42, E8 = 7, E10 = 14, E12 = E14 = 0,

with S = 63, E0 + S = 1 + 63 = 64. So, the LP bound yields

M2∗ (13, 5) = M2∗ (14, 6) ≤ 64.

Note that the bound is sharp as a [13, 64, 5] binary code actually exists. Compare
the LP bound with the Hamming bound:

M2∗ (13, 5) ≤ 213 (1 + 13 + 13 · 6) = 213 92 = 211 /23,

i.e.
M2∗ (13, 5) ≤ 91.

Next, the Singleton bound gives k ≤ 13 − 5 − 1 = 7,

M2∗ (13, 5) ≤ 27 = 128.


328 Further Topics from Coding Theory

It is also interesting to see what the Elias bound gives:


65/2 213
M2∗ (13, 5) ≤  
s − 13s + 65/2
2
13
1 + 13 + · · · +
5
for all s < 13 such that s2 − 13s + 65/2 > 0.

Substituting s = 2 yields s2 − 13s + 65/2 = 4 − 26 + 65/2 = 21/2 > 0 and


65 13  
M2∗ (13, 5) ≤ 2 1 + 13 + 13 · 6 = 2.33277 × 106 :
21
not good enough. Next, s = 3 yields s2 − 13s + 65/2 = 9 − 39 + 65/2 = 5/2 > 0
and
65 13   212 13 11
M2∗ (13, 5) ≤ 2 1 + 13 + 13 · 6 + 13 · 2 · 5 = 13 × ≥ 2 :
5 111 66
not as good as Hamming’s. Finally, observe that 42 − 13 × 4 + 65/2 < 0, and the
procedure stops.

3.5 Asymptotically good codes


In this section we briefly discuss some families of codes where the number of
corrected errors gives a non-zero fraction of the codeword length. For more details,
see [54], [71], [131].
Definition 3.5.1 A sequence of [Ni , ki , di ] codes with Ni → ∞ is said to be asymp-
totically good if ki /Ni and di /Ni are both bounded away from 0.
Theorem 3.3.25 showed that there is no asymptotically good sequence of primi-
tive BCH codes (in fact, there is no asymptotically good sequence of BCH codes of
any form). Theoretically, an elegant way to produce an asymptotically good family
are the so-called Justensen codes. As the first attempt to define a good code take
0 = α ∈ F2m  Fm 2 and define the set

Xα = {(a, α a) : a ∈ Fm
2 }. (3.5.1)

Then Xα is a [2m, m] linear code and has information rate 1/2. We can recover α
from any non-zero codeword (a, b) ∈ Xα , as α = ba−1 (division in F2m ). Hence,
if α = α then Xα ∩ Xα = {0}.
Now, given λ = λm ∈ (0, 1/2], we want to find α = αm such that code Xα has
minimum weight ≥ 2mλ . Since a non-zero binary (2m)-word can enter at most
one of the Xα ’s, we can find such α if the number of the non-zero (2m)-words
3.5 Asymptotically good codes 329

of weight < 2mλ is < 2m − 1, the number of distinct codes Xα . That is, we can
manage if
 
2m
∑ i
< 2m − 1
1≤i≤2mλ −1
 
2m
or even better, ∑ < 2m − 1. Now use the following:
1≤i≤2mλ i
Lemma 3.5.2 For 0 ≤ λ ≤ 1/2,
 
N
∑ k
≤ 2N η (λ ) , (3.5.2)
0≤k≤λ N

where η (λ ) is the binary entropy.


Proof Observe that (3.5.2) holds for λ = 0 (here both sides are equal to 1) and
λ = 1/2 (where the RHS equals 2N ). So we will assume that 0 < λ < 1/2. Consider
a random variable ξ with the binomial distribution
 
 N
P ξ = k) = (1/2)N , 0 ≤ k ≤ N.
k
Given t ∈ R+ , use the following Chebyshev-type inequality:
   N
N 1
∑ k 2 = P(ξ ≤ λ N)
0≤k≤λ N
 
= P exp (−t ξ ) ≥ e−λ Nt
≤ e−λ Nt Ee−t ξ
 
−λ Nt 1 1 −t N
=e + e . (3.5.3)
2 2
Minimise the RHS of (3.5.3) in x = e−t for t > 0, i.e. for 0 < x < 1. This yields the
minimiser e−t = λ /(1 − λ ) and the minimal value
 −λ N  N  N
λ 1 λ
1+
1−λ 2   1−λ  
1 N 1 N
= λ −λ N μ − μ N = 2N η (λ ) ,
2 2
with μ = 1 − λ . Hence, (3.5.2) implies
   N  N
N 1 N η (λ ) 1
∑ k 2 ≤2 2
.
0≤k≤λ N
330 Further Topics from Coding Theory

Owing to Lemma 3.5.2, inequality (3.5.1) occurs when


22mη (λ ) < 2m − 1. (3.5.4)
Now if, for example,
 
−1 1
λ = λm = η 1/2 −
log m
(with 0 < λ < 12 ), bound (3.5.4) becomes 2m−2m/ log m < 2m − 1 which is true for
m large enough. And λm → h−1 (1/2) > 0, as m → ∞. Here and below, η −1 is
the inverse function to λ ∈ (0, 1/2] → η (λ ). In the code (3.5.1) with a fixed α
the information rate is 1/2 but one cannot guarantee that d/2m is bounded away
from 0. Moreover, there is no effective way of finding a proper α = αm . However,
in 1972, Justensen [81] showed how to obtain a good sequence of codes cleverly
using the concatenation of words from an RS code.
More precisely, consider a binary (k1 k2 )-word a organised as k1 separate k2 -
words: a = a(0) a(1) . . . a(k1 −1) . Pictorially,
k2 k2
←→ ←→
a = ... a(i) ∈ F2k2 , 0 ≤ i ≤ k1 − 1.
a(0) a(k1 −1)

We fix an [N1 , k1 , d1 ] code X1 over F2k2 called an outer code: X1 ⊂ FN2k12 . Then
string a is encoded into a codeword c = c0 c1 . . . cN1 −1 ∈ X1 . Next, each ci ∈ F2k2
is encoded by a codeword bi from an [N2 , k2 , d2 ] code X2 over F2 , called an inner
code. The result is a string b = b(0) . . . b(N1 −1) ∈ FNq 1 N2 of length N1 N2 :
N2 N2
←→ ←→
b = ... b(i) ∈ F2N2 , 0 ≤ i ≤ N1 − 1.
b(0) b(N1 −1)
The encoding is represented by the diagram:
input: a (k1 k2 ) string a, output: an (N1 N2 ) codeword b.
Observe that different symbols ci can be encoded by means of different inner
codes. Let the outer code X1 be a [2m − 1, k, d] RS code X RS over F2m . Write a
binary (k2m )-word a as a concatenation a(0) . . . a(k−1) , with a(i) ∈ F2m . Encoding a
using X RS gives a codeword c = c0 . . . cN−1 , with N = 2m − 1 and ci ∈ F2m . Let β
be a primitive element in F2m . Then for all j = 0, . . . , N − 1 = 2m − 2, consider the
inner code
* +
X ( j) = (c, β j c) : c ∈ F2m . (3.5.5)
3.5 Asymptotically good codes 331

The resulting codeword (a ‘super-codeword’) is


b = (c0 , c0 )(c1 , β c1 )(c2 , β 2 c2 ) . . . (cN−1 , β N−1 cN−1 ). (3.5.6)
Definition 3.5.3 The Justensen code Xm,k Ju is the collection of binary super-words

b obtained as above, with the [2 − 1, k, d] RS code as an outer code X1 , and


m

X ( j) (see (3.5.6)) as the inner codes, where 0 ≤ j ≤ 2m − 2. Code Xm,k Ju has length

k
2m(2m − 1), rank mk and hence rate < 1/2.
2(2 − 1)
m

A convenient parameter describing Xm,k


Ju is N = 2m − 1, the length of the outer

RS code. We want to construct a sequence Xm,k Ju with N → ∞, but k/(2m(2m −

1)) and d/(2m(2m − 1)) bounded away from 0. Fix R0 ∈ (0, 1/2) and choose a
sequence of outer RS codes XNRS of length N, with N = 2m − 1 and k = [2NR0 ].
Ju is k/(2N) ≥ R .
Then the rate of Xm,k 0
Now consider the minimum weight
 Ju 
  Ju 
w Xm,k = min w(x) : x ∈ Xm,k
Ju
, x = 0 = d Xm,k . (3.5.7)
For any fixed m, if the outer RS code XNRS , N = 2m − 1, has minimum weight
d then any super-codeword b = (c0 , c0 )(c1 , β c1 ) . . . (cN−1 , β N−1 cN−1 ) ∈ Xm,k
Ju has

≥ d non-zero first components c0 , . . . , cN−1 . Furthermore, any two inner codes


among X (0) , X (1) , . . . , X (N−1) have only 0 in common. So, the corresponding d
ordered pairs, being from different codes, must be distinct. That is, super-codeword
b has ≥ d distinct non-zero binary (2m)-strings.
Next, the weight of super-codeword b ∈ Xm,k Ju is at least the sum of the weights

of the above d distinct non-zero binary (2m)-strings. So, we need to establish a


lower bound on such a sum. Note that
 
k−1
d = N −k+1 = N 1− ≥ N(1 − 2R0 ).
N
Hence, a super-codeword b ∈ Xm,k
Ju has at least N(1 − 2R ) distinct non-zero binary
0
(2m)-strings.
Lemma 3.5.4 The sum of the weights of any N(1 − 2R0 ) distinct non-zero binary
(2m)-strings is
   
−1 1
≥ 2mN(1 − 2R0 ) η − o(1) . (3.5.8)
2
Proof By Lemma 3.5.2, for any λ ∈ [0, 1/2], the number of non-zero binary (2m)-
strings of weight ≤ 2mλ is
 
2m
≤ ∑ ≤ 22mη (λ ) .
1≤i≤2mλ
i
332 Further Topics from Coding Theory

Discarding these lightweight strings, the total weight is


 
2mη (λ )
2mη (λ )
≥ 2mλ N(1 − 2R0 ) − 2 = 2mN λ (1 − 2R0 ) 1 − .
N(1 − 2R0 )
 
−1 1 1
Select λm = η − ∈ (0, 1/2). Then λm → η −1 (1/2), as h−1 is
2 log(2m)
continuous on [0, 1/2]. So,
   
1 1 1
λm = η −1 − = η −1 − o(1).
2 log(2m) 2
Since N = 2m − 1, we have that as m → ∞, N → ∞, and
2mη (λ ) 1 2m−2m/ log(2m)
=
N(1 − 2R0 ) 1 − 2R0 2m − 1
1 2m 1
= → 0.
1 − 2R0 2 −1 2
m 2m/ log(2m)

So the total weight of the N(1 − 2R0 ) distinct (2m)-strings is bounded below by
   
2mN(1 − 2R0 ) η −1 (1/2) − o(1) (1 − o(1)) = 2mN(1 − 2R0 ) η −1 (1/2) − o(1) .
Thus the result follows.
Lemma 3.5.4 demonstrates that Xm,k
Ju has

   
 Ju  −1 1
w Xm,k ≥ 2mN(1 − 2R0 ) η − o(1) . (3.5.9)
2
Then
w Xm,kJu
≥ (1 − 2R0 )(η −1 (1/2) − o(1)) → (1 − 2R0 )η −1 (1/2)
length Xm,k Ju

≈ c(1 − 2R0 ) > 0.


So, the sequence Xm,kJu with k = [2NR ], N = 2m − 1 and a fixed 0 < R < 1/2, has
0 0
information rate ≥ R0 > 0 and

w Xm,k Ju
→ c(1 − 2R0 ) > 0, c = η −1 (1/2) > 0.3. (3.5.10)
length Xm,k Ju

In the construction, R0 ∈ (0, 1/2). However, by truncating one can achieve any
given rate R0 ∈ (0, 1); see [110].

The next family of codes to be discussed in this section is formed by alternant


codes. Alternant codes are a generalisation of BCH (though in general not cyclic).
Like Justesen codes, alternant codes also form an asymptotically good family.
3.5 Asymptotically good codes 333

Let M be a (r × n) matrix over field Fqm :


⎛ ⎞
c11 . . . c1n
⎜ .. .. .. ⎟ .
M=⎝ . . . ⎠
cr1 . . . crn
As before, each ci j can be written as − c→i j ∈ (Fq ) , a column vector of length m over
m

Fq . That is, we can think of M as an (mr × n) matrix over Fq (denoted again by M).
Given elements a1 , . . . , an ∈ Fqm , we have
⎛ ⎞
⎛ ⎞ ⎛ ⎞⎛ ⎞ ∑ a j ci j
a1 c11 . . . c1n a1 ⎜ 1≤ j≤n ⎟
⎜ .. ⎟ ⎜ .. . . ⎟ ⎜ .. ⎟ ⎜ .. ⎟
M⎝ . ⎠ = ⎝ . . . . ⎠⎝ . ⎠ = ⎜
. . ⎟.
⎝ ⎠
an cr1 . . . crn an ∑ a j cr j
1≤ j≤n

Furthermore, if b ∈ Fq and c, d ∈ Fqm then b→ −c = →


− −c + →
bc and →
− −−→
d = (c + d). Thus,
if a1 , . . . , an ∈ Fq , then
⎛ −−→ ⎞
⎛ ⎞ ⎛ −→ −→ ⎞⎛ ⎞ ∑ ai ci j
a1 c11 . . . c1n a1 ⎜ 1≤ j≤n ⎟
⎜ .. ⎟ ⎜ .. . . ⎟ ⎜ . ⎟
. . .. ⎠ ⎝ .. ⎠ = ⎜ .. ⎟
M⎝ . ⎠ = ⎝ . ⎜ . ⎟.

→ −→ ⎝ −−→ ⎠
an cr1 . . . crn an ∑ ai cr j
1≤ j≤n

So, if the columns of M are linearly independent as r-vectors over Fqm , they are
also linearly independent as (rm)-vectors over Fq . That is, the columns of M are
linearly independent over Fq .
Recall that if ω is a primitive (n, Fqm ) root of unity and δ ≥ 2 then the n × (mδ )
Vandermonde matrix over Fq
⎛ → −e →
−e →
−e ⎞
...
⎜ → −ω →

ω2 ... →

ω δ −1 ⎟
⎜ → − →
− →
− ⎟
H =⎜ ω
T ⎜ 2 ω 4 ... ω 2( δ −1) ⎟

⎝ ... ⎠

− →
− →

ω n−1 ω 2(n−1) . . . ω (δ −1)(n−1)

checks a narrow-sense BCH code Xq,n, BCH (a proper parity-check matrix emerges
ω ,δ
after column purging). Generalise it by taking an n × r matrix over Fqm
⎛ ⎞
h1 h1 α1 . . . h1 α1r−2 h1 α1r−1
⎜ h2 h2 α2 . . . h2 α r−2 h2 α r−1 ⎟
⎜ 2 2 ⎟
A=⎜ . .. . .. ⎟, (3.5.11)
⎝ . . . . . . ⎠
hn hn αn . . . hn αnr−2 hn αnr−1
334 Further Topics from Coding Theory

or its n × (mr) version over Fq :


⎛ →− −−→ −−→ −−→r−1 ⎞
h1 h1 α1 . . . h1 α1 r−2 h1 α1
⎜ →− −−→ −−→ −−→r−1 ⎟
→ ⎜
− ⎜ h2 h2 α2 . . . h2 α2 r−2 h2 α2 ⎟
A =⎜ . ⎟. (3.5.12)
.. .. .. ⎟
⎝ .. . . . ⎠
− −−→
→ −−→r−2 −−→r−1
hn hn αn . . . hn αn hn αn
Here r < n, h1 , . . . , hn are non-zero elements and α1 , . . . , αn are distinct elements
from Fq .
Note that any r rows of A in (3.5.11) form a square sub-matrix K that is similar
to Vandermonde’s. It has a non-zero determinant and hence any r rows of A are lin-
early independent over Fqm and hence over Fq . Also the columns of A in (3.5.11)


are linearly independent over Fqm . However, columns of A in (3.5.12) can be lin-
early dependent and purging such columns may be required to produce a ‘genuine’
parity-check matrix H.
Definition 3.5.5 Let α = (α1 , . . . , αn ) and h = (h1 , . . . , hn ) where α1 , . . . , αn are
distinct and h1 , . . . , hn non-zero elements of Fqm . Given r < n, an alternant code
,h is the kernel of the n × (rm) matrix A in (3.5.12).
XαAlt

Theorem 3.5.6 XαAlt has length n, rank k satisfying n − mr ≤ k ≤ n − r and


 ,hAlt 
minimum distance d Xα ,h ≥ r + 1.
We see that the alternant codes are indeed generalisations of BCH. The main
outcome of the theory of alternant codes is the following Theorem 3.5.7 (not to be
proven here).
Theorem 3.5.7 There exist arbitrarily long alternant codes XαAlt
,h meeting the
Gilbert–Varshamov bound.
So, alternant codes are asymptotically good. More precisely, a sequence of
asymptotically good alternant codes is formed by the so-called Goppa codes. See
below.
The Goppa codes are particular examples of alternant codes. They were invented
by a Russian coding theorist, Valery Goppa, in 1972 by following an elegant idea
that has its origin in algebraic geometry. Here, we perform the construction by
using methods developed in this section.
Let G(X) ∈ Fqm [X] be a polynomial over Fqm and consider Fqm [X]/"G(X)#, the
polynomial ring mod G(X) over Fqm . Then Fqm [X]/"G(X)# is a field iff G(X) is
irreducible. But if, for a given α ∈ Fqm , G(α ) = 0, the linear polynomial X − α is
invertible in Fqm [X]/"G(X)#. In fact, write
G(X) = q(X)(X − α ) + G(α ), (3.5.13)
3.5 Asymptotically good codes 335

with q(X) ∈ Fq [X], deg q(X) = deg G(X) − 1.


So, q(X)(X − α ) = −G(α ) mod G(X) or

(−G(α )−1 q(X))(X − α ) = e mod G(X)

and

(X − α )−1 = (−G(α )−1 q(X)) mod G(X). (3.5.14a)

As q(X) = (G(X) − G(α ))(X − α )−1 , we have that

(X − α )−1 = −(G(X) − G(α ))(X − α )−1 G(α )−1 mod G(X). (3.5.14b)

So we define (X − α )−1 as a polynomial in Fqm [X]/"G(X)# given by (3.5.14a).

Definition 3.5.8 Fix a polynomial G(X) ∈ Fq [X] and a set α = {α1 , . . ., αn } of


distinct elements of Fqm , qm ≥ n > deg G(X), where G(α j ) = 0, 1 ≤ j ≤ n. Given
a word b = b1 . . . bn with bi ∈ Fq , 1 ≤ i ≤ n, set

Rb (X) = ∑ bi (X − αi )−1 ∈ Fqm [X]/"G(X)#. (3.5.15)


1≤i≤n

The q-ary Goppa code X Go (= XαGo


,G ) is defined as the set

{b ∈ Fnq : Rb (X) = 0 mod G(X)}. (3.5.16)

Clearly, XαGo
,G is a linear code. The polynomial G(X) is called the Goppa polyno-
mial; if G(X) is irreducible, we say that X Go is irreducible.

So, b = b1 . . . bn ∈ X Go iff in Fqm [X]

∑ bi (G(X) − G(αi ))(X − αi )−1 G(αi )−1 = 0. (3.5.17)


1≤i≤n

Write G(X) = ∑ gi X i where deg G(X) = r, gr = 1 and r < n. Then in Fqm [X]
0≤i≤r

(G(X) − G(αi ))(X − αi )−1


 
= ∑ g j X j − αij (X − αi )−1
0≤ j≤r

= ∑0≤ j≤r g j ∑0≤u≤ j−1 X u αij−1−u


336 Further Topics from Coding Theory

and so
∑ bi (G(X) − G(αi ))(X − αi )−1 G(αi )−1
1≤i≤n

= ∑ bi ∑ g j ∑ αij−1−u X u G(α j )−1


1≤i≤n 0≤ j≤r 0≤u≤ j−1

= ∑ X u ∑ bi G(αi )−1 ∑ g j αij−1−u .


0≤u≤r−1 1≤i≤n u+1≤ j≤r

Hence, b ∈ X Go iff in Fqm

∑ bi G(αi )−1 ∑ g j αij−1−u = 0 (3.5.18)


1≤i≤n u+1≤ j≤r

for all u = 0, . . . , r − 1.
Equation (3.5.18) leads to the parity-check matrix for X Go . First, we see that
the matrix
⎛ ⎞
G(α1 )−1 G(α2 )−1 ... G(αn )−1
⎜ α1 G(α1 )−1 α2 G(α2 )−1 . . . αn G(αn )−1 ⎟
⎜ ⎟
⎜ α 2 G(α1 )−1 α22 G(α2 )−1 . . . αn2 G(αn )−1 ⎟
⎜ 1 ⎟, (3.5.19)
⎜ . . . . ⎟
⎝ .
. .
. . . .
. ⎠
α1 G(α1 )
r−1 −1 α2 G(α2 )
r−1 −1 . . . αn G(αn )
r−1 −1

which is (n × r) over Fm q , provides a parity-check. As before, any r rows of ma-


trix (3.5.19) are linearly independent over Fqm and so are its columns. Then again
we write (3.5.19) as an n × (mr) matrix over Fq ; after purging linearly dependent
columns it will give the parity-check matrix H.
We see that X Go is an alternant code XαAlt ,h where α = (α1 , . . . , αn ) and h =
−1 −1
(G(α1 ) , . . . , G(αn ) ). So, Theorem 3.5.6 implies
Theorem 3.5.9 The q-ary Goppa code X = XαGo ,G , where α = {α1 , . . . , αn } and
deg G(X) = r < n, has length n, rank k satisfying n − mr ≤ k ≤ n − r and minimum
distance d(X ) ≥ r + 1.
As before, the above bound on minimum distance can be improved in the binary
case. Suppose that a binary word b = b1 . . . bn ∈ X where X is a Goppa code
,G , where α ⊂ F2m and G(X) ∈ F2 [X]. Suppose w(b) = w and bi1 = · · · = biw = 1.
XαGo
Take fb (X) = ∏ (X − αi j ) and write the derivative ∂X fb (X) as
1≤ j≤w

∂X fb (X) = Rb (X) fb (X) (3.5.20)


 −1
where Rb (X) = ∑ X − αi j (cf. (3.5.15)). As polynomials fb (X) and Rb (X)
1≤ j≤w
have no common roots in any extension F2K , they are co-prime. Then b ∈ X Go
3.5 Asymptotically good codes 337

iff G(X) divides Rb (X) which is the same as G(X) divides ∂X fb (X). For q = 2,
∂X fb (X) has only even powers of X (as its monomials are of the form X −1 times
a product of some αi j ’s: this vanishes when  is even). In other words, ∂X fb =
h(X 2 ) = (h(X))2 for some polynomial h(X). Hence if g(X) is the polynomial of
lowest degree which is a square and divisible by G(X) then G(X) divides ∂X fb (X)
iff g(X) divides ∂X fb (X). So,

b ∈ X Go ⇔ g(X)|∂X fb (X) ⇔ Rb (X) = 0 mod g(X). (3.5.21)

Theorem 3.5.10 Let X be a binary Goppa code XαGo ,G . If g(X) is a polynomial


of the lowest degree which is a square and divisible by G(X) then X = XαGo ,g .
Hence, d(X Go ) ≥ deg g(X) + 1.
Corollary 3.5.11 Suppose that the Goppa polynomial G(X) ∈ F2 [X] has no mul-
tiple roots in any extension field. Then XαGo
,G = Xα ,G2 , and the minimum distance
Go
 Go 
d Xα ,G is ≥ 2 deg G(X) + 1. Therefore, XαGo ,G can correct ≥ deg G(X) errors.

A binary Goppa code XαGo ,G where polynomial G(X) has no multiple roots is
called separable.
It is interesting to discuss a particular decoding procedure applicable for alter-
nant codes and based on the Euclid algorithm; cf. Section 2.5.
The initial setup for decoding an alternant code XαAlt ,h over Fq is as follows. As
− −−→i−1

in (3.5.12), we take the n × (mr) matrix A = h j α j over Fq obtained from the
 
n × r matrix A = h j α i−1 j over Fqm by replacing the entries with rows of length m.


Then purge linearly dependent columns from A . Recall that h1 , . . . , hn are non-zero
and α1 , . . . , αn are distinct elements of Fqm . Suppose a word u = c + e is received,
where c is the right codeword and e an error vector. We assume that r is even and
that t ≤ r/2 errors have occurred, at digits 1 ≤ i1 < · · · < it ≤ n. Let the i j th entry
of e be ei j = 0. It is convenient to identify the error locators with elements αi j : as
αi = αi for i = i (the αi are distinct), we will know the erroneous positions if we
determine αi1 , . . . , αit . Moreover, if we introduce the error locator polynomial
t
(X) = ∏ (1 − αi j X) = ∑ i X i , (3.5.22)
j=1 0≤i≤t

with 0 = 1 and the roots αi−1j


, then it will be enough to find (X) (i.e. coeffi-
cients i ).
We have to calculate the syndrome (we will call it an A-syndrome) emerging by
acting on matrix A:
uA = eA = 0 . . . 0ei1 . . . eit 0 . . . 0 A.
338 Further Topics from Coding Theory

Suppose the A-syndrome is s = s0 . . . sr−1 , with s(X) = ∑ si X i . It is convenient


0≤i≤r−1
to introduce the error evaluator polynomial ε (X), by
ε (X) = ∑ hik eik ∏ (1 − αi j X). (3.5.23)
1≤k≤t 1≤ j≤t: j =k

Lemma 3.5.12 For all u = 1, . . . ,t ,


ε (αi−1
j
)
ei j = . (3.5.24)
hi j ∏ (1 − αi j αi−1
j
)
1≤ j≤t, j = j

Proof Straightforward.
The crucial fact is that (X), ε (X) and s(X) are related by
Lemma 3.5.13 The following formula holds true:
ε (X) = (X)s(X) mod X r . (3.5.25)
Proof Write the following sequence:
ε (X) − (X)s(X) = ∑ hik eik ∏ (1 − αi j X) − (X) ∑ sl X l
1≤k≤t 1≤ j≤t: j =k 0≤l≤r−1

= ∑ hik eik ∏ (1 − αi j X) − (X) ∑ ∑ hik αilk eik X l


k 1≤ j≤t: j =k l 1≤k≤t

= ∑ hik eik ∏(1 − αi j X) − (X) ∑ hik eik ∑ αilk X l


k j =k k l
 
= ∑ hik eik ∏(1 − αi X) − (X) ∑ αil X l
j k
k j =k l

= ∑ hik eik ∏(1 − αi j X)(1 − (1 − eik X) ∑ αilk X l )


k j =k l
 
1 − αirk X r
= ∑ hik eik ∏(1 − αi j X) 1 − (1 − αik X)
k j =k 1 − αik X
= ∑ hik αik ∏(1 − αi j X) αirk X r = 0 mod X r .
k j =k

Lemma 3.5.13 shows the way of decoding alternant codes. We know that there
exists a polynomial q(X) such that
ε (X) = q(X)X r + (X)s(X). (3.5.26)
We also have deg ε (X) ≤ t − 1 < r/2, deg (X) = t ≤ r/2 and that ε (X) and (X)
are co-prime as they have no common roots in any extension. Suppose we apply the
3.5 Asymptotically good codes 339

Euclid algorithm to the known polynomials f (X) = X r and g(X) = s(X) with the
aim to find ε (X) and (X). By Lemma 2.5.44, a typical step produces a remainder
rk (X) = ak (X)X r + bk (X)s(X). (3.5.27)
If we want rk (X) and bk (X) to give ε (X) and (X), their degrees must match:
at least we must have deg rk (X) < r/2 and deg bk (X) ≤ r/2. So, the algorithm is
repeated until deg rk−1 (X) ≥ r/2 and deg rk (X) < r/2. Then, according to Lemma
2.5.44, statement (3), deg bk (X) = deg X r − deg rk−1 (X) ≤ r − r/2 = r/2. This is
possible as the algorithm can be iterated until rk (X) = gcd(X r , s(X)). But then
rk (X)|ε (X) and hence deg rk (X) ≤ deg ε (X) < r/2. So we can assume deg rk (X) ≤
r/2, deg bk (X) ≤ r/2.
The relevant equations are
ε (X) = q(X)X r + (X)s(X),
deg ε (X) < r/2, deg (X) ≤ r/2,
gcd (ε (X), (X)) = 1,
and also
rk (X) = ak (X)X r + bk (X)s(X), deg rk (X) < r/2, deg bk (X) ≤ r/2.
We want to show that polynomials rk (X) and bk (X) are scalar multiples of ε (X)
and (X). Exclude s(X) to get
bk (X)ε (X) − rk (X)(X) = (bk (X)q(X) − ak (X)(X))X r .
As
deg bk (X)ε (X) = deg bk (X) + deg ε (X) < r/2 + r/2 = r
and
deg rk (X)(X) = deg rk (X) + deg (X) < r/2 + r/2 = r,
deg(b(X)ε (X) − rk (X)(X)) < r. Hence, bk (X)ε (X) − rk (X)(X) must be 0, i.e.
(X)rk (X) = ε (X)bk (X), bk (X)q(X) = ak (X)(X).
So, (X)|ε (X)bk (X) and bk (X)|ak (X)(X). But (X) and ε (X) are co-primes as
well as ak (X) and bk (X) (by statement (5) of Lemma 2.5.44). Therefore, (X) =
λ bk (X) and hence ε (X) = λ rk (X). As l(0) = 1, λ = bk (0)−1 .
To summarise:
Theorem 3.5.14 (The decoding algorithm for alternant codes) Suppose XαAlt ,h is
an alternant code, with even r, and that t ≤ r/2 errors occurred in a received word
u. Then, upon receiving word u:
340 Further Topics from Coding Theory

(a) Find the A-syndrome uA = s0 . . . sr−1 , with the corresponding polynomial


s(X) = ∑l sl X l .
(b) Use the Euclid algorithm, beginning with f (X) = X r , g(X) = s(X), to obtain
rk (X) = ak (X)X r + bk (X)s(X) with deg rk−1 (X) ≥ r/2 and deg rk (X) < r/2.
(c) Set (X) = bk (0)−1 bk (X), ε (X) = bk (0)−1 rk (X).

Then (X) is the error locator polynomial whose roots are the inverses of
αi1 , . . . , yt = αit , and i1 , . . . , it are the error digits. The values ei j are given by

ε (αi−1
j
)
ei j = . (3.5.28)
hi j ∏l = j (1 − αil αi−1
j
)

The ideas discussed in this section found a far-reaching development in


algebraic-geometric codes. Algebraic geometry provided powerful tools in modern
code design; see [98], [99], [158], [160].

3.6 Additional problems for Chapter 3


Problem 3.1 Define Reed–Solomon codes and prove that they are maximum dis-
tance separable. Prove that the dual of a Reed–Solomon code is a Reed–Solomon
code.
Find the minimum distance of a Reed–Solomon code of length 15 and rank 11
and the generator polynomial g1 (X) over F16 for this code. Use the provided F16
field table to write g1 (X) in the form ω i0 + ω i1 X + ω i2 X 2 + · · · , identifying each
coefficient as a single power of a primitive element ω of F16 .
Find the generator polynomial g2 (X) and the minimum distance of a Reed–
Solomon code of length 10 and rank 6. Use the provided F11 field table to write
g2 (X) in the form a0 + a1 X + a2 X 2 + · · · , where each coefficient is a number from
{0, 1, . . . , 10}.
Determine a two-error correcting Reed–Solomon code over F16 and find its
length, rank and generator polynomial.
The field table for F11 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, with addition and multipli-
cation mod 11:

i 0 1 2 3 4 5 6 7 8 9
ω i 1 2 4 8 5 10 9 7 3 6
3.6 Additional problems for Chapter 3 341

The field table for F16 = F24 :

i 0 1 2 3 4 5 6 7 8
ω i 0001 0010 0100 1000 0011 0110 1100 1011 0101

i 9 10 11 12 13 14
ω 1010 0111 1110 1111 1101 1001
i

Solution A q-ary RS code X RS of designed distance δ ≤ q − 1 is defined as a


cyclic code of length N = q − 1 over Fq , with a generator polynomial

g(X) = (X − ω b )(X − ω b+1 ) . . . (X − ω b+δ −2 )

of deg(g(X)) = δ −1. Here ω is a primitive (q−1, Fq ) root of unity (i.e. a primitive


element of F∗q ) and b = 0, 1, . . . , q − 2. The powers ω b , . . . , ω b+δ −2 are called the
(defining) zeros and the remaining N − δ + 1 powers of ω non-zeros of X RS .
The rank of X RS equals k = N − δ + 1. The distance is ≥ δ = N − k + 1, but
by the Singleton bound should be ≤ δ = N − k + 1. So, the distance equals δ =
N − k + 1, i.e. X RS is maximum distance separable.
The dual (X RS )⊥ of X RS is cyclic and its zeros are the inverses of the non-
zeroes of X RS :
ω q−1− j = (ω j )−1 , j = b, . . . , b + δ − 2.

That is, they are


ω q−b , ω q−b+1 , . . . , ω q−b+q−δ −1

and the generator polynomial g⊥ (X) for (X RS )⊥ is


⊥ ⊥ +1 ⊥ +q−δ −1
g⊥ (X) = (X − ω b )(X − ω b ) . . . (X − ω b )

where b⊥ = q − b. That is (X RS )⊥ is an RS code of designed distance q − δ + 1.


In the example, length 15 means q = 15 + 1 = 16 and rank 11 yields distance
δ = 15 − 11 + 1 = 5. The generator g1 (X) over F16 = F42 , for the code with b = 1,
reads
g1 (X) = (X − ω )(X − ω 2 )(X − ω 3 )(X − ω 4 )
= X 4 − (ω + ω 2 + ω 3 + ω 4 )X 3
+(ω 3 + ω 4 + ω 5 + ω 5 + ω 6 + ω 7 )X 2
−(ω 6 + ω 7 + ω 8 + ω 9 )X + ω 10
= X 4 + ω 13 X 3 + ω 6 X 2 + ω 3 X + ω 10

where the calculation is accomplished by using the F16 field table.


342 Further Topics from Coding Theory

Similarly, length 10 means q = 11 and rank 6 yields distance δ = 10 − 6 + 1 = 5.


The generator g2 (X) is over F11 and, again for b = 1, reads
g2 (X) = (X − ω )(X − ω 2 )(X − ω 3 )(X − ω 4 )
= X 4 − (ω + ω 2 + ω 3 + ω 4 )X 3
+(ω 3 + ω 4 + ω 5 + ω 5 + ω 6 + ω 7 )X 2
−(ω 6 + ω 7 + ω 8 + ω 9 )X + ω 10
= X 4 + 3X 3 + 5X 2 + 8X + 1
where the calculation is accomplished by using the F11 -field table.
Finally, a two-error correcting RS code over F16 must have length N = 15 and
distance δ = 5, hence rank 11. So, it coincides with the above 16-ary [15, 11] RS
code.
Problem 3.2 Let X be a binary linear [N, k] code and X ev be the set of words
x ∈ X of even weight. Prove that either

(i) X = X ev or
(ii) X ev is an [N, k − 1] linear subcode of X .
Prove that if the generating matrix G of X has no zero column then the total weight
∑ w(x) equals N2k−1 .
x∈X
[Hint: Consider the contribution from each column of G.]
Denote by XH, the binary Hamming code of length N = 2 − 1 and by XH, ⊥

the dual simplex code,  = 3, 4, . . .. Is it always true that the N -vector 1 . . . 1 (with
all digits one) is a codeword in XH, ? Let As and A⊥ s denote the number of words
of weight s in XH, and XH, , respectively, with A0 = A⊥

0 = 1 and A1 = A2 = 0.
Check that

A3 = N(N − 1) 3!, A4 = N(N − 1)(N − 3) 4!,

and

A5 = N(N − 1)(N − 3)(N − 7) 5!.

Prove that A⊥2−1


= 2 − 1 (i.e. all non-zero words x ∈ XH,⊥ have weight 2−1 ). By

using the last fact and the MacWilliams identity for binary codes, give a formula
for As in terms of Ks (2−1 ), the value of the Kravchuk polynomial:
s∧2−1  −1    
2 2 − 1 − 2−1
Ks (2−1 ) = ∑−1  j s − j
(−1) j .
j=0∨s+2 −2 +1

Here 0 ∨ s + 2−1 − 2 + 1 = max [0, s + 2−1 − 2 + 1] and s ∧ 2−1 = min[s, 2−1 ].


Check that your formula gives the right answer for s = N = 2 − 1.
3.6 Additional problems for Chapter 3 343

Solution X ev is always a linear subcode of X . In fact, for binary words x and x ,


w(x + x ) = w(x) + w(x ) − 2w(x ∧ x ), where digit (x ∧ x ) j = x j x j = min[xi , xi ].
If both x and x are even words (have even weight) or odd (have odd weight) then
x + x is even, and if x is even and x odd then x + x is odd. So, if X ev = X then
X ev is a subgroup in X of index [X ev : X ] two. Thus, there are two cosets, and
X ev is a half of X . So, X ev is [N, k − 1] code.
Let g( j) = (g1 j , . . . , gk j )T be column j of G, the generating matrix of X . Set
W j = {i = 1, . . . , k : gi j = 1}, with W j = w(g( j) ) = w j ≥ 1, 1 ≤ j ≤ N. The contri-
bution into the sum ∑x∈X w(x) coming from g( j) equals
2k−w j × 2w j −1 = 2k−1 .
Here 2k−w j represents the number of subsets of the complement {1, . . . , k} \ W j and
2w j −1 the number of odd subsets of W j . Multiplying by N (the number of columns)
gives N2k−1 .
If H = HH, is the parity-check matrix of the Hamming code XH, then weight
of row j of H equals the number of digits one in position j in the binary decom-
position of numbers 1, . . . , 2l − 1, 1 ≤ j ≤ l. So, w(h( j) ) = 2l−1 (a half of numbers
1, . . . , 2l − 1 have zero in position j and a half one). Then for all j = 1, . . . , N, the
dot-product 1 · h( j) = w(h( j) ) mod 2 = 0, i.e. 1 . . . 1 ∈ XH, .
Now A3 = N(N − 1)/3!, the number of linearly dependent triples of columns
of H (as the choice is made by fixing two distinct columns: the third is their
sum). Next, A4 = N(N − 1)(N − 3)/4!, the number of linearly dependent 4-ples of
columns of H (as the choice is made by fixing: (a) two arbitrary distinct columns,
(b) a third column that is not their sum, with (c) the fourth column being the sum
of the first three), and similarly, A5 = N(N − 1)(N − 3)(N − 7)/5!, the number of
linearly dependent 5-ples of columns of H. (Here N − 7 indicates that while choos-
ing the fourth column we should avoid any of 23 − 1 = 7 linear combinations of
the first three.)
In fact, any non-zero word x in the dual code XH, ⊥ has w(x) = 2l−1 . To prove

this note that the generating matrix of XH, ⊥ is H. So, write x as a sum of rows of

H, and let W be the set of rows of H contributing into this sum, with W = w ≤ .
Then w(x) equals the number of j among 1, 2, . . . , 2 − 1 such that in the binary
decomposition j = 20 j0 + 21 j1 + · · · + 2−1 jl−1 the sum ∑t∈W jt mod 2 equals one.
As before, this is equal to 2w−1 (the number of subsets of W of odd cardinality).
So, w(x) = 2−w+w−1 = 2−1 . Note that the rank of XH,l ⊥ is 2 − 1 − (2 − 1 − l) = l

and the size XH, ⊥ = 2 .

The MacWilliams identity (in the reversed form) reads


N
1
As = ⊥
XH,
∑ A⊥i Ks (i) (3.6.1)
i=1
344 Further Topics from Coding Theory

where
s∧i i N − i
Ks (i) = ∑ j s − j
(−1) j , (3.6.2)
j=0∨s+i−N

with 0 ∨ s + i − N = max[0, s + i − N], s ∧ i = min[s, i]. In our case, A⊥ ⊥


0 = 1, A2 −1 =
2 − 1 (the number of the non-zero words in XH, ⊥ ). Thus,

1  − 1)K (2−1 )

As = 1 + (2 s
2  
1 s∧2−1 2−1 2 − 1 − 2−1
=  1 + (2 − 1)

∑ j
(−1) .
2 j=0∨s+2−1 −2 +1 j 2 − 1 − j

For s = N = 2 − 1, A2 −1 can be either 1 (if the 2 -word 11 . . . 1 lies in XH, ) or 0


(if it doesn’t). The last formula yields

1 2−1 2−1 2 − 1 − 2−1
A2 −1 =  1 + (2 − 1) ∑
2 j=2−1 j 2 − 1 − j
1
=  1 + 2 − 1 = 1,
2
which agrees with the fact that 11 . . . 1 ∈ XH, .

Problem 3.3 Let ω be a root of M(X) = X 5 + X 2 + 1 in F32 ; given that M(X) is a


primitive polynomial for F32 , ω is a primitive (31, F32 )-root of unity. Use elements
ω, ω 2 , ω 3 , ω 4 to construct a binary narrow-sense primitive BCH code X of length
31 and designed distance 5. Identify the cyclotomic coset {i, 2i, . . . , 2d−1 i} for each
of ω, ω 2 , ω 3 , ω 4 . Check that ω and ω 3 suffice as defining zeros of X and that the
actual minimum distance of X equals 5. Show that the generator polynomial g(X)
for X is the product

(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1.

Suppose you received a word u(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 2 + 1 from a


sender who uses code X . Check that u(ω ) = ω 3 and u(ω 3 ) = ω 9 , argue that u(X)
should be decoded as

c(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 3 + X 2 + 1

and verify that c(X) is indeed a codeword in X .


The field table for F32 = F25 and the list of irreducible polynomials of degree 5
over F2 are also provided to help with your calculations.
3.6 Additional problems for Chapter 3 345

The field table for F32 = F25 :


i 0 1 2 3 4 5 6 7
ω 00001 00010 00100 01000 10000 00101 01010 10100
i

i 8 9 10 11 12 13 14 15
ω 01101 11010 10001 00111 01110 11100 11101 11111
i

i 16 17 18 19 20 21 22 23
ω i 11011 10011 00011 00110 01100 11000 10101 01111

i 24 25 26 27 28 29 30
ω i 11110 11001 10111 01011 10110 01001 10010
The list of irreducible polynomials of degree 5 over F2 :

X 5 + X 2 + 1, X 5 + X 3 + 1, X 5 + X 3 + X 2 + X + 1,

X 5 + X 4 + X 3 + X + 1, X 5 + X 4 + X 3 + X 2 + 1;

they all have order 31. Polynomial X 5 + X 2 + 1 is primitive.

Solution As M(X) = X 5 + X 2 + 1 is a primitive polynomial in F2 [X], any root ω of


M(X) is a primitive (31, F2 )-root of unity, i.e. satisfies ω 31 + 1 = 0. Furthermore,
M(X) is the minimal polynomial for ω .
The BCH code X under construction is a cyclic code whose generator is a
polynomial of smallest degree having ω , ω 2 , ω 3 , ω 4 among its roots (i.e. a cyclic
code whose zeros form a minimal set including ω , ω 2 , ω 3 , ω 4 ). Thus, the generator
polynomial g(X) of X is the lcm of the minimal polynomials for ω , ω 2 , ω 3 , ω 4 .
The cyclotomic coset for ω is C = {1, 2, 4, 8, 16}; hence

(X − ω )(X − ω 2 )(X − ω 4 )(X − ω 8 )(X − ω 16 ) = X 5 + X 2 + 1

is the minimal polynomial for ω , ω 2 and ω 4 . The cyclotomic coset for ω 3 is C =


{3, 6, 12, 24, 17} and the minimal polynomial Mω 3 (X) for ω 3 equals
Mω 3 (X) = (X − ω 3 )(X − ω 6 )(X − ω 12 )(X − ω 24 )(X − ω 17 )
= X 5 + (ω 3 + ω 6 + ω 12 + ω 24 + ω 17 )X 4 + (ω 9 + ω 15 + ω 27
+ω 20 + ω 18 + ω 30 + ω 23 + ω 36 + ω 29 + ω 41 )X 3
+(ω 21 + ω 33 + ω 26 + ω 39 + ω 32 + ω 44 + ω 42 + ω 35
+ω 47 + ω 53 )X 2 + (ω 45 + ω 38 + ω 50 + ω 56 + ω 59 )X + ω 62
= X5 + X4 + X3 + X2 + 1
by a direct field-table calculation or by inspecting the list of irreducible polynomi-
als over F2 of degree 5.
346 Further Topics from Coding Theory

So, ω and ω 3 suffice as zeros, and the generating polynomial g(X) equals
(X 5 + X 2 + 1)(X 5 + X 4 + X 3 + X 2 + 1)
= X 10 + X 9 + X 8 + X 6 + X 5 + X 3 + 1,
as required. In other words:
X = {c(X) ∈ F2 [X]/(X 31 + 1) : c(ω ) = c(ω 3 ) = 0}

= {c(X) ∈ F2 [X]/(X 31 + 1) : g(X)|c(X)}.


The rank of X equals 21. The minimum distance of X equals 5, its designed
 3.3.20:
distance. This follows from Theorem
N
Let N = 2 − 1. If 2 < ∑
 E then the binary narrow-sense primitive BCH
0≤i≤E+1 i
code of designed distance 2E + 1 has minimum distance 2E + 1.
In fact, N = 31 = 25 − 1 with  = 5 and E = 2, i.e. 2E + 1 = 5, and
31 × 30 31 × 30 × 29
1024 = 210 < 1 + 31 + + = 4992.
2 2×3
Thus, X corrects two errors. The Berlekamp–Massey decoding procedure re-
quires calculating the values of the received polynomial at the defining zeros. From
the F32 field table we have
u(ω ) = ω 12 + ω 11 + ω 9 + ω 7 + ω 6 + ω 2 + 1 = ω 3 ,
u(ω 3 ) = ω 36 + ω 33 + ω 27 + ω 18 + ω 6 + 1 = ω 9 .
So, u(ω 3 ) = u(ω )3 . As u(ω ) = ω 3 , we conclude that a single error occurred, at
digit three, i.e. u(X) is decoded by
c(X) = X 12 + X 11 + X 9 + X 7 + X 6 + X 3 + X 2 + 1
which is (X 2 + 1)g(X) as required.
Problem 3.4 Define the dual X ⊥ of a linear [N, k] code of length N and dimen-
sion k with alphabet F. Prove or disprove that if X is a binary [N, (N − 1)/2] code
with N odd then X ⊥ is generated by a basis of X plus the word 1 . . . 1. Prove or
disprove that if a binary code X is self-dual, X = X ⊥ , then N is even and the
word 1 . . . 1 belongs to X .
Prove that a binary self-dual linear [N, N/2] code X exists for each even N .
Conversely, prove that if a binary linear [N, k] code X is self-dual then k = N/2.
Give an example of a non-binary linear self-dual code.

Solution The dual X ⊥ of the [N, k] linear code X is given by


X ⊥ = {x = x1 . . . xN ∈ FN : x · y = 0 for all y ∈ X }
3.6 Additional problems for Chapter 3 347

where x · y = x1 y1 + · · · + xN yN . Take N = 5, k = (N − 1)/2 = 2,


⎛ ⎞
1 0 0 0 0
⎜ 0 1 0 0 0⎟
X =⎜ ⎝ 1 1 0 0 0⎠ .

0 0 0 0 0
Then X ⊥ is generated by
⎛ ⎞
0 0 1 0 0
⎝ 0 0 0 1 0⎠ .
0 0 0 0 1
None of the vectors from X belongs to X ⊥ , so the claim is false.
Now take a self-dual code X = X ⊥ . If the word 1 = 1 . . . 1 ∈ X then there
exists x ∈ X such that x · 1 = 0. But x · 1 = ∑ xi = w(x) mod 2. On the other hand,
∑ xi = x · x, so x · x = 0. But then x ∈ X ⊥ . Hence 1 ∈ X . But then 1 · 1 = 0 which
implies that N is even.
Now let N = 2k. Divide digits 1, . . . , N into k disjoint pairs (α1 , β1 ), . . . , (αk , βk ),
with αi < βi . Then consider k binary words x(1) , . . . , x(k) of length N and weight 2,
with the non-zero digits in the word x(i) in positions (αi , βi ). Then form the [N, k]
code generated by x(1) , . . . , x(k) .

This code X is self-dual. In fact, x(i) · x(i ) = 0 for all i, i , hence X ⊂ X ⊥ .
Conversely, let y ∈ X ⊥ . Then y · x(i) = 0 for all i. This means that for all i, y
has either both 0 or both non-zero digits at positions (αi , βi ). Then y ∈ X . So,
X = X ⊥.
Now assume X = X ⊥ . Then N is even. But the dimension must be k by the
rank-nullity theorem.
The non-binary linear self-dual code is the ternary Golay [12, 6] with a generat-
ing matrix ⎛ ⎞
1 0 0 0 0 0 0 1 1 1 1 1
⎜0 1 0 0 0 0 1 0 1 2 2 1⎟
⎜ ⎟
⎜ ⎟
⎜0 0 1 0 0 0 1 1 0 1 2 2⎟
G=⎜ ⎟
⎜0 0 0 1 0 0 1 2 1 0 1 2⎟
⎜ ⎟
⎝0 0 0 0 1 0 1 2 2 1 0 1⎠
0 0 0 0 0 1 1 1 2 2 1 0
Here rows of G are orthogonal (including self-orthogonal). Hence, X ⊂ X ⊥ .
But dim(X ) = dim(X ⊥ ) = 6, so X = X ⊥ .
Problem 3.5 Define a finite field Fq with q elements and prove that q must have
the form q = ps where p is a prime integer and s  1 a positive integer. Check that
p is the characteristic of Fq .
348 Further Topics from Coding Theory

Prove that for any p and s as above there exists a finite field Fsp with ps elements,
and this field is unique up to isomorphism.
Prove that the set F∗ps of the non-zero elements of F ps is a cyclic group Z ps −1 .
Write the field table for F9 , identifying the powers ω i of a primitive element
ω ∈ F9 as vectors over F3 . Indicate all vectors α in this table such that α 4 = e.

Solution A field Fq with q elements is a set of cardinality q with two commutative


group operations, + and ·, with standard distributivity rules. It is easy to check that
char(Fq ) = p is a prime number. Then F p ⊂ Fq and q =  Fq = ps where s = [Fq : F p ]
is the dimension of Fq as a vector space over F p , a field of p elements.
Now, let F∗q , the multiplicative group of non-zero elements from Fq , contain an
element of order q − 1 =  F∗q . In fact, every b ∈ F∗q has a finite order ord(b) = r(b);
set r0 = max[r(b) : b ∈ F∗q ]. and fix a ∈ F∗q with r(a) = r0 . Then r(b)|r0 for all

b ∈ F∗q . Next, pick γ , a prime factor of r(b), and write r(b) = γ s ω , r0 = γ s α . Let us

check that s ≥ s . Indeed, aγ has order α , bω order γ s and aγ bω order γ s α . Thus,
s s

if s > s, we obtain an element of order > r0 . Hence, s ≥ s which holds for any
prime factor of r(b), and r(b)|r(a).
Then br(a) = e, for all b ∈ F∗q , i.e. the polynomial X r0 − e is divisible by (X − b).
It must then be the product ∏b∈F∗q (X − b). Then r0 =  F∗q = q − 1. Then F∗q is a
cyclic group with generator a.
For each prime p and positive integer s there exists at most one field Fq with
q = ps , up to isomorphism. Indeed, if Fq and F q are two such fields then they both
are isomorphic to Spl(X q − X), the splitting field of X q − X (over F p , the basic
field).
The elements α of F9 = F3 × F3 with α 4 = e are e = 01, ω 2 = 1 + 2ω = 21,
ω 4 = 02, ω 6 = 2 + ω = 12 where ω = 10.
Problem 3.6 Give the definition of a cyclic code of length N with alphabet Fq .
What are the defining zeros of a cyclic code
 and why are they always
 (N, Fq )-roots
3s − 1 3s − 1
of unity? Prove that the ternary Hamming , − s, 3 code is equivalent
2 2
to a cyclic code and identify the defining zeros of this cyclic code.
A sender uses the ternary [13, 10, 3] Hamming code, with field alphabet F3 =
{0, 1, 2} and the parity-check matrix H of the form
⎛ ⎞
1 0 1 2 0 1 2 0 1 2 0 1 2
⎝ 0 1 1 1 0 0 0 1 1 1 2 2 2⎠ .
0 0 0 0 1 1 1 1 1 1 1 1 1
The receiver receives the word x = 2 1 2 0 1 1 0 0 2 1 1 2 0. How should he
decode it?
3.6 Additional problems for Chapter 3 349

Solution As g(X)|(X N − 1), all roots of


 g(X)  of unity. Let gcd(l, q −
are Nth roots
ql − 1 ql − 1
1) = 1. We prove that the Hamming , − l code is equivalent to a
q−1 q−1  
cyclic code, with defining zero ω = β q−1 where β is the primitive ql −1 (q−1)-
root of unity. Indeed, set N = (ql −1) (q−1). The splitting field Spl(X N −1) = Fqr

where r = ordN (q) = min[s : N|(q − 1)]. Then r = l as q − 1 (q − 1)|(ql − 1)
s l

and l is the least such power. So, Spl(X N − 1) = Fql .

ql −1
If β is a primitive element is Fql then ω = β N = β q−1 is a primitive Nth root
of unity in Fql . Write ω 0 = e, ω , ω 2 , . . . , ω N−1 as column vectors in Fq × . . . × Fq
and form an l × N check matrix H. We want to check that any two distinct columns
of H are linearly independent. This is done exactly as in Theorem 3.3.14.

Then the code with parity-check matrix H has distance ≥ 3, rank k ≥ N − l. The
Hamming bound with N = (ql − 1)/(q − 1)

 −1 A B
N d −1
q ≤q
k N
∑ m
(q − 1) m
, with E =
2
, (3.6.3)
0≤m≤E

shows that d = 3 and k = N − l. So, the cyclic code with the parity-check matrix H
is equivalent to Hamming’s.

To decode the code in question, calculate the syndrome xH T = 2 0 2 = 2 · (1 0 1)


indicating the error is in the 6th position. Hence, x − 2e(6) = y + e(6) and the correct
word is c = 2 1 2 0 1 2 0 0 2 1 1 2 0.

Problem 3.7 Compute the rank and minimum distance of the cyclic code with
generator polynomial g(X) = X 3 +X +1 and parity-check polynomial h(X) = X 4 +
X 2 + X + 1. Now let ω be a root of g(X) in the field F8 . We receive the word
r(X) = X 5 + X 3 + X(mod X 7 − 1). Verify that r(ω ) = ω 4 , and hence decode r(X)
using minimum-distance decoding.

Solution A cyclic code X of length N has generator polynomial g(X) ∈ F2 [X]


and parity-check polynomial h(X) ∈ F2 [X] with g(X)h(X) = X N − 1. Recall that
if g(X) has degree k, i.e. g(X) = a0 + a1 X + · · · + ak X k where ak = 0, then
g(X), Xg(X), . . . , X n−k−1 g(X) form a basis for X . In particular, the rank of X
equals N − k. In this question, k = 3 and rank(X ) = 4.
350 Further Topics from Coding Theory

If h(X) = b0 + b1 X + · · · + bN−k X N−k then the parity-check matrix H of code


X is
⎛ ⎞
bN−k bN−k−1 ... b1 b0 0 ... 0 0
⎜ 0 b b . .. b1 b0 ... 0 0⎟
⎜ N−k N−k−1 ⎟
⎜ ⎟
⎜ ⎟
⎜ .. .. .. .. .. .. ⎟.
⎜ 0 . . . . . . ⎟
⎜ ⎟
⎝ ⎠
0 0 ... 0 bN−k bN−k−1 . . . b1 b0
: ;< =
N
The codewords of X are linear dependence relations between the columns of H.
The minimum distance d(X ) of a linear code X is the minimum non-zero weight
of a codeword.
In this question, N = 7, and
⎛ ⎞
1 0 1 1 1 0 0
H = ⎝0 1 0 1 1 1 0 ⎠ .
0 0 1 0 1 1 1

no zero column ⇒ no codewords of weight 1


no repeated column ⇒ no codewords of weight 2
Hence, d(X ) = 3. In fact, X is equivalent to Hamming’s original [7, 4] code.
Since g(X) ∈ F2 [X] is irreducible, the code X ⊂ F72 = F2 [X] (X 7 − 1) is the
cyclic code defined by ω . The multiplicative cyclic group F∗8  Z×7 of non-zero
elements of field F8 is
ω 0 = 1,
ω,
ω 2,
ω 3 = ω + 1,
ω4 = ω2 + ω,
ω 5 = ω 3 + ω 2 = ω 2 + ω + 1,
ω 6 = ω 3 + ω 2 + ω = ω 2 + 1,
ω 7 = ω 3 + ω = 1.
Next, the value r(ω ) is
r(ω ) = ω + ω 3 + ω 5
= ω + (ω + 1) + (ω 2 + ω + 1)
= ω2 + ω
= ω 4,
3.6 Additional problems for Chapter 3 351

as required. Let c(X) = r(X)+X 4 mod(X 7 − 1). Then c(ω ) = 0, i.e. c(X) is a code-
word. Since d(X ) = 3 the code is 1-error correcting. We just found a codeword
c(X) at distance 1 from r(X). Then r(X) is written as
c(X) = X + X 3 + X 4 + X 5 mod (X 7 − 1),
and should be decoded by c(X) under minimum-distance decoding.
Problem 3.8 If X is a linear [N, k] code, define its weight enumeration polyno-
mial WX (s,t). Show that:
(a) WX (1, 1) = 2k ,
(b) WX (0, 1) = 1,
(c) WX (1, 0) has value 0 or 1,
(d) WX (s,t) = WX (t, s) if and only if WX (1, 0) = 1.

Solution If x ∈ X the weight w(x) of X is given by w(x) = {xi : xi = 1}. Define


the weight enumeration polynomial
WX (s,t) = ∑ A j s j t N− j (3.6.4)
where A j = {x ∈ X : w(x) = j}. Then:
(a) WX (1, 1) = {x : x ∈ X } = 2dim X = 2k .
(b) WX (0, 1) = A0 = {0} = 1; note 0 ∈ X since X is a subspace.
(c) WX (1, 0) = 1 ⇔ AN = 1, i.e. 11 . . . 1 ∈ X , WX (1, 0) = 0 ⇔ AN = 0, i.e.
11 . . . 1 ∈ X .
(d) WX (s,t) = WX (t, s) ⇒ W (0, 1) = W (1, 0) ⇒ WX (1, 0) = 1 by (b).
So, if WX (1, 0) = 1 then
 {x ∈ X : w(x) = j} =  {x + 11 . . . 1 : x ∈ X , w(x) = j}
=  {y ∈ X : w(y) = N − j}
and WX (1, 0) = 1 implies AN− j = A j for all j. Hence, WX (s,t) = WX (t, s).
Problem 3.9 State the MacWilliams identity, connecting the weight enumerator
polynomials of a code X and its dual X ⊥ .
Prove that the weight enumerator of the binary Hamming code XH,l of length
N = 2l − 1 equals
1 
WX H (z) = l (1 + z)2 −1 + (2l − 1)(1 − z2 )(2 −2)/2 (1 − z) .
l l
(3.6.5)
l 2
Solution (The second part only) Let Ai be the number of codewords of weight i.
Consider i − 1 columns of the parity-check matrix H. There are three possibilities:
(a) the sum of these columns is 0;
352 Further Topics from Coding Theory

(b) the sum of these columns is one of the chosen columns;


(c) the sum of these columns is one of the remaining columns.
Possibility (a) occurs Ai−1 times; possibility (c) occurs iAi times as the selected
combination of i − 1 columns may be obtained from any word of weight i by
dropping any of its non-zero components. Next, observe that possibility (b) oc-
curs (N − (i − 2))Ai−2 times. Indeed, this combination may be obtained from a
codeword of weight i − 2 by adding any of the  N − (i − 2) remaining columns.
N
However, we can choose i − 1 columns in i−1 ways. Hence,
 
N
iAi = − Ai−1 − (N − i + 2)Ai−2 , (3.6.6)
i−1
which is trivially correct if i > N + 1. If we multiply both sides by zi−1 and then
sum over i we obtain an ODE
A (z) = (1 + z)N − A(z) − NzA(z) + z2 A (z). (3.6.7)
Since A(0) = 1, the unique solution of this ODE is
1 N
A(z) = (1 + z)N + (1 + z)(N−1)/2 (1 − z)(N+1)/2 (3.6.8)
N +1 N +1
which is equivalent to (3.6.5).
Problem 3.10 Let X be a linear code over F2 of length N and rank k and let Ai be
the number of words in X of weight i, i = 0, . . . , N . Define the weight enumerator
polynomial of X as
W (X , z) = ∑ Ai zi .
0≤i≤N

Let X ⊥ denote the dual code to X . Show that


 
1−z
W X ⊥ , z = 2−k (1 + z)N W X , . (3.6.9)
1+z

[Hint: Consider g(u) = ∑ (−1)u·v zw(v) where w(v) denotes the weight of the
v∈FN
2
vector v and average over X .]
Hence or otherwise show that if X corrects at least one error then the words of
X ⊥ have average weight N/2.
Apply (3.6.9) to the enumeration polynomial of Hamming code,
1 N
W (XHam , z) = (1 + z)N + (1 + z)(N−1)/2 (1 − z)(N+1)/2 , (3.6.10)
N +1 N +1
to obtain the enumeration polynomial of the simplex code:
W (Xsimp , z) = 2−k 2N /2l + 2−k (2l − 1)/2l × 2N z2
l−1 l−1
= 1 + (2l − 1)z2 .
3.6 Additional problems for Chapter 3 353

Solution The dual code X ⊥ , of a linear code X with the generating matrix G and
the parity-check matrix H, is defined as a linear code with the generating matrix
H. If X is an [N, k] code, X ⊥ is an [N, N − k] code, and the parity-check matrix
for X ⊥ is G.
Equivalently, X ⊥ is the code which is formed by the linear subspace in FN2
orthogonal to X in the dot-product
"x, y# = ∑ xi yi , x = x1 . . . xN , y = y1 . . . yN .
1≤i≤N

By definition,

W (X , z) = ∑ zw(u) , W X ⊥ , z = ∑ zw(v) .
u∈X v∈X ⊥

Following the hint, consider the average


1
X ∑ g(u), where g(u) = ∑(−1)"u,v# zw(v) . (3.6.11)
u∈X v

Then write (3.6.11) as


1
X ∑ zw(v) ∑ (−1)"u,v#. (3.6.12)
v u∈X

Note that when v ∈ X ⊥ , the sum ∑ (−1)"u,v# =  X . On the other hand, when
u∈X
v ∈ X ⊥ then there exists u0 ∈ X such that "u0 , v# = 0 (i.e. "u0 , v# = 1). Hence, if
v ∈ X ⊥ , then, with the change of variables u → u + u0 , we obtain

∑ (−1)"u,v# = ∑ (−1)"u+u ,v# 0

u∈X u∈X
= (−1)"u0 ,v# ∑ (−1)"u,v# = − ∑ (−1)"u,v#,
u∈X u∈X

which yields that in this case ∑ (−1)"u,v# = 0. We conclude that the sum in
u∈X
(3.6.11) equals
1    
X ∑ ⊥
zw(v)  X = W X ⊥ , z . (3.6.13)
v∈X

On the other hand, for u = u1 . . . uN ,


g(u) = ∑ ∏ zw(vi ) (−1)ui vi
v1 ,...,vN 1≤i≤N

= ∏ ∑ zw(a) (−1)aui
1≤i≤N a=0,1

= ∏ 1 + z(−1)ui . (3.6.14)
1≤i≤N
354 Further Topics from Coding Theory

Here w(a) = 0 for a = 0 and w(a) = 1 for a = 1. The RHS of (3.6.14) equals

(1 − z)w(u) (1 + z)N−w(u) .

Hence, an alternative expression for (3.6.11) is


 w(u)  
1 1−z 1 1−z
(1 + z)N ∑ = (1 + z)N W X , . (3.6.15)
X u∈X 1+z X 1+z

Equating (3.6.13) and (3.6.15) yields


 
1 1−z
(1 + z) W X ,
N
= W X ⊥, z (3.6.16)
X 1+z

which gives the required equation as  X = 2k .


Next, differentiate (3.6.16) in z at z = 1. The RHS gives
     
∑ iAi X ⊥ =  X ⊥ × the average weight in X ⊥ .
0≤i≤N

On the other hand, in the LHS we have


 

d 1 N−i 
dz  X 0≤i≤N∑ A i (X )(1 − z) i
(1 + z) 

z=1
1  
= N2N−1 − A1 (X )2N−1 (only terms i = 0, 1 contribute)
X
2N N
= (A1 (X ) = 0 as the code is at least 1-error correcting,
X 2
with distance ≥ 3).

Now take into account that

( X ) × ( X ⊥ ) = 2k × 2N−k = 2N .

The equality
N
the average weight in X ⊥ =
2
follows. The enumeration polynomial of the simplex code is obtained by substitu-
tion. In this case the average length is (2l − 1)/2.

Problem 3.11 Describe the binary narrow-sense BCH code X of length 15 and
the designed distance 5 and find the generator polynomial. Decode the message
100000111000100.
3.6 Additional problems for Chapter 3 355

Solution Take the binary narrow-sense BCH code X of length 15 and the designed
distance 5. We have Spl(X 15 − 1) = F24 = F16 . We know that X 4 + X + 1 is a
primitive polynomial over F16 . Let ω be a root of X 4 + X + 1. Then
M1 (X) = X 4 + X + 1, M3 (X) = X 4 + X 3 + X 2 + X + 1,
and the generator g(X) for X is
g(X) = M1 (X)M3 (X) = X 8 + X 7 + X 6 + X 4 + 1.
Take g(X) as example of a codeword. Introduce 2 errors – at positions 4 and 12
– by taking
u(X) = X 12 + X 8 + X 7 + X 6 + 1.
Using the field table for F16 , obtain
u1 = u(ω ) = ω 12 + ω 8 + ω 7 + ω 6 + 1 = ω 6
and
u3 = u(ω 3 ) = ω 36 + ω 24 + ω 18 + 1 = ω 9 + ω 3 + 1 = ω 4 .
As u1 = 0 and u31 = ω 18 = ω 3 = u3 , deduce that ≥ 2 errors occurred. Calculate the
locator polynomial
l(X) = 1 + ω 6 X + (ω 13 + ω 12 )X 2 .
Substituting 1, ω , . . . , ω 14 into l(X), check that ω 3 and ω 11 are roots. This confirms
that, if exactly 2 errors occurred their positions are 4 and 12 then the codeword sent
was 100010111000000.
Problem 3.12 For a word x = x1 . . . xN ∈ FN2 the weight w(x) is the number
of non-zero digits: w(x) =  {i : xi = 0}. For a linear [N, k] code X let Ai be the
number of words in X of weight i (0 ≤ i ≤ N). Define the weight enumerator
N
polynomial W (X , z) = ∑ Ai zi . Show that if we use X on a binary symmetric
i=0
channel with error-probability
p, the
probability of failing to detect an incorrect
p
word is (1 − p) W X , 1−p − 1 .
N

Solution Suppose we have sent the zero codeword 0. Then the error-probability
 
E = ∑ P x |0 sent = ∑ Ai pi (1 − p)N−i =
x∈X \0  i≥1

 i    
p p
(1 − p) N
∑ Ai − 1 = (1 − p) N W X, −1 .
i≥0 1− p 1− p
356 Further Topics from Coding Theory

Problem 3.13 Let X be a binary linear [N, k, d] code, with the weight enumer-
ator WX (s). Find expressions, in terms of WX (s), for the weight enumerators of:

(i) the subcode X ev ⊆ X consisting of all codewords x ∈ X of even weight,


(ii) the parity-check extension X pc of X .
Prove that if d is even then there exists an [N, k, d] code where each codeword has
even weight.

Solution (i) All words with even weights from X belong to subcode X ev . Hence
ev 1
WX (s) = [WX (s) +WX (−s)] .
2
(ii) Clearly, all non-zero coefficients of weight enumeration polynomial for X +
corresponds to even powers of z, and A2i (X + ) = A2i (X )+A2i−1 (X ), i = 1, 2, . . ..
Hence,
pc 1
WX (s) = [(1 + s)WX (s) + (1 − s)WX (−s)] .
2
If X is binary [N, k, d] then you first truncate X to X − then take the parity-
check extension (X − ) . This preserves k and d (if d is even) and makes all code-
+

words of even weight.


Problem 3.14 Check that polynomials X 4 + X 3 + X 2 + X + 1 and X 4 + X + 1 are
irreducible over F2 . Are these polynomials primitive over F2 ? What about polyno-
mials X 3 + X + 1, X 3 + X 2 + 1? X 4 + X 3 + 1?

Solution As both polynomials X 4 + X 3 + X 2 + X + 1 and X 4 + X + 1 do not vanish


at X = 0 or X = 1, they are not divisible by X or X +1. They are also not divisible by
X 2 + X + 1, the only irreducible polynomial of degree 2, or by X 3 + X + 1 or X 3 +
X 2 + 1, the only irreducible polynomials of degree 3. Hence, they are irreducible.
The polynomial X 4 + X 3 + X 2 + X + 1 cannot be primitive polynomial as it di-
vides X 5 − 1. Let us check that X 4 + X + 1 is primitive. Take F2 [X]/"X 4 + X + 1#
and use the F42 field table. The cyclotomic coset is {ω , ω 2 , ω 4 , ω 8 } (as ω 16 = ω ).
The primitive polynomial Mω (X) is then
(X − ω )(X − ω 2 )(X − ω 4 )(X − ω 8 )
= X 4 − (ω + ω 2 + ω 4 + ω 8 )X 2
+(ωω 2 + ωω 4 + ωω 8 + ω 2 ω 4 + ω 2 ω 8 + ω 4 ω 8 )X 2
−(ωω 2 ω 4 + ωω 2 ω 8 + ωω 4 ω 8 + ω 2 ω 4 ω 8 )x + ωω 2 ω 4 ω 8
= X 4 − (ω + ω 2 + ω 4 + ω 8 )X 2 + (ω 3 ω 5 + ω 9 + ω 6 + ω 10 + ω 12 )X 2
−(ω 7 + ω 11 + ω 13 + ω 14 )X + ω 15 = X 4 + X + 1.
3.6 Additional problems for Chapter 3 357

The order of X 4 + X + 1 is 15: other primitive polynomials of order 15 are X 4 +


X 3 + 1 and X 4 + X + 1. Thus, the only primitive polynomial of degree 4 is X 4 +
X + 1. Similarly, the only primitive polynomials of degree 3 are X 3 + X + 1 and
X 3 + X 2 + 1, both of order 7.
Problem 3.15 Suppose a binary narrow-sense BCH code is used, of length 15,
designed distance 5, and the received word is X 10 + X 5 + X 4 + X + 1. How is it
decoded? If the received word is X 11 + X 10 + X 6 + X 5 + X 4 + X + 1, what is the
number of errors?

Solution Suppose the received word is


r(X) = X 10 + X 5 + X 4 + X + 1,
and let ω be a primitive element in F16 . Then
s1 = r(ω ) = ω 10 + ω 5 + ω 4 + ω + e
= 0111 + 0110 + 0011 + 0010 + 0001 = 0001 = e,
s3 = r(ω 3 ) = ω 30 + ω 15 + ω 12 + ω 3 + e
= 0001 + 0001 + 1111 + 1000 + 0001 = 0110 = ω 5 .
See that s3 = s31 : two errors. The error-locator polynomial
σ (X) = e + s1 X + (s3 s−1
1 + s1 )X = e + X + (ω + e)X = e + X + ω X .
2 2 5 2 10 2

Checking for the roots, ω 0 = e, ω 1 , ω 2 , ω 3 , ω 4 , ω 5 , ω 6 : no, ω 7 : yes. Then divide:



(ω 10 X 2 + X + e) (X + ω 7 ) = ω 10 X + ω 8 = ω 10 (X + ω 13 ),
and identify the second root: ω 13 . So, the errors occurred at positions 15 − 7 = 8
and 15 − 13 = 2. Decode:
r(X) → X 10 + X 8 + X 5 + X 4 + X 2 + X + 1.

Problem 3.16 Prove that the binary code of length 23 generated by the poly-
nomial g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 has minimal distance 7, and is
perfect.
[Hint: If grev (X) = X 11 g(1/X) is the reversal of g(X) then
X 23 + 1 ≡ (X + 1)g(X)grev (X) mod 2.]

Solution First, show that the code is BCH, of designed distance 5. By the fresher’s
dream Lemma 3.1.5, if ω is a root of a polynomial f (X) ∈ F2 [X] then so is ω 2 .
Thus, if ω is a root of g(X) = 1 + X + X 5 + X 6 + X 7 + X 9 + X 11 then so are ω ,
ω 2 , ω 4 , ω 8 , ω 16 , ω 9 , ω 18 , ω 13 , ω 3 , ω 6 , ω 12 . This yields the design sequence
358 Further Topics from Coding Theory

{ω , ω 2 , ω 3 , ω 4 }. By the BCH bound (Theorem 2.5.39 and Theorem 3.2.9), the


cyclic code X generated by g(X) has distance ≥ 5.
Next, the parity-check extension, X + , is self-orthogonal. To check this, we need
only to show that any two rows of the generating matrix of X + are orthogonal.
These are represented by the concatenated words
(X i g(X)|1) and (X j g(X)|1).
Their dot-product equals
1 + (X i g(X))(X j g(X)) = 1 + ∑ gi+r g j+r
r
= 1 + ∑ gi+r grev
11− j−r
r
= 1 + coefficient of X 11+i− j in g(X) × grev (X)
: ;< =
||
1 + · · · + X 22
= 1 + 1 = 0.
We conclude that
any two words in X + are dot-orthogonal.
Next, observe that all words in X + have weights divisible by 4. Indeed, by
inspection, all rows (X i g(X)|1) of the generating matrix of X + have weight 8.
Then, by induction on the number of rows involved in the sum, if x ∈ X + and
g(i) ∼ (X i g(X)|1) is a row of the generating matrix of X + then
     
w g(i) + x = w g(i) + w(x) − 2w g(i) ∧ x , (3.6.17)
 (i)   (i) 
 
where g ∧ x l = min g l , xl , l = 1, . . . , 24. We know that 8 divides w g(i) .

Moreover,
 by theinduction
 hypothesis, 4 divides w(x). Next,by (3.6.17),
 w g(i) ∧
x is even, so 2w g(i) ∧ x is divisible by 4. Then the LHS, w g(i) + x , is divisible
by 4.
Therefore, the distance of X + is 8, as it is ≥ 5 and is divisible by 4. (It is easy to
see that it cannot be > 8 as then it would be 12.) Then the distance of the original
code, X , equals 7.
The code X is perfect 3-error correcting, since the volume of the 3-ball in F23 2
equals
       
23 23 23 23
+ + + = 1 + 23 + 253 + 1771 = 2048 = 211 ,
0 1 2 3
and
211 × 212 = 223 .
Here, obviously, 12 represents the rank and 23 the length.
3.6 Additional problems for Chapter 3 359

Problem 3.17 Use the MacWilliams identity to prove that the weight distribution
of a q-ary MDS code of distance d is
   
N j i
i 0≤ ∑
Ai = (−1) qi−d+1− j
− 1
j≤i−d j
   
N j i−1
= (q − 1) ∑ (−1) qi−d− j , d ≤ i ≤ N.
i 0≤ j≤i−d j

[Hint: To begin the solution,

(a) write the standard MacWilliams identity,


(b) swap X and X ⊥ ,
(c) change s → s−1 ,
(d) multiply by sn and
(e) take the derivative d r /dsr , 0 ≤ r ≤ k (which equals d(X ⊥ ) − 1).

Use the Leibniz rule


  j   r− j 
dr   r d d
f (s)g(s) = ∑ f (s) g(s) . (3.6.18)
dsr 0≤ j≤r j ds j dsr− j

Use the fact that d(X ) = N −k +1 and d(X ⊥ ) = k +1 and obtain simplified equa-
tions involving AN−k+1 , . . . , AN−r only. Subsequently, determine AN−k+1 , . . . , AN−r .
Varying r, continue up to AN .]

Solution The MacWilliams identity is


1
N−i
∑ A⊥ i
i s = ∑
qk 1≤i≤N
A i (1 − s) i
1 + (q − 1)s .
1≤i≤N

Swap X and X ⊥ , change s → s−1 and multiply by sN . After this differentiate


r ≤ k times and substitute s = 1:
   
1 N −i 1 ⊥ N −i

qk 0≤i≤N−r r
Ai = r ∑ Ai
q 0≤i≤r N −r
(3.6.19)

(the Leibniz rule (3.6.18) is used here). Formula (3.6.19) is the starting point. For
an MDS code, A0 = A⊥ 0 = 1, and

Ai = 0, 1 ≤ i ≤ N − k (= d − 1), A⊥ ⊥
i = 0, 1 ≤ i ≤ k (= d − 1).

Then
       
N 1 1 N−r N −i 1 N 1 N
+ ∑
r qk qk i=N−k+1 r
Ai = r
q N −r
= r
q r
,
360 Further Topics from Coding Theory

i.e.
   
N−r
N −i N
∑ r
Ai =
r
(qk−r − 1).
i=N−k+1

For r = k we obtain 0 = 0, for r = k − 1


 
N
AN−k+1 = (q − 1), (3.6.20)
k−1
for r = k − 2
   
k−1 N
A + AN−k+2 = (q2 − 1),
k − 2 N−k+1 k−2
etc. This is a triangular system of equations for AN−k+1 , . . . , AN−r . Varying r, we
can get AN−k+1 , . . . , AN−1 . The result is
   
N i
Ai = ∑ (−1) j (qi−d+1− j − 1)
i 0≤ j≤i−d j
   
N i−1
= ∑ (−1) j (qqi−d− j − 1)
i j
0≤ j≤i−d
  
i−1
− ∑ (−1) j−1 (qi−d+1− j − 1)
j−1
1≤  j≤i−d+1
 
N i − 1 i−d− j
= (q − 1) ∑ (−1) j q , d ≤ i ≤ N,
i 0≤ j≤i−d j

as required.
In fact, (3.6.20) can be obtained without calculations: in an MDS code of rank k
and distance d any k = N − d + 1 digits determine the codeword uniquely. Further,
for any choice of N − d positions there are exactly q codewords with digits 0 in
these positions. One of them is the zero codeword, and the remaining q − 1 are of
weight d. Hence,
 
N
AN−k+1 = Ad = (q − 1).
d

Problem 3.18 Prove the following properties of Kravchuk’s polynomials Kk (i).


   
N N
(a) For all q: (q − 1) i
Kk (i) = (q − 1) k
Ki (k).
i k
(b) For q = 2: Kk (i) = (−1)k Kk (N − i).
(c) For q = 2: Kk (2i) = KN−k (2i).
3.6 Additional problems for Chapter 3 361

Solution Write
  
i N −i
Kk (i) = ∑ j k − j
(−1) j (q − 1)k− j .
0∨(i+k−N)≤ j≤k∧i

Next:
(a) The following straightforward equation holds true:
   
i N k N
(q − 1) Kk (i) = (q − 1) Ki (k)
i k
(as all summands become insensitive to swapping i ↔ k).
   
N N
For q = 2 this yields Kk (i) = Ki (k); in particular,
i k
   
N N
K0 (i) = Ki (0) = Ki (0).
i 0
(b) Also, for q = 2: Kk (i) = (−1)k Kk (N − i) (again straightforward, after swapping
i ↔ i − j).
   
N N
(c) Thus, still for q = 2: K (2i) = K2i (k) which equals
2i k k
   
N N
(−1) 2i K2i (N − k) = K (2i). That is,
N −k 2i N−k

Kk (2i) = KN−k (2i).

Problem 3.19 What is an (n, Fq )-root of unity? Show that the set E(n,q) of the
(n, Fq )-roots of unity form a cyclic group. Check that the order of E(n,q) equals n if
n and q are co-prime. Find the minimal s such that E(n,q) ⊂ Fqs .
Define a primitive (n, Fq )-root of unity. Determine the number of primitive
(n, Fq )-roots of unity when n and q are co-prime. If ω is a primitive (n, Fq )-root of
unity, find the minimal  such that ω ∈ Fq .
Find representation of all elements of F9 as vectors over F3 . Find all (4, F9 )-roots
of unity as vectors over F3 .

Solution We know that any root of an irreducible polynomial of degree 2 over field
F3 = {0, 1, 2} belongs to F9 . Take the polynomial f (X) = X 2 + 1 and denote its
root by α (any of the two). Then all elements of F9 may be represented as a0 + a1 α
where a0 , a1 ∈ F3 . In fact,

F9 = {0, 1, α , 1 + α , 2 + α , 2α , 1 + 2α , 2 + 2α }.
362 Further Topics from Coding Theory

Another approach is as follows: we know that X 8 − 1 = ∏ (X − ζ i ) in the field


1≤i≤8
F9 where ζ is a primitive (8, F9 )-root of unity. In terms of circular polynomials,
X 8 − 1 = Q1 (X)Q2 (X)Q4 (X)Q8 (X). Here Qn (x) = ∏s:gcd(s,n)=1 (x − ω s ) where ω
is a primitive (n, F9 )-root of unity. Write X 8 − 1 = ∏d:d|8 Qd (x). Next, compute

Q1 (X) = −1 + X, Q2 (X) = 1 + X, Q4 (X) = 1 + X 2 ,


   
Q8 (X) = X 8 − 1 Q1 (X)Q2 (X)Q4 (X) = (X 8 − 1)/(X 4 − 1) = X 4 + 1.
As 32 = 1 mod 8, by Theorem 3.1.53 Q8 (X) should be decomposed over F3 into
product of φ (8)/2 = 2 irreducible polynomials of degree 2. Indeed,
Q8 (X) = (X 2 + X + 2)(X 2 + 2X + 2).
Let ζ be a root of X 2 + X + 2, then it is a primitive root of degree 8 over F3 and
F9 = F3 (ζ ). Hence, F9 = {0, ζ , ζ 2 , ζ 3 , ζ 4 , ζ 5 , ζ 6 , ζ 7 , ζ 8 }, and ζ = 1 + α . Finally,
we present the index table
ζ = 1 + α , ζ 2 = 2α , ζ 3 = 1 + 2α , ζ 4 = 2,
ζ 5 = 2 + 2α , ζ 6 = α , ζ 7 = 2 + α , ζ 8 = 1.
Hence, the roots of degree 4 are ζ 2 , ζ 4 , ζ 6 , ζ 8 .
Problem 3.20 Define a cyclic code of length N over the field Fq . Show that there
is a bijection between the cyclic codes of length N , and the factors of X N − e in the
polynomial ring Fq [X].
Now consider binary cyclic codes. If N is an odd integer then we can find a finite
extension K of F2 that contains a primitive N th root of unity ω. Show that a cyclic
code of length N with defining set {ω , ω 2 , . . . , ω δ −1 } has minimum distance at
least δ . Show that if N = 2 − 1 and δ = 3 then we obtain the Hamming [2 − 1,
2 − 1 − , 3] code.

Solution A linear code X ⊂ F×N q is a cyclic code if x1 . . . xN ∈ X implies that


x2 , . . . xN x1 ∈ X . Bijection of cyclic codes and factors of X N − 1 can be established
as in Corollary 3.3.3.
Passing to binary codes, consider, for brevity, N = 7. Factorising in F72 renders
the decomposition
X 7 − 1 = (X − 1)(X 3 + X + 1)(X 3 + X 2 + 1) := (X − 1) f1 (X) f2 (X).
Suppose ω is a root of f1 (X). Since f1 (X)2 = f1 (X 2 ) in F2 [X] we have
f1 (ω ) = f1 (ω 2 ) = 0.
It follows that the cyclic code X with defining root ω has the generator polynomial
f1 (X) and the check polynomial (X − 1) f2 (X) = X 4 + X 2 + X + 1. This property
3.6 Additional problems for Chapter 3 363

characterises Hamming’s original code (up to equivalence). The case where ω is a


root of f2 (X) is similar (in fact, we just reverse every codeword). For a general N =
2l − 1, we take a primitive element ω ∈ F2l and its minimal polynomial Mω (X).
l−1
The roots of Mω (X) are ω , ω 2 , . . . , ω 2 , hence deg Mω (X) = l. Thus, a code with
defining root ω has rank N − l = 2l − 1 − l, as in the Hamming [2l − 1, 2l − l − 1]
code.

Problem 3.21 Write an essay comparing the decoding procedures for Hamming
and two-error correcting BCH codes.

Solution To clarify the ideas behind the BCH construction, we first return to the
Hamming codes. The Hamming [2l − 1, 2l − 1 − l] code is a perfect one-error cor-
recting code of length N = 2l − 1. The procedure of decoding the Hamming code is
as follows. Having a word y = y1 . . . yN , N = 2l − 1, form the syndrome s = yH T .
If s = 0, decode y by y. If s = 0 then s is among the columns of H = HHam . If this is
column i, decode y by x∗ = y + ei , where ei = 0 . . . 010 . . . 0 (1 in the ith position,
0 otherwise).
We can try the following idea to be able to correct more than one error (two to
start with). Select 2l of the rows of the parity-check matrix in the form
 
H
H= . (3.6.21)
ΠH

Here ΠHHam is obtained by permuting the columns of HHam (Π is a permutation


of degree 2l − 1). The new matrix H contains 2l linearly independent rows: it then
determines a [2l − 1, 2l − 1 − 2l] linear code. The syndromes are now words of
length 2l (or pairs of words of length l): yH T = (ss ). A syndrome (s, s )T may or
may not be among the columns of H. Recall, we want the new code to be two-error
correcting, and the decoding procedure to be similar to the one for the Hamming
codes. Suppose two errors occur, i.e. y differs from a codeword x by two digits, say
i and j. Then the syndrome is

yH T = (si + s j , sΠi + sΠ j )

where sk is the word representing column k in H. We organise our permutation so


that, knowing vector (si + s j , sΠi + sΠ j ), we can always find i and j (or equivalently,
si and s j ). In other words, we should be able to solve the equations

si + s j = z, sΠi + sΠ j = z (3.6.22)

for any pair (z, z ) that may eventually occur as a syndrome under two errors.
A natural guess is to try a permutation Π that has some algebraic significance,
e.g. sΠi = si si = (si )2 (a bad choice) or sΠi = si si si = (si )3 (a good choice)
364 Further Topics from Coding Theory

or, generally, sΠi = si si · · · si (k times). Say, one can try the multiplication
mod 1 + X N ; unfortunately, the multiplication does not lead to a field. The reason
is that polynomial 1 + X N is always reducible. So, suppose we organise the check
matrix as
⎛ ⎞
(1 . . . 00) (1 . . . 00)k
⎜ .. ⎟
HT = ⎝ . ⎠.
(1 . . . 11) (1 . . . 11)k
Then we have to deal with equations of the type

si + s j = z, ski + skj = z . (3.6.23)

For solving (3.6.23), we need the field structure of the Hamming space, i.e. not
only multiplication but also division. Any field structure on the Hamming space
N is isomorphic to F2N , and a concrete realisation of such a structure is
of length
F2 [X] "c(X)#, a polynomial field modulo an irreducible polynomial c(X) of degree
N. Such a polynomial always exists: it is one of the primitive polynomials of degree
N. In fact, the simplest consistent system of the form (3.6.23) is

s + s = z, s3 + s = z ;
3

it is reduced to a single equation zs2 − z2 s + z3 − z = 0, and our problem becomes


to factorise the polynomial zX 2 − z2 X + z3 − z .
For N = 2l − 1, l = 4 we obtain [15, 7, 5] code. The rank 7 is due to the linear
independence of the columns of H. The key point is to check that the code corrects
up to two errors. First suppose we received a word y = y1 . . . y15 in which two
errors occurred in digits i and j that are unknown. In order to find these places,
calculate the syndrome yH T = (z, z )T . Recall that z and z are words of length 4;
the total length of the syndrome is 8. Note that z = z3 : if z = z3 , precisely one
error occurred. Write a pair of equations

s + s = z, s3 + s = z ,
3
(3.6.24)

where s and s are words of length 4 (or equivalently their polynomials), and the
multiplication is mod 1 + X + X 4 . In the case of two errors it is guaranteed that
there is exactly one pair of solutions to (3.6.24), one vector occupying position i
and another position j, among the columns of the upper (Hamming) half of matrix
H. Moreover, (3.6.24) cannot have more than one pair of solutions because

z = s3 + s = (s + s )(s2 + ss + s ) = z(z2 + ss )
3 2

implies that
ss = z z−1 + z2 . (3.6.25)
3.6 Additional problems for Chapter 3 365

Now (3.6.25) and the first equation in (3.6.24) give that s, s are precisely the roots
of a quadratic equation
 
X 2 + zX + z z−1 + z2 = 0 (3.6.26)
(with z z−1 + z2 = 0). But the polynomial in the LHS of (3.6.26) cannot have more
than two distinct roots (it could have no root or two coinciding roots, but it is
excluded by the assumption that there are precisely two errors). In the case of a
single error, we have z = z3 ; in this case s = z is the only root and we just find the
word z among the columns of the upper half of matrix H.
Summarising, the decoding scheme, in the case of the above [15, 7] code, is as
follows: Upon receiving word y, form a syndrome yH T = (z, z )T . Then
(i) If both z and z are zero words, conclude that no error occurred and decode y
by y itself.
(ii) If z = 0 and z3 = z , conclude that a single error occurred and find the location
of the error digit by identifying word z among the columns of the Hamming
check matrix.
(iii) If z = 0 and z3 = z , form the quadric (3.6.24), and if it has two distinct roots
s and s , conclude that two errors occurred and locate the error digits by iden-
tifying words s and s among the columns of the Hamming check matrix.
(iv) If z = 0 and z3 = z and quadric (3.6.26) has no roots, or if z is zero but z is
not, conclude that there are at least three errors.
Note that the case where z = 0, z3 = z and quadric (3.6.26) has a single root is
impossible: if (3.6.26) has a root, s say, then either another root s = s or z = 0 and
a single error occurs.
The decoding procedure allows us to detect, in some cases, that more than three
errors occurred. However, this procedure may lead to a wrong codeword when
three or more errors occur.
4
Further Topics from Information Theory

In Chapter 4 it will be convenient to work in a general setting which covers both


discrete and continuous-type probability distributions. To do this, we assume that
probability distributions under considerations are given by their Radon–Nikodym
derivatives with respect to underlying reference measures usually denoted by μ
or ν . The role of a reference measure can be played by a counting measure sup-
ported by a discrete set or by the Lebesgue measure on Rd ; we need only that
the reference measure is locally finite (i.e. it assigns finite values to compact sets).
The Radon–Nikodym derivatives will be called probability mass functions (PMFs):
they represent probabilities in the discrete case and probability density functions
(PDFs) in the continuous case.
The initial setting of the channel capacity theory developed for discrete channels
in Chapter 1 (see Section 1.4) goes almost unchanged for a continuously distributed
noise by adopting the logical scheme:
3 4
a set U of messages, of cardinality M = 2NR
→ a codebook X of size M with codewords of length N
→ reliable rate R of transmission through a noisy channel
→ the capacity of the channel.
However, to simplify the exposition, we will assume from now on that encoding
U → X is a one-to-one map and identify a code with its codebook.

4.1 Gaussian channels and beyond


Here we study channels with continuously distributed noise; they are the basic
models in telecommunication, including both wireless and telephone transmission.
The most popular model of such a channel is a memoryless additive Gaussian chan-
nel (MAGC) but other continuous-noise models are also useful. The case of an

366
4.1 Gaussian channels and beyond 367

MAGC is particularly attractive because it allows one to do some handy and far-
reaching calculations with elegant answers.
However, Gaussian (and other continuously distributed) channels present a chal-
lenge that was absent in the case of finite alphabets considered in Chapter 1.
Namely, because codewords (or, using a slightly more appropriate term, codevec-
tors) can a priori take values from a Euclidean space (as well as noise vectors),
the definition of the channel capacity has to be modified, by introducing a power
constraint. More generally, the value of capacity for a channel will depend upon
the so-called regional constraints which can generate analytic difficulties. In the
case of MAGC, the way was shown by Shannon, but it took some years to make
his analysis rigorous.
An input word of length N (designed to use the channel over N slots in succes-
sion) is identified with an input N-vector
⎛ ⎞
x1
⎜ ⎟
x(= x(N) ) = ⎝ ... ⎠ .
xN

We assume that xi ∈ R and hence x(N) ∈ RN (to make the notation shorter, the upper
index (N) will be often omitted).
In an⎛additive
⎞ channels an input vector x is transformed to a random vector
Y1
⎜ ⎟
Y(N) = ⎝ ... ⎠ where Y = x + Z, or, component-wise,
YN

Y j = x j + Z j , 1 ≤ j ≤ N. (4.1.1)

Here and below,


⎛ ⎞
Z1
⎜ ⎟
Z = ⎝ ... ⎠
ZN

is a noise vector composed of random variables Z1 , . . . , ZN . Thus, the noise can be


characterised by a joint PDF f no (z) ≥ 0 where
⎛ ⎞
z1
⎜ .. ⎟
z=⎝ . ⎠
zN
368 Further Topics from Information Theory
0
and the total integral f no (z)dz1 . . . dzN = 1. The N-fold noise probability distri-
bution is determined by integration over a given set of values for Z:
0
P (Z ∈ A) =
no
f no (z)dz1 . . . dzN , for A ⊆ RN .
A
Example 4.1.1 An additive⎛channel
⎞ is called Gaussian (an AGC, in short) if,
Z1
⎜ ⎟
for each N, the noise vector ⎝ ... ⎠ is a multivariate normal; cf. PSE I, p. 114.
ZN
We assume from now on that the mean value EZ j = 0. Recall that the multivariate
normal distribution with the zero mean is completely determined by its covariance
matrix. More precisely, the joint PDF fZno(N) (z(N) ) for an AGC has the form
⎛ ⎞
  z1
1 1 T −1 ⎜ .. ⎟
1/2 exp − 2 z Σ z , z = ⎝ . ⎠ ∈ R .
N
 (4.1.2)
(2π ) N/2 det Σ
zN
Here Σ is an N × N matrix assumed  to be real, symmetric and strictly positive def-
inite, with entries Σ j j = E Z j Z j representing the covariance of noise random
variables Z j and Z j , 1 ≤ j, j ≤ N. (Real strict positive definiteness means that Σ is
of the form BBT where B is an N × N real invertible matrix; if Σ is strictly positive
definite then Σ has N mutually orthogonal eigenvectors, and all N eigenvalues of Σ
are greater than 0.) In particular, each random variable Z j is normal: Z j ∼ N(0, σ 2j )
where σ 2j = EZ 2j coincides with the diagonal entry Σ j j . (Due to strict positive def-
initeness, Σ j j > 0 for all j = 1, . . . , N.)
If in addition the random variables Z1 , Z2 , . . . are IID, the channel is called mem-
oryless Gaussian (MGC) or a channel with (additive) Gaussian white noise. In this
case matrix Σ is diagonal: Σi j = 0 when i = j and Σii > 0 when i = j. This is an
important model example (both educationally and practically) since it admits some
nice final formulas and serves as a basis for further generalisations.
Thus, an MGC has IID noise random variables Zi ∼ N(0, σ 2 ) where σ 2 =
VarZi = EZi2 . For normal random  variables,
 independence is equivalent to decorre-
lation. That is, the equality E Z j Z j = 0 for all j, j = 1, . . . , N with j = j implies
that the components Z1 , . . . , ZN of the noise vector Z(N) are mutually independent.
This can be deduced from (4.1.2): if matrix Σ has Σ j j = 0 for j = j then Σ is
diagonal, with det Σ = ∏ Σ j j , and the joint PDF in (4.1.2) decomposes into a
1≤ j≤N
product of N factors representing individual PDFs of components Z j , 1 ≤ j ≤ N:
 
1 z2j
∏  1/2 exp − 2Σ . (4.1.3)
1≤ j≤N 2π Σ j j jj
4.1 Gaussian channels and beyond 369

Moreover, under the IID assumption, with Σ j j ≡ σ 2 > 0, all random variables Z j ∼
N(0, σ 2 ), and the noise distribution for an MGC is completely specified by a single
parameter σ > 0. More precisely, the joint PDF from (4.1.3) is rewritten as
 N  
1 1
√ exp − 2 ∑ z2j .
2πσ 2σ 1≤ j≤N

It is often convenient to think that an infinite random sequence Z∞ 1 = {Z1 , Z2 , . . .}


(N)
is given, and the above noise vector Z is formed by the first N members of this
sequence. In the Gaussian case, Z∞ 1 is called a random Gaussian process; with
EZ j ≡ 0, this
 process
 is determined, like before, by its covariance Σ, with Σi j =
Cov Zi , Z j = E Zi Z j . The term ‘white Gaussian noise’ distinguishes this model
from a more general model of a channel with ‘coloured’ noise; see below.
Channels with continuously distributed noise are analysed by using a scheme
similar to the one adopted in the discrete case: in particular, if the channel is used
for transmitting one of M ∼ 2RN , R < 1, encoded messages, we need a codebook
that consists of M codewords of length N: xT (i) = (x1 (i), . . . , xN (i)), 1 ≤ i ≤ M:
⎧⎛ ⎞ ⎛ ⎞⎫
( ) ⎪ ⎨ x1 (1) x1 (M) ⎪ ⎬
⎜ ⎟ ⎜ ⎟
XM,N = x(N) (1), . . . , x(N) (M) = ⎝ ... ⎠ , . . . , ⎝ ... ⎠ . (4.1.4)

⎩ ⎪

xN (1) xN (M)

The codebook is, of course, presumed to be known to both the sender and the
receiver. The transmission rate R is given by
log2 M
R= . (4.1.5)
N
Now suppose that⎛a codevector⎞x(i) had been sent. Then the received random
x1 (i) + Z1
⎜ .. ⎟
vector Y(= Y(i)) = ⎝ . ⎠ is decoded by using a chosen decoder d : y →
xN (i) + ZN
d(y) ∈ XM,N . Geometrically, the decoder looks for the nearest codeword x(k),
relative to a certain distance (adapted to the decoder); for instance, if we choose to
use the Euclidean distance then vector Y is decoded by the codeword minimising
the sum of squares:
 
d(Y) = arg min ∑ (Y j (i) − x j (l))2 : x(l) ∈ XM,N ; (4.1.6)
1≤ j≤N

when d(y) = x(i) we have an error. Luckily, the choice of a decoder is conveniently
resolved on the basis of the maximum-likelihood principle; see below.
370 Further Topics from Information Theory

There is an additional subtlety here: one assumes that, for an input word x to get a
chance of successful decoding, it should belong to a certain ‘transmittable’ domain
in RN . For example, working with an MAGC, one imposes the power constraint
1
N 1≤∑
x2j ≤ α (4.1.7)
j≤N

where α > 0 is a given constant. In the context of wireless transmission this means
that the amplitude square power per signal in an N-long input vector should be
bounded by α , otherwise the result of transmission is treated as ‘undecodable’.
Geometrically, in order to perform decoding, the input√codeword x(i) constituting

the codebook must lie inside the Euclidean ball BN2 ( α N) of radius r = α N
centred at 0 ∈ RN :
⎧ ⎛ ⎞ ⎫
⎪ x1  1/2 ⎪
⎨ ⎬
⎜ .. ⎟
∑ j
(N)
B2 (r) = x = ⎝ . ⎠ : x 2
≤ r .

⎩ 1≤ j≤N


xN
The subscript 2 stresses that RN with the standard Euclidean distance is viewed as
a Hilbert 2 -space.
In fact, it is not required that the whole codebook XM,N lies in a decodable
domain; the agreement is only that if a codeword x(i) falls outside then it is decoded
wrongly with probability 1. Pictorially, the requirement is that ‘most’ of codewords
lie within BN2 ((N α )1/2 ) but not necessarily all of them. See Figure 4.1.
A reason for the ‘regional’ constraint (4.1.7) is that otherwise the codewords
can be positioned in space at an arbitrarily large distance from each other, and,
eventually, every transmission rate would become reliable. (This would mean that
the capacity of the channel is infinite; although such channels should not be dis-
missed outright, in the context of an AGC the case of an infinite capacity seems
impractical.)
Typically, the decodable region D(N) ⊂ RN is represented by a ball in RN , centred
at the origin, and specified relative to a particular distance in RN . Say, in the case
of exponentially distributed noise it is natural to select
⎧ ⎛ ⎞ ⎫

⎨ x1 ⎪

⎜ .. ⎟
D = B1 (N α ) = x = ⎝ . ⎠ : ∑ |x j | ≤ N α
(N) (N)

⎩ 1≤ j≤N


xN
the ball in the 1 -metric. When an output-signal vector falling within distance r
from a codeword is decoded by this codeword, we have a correct decoding if (i)
the output signal falls in exactly one sphere around a codeword, (ii) the codeword
in question lies within D(N) , and (iii) this specific codeword was sent. We have
possibly an error when more than one codeword falls into the sphere.
4.1 Gaussian channels and beyond 371

Figure 4.1

As in the discrete case, a more general channel is represented by a family of


(conditional) probability distributions for received vectors of length N given that
an input word x(N) ∈ RN has been sent:
(N) (N)
Pch ( · | x(N) ) = Pch ( · |word x(N) sent), x ∈ RN . (4.1.8)
As before, N = 1, 2, . . . indicates how many slots of the channel were used for trans-
mission, and we will consider the limit N → ∞. Now assume that the distribution
(N) (N)
Pch ( · | x(N) ) is determined by a PMF fch (y(N) | x(N) ) relative to a fixed measure
ν (N) on RN :
0
(N) (N)
Pch (Y(N) ∈ A| x
(N)
)= fch ( · | x(N) )dν (N) (y(N) ). (4.1.9a)
A

A typical assumption is that ν (N) is a product-measure of the form


ν (N) = ν × · · · × ν (N times); (4.1.9b)
for instance, ν (N) can be the Lebesgue measure on RN which is the product of
Lebesgue measures on R: dx(N) = dx1 × · · · × dxN . In the discrete case where
digits xi represent letters from an input channel alphabet A (say, binary, with
A = {0, 1}), ν is the counting measure on A , assigning weight 1 to each sym-
bol of the alphabet. Then ν (N) is the counting measure on A N , the set of all input
words of length N, assigning weight 1 to each such word.
372 Further Topics from Information Theory

Assuming the product-form reference measure ν (N) (4.1.9b), we specify a mem-


(N)
oryless channel by a product form PMF fch (y(N) | x(N) ):


(N)
fch (y(N) | x(N) ) = fch (y j |x j ). (4.1.10)
1≤ j≤N

Here fch (y|x) is the symbol-to-symbol channel PMF describing the impact of a
single use of the channel. For an MGC, fch (y|x) is a normal N(x, σ 2 ). In other
words, fch (y|x) gives the PDF of a random variable Y = x + Z where Z ∼ N(0, σ 2 )
represents the ‘white noise’ affecting an individual input value x.
Next, we turn to a codebook XM,N , the image of a one-to-one map M → RN
where M is a finite collection of messages (originally written in a message alpha-
bet); cf. (4.1.4). As in the discrete case, the ML decoder dML decodes the received
(N)
word Y = y(N) by maximising fch (y| x) in the argument x = x(N) ∈ XM,N :
 
(N)
dML (y) = arg max fch (y| x) : x ∈ XM,N . (4.1.11)
The case when maximiser is not unique will be treated as an error.
(N),ε
Another useful example is the joint typicality (JT) decoder dJT = dJT (see
below); it looks for the codeword x such that x and y lie in the ε -typical set TεN :
dJT (y) = x if x ∈ XM,N and (x, y) ∈ TεN . (4.1.12)
The JT decoder is designed – via a specific form of set TεN – for codes generated as
samples of a random code X M,N . Consequently, for given output vector yN and a
code XM,N , the decoded word dJT (y) ∈ XM,N may be not uniquely defined (or not
defined at all), again leading to an error. A general decoder should be understood
as a one-to-one map defined on a set K(N) ⊆ RN taking points yN ∈ KN to points
x ∈ XM,N ; outside set K(N) it may be not defined correctly. The decodable region
K(N) is a part of the specification of decoder d (N) . In any case, we want to achieve

(N) (N)
Pch d (N) (Y) = x|x sent = Pch Y ∈ K(N) |x sent

(N)
+ Pch Y ∈ K(N) , d(Y) = x|x sent → 0
as N → ∞. In the case of an MGC, for any code XM,N , the ML decoder from
(4.1.6) is defined uniquely almost everywhere in RN (but does not necessarily give
the right answer).
We also require that the input vector x(N) ∈ D(N) ⊂ RN and when x(N) ∈ D(N) ,
the result of transmission is rendered undecodable (regardless of the qualities of the
decoder used). Then the average probability of error, while using codebook XM,N
and decoder d (N) , is defined by
1
eav (XM,N , d (N) , D(N) ) = ∑ e(x, d (N), D(N) ),
M x∈X
(4.1.13a)
M,N
4.1 Gaussian channels and beyond 373

and the maximum probability of error by


emax (XM,N , d (N) , D(N) ) = max e(x, d (N) , D(N) ) : x ∈ XM,N . (4.1.13b)
Here e(x, d (N) , D(N) ) is the probability of error when codeword x had been trans-
mitted:

⎨ 1, x ∈ D(N) ,
e(x, d (N) , D(N) ) = (4.1.14)
⎩P(N)
ch d
(N) (Y) = x|x , x ∈ D(N) .

In (4.1.14) the order of the codewords in the codebook XM,N does not matter;
thus XM,N may be regarded simply as a set of M points in the Euclidean space RN .
Geometrically, we want the points of XM,N to be positioned so as to maximise the
chance of correct ML-decoding and lying, as a rule, within domain D(N) (which
again leads us to a sphere-packing problem).
4 suppose that a number R > 0 is fixed, the size of the codebook XM,N :
To 3this end,
M = 2NR . We want to define a reliable transmission rate as N → ∞ in a fashion
similar to how it was done in Section 1.4.
Definition 4.1.2 Value R >30 is 4called a reliable transmission rate with regional
constraint D(N) if, with M = 2NR , there exist a sequence {XM,N } of codebooks
XM,N ⊂ RN and a sequence {d (N) } of decoders d (N) : RN → RN such that
lim eav (XM,N , d (N) , D(N) ) = 0. (4.1.15)
N→∞

Remark 4.1.3 It is easy to verify that a transmission rate R reliable in the sense of
average error-probability eav (XM,N , d (N) , D(N) ) is reliable for the maximum error-
probability emax (XM,N , d (N) , D(N) ). In fact, assume that R is reliable in the sense of
Definition 4.1.2, i.e. in the sense of the average error-probability. Take a sequence
{XM,N } of the corresponding codebooks with M = 2RN  and a sequence {dN } of
(0)
the corresponding decoding rules. Divide each code XN into two halves, XN and
(1)
XN , by ordering the codewords in the non-decreasing order of their probabilities
(0)
of erroneous decoding and listing the first M (0) = M/2 codewords in XN and
(1) (0)
the rest, M (1) = M − M (0) , in XN . Then, for the sequence of codes {XM,N }:
(i) the information rate approaches the value R as N → ∞ as
1
log M (0) ≥ R + O(N −1 );
N
(ii) the maximum error-probability, while using the decoding rule dN ,
1 M
Pemax XN , dN ≤ (1) ∑ Pe (x(N) , dN ) ≤ (1) Peav (XN , dN ) .
(0)
M (1) M
(N)x ∈XN
374 Further Topics from Information Theory

Since M/M (1) ≤ 2, the RHS tends to 0 as N → ∞. We conclude that R is


a reliable transmission rate for the maximum error-probability. The converse
assertion, that a reliable transmission rate R in the sense of the maximum error-
probability is also reliable in the sense of the average error-probability, is ob-
vious.

Next, the capacity of the channel is the supremum of reliable transmission rates:
 
C = sup R > 0 : R is reliable ; (4.1.16)

it varies from channel to channel and with the shape of constraining domains.
It turns out (cf. Theorem 4.1.9 below) that for the MGC, under the average power
constraint threshold α (see (4.1.7)), the channel capacity C(α , σ 2 ) is given by the
following elegant expression:

1 α
C(α , σ 2 ) = log2 1 + 2 . (4.1.17)
2 σ

Furthermore, like in Section 1.4, the capacity C(α , σ 2 ) is achieved by a se-


quence of random codings where codeword x(i) = (X1 (i), . . . , XN (i)) has IID
components X j (i) ∼ N(0, α − εN ), j = 1, . . . , N, i = 1, . . . , M, with εN → 0 as
N → ∞. Although such random codings do not formally obey the constraint
 probability as N → ∞ (since
(4.1.7) for finite N, it is violated with a vanishing
1
N 1≤∑
lim supN→∞ P max X j (i) : 1 ≤ i ≤ M ≤ α = 1 with a proper choice
2
j≤N
of εN ). Consequently, the average error-probability (4.1.13a) goes to 0 (of course,
for a random coding the error-probability becomes itself random).

Example 4.1.4 Next, we discuss an AGC with coloured Gaussian noise. Let a
codevector x = (x1 , . . . , xN ) have multi-dimensional entries
⎛ ⎞
x j1
⎜ .. ⎟
x j = ⎝ . ⎠ ∈ Rk , 1 ≤ j ≤ N,
x jk

and the components Z j of the noise vector


⎛ ⎞
Z1
⎜ ⎟
Z = ⎝ ... ⎠
ZN
4.1 Gaussian channels and beyond 375

are also random vectors of dimension k:


⎛ ⎞
Z j1
⎜ ⎟
Z j = ⎝ ... ⎠ .
Z jk
For instance, Z1 , . . . , ZN may be IID N(0, Σ) (with k-variate normal), where Σ is a
given k × k covariance matrix.
The ‘coloured’ model arises when one uses a system of k scalar Gaussian chan-
nels in parallel. Here, a scalar signal x j1 is sent through channel 1, x j2 through
channel 2, etc., at the jth use of the system. A reasonable assumption is that at
each use the scalar channels produce jointly Gaussian noise; different channels
may be independent (with matrix Σ being k × k diagonal) or dependent (when Σ is
a general positive-definite k × k matrix).
Here a( codebook, as) before, is an (ordered or unordered) collection
XM,N = x(1), . . . , x(M) where each codeword x(i) is a ‘multi-vector’
(x1 (i), . . . , xN (i))T ∈ Rk×N := Rk × · · · × Rk . Let Q be a positive-definite k × k ma-
trix commuting with Σ: QΣ = ΣQ. The power constraint is now
1 I J

N 1≤ j≤N
x j (i), Qx j (i) ≤ α . (4.1.18)

The formula for the capacity of an AGC with coloured noise is, not surprisingly,
more complicated. As ΣQ = QΣ, matrices Σ and Q may be simultaneously diag-
onalised. Let λi and γi , i = 1, . . . , k, be the eigenvalues of Σ and Q, respectively
(corresponding to the same eigenvectors). Then
 
1 (νγl−1 − λl )+
C(α , Q, Σ) = ∑ log2 1 +
2 1≤l≤k λl
, (4.1.19)

 −1 
where (νγl−1 − λl )+ = max  −1 νγ l − λl , 0 . In other words, (νγl−1 − λl )+ are the
eigenvalues of the matrix ν Q −Σ + representing the positive-definite part of the
Hermitian matrix ν Q−1 − Σ. Next, ν = ν (α ) > 0 is determined from the condition
  
tr ν I − QΣ + = α . (4.1.20)
 
The positive-definite part ν I − QΣ + is in turn defined by
   
ν I − QΣ + = Π+ ν I − QΣ Π+

where Π+ is the orthoprojection (in Rk ) onto the subspace


 spanned by
 the eigen-
vectors of QΣ with eigenvalues γl λl < ν . In (4.1.20) tr ν I − QΣ + ≥ 0 (since
tr AB ≥ 0 for all pair of positive-definite matrices), equals 0 for ν = 0 (as
376 Further Topics from Information Theory
 
− QΣ + = 0) and monotonically increases with ν to +∞. Therefore, for any given
α > 0, (4.1.20) determines the value of ν = ν (α ) uniquely.
Though (4.1.19) looks much more involved than (4.1.17) both expressions are
corollaries of two facts: (i) the capacity can be identified as the maximum of the
mutual entropy between the (random) input and output signals, just as in the dis-
crete case (cf. Sections 1.3 and 1.4), and (ii) the mutual information in the case
of a Gaussian noise (white or coloured) is attained when the input signal is itself
Gaussian whose covariance solves an auxiliary optimisation problem. In the case
of (4.1.17) this optimisation problem is rather simple, while for (4.1.19) it is more
complicated (but still has a transparent meaning).
Correspondingly, the random encoding achieving the capacity C(α ; Q; Σ) is
where signals X j (i), 1 ≤ j ≤ N, i = 1, . . . , M, are IID, and X j (i) ∼ N(0, A − εN I)
where A is the k ×k positive-definite matrix maximising the determinant det(A+Σ)
subject to the constraint tr QA = α ; such a matrix turns out to be of the form
ν Q−1 − Σ + . The random encoding provides a convenient tool for calculating the
capacity in various models. We will discuss a number of such models in Worked
Examples.
The notable difference emerging for channels with continuously distributed
noise is that the entropy should be replaced – when appropriate – with the differen-
tial entropy. Recall the differential entropy introduced in Section 1.5. The mutual
entropy between two random variables X and Y with the joint PMF fX,Y (x, y) rel-
1
ative to a reference measure μ × ν and marginal PMFs fX (x) = fX,Y (x, y)ν (dy)
1
and fY (y) = fX,Y (x, y)μ (dx) is

fX,Y (X,Y )
I(X : Y ) = E log
fX (X) fY (Y )
0
fX,Y (x, y)
= fX,Y (x, y) log μ (dx)ν (dy).
fX (x) fY (y)

A similar definition works when X and Y are replaced by random vectors X =


(X1 , . . . , XN ) and Y = (Y1 , . . . ,YN ) (or even multi-vectors where – as in Example
4.1.4 – components X j and Y j are vectors themselves):


(N )
fX(N) ,Y(N ) (X(N) , Y(N ) )
I(X (N)
:Y ) = E log . (4.1.21a)
fX(N) (X(N) ) fY(N ) (Y(N ) )


Here fX(N) (x(N) ) and fY(N ) (y(N ) ) are the marginal PMFs for X(N) and Y(N ) (i.e.
joint PMFs for components of these vectors).
4.1 Gaussian channels and beyond 377

Specifically, if N = N , X(N) represents a random input and Y(N) = X(N) + Z(N)


the corresponding random output of a channel with a (random) probability of error:

⎨ 1, x(N) ∈ D(N) ,
E(x(N) , D(N) ) =
⎩P(N)
ch dML (Y
(N) ) = x(N) |x(N) , x(N) ∈ D(N) ;

cf. (4.1.14). Furthermore, we are interested in the expected value


E (PX(N) ; D(N) ) = E E(X(N) , D(N) ) . (4.1.21b)

Next, given ε > 0, we can define the supremum of the mutual information per
signal (i.e. per a single use of the channel), over all input probability distributions
PX(N) with E (PX(N) , D(N) ) ≤ ε :

1  
Cε ,N = sup I(X(N) : Y(N) ) : E (PX(N) , D(N) ) ≤ ε , (4.1.22)
N
Cε = lim sup Cε ,N C = lim inf Cε . (4.1.23)
N→∞ ε →0

We want to stress that the supremum in (4.1.22) should be taken over all proba-
bility distributions PX(N) of the input word X(N) with the property that the expected
error-probability is ≤ ε , regardless of whether these distributions are discrete or
continuous or mixed (contain both parts). This makes the correct evaluation of
CN,ε quite difficult. However, the limiting value C is more amenable, at least in
some important examples.
We are now in a position to prove the converse part of the Shannon second
coding theorem:

Theorem 4.1.5 (cf. Theorems 1.4.14 and 2.2.10.) Consider


a channel given by
a sequence of probability distributions Pch · | x sent for the random output
(N)

words Y(N) and decodable domains D(N) . Then quantity C from (4.1.22), (4.1.23)
gives an upper bound for the capacity:

C ≤ C. (4.1.24)

Proof Let R be a reliable transmission rate and {XM,N } be a sequence of code-


books with M =  XM,N ∼ 2NR for which lim eav (XM,N , D(N) ) = 0. Consider the
N→∞
(N)
pair (x, dML (y)) where (i) x = xeq is the random input word equidistributed over
XM,N , (ii) Y = Y(N) is the received word and (iii) dML (y) is the codeword guessed
while using the ML decoding rule dML after transmission. Words x and dML (Y) run
378 Further Topics from Information Theory

jointly over XM,N , i.e. have a discrete-type joint distribution. Then, by the gener-
alised Fano inequality (1.2.23),
hdiscr (X|d(Y)) ≤ 1 + log(M − 1) ∑ P(x = x, dML (Y) = x)
X∈XM,N
NR
≤ 1+ ∑ Pch (dML (Y) = x|x sent)
M x∈XM,N

= 1 + NReav (XM,N , D(N) ) := N θN ,


(N) (N)
where θN → 0 as N → ∞. Next, with h(Xeq ) = log M, we have NR − 1 ≤ h(Xeq ).
Therefore,
(N)
1 + h(Xeq )
R≤
N
1 (N) 1 (N)
= I(Xeq : d(Y(N) )) + h(Xeq |d(Y(N) ))
N N
1 1 (N) (N)
+ ≤ I(xeq : Y ) + θN .
N N
For any given ε > 0, for N sufficiently large, the average error-probability will
satisfy eav (XM,N , D(N) ) < ε . Consequently, R ≤ Cε ,N , for N large enough. (Because
the equidistribution over a codebook XM,N with  eav (XM,N , D(N) ) gives a specific
example of an input distribution PX(N) with E PX(N) , D(N) ) ≤ ε .) Thus, for all ε > 0,
R ≤ Cε , implying that the transition rate R ≤ C. Therefore, C ≤ C, as claimed.
The bound C ≤ C in (4.1.24) becomes exact (with C = C) in many interesting sit-
uations. Moreover, the expression for C simplifies in some cases of interest. For ex-
ample, for an MAGC instead of maximising the mutual information I(X(N) : Y(N) )
for varying N it becomes possible to maximise I(X : Y ), the mutual information be-
tween single input and output signals subject to an appropriate constraint. Namely,
for an MAGC,

C = C = sup I(X : Y ) : EX 2 < α . (4.1.25a)


The quantity sup I(X : Y ) : EX 2 ≤ α is often called the information capacity of


an MAGC, under the square-power constraint α . Moreover, for a general AGC,
 
1 1
N 1≤∑
(N)
C = C = lim sup I(X : Y ) : (N)
EX j < α .
2
(4.1.25b)
N→∞ N
j≤N

Example 4.1.6 Here we estimate the capacity C(α , σ 2 ) of an MAGC with addi-
tive white Gaussian noise of variance σ 2 , under the average power constraint (with
D(N) = B(N) ((N α )1/2 ) (cf. Example 4.1.1.), i.e. bound from above the right-hand
side of (4.1.25b).
4.1 Gaussian channels and beyond 379

Given an input distribution PX(N) , we write

I(X(N) : Y(N) ) = h(Y(N) ) − h(Y(N) |X(N) )


= h(Y(N) ) − h(Z(N) )
≤ ∑ h(Y j ) − h(Z(N) )
1≤ j≤N
 
= ∑ h(Y j ) − h(Z j ) . (4.1.26)
1≤ j≤N

Denote by α 2j = EX j2 the second moment of the single-input random variable X j ,


the jth entry of the random input vector X(N) . Then the corresponding random
output random variable Y j has

EY j2 = E(X j + Z j )2 = EX j2 + 2EX j Z j + EZ 2j = α 2j + σ 2 ,

as X j and Z j are independent and EZ j = 0.


Note that for a Gaussian channel, Y j has a continuous distribution (with the PDF
1
fY j (y) given by the convolution φσ 2 (x − y)dFX j (x) where φσ 2 is the PDF of Z j ∼
N(0, σ 2 )). Consequently, the entropies figuring in (4.1.26) and – implicitly – in
(4.1.25a,b) are the differential entropies. Recall that for a random variable Y j with
a PDF fY j , under the condition EY j2 ≤ α 2j + σ 2 , the maximum of the differential
1
entropy h(Y j ) ≤ log2 [2π e(α 2j + σ 2 )]. In fact, by Gibbs,
2
0
h(Y j ) = − fY j (y) log2 fY j (y)dy
0
≤− fY j (y) log2 φα 2j +σ 2 (y)dy
1
log2 e
= log2 2π (α 2j + σ 2 ) + EY 2
2 2(α 2j + σ 2 ) j
1

≤ log2 2π e(α 2j + σ 2 ) ,
2
and consequently,

I(X j : Y j ) = h(Y j ) − h(Z j )


≤ log2 [2π e(α 2j + σ 2 )] − log2 (2π eσ 2 )
 
α 2j
= log2 1 + 2 ,
σ

with equality iff Y j ∼ N(0, α 2j + σ 2 ).


380 Further Topics from Information Theory

The bound ∑ EX j2 = ∑ α 2j < N α in (4.1.25b) implies, by the law of large


1≤ j≤N
1≤ j≤N

numbers, that lim PX(N) B(N) ( N α ) = 1. Moreover, for any input probability
N→∞
distribution PX(N) with EX j2 ≤ α 2j , 1 ≤ j ≤ N, we have that
 
1 1 α 2j
2N 1≤∑
I(X : Y ) ≤
(N) (N)
log2 1 + 2 .
N j≤N σ

The Jensen inequality, applied to the concave function x → log2 (1 + x), implies
   
1 α 2j 1 1 α 2j
2N 1≤∑ N 1≤∑
log2 1 + 2 ≤ log2 1 +
j≤N σ 2 j≤N σ
2

1 α
≤ log2 1 + 2 .
2 σ
Therefore, in this example, the information capacity C, taken as the RHS of
(4.1.25b), obeys
1 α
C ≤ log2 1 + 2 . (4.1.27)
2 σ
After establishing Theorem 4.1.8, we will be able to deduce that the capacity
C(α , σ 2 ) equals the RHS, confirming the answer in (4.1.17).

Example 4.1.7 For the coloured Gaussian noise the bound from (4.1.26) can be
repeated:
I(X(N) : Y(N) ) ≤ ∑ [h(Y j ) − h(Z j )].
1≤ j≤N

Here we work with the mixed second-order moments for the random vectors of
input and output signals X j and Y j = X j + Z j :
1
N 1≤∑
α 2j = E"X j , QX j #, E"Y j , QY j # = α 2j + tr (QΣ), α 2j ≤ α .
j≤N

In this calculation we again made use of the fact that X j and Z j are independent
and the expected value EZ j = 0.
1
Next, as in the scalar case, I(X(N) : Y(N) ) does not exceed the difference
N
h(Y ) − h(Z) where Z ∼ N(0, Σ) is the coloured noise vector and Y = X + Z is
a multivariate normal distribution maximising the differential entropy under the
trace restriction. Formally:
1
I(X(N) : Y(N) ) ≤ h(α , Q, Σ) − h(Z)
N
4.1 Gaussian channels and beyond 381

where K is the covariance matrix of a signal, and


1 *

h(α , Q, Σ) = max log (2π )k e det(K + Σ) :


2 +
K positive-definite k × k matrix with tr (QK) ≤ α .

Write Σ in the diagonal form Σ = CΛCT where C is an orthogonal and Λ the diag-
onal k × k matrix formed by the eigenvalues of Σ:
⎛ ⎞
λ1 0 . . . 0
⎜ 0 λ2 . . . 0 ⎟
⎜ ⎟
Λ=⎜ . ⎟.
⎝0 0 . . 0⎠
0 0 . . . λk

Write CT KC = B and maximise det(B + Λ) subject to the constraint


B positive-definite and tr (ΓB) ≤ α where Γ = CT QC.

By the Hadamard inequality of Worked Example 1.5.10, det(B + Λ) ≤


∏ (Bii + λi ), with equality iff B is diagonal (i.e. matrices Σ and K have the same
1≤i≤k
eigenbasis), and B11 = β1 , . . . , Bkk = βk are the eigenvalues of K. As before, as-
sume QΣ = ΣQ, then tr (ΓB) = ∑ γi βi . So, we want to maximise the product
1≤i≤k
∏ (βi + λi ), or equivalently, the sum
1≤i≤k

∑ log(βi + λi ), subject to β1 , . . . , βk ≥ 0 and ∑ γi βi ≤ α .


1≤i≤k 1≤i≤k

If we discard the regional constraints β1 , . . . , βk ≥ 0, the Lagrangian


 
L (β1 , . . . , βk ; κ ) = ∑ log(βi + λi ) + κ α − ∑ γi βi
1≤i≤k 1≤i≤k

is maximised at
1 1
= κγi , i.e. βi = − λi , i = 1, . . . , k.
βi + λi κγi
To satisfy the regional constraint, we take
 
1
βi = − λi , i = 1, . . . , k,
κγi +

and adjust κ > 0 so that


 
1
∑ κ
− γi λi = α. (4.1.28)
1≤i≤k +
382 Further Topics from Information Theory

This yields that the information capacity C(α , Q, Σ) obeys


 
1 (νγl−1 − λl )+
C(α , Q, Σ) ≤ ∑ log2 1 +
2 1≤l≤k λl
, (4.1.29)

where the RHS comes from (4.1.28) with ν = 1/κ . Again, we will show that the
capacity C(α , Q, Σ) equals the last expression, confirming the answer in (4.1.19).

We now pass to the direct part of the second Shannon coding theorem for general
channels with regional restrictions. Although the statement of this theorem differs
from that of Theorems 1.4.15 and 2.2.1 only in the assumption of constraints upon
the codewords (and the proof below is a mere repetition of that of Theorem 1.4.15),
it is useful to put it in the formal context.

Theorem 4.1.8 Let a channel be specified by a sequence of conditional probabili-


(N)  
ties Pch · | x(N) sent for the received word Y(N) and a sequence of decoding con-
(N)  
straints x(N) ∈ D(N) for the input vector. Suppose that probability Pch · | x(N) sent
is given by a PMF fch (y(N) |x(N) sent) relative to a reference measure ν (N) . Given
c > 0, suppose that there exists a sequence of input probability distributions PX(N)
such that

(i) lim PX(N) (D(N) ) = 1,


N→∞
(ii) the distribution PX(N) is given by a PMF fX(N) (x(N) ) relative to a reference
measure μ (N) ,
(iii) the following convergence in probability holds true: for all ε > 0,
 N
 limN→∞ PX(N) ,Y(N) Tε = 1,  
1 f (N) ,Y(N) (x
(N) , Y(N) ) 
 x  (4.1.30a)
TεN =  log+ − c ≤ ε ,
N fX(N) (x(N) ) fY(N) (Y(N) ) 

where

0 , y ) = fX(N) (x ) fch (y |x sent),


fX(N) ,Y(N) (x(N) (N) (N) (N) (N)
(4.1.30b)
fY(N) (y(N) ) = fX(N) (xN ) fch (y(N) |x(N) sent)μ ×N dx(N) .

Then the capacity of the channel satisfies C ≥ c.


* +
Proof Take R < c and consider a random codebook xN (1), . . . , xN (M) , with
M ∼ 2NR , composed by IID codewords where each codeword xN ( j) is drawn ac-
cording to PN = PxN . Suppose that a (random) codeword xN ( j) has been sent and a
4.1 Gaussian channels and beyond 383

(random) word YN = YN ( j) received, with the joint PMF fX(N) ,Y(N) as in (4.1.30b).
We take ε > 0 and decode YN by using joint typicality:
dJT (YN ) = xN (i) when xN (i) is the only vector among
xN (1), . . . , xN (M) such that (xN (i), YN ) ∈ TεN .
Here set TεN is specified in (4.1.30a).
Suppose a random vector xN ( j) has been sent. It is assumed that an error occurs
every time when
(i) xN ( j) ∈ D(N) , or
(ii) the pair (xN ( j), YN ) ∈ TεN , or
(iii) (xN (i), YN ) ∈ TεN for some i = j.
These possibilities do not exclude each other but if none of them occurs then
(a) xN ( j) ∈ D(N) and
(b) x( j) is the only word among xN (1), . . . , xN (M) with (xN ( j), YN ) ∈ TεN .
Therefore, the JT decoder will return the correct result. Consider the average error-
probability
1
M 1≤∑
EM (PN ) = E( j, PN )
j≤M

where E( j, PN ) is the probability that any of the above possibilities (i)–(iii) occurs:
* + * +
E( j, PN ) = P xN ( j) ∈ D(N) ∪ (xN ( j), YN ) ∈ TεN
* +
∪ (xN (i), YN ) ∈ TεN for some i = j

= E1 xN ( j) ∈ D(N)

+ E1 xN ( j) ∈ D(N) , dJT (YN ) = xN ( j) . (4.1.31)

The symbols P and E in (4.1.31) refer to (1) a collection of IID input vectors
xN (1), . . . , xN (M), and (2) the output vector YN related to xN ( j) by the action of
the channel. Consequently, YN is independent of vectors xN (i) with i = j. It is in-
structive to represent the corresponding probability distribution P as the Cartesian
product; e.g. for j = 1 we refer in (4.1.31) to
P = PxN (1),YN (1) × PxN (2) × · · · × PxN (M)
where PxN (1),YN (1) stands for the joint distribution of the input vector xN (1) and the
output vector YN (1), determined by the joint PMF
fxN (1),YN (1) (xN , yN ) = fxN (1) (xN ) fch (yN |xN sent).
384 Further Topics from Information Theory

By symmetry, E( j, PN ) does not depend on j, thus in the rest of the argument we


can take j = 1. Next, probability E(1, PN ) does not exceed the sum of probabilities
   
P xN (1) ∈ D(N) + P (xN (1), YN ) ∈ TεN
M  
+ ∑ P (xN (i), YN ) ∈ TεN .
i=2

Thanks to the condition that lim Px(N) (D(N) ) = 1, the first summand vanishes as
N→∞
N → ∞. The second summand vanishes, again in the limit N → ∞, because of
M  
(4.1.30a). It remains to estimate the sum ∑ P (xN (i), YN ) ∈ TεN .
i=2
First, note that, by symmetry, all summands are equal, so
M  N   
∑P (x (i), YN ) ∈ TεN = 2NR − 1 P (xN (2), YN ) ∈ TεN .
i=2

Next, by Worked Example 4.2.3 (see (4.2.9) below)


 
P (xN (2), YN ) ∈ TεN ≤ 2−N(c−3ε )

and hence
m  N 
∑P (x (i), YN ) ∈ TεN ≤ 2N(R−c+3ε )
i=2

which tends to 0 as N → ∞ when ε < (c − R)/3.


Therefore, for R < c, lim EM (PN ) = 0. But EM (PN ) admits the representation
N→∞
 
1
M 1≤∑
Em (P ) = EPxN (1) ×···×PxN (M)
N
E( j)
j≤M

where quantity E( j) represents the error-probability as defined in (4.1.14):


'
1, xN ∈ D(N) ,
E( j) = (N),ε
Pch dJT (YN ) = xN ( j)|xN ( j) sent , xN ∈ D(N) .

We conclude that there exists a sequence of sample codebooks XM,N such that
the average error-probability

1
∑ e(x) → 0
M x∈XM,N
4.1 Gaussian channels and beyond 385
(N),ε
where e(x) = e(x, XM,N , D(N) , dJT ) is the error-probability for the input word x
in code XM,N , under the JT decoder and with regional constraint specified by D(N) :

⎨ 1, xN ∈ D(N) ,
e(x) =
(N),ε
⎩Pch dJT (YN ) = x|x sent , xN ∈ D(N) .

Hence, R is a reliable transmission rate in the sense of Definition 4.1.2. This com-
pletes the proof of Theorem 4.1.8.

We also have proved in passing the following result.

Theorem 4.1.9 Assume that the conditions of Theorem 4.1.5 hold true. Then,
for all R < C, there exists a sequence of codes XM,N of length N and size M ∼ 2RN
such that the maximum probability of error tends to 0 as N → ∞.

Example 4.1.10 Theorem 4.1.8 enables us to specify the expressions in (4.1.17)


and (4.1.19) as the true values of the corresponding capacities (under the ML rule):
for a scalar white noise of variance σ 2 , under an average input power constraint
∑ x2j ≤ N α ,
1≤ j≤N

1 α
C(α , σ 2 ) = log 1 + 2 ,
2 σ
for a vector white noise with variances σ 2 = (σ12 , . . . , σk2 ), under the constraint
∑ xTj x j ≤ N α ,
1≤ j≤N
 
1 (ν − σi2 )+
C(α , σ 2 ) = ∑
2 1≤i≤k
log 1 +
σi2
, where ∑ (ν − σi2 )+ = α 2 ,
1≤i≤k

and for the coloured vector noise with a covariance matrix Σ, under the constraint
∑ xTj Qx j ≤ N α ,
1≤ j≤N
 
1 (νγi−1 − λi )+
C(α , Q, Σ) = ∑ log 1 +
2 1≤i≤k λi
,

where ∑ (ν − γi λi )+ = α .
1≤i≤k
Explicitly, for a scalar white noise we take the random coding where the signals
X j (i), 1 ≤ j ≤ N, 1 ≤ i ≤ M = 2NR , are IID N(0, α − ε ). We have to check the
conditions of Theorem 4.1.5 in this case: as N → ∞,

(i) lim P(x(N) (i) ∈ B(N) ( N α ), for all i = 1, . . . , M) = 1;
N→∞
386 Further Topics from Information Theory

(ii) lim lim θN = C(α , σ 2 ) in probability where


ε →0 N→∞

1 P(X,Y )
N 1≤∑
θN = log .
j≤M PX (X)PY (Y )

First, property (i):



P x(N) (i) ∈ B(N) ( N α ), for some i = 1, . . . , M
 
1
≤P ∑ ∑ X j (i)2 ≥ α
NM 1≤i≤M 1≤ j≤N
 
1
=P ∑ ∑ (X j (i) − σ ) ≥ ε
NM 1≤i≤M
2 2
1≤ j≤N
 
1
≤ E(X − σ )
2 2 2
→ 0.
NM ε 2
Next, (ii): since pairs (X j ,Y j ) are IID, we apply the law of large numbers and
obtain that
P(X,Y )
θN → E log = I(X1 : Y1 ).
PX (X)PY (Y )
But

I(X1 : Y1 ) = h(Y1 ) − h(Y1 |X1 )


1
1  
= log 2π e(α − ε + σ 2 ) − log 2π eσ 2
2   2
1 α −ε
= log 1 + → C(α , σ 2 ) as ε → 0.
2 σ2
Hence, the capacity equals C(α , σ 2 ), as claimed. The case of coloured noise is
studied similarly.

Remark 4.1.11 Introducing a regional constraint described by a domain D does


not mean one has to secure that the whole code X should lie in D. To guarantee
that the error-probability Peav (X ) → 0 we only have to secure that the ‘majority’
of codewords x(i) ∈ X belong to D when the codeword-length N → ∞.

Example 4.1.12 Here we consider a non-Gaussian additive channel, where the


noise vector
⎛ ⎞
Z1
⎜ ⎟
Z = ⎝ ... ⎠
ZN
4.1 Gaussian channels and beyond 387

has two-side exponential IID components Z j ∼ (2) Exp(λ ), with the PDF
1
fZ j (z) = λ e−λ |z| , −∞ < z < ∞,
2
where Exp denotes the exponential distribution, λ > 0 and E|Z j | = 1/λ (see PSE I,
Appendix). Again we will calculate the capacity under the ML rule and with a
regional constraint x(N) ∈ Ł(N α ) where
( )
Ł(N α ) = x(N) ∈ RN : ∑ |x j | ≤ N α .
1≤ j≤N

First, observe that if the random variable X has E|X| ≤ α and the random variable
Z has E|Z| ≤ ζ then E|X + Z| ≤ α + ζ . Next, we use the fact that a random variable
Y with PDF fY and E|Y | ≤ η has the differential entropy

h(Y ) ≤ 2 + log2 η ; with equality iff Y ∼ (2) Exp(1/η ).

In fact, as before, by Gibbs


0
h(Y ) = − fY (y) log fY (y)dy
0
≤− fY (y) log φ (2) Exp(1/η ) (y)dy
0
1
= 1+ fY (y)|y|dy + log η
η
= 1 + log η ≤ 2 + log η
1
+ E|Y |
η
0
=− φ (2) Exp(1/η ) (y) log φ (2) Exp(1/η ) (y)dy,

and the equalities are achieved only when fY = φ (2) Exp(1/η ) .


Then, by the converse part of the SSCT,
1 (N) (N) 1
I(x : Y ) = ∑ h(Y j ) − h(Z j )
N N
1

≤ ∑ 2 + log2 (α j + λ −1 ) − 2 + log2 (λ )
N
1  

= ∑ log2 1 + α j λ
N  
≤ log2 1 + αλ .

The same arguments as before establish that the RHS gives the capacity of the
channel.
388 Further Topics from Information Theory

C
inf
= ln 7 M = 13
m=6
2b
_ 0
A A

Figure 4.2

Worked Example 4.1.13 Next, we consider a channel with an additive uniform


noise, where the noise random variable Z ∼ U(−b, b) with b > 0 representing the
limit for the noise amplitude. Let us choose the region constraint for the input signal
as a finite set A ⊂ R (an input ‘alphabet’) of the form A = {a, a + b, . . . , a + (M −
1)b}. Compute the information capacity of the channel:

Cinf = sup I(X : Y ) : pX (A ) = 1,Y = X + Z .

Solution Because of the shift-invariance, we can assume that a = −A and a + Mb


= A where 2A = Mb is the ‘width’ of the input signal set. The formula I(X : Y ) =
h(Y ) − h(Y |X) where h(Y |X) = h(Z) = ln(2b) (in nats) shows that we must max-
imise the output signal entropy h(Y ). The limits for Y are −A − b ≤ Y ≤ A + b, so
the distribution PY must be as close to uniform U(−A − b, A + b) as possible.
First, suppose M is odd:  A = 2m + 1, with

A = {0, ±A/m, ±2A/m, . . . , ±A} and b = A/m.

That is, the points of A partition the interval [−A, A] into 2m intervals of length
A/m; the ‘extended’ interval [−A − b, A + b] contains 2(m + 1) such intervals. The
maximising probability distribution PX can be spotted without calculations: it as-
signs equal probabilities 1/(m + 1) to m + 1 points

−A, −A + 2b, . . . , A − 2b, A.

In other words, we ‘cross off’ every second ‘letter’ from A and use the remaining
letters with equal probabilities.
In fact, with PX (−A) = PX (−A + 2b) = · · · = PX (A), the output signal PDF fY
assigns the value [2b(m + 1)]−1 to every point y ∈ [−A − b, A + b]. In other words,
Y ∼ U(−A − b, A + b) as required. The information capacity Cinf in this case is
equal (in nats) to
ln(2A + 2b) − ln 2b = ln (1 + m) . (4.1.32)
4.1 Gaussian channels and beyond 389

Say, for M = 3 (three input signals, at −A, 0, A, and b = A), Cinf = ln 2. For
M = 5 (five input signals, at −A, −A/2, 0, A/2, A, and b = A/2), Cinf = ln 3. See
Figure 4.2 for M = 13.

Remark 4.1.14 It can be proved that (4.1.32) gives the maximum mutual infor-
mation I(X : Y ) between the input and output signals X and Y = X + Z when (i)
the noise random variable Z ∼ U(−b, b) is independent of X and (ii) X has a gen-
eral distribution supported on the interval [−A, A] with b = A/m. Here, the mutual
information I(X : Y ) is defined according to Kolmogorov:

I(X : Y ) = sup I(Xξ : Yη ) (4.1.33)


ξ ,η

where the supremum is taken over all finite partitions ξ and η of intervals [−A, A]
and [−A − b, A + b], and Xξ and Yη stand for the quantised versions of random
variables X and Y , respectively.
In other words, the input-signal distribution PX with
1
PX (−A) = PX (−A + 2b) = · · · = PX (A − 2b) = PX (A) = (4.1.34)
m+1
maximises I(X : Y ) under assumptions (i) and (ii). We denote this distribution by
(A,A/m) (bm,b)
PX , or, equivalently, PX .
However, if M = 2m, i.e. the number  A of allowed signals is even, the cal-
culation becomes more involved. Here, clearly, the uniform distribution U(−A −
b, A + b) for the output signal Y cannot be achieved. We have to maximise h(Y ) =
h(X + Z) within the class of piece-wise constant PDFs fY on [−A − b, A + b]; see
below.
Equal spacing in [−A, A] is generated by points ±A/(2m − 1), ±3A/(2m −
1), . . . , ±A; they are described by the formula ±(2k − 1)A/(2m − 1) for k =
1, . . . , m. These points divide the interval [−A, A] into (2m − 1) intervals of length
2A/(2m − 1). With Z ∼ U(−b, b) and A = b(m − 1/2), we again have the output-
signal PDF fY (y) supported in [−A − b, A + b]:


⎪ if b(m − 1/2) ≤ y ≤ b(m + 1/2),
⎪pm /(2b), 




⎪ pk + pk+1 (2b), if b(k − 1/2) ≤ y ≤ b(k + 1/2)



⎪ for k = 1, . . . , m − 1,
⎨ 
fY (y) = p−1 + p1 (2b), if − b/2 ≤ y ≤ b/2,

⎪  

⎪ pk + pk+1 (2b), if b(k − 1/2) ≤ y ≤ b(k + 1/2)





⎪ for k = −1, . . . , −m + 1,


⎩ p /(2b), if − b(m + 1/2) ≤ y ≤ −b(m − 1/2),
−m
390 Further Topics from Information Theory

where
    
1 (2k − 1)A
p±k = pX ±b k − =P X =± , k = 1, . . . , m,
2 2m − 1

stand for the input-signal probabilities. The entropy h(Y ) = h(X + Z) is written as

pm pm pk + pk+1 pk + pk+1 p−1 + p1 p−1 + p1


− ln − ∑ ln − ln
2 2b 1≤k<m 2 2b 2 2b
pk + pk+1 pk + pk+1 p−m p−m
− ∑ ln − ln .
−m<k≤−1 2 2b 2 2b

It turns out that the maximising distribution PX has p−k = pk , for k = 1, . . . , m.


Thus, we face an optimisation problem:

pm pk + pk+1 p1
maximise G(p) = −pm ln − ∑ (pk + pk+1 ) ln − p1 ln
2b 1≤k<m 2b b
(4.1.35)
subject to the probabilistic constraints pk ≥ 0 and 2 ∑ pk = 1. The Lagrangian
1≤k≤m
L (PX ; λ ) reads

L (PX ; λ ) = G(p) + λ (2p1 + · · · + 2pm − 1)

and is maximised when


L (PX ; λ ) = 0, k = 1, . . . , m.
∂ pk

Thus, we have m equations, with the same RHS:

pm (pm−1 + pm )
− ln − 2 + 2λ = 0, (implies) pm (pm−1 + pm ) = 4b2 e2λ −2 ,
4b2
(pk−1 + pk )(pk + pk+1 )
− ln − 2 + 2λ = 0,
4b2
(implies) (pk−1 + pk )(pk + pk+1 ) = 4b2 e2λ −2 , 1 < k < m,
2p1 (p1 + p2 )
− ln − 2 + 2λ = 0 (implies) 2p1 (p1 + p2 ) = 4b2 e2λ −2 .
4b2

This yields
K
pm = pm−1 + pm−2 = · · · = p3 + p2 = 2p1 ,
for m even,
pm + pm−1 = pm−2 + pm−3 = · · · = p2 + p1 ,
4.1 Gaussian channels and beyond 391

and
K
pm = pm−1 + pm−2 = · · · = p2 + p1 ,
for m odd.
pm + pm−1 = pm−2 + pm−3 = · · · = p3 + p2 = 2p1 ,

For small values of M = 2m the solution is straightforward. Viz., for M = 2


(two input signals at ±A with b = 2A): p1 = 1/2 and the maximising output-signal
PDF is


⎨1/(4b), A ≤ y ≤ 3A,

fY (y) = 1/(2b), −A ≤ y ≤ A, yielding Cinf = (ln 2)/2.


1/(4b), −3A ≤ y ≤ −A,

For M = 4 (four input signals at −A, −A/3, A/3, A, with b = 2A/3): p1 = 1/6,
p2 = 1/3, and the maximising output-signal PDF is


⎨1/(6b), A ≤ y ≤ 5A/3 and − 5A/3 ≤ y ≤ −A,

fY (y) = 1/(4b), 2A/3 ≤ y ≤ A and − A ≤ y ≤ −2A/3,


1/(6b), −2A/3 ≤ y ≤ 2A/3,

which yields Cinf = ln(61/2 41/3 /2).


For M = 6 (six input signals at −A, −3A/5, −A/5, A/5, 3A/5, A, with b =
2A/5): p1 = 1/6, p2 = 1/12, p3 = 1/4. Similarly, for M = 8 (eight input signals
at −A, −5A/7, −3A/7, −A/7, A/7, 3A/7, 5A/7, A, with b = 2A/7): p1 = 1/10,
p2 = 3/20, p3 = 1/20, p4 = 1/5.
In general, we can write all probabilities in terms of p1 . Viz., for m even:

pm = 2p1 ,
pm−1 = p2 − p1 ,
pm−2 = 3p1 − p2 ,
pm−3 = 2(p2 − p1 ),
pm−4 = 4p1 − 2p2 ,
..
m .
p3 = − 1 (p2 − p1 ),
2
m+2
p2 = p1 ,
m
392 Further Topics from Information Theory

whence
m+2
p2 = p1 ,
m
m−2
p3 = p1 ,
m
m+4
p4 = p1 ,
m
m−4
p5 = p1 ,
m
.. (4.1.36)
.
2m − 2
pm−2 = ,
m
2
pm−1 = ,
m
pm = 2p1 ,
1
with p1 = .
2(m + 1)

The corresponding PDF fY gives the value

1 1 1 1
h(Y ) = − ln inf
and CA = − ln − ln 2. (4.1.37)
2 4m(m + 1)b2 2 4m(m + 1)

On the other hand, for a general odd m, the maximising input-signal distribution
PX has
m+1
p1 = ,
2m(m + 1)
m−1
p2 = ,
2m(m + 1)
m+3
p3 = ,
2m(m + 1)
m−3 (4.1.38)
p4 = ,
2m(m + 1)
..
.
1
pm−1 = ,
2m(m + 1)
m
pm = .
m(m + 1)
4.1 Gaussian channels and beyond 393

This yields the same answer for the maximum entropy and the restricted capacity:
1 1 1 1
h(Y ) = − ln inf
and CA = − ln − ln 2. (4.1.39)
2 4m(m + 1)b2 2 4m(m + 1)
In future, we will refer to the input-signal distributions specified in (4.1.36) and
(A,2A/(2m−1))
(4.1.38) as PX .

Remark 4.1.15 It is natural to suggest that the above formulas give the maxi-
mum mutual information I(X : Y ) when (i) the noise random variable Z ∼ U(−b, b)
is independent of X and (ii) the input-signal distribution PX is confined to [−A, A]
with b = 2A/(2m − 1), but otherwise is arbitrary (with I(X : Y ) again defined as in
(4.1.33)). A further-reaching (and more speculative) conjecture is about the max-
imiser under the above assumptions (i) and (ii) but for arbitrary A > b > 0, not
necessarily with A/b being integer or half-integer. Here number M = 2A/b + 1
will not be integer either, but remains worth keeping as a value of reference.
So when b decays from A/m to A/(m + 1) (or, equivalently, A grows from bm
to b(m + 1) and, respectively, M increases from 2m + 1 to 2m + 3), the maximiser
(A,b) (bm,b) (b(m+1),b)
PX evolves from PX to PX ; at A = b(m + 1/2) (when M = 2(m + 1))
(A,b) (A,b)
distribution PX may or may not coincide with the distribution PX from
(4.1.36), (4.1.38).
To (partially) clarify the issue, consider the case where A/2 ≤ b ≤ A (i.e. 3 ≤
M ≤ 5) and assume that the input-signal distribution PX has
1
PX (−A) = PX (A) = p and PX (0) = 1 − 2p where 0 ≤ p ≤ . (4.1.40)
2
Then
 
1 p 1− p 1 − 2p
hy(Y ) = − Ap ln + (2b − A)(1 − p) ln + (A − b)(1 − 2p) ln ,
b 2b 2b 2b
(4.1.41)
and the equation dh(Y ) dp = 0 is equivalent to

pA = (1 − p)2b−A (1 − 2p)2(A−b) . (4.1.42)

For b = A/2 this yields pA = (1 − 2p)A , i.e. p = 1 − 2p whence p = 1/3; similarly,


for b = A, p = 1/2. These coincide with previously obtained results. For b = 2A/3
we have that
pA = (1 − p)A/3 (1 − 2p)2A/3 ;

i.e.
p3 = (1 − p)(1 − 2p)2 . (4.1.43a)
394 Further Topics from Information Theory

We are interested in the solution lying in (0, 1/2) (in fact, in (1/3, 1/2)). For b =
3A/4, the equation becomes
pA = (1 − p)A/2 (1 − 2p)A/2 ,
i.e.
p2 = (1 − p)(1 − 2p), (4.1.43b)

whence p = (3 − 5) 2.
Example 4.1.16 It is useful to look at the example where the noise random
variable Z has two components: discrete and continuous. To start with, one could
try the case where
fZ (z) = qδ0 + (1 − q)φ (z; σ 2 ),
i.e. Z = 0 with probability q and Z ∼ N(0, σ 2 ) with probability 1 − q ∈ (0, 1). (So,
1 − q gives the total probability of error.) Here, we consider the case
1
fZ = qδ0 + (1 − q) 1(|z| ≤ b),
2b
and study the input-signal PMF of the form
PX (−A) = p−1 , PX (0) = p0 , PX (A) = p1 , (4.1.44a)
where
p−1 , p0 , p1 ≥ 0, p−1 + p0 + p1 = 1, (4.1.44b)
with b = A and M = 3 (three signal levels in (−A, A)). The input-signal entropy is
h(X) = h(p−1 , p0 , p1 ) = −p−1 ln p−1 − p0 ln p0 − p1 ln p1 .
The output-signal PMF has the form
  1
fY (y) = q p−1 δ−A + p0 δ0 + p1 δA + (1 − q)
 2b 
× p−1 1(−2A ≤ y ≤ 0) + p0 1(−A ≤ y ≤ A) + p1 1(0 ≤ y ≤ 2A)

and its entropy h(Y ) (calculated relative to the reference measure μ on R, whose
absolutely continuous component coincides with the Lebesgue and discrete com-
ponent assigns value 1 to points −A, 0 and A) is given by
h(Y ) = −q ln q − (1 − q) ln(1 − q) − qh(p−1 , p0 , p1 )

p−1   p−1 + p0
− (1 − q)A p−1 ln + p−1 + p0 ln
2A 2A

  p0 + p1 p1
+ p0 + p1 ln + p1 ln .
2A 2A
4.1 Gaussian channels and beyond 395

By symmetry, h(Y ) is maximised when p−1 = p1 = p, p0 = 1 − 2p, and we have


to maximise, in q ∈ (0, 1), the expression
 
p 1 − 2p
h(Y ) = h(q, 1 − q) − qh(p, p, 1 − 2p) − (1 − q)A 2p ln + (1 − 2p) ln ,
2A 2A
for a given q ∈ (0, 1).
Differentiating yields
 −(1−q)A/q
d p p
h(Y ) = 0 ↔ = .
dp 1 − 2p 1− p
If (1 − q)A/q > 1 this equation yields a unique solution which defines an optimal
input-signal distribution PX of the form (4.1.44a)–(4.1.44b).
If we wish to see what value of q yields the maximum of h(Y ) (and hence, the
maximum information capacity), we differentiate in q as well:
d q
h(Y ) = 0 ↔ log = (A − 1)h(p, p, 1 − 2p) − 2A ln 2A.
dq 1−q
If we wish to consider a continuously distributed input signal on [−A, A], with a
PDF fX (x), then the output random variable Y = X + Z has the PDF given by the
convolution:
0
1 (y+b)∧A
fY (y) = fX (x)dx.
2b (y−b)∨(−A)
1
The differential entropy h(Y ) = − fY (y) ln fY (y)dy, in terms of fX , takes the form
0 0 b  0 (x+z+b)∧A 
1 A 1
h(X + Z) = − fX (x) ln fX (x )dx dzdx.
2b −A −b 2b (x+z−b)∨(−A)
The PDF fX minimising the differential entropy h(X + Z) yields a solution to
0 b   0 (x+z+b)∧A 
1
0= ln fX (x )dx + fX (x)
−b 2b (x+z−b)∨(−A)
0 (x+z+b)∧A −1 

× fX (x )dx fX (x + z + b) − fX (x + z − b) dz.
(x+z−b)∨(−A)

An interesting question emerges when we think of a two-time-per-signal use of


a channel with a uniform noise. Suppose an input signal is represented by a point
x = (x1 , x2 ) in a plane R2 and assume as before that Z ∼ U(−b, b), independently
of the input signal. Then the square Sb (x) = (x1 − b, x1 + b) × (x2 − b, x2 + b), with
the uniform PDF 1/(4b2 ), outlines the possible positions of the output signal Y
given that X = (x1 , x2 ). Suppose that we have to deal with a finite input alphabet
A ⊂ R2 ; then the output-signal domain is the finite union B = ∪x∈A S(x). The
above argument shows that if we can find a subset A ⊆ A such that squares Sb (x)
396 Further Topics from Information Theory

inf
C = ln 10 2b points from A (10 in total)
points from A \ A (8 in total)

an example of set B
set B
square Sb ( _x )

2b
_x

Figure 4.3

with x ∈ A partition domain B (i.e. cover B but do not intersect each other) then,
for the input PMF Px with Px (x) = 1 ( A ) (a uniform distribution over A ), the

output-vector-signal PDF fY is uniform on B (that is, fY (y) = 1 area of B ).
Consequently, the output-signal entropy h(Y ) = ln area of B is attaining the
maximum over all input-signal PMFs Px with Px (A ) = 1 (and even attaining the
maximum over all input-signal PMFs Px with Px (B ) = 1 where B ⊂ B is an
arbitrary subset with the property that ∪x ∈B S(x ) lies within B). Finally, the in-
formation capacity for the channel under consideration,
1 area of B
Cinf = ln nats/(scalar input signal).
2 4b2
See Figure 4.3.
To put it differently, any bounded set D2 ⊂ R2 that can be partitioned into disjoint
squares of length 2b yields the information capacity
1 area of D2
C2inf = ln nats/(scalar input signal),
2 4b2
of an additive channel with a uniform noise over (−b, b), when the channel is used
two times per scalar input signal and the random vector input x = (X1 , X2 ) is subject
to the regional constraint x ∈ D2 . The maximising input-vector PMF assigns equal
probabilities to the centres of squares forming the partition.
A similar conclusion holds in R3 when the channel is used three times for every
input signal, i.e. the input signal is a three-dimensional vector x = (x1 , x2 , x3 ), and
so on. In general, when we use a K-dimensional input signal x = (x1 , . . . , xk ) ∈ RK ,
and the regional constraint is x ∈ DK ⊂ RK where DK is a bounded domain that can
4.2 The asymptotic equipartition property in continuous time setting 397

be partitioned into disjoint cubes of length 2b, the information capacity


1 volume of DK
CKinf = ln nats/(scalar input signal)
K (2b)K
is achieved at the input-vector-signal PMF Px assigning equal masses to the centres
of the cubes forming the partition.
As K → ∞, the quantity CK may converge to a limit C∞inf yielding the capacity
per scalar input signal under the sequence of regional constraint domains DK . A
trivial example of such a situation is where DK is a K-dimensional cube

SbK = (−2bm, 2bm)×K ;

then CKinf = ln(1 + m) does not depend on K (and the channel is memoryless).

4.2 The asymptotic equipartition property in the continuous


time setting

The errors of a wise man make your rule,


Rather than perfections of a fool.
William Blake (1757–1821), English poet

This section provides a missing step in the proof of Theorem 4.1.8 and ad-
ditional Worked Examples. We begin with a series of assertions illustrating the
asymptotic equipartition property in various forms. The central facts are based on
the Shannon–McMillan–Breiman (SMB) theorem which is considered a corner-
stone of information theory. This theorem gives the information rate of a stationary
ergodic process X = (Xn ). Recall that a transformation of a probability space T is
called ergodic if every set A such that TA = A almost everywhere, satisfies P(A) = 0
or 1. For a stationary ergodic source with a finite expected value, Birkhoff’s ergodic
theorem states the law of large numbers (with probability 1):
1 n
∑ Xi → EX.
n i=1
(4.2.1)

Typically, for a measurable function f (Xt ) of ergodic process,


1 n
∑ f (Xi ) → E f (X).
n i=1
(4.2.2)

Theorem 4.2.1 (Shannon–McMillan–Breiman) For any stationary ergodic pro-


cess X with finitely many values the information rate R = h, i.e. the limit in (4.2.3)
398 Further Topics from Information Theory

exists in the sense of the a.s. convergence and equals to entropy


1  
− lim log pX n−1 X0n−1 = h a.s. (4.2.3)
n→∞ n 0

The proof of Theorem 4.2.1 requires some auxiliary lemmas and is given at the
end of the section.
Worked Example 4.2.2 (A general asymptotic equipartition property) Given a
⎛ ⎞X1 , X2 , . . ., for all N = 1, 2, . . ., the distribution of
sequence of random variables
X1
⎜ .. ⎟
the random vector x1 = ⎝ . ⎠ is determined by a PMF fxN (xN1 ) with respect to
N
1
XN
measure μ = μ ×· · ·× μ (N factors). Suppose that the statement of the Shannon–
(N)

McMillan–Breiman theorem holds true:


1
− log fxN (xN1 ) → h in probability,
N 1

where h > 0 is a constant (typically, h = lim h(Xi )). Given ε > 0, consider the
i→∞
typical set
⎧ ⎞ ⎛ ⎫

⎨ x1 ⎪

⎜ .. ⎟ 1
Sε = x1 = ⎝ . ⎠ : −ε ≤ log fxN (x1 ) + h ≤ ε .
N N N

⎩ N 1 ⎪

xN
0
The volume μ (N) (SεN ) = μ (dx1 ) . . . μ (dxN ) of set SεN has the following proper-
SεN
ties:
μ (N) (SεN ) ≤ 2N(h+ε ) , for all ε and N, (4.2.4)
and, for 0 < ε < h and for all δ > 0,
μ (N) (SεN ) ≥ (1 − δ )2N(h−ε ) , for N large enough, depending on δ . (4.2.5)
0
Solution Since P(RN ) =
RN
fxN (xN1 )
1
∏ μ (dx j ) = 1, we have that
1≤ j≤N
0
1=
RN
fxN (xN1 )
1
∏ μ (dx j )
1≤ j≤N
0

SεN
fxN (xN1 )
1
∏ μ (dx j )
1≤ j≤N
0
≥2 −N(h+ε )
SεN
∏ μ (dx j ) = 2−N(h+ε ) μ (N) (SεN ),
1≤ j≤N
4.2 The asymptotic equipartition property in continuous time setting 399

giving the upper bound (4.2.4). On the other hand, given δ > 0, we can take N
large so that P(SεN ) ≥ 1 − δ , in which case, for 0 < ε < h,

1 − δ ≤ P(SεN )
0
=
SεN
fxN (xN1 )
1
∏ μ (dx j )
1≤ j≤N
0
≤2 −N(h−ε )

SεN 1≤ j≤N
μ (dx j ) = 2−N(h−ε ) μ (N) (SεN ).

This yields the lower bound (4.2.5).

The next step is to extend the asymptotic equipartition property to joint distri-
butions of pairs XN1 , YN1 (in applications, XN1 will play a role of an input and YN1
of an output of a channel). Formally, given two sequences of random variables,
X1 , X2 , . . . and Y1 ,Y2 , . . ⎛
., for⎞
all N = 1, 2, .⎛
. ., consider
⎞ the joint distribution of the
X1 Y1
⎜ ⎟ ⎜ ⎟
random vectors XN1 = ⎝ ... ⎠ and YN1 = ⎝ ... ⎠ which is determined by a (joint)
XN YN
PMF fxN ,YN with respect to measure μ (N) × ν (N) where μ (N) = μ × · · · × μ and
1 1
ν (N) = ν × · · · × ν (N factors in both products). Let fXN1 and fYN1 stand for the
(joint) PMFs of vectors XN1 and YN1 , respectively.
As in Worked Example 4.2.2, we suppose that the statements of the Shannon–
McMillan–Breiman theorem hold true, this time for the pair (XN1 , YN1 ) and each of
XN1 and YN1 : as N → ∞,

1 1
− log fXN (XN1 ) → h1 , − log fYN (YN1 ) → h2 ,
N 1 N 1
in probability,
1
− log fXN ,YN (X1 , Y1 ) → h,
N N
N 1 1

where h1 , h2 and h are positive constants, with

h1 + h2 ≥ h; (4.2.6)

typically, h1 = lim h(Xi ), h2 = lim h(Yi ), h = lim h(Xi ,Yi ) and h1 + h2 − h =


i→∞ i→∞ i→∞
lim I(Xi : Yi ). Given ε > 0, consider the typical set formed by sample pairs (xN1 , yN1 )
i→∞
where
⎛ ⎞
x1
⎜ ⎟
xN1 = ⎝ ... ⎠
xN
400 Further Topics from Information Theory

and
⎛ ⎞
y1
⎜ ⎟
yN1 = ⎝ ... ⎠ .
yN
Formally,
%
1
TεN= (xN1 , yN1 ) : − ε ≤ log fxN (xN1 ) + h1 ≤ ε ,
N 1

1
− ε ≤ log fYN (yN1 ) + h2 ≤ ε ,
N 1
K
1
− ε ≤ log fxN ,YN (x1 , y1 ) + h ≤ ε ; (4.2.7)
N N
N 1 1

 N
by the above assumption we have that lim P Tε = 1 for all ε > 0. Next, define
N→∞
the volume of set TεN :
 N 0
μ (N)
×ν (N)
Tε = μ (N) (dxN1 )ν (N) (dyN1 ).
TεN

Finally, consider an independent pair XN1 , YN1 where component XN1 has the same
PMF as XN1 and YN1 the same PMF as YN1 . That is, the joint PMF for XN1 and YN1
has the form
fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fYN (yN1 ). (4.2.8)
1 1 1 1

Next, we assess the volume of set TεN and then the probability that xN1 , YN1 ∈
TεN .
Worked Example 4.2.3 (A general joint asymptotic equipartition property)
(I) The volume of the typical set has the following properties:
 
μ (N) × ν (N) TεN ≤ 2N(h+ε ) , for all ε and N, (4.2.9)
and, for all δ > 0 and 0 < ε < h, for N large enough, depending on δ ,
 
μ (N) × ν (N) TεN ≥ (1 − δ )2N(h−ε ) . (4.2.10)

(II) For the independent pair XN1 , YN1 ,

P XN1 , YN1 ∈ TεN ≤ 2−N(h1 +h2 −h−3ε ) , for all ε and N, (4.2.11)

and, for all δ > 0, for N large enough, depending on δ ,



P XN1 , YN1 ∈ TεN ≥ (1 − δ )2−N(h1 +h2 −h+3ε ) , for all ε . (4.2.12)
4.2 The asymptotic equipartition property in continuous time setting 401

Solution (I) Completely follows the proofs of (4.2.4) and (4.2.5) with integration
of fxN ,YN .

1 1

(II) For the probability P XN1 , YN1 ∈ TεN we obtain (4.2.11) as follows:

0
P XN1 , YN1 ∈ TεN = fxN ,YN μ (dxN1 )ν (dyN1 )
TεN 1 1

by definition
0
= fxN (xN1 ) fYN (yN1 )μ (dxN1 )ν (dyN1 )
1 1
TεN
substituting (4.2.8)
0
≤ 2−N(h1 −ε ) 2−N(h2 −ε ) μ (dxN1 )ν (dyN1 )
TεN
according to (4.2.7)

≤ 2−N(h1 −ε ) 2−N(h2 −ε ) 2N(h+ε ) = 2−N(h1 +h2 −h−3ε )


because of bound (4.2.9).

Finally, by reversing the inequalities in the last two lines, we can cast them as
0
≥ 2−N(h1 +ε ) 2−N(h2 +ε ) μ (dxN1 )ν (dyN1 )
TεN
according to (4.2.7)

≥ (1 − δ )2−N(h1 +ε ) 2−N(h2 +ε ) 2N(h−ε ) = (1 − δ )2−N(h1 +h2 −h+3ε )


because of bound (4.2.10).

Formally, we assumed here that 0 < ε < h (since it was assumed in (4.2.10)), but
increasing ε only makes the factor 2−N(h1 +h2 −h+3ε ) smaller. This proves bound
(4.2.12).

A more convenient (and formally a broader) extension of the asymp-


totic equipartition property is where we suppose that the statements of
the Shannon–McMillan–Breiman
theorem

hold true directly for the ratio
N N N N
fxN ,YN (X1 , Y1 ) fXN (X1 ) fYN (Y1 )) . That is,
1 1 1 1

1 fXN ,YN (XN1 , YN1 )


log 1 1N → c in probability, (4.2.13)
N fxN (X1 ) fYN (YN1 )
1 1
402 Further Topics from Information Theory

where c > 0 is a constant. Recall that fXN ,YN represents the joint PMF while fXN
1 1 1
and fxN individual PMFs for the random input and output vectors xN and YN , with
1
respect to reference measures μ (N) and ν (N) :

fXN ,YN (xN1 , yN1 ) = fXN (xN1 ) fch (yN1 |xN1 sent ),
1 1 0 1
 
fYN (YN1 ) = fXN ,YN (xN1 , yN1 )μ (N) dxN1 .
1 1 1

Here, for ε > 0, we consider the typical set


% K
1 fXN ,YN (xN1 , yN1 )
TεN = (XN1 , yN1 ) : −ε ≤ log 1 1
− c ≤ ε ; (4.2.14)
N fXN (xN1 ) fYN (yN1 )
1 1
 N N  
by assumption (4.2.13) we have that lim P X1 , Y1 ∈ TεN = 1 for all ε > 0.
N→∞
Again, we will consider an independent pair XN1 , YN1 where component XN1
has the same PMF as XN1 and YN1 the same PMF as YN1 .

Theorem 4.2.4 (Deviation from the joint asymptotic equipartition property)


Assume that property (4.2.13) holds true. For an independent pair XN1 , YN1 , the

probability that XN1 , YN1 ∈ TεN obeys

P XN1 , YN1 ∈ TεN ≤ 2−N(c−ε ) , for all ε and N , (4.2.15)

and, for all δ > 0, for N large enough, depending on δ ,



P XN1 , YN1 ∈ TεN ≥ (1 − δ )2−N(c+ε ) , for all ε . (4.2.16)

Proof Again, we obtain (4.2.15) as follows:


0
P XN1 , YN1 ∈ TεN = fXN ,YN μ ×N (dXN1 )ν ×N (dyN1 )
TεN 1 1
0
= fxN (xN1 ) fYN (yN1 )μ (dXN1 )ν (dyN1 )
1 1
TεN
0
 
fXN ,YN (xN1 , yN1 )
= exp − 1 N1
TεN fXN (x1 ) fYN (yN1 )
1 1

× fXN ,YN (xN1 , yN1 )μ ×N (dxN1 )ν ×N (dyN1 )


0 1 1

−N(c−ε )
≤2 fXN ,YN (xN1 , yN1 )μ (dxN1 )ν (dyN1 )
1 1
TεN
−N(c−ε )
 N N  
=2 P X1 , Y1 ∈ TεN
≤ 2−N(c−ε ) .
4.2 The asymptotic equipartition property in continuous time setting 403

The first equality is by definition, the second step follows by substituting (4.2.8),
the third is by direct calculation, and the fourth because of the bound (4.2.14).
Finally, by reversing the inequalities in the last two lines, we obtain the bound
(4.2.16):
0
≥ 2−N(c+ε ) fXN ,YN (xN1 , yN1 )μ (dxN1 )ν (dyN1 )
1 1
TεN
−N(c+ε )
 N N  
=2 P X1 , Y1 ∈ TεN ≥ 2−N(c+ε ) (1 − δ ),
the first inequality following because of (4.2.14).
Worked Example 4.2.5 Let x = {X(1), . . . , X(n)}T be a given vector/collection
of random variables. Let us write x(C) for subcollection {X(i) : i ∈ C} where C is
a non-empty subset in the index set {1, . . . , n}. Assume that the joint distribution
for any subcollection x(C) with C = k, 1 ≤ k ≤ n, is given by a joint PMF fx(C)
relative to measure μ × · · · × μ (k factors, each corresponding
⎛ ⎞ to a random variable
x(1)
⎜ ⎟
X(i) with i ∈ C). Similarly, given a vector x = ⎝ ... ⎠ of values for x, denote by
x(n)
x(C) the argument {x(i) : i ∈ C} (the sub-column in x extracted by picking the rows
with i ∈ C). By the Gibbs inequality, for all partitions {C1 , . . . ,Cs } of set {1, . . . , n}
into non-empty disjoint subsets C1 , . . . ,Cs (with 1 ≤ s ≤ n), the integral
0
fxn1 (x)
fx(C1 ) (x(C1 )) . . . fx(Cs ) (x(Cs )) 1≤∏
fx (x) log μ (dx( j)) ≥ 0. (4.2.17)
j≤n

What is the partition for which the integral in (4.2.17) attains its maximum?

Solution The partition in question has s = n subsets, each consisting of a single


point. In fact, consider the partition of set {1, . . . , n} into single points; the corre-
sponding integral equals
0
fxn1 (x)
fx (x) log ∏ μ (dx( j)).
∏ fXi (xi ) 1≤ j≤n
(4.2.18)
1≤i≤n

Let {C1 , . . . ,Cs } be any partition of {1, . . . , n}. Multiply and divide the fraction
under the log by the product of joint PMFs ∏ fx(Cl ) (x(Cl )). Then the integral
1≤i≤s
(4.2.18) is represented as the sum
0
fxn1 (x)
fx (x) log ∏ μ (dx( j)) + terms ≥ 0.
∏ fx(Ci ) (x(Ci )) 1≤ j≤n
1≤i≤s

The answer follows.


404 Further Topics from Information Theory

Worked Example 4.2.6 Let x = {X(1), . . . , X(n)} be a collection of random


variables as in Worked Example 4.2.5, and let Y be another random variable.
Suppose that there exists a joint PMF fx,Y , relative to a measure μ (n) × ν where
μ (n) = μ × · · · × μ (n times). Given a subset C ⊆ {1, . . . , n}, consider the sum

I(x(C) : Y ) + E I(x(C : Y )|x(C) .


Here x(C) = {X(i) : i ∈ C}, x(C) = {X(i) : i ∈ C}, and E I(x(C : Y )|x(C) stands
for the expectation of I(x(C : Y ) conditional on the value of x(C). Prove that this
sum does not depend on the choice of set C.

Solution Check that the expression in question equals I(x : Y ).


In Section 4.3 we need the following facts about parallel (or product) channels.
Worked Example 4.2.7 (Lemma A in [173]; see also [174]) Show that the
capacity of the product of r time-discrete Gaussian channels with parameters
(α j , p( j) , σ 2j ) equals
 
αj p( j)
C= ∑ ln 1 + . (4.2.19)
1≤ j≤r 2 α j σ 2j
Moreover, (4.2.19) holds when some of the α j s equal +∞: in this case the corre-
sponding summand takes the form p( j) σ 2j .

Solution Suppose that multi-vector data x = {x1 , . . . , xr } are transmitted


⎛ ⎞ via r par-
x j1
⎜ .. ⎟
allel channels of capacities C1 , . . . ,Cr , where each vector x j = ⎝ . ⎠ ∈ Rn j . It is
x jn j
convenient to set n j = α j τ  where τ → ∞. It is claimed that the capacity for this
product-channel equals the sum ∑ Ci . By induction, it is sufficient to consider
1≤i≤r
the case r = 2. For the direct part, assume that R < C1 +C2 and ε > 0 are given. For
τ sufficiently large we must find a code for the product channel with M = eRτ code-
words and Pe < ε . Set η = (C1 +C2 − R)/2. Let X 1 and X 2 be codes for channels
1 and 2 respectively with M1 ∼ e(C1 −η )τ and M2 ∼ e(C2 −η )τ and error-probabilities
PeX , PeX ≤ ε /2. Construct a concatenation code X with codewords x = xk1 xl2
1 2

where x•i ∈ X i , i = 1, 2. Then, for the product-channel under consideration, with


codes X 1 and X 2 , the error-probability PeX , X is decomposed as follows:
1 2

1  
PeX , X = ∑
1 2
P error in channel 1 or 2| xk1 xl2 sent .
M1 M2 1≤k≤M1 ,1≤l≤M2

By independence of the channels, PeX


1, X 2
≤ PeX + PeX ≤ ε which yields the
1 2

direct part.
4.2 The asymptotic equipartition property in continuous time setting 405

The proof of the inverse is more involved and we present only a sketch, referring
the interested reader to [174]. The idea is to apply the so-called list decoding:
suppose we have a code Y of size M and a decoding rule d = d Y . Next, given
that a vector y has been received at the output port of a channel, a list of L possible
code-vectors from Y has to be produced, by using a decoding rule d = dlist Y , and the

decoding (based on rule d) is successful if the correct word is in the list. Then, for
the average error-probability Pe = PeY (d) over code Y , the following inequality is
satisfied:
Pe ≥ Pe ( d ) PeAV (L, d) (4.2.20)

where the error-probability Pe (d) = PeY (d) refers to list decoding and PeAV (L, d) =
PeAV (Y , L, d) stands for the error-probability under decoding rule d averaged over
all subcodes in Y of size L.
Now, going back to the product-channel with marginal capacities C1 and C2 ,
choose R > C1 +C2 , set η = (R −C1 −C2 )/2 and let the list size be L = eRL τ , with
RL = C2 + η . Suppose we use a code Y of size eRτ with a decoding rule d and a
list decoder d with the list-size L. By using (4.2.20), write

Pe ≥ Pe (d)PeAV (eRL τ , d) (4.2.21)

and use the facts that RL > C2 and the value PeAV (eRL τ , d) is bounded away from
zero. The assertion of the inverse part follows from the following observation dis-
cussed in Worked Example 4.2.8. Take R2 < R− RL and consider subcodes L ⊂ Y
of size  L = eR2 τ . Suppose we choose subcode L at random, with equal proba-
bilities. Let M2 = eR2 τ and PeY ,M2 (d) stand for the mean error-probability averaged
over all subcodes L ⊂ Y of size  L = eR2 τ . Then

Pe (d) ≥ PeY ,M2 (d) + ε (τ ) (4.2.22)

where ε (τ ) → 0 as τ → ∞.

Worked Example 4.2.8 Let L = eRL τ and M = eRτ . We aim to show that if
R2 < R − RL and M2 = eR2 τ then the following holds. Given a code X of size M ,
a decoding rule d and a list decoder d with list-size L, consider the mean error-
probability PeX ,M2 (d) averaged over the equidistributed subcodes S ⊂ X of size
 S = M2 . Then PeX ,M2 (d) and the list-error-probability PeX (d) satisfy

PeX (d) ≥ PeX ,M2 (d) + ε (τ ) (4.2.23)

where ε (τ ) → 0 as τ → ∞.

Solution Let X , S and d be as above and suppose we use a list decoder d with
list-length L.
406 Further Topics from Information Theory

Given a subcode S ⊂ X with M2 codewords, we will use the following de-


coding. Let L be the output of decoder d. If exactly one element x j ∈ S belongs
to L , the decoder for S will declare x j . Otherwise, it will pronounce an error.
Denote the decoder for S by d S . Thus, given that xk ∈ S was transmitted, the
resulting error-probability, under the above decoding rule, takes the form

Pek = ∑ p(L |xk )ES (L |xk )


L

where p(L |xk ) is a probability of obtaining the output L after transmitting xk


under the rule d X and ES (L |xk ) is the error-probability for d S . Next, split
ES (L |xk ) = ES 1 (L |x ) + E 2 (L |x ) where E 1 (L |x ) stands for the probabil-
k S k S k
ity that xk ∈ L and ES 2 (L |x ) for the probability that word x ∈ L was decoded
k k
by a wrong code-vector from S (both probabilities conditional upon sending xk ).
Further, ES 2 (L |x ) is split into a sum of (conditional) probabilities E (L , x |x )
k S j k
that the decoder returned vector x j ∈ L with j = k.
Let PeS (d) = PeS , AV (d) denote the average error-probability for subcode S .
The above construction yields
 
1
M2 k: x∑ ∑ p(L |xk ) ES1 (L |xk ) + ∑ ES (L , x j |xk ) . (4.2.24)
PeS (d) ≤
k ∈S L j =k

Inequality (4.2.24) is valid for any subcode S . We now select S at random from
X choosing each subcode of size M2 with equal probability. After averaging over
all such subcodes we obtain a bound for the averaged error-probability PeX ,M2 =
PeX ,M2 (d):

1 M2 C DX ,M2
PeX ,M2 ≤ PeX (d) + ∑∑∑
M2 k=1
p(L |x )E
k • (L , x |x
j k ) (4.2.25)
L j =k
I JX ,M2
where means the average over all selections of subcodes. As x j and xk
are chosen independently,
C DX ,M2 C DX ,M2 C DX ,M2
p(L |xk )E 2 (L , x j ) = p(L |xk ) E•2 (L , x j ) .

Next,
C DX ,M2 1 C DX ,M2 L
p(L |xk ) = ∑ M
p(L |x), E•2 (L , x j |xk ) = ,
M
x∈X

and we obtain
1 M2 1 L
PeX ,M2 ≤ PeX (d) + ∑∑ ∑ M
M2 k=1
p(L |x) ∑M
L x∈X j =k
4.2 The asymptotic equipartition property in continuous time setting 407

which implies
M2 L
PeX ,M2 ≤ PeX (d) + . (4.2.26)
M

Since M2 L/M = eR2 τ e−(R−RL )τ → 0 when τ → ∞ as R2 < R−RL , inequality (4.2.23)


is proved.

We now give the proof of Theorem 4.2.1. Consider the sequence of kth-order
Markov approximations of a process X, by setting

    n−1  
p(k) X0n−1 = pX k−1 X0k−1 ∏ p Xi |Xi−k
i−1
. (4.2.27)
0
i=k

Set also
 −1

−1
H (k) = E − log p X0 |X−k = h(X0 |X−k ) (4.2.28)

and
 −1

−1
H = E − log p X0 |X−∞ = h(X0 |X−∞ ). (4.2.29)

The proof is based on the following three results: Lemma 4.2.9 (the sandwich
lemma), Lemma 4.2.10 (a Markov approximation lemma) and Lemma 4.2.11 (a
no-gap lemma).

Lemma 4.2.9 For any stationary process X,

1 p(k) (X0n−1 )
lim sup log ≤ 0 a.s., (4.2.30)
n→∞ n p(X0n−1 )

1 p(X0n−1 )
lim sup log −1
≤ 0 a.s. (4.2.31)
n→∞ n p(X0n−1 |X−∞ )

Proof If An is a support event for pX n−1 (i.e. P(X0n−1 ∈ An ) = 1), write


0

p(k) (X0n−1 ) (k) n−1


n−1 p (x0 )
E
p(X0n−1 )
= ∑ p(x0 )
p(x0n−1 )
xn−1 ∈A
0 n

= ∑ p(k) (x0n−1 )
x0n−1 ∈An

= p(k) (A) ≤ 1.
408 Further Topics from Information Theory
−1
Similarly, if Bn = Bn (X−∞ ) is a support event for pX n−1 |X −1 (i.e. P(X0n−1 ∈
0 −∞
−1
Bn |X−∞ ) = 1), write
p(X0n−1 ) p(x0n−1 )
X−∞ ∑
n−1 −1
E −1
= E −1 p(x0 |X−∞ ) −1
p(X0n−1 |X−∞ ) xn−1 ∈B
p(x0n−1 |X−∞ )
0 n

= EX −1
−∞
∑ p(x0n−1 )
x0n−1 ∈Bn
= EX −1 P(Bn ) ≤ 1.
−∞

By the Markov inequality,


   
p(k) (X0n−1 ) 1 p(k) (X0n−1 ) 1 1
P ≥ tn = P log ≥ logtn ≤ ,
p(X0n−1 ) n p(X0n−1 ) tn tn
 
p(X0n−1 )
and similarly for P −1
≥ tn . Letting tn = n2 so that ∑ 1/tn < ∞ and
p(X0n−1 |X−∞ ) n
using the Borel–Cantelli lemma completes the proof.
Lemma 4.2.10 For a stationary ergodic process X,
a.s.
1
− log p(k) (X0n−1 ) ⇒ H (k) , (4.2.32)
n

a.s.
1 −1
− log p(X0n−1 |X−∞ ) ⇒ H. (4.2.33)
n

−1 −1
Proof Substituting f = − log p(X0 |X−k ) and f = − log p(X0 |X−∞ ) into
Birkhoff’s ergodic theorem (see for example Theorem 9.1 from [36]) yields
a.s.
1 1 1 n−1
− log p(k) (X0n−1 ) = − log p(X0k−1 ) − ∑ log p(k) (Xi |Xi−k
i−1
) ⇒ 0 + H (k)
n n n i=k
(4.2.34)
and
a.s.
1 1 n−1
−1
− log p(X0n−1 |X−∞ ) = − ∑ log p(Xi |X−∞
i−1
) ⇒ H, (4.2.35)
n n i=0

respectively.
So, by Lemmas 4.2.9 and 4.2.10,
1 1 1 1
lim sup log n−1
≤ lim log (k) n−1 = H (k) , (4.2.36)
n→∞ n p(X0 ) n→∞ n p (X0 )
4.3 The Nyquist–Shannon formula 409

and
1 1 1 1
lim inf log ≥ lim log n−1 −1
= H, )
n→∞ n n−1
p(X0 ) n→∞ n p(X0 |X−∞ )
which we rewrite as
1 1
H ≤ lim inf − log p(X0n−1 ) ≤ lim sup − log p(X0n−1 ) ≤ H (k) . (4.2.37)
n→∞ n n→∞ n

Lemma 4.2.11 For any stationary process X, H (k)  H = H.

Proof The convergence H (k)  H follows by stationarity and by conditioning


not to increase entropy. It remains to show that H (k)  H, so that H = H. The
Doob–Lévy martingale convergence theorem for conditional probabilities yields

 −1
 a.s.  −1

p X0 = x0 |X−k ⇒ p X0 = x0 |X−∞ , k → ∞. (4.2.38)

As the set of values I is supposed to be finite, and the function p ∈ [0, 1] → −p log p
is bounded, the bounded convergence theorem gives that as k → ∞,

H (k) = E − ∑ p(X0 = x0 |X−k


−1 −1
) log p(X0 = x0 |X−k )
x0 ∈I
 
→E − ∑ −1
p(X0 = x0 |X−∞ −1
) log p(X0 = x0 |X−∞ ) = H.
x0 ∈I

4.3 The Nyquist–Shannon formula


In this section we give a rigorous derivation of the famous Nyquist–Shannon for-
mula1 for the capacity of a continuous-time channel with the power constraint and
a finite bandwidth, the result broadly considered an ultimate fact of information
theory. Our exposition follows (with minor deviations) the paper [173]. Because
it is quite long, we divide the section into subsections, each of which features a
particular step of the construction.
Harry Nyquist (1889–1976) is considered a pioneer of information theory whose
works, together with those of Ralph Hartley (1888–1970), helped to create the
concept of the channel capacity.
1 Some authors speak in this context of a Shannon–Hartley theorem.
410 Further Topics from Information Theory

The setting is as follows. Fix numbers τ , α , p > 0 and assume that every τ seconds
a coder produces a real code-vector
⎛ ⎞
x1
⎜ .. ⎟
x=⎝ . ⎠
xn
where n = ατ . All vectors x generated by the coder lie in a finite set X = Xn ⊂
Rn of cardinality M ∼ 2Rb τ = eRn τ (a codebook); sometimes we write, as before,
XM,n to stress the role of M and n. It is also convenient to list the code-vectors
from X as x(1), . . . , x(M) (in an arbitrary order) where
⎛ ⎞
x1 (i)
⎜ ⎟
x(i) = ⎝ ... ⎠ , 1 ≤ i ≤ M.
xn (i)

Code-vector x is then converted into a continuous-time signal


n
x(t) = ∑ xi φi (t), where 0 ≤ t ≤ τ , (4.3.1)
i=1

by using an orthonormal basis in Ł2 [0, τ ] formed by functions φi (t), i = 1, 2, . . .


1
(with 0τ φi (t)φ j (t)dt = δi j ). Then the entry xi can be recovered by integration:
0 τ
xi = x(t)φi (t)dt. (4.3.2)
0

The instantaneous signal power at time t is associated with |x(t)|2 ; then the square-
1
norm ||x||2 = 0τ |x(t)|2 dt = ∑ |xi |2 will represent the full energy of the signal in
1≤i≤n
the interval [0, τ ]. The upper bound on the total energy spent on transmission takes
the form

||x||2 ≤ pτ , or x ∈ Bn ( pτ ). (4.3.3)
(In the theory of waveguides, the dimension n is called the Nyquist number and the
value W = n/(2τ ) ∼ α /2 the bandwidth of the channel.)

The code-vector x(i) is sent through an additive channel, where the receiver gets
the (random) vector
⎛ ⎞
Y1
⎜ .. ⎟
Y = ⎝ . ⎠ where Yk = xk (i) + Zk , 1 ≤ k ≤ n. (4.3.4)
Yn
4.3 The Nyquist–Shannon formula 411

The assumption we will adopt is that


⎛ ⎞
Z1
⎜ .. ⎟
Z=⎝ . ⎠
Zn

is a vector with IID entries Zk ∼ N(0, σ 2 ). (In applications, engineers use the rep-
1
resentation Zi = 0τ Z(t)φi (t)dt, in terms of a ‘white noise’ process Z(t).)

From the start we declare that if x(i) ∈ X \ Bn ( pτ ), i.e. ||x(i)||2 > pτ , the
output signal vector Y is rendered ‘non-decodable’. In other words, the probability
of correctly decoding the output vector Y = x(i) + Z with ||x(i)||2 > pτ is taken to
be zero (regardless of the fact that the noise vector Z can be small and the output
vector Y close to x(i), with a positive probability).
Otherwise, i.e. when ||x(i)||2 ≤ pτ , the receiver applies, to the output vector Y,
a decoding rule d(= dn,X ), i.e. a map y ∈ K → d(y) ∈ X where K ⊂ Rn is a
‘decodable domain’ (where map d had been defined). In other words, if Y ∈ K
then vector Y is decoded as d(Y) ∈ X . Here, an error arises either if Y ∈ K or if
d(Y) = x(i) given that x(i) was sent. This leads to the following formula for the
probability of erroneously decoding the input code-vector x(i):

⎨1, ||x(i)||2 > pτ ,
Pe (i, d) = (4.3.5)
⎩Pch Y ∈ K or d(Y) = x(i)|x(i) sent , ||x(i)||2 ≤ pτ .

The average error-probability Pe = PeX ,av (d) for the code X is then defined by

1
Pe = ∑ Pe (i, d).
M 1≤i≤M
(4.3.6)

Furthermore, we say that Rbit (or Rnat ) is a reliable transmission rate (for given
α and p) if for all ε > 0 we can specify τ0 (ε ) > 0 such that for all τ > τ0 (ε )
there exists a codebook X of size  X ∼ eRnat τ and a decoding rule d such that
Pe = PeX ,av (d) < ε . The channel capacity C is then defined as the supremum of all
reliable transmission rates, and the argument from Section 4.1 yields

α p
C= ln 1 + (in nats); (4.3.7)
2 ασ 2

cf. (4.1.17). Note that when α → ∞, the RHS in (4.3.7) tends to p/(2σ 2 ).
412 Further Topics from Information Theory

.
..
.. ...
0 ....... ... .. ....... ... .. . . .. .. . . T
. . . .. .... . .
. .. .. . .. . . .
.. .
.. ..
. .
. .. . . . . . ... ... .
.. .. .. . .....
.... .
.

Figure 4.4

In the time-continuous set-up, Shannon (and Nyquist before him) discussed an


application of formula (4.3.7) to band-limited signals. More precisely, set W =
α /2; then the formula
 
p
C = W ln 1 + 2 (4.3.8)
2σ0 W
should give the capacity of the time-continuous additive channel with white noise
of variance σ 2 = σ02W , for a band-limited signal x(t) with the spectrum in [−W,W ]
and of energy per unit time ≤ p.
This sentence, perfectly clear to a qualified engineer, became a stumbling point
for mathematicians and required a technically involved argument for justifying its
validity. In engineers’ language, an ‘ideal’ orthonormal system on [0, τ ] to be used
in (4.3.1) would be a collection of n ∼ 2W τ equally spaced δ -functions. In other
words, it would have been very convenient to represent the code-vector x(i) =
(x1 (i), . . . , xn (i)) by a function fi (t), of the time argument t ∈ [0, τ ], given by the
sum
 
k
fi (t) = ∑ xk (i)δ t − (4.3.9)
1≤k≤n 2W

where n = 2W τ  (and α = 2W ). Here δ (t) represents a ‘unit impulse’ appearing


near time 0 and graphically visualised as an ‘acute unit peak’ around point t =
0. Then the shifted function δ (t − k/(2W )) yields a peak concentrated near t =
k/(2W ), and the graph of function fi (t) is shown in Figure 4.4.
We may think that our coder produces functions xi (t) every τ seconds, and each
such function is the result of encoding a message i. Moreover, within each time
interval of length τ , the peaks xk (i)δ (t − k/(2W )) appear at time-step 1/(2W ).
Here δ (t − k/(2W )) is the time-shifted Dirac delta-function.

The problem is that δ (t) is a so-called ‘generalised function’, and δ ∈ Ł2 . A way


to sort out this difficulty is to pass the signal through a low-frequency filter. This
produces, instead of fi (t), the function fi (t)(= fW,i (t)) given by
4.3 The Nyquist–Shannon formula 413

fi (t) = ∑ xk (i) sinc (2Wt − k) . (4.3.10)


1≤k≤n

Here
sin (π (2Wt − k))
sinc (2Wt − k) = (4.3.11)
π (2Wt − k)

is the value of the shifted and rescaled (normalised) sinc function:



⎨ sin(π s) , s = 0,
sinc(s) = πs s ∈ R, (4.3.12)
⎩1, s = 0,

featured in Figure 4.5.


The procedure of removing high-frequency harmonics (or, more generally, high-
resolution components) and replacing the signal fi (t) with its (approximate) lower-
resolution version fi (t) is widely used in modern computer graphics and other areas
of digital processing.

Example 4.3.1 (The Fourier transform in Ł2 ) Recall that the Fourier transform
1
φ → Fφ of an integrable function φ (i.e. a function with |φ (x)|dx < +∞) is de-
fined by

0
Fφ (ω ) = φ (x)eiω x dx, ω ∈ R. (4.3.13)

The inverse Fourier transform can be written as an inverse map:


−1
0
1
F φ (x) = φ (ω )e−iω x dω . (4.3.14)

A profound fact is that (4.3.13) and (4.3.14) can be extended to square-integrable
1
functions φ ∈ Ł2 (R) (with φ 2 = |φ (x)|2 dx < +∞). We have no room here to
go into detail; the enthusiastic reader is referred to [127]. Moreover, the Fourier-
transform techniques turn out to be extremely useful in numerous applications. For
instance, denoting Fφ = φ and writing F−1 φ = φ , we obtain from (4.3.13), (4.3.14)
that
0
1
φ (x) = φ(ω )e−ixω dω . (4.3.15)

In addition, for any two square-integrable functions φ1 , φ2 ∈ Ł2 (R),
0 0
2π φ1 (x)φ2 (x)dx = φ1 (ω )φ2 (ω )dω . (4.3.16)
414 Further Topics from Information Theory

K(t−s)

0.0 0.2 0.4 0.6

W=2
W=1
W=0.5
−4
−2
t−s

0
2
4

Figure 4.5
4.3 The Nyquist–Shannon formula 415

Furthermore, the Fourier transform can be defined for generalised functions too;
see again [127]. In particular, the equations similar to (4.3.13)–(4.3.14) for the
delta-function look like this:
0 0
1 −iω t
δ (t) = e dω , 1 = δ (t)eit ω dt, (4.3.17)

implying that the Fourier transform of the Dirac delta is δ(ω ) ≡ 1. For the shifted
delta-function we obtain
  0
k 1
δ t− = eikω /(2W ) e−iω t dω . (4.3.18)
2W 2π

The Shannon–Nyquist formula is established for a device where the channel is


preceded by a ‘filter’ that ‘cuts off’ all harmonics e±it ω with frequencies ω outside
the interval [−2π W, 2π W ]. In other words, a (shifted) unit impulse δ (t − k/(2W ))
in (4.3.18) is replaced by its cut-off version which emerges after the filter cuts off
the harmonics e−it ω with |ω | > 2π W .
The sinc function (a famous object in applied mathematics) is a classical function
arising when we reduce the integral in ω in (4.3.17) to the interval [−π , π ]:
0 π 0
1 −iω t
sinc(t) = e dω , 1[−π ,π ] (ω ) = sinc(t)eit ω dt, t, ω ∈ R1 (4.3.19)
2π −π

(symbolically, function sinc = F−1 1[−π ,π ] ). In our context, the function t →


A sinc(At) can be considered, for large values of parameter A > 0, as a conve-
nient approximation for δ (t). A customary caution is that sinc(t) is not an inte-
grable function on the whole axis R (due to the 1/t factor), although it is square-
1 2
integrable: sinct dt < ∞. Thus, the right equation in (4.3.19) should be under-
stood in an Ł2 -sense.
However, it does not make the mathematical and physical aspects of the theory
less tricky (as well as engineering ones). Indeed, an ideal filter producing a clear cut
of unwanted harmonics is considered, rightly, as ‘physically unrealisable’. More-
over, assuming that such a perfect device is available, we obtain a signal fi (t) that
is no longer confined to the time interval [0, τ ] but is widely spread in the whole
time axis. To overcome this obstacle, one needs to introduce further technical ap-
proximations.
Worked Example 4.3.2 Verify that the functions

t → 2 π W sinc (2Wt − k) , k = 1, . . . , n, (4.3.20)

are orthonormal in the space Ł2 (R1 ):


 0  

4π W [sinc (2Wt − k)] sinc 2Wt − k dt = δkk .


416 Further Topics from Information Theory

Solution The shortest way to see this is to write the Fourier-decomposition (in
Ł2 (R)) implied by (4.3.19):
√ 0 2π W
1
2 π W sinc (2Wt − k) = √ eikω /(2W ) e−it ω dω (4.3.21)
2 πW −2π W

and check that the functions representing the Fourier-transforms


1
√ 1 (|ω | ≤ 2π W )eikω /(2W ) , k = 1, . . . , n,
2 πW
are orthonormal. That is,
0 2π W
1
ei(k−k )ω /(2W ) dω = δkk (4.3.22)
4π W −2π W

where
'
1, k = k ,
δkk =
0, k = k ,

is the Kronecker symbol. But (4.3.22) can be verified by a standard integration.

Since functions in (4.3.20) are orthonormal, we obtain that


0
||x(i)||2 = (4π W )|| fi ||2 , where || fi ||2 = | fi (t)|2 dt, (4.3.23)

and functions fi have been introduced in (4.3.10). Thus, the power constraint can
be written as
|| fi ||2 ≤ pτ /4π W = p0 . (4.3.24)

In fact, the coefficients xk (i) coincide with the values fi (k/(2W )) of function fi
calculated at time points k/(2W ), k = 1, . . . , n; these points can be referred to as
‘sampling instances’.
Thus, the input signal fi (t) develops in continuous time although it is completely
specified by its values fi (k/(2W )) = xk (i). Thus, if we think that different signals
are generated in disjoint time intervals (0, τ ), (τ , 2τ ), . . ., then, despite interference
caused by infinite tails of the function sinc(t), these signals are clearly identifiable
through their values at sampling instances.
The Nyquist–Shannon assumption is that signal fi (t) is transformed in the chan-
nel into
g(t) = fi (t) + Z (t). (4.3.25)
4.3 The Nyquist–Shannon formula 417

Here Z (t) is a stationary continuous-time Gaussian process with the zero mean
(EZ (t) ≡ 0) and the (auto-)correlation function

E Z (s)Z (t + s) = 2σ02W sinc(2Wt), t, s ∈ R. (4.3.26)

In particular, when t is a multiple of π /W (i.e. point t coincides with a sampling


instance), the random variables Z (s) and Z (t + s) are independent. An equivalent
form of this condition is that the spectral density
0
 
Φ(ω ) := eit ω E Z (0)Z (t) dt = σ02 1 |ω | < 2π W . (4.3.27)

We see that the received


 continuous-time signal y(t) can be identified through
k
its values yk = y via equations
2W
 
k
yk = xk (i) + Zk where Zk = Z are IID N(0, 2σ02W ).
2W
This corresponds to the system considered in Section 4.1 with p = 2W p0 and
σ 2 = 2σ02W . It has been generally believed in the engineering community that
the capacity C of the current system is given by (4.3.8), i.e. the transmission rates
below this value of C are reliable and above it they are not.

However, a number of problems are to be addressed, in order to understand formula


(4.3.8) rigorously. One is that, as was noted above, a ‘shar’ filter band-limiting the
signal to a particular frequency interval is an idealised device. Another is that the
output signal g(t) in (4.3.25) can be reconstructed after it has been recorded over a
small time interval because any sample function of the form

t ∈ R → ∑ (xk (i) + zk ) sinc (2Wt − k) (4.3.28)


1≤k≤n

is analytic in t. Therefore, the notion of rate should be properly re-defined.


The simplest solution (proposed in [173]) is to introduce a class of functions
A (τ ,W, p0 ) which are

(i) approximately band-limited to W cycles per a unit of time (say, a second),


(ii) supported by a time interval of length τ (it will be convenient to specify this
interval as [−τ /2, τ /2]),
(iii) have the total energy (the Ł2 (R)-norm) not exceeding p0 τ .

These restrictions determine the regional constraints upon the system.


418 Further Topics from Information Theory

Thus, consider a code X of size M ∼ eRτ , i.e. a collection of functions


f1 (t), . . . , fM (t), of a time variable t. If a given code-function fi ∈ A (τ ,W, p0 ), it
is declared non-decodable: it generates an error with probability 1. Otherwise, the
signal fi ∈ A (τ ,W, p0 ) is subject to the additive Gaussian noise Z (t) with mean
EZ (t) ≡ 0 characterised by (4.3.27) and is transformed to g(t) = fi (t) + Z (t), the
signal at the output port of the channel (cf. (4.3.25)). The receiver uses a decoding
rule, i.e. a map d : K → X where K is, as earlier, the domain of definition of d,
i.e. some given class of functions where map d is defined. (As before, the decoder
d may vary with the code, prompting the notation d = d X .) Again, if g ∈ K, the
transmission is considered as erroneous. Finally, if g ∈ K then the received signal g
is decoded by the code-function d X (g)(t) ∈ X . The probability of error for code
X when the code-signal generated by the coder was fi ∈ X is set to be
'
1, fi ∈ A (τ ,W, p0 ),
Pe (i) =  c X
 (4.3.29)
Pch K ∪ {g : d (g) = fi } , fi ∈ A (τ ,W, p0 ).

The average error-probability Pe = PeX ,av (d) for code X (and decoder d) equals
1
Pe = ∑ Pe (i, d).
M 1≤i≤M
(4.3.30)

Value R(= Rnat ) is called a reliable transmission rate if, for all ε > 0, there exists τ
and a code X of size M ∼ eRτ such that Pe < ε .

Now fix a value η ∈ (0, 1). The class A (τ ,W, p0 ) = A (τ ,W, p0 , η ) is defined as
the set of functions f ◦ (t) such that

(i) f ◦ = Dτ f where

Dτ f (t) = f (t)1(|t| < τ /2), t ∈ R,


1
and f (t) has the Fourier transform eit ω f (t)dt vanishing for |ω | > 2π W ;
(ii) the ratio
|| f ◦ ||2
≥ 1 − η;
|| f ||2
and
(iii) the norm || f ◦ ||2 ≤ p0 τ .

In other words, the ‘transmittable’ signals f ◦ ∈ A (τ ,W, p0 , η ) are ‘sharply lo-


calised’ in time and ‘nearly band-limited’ in frequency.
4.3 The Nyquist–Shannon formula 419

The Nyquist–Shannon formula can be obtained as a limiting case from several


assertions; the simplest one is Theorem 4.3.3 below. An alternative approach will
be presented later in Theorem 4.3.7.

Theorem 4.3.3 The capacity C = C(η ) of the above channel with constraint
domain A (τ ,W, p0 , η ) described in conditions (i)–(iii) above is given by
 
p0 η p0
C = W ln 1 + + . (4.3.31)
2σ02W 1 − η σ02

As η → 0,
 
p0
C(η ) → W ln 1 + (4.3.32)
2σ02W

which yields the Nyquist–Shannon formula (4.3.8).

Before going to (quite involved) technical detail, we will discuss some facts rele-
vant to the product, or parallel combination, of r time-discrete Gaussian channels.
(In essence, this model was discussed at the end of Section 4.2.) Here, every τ time
units, the input signal is generated, which is an ordered collection of vectors
⎛ ⎞
( j)
x1
* (1) + ⎜ .. ⎟
x , . . . , x(r) where x( j) = ⎜ ⎟
⎝ . ⎠ ∈ R , 1 ≤ j ≤ r,
nj
(4.3.33)
( j)
xn j

and n j = α j τ  with α j being a given value (the speed of the digital production
from coder j). For each vector x( j) we consider a specific power constraint:
O O2
O ( j) O
Ox O ≤ p( j) τ , 1 ≤ j ≤ r. (4.3.34)

The output signal is a collection of (random) vectors


⎛ ⎞
( j)
( ) Y
⎜ 1. ⎟
Y(1) , . . . , Y(r) where Y( j) = ⎜ . ⎟ ( j) ( j) ( j)
⎝ . ⎠ and Yk = xk + Zk , (4.3.35)
( j)
Yn j
2

( j) ( j)
with Zk being IID random variables, Zk ∼ N 0, σ ( j) , 1 ≤ k ≤ n j , 1 ≤ j ≤ r.
420 Further Topics from Information Theory

A codebook X with information rate R, for the product-channel under consid-


eration, is an array of M input signals,
( 
x(1) (1), . . . , x(r) (1)
 (1) 
x (2), . . . , x(r) (2)
(4.3.36)
... ... ... )
 (1) 
x (M), . . . , x(r) (M) ,

each of which has the same structure as in (4.3.33).*As before, a + decoder d is a map
acting on a given set K of sample output signals y(1) , . . . , y(r) and taking these
signals to X .
As above, for i = 1, . . . , M,we define the error-probability
 Pe (i, d) for code X
(1) (r)
when sending an input signal x (i), . . . , x (i) :
 2
 
Pe (i, d) = 1, if x( j) (i) ≥ p( j) τ for some j = 1, . . . , r,

and
* +
Pe (i, d) = Pch Y(1) , . . . , Y(r) ∈ K or
* + * +
d Y(1) , . . . , Y(r) = x(1) (i), . . . , x(r) (i)
* +
| x(1) (i), . . . , x(r) (i) sent ,
 2
if x( j) (i) < p( j) τ for all j = 1, . . . , r.

The average error-probability Pe = PeX ,av (d) for code X (while using decoder d)
is then again given by
1
Pe = ∑ Pe (i, d).
M 1≤i≤M

As usual, R is said to be a reliable transmission rate if for all ε > 0 there exists a
τ0 > 0 such that for all τ > τ0 there exists a code X of cardinality M ∼ eRτ and a
decoding rule d such that Pe < ε . The capacity of the combined channel is again de-
fined as the supremum of all reliable transmission rates. In Worked Example 4.2.7
the following fact has been established (cf. Lemma A in [173]; see also [174]).

Lemma 4.3.4 The capacity of the product-channel under consideration equals


 
αj p( j)
C= ∑ ln 1 + . (4.3.37)
1≤ j≤r 2 α j σ 2j

Moreover, (4.3.37) holds when some of the2 α j equal +∞: in this case the corre-
sponding summand takes the form p 2σ j .
( j)
4.3 The Nyquist–Shannon formula 421

Our next step is to consider jointly constrained products of time-discrete Gaussian


channels. We discuss the following types of joint constraints.
Case I. Take r = 2, assume σ12 = σ22 = σ02 and replace condition (4.3.34) with

x(1) 2 + x(2) 2 < p0 τ . (4.3.38a)

In addition, if α1 ≤ α2 , we introduce β ∈ (0, 1) and require that



x(2) 2 ≤ β x(1) 2 + x(2) 2 . (4.3.38b)

Otherwise, i.e. if α2 ≤ α1 , formula (4.3.38b) is replaced by



x(1) 2 ≤ β x(1) 2 + x(2) 2 . (4.3.38c)

Case II. Here we take r = 3 and assume that σ12 = σ22 ≥ σ32 and α3 = +∞. The
requirements are now that

∑ x( j) 2 < p0 τ (4.3.39a)


1≤ j≤3

and
x(3) 2 ≤ β ∑ x( j) 2 . (4.3.39b)
1≤ j≤3

Case III. As in Case I, take r = 2 and assume σ12 = σ22 = σ02 . Further, let α2 = +∞.
The constraints now are
x(1) 2 < p0 τ (4.3.40a)

and

x(2) 2 < β x(1) 2 + x(2) 2 . (4.3.40b)

Worked Example 4.3.5 (cf. Theorem 1 in [173]). We want to prove that the
capacities of the above combined parallel channels of types I–III are as follows.

Case I, α1 ≤ α2 :
   
α1 (1 − ζ )p0 α2 ζ p0
C= ln 1 + + ln 1 + (4.3.41a)
2 α1 σ02 2 α2 σ02
where
 
α2
ζ = min β , . (4.3.41b)
α1 + α2
422 Further Topics from Information Theory

If α2 ≤ α1 , subscripts 1 and 2 should replace each other in these equations. Further,


when αi = +∞, one uses the limiting expression lim (α /2) ln(1 + v/α ) = v/2. In
α →∞
particular, if α1 < α2 = +∞ then β = ζ , and the capacity becomes
 
α1 (1 − β )p0 p0
C= ln 1 + +β 2. (4.3.41c)
2 α 1 σ0
2 2σ0
This means that the best transmission rate is attained when one puts as much ‘en-
ergy’ into channel 2 as is allowed by (4.3.38b).

Case II:
 
α1 (1 − β )p0
C= ln 1 +
2 (α1 + α2 )σ12
 
α2 (1 − β )p0 βp
+ ln 1 + + 2. (4.3.42)
2 ( α 1 + α2 ) σ1
2 2σ3

Case III:
 
α1 p0 β p0
C= ln 1 + + . (4.3.43)
2 α 1 σ0
2 2(1 − β )σ02

Solution We present the proof for Case I only. For definiteness, assume that α1 <
α2 ≤ ∞. First, the direct part. With p1 = (1 − ζ )p0 , p2 = ζ p0 , consider the parallel
combination of two channels, with individual power constraints on the input signals
x(1) and x(2) :
O O2 O O2
O (1) O O O
Ox O ≤ p1 τ , Ox(2) O ≤ p2 τ . (4.3.44a)

Of course, (4.3.44a) implies (4.3.38a). Next, with ζ ≤ β , condition (4.3.38b) also


holds true. Then, according to the direct part of Lemma 4.3.4, any rate R with
R < C1 (p1 ) +C2 (p2 ) is reliable. Here and below,
 
αι q
Cι (q) = ln 1 + , ι = 1, 2. (4.3.44b)
2 αι σ02
This implies the direct part.
A longer argument is needed to prove the inverse. Set C∗ = C1 (p1 ) + C2 (p2 ).
The aim is to show that any rate R > C∗ is not reliable. Assume the opposite: there
exists such a reliable R = C∗ + ε ; let us recall what it formally means. There exists
a sequence of values τ (l) → ∞ and (a) a sequence of codes
( ( ) )
X (l) = x(i) = x(1) (i), x(2) (i) , 1 ≤ i ≤ M (l)
4.3 The Nyquist–Shannon formula 423
Rτ (l)
+ *
of size M (l) ∼ e composed of ‘combined’ code-vectors x(i) = x(1) (i), x(2) (i)
with square-norms x(i)2 = x(1) (i)2 + x(2) (i)2 , and (b) a sequence of de-
coding maps d (l) : y ∈ K(l) → d (l) (y) ∈ X (l) such that Pe → 0. Here, as before,
Pe = PeX
(l) ,av
(d (l) ) stands for the average error-probability:
1
Pe = ∑ Pe (i, d (l))
M (l) 1≤i≤M(l)

calculated from individual error-probabilities Pe (i, d (l) ):




⎨1, if x(i) > p0 τ or x (i) > β x(i) ,
2 (l) (2) 2 2

Pe (i, d (l) ) = Pch Y ∈ K(l) or d (l) (Y) = x(i)| x(i) sent ,


⎩ if x(i)2 ≤ p τ (l) and x(2) (i)2 ≤ β x(i)2 .
0

The component vectors


⎛ (1) ⎞ (1)
⎛ ⎞
x1 (i) x1 (i)
⎜ .. ⎟ ⎜ .. ⎟
x(1) (i) = ⎜
⎝ . ⎟ ∈ Rα1 τ (l) 
⎠ and x(2) (i) = ⎜
⎝ . ⎟ ∈ Rα2 τ (l) 

(1) (2)
xα τ (l)  (i) xα τ (l)  (i)
1 2

are sent through their respective parts of the parallel-channel combination, which
results in output vectors
⎛ (1) ⎞ ⎛ (2) ⎞
Y1 (i) Y1 (i)
⎜ .. ⎟ ⎜ .. ⎟
Y(1) = ⎜
⎝ . ⎟ ∈ Rα1 τ (l)  , Y(2) = ⎜
⎠ ⎝ . ⎟ ∈ Rα2 τ (l) 

(1) (2)
Yα τ (l)  (i) Yα τ (l)  (i)
1 2

* +
forming the combined output signal Y = Y(1) , Y(2) . The entries of vectors Y(1)
and Y(2) are sums
(1) (1) (1) (2) (2) (2)
Yj = x j (i) + Z j , Yk = xk (i) + Zk ,
(1) (2)
where Z j and Zk are IID, N(0, σ02 ) random variables. Correspondingly, Pch
(1) (2)
refers to the joint distribution of the random variables Y j and Yk , 1 ≤ j ≤
α1 τ (l) , 1 ≤ k ≤ α2 τ (l) .
Observe that function q → C1 (q) is uniformly continuous in q on [0, p0 ]. Hence,
we can find an integer J0 large enough such that
  
 
C1 (q) −C1 q − ζ p0  < ε , for all q ∈ (0, ζ p0 ).
 J0  2
424 Further Topics from Information Theory
(l)
Then we partition the code X (l) into J0 classes (subcodes) X j , j = 1, . . . , J0 : a
  (l)
code-vector x(1) (i), x(2) (i) falls in class X j if
ζ p0 τ ζ p0 τ

(2)
( j − 1) < (xk )2 ≤ j . (4.3.45a)
J0 1≤k≤α τ (l) 
J0
2

Since a transmittable code-vector x has a component x(2) with x(2) 2 ≤ ζ x2 ,


each such x lies in one and only one class. (We make an agreement that zero
(l) (l)
code-vectors belong to X1 .) The class X j containing the most code-vectors
(l) (l)
is denoted by X∗ . Then, obviously, the cardinality  X∗ ≥ M (l) J0 , and the
(l)
transmission rate R∗ of code X∗ satisfies
1
R∗ ≥ R − ln J0 . (4.3.45b)
τ (l)
(l)
On the other hand, the maximum error-probability for subcode X∗ is not larger
than for the whole code X (l) (when using the same decoder d (l) ); consequently,
(l)  
the error-probability PeX∗ ,av d (l) ≤ Pe → 0.
(l)

Having a fixed number J0 of classes in the partition of X (l) , we can find at


least one j0 ∈ {1, . . . , J0 } such that, for infinitely many l, the most numerous class
(l) (l)
X∗ coincides with X j . Reducing our argument to those l, we may assume that
(l) (l)   (l)
X∗ = X j0 for all l. Then, for all x(1) , x(2) ∈ X∗ , with
⎛ ⎞
(i)
x1
⎜ .. ⎟
x(i) = ⎜ ⎟
⎝ . ⎠ , i = 1, 2,
(i)
xni
using (4.3.38a) and (4.3.45a)
O O2  ( j0 − 1)ζ
 O O2 j0 ζ
O (1) O (l) O (2) O
O O
x ≤ 1 − p0 τ , Ox O ≤ p0 τ (l) .
J0 J0
( )
(l)
That is, X∗ , d (l) is a coder/decoder sequence for the ‘standard’ parallel-
channel combination (cf. (4.3.34)), with
 
( j0 − 1)ζ j0 ζ
p1 = 1 − p0 and p2 = p0 .
J0 J0
(l)  
As the error-probability PeX∗ ,av d (l) → 0, rate R is reliable for this combination
of channels. Hence, this rate does not surpass the capacity:
    
( j0 − 1)ζ j0 ζ
R∗ ≤ C1 1− p0 +C2 p0 .
J0 J0
4.3 The Nyquist–Shannon formula 425

Here and below we refer to the definition of Ci (u) given in (4.3.44b), i.e.
ε
R∗ ≤ C1 ((1 − δ )p0 ) +C2 (δ p0 ) + (4.3.46)
2
where δ = j0 ζ /J0 .
Now note that, for α2 ≥ α1 , the function

δ → C1 ((1 − δ )p0 ) +C2 (δ p0 )

increases in δ when δ < α2 /(α1 + α2 ) and decreases when δ > α2 /(α1 + α2 ).


Consequently, as δ = j0 ζ /J0 ≤ ζ , we obtain that, with ζ = min [β , α2 /(α1 + α2 )],

C1 ((1 − δ )p0 ) +C2 (δ p0 ) ≤ C1 (p1 ) +C2 (p2 ) = C∗ . (4.3.47)

In turn, this implies, owing to (4.3.45b), (4.3.46) and (4.3.47), that


ε 1 ε
R ≤ C∗ + + (l) ln J0 , or R ≤ C∗ + when τ (l) → ∞.
2 τ 2
The contradiction to R = C∗ + ε yields the inverse.

Example 4.3.6 (Prolate spheroidal wave functions (PSWFs); see [146],


[90], [91]) For any given τ ,W > 0 there exists a sequence of real functions
ψ
0 1 (t), ψ2 (t), . . ., of a variable t ∈ R, belonging to the Hilbert space Ł2 (R) (i.e. with
ψn (t)2 dt < ∞), called prolate spheroidal wave functions (PSWFs), such that
0
n (ω ) =
(a) The Fourier transforms ψ ψ (t)eit ω dt vanish for |ω | > 2πW ; more-
over, the functions ψn (t) form an orthonormal basis in the Hilbert subspace
formed by functions from Ł2 (R) with this property.
(b) The functions ψn◦ (t) := ψn (t)1(|t| < τ /2) (the restrictions of ψn (t) to
(−τ /2, τ /2)) are pairwise orthogonal:
0 0 τ /2
ψn◦ (t)ψn◦ (t)dt = ψn (t)ψn (t)dt = 0 when n = n . (4.3.48a)
−τ /2

Furthermore, functions ψn◦ form a complete system in Ł2 (−τ /2, τ /2): if a


0 τ /2
function ϕ ∈ Ł2 (−τ /2, τ /2) has ϕ (t)ψn (t)dt = 0 for all n ≥ 1 then
−τ /2
ϕ (t) = 0 in Ł2 (−τ /2, τ /2).
(c) The functions ψn (t) satisfy, for all n ≥ 1 and t ∈ R, the equations
0 τ /2  
λn ψn (t) = 2W ψn (s) sinc 2W π (t − s) ds. (4.3.48b)
−τ /2
426 Further Topics from Information Theory

That is, functions ψn (t)0 are the eigenfunctions, with the eigenvalues λn , of the
integral operator ϕ → ϕ (s)K( · , s) ds with the integral kernel
 
K(t, s) = 1(|s| < τ /2)(2W ) sinc 2W (t − s)
sin(2π W (t − s))
= 1(|s| < τ /2) , −τ /2 ≤ s;t ≤ τ /2.
π (t − s)
(d) The eigenvalues λn satisfy the condition
0 τ /2
λn = ψn (t)2 dt with 1 > λ1 > λ2 > · · · > 0.
−τ /2

An equivalent formulation
0 can be given in terms involving the Fourier trans-
forms [Fψn◦ ] (ω ) = ψn◦ (t)eit ω dt:
0 2π W 0 τ /2
1
| [Fψn◦ ] (ω ) |2 dω |ψn (t)|2 dt = λn ,
2π −2π W −τ /2

which means that λn gives a ‘frequency concentration’ for the truncated func-
tion ψn◦ .
(e) It can be checked that functions ψn (t) (and hence numbers λn ) depend on W
and τ through the product W τ only. Moreover, for all θ ∈ (0, 1), as W τ → ∞,
λ2W τ (1−θ ) → 1, and λ2W τ (1+θ ) → 0. (4.3.48c)
That is, for τ large, nearly 2W τ of values λn are close to 1 and the rest are close
to 0.

An important part of the argument that is currently developing is the Karhunen–


Loève decomposition. Suppose Z(t) is a Gaussian random process with spectral
density Φ(ω ) given by (4.3.27). The Karhunen–Loève decomposition states that
for all t ∈ (−τ /2, τ /2), the random variable Z(t) can be written as a convergent (in
the mean-square sense) series
Z(t) = ∑ An ψn (t), (4.3.49)
n≥1

where ψ1 (t), ψ2 (t), . . . are the PSWFs discussed in Worked Example 4.3.9 below
and A1 , A2 , . . . are IID random variables with An ∼ N(0, λn ) where
√ λn are the cor-
responding eigenvalues. Equivalently, one writes Z(t) = ∑n≥1 λn ξn ψn (t) where
ξn ∼ N(0, 1) IID random variables.
The proof of this fact goes beyond the scope of this book, and the interested
reader is referred to [38] or [103], p. 144.
4.3 The Nyquist–Shannon formula 427

The idea of the proof of Theorem 4.3.3 is as follows. Given W and τ , an input
signal s◦ (t) from A (τ ,W, p0 , η ) is written as a Fourier series in the PSWFs ψn .
In this series, the first 2W τ summands represent the part of the signal confined
between the frequency band-limits ±2π W and the time-limits ±τ /2. Similarly, the
noise realisation Z(t) is decomposed in a series in functions ψn . The action of the
continuous-time channel is then represented in terms of a parallel combination of
two jointly constrained discrete-time Gaussian channels. Channel 1 deals with the
first 2W τ PSWFs in the signal decomposition and has α1 = 2W . Channel 2 receives
the rest of the expansion and has α2 = +∞. The power constraint s2 ≤ p0 τ leads
to a joint constraint, as in (4.3.38a). In addition, a requirement emerges that the
energy allocated outside the frequency band-limits ±2π W or time-limits ±τ /2 is
small: this results in another power constraint, as in (4.3.38b). Applying Worked
Example 4.3.5 for Case I results in the assertion of Theorem 4.3.3.
To make these ideas precise, we first derive Theorem 4.3.7 which gives an al-
ternative approach to the Nyquist–Shannon formula (more complex in formulation
but somewhat simpler in the (still quite lengthy) proof).

Theorem 4.3.7 Consider the following modification of the model from Theorem
4.3.3. The set of allowable signals A2 (τ ,W, p0 , η ) consists of functions t ∈ R →
s(t) such that
0
(1) s2 = |s(t)|2 dt ≤ p0 τ,
0
(2) the Fourier transform [Fs](ω ) = s(t)eit ω dt vanishes when |ω | > 2π W , and
0 τ /2 
(3) the ratio |s(t)|2 dt s2 > 1 − η. That is, the functions s ∈
−τ /2
A (τ ,W, p0 , η ) are ‘sharply band-limited’ in frequency and ‘nearly localised’
in time.

The noise process is Gaussian, with the spectral density vanishing when |ω | > 2πW
and equal to σ02 for |ω | ≤ 2πW .
Then the capacity of such a channel is given by
 
p0 η p0
C = Cη = W ln 1 + (1 − η ) 2 + 2. (4.3.50)
2σ0 W 2σ0

As η → 0,
 
p0
Cη → W ln 1 +
2σ02W

yielding the Nyquist–Shannon formula (4.3.8).


428 Further Topics from Information Theory

Proofof Theorem 4.3.7 First, we establish the direct half. Take


 
(1 − η )p0 η p0
R < W ln 1 + + 2 (4.3.51)
2σ0 W
2 2σ0
and take δ ∈ (0, 1) and ξ ∈ (0, min [η , 1 − η ]) such that R is still less than
 
∗ (1 − η + ξ )p0 (η − ξ )p0
C = W (1 − δ ) ln 1 + + . (4.3.52)
2σ0 W (1 − δ )
2 2σ02
According to Worked Example 4.3.5, C∗ is the capacity of a jointly constrained
discrete-time pair of parallel channels as in Case I, with

α1 = 2W (1 − δ ), α2 = +∞, β = η − ξ , p = p0 , σ 2 = σ02 ; (4.3.53)

cf. (4.3.41a). We want to construct codes and decoding rules for the time-
continuous version of the channel,
 (1) (2)yielding
 asymptotically vanishing probability
of error as τ → ∞. Assume x , x is an allowable input signal for the parallel
pair of discrete-time channels with parameters given in (4.3.53). The input for the
time-continuous channel is the following series of (W, τ ) PSWFs:

∑ ∑
(1) (2)
s(t) = xk ψk (t) + xk ψk+α1 τ  (t). (4.3.54)
1≤k≤α1 τ  1≤k<∞

The first fact to verify is that the signal in (4.3.54) belongs to A2 (τ ,W, p0 , η ), i.e.
satisfies conditions (1)–(3) of Theorem 4.3.7.
To check property (1), write
2 2 O O2 O O2
O O O O
∑ ∑ xk = Ox(1) O + Ox(2) O ≤ p0 τ .
(1) (2)
s2 = xk +
1≤k≤α1 τ  1≤k<∞

Next, the signal s(t) is band-limited, inheriting this property from the PSWFs
ψk (t). Thus, (2) holds true.
A more involved argument is needed to establish property (3). Because the
PSWFs ψk (t) are orthogonal in Ł2 [−τ /2, τ /2] (cf. (4.3.48a)), and using the mono-
tonicity of the values λn (cf. (4.3.48b)), we have that
0 τ /2 
(1 − Dτ )s||2
1− |s(t)| dt s2 =
2
−τ /2 ||s||2
2 2
(1) (2)
(1 − λk ) xk (1 − λk+α1 τ  ) xk
= ∑ OO (1) OO2 OO (2) OO2 + ∑ OO (1) OO2 OO (2) OO2
1≤k≤α1 τ  x + x 1≤k<∞ x + x
O (1) O2 O (2) O2
Ox O Ox O
≤ 1 − λα1 τ  O O2 O O2 + O O2 O O2 .
Ox(1) O + Ox(2) O Ox(1) O + Ox(2) O
4.3 The Nyquist–Shannon formula 429

Now, as τ → ∞, the value λα1 τ  → 1 (see (4.3.48c)). With the ratio


O (1) O2 O (1) O2 O (2) O2
Ox O Ox O + Ox O ≤ 1, we have that for τ large enough,

O (1) O2
Ox O
1 − λα1 τ  O O2 O O2 ≤ ξ .
Ox(1) O + Ox(2) O
O O2 O (1) O2 O (2) O2
Next, the ratio Ox(2) O Ox O + Ox O ≤ η − ξ (referring to (4.3.38b)). This
finally yields
0 τ /2 
(1 − Dτ )s||2
1− |s(t)| dt s2 =
2
≤ ξ + η − ξ = η,
−τ /2 ||s||2
i.e. property (3).
Further, the noise can be expanded in accordance with Karhunen–Loève:

∑ ∑
(1) (2)
Z(t) = Zk ψk (t) + Zk ψk+α1 τ  (t). (4.3.55)
1≤k≤α1 τ  1≤k<∞

( j)
Here again, ψk (t) are the PSWFs and IID random variables Zk ∼ N(0, λk ). Cor-
respondingly, the output signal is written as

∑ ∑
(1) (2)
Y (t) = Yk ψk (t) + Yk ψk+α1 τ  (t) (4.3.56)
1≤k≤α1 τ  1≤k<∞

where
( j) ( j) ( j)
Yk = xk + Zk , j = 1, 2, k ≥ 1. (4.3.57)
So, the continuous-time channel is equivalent to a jointly constrained parallel com-
bination. As we checked, the capacity equals C∗ specified in (4.3.52). Thus, for
R < C∗ we can construct codes of rate R and decoding rules such that the error-
probability tends to 0.
For the converse, assume that there exists a sequence τ (l) → ∞, a sequence of
(l)
transmissible domains A2 (τ (l) ,W, p0 , η (l) ) described in (1)–(3) and a sequence
of codes X (l) of size M = eRτ  where
(l)

 
(1 − η )p0 η p0
R > W ln 1 + + 2 .
2W σ0 2 σ0

As usual, we want to show that the error-probability PeX ,av (d (l) ) does not tend to
(l)

0.
As before, we take δ > 0 and ξ ∈ (0, 1 − η ) to ensure that R > C∗ where
 
∗ (1 − η − ξ ) p0 η p0
C = W (1 + δ ) ln 1 + + .
(1 − ξ ) 2W σ0 (1 + δ )
2 (1 − ξ )σ02
430 Further Topics from Information Theory

Then, as in the argument on the direct half, C∗ is the capacity of the type I jointly
constrained parallel combination of channels with
η
β= , σ 2 = σ02 , p = p0 , α1 = 2W (1 + δ ), α2 = +∞. (4.3.58)
1−ξ
(l)
Let s(t) ∈ X (l) ∩ A2 (τ (l) ,W, p0 , η (l) ) be a continuous-time code-function.
Since the PSWFs ψk (t) form an ortho-basis in Ł2 (R), we can decompose

∑ ∑
(1) (2)
s(t) = xk ψk (t) + xk ψk+α1 τ (l)  (t),t ∈ R. (4.3.59)
1≤k≤α1 τ (l)  1≤k<∞
 
We want to show that the discrete-time signal x = x(1) , x(2) represents an al-
lowable input to the type I jointly constrained parallel combination specified in
(4.3.38a–c). By orthogonality of PSWFs ψk (t) in Ł2 (R) we can write
x2 = ||s||2 ≤ p0 τ (l)
ensuring that condition (4.3.38a) is satisfied. Further, using orthogonality of PSW
functions ψk (t) in Ł2 (−τ /2, τ /2) and the fact that the eigenvalues λk decrease
monotonically, we obtain that
0 τ (l) /2

 (1 − Dτ (l) ) s2
1− |s(t)|2 dt s2 =
−τ (l) /2 ||s||2
2 2
(1) (2)
(1 − λk ) xk 1 − λk+α1 τ (l)  xk
= ∑ + ∑
1≤k≤α1 τ (l)  x2 1≤k<∞ x2
O O
Ox(2) O 2
≥ 1 − λα1 τ (l)  .
x2
By virtue of (4.3.48c), λα1 τ (l)  ≤ ξ for l large enough. Moreover, since 1 −
0 (l)

τ /2
|s(t)|2 dt s2 ≤ η , we can write
−τ (l) /2
O (2) O2
Ox O η

x 2 1−ξ
and deduce property (4.3.38b).
Next, as in the direct half, we again use the Karhunen–Loève decomposition
of noise Z(t) to deduce that for each code for the continuous-time channel there
corresponds a code for the jointly constrained parallel combination of discrete-time
channels, with the same rate and error-probability. Since R is > C∗ , the capacity
of the discrete-time channel, the error-probability PeX ,av (d (l) ) remains bounded
(l)

away from 0 as l → ∞. This yields the converse.


4.3 The Nyquist–Shannon formula 431

Proof of Theorem 4.3.3 (Sketch) The formal argument proceeds as in Theorem


4.3.7: we have to prove the direct and converse parts of the theorem. Recall that the
direct part states that the capacity is ≥ C, the value indicated in (4.3.31), while the
converse/inverse that it is ≤ C. For the direct part, the channel is decomposed into
the product of two parallel channels, as in Case III, with

α1 = 2W (1 − θ ), α2 = +∞, p = p0 , σ 2 = σ02 , β = η − ξ , (4.3.60)

where θ ∈ (0, 1) (cf. property (e) of PSWFs in Example 4.3.6) and ξ ∈ (0, η ) are
auxiliary values.
For the converse half we use the decomposition into two parallel channels, again
as in Case III, with
η
α1 = 2W (1 + θ ), α2 = +∞, p = p0 , σ 2 = σ02 , β = . (4.3.61)
1−ξ
Here, as before, value θ ∈ (0, 1) emerges from property (e) of PSWFs, whereas
value ξ ∈ (0, 1).

Summing up our previous observations we obtain the famous

Lemma 4.3.8 (The Nyquist–Shannon–Kotelnikov–Whittaker sampling lemma)


1
Let f be a function t ∈ R → f (t) ∈ R with | f (t)|dt < +∞. Suppose that the
Fourier transform 0
[F f ](ω ) = eit ω f (t)dt

vanishes for |ω | > 2πW . Then, for all x ∈ R, function f can be uniquely recon-
structed from its values f (x + n/(2W )) calculated at points x + n/(2W ), where
n = 0, ±1, ±2. More precisely, for all t ∈ R,
n sin [2π (Wt − n)]
f (t) = ∑ f . (4.3.62)
n∈Z1
2W 2π (Wt − n)

Worked Example 4.3.9 By the famous uncertainty principle of quantum


physics, a function and its Fourier transform cannot be localised simultaneously
in finite intervals [−τ , τ ] and [−2πW, 2πW ]. What could be said about the case
when both function and its Fourier transform are nearly localised? How can we
quantify the uncertainty in this case?

Solution Assume the function f ∈ Ł2 (R) and let f = F f ∈ L2 (R) be the Fourier
transform of f . (Recall that space Ł2 (R) consists of functions f on R with || f ||2 =
432 Further Topics from Information Theory
1 1
| f (t)|2 dt < +∞ and that for all f , g ∈ Ł2 (R), the inner product f (t)g(t)dt is
finite.) We shall see that if
0 t0 +τ /2 0 ∞
| f (t)| dt
2
| f (t)|2 dt = α 2 (4.3.63)
t0 −τ /2 −∞

and
0 2π W 0 ∞
|F f (ω )| dω 2
|F f (ω )|2 dω = β 2 (4.3.64)
−2π W −∞

then W τ ≥ η , where η = η (α , β ) will be found explicitly. (The inequality will be


sharp, and functions yielding equality will be specified.)
Consider the linear operators f ∈ Ł2 (R) → D f ∈ Ł2 (R) and f ∈ Ł2 (R) → B f ∈
Ł2 (R) given by
D f (t) = f (t)1(|t| ≤ τ /2) (4.3.65)
and
0 2π W 0 ∞
1 1 sin 2π W (t − s)
B f (t) = F f (ω )e−iω t dω = f (s) ds. (4.3.66)
2π −2π W π −∞ t −s
We are interested in the product of these operators, A = BD:
0 τ /2
1 sin 2π W (t − s)
A f (t) = f (s) ds; (4.3.67)
π −τ /2 t −s
see Example 4.3.6. The eigenvalues λn of A obey 1 > λ0 > λ1 > · · · and tend to zero
as n → ∞; see [91]. We are interested in the eigenvalue λ0 : it can be shown that λ0
is a function of the product W τ . In fact, the eigenfunctions (ψ j ) of (4.3.67) yield an
orthonormal basis in Ł2 (R); at the same time these functions form an orthogonal
basis in Ł2 [−τ /2, τ /2]:
0 τ /2
ψ j (t)ψi (t)dt = λi δi j .
−τ /2

As usual, the angle between f and g in Hilbert space Ł2 (R) is determined by


 0 
−1 1
θ ( f , g) = cos Re f (t)g(t)dt . (4.3.68)
|| f || ||g||
The angle between two subspaces is the minimal angle between vectors in these
subspaces. We will show that there exists a positive angle θ (B, D) between the
subspaces B and D, the image spaces of operators B and D. That is, B is the
linear subspace of all band-limited functions while D is that of all time-limited
functions. Moreover,
$
θ (B, D) = cos−1 λ0 (4.3.69)
4.3 The Nyquist–Shannon formula 433

and inf f ∈B,g∈D θ ( f , g) is achieved when f = ψ0 , g = Dψ0 where ψ0 is the (unique)


eigenfunction with the eigenvalue λ0 .
To this end, we verify that for any f ∈ B
||D f ||
min θ ( f , g) = cos−1 . (4.3.70)
g∈D || f ||
1
Indeed,
expand f = f − D f + D f and observe that the integral f (t) −
D f (t) g(t)dt = 0 (since the supports of g and f − D f are disjoint). This implies
that  0  0  0 
     
Re f (t)g(t)dt  ≤  f (t)g(t)dt  =  D f (t)g(t)dt  .
     

Hence,
0
1 ||D f ||
Re f (t)g(t)dt ≤
|| f ||||g|| || f ||
which implies (4.3.70), by picking g = D f .

Next, we expand f = ∑ an ψn , relative to the eigenfunctions of A. This yields
n=0
the formula
 1/2
||D f || ∑n |an |2 λn
cos−1 = cos−1 . (4.3.71)
|| f || ∑n |an |2
The supremum of the RHS in f is achieved when an = 0 for n ≥ 1, and f = ψ0 .
We conclude that there exists the minimal angle between subspaces B and D, and
this angle is achieved on the pair f = ψ0 , g = Dψ0 , as required.
Next, we establish
Lemma 4.3.10 There exists a function f ∈ Ł2 such that || f || = 1, ||D f || = α and
||B f || = β if and only if α and β fall in one of the following cases (a)–(d):
(a) α = 0 and√0 ≤ β < 1;
(b) 0 < α < λ0 < 1 and 0 ≤ β ≤ 1;
√ √
−1 α + cos−1 β ≥ cos−1 λ ;
(c) λ0 ≤ α < 1 and cos
√ 0
(d) α = 1 and 0 < β ≤ λ0 .
Proof Given α ∈ [0, 1], let G (α ) be the family of functions f ∈ L2 with norms
|| f || = 1 and ||D f || = α . Next, determine β ∗ (α ) := sup f ∈G (α ) ||B f ||.
(a) If α = 0, the family G (0) can contain no function with β = B f  = 1. Further-
more, if D f  = 0 and B f  = 1 for f ∈ B then f is analytic and f (t) = 0 for
|t| < τ /2, implying f ≡ 0. To show that G (0) contains functions with all values of
ψn − Dψn √
β ∈ [0, 1), we set fn = √ . Then the norm ||B fn || = 1 − λn . Since there
1 − λn
434 Further Topics from Information Theory

exist eigenvalues λn arbitrarily close to zero, ||B fn || becomes arbitrarily close to 1.


By considering the functions eipt f (t) we can obtain all values of β between points

1 − λn since

0 −p+π W
1/2
||Beipt
f || = |Fn (ω | dω
2
.
−p−π W

The norm ||Beipt f || is continuous in p and approaches 0 as p → ∞. This completes


the analysis of case (a).

(b) When 0 < α < λ0 < 1, we set

$ $
α 2 − λn ψ0 − λ0 − α 2 ψn
f= √ ,
λ0 − λn

for n large when the eigenvalue λn is close to 0. We have that f ∈ B, || f || = ||B f || =


1, while a simple computation shows that ||D f || = α . This includes the case β = 1
as, by choosing eipt f (t) appropriately, we can obtain any 0 < β < 1.

(c) and (d) If λ0 ≤ α < 1 we decompose f ∈ G (α ) as follows:

f = a1 D f + a2 B f + g (4.3.72)

with g orthogonal to both D f and B f . Taking the inner product of the sum in the
RHS of (4.3.72), subsequently, with f , D f , B f and g we obtain four equations:

0
1 = a1 α 2 + a2 β 2 + g(t) f (t)dt,
0
α 2 = a1 α 2 + a2 B f (t)Dg(t)dt,
0
β 2 = a1 D f (t)B f (t)dt + a2 β 2 ,
0
f (t)g(t)dt = g2 .

These equations imply

0 0
α 2 + β 2 − 1 + ||g||2 = a1 D f (t)B f (t)dt + a2 B f (t)D f (t)dt.
4.3 The Nyquist–Shannon formula 435
1
By eliminating g(t) f (t)dt, a1 and a2 we find, for αβ = 0,
1 − α 2 − ||g||2
β2 = 0 β2
(β 2 − B f (t)D f (t)dt)
⎡ ⎤
0
⎢ 1 − α 2 − ||g||2 ⎥
+ ⎣1 − 0 B f (t)D f (t)dt ⎦
α 2 (β 2 − B f (t)D f (t)dt)
0
× D f (t)B f (t)dt

which is equivalent to
0
β − 2Re
2
D f (t)B f (t)dt
 0 2 
1  
≤ −α + (1 − 2 2  D f (t)B f (t)dt 
2
(4.3.73)
α β
 0 2 
1  
− ||g|| 1 − 2 2  D f (t)B f (t)dt  .
2
(4.3.74)
α β
In terms of the angle θ , we can write
0 0 
 
αβ cos θ = Re D f (t)B f (t)dt ≤  D f (t)B f (t)dt  ≤ αβ .


Substituting into (4.3.74) and completing the square we obtain


(β − α cos θ )2 ≤ (1 − α 2 ) sin2 θ (4.3.75)
0
with equality if and only if g = 0 and the integral D f (t)B f (t)dt is real. Since

θ ≥ cos−1 λ0 , (4.3.75) implies that
$
cos−1 α + cos−1 β ≥ cos−1 λ0 . (4.3.76)
The locus of points (α , β ) satisfying (4.3.76) is up and to the right of the curve
where
$
cos−1 α + cos−1 β = cos−1 λ0 . (4.3.77)
See Figure 4.6.
Equation (4.3.77) holds for the function f = b1 ψ0 + b2 Dψ0 with
R R
1 − α2 α 1 − α2
b1 = and b2 = √ − .
1 − λ0 λ0 1 − λ0

All intermediate values of β are again attained by employing eipt f .


436 Further Topics from Information Theory

1.0
0.8
0.6
beta^2

0.4
0.2

W=0.5
W=1
0.0

W=2

0.0 0.2 0.4 0.6 0.8 1.0

alpha^2

Figure 4.6

4.4 Spatial point processes and network information theory


For a discussion of capacity of distributed systems and construction of random
codebooks based on point processes we need some background. Here we study the
spatial Poisson process in Rd , and some more advanced models of point process
are introduced with a good code distance. This section could be read independently
of PSE II, although some knowledge of its material may be very useful.

Definition 4.4.1 (cf. PSE II, p. 211) Let μ be a measure on R with values μ (A)
for measurable subsets A ⊆ R. Assume that μ is (i) non-atomic and (ii) σ -finite,
i.e. (i) μ (A) = 0 for all countable sets A ⊂ R and (ii) there exists a partition R =
∪ j J j of R into pairwise disjoint intervals J1 , J2 , . . . such that μ (J j ) < ∞. We say
that a random counting measure M defines a Poisson random measure (PRM, for
short) with mean, or intensity, measure μ if for all collection of pairwise disjoint
intervals I1 , . . . , In on R, the values M(Ik ), k = 1, . . . , n, are independent, and each
M(Ik ) ∼ Po(μ (Ik )).
4.4 Spatial point processes and network information theory 437

We will state several facts, without proof, about the existence and properties of
the Poisson random measure introduced in Definition 4.4.1.

Theorem 4.4.2 For any non-atomic and σ -finite measure μ on R+ there exists
a unique PRM satisfying Definition 4.4.1. If measure μ has the form μ (dt) =
λ dt where λ > 0 is a constant (called the intensity of μ), this PRM is a Poisson
process PP(λ ). If the measure μ has the form μ (dt) = λ (t)dt where λ (t) is a given
function, this PRM gives an inhomogeneous Poisson process PP(λ (t)).

Theorem 4.4.3 (The mapping theorem) Let μ be a non-atomic and σ -finite


measure on R such that for all t ≥ 0 and h > 0, the measure μ (t,t + h) of the
interval (t,t + h) is positive and finite (i.e. the value μ (t,t + h) ∈ (0, ∞)), with
lim μ (0, h) = 0 and μ (R+ ) = lim μ (0, u) = +∞. Consider the function
h→0 u→+∞

f : u ∈ R+ → μ (0, u),

and let f −1 be the inverse function of f . (It exists because f (u) = μ (0, u) is strictly
monotone in u.) Let M be the PRM(μ ). Define a random measure f ∗ M by

( f ∗ M)(I) = M(μ ( f −1 I)) = M(μ ( f −1 (a), f −1 (b))), (4.4.1)

for interval I = (a, b) ⊂ R+ , and continue it on R. Then f ∗ M ∼ PP(1), i.e. f ∗ M


yields a Poisson process of the unit rate.

We illustrate the above approach in a couple of examples.

Worked Example 4.4.4 Let the rate function of a Poisson process Π = PP(λ (x))
on the interval S = (−1, 1) be

λ (x) = (1 + x)−2 (1 − x)−3 .

Show that Π has, with probability 1, infinitely many points in S, and that they
can be labelled in ascending order as

· · · X−2 < X−1 < X0 < X1 < X2 < · · ·

with X0 < 0 < X1 .


Show that there is an increasing function f : S → R with f (0) = 0 such that the
points f (X)(X ∈ Π) form a Poisson process of unit rate on R, and use the strong
law of large numbers to show that, with probability 1,
1
lim (2n)1/2 (1 − Xn ) = . (4.4.2)
n→+∞ 2
Find a corresponding result as n → −∞.
438 Further Topics from Information Theory

Solution Since
0 1
λ (x)dx = ∞,
−1

there are with probability 1 infinitely many points of Π in (−1, 1). On the other
hand,
0 1−δ
λ (x)dx < ∞
−1+δ

for every δ > 0, so that Π(−1+ δ , 1− δ ) is finite with probability 1. This is enough
to label uniquely in ascending order the points of Π. Let
0 x
f (x) = λ (y)dy.
0

As f : S → R is increasing, f maps Π into a Poisson process whose mean measure


μ is given by
0 f −1 (b)
μ (a, b) = λ (x)dx = b − a.
f −1 (a)

With this choice of f , the points ( f (Xn )) form a Poisson process of unit rate on R.
The strong law of large numbers shows that, with probability 1, as n → ∞,
n−1 f (Xn ) → 1, and n−1 f (X−n ) → −1.

Now, observe that


1 1
λ (x) ∼ (1 − x)−3 and f (x) ∼ (1 − x)−2 , as x → 1.
4 8
Thus, as n → ∞, with probability 1,
1
n−1 (1 − Xn )−2 → 1,
8
which is equivalent to (4.4.2). Similarly,
1 1
λ (x) ∼ (1 + x)−2 and f (x) ∼ (1 + x)−1 , as x → −1,
8 8
implying that with probability 1, as n → ∞,
1
n−1 (1 + X−n )−1 → 1.
8
Hence, with probability 1
1
lim n(1 + X−n ) = .
n→∞ 8
4.4 Spatial point processes and network information theory 439

Worked Example 4.4.5 Show that, if Y1 < Y2 < Y3 < · · · are points of a Poisson
process on (0, ∞) with constant rate function λ , then
lim Yn /n = λ
n→∞

with probability 1. Let the rate function of a Poisson process Π = PP(λ (x)) on
(0, 1) be
λ (x) = x−2 (1 − x)−1 .
Show that the points of Π can be labelled as
1
· · · < X−2 < X−1 < < X0 < X1 < · · ·
2
and that
lim Xn = 0 , lim Xn = 1 .
n→−∞ n→∞

Prove that
lim nX−n = 1
n→∞

with probability 1. What is the limiting behaviour of Xn as n → +∞?

Solution The first part again follows from the strong law of large numbers. For the
second part we set
0 x
f (x) = λ (ξ )dξ ,
1/2

and use the fact that f maps Π into a PP of constant rate on ( f (0), f (1)): f (Π) =
PP(1). In our case, f (0) = −∞ and f (1) = ∞, and so f (Π) is a PP on R. Its points
may be labelled
· · · < Y−2 < Y−1 < 0 < Y0 < Y1 < · · ·
with
lim Yn = −∞, lim Yn = +∞.
n→−∞ n→+∞

Then Xn = f −1 (Yn ) has the required properties.


The strong law of large numbers applied to Y−n gives
f (Xn ) Yn
lim = lim = 1, a.s.
n→−∞ n n→−∞ n

Now, as x → 0,
0 1/2 0 1/2
−2 −1
f (x) = − ξ (1 − ξ ) dξ ∼ − ξ −2 dξ ∼ −x−1 ,
x x
440 Further Topics from Information Theory

implying that
−1
X−n
lim = 1, i.e. lim nX−n = 1, a.s.
n→∞ n n→∞

Similarly,
f (Xn )
lim = 1, a.s.,
n→+∞ n
and as x → 1,
0 x
f (x) ∼ (1 − ξ )−1 dξ ∼ − ln(1 − x).
1/2

This implies that


ln(1 − Xn )
lim − = 1, a.s.
n→∞ n

Next, we discuss the concept of a Poisson random measure (PRM) on a general


set E. Formally, we assume that E had been endowed with a σ -algebra E of subsets,
and a measure μ assigning to every A ∈ E a value μ (A), so that if A1 , A2 , . . . are
pairwise disjoint sets from E then

μ (∪n An ) = ∑ μ (An ).
n

The value μ (E) can be finite or infinite. Our aim is to define a random counting
measure M = (M(A), A ∈ E ), with the following properties:
(a) The random variable M(A) takes non-negative integer values (including, possi-
bly, +∞). Furthermore,
'
∼ Po(λ μ (A)), if μ (A) < ∞,
M(A) (4.4.3)
= +∞ with probability 1, if μ (A) = ∞.

(b) If A1 , A2 , . . . ∈ E are disjoint sets then

M (∪i Ai ) = ∑ M(Ai ). (4.4.4)


i

(c) The random variables M(A1 ), M(A2 ), . . . are independent if sets A1 , A2 , . . . ∈ E


are disjoint. That is, for all finite collections of disjoint sets A1 , . . . , An ∈ E and
non-negative integers k1 , . . . , kn

P (M(Ai ) = ki , 1 ≤ i ≤ n) = ∏ P (M(Ai ) = ki ) . (4.4.5)


1≤i≤n
4.4 Spatial point processes and network information theory 441

First assume that μ (E) < ∞ (if not, split E into subsets of finite measure). Fix a
random variable M(E) ∼ Po( λ μ (E)). Consider a sequence X1 , X2 , . . . of IID ran-
dom points in E, with Xi ∼ μ μ (E), independently of M(E). It means that for all
n ≥ 1 and sets A1 , . . . , An ∈ E (not necessarily disjoint)
 n n
  −λ μ (E) λ μ (E) μ (Ai )
P M(E) = n, X1 ∈ A1 , . . . , Xn ∈ An = e ∏ , (4.4.6)
n! i=1 μ (E)

and conditionally,

  n
μ (Ai )
P X1 ∈ A1 , . . . , Xn ∈ An |M(E) = n = ∏ . (4.4.7)
i=1 μ (E)

Then set
M(E)
M(A) = ∑ 1(Xi ∈ A), A ∈ E . (4.4.8)
i=1

Theorem 4.4.6 If μ (E) < ∞, equation (4.4.8) defines a random measure M on E


satisfying properties (a)–(c) above.

Worked Example 4.4.7 Let M be a Poisson random measure of intensity λ on


the plane R2 . Denote by C(r) the circle {x ∈ R2 : |x| < r} of radius r in R2 centred
at the origin and let Rk be the largest radius such that C(Rk ) contains precisely k
points of M . [Thus C(R0 ) is the largest circle about the origin containing no points
of M , C(R1 ) is the largest circle about the origin containing a single point of M ,
and so on.] Calculate ER0 , ER1 and ER2 .

Solution Clearly,

P(R0 > r) = P(C(r) contains no point of M) = e−λ π r , r > 0,


2

and
P(R1 > r) = P(C(r) contains at most one point of M)
= (1 + λ π r2 )e−λ π r , r > 0.
2

Similarly,
 
1
P(R2 > r) = 1 + λ π r2 + (λ π r2 )2 e−λ π r , r > 0.
2

2
442 Further Topics from Information Theory

Then
0 ∞ 0 ∞ √ 
1 1
e−πλ r d
2
ER0 = P(R0 > r)dr = √ 2πλ r = √ ,
0
0 ∞
2πλ 0 2 λ
ER1 = P(R1 > r)dr
0
0 ∞
1 2 
= √ + e−πλ r λ π r2 dr
2 λ 0
0 ∞
1 1  2 √ 
= √ + √ 2πλ r2 e−πλ r d 2πλ r
2 λ 2 2πλ 0
3
= √ ,
4 λ
0 ∞ 2
3 λ π r2 −πλ r2
ER2 = √ + e dr
4 λ 0 2
0 ∞
3 1 2 2 √ 
= √ + √ 2λ π r2 e−πλ r d 2λ π r
4 λ 8 2πλ 0
3 3
= √ + √
4 λ 16 λ
15
= √ .
16 λ

We shall use for the PRM M on the phase space E with intensity measure μ con-
structed in Theorem 4.4.6 the notation PRM(E, μ ). Next, we extend the definition
of the PRM to integral sums: for all functions g : E → R+ define
M(E) 0
M(g) = ∑ g(Xi ) := g(y)dM(y); (4.4.9)
i=1

summation is taken over all points Xi ∈ E, and M(E) is the total number of such
points. Next, for a general g : E → R we set
M(g) = M(g+ ) − M(−g− ),
with the standard agreement that +∞ − a = +∞ and a − ∞ = −∞ for all a ∈ (0, ∞).
[When both M(g+ ) and M(−g− ) equal ∞, the value M(g) is declared not defined.]
Then
Theorem 4.4.8 (Campbell theorem) For all θ ∈ R and for all functions g : E → R
such that eθ g(y) − 1 is μ-integrable
⎡ ⎤
0
Eeθ M(g) = exp ⎣λ eθ g(y) − 1 dμ (y)⎦ . (4.4.10)
E
4.4 Spatial point processes and network information theory 443

Proof Write
 

Eeθ M(g) = E E eθ M(g) |M(E)


   
k
= ∑ P(M(E) = k)E exp θ ∑ g(Xi ) |M(E) = k .
k i=1

Owing to conditional independence (4.4.7),


  k
k k
E exp θ ∑ g(Xi ) |M(E) = k = ∏ Eeθ g(Xi ) = Eeθ g(X1 )
i=1

i=1
k
1 1 θ g(x)
= e dμ (x) ,
μ (E) E
and
⎛ ⎞k
0
(λ μ (E))k 1
Eeθ M(g) = ∑ e−λ μ (E)  k ⎝ eθ g(x) dμ (x)⎠
k k! μ (E) E
⎡ ⎤
0
= e−λ μ (E) exp ⎣λ eθ g(x) dμ (x)⎦
E
⎡ ⎤
0
= exp ⎣λ eθ g(x) − 1 dμ (x)⎦ .
E

Corollary 4.4.9 The expected value of M(g) is given by


0
EM(g) = λ g(y)dμ (y);
E

it exists if and only if the integral on the RHS is well defined.


Proof The proof follows by differentiation of the MGF at θ = 0.

Example 4.4.10 Suppose that the wireless transmitters are located at the points
of Poisson process Π on R2 of rate λ . Let ri be the distance from transmitter i to
the central receiver at 0, and the minimal distance to a transmitter is r0 . Suppose
that the power of the received signal is Y = ∑Xi ∈Π rPα for some α > 2. Then
i
⎡ ⎤
0∞
Eeθ Y = exp ⎣2λ π eθ g(r) − 1 rdr⎦ , (4.4.11)
r0

P
where g(r) = rα where P is the transmitter power.
444 Further Topics from Information Theory

A popular model in application is the so-called marked point process with the
space of marks D. This is simply a random measure on Rd × D or on its subset. We
will need the following product property proved below in the simplest set-up.

Theorem 4.4.11 (The product theorem) Suppose that a Poisson process with the
constant rate λ is given on R, and marks Yi are IID with distribution ν. Define a
random measure M on R+ × D by
∞  
M(A) = ∑I (Tn ,Yn ) ∈ A , A ⊆ R+ × D. (4.4.12)
n=1

This measure is a PRM on R+ × D, with the intensity measure λ m × ν where m is


a Lebesgue measure.

Proof First, consider a set A ⊆ [0,t) × D where t > 0. Then


Nt
M(A) = ∑ 1((Tn ,Yn ) ∈ A).
n=1

Consider the MGF Eeθ M(A) and use standard conditioning


 
Eeθ M(A) = E E eθ M(A) |Nt

= ∑ P(Nt = k)E eθ M(A) |Nt = k .
k=0

We know that Nt ∼ Po(λ t). Further, given that Nt = k, the jump points T1 , . . . , Tk
have the conditional joint PDF fT1 ,...,Tk ( · |Nt = k) given by (4.4.7). Then, by using
further conditioning, by T1 , . . . , Tk , in view of the independence of the Yn , we have

E eθ M(A) |Nt = k
 
= E E eθ M(A) |Nt = k; T1 , . . . , Tk
0t 0t
= ... dxk . . . dx1 fT1 ,...,Tk (x1 , . . . , xk |N = k)
0 0
   
k  
× E exp θ ∑ I (xi ,Yi ) ∈ A |Nt = k; T1 = x1 , . . . , Tk = xk
i=1
⎛ ⎞k
0t 0
1⎝
= eθ IA (x,y) dν (y)dx⎠ .
tk
0 D
4.4 Spatial point processes and network information theory 445

Then
⎛ ⎞k
∞ 0t 0
(λ t)k 1⎝
Eeθ M(A) = e−λ t ∑ eθ IA (x,y) dν (y)dx⎠
k=0 k! t k
0 D
⎡ ⎤
0t 0
= exp ⎣λ eθ IA (x,y) − 1 dν (y)dx⎦ .
0 D

The expression eθ IA (x,y) − 1 takes value eθ − 1 for (x, y) ∈ A and 0 for (x, y) ∈ A.
Hence,
⎡ ⎤
  0
Eeθ M(A) = exp ⎣ eθ − 1 λ dν (y)dx⎦ , θ ∈ R. (4.4.13)
A

Therefore, M(A) ∼ Po(λ m × ν (A)).


Moreover, if A1 , . . . , An are disjoint subsets of [0,t)×D then the random variables
M(A1 ), . . . , M(An ) are independent. To see this, note first that, by definition, M is
additive: M(A) = M(A1 ) + · · · + M(An ) where A = A1 ∪ · · · ∪ An . From (4.4.13)
⎡ ⎤
  n 0 n
Eeθ M(A) = exp ⎣ eθ − 1 λ ∑ dν (y)dx⎦ = ∏ Eeθ M(Ai ) , θ ∈ R,
i=1 i=1
Ai

which implies independence.


So, the restriction of M to E n = [0, n) × D is an (E n , λ dmn × ν ) PRM, where
mn = m|[0,n) . Then, by the extension property, M is an (R+ × D, λ m × ν ) PRM.

Worked Example 4.4.12 Use the product and Campbell’s theorems to solve
the following problem. Stars are scattered over three-dimensional space R3 in a
Poisson process Π with density ν (X) (X ∈ R3 ). Masses of the stars are IID random
variables; the mass mX of a star at X has PDF ρ (X, dm). The gravitational potential
at the origin is given by
GmX
F= ∑ ,
X∈Π |X|

where G is a constant. Find the MGF Eeθ F .


A galaxy occupies a sphere of radius R centred at the origin. The density of
stars is ν (x) = 1/|x| for points x inside the sphere; the mass of each star has
the exponential distribution with mean M . Calculate the expected potential due to
the galaxy at the origin. Let C be a positive constant. Find the distribution of the
distance from the origin to the nearest star whose contribution to the potential F is
at least C.
446 Further Topics from Information Theory

Solution Campbell’s theorem says that if M is a Poisson random measure on the


space E with intensity measure ν and a : E → R is a bounded measurable function
then
⎛ ⎞
0
Eeθ Σ = exp ⎝ eθ a(y) − 1 ν (dy)⎠ ,
E

where
0
Σ= a(y)M(dy) = ∑ a(X).
X∈Π
E

By the product theorem, pairs (X, mX ) (position, mass) form a PRM on R3 ×


R+ , with intensity measure μ (dx × dm) = ν (x)dxρ (x, dm). Then by Campbell’s
theorem:
⎛ ⎞
0 0∞
Eeθ F = exp ⎝ μ (dx × dm) eθ Gm/|x| − 1 ⎠ .
R3 0

dEeθ F
The expected potential at the origin is EF = | and equals
dθ θ =0
0 0∞ 0
Gm 1
ν (x)dx ρ (x, dm) = GM dx 1(|x| ≤ R).
|x| |x|2
R3 0 R3

In the spherical coordinates,

0 0R 0 0
1 1 2
dx 2 1(|x| ≤ R) = dr r dϑ cos ϑ dφ = 4π R
|x| r2
R3 0

which yields
EF = 4π GMR.

Finally, let D be the distance to the nearest star contributing to F at least C. Then,
by the product theorem,
 
P(D ≥ d) = P(no points in A) = exp − μ (A) .

Here
% K
Gm
A = (x, m) ∈ R3 × R+ : |x| ≤ d, ≥C ,
|x|
4.4 Spatial point processes and network information theory 447
0
and μ (A) = μ (dx × dm) is represented as
A

0d 0 0 0∞
1 −1
dr r2 dϑ cos ϑ dφ M dme−m/M
r
0 Cr/G
0d
= 4π drre−Cr/(GM)
0
 2  
GM −Cd/(GM) Cd −Cd/(GM)
= 4π 1−e − e .
C GM
This determines the distribution of D on [0, R].

In distributed systems of transmitters and receivers like wireless networks of


mobile phones the admissible communication rate between pairs of nodes in the
wireless network depends on their random positions and their transmission strate-
gies. Usually, the transmission is performed along the chain of transmitters from
the source to destination. So, the new interesting direction in information theory
has emerged; some experts even coined the term ‘network information theory’.
This field of research has many connections with probability theory, in particu-
lar, percolation and spatial point processes. We do not attempt here to give even
a glimpse of this rapidly developing field, but no presentation of information the-
ory nowadays can completely avoid network aspects. Here we touch slightly a few
topics and refer the interested reader to [48] and the literature cited therein.

Example 4.4.13 Suppose that the receiver is located at point y and the transmit-
ters are scattered on the plane R2 at the points of xi ∈ Π of Poisson process of rate
λ . Then the simplest model for the power of the received signal is

Y= ∑ P(|xi − y|) (4.4.14)


xi ∈Π

where P is the emitted signal power and the function  describes the fading of the
signal. In the case of so-called Rayleigh fading (|x|) = e−β |x| , and in the case of
the power fading (|x|) = |x|−α , α > 2. By the Campbell theorem

0 ∞  
φ (θ ) = E eθ Y = exp 2λ π r eθ P(r) − 1 dr . (4.4.15)
0

A more realistic model of the wireless network may be described as follows.


Suppose that receivers are located at points y j , j = 1, . . . , J, and transmitters are
scattered on the plane R2 at the points of xi ∈ Π of Poisson process of rate λ .
448 Further Topics from Information Theory

Assuming that the signal Sk from the point xk is amplified by the coefficient P
we write the signal
Yj = ∑ h jk Sk + Z j , j = 1, . . . , J. (4.4.16)
xk ∈Π

Here the simplest model of the transmission function is


√ e2π ir jk /ν
h jk = P α /2 , (4.4.17)
r jk
where ν is the transmission wavelength and r jk = |y j − xk |. The noise random vari-
ables Z j are assumed to be IID N(0, σ02 ). A similar formula could be written for
Rayleigh fading. We know that in the case of J = 1 and a single transmitter K = 1,
by the Nyquist–Shannon theorem of Section 4.3, the capacity of the continuous
time, additive white Gaussian noise channel Y (t) = X(t)(x, y) + Z(t) with attenu-
1 τ /2
ation factor (x, y), subject to the power constraint −τ /2 X 2 (t)dt < Pτ , bandwidth
W , and noise power spectral density σ02 , is
P2 (x, y)
C = W log 1 + . (4.4.18)
2W σ02
Next, consider the case of finite numbers K of transmitters and J of receivers
K
y j (t) = ∑ (xi , y j )xi (t) + z j (t), j = 1, . . . , J, (4.4.19)
i=1

with a power constraint Pk for transmitter k = 1, . . . , K. Using Worked Example


4.3.5 for the capacity of parallel channels it can be proved (cf. [48]) that the capac-
ity of the channel is
K  
Pk s2k
C = ∑ W log 1 + (4.4.20)
k=1 2W σ02

where sk is the kth largest singular value of the matrix L = (|y j − xk |) . Next,
we assume that the bandwidth W = 1. It is also interesting to describe the capacity
region of a distributed system with K transmitters and J receivers under constant,
on average, power of transmission K −1 ∑k Pk ≤ P. Again, the interested reader is
referred to [48] where the following capacity domain for allowable rates Rk j is
established:
K J K  
Pk s2k
∑ ∑ Rk j ≤ Pk ≥0,max ∑ log 1 + 2σ 2 .
∑k Pk ≤KP k=1
(4.4.21)
k=1 j=1 0

Theorem 4.4.14 Consider an arbitrary configuration S of 2n nodes placed inside



the box Bn of area n (i.e. size n); partition them into two sets S1 and S2 , so
4.4 Spatial point processes and network information theory 449
n n
that S1 ∩ S2 = 0/ , S1 ∪ S2 = S, S1 = S2 = n. The sum Cn = ∑ ∑ Rk j of reliable
k=1 j=1
transmission rates in model (4.4.19) from the transmitters xk ∈ S1 to the receivers
yi ∈ S2 is bounded from above:
n n n Pk s2k
Cn = ∑ ∑ Rk j ≤ max ∑
Pk ≥0,∑ Pk ≤nP k=1
log 1 +
2σ02
,
k=1 j=1

where sk is the kth largest singular value of the matrix L = ((xk , y j )), σ02 is the
noise power spectral density, and the bandwidth W = 1.

This result allows us to find the asymptotic of capacity as n → ∞. In the most


2
interesting case of Rayleigh fading R(n) = Cn /n ∼ O( (log
√ n) ); in the case of power
n
1/α (log n)2
α > 2 fading, R(n) ∼ O( n √
n
): see again [48].

Next we discuss the interference limited networks. Let Π be a Poisson process of


rate λ on the plane R2 . Let the function  : R2 × R2 → R+ describing an attenuation
factor of signal emitted from x at point y be symmetric: (x, y) = (y, x), x, y ∈ R2 .
The most popular examples are (x, y) = Pe−β |x−y| and (x, y) = |x−y|P
α , α > 2. A

general theory is developed under the following assumptions:


1
(i) (x, y) = (|x − y|), r∞0 r(r)dr < ∞ for some r0 > 0.
(ii) l(0) > kσ02 /P, (x) ≤ 1 for all x > 0 where k > 0 is an admissible level of
interference.
(iii)  is continuous and strictly decreasing where it is non-zero.
For each pair of points xi , x j ∈ Π define the signal/noise ratio

P2 (xi , x j )
SNR(xi → x j ) = (4.4.22)
σ02 + γ ∑k =i, j P2 (xk , x j )

where P, σ02 , k > 0 and 0 ≤ γ < 1k . We say that a transmitter located at xi can send a
message to receiver located at x j if SNR(xi → x j ) ≥ k. For any k > 0 and 0 < κ < 1,
let An (k, κ ) be an event that there exists a set Sn of at least κ n points of Π such that
for any two points s, d ∈ Sn , SNR(s, d) > k. It can be proved (see [48]) that for all
κ ∈ (0, 1) there exists k = k(κ ) such that
 
lim P An (k(κ ), κ ) = 1. (4.4.23)
n→∞

Then we say that the network is supercritical at interference level k(κ ); it means
that the number of other points the given transmitter (say, located at the origin 0)
could communicate to, by using re-transmission at intermediate points, is infinite
with a positive probability.
450 Further Topics from Information Theory

First, we note that any given transmitter may be directly connected to at most
1 + (γ k)−1 receivers. Indeed, suppose that nx nodes are connected to the node x.
Denote by x1 the node connected to x and such that
(|x1 − x|) ≤ (|xi − x|), i = 2, . . . , nx . (4.4.24)
Since x1 is connected to x we have
P(|x1 − x|)
∞ ≥k
σ02 + γ ∑ P(|xi − x|)
i=2

which implies
P(|x1 − x|) ≥ kσ02 + kγ ∑ P(|xi − x|)
i≥2

≥ kσ0 + kγ (nx − 1)P(|x1 − x|) + kγ


2
∑ P(|xi − x|)
i≥nx +1
≥ kγ (nx − 1)P(|x1 − x|). (4.4.25)
We conclude from (4.4.25) that nx ≤ 1 + (kγ )−1 . However, the network percolates
for some values of parameters in view of (4.4.23). This means that with positive
probability a given transmitter may be connected to an infinite number of oth-
ers with re-transmissions. In particular, the model percolates for γ = 0, above the
critical rate for percolation of Poisson flow λcr . It may be demonstrated that for
λ > λcr the critical value of γ ∗ (λ ) first increases with λ but then starts to decay
because the interference becomes too strong. The proof of the following result may
be found in [48].
Theorem 4.4.15 Let λcr be the critical node density for γ = 0. For any node
density λ > λcr , there exists γ ∗ (λ ) > 0 such that for γ ≤ γ ∗ (λ ), the interference
model percolates. For λ → ∞ we have that
γ ∗ (λ ) = O(λ −1 ). (4.4.26)
Another interesting connection with the theory of spatial point processes in RN
is in using realisations of point processes for producing random codebooks. An al-
ternative (and rather efficient) way to generate a random coding attaining the value
C(α ) in (4.1.17) is as follows. Take a Poisson process Π(N) in RN , of rate λN = eNRN
where RN → R as N → ∞. Here R < 12 log 2π 1eσ 2 where σ02 be the variance of additive
0
Gaussian noise in a channel. Enlist in the codebook XM,N√the random points X(i)
from process ξ (N) lying inside the Euclidean ball B(N) ( N α ) and surviving the
following ‘purge’. Fix r > 0 (the minimal distance of the random code) and for any
point X j of a Poisson process Π(N) generate an IID random variable T j ∼ U([0, 1])
(a random mark). Next, for every point X j of the original Poisson process examine
4.4 Spatial point processes and network information theory 451

the ball B(N) (X j , r) of radius r centred at X j . The point X j will survive only if its
mark T j is strictly smaller than the marks of all other points from Π(N) lying in
B(N) (X j , r). The resulting point process ξ (N) is known as the Matérn process; it is
an example of a more general construction discussed in the recent paper [1].
The main parameter of a random codebook with codewords x(N) of length N is
the induced distribution of the distance between codewords. In the case of code-
books generated by stationary point processes it is convenient to introduce a func-
tion K(t) such that λ 2 K(t) gives the expected number of ordered pairs of distinct
points in a unit volume less than distance t apart. In other words, λ K(t) is the ex-
pected number of further points within t of an arbitrary point of a process. Say,
for Poisson process on R2 of rate λ , K(t) = π t 2 . In random codebooks we are in-
terested in models where K(t) is much smaller for small and moderate t. Hence,
random codewords appear on a small distance from one another much more rarely
than in a Poisson process. It is convenient to introduce the so-called product density
λ 2 dK(t)
ρ (t) = , (4.4.27)
c(t) dt
where c(t) depends on the state space of the point process. Say, c(t) = 2π t on R1 ,
c(t) = 2π t 2 on R2 , c(t) = 2π sint on the unit sphere, etc.
Some convenient models of this type have been introduced by B. Matérn. Here
we discuss two rather intuitive models of point processes on RN . The first is ob-
tained by sampling a Poisson process of rate λ and deleting any point which is
within 2R of any other whether or not this point has already been deleted. The rate
of this process for N = 2 is
λM,1 = λ e−4πλ R .
2
(4.4.28)

The product density k(t) = 0 for t < 2R, and

ρ (t) = λ 2 e−2U(t) ,t > 2R,


where
U(t) = meas[B((0, 0), 2R) ∪ B((t, 0), 2R)]. (4.4.29)

Here B((0, 0), 2R) is the ball with centre (0, 0) of radius 2R, and B((t, 0), 2R) is
the ball with centre (t, 0) of radius 2R. For varying λ this model has the maximum
rate of (4π eR2 )−1 and
√ so cannot model densely packed codes. This is 10% of the
theoretical bound ( 12R2 )−1 which is attained by the triangular lattice packing,
cf. [1].
The second Matérn model is an example of the so-called marked point process.
The points of a Poisson process of rate λ are independently marked by IID random
variables with distribution U([0, 1]). A point is deleted if there is another point of
452 Further Topics from Information Theory

the process within distance 2R which has a bigger mark whether or not this point
has already been deleted. The rate of this process for N = 2 is

λM,2 = (1 − e−λ c )/c, c = U(0) = 4π R2 . (4.4.30)

The product density ρ (t) = 0 for t < 2R, and


 2   
2U(t) 1 − e−4π R λ − 2c 1 − e−λ U(t)
ρ (t) = ,t > 2R. (4.4.31)
cU(t)(U(t) − c)

An equivalent definition is as follows. Given two points X and Y of the primary


Poisson process on the distance t = |X − Y | define the probability k(t) = ρ (t)/λ 2
that both of them are retained in the secondary process. Then k(t) = 0 for t < 2R,
and

2U(t)(1 − e−4π R λ ) − 8π R2 (1 − e−λ U(t))


2

k(t) = , t > 2R.


4λ 2 π R2U(t)(U(t) − 4π R2 )

Example 4.4.16 (Outage probability in a wireless network) Suppose a receiver is


located at the origin and transmitters are distributed according to the Matérn hard-
core process with the inner radius r0 . We suppose that no transmitters are closer to
each other than r0 and the coverage distance is a. The sum of received powers at
the central receiver from the signals from all wireless network is written as

P
Xr0 = ∑ rα (4.4.32)
Jr0 ,a i

where Jr0 ,a denotes the set of interfering transmitters such that r0 ≤ ri < a. Let λP
be the rate of Poisson process producing a Matérn process after thinning. The rate
of thinned process is
 
1 − exp − λP π r02
λ= .
π r02

Using the Campbell theorem we compute the MGF of Xr0 :



φ (θ ) = E eθ Xr0
 0 1

0 a 2r θ g(r)

= exp λP π (a 2
− r02 ) q(t)dt e dr − 1 . (4.4.33)
0 r0 (a2 − r02 )
4.5 Selected examples and problems from cryptography 453
P  
Here g(r) = α
and q(t) = exp − λP π r02t is the retaining probability of a point
r 0
1 λ
of mark t. Since q(t)dt = , we obtain
0 λP
 0 a 
2r θ g(r)
φ (θ ) = exp λ π (a − r0 )
2 2
e dr − 1 . (4.4.34)
r0 (a2 − r0 )
2

Now we can compute all absolute moments of the interfering signal:


0 a
2λ π Pk Pk
μk = λ π 2r(g(r))k dr = − . (4.4.35)
r0 kα − 2 r0kα −2 akα −2

Engineers say that outage happens at the central receiver, i.e. the interference pre-
vents one from reading a signal obtained from a sender at distance rs , if
P/rsα
≤ k.
σ02 + ∑Jr0 ,a P/riα

Here, σ02 is the noise power, rs is the distance to sender and k is the minimal SIR
(signal/noise ratio) required for successful reception. Different approximations of
outage probability based on the moments computed in (4.4.35) are developed. Typ-
ically, the distribution of Xr0 is close to log-normal; see, e.g., [113].

4.5 Selected examples and problems from cryptography


Cryptography, commonly defined as ‘the practice and study of hiding information’,
became a part of many courses and classes on coding; in our exposition we mainly
follow the traditions of the Cambridge course Coding and Cryptography. We keep
the theoretical explanations to the bare minimum and refer the reader to specialised
books for details. Cryptography has a long and at times fascinating history where
mathematics is interleaved with other sciences and even non-sciences. It has in-
spired countless fiction and half-fiction books, films, and broadcast programmes;
its popularity does not seem to be waning.
A popular method of producing encrypted digit sequences is through the so-
called feedback shift registers. We will restrict ourselves to the binary case, work-
ing with string spaces Hn,2 = {0, 1}n = F×n2 .

Definition 4.5.1 A (general) binary feedback shift register of length d is a map


{0, 1}d → {0, 1}d of the form
 
(x0 , . . . , xd−1 ) → x1 , . . . , xd−1 , f (x0 , . . . , xd−1 )
454 Further Topics from Information Theory

for some function f : {0, 1}d → {0, 1} (a feedback function). The initial string
(x0 , . . . , xd−1 ) is called an initial fill; it produces an output stream (xn )n≥0 satisfying
the recurrence equation

xn+d = f (xn , . . . , xn+d−1 ), for all n ≥ 0. (4.5.1)

A feedback shift register is said to be linear (an LFSR, for short) if function f is
linear and c0 = 1:
d−1
f (x0 , . . . , xd−1 ) = ∑ ci xi , where ci = 0, 1, c0 = 1; (4.5.2)
i=0

in this case the recurrence equation is linear:


d−1
xn+d = ∑ ci xn+i for all n ≥ 0. (4.5.3)
i=0

It is convenient to write (4.5.3) in the matrix form

xn+d n+d−1
n+1 = Vxn (4.5.4)

where
⎛ ⎞ ⎛ ⎞
0 1 0 ... 0 0 xn
⎜0 0 1 ... 0 0 ⎟ ⎜ xn+1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ .. ⎟
V = ⎜ ... ..
.
..
.
..
.
..
.
..
. ⎟ , xn+d−1
= ⎜ . ⎟. (4.5.5)
⎜ ⎟ n
⎜ ⎟
⎝0 0 0 ... 0 1 ⎠ ⎝xn+d−2 ⎠
c0 c1 c2 . . . cd−2 cd−1 xn+d−1
By the expansion of the determinant along the first column one can see that det V =
1 mod 2: the cofactor for the (n, 1) entry c0 is the matrix Id−1 . Hence,

det V = c0 det Id−1 = c0 = 1, and the matrix V is invertible. (4.5.6)

A useful concept is the auxiliary, or feedback, polynomial of an LFSR from


(4.5.3):
C(X) = c0 + c1 X + · · · + cd−1 X d−1 + X d . (4.5.7)

Observe that general feedback shift registers, after an initial run, become peri-
odic:

Theorem 4.5.2 The output stream (xn ) of a general feedback shift register of
length d has the property that there exists integer r, 0 ≤ r < 2d , and integer D,
1 ≤ D < 2d − r, such that xk+D = xk for all k ≥ r.
4.5 Selected examples and problems from cryptography 455

Proof A segment xM . . . xM+d−1 determines uniquely the rest of the output stream
in (4.5.1), i.e. (xn , n ≥ M + d − 1). We see that if such a segment is reproduced in
the stream, it will be repeated. There are 2d different possibilities for a string of d
subsequent digits. Hence, by the pigeonhole principle, there exists 0 ≤ r < R < 2d
such that the two segments of length d of the output stream, from positions r and
R onwards, will be the same: xr+ j = xR+ j , 0 ≤ r < R < d. Then, as was noted,
xr+ j = xR+ j for all j ≥ 0, and the assertion holds true with D = R − r.

In the linear case (LFSR), we can repeat the above argument, with the zero string
discarded. This allows us to reduce 2d to 2d − 1. However, an LFSR is periodic in
a ‘proper sense’:

Theorem 4.5.3 An LFSR (xn ) is periodic, i.e. there exists D ≤ 2d − 1 such that
xn+D = xn for all n. The smallest D with this property is called the period of the
LFSR.

Proof Indeed, the column vectors xn+d−1 n , n ≥ 0, are related by the equation
xn+1 = Vxn = Vn+1 x0 , n ≥ 0, where matrix V was defined in (4.5.5). We noted that
det V = c0 = 0 and hence V is invertible. As was said before, we may discard the
zero initial fill. For each vector xn ∈ {0, 1}d there are only 2d − 1 non-zero possibil-
ities. Therefore, as in the proof of Theorem 4.5.2, among the initial 2d − 1 vectors
xn , 0 ≤ n ≤ 2d − 2, either there will be repeats, or there will be a zero vector. The
second possibility can be again discarded, as it leads to the zero initial fill. Thus,
suppose that the first repeat was for j and D + j: x j = x j+D , i.e. V j+D x0 = V j x0 .
If j = 0, we multiply by V−1 and arrive at an earlier repeat. So: j = 0, D ≤ 2d − 1
and VD x0 = x0 . Then, obviously, xn+D = Vn+D x0 = Vn x0 = xn .

Worked Example 4.5.4 Give an example of a general feedback register with


output k j , and initial fill (k0 , k1 , . . . , kN ), such that

(kn , kn+1 , . . . , kn+N ) = (k0 , k1 , . . . , kN ) for all n ≥ 1.

Solution Take f : {0, 1}2 → {0, 1}2 with f (x1 , x2 ) = x2 1. The initial fill 00 yields
00111111111 . . .. Here, kn+1 = 0 = k1 for all n ≥ 1.

Worked Example 4.5.5 Let matrix V be defined by (4.5.5), for the linear re-
cursion (4.5.3). Define and compute the characteristic and minimal polynomials
for V.
456 Further Topics from Information Theory

Solution The characteristic polynomial of matrix V is hV (X) ∈ F2 [X] = X →


det(XI − V):
⎛ ⎞
X 1 0 ... 0 0
⎜0 X 1 ... 0 0 ⎟
⎜ ⎟
⎜ ⎟
hV (X) = det ⎜ ... ..
.
..
.
..
.
..
.
..
. ⎟ (4.5.8)
⎜ ⎟
⎝0 0 0 ... X 1 ⎠
c0 c1 c2 . . . cd−2 (cd−1 + X)

(recall, entries 1 and ci are considered in F2 ). Expanding along the bottom row,
the polynomial hV (t) is written as a linear combination of determinants of size
(d − 1) × (d − 1) (co-factors):
⎛ ⎞ ⎛ ⎞
1 0 ... 0 0 X 0 ... 0 0
⎜X 1 . . . 0 0⎟ ⎜X 1 . . . 0 0⎟
⎜ ⎟ ⎜ ⎟
c0 det ⎜ . . . . . ⎟ + c1 det ⎜ . . . .. .. ⎟
⎝. .
. . . . . .⎠
. . ⎝. .
. . . . . .⎠
0 0 ... X 1 0 0 ... X 1
⎛ ⎞
X 1 ... 0 0
⎜0 X . . . 0 0⎟
⎜ ⎟
+ · · · + cd−2 det ⎜ . .. . . .. .. ⎟
⎝ .. . . . .⎠
0 0 ... 0 1
⎛ ⎞
X 1 ... 0 0
⎜0 X ... 0 0⎟
⎜ ⎟
+(cd−1 + X) det ⎜ . . . .. ⎟
⎝ .. .. . . ... .⎠
0 0 ... 0 X

= c0 + c1 X + · · · + cd−2 X d−2 + (cd−1 + X)X d−1


= ∑ ci X i + X d ,
0≤i≤d−1

which gives the characteristic polynomial C(X) of the recursion.


By the Cayley–Hamilton theorem,

hV (V) = c0 I + c1 V + · · · + cd−1 Vd−1 + Vd = O.

The minimal polynomial, mV (X), of matrix V is the polynomial of minimal


degree such that mV (V) = O. It is a divisor of hV (X), and every root of hV (X) is a
root of mV (X). The difference between mV (X) and hV (X) is in multiplicities: the
multiplicity of a root μ of mV (X) equals the maximal size of the Jordan cell of V
corresponding to μ whereas for hV (X) it is the sum of the sizes of all Jordan cells
in V corresponding to μ .
4.5 Selected examples and problems from cryptography 457

To calculate mV (X), we:


(i) take a basis e1 , . . . , ed (in F×d 2 );
(ii) then for any vector e j we find the minimal number d j such that vectors e j ,
Ve j , . . . , Vd j e j , Vd j +1 e j are linearly dependent;
(iii) identify the corresponding linear combination
( j) ( j) ( j)
a0 e j + a1 Ve j + · · · + ad j Vd j e j + Vd j +1 e j = 0.
(iv) Further, we form the corresponding polynomial


( j) ( j)
mV (X) = ai X i + X d j +1 .
0≤i≤d j

(v) Then,
 
(1) (d)
mV (X) = lcm mV (X), . . . , mV (X) .

In our case, it is convenient to take


⎛ ⎞
0
⎜ .. ⎟ ..
⎜ . ⎟ .
⎜ ⎟
ej = ⎜ ⎟
⎜ 1 ⎟∼ j.
⎜ .. ⎟ ..
⎝ . ⎠ .
0
Then V j e1 = e j , and we obtain that d1 = d, and


(1)
mV (X) = ci X i + X d = hV (X).
0≤i≤d−1

We see that the feedback polynomial C(X) of the recursion coincides with the
characteristic and the minimal polynomial for V. Observe that at X = 0 we obtain
hV (0) = C(0) = c0 = 1 = det V. (4.5.9)

Any polynomial can be identified through its roots; we saw that such a descrip-
tion may be extremely useful. In the case of an LFSR, the following example is
instructive.
Theorem 4.5.6 Consider the binary linear recurrence in (4.5.3) and the corre-
sponding auxiliary polynomial C(X) from (4.5.7).
(a) Suppose K is a field containing F2 such that polynomial C(X) has a root α of
multiplicity m in K. Then, for all k = 0, 1, . . . , m − 1,
xn = A(n, k)α n , n = 0, 1, . . . , (4.5.10)
458 Further Topics from Information Theory

is a solution to (4.5.3) in K, where




⎨1, k = 0,
 
A(n, k) = (4.5.11)

⎩ ∏ (n − l)+ mod 2, k ≥ 1.
0≤l≤k−1

Here, and below, (a)+ stands for max[a, 0]. In other words, sequence x(k) =
(xn ), where xn is given by (4.5.10), is an output of the LFSR with auxiliary
polynomial C(X).
(b) Suppose K is a field containing F2 such that C(X) factorises in K into lin-
ear factors. Let α1 , . . . , αr ∈ K be distinct roots of C(X) of multiplicities
m1 , . . . , mr , with ∑ mi = d . Then the general solution of (4.5.3) in K is
1≤i≤r

xn = ∑ ∑ bi,k A(n, k)αin (4.5.12)


1≤i≤r 0≤k≤mi −1

for some bu,v ∈ K. In other words, sequences x(i,k) = (xn ), where xn = A(n, k)αin
and A(n, k) is given by (4.5.11), span the set of all output streams of the LFSR
with auxiliary polynomial C(X).
Proof (a) If C(X) has a root α ∈ K of multiplicity m then C(X) = (X − α )mC(X)
where C(X) is a polynomial of degree d −m (with coefficients from a field K ⊆ K).
Then, for all k = 0, . . . , m − 1, and for all n ≥ d, the polynomial
dk  
Dk,n (X) := X k k X n−d C(X)
dX
(with coefficients taken mod 2) vanishes at X = α (in field K):
Dk,n (α ) = ∑ ci A(n − d + i, k)α n−d+i + A(n, k)α n .
0≤i≤d−1

This yields
A(n, k)α n = ∑ ci A(n − d + i, k)α n−d+i .
0≤i≤d−1

Thus, stream x(k) = (xn ) with xn as in (4.5.10) solves the recursion xn =


∑ ci xn−d+i in K. The number of such solutions equals m, the multiplicity
0≤i≤d−1
of root α .
(b) First, observe that the set of output streams (xn )n≥0 forms a linear space W over
K (in the set of all sequences with entries from K). The dimension of W equals d,
as every stream is uniquely defined by a seed (initial fill) x0 x1 . . . xd−1 ∈ Kd . On the
 (i,k) 
other hand, d = ∑ mi , the total number of sequences x(i,k) = xn with entries
1≤i≤r
(i,k)
xn = A(n, k)αin , n = 0, 1, . . . .
4.5 Selected examples and problems from cryptography 459

Thus, it suffices to check that the streams x(i,k) , where i = 1, . . . , r,


k = 0, 1 . . . , mi − 1, are linearly independent over K.
To this end, take a linear combination ∑ ∑ bi,k x(i,k) and assume it gives
1≤i≤r 0≤k≤mi −1
0. Let us also agree that sequence x(i,k)
= 0 for k < 0. It is convenient to introduce
a shift operator x = (xn ) → Sx where sequence Sx = (xn ) has entries xn = xn+1 ,
n = 0, 1, . . .. The key observation is as follows. Let I stand for the identity transfor-
mation. Then for all β ∈ K,
(S − β I)x(i,k) = (αi − β )x(i,k) + kαi x(i,k−1) .
In fact, the nth entry of the sequence (S − β I)x(i,k) equals
A(n + 1, k)αin+1 − β A(k, n)αin
= [A(n, k) + kA(n, k − 1)] αin+1 − β A(n, k)αin
= (αi − β )A(n, k)αin + kαi A(n, k − 1)αin ,
in agreement with the above equation for sequences. We have used here the ele-
mentary equation
A(n + 1, k) = A(n, k) + kA(n, k − 1).
Then, iterating, we obtain
(S − β1 I)(S − β2 I)x(i,k) = (αi − β1 )(αi − β2 )x(i,k)
+ kαi (αi − β1 + αi − β2 )x(i,k−1) + k2 αi2 x(i,k−2)
= (S − β2 I)(S − β1 I)x(i,k) ,
and so on (all operations with coefficients are performed in field K). In particular,
with β = αi :
%
(kαi )l x(i,k−l) , 1 ≤ l ≤ k,
(S − αi I) x
l (i,k)
=
0, l > k.

Now consider the product of operators ∏ (S − αi I)mi (S − αr I)mr −1 applied to


1≤i<r
our vanishing linear combination ∑ ∑ bi,k x(i,k) . The only term that sur-
1≤i≤r 0≤k≤mi −1
vives comes from the summand br,mr −1 x(r,mr −1) . This gives
br,mr −1 ∏ (αi − αr )m [(mr − 1)αr ]m −1 x(i,0) = 0.
i r

1≤i<r

Hence, br,mr −1 = 0. Next, we apply ∏ (S − αi I)mi (S − αr I)mr −2 to obtain that


1≤i<r
br,mr −2 = 0. Continuing in a similar fashion, we can guarantee that each coefficient
bi,k = 0.
460 Further Topics from Information Theory

Upon seeing a stream of digits (xn )n≥0 , an observer may wish to determine
whether it was produced by an LFSR. This can be done by using the so-called
Berlekamp–Massey (BM) algorithm, solving a system of linear equations. If a se-
d−1
quence (xn ) comes from an LFSR with feedback polynomial C(X) = ∑ ci X i + X d
i=0
d−1
then the recurrence xn+d = ∑ ci xn+i for n = 0, . . . , d can be written in a vector-
i=0
matrix form Ad cd = 0 where
⎛ ⎞
⎛ ⎞ c0
x0 x1 x2 . . . xd ⎜ ⎟
⎜x1 c1
⎜ x2 x3 . . . xd+1 ⎟





..
Ad = ⎜ . .. .. .. .. ⎟ , cd = ⎜ ⎟.
. (4.5.13)
⎝ .. . . . . ⎠ ⎜ ⎟
⎝cd−1 ⎠
xd xd+1 xd+2 . . . x2d
1

Consequently, the (d + 1) × (d + 1) matrix Ad must have determinant 0, and the


(d + 1)-dimensional vector cd must lie in the null-space ker Ad .
The algorithm begins with an inspection of matrix Ar for a small value of r
(known to be ≤ d):
⎛ ⎞
x0 x1 x2 . . . xr
⎜x1 x2 x3 . . . xr+1 ⎟
⎜ ⎟
Ar = ⎜ . .. .. .. .. ⎟ .
⎝ .. . . . . ⎠
xr xr+1 xr+2 . . . x2r

We calculate det Ar : if det Ar = 0, we conclude that d = r and increase r by 1. If


det Ar = 0 then we solve the equation Ar ar = 0, i.e. try d = r:
⎛ ⎞
a0 ⎛ ⎞
⎜ ⎟ x0 x1 . . . xd
⎜ a1
⎟ ⎜x1 x2 . . . xd+1 ⎟
⎜ ⎟.. ⎜ ⎟
Ad ⎜ ⎟ = 0, where Ad = ⎜ ..
. .. .. .. ⎟
⎜ ⎟ ⎝. . . . ⎠
⎝ad−1 ⎠
xd xd+1 . . . x2d
1

(e.g. by Gaussian elimination) and test sequence (xn ) for the recursion xn+d =
∑ ai xn+i . If we discover a discrepancy, we choose a different vector cr ∈
0≤i≤d−1
ker Ar or – if it fails – increase r.
The BM algorithm can be stated in an elegant algebraic form. Given a sequence

(xn ), consider a formal power series in X: ∑ x j X j . The fact that (xn ) is produced
j=0
4.5 Selected examples and problems from cryptography 461

by the LFSR with a feedback polynomial C(X) is equivalent to the fact that the
d
above series is obtained by dividing a polynomial A(X) = ∑ ai X i by C(X):
i=0

A(X)
∑ x j X j = C(X) . (4.5.14)
j=0

Indeed, as c0 = 1, A(X) = C(X) ∑ x j X j is equivalent to
j=0
n
an = ∑ ci xn−i , n = 1, . . . , (4.5.15)
i=1
or ⎧


n−1
⎨an − ∑ ci xn−i , n = 0, 1, . . . , d,
xn = i=1 (4.5.16)


n−1
⎩− ∑ ci xn−i , n > d.
i=0

In other words, A(X) takes part in specifying the initial fill, and C(X) acts as the
feedback polynomial.
Worked Example 4.5.7 What is a linear feedback shift register? Explain the
Berlekamp–Massey method for recovering the feedback polynomial of a linear
feedback shift register from its output. Illustrate in the case when we observe out-
puts
1 0 1 0 1 1 0 0 1 0 0 0 ...,

0 1 0 1 1 1 1 0 0 0 1 0 ...
and
1 1 0 0 1 0 1 1.

Solution An initial fill x0 . . . xd−1 produces an output stream (xn )n≥0 satisfying the
recurrence equation
d−1
xn+d = ∑ ci xn+i for all n ≥ 0.
i=0

The feedback polynomial


C(X) = c0 + c1 X + · · · + cd−1 X d−1 + X d
is the characteristic polynomial for this recurrence equation determining its solu-
tions. We will assume that coefficient c0 = 0; otherwise value xn has no impact on
xn+d and the register can be treated as the one of length d − 1.
462 Further Topics from Information Theory

The Berlekamp–Massey algorithm begins with an inspection of matrix


 
1 0
A1 = , with det A1 = 0,
0 1

but
⎛ ⎞
1 0 1
A2 = ⎝0 1 0⎠ , with det A2 = 0,
1 0 1
⎛ ⎞
c0

and A2 c1 ⎠ = 0 has the solution c0 = 1, c1 = 0. This gives the recursion
1
xn+2 = xn ,

which does not fit the remaining digits. So, we move to A3 :


⎛ ⎞
1 0 1 0
⎜ 0 1 0 1⎟
A3 = ⎜ ⎟
⎝1 0 1 1⎠ , with det A3 = 0,
0 1 1 0

and then to A4 :
⎛ ⎞
1 0 1 0 1
⎜0 1 0 1 1⎟
⎜ ⎟
A4 = ⎜
⎜1 0 1 1 0⎟⎟ , with det A4 = 0.
⎝0 1 1 0 0⎠
1 1 0 0 1
⎛ ⎞
1
⎜ 0⎟
The equation A4 c4 = 0 is solved by c4 = ⎜ ⎟
⎝ 0⎠. This yields
1
xn+4 = xn + xn+3 ,

which fits the rest of the string. In the second example we have:
⎛ ⎞
⎛ ⎞ 0 1 0 1
  0 1 0
0 1 ⎜ 1 0 1 1⎟
det = 0, det ⎝1 0 1⎠ = 0, det ⎜ ⎝0 1 1
⎟ = 0
1 0 1⎠
0 1 1
1 1 1 1
4.5 Selected examples and problems from cryptography 463

and
⎛ ⎞⎛ ⎞
0 1 0 1 1 1
⎜1 0 1 1 1⎟ ⎜ 1⎟
⎜ ⎟⎜ ⎟
⎜0 1⎟ ⎜ ⎟
⎜ 1 1 1 ⎟ ⎜0⎟ = 0.
⎝1 1 1 1 0 ⎝ 0⎠

1 1 1 0 0 1
This yields the solution: d = 4, xn+4 = xn + xn+1 . The linear recurrence relation is
satisfied by every term of the output sequence given. The feedback polynomial is
then X 4 + X + 1.
In the third example the recursion is xn+3 = xn + xn+1 .
LFSRs are used for producing additive stream ciphers. Additive stream ciphers
were invented in 1917 by Gilbert Vernam, at the time an engineer with the AT&T
Bell Labs. Here, the sending party uses an output stream from an LFSR (kn ) to
encrypt a plain text (pn ) by (zn ) where

zn = pn + kn mod 2, n ≥ 0. (4.5.17)

The recipient would decrypt it by

pn = zn + kn mod 2, n ≥ 0, (4.5.18)

but of course he must know the initial fill k0 . . . kd−1 and the string c0 . . . cd−1 . The
main deficiency of the stream cipher is its periodicity. Indeed, if the generating
LFSR has period D then it is enough for an ‘attacker’ to have in his possession a
cipher text z0 z1 . . . z2D−1 and the corresponding plain text p0 p1 . . . p2D−1 , of length
2D. (Not an unachievable task for a modern-day Sherlock Holmes.) If by some luck
the attacker knows the value of the period D then he only needs z0 z1 . . . zD−1 and
p0 p1 . . . pD−1 . This will allow the attacker to break the cipher, i.e. to decrypt the
whole plain text, however long.
Clearly, short-period LFSRs are easier to break when they are used repeat-
edly. The history of World War II and the subsequent Cold War has a number of
spectacular examples (German code-breakers succeeding in part in reading British
Navy codes, British and American code-breakers succeeding in breaking German
codes, the American project ‘Venona’ deciphering Soviet codes) achieved because
of intensive message traffic. However, even ultra-long periods cannot guarantee
safety.
As far as this section of the book is concerned, the period of an LFSR can be
increased by combining several LFSRs.
Theorem 4.5.8 Suppose a stream (xn ) is produced by an LFSR of length d1 ,
period D1 and with an auxiliary polynomial C1 (X), and a stream (yn ) by an LFSR
464 Further Topics from Information Theory

of length d2 , period D2 and with an auxiliary polynomial C2 (X). Let α1 , . . . , αr1


and β1 , . . . , βr2 be the distinct roots of C1 (X) and C2 (X), respectively, lying in some
field K ⊃ F2 . Let mi be the multiplicity of root αi and m j be the multiplicity of root
β j , with d1 = ∑ mi and d2 = ∑ m i . Then
1≤i≤r1 1≤i≤r2

(a) Stream (xn + yn ) is produced by an LFSR with the auxiliary polynomial


lcm(C1 (X),C2 (X)).
(b) Stream (xn yn ) is produced by an LFSR with the auxiliary polynomial C(X) =

∏ ∏ (X − αi β j )mi +m j −1 .
1≤i≤r1 1≤ j≤r2

In particular, the period of the resulting LFSR is in both cases divisible by


lcm(D1 , D2 ).
Proof According to Theorem 4.5.6, the output streams (xn ) and (yn ) for the LF-
SRs in question have the following form in field K:
xn = ∑ ∑ ai,k A(n, k)αin , yn = ∑ ∑ b j,l A(n, l)β jn , (4.5.19)
1≤i≤r1 0≤k≤mi −1 1≤ j≤r2 0≤l≤m j −1

for some ai,k , b j,l ∈ K.


(a) Writing xn +yn as the sum of the expressions from (4.5.19) and grouping similar
terms leads to the statement (a).
(b) For the product xn yn we have the expression

∑ ∑ ai,k b j,l A(n, k)A(n, l)(αiβ j )n .


i, j k,l

The product ai,k b j,l A(n, k)A(n, l) can be written as a sum


∑ A(n,t)ut (ai,k , b j,l ) where coefficients ut (ai,k , b j,l ) ∈ K. This gives
k∧l≤t≤k+l−1
the following representation of xn yn :

∑ ∑ ∑ A(n,t) ∑ ut (ai,k , b j,l )(αi β j )n


1≤i≤r1 1≤ j≤r2 0≤t≤mi +m j −2 k,l:k∧l≤t≤k+l−1

which in turn can be written as


xn yn = ∑ ∑ ∑ A(n,t)vi, j;t (αi β j )n
1≤i≤r1 1≤ j≤r2 0≤t≤mi +m j −2

corresponding to the generic form of the output stream for the LFSR with the aux-
iliary polynomial C(X) in statement (b).
Despite serious drawbacks, LFSRs remain in use in a variety of situations: they
allow simple enciphering and deciphering without ‘lookahead’ and display a ‘lo-
cal’ effect of an error, be it encoding, transmission or decoding. More generally,
4.5 Selected examples and problems from cryptography 465

non-linear LFSRs often offer only marginal advantages while bringing serious dis-
advantages, in particular with deciphering.

. . . an error by the same example


Will rush into the state.
William Shakespeare (1564–1616), English playwright and poet
from Merchant of Venice

Worked Example 4.5.9 (a) Let (xn ), (yn ), (zn ) be three streams produced by
LFSRs. Set

kn = xn if yn = zn ,

kn = yn if yn = zn .

Show that kn is also a stream produced by a linear feedback register.


(b) A cipher stream is given by a linear feedback register of known length d . Show
that, given plain text and ciphered text of length 2d , we can find the cipher stream.

Solution (a) For three streams (xn ), (yn ), (zn ) produced by LFSRs we set

kn = xn + (xn + yn )(yn + zn ) (in F2 ).

So it suffices to note that (pointwise) sums and products of streams produced by


LFSRs also yield some streams produced by LFSRs.
(b) Suppose the plain text is y1 y2 . . . y2d , and the ciphered text is x1 + y1 x2 + y2 . . .
x2d + y2d . Then we can recover x1 . . . x2d . We know that c1 . . . cd must satisfy d
simultaneous linear equations

d
xd+ j = ∑ ci x j+i−1 , for j = 1, 2, . . . , d.
i=1

Solve these to find c1 , c2 , . . . , cd and hence the cipher stream.

Worked Example 4.5.10 A binary non-linear feedback register of length 4 has


defining relation

xn+1 = xn−1 + xn xn−2 + xn−3 .

Show that the state space contains cycles of lengths 1, 4, 9 and 2.


466 Further Topics from Information Theory

Solution There are 24 = 16 initial binary strings. By inspection,

0000 → 0000 (a cycle of length 1),


0001 → 0010 → 0100 → 1000 → 0001 (a cycle of length 4),
0011 → 0111 → 1111 → 1110 → 1101
→ 1011 → 0110 → 1100 → 1001 → 0011 (a cycle of length 9),
0101 → 1010 → 0101 (a cycle of length 2).
All 16 initial fills have appeared in the list, so the analysis is complete.

Worked Example 4.5.11 Describe how an additive stream cipher operates. What
is a one-time pad? Explain briefly why a one-time pad is safe if used only once but
becomes unsafe if used many times. A one-time pad is used to send the message
x1 x2 x3 x4 x5 x6 y7 which is encoded as 0101011. By mistake, it is reused to send the
message y0 x1 x2 x3 x4 x5 x6 which is encoded as 0100010. Show that x1 x2 x3 x4 x5 x6 is
one of two possible messages, and find the two possibilities.

Solution A one-time pad is an example of a cipher based on a random key and


proposed by Gilbert Vernam and Joseph Mauborgne (the Chief of the USA Signal
Corps during World War II). The cipher uses a random number generator producing
a sequence k1 k2 k3 . . . from the alphabet J of size q. More precisely, each letter
is uniformly distributed over J and different letters are independent. A message
m = a1 a2 . . . an is encrypted as c = c1 c2 . . . cn where
 
ci = ai + ki mod q .

To show that the one-time pad achieves perfect secrecy, write

P(M = m,C = c) = P(M = m, K = c − m)


1
= P(M = m)P(K = c − m) = P(M = m) ;
qn
here the subtraction c − m is digit-wise and mod q. Hence, the conditional proba-
bility
P(M = m,C = c) 1
P(C = c|M = m) = = n
P(M = m) q
does not depend on m. Hence, M and C are independent.
Working in F2 , consider a cipher key stream k1 k2 k3 . . .. The plain (input) text
stream p1 p2 p3 . . . is encrypted as the cipher text stream c1 c2 c3 . . ., where c j =
p j + k j . If the k j are IID random numbers and the cipher key stream is only used
once (which happens in practice) then we have a one-time pad. (It is assumed that
4.5 Selected examples and problems from cryptography 467

the cipher key stream is known only to the sender and the recipient.) In the example,
we have
x1 x2 x3 x4 x5 x6 y7 → 0101011,
y0 x1 x2 x3 x4 x5 x6 → 0100010.
Suppose x1 = 0. Then
k0 = 0, k1 = 1, x2 = 0, k2 = 0, x3 = 0, k3 = 0, x4 = 1, k1 = 0,
x5 = 1, k5 = 0, x6 = 1, k6 = 1.
Thus,
k = 0100101, x = 000111.
If x1 = 1, every digit changes, so
k = 1011010, x = 111000.
Alternatively, set x0 = y0 and x7 = y7 . If the first cipher is q1 q2 . . ., the second is
p1 p2 . . . and the one-time pad is k1 , k2 , . . ., then
q j = x j+1 + k j , p j = x j + k j .
So,
x j + x j+1 = q j + p j ,
and
x1 + x2 = 0, x2 + x3 = 0,
x3 + x4 = 1, x4 + x5 = 0, x5 + x6 = 0.
This yields
x1 = x2 = x3 , x4 = x5 = x6 , x4 = x3 + 1.
The message is 000111 or 111000.
Worked Example 4.5.12 (a) Let θ : Z+ → {0, 1} be given by θ (n) = 1 if n is
odd, θ (n) = 0 if n is even. Consider the following recurrence relation over F2 :
un+3 + un+2 + un+1 + un = 0. (4.5.20)
Is it true that the general solution of (4.5.20) is un = A + Bθ (n) + Cθ (n2 )? If it is
true, prove it. If not, explain why it is false and state and prove the correct result.
(b) Solve the recurrence relation un+2 + un = 1 over F2 , subject to u0 = 1, u1 = 0,
expressing the solution in terms of θ and n.
(c) Four streams wn , xn , yn , zn are produced by linear feedback registers. If we set
%
xn + yn + zn if zn + wn = 1,
kn =
xn + wn if zn + wn = 0,
show that kn is also a stream produced by a linear feedback register.
468 Further Topics from Information Theory

Solution (a) Observe that θ (n2 ) = θ (n), so the suggested


 sum contains only two
arbitrary constants. Now consider g(n) = θ n(n − 1)/2 . Then
g(n + 3) +
g(n + 2) + g(n + 1) +g(n) 
(n + 3)(n + 2) (n + 2)(n + 1)
=θ +θ
 2   2
(n + 1)n n(n − 1)
+θ +θ
 2 2
= θ (n + 2)2 + n2 = 0,
and g(0) = g(1) = 0, g(2) = 1. Then we substitute n = 0 and n = 1 into the relation
aθ (n) + b + cg(n) = 0, and observe that a = b = c = 0. So, θ (n), 1, g(n) are inde-
pendent. Thus, Aθ (n)+B+Cg(n) is a general solution of the third-order difference
equation.
(b) First try to solve the recurrence relation un+2 + un = 1 without additional con-
ditions
 
n(n − 1) (n + 2)(n + 1)
g(n) + g(n + 2) = θ +
2 2
 2 
n − n + n + 3n + 2
2

2
 2 
= θ n + n + 1 = 1.
Now substitute n = 0 and n = 1 into relation un = A + Bθ (n) + g(n) to get A = B =
1. Thus, un = 1 + θ (n) + g(n).
(c) The sequence kn is produced by the linear register
kn = xn + wn + (zn + wn )(yn + zn + wn ).

In the next part of this section, we discuss properties of a class of cryptosystems


used in modern practice and called public-key ciphers, focusing in particular on the
RSA and the bit commitment cryptosystems.
Definition 4.5.13 We say that a formal cryptosystem is given, if we can identify:
(a) a set P of plaintexts (source messages in the language of Chapter 1);
(b) a set C of ciphertexts (codewords in the language of Chapter 1);
c) a set K of keys that label the encoding maps;
(d) the set E of encryptic functions (encoding maps) where each function Ek takes
P ∈ P → Ek (P) ∈ C and is labelled by an element k ∈ K ;
(e) the set D of decryptic functions (decoding maps) where each function Dk takes
C ∈ C → Dk (C) ∈ P and is again labelled by an element k ∈ K ;
4.5 Selected examples and problems from cryptography 469

such that
(f) for all key e ∈ K there is a key d ∈ K , with the property that Dd (Ee (P)) = P
for all plaintext P ∈ P.

Example 4.5.14 Suppose that two parties, Bob and Alice, intend to have a two-
side private communication. They want to exchange their keys, EA and EB , by us-
ing an insecure binary channel. An obvious protocol is as follows. Alice encrypts
a plain-text m as EA (m) and sends it to Bob. He encrypts it as EB (EA (m)) and
returns it to Alice. Now we make a crucial assumption that EA and EB commute
for any plaintext m : EA ◦ EB (m ) = EB ◦ EA (m ). In this case Alice can decrypt
this message as DA (EA (EB (m))) = EB (m) and send this to Bob, who then calcu-
lates DB (EB (m)) = m. Under this protocol, at no time during the transaction is an
unencrypted message transmitted.
However, a further thought shows that this is no solution at all. Indeed, suppose
that Alice uses a one-time pad kA and Bob uses a one-time pad kB . Then any sin-
gle interception provides no information about plaintext m. However, if all three
transmissions are intercepted, it is enough to take the sum

(m + kA ) + (m + kA + kB ) + (m + kB ) = m

to obtain the plaintext m. So, more sophisticated protocols should be developed:


this is where public key cryptosystems are helpful.

Another popular example is a network of investors and brokers dealing in a


market and using an open access cryptosystem such as RSA. An investor’s concern
is that a broker will buy shares without her authorisation and, in the case of a loss,
claim that he had a written request from the client. Indeed, it is easy for a broker
to generate a coded order requesting to buy the stocks as the encoding key is in the
public domain. On the other hand, a broker may be concerned that if he buys the
shares by the investor’s request and the market goes down, the investor may claim
that she never ordered this transaction and that her coded request is a fake.
However, it is easy to develop a protocol which addresses these concerns. An
investor Alice sends to a broker Bob, together with her request p to buy shares, her
‘electronic signature’ fB fA−1 (p). After receiving this message Bob sends a receipt
r encoded as fA fB−1 (r). If a conflict emerges, both sides can provide a third party
(say, a court) with these coded messages and the keys. Since no-one but Alice could
generate the message coded by fB fA−1 and no-one but Bob could generate the mes-
sage coded by fA fB−1 , no doubts would remain. This is the gist of bit commitment.
The above-mentioned RSA (Rivest–Shamir–Adelman) scheme is a prime example
470 Further Topics from Information Theory

of a public key cryptosystem. Here, a recipient user (Bob, possibly a collective


entity) sets
N = pq, where p and q are two large primes, kept secret. (4.5.21)
Number N is often called the RSA modulus (and made public). The value of the
Euler totient function is
φ (N) = (p − 1)(q − 1), kept secret.
Next, the recipient user chooses (or is given by the key centre) an integer l such
that
1 < l < φ (N) and gcd (φ (N), l) = 1. (4.5.22)
Finally, an integer d is computed (again, by Bob or on his behalf) such that
1 < d < φ (N) and l d = 1 mod φ (N). (4.5.23)
[The value of d can be computed via the extended Euclid’s algorithm.] The public
key eB used for encryption is the pair (N, l) (listed in the public directory). The
sender (Alice), when communicating to Bob, understands that Bob’s plaintext and
ciphertext sets are P = C = {1, . . . , N − 1}. She then encrypts her chosen plaintext
m = 1, . . . , N − 1 as the ciphertext
EN,l (m) = c where c = ml mod N. (4.5.24)
Bob’s private key dB is the pair (N, d) (or simply number d): it is kept secret
from public but made known to Bob. The recipient decrypts ciphertext c as
Dd (c) = cd mod N. (4.5.25)
In the literature, l is often called the encryption and d the decryption exponent.
Theorem 4.5.15 below guarantees that
Dd (c) = mdl = m mod N, (4.5.26)
i.e. the ciphertext c is decrypted correctly. More precisely,
Theorem 4.5.15 For all integers m = 0, . . . , N − 1, the equation (4.5.26) holds
true, where l and d satisfy (4.5.22) and (4.5.23) and N is as in (4.5.21).
Proof By virtue of (4.5.23),
l d = 1 + b(p − 1)(q − 1)
where b is an integer. Then
 (q−1)b
(ml )d = mld = m1+b(p−1)(q−1) = m m(p−1) .
4.5 Selected examples and problems from cryptography 471

Recall the Euler–Fermat theorem: If gcd (m, p) = 1 then mφ (p) = 1 mod p.


We deduce that if m is not divisible by p then

(ml )d = m mod p. (4.5.27)

Otherwise, i.e. when p|m, (4.5.27) still holds as m and (ml )d are both equal to
0 mod p. By a similar argument,

(ml )d = m mod q. (4.5.28)

By the Chinese remainder theorem (CRT) – [28], [114] – (4.5.27) and (4.5.28)
imply (4.5.26).

Example 4.5.16 Suppose Bob has chosen p = 29, q = 31, with N = 899 and
φ (N) = 840. The smallest possible value of e with gcd(l, φ (N)) = 1 is l = 11, after
that 13 followed by 17, and so on. The (extended) Euclid algorithm yields d = 611
for l = 11, d = 517 for l = 13, and so on. In the first case, the encrypting key E899,11
is
m → m11 mod 899, that is, E899,11 (2) = 250.

The ciphertext 250 is decoded by

D611 (250) = 250611 mod 899 = 2,

with the help of the computer. [The computer is needed even after the simplification
rendered by the use of the CRT. For instance, the command in Mathematica is
PowerMod[250,611,899].]

Worked Example 4.5.17 (a) Referring to the RSA cryptosystem with public key
(N, l) and private key (φ (N), d), discuss possible advantages or disadvantages of
taking (i) l = 232 + 1 or (ii) d = 232 + 1.
(b) Let a (large) number N be given, and we know that N is a product of two distinct
prime numbers, N = pq, but we do not know the numbers p and q. Assume that
another positive integer, m, is given, which is a multiple of φ (N). Explain how to
find p and q.
(c) Describe how to solve the bit commitment problem by means of the RSA.

Solution Using l = 232 + 1 provides fast encryption (you need just 33 multiplica-
tions using repeated squaring). With d = 232 + 1 one can decrypt messages quickly
(but an attacker can easily guess it).
472 Further Topics from Information Theory

(b) Next, we show that if we know a multiple m of φ (N) then it is ‘easy’ to factor N.
Given positive integers y > 1 and M > 1, denote by ordM (y) the order of y relative
to M:

ordM (y) = min s = 1, 2, . . . : ys = 1 mod M .


Assume that m = 2a b where a ≥ 0 and b is odd. Set
* +
X = x = 1, 2, . . . , N : ord p (xb ) = ordq (xb ) . (4.5.29)
Given N, l and d, we put m = dl − 1. As φ (N)|dl − 1 we can use Lemma 4.5.18
below to factor N. We select x < N. Suppose gcd(x, N) = 1; otherwise the search
is already successful. The probability of finding a non-trivial factor is 1/2, so the
probability of failure after r random choices of x ∈ X is 1/2r .
(c) The bit commitment problem arises in the following case: Alice sends a mes-
sage to Bob in such a way that
(i) Bob cannot read the message until Alice sends further information;
(ii) Alice cannot change the message.
A solution is to use the electronic signature: Bob cannot read the message until
Alice (later) reveals her private key. This does not violate conditions (i), (ii) and
makes it (legally) impossible for Alice to refuse acknowledging her authorship.

Lemma 4.5.18 (i) Let N = pq, m be as before, i.e. φ (N)|m, and  2tdefine  set X
the
as in (4.5.29). If x ∈ X then there exists 0 ≤ t < a such that gcd x − 1, N > 1 is
b

a non-trivial factor of N = pq.


(ii) The cardinality  X ≥ φ (N)/2.
Proof (i) Put y = xb mod N. The Euler–Fermat theorem implies that xφ (N) ≡
a
1 mod N and hence y2 ≡ 1 mod N. Then
ord p (xb ) and ordq (xb ) are powers of 2.
As we know, ord p (xb ) = ordq (xb ); say, ord p (xb ) < ordq (xb ). Then there exists 0 ≤
t < a such that
t t
y2 ≡ 1 mod p, y2 ≡ 1 mod q.
t
So, gcd y2 − 1, N = p, as required.
(ii) By the CRT, there is a bijection
* +   * + * +
x ∈ 1, . . . , N ↔ x mod p, x mod q ∈ 1, . . . , p × 1, . . . , q ,
with
* the agreement
+ that N ↔ (p, q). Then it suffices to show that if we partition set
1, . . . , p into subsets according to the value of ord p (xb ), x ∈ X, then each subset
4.5 Selected examples and problems from cryptography 473

has size ≤ (p − 1)/2. We will do this by exhibiting such a subset of size (p − 1)/2.
Note that
φ (N)|2a b implies that there exists γ ∈ {1, . . . , p − 1}
such that ord p (γ b )is a power of 2.
In turn, the latter statement implies that
%
δb = ord p (γ b ), δ odd,
ord p (γ )
< ord p (γ b ), δ even.
* +
Therefore, γ δ b mod p : δ odd is the required subset.

Our next example of a cipher is the Rabin, or Rabin–Williams cryptosystem.


Here, again, one uses the factoring problem to provide security. For this system,
the relation with the factoring problem has been proved to be mutual: knowing
the solution to the factoring problem breaks the cryptosystem, and the ability of
breaking the cryptosystem leads to factoring. [That is not so in the case with the
RSA: it is not known whether breaking the RSA enables one to solve the factoring
problem.]
In the Rabin system the recipient user (Alice) chooses at random two large
primes, p and q, with
p = q = 3 mod 4. (4.5.30)

Furthermore:
Alice’s public key is N = pq; her secret key is the pair (p, q);
Alice’s plaintext and ciphertext are numbers m = 0, 1 . . . , N − 1, (4.5.31)
and her encryption rule is EN (m) = c where c = m2 mod N.
To decrypt a ciphertext c addressed to her, Alice computes

m p = c(p+1)/4 mod p and mq = c(q+1)/4 mod q. (4.5.32)

Then
±m p = c1/2 mod p and ± mq = c1/2 mod q,

i.e. ±m p and ±mq are the square roots of c mod p and mod q, respectively. In fact,
 2   p−1
± m p = c(p+1)/2 = c(p−1)/2 c = ± m p c = c mod p;

at the last step the Euler–Fermat theorem has been used. The argument for ±mq is
similar. Then Alice computes, via Euclid’s algorithm, integers u(p) and v(q) such
that
u(p)p + v(q)q = 1.
474 Further Topics from Information Theory

Finally, Alice computes



±r = ± u(p)pmq + v(q)qm p ] mod N
and

±s = ± u(p)pmq − v(q)qm p ] mod N.
These are four square roots of c mod N. The plaintext m is one of them. To
secure that she can identify the original plaintext, Alice may reduce the plaintext
space P, allowing only plaintexts with some special features (like the property
that their first 32 and last 32 digits are repetitions of each other), so that it becomes
unlikely that more than one square root has this feature. However, such a measure
may result in a reduced difficulty of breaking the cipher as it will be not always
true that the ‘reduced’ problem is equivalent to factoring.

I have often admired the mystical way of Pythagoras


and the secret magic of numbers.
Thomas Browne (1605–1682), English author who wrote on
medicine, religion, science and the esoteric

Example 4.5.19 Alice uses prime numbers p = 11 and q = 23. Then N = 253.
Bob encrypts the message m = 164, with
c = m2 mod N = 78.
Alice calculates m p = 1, mq = 3, u(p) = −2, v(q) = 1. Then Alice computes
r = ±[u(p)pmq + v(q)qm p ] mod N = 210 and 43

s = ±[u(p)pmq − v(q)qm p ] mod N = 164 and 89


and finds out the message m = 164 among the solutions: 1642 = 78 mod 253.
We continue with the Diffie–Hellman key exchange scheme. Diffie and Hellman
proposed a protocol enabling a pair of users to exchange secret keys via insecure
channels. The Diffie–Hellman scheme is not a public-key cryptosystem but its im-
portance has been widely recognised since it forms a basis for the ElGamal signa-
ture cryptosystem.
The Diffie–Hellman protocol is related to the discrete logarithm problem (DLP):
we are given a prime number p, field F p with the multiplicative group F∗p  Z p−1
and a generator γ of F∗p (i.e. a primitive element in F∗p ). Then, for all b ∈ F∗p , there
exists a unique α ∈ {0, 1, . . . , p − 2} such that
b = γ α mod p. (4.5.33)
4.5 Selected examples and problems from cryptography 475

Then α is called the discrete logarithm, mod p, of b to base γ ; some authors write
α = dlogγ b mod p. Computing discrete logarithms is considered a difficult prob-
lem: no efficient (polynomial) algorithm is known, although there is no proof that
it is indeed a non-polynomial problem. [In an additive cyclic group Z/(nZ), the
DLP becomes b = γα mod n and is solved by the Euclid algorithm.]
The Diffie–Hellman protocol allows Alice and Bob to establish a common secret
key using field tables for F p , for a sufficient quantity of prime numbers p. That is,
they know a primitive element γ in each of these fields. They agree to fix a large
prime number p and a primitive element γ ∈ F p . The pair (p, γ ) may be publicly
known: Alice and Bob can fix p and γ through the insecure channel.
Next, Alice chooses a ∈ {0, 1, . . . , p − 2} at random, computes

A = γ a mod p

and sends A to Bob, keeping a secret. Symmetrically, Bob chooses b ∈ {0, 1, . . . ,


p − 2} at random, computes
B = γ b mod p

and sends B to Alice keeping b secret. Then

Alice computes Ba mod p and Bob computes Ab mod p,

and their secret key is the common value

K = γ ab = Ba = Ab mod p.

The attacker may intercept p, γ , A and B but knows

neither a = dlogγ A mod p nor b = dlogγ B mod p.

If the attacker can find discrete logarithms mod p then he can break the secret
key: this is the only known way to do so. The opposite question – solving the
discrete logarithm problem if he is able to break the protocol – remains open (it is
considered an important problem in public key cryptography).
However, like previously discussed schemes, the Diffie–Hellman protocol has a
particular weak point: it is vulnerable to the man in the middle attack. Here, the
attacker uses the fact that neither Alice nor Bob can verify that a given message re-
ally comes from the opposite party and not from a third party. Suppose the attacker
can intercept all messages between Alice and Bob. Suppose he can impersonate
Bob and exchange keys with Alice pretending to be Bob and at the same time im-
personate Alice and exchange keys with Bob pretending to be Alice. It is necessary
to use electronic signatures to distinguish this forgery.
476 Further Topics from Information Theory

We conclude Section 4.5 with the ElGamal cryptosystem based on electronic


signatures. The ElGamal cipher can be considered a development of the Diffie–
Hellman protocol. Both schemes are based on the difficulty of the discrete log-
arithm problem (DLP). In the ElGamal system, a recipient user, Alice, selects a
prime number p and a primitive element γ ∈ F p . Next, she chooses, at random, an
exponent a ∈ {0, . . . , p − 2}, computes
A = γ a mod p
and announces/broadcasts
the triple (p, γ , A), her public key.
At the same time, she keeps in secret
exponent a, her private key.
Alice’s plaintext set P is numbers 0, 1, . . . , p − 1.
Another user Bob, wishing to send a message to Alice and knowing triple
(p, γ , A), chooses, again at random, an exponent b ∈ {0, 1, . . . , p − 2}, and com-
putes
B = γ b mod p.
Then Bob lets Alice know B (which he can do by broadcasting value B). The value
B will play the role of Bob’s ‘signature’. In contrast, the value b of Bob’s exponent
is kept secret.
Now, to send to Alice a message m ∈ {0, 1, . . . , p − 1}, Bob encrypts m by the
pair
Eb (m) = (B, c) where c = Ab m mod p.
That is, Bob’s ciphertext consists of two components: the encrypted message c and
his signature B.
Clearly, values A and B are parts of the Diffie–Hellman protocol; in this sense
the latter can be considered as a part of the ElGamal cipher. Further, the encrypted
message c is the product of m by Ab , the factor combining part A of Alice’s public
key and Bob’s exponent b.
When Alice receives the ciphertext (B, c) she uses her secret key a. Namely,
she divides c by Ba mod p. A convenient way is to calculate x = p − 1 − a: as
1 ≤ a ≤ p − 2, the value x also satisfies 1 ≤ x ≤ p − 2. Then Alice decrypts c by
Bx c mod p. This yields the original message m, since
 b  −b
Bx c = γ b(p−1−a) Ab m = γ p−1 γ a Ab m = A−b Ab m = m mod p.
Example 4.5.20 With p = 37, γ = 2 and a = 12 we have
A = γ a mod p = 26
4.5 Selected examples and problems from cryptography 477

and Alice’s public key is (p = 37, γ = 2, A = 26), her plaintexts are 0, 1, . . . , 36 and
private key a = 12. Assume Bob has chosen b = 32; then
B = 232 mod 37 = 4.
Suppose Bob wants to send m = 31. He encrypts m by
c = Ab m mod p = (26)32 m mod 37 = 10 × 31 mod 37 = 14.
Alice decodes this message as 232 = 7 and 724 = 26 mod 37,
14 × 232(37−12−1) mod 37 = 14 × 724 = 14 × 26 mod 37 = 31.
Worked Example 4.5.21 Suppose that Alice wants to send the message ‘today’
to Bob using the ElGamal encryption. Describe how she does this using the prime
p = 15485863, γ = 6 a primitive root mod p, and her choice of b = 69. Assume
that Bob has private key a = 5. How does Bob recover the message using the
Mathematica program?

Solution Bob has public key (15485863, 6, 7776), which Alice obtains. She con-
verts the English plaintext using the alphabet order to the numerical equivalent:
19, 14, 3, 0, 24. Since 265 < p < 266 , she can represent the plaintext message as a
single 5-digit base 26 integer:
m = 19 × 264 + 14 × 263 + 3 × 262 + 0 × 26 + 24 = 8930660.
Now she computes γ b = 669 = 13733130 mod 15485863, then
mγ ab = 8930660 × 777669 = 4578170 mod 15485863.
Alice sends c = (13733130, 4578170) to Bob. He uses his private key to compute
(γ b ) p−1−a = 1373313015485863−1−5 = 2620662 mod 15485863
and
(γ )−a mγ ab = 2620662 × 4578170 = 8930660 mod 15485863,
and converts the message back to the English plaintext.
Worked Example 4.5.22 (a) Describe the Rabin–Williams scheme for coding
a message x as x2 modulo a certain N . Show that, if N is chosen appropriately,
breaking this code is equivalent to factorising the product of two primes.
(b) Describe the RSA system associated with a public key e, a private key d and
the product N of two large primes.
Give a simple example of how the system is vulnerable to a homomorphism
attack. Explain how a signature system prevents such an attack. Explain how to
factorise N when e, d and N are known.
478 Further Topics from Information Theory

Solution (a) Fix two large primes p, q ≡ −1 mod 4 which forms a private key; the
broadcasted public key is the product N = pq. The properties used are:
(i) If p is a prime, the congruence a2 ≡ d mod p has at most two solutions.
(ii) For a prime p = −1 mod 4, i.e. p = 4k − 1, if the congruence a2 ≡ c mod p has
a solution then a ≡ c(p+1)/4 mod p is one solution and a ≡ −c(p+1)/4 mod p is
another solution. [Indeed, if c ≡ a2 mod p then, by the Euler–Fermat theorem,
c2k = a4k = a(p−1)+2 = a2 mod p, implying ck = ±a.]
The message is a number m from M = {0, 1, . . . , N −1}. The encrypter (Bob) sends
(broadcasts) m = m2 mod N. The decrypter (Alice) uses property (ii) to recover the
two possible values of m mod p and two possible values of m mod q. The CRT then
yields four possible values for m: three of them would be incorrect and one correct.
So, if one can factorise N then the code would be broken. Conversely, suppose
that we can break the code. Then we can find all four distinct square roots u1 , u2 ,
u3 , u4 mod N for a general u. (The CRT plus property (i) shows that u has zero
or four square roots unless it is a multiple of p and q.) Then u j u−1 (calculable via
Euclid’s algorithm) gives rise to the four square roots, 1, −1, ε1 and ε2 , of 1 mod N,
with
ε1 ≡ 1 mod p, ε1 ≡ −1 mod q

and
ε2 ≡ −1 mod p, ε2 ≡ 1 mod q.

By interchanging p and q, if necessary, we may suppose we know ε1 . As ε1 − 1


is divisible by p and not by q, the gcd(ε1 − 1, N) = p; that is, p can be found by
Euclid’s algorithm. Then q can also be identified.
In practice, it can be done as follows. Assuming that we can find square roots
mod N, we pick x at random and solve the congruence x2 ≡ y2 mod N. With prob-
ability 1/2, we have x ≡ ±y mod N. Then gcd(x − y, N) is a non-trivial factor of N.
We repeat the procedure until we identify a factor; after k trials the probability of
success is 1 − 2−k .
(b) To define the RSA cryptosystem let us randomly choose large primes p and q.
By Fermat’s little theorem,

x p−1 ≡ 1 mod p, xq−1 ≡ 1 mod q.

Thus, by writing N = pq and λ (N) = lcm(p − 1, q − 1), we have

xλ (N) ≡ 1 mod N,

for all integers x coprime to N.


4.5 Selected examples and problems from cryptography 479

Next, we choose e randomly. Either Euclid’s algorithm will reveal that e is not
co-prime to λ (N) or we can use Euclid’s algorithm to find d such that

de ≡ 1 mod λ (N).

With a very high probability a few trials will give appropriate d and e.
We now give out the value e of the public key and the value of N but keep secret
the private key d. Given a message m with 1 ≤ m ≤ N − 1, it is encoded as the
integer c with
1 ≤ c ≤ N − 1 and c ≡ me mod N.

Unless m is not co-prime to N (an event of negligible probability), we can decode


by observing that
m ≡ mde ≡ cd mod N.

As an example of a homomorphism attack, suppose the system is used to trans-


mit a number m (dollars to be paid) and someone knowing this replaces the coded
message c by c2 . Then
(c2 )d ≡ m2de ≡ m2

and the recipient of the (falsified) message believes that m2 dollars are to be paid.
Suppose that a signature B(m) is also encoded and transmitted, where B is a
many-to-one function with no simple algebraic properties. Then the attack above
will produce a message and signature which do not correspond, and the recipient
will know that the message was tampered with.
Suppose e, d and N are known. Since
 
de − 1 ≡ 0 mod λ (N)

and λ (N) is even, de − 1 is even. Thus de − 1 = 2a b with b odd and a ≥ 1.


Choose x at random. Set z ≡ xb mod N. By the CRT, z is a square root of
1 mod N = pq if and only if it is a square root of 1 mod p and q. As F2 is a field,
x2 ≡ 1 mod p ⇔ (x − 1)(x + 1) ≡ 0 mod p
⇔ (x − 1) ≡ 0 mod p or (x + 1) ≡ 0 mod p.
Thus 1 has four square roots w mod N satisfying w ≡ ±1 mod p and w ≡
±1 mod q. In other words,
w ≡ 1 mod N, w ≡ −1 mod N,
w ≡ w1 mod N with w1 ≡ 1 mod p and w1 ≡ −1 mod q
or
w ≡ w2 (mod N) with w2 ≡ −1 mod p and w1 ≡ 1 mod q.
480 Further Topics from Information Theory

Now z (the square root of 1 mod N) cannot satisfy z ≡ 1 mod N. If w ≡


−1 mod N, we are unlucky and try again. Otherwise we know that z + 1 is not
congruent to 0 mod N but is divisible by one of the two prime factors of N. Apply-
ing Euclid’s algorithm yields the common factor. Having found one prime factor,
we can find the other one by division or by looking at z − 1.
Since square roots of 1 are algebraically indistinguishable, the probability of this
method’s failure tends to 0 rapidly with the number of trials.

4.6 Additional problems for Chapter 4


Problem 4.1 (a) Let (Nt )t≥0 be a Poisson process of rate λ > 0 and p ∈ (0, 1).
Suppose that each jump in (Nt ) is counted as type one with probability p and type
two with probability 1 − p, independently for different jumps and independently of
(1) (2)
the Poisson process. Let Mt be the number of type-one jumps and Mt = Nt −
(1)
Mt the number of type-two jumps by time t . What is the joint distribution of the
(1) (2)
pair of processes (Mt )t≥0 and (Mt )t≥0 ? What if we fix probabilities p1 , . . . , pm
with p1 + · · · + pm = 1 and consider m types instead of two?
(b) A person collects coupons one at a time, at jump times of a Poisson process
(Nt )t≥0 of rate λ . There are m types of coupons, and each time a coupon of type j
is obtained with probability p j , independently of the previously collected coupons
and independently of the Poisson process. Let T be the first time when a complete
set of coupon types is collected. Show that
m
P(T < t) = ∏ (1 − e−p j λ t ) . (4.6.1)
j=1

Let L = NT be the total number of coupons collected by the time the complete
set of coupon types is obtained. Show that λ ET = EL. Hence, or otherwise, deduce
that EL does not depend on λ .

Solution Part (a) directly follows from the definition of a Poisson process.
(b) Let T j be the time of the first collection of a type j coupon. Then T j ∼ Exp(p j λ ),
independently for different j. We have

T = max T1 , . . . , Tm ,

and hence

 m m  
P(T < t) = P max T1 , . . . , Tm < t = ∏ P(T j < t) = ∏ 1 − e−p j λ t .
j=1 j=1
4.6 Additional problems for Chapter 4 481

Next, observe that the random variable L counts the jumps in the original Poisson
process (Nt ) until the time of collecting a complete set of coupon types. That is:
L
T = ∑ Si ,
i=1

where S1 , S2 , . . . are the holding times in (Nt ), with S j ∼ Exp(λ ), independently for
different j. Then
E(T |L = n) = nES1 = nλ −1 .
Moreover, L is independent of the random variables S1 , S2 , . . .. Thus,
 
ET = ∑ P(L = n)E T |L = n = ES1 ∑ nP(L = n) = λ −1 EL.
n≥m n≥m

But
0 ∞
λ ET = λ P(T > t)dt
0 ∞
 0

m
=λ 1−∏ 1−e −p j λ t
dt
0 j=1
0 ∞
 
m 
= 1 − ∏ 1 − e−p j t dt,
0 j=1

and the RHS does not depend on λ .


Equivalently, L is identified as the number of collections needed for collecting
a complete set of coupons when collections occur at positive integer times t =
1, 2, . . ., with probability p j of obtaining a coupon of type j, regardless of the results
of previous collections. In this construction, λ does not figure, so the mean EL does
not depend on λ (as, in fact, the whole distribution of L).
Problem 4.2 Queuing systems are discussed in detail in PSE II. We refer to this
topic occasionally as they provide a rich source of examples in point processes.
Consider a system of k queues in series, each with infinitely many servers, in which,
for i = 1, . . . , k − 1, customers leaving the ith queue immediately arrive at the (i +
1)th queue. Arrivals to the first queue form a Poisson process of rate λ . Service
times at the ith queue are all independent with distribution F , and independent of
service times at other queues, for all i. Assume that initially the system is empty
and write Vi (t) for the number of customers at queue i at time t ≥ 0. Show that
V1 (t), . . . ,Vk (t) are independent Poisson random variables.
In the case F(t) = 1 − e−μ t show that
λ
EVi (t) = P(Nt ≥ i), t ≥ 0, i = 1, . . . , k, (4.6.2)
μ
where (Nt )t≥0 is a Poisson process of rate μ.
482 Further Topics from Information Theory

Suppose now that arrivals to the first queue stop at time T . Determine the mean
number of customers at the ith queue at each time t ≥ T .

Solution We apply the product theorem to the Poisson process of arrivals with
random vectors Yn = (Sn1 , . . . , Snk ) where Sni is the service time of the nth customer
at the ith queue. Then
Vi (t) = the number of customers in the ith queue at time t

= ∑ 1 the nth customer arrived in the first queue at
n=1

time Jn is in the ith queue at time t

= ∑ Jn > 0, Sn1 , . . . , Snk ≥ 0,
1
n=1

Jn + Sn1 + · · · + Sni−1 < t < Jn + Sn1 + · · · + Sni
∞   
= ∑ 1 Jn , (Sn1 , . . . , Snk ) ∈ Ai (t) = M(Ai (t)).
n=1

Here (Jn : n ∈ N) denote the jump times of a Poisson process of rate λ , and the
measures M and ν on (0, ∞) × Rk+ are defined by
∞  
M(A) = ∑1 (Jn ,Yn ) ∈ A , A ⊂ (0, ∞) × Rk+
n=1

and
 
ν (0,t] × B = λ t μ (B).
The product theorem states that M is a Poisson random measure on (0, ∞) × Rk+
with intensity measure ν . Next, the set Ai (t) ⊂ (0, ∞) × Rk+ is defined by
*
Ai (t) = (τ , s1 , . . . , sk ) : 0 < τ < t, s1 , . . . , sk ≥ 0
+
and τ + s1 + · · · + si−1 ≤ t < τ + s1 + · · · + si
'
= (τ , s1 , . . . , sk ) : 0 < τ < t, s1 , . . . , sk ≥ 0
@
i−1 i
and ∑ sl ≤ t − τ < ∑ sl .
l=1 l=1

Sets Ai (t) are pairwise disjoint for i = 1, . . . , k (as t − τ can fall between subse-
i−1 i
quent partial sums ∑ sl and ∑ sl only once). So, the random variables Vi (t) are
l=1 l=1
independent Poisson.
4.6 Additional problems for Chapter 4 483

A direct verification is through the joint MGF. Namely, let Nt ∼ Po(λ t) be the
number of arrivals at the first queue by time t. Then write

MV1 (t),...,Vk (t) (θ1 , . . . , θk ) =


 E exp(θ1V1 (t) + · · · + θkVk (t)) 
k 
= E E exp ∑ θiVi (t) Nt ; J1 , . . . , JNt .
i=1

In turn, given n = 1, 2, . . . and points 0 < τ1 < · · · < τn < t, the conditional expec-
tations is
  
k 

E exp ∑ θiVi (t) Nt = n; J1 = τ1 , . . . , Jn = τn
i=1

 k n  

= E exp ∑ θi ∑ 1 τ j , (S j , . . . , S j ) ∈ Ai (t)
1 k
i=1 j=1
  
n k 
= E exp ∑ ∑ θi 1 τ j , (S j , . . . , S j ) ∈ Ai (t)
1 k
j=1 i=1
 
n  k  
= ∏ E exp ∑ θi 1 τ j , (S1j , . . . , Skj ) ∈ Ai (t) .
j=1 i=1

Next, perform summation over n and integration over τ1 , . . . , τn :


    0t 0τn 0τ2
k  ∞
E E exp ∑ θiVi (t) Nt ; J1 , . . . , JNt = ∑ λ e n − λ t ···
i=1 n=1
  0 0 0
n k   
× ∏ E exp ∑ θi 1 τ j , (S1j , . . . , Skj ) ∈ Ai (t) dτ1 · · · dτn−1 dτn
j=1 0
i=1   n
∞ λn t k   
= ∑ e−λ t
E exp ∑ θi 1 τ , (S , . . . , S ) ∈ Ai (t) dτ
1 k
n=1 n! 0
 0   i=1   
t k   
= exp λ E exp ∑ θi 1 τ , (S1 , . . . , Sk ) ∈ Ai (t) − 1 dτ
0
 0 i=1 
t k    
= exp λ ∑ P τ , (S1 , . . . , Sk ) ∈ Ai (t) eθi − 1 dτ
0
 i=1   
k  θ  0t i−1 i
= ∏ exp e i − 1 λ P ∑ Sl < t − τ < ∑ Sl dτ .
i=1 0 l=1 l=1

By the uniqueness of a random variable with a given MGF, this implies that
 0   
t i−1 i
Vi (t) ∼ Po λ
0
P ∑ Sl < t − τ < ∑ Sl dτ , independently.
l=1 l=1
484 Further Topics from Information Theory

If F(t) = 1 − e−μ t then partial sums S1 , S1 + S2 , . . . mark the subsequent points


of a Poisson process (Ns ) of rate μ . In this case, EVi (t) = ν (Ai (t)) equals
0t
  0t 
i−1 i 
λ P ∑ S ≤ t − τ < ∑ S dτ = λ P Nt−τ = i − 1 dτ
l l
l=1 l=1
0 0
0t
λ
= λE 1(Ns = i − 1)ds = P(Nt ≥ i).
μ
0

Finally, write Vi (t, T ) for the number of customers in queue i at time t after closing
the entrance at time T . Then
0T 0t
EVi (t, T ) = λ P(Nt−τ = i − 1)dτ = λ E 1(Ns = i − 1)ds
0 t−T

= P(Nt ≥ i) − P(Nt−T ≥ i) .
μ

Problem 4.3 The arrival times of customers at a supermarket form a Poisson


process of rate λ . Each customer spends a random length of time, S, collecting
items to buy, where S has PDF ( f (s,t) : s ≥ 0) for a customer arriving at time t .
Customers behave independently of one another. At a checkout it takes time g(S)
to buy the items collected. The supermarket has a policy that nobody should wait
at the checkout, so more tills are made available as required. Find
(i) the probability that the first customer has left before the second has arrived,
(ii) the distribution of the number of checkouts in use at time T .

Solution (i) If J1 is the arrival time of the first customer then J1 + S1 is the time he
enters the checkout till and J1 + S1 + g(S1 ) the time he leaves. Let J2 be the time of
arrival of the second customer. Then J1 , J2 − J1 ∼ Exp(λ ), independently.
Then
0∞ 0∞ 0t2
−λ t1 −λ t2
P(S1 + g(S1 ) < J2 − J1 ) = dt1 λ e dt2 λ e ds1 f (s1 ,t1 )1(s1 + g(s1 ) < t2 )
0 0 0
0∞ 0∞ 0∞
−λ t1
= dt1 λ e ds1 f (s1 ,t1 ) dt2 λ e−λ t2
0 0 s1 +g(s1 )
0∞ 0∞
= dt1 λ e−λ t1 ds1 f (s1 ,t1 )e−λ (s1 +g(s1 )) .
0 0
4.6 Additional problems for Chapter 4 485

(ii) Let NTch be the number of checkouts used at time T . By the product theorem,
4.4.11, NTch ∼ Po(Λ(T )) where

0T 0∞
Λ(T ) = λ du ds f (s, u)1(u + s < T, u + s + g(s) > T )
0 0
0T 0∞
=λ du ds f (s, u)1(T − g(s) < u + s < T ).
0 0

In fact, if NTarr ∼ Po(λ T ) is the number of arrivals by time T , then

NTarr
 
NTch = ∑1 Ji + Si < T < Ji + Si + g(Si ) ,
i=1

and the MGF


     

E exp θ NTch = E E exp θ NTch |NTarr ; J1 , . . . , JNTarr


∞ 0 T 0 tk 0 t2 k  
= e−λ T ∑ λ k ··· ∏ E exp θ 1 ti + Si < T
k=0 0 0 0 i=1

< ti + Si + g(Si ) dt1 · · · dtk
∞ 0 T 0 T k  
λk
=e −λ T
∑ k! 0
··· ∏ E exp
0 i=1
θ 1 ti + Si < T
k=0

< ti + Si + g(Si ) dt1 · · · dtk
0 T   

λk  k
= e−λ T ∑ E exp θ 1 t + S < T < t + S + g(S)
k=0 k! 0
 0T   

= exp λ E exp θ 1 t + S < T < t + S + g(S) − 1 dt
0
 0 T  
θ

= exp λ (e − 1) P t + S < T < t + S + g(S) dt
0
 0 T0 ∞ 
θ
 
= exp (e − 1)λ f (s, u)1 u + s < T < u + s + g(s) dsdu ,
0 0

which verifies the claim.

Problem 4.4 A library is open from 9am to 5pm. No student may enter after
5pm; a student already in the library may remain after 5pm. Students arrive at the
library in the period from 9am to 5pm in the manner of a Poisson process of rate
λ . Each student spends in the library a random amount of time, H hours, where
486 Further Topics from Information Theory

0 ≤ H ≤ 8 is a random variable with PDF h and E[H] = 1. The periods of stay of


different students are IID random variables.

(a) Find the distribution of the number of students who leave the library between
3pm and 4pm.
(b) Prove that the mean number of students who leave between 3pm and 4pm is
E[min(1, (7 − H)+ )], where w+ denotes max[w, 0].
(c) What is the number of students still in the library at closing time?

Solution The library is open from 9am to 5pm. Students arrive as a PP(λ ). The
problem is equivalent to an M/GI/∞ queue (until 5pm, when the restriction of no
more arrivals applies, but for problems involving earlier times this is unimportant).
Denote by Jn the arrival time of the nth student using the 24 hour clock.
Denote by Hn the time the nth student spends in the library.
Again use the product theorem, 4.4.11, for the random measure on (0, 8) × (0, 8)
with atoms (Jn ,Yn ), where (Jn : n ∈ N) are the arrival times and (Yn : n ∈ N) are
periods of time that students stay in the library. Define measures on (0, ∞) × R+
by μ ((0,t) × B) = λ t μ (B), N(A) = ∑ 1((Jn ,Hn )∈A) . Then N is a Poisson random
n
1y
measure with intensity ν ([0,t] × [0, y]) = λ tF(y), where F(y) = h(x)dx (the time
0
t = 0 corresponds to 9am).

(a) Now, the number of students leaving the library between 3pm and 4pm (i.e.
6 ≤ t ≤ 7) has a Poisson distribution Po(ν (A)) where A = {(r, s) : s ∈ [0, 7], r ∈
[6 − s, 7 − s] if s ≤ 6; r ∈ [0, 7 − s] if s > 6}. Here

08 0 +
(7−r) 08

ν (A) = λ dF(r) ds = λ (7 − r)+ − (6 − r)+ dF(r).


0 (6−r)+ 0

So, the distribution of students leaving the library between 3pm and 4pm is Poisson
17
with rate = λ [(7 − y)+ − (6 − y)+ ]dF(r).
0

(b)


⎨0, if y ≥ 7,
(7 − y)+ − (6 − y)+ = 7 − y, if 6 ≤ y ≤ 7,


1, if y ≤ 6.
4.6 Additional problems for Chapter 4 487

The mean number of students leaving the library between 3pm and 4pm is
08
ν (A) = λ [min(1, (7 − r)+ ]dF(r) = λ E[min(1, (7 − H)+ )]
0

as required.
(c) For students still to be there at closing time we require J + H ≥ 8, as H ranges
over [0, 8], and J ranges over [8 − H, 8]. Let

B = {(t, x) : t ∈ [0, 8], x ∈ [8 − t, 8]}.

So,
08 08 08 0
8−x
ν (B) = λ dt dF(x) = λ dF(x) dt
0 8−t 0 0

08 08 08
=λ (8 − x)dF(x) = 8λ dF(x) − λ xdF(x),
0 0 0

18 18
but dF(x) = 1 and xdF(x) = E[H] = 1 imply λ E[H] = λ . Hence, the expected
0 0
number of students in the library at closing time is 7λ .

Problem 4.5 (i) Prove Campbell’s theorem, i.e. show that if M is a Poisson
random measure on the state space E with intensity measure μ and a : E → R is a
bounded measurable function, then
⎡ ⎤
0
E[e θX
] = exp ⎣ (eθ a(y) − 1)μ (ddy)⎦ , (4.6.3)
E
1
where X = a(y)M(dy) (assume that λ = μ (E) < ∞).
E
(ii) Shots are heard at jump times J1 , J2 , . . . of a Poisson process with rate λ . The
initial amplitudes of the gunshots A1 , A2 , . . . ∼ Exp(2) are IID exponentially dis-
tributed with parameter 2, and the amplitutes decay linearly at rate α. Compute the
MGF of the total amplitude Xt at time t :

Xt = ∑ An (1 − α (t − Jn )+ )1(Jn ≤t) ;
n

x+ = x if x ≥ 0 and 0 otherwise.
488 Further Topics from Information Theory

Solution (i) Conditioned on M(E) = n, the atoms of M form a random sample


1
Y1 , . . . ,Yn with distribution μ , so
λ
n
θX
θ ∑ a(Yk )

E[e | M(E) = n] = E e k=1


⎛ ⎞n
0
=⎝ eθ a(y) μ (dy)/λ ⎠ .
E

Hence,
E[eθ X ] = ∑ E[eθ X | M(E) = n]P(M(E) = n)
n
⎛ ⎞n
0
e−λ λ n
= ∑⎝ eθ a(y) μ (dy)/λ ⎠
n n!
E
⎛ ⎞
0
= exp ⎝ (eθ a(y) − 1)μ (dy)⎠ .
E

(ii) Fix t and let E = [0,t] × R+ and ν and M be such that ν (ds, dx) =
2λ e−2x dsdx, M(B) = ∑ 1{(Jn ,An )∈B} . By the product theorem M is a Poisson ran-
n
dom measure with intensity measure ν . Set at (s, x) = x(1 − α (t − s))+ , then
1
Xt = at (s, x)M(ds, dx). So, by Campbell’s theorem, for θ < 2,
E
⎛ ⎞
0
E[eθ Xt ] = exp ⎝ (eθ at (s,x) − 1)ν (ds, dx)⎠
E
⎛ ⎞
0t 0∞
= e−λ t exp ⎝2λ e−x(2−θ (1−α (t−s))+ ) dxds⎠
0 0
⎛ ⎞
0t
1
= e−λ t exp ⎝2λ ds ⎠
2 − θ (1 − α (t − s))+
0
2 − θ + θ α min[t, 1/α ] θ2λα
= e−λ min[t,1/α ]
2−θ
1t 1 t− α1 1t
by splitting integral 0 = 0 + t− α1 in the case t > α1 .

Problem 4.6 Seeds are planted in a field S ⊂ R2 . The random way they are sown
means that they form a Poisson process on S with density λ (x, y). The seeds grow
into plants that are later harvested as a crop, and the weight of the plant at (x, y) has
4.6 Additional problems for Chapter 4 489

mean m(x, y) and variance v(x, y). The weights of different plants are independent
random variables. Show that the total weight W of all the plants is a random variable
with finite mean 00
I1 = m(x, y)λ (x, y) dxdy
S

and variance 00 * +
I2 = m(x, y)2 + v(x, y) λ (x, y) dxdy ,
S

so long as these integrals are finite.

Solution Suppose first that


0
μ= λ (x, y)dxdy
S

is finite. Then the number N of plants is finite and has the distribution Po(μ ).
Conditional on N, their positions may be taken as independent random variables
(Xn ,Yn ), n = 1, . . . , N, with density λ /μ on S. The weights of the plants are then
independent, with
0
EW = m(x, y)λ (x, y)μ −1 dxdy = μ −1 I1
S

and 0

EW 2 = m(x, y)2 + v(x, y) λ (x, y)μ −1 dxdy = μ −1 I2 ,


S

where I1 and I2 are finite. Hence,


  N
E W |N = ∑ μ −1 I1 = N μ −1 I1
n=1

and
  N    
Var W |N = ∑ μ −1 I2 − μ −2 I12 = N μ −1 I2 − μ −2 I12 .
n=1

Then
EW = EN μ −1 I1 = I1

and
 
 

VarW = E Var W |N + Var E W |N


   
= μ μ −1 I2 − μ −2 I12 + Var N μ −2 I12 = I2 ,

as required.
490 Further Topics from Information Theory

If μ = ∞, we divide S into disjoint Sk on which λ is integrable, then write W =


∑ (k) where the harvests W(k) on Sk are independent, and use
W
k
0
EW = ∑ EW(k) = ∑ m(x, y)λ (x, y)dxdy
k k Sk
0
= m(x, y)λ (x, y)dxdy
S

and similarly for VarW .


Problem 4.7 A line L in R2 not passing through the origin O can be defined by
its perpendicular distance p > 0 from O and the angle θ ∈ [0, 2π ) that the perpen-
dicular from O to L makes with the x-axis. Explain carefully what is meant by a
Poisson process of such lines L.
A Poisson process Π of lines L has mean measure μ given by
00
μ (B) = dp dθ (4.6.4)
B

for B ⊆ (0, ∞) × [0, 2π ). A random countable set Φ ⊂ R2 is defined to consist of all


intersections of pairs of lines in Π. Show that the probability that there is at least
one point of Φ inside the circle with centre O and radius r is less than

1 − (1 + 2π r)e−2π r .

Is Φ a Poisson process?

Solution Suppose that μ is a measure on the space L of lines in R2 not passing


through 0. A Poisson process with mean measure μ is a random countable subset
Π of L such that
(1) the number N(A) of points of Π in a measurable subset A of L has distribution
Po(μ (A)), and
(2) for disjoint A1 , . . . , An , the N(A j ) are independent.
In the problem, the number N of lines which meet the disc D of centre 0 and radius
r equals the number of lines with p < r. It is Poisson with mean
0 r 0 2π
dpdθ = 2π r.
0 0

If there is at least one point of Φ in D then there must be at least two lines of Π
meeting D, and this has probability
(2π r)n −2π r
∑ n!
e = 1 − (1 + 2π r)e−2π r .
n≥2
4.6 Additional problems for Chapter 4 491

The probability of a point of Φ lying in D is strictly less than this, because there
may be two lines meeting D whose intersection lies outside D.
Finally, Φ is not a Poisson process, since it has with positive probability collinear
points.

Problem 4.8 Particular cases of the Poisson–Dirichlet distribution for the ran-
dom sequence (p1 , p2 , p3 , . . .) with parameter θ appeared in PSE II the definition
is given below. Show that, for any polynomial φ with φ (0) = 0,
' @ 0
∞ 1
E ∑ φ (pn ) =θ
0
φ (x)x−1 (1 − x)θ −1 dx . (4.6.5)
n=1

What does this tell you about the distribution of p1 ?

Solution The simplest way to introduce the Poisson–Dirichlet distribution is to say


that p = (p1 , p2 , . . .) has the same distribution as (ξn /σ ), where {ξn , n = 1, 2, . . .}
are the points in descending order of a Poisson process on (0, ∞) with rate θ x−1 e−x ,
and σ = ∑ ξn . By Campbell’s theorem, σ is a.s. finite and has distribution Gam(θ )
n≥1
(where θ > 0 can be arbitrary) and is independent from the vector p = (p1 , p2 , . . .)
with
p1 ≥ p2 ≥ · · · , ∑ pn = 1, with probability 1.
n≥1

Here Gam stands for the Gamma distribution; see PSE I, Appendix.
To prove (4.6.5), we can take pn = ξn /σ and use the fact that σ and p are inde-
pendent. For k ≥ 1,
  0

E ∑ ξnk =
0
xk θ x−1 e−x dx = θ Γ(k).
n≥1

The left side equals


   
E σ k
∑ pkn = Γ(θ + k)Γ(θ ) E −1
∑ pkn .
n≥1 n≥1

Thus,
  0 1
θ Γ(k)Γ(θ )
E ∑ pkn =
Γ(k + θ )

0
xk−1 (1 − x)θ −1 dx.
n≥1

We see that the identity (4.6.5) holds for φ (x) = xk (with k ≥ 1) and hence by
linearity for all polynomials with φ (0) = 0.
492 Further Topics from Information Theory

Approximating step functions by polynomials shows that the mean number of


pn in an interval (a, b) (with 0 < a < b < 1) equals
0 b
θ x−1 (1 − x)θ −1 dx.
a

If a > 1/2, there can be at most one such pn , so that p1 has the PDF

θ x−1 (1 − x)θ −1 on (1/2, 1).

But this fails on (0, 1/2), and the identity (4.6.5) does not determine the distribution
of p1 on this interval.

Problem 4.9 The positions of trees in a large forest can be modelled as a Pois-
son process Π of constant rate λ on R2 . Each tree produces a random number of
seeds having a Poisson distribution with mean μ. Each seed falls to earth at a point
uniformly distributed over the circle of radius r whose centre is the tree. The po-
sitions of the different seeds relative to their parent tree, and the numbers of seeds
produced by a given tree, are independent of each other and of Π. Prove that, con-
ditional on Π, the seeds form a Poisson process Π∗ whose mean measure depends
on Π. Is the unconditional distribution of Π∗ that of a Poisson process?

Solution By a direct calculation, the seeds from a tree at X form a Poisson process
with rate
'
π −1 r−2 , |x − X| < r,
ρX (x) =
0, otherwise.

Superposing these independent Poisson processes gives a Poisson process with rate

ΛΠ (x) = ∑ ρX (x);
X∈Π

it clearly depends on Π. The unrealistic assumption of a circular uniform distri-


bution is chosen to create no doubt about this dependence – in this case Π can be
reconstructed from the contours of ΛΠ .
Here we meet for the first time the doubly stochastic (Cox) processes, i.e. Pois-
son process with random intensity. The number of seeds in a bounded set Λ has
mean

0
EN(A) = EE N(A)|Π = E ΛΠ (x)dx
A
4.6 Additional problems for Chapter 4 493

and variance
 


Var N(A) = E Var N(A)|Π + Var E N(A)|Π
0

= EN(A) + Var ΛΠ (x)dx


> EN(A).

Hence, Π∗ is not a Poisson process.

Problem 4.10 A uniform Poisson process Π in the unit ball of R3 is one whose
mean measure is Lebesgue measure (volume) on

B = {(x, y, z) ∈ R3 : r2 = x2 + y2 + z2  1}.

Show that
Π1 = {r : (x, y, z) ∈ Π}

is a Poisson process on [0, 1] and find its mean measure. Show that

Π2 = {(x/r, y/r, z/r) : (x, y, z) ∈ Π}

is a Poisson process on the boundary of B, whose mean measure is a multiple of


surface area. Are Π1 and Π2 independent processes?

Solution By the mapping theorem, Π1 is Poisson, with expected number of points


in (a, b) equal to λ ×(the volume of the shell with radii a and b), i.e.
 
4 3 4 3
λ πb − πa .
3 3
Thus, the mean measure of Π1 has the PDF

4λ π r2 (0 < r < 1).

Similarly, the expected number of points of Π2 in A ⊆ ∂ B equals


1
λ × (the conic volume from 0 to A) = λ × (the surface area of A).
3
Finally, Π1 and Π2 are not independent since they have the same number of points.

Problem 4.11 The points of Π are coloured randomly either red or green, the
probability of any point being red being r, 0 < r < 1, and the colours of different
points being independent. Show that the red and the green points form independent
Poisson processes.
494 Further Topics from Information Theory

Solution If A ⊆ S has μ (A) < ∞ then write

N(A) = N1 (A) + N2 (A)

where N1 and N2 are the numbers of red and green points. Conditional on N(A) = n,
N1 (A) has the binomial distribution Bin(n, r). Thus,
 
P N1 (A) = k, N2 (A) = l
   
= P N(A) = k + l P N1 (A) = k|N(A) = k + l
 
μ (A)k+l e−μ (A) k + l k
= r (1 − r)l
(k + l)! k
[r μ (A)]k e−r μ (A) [(1 − r)μ (A)]l e−(1−r)μ (A)
= .
k! l!
Hence, N1 (A) and N2 (A) are independent Poisson random variables with means
r μ (A) and (1 − r)μ (A), respectively.
If A1 , A2 , . . . are disjoint sets then the pairs

(N1 (A1 ), N2 (A1 )), (N1 (A2 ), N2 (A2 )), . . .

are independent, and hence


   
N1 (A1 ), N1 (A2 ), . . . and N2 (A1 ), N2 (A2 ), . . .

are two independent sequences of independent random variables. If μ (A) = ∞ then


N(A) = ∞ a.s., and since r > 0 and 1 − r > 0, there are a.s. infinitely many red and
green points in A.

Problem 4.12 A model of a rainstorm falling on a level surface (taken to be the


plane R2 ) describes each raindrop by a triple (X, T,V ), where X ∈ R2 is the hori-
zontal position of the centre of the drop, T is the instant at which the drop hits the
plane, and V is the volume of water in the drop. The points (X, T,V ) are assumed
to form a Poisson process on R4 with a given rate λ (x,t, v). The drop forms a wet
circular patch on the surface, with centre X and a radius that increases with time,
the radius at time (T + t) being a given function r(t,V ). Find the probability that
a point ξ ∈ R2 is dry at time τ, and show that the total rainfall in the storm has
expectation
0
vλ (x,t, v)dxdtdv
R4

if this integral converges.


4.6 Additional problems for Chapter 4 495

Solution Thus, ξ ∈ R2 is wet iff there is a point of Π with t < τ and

||X − ξ || < r(τ − t,V )

(there no problem about whether or not the inequality is strict since the difference
involves events of zero probability). The number of points of Π satisfying these
two inequalities is Poisson, with mean
0  
μ= λ (x,t, v)1 t < τ , ||x − ξ || < r(τ − t, v) dxdtdv.

Hence, the probability that ξ is dry is e−μ (or 0 if μ = +∞). Finally, the formula
for the expected total rainfall,

∑ V,
(X,T,V )∈Π

is a direct application of Campbell’s theorem.

Problem 4.13 Let M be a Poisson random measure on E = R × [0, π ) with con-


stant intensity λ . For (x, θ ) ∈ E , denote by l(x, θ ) the line in R2 obtained by rotat-
ing the line {(x, y) : y ∈ R} through an angle θ about the origin.
Consider the line process L = M ◦ l −1 .

(i) What is the distribution of the number of lines intersecting the disk Da = {z ∈
R2 : | z |≤ a}?
(ii) What is the distribution of the distance from the origin to the nearest line?
(iii) What is the distribution of the distance from the origin to the kth nearest line?

Solution (i) A line intersects the disk Da = {z ∈ R2 : |z| ≤ a} if and only if its
representative point (x, θ ) lies in (−a, a) × [0, π ). Hence,

 of lines intersecting Da ∼ Po(2aπλ ).

(ii) Let Y be the distance from the origin to the nearest line. Then
 
P(Y ≥ a) = P M((−a, a) × [0, π )) = 0 = exp(−2aλ π ),

i.e. Y ∼ Exp(2πλ ).
(iii) Let Y1 ,Y2 , . . . be the distances from the origin to the nearest line, the second
nearest line, and so on. Then the Yi are the atoms of the PRM N on R+ which is
obtained from M by the projection (x, θ ) → |x|. By the mapping theorem, N is the
Poisson process on R+ of rate 2πλ . Hence, Yk ∼ Gam(k, 2λ π ), as Yk = S1 +· · ·+Sk
where Si ∼ Exp(2πλ ), independently.
496 Further Topics from Information Theory

Problem 4.14 One wishes to transmit one of M equiprobable distinct messages


through a noisy channel. The jth message is encoded by the sequence of scalars
a jt (t = 1, 2, . . . , n) which, after transmission, is received as a jt + εt (t = 1, 2, . . . , n).
Here the noise random variables εt are independent and normally distributed, with
zero mean and with time-dependent variance Var εt = vt .
Find an inference rule at the receiver for which the average probability that the
message value is incorrectly inferred has the upper bound
  1
M 1≤ j ∑
P error ≤ exp(−d jk /8), (4.6.6)
=k≤M

where
>
d jk = ∑ (a jt − akt )2 vt .
1≤t≤n

Suppose that M = 2 and that the transmitted waveforms are subject to the power
constraint ∑ a2jt ≤ K , j = 1, 2. Which of the two waveforms minimises the prob-
1≤t≤n
ability of error?
[Hint: You may assume validity of the bound P(Z ≥ a) ≤ exp(−a2 /2), where Z is
a standard N(0, 1) random variable.]

Solution Let f j = fch (y|X = a j ) be the PDF of receiving a vector y given that a
‘waveform’ A j = (a jt ) was transmitted. Then
1
M∑ ∑ P({y : fk (y) ≥ f j (y)|X = A j }).
P(error) ≤
j k:k = j

Let V be the diagonal matrix with the diagonal elements v j . In the present case,
 
1 n
f j = C exp − ∑ (yt − a jt ) /vt
2
2 t=1
 
1 T −1
= C exp − (Y − A j ) V (Y − A j ) .
2
Then if X = A j and Y = A j + ε we have
1 1
log fk − log f j = − (A j − Ak + ε )TV −1 (A j − Ak + ε ) + ε TV −1 ε
2 2
1 T −1
= − d jk − (A j − (Ak ) V ε )
2
1 $
= − d jk + d jk Z
2
4.6 Additional problems for Chapter 4 497

where Z ∼ N(0, 1). Thus, by the hint, (4.6.6) follows:


$
P( fk ≥ f j ) = P(Z > d jk /2) ≤ e−d jk /8 .

In the case M = 2 we have to maximise

d12 = (A1 − A2 )TV −1 (A1 − A2 ) = ∑ (a1t − a2t )2 /vt


1≤t≤n

subject to

∑ a2jt ≤ K or (a)Tj (a) j ≤ K, j = 1, 2.


t

By Cauchy–Schwarz,
S S 2
−1 −1 −1
(A1 − A2 ) V
T
(A1 − A2 ) ≤ T T
A1 V A1 + A2 V A2 (4.6.7)

with equality holding when A1 = const A2 . Further, in our case V is diagonal, and
(4.6.7) is maximised when ATj A j = K, j = 1, 2, . . . We conclude that

a1t = −a2t = bt

with bt non-zero only for t such that vt is minimal, and ∑t bt2 = K.

Problem 4.15 A random variable Y is distributed on the non-negative integers.


Show that the maximum entropy of Y , subject to EY ≤ M , is

−M log M + (M + 1) log(M + 1)

attained by a geometric distribution with mean M .


A memoryless channel produces outputs Y from non-negative integer-valued
inputs X by
Y = X + ε,

where ε is independent of X , P(ε = 1) = p, P(ε = 0) = 1 − p = q and inputs X are


constrained by EX ≤ q. Show that, provided p ≤ 1/3, the optimal input distribution
is
   
−1 1 −p r+1
P(X = r) = (1 + p) − , r = 0, 1, 2, . . . ,
2r+1 q

and determine the capacity of the channel.


Describe, very briefly, the problem of determining the channel capacity if p >
1/3.
498 Further Topics from Information Theory

Solution First, consider the problem



⎨ py ≥ 0,

maximise h(Y ) = − ∑ py log py subject to ∑y py = 1,
y≥0


∑y ypy = M.
The solution, found by using Lagrangian multipliers, is
λ M
py = (1 − λ )λ y , y = 0, 1, . . . , with M = , or λ = ,
1−λ M+1
with the optimal value
h(Y ) = (M + 1) log(M + 1) − M log M.
Next, for g(m) = (m + 1) log(m + 1) − m log m,
g (m) = log(m + 1) − log m > 0,
implying that h(Y )  when M . Therefore, the maximiser and the optimal value
are the same for EY ≤ M, as required.

Now, the capacity C = sup h(Y ) − h(Y |X)] = h(Y ) − h(ε ), and the condition
EX ≤ q implies that EY ≤ q + Eε = q + p = 1. With h(ε ) = −p log p − q log q, we
want Y geometric, with M = 1, λ = 1/2, yielding

C = 2 log 2 + p log p + q log q = log 4p p qq ).
Then
 
EzY 1−λ > 1
EzX = ε = (pz + q) =
Ez 1−λz (2 − z)(q + pz)
(2 − z)−1 + p(q + pz)−1
=
1+ p
1  1+r r    r
=
1+ p ∑ 1/2 z + p/q ∑ − p/q zr .

If p > 1/3 then p/q > 1/2 and the alternate probabilities become negative,
which means that there is no distribution for X giving an optimum for Y . Then
we would have to maximise
− ∑ py log py , subject to py = pπy−1 + qπy ,
y

where πy ≥ 0, ∑y πy = 1 and ∑y yπy ≤ q.


Problem 4.16 Assuming the bounds on channel capacity asserted by the second
coding theorem, deduce the capacity of a memoryless Gaussian channel.
A channel consists of r independent memoryless Gaussian channels, the noise in
the ith channel having variance vi , i = 1, 2, . . . , n. The compound channel is subject
4.6 Additional problems for Chapter 4 499
 
to an overall power constraint E ∑ xit ≤ p, for each t , where xit is the input of
2
i
channel i at time t . Determine the capacity of the compound channel.

Solution For the first part see Section 4.3.


If the power in the ith channel is reduced to pi , we would have capacity
 
1 pi
C = ∑ log 1 +

.
2 i vi

The actual capacity is given by C = maxC subject to p1 , . . . , pr ≥ 0, ∑i pi = p.


Thus, we have to maximise the Lagrangian
 
1 pi
L = ∑ log 1 + − λ ∑ pi ,
2 i vi i

with
∂ 1
L = (vi + pi )−1 − λ , i = 1, . . . , r
∂ pi 2
and the maximum at
   
1 1
pi = max 0, − vi = − vi .
2λ 2λ +

To adjust the constraint, choose λ = λ ∗ where λ ∗ is determined from


 
1
∑ 2λ ∗ i = p.
− v
i +

The existence and uniqueness of λ ∗ follows since the LHS monotonically de-
creases from +∞ to 0. Thus,
 
1 1
C = ∑ log .
2 i 2λ ∗ vi

Problem 4.17 Here we consider random variables taking values in a given


set A (finite, countable or uncountable) whose distributions are determined by
PMFs with respect to a given reference measure μ. Let ψ be a real function
and β a real number. Prove that the maximum hmax (X) of the entropy h(X) =
1
− fX (X) log fX (x)μ (dx) subject to the constraint Eψ (X) = β is achieved at the
random variable X ∗ with the PMF
1

fX ∗ (x) = exp − γψ (x) (4.6.8a)


Ξ
500 Further Topics from Information Theory
1

where Ξ = Ξ(γ ) = exp − γψ (x) μ (dx) is the normalising constant and γ is cho-
sen so that
0

∗ ψ (x)
Eψ (X ) = exp − γψ (x) μ (dx) = β . (4.6.8b)
Ξ
1 ψ (x)

Assume that the value γ with the property exp − γψ (x) μ (dx) = β exists.
Ξ
Show that if, in addition, function ψ is non-negative, then, for any given β > 0,
the PMF fX ∗ from (4.6.8a), (4.6.8b) maximises the entropy h(X) under a wider
constraint Eψ (X) ≤ β .
Consequently, calculate the maximal value of h(X) subject to Eψ (X) ≤ β , in
the following cases: (i) when A is a finite set, μ is a positive measure on A (with
μi = μ ({i}) = 1/μ (A) where μ (A) = ∑ μ j ) and ψ (x) ≡ 1, x ∈ A; (ii) when A is
j∈A
an arbitrary set, μ is a positive measure on A with μ (A) < ∞ and ψ (x) ≡ 1, x ∈ A;
(iii) when A = R is a real line, μ is the Lebesgue measure and ψ (x) = |x|; (iv)
when A = Rd , μ is a d -dimensional Lebesgue measure and ψ (x) = ∑ Ki j xi x j ,
1|leq j≤d
where K = (Ki j ) is a d × d positive definite real matrix.

Solution With ln fX∗ (x) = −γψ (x) − ln Ξ, we use the Gibbs inequality:
0 0

h(X) = − fX (x) ln fX (x)μ (dx) ≤ fX (x) γψ (x) + ln Ξ μ (dx)


0

= fX ∗ (x) γψ (x) + ln Ξ μ (dx) = h(X ∗ )

with equality if and only if X ∼ X ∗ . This proves the first assertion.


If ψ ≥ 0, the expected value Eψ (X) ≥ 0, and γ is minimal when the constraint
is satisfied.
Bibliography

[1] V. Anantharam, F. Baccelli. A Palm theory approach to error exponents. In


Proceedings of the 2008 IEEE Symposium on Information Theory, Toronto,
pp. 1768–1772, 2008.
[2] J. Adámek. Foundations of Coding: Theory and Applications of Error-Correcting
Codes, with an Introduction to Cryptography and Information Theory. Chichester:
Wiley, 1991.
[3] D. Applebaum. Probability and Information: An Integrated Approach. Cambridge:
Cambridge University Press, 1996.
[4] R.B. Ash. Information Theory. New York: Interscience, 1965.
[5] E.F. Assmus, Jr., J.D. Key. Designs and their Codes. Cambridge: Cambridge
University Press, 1992.
[6] K.A. Arwini, C.T.J. Dodson. Information Geometry: Near Randomness and Near
Independence. Lecture notes in mathematics, 1953. Berlin: Springer, 2008.
[7] D. Augot, M. Stepanov. A note on the generalisation of the Guruswami–Sudan
list decoding algorithm to Reed–Muller codes. In Gröbner Bases, Coding, and
Cryptography. RISC Book Series. Springer, Heidelberg, 2009.
[8] R.U. Ayres. Manufacturing and Human Labor as Information Processes. Laxen-
burg: International Institute for Applied System Analysis, 1987.
[9] A.V. Balakrishnan. Communication Theory (with contributions by J.W. Carlyle
et al.). New York: McGraw-Hill, 1968.
[10] J. Baylis. Error-Correcting Codes: A Mathematical Introduction. London:
Chapman & Hall, 1998.
[11] A. Betten et al. Error-Correcting Linear Codes Classification by Isometry and
Applications. Berlin: Springer, 2006.
[12] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression.
Englewood Cliffs, NJ: Prentice-Hall, 1971.
[13] E.R. Berlekamp. A Survey of Algebraic Coding Theory. Wien: Springer, 1972.
[14] E.R. Berlekamp. Algebraic Coding Theory. New York: McGraw-Hill, 1968.
[15] J. Berstel, D. Perrin. Theory of Codes. Orlando, FL: Academic Press, 1985.
[16] J. Bierbrauer. Introduction to Coding Theory. Boca Raton, FL: Chapman &
Hall/CRC, 2005.
[17] P. Billingsley. Ergodic Theory and Information. New York: Wiley, 1965.

501
502 Bibliography

[18] R.E. Blahut. Principles and Practice of Information Theory. Reading, MA:
Addison-Wesley, 1987.
[19] R.E. Blahut. Theory and Practice of Error Control Codes. Reading, MA: Addison-
Wesley, 1983. See also Algebraic Codes for Data Transmission. Cambridge:
Cambridge University Press, 2003.
[20] R.E. Blahut. Algebraic Codes on Lines, Planes, and Curves. Cambridge:
Cambridge University Press, 2008.
[21] I.F. Blake, R.C. Mullin. The Mathematical Theory of Coding. New York: Academic
Press, 1975.
[22] I.F. Blake, R.C. Mullin. An Introduction to Algebraic and Combinatorial Coding
Theory. New York: Academic Press, 1976.
[23] I.F. Blake (ed). Algebraic Coding Theory: History and Development. Stroudsburg,
PA: Dowden, Hutchinson & Ross, 1973.
[24] N. Blachman. Noise and its Effect on Communication. New York: McGraw-Hill,
1966.
[25] R.C. Bose, D.K. Ray-Chaudhuri. On a class of errors, correcting binary group
codes. Information and Control, 3(1), 68–79, 1960.
[26] W. Bradley, Y.M. Suhov. The entropy of famous reals: some empirical results.
Random and Computational Dynamics, 5, 349–359, 1997.
[27] A.A. Bruen, M.A. Forcinito. Cryptography, Information Theory, and Error-
Correction: A Handbook for the 21st Century. Hoboken, NJ: Wiley-Interscience,
2005.
[28] J.A. Buchmann. Introduction to Cryptography. New York: Springer-Verlag, 2002.
[29] P.J. Cameron, J.H. van Lint. Designs, Graphs, Codes and their Links. Cambridge:
Cambridge University Press, 1991.
[30] J. Castiñeira Moreira, P.G. Farrell. Essentials of Error-Control Coding. Chichester:
Wiley, 2006.
[31] W.G. Chambers. Basics of Communications and Coding. Oxford: Clarendon, 1985.
[32] G.J. Chaitin. The Limits of Mathematics: A Course on Information Theory and the
Limits of Formal Reasoning. Singapore: Springer, 1998.
[33] G. Chaitin. Information-Theoretic Incompleteness. Singapore: World Scientific,
1992.
[34] G. Chaitin. Algorithmic Information Theory. Cambridge: Cambridge University
Press, 1987.
[35] F. Conway, J. Siegelman. Dark Hero of the Information Age: In Search of Norbert
Wiener, the Father of Cybernetics. New York: Basic Books, 2005.
[36] T.M. Cover, J.M. Thomas. Elements of Information Theory. New York: Wiley,
2006.
[37] I. Csiszár, J. Körner. Information Theory: Coding Theorems for Discrete Memo-
ryless Systems. New York: Academic Press, 1981; Budapest: Akadémiai Kiadó,
1981.
[38] W.B. Davenport, W.L. Root. Random Signals and Noise. New York: McGraw Hill,
1958.
[39] A. Dembo, T. M. Cover, J. A. Thomas. Information theoretic inequalities. IEEE
Transactions on Information Theory, 37, (6), 1501–1518, 1991.
Bibliography 503

[40] R.L. Dobrushin. Taking the limit of the argument of entropy and information func-
tions. Teoriya Veroyatn. Primen., 5, (1), 29–37, 1960; English translation: Theory
of Probability and its Applications, 5, 25–32, 1960.
[41] F. Dyson. The Tragic Tale of a Genius. New York Review of Books, July 14, 2005.
[42] W. Ebeling. Lattices and Codes: A Course Partially Based on Lectures by F. Hirze-
bruch. Braunschweig/Wiesbaden: Vieweg, 1994.
[43] N. Elkies. Excellent codes from modular curves. STOC’01: Proceedings of the
33rd Annual Symposium on Theory of Computing (Hersonissos, Crete, Greece),
pp. 200–208, NY: ACM, 2001.
[44] S. Engelberg. Random Signals and Noise: A Mathematical Introduction. Boca Ra-
ton, FL: CRC/Taylor & Francis, 2007.
[45] R.M. Fano. Transmission of Information: A Statistical Theory of Communication.
New York: Wiley, 1961.
[46] A. Feinstein. Foundations of Information Theory. New York: McGraw-Hill, 1958.
[47] G.D. Forney. Concatenated Codes. Cambridge, MA: MIT Press, 1966.
[48] M. Franceschetti, R. Meester. Random Networks for Communication. From Sta-
tistical Physics to Information Science. Cambridge: Cambridge University Press,
2007.
[49] R. Gallager. Information Theory and Reliable Communications. New York: Wiley,
1968.
[50] A. Gofman, M. Kelbert, Un upper bound for Kullback–Leibler divergence with a
small number of outliers. Mathematical Communications, 18, (1), 75–78, 2013.
[51] S. Goldman. Information Theory. Englewood Cliffs, NJ: Prentice-Hall, 1953.
[52] C.M. Goldie, R.G.E. Pinch. Communication Theory. Cambridge: Cambridge
University Press, 1991.
[53] O. Goldreich. Foundations of Cryptography, Vols 1, 2. Cambridge: Cambridge
University Press, 2001, 2004.
[54] V.D. Goppa. Geometry and Codes. Dordrecht: Kluwer, 1988.
[55] S. Gravano. Introduction to Error Control Codes. Oxford: Oxford University Press,
2001.
[56] R.M. Gray. Source Coding Theory. Boston: Kluwer, 1990.
[57] R.M. Gray. Entropy and Information Theory. New York: Springer-Verlag, 1990.
[58] R.M. Gray, L.D. Davisson (eds). Ergodic and Information Theory. Stroudsburg,
CA: Dowden, Hutchinson & Ross, 1977 .
[59] V. Guruswami, M. Sudan. Improved decoding of Reed–Solomon codes and alge-
braic geometry codes. IEEE Trans. Inform. Theory, 45, (6), 1757–1767, 1999.
[60] R.W. Hamming. Coding and Information Theory. 2nd ed. Englewood Cliffs, NJ:
Prentice-Hall, 1986.
[61] T.S. Han. Information-Spectrum Methods in Information Theory. New York:
Springer-Verlag, 2002.
[62] D.R. Hankerson, G.A. Harris, P.D. Johnson, Jr. Introduction to Information Theory
and Data Compression. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC, 2003.
[63] D.R. Hankerson et al. Coding Theory and Cryptography: The Essentials. 2nd ed.
New York: M. Dekker, 2000. (Earlier version: D. G. Hoffman et al. Coding Theory:
The Essentials. New York: M. Dekker, 1991.)
[64] W.E. Hartnett. Foundations of Coding Theory. Dordrecht: Reidel, 1974.
504 Bibliography

[65] S.J. Heims. John von Neumann and Norbert Wiener: From Mathematics to the
Technologies of Life and Death. Cambridge, MA: MIT Press, 1980.
[66] C. Helstrom. Statistical Theory of Signal Detection. 2nd ed. Oxford: Pergamon
Press, 1968.
[67] C.W. Helstrom. Elements of Signal Detection and Estimation. Englewood Cliffs,
NJ: Prentice-Hall, 1995.
[68] R. Hill. A First Course in Coding Theory. Oxford: Oxford University Press, 1986.
[69] T. Ho, D.S. Lun. Network Coding: An Introduction. Cambridge: Cambridge Uni-
versity Press, 2008.
[70] A. Hocquenghem. Codes correcteurs d’erreurs. Chiffres, 2, 147–156, 1959.
[71] W.C. Huffman, V. Pless. Fundamentals of Error-Correcting Codes. Cambridge:
Cambridge University Press, 2003.
[72] J.F. Humphreys, M.Y. Prest. Numbers, Groups, and Codes. 2nd ed. Cambridge:
Cambridge University Press, 2004.
[73] S. Ihara. Information Theory for Continuous Systems. Singapore: World Scientific,
1993 .
[74] F.M. Ingels. Information and Coding Theory. Scranton: Intext Educational Pub-
lishers, 1971.
[75] I.M. James. Remarkable Mathematicians. From Euler to von Neumann.
Cambridge: Cambridge University Press, 2009 .
[76] E.T. Jaynes. Papers on Probability, Statistics and Statistical Physics. Dordrecht:
Reidel, 1982.
[77] F. Jelinek. Probabilistic Information Theory. New York: McGraw-Hill, 1968.
[78] G.A. Jones, J.M. Jones. Information and Coding Theory. London: Springer, 2000.
[79] D.S. Jones. Elementary Information Theory. Oxford: Clarendon Press, 1979.
[80] O. Johnson. Information Theory and the Central Limit Theorem. London: Imperial
College Press, 2004.
[81] J. Justensen. A class of constructive asymptotically good algebraic codes. IEEE
Transactions Information Theory, 18(5), 652–656, 1972.
[82] M. Kelbert, Y. Suhov. Continuity of mutual entropy in the large signal-to-noise
ratio limit. In Stochastic Analysis 2010, pp. 281–299, 2010. Berlin: Springer.
[83] N. Khalatnikov. Dau, Centaurus and Others. Moscow: Fizmatlit, 2007.
[84] A.Y. Khintchin. Mathematical Foundations of Information Theory. New York:
Dover, 1957.
[85] T. Klove. Codes for Error Detection. Singapore: World Scientific, 2007.
[86] N. Koblitz. A Course in Number Theory and Cryptography. New York: Springer,
1993 .
[87] H. Krishna. Computational Complexity of Bilinear Forms: Algebraic Coding The-
ory and Applications of Digital Communication Systems. Lecture notes in control
and information sciences, Vol. 94. Berlin: Springer-Verlag, 1987.
[88] S. Kullback. Information Theory and Statistics. New York: Wiley, 1959.
[89] S. Kullback, J.C. Keegel, J.H. Kullback. Topics in Statistical Information Theory.
Berlin: Springer, 1987.
[90] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, II. Bell System Technical Journal, 64–84, 1961.
Bibliography 505

[91] H.J. Landau, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, III. The dimension of the space of essentially time- and band-limited
signals. Bell System Technical Journal, 1295–1336, 1962.
[92] R. Lidl, H. Niederreiter. Finite Fields. Cambridge: Cambridge University Press,
1997.
[93] R. Lidl, G. Pilz. Applied Abstract Algebra. 2nd ed. New York: Wiley, 1999.
[94] E.H. Lieb. Proof of entropy conjecture of Wehrl. Commun. Math. Phys., 62, (1),
35–41, 1978.
[95] S. Lin. An Introduction to Error-Correcting Codes. Englewood Cliffs, NJ; London:
Prentice-Hall, 1970.
[96] S. Lin, D.J. Costello. Error Control Coding: Fundamentals and Applications.
Englewood Cliffs, NJ: Prentice-Hall, 1983.
[97] S. Ling, C. Xing. Coding Theory. Cambridge: Cambridge University Press, 2004.
[98] J.H. van Lint. Introduction to Coding Theory. 3rd ed. Berlin: Springer, 1999.
[99] J.H. van Lint, G. van der Geer. Introduction to Coding Theory and Algebraic
Geometry. Basel: Birkhäuser, 1988.
[100] J.C.A. van der Lubbe. Information Theory. Cambridge: Cambridge University
Press, 1997.
[101] R.E. Lewand. Cryptological Mathematics. Washington, DC: Mathematical Asso-
ciation of America, 2000.
[102] J.A. Llewellyn. Information and Coding. Bromley: Chartwell-Bratt; Lund:
Studentlitteratur, 1987.
[103] M. Loève. Probability Theory. Princeton, NJ: van Nostrand, 1955.
[104] D.G. Luenberger. Information Science. Princeton, NJ: Princeton University Press,
2006.
[105] D.J.C. Mackay. Information Theory, Inference and Learning Algorithms.
Cambridge: Cambridge University Press, 2003.
[106] H.B. Mann (ed). Error-Correcting Codes. New York: Wiley, 1969 .
[107] M. Marcus. Dark Hero of the Information Age: In Search of Norbert Wiener, the
Father of Cybernetics. Notices of the AMS 53, (5), 574–579, 2005.
[108] A. Marshall, I. Olkin. Inequalities: Theory of Majorization and its Applications.
New York: Academic Press, 1979 .
[109] V.P. Maslov, A.S. Chernyi. On the minimization and maximization of entropy in
various disciplines. Theory Probab. Appl. 48, (3), 447–464, 2004.
[110] F.J. MacWilliams, N.J.A. Sloane. The Theory of Error-Correcting Codes, Vols I,
II. Amsterdam: North-Holland, 1977.
[111] R.J. McEliece. The Theory of Information and Coding. Reading, MA: Addison-
Wesley, 1977. 2nd ed. Cambridge: Cambridge University Press, 2002.
[112] R. McEliece. The Theory of Information and Coding. Student ed. Cambridge:
Cambridge University Press, 2004.
[113] A. Menon, R.M. Buecher, J.H. Read. Impact of exclusion region and spreading in
spectrum-sharing ad hoc networks. ACM 1-59593-510-X/06/08, 2006 .
[114] R.A. Mollin. RSA and Public-Key Cryptography. New York: Chapman & Hall,
2003.
[115] R.H. Morelos-Zaragoza. The Art of Error-Correcting Coding. 2nd ed. Chichester:
Wiley, 2006.
506 Bibliography

[116] G.L. Mullen, C. Mummert. Finite Fields and Applications. Providence, RI:
American Mathematical Society, 2007.
[117] A. Myasnikov, V. Shpilrain, A. Ushakov. Group-Based Cryptography. Basel:
Birkhäuser, 2008.
[118] G. Nebe, E.M. Rains, N.J.A. Sloane. Self-Dual Codes and Invariant Theory. New
York: Springer, 2006.
[119] H. Niederreiter, C. Xing. Rational Points on Curves over Finite Fields: Theory and
Applications. Cambridge: Cambridge University Press, 2001.
[120] W.W. Peterson, E.J. Weldon. Error-Correcting Codes. 2nd ed. Cambridge,
MA: MIT Press, 1972. (Previous ed. W.W. Peterson. Error-Correcting Codes.
Cambridge, MA: MIT Press, 1961.)
[121] M.S. Pinsker. Information and Information Stability of Random Variables and Pro-
cesses. San Francisco: Holden-Day, 1964.
[122] V. Pless. Introduction to the Theory of Error-Correcting Codes. 2nd ed. New York:
Wiley, 1989.
[123] V.S. Pless, W.C. Huffman (eds). Handbook of Coding Theory, Vols 1, 2. Amster-
dam: Elsevier, 1998.
[124] P. Piret. Convolutional Codes: An Algebraic Approach. Cambridge, MA: MIT
Press, 1988.
[125] O. Pretzel. Error-Correcting Codes and Finite Fields. Oxford: Clarendon Press,
1992; Student ed. 1996.
[126] T.R.N. Rao. Error Coding for Arithmetic Processors. New York: Academic Press,
1974.
[127] M. Reed, B. Simon. Methods of Modern Mathematical Physics, Vol. II. Fourier
analysis, self-adjointness. New York: Academic Press, 1975.
[128] A. Rényi. A Diary on Information Theory. Chichester: Wiley, 1987; initially pub-
lished Budapest: Akad’emiai Kiadó, 1984.
[129] F.M. Reza. An Introduction to Information Theory. New York: Constable, 1994.
[130] S. Roman. Coding and Information Theory. New York: Springer, 1992.
[131] S. Roman. Field Theory. 2nd ed. New York: Springer, 2006.
[132] T. Richardson, R. Urbanke. Modern Coding Theory. Cambridge: Cambridge Uni-
versity Press, 2008.
[133] R.M. Roth. Introduction to Coding Theory. Cambridge: Cambridge University
Press, 2006.
[134] B. Ryabko, A. Fionov. Basics of Contemporary Cryptography for IT Practitioners.
Singapore: World Scientific, 2005.
[135] W.E. Ryan, S. Lin. Channel Codes: Classical and Modern. Cambridge: Cambridge
University Press, 2009.
[136] T. Schürmann, P. Grassberger. Entropy estimation of symbol sequences. Chaos, 6,
(3), 414–427, 1996.
[137] P. Seibt. Algorithmic Information Theory: Mathematics of Digital Information Pro-
cessing. Berlin: Springer, 2006.
[138] C.E. Shannon. A mathematical theory of cryptography. Bell Lab. Tech. Memo.,
1945.
[139] C.E. Shannon. A mathematical theory of communication. Bell System Technical
Journal, 27, July, October, 379–423, 623–658, 1948.
Bibliography 507

[140] C.E. Shannon: Collected Papers. N.J.A. Sloane, A.D. Wyner (eds). New York:
IEEE Press, 1993.
[141] C.E. Shannon, W. Weaver. The Mathematical Theory of Communication. Urbana,
IL: University of Illinois Press, 1949.
[142] P.C. Shields. The Ergodic Theory of Discrete Sample Paths. Providence, RI:
American Mathematical Society, 1996.
[143] M.S. Shrikhande, S.S. Sane. Quasi-Symmetric Designs. Cambridge: Cambridge
University Press, 1991.
[144] S. Simic. Best possible global bounds for Jensen functionals. Proc. AMS, 138, (7),
2457–2462, 2010.
[145] A. Sinkov. Elementary Cryptanalysis: A Mathematical Approach. 2nd ed. revised
and updated by T. Feil. Washington, DC: Mathematical Association of America,
2009.
[146] D. Slepian, H.O. Pollak. Prolate spheroidal wave functions, Fourier analysis and
uncertainty, Vol. I. Bell System Technical Journal, 43–64, 1961 .
[147] W. Stallings. Cryptography and Network Security: Principles and Practice. 5th ed.
Boston, MA: Prentice Hall; London: Pearson Education, 2011.
[148] H. Stichtenoth. Algebraic Function Fields and Codes. Berlin: Springer, 1993.
[149] D.R. Stinson. Cryptography: Theory and Practice. 2nd ed. Boca Raton, FL;
London: Chapman & Hall/CRC, 2002.
[150] D. Stoyan, W.S. Kendall. J. Mecke. Stochastic Geometry and its Applications.
Berlin: Academie-Verlag, 1987 .
[151] C. Schlegel, L. Perez. Trellis and Turbo Coding. New York: Wiley, 2004.
[152] Š. Šujan. Ergodic Theory, Entropy and Coding Problems of Information Theory.
Praha: Academia, 1983.
[153] P. Sweeney. Error Control Coding: An Introduction. New York: Prentice Hall,
1991.
[154] Te Sun Han, K. Kobayashi. Mathematics of Information and Coding. Providence,
RI: American Mathematical Society, 2002.
[155] T.M. Thompson. From Error-Correcting Codes through Sphere Packings to Simple
Groups. Washington, DC: Mathematical Association of America, 1983.
[156] R. Togneri, C.J.S. deSilva. Fundamentals of Information Theory and Coding
Design. Boca Raton, FL: Chapman & Hall/CRC, 2002.
[157] W. Trappe, L.C. Washington. Introduction to Cryptography: With Coding Theory.
2nd ed. Upper Saddle River, NJ: Pearson Prentice Hall, 2006.
[158] M.A. Tsfasman, S.G. Vlǎdut. Algebraic-Geometric Codes. Dordrecht: Kluwer
Academic, 1991.
[159] M. Tsfasman, S. Vlǎdut, T. Zink. Modular curves, Shimura curves and Goppa
codes, better than Varshamov–Gilbert bound. Mathematics Nachrichten, 109,
21–28, 1982.
[160] M. Tsfasman, S. Vlǎdut, D. Nogin. Algebraic Geometric Codes: Basic Notions.
Providence, RI: American Mathematical Society, 2007.
[161] M.J. Usher. Information Theory for Information Technologists. London: Macmil-
lan, 1984.
[162] M.J. Usher, C.G. Guy. Information and Communication for Engineers. Bas-
ingstoke: Macmillan, 1997
508 Bibliography

[163] I. Vajda. Theory of Statistical Inference and Information. Dordrecht: Kluwer, 1989.
[164] S. Verdú. Multiuser Detection. New York: Cambridge University Press, 1998.
[165] S. Verdú, D. Guo. A simple proof of the entropy–power inequality. IEEE Trans.
Inform. Theory, 52, (5), 2165–2166, 2006.
[166] L.R. Vermani. Elements of Algebraic Coding Theory. London: Chapman & Hall,
1996.
[167] B. Vucetic, J. Yuan. Turbo Codes: Principles and Applications. Norwell, MA:
Kluwer, 2000.
[168] G. Wade. Coding Techniques: An Introduction to Compression and Error Control.
Basingstoke: Palgrave, 2000.
[169] J.L. Walker. Codes and Curves. Providence, RI: American Mathematical Society,
2000.
[170] D. Welsh. Codes and Cryptography. Oxford, Oxford University Press, 1988.
[171] N. Wiener. Cybernetics or Control and Communication in Animal and Machine.
Cambridge, MA: MIT Press, 1948; 2nd ed: 1961, 1962.
[172] J. Wolfowitz. Coding Theorems of Information Theory. Berlin: Springer, 1961; 3rd
ed: 1978.
[173] A.D. Wyner. The capacity of the band-limited Gaussian channel. Bell System Tech-
nical Journal, 359–395, 1996 .
[174] A.D. Wyner. The capacity of the product of channels. Information and Control,
423–433, 1966.
[175] C. Xing. Nonlinear codes from algebraic curves beating the Tsfasman–Vlǎdut–
Zink bound. IEEE Transactions Information Theory, 49, 1653–1657, 2003.
[176] A.M. Yaglom, I.M. Yaglom. Probability and Information. Dordrecht, Holland:
Reidel, 1983.
[177] R. Yeung. A First Course in Information Theory. Boston: Kluwer Academic, 1992;
2nd ed. New York: Kluwer, 2002.
Index

additive stream cipher, 463 bound


algebra (a commutative ring and a linear space), BCH bound, 237, 295
318 Elias bound, 177
group algebra, 317 Gilbert bound, 198
polynomial algebra, 214 Gilbert–Varshamov bound, 154
σ -algebra, 440 Griesmer bound, 197
algebraic-geometric code, 340 Hamming bound, 150
algorithm, 9 Johnson bound, 177
Berlekamp–Massey (BM) decoding algorithm for linear programming bound, 322
BCH codes, 240 Plotkin bound, 155
Berlekamp-Massey (BM) algorithm for solving Singleton bound, 154
linear equations, 460 bar-product, 152
division algorithm for polynomials, 214
Euclid algorithm for integers, 473 capacity, 61
extended Euclid algorithm for integers, 470 capacity of a discrete channel, 61
Euclid algorithm for polynomials, 242 capacity of a memoryless Gaussian channel with
Guruswami–Sudan (GS) decoding algorithm for white noise, 374
Reed–Solomon codes, 298 capacity of a memoryless Gaussian channel with
Huffman encoding algorithm, 9 coloured noise, 375
alphabet, 3 operational channel capacity, 102
source alphabet, 8 character (as a digit or a letter or a symbol), 53
coder (encoding) alphabet, 3 character (of a homomorphism), 313
channel input alphabet, 60 modular character, 314
channel output alphabet, 65 trivial, or principal, character, 313
asymptotic equipartition property, 44 character transform, 319
asymptotically good sequence of codes, 78 characteristic of a field, 269
automorphism, 283 channel, 60
additive Gaussian channel (AGC), 368
band-limited signal, 411 memoryless Gaussian channel (MGC), 368
bandwidth, 409 memoryless additive Gaussian channel (MAGC),
basis, 149, 184 366
BCH (Bose–Ray-Chaudhuri–Hocquenghem) bound, memoryless binary channel (MBC), 60
or BCH theorem, 237, 295 memoryless binary symmetric channel (MBSC), 60
BCH code, 213 noiseless channel, 103
BCH code in a narrow sense, 235 channel capacity, 61
binary BCH code in a narrow sense, 235 operational channel capacity, 102
Bernoulli source, 3 check matrix: see parity-check matrix
bit (a unit of entropy), 9 cipher (or a cryptosystem), 463
bit commitment cryptosystem, 468 additive stream cipher, 463

509
510 Index

cipher (or a cryptosystem) (cont.) core polynomial of a field, 231


one-time pad cipher, 466 coset, 192
public-key cipher, 467 cyclotomic coset, 285
ciphertext, 468 leader of a coset, 192
code, or encoding, viii, 4 cryptosystem (or a cipher), 468
alternant code, 332 bit commitment cryptosystem, 468
BCH code, 213 ElGamal cryptosystem, 475
binary code, 10, 95 public key cryptosystem, 468
cardinality of a code, 253 RSA (Rivest–Shamir–Adelman) cryptosystem,
cyclic code, 216 468
decipherable code, 14 Rabin, or Rabin–Williams cryptosystem, 473
dimension of a linear code, 149 cyclic group, 231
D error detecting code, 147 generator of a cyclic group
dual code, 153 cyclic shift, 216
equivalent codes, 190
E error correcting code, 147 data-processing inequality, 80
Golay code, 151 detailed balance equations (DBEs), 56
Goppa code, 160, 334 decoder, or a decoding rule, 65
Hamming code, 199 geometric (or minimal distance) decoding rule, 163
Huffman code, 9 ideal observer (IO) decoding rule, 66
information rate of a code, 147 maximum likelihood (ML) decoding rule, 66
Justesen code, 240, 332 joint typicality (JT) decoder, 372
lossless code, 4 decoding, 167
linear code, 148 decoding alternant codes, 337
maximal distance separating (MDS), 155 decoding BCH codes, 239, 310
parity-check code, 149 decoding cyclic codes, 214
perfect code, 151 decoding Hamming codes, 200
prefix-free code, 4 list decoding, 192, 405
random code, 68, 372 decoding Reed–Muller codes, 209
rank of a linear code, 184 decoding Reed–Solomon codes, 292
Reed–Muller (RM) code, 203 decoding Reed–Solomon codes by the
Reed–Solomon code, 256, 291 Guruswami–Sudan algorithm, 299
repetition code, 149 syndrome decoding, 193
reversible cyclic code, 230 decrypt function, 469
self-dual code, 201, 227 degree of a polynomial, 206, 214
self-orthogonal, 227 density of a probability distribution (PDF), 86
symplex code, 194 differential entropy, 86
codebook, 67 digit, 3
random codebook, 68 dimension, 149
coder, or encoder, 3 dimension of a code, 149
codeword, 4 dimension of a linear representation, 314
random codeword, 6 discrete Fourier transform (FFT), 296
coding: see encoding discrete-time Markov chain (DTMC), 1, 3
coloured noise, 374 discrete logarithm, 474
concave, 19, 32 distributed system, or a network (of transmitters), 436
strictly concave, 32 Dirac δ -function, 318
concavity, 20 distance, 20
conditionally independent, 26 Kullback–Leibler distance, 20
conjugacy, 281 Hamming distance, 144
conjugate, 229 minimal distance of a code, 147
convergence almost surely (a.s.), 131 distance enumerator polynomial, 322
convergence in probability, 43 divisor, 217
convex, 32 greatest common divisor (gcd), 223
strictly convex, 104 dot-product, 153
convexity, 142 doubly stochastic (Cox) random process, 492
Index 511

electronic signature, 469, 476 generator (of a cyclic code), 218


encoding, or coding, vii, 4 minimal degree generator polynomial, 218
Huffman encoding, 9 generator (of a cyclic group), 232
Shannon–Fano encoding, 9 geometric (or minimal distance) decoding rule, 163
random coding, 67 group, 146
entropy, vii, 7 group algebra, 317
axiomatic definition of entropy, 36 commutative, or Abelian, group, 146
binary entropy, 7 cyclic group, 231
conditional entropy, 20 linear representation of a group, 314
differential entropy, 86 generalized function, 412
entropy of a random variable, 18 greatest common divisor (gcd), 223
entropy of a probability distribution, 18
joint entropy, 20 ideal observer (IO) decoding rule, 66
mutual entropy, 28 ideal of a ring, 217
entropy–power inequality, 92 principal ideal, 219
q-ary entropy, 7 identity (for weight enumerator polynomials), 258
entropy rate, vii, 41 abstract MacWilliams identity, 315
relative entropy, 20 MacWilliams identity for a linear code, 258, 313
encrypt function, 468 independent identically distributed (IID) random
ergodic random process (stationary), 397 variables, 1, 3
ergodic transformation of a probability space, 397 inequality, 4
error locator, 311 Brunn–Minkovski inequality, 93
error locator polynomial, 239, 311 Cauchy–Schwarz inequality, 124
error-probability, 58 Chebyshev inequality, 128
extension of a code, 151 data-processing inequality, 80
parity-check extension, 151 entropy–power inequality, 92
extension field, 261 Fano inequality, 25
generalized Fano inequality, 27
factor (as a divisor), 39 Gibbs inequality, 17
irreducible factor, 219 Hadamard inequality, 91
prime factor, 39 Kraft inequality, 4
factorization, 230 Ky–Fan inequality, 91
fading of a signal, 447 log-sum inequality, 103
power fading, 447 Markov inequality, 408
Rayleigh fading, 447 pooling inequalities, 24
feedback shift register, 453 information, 2, 18
linear feedback shift register (LFSR), 454 mutual information, or mutual entropy, 28
feedback polynomial, 454 information rate, 15
field (a commutative ring with inverses), 146, 230 information source (random source), 2, 44
extension field, 261 Bernoulli information source, 3
Galois field, 272 Markov information source, 3
finite field, 194 information symbols, 209
polynomial field, 231 initial fill, 454
primitive element of a field, 230, 232 intensity (of a random measure), 437
splitting field, 236, 271 intensity measure, 437
Frobenius map, 283
joint entropy, 20
Gaussian channel, 366 joint input/output distribution (of a channel), 67
additive Gaussian channel (AGC), 368 joint typicality (JT) decoder, 372
memoryless Gaussian channel (MGC), 368
memoryless additive Gaussian channel (MAGC), key (as a part of a cipher), 466
366 decoding key (a label of a decoding, or
Gaussian coloured noise, 374 decrypting, map), 469
Gaussian white noise, 368 encoding key (a label of an encoding, or
Gaussian random process, 369 encrypting, map), 468
generating matrix, 185 random key of a one-pad cipher, 466
512 Index

key (as a part of a cipher) (cont.) measure (as a countably additive function of a set),
private key, 470 366
public key, 469 intensity (or mean) measure, 436
secret key, 473 non-atomic measure, 436
Karhunen–Loéve decomposition, 426 Poisson random measure, 436
product-measure, 371
law of large numbers, 34 random measure, 436
strong law of large numbers, 438 reference measure, 372
leader of a coset, 192 σ -finite, 436
least common multiple (lcm), 223 Möbius function, 277
lemma Möbius inversion formula, 278
Borel–Cantelli lemma, 418 moment generating function, 442
Nyquist–Shannon–Kotelnikov–Whittaker lemma,
431 network: see distributed system
letter, 2 supercritical network, 449
linear code, 148 network information theory, 436
linear representation of a group, 314 noise (in a channel), 2, 70
space of a linear representation, 314 Gaussian coloured noise, 374
dimension of a linear representation, 314 Gaussian white noise, 368
linear space, 146 noiseless channel, 103
linear subspace, 148 noisy (or fully noisy) channel, 81
linear feedback shift register (LFSR), 454
auxiliary, or feedback, polynomial of an LFSR, 454 one-time pad cipher, 466
operational channel capacity, 102
Markov chain, 1, 3 order of an element, 267
discrete-time Markov chain (DTMC), 1, 3 order of a polynomial, 231
coupled Markov chain, 50 orthogonal, 185
irreducible and aperiodic Markov chain, 128 ortho-basis, 430
kth-order Markov chain approximation, 407 orthogonal complement, 185
second-order Markov chain, 131 orthoprojection, 375
transition matrix of a Markov chain, 3 self-orthogonal, 227
Markov inequality, 408 output stream of a register, 454
Markov property, 33
strong Markov property, 50 parity-check code, 149
Markov source, 3 parity-check extension, 151
stationary Markov source, 3 parity-check matrix, 186
Markov triple, 33 plaintext, 468
Matérn process (with a hard core), 451 Poisson process, 436
first model of the Matérn process, 451 Poisson random measure, 436
second model of the Matérn process, 451 polynomial, 206
matrix, 13 algebra, polynomial, 214
covariance matrix, 88 degree of a polynomial, 206, 214
generating matrix, 185 distance enumerator polynomial, 322
generating check matrix, canonical, or standard, error locator polynomial, 239
form of, 189 Goppa polynomial, 335
parity-check matrix, 186 irreducible polynomial, 219
parity-check matrix, canonical, or standard, form Mattson–Solomon polynomial, 296
of, 189 minimal polynomial, 236
parity-check matrix of a Hamming code, 191 order of a polynomial, 231
positive definite matrix, 91 reducible polynomial, 221
recursion matrix, 174 primitive polynomial, 230, 267
Töplitz matrix, 93 Kravchuk polynomial, 320
transition matrix of a Markov chain, 3 weight enumerator polynomial, 319, 351
transition matrix, doubly stochastic, 34 probability distribution, vii, 1
Vandermonde matrix, 295 conditional probability, 1
maximum likelihood (ML) decoding rule, 66 probability density function (PDF), 86
Index 513

equiprobable, or uniform, distribution, 3, 22 register, 453


exponential distribution (with exponential density), feedback shift register, 453
89 linear feedback shift register (LFSR), 454
geometric distribution, 21 feedback, or auxiliary, polynomial of an LFSR, 454
joint probability, 1 initial fill of register, 454
multivariate normal distribution, 88 output stream of a register, 454
normal distribution (with univariate normal repetition code, 149
density), 89 repetition of a code, 152
Poisson distribution, 101 ring, 217
probability mass function (PMF), 366 ideal of a ring, 217
probability space, 397 quotient ring, 274
prolate spheroidal wave function (PSWF), 425 root of a cyclic code, 230
protocol of a private communication, 469 defining root of a cyclic code, 233
Diffie–Hellman protocol, 474 root of a polynomial, 228
prefix, 4 root of unity, 228
prefix-free code, 4 primitive root of unity, 236
product-channel, 404
public-key cipher, 467 sample, viii, 2
signal/noise ratio (SNR), 449
quantum mechanics, 431 sinc function, 413
size of a code, 147
random code, 68, 372 space, 35
random codebook, 68 Hamming space, 144
random codeword, 6 space Ł2 (R1 ), 415
random measure, 436 linear space, 146
Poisson random measure (PRM), 436 linear subspace, 148
random process, vii space of a linear representation, 314
Gaussian random process, 369 state space of a Markov chain, 35
Poisson random process, 436 vector space over a field, 269
stationary random process, 397 stream, 463
stationary ergodic random process, 397 strictly concave, 32
random variable, 18 strictly convex, 104
conditionally independent random variables, 26 string, or a word (of characters, digits, letters or
equiprobable, or uniform, random variable, symbols), 3
3, 22 source of information (random), 2, 44
exponential random variable (with exponential Bernoulli source, 3
density), 89 equiprobable Bernoulli source, 3
geometric random variable, 21 Markov source, 3
independent identically distributed (IID) random stationary Markov source, 3
variables, 1, 3 spectral density, 417
joint probability distribution of random variables, 1 stationary, 3
normal random variable (with univariate normal stationary Markov source, 3
density), 89 stationary random process, 397
Poisson random variable, 101 stationary ergodic random process, 397
random vector, 20 supercritical network, 449
multivariate normal random vector, 88 symbol, 2
rank of a code, 184 syndrome, 192
rank-nullity property, 186
rate, 15 theorem
entropy rate, vii, 41 Brunn–Minkovski theorem, 93
information rate of a source, 15 Campbell theorem, 442
reliable encoding (or encodable) rate, 15 Cayley–Hamilton theorem, 456
reliable transmission rate, 62 central limit theorem (CLT), 94
reliable transmission rate with regional constraint, Doob–Lévy theorem, 409
373 local De Moivre–Laplace theorem, 53
regional constraint for channel capacity, 367 mapping theorem, 437
514 Index

theorem (cont.) Fourier transform, discrete, 296


product theorem, 444 Fourier transform in Ł2 , 413
Shannon theorem, 8 transmitter, 443
Shannon’s noiseless coding theorem (NLCT), 8
Shannon’s first coding theorem (FCT), 42 uncertainty principle, 431
Shannon’s second coding theorem (SCT), or noisy
coding theorem (NCT), 59, 162 Vandermonde determinant, 237
Shannon’s SCT: converse part, 69 Vandermonde matrix, 297
Shannon’s SCT: strong converse part, 175
Shannon’s SCT: direct part, 71, 163 wedge-product, 149
Shannon–McMillan–Breiman theorem, 397 weight enumerator polynomial, 319
totient function, 270 white noise, 368
transform word, or a string (of characters, digits, letters or
character transform, 319 symbols), 3
Fourier transform, 296 weight of a word, 144

You might also like