Book 1
Book 1
Walrand
Probability
in Electrical
Engineering
and Computer
Science
An Application-Driven Course
Probability in Electrical Engineering
and Computer Science
Jean Walrand
Probability in Electrical
Engineering and
Computer Science
An Application-Driven Course
Jean Walrand
Department of EECS
University of California, Berkeley
Berkeley, CA, USA
https://www.springer.com/us/book/9783030499945
© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife Annie, my daughters Isabelle and Julie,
and my grandchildren Melanie and Benjamin,
who will probably never read this book.
Preface
This book is about extracting information from noisy data, making decisions that
have uncertain consequences, and mitigating the potentially detrimental effects of
uncertainty.
Applications of those ideas are prevalent in computer science and electrical
engineering: digital communication, GPS, self-driving cars, voice recognition,
natural language processing, face recognition, computational biology, medical tests,
radar systems, games of chance, investments, data science, machine learning,
artificial intelligence, and countless (in a colloquial sense) others.
This material is truly exciting and fun. I hope you will share my enthusiasm for
the ideas.
vii
Acknowledgements
I am grateful to my colleagues and students who made this book possible. I thank
Professor Ramtin Pedarsani for his careful reading of the manuscript, Sinho Chewi
for pointing out typos in the first edition and suggesting improvements of the text,
Dr. Abhay Parekh for teaching the course with me, Professors David Aldous, Venkat
Anantharam, Tom Courtade, Michael Lustig, John Musacchio, Shyam Parekh,
Kannan Ramchandran, Anant Sahai, David Tse, Martin Wainwright, and Avideh
Zakhor for their useful comments, Stephan Adams, Kabir Chandrasekher, Dr.
Shiang Jiang, Dr. Sudeep Kamath, Dr. Jerome Thai, Professors Antonis Dimakis,
Vijay Kamble, and Baosen Zhang for serving as teaching assistants for the course
and designing assignments, Professor Longbo Huang for translating the book in
Mandarin and providing many valuable suggestions, Professors Pravin Varaiya and
Eugene Wong for teaching me Probability, Professor Tsu-Jae King Liu for her
support, and the students in EECS126 for their feedback.
Finally, I want to thank Professor Takek El-Bawab for making a number of valu-
able suggestions for the second edition and the Springer editorial team, including
Mary James, Zoe Kennedy, Vidhya Hariprasanth, and Lavanya Venkatesan for their
help in the preparation of this edition.
ix
Introduction
xi
xii Introduction
level students. Parts B contain more difficult aspects of the material. It is possible
to teach only the appendices and parts A. This would constitute a good junior-level
course. One possible approach is to teach parts A in a first course and parts B in a
second course. For a more ambitious course, one may teach parts A, then parts B.
It is also possible to teach the chapters in order. The last chapter is a collection of
more advanced topics that the reader and instructor can choose from.
The appendices should be useful for most readers. Appendix A discusses the
elementary notions of probability on simple examples. Students might benefit from
a quick read of this chapter.
Appendix B reviews the basic concepts of probability. Depending on the
background of the students, it may be recommended to start the course with a review
of that appendix.
The theory starts with models of uncertain quantities. Let us denote such
quantities by X and Y. A model enables one to calculate the expected value E(h(X))
of a function h(X) of X. For instance, X might specify the output of a solar panel
every day during 1 month and h(X) the total energy that the panel produced. Then
E(h(X)) is the average energy that the panel produces per month. Other examples
are the average delay of packets in a communication network or the average time a
data center takes to complete one job (Fig. 1).
Fig. 1 Evaluation ?
X E(h(X))
One important problem is then to find the values of the parameters θ that
maximize E(h(X, θ )). This is not a simple problem if one does not have an
analytical expression for this average value in terms of θ . We explain such
optimization problems in the book.
There are many situations where one observes Y and one is interested in guessing
the value of X, which is not observed. As an example, X may be the signal that a
transmitter sends and Y the signal that the receiver gets (Fig. 3).
Fig. 3 Inference ?
Y X
Introduction xiii
Fig. 4 Control X Y
https://www.springer.com/us/book/9783030499945
provides additional resources for this book, such as an Errata, Additional Problems,
and Python Labs.
This second edition differs from the first in a few aspects. The Matlab exercises
have been deleted as most students use Python. Python exercises are not included
in the book; they can be found on the website. The appendix on Linear Algebra has
been deleted. The relevant results from that theory are introduced in the text when
needed. Appendix A is new. It is motivated by the realization that some students are
confused by basic notions. The chapters on networks are new. They were requested
by some colleagues. Basic statistics are discussed in Chap. 8. Neural networks are
explained in Chap. 12.
Contents
1 PageRank: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 General Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Distribution After n Steps and Invariant Distribution . . 4
1.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Irreducibility and Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Big Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Long-Term Fraction of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Hitting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Mean Hitting Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 Probability of Hitting a State Before Another . . . . . . . . . . 11
1.5.3 FSE for Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 PageRank: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Laws of Large Numbers for Coin Flips . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Convergence in Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Laws of Large Numbers for i.i.d. RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Law of Large Numbers for Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Proof of Big Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Proof of Theorem 1.1 (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Proof of Theorem 1.1 (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.3 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
xv
xvi Contents
2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Multiplexing: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Sharing Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Gaussian Random Variable and CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Binomial and Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Multiplexing and Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Markov Chain Model of Buffer . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3 Average Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 A Note About Arrivals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.5 Little’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Multiple Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Multiplexing: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Proof of CLT (Sketch) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Moments of N (0, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Sum of Squares of 2 i.i.d. N (0, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Two Applications of Characteristic Functions. . . . . . . . . . . . . . . . . . . . . 63
4.5.1 Poisson as a Limit of Binomial . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Exponential as Limit of Geometric . . . . . . . . . . . . . . . . . . . . . 64
4.6 Error Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Adaptive Multiple Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Networks: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Spreading Rumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Seeding the Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Manufacturing of Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Polarization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.7 Network of Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8 Optimizing Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.9 Internet and Network of Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Contents xvii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
PageRank: A
1
Background:
1.1 Model
The World Wide Web is a collection of linked web pages (Fig. 1.1). These pages
and their links form a graph. The nodes of the graph are pages X and there is an
arc (a directed edge) from i to j if page i has a link to j .
Intuitively, a page has a high rank if other pages with a high rank point to it. (The
actual ordering of search engines results depends also on the presence of the search
keywords in the pages and on many other factors, in addition to the rank measure
that we discuss here.) Thus, the rank π(i) of page i is a positive number and
π(i) = π(j )P (j, i), i ∈ X ,
j ∈X
D E
where P (j, i) is the fraction of links in j that point to i and is zero if there is no
such link. In our example, P (A, B) = 1/2, P (D, E) = 1/3, P (B, A) = 0, etc.
(The basic idea of the algorithm is due to Larry Page (Fig. 1.2), hence the name
PageRank. Since it ranks pages, the name is doubly appropriate.)
We can write these equations in matrix notation as
π = πP, (1.1)
where we treat π as a row vector with components π(i) and P as a square matrix
with entries P (i, j ) (Figs. 1.3, 1.4 and 1.5).
Equations (1.1) are called the balance equations. Note that if π solves these
equations, then any multiple of π also solves the equations. For convenience, we
normalize the solution so that the ranks of the pages add up to one, i.e.,
π(i) = 1. (1.2)
i∈X
1.2 Markov Chain 3
D E
Solving these equations with the condition that the numbers add up to one yields
1
π = [π(A), π(B), π(C), π(D), π(E)] = [12, 9, 10, 6, 2].
39
Thus, page A has the highest rank and page E has the smallest. A search engine that
uses this method would combine these ranks with other factors to order the pages.
Search engines also use variations on this measure of rank.
Imagine that you are browsing the web. After viewing a page i, say for one unit
of time, you go to another page by clicking one of the links on page i, chosen at
random. In this process, you go from page i to page j with probability P (i, j )
where P (i, j ) is the same as we defined earlier. The resulting sequence of pages
that you visit is called a Markov chain, a model due to Andrey Markov (Fig. 1.4).
4 1 PageRank: A
More generally, consider a finite graph with nodes X = {1, 2, . . . , N } and directed
edges. In this graph, some edges can go from a node to itself. To each edge (i, j )
one assigns a positive number P (i, j ) in a way that the sum of the numbers on the
edges out of each node is equal to one. By convention, P (i, j ) = 0 if there is no
edge from i to j .
The corresponding matrix P = [P (i, j )] with nonnegative entries and rows that
add up to one is called a stochastic matrix. The sequence {X(n), n ≥ 0} that goes
from node i to node j with probability P (i, j ), independently of the nodes it visited
before, is then called a Markov chain. The nodes are called the states of the Markov
chain and the P (i, j ) are called the transition probabilities. We say that X(n) is the
state of the Markov chain at time n, for n ≥ 0. Also, X(0) is called the initial state.
The graph is the state transition diagram of the Markov chain.
Figure 1.6 shows the state transition diagrams of three Markov chains.
Thus, our description corresponds to the following property:
The probability of moving from i to j does not depend on the previous states. This
“amnesia” is called the Markov property. It formalizes the fact that X(n) is indeed
a “state” in that it contains all the information relevant for predicting the future of
the process.
Indeed, the event that the Markov chain is in state i at step n + 1 is the union over
all j of the disjoint events that it is in state j at step n and in state i at step n + 1.
The probability of a disjoint union of events is the sum of the probabilities of the
individual events. Also, the probability that the Markov chain is in state j at step n
and in state i at step n + 1 is πn (j )P (j, i).
Thus, in matrix notation,
πn+1 = πn P ,
1.3 Analysis 5
so that
πn = π0 P n , n ≥ 0. (1.5)
Observe that πn (i) = π0 (i) for all n ≥ 0 and all i ∈ X if and only if π0
solves the balance equations (1.1). In that case, we say that π0 is an invariant
distribution. Thus, an invariant distribution is a nonnegative solution π of (1.1)
whose components sum to one.
1.3 Analysis
(a) A Markov chain is irreducible, if it can go from any state to any other state,
possibly after many steps.
(b) Assume the Markov chain is irreducible and let
Then d(i) has the same value d for all i, as shown in Lemma 2.2. The Markov
chain is aperiodic if d = 1. Otherwise, it is periodic with period d.
The Markov chains (a) and (b) in Fig. 1.6 are irreducible and (c) is not. Also, (a)
is periodic and (b) is aperiodic.
6 1 PageRank: A
1 0.4
(c) 1 2 3 1
0.6
Simple examples show that the answers to Q2–Q3 can be negative. For instance,
every distribution is invariant for a Markov chain that does not move. Also, a Markov
chain that alternates between the states 0 and 1 with π0 (0) = 1 is such that πn (0) =
1 when n is even and πn (0) = 0 when n is odd, so that πn does not converge.
However, we have the following key result.
(a) If the Markov chain is finite and irreducible, it has a unique invariant distribu-
tion π and π(i) is the long-term fraction of time that X(n) is equal to i.
(b) If the Markov chain is also aperiodic, then the distribution πn of X(n) converges
to π .
In this theorem, the long-term fraction of time that X(n) is equal to i is defined
as the limit
N −1
1
lim 1{X(n) = i}.
N →∞ N
n=0
In this expression, 1{X(n) = i} takes the value 1 if X(n) = i and the value 0
otherwise. Thus, in the expression above, the sum is the total time that the Markov
chain is in state i during the first N steps. Dividing by N gives the fraction of time.
Taking the limit yields the long-term fraction of time.
The theorem says that, if the Markov chain is irreducible, this limit exists and is
equal to π(i). In particular, this limit does not depend on the particular realization
of the random variables. This means that every simulation yields the same limit, as
you will verify in Problem 1.8.
1.3 Analysis 7
Why should the fraction of time that a Markov chain spends in one state converge?
In our browsing example, if we count the time that we spend on page A over n time
units and we divide that time by n, it turns out that the ratio converges to π(A).
This result is similar to the fact that, when we flip a fair coin repeatedly, the
fraction of “heads” converges to 50%. Thus, even though the coin has no memory,
it makes sure that the fraction of heads approaches 50%. How does it do it?
These convergence results are examples of the Law of Large Numbers. This
law is at the core of our intuitive understanding of probability and it captures our
notion of statistical regularity. Even though outcomes are uncertain, one can make
predictions. Here is a statement of the result. We discuss it in Chap. 2.
1 “Almost sure” is a somewhat confusing technical expression. It means that, although there are
outcomes for which the convergence does not happen, all these outcomes have probability zero.
For instance, if you flip a fair coin, the outcome where the coin flips keep on yielding tails
is such that the fraction of tails does not converge to 0.5. The same is true for the outcome
H, H, T , H, H, T , . . .. So, almost sure means that it happens with probability 1, but not for a
set of outcomes that has probability zero.
8 1 PageRank: A
1.4 Illustrations
We illustrate Theorem 1.1 for the Markov chains in Fig. 1.6. The three situations are
different and quite representative. We explore them one by one.
Figures 1.8, 1.9 and 1.10 correspond to each of the three Markov chains in
Fig. 1.6, as shown on top of each figure. The top graph of each figure shows
the successive values of Xn for n = 0, 1, . . . , 100. The middle graph of the
figure shows, for n = 0, . . . , 100, the fraction of time that Xm is equal to the
different states during {0, 1, . . . , n}. The bottom graph of the figure shows, for
n = 0, . . . , 100, the probability that Xn is equal to each of the states.
In Fig. 1.8, the fraction of time that the Markov chain is equal to each of the
states {1, 2, 3} converges to positive values. This is the case because the Markov
chain is irreducible. (See Theorem 1.1(a).) However, the probability of being in a
given state does not converge. This is because the Markov chain is periodic. (See
Theorem 1.1(b).)
For the Markov chain in Fig. 1.9, the probabilities converge, because the Markov
chain is aperiodic. (See again Theorem 1.1.)
Finally, for the Markov chain in Fig. 1.10, eventually Xn = 3; the fraction of
time in state 3 converges to one and so does the probability of being in state 3. What
happens in this case is that state 3 is absorbing: once the Markov chain gets there,
it cannot leave.
Say that you start in page A in Fig. 1.2 and that, at every step, you follow each
outgoing link of the page where you are with equal probabilities. How many steps
does it take to reach page E? This time is called the hitting time, or first passage
time, of page E and we designate it by TE . As we can see from the figure, TE can
be as small as 2, but it has a good chance of being much larger than 2 (Fig. 1.11).
Our goal is to calculate the average value of TE starting from X0 = A. That is, we
want to calculate
The key idea to perform this calculation is to in fact calculate the mean hitting time
for all possible initial pages. That is, we will calculate β(i) for i = A, B, C, D, E
where
The reason for considering these different values is that the mean time to hit E
starting from A is clearly related to the mean hitting time starting from B and
from D. These in turn are related to the mean hitting time starting from C. We
claim that
1 1
β(A) = 1 + β(B) + β(D). (1.7)
2 2
To see this, note that, starting from A, after one step, the Markov chain is in state B
with probability 1/2 and it is in state D with probability 1/2. Thus, after one step,
the average time to hit E is the average time starting from B, with probability 1/2,
and it is the average time starting from D, with probability 1/2.
This situation is similar to the following one. You flip a fair coin. If the outcome
is heads you get a random amount of money equal to X and if it is tails you get a
random amount Y . On average, you get
1 1
E(X) + E(Y ).
2 2
Similarly, we can see that
β(B) = 1 + β(C)
β(C) = 1 + β(A)
1 1 1
β(D) = 1 + β(A) + β(B) + β(E)
3 3 3
β(E) = 0.
These equations, together with (1.7), are called the first step equations (FSE).
Solving them, we find
Consider once again the same situation but say that we are interested in the proba-
bility that starting from A we visit state C before E. We write this probability as
As in the previous case, it turns out that we need to calculate α(i) for i =
A, B, C, D, E. We claim that
1 1
α(A) = α(B) + α(D). (1.8)
2 2
To see this, note that, starting from A, after one step you are in state B with
probability 1/2 and you will then visit C before E with probability α(B). Also,
with probability 1/2, you will be in state D after one step and you will then visit C
before E with probability α(D). Thus, the event that you visit C before E starting
from A is the union of two disjoint events: either you do that by first going to B or
by first going to D. Adding the probabilities of these two events, we get (1.8).
12 1 PageRank: A
α(B) = α(C)
α(C) = 1
1 1 1
α(D) = α(A) + α(B) + α(E)
3 3 3
α(E) = 0.
These equations, together with (1.8), are also called the first step equations. Solving
them, we find
4 3
α(A) = , α(B) = 1, α(C) = 1, α(D) = , α(E) = 0.
5 5
TA
Y = h(X(n)).
n=0
That is, you collect an amount h(i) every time you visit state i, until you enter set
A. Let
TA
Z= β n h(X(n)),
n=0
Hopefully these examples give you a sense of the variety of questions that can be
answered for finite Markov chains. This is very fortunate, because Markov chains
can be used to model a broad range of engineering and natural systems.
1.6 Summary
1.7 References
There are many excellent books on Markov chains. Some of my favorites are
Grimmett and Stirzaker (2001) and Bertsekas and Tsitsiklis (2008). The original
patent on PageRank is Page (2001). The online book Easley and Kleinberg (2012)
is an inspiring discussion of social networks. Chapter 14 of that reference discusses
PageRank.
1.8 Problems
Problem 1.1 Construct a Markov chain that is not irreducible but whose distribu-
tion converges to its unique invariant distribution.
Problem 1.2 Show a Markov chain whose distribution converges to a limit that
depends on the initial distribution.
Problem 1.3 Can you find a finite irreducible aperiodic Markov chain whose
distribution does not converge?
Problem 1.4 Show a finite irreducible aperiodic Markov chain that converges very
slowly to its invariant distribution.
Problem 1.5 Show that a function Y (n) = g(X(n)) of a Markov chain X(n) may
not be a Markov chain.
Problem 1.6 Construct a Markov chain that is a sequence of i.i.d. random variables.
Is it irreducible and aperiodic?
1.8 Problems 15
Problem 1.7 Consider the Markov chain X(n) with the state diagram shown in
Fig. 1.12 where a, b ∈ (0, 1).
Problem 1.8 Use Python to write a simulator for a Markov chain {X(n), n ≥ 1}
with K states, initial distribution π , and transition probability matrix P . The
program should be able to do the following:
1. Plot {X(n), n = 1, . . . , N };
2. Plot the fraction of time that X(n) is in some chosen states during {1, 2, . . . , m}
as a function of m, for m = 1, . . . , N ;
3. Plot the probability that X(n) is equal to some chosen states, for n = 1, . . . , N;
4. Use this program to simulate a periodic Markov chain with five states;
5. Use the program to simulate an aperiodic Markov chain with five states.
Problem 1.9 Use your simulator to simulate the Markov chains of Figs. 1.2 and 1.6.
Problem 1.10 Find the invariant distribution for the Markov chains of Fig. 1.6.
Problem 1.11 Calculate d(1), d(2), and d(3), defined in (1.6), for the Markov
chains of Fig. 1.6.
Problem 1.12 Calculate d(A), defined in (1.6), for the Markov chain of Fig. 1.2.
Problem 1.13 Let {Xn , n ≥ 0} be a finite Markov chain. Assume that it has
a unique invariant distribution π and that πn converges to π for every initial
distribution π0 . Then (choose the correct answers, if any)
• Xn is irreducible;
• Xn is periodic;
• Xn is aperiodic;
• Xn might not be irreducible.
16 1 PageRank: A
Problem 1.14 Consider the Markov chain {Xn , n ≥ 0} on {0, 1} with P (0, 1) =
0.1 and P (1, 0) = 0.3. Then (choose the correct answers, if any)
Problem 1.15 Consider the MC with the state transition diagram shown in
Fig. 1.13.
Problem 1.16 Consider the MC with the state transition diagram shown in
Fig. 1.14.
Problem 1.17 Consider the MC with the state transition diagram shown in
Fig. 1.15.
0.6 1
0.3 0 1 0.4
0.1
0.5
1
2
0.3 0 1 0.4
0.1
0.5
1
2
Problem 1.19 For the Markov chain {Xn , n ≥ 0} with transition diagram shown in
Fig. 1.17, assume that X0 = 0. Find the probability that Xn hits 2 before it hits 1
twice.
Problem 1.20 Draw an irreducible aperiodic MC with six states and choose the
transition probabilities. Simulate the MC in Python. Plot the fraction of time in the
six states. Assume you start in state 1. Plot the probability of being in each of the
six states.
Problem 1.22 How would you trick the PageRank algorithm into believing that
your home page should be given a high rank?
Problem 1.23 Show that the holding time of a state is geometrically distributed.
Problem 1.24 You roll a die until the sum of the last two rolls is exactly 10. How
many times do you have to roll, on average?
Problem 1.25 You roll a die until the sum of the last three rolls is at least 15. How
many times do you have to roll, on average?
Problem 1.26 A doubly stochastic matrix is a nonnegative matrix whose rows and
columns add up to one. Show that the invariant distribution is uniform for such a
transition matrix.
Problem 1.27 Assume that the Markov chain (c) of Fig. 1.6 starts in state 1.
Calculate the average number of times it visits state 1 before being absorbed in
state 3.
Problem 1.28 A man tries to go up a ladder that has N rungs. Every step he makes,
he has a probability p of dropping back to the ground and he goes up one rung
otherwise. Use the first step equations to calculate analytically the average time he
takes to reach the top, for N = 1, . . . , 20 and p = 0.05, 0.1, and 0.2. Use Python
to plot the corresponding graphs.
Problem 1.29 Let {Xn , n ≥ 0} be a finite irreducible Markov chain with transition
probability matrix P and invariant distribution π . Show that, for all i, j ,
N −1
1
1{Xn = i, Xn+1 = j } → π(i)P (i, j ), w.p. 1 as N → ∞.
N
n=0
Xn+1 = f (Xn , Vn ), n ≥ 0,
Problem 1.31 Let P and P̃ be two stochastic matrices and π a pmf on the finite set
X . Assume that
Problem 1.32 Let Xn be a Markov chain on a finite set X . Assume that the
transition diagram of the Markov chain is a tree, as shown in Fig. 1.18. Show that if
1.8 Problems 19
π is invariant and if P is the transition matrix, then it satisfies the following detailed
balance equations:
Problem 1.33 Let Xn be a Markov chain such that X0 has the invariant distribution
π and the detailed balance equations are satisfied. Show that
for all n, all N ≥ n, and all x0 , . . . , xn . Thus, the evolution of the Markov chain in
reverse time (N, N −1, N −2, . . . , N −n) cannot be distinguished from its evolution
in forward time (0, 1, . . . , n). One says that the Markov chain is time-reversible.
Yn = X0 + · · · + Xn , n ≥ 0.
Problem 1.35 You flip a fair coin repeatedly, forever. Show that the probability that
the number of heads is always ahead of the number of tails is zero.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
PageRank: B
2
Background:
• Borel–Cantelli (B.1.2);
• monotonicity of expectation (B.2);
• convergence of expectation (B.8)–(B.9);
• properties of variance: (B.3) and Theorem B.4.
Let us connect the definition of X = {Xn , n ≥ 0} of a Markov chain with the general
framework of Sect. B.1. (We write Xn or X(n).) In that section, we explained
that a random experiment is described by a sample space. The elements of the
sample space are the possible outcomes of the experiment. A probability is defined
on subsets, called events, of that sample space. Random variables are real-valued
functions of the outcome of the experiment.
To clarify these concepts, consider the case where the Xn are i.i.d. Bernoulli
random variables with P (Xn = 1) = P (Xn = 0) = 0.5. These random variables
describe flips of a fair coin. The random experiment is to flip the coin repeatedly,
forever. Thus, one possible outcome of this experiment is an infinite sequence of
0’s and 1’s. Note that an outcome is not 0 or 1: it is an infinite sequence since the
outcome specifies what happens when we flip the coin forever. Thus, the set Ω of
outcomes is the set {0, 1}∞ of infinite sequences of 0’s and 1’s. If ω is one such
sequence, we have ω = (ω0 , ω1 , . . .) where ωn ∈ {0, 1}. It is then natural to define
Xn (ω) = ωn , which simply says that Xn is the outcome of flip n, for n ≥ 0. Hence
Xn (ω) ∈ for all ω ∈ Ω and we see that each Xn is a real-valued function defined
on Ω. For instance, X0 (1101001 . . .) = 1 since ω0 = 1 when ω = 1101001 . . . .
Similarly, X1 (1101001 . . .) = 1 and X2 (1101001 . . .) = 0. To specify the random
experiment, it remains to define the probability on Ω. The simplest way is to say
that
P ({ω|ω0 = a, ω1 = b, . . . , ωn = z})
= P (X0 = a, . . . , Xn = z) = 1/2n+1
Similarly,
P (X0 = i0 , X1 = i1 , . . . , Xn = in )
= π0 (i0 )P (i0 , i1 ) × · · · × P (in−1 , in ), (2.1)
for all n ≥ 0 and i0 , i1 , . . . , in in X . Here, π0 (i0 ) is the probability that the Markov
chain starts in state i0 .
This identity is equivalent to (1.3). Indeed, if we let
An = {X0 = i0 , X1 = i1 , . . . , Xn = in }
and
then
Before we discuss the case of Markov chains, let us consider the simpler example of
coin flips. Let then {Xn , n ≥ 0} be i.i.d. Bernoulli random variables with P (Xn =
0) = P (Xn = 1) = 0.5, as in the previous section. We think of Xn = 1 if flip n
yields heads and Xn = 0 if it yields tails. We want to show that, as we keep flipping
the coin, the fraction of heads approaches 50%. There are two statements that make
this idea precise.
24 2 PageRank: B
The first statement, called the Weak Law of Large Numbers (WLLN), says that it is
very unlikely that the fraction of heads in n coin flips differs from 50% by even a
small amount, say 1%, if n is large. For instance, let n = 105 . We want to show that
the likelihood that the fraction of heads among 105 flips is more than 51% or less
than 49% is small. Moreover, this likelihood can be made as small as we wish if we
flip the coin more times.
To show this, let
X0 + · · · + Xn−1
Yn =
n
be the fraction of heads in the first n flips. We claim that
var(Yn )
P (|Yn − E(Yn )| ≥ ) ≤ . (2.2)
2
This result is called Chebyshev’s inequality (Fig. 2.2).
To see (2.2), observe that1
Indeed, if |Yn − E(Yn )| ≥ , then (Yn − E(Yn ))2 ≥ 2 , so that if the left-hand side
of inequality (2.3) is one, the right-hand side is at least equal to one. Also, if the
left-hand side is zero, it is less than or equal to the right-hand side. Thus, (2.3) holds
and (2.2) follows by taking the expected values in (2.3), since E(1A ) = P (A) and
E((Yn − E(Yn ))2 ) = var(Yn ) and since expectation is monotone (B.2).
1 By definition, 1{C} takes the value 1 if the condition C holds and the value 0 otherwise.
2.2 Laws of Large Numbers for Coin Flips 25
var(X0 )
P (|Yn − 0.5| ≥ ) ≤ .
n 2
Thus,
1
P (|Yn − 0.5| ≥ ) ≤ .
4n 2
In particular, if we choose = 1% = 0.01, we find
2, 500
P (|Yn − 0.5| ≥ 1%) ≤ = 0.025 with n = 105 .
n
More generally, we have shown that
The second statement is the Strong Law of Large Numbers (SLLN). It says that,
for all the sequences of coin flips we will ever observe, the fraction Yn actually
converges to 50% as we keep on flipping the coin.
There are many sequences of coin flips for which the fraction of heads does not
approach 50%. For instance, the sequence that yields heads for every flip is such
that Yn = 1 for all n and thus Yn does not converge to 50%. Similarly, the sequence
001001001001001 . . . is such that Yn approaches 1/3 and not 50%. What the SLLN
implies is that all those sequences such that Yn does not converge to 50% have
probability 0: they will never be observed.
26 2 PageRank: B
Thus, this statement is very deep because there are so many sequences to rule
out. Keeping track of all of them seems rather formidable. Indeed, the proof of this
statement is quite clever. Here is how it proceeds. Note that
|Yn − 0.5|4
P (|Yn − 0.5| ≥ ) ≤ E , ∀n, > 0.
4
Indeed,
|Yn − 0.5|4
1{|Yn − 0.5| ≥ } ≤
4
and the previous inequality follows by taking expectations. Now,
((X0 − 0.5) + · · · + (Xn−1 − 0.5))4
E |Yn − 0.5| 4
=E .
n4
where the sum is over all a, b, c, d ∈ {0, 1, . . . , n − 1}. This sum consists of n
terms Za4 , n(n − 1) terms Za2 Zb2 with a = b and other terms where at least a
factor Za is not repeated. The latter terms have zero-mean since E(Za Zb Zc Zd ) =
E(Za )E(Zb Zc Zd ) = 0, by independence, whenever b, c, and d are all different
from a. Consequently,
⎛ ⎞
E⎝ Za Zb Zc Zd ⎠ = nE Z04 + n(n − 1)E Z02 Z12 = nα + n(n − 1)β
a,b,c,d
with α = E(Z04 ) and β = E(Z02 Z12 ). Hence, substituting the result of this
calculation in the previous expressions, we find that
This expression shows that the events An := {|Yn − 0.5| ≥ } have probabilities that
add up to a finite number. From the Borel–Cantelli Theorem B.1, we conclude that
P (An , i.o.) = 0.
This result says that, with probability one, ω belongs only to finitely many An ’s.
Hence,3 with probability one, there is some n(ω) so that ω ∈
/ An for n ≥ n(ω). That
is,
Since this property holds for an arbitrary > 0, we conclude that, with probability
one,
Yn (ω) → 0.5 as n → ∞.
Indeed, if Yn (ω) does not converge to 50%, there must be some > 0 so that
|Yn − 0.5| > for infinitely many n’s and we have seen that this is not the case.
The results that we proved for coin flips extend to i.i.d. random variables {Xn , n ≥
0} to show that
X0 + · · · + Xn−1
Yn :=
n
approaches E(X0 ) as n → ∞. As for coin flips, there are two ways of making that
statement precise.
2 Recall that
1
< ∞.
n
n2
We need a definition.
P (|Xn − X| ≥ ) → 0 as n → ∞.
X0 + · · · + Xn−1 p
Yn = → μ. (2.4)
n
Proof Assume that E(Xn2 ) < ∞. The proof is then the same as for coin flips and is
left as an exercise. For the general case, see Theorem 15.14.
The first result of this type was proved by Jacob Bernoulli (Fig. 2.3).
P lim Xn (ω) = X(ω) = 1.
n→∞
Thus, this convergence means that the sequence of real numbers Xn (ω) converges
to the real number X(ω) as n → ∞, with probability one.
Let {Xn , n ≥ 0} be as in the statement of Theorem 2.1. We have the following
result.4
X0 + · · · + Xn−1
→ μ as n → ∞, with probability 1.
n
Thus, the sample mean values Yn := (X0 + · · · + Xn−1 )/n converge to the
expected value, with probability 1. (See Fig. 2.4.)
The proof is then the same as for coin flips and is left as an exercise. The proof of
the SLLN in the general case is given in Theorem 15.14.
Figure 2.5 illustrates the SLLN and WLLN. The SLLN states that the sample
means of i.i.d. random variables converge to the mean, with probability one. The
WLLN says that as the number of samples increases, the fraction of realizations
where the sample mean differs from the mean by some amount gets small.
The long-term fraction of time that a finite irreducible Markov chain spends in a
given state is the invariant probability of that state. For instance, a Markov chain
X(n) on {0, 1} with P (0, 1) = a = P (1, 0) with a ∈ (0, 1] spends half of the time
in state 0, in the long term. The Markov chain in Fig. 1.2 spends a fraction 12/39 of
the time in state A, in the long term.
To understand this property, one should look at the returns to state i, as shown
in Fig. 2.6. The figure shows a particular sequence of values of X(n) and it
decomposes this sequence into cycles between successive returns to a given state
i. A new cycle starts when the Markov chain comes back to i. The durations of
these successive cycles, T1 , T2 , T3 , . . ., are independent and identically distributed,
because the Markov chains start afresh from state i at each time Tn , independently
of the previous states. This is a consequence of the Markov property for any given
value k of Tn and of the fact that the distribution of the evolution starting from state
i at time k does not depend on k.
It is easy to see that these random times have a finite mean. Indeed, fix one state
i. Then, starting from any given state j , there is some minimum number Mj of steps
required to go to state i. Also, there is some probability pj that the Markov chain
will go from j to i in Mj steps. Let then M = maxj Mj and p = minj pj . We can
then argue that, starting from any state at time 0, there is at least a probability p that
the Markov chain visits state i after at most M steps. If it does not, we repeat the
argument starting at time M. We conclude that Ti ≤ Mτ where τ is a geometric
2.4 Law of Large Numbers for Markov Chains 31
X(n)
T1 T2 T3 T4 T5
Fig. 2.6 The cycles between returns to state i are i.i.d. The law of large numbers explains the
convergence of the long-term fraction of time to a constant
random with parameter p. Hence E(Ti ) ≤ ME(τ ) = M/p < ∞, as claimed. Note
also that E(Ti4 ) ≤ M 4 E(τ 4 ) < ∞.
The Strong Law of Large Numbers states that
T 1 + T2 + · · · + Tk
→ E(T1 ), as k → ∞, with probability 1. (2.5)
k
Thus, the long-term fraction of time that the Markov chain spends in state i is
given by
k 1
lim = , with probability 1. (2.6)
k→∞ T1 + T2 + · · · + Tk E(T1 )
Let us clarify why (2.6) implies that the fraction of time in state i converges to
1/E(T1 ). Let A(n) be the number of visits to state i by time n. We want to show
that A(n)/n converges to 1/E(T1 ). Then,
k A(n) k k
< = ≤
T1 + · · · + Tk+1 n n T1 + · · · + Tk
A(n) 1
→ ,
n E(T1 )
with α = /M.
Thus, by Borel–Cantelli Theorem B.1, the event Tk+1 /k > occurs only for
finitely many values of k, which proves the convergence to zero.
This section presents the proof of the main result about Markov chains.
N −1
1
1{X(n) = i} → π(i).
N
n=0
However, taking expectation, we find that the left-hand side is equal to φ(i). Thus,
φ = π and the invariant distribution is unique.5
5 Indeed,
1 2 3
4 0.3
If the Markov chain is irreducible but not aperiodic, then πn may not converge to
the invariant distribution π . For instance, if the Markov chain alternates between 0
and 1 and starts from 0, then πn = [1, 0] for n even and πn = [0, 1] for n odd, so
that πn does not converge to π = [0.5, 0.5].
If the Markov chain is aperiodic, πn → π . Moreover, the convergence is
geometric. We first illustrate the argument on a simple example shown in Fig. 2.7.
Consider the number of steps to go from 1 to 1. Note that
P [X(M) = 1|X(0) = i] ≥ p, i = 1, 2, 3, 4.
Now, consider two copies of the Markov chain: {X(n), n ≥ 0} and {Y (n), n ≥ 0}.
One chooses X(0) with distribution π0 and Y (0) with the invariant distribution π .
The two Markov chains evolve independently initially. We define
Thus, P (τ > M) ≤ 1 − p2 . If τ > M, then the two Markov chains have not met yet
by time M. Using the same argument as before, we see that they have a probability
at least p2 of meeting in the next M steps. Thus,
k
P (τ > kM) ≤ 1 − p2 .
Now, modify X(n) by gluing it to Y (n) after time τ . This coupling operation does
not change the fact that X(n) still evolves according to the transition matrix P , so
that P (X(n) = i) = πn (i) where πn = π0 P n .
34 2 PageRank: B
Now,
|P (X(n) = i) − P (Y (n) = i)| ≤ 2P (X(n) = Y (n)) ≤ 2P (τ > n).
i
Hence,
|πn (i) − π(i)| ≤ 2P (τ > n),
i
To extend this argument to a general aperiodic Markov chain, we need the fact
that for each state i there is some integer ni such that P n (i, i) > 0 for all n ≥ ni .
We prove that fact as Lemma 2.3 in the following section.
2.5.3 Periodicity
We start with a property of the set of return times of an irreducible Markov chain.
Lemma 2.1 Fix a state i and let S := {n > 0|P n (i, i) > 0} and d = g.c.d.(S).
There must be two integers n and n + d in the set S.
At each step, we go from (x, y) with x ≤ y to the ordered pair of {x, y − x}. Note
that at each step, each term in the pair (x, y) is an integer linear combination of a
and b. For instance, (6, 15) = (b − a, a). Then, (6, 9) = (b − a, a − (b − a)) =
(b − a, 2a − b), and so on. Eventually, we must get to (3, 3). Indeed, the terms are
always decreasing until we get to zero. Assume we get to (x, x) with x = 3. At the
previous step, we had (x, 2x). The step before must have been (x, 3x), and so on.
Going back all the way to (a, b), we see that a and b are both multiples of x. But
then, g.c.d.{a, b} = x, a contradiction.
From this construction, since at each step the terms are integer linear combina-
tions of a and b, we see that
3 = ma + nb
2.5 Proof of Big Theorem 35
3 = m+ a + n+ b − m− a − n− b,
N = m− a + n− b and N + 3 = m+ a + n+ b.
The last step of the argument is to notice that if a, b ∈ S, then αa + βb ∈ S for any
integers α and β that are not both zero. This fact follows from the definition of S as
the return times from i to i. Hence, both N and N + 3 are in S.
The proof for a general set S with gcd equal to d is identical.
This result enables us to show that the period of a Markov chain is well-defined.
Lemma 2.2 For an irreducible Markov chain, d(i) defined in (1.6) has the same
value for all states.
Proof Pick j = i. We show that d(j ) ≤ d(i). This suffices to prove the lemma,
since by symmetry one also has d(i) ≤ d(j ).
By irreducibility, P m (j, i) > 0 for some m and P n (i, j ) > 0 for some n. Now,
by definition of d(i) and by the previous lemma, there is some integer N such that
P N (i, i) > 0 and P N +d(i) (i, i) > 0. But then,
This implies that the integers K := n + N + m and K + d(i) are both in S := {n >
0|P n (j, j ) > 0}. Clearly, this shows that
The following fact then suffices for our proof of convergence, as we explained in
the example.
Proof We know from Lemma 2.1 that there is some integer N such that N, N + 1 ∈
S. We claim that
n ∈ S, ∀n > N 2 .
36 2 PageRank: B
mN + 0 = mN,
mN + 1 = (m − 1)N + (N + 1),
mN + 2 = (m − 2)N + 2(N + 1),
...,
mN + N − 1 = (m − N + 1)N + (N − 1)(N + 1).
n = mN + k
2.6 Summary
• Sample Space;
• Laws of Large Numbers: SLLN and WLLN;
• WLLN from Chebyshev’s Inequality;
• SLLN from Borel–Cantelli and fourth moment bound;
• SLLN for Markov chains using the i.i.d. return times to a state;
• Proof of Big Theorem.
2.7 References
2.8 Problems
Problem 2.1 Consider a Markov chain Xn that takes values in {0, 1}. Explain why
{0, 1} is not its sample space.
Problem 2.2 Consider again a Markov chain that takes values in {0, 1} with
P (0, 1) = a and P (1, 0) = b. Exhibit two different sample spaces and the
probability on them for that Markov chain.
Problem 2.3 Draw the smallest periodic Markov chain. Show that the fraction of
time in the states converges but the probability of being in a state at time n does not
converge.
Problem 2.4 For the Markov chain in Problem 2.2, calculate the eigenvalues and
use them to get a bound on the distance between the distribution at time n and the
invariant distribution.
Problem 2.5 Why does the strong law imply the weak law? More concretely, let
Xn , X be random variables such that Xn → X almost surely. Show that Xn → X
in probability.
Hint Fix > 0 and define Zn = 1{|Xn −X| ≥ }. Use DCT to show that E(Zn ) →
0 as n → ∞ if Xn → X almost surely.
Problem 2.6 Draw a Markov chain with four states that is irreducible and aperi-
odic. Consider two independent versions of the Markov chain: one that starts in
state 1, the other in state 2. Explain what they will meet after a finite time.
Problem 2.7 Consider the Markov chain of Fig. 1.2. Use Python to calculate the
eigenvalues of P . Let λ be the largest absolute value of the eigenvalues other than
1. Use Python to calculate
d(n) := |π(i) − πn (i)|,
i
Problem 2.8 You flip a fair coin. If the outcome is “head,” you get a random
amount of money equal to X and if it is“ tail,” you get a random amount Y . Prove
formally that on average, you get
1 1
E(X) + E(Y ).
2 2
Problem 2.9 Can you find random variables that converge to 0 almost surely, but
not in probability?
Problem 2.10 Let {Xn , n ≥ 1} be i.i.d. zero-mean random variables with variance
σ 2 . Show that Xn /n → 0 with probability one as n → ∞.
Hint Borel–Cantelli.
N −1
1
f (Xn ) → π(i)f (i) w.p. 1, as N → ∞.
N
n=0 i∈X
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Multiplexing: A
3
Background:
• General RV (B.4)
Telephone
TV
Internet
with rate C
xv
In the internet, at any given time, a number of packet flows share links. For
instance, 20 users may be downloading web pages or video files and use the same
coaxial cable of their service provider.
The transmission control protocol (TCP) arranges for these different flows to
share the links as equally as possible (at least, in principle).
We focus our attention on a single link, as shown in Fig. 3.3. The link transmits
bits at rate C bps. If ν connections are active at a given time, they each get a rate
C/ν. We want to study the typical rate that a connection gets. The nontrivial aspect
of the problem is that ν is a random variable.
As a simple model, assume that there are N 1 users who can potentially
use that link. Assume also that the users are active independently, with probability
p. Thus, the number ν of active users is Binomial(N, p) that we also write as
B(N, p). (See Sect. B.2.8.)
Figure 3.4 shows the probability mass function for N = 100 and p = 0.1, 0.2,
and 0.5. To be specific, assume that N = 100 and p = 0.2. The number ν of active
users is B(100, 0.2) that we also write as Binomial(100, 0.2). On average, there
3.1 Sharing Links 41
Fig. 3.4 The probability mass function of the Binomial(100, p) distribution, for p = 0.1, 0.2
and 0.5
are Np = 20 active users. However, there is some probability that a few more than
20 users are active. We want to find a number m so that the likelihood that there are
more than m active users is negligible, say 5%. Given that value, we know that each
active user gets at least a rate C/m, with probability 95%.
Thus, we can dimension the links, or provision the network, based on that value
m. Intuitively, m should be slightly larger than the mean. Looking at the actual
distribution, for instance, by using Python’s “ppf” as in Fig. 3.5, we find that
P (ν ≤ 27) = 0.966 > 95% and P (ν ≤ 26) = 0.944 < 95%. (3.1)
To avoid having to use distribution tables or computation tools, we use the fact
that the binomial distribution is well approximated by a Gaussian random variable
that we discuss next.
(a) A random variable W is Gaussian, or normal, with mean 0 and variance 1, and
one writes W =D N (0, 1), if its probability density function (pdf) is fW where
2
1 x
fW (x) = √ exp − , x ∈ .
2π 2
X = μ + σ W,
Figure 3.7 shows the pdf of a N (0, 1) random variable W . Note in particular
that
1 See (B.9).
3.2 Gaussian Random Variable and CLT 43
P (W > 1.65) ≈ 5%, P (W > 1.96) ≈ 2.5% and P (W > 2.32) ≈ 1%. (3.2)
The Central Limit Theorem states that the sum of many small independent
random variables is approximately Gaussian. This result explains that thermal noise,
due to the agitation of many electrons, is Gaussian. Many other natural phenomena
exhibit a Gaussian distribution when they are caused by a superposition of many
independent effects.
P (Y (n) ≤ x) → P (W ≤ x), ∀x ∈ ,
where W is a N (0, 1) random variable. We prove this result in the next chapter.
More generally, one has the following definition.
P (X(n) ≤ 3) = 0 → P (X ≤ 3) = 1.
But,
P (X(n) ≤ x) → P (X ≤ x), ∀x = 3.
This example explains why the definition (3.4) requires convergence of P (X(n) ≤
x) to P (X ≤ x) only for x such that P (X = x) = 0.
How does this notion of convergence relate to convergence in probability and
almost sure convergence? First note that convergence in distribution is defined even
if the random variables X(n) and X are not on the same probability space, since it
involves only the distributions of the individual random variables. One can show2
that
a.s. p
X(n) → X implies X(n) → X implies X(n) ⇒ X.
This may seem mysterious but is in fact quite obvious. First note that a random
variable with cdf F (·) can be constructed by choosing a random variable Z =D
U [0, 1] and defining (see Fig. 3.8)
Indeed, one then has P (X(Z) ≤ a) = F (a) since X(Z) ≤ a if and only if Z ∈
[0, F (a)], which has probability F (a) since Z =D U [0, 1]. But then, if X(n) ⇒ X,
we have FXn (x) → FX (x) whenever P (X = x) = 0, and this implies that
for all z.
X = Y1 + · · · + Y N ,
where the random variables Yn are i.i.d. and Bernoulli with parameter p. Thus, by
the CLT,
X − Np
√ ≈ N 0, σ 2 ,
N
where σ 2 = var(Y1 ) = E(Y12 ) − (E(Y1 ))2 = p(1 − p). Hence, one can argue that
B(N, p) ≈D N Np, N σ 2 =D N (Np, Np(1 − p)). (3.5)
For p = 0.2 and N = 100, one concludes that B(100, 0.2) ≈ N (20, 16), which is
confirmed by Fig. 3.9.
46 3 Multiplexing: A
A look at Fig. 3.9 shows that it is indeed unlikely that ν is larger than 27 when
ν =D B(100, 0.2).
One can invert the calculation that we did in the previous section and try to guess p
from the observed fraction Y (N) of active users out of N 1. From the ideas (1)
and (2) above, together with the symmetry of the Gaussian distribution around its
mean, we see that the events
A1 = {B(N, p) ≥ Np + 1.65 Np(1 − p)}
and
A2 = {B(N, p) ≤ Np − 1.65 Np(1 − p)}
each have a probability close to 5%. With Y (N) =D B(N, p)/N , we see that
p(1 − p)
A1 = Y (N) ≥ p + 1.65
N
and
p(1 − p)
A2 = Y (N) ≤ p − 1.65 .
N
3.2 Gaussian Random Variable and CLT 47
Hence, the event A1 ∪ A2 has probability close to 10%, so that its complement has
probability close to 90%. Consequently,
p(1 − p) p(1 − p)
P Y (N) − 1.65 ≤ p ≤ Y (N) + 1.65 ≈ 90%.
N N
For instance, if we observe that 30% of the 100 users are active, then we guess
that p is between 0.22 and 0.38, with probability 90%. In other words, [Y (N) −
0.08, Y (N) + 0.08] is a 90%-confidence interval for p.
Figure 3.7 shows that we can get a 5%-confidence interval by replacing 1.65 by
2. Thus, we see that
1 1
Y (N) − √ , Y (N) + √ (3.6)
N N
Thus, Y (1, 089) is an estimate of p with an error less than 0.03, with probability
95%. Such results form the basis for the design of public opinion surveys.
In many cases, one does not know a bound on the variance. In such situations,
one replaces the standard deviation by the sample standard deviation. That is, for
i.i.d. random variables {X(n), n ≥ 1} with mean μ, the confidence intervals for μ
are as follows:
σn σn
μn − 1.65 √ , μn + 1.65 √ = 90% − Confidence Interval
n n
σn σn
μn − 2 √ , μn + 2 √ = 95% − Confidence Interval,
n n
48 3 Multiplexing: A
where
X(1) + · · · + X(n)
μn =
n
and
n
n
m=1 (X(m) − μn )
2 2
n m=1 X(m)
σn2 = = − μ2n .
n−1 n−1 n
For the second equality, note that the cross-terms E(X(i)X(j )) for i = j vanish
because the random variables are independent and zero-mean.
Hence,
n−1 2
n
E (X(1) − μn )2 = σ and E (X(m) − μn )2 = (n − 1)σ 2 .
n
m=1
1
n
σn2 := E (X(m) − μn )2 .
n−1
m=1
3.3 Buffers
somewhat like an envelope you send by regular mail (if you remember that). A host
(e.g., a computer, a smartphone, or a web cam) sends packets to a switch. The switch
has multiple input and output ports, as shown in Fig. 3.10.
The switch stores the packets as they arrive and sends them out on the appropriate
output port, based on the destination address of the packets. The packets arrive at
random times at the switch and, occasionally, packets that must go out on a specific
output port arrive faster than the switch can send them out. When this happens,
packets accumulate in a buffer. Consequently, packets may face a queueing3 delay
before they leave the switch. We study a simple model of such a system.
3 Queueing and queuing are alternative spellings; queueing tends to be preferred by researchers and
has the peculiar feature of having five vowels in a row, somewhat appropriately.
50 3 Multiplexing: A
0 1 n−1 n n+1 N −1 N
p2 p2 p2 p2 p2 p2 p2 p2
1 − p2 ...... ...... 1 − p0
p0 p0 p0 p0 p0 p0 p0 p0
p1 p1 p1 p1 p1
Fig. 3.11 The transition probabilities for the buffer occupancy for one of the output ports
In this diagram,
p2 = λ(1 − μ)
p0 = μ(1 − λ)
p 1 = 1 − p0 − p 2 .
For instance, p2 is the probability that one new packet arrives and that the
transmission of a previous does not complete, so that the number of packets in the
buffer increases by one.
N
N
E(X) = iπ(i) = π(0) iρ i
i=0 i=0
Nρ N +1 − (N + 1)ρ N + 1
=ρ
(1 − ρ)(1 − ρ N +1 )
ρ p2 λ(1 − μ)
≈ = = ,
1−ρ p 0 − p2 μ−λ
How long do packets stay in the switch? Consider a packet that arrives when there
are k packets already in the buffer. That packet then leaves after k + 1 packet
transmissions. Since each packet transmission takes 1/μ steps, on average, the
expected time that the packet spends in the switch is (k + 1)/μ. Thus, to find the
expected time a packet stays in the switch, we need to calculate the probability φ(k)
that an arriving packet finds k packets already in the buffer. Then, the expected time
W that a packet stays in the switch is given by
k+1
W = φ(k).
μ
k≥0
The result of the calculation is given in the next theorem.
52 3 Multiplexing: A
1 1−μ
W = E(X) = .
λ λ−μ
Proof The calculation is a bit lengthy and the details may not be that interesting,
except that they explain how to calculate φ(k) and that they show that the simplicity
of the result is quite remarkable.
Recall that φ(k) is the probability that there are k + 1 packets in the buffer after a
given packet arrives at time n. Thus, φ(k) = P [X(n+1) = k +1 | A(n) = 1] where
A(n) is the number of arrivals at time n. Now, if D(n) is the number of transmission
completions at time n,
Also,
Consequently, the expected time W that a packet spends in the switch is given by
k+1 1 1
W = φ(k) = + kπ(k)(1 − μ1{k = 0}) + kπ(k + 1)
μ μ μ
k≥0 k≥0 k≥0
1 1
= + kπ(k)(1 − μ) + (k − 1)π(k)
μ μ
k≥0 k≥1
1 1−μ 1 1
= + E(X) + E(X) − 1 = + E(X) − 1
μ μ μ μ
1 λ(1 − μ) 1−μ 1
= + −1= = E(X).
μ μ(μ − λ) μ−λ λ
P [Xn+1 = k + 1 | An = 1] = P [Xn = k | An = 1]
= P [Xn = k] = π(k),
where the second identity comes from the independence of the arrivals An and the
backlog Xn . However, the first identity does not hold since it is possible that Xn+1 =
k, Xn = k, and An = 1. Indeed, one may have Dn = 1.
If one assumes that λ < μ 1, then the probability that An = 1 and Dn = 1
is negligible and it is then the case that π(k) ≈ π(k). We encounter that situation in
Sect. 5.6.
The previous result is a particular case of Little’s Law (Little 1961) (Fig. 3.13).
L = λW,
54 3 Multiplexing: A
One way to understand this law is to consider a packet that leaves the switch
after having spent T time units. During its stay, λT packets arrive, on average. So
the average backlog in the switch should be λT .
It turns out that Little’s law applies to very general systems, even those that do
not serve the packets in their order of arrival.
One way to see this is to think that each packet pays the switch one unit of money
per unit of time it spends in the switch. If a packet spends T time units, on average,
in the switch, then each packet pays T , on average. Thus, the switch collects money
at the rate of λT per unit of time, since λ packets go through the switch per unit of
time and each pays an average of T . Another way to look at the rate at which the
switch is getting paid is to realize that if there are L packets in the switch at any
given time, on average, then the switch collects money at rate L, since each packet
pays one unit per unit time. Thus, L = λT .
The maximum over p of this success rate occurs for p = 1/N and it is λ∗ where
3.5 Summary 55
1 N −1 1
λ∗ = 1 − ≈ ≈ 0.36.
N e
3.5 Summary
3.6 References
3.7 Problems
Problem 3.1 Write a Python code to compute the number of people to poll in a
public opinion survey to estimate the fraction of the population that will vote in
favor of a proposition within α percent, with probability at least 1 − β. Use an upper
bound on the variance. Assume that we know that p ∈ [0.4, 0.7].
Problem 3.2 We are conducting a public opinion poll to determine the fraction p
of people who will vote for Mr. Whatshisname as the next president. We ask N1
college-educated and N2 non-college-educated people. We assume that the votes
in each of the two groups are i.i.d. B(p1 ) and B(p2 ), respectively, in favor of
Whatshisname. In the general population, the percentage of college-educated people
is known to be q.
(a) What is a 95%-confidence interval for p, using an upper bound for the variance.
(b) How do we choose N1 and N2 subject to N1 + N2 = N to minimize the width
of that interval?
Problem 3.3 You flip a fair coin 10,000 times. The probability that there are more
than 5085 heads is approximately (choose the correct answer)
15%;
10%;
5%;
2.5%;
1%.
3.7 Problems 57
Problem 3.5 Consider a buffer that can transmit up to M packets in parallel. That
is, when there are m packets in the buffer, min{m, M} of these packets are being
transmitted. Also, each of these packets completes transmission independently in
the next time slot with probability μ. At each time step, a packet arrives with
probability λ.
(a) What are the transition probabilities of the corresponding Markov chain?
(b) For what values of λ, M, and μ do you expect the system to be stable?
(c) Write a Python simulation of this system.
Problem 3.6 In order to estimate the probability of head in a coin flip, p, you flip a
coin n times, and count the number of heads, Sn . You use the estimator p̂ = Sn /n.
You choose the sample size n to have a guarantee
P (|Sn /n − p| ≥ ) ≤ δ.
Problem 3.8 Consider one buffer where packets arrive one by one every 2 s and
take 1 s to transmit. What is the average delay through the queue per packet? Repeat
the problem assuming that the packets arrive ten at a time every 20 s. This example
shows that the delay depends on how “bursty” the traffic is.
p
Problem 3.9 Show that if X(n) → X, then X(n) ⇒ X.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Multiplexing: B
4
Before we explain the proof of the CLT, we have to describe the use of characteristic
functions.
φX (u) = E(eiuX ), u ∈ .
√
In this expression, i := −1.
Note that
∞
φX (u) = eiux fX (x)dx,
−∞
so that φX (u) is the Fourier transform of fX (x). As such, the characteristic function
determines the pdf uniquely.
As an important example, we have the following result.
Theorem 4.1 (Characteristic Function of N (0, 1)) Let X =D N (0, 1). Then,
u2
φX (u) = e− 2 . (4.1)
Proof One has
∞ 1 x2
φX (u) = eiux √ e− 2 dx,
−∞ 2π
so that
∞ ∞
d 1 x2 1 x2
φX (u) = ixeiux √ e− 2 dx = − ieiux √ de− 2
du −∞ 2π −∞ 2π
∞ ∞
1 x2 1 x2
= i √ e− 2 deiux = −u eiux √ e− 2 dx
−∞ 2π −∞ 2π
= −uφX (u).
d d u2
log(φX (u)) = −u = − ,
du du 2
which implies that
u2
φX (u) = Ae− 2 .
Since φX (0) = E(ei0X ) = 1, we see that A = 1, and this proves the result (4.1).
We have
4.3 Moments of N (0, 1) 61
iu(X(m) − μ)
φY (n) (u) = E e iuY (n)
= E Πm=1 exp
n
√
σ n
n
iu(X(1) − μ)
= E exp √
σ n
n
iu(X(1) − μ) u2 (X(1) − μ)2
= E 1+ √ − + o(1/n)
σ n 2σ 2 n
n !
= 1 − u2 /(2n) + o(1/n) → exp −u2 /2 , as n → ∞.
The third equality holds because the X(m) are i.i.d. and the fourth one follows from
the Taylor expansion of the exponential:
1
ea ≈ 1 + a + a 2 .
2
Third, we match the coefficients of u2m in these two expressions and we find that
1 2m 2m 1 1 m
i E X = − ,
(2m)! m! 2
62 4 Multiplexing: B
This gives1
(2m)!
E X2m = . (4.2)
m!2m
For instance,
2! 4!
E(X2 ) = = 1, E X4 = = 3.
1!21 2!22
Finally, we note that the coefficients of odd powers of u must be zero, so that
E X2m+1 = 0, for m = 0, 1, 2, . . . .
Z = X2 + Y 2 =D Exp(1/2).
Let θ be the angle of the vector (X, Y ) and R 2 = X2 + Y 2 . Thus (see Fig. 4.1)
dxdy = rdrdθ.
i 2m = (−1)m .
4.5 Two Applications of Characteristic Functions 63
x
dq
where
2
1 r
fθ (θ ) = 1{0 < θ < 2π } and fR (r) = r exp − 1{r ≥ 0}.
2π 2
√
Thus, the angle θ of (X, Y ) and the norm R = X2 + Y 2 are independent and have
the indicated distributions. But then, if V = R 2 =: g(R), we find that, for v ≥ 0,
2
1 1 r 1 v!
fV (v) = fR (r) = r exp − = exp −
|g (R)| 2r 2 2 2
which shows that the angle θ and V = X2 + Y 2 are independent, the former being
uniformly distributed in [0, 2π ] and the latter being exponentially distributed with
mean 2.
We have used characteristic functions to prove the CLT. Here are two other cute
applications.
A Poisson random variable X with mean λ can be viewed as a limit of a B(n, λ/n)
random variable Xn as n → ∞. To see this, note that
where the random variables {Zn (1), . . . , Zn (n)} are i.i.d. Bernoulli with mean λ/n.
Hence,
64 4 Multiplexing: B
n
" #n λ iu
E(exp{iuXn }) = E(exp{iu(Zn (1)})) = 1 + (e − 1) .
n
For the second identity, we use the fact that if Z =D B(p), then
Also, since
∞
λm −λ am
P (X = m) = e and ea = ,
m! m!
m=0
we find that
∞
λm
E(exp{iuX}) = exp{−λ}eium = exp{λ(eiu − 1)}.
m!
m=0
1
Xn → X, in distribution.
n
To see this, recall that
Also,
∞ 1
e−βx =
0 β
Moreover, since
P (Xn = m) = (1 − p)m p, m ≥ 0,
The function Q(x) is called the error function. With Python or the appropriate smart
phone app, you can get the value of Q(x). Nevertheless, the following bounds (see
Fig. 4.2) may be useful.
Proof Here is a derivation of the upper bound. For x > 0, one has
∞ ∞ ∞
1 y2 1 y − y2
Q(x) = fX (y)dy = √ e− 2 dy = √ e 2 dy
x x 2π 2π x y
66 4 Multiplexing: B
∞ ∞
1 y − y2 1 y2
≤√ e 2 dy = √ ye− 2 dy
2π x x x 2π x
∞ 2
1 − y2 1 x2
=− √ de = √ e− 2 .
x 2π x x 2π
For the lower bound, one uses the following calculation, again with x > 0:
∞ ∞
1 y2 1 y2
1+ e− 2 dy ≥ 1+ e− 2 dy
x2 x x y2
∞
1 − y2 1 − x2
=− d e 2 = e 2.
x y x
0.7
Tn
0.6
0.5
N = 100
0.4
0.3
N = 40
0.2
0.1
n
0
0 200 400 600 800 1000 1200
In these update rules, a and b are constants with a ∈ (0, 1) and b > 1. The idea
is to increase p(n) if no device transmitted and to decrease it after a collision. This
scheme is due to Hajek and Van Loon (1982) (Fig. 4.3).
Figure 4.4 shows the evolution over time of the success rate Tn . Here,
1
n−1
Tn = 1{X(m) = 1}.
n
m=0
The figure uses a = 0.8 and b = 1.2. We see that the throughput approaches the
optimal value for N = 40 and for N = 100. Thus, the scheme adapts automatically
to the number of active devices.
68 4 Multiplexing: B
4.8 Summary
• Characteristics Function;
• Proof of CLT;
• Moments of Gaussian;
• Sum of Squares of Gaussians;
• Poisson as limit of Binomial;
• Exponential as limit of Geometric;
• Adaptive Multiple Access Protocol.
4.9 References
The CLT is a classical result, see Bertsekas and Tsitsiklis (2008), Grimmett and
Stirzaker (2001) or Billingsley (2012).
4.10 Problems
Problem 4.1 Let X be a N(0, 1) random variable. You will recall that E(X2 ) = 1
and E(X4 ) = 3.
explained in the text. Plot the total backlog in all the stations as a function of
time.
Problem 4.3 Consider a multiple access scheme where the N stations indepen-
dently transmit short reservation packets with duration equal to one time unit with
probability p. If the reservation packets collide or no station transmits a reservation
packet, the stations try again. Once a reservation is successful, the succeeding station
transmits a packet during K time units. After that transmission, the process repeats.
Calculate the maximum fraction of time that the channel can be used for transmitting
packets. Note: This scheme is called Reservation Aloha.
Problem 4.4 Let X be a random variable with mean zero and variance 1. Show that
E(X4 ) ≥ 1.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Networks: A
5
We prove that result in the next chapter. The result should be intuitive: if Z < 1,
the spreading dies out, like a population that does not reproduce enough. This model
is also relevant for the spread of epidemics or cyber viruses.
5.2 Cascades
If most of your friends prefer Apple over Samsung, you may follow the majority. In
turn, your advice will influence other friends. How big is such an influence cascade?
We model that situation with nodes arranged in a line, in the chronological order
of their decisions, as shown in Fig. 5.2. Node n listens to the advice of a subset of
{0, 1, . . . , n − 1} who have decided before him. Specifically, node n listens to the
advice of node n − k independently with probability pk , for k = 1, . . . , n. If the
majority of these friends are blue, node n turns blue; if the majority are red, node
n turns red; in case of a tie, node n flips a fair coin and turns red with probability
1/2 or blue otherwise. Assume that, initially, node 0 is red. Does the fraction of red
nodes become larger than 0.5, or does the initial effect of node 0 vanish?
A first observation is that if nodes listen only to their left-neighbor with
probability p ∈ (0, 1), the cascade ends. Indeed, there is a first node that does
not listen to its neighbor and then turns red or blue with equal probabilities.
Consequently, there will be a string of red nodes followed by a string of blue node,
and so on. By symmetry, the lengths of those strings are independent and identically
distributed. It is easy to see they have a finite mean. The SLLN then implies that the
fraction of red nodes among the first n nodes converges to 0.5. In other words, the
influence of the first node vanishes.
The situation is less obvious if pk = p < 1 for all k. Indeed, in this case, as n
gets large, node n is more likely to listen to many previous neighbors. The slightly
5.3 Seeding the Market 73
surprising result is that, no matter how small p is, there is a positive probability that
all the nodes turn red.
Theorem 5.2 (Cascades) Assume pk = p ∈ (0, 1] for all k ≥ 1. Then, all nodes
turn red with probability at least equal to θ where
1−p
θ = exp − .
p
We prove the result in the next chapter. It turns out to be possible that every node
listens to at least one previous node. In that case, all the nodes turn red.
Some companies distribute free products to spread their popularity. What is the best
fraction of customers who should get free products? To explore this question, let
us go back to our model where each node listens only to its left-neighbor with
probability p. The system is the same as before, except that each node gets a free
product and turns red with probability λ. The fraction of red nodes increases in λ
and we write it as ψ(λ). If the cost of a product is c and the selling price is s, the
company makes a profit (s − c)ψ(λ) − cλ since it makes a profit s − c from a buyer
and loses c for each free product. The company then can select λ to optimize its
profit. Next, we calculate ψ(λ).
Let π(n − 1) be the probability that user n − 1 is red. If user n listens to n − 1,
he turns red unless n − 1 is blue and he does not get a free product. If he does not
listen to n − 1, he turns red with probability 0.5 if he does not get a free product and
with probability one otherwise. Thus,
Since p(1 − λ) < 1, the value of π(n) converges to the value ψ(λ) that solves the
fixed point equation
Hence,
1 + λ − p + λp
ψ(λ) = 0.5 .
1 − p(1 − λ)
To maximize the profit (s − c)ψ(λ) − cλ, we substitute the expression for ψ(λ)
in the profit and we set the derivative with respect to λ equal to zero. After some
algebra, we find that the optimal λ∗ is given by
∗ (1 − p)1/2 − (1 − p) 0.5(s − c)
λ = min 1, .
p c
Not surprisingly, λ∗ increases with the profit margin (s − c)/c and decreases with
p.
Three people walk into a bar. No, this is not a joke. They chat and, eventually, leave
with the same majority opinion. As such events repeat, the opinion of the population
evolves. We explore a model of this evolution.
Consider a population of 2N ≥ 4 people. Initially, half believe red and the other
half believe blue. We choose three people at random. If two are blue and one is red,
they all become blue, and they return to the general population. The other cases are
similar. The same process then repeats. Let Xn be the number of blue people after n
steps, for n ≥ 1 and let X0 = N. Then Xn is a Markov chain. This Markov chain has
two absorbing states: 0 and 2N. Indeed, if Xn = k for some k ∈ {1, . . . , 2N − 1},
there is a positive probability of choosing three people where two have one opinion
and the third has a different one. After their meeting, Xn+1 = Xn . The Markov
chain is such that P (1, 0) = 1 and P (2N − 1, 2N) = 1. Moreover, P (k, k) >
0, P (k, k + 1) > 0, and P (k, k − 1) > 0 for all k ∈ {2, . . . , 2N − 2}. Consequently,
with probability one,
Thus, eventually, everyone is blue or everyone is red. By symmetry, the two limits
have probability 0.5.
What is the effect of the media on the limiting consensus? Let us modify our
previous model by assuming that when two blue and one red person meet, they all
turn blue with probability 1 − p and remain as before with probability p. Here p
models the power of the media at convincing people to stay red. If two red and one
blue meet, they all turn red.
We have, for k ∈ {2, . . . , 2N − 2},
5.4 Manufacturing of Consent 75
k(k − 1)(2N − k)
P [Xn+1 = k + 1 | Xn = k] = (1 − p)3 =: p(k).
2N(2N − 1)(2N − 2)
We want to calculate
where T0 is the first time that Xn = 0 and T2N is the first time that Xn = 2N. Then,
α(N) is the probability that the population eventually becomes all red.
The first step equations are, for k ∈ {2, . . . , 2N − 2},
i.e.,
q(k) q(k)
α(k + 1) = (1 + )α(k) − α(k − 1), k = 2, 3, . . . , 2N − 2
p(k) p(k)
2N − k − 1
= 1+ α(k)
(1 − p)(k − 1)
2N − k − 1)
− α(k − 1), k = 2, 3, . . . , 2N − 2.
(1 − p)(k − 1)
Fig. 5.3 The effect of the media. Here, p is the probability that someone remains red after chatting
with two blue people. The graph shows the probability that the whole population turns blue instead
of red. A small amount of persuasion goes a long way
5.5 Polarization
In most countries, the population is split among different political and religious
persuasions. How is this possible if everyone is faced with the same evidence? One
effect is that interactions are not fully mixing. People belong to groups that may
converge to a consensus based on the majority opinion of the group.
To model this effect, we consider a population of N people. An adjacency matrix
G specifies which people are friends. Here, G(v, w) = 1 if v and w are friends and
G(v, w) = 0 otherwise.
Initially, people are blue or red with equal probabilities. We pick one person at
random. If that person has a majority of red friends, she becomes red. If the majority
of her friends are blue, she becomes blue. If it is a tie, she does not change. We repeat
the process. Note that the graph does not change; it is fixed throughout. We want to
explore how the coloring of people evolves over time.
Let Xn (v) ∈ {B, R} be the state of person v at time n, for n ≥ 0 and v ∈
{1, . . . , N }. We pick v at random. We count the number of red friends and blue
friends of v. They are given by
G(v, w)1{Xn (w) = R} and G(v, w)1{Xn (w) = B}.
w w
Thus,
⎧
⎨ R, if w G(v, w)1{Xn (w) = R} > w G(v, w)1{Xn (w) = B}
Xn+1 (v) = B, if w G(v, w)1{Xn (w) = R} < w G(v, w)1{Xn (w) = B}
⎩
Xn (v), otherwise.
That is, V (Xn ) is the number of disagreements among friends. The rules of
evolution guarantee that V (Xn+1 ) ≤ V (Xn ) and that P (V (Xn+1 ) < V (Xn )) > 0
unless P (Xn+1 = Xn ) = 1. Indeed, if the state of v changes, it is to make that
person agree with more of her neighbors. Also, if there is no v who can reduce
her number of disagreements, then the state can no longer change. These properties
imply that the state converges.
A simple example shows that the limit may be random. Consider four people at
the vertices of a square that represents G. Assume that two opposite vertices are
blue and the other two are red. If the first person v to reconsider her opinion is blue,
she turns red, and the limit is all red. If v is red, the limit is all blue. Thus, the limit
is equally likely to be all red or all blue.
In the limit, it may be that a fraction of the nodes are red and the others are blue.
For instance, if the nodes are arranged in a line graph, then the limit is alternating
sequences of at least two red nodes and sequences of at least two blue nodes.
The properties of the limit depend on the adjacency graph G. One might think
that a close group of friends should have the same color, but that is not necessarily
the case, as the example of Fig. 5.4 shows.
¸ ¸ ¸ ¸
0 1 2 3 .....
¹ ¹ ¹ ¹
5.6 M/M/1 Queue 79
The bottom part of Fig. 5.5 is a state transition diagram that indicates the rates of
transitions. For instance, the arrow from 1 to 2 is marked with λ to indicate that, in
1 s, the queue length jumps from 1 to 2 with probability λ. The figure shows
that arrivals (that increase the queue length) occur at the same rate λ, independently
of the queue length. Also, service completions (that reduce the queue length) occur
at rate μ as long as the queue is nonempty.
Note that
The first identity is the law of total probability: the event {Xt+ = 0} is the union of
the two disjoint events {Xt = 0, Xt+ = 0} and {Xt = 1, Xt+ = 0}. The second
identity uses the fact that {Xt = 0, Xt+ = 0} occurs when Xt = 0 and there is no
arrival during (t, t +]. This event has probability P (Xt = 0) multiplied by (1−λ)
since arrivals are independent of the current queue length. The other term is similar.
Now, imagine that π is a pmf on Z≥0 := {0, 1, . . .} such that P (Xt = i) = π(i)
for all time t and i ∈ Z≥0 . That is, assume that π is an invariant distribution for Xt .
In that case, P (Xt+ = 0) = π(0), P (Xt = 0) = π(0), and P (Xt = 1) = π(1).
Hence, the previous identity implies that
π(0)λ ≈ π(1)μ.
Hence,
The Eqs. (5.1)–(5.2) are called the balance equations. Thus, if π is invariant for Xt ,
it must satisfy the balance equations. Looking back at our calculations, we also see
that if π satisfies the balance equations, and if P (Xt = i) = π(i) for all i, then
P (Xt+ = i) = π(i) for all i. Thus, π is invariant for Xt if and only if it satisfies
the balance equations.
One can solve the balance equations (5.1)–(5.2) as follows. Equation (5.1) shows
that π(1) = ρπ(0) with ρ = λ/μ. Subtracting (5.1) from (5.2) yields
π(1)λ = π(2)μ.
This equation then shows that π(2) = π(1)ρ = π(0)ρ 2 . Continuing in this way
shows that π(n) = π(0)ρ n for n ≥ 0. To find π(0), we use the fact that n π(n) =
1. That is
∞
π(0)ρ n = 1.
n=0
1
π(0) = 1,
1−ρ
π(n) = (1 − ρ)ρ n , n ≥ 0.
To calculate the average delay W of a customer in the queue, one can use Little’s
Law L = λW . This identity implies that
1
W = .
μ−λ
completions before he leaves. Since very service completions lasts 1/μ on average,
his average delay is (k + 1)/μ. Now, the probability that this customer finds k other
customers in the queue is π(k). To see this, note that the probability that a customer
who enters the queue between time t and t + finds k customers in the queue is
Figure 5.6 shows a representative network of queues. Two types of customers arrive
into the network, with respective rates γ1 and γ2 . The first type goes through queue
1, then queue 3, and should leave the network. However, with probability p1 these
customers must go back to queue 1 and try again. In a communication network, this
event models an transmission error where a packet (a group of bits) gets corrupted
and has to be retransmitted. The situation is similar for the other type. Thus, in
1 time unit, a packet of the first type arrives with probability γ1 , independently
of what happened previously. This is similar to the arrivals into an M/M/1 queue.
Also, we assume that the service times are exponentially distributed with rate μk in
queue k, for k = 1, 2, 3.
Let Xtk be the number of customers in queue k at time t, for k = 1, 2 and t ≥ 0.
Let also Xt3 be the list of customer types in queue 3 at time t. For instance, in
Fig. 5.6, one has Xt3 = (1, 1, 2, 1), from tail to head of the queue to indicate that the
customer at the head of the queue is of type 1, that he is followed by a customer of
γ1 μ1 μ1 γ1
μ3 p1
μ3 p2 (3, 2, (1, 1, 2, 1))
λ2 μ3 (1 − p1 ) γ2
μ2
γ2 μ2 (4, 2, (1, 1, 2)) (3, 1, (2, 1, 1, 2, 1)) (3, 3, (1, 1, 2, 1))
type 2, etc. Because of the memoryless property of the exponential distribution, the
process Xt = (X11 , Xt2 , Xt3 ) is a Markov chain: observing the past up to time t does
not help predict the time of the next arrival or service completion.
Figure 5.6 shows the transition rates out of the current state (3, 2, (1, 1, 2, 1)).
For instance, with rate μ3 p1 , a service completes in queue 3 and that customer has
to go back to queue 1, so that the new state is (4, 2, (1, 1, 2)). The other transitions
are similar.
One can then, in principle, write down the balance equations and try to solve
them. This looks like a very complex task and it seems very unlikely that one
could solve these equations analytically. However, a miracle occurs and one has the
remarkably simple result stated in the next theorem. Before we state the result, we
need to define λ1 , λ2 , and λ3 . As sketched in Fig. 5.6, for k = 1, 2, 3, the quantity
λk is the rate at which customers go through queue k, in the long term. These rates
should be such that
λ 1 = γ 1 + λ 1 p1
λ 2 = γ 2 + λ 2 p2
λ3 = λ1 + λ2 .
For instance, the rate λ1 at which customers enter queue 1 is the rate γ1 plus the rate
at which customers of type 1 that leave queue 3 are sent back to queue 1. Customers
of type 1 go through queue 3 at rate λ1 , since they come out of queue 1 at rate λ1 ;
also, a fraction p1 of these customers go back to queue 1. The other expressions
can be understood similarly. The equations above are called the flow conservation
equations.
These equations admit the following solution:
γ1 γ2
λ1 = , λ2 = , λ3 = λ1 + λ2 .
1 − p1 1 − p2
5.7 Network of Queues 83
This result shows that the invariant distribution has a product form.
We prove this result in the next chapter. It indicates that under the invariant
distribution π , the states of the three queues are independent. Moreover, the state
of queue 1 has the same invariant distribution as an M/M/1 queue with arrival rate
λ1 and service rate μ1 , and similarly for queue 2. Finally, queue 3 has the same
invariant distribution as a single queue with arrival rates λ1 and λ2 and service rate
μ3 : the length of queue 3 has the same distribution as an M/M/1 queue with arrival
rate λ1 + λ3 and the types of the customers in the queue are independent and of type
1 with probability p(1) and 2 with probability p(2).
This result is remarkable not only for its simplicity but mostly because it is
surprising. The independence of the states of the queues is shocking: the arrivals into
queue 3 are the departures from the other two queues, so it seems that if customers
are delayed in queues 1 and 2, one should have larger values for Xt1 and Xt2 and a
smaller one for the length of queue 3. Thus, intuition suggests a strong dependency
between the queue lengths. Moreover, the fact that the invariant distributions of the
queues are the same as for M/M/1 queues is also shocking. Indeed, if there are
many customers in queue 1, we know that a fraction of them will come back into
the queue, so that future arrivals into queue 1 depend on the current queue length,
which is not the case for an M/M/1 queue. The paradox is explained in a reference.
We use this theorem to calculate the delay of customers in the network.
where
γ1 γ2
λ1 = and λ2 = .
1 − p1 1 − p2
Proof We use Little’s Law that says that Lk = γk Wk where Lk is the average
number of customers of type k in the network. Consider the case k = 1. The other
one is similar. L1 is the average number of customers in queue 1 plus the average
number of customers of type 1 in queue 3.
The average length of queue 1 is λ1 /(μ1 − λ1 ) because the invariant distribution
of queue 1 is the same as that of an M/M/1 queue with arrival rate λ1 and service
rate μ1 .
The average length of queue 3 is (λ1 + λ2 )/(μ3 − λ1 − λ2 ) because the invariant
distribution of queue 3 is the same as queue with arrival rate λ1 and λ2 and service
rate μ3 . Also, the probability that any customer in queue 3 is of type 1 is p(1) =
λ1 /(λ1 + λ2 ). Thus, the average number of customers of type 1 in queue 3 is
84 5 Networks: A
λ1 + λ2 λ1
p(1) = .
μ3 − λ1 − λ2 μ3 − λ1 − λ2
Hence,
λ1 λ1
L1 = + .
μ1 − λ1 μ3 − λ1 − λ2
We use our network model to optimize the rates of the transmitters. The basic idea
is that nodes with more traffic should have faster transmitter. To make this idea
precise, we formulate an optimization problem: minimize a delay cost subject to a
given budget for buying the transmitters.
We carry out the calculations not because of the importance of the specific
example (it is not important!) but because they are representative of problems of
this type.
Consider once again the network in Fig. 5.6. Assume that the cost of the
transmitters is c1 μ1 + c2 μ2 + c3 μ3 . The delay cost is d1 W1 + d2 W2 where Wk
is the average delay for packets of type k (k = 1, 2). The problem is then as follows:
Minimize D(μ1 , μ2 , μ3 ) := d1 W1 + d2 W2
subject to C(μ1 , μ2 , μ3 ) := c1 μ1 + c2 μ2 + c3 μ3 ≤ B.
dk 1 1
D(μ1 , μ2 , μ3 ) = + .
1 − pk μk − λk μ3 − λ1 − λ2
k=1,2
where λ > 0 is a Lagrange multiplier that penalizes capacities that have a high cost.
To solve this problem for a given value of λ, we set to zero the derivative of this
expression with respect to each μk . For k = 1, 2 we find
5.8 Optimizing Capacity 85
∂ ∂
0= D(μ1 , μ2 , μ3 ) + αC(μ1 , μ2 , μ3 )
∂μk ∂μ1
dk 1
=− + αck .
1 − pk (μk − λk )2
For k = 3, we find
d1 /(1 − p1 ) + d2 /(1 − p2 )
0=− + αc3 .
(μ3 − λ1 − λ2 )2
Hence,
1/2
dk
μk = λk + , for k = 1, 2
αck (1 − pk )
d1 /(1 − p1 ) + d2 /(1 − p2 ) 1/2
μ3 = λ1 + λ2 + .
αc3
C(μ1 , μ2 , μ3 ) = c1 λ1 + c2 λ2 + c3 (λ1 + λ3 )
⎡ ⎛ ⎞1/2 ⎤
1 ⎢ dk ck 1/2 dk
⎠ ⎥
+ c3 ⎝
1/2
+√ ⎣ ⎦.
α 1 − pk 1 − pk
k=1,2 k=1,2
where
B − c1 λ1 − c2 λ2 − c3 (λ1 + λ2 )
D= 1/2 1/2 .
dk c k 1/2 dk ck
k=1,2 1−pk + c3 k=1,2 1−pk
86 5 Networks: A
These results show that, for k = 1, 2, the capacity μk increases with dk , i.e.,
the cost of delays of packets of type k; it also decreases with ck , i.e., the cost of
providing that capacity.
A numerical solution can be obtained using a scipy optimization tool called
minimize. Here is the code.
import numpy as np
from scipy.optimize import minimize
d = [1, 2] # delay cost coefficients
c = [2, 3, 4] # capacity cost coefficients
l = [3, 2] # rates l[0] = lambda1, etc
p = [0.1, 0.2] # error probabilities
B = 60 # capacity budget
UB = 50 # upper bound on capacity
# x = mu1, mu2, mu3: x[0] = mu1, etc
def objective(x): # objective to minimize
z = 0
for k in range(2):
z = z + (d[k]/(1 - p[k]))*(1/(x[k] - l[k])
+ 1/(x[2] - l[0]-l[1]))
return z
def constraint(x): # budget constraint >= 0
z = B
for k in range(3):
z = z - c[k]*x[k]
return z
x0 = [5,5,10] # initial value for optimization
b0 = (l[0], UB) # lower and upped bound for x[0]
b1 = (l[1], UB) # lower and upped bound for x[1]
b2 = (l[0]+l[1], UB) # lower and upped bound for x[1]
bnds = (b0,b1,b2) # bounds for the three variables x
con = {’type’: ’ineq’, ’fun’: constraint}
# specifies constraints
sol = minimize(objective,x0,method=’SLSQP’,
bounds = bnds, constraints=con)
# sol will be the solution
print(sol)
The code produces an approximate solution. The advantage is that one does not
need any analytical skills. The disadvantage is that one does not get any qualitative
insight.
5.10 Product-Form Networks 87
Can one model the internet as a network of queues? If so, does the result of
the previous section really apply? Well, the mathematical answers are maybe and
maybe.
The internet transports packets (groups of bits) from node to node. The nodes
are sources and destinations such as computers, webcams, smartphones, etc., and
network nodes such as switches or routers. The packets go from buffer to buffer.
These buffers look like queues. The service times are the transmission times of
packets. The transmission time of a packet (in seconds) is the number of bits in
the packet divided by the rate of the transmitter (in bits per second). The packets
have random lengths, so the service times are random. So, the internet looks like a
network of queues. However, there are some important ways in which our network
of queues is not an exact model of the internet. First, the packet lengths are not
exponentially distributed. Second, a packet keeps the same number of bits as it
moves from one queue to the next. Thus, the service times of a given packet in the
different queues are all proportional to each other. Third, the time between the arrival
two successive packets from a given node cannot be smaller than the transmission
time of the first packet. Thus, the arrival times and the service times in one queue
are not independent and the times between arrivals are not exponentially distributed.
The real question is whether the internet can be approximated by a network
similar to that of the previous section. For instance, if we use that model, are we
very far off when we try to estimate delays of queue lengths? Experiments suggest
that the approximation may be reasonable to a first order. One intuitive justification
is the diversity of streams of packets. It goes as follows. Consider one specific queue
in a large network node of the internet. This node is traversed by packets that come
from many different sources and go to many destinations. Thus, successive packets
that arrive at the queue may come from different previous nodes, which reduces
the dependency of the arrivals and the service times. The service time distribution
certainly affects the delays. However, the results obtained assuming an exponential
distribution may provide a reasonable estimate.
Define λci as the average rate of customers of class c that go through queue i, for
i ∈ {1, . . . , N } and for c ∈ {1, . . . , C}. Assume that the rate of arrivals of customers
of a given class into a queue is equal to the rate of departures of those customers
from the queue. Then the rates λci should satisfy the following flow conservation
equations:
N
C
d,c
λci = γic + rj,i , i ∈ {1, . . . , N }, c ∈ {1, . . . , C}.
j =1 d=1
Let also X(t) = {Xi (t), i = 1, . . . , N} where Xi (t) is the configuration of queue
i at time t ≥ 0. That is, Xi (t) is the list of customer classes in queue i, from the tail
of the queue to the head of the queue. For instance, Xi (t) = 132,312 if the customer
at the tail of queue i is of class 1, the customer in front of her is of class 3, and so
on, and the customer at the head of the queue and being served is of class 2. If the
queue is empty, then Xi (t) = [], where [] designates the empty string.
One then has the following theorem.
π(x) = AΠi=1
N
gi (xi ),
where
λci 1 · · · λci N
gi (c1 · · · cn ) =
μni
and A is a constant such that πi sums to one over all the possible configurations
of the queues.
(b) If the network is open in that every customer can leave the network, then the
invariant distribution becomes
π(x) = Πi=1
N
πi (xi ),
where
c c
λ i λi 1 · · · λi n
πi (c1 · · · cn ) = 1 − .
μi μni
In this case, under the invariant distribution, the queue lengths at time t are
all independent, the length of queue i has the same distribution as that of an
5.11 References 89
M/M/1 queue with arrival rate λi and service rate μi , and the customer classes
are all independent and are equal to c with probability λci /λi .
The proof of this theorem is the same as that of the particular example given in
the next chapter.
5.10.1 Example
Figure 5.7 shows a network with two types of jobs. There is a single gray job that
visits the two queues as shown. The white jobs go through the two queues once. The
gray job models “hello” messages that the queues keep on exchanging to verify that
the system is alive. For ease of notation, we assume that the service rates in the two
queues are identical.
We want to calculate the average time that the white jobs spend in the system
and compare that value to the case when there is no gray job. That is, we want
to understand the “cost” of using hello messages. The point of the example is to
illustrate the methodology for networks where some customers never leave. The
calculations show the following somewhat surprising result.
Theorem 5.7 Using a hello message increases the expected delay of the white jobs
by 50%.
We prove the theorem in the next chapter. In that proof, we use Theorem 5.6 to
calculate the invariant distribution of the system, derive the expected number L of
white jobs in the network, then use Little’s Law to calculate the average delay W of
the white jobs as W = L/γ . We then compare that value to the case where there is
not gray job.
5.11 References
The literature on social networks is vast and growing. The textbook Easley and
Kleinberg (2012) contains many interesting models and result. The text Shah (2009)
studies the propagation of information in networks.
The book Kelly (1979) is the most elegant presentation of the theory of queueing
networks. It is readily available online. The excellent notes Kelly and Yudovina
(2013) discuss recent results. The nice textbook Srikant and Ying (2014) explains
network optimization and other performance evaluation problems. The books
90 5 Networks: A
Bremaud (2017) and Lyons and Perez (2017) are excellent sources for deeper studies
of networks. The text Walrand (1988) is more clumsy but may be useful.
5.12 Problems
Problem 5.1 There are K users of a social network who collaborate to estimate
some quantity by exchanging information. At each step, a pair (i, j ) of users is
selected uniformly at random and user j sends a message to user i with his estimate.
User i then replaces his estimate by the average of his estimate and that of user j .
Show that the estimates of all the users converge in probability to the average value
of the initial estimates. This is an example of consensus algorithm.
Hint Let Xn (i) be the estimate of user i at step n and Xn the vector with components
Xn (i). Show that
so that
and
Markov’s inequality then shows that P (|Xn (i) − A| > ) → 0 for any > 0.
Problem 5.2 Jobs arrive at rate γ in the system shown in Fig. 5.8. With probability
p, a customer is sent to queue 1, independently of the other jobs; otherwise, the job
is sent to queue 2. For i = 1, 2, queue i serves the jobs at rate μi . Find the value
of p that minimizes the average delay of jobs in the system. Compare the resulting
average delay to that of the system where the jobs are in one queue and join the
available server when they reach the head of the queue, and the fastest server if both
are idle, as shown in the bottom part of Fig. 5.8.
Hint The system of the top part of the figure is easy to analyze: with probability
p, a job faces the average delay 1/(μ1 − γp) in the top queue and with probability
1 − p the job faces the average delay 1/(μ2 − γ (1 − p)), One the finds the value of
p that minimizes the expected delay. For the system in the bottom part of the figure,
the state is n with n ≥ 2 when there are at least two jobs and the two servers are
5.12 Problems 91
μ1
γ
μ2
busy, or (1, s) where s ∈ {1, 2} indicates which server is busy, or 0 when the system
is empty. One then needs to find the invariant distribution of the state, compute the
average number of jobs, and use Little’s Law to find the average delay. The state
transition diagram is shown in Fig. 5.9.
Problem 5.3 This problem compares parallel queues to a single queue. There are
N servers. Each server serves customers at rate μ. The customers arrive at rate
N λ. In the first system, the customers are split into N queues, one for each server.
Customers arrive at each queue with rate λ. The average delay is that of an M/M/1
queue, i.e., 1/(μ − λ). In the second system, the customers join a single queue. The
customer at the head of the queue then goes to the next available server. Calculate
the average delay in this system. Write a Python program to plot the average delays
of the two systems as a function ρ := λ/μ for different values of N .
Problem 5.4 In this problem, we explore a system of parallel queues where the
customers join the shortest queue. Customers arrive at rate Nλ and there are N
queues, each with a server who serves customers at rate μ > λ. When a customer
arrives, she joins the shortest queue. The goal is to analyze the expected delay in
the system. Unfortunately, this problem cannot be solved analytically. So, your task
is to write a Python program to evaluate the expected delay numerically. The first
92 5 Networks: A
2 μ
1
A
step is to draw the state transition diagram. Approximate the system by discarding
customers who arrive when there are already M customers in the system. The second
step is to write the balance equations. Finally, one writes a program to solve the
equations numerically.
Problem 5.5 Figure 5.11 shows a system of N queues that serve jobs at rate μ. If
there is a single job, it takes on average N/μ time units for it to go around the circle.
Thus, the average rate at which a job leaves a particular queue is μ/N . Show that
when there are two jobs, this rate is 2μ/(N + 1).
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Networks—B
6
Proof For part (a), let Xn be the number of nodes that are n steps from the root. If
Xn = k, we can write Xn+1 = Y1 + · · · + Yk where Yj is the number of children of
node j at level n. By assumption, E(Yj ) = μ for all j . Hence,
E(Xn ) = μn , n ≥ 0.
1 − μn+1
E(Zn ) = μ0 + · · · + μn = .
1−μ
Indeed, if X1 = k, the probability that none of the k children of the root has a
survivor after n generations is (1 − αn )k . Hence,
Also, α0 = 1. As n → ∞, one has αn → α ∗ = P (Xn > 0, for all n). Figure 6.1
shows that α ∗ > 0. The key observations are that
g(0) = 0
g(1) = P (X1 > 0) < 1
g (0) = E(X1 (1 − α)X1 −1 ) |α=0 = μ > 1
g (1) = E(X1 (1 − α)X1 −1 ) |α=1 = 0,
Theorem 6.2 (Cascades) Assume pk = p ∈ (0, 1] for all k ≥ 1. Then, all nodes
turn red with probability at least equal to θ where
1−p
θ = exp − .
p
Proof The probability that node n does not listen to anyone is an = (1 − p)n . Let
X be the index of the first node that does not listen to anyone. Then
6.2 Continuous-Time Markov Chains 95
P (X1 > 0)
μ>1
0 α
0 α1 1
α2
α∗
Now,
1−p
P (X = ∞) = lim P (X > n) ≥ exp − = θ.
n p
Thus, with probability at least θ , every node listens to at least one previous node.
When that is the case, all the nodes turn red. To see this, assume that n is the first
blue node. That is not possible since it listened to some previous nodes that are all
red.
Our goal is to understand networks where packets travel from node to node until
they reach their destination. In particular, we want to study the delay of packets
from source to destination and the backlog in the nodes.
It turns out that the analysis of such systems is much easier in continuous time
than in discrete time. To carry out such analysis, we have to introduce continuous-
time Markov chains. We do this on a few simple examples.
96 6 Networks—B
Figure 6.2 illustrates a random process {Xt , t ≥ 0} that takes values in {0, 1}. A
random process is a collection of random variables indexed by t ≥ 0. Saying that
such a random process is defined means that one can calculate the probability that
{Xt1 = x1 , Xt2 = x2 , . . . , Xtn = xn } for any value of n ≥ 1, any 0 ≤ t1 ≤
· · · ≤ tn , and x1 , . . . , xn ∈ {0, 1}. We explain below how one could calculate such a
probability.
We call Xt the state of the process at time t. The possible values {0, 1} are also
called states. The state Xt evolves according to rules characterized by two positive
numbers λ and μ. As Fig. 6.2 shows, if X0 = 0, the state remains equal to zero
for a random time T0 that is exponentially distributed with parameter λ, thus with
mean 1/λ. The state Xt then jumps to 1 where it stays for a random time T1 that is
exponentially distributed with rate μ, independent of T0 , and so on. The definition is
similar if X0 = 1. In that case, Xt keeps the value 1 for an exponentially distributed
time with rate μ, then jumps to 0, etc.
Thus, the pdf of T0 is
In particular,
Throughout this chapter, the symbol ≈ means “up to a quantity negligible compared
to .” It is shown in Theorem 15.3 that exponentially distributed random variable is
memoryless. That is,
Exp(λ)
6.2 Continuous-Time Markov Chains 97
0 t
time s, because the time in 0 is memoryless and independent of the previous times
in 0 and 1. This property is written as
for k = 0, 1, for all s ≥ 0, and for all sets A of possible trajectories. A generic set
A of trajectories is
for given 0 < t1 < · · · < tn and i1 , . . . , in ∈ {0, 1}. Here, C+ is the set of right-
continuous functions of t ≥ 0 that take values in {0, 1}.
This property is the continuous-time version of the Markov property for Markov
chains. One says that the process Xt satisfies the Markov property and one calls
{Xt , t ≥ 0} is a continuous-time Markov chain (CTMC).
For instance,
P [Xt+ = 1 | Xt = 0] ≈ λ.
98 6 Networks—B
0 t
τ
Fig. 6.5 The state transition λ
diagram
0 1
Indeed, the process jumps from 0 to 1 in time units if the exponential time in 0 is
less than , which has probability approximately λ.
Similarly,
P [Xt+ = 0 | Xt = 1] ≈ μ.
We say that the transition rate from 0 to 1 is equal to λ and that from 1 to 0 is equal
to μ to indicate that the probability of a transition from 0 to 1 in units of time is
approximately λ and that from 1 to 0 is approximately μ.
Figure 6.5 illustrates these transition rates. This figure is called the state
transition diagram.
The previous two identities imply that
πt+ ≈ πt (I + Q),
where I is the identity matrix. Subtracting πt from both sides, dividing by , and
letting → 0, we find
d
πt = πt Q. (6.1)
dt
By analogy with the scalar equation dxt /dt = axt whose solution is xt =
x0 exp{at}, we conclude that
πt = π0 exp{Qt}, (6.2)
where
1 2 2 1
exp{Qt} := I + Qt + Q t + Q3 t 3 + · · · .
2! 3!
Note that
d 1
exp{Qt} = 0 + Q + Q2 t + Q3 t 2 + · · · = Q exp{Qt}.
dt 2!
Observe also that πt = π for all t ≥ 0 if and only if π0 = π and
π Q = 0. (6.3)
1− 0 1 1−μ
i.e.,
π(0)(−λ) + π(1)μ = 0
π(0)λ − π(1)μ = 0.
These two equations are identical. To determine π , we use the fact that π(0) +
π(1) = 1. Combined with the previous identity, we find
μ λ
[π(0), π(1)] = , .
λ+μ λ+μ
πn → π, as n → ∞.
πt → π, as t → ∞.
The previous Markov chain alternates between the states 0 and 1. More general
Markov chains visit states in a random order. We explain that feature in our next
example with 3 states. Fortunately, this example suffices to illustrate the general
case. We do not have to look at Markov chains with 4, 5, . . . states to describe the
general model.
6.2 Continuous-Time Markov Chains 101
q(2, 0)
Exp(q1 ) Exp(q2 )
Xt
2
Γ(0, 2) T3
1
Γ(0, 1) T1
T0 T2
0 t
Exp(q0 )
q0 = q(0, 1) + q(0, 2) Γ(0, 1) = q(0, 1)/q0
q1 = q(1, 2) Γ(0, 2) = q(0, 2)/q0
q2 = q(2, 0)
In the example shown in Fig. 6.7, the rules of evolution are characterized
by positive numbers q(0, 1), q(0, 2), q(1, 2), and q(2, 0). One also defines
q0 , q1 , q2 , Γ (0, 1), and Γ (0, 2) as in the figure.
If X0 = 0, the state Xt remains equal to 0 for some random time T0 that
is exponentially distributed with rate q0 . At time T0 , the state jumps to 1 with
probability Γ (0, 1) or to state 2 otherwise, with probability Γ (0, 2). If Xt jumps
to 1, it stays there for an exponentially distributed time T1 with rate q1 that is
independent of T0 . More generally, when Xt enters state k, it stays there for a
random time that is exponentially distributed with rate qk that is independent of the
past evolution. From this definition, it should be clear that the process Xt satisfies
the Markov property.
Define πt = [πt (0), πt (1), πt (2)] where πt (k) = P (Xt = k) for k = 0, 1, 2.
One has, for 0 < 1,
Indeed, the process jumps from 0 to 1 in time units if the exponential time with
rate q0 is less than and if the process then jumps to 1 instead of jumping to 2.
Similarly,
Also,
102 6 Networks—B
P [Xt+ = 1 | Xt = 1] ≈ 1 − q1 ,
since this is approximately the probability that the exponential time with rate q1 is
larger than . Moreover,
P [Xt+ = 1 | Xt = 2] ≈ 0,
because the probability that both the exponential time with rate q2 in state 2 and the
exponential time with rate q0 in state 0 are less than is roughly (q2 ) × (q1 ), and
this is negligible compared to .
These observations imply that
Similarly to the two-state example, let us define the rate matrix Q as follows:
⎡ ⎤
−q0 q(0, 1) q(0, 2)
Q=⎣ 0 −q1 q(0, 1) ⎦ .
q(2, 0) 0 −q2
πt+ ≈ πt [I + Q].
Subtracting πt from both sides, dividing by , and letting → 0 then shows that
d
πt = πt Q.
dt
As before, the solution of this equation is
πt = π0 exp{Qt}, t ≥ 0.
q(0, 1) 1 q(1, 2)
q(0, 2)
0 2
1 − q0 1 − q2
q(2, 0)
π Q = 0.
P (Xn = k) → π(k), as n → ∞.
πt → π, as t → ∞.
Also, since Xn is irreducible, the long-term fraction of time that it spends in the
different states converge to π , and we can then expect the same for Xt .
104 6 Networks—B
Xt
Exp(qj ) Exp(qk )
Exp(ql )
k
Γ(i, k)
j
Γ(i, j) l
i t
Exp(qi )
This definition means that the process jumps from i to j = i with probability
q(i, j ) in 1 time units. Thus, q(i, j ) is the probability of jumping from i to j ,
per unit of time. Note that the sum of these expressions over all j gives 1, as should
be.
One construction of this process is as follows. Say that Xt = i. One then chooses
a random time τ that is exponentially distributed with rate qi := −q(i, i). At time
t + τ , the process jumps and goes to state y with probability Γ (i, j ) = q(i, j )/qi
for j = i (Fig. 6.9).
Thus, if Xt = i, the probability that Xt+ = j is the probability that the process
jumps in (t, t + ), which is qi , times the probability that it then jumps to j , which
is Γ (i, j ). Hence,
q(i, j )
P [Xt+ = j |Xt = i] = qi = q(i, j ),
qi
d
πt = πt Q,
dt
so that
πt = π0 exp{Qt}.
0 = π Q.
These equations express the equality of the rate of leaving a state and the rate of
entering that state.
Define
P (Xt1 = i1 , . . . , Xtn = in ) = P (Xt1 = i1 )Pt2 −t1 (i1 , i2 )Pt3 −t2 (i2 , i3 ) · · · Ptn −tn−1 (in−1 , in ),
Hence,
P (Xt1 = i1 , . . . , Xtn = in ) = π(i1 )Pt2 −t1 (i1 , i2 )Pt3 −t2 (I2 , i3 ) · · · Ptn −tn−1 (in−1 , in ),
(a) If the Markov chain is irreducible, the states are either all transient, all positive
recurrent, or all null recurrent. We then say that the Markov chain is transient,
positive recurrent, or null recurrent, respectively.
(b) If the Markov chain is positive recurrent, it has a unique invariant distribution
π and π(i) is the long-term fraction of time that Xt is equal to i. Moreover, the
probability πt (i) that the Markov chain Xt is in state i converges to π(i).
(c) If the Markov chain is not positive recurrent, it does not have an invariant
distribution and the fraction of time that it spends in any state goes to zero.
6.2.4 Uniformization
Let ν be the invariant distribution of this jump chain. That is, ν = νΓ . Since ν(i) is
the long-term fraction of time that the jump chain is in state i, and since the CTMC
Xt spends an average time 1/qi in state i whenever it visits that state, the fraction of
time that Xt spends in state i should be proportional to ν(i)/qi . That is, one expects
π(i) = Aν(i)/qi
[ν(i)/qi ]q(i, j ) = ν(i)Γ (i, j ) + ν(i)q(i, i)/qi = ν(i) − ν(i) = 0.
j j =i
To see this, assume that Yt = i. The next jump will occur with rate λ. With
probability (λ − qi )/λ, it is a dummy jump from i to i. With probability qi /λ it
is an actual jump where Yt jumps to j = i with probability Γ (i, j ). Hence, Yt
jumps from i to i with probability (λ − qi )/λ and from i to j = i with probability
(qi /λ)Γ (i, j ) = q(i, j )/λ.
Note that
1
P =I+ Q,
λ
where I is the identity matrix.
Now, define Zn to be the jump chain of Yt , i.e., the Markov chain with transition
matrix P . Since the jumps of Yt occur at rate λ, independently of the value of the
state Yt , we can simulate Yt as follows. Let Nt be a Poisson process with rate λ. The
jump times {t1 , t2 , . . .} of Nt will be the jump times of Yt . The successive values of
Yt are those of Zn . Formally,
Yt = ZNt .
That identity holds since π Q = 0. Thus, the DTMC Zn has the same invariant
distribution as Xt . Observe that Zn is not the same as the jump chain of Xt . Also, it
is not a discrete-time approximation of Xt . This DTMC shows that a CTMC can be
seen as a DTMC where one replaces the constant time steps by i.i.d. exponentially
distributed time steps between the jumps.
108 6 Networks—B
As a preparation for our study of networks of queues, we note the following result.
Theorem 6.2 (Kelly’s Lemma) Let Q be the rate matrix of a Markov chain on X .
Let also Q̃ be another rate matrix on X . Assume that π is a distribution on X and
that
qi = q̃i , i ∈ X and
π(i)q(i, j ) = π(j )q̃(j, i), ∀i = j.
Then π Q = 0.
Proof We have
π(j )q(j, i) = p(i)q̃(i, j ) = p(i) q̃(i, j ) = p(i)q̃i = p(i)qi ,
j =i j =i j =i
so that π Q = 0.
The following result explains the meaning of Q̃ in the previous theorem. We state
it without proof.
Theorem 6.3 Assume that Xt has the invariant distribution π . Then Xt reversed in
time is a Markov chain with rate matrix Q̃ given by
π(j )q(j, i)
q̃(i, j ) = .
π(i)
μ1 μ3
λ2
p2 μ2
Proof Figure 6.10 shows a guess for the time-reversal of the network.
Let Q be the rate matrix of the top network and Q̃ that of the bottom one. Let
also π be as stated in the theorem. We show that π, Q, Q̃ satisfy the conditions of
Kelly’s Lemma.
For instance, we verify that
i.e.,
p(1)ρ3 μ3 = ρ1 μ1 ,
i.e.,
λ1 λ3 λ1
μ3 = μ1
λ3 μ3 μ1
and this equation is seen to be satisfied. A similar argument shows that Kelly’s
lemma is satisfied for all pairs of states.
The first step in using the theorem is to solve the flow conservation equations. Let
us call class 1 that of the white jobs and class 2 that of the gray job. Then we see
that
solve the flow conservation equations for any α > 0. We have to assume γ < μ for
the services to be able to keep up with the white jobs. With this assumption, we can
choose α small enough so that λ1 = λ2 = λ := γ + α < min{μ1 , μ2 }.
The second step is to use the theorem to obtain the invariant distribution. It is
with
n1 (xi )
n2 (xi )
γ α n (x ) n (x )
h(xi ) = = ρ1 1 i ρ2 2 i ,
μ μ
where ρ1 = γ /μ, ρ2 = α/μ, and nc (x) is the number of jobs of class c in xi , for
c = 1, 2. To calculate A, we note that there are n + 1 states xi with n class 1 jobs
and 1 class 2 job, and 1 state xi with n classes 1 jobs and no class 2 job. Indeed, the
class 2 customer can be in n + 1 positions in the queue with the n customers of class
1.
Also, all the possible pairs (x1 , x2 ) must have one class 2 customer either in
queue 1 or in queue 2. Thus,
6.4 Proof of Theorem 5.7 111
∞
∞
1= π(x1 , x2 ) = A G(m, n),
(x1 ,x2 ) m=0 n=0
where
In this expression, the first term corresponds to the states with m class 1 customers
and one class 2 customer in queue 1 and n customers of class 1 in queue 2; the
second term corresponds to the states with m customer of class 1 in queue 1, and n
customers of class 1 and one customer of class 2 in queue 2. Thus, AG(m, n) is the
probability that there are m customers of class 1 in the first queue and n customers
of class 1 in the second queue.
Hence,
∞
∞ ∞
∞
1=A [(m + 1)ρ1m+n ρ2 + (n + 1)ρ1m+n ρ2 ] = 2A (m + 1)ρ1m+n ρ2 ,
m=0 n=0 m=0 n=0
and
∞
∞
∂ n+1 ∂
(n + 1)ρ =
n
ρ = [(1 − ρ)−1 − 1] = (1 − ρ)−2 .
∂ρ ∂ρ
n=0 n=0
1 = 2Aρ2 (1 − ρ1 )−3 ,
so that
(1 − ρ1 )3
A= .
2ρ2
112 6 Networks—B
Third, we calculate the expected number L of jobs of class 1 in the two queues.
One has
∞
∞
L= A(m + n)G(m, n)
m=0 n=0
∞
∞ ∞
∞
= A(m + n)(m + 1)ρ1m+n ρ2 + A(m + n)(n + 1)ρ1m+n ρ2
m=0 n=0 m=0 n=0
∞
∞
=2 A(m + n)(m + 1)ρ1m+n ρ2 ,
m=0 n=0
where the last identity follows from the symmetry of the two terms. Thus,
∞
∞ ∞
∞
L=2 Am(m + 1)ρ1m+n ρ2 + 2 An(m + 1)ρ1m+n ρ2
m=0 n=0 m=0 n=0
∞
∞ ∞
∞
= 2Aρ2 m(m + 1)ρ1m ρ1n + 2Aρ2 (m + 1)ρ1m nρ1n
m=0 n=0 m=0 n=0
∞
∞
= 2Aρ2 m(m + 1)ρ1m (1 − ρ1 )−1 + 2Aρ2 (1 − ρ)−2 nρ1n .
m=0 n=0
= 2ρ(1 − ρ)−3 .
Also,
∞
∞
∞
nρ1n = ρ1 nρ1n−1 = ρ1 (n + 1)ρ1n = ρ1 (1 − ρ1 )−2 .
n=0 n=0 n=0
Hence,
Finally, we get the average time W that jobs of class 1 spend in the network: W =
L/γ .
Without the gray job, the expected delay W of the white jobs would be the sum
of delays in two M/M/1 queues, i.e., W = L /γ where
ρ1
L = 2 .
1 − ρ1
W = 1.5W ,
so that using a hello message increases the average delay of the class 1 customers
by 50%.
6.5 References
The time-reversal arguments are developed in Kelly (1979). That book also explains
many other models that can be analyzed using that approach. See also Bremaud
(2008), Lyons and Perez (2017), Neely (2010).
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Digital Link—A
7
A digital link consists of a transmitter and a receiver. It transmits bits over some
physical medium that can be a cable, a phone line, a laser beam, an optical fiber, an
electromagnetic wave, or even a sound wave. This contrasts with an analog system
that transmits signals without converting them into bits, as in Fig. 7.1.
An elementary such system1 consists of a phone line and, to send a bit 0, the
transmitter applies a voltage −1 Volt across its end of the line for T seconds; to
send a bit 1, it applies the voltage +1 Volt for T second. The receiver measures
the voltage across its end of the line. If the voltage that the receiver measures is
negative, it decides that the transmitter must have sent a 0; if it is positive, it decides
that the transmitter sent a 1. This system is not error-free. The receiver gets a noisy
and attenuated version of what the transmitter sent. Thus, there is a chance that a 0
is mistaken for a 1, and vice versa. Various coding techniques are used to reduce the
chances of such errors Fig. 7.2 shows the general structure of a digital link.
In this chapter, we explore the operating principles of digital links and their
characteristics. We start with a discussion of Bayes’ rule and of detection theory.
We apply these ideas to a simple model of communication link. We then explore a
coding scheme that makes the transmissions faster. We conclude the chapter with a
TRANSMITTER
Source Channel
Source Modulator
Coder Coder
Channel
Source Channel
Destination Demodulator
Decoder Decoder
RECEIVER
The receiver gets some signal S and tries to guess what the transmitter sent. We
explore a general model of this problem and we then apply it to concrete situations.
qi S
pi
qN
pN
causation
conditional
probabilities
priors CN
where
N
pi ≥ 0, qi ∈ [0, 1] for i = 1, . . . , N and pi = 1.
i=1
P (Ci and S)
π(i) = P [Ci |S] =
P (S)
P (Ci and S) P [S|Ci ]P (Ci )
= N = N
j =1 P (Cj and S) j =1 P [S|Cj ]P (Cj )
pi qi
= N .
j =1 pj qj
This rule is very simple but is a canonical example of how observations affect
our beliefs. It is due to Thomas Bayes (Fig. 7.4).
118 7 Digital Link—A
Given the previous model, we see that the most likely circumstance under which the
symptom occurs, which we call the Maximum A Posteriori (MAP) estimate of the
circumstance given the symptom, is
The notation is that if h(·) is a function, then arg maxx h(x) is any value of x that
achieves the maximum of h(·). Thus, if x ∗ = arg maxx h(x), then h(x ∗ ) ≥ h(x) for
all x.
Thus, the MAP is the most likely circumstance, a posteriori, that is, after having
observed the symptom.
Note that if all the prior probabilities are equal, i.e., if pi = 1/N for all i, then
the MAP maximizes qi . In general, the estimate that maximizes qi is called the
Maximum Likelihood Estimate (MLE) of the circumstance given the symptom. That
is,
That is, the MLE is the circumstance that makes the symptom most likely.
More generally, one has the following definitions.
7.2 Detection and Bayes’ Rule 119
Definition 7.1 (MAP and MLE) Let (X, Y ) be discrete random variables. Then
and
These definitions extend in the natural way to the continuous case, as we will get
to see later.
P (sunburn and ice cream) = 50 < P (sunburn and no ice cream) = 600,
so that among those who have a sunburn, a minority eat ice cream, so that it is more
likely that a sunburn person did not eat ice cream. Hence, the MAP if No. However,
the fraction of people who have a sunburn is larger among those who eat ice cream
(10%) than among those who do not (0.6%). Hence, the MLE is Yes.
p
1 1
1−p
120 7 Digital Link—A
0
0 p 0.5 1 − p 1
Note that if p = 0 or p = 1, then one can recover exactly every bit that is
sent. Also, if p = 0.5, then the output is independent of the input and no useful
information goes through the channel. What happens in the other cases?
Call X ∈ {0, 1} the input of the channel and Y ∈ {0, 1} its output. Assume that
you observe Y = 1 and that P (X = 1) = α, so that P (X = 0) = 1 − α. We have
the following result illustrated in Fig. 7.6.
Theorem 7.2 (MAP and MLE for BSC) For the BSC with p < 0.5,
and
MLE[X|Y ] = Y.
To understand the MAP results, consider the case Y = 1. Since p < 0.5, we are
inclined to think that X = 1. However, if α is small, this is unlikely. The result is
that X = 1 is more likely than X = 0 if α > p, i.e., if the prior is “stronger” than
the noise. The case Y = 0 is similar.
Proof In the terminology of Bayes’ rule, the event Y = 1 is the symptom. Also, the
prior probabilities are
p0 = 1 − α and p1 = α,
Hence,
Thus,
1, if p1 q1 = α(1 − p) > p0 q0 = (1 − α)p
MAP [X|Y = 1] =
0, otherwise.
Hence, MAP [X|Y = 1] = 1{α > p}. That is, when Y = 1, your guess is that
X = 1 if the prior that X = 1 is larger than the probability that the channel makes
an error.
Also,
Thus,
1, if p1 (1 − q1 ) = αp > p0 (1 − q0 )(1 − α)(1 − p)
MAP [X|Y = 0] =
0, otherwise.
Hence, MAP [X|Y = 0] = 1{α > 1 − p}. Thus, when Y = 0, you guess that X = 1
if X = 1 is more likely a priori than the channel being correct.
Also, MLE[X|Y = 0] = 0 because p < 0.5, irrespectively of α.
Coding can improve the characteristics of a digital link. We explore Huffman codes
in this section.
Say that you want to transmit strings of symbols A, B, C, D across a digital link.
The simplest method is to encode these symbols as 00, 01, 10, and 11, respectively.
In so doing, each symbol requires transmitting two bits. Assuming that there is no
error, if the receiver gets the bits 0100110001, it recovers the string BADAB.
122 7 Digital Link—A
Now assume that the strings are such that the symbols occur with the following
frequencies: (A, 55%), (B, 30%), (C, 10%), (D, 5%). Thus, A occurs 55% of the
time, and similarly for the other symbols. In this situation, one may design a code
where A requires fewer bits than D.
The Huffman code (Huffman 1952, Fig. 7.7) for this example is as follows:
Thus, one saves 20% of the transmissions and the resulting system is 25% faster
(ah! arithmetics). Note that the code is such that, when there is no error, the receiver
can recover the symbols uniquely from the bits it gets. For instance, if the receiver
gets 110100111, the symbols are CBAD, without ambiguity.
The reason why there is no possible ambiguity is that one can picture the bits as
indicating the path in a tree that ends with a leaf of the tree, as shown in Fig. 7.8.
Thus, starting with the first bit received, one walks down the tree until one reaches
a leaf. One then repeats for the subsequent bits. In our example, when the bits are
110100111, one starts at the top of the tree, then one follows the branches 110 and
reaches leaf C, then one restarts from the top and follows the branches 10 and gets
to the leaf B, and so on. Codes that have this property of being uniquely decodable
in one pass are called prefix-free codes.
The construction of the code is simple. As shown in Fig. 7.8, one joins the two
symbols with the smallest frequency of occurrence, here C and D, with branches 0
and 1 and assigns the group CD the sum of the symbol frequencies, here 0.15. One
then continues in the same way, joining CD and B and assigning the group BCD
the frequency 0.3 + 0.15 = 0.45. Finally, one joins A and BCD. The resulting tree
specifies the code.
The following property is worth noting.
7.3 Huffman Codes 123
0.15
0 1
A B C D
0.55 0.3 0.1 0.05
0 10 110 111
Theorem 7.3 (Optimality of Huffman Code) The Huffman code has the smallest
average number of bits per symbol among all prefix-free codes.
It should be noted that other codes have a smaller average length, but they are
not symbol-by-symbol codes and are more complex. One code is based on the
observation that there are only 2nH likely strings of n 1 symbols, where
H =− x log2 (x).
X
In this expression, x is the frequency of symbol X and the sum is over all the
symbols. This expression H is the entropy of the distribution of the symbols. Thus,
by listing all these strings and assigning nH bits to identify them, one requires only
nH bits for n symbols, or H bits per symbol (See Sect. 15.7.).
In our example, one has
Thus, for this example, the savings over the Huffman code are not spectacular, but
it is easy to find examples for which they are. For instance, assume that there are
only two symbols A and B with frequencies p and 1 − p, for some p ∈ (0, 1). The
Huffman code requires one bit per symbol, but codes based on long strings require
only −p log2 (p) − (1 − p) log2 (1 − p) bits per symbol. For p = 0.1, this is 0.47,
which is less than half the number of bits of the Huffman code.
Coding based on long strings of symbols are discussed in Sect. 15.7.
124 7 Digital Link—A
Y = X + Z.
Hence,
0 1 y
7.4 Gaussian Channel 125
Similarly,
Also,
If we choose the MLE detection rule, the system has the same probability of error
as a BSC channel with
0.5
p = p(σ 2 ) := P (N (0, σ 2 ) > 0.5) = P N (0, 1) > .
σ
Simulation
Figure 7.10 shows the simulation results when α = 0.5 and σ = 1. The code is in
the Jupyter notebook for this chapter.
7.4.1 BPSK
The system in the previous section was very simple and corresponds to a practical
transmission scheme called Binary Phase Shift Keying (BPSK). In this system,
instead of sending a constant voltage for T seconds to represent either a bit 0 or
a bit 1, the transmitter sends a sine wave for T seconds and the phase of that sine
wave depends on whether the transmitter sends a 0 or a 1 (Fig. 7.11).
Specifically, to send bit 0, the transmitter sends the signal
signal is a sine wave around frequency f and the designer can choose a frequency
that the transmission medium transports well. For instance, if the transmission is
wireless, the frequency f is chosen so that the antennas radiate and receive that
frequency well. The wavelength of the transmitted electromagnetic wave is the
speed of light divided by f and it should be of the same order as the physical length
of the antenna. For instance, 1GHz corresponds to a wavelength of one foot and it
can be transmitted and received by suitably shaped cell phone antennas.
In any case, the transmitter sends the signal si to send a bit i, for i = 0, 1. The
receiver attempts to detect whether s0 or s1 = −s0 was sent. To do this, it multiplies
the received signal by a sine wave at the frequency f , then computes the average
value of the product. That is, if the receiver gets the signal r = {rt , 0 ≤ t ≤ T }, it
computes
T
1
rt sin(2πf t)dt.
T 0
You can verify that if r = s0 , then the result is A/2 and if r = s1 , then the result is
−A/2. Thus, the receiver guesses that bit 0 was transmitted if this average value is
positive and that bit 1 was transmitted otherwise.
The signal that the receiver gets is not si when the transmitter sends si . Instead,
the receiver gets an attenuated and noisy version of that signal. As a result, after
doing its calculation, the receiver gets B + Z or −B + Z where B is some constant
7.5 Multidimensional Gaussian Channel 127
When using BP SK, the transmitter has a choice between two signals: s0 and s1 .
Thus, in T seconds, the transmitter sends one bit. To increase the transmission
rate, communication engineers devised a more efficient scheme called Quadrature
Amplitude Modulation (QAM). When using this scheme, a transmitter can send a
number k of bits every T seconds. The scheme can be designed for different values
of k. When k = 1, the scheme is identical to BPSK. For k > 1, there are 2k different
signals and each one is of the form
where the coefficients (a, b) characterize the signal and correspond to a given string
of k-bits. These coefficients form a constellation as shown in Fig. 7.12 in the case
of QAM-16, which corresponds to k = 4.
When the receiver gets the signal, it multiplies it by 2 cos(2πf t) and computes
the average over T seconds. This average value should be the coefficient a if
there was not attenuation and no noise. The receiver also multiplies the signal by
2 sin(2πf t) and computes the average over T seconds. The result should be the
coefficient b. From the value of (a, b), the receiver can tell the four bits that the
transmitter sent.
Because of the noise (we can correct for the attenuation), the receiver gets a pair
of values Y = (Y1 , Y2 ), as shown in the figure. The receiver essentially finds the
constellation point closest to the measured point Y and reads off the corresponding
bits.
x 16
x1
128 7 Digital Link—A
The values of |a| and |b| are bounded, because of a power constraint on the
transmitter. Accordingly, a constellation with more points (i.e., a larger value of k)
has points that are closer together. This proximity increases the likelihood that the
noise misleads the receiver. Thus, the size of the constellation should be adapted
to the power of the noise. This is in fact what actual systems do. For instance, a
cable modem and an ADSL modem divide the frequency band into small channels
and they measure the noise power in each channel and choose the appropriate
constellation for each. WiFi, LTE, and 5G systems use a similar scheme.
Y=X+Z
where Z = (Z1 , Z2 ) and Z1 , Z2 are i.i.d. N(0, σ 2 ) random variables. That is, we
assume that the errors in Y1 and Y2 are independent and Gaussian. In this case, we
can calculate the conditional density fY|X [y|x] as follows. Given X = x, we see that
Y1 = x1 + Z1 and Y2 = x2 + Z2 . Since Z1 and Z2 are independent, it follows that
Y1 and Y2 are independent as well. Moreover, Y1 = N(x1 , σ 2 ) and Y2 = N(x2 , σ 2 ).
Hence,
1 (y1 − x1 )2 1 (y2 − x2 )2
fY|X [y|x] = √ exp − √ exp − .
2π σ 2 2σ 2 2π σ 2 2σ 2
Recall that MLE[X|Y = y] is the value of x ∈ {x1 , . . . , x16 } that maximizes this
expression. Accordingly, it is the value xk that minimizes
Thus, MLE[X|Y] is indeed the constellation point that is the closest to the measured
value Y.
There are many situations where the MAP and MLE are not satisfactory guesses.
This is the case for designing alarms, medical tests, failure detection algorithms,
and many other applications. We describe an important formulation, called the
hypothesis testing problem.
7.6 Hypothesis Testing 129
7.6.1 Formulation
We consider the case where X ∈ {0, 1} and where one assumes a distribution of Y
given X. The goal will be to solve the following problem:
A typical ROC is shown in Fig. 7.13. The terminology comes from the fact that
this function depends on the conditional distributions of Y given X = 0 and given
X = 1, i.e., of the signal that is received about X.
Note the following features of that curve. First, R(1) = 1 because if one is
allowed to have P F A = 1, then one can choose X̂ = 1 for all observations; in that
case P CD = 1.
Second, the function R(β) is concave. To see this, let 0 ≤ β1 < β2 ≤ 1 and
assume that gi (Y ) achieves P [gi (Y ) = 1|X = 1] = R(βi ) and P [gi (Y ) = 1|X =
0] = βi for i = 1, 2. Choose ∈ (0, 1) and define X = g1 (Y ) with probability
and X = g2 (Y ) otherwise. Then,
2 IfH0 means that you are healthy and H1 means that you have a disease, P F A is the probability
of a false positive test and 1 − P CD is the probability of a false negative test. These are also called
type I and type II errors in the literature. P F A is also called the p-value of the test.
130 7 Digital Link—A
b
0
0 1
Also,
Now, the decision rule X̂ that maximizes P [X̂ = 1|X = 1] subject to P [X̂ =
1|X = 0] = β1 + (1 − )β2 must be at least as good as X . Hence,
R(β1 + (1 − )β2 ) ≥ β1 + (1 − )β2 .
7.6.2 Solution
The solution of the hypothesis testing problem is stated in the following theorem.
In these expressions,
fY |X [y|1]
L(y) =
fY |X [y|0]
is the likelihood ratio, i.e., the ratio of the likelihood of y when X = 1 divided by its
likelihood when X = 0. Also, λ > 0 and γ ∈ [0, 1] are chosen so that the resulting
X̂ satisfies
P [X̂ = 1|X = 0] = β.
Thus, if L(Y ) is large, X̂ = 1. The fact that L(Y ) is large means that the observed
value Y is much more likely when X = 1 than when X = 0. One is then inclined
to decide that X = 1, i.e. to guess X̂ = 1. The situation is similar when L(Y ) is
small. By adjusting λ, one controls the sensitivity of the detector. If λ is small, one
tends to choose X̂ = 1 more frequently, which increases P CD but also P F A. One
then chooses λ so that the detector is just sensitive enough so that P F A = β. In
some problems, one may have to hedge the guess for the critical value λ as we will
explain in examples (Fig. 7.14).
We prove this theorem in the next chapter. Let us consider a number of examples.
7.6.3 Examples
Gaussian Channel
Recall our model of the scalar Gaussian channel:
Y = X + Z,
We looked at two formulations: MLE and MAP. In the MLE, we want to find the
value of X that makes Y most likely. That is,
To calculate the MAP, one needs to know the prior probability p0 that X = 0. We
found out that MAP [X|Y = y] = 1 if y ≥ 0.5 + σ 2 log(p0 /p1 ) and MAP [X|Y =
y] = 0 otherwise.
In the hypothesis testing formulation, we choose a bound β on P F A = P [X̂ =
1|X = 0]. According to Theorem 7.4, we should calculate the likelihood ratio L(Y ).
We find that
!
exp − (y−1)
2
2σ 2 2y − 1
L(y) = ! = exp .
exp − y
2
2σ 2
2σ 2
Note that, for any given λ, P (L(Y ) = λ) = 0. Moreover, L(y) is strictly increasing
in y. Hence, (7.3) simplifies to
1, if y ≥ y0
X̂ =
0, otherwise.
P [X̂ = 1|X = 0] = P [Y ≥ y0 |X = 0] = β.
P (N(0, σ 2 ) ≥ y0 ) = β,
PFA = b
X̂ = 0 X̂ = 1
Let us calculate the ROC for the Gaussian channel. Let y(β) be such that
P (N (0, 1) ≥ y(β)) = β, so that y0 = y(β)σ . The probability of correct detection
is then
Figure 7.16 shows the ROC for different values of σ , obtained using Python. Not
surprisingly, the performance of the system degrades when the channel is noisier.
fY |X [y|1] Π n λ1 exp{−λ1 yi }
L(y) = = i=1
fY |X [y|0] n λ exp{−λ y }
Πi=1 0 0 i
n
λ1
n
= exp −(λ1 − λ0 ) yi .
λ0
i=1
Since λ1 > λ0 , we find that L(y) is strictly decreasing in i yi and also that
P (L(Y ) = λ) = 0 for all λ. Thus, (7.3) simplifies to
1, if ni=1 Yi ≤ a
X̂ =
0, otherwise,
Now, when X = 0, the Yi are i.i.d. random variables that are exponentially
distributed with mean 1/λ0 . The distribution of their sum is rather complicated.
We approximate it using the Central Limit Theorem.
We have3
Y1 + · · · + Yn − nλ−1
√ 0
≈ N(0, λ−2
0 ).
n
Now,
Y1 + · · · + Yn − nλ−1 a − nλ−1
n
Yi ≤ a ⇔ √ 0
≤ √ 0 .
n n
i=1
Hence,
n
a − nλ−1
P Yi ≤ a|X = 0 ≈ P N(0, λ−2
0 ) ≤ √ 0
n
i=1
a − nλ−1
=P N(0, 1) ≤ λ0 √ 0 .
n
a − nλ−1
λ0 √ 0 = 1.65,
n
i.e.,
√
a = (n + 1.65 n)λ−1
0 .
One point is worth noting for this example. We see that the calculation of X̂ is
based on Y1 + · · · + Yn . Thus, although one has measured the individual lifespans of
the n bulbs, the decision is based only on their sum, or equivalently on their average.
Bias of a Coin
In this example, we observe n coin flips. Given X = x ∈ {0, 1}, the coins are
i.i.d. B(px ). That is, given X = x, the outcomes Y1 , . . . , Yn of the coin flips are
i.i.d. and equal to 1 with probability px and to zero otherwise. We assume that
p1 > p0 = 0.5. That is, we want to test whether the coin is fair or biased.
Here, the random variables Yi are discrete. We see that
Hence,
P [Yi = yi , i = 1, . . . , n|X = 1]
L(Y1 , . . . , Yn ) =
P [Yi = yi , i = 1, . . . , n|X = 0]
S
p1 1 − p1 n−S 1 − p1 n p1 (1 − p0 ) S
= = .
p0 1 − p0 1 − p0 p0 (1 − p1 )
Since p1 > p0 , we see that the likelihood ratio is increasing in S. Thus, the solution
of the hypothesis testing problem is
X̂ = 1{S ≥ n0 },
2n0 − n
√ = 1.65,
n
by (3.2). Hence,
√
n0 = 0.5n + 0.83 n.
Discrete Observations
In the examples that we considered so far, the random variable L(Y ) is continuous.
In such cases, the probability that L(Y ) = λ is always zero, and there is no need to
randomize the choice of X̂ for specific values of Y . In our next examples, that need
arises.
First consider, as usual, the problem of choosing X̂ ∈ {0, 1} to maximize the
probability of correct detection P [X̂ = 1|X = 1] subject to a bound P [X̂ =
1|X = 0] ≤ β on the probability of false alarm. However, assume that we make
no observation. In this case, the solution is to choose X̂ = 1 with probability
β. This choice meets the bound on the probability of false alarm and achieves a
probability of correct detection equal to β. This randomized choice is better than
always deciding X̂ = 0.
Now consider a more complex example where Y ∈ {A, B, C} and
0.8 R(β)
0.6
β
0
0 0.3 0.5 1
7.7 Summary
Bayes’ Rule πi = pi qi /( j pj qj ) Theorem 7.1
MAP [X|Y = y] arg maxx P [X = x|Y = y] Definition 7.1
MLE[X|Y = y] arg maxx P [Y = y|X = x] Definition 7.1
Likelihood Ratio L(y) = fY |X [y|1]/fY |X [y|0] Theorem 7.4
Gaussian Channel MAP [X|Y = y] = 1{y ≥ 12 + σ 2 log( pp01 )} (7.2)
Neyman–Pearson Theorem P [X̂ = 1|Y ] = 1{L(Y ) > λ} + γ 1{L(Y ) = λ} Theorem 7.4
ROC ROC(β) = max. P CD s.t. P F A ≤ β Definition 7.2
7.8 References
7.9 Problems
Problem 7.3 A digital link uses the QAM-16 constellation shown in Fig. 7.12 with
x1 = (1, −1). The received signal is Y = X + Z where Z =D N (0, σ 2 I). The
receiver uses the MAP. Simulate the system using Python to estimate the fraction of
errors for σ = 0.2, 0.3.
Problem 7.4 Use Python to verify the CLT with i.i.d. U [0, 1] random variables Xn .
That is, generate the random variables {X1 , . . . , XN } for N = 10000. Calculate
X100n+1 + · · · + X(n+1)100 − 50
Yn = , n = 0, 1, . . . , 99.
10
7.9 Problems 139
Plot the empirical cdf of {Y0 , . . . , Y99 } and compare with the cdf of a N (0, 1/12)
random variable.
Problem 7.5 You are testing a digital link that corresponds to a BSC with some
error probability ∈ [0, 0.5).
(a) Assume you observe the input and the output of the link. How do you find the
MLE of .
(b) You are told that the inputs are i.i.d. bits that are equal to 1 with probability 0.6
and to 0 with probability 0.4. You observe n outputs. How do you calculate the
MLE of .
(c) The situation is as in the previous case, but you are told that has pdf 4 − 8x on
[0, 0.5). How do you calculate the MAP of given n outputs.
Problem 7.6 The situation is the same as in the previous problem. You observe n
inputs and outputs of the BSC. You want to solve a hypothesis problem to detect
that > 0.1 with a probability of false alarm at most equal to 5%. Assume that n is
very large and use the CLT.
y1 = x1 + Z1 ,
y2 = x2 + Z2 ,
where Z1 and Z2 are independent N(0, σ 2 ) random variables. Find the MAP
detector and ML detector analytically.
140 7 Digital Link—A
Simulate the channel using Python for π = [0.1, 0.2, 0.3, 0.4], and σ = 0.1 and
σ = 0.5. Evaluate the probability of correct detection.
Problem 7.9 Let X be equally likely to take any of the values {1, 2, 3}. Given X,
the random variable Y is N (X, 1).
(a) Assume that you have observed Y n = (Y1 , . . . , Yn ). What is the guess X̂n based
on these observations that maximizes the probability that X̂n = X?
(b) What is the corresponding value of P (X̂n = X)?
(c) Choose n to maximize P (X = X̂n ) − βn where X̂n is chosen on the basis of
Y1 , . . . , Yn ). Hint: You will recall that
d x
(a ) = a x log(a).
dx
Problem 7.13 Assume that Y =D U [a, b]. You observe n i.i.d. samples Y1 , . . . , Yn
of this random variable. Calculate the maximum likelihood estimator â of a and b̂
of b. What is the bias of â and b̂?
Problem 7.15 Given θ ∈ {0, 1}, X = θ (1, 1) +V where V1 and V2 are independent
and uniformly distributed in [−2, 2]. Solve the hypothesis testing problem:
(a) Find θ̂ = H T [θ |X, β], defined as the random variable θ̂ determined from X
that maximizes P [θ̂ = 1|θ = 1] subject to P [θ̂ = 1|θ = 0] ≤ β;
(b) Compute the resulting value of α(β) = P [θ̂ = 1|θ = 1];
(c) Sketch the ROC curve α(β) for β ∈ [0, 1].
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Digital Link—B
8
Theorem 8.1 (Optimality of Huffman Code) The Huffman code has the smallest
average number of bits per symbol among all prefix-free codes (Fig. 8.1).
0.15
0 1
A B C D
0.55 0.3 0.1 0.05
0 10 110 111
Thus, L(n + 1) ≤ A(n + 1), which contradicts the assumption that the Huffman
code is not optimal for n + 1 symbols. It remains to prove the claim about X and
Y being siblings. First note that Y having the maximum path length, it cannot be
an only child, for otherwise, we would replace its parent by Y and reduce the path
length. Say that Y has a sibling V other than X. By swapping V and X, one does
not increase the average path length, since the frequency of V is not smaller than
that of X. This concludes the proof.
The idea of the proof is to consider any other decision rule that produces an estimate
X̃ with P [X̃ = 1|X = 0] ≤ β and to show that
(X̂ − X̃)(L(Y ) − λ) ≥ 0.
Indeed, when L(Y ) − λ > 0, one has X̂ = 1 ≥ X̃, so that the expression above is
indeed nonnegative. Similarly, when L(Y ) − λ < 0, one has X̂ = 0 ≤ X̃, so that
the expression is again nonnegative.
Taking the expected value of this expression given X = 0, we find
Now,
Note that this result continues to hold even for a function g(Y, Z) where Z is a
random variable that is independent of X and Y . In particular,
Similarly,
In many systems, the errors in the different components of the measured vector Y
are not independent. A suitable model for this situation is that
Y = X + AZ,
1 1 −1
fY|X [y|x] = exp − (y − x) (AA ) (y − x) , (8.4)
2π |A| 2
where A is the transposed of matrix A, i.e., A (i, j ) = A(j, i) for i, j ∈ {1, 2}.
Consequently, the MLE is the value xk of x that minimizes
W := A−1 Y = A−1 X + Z =: V + Z.
Thus, if we calculate A−1 Y from the measured vector Y, we find that its components
are i.i.d. N(0, 1) for a given value of X. Hence, it is easy to calculate MLE[V|W =
w]: it is the closest value to w in the set {A−1 x1 , . . . , A−1 x16 } of possible values of
V. It is then reasonable to expect that we can recover the MLE of X by multiplying
the MLE of V = A−1 X by A, i.e., that
Our goal in this section is to explain (8.4) and more general versions of this result.
We start by stating the main definition and a result that we prove later.
Y = AX + μY with ΣY = AA ,
20 5u2
A
u2
v1
u1
v2
A 60 5u1
The level curves of this jpdf are ellipses, as sketched in Fig. 8.2.
Note that this joint distribution is determined by the mean and the covariance
matrix. In particular, if Y = (V , W ) are jointly Gaussian, then the joint
distribution is characterized by the mean and ΣV , ΣW and cov(V, W). We know
that if V and W are independent, then they are uncorrelated, i.e., cov(V, W) =
0. Since the joint distribution is characterized by the mean and covariance, we
conclude that if they are uncorrelated, they are independent. We note this fact as
a theorem.
Theorem 8.3 (Jointly Gaussian RVs Are Independent Iff Uncorrelated) Let V
and W be jointly Gaussian random variables. Then, there are independent if and
only if they are uncorrelated.
Proof By definition, V and W are jointly Gaussian if they are linear functions of
i.i.d. N (0, 1) random variables. But then AV + a and BW + b are linear functions
of the same i.i.d. N(0, 1) random variables, so that they are jointly Gaussian. More
explicitly, there are some i.i.d. N (0, 1) random variables X so that
V c C
= + X,
W d D
148 8 Digital Link—B
so that
AV + a a + Ac AC
= + X.
BW + b b + Bd BD
Indeed, these random variables are jointly Gaussian by Theorem 8.4. Also, they are
uncorrelated since
Let us apply (8.6) to the case where X is a vector of n i.i.d. N(0, 1) random
variables. In this case,
1 xi2
fX (x) = Πi=1 fXi (xi ) = Πi=1 √ exp −
n n
2π 2
1 ||x||2
= exp − .
(2π )n/2 2
where Ax + μY = y. Thus,
x = A−1 (y − μY )
and
ΣY = AΣX A = AA .
In particular,
|ΣY | = |A|2 .
This section explains some basic statistical tests that are at the core of “data science.”
8.4.1 Zero-Mean?
H0 : μ = 0. (8.7)
H1 : μ = 0. (8.8)
We know that P [|Y | > 2 | H0 ] ≈ 5%. That is, if we reject H0 when |Y | > 2,
the probability of “false alarm,” i.e., of rejecting the hypothesis when it is correct is
5%. This is what all the tests that we will discuss in this chapter do. However, there
are many tests that achieve the same false alarm probability. For instance, we could
reject H0 when Y > 1.64 and the probability of false alarm would also be 5%. Or,
we could reject H0 when Y is in the interval [1, 1.23]. The probability of that event
under H0 is also about 5%.
150 8 Digital Link—B
Thus, there are many tests that reject H0 with a probability of false alarm equal
to 5%. Intuitively, we feel that the first one—rejecting H0 when |Y | > 2—is
more sensible than the others. This intuition probably comes from the idea that
the alternative hypothesis H1 : μ = 0 appears to be a symmetric assumption about
the likely values of μ. That is, we do not have a reason to believe that under H1 the
mean μ is more likely to be positive than negative. We just know that it is nonzero.
Given this symmetry, it is intuitively reasonable that the test should be symmetric.
However, there are many symmetric tests! So, we need a more careful justification.
To justify the test |Y | > 2, we note the following simple result.
H0 : μ = 0
H1 : μ has a symmetric distribution about 0.
Proof We know that the Neyman–Pearson test is a likelihood ratio test. Thus, it
suffices to show that the likelihood ratio is increasing in |Y |. Assume that the density
of μ under H1 is h(x). (The same argument goes through it μ is a mixed random
variable.) Then the pdf f1 (y) of Y under H1 is as follows:
f1 (y) = h(x)f (y − x)dx,
√
where f (x) = (1/ 2π ) exp{−0.5y 2 } is the pdf of a N (0, 1) random variable.
Consequently, the likelihood ratio L(y) of Y is given by
2
f1 (y) f (y − x) x
L(y) = = h(x) dx = h(x) exp{−xy} exp − dx
f (y) f (y) 2
2
x
= 0.5 [h(x) + h(−x)] exp{−xy} exp − dx
2
= 0.5 h(x)[exp{xy} + exp{−xy}]dx,
where the fourth identity comes from h(x) = 0.5h(x) + 0.5h(−x), since h(x) =
h(−x). This expression shows that L(y) = L(−y). Also,
8.4 Elementary Statistics 151
∞
L (y)=0.5 h(x)x[exp{xy}− exp{−xy}]dx= h(x)x[exp{xy}− exp{−xy}]dx,
0
by symmetry of the integrand. For y > 0 and x > 0, we see that the last integrand
is positive, so that L (y) > 0 for y > 0.
Hence, L(y) is symmetric and increasing in y > 0, so that it is an increasing
function of |y|, which completes the proof.
As a simple application, say that you buy 100 light bulbs from brand A and 100
from brand B. You want to test whether that have the same mean lifetime. You
measure the lifetimes {X1A , . . . , X100
A } and {X B , . . . , X B } of the bulbs of the two
1 100
batches and you calculate
(X1A + · · · X100
A ) − (X B + · · · X B )
Y = √ 1 100
,
σ N
|μ̂|
> λ,
σ̂
1
n
σ̂ 2 = (Ym − μ̂)2
n−1
m=1
|tn−1 |
is the sample variance, and λ is such that P ( √ n−1
> tn−1 ) = β.
Here, tn−1 is a random variable with a t distribution with n − 1 degrees of
freedom. By definition, this means that
N (0, 1)
tn−1 = * ,
2 /(n − 1)
χn−1
152 8 Digital Link—B
σZ σW
μ̂1
σV 1
V = N (0, 1)
0
σW2 = (n − 1)σ̂ 2
2
where χn−1 is the sum of the squares of n − 1 i.i.d. N (0, 1) random variables.
Thus, this chi-squared test is very similar to the previous one, except that one
replaces the standard deviation σ by it estimate σ̂ and the threshold λ is adjusted
(increased) to reflect the uncertainty in σ . Statistical packages provide routines to
calculate the appropriate value of λ. (See scipy.stats.chisquare for Python.)
Figure 8.3 explains the result. The rotation symmetry of Z implies that we can
assume that V = Z1 and that W = (0, Z2 , . . . , Zn ). As in the previous examples,
one uses the symmetry assumption under H1 to prove that the likelihood ratio is
monotone in μ̂/σ̂ .
Coming back to our lightbulbs example, what should we do if we have different
number of bulbs of the two brands? The next test covers that situation.
Needless to say, some care must be taken. It is not difficult to find distributions
for which this test does not perform well. This fact helps explain why many poorly
conducted statistical studies regularly contradict one another. Many publications
decry this fallacy of the p-value. The p-value is the name given to the probability of
false alarm.
H0 : Y = N (μ, σ 2 I), μ ∈ L
H1 : Y = N (μ, σ 2 I), μ ∈ n .
1
H = H1 if and only if Y − μ̂2 > βn−m ,
σ2
where
2
In this expression, χn−m represents a random variable that has a chi-square
distribution with n − m degrees of freedom. This means that it is distributed like
the sum of n − m random variables that are i.i.d. N (0, 1).
Figure 8.4 shows that
Y − μ̂ = σ Z.
Y − μ̂ = σ (0, . . . , 0, Zm+1 , . . . , Zn ),
8.4.5 ANOVA
Our next model is more general and is widely used. In this model, Y =
N (Aγ , σ 2 I). We would like to test whether Mγ = 0, which is the H0 hypothesis.
Here, A is a n × k matrix, with k < n. Also, M is a q × k matrix with q < k.
The decision is to reject H0 if F > F0 where
Y − μ0 2 − Y − μ1 2 n−k
F = ×
Y − μ1 2 q
μ0 = arg min{Y − μ2 : μ = Aγ , Mγ = 0}
μ
Low Density Parity Check (LDPC) codes are among the most efficient codes used in
practice. Gallager invented these codes in his 1960 thesis (Gallager 1963, Fig. 8.6).
8.5 LDPC Codes 155
0 [k] · 2 = σ 2 χ2q
These codes are used extensively today, for instance, in satellite video transmissions.
They are almost optimal for BSC channels and also for many other channels.
The LDPC codes are as follows. Let x ∈ {0, 1}n be an n-bit string to be
transmitted. One augments this string with the m-bit string y where
y = H x. (8.9)
Here, H is an m × n matrix with entries in {0, 1}, one views x and y as column
vectors and the operations are addition modulo 2. For instance, if
⎡ ⎤
1 0 1 1 1 0 0 0
⎢0 1 0 1 1 0 1 0⎥
H =⎢
⎣1
⎥
1 0 0 0 1 0 1⎦
0 0 1 0 1 1 1 1
and x = [01001010], then y = [1110]. This calculation of the parity check bits y
from x is illustrated by the graph, called Tanner graph, shown in Fig. 8.7.
Thus, instead of simply sending the bit string x, one sends both x and y. The bits
in y are parity check bits. Because of possible transmission errors, the receiver may
get x̃ and ỹ instead of x and y. The receiver computes H x̃ and compares the result
with ỹ. The idea is that if ỹ = H x̃, then it is likely that x̃ = x and ỹ = y. In other
words, it is unlikely that errors would have corrupted x and y in a way that these
156 8 Digital Link—B
vectors would still satisfy the relation ỹ = H x̃. Thus, one expects the scheme to be
good at detecting errors, at least if the matrix H is well chosen.
In addition to detecting errors, the LDPC code is used for error correction. If
ỹ = H x̃, one tries to find the least number of components of x̃ and ỹ that can
be changed to satisfy the equations. These would be the most likely transmission
errors, if we assume that bit errors are i.i.d. have a very small probability. However,
searching for the possible combinations of components to change is exponentially
hard. Instead, one uses iterative algorithms that approximate the solution.
We illustrate a commonly used decoding algorithm, called belief propagation
(BP). We assume that each received bit is erroneous with probability 1 and
correct with probability ¯ = 1 − , independently of the other bits. We also assume
that the transmitted bits xj are equally likely to be 0 or 1. This implies that the parity
check bits yi are also equally likely to be 0 or 1, by symmetry. In this algorithm, the
message nodes xj and the check nodes yi exchange beliefs along the links of the
graph of Fig. 8.7 about the probability that the xj are equal to 1.
In steps 1, 3, 5, . . . of the algorithm, each node xj sends to each node yi to which
it is attached an estimate of P (xj = 1). Each node yi then combines these estimates
to send back new estimates to each xj about P (xj = 1). Here is the calculation
that the y nodes perform. Consider a situation shown in Fig. 8.8 where node y1 gets
the estimates a = P (x1 = 1), b = P (x2 = 1), c = P (x3 = 1). Assume also that
ỹ1 = 1, from which node y1 calculates P [y1 = 1|ỹ1 ] = 1 − = , ¯ by Bayes’ rule.
Since the graph shows that x1 + x2 + x3 = y1 , node y1 estimates the probability that
x1 = 1 as the probability that an odd number of bits among {x2 , x3 , y1 } are equal to
one (Fig. 8.9).
To see how to do the calculation, assume that x1 , . . . , xn are independent {0, 1}-
random variables with pi = P (xi = 1). Note that
1 − (1 − 2x1 ) × · · · × (1 − 2xn )
8.5 LDPC Codes 157
p1 p2 pn
1 1 n
P (odd) = − Π (1 − 2pj )
2 2 j=1
Fig. 8.9 Each node j is equal to one w.p. pj and to zero otherwise, independently of the other
nodes. The probability that an odd number of nodes are one is given in the figure
is equal to zero if the number of variables that are equal to one among {x1 , . . . , xn }
is even and is equal to two if it is odd. Thus, taking expectation,
2P (odd) = 1 − Πi=1
n
(1 − 2pi ),
so that
1 1 n
P (odd) = − Π (1 − 2pi ). (8.10)
2 2 i=1
Thus, in Fig. 8.8, one finds that
a
d b
c
abc
d=
abc + (1 − a)(1 − b)(1 − c)
One has
P (X = 1, Y1 , . . . , YN )
P [X = 1|Y1 , . . . , YN ] =
P (Y1 , . . . , YN )
P [Y1 , . . . , YN |X = 1]P (X = 1)
=
x=0,1 P [Y1 , . . . , YN |X = x]P (X = x)
P [Y1 |X = 1] × · · · × P [YN |X = 1]
= .
x=0,1 P [Y1 |X = x] × · · · × P [YN |X = x]
(8.12)
Now,
P (X = x, Yn ) P [X = x|Yn ]P (Yn )
P [Yn |X = x] = = .
P (X = x) 1/2
Thus,
y3
from the nodes y1 , y2 , y3 and node x1 assumes that these estimates were based on
independent observations.
To calculate a new estimate that it will send to node y1 , node x1 combines the
estimates from x̃1 , y2 and y3 . This estimate is
bc
, (8.14)
bc + ¯ b̄c̄
where b̄ = 1 − b and c̄ = 1 − c. In the next step, node x1 will send that estimate to
node y1 . It also calculates estimates for nodes y2 and y3 .
Summing up, the algorithm is as follows. At each odd step, node xj sends X(i, j )
to each node yi . At each even step, node yi sends Y (i, j ) to each node xj . One has
1 1
Y (i, j ) = − (1 − 2)(1 − 2ỹi )Πs∈A(i,j ) (1 − 2X(i, s)), (8.15)
2 2
where A(i, j ) = {s = j | H (i, s) = 1} and
N(i, j )
X(i, j ) = , (8.16)
N(i, j ) + D(i, j )
where
and
with
Also, node xj can update its probability of being 1 by merging the opinions of
the experts as
N(j )
X(j ) = , (8.17)
N(j ) + D(j )
160 8 Digital Link—B
where
and
After enough iterations, one makes the detection decisions xj = 1{X(j ) ≥ 0.5}.
Figure 8.12 shows the evolution over time of the estimated probabilities that
the xj are equal to one. Our code is a direct implementations of the formulas in
this section. More sophisticated implementations use sums of logarithms instead of
products.
Simulations, and a deep theory, show that this algorithm performs well if the
graph does not have small cycles. In such a case, the assumption that the estimates
are obtained from independent observations is almost correct.
8.6 Summary
• LDPC Codes;
• Jointly Gaussian Random Variables, independent if uncorrelated;
• Proof of Neyman–Pearson Theorem;
• Testing properties of the mean.
8.8 Problems 161
LDPC y = Hx (8.9)
P(odd) P ( j Xj = 1) = 0.5 − 0.5Πj (1 − 2pj ) (8.10)
Fusion of Experts P [X = 1|Y1 , . . . , Yn ] = Πj pj /(Πj pj + Πj p̄j ) (8.13)
Jointly Gaussian N (μ, Σ) ⇔ fX = . . . (8.4)
If X, Y are J.G., then X ⊥ Y ⇒ X, Y are independent Theorem 8.3
8.7 References
8.8 Problems
Problem 8.1 Construct two Gaussian random variables that are not jointly Gaus-
sian. Hint: Let X =D N (0, 1) and Z be independent random variables with
P (Z = 1) = P (Z = −1) = 1/2. Define Y = XZ. Show that X and Y meet
the requirements of the problem.
√
Problem 8.2 Assume that X =D (Y + Z)/ 2 where Y and Z are independent and
distributed like X. Show that X = N (0, σ 2 ) for some σ 2 ≥ 0. Hint: First √ show
that E(X) = 0. Second, show by induction that X =D (V1 + · · · + Vm )/ m for
m = 2n . where the Vi are i.i.d. and distributed like X. Conclude using the CLT.
Problem 8.3 Consider Problem 7.8 but assume now that Z =D N (0, Σ) where
0.2 0.1
Σ= .
0.1 0.3
The symbols are equally likely and the receiver uses the MLE. Simulate the system
using Python to estimate the fraction of errors.
162 8 Digital Link—B
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Tracking—A
9
9.1 Examples
A GPS receiver uses the signals it gets from satellites to estimate its location
(Fig. 9.1). Temperature and pressure sensors provide signals that a computer uses
to estimate the state of a chemical reactor.
A radar measures electromagnetic waves that an object reflects and uses the
measurements to estimate the position of that object (Fig. 9.2).
Similarly, your car’s control computer estimates the state of the car from
measurements it gets from various sensors (Fig. 9.3).
The objective is to choose the inference function g(·) to minimize the expected
error C(g) where
In this expression, c(X, X̂) is the cost of guessing X̂ when the actual value is X. A
standard example is
We will also study the case when X ∈ d for d > 1. In such a situation, one
uses c(X, X̂) = ||X − X̂||2 . If the function g(·) can be arbitrary, the function that
minimizes C(g) is the Minimum Mean Squares Estimate (MMSE) of X given Y . If
the function g(·) is restricted to be linear, i.e., of the form a +BY , the linear function
that minimizes C(g) is the Linear Least Squares Estimate (LLSE) of X given Y . One
may also restrict g(·) to be a polynomial of a given degree. For instance, one may
define the Quadratic Least Squares Estimate QLSE of X given Y . See Fig. 9.4.
9.3 Linear Least Squares Estimates 165
X̂
As we will see, a general method for the off-line inference problem is to choose
a parametric class of functions {gw , w ∈ d } and to then minimize the empirical
error
K
c(Xk , gw (Yk ))
k=1
over the parameters w. Here, the (Xk , Yk ) are the observed samples. The parametric
function could be linear, polynomial, or a neural network.
For the on-line problem, one also chooses a similar parametric family of
functions and one uses a stochastic gradient descent algorithm of the form
where ∇ is the gradient with respect to w and γ > 0 is a small step size. The
justification for this approach is that, since γ is small, by the SLLN, the update
tends to be in the direction of
k+K−1
− ∇w c(Xi+1 , gw (Yi+1 ) ≈ −K∇E(c(Xk , gw (Yk )) = −K∇C(gw ),
i=k
In this section, we study the linear least squares estimates. Recall the setup that we
explained in the previous section. There is a pair (X, Y ) of random variables with
some joint distribution and the problem is to find the function g(Y ) = a + bY that
minimizes
One consider the cases where the distribution is known, or a set of samples has been
observed, or one observes one sample at a time.
Assume that the joint distribution of (X, Y ) is known. This means that we know
the joint cumulative distribution function (j.c.d.f.) FX,Y (x, y).1
We are looking for the function g(Y ) = a + bY that minimizes
Definition 9.1 (Linear Least Squares Estimate (LLSE)) The LLSE of X given
Y , denoted by L[X|Y ], is the linear function a + bY that minimizes
E(|X − a − bY |2 ).
Note that
To find the values of a and b that minimize that expression, we set to zero the partial
derivatives with respect to a and b. This gives the following two equations:
cov(X, Y )
L[X|Y ] = a + bY = E(X) + (Y − E(Y )),
var(Y )
1 See Appendix B.
9.3 Linear Least Squares Estimates 167
cov(X, Y )
L[X|Y ] = E(X) + (Y − E(Y )). (9.3)
var(Y )
Y = αX + Z, (9.4)
Hence,
αE(X2 ) α −1 Y
L[X|Y ] = Y = ,
α 2 E(X2 ) + E(Z 2 ) 1 + SN R −1
where
α 2 E(X2 )
SNR :=
σ2
is the signal-to-noise ratio, i.e., the ratio of the power E(α 2 X2 ) of the signal in Y
divided by the power E(Z 2 ) of the noise. Note that if SN R is small, then L[X|Y ] is
close to zero, which is the best guess about X if one does not make any observation.
Also, if SN R is very large, then L[X|Y ] ≈ α −1 Y , which is the correct guess if
Z = 0.
As a second example, assume that
X = αY + βY 2 , (9.5)
E(Y k ) = (1 + k)−1 .
168 9 Tracking—A
0 X
0
Hence,
This estimate is sketched in Fig. 9.5. Obviously, if one observes Y , one can compute
X. However, recall that L[X|Y ] is restricted to being a linear function of Y .
9.3.1 Projection
h(Y )
L[X|Y ]
Equivalently,
These two equations are the same as (9.1)–(9.2). We call the identities (9.6) the
projection property.
Figure 9.7 illustrates the projection when
||X̂|| ||X||
= ,
||X|| ||Y ||
so that
||X̂|| 1 ||Y ||
=√ = ,
1 1+σ 2 1 + σ2
√
since ||Y || = 1 + σ 2 . This shows that
1
X̂ = Y.
1 + σ2
170 9 Tracking—A
To see why the projection property implies that L[X|Y ] is the closest point to X
in L (Y ), as suggested by Fig. 9.6, we verify that
for any given h(Y ) = c + dY . The idea of the proof is to verify Pythagoras’ identity
on the right triangle with vertices X, L[X|Y ] and h(Y ). We have
Now, the projection property (9.6) implies that the last term in the above expression
is equal to zero. Indeed, L[X|Y ] − h(Y ) is a linear function of Y . It follows that
as was to be proved.
Assume now that, instead of knowing the joint distribution of (X, Y ), we observe K
i.i.d. samples (X1 , Y1 ), . . . , (XK , YK ) of these random variables. Our goal is still to
construct a function g(Y ) = a + bY so that
E(|X − a − bY |2 )
K
|Xk − a − bYk |2 .
k=1
To do this, we set to zero the derivatives of this sum with respect to a and b. Algebra
shows that the resulting values of a and b are such that
covK (X, Y )
a + bY = EK (X) + (Y − EK (Y )), (9.7)
varK (Y )
where we defined
9.4 Linear Regression 171
Linear regression
1 1
K K
EK (X) = Xk , EK (Y ) = Yk ,
K K
k=1 k=1
1
K
covK (X, Y ) = Xk Yk − EK (X)EK (Y ),
K
k=1
1
K
varK (Y ) = Yk2 − EK (Y )2 .
K
k=1
That is, the expression (9.7) is the same as (9.3), except that the expectation is
replaced by the sample mean. The expression (9.7) is called the linear regression of
X over Y . It is shown in Fig. 9.8.
One has the following result.
Combined with the expressions for the linear regression and the LLSE, these
properties imply the result.
Formula (9.3) and the linear regression provide an intuitive meaning of the
covariance cov(X, Y ). If this covariance is zero, then L[X|Y ] does not depend
on Y . If it is positive (negative), it increases (decreases, respectively) with Y .
Thus, cov(X, Y ) measures a form of dependency in terms of linear regression. For
172 9 Tracking—A
instance, the random variables in Fig. 9.9 are uncorrelated since L[X|Y ] does not
depend on Y .
In the previous section, we examined the problem of finding the linear function a +
bY that best approximates X, in the mean squared error sense. We could develop the
corresponding theory for quadratic approximations a + bY + cY 2 , or for polynomial
approximations of a given degree. The ideas would be the same and one would have
a similar projection interpretation.
In principle, a higher degree polynomial approximates X better than a lower
degree one since there are more such polynomials. The question of fitting the
parameters with a given number of observations is more complex.
Assume you observe N data points {(Xn , Yn ), n = 1, . . . , N}. If the values Yn
are different, one can define the function g(·) by g(Yn ) = Xn for n = 1, . . . , N.
This function achieves a zero-mean squared error. What is then the point of looking
for a linear function, or a quadratic, or some polynomial of a given degree? Why not
simply define g(Yn ) = Xn ?
Remember that the goal of the estimation is to discover a function g(·) that is
likely to work well for data points we have not yet observed. For instance, we hope
that E(C(XN +1 , g(YN +1 )) is small, where (XN +1 , YN +1 ) has the same distribution
as the samples (Xn , Yn ) we have observed for n = 1, . . . , N .
If we define g(Yn ) = Xn , this does not tell us how to calculate g(YN +1 ) for a
value YN +1 we have not observed. However, if we construct a polynomial g(·) of
a given degree based on the N samples, then we can calculate g(Yn+1 ). The key
observation is that a higher degree polynomial may not be a better estimate because
it tends to fit noise instead of important statistics.
As a simple illustration of overfitting, say that we observe (X1 , Y1 ) and Y2 .
We want to guess X2 . Assume that the samples Xn , Yn are all independent and
U [−1, 1]. If we guess X̂2 = 0, the mean squared error is E((X2 − X̂2 )2 ) =
E(X22 ) = 1/3. If we use the guess X̂2 = X1 based on the observations, then
E((X2 − X̂2 )2 ) = E((X2 − X1 )2 ) = 2/3. Hence, ignoring the observation is better
than taking it into account.
The practical question is how to detect overfitting. For instance, how does one
determine whether a linear regression is better than a quadratic regression? A simple
9.6 MMSE 173
9.6 MMSE
For now, assume that we know the joint distribution of (X, Y ) and consider the
problem of finding the function g(Y ) that minimizes
per all the possible functions g(·). The best function is called the MMSE of X
given Y . We have the following theorem:
g(Y ) = E[X|Y ],
where
fX,Y (x, y)
fX|Y [x|y] :=
fY (y)
Figure 9.10 illustrates the conditional expectation. That figure assumes that the
pair (X, Y ) is picked uniformly in the shaded area. Thus, if one observes that Y ∈
174 9 Tracking—A
X
E[X|Y = y]
(y, y + dy), the point X is uniformly distributed along the segment that cuts the
shaded area at Y = y. Accordingly, the average value of X is the mid-point of that
segment, as indicated in the figure. The dashed red line shows how that mean value
depends on Y and it defines E[X|Y ].
The following result is a direct consequence of the definition.
Proof
= E(Xφ(Y )),
h(Y )
E[X|Y ]
Proof of Theorem 9.3 The identity (9.8) is the projection property. It states that X −
E[X|Y ] is orthogonal to the set G (Y ) of functions of Y , as shown in Fig. 9.11.
In particular, it is orthogonal to h(Y ) − E[X|Y ]. As in the case of the LLSE, this
projection property implies that
for any function h(·). This implies that E[X|Y ] is indeed the MMSE of X given Y .
From the definition, we see how to calculate E[X|Y ] from the conditional density
of X given Y . However, in many cases one can calculate E[X|Y ] more simply. One
approach is to use the following properties of conditional expectation.
176 9 Tracking—A
(a) Linearity:
E[X|Y ] = E(X).
(d) Smoothing:
E(E[X|Y ]) = E(X);
(e) Tower:
Proof
ai (Xi − E[Xi |Y ])
Now,
X − E(X)
is orthogonal to G (Y ). Now,
The first equality follows from the fact that X −E(X) and φ(Y ) are independent
since they are functions of independent random variables.4
(d) Letting φ(Y ) = 1 in (9.8), we find
E(X − E[X|Y ]) = 0,
for any function h(Y ). But E(h(Y )(X − E[X|Y, Z])) = 0 by the projec-
tion property, because h(Y ) is some function of (Y, Z). Also, E(h(Y )(X −
E[X|Y ])) = 0, also by the projection property. Hence,
E[(X + 2Y )2 |Y ].
4 See Appendix B.
178 9 Tracking—A
We find
1
E[X|X + Y + Z] = (X + Y + Z). (9.10)
3
To see this, note that, by symmetry,
Denote by V the common value of these random variables. Note that their sum is
E[X|Y ] h(Y )
L(Y ) = {a + bY |a, b }
G(Y ) = {g(Y )| g(·) is a function}
cov(X, Y )
E[X|Y ] = L[X|Y ] = E(X) + (Y − E(Y )).
var(Y )
Also, X − L[X|Y ] and Y are two linear functions of the jointly Gaussian random
variables X and Y . Consequently, they are jointly Gaussian by Theorem 8.4 and
they are independent by Theorem 8.3.
Consequently,
for any φ(·), because functions of independent random variables are independent by
Theorem B.11 in Appendix B. Hence,
X − L[X|Y ] is orthogonal to G (Y ),
L[X|Y] = Ay + b
E(||X − AY − b||2 ).
Thus, as in the scalar case, the LLSE is the linear function of the observations
that best approximates X, in the mean squared error sense.
Before proceeding, review the notation of Sect. B.6 for ΣY and cov(X, Y).
Theorem 9.7 (LLSE of Vectors) Let X and Y be random vectors such that ΣY is
nonsingular.
(a) Then
(b) Moreover,
Proof
(a) The proof is similar to the scalar case. Let Z be the right-hand side of (9.11).
One shows that the error X − Z is orthogonal to all the linear functions of Y.
One then uses that fact to show that X is closer to Z than to any other linear
function h(Y) of Y.
9.7 Vector Case 181
We claim that the last term is equal to zero. To see this, note that
n
E((X − Z) (Z − h(Y)) = E((Xi − Zi )(Zi − hi (Y))).
i=1
Also,
where the first equality comes from the fact that tr(AB) = tr(BA) for matrices
of compatible dimensions.)
(b) Let X̃ := X − E[X|Y] be the estimation error. Thus,
Hence,
The Kalman Filter is an algorithm to update the estimate of the state of a system
using its output, as sketched in Fig. 9.13. The system has a state X(n) and an output
Y (n) at time n = 0, 1, . . .. These variables are defined through a system of linear
equations:
In these equations, the random variables {X(0), V (n), W (n), n ≥ 0} are all
orthogonal and zero-mean. The covariance of V (n) is ΣV and that of W (n) is ΣW .
The filter is developed when the variables are random vectors and A, C are matrices
of compatible dimensions.
The objective is to derive recursive equations to calculate
Here is the result, due to Rudolf Kalman (Fig. 9.14), which we prove in the next
chapter. Do not panic when you see the equations!
Moreover,
We will give a number of examples of this result. But first, let us make a few
comments.
9.8.2 Examples
Random Walk
The first example is a filter to track a “random walk” by making noisy observations.
Let
That is, X(n) has orthogonal increments and it is observed with orthogonal noise.
Figure 9.15 shows a simulation of the filter. The left-hand part of the figure shows
that the estimate tracks the state with a bounded error. The middle part of the figure
shows the variance of the error, which can be precomputed. The right-hand part of
the figure shows the filter with the time-varying gain (in blue) and the filter with the
limiting gain (in green). The filter with the constant gain performs as well as the one
with the time-varying gain, in the limit, as justified by part (c) of the theorem.
In this model, X2 (n) is the constant but unknown drift and X1 (n) is the value of
the “random walk.” Figure 9.16 shows a simulation of the filter. It shows that the
60 4.5
X1 (n) 4 X2 (n)
Xˆ (n) Xˆ 2 (n)
50
1 3.5
40 3
2.5
30
2
20 1.5
1
10
0.5
0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20
filter eventually estimates the drift and that the estimate of the position of the walk
is quite accurate.
In this model, X2 (n) is the varying drift and X1 (n) is the value of the “random
walk.” Figure 9.17 shows a simulation of the filter. It shows that the filter tries to
track the drift and that the estimate of the position of the walk is quite accurate.
Falling Object
In the fourth example, one tracks a falling object. The elevation Z(n) of that falling
object follows the equation
where S(0) is the initial vertical velocity of the object and g is the gravitational
constant at the surface of the earth. In this expression, V (n) is some noise that
perturbs the motion. We observe η(n) = Z(n) + W (n), where W (n) is some noise.
186 9 Tracking—A
Z (n)
0.5
0 Zˆ (n)
–0.5
–1
–1.5
0 5 10 15 20 25 30
With this change of variables, the system is described by the following equations:
Figure 9.18 shows a simulation of the filter that computes X̂1 (n) from which we
subtract gt 2 /2 to get an estimate of the actual altitude Z(n) of the object.
9.11 Problems 187
9.9 Summary
9.10 References
LLSE, MMSE, and linear regression are covered in Chapter 4 of Bertsekas and
Tsitsiklis (2008). The Kalman filter was introduced in Kalman (1960). The text
(Brown and Hwang 1996) is an easy introduction to Kalman filters with many
examples.
9.11 Problems
Problem 9.1 Assume that Xn = Yn + 2Yn2 + Zn where the Yn and Zn are i.i.d.
U [0, 1]. Let also X = X1 and Y = Y1 .
Problem 9.2 We want to compare the off-line and on-line methods for computing
L[X|Y ]. Use the setup of the previous problem.
(a) Generate N = 1, 000 samples and compute the linear regression of X given Y .
Say that this is X = aY + b
(b) Using the same samples, compute the linear fit recursively using the stochastic
gradient algorithm. Say that you obtain X = cY + d
(c) Evaluate the quality of the two estimates your obtained by computing E((X −
aY − b)2 ) and E((X − cY − d)2 ).
Problem 9.4 You observe three i.i.d. samples X1 , X2 , X3 from the distribution
fX|θ (x) = 12 e−|x−θ| , where θ ∈ R is the parameter to estimate. Find
MLE[θ |X1 , X2 , X3 ].
Problem 9.5
(a) Given three independent N(0, 1) random variables X, Y , and Z, find the
following minimum mean square estimator:
(b) For the above, compute the mean squared error of the estimator.
Problem 9.6 Given two independent N(0, 1) random variables X and Y, find the
following linear least square estimator:
L[X|X2 + Y ].
Problem 9.7 Consider a sensor network with n sensors that are making observa-
tions Yn = (Y1 , . . . , Yn ) of a signal X where
Yi = aX + Zi , i = 1, . . . , n.
nC + σn2 .
νC + σν2 ,
Problem 9.8 We want to use a Kalman filter to detect a change in the popularity of
a word in twitter messages. To do this, we create a model of the number Yn of times
that particular word appears in twitter messages on day n. The model is as follows:
X(n + 1) = X(n)
Y (n) = X(n) + W (n),
where the W (n) are zero-mean and uncorrelated. This model means that we are
observing numbers of occurrences with an unknown mean X(n) that is supposed
to be constant. The idea is that if the mean actually changes, we should be able to
detect it by noticing that the errors between Ŷ (n) and Y (n) are large. Propose an
algorithm for detecting that change and implement it in Python.
Xn+1 = aXn + Vn , n ≥ 0.
Problem 9.12 Let θ =D U [0, 1], and given θ , the random variable X is uniformly
distributed in [0, θ ]. Find E[θ |X].
Problem 9.13 Let (X, Y )T ∼ N([0; 0], [3, 1; 1, 1]). Find E[X2 |Y ].
Problem 9.14 Let (X, Y, Z)T ∼ N([0; 0; 0], [5, 3, 1; 3, 9, 3; 1, 3, 1]). Find
E[X|Y, Z].
Problem 9.15 Consider arbitrary random variables X and Y . Prove the following
property:
Problem 9.16 Let the joint p.d.f. of two random variables X and Y be
1
fX,Y (x, y) = (2x + y)1{0 ≤ x ≤ 1}1{0 ≤ y ≤ 2}.
4
First show that this is a valid joint p.d.f. Suppose you observe Y drawn from this
joint density. Find MMSE[X|Y ].
E[X + 2Y + 3Z|Y + 5Z + 4V ].
Problem 9.18 Assume that X, Y are two random variables that are such that
E[X|Y ] = L[X|Y ]. Then, it must be that (choose the correct answers, if any)
9.11 Problems 191
Problem 9.19 In a linear system with independent Gaussian noise, with state Xn
and observation Yn , the Kalman filter computes (choose the correct answers, if any)
MLE[Yn |Xn ];
MLE[Xn |Y n ];
MAP [Yn |Xn ];
MAP [Xn |Y n ];
E[Xn |Y n ];
E[Yn |Xn ];
E[Xn |Yn ];
E[Yn |Xn ].
Find E[X|Y].
Problem 9.23 Given two independent N(0, 1) random variables X and Y, find the
following linear least square estimator:
L[X|X3 + Y ].
192 9 Tracking—A
E[X|X + Y, X + Z, Y − Z].
Find L[X|Y1 , Y2 , Y3 ]. Hint: You will observe that ΣY is singular. This means that at
least one of the observations Y1 , Y2 , or Y3 is redundant, i.e., is a linear combination
of the others. This implies that L[X|Y1 , Y2 , Y3 ] = L[X|Y1 , Y2 ].
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Tracking: B
10
In many situations, one keeps making observations and one wishes to update
the estimate accordingly, hopefully without having to recompute everything from
scratch. That is, one hopes for a method that enables to calculate L[X|Y, Z] from
L[X|Y] and Z.
The key idea is in the following result.
Proof Figure 10.1 shows why the result holds. To be convinced mathematically, we
need to show that the error
X − (L[X|Y] + L[X|Z])
X
{BY}
{BY + DZ}
L[X|Y, Z]
{DZ}
L[X|Y]
L[X|Z]
0
Fig. 10.1 The LLSE is easy to update after an additional orthogonal observation
(X − L[X|Y]) − L[X|Z].
Proof The idea here is that one considers the innovation Z̃ := Z − L[Z|Y], which
is the information in the new observation Z that is orthogonal to Y.
To see why the result holds, note that any linear combination of Y and Z can be
written as a linear combination of Y and Z̃. For instance, if L[Z|Y] = CY, then
Thus, the set of linear functions of Y and Z is the same as the set of linear functions
of Y and Z̃, so that
Thus, (10.2) follows from Theorem 10.1 since Y and Z̃ are orthogonal.
10.2 Derivation of Kalman Filter 195
We derive the equations for the Kalman filter, as stated in Theorem 9.8. For
convenience, we repeat those equations here:
and
cov(BV , DW ) = B cov(V , W )D
The algebra is a bit tedious, but the key steps are worth noting.
Let
Y n = (Y (0), . . . , Y (n)).
Note that
L X(n)|Y n−1 = L AX(n − 1) + V (n − 1)|Y n−1 = AX̂(n − 1).
Hence,
L Y (n)|Y n−1 = L CX(n) + W (n)|Y n−1 = CL X(n)|Y n−1 = CAX̂(n − 1),
Thus,
196 10 Tracking: B
X̂(n) = L[X(n)|Y n ] = L X(n)|Y n−1 + L X(n)|Y (n) − L Y (n)|Y n−1
= AX̂(n − 1) + Kn Y (n) − CAX̂(n − 1) .
This derivation shows that (10.16) is a fairly direct consequence of the formula in
Theorem 10.2 for updating the LLSE.
The calculation of the gain Kn is a bit more complex. Let
Ỹ (n) = Y (n) − L Y (n)|Y n−1 = Y (n) − CAX̂(n − 1).
Then
−1
Kn = cov X(n), Ỹ (n) cov Ỹ (n) .
Now,
cov X(n), Ỹ (n) = cov X(n) − L X(n)|Y n−1 , Ỹ (n) ,
by (10.20).
To calculate cov(Ỹ (n)), we note that
cov(Ỹ (n)) = cov CX(n) + W (n) − CL X(n)|Y n−1 = CSn C + ΣW .
Thus,
" #−1
Kn = Sn C CSn C + ΣW .
= AΣn−1 A + ΣV .
10.3 Properties of Kalman Filter 197
We observe that
X(n) − L[X(n)|Y n ] = X(n) − AX̂(n − 1) − Kn Y (n) − CAX̂(n − 1)
= X(n) − AX̂(n − 1) − Kn CX(n) + W (n) − CAX̂(n − 1)
= [I − Kn C] X(n) − AX̂(n − 1) − Kn W (n),
so that
as we wanted to show.
The goal of this section is to explain and justify the following result. The terms
observable and reachable are defined after the statement of the theorem.
Σn → Σ and Kn → K, (10.37)
We explain these properties in the subsequent sections. Let us first make a few
comments.
198 10 Tracking: B
• For some systems, the errors grow without bound. For instance, if one does not
observe anything (e.g., C = 0) and if the system is unstable (e.g., X(n) =
2X(n − 1) + V (n)), then Σn goes to infinity. However, (a) says that “if the
observations are rich enough,” this does not happen: one can track X(n) with an
error that has a bounded covariance.
• Part (b) of the theorem says that in some cases, one can use the filter with
a constant gain K without having a bigger error, asymptotically. This is very
convenient as one does not have to compute a new gain at each step.
10.3.1 Observability
Are the observations good enough to track the state with a bounded error covari-
ance? Before stating the result, we need a precise notion of good observations.
Definition 10.1 (Observability) We say that (A, C) is observable if the null space
of
⎡ ⎤
C
⎢ CA ⎥
⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
CAd
is {0}. Here, d is the dimension of X(n). A matrix M has null space {0} if {0} is the
only vector v such that Mv = 0.
Proof
(a) Observability implies that there is only one X(0) that corresponds to
(Y (0), . . . , Y (d)) if the system has no noise. Indeed, in that case,
Then,
10.3 Properties of Kalman Filter 199
so that
Consequently,
⎡ ⎤ ⎡ ⎤
Y (0) C
⎢ Y (1) ⎥ ⎢ CA ⎥
⎢ ⎥ ⎢ ⎥
⎢ . ⎥ = ⎢ . ⎥ X(0).
⎣ .. ⎦ ⎣ .. ⎦
Y (d) CAd−1
Now, imagine that there are two different initial states, say X(0) and X̊(0) that
give the same outputs Y (0), . . . , Y (d). Then,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
Y (0) C C
⎢ Y (1) ⎥ ⎢ CA ⎥ ⎢ CA ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ . ⎥ = ⎢ . ⎥ X(0) = ⎢ . ⎥ X̊(0),
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
Y (d) CAd−1 CAd−1
so that
⎡ ⎤
C
⎢ CA ⎥
⎢ ⎥
⎢ .. ⎥ (X(0) − X̊(0)) = 0.
⎣ . ⎦
CAd−1
This implies that the error between X(n) and X̂(n) is a linear combination of
d noise contributions, so that Σn is bounded.
(b) One can show that if Σ0 = 0, i.e., if we know X(0), then Σn increases in the
sense that Σn − Σn−1 is nonnegative definite. Being bounded and increasing
implies that Σn converges, and so does Kn .
10.3.2 Reachability
is full. To appreciate the meaning of this property, note that we can write the state
equations as
where cov(ηn ) = I. That is, the components of η are orthogonal. In the Gaussian
case, the components of η are N(0, 1) and independent. If (A, Q) is reachable, this
means that for any x ∈ d , there is some sequence η0 , . . . , ηd such that if X(0) = 0,
then X(d) = x. Indeed,
⎡ ⎤
ηd
d
⎢ ⎥
X(d) = Ak Qηd−k = Q, AQ, . . . , Ad−1 Q ⎣ ηd−1 ... ⎦ .
k=0
η0
Since the matrix is full rank, the span of its columns is d , which means precisely
that there is a linear combination of these columns that is equal to any given vector
in d .
The proof of part (b) of the theorem is a bit too involved for this course.
The Kalman filter is often used for nonlinear systems. The idea is that if the system
is almost linear over a few steps, then one may be able to use the Kalman filter
locally and change the matrices A and C as the estimate of the state changes.
The model is as follows:
where
∂ ∂
[An ]ij = fi X̂(n) and [Cn ]ij = gi X̂(n) .
∂xj ∂xj
Thus, the idea is to linearize the system around the estimated state value and then
apply the usual Kalman filter.
Note that we are now in the realm of heuristics and that very little can be said
about the properties of this filter. Experiments show that it works well when the
nonlinearities are small, whatever this means precisely, but that it may fail miserably
in other conditions.
10.4.1 Examples
Tracking a Vehicle
In this example, borrowed from “Eric Feron, Notes for AE6531, Georgia Tech.”, the
goal is to track a vehicle that moves in the plane by using noisy measurements of
distances to 9 points pi ∈ 2 . Let p(n) ∈ 2 be the position of the vehicle and
u(n) ∈ 2 be its velocity at time n ≥ 0.
We assume that the velocity changes accruing to a known rule, except for some
random perturbation. Specifically, we assume that
where the w(n) are i.i.d. N(0, I). The measurements are
Fig. 10.2 The Extended Kalman Filter for the system (10.38)–(10.39)
and
y = RT (CA + CB + CC ).
As shown in the top part of Fig. 10.4, this filter does not track the concentrations
correctly. In fact, some concentrations that the filter estimates are negative!
10.5 Summary 203
Fig. 10.4 The top two graphs show that the extended Kalman filter does not track the concentra-
tions correctly. The bottom two graphs show convergence after modifying the equations
The bottom graphs show that the filter tracks the concentrations converge after
modifying the equations and replacing negative estimates by 0.
The point of this example is that the extended Kalman filter is not guaranteed to
converge and that, sometimes, a simple modification makes it converge.
10.5 Summary
• Updating LLSE;
• Derivation of Kalman Filter;
• Observability and Reachability;
• Extended Kalman Filter.
204 10 Tracking: B
10.6 References
The book Goodwin and Sin (2009) survey filtering and applications to control. The
textbook Kumar and Varaiya (1986) is a comprehensive yet accessible presentation
of control theory, filtering, and adaptive control. It is available online.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Speech Recognition: A
11
A hidden Markov chain is a Markov chain together with a state observation model.
The Markov chain is {X(n), n ≥ 0} and it has its transition matrix P on the state
space X and its initial distribution π0 . The state observation model specifies that
when the state of the Markov chain is x, one observes a value y with probability
Q(x, y), for y ∈ Y . More precisely, here is the definition (Fig. 11.2).
In the speech recognition application, the Xn are “parts of speech,” i.e., segments
of sentences, and the Yn are sounds. The structure of the language determines
relationships between the Xn that can be approximated by a Markov chain. The
relationship between Xn and Yn is speaker-dependent.
The recognition problem is the following. Assume that you have observed that
Yn := (Y0 , . . . , Yn ) = yn := (y0 , . . . , yn ). What is the most likely sequence Xn :=
(X0 , . . . , Xn )? That is, in the terminology of Chap. 7, we want to compute
MAP [Xn | Yn = yn ].
P [Xn = xn | Yn = yn ].
Note that
P (Xn = xn , Yn = yn )
P [Xn = xn | Yn = yn ] = .
P (Yn = yn )
The MAP is the value of xn that maximizes the numerator. Now, by (11.1), the
logarithm of the numerator is equal to
n
log(π0 (x0 )Q(x0 , y0 )) + log(P (xm−1 , xm )Q(xm , ym )).
m=1
Define
and
n
d(x0 ) + dm (xm−1 , xm ). (11.2)
m=1
The expression (11.2) can be viewed as the length for a path in the graph shown
in Fig. 11.3. Finding the MAP is then equivalent to solving a shortest path problem.
There are a few standard algorithms for solving such problems. We describe the
Bellman–Ford Algorithm due to Bellman (Fig. 11.4) and Ford.
For m = 0, . . . , n and x ∈ X , let Vm (x) be the length of the shortest path from
X(m) = x to the column X(n) in the graph. Also, let Vn (x) = 0 for all x ∈ X .
Then, one has
+ ,
Vm (x) = min
dm+1 (x, x ) + Vm+1 (x ) , x ∈ X , m = 0, . . . , n − 1. (11.3)
x ∈X
208 11 Speech Recognition: A
N N N N N
Finally, let
Equations (11.3) are the Bellman–Ford Equations. They are a particular version
of Dynamic Programming Equations (DPE) for the shortest path problem.
Note that the essential idea was to define the length of the shortest remaining path
starting from every node in the graph and to write recursive expressions for those
quantities. Thus, one solves the DPE backwards and then one finds the shortest path
forward. This application of the shortest path algorithm for finding a MAP is called
the Viterbi Algorithm due to Andrew Viterbi (Fig. 11.5).
11.3 Expectation Maximization and Clustering 209
You look at set of N exam results {X(1), . . . , X(N)} in your probability course
and you must decide who are the A and the B students. To study this problem, we
assume that the results of A students are i.i.d. N (a, σ 2 ) and those of B students are
N (b, σ 2 ) where a > b.
For simplicity, assume that we know σ 2 and that each student has probability 0.5
of being an A student. However, we do not know the parameters (a, b).
(The same method applies when one does not know the variances of the scores
of A and B students, nor the prior probability that a student is of type A.)
One heuristic is as follows (see Fig. 11.6). Start with a guess (a1 , b1 ) for (a, b).
Student n with score X(n) is more likely to be of type A if X(n) > (a1 + b1 )/2.
Let us declare that such students are of type A and the others are of type B. Let then
a2 be the average score of the students declared to be of type A and b2 that of the
other students. We repeat the procedure after replacing (a1 , b1 ) by (a2 , b2 ) and we
keep doing this until the values seem to converge. This heuristic is called the hard
expectation maximization algorithm.
A slightly different heuristic is as follows (see Fig. 11.7). Again, we start with a
guess (a1 , b1 ).
Using Bayes’ rule, we calculate the probability p(n) that student n with score
X(n) is of type A. We then calculate
n X(n)p(n) n X(n)(1 − p(n))
a2 = and b2 = .
n p(n) n (1 − p(n))
210 11 Speech Recognition: A
b3 a3
b 4 = b3 a4 = a3
b2 a2
®(·; a2 ; b2 )
1
b3 a3
In the previous example, one attempts to estimate some parameter θ = (a, b) based
on some observations X = (X1 , . . . , XN ). Let Z = (Z1 , . . . , ZN ) where Zn = A if
student n is of type A and Zn = B otherwise.
We would like to maximize f [x|θ ] over θ , to find MLE[θ |X = x]. One has
f [x|θ ] = f [x|z, θ ]P [z|θ ],
z
where the sum is over the 2N possible values of Z. This is computationally too
difficult.
11.3 Expectation Maximization and Clustering 211
f [x|z∗ , θ ]P [z∗ |θ ],
where z∗ is the most likely value of Z given the observations and a current guess for
θ . That is, if the current guess is θk , then
by
log(f [x|z, θ ])P [z|θ ].
z
and the new guess θk+1 is the maximizer of that expression over θ . Thus, it replaces
the distribution of Z by the conditional distribution given the current guess and the
observations.
212 11 Speech Recognition: A
If this heuristic did not work in practice, nobody would mention it. Surprisingly, it
seems to work for some classes of problems. There is some theoretical justification
for the heuristic. One can show that it converges to a local maximum of f [x|θ ].
Generally, this is little comfort because most problems have many local maxima.
See Roche (2012).
Consider once again a hidden Markov chain model but assume that (π, P , Q) are
functions of some parameter θ that we wish to estimate. We write this explicitly as
(πθ , Pθ , Qθ ). We are interested in the value of θ that makes the observed sequence
yn most likely.
Recall that MLE of θ given that Yn = yn is defined as
11.4.1 HEM
P [Xn = xn∗ |Y n , θ0 ],
where
Recall that one can find xn∗ by using Viterbi’s algorithm. Also,
P [Yn = yn | Xn = xn , θ ]
= πθ (x0 )Qθ (x0 , y0 )Qθ (x1 , y1 ) × · · · × Pθ (xn−1 , xn )Qθ (xn , yn ).
11.6 References 213
11.5 Summary
11.6 References
The text Wainwright and Jordan (2008) is great presentation of graphical models. It
covers expectation maximization and many other useful techniques.
214 11 Speech Recognition: A
11.7 Problems
Problem 11.1 Let (Xn , Yn ) be a hidden Markov chain. Let Y n = (Y0 , . . . , Yn ) and
Xn = (X0 , . . . , Xn ). The Viterbi algorithm computes
MLE[Y n |Xn ];
MLE[Xn |Y n ];
MAP [Y n |Xn ];
MAP [Xn |Y n ].
Problem 11.2 Assume that the Markov chain Xn is such that X = {a, b}, π0 (a) =
π0 (b) = 0.5 and P (x, x ) = α for x = x and P (x, x) = 1 − α. Assume also
that Xn is observed through a BSC with error probability , as shown in Fig. 11.9.
Implement the Viterbi algorithm and evaluate its performance.
Problem 11.3 Suppose that the grades of students in a class are distributed as a
mixture of two Gaussian distribution, N(μ1 , σ12 ) with probability p and N(μ2 , σ22 )
with probability 1 − p. All the parameters θ = (μ1 , σ1 , μ2 , σ2 , p) are unknown.
(a) You observe n i.i.d. samples, y1 , . . . , yn drawn from the mixed distribution. Find
f (y1 , . . . , yn |θ ).
(b) Let the type random variable Xi be 0 if Yi ∼ N(μ1 , σ12 ) and 1 if Yi ∼
N(μ2 , σ22 ). Find MAP [Xi |Yi , θ ].
(c) Implement Hard EM algorithm to approximately find MLE[θ |Y1 , . . . , Yn ]. To
this end, use MATLAB to generate 1000 data points (y1 , . . . , y1000 ), according
to θ = (10, 4, 30, 6, 0.4). Use your data to estimate θ . How well is your
algorithm working?
b 1
1-
1- = [0.5, 0.5]
0
11.7 Problems 215
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Speech Recognition: B
12
This section explains the stochastic gradient descent algorithm, which is a technique
used in many learning schemes.
Recall that a linear regression finds the parameters a and b that minimize the
error
K
(Xk − a − bYk )2 ,
k=1
where the (Xk , Yk ) are observed samples that are i.i.d. with some unknown
distribution fX,Y (x, y).
Assume that, instead of calculating the linear regression based on K samples, we
keep updating the parameters (a, b) every time we observe a new sample.
Our goal is to find a and b that minimize
E (X − a − bY )2
= E X2 + a 2 + b2 E Y 2 − 2aE(X) − 2bE(XY ) + 2abE(Y )
=: h(a, b).
One idea is to use a gradient descent algorithm to minimize h(a, b). Say that at
step k of the algorithm, one has calculated (a(k), b(k)). The gradient algorithm
would update (a(k), b(k)) in the direction opposite of the gradient, to make
h(a(k), b(k)) decrease. That is, the algorithm would compute
∂
a(k + 1) = a(k) − α h(a(k), b(k))
∂a
∂
b(k + 1) = b(k) − α h(a(k), b(k)),
∂b
where α is a small positive number that controls the step size. Thus,
However, we do not know the distributions and cannot compute the expected
values. Instead, we replace the mean values by the values of the new samples. That
is, we compute
That is, instead of using the gradient algorithm we use a stochastic gradient
algorithm where the gradient is replaced by a noisy version. The intuition is that,
if the step size is small, the errors between the true gradient and its noisy version
average out.
The top part of Fig. 12.1 shows the updates of this algorithm for the example
(9.4) with α = 0.002, E(X2 ) = 1, and E(Z 2 ) = 0.3. In this example, we know that
the LLSE is
1
L[X|Y ] = a + bY = Y = 0.77Y.
1.3
The figure shows that (ak , bk ) approaches (0, 0.77).
The bottom part of Fig. 12.1 shows the coefficients for (9.5) with γ = 0.05, α =
1, and β = 6. We see that (ak , bk ) approaches (−1, 7), which are the values for the
LLSE.
12.2 Theory of Stochastic Gradient Projection 219
0.5
0.4
0.3
0.2
0.1
a(k)
0
–0.1
0 200 400 600 800 k 1000
7
b(k)
6
0
a(k)
–1
–2
0 200 400 600 800 k 1000
In this section, we explain the theory of the stochastic gradient algorithm that
we illustrated in the case of online regression. We start with a discussion of the
deterministic gradient projection algorithm.
Consider a smooth convex function on a convex set, such as a soup bowl. A
standard algorithm to minimize that function, i.e., to find the bottom of the bowl,
is the gradient projection algorithm. This algorithm is similar to going downhill by
making smaller and smaller jumps along the steepest slope. The projection makes
sure that one remains in the acceptable set. The step size of the algorithm decreases
over time so that one does not keep on overshooting the minimum.
The stochastic gradient projection algorithm is similar except that one has access
only to a noisy version of the gradient. As the step size gets small, the errors in
the gradient tend to average out and the algorithm converges to the minimum of the
function.
We first review the gradient projection algorithm and then discuss the stochastic
gradient projection algorithm.
That is, C contains the line segment between any two of its points. That is, there
are no holes or kinks in the set boundary (Fig. 12.2).
Also (see Fig. 12.3), recall that a function f : C → is a convex function if
(Fig. 12.3)
y y
C x y
C x y
12.2 Theory of Stochastic Gradient Projection 221
–1
–2
–3
–4
0 5 10 15 20 25
Here,
∂ ∂
∇f (x) := f (x), . . . , f (x)
∂x1 ∂xd
is the gradient of f (·) at x and [y]C indicates the closest point to y in C , also called
the projection of y onto C . The constants αn > 0 are called the step sizes of the
algorithm.
As a simple example, let f (x) = 6(x − 0.2)2 for x ∈ C := [0, 1]. The factor 6 is
there only to have big steps initially and show the necessity of projecting back into
the convex set. With αn = 1/n and x0 = 0, the algorithm is
12
xn+1 = xn − (xn − 0.2) . (12.3)
n C
Equivalently,
12
yn+1 = xn − (xn − 0.2) (12.4)
n
xn+1 = max{0, min{1, yn+1 }} (12.5)
with y0 = x0 .
As the Fig. 12.4 shows, when the step size is large, the update yn+1 falls outside
the set C and it is projected back into that set. Eventually, the updates fall into the
set C .
There are many known sufficient conditions that guarantee that the algorithm
converges to the unique minimizer of f (·) on C . Here is an example.
222 12 Speech Recognition: B
Theorem 12.1 Assume that f (x) is convex and differentiable on the convex set C
and such
Then
xn → x ∗ as n → ∞.
Proof The idea of the proof is as follows. Let dn = 12 ||xn − x ∗ ||2 . Fix > 0. One
shows that there is some n0 () so that, when n ≥ n0 (),
dn+1 ≤ dn − γn , if dn ≥ (12.9)
dn+1 ≤ 2, if dn < . (12.10)
Moreover, in (12.9), γn > 0 and n γn = ∞.
It follows from (12.9) that, eventually, for some n = n1 () ≥ n0 (), one has
dn < . But then, because of (12.9) and (12.10), dn < 2 for all n ≥ n1 (). Since
> 0 is arbitrary, this proves that xn → x ∗ .
To show (12.9) and (12.10), we first claim that
1
dn+1 ≤ dn + αn (x ∗ − xn )T ∇f (xn ) + αn2 K. (12.11)
2
To see this, note that
1
dn+1 = ||[xn − αn ∇f (xn )]C − x ∗ ||2
2
1
≤ ||xn − αn ∇f (xn ) − x ∗ ||2 (12.12)
2
1
≤ dn + αn (x ∗ − xn )T ∇f (xn ) + αn2 K. (12.13)
2
The inequality in (12.12) comes from the fact that projection on a convex set is
non-expansive. That is,
xC − yC ≤ x − y.
12.2 Theory of Stochastic Gradient Projection 223
f (·) (x − x∗ )T ∇f (x∗ )
x∗ x
(x ∗ − xn )T ∇f (xn ) ≤ −δ().
1
dn+1 ≤ dn − αn δ() + αn2 K.
2
Now, let
1
γn = αn δ() − αn2 K. (12.15)
2
224 12 Speech Recognition: B
There are many situations where one cannot measure directly the gradient ∇f (xn )
of the function. Instead, one has access to a random estimate of that gradient,
∇f (xn ) + ηn , where ηn is a random variable. One hopes that, if the error ηn is small
enough, GP still converges to x ∗ when one uses ∇f (xn ) + ηn instead of ∇f (xn ).
The point of this section is to justify this hope.
The algorithm is as follows (see Fig. 12.7):
where
gn = ∇f (xn ) + zn + bn (12.17)
In this expression, the zn are i.i.d. U [−0.5, 0.5]. Figure 12.8 shows the values that
the algorithm produces.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 100 200 300 400 500 600
This algorithm converges to the minimum x ∗ = 0.2 of the function, albeit slowly.
For the algorithm (12.16) and (12.17) to converge, one needs the estimation noise
zn and bias bn to be small. Specifically, one has the following result.
E[zn+1 | z0 , z1 , . . . , zn ] = 0; (12.23)
E(||zn ||2 ) ≤ A, n ≥ 0. (12.24)
Proof The proof is essentially the same as for the deterministic case.
The inequality (12.11) becomes
1
dn+1 ≤ dn + αn (x∗ − xn )T [∇f (xn ) + zn + bn ] + αn2 K. (12.25)
2
226 12 Speech Recognition: B
We discuss the theory of martingales in Sect. 15.9. Here are the ideas we needed in
the proof of Theorem 12.2.
Let {xn , yn , n ≥ 0} be random variables such that E(xn ) is well-defined for all
n. The sequence xn is said to be a martingale with respect to {(xm , ym ), m ≥ 0} if
The web makes it easy to collect a vast amount of data from many sources. Examples
include books, movie, and restaurants that people like, website that they visit, their
mobility patterns, their medical history, and measurements from sensors. This data
2 See
the next section.
3 Recall
that if a series n wn converges, then the tail m≥n wm of the series converges to zero as
n → ∞.
12.3 Big Data 227
can be useful to recommend items that people will probably like, treatments that
are likely to be effective, people you might want to commute with, to discover
who talks to who, efficient management techniques, and so on. Moreover, new
technologies for storage, databases, and cloud computing make it possible to process
huge amounts of data. This section explains a few of the formulations of such
problems and algorithms to solve them (Fig. 12.9).
Many factors potentially affect an outcome, but what are the most relevant ones?
For instance, the success in college of a student is correlated with her high-school
GPA, her scores in advanced placement courses and standardized tests. How does
one discover the factors that best predict her success? A similar situation occurs for
predicting the odds of getting a particular disease, the likelihood of success of a
medical treatment, and many other applications.
Identifying these important factors can be most useful to improve outcomes. For
instance, if one discovers that the odds of success in college are most affected by the
number of books that a student has to read in high-school and by the number of hours
she spends playing computer games, then one may be able to suggest strategies for
improving the odds of success.
One formulation of the problem is that the outcome Y is correlated with a
collection of factors that we represent by a vector X with N 1 components.
For instance, if Y is the GPA after 4 years in college, the first component X1 of
X might indicate the high-school GPA, the second component X2 the score on a
specific standardized test, X3 the number of books the student had to write reports
on, and so on. Intuition suggests that, although N 1, only relatively few of the
components of X really affect the outcome Y in a significant way. However, we do
not want to presume that we know what these components are.
Say that you want to predict Y on the basis of six components of X. Which
ones should you consider? This problem turns out to be hard because there are
many (about N 6 /6!) subsets with 6 elements in N = {1, 2, . . . , N}, and this
combinatorial aspect of the problem makes it intractable when N is large. To
228 12 Speech Recognition: B
Fig. 12.10
make progress, we change the formulation slightly and resort to some heuristic
(Fig. 12.10).4
The change in formulation is to consider the problem of minimizing
J (b) = E (Y − bn Xn ) 2
This is called the LASSO problem, for “least absolute shrinkage and selection
operator.” Thus, the hard constraint on the number of components is replaced by a
cost for using large coefficients. Intuitively, the problem is still qualitatively similar.
Also, the constraint is such that the solution of the problem has many bn equal to
zero. Intuitively, if a component is less useful than others, its coefficient is probably
equal to zero in the solution.
One interpretation of this problem as follows. In order to simplify the algebra,
we assume that Y and X are zero-mean. Assume that
Y = Bn Xn + Z,
n
where Z is N (0, σ 2 ) and the coefficients Bn are random and independent with a
prior distribution of Bn given by
λ
fn (b) = exp{−λ|b|}.
2
Then
4 If
you cannot crack a nut, look for another one. (A difference between Engineering and
Mathematics?)
12.3 Big Data 229
cov(Y, Xn )
L[Y |Xn ] = Xn =: bn Xn
var(Xn )
and
cov(Y, Xn )2
E((Y − L[Y |Xn ])2 ) = var(Y ) −
var(Xn )
= var(Y ) − |cov(Y, Xn )| × |bn |.
Thus, one unit of “cost” C(bn ) = |bn | invested in bn brings a reduction |cov(Y, Xn )|
in the objective J (bn ). It then makes sense to choose the first component with the
largest value of “reward per unit cost” |cov(Y, Xn )|. Say that this component is X1
and let Ŷ1 = L[Y |X1 ].
Second, assume that we stick to our choice of X1 with coefficient b1 and that we
look for a second component Xn with n = 1 to add to our estimate. Note that
E((Y − b1 X1 − bn Xn )2 )
= E((Y − b1 X1 )2 ) − 2bn cov(Y − b1 X1 , Xn ) + bn2 var((Xn ).
cov(Y − b1 X1 , Xn )
bn =
var(Xn )
cov(Y − b1 X1 , Xn )2
E((Y − b1 X1 )2 ) − .
var(Xn )
|cov(Y − b1 X1 , Xn )|
in the cost J (b1 , bn ). This suggests that the second component Xn to pick should be
the one with the largest covariance with Y − b1 X1 .
These observations suggest the following algorithm, called the stepwise regres-
sion algorithm. At each step k, the algorithm finds the component Xn that is
most correlated with the residual error Y − Ŷk , where Ŷk is the current estimate.
Specifically, the algorithm is as follows:
Step k + 1 : Find n ∈
/ Sk that maximizes E((Y − Ŷk )Xn )
Let Sk+1 = Sk ∪ {n}, Yk+1 = L[Y |Xn , n ∈ Sk+1 ], k = k + 1;
student in the college success example. From those samples, one can estimate the
mean values by the sample means. Thus, in step k, one has calculated coefficients
(b1 , . . . , bk ) to calculate
1 m
M
(Y − Ŷkm )Xnm .
M
m=1
be obtained from a few thousand. Recall that one can use the sample moments to
compute confidence intervals for these estimates.
Signal processing uses a similar algorithm called matching pursuit introduced in
Mallat and Zhang (1993). In that context, the problem is to find a compact represen-
tation of a signal, such as a picture or a sound. One considers a representation of the
signal as a linear combination of basis functions. The matching pursuit algorithm
finds the most important basis functions to use in the representation.
An Example
Our example is very small, so that we can understand the steps. We assume that all
the random variables are zero-mean and that N = 3 with
⎡ ⎤
4 3 2 2
⎢3 4 2 2⎥
ΣZ = ⎢
⎣2
⎥,
2 4 1⎦
2 2 1 4
We first try the stepwise regression. The component Xn most correlated with Y
is X1 . Thus,
cov(Y, X1 ) 3
Ŷ1 = L[Y |X1 ] = X1 = X1 =: b1 X1 .
var(X1 ) 4
The next step is to compute the correlations E(Xn (Y − Ŷ1 )) for n = 2, 3. We find
Hence, the algorithm selects X2 as the next components and one finds
" # 4 2 −1 X1 2 1
Ŷ2 = L[Y |X1 , X2 ] = 3 2 = X1 + X2 .
24 X2 3 6
–1
–2
–3
–4
0 200 400 600 800 1000
Complex looking objects may have a simple hidden structure. For example, the
signal s(t) shown in Fig. 12.11 is the sum of three sine waves. That is,
3
s(t) = bi sin(2π φi t), t ≥ 0. (12.27)
i=1
A classical result, called the Nyquist sampling theorem, states that one can
reconstruct a signal exactly from its values measured every T seconds, provided that
1/T is at least twice the largest frequency in the signal. According to that result, we
could reconstruct s(t) by specifying its value every T seconds if T < 1/(2φi ) for
i = 1, 2, 3. However, in the case of (12.27), one can describe s(t) completely by
specifying the values of the six parameters {bi , φi , i = 1, 2, 3}. Also, it seems clear
in this particular case that one does not need to know many sample values s(tk )
for different times tk to be able to reconstruct the six parameters and therefore the
signal s(t) for all t ≥ 0. Moreover, one expects the reconstruction to be unique if
we choose a few sampling times tk randomly. The same is true if the representation
is in terms of different functions, such as polynomials or wavelets.
This example suggests that if a signal has a simple representation in terms of
some basis functions (e.g., sine waves), then it is possible to reconstruct it exactly
from a small number of samples.
Computing the parameters of (12.27) from a number of samples s(tk ) is highly
nontrivial, so that the fact that it is possible does not seem very useful. However, a
slightly different perspective shows that the problem can be solved. Assume that we
have a collection of functions (Fig. 12.12)
Assume also that the frequencies {φ1 , φ2 , φ3 } in s(t) are in the collection {fn , n =
1, . . . , N }. We can then try to find the vector a = {an , n = 1, . . . , N} such that
N
s(tk ) = an gn (tk ), for k = 1, . . . , K.
n=1
That is, one tries to find the most economical representation of s(t) as a linear
combination of functions in the collection.
Unfortunately, this problem is intractable because of the number of choices of
sets of nonzero coefficients an , a difficulty we already faced in the previous section.
The key trick is, as before, to convert the problem into a much easier one that retains
the main goal.
The new problem is as follows:
Minimize |an |
n
such that s(tk ) = an gn (tk ), for k = 1, . . . , K.
n
(12.28)
Theorem 12.4 (Exact Recovery from Random Samples) The signal s(t) can be
recovered exactly with a very high probability from K samples by solving (12.28) if
K ≥ C × B × log(N ).
In this expression, C is a small constant, B is the number of sine waves that make
up s(t), and N is the number of sine waves in the collection.
Note that this is a probabilistic statement. Indeed, one could be unlucky and
choose sampling times tk , where s(tk ) = 0 (see Fig. 12.11) and these samples
would not enable the reconstruction of s(t). More generally, the samples could be
chosen so that they do not enable an exact reconstruction. The theorem says that the
probability of poor samples is very small.
Thus, in our example, where B = 3, one can expect to recover the signal s(t)
exactly from about 3 log(100) ≈ 14 samples if N ≤ 100.
Problem (12.28) is equivalent to the following linear programming problem,
which implies that it is easy to solve:
Minimize bn
n
such that s(tk ) = an gn (tk ), for k = 1, . . . , K
n
Assume that
where fn = n/10.
The frequencies of the sine waves in the collection are 0.1, 0.2, . . . , 10. Thus,
the frequencies in s(t) are contained in the collection, so that perfect reconstruction
is possible as
s(t) = an gn (t)
n
12.3 Big Data 235
with a10 = 1, a12 = 2, and a16 = 3, and all the other coefficients an equal to zero.
The theory tells us that reconstruction should be possible with about 14 samples. We
choose 15 sampling times tk randomly and uniformly in [0, 1]. We then ask Python
to solve (12.29). The solution is shown in Fig. 12.13.
Another Example
Figure 12.14, from Candes and Romberg (2007), shows another example. The image
on top has about one million pixels. However, it can be represented as a linear
combination of 25,000 functions called wavelets. Thus, the compressed sensing
results tell us that one should be able to reconstruct the picture exactly from a small
–2
–4
–6
0 0.2 0.4 0.6 0.8 1
multiple of 25,000 randomly chosen pixels. It turns out that this is indeed the case
with about 96,000 pixels.
Which movie would you like to watch? One formulation of the problem is as
follows. There is a K × N matrix Y . The entry Y (k, n) of the matrix indicates
how much user k likes movie n. However, one does not get to observe the complete
matrix. Instead, one observes a number of entries, when users actually watch movies
and one gets to record their rankings. The problem is to complete the matrix to be
able to recommend movies to users.
This matrix completion is based on the idea that the entries of the matrix are
not independent. For instance, assume that Bob and Alice have seen the same five
movies and gave them the same ranking. Assume that Bob has seen another movie
he loved. Chances are that Alice would also like it.
To formulate this dependency of the entries of the matrix Y , one observes that
even though there are thousands of movies, a few factors govern how much users
like them. Thus, it is reasonable to expect that many columns of the matrix are
combinations of a few common vectors that correspond to the hidden factors that
influence the rankings by users. Thus, a few independent vectors get combined
into linear combinations that form the columns. Consequently the matrix Y has a
small number of linearly independent columns, i.e., it is a low rank matrix.5 This
observation leads to the question of whether one can recover a low rank matrix Y
from observed entries?
One possible formulation is
Here, {M(k, n), (k, n) ∈ Ω} is the set of observed entries of the matrix. Thus, one
wishes to find the lowest-rank matrix X that is consistent with the observed entries.
As before, such a problem is hard. To simplify the problem, one replaces the rank
by the nuclear norm ||X||∗ where
||X||∗ = σi ,
i
where the σi are the singular values of the matrix X. The rank of the matrix counts
the number of nonzero singular values. The nuclear norm is a convex function of
the entries of the matrix, which makes the problem a convex programming problem
that is easy to solve. Remarkably, as in the case of compressed sensing, the solution
of the modified problem is very good.
Theorem 12.5 (Exact Matrix Completion from Random Entries) The solution
of the problem
is the matrix Y with a very high probability if the observed entries are chosen
uniformly at random and if there are at least
Cn1.25 r log(n)
This result is useful in many situations where this number of required observa-
tions is much smaller than K ×N, which is the number of entries of Y . The reference
contains many extensions of these results and details on numerical solutions.
Deep neural networks (DNN) are electronic processing circuits inspired by the
structure of the brain. For instance, our vision system consists of layers. The first
layer is in the retina that captures the intensity and color of zones in our field of
vision. The next layer extracts edges and motion. The brain receives these signals
and extracts higher level features. A simplistic model of this processing is that the
neurons are arranged in successive layers, where each neuron in one layer gets
inputs from neurons in the previous layer through connections called synapses.
Presumably, the weights of these connections get tuned as we grow up and learn
to perform tasks, possibly by trial and errors. The figure sketches a DNN. The
inputs at the left of the DNN are the features X from which the system produces
the probability that X corresponds to a dog, or the estimate of some quantity
(Fig. 12.15).
k μk
238 12 Speech Recognition: B
0.50
0.25
0.00
–0.25
–0.50
–0.75
–1.00
–3 –2 –1 0 1 2 v 3
Each circle is a circuit that we call a neuron. In the figure, zk is the output of
neuron k. It is multiplied by θk to contribute the quantity θk zk to the total input Vl of
neuron l. The parameter θk represents
the strength of the connection between neuron
k and neuron l. Thus, Vl = n θn zn , where the sum is over all the neurons n of the
layer to the immediate left of neuron l, including neuron k. The output zl of neuron
l is equal to f (al , Vl ), where al is a parameter specific to that neuron and f is some
function that we discuss later.
With this structure, it is easy to compute the derivative of some output Z with
respect to some weight, say θk . We do it in the last section of this chapter.
What should be the functions f (a, V )? Inspired by the idea that a neuron fires if
it is excited enough, one may use a function f (a, V ) that is close to 1 if V > a and
close to −1 if V < a. To make the function differentiable, one may use f (a, V ) =
g(V − a) with
2
g(v) = − 1,
1 + e−βv
where β is a positive constant. If β is large, then e−βv goes from a very large to a
very small value when v goes from negative to positive. Consequently, g(v) goes
from −1 to +1 (Fig. 12.16).
The DNN is able to model many functions by adjusting its parameters. To see
why, consider neuronl. The output of this neuron indicates whether the linear
combination Vl = n θn zn is larger or smaller than the thresholds al of the
neurons. Consequently, the first layer divides the set of inputs into regions separated
by hyperplanes. The next layer then further divides these regions. The number of
12.4 Deep Neural Networks 239
regions that can be obtained by this process is exponential in the number of layers.
The final layer then assigns values to the regions, thus approximating a complex
function of the input vector by an almost piecewise constant function.
The missing piece of the puzzle is that, unfortunately, the cost function is not a
nice convex function of the parameters of the DNN. Instead, it typically has many
local minima. Consequently, by using the SGD algorithm, the tuning of the DNN
may get stuck in a local minimum. Also, to reduce the number of parameters to
tune, one usually selects a few layers with fixed parameters, such as edge detectors
in vision systems. Thus, the selection of the DNN becomes somewhat of an art, like
cooking.
Thus, it remains impossible to predict whether the DNN will be a good technique
for machine learning in a specific application. The answer of the practitioners is to
try and see. If it works, they publish a paper. We are far from the proven convergence
results of adaptive systems. Ah, nostalgia. . . .
There is a worrisome aspect to these black-box approaches. When the DNN
has been tuned and seems to perform well on many trials, not only one does
not understand what it really does, but one has no guarantee that it will not
seriously misbehave for some inputs. Imagine then a killer drone with a DNN target
recognition system. . . . It is not surprising that a number of serious scientists have
raised concerns about “artificial stupidity” and the need to build safeguards into such
systems. “Open the pod bay doors, Hal.”
dZ
= zk f (al , Vl )θl f (am , Vm )θm f (ar , Vr ).
dθk
The details do not matter too much. The point is that the structure of the network
makes the calculation of the derivatives straightforward.
240 12 Speech Recognition: B
12.5 Summary
12.6 References
Online linear regression algorithms are discussed in Strehl and Littman (2007).
The book Bertsekas and Tsitsiklis (1989) is an excellent presentation of dis-
tributed optimization algorithms. It explains the gradient projection algorithm and
distributed implementations. The LASSO algorithm and many other methods are
clearly explained in Hastie et al. (2009), together with applications. The theory of
martingales is nicely presented by its father in Doob (1953). Theorem 12.4 is from
Candes and Romberg (2007).
12.7 Problems
Problem 12.1 Let {Yn , n ≥ 1} be i.i.d. U [0, 1] random variables and {Zn , n ≥ 1}
be i.i.d. N (0, 1) random variables. Define Xn = 1{Yn ≥ a} + Zn for some constant
a. The goal of the problem is to design an algorithm that “learns” the value of a
from the observation of pairs (Xn , Yn ). We construct a model
12.7 Problems 241
0.7
0.6
0.5
0.4
0.3
0.2
0.1
u
0
–1 –0.5 0 0.5 1
Xn = g(Yn − θ ),
where
1
g(u) = (12.31)
1 + exp{−λu}
with λ = 10. Note that when u > 0, the denominator of g(u) is close to 1, so
that g(u) ≈ 1. Also, when u < 0, the denominator is large and g(u) ≈ 0. Thus,
g(u) ≈ 1{u ≥ 0}. The function g(·) is called the logistic function. Use SGD in
Python to estimate θ (Fig. 12.17).
where you choose sampling times tk independently and uniformly in [0, 1]. Assume
that the collection of sine waves has the frequencies {0.1, 0.2, . . . , 3}.
What is the minimum number of samples that you need for exact reconstruction?
242 12 Speech Recognition: B
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Route Planning: A
13
13.1 Model
One is given a finite connected directed graph. Each edge (i, j ) is associated with a
travel time T (i, j ). The travel times are independent and have known distributions.
There are a start node s and a destination node d. The goal is to choose a fast route
from s to d. We consider a few different formulations (Fig. 13.1).
To make the situation concrete, we consider the very simple example illustrated
in Fig. 13.2.
The goal is to choose the fastest path from s to d. In this example, the possible
paths are sd, sad, and sabd. We assume that the delays T (i, j ) on the edges (i, j )
are as follows:
Thus, the delay from s to a is uniformly distributed in [5, 13], the delay from a to
d is equal to 10, and so on. The delays are assumed to be independent, which is an
unrealistic simplification.
In this formulation, one does not observe anything and one plans the journey ahead
of time. In this case, the solution is to look at the average travel times E(T (i, j )) =
c(i, j ) and to run a shortest path algorithm.
For our example, the average delays are c(s, a) = 9, c(a, d) = 10, and so on, as
shown in the top part of Fig. 13.3.
Let V (i) be the minimum average travel time from node i to the destination d.
The Bellman–Ford Algorithm calculates these values as follows. Let Vn (i) be an
estimate of the shortest average travel time from i to d, as calculated after the n-th
iteration of the algorithm. The algorithm starts with V0 (d) = 0 and V0 (i) = ∞ for
i = d. Then, the algorithm calculates
The interpretation is that Vn (i) is the minimum expected travel time from i to d
over all paths that go through at most n edges. The distance is infinite if no path
with at most n edges reaches the destination d. This is exactly the same algorithm
we discussed in Sect. 11.2 to develop the Viterbi algorithm.
These relations are justified by the fact that the mean value of a sum is the sum of
the mean values. For instance, say that the minimum average travel time from a to
d using a path that has at most 2 edges is V2 (a, d) and it corresponds to a path with
13.3 Formulation 2: Adapting 245
random travel time W2 (a, d). Then, the minimum average travel time from s to d
using a path that has at most 3 edges follows either the direct path sd, that has travel
time T (s, d), or the edge sa followed by the fastest path from a to d that uses at most
2 edges with travel time W2 (a, d). Accordingly, the minimum expected travel time
V3 (s) from s to d using at most three edges is the minimum of E(T (s, d)) = c(s, d)
and the mean value of T (s, a) + W2 (a, d). Thus,
These are called the dynamic programming equations (DPE). Thus, (13.1) is an
algorithm for solving (13.2).
We now assume that when we get to a node i, we see the actual travel times along
the edges out of i. However, we do not see beyond those edges. How should we
modify our path planning? If the travel times are in fact deterministic, then nothing
changes. However, if they are random, we may notice that the actual travel times on
some edges out of i are smaller than their mean value, whereas others may be larger.
Clearly, we should use that information.
246 13 Route Planning: A
Here is a systematic procedure for calculating the best path. Let V (i) be the
minimum average time to get to d starting from node i, for i ∈ {s, a, b, d}. We see
that V (b) = T (b, d) = 4.
To calculate V (a), define W (a) to be the minimum expected time from a to d
given the observed delays along the edges out of a. That is,
For this example, we see that T (a, b) + V (b) =D U [6, 14]. Since T (a, d) = 10, if
T (a, b) + V (b) < 10, which occurs with probability 1/2, we choose the path abd
that has a travel time uniformly distributed in [6, 10] with a mean value 8. Also,
if T (a, b) + V (b) > 10, then we choose the travel time T (a, d) = 10, also with
probability 1/2. Thus, the minimum expected travel time V (a) from a to d is equal
to 8 with probability 1/2 and to 10 with probability 1/2, so that its average value is
8(1/2) + 10(1/2) = 9. Hence, V (a) = 9.
Similarly,
where T (s, a) + V (a) =D U [14, 22] and T (s, d) = 20. Thus, if T (s, a) + V (a) <
20, which occurs with probability (20 − 14)/(22 − 14) = 3/4, then we choose a
path that goes from s to a and has a delay that is uniformly distributed in [14, 20],
with mean value 17. If T (s, a) + V (a) > 20, which occurs with probability 1/4, we
choose the direct path sd that has delay 20. Hence V (s) = 17(3/4) + 20(1/4) =
71/4 = 17.75.
Note that by observing the delays on the next edges and making the appropriate
decisions, we reduce the expected travel time from s to d from 19 to 17.5. Not
surprisingly, more information helps. Observe also that the decisions we make
depend on the observed delays. For instance, starting in node s, we go along edge sd
if T (s, a) + V (a) > T (s, d), i.e., if T (s, a) + 9 > 20, or T (s, a) > 11. Otherwise,
we follow the edge sa.
Let us now go back to the general model. The key relationships are as follows:
The interpretation is simple: starting from i, one can choose to go next to j . In that
case, one faces a travel time T (i, j ) from i to j and a subsequent minimum average
time from j to d equal to V (j ). Since the path from i to d must necessarily go to a
next node j , the minimum expected travel time from i to d is given by the expression
13.4 Markov Decision Problem 247
above. As before, these equations are justified by the fact that the expected value of
a sum is the sum of the expected values.
An algorithm for solving these fixed-point equations is
where V0 (i) = 0 for all i. The interpretation of Vn (i) is the same as before: it is the
minimum expected time from i to d using a path with at most n edges, given that at
each step along the path one observes the delays along the edges out of the current
node.
Equations (13.4) are the stochastic dynamic programming equations for the
problem. Equations (13.5) are called the value iteration equations.
A more general version of the path planning problem is the control of a Markov
chain. At each step, one looks at the state and one chooses an action that determines
the transition probabilities and also the cost for the next step.
More precisely, to define a controlled Markov chain X(n) on some state space
X , one specifies, for each x ∈ X , a set A(x) of possible actions. For each state
x∈X and each action a ∈ A(x), one has transition probabilities P (x, x ; a) ≥ 0
with x ∈X P (x, x ; a) = 1. One also specifies a cost c(x, a) of taking the action
a when in state x.
The sequence X(n) is then defined by
The goal is to choose the actions to minimize the average total cost
n
E c(X(m), a(m))|X(0) = x . (13.6)
m=0
Let a = gm (x) be the value of a ∈ A(x) that achieves the minimum in (13.7). Then
the choices a(m) = gn−m (X(m)) achieve the minimum of (13.6).
The existence of the minimizing a in (13.7) is clear if X and each A(x) are finite
and also under weaker assumptions.
13.4.1 Examples
Guess a Card
Here is a simple example. One is given a perfectly shuffled deck of 52 cards. The
cards are turned over one at a time. Before one turns over a new card, you have the
option of saying “Stop.” If the next card is an ace, you win $1.00. If not, the game
stops and you lose. The problem is for you to decide when to stop (Fig. 13.4).
Assume that there are still x aces in a deck with m remaining cards. Then, if you
say stop, you win with probability x/m. If you do not say stop, then after the next
card is turned over, x − 1 aces remain with probability x/m and x remain otherwise.
Let V (m, x) be the maximum expected probability that you win if there are still
x aces in the deck with m remaining cards.
The DPE are
x x m−x
V (m, x) = max , V (m − 1, x − 1) + V (m − 1, x) .
m m m
Scheduling Jobs
You have two sets of jobs to perform. Jobs of type i (for i = 1, 2) have a waiting
cost equal to ci per unit of waiting time until they are completed. Also, when you
work on a job of type i, it completes with probability μi in the next time unit,
6
13.4 Markov Decision Problem 249
independently of how long you have worked on it. That is, the job processing times
are geometrically distributed with parameter μi . The problem is to decide which job
to work on to minimize the total waiting cost of the jobs.
Let V (x1 , x2 ) be the minimum expected total remaining waiting cost given that
there are x1 jobs to type 1 and x2 jobs of type 2. The DPE are
where
and
As can be verified directly, the solution of the DPE is as follows. Assume that
c1 μ1 > c2 μ2 . Then
x1 (x1 + 1) x2 (x2 + 1) x1 x2
V (x1 , x2 ) = c1 + c2 + c2 .
2μ1 2μ2 μ1
Moreover, this minimum expected cost is achieved by performing all the jobs of type
1 first and then the jobs of type 2. This strategy is called the cμ rule. Thus, although
one might be tempted to work on the longest queue first, this is not optimal.
There is a simple interchange argument to confirm the optimality of the cμ rule.
Say that you decide to work on the jobs in the following order: 1221211. Thus, you
work on a job of type 1 until it completes, then a job of type 2, then another job of
type 2, and so on. Modify the strategy as follows. Instead of working on the second
job of type 2, work on the second job of type 1, until it completes. Then work on the
second job of type 2 and continue as you would have. Thus, the processings of two
jobs have been interchanged: the second job of type 2 and the second job of type
1. Only the waiting times of these two jobs change. The waiting time of the job of
type 1 is reduced by 1/μ2 , on average, since this is the average completion time of
the job of type 2 that was previously processed before the job of type 1. Thus, the
waiting cost of the job of type 1 is reduced by c1 /μ2 . Similarly, the waiting cost of
the job of type 2 is increased by c2 /μ1 , on average. Thus, the average cost decreases
by c1 /μ2 − c2 /μ1 which is a positive amount since c1 μ1 > c2 μ2 . By induction, it
is optimal to process all the jobs of type 1 first.
Of course, there are very few examples of control problems where the optimal
policy can be proved by a simple argument. Nevertheless, keep this possibility
in mind because it can yield elegant results simply. For instance, assume that
jobs arrive at the queues shown in Fig. 13.5 according to independent Bernoulli
processes. That is, with probability λi , a job of type i arrives during each time step,
independently of the past, for i = 1, 2. The same interchange argument shows that
250 13 Route Planning: A
2 , c2
the cμ rule minimizes the long-term average expected waiting cost of the jobs (a
cost that we have not defined, but you may be able to imagine what it means).
This is useful because the DPE can no longer be solved explicitly and proving the
optimality of this rule analytically is quite complicated.
Hiring a Helper
Jobs arrive at random times and you must decide whether to work on them yourself
or hire some helper. Intuition suggests that you should get some help if the backlog
of jobs to be performed exceeds some threshold. We examine a model of this
situation.
At time n = 0, 1, . . ., a job arrives with probability λ ∈ (0, 1). If you work alone,
you complete a job with probability μ ∈ (0, 1) in one time unit, independently of
the past. If you hire a helper, then together you complete a job with probability
αμ ∈ (0, 1) in one unit of time, where α > 1. Let the cost at time n be c(n) = β > 0
if you hire a helper at time step n and c(n) = 0 otherwise. The goal is to minimize
N
E (X(n) + c(n)) ,
n=0
where X(n) is the number of jobs yet to be processed at time n. This cost measures
the waiting cost of the jobs plus the cost of hiring the helper. The waiting cost is
minimized if you hire the helper all the time and the helper cost is minimized if you
never hire him. The goal of the problem is to figure out when to hire a helper to
achieve the best trade-off between these two costs.
The state of the system is X(n) at time n. Let
m
Vm (x) = min E (X(n) + c(n))|X(0) = x ,
n=0
where the minimum is over the possible choices of actions (hiring or not) that
depend on the state up to that time. The stochastic dynamic programming equations
are
18
16
14
10
8
n
6
0 50 100 150 200 250
where we defined μ(0) = μ and μ(1) = αμ and V−1 (x) = 0. Also, we limit the
backlog of jobs to K, so that if one job arrives where there are already K, we discard
the new arrival.
We solve these equations using Python. As expected, the solution shows that one
should hire a helper at time n if X(n) > γ (N − n), where γ (m) is a constant that
decreases with m. As the time to go m increases, the cost of holding extra jobs
increases and so does the incentive to hire a helper. Figure 13.6 shows the values of
γ (n) for β = 14 and β = 20. The figure corresponds to λ = 0.5, μ = 0.6, α =
1.5, K = 20, and N = 200. Not surprisingly, when the helper is more expensive,
one waits until the backlog is larger before hiring him.
x1 + 1 x2 + 1
< ,
μ1 μ2
252 13 Route Planning: A
as this will minimize the expected time until you are served. However, if we consider
the problem of minimizing the total average waiting time of customers in the two
queues, we find that the optimal policy does not agree with the selfish choice of
individual customers. Figure 13.7 shows an example with μ2 < μ1 . It indicates that
under the socially optimal policy some customers should join queue 2, even though
they will then incur a longer delay than under the selfish policy.
This example corresponds to minimizing the total cost
N
β n E(X1 (n) + X2 (n)).
n=0
The problem of minimizing (13.6) involves a finite horizon. The problem stops at
time n. We have seen that the minimum cost to go when there are m more steps is
Vm (x) when in state x. Thus, not surprisingly, the cost to go depends on the time to
go and, consequently, the best action to choose in a given state x generally depends
on the time to go.
The problem is simpler when one considers an infinite horizon because the time
to go remains the same at each step. To make the total cost finite, one discounts the
future costs. That is, one considers the problem of minimizing the expected total
discounted cost:
∞
E β c(X(m), a(m))|X(0) = x .
m
(13.8)
m=0
In this expression, 0 < β < 1 is the discount rate. Intuitively, if β is small, then
future costs do not matter much and one tends to be short-sighted. However, if β is
close to 1, then one pays a lot of attention to the long term.
Define V (x) to be the minimum value of the cost (13.8), where the minimum is
over all the possible choices of the actions at each step. Arguing as before, one can
show that
These equations are similar to (13.7), with two differences: the discount factor
and the fact that the value function does not depend on time. Note that these
equations are fixed-point equations. A standard method to solve them is to consider
the equations
Vn+1 (x) = min c(x, a) + β P (x, y; a)Vn (y) , n ≥ 0, (13.10)
a∈A(x)
y
where one chooses V0 (x) = 0, ∀x. Note that these equations correspond to
n
Vn (x) = min E β m c(X(m), a(m))|X(0) = x . (13.11)
m=0
One can show that the solution Vn (x) of (13.10) is such that Vn (x) → V (x) as
n → ∞, where V (x) is the solution of (13.9).
254 13 Route Planning: A
13.6 Summary
13.7 References
13.8 Problems
Problem 13.1 Consider a single queue with one server in discrete time. At each
time, a new customer arrives to the queue with probability λ < 1, and if the server
works on the queue at rate μ ∈ [0, 1], it serves one customer in one unit of time
with probability μ. Due to energy constraints, you want your server to work with
the smallest rate as possible without making the queue unstable. Thus, you want
your server to work at rate μ∗ = λ. Unfortunately, you do not know the value of
λ. All you can observe is the queue length. We try to design an algorithm based on
stochastic gradient to learn μ∗ in the following steps:
(a) Minimize the function V (μ) = 12 (λ − μ)2 over μ using gradient descent.
(b) Find E[Q(n+1)−Q(n)|Q(n) = q], for some q > 0, given that server allocates
capacity μn during time slot n. Q(n) is the queue length at time n. What happens
if q = 0?
(c) Use the stochastic gradient projection algorithm and write a Python code based
on parts (a) and (b) to learn μ∗ . Note that 0 ≤ μ ≤ 1.
13.8 Problems 255
Hint To avoid the case when the queue length is 0, start with a large initial queue
length.
Problem 13.2 Consider a routing network with three nodes: the start node s, the
destination node d, and an intermediate node r. There is a direct path from s to d
with travel time 20. The travel time from s to r is 7. There are two paths from r to
d. They have independent travel times that are uniformly distributed between 8 and
20.
Problem 13.3 Consider a single queue in discrete time with Bernoulli arrival
process of rate λ. The queue can hold K jobs, and there is a fee γ when its
backlog reaches K. There is one server dedicated to the queue with service rate
μ(0). You can decide to allocate another server to the queue that increases the rate
to μ(1) ∈ (μ(0), 1). However, using the additional server has some cost. You want
to minimize the cost
∞
β n E(X(n) + αH (n) + γ 1{X(n) = K}),
n=0
where H (n) is equal to one if you use an extra helper at time n and is zero otherwise.
Problem 13.4 We want to plan routing from node 1 to 5 in the graph of Fig. 13.8.
The travel times on the edges of the graph are as follows: T (1, 2) = 2, T (1, 3) ∼
U [2, 4], T (2, 4) = 1, T (2, 5) ∼ U [4, 6], T (4, 5) ∼ U [3, 5], and T (3, 5) = 4. Note
that X ∼ U [a, b] means X is a random variable uniformly distributed between a
and b.
(a) If you want to do pre-planning, which path would you choose? What is the
expected travel time?
(b) Now suppose that at each node, the travel times of two steps ahead are revealed.
Thus, at node 1 all the travel times are revealed except T (4, 5). Write the
dynamic programming equations that solve the route planning problem and
solve them. That is, let V (i) be the minimum expected travel time from i to
5, and 1 ≤ i ≤ 5. Find V (i) for 1 ≤ i ≤ 5.
Problem 13.5 Consider a factory, DilBox, that stores boxes. At the beginning of
year k, they have xk boxes in storage. Now at the end of every year k they are
mandated by contracts to provide dk boxes. However, the number of boxes dk is
unknown until the year actually ends.
At the beginning of the year, they can request uk boxes. Using very shoddy
Elbonian labor each box has costs A to produce. At the end of the year DilBox
is able to borrow yk boxes from BoxR’Us at the cost s(yk ) to meet the contract.
The boxes remaining after meeting the demand are carried over to the next year
xk+1 = xk + uk + yk − dk . Sadly, they need to pay to store the boxes at a cost given
by a function r(xk+1 ).
Now your job is to provide a box creation and storage plan for the upcoming 20
years. Your goal is to minimize the total cost for the 20 years. You can treat costs
as being paid at the end of the year and there is no inflation. Also, you get your
pension after 20 years so you do not care about costs beyond those paid in the 20th
year. (Assume you start with zero boxes, of course, it does not really matter).
– r(xk ) = 5xk ;
– s(yk ) = 20yk ;
– A = 1;
– dk =D U {1, . . . , 10}.
Problem 13.6 Consider a video game duel where Bob starts at time 0 at distance
T = 10 from Alice and gets closer to her at speed 1. For instance, Alice is at location
(0, 0) in the plane and Bob starts at location (0, T ) and moves toward Alice, so that
after t seconds, Bob is at location (0, T − t). Alice has picked a random time,
uniformly distributed in [0, T ], when she will shoot Bob. If Alice shoots first, Bob
is dead. Alice never misses. [This is only a video game.]
(a) Bob has to find at what time t he should shoot Alice to maximize the probability
of killing her. If Bob shoots from a distance x, the probability that he hits (and
kills) Alice is 1/(1 + x)2 . Bob has only one bullet.
(b) What is the maximum probability that Bob wins the duel?
(c) Assume now that Bob has two bullets. You must find the times t1 and t2 when
Bob should shoot Alice to maximize the probability that he wins the duel. Again,
13.8 Problems 257
for each bullet that Bob shoots from distance x, the probability of success is
1/(1 + x)2 , independently for each bullet.
Problem 13.7 You play a game where you win the amount you bet with probability
p ∈ (0, 0.5) and you lose it with probability 1 − p. Your initial fortune is 16 and
you gamble a fixed amount γ at each step, where γ ∈ {1, 2, 4, 8, 16}. Find the
probability that you reach a fortune equal to 256 before you go broke. What is the
gambling amount that maximizes that probability?
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Route Planning: B
14
Here, X(n) is the state, U (n) is a control value, and V (n) is the noise. We assume
that the random variables V (n) are i.i.d. and N(0, σ 2 ).
The problem is to choose, at each time n, the control value U (n) in based on
the observed state values up to time n to minimize the expected cost
N
E X(n) + βU (n) |X(0) = x .
2 2
(14.2)
n=0
Thus, the goal of the control is to keep the state value close to zero, and one pays a
cost for the control.
The problem is then to trade-off the cost of a large value of the state and that of
the control that can bring the state back close to zero. To get some intuition for the
solution, consider a simple form of this trade-off: minimizing
g(N – n)
In this simple version of the problem, there is no noise and we apply the control
only once. To minimize this expression over u, we set the derivative with respect to
u equal to zero and we find
2(ax + u) + 2βu = 0,
so that
a
u=− x.
1+β
Thus, the value of the control that minimizes the cost is linear in the state. We should
use a large control value when the state is far from the desired value 0. The following
result shows that the same conclusion holds for our problem (Fig. 14.1).
Theorem 14.1 Optimal LQG Control The control values U (n) that minimize
(14.2) for the system (14.1) are
where
ad(m − 1)
g(m) = − , m ≥ 0; (14.3)
β + d(m − 1)
a 2 βd(m − 1)
d(m) = 1 + ,m ≥ 0 (14.4)
β + d(m − 1)
with d(−1) = 0.
That is, the optimal control is linear in the state and the coefficient depends on
the time-to-go. These coefficients can be pre-computed at time 0 and they do not
depend on the noise variance. Thus, the control values would be calculated in the
same way if V (n) = 0 for all n.
14.1 LQG Control 261
Proof Let Vm (x) be the minimum value of (14.2) when N is replaced by m. The
stochastic dynamic programming equations are
!
Vm (x) = min x 2 + βu2 + E(Vm−1 (ax + u + V )) , m ≥ 0, (14.5)
u
for some constants c(m) and d(m) where d(m) satisfies (14.4).
That is, we claim that
where d(m) is given by (14.4) and the minimizer is u = g(m)x where g(m) is given
by (14.3).
The verification is a simple algebraic exercise that we leave to the reader.
14.1.1 Letting N → ∞
a 2 βd
d = f (d) := 1 + .
β +d
a2β 2
f (d) = ,
(β + d)2
so that 0 < f (d) < a 2 for d ≥ 0. Also, f (d) > 0 for d ≥ 0. Hence, f (d) is a
contraction. That is,
Thus,
|d − d(m)| ≤ α m |d − d(0)|,
which shows that d(m) → d, as claimed. Consequently, (14.3) shows that g(m) →
g as m → ∞, where
ad
g=− .
β +d
Thus, when the time-to-go m is very large, the optimal control approaches U (N −
m) = gX(N − m). This suggests that this control may minimize the cost (14.2)
when N tends to infinity (Fig. 14.2).
The formal way to study this problem is to consider the long-term average cost
defined by
N
1
lim E X(n) + βU (n) |X(0) = x .
2 2
N →∞ N
n=0
This expression is the average cost per unit time. One can show that if |a| < 1, then
the control U (n) = gX(n) with g defined as before indeed minimizes that average
cost.
In the previous section, we controlled a linear system with Gaussian noise assuming
that we observed the state. We now consider the case of noisy observations.
The system is
where the random variables W (n) are i.i.d. N (0, w2 ) and are independent of the
V (n).
14.2 LQG with Noisy Observations 263
g(N − n)
The problem is to find, for each n, the value of U (n) based on the values of
Y n := {Y (0), . . . , Y (n)} that minimize the expected total cost (14.2).
The following result gives the solution of the problem (Fig. 14.3).
Theorem 14.2 Optimal LQG Control with Noisy Observations The solution of the
problem is
where
can be computed by using the Kalman filter and the constants g(m) are given by
(14.3)–(14.4).
Thus, the control values are the same as when X(n) is observed exactly, except
that X(n) is replaced by X̂(n). This feature is called certainty equivalence.
Proof The fact that the values of g(n) do not depend on the noise V (n) gives us
some inkling as to why the result in the theorem can be expected: given Y n , the
state X(n) is N (X̂(n), v 2 ) for some variance v 2 . Thus, we can view the noisy
observation as increasing the variance of the state, as if the variance of V (n) were
increased.
Instead of providing the complete algebra, let us sketch why the result holds.
Assume that the minimum expected cost-to-go at time N − m + 1 given Y N −m+1 is
X(N − m) = X̂(N − m) + η,
X̂(N − m + 1) = a X̂(N − m) + u
+ K(N − m + 1){Y (N − m + 1) − E[Y (N − m + 1)|Y N −m ]}.
X̂(N − m + 1) = a X̂(N − m) + u + Z
i.e., of
14.2.1 Letting N → ∞
As when X(n) is observed exactly, one can show that, if |a| < 1, the control
U (n) = g X̂(n)
minimizes the average cost per unit time. Also, in this case, we know that the
Kalman filter becomes stationary and has the form (Fig. 14.4)
In the previous chapter, we considered a controlled Markov chain and the action
is based on the knowledge of the state. In this section, we look at problems where
the state of the Markov chain is not observed exactly. In other words, we look at
a controlled hidden Markov chain. These problems are called partially observed
Markov decision problems (POMDPs).
Instead of discussing the general version of this problem, we look at one concrete
example to convey the basic ideas.
The example is illustrated in Fig. 14.5. You have misplaced your keys but you
know that they are either in bag A, with probability p, or in bag B, otherwise.
Unfortunately, your bags are cluttered and if you spend one unit of time (say 10 s)
looking in bag A, you find your keys with probability α if they are there. Similarly,
the probability for bag B is β. Every time unit, you choose which bag to explore.
Your objective is to minimize the expected time until you find your keys.
The state of the system is the location A or B of your keys. However, you do
not observe that state. The key idea (excuse the pun) is to consider the conditional
probability pn that the keys are in bag A given all your observations up to time n. It
turns out that pn is a controlled Markov chain, as we explain shortly. Unfortunately,
the set of possible value of pn is [0, 1], which is not finite, nor even countable. Let
us not get discouraged by this technical issue.
Assume that at time n, when the keys are in bag A with probability pn , you look
in bag A for one unit of time and you do not see the keys. What is then pn+1 ? We
claim that
pn (1 − α)
pn+1 = =: f (A, pn ).
pn (1 − α) + (1 − pn )
Indeed, this is the probability that the keys are in bag A and we do not see them,
divided by the probability that we do not see the keys (either when they are there or
when they are not). Of course, if we see the keys, the problem stops.
266 14 Route Planning: B
Similarly, say that we look in bag B and we do not see the keys. Then
pn
pn+1 = =: f (B, pn ).
pn + (1 − pn )(1 − β)
Thus, we control pn with our actions. Let V (p) be the minimum expected time
until we find the keys, given that they are in bag A with probability p. Then, the
DPE are
The constant 1 is the duration of the first step. The first term in the minimum is what
happens when you look in bag A. With probability 1 − pα, you do not find your
keys and you will then have to wait a minimum expected time equal to V (f (A, p))
to find your keys, because the probability that they are in bag A is now f (A, p).
The other term corresponds to first looking in bag B.
These equations look hopeless. However, they are easy to solve in Python. One
discretizes [0, 1] into K intervals and one rounds off the updates f (A, p) and
f (B, p).
Thus, the updates are for a finite vector V = (V (1/K), V (2/K), . . . , V (1)).
With this discretization, the equations (14.9) look like
V = φ(V),
where φ(·) is the right-hand side of (14.9). These are fixed-point equations. To solve
them, we initialize V0 = 0 and we iterate
Vt+1 = φ(Vt ), t ≥ 0.
With a bit of luck, that can be justified mathematically, this algorithm converges to
V, the solution of the DPE. The solution is shown in Fig. 14.6, for different values
of α and β. The figure also shows the optimum action as a function of p. The
discretization uses K = 1000 values in [0, 1] and the iteration is performed 100
times.
14.5 References 267
14.4 Summary
14.5 References
The texts Bertsekas (2005), Kumar and Varaiya (1986) and Goodwin and Sin (2009)
cover LQG control. The first two texts discuss POMDP.
268 14 Route Planning: B
14.6 Problems
where X(0) = 0 and the random variables V (n) are i.i.d. and N (0, 0.2). The U (n)
are control values.
where X(0) = 0 and the random variables V (n), W (n) are independent with
V (n) =D N (0, 0.2) and W (n) =D N (0, σ 2 ).
(a) Implement the control described in Theorem 14.2 for σ 2 = 0.1 and σ 2 = 0.4
and simulate the controlled system.
(b) Implement the limiting control with the limiting gain and the stationary Kalman
filter for σ 2 = 0.1 and σ 2 = 0.4. Simulate the system.
(c) Compare the systems with the time-varying and the limiting controls.
Problem 14.3 There are two coins. One is fair and the other one has a probability
of “head” equal to 0.6. You cannot tell which is which by looking at the coins. At
each step n ≥ 1, you must choose which coin to flip. The goal is to maximize the
expected number of “heads.”
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Perspective and Complements
15
15.1 Inference
One key concept that we explored is that of inference. The general problem of
inference can be formulated as follows. There is a pair of random quantities (X, Y ).
One observes Y and one wants to guess X (Fig. 15.1).
Thus, the goal is to find a function g(·) such that X̂ := g(Y ) is close to X, in a
sense to be made precise. Here are a few sample problems:
?
Y X
Fig. 15.1 The inference problem is to guess the value of X from that of Y
A useful notion for inference problems is that of a sufficient statistic. We have not
discussed this notion so far. It is time to do it.
Definition 15.1 (Sufficient Statistic) We say that h(Y ) is a sufficient statistic for
X if
or, equivalently, it
Before we discuss the meaning of this definition, let us explore some implica-
tions. First note that if we have a prior fX (x) and we want to calculate MAP [X|Y =
y], we have
for some function g(·). In words, the information in Y that is useful to calculate
MAP [X|Y ] is contained in h(Y ).
In the same way, we see that MLE[X|Y ] is also a function of h(Y ).
Observe also that
fX (x)fY |X [y|x] fX (x)f (h(y), x)g(y)
fX|Y [x|y] = = .
fY (y) fY (y)
Now,
∞ ∞
fY (y) = fX (x)f (h(y), x)g(y)dx = g(y) fX (x)f (h(y), x)dx
−∞ −∞
= g(y)φ(h(y)),
where
∞
φ(h(y)) = fX (x)f (h(y), x)dx.
−∞
Hence,
fX (x)f (h(y), x)
fX|Y [x|y] = .
φ(h(Y ))
Now, consider the hypothesis testing problem when X ∈ {0, 1}. Note that
274 15 Perspective and Complements
Thus, the likelihood ratio depends only on h(y) and it follows that the solution of
the hypothesis testing problem is also a function of h(Y ).
15.2.1 Interpretation
The definition of sufficient statistic is quite abstract. The intuitive meaning is that if
h(Y ) is sufficient for X, then Y is some function of h(Y ) and a random variable Z
that is independent of X and Y . That is,
For instance, say that Y = (Y1 , . . . , Yn ) where the Ym are i.i.d. and Bernoulli with
parameter X ∈ [0, 1]. Let h(Y ) = Y1 + · · · + Yn . Then we can think of Y as being
constructed from h(Y ) by selecting randomly which h(Y ) random variables among
(Y1 , . . . , Yn ) are equal to one. This random choice is some independent random
variable Z. In such a case, we see that Y does not contain any information about X
that is not already in h(Y ).
To see the equivalence between this interpretation and the definition, first assume
that (15.1) holds. Then
so that h(Y ) is sufficient for X. Conversely, if h(Y ) is sufficient for X, then we can
find some Z such that g(h(y), Z) has the density fY |h(Y ) [y|h(y)].
Assume that p ∈ (0, 1). Then the Markov chain is irreducible. However, it is
intuitively clear that X(n) → ∞ as n → ∞ if p > 0.5. To see that this is indeed
the case, let Z(n) be i.i.d. random variables with P (Z(n) = 1) = p and P (Z(n) =
−1) = q. Then note that
so that
Also,
where the convergence follows by the SLLN. This implies that X(n) → ∞, as
claimed.
Thus, X(n) eventually is larger than any given N and remains larger. This shows
that X(n) visits every state only finitely many times. We say that the states are
transient because they are visited only finitely often.
We say that a state is recurrent if it is not transient. In that case, the state is called
positive recurrent if the average time between successive visits is finite; otherwise
it is called null recurrent.
Here is the result that corresponds to Theorem 1.1
276 15 Perspective and Complements
Theorem 15.1 (Big Theorem for Infinite Markov Chains) Consider an infinite
Markov chain.
(a) If the Markov chain is irreducible, the states are either all transient, all positive
recurrent, or all null recurrent. We then say that the Markov chain is transient,
positive recurrent, or null recurrent, respectively.
(b) If the Markov chain is positive recurrent, it has a unique invariant distribution
π and π(i) is the long-term fraction of time that X(n) is equal to i.
(c) If the Markov chain is positive recurrent and also aperiodic, then the distribu-
tion πn of X(n) converges to π .
(d) If the Markov chain is not positive recurrent, it does not have an invariant
distribution and the fraction of time that it spends in any state goes to zero.
It turns out that the Markov chain in Fig. 15.2 is null recurrent for p = 0.5 and
positive recurrent for p < 0.5. In the latter case, its invariant distribution is
p
π(i) = (1 − ρ)ρ i , i ≥ 0, where ρ := .
q
15.4.1 Definition
Definition 15.2 (Poisson Process) Let λ > 0 and {S1 , S2 , . . .} be i.i.d. Exp(λ)
random variables. Let also Tn = S1 + · · · + Sn for n ≥ 1. Define
Before exploring the properties of the Poisson process, we recall two properties of
the exponential distribution.
Fτ (t) = P (τ ≤ t) = 1 − exp{−λt}, t ≥ 0.
In particular, the pdf of τ is fτ (t) = λ exp{−λt} for t ≥ 0. Also, E(τ ) = λ−1 and
var(τ ) = λ−2 .
Then,
P [τ ≤ t + |τ > t] = λ + o().
Proof
P (τ > t + s)
P [τ > t + s|τ > s] =
P (τ > s)
exp{−λ(t + s)}
= = exp{−λt}
exp{−λs}
= P (τ > t).
Proof Figure 15.4 illustrates that result. Given {Ns , s ≤ t}, the first jump time of
{Ns+t − Nt , s ≥ 0} is Exp(λ), by the memoryless property of the exponential
distribution. The subsequent inter-jump times are i.i.d. and Exp(λ). This proves the
theorem.
15.4 Poisson Process 279
t
0 T1 T2 T 3 t T 4 T5
Proof There are a number of ways of showing this result. The standard way is as
follows. Note that
Hence,
d
P (Nt = n) = λP (Nt = n − 1) − λP (Nt = n).
dt
Thus,
d
P (Nt = 0) = −λP (Nt = 0).
dt
d
[g(n, t) exp{−λt}] = λ[g(n − 1, t) − g(n, t)] exp{−λt},
dt
280 15 Perspective and Complements
i.e.,
d
g(n, t) = λg(n − 1, t).
dt
n
This expression shows by induction that g(n, t) = (λt)
n! .
A different proof makes use of the density of the jumps. Let Tn be the n-th jump
of the process and Sn = Tn − Tn−1 , as before. Then
To derive this expression, we used the fact that the Sn are i.i.d. Exp(λ). The
expression above shows that, given that there are n jumps in [0, t], they are equally
likely to be anywhere in the interval. Also,
P (Nt = n) = λn dt1 · · · dtn exp{−λt},
S
where S = {t1 , . . . , tn |0 < t1 < · · · < tn < t}. Now, observe that S is a fraction of
[0, t]n that corresponds to the times ti being in a particular order. There are n! such
orders and, by symmetry, each order corresponds to a subset of [0, t]n of the same
size. Thus, the volume of S is t n /n!. We conclude that
tn n
P (Nt = n) = λ exp{−λt},
n!
which proves the result.
15.5 Boosting
You follow the advice of some investment experts when you buy stocks. Their
recommendations are often contradictory. How do you make your decisions so that,
in retrospect, you are not doing too bad compared to the best of the experts? The
intuition is that you should try to follow the leader, but randomly. To make the
situation concrete, Fig. 15.5 shows three experts (B, I, T ) and the profits one would
make by following their advice on the successive days.
On a given day, you choose which expert to follow the next day. Figure 15.6
shows your profit if you make the sequence of selections indicated by the red circles.
15.5 Boosting 281
Fig. 15.5 The three experts and the profits of their recommended stocks
Fig. 15.6 A specific sequence of choices and the resulting profit and regrets
In these selections, you choose to follow B the first 2 days, then I the next to
days, then T the last day. Of course, you have to choose the day before, and the
actual profit is only known the next day. The figure also shows the regrets that you
accumulate when comparing your profit to that of the three experts. Your total profit
is −5 and the profit you would have made if you had followed B all the time would
have been −2, so your regret compared to B is −2 − (−5) = 3, and similarly for
the other two experts.
The problem is to make the expert selection every day so as to minimize the worst
regret, i.e., the regret with respect to the most successful expert. More precisely, the
goal is to minimize the rate of growth of the worst regret. Here is the result.
Theorem √ 15.6 (Minimum Regret Algorithm) Generally, the worst regret grows
like O( n) with the number n of steps. One algorithm that achieves this rate of
regret is to choose expert E at step n + 1 with probability πn+1 (E) given by
282 15 Perspective and Complements
40
30
20 Expert selection
algorithm
10
√
πn+1 (E) = An exp{αPn (E)/ n}, for E ∈ {B, I, T },
where η > 0 is a constant, An is such that these probabilities add up to one, and
Pn (E) is the profit that expert E makes in the first n days.
Thus, the algorithm favors successful experts. However, the algorithm makes
random selections. It is easy to construct examples where a deterministic algorithm
accumulates a regret that grows like n.
Figure 15.7 shows a simulation of three experts and of the selection algorithm
in the theorem. The experts are random walks with drift 0.1. The simulation
√ shows
that the selection algorithm tends to fall behind the best expert by O( n).
The proof of the theorem can be found in Cesa-Bianchi and Lugosi (2006).
Here is a classical problem. You are given two coins, both with an unknown bias
(the probability of heads). At each step k = 1, 2, . . . you choose a coin to flip.
Your goal is to accumulate heads as fast as possible. Let Xk be the number of heads
you accumulate after k steps. Let also Xk∗ be the number of heads that you would
accumulate if you always flipped the coin with the largest bias. The regret of your
strategy after n steps is defined as
Rk = E(Xk∗ − Xk ).
Let θ1 and θ2 be the bias of coins 1 and 2, respectively. Then E(Xk∗ ) = k max{θ1 , θ2 }
and the best strategy is to flip the coin with the largest bias at each step. However,
since the two biases are unknown, you cannot use that strategy. We explain below
15.6 Multi-Armed Bandits 283
that there is a strategy such that the regret grows like log(k) with the number of
steps.
Any good strategy keeps on estimating the biases. Indeed, any strategy that stops
estimating and then forever flips the coin that is believed to be best has a positive
probability of getting stuck with the worst coin, thus accumulating a regret that
grows linearly over time. Thus, a good strategy must constantly explore, i.e., flip
both coins to learn their bias.
However, a good strategy should exploit the estimates by flipping the coin that is
believed to be better more frequently than the other. Indeed, if you were to flip the
two coins the same fraction of time, the regret would also grow linearly. Hence, a
good strategy must exploit the accumulated knowledge about the biases.
The key question is how to balance exploration and exploitation. The strategy
called Thompson Sampling does this optimally. Assume that the biases θ1 and
θ2 of the two coins are independent and uniformly distributed in [0, 1]. Say that
you have flipped the coins a number of times. Given the outcomes of these coin
flips, one can in principle compute the conditional distributions of θ1 and θ2 . Given
these conditional distributions, one can calculate the probability that θ1 > θ2 . The
Thompson Sampling strategy is to choose coin 1 with that probability and coin 2
otherwise for the next flip. Here is the key result.
Rk ≥ O(log k).
The notation O(log k) indicates a function g(k) that grows like log k, i.e., such
that g(k)/ log k converges to a positive constant as k → ∞.
Thus this strategy does not necessarily choose the coin with the largest expected
bias. It is the case that the strategy favors the coin that has been more successful so
far, thus exploiting the information. But the selection is random, which contributes
to the exploration.
One can show that if flips of coin 1 have produced h heads and t tails, then the
conditional density of θ1 is g(θ ; h, t), where
(h + t)! h
g(θ ; h, t) = θ (1 − θ )t , θ ∈ [0, 1].
h!t!
The same result holds for coin 2. Thus, Thompson Sampling generates θ̂1 and θ̂2
according to these densities.
For a proof of this result, see Agrawal and Goyal (2012). See also Russo et al.
(2018) for applications of multi-armed bandits.
284 15 Perspective and Complements
A rough justification of the result goes as follows. Say that θ1 > θ2 . One can
show that after flipping coin 2 a number n of times, it takes about n steps until you
flip it again when using Thompson Sampling. Your regret then grows by one at times
1, 1 + 1, 2 + 2, 4 + 4, . . . , 2n , 2n+1 , . . .. Thus, the regret is of order n after O(2n )
steps. Equivalently, after N = 2n steps, the regret is of order n = log N .
Consider a binary symmetric channel with error probability p ∈ (0, 0.5). Every bit
that the transmitter sends has a chance of being corrupted. Thus, it is impossible
to transmit any bit string fully reliably across this channel. No matter what the
transmitter sends, the receiver can never be sure that it got the message right.
However, one might be able to achieve a very small probability of error. For
instance, say that p = 0.1 and that one transmits a bit by repeating it N times,
where N 1. As the receiver gets the N bits, it uses a majority decoding. That is,
if it gets more zeros than ones, it decides that transmitter sent a zero, and conversely
for a one. The probability of error can be made arbitrarily small by choosing N very
large. However, this scheme gets to transmit only one bit every N steps. We say
that the rate of the channel is 1/N and it seems that to achieve a very small error
probability, the rate has to become negligible.
It turns out that our pessimistic conclusion is wrong. Claude Shannon (Fig. 15.8),
in the late 1940s, explained that the channel can transmit at any rate less than C(p),
where (see Fig. 15.9)
C(p)
0.6
0.4
0.2
p
0
0 0.2 0.4 0.6 0.8 1
error less than 10−12 . The actual scheme that we use depends on , and it becomes
more complex when is smaller; however, the rate R does not depend on . Quite a
remarkable result! Needless to say, it baffled all the engineers who had been busily
designing various ad hoc transmission schemes.
Shannon’s key insight is that long sequences are typical. There is a statistical
regularity in random sequences such as Markov chains or i.i.d. random variables and
this regularity manifests itself in a characteristic of long sequences. For instance, flip
many times a biased coin with P (head) = 0.1. The sequence that you will observe
is likely to have about 10% of heads. Many other sequences are so unlikely that you
will not see them. Thus, there are relatively few long sequences that are possible.
√ although there are M = 2 possible sequences of N coin flips,
In this example, N
only about M are typical when P (head) = 0.1. Moreover, by symmetry, these
typical sequences are all equally likely. For that reason, the errors of the BSC must
correspond to relatively few patterns. Say that there are only A possible patterns of
errors for N transmissions. Then, any bit string of length N that the sender transmits
will correspond to A possible received “output” strings: one for every typical error
sequence. Thus, it might be possible to choose B different “input” strings of length
N for the transmitter so that the A received “output” strings for each one of these
B input strings are all distinct. However, one might worry that choosing the B input
strings would be rather complex if we want their sets of output strings to be distinct.
Shannon noticed that if we pick the input strings completely randomly, this will
work. Thus, Shannon scheme is as follows. Pick a large N. Choose B strings of N
bits randomly, each time by flipping a fair coin N times. Call these inputs strings
X1 , . . . XB . These are the codewords. Let S1 be the set of A typical outputs that
correspond to X1 . Let Yj be the output that corresponds to input Xj . Note that the
Yj are sequences of fair coin flips, by symmetry of the channel. Thus, each Yj
is equally likely to be any one of the 2N possible output strings. In particular, the
probability that Yj falls in S1 is A/2N (Fig. 15.10).
In fact,
P (Y2 ∈ S1 or Y3 ∈ S1 . . . or YB ∈ S1 ) ≤ B × A2−N .
286 15 Perspective and Complements
Indeed, the probability of a union of events is not larger than the sum of their
probabilities. We explain below that A = 2N H (p) . Thus, if we choose B = 2N R , we
see that the expression above is less than or equal to
2N R × 2N H (p) × 2−N
Thus, the receiver makes an error with a negligible probability if one does not choose
too many codewords. Note that B = 2N R corresponds to transmitting NR different
bits in N steps, thus transmitting at rate R.
How does the receiver recognize the bit string that the transmitter sent? The idea
is to give the list of the B input strings, i.e., codewords, to the receiver. When
it receives a string, the receiver looks in the list to find the codeword that is the
closest to the string it received. With a very high probability, it is the string that the
transmitter sent.
It remains to show that A = 2N H (p) . Fortunately, this calculation is a simple
consequence of the SLLN. Let X := {X(n), n = 1, . . . , N } be i.i.d. random
variables with P (X(n) = 1) = p and P (X(n) = 0) = 1 − p. For a given sequence
x = (x(1), . . . , x(N )) ∈ {0, 1}N , let
1
ψ(x) := log2 (P (X = x)). (15.3)
N
N
Note that, with |x| := n=1 x(n),
1
ψ(x) = log2 (p|x| (1 − p)N −|x| )
N
|x| N − |x|
= log2 (p) + log2 (1 − p).
N N
15.7 Capacity of BSC 287
|X| N − |X|
ψ(X) = log2 (p) + log2 (1 − p).
N N
This calculation shows that any sequence x of values that X takes has approximately
the same value of ψ(x). But, by (15.3), this implies all the sequences x that occur
have approximately the same probability
2−N H (p) .
We conclude that there are 2N H (p) typical sequences and that they are all essentially
equally likely. Thus, A = 2N H (p) .
Recall that for the Gaussian channel with the MLE detection rule, the channel
becomes a BSC with
0.6
0.4
0.2
σ
0
0 0.5 1 1.5 2 2.5 3
288 15 Perspective and Complements
is called the entropy rate of the Markov chain. A practical scheme, called Liv–
Zempel compression, essentially achieves this limit. It is the basis for most file
compression algorithms (e.g., ZIP).
Shannon put these two ideas together: channel capacity and source coding. Here
is an example of his source–channel coding result. How fast can one send the
symbols X(n) produced by the Markov chain through a BSC channel? The answer
is C(p)/H (P ). Intuitively, it takes H (P ) bits per symbol X(n) and the BSC can
send C(p) bits per unit time. Moreover, to accomplish this rate, one first encodes
the source and one separately chooses the codewords for the BSC, and one then
uses them together. Thus, the channel coding is independent of the source coding
and vice versa. This is called the separation theorem of Claude Shannon.
E(f (X))
P (X ≥ a) ≤ , (15.4)
f (a)
Proof
f (X)
1{X ≥ a} ≤ ,
f (a)
as shown in Fig. 15.14. The inequality (15.6) then follows by taking expecta-
tions.
Recall the multiplexing problem. There are N users who are independently active
with probability p. Thus, the number of active users Z is B(N, p). We want to find
m so that P (Z ≥ m) = 5%.
As a first estimate of m, we use Chebyshev’s inequality (2.2) which says that
var(ν)
P (|ν − E(ν)| > ) ≤ .
2
a x
E(X)
f(E(X)) + a(x - E(X))
15.8 Bounds on Probabilities 291
Now, if Z = B(N, p), one has E(Z) = Np and var(Z) = Np(1 − p).4 Hence,
since ν = B(100, 0.2), one has E(ν) = 20 and var(ν) = 16. Chebyshev’s inequality
gives
16
P (|ν − 20| > ) ≤ .
2
Thus, we expect that
8
P (ν − 20 > ) ≤ ,
2
because it is reasonable to think that the distribution of ν is almost symmetric around
its mean, as we see in Fig. 3.4. We want to choose m = 20 + so that P (ν > m) ≤
5%. This means that we should choose so that 8/ 2 = 5%. This gives = 13, so
that m = 33. Thus, according to Chebyshev’s inequality, it is safe to assume that no
more than 33 users are active and we can choose C so that C/33 is a satisfactory
rate for users.
As a second approach, we use Chernoff’s inequality (15.5) which states that
To calculate the right-hand size, we note that if Z = Bernoulli(N, p), then we can
write as Z = X(1) + · · · + X(N ), where the X(n) are i.i.d. random variables with
P (X(n) = 1) = p and P (X(n) = 0) = 1 − p. Then,
To continue the calculation, we note that, since the X(n) are independent, so
are the random variables exp{θ X(n)}.5 Also, the expected value of a product
of independent random variables is the product of their expected values (see
Appendix A). Hence,
where we define
4 See Appendix A.
5 Indeed, functions of independent random variables are independent. See Appendix A.
292 15 Perspective and Complements
Fig. 15.15 The logarithm divided by N of the probability of too many active users
Since this inequality holds for every θ > 0, let us minimize the right-hand side with
respect to θ . That is, let us define
so that
and
Setting to zero the derivative with respect to θ of the term between brackets, we find
1
a= (peθ ),
1 − p + peθ
a(1 − p)
eθ = .
(1 − a)p
a 1−a
Λ∗ (a) = −a log( ) − (1 − a) log( ), ∀a > p.
p 1−p
P (ν ≥ Na) ≈ 0.05.
i.e.,
log(0.05)
Λ∗ (a) = − ≈ 0.03.
N
Looking at Fig. 15.15, we find a = 0.30. This corresponds to m = 30. Thus,
Chernoff’s estimate says that P (ν > 30) ≈ 5% and that we can size the network
assuming that only 30 users are active at any one time.
By the way, the calculations we have performed above show that Chernoff’s
bound can be written as
P (B(N, p) = Na)
P (Z ≥ Na) ≤ .
P (B(N, a) = Na)
294 15 Perspective and Complements
15.9 Martingales
15.9.1 Definitions
Let Xn be the fortune at time n ≥ 0 when one plays a game of chance. The game is
fair if
In this expression, Xn := {Xm , m ≤ n}. Thus, in a fair game, one cannot expect
to improve one’s fortune. A sequence {Xn , n ≥ 0} of random variables with that
property is a martingale.
This basic definition generalizes to the case where one has access to additional
information and is still unable to improve one’s fortune. For instance, say that the
additional information is the value of other random variables Yn . One then has the
following definitions.
In many cases, we do not specify the random variables Yn and we simply say that
Xn is a martingale, or a submartingale, or a supermartingale.
Note that if Xn is a martingale, then
E(Xn ) = E(X0 ), ∀n ≥ 0.
15.9 Martingales 295
15.9.2 Examples
Random Walk
Let {Zn , n ≥ 0} be independent and zero-mean random variables. Then Xn :=
Z0 + · · · + Zn for n ≥ 0 is a martingale. Indeed,
Product
Let {Zn , n ≥ 0} be independent random variables with mean 1. Then Xn := Z0 ×
· · · × Zn for n ≥ 0 is a martingale. Indeed,
Branching Process
For m ≥ 1 and n ≥ 0, let Xm n be i.i.d. random variables distributed like X that take
Yn
Yn+1 = n
Xm , n ≥ 0.
m=1
Zn = μ−n Yn , n ≥ 0
is a martingale. Indeed,
E[Yn+1 |Y0 , . . . , Yn ] = Yn μ,
296 15 Perspective and Complements
so that
Wn = q Zn , n ≥ 1
is a martingale.
Proof Exercise.
Doob Martingale
Let {Xn , n = 1, . . . , N } be random variables and Y = f (X1 , . . . , XN ), where f is
some bounded measurable real-valued function. Then
Zn := E[Y | Xn ], n = 0, . . . , N
1. Throw N balls into M bins, and let Y be some function of the throws: the
number of empty bins, the max load, the second-highly loaded bin, or some
similar function. Let Xn be the index of the bin into which ball n lands. Then
Zn = E[Y | Xn ] is a martingale.
2. Suppose we have r red and b blue balls in a bin. We draw balls without
replacement from this bin: what is the number of red balls drawn? Let Xn be
the indicator for whether ball n is red, and let Y = X1 + · · · + Xn be the number
of red balls. Then Zn is a martingale.
n
Yn = Vm−1 (Xm − Xm−1 ), n ≥ 1, (15.10)
m=0
with Y0 := 0 is a martingale.
15.9 Martingales 297
The meaning of Yn is the fortune that you would get by betting Vm−1 at time
m − 1 on the gain Xm − Xm−1 of the next round of the game. This bet must be based
on the information (Xm−1 , Z m−1 ) that you have when placing the bet, not on the
outcome of the next round, obviously. The theorem says that your fortune remains
a martingale even after adjusting your bets in real time.
Stopping Times
When playing a game of chance, one may decide to stop after observing a particular
sequence of gains and losses. The decision to stop is non-anticipative. That is, one
cannot say “never mind, I did not mean to play the last three rounds.” Thus, the
random stopping time τ must have the property that the event {τ ≤ n} must be a
function of the information available at time n, for all n ≥ 0. Such a random time is
a stopping time.
Definition 15.4 (Stopping Time) A random variable τ is a stopping time for the
sequence {Xn , Yn , n ≥ 0} if τ takes values in {0, 1, 2, . . .} and
P [τ ≤ n|Xm , Ym , m ≥ 0] = φn (Xn , Y n ), ∀n ≥ 0
For instance,
τ = min{n ≥ 0 | (Xn , Yn ) ∈ A },
where A is a set in 2 is a stopping time for the sequence {Xn , Yn , n ≥ 0}. Thus,
you may want to stop the first time that either you go broke or your fortune exceeds
$1000.00.
One might hope that a smart choice of when to stop playing a fair game could
improve one’s expected fortune. However, that is not the case, as the following fact
shows.
298 15 Perspective and Complements
E[Xτ ∧n |X0 , Y0 ] = X0 .
In the statement of the theorem, for a random time σ one defines Xσ := Xn when
σ = n.
You will note that bounding τ ∧ n in the theorem above is essential. For instance,
let Xn correspond to the random walk described above with P (Zn = 1) = P (Zn =
−1) = 0.5. If we define τ = min{n ≥ 0 | Xn = 10}, one knows that τ is finite. (See
the comments below Theorem 15.1.) Hence, Xτ = 10, so that
E[Xτ |X0 = 0] = 10 = X0 .
lim Xτ ∧n = Xτ = 10,
n→∞
because τ is finite. One then might conclude that the left-hand size of (15.11) goes
to 10, which would contradict (15.11). However, the limit and the expectation do
not interchange because the random variables Xτ ∧n are not bounded. However, if
they were, one would get E[Xτ |X0 ] = X0 , by the dominated convergence theorem.
We record this observation as the next result.
E[Xτ |X0 , Y0 ] = X0 .
6 τ ∧ n := min{τ, n}
15.9 Martingales 299
L1 -Bounded Martingales
An L1 -bounded martingale cannot bounce up and down infinitely often across an
interval [a, b]. For if it did, you could increase your fortune without bound by
betting 1 on the way up across the interval and betting 0 on the way down. We
will see shortly that this cannot happen. As a result, the martingale must converge.
(Note that this is not true if the martingale is not L1 -bounded, as the random walk
example shows.)
Proof Consider an interval [a, b]. We show that Xn cannot up-cross this interval
infinitely often. (See Fig. 15.16.) Let us bet 1 on the way up and 0 on the way down.
That is, wait until Xn gets first below a, then bet 1 at every step until Xn > b, then
stop betting until Xn gets below a, and continue in this way.
If Xm crossed the interval Un times by time n, your fortune Yn is now at least
(b − a)Un + (Xn − a). Indeed, your gain was at least b − a for every upcrossing
and, in the last steps of your playing, you lose at most Xn − a if Xn never crosses
above b after you last resumed betting. But, since Yn is a martingale, we have
(We used the fact that Xn ≥ −|Xn |, so that E(Xn ) ≥ −E(|Xn |) = −K. This shows
that E(Un ) ≤ B = (K + Y0 + a)/(b − a) < ∞. Letting n → ∞, since Un ↑ U ,
where U is the total number of upcrossings of the interval [a, b], it follows by the
monotone convergence theorem that E(U ) ≤ B. Consequently, U is finite. Thus,
Xn cannot up-cross any given interval [a, b] infinitely often.
Consequently, the probability that it up-crosses infinitely often any interval with
rational limits is zero (since there are countably many such intervals).
This implies that Xn must converge, either to +∞, −∞, or to a finite value.
Since E(|Xn |) ≤ K, the probability that Xn converges to +∞ or −∞ is zero.
n
300 15 Perspective and Complements
The following is a direct but useful consequence. We used this result in the proof
of the convergence of the stochastic gradient projection algorithm (Theorem 12.2).
Proof We have
by Jensen’s inequality. Thus, it follows that E(|Xn |) ≤ K for all n, so that the result
of the theorem applies to this martingale.
X1 + · · · + Xn
→ μ, almost surely as n → ∞.
n
Proof Let
Sn = X1 + · · · + Xn , n ≥ 1.
Note that
1
E[X1 |Sn , Sn+1 , . . .] = Sn =: Y−n , (15.12)
n
15.9 Martingales 301
by symmetry. Thus,
X1 + · · · + Xn
Y−∞ = lim ,
n→∞ n
we see that Y−∞ is independent of (X1 , . . . , Xn ) for any finite n. Indeed, the limit
does not depend on the values of the first n random variables. However, since Y−∞
is a function of {Xn , n ≥ 1}, it must be independent of itself, i.e., be a constant.
Since E(Y∞ ) = E(Y1 ) = μ, we see that Y∞ = μ.
E(Yτ ∧n ) = E(Y1 ) = 0,
which gives the identity with τ replaced by τ ∧ n. If E(τ ) < ∞, one can let n go to
infinity and get the result. (For instance, replace Xi by Xi+ and use MCT, similarly
for Xi− , then subtract.)
302 15 Perspective and Complements
15.10 Summary
15.11 References
For the theory of Markov chains, see Chung (1967). The text Harchol-Balter (2013)
explains basic queueing theory and many applications to computer systems and
operations research.
The book Bremaud (1998) is also highly recommended for its clarity and the
breadth of applications. Information Theory is explained in the textbook Cover and
15.12 Problems 303
Thomas (1991). I learned the theory of martingales mostly from Neveu (1975). The
theory of multi-armed bandits is explained in Cesa-Bianchi and Lugosi (2006). The
text Hastie et al. (2009) is an introduction to applications of statistics in data science
(Fig. 15.17).
15.12 Problems
Problem 15.2 Customers arrive to a store according to a Poisson process with rate
4 (per hour).
Problem 15.3 Consider two independent Poisson processes with rates λ1 and λ2 .
Those processes measure the number of customers arriving in stores 1 and 2.
(a) What is the probability that a customer arrives in store 1 before any arrives in
store 2?
(b) What is the probability that in the first hour exactly 6 customers arrive at the two
stores? (The total for both is 6)
(c) Given exactly 6 have arrived at the two stores, what is the probability all 6 went
to store 1?
6
2 4
1 2 3 4
10
8
304 15 Perspective and Complements
(a) Construct the Markov chain that models the queue. What are the states and
transition probabilities? [Hint: Suppose the head of the line task of the queue
still requires z units of service. Include z in the state description of the MC.]
(b) Use Lyapunov–Foster argument to show the queue is stable or equivalently the
MC is positive recurrent.
Problem 15.7 Let {Nt , t ≥ 0} be a Poisson process with rate λ. Let Sn denote the
time of the n-th event. Find
Problem 15.8 A queue has Poisson arrivals with rate λ. It has two servers that work
in parallel. When there are at least two customers in the queue, two are being served.
When there is only one customer, only one server is active. The service times are
i.i.d. Exp(μ).
Problem 15.9 Let {Xt , t ≥ 0} be a continuous-time Markov chain with rate matrix
Q = {q(i, j )}. Define q(i) = j =i q(i, j ). Let also Ti = inf{t > 0|Xt = i} and
Si = inf{t > 0|Xt = i}. Then (select the correct answers)
Problem 15.10 A continuous-time queue has Poisson arrivals with rate λ, and it is
equipped with infinitely many servers. The servers can work in parallel on multiple
customers, but they are non-cooperative in the sense that a single customer can only
be served by one server. Thus, when there are k customers in the queue, k servers are
active. Suppose that the service time of each customer is exponentially distributed
with rate μ and they are i.i.d.
(a) Argue that the queue length is a Markov chain. Draw the transition diagram of
the Markov chain.
(b) Prove that for all finite values of λ and μ the Markov chain is positive recurrent
and find the invariant distribution.
(a) Given S3 = s, find the joint distribution of S1 and S2 . Show you work.
(b) Find E[S2 |S3 = s].
(c) Find E[S3 |N1 = 2].
N
Problem 15.12 Let S = i=1 Xi denote the total amount of money withdrawn
from an ATM in 8 h, where:
(a) Xi are i.i.d. random variables denoting the amount withdrawn by each customer
with E[Xi ] = 30 and V ar[Xi ] = 400.
(b) N is a Poisson random variable denoting the total number of customers with
E[N ] = 80.
Problem 15.13 One is given two independent Poisson processes Mt and Nt with
respective rates λ and μ, where λ > μ. Find E(τ ), where
τ = max{t ≥ 0 | Mt ≤ Nt + 5}.
Problem 15.14 Consider a queue with Poisson arrivals with rate λ. The service
times are all equal to one unit of time. Let Xt be the queue length at time t (t ≥ 0).
Problem 15.15 Consider a queue with Poisson arrivals with rate λ. The queue can
hold N customers. The service times are i.i.d. Exp(μ). When a customer arrives,
you can choose to pay him c so that he does not join the queue. You also pay c when
a customer arrives at a full queue. You want to decide when to accept customers to
minimize the cost of rejecting them, plus the cost of the average waiting time they
spend in the queue.
(a) Formulate the problem as a Markov decision problem. For simplicity, consider
a total discounted cost. That is, if xt customers are in the system at time t, then
the waiting cost during [t, t + ] is e−βt xt . Similarly, if you reject a customer
at time t, then the cost is ce−βt .
(b) Write the dynamic programming equations.
(c) Use Python to solve the equations.
Problem 15.17 Figure 15.18 shows a system where a source alternates between the
ON and OFF states according to a continuous-time Markov chain with the transition
rates indicated. When the source is ON, it sends a fluid with rate 2 into the queue.
When the source is OFF, it does not send any fluid. The queue is drained at constant
rate 1 whenever it contains some fluid. Let Xt be the amount of fluid in the queue at
time t ≥ 0.
OFF
15.12 Problems 307
Problem 15.18 Let {Nt , t ≥ 0} be a Poisson process with rate λ that is exponen-
tially distributed with rate μ > 0.
Problem 15.19 Consider two queues in parallel in discrete time with Bernoulli
arrival processes of rates λ1 and λ2 , and geometric service rates of μ1 and μ2 ,
respectively. There is only one server that can serve either queue 1 and queue
2 at each time. Consider the scheduling policy that serves queue 1 at time
n if μ1 Q1 (n) > μ2 Q2 (n), and serve queue 2 otherwise, where Q1 (n) and
Q2 (n) are queue lengths of the queues at time n. Use the Lyapunov function
V (Q1 (n), Q2 (n)) = Q21 (n) + Q22 (n) to show that the queues are stable if λ1 /μ1 +
λ2 /μ2 < 1. This scheduling policy is known as Max-Weight or Back-Pressure
policy.
Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
Elementary Probability
A
A.1 Symmetry
1 1 1 1 1 3 1 3 1 4 1 4 1 4 2 3 2 4 2 4
Fig. A.1 Ten marbles marked with a blue and a red number
be picked, P (A) = |A|/10, where |A| is the number of marbles in the set A. For
instance, if A is the set of marbles where (B = 1, R = 3) or (B = 2, R = 4), then
P (A) = 0.4 since there are four such marbles out of 10.
It is clear that if A1 and A2 are disjoint sets (i.e., have no marble in common),
then P (A1 ∪ A2 ) = P (A1 ) + P (A2 ). Indeed, when A1 and A2 are disjoint, the
number of marbles in A1 ∪ A2 is the number of marbles in A1 plus the number
of marbles in A2 . If we divide by ten, we conclude that the probability of picking
a marble that is in A1 ∪ A2 is the sum of the probabilities of picking one in A1
or in A2 . We say that probability is additive. This property extends to any finite
collection of events that are pairwise disjoint.
Note that if A1 and A2 are not disjoint, then P (A1 ∪ A2 ) < P (A1 ) + P (A2 ). For
instance, if A1 is the set of marbles such that B = 1 and A2 is the set of marbles such
that R = 4, then P (A1 ∪ A2 ) = 0.9, whereas P (A1 ) + P (A2 ) = 0.7 + 0.5 = 1.2.
What is happening is that P (A1 ) + P (A2 ) is double-counting the marbles that are
in both A1 and A2 , i.e., the marbles such that (B = 1, R = 4). We can eliminate
this double-counting and check that P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ).
Thus, one has to be a bit careful when examining the different ways that
something can happen. When adding up the probabilities of these different ways,
one should make sure that they are exclusive, i.e., that they cannot happen together.
For example, the probability that your car is red or that it is a Toyota is not the
sum of the probability that it is red plus the probability that it is a Toyota. This sum
double-counts the probability that your car is a red Toyota. Such double-counting
mistakes are surprisingly common.
A.2 Conditioning
Now, imagine that you pick a marble and tell me that B = 1. How do I guess R?
Looking at the marbles, we see that there are 7 marbles with B = 1, among
which two are such that R = 1. Thus, given that B = 1, the probability that R = 1
is 2/7. Indeed, given that B = 1, you are equally likely to have picked any one of
the 7 marbles with B = 1. Since 2 out of these 7 marbles are such that R = 1, we
conclude that the probability that R = 1 given that B = 1 is 2/7.
We write P [R = 1 | B = 1] = 2/7. Similarly, P [R = 3 | B = 1] = 2/7
and P [R = 4 | B = 1] = 3/7. We say that P [R = 1 | B = 1] is the conditional
probability that R = 1 given that B = 1.
So, we are not sure of the value of R when we are told that B = 1, but the
information is useful. For instance, you can see that P [R = 1 | B = 2] = 0,
whereas P [R = 1 | B = 1] = 2/7. Thus, knowing B tells us something about R.
A.2 Conditioning 311
P (B = 1, R = 1) = P (B = 1) × P [R = 1 | B = 1].
Indeed,
To make the previous identity intuitive, we argue in the following way. For (B =
1, R = 1) to occur, B = 1 must occur and then R = 1 must occur given that B = 1.
Thus, the probability of (B = 1, R = 1) is the probability of B = 1 times the
probability of R = 1 given that B = 1.
The previous identity shows that
P (B = 1, R = 1)
P [R = 1 | B = 1] = .
P (B = 1)
P (B = b, R = r)
P [R = r | B = b] = (A.1)
P (B = b)
and
model, we see that P[High Fever | Ebola] = 1 > P[High Fever | Flu] = 0.75, and
P[Flu | High Fever] = 15/16 > P[Ebola | High Fever] = 1/16.
The discussion so far probably seems quite elementary. However, most of the
confusion about probability arises with these basic ideas. Let us look at some
examples.
You are told that Bill has two children and one of them is named Isabelle.
What is the probability that Bill has two daughters? You might argue that Bill’s
other child has a 50% probability of being a girl, so that the probability that
Bill has two daughters must be 0.5. In fact, the correct answer is 1/3. To see
this, look at the four equally likely outcomes for the sex of the two children:
(M, M), (F, M), (M, F ), (F, F ) where (M, F ) means that the first child is male
and the second is female, and similarly for the other cases. Out of these four
outcomes, three are consistent with the information that “one of them is named
Isabelle.” Out of these three outcomes, one corresponds to Bill having two daugh-
ters. Hence, the probability that Bill has two daughters given that one of his two
children is named Isabelle is 1/3, not 50%.
This example shows that confusion in Probability is not caused by the sophis-
tication of the mathematics involved. It is not a lack of facility with Calculus
or Algebra that causes the difficulty. It is the lack of familiarity with the basic
formalism: looking at the possible outcomes and identifying precisely what the
given information tells us about these outcomes.
Another common source of confusion concerns chance fluctuations. Say that you
flip a fair coin ten times. You expect about half of the outcomes to be tails and half
to be heads. Now, say that the first six outcomes happen to be heads. Do you think
the next four are more likely to be tails, to catch up with the average? Of course not.
After 4 years of drought in California, do you expect the next year to be rainier than
average? You should not.
Surprisingly, many people believe in the memory of purely random events. A
useful saying is that “lady luck has no memory nor vengeance.”
A related concept is “regression to the mean.” A simple example goes as follows.
Flip a fair coin twenty times. Say that eight of the first ten flips are heads. You
expect the next ten flips to be more balanced. This does not mean that the next ten
flips are more likely to be tails to compensate for the first ten flips. It simply means
that the abnormal fluctuations in the first ten flips do not carry over to the next ten
flips. More subtle scenarios of this example involve the stock market or the scores of
sports teams, but the basic idea is the same. Of course, if you do not know whether
the coin is fair or biased, observing eight heads out of the first ten flips suggests that
the coin is biased in favor of heads, so that the next ten coin flips are likely to give
more heads than tails. But, if you have observed many flips of that coin in the past,
then you may know that it is fair, and in that case regression to the mean makes
sense.
A.4 Independence 313
Now, you might ask “how can about half of the coin flips be heads if the coin
does not make up for an excessive number of previous tails?” The answer is that
among the 210 = 1024 equally likely strings of 10 heads and tails, a very large
proportion have about 50% of heads. Indeed, 672 such strings have either 4, 5, or
6 heads. Thus the probability that the number of heads is between 40 and 60% is
672/1, 024 = 65.6%. This probability gets closer to one as you flip more coins.
For twenty coins, the probability that the fraction of heads is between 40 and 60 is
73.5%.
To avoid being confused, always keep the basic formalism in mind. What are the
outcomes? How likely are they? What does the known information tell you about
them?
A.4 Independence
Look at the marbles in Fig. A.2. For these marbles, we see that P [R = 1 | B =
1] = P [R = 3 | B = 1] = 2/4 = 0.5. Also, P [R = 1 | B = 2] = P [R = 3 | B =
2] = 3/6 = 0.5. Thus, knowing the value of B does not change the probability of
the different values of R. We say that for this experiment, R and B are independent.
Here, the value of R tells us something about which marble you picked, but that
information does not change the probability of the different values of B.
In contrast, for the marbles in Fig. A.1, we saw that P [R = 1 | B = 2] = 0
and P [R = 1 | B = 1] = 2/7 so that, for that experiment, R and B are not
independent: knowing the value of B changes the probability of R = 1. That is, B
tells you something about R. This is rather common. The temperature in Berkeley
tells us something about the temperature in San Francisco. If it rains in Berkeley, it
is likely to rain in San Francisco.
This fact that observations tell us something about what do not observe directly
is central in applied probability. It is at the core of data science. What information
do we get from data? We explore this question later in this appendix.
Summarizing, we say that B and R are independent if P [R = r | B =
b] = P (R = r) for all values of r and b. In view of (A.2), B and R are
independent if
1 1 1 1 1 3 1 3 2 1 2 1 2 1 2 3 2 3 2 3
Fig. A.2 Ten other marbles marked with a blue and a red number
314 A Elementary Probability
a × 26 a 24 × b b
P (A) = 10
= 4
and P (B) = 10
= 6.
2 2 2 2
Moreover,
a×b
P (A ∩ B) = .
210
Hence,
P (A ∩ B) = P (A) × P (B).
If you look back at the calculation, you will notice that it boils down to the area of a
rectangle being the product of the sides. The key observation is then that the set of
outcomes where X = x and Y = y is a rectangle. This is so because X = x imposes
a constraint on the first four flips and Y = y imposes a constraint on the other flips.
A.5 Expectation 315
A.5 Expectation
Going back to the marbles of Fig. A.1, what do you expect the value of B to be? How
much would you be willing to pay to pick a marble given that you get the value of B
in dollars? An intuitive argument is that if you were to repeat the experiment 1000
times, you should get a marble with B = 1 about 70% of the time, i.e., about 700
times. The other 300 times, you would get B = 2. The total amount should then be
1 × 700 + 2 × 300. The average value per experiment is then
1 × 700 + 2 × 300
= 1 × 0.7 + 2 × 0.3.
1000
We call this number the expected value of B and we write it E(B). Similarly, we
define
Thus, the expected value is defined as the sum of the values multiplied by
their probability. The interpretation we gave by considering the experiment being
repeated a large number of times is only an interpretation, for now.
Reviewing the argument, and extending it somewhat, let us assume that we have
N marbles marked with a number X that takes the possible values {x1 , x2 , . . . , xM }
and that a fraction pm of the marbles are marked with X = xm , for m = 1, . . . , M.
Then we write P (X = xm ) = pm for m = 1, . . . , M. We define the expected value
of X as
M
M
E(X) = xm pm = xm P (X = xm ). (A.4)
m=1 m=1
Consider a random variable X that is equal to the same constant x for every
outcome. For instance, X could be a number on a marble when all the marbles are
marked with the same number x. In this case,
E(X) = x × P (X = x) = x.
Thus, the expected value of a constant is the constant. For instance, if we designate
by a a random variable that always takes the value a, then we have
E(a) = a.
There is a slightly different but very useful way to compute the expectation. We
can write E(X) as the sum over all the possible marbles we could pick of the product
of X for that marble times the probability 1/N that we pick that particular marble.
Doing this, we have
316 A Elementary Probability
N
1
E(X) = X(n) ,
N
n=1
where X(n) is the value of X for marble n. This expression gives the same value as
the previous calculation. Indeed, in this sum there are pm N terms with X(n) = xm
because we know that a fraction pm of the N marbles, i.e., pm N marbles, are marked
with X = xm . Hence, the sum above is equal to
M
1
M
(pm N)xm = pm xm ,
N
m=1 m=1
1 1
E(B + R) = (1 + 1) + · · · + (2 + 4) .
10 10
If we decompose this sum by regrouping the values of B and then those of R, we
see that
1 1 1 1
E(B + R) = 1 + · · · + 2 + 1 + ··· + 4 .
10 10 10 10
The first sum is E(B) and the second is E(R). Thus, the expected value of a sum is
the sum of the expected values. We say that expectation is linear. Notice that this is
so even though the values B and R are not independent.
More generally, for our N marbles, if marble n is marked with two numbers X(n)
and Y (n), then we see that
N
1 N
1
N
1
E(X + Y ) = (X(n) + Y (n)) = X(n) + Y (n) = E(X) + E(Y ).
N N N
n=1 n=1 n=1
(A.5)
Linearity shows that if we get 5 + 3X2 + 4Y 3 when we pick a marble marked
with the numbers X and Y , then
Indeed,
A.5 Expectation 317
1
N
E(5 + 3X2 + 4Y 3 )) = (5 + 3X2 (n) + 4Y 3 (n)) = 5 + 3E(X2 ) + 4E(Y 3 ).
N
n=1
Similarly,
N
1 1
E(XY ) = X(n)Y (n) = xi yj N(i, j ) ,
N N
n=1 i j
where N (i, j ) is the number of marbles marked with (xi , yj ). We obtained the
last term by regrouping the terms based on the values of X(n) and Y (n). Now,
N (i, j )/N = P (X = xi , Y = yj ). Also, by independence, P (X = xi , Y = yj ) =
P (X = xi )P (Y = yj ). Thus, we can write the sum above as follows:
E(XY ) = xi yj P (X = xi , Y = yj ) = xi yj P (X = xi )P (Y = yj ).
i j i j
We now compute the sum on the right by first summing over j . We get
318 A Elementary Probability
⎡ ⎤
E(XY ) = ⎣ xi xj P (X = xi )P (Y = yj )⎦ = xi P (X = xi )
i j i
⎡ ⎤
×⎣ xj P (Y = yj )⎦ .
j
as claimed. Thus, if X and Y are independent, the expected value of their product is
the product of their expected values.
We can check that this property does not generally hold if the random variables
are not independent. For instance, consider R and B in Fig. A.1. We find that
whereas
• expectation is linear
• expectation is monotone
• the expected value of the product of two independent random variables is the
product of their expected values.
A.6 Variance
The variance measures the variability around the mean. If the variance is small, the
random variable is likely to be close to its mean.
By linearity of expectation, we have
Hence,
Hence,
Now, assume that X and Y are independent random variables. Then we find that
where the last expression results from the fact that E(XY ) = E(X)E(Y ) when the
random variables are independent.
The square root of the variance is called the standard deviation.
Summing up, we saw the following results about variance:
• when one multiplies a random variable by a constant, its variance gets multiplied
by the square of the constant
• the variance of the sum of independent random variables is the sum of their
variances
• the standard deviation of a random variable is the square root of its variance.
320 A Elementary Probability
A.7 Inequalities
The fact that expectation is monotone yields some inequalities that are useful to
bound the probability that a random variable takes large values. Intuitively, if a
random variable is likely to take large values, its expected value is large.
The simplest such inequality is as follows. Let X be a random variable that is
always non-negative, then
E(X)
P (X ≥ a) ≤ , for a > 0.
a
This is called Markov’s inequality. To prove it, we define the random variable Y as
being 0 when X < a and 1 when X ≥ a. Hence,
We note that Y ≤ X/a. Indeed, that inequality is immediate if X < a because then
Y = 0 and X/a > 0. It is also immediate when X ≥ a because then Y = 1 and
X/a ≥ 1. Consequently, by monotonicity of expectation, E(Y ) ≤ E(X/a). Hence,
X E(X)
P (X ≥ a) = E(Y ) ≤ E = ,
a a
var(X)
P (|X − E(X)| ≥ ) ≤ , for > 0.
2
X1 + · · · + Xn nμ
E(Y ) = E = = μ.
n n
Also,
1 1 σ2
var(Y ) = var(X1 + · · · + Xn ) = × nσ 2
= .
n2 n2 n
Consequently, Chebyshev’s inequality implies that
σ2
P (|Y − μ| ≥ ) ≤ .
n 2
This probability becomes arbitrarily small as n increases. Thus, if Y is the
average of n random variables that are independent and have the same mean μ
and the same variance, then Y is very close to μ, with a high probability, when n is
large. This is called the Weak Law of Large Numbers.
Note that this result extends to the case where the random variables are
independent, have the same mean, and have a variance bounded by some σ 2 .
Consider once again the N marbles with the numbers (X, Y ). We define the
covariance of X and Y as
To make this precise, we look for the values of a and b that minimize E((Y −
Ŷ )2 ). That is, we want the error Y − Ŷ to be small, on average. We consider the
square of the error because this is much easier to analyze than choosing the absolute
value of the error.
Now,
To do the calculation, we used the linearity of expectation and the facts that the
expected values of X − E(X) and Y − E(Y ) are equal to zero. To minimize this
expression over a, we should choose a = 0. To minimize it over b, we set the
derivative with respect to b equal to zero and we find
2bvar(X) − 2cov(X, Y ) = 0,
cov(X, Y )
Ŷ = E(Y ) + (X − E(X)).
var(X)
We call Ŷ the Linear Least Squares Estimate (LLSE) of Y given X. It is the linear
function of X that minimizes the mean squared error with Y .
As an example, consider again the N marbles. There,
1 1
N N
E(X) = X(n), E(Y ) = Y (n)
N N
n=1 n=1
1
N
var(X) = X2 (n) − [E(X)]2
N
n=1
1 2
N
var(Y ) = Y (n) − [E(Y )]2
N
n=1
1
N
cov(X, Y ) = X(n)Y (n) − E(X)E(Y ).
N
n=1
A.10 Why Do We Need a More Sophisticated Formalism? 323
In this case, one calls the resulting expression for Ŷ the linear regression of Y
against X. The linear regression is the same as the LLSE when one considers that
(X, Y ) are random variables that are equal to a given sample (X(n), Y (n)) with
probability 1/N for n = 1, . . . , N . That is, to compute the linear regression, one
assumes that the sample values one has observed are representative of the random
pair (X, Y ) and that each sample is an equally likely value for the random pair.
The previous sections show that one can get quite far in the discussion of probability
concepts by considering a finite set of marbles, and random variables that can only
take finitely many values. In engineering, one might think that this is enough. One
may approximate any sensible quantity with a finite number of bits, so for all
applications one may consider that there are only finitely many possibilities. All
that is true, but results in clumsy models. For instance, try to write the equations
of a falling object with discretized variables. The continuous versions are usually
simpler than the discrete ones. As another example, a Gaussian random variable is
easier to work with than a binomial or Poisson random variable, as we will see.
Thus we need to extend the model to random variables that have an infinite, even
an uncountable set of possible values. Does this step cause formidable difficulties?
Not at the intuitive level. The continuous version is a natural extension of the
discrete case. However, there is a philosophical difficulty in going from discrete
to continuous. Some thinkers do not accept the idea of making an infinite number
of choices before moving on. That is, say that we are given an infinite collections
{A1 , A2 , . . .} of nonempty sets. Can we reasonably define a new set B that contains
one element from each An ? We can define it, but if there is no finite way of building
it, can we assume that it exists? A theory that does not rely on this axiom of choice
is considerably more complex than those that do. The classical theory of probability
(due to Kolmogorov) accepts the axiom of choice, and we follow that theory.
One key axiom of probability theory enables to define the probability of a set A
of outcomes as a limit of that of simpler sets An that approach A. This is similar
to approximating the area of a circle by the sum of the areas of disjoint rectangles
that approach it from inside, or approximating an integral by a sum of rectangles.
This key axiom says that if A1 ⊂ A2 ⊂ A3 ⊂ · · · and A = ∪n An , then P (A) =
lim P (An ). Thus, if sets An approximate A from inside in the sense that these sets
grow and eventually contain every point of A, then the probability of A is equal to
the limit of the probability of An . This is a natural way of extending the definition
of probability of simple sets to more complex ones. The trick is to show that this
is a consistent definition in the sense that different approximating sequences of sets
must have the same limiting probability.
This key axiom enables to prove the strong law of large numbers. That law states
that as you keep on flipping coins, the fraction of heads converges to the probability
that one coin yields heads. Thus, not only is the fraction of heads very likely to be
324 A Elementary Probability
close to that probability when you flip many coins, but in fact the fraction gets closer
and closer to that probability. This property justifies the frequentist interpretation of
probability of an event as the long-term fraction of time that event occurs when one
repeats the experiment. This is the interpretation that we used to justify the definition
of expected value.
A.11 References
There are many useful texts and websites on elementary probability. Readers might
find Walrand (2019) worthwhile, especially since it is free on Kindle.
Problem A.1 You have a bag with 20 red marbles and 30 blue marbles. You shake
the bag and pick three marbles, one at a time, without replacement. What is the
probability that the third marble is red?
Solution As is often the case, there is a difficult and an easy way to solve this
problem. The difficult way is to consider the first marble, then find the probability
that the second marble is red or blue given the color of the first marble, then find
the probability that the third marble is red given the colors of the first two marbles.
The easy way is to notice that, by symmetry, the probability that the third marble
is red is the same as the probability that the first marble is red, which is 20/50 = 0.4.
It may be useful to make the symmetry argument explicit. Think of the marbles as
being numbered from 1 to 50. Imagine that shaking the bag results in some ordering
in which the marbles would be picked one by one out of the bag. All the orderings
are equally likely. Now think of interchanging marble one and marble three in each
ordering. You end up with a new set of orderings that are again equally likely. In
this new ordering, the third marble is the first one to get out of the bag. Thus, the
probability that the third marble is red is the same as the probability that the first
marble is red.
Problem A.2 Your applied probability class has 275 students who all turn in their
homework assignment. The professor returns the graded assignments in a random
order to the students. What is the expected number of students who get their own
assignment back?
Solution The difficult way to solve the problem is to consider the first assignment,
then the second, and so on, and for each to explore what happens if it is returned to
its owner or not. The probability that one student gets her assignment back depends
on what happened to the other students. It all seems very complicated.
The easy way is to argue that, by symmetry, the probability that any given student
gets his assignment back is the probability that the first one gets his assignment
A.12 Solved Problems 325
back, which is 1/275. Let then Xn = 1 if student n gets his/her own assignment and
Xn = 0 otherwise. Thus, E(Xn ) = 1/275. The number of students who get their
assignment back is X1 + · · · + X275 . Now, by linearity of expectation, E(X1 + · · · +
X275 ) = E(X1 ) + · · · + E(X275 ) = 275 × (1/275) = 1.
Solution The easy solution uses the linearity of expectation. Let Xn = 1 if the name
“walrand” appears in the sequence, starting at the n-th symbol of the string. The
number of times that the name appears is then Z = X1 +· · ·+XN with N = 106 − 6.
By symmetry, E(Xn ) = E(X1 ) for all n. Now, the probability that X1 = 1 is equal
to the probability that the first symbol is w, that the second symbol is a, and so on.
Thus, E(X1 ) = P (X1 = 1) = (1/40)7 . Hence, the expected number of times that
“walrand” appears is E(Z) = (106 − 6) × (1/40)7 ≈ 6 × 10−6 . So, it is true that
a monkey could eventually type one of Shakespeare’s plays, but he is likely to die
before succeeding.
Note that Markov’s inequality implies that
P (Z ≥ 1) ≤ E(Z) ≈ 6 × 10−6 .
Problem A.4 You flip a fair coin n times and the fraction of heads is Y . How large
does n have to be to be sure that P (|Y − 0.5| ≥ 0.05) ≤ 0.05?
var(Y )
P (|Y − 0.5| ≥ ) ≤ .
2
We saw in our discussion of the weak law of large numbers that var(Y ) =
var(X1 )/n where X1 = 1 if the first coin yields heads and X1 = 0 otherwise. Since
P (X1 = 1) = P (X1 = 0) = 0.5, we find that E(X1 ) = 0.5 and E(X12 ) = 0.5.
Hence, var(X1 ) = E(X12 ) − [E(X1 )]2 = 0.25. Consequently,
0.25 100
P (|Y − 0.5| ≥ 0.05) ≤ = .
n × 25 × 10−4 n
Problem A.5 What is the probability that two friends share a birthday? What about
three friends? What about n? How large does n have to be for this probability to be
50%?
326 A Elementary Probability
Solution In the case of two friends, it is the probability that the second has the same
birthday as the first, which is 1/365 (ignoring February 29).
The case of three friends looks more complicated: two of the three or all of them
could share a birthday. It is simpler to look at the probability that they do not share
a birthday. This is the probability that the second friend does not have the same
birthday as the first, which is 364/365 times the probability that the third does not
share a birthday with the first two, which is 363/365. Let us explore this a bit further
to make sure we fully understand this solution. First, we consider all the strings of
three numbers picked in {1, 2, . . . , 365}. There are 3653 such strings because there
are 365 choices for the first number, then 365 for the second, and finally 365 for
the third. Second, consider the strings of three different numbers from the same set.
There are 365 choices for the first, then 364 for the second, then 363 for the third.
Hence, there are 365 × 364 × 363 such strings. Since all the strings are equally
likely to be picked (a reasonable assumption), the probability that the friends do not
share a birthday is
The case of n friends is then clear: they do not share a birthday with probability
p where
To evaluate this expression, we use the fact that 1 − x ≈ exp{−x} when |x|
1. We use this fact repeatedly in this book. Do not worry, there are not too many
such tricks. In practice, this approximation is good for |x| < 0.1. For instance,
exp{−0.1} ≈ 0.90483. Thus, assuming that n/365 ≤ 0.1, i.e., n ≤ 36, we find
1 2 n−1
p ≈ 1 × exp − × exp − × · · · × exp −
365 365 365
1 2 n−1 1 + 2 + ··· + n − 1
= exp − − − ··· − = exp −
365 365 365 365
(n − 1)n
= exp − .
730
Hence, the probability that at least two friends in a group of 24 share a birthday
is about 50%. This result is somewhat surprising because 24 is small compared to
365. One calls this observation the birthday paradox. Many people think that it takes
about 365/2 ≈ 180 friends for the probability that they share a birthday to be 50%.
The paradox is less mysterious when you think of the many ways that friends can
share birthdays.
Problem A.6 You throw M marbles into B bins, each time independently and in
a way that each marble is equally likely to fall into each bin. What is the expected
number of empty bins? What is the probability that no bin contains more than one
marble?
Solution The first bin is empty with probability α := [(B −1)/B]M , and the same is
true for every bin. Hence, if Xb = 1 when bin b is empty and Xb = 0 otherwise, we
see that E(Xb ) = α. Hence, the expected value of the number Z = X1 + · · · + XB
of empty bins is equal to
so that
E(Z) ≈ B exp{−β}.
For instance, with M = 20 and B = 30, one has β = 2/3 and E(Z) ≈
30 exp{−2/3} ≈ 15. That is, the 20 marbles are likely to fall into 15 of the 30
bins.
The probability that no bins contain more than one marble is the same as the
probability that no two friends share a birthday when there are B different days and
M friends. We saw in the last problems that this is given by
M(M − 1) M2
exp − ≈ exp − .
2B 2B
the checksum to the file. When you read the file, you recompute the checksum and
you compare with the one attached to the file. If the checksums agree, you assume
that no storage/retrieval error occurred. How large can M be before the probability
that two files share a checksum exceeds 10−6 . A similar scheme is used as a digital
signature to make sure that files are not modified.
Solution There are B = 2b possible checksums. Let us assume that each file is
equally likely to get any one of the B checksums. In view of the previous problem,
we want to find M such that
M2
exp − = 10−6 = exp{−6 log(10)} ≈ exp{−14}.
2B
√
Thus, M 2 /(2B) = 14, so that M 2 = 28B = 28 × 2b and M = 2b/2 28 ≈
5.3 × 2b/2 . With b = 32, we find M ≈ 350,000.
Problem A.8 N people apply for a job with your company. You will interview
them sequentially but you must either hire or decline a person right at the end of
the interview. How should you proceed to maximize the chance of picking the best
of all the candidates? Implicitly, we assume that the qualities of the candidates are
all independent and equally likely to be any number in {1, . . . , Q} where Q is very
large.
Solution The best strategy is to interview and decline about M = N/e candidates
and then hire the first subsequent candidate who is better that those M. Here, e =
exp{1} ≈ 2.72. If no candidate among {M + 1, . . . , N} is better than the first M,
you hire the last candidate.
To justify this procedure, we compute the probability that the candidate you select
is the best, for a given value of M. By symmetry, the best candidate appears in
position b with probability 1/N. You then pick the best candidate if b > M and if
the best candidate among the first b − 1 is among the first M, which has probability
M/(b − 1), by symmetry. Since probability is additive, the probability p that you
pick the best candidate is given by
N −1
M 1
N
1 M M N 1 M
p= = ≈ db = [log(N ) − log(M)].
N b−1 N b N M b N
b=M+1 b=M
To find the maximizing value of M, we set the derivative of this expression with
respect to M equal to zero. This shows that N/M ≈ e.
Basic Probability
B
The general model of Probability Theory may seem a bit abstract and disconcerting.
However, it unifies all the key ideas in a systematic framework and results in a great
conceptual clarity. You should try to keep in mind this underlying framework when
we discuss concrete examples.
To describe a random experiment, one first specifies the set Ω of all the possible
outcomes. This set is called the sample space. For instance, when we flip a coin, the
sample space is Ω = {H, T }; when we roll a die, Ω = {1, 2, 3, 4, 5, 6}, when one
measures a voltage one may have Ω = = (−∞, +∞); and so on.
Second, one specifies the probability that the outcome falls in subsets of Ω. That
is, for A ⊂ Ω, one specifies a number P (A) ∈ [0, 1] that represents the likelihood
that the random experiment yields an outcome in A. For instance, when rolling a
die, the probability that the outcome is in a set A ⊆ {1, 2, 3, 4, 5, 6} is given by
P (A) = |A|/6 where |A| is the number of elements of A. When we measure a
voltage, the probability that it has any given value is typically 0, but the probability
that it is less than 15 in absolute value may be 95%, which is why we specify the
probability of subsets, not of specific outcomes.
Then
P (An , i.o.) = 0.
Here, {An , i.o.} is defined as the set of outcomes ω that are in infinitely many
sets An . So, stating that the probability of this set is equal to zero means that
the probability that the events An occur for infinitely many n’s is zero. So, the
probability that the events An occur infinitely often is equal to zero. In other words,
for any outcome ω that occurs, there is some m such that An does not occur for any
n larger than m.
{An , i.o.} = ∩n Bn =: B,
where Bn = ∪m≥n Am is a decreasing sequence of sets. To see this, note that the
outcome ω is in infinitely many sets An , i.e., that ω ∈ {An , i.o.}, if and only if for
every n, the outcome ω is in some Am for m ≥ n. Also, ω is in ∪m≥n Am = Bn for
all n if and only if ω is in ∩n Bn = B. Hence ω ∈ {An , i.o.} if and only if ω ∈ B.
Now, B1 ⊇ B2 ⊇ · · · , so that P (Bn ) → P (B). Thus,
∞
and P (Bn ) ≤ m=n P (Am ), so that1
P (Bn ) → 0 as n → ∞.
B.1.3 Independence
It is easy to construct events that are pairwise independent but not mutually
independent. For instance, let Ω = {1, 2, 3, 4} where the four outcomes are equally
likely and let A = {1, 2}, B = {1, 3}, and C = {1, 4}. You can check that these
events are pairwise independent but not mutually independent since P (A∩B ∩C) =
1/4 = P (A)P (B)P (C).
∞ ∞
1 Recallthat if the nonnegative numbers an are such that n=0 an < ∞, then m=n am goes to
zero as n → ∞.
332 B Basic Probability
Hence,
Thus, to prove the theorem, it suffices to show that P (Bnc ) = 0 for all n. Indeed, if
N
n=1 Bn ) ≤
that is the case, then P (∪N n=1 P (Bn ) = 0 and ∪n=1 Bn are increasing
c c N c
with N and their union is ∪n Bnc , so that P (∪n Bnc ) = limN →∞ P (∪N
n=1 Bn ) = 0.
c
Now,
= lim Πm=n
N
P (Acm ) = lim Πm=n
N
[1 − P (Am )]
N →∞ N →∞
N
≤ lim N
Πm=n exp{−P (Am )} = lim exp − P (Am ) = 0.
N →∞ N →∞
m=n
N
In this derivation we used the facts that 1 − x ≤ exp{−x} and m=n P (Am ) →∞
as N → ∞.
Let A and B be two events. Assume that P (B) > 0. One defines the conditional
probability P [A|B] of A given B as follows:
P (A ∩ B)
P [A|B] := .
P (B)
The meaning of P [A|B] is the probability that the outcome of the experiment is
in A given that it is in B. As an example, say that a random experiment has 1000
equally likely outcomes. Assume that A contains |A| outcomes and B contains |B|
outcomes. If we know that the outcome is in B, we know that it is equally likely
B.1 General Framework 333
to be any one of these |B| outcomes. Given that information, the probability that
the outcome is in A is then the fraction of outcomes in B that are also in A. This
fraction is
|A ∩ B| |A ∩ B|/1000 P (A ∩ B)
= = .
|B| |B|/1000 P (B)
Note that the definition implies that if A and B are independent, then P [A|B] =
P (A), which makes intuitive sense. Also,
P (A ∩ B) = P [A|B]P (B).
This expression extends to more than two events. For instance, with events
{A1 , . . . , An } one has
and this product is equal to the left-had side of the identity above.
The interpretation is the natural one: the probability that X ∈ B is the probability
that the outcome ω is such that X(ω) ∈ B.
In particular, one defines the cumulative distribution function (cdf) of the random
variable X as FX (x) = P (X ∈ (−∞, x]) =: P (X ≤ x). This function is
nondecreasing and right-continuous; it tends to zero as x → −∞ and to one as
x → +∞.
Figure B.1 summarizes this general framework for one random variable.
334 B Basic Probability
B.2.1 Definition
X ≡ {(xn , pn ), n = 1, 2, . . . , N }. (B.1)
Here, the xn are real numbers and the pn are positive and add up to one. By
definition, pn is the probability that X takes the value xn and we write
pn = P (X = xn ), n = 1, . . . , N.
The number of values N can be infinite. This list is called the probability mass
function (pmf) of the random variable X.
As an example,
is a random variable that has three possible values (1, 2, 3) and takes these values
with probability 0.1, 0.3, and 0.6, respectively. Equivalently, one can write
⎧
⎨ 1, with probability 0.1;
X = 2, with probability 0.3;
⎩
3, with probability 0.6.
P (X = xn ) = P ({ω ∈ Ω|X(ω) = xn }) = pn .
B.2.2 Expectation
The expected value, or mean, of the random variable X is denoted E(X) and is
defined as (Fig. B.2)
N
E(X) = xn pn .
n=1
In our example,
When N is infinite, the definition makes sense unless the sum of the positive
terms and that of the negative terms are both infinite. In such a case, one says that
X does not have an expected value.
It is a simple exercise to verify that the number a that minimizes E((X − a)2 ) is
a = E(X). Thus, the mean is the “least squares estimate” of X.
B.2.3 Function of a RV
{(h(xn ), pn ), n = 1, . . . , N }.
Note that the values h(xn ) may not be distinct, so that to conform to our definition
of the pmf one should merge identical values and add their probabilities.
For instance, say that h(1) = h(2) = 10 and h(3) = 15. Then
where we merged the two values h(1) and h(2) because they are equal to 10.
Thus,
Observe that
N
E(h(V )) = h(vn )pn ,
n=1
since
P (·) h(X(·))
B.2 Discrete Random Variable 337
3
h(vn )pn = h(1)0.1 + h(2)0.3 + h(3)0.6
n=1
N
E(h(X)) = h(xn )pn .
n=1
B.2.4 Nonnegative RV
We say that X is nonnegative, and we write X ≥ 0, if all its possible values xn are
nonnegative. Observe that
Also,
As before,
N
E(h1 (X) + h2 (X)) = (h1 (xn ) + h2 (xn ))pn .
n=1
338 B Basic Probability
By X ≥ 0 we mean that all the possible values of X are nonnegative, i.e., that
X(ω) ≥ 0 for all ω. In that case, E(X) ≥ 0 since E(X) = n xn P (X = xn ) and
all the xn are nonnegative.
We also write X ≤ Y if X(ω) ≤ Y (ω). The linearity of expectation then implies
that E(X) ≤ E(Y ) since 0 ≤ E(Y − X) = E(Y ) − E(X). Hence,
Indeed,
Bernoulli We say that X is Bernoulli with parameter p ∈ [0, 1], and we write
X =D B(p), if2
i.e., if
P (X = 0) = 1 − p and P (X = 1) = p.
You should check that E(X) = p and var(X) = p(1 − p). This random variable
models a coin flip where 1 represents “heads” and 0 “tails.”
P (X = n) = (1 − p)n−1 p, n ≥ 1.
You should check that E(X) = 1/p and var(X) = (1 − p)p−2 . This random
variable models the number of coin flips until the first “heads” if the probability of
Fig. B.6 The probability mass function of the B(100, p) distribution, for p = 0.1, 0.2, and 0.5
where
N N!
= .
n (N − n)!n!
You should verify that E(X) = Np and var(X) = Np(1 − p). This random variable
models the number of heads in N coin flips; it is the sum of N independent Bernoulli
random variables with parameter p. Indeed, there are Nn strings of N symbols in
{H, T } with n symbols H and N − n symbols T . The probability of each of these
sequences is pn (1 − p)N −n (Figs. B.6, and B.7).
B.3 Multiple Discrete Random Variables 341
Fig. B.7 The binomial distribution as a sum of Bernoulli random variables. At each step, every
steel ball moves to the left or to the right with equal
probabilities, i.e., by 2Xn − 1 where Xn is
Bernoulli 0.5. The position after N steps is Y = n = 1N (2Xn − 1) = 2B(N, 0.5) − N . After
M balls, the stacks show approximately the values of M × P (Y = y) for integer y’s
Quite often one is interested in multiple random variables. These random variables
may be related. For instance, the weight and height of a person, the voltage that a
342 B Basic Probability
220
176
132
88
(xm , yn )
pm,n
(x , y )
pi,j i j
X
(x1 , y1 )
p1,1
transmitter sends and the one that the receiver gets, and the backlog and delay at a
queue are pairs of non-independent random variables (Fig. B.9).
To study such dependent random variables, one needs a description more complete
than simply looking at the random variables individually. Consider the following
example. Roll a die and let X = 1 if the outcome is odd and X = 0 otherwise.
Let also Y = 1 if the outcome is in {2, 3, 4} and Y = 0 if it is in {1, 5, 6}. Note
that P (X = 1) = P (X = 0) = 0.5 and P (Y = 1) = P (Y = 0) = 0.5.
Thus, individually, X and Y could describe the outcomes of flipping two fair coins.
However, jointly, the pair (X, Y ) does not look like the outcomes of two coin flips.
For instance, X = 1 and Y = 1 only if the outcome is 3, which has probability 1/6.
If X and Y were the outcomes of two flips of a fair coin, one would have X = 1 and
Y = 1 in one out of four equally outcomes.
In the discrete case, one describes a pair (X, Y ) of random variables by listing
the possible values and their probabilities (see Fig. B.10):
B.3 Multiple Discrete Random Variables 343
where the pi,j are nonnegative and add up to one. Here, m and n can be infinite.
This description specifies the joint probability mass function (jpmf) of the random
variables (X, Y ). (See Fig. B.10.)
From this description, one can in particular recover the probability mass of X
and that of Y . For instance,
n
n
P (X = xi ) = P (X = xi , Y = yj ) = pi,j .
j =1 j =1
B.3.2 Independence
1 1
P (X = 1, Y = 1) = = P (X = 1)P (Y = 1) = ,
6 4
so that X and Y are not independent (Fig. B.11).
m
n
E(h(X, Y )) = h(xi , yj )pi,j .
i=1 j =1
m
n
E(h(X, Y )) = h(xi , yj )pi,j
i=1 j =1
m
n
= [h1 (xi , yj ) + h2 (xi , yj )]pi,j
i=1 j =1
m
n
m
n
= h1 (xi , yj )pi,j + h2 (xi , yj )pi,j
i=1 j =1 i=1 j =1
B.3.4 Covariance
One says that X and Y are uncorrelated if cov(X, Y ) = 0. One says that X and
Y are positively correlated if cov(X, Y ) > 0 and that they are negatively correlated
if cov(X, Y ) < 0 (Fig. B.12).
In the die roll example, one finds
1 1
cov(X, Y ) = E(XY ) − E(X)E(Y ) = − < 0,
6 4
so that X and Y are negatively correlated. This negative correlation suggests that if
X is larger than average, then Y tends to be smaller than average. In our example,
we see that if X = 1, then the outcome is odd and Y is more likely to be 0 than 1.
Here is an important result:
Proof
(b) As a simple example see Fig. B.13, say that (X, Y ) is equally likely to take each
of the following four values:
Then one sees that E(XY ) = 0 = E(X)E(Y ) so that X and Y are uncorrelated.
However, P (X = −1, Y = 1) = 0 = P (X = −1)P (Y = 1), so that X and Y
are not independent.
(c) Let X and Y be uncorrelated random variables. Then
The third equality in this derivation comes from the fact that E(XY ) = E(X)E(Y ).
346 B Basic Probability
X
−1 1
−1
P (X = xi , Y = yj )
P [Y = yj | X = xi ] = .
P (X = xi )
B.4 General Random Variables 347
In the same spirit as Theorem B.3, one has the following result:
Not all random variables have a discrete set of possible values. For instance, the
voltage across a phone line, wind speed, temperature, and the time until the next
customer arrives at a cashier have a continuous range of possible values.
In practice, one can always approximate values by choosing a finite number
of bits to represent them. For instance, one can measure temperature in degrees,
348 B Basic Probability
ignoring fractions, and fixing a lower and upper bound. Thus, discrete random
variables suffice to describe systems with an arbitrary degree of precision. However,
this discretization is rather artificial and complicates things. For instance, writing
Newton’s equation F = ma where a = dv(t)/dt with discrete variables is
rather bizarre since a discrete speed does not admit a derivative. Hence, although
computers perform all their calculations on discrete variables, the analysis and
derivation of algorithms are often more natural with general variables. Nevertheless,
the approximation intuition is useful and we make use of it.
We start with a definition of a general random variable.
B.4.1 Definitions
(a) The cumulative distribution function (cdf) of X is the function FX (x) defined
by
FX (x) = P (X ≤ x), x ∈ .
d
fX (x) = FX (x),
dx
if this derivative exists.
where the last expression makes sense if the derivative exists. Also, if the pdf exists,
B.4.2 Examples
Example B.1 (U [a, b]) As a first example, we say that X is uniformly distributed
in [a, b], for some a < b, and we write X =D U [a, b] if
B.4 General Random Variables 349
0 x
a b
1
fX (x) = 1{a ≤ x ≤ b}.
b−a
Figure B.14 illustrates the pdf and the cdf of a U [a, b] random variable.
Figure B.15.
As before, you can verify that
0.15
0.1
0.05
0 1 2 3 4 5 6 7 8 9 10
so that
P (X ≥ x) = exp{−λx}, ∀x ≥ 0.
It may help intuition to realize that a random variable X with cdf FX (·)
can be approximated by a discrete random variable Y that takes values in
{. . . , −2, −, 0, , 2, . . .} with
P (Y = n) = FX ((n + 1)) − FX (n) = P (X ∈ (n, (n + 1)]).
Figure B.16.
B.4.3 Expectation
where the last term is defined as the limit of the sum as → 0. If the pdf exists, one
sees that
∞
E(h(Y )) ≈ h(x)fX (x)dx.
−∞
∞
E(h(X)) = h(x)dFX (x).
−∞
In particular,
2
1 1 1
var(X) = E(X2 ) − E(X)2 = − = .
3 2 12
= −λ−1 [e−λx ]∞ −1
0 =λ .
Also,
∞ ∞
E(X2 ) = x 2 λe−λx dx = −λ−1 x 2 de−λx
0 0
∞
= −λ−1 [x 2 e−λx ]∞
0 +λ
−1
e−λx dx 2
0
∞
−1
= 2λ xe−λx dx = 2λ−2 .
0
In particular,
0.18
0 x
0.3 1
outcome of the coin flip is head, then X = 0.3. If the outcome is tail, then X is
picked uniformly in [0, 1]. Then,
This cdf is illustrated in Fig. B.17. We can define the derivative of FX (x) formally
by using the Dirac impulse as the formal derivative of a step function.
For this random variable, one finds that3
∞
E(X ) =
k
x k fX (dx)
−∞
∞ 1
= x 0.4δ(x − 0.3)dx +
k
x k 0.6dx
−∞ 0
1
= 0.4(0.3)k + 0.6 .
k+1
X2 (ω)
X1 (ω)
0 1 ω
We state without proof two useful technical properties of expectation. They address
the following question. Assume that Xn → X as n → ∞. Can we conclude that
E(Xn ) → E(X)? In other words, is expectation “continuous”?
The following counterexample shows that some conditions are needed (see
Fig. B.18). Say that ω is chosen uniformly in [0, 1], so that P ([0, a]) = a for
a ∈ Ω := (0, 1]. Define Xn (ω) = n × 1{ω ≤ 1/n} for n ≥ 1. That is,
Xn (ω) = n if ω ≤ 1/n and Xn (ω) = 0 otherwise. Then P (Xn = n) = 1/n
and P (Xn = 0) = 1 − 1/n, so that E(Xn ) = 1 for all n. Also, Xn (ω) → 0 as
n → ∞, for all ω ∈ Ω. Indeed, Xn (ω) = 0 for all n > 1/ω. Thus, Xn → X = 0
but E(Xn ) = 1 does not converge to 0 = E(X).
354 B Basic Probability
For the last equality we use the fact that x(1 − FX (x)) = xP (X > x) goes to
zero as x → ∞. This fact follows from DCT . To see this, define Xn = n1{X > n}.
Then |Xn | ≤ X for all n. Also, Xn → 0 as n → ∞. Since E(X) < ∞, DCT then
implies that nP (X > n) = E(Xn ) → 0.
FX (x) = P (X1 ≤ x1 , . . . , Xn ≤ xn ), x ∈ n .
B.5 Multiple Random Variables 355
The derivative of this function, if it exists, is the joint probability density function
(jpdf) fX (x, ). That is,
x1 xn
FX (x, ) = ··· fX (u)du1 · · · dun .
−∞ −∞
1 !
fX,Y (x, y) = 1 x 2 + y 2 ≤ 1 , x, y ∈ .
π
Then, we say that (X, Y ) is picked uniformly at random inside the unit circle.
One intuitive way to look at these random variables is to approximate them by
points on a fine grid with size > 0. For instance, an -approximation of a pair
(X, Y ) is (X̃, Ỹ ) defined by
(X̃, Ỹ ) = (m, n) w. p. fX,Y (m, n) 2 .
More generally,
∞ ∞
E(h(X)) = ··· h(x)dx1 · · · dxn .
−∞ −∞
P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)
It is a simple exercise to show that, if the jpdf exists, the random variables are
independent if and only if
Also,
P (W ≤ w) = P (X ≤ w, Y ≤ w) = P (X ≤ w)P (Y ≤ w).
Hence,
+∞
fZ (z)dz = fX (x)fY (z − x)dxdz.
−∞
We conclude that
+∞
fZ (z) = fX (x)fY (z − x)dx = fX ∗ fY (z),
−∞
where g ∗ h indicates the convolution of two functions. If you took a class on signals
and systems, you learned the “flip and drag” graphical method to find a convolution.
FX (x1 , . . . , xn ) := P (X1 ≤ x1 , . . . , Xn ≤ xn ), xi ∈ , i = 1, . . . , n.
The Joint Probability Density Function (jpdf) is the function fX (x) such that
x1 xn
FX (x1 , . . . , xn ) = ··· fX (u1 , . . . , un )du1 . . . dun ,
−∞ −∞
Thus, the jcdf and the jpdf specify the likelihood that the random vector takes
values in given subsets of n .
As in the case of two random variables, one has
E(h(X)) = · · · h(x)fX (u)du1 . . . dun ,
Definition B.5 (Mean and Covariance) Let X, Y be random vectors. One defines
Thus, the mean value of a vector is the vector of mean values. Similarly, the mean
value of a matrix is defined as the matrix of mean values. Also, the covariance of X
and Y is the matrix of covariances. Indeed,
Y−X
0 X
The notions of orthogonality and of projection are essential when studying estima-
tion.
Let X and Y be two random vectors. We say that X and Y are orthogonal and we
write X ⊥ Y if
E(XY ) = 0.
since E(X) = 0.
The following fact is very useful (see Fig. B.19).
Now, if X ⊥ Y, then E(Xi Yj ) = 0 for all i, j . Consequently, E(X Y) =
i E(Xi Yi ) = 0. This proves the result.
360 B Basic Probability
Assume that X has p.d.f. fX (x). Let Y = aX + b for some a > 0. How do we
calculate fY (y)?
As we see in Fig. B.20, we have
so that
1
fY (y) = fX (x) where ax + b = y. (B.9)
a
The case a < 0 is not that different. Repeating the argument above, one finds
1
fY (y) = fX (x) where ax + b = y.
|a|
What about a pair of random variables? Assume that X is a random vector that
takes values in 2 with p.d.f. fX (x). Let
X
1
x dy
a
B.7 Density of a Function of Random Variables 361
dx2
v
x2 y = Ax + b
x
X1 Y1
dx1
x1
Y = AX + b,
Hence,
1
fY (y) = fX (x) where Ax + b = y.
|A|
1
fY (y) = fX (x) where Ax + b = y. (B.10)
|A|
362 B Basic Probability
X L
X2
0
0
0 1 0 X1 1
X1
Then Y has no density in 2 . Indeed, if it had one, one would find that, with
L = {y | y1 = y1 and 0 ≤ y1 ≤ 1},
P (Y ∈ L) = fY (y)dy = 0
L
The case when Y = g(X) for a nonlinear function g(·) is slightly more tricky. Let
us look at one example first.
First Example
Say that X =D U [0, 1] and Y = X2 , as shown in Fig. B.23.
B.7 Density of a Function of Random Variables 363
As the figure shows, for 0 < 1, one has Y ∈ [y, y + ) if and only if
X ∈ [x1 , x1 + δ) where
δ= = where g(x1 ) = x12 = y.
g (x1 ) 2x1
Now,4
P (Y ∈ [y, y + )) = fY (y) + o()
and
1
fY (y) + o() = fX (x1 )δ + o() = fX (x1 ) + o(),
g (x1 )
o()
4 Recall that o() designates a function of such that → 0 as → 0.
364 B Basic Probability
so that
1
fY (y) = fX (x1 ) where g(x1 ) = y.
g (x1 )
1
fY (y) = √ ,
2 y
√
because g (x1 ) = 2x1 = 2 y and fX (x1 ) = 1.
Second Example
We now look at a slightly more complex example. Assume that Y = g(X) = X2
where X takes values in [−1, 1] and has p.d.f.
3
fX (x) = (1 + x)2 , x ∈ [−1, 1].
8
Figure B.24.
Consider one value of y ∈ (0, 1). Note that there are now two values of x, namely
√ √
x1 = y and x2 = − y such that g(x) = y. Thus,
where
δ1 = and δ2 = .
g (x1 ) |g (x2 )|
Hence,
fY (y) + o() = fX (x1 ) + fX (x2 ) + o()
g (x1 ) |g (x2 )|
1 3 √ 1 3 √ 31+y
fY (y) = √ (1 + y)2 + √ (1 − y)2 = √ .
2 y8 2 y8 8 y
Third Example
Our next example is a general differentiable function g(·). From the second example,
we can see that if Y = g(X), then
1
fY (y) = fX (xi ), (B.11)
|g (xi )|
i
Fourth Example
What about the multi-dimensional case? The key idea is that, locally, the transfor-
mation from x to y looks linear. Observe that
∂
gi (x + dx) ≈ gi (x) + gi (x)dxj ≈ g(x) + J (x)dx,
∂xj
j
∂
Ji,j (x) = gi (x).
∂xj
where the sum is over all the xi such that g(xi ) = y and |J (xi )| is the absolute value
of the determinant of the Jacobian evaluated at xi .
Here is an example to illustrate this result. Assume that X = (X1 , X2 ) where the
Xi are i.i.d. U [0, 1]. Consider the transformation
Then
2x1 2x2
J (x) = .
2x2 2x1
Hence,
There are two values of x that correspond to each value of y. These values are
1 "√ √ # 1 "√ √ #
x1 = y1 + y2 + y1 − y2 and x2 = y1 + y2 − y1 − y2
2 2
and
1 "√ √ # 1 "√ √ #
x1 = y1 + y2 − y1 − y2 and x2 = y1 + y2 + y1 − y2 .
2 2
For these values,
*
|J (x)| = y12 − y22 .
B.9 Problems 367
Hence,
2
fY (y) = *
y12 − y22
B.8 References
Mastering probability theory requires curiosity, intuition, and patience. Good books
are very helpful. Personally, I enjoyed Pitman (1993). The home page of David
Aldous (2018) is a source of witty and inspiring comments about probability. The
textbooks Bertsekas and Tsitsiklis (2008), Grimmett and Stirzaker (2001), and
Billingsley (2012) are very useful. The text Wong and Hajek (1985) provides a
deeper discussion of the topics in this book. The books Gallager (2014) and Hajek
(2017) are great resources and are highly recommended to complement this course.
Wikipedia and YouTube are cool sources of information about everything,
including probability. I like to joke, “Don’t take notes, it’s all on the web.”
B.9 Problems
Problem B.1 You have a collection of coins and that the probability that coin n
yields heads is pn . Show that, as
you keep flipping the coins, the flips yield a finite
number of heads if and only if pn < ∞.
Hint This is a direct consequence of the Borel–Cantelli Theorem and its converse.
Problem B.2 Indicate whether the following statements are true or false:
P [A|C] < P (A), P [A|B] > P (A) and P [B|A] > P (B).
Problem B.4 Roll two balanced dice. Let A be the event “the sum of the faces is
less than or equal to 8.” Let B be the event “the face of the first die is larger than or
equal to 3.”
368 B Basic Probability
• What is the probability that out of the first 1000 flips the number of heads is
even?
• What is the probability that the number of heads is always ahead of the number
of tails in the first 4 flips?
Problem B.6 Let X, Y be i.i.d. Exp(1), i.e., exponentially distributed with rate 1.
Derive the p.d.f. of Z = X + Y .
Problem B.7 You pick four cards randomly from a perfectly shuffled 52-card deck.
Assume that the four cards you got are all numbered between 2 and 10. For instance,
you got a 2 of diamonds, a 10 of hearts, a 6 of clubs, and a 2 of spades. Write
a MATLAB script to calculate the probability that the sum of the numbers on the
black cards is exactly twice the sum of the numbers on the red cards.
Problem B.9 Let X, Y be i.i.d. UD [0, 1]. Calculate E(max{X, Y } − min{X, Y }).
Problem B.10 Let X =D P (λ) (i.e., Poisson distributed with mean λ). Find
P (X is even).
Problem B.11 Consider Ω = [0, 1] with the uniform distribution. Let X(ω) =
1{a < ω < b} and Y = 1{c < ω < d} for some 0 < a < b < 1 and 0 < c < d < 1.
Assume that X and Y are uncorrelated. Are they necessarily independent?
Problem B.12 Let X and Y be i.i.d. U [−1, 1] and define Z = XY . Are X and Z
uncorrelated? Are they independent?
Problem B.14 You are given a one meter long stick. You choose two points X and
Y independently and uniformly along the stick and cut the stick at those two points.
What is the probability that you can make a triangle with the three pieces?
Problem B.15 Two friends go independently to a bar at times that are uniformly
distributed between 5:00 pm and 6:00 pm. They wait for ten minutes when they get
there. What is the probability that they meet?
B.9 Problems 369
Problem B.17 Assume that Z and 1/Z are random variables with the same
probability distribution and such that E(|Z|) is well-defined. Show that E(|Z|) ≥ 1.
Problem B.18 Let {Xn , n ≥ 1} be i.i.d. with mean 0 and variance 1. Define Yn =
(X1 + · · · + Xn )/n.
Problem B.21 Pick two points X and Y independently and uniformly in [0, 1]2 .
Calculate E(||X − Y ||2 ).
Problem B.22 Let (X, Y ) be picked uniformly in the triangle with corners
(−1, 0), (1, 0), (0, 1). Find cov(X, Y ).
Problem B.23 Let X be a random variable with mean 1 and variance 0.5. Show
that
Problem B.24 Let X, Y, Z be i.i.d. and uniformly distributed in {−1, +1} (i.e.,
equally likely to be −1 or +1). Define V1 = XY, V2 = Y Z, V3 = XZ.
Problem B.25 Let A and B be events with probabilities P (A) = 3/4 and P (B) =
1
1/3. Show that 12 ≤ P (A ∩ B) ≤ 1/3, and give examples to show that both upper
and lower bound are tight. Find corresponding bounds for P (A ∪ B).
Problem B.26 A power system supplies electricity to a city from N plants. Each
power plant fails with probability p independently of the other plants. The city will
experience a blackout if fewer than k plants are supplying it, where 0 < k < N .
What is the probability of blackout?
370 B Basic Probability
Problem B.27 Figure B.25 is the reliability graph of a system. The links of the
graph represent components of the system. Each link i is working with probability
pi and defective with probability 1−pi , independently of the other links. The system
is operational if the nodes S and T are connected. Thus, the system is built of two
redundant subsystems. Each subsystem consists of a number of components.
Problem B.28 Figure B.26 illustrates an RC-circuit used as a timer. Initially, the
capacitor is charged by the power supply to 5 V . At time t = 0, the switch is flipped
and the capacitor starts discharging through the resistor. An external circuit detects
the time τ when V (t) first drops below 1 V .
Problem B.29 Alice and Bob play the game of matching pennies. In this game,
they both choose the side of the penny to show. Alice wins if the two sides are
different and Bob wins otherwise (Fig. B.27).
(a) Assume that Alice chooses to show “head” with probability pA ∈ [0, 1].
Calculate the probability pB with which Bob should show “head” to maximize
his probability of winning.
(b) From your calculations, find the best choices of pA and pB for Alice and Bob.
Argue that those choices are such that Alice cannot improve her chance of
winning by modifying pA and similarly for Bob. A solution with that property
is called a Nash equilibrium.
Problem B.30 You find two old batteries in a drawer. They produce the voltages X
and Y . Assume that X and Y are i.i.d. and uniformly distributed in [0, 1.5].
(a) What is the probability that if you put them in series they produce a voltage
larger than 2?
(b) What is the probability that at least one of the two batteries has a voltage that
exceeds 1V ?
(c) What is the probability that both batteries have a voltage that exceeds 1 V ?
(d) You find more similar batteries in that drawer. You test them one by one until
you find one whose voltage exceeds 1.2 V . What is the expected number of
batteries that you have to test?
(e) You pick three batteries. What is the probability that at least two of them have
voltages that add up to more than 2? (Fig. B.28).
Problem B.31 You want to sell your old iPhone 4S. Two friends, Alice and Bob,
are interested. You know that they value the phone at X and Y , respectively, where X
and Y are i.i.d. U [50, 150]. You propose the following auction. You ask for a price
R. If Alice bids A and Bob B, then the phone goes to the highest bidder, provided
that it is larger than R, and the highest bidder pays the maximum of the second bid
and R. Thus, if A < R < B, then Bob gets the phone and pays R. If R < A < B,
then Bob gets the phone and pays A (Fig. B.29).
(a) What is the expected payment that you get for the phone if A = X and B = Y ?
(b) Find the value of R that maximizes this expected payment.
372 B Basic Probability
(c) The surplus of Alice is X − P if she gets the phone and pays P for it; it is
zero if she does not get the phone. Bob’s surplus is defined similarly. Show that
Alice maximizes her expected surplus by bidding A = X and similarly for Bob.
We say that this auction is incentive compatible. Also, this auction is revenue
maximizing.
Problem B.32 Recall that the trace tr(S) of a square matrix S is the sum of its
diagonal elements. Let A be an m × n matrix and B an n × m matrix. Show that
tr(AB) = tr(BA).
Problem B.33 Let Σ be the covariance of some random vector X. Show that
a Σa ≥ 0 for all real vector a.
B.9 Problems 373
Problem B.34 You want to buy solar panels for your house. Panels that deliver a
maximum power K cost αK per unit of time, after amortizing the cost over the
lifetime of the panels. Assume that the actual power Z that such panels deliver is
U [0, K] (Fig. B.30).
The power X that you need is U [0, A] and we assume it is independent of the
power that the solar panels deliver. If you buy panels with a maximum power K,
your cost per unit time is
αK + β max{0, X − Z},
where the last term is the amount of power that you have to buy from the grid. Find
the maximum power K of the panels you should buy to minimize your expected
cost per unit time.
References
B. Hajek, T. Van Loon, Decentralized dynamic control of a multiple access broadcast channel.
IEEE Trans. Autom. Control, AC-27(3), 559–569 (1982)
M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in
Action. (Cambridge University Press, Cambridge, 2013)
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd edn. (Springer, Berlin, 2009)
D.A. Huffman, A method for the construction of minimum-redundancy codes. Proceeding of the
IRE, pp. 1098–1101 (1952)
R.E. Kalman, A new approach to linear filtering and prediction problems. Trans. ASME J. Basic
Eng. 82(Series D), 35–45 (1960)
F. Kelly, Reversibility and Stochastic Networks (Wiley, Hoboken, 1979)
F. Kelly, E. Yudovina, Lecture Notes in Stochastic Networks (2013). http://www.statslab.cam.ac.
uk/~frank/STOCHNET/LNSN/book.pdf
L. Kleinrock, Queueing Systems, vols.1 and 2 (J. Wiley, Hoboken, 1975–1976)
P.R. Kumar, P.P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control
(Prentice-Hall, Upper Saddle River, 1986)
E.L. Lehmann, Testing Statistical Hypotheses, 3d edn. (Springer, Berlin, 2010)
J.D.C. Little, A proof for the queuing formula: L = λW . Oper. Res. 9(3), 383–401 (1961)
R. Lyons, Y. Perez, Probability on Trees and Networks. Cambridge Series in Statistical and
Probabilistic Mathematics (2017)
S.G. Mallat, Z. Zhang, Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal
Process. 41, 3397–3415 (1993)
M.J. Neely, Stochastic Network Optimization with Application to Communication and Queueing
Systems (Morgan & Claypool, San Rafael, 2010)
J. Neveu, Discrete Parameter Martingales (American Elsevier, North-Holland, 1975)
J. Neyman, E.S. Pearson, On the problem of the most efficient tests of statistical hypotheses. Phil.
Trans. R. Soc. Lond. 231, 289–337 (1933)
L. Page, Method for node ranking in a linked database. United States Patent and Trademark Office,
6,285,999 (2001)
J. Pitman, Probability. Springer Texts in Statistics (Springer, New York, 1993)
J. Proakis, Digital Communications, 4th edn. (McGraw-Hill Science/Engineering/Math, New
York, 2000)
T. Richardson, R. Urbanke. Modern Coding Theory (Cambridge University Press, Cambridge,
2008)
E. Roche, EM algorithm and variants: an informal tutorial (2012). arXiv:1105.1476v2 [stat.CO]
S.M. Ross, Introduction to Stochastic Dynamic Programming (Academic Press, Cambridge, 1995)
D. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen. A Tutorial on Thompson Sampling
problem. IEEE Trans. Signal Process. 11, 1–96 (2018)
D. Shah, Gossip algorithms. Found. Trends Netw. 3, 1–25 (2009)
R. Srikant, L. Ying, Communication Networks: An Optimization, Control, and Stochastic Networks
Perspective (Cambridge University Press, Cambridge, 2014)
E.L. Strehl, M.L. Littman, Online linear regression and its application to model-based rein-
forcement learning, in In Advances in Neural Information Processing Systems 20 (NIPS-07,
pp. 737–744 (2007)
D. Tse, P. Viswanath, Fundamentals of Wireless Communication (Cambridge University Press,
Cambridge, 2005)
M.J. Wainwright, M. Jordan, Graphical Models, Exponential Families, and Variational Inference
(Now Publishers, Boston, 2008)
J. Walrand, An Introduction to Queueing Networks (Prentice-Hall, Upper Saddle River, 1988)
J. Walrand, Uncertainty: A User Guide (Amazon, Seattle, 2019)
E. Wong, B. Hajek, Stochastic Processes in Engineering Systems (Springer, Berlin, 1985)
Index
A Clustering, 209
Adaptive randomized multiple access protocol, Codewords, 285
66 Complementary cdf, 354
Additive Gaussian noise channel, 124 Compressed sensing, 234
Almost sure, 7 Conditional probability, 310
Almost sure convergence, 28 Confidence interval, 47
ANOVA, 154 Consensus algorithm, 90
Aperiodic, 5 Contraction, 261
Arg max, 118 Convergence in distribution, 43
Axiom of choice, 323 Convergence in probability, 28
Convex function, 220
Convex set, 220
B Covariance, 321, 344
Balance equations, 2 Cumulative distribution function (cdf), 333,
detailed, 18 348
Bayes’ Rule, 117 joint, 354
Belief propagation, 156
Bellman-Ford Algorithm, 207, 244
Bellman-Ford Equations, 208 D
Bernoulli RV, 339 Deep neural networks (DNN), 237
Binary Phase Shift Keying (BPSK), 125 Digital link, 115
Binary Symmetric Channel (BSC), 119 Discrete random variable, 334
Binomial distribution, 340 Doob martingale, 296
Binomial RV, 340 Dynamic programming equations, 208, 245
Boosting, 280
Borel-Cantelli Theorem, 330
Converse, 332 E
Entropy, 123, 288
Entropy rate, 288
C Epidemics, 72
Cμ rule, 249 Error correction, 156
Cascades, 72 Error function, 65
Cauchy-Schwarz inequality, 69 bounds, 65
Central Limit Theorem (CLT), 43 Estimation problem, 163
Certainty equivalence, 263 Expectation, 315
Characteristic function, 59 linearity, 316, 338
Chebyshev’s inequality, 24, 320 monotone, 338
Chernoff’s inequality, 288 monotonicity, 317
Chi-squared test, 152 of product of independent RVs, 318
U W
Uncorrelated, 321, 344, 358 Wald’s equality, 301
Uniformization, 106 Wavelength, 126
Uniformly distributed, 348 Weak Law of Large Numbers, 28, 321
WiFi Access Point, 54