0% found this document useful (0 votes)

456 views389 pages

Book 1

Uploaded by

Muhammad Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

456 views389 pages

Book 1

Uploaded by

Muhammad Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 389

Jean

Walrand

Probability
in Electrical
Engineering
and Computer
Science
An Application-Driven Course
Probability in Electrical Engineering
and Computer Science
Jean Walrand

Probability in Electrical
Engineering and
Computer Science
An Application-Driven Course
Jean Walrand
Department of EECS
University of California, Berkeley
Berkeley, CA, USA

https://www.springer.com/us/book/9783030499945

ISBN 978-3-030-49994-5 ISBN 978-3-030-49995-2 (eBook)

https://doi.org/10.1007/978-3-030-49995-2

© The Editor(s) (if applicable) and The Author(s) 2021. This book is an open access publication
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this book are included in the book’s Creative Commons
license, unless indicated otherwise in a credit line to the material. If material is not included in the book’s
Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife Annie, my daughters Isabelle and Julie,
and my grandchildren Melanie and Benjamin,
who will probably never read this book.
Preface

This book is about extracting information from noisy data, making decisions that
have uncertain consequences, and mitigating the potentially detrimental effects of
uncertainty.
Applications of those ideas are prevalent in computer science and electrical
engineering: digital communication, GPS, self-driving cars, voice recognition,
natural language processing, face recognition, computational biology, medical tests,
radar systems, games of chance, investments, data science, machine learning,
artificial intelligence, and countless (in a colloquial sense) others.
This material is truly exciting and fun. I hope you will share my enthusiasm for
the ideas.

Berkeley, CA, USA Jean Walrand

April 2020

vii
Acknowledgements

I am grateful to my colleagues and students who made this book possible. I thank
Professor Ramtin Pedarsani for his careful reading of the manuscript, Sinho Chewi
for pointing out typos in the first edition and suggesting improvements of the text,
Dr. Abhay Parekh for teaching the course with me, Professors David Aldous, Venkat
Anantharam, Tom Courtade, Michael Lustig, John Musacchio, Shyam Parekh,
Kannan Ramchandran, Anant Sahai, David Tse, Martin Wainwright, and Avideh
Zakhor for their useful comments, Stephan Adams, Kabir Chandrasekher, Dr.
Shiang Jiang, Dr. Sudeep Kamath, Dr. Jerome Thai, Professors Antonis Dimakis,
Vijay Kamble, and Baosen Zhang for serving as teaching assistants for the course
and designing assignments, Professor Longbo Huang for translating the book in
Mandarin and providing many valuable suggestions, Professors Pravin Varaiya and
Eugene Wong for teaching me Probability, Professor Tsu-Jae King Liu for her
support, and the students in EECS126 for their feedback.
Finally, I want to thank Professor Takek El-Bawab for making a number of valu-
able suggestions for the second edition and the Springer editorial team, including
Mary James, Zoe Kennedy, Vidhya Hariprasanth, and Lavanya Venkatesan for their
help in the preparation of this edition.

ix
Introduction

This book is about applications of probability in electrical engineering and computer

science. It is not a survey of all the important applications. That would be too ambi-
tious. Rather, the course describes real, important, and representative applications
that make use of a fairly wide range of probability concepts and techniques.
Probabilistic modeling and analysis are essential skills for computer scientists
and electrical engineers. These skills are as important as calculus and discrete
mathematics. The systems that these scientists and engineers use and/or design are
complex and operate in an uncertain environment. Understanding and quantifying
the impact of this uncertainty is critical to the design of systems.
The book was written for the upper-division course EECS126 “Probability in
EECS” in the Department of Electrical Engineering and Computer Sciences of the
University of California, Berkeley. The students have taken an elementary course on
probability. They know the concepts of event, probability, conditional probability,
Bayes’ rule, discrete random variables and their expectation. They also have some
basic familiarity with matrix operations. The students in this class are smart, hard-
working, and interested in clever and sophisticated ideas. After taking this course,
the students are familiar with Markov chains, stochastic dynamic programming,
detection, and estimation. They have both an intuitive understanding and a working
knowledge of these concepts and their methods. Subsequently, many students go on
to study artificial intelligence and machine learning. This course provides them with
a background that enables them to go beyond blindly using toolboxes.
In contrast to most introductory books on probability, the material is organized by
applications. Instead of the usual sequence—probability space, random variables,
expectation, detection, estimation, Markov chains—we start each topic with a
concrete, real, and important EECS application. We introduce the theory as it is
needed to study the applications. We believe that this approach makes the theory
more relevant by demonstrating its usefulness as it is introduced. Moreover, an
emphasis is on hands-on projects where the students use Python notebooks available
from the book website to simulate and calculate. Our colleagues at Berkeley
designed these projects carefully to reinforce the intuitive understanding of the
concepts and to prepare the students for their own investigations.
The chapters, except for the last one and the appendices, are divided into two
parts: A and B. Parts A contain the key ideas that should be accessible to junior-

xi
xii Introduction

level students. Parts B contain more difficult aspects of the material. It is possible
to teach only the appendices and parts A. This would constitute a good junior-level
course. One possible approach is to teach parts A in a first course and parts B in a
second course. For a more ambitious course, one may teach parts A, then parts B.
It is also possible to teach the chapters in order. The last chapter is a collection of
more advanced topics that the reader and instructor can choose from.
The appendices should be useful for most readers. Appendix A discusses the
elementary notions of probability on simple examples. Students might benefit from
a quick read of this chapter.
Appendix B reviews the basic concepts of probability. Depending on the
background of the students, it may be recommended to start the course with a review
of that appendix.
The theory starts with models of uncertain quantities. Let us denote such
quantities by X and Y. A model enables one to calculate the expected value E(h(X))
of a function h(X) of X. For instance, X might specify the output of a solar panel
every day during 1 month and h(X) the total energy that the panel produced. Then
E(h(X)) is the average energy that the panel produces per month. Other examples
are the average delay of packets in a communication network or the average time a
data center takes to complete one job (Fig. 1).

Fig. 1 Evaluation ?
X E(h(X))

Estimating E(h(X)) is called performance evaluation. In many cases, the system

that handles the uncertain quantities has some parameters θ that one can select to
tune its operations. For instance, the orientation of the solar panels can be adjusted.
Similarly, one may be able to tune the operations of a data center. One may model
the effect of the parameters by a function h(X, θ ) that describes the measure of
performance in terms of the uncertain quantities X and the tuning parameters θ
(Fig. 2).

Fig. 2 Optimization max E(h(X, θ))

One important problem is then to find the values of the parameters θ that
maximize E(h(X, θ )). This is not a simple problem if one does not have an
analytical expression for this average value in terms of θ . We explain such
optimization problems in the book.
There are many situations where one observes Y and one is interested in guessing
the value of X, which is not observed. As an example, X may be the signal that a
transmitter sends and Y the signal that the receiver gets (Fig. 3).

Fig. 3 Inference ?
Y X
Introduction xiii

Fig. 4 Control X Y

The problem of guessing X on the basis of Y is an inference problem. Examples

include detection problems (Is there a fire? Do you have the flu?) and estimation
problems (Where is the iPhone given the GPS signal?). Finally, there is a class of
problems where one uses the observations to act upon a system that then changes.
For instance, a self-driving car uses observations from laser range finders, GPS, and
cameras to steer the car. These are control problems (Fig. 4).
Thus, the course discusses performance evaluation, optimization, inference, and
control problems. Some of these topics are called artificial intelligence in computer
science and statistical signal processing in electrical engineering. Probabilists call
them examples. Mathematicians may call them particular cases. The techniques
used to address these topics are introduced by looking at concrete applications such
as web search, multiplexing, digital communication, speech recognition, tracking,
route planning, and recommendation systems. Along the way, we will meet some of
the giants of the field.
The website

https://www.springer.com/us/book/9783030499945

provides additional resources for this book, such as an Errata, Additional Problems,
and Python Labs.

About This Second Edition

This second edition differs from the first in a few aspects. The Matlab exercises
have been deleted as most students use Python. Python exercises are not included
in the book; they can be found on the website. The appendix on Linear Algebra has
been deleted. The relevant results from that theory are introduced in the text when
needed. Appendix A is new. It is motivated by the realization that some students are
confused by basic notions. The chapters on networks are new. They were requested
by some colleagues. Basic statistics are discussed in Chap. 8. Neural networks are
explained in Chap. 12.
Contents

1 PageRank: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 General Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Distribution After n Steps and Invariant Distribution . . 4
1.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Irreducibility and Aperiodicity . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Big Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Long-Term Fraction of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Hitting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.1 Mean Hitting Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5.2 Probability of Hitting a State Before Another . . . . . . . . . . 11
1.5.3 FSE for Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 PageRank: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Sample Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Laws of Large Numbers for Coin Flips . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Convergence in Probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Laws of Large Numbers for i.i.d. RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Strong Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Law of Large Numbers for Markov Chains . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Proof of Big Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Proof of Theorem 1.1 (a) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 Proof of Theorem 1.1 (b) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.3 Periodicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.6.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
xv
xvi Contents

2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Multiplexing: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1 Sharing Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Gaussian Random Variable and CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.1 Binomial and Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Multiplexing and Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Markov Chain Model of Buffer . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.2 Invariant Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.3 Average Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 A Note About Arrivals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.5 Little’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Multiple Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4 Multiplexing: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Proof of CLT (Sketch) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Moments of N (0, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Sum of Squares of 2 i.i.d. N (0, 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Two Applications of Characteristic Functions. . . . . . . . . . . . . . . . . . . . . 63
4.5.1 Poisson as a Limit of Binomial . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.2 Exponential as Limit of Geometric . . . . . . . . . . . . . . . . . . . . . 64
4.6 Error Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7 Adaptive Multiple Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.8.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Networks: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Spreading Rumors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Seeding the Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Manufacturing of Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Polarization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 M/M/1 Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.7 Network of Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8 Optimizing Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.9 Internet and Network of Queues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Contents xvii

5.10 Product-Form Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.10.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6 Networks—B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.1 Social Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.1 Two-State Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2.2 Three-State Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.3 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2.4 Uniformization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2.5 Time Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3 Product-Form Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Proof of Theorem 5.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7 Digital Link—A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.1 Digital Link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.2 Detection and Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.1 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.2.2 Circumstances vs. Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.3 MAP and MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.2.4 Binary Symmetric Channel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.3 Huffman Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Gaussian Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.4.1 BPSK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.5 Multidimensional Gaussian Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.5.1 MLE in Multidimensional Case . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.6.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.6.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.7.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8 Digital Link—B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.1 Proof of Optimality of the Huffman Code . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2 Proof of Neyman–Pearson Theorem 7.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3 Jointly Gaussian Random Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.3.1 Density of Jointly Gaussian Random Variables . . . . . . . . 146
8.4 Elementary Statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.1 Zero-Mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.4.2 Unknown Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
xviii Contents

8.4.3 Difference of Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.4.4 Mean in Hyperplane? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
8.4.5 ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.5 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.6.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9 Tracking—A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2 Estimation Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.3 Linear Least Squares Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.3.1 Projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.5 A Note on Overfitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.6 MMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.6.1 MMSE for Jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.7 Vector Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.8 Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.8.1 The Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.8.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.9.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10 Tracking: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.1 Updating LLSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2 Derivation of Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.3 Properties of Kalman Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
10.3.1 Observability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10.3.2 Reachability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10.4 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
10.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.5.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
10.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11 Speech Recognition: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.1 Learning: Concepts and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
11.2 Hidden Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.3 Expectation Maximization and Clustering. . . . . . . . . . . . . . . . . . . . . . . . . 209
11.3.1 A Simple Clustering Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.3.2 A Second Look. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Contents xix

11.4 Learning: Hidden Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

11.4.1 HEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.4.2 Training the Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
12 Speech Recognition: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.1 Online Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.2 Theory of Stochastic Gradient Projection . . . . . . . . . . . . . . . . . . . . . . . . . 219
12.2.1 Gradient Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
12.2.2 Stochastic Gradient Projection . . . . . . . . . . . . . . . . . . . . . . . . . . 224
12.2.3 Martingale Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.3 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.3.1 Relevant Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
12.3.2 Compressed Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.3.3 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
12.4 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.4.1 Calculating Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.5.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
13 Route Planning: A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
13.2 Formulation 1: Pre-planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
13.3 Formulation 2: Adapting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
13.4 Markov Decision Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
13.4.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
13.5 Infinite Horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
13.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.6.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
13.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
14 Route Planning: B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14.1 LQG Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
14.1.1 Letting N → ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
14.2 LQG with Noisy Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
14.2.1 Letting N → ∞ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
14.3 Partially Observed MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
14.3.1 Example: Searching for Your Keys . . . . . . . . . . . . . . . . . . . . . 265
14.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
14.4.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
xx Contents

14.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

14.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
15 Perspective and Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
15.1 Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
15.2 Sufficient Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
15.2.1 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
15.3 Infinite Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
15.3.1 Lyapunov–Foster Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
15.4 Poisson Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.4.2 Independent Increments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
15.4.3 Number of Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
15.5 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.6 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
15.7 Capacity of BSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
15.8 Bounds on Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
15.8.1 Applying the Bounds to Multiplexing . . . . . . . . . . . . . . . . . . 290
15.9 Martingales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
15.9.1 Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
15.9.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.9.3 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
15.9.4 Wald’s Equality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
15.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.10.1 Key Equations and Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.12 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

A Elementary Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

A.1 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
A.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
A.3 Common Confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
A.4 Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
A.5 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
A.6 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
A.7 Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
A.8 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
A.9 Covariance and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
A.10 Why Do We Need a More Sophisticated Formalism?. . . . . . . . . . . . . 323
A.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
A.12 Solved Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Contents xxi

B Basic Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

B.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
B.1.1 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
B.1.2 Borel–Cantelli Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
B.1.3 Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
B.1.4 Converse of Borel–Cantelli Theorem . . . . . . . . . . . . . . . . . . . 332
B.1.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
B.1.6 Random Variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
B.2 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
B.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
B.2.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
B.2.3 Function of a RV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
B.2.4 Nonnegative RV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
B.2.5 Linearity of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
B.2.6 Monotonicity of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
B.2.7 Variance, Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
B.2.8 Important Discrete Random Variables . . . . . . . . . . . . . . . . . . 339
B.3 Multiple Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
B.3.1 Joint Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
B.3.2 Independence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
B.3.3 Expectation of Function of Multiple RVs . . . . . . . . . . . . . . 343
B.3.4 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
B.3.5 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
B.3.6 Conditional Expectation of a Function . . . . . . . . . . . . . . . . . 347
B.4 General Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
B.4.1 Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
B.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
B.4.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
B.4.4 Continuity of Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
B.5 Multiple Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
B.5.1 Random Vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
B.5.2 Minimum and Maximum of Independent RVs . . . . . . . . . 356
B.5.3 Sum of Independent Random Variables . . . . . . . . . . . . . . . . 357
B.6 Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
B.6.1 Orthogonality and Projection. . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
B.7 Density of a Function of Random Variables . . . . . . . . . . . . . . . . . . . . . . . 360
B.7.1 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
B.7.2 Nonlinear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
B.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
B.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
PageRank: A
1

Application: Ranking the most relevant pages in web search

Topics: Finite Discrete Time Markov Chains, SLLN

Background:

• probability space (B.1.1);

• conditional probability (B.1.5);
• discrete random variable (B.2.1);
• expectation and conditional expectation for discrete RVs (B.2.2), (B.3.5).

1.1 Model

The World Wide Web is a collection of linked web pages (Fig. 1.1). These pages
and their links form a graph. The nodes of the graph are pages X and there is an
arc (a directed edge) from i to j if page i has a link to j .
Intuitively, a page has a high rank if other pages with a high rank point to it. (The
actual ordering of search engines results depends also on the presence of the search
keywords in the pages and on many other factors, in addition to the rank measure
that we discuss here.) Thus, the rank π(i) of page i is a positive number and

π(i) = π(j )P (j, i), i ∈ X ,
j ∈X

© The Author(s) 2021 1

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_1
2 1 PageRank: A

Fig. 1.1 Pages point to one

another in the web. Here,
P (A, B) = 1/2 and A B
P (D, E) = 1/3
C

D E

Fig. 1.2 Larry page

Fig. 1.3 Balance equations?

where P (j, i) is the fraction of links in j that point to i and is zero if there is no
such link. In our example, P (A, B) = 1/2, P (D, E) = 1/3, P (B, A) = 0, etc.
(The basic idea of the algorithm is due to Larry Page (Fig. 1.2), hence the name
PageRank. Since it ranks pages, the name is doubly appropriate.)
We can write these equations in matrix notation as

π = πP, (1.1)

where we treat π as a row vector with components π(i) and P as a square matrix
with entries P (i, j ) (Figs. 1.3, 1.4 and 1.5).
Equations (1.1) are called the balance equations. Note that if π solves these
equations, then any multiple of π also solves the equations. For convenience, we
normalize the solution so that the ranks of the pages add up to one, i.e.,

π(i) = 1. (1.2)
i∈X
1.2 Markov Chain 3

Fig. 1.4 Copy of Fig. 1.2.

Recall that P (A, B) = 1/2
and P (D, E) = 1/3, etc. A B
C

D E

Fig. 1.5 Andrey Markov.

1856–1922

For the example of Fig. 1.4, the balance equations are

π(A) = π(C) + π(D)(1/3)

π(B) = π(A)(1/2) + π(D)(1/3) + π(E)(1/2)
π(C) = π(B) + π(E)(1/2)
π(D) = π(A)(1/2)
π(E) = π(D)(1/3).

Solving these equations with the condition that the numbers add up to one yields
1
π = [π(A), π(B), π(C), π(D), π(E)] = [12, 9, 10, 6, 2].
39
Thus, page A has the highest rank and page E has the smallest. A search engine that
uses this method would combine these ranks with other factors to order the pages.
Search engines also use variations on this measure of rank.

1.2 Markov Chain

Imagine that you are browsing the web. After viewing a page i, say for one unit
of time, you go to another page by clicking one of the links on page i, chosen at
random. In this process, you go from page i to page j with probability P (i, j )
where P (i, j ) is the same as we defined earlier. The resulting sequence of pages
that you visit is called a Markov chain, a model due to Andrey Markov (Fig. 1.4).
4 1 PageRank: A

1.2.1 General Definition

More generally, consider a finite graph with nodes X = {1, 2, . . . , N } and directed
edges. In this graph, some edges can go from a node to itself. To each edge (i, j )
one assigns a positive number P (i, j ) in a way that the sum of the numbers on the
edges out of each node is equal to one. By convention, P (i, j ) = 0 if there is no
edge from i to j .
The corresponding matrix P = [P (i, j )] with nonnegative entries and rows that
add up to one is called a stochastic matrix. The sequence {X(n), n ≥ 0} that goes
from node i to node j with probability P (i, j ), independently of the nodes it visited
before, is then called a Markov chain. The nodes are called the states of the Markov
chain and the P (i, j ) are called the transition probabilities. We say that X(n) is the
state of the Markov chain at time n, for n ≥ 0. Also, X(0) is called the initial state.
The graph is the state transition diagram of the Markov chain.
Figure 1.6 shows the state transition diagrams of three Markov chains.
Thus, our description corresponds to the following property:

P [X(n + 1) = j |X(n) = i, X(m), m < n] = P (i, j ), ∀i, j ∈ X , n ≥ 0. (1.3)

The probability of moving from i to j does not depend on the previous states. This
“amnesia” is called the Markov property. It formalizes the fact that X(n) is indeed
a “state” in that it contains all the information relevant for predicting the future of
the process.

1.2.2 Distribution After n Steps and Invariant Distribution

If the Markov chain is in state j with probability πn (j ) at step n for some n ≥ 0, it

is in state i at step n + 1 with probability πn+1 (i) where

πn+1 (i) = πn (j )P (j, i), i ∈ X . (1.4)
j ∈X

Indeed, the event that the Markov chain is in state i at step n + 1 is the union over
all j of the disjoint events that it is in state j at step n and in state i at step n + 1.
The probability of a disjoint union of events is the sum of the probabilities of the
individual events. Also, the probability that the Markov chain is in state j at step n
and in state i at step n + 1 is πn (j )P (j, i).
Thus, in matrix notation,

πn+1 = πn P ,
1.3 Analysis 5

so that

πn = π0 P n , n ≥ 0. (1.5)

Observe that πn (i) = π0 (i) for all n ≥ 0 and all i ∈ X if and only if π0
solves the balance equations (1.1). In that case, we say that π0 is an invariant
distribution. Thus, an invariant distribution is a nonnegative solution π of (1.1)
whose components sum to one.

1.3 Analysis

Natural questions are

• Does there exist an invariant distribution?

• Is it unique?
• Does πn approach an invariant distribution?

The next sections answer those questions.

1.3.1 Irreducibility and Aperiodicity

We need the following definitions.

Definition 1.1 (Irreducible, Aperiodic, Period)

(a) A Markov chain is irreducible, if it can go from any state to any other state,
possibly after many steps.
(b) Assume the Markov chain is irreducible and let

d(i) := g.c.d.{n ≥ 1 | P n (i, i) > 0}. (1.6)

(If S is a set of positive integers, g.c.d.(S) is the greatest common divisor of

these integers.)

Then d(i) has the same value d for all i, as shown in Lemma 2.2. The Markov
chain is aperiodic if d = 1. Otherwise, it is periodic with period d.

The Markov chains (a) and (b) in Fig. 1.6 are irreducible and (c) is not. Also, (a)
is periodic and (b) is aperiodic.
6 1 PageRank: A

Fig. 1.6 Three Markov 1 0.4

chains with three states
{1, 2, 3} and different (a) 1 2 3
transition probabilities. (a) is 0.6
irreducible, periodic; (b) is
1
irreducible, aperiodic; (c) is
1 0.4
not irreducible
(b) 1 2 3 0.1
0.6 0.9

1 0.4
(c) 1 2 3 1
0.6

1.3.2 Big Theorem

Simple examples show that the answers to Q2–Q3 can be negative. For instance,
every distribution is invariant for a Markov chain that does not move. Also, a Markov
chain that alternates between the states 0 and 1 with π0 (0) = 1 is such that πn (0) =
1 when n is even and πn (0) = 0 when n is odd, so that πn does not converge.
However, we have the following key result.

Theorem 1.1 (Big Theorem for Finite Markov Chains)

(a) If the Markov chain is finite and irreducible, it has a unique invariant distribu-
tion π and π(i) is the long-term fraction of time that X(n) is equal to i.
(b) If the Markov chain is also aperiodic, then the distribution πn of X(n) converges
to π .

In this theorem, the long-term fraction of time that X(n) is equal to i is defined
as the limit
N −1
1
lim 1{X(n) = i}.
N →∞ N
n=0

In this expression, 1{X(n) = i} takes the value 1 if X(n) = i and the value 0
otherwise. Thus, in the expression above, the sum is the total time that the Markov
chain is in state i during the first N steps. Dividing by N gives the fraction of time.
Taking the limit yields the long-term fraction of time.
The theorem says that, if the Markov chain is irreducible, this limit exists and is
equal to π(i). In particular, this limit does not depend on the particular realization
of the random variables. This means that every simulation yields the same limit, as
you will verify in Problem 1.8.
1.3 Analysis 7

1.3.3 Long-Term Fraction of Time

Why should the fraction of time that a Markov chain spends in one state converge?
In our browsing example, if we count the time that we spend on page A over n time
units and we divide that time by n, it turns out that the ratio converges to π(A).
This result is similar to the fact that, when we flip a fair coin repeatedly, the
fraction of “heads” converges to 50%. Thus, even though the coin has no memory,
it makes sure that the fraction of heads approaches 50%. How does it do it?
These convergence results are examples of the Law of Large Numbers. This
law is at the core of our intuitive understanding of probability and it captures our
notion of statistical regularity. Even though outcomes are uncertain, one can make
predictions. Here is a statement of the result. We discuss it in Chap. 2.

Theorem 1.2 (Strong Law of Large Numbers) Let {X(n), n ≥ 1} be a sequence

of i.i.d. random variables with mean μ. Then
X(1) + · · · + X(n)
→ μ as n → ∞, with probability 1.
n
Thus, the sample mean values Y (n) := (X(1) + · · · + X(n))/n converge to the
expected value, with probability 1. (See Fig. 1.7.) Note that the sample mean values
Y (n) are random variables: for each n, the value of Y (n) depends on the particular
realization of the random variables X(m); if you repeat the experiment, the values
will probably be different. However, the limit is always μ, with probability 1. We
say that the convergence is almost sure.1

Fig. 1.7 When rolling a

balanced die, the sample
mean converges to 3.5

1 “Almost sure” is a somewhat confusing technical expression. It means that, although there are
outcomes for which the convergence does not happen, all these outcomes have probability zero.
For instance, if you flip a fair coin, the outcome where the coin flips keep on yielding tails
is such that the fraction of tails does not converge to 0.5. The same is true for the outcome
H, H, T , H, H, T , . . .. So, almost sure means that it happens with probability 1, but not for a
set of outcomes that has probability zero.
8 1 PageRank: A

1.4 Illustrations

We illustrate Theorem 1.1 for the Markov chains in Fig. 1.6. The three situations are
different and quite representative. We explore them one by one.
Figures 1.8, 1.9 and 1.10 correspond to each of the three Markov chains in
Fig. 1.6, as shown on top of each figure. The top graph of each figure shows
the successive values of Xn for n = 0, 1, . . . , 100. The middle graph of the
figure shows, for n = 0, . . . , 100, the fraction of time that Xm is equal to the
different states during {0, 1, . . . , n}. The bottom graph of the figure shows, for
n = 0, . . . , 100, the probability that Xn is equal to each of the states.
In Fig. 1.8, the fraction of time that the Markov chain is equal to each of the
states {1, 2, 3} converges to positive values. This is the case because the Markov
chain is irreducible. (See Theorem 1.1(a).) However, the probability of being in a
given state does not converge. This is because the Markov chain is periodic. (See
Theorem 1.1(b).)
For the Markov chain in Fig. 1.9, the probabilities converge, because the Markov
chain is aperiodic. (See again Theorem 1.1.)
Finally, for the Markov chain in Fig. 1.10, eventually Xn = 3; the fraction of
time in state 3 converges to one and so does the probability of being in state 3. What
happens in this case is that state 3 is absorbing: once the Markov chain gets there,
it cannot leave.

Fig. 1.8 Markov chain (a) in

Fig. 1.6
1.4 Illustrations 9

Fig. 1.9 Markov chain (b) in

Fig. 1.6

Fig. 1.10 Markov chain (c)

in Fig. 1.6
10 1 PageRank: A

1.5 Hitting Time

Say that you start in page A in Fig. 1.2 and that, at every step, you follow each
outgoing link of the page where you are with equal probabilities. How many steps
does it take to reach page E? This time is called the hitting time, or first passage
time, of page E and we designate it by TE . As we can see from the figure, TE can
be as small as 2, but it has a good chance of being much larger than 2 (Fig. 1.11).

1.5.1 Mean Hitting Time

Our goal is to calculate the average value of TE starting from X0 = A. That is, we
want to calculate

β(A) := E[TE | X0 = A].

The key idea to perform this calculation is to in fact calculate the mean hitting time
for all possible initial pages. That is, we will calculate β(i) for i = A, B, C, D, E
where

β(i) := E[TE | X0 = i].

The reason for considering these different values is that the mean time to hit E
starting from A is clearly related to the mean hitting time starting from B and
from D. These in turn are related to the mean hitting time starting from C. We
claim that
1 1
β(A) = 1 + β(B) + β(D). (1.7)
2 2
To see this, note that, starting from A, after one step, the Markov chain is in state B
with probability 1/2 and it is in state D with probability 1/2. Thus, after one step,
the average time to hit E is the average time starting from B, with probability 1/2,
and it is the average time starting from D, with probability 1/2.

Fig. 1.11 This is NOT what

we mean by hitting time!
1.5 Hitting Time 11

This situation is similar to the following one. You flip a fair coin. If the outcome
is heads you get a random amount of money equal to X and if it is tails you get a
random amount Y . On average, you get

1 1
E(X) + E(Y ).
2 2
Similarly, we can see that

β(B) = 1 + β(C)
β(C) = 1 + β(A)
1 1 1
β(D) = 1 + β(A) + β(B) + β(E)
3 3 3
β(E) = 0.

These equations, together with (1.7), are called the first step equations (FSE).
Solving them, we find

β(A) = 17, β(B) = 19, β(C) = 18, β(D) = 13 and β(E) = 0.

1.5.2 Probability of Hitting a State Before Another

Consider once again the same situation but say that we are interested in the proba-
bility that starting from A we visit state C before E. We write this probability as

α(A) = P [TC < TE | X0 = A].

As in the previous case, it turns out that we need to calculate α(i) for i =
A, B, C, D, E. We claim that

1 1
α(A) = α(B) + α(D). (1.8)
2 2
To see this, note that, starting from A, after one step you are in state B with
probability 1/2 and you will then visit C before E with probability α(B). Also,
with probability 1/2, you will be in state D after one step and you will then visit C
before E with probability α(D). Thus, the event that you visit C before E starting
from A is the union of two disjoint events: either you do that by first going to B or
by first going to D. Adding the probabilities of these two events, we get (1.8).
12 1 PageRank: A

Similarly, one finds that

α(B) = α(C)
α(C) = 1
1 1 1
α(D) = α(A) + α(B) + α(E)
3 3 3
α(E) = 0.

These equations, together with (1.8), are also called the first step equations. Solving
them, we find

4 3
α(A) = , α(B) = 1, α(C) = 1, α(D) = , α(E) = 0.
5 5

1.5.3 FSE for Markov Chain

Let us generalize this example to the case of a Markov chain on X = {1, 2, . . . , N }

with transition probability matrix P . Let Ti be the hitting time of state i. For a set
A ⊂ X of states, let TA = min{n ≥ 0 | X(n) ∈ A} be the hitting time of the set A.
First, we consider the mean value of TA . Let

β(i) = E[TA | X0 = i], i ∈ X .

The FSE are

1+ P (i, j )β(j ), if i ∈
/A
β(i) = j
0, if i ∈ A.

Second, we study the probability of hitting a set A before a set B, where A, B ⊂

X and A ∩ B = ∅. Let

α(i) = P [TA < TB | X0 = i], i ∈ X .

The FSE are

⎧
⎨ j P (i, j )α(j ), if i ∈
/ A∪B
α(i) = 1, if i ∈ A
⎩
0, if i ∈ B.
1.6 Summary 13

Third, we explore the value of

TA
Y = h(X(n)).
n=0

That is, you collect an amount h(i) every time you visit state i, until you enter set
A. Let

γ (i) = E[Y | X0 = i], i ∈ X .

The FSE are

h(i) + P (i, j )γ (j ), if i ∈
/A
γ (i) = j (1.9)
h(i), if i ∈ A.

Fourth, we consider the value of

TA
Z= β n h(X(n)),
n=0

where β can be thought of as a discount factor. Let

δ(i) = E[Z | X0 = i].

The FSE are

h(i) + β P (i, j )δ(j ), if i ∈
/A
δ(i) = j
h(i), if i ∈ A.

Hopefully these examples give you a sense of the variety of questions that can be
answered for finite Markov chains. This is very fortunate, because Markov chains
can be used to model a broad range of engineering and natural systems.

1.6 Summary

• Markov Chains: states, transition probabilities, irreducible, aperiodic, invari-

ant distribution, hitting times;
• Strong Law of Large Numbers;
• Big Theorem: irreducible implies unique invariant distribution equal to the
long-term fraction of time in the states; convergence to invariant distribution
if irreducible and aperiodic;
• Hitting Times: first step equations.
14 1 PageRank: A

1.6.1 Key Equations and Formulas

Definition of MC P [X(n + 1) = j |X(n) = i, X(m), m < n] = P (i, j ) (1.3)

P.m.f. of Xn πn = π0 P n (1.5)
Balance Equations πP = π (1.1)

First Step Equations γ (i) = h(i) + j P (i, j )γ (j ) (1.9)

1.7 References

There are many excellent books on Markov chains. Some of my favorites are
Grimmett and Stirzaker (2001) and Bertsekas and Tsitsiklis (2008). The original
patent on PageRank is Page (2001). The online book Easley and Kleinberg (2012)
is an inspiring discussion of social networks. Chapter 14 of that reference discusses
PageRank.

1.8 Problems

Problem 1.1 Construct a Markov chain that is not irreducible but whose distribu-
tion converges to its unique invariant distribution.

Problem 1.2 Show a Markov chain whose distribution converges to a limit that
depends on the initial distribution.

Problem 1.3 Can you find a finite irreducible aperiodic Markov chain whose
distribution does not converge?

Problem 1.4 Show a finite irreducible aperiodic Markov chain that converges very
slowly to its invariant distribution.

Problem 1.5 Show that a function Y (n) = g(X(n)) of a Markov chain X(n) may
not be a Markov chain.

Problem 1.6 Construct a Markov chain that is a sequence of i.i.d. random variables.
Is it irreducible and aperiodic?
1.8 Problems 15

Fig. 1.12 Markov chain for a b

Problem 1.7
1-a
0 1 2
1-b 1

Problem 1.7 Consider the Markov chain X(n) with the state diagram shown in
Fig. 1.12 where a, b ∈ (0, 1).

(a) Show that this Markov chain is aperiodic;

(b) Calculate P [X(1) = 1, X(2) = 0, X(3) = 0, X(4) = 1 | X(0) = 0];
(c) Calculate the invariant distribution;
(d) Let Ti = min{n ≥ 0 | X(n) = i}. Calculate E[T2 | X(0) = 1].

Problem 1.8 Use Python to write a simulator for a Markov chain {X(n), n ≥ 1}
with K states, initial distribution π , and transition probability matrix P . The
program should be able to do the following:

1. Plot {X(n), n = 1, . . . , N };
2. Plot the fraction of time that X(n) is in some chosen states during {1, 2, . . . , m}
as a function of m, for m = 1, . . . , N ;
3. Plot the probability that X(n) is equal to some chosen states, for n = 1, . . . , N;
4. Use this program to simulate a periodic Markov chain with five states;
5. Use the program to simulate an aperiodic Markov chain with five states.

Problem 1.9 Use your simulator to simulate the Markov chains of Figs. 1.2 and 1.6.

Problem 1.10 Find the invariant distribution for the Markov chains of Fig. 1.6.

Problem 1.11 Calculate d(1), d(2), and d(3), defined in (1.6), for the Markov
chains of Fig. 1.6.

Problem 1.12 Calculate d(A), defined in (1.6), for the Markov chain of Fig. 1.2.

Problem 1.13 Let {Xn , n ≥ 0} be a finite Markov chain. Assume that it has
a unique invariant distribution π and that πn converges to π for every initial
distribution π0 . Then (choose the correct answers, if any)

• Xn is irreducible;
• Xn is periodic;
• Xn is aperiodic;
• Xn might not be irreducible.
16 1 PageRank: A

Problem 1.14 Consider the Markov chain {Xn , n ≥ 0} on {0, 1} with P (0, 1) =
0.1 and P (1, 0) = 0.3. Then (choose the correct answers, if any)

• The invariant distribution of the Markov chain is [0.75, 0.25];

• Let T1 = min{n ≥ 0|Xn = 1}. Then E[T1 |X0 = 0] = 1.2;
• E[X1 + X2 |X0 = 0] = 0.8.

Problem 1.15 Consider the MC with the state transition diagram shown in
Fig. 1.13.

(a) What is the period of this MC? Explain.

(b) Find all the invariant distributions for this MC.
(c) Does πn , the distribution of Xn , converge as n → ∞? Explain.
(d) Do the fractions of time the MC spends in the states converge? If so, what is the
limit?

Problem 1.16 Consider the MC with the state transition diagram shown in
Fig. 1.14.

(a) Find all the invariant distributions of this MC.

(b) Assume π0 (3) = 1. Find limn πn .

Problem 1.17 Consider the MC with the state transition diagram shown in
Fig. 1.15.

(a) Find all the invariant distributions of this MC.

(b) Does πn converge as n → ∞? If it does, prove it.
(c) Do the fractions of time the MC spends in the states converge? Prove it.

Fig. 1.13 MC for 1 0.4

Problem 1.15
1 2 3

0.6 1

Fig. 1.14 MC for 0.4 0.2 1

Problem 1.16 0.6 0.7
1 2 3 4 5

1 0.5 0.3 0.3

Fig. 1.15 MC for 0.4 0.1 0.2 1

Problem 1.17 0.6 0.7
1 2 3 4 5

0.9 0.5 0.3 0.3

1.8 Problems 17

Fig. 1.16 MC for 0.7

Problem 1.18

0.3 0 1 0.4
0.1

0.5
1
2

Fig. 1.17 MC for 0.7

Problem 1.19

0.3 0 1 0.4
0.1

0.5
1
2

Problem 1.18 Consider the MC shown in Fig. 1.16.

(a) Find the invariant distribution π of this Markov chain.

(b) Calculate the expected time from 0 to 2.
(c) Use Python to plot the probability that, starting from 0, the MC has not reached
2 after n steps.
(d) Use Python to simulate the MC and plot the fraction of time that it spends in the
different states after n steps.
(e) Use Python to plot πn .

Problem 1.19 For the Markov chain {Xn , n ≥ 0} with transition diagram shown in
Fig. 1.17, assume that X0 = 0. Find the probability that Xn hits 2 before it hits 1
twice.

Problem 1.20 Draw an irreducible aperiodic MC with six states and choose the
transition probabilities. Simulate the MC in Python. Plot the fraction of time in the
six states. Assume you start in state 1. Plot the probability of being in each of the
six states.

Problem 1.21 Repeat Problem 1.20, but with a periodic MC.

Problem 1.22 How would you trick the PageRank algorithm into believing that
your home page should be given a high rank?

Hint Try adding another page with suitable links.

18 1 PageRank: A

Problem 1.23 Show that the holding time of a state is geometrically distributed.

Problem 1.24 You roll a die until the sum of the last two rolls is exactly 10. How
many times do you have to roll, on average?

Problem 1.25 You roll a die until the sum of the last three rolls is at least 15. How
many times do you have to roll, on average?

Problem 1.26 A doubly stochastic matrix is a nonnegative matrix whose rows and
columns add up to one. Show that the invariant distribution is uniform for such a
transition matrix.

Problem 1.27 Assume that the Markov chain (c) of Fig. 1.6 starts in state 1.
Calculate the average number of times it visits state 1 before being absorbed in
state 3.

Problem 1.28 A man tries to go up a ladder that has N rungs. Every step he makes,
he has a probability p of dropping back to the ground and he goes up one rung
otherwise. Use the first step equations to calculate analytically the average time he
takes to reach the top, for N = 1, . . . , 20 and p = 0.05, 0.1, and 0.2. Use Python
to plot the corresponding graphs.

Problem 1.29 Let {Xn , n ≥ 0} be a finite irreducible Markov chain with transition
probability matrix P and invariant distribution π . Show that, for all i, j ,

N −1
1
1{Xn = i, Xn+1 = j } → π(i)P (i, j ), w.p. 1 as N → ∞.
N
n=0

Problem 1.30 Show that a Markov chain {Xn , n ≥ 0} can be written as

Xn+1 = f (Xn , Vn ), n ≥ 0,

where the Vn are i.i.d. random variables independent of X0 .

Problem 1.31 Let P and P̃ be two stochastic matrices and π a pmf on the finite set
X . Assume that

π(i)P (i, j ) = π(j )P̃ (j, i), ∀i, j ∈ X .

Show that π is invariant for P .

Problem 1.32 Let Xn be a Markov chain on a finite set X . Assume that the
transition diagram of the Markov chain is a tree, as shown in Fig. 1.18. Show that if
1.8 Problems 19

Fig. 1.18 A transition

diagram that is a tree

π is invariant and if P is the transition matrix, then it satisfies the following detailed
balance equations:

π(i)P (i, j ) = π(j )P (j, i), ∀i, j.

Problem 1.33 Let Xn be a Markov chain such that X0 has the invariant distribution
π and the detailed balance equations are satisfied. Show that

P (X0 =x0 , X1 =x1 , . . . , Xn =xn )=P (XN = x0 , XN −1 = x1 , . . . , XN −n = xn )

for all n, all N ≥ n, and all x0 , . . . , xn . Thus, the evolution of the Markov chain in
reverse time (N, N −1, N −2, . . . , N −n) cannot be distinguished from its evolution
in forward time (0, 1, . . . , n). One says that the Markov chain is time-reversible.

Problem 1.34 Let {Xn , n ≥ 0} be a Markov chain on {−1, 1} with P (−1, 1) =

P (1, −1) = a for a given a ∈ (0, 1). Define

Yn = X0 + · · · + Xn , n ≥ 0.

(a) Is {Yn , n ≥ 0} a Markov chain? Prove or disprove.

(b) How would you calculate

E[τ |Y0 = 1] where τ = min{n > 0 | Yn = −5 or Yn = 30}?

Problem 1.35 You flip a fair coin repeatedly, forever. Show that the probability that
the number of heads is always ahead of the number of tails is zero.

Open Access This chapter is distributed under the terms of the Creative Commons Attribution
4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, dupli-
cation, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, a link is provided to the Creative
Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative
Commons license, unless indicated otherwise in the credit line; if such material is not included
in the work’s Creative Commons license and the respective action is not permitted by statutory
regulation, users will need to obtain permission from the license holder to duplicate, adapt or
reproduce the material.
PageRank: B
2

Topics: Sample Space, Trajectories; Laws of Large Numbers: WLLN, SLLN;

Proof of Big Theorem.

Background:

• Borel–Cantelli (B.1.2);
• monotonicity of expectation (B.2);
• convergence of expectation (B.8)–(B.9);
• properties of variance: (B.3) and Theorem B.4.

2.1 Sample Space

Let us connect the definition of X = {Xn , n ≥ 0} of a Markov chain with the general
framework of Sect. B.1. (We write Xn or X(n).) In that section, we explained
that a random experiment is described by a sample space. The elements of the
sample space are the possible outcomes of the experiment. A probability is defined
on subsets, called events, of that sample space. Random variables are real-valued
functions of the outcome of the experiment.
To clarify these concepts, consider the case where the Xn are i.i.d. Bernoulli
random variables with P (Xn = 1) = P (Xn = 0) = 0.5. These random variables
describe flips of a fair coin. The random experiment is to flip the coin repeatedly,
forever. Thus, one possible outcome of this experiment is an infinite sequence of

© The Author(s) 2021 21

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_2
22 2 PageRank: B

0’s and 1’s. Note that an outcome is not 0 or 1: it is an infinite sequence since the
outcome specifies what happens when we flip the coin forever. Thus, the set Ω of
outcomes is the set {0, 1}∞ of infinite sequences of 0’s and 1’s. If ω is one such
sequence, we have ω = (ω0 , ω1 , . . .) where ωn ∈ {0, 1}. It is then natural to define
Xn (ω) = ωn , which simply says that Xn is the outcome of flip n, for n ≥ 0. Hence
Xn (ω) ∈ for all ω ∈ Ω and we see that each Xn is a real-valued function defined
on Ω. For instance, X0 (1101001 . . .) = 1 since ω0 = 1 when ω = 1101001 . . . .
Similarly, X1 (1101001 . . .) = 1 and X2 (1101001 . . .) = 0. To specify the random
experiment, it remains to define the probability on Ω. The simplest way is to say
that

P ({ω|ω0 = a, ω1 = b, . . . , ωn = z})
= P (X0 = a, . . . , Xn = z) = 1/2n+1

for all n ≥ 0 and a, b, . . . , z ∈ {0, 1}. For instance,

P ({ω|ω0 = 1}) = P (X0 = 1) = 1/2.

Similarly,

P ({ω|ω0 = 1, ω1 = 0}) = P (X0 = 1, X1 = 0) = 1/4.

Observe that we define the probability of a set of outcomes, or event, {ω|ω0 =

a, ω1 = b, . . . , ωn = z} instead of specifying the probability of each outcome ω.
The reason is that the probability that we observe a specific infinite sequence of 0’s
and 1’s is zero. That is, P ({ω}) = 0 for all ω ∈ Ω. Such a description does not tell
us much about the coin flips! For instance, it does not specify the bias of the coin, or
the fact that successive flips are independent. Hence, the correct way to proceed is to
specify the probability of events, that are sets of outcomes, instead of the probability
of individual outcomes.
For a Markov chain, there is some sample space Ω and each Xn is a function
Xn (ω) of the outcome ω that takes values in X . A probability is defined on subsets
of Ω.
In this example, one can choose Ω to be the set of possible infinite sequences
of symbols in X . That is, Ω = X ∞ and an element ω ∈ Ω is ω = (ω0 , ω1 , . . .)
with ωn ∈ X for n ≥ 0. With this choice, one has Xn (ω) = ωn for n ≥ 0 and
ω ∈ Ω, as shown in Fig. 2.1. This choice of Ω, similar to what we did for the coin
flips, is called the canonical sample space. Thus, an outcome is the actual sequence
of values of the Markov chain, called the trajectory, or realization of the Markov
chain. It remains to specify the probability of event in Ω. The trick here is that the
probability that the Markov chain follows a specific infinite sequence is 0, similarly
to the probability that coin flips follow a specific infinite sequence such as all heads.
Thus, one should specify the probability of subsets of Ω, not of individual outcomes.
One specifies that
2.2 Laws of Large Numbers for Coin Flips 23

Fig. 2.1 In the canonical ω X(ω, n)

sample space, the outcome ω
is the trajectory of the
Markov chain

P (X0 = i0 , X1 = i1 , . . . , Xn = in )
= π0 (i0 )P (i0 , i1 ) × · · · × P (in−1 , in ), (2.1)

for all n ≥ 0 and i0 , i1 , . . . , in in X . Here, π0 (i0 ) is the probability that the Markov
chain starts in state i0 .
This identity is equivalent to (1.3). Indeed, if we let

An = {X0 = i0 , X1 = i1 , . . . , Xn = in }

and

An−1 = {X0 = i0 , X1 = i1 , . . . , Xn−1 = in−1 },

then

P (An ) = P [An |An−1 ]P (An−1 ) = P (An−1 )P (in−1 , in ),

by (1.3), so that (2.1) holds by induction on n.

Thus, one has defined the probability of events characterized by the first n + 1
values of the Markov chain. It turns out that there is one probability on Ω that is
consistent with these values.

2.2 Laws of Large Numbers for Coin Flips

Before we discuss the case of Markov chains, let us consider the simpler example of
coin flips. Let then {Xn , n ≥ 0} be i.i.d. Bernoulli random variables with P (Xn =
0) = P (Xn = 1) = 0.5, as in the previous section. We think of Xn = 1 if flip n
yields heads and Xn = 0 if it yields tails. We want to show that, as we keep flipping
the coin, the fraction of heads approaches 50%. There are two statements that make
this idea precise.
24 2 PageRank: B

2.2.1 Convergence in Probability

The first statement, called the Weak Law of Large Numbers (WLLN), says that it is
very unlikely that the fraction of heads in n coin flips differs from 50% by even a
small amount, say 1%, if n is large. For instance, let n = 105 . We want to show that
the likelihood that the fraction of heads among 105 flips is more than 51% or less
than 49% is small. Moreover, this likelihood can be made as small as we wish if we
flip the coin more times.
To show this, let

X0 + · · · + Xn−1
Yn =
n
be the fraction of heads in the first n flips. We claim that

var(Yn )
P (|Yn − E(Yn )| ≥ ) ≤ . (2.2)
2
This result is called Chebyshev’s inequality (Fig. 2.2).
To see (2.2), observe that1

(Yn − E(Yn ))2

1{|Yn − E(Yn )| ≥ } ≤ . (2.3)
2

Indeed, if |Yn − E(Yn )| ≥ , then (Yn − E(Yn ))2 ≥ 2 , so that if the left-hand side
of inequality (2.3) is one, the right-hand side is at least equal to one. Also, if the
left-hand side is zero, it is less than or equal to the right-hand side. Thus, (2.3) holds
and (2.2) follows by taking the expected values in (2.3), since E(1A ) = P (A) and
E((Yn − E(Yn ))2 ) = var(Yn ) and since expectation is monotone (B.2).

Fig. 2.2 Pafnuty Chebyshev.

1821–1884

1 By definition, 1{C} takes the value 1 if the condition C holds and the value 0 otherwise.
2.2 Laws of Large Numbers for Coin Flips 25

Now, E(Yn ) = 0.5 and

var(X0 + · · · + Xn−1 ) nvar(X0 )

var(Yn ) = 2
= .
n n2
To see this, recall that if one multiplies a random variable by a, its variance is
multiplied by a 2 (see (B.3)). Also, the variance of a sum of independent random
variables is the sum of their variances (see Theorem B.4). Hence,

var(X0 )
P (|Yn − 0.5| ≥ ) ≤ .
n 2

Since X0 =D B(0.5), we find that

var(X0 ) = E X02 − (E(X0 ))2
= E(X0 ) − (E(X0 ))2 = 0.5 − 0.25 = 0.25.

Thus,

1
P (|Yn − 0.5| ≥ ) ≤ .
4n 2
In particular, if we choose = 1% = 0.01, we find

2, 500
P (|Yn − 0.5| ≥ 1%) ≤ = 0.025 with n = 105 .
n
More generally, we have shown that

P (|Yn − 0.5| ≥ ) → 0 as n → ∞, ∀ > 0.

This is the WLLN.

2.2.2 Almost Sure Convergence

The second statement is the Strong Law of Large Numbers (SLLN). It says that,
for all the sequences of coin flips we will ever observe, the fraction Yn actually
converges to 50% as we keep on flipping the coin.
There are many sequences of coin flips for which the fraction of heads does not
approach 50%. For instance, the sequence that yields heads for every flip is such
that Yn = 1 for all n and thus Yn does not converge to 50%. Similarly, the sequence
001001001001001 . . . is such that Yn approaches 1/3 and not 50%. What the SLLN
implies is that all those sequences such that Yn does not converge to 50% have
probability 0: they will never be observed.
26 2 PageRank: B

Thus, this statement is very deep because there are so many sequences to rule
out. Keeping track of all of them seems rather formidable. Indeed, the proof of this
statement is quite clever. Here is how it proceeds. Note that

|Yn − 0.5|4
P (|Yn − 0.5| ≥ ) ≤ E , ∀n, > 0.
4

Indeed,

|Yn − 0.5|4
1{|Yn − 0.5| ≥ } ≤
4
and the previous inequality follows by taking expectations. Now,

((X0 − 0.5) + · · · + (Xn−1 − 0.5))4
E |Yn − 0.5| 4
=E .
n4

Also, with Zm = Xm − 0.5, one has

⎛ 4 ⎞

n−1
E((X0 − 0.5) + · · · + (Xn−1 − 0.5))4 ) = E ⎝ Zm ⎠
m=0
⎛ ⎞

=E⎝ Za Zb Zc Zd ⎠ ,
a,b,c,d

where the sum is over all a, b, c, d ∈ {0, 1, . . . , n − 1}. This sum consists of n
terms Za4 , n(n − 1) terms Za2 Zb2 with a = b and other terms where at least a
factor Za is not repeated. The latter terms have zero-mean since E(Za Zb Zc Zd ) =
E(Za )E(Zb Zc Zd ) = 0, by independence, whenever b, c, and d are all different
from a. Consequently,
⎛ ⎞

E⎝ Za Zb Zc Zd ⎠ = nE Z04 + n(n − 1)E Z02 Z12 = nα + n(n − 1)β
a,b,c,d

with α = E(Z04 ) and β = E(Z02 Z12 ). Hence, substituting the result of this
calculation in the previous expressions, we find that

nα + n(n − 1)β n2 (α + β) α+β

P (|Yn − 0.5| ≥ ) ≤ 4 4
≤ = 2 4 .
n n4 4 n
2.3 Laws of Large Numbers for i.i.d. RVs 27

This inequality implies that2

P (|Yn − 0.5| ≥ ) < ∞.
n≥1

This expression shows that the events An := {|Yn − 0.5| ≥ } have probabilities that
add up to a finite number. From the Borel–Cantelli Theorem B.1, we conclude that

P (An , i.o.) = 0.

This result says that, with probability one, ω belongs only to finitely many An ’s.
Hence,3 with probability one, there is some n(ω) so that ω ∈
/ An for n ≥ n(ω). That
is,

|Yn (ω) − 0.5| ≤ , ∀n ≥ n(ω).

Since this property holds for an arbitrary > 0, we conclude that, with probability
one,

Yn (ω) → 0.5 as n → ∞.

Indeed, if Yn (ω) does not converge to 50%, there must be some > 0 so that
|Yn − 0.5| > for infinitely many n’s and we have seen that this is not the case.

2.3 Laws of Large Numbers for i.i.d. RVs

The results that we proved for coin flips extend to i.i.d. random variables {Xn , n ≥
0} to show that

X0 + · · · + Xn−1
Yn :=
n

approaches E(X0 ) as n → ∞. As for coin flips, there are two ways of making that
statement precise.

2 Recall that
1
< ∞.
n
n2

3 Let n(ω) − 1 be the largest n such that ω ∈ An .

28 2 PageRank: B

2.3.1 Weak Law of Large Numbers

We need a definition.

Definition 2.1 (Convergence in Probability) Let Xn , n ≥ 0 and X be random

variables defined on a common probability space. One says that Xn converges in
p
probability to X, and one writes Xn → X if, for all > 0,

P (|Xn − X| ≥ ) → 0 as n → ∞.

The Weak Law of Large Numbers (WLLN) is the following result.

Theorem 2.1 (Weak Law of Large Numbers) Let {Xn , n ≥ 0} be a sequence of

i.i.d. random variables with mean μ. Then

X0 + · · · + Xn−1 p
Yn = → μ. (2.4)
n

Proof Assume that E(Xn2 ) < ∞. The proof is then the same as for coin flips and is
left as an exercise. For the general case, see Theorem 15.14.

The first result of this type was proved by Jacob Bernoulli (Fig. 2.3).

2.3.2 Strong Law of Large Numbers

We again need a definition.

Definition 2.2 (Almost Sure Convergence) Let Xn , n ≥ 0 and X be random

variables defined on a common probability space. One says that Xn converges
almost surely to X as n → ∞, and one writes Xn → X, a.s. if

Fig. 2.3 Jacob Bernoulli.

1655–1705
2.3 Laws of Large Numbers for i.i.d. RVs 29

Fig. 2.4 When rolling a

balanced die, the sample
mean converges to 3.5

P lim Xn (ω) = X(ω) = 1.
n→∞

Thus, this convergence means that the sequence of real numbers Xn (ω) converges
to the real number X(ω) as n → ∞, with probability one.
Let {Xn , n ≥ 0} be as in the statement of Theorem 2.1. We have the following
result.4

Theorem 2.2 (Strong Law of Large Numbers) Let {Xn , n ≥ 0} be a sequence of

i.i.d. random variables with mean μ. Then

X0 + · · · + Xn−1
→ μ as n → ∞, with probability 1.
n

Thus, the sample mean values Yn := (X0 + · · · + Xn−1 )/n converge to the
expected value, with probability 1. (See Fig. 2.4.)

Proof Assume that

E Xn4 < ∞.

The proof is then the same as for coin flips and is left as an exercise. The proof of
the SLLN in the general case is given in Theorem 15.14.

Figure 2.5 illustrates the SLLN and WLLN. The SLLN states that the sample
means of i.i.d. random variables converge to the mean, with probability one. The

4 Almostsure convergence implies convergence in probability, so SLLN is stronger than WLLN.

See Problem 2.5.
30 2 PageRank: B

Fig. 2.5 SLLN and WLLN

for i.i.d. U [0, 1] random
variables

WLLN says that as the number of samples increases, the fraction of realizations
where the sample mean differs from the mean by some amount gets small.

2.4 Law of Large Numbers for Markov Chains

The long-term fraction of time that a finite irreducible Markov chain spends in a
given state is the invariant probability of that state. For instance, a Markov chain
X(n) on {0, 1} with P (0, 1) = a = P (1, 0) with a ∈ (0, 1] spends half of the time
in state 0, in the long term. The Markov chain in Fig. 1.2 spends a fraction 12/39 of
the time in state A, in the long term.
To understand this property, one should look at the returns to state i, as shown
in Fig. 2.6. The figure shows a particular sequence of values of X(n) and it
decomposes this sequence into cycles between successive returns to a given state
i. A new cycle starts when the Markov chain comes back to i. The durations of
these successive cycles, T1 , T2 , T3 , . . ., are independent and identically distributed,
because the Markov chains start afresh from state i at each time Tn , independently
of the previous states. This is a consequence of the Markov property for any given
value k of Tn and of the fact that the distribution of the evolution starting from state
i at time k does not depend on k.
It is easy to see that these random times have a finite mean. Indeed, fix one state
i. Then, starting from any given state j , there is some minimum number Mj of steps
required to go to state i. Also, there is some probability pj that the Markov chain
will go from j to i in Mj steps. Let then M = maxj Mj and p = minj pj . We can
then argue that, starting from any state at time 0, there is at least a probability p that
the Markov chain visits state i after at most M steps. If it does not, we repeat the
argument starting at time M. We conclude that Ti ≤ Mτ where τ is a geometric
2.4 Law of Large Numbers for Markov Chains 31

X(n)

T1 T2 T3 T4 T5

Fig. 2.6 The cycles between returns to state i are i.i.d. The law of large numbers explains the
convergence of the long-term fraction of time to a constant

random with parameter p. Hence E(Ti ) ≤ ME(τ ) = M/p < ∞, as claimed. Note
also that E(Ti4 ) ≤ M 4 E(τ 4 ) < ∞.
The Strong Law of Large Numbers states that

T 1 + T2 + · · · + Tk
→ E(T1 ), as k → ∞, with probability 1. (2.5)
k
Thus, the long-term fraction of time that the Markov chain spends in state i is
given by

k 1
lim = , with probability 1. (2.6)
k→∞ T1 + T2 + · · · + Tk E(T1 )

Let us clarify why (2.6) implies that the fraction of time in state i converges to
1/E(T1 ). Let A(n) be the number of visits to state i by time n. We want to show
that A(n)/n converges to 1/E(T1 ). Then,

k A(n) k k
< = ≤
T1 + · · · + Tk+1 n n T1 + · · · + Tk

whenever T1 + · · · + Tk ≤ n < T1 + · · · + Tk+1 . If we believe that Tk+1 /k → 0 as

k → ∞, the inequality above shows that

A(n) 1
→ ,
n E(T1 )

as claimed. To see why Tk+1 /k goes to zero, note that

Tk+1 Mτ
P > ≤P > ≤ P (τ > αk) ≤ (1 − p)αk
k k
32 2 PageRank: B

with α = /M.
Thus, by Borel–Cantelli Theorem B.1, the event Tk+1 /k > occurs only for
finitely many values of k, which proves the convergence to zero.

2.5 Proof of Big Theorem

This section presents the proof of the main result about Markov chains.

2.5.1 Proof of Theorem 1.1 (a)

Let mj be the expected return time to state j . That is,

mj = E[Tj |X(0) = j ] with Tj = min{n > 0|X(n) = j }.

We show that π(j ) = 1/mj , j = 1, . . . , N is the unique invariant distribution if the

Markov chain is irreducible.
During n = 1, . . . , N where N 1, the Markov chain visits state j a fraction
1/mj of the times. A fraction P (j, i) of those times, it visits state i just after visiting
state j . Thus, a fraction (1/mj )P (j, i) of the times, the Markov chain visits j then
i in successive steps. By summing over j , we find the fraction of the times that the
Markov chain visits i. Thus,
1 1
P (j, i) = .
mj mi
j

Hence, there is one invariant distribution π and it is given by πi = 1/mi , which is

the fraction of time that the Markov chain spends in state i.
To show that the invariant distribution is unique, assume that there is another one,
say φ(i). Start the Markov chain with that distribution. Then

N −1
1
1{X(n) = i} → π(i).
N
n=0

However, taking expectation, we find that the left-hand side is equal to φ(i). Thus,
φ = π and the invariant distribution is unique.5

5 Indeed,

E(1{X(n) = i}) = P (X(n) = i) = φ(i).

2.5 Proof of Big Theorem 33

Fig. 2.7 An aperiodic 0.7

Markov chain

1 2 3

4 0.3

2.5.2 Proof of Theorem 1.1 (b)

If the Markov chain is irreducible but not aperiodic, then πn may not converge to
the invariant distribution π . For instance, if the Markov chain alternates between 0
and 1 and starts from 0, then πn = [1, 0] for n even and πn = [0, 1] for n odd, so
that πn does not converge to π = [0.5, 0.5].
If the Markov chain is aperiodic, πn → π . Moreover, the convergence is
geometric. We first illustrate the argument on a simple example shown in Fig. 2.7.
Consider the number of steps to go from 1 to 1. Note that

{n > 0|P n (1, 1) > 0} = {3, 4, 6, 7, 8, 9, 10, . . .}.

Thus, P n (1, 1) > 0 if n ≥ 6. Now, P [X(2) = 1|X(0) = 2] > 0, so that

P [X(n) = 1|X(0) = 2] > 0 for n ≥ 8. Indeed, if n ≥ 8, then X can go from 2 to
1 in two steps and then from 1 to 1 in n − 2 steps. The argument is similar for the
other states and we find that there is some M > 0 and some p > 0 such that

P [X(M) = 1|X(0) = i] ≥ p, i = 1, 2, 3, 4.

Now, consider two copies of the Markov chain: {X(n), n ≥ 0} and {Y (n), n ≥ 0}.
One chooses X(0) with distribution π0 and Y (0) with the invariant distribution π .
The two Markov chains evolve independently initially. We define

τ = min{n > 0|X(n) = Y (n)}.

In view of the observation above,

P (X(M) = 1 and Y (M) = 1) ≥ p2 .

Thus, P (τ > M) ≤ 1 − p2 . If τ > M, then the two Markov chains have not met yet
by time M. Using the same argument as before, we see that they have a probability
at least p2 of meeting in the next M steps. Thus,
k
P (τ > kM) ≤ 1 − p2 .

Now, modify X(n) by gluing it to Y (n) after time τ . This coupling operation does
not change the fact that X(n) still evolves according to the transition matrix P , so
that P (X(n) = i) = πn (i) where πn = π0 P n .
34 2 PageRank: B

Now,

|P (X(n) = i) − P (Y (n) = i)| ≤ 2P (X(n) = Y (n)) ≤ 2P (τ > n).
i

Hence,

|πn (i) − π(i)| ≤ 2P (τ > n),
i

and this implies that

k
|πn (i) − π(i)| ≤ 2 1 − p2 if n > kM.
i

To extend this argument to a general aperiodic Markov chain, we need the fact
that for each state i there is some integer ni such that P n (i, i) > 0 for all n ≥ ni .
We prove that fact as Lemma 2.3 in the following section.

2.5.3 Periodicity

We start with a property of the set of return times of an irreducible Markov chain.

Lemma 2.1 Fix a state i and let S := {n > 0|P n (i, i) > 0} and d = g.c.d.(S).
There must be two integers n and n + d in the set S.

Proof The trick is clever. We first illustrate it on an example. Assume S =

{9, 15, 21, . . .} with d = g.c.d.(S) = 3. There must be a, b ∈ S with g.c.d.{a, b} =
3. Otherwise, the gcd of S would not be 3. Here, we can choose a = 15 and b = 21.
Now, consider the following operations:

(a, b) = (15, 21) → (6, 15) → (6, 9) → (3, 6) → (3, 3).

At each step, we go from (x, y) with x ≤ y to the ordered pair of {x, y − x}. Note
that at each step, each term in the pair (x, y) is an integer linear combination of a
and b. For instance, (6, 15) = (b − a, a). Then, (6, 9) = (b − a, a − (b − a)) =
(b − a, 2a − b), and so on. Eventually, we must get to (3, 3). Indeed, the terms are
always decreasing until we get to zero. Assume we get to (x, x) with x = 3. At the
previous step, we had (x, 2x). The step before must have been (x, 3x), and so on.
Going back all the way to (a, b), we see that a and b are both multiples of x. But
then, g.c.d.{a, b} = x, a contradiction.
From this construction, since at each step the terms are integer linear combina-
tions of a and b, we see that

3 = ma + nb
2.5 Proof of Big Theorem 35

for some integers m and n. Thus,

3 = m+ a + n+ b − m− a − n− b,

where m+ = max{m, 0} and m = m+ − m− , and similarly for n+ and n− . Now we

can choose

N = m− a + n− b and N + 3 = m+ a + n+ b.

The last step of the argument is to notice that if a, b ∈ S, then αa + βb ∈ S for any
integers α and β that are not both zero. This fact follows from the definition of S as
the return times from i to i. Hence, both N and N + 3 are in S.
The proof for a general set S with gcd equal to d is identical.

This result enables us to show that the period of a Markov chain is well-defined.

Lemma 2.2 For an irreducible Markov chain, d(i) defined in (1.6) has the same
value for all states.

Proof Pick j = i. We show that d(j ) ≤ d(i). This suffices to prove the lemma,
since by symmetry one also has d(i) ≤ d(j ).
By irreducibility, P m (j, i) > 0 for some m and P n (i, j ) > 0 for some n. Now,
by definition of d(i) and by the previous lemma, there is some integer N such that
P N (i, i) > 0 and P N +d(i) (i, i) > 0. But then,

P m+N +n (j, j ) > 0 and P m+N +d(i)+n (j, j ) > 0.

This implies that the integers K := n + N + m and K + d(i) are both in S := {n >
0|P n (j, j ) > 0}. Clearly, this shows that

d(j ) := g.c.d.(S) ≤ d(i).

The following fact then suffices for our proof of convergence, as we explained in
the example.

Lemma 2.3 Let X be an irreducible aperiodic Markov chain. Let S = {n >

0|P n (i, i) > 0}. Then, there is some ni such that n ∈ S, for all n ≥ ni .

Proof We know from Lemma 2.1 that there is some integer N such that N, N + 1 ∈
S. We claim that

n ∈ S, ∀n > N 2 .
36 2 PageRank: B

To see this, first note that for m > N − 1 one has

mN + 0 = mN,
mN + 1 = (m − 1)N + (N + 1),
mN + 2 = (m − 2)N + 2(N + 1),
...,
mN + N − 1 = (m − N + 1)N + (N − 1)(N + 1).

Now, for n > N 2 one can write

n = mN + k

for some k ∈ {0, 1, . . . , N − 1} and m > N − 1. Thus, n is an integer linear

combination of N and N + 1 that are both in S, so that n ∈ S.

2.6 Summary

• Sample Space;
• Laws of Large Numbers: SLLN and WLLN;
• WLLN from Chebyshev’s Inequality;
• SLLN from Borel–Cantelli and fourth moment bound;
• SLLN for Markov chains using the i.i.d. return times to a state;
• Proof of Big Theorem.

2.6.1 Key Equations and Formulas

SLLN (X1 + · · · + Xn )/n → E(X1 ), w.p. 1 T.2.2

Chebyshev P (|(X1 + · · · + Xn )/n − μ| ≥ ) ≤ var(X1 )/ 2 (2.2)
Convergence in Prob. P (|Xn − X| ≥ ) → 0 D.2.1

Borel–Cantelli P (An ) < ∞ ⇒ P (An , i.o.) = 0 T.B.1
SLLN for MC (1{X1 = i} + · · · + 1{Xn = i})/n → π(i) w.p. 1 T.1.1
2.8 Problems 37

2.7 References

An excellent text on Markov Chains is Chung (1967). A more advanced text on

probability theory is Billingsley (2012).

2.8 Problems

Problem 2.1 Consider a Markov chain Xn that takes values in {0, 1}. Explain why
{0, 1} is not its sample space.

Problem 2.2 Consider again a Markov chain that takes values in {0, 1} with
P (0, 1) = a and P (1, 0) = b. Exhibit two different sample spaces and the
probability on them for that Markov chain.

Problem 2.3 Draw the smallest periodic Markov chain. Show that the fraction of
time in the states converges but the probability of being in a state at time n does not
converge.

Problem 2.4 For the Markov chain in Problem 2.2, calculate the eigenvalues and
use them to get a bound on the distance between the distribution at time n and the
invariant distribution.

Problem 2.5 Why does the strong law imply the weak law? More concretely, let
Xn , X be random variables such that Xn → X almost surely. Show that Xn → X
in probability.

Hint Fix > 0 and define Zn = 1{|Xn −X| ≥ }. Use DCT to show that E(Zn ) →
0 as n → ∞ if Xn → X almost surely.

Problem 2.6 Draw a Markov chain with four states that is irreducible and aperi-
odic. Consider two independent versions of the Markov chain: one that starts in
state 1, the other in state 2. Explain what they will meet after a finite time.

Problem 2.7 Consider the Markov chain of Fig. 1.2. Use Python to calculate the
eigenvalues of P . Let λ be the largest absolute value of the eigenvalues other than
1. Use Python to calculate

d(n) := |π(i) − πn (i)|,
i

where π0 (A) = 1. Plot d(n) and λn as functions of n.

38 2 PageRank: B

Problem 2.8 You flip a fair coin. If the outcome is “head,” you get a random
amount of money equal to X and if it is“ tail,” you get a random amount Y . Prove
formally that on average, you get

1 1
E(X) + E(Y ).
2 2
Problem 2.9 Can you find random variables that converge to 0 almost surely, but
not in probability?

Problem 2.10 Let {Xn , n ≥ 1} be i.i.d. zero-mean random variables with variance
σ 2 . Show that Xn /n → 0 with probability one as n → ∞.

Hint Borel–Cantelli.

Problem 2.11 Let Xn be a finite irreducible Markov chain on X with invariant

distribution π and f : X → some function. Show that

N −1
1
f (Xn ) → π(i)f (i) w.p. 1, as N → ∞.
N
n=0 i∈X

Application: Sharing Links, Multiple Access, Buffers

Topics: Central Limit Theorem, Confidence Intervals, Queueing, Random-
ized Protocols

Background:

• General RV (B.4)

3.1 Sharing Links

One essential idea in communication networks is to have different users share

common links.
For instance, many users are attached to the same coaxial cable; a large number
of cell phones use the same base station; a WiFi access point serves many devices;
the high-speed links that connect buildings or cities transport data from many users
at any given time (Figs. 3.1 and 3.2).
Networks implement this sharing of physical resources by transmitting bits that
carry information of different users on common physical media such as cables,
wires, optical fibers, or radio channels. This general method is called multiplexing.
Multiplexing greatly reduces the cost of the communication systems. In this chapter,
we explain statistical aspects of multiplexing.

© The Author(s) 2021 39

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_3
40 3 Multiplexing: A

Fig. 3.1 Shared coaxial

cable for internet access, TV,
and telephone

Telephone
TV
Internet

Fig. 3.2 Cellular base

station antennas

Fig. 3.3 A random number ν x1

of connections share a link C
.....

with rate C

In the internet, at any given time, a number of packet flows share links. For
instance, 20 users may be downloading web pages or video files and use the same
coaxial cable of their service provider.
The transmission control protocol (TCP) arranges for these different flows to
share the links as equally as possible (at least, in principle).
We focus our attention on a single link, as shown in Fig. 3.3. The link transmits
bits at rate C bps. If ν connections are active at a given time, they each get a rate
C/ν. We want to study the typical rate that a connection gets. The nontrivial aspect
of the problem is that ν is a random variable.
As a simple model, assume that there are N 1 users who can potentially
use that link. Assume also that the users are active independently, with probability
p. Thus, the number ν of active users is Binomial(N, p) that we also write as
B(N, p). (See Sect. B.2.8.)
Figure 3.4 shows the probability mass function for N = 100 and p = 0.1, 0.2,
and 0.5. To be specific, assume that N = 100 and p = 0.2. The number ν of active
users is B(100, 0.2) that we also write as Binomial(100, 0.2). On average, there
3.1 Sharing Links 41

Fig. 3.4 The probability mass function of the Binomial(100, p) distribution, for p = 0.1, 0.2
and 0.5

Fig. 3.5 The Python tool “ppf” shows (3.1)

are Np = 20 active users. However, there is some probability that a few more than
20 users are active. We want to find a number m so that the likelihood that there are
more than m active users is negligible, say 5%. Given that value, we know that each
active user gets at least a rate C/m, with probability 95%.
Thus, we can dimension the links, or provision the network, based on that value
m. Intuitively, m should be slightly larger than the mean. Looking at the actual
distribution, for instance, by using Python’s “ppf” as in Fig. 3.5, we find that

P (ν ≤ 27) = 0.966 > 95% and P (ν ≤ 26) = 0.944 < 95%. (3.1)

Thus, the smallest value of m such that P (ν ≤ m) ≥ 95% is m = 27.

42 3 Multiplexing: A

To avoid having to use distribution tables or computation tools, we use the fact
that the binomial distribution is well approximated by a Gaussian random variable
that we discuss next.

3.2 Gaussian Random Variable and CLT

Definition 3.1 (Gaussian Random Variable)

(a) A random variable W is Gaussian, or normal, with mean 0 and variance 1, and
one writes W =D N (0, 1), if its probability density function (pdf) is fW where
2
1 x
fW (x) = √ exp − , x ∈ .
2π 2

One also says that W is a standard normal random variable, or a standard

Gaussian random variable (Named after C.F. Gauss, see Fig. 3.6).
(b) A random variable X is Gaussian, or normal, with mean μ and variance σ 2 , and
we write X =D N (μ, σ 2 ), if

X = μ + σ W,

where W =D N (0, 1). Equivalently,1 the pdf of X is given by

1 (x − μ)2
fX (x) = √ exp − .
2π σ 2 2σ 2

Figure 3.7 shows the pdf of a N (0, 1) random variable W . Note in particular
that

Fig. 3.6 Carl Friedrich

Gauss. 1777–1855

1 See (B.9).
3.2 Gaussian Random Variable and CLT 43

Fig. 3.7 The pdf of a N (0, 1) random variable

P (W > 1.65) ≈ 5%, P (W > 1.96) ≈ 2.5% and P (W > 2.32) ≈ 1%. (3.2)

The Central Limit Theorem states that the sum of many small independent
random variables is approximately Gaussian. This result explains that thermal noise,
due to the agitation of many electrons, is Gaussian. Many other natural phenomena
exhibit a Gaussian distribution when they are caused by a superposition of many
independent effects.

Theorem 3.1 (Central Limit Theorem) Let {X(n), n ≥ 1} be i.i.d. random

variables with mean E(X(n)) = μ and variance var(X(n)) = σ 2 . Then, as
n → ∞,
X(1) + · · · + X(n) − nμ
√ ⇒ N (0, 1). (3.3)
σ n

In (3.3), the symbol ⇒ means convergence in distribution. Specifically, if

{Y (n), n ≥ 1} are random variables, then Y (n) ⇒ N (0, 1) means that

P (Y (n) ≤ x) → P (W ≤ x), ∀x ∈ ,

where W is a N (0, 1) random variable. We prove this result in the next chapter.
More generally, one has the following definition.

Definition 3.2 (Convergence in Distribution) Let {X(n), n ≥ 1} and X be

random variables. One says that X(n) converges in distribution to X, and one writes
X(n) ⇒ X, if
44 3 Multiplexing: A

P (X(n) ≤ x) → P (X ≤ x), for all x s.t. P (X = x) = 0. (3.4)

As an example, let X(n) = 3 + 1/n for n ≥ 1 and X = 3. It is intuitively clear that

the distribution of X(n) converges to that of X. However,

P (X(n) ≤ 3) = 0 → P (X ≤ 3) = 1.

But,

P (X(n) ≤ x) → P (X ≤ x), ∀x = 3.

This example explains why the definition (3.4) requires convergence of P (X(n) ≤
x) to P (X ≤ x) only for x such that P (X = x) = 0.
How does this notion of convergence relate to convergence in probability and
almost sure convergence? First note that convergence in distribution is defined even
if the random variables X(n) and X are not on the same probability space, since it
involves only the distributions of the individual random variables. One can show2
that
a.s. p
X(n) → X implies X(n) → X implies X(n) ⇒ X.

Thus, convergence in distribution is the weakest form of convergence.

Also, a fact that I find very comforting is that if X(n) ⇒ X, then one can
construct random variables Y (n) and Y on the same probability space so that
Y (n) =D X(n) and Y =D X and

Y (n) → Y, with probability 1.

This may seem mysterious but is in fact quite obvious. First note that a random
variable with cdf F (·) can be constructed by choosing a random variable Z =D
U [0, 1] and defining (see Fig. 3.8)

X(Z) = inf{x ∈ | F (x) ≥ Z}.

Indeed, one then has P (X(Z) ≤ a) = F (a) since X(Z) ≤ a if and only if Z ∈
[0, F (a)], which has probability F (a) since Z =D U [0, 1]. But then, if X(n) ⇒ X,
we have FXn (x) → FX (x) whenever P (X = x) = 0, and this implies that

Xn (z) = inf{x ∈ | FXn (x) ≥ z} → X(z) = inf{x ∈ | F (x) ≥ z},

for all z.

2 See Problems 2.5 and 3.9.

3.2 Gaussian Random Variable and CLT 45

Fig. 3.8 If Z =D U [0, 1], F (x)

then cdf of X(Z) is F 1
F (a)
Z1
Z2
x
0 a
X(Z2 )X(Z1 )

Fig. 3.9 Comparing

Binomial(100, 0.2) with
Gaussian(20, 16)

3.2.1 Binomial and Gaussian

Figure 3.9 compares the binomial and Gaussian distributions.

To see why these distributions are similar, note that if X =D B(N, p), then one
can write

X = Y1 + · · · + Y N ,

where the random variables Yn are i.i.d. and Bernoulli with parameter p. Thus, by
the CLT,
X − Np
√ ≈ N 0, σ 2 ,
N

where σ 2 = var(Y1 ) = E(Y12 ) − (E(Y1 ))2 = p(1 − p). Hence, one can argue that

B(N, p) ≈D N Np, N σ 2 =D N (Np, Np(1 − p)). (3.5)

For p = 0.2 and N = 100, one concludes that B(100, 0.2) ≈ N (20, 16), which is
confirmed by Fig. 3.9.
46 3 Multiplexing: A

3.2.2 Multiplexing and Gaussian

We now apply the Gaussian approximation of a binomial distribution to multiplex-

ing. Recall that we were looking for the smallest value of m such that P (B(N, p) >
m) ≤ 5%. The ideas are as follows. From (3.5) and (3.2), we have

(1) B(N, p) ≈ N (Np, Np(1 − p)), for N 1;

(2) P (N μ, σ 2 > μ + 1.65σ ) ≈ 5%.

Combining these facts, we see that, for N 1,

P (B(N, p) > Np + 1.65 Np(1 − p)) ≈ 5%.

Thus, the value of m that we are looking for is

√
m = Np + 1.65 Np(1 − p) = 20 + 1.65 16 ≈ 27.

A look at Fig. 3.9 shows that it is indeed unlikely that ν is larger than 27 when
ν =D B(100, 0.2).

3.2.3 Confidence Intervals

One can invert the calculation that we did in the previous section and try to guess p
from the observed fraction Y (N) of active users out of N 1. From the ideas (1)
and (2) above, together with the symmetry of the Gaussian distribution around its
mean, we see that the events

A1 = {B(N, p) ≥ Np + 1.65 Np(1 − p)}

and

A2 = {B(N, p) ≤ Np − 1.65 Np(1 − p)}

each have a probability close to 5%. With Y (N) =D B(N, p)/N , we see that

p(1 − p)
A1 = Y (N) ≥ p + 1.65
N

and

p(1 − p)
A2 = Y (N) ≤ p − 1.65 .
N
3.2 Gaussian Random Variable and CLT 47

Hence, the event A1 ∪ A2 has probability close to 10%, so that its complement has
probability close to 90%. Consequently,

p(1 − p) p(1 − p)
P Y (N) − 1.65 ≤ p ≤ Y (N) + 1.65 ≈ 90%.
N N

We do not know p, but p(1 − p) ≤ 1/4. Hence, we find

1 1
P Y (N) − 0.83 √ ≤ p ≤ Y (N) + 0.83 √ ≥ 90%.
N N

For N = 100, this gives

P (Y (N) − 0.08 ≤ p ≤ Y (N) + 0.08) ≥ 90%.

For instance, if we observe that 30% of the 100 users are active, then we guess
that p is between 0.22 and 0.38, with probability 90%. In other words, [Y (N) −
0.08, Y (N) + 0.08] is a 90%-confidence interval for p.
Figure 3.7 shows that we can get a 5%-confidence interval by replacing 1.65 by
2. Thus, we see that

1 1
Y (N) − √ , Y (N) + √ (3.6)
N N

is a 95%-confidence interval for p.

How large should N be to have a good estimate of p? Let us say that we would
like to know p plus or minus 0.03 with 95% confidence. Using (3.6), we see that we
need
1
√ = 3%, i.e., N = 1, 089.
N

Thus, Y (1, 089) is an estimate of p with an error less than 0.03, with probability
95%. Such results form the basis for the design of public opinion surveys.
In many cases, one does not know a bound on the variance. In such situations,
one replaces the standard deviation by the sample standard deviation. That is, for
i.i.d. random variables {X(n), n ≥ 1} with mean μ, the confidence intervals for μ
are as follows:

σn σn
μn − 1.65 √ , μn + 1.65 √ = 90% − Confidence Interval
n n

σn σn
μn − 2 √ , μn + 2 √ = 95% − Confidence Interval,
n n
48 3 Multiplexing: A

where
X(1) + · · · + X(n)
μn =
n
and
n
n
m=1 (X(m) − μn )
2 2
n m=1 X(m)
σn2 = = − μ2n .
n−1 n−1 n

What’s up with this n − 1 denominator? You probably expected the sample

variance to be the arithmetic mean of the squares of the deviations from the sample
mean, i.e., a denominator n in the first expression for σn2 . It turns out that to make the
estimator such that E(σn2 ) = σ 2 , i.e., to make the estimator unbiased, one should
divide by n − 1 instead of n. The difference is negligible for large n, obviously.
Nevertheless, let us see why this is so.
For simplicity of notation, assume that E(X(n)) = 0 and let σ 2 = var(X(n)) =
E(X(n)2 ). Note that
2
n2 E( X(1) − μn )
2
= E nX(1) − X(1) − X(2) − · · · − X(n)

= E (n − 1)2 X(1)2 + E X(2)2 + · · · + E X(n)2
= (n − 1)2 σ 2 + (n − 1)σ 2 = n(n − 1)σ 2 .

For the second equality, note that the cross-terms E(X(i)X(j )) for i = j vanish
because the random variables are independent and zero-mean.
Hence,

n−1 2
n

E (X(1) − μn )2 = σ and E (X(m) − μn )2 = (n − 1)σ 2 .
n
m=1

Consequently, an unbiased estimate of σ 2 is

1
n

σn2 := E (X(m) − μn )2 .
n−1
m=1

3.3 Buffers

The internet is a packet-switched network. A packet is a group of bits of data

together with some control information such as a source and destination address,
3.3 Buffers 49

Fig. 3.10 A switch with

multiple input and output
ports

somewhat like an envelope you send by regular mail (if you remember that). A host
(e.g., a computer, a smartphone, or a web cam) sends packets to a switch. The switch
has multiple input and output ports, as shown in Fig. 3.10.
The switch stores the packets as they arrive and sends them out on the appropriate
output port, based on the destination address of the packets. The packets arrive at
random times at the switch and, occasionally, packets that must go out on a specific
output port arrive faster than the switch can send them out. When this happens,
packets accumulate in a buffer. Consequently, packets may face a queueing3 delay
before they leave the switch. We study a simple model of such a system.

3.3.1 Markov Chain Model of Buffer

We focus on packets destined to one particular output port. Our model is in

discrete time. We assume that one packet destined for that output port arrives with
probability λ ∈ [0, 1] at each time instant, independently of previous arrivals. The
packets have random sizes, so that they take random times to be transmitted. We
assume that the time to transmit a packet is geometrically distributed with parameter
μ and all the transmission times are independent. Let Xn be the number of packets
in the output buffer at time n, for n ≥ 0. At time n, a transmission completes with
probability μ and a new packet arrives with probability λ, independently of the past.
Thus, Xn is a Markov chain with the state transition diagram shown in Fig. 3.11.

3 Queueing and queuing are alternative spellings; queueing tends to be preferred by researchers and

has the peculiar feature of having five vowels in a row, somewhat appropriately.
50 3 Multiplexing: A

0 1 n−1 n n+1 N −1 N
p2 p2 p2 p2 p2 p2 p2 p2
1 − p2 ...... ...... 1 − p0
p0 p0 p0 p0 p0 p0 p0 p0
p1 p1 p1 p1 p1

Fig. 3.11 The transition probabilities for the buffer occupancy for one of the output ports

In this diagram,

p2 = λ(1 − μ)
p0 = μ(1 − λ)
p 1 = 1 − p0 − p 2 .

For instance, p2 is the probability that one new packet arrives and that the
transmission of a previous does not complete, so that the number of packets in the
buffer increases by one.

3.3.2 Invariant Distribution

The balance equations are

π(0) = (1 − p2 )π(0) + p0 π(1)

π(n) = p2 π(n − 1) + p1 π(n) + p0 π(n + 1), 1 ≤ n ≤ N − 1
π(N) = p2 π(N − 1) + (1 − p0 )π(N ).

You can verify that the solution is given by

p2
π(i) = π(0)ρ i , i = 0, 1, . . . , N where ρ := .
p0

Since the probabilities add up to one, we find that

−1

N
1−ρ
π(0) = ρ i
= .
1 − ρ N +1
i=0

In particular, the average value of X under the invariant distribution is

3.3 Buffers 51

Fig. 3.12 A simulation of the queue with λ = 0.16, μ = 0.20, and N = 20

N
N
E(X) = iπ(i) = π(0) iρ i
i=0 i=0

Nρ N +1 − (N + 1)ρ N + 1
=ρ
(1 − ρ)(1 − ρ N +1 )
ρ p2 λ(1 − μ)
≈ = = ,
1−ρ p 0 − p2 μ−λ

where the approximation is valid if ρ < 1, i.e., λ < μ, and N 1 so that Nρ N 1.

Figure 3.12 shows a simulation of this queue when λ = 0.16, μ = 0.20, and
N = 20. It also shows the average queue length over n steps and we see that it
approaches λ(1 − μ)/(μ − λ) = 3.2. Note that this queue is almost never full,
which explains that one can let N → ∞ in the expression for E(X).

3.3.3 Average Delay

How long do packets stay in the switch? Consider a packet that arrives when there
are k packets already in the buffer. That packet then leaves after k + 1 packet
transmissions. Since each packet transmission takes 1/μ steps, on average, the
expected time that the packet spends in the switch is (k + 1)/μ. Thus, to find the
expected time a packet stays in the switch, we need to calculate the probability φ(k)
that an arriving packet finds k packets already in the buffer. Then, the expected time
W that a packet stays in the switch is given by
k+1
W = φ(k).
μ
k≥0
The result of the calculation is given in the next theorem.
52 3 Multiplexing: A

Theorem 3.2 If λ < μ, one has

1 1−μ
W = E(X) = .
λ λ−μ

Proof The calculation is a bit lengthy and the details may not be that interesting,
except that they explain how to calculate φ(k) and that they show that the simplicity
of the result is quite remarkable.
Recall that φ(k) is the probability that there are k + 1 packets in the buffer after a
given packet arrives at time n. Thus, φ(k) = P [X(n+1) = k +1 | A(n) = 1] where
A(n) is the number of arrivals at time n. Now, if D(n) is the number of transmission
completions at time n,

φ(k) = P [X(n) = k + 1, D(n) = 1 | A(n) = 1]

+ P [X(n) = k, D(n) = 0 | A(n) = 1].

Also,

P [X(n) = k + 1, D(n) = 1, A(n) = 1]

P [X(n) = k + 1, D(n) = 1 | A(n) = 1] =
P (A(n) = 1)
1
= P (X(n) = k + 1)P [D(n) = 1, A(n) = 1 | X(n) = k + 1]
λ
1
= π(k + 1)λμ = π(k + 1)μ.
λ
Similarly,

P [X(n) = k, D(n) = 0, A(n) = 1]

P [X(n) = k, D(n) = 0 | A(n) = 1] =
P (A(n) = 1)
1
= P (X(n) = k)P [D(n) = 0, A(n) = 1 | X(n) = k]
λ
1
= π(k)λ(1 − μ1{k > 0}) = π(k)(1 − μ1{k > 0}).
λ
Hence,

φ(k) = π(k)(1 − μ1{k > 0}) + π(k + 1)μ.

3.3 Buffers 53

Consequently, the expected time W that a packet spends in the switch is given by
k+1 1 1
W = φ(k) = + kπ(k)(1 − μ1{k = 0}) + kπ(k + 1)
μ μ μ
k≥0 k≥0 k≥0

1 1
= + kπ(k)(1 − μ) + (k − 1)π(k)
μ μ
k≥0 k≥1

1 1−μ 1 1
= + E(X) + E(X) − 1 = + E(X) − 1
μ μ μ μ
1 λ(1 − μ) 1−μ 1
= + −1= = E(X).
μ μ(μ − λ) μ−λ λ

3.3.4 A Note About Arrivals

Since the arrivals are independent of the backlog in the buffer, it is tempting to
conclude that the probability that a packet finds k packet in the buffer upon its arrival
is π(k). An argument in favor of this conclusion looks as follows:

P [Xn+1 = k + 1 | An = 1] = P [Xn = k | An = 1]
= P [Xn = k] = π(k),

where the second identity comes from the independence of the arrivals An and the
backlog Xn . However, the first identity does not hold since it is possible that Xn+1 =
k, Xn = k, and An = 1. Indeed, one may have Dn = 1.
If one assumes that λ < μ 1, then the probability that An = 1 and Dn = 1
is negligible and it is then the case that π(k) ≈ π(k). We encounter that situation in
Sect. 5.6.

3.3.5 Little’s Law

The previous result is a particular case of Little’s Law (Little 1961) (Fig. 3.13).

Theorem 3.3 (Little’s Law) Under weak assumptions,

L = λW,

where L is the average number of customers in a system, λ is the average arrival

rate of customers, and W is the average time that a customer spends in the system.

54 3 Multiplexing: A

Fig. 3.13 John D. C. Little.

b. 1928

One way to understand this law is to consider a packet that leaves the switch
after having spent T time units. During its stay, λT packets arrive, on average. So
the average backlog in the switch should be λT .
It turns out that Little’s law applies to very general systems, even those that do
not serve the packets in their order of arrival.
One way to see this is to think that each packet pays the switch one unit of money
per unit of time it spends in the switch. If a packet spends T time units, on average,
in the switch, then each packet pays T , on average. Thus, the switch collects money
at the rate of λT per unit of time, since λ packets go through the switch per unit of
time and each pays an average of T . Another way to look at the rate at which the
switch is getting paid is to realize that if there are L packets in the switch at any
given time, on average, then the switch collects money at rate L, since each packet
pays one unit per unit time. Thus, L = λT .

3.4 Multiple Access

Imagine a number of smartphones sharing a WiFi access point, as illustrated in

Fig. 3.14. They want to transmit packets.
If multiple smartphones transmit at the same time, the transmissions garble one
another, and we say that they collide. We discuss a simple scheme to regulate the
transmissions and achieve a large rate of success. We consider a discrete time model
of the situation.
There are N devices. At time n ≥ 0, each device transmits with probability p,
independently of the others. This scheme, called randomized multiple access , was
proposed by Norman Abramson in the late 1960s for his Aloha network (Abramson
1970) (Fig. 3.15).
The number X(n) of transmissions at time n is then B(N, p) (see (B.4)). In
particular, the fraction of time that exactly one device transmits is

P (X(n) = 1) = Np(1 − p)N −1 .

The maximum over p of this success rate occurs for p = 1/N and it is λ∗ where
3.5 Summary 55

Fig. 3.14 A number of

smartphones share a WiFi
access point

Fig. 3.15 Norman

Abramson, b. 1932

1 N −1 1
λ∗ = 1 − ≈ ≈ 0.36.
N e

In this derivation, we use the fact that

a N
1− ≈ e−a for N 1. (3.7)
N
Thus, this scheme achieves a transmission rate of about 36%. However, it
requires selecting p = 1/N, which means that the devices need to know how many
other devices are active (i.e., try to transmit). We discuss an adaptive scheme in the
next chapter that does not require that information.

3.5 Summary

• Gaussian random variable N (μ, σ 2 );

• CLT;
• Confidence Intervals;
• Buffers: average backlog and delay; Little’s Law;
• Multiple Access Protocol.
56 3 Multiplexing: A

3.5.1 Key Equations and Formulas

Definition of N (μ, σ 2 ) fX (x) = (2π σ 2 )−1/2 exp{−(x − μ)2 /(2σ 2 )} D.3.1

√
CLT (X1 + · · · + Xn − nμ)/ n ⇒ N (0, σ 2 ) T.3.1
95%-Confidence Interval (X1 + · · · + Xn )/n ± 2σ S.3.2.3
Little’s Law L = λW T.3.3
Exponential Approximation (1 − a/n)n ≈ exp{−a} (3.7)

3.6 References

The buffering analysis is a simple example of queueing theory. See Kleinrock

(1975–6) for a discussion of queueing models of computer and communication
systems.

3.7 Problems

Problem 3.1 Write a Python code to compute the number of people to poll in a
public opinion survey to estimate the fraction of the population that will vote in
favor of a proposition within α percent, with probability at least 1 − β. Use an upper
bound on the variance. Assume that we know that p ∈ [0.4, 0.7].

Problem 3.2 We are conducting a public opinion poll to determine the fraction p
of people who will vote for Mr. Whatshisname as the next president. We ask N1
college-educated and N2 non-college-educated people. We assume that the votes
in each of the two groups are i.i.d. B(p1 ) and B(p2 ), respectively, in favor of
Whatshisname. In the general population, the percentage of college-educated people
is known to be q.

(a) What is a 95%-confidence interval for p, using an upper bound for the variance.
(b) How do we choose N1 and N2 subject to N1 + N2 = N to minimize the width
of that interval?

Problem 3.3 You flip a fair coin 10,000 times. The probability that there are more
than 5085 heads is approximately (choose the correct answer)

15%;
10%;
5%;
2.5%;
1%.
3.7 Problems 57

Problem 3.4 Write a Python simulation of a buffer where packets arrive as a

Bernoulli process with rate λ and geometric service times with rate μ. Plot the
simulation and calculate the long-term average backlog.

Problem 3.5 Consider a buffer that can transmit up to M packets in parallel. That
is, when there are m packets in the buffer, min{m, M} of these packets are being
transmitted. Also, each of these packets completes transmission independently in
the next time slot with probability μ. At each time step, a packet arrives with
probability λ.

(a) What are the transition probabilities of the corresponding Markov chain?
(b) For what values of λ, M, and μ do you expect the system to be stable?
(c) Write a Python simulation of this system.

Problem 3.6 In order to estimate the probability of head in a coin flip, p, you flip a
coin n times, and count the number of heads, Sn . You use the estimator p̂ = Sn /n.
You choose the sample size n to have a guarantee

P (|Sn /n − p| ≥ ) ≤ δ.

(a) What is the value of n suggested by Chebyshev’s inequality? (Use a bound on

the variance.)
(b) How does this value change when is reduced to half of its original value?
(c) How does it change when δ is reduced to half of its original value?
(d) Compare this value of n with that given by the CLT.

Problem 3.7 Let {Xn , n ≥ 1} be i.i.d. U [0, 1] and Zn = X1 + · · · + Xn . What is

P (Zn > n)? What would the estimate be of the same probability obtained from the
Central Limit Theorem?

Problem 3.8 Consider one buffer where packets arrive one by one every 2 s and
take 1 s to transmit. What is the average delay through the queue per packet? Repeat
the problem assuming that the packets arrive ten at a time every 20 s. This example
shows that the delay depends on how “bursty” the traffic is.
p
Problem 3.9 Show that if X(n) → X, then X(n) ⇒ X.

Hint Assume that P (X = x) = 0. To show that P (X(n) ≤ x) → P (X ≤ x), note

that if |X(n) − X| ≤ and X ≤ x, then X(n) ≤ X + .
58 3 Multiplexing: A

Topics: Characteristic Functions, Proof of Central Limit Theorem, Adaptive

CSMA

4.1 Characteristic Functions

Before we explain the proof of the CLT, we have to describe the use of characteristic
functions.

Definition 4.1 Characteristic Function The characteristic function of a random

variable X is defined as

φX (u) = E(eiuX ), u ∈ .
√
In this expression, i := −1.

Note that
∞
φX (u) = eiux fX (x)dx,
−∞

© The Author(s) 2021 59

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_4
60 4 Multiplexing: B

so that φX (u) is the Fourier transform of fX (x). As such, the characteristic function
determines the pdf uniquely.
As an important example, we have the following result.

Theorem 4.1 (Characteristic Function of N (0, 1)) Let X =D N (0, 1). Then,

u2
φX (u) = e− 2 . (4.1)

Proof One has
∞ 1 x2
φX (u) = eiux √ e− 2 dx,
−∞ 2π

so that
∞ ∞
d 1 x2 1 x2
φX (u) = ixeiux √ e− 2 dx = − ieiux √ de− 2
du −∞ 2π −∞ 2π
∞ ∞
1 x2 1 x2
= i √ e− 2 deiux = −u eiux √ e− 2 dx
−∞ 2π −∞ 2π
= −uφX (u).

(The third equation follows by integration by parts.) Thus,

d d u2
log(φX (u)) = −u = − ,
du du 2
which implies that

u2
φX (u) = Ae− 2 .

Since φX (0) = E(ei0X ) = 1, we see that A = 1, and this proves the result (4.1).

We are now ready to prove the CLT.

4.2 Proof of CLT (Sketch)

The technique to analyze sums of independent random variables is to calculate the

characteristic function. Let then
X(1) + · · · + X(n) − nμ
Y (n) = √ , n ≥ 1.
σ n

We have
4.3 Moments of N (0, 1) 61

iu(X(m) − μ)
φY (n) (u) = E e iuY (n)
= E Πm=1 exp
n
√
σ n

n
iu(X(1) − μ)
= E exp √
σ n

n
iu(X(1) − μ) u2 (X(1) − μ)2
= E 1+ √ − + o(1/n)
σ n 2σ 2 n
n !
= 1 − u2 /(2n) + o(1/n) → exp −u2 /2 , as n → ∞.

The third equality holds because the X(m) are i.i.d. and the fourth one follows from
the Taylor expansion of the exponential:

1
ea ≈ 1 + a + a 2 .
2

Thus, the characteristic function of Y (n) converges to that of a N (0, 1) random

variable. This suggests that the inverse Fourier transform, i.e., the density of Y (n)
converges to that of a N (0, 1) random variable. This last step can be shown
formally, but we will not do it here.

4.3 Moments of N (0, 1)

We can use the characteristic function of a N (0, 1) random variable X to calculate

its moments. This is how. First we note that, by using the Taylor expansion of the
exponential,
∞
1
φX (u) = E e iuX
=E (iuX) n
n!
n=0
∞
1
= (iu)n E(Xn ).
n!
n=0

Second, again using the expansion of the exponential,

∞

2 m
1 u
φX (u) = e−u
2 /2
= − .
m! 2
m=0

Third, we match the coefficients of u2m in these two expressions and we find that

1 2m 2m 1 1 m
i E X = − ,
(2m)! m! 2
62 4 Multiplexing: B

This gives1
(2m)!
E X2m = . (4.2)
m!2m

For instance,

2! 4!
E(X2 ) = = 1, E X4 = = 3.
1!21 2!22
Finally, we note that the coefficients of odd powers of u must be zero, so that

E X2m+1 = 0, for m = 0, 1, 2, . . . .

(This should be obvious from the symmetry of fX (x).) In particular,

var(X) = E X2 − E(X)2 = 1.

4.4 Sum of Squares of 2 i.i.d. N (0, 1)

Let X, Y be two i.i.d. N (0, 1) random variables. The claim is that

Z = X2 + Y 2 =D Exp(1/2).

Let θ be the angle of the vector (X, Y ) and R 2 = X2 + Y 2 . Thus (see Fig. 4.1)

dxdy = rdrdθ.

Note that E(Z) = E(X2 ) + E(Y 2 ) = 2, so that if Z is exponentially distributed,

its rate must be 1/2. Let us prove that it is exponential. One has
2
1 x + y2
fX,Y (x, y)dxdy = fX,Y (x, y)rdrdθ = exp − rdrdθ
2π 2
2 2
1 r dθ r
= exp − rdrdθ = × exp − rdr
2π 2 2π 2
=: fθ (θ )dθ × fR (r)dr,

1 We used the fact that

i 2m = (−1)m .
4.5 Two Applications of Characteristic Functions 63

Fig. 4.1 Under the change of rdq

variables x = r cos(θ) and rdrdq
y = r sin(θ), we see that
dxdy = rdrdθ. That is,
[r, r + dr] × [θ, θ + dθ] y
covers an area rdrdθ in the
(x, y) plane r dr

x
dq

where
2
1 r
fθ (θ ) = 1{0 < θ < 2π } and fR (r) = r exp − 1{r ≥ 0}.
2π 2
√
Thus, the angle θ of (X, Y ) and the norm R = X2 + Y 2 are independent and have
the indicated distributions. But then, if V = R 2 =: g(R), we find that, for v ≥ 0,
2
1 1 r 1 v!
fV (v) = fR (r) = r exp − = exp −
|g (R)| 2r 2 2 2

which shows that the angle θ and V = X2 + Y 2 are independent, the former being
uniformly distributed in [0, 2π ] and the latter being exponentially distributed with
mean 2.

4.5 Two Applications of Characteristic Functions

We have used characteristic functions to prove the CLT. Here are two other cute
applications.

4.5.1 Poisson as a Limit of Binomial

A Poisson random variable X with mean λ can be viewed as a limit of a B(n, λ/n)
random variable Xn as n → ∞. To see this, note that

E(exp{iuXn }) = E(exp{iu(Zn (1)) + . . . + Zn (n)}),

where the random variables {Zn (1), . . . , Zn (n)} are i.i.d. Bernoulli with mean λ/n.
Hence,
64 4 Multiplexing: B

n
" #n λ iu
E(exp{iuXn }) = E(exp{iu(Zn (1)})) = 1 + (e − 1) .
n

For the second identity, we use the fact that if Z =D B(p), then

E(exp{iuZ}) = (1 − p)e0 + peiu = 1 + p(eiu − 1).

Also, since
∞

λm −λ am
P (X = m) = e and ea = ,
m! m!
m=0

we find that
∞
λm
E(exp{iuX}) = exp{−λ}eium = exp{λ(eiu − 1)}.
m!
m=0

The result then follows from the fact that

a n
1+ → ea , as n → ∞.
n

4.5.2 Exponential as Limit of Geometric

An exponential random variable can be viewed as a limit of scaled geometric

random variables. Let X =D Exp(λ) and Xn = G(λ/n). Then

1
Xn → X, in distribution.
n
To see this, recall that

fX (x) = λe−λx 1{x ≥ 0}.

Also,
∞ 1
e−βx =
0 β

if the real part of β is positive.

Hence,
∞ λ
E e iuX
= eiux λe−λx dx = .
0 λ − iu
4.6 Error Function 65

Moreover, since

P (Xn = m) = (1 − p)m p, m ≥ 0,

we find that, with p = λn ,

∞
1
E exp iu Xn = (1 − p)m p exp{ium/n}
n
m=0
∞
p
=p [(1 − p)exp{iu/n}]m =
1 − (1 − p)exp{iu/n}
m=0
λ/n
=
1 − (1 − λ/n) exp{iu/n}
λ
=
n(1 − (1 − λ/n)(1 + iu/n + o(1/n)))
λ
= ,
λ − iu + o(1/n)

where o(1/n) → 0 as n → ∞. This proves the result.

4.6 Error Function

In the calculation of confidence intervals, one uses estimates of

Q(x) := P (X > x) where X =D N (0, 1).

The function Q(x) is called the error function. With Python or the appropriate smart
phone app, you can get the value of Q(x). Nevertheless, the following bounds (see
Fig. 4.2) may be useful.

Theorem 4.2 (Bounds on Error Function) One has

2 2
x 1 x 1 x
√ exp − ≤ Q(x) ≤ √ exp − , ∀x > 0.
1+x 2
2π 2 x 2π 2

Proof Here is a derivation of the upper bound. For x > 0, one has
∞ ∞ ∞
1 y2 1 y − y2
Q(x) = fX (y)dy = √ e− 2 dy = √ e 2 dy
x x 2π 2π x y
66 4 Multiplexing: B

Fig. 4.2 The error function

Q(x) and its bounds

∞ ∞
1 y − y2 1 y2
≤√ e 2 dy = √ ye− 2 dy
2π x x x 2π x
∞ 2
1 − y2 1 x2
=− √ de = √ e− 2 .
x 2π x x 2π

For the lower bound, one uses the following calculation, again with x > 0:

∞ ∞

1 y2 1 y2
1+ e− 2 dy ≥ 1+ e− 2 dy
x2 x x y2
∞

1 − y2 1 − x2
=− d e 2 = e 2.
x y x

4.7 Adaptive Multiple Access

In Sect. 3.4, we explained a randomized multiple access scheme. In this scheme,

there are N active station and each station attempts to transmit with probability
1/N in each time slot. This scheme results in a success rate of about 1/e ≈ 36%.
However, it requires that each station knows how many other stations are active.
To make the scheme adaptive to the number of active devices, say that the devices
adjust the probability p(n) with which they transmit at time n as follows:
⎧
⎨ p(n), if X(n) = 1;
p(n + 1) = ap(n), if X(n) > 1;
⎩
min{bp(n), 1}, if X(n) = 0.
4.7 Adaptive Multiple Access 67

Fig. 4.3 Bruce Hajek

0.7
Tn
0.6

0.5
N = 100

0.4

0.3
N = 40

0.2

0.1

n
0
0 200 400 600 800 1000 1200

Fig. 4.4 Throughput of the adaptive multiple access scheme

In these update rules, a and b are constants with a ∈ (0, 1) and b > 1. The idea
is to increase p(n) if no device transmitted and to decrease it after a collision. This
scheme is due to Hajek and Van Loon (1982) (Fig. 4.3).
Figure 4.4 shows the evolution over time of the success rate Tn . Here,

1
n−1
Tn = 1{X(m) = 1}.
n
m=0

The figure uses a = 0.8 and b = 1.2. We see that the throughput approaches the
optimal value for N = 40 and for N = 100. Thus, the scheme adapts automatically
to the number of active devices.
68 4 Multiplexing: B

4.8 Summary

• Characteristics Function;
• Proof of CLT;
• Moments of Gaussian;
• Sum of Squares of Gaussians;
• Poisson as limit of Binomial;
• Exponential as limit of Geometric;
• Adaptive Multiple Access Protocol.

4.8.1 Key Equations and Formulas

Characteristic Function φX (u) = E(exp{iuX}) D. 4.1

For N (0, 1) exp{−u2 /2} T. 4.1
Moments of N (0, 1) E(X2m ) = (2m)!/(m!2m ) (4.2)
Error Function P (N (0, 1) > x) Bounds T. 4.2

4.9 References

The CLT is a classical result, see Bertsekas and Tsitsiklis (2008), Grimmett and
Stirzaker (2001) or Billingsley (2012).

4.10 Problems

Problem 4.1 Let X be a N(0, 1) random variable. You will recall that E(X2 ) = 1
and E(X4 ) = 3.

(a) Use Chebyshev’s inequality to get a bound on P (|X| > 2);

(b) Use the inequality that involves the fourth moment of X to bound P (|X| > 2).
Do you get a better bound?
(c) Compare with what you know about the N(0, 1) random variable.

Problem 4.2 Write a Python simulation of Hajek’s random multiple access

scheme. There are 20 stations. An arrival occurs at each station with probability
λ/20 at each time slot. The stations update their transmission probability as
4.10 Problems 69

explained in the text. Plot the total backlog in all the stations as a function of
time.

Problem 4.3 Consider a multiple access scheme where the N stations indepen-
dently transmit short reservation packets with duration equal to one time unit with
probability p. If the reservation packets collide or no station transmits a reservation
packet, the stations try again. Once a reservation is successful, the succeeding station
transmits a packet during K time units. After that transmission, the process repeats.
Calculate the maximum fraction of time that the channel can be used for transmitting
packets. Note: This scheme is called Reservation Aloha.

Problem 4.4 Let X be a random variable with mean zero and variance 1. Show that
E(X4 ) ≥ 1.

Hint Use the fact that E((X2 − 1)2 ) ≥ 0.

Problem 4.5 Let X, Y be two random variables. Show that

(E(XY ))2 ≤ E(X2 )E(Y 2 ).

This is the Cauchy–Schwarz inequality.

Hint Use E((λX − Y )2 ) ≥ 0 with λ = E(XY )/E(X2 ).

Application: Social Networks, Communication Networks

Topics: Random Graphs, Queueing Networks

5.1 Spreading Rumors

Picture yourself in a social network. You are connected to a number of “friends”

who are also connected to friends. You send a message to some of your friends and
they in turn forward it to some of their friends. We are interested in the number of
people who eventually get the message.
To explore this question, we model the social network as a random tree of which
you are the root. You send a message to a random number of your friends that
we model as the children of the root node. Similarly, every node in the graph
has a random number of children. Assume that the numbers of children of the
different nodes are independent, identically distributed, and have mean μ. The tree
is potentially infinite, a clear mathematical idealization. The model ignores cycles
in friendships, another simplification.
The model is illustrated in Fig. 5.1. Thus, the graph only models the people who
get a copy of the message. For the problem to be non-trivial, we assume that there is
a positive probability that some nodes have no children, i.e., that someone does not
forward the message. Without this assumption, the message always spreads forever.
We have the following result.

© The Author(s) 2021 71

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_5
72 5 Networks: A

Fig. 5.1 The spreading of a

message as a random tree

Theorem 5.1 (Spreading of a Message) Let Z be the number of nodes that

eventually receive the message.

(a) If μ < 1, then P (Z < ∞) = 1 and E(Z) < ∞;

(b) If μ > 1, then P (Z = ∞) > 0.

We prove that result in the next chapter. The result should be intuitive: if Z < 1,
the spreading dies out, like a population that does not reproduce enough. This model
is also relevant for the spread of epidemics or cyber viruses.

5.2 Cascades

If most of your friends prefer Apple over Samsung, you may follow the majority. In
turn, your advice will influence other friends. How big is such an influence cascade?
We model that situation with nodes arranged in a line, in the chronological order
of their decisions, as shown in Fig. 5.2. Node n listens to the advice of a subset of
{0, 1, . . . , n − 1} who have decided before him. Specifically, node n listens to the
advice of node n − k independently with probability pk , for k = 1, . . . , n. If the
majority of these friends are blue, node n turns blue; if the majority are red, node
n turns red; in case of a tie, node n flips a fair coin and turns red with probability
1/2 or blue otherwise. Assume that, initially, node 0 is red. Does the fraction of red
nodes become larger than 0.5, or does the initial effect of node 0 vanish?
A first observation is that if nodes listen only to their left-neighbor with
probability p ∈ (0, 1), the cascade ends. Indeed, there is a first node that does
not listen to its neighbor and then turns red or blue with equal probabilities.
Consequently, there will be a string of red nodes followed by a string of blue node,
and so on. By symmetry, the lengths of those strings are independent and identically
distributed. It is easy to see they have a finite mean. The SLLN then implies that the
fraction of red nodes among the first n nodes converges to 0.5. In other words, the
influence of the first node vanishes.
The situation is less obvious if pk = p < 1 for all k. Indeed, in this case, as n
gets large, node n is more likely to listen to many previous neighbors. The slightly
5.3 Seeding the Market 73

Fig. 5.2 An influence

cascade

surprising result is that, no matter how small p is, there is a positive probability that
all the nodes turn red.

Theorem 5.2 (Cascades) Assume pk = p ∈ (0, 1] for all k ≥ 1. Then, all nodes
turn red with probability at least equal to θ where

1−p
θ = exp − .
p

We prove the result in the next chapter. It turns out to be possible that every node
listens to at least one previous node. In that case, all the nodes turn red.

5.3 Seeding the Market

Some companies distribute free products to spread their popularity. What is the best
fraction of customers who should get free products? To explore this question, let
us go back to our model where each node listens only to its left-neighbor with
probability p. The system is the same as before, except that each node gets a free
product and turns red with probability λ. The fraction of red nodes increases in λ
and we write it as ψ(λ). If the cost of a product is c and the selling price is s, the
company makes a profit (s − c)ψ(λ) − cλ since it makes a profit s − c from a buyer
and loses c for each free product. The company then can select λ to optimize its
profit. Next, we calculate ψ(λ).
Let π(n − 1) be the probability that user n − 1 is red. If user n listens to n − 1,
he turns red unless n − 1 is blue and he does not get a free product. If he does not
listen to n − 1, he turns red with probability 0.5 if he does not get a free product and
with probability one otherwise. Thus,

π(n) = p(1 − (1 − π(n − 1))(1 − λ)) + (1 − p)(0.5(1 − λ) + λ)

= p(1 − λ)π(n − 1) + 0.5λp + 0.5 + 0.5λ − 0.5p + 0.5λp.

Since p(1 − λ) < 1, the value of π(n) converges to the value ψ(λ) that solves the
fixed point equation

ψ(λ) = p(1 − λ)ψ(λ) + 0.5λp + 0.5 + 0.5λ − 0.5p + 0.5λp.

74 5 Networks: A

Hence,

1 + λ − p + λp
ψ(λ) = 0.5 .
1 − p(1 − λ)

To maximize the profit (s − c)ψ(λ) − cλ, we substitute the expression for ψ(λ)
in the profit and we set the derivative with respect to λ equal to zero. After some
algebra, we find that the optimal λ∗ is given by

∗ (1 − p)1/2 − (1 − p) 0.5(s − c)
λ = min 1, .
p c

Not surprisingly, λ∗ increases with the profit margin (s − c)/c and decreases with
p.

5.4 Manufacturing of Consent

Three people walk into a bar. No, this is not a joke. They chat and, eventually, leave
with the same majority opinion. As such events repeat, the opinion of the population
evolves. We explore a model of this evolution.
Consider a population of 2N ≥ 4 people. Initially, half believe red and the other
half believe blue. We choose three people at random. If two are blue and one is red,
they all become blue, and they return to the general population. The other cases are
similar. The same process then repeats. Let Xn be the number of blue people after n
steps, for n ≥ 1 and let X0 = N. Then Xn is a Markov chain. This Markov chain has
two absorbing states: 0 and 2N. Indeed, if Xn = k for some k ∈ {1, . . . , 2N − 1},
there is a positive probability of choosing three people where two have one opinion
and the third has a different one. After their meeting, Xn+1 = Xn . The Markov
chain is such that P (1, 0) = 1 and P (2N − 1, 2N) = 1. Moreover, P (k, k) >
0, P (k, k + 1) > 0, and P (k, k − 1) > 0 for all k ∈ {2, . . . , 2N − 2}. Consequently,
with probability one,

lim Xn ∈ {0, 2N}.

n→∞

Thus, eventually, everyone is blue or everyone is red. By symmetry, the two limits
have probability 0.5.
What is the effect of the media on the limiting consensus? Let us modify our
previous model by assuming that when two blue and one red person meet, they all
turn blue with probability 1 − p and remain as before with probability p. Here p
models the power of the media at convincing people to stay red. If two red and one
blue meet, they all turn red.
We have, for k ∈ {2, . . . , 2N − 2},
5.4 Manufacturing of Consent 75

k(k − 1)(2N − k)
P [Xn+1 = k + 1 | Xn = k] = (1 − p)3 =: p(k).
2N(2N − 1)(2N − 2)

Indeed, Xn increases with probability 1 − p from k to k + 1 if in the meeting

two people are blue and one is red. The probability that the first one is blue is
k/(2N ) since there are k blue people among 2N. The probability that the second
is also blue is then (k − 1)/(2N − 1). Also, the probability that the third is red is
(2N − k)/(2N − 2) since there are 2N − k red people among the 2N − 2 who
remain after picking two blue. Finally, there are three orderings in which one could
pick one red and two blue.
Similarly, for k ∈ {2, . . . , 2N − 2},

(2N − k)(2N − k − 1)k

P [Xn+1 = k − 1 | Xn = k] = 3 =: q(k).
2N(2N − 1)(2N − 2)

We want to calculate

α(k) = P [T2N < T0 | X0 = k],

where T0 is the first time that Xn = 0 and T2N is the first time that Xn = 2N. Then,
α(N) is the probability that the population eventually becomes all red.
The first step equations are, for k ∈ {2, . . . , 2N − 2},

α(k) = p(k)α(k + 1) + q(k)α(k − 1) + (1 − p(k) − q(k))α(k),

i.e.,

(p(k) + q(k))α(k) = p(k)α(k + 1) + q(k)α(k − 1).

The boundary conditions are α(1) = 0, α(2N − 1) = 1.

We solve these equations numerically, using Python. Our procedure is as follows.
We let α(1) = 0 and α(2) = A, for some constant A. We then solve recursively

q(k) q(k)
α(k + 1) = (1 + )α(k) − α(k − 1), k = 2, 3, . . . , 2N − 2
p(k) p(k)

2N − k − 1
= 1+ α(k)
(1 − p)(k − 1)
2N − k − 1)
− α(k − 1), k = 2, 3, . . . , 2N − 2.
(1 − p)(k − 1)

Eventually, we find α(2N − 1). This value is proportional to A. Since α(2N −

1) = 1, we then divide all the α(k) by α(2N − 1). Not elegant, but it works. We
repeat this process for p = 0, 0.02, 0.04, . . . , 0.14. Figure 5.3 shows the results for
N = 450, i.e., for a population of 900 people.
76 5 Networks: A

Fig. 5.3 The effect of the media. Here, p is the probability that someone remains red after chatting
with two blue people. The graph shows the probability that the whole population turns blue instead
of red. A small amount of persuasion goes a long way

5.5 Polarization

In most countries, the population is split among different political and religious
persuasions. How is this possible if everyone is faced with the same evidence? One
effect is that interactions are not fully mixing. People belong to groups that may
converge to a consensus based on the majority opinion of the group.
To model this effect, we consider a population of N people. An adjacency matrix
G specifies which people are friends. Here, G(v, w) = 1 if v and w are friends and
G(v, w) = 0 otherwise.
Initially, people are blue or red with equal probabilities. We pick one person at
random. If that person has a majority of red friends, she becomes red. If the majority
of her friends are blue, she becomes blue. If it is a tie, she does not change. We repeat
the process. Note that the graph does not change; it is fixed throughout. We want to
explore how the coloring of people evolves over time.
Let Xn (v) ∈ {B, R} be the state of person v at time n, for n ≥ 0 and v ∈
{1, . . . , N }. We pick v at random. We count the number of red friends and blue
friends of v. They are given by

G(v, w)1{Xn (w) = R} and G(v, w)1{Xn (w) = B}.
w w

Thus,
⎧
⎨ R, if w G(v, w)1{Xn (w) = R} > w G(v, w)1{Xn (w) = B}

Xn+1 (v) = B, if w G(v, w)1{Xn (w) = R} < w G(v, w)1{Xn (w) = B}
⎩
Xn (v), otherwise.

We have the following result.

5.5 Polarization 77

Theorem 5.3 The state Xn = {Xn (v), v = 1, . . . , N } of the system always

converges. However, the limit may be random.

Proof Define the function V (Xn ) as follows:

V (Xn ) = 1{Xn (v) = Xn (w)}.
v w

That is, V (Xn ) is the number of disagreements among friends. The rules of
evolution guarantee that V (Xn+1 ) ≤ V (Xn ) and that P (V (Xn+1 ) < V (Xn )) > 0
unless P (Xn+1 = Xn ) = 1. Indeed, if the state of v changes, it is to make that
person agree with more of her neighbors. Also, if there is no v who can reduce
her number of disagreements, then the state can no longer change. These properties
imply that the state converges.
A simple example shows that the limit may be random. Consider four people at
the vertices of a square that represents G. Assume that two opposite vertices are
blue and the other two are red. If the first person v to reconsider her opinion is blue,
she turns red, and the limit is all red. If v is red, the limit is all blue. Thus, the limit
is equally likely to be all red or all blue.

In the limit, it may be that a fraction of the nodes are red and the others are blue.
For instance, if the nodes are arranged in a line graph, then the limit is alternating
sequences of at least two red nodes and sequences of at least two blue nodes.
The properties of the limit depend on the adjacency graph G. One might think
that a close group of friends should have the same color, but that is not necessarily
the case, as the example of Fig. 5.4 shows.

Fig. 5.4 A close group of

friends, the four vertices of
the square, do not share the
same color
78 5 Networks: A

5.6 M/M/1 Queue

We discuss a simple model of a queue, called an M/M/1 queue. We turn to networks

in the next section. This section uses concepts from continuous-time Markov chains
that we develop in the next chapter. Thus, the discussion here is a bit informal, but
is hopefully clear enough to be read first.
Figure 5.5 illustrates a queue where customers (this is the standard terminology)
arrive and a server serves them one at a time, in a first come, first served order.
The times between arrivals are independent and exponentially distributed with
rate λ. Thus, the average time between two consecutive arrivals is 1/λ, so that λ
customers arrive per unit of time, on average. The service times are independent
and exponentially distributed with rate μ. The durations of the service times and the
arrival times are independent. The expected value of a service time is 1/μ. Thus, if
the queue were always full, there would be μ service completions per unit time, on
average. If λ < μ, the server can keep up with the arrivals, and the queue should
empty regularly. If λ > μ, one can expect the number of customers in the queue to
increase without bound.
In the notation M/M/1, the first M indicates that the inter-arrival times are
memoryless, the second M indicates that the service times are memoryless, and the
1 indicates that there is one server. As you may expect, there are related notations
such as D/M/3 or M/G/5, and so on, where the inter-arrival times and the service
times have other properties and there are multiple servers.
Let Xt be the number of customers in the queue at time t, for t ≥ 0. We call Xt
the queue length process. The middle part of Fig. 5.5 shows a possible realization
of that process. Observing the queue length process up to some time t provides
information about previous inter-arrival times and service times and also about
when the last arrival occurred and when the last service started. Since the inter-
arrival times and service times are independent and memoryless, this information
is independent of the time until the next arrival or the next service completion. In
particular, given {Xs , s ≤ t}, the likelihood that a new arrival occurs during (t, t +]
is approximately λ for 1; also the likelihood that a service completes during
(t, t + ] is approximately μ if Xt > 0 and zero if Xt = 0.

Fig. 5.5 An M/M/1 queue, Xt ¹

a possible realization, and its ¸
state transition diagram
Xt

¸ ¸ ¸ ¸

0 1 2 3 .....
¹ ¹ ¹ ¹
5.6 M/M/1 Queue 79

The bottom part of Fig. 5.5 is a state transition diagram that indicates the rates of
transitions. For instance, the arrow from 1 to 2 is marked with λ to indicate that, in
1 s, the queue length jumps from 1 to 2 with probability λ. The figure shows
that arrivals (that increase the queue length) occur at the same rate λ, independently
of the queue length. Also, service completions (that reduce the queue length) occur
at rate μ as long as the queue is nonempty.
Note that

P (Xt+ = 0) = P (Xt = 0, Xt+ = 0) + P (Xt = 1, Xt+ = 0)

≈ P (Xt = 0)(1 − λ) + P (Xt = 1)μ.

The first identity is the law of total probability: the event {Xt+ = 0} is the union of
the two disjoint events {Xt = 0, Xt+ = 0} and {Xt = 1, Xt+ = 0}. The second
identity uses the fact that {Xt = 0, Xt+ = 0} occurs when Xt = 0 and there is no
arrival during (t, t +]. This event has probability P (Xt = 0) multiplied by (1−λ)
since arrivals are independent of the current queue length. The other term is similar.
Now, imagine that π is a pmf on Z≥0 := {0, 1, . . .} such that P (Xt = i) = π(i)
for all time t and i ∈ Z≥0 . That is, assume that π is an invariant distribution for Xt .
In that case, P (Xt+ = 0) = π(0), P (Xt = 0) = π(0), and P (Xt = 1) = π(1).
Hence, the previous identity implies that

π(0) ≈ π(0)(1 − λ) + π(1)μ.

Subtracting π(0)(1 − λ) from both terms gives

π(0)λ ≈ π(1)μ.

Dividing by shows that1

π(0)λ = π(1)μ. (5.1)

Similarly, for i ≥ 1, one has

P (Xt+ = i) = P (Xt = i − 1, Xt+ = i) + P (Xt = i, Xt+ = i)

+ P (Xt = i + 1, Xt+ = i)
≈ P (Xt = i − 1)λ + P (Xt = i)(1 − λ − μ) + P (Xt = i + 1)μ.

Hence,

π(i) ≈ π(i − 1)λ + π(i)(1 − λ − μ) + π(i + 1)μ.

the ≈ sign is an identity up to a term negligible in . When we divide by and let

1 Technically,

→ 0, the ≈ becomes an identity.

80 5 Networks: A

This relation implies that

π(i)(λ + μ) = π(i − 1)λ + π(i + 1)μ, i ≥ 1. (5.2)

The Eqs. (5.1)–(5.2) are called the balance equations. Thus, if π is invariant for Xt ,
it must satisfy the balance equations. Looking back at our calculations, we also see
that if π satisfies the balance equations, and if P (Xt = i) = π(i) for all i, then
P (Xt+ = i) = π(i) for all i. Thus, π is invariant for Xt if and only if it satisfies
the balance equations.
One can solve the balance equations (5.1)–(5.2) as follows. Equation (5.1) shows
that π(1) = ρπ(0) with ρ = λ/μ. Subtracting (5.1) from (5.2) yields

π(1)λ = π(2)μ.

This equation then shows that π(2) = π(1)ρ = π(0)ρ 2 . Continuing in this way
shows that π(n) = π(0)ρ n for n ≥ 0. To find π(0), we use the fact that n π(n) =
1. That is
∞

π(0)ρ n = 1.
n=0

If ρ ≥ 1, i.e., if λ ≥ μ, this is not possible. In that case there is no invariant

distribution. If ρ < 1, then the previous equation becomes

1
π(0) = 1,
1−ρ

so that π(0) = 1 − ρ and

π(n) = (1 − ρ)ρ n , n ≥ 0.

In particular, when Xt has the invariant distribution π , one has

∞
ρ λ
E(Xt ) = n(1 − ρ)ρ n = = =: L.
1−ρ μ−λ
n=0

To calculate the average delay W of a customer in the queue, one can use Little’s
Law L = λW . This identity implies that

1
W = .
μ−λ

Another way of deriving this expression is to realize that if a customer finds

k other customers in the queue upon his arrival, he has to wait for k + 1 service
5.7 Network of Queues 81

completions before he leaves. Since very service completions lasts 1/μ on average,
his average delay is (k + 1)/μ. Now, the probability that this customer finds k other
customers in the queue is π(k). To see this, note that the probability that a customer
who enters the queue between time t and t + finds k customers in the queue is

P [Xt = k | Xt+ = Xt + 1].

Now, the conditioning event is independent of Xt , because the arrivals occur at

rate λ, independently of the queue length. Thus, the expression above is equal to
P (Xt = k) = π(k). Hence,
∞
∞

k+1 k+1 1
W = π(k) = (1 − ρ)ρ k = ,
μ μ μ−λ
k=0 k=0

as some simple algebra shows.

5.7 Network of Queues

Figure 5.6 shows a representative network of queues. Two types of customers arrive
into the network, with respective rates γ1 and γ2 . The first type goes through queue
1, then queue 3, and should leave the network. However, with probability p1 these
customers must go back to queue 1 and try again. In a communication network, this
event models an transmission error where a packet (a group of bits) gets corrupted
and has to be retransmitted. The situation is similar for the other type. Thus, in
1 time unit, a packet of the first type arrives with probability γ1 , independently
of what happened previously. This is similar to the arrivals into an M/M/1 queue.
Also, we assume that the service times are exponentially distributed with rate μk in
queue k, for k = 1, 2, 3.
Let Xtk be the number of customers in queue k at time t, for k = 1, 2 and t ≥ 0.
Let also Xt3 be the list of customer types in queue 3 at time t. For instance, in
Fig. 5.6, one has Xt3 = (1, 1, 2, 1), from tail to head of the queue to indicate that the
customer at the head of the queue is of type 1, that he is followed by a customer of

λ1 λ3 p1 (3, 2, (1, 1, 2)) (2, 2, (1, 1, 1, 2, 1)) (4, 2, (1, 1, 2, 1))

γ1 μ1 μ1 γ1
μ3 p1
μ3 p2 (3, 2, (1, 1, 2, 1))
λ2 μ3 (1 − p1 ) γ2
μ2
γ2 μ2 (4, 2, (1, 1, 2)) (3, 1, (2, 1, 1, 2, 1)) (3, 3, (1, 1, 2, 1))

Fig. 5.6 A network of queues

82 5 Networks: A

type 2, etc. Because of the memoryless property of the exponential distribution, the
process Xt = (X11 , Xt2 , Xt3 ) is a Markov chain: observing the past up to time t does
not help predict the time of the next arrival or service completion.
Figure 5.6 shows the transition rates out of the current state (3, 2, (1, 1, 2, 1)).
For instance, with rate μ3 p1 , a service completes in queue 3 and that customer has
to go back to queue 1, so that the new state is (4, 2, (1, 1, 2)). The other transitions
are similar.
One can then, in principle, write down the balance equations and try to solve
them. This looks like a very complex task and it seems very unlikely that one
could solve these equations analytically. However, a miracle occurs and one has the
remarkably simple result stated in the next theorem. Before we state the result, we
need to define λ1 , λ2 , and λ3 . As sketched in Fig. 5.6, for k = 1, 2, 3, the quantity
λk is the rate at which customers go through queue k, in the long term. These rates
should be such that

λ 1 = γ 1 + λ 1 p1
λ 2 = γ 2 + λ 2 p2
λ3 = λ1 + λ2 .

For instance, the rate λ1 at which customers enter queue 1 is the rate γ1 plus the rate
at which customers of type 1 that leave queue 3 are sent back to queue 1. Customers
of type 1 go through queue 3 at rate λ1 , since they come out of queue 1 at rate λ1 ;
also, a fraction p1 of these customers go back to queue 1. The other expressions
can be understood similarly. The equations above are called the flow conservation
equations.
These equations admit the following solution:
γ1 γ2
λ1 = , λ2 = , λ3 = λ1 + λ2 .
1 − p1 1 − p2

Theorem 5.4 (Invariant Distribution of Network) Assume λk < μk and let

ρk = λk /μk , for k = 1, 2, 3. Then the Markov chain Xt has a unique invariant
distribution π that is given by

π(x1 , x2 , x3 ) = π1 (x1 )π2 (x2 )π3 (x3 )

πk (n) = (1 − ρk )ρkn , n ≥ 0, k = 1, 2
π3 (a1 , a2 , . . . , an ) = p(a1 )p(a2 ) · · · p(an )(1 − ρ3 )ρ3n ,
n ≥ 0, ak ∈ {1, 2}, k = 1, . . . , n,

where p(1) = λ1 /(λ1 + λ2 ) and p(2) = λ2 /(λ1 + λ2 ).

5.7 Network of Queues 83

This result shows that the invariant distribution has a product form.
We prove this result in the next chapter. It indicates that under the invariant
distribution π , the states of the three queues are independent. Moreover, the state
of queue 1 has the same invariant distribution as an M/M/1 queue with arrival rate
λ1 and service rate μ1 , and similarly for queue 2. Finally, queue 3 has the same
invariant distribution as a single queue with arrival rates λ1 and λ2 and service rate
μ3 : the length of queue 3 has the same distribution as an M/M/1 queue with arrival
rate λ1 + λ3 and the types of the customers in the queue are independent and of type
1 with probability p(1) and 2 with probability p(2).
This result is remarkable not only for its simplicity but mostly because it is
surprising. The independence of the states of the queues is shocking: the arrivals into
queue 3 are the departures from the other two queues, so it seems that if customers
are delayed in queues 1 and 2, one should have larger values for Xt1 and Xt2 and a
smaller one for the length of queue 3. Thus, intuition suggests a strong dependency
between the queue lengths. Moreover, the fact that the invariant distributions of the
queues are the same as for M/M/1 queues is also shocking. Indeed, if there are
many customers in queue 1, we know that a fraction of them will come back into
the queue, so that future arrivals into queue 1 depend on the current queue length,
which is not the case for an M/M/1 queue. The paradox is explained in a reference.
We use this theorem to calculate the delay of customers in the network.

Theorem 5.5 For k = 1, 2, the average delay Wk of customers of type k is given by

1 1 1
Wk = + ,
1 − pk μk − λk μ3 − λ1 − λ2

where
γ1 γ2
λ1 = and λ2 = .
1 − p1 1 − p2

Proof We use Little’s Law that says that Lk = γk Wk where Lk is the average
number of customers of type k in the network. Consider the case k = 1. The other
one is similar. L1 is the average number of customers in queue 1 plus the average
number of customers of type 1 in queue 3.
The average length of queue 1 is λ1 /(μ1 − λ1 ) because the invariant distribution
of queue 1 is the same as that of an M/M/1 queue with arrival rate λ1 and service
rate μ1 .
The average length of queue 3 is (λ1 + λ2 )/(μ3 − λ1 − λ2 ) because the invariant
distribution of queue 3 is the same as queue with arrival rate λ1 and λ2 and service
rate μ3 . Also, the probability that any customer in queue 3 is of type 1 is p(1) =
λ1 /(λ1 + λ2 ). Thus, the average number of customers of type 1 in queue 3 is
84 5 Networks: A

λ1 + λ2 λ1
p(1) = .
μ3 − λ1 − λ2 μ3 − λ1 − λ2

Hence,

λ1 λ1
L1 = + .
μ1 − λ1 μ3 − λ1 − λ2

Combined with Little’s Law, this expression yields W1 .

5.8 Optimizing Capacity

We use our network model to optimize the rates of the transmitters. The basic idea
is that nodes with more traffic should have faster transmitter. To make this idea
precise, we formulate an optimization problem: minimize a delay cost subject to a
given budget for buying the transmitters.
We carry out the calculations not because of the importance of the specific
example (it is not important!) but because they are representative of problems of
this type.
Consider once again the network in Fig. 5.6. Assume that the cost of the
transmitters is c1 μ1 + c2 μ2 + c3 μ3 . The delay cost is d1 W1 + d2 W2 where Wk
is the average delay for packets of type k (k = 1, 2). The problem is then as follows:

Minimize D(μ1 , μ2 , μ3 ) := d1 W1 + d2 W2
subject to C(μ1 , μ2 , μ3 ) := c1 μ1 + c2 μ2 + c3 μ3 ≤ B.

Thus, the objective function is

dk 1 1
D(μ1 , μ2 , μ3 ) = + .
1 − pk μk − λk μ3 − λ1 − λ2
k=1,2

We convert the constrained optimization problem into an unconstrained one by

replacing the constraint by a penalty. That is, we consider the problem

Minimize D(μ1 , μ2 , μ3 ) + α(C(μ1 , μ2 , μ3 ) − B),

where λ > 0 is a Lagrange multiplier that penalizes capacities that have a high cost.
To solve this problem for a given value of λ, we set to zero the derivative of this
expression with respect to each μk . For k = 1, 2 we find
5.8 Optimizing Capacity 85

∂ ∂
0= D(μ1 , μ2 , μ3 ) + αC(μ1 , μ2 , μ3 )
∂μk ∂μ1
dk 1
=− + αck .
1 − pk (μk − λk )2

For k = 3, we find

d1 /(1 − p1 ) + d2 /(1 − p2 )
0=− + αc3 .
(μ3 − λ1 − λ2 )2

Hence,

1/2
dk
μk = λk + , for k = 1, 2
αck (1 − pk )

d1 /(1 − p1 ) + d2 /(1 − p2 ) 1/2
μ3 = λ1 + λ2 + .
αc3

These identities express μ1 , μ2 , and μ3 in terms of α. Using these expressions in

C(μ1 , μ2 , μ3 ), we find that the cost is given by

C(μ1 , μ2 , μ3 ) = c1 λ1 + c2 λ2 + c3 (λ1 + λ3 )
⎡ ⎛ ⎞1/2 ⎤

1 ⎢ dk ck 1/2 dk
⎠ ⎥
+ c3 ⎝
1/2
+√ ⎣ ⎦.
α 1 − pk 1 − pk
k=1,2 k=1,2

Using C(μ1 , μ2 , μ3 ) = B then enables to solve for α. As a last step, we substitute

that value of α in the expressions for the μk . We find,

1/2
dk
μk = λk + D , for k = 1, 2
ck (1 − pk )
⎛ ⎞1/2
dk
μ3 = λ1 + λ2 + D ⎝ ⎠ ,
ck (1 − pk )
k=1,2

where
B − c1 λ1 − c2 λ2 − c3 (λ1 + λ2 )
D= 1/2 1/2 .
dk c k 1/2 dk ck
k=1,2 1−pk + c3 k=1,2 1−pk
86 5 Networks: A

These results show that, for k = 1, 2, the capacity μk increases with dk , i.e.,
the cost of delays of packets of type k; it also decreases with ck , i.e., the cost of
providing that capacity.
A numerical solution can be obtained using a scipy optimization tool called
minimize. Here is the code.

import numpy as np
from scipy.optimize import minimize
d = [1, 2] # delay cost coefficients
c = [2, 3, 4] # capacity cost coefficients
l = [3, 2] # rates l[0] = lambda1, etc
p = [0.1, 0.2] # error probabilities
B = 60 # capacity budget
UB = 50 # upper bound on capacity
# x = mu1, mu2, mu3: x[0] = mu1, etc
def objective(x): # objective to minimize
z = 0
for k in range(2):
z = z + (d[k]/(1 - p[k]))*(1/(x[k] - l[k])
+ 1/(x[2] - l[0]-l[1]))
return z
def constraint(x): # budget constraint >= 0
z = B
for k in range(3):
z = z - c[k]*x[k]
return z
x0 = [5,5,10] # initial value for optimization
b0 = (l[0], UB) # lower and upped bound for x[0]
b1 = (l[1], UB) # lower and upped bound for x[1]
b2 = (l[0]+l[1], UB) # lower and upped bound for x[1]
bnds = (b0,b1,b2) # bounds for the three variables x
con = {’type’: ’ineq’, ’fun’: constraint}
# specifies constraints
sol = minimize(objective,x0,method=’SLSQP’,
bounds = bnds, constraints=con)
# sol will be the solution
print(sol)

The code produces an approximate solution. The advantage is that one does not
need any analytical skills. The disadvantage is that one does not get any qualitative
insight.
5.10 Product-Form Networks 87

5.9 Internet and Network of Queues

Can one model the internet as a network of queues? If so, does the result of
the previous section really apply? Well, the mathematical answers are maybe and
maybe.
The internet transports packets (groups of bits) from node to node. The nodes
are sources and destinations such as computers, webcams, smartphones, etc., and
network nodes such as switches or routers. The packets go from buffer to buffer.
These buffers look like queues. The service times are the transmission times of
packets. The transmission time of a packet (in seconds) is the number of bits in
the packet divided by the rate of the transmitter (in bits per second). The packets
have random lengths, so the service times are random. So, the internet looks like a
network of queues. However, there are some important ways in which our network
of queues is not an exact model of the internet. First, the packet lengths are not
exponentially distributed. Second, a packet keeps the same number of bits as it
moves from one queue to the next. Thus, the service times of a given packet in the
different queues are all proportional to each other. Third, the time between the arrival
two successive packets from a given node cannot be smaller than the transmission
time of the first packet. Thus, the arrival times and the service times in one queue
are not independent and the times between arrivals are not exponentially distributed.
The real question is whether the internet can be approximated by a network
similar to that of the previous section. For instance, if we use that model, are we
very far off when we try to estimate delays of queue lengths? Experiments suggest
that the approximation may be reasonable to a first order. One intuitive justification
is the diversity of streams of packets. It goes as follows. Consider one specific queue
in a large network node of the internet. This node is traversed by packets that come
from many different sources and go to many destinations. Thus, successive packets
that arrive at the queue may come from different previous nodes, which reduces
the dependency of the arrivals and the service times. The service time distribution
certainly affects the delays. However, the results obtained assuming an exponential
distribution may provide a reasonable estimate.

5.10 Product-Form Networks

The example of the previous sections generalizes as follows. There are N ≥ 1

queues and C ≥ 1 classes of customers. At each queue i, customers of class c ∈
{1, . . . , C} arrive with rate γic , independently of the past and of other arrivals. Queue
i serves customers with rate μi . When a customer of class c completes service in
queue i, it goes to queue j and becomes a customer of class d with probability
c,d
ri,j , for i, j ∈ {1, . . . , N} and c, d ∈ {1, . . . , C}. That customer leaves the network
with probability ri,0 c = 1 − N C r c,d . That is, a customer of class c who
j =1 d=1 i,j
completes service in queue i either goes to another queue or leaves the network.
88 5 Networks: A

Define λci as the average rate of customers of class c that go through queue i, for
i ∈ {1, . . . , N } and for c ∈ {1, . . . , C}. Assume that the rate of arrivals of customers
of a given class into a queue is equal to the rate of departures of those customers
from the queue. Then the rates λci should satisfy the following flow conservation
equations:

N
C
d,c
λci = γic + rj,i , i ∈ {1, . . . , N }, c ∈ {1, . . . , C}.
j =1 d=1

Let also X(t) = {Xi (t), i = 1, . . . , N} where Xi (t) is the configuration of queue
i at time t ≥ 0. That is, Xi (t) is the list of customer classes in queue i, from the tail
of the queue to the head of the queue. For instance, Xi (t) = 132,312 if the customer
at the tail of queue i is of class 1, the customer in front of her is of class 3, and so
on, and the customer at the head of the queue and being served is of class 2. If the
queue is empty, then Xi (t) = [], where [] designates the empty string.
One then has the following theorem.

Theorem 5.6 (Product-Form Networks)

(a) Let {λci , i = 1, . . . , N ; c − 1, . . . , C} be a solution of the flow conservation

equations. If λi := C c=1 λi < μi for i = 1, . . . , N , then Xt is a Markov chain
c

and its invariant distribution is given by

π(x) = AΠi=1
N
gi (xi ),

where

λci 1 · · · λci N
gi (c1 · · · cn ) =
μni

and A is a constant such that πi sums to one over all the possible configurations
of the queues.
(b) If the network is open in that every customer can leave the network, then the
invariant distribution becomes

π(x) = Πi=1
N
πi (xi ),

where

c c
λ i λi 1 · · · λi n
πi (c1 · · · cn ) = 1 − .
μi μni

In this case, under the invariant distribution, the queue lengths at time t are
all independent, the length of queue i has the same distribution as that of an
5.11 References 89

Fig. 5.7 A network of

queues γ
μ μ

M/M/1 queue with arrival rate λi and service rate μi , and the customer classes
are all independent and are equal to c with probability λci /λi .

The proof of this theorem is the same as that of the particular example given in
the next chapter.

5.10.1 Example

Figure 5.7 shows a network with two types of jobs. There is a single gray job that
visits the two queues as shown. The white jobs go through the two queues once. The
gray job models “hello” messages that the queues keep on exchanging to verify that
the system is alive. For ease of notation, we assume that the service rates in the two
queues are identical.
We want to calculate the average time that the white jobs spend in the system
and compare that value to the case when there is no gray job. That is, we want
to understand the “cost” of using hello messages. The point of the example is to
illustrate the methodology for networks where some customers never leave. The
calculations show the following somewhat surprising result.

Theorem 5.7 Using a hello message increases the expected delay of the white jobs
by 50%.

We prove the theorem in the next chapter. In that proof, we use Theorem 5.6 to
calculate the invariant distribution of the system, derive the expected number L of
white jobs in the network, then use Little’s Law to calculate the average delay W of
the white jobs as W = L/γ . We then compare that value to the case where there is
not gray job.

5.11 References

The literature on social networks is vast and growing. The textbook Easley and
Kleinberg (2012) contains many interesting models and result. The text Shah (2009)
studies the propagation of information in networks.
The book Kelly (1979) is the most elegant presentation of the theory of queueing
networks. It is readily available online. The excellent notes Kelly and Yudovina
(2013) discuss recent results. The nice textbook Srikant and Ying (2014) explains
network optimization and other performance evaluation problems. The books
90 5 Networks: A

Bremaud (2017) and Lyons and Perez (2017) are excellent sources for deeper studies
of networks. The text Walrand (1988) is more clumsy but may be useful.

5.12 Problems

Problem 5.1 There are K users of a social network who collaborate to estimate
some quantity by exchanging information. At each step, a pair (i, j ) of users is
selected uniformly at random and user j sends a message to user i with his estimate.
User i then replaces his estimate by the average of his estimate and that of user j .
Show that the estimates of all the users converge in probability to the average value
of the initial estimates. This is an example of consensus algorithm.

Hint Let Xn (i) be the estimate of user i at step n and Xn the vector with components
Xn (i). Show that

E[Xn+1 (i) | Xn ] = (1 − α)Xn (i) + αA,

where α = 1/(2(K − 1)) and A = i X0 (i)/K. Consequently,

E[|Xn+1 (i) − A| | Xn ] = (1 − α)|Xn (i) − A|,

so that

E[|Xn+1 (i) − A|] = (1 − α)E[|Xn (i) − A|]

and

E[|Xn (i) − A|] → 0.

Markov’s inequality then shows that P (|Xn (i) − A| > ) → 0 for any > 0.

Problem 5.2 Jobs arrive at rate γ in the system shown in Fig. 5.8. With probability
p, a customer is sent to queue 1, independently of the other jobs; otherwise, the job
is sent to queue 2. For i = 1, 2, queue i serves the jobs at rate μi . Find the value
of p that minimizes the average delay of jobs in the system. Compare the resulting
average delay to that of the system where the jobs are in one queue and join the
available server when they reach the head of the queue, and the fastest server if both
are idle, as shown in the bottom part of Fig. 5.8.

Hint The system of the top part of the figure is easy to analyze: with probability
p, a job faces the average delay 1/(μ1 − γp) in the top queue and with probability
1 − p the job faces the average delay 1/(μ2 − γ (1 − p)), One the finds the value of
p that minimizes the expected delay. For the system in the bottom part of the figure,
the state is n with n ≥ 2 when there are at least two jobs and the two servers are
5.12 Problems 91

Fig. 5.8 Optimizing p (top) μ1

versus joining the free server γ p
(bottom)
1−p
μ2

μ1
γ

μ2

Fig. 5.9 The state transition γ μ2

diagram. Here, μ := μ1 + μ2
μ1 (1, 1) γ
μ γ μ γ
0 2 3 ···
μ2 (1, 2) γ
μ1
Fig. 5.10 The state λ λ λ λ λ
transition diagram
0 1 2 ··· N N +1 ···
μ 2μ Nμ Nμ Nμ

busy, or (1, s) where s ∈ {1, 2} indicates which server is busy, or 0 when the system
is empty. One then needs to find the invariant distribution of the state, compute the
average number of jobs, and use Little’s Law to find the average delay. The state
transition diagram is shown in Fig. 5.9.

Problem 5.3 This problem compares parallel queues to a single queue. There are
N servers. Each server serves customers at rate μ. The customers arrive at rate
N λ. In the first system, the customers are split into N queues, one for each server.
Customers arrive at each queue with rate λ. The average delay is that of an M/M/1
queue, i.e., 1/(μ − λ). In the second system, the customers join a single queue. The
customer at the head of the queue then goes to the next available server. Calculate
the average delay in this system. Write a Python program to plot the average delays
of the two systems as a function ρ := λ/μ for different values of N .

Hint The state diagram is shown in Fig. 5.10.

Problem 5.4 In this problem, we explore a system of parallel queues where the
customers join the shortest queue. Customers arrive at rate Nλ and there are N
queues, each with a server who serves customers at rate μ > λ. When a customer
arrives, she joins the shortest queue. The goal is to analyze the expected delay in
the system. Unfortunately, this problem cannot be solved analytically. So, your task
is to write a Python program to evaluate the expected delay numerically. The first
92 5 Networks: A

Fig. 5.11 The system B

2 μ

1
A

step is to draw the state transition diagram. Approximate the system by discarding
customers who arrive when there are already M customers in the system. The second
step is to write the balance equations. Finally, one writes a program to solve the
equations numerically.

Problem 5.5 Figure 5.11 shows a system of N queues that serve jobs at rate μ. If
there is a single job, it takes on average N/μ time units for it to go around the circle.
Thus, the average rate at which a job leaves a particular queue is μ/N . Show that
when there are two jobs, this rate is 2μ/(N + 1).

Application: Social Networks, Communication Networks

Topics: Continuous-Time Markov Chains, Product-Form Queueing Networks

6.1 Social Networks

We provide the proofs of the theorems in Sect. 5.1.

Theorem 6.1 (Spreading of a Message) Let Z be the number of nodes that

eventually receive the message.

(a) If μ < 1, then P (Z < ∞) = 1 and E(Z) < ∞;

(b) If μ > 1, then P (Z = ∞) > 0.

Proof For part (a), let Xn be the number of nodes that are n steps from the root. If
Xn = k, we can write Xn+1 = Y1 + · · · + Yk where Yj is the number of children of
node j at level n. By assumption, E(Yj ) = μ for all j . Hence,

E[Xn+1 | Xn = k] = E(Y1 + · · · + Yk ) = μk.

Hence, E[Xn+1 | Xn ] = μXn . Taking expectations shows that E(Xn+1 ) =

μE(Xn ), n ≥ 0. Consequently,

E(Xn ) = μn , n ≥ 0.

© The Author(s) 2021 93

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_6
94 6 Networks—B

∞ the sequence Zn = X0 + · · · + Xn is nonnegative and increases to Z =

Now,
n=0 Zn . By MCT, it follows that E(Zn ) → Z. But

1 − μn+1
E(Zn ) = μ0 + · · · + μn = .
1−μ

Hence, E(Z) = 1/(1 − μ) < ∞. Consequently, P (Z < ∞) = 1.

For part (b), one first observes that the theorem does not state that P (Z = ∞) =
1. For instance, assume that each node has three children with probability 0.5 and
has no child otherwise. Then μ = 1.5 > 1 and P (Z = 1) = P (X1 = 0) = 0.5, so
that P (Z = ∞) ≤ 0.5 < 1. We define Xn , Yj , and Zn as in the proof of part (a).
Let αn = P (Xn > 0). Consider the X1 children of the root. Since αn+1 is the
probability that there is one survivor after n + 1 generations, it is the probability that
at least one of the X1 children of the root has a survivor after n generations. Hence,

1 − αn+1 = E((1 − αn )X1 ), n ≥ 0.

Indeed, if X1 = k, the probability that none of the k children of the root has a
survivor after n generations is (1 − αn )k . Hence,

αn+1 = 1 − E((1 − αn )X1 ) =: g(αn ), n ≥ 0.

Also, α0 = 1. As n → ∞, one has αn → α ∗ = P (Xn > 0, for all n). Figure 6.1
shows that α ∗ > 0. The key observations are that

g(0) = 0
g(1) = P (X1 > 0) < 1
g (0) = E(X1 (1 − α)X1 −1 ) |α=0 = μ > 1
g (1) = E(X1 (1 − α)X1 −1 ) |α=1 = 0,

so that the figure is as drawn.

Theorem 6.2 (Cascades) Assume pk = p ∈ (0, 1] for all k ≥ 1. Then, all nodes
turn red with probability at least equal to θ where

1−p
θ = exp − .
p

Proof The probability that node n does not listen to anyone is an = (1 − p)n . Let
X be the index of the first node that does not listen to anyone. Then
6.2 Continuous-Time Markov Chains 95

Fig. 6.1 The proof that

α∗ > 0 g(α) = 1 − E((1 − α)X1 )
1

P (X1 > 0)

μ>1
0 α
0 α1 1
α2
α∗

P (X > n) = (1 − a1 )(1 − a2 ) · · · (1 − an ) ≤ exp{−a1 − · · · − an }

1
= exp − ((1 − p) − (1 − p) ) . n+1
p

Now,

1−p
P (X = ∞) = lim P (X > n) ≥ exp − = θ.
n p

Thus, with probability at least θ , every node listens to at least one previous node.
When that is the case, all the nodes turn red. To see this, assume that n is the first
blue node. That is not possible since it listened to some previous nodes that are all
red.

6.2 Continuous-Time Markov Chains

Our goal is to understand networks where packets travel from node to node until
they reach their destination. In particular, we want to study the delay of packets
from source to destination and the backlog in the nodes.
It turns out that the analysis of such systems is much easier in continuous time
than in discrete time. To carry out such analysis, we have to introduce continuous-
time Markov chains. We do this on a few simple examples.
96 6 Networks—B

6.2.1 Two-State Markov Chain

Figure 6.2 illustrates a random process {Xt , t ≥ 0} that takes values in {0, 1}. A
random process is a collection of random variables indexed by t ≥ 0. Saying that
such a random process is defined means that one can calculate the probability that
{Xt1 = x1 , Xt2 = x2 , . . . , Xtn = xn } for any value of n ≥ 1, any 0 ≤ t1 ≤
· · · ≤ tn , and x1 , . . . , xn ∈ {0, 1}. We explain below how one could calculate such a
probability.
We call Xt the state of the process at time t. The possible values {0, 1} are also
called states. The state Xt evolves according to rules characterized by two positive
numbers λ and μ. As Fig. 6.2 shows, if X0 = 0, the state remains equal to zero
for a random time T0 that is exponentially distributed with parameter λ, thus with
mean 1/λ. The state Xt then jumps to 1 where it stays for a random time T1 that is
exponentially distributed with rate μ, independent of T0 , and so on. The definition is
similar if X0 = 1. In that case, Xt keeps the value 1 for an exponentially distributed
time with rate μ, then jumps to 0, etc.
Thus, the pdf of T0 is

fT0 (t) = λ exp{−λt}1{t ≥ 0}.

In particular,

P (T0 ≤ ) ≈ fT0 (0) = λ, for 1.

Throughout this chapter, the symbol ≈ means “up to a quantity negligible compared
to .” It is shown in Theorem 15.3 that exponentially distributed random variable is
memoryless. That is,

P [T0 > t + s | T0 > t] = P (T0 > s), s, t ≥ 0.

The memoryless property and the independence of the exponential times Tk

imply that {Xt , t ≥ 0} starts afresh from Xs at time s. Figure 6.3 illustrates that
property. Mathematically, it says that given {Xt , t ≤ s} with Xs = k, the process
{Xs+t , t ≥ 0} has the same properties as {Xt , t ≥ 0} given that X0 = k, for k = 0, 1
and for any s ≥ 0. Indeed, if Xs = 0, then the residual time that Xt remains in 0
is exponentially distributed with rate λ and is independent of what happened before

Fig. 6.2 A random process Xt Exp(μ)

on {0, 1}
1
T1 T3
T0 T2
0 t

Exp(λ)
6.2 Continuous-Time Markov Chains 97

Fig. 6.3 The process Xt

starts afresh from Xs at time s
Xt {Xs+t , t ≥ 0}
1

0 t

time s, because the time in 0 is memoryless and independent of the previous times
in 0 and 1. This property is written as

P [{Xs+t , t ≥ 0} ∈ A | Xs = k; Xt , t ≤ s] = P [{Xt , t ≥ 0} ∈ A | X0 = k],

for k = 0, 1, for all s ≥ 0, and for all sets A of possible trajectories. A generic set
A of trajectories is

A = {(xt , t ≥ 0) ∈ C+ | xt1 = i1 , . . . , xtn = in }

for given 0 < t1 < · · · < tn and i1 , . . . , in ∈ {0, 1}. Here, C+ is the set of right-
continuous functions of t ≥ 0 that take values in {0, 1}.
This property is the continuous-time version of the Markov property for Markov
chains. One says that the process Xt satisfies the Markov property and one calls
{Xt , t ≥ 0} is a continuous-time Markov chain (CTMC).
For instance,

P [Xs+2.5 = 1, Xs+4 = 0, Xs+5.1 = 0 | Xs = 0; Xt , t ≤ s]

= P [X2.5 = 1, X4 = 0, X5.1 = 0 | X0 = 0].

The Markov property generalizes to situations where s is replaced by a random

τ that is defined by a causal rule, i.e., a rule that does not look ahead. For instance,
as in Fig. 6.4, τ can be the second time that Xt visits state 0. Or τ could be the
first time that it visits state 0 after having spent at least 3 time units in state 1. The
property does not extend to non-causal times such as one time unit before Xt visits
state 1. Random times τ defined by causal rules are called stopping times. This more
general property is called the strong Markov property. To prove this property, one
conditions on the value s of τ and uses the fact that the future evolution does not
depend on this value since the event {τ = s} depends only on {Xt , t ≤ s}.
For 0 < 1 one has

P [Xt+ = 1 | Xt = 0] ≈ λ.
98 6 Networks—B

Fig. 6.4 The process Xt

starts afresh from Xτ at the
Xt {Xτ +t , t ≥ 0}
stopping time τ
1

0 t

τ
Fig. 6.5 The state transition λ
diagram

0 1

Indeed, the process jumps from 0 to 1 in time units if the exponential time in 0 is
less than , which has probability approximately λ.
Similarly,

P [Xt+ = 0 | Xt = 1] ≈ μ.

We say that the transition rate from 0 to 1 is equal to λ and that from 1 to 0 is equal
to μ to indicate that the probability of a transition from 0 to 1 in units of time is
approximately λ and that from 1 to 0 is approximately μ.
Figure 6.5 illustrates these transition rates. This figure is called the state
transition diagram.
The previous two identities imply that

P (Xt+ = 1) = P (Xt = 0, Xt+ = 1) + P (Xt = 1, Xt+ = 1)

= P (Xt =0)P [Xt+ =1 | Xt =0]+P (Xt =1)P [Xt+ =1 | Xt =1]
≈ P (Xt = 0)λ + P (Xt = 1)(1 − P [Xt+ = 0 | Xt = 1])
≈ P (Xt = 0)λ + P (Xt = 1)(1 − μ).

Also, similarly, one finds that

P (Xt+ = 0) ≈ P (Xt = 0)(1 − λ) + P (Xt = 1)μ.

We can write these identities in a convenient matrix notation as follows. For

t ≥ 0, one defines the row vector πt as

πt = [P (Xt = 0), P (Xt = 1)].

6.2 Continuous-Time Markov Chains 99

One also defines the transition rate matrix Q as follows:

−λ λ
Q= .
μ −μ

With that notation, the previous identities can be written as

πt+ ≈ πt (I + Q),

where I is the identity matrix. Subtracting πt from both sides, dividing by , and
letting → 0, we find

d
πt = πt Q. (6.1)
dt
By analogy with the scalar equation dxt /dt = axt whose solution is xt =
x0 exp{at}, we conclude that

πt = π0 exp{Qt}, (6.2)

where
1 2 2 1
exp{Qt} := I + Qt + Q t + Q3 t 3 + · · · .
2! 3!
Note that
d 1
exp{Qt} = 0 + Q + Q2 t + Q3 t 2 + · · · = Q exp{Qt}.
dt 2!
Observe also that πt = π for all t ≥ 0 if and only if π0 = π and

π Q = 0. (6.3)

Indeed, if πt = π for all t, then (6.1) implies that 0 = d

dt πt = πt Q = π Q.
Conversely, if π0 = π with π Q = 0, then

1 2 2 1 3 3
πt = π0 exp{Qt} = π exp{Qt} = π I + Qt + Q t + Q t + · · · = π.
2! 3!

These equations π Q = 0 are called the balance equations. They are

−λ λ
[π(0), π(1)] = 0,
μ −μ
100 6 Networks—B

Fig. 6.6 A discrete-time

approximation of Xt

1− 0 1 1−μ

i.e.,

π(0)(−λ) + π(1)μ = 0
π(0)λ − π(1)μ = 0.

These two equations are identical. To determine π , we use the fact that π(0) +
π(1) = 1. Combined with the previous identity, we find

μ λ
[π(0), π(1)] = , .
λ+μ λ+μ

The identity πt+ ≈ πt (I + Q) shows that one can view {Xn , n = 0, 1, . . .}

as a discrete-time Markov chain with transition matrix P = I + Q. Figure 6.6
shows the transition diagram that corresponds to this transition matrix. The invariant
distribution for P is such that π P = π , i.e., π(I + Q) = π , so that π Q = 0, not
surprisingly.
Note that this discrete-time Markov chain is aperiodic because states have self-
loops. Thus, we expect that

πn → π, as n → ∞.

Consequently, we expect that, in continuous time,

πt → π, as t → ∞.

6.2.2 Three-State Markov Chain

The previous Markov chain alternates between the states 0 and 1. More general
Markov chains visit states in a random order. We explain that feature in our next
example with 3 states. Fortunately, this example suffices to illustrate the general
case. We do not have to look at Markov chains with 4, 5, . . . states to describe the
general model.
6.2 Continuous-Time Markov Chains 101

Fig. 6.7 A three-state 1

Markov chain q(0, 1) q(1, 2)
q(0, 2)
0 2

q(2, 0)

Exp(q1 ) Exp(q2 )
Xt
2
Γ(0, 2) T3

1
Γ(0, 1) T1
T0 T2
0 t

Exp(q0 )
q0 = q(0, 1) + q(0, 2) Γ(0, 1) = q(0, 1)/q0
q1 = q(1, 2) Γ(0, 2) = q(0, 2)/q0
q2 = q(2, 0)

In the example shown in Fig. 6.7, the rules of evolution are characterized
by positive numbers q(0, 1), q(0, 2), q(1, 2), and q(2, 0). One also defines
q0 , q1 , q2 , Γ (0, 1), and Γ (0, 2) as in the figure.
If X0 = 0, the state Xt remains equal to 0 for some random time T0 that
is exponentially distributed with rate q0 . At time T0 , the state jumps to 1 with
probability Γ (0, 1) or to state 2 otherwise, with probability Γ (0, 2). If Xt jumps
to 1, it stays there for an exponentially distributed time T1 with rate q1 that is
independent of T0 . More generally, when Xt enters state k, it stays there for a
random time that is exponentially distributed with rate qk that is independent of the
past evolution. From this definition, it should be clear that the process Xt satisfies
the Markov property.
Define πt = [πt (0), πt (1), πt (2)] where πt (k) = P (Xt = k) for k = 0, 1, 2.
One has, for 0 < 1,

P [Xt+ = 1 | Xt = 0] ≈ q0 Γ (0, 1) = q(0, 1).

Indeed, the process jumps from 0 to 1 in time units if the exponential time with
rate q0 is less than and if the process then jumps to 1 instead of jumping to 2.
Similarly,

P [Xt+ = 2 | Xt = 0] ≈ q0 Γ (0, 2) = q(0, 2).

Also,
102 6 Networks—B

P [Xt+ = 1 | Xt = 1] ≈ 1 − q1 ,

since this is approximately the probability that the exponential time with rate q1 is
larger than . Moreover,

P [Xt+ = 1 | Xt = 2] ≈ 0,

because the probability that both the exponential time with rate q2 in state 2 and the
exponential time with rate q0 in state 0 are less than is roughly (q2 ) × (q1 ), and
this is negligible compared to .
These observations imply that

πt+ (1) = P (Xt = 0, Xt+ = 1) + P (Xt = 1, Xt+ = 1) + P (Xt = 2, Xt+ = 1)

= P (Xt =0)P [Xt+ =1 | Xt =0]+P (Xt =1)P [Xt+ =1 | Xt =1]
+ P (Xt = 2)P [Xt+ = 1 | Xt = 2]
≈ πt (0)q(0, 1) + πt (1)(1 − q1 ).

Proceeding in a similar way shows that

πt+ (0) ≈ πt (0)(1 − q0 ) + πt (2)q(2, 0)

πt+ (2) ≈ πt (1)q(1, 2) + πt (2)(1 − q2 .

Similarly to the two-state example, let us define the rate matrix Q as follows:
⎡ ⎤
−q0 q(0, 1) q(0, 2)
Q=⎣ 0 −q1 q(0, 1) ⎦ .
q(2, 0) 0 −q2

The previous identities can then be written as follows:

πt+ ≈ πt [I + Q].

Subtracting πt from both sides, dividing by , and letting → 0 then shows that

d
πt = πt Q.
dt
As before, the solution of this equation is

πt = π0 exp{Qt}, t ≥ 0.

The distribution π is invariant if and only if

6.2 Continuous-Time Markov Chains 103

Fig. 6.8 The transition 1 − q1

matrix of the discrete-time
approximation

q(0, 1) 1 q(1, 2)
q(0, 2)
0 2
1 − q0 1 − q2
q(2, 0)

π Q = 0.

Once again, we note that {Xn , n = 0, 1, . . .} is approximately a discrete-time

Markov chain with transition matrix P = I + Q shown in Fig. 6.8. This Markov
chain is aperiodic, and we conclude that

P (Xn = k) → π(k), as n → ∞.

Thus, we can expect that

πt → π, as t → ∞.

Also, since Xn is irreducible, the long-term fraction of time that it spends in the
different states converge to π , and we can then expect the same for Xt .

6.2.3 General Case

Let X be a countable or finite set. The process {Xt , t ≥ 0} is defined as

follows. One is given a probability distribution π on X and a rate matrix Q =
{q(i, j ), i, j ∈ X }.
By definition, Q is such that

q(i, j ) ≥ 0, ∀i = j and q(i, j ) = 0, ∀x.
j

Definition 6.1 (Continuous-Time Markov Chain) A continuous-time Markov

chain with initial distribution π and rate matrix Q is a process {Xt , t ≥ 0} such
that P (X0 = i) = π(i). Also,

P [Xt+ = j |Xt = i, Xu , u < t] = 1{i = j } + q(i, j ) + o().

104 6 Networks—B

Fig. 6.9 Construction of a q(i, j)

continuous-time Markov j
chain
i
l
q(i, k) k

Xt
Exp(qj ) Exp(qk )
Exp(ql )
k
Γ(i, k)
j
Γ(i, j) l
i t
Exp(qi )

This definition means that the process jumps from i to j = i with probability
q(i, j ) in 1 time units. Thus, q(i, j ) is the probability of jumping from i to j ,
per unit of time. Note that the sum of these expressions over all j gives 1, as should
be.
One construction of this process is as follows. Say that Xt = i. One then chooses
a random time τ that is exponentially distributed with rate qi := −q(i, i). At time
t + τ , the process jumps and goes to state y with probability Γ (i, j ) = q(i, j )/qi
for j = i (Fig. 6.9).
Thus, if Xt = i, the probability that Xt+ = j is the probability that the process
jumps in (t, t + ), which is qi , times the probability that it then jumps to j , which
is Γ (i, j ). Hence,

q(i, j )
P [Xt+ = j |Xt = i] = qi = q(i, j ),
qi

up to o(). Thus, the construction yields the correct transition probabilities.

As we observed in the examples,

d
πt = πt Q,
dt
so that

πt = π0 exp{Qt}.

Moreover, a distribution π is invariant if and only if it solves the balance equations

6.2 Continuous-Time Markov Chains 105

0 = π Q.

These equations, state by state, say that

π(i)qi = π(j )q(j, i), ∀i ∈ X .
j =i

These equations express the equality of the rate of leaving a state and the rate of
entering that state.
Define

Pt (i, j ) = P [Xs+t = j | Xs = i], for i, j ∈ X and s, t ≥ 0.

The Markov property implies that

P (Xt1 = i1 , . . . , Xtn = in ) = P (Xt1 = i1 )Pt2 −t1 (i1 , i2 )Pt3 −t2 (i2 , i3 ) · · · Ptn −tn−1 (in−1 , in ),

for all i1 , . . . , in ∈ X and all 0 < t1 < · · · < tn .

Moreover, this identity implies the Markov property. Indeed, if it holds, one has

P [Xtm+1 = im+1 , . . . , Xtn = in | Xt1 = i1 , . . . , Xtm = im ]

P (Xt1 = i1 , . . . , Xtn = in )
=
P (Xt1 = i1 , . . . , Xtm = im )
P (Xt1 = i1 )Pt2 −t1 (i1 , i2 )Pt3 −t2 (i2 , i3 ) · · · Ptn −tn−1 (in−1 , in )
=
P (Xt1 = i1 )Pt2 −t1 (i1 , i2 )Pt3 −t2 (i2 , i3 ) · · · Ptm−1 −tm−2 (im−2 , im−1 )
= Ptm −tm−1 (im−1 , im ) · · · Ptn −tn−1 (in−1 , in ).

Hence,

P [Xtm+1 = im+1 , . . . , Xtn = in | Xt1 = i1 , . . . , Xtm = im ]

P (Xtm−1 = im−1 )Ptm −tm−1 (im−1 , im ) · · · Ptn −tn−1 (in−1 , in )
=
P (Xtm−1 = im−1 )Ptm −tm−1 (im−1 , im )
P (Xtm−1 = im−1 , . . . , Xtn = in )
=
P (Xtm−1 = im−1 )
= P [Xtm = im , . . . , Xtn = in | Xtm−1 = im−1 ].

If Xt has the invariant distribution, one has

P (Xt1 = i1 , . . . , Xtn = in ) = π(i1 )Pt2 −t1 (i1 , i2 )Pt3 −t2 (I2 , i3 ) · · · Ptn −tn−1 (in−1 , in ),

for all i1 , . . . , in ∈ X and all 0 < t1 < · · · < tn .

106 6 Networks—B

Here is the result that corresponds to Theorem 15.1. We define irreducibility,

transience, and null and positive recurrence as in discrete time. There is no notion
of periodicity in continuous time.

Theorem 6.1 (Big Theorem for Continuous-Time Markov Chains)

Consider a continuous-time Markov chain.

(a) If the Markov chain is irreducible, the states are either all transient, all positive
recurrent, or all null recurrent. We then say that the Markov chain is transient,
positive recurrent, or null recurrent, respectively.
(b) If the Markov chain is positive recurrent, it has a unique invariant distribution
π and π(i) is the long-term fraction of time that Xt is equal to i. Moreover, the
probability πt (i) that the Markov chain Xt is in state i converges to π(i).
(c) If the Markov chain is not positive recurrent, it does not have an invariant
distribution and the fraction of time that it spends in any state goes to zero.

6.2.4 Uniformization

We saw earlier that a CTMC can be approximated by a discrete-time Markov chain

that has a time step 1. There are two other DTMCs that have a close relationship
with the CTMC: the jump chain and the uniformized chain. We explain these chains
for the CTMC Xt in Fig. 6.7.
The jump chain is Xt observed when it jumps. As Fig. 6.7 shows, this DTMC
has a transition matrix equal to Γ where

q(i, j )/qi , if i =
j
Γ (i, j ) =
0, if i = j.

Let ν be the invariant distribution of this jump chain. That is, ν = νΓ . Since ν(i) is
the long-term fraction of time that the jump chain is in state i, and since the CTMC
Xt spends an average time 1/qi in state i whenever it visits that state, the fraction of
time that Xt spends in state i should be proportional to ν(i)/qi . That is, one expects

π(i) = Aν(i)/qi

for some constant A. That is, one should have

[Aν(i)/qi ]q(i, j ) = 0.
j

To verify that equality, we observe that

6.2 Continuous-Time Markov Chains 107

[ν(i)/qi ]q(i, j ) = ν(i)Γ (i, j ) + ν(i)q(i, i)/qi = ν(i) − ν(i) = 0.
j j =i

We used the fact that νΓ = ν and q(i, i) = −qi .

The uniformized chain is not the jump chain. It is a discrete-time Markov chain
obtained from the CTMC as follows. Let λ ≥ qi for all i. The rate at which Xt
changes state is qi when it is in state i. Let us add a dummy jump from i to i with
rate λ − qi . The rate of jumps, including these dummy jumps, of this new Markov
chain Yt is now constant and equal to λ.
The transition matrix P of Yt is such that

(λ − qi )/λ, if i = j
P (i, j ) =
q(i, j )/λ, if i = j.

To see this, assume that Yt = i. The next jump will occur with rate λ. With
probability (λ − qi )/λ, it is a dummy jump from i to i. With probability qi /λ it
is an actual jump where Yt jumps to j = i with probability Γ (i, j ). Hence, Yt
jumps from i to i with probability (λ − qi )/λ and from i to j = i with probability
(qi /λ)Γ (i, j ) = q(i, j )/λ.
Note that
1
P =I+ Q,
λ
where I is the identity matrix.
Now, define Zn to be the jump chain of Yt , i.e., the Markov chain with transition
matrix P . Since the jumps of Yt occur at rate λ, independently of the value of the
state Yt , we can simulate Yt as follows. Let Nt be a Poisson process with rate λ. The
jump times {t1 , t2 , . . .} of Nt will be the jump times of Yt . The successive values of
Yt are those of Zn . Formally,

Yt = ZNt .

That is, if Nt = n, then we define Yt = Zn . Since the CTMC Yt spends 1/λ on

average between jumps, the invariant distribution of Yt should be the same as that
of Xt , i.e., π . To verify this, we check that π P = π , i.e., that

1
π I + Q = π.
λ

That identity holds since π Q = 0. Thus, the DTMC Zn has the same invariant
distribution as Xt . Observe that Zn is not the same as the jump chain of Xt . Also, it
is not a discrete-time approximation of Xt . This DTMC shows that a CTMC can be
seen as a DTMC where one replaces the constant time steps by i.i.d. exponentially
distributed time steps between the jumps.
108 6 Networks—B

6.2.5 Time Reversal

As a preparation for our study of networks of queues, we note the following result.

Theorem 6.2 (Kelly’s Lemma) Let Q be the rate matrix of a Markov chain on X .
Let also Q̃ be another rate matrix on X . Assume that π is a distribution on X and
that

qi = q̃i , i ∈ X and
π(i)q(i, j ) = π(j )q̃(j, i), ∀i = j.

Then π Q = 0.

Proof We have

π(j )q(j, i) = p(i)q̃(i, j ) = p(i) q̃(i, j ) = p(i)q̃i = p(i)qi ,
j =i j =i j =i

so that π Q = 0.

The following result explains the meaning of Q̃ in the previous theorem. We state
it without proof.

Theorem 6.3 Assume that Xt has the invariant distribution π . Then Xt reversed in
time is a Markov chain with rate matrix Q̃ given by

π(j )q(j, i)
q̃(i, j ) = .
π(i)

6.3 Product-Form Networks

Theorem 6.4 (Invariant Distribution of Network) Assume λk < μk and let

ρk = λk /μk , for k = 1, 2, 3. Then the Markov chain Xt has a unique invariant
distribution π that is given by

π(x1 , x2 , x3 ) = π1 (x1 )π2 (x2 )π3 (x3 )

πk (n) = (1 − ρk )ρkn , n ≥ 0, k = 1, 2
π3 (a1 , a2 , . . . , an ) = p(a1 )p(a2 ) · · · p(an )(1 − ρ3 )ρ3n ,
n ≥ 0, ak ∈ {1, 2}, k = 1, . . . , n,
6.3 Product-Form Networks 109

μ1 μ3
λ2

p2 μ2

where p(1) = λ1 /(λ1 + λ2 ) and p(2) = λ2 /(λ1 + λ2 ).

Proof Figure 6.10 shows a guess for the time-reversal of the network.
Let Q be the rate matrix of the top network and Q̃ that of the bottom one. Let
also π be as stated in the theorem. We show that π, Q, Q̃ satisfy the conditions of
Kelly’s Lemma.
For instance, we verify that

π([3, 2, [1, 1, 2, 1]])q([3, 2, [1, 1, 2, 1]], [4, 2, [1, 1, 2]])

= π([4, 2, [1, 1, 2]])q̃([4, 2, [1, 1, 2]], [3, 2, [1, 1, 2, 1]]).

Looking at the figure, we can see that

q([3, 2, [1, 1, 2, 1]], [4, 2, [1, 1, 2]]) = μ3 p1

q̃([4, 2, [1, 1, 2]], [3, 2, [1, 1, 2, 1]] = μ1 p1 .

Thus, the previous identity reads

π([3, 2, [1, 1, 2, 1]])μ3 p1 = π([4, 2, [1, 1, 2]])μ1 p1 ,

i.e.,

π([3, 2, [1, 1, 2, 1]])μ3 = π([4, 2, [1, 1, 2]])μ1 .

Given the expression for π , this is

110 6 Networks—B

(1 − ρ1 )ρ13 × (1 − ρ2 )ρ22 × p(1)p(1)p(2)p(1)(1 − ρ3 )ρ34 μ3

= (1 − ρ1 )ρ14 × (1 − ρ2 )ρ22 × p(1)p(1)p(2)(1 − ρ3 )ρ33 μ1 .

After simplifications, this identity is seen to be equivalent to

p(1)ρ3 μ3 = ρ1 μ1 ,

i.e.,

λ1 λ3 λ1
μ3 = μ1
λ3 μ3 μ1

and this equation is seen to be satisfied. A similar argument shows that Kelly’s
lemma is satisfied for all pairs of states.

6.4 Proof of Theorem 5.7

The first step in using the theorem is to solve the flow conservation equations. Let
us call class 1 that of the white jobs and class 2 that of the gray job. Then we see
that

λ11 = λ12 = γ , λ21 = λ22 = α

solve the flow conservation equations for any α > 0. We have to assume γ < μ for
the services to be able to keep up with the white jobs. With this assumption, we can
choose α small enough so that λ1 = λ2 = λ := γ + α < min{μ1 , μ2 }.
The second step is to use the theorem to obtain the invariant distribution. It is

π(x1 , x2 ) = Ah(x1 )h(x2 )

with

n1 (xi )
n2 (xi )
γ α n (x ) n (x )
h(xi ) = = ρ1 1 i ρ2 2 i ,
μ μ

where ρ1 = γ /μ, ρ2 = α/μ, and nc (x) is the number of jobs of class c in xi , for
c = 1, 2. To calculate A, we note that there are n + 1 states xi with n class 1 jobs
and 1 class 2 job, and 1 state xi with n classes 1 jobs and no class 2 job. Indeed, the
class 2 customer can be in n + 1 positions in the queue with the n customers of class
1.
Also, all the possible pairs (x1 , x2 ) must have one class 2 customer either in
queue 1 or in queue 2. Thus,
6.4 Proof of Theorem 5.7 111

∞
∞
1= π(x1 , x2 ) = A G(m, n),
(x1 ,x2 ) m=0 n=0

where

G(m, n) = (m + 1)ρ1m+n ρ2 + (n + 1)ρ1m+n ρ2 .

In this expression, the first term corresponds to the states with m class 1 customers
and one class 2 customer in queue 1 and n customers of class 1 in queue 2; the
second term corresponds to the states with m customer of class 1 in queue 1, and n
customers of class 1 and one customer of class 2 in queue 2. Thus, AG(m, n) is the
probability that there are m customers of class 1 in the first queue and n customers
of class 1 in the second queue.
Hence,

∞
∞ ∞
∞
1=A [(m + 1)ρ1m+n ρ2 + (n + 1)ρ1m+n ρ2 ] = 2A (m + 1)ρ1m+n ρ2 ,
m=0 n=0 m=0 n=0

by symmetry of the two terms. Thus,

∞
∞

1 = 2Aρ2 (m + 1)ρ1m ρ1n .
m=0 n=0

To compute the sum, we use the following identities:

∞

ρ n = (1 − ρ)−1 , for 0 < ρ < 1
n=0

and
∞
∞
∂ n+1 ∂
(n + 1)ρ =
n
ρ = [(1 − ρ)−1 − 1] = (1 − ρ)−2 .
∂ρ ∂ρ
n=0 n=0

Thus, one has

1 = 2Aρ2 (1 − ρ1 )−3 ,

so that

(1 − ρ1 )3
A= .
2ρ2
112 6 Networks—B

Third, we calculate the expected number L of jobs of class 1 in the two queues.
One has
∞
∞
L= A(m + n)G(m, n)
m=0 n=0

∞
∞ ∞
∞
= A(m + n)(m + 1)ρ1m+n ρ2 + A(m + n)(n + 1)ρ1m+n ρ2
m=0 n=0 m=0 n=0
∞
∞
=2 A(m + n)(m + 1)ρ1m+n ρ2 ,
m=0 n=0

where the last identity follows from the symmetry of the two terms. Thus,

∞
∞ ∞
∞
L=2 Am(m + 1)ρ1m+n ρ2 + 2 An(m + 1)ρ1m+n ρ2
m=0 n=0 m=0 n=0
∞
∞ ∞
∞

= 2Aρ2 m(m + 1)ρ1m ρ1n + 2Aρ2 (m + 1)ρ1m nρ1n
m=0 n=0 m=0 n=0
∞
∞

= 2Aρ2 m(m + 1)ρ1m (1 − ρ1 )−1 + 2Aρ2 (1 − ρ)−2 nρ1n .
m=0 n=0

To calculate the sums, we use the fact that

∞
∞

m(m + 1)ρ m = ρ m(m + 1)ρ m−1
m=0 m=0
∞
∂ 2 m+1 ∂2
=ρ 2
ρ = ρ 2 [(1 − ρ)−1 − 1]
∂ρ ∂ρ
m=0

= 2ρ(1 − ρ)−3 .

Also,
∞
∞
∞

nρ1n = ρ1 nρ1n−1 = ρ1 (n + 1)ρ1n = ρ1 (1 − ρ1 )−2 .
n=0 n=0 n=0

Hence,

L = 2Aρ2 × 2ρ(1 − ρ)−3 × (1 − ρ1 )−1 + 2Aρ2 (1 − ρ)−2 × ρ1 (1 − ρ1 )−2

= 6Aρ2 ρ1 (1 − ρ1 )−4 .
6.5 References 113

Substituting the value for A that we derived above, we find

ρ1
L=3 .
1 − ρ1

Finally, we get the average time W that jobs of class 1 spend in the network: W =
L/γ .
Without the gray job, the expected delay W of the white jobs would be the sum
of delays in two M/M/1 queues, i.e., W = L /γ where
ρ1
L = 2 .
1 − ρ1

Hence, we find that

W = 1.5W ,

so that using a hello message increases the average delay of the class 1 customers
by 50%.

6.5 References

The time-reversal arguments are developed in Kelly (1979). That book also explains
many other models that can be analyzed using that approach. See also Bremaud
(2008), Lyons and Perez (2017), Neely (2010).

Application: Transmitting bits across a physical medium

Topics: MAP, MLE, Hypothesis Testing

7.1 Digital Link

A digital link consists of a transmitter and a receiver. It transmits bits over some
physical medium that can be a cable, a phone line, a laser beam, an optical fiber, an
electromagnetic wave, or even a sound wave. This contrasts with an analog system
that transmits signals without converting them into bits, as in Fig. 7.1.
An elementary such system1 consists of a phone line and, to send a bit 0, the
transmitter applies a voltage −1 Volt across its end of the line for T seconds; to
send a bit 1, it applies the voltage +1 Volt for T second. The receiver measures
the voltage across its end of the line. If the voltage that the receiver measures is
negative, it decides that the transmitter must have sent a 0; if it is positive, it decides
that the transmitter sent a 1. This system is not error-free. The receiver gets a noisy
and attenuated version of what the transmitter sent. Thus, there is a chance that a 0
is mistaken for a 1, and vice versa. Various coding techniques are used to reduce the
chances of such errors Fig. 7.2 shows the general structure of a digital link.
In this chapter, we explore the operating principles of digital links and their
characteristics. We start with a discussion of Bayes’ rule and of detection theory.
We apply these ideas to a simple model of communication link. We then explore a
coding scheme that makes the transmissions faster. We conclude the chapter with a

1 We are ignoring many details of synchronization.

© The Author(s) 2021 115

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_7
116 7 Digital Link—A

Fig. 7.1 An analog

communication system

TRANSMITTER

Source Channel
Source Modulator
Coder Coder

Channel

Source Channel
Destination Demodulator
Decoder Decoder

RECEIVER

Fig. 7.2 Components of a digital link

discussion of modulation and detection schemes that actual transmission systems,

such as ADSL and Cable Modems, use.

7.2 Detection and Bayes’ Rule

The receiver gets some signal S and tries to guess what the transmitter sent. We
explore a general model of this problem and we then apply it to concrete situations.

7.2.1 Bayes’ Rule

The basic formulation is that there are N possible exclusive circumstances

C1 , . . . , CN under which a particular symptom S can occur. By exclusive, we
mean that exactly one circumstance occurs (Fig. 7.3). Each circumstance Ci has
some prior probability pi and qi is the probability that S occurs under circumstance
Ci . Thus,

pi = P (Ci ) and qi = P [S | Ci ], for i = 1, . . . , N,

7.2 Detection and Bayes’ Rule 117

Fig. 7.3 The symptom and possible circumstances

its possible circumstances.
Here, pi = P (Ci ) and C1
qi = P [S | Ci ] q1
p1
Ci Symptom

qi S
pi

qN
pN
causation
conditional
probabilities
priors CN

where

N
pi ≥ 0, qi ∈ [0, 1] for i = 1, . . . , N and pi = 1.
i=1

The posterior probability πi that circumstance Ci is in effect given that S is

observed can be computed by using Bayes’ rule as we explain next. One has

P (Ci and S)
π(i) = P [Ci |S] =
P (S)
P (Ci and S) P [S|Ci ]P (Ci )
= N = N
j =1 P (Cj and S) j =1 P [S|Cj ]P (Cj )
pi qi
= N .
j =1 pj qj

Given the importance of this result, we state it as a theorem.

Theorem 7.1 (Bayes’ Rule) One has

pi qi
πi = N , i = 1, . . . , N. (7.1)
j =1 pj qj

This rule is very simple but is a canonical example of how observations affect
our beliefs. It is due to Thomas Bayes (Fig. 7.4).
118 7 Digital Link—A

Fig. 7.4 Thomas Bayes,

1701–1761

7.2.2 Circumstances vs. Causes

In the previous section we were careful to qualify the Ci as possible circumstances,

not as causes. The distinction is important. Say that you go to a beach, eat an ice
cream, and leave with a sunburn. Later, you meet a friend who did not go to the
beach, did not eat an ice cream, and did not get sunburned. More generally, the
probability that someone got sunburned is larger if that person ate an ice cream.
However, it would be silly to qualify the ice cream as the cause of the sunburn.
Unfortunately, confusing correlation and causation is a prevalent mistake.

7.2.3 MAP and MLE

Given the previous model, we see that the most likely circumstance under which the
symptom occurs, which we call the Maximum A Posteriori (MAP) estimate of the
circumstance given the symptom, is

MAP = arg max πi = arg max pi qi .

i i

The notation is that if h(·) is a function, then arg maxx h(x) is any value of x that
achieves the maximum of h(·). Thus, if x ∗ = arg maxx h(x), then h(x ∗ ) ≥ h(x) for
all x.
Thus, the MAP is the most likely circumstance, a posteriori, that is, after having
observed the symptom.
Note that if all the prior probabilities are equal, i.e., if pi = 1/N for all i, then
the MAP maximizes qi . In general, the estimate that maximizes qi is called the
Maximum Likelihood Estimate (MLE) of the circumstance given the symptom. That
is,

MLE = arg max qi .

That is, the MLE is the circumstance that makes the symptom most likely.
More generally, one has the following definitions.
7.2 Detection and Bayes’ Rule 119

Definition 7.1 (MAP and MLE) Let (X, Y ) be discrete random variables. Then

MAP [X|Y = y] = arg max P (X = x and Y = y)

and

MLE[X|Y = y] = arg max P [Y = y|X = x].

These definitions extend in the natural way to the continuous case, as we will get
to see later.

Example: Ice Cream and Sunburn

As an example, say that on a particular summer day in Berkeley 500 out of 100,000
people eat ice cream, among which 50 get sunburned and that among the 99,500
who do not eat ice cream, 600 get sunburned. Then, the MAP of eating ice cream
given sunburn is No but the MLE is Yes. Indeed, we see that

P (sunburn and ice cream) = 50 < P (sunburn and no ice cream) = 600,

so that among those who have a sunburn, a minority eat ice cream, so that it is more
likely that a sunburn person did not eat ice cream. Hence, the MAP if No. However,
the fraction of people who have a sunburn is larger among those who eat ice cream
(10%) than among those who do not (0.6%). Hence, the MLE is Yes.

7.2.4 Binary Symmetric Channel

We apply the concepts of MLE and MAP to a simplified model of a communication

link. Figure 7.5 illustrates the model, called a binary symmetric channel (BSC).
In this model, the transmitter sends a 0 or a 1 and the receiver gets the transmitted
bit with probability 1−p, otherwise it gets the opposite bit. Thus, the channel makes
an error with probability p. We assume that if the transmitter sends successive bits,
the errors are i.i.d.

Fig. 7.5 The binary 1−p

symmetric channel 0 0
p

p
1 1
1−p
120 7 Digital Link—A

Fig. 7.6 MAP for BSC. M AP [X|Y = 1]

Here, α = P (X = 1) and p is
the probability of a channel M AP [X|Y = 0]
error
1

0
0 p 0.5 1 − p 1

Note that if p = 0 or p = 1, then one can recover exactly every bit that is
sent. Also, if p = 0.5, then the output is independent of the input and no useful
information goes through the channel. What happens in the other cases?
Call X ∈ {0, 1} the input of the channel and Y ∈ {0, 1} its output. Assume that
you observe Y = 1 and that P (X = 1) = α, so that P (X = 0) = 1 − α. We have
the following result illustrated in Fig. 7.6.

Theorem 7.2 (MAP and MLE for BSC) For the BSC with p < 0.5,

MAP [X|Y = 0] = 1{α > 1 − p}, MAP [X|Y = 1] = 1{α > p}

and

MLE[X|Y ] = Y.

To understand the MAP results, consider the case Y = 1. Since p < 0.5, we are
inclined to think that X = 1. However, if α is small, this is unlikely. The result is
that X = 1 is more likely than X = 0 if α > p, i.e., if the prior is “stronger” than
the noise. The case Y = 0 is similar.

Proof In the terminology of Bayes’ rule, the event Y = 1 is the symptom. Also, the
prior probabilities are

p0 = 1 − α and p1 = α,

and the conditional probabilities are

q0 = P [Y = 1|X = 0] = p and q1 = P [Y = 1|X = 1] = 1 − p.

7.3 Huffman Codes 121

Hence,

MAP [X|Y = 1] = arg max pi qi .

i∈{0,1}

Thus,

1, if p1 q1 = α(1 − p) > p0 q0 = (1 − α)p
MAP [X|Y = 1] =
0, otherwise.

Hence, MAP [X|Y = 1] = 1{α > p}. That is, when Y = 1, your guess is that
X = 1 if the prior that X = 1 is larger than the probability that the channel makes
an error.
Also,

MLE[X|Y = 1] = arg max qi .

i∈{0,1}

In this case, since p < 0.5, we see that MLE[X|Y = 1] = 1, because Y = 1

is more likely when X = 1 than when X = 0. Thus, the MLE ignores the prior
and always guesses that X = 1 when Y = 1, even though the prior probability
P (X = 1) = α may be very small.
Similarly, we see that

MAP [X|Y = 0] = arg max pi (1 − qi ).

i∈{0,1}

Thus,

1, if p1 (1 − q1 ) = αp > p0 (1 − q0 )(1 − α)(1 − p)
MAP [X|Y = 0] =
0, otherwise.

Hence, MAP [X|Y = 0] = 1{α > 1 − p}. Thus, when Y = 0, you guess that X = 1
if X = 1 is more likely a priori than the channel being correct.
Also, MLE[X|Y = 0] = 0 because p < 0.5, irrespectively of α.

7.3 Huffman Codes

Coding can improve the characteristics of a digital link. We explore Huffman codes
in this section.
Say that you want to transmit strings of symbols A, B, C, D across a digital link.
The simplest method is to encode these symbols as 00, 01, 10, and 11, respectively.
In so doing, each symbol requires transmitting two bits. Assuming that there is no
error, if the receiver gets the bits 0100110001, it recovers the string BADAB.
122 7 Digital Link—A

Fig. 7.7 David Huffman,

1925–1999

Now assume that the strings are such that the symbols occur with the following
frequencies: (A, 55%), (B, 30%), (C, 10%), (D, 5%). Thus, A occurs 55% of the
time, and similarly for the other symbols. In this situation, one may design a code
where A requires fewer bits than D.
The Huffman code (Huffman 1952, Fig. 7.7) for this example is as follows:

A = 0, B = 10, C = 110, D = 111.

The average number of bits required per symbol is

1 × 55% + 2 × 30% + 3 × 10% + 3 × 5% = 1.6.

Thus, one saves 20% of the transmissions and the resulting system is 25% faster
(ah! arithmetics). Note that the code is such that, when there is no error, the receiver
can recover the symbols uniquely from the bits it gets. For instance, if the receiver
gets 110100111, the symbols are CBAD, without ambiguity.
The reason why there is no possible ambiguity is that one can picture the bits as
indicating the path in a tree that ends with a leaf of the tree, as shown in Fig. 7.8.
Thus, starting with the first bit received, one walks down the tree until one reaches
a leaf. One then repeats for the subsequent bits. In our example, when the bits are
110100111, one starts at the top of the tree, then one follows the branches 110 and
reaches leaf C, then one restarts from the top and follows the branches 10 and gets
to the leaf B, and so on. Codes that have this property of being uniquely decodable
in one pass are called prefix-free codes.
The construction of the code is simple. As shown in Fig. 7.8, one joins the two
symbols with the smallest frequency of occurrence, here C and D, with branches 0
and 1 and assigns the group CD the sum of the symbol frequencies, here 0.15. One
then continues in the same way, joining CD and B and assigning the group BCD
the frequency 0.3 + 0.15 = 0.45. Finally, one joins A and BCD. The resulting tree
specifies the code.
The following property is worth noting.
7.3 Huffman Codes 123

Fig. 7.8 Huffman code

0 1
0.45
0 1

0.15
0 1
A B C D
0.55 0.3 0.1 0.05
0 10 110 111

Theorem 7.3 (Optimality of Huffman Code) The Huffman code has the smallest
average number of bits per symbol among all prefix-free codes.

Proof See Chap. 8.

It should be noted that other codes have a smaller average length, but they are
not symbol-by-symbol codes and are more complex. One code is based on the
observation that there are only 2nH likely strings of n 1 symbols, where

H =− x log2 (x).
X

In this expression, x is the frequency of symbol X and the sum is over all the
symbols. This expression H is the entropy of the distribution of the symbols. Thus,
by listing all these strings and assigning nH bits to identify them, one requires only
nH bits for n symbols, or H bits per symbol (See Sect. 15.7.).
In our example, one has

H = −0.55 log2 (0.55) − 0.3 log2 (0.3)

− 0.1 log2 (0.1) − 0.05 log2 (0.05) = 1.54.

Thus, for this example, the savings over the Huffman code are not spectacular, but
it is easy to find examples for which they are. For instance, assume that there are
only two symbols A and B with frequencies p and 1 − p, for some p ∈ (0, 1). The
Huffman code requires one bit per symbol, but codes based on long strings require
only −p log2 (p) − (1 − p) log2 (1 − p) bits per symbol. For p = 0.1, this is 0.47,
which is less than half the number of bits of the Huffman code.
Coding based on long strings of symbols are discussed in Sect. 15.7.
124 7 Digital Link—A

7.4 Gaussian Channel

In the previous sections, we had a simplified model of a channel as a BSC. In this

section, we examine a more realistic model of the channel that captures the physical
characteristic of the noise. In this model, the transmitter sends a bit X ∈ {0, 1} and
the receiver gets Y where

Y = X + Z.

In this identity, Z =D N (0, σ 2 ) and is independent of X. We say that this is an

additive Gaussian noise channel.
Figure 7.9 shows the densities of Y when X = 0 and when X = 1. Indeed, when
X = x, we see that Y =D N (x, σ 2 ).
Assume that the receiver observes Y . How should it decide whether X = 0 or
X = 1? Assume again that P (X = 1) = p1 = α and P (X = 0) = p0 = 1 − α.
In this example, P [Y = y|X = 0] = 0 for all values of y. Indeed, Y is a
continuous random variable. So, we must change a little our discussion of Bayes’
rule. Here is how to do it. Pretend that we do not measure Y with infinite precision
but that we instead observe that Y ∈ (y, y + ) where 0 < 1. Thus, the
symptom is Y ∈ (y, y + ) and it now has a positive probability. In fact,

q0 = P [Y ∈ (y, y + )|X = 0] ≈ f0 (y),

by definition of the density f0 (y) of Y when X = 0. Similarly,

q1 = P [Y ∈ (y, y + )|X = 1] ≈ f1 (y).

Hence,

MAP [X|Y ∈ (y, y + )] = arg max pi fi (y).

i∈{0,1}

Since the result does not depend on , we write

MAP [X|Y = y] = arg max pi fi (y).

i∈{0,1}

Fig. 7.9 The pdf of Y is f0

when X = 0 and f1 when
f0(y) f1(y)
X=1

0 1 y
7.4 Gaussian Channel 125

Similarly,

MLE[X|Y = y] = arg max fi (y).

i∈{0,1}

We can verify that

1 p0
MAP [X|Y = y] = 1 y ≥ + σ 2 log . (7.2)
2 p1

Also, the resulting probability of error is

1 p0
P N (0, σ 2 ) ≥ + σ 2 log p0
2 p1

1 p0
+ P N (1, σ ) ≤ + σ log
2 2
p1 .
2 p1

Also,

MLE[X|Y = y] = 1{y ≥ 0.5}.

If we choose the MLE detection rule, the system has the same probability of error
as a BSC channel with

0.5
p = p(σ 2 ) := P (N (0, σ 2 ) > 0.5) = P N (0, 1) > .
σ

Simulation
Figure 7.10 shows the simulation results when α = 0.5 and σ = 1. The code is in
the Jupyter notebook for this chapter.

7.4.1 BPSK

The system in the previous section was very simple and corresponds to a practical
transmission scheme called Binary Phase Shift Keying (BPSK). In this system,
instead of sending a constant voltage for T seconds to represent either a bit 0 or
a bit 1, the transmitter sends a sine wave for T seconds and the phase of that sine
wave depends on whether the transmitter sends a 0 or a 1 (Fig. 7.11).
Specifically, to send bit 0, the transmitter sends the signal

s0 = {s0 (t) = A sin(2πf t), t ∈ [0, T ]}.

Here, T is a multiple of the period, so that f T = k for some integer k. To send

a bit 1, the transmitter sends the signal s1 = −s0 . Why all this complication? The
126 7 Digital Link—A

Fig. 7.10 Simulation of the AGN channel with α = 0.5 and σ = 1

Fig. 7.11 The signal that the

0 1 0 1 0 0 1
transmitter sends when using
BPSK

signal is a sine wave around frequency f and the designer can choose a frequency
that the transmission medium transports well. For instance, if the transmission is
wireless, the frequency f is chosen so that the antennas radiate and receive that
frequency well. The wavelength of the transmitted electromagnetic wave is the
speed of light divided by f and it should be of the same order as the physical length
of the antenna. For instance, 1GHz corresponds to a wavelength of one foot and it
can be transmitted and received by suitably shaped cell phone antennas.
In any case, the transmitter sends the signal si to send a bit i, for i = 0, 1. The
receiver attempts to detect whether s0 or s1 = −s0 was sent. To do this, it multiplies
the received signal by a sine wave at the frequency f , then computes the average
value of the product. That is, if the receiver gets the signal r = {rt , 0 ≤ t ≤ T }, it
computes
T
1
rt sin(2πf t)dt.
T 0

You can verify that if r = s0 , then the result is A/2 and if r = s1 , then the result is
−A/2. Thus, the receiver guesses that bit 0 was transmitted if this average value is
positive and that bit 1 was transmitted otherwise.
The signal that the receiver gets is not si when the transmitter sends si . Instead,
the receiver gets an attenuated and noisy version of that signal. As a result, after
doing its calculation, the receiver gets B + Z or −B + Z where B is some constant
7.5 Multidimensional Gaussian Channel 127

that depends on the attenuation, Z is a N (0, σ 2 ) random variable and σ 2 reflects

the power of the noise.
Accordingly, the detection problem amounts to detecting the mean value of a
Gaussian random variable, which is the problem that we discussed earlier.

7.5 Multidimensional Gaussian Channel

When using BP SK, the transmitter has a choice between two signals: s0 and s1 .
Thus, in T seconds, the transmitter sends one bit. To increase the transmission
rate, communication engineers devised a more efficient scheme called Quadrature
Amplitude Modulation (QAM). When using this scheme, a transmitter can send a
number k of bits every T seconds. The scheme can be designed for different values
of k. When k = 1, the scheme is identical to BPSK. For k > 1, there are 2k different
signals and each one is of the form

a cos(2πf t) + b sin(2πf t),

where the coefficients (a, b) characterize the signal and correspond to a given string
of k-bits. These coefficients form a constellation as shown in Fig. 7.12 in the case
of QAM-16, which corresponds to k = 4.
When the receiver gets the signal, it multiplies it by 2 cos(2πf t) and computes
the average over T seconds. This average value should be the coefficient a if
there was not attenuation and no noise. The receiver also multiplies the signal by
2 sin(2πf t) and computes the average over T seconds. The result should be the
coefficient b. From the value of (a, b), the receiver can tell the four bits that the
transmitter sent.
Because of the noise (we can correct for the attenuation), the receiver gets a pair
of values Y = (Y1 , Y2 ), as shown in the figure. The receiver essentially finds the
constellation point closest to the measured point Y and reads off the corresponding
bits.

Fig. 7.12 A QAM-16 b

constellation

x 16

x1
128 7 Digital Link—A

The values of |a| and |b| are bounded, because of a power constraint on the
transmitter. Accordingly, a constellation with more points (i.e., a larger value of k)
has points that are closer together. This proximity increases the likelihood that the
noise misleads the receiver. Thus, the size of the constellation should be adapted
to the power of the noise. This is in fact what actual systems do. For instance, a
cable modem and an ADSL modem divide the frequency band into small channels
and they measure the noise power in each channel and choose the appropriate
constellation for each. WiFi, LTE, and 5G systems use a similar scheme.

7.5.1 MLE in Multidimensional Case

We can summarize the effect of modulation, demodulation, amplification to com-

pensate for the attenuation and the noise as follows. The transmitter sends one of the
sixteen vectors xk = (ak , bk ) shown in Fig. 7.12. Let us call the transmitted vector
X. The vector that the receiver computes is Y.
Assume first that

Y=X+Z

where Z = (Z1 , Z2 ) and Z1 , Z2 are i.i.d. N(0, σ 2 ) random variables. That is, we
assume that the errors in Y1 and Y2 are independent and Gaussian. In this case, we
can calculate the conditional density fY|X [y|x] as follows. Given X = x, we see that
Y1 = x1 + Z1 and Y2 = x2 + Z2 . Since Z1 and Z2 are independent, it follows that
Y1 and Y2 are independent as well. Moreover, Y1 = N(x1 , σ 2 ) and Y2 = N(x2 , σ 2 ).
Hence,

1 (y1 − x1 )2 1 (y2 − x2 )2
fY|X [y|x] = √ exp − √ exp − .
2π σ 2 2σ 2 2π σ 2 2σ 2

Recall that MLE[X|Y = y] is the value of x ∈ {x1 , . . . , x16 } that maximizes this
expression. Accordingly, it is the value xk that minimizes

||xk − y||2 = (x1 − y1 )2 + (x2 − y2 )2 .

Thus, MLE[X|Y] is indeed the constellation point that is the closest to the measured
value Y.

7.6 Hypothesis Testing

There are many situations where the MAP and MLE are not satisfactory guesses.
This is the case for designing alarms, medical tests, failure detection algorithms,
and many other applications. We describe an important formulation, called the
hypothesis testing problem.
7.6 Hypothesis Testing 129

7.6.1 Formulation

We consider the case where X ∈ {0, 1} and where one assumes a distribution of Y
given X. The goal will be to solve the following problem:

Maximize P CD := P [X̂ = 1|X = 1]

subject to P F A := P [X̂ = 1|X = 0] ≤ β.

Here, P CD is the probability of correct detection, i.e., of detecting that X = 1

when it is actually equal to 1. Also, P F A is the probability of false alarm, i.e., of
declaring that X = 1 when it is in fact equal to zero. The constant β is a given bound
on the probability of false alarm.2
For making sense of the terminology, think that X = 1 means that your house
is on fire. It is not reasonable to assume a prior probability that X = 1, so that
the MAP formulation is not appropriate. Also, the MLE amounts to assuming that
P (X = 1) = 1/2, which is not suitable here. In the hypothesis testing formulation,
the goal is to detect a fire with the largest possible probability, subject to a bound
on the probability of false alarm. That is, one wishes to make the fire detector as
sensitive as possible, but not so sensitive that it produces frequent false alarms.
One has the following useful concept.

Definition 7.2 (Receiver Operating Characteristic (ROC)) If the solution of the

problem is P CD = R(β), the function R(β) is called the Receiver Operating
Characteristic (ROC).

A typical ROC is shown in Fig. 7.13. The terminology comes from the fact that
this function depends on the conditional distributions of Y given X = 0 and given
X = 1, i.e., of the signal that is received about X.
Note the following features of that curve. First, R(1) = 1 because if one is
allowed to have P F A = 1, then one can choose X̂ = 1 for all observations; in that
case P CD = 1.
Second, the function R(β) is concave. To see this, let 0 ≤ β1 < β2 ≤ 1 and
assume that gi (Y ) achieves P [gi (Y ) = 1|X = 1] = R(βi ) and P [gi (Y ) = 1|X =
0] = βi for i = 1, 2. Choose ∈ (0, 1) and define X = g1 (Y ) with probability
and X = g2 (Y ) otherwise. Then,

P [X = 1|X = 0] = P [g1 (Y ) = 1|X = 0] + (1 − )P [g2 (Y ) = 1|X = 0]

= β1 + (1 − )β2 .

2 IfH0 means that you are healthy and H1 means that you have a disease, P F A is the probability
of a false positive test and 1 − P CD is the probability of a false negative test. These are also called
type I and type II errors in the literature. P F A is also called the p-value of the test.
130 7 Digital Link—A

Fig. 7.13 The Receiver

R(b) = max{PCD|PFA ≤ b}
Operating Characteristic is
the maximum probability of
correct detection R(β) as a 1
function of the bound β on
the probability of false alarm

b
0

0 1

Also,

P [X = 1|X = 1] = P [g1 (Y ) = 1|X = 1] + (1 − )P [g2 (Y ) = 1|X = 1]

= R(β1 ) + (1 − )R(β2 ).

Now, the decision rule X̂ that maximizes P [X̂ = 1|X = 1] subject to P [X̂ =
1|X = 0] = β1 + (1 − )β2 must be at least as good as X . Hence,

R(β1 + (1 − )β2 ) ≥ β1 + (1 − )β2 .

This inequality proves the concavity of R(β).

Third, the function R(β) is nondecreasing. Intuitively, if one can make a larger
P F A, one can decide X̂ = 1 with a larger probability, which increases P CD. To
show this formally, let β2 = 1 in the previous derivation.
Fourth, note that it may not be the case that R(0) = 0. For instance, assume that
Y = X. In this case, one chooses X̂ = Y = X, so that P CD = 1 and P F A = 0.

7.6.2 Solution

The solution of the hypothesis testing problem is stated in the following theorem.

Theorem 7.4 (Neyman–Pearson (1933)) The decision X̂ that maximizes P CD

subject to P F A ≤ β is given by
⎧
⎨ 1, if L(Y ) > λ
X̂ = 1 w.p. γ , if L(Y ) = λ (7.3)
⎩
0, if L(Y ) < λ.
7.6 Hypothesis Testing 131

Fig. 7.14 Jerzy Neyman,

1894–1981

In these expressions,

fY |X [y|1]
L(y) =
fY |X [y|0]

is the likelihood ratio, i.e., the ratio of the likelihood of y when X = 1 divided by its
likelihood when X = 0. Also, λ > 0 and γ ∈ [0, 1] are chosen so that the resulting
X̂ satisfies

P [X̂ = 1|X = 0] = β.

Thus, if L(Y ) is large, X̂ = 1. The fact that L(Y ) is large means that the observed
value Y is much more likely when X = 1 than when X = 0. One is then inclined
to decide that X = 1, i.e. to guess X̂ = 1. The situation is similar when L(Y ) is
small. By adjusting λ, one controls the sensitivity of the detector. If λ is small, one
tends to choose X̂ = 1 more frequently, which increases P CD but also P F A. One
then chooses λ so that the detector is just sensitive enough so that P F A = β. In
some problems, one may have to hedge the guess for the critical value λ as we will
explain in examples (Fig. 7.14).
We prove this theorem in the next chapter. Let us consider a number of examples.

7.6.3 Examples

Gaussian Channel
Recall our model of the scalar Gaussian channel:

Y = X + Z,

where Z = N(0, σ 2 ) and is independent of X. In this model, X ∈ {0, 1} and the

receiver tries to guess X from the received signal Y .
132 7 Digital Link—A

We looked at two formulations: MLE and MAP. In the MLE, we want to find the
value of X that makes Y most likely. That is,

MLE[X|Y = y] = arg max fY |X [y|x].

The answer is MLE[X|Y ] = 0 if Y < 0.5 and MLE[X|Y ] = 1, otherwise.

The MAP is the most likely value of X in {a, b} given Y . That is,

MAP [X|Y = y] = arg max P [X = x|Y = y].

To calculate the MAP, one needs to know the prior probability p0 that X = 0. We
found out that MAP [X|Y = y] = 1 if y ≥ 0.5 + σ 2 log(p0 /p1 ) and MAP [X|Y =
y] = 0 otherwise.
In the hypothesis testing formulation, we choose a bound β on P F A = P [X̂ =
1|X = 0]. According to Theorem 7.4, we should calculate the likelihood ratio L(Y ).
We find that
!
exp − (y−1)
2

2σ 2 2y − 1
L(y) = ! = exp .
exp − y
2
2σ 2
2σ 2

Note that, for any given λ, P (L(Y ) = λ) = 0. Moreover, L(y) is strictly increasing
in y. Hence, (7.3) simplifies to

1, if y ≥ y0
X̂ =
0, otherwise.

We choose y0 so that P F A = β, i.e., so that

P [X̂ = 1|X = 0] = P [Y ≥ y0 |X = 0] = β.

Now, given X = 0, Y = N(0, σ 2 ). Hence, y0 is such that

P (N(0, σ 2 ) ≥ y0 ) = β,

i.e., such that

y0
P N(0, 1) ≥ = β.
σ
For instance, Fig. 3.7 shows that if β = 5%, then y0 /σ = 1.65. Figure 7.15
illustrates the solution.
7.6 Hypothesis Testing 133

Fig. 7.15 The solution of the 1

hypothesis testing problem
for a Gaussian channel fY |X [y|0] fY |X [y |1]

PFA = b

X̂ = 0 X̂ = 1

Fig. 7.16 The ROC a

Gaussian channel Y = X + Z
where X ∈ {0, 1} and
Z = N (0, σ 2 )

Let us calculate the ROC for the Gaussian channel. Let y(β) be such that
P (N (0, 1) ≥ y(β)) = β, so that y0 = y(β)σ . The probability of correct detection
is then

P CD = P [X̂ = 1|X = 1] = P [Y ≥ y0 |X = 1] = P (N(1, σ 2 ) ≥ y0 )

= P (N (0, σ 2 ) ≥ y0 − 1) = P (N(0, 1) ≥ σ −1 y0 − σ −1 )
= P (N (0, 1) ≥ y(β) − σ −1 ).

Figure 7.16 shows the ROC for different values of σ , obtained using Python. Not
surprisingly, the performance of the system degrades when the channel is noisier.

Mean of Exponential RVs

In this second example, we are testing the mean of exponential random variables.
The story is that a machine produces lightbulbs that have an exponentially dis-
tributed lifespan with mean 1/λx when X = x ∈ {0, 1}. Assume that λ0 < λ1 . The
interpretation is that the machine is defective when X = 1 and produces lightbulbs
that have a shorter lifespan.
134 7 Digital Link—A

Let Y = (Y1 , . . . , Yn ) be the observed lifespans of n bulbs. We want to detect

that X = 1 with P F A ≤ β = 5%.
We find

fY |X [y|1] Π n λ1 exp{−λ1 yi }
L(y) = = i=1
fY |X [y|0] n λ exp{−λ y }
Πi=1 0 0 i

n
λ1
n
= exp −(λ1 − λ0 ) yi .
λ0
i=1

Since λ1 > λ0 , we find that L(y) is strictly decreasing in i yi and also that
P (L(Y ) = λ) = 0 for all λ. Thus, (7.3) simplifies to

1, if ni=1 Yi ≤ a
X̂ =
0, otherwise,

where a is chosen so that

n
P Yi ≤ a|X = 0 = β = 5%.
i=1

Now, when X = 0, the Yi are i.i.d. random variables that are exponentially
distributed with mean 1/λ0 . The distribution of their sum is rather complicated.
We approximate it using the Central Limit Theorem.
We have3

Y1 + · · · + Yn − nλ−1
√ 0
≈ N(0, λ−2
0 ).
n

Now,

Y1 + · · · + Yn − nλ−1 a − nλ−1
n
Yi ≤ a ⇔ √ 0
≤ √ 0 .
n n
i=1

Hence,

n
a − nλ−1
P Yi ≤ a|X = 0 ≈ P N(0, λ−2
0 ) ≤ √ 0
n
i=1

a − nλ−1
=P N(0, 1) ≤ λ0 √ 0 .
n

3 Recall that var(Yi ) = λ−2

0 .
7.6 Hypothesis Testing 135

Hence, if we want this probability to be equal to 5%, by (3.2), we must choose a so

that

a − nλ−1
λ0 √ 0 = 1.65,
n

i.e.,
√
a = (n + 1.65 n)λ−1
0 .

One point is worth noting for this example. We see that the calculation of X̂ is
based on Y1 + · · · + Yn . Thus, although one has measured the individual lifespans of
the n bulbs, the decision is based only on their sum, or equivalently on their average.

Bias of a Coin
In this example, we observe n coin flips. Given X = x ∈ {0, 1}, the coins are
i.i.d. B(px ). That is, given X = x, the outcomes Y1 , . . . , Yn of the coin flips are
i.i.d. and equal to 1 with probability px and to zero otherwise. We assume that
p1 > p0 = 0.5. That is, we want to test whether the coin is fair or biased.
Here, the random variables Yi are discrete. We see that

P [Yi = yi , i = 1, . . . , n|X = x] = Πi=1

n
pxYi (1 − px )1−Yi
= pxS (1 − px )n−S where S = Y1 + · · · + Yn .

Hence,
P [Yi = yi , i = 1, . . . , n|X = 1]
L(Y1 , . . . , Yn ) =
P [Yi = yi , i = 1, . . . , n|X = 0]

S

p1 1 − p1 n−S 1 − p1 n p1 (1 − p0 ) S
= = .
p0 1 − p0 1 − p0 p0 (1 − p1 )

Since p1 > p0 , we see that the likelihood ratio is increasing in S. Thus, the solution
of the hypothesis testing problem is

X̂ = 1{S ≥ n0 },

where n0 is such that P [S ≥ n0 |X = 0] ≈ β. To calculate n0 , we approximate S,

when X = 0, by using the Central Limit Theorem. We have

S − np0 n0 − np0
P [S ≥ n0 |X = 0] = P √ ≥ √ |X = 0
n n

n0 − np0
≈ P N(0, p0 (1 − p0 )) ≥ √
n

n0 − np0 2n0 − n
= P N(0, 0.25) ≥ √ = P N(0, 1) ≥ √ .
n n
136 7 Digital Link—A

Say that β = 5%, then we need

2n0 − n
√ = 1.65,
n

by (3.2). Hence,
√
n0 = 0.5n + 0.83 n.

Discrete Observations
In the examples that we considered so far, the random variable L(Y ) is continuous.
In such cases, the probability that L(Y ) = λ is always zero, and there is no need to
randomize the choice of X̂ for specific values of Y . In our next examples, that need
arises.
First consider, as usual, the problem of choosing X̂ ∈ {0, 1} to maximize the
probability of correct detection P [X̂ = 1|X = 1] subject to a bound P [X̂ =
1|X = 0] ≤ β on the probability of false alarm. However, assume that we make
no observation. In this case, the solution is to choose X̂ = 1 with probability
β. This choice meets the bound on the probability of false alarm and achieves a
probability of correct detection equal to β. This randomized choice is better than
always deciding X̂ = 0.
Now consider a more complex example where Y ∈ {A, B, C} and

P [Y = A|X = 1] = 0.2, P [Y = B|X = 1] = 0.2, P [Y = C|X = 1] = 0.6

P [Y = A|X = 0] = 0.2, P [Y = B|X = 0] = 0.5, P [Y = C|X = 0] = 0.3.

Accordingly, the values of the likelihood ratio L(y) = P [Y = y|X = 1]/P [Y =

y|X = 0] are as follows:

L(A) = 1, L(B) = 0.4 and L(C) = 2.

We rank the observations in increasing order of the values of L, as shown in

Fig. 7.17.

Fig. 7.17 The three possible

observations
7.7 Summary 137

Fig. 7.18 The ROC for the

discrete observation example 1

0.8 R(β)

0.6

β
0
0 0.3 0.5 1

The solution of the hypothesis testing problem amounts to choosing a threshold

λ and a randomization γ so that

P [X̂ = 1|Y ] = 1{L(Y ) > λ} + γ 1{L(Y ) = λ}.

Also, we choose λ and γ so that P [X̂ = 1|X = 0] = β.

Figure 7.17 shows that if we choose λ = 2.1, then L(Y ) < λ, for all values of Y ,
so that we always decide X̂ = 0. Accordingly, P CD = 0 and P F A = 0.
The figure also shows that if we choose λ = 2 and a parameter γ , then we decide
X̂ = 1 when L(Y ) = 2 with probability γ . Thus, if X = 0, we decide X̂ = 1
with probability 0.3γ , because Y = C with probability 0.3 when X = 0 and this is
precisely when L(Y ) = 2 and we randomize with probability γ . The figure shows
other examples.
It should be clear that as we reduce λ from 2.1 to 0.39, the probability that we
decide X̂ = 1 when X = 0 increases from 0 to 1. Also, by choosing the parameter
γ suitably when λ is set to a possible value of L(Y ), we can adjust PFA to any value
in [0, 1].
For instance, we can have P F A = 0.05 if we choose λ = 2 and γ = 0.05/0.3.
Similarly, we can have P F A = 0.4 by choosing λ = 1 and γ = 0.5. Indeed,
in this case, we decide X̂ = 1 when Y = C and also with probability 0.5 when
Y = A, so that this occurs with probability 0.3 + 0.2 × 0.5 = 0.4 when X = 0. The
corresponding PCD is then 0.6 + 0.2 × 0.5 = 0.7.
Figure 7.18 shows P CD as a function of the bound on P F A.

7.7 Summary

• MAP and MLE;

• BPSK;
• Huffman Codes;
• Independent Gaussian Errors;
• Hypothesis Testing: Neyman–Pearson Theorem.
138 7 Digital Link—A

7.7.1 Key Equations and Formulas

Bayes’ Rule πi = pi qi /( j pj qj ) Theorem 7.1
MAP [X|Y = y] arg maxx P [X = x|Y = y] Definition 7.1
MLE[X|Y = y] arg maxx P [Y = y|X = x] Definition 7.1
Likelihood Ratio L(y) = fY |X [y|1]/fY |X [y|0] Theorem 7.4
Gaussian Channel MAP [X|Y = y] = 1{y ≥ 12 + σ 2 log( pp01 )} (7.2)
Neyman–Pearson Theorem P [X̂ = 1|Y ] = 1{L(Y ) > λ} + γ 1{L(Y ) = λ} Theorem 7.4
ROC ROC(β) = max. P CD s.t. P F A ≤ β Definition 7.2

7.8 References

Detection theory is obviously a classical topic. It is at the core of digital commu-

nication (see e.g., Proakis (2000)). The Neyman–Pearson Theorem is introduced in
Neyman and Pearson (1933). For a discussion of hypothesis testing, see Lehmann
(2010). For more details on digital communication and, in particular, on wireless
communication, see the excellent presentation in Tse and Viswanath (2005).

7.9 Problems

Problem 7.1 Assume that when X = 0, Y = N (0, 1) and when X = 1, Y =

N (0, σ 2 ) with σ 2 > 1. Calculate MLE[X|Y ].

Problem 7.2 Let X, Y be i.i.d. U [0, 1] random variables. Define V = X + Y and

W = X − Y.

(a) Show that V and W are uncorrelated;

(b) Are V and W independent? Prove or disprove.

Problem 7.3 A digital link uses the QAM-16 constellation shown in Fig. 7.12 with
x1 = (1, −1). The received signal is Y = X + Z where Z =D N (0, σ 2 I). The
receiver uses the MAP. Simulate the system using Python to estimate the fraction of
errors for σ = 0.2, 0.3.

Problem 7.4 Use Python to verify the CLT with i.i.d. U [0, 1] random variables Xn .
That is, generate the random variables {X1 , . . . , XN } for N = 10000. Calculate

X100n+1 + · · · + X(n+1)100 − 50
Yn = , n = 0, 1, . . . , 99.
10
7.9 Problems 139

Plot the empirical cdf of {Y0 , . . . , Y99 } and compare with the cdf of a N (0, 1/12)
random variable.

Problem 7.5 You are testing a digital link that corresponds to a BSC with some
error probability ∈ [0, 0.5).

(a) Assume you observe the input and the output of the link. How do you find the
MLE of .
(b) You are told that the inputs are i.i.d. bits that are equal to 1 with probability 0.6
and to 0 with probability 0.4. You observe n outputs. How do you calculate the
MLE of .
(c) The situation is as in the previous case, but you are told that has pdf 4 − 8x on
[0, 0.5). How do you calculate the MAP of given n outputs.

Problem 7.6 The situation is the same as in the previous problem. You observe n
inputs and outputs of the BSC. You want to solve a hypothesis problem to detect
that > 0.1 with a probability of false alarm at most equal to 5%. Assume that n is
very large and use the CLT.

Problem 7.7 The random variable X is such that P (X = 1) = 2/3 and P (X =

0) = 1/3. When X = 1, the random variable Y is exponentially distributed with
rate 1. When X = 0, the random variable Y is uniformly distributed in [0, 2]. (Hint:
Be careful about the case Y > 2.)

(a) Find MLE[X|Y ];

(b) Find MAP [X|Y ];
(c) Solve the following hypothesis testing problem:

Maximize P [X̂ = 1|X = 1]

subject to P [X̂ = 1|X = 0] ≤ 5%.

Problem 7.8 Simulate the following communication channel. There is an i.i.d.

source that generates symbols {1, 2, 3, 4} according to a prior distribution π =
[p1 , p2 , p3 , p4 ]. The symbols are modulated by QPSK scheme, i.e. they are mapped
to constellation points (±1, ±1). The communication is on a baseband Gaussian
channel, i.e. if the sent signal is (x1 , x2 ), the received signal is

y1 = x1 + Z1 ,

y2 = x2 + Z2 ,

where Z1 and Z2 are independent N(0, σ 2 ) random variables. Find the MAP
detector and ML detector analytically.
140 7 Digital Link—A

Simulate the channel using Python for π = [0.1, 0.2, 0.3, 0.4], and σ = 0.1 and
σ = 0.5. Evaluate the probability of correct detection.

Problem 7.9 Let X be equally likely to take any of the values {1, 2, 3}. Given X,
the random variable Y is N (X, 1).

(a) Find MAP [X|Y ];

(b) Calculate MLE[X|Y ];
(c) Calculate E((X − Y )2 ).

Problem 7.10 The random variable X is such that P (X = 0) = P (X = 1) = 0.5.

Given X, the random variables Yn are i.i.d. U [0, 1.1 − 0.1X]. The goal is to guess X̂
from the observations Yn . Each observation has a cost β > 0. To get nice numerical
solutions, we assume that

β = 0.018 ≈ 0.5(1.1)−10 log(1.1).

(a) Assume that you have observed Y n = (Y1 , . . . , Yn ). What is the guess X̂n based
on these observations that maximizes the probability that X̂n = X?
(b) What is the corresponding value of P (X̂n = X)?
(c) Choose n to maximize P (X = X̂n ) − βn where X̂n is chosen on the basis of
Y1 , . . . , Yn ). Hint: You will recall that

d x
(a ) = a x log(a).
dx

Problem 7.11 The random variable X is exponentially distributed with mean 1.

Given X, the random variable Y is exponentially distributed with rate X.

(a) Find MLE[X|Y ];

(b) Find MAP [X|Y ];
(c) Solve the following hypothesis testing problem:

Maximize P [X̂ = 1|X = a]

subject to P [X̂ = 1|X = 1] ≤ 5%,

where a > 1 is given.

Problem 7.12 Consider a random variable Y that is exponentially distributed with

parameter θ . You observe n i.i.d. samples Y1 , . . . , Yn of this random variable.
Calculate θ̂ = MLE[θ |Y1 , . . . , Yn ]. What is the bias of this estimator, i.e.,
E[θ̂ − θ |θ ]? Does the bias converge to 0 as n goes to infinity?
7.9 Problems 141

Problem 7.13 Assume that Y =D U [a, b]. You observe n i.i.d. samples Y1 , . . . , Yn
of this random variable. Calculate the maximum likelihood estimator â of a and b̂
of b. What is the bias of â and b̂?

Problem 7.14 We are looking at an hypothesis testing problem where X, X̂ take

values in {0, 1}. The value of X̂ is decided based on the observed value of the random
vector Y. We assume that Y has a density fi (y) given that X = i, for i = 0, 1, and
we define L(y) := f1 (y)/f0 (y).
Define g(β) to be the maximum value of P [X̂ = 1|X = 1] subject to P [X̂ =
1|X = 0] ≤ β for β ∈ [0, 1]. Then (choose the correct answers, if any)
g(β) ≥ 1 − β;
g(β) ≥ β;
The optimal decision is described by a function h(y) = P [X̂ = 1|Y = y] and
this function is nondecreasing in f1 (y)/f0 (y).

Problem 7.15 Given θ ∈ {0, 1}, X = θ (1, 1) +V where V1 and V2 are independent
and uniformly distributed in [−2, 2]. Solve the hypothesis testing problem:

Maximize P [θ̂ = 1|θ = 1]

s.t. P [θ̂ = 1|θ = 0] ≤ 5%.

Problem 7.16 Given θ = 1, X =D Exp(1) and, given θ = 0, X =D U [0, 2].

(a) Find θ̂ = H T [θ |X, β], defined as the random variable θ̂ determined from X
that maximizes P [θ̂ = 1|θ = 1] subject to P [θ̂ = 1|θ = 0] ≤ β;
(b) Compute the resulting value of α(β) = P [θ̂ = 1|θ = 1];
(c) Sketch the ROC curve α(β) for β ∈ [0, 1].

Problem 7.17 You observe a random sequence {Xn , n = 0, 1, 2, . . .}. With

probability p, θ = 0 and this sequence is i.i.d. Bernoulli with P (Xn = 0) =
P (Xn = 1) = 0.5. With probability 1 − p, θ = 1 and the sequence is a stationary
Markov chain on {0, 1} with transition probabilities P (0, 1) = P (1, 0) = α. The
parameter α is given in (0, 1).

(1) Find MAP [θ |X0 , . . . , Xn ];

(2) Discuss the convergence of θ̂n ;
(3) Discuss the composite hypothesis testing problem where α < 0.5 when θ = 1
and α = 0.5 when θ = 0.

Problem 7.18 If θ = 0, the sequence {Xn , n ≥ 0} is a Markov chain on a finite set

X with transition matrix P0 . If θ = 1, the transition matrix is P1 . In both cases,
X0 = x0 is known. Find MLE[θ |X0 , . . . , Xn ].
142 7 Digital Link—A

Topics: Optimality of Huffman Codes, LDPC Codes, Proof of Neyman–

Pearson Theorem, Jointly Gaussian RVs, Statistical Tests, ANOVA

8.1 Proof of Optimality of the Huffman Code

We stated the following result in Chap. 7. Here, we provide a proof.

Theorem 8.1 (Optimality of Huffman Code) The Huffman code has the smallest
average number of bits per symbol among all prefix-free codes (Fig. 8.1).

Proof The argument in Huffman (1952) is by induction on the number of symbols.

Assume that the Huffman code has an average path length L(n) that is minimum
for n symbols and that there is some other tree T with a smaller average path length
A(n + 1) than the Huffman code for n + 1 symbols. Let X and Y be the two least
frequent symbols and x ≥ y their frequencies. We can pick these symbols in T so
that their path lengths are maximum and such that Y has the largest path length in
T . Otherwise, we could swap Y in T with a more frequent symbol and reduce the
average path length. Accept for now the claim that we can also pick X and Y so
that they are siblings in T . By merging X and Y into their parent Z with frequency
z = x + y, we have constructed a code for n symbols with average path length
A(n+1)−z. Hence, L(n) ≤ A(n+1)−z. Now, the Huffman code for n+1 symbol
would merge X and Y also, so that its average path length is L(n + 1) := L(n) + z.

© The Author(s) 2021 143

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_8
144 8 Digital Link—B

Fig. 8.1 Huffman code

0 1
0.45
0 1

0.15
0 1
A B C D
0.55 0.3 0.1 0.05
0 10 110 111

Thus, L(n + 1) ≤ A(n + 1), which contradicts the assumption that the Huffman
code is not optimal for n + 1 symbols. It remains to prove the claim about X and
Y being siblings. First note that Y having the maximum path length, it cannot be
an only child, for otherwise, we would replace its parent by Y and reduce the path
length. Say that Y has a sibling V other than X. By swapping V and X, one does
not increase the average path length, since the frequency of V is not smaller than
that of X. This concludes the proof.

8.2 Proof of Neyman–Pearson Theorem 7.4

The idea of the proof is to consider any other decision rule that produces an estimate
X̃ with P [X̃ = 1|X = 0] ≤ β and to show that

P [X̃ = 1|X = 1] ≤ P [X̂|X = 1], (8.1)

where X̂ is specified by the theorem. To show this, we note that

(X̂ − X̃)(L(Y ) − λ) ≥ 0.

Indeed, when L(Y ) − λ > 0, one has X̂ = 1 ≥ X̃, so that the expression above is
indeed nonnegative. Similarly, when L(Y ) − λ < 0, one has X̂ = 0 ≤ X̃, so that
the expression is again nonnegative.
Taking the expected value of this expression given X = 0, we find

E[X̂L(Y )|X = 0] − E[X̃L(Y )|X = 0]

≥ λ(E[X̂|X = 0] − E[X̃|X = 0]). (8.2)

Now,

E[X̂|X = 0] = P [X̂ = 1|X = 0] = β ≥ P [X̃ = 1|X = 0] = E[X̃|X = 0].

8.3 Jointly Gaussian Random Variables 145

Hence, (8.2) implies that

E[X̂L(Y )|X = 0] ≥ E[X̃L(Y )|X = 0]. (8.3)

Observe that, for any function g(Y ), one has

E[g(Y )L(Y )|X = 0] = g(y)L(y)fY |X [y|0]dy

fY |X [y|1]
= g(y) fY |X [y|0]dy
fY |X [y|0]

= g(y)fY |X [y|1]dy

= E[g(Y )|X = 1].

Note that this result continues to hold even for a function g(Y, Z) where Z is a
random variable that is independent of X and Y . In particular,

E[X̂L(Y )|X = 0] = E[X̂|X = 1] = P [X̂ = 1|X = 1].

Similarly,

E[X̃L(Y )|X = 0] = P [X̃ = 1|X = 1].

Combining these results with (8.3) gives (8.1).

8.3 Jointly Gaussian Random Variables

In many systems, the errors in the different components of the measured vector Y
are not independent. A suitable model for this situation is that

Y = X + AZ,

where Z = (Z1 , Z2 ) is a pair of i.i.d. N(0, 1) random variables and A is some 2 × 2

matrix. The key idea here is that the components of the noise vector AZ will not be
independent in general. For instance, if the two rows of A are identical, so are the
two components of AZ. Thus, this model allows to capture a dependency between
the errors in the two components. The model also suggests that the dependency
comes from the fact that the errors are different linear combinations of the same
fundamental sources of noise.
For such a model, how does one compute MLE[X|Y]? We explain in the next
section that
146 8 Digital Link—B

1 1 −1
fY|X [y|x] = exp − (y − x) (AA ) (y − x) , (8.4)
2π |A| 2

where A is the transposed of matrix A, i.e., A (i, j ) = A(j, i) for i, j ∈ {1, 2}.
Consequently, the MLE is the value xk of x that minimizes

(y − x) (AA )−1 (y − x) = ||A−1 y − A−1 x||2 .

(For simplicity, we assume that A is invertible.)

That is, we want to find the vector xk such that A−1 xk is the closest to A−1 y.
One way to understand this result is to note that

W := A−1 Y = A−1 X + Z =: V + Z.

Thus, if we calculate A−1 Y from the measured vector Y, we find that its components
are i.i.d. N(0, 1) for a given value of X. Hence, it is easy to calculate MLE[V|W =
w]: it is the closest value to w in the set {A−1 x1 , . . . , A−1 x16 } of possible values of
V. It is then reasonable to expect that we can recover the MLE of X by multiplying
the MLE of V = A−1 X by A, i.e., that

MLE[X|Y = y] = A × MLE[V|W = A−1 y].

8.3.1 Density of Jointly Gaussian Random Variables

Our goal in this section is to explain (8.4) and more general versions of this result.
We start by stating the main definition and a result that we prove later.

Definition 8.1 (Jointly Gaussian N(μY , ΣY ) Random Variables) The random

variables Y = (Y1 , . . . , Yn ) are jointly Gaussian with mean μY and covariance
ΣY , which we write as Y =D N(μY , ΣY ), if

Y = AX + μY with ΣY = AA ,

where X is a vector of independent N(0, 1) random variables.

Here is the main result.

Theorem 8.2 (Density of N(μY , ΣY ) Random Variables) Let Y =D N(μY , ΣY ).

Then

1 1 −1
fY (y) = √ exp − (y − μY ) ΣY (y − μY ) . (8.5)
|ΣY |(2π )n/2 2
8.3 Jointly Gaussian Random Variables 147

20 5u2
A
u2

v1
u1
v2

A 60 5u1

Fig. 8.2 The level curves of fY

The level curves of this jpdf are ellipses, as sketched in Fig. 8.2.
Note that this joint distribution is determined by the mean and the covariance
matrix. In particular, if Y = (V , W ) are jointly Gaussian, then the joint
distribution is characterized by the mean and ΣV , ΣW and cov(V, W). We know
that if V and W are independent, then they are uncorrelated, i.e., cov(V, W) =
0. Since the joint distribution is characterized by the mean and covariance, we
conclude that if they are uncorrelated, they are independent. We note this fact as
a theorem.

Theorem 8.3 (Jointly Gaussian RVs Are Independent Iff Uncorrelated) Let V
and W be jointly Gaussian random variables. Then, there are independent if and
only if they are uncorrelated.

We will use the following result.

Theorem 8.4 (Linear Combinations of JG Are JG) Let V and W be jointly

Gaussian. Then AV + a and BW + b are jointly Gaussian.

Proof By definition, V and W are jointly Gaussian if they are linear functions of
i.i.d. N (0, 1) random variables. But then AV + a and BW + b are linear functions
of the same i.i.d. N(0, 1) random variables, so that they are jointly Gaussian. More
explicitly, there are some i.i.d. N (0, 1) random variables X so that

V c C
= + X,
W d D
148 8 Digital Link—B

so that

AV + a a + Ac AC
= + X.
BW + b b + Bd BD

As an example, let X, Y be independent N(0, 1) random variables. Then,

X + Y and X − Y are independent.

Indeed, these random variables are jointly Gaussian by Theorem 8.4. Also, they are
uncorrelated since

E((X + Y )(X − Y )) − E(X + Y )E(X − Y ) = E(X2 − Y 2 ) = 0.

Hence, they are independent by Theorem 8.3.

We devote the remainder of this section to the derivation of (8.5). We explain in
Theorem B.13 how to calculate the p.d.f. of AX+b from the density of X. We recall
the result here for convenience:
1
fY (y) = fX (x) where Ax + b = y. (8.6)
|A|

Let us apply (8.6) to the case where X is a vector of n i.i.d. N(0, 1) random
variables. In this case,

1 xi2
fX (x) = Πi=1 fXi (xi ) = Πi=1 √ exp −
n n
2π 2

1 ||x||2
= exp − .
(2π )n/2 2

Then, (8.6) gives

1 1 ||x||2
fY (y) = exp − ,
|A| (2π )n/2 2

where Ax + μY = y. Thus,

x = A−1 (y − μY )

and

||x||2 = ||A−1 (y − μY )||2 = (y − μY ) (A−1 ) A−1 (y − μY ),

8.4 Elementary Statistics 149

where we used the facts that ||z||2 = z z and (Mv) = v M .

Recall the definition of the covariance matrix:

ΣY = E((Y − E(Y))(Y − E(Y)) ).

Since Y = AX + μY and ΣX = I, the identity matrix, we see that

ΣY = AΣX A = AA .

In particular,

|ΣY | = |A|2 .

Hence, we find that

1 1 −1
fY (y) = √ exp − (y − μY ) Σ (y − μY ) .
|ΣY |(2π )n/2 2 Y

This is precisely (8.5).

8.4 Elementary Statistics

This section explains some basic statistical tests that are at the core of “data science.”

8.4.1 Zero-Mean?

Consider the following hypothesis testing problem. The random variable Y is

N (μ, 1). We want to decide between two hypotheses:

H0 : μ = 0. (8.7)
H1 : μ = 0. (8.8)

We know that P [|Y | > 2 | H0 ] ≈ 5%. That is, if we reject H0 when |Y | > 2,
the probability of “false alarm,” i.e., of rejecting the hypothesis when it is correct is
5%. This is what all the tests that we will discuss in this chapter do. However, there
are many tests that achieve the same false alarm probability. For instance, we could
reject H0 when Y > 1.64 and the probability of false alarm would also be 5%. Or,
we could reject H0 when Y is in the interval [1, 1.23]. The probability of that event
under H0 is also about 5%.
150 8 Digital Link—B

Thus, there are many tests that reject H0 with a probability of false alarm equal
to 5%. Intuitively, we feel that the first one—rejecting H0 when |Y | > 2—is
more sensible than the others. This intuition probably comes from the idea that
the alternative hypothesis H1 : μ = 0 appears to be a symmetric assumption about
the likely values of μ. That is, we do not have a reason to believe that under H1 the
mean μ is more likely to be positive than negative. We just know that it is nonzero.
Given this symmetry, it is intuitively reasonable that the test should be symmetric.
However, there are many symmetric tests! So, we need a more careful justification.
To justify the test |Y | > 2, we note the following simple result.

Theorem 8.5 Consider the following hypothesis testing problem: Y is N (μ, 1)

and

H0 : μ = 0
H1 : μ has a symmetric distribution about 0.

Then, the Neyman–Pearson test with probability of false alarm 5% is to reject H0

when |Y | > 2.

Proof We know that the Neyman–Pearson test is a likelihood ratio test. Thus, it
suffices to show that the likelihood ratio is increasing in |Y |. Assume that the density
of μ under H1 is h(x). (The same argument goes through it μ is a mixed random
variable.) Then the pdf f1 (y) of Y under H1 is as follows:

f1 (y) = h(x)f (y − x)dx,

√
where f (x) = (1/ 2π ) exp{−0.5y 2 } is the pdf of a N (0, 1) random variable.
Consequently, the likelihood ratio L(y) of Y is given by
2
f1 (y) f (y − x) x
L(y) = = h(x) dx = h(x) exp{−xy} exp − dx
f (y) f (y) 2
2
x
= 0.5 [h(x) + h(−x)] exp{−xy} exp − dx
2

= 0.5 h(x)[exp{xy} + exp{−xy}]dx,

where the fourth identity comes from h(x) = 0.5h(x) + 0.5h(−x), since h(x) =
h(−x). This expression shows that L(y) = L(−y). Also,
8.4 Elementary Statistics 151

∞

L (y)=0.5 h(x)x[exp{xy}− exp{−xy}]dx= h(x)x[exp{xy}− exp{−xy}]dx,
0

by symmetry of the integrand. For y > 0 and x > 0, we see that the last integrand
is positive, so that L (y) > 0 for y > 0.
Hence, L(y) is symmetric and increasing in y > 0, so that it is an increasing
function of |y|, which completes the proof.

As a simple application, say that you buy 100 light bulbs from brand A and 100
from brand B. You want to test whether that have the same mean lifetime. You
measure the lifetimes {X1A , . . . , X100
A } and {X B , . . . , X B } of the bulbs of the two
1 100
batches and you calculate

(X1A + · · · X100
A ) − (X B + · · · X B )
Y = √ 1 100
,
σ N

where σ is the standard deviation of XnA + XnB that we assume to be known.

By the CLT, it is reasonable to approximate Y by a N (0, 1) random variable.
Thus, we reject the hypothesis that the bulbs of the two brands have the same average
lifetime if |Y | > 2.
Of course, assuming that σ is known is not realistic. The next test is then more
practical.

8.4.2 Unknown Variance

A practically important variation of the previous example is when the variance σ 2

is not known. In that case, the Neyman–Pearson test is to decide H1 when

|μ̂|
> λ,
σ̂

where μ̂ is the sample mean of the Ym , as before,

1
n
σ̂ 2 = (Ym − μ̂)2
n−1
m=1

|tn−1 |
is the sample variance, and λ is such that P ( √ n−1
> tn−1 ) = β.
Here, tn−1 is a random variable with a t distribution with n − 1 degrees of
freedom. By definition, this means that

N (0, 1)
tn−1 = * ,
2 /(n − 1)
χn−1
152 8 Digital Link—B

Fig. 8.3 The projection error

· 2 = σ 2 χ2n−1
Y
1

σZ σW
μ̂1
σV 1
V = N (0, 1)
0
σW2 = (n − 1)σ̂ 2

2
where χn−1 is the sum of the squares of n − 1 i.i.d. N (0, 1) random variables.
Thus, this chi-squared test is very similar to the previous one, except that one
replaces the standard deviation σ by it estimate σ̂ and the threshold λ is adjusted
(increased) to reflect the uncertainty in σ . Statistical packages provide routines to
calculate the appropriate value of λ. (See scipy.stats.chisquare for Python.)
Figure 8.3 explains the result. The rotation symmetry of Z implies that we can
assume that V = Z1 and that W = (0, Z2 , . . . , Zn ). As in the previous examples,
one uses the symmetry assumption under H1 to prove that the likelihood ratio is
monotone in μ̂/σ̂ .
Coming back to our lightbulbs example, what should we do if we have different
number of bulbs of the two brands? The next test covers that situation.

8.4.3 Difference of Means

You observe {Xn , n = 1, . . . , n1 } and {Yn , n = 1, . . . , n2 }. Assume that these

random variables are all independent and that the Xn are N (μ1 , 1) and the Yn
are N (μ2 , 1). We want to test whether μ1 = μ2 .
Define

1 X1 + · · · + Xn1 Y1 + · · · + Yn2
Z = −1 − .
n1 + n−1 2
n1 n2

Then Z = N (μ, 1) where μ = μ1 − μ2 . Testing μ1 = μ2 is then equivalent to

testing μ = 0. A sensible decision is then to reject the hypothesis that μ1 = μ2 if
|Z| > 2.
In practice, if n1 and n2 are not too small, one can invoke the Central Limit
Theorem to justify the same test even when the random variables are not Gaussian.
That is typically how this test is used. Also, when the random variables have nonzero
means and unknown variances, one then renormalizes them by subtracting their
sample mean and dividing by the sample standard deviation.
8.4 Elementary Statistics 153

Needless to say, some care must be taken. It is not difficult to find distributions
for which this test does not perform well. This fact helps explain why many poorly
conducted statistical studies regularly contradict one another. Many publications
decry this fallacy of the p-value. The p-value is the name given to the probability of
false alarm.

8.4.4 Mean in Hyperplane?

A generalization of the previous example is as follows:

H0 : Y = N (μ, σ 2 I), μ ∈ L
H1 : Y = N (μ, σ 2 I), μ ∈ n .

Here, L is an m-dimensional subspace in n .

Here is the test that has a probability of false alarm (deciding H1 when H0 is
true) less than β: Decide

1
H = H1 if and only if Y − μ̂2 > βn−m ,
σ2
where

μ̂ = arg min{Y − x2 : x ∈ L }

2
P (χn−m > βn−m ) = β.

2
In this expression, χn−m represents a random variable that has a chi-square
distribution with n − m degrees of freedom. This means that it is distributed like
the sum of n − m random variables that are i.i.d. N (0, 1).
Figure 8.4 shows that

Y − μ̂ = σ Z.

Now, the distribution of Z is invariant under rotation. Consequently, we can rotate

the axes around μ so that Z = σ (0, . . . , 0, Zm+1 , . . . , Zn ). Thus,

Fig. 8.4 The projection error Y

σZ · 2 = σ 2 χ2n−m
L
μ μ̂
0
2
= σ 2 χ2m
154 8 Digital Link—B

Y − μ̂ = σ (0, . . . , 0, Zm+1 , . . . , Zn ),

so that Y − μ̂2 = σ 2 (Zm+1 2 + · · · + Zn2 ), which proves the result.

As in our simple example, this test has a probability of false alarm equal to β.
Here also, one can show that the test maximizes the probability of correct detection
subject to that bound on the probability of false alarm if under H1 one knows that
μ has a symmetric pmf around L . This means that μ = γi + vi with probability
pi /2 and γi − vi with probability pi /2 where γi ∈ L and vi is orthogonal to L ,
for i = 1, . . . , K. The continuous version of this symmetry should be clear. The
verification of this fact is similar to the simple case we discussed above.

8.4.5 ANOVA

Our next model is more general and is widely used. In this model, Y =
N (Aγ , σ 2 I). We would like to test whether Mγ = 0, which is the H0 hypothesis.
Here, A is a n × k matrix, with k < n. Also, M is a q × k matrix with q < k.
The decision is to reject H0 if F > F0 where

Y − μ0 2 − Y − μ1 2 n−k
F = ×
Y − μ1 2 q
μ0 = arg min{Y − μ2 : μ = Aγ , Mγ = 0}
μ

μ1 = arg min{Y − μ2 : μ = Aγ }

μ

χq2 /q
β=P 2 /(n − k)
> F0 .
χn−k

In the last expression, the ratio of two χ 2 random variables is said to be an F

distribution, in the honor of Sir Ronald A. Fisher who introduced this F-test in
1920.
This test has a probability of false alarm equal to β, as Fig. 8.5 shows. This figure
represents the situation under H0 , when Y = μ0 + σ Z and shows that F is the ratio
of two χ 2 random variables, so that it has an F distribution.
As in the previous examples, the optimality of the test in terms of probability of
correct detection requires some symmetry assumptions of μ under H1 .

8.5 LDPC Codes

Low Density Parity Check (LDPC) codes are among the most efficient codes used in
practice. Gallager invented these codes in his 1960 thesis (Gallager 1963, Fig. 8.6).
8.5 LDPC Codes 155

Fig. 8.5 The F-test. The Y

figure shows that F is the · 2 = σ 2 χ2n−k
ratio of two independent
μ = Aγ
chi-square random variables σZ
μ1
[k − q] μ0
μ = Aγ, M γ = 0

0 [k] · 2 = σ 2 χ2q

Fig. 8.6 Robert G. Gallager,

b. 1931

These codes are used extensively today, for instance, in satellite video transmissions.
They are almost optimal for BSC channels and also for many other channels.
The LDPC codes are as follows. Let x ∈ {0, 1}n be an n-bit string to be
transmitted. One augments this string with the m-bit string y where

y = H x. (8.9)

Here, H is an m × n matrix with entries in {0, 1}, one views x and y as column
vectors and the operations are addition modulo 2. For instance, if
⎡ ⎤
1 0 1 1 1 0 0 0
⎢0 1 0 1 1 0 1 0⎥
H =⎢
⎣1
⎥
1 0 0 0 1 0 1⎦
0 0 1 0 1 1 1 1

and x = [01001010], then y = [1110]. This calculation of the parity check bits y
from x is illustrated by the graph, called Tanner graph, shown in Fig. 8.7.
Thus, instead of simply sending the bit string x, one sends both x and y. The bits
in y are parity check bits. Because of possible transmission errors, the receiver may
get x̃ and ỹ instead of x and y. The receiver computes H x̃ and compares the result
with ỹ. The idea is that if ỹ = H x̃, then it is likely that x̃ = x and ỹ = y. In other
words, it is unlikely that errors would have corrupted x and y in a way that these
156 8 Digital Link—B

Fig. 8.7 Tanner graph j

representation of the LDPC 0 0
code. The graph shows the edge if H(i, j) = 1
nonzero entries of H , so that 1 1
y = H x. The receiver gets x̃ i
and ỹ instead of x and y. The 0 0
1 1
nodes xj are called message
nodes and the nodes yi are 0 0 1 1
called check nodes
0 1
1 1
0 0
0 0
1 1
ỹ
0 0
y = Hx
x̃ x

vectors would still satisfy the relation ỹ = H x̃. Thus, one expects the scheme to be
good at detecting errors, at least if the matrix H is well chosen.
In addition to detecting errors, the LDPC code is used for error correction. If
ỹ = H x̃, one tries to find the least number of components of x̃ and ỹ that can
be changed to satisfy the equations. These would be the most likely transmission
errors, if we assume that bit errors are i.i.d. have a very small probability. However,
searching for the possible combinations of components to change is exponentially
hard. Instead, one uses iterative algorithms that approximate the solution.
We illustrate a commonly used decoding algorithm, called belief propagation
(BP). We assume that each received bit is erroneous with probability 1 and
correct with probability ¯ = 1 − , independently of the other bits. We also assume
that the transmitted bits xj are equally likely to be 0 or 1. This implies that the parity
check bits yi are also equally likely to be 0 or 1, by symmetry. In this algorithm, the
message nodes xj and the check nodes yi exchange beliefs along the links of the
graph of Fig. 8.7 about the probability that the xj are equal to 1.
In steps 1, 3, 5, . . . of the algorithm, each node xj sends to each node yi to which
it is attached an estimate of P (xj = 1). Each node yi then combines these estimates
to send back new estimates to each xj about P (xj = 1). Here is the calculation
that the y nodes perform. Consider a situation shown in Fig. 8.8 where node y1 gets
the estimates a = P (x1 = 1), b = P (x2 = 1), c = P (x3 = 1). Assume also that
ỹ1 = 1, from which node y1 calculates P [y1 = 1|ỹ1 ] = 1 − = , ¯ by Bayes’ rule.
Since the graph shows that x1 + x2 + x3 = y1 , node y1 estimates the probability that
x1 = 1 as the probability that an odd number of bits among {x2 , x3 , y1 } are equal to
one (Fig. 8.9).
To see how to do the calculation, assume that x1 , . . . , xn are independent {0, 1}-
random variables with pi = P (xi = 1). Note that

1 − (1 − 2x1 ) × · · · × (1 − 2xn )
8.5 LDPC Codes 157

Fig. 8.8 Node y1 gets x1

estimates from x nodes and a y1 ¯
calculates new estimates
x2 b 1
c
x3

p1 p2 pn

1 1 n
P (odd) = − Π (1 − 2pj )
2 2 j=1

Fig. 8.9 Each node j is equal to one w.p. pj and to zero otherwise, independently of the other
nodes. The probability that an odd number of nodes are one is given in the figure

is equal to zero if the number of variables that are equal to one among {x1 , . . . , xn }
is even and is equal to two if it is odd. Thus, taking expectation,

2P (odd) = 1 − Πi=1
n
(1 − 2pi ),

so that
1 1 n
P (odd) = − Π (1 − 2pi ). (8.10)
2 2 i=1
Thus, in Fig. 8.8, one finds that

P (x1 = 1) = P (odd among x2 , x3 , y1 )

1 1
= − (1 − 2b)(1 − 2c)(1 − 2¯ ). (8.11)
2 2
The y-nodes in Fig. 8.7 use that procedure to calculate new estimates and send
them to the x-nodes.
In steps 2, 4, 6, . . . of the algorithm, each xj nodes combines the estimates of
P (xj = 1) it gets from x̃j and from the y-nodes in the previous steps to calculate
new estimates. Each node xj assumes that the different estimates it got are derived
from independent observations. That is, node xj gets opinions about P (xj = 1)
from independent experts, namely x̃j and the yi to which it is attached in the graph.
Node xj will merge the opinion of these experts to calculate new estimates.
How should one merge the opinions of independent experts? Say that N experts
make independent observations Y1 , . . . , YN and provide estimates pi = P [X =
1|Yi ]. Assume that the prior probability is that P (X = 1) = P (X = 0) = 1/2.
How should one estimate P [X = 1|p1 , . . . , pN ]? Here is the calculation.
158 8 Digital Link—B

Fig. 8.10 Merging the P (X = 1) =?

opinion of independent
experts about P (X = 1)
when the prior is 1/2

a
d b
c

abc
d=
abc + (1 − a)(1 − b)(1 − c)

(8.12)

Now,

P (X = x, Yn ) P [X = x|Yn ]P (Yn )
P [Yn |X = x] = = .
P (X = x) 1/2

Thus,

P [Yn |X = 0] = 2(1 − pn )P (Yn ) and P [Yn |X = 1] = 2pn P (Yn ).

Substituting these expressions in (8.12), one finds that

p1 · · · pN
P [X = 1|Y1 , . . . , YN ] = , (8.13)
p1 p2 · · · pN + (1 − p1 ) · · · (1 − pN )

as shown in Fig. 8.10.

Let us apply this rule to the situation shown in Fig. 8.11. In the figure, node x1
gets an estimate of P (x1 = 1) from observing x̃1 = 0. It also gets estimates a, b, c
8.5 LDPC Codes 159

Fig. 8.11 Node x1 gets y1

estimates of P (x1 = 1) from x1 a
y nodes and calculates new 0 b
estimates y2
c

from the nodes y1 , y2 , y3 and node x1 assumes that these estimates were based on
independent observations.
To calculate a new estimate that it will send to node y1 , node x1 combines the
estimates from x̃1 , y2 and y3 . This estimate is

bc
, (8.14)
bc + ¯ b̄c̄

where b̄ = 1 − b and c̄ = 1 − c. In the next step, node x1 will send that estimate to
node y1 . It also calculates estimates for nodes y2 and y3 .
Summing up, the algorithm is as follows. At each odd step, node xj sends X(i, j )
to each node yi . At each even step, node yi sends Y (i, j ) to each node xj . One has

1 1
Y (i, j ) = − (1 − 2)(1 − 2ỹi )Πs∈A(i,j ) (1 − 2X(i, s)), (8.15)
2 2
where A(i, j ) = {s = j | H (i, s) = 1} and

N(i, j )
X(i, j ) = , (8.16)
N(i, j ) + D(i, j )

where

N(i, j ) = P [xj = 1|x̃j ]Π{v=i|H (v,j )=1} Y (v, j )

and

D(i, j ) = P [xj = 0|x̃j ]Π{v=i|H (v,j )=1} (1 − Y (v, j ))

with

P [xj = 1|x̃j ] = + (1 − 2)x̃j .

Also, node xj can update its probability of being 1 by merging the opinions of
the experts as

N(j )
X(j ) = , (8.17)
N(j ) + D(j )
160 8 Digital Link—B

Fig. 8.12 Belief propagation

applied to the example of
Fig. 8.7. The horizontal axis
is the step of the algorithm.
The vertical axis is the best
guess for each x(i) at that
step. For clarity, we separated
the guesses by 0.1. The final
detection is
[0, 1, 0, 0, 1, 0, 1, 0], which is
intuitively the best guess

where

N(j ) = P [xj = 1|x̃j ]Π{v|H (v,j )=1} Y (v, j )

and

D(j ) = P [xj = 0|x̃j ]Π{v|H (v,j )=1} (1 − Y (v, j )).

After enough iterations, one makes the detection decisions xj = 1{X(j ) ≥ 0.5}.
Figure 8.12 shows the evolution over time of the estimated probabilities that
the xj are equal to one. Our code is a direct implementations of the formulas in
this section. More sophisticated implementations use sums of logarithms instead of
products.
Simulations, and a deep theory, show that this algorithm performs well if the
graph does not have small cycles. In such a case, the assumption that the estimates
are obtained from independent observations is almost correct.

8.6 Summary

• LDPC Codes;
• Jointly Gaussian Random Variables, independent if uncorrelated;
• Proof of Neyman–Pearson Theorem;
• Testing properties of the mean.
8.8 Problems 161

8.6.1 Key Equations and Formulas

LDPC y = Hx (8.9)

P(odd) P ( j Xj = 1) = 0.5 − 0.5Πj (1 − 2pj ) (8.10)
Fusion of Experts P [X = 1|Y1 , . . . , Yn ] = Πj pj /(Πj pj + Πj p̄j ) (8.13)
Jointly Gaussian N (μ, Σ) ⇔ fX = . . . (8.4)
If X, Y are J.G., then X ⊥ Y ⇒ X, Y are independent Theorem 8.3

8.7 References

The book (Richardson and Urbanke 2008) is a comprehensive reference on LDPC

codes and iterative decoding techniques.

8.8 Problems

Problem 8.1 Construct two Gaussian random variables that are not jointly Gaus-
sian. Hint: Let X =D N (0, 1) and Z be independent random variables with
P (Z = 1) = P (Z = −1) = 1/2. Define Y = XZ. Show that X and Y meet
the requirements of the problem.
√
Problem 8.2 Assume that X =D (Y + Z)/ 2 where Y and Z are independent and
distributed like X. Show that X = N (0, σ 2 ) for some σ 2 ≥ 0. Hint: First √ show
that E(X) = 0. Second, show by induction that X =D (V1 + · · · + Vm )/ m for
m = 2n . where the Vi are i.i.d. and distributed like X. Conclude using the CLT.

Problem 8.3 Consider Problem 7.8 but assume now that Z =D N (0, Σ) where

0.2 0.1
Σ= .
0.1 0.3

The symbols are equally likely and the receiver uses the MLE. Simulate the system
using Python to estimate the fraction of errors.
162 8 Digital Link—B

Application: Estimation, Tracking

Topics: LLSE, MMSE, Kalman Filter

9.1 Examples

A GPS receiver uses the signals it gets from satellites to estimate its location
(Fig. 9.1). Temperature and pressure sensors provide signals that a computer uses
to estimate the state of a chemical reactor.
A radar measures electromagnetic waves that an object reflects and uses the
measurements to estimate the position of that object (Fig. 9.2).
Similarly, your car’s control computer estimates the state of the car from
measurements it gets from various sensors (Fig. 9.3).

9.2 Estimation Problem

The basic estimation problem can be formulated as follows. There is a pair of

continuous random variables (X, Y ). The problem is to estimate X from the
observed value of Y .
This problem admits a few different formulations:

• Known Distribution: We know the joint distribution of (X, Y );

• Off-Line: We observe a set of sample values of (X, Y );
• On-Line: We observe successive values of samples of (X, Y );

© The Author(s) 2021 163

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_9
164 9 Tracking—A

Fig. 9.1 Estimating the

location of a device from
satellite signals

Fig. 9.2 Estimating the

position of an object from
radar signals

Fig. 9.3 Estimating the state

of a vehicle from sensor
signals

The objective is to choose the inference function g(·) to minimize the expected
error C(g) where

C(g) = E(c(X, g(Y ))).

In this expression, c(X, X̂) is the cost of guessing X̂ when the actual value is X. A
standard example is

c(X, X̂) = |X − X̂|2 .

We will also study the case when X ∈ d for d > 1. In such a situation, one
uses c(X, X̂) = ||X − X̂||2 . If the function g(·) can be arbitrary, the function that
minimizes C(g) is the Minimum Mean Squares Estimate (MMSE) of X given Y . If
the function g(·) is restricted to be linear, i.e., of the form a +BY , the linear function
that minimizes C(g) is the Linear Least Squares Estimate (LLSE) of X given Y . One
may also restrict g(·) to be a polynomial of a given degree. For instance, one may
define the Quadratic Least Squares Estimate QLSE of X given Y . See Fig. 9.4.
9.3 Linear Least Squares Estimates 165

Fig. 9.4 Least squares LLSE

estimates of X given Y : Y
LLSE is linear, QLSE is M M SE
quadratic, and MMSE can be
an arbitrary function QLSE

X̂

As we will see, a general method for the off-line inference problem is to choose
a parametric class of functions {gw , w ∈ d } and to then minimize the empirical
error

K
c(Xk , gw (Yk ))
k=1

over the parameters w. Here, the (Xk , Yk ) are the observed samples. The parametric
function could be linear, polynomial, or a neural network.
For the on-line problem, one also chooses a similar parametric family of
functions and one uses a stochastic gradient descent algorithm of the form

w(k + 1) = w(k) − γ ∇w c(Xk+1 , gw (Yk+1 )),

where ∇ is the gradient with respect to w and γ > 0 is a small step size. The
justification for this approach is that, since γ is small, by the SLLN, the update
tends to be in the direction of

k+K−1
− ∇w c(Xi+1 , gw (Yi+1 ) ≈ −K∇E(c(Xk , gw (Yk )) = −K∇C(gw ),
i=k

which would correspond to a gradient algorithm to minimize C(gw ).

9.3 Linear Least Squares Estimates

In this section, we study the linear least squares estimates. Recall the setup that we
explained in the previous section. There is a pair (X, Y ) of random variables with
some joint distribution and the problem is to find the function g(Y ) = a + bY that
minimizes

C(g) = E(|X − g(Y )|2 ).

166 9 Tracking—A

One consider the cases where the distribution is known, or a set of samples has been
observed, or one observes one sample at a time.
Assume that the joint distribution of (X, Y ) is known. This means that we know
the joint cumulative distribution function (j.c.d.f.) FX,Y (x, y).1
We are looking for the function g(Y ) = a + bY that minimizes

C(g) = E(|X − g(Y )|2 ) = E(|X − a − bY |2 ).

We denote this function by L[X|Y ]. Thus, we have the following definition.

Definition 9.1 (Linear Least Squares Estimate (LLSE)) The LLSE of X given
Y , denoted by L[X|Y ], is the linear function a + bY that minimizes

E(|X − a − bY |2 ).

Note that

C(g) = E(X2 + a 2 + b2 Y 2 − 2aX − 2bXY + 2abY )

= E(X2 ) + a 2 + b2 E(Y 2 ) − 2aE(X) − 2bE(XY ) + 2abE(Y ).

To find the values of a and b that minimize that expression, we set to zero the partial
derivatives with respect to a and b. This gives the following two equations:

0 = 2a − 2E(X) + 2bE(Y ) (9.1)

0 = 2bE(Y 2 ) − 2E(XY ) + 2aE(Y ). (9.2)

Solving these equations for a and b, we find that

cov(X, Y )
L[X|Y ] = a + bY = E(X) + (Y − E(Y )),
var(Y )

where we used the identities

cov(X, Y ) = E(XY ) − E(X)E(Y ) and var(Y ) = E(Y 2 ) − E(Y )2 .

We summarize this result as a theorem.

1 See Appendix B.
9.3 Linear Least Squares Estimates 167

Theorem 9.1 (Linear Least Squares Estimate) One has

cov(X, Y )
L[X|Y ] = E(X) + (Y − E(Y )). (9.3)
var(Y )

As a first example, assume that

Y = αX + Z, (9.4)

where X and Z are zero-mean and independent. In this case, we find 2

cov(X, Y ) = E(XY ) − E(X)E(Y )

= E(X(αX + Z)) = αE(X2 )
var(Y ) = α 2 var(X) + var(Z) = α 2 E(X2 ) + E(Z 2 ).

Hence,

αE(X2 ) α −1 Y
L[X|Y ] = Y = ,
α 2 E(X2 ) + E(Z 2 ) 1 + SN R −1

where

α 2 E(X2 )
SNR :=
σ2

is the signal-to-noise ratio, i.e., the ratio of the power E(α 2 X2 ) of the signal in Y
divided by the power E(Z 2 ) of the noise. Note that if SN R is small, then L[X|Y ] is
close to zero, which is the best guess about X if one does not make any observation.
Also, if SN R is very large, then L[X|Y ] ≈ α −1 Y , which is the correct guess if
Z = 0.
As a second example, assume that

X = αY + βY 2 , (9.5)

where3 Y =D U [0, 1]. Then,

2 Indeed, E(XZ) = E(X)E(Z) = 0, by independence.

3 Thus,

E(Y k ) = (1 + k)−1 .
168 9 Tracking—A

Fig. 9.5 The figure shows Y LLSE

L[αY + βY 2 |Y ] when
Y =D U [0, 1] 1

0 X
0

E(X) = αE(Y ) + βE(Y 2 ) = α/2 + β/3;

cov(X, Y ) = E(XY ) − E(X)E(Y )
= E(αY 2 + βY 3 ) − (α/2 + β/3)(1/2)
= α/3 + β/4 − α/4 − β/6
= (α + β)/12
var(Y ) = E(Y 2 ) − E(Y )2 = 1/3 − (1/2)2 = 1/12.

Hence,

L[X|Y ] = α/2 + β/3 + (α + β)(Y − 1/2) = −β/6 + (α + β)Y.

This estimate is sketched in Fig. 9.5. Obviously, if one observes Y , one can compute
X. However, recall that L[X|Y ] is restricted to being a linear function of Y .

9.3.1 Projection

There is an insightful interpretation of L[X|Y ] as a projection that also helps

understand more complex estimates. This interpretation is that L[X|Y ] is the
projection of X onto the set L (Y ) of linear functions of Y .
This interpretation is sketched in Fig. 9.6. In that figure, random variables are
represented by points and L (Y ) is shown as a plane since the linear combination of
points in that set is again in the set. In the figure, the square of the length of a vector
from a random variable V to another random variable W is E(|V − W |2 ). Also,
we say that two vectors V and W are orthogonal if E(V W ) = 0. Thus, L[X|Y ] =
a +bY is the projection of X onto L (Y ) if X −L[X|Y ] is orthogonal to every linear
function of Y , i.e., if

E((X − a − bY )(c + dY )) = 0, ∀c, d ∈ .

9.3 Linear Least Squares Estimates 169

Fig. 9.6 L[X|Y ] is the X

projection of X onto L (Y )
L(Y ) = {c + dY |c, d }

h(Y )
L[X|Y ]

Fig. 9.7 Example of Y

projection
X̂
Z
(s )
0
(1) X

Equivalently,

E(X) = a + bE(Y ) and E((X − a − bY )Y ) = 0. (9.6)

These two equations are the same as (9.1)–(9.2). We call the identities (9.6) the
projection property.
Figure 9.7 illustrates the projection when

X = N (0, 1) and Y = X + Z where Z = N (0, σ 2 ).

In this figure, the length of Z is equal to E(Z 2 ) = σ , the length of X is E(X2 ) =
1 and the vectors X and Z are orthogonal because E(XZ) = 0.
We see that the triangles 0X̂X and 0XY are similar. Hence,

||X̂|| ||X||
= ,
||X|| ||Y ||

so that

||X̂|| 1 ||Y ||
=√ = ,
1 1+σ 2 1 + σ2
√
since ||Y || = 1 + σ 2 . This shows that

1
X̂ = Y.
1 + σ2
170 9 Tracking—A

To see why the projection property implies that L[X|Y ] is the closest point to X
in L (Y ), as suggested by Fig. 9.6, we verify that

E(|X − L[X|Y ]|2 ) ≤ E(|X − h(Y )|2 ),

for any given h(Y ) = c + dY . The idea of the proof is to verify Pythagoras’ identity
on the right triangle with vertices X, L[X|Y ] and h(Y ). We have

E(|X − h(Y )|2 ) = E(|X − L[X|Y ] + L[X|Y ] − h(Y )|2 )

= E(|X − L[X|Y ]|2 ) + E(|L[X|Y ] − h(Y )|2 )
+ 2E((X − L[X|Y ])(L[X|Y ] − h(Y ))).

Now, the projection property (9.6) implies that the last term in the above expression
is equal to zero. Indeed, L[X|Y ] − h(Y ) is a linear function of Y . It follows that

E(|X − h(Y )|2 ) = E(|X − L[X|Y ]|2 ) + E(|L[X|Y ] − h(Y )|2 )

≥ E(|X − L[X|Y ]|2 ),

as was to be proved.

9.4 Linear Regression

Assume now that, instead of knowing the joint distribution of (X, Y ), we observe K
i.i.d. samples (X1 , Y1 ), . . . , (XK , YK ) of these random variables. Our goal is still to
construct a function g(Y ) = a + bY so that

E(|X − a − bY |2 )

is minimized. We do this by choosing a and b to minimize the sum of the squares

of the errors based on the samples. That is, we choose a and b to minimize

K
|Xk − a − bYk |2 .
k=1

To do this, we set to zero the derivatives of this sum with respect to a and b. Algebra
shows that the resulting values of a and b are such that

covK (X, Y )
a + bY = EK (X) + (Y − EK (Y )), (9.7)
varK (Y )

where we defined
9.4 Linear Regression 171

Fig. 9.8 The linear Y

regression of X over Y
Samples

Linear regression

1 1
K K
EK (X) = Xk , EK (Y ) = Yk ,
K K
k=1 k=1

1
K
covK (X, Y ) = Xk Yk − EK (X)EK (Y ),
K
k=1

1
K
varK (Y ) = Yk2 − EK (Y )2 .
K
k=1

That is, the expression (9.7) is the same as (9.3), except that the expectation is
replaced by the sample mean. The expression (9.7) is called the linear regression of
X over Y . It is shown in Fig. 9.8.
One has the following result.

Theorem 9.2 (Linear Regression Converges to LLSE) As the number of samples

increases, the linear regression approaches the LLSE.

Proof As K → ∞, one has, by the Strong Law of Large Numbers,

EK (X) → E(X), EK (Y ) → E(Y ),

covK (X, Y ) → cov(X, Y ), varK (Y ) → var(Y ).

Combined with the expressions for the linear regression and the LLSE, these
properties imply the result.

Formula (9.3) and the linear regression provide an intuitive meaning of the
covariance cov(X, Y ). If this covariance is zero, then L[X|Y ] does not depend
on Y . If it is positive (negative), it increases (decreases, respectively) with Y .
Thus, cov(X, Y ) measures a form of dependency in terms of linear regression. For
172 9 Tracking—A

Fig. 9.9 The random Y

variables X and Y are Equally likely
uncorrelated. Note that they
are not independent

instance, the random variables in Fig. 9.9 are uncorrelated since L[X|Y ] does not
depend on Y .

9.5 A Note on Overfitting

In the previous section, we examined the problem of finding the linear function a +
bY that best approximates X, in the mean squared error sense. We could develop the
corresponding theory for quadratic approximations a + bY + cY 2 , or for polynomial
approximations of a given degree. The ideas would be the same and one would have
a similar projection interpretation.
In principle, a higher degree polynomial approximates X better than a lower
degree one since there are more such polynomials. The question of fitting the
parameters with a given number of observations is more complex.
Assume you observe N data points {(Xn , Yn ), n = 1, . . . , N}. If the values Yn
are different, one can define the function g(·) by g(Yn ) = Xn for n = 1, . . . , N.
This function achieves a zero-mean squared error. What is then the point of looking
for a linear function, or a quadratic, or some polynomial of a given degree? Why not
simply define g(Yn ) = Xn ?
Remember that the goal of the estimation is to discover a function g(·) that is
likely to work well for data points we have not yet observed. For instance, we hope
that E(C(XN +1 , g(YN +1 )) is small, where (XN +1 , YN +1 ) has the same distribution
as the samples (Xn , Yn ) we have observed for n = 1, . . . , N .
If we define g(Yn ) = Xn , this does not tell us how to calculate g(YN +1 ) for a
value YN +1 we have not observed. However, if we construct a polynomial g(·) of
a given degree based on the N samples, then we can calculate g(Yn+1 ). The key
observation is that a higher degree polynomial may not be a better estimate because
it tends to fit noise instead of important statistics.
As a simple illustration of overfitting, say that we observe (X1 , Y1 ) and Y2 .
We want to guess X2 . Assume that the samples Xn , Yn are all independent and
U [−1, 1]. If we guess X̂2 = 0, the mean squared error is E((X2 − X̂2 )2 ) =
E(X22 ) = 1/3. If we use the guess X̂2 = X1 based on the observations, then
E((X2 − X̂2 )2 ) = E((X2 − X1 )2 ) = 2/3. Hence, ignoring the observation is better
than taking it into account.
The practical question is how to detect overfitting. For instance, how does one
determine whether a linear regression is better than a quadratic regression? A simple
9.6 MMSE 173

test is as follows. Say you observed N samples {(Xn , Yn ), n = 1, . . . , N }. You

remove sample n and compute a linear regression using the N − 1 other samples.
You use that regression to calculate the estimate X̂n of Xn based on Yn . You then
compute the squared error (Xn − X̂n )2 . You repeat that procedure for n = 1, . . . , N
and add up the squared errors. You then use the same procedure for a quadratic
regression and you compare.

9.6 MMSE

For now, assume that we know the joint distribution of (X, Y ) and consider the
problem of finding the function g(Y ) that minimizes

E(|X − g(Y )|2 ),

per all the possible functions g(·). The best function is called the MMSE of X
given Y . We have the following theorem:

Theorem 9.3 (The MMSE Is the Conditional Expectation) The MMSE of X

given Y is given by

g(Y ) = E[X|Y ],

where E[X|Y ] is the conditional expectation of X given Y .

Before proving this result, we need to define the conditional expectation.

Definition 9.2 (Conditional Expectation) The conditional expectation of X given

Y is defined by
∞
E[X|Y = y] = xfX|Y [x|y]dx,
−∞

where
fX,Y (x, y)
fX|Y [x|y] :=
fY (y)

is the conditional density of X given Y .

Figure 9.10 illustrates the conditional expectation. That figure assumes that the
pair (X, Y ) is picked uniformly in the shaded area. Thus, if one observes that Y ∈
174 9 Tracking—A

Fig. 9.10 The conditional Y

expectation E[X|Y ] when the
pair (X, Y ) is picked
uniformly in the shaded area
E[X|Y ]
y

X
E[X|Y = y]

(y, y + dy), the point X is uniformly distributed along the segment that cuts the
shaded area at Y = y. Accordingly, the average value of X is the mid-point of that
segment, as indicated in the figure. The dashed red line shows how that mean value
depends on Y and it defines E[X|Y ].
The following result is a direct consequence of the definition.

Lemma 9.4 (Orthogonality Property of MMSE)

(a) For any function φ(·), one has

E((X − E[X|Y ])φ(Y )) = 0. (9.8)

(b) Moreover, if the function g(Y ) is such that

E((X − g(Y ))φ(Y )) = 0, ∀φ(·), (9.9)

then g(Y ) = E[X|Y ].

Proof

(a) To verify (9.8) note that

∞
E(E[X|Y ]φ(Y )) = E[X|Y = y]φ(y)fY (y)dy
−∞
∞ ∞ fX,Y (x, y)
= x dxφ(y)fY (y)dy
−∞ −∞ fY (y)
∞ ∞
= xφ(y)fX,Y (x, y)dxdy
−∞ −∞

= E(Xφ(Y )),

which proves (9.8).

9.6 MMSE 175

Fig. 9.11 The conditional X

expectation E[X|Y ] as the
projection of X on the set
G (Y ) of functions of Y

h(Y )
E[X|Y ]

G(Y ) = {g(Y )| g(·) is a function}

(b) To prove the second part of the lemma, note that

E(|g(Y ) − E[X|Y ]|2 )

= E((g(Y ) − E[X|Y ]){(g(Y ) − X) − (E[X|Y ] − X)}) = 0,

because of (9.8) and (9.9) with φ(Y ) = g(Y ) − E[X|Y ].

Note that the second part of the lemma simply says that the projection
property characterizes uniquely the conditional expectation. In other words,
there is only one projection of X onto G (Y ).

We can now prove the theorem.

Proof of Theorem 9.3 The identity (9.8) is the projection property. It states that X −
E[X|Y ] is orthogonal to the set G (Y ) of functions of Y , as shown in Fig. 9.11.
In particular, it is orthogonal to h(Y ) − E[X|Y ]. As in the case of the LLSE, this
projection property implies that

E(|X − h(Y )|2 ) ≥ E(|X − E[X|Y ]|2 ),

for any function h(·). This implies that E[X|Y ] is indeed the MMSE of X given Y .

From the definition, we see how to calculate E[X|Y ] from the conditional density
of X given Y . However, in many cases one can calculate E[X|Y ] more simply. One
approach is to use the following properties of conditional expectation.
176 9 Tracking—A

Theorem 9.5 (Properties of Conditional Expectation)

(a) Linearity:

E[a1 X1 + a2 X2 |Y ] = a1 E[X1 |Y ] + a2 E[X2 |Y ];

(b) Factoring Known Values:

E[h(Y )X|Y ] = h(Y )E[X|Y ];

(c) Independence: If X and Y are independent, then

E[X|Y ] = E(X).

(d) Smoothing:

E(E[X|Y ]) = E(X);

(e) Tower:

E[E[X|Y, Z]|Y ] = E[X|Y ].

Proof

(a) By Lemma 9.4(b), it suffices to show that

a1 X1 + a2 X2 − (a1 E[X1 |Y ] + a2 E[X2 |Y ])

is orthogonal to G (Y ). But this is immediate since it is the sum of two terms

ai (Xi − E[Xi |Y ])

for i = 1, 2 that are orthogonal to G (Y ).

(b) By Lemma 9.4(b), it suffices to show that

h(Y )X − h(Y )E[X|Y ]

is orthogonal to G (Y ), i.e., that

E((h(Y )X − h(Y )E[X|Y ])φ(Y )) = 0, ∀φ(·).

9.6 MMSE 177

Now,

E((h(Y )X − h(Y )E[X|Y ])φ(Y )) = E((X − E[X|Y ])h(Y )φ(Y )) = 0,

because X − E[X|Y ] is orthogonal to G (Y ) and therefore to h(Y )φ(Y ).

X − E(X)

is orthogonal to G (Y ). Now,

E((X − E(X))φ(Y )) = E(X − E(X))E(φ(Y )) = 0.

The first equality follows from the fact that X −E(X) and φ(Y ) are independent
since they are functions of independent random variables.4
(d) Letting φ(Y ) = 1 in (9.8), we find

E(X − E[X|Y ]) = 0,

which is the identity we wanted to prove.

E(h(Y )(E[X|Y, Z] − E[X|Y ])) = 0

for any function h(Y ). But E(h(Y )(X − E[X|Y, Z])) = 0 by the projec-
tion property, because h(Y ) is some function of (Y, Z). Also, E(h(Y )(X −
E[X|Y ])) = 0, also by the projection property. Hence,

E(h(Y )(E[X|Y, Z]−E[X|Y ]))=E(h(Y )(X−E[X|Y ]))

−E(h(Y )(X−E[X|Y, Z]))=0.

As an example, assume that X, Y, Z are i.i.d. U [0, 1]. We want to calculate

E[(X + 2Y )2 |Y ].

4 See Appendix B.
178 9 Tracking—A

We find

E[(X + 2Y )2 |Y ] = E[X2 + 4Y 2 + 4XY |Y ]

Note that calculating the conditional density of (X + 2Y )2 given Y would have

been quite a bit more tedious.
In some situations, one may be able to exploit symmetry to evaluate the
conditional expectation. Here is one representative example. Assume that X, Y, Z
are i.i.d. Then, we claim that

1
E[X|X + Y + Z] = (X + Y + Z). (9.10)
3
To see this, note that, by symmetry,

E[X|X + Y + Z] = E[Y |X + Y + Z] = E[Z|X + Y + Z].

Denote by V the common value of these random variables. Note that their sum is

3V = E[X + Y + Z|X + Y + Z],

by linearity. Thus, 3V = X + Y + Z, which proves our claim.

9.6.1 MMSE for Jointly Gaussian

In general L[X|Y ] = E[X|Y ]. As a trivial example, Let Y =D U [−1, 1] and X =

Y 2 . Then E[X|Y ] = Y 2 and L[X|Y ] = E(X) = 1/3 since cov(X, Y ) = E(XY ) −
E(X)E(Y ) = 0.
Figure 9.12 recalls that E[X|Y ] is the projection of X onto G (Y ), whereas
L[X|Y ] is the projection of X onto L (Y ). Since L (Y ) is a subspace of G (Y ),
one expects the two projections to be different, in general.
However, there are examples where E[X|Y ] happens to be linear. We saw one
such example in (9.10) and it is not difficult to construct many other examples.
9.6 MMSE 179

Fig. 9.12 The MMSE and X

LLSE are generally different
L[X|Y ]

E[X|Y ] h(Y )

L(Y ) = {a + bY |a, b }
G(Y ) = {g(Y )| g(·) is a function}

There is an important class of problems where this occurs. It is when X and Y

are jointly Gaussian. We state that result as a theorem.

Theorem 9.6 (MMSE for Jointly Gaussian RVs)

Let X, Y be jointly Gaussian random variables. Then

cov(X, Y )
E[X|Y ] = L[X|Y ] = E(X) + (Y − E(Y )).
var(Y )

Proof Note that

X − L[X|Y ] and Y are uncorrelated.

Also, X − L[X|Y ] and Y are two linear functions of the jointly Gaussian random
variables X and Y . Consequently, they are jointly Gaussian by Theorem 8.4 and
they are independent by Theorem 8.3.
Consequently,

X − L[X|Y ] and φ(Y ) are independent,

for any φ(·), because functions of independent random variables are independent by
Theorem B.11 in Appendix B. Hence,

X − L[X|Y ] and φ(Y ) are uncorrelated,

for any φ(·) by Theorem B.4 of Appendix B.

This shows that

X − L[X|Y ] is orthogonal to G (Y ),

and, consequently, that L[X|Y ] = E[X|Y ].

180 9 Tracking—A

9.7 Vector Case

So far, to keep notation at a minimum, we have considered L[X|Y ] and E[X|Y ]

when X and Y are single random variables. In this section, we discuss the vector
case, i.e., L[X|Y] and E[X|Y] when X and Y are random vectors. The only difficulty
is one of notation. Conceptually, there is nothing new.

Definition 9.3 (LLSE of Random Vectors) Let X and Y be random vectors of

dimensions m and n, respectively. Then

L[X|Y] = Ay + b

where A is the m × n matrix and b the vector in m that minimize

E(||X − AY − b||2 ).

Thus, as in the scalar case, the LLSE is the linear function of the observations
that best approximates X, in the mean squared error sense.
Before proceeding, review the notation of Sect. B.6 for ΣY and cov(X, Y).

Theorem 9.7 (LLSE of Vectors) Let X and Y be random vectors such that ΣY is
nonsingular.

(a) Then

L[X|Y] = E(X) + cov(X, Y)ΣY−1 (Y − E(Y)). (9.11)

(b) Moreover,

E(||X − L[X|Y]||2 ) = tr(ΣX − cov(X, Y)ΣY−1 cov(Y, X)). (9.12)

In this expression, for a square matrix M, tr(M) := i Mi,i is the trace of the
matrix.

Proof

(a) The proof is similar to the scalar case. Let Z be the right-hand side of (9.11).
One shows that the error X − Z is orthogonal to all the linear functions of Y.
One then uses that fact to show that X is closer to Z than to any other linear
function h(Y) of Y.
9.7 Vector Case 181

First we show the orthogonality. Since E(X − Z) = 0, we have

E((X − Z)(BY + b) ) = E((X − Z)(BY) ) = E((X − Z)Y )B .

Next, we show that E((X − Z)Y ) = 0. To see this, note that

E((X − Z)Y ) = E((X − Z)(Y − E(Y)) )

= E((X − E(X))(Y − E(Y)) )

− cov(X, Y)ΣY−1 E((Y − E(Y))(Y − E(Y)) )

= cov(X, Y) − cov(X, Y)ΣY−1 ΣY = 0.

Second, we show that Z is closer to X than any linear h(Y). We have

E(||X − h(Y)||2 ) = E((X − h(Y)) (X − h(Y)))

= E((X − Z + Z − h(Y)) (X − Z + Z − h(Y)))
= E(||X − Z||2 ) + E(||Z − h(Y)||2 ) + 2E((X − Z) (Z − h(Y))).

We claim that the last term is equal to zero. To see this, note that

n
E((X − Z) (Z − h(Y)) = E((Xi − Zi )(Zi − hi (Y))).
i=1

Also,

E((Xi − Zi )(Zi − hi (Y))) = E((X − Z)(Z − h(Y)) )i,i

and the matrix E((X−Z)(Z−h(Y)) ) is equal to zero since X−Y is orthogonal

to any linear function of Y and, in particular, to Z − h(Y).
(Note: an alternative way of showing that the last term is equal to zero is to
write

E((X − Z) (Z − h(Y)) = trE((X − Z)(Z − h(Y)) ) = 0,

where the first equality comes from the fact that tr(AB) = tr(BA) for matrices
of compatible dimensions.)
(b) Let X̃ := X − E[X|Y] be the estimation error. Thus,

X̃ = X − E(X) − cov(X, Y)ΣY−1 (Y − E(Y)).

Now, if V and W are two zero-mean random vectors and M a matrix,

182 9 Tracking—A

cov(V − MW) = E((V − MW)(V − MW) )

= E(VV − 2MWV + MWW M )
= cov(V) − 2Mcov(W, V) + Mcov(W)M .

Hence,

cov(X̃) = ΣX − 2cov(X, Y)ΣY−1 cov(Y, X)

+ cov(X, Y)ΣY−1 ΣY ΣY−1 cov(Y, X)

= ΣX − cov(X, Y)ΣY−1 cov(Y, X).

To conclude the proof, note that, for a zero-mean random vector V,

E(||V||2 ) = E(tr(VV )) = tr(E(VV )) = tr(ΣV ).

9.8 Kalman Filter

The Kalman Filter is an algorithm to update the estimate of the state of a system
using its output, as sketched in Fig. 9.13. The system has a state X(n) and an output
Y (n) at time n = 0, 1, . . .. These variables are defined through a system of linear
equations:

X(n + 1) = AX(n) + V (n), n ≥ 0; (9.13)

Y (n) = CX(n) + W (n), n ≥ 0. (9.14)

In these equations, the random variables {X(0), V (n), W (n), n ≥ 0} are all
orthogonal and zero-mean. The covariance of V (n) is ΣV and that of W (n) is ΣW .
The filter is developed when the variables are random vectors and A, C are matrices
of compatible dimensions.
The objective is to derive recursive equations to calculate

X̂(n) = L[X(n)|Y (0), . . . , Y (n)], n ≥ 0.

Fig. 9.13 The Kalman Filter System Filter

computes the LLSE of the
state of a system given the Y (n) X̂(n)
past of its output X(n) KF
Output
9.8 Kalman Filter 183

9.8.1 The Filter

Here is the result, due to Rudolf Kalman (Fig. 9.14), which we prove in the next
chapter. Do not panic when you see the equations!

Theorem 9.8 (Kalman Filter) One has

X̂(n) = AX̂(n − 1) + Kn [Y (n) − CAX̂(n − 1)] (9.15)

Kn = Sn C [CSn C + ΣW ]−1 (9.16)

Sn = AΣn−1 A + ΣV (9.17)
Σn = (I − Kn C)Sn . (9.18)

Moreover,

Sn = cov(X(n) − AX̂(n − 1)) and Σn = cov(X(n) − X̂(n)). (9.19)

We will give a number of examples of this result. But first, let us make a few
comments.

• Equations (9.15)–(9.18) are recursive: the estimate at time n is a simple linear

function of the estimate at time n − 1 and of the new observation Y (n).
• The matrix Kn is the filter gain. It can be precomputed at time 0.
• The covariance of the error X(n) − X̂(n), Σn , can also be precomputed at time
0: it does not depend on the observations {Y (0), . . . , Y (n)}. The estimate X̂(n)
depends on these observations but the mean squared error does not.
• If X(0) and the noise random variables are Gaussian, then the Kalman filter
computes the MMSE.
• Finally, observe that these equations, even though they look a bit complicated,
can be programmed in a few lines. This filter is elementary to implement and this
explains its popularity.

Fig. 9.14 Rudolf Kalman,

1930–2016
184 9 Tracking—A

9.8.2 Examples

In this section, we examine a few examples of the Kalman filter.

Random Walk
The first example is a filter to track a “random walk” by making noisy observations.
Let

X(n + 1) = X(n) + V (n) (9.20)

Y (n) = X(n) + W (n) (9.21)
var(V (n)) = 0.04, var(W (n)) = 0.09. (9.22)

That is, X(n) has orthogonal increments and it is observed with orthogonal noise.
Figure 9.15 shows a simulation of the filter. The left-hand part of the figure shows
that the estimate tracks the state with a bounded error. The middle part of the figure
shows the variance of the error, which can be precomputed. The right-hand part of
the figure shows the filter with the time-varying gain (in blue) and the filter with the
limiting gain (in green). The filter with the constant gain performs as well as the one
with the time-varying gain, in the limit, as justified by part (c) of the theorem.

Random Walk with Unknown Drift

In the second example, one tracks a random walk that has an unknown drift. This
system is modeled by the following equations:

X1 (n + 1) = X1 (n) + X2 (n) + V (n) (9.23)

X2 (n + 1) = X2 (n) (9.24)
Y (n) = X1 (n) + W (n) (9.25)
var(V (n)) = 1, var(W (n)) = 0.25. (9.26)

In this model, X2 (n) is the constant but unknown drift and X1 (n) is the value of
the “random walk.” Figure 9.16 shows a simulation of the filter. It shows that the

Fig. 9.15 The Kalman Filter for (9.20)–(9.22)

9.8 Kalman Filter 185

60 4.5
X1 (n) 4 X2 (n)
Xˆ (n) Xˆ 2 (n)
50
1 3.5

40 3

2.5
30
2

20 1.5

1
10
0.5

0 0
0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 16 18 20

Fig. 9.16 The Kalman Filter for (9.23)–(9.26)

filter eventually estimates the drift and that the estimate of the position of the walk
is quite accurate.

Random Walk with Changing Drift

In the third example, one tracks a random walk that has changing drift. This system
is modeled by the following equations:

X1 (n + 1) = X1 (n) + X2 (n) + V1 (n) (9.27)

X2 (n + 1) = X2 (n) + V2 (n) (9.28)
Y (n) = X1 (n) + W (n) (9.29)
var(V1 (n)) = 1, var(V2 (n)) = 0.01, (9.30)
var(W (n)) = 0.25. (9.31)

In this model, X2 (n) is the varying drift and X1 (n) is the value of the “random
walk.” Figure 9.17 shows a simulation of the filter. It shows that the filter tries to
track the drift and that the estimate of the position of the walk is quite accurate.

Falling Object
In the fourth example, one tracks a falling object. The elevation Z(n) of that falling
object follows the equation

Z(n) = Z(0) + S(0)n − gn2 /2 + V (n), n ≥ 0,

where S(0) is the initial vertical velocity of the object and g is the gravitational
constant at the surface of the earth. In this expression, V (n) is some noise that
perturbs the motion. We observe η(n) = Z(n) + W (n), where W (n) is some noise.
186 9 Tracking—A

Fig. 9.17 The Kalman Filter for (9.27)–(9.31)

Fig. 9.18 The Kalman Filter × 104

for (9.32)–(9.35) 1

Z (n)
0.5

0 Zˆ (n)

–0.5

–1

–1.5
0 5 10 15 20 25 30

Since the term −gn2 /2 is known, we consider

X1 (n) = Z(n) + gn2 /2 and Y (n) = η(n) + gn2 /2.

With this change of variables, the system is described by the following equations:

X1 (n + 1) = X1 (n) + X2 (n) + V (n) (9.32)

X2 (n + 1) = X2 (n) (9.33)
Y (n) = X1 (n) + W (n) (9.34)
var(V1 (n)) = 100 and var(W (n)) = 1600. (9.35)

Figure 9.18 shows a simulation of the filter that computes X̂1 (n) from which we
subtract gt 2 /2 to get an estimate of the actual altitude Z(n) of the object.
9.11 Problems 187

9.9 Summary

• LLSE, linear regression, and MMSE;

• Projection characterization;
• MMSE of jointly Gaussian is linear;
• Kalman Filter.

9.9.1 Key Equations and Formulas

LLSE L[X|Y ] = E(X) + cov(X, Y )var(Y )−1 (Y − E(Y )) Theorem 9.1

Orthogonality X − L[X|Y ] ⊥ a + bY (9.6)
Linear Regression converges to L[X|Y ] Theorem 9.2
Conditional Expectation E[X|Y ] = . . . Definition 9.2
Orthogonality X − E[X|Y ] ⊥ g(Y ) Lemma 9.4
MMSE = CE MMSE[X|Y ] = E[X|Y ] Theorem 9.3
Properties of CE Linearity, smoothing, etc. . . Theorem 9.5
CE for J.G. If X, Y J.G., then E[X|Y ] = L[X|Y ] = · · · Theorem 9.6
LLSE vectors L[X|Y] = E(X) + ΣX,Y ΣY−1 (Y − E(Y)) Theorem 9.7
Kalman Filter X̂(n) = AX̂(n − 1) + Kn [Y (n) − CAX̂(n − 1)] Theorem 9.8

9.10 References

LLSE, MMSE, and linear regression are covered in Chapter 4 of Bertsekas and
Tsitsiklis (2008). The Kalman filter was introduced in Kalman (1960). The text
(Brown and Hwang 1996) is an easy introduction to Kalman filters with many
examples.

9.11 Problems

Problem 9.1 Assume that Xn = Yn + 2Yn2 + Zn where the Yn and Zn are i.i.d.
U [0, 1]. Let also X = X1 and Y = Y1 .

(a) Calculate L[X|Y ] and E((X − L[X|Y ])2 );

(b) Calculate Q[X|Y ] and E((X − Q[X|Y ])2 ) where Q[X|Y ] is the quadratic least
squares estimate of X given Y .
188 9 Tracking—A

(c) Design a stochastic gradient algorithm to compute Q[X|Y ] and implement it in

Python.

Problem 9.2 We want to compare the off-line and on-line methods for computing
L[X|Y ]. Use the setup of the previous problem.

(a) Generate N = 1, 000 samples and compute the linear regression of X given Y .
Say that this is X = aY + b
(b) Using the same samples, compute the linear fit recursively using the stochastic
gradient algorithm. Say that you obtain X = cY + d
(c) Evaluate the quality of the two estimates your obtained by computing E((X −
aY − b)2 ) and E((X − cY − d)2 ).

Problem 9.3 The random variables X, Y, Z are jointly Gaussian,

⎛ ⎡ ⎤⎞
221
(X, Y, Z)T ∼ N ⎝(0, 0, 0)T , ⎣ 2 4 2 ⎦⎠ .
121

(a) Find E[X|Y, Z];

(b) Find the variance of error.

Problem 9.4 You observe three i.i.d. samples X1 , X2 , X3 from the distribution
fX|θ (x) = 12 e−|x−θ| , where θ ∈ R is the parameter to estimate. Find
MLE[θ |X1 , X2 , X3 ].

Problem 9.5

(a) Given three independent N(0, 1) random variables X, Y , and Z, find the
following minimum mean square estimator:

E[X + 3Y |2Y + 5Z].

(b) For the above, compute the mean squared error of the estimator.

Problem 9.6 Given two independent N(0, 1) random variables X and Y, find the
following linear least square estimator:

L[X|X2 + Y ].

Hint: The characteristic function of a N(0, 1) random variable X is as follows:

1 2
E(eisX ) = e− 2 s .
9.11 Problems 189

Problem 9.7 Consider a sensor network with n sensors that are making observa-
tions Yn = (Y1 , . . . , Yn ) of a signal X where

Yi = aX + Zi , i = 1, . . . , n.

In this expression, X =D N(0, 1), Zi =D N(0, σ 2 ), for i = 1, . . . , n and these

random variables are mutually independent.

(a) Compute the MMSE estimator of X given Yn .

(b) Compute the mean squared error σn2 of the estimator.
(c) Assume each measurement has a cost C and that we want to minimize

nC + σn2 .

Find the best value of n.

(d) Assume that we can decide at each step whether to make another measurement
or to stop. Our goal is to minimize the expected value of

νC + σν2 ,

where ν is the random number of measurements. Do you think there is a decision

rule that will do better than the deterministic value n derived in (c)? Explain.

Problem 9.8 We want to use a Kalman filter to detect a change in the popularity of
a word in twitter messages. To do this, we create a model of the number Yn of times
that particular word appears in twitter messages on day n. The model is as follows:

X(n + 1) = X(n)
Y (n) = X(n) + W (n),

where the W (n) are zero-mean and uncorrelated. This model means that we are
observing numbers of occurrences with an unknown mean X(n) that is supposed
to be constant. The idea is that if the mean actually changes, we should be able to
detect it by noticing that the errors between Ŷ (n) and Y (n) are large. Propose an
algorithm for detecting that change and implement it in Python.

Problem 9.9 The random variable X is exponentially distributed with mean 1.

Given X, the random variable Y is exponentially distributed with rate X.

(a) Calculate E[Y |X].

(b) Calculate E[X|Y ].
190 9 Tracking—A

Problem 9.10 The random variables X, Y, Z are i.i.d. N (0, 1).

(a) Find L[X2 + Y 2 |X + Y ];

(b) Find E[X + 2Y |X + 3Y + 4Z];
(c) Find E[(X + Y )2 |X − Y ].

Problem 9.11 Let (Vn , n ≥ 0) be i.i.d. N(0, σ 2 ) and independent of X0 =

N (0, u2 ). Define

Xn+1 = aXn + Vn , n ≥ 0.

1. What is the distribution of Xn for n ≥ 1?

2. Find E[Xn+m |Xn ] for 0 ≤ n < n + m.
3. Find u so that the distribution of Xn is the same for all n ≥ 0.

Problem 9.12 Let θ =D U [0, 1], and given θ , the random variable X is uniformly
distributed in [0, θ ]. Find E[θ |X].

Problem 9.13 Let (X, Y )T ∼ N([0; 0], [3, 1; 1, 1]). Find E[X2 |Y ].

Problem 9.14 Let (X, Y, Z)T ∼ N([0; 0; 0], [5, 3, 1; 3, 9, 3; 1, 3, 1]). Find
E[X|Y, Z].

Problem 9.15 Consider arbitrary random variables X and Y . Prove the following
property:

var(Y ) = E(var[Y |X]) + var(E[Y |X]).

Problem 9.16 Let the joint p.d.f. of two random variables X and Y be

1
fX,Y (x, y) = (2x + y)1{0 ≤ x ≤ 1}1{0 ≤ y ≤ 2}.
4
First show that this is a valid joint p.d.f. Suppose you observe Y drawn from this
joint density. Find MMSE[X|Y ].

Problem 9.17 Given four independent N(0, 1) random variables X, Y , Z, and V ,

find the following minimum mean square estimate:

E[X + 2Y + 3Z|Y + 5Z + 4V ].

Find the mean squared error of the estimate.

Problem 9.18 Assume that X, Y are two random variables that are such that
E[X|Y ] = L[X|Y ]. Then, it must be that (choose the correct answers, if any)
9.11 Problems 191

X and Y are jointly Gaussian;

X can be written as X = aY + Z where Z is a random variable that is
independent of Y ;
E((X − L[X|Y ])Y k ) = 0 for all k ≥ 0;
E((X − L[X|Y ]) sin(3Y + 5)) = 0.

Problem 9.20 Let (X, Y) where Y = [Y1 , Y2 , Y3 , Y4 ] be N(μ, Σ) with μ =

[2, 1, 3, 4, 5] and
⎡ ⎤
3 4 6 12 8
⎢ 4 12 ⎥
⎢ 6 9 18 ⎥
⎢ ⎥
Σ =⎢ 6 9 14 28 18 ⎥ .
⎢ ⎥
⎣ 12 18 28 56 36 ⎦
8 12 18 36 24

Find E[X|Y].

Problem 9.21 Let X = AV and Y = CV where V = N(0, I).

Find E[X|Y].

Problem 9.22 Given θ ∈ {0, 1}, X = N(0, Σθ ) where

10 1ρ
Σ0 = and Σ1 = ,
01 ρ 1

where ρ > 0 is given.

Find MLE[θ |X].

Problem 9.23 Given two independent N(0, 1) random variables X and Y, find the
following linear least square estimator:

L[X|X3 + Y ].
192 9 Tracking—A

Hint: The characteristic function of a N(0, 1) random variable X is as follows:

1 2
E(eisX ) = e− 2 s .

Problem 9.24 Let X, Y, Z be i.i.d. N (0, 1). Find

E[X|X + Y, X + Z, Y − Z].

Hint: Argue that the observation Y − Z is redundant.

Problem 9.25 Let X, Y1 , YS , Y3 be zero-mean with covariance matrix

⎡ ⎤
10 6 5 16
⎢6 9 6 21 ⎥
Σ =⎢
⎣5
⎥.
6 6 18 ⎦
16 21 18 57

Find L[X|Y1 , Y2 , Y3 ]. Hint: You will observe that ΣY is singular. This means that at
least one of the observations Y1 , Y2 , or Y3 is redundant, i.e., is a linear combination
of the others. This implies that L[X|Y1 , Y2 , Y3 ] = L[X|Y1 , Y2 ].

Topics: Derivation and properties of Kalman filter; Extended Kalman filter

10.1 Updating LLSE

In many situations, one keeps making observations and one wishes to update
the estimate accordingly, hopefully without having to recompute everything from
scratch. That is, one hopes for a method that enables to calculate L[X|Y, Z] from
L[X|Y] and Z.
The key idea is in the following result.

Theorem 10.1 (LLSE Update—Orthogonal Additional Observation) Assume

that X, Y, and Z are zero-mean and that Y and Z are orthogonal. Then

L[X|Y, Z] = L[X|Y] + L[X|Z]. (10.1)

Proof Figure 10.1 shows why the result holds. To be convinced mathematically, we
need to show that the error

X − (L[X|Y] + L[X|Z])

is orthogonal to Y and to Z. To see why it is orthogonal to Y, note that the error is

© The Author(s) 2021 193

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_10
194 10 Tracking: B

X
{BY}
{BY + DZ}

L[X|Y, Z]

{DZ}
L[X|Y]
L[X|Z]
0

Fig. 10.1 The LLSE is easy to update after an additional orthogonal observation

(X − L[X|Y]) − L[X|Z].

Now, the term between parentheses is orthogonal to Y, by the projection property

of L[X|Y]. Also, the second term is linear in Z, and is therefore orthogonal to Y
since Z is orthogonal to Y. One shows that the error is orthogonal to Z in the same
way.

A simple consequence of this result is the following fact.

Theorem 10.2 (LLSE Update—General Additional Observation) Assume that

X, Y, and Z are zero-mean. Then

L[X|Y, Z] = L[X|Y] + L[X|Z − L[Z|Y]]. (10.2)

Proof The idea here is that one considers the innovation Z̃ := Z − L[Z|Y], which
is the information in the new observation Z that is orthogonal to Y.
To see why the result holds, note that any linear combination of Y and Z can be
written as a linear combination of Y and Z̃. For instance, if L[Z|Y] = CY, then

AY + BZ = AY + B(Z − CY) + BCY = (A + BC)Y + B Z̃.

Thus, the set of linear functions of Y and Z is the same as the set of linear functions
of Y and Z̃, so that

L[X|Y, Z] = L[X|Y, Z̃].

Thus, (10.2) follows from Theorem 10.1 since Y and Z̃ are orthogonal.

10.2 Derivation of Kalman Filter 195

10.2 Derivation of Kalman Filter

We derive the equations for the Kalman filter, as stated in Theorem 9.8. For
convenience, we repeat those equations here:

X̂(n) = AX̂(n − 1) + Kn [Y (n) − CAX̂(n − 1)] (10.16)

Kn = Sn C [CSn C + ΣW ]−1 (10.17)
Sn = AΣn−1 A + ΣV (10.18)
Σn = (I − Kn C)Sn (10.19)

and

Sn = cov(X(n) − AX̂(n − 1)) and Σn = cov(X(n) − X̂(n)). (10.20)

In the algebra, we repeatedly use the fact that

cov(BV , DW ) = B cov(V , W )D

and also that if V and W are orthogonal, then

cov(V + W ) = cov(V ) + cov(W ).

The algebra is a bit tedious, but the key steps are worth noting.
Let

Y n = (Y (0), . . . , Y (n)).

Note that

L X(n)|Y n−1 = L AX(n − 1) + V (n − 1)|Y n−1 = AX̂(n − 1).

Hence,

L Y (n)|Y n−1 = L CX(n) + W (n)|Y n−1 = CL X(n)|Y n−1 = CAX̂(n − 1),

so that, by Theorem 10.2,

Y (n) − L Y (n)|Y n−1 = Y (n) − CAX̂(n − 1).

Thus,
196 10 Tracking: B

X̂(n) = L[X(n)|Y n ] = L X(n)|Y n−1 + L X(n)|Y (n) − L Y (n)|Y n−1

= AX̂(n − 1) + Kn Y (n) − CAX̂(n − 1) .

This derivation shows that (10.16) is a fairly direct consequence of the formula in
Theorem 10.2 for updating the LLSE.
The calculation of the gain Kn is a bit more complex. Let

Ỹ (n) = Y (n) − L Y (n)|Y n−1 = Y (n) − CAX̂(n − 1).

Then
−1
Kn = cov X(n), Ỹ (n) cov Ỹ (n) .

Now,

cov X(n), Ỹ (n) = cov X(n) − L X(n)|Y n−1 , Ỹ (n) ,

because Ỹ (n) is orthogonal to Y n−1 . Also,

cov(X(n) − L X(n)|Y n−1 , Ỹ (n))

= cov(X(n) − AX̂(n − 1), Y (n) − CAX̂(n − 1))

= cov(X(n) − AX̂(n − 1), CX(n) + W (n) − CAX̂(n − 1))

= Sn C ,

by (10.20).
To calculate cov(Ỹ (n)), we note that

cov(Ỹ (n)) = cov CX(n) + W (n) − CL X(n)|Y n−1 = CSn C + ΣW .

Thus,
" #−1
Kn = Sn C CSn C + ΣW .

To show (10.18), we note that

Sn = cov X(n) − L X(n)|Y n−1

= cov AX(n − 1) + V (n − 1) − AX̂(n − 1)

= AΣn−1 A + ΣV .
10.3 Properties of Kalman Filter 197

Finally, to derive (10.19), we calculate

Σn = cov X(n) − X̂(n) .

We observe that

X(n) − L[X(n)|Y n ] = X(n) − AX̂(n − 1) − Kn Y (n) − CAX̂(n − 1)

= X(n) − AX̂(n − 1) − Kn CX(n) + W (n) − CAX̂(n − 1)

= [I − Kn C] X(n) − AX̂(n − 1) − Kn W (n),

so that

Σn = [I − Kn C]Sn [I − Kn C] + Kn ΣW Kn

" #
= Sn − 2Kn CSn + Kn CSn C + ΣW Kn
" #" #−1
= Sn − 2Kn CSn + Kn CSn C + ΣW CSn C + ΣW CSn by (10.17)
= Sn − Kn CSn ,

as we wanted to show.

10.3 Properties of Kalman Filter

The goal of this section is to explain and justify the following result. The terms
observable and reachable are defined after the statement of the theorem.

Theorem 10.3 (Properties of the Kalman Filter)

(a) If (A, C) is observable, then Σn is bounded. Moreover, if Σ0 = 0, then

Σn → Σ and Kn → K, (10.37)

where Σ is a finite matrix.

1/2
(b) Also, if in addition, (A, ΣV ) is reachable, then the filter with Kn = K is such
that the covariance of the error also converges to Σ.

We explain these properties in the subsequent sections. Let us first make a few
comments.
198 10 Tracking: B

• For some systems, the errors grow without bound. For instance, if one does not
observe anything (e.g., C = 0) and if the system is unstable (e.g., X(n) =
2X(n − 1) + V (n)), then Σn goes to infinity. However, (a) says that “if the
observations are rich enough,” this does not happen: one can track X(n) with an
error that has a bounded covariance.
• Part (b) of the theorem says that in some cases, one can use the filter with
a constant gain K without having a bigger error, asymptotically. This is very
convenient as one does not have to compute a new gain at each step.

10.3.1 Observability

Are the observations good enough to track the state with a bounded error covari-
ance? Before stating the result, we need a precise notion of good observations.

Definition 10.1 (Observability) We say that (A, C) is observable if the null space
of
⎡ ⎤
C
⎢ CA ⎥
⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
CAd

is {0}. Here, d is the dimension of X(n). A matrix M has null space {0} if {0} is the
only vector v such that Mv = 0.

The key result is the following.

Lemma 10.4 (Observability Implies Bounded Error Covariance)

(a) If the system is observable, then Σn is bounded.

(b) If in addition, Σ0 = 0, then Σn converges to some finite Σ.

Proof

(a) Observability implies that there is only one X(0) that corresponds to
(Y (0), . . . , Y (d)) if the system has no noise. Indeed, in that case,

X(n) = AX(n − 1) and Y (n) = CX(n).

Then,
10.3 Properties of Kalman Filter 199

X(1) = AX(0), X(2) = A2 X(0), . . . , X(d) = Ad−1 X(0),

so that

Y (0) = CX(0), Y (1) = CAX(0), . . . , Y (d − 1) = CAd−1 X(0).

Consequently,
⎡ ⎤ ⎡ ⎤
Y (0) C
⎢ Y (1) ⎥ ⎢ CA ⎥
⎢ ⎥ ⎢ ⎥
⎢ . ⎥ = ⎢ . ⎥ X(0).
⎣ .. ⎦ ⎣ .. ⎦
Y (d) CAd−1

Now, imagine that there are two different initial states, say X(0) and X̊(0) that
give the same outputs Y (0), . . . , Y (d). Then,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
Y (0) C C
⎢ Y (1) ⎥ ⎢ CA ⎥ ⎢ CA ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ . ⎥ = ⎢ . ⎥ X(0) = ⎢ . ⎥ X̊(0),
⎣ .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦
Y (d) CAd−1 CAd−1

so that
⎡ ⎤
C
⎢ CA ⎥
⎢ ⎥
⎢ .. ⎥ (X(0) − X̊(0)) = 0.
⎣ . ⎦
CAd−1

The observability property implies that X(0) − X̊(0) = 0.

Thus, if (A, C) is observable, one can identify the initial condition X(0)
uniquely after d + 1 observations of the output, when there is no noise. Hence,
when there is no noise, one can then determine X(1), X(2), . . . exactly. Thus,
when (A, C) is observable, one can determine the state X(n) precisely from the
outputs.
However, our system has some noise. If (A, C) is observable, we are
able to identify X(0) from Y (0), . . . , Y (d), up to some linear function of
the noise that has affected those outputs, i.e., up to a linear function of
{V (0), . . . , V (d − 1), W (0), . . . , W (d)}. Consequently, we can determine
X(d) from Y (0), . . . , Y (d), up to some linear function of {V (0), . . . , V (d −
1), W (0), . . . , W (d)}. Similarly, we can determine X(n) from Y (n −
d), . . . , Y (n), up to some linear function of {V (n), . . . , V (n + d − 1), W (n −
d), . . . , W (n)}.
200 10 Tracking: B

This implies that the error between X(n) and X̂(n) is a linear combination of
d noise contributions, so that Σn is bounded.
(b) One can show that if Σ0 = 0, i.e., if we know X(0), then Σn increases in the
sense that Σn − Σn−1 is nonnegative definite. Being bounded and increasing
implies that Σn converges, and so does Kn .

10.3.2 Reachability

Assume that ΣV = QQ . We say that (A, Q) is reachable if the rank of

[Q, AQ, . . . , Ad−1 Q]

is full. To appreciate the meaning of this property, note that we can write the state
equations as

X(n) = AX(n − 1) + Qηn ,

where cov(ηn ) = I. That is, the components of η are orthogonal. In the Gaussian
case, the components of η are N(0, 1) and independent. If (A, Q) is reachable, this
means that for any x ∈ d , there is some sequence η0 , . . . , ηd such that if X(0) = 0,
then X(d) = x. Indeed,
⎡ ⎤
ηd

d
⎢ ⎥
X(d) = Ak Qηd−k = Q, AQ, . . . , Ad−1 Q ⎣ ηd−1 ... ⎦ .
k=0
η0

Since the matrix is full rank, the span of its columns is d , which means precisely
that there is a linear combination of these columns that is equal to any given vector
in d .
The proof of part (b) of the theorem is a bit too involved for this course.

10.4 Extended Kalman Filter

The Kalman filter is often used for nonlinear systems. The idea is that if the system
is almost linear over a few steps, then one may be able to use the Kalman filter
locally and change the matrices A and C as the estimate of the state changes.
The model is as follows:

X(n + 1) = f (X(n)) + V (n)

Y (n + 1) = g(X(n + 1)) + W (n + 1).
10.4 Extended Kalman Filter 201

The extended Kalman filter is then

X̂(n + 1) = f X̂(n) + Kn Y (n + 1) − g f (X̂(n))
" #−1
Kn = Sn Cn Cn Sn Cn + ΣW
Sn = An Σn An + ΣV
Σn+1 = [I − Kn Cn ]Sn ,

where
∂ ∂
[An ]ij = fi X̂(n) and [Cn ]ij = gi X̂(n) .
∂xj ∂xj

Thus, the idea is to linearize the system around the estimated state value and then
apply the usual Kalman filter.
Note that we are now in the realm of heuristics and that very little can be said
about the properties of this filter. Experiments show that it works well when the
nonlinearities are small, whatever this means precisely, but that it may fail miserably
in other conditions.

10.4.1 Examples

Tracking a Vehicle
In this example, borrowed from “Eric Feron, Notes for AE6531, Georgia Tech.”, the
goal is to track a vehicle that moves in the plane by using noisy measurements of
distances to 9 points pi ∈ 2 . Let p(n) ∈ 2 be the position of the vehicle and
u(n) ∈ 2 be its velocity at time n ≥ 0.

We assume that the velocity changes accruing to a known rule, except for some
random perturbation. Specifically, we assume that

p(n + 1) = p(n) + 0.1u(n) (10.38)

0.85 0.15
u(n + 1) = u(n) + w(n), (10.39)
−0.1 0.85

where the w(n) are i.i.d. N(0, I). The measurements are

yi (n) = ||p(n) − pi || + vi (n), i = 1, 2, . . . , 9,

where the vi (n) are i.i.d. N(0, 0.32 ).

Figure 10.2 shows the result of the extended Kalman filter for X(n) =
(p(n), u(n)) initialized with x̂(0) = 0 and Σ0 = I.
202 10 Tracking: B

Fig. 10.2 The Extended Kalman Filter for the system (10.38)–(10.39)

Fig. 10.3 The chemical

reactions

Tracking a Chemical Reaction

This example concerns estimating the state of a chemical reactor from measure-
ments of the pressure. This example is borrowed from James B. Rawlings and
Fernando V. Lima, U. Wisconsin, Madison. There are three components A, B, C
in the reactions and they are modeled as shown in Fig. 10.3 where the ki are the
kinetic constants.

Let CA , CB , CC be the concentrations of the A, B, C, respectively. The model is

⎡ ⎤ ⎡ ⎤
C −1 0
d ⎣ A⎦ ⎣ ⎦ k1 CA − k−1 CB CC
CB = 1 −2
dt k2 CB2 − k−2 CC
CC 1 1

and

y = RT (CA + CB + CC ).

As shown in the top part of Fig. 10.4, this filter does not track the concentrations
correctly. In fact, some concentrations that the filter estimates are negative!
10.5 Summary 203

Fig. 10.4 The top two graphs show that the extended Kalman filter does not track the concentra-
tions correctly. The bottom two graphs show convergence after modifying the equations

The bottom graphs show that the filter tracks the concentrations converge after
modifying the equations and replacing negative estimates by 0.
The point of this example is that the extended Kalman filter is not guaranteed to
converge and that, sometimes, a simple modification makes it converge.

10.5 Summary

• Updating LLSE;
• Derivation of Kalman Filter;
• Observability and Reachability;
• Extended Kalman Filter.
204 10 Tracking: B

10.5.1 Key Equations and Formulas

Updating LLSE & zero-mean ⇒ L[X|Y, Z] = L[X|Y] + L[X|Z − L[Z|Y]] T. 10.2

Observability ⇒ bounded error covariance L.10.4
Observability + Reachability ⇒ asymptotic filter is good enough T.9.8
Extended Kalman Filter Linearize equations S.10.4

10.6 References

The book Goodwin and Sin (2009) survey filtering and applications to control. The
textbook Kumar and Varaiya (1986) is a comprehensive yet accessible presentation
of control theory, filtering, and adaptive control. It is available online.

Application: Recognizing Speech

Topics: Hidden Markov chain, Viterbi decoding, EM Algorithms

11.1 Learning: Concepts and Examples

In artificial intelligence, “learning” refers to the process of discovering the relation-

ship between related items, for instance between spoken words and sounds heard
(Fig. 11.1).
As a simple example, consider the binary symmetric channel example of Problem
7.5 in Chap. 7. The inputs Xn are i.i.d. B(p) and, given the inputs, the output Yn is
equal to Xn with probability 1−, for n ≥ 0. In this example, there is a probabilistic
relationship between the inputs and the outputs described by . Learning here refers
to estimating .
There are two basic situations. In supervised learning, one observes the inputs
{Xn , n = 0, . . . , N } and the outputs {Yn , n = 0, . . . , N}. One can think of this form
of learning as a training phase for the system. Thus, one observes the channel with
a set of known input values. Once one has “learned” the channel, i.e., estimated ,
one can then design the best receiver and use it on unknown inputs. In unsupervised
learning, one observes only the outputs. The benefit of this form of learning is that
it takes place while the system is operational and one does not “waste” time with
a training phase. Also, the system can adapt automatically to slow changes of
without having to re-train it with a new training phase.
As you can expect, there is a trade-off when choosing supervised versus
unsupervised learning. A training phase takes time but the learning is faster than

© The Author(s) 2021 205

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_11
206 11 Speech Recognition: A

Fig. 11.1 Can you hear me?

in unsupervised learning. The best method to use depends on characteristics of the

practical situation, such as the likely rate of change of the system parameters.

11.2 Hidden Markov Chain

A hidden Markov chain is a Markov chain together with a state observation model.
The Markov chain is {X(n), n ≥ 0} and it has its transition matrix P on the state
space X and its initial distribution π0 . The state observation model specifies that
when the state of the Markov chain is x, one observes a value y with probability
Q(x, y), for y ∈ Y . More precisely, here is the definition (Fig. 11.2).

Definition 11.1 (Hidden Markov Chain) A hidden Markov chain is a random

sequence {(X(n), Y (n)), n ≥ 0} such that X(n) ∈ X = {1, . . . , N } and Y (n) ∈
Y = {1, . . . , M} and

P (X(0) = x0 , Y (0) = y0 , . . . , X(n) = xn , Y (n) = yn )

= π0 (x0 )Q(x0 , y0 )P (x0 , x1 )Q(x1 , y1 ) × · · · × P (xn−1 , xn )Q(xn , yn ),
for all n ≥ 0, xm ∈ X , ym ∈ Y . (11.1)

In the speech recognition application, the Xn are “parts of speech,” i.e., segments
of sentences, and the Yn are sounds. The structure of the language determines
relationships between the Xn that can be approximated by a Markov chain. The
relationship between Xn and Yn is speaker-dependent.
The recognition problem is the following. Assume that you have observed that
Yn := (Y0 , . . . , Yn ) = yn := (y0 , . . . , yn ). What is the most likely sequence Xn :=
(X0 , . . . , Xn )? That is, in the terminology of Chap. 7, we want to compute

MAP [Xn | Yn = yn ].

Thus, we want to find the sequence xn ∈ X n+1 that maximizes

11.2 Hidden Markov Chain 207

Fig. 11.2 The hidden

Markov chain
X(n)
Q
Y(n)
0 ,P

P [Xn = xn | Yn = yn ].

Note that

P (Xn = xn , Yn = yn )
P [Xn = xn | Yn = yn ] = .
P (Yn = yn )

The MAP is the value of xn that maximizes the numerator. Now, by (11.1), the
logarithm of the numerator is equal to

n
log(π0 (x0 )Q(x0 , y0 )) + log(P (xm−1 , xm )Q(xm , ym )).
m=1

Define

d(x0 ) = − log(π0 (x0 )Q(x0 , y0 ))

and

dm (xm−1 , xm ) = − log(P (xm−1 , xm )Q(xm , ym )).

Then, the MAP is the sequence xn that minimizes

n
d(x0 ) + dm (xm−1 , xm ). (11.2)
m=1

The expression (11.2) can be viewed as the length for a path in the graph shown
in Fig. 11.3. Finding the MAP is then equivalent to solving a shortest path problem.
There are a few standard algorithms for solving such problems. We describe the
Bellman–Ford Algorithm due to Bellman (Fig. 11.4) and Ford.
For m = 0, . . . , n and x ∈ X , let Vm (x) be the length of the shortest path from
X(m) = x to the column X(n) in the graph. Also, let Vn (x) = 0 for all x ∈ X .
Then, one has
+ ,
Vm (x) = min

dm+1 (x, x ) + Vm+1 (x ) , x ∈ X , m = 0, . . . , n − 1. (11.3)
x ∈X
208 11 Speech Recognition: A

X(0) X(1) X(k-1) X(k) X(n)

1 1 1 1 1
d0(1)
x d1(x, x x x dk(x, x x x
d0(x)
x x x x x

N N N N N

Fig. 11.3 The MAP as a shortest path

Fig. 11.4 Richard Bellman,

1920–1984

Finally, let

V = min {d0 (x) + V0 (x)}. (11.4)

x∈X

Then, V is the minimum value of expression (11.2).

The algorithm is then as follows:

Step (1): Calculate {Vm (x), x ∈ X } recursively for m = n − 1, n − 2, . . . , 0, using

(11.3). At each step, note the arc out of each x that achieves the minimum. Say
that the arc out of xm = x goes to xm+1 = s(m, x) for x ∈ X .
Step (2): Find the value x0 that achieves the minimum in (11.4).
Step (3): The MAP is then the sequence

x0 , x1 = s(0, x0 ), x2 = s(1, x1 ), . . . , xn = s(n − 1, xn−1 ).

Equations (11.3) are the Bellman–Ford Equations. They are a particular version
of Dynamic Programming Equations (DPE) for the shortest path problem.
Note that the essential idea was to define the length of the shortest remaining path
starting from every node in the graph and to write recursive expressions for those
quantities. Thus, one solves the DPE backwards and then one finds the shortest path
forward. This application of the shortest path algorithm for finding a MAP is called
the Viterbi Algorithm due to Andrew Viterbi (Fig. 11.5).
11.3 Expectation Maximization and Clustering 209

Fig. 11.5 Andrew Viterbi, b.

1934

11.3 Expectation Maximization and Clustering

Expectation maximization is a class of algorithms to estimate parameters of

distributions. We first explain these algorithms on a simple clustering problem. We
apply expectation maximization to the HMC model in the next section.
The clustering problem consists in grouping sample points into clusters of
“similar” values. We explain a simple instance of this problem and we discuss the
expectation maximization algorithm.

11.3.1 A Simple Clustering Problem

You look at set of N exam results {X(1), . . . , X(N)} in your probability course
and you must decide who are the A and the B students. To study this problem, we
assume that the results of A students are i.i.d. N (a, σ 2 ) and those of B students are
N (b, σ 2 ) where a > b.
For simplicity, assume that we know σ 2 and that each student has probability 0.5
of being an A student. However, we do not know the parameters (a, b).
(The same method applies when one does not know the variances of the scores
of A and B students, nor the prior probability that a student is of type A.)
One heuristic is as follows (see Fig. 11.6). Start with a guess (a1 , b1 ) for (a, b).
Student n with score X(n) is more likely to be of type A if X(n) > (a1 + b1 )/2.
Let us declare that such students are of type A and the others are of type B. Let then
a2 be the average score of the students declared to be of type A and b2 that of the
other students. We repeat the procedure after replacing (a1 , b1 ) by (a2 , b2 ) and we
keep doing this until the values seem to converge. This heuristic is called the hard
expectation maximization algorithm.
A slightly different heuristic is as follows (see Fig. 11.7). Again, we start with a
guess (a1 , b1 ).
Using Bayes’ rule, we calculate the probability p(n) that student n with score
X(n) is of type A. We then calculate

n X(n)p(n) n X(n)(1 − p(n))
a2 = and b2 = .
n p(n) n (1 − p(n))
210 11 Speech Recognition: A

Fig. 11.6 Clustering with

hard EM. The initial guess is
(a1 , b1 ), which leads to the b1 a1
MAP of the types and the
next guess (a2 , b2 ), and so on
b2 a2

b3 a3

b 4 = b3 a4 = a3

Fig. 11.7 Clustering with

soft EM. The initial guess is
(a1 , b1 ), which leads to the b1 a1
probabilities of the types and ®(·; a1 ; b1 )
the next guess (a2 , b2 ), and 1
so on 0

b2 a2
®(·; a2 ; b2 )
1

b3 a3

We then repeat after replacing (a1 , b1 ) by (a2 , b2 ). Thus, the calculation of a2

weighs the scores of the students by the likelihood that they are of type A, and
similarly for the calculation of b2 .
This heuristic is called the soft expectation maximization algorithm.

11.3.2 A Second Look

In the previous example, one attempts to estimate some parameter θ = (a, b) based
on some observations X = (X1 , . . . , XN ). Let Z = (Z1 , . . . , ZN ) where Zn = A if
student n is of type A and Zn = B otherwise.
We would like to maximize f [x|θ ] over θ , to find MLE[θ |X = x]. One has

f [x|θ ] = f [x|z, θ ]P [z|θ ],
z

where the sum is over the 2N possible values of Z. This is computationally too
difficult.
11.3 Expectation Maximization and Clustering 211

Fig. 11.8 Hard and soft

EM?

Hard EM (Fig. 11.8) replaces the sum over z by

f [x|z∗ , θ ]P [z∗ |θ ],

where z∗ is the most likely value of Z given the observations and a current guess for
θ . That is, if the current guess is θk , then

z∗ = MAP [Z|X = x, θk ] = arg max P [Z = z|X = x, θk ].

The next guess is then

θk+1 = arg max f [x|z∗ , θ ]P [z∗ |θ ].

Soft EM makes a different approximation. First, it replaces

log(f [x|θ ]) = log f [x|z, θ ]P [z|θ ]
z

by

log(f [x|z, θ ])P [z|θ ].
z

That is, it replaces the logarithm of an expectation by the expectation of the

logarithm.
Second, it replaces the expression above by

log(f [x|z, θ ])P [z|x, θk ]
z

and the new guess θk+1 is the maximizer of that expression over θ . Thus, it replaces
the distribution of Z by the conditional distribution given the current guess and the
observations.
212 11 Speech Recognition: A

If this heuristic did not work in practice, nobody would mention it. Surprisingly, it
seems to work for some classes of problems. There is some theoretical justification
for the heuristic. One can show that it converges to a local maximum of f [x|θ ].
Generally, this is little comfort because most problems have many local maxima.
See Roche (2012).

11.4 Learning: Hidden Markov Chain

Consider once again a hidden Markov chain model but assume that (π, P , Q) are
functions of some parameter θ that we wish to estimate. We write this explicitly as
(πθ , Pθ , Qθ ). We are interested in the value of θ that makes the observed sequence
yn most likely.
Recall that MLE of θ given that Yn = yn is defined as

MLE[θ |Yn = yn ] = arg max P [Yn = yn | θ ].

As in the discussion of clustering, we have

P [Yn = yn | θ ] = P [Yn = yn | Xn = xn , θ ]P [Xn = xn |θ ]. (11.5)
xn

11.4.1 HEM

The HEM algorithm replaces the sum over xn by

P [Yn = yn | Xn = xn∗ , θ ]P [Xn = xn∗ |θ ]

and then P [Xn = xn∗ |θ ] by

P [Xn = xn∗ |Y n , θ0 ],

where

xn∗ = MAP [xn |Yn , θ0 ].

Recall that one can find xn∗ by using Viterbi’s algorithm. Also,

P [Yn = yn | Xn = xn , θ ]
= πθ (x0 )Qθ (x0 , y0 )Qθ (x1 , y1 ) × · · · × Pθ (xn−1 , xn )Qθ (xn , yn ).
11.6 References 213

11.4.2 Training the Viterbi Algorithm

The Viterbi algorithm requires knowing P and Q. In practice, Q depends on the

speaker and P may depend on the local dialect. (Valley speech uses more “likes”
than Berkeley speakers.) We explained that if a parametric model is available, then
one can use HEM.
Without a parametric model, a simple supervised training approach where one
knows both xn and yn is to estimate P and Q by using empirical frequencies. For
instance, the number of pairs (xm , xm+1 ) that are equal to (a, b) in xn divided by
the number of times that xm = a provides an estimate of P (a, b). The estimation of
Q is similar.

11.5 Summary

• Hidden Markov Chain;

• Viterbi Algorithm for MAP [X|Y];
• Clustering and Expectation Maximization;
• EM for HMC.

11.5.1 Key Equations and Formulas

Definition of HMC X(n) = MC & P [Yn |Xn ] D.11.1

Bellman–Ford Equations Vn (x) = miny {d(x, y) + Vn+1 (y)} (11.3)
EM, Soft and Hard θ → z → x; Heuristics to compute MAP [θ|x] S.11.3

11.6 References

The text Wainwright and Jordan (2008) is great presentation of graphical models. It
covers expectation maximization and many other useful techniques.
214 11 Speech Recognition: A

11.7 Problems

Problem 11.1 Let (Xn , Yn ) be a hidden Markov chain. Let Y n = (Y0 , . . . , Yn ) and
Xn = (X0 , . . . , Xn ). The Viterbi algorithm computes
MLE[Y n |Xn ];
MLE[Xn |Y n ];
MAP [Y n |Xn ];
MAP [Xn |Y n ].

Problem 11.2 Assume that the Markov chain Xn is such that X = {a, b}, π0 (a) =
π0 (b) = 0.5 and P (x, x ) = α for x = x and P (x, x) = 1 − α. Assume also
that Xn is observed through a BSC with error probability , as shown in Fig. 11.9.
Implement the Viterbi algorithm and evaluate its performance.

Problem 11.3 Suppose that the grades of students in a class are distributed as a
mixture of two Gaussian distribution, N(μ1 , σ12 ) with probability p and N(μ2 , σ22 )
with probability 1 − p. All the parameters θ = (μ1 , σ1 , μ2 , σ2 , p) are unknown.

(a) You observe n i.i.d. samples, y1 , . . . , yn drawn from the mixed distribution. Find
f (y1 , . . . , yn |θ ).
(b) Let the type random variable Xi be 0 if Yi ∼ N(μ1 , σ12 ) and 1 if Yi ∼
N(μ2 , σ22 ). Find MAP [Xi |Yi , θ ].
(c) Implement Hard EM algorithm to approximately find MLE[θ |Y1 , . . . , Yn ]. To
this end, use MATLAB to generate 1000 data points (y1 , . . . , y1000 ), according
to θ = (10, 4, 30, 6, 0.4). Use your data to estimate θ . How well is your
algorithm working?

Fig. 11.9 A simple hidden Xn Yn

Markov chain 1-
1-
a 0

b 1
1-
1- = [0.5, 0.5]
0
11.7 Problems 215

Topics: Stochastic Gradient, Matching Pursuit, Compressed Sensing, Recom-

mendation Systems

12.1 Online Linear Regression

This section explains the stochastic gradient descent algorithm, which is a technique
used in many learning schemes.
Recall that a linear regression finds the parameters a and b that minimize the
error

K
(Xk − a − bYk )2 ,
k=1

where the (Xk , Yk ) are observed samples that are i.i.d. with some unknown
distribution fX,Y (x, y).
Assume that, instead of calculating the linear regression based on K samples, we
keep updating the parameters (a, b) every time we observe a new sample.
Our goal is to find a and b that minimize

E (X − a − bY )2

= E X2 + a 2 + b2 E Y 2 − 2aE(X) − 2bE(XY ) + 2abE(Y )

=: h(a, b).

© The Author(s) 2021 217

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_12
218 12 Speech Recognition: B

One idea is to use a gradient descent algorithm to minimize h(a, b). Say that at
step k of the algorithm, one has calculated (a(k), b(k)). The gradient algorithm
would update (a(k), b(k)) in the direction opposite of the gradient, to make
h(a(k), b(k)) decrease. That is, the algorithm would compute

∂
a(k + 1) = a(k) − α h(a(k), b(k))
∂a
∂
b(k + 1) = b(k) − α h(a(k), b(k)),
∂b
where α is a small positive number that controls the step size. Thus,

a(k + 1) = a(k) − α[2a(k) − 2E(X) + 2b(k)E(Y )]

b(k + 1) = b(k) − α[2b(k)E(Y 2 ) − 2E(XY ) + 2a(k)E(Y )].

However, we do not know the distributions and cannot compute the expected
values. Instead, we replace the mean values by the values of the new samples. That
is, we compute

a(k + 1) = a(k) − α[2a(k) − 2X(k + 1) + 2b(k)Y (k + 1)]

b(k + 1) = b(k) − α[2b(k)Y 2 (k + 1)
− 2X(k + 1)Y (k + 1) + 2a(k)Y (k + 1)].

That is, instead of using the gradient algorithm we use a stochastic gradient
algorithm where the gradient is replaced by a noisy version. The intuition is that,
if the step size is small, the errors between the true gradient and its noisy version
average out.
The top part of Fig. 12.1 shows the updates of this algorithm for the example
(9.4) with α = 0.002, E(X2 ) = 1, and E(Z 2 ) = 0.3. In this example, we know that
the LLSE is
1
L[X|Y ] = a + bY = Y = 0.77Y.
1.3
The figure shows that (ak , bk ) approaches (0, 0.77).
The bottom part of Fig. 12.1 shows the coefficients for (9.5) with γ = 0.05, α =
1, and β = 6. We see that (ak , bk ) approaches (−1, 7), which are the values for the
LLSE.
12.2 Theory of Stochastic Gradient Projection 219

Fig. 12.1 The coefficients 0.8

“learned” with a stochastic
0.7
gradient algorithm for (9.4) b(k)
(top) and (9.5) (bottom) 0.6

0.5

0.4

0.3

0.2

0.1
a(k)
0

–0.1
0 200 400 600 800 k 1000

7
b(k)
6

0
a(k)
–1

–2
0 200 400 600 800 k 1000

12.2 Theory of Stochastic Gradient Projection1

In this section, we explain the theory of the stochastic gradient algorithm that
we illustrated in the case of online regression. We start with a discussion of the
deterministic gradient projection algorithm.
Consider a smooth convex function on a convex set, such as a soup bowl. A
standard algorithm to minimize that function, i.e., to find the bottom of the bowl,
is the gradient projection algorithm. This algorithm is similar to going downhill by
making smaller and smaller jumps along the steepest slope. The projection makes
sure that one remains in the acceptable set. The step size of the algorithm decreases
over time so that one does not keep on overshooting the minimum.

1 This algorithm is also called ‘stochastic gradient descent’.

220 12 Speech Recognition: B

The stochastic gradient projection algorithm is similar except that one has access
only to a noisy version of the gradient. As the step size gets small, the errors in
the gradient tend to average out and the algorithm converges to the minimum of the
function.
We first review the gradient projection algorithm and then discuss the stochastic
gradient projection algorithm.

12.2.1 Gradient Projection

Consider the problem of minimizing a convex differentiable function f (x) on a

closed convex subset C of d . By definition, C is a convex set if

θ x + (1 − θ )y ∈ C , ∀x, y ∈ C and θ ∈ (0, 1). (12.1)

That is, C contains the line segment between any two of its points. That is, there
are no holes or kinks in the set boundary (Fig. 12.2).
Also (see Fig. 12.3), recall that a function f : C → is a convex function if
(Fig. 12.3)

f (θ x + (1 − θ )y) ≤ θf (x) + (1 − θ )f (y), ∀x, y ∈ C and θ ∈ (0, 1). (12.2)

A standard algorithm is gradient projection (GP):

xn+1 = [xn − αn ∇f (xn )]C , for n ≥ 0.

Fig. 12.2 A non-convex set

(left) and a convex set (right)
x x

y y

Fig. 12.3 A non-convex

function (top) and a convex
function (bottom)

C x y

C x y
12.2 Theory of Stochastic Gradient Projection 221

Fig. 12.4 The gradient 3

projection algorithm (12.4)
and (12.5) 2

–1

–2

–3

–4
0 5 10 15 20 25

Here,

∂ ∂
∇f (x) := f (x), . . . , f (x)
∂x1 ∂xd

is the gradient of f (·) at x and [y]C indicates the closest point to y in C , also called
the projection of y onto C . The constants αn > 0 are called the step sizes of the
algorithm.
As a simple example, let f (x) = 6(x − 0.2)2 for x ∈ C := [0, 1]. The factor 6 is
there only to have big steps initially and show the necessity of projecting back into
the convex set. With αn = 1/n and x0 = 0, the algorithm is

12
xn+1 = xn − (xn − 0.2) . (12.3)
n C

Equivalently,
12
yn+1 = xn − (xn − 0.2) (12.4)
n
xn+1 = max{0, min{1, yn+1 }} (12.5)

with y0 = x0 .
As the Fig. 12.4 shows, when the step size is large, the update yn+1 falls outside
the set C and it is projected back into that set. Eventually, the updates fall into the
set C .
There are many known sufficient conditions that guarantee that the algorithm
converges to the unique minimizer of f (·) on C . Here is an example.
222 12 Speech Recognition: B

Theorem 12.1 Assume that f (x) is convex and differentiable on the convex set C
and such

f (x) has a unique minimizer x ∗ in C (12.6)

||∇f (x)||2 ≤ K, ∀x ∈ C (12.7)

αn = ∞ and αn2 < ∞. (12.8)
n n

Then

xn → x ∗ as n → ∞.

Proof The idea of the proof is as follows. Let dn = 12 ||xn − x ∗ ||2 . Fix > 0. One
shows that there is some n0 () so that, when n ≥ n0 (),

dn+1 ≤ dn − γn , if dn ≥ (12.9)
dn+1 ≤ 2, if dn < . (12.10)

Moreover, in (12.9), γn > 0 and n γn = ∞.
It follows from (12.9) that, eventually, for some n = n1 () ≥ n0 (), one has
dn < . But then, because of (12.9) and (12.10), dn < 2 for all n ≥ n1 (). Since
> 0 is arbitrary, this proves that xn → x ∗ .
To show (12.9) and (12.10), we first claim that

1
dn+1 ≤ dn + αn (x ∗ − xn )T ∇f (xn ) + αn2 K. (12.11)
2
To see this, note that

1
dn+1 = ||[xn − αn ∇f (xn )]C − x ∗ ||2
2
1
≤ ||xn − αn ∇f (xn ) − x ∗ ||2 (12.12)
2
1
≤ dn + αn (x ∗ − xn )T ∇f (xn ) + αn2 K. (12.13)
2
The inequality in (12.12) comes from the fact that projection on a convex set is
non-expansive. That is,

xC − yC ≤ x − y.
12.2 Theory of Stochastic Gradient Projection 223

Fig. 12.5 Projection on a x

convex set is non-expansive
xC y
yC
C
d(xC , yC ) ≤ d(x, y)

Fig. 12.6 The inequality

(12.14) f (x) − f (x∗ )

f (·) (x − x∗ )T ∇f (x∗ )

x∗ x

This property is clear from a picture

(see Fig. 12.5) and is not difficult to prove.
Observe that αn → 0, because n αn2 < ∞. Hence, (12.13) and (12.7) imply
(12.10).
It remains to show (12.9). As Fig. 12.6 shows, the convexity of f (·) implies that

(x ∗ − x)T ∇f (x) ≤ f (x ∗ ) − f (x). (12.14)

Also, if dn ≥ , one has f (x ∗ ) − f (xn ) ≤ −δ(), for some δ() > 0. Thus,

whenever dn ≥ , one has

(x ∗ − xn )T ∇f (xn ) ≤ −δ().

Together with (12.11), this implies

1
dn+1 ≤ dn − αn δ() + αn2 K.
2
Now, let

1
γn = αn δ() − αn2 K. (12.15)
2
224 12 Speech Recognition: B

Since αn → 0, there is some n2 () such that γn > 0 for n ≥ n2 (). Moreover,

(12.8) is seen to imply that n γn = ∞. This proves (12.9) after replacing n0 () by
max{n0 (), n2 ()}.

12.2.2 Stochastic Gradient Projection

There are many situations where one cannot measure directly the gradient ∇f (xn )
of the function. Instead, one has access to a random estimate of that gradient,
∇f (xn ) + ηn , where ηn is a random variable. One hopes that, if the error ηn is small
enough, GP still converges to x ∗ when one uses ∇f (xn ) + ηn instead of ∇f (xn ).
The point of this section is to justify this hope.
The algorithm is as follows (see Fig. 12.7):

xn+1 = [xn − αn gn ]C , (12.16)

where

gn = ∇f (xn ) + zn + bn (12.17)

is a noisy estimate of the gradient. In (12.17), zn is a zero-mean random variable that

models the estimation noise and bn is a constant that models the estimation bias.
As a simple example, let f (x) = 6(x − 0.2)2 for x ∈ C := [0, 1]. With αn =
1/n, bn = 0, and x0 = 0, the algorithm is

12
xn+1 = xn − (xn − 0.2 + zn ) . (12.18)
n C

In this expression, the zn are i.i.d. U [−0.5, 0.5]. Figure 12.8 shows the values that
the algorithm produces.

Fig. 12.7 The figure shows C

level curves of f (·) and the
convex set C . It also shows 0 2
the first few iterations of GPA
in red and of SGPA in blue
2
1
3
1
3
12.2 Theory of Stochastic Gradient Projection 225

Fig. 12.8 The stochastic 1

gradient projection algorithm 0.9
(12.18)
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 100 200 300 400 500 600

This algorithm converges to the minimum x ∗ = 0.2 of the function, albeit slowly.
For the algorithm (12.16) and (12.17) to converge, one needs the estimation noise
zn and bias bn to be small. Specifically, one has the following result.

Theorem 12.2 Assume that C is bounded and

f (.) has a unique minimizer x∗ in C ; (12.19)

||∇f (x)||2 ≤ K, ∀x ∈ C ; (12.20)

αn > 0, αn = ∞, αn2 < ∞. (12.21)
n n

In addition, assume that

∞

αn ||bn || < ∞; (12.22)
n=0

E[zn+1 | z0 , z1 , . . . , zn ] = 0; (12.23)
E(||zn ||2 ) ≤ A, n ≥ 0. (12.24)

Then xn → x∗ with probability one.

Proof The proof is essentially the same as for the deterministic case.
The inequality (12.11) becomes

1
dn+1 ≤ dn + αn (x∗ − xn )T [∇f (xn ) + zn + bn ] + αn2 K. (12.25)
2
226 12 Speech Recognition: B

Accordingly, γn in (12.15) is replaced by

1
γn = αn δ() + (x∗ − xn )T (zn + bn ) − αn2 K. (12.26)
2

Now, (12.23) implies that vn := nm=0αm zm is a martingale.2 Because of (12.24)
and (12.21), one has E(||vn ||2 ) ≤ A ∞ m=0 αm < ∞ for all n. This implies, by
the Martingale Convergence Theorem 12.3, that vn converges
to a finite random
variable. Combining this fact with (12.22) shows that3 ∞ m=n αm [zm + bm ] →
0. Since ||xn − x∗ || is bounded, this implies that the effect of the estimation
error is asymptotically negligible and that argument used in the proof of GP
applies here.

12.2.3 Martingale Convergence

We discuss the theory of martingales in Sect. 15.9. Here are the ideas we needed in
the proof of Theorem 12.2.
Let {xn , yn , n ≥ 0} be random variables such that E(xn ) is well-defined for all
n. The sequence xn is said to be a martingale with respect to {(xm , ym ), m ≥ 0} if

E[xn+1 |xm , ym , m ≤ n] = xn , ∀n.

Theorem 12.3 (Martingale Convergence Theorem) If a martingale xn is such

that E(xn2 ) ≤ B < ∞ for all n, then it converges with probability one to a finite
random variable.

For a proof, see Theorem 15.13.

12.3 Big Data

The web makes it easy to collect a vast amount of data from many sources. Examples
include books, movie, and restaurants that people like, website that they visit, their
mobility patterns, their medical history, and measurements from sensors. This data

2 See
the next section.
3 Recall

that if a series n wn converges, then the tail m≥n wm of the series converges to zero as
n → ∞.
12.3 Big Data 227

Fig. 12.9 The web provides

access to vast amounts of
data. How does one extract
useful knowledge from that
data?

can be useful to recommend items that people will probably like, treatments that
are likely to be effective, people you might want to commute with, to discover
who talks to who, efficient management techniques, and so on. Moreover, new
technologies for storage, databases, and cloud computing make it possible to process
huge amounts of data. This section explains a few of the formulations of such
problems and algorithms to solve them (Fig. 12.9).

12.3.1 Relevant Data

Many factors potentially affect an outcome, but what are the most relevant ones?
For instance, the success in college of a student is correlated with her high-school
GPA, her scores in advanced placement courses and standardized tests. How does
one discover the factors that best predict her success? A similar situation occurs for
predicting the odds of getting a particular disease, the likelihood of success of a
medical treatment, and many other applications.
Identifying these important factors can be most useful to improve outcomes. For
instance, if one discovers that the odds of success in college are most affected by the
number of books that a student has to read in high-school and by the number of hours
she spends playing computer games, then one may be able to suggest strategies for
improving the odds of success.
One formulation of the problem is that the outcome Y is correlated with a
collection of factors that we represent by a vector X with N 1 components.
For instance, if Y is the GPA after 4 years in college, the first component X1 of
X might indicate the high-school GPA, the second component X2 the score on a
specific standardized test, X3 the number of books the student had to write reports
on, and so on. Intuition suggests that, although N 1, only relatively few of the
components of X really affect the outcome Y in a significant way. However, we do
not want to presume that we know what these components are.
Say that you want to predict Y on the basis of six components of X. Which
ones should you consider? This problem turns out to be hard because there are
many (about N 6 /6!) subsets with 6 elements in N = {1, 2, . . . , N}, and this
combinatorial aspect of the problem makes it intractable when N is large. To
228 12 Speech Recognition: B

Fig. 12.10

make progress, we change the formulation slightly and resort to some heuristic
(Fig. 12.10).4
The change in formulation is to consider the problem of minimizing

J (b) = E (Y − bn Xn ) 2

over b = (b1 , . . . , bN ), subject to a bound on

C(b) = |bn |.
n

This is called the LASSO problem, for “least absolute shrinkage and selection
operator.” Thus, the hard constraint on the number of components is replaced by a
cost for using large coefficients. Intuitively, the problem is still qualitatively similar.
Also, the constraint is such that the solution of the problem has many bn equal to
zero. Intuitively, if a component is less useful than others, its coefficient is probably
equal to zero in the solution.
One interpretation of this problem as follows. In order to simplify the algebra,
we assume that Y and X are zero-mean. Assume that

Y = Bn Xn + Z,
n

where Z is N (0, σ 2 ) and the coefficients Bn are random and independent with a
prior distribution of Bn given by

λ
fn (b) = exp{−λ|b|}.
2
Then

4 If
you cannot crack a nut, look for another one. (A difference between Engineering and
Mathematics?)
12.3 Big Data 229

MAP [B|X = x, Y = y] = arg max fB|X,Y [b|x, y]

= arg max fB (b)fY |X [y|x]

b
⎧ 2 ⎫
1⎨ ⎬
= arg max exp − 2 y− bn xn
b ⎩ 2σ n
⎭

× exp{−λ |bn |}
n
⎧ 2 ⎫
⎨ ⎬
= arg min y− bn xn + μ |bn |
b ⎩ ⎭
n n

with μ = 2λσ 2 . This formulation is the Lagrange multiplier formulation of the

LASSO problem where the constraint on the cost C(b) is replaced by a penalty
μC(b). Thus, the LASSO problem is equivalent to finding MAP [B|X, Y ] under
the assumptions stated above.
We explain a greedy algorithm that selects the components one by one, trying to
maximize the progress that it makes with each selection. First assume that we can
choose only one component Xn among the N elements in X. We know that

cov(Y, Xn )
L[Y |Xn ] = Xn =: bn Xn
var(Xn )

and

cov(Y, Xn )2
E((Y − L[Y |Xn ])2 ) = var(Y ) −
var(Xn )
= var(Y ) − |cov(Y, Xn )| × |bn |.

Thus, one unit of “cost” C(bn ) = |bn | invested in bn brings a reduction |cov(Y, Xn )|
in the objective J (bn ). It then makes sense to choose the first component with the
largest value of “reward per unit cost” |cov(Y, Xn )|. Say that this component is X1
and let Ŷ1 = L[Y |X1 ].
Second, assume that we stick to our choice of X1 with coefficient b1 and that we
look for a second component Xn with n = 1 to add to our estimate. Note that

E((Y − b1 X1 − bn Xn )2 )
= E((Y − b1 X1 )2 ) − 2bn cov(Y − b1 X1 , Xn ) + bn2 var((Xn ).

This expression is minimized over bn by choosing

230 12 Speech Recognition: B

cov(Y − b1 X1 , Xn )
bn =
var(Xn )

and it is then equal to

cov(Y − b1 X1 , Xn )2
E((Y − b1 X1 )2 ) − .
var(Xn )

Thus, as before, one unit of additional cost in C(b1 , bn ) invested in bn brings a

reduction

|cov(Y − b1 X1 , Xn )|

in the cost J (b1 , bn ). This suggests that the second component Xn to pick should be
the one with the largest covariance with Y − b1 X1 .
These observations suggest the following algorithm, called the stepwise regres-
sion algorithm. At each step k, the algorithm finds the component Xn that is
most correlated with the residual error Y − Ŷk , where Ŷk is the current estimate.
Specifically, the algorithm is as follows:

Step 0 : Ŷ0 = E(Y ) and S0 = ∅;

Step k + 1 : Find n ∈
/ Sk that maximizes E((Y − Ŷk )Xn )
Let Sk+1 = Sk ∪ {n}, Yk+1 = L[Y |Xn , n ∈ Sk+1 ], k = k + 1;

Repeat until E((Y − Ŷk )2 ) ≤ .

In practice, one is given a collection of outcomes {Y m , m = 1, . . . , M} of with

factors Xm = (X1m , X2m , . . . , XN
m ). Here, each m corresponds to one sample, say one

student in the college success example. From those samples, one can estimate the
mean values by the sample means. Thus, in step k, one has calculated coefficients
(b1 , . . . , bk ) to calculate

Ŷkm = b1 X1m + · · · + bk Xkm .

One then estimates E((Y − Ŷk )Xn ) by

1 m
M
(Y − Ŷkm )Xnm .
M
m=1

Also, one approximates L[Y |Xn , n ∈ Sk+1 ] by the linear regression.

It is useful to note that, by the Law of Large Numbers, the number M of samples
needed to estimate the means and covariances is not necessarily very large. Thus,
although one may have data about millions of students, a reasonable estimate may
12.3 Big Data 231

be obtained from a few thousand. Recall that one can use the sample moments to
compute confidence intervals for these estimates.
Signal processing uses a similar algorithm called matching pursuit introduced in
Mallat and Zhang (1993). In that context, the problem is to find a compact represen-
tation of a signal, such as a picture or a sound. One considers a representation of the
signal as a linear combination of basis functions. The matching pursuit algorithm
finds the most important basis functions to use in the representation.

An Example
Our example is very small, so that we can understand the steps. We assume that all
the random variables are zero-mean and that N = 3 with
⎡ ⎤
4 3 2 2
⎢3 4 2 2⎥
ΣZ = ⎢
⎣2
⎥,
2 4 1⎦
2 2 1 4

where Z = (Y, X1 , X2 , X3 ) = (Y, X ).

We first try the stepwise regression. The component Xn most correlated with Y
is X1 . Thus,

cov(Y, X1 ) 3
Ŷ1 = L[Y |X1 ] = X1 = X1 =: b1 X1 .
var(X1 ) 4

The next step is to compute the correlations E(Xn (Y − Ŷ1 )) for n = 2, 3. We find

E(X2 (Y − Ŷ1 )) = E(X2 (Y − b1 X1 )) = 2 − 2b1 = 0.5

E(X3 (Y − Ŷ1 )) = E(X3 (Y − b1 X1 )) = 2 − 2b1 = 0.5.

Hence, the algorithm selects X2 as the next components and one finds

" # 4 2 −1 X1 2 1
Ŷ2 = L[Y |X1 , X2 ] = 3 2 = X1 + X2 .
24 X2 3 6

The resulting error variance is

5
E (Y − Ŷ2 )2 = .
3
232 12 Speech Recognition: B

Fig. 12.11 A complex 4

looking signal that is the sum
3
of three sine waves
2

–1

–2

–3

–4
0 200 400 600 800 1000

12.3.2 Compressed Sensing

Complex looking objects may have a simple hidden structure. For example, the
signal s(t) shown in Fig. 12.11 is the sum of three sine waves. That is,

3
s(t) = bi sin(2π φi t), t ≥ 0. (12.27)
i=1

A classical result, called the Nyquist sampling theorem, states that one can
reconstruct a signal exactly from its values measured every T seconds, provided that
1/T is at least twice the largest frequency in the signal. According to that result, we
could reconstruct s(t) by specifying its value every T seconds if T < 1/(2φi ) for
i = 1, 2, 3. However, in the case of (12.27), one can describe s(t) completely by
specifying the values of the six parameters {bi , φi , i = 1, 2, 3}. Also, it seems clear
in this particular case that one does not need to know many sample values s(tk )
for different times tk to be able to reconstruct the six parameters and therefore the
signal s(t) for all t ≥ 0. Moreover, one expects the reconstruction to be unique if
we choose a few sampling times tk randomly. The same is true if the representation
is in terms of different functions, such as polynomials or wavelets.
This example suggests that if a signal has a simple representation in terms of
some basis functions (e.g., sine waves), then it is possible to reconstruct it exactly
from a small number of samples.
Computing the parameters of (12.27) from a number of samples s(tk ) is highly
nontrivial, so that the fact that it is possible does not seem very useful. However, a
slightly different perspective shows that the problem can be solved. Assume that we
have a collection of functions (Fig. 12.12)

gn (t) = sin(2πfn t), t ≥ 0, n = 1, . . . , N.

12.3 Big Data 233

Fig. 12.12 A tough nut to

crack!

Assume also that the frequencies {φ1 , φ2 , φ3 } in s(t) are in the collection {fn , n =
1, . . . , N }. We can then try to find the vector a = {an , n = 1, . . . , N} such that

N
s(tk ) = an gn (tk ), for k = 1, . . . , K.
n=1

We should be able to do this with three functions, by choosing the appropriate

coefficients. How do we do this systematically? A first idea is to formulate the
following problem:

Minimize 1{an = 0}
n

such that s(tk ) = an gn (tk ), for k = 1, . . . , K.
n

That is, one tries to find the most economical representation of s(t) as a linear
combination of functions in the collection.
Unfortunately, this problem is intractable because of the number of choices of
sets of nonzero coefficients an , a difficulty we already faced in the previous section.
The key trick is, as before, to convert the problem into a much easier one that retains
the main goal.
The new problem is as follows:

Minimize |an |
n

such that s(tk ) = an gn (tk ), for k = 1, . . . , K.
n

(12.28)

Trying to minimize the sum of the absolute values of the coefficients an is a

relaxation of limiting
the number
of nonzero coefficients. (Simple examples show
that choosing n |an |2 instead of n |an | often leads to bad reconstructions.) The
result is that if K is large enough, then the solution is exact with a high probability.
234 12 Speech Recognition: B

Theorem 12.4 (Exact Recovery from Random Samples) The signal s(t) can be
recovered exactly with a very high probability from K samples by solving (12.28) if

K ≥ C × B × log(N ).

In this expression, C is a small constant, B is the number of sine waves that make
up s(t), and N is the number of sine waves in the collection.

Note that this is a probabilistic statement. Indeed, one could be unlucky and
choose sampling times tk , where s(tk ) = 0 (see Fig. 12.11) and these samples
would not enable the reconstruction of s(t). More generally, the samples could be
chosen so that they do not enable an exact reconstruction. The theorem says that the
probability of poor samples is very small.
Thus, in our example, where B = 3, one can expect to recover the signal s(t)
exactly from about 3 log(100) ≈ 14 samples if N ≤ 100.
Problem (12.28) is equivalent to the following linear programming problem,
which implies that it is easy to solve:

Minimize bn
n

such that s(tk ) = an gn (tk ), for k = 1, . . . , K
n

and − bn ≤ an ≤ bn , for n = 1, . . . , N. (12.29)

Assume that

s(t) = sin(2π t) + 2 sin(2.4π t) + 3 sin(3.2π t), t ∈ [0, 1]. (12.30)

The frequencies in s(t) are φ1 = 1, φ2 = 1.2, and φ3 = 1.6. The collection of

functions is

{gn (t) = sin(2πfn t), n = 1, . . . , 100},

where fn = n/10.
The frequencies of the sine waves in the collection are 0.1, 0.2, . . . , 10. Thus,
the frequencies in s(t) are contained in the collection, so that perfect reconstruction
is possible as

s(t) = an gn (t)
n
12.3 Big Data 235

with a10 = 1, a12 = 2, and a16 = 3, and all the other coefficients an equal to zero.
The theory tells us that reconstruction should be possible with about 14 samples. We
choose 15 sampling times tk randomly and uniformly in [0, 1]. We then ask Python
to solve (12.29). The solution is shown in Fig. 12.13.

Another Example
Figure 12.14, from Candes and Romberg (2007), shows another example. The image
on top has about one million pixels. However, it can be represented as a linear
combination of 25,000 functions called wavelets. Thus, the compressed sensing
results tell us that one should be able to reconstruct the picture exactly from a small

Fig. 12.13 Exact 6

reconstruction of the signal
(12.30) with 15 samples
4
chosen uniformly in [0, 1].
The signal is in green and the
reconstruction in blue 2

–2

–4

–6
0 0.2 0.4 0.6 0.8 1

Fig. 12.14 Original image

with 106 pixels (top) and
reconstruction from 96,000
randomly chosen pixels
(bottom)
236 12 Speech Recognition: B

multiple of 25,000 randomly chosen pixels. It turns out that this is indeed the case
with about 96,000 pixels.

12.3.3 Recommendation Systems

Which movie would you like to watch? One formulation of the problem is as
follows. There is a K × N matrix Y . The entry Y (k, n) of the matrix indicates
how much user k likes movie n. However, one does not get to observe the complete
matrix. Instead, one observes a number of entries, when users actually watch movies
and one gets to record their rankings. The problem is to complete the matrix to be
able to recommend movies to users.
This matrix completion is based on the idea that the entries of the matrix are
not independent. For instance, assume that Bob and Alice have seen the same five
movies and gave them the same ranking. Assume that Bob has seen another movie
he loved. Chances are that Alice would also like it.
To formulate this dependency of the entries of the matrix Y , one observes that
even though there are thousands of movies, a few factors govern how much users
like them. Thus, it is reasonable to expect that many columns of the matrix are
combinations of a few common vectors that correspond to the hidden factors that
influence the rankings by users. Thus, a few independent vectors get combined
into linear combinations that form the columns. Consequently the matrix Y has a
small number of linearly independent columns, i.e., it is a low rank matrix.5 This
observation leads to the question of whether one can recover a low rank matrix Y
from observed entries?
One possible formulation is

Minimize rank(X) s.t. X(k, n) = M(k, n), ∀(k, n) ∈ Ω.

Here, {M(k, n), (k, n) ∈ Ω} is the set of observed entries of the matrix. Thus, one
wishes to find the lowest-rank matrix X that is consistent with the observed entries.
As before, such a problem is hard. To simplify the problem, one replaces the rank
by the nuclear norm ||X||∗ where

||X||∗ = σi ,
i

where the σi are the singular values of the matrix X. The rank of the matrix counts
the number of nonzero singular values. The nuclear norm is a convex function of
the entries of the matrix, which makes the problem a convex programming problem
that is easy to solve. Remarkably, as in the case of compressed sensing, the solution
of the modified problem is very good.

5 The rank of a matrix is the number of linearly independent columns.

12.4 Deep Neural Networks 237

Theorem 12.5 (Exact Matrix Completion from Random Entries) The solution
of the problem

Minimize ||X||∗ s.t. X(k, n) = M(k, n), ∀(k, n) ∈ Ω

is the matrix Y with a very high probability if the observed entries are chosen
uniformly at random and if there are at least

Cn1.25 r log(n)

observations. In this expression, C is a small constant, n = max{K, N }, and r is

the rank of Y .

This result is useful in many situations where this number of required observa-
tions is much smaller than K ×N, which is the number of entries of Y . The reference
contains many extensions of these results and details on numerical solutions.

12.4 Deep Neural Networks

Deep neural networks (DNN) are electronic processing circuits inspired by the
structure of the brain. For instance, our vision system consists of layers. The first
layer is in the retina that captures the intensity and color of zones in our field of
vision. The next layer extracts edges and motion. The brain receives these signals
and extracts higher level features. A simplistic model of this processing is that the
neurons are arranged in successive layers, where each neuron in one layer gets
inputs from neurons in the previous layer through connections called synapses.
Presumably, the weights of these connections get tuned as we grow up and learn
to perform tasks, possibly by trial and errors. The figure sketches a DNN. The
inputs at the left of the DNN are the features X from which the system produces
the probability that X corresponds to a dog, or the estimate of some quantity
(Fig. 12.15).

Fig. 12.15 A neural network zl

m
μm
zm
μl
l r Z = zr
zk

k μk
238 12 Speech Recognition: B

Fig. 12.16 The logistic 1.00

function g(v)
0.75

0.50

0.25

0.00

–0.25

–0.50

–0.75

–1.00

–3 –2 –1 0 1 2 v 3

Each circle is a circuit that we call a neuron. In the figure, zk is the output of
neuron k. It is multiplied by θk to contribute the quantity θk zk to the total input Vl of
neuron l. The parameter θk represents
the strength of the connection between neuron
k and neuron l. Thus, Vl = n θn zn , where the sum is over all the neurons n of the
layer to the immediate left of neuron l, including neuron k. The output zl of neuron
l is equal to f (al , Vl ), where al is a parameter specific to that neuron and f is some
function that we discuss later.
With this structure, it is easy to compute the derivative of some output Z with
respect to some weight, say θk . We do it in the last section of this chapter.
What should be the functions f (a, V )? Inspired by the idea that a neuron fires if
it is excited enough, one may use a function f (a, V ) that is close to 1 if V > a and
close to −1 if V < a. To make the function differentiable, one may use f (a, V ) =
g(V − a) with

2
g(v) = − 1,
1 + e−βv

where β is a positive constant. If β is large, then e−βv goes from a very large to a
very small value when v goes from negative to positive. Consequently, g(v) goes
from −1 to +1 (Fig. 12.16).
The DNN is able to model many functions by adjusting its parameters. To see
why, consider neuronl. The output of this neuron indicates whether the linear
combination Vl = n θn zn is larger or smaller than the thresholds al of the
neurons. Consequently, the first layer divides the set of inputs into regions separated
by hyperplanes. The next layer then further divides these regions. The number of
12.4 Deep Neural Networks 239

regions that can be obtained by this process is exponential in the number of layers.
The final layer then assigns values to the regions, thus approximating a complex
function of the input vector by an almost piecewise constant function.
The missing piece of the puzzle is that, unfortunately, the cost function is not a
nice convex function of the parameters of the DNN. Instead, it typically has many
local minima. Consequently, by using the SGD algorithm, the tuning of the DNN
may get stuck in a local minimum. Also, to reduce the number of parameters to
tune, one usually selects a few layers with fixed parameters, such as edge detectors
in vision systems. Thus, the selection of the DNN becomes somewhat of an art, like
cooking.
Thus, it remains impossible to predict whether the DNN will be a good technique
for machine learning in a specific application. The answer of the practitioners is to
try and see. If it works, they publish a paper. We are far from the proven convergence
results of adaptive systems. Ah, nostalgia. . . .
There is a worrisome aspect to these black-box approaches. When the DNN
has been tuned and seems to perform well on many trials, not only one does
not understand what it really does, but one has no guarantee that it will not
seriously misbehave for some inputs. Imagine then a killer drone with a DNN target
recognition system. . . . It is not surprising that a number of serious scientists have
raised concerns about “artificial stupidity” and the need to build safeguards into such
systems. “Open the pod bay doors, Hal.”

12.4.1 Calculating Derivatives

Let’s compute the derivative of Z with respect to θk .

See you increase θk by . This increases Vl by zk . In turn, this increases
zl by δzl := zk f (al , Vl ), where f (al , Vl ) is the derivative of f (al , Vl ) with
respect to Vl . Consequently, this increases Vm by θl δzl . The result is an increase
of zm by δzm = θl δzl f (am , Vm ). Finally, this increase Vr by θm δzm and Z by
θm δzm f (ar , Vr ). We conclude that

dZ
= zk f (al , Vl )θl f (am , Vm )θm f (ar , Vr ).
dθk

The details do not matter too much. The point is that the structure of the network
makes the calculation of the derivatives straightforward.
240 12 Speech Recognition: B

12.5 Summary

• Online Linear Regression;

• Convex Sets and Functions;
• Gradient Projection Algorithm;
• Stochastic Gradient Projection Algorithm;
• Deep Neural Networks;
• Martingale Convergence Theorem;
• Big Data: Relevant Data, Compressed Sensing, Recommendation Systems.

12.5.1 Key Equations and Formulas

Convex Set if it contains its chords (12.1)

Convex Function if it is above its tangents (12.2)
Convergence of GP if unique minimizer and bounded gradient T.12.1
Convergence of SGP if bounded drift and noise variance T.12.2
Martingale CT L1 or L2 -bounded MG converges w.p. 1 T.12.3

12.6 References

Online linear regression algorithms are discussed in Strehl and Littman (2007).
The book Bertsekas and Tsitsiklis (1989) is an excellent presentation of dis-
tributed optimization algorithms. It explains the gradient projection algorithm and
distributed implementations. The LASSO algorithm and many other methods are
clearly explained in Hastie et al. (2009), together with applications. The theory of
martingales is nicely presented by its father in Doob (1953). Theorem 12.4 is from
Candes and Romberg (2007).

12.7 Problems

Problem 12.1 Let {Yn , n ≥ 1} be i.i.d. U [0, 1] random variables and {Zn , n ≥ 1}
be i.i.d. N (0, 1) random variables. Define Xn = 1{Yn ≥ a} + Zn for some constant
a. The goal of the problem is to design an algorithm that “learns” the value of a
from the observation of pairs (Xn , Yn ). We construct a model
12.7 Problems 241

Fig. 12.17 The logistic 1

function (12.31) with λ = 10 0.9
g(u)
0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1
u
0
–1 –0.5 0 0.5 1

Xn = g(Yn − θ ),

where
1
g(u) = (12.31)
1 + exp{−λu}

with λ = 10. Note that when u > 0, the denominator of g(u) is close to 1, so
that g(u) ≈ 1. Also, when u < 0, the denominator is large and g(u) ≈ 0. Thus,
g(u) ≈ 1{u ≥ 0}. The function g(·) is called the logistic function. Use SGD in
Python to estimate θ (Fig. 12.17).

Problem 12.2 Implement the stepwise regression algorithm with

⎡ ⎤
10 5 6 7
⎢5 6 5 2⎥
ΣZ = ⎢
⎣ 6 5 11
⎥,
5⎦
7 2 5 6

where Z = (Y, X1 , X2 , X3 ) = (Y, X ).

Problem 12.3 Implement the compressed sensing algorithm with

s(t) = 3 sin(2π t) + 2 sin(3π t) + 4 sin(4π t), t ∈ [0, 1],

where you choose sampling times tk independently and uniformly in [0, 1]. Assume
that the collection of sine waves has the frequencies {0.1, 0.2, . . . , 3}.
What is the minimum number of samples that you need for exact reconstruction?
242 12 Speech Recognition: B

Application: Choosing a fast route given uncertain delays, Controlling a

Markov chain
Topics: Stochastic Dynamic Programming, Markov Decision Problems

13.1 Model

One is given a finite connected directed graph. Each edge (i, j ) is associated with a
travel time T (i, j ). The travel times are independent and have known distributions.
There are a start node s and a destination node d. The goal is to choose a fast route
from s to d. We consider a few different formulations (Fig. 13.1).
To make the situation concrete, we consider the very simple example illustrated
in Fig. 13.2.
The goal is to choose the fastest path from s to d. In this example, the possible
paths are sd, sad, and sabd. We assume that the delays T (i, j ) on the edges (i, j )
are as follows:

T (s, a) =D U [5, 13], T (a, d) = 10, T (a, b) =D U [2, 10],

T (b, d) = 4, T (s, d) = 20.

Thus, the delay from s to a is uniformly distributed in [5, 13], the delay from a to
d is equal to 10, and so on. The delays are assumed to be independent, which is an
unrealistic simplification.

© The Author(s) 2021 243

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_13
244 13 Route Planning: A

Fig. 13.1 Road network.

How to select a path?

Fig. 13.2 A simple graph

a
s
b d

13.2 Formulation 1: Pre-planning

In this formulation, one does not observe anything and one plans the journey ahead
of time. In this case, the solution is to look at the average travel times E(T (i, j )) =
c(i, j ) and to run a shortest path algorithm.
For our example, the average delays are c(s, a) = 9, c(a, d) = 10, and so on, as
shown in the top part of Fig. 13.3.
Let V (i) be the minimum average travel time from node i to the destination d.
The Bellman–Ford Algorithm calculates these values as follows. Let Vn (i) be an
estimate of the shortest average travel time from i to d, as calculated after the n-th
iteration of the algorithm. The algorithm starts with V0 (d) = 0 and V0 (i) = ∞ for
i = d. Then, the algorithm calculates

Vn+1 (i) = min{c(i, j ) + Vn (j )}, n ≥ 0. (13.1)

The interpretation is that Vn (i) is the minimum expected travel time from i to d
over all paths that go through at most n edges. The distance is infinite if no path
with at most n edges reaches the destination d. This is exactly the same algorithm
we discussed in Sect. 11.2 to develop the Viterbi algorithm.
These relations are justified by the fact that the mean value of a sum is the sum of
the mean values. For instance, say that the minimum average travel time from a to
d using a path that has at most 2 edges is V2 (a, d) and it corresponds to a path with
13.3 Formulation 2: Adapting 245

Fig. 13.3 The average a

delays (top) and the 9 10
successive steps of the
s 6
Bellman–Ford algorithm to b 4 d
calculate the minimum
20
expected times (shown in red) 9 a∞
10
from the nodes to the s 6 ∞4 0
destination ∞ b d
20
9 a 10
s 10
6 44 0
20 20 b d
9 a 10
s 10
6 44 0
19 20 b d

random travel time W2 (a, d). Then, the minimum average travel time from s to d
using a path that has at most 3 edges follows either the direct path sd, that has travel
time T (s, d), or the edge sa followed by the fastest path from a to d that uses at most
2 edges with travel time W2 (a, d). Accordingly, the minimum expected travel time
V3 (s) from s to d using at most three edges is the minimum of E(T (s, d)) = c(s, d)
and the mean value of T (s, a) + W2 (a, d). Thus,

V3 (s) = min{c(s, d), E(T (s, a) + W2 (a, d))}

= min{c(s, d), c(s, a) + V2 (a, d)}.

Since the graph is finite, Vn converges to V in at most N steps, where N is the

length of the longest cycle-free path to node d. The limit is such that V (i) is the
shortest average travel time from i to d. Note that V satisfies the following fixed-
point equations:

V (i) = min{c(i, j ) + V (j )}, ∀j and V (d) = 0. (13.2)

These are called the dynamic programming equations (DPE). Thus, (13.1) is an
algorithm for solving (13.2).

13.3 Formulation 2: Adapting

We now assume that when we get to a node i, we see the actual travel times along
the edges out of i. However, we do not see beyond those edges. How should we
modify our path planning? If the travel times are in fact deterministic, then nothing
changes. However, if they are random, we may notice that the actual travel times on
some edges out of i are smaller than their mean value, whereas others may be larger.
Clearly, we should use that information.
246 13 Route Planning: A

Here is a systematic procedure for calculating the best path. Let V (i) be the
minimum average time to get to d starting from node i, for i ∈ {s, a, b, d}. We see
that V (b) = T (b, d) = 4.
To calculate V (a), define W (a) to be the minimum expected time from a to d
given the observed delays along the edges out of a. That is,

W (a) = min{T (a, b) + V (b), T (a, d)}.

Hence, V (a) = E(W (a)). Thus,

V (a) = E(min{T (a, b) + V (b), T (a, d)}). (13.3)

For this example, we see that T (a, b) + V (b) =D U [6, 14]. Since T (a, d) = 10, if
T (a, b) + V (b) < 10, which occurs with probability 1/2, we choose the path abd
that has a travel time uniformly distributed in [6, 10] with a mean value 8. Also,
if T (a, b) + V (b) > 10, then we choose the travel time T (a, d) = 10, also with
probability 1/2. Thus, the minimum expected travel time V (a) from a to d is equal
to 8 with probability 1/2 and to 10 with probability 1/2, so that its average value is
8(1/2) + 10(1/2) = 9. Hence, V (a) = 9.
Similarly,

V (s) = E(min{T (s, a) + V (a), T (s, d)}),

where T (s, a) + V (a) =D U [14, 22] and T (s, d) = 20. Thus, if T (s, a) + V (a) <
20, which occurs with probability (20 − 14)/(22 − 14) = 3/4, then we choose a
path that goes from s to a and has a delay that is uniformly distributed in [14, 20],
with mean value 17. If T (s, a) + V (a) > 20, which occurs with probability 1/4, we
choose the direct path sd that has delay 20. Hence V (s) = 17(3/4) + 20(1/4) =
71/4 = 17.75.
Note that by observing the delays on the next edges and making the appropriate
decisions, we reduce the expected travel time from s to d from 19 to 17.5. Not
surprisingly, more information helps. Observe also that the decisions we make
depend on the observed delays. For instance, starting in node s, we go along edge sd
if T (s, a) + V (a) > T (s, d), i.e., if T (s, a) + 9 > 20, or T (s, a) > 11. Otherwise,
we follow the edge sa.
Let us now go back to the general model. The key relationships are as follows:

V (i) = E(min{T (i, j ) + V (j )}), ∀i. (13.4)

The interpretation is simple: starting from i, one can choose to go next to j . In that
case, one faces a travel time T (i, j ) from i to j and a subsequent minimum average
time from j to d equal to V (j ). Since the path from i to d must necessarily go to a
next node j , the minimum expected travel time from i to d is given by the expression
13.4 Markov Decision Problem 247

above. As before, these equations are justified by the fact that the expected value of
a sum is the sum of the expected values.
An algorithm for solving these fixed-point equations is

Vn+1 (i) = E(min{T (i, j ) + Vn (j )}), n ≥ 0, (13.5)

where V0 (i) = 0 for all i. The interpretation of Vn (i) is the same as before: it is the
minimum expected time from i to d using a path with at most n edges, given that at
each step along the path one observes the delays along the edges out of the current
node.
Equations (13.4) are the stochastic dynamic programming equations for the
problem. Equations (13.5) are called the value iteration equations.

13.4 Markov Decision Problem

A more general version of the path planning problem is the control of a Markov
chain. At each step, one looks at the state and one chooses an action that determines
the transition probabilities and also the cost for the next step.
More precisely, to define a controlled Markov chain X(n) on some state space
X , one specifies, for each x ∈ X , a set A(x) of possible actions. For each state
x∈X and each action a ∈ A(x), one has transition probabilities P (x, x ; a) ≥ 0

with x ∈X P (x, x ; a) = 1. One also specifies a cost c(x, a) of taking the action
a when in state x.
The sequence X(n) is then defined by

P [X(1) = x1 , X(2) = x2 , . . . , X(n) = xn |X(0) = x0 , a0 , . . . , an−1 ]

= P (x0 , x1 ; a0 )P (x1 , x2 ; a1 ) × · · · × P (xn−1 , xn ; an−1 ).

The goal is to choose the actions to minimize the average total cost

n
E c(X(m), a(m))|X(0) = x . (13.6)
m=0

For each m = 0, . . . , n, the action a(m) ∈ A(X(m)) is determined from the

knowledge of X(m) and also of the previous states X(0), . . . , X(m−1) and previous
actions a(0), . . . , a(m − 1).
This problem is called a Markov decision problem (MDP).
To solve this problem, we follow a procedure identical to the path planning
problem where we think of the state as the node that has been reached during the
travel. Let Vm (x) be the minimum value of the cost (13.6) when n is replaced by m.
That is, Vm (x) is the minimum average cost of the next m + 1 steps, starting from
X(0) = x. The function Vm (·) is called the value function.
248 13 Route Planning: A

The DPE are

Vm (x) = min {c(x, a) + E[Vm−1 (x )|X(0) = x, a(0) = a]}

a∈A(x)

= min c(x, a) + P (x, x ; a)Vm−1 (x ) . (13.7)
a∈A(x)
x

Let a = gm (x) be the value of a ∈ A(x) that achieves the minimum in (13.7). Then
the choices a(m) = gn−m (X(m)) achieve the minimum of (13.6).
The existence of the minimizing a in (13.7) is clear if X and each A(x) are finite
and also under weaker assumptions.

13.4.1 Examples

Guess a Card
Here is a simple example. One is given a perfectly shuffled deck of 52 cards. The
cards are turned over one at a time. Before one turns over a new card, you have the
option of saying “Stop.” If the next card is an ace, you win $1.00. If not, the game
stops and you lose. The problem is for you to decide when to stop (Fig. 13.4).
Assume that there are still x aces in a deck with m remaining cards. Then, if you
say stop, you win with probability x/m. If you do not say stop, then after the next
card is turned over, x − 1 aces remain with probability x/m and x remain otherwise.
Let V (m, x) be the maximum expected probability that you win if there are still
x aces in the deck with m remaining cards.
The DPE are

x x m−x
V (m, x) = max , V (m − 1, x − 1) + V (m − 1, x) .
m m m

Interestingly, the solution of these equations is V (m, x) = x/m, as you can

verify. Also, the two terms in the maximum are equal if x > 0. The conclusion is
that you can stop at any time, as long as there is still at least one ace in the deck.

Scheduling Jobs
You have two sets of jobs to perform. Jobs of type i (for i = 1, 2) have a waiting
cost equal to ci per unit of waiting time until they are completed. Also, when you
work on a job of type i, it completes with probability μi in the next time unit,

Fig. 13.4 Guessing if the

next card is an Ace
6

6
13.4 Markov Decision Problem 249

independently of how long you have worked on it. That is, the job processing times
are geometrically distributed with parameter μi . The problem is to decide which job
to work on to minimize the total waiting cost of the jobs.
Let V (x1 , x2 ) be the minimum expected total remaining waiting cost given that
there are x1 jobs to type 1 and x2 jobs of type 2. The DPE are

V (x1 , x2 ) = x1 c1 + x2 c2 + min{V1 (x1 , x2 ), V2 (x1 , x2 )},

where

V1 (x1 , x2 ) = μ1 V ((x1 − 1)+ , x2 ) + (1 − μ1 )V (x1 , x2 )

and

V1 (x1 , x2 ) = μ2 V (x1 , (x2 − 1)+ ) + (1 − μ1 )V (x1 , x2 ).

As can be verified directly, the solution of the DPE is as follows. Assume that
c1 μ1 > c2 μ2 . Then

x1 (x1 + 1) x2 (x2 + 1) x1 x2
V (x1 , x2 ) = c1 + c2 + c2 .
2μ1 2μ2 μ1

Moreover, this minimum expected cost is achieved by performing all the jobs of type
1 first and then the jobs of type 2. This strategy is called the cμ rule. Thus, although
one might be tempted to work on the longest queue first, this is not optimal.
There is a simple interchange argument to confirm the optimality of the cμ rule.
Say that you decide to work on the jobs in the following order: 1221211. Thus, you
work on a job of type 1 until it completes, then a job of type 2, then another job of
type 2, and so on. Modify the strategy as follows. Instead of working on the second
job of type 2, work on the second job of type 1, until it completes. Then work on the
second job of type 2 and continue as you would have. Thus, the processings of two
jobs have been interchanged: the second job of type 2 and the second job of type
1. Only the waiting times of these two jobs change. The waiting time of the job of
type 1 is reduced by 1/μ2 , on average, since this is the average completion time of
the job of type 2 that was previously processed before the job of type 1. Thus, the
waiting cost of the job of type 1 is reduced by c1 /μ2 . Similarly, the waiting cost of
the job of type 2 is increased by c2 /μ1 , on average. Thus, the average cost decreases
by c1 /μ2 − c2 /μ1 which is a positive amount since c1 μ1 > c2 μ2 . By induction, it
is optimal to process all the jobs of type 1 first.
Of course, there are very few examples of control problems where the optimal
policy can be proved by a simple argument. Nevertheless, keep this possibility
in mind because it can yield elegant results simply. For instance, assume that
jobs arrive at the queues shown in Fig. 13.5 according to independent Bernoulli
processes. That is, with probability λi , a job of type i arrives during each time step,
independently of the past, for i = 1, 2. The same interchange argument shows that
250 13 Route Planning: A

Fig. 13.5 What job to work 1 , c1

on next?

2 , c2

the cμ rule minimizes the long-term average expected waiting cost of the jobs (a
cost that we have not defined, but you may be able to imagine what it means).
This is useful because the DPE can no longer be solved explicitly and proving the
optimality of this rule analytically is quite complicated.

Hiring a Helper
Jobs arrive at random times and you must decide whether to work on them yourself
or hire some helper. Intuition suggests that you should get some help if the backlog
of jobs to be performed exceeds some threshold. We examine a model of this
situation.
At time n = 0, 1, . . ., a job arrives with probability λ ∈ (0, 1). If you work alone,
you complete a job with probability μ ∈ (0, 1) in one time unit, independently of
the past. If you hire a helper, then together you complete a job with probability
αμ ∈ (0, 1) in one unit of time, where α > 1. Let the cost at time n be c(n) = β > 0
if you hire a helper at time step n and c(n) = 0 otherwise. The goal is to minimize
N

E (X(n) + c(n)) ,
n=0

where X(n) is the number of jobs yet to be processed at time n. This cost measures
the waiting cost of the jobs plus the cost of hiring the helper. The waiting cost is
minimized if you hire the helper all the time and the helper cost is minimized if you
never hire him. The goal of the problem is to figure out when to hire a helper to
achieve the best trade-off between these two costs.
The state of the system is X(n) at time n. Let
m

Vm (x) = min E (X(n) + c(n))|X(0) = x ,
n=0

where the minimum is over the possible choices of actions (hiring or not) that
depend on the state up to that time. The stochastic dynamic programming equations
are

Vm (x) = x + min {β1{a = 1} + (1 − λ)(1 − μ(a))Vm−1 (x)

a∈{0,1}

+ λ(1 − μ(a))Vm−1 (min{x + 1, K})

+ (1 − λ)μ(a)Vm−1 (max{x − 1, 0})
+ λμ(a)Vm−1 (x)}, n ≥ 0,
13.4 Markov Decision Problem 251

Fig. 13.6 One should hire a 22

helper at time n if the backlog
exceeds γ (N − n) 20

12 γ(N − n), β = 20 γ(N − n), β = 14

8
n
6
0 50 100 150 200 250

where we defined μ(0) = μ and μ(1) = αμ and V−1 (x) = 0. Also, we limit the
backlog of jobs to K, so that if one job arrives where there are already K, we discard
the new arrival.
We solve these equations using Python. As expected, the solution shows that one
should hire a helper at time n if X(n) > γ (N − n), where γ (m) is a constant that
decreases with m. As the time to go m increases, the cost of holding extra jobs
increases and so does the incentive to hire a helper. Figure 13.6 shows the values of
γ (n) for β = 14 and β = 20. The figure corresponds to λ = 0.5, μ = 0.6, α =
1.5, K = 20, and N = 200. Not surprisingly, when the helper is more expensive,
one waits until the backlog is larger before hiring him.

Which Queue to Join?

After shopping in the supermarket, you get to the cashiers and have to choose a
queue to join. Naturally, you try to identify the queue with the shortest expected
waiting time, and you join that queue. Everyone does the same, and it seems quite
natural that this strategy should minimize the expected waiting time of all the
customers. Your friend, who has taken this class before, tells you that this is not
necessarily the case. Let us try to understand this apparent paradox.
Assume that there are two queues and customers arrive with probability λ at each
time step. The service times in queue i are geometrically distributed with parameter
μi in queue i, for i = 1, 2.
Say that when you arrive, there are xi customers in queue i, for i = 1, 2. You
should join queue 1 if

x1 + 1 x2 + 1
< ,
μ1 μ2
252 13 Route Planning: A

Fig. 13.7 The socially

optimal policy is shown in
blue and the selfish policy is
shown in green

as this will minimize the expected time until you are served. However, if we consider
the problem of minimizing the total average waiting time of customers in the two
queues, we find that the optimal policy does not agree with the selfish choice of
individual customers. Figure 13.7 shows an example with μ2 < μ1 . It indicates that
under the socially optimal policy some customers should join queue 2, even though
they will then incur a longer delay than under the selfish policy.
This example corresponds to minimizing the total cost

N
β n E(X1 (n) + X2 (n)).
n=0

In this expression, Xi (n) is the number of customers in queue i at time n. The

capacity of each queue is K. To prevent the system from discarding too many
customers, one imposes the constraint that if only one queue is full when a customer
arrives, he should join the non-full queue. In the expression for the total cost, one
uses a discount factor β ∈ (0, 1) to keep the cost bounded. The figure corresponds
to K = 8, λ = 0.3, μ1 = 0.4, μ2 = 0.2, N = 100, and β = 0.95. (The graphs are
in fact for x1 + 1 and x2 + 2 as Python does not like the index value 0.)
13.5 Infinite Horizon 253

13.5 Infinite Horizon

The problem of minimizing (13.6) involves a finite horizon. The problem stops at
time n. We have seen that the minimum cost to go when there are m more steps is
Vm (x) when in state x. Thus, not surprisingly, the cost to go depends on the time to
go and, consequently, the best action to choose in a given state x generally depends
on the time to go.
The problem is simpler when one considers an infinite horizon because the time
to go remains the same at each step. To make the total cost finite, one discounts the
future costs. That is, one considers the problem of minimizing the expected total
discounted cost:
∞

E β c(X(m), a(m))|X(0) = x .
m
(13.8)
m=0

In this expression, 0 < β < 1 is the discount rate. Intuitively, if β is small, then
future costs do not matter much and one tends to be short-sighted. However, if β is
close to 1, then one pays a lot of attention to the long term.
Define V (x) to be the minimum value of the cost (13.8), where the minimum is
over all the possible choices of the actions at each step. Arguing as before, one can
show that

V (x) = min {c(x, a) + βE[V (X(1))|X(0) = x, a(0) = a]}

a∈A(x)

= min c(x, a) + β P (x, y; a)V (y) . (13.9)
a∈A(x)
y

These equations are similar to (13.7), with two differences: the discount factor
and the fact that the value function does not depend on time. Note that these
equations are fixed-point equations. A standard method to solve them is to consider
the equations

Vn+1 (x) = min c(x, a) + β P (x, y; a)Vn (y) , n ≥ 0, (13.10)
a∈A(x)
y

where one chooses V0 (x) = 0, ∀x. Note that these equations correspond to

n
Vn (x) = min E β m c(X(m), a(m))|X(0) = x . (13.11)
m=0

One can show that the solution Vn (x) of (13.10) is such that Vn (x) → V (x) as
n → ∞, where V (x) is the solution of (13.9).
254 13 Route Planning: A

13.6 Summary

• Dynamic Programming Equations;

• Controlled Markov Chain;
• Markov Decision Problem.

13.6.1 Key Equations and Formulas

MDP P (x, y; a) S.13.4

SDPE Vm+1 (x) = mina {c(x, a) + y P (x, y; a)Vm (y)} (13.7)

13.7 References

The book Ross (1995) is a splendid introduction to stochastic dynamic program-

ming. We borrowed the “guess a card” example from it. It explains the key
ideas simply and the many variations of the theory illustrated by carefully chosen
examples. The textbook Bertsekas (2005) is a comprehensive presentation of the
algorithms for dynamic programming. It contains many examples and detailed
discussions of the theory and practice.

13.8 Problems

Problem 13.1 Consider a single queue with one server in discrete time. At each
time, a new customer arrives to the queue with probability λ < 1, and if the server
works on the queue at rate μ ∈ [0, 1], it serves one customer in one unit of time
with probability μ. Due to energy constraints, you want your server to work with
the smallest rate as possible without making the queue unstable. Thus, you want
your server to work at rate μ∗ = λ. Unfortunately, you do not know the value of
λ. All you can observe is the queue length. We try to design an algorithm based on
stochastic gradient to learn μ∗ in the following steps:

(a) Minimize the function V (μ) = 12 (λ − μ)2 over μ using gradient descent.
(b) Find E[Q(n+1)−Q(n)|Q(n) = q], for some q > 0, given that server allocates
capacity μn during time slot n. Q(n) is the queue length at time n. What happens
if q = 0?
(c) Use the stochastic gradient projection algorithm and write a Python code based
on parts (a) and (b) to learn μ∗ . Note that 0 ≤ μ ≤ 1.
13.8 Problems 255

Hint To avoid the case when the queue length is 0, start with a large initial queue
length.

Problem 13.2 Consider a routing network with three nodes: the start node s, the
destination node d, and an intermediate node r. There is a direct path from s to d
with travel time 20. The travel time from s to r is 7. There are two paths from r to
d. They have independent travel times that are uniformly distributed between 8 and
20.

(a) If you want to do pre-planning, which path should be chosen to go from s to d?

(b) If the travel times from r to d are revealed at r which path should be chosen?

Problem 13.3 Consider a single queue in discrete time with Bernoulli arrival
process of rate λ. The queue can hold K jobs, and there is a fee γ when its
backlog reaches K. There is one server dedicated to the queue with service rate
μ(0). You can decide to allocate another server to the queue that increases the rate
to μ(1) ∈ (μ(0), 1). However, using the additional server has some cost. You want
to minimize the cost
∞

β n E(X(n) + αH (n) + γ 1{X(n) = K}),
n=0

where H (n) is equal to one if you use an extra helper at time n and is zero otherwise.

(a) Write the dynamic programming equations.

(b) Solve the DPE with MATLAB for λ = 0.4, μ(0) = 0.35, μ(1) = 0.5, α =
2.5, β = 0.95, and γ = 30.

Problem 13.4 We want to plan routing from node 1 to 5 in the graph of Fig. 13.8.
The travel times on the edges of the graph are as follows: T (1, 2) = 2, T (1, 3) ∼
U [2, 4], T (2, 4) = 1, T (2, 5) ∼ U [4, 6], T (4, 5) ∼ U [3, 5], and T (3, 5) = 4. Note
that X ∼ U [a, b] means X is a random variable uniformly distributed between a
and b.

(a) If you want to do pre-planning, which path would you choose? What is the
expected travel time?

Fig. 13.8 Route planning

256 13 Route Planning: A

(b) Now suppose that at each node, the travel times of two steps ahead are revealed.
Thus, at node 1 all the travel times are revealed except T (4, 5). Write the
dynamic programming equations that solve the route planning problem and
solve them. That is, let V (i) be the minimum expected travel time from i to
5, and 1 ≤ i ≤ 5. Find V (i) for 1 ≤ i ≤ 5.

Problem 13.5 Consider a factory, DilBox, that stores boxes. At the beginning of
year k, they have xk boxes in storage. Now at the end of every year k they are
mandated by contracts to provide dk boxes. However, the number of boxes dk is
unknown until the year actually ends.
At the beginning of the year, they can request uk boxes. Using very shoddy
Elbonian labor each box has costs A to produce. At the end of the year DilBox
is able to borrow yk boxes from BoxR’Us at the cost s(yk ) to meet the contract.
The boxes remaining after meeting the demand are carried over to the next year
xk+1 = xk + uk + yk − dk . Sadly, they need to pay to store the boxes at a cost given
by a function r(xk+1 ).
Now your job is to provide a box creation and storage plan for the upcoming 20
years. Your goal is to minimize the total cost for the 20 years. You can treat costs
as being paid at the end of the year and there is no inflation. Also, you get your
pension after 20 years so you do not care about costs beyond those paid in the 20th
year. (Assume you start with zero boxes, of course, it does not really matter).

(a) Formulate the problem as a Markov decision problem;

(b) Write the dynamic programming equations;
(c) Use Python to solve the equations with the following parameters:

– r(xk ) = 5xk ;
– s(yk ) = 20yk ;
– A = 1;
– dk =D U {1, . . . , 10}.

Problem 13.6 Consider a video game duel where Bob starts at time 0 at distance
T = 10 from Alice and gets closer to her at speed 1. For instance, Alice is at location
(0, 0) in the plane and Bob starts at location (0, T ) and moves toward Alice, so that
after t seconds, Bob is at location (0, T − t). Alice has picked a random time,
uniformly distributed in [0, T ], when she will shoot Bob. If Alice shoots first, Bob
is dead. Alice never misses. [This is only a video game.]

(a) Bob has to find at what time t he should shoot Alice to maximize the probability
of killing her. If Bob shoots from a distance x, the probability that he hits (and
kills) Alice is 1/(1 + x)2 . Bob has only one bullet.
(b) What is the maximum probability that Bob wins the duel?
(c) Assume now that Bob has two bullets. You must find the times t1 and t2 when
Bob should shoot Alice to maximize the probability that he wins the duel. Again,
13.8 Problems 257

for each bullet that Bob shoots from distance x, the probability of success is
1/(1 + x)2 , independently for each bullet.

Problem 13.7 You play a game where you win the amount you bet with probability
p ∈ (0, 0.5) and you lose it with probability 1 − p. Your initial fortune is 16 and
you gamble a fixed amount γ at each step, where γ ∈ {1, 2, 4, 8, 16}. Find the
probability that you reach a fortune equal to 256 before you go broke. What is the
gambling amount that maximizes that probability?

Topics: LQG Control, incomplete observations

14.1 LQG Control

The ideas of dynamic programming that we explained for a controlled Markov

chain apply to other controlled systems. We discuss the case of a linear system with
quadratic cost and Gaussian noise, which is called the LQG problem. For simplicity,
we consider only the scalar case.
The system is

X(n + 1) = aX(n) + U (n) + V (n), n ≥ 0. (14.1)

Here, X(n) is the state, U (n) is a control value, and V (n) is the noise. We assume
that the random variables V (n) are i.i.d. and N(0, σ 2 ).
The problem is to choose, at each time n, the control value U (n) in based on
the observed state values up to time n to minimize the expected cost
N

E X(n) + βU (n) |X(0) = x .
2 2
(14.2)
n=0

Thus, the goal of the control is to keep the state value close to zero, and one pays a
cost for the control.
The problem is then to trade-off the cost of a large value of the state and that of
the control that can bring the state back close to zero. To get some intuition for the
solution, consider a simple form of this trade-off: minimizing

© The Author(s) 2021 259

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_14
260 14 Route Planning: B

Fig. 14.1 The optimal

control is linear in the state X(n)
U (n) X(n)
×

g(N – n)

(ax + u)2 + βu2 .

In this simple version of the problem, there is no noise and we apply the control
only once. To minimize this expression over u, we set the derivative with respect to
u equal to zero and we find

2(ax + u) + 2βu = 0,

so that
a
u=− x.
1+β

Thus, the value of the control that minimizes the cost is linear in the state. We should
use a large control value when the state is far from the desired value 0. The following
result shows that the same conclusion holds for our problem (Fig. 14.1).

Theorem 14.1 Optimal LQG Control The control values U (n) that minimize
(14.2) for the system (14.1) are

U (n) = g(N − n)X(n),

where
ad(m − 1)
g(m) = − , m ≥ 0; (14.3)
β + d(m − 1)
a 2 βd(m − 1)
d(m) = 1 + ,m ≥ 0 (14.4)
β + d(m − 1)

with d(−1) = 0.
That is, the optimal control is linear in the state and the coefficient depends on
the time-to-go. These coefficients can be pre-computed at time 0 and they do not
depend on the noise variance. Thus, the control values would be calculated in the
same way if V (n) = 0 for all n.

14.1 LQG Control 261

Proof Let Vm (x) be the minimum value of (14.2) when N is replaced by m. The
stochastic dynamic programming equations are
!
Vm (x) = min x 2 + βu2 + E(Vm−1 (ax + u + V )) , m ≥ 0, (14.5)
u

where V = N (0, σ 2 ). Also, V−1 (x) := 0.

We claim that the solution of these equations is

Vm (x) = c(m) + d(m)x 2

for some constants c(m) and d(m) where d(m) satisfies (14.4).
That is, we claim that

min{x 2 +βu2 +E[c(m−1)+d(m−1)(ax +u+V )2 } = c(m)+d(m)x 2 , (14.6)

where d(m) is given by (14.4) and the minimizer is u = g(m)x where g(m) is given
by (14.3).
The verification is a simple algebraic exercise that we leave to the reader.

14.1.1 Letting N → ∞

What happens if N becomes very large in (14.2)? Proceeding formally, we examine

(14.4) and observe that if |a| < 1, then d(m) → d as m → ∞ where d is the
solution of the fixed-point equation

a 2 βd
d = f (d) := 1 + .
β +d

To see why this is the case, note that

a2β 2
f (d) = ,
(β + d)2

so that 0 < f (d) < a 2 for d ≥ 0. Also, f (d) > 0 for d ≥ 0. Hence, f (d) is a
contraction. That is,

|f (d1 ) − f (d2 )| ≤ α|d1 − d2 |, ∀d1 , d2 ≥ 0

for some α ∈ (0, 1). (Here, α = a 2 .) In particular, choosing d1 = d and d2 = d(m),

we find that

|d − d(m + 1)| ≤ α|d − d(m)|, ∀m ≥ 0.

262 14 Route Planning: B

Fig. 14.2 The optimal

control for the average cost X(n)
U (n) X(n)
×

Thus,

|d − d(m)| ≤ α m |d − d(0)|,

which shows that d(m) → d, as claimed. Consequently, (14.3) shows that g(m) →
g as m → ∞, where

ad
g=− .
β +d

Thus, when the time-to-go m is very large, the optimal control approaches U (N −
m) = gX(N − m). This suggests that this control may minimize the cost (14.2)
when N tends to infinity (Fig. 14.2).
The formal way to study this problem is to consider the long-term average cost
defined by
N
1
lim E X(n) + βU (n) |X(0) = x .
2 2
N →∞ N
n=0

This expression is the average cost per unit time. One can show that if |a| < 1, then
the control U (n) = gX(n) with g defined as before indeed minimizes that average
cost.

14.2 LQG with Noisy Observations

In the previous section, we controlled a linear system with Gaussian noise assuming
that we observed the state. We now consider the case of noisy observations.
The system is

X(n + 1) = aX(n) + U (n) + V (n), n ≥ 0; (14.7)

Y (n) = X(n) + W (n), (14.8)

where the random variables W (n) are i.i.d. N (0, w2 ) and are independent of the
V (n).
14.2 LQG with Noisy Observations 263

Fig. 14.3 The optimal

control is linear in the X(n)
estimate of the state Y (n)
U (n)
X̂(n)
× KF

g(N − n)

The problem is to find, for each n, the value of U (n) based on the values of
Y n := {Y (0), . . . , Y (n)} that minimize the expected total cost (14.2).
The following result gives the solution of the problem (Fig. 14.3).

Theorem 14.2 Optimal LQG Control with Noisy Observations The solution of the
problem is

U (n) = g(N − n)X̂(n),

where

X̂(n) = E[X(n)|Y (0), . . . , Y (n), U (0), . . . , U (n − 1)]

can be computed by using the Kalman filter and the constants g(m) are given by
(14.3)–(14.4).
Thus, the control values are the same as when X(n) is observed exactly, except
that X(n) is replaced by X̂(n). This feature is called certainty equivalence.

Proof The fact that the values of g(n) do not depend on the noise V (n) gives us
some inkling as to why the result in the theorem can be expected: given Y n , the
state X(n) is N (X̂(n), v 2 ) for some variance v 2 . Thus, we can view the noisy
observation as increasing the variance of the state, as if the variance of V (n) were
increased.
Instead of providing the complete algebra, let us sketch why the result holds.
Assume that the minimum expected cost-to-go at time N − m + 1 given Y N −m+1 is

c(m − 1) + d(m − 1)X̂(N − m + 1)2 .

Then, at time N − m, the expected cost-to-go given Y N −m and U (N − m) = u is

the expected value of

X(N − m)2 + βu2 + c(m − 1) + d(m − 1)X̂(N − m + 1)2

264 14 Route Planning: B

given Y N −m and U (N − m) = u. Now,

X(N − m) = X̂(N − m) + η,

where η is a Gaussian random variable independent of Y N −m . Also, as we saw when

we discussed the Kalman filter,

X̂(N − m + 1) = a X̂(N − m) + u
+ K(N − m + 1){Y (N − m + 1) − E[Y (N − m + 1)|Y N −m ]}.

Moreover, we know from our study of conditional expectation of jointly Gaussian

random variables, that Y (N − m + 1) − E[Y (N − m + 1)|Y N −m ] is a Gaussian
random variable that has mean zero and is independent of Y N −m . Hence,

X̂(N − m + 1) = a X̂(N − m) + u + Z

for some independent zero-mean Gaussian random variable Z.

Thus, the expected cost-to-go at time N − m − 1 is the expected value of

(X̂(N − m) + η)2 + βu2 + c(m − 1)

+ d(m − 1)(a X̂(N − m) + Z)2 ,

i.e., of

X̂(N − m)2 + βu2 + c(m − 1) + d(m − 1)(a X̂(N − m) + u + Z)2 .

This expression is identical to (14.6), except that x is replaced by X̂(N − m) and V

is replaced by Z. Since the variance of V does not affect the calculations of c(m)
and d(m), this concludes the proof.

14.2.1 Letting N → ∞

As when X(n) is observed exactly, one can show that, if |a| < 1, the control

U (n) = g X̂(n)

minimizes the average cost per unit time. Also, in this case, we know that the
Kalman filter becomes stationary and has the form (Fig. 14.4)

X̂(n + 1) = a X̂(n) + u + K[Y (n + 1) − a X̂(n) − U (n)].

14.3 Partially Observed MDP 265

Fig. 14.4 The optimal

control for the average cost X(n)
with noisy observations. Y (n)
Here, the Kalman filter is U (n)
stationary X̂(n)
× KF

14.3 Partially Observed MDP

In the previous chapter, we considered a controlled Markov chain and the action
is based on the knowledge of the state. In this section, we look at problems where
the state of the Markov chain is not observed exactly. In other words, we look at
a controlled hidden Markov chain. These problems are called partially observed
Markov decision problems (POMDPs).
Instead of discussing the general version of this problem, we look at one concrete
example to convey the basic ideas.

14.3.1 Example: Searching for Your Keys

The example is illustrated in Fig. 14.5. You have misplaced your keys but you
know that they are either in bag A, with probability p, or in bag B, otherwise.
Unfortunately, your bags are cluttered and if you spend one unit of time (say 10 s)
looking in bag A, you find your keys with probability α if they are there. Similarly,
the probability for bag B is β. Every time unit, you choose which bag to explore.
Your objective is to minimize the expected time until you find your keys.
The state of the system is the location A or B of your keys. However, you do
not observe that state. The key idea (excuse the pun) is to consider the conditional
probability pn that the keys are in bag A given all your observations up to time n. It
turns out that pn is a controlled Markov chain, as we explain shortly. Unfortunately,
the set of possible value of pn is [0, 1], which is not finite, nor even countable. Let
us not get discouraged by this technical issue.
Assume that at time n, when the keys are in bag A with probability pn , you look
in bag A for one unit of time and you do not see the keys. What is then pn+1 ? We
claim that
pn (1 − α)
pn+1 = =: f (A, pn ).
pn (1 − α) + (1 − pn )
Indeed, this is the probability that the keys are in bag A and we do not see them,
divided by the probability that we do not see the keys (either when they are there or
when they are not). Of course, if we see the keys, the problem stops.
266 14 Route Planning: B

Fig. 14.5 Where to look for

your keys?

Similarly, say that we look in bag B and we do not see the keys. Then
pn
pn+1 = =: f (B, pn ).
pn + (1 − pn )(1 − β)

Thus, we control pn with our actions. Let V (p) be the minimum expected time
until we find the keys, given that they are in bag A with probability p. Then, the
DPE are

V (p) = 1 + min{(1 − pα)V (f (A, p)), (1 − (1 − p)β)V (f (B, p))}. (14.9)

The constant 1 is the duration of the first step. The first term in the minimum is what
happens when you look in bag A. With probability 1 − pα, you do not find your
keys and you will then have to wait a minimum expected time equal to V (f (A, p))
to find your keys, because the probability that they are in bag A is now f (A, p).
The other term corresponds to first looking in bag B.
These equations look hopeless. However, they are easy to solve in Python. One
discretizes [0, 1] into K intervals and one rounds off the updates f (A, p) and
f (B, p).
Thus, the updates are for a finite vector V = (V (1/K), V (2/K), . . . , V (1)).
With this discretization, the equations (14.9) look like

V = φ(V),

where φ(·) is the right-hand side of (14.9). These are fixed-point equations. To solve
them, we initialize V0 = 0 and we iterate

Vt+1 = φ(Vt ), t ≥ 0.

With a bit of luck, that can be justified mathematically, this algorithm converges to
V, the solution of the DPE. The solution is shown in Fig. 14.6, for different values
of α and β. The figure also shows the optimum action as a function of p. The
discretization uses K = 1000 values in [0, 1] and the iteration is performed 100
times.
14.5 References 267

Fig. 14.6 Numerical solution of (14.9)

14.4 Summary

• LQG Control Problem with State Observations;

• LQG Control Problem with Noisy Observations;
• Partially Observed MDP.

14.4.1 Key Equations and Formulas

LQG problem Formulation (14.1)–(14.2)

Solution of LQG Un = gN −n Xn T.14.1
Noisy observations Yn = Xn + Wn (14.8)
Solution with noisy observations Un = gN −n X̂n T.14.2
Partially observed MDP Replace Xn by P [Xn = x|Y n ] S.14.3

14.5 References

The texts Bertsekas (2005), Kumar and Varaiya (1986) and Goodwin and Sin (2009)
cover LQG control. The first two texts discuss POMDP.
268 14 Route Planning: B

14.6 Problems

Problem 14.1 Consider the system

X(n + 1) = 0.8X(n) + U (n) + V (n), n ≥ 0,

where X(0) = 0 and the random variables V (n) are i.i.d. and N (0, 0.2). The U (n)
are control values.

(a) Simulate the system when U (n) = 0 for all n ≥ 0.

(b) Implement the control given in Theorem 14.1 with N = 100 and simulate the
controlled system.
(c) Implement the control with the constant gain g = limn→∞ g(n) and simulate
the system.

Problem 14.2 Consider the system

X(n + 1) = 0.8X(n) + U (n) + V (n), n ≥ 0

Y (n) = X(n) + W (n), n ≥ 0,

where X(0) = 0 and the random variables V (n), W (n) are independent with
V (n) =D N (0, 0.2) and W (n) =D N (0, σ 2 ).

(a) Implement the control described in Theorem 14.2 for σ 2 = 0.1 and σ 2 = 0.4
and simulate the controlled system.
(b) Implement the limiting control with the limiting gain and the stationary Kalman
filter for σ 2 = 0.1 and σ 2 = 0.4. Simulate the system.
(c) Compare the systems with the time-varying and the limiting controls.

Problem 14.3 There are two coins. One is fair and the other one has a probability
of “head” equal to 0.6. You cannot tell which is which by looking at the coins. At
each step n ≥ 1, you must choose which coin to flip. The goal is to maximize the
expected number of “heads.”

(a) Formulate the problem as a POMDP.

(b) Discretize the state of the system as we did in the “searching for your keys”
example and write the SDPEs.
(c) Implement the SDPEs in Python and simulate the resulting system.
14.6 Problems 269

Topics: Inference, Sufficient Statistic, Infinite Markov Chains, Poisson,

Boosting, Multi-Armed Bandits, Capacity, Bounds, Martingales, SLLN

15.1 Inference

One key concept that we explored is that of inference. The general problem of
inference can be formulated as follows. There is a pair of random quantities (X, Y ).
One observes Y and one wants to guess X (Fig. 15.1).
Thus, the goal is to find a function g(·) such that X̂ := g(Y ) is close to X, in a
sense to be made precise. Here are a few sample problems:

• X is the weight of a person and Y is her height;

• X = 1 is a house is on fire, X = 0 otherwise, and Y is a measurement of the CO
density at a sensor;
• X ∈ {0, 1}N is a bit string that a transmitter sends and Y ∈ [0,T ] is a signal that
the receiver receives;
• Y is one woman’s genome and X = 1 if she develops a specific form of breast
cancer and X = 0 otherwise;
• Y is a vector of characteristics of a movie and of one person and X is the number
of stars that the person gives to the movie;
• Y is the photograph of a person’s face and X = 1 if it is that of a man and X = 0
otherwise;
• X is a sentence and Y is the signal that a microphone picks up.

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2_15
272 15 Perspective and Complements

?
Y X

Fig. 15.1 The inference problem is to guess the value of X from that of Y

We explained a few different formulations of this problem in Chaps. 7 and 9:

• Known Distribution: We know the joint distribution of (X, Y );

• Off-Line: We observe a set of sample values of (X, Y );
• On-Line: We observe successive values of samples of (X, Y );
• Maximum Likelihood Estimate: We do not want to assume a distribution for
X, only the conditional distribution of Y given X; the goal is to find the value of
X that makes the observed Y most likely;
• Maximum A Posteriori Estimate: We know a prior distribution for X and the
conditional distribution of Y given X; the goal is to find the value of X that is
most likely given Y ;
• Hypothesis Test: We do not want to assume a distribution for X ∈ {0, 1}, only a
conditional distribution of Y given X; the goal is to maximize the probability of
correctly deciding that X = 1 while keeping the probability that we decide that
X = 1 when in fact X = 0 below some given β.
• MMSE: Given the joint distribution of X and Y , we want to find the function
g(Y ) that minimizes E((X − g(Y ))2 ).
• LLSE: Given the joint distribution of X and Y , we want to find the linear function
a + bY that minimizes E((X − a − bY )2 ).

15.2 Sufficient Statistic

A useful notion for inference problems is that of a sufficient statistic. We have not
discussed this notion so far. It is time to do it.

Definition 15.1 (Sufficient Statistic) We say that h(Y ) is a sufficient statistic for
X if

fY |X [y|x] = f (h(y), x)g(y),

or, equivalently, it

fY |h(Y ),X [y|s, x] = fY |h(Y ) [y|s].

We leave the verification of this equivalence to the reader.

15.2 Sufficient Statistic 273

Before we discuss the meaning of this definition, let us explore some implica-
tions. First note that if we have a prior fX (x) and we want to calculate MAP [X|Y =
y], we have

MAP [X|Y = y] = arg max fX (x)fY |X [y|x]

= arg max fX (x)f (h(y), x)g(y)

= arg max fX (x)f (h(y), x).

Consequently, the maximizer is some function of h(y). Hence,

MAP [X|Y ] = g(h(Y )),

Now,
∞ ∞
fY (y) = fX (x)f (h(y), x)g(y)dx = g(y) fX (x)f (h(y), x)dx
−∞ −∞

= g(y)φ(h(y)),

where
∞
φ(h(y)) = fX (x)f (h(y), x)dx.
−∞

Hence,

fX (x)f (h(y), x)
fX|Y [x|y] = .
φ(h(Y ))

Thus, the conditional density of X given Y depends only on h(Y ). Consequently,

E[X|Y ] = ψ(h(Y )).

Now, consider the hypothesis testing problem when X ∈ {0, 1}. Note that
274 15 Perspective and Complements

fY |X [y|1] f (h(y), 1)g(y)

L(y) = = = ψ(h(y)).
fY |X [y|0] f (h(y), 0)g(y)

Thus, the likelihood ratio depends only on h(y) and it follows that the solution of
the hypothesis testing problem is also a function of h(Y ).

15.2.1 Interpretation

The definition of sufficient statistic is quite abstract. The intuitive meaning is that if
h(Y ) is sufficient for X, then Y is some function of h(Y ) and a random variable Z
that is independent of X and Y . That is,

Y = g(h(Y ), Z). (15.1)

For instance, say that Y = (Y1 , . . . , Yn ) where the Ym are i.i.d. and Bernoulli with
parameter X ∈ [0, 1]. Let h(Y ) = Y1 + · · · + Yn . Then we can think of Y as being
constructed from h(Y ) by selecting randomly which h(Y ) random variables among
(Y1 , . . . , Yn ) are equal to one. This random choice is some independent random
variable Z. In such a case, we see that Y does not contain any information about X
that is not already in h(Y ).
To see the equivalence between this interpretation and the definition, first assume
that (15.1) holds. Then

P [Y ≈ y|X = x] = P [h(Y ) ≈ h(y)|X = x]P (g(h(y), Z) ≈ y)

= f (h(y), x)g(y),

so that h(Y ) is sufficient for X. Conversely, if h(Y ) is sufficient for X, then we can
find some Z such that g(h(y), Z) has the density fY |h(Y ) [y|h(y)].

15.3 Infinite Markov Chains

We studied Markov chains on a finite state space X = {1, 2, . . . , N }. Let us explore

the countably infinite case where X = {0, 1, . . .}.
One is given an initial distribution π = {π(x), x ∈ X }, where π(x) ≥ 0 and
x∈X π(x) = 1. Also, one is given a set of nonnegative numbers {P (x, y), x, y ∈
X } such that

P (x, y) = 1, ∀x ∈ X .
y∈X

The sequence {X(n), n ≥ 0} is then a Markov chain with initial distribution π

and probability transition matrix P if
15.3 Infinite Markov Chains 275

Fig. 15.2 An infinite p p p p p

Markov chain
q 0 1 2 ..... n .....
q q q q q

P (X(0) = x0 , X(1) = x1 , . . . , X(n) = xn )

= π(x0 )P (x0 , x1 ) × · · · × P (xn−1 , xn ),

for all n ≥ 0 and all x0 , . . . , xn in X .

One defines irreducible and aperiodic as in the case of a finite Markov chain.
Recall that if a finite Markov chain is irreducible, then it visits all its states infinitely
often and it spends a positive fraction of time in each state.
That may not happen when the Markov chain is infinite. To see this, consider the
following example (see Fig. 15.2). One has π(0) = 1 and P (i, i + 1) = p for i ≥ 1
and

P (i + 1, i) = 1 − p =: q = P (0, 0), ∀i.

Assume that p ∈ (0, 1). Then the Markov chain is irreducible. However, it is
intuitively clear that X(n) → ∞ as n → ∞ if p > 0.5. To see that this is indeed
the case, let Z(n) be i.i.d. random variables with P (Z(n) = 1) = p and P (Z(n) =
−1) = q. Then note that

X(n) = max{X(n − 1) + Z(n), 0},

so that

X(n) ≥ X(0) + Z(1) + · · · + Z(n − 1), n ≥ 0.

Also,

X(n) X(0) + Z(1) + · · · + Z(n − 1)

≥ → E(Z(n)) > 0,
n n

where the convergence follows by the SLLN. This implies that X(n) → ∞, as
claimed.
Thus, X(n) eventually is larger than any given N and remains larger. This shows
that X(n) visits every state only finitely many times. We say that the states are
transient because they are visited only finitely often.
We say that a state is recurrent if it is not transient. In that case, the state is called
positive recurrent if the average time between successive visits is finite; otherwise
it is called null recurrent.
Here is the result that corresponds to Theorem 1.1
276 15 Perspective and Complements

Theorem 15.1 (Big Theorem for Infinite Markov Chains) Consider an infinite
Markov chain.

(a) If the Markov chain is irreducible, the states are either all transient, all positive
recurrent, or all null recurrent. We then say that the Markov chain is transient,
positive recurrent, or null recurrent, respectively.
(b) If the Markov chain is positive recurrent, it has a unique invariant distribution
π and π(i) is the long-term fraction of time that X(n) is equal to i.
(c) If the Markov chain is positive recurrent and also aperiodic, then the distribu-
tion πn of X(n) converges to π .
(d) If the Markov chain is not positive recurrent, it does not have an invariant
distribution and the fraction of time that it spends in any state goes to zero.

It turns out that the Markov chain in Fig. 15.2 is null recurrent for p = 0.5 and
positive recurrent for p < 0.5. In the latter case, its invariant distribution is
p
π(i) = (1 − ρ)ρ i , i ≥ 0, where ρ := .
q

15.3.1 Lyapunov–Foster Criterion

Here is a useful sufficient condition for positive recurrence.

Theorem 15.2 (Lyapunov–Foster) Let X(n) be an irreducible Markov chain on

an infinite state space X . Assume there exists some function V : X → [0, ∞)
such that

E[V (X(n + 1)) − V (X(n))|X(n) = x] ≤ −α + β1{x ∈ A},

where A is a finite set, α > 0 and β > 0.

Then the Markov chain is positive recurrent.
Such a function V is said to be a Lyapunov function for the Markov chain.

The condition means that the Lyapunov function decreases by at least α on

average when X(n) is outside some finite set A. The intuitive reason why this
makes the Markov chain positive recurrent is that, since the Lyapunov function
is nonnegative, it cannot decrease forever. Thus, it must spend a positive fraction
of time inside the finite set A. By the big theorem, this implies that it is positive
recurrent.
15.4 Poisson Process 277

15.4 Poisson Process

The Poisson process is an important model in applied probability. It is a good

approximation of the arrivals of packets at a router, of telephone calls, of new TCP
connections, of customers at a cashier.

15.4.1 Definition

We start with a definition of the Poisson process. (See Fig. 15.3.)

Definition 15.2 (Poisson Process) Let λ > 0 and {S1 , S2 , . . .} be i.i.d. Exp(λ)
random variables. Let also Tn = S1 + · · · + Sn for n ≥ 1. Define

Nt = max{n ≥ 1|Tn ≤ t}, t ≥ 0,

with Nt = 0 if t < T1 . Then, N := {Nt , t ≥ 0} is a Poisson process with rate λ.

Note that Tn is the n-th jump time of N.

15.4.2 Independent Increments

Before exploring the properties of the Poisson process, we recall two properties of
the exponential distribution.

Theorem 15.3 (Properties of Exponential Distribution) Let τ be exponentially

distributed with rate λ > 0. That is,

Fτ (t) = P (τ ≤ t) = 1 − exp{−λt}, t ≥ 0.

Fig. 15.3 Poisson process:

the times Sn between jumps 5
are i.i.d. and exponentially Nt
distributed with rate λ 4
3
2
1
S1 S2 S3 S4 S5
0 t
0 T1 T 2 T3 T4 T5
278 15 Perspective and Complements

In particular, the pdf of τ is fτ (t) = λ exp{−λt} for t ≥ 0. Also, E(τ ) = λ−1 and
var(τ ) = λ−2 .
Then,

P [τ > t + s|τ > s] = P (τ > t).

This is the memoryless property of the exponential distribution.

Also,

P [τ ≤ t + |τ > t] = λ + o().

Proof

P (τ > t + s)
P [τ > t + s|τ > s] =
P (τ > s)
exp{−λ(t + s)}
= = exp{−λt}
exp{−λs}
= P (τ > t).

The interpretation of this property is that if a lightbulb has an exponentially

distributed lifetime, then an old bulb is exactly as good as a new one (as long as
it is still burning).
We use this property to show that the Poisson process is also memoryless, in a
precise sense.

Theorem 15.4 (Poisson Process Is Memoryless) Let N := {Nt , t ≥ 0} is a

Poisson process with rate λ. Fix t > 0. Given {Ns , s ≤ t}, the process {Ns+t −
Nt , s ≥ 0} is a Poisson process with rate λ.
As a consequence, the process has stationary and independent increments. That
is, for any 0 ≤ t1 < t2 < · · · , the increments {Ntn+1 − Ntn , n ≥ 1} of the Poisson
process are independent and the distribution of Ntn+1 −Ntn depends only on tn+1 −tn .

Proof Figure 15.4 illustrates that result. Given {Ns , s ≤ t}, the first jump time of
{Ns+t − Nt , s ≥ 0} is Exp(λ), by the memoryless property of the exponential
distribution. The subsequent inter-jump times are i.i.d. and Exp(λ). This proves the
theorem.

15.4 Poisson Process 279

Fig. 15.4 Given the past of

the process up to time t, the Nt+s − Nt
future jump times are those of
a Poisson process Nt
s
0

t
0 T1 T2 T 3 t T 4 T5

15.4.3 Number of Jumps

One has the following result.

Theorem 15.5 (The Number of Jumps Is Poisson) N := {Nt , t ≥ 0} is a Poisson

process with rate λ. Then Nt has a Poisson distribution with mean λt.

Proof There are a number of ways of showing this result. The standard way is as
follows. Note that

P (Nt+ = n) = P (Nt = n)(1 − λ) + P (Nt = n − 1)λ + o().

Hence,

d
P (Nt = n) = λP (Nt = n − 1) − λP (Nt = n).
dt
Thus,

d
P (Nt = 0) = −λP (Nt = 0).
dt

Since P (N0 = 0) = 1, this shows that P (Nt = 0) = exp{−λt} for t ≥ 0. Now,

assume that

P (Nt = n) = g(n, t) exp{−λt}, n ≥ 0.

Then, the differential equation above shows that

d
[g(n, t) exp{−λt}] = λ[g(n − 1, t) − g(n, t)] exp{−λt},
dt
280 15 Perspective and Complements

i.e.,

d
g(n, t) = λg(n − 1, t).
dt
n
This expression shows by induction that g(n, t) = (λt)
n! .
A different proof makes use of the density of the jumps. Let Tn be the n-th jump
of the process and Sn = Tn − Tn−1 , as before. Then

P (T1 ∈ (t1 , t1 + dt1 ), . . . , Tn ∈ (tn , tn + dtn ), Tn+1 > t)

= P (S1 ∈ (t1 , t1 + dt1 ), . . . , Sn ∈ (tn − tn−1 , tn
− tn−1 + dtn ), Sn+1 > t − tn )
= λ exp{−λt1 }dt1 λ exp{−λ(t2 − t1 )}dt2 · · · exp{−λ(t − tn )}
= λn dt1 · · · dtn exp{−λt}.

To derive this expression, we used the fact that the Sn are i.i.d. Exp(λ). The
expression above shows that, given that there are n jumps in [0, t], they are equally
likely to be anywhere in the interval. Also,

P (Nt = n) = λn dt1 · · · dtn exp{−λt},
S

where S = {t1 , . . . , tn |0 < t1 < · · · < tn < t}. Now, observe that S is a fraction of
[0, t]n that corresponds to the times ti being in a particular order. There are n! such
orders and, by symmetry, each order corresponds to a subset of [0, t]n of the same
size. Thus, the volume of S is t n /n!. We conclude that

tn n
P (Nt = n) = λ exp{−λt},
n!
which proves the result.

15.5 Boosting

You follow the advice of some investment experts when you buy stocks. Their
recommendations are often contradictory. How do you make your decisions so that,
in retrospect, you are not doing too bad compared to the best of the experts? The
intuition is that you should try to follow the leader, but randomly. To make the
situation concrete, Fig. 15.5 shows three experts (B, I, T ) and the profits one would
make by following their advice on the successive days.
On a given day, you choose which expert to follow the next day. Figure 15.6
shows your profit if you make the sequence of selections indicated by the red circles.
15.5 Boosting 281

Fig. 15.5 The three experts and the profits of their recommended stocks

Fig. 15.6 A specific sequence of choices and the resulting profit and regrets

In these selections, you choose to follow B the first 2 days, then I the next to
days, then T the last day. Of course, you have to choose the day before, and the
actual profit is only known the next day. The figure also shows the regrets that you
accumulate when comparing your profit to that of the three experts. Your total profit
is −5 and the profit you would have made if you had followed B all the time would
have been −2, so your regret compared to B is −2 − (−5) = 3, and similarly for
the other two experts.
The problem is to make the expert selection every day so as to minimize the worst
regret, i.e., the regret with respect to the most successful expert. More precisely, the
goal is to minimize the rate of growth of the worst regret. Here is the result.

Theorem √ 15.6 (Minimum Regret Algorithm) Generally, the worst regret grows
like O( n) with the number n of steps. One algorithm that achieves this rate of
regret is to choose expert E at step n + 1 with probability πn+1 (E) given by
282 15 Perspective and Complements

Fig. 15.7 A simulation of 70

the experts and the selection
60
algorithm
50

20 Expert selection
algorithm
10

0 Three experts (here, random

walks with drift 0.1)
–10
0 100 200 300 400 500

√
πn+1 (E) = An exp{αPn (E)/ n}, for E ∈ {B, I, T },

where η > 0 is a constant, An is such that these probabilities add up to one, and
Pn (E) is the profit that expert E makes in the first n days.

Thus, the algorithm favors successful experts. However, the algorithm makes
random selections. It is easy to construct examples where a deterministic algorithm
accumulates a regret that grows like n.
Figure 15.7 shows a simulation of three experts and of the selection algorithm
in the theorem. The experts are random walks with drift 0.1. The simulation
√ shows
that the selection algorithm tends to fall behind the best expert by O( n).
The proof of the theorem can be found in Cesa-Bianchi and Lugosi (2006).

15.6 Multi-Armed Bandits

Here is a classical problem. You are given two coins, both with an unknown bias
(the probability of heads). At each step k = 1, 2, . . . you choose a coin to flip.
Your goal is to accumulate heads as fast as possible. Let Xk be the number of heads
you accumulate after k steps. Let also Xk∗ be the number of heads that you would
accumulate if you always flipped the coin with the largest bias. The regret of your
strategy after n steps is defined as

Rk = E(Xk∗ − Xk ).

Let θ1 and θ2 be the bias of coins 1 and 2, respectively. Then E(Xk∗ ) = k max{θ1 , θ2 }
and the best strategy is to flip the coin with the largest bias at each step. However,
since the two biases are unknown, you cannot use that strategy. We explain below
15.6 Multi-Armed Bandits 283

that there is a strategy such that the regret grows like log(k) with the number of
steps.
Any good strategy keeps on estimating the biases. Indeed, any strategy that stops
estimating and then forever flips the coin that is believed to be best has a positive
probability of getting stuck with the worst coin, thus accumulating a regret that
grows linearly over time. Thus, a good strategy must constantly explore, i.e., flip
both coins to learn their bias.
However, a good strategy should exploit the estimates by flipping the coin that is
believed to be better more frequently than the other. Indeed, if you were to flip the
two coins the same fraction of time, the regret would also grow linearly. Hence, a
good strategy must exploit the accumulated knowledge about the biases.
The key question is how to balance exploration and exploitation. The strategy
called Thompson Sampling does this optimally. Assume that the biases θ1 and
θ2 of the two coins are independent and uniformly distributed in [0, 1]. Say that
you have flipped the coins a number of times. Given the outcomes of these coin
flips, one can in principle compute the conditional distributions of θ1 and θ2 . Given
these conditional distributions, one can calculate the probability that θ1 > θ2 . The
Thompson Sampling strategy is to choose coin 1 with that probability and coin 2
otherwise for the next flip. Here is the key result.

Theorem 15.7 (Minimum Regret of Thompson Sampling) If the coins have

different biases, then any strategy is such that

Rk ≥ O(log k).

Moreover, Thompson Sampling achieves this lower bound.

The notation O(log k) indicates a function g(k) that grows like log k, i.e., such
that g(k)/ log k converges to a positive constant as k → ∞.
Thus this strategy does not necessarily choose the coin with the largest expected
bias. It is the case that the strategy favors the coin that has been more successful so
far, thus exploiting the information. But the selection is random, which contributes
to the exploration.
One can show that if flips of coin 1 have produced h heads and t tails, then the
conditional density of θ1 is g(θ ; h, t), where

(h + t)! h
g(θ ; h, t) = θ (1 − θ )t , θ ∈ [0, 1].
h!t!

The same result holds for coin 2. Thus, Thompson Sampling generates θ̂1 and θ̂2
according to these densities.
For a proof of this result, see Agrawal and Goyal (2012). See also Russo et al.
(2018) for applications of multi-armed bandits.
284 15 Perspective and Complements

A rough justification of the result goes as follows. Say that θ1 > θ2 . One can
show that after flipping coin 2 a number n of times, it takes about n steps until you
flip it again when using Thompson Sampling. Your regret then grows by one at times
1, 1 + 1, 2 + 2, 4 + 4, . . . , 2n , 2n+1 , . . .. Thus, the regret is of order n after O(2n )
steps. Equivalently, after N = 2n steps, the regret is of order n = log N .

15.7 Capacity of BSC

Consider a binary symmetric channel with error probability p ∈ (0, 0.5). Every bit
that the transmitter sends has a chance of being corrupted. Thus, it is impossible
to transmit any bit string fully reliably across this channel. No matter what the
transmitter sends, the receiver can never be sure that it got the message right.
However, one might be able to achieve a very small probability of error. For
instance, say that p = 0.1 and that one transmits a bit by repeating it N times,
where N 1. As the receiver gets the N bits, it uses a majority decoding. That is,
if it gets more zeros than ones, it decides that transmitter sent a zero, and conversely
for a one. The probability of error can be made arbitrarily small by choosing N very
large. However, this scheme gets to transmit only one bit every N steps. We say
that the rate of the channel is 1/N and it seems that to achieve a very small error
probability, the rate has to become negligible.
It turns out that our pessimistic conclusion is wrong. Claude Shannon (Fig. 15.8),
in the late 1940s, explained that the channel can transmit at any rate less than C(p),
where (see Fig. 15.9)

C(p) = 1 − H (p) with H (p) = −p log2 p − (1 − p) log2 (1 − p), (15.2)

with a probability less than , for any > 0.

For instance, C(0.1) ≈ 0.53. Fix a rate less than C(0.1), say R = 0.5. Pick any
> 0, say = 10−8 . Then, it is possible to transmit bits across this channel at rate
R = 0.5, with a probability of error per bit less than 10−8 . The same is true if we
choose = 10−12 : it is possible to transmit at the same rate R with a probability of

Fig. 15.8 Claude Shannon.

1916–2001
15.7 Capacity of BSC 285

Fig. 15.9 The capacity C(p) 1

of the BSC with error
probability p 0.8

C(p)
0.6

0.4

0.2

p
0
0 0.2 0.4 0.6 0.8 1

error less than 10−12 . The actual scheme that we use depends on , and it becomes
more complex when is smaller; however, the rate R does not depend on . Quite a
remarkable result! Needless to say, it baffled all the engineers who had been busily
designing various ad hoc transmission schemes.
Shannon’s key insight is that long sequences are typical. There is a statistical
regularity in random sequences such as Markov chains or i.i.d. random variables and
this regularity manifests itself in a characteristic of long sequences. For instance, flip
many times a biased coin with P (head) = 0.1. The sequence that you will observe
is likely to have about 10% of heads. Many other sequences are so unlikely that you
will not see them. Thus, there are relatively few long sequences that are possible.
√ although there are M = 2 possible sequences of N coin flips,
In this example, N

only about M are typical when P (head) = 0.1. Moreover, by symmetry, these
typical sequences are all equally likely. For that reason, the errors of the BSC must
correspond to relatively few patterns. Say that there are only A possible patterns of
errors for N transmissions. Then, any bit string of length N that the sender transmits
will correspond to A possible received “output” strings: one for every typical error
sequence. Thus, it might be possible to choose B different “input” strings of length
N for the transmitter so that the A received “output” strings for each one of these
B input strings are all distinct. However, one might worry that choosing the B input
strings would be rather complex if we want their sets of output strings to be distinct.
Shannon noticed that if we pick the input strings completely randomly, this will
work. Thus, Shannon scheme is as follows. Pick a large N. Choose B strings of N
bits randomly, each time by flipping a fair coin N times. Call these inputs strings
X1 , . . . XB . These are the codewords. Let S1 be the set of A typical outputs that
correspond to X1 . Let Yj be the output that corresponds to input Xj . Note that the
Yj are sequences of fair coin flips, by symmetry of the channel. Thus, each Yj
is equally likely to be any one of the 2N possible output strings. In particular, the
probability that Yj falls in S1 is A/2N (Fig. 15.10).
In fact,

P (Y2 ∈ S1 or Y3 ∈ S1 . . . or YB ∈ S1 ) ≤ B × A2−N .
286 15 Perspective and Complements

Fig. 15.10 Because of the 2N output strings

random choice of the
codewords, the likelihood that Yj
one codeword produces an Output string
S1
output that is typical for that corresponds
another codeword is A2−N to input Xj
A = 2NH(p) typical
output strings
for input string X1

Indeed, the probability of a union of events is not larger than the sum of their
probabilities. We explain below that A = 2N H (p) . Thus, if we choose B = 2N R , we
see that the expression above is less than or equal to

2N R × 2N H (p) × 2−N

and this expression goes to zero as N increases, provided that

R + H (p) < 1, i.e., R < C(p) := 1 − H (p).

Thus, the receiver makes an error with a negligible probability if one does not choose
too many codewords. Note that B = 2N R corresponds to transmitting NR different
bits in N steps, thus transmitting at rate R.
How does the receiver recognize the bit string that the transmitter sent? The idea
is to give the list of the B input strings, i.e., codewords, to the receiver. When
it receives a string, the receiver looks in the list to find the codeword that is the
closest to the string it received. With a very high probability, it is the string that the
transmitter sent.
It remains to show that A = 2N H (p) . Fortunately, this calculation is a simple
consequence of the SLLN. Let X := {X(n), n = 1, . . . , N } be i.i.d. random
variables with P (X(n) = 1) = p and P (X(n) = 0) = 1 − p. For a given sequence
x = (x(1), . . . , x(N )) ∈ {0, 1}N , let

1
ψ(x) := log2 (P (X = x)). (15.3)
N
N
Note that, with |x| := n=1 x(n),

1
ψ(x) = log2 (p|x| (1 − p)N −|x| )
N
|x| N − |x|
= log2 (p) + log2 (1 − p).
N N
15.7 Capacity of BSC 287

Thus, the random string X of N bits is such that

|X| N − |X|
ψ(X) = log2 (p) + log2 (1 − p).
N N

But we know from the SLLN that |X|/N → p as N → ∞. Thus, for N 1,

ψ(X) ≈ p log2 (p) + (1 − p) log2 (1 − p) =: −H (p).

This calculation shows that any sequence x of values that X takes has approximately
the same value of ψ(x). But, by (15.3), this implies all the sequences x that occur
have approximately the same probability

2−N H (p) .

We conclude that there are 2N H (p) typical sequences and that they are all essentially
equally likely. Thus, A = 2N H (p) .
Recall that for the Gaussian channel with the MLE detection rule, the channel
becomes a BSC with

p = p(σ 2 ) := P (N (0, σ 2 ) > 0.5).

Accordingly, we can calculate the capacity C(p(σ 2 )) as a function of the noise

standard deviation σ . Figure 15.11 shows the result.
These results of Shannon on the capacity, or achievable rates, of channels
have had a profound impact on the design of communication systems. Suddenly,
engineers had a target and they knew how far or how close their systems were
to the feasible rate. Moreover, the coding scheme of Shannon, although not really
practical, provided a valuable insight into the design of codes for specific channels.
Shannon’s theory, called Information Theory, is an inspiring example of how a
profound conceptual insight can revolutionize an engineering field.

Fig. 15.11 The capacity of 1

the BSC that corresponds to a C(p(σ2))
N (0, σ 2 ) additive noise. The
0.8
detector uses the MLE

0.6

0.4

0.2

σ
0
0 0.5 1 1.5 2 2.5 3
288 15 Perspective and Complements

Another important part of Shannon’s work concerns the coding of random

objects. For instance, how many bits does it take to encode a 500-page book?
Once again, the relevant notion is that of typicality. As an example, we know
that to encode a string of N flips of a biased coin with P (head) = p, we need
only NH (p) bits, because this is the number of typical sequences. Here, H (p) is
called the entropy of the coin flip. Similarly, if {X(n), n ≥ 1} is an irreducible,
finite, and aperiodic Markov chain with invariant distribution π and transition
probabilities P (i, j ), then one can show that to encode {X(1), . . . , X(N)} one needs
approximately N H (P ) bits, where

H (P ) = − π(i) P (i, j ) log2 P (i, j )
i j

is called the entropy rate of the Markov chain. A practical scheme, called Liv–
Zempel compression, essentially achieves this limit. It is the basis for most file
compression algorithms (e.g., ZIP).
Shannon put these two ideas together: channel capacity and source coding. Here
is an example of his source–channel coding result. How fast can one send the
symbols X(n) produced by the Markov chain through a BSC channel? The answer
is C(p)/H (P ). Intuitively, it takes H (P ) bits per symbol X(n) and the BSC can
send C(p) bits per unit time. Moreover, to accomplish this rate, one first encodes
the source and one separately chooses the codewords for the BSC, and one then
uses them together. Thus, the channel coding is independent of the source coding
and vice versa. This is called the separation theorem of Claude Shannon.

15.8 Bounds on Probabilities

We explain how to derive estimates of probabilities using Chebyshev and Chernoff’s

inequalities and also using the Gaussian approximation. These methods also provide
a useful insight into the likelihood of events. The power of these methods is that they
can be used in very complex situations.

Theorem 15.8 (Markov, Chernoff, and Jensen Inequalities) Let X be a random

variable. Then one has

(a) Markov’s Inequality:1

E(f (X))
P (X ≥ a) ≤ , (15.4)
f (a)

for all f (·) that is nondecreasing and positive.

1 Markov’s inequality is due to Chebyshev who was Markov’s teacher.

15.8 Bounds on Probabilities 289

Fig. 15.12 Herman

Chernoff. b. 1923

Fig. 15.13 Johan Jensen.

1859–1925

(b) Chernoff’s Inequality (Fig. 15.12):2

P (X ≥ a) ≤ E(exp{θ (X − a)}), (15.5)

for all θ > 0.

f (E(X)) ≤ E(f (X)), (15.6)

for all f (·) that is convex.

These results are easy to show, so here is a proof.

2 Chernoff’s inequality is due to Herman Rubin (see Chernoff (2004)).

3 Jensen’s inequality seems to be due to Jensen.
290 15 Perspective and Complements

Proof

(a) Since f (·) is nondecreasing and positive, we have

f (X)
1{X ≥ a} ≤ ,
f (a)

so that (15.4) follows by taking expectations.

(b) The inequality (15.5) is a particular case of Markov’s inequality (15.4) for
f (X) = exp{θ X} with θ > 0.
(c) Let f (·) be a convex function. This means that it lies above any tangent. In
particular,

f (X) ≥ f (E(X)) + f (E(X))(X − E(X)),

as shown in Fig. 15.14. The inequality (15.6) then follows by taking expecta-
tions.

15.8.1 Applying the Bounds to Multiplexing

Recall the multiplexing problem. There are N users who are independently active
with probability p. Thus, the number of active users Z is B(N, p). We want to find
m so that P (Z ≥ m) = 5%.
As a first estimate of m, we use Chebyshev’s inequality (2.2) which says that

var(ν)
P (|ν − E(ν)| > ) ≤ .
2

Fig. 15.14 A convex f(x)

function f (·) lies above its
tangents. In particular, it lies
above a tangent at E(X),
which implies Jensen’s
inequality.

a x
E(X)
f(E(X)) + a(x - E(X))
15.8 Bounds on Probabilities 291

Now, if Z = B(N, p), one has E(Z) = Np and var(Z) = Np(1 − p).4 Hence,
since ν = B(100, 0.2), one has E(ν) = 20 and var(ν) = 16. Chebyshev’s inequality
gives

16
P (|ν − 20| > ) ≤ .
2
Thus, we expect that

8
P (ν − 20 > ) ≤ ,
2
because it is reasonable to think that the distribution of ν is almost symmetric around
its mean, as we see in Fig. 3.4. We want to choose m = 20 + so that P (ν > m) ≤
5%. This means that we should choose so that 8/ 2 = 5%. This gives = 13, so
that m = 33. Thus, according to Chebyshev’s inequality, it is safe to assume that no
more than 33 users are active and we can choose C so that C/33 is a satisfactory
rate for users.
As a second approach, we use Chernoff’s inequality (15.5) which states that

P (ν ≥ Na) ≤ E(exp{θ (ν − Na)}), ∀θ > 0.

To calculate the right-hand size, we note that if Z = Bernoulli(N, p), then we can
write as Z = X(1) + · · · + X(N ), where the X(n) are i.i.d. random variables with
P (X(n) = 1) = p and P (X(n) = 0) = 1 − p. Then,

E(exp{θ Z}) = E(exp{θ X(1) + · · · + θ X(N)})

= E(exp{θ X(1)} × · · · × exp{θ X(N)}).

To continue the calculation, we note that, since the X(n) are independent, so
are the random variables exp{θ X(n)}.5 Also, the expected value of a product
of independent random variables is the product of their expected values (see
Appendix A). Hence,

E(exp{θ Z}) = E(exp{θ X(1)}) × · · · × E(exp{θ X(N)})

= E(exp{θ X(1)})N = exp{NΛ(θ )}

where we define

Λ(θ ) = log(E(exp{θ X(1)})).

4 See Appendix A.
5 Indeed, functions of independent random variables are independent. See Appendix A.
292 15 Perspective and Complements

Fig. 15.15 The logarithm divided by N of the probability of too many active users

Thus, Chernoff’s inequality says that

P (Z ≥ Na) ≤ exp{NΛ(θ )} exp{−θ N a}

= exp{N (Λ(θ ) − θ a)}

Since this inequality holds for every θ > 0, let us minimize the right-hand side with
respect to θ . That is, let us define

Λ∗ (a) = max{θ a − Λ(θ )}.

θ>0

Then, we see that

P (Z ≥ Na) ≤ exp{−NΛ∗ (a)}. (15.7)

Figure 15.15 shows this function when p = 0.2.

We now evaluate Λ(θ ) and Λ∗ (a). We find

E(exp{θ X(1)}) = 1 − p + peθ ,

15.8 Bounds on Probabilities 293

so that

Λ(θ ) = log(1 − p + peθ )

and

Λ∗ (a) = max{θ a − log(1 − p + peθ )}.

θ>0

Setting to zero the derivative with respect to θ of the term between brackets, we find

1
a= (peθ ),
1 − p + peθ

which gives, for a > p,

a(1 − p)
eθ = .
(1 − a)p

Substituting back in Λ∗ (a), we get

a 1−a
Λ∗ (a) = −a log( ) − (1 − a) log( ), ∀a > p.
p 1−p

Going back to our example, we want to find m = Na so that

P (ν ≥ Na) ≈ 0.05.

Using (15.7), we need to find Na so that

exp{−NΛ∗ (a)} ≈ 0.05 = exp{log(0.05)},

i.e.,

log(0.05)
Λ∗ (a) = − ≈ 0.03.
N
Looking at Fig. 15.15, we find a = 0.30. This corresponds to m = 30. Thus,
Chernoff’s estimate says that P (ν > 30) ≈ 5% and that we can size the network
assuming that only 30 users are active at any one time.
By the way, the calculations we have performed above show that Chernoff’s
bound can be written as
P (B(N, p) = Na)
P (Z ≥ Na) ≤ .
P (B(N, a) = Na)
294 15 Perspective and Complements

15.9 Martingales

A martingale represents the sequence of fortunes of someone playing a fair game

of chance. In such a game, the expected gain is always zero. A simple example is a
random walk with zero-mean step size. Martingales are good models of noise and
of processes discounted based on their expected value (e.g., the stock market). This
theory is due to Doob (1953).
Martingales have an important property that generalizes the strong law of large
numbers. It says that a martingale bounded in expectation converges almost surely.
This result is used to show that fluctuations vanish and that a process converges to its
mean value. The convergence of stochastic gradient algorithms and approximations
of random processes by differential equations follow from that property.

15.9.1 Definitions

Let Xn be the fortune at time n ≥ 0 when one plays a game of chance. The game is
fair if

E[Xn+1 |Xn ] = Xn , ∀n ≥ 0. (15.8)

In this expression, Xn := {Xm , m ≤ n}. Thus, in a fair game, one cannot expect
to improve one’s fortune. A sequence {Xn , n ≥ 0} of random variables with that
property is a martingale.
This basic definition generalizes to the case where one has access to additional
information and is still unable to improve one’s fortune. For instance, say that the
additional information is the value of other random variables Yn . One then has the
following definitions.

Definition 15.3 (Martingale, Supermartingale, Submartingale) The sequence

of random variables {Xn , n ≥ 0} is a martingale with respect to {Xn , Yn , n ≥ 0} if

E[Xn+1 |Xn , Y n ] = Xn , ∀n ≥ 0 (15.9)

with Xn = {Xm , m ≤ n} and Y n = {Ym , m ≤ n}.

If (15.9) holds with = replaced by ≤, then Xn is a supermartingale; if it holds
with ≥, then Xn is a submartingale.

In many cases, we do not specify the random variables Yn and we simply say that
Xn is a martingale, or a submartingale, or a supermartingale.
Note that if Xn is a martingale, then

E(Xn ) = E(X0 ), ∀n ≥ 0.
15.9 Martingales 295

Indeed, E(Xn ) = E(E[Xn |X0 , Y0 ]) by the smoothing property of conditional

expectation (see Theorem 9.5).

15.9.2 Examples

A few examples illustrate the definition.

Random Walk
Let {Zn , n ≥ 0} be independent and zero-mean random variables. Then Xn :=
Z0 + · · · + Zn for n ≥ 0 is a martingale. Indeed,

E[Xn+1 |Xn ] = E[Z0 + · · · + Zn + Zn+1 |Z0 , . . . , Zn ] = Z0 + · · · + Zn = Xn .

Note that if E(Zn ) ≤ 0, then Xn is a supermartingale; if E(Zn ) ≥ 0, then Xn is a

submartingale.

Product
Let {Zn , n ≥ 0} be independent random variables with mean 1. Then Xn := Z0 ×
· · · × Zn for n ≥ 0 is a martingale. Indeed,

E[Xn+1 |Xn ] = E[Z0 × · · · × Zn × Zn+1 |Z0 , . . . , Zn ] = Z0 × · · · × Zn = Xn .

Note that if Zn ≥ 0 and E(Zn ) ≤ 1 for all n, then Xn is a supermartingale. Similarly,

if Zn ≥ 0 and E(Zn ) ≥ 1 for all n„ then Xn is a submartingale.

Branching Process
For m ≥ 1 and n ≥ 0, let Xm n be i.i.d. random variables distributed like X that take

values in Z+ := {0, 1, 2, . . .} and have mean μ. The branching process is defined

by Y0 = 1 and

Yn
Yn+1 = n
Xm , n ≥ 0.
m=1

The interpretation is that there are Yn individuals in a population at the n-th

n children.
generation. Individual m in that population has Xm

One can see that

Zn = μ−n Yn , n ≥ 0

is a martingale. Indeed,

E[Yn+1 |Y0 , . . . , Yn ] = Yn μ,
296 15 Perspective and Complements

so that

E[Zn+1 |Z0 , . . . , Zn ] = E[μ−(n+1) Yn+1 |Y0 , . . . , Yn ] = μ−n Yn = Zn .

Let f (s) = E(es X) and q be the smallest nonnegative solution of q = f (q).

One can then show that

Wn = q Zn , n ≥ 1

is a martingale.

Proof Exercise.

Doob Martingale
Let {Xn , n = 1, . . . , N } be random variables and Y = f (X1 , . . . , XN ), where f is
some bounded measurable real-valued function. Then

Zn := E[Y | Xn ], n = 0, . . . , N

is a martingale (by the smoothing property of conditional expectation, see Theo-

rem 9.5) called a Doob martingale. Here are a two examples.

1. Throw N balls into M bins, and let Y be some function of the throws: the
number of empty bins, the max load, the second-highly loaded bin, or some
similar function. Let Xn be the index of the bin into which ball n lands. Then
Zn = E[Y | Xn ] is a martingale.
2. Suppose we have r red and b blue balls in a bin. We draw balls without
replacement from this bin: what is the number of red balls drawn? Let Xn be
the indicator for whether ball n is red, and let Y = X1 + · · · + Xn be the number
of red balls. Then Zn is a martingale.

You Cannot Beat the House

To study convergence, we start by explaining a key property of martingales that says
there is no winning recipe to play a fair game of chance.

Theorem 15.9 (You Cannot Win) Let Xn be a martingale with respect to

{Xn , Zn , n ≥ 0} and Vn some bounded function of (Xn , Z n ). Then

n
Yn = Vm−1 (Xm − Xm−1 ), n ≥ 1, (15.10)
m=0

with Y0 := 0 is a martingale.

15.9 Martingales 297

Proof One has

E[Yn − Yn−1 | Xn−1 , Z n−1 ]

= E[Vn−1 (Xn − Xn−1 ) | Xn−1 , Z n−1 ]
= Vn−1 E[Xn − Xn−1 | Xn−1 , Z n−1 ] = 0.

The meaning of Yn is the fortune that you would get by betting Vm−1 at time
m − 1 on the gain Xm − Xm−1 of the next round of the game. This bet must be based
on the information (Xm−1 , Z m−1 ) that you have when placing the bet, not on the
outcome of the next round, obviously. The theorem says that your fortune remains
a martingale even after adjusting your bets in real time.

Stopping Times
When playing a game of chance, one may decide to stop after observing a particular
sequence of gains and losses. The decision to stop is non-anticipative. That is, one
cannot say “never mind, I did not mean to play the last three rounds.” Thus, the
random stopping time τ must have the property that the event {τ ≤ n} must be a
function of the information available at time n, for all n ≥ 0. Such a random time is
a stopping time.

Definition 15.4 (Stopping Time) A random variable τ is a stopping time for the
sequence {Xn , Yn , n ≥ 0} if τ takes values in {0, 1, 2, . . .} and

P [τ ≤ n|Xm , Ym , m ≥ 0] = φn (Xn , Y n ), ∀n ≥ 0

for some functions φn .

For instance,

τ = min{n ≥ 0 | (Xn , Yn ) ∈ A },

where A is a set in 2 is a stopping time for the sequence {Xn , Yn , n ≥ 0}. Thus,
you may want to stop the first time that either you go broke or your fortune exceeds
$1000.00.
One might hope that a smart choice of when to stop playing a fair game could
improve one’s expected fortune. However, that is not the case, as the following fact
shows.
298 15 Perspective and Complements

Theorem 15.10 (Optional Stopping) Let {Xn , n ≥ 0} be a martingale and τ a

stopping time with respect to {Xn , Yn , n ≥ 0}. Then6

E[Xτ ∧n |X0 , Y0 ] = X0 .

In the statement of the theorem, for a random time σ one defines Xσ := Xn when
σ = n.

Proof Note that Xτ ∧n is the fortune Yn that one accumulates by betting Vm =

1{τ ∧ n > m} at time m in (15.10), i.e., by betting 1 until one stops at time τ ∧ n.
Since 1{τ ∧ n > m} = 1 − {τ ∧ n ≤ m} = φ(Xm , Y m ), the resulting fortune is a
martingale.

You will note that bounding τ ∧ n in the theorem above is essential. For instance,
let Xn correspond to the random walk described above with P (Zn = 1) = P (Zn =
−1) = 0.5. If we define τ = min{n ≥ 0 | Xn = 10}, one knows that τ is finite. (See
the comments below Theorem 15.1.) Hence, Xτ = 10, so that

E[Xτ |X0 = 0] = 10 = X0 .

However, if we bound the stopping time, the theorem says that

E[Xτ ∧n |X0 = 0] = 0. (15.11)

This result deserves some thought.

One might be tempted to take the limit of the left-hand side of (15.11) as n → ∞
and note that

lim Xτ ∧n = Xτ = 10,
n→∞

because τ is finite. One then might conclude that the left-hand size of (15.11) goes
to 10, which would contradict (15.11). However, the limit and the expectation do
not interchange because the random variables Xτ ∧n are not bounded. However, if
they were, one would get E[Xτ |X0 ] = X0 , by the dominated convergence theorem.
We record this observation as the next result.

Theorem 15.11 (Optional Stopping—2) Let {Xn , n ≥ 0} be a martingale and τ

a stopping time with respect to {Xn , Yn , n ≥ 0}. Assume that |Xn | ≤ V for some
random variable V such that E(V ) < ∞. Then

E[Xτ |X0 , Y0 ] = X0 .

6 τ ∧ n := min{τ, n}
15.9 Martingales 299

L1 -Bounded Martingales
An L1 -bounded martingale cannot bounce up and down infinitely often across an
interval [a, b]. For if it did, you could increase your fortune without bound by
betting 1 on the way up across the interval and betting 0 on the way down. We
will see shortly that this cannot happen. As a result, the martingale must converge.
(Note that this is not true if the martingale is not L1 -bounded, as the random walk
example shows.)

Theorem 15.12 (L1 -Bounded Martingales Convergence) Let {Xn , n ≥ 0} be a

martingale such that E(|Xn |) ≤ K for all n. Then Xn converges almost surely to a
finite random variable X∞ .

Proof Consider an interval [a, b]. We show that Xn cannot up-cross this interval
infinitely often. (See Fig. 15.16.) Let us bet 1 on the way up and 0 on the way down.
That is, wait until Xn gets first below a, then bet 1 at every step until Xn > b, then
stop betting until Xn gets below a, and continue in this way.
If Xm crossed the interval Un times by time n, your fortune Yn is now at least
(b − a)Un + (Xn − a). Indeed, your gain was at least b − a for every upcrossing
and, in the last steps of your playing, you lose at most Xn − a if Xn never crosses
above b after you last resumed betting. But, since Yn is a martingale, we have

E(Yn ) = Y0 ≥ (b − a)E(Un ) + E(Xn − a) ≥ (b − a)E(Un ) − K − a.

(We used the fact that Xn ≥ −|Xn |, so that E(Xn ) ≥ −E(|Xn |) = −K. This shows
that E(Un ) ≤ B = (K + Y0 + a)/(b − a) < ∞. Letting n → ∞, since Un ↑ U ,
where U is the total number of upcrossings of the interval [a, b], it follows by the
monotone convergence theorem that E(U ) ≤ B. Consequently, U is finite. Thus,
Xn cannot up-cross any given interval [a, b] infinitely often.
Consequently, the probability that it up-crosses infinitely often any interval with
rational limits is zero (since there are countably many such intervals).
This implies that Xn must converge, either to +∞, −∞, or to a finite value.
Since E(|Xn |) ≤ K, the probability that Xn converges to +∞ or −∞ is zero.

Fig. 15.16 If Xn does not Xn

converge, there are some
rational numbers a < b such
that Xn crosses the interval
[a, b] infinitely often b
a

n
300 15 Perspective and Complements

The following is a direct but useful consequence. We used this result in the proof
of the convergence of the stochastic gradient projection algorithm (Theorem 12.2).

Theorem 15.13 (L2 -Bounded Martingales Convergence) Let Xn be a L2 -

bounded martingale, i.e., such that E(Xn2 ) ≤ K 2 , ∀n ≥ 0, then Xn → X∞ , almost
surely, for some finite random variable X∞ .

Proof We have

E(|Xn |)2 ≤ E(Xn2 ) ≤ K 2 ,

by Jensen’s inequality. Thus, it follows that E(|Xn |) ≤ K for all n, so that the result
of the theorem applies to this martingale.

One can also show that E(|Xn − X∞ |2 ) → 0.

15.9.3 Law of Large Numbers

The SLLN can be proved as an application of the convergence of martingales, as

Doob (1953) showed.

Theorem 15.14 (SLLN) Let {Xn , n ≥ 1} be i.i.d. random variables with

E(|Xn |) = K < ∞ and E(Xn ) = μ. Then

X1 + · · · + Xn
→ μ, almost surely as n → ∞.
n

Proof Let

Sn = X1 + · · · + Xn , n ≥ 1.

Note that
1
E[X1 |Sn , Sn+1 , . . .] = Sn =: Y−n , (15.12)
n
15.9 Martingales 301

by symmetry. Thus,

E[Y−n | Sn+1 , . . .] = E[E[X1 | Sn , Sn+1 , . . .] | Sn+1 , . . .]

= E[X1 | Sn+1 , . . .] = Y−n−1 .

Thus, {. . . , Y−n−2 , Y−n−1 , Y−n , . . .} is a martingale. (It is a Doob martingale.) This

implies as before that the number Un of upcrossings of an interval [a, b] is such that
E(Un ) ≤ B < ∞. As before, we conclude that U := lim Un < ∞, almost surely.
Hence, Yn converges almost surely to a random variable Y−∞ .
Now, since

X1 + · · · + Xn
Y−∞ = lim ,
n→∞ n

we see that Y−∞ is independent of (X1 , . . . , Xn ) for any finite n. Indeed, the limit
does not depend on the values of the first n random variables. However, since Y−∞
is a function of {Xn , n ≥ 1}, it must be independent of itself, i.e., be a constant.
Since E(Y∞ ) = E(Y1 ) = μ, we see that Y∞ = μ.

15.9.4 Wald’s Equality

A useful application of martingales is the following. Let {Xn , n ≥ 1} be i.i.d.

random variables. Let τ be a random variable independent of the Xn ’s that take
values in {1, 2, . . .} with E(τ ) < ∞. Then

E(X1 + · · · + Xτ ) = E(τ )E(X1 ). (15.13)

This expression is known as Wald’s Equality.

To see this, note that Yn = X1 + · · · + Xn − nE(X1 ) is a martingale. Also, τ is
a stopping time. Thus,

E(Yτ ∧n ) = E(Y1 ) = 0,

which gives the identity with τ replaced by τ ∧ n. If E(τ ) < ∞, one can let n go to
infinity and get the result. (For instance, replace Xi by Xi+ and use MCT, similarly
for Xi− , then subtract.)
302 15 Perspective and Complements

15.10 Summary

• General inference problems: guessing X given Y , Bayesian or not;

• Sufficient statistic: h(Y ) is sufficient for X;
• Infinite Markov Chains: PR, NR, T;
• Lyapunov–Foster Criterion;
• Poisson Process: independent stationary increments;
• Continuous-Time Markov Chain: rate matrix;
• Shannon Capacity of BSC: typical sequences and random codes;
• Bounds: Chernoff and Jensen;
• Martingales and Convergence;
• Strong Law of Large Numbers.

15.10.1 Key Equations and Formulas

Inference Problem Guess X given Y : MAP, MLE, HT S.15.1

Sufficient Statistic fY |X [y|x] = f (h(y), x)g(y) D.15.1
Infinite MC Irreducible ⇒ T, NR or PR T.15.1
Poisson Process Jumps w.p. λ in next seconds D.15.2
Continuous-Time MC Jumps from i to j w. rate Q(i, j ) D.6.1
Shannon Capacity C Can transmit reliably at any rate R < C S.15.7
“ ” of BSC(p) C = 1 + p log2 (p) + (1 − p) log2 (1 − p) (15.2)
Chernoff P (X > a) ≤ E(exp{θ(X − a)}), ∀θ ≥ 0 (15.5)
Jensen h convex ⇒ E(h(X)) ≥ h(E(X)) (15.6)
Martingales zero expected increase D.15.3
MG Convergence A.s. to finite RV if L1 or L2 bounded T.15.12
Wald E(X1 + · · · + Xτ ) = E(τ )E(X1 ) (15.13)

15.11 References

For the theory of Markov chains, see Chung (1967). The text Harchol-Balter (2013)
explains basic queueing theory and many applications to computer systems and
operations research.
The book Bremaud (1998) is also highly recommended for its clarity and the
breadth of applications. Information Theory is explained in the textbook Cover and
15.12 Problems 303

Thomas (1991). I learned the theory of martingales mostly from Neveu (1975). The
theory of multi-armed bandits is explained in Cesa-Bianchi and Lugosi (2006). The
text Hastie et al. (2009) is an introduction to applications of statistics in data science
(Fig. 15.17).

15.12 Problems

Problem 15.1 Suppose that y1 , . . . , yn are i.i.d. samples of N(μ, σ 2 ). What is a

sufficient statistic for estimating μ given σ = 1. What is a sufficient statistic for
estimating σ given μ = 1?

Problem 15.2 Customers arrive to a store according to a Poisson process with rate
4 (per hour).

(a) What is the probability that exactly 3 customers arrive during 1 h?

(b) What is the probability that more than 40 min is required before the first
customer arrives?

Problem 15.3 Consider two independent Poisson processes with rates λ1 and λ2 .
Those processes measure the number of customers arriving in stores 1 and 2.

(a) What is the probability that a customer arrives in store 1 before any arrives in
store 2?
(b) What is the probability that in the first hour exactly 6 customers arrive at the two
stores? (The total for both is 6)
(c) Given exactly 6 have arrived at the two stores, what is the probability all 6 went
to store 1?

Problem 15.4 Consider the continuous-time Markov chain in Fig. 15.17.

(a) Find the invariant distribution.

(b) Simulate the MC and see that the fraction of time spent in state 1 converges to
π(1).

Fig. 15.17 CTMC 10

6
2 4
1 2 3 4
10

8
304 15 Perspective and Complements

Problem 15.5 Consider a first-come-first-served discrete-time queuing system

with a single server. The arrivals are Bernoulli with rate λ. The service times are
i.i.d. and independent of the arrival times. Each service time Z takes values in
{1, 2, . . . , K} such that E(Z) = 1/μ and λ < μ.

(a) Construct the Markov chain that models the queue. What are the states and
transition probabilities? [Hint: Suppose the head of the line task of the queue
still requires z units of service. Include z in the state description of the MC.]
(b) Use Lyapunov–Foster argument to show the queue is stable or equivalently the
MC is positive recurrent.

Problem 15.6 Suppose that random variable

X takes value in the set {1, 2, . . . , K}
such that Pr(X1 = k) = pk > 0, and K k=1 k = 1. Suppose X1 , X2 , . . . , Xn is a
p
sequence of n i.i.d. samples of X.

(a) How many possible sequences exist?

(b) How many typical sequences exist when n is large?
(c) Find a condition that answers to parts (a) and (b) are the same.

Problem 15.7 Let {Nt , t ≥ 0} be a Poisson process with rate λ. Let Sn denote the
time of the n-th event. Find

(a) the pdf of Sn .

(b) E[S5 ].
(c) E[S4 |N(1) = 2].
(d) E[N (4) − N(2)|N(1) = 3].

Problem 15.8 A queue has Poisson arrivals with rate λ. It has two servers that work
in parallel. When there are at least two customers in the queue, two are being served.
When there is only one customer, only one server is active. The service times are
i.i.d. Exp(μ).

(a) Argue that the queue length is a Markov Chain.

(b) Draw the state transition diagram.
(c) Find the minimum value of μ so that the queue is positive recurrent and solve
the balance equations.

Problem 15.9 Let {Xt , t ≥ 0} be a continuous-time Markov chain with rate matrix
Q = {q(i, j )}. Define q(i) = j =i q(i, j ). Let also Ti = inf{t > 0|Xt = i} and
Si = inf{t > 0|Xt = i}. Then (select the correct answers)

E[Si |X0 = i] = q(i);

P [Ti < Tj |X0 = k] = q(k, i)/(q(k, i) + q(k, j )) for i, j, k distinct;

If α(k) = P [Ti < Tj |X0 = k], then α(k) = s q(k,s)
q(k) α(s) for k ∈ / {i, j }.
15.12 Problems 305

Problem 15.10 A continuous-time queue has Poisson arrivals with rate λ, and it is
equipped with infinitely many servers. The servers can work in parallel on multiple
customers, but they are non-cooperative in the sense that a single customer can only
be served by one server. Thus, when there are k customers in the queue, k servers are
active. Suppose that the service time of each customer is exponentially distributed
with rate μ and they are i.i.d.

(a) Argue that the queue length is a Markov chain. Draw the transition diagram of
the Markov chain.
(b) Prove that for all finite values of λ and μ the Markov chain is positive recurrent
and find the invariant distribution.

Problem 15.11 Consider a Poisson process {Nt , t ≥ 0} with rate λ = 1. Let

random variable Si denote the time of the i-th arrival. [Hint: You recall that
i−1 e−x
fSi (x) = x(i−1)! 1{x ≥ 0}.]

(a) Given S3 = s, find the joint distribution of S1 and S2 . Show you work.
(b) Find E[S2 |S3 = s].
(c) Find E[S3 |N1 = 2].
N
Problem 15.12 Let S = i=1 Xi denote the total amount of money withdrawn
from an ATM in 8 h, where:

(a) Xi are i.i.d. random variables denoting the amount withdrawn by each customer
with E[Xi ] = 30 and V ar[Xi ] = 400.
(b) N is a Poisson random variable denoting the total number of customers with
E[N ] = 80.

Find E[S] and V ar[S].

Problem 15.13 One is given two independent Poisson processes Mt and Nt with
respective rates λ and μ, where λ > μ. Find E(τ ), where

τ = max{t ≥ 0 | Mt ≤ Nt + 5}.

(Note that this is a max, not a min.)

Problem 15.14 Consider a queue with Poisson arrivals with rate λ. The service
times are all equal to one unit of time. Let Xt be the queue length at time t (t ≥ 0).

(a) Is Xt a Markov chain? Prove or disprove.

(b) Let Yn be the queue length just after the n-th departure from the queue (n ≥ 1).
Prove that Yn is a Markov chain. Draw a state diagram.
(c) Prove that Yn is positive recurrent when λ < 1.
306 15 Perspective and Complements

Problem 15.15 Consider a queue with Poisson arrivals with rate λ. The queue can
hold N customers. The service times are i.i.d. Exp(μ). When a customer arrives,
you can choose to pay him c so that he does not join the queue. You also pay c when
a customer arrives at a full queue. You want to decide when to accept customers to
minimize the cost of rejecting them, plus the cost of the average waiting time they
spend in the queue.

(a) Formulate the problem as a Markov decision problem. For simplicity, consider
a total discounted cost. That is, if xt customers are in the system at time t, then
the waiting cost during [t, t + ] is e−βt xt . Similarly, if you reject a customer
at time t, then the cost is ce−βt .
(b) Write the dynamic programming equations.
(c) Use Python to solve the equations.

Problem 15.16 The counting process N := {Nt , 0 ≤ t ≤ T } is defined as follows:

Given τ , {Nt , 0 ≤ t ≤ τ } and {Nt − Nτ , τ ≤ t ≤ T } are independent Poisson
processes with respective rates λ0 and λ1 .
Here, λ0 and λ1 are known and such that 0 < λ0 < λ1 . Also, τ is exponentially
distributed with known rate μ > 0.

1. Find the MLE of τ given N.

2. Find the MAP of τ given N.

Problem 15.17 Figure 15.18 shows a system where a source alternates between the
ON and OFF states according to a continuous-time Markov chain with the transition
rates indicated. When the source is ON, it sends a fluid with rate 2 into the queue.
When the source is OFF, it does not send any fluid. The queue is drained at constant
rate 1 whenever it contains some fluid. Let Xt be the amount of fluid in the queue at
time t ≥ 0.

(a) Plot a typical trajectory of the random process {Xt , t ≥ 0}.

(b) Intuitively, what are conditions on λ and μ that should guarantee the “stability”
of the queue?
(c) Is the process {Xt , t ≥ 0} Markov?

Fig. 15.18 The system 2

ON
1
Xt

OFF
15.12 Problems 307

Problem 15.18 Let {Nt , t ≥ 0} be a Poisson process with rate λ that is exponen-
tially distributed with rate μ > 0.

(a) Find MLE[λ|Ns , 0 ≤ s ≤ t];

(b) Find MAP [λ|Ns , 0 ≤ s ≤ t];
(c) What is a sufficient statistic for λ given {Ns , 0 ≤ s ≤ t};
(d) Instead of λ being exponentially distributed, assume that λ is known to take
values in [5, 10]. Give an estimate of the time t required to estimate λ within
5% with probability 95%.

Problem 15.19 Consider two queues in parallel in discrete time with Bernoulli
arrival processes of rates λ1 and λ2 , and geometric service rates of μ1 and μ2 ,
respectively. There is only one server that can serve either queue 1 and queue
2 at each time. Consider the scheduling policy that serves queue 1 at time
n if μ1 Q1 (n) > μ2 Q2 (n), and serve queue 2 otherwise, where Q1 (n) and
Q2 (n) are queue lengths of the queues at time n. Use the Lyapunov function
V (Q1 (n), Q2 (n)) = Q21 (n) + Q22 (n) to show that the queues are stable if λ1 /μ1 +
λ2 /μ2 < 1. This scheduling policy is known as Max-Weight or Back-Pressure
policy.

Topics: Symmetry, conditioning, independence, expectation, law of large

numbers, regression.

A.1 Symmetry

The simplest model of probability is based on symmetry. Picture a bag with 10

marbles that are identical, except that they are marked as shown in Fig. A.1.
You put the marbles in a bag that you shake thoroughly and you then pick a
marble without looking. Out of the ten marbles, seven have a blue number equal to 1.
We say that the probability that you pick a marble whose blue number is 1 is equal to
7/10 = 0.7. The probability is the fraction of favorable outcomes (picking a marble
with a blue number equal to 1) among all the equally likely possible outcomes (the
different marbles). The notion of “equally likely” is defined by symmetry, so this
definition of probability is not circular.
For ease of discussion, let us call B the blue number on the marble that you
pick and R the red number on that marble. We write P (B = 1) = 0.7. Similarly,
P (B = 2) = 0.3, P (R = 1) = 0.2, P (B = 2, R = 3) = 0.1, etc. We call R
and B random variables. The justification for the terminology is that if we were to
repeat the experiment of shaking the bag of ten marbles and picking a marble, the
values of R and B would vary from experiment to experiment in an unpredictable
way. Note that R and B are functions of the same outcome (the selected marble) of
one experiment (picking a marble). Indeed, we do not pick one marble to read the
value of B and then another one to read the value of R; we pick only one marble
and the values of B and R correspond to that marble.
Let A be a subset of the marbles. We also write P (A) for the probability that
you pick a marble that is in the set A. Since all the marbles are equally likely to

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2
310 A Elementary Probability

1 1 1 1 1 3 1 3 1 4 1 4 1 4 2 3 2 4 2 4

Fig. A.1 Ten marbles marked with a blue and a red number

be picked, P (A) = |A|/10, where |A| is the number of marbles in the set A. For
instance, if A is the set of marbles where (B = 1, R = 3) or (B = 2, R = 4), then
P (A) = 0.4 since there are four such marbles out of 10.
It is clear that if A1 and A2 are disjoint sets (i.e., have no marble in common),
then P (A1 ∪ A2 ) = P (A1 ) + P (A2 ). Indeed, when A1 and A2 are disjoint, the
number of marbles in A1 ∪ A2 is the number of marbles in A1 plus the number
of marbles in A2 . If we divide by ten, we conclude that the probability of picking
a marble that is in A1 ∪ A2 is the sum of the probabilities of picking one in A1
or in A2 . We say that probability is additive. This property extends to any finite
collection of events that are pairwise disjoint.
Note that if A1 and A2 are not disjoint, then P (A1 ∪ A2 ) < P (A1 ) + P (A2 ). For
instance, if A1 is the set of marbles such that B = 1 and A2 is the set of marbles such
that R = 4, then P (A1 ∪ A2 ) = 0.9, whereas P (A1 ) + P (A2 ) = 0.7 + 0.5 = 1.2.
What is happening is that P (A1 ) + P (A2 ) is double-counting the marbles that are
in both A1 and A2 , i.e., the marbles such that (B = 1, R = 4). We can eliminate
this double-counting and check that P (A1 ∪ A2 ) = P (A1 ) + P (A2 ) − P (A1 ∩ A2 ).
Thus, one has to be a bit careful when examining the different ways that
something can happen. When adding up the probabilities of these different ways,
one should make sure that they are exclusive, i.e., that they cannot happen together.
For example, the probability that your car is red or that it is a Toyota is not the
sum of the probability that it is red plus the probability that it is a Toyota. This sum
double-counts the probability that your car is a red Toyota. Such double-counting
mistakes are surprisingly common.

A.2 Conditioning

Now, imagine that you pick a marble and tell me that B = 1. How do I guess R?
Looking at the marbles, we see that there are 7 marbles with B = 1, among
which two are such that R = 1. Thus, given that B = 1, the probability that R = 1
is 2/7. Indeed, given that B = 1, you are equally likely to have picked any one of
the 7 marbles with B = 1. Since 2 out of these 7 marbles are such that R = 1, we
conclude that the probability that R = 1 given that B = 1 is 2/7.
We write P [R = 1 | B = 1] = 2/7. Similarly, P [R = 3 | B = 1] = 2/7
and P [R = 4 | B = 1] = 3/7. We say that P [R = 1 | B = 1] is the conditional
probability that R = 1 given that B = 1.
So, we are not sure of the value of R when we are told that B = 1, but the
information is useful. For instance, you can see that P [R = 1 | B = 2] = 0,
whereas P [R = 1 | B = 1] = 2/7. Thus, knowing B tells us something about R.
A.2 Conditioning 311

Observe that P (R = 1, B = 1) = N(1, 1)/10 where N(1, 1) is the number

of marbles with R = 1 and B = 1. Also, P [R = 1 | B = 1] = N(1, 1)/N(1, ∗)
where N (1, ∗) is the number of marbles with B = 1 and R taking an arbitrary value.
Moreover, P (B = 1) = N(1, ∗)/N . It then follows that

P (B = 1, R = 1) = P (B = 1) × P [R = 1 | B = 1].

Indeed,

N(1, 1) N(1, ∗) N(1, 1)

= × .
N N N(1, ∗)

To make the previous identity intuitive, we argue in the following way. For (B =
1, R = 1) to occur, B = 1 must occur and then R = 1 must occur given that B = 1.
Thus, the probability of (B = 1, R = 1) is the probability of B = 1 times the
probability of R = 1 given that B = 1.
The previous identity shows that

P (B = 1, R = 1)
P [R = 1 | B = 1] = .
P (B = 1)

Intuitively, this expression says that the probability of R = 1 given B = 1 is the

fraction of marbles with B = 1 and R = 1 among the marbles with B = 1.
More generally, for any two values b and r of B and R, one has

P (B = b, R = r)
P [R = r | B = b] = (A.1)
P (B = b)

and

P (B = b, R = r) = P (B = b)P [R = r | B = b]. (A.2)

The most likely value of R given that B = 1 is 4. Indeed, P [R = 4 | B = 1]

is larger than P [R = 1 | B = 1] and P [R = 3 | B = 1]. We say that 4 is the
maximum a posteriori (MAP) estimate of R given that B = 1. Similarly, the MAP
estimate of B given that R = 4 is 1. Indeed, P [B = 1 | R = 4] = 3/5, which is
larger than P [B = 2 | R = 4] = 2/5.
A slightly different concept is the maximum likelihood estimate (MLE) of B given
R = 4. By definition, this is the value of B that makes R = 4 most likely. We see
that the MLE of B given that R = 4 is 2 because P [R = 4 | B = 2] = 2/3 >
P [R = 4 | B = 1].
For instance, the MLE of a disease of a person with high fever might be Ebola,
but the MAP may be a common flu. To see this, imagine 100 marbles out of which
one is marked (Ebola, High Fever), 15 are marked (Flu, High Fever), 5 are marked
(Flu, Low Fever), and the others are marked (Something Else, No Fever). For this
312 A Elementary Probability

model, we see that P[High Fever | Ebola] = 1 > P[High Fever | Flu] = 0.75, and
P[Flu | High Fever] = 15/16 > P[Ebola | High Fever] = 1/16.

A.3 Common Confusion

The discussion so far probably seems quite elementary. However, most of the
confusion about probability arises with these basic ideas. Let us look at some
examples.
You are told that Bill has two children and one of them is named Isabelle.
What is the probability that Bill has two daughters? You might argue that Bill’s
other child has a 50% probability of being a girl, so that the probability that
Bill has two daughters must be 0.5. In fact, the correct answer is 1/3. To see
this, look at the four equally likely outcomes for the sex of the two children:
(M, M), (F, M), (M, F ), (F, F ) where (M, F ) means that the first child is male
and the second is female, and similarly for the other cases. Out of these four
outcomes, three are consistent with the information that “one of them is named
Isabelle.” Out of these three outcomes, one corresponds to Bill having two daugh-
ters. Hence, the probability that Bill has two daughters given that one of his two
children is named Isabelle is 1/3, not 50%.
This example shows that confusion in Probability is not caused by the sophis-
tication of the mathematics involved. It is not a lack of facility with Calculus
or Algebra that causes the difficulty. It is the lack of familiarity with the basic
formalism: looking at the possible outcomes and identifying precisely what the
given information tells us about these outcomes.
Another common source of confusion concerns chance fluctuations. Say that you
flip a fair coin ten times. You expect about half of the outcomes to be tails and half
to be heads. Now, say that the first six outcomes happen to be heads. Do you think
the next four are more likely to be tails, to catch up with the average? Of course not.
After 4 years of drought in California, do you expect the next year to be rainier than
average? You should not.
Surprisingly, many people believe in the memory of purely random events. A
useful saying is that “lady luck has no memory nor vengeance.”
A related concept is “regression to the mean.” A simple example goes as follows.
Flip a fair coin twenty times. Say that eight of the first ten flips are heads. You
expect the next ten flips to be more balanced. This does not mean that the next ten
flips are more likely to be tails to compensate for the first ten flips. It simply means
that the abnormal fluctuations in the first ten flips do not carry over to the next ten
flips. More subtle scenarios of this example involve the stock market or the scores of
sports teams, but the basic idea is the same. Of course, if you do not know whether
the coin is fair or biased, observing eight heads out of the first ten flips suggests that
the coin is biased in favor of heads, so that the next ten coin flips are likely to give
more heads than tails. But, if you have observed many flips of that coin in the past,
then you may know that it is fair, and in that case regression to the mean makes
sense.
A.4 Independence 313

Now, you might ask “how can about half of the coin flips be heads if the coin
does not make up for an excessive number of previous tails?” The answer is that
among the 210 = 1024 equally likely strings of 10 heads and tails, a very large
proportion have about 50% of heads. Indeed, 672 such strings have either 4, 5, or
6 heads. Thus the probability that the number of heads is between 40 and 60% is
672/1, 024 = 65.6%. This probability gets closer to one as you flip more coins.
For twenty coins, the probability that the fraction of heads is between 40 and 60 is
73.5%.
To avoid being confused, always keep the basic formalism in mind. What are the
outcomes? How likely are they? What does the known information tell you about
them?

A.4 Independence

Look at the marbles in Fig. A.2. For these marbles, we see that P [R = 1 | B =
1] = P [R = 3 | B = 1] = 2/4 = 0.5. Also, P [R = 1 | B = 2] = P [R = 3 | B =
2] = 3/6 = 0.5. Thus, knowing the value of B does not change the probability of
the different values of R. We say that for this experiment, R and B are independent.
Here, the value of R tells us something about which marble you picked, but that
information does not change the probability of the different values of B.
In contrast, for the marbles in Fig. A.1, we saw that P [R = 1 | B = 2] = 0
and P [R = 1 | B = 1] = 2/7 so that, for that experiment, R and B are not
independent: knowing the value of B changes the probability of R = 1. That is, B
tells you something about R. This is rather common. The temperature in Berkeley
tells us something about the temperature in San Francisco. If it rains in Berkeley, it
is likely to rain in San Francisco.
This fact that observations tell us something about what do not observe directly
is central in applied probability. It is at the core of data science. What information
do we get from data? We explore this question later in this appendix.
Summarizing, we say that B and R are independent if P [R = r | B =
b] = P (R = r) for all values of r and b. In view of (A.2), B and R are
independent if

P (B = b, R = r) = P (B = b)P (R = r), for all b, r. (A.3)

As a simple example of independence, consider ten flips of a fair coin. Let X be

the number of heads in the first 4 flips and Y the number of heads in the last 6 flips.
We claim that X and Y are independent. Intuitively, this is obvious. However, how

1 1 1 1 1 3 1 3 2 1 2 1 2 1 2 3 2 3 2 3

Fig. A.2 Ten other marbles marked with a blue and a red number
314 A Elementary Probability

Fig. A.3 A = {X = x} and

B = {Y = y}

do we show this formally? In this experiment, an outcome is a string of ten heads

and tails. These outcomes are all equally likely. Fix arbitrary integers x and y. Let A
be the set of outcomes for which X = x and B the set of outcomes for which Y = y.
Figure A.3 illustrates the sets A and B. In the figure, the horizontal axis corresponds
to the different strings of the first four flips; the vertical axis corresponds to the
different strings of the last six flips.
The set A corresponds to a different strings of the first four flips that are such
that X = x and arbitrary values of the last six flips. Similarly, B corresponds to b
different strings of the last six flips and arbitrary values of the first four flips. Note
that A has a × 26 outcomes since each of the a strings corresponds to 26 strings of
the last 6 flips. Similarly, B has 24 × b outcomes. Thus,

a × 26 a 24 × b b
P (A) = 10
= 4
and P (B) = 10
= 6.
2 2 2 2
Moreover,

a×b
P (A ∩ B) = .
210
Hence,

P (A ∩ B) = P (A) × P (B).

If you look back at the calculation, you will notice that it boils down to the area of a
rectangle being the product of the sides. The key observation is then that the set of
outcomes where X = x and Y = y is a rectangle. This is so because X = x imposes
a constraint on the first four flips and Y = y imposes a constraint on the other flips.
A.5 Expectation 315

A.5 Expectation

Going back to the marbles of Fig. A.1, what do you expect the value of B to be? How
much would you be willing to pay to pick a marble given that you get the value of B
in dollars? An intuitive argument is that if you were to repeat the experiment 1000
times, you should get a marble with B = 1 about 70% of the time, i.e., about 700
times. The other 300 times, you would get B = 2. The total amount should then be
1 × 700 + 2 × 300. The average value per experiment is then

1 × 700 + 2 × 300
= 1 × 0.7 + 2 × 0.3.
1000
We call this number the expected value of B and we write it E(B). Similarly, we
define

E(R) = 1 × 0.2 + 3 × 0.3 + 4 × 0.5.

Thus, the expected value is defined as the sum of the values multiplied by
their probability. The interpretation we gave by considering the experiment being
repeated a large number of times is only an interpretation, for now.
Reviewing the argument, and extending it somewhat, let us assume that we have
N marbles marked with a number X that takes the possible values {x1 , x2 , . . . , xM }
and that a fraction pm of the marbles are marked with X = xm , for m = 1, . . . , M.
Then we write P (X = xm ) = pm for m = 1, . . . , M. We define the expected value
of X as

M
M
E(X) = xm pm = xm P (X = xm ). (A.4)
m=1 m=1

Consider a random variable X that is equal to the same constant x for every
outcome. For instance, X could be a number on a marble when all the marbles are
marked with the same number x. In this case,

E(X) = x × P (X = x) = x.

Thus, the expected value of a constant is the constant. For instance, if we designate
by a a random variable that always takes the value a, then we have

E(a) = a.

There is a slightly different but very useful way to compute the expectation. We
can write E(X) as the sum over all the possible marbles we could pick of the product
of X for that marble times the probability 1/N that we pick that particular marble.
Doing this, we have
316 A Elementary Probability

N
1
E(X) = X(n) ,
N
n=1

where X(n) is the value of X for marble n. This expression gives the same value as
the previous calculation. Indeed, in this sum there are pm N terms with X(n) = xm
because we know that a fraction pm of the N marbles, i.e., pm N marbles, are marked
with X = xm . Hence, the sum above is equal to

M
1
M
(pm N)xm = pm xm ,
N
m=1 m=1

which agrees with the previous expression for E(X).

This latter calculation is useful to show that E(B + R) = E(B) + E(R) in
our example of Fig. A.1 with ten marbles. Let us examine this important property
closely. You pick a marble and you get B + R. By looking at the marbles, you see
that you get 1 + 1 if you pick the first marble, and so on. Thus,

1 1
E(B + R) = (1 + 1) + · · · + (2 + 4) .
10 10
If we decompose this sum by regrouping the values of B and then those of R, we
see that

1 1 1 1
E(B + R) = 1 + · · · + 2 + 1 + ··· + 4 .
10 10 10 10

The first sum is E(B) and the second is E(R). Thus, the expected value of a sum is
the sum of the expected values. We say that expectation is linear. Notice that this is
so even though the values B and R are not independent.
More generally, for our N marbles, if marble n is marked with two numbers X(n)
and Y (n), then we see that

N
1 N
1
N
1
E(X + Y ) = (X(n) + Y (n)) = X(n) + Y (n) = E(X) + E(Y ).
N N N
n=1 n=1 n=1
(A.5)
Linearity shows that if we get 5 + 3X2 + 4Y 3 when we pick a marble marked
with the numbers X and Y , then

E(5 + 3X2 + 4Y 3 ) = 5 + 3E(X2 ) + 4E(Y 3 ).

Indeed,
A.5 Expectation 317

1
N
E(5 + 3X2 + 4Y 3 )) = (5 + 3X2 (n) + 4Y 3 (n)) = 5 + 3E(X2 ) + 4E(Y 3 ).
N
n=1

As another example, we have

E(a + X) = E(a) + E(X) = a + E(X).

Similarly,

E((X − a)2 ) = E(X2 − 2aX + a 2 ) = E(X2 ) − 2aE(X) + a 2 .

Choosing a = E(X) in the previous example, we find

E((X − E(X))2 ) = E(X2 ) − 2E(X)E(X) + [E(X)]2 = E(X2 ) − [E(X)]2 ,

an example we discuss in the next section.

There is another elementary property of expectation that we use in the next
section. Consider marbles where B ≤ R, such as the marbles in Fig. A.1. Compare
the sum over the marbles of B(n) × (1/10) with the sum of R(n) × (1/10). Term by
term, the first sum is less than the second, so that E(B) ≤ E(R). Hence, if B ≤ R
for every outcome, one has E(B) ≤ E(R). We say that expectation is monotone.
We will use yet another simple property of expectation in the next section.
(Do not worry, there are not many such properties!) Assume that X and Y are
independent. Recall that this means that P (X = xi , Y = yj ) = P (X = xi )P (Y =
yj ) for all possible pair of values (xi , yj ) of X and Y . Then we claim that

E(XY ) = E(X)E(Y ). (A.6)

To see this, we write

N
1 1
E(XY ) = X(n)Y (n) = xi yj N(i, j ) ,
N N
n=1 i j

where N (i, j ) is the number of marbles marked with (xi , yj ). We obtained the
last term by regrouping the terms based on the values of X(n) and Y (n). Now,
N (i, j )/N = P (X = xi , Y = yj ). Also, by independence, P (X = xi , Y = yj ) =
P (X = xi )P (Y = yj ). Thus, we can write the sum above as follows:

E(XY ) = xi yj P (X = xi , Y = yj ) = xi yj P (X = xi )P (Y = yj ).
i j i j

We now compute the sum on the right by first summing over j . We get
318 A Elementary Probability

⎡ ⎤

E(XY ) = ⎣ xi xj P (X = xi )P (Y = yj )⎦ = xi P (X = xi )
i j i
⎡ ⎤

×⎣ xj P (Y = yj )⎦ .
j

We got the last expression by noticing that the factor xi P (X = xi ) is common

to all the terms xi xj P (X = xi )P (Y = yj ) for different values of j when i is fixed.
Now, the term between brackets is E(Y ). Hence, we find

E(XY ) = xi P (X = xi )E(Y ) = E(X)E(Y ),
i

as claimed. Thus, if X and Y are independent, the expected value of their product is
the product of their expected values.
We can check that this property does not generally hold if the random variables
are not independent. For instance, consider R and B in Fig. A.1. We find that

E(BR) = (1 + 1 + 3 + 3 + 4 + 4 + 4 + 6 + 8 + 8)/10 = 3.4,

whereas

E(B) = 1.3 and E(R) = 3.1,

so that E(BR) = 3.4 = E(B)E(R) = 1.3 × 3.4 = 4.42.

We invite you to construct an example of marbles where R and B are not
independent and yet E(BR) = E(B)E(R). This example will convince you that
E(XY ) = E(X)E(Y ) does not imply that X and Y are independent.
We have seen the following properties of expectation:

• expectation is linear
• expectation is monotone
• the expected value of the product of two independent random variables is the
product of their expected values.

A.6 Variance

Let us define the variance of a random variable X as

var(X) = E((X − E(X))2 ). (A.7)

A.6 Variance 319

The variance measures the variability around the mean. If the variance is small, the
random variable is likely to be close to its mean.
By linearity of expectation, we have

var(X) = E(X2 − 2XE(X) + [E(X)]2 ) = E(X2 ) − 2E(X)E(X) + [E(X)]2 .

Hence,

var(X) = E(X2 ) − [E(X)]2 , (A.8)

as we saw already in the previous section.

Note also that

var(aX) = E((aX)2 )−[E(aX)]2 = a 2 E(X2 )−a 2 [E(X)]2 = a 2 [E(X2 )−[E(X)]2 ].

Hence,

var(aX) = a 2 var(X). (A.9)

Now, assume that X and Y are independent random variables. Then we find that

var(X + Y ) = E((X + Y )2 ) − [E(X + Y )]2

= E(X2 + 2XY + Y 2 ) − [E(X)]2 − 2E(X)E(Y ) − [E(Y )]2
= E(X2 ) − [E(X)]2 + E(Y 2 ) − [E(Y )]2 + 2E(XY ) − 2E(X)E(Y ).

Therefore, if X and Y are independent,

var(X + Y ) = var(X) + var(Y ), (A.10)

where the last expression results from the fact that E(XY ) = E(X)E(Y ) when the
random variables are independent.
The square root of the variance is called the standard deviation.
Summing up, we saw the following results about variance:

• when one multiplies a random variable by a constant, its variance gets multiplied
by the square of the constant
• the variance of the sum of independent random variables is the sum of their
variances
• the standard deviation of a random variable is the square root of its variance.
320 A Elementary Probability

A.7 Inequalities

The fact that expectation is monotone yields some inequalities that are useful to
bound the probability that a random variable takes large values. Intuitively, if a
random variable is likely to take large values, its expected value is large.
The simplest such inequality is as follows. Let X be a random variable that is
always non-negative, then

E(X)
P (X ≥ a) ≤ , for a > 0.
a
This is called Markov’s inequality. To prove it, we define the random variable Y as
being 0 when X < a and 1 when X ≥ a. Hence,

E(Y ) = 0 × P (X < a) + 1 × P (X ≥ a) = P (X ≥ a).

We note that Y ≤ X/a. Indeed, that inequality is immediate if X < a because then
Y = 0 and X/a > 0. It is also immediate when X ≥ a because then Y = 1 and
X/a ≥ 1. Consequently, by monotonicity of expectation, E(Y ) ≤ E(X/a). Hence,

X E(X)
P (X ≥ a) = E(Y ) ≤ E = ,
a a

where the last equality comes from the linearity of expectation.

The second inequality is Chebyshev’s inequality. It states that

var(X)
P (|X − E(X)| ≥ ) ≤ , for > 0.
2

To derive this inequality, we define Z = |X − E(X)|2 . Markov’s inequality says

that
E(Z) varX
P (Z ≥ 2 ) ≤ 2
= 2 .

Now, Z ≥ 2 is equivalent to |X − E(X)| ≥ . This proves the inequality.

A.8 Law of Large Numbers

Chebyshev’s inequality is particularly useful when we consider a sum of inde-

pendent random variables. Assume that X1 , X2 , . . . , Xn are independent random
variables with the same expected value μ and the same variance σ 2 . Define Y =
(X1 + · · · + Xn )/n. Observe that
A.9 Covariance and Regression 321

X1 + · · · + Xn nμ
E(Y ) = E = = μ.
n n

Also,

1 1 σ2
var(Y ) = var(X1 + · · · + Xn ) = × nσ 2
= .
n2 n2 n
Consequently, Chebyshev’s inequality implies that

σ2
P (|Y − μ| ≥ ) ≤ .
n 2
This probability becomes arbitrarily small as n increases. Thus, if Y is the
average of n random variables that are independent and have the same mean μ
and the same variance, then Y is very close to μ, with a high probability, when n is
large. This is called the Weak Law of Large Numbers.
Note that this result extends to the case where the random variables are
independent, have the same mean, and have a variance bounded by some σ 2 .

A.9 Covariance and Regression

Consider once again the N marbles with the numbers (X, Y ). We define the
covariance of X and Y as

cov(X, Y ) = E(XY ) − E(X)E(Y ).

By linearity of expectation, one can check that cov(XY ) = E((X − E(X))(Y −

E(Y )). This expression suggests that cov(X, Y ) is positive when X and Y tend to
be large or small together. We make this idea more precise in this section.
Observe that if X and Y are independent, then E(XY ) = E(X)E(Y ), so
that cov(X, Y ) = 0. Two random variables are said to be uncorrelated if their
covariance is zero. Thus, independent random variables are uncorrelated. The
converse is not true.
Say that you observe X and you want to guess Y , as we did earlier with R and B.
For instance, you observe the height of a person and you want to guess his weight.
To do this, you choose two numbers a and b and you estimate Y by Ŷ where

Ŷ = a + E(Y ) + b(X − E(X)).

The goal is to choose a and b so that Ŷ tends to be close to Y . Thus, Ŷ can be an

arbitrary linear function of X and we want to find the best linear function. We wrote
the linear function in this particular form to simplify the subsequent algebra.
322 A Elementary Probability

To make this precise, we look for the values of a and b that minimize E((Y −
Ŷ )2 ). That is, we want the error Y − Ŷ to be small, on average. We consider the
square of the error because this is much easier to analyze than choosing the absolute
value of the error.
Now,

(Y − Ŷ )2 = [(Y − E(Y )) − a − b(X − E(X))]2

= a 2 + (Y − E(Y ))2 + b2 (X − E(X))2
− 2a(Y − E(Y )) − 2b(Y − E(Y ))(X − E(X)) + 2ab(X − E(X)).

Taking the expected value, we get

E((Y − Ŷ )2 ) = a 2 + var(Y ) + b2 var(X) − 2bcov(X, Y ).

To do the calculation, we used the linearity of expectation and the facts that the
expected values of X − E(X) and Y − E(Y ) are equal to zero. To minimize this
expression over a, we should choose a = 0. To minimize it over b, we set the
derivative with respect to b equal to zero and we find

2bvar(X) − 2cov(X, Y ) = 0,

so that b = cov(X, Y )/var(X). Consequently,

cov(X, Y )
Ŷ = E(Y ) + (X − E(X)).
var(X)

We call Ŷ the Linear Least Squares Estimate (LLSE) of Y given X. It is the linear
function of X that minimizes the mean squared error with Y .
As an example, consider again the N marbles. There,

1 1
N N
E(X) = X(n), E(Y ) = Y (n)
N N
n=1 n=1

1
N
var(X) = X2 (n) − [E(X)]2
N
n=1

1 2
N
var(Y ) = Y (n) − [E(Y )]2
N
n=1

1
N
cov(X, Y ) = X(n)Y (n) − E(X)E(Y ).
N
n=1
A.10 Why Do We Need a More Sophisticated Formalism? 323

In this case, one calls the resulting expression for Ŷ the linear regression of Y
against X. The linear regression is the same as the LLSE when one considers that
(X, Y ) are random variables that are equal to a given sample (X(n), Y (n)) with
probability 1/N for n = 1, . . . , N . That is, to compute the linear regression, one
assumes that the sample values one has observed are representative of the random
pair (X, Y ) and that each sample is an equally likely value for the random pair.

A.10 Why Do We Need a More Sophisticated Formalism?

The previous sections show that one can get quite far in the discussion of probability
concepts by considering a finite set of marbles, and random variables that can only
take finitely many values. In engineering, one might think that this is enough. One
may approximate any sensible quantity with a finite number of bits, so for all
applications one may consider that there are only finitely many possibilities. All
that is true, but results in clumsy models. For instance, try to write the equations
of a falling object with discretized variables. The continuous versions are usually
simpler than the discrete ones. As another example, a Gaussian random variable is
easier to work with than a binomial or Poisson random variable, as we will see.
Thus we need to extend the model to random variables that have an infinite, even
an uncountable set of possible values. Does this step cause formidable difficulties?
Not at the intuitive level. The continuous version is a natural extension of the
discrete case. However, there is a philosophical difficulty in going from discrete
to continuous. Some thinkers do not accept the idea of making an infinite number
of choices before moving on. That is, say that we are given an infinite collections
{A1 , A2 , . . .} of nonempty sets. Can we reasonably define a new set B that contains
one element from each An ? We can define it, but if there is no finite way of building
it, can we assume that it exists? A theory that does not rely on this axiom of choice
is considerably more complex than those that do. The classical theory of probability
(due to Kolmogorov) accepts the axiom of choice, and we follow that theory.
One key axiom of probability theory enables to define the probability of a set A
of outcomes as a limit of that of simpler sets An that approach A. This is similar
to approximating the area of a circle by the sum of the areas of disjoint rectangles
that approach it from inside, or approximating an integral by a sum of rectangles.
This key axiom says that if A1 ⊂ A2 ⊂ A3 ⊂ · · · and A = ∪n An , then P (A) =
lim P (An ). Thus, if sets An approximate A from inside in the sense that these sets
grow and eventually contain every point of A, then the probability of A is equal to
the limit of the probability of An . This is a natural way of extending the definition
of probability of simple sets to more complex ones. The trick is to show that this
is a consistent definition in the sense that different approximating sequences of sets
must have the same limiting probability.
This key axiom enables to prove the strong law of large numbers. That law states
that as you keep on flipping coins, the fraction of heads converges to the probability
that one coin yields heads. Thus, not only is the fraction of heads very likely to be
324 A Elementary Probability

close to that probability when you flip many coins, but in fact the fraction gets closer
and closer to that probability. This property justifies the frequentist interpretation of
probability of an event as the long-term fraction of time that event occurs when one
repeats the experiment. This is the interpretation that we used to justify the definition
of expected value.

A.11 References

There are many useful texts and websites on elementary probability. Readers might
find Walrand (2019) worthwhile, especially since it is free on Kindle.

A.12 Solved Problems

Problem A.1 You have a bag with 20 red marbles and 30 blue marbles. You shake
the bag and pick three marbles, one at a time, without replacement. What is the
probability that the third marble is red?

Solution As is often the case, there is a difficult and an easy way to solve this
problem. The difficult way is to consider the first marble, then find the probability
that the second marble is red or blue given the color of the first marble, then find
the probability that the third marble is red given the colors of the first two marbles.
The easy way is to notice that, by symmetry, the probability that the third marble
is red is the same as the probability that the first marble is red, which is 20/50 = 0.4.
It may be useful to make the symmetry argument explicit. Think of the marbles as
being numbered from 1 to 50. Imagine that shaking the bag results in some ordering
in which the marbles would be picked one by one out of the bag. All the orderings
are equally likely. Now think of interchanging marble one and marble three in each
ordering. You end up with a new set of orderings that are again equally likely. In
this new ordering, the third marble is the first one to get out of the bag. Thus, the
probability that the third marble is red is the same as the probability that the first
marble is red.

Problem A.2 Your applied probability class has 275 students who all turn in their
homework assignment. The professor returns the graded assignments in a random
order to the students. What is the expected number of students who get their own
assignment back?

Solution The difficult way to solve the problem is to consider the first assignment,
then the second, and so on, and for each to explore what happens if it is returned to
its owner or not. The probability that one student gets her assignment back depends
on what happened to the other students. It all seems very complicated.
The easy way is to argue that, by symmetry, the probability that any given student
gets his assignment back is the probability that the first one gets his assignment
A.12 Solved Problems 325

back, which is 1/275. Let then Xn = 1 if student n gets his/her own assignment and
Xn = 0 otherwise. Thus, E(Xn ) = 1/275. The number of students who get their
assignment back is X1 + · · · + X275 . Now, by linearity of expectation, E(X1 + · · · +
X275 ) = E(X1 ) + · · · + E(X275 ) = 275 × (1/275) = 1.

Problem A.3 A monkey types a sequence of one million random letters on a

typewriter. How many times does the name “walrand” appear in that sequence?
Assume that the typewriter has 40 keys: the 26 letters and 14 other characters.

Solution The easy solution uses the linearity of expectation. Let Xn = 1 if the name
“walrand” appears in the sequence, starting at the n-th symbol of the string. The
number of times that the name appears is then Z = X1 +· · ·+XN with N = 106 − 6.
By symmetry, E(Xn ) = E(X1 ) for all n. Now, the probability that X1 = 1 is equal
to the probability that the first symbol is w, that the second symbol is a, and so on.
Thus, E(X1 ) = P (X1 = 1) = (1/40)7 . Hence, the expected number of times that
“walrand” appears is E(Z) = (106 − 6) × (1/40)7 ≈ 6 × 10−6 . So, it is true that
a monkey could eventually type one of Shakespeare’s plays, but he is likely to die
before succeeding.
Note that Markov’s inequality implies that

P (Z ≥ 1) ≤ E(Z) ≈ 6 × 10−6 .

Problem A.4 You flip a fair coin n times and the fraction of heads is Y . How large
does n have to be to be sure that P (|Y − 0.5| ≥ 0.05) ≤ 0.05?

Solution We use Chebyshev’s inequality that states

var(Y )
P (|Y − 0.5| ≥ ) ≤ .
2

We saw in our discussion of the weak law of large numbers that var(Y ) =
var(X1 )/n where X1 = 1 if the first coin yields heads and X1 = 0 otherwise. Since
P (X1 = 1) = P (X1 = 0) = 0.5, we find that E(X1 ) = 0.5 and E(X12 ) = 0.5.
Hence, var(X1 ) = E(X12 ) − [E(X1 )]2 = 0.25. Consequently,

0.25 100
P (|Y − 0.5| ≥ 0.05) ≤ = .
n × 25 × 10−4 n

Thus, the right-hand side is 0.05 if n = 2, 000. You have to be patient. . . .

Problem A.5 What is the probability that two friends share a birthday? What about
three friends? What about n? How large does n have to be for this probability to be
50%?
326 A Elementary Probability

Solution In the case of two friends, it is the probability that the second has the same
birthday as the first, which is 1/365 (ignoring February 29).
The case of three friends looks more complicated: two of the three or all of them
could share a birthday. It is simpler to look at the probability that they do not share
a birthday. This is the probability that the second friend does not have the same
birthday as the first, which is 364/365 times the probability that the third does not
share a birthday with the first two, which is 363/365. Let us explore this a bit further
to make sure we fully understand this solution. First, we consider all the strings of
three numbers picked in {1, 2, . . . , 365}. There are 3653 such strings because there
are 365 choices for the first number, then 365 for the second, and finally 365 for
the third. Second, consider the strings of three different numbers from the same set.
There are 365 choices for the first, then 364 for the second, then 363 for the third.
Hence, there are 365 × 364 × 363 such strings. Since all the strings are equally
likely to be picked (a reasonable assumption), the probability that the friends do not
share a birthday is

365 × 364 × 363

.
365 × 365 × 365

The case of n friends is then clear: they do not share a birthday with probability
p where

365 × 364 × · · · × (365 − n + 1)

p=
365 × 365 × · · · × 365

1 2 n−1
=1× 1− × 1− × ··· × 1 − .
365 365 365

To evaluate this expression, we use the fact that 1 − x ≈ exp{−x} when |x|
1. We use this fact repeatedly in this book. Do not worry, there are not too many
such tricks. In practice, this approximation is good for |x| < 0.1. For instance,
exp{−0.1} ≈ 0.90483. Thus, assuming that n/365 ≤ 0.1, i.e., n ≤ 36, we find

1 2 n−1
p ≈ 1 × exp − × exp − × · · · × exp −
365 365 365

1 2 n−1 1 + 2 + ··· + n − 1
= exp − − − ··· − = exp −
365 365 365 365

(n − 1)n
= exp − .
730

For instance, with n = 24, we find

23 × 24
p ≈ exp − ≈ 0.5.
730
A.12 Solved Problems 327

Hence, the probability that at least two friends in a group of 24 share a birthday
is about 50%. This result is somewhat surprising because 24 is small compared to
365. One calls this observation the birthday paradox. Many people think that it takes
about 365/2 ≈ 180 friends for the probability that they share a birthday to be 50%.
The paradox is less mysterious when you think of the many ways that friends can
share birthdays.

Problem A.6 You throw M marbles into B bins, each time independently and in
a way that each marble is equally likely to fall into each bin. What is the expected
number of empty bins? What is the probability that no bin contains more than one
marble?

Solution The first bin is empty with probability α := [(B −1)/B]M , and the same is
true for every bin. Hence, if Xb = 1 when bin b is empty and Xb = 0 otherwise, we
see that E(Xb ) = α. Hence, the expected value of the number Z = X1 + · · · + XB
of empty bins is equal to

E(Z) = BE(X1 ) = Bα.

To evaluate this expression we use the following approximation:

a N
1− ≈ exp{−a} for N 1.
N
This approximation is already quite good for N = 10 and 0 < a ≤ 1. For instance,
(1 − 1/10)10 ≈ 0.35 and exp{−1} ≈ 0.37. Hence, if M = βB, one can write

α = (1 − 1/B)M = (1 − β/M)M ≈ exp{−β},

so that

E(Z) ≈ B exp{−β}.

For instance, with M = 20 and B = 30, one has β = 2/3 and E(Z) ≈
30 exp{−2/3} ≈ 15. That is, the 20 marbles are likely to fall into 15 of the 30
bins.
The probability that no bins contain more than one marble is the same as the
probability that no two friends share a birthday when there are B different days and
M friends. We saw in the last problems that this is given by

M(M − 1) M2
exp − ≈ exp − .
2B 2B

Problem A.7 As an error detection scheme, you compute a checksum of b = 32

bits from the bits of each of M files that you store in a computer and you attach
328 A Elementary Probability

the checksum to the file. When you read the file, you recompute the checksum and
you compare with the one attached to the file. If the checksums agree, you assume
that no storage/retrieval error occurred. How large can M be before the probability
that two files share a checksum exceeds 10−6 . A similar scheme is used as a digital
signature to make sure that files are not modified.

Solution There are B = 2b possible checksums. Let us assume that each file is
equally likely to get any one of the B checksums. In view of the previous problem,
we want to find M such that

M2
exp − = 10−6 = exp{−6 log(10)} ≈ exp{−14}.
2B
√
Thus, M 2 /(2B) = 14, so that M 2 = 28B = 28 × 2b and M = 2b/2 28 ≈
5.3 × 2b/2 . With b = 32, we find M ≈ 350,000.

Problem A.8 N people apply for a job with your company. You will interview
them sequentially but you must either hire or decline a person right at the end of
the interview. How should you proceed to maximize the chance of picking the best
of all the candidates? Implicitly, we assume that the qualities of the candidates are
all independent and equally likely to be any number in {1, . . . , Q} where Q is very
large.

Solution The best strategy is to interview and decline about M = N/e candidates
and then hire the first subsequent candidate who is better that those M. Here, e =
exp{1} ≈ 2.72. If no candidate among {M + 1, . . . , N} is better than the first M,
you hire the last candidate.
To justify this procedure, we compute the probability that the candidate you select
is the best, for a given value of M. By symmetry, the best candidate appears in
position b with probability 1/N. You then pick the best candidate if b > M and if
the best candidate among the first b − 1 is among the first M, which has probability
M/(b − 1), by symmetry. Since probability is additive, the probability p that you
pick the best candidate is given by

N −1
M 1
N
1 M M N 1 M
p= = ≈ db = [log(N ) − log(M)].
N b−1 N b N M b N
b=M+1 b=M

To find the maximizing value of M, we set the derivative of this expression with
respect to M equal to zero. This shows that N/M ≈ e.
Basic Probability
B

Topics: General framework, conditional probability, independence, expecta-

tion, pdf, cdf, function of random variables, correlation, variance, transforma-
tion of jpdf.

B.1 General Framework

The general model of Probability Theory may seem a bit abstract and disconcerting.
However, it unifies all the key ideas in a systematic framework and results in a great
conceptual clarity. You should try to keep in mind this underlying framework when
we discuss concrete examples.

B.1.1 Probability Space

To describe a random experiment, one first specifies the set Ω of all the possible
outcomes. This set is called the sample space. For instance, when we flip a coin, the
sample space is Ω = {H, T }; when we roll a die, Ω = {1, 2, 3, 4, 5, 6}, when one
measures a voltage one may have Ω = = (−∞, +∞); and so on.
Second, one specifies the probability that the outcome falls in subsets of Ω. That
is, for A ⊂ Ω, one specifies a number P (A) ∈ [0, 1] that represents the likelihood
that the random experiment yields an outcome in A. For instance, when rolling a
die, the probability that the outcome is in a set A ⊆ {1, 2, 3, 4, 5, 6} is given by
P (A) = |A|/6 where |A| is the number of elements of A. When we measure a
voltage, the probability that it has any given value is typically 0, but the probability
that it is less than 15 in absolute value may be 95%, which is why we specify the
probability of subsets, not of specific outcomes.

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2
330 B Basic Probability

Of course, the specification of the probability of subsets of Ω cannot be arbitrary.

For instance, if A ⊆ B, then one must have P (A) ≤ P (B). Also, P (Ω) = 1.
Moreover, if A and B are disjoint, i.e., if A∩B = ∅, then P (A∪B) = P (A)+P (B).
Finally, to be able to approximate a complex set by simple sets, one requires that if
A1 ⊆ A2 ⊆ A3 ⊆ · · · and if A = ∪n An , then P (An ) → P (A). Equivalently, if
A1 ⊇ A2 ⊇ A3 ⊇ · · · and if A = ∩n An , then P (An ) → P (A).
This property also implies the following result.

B.1.2 Borel–Cantelli Theorem

Theorem B.1 (Borel–Cantelli Theorem) Let An be events such that

∞

P (An ) < ∞.
n=1

Then

P (An , i.o.) = 0.

Here, {An , i.o.} is defined as the set of outcomes ω that are in infinitely many
sets An . So, stating that the probability of this set is equal to zero means that
the probability that the events An occur for infinitely many n’s is zero. So, the
probability that the events An occur infinitely often is equal to zero. In other words,
for any outcome ω that occurs, there is some m such that An does not occur for any
n larger than m.

Proof First note that

{An , i.o.} = ∩n Bn =: B,

where Bn = ∪m≥n Am is a decreasing sequence of sets. To see this, note that the
outcome ω is in infinitely many sets An , i.e., that ω ∈ {An , i.o.}, if and only if for
every n, the outcome ω is in some Am for m ≥ n. Also, ω is in ∪m≥n Am = Bn for
all n if and only if ω is in ∩n Bn = B. Hence ω ∈ {An , i.o.} if and only if ω ∈ B.
Now, B1 ⊇ B2 ⊇ · · · , so that P (Bn ) → P (B). Thus,

P (An , i.o.) = P (B) = lim P (Bn )

n→∞
B.1 General Framework 331

∞
and P (Bn ) ≤ m=n P (Am ), so that1

P (Bn ) → 0 as n → ∞.

Consequently, P (An , i.o.) = 0.

You may wonder whether n P (An ) = ∞ implies that P (An , i.o.) = 1. As
a simple counterexample, imagine that you have an infinite collection of coins that
you solder together in an infinite line, all heads up. Assume also that this long string
is balanced and that you manage to flip it. Let An be the event that coin n yields
heads. In this contraption, either all the coins yield
heads, with probability 0.5, or all
the coins yield tails. Also, P (An ) = 0.5, so that n P (An ) = ∞ and P (An , i.o.) =
0.5. However, we show in the next section that the result holds if the events are
mutually independent.
For the sake of completeness, we should mention that it is generally not possible
to specify the probability of all the subsets of Ω. This does not really matter
in applications. The terminology is that the subsets of Ω with a well-defined
probability are events.

B.1.3 Independence

We say that the events A and B are independent if P (A ∩ B) = P (A)P (B).

For instance, roll two dice. An outcome is a pair (a, b) ∈ {1, 2, . . . , 6}2 where a
corresponds to the first die and b to the second. The event “the first die yields a
number in {2, 4, 5}” corresponds to the set of outcomes A = {2, 4, 5} × {1, . . . , 6}.
The event “the second die yields a number in {2, 4}” is the set of outcomes
B = {1, . . . , 6} × {2, 4}. We can see that A and B are independent since P (A) =
18/36, P (B) = 12/36 and P (A ∩ B) = 6/36.
A more subtle notion is that of mutual independence. We say that the events
{Aj , j ∈ J } are mutually independent if

P (∩j ∈K Aj ) = Πj ∈K P (Aj ), ∀ finite K ⊂ J.

It is easy to construct events that are pairwise independent but not mutually
independent. For instance, let Ω = {1, 2, 3, 4} where the four outcomes are equally
likely and let A = {1, 2}, B = {1, 3}, and C = {1, 4}. You can check that these
events are pairwise independent but not mutually independent since P (A∩B ∩C) =
1/4 = P (A)P (B)P (C).

∞ ∞
1 Recallthat if the nonnegative numbers an are such that n=0 an < ∞, then m=n am goes to
zero as n → ∞.
332 B Basic Probability

B.1.4 Converse of Borel–Cantelli Theorem

Theorem B.2 (Converse of Borel–Cantelli Theorem) Let {An , n ≥ 1} be a

collection of mutually independent events with P (An ) = ∞. Then P (An , i.o.) =
1.

Proof Recall that

{An , i.o.} = ∩n Bn where Bn = ∪m≥n Am .

Hence,

{An , i.o.}c = ∪n Bnc where Bnc = ∩m≥n Acm .

Thus, to prove the theorem, it suffices to show that P (Bnc ) = 0 for all n. Indeed, if
N
n=1 Bn ) ≤
that is the case, then P (∪N n=1 P (Bn ) = 0 and ∪n=1 Bn are increasing
c c N c

with N and their union is ∪n Bnc , so that P (∪n Bnc ) = limN →∞ P (∪N
n=1 Bn ) = 0.
c

Now,

P (Bnc ) = P (∩m≥n Acm ) = lim P (∩N c

m=n Am )
N →∞

= lim Πm=n
N
P (Acm ) = lim Πm=n
N
[1 − P (Am )]
N →∞ N →∞

N
≤ lim N
Πm=n exp{−P (Am )} = lim exp − P (Am ) = 0.
N →∞ N →∞
m=n

N
In this derivation we used the facts that 1 − x ≤ exp{−x} and m=n P (Am ) →∞
as N → ∞.

B.1.5 Conditional Probability

Let A and B be two events. Assume that P (B) > 0. One defines the conditional
probability P [A|B] of A given B as follows:

P (A ∩ B)
P [A|B] := .
P (B)

The meaning of P [A|B] is the probability that the outcome of the experiment is
in A given that it is in B. As an example, say that a random experiment has 1000
equally likely outcomes. Assume that A contains |A| outcomes and B contains |B|
outcomes. If we know that the outcome is in B, we know that it is equally likely
B.1 General Framework 333

to be any one of these |B| outcomes. Given that information, the probability that
the outcome is in A is then the fraction of outcomes in B that are also in A. This
fraction is
|A ∩ B| |A ∩ B|/1000 P (A ∩ B)
= = .
|B| |B|/1000 P (B)

Note that the definition implies that if A and B are independent, then P [A|B] =
P (A), which makes intuitive sense. Also,

P (A ∩ B) = P [A|B]P (B).

This expression extends to more than two events. For instance, with events
{A1 , . . . , An } one has

P (A1 ∩ A2 ∩ · · · ∩ An ) = P (A1 )P [A2 | A1 ]P [A3 | A1 ∩ A2 ]

· · · P [An | A1 ∩ · · · ∩ An−1 ].

To verify this identity, note that the right-hand side is equal to

P (A1 ∩ A2 ) P (A1 ∩ A2 ∩ A3 ) P (A1 ∩ · · · An )

P (A1 ) ··· ,
P (A1 ) P (A1 ∩ A2 ) P (A1 ∩ · · · ∩ An−1 )

and this product is equal to the left-had side of the identity above.

B.1.6 Random Variable

A random variable X is a function X : Ω → . Thus, one associates a real number

X(ω) to every possible outcome ω of the random experiment.
For instance, when one flips a coin, with Ω = {H, T }, one can define a random
variable X by X(H ) = 1 and X(T ) = 0.
One then uses the notation P (X ∈ B) = P (X−1 (B)) for B ⊂ where

X−1 (B) := {ω ∈ Ω|X(ω) ∈ B}.

The interpretation is the natural one: the probability that X ∈ B is the probability
that the outcome ω is such that X(ω) ∈ B.
In particular, one defines the cumulative distribution function (cdf) of the random
variable X as FX (x) = P (X ∈ (−∞, x]) =: P (X ≤ x). This function is
nondecreasing and right-continuous; it tends to zero as x → −∞ and to one as
x → +∞.
Figure B.1 summarizes this general framework for one random variable.
334 B Basic Probability

Fig. B.1 The random W R: real line

experiment is described by a
set Ω of outcomes: the
sample space. Subsets of Ω w X(w)
called events have a
probability. A random
variable is a real-valued
function of the outcome ω of X–1 (B) B
the random experiment

P(X ∈ B) : = P(X–1 (B))

B.2 Discrete Random Variable

B.2.1 Definition

A discrete random variable X is defined by a list of distinct possible values and

their probability:

X ≡ {(xn , pn ), n = 1, 2, . . . , N }. (B.1)

Here, the xn are real numbers and the pn are positive and add up to one. By
definition, pn is the probability that X takes the value xn and we write

pn = P (X = xn ), n = 1, . . . , N.

The number of values N can be infinite. This list is called the probability mass
function (pmf) of the random variable X.
As an example,

V ≡ {(1, 0.1), (2, 0.3), (3, 0.6)}

is a random variable that has three possible values (1, 2, 3) and takes these values
with probability 0.1, 0.3, and 0.6, respectively. Equivalently, one can write
⎧
⎨ 1, with probability 0.1;
X = 2, with probability 0.3;
⎩
3, with probability 0.6.

Note that the probabilities add up to one.

The connection with the general framework is the following. There is some
probability space and some function X : Ω → that happens to take the values
{x1 , . . . , xN } and is such that
B.2 Discrete Random Variable 335

P (X = xn ) = P ({ω ∈ Ω|X(ω) = xn }) = pn .

A possible construction is to define Ω = {1, 2, 3} with P ({1}) = 0.1, P ({2}) =

0.3, P ({3}) = 0.6, and X(ω) = ω. This construction is called the canonical
probability space. It may not be the natural choice. For instance, say that you pick
a marble out of a bag that has 10 identical marbles except that one is marked
with the number 1, three with the number 2, and six with the number 3. Let then
X be the number on the marble that you pick. A more natural probability space
has ten outcomes (the ten marbles) and X(ω) is the number on marble ω for
ω ∈ Ω = {1, 2, . . . , 10}.
When one is interested in only one random variable X, one cares about its
possible values and their probability, i.e., its pmf. The details of the random
experiment do not matter. Thus, one may forget about the bag of marbles. However,
if the marbles are marked with a second number Y , then one may have to go back
to the description of the bag of marbles to analyze Y or to analyze the pair (X, Y ).

B.2.2 Expectation

The expected value, or mean, of the random variable X is denoted E(X) and is
defined as (Fig. B.2)

N
E(X) = xn pn .
n=1

In our example,

E(V ) = 1 × 0.1 + 2 × 0.3 + 3 × 0.6 = 2.5.

As another frequently used example, say that X(ω) = 1{ω ∈ A} where A is an

event in Ω. We say that X is the indicator of the event A. In this case, X is equal to
one with probability P (A) and to zero otherwise, so that E(X) = P (A).

Fig. B.2 The expected value

of a random variable
336 B Basic Probability

When N is infinite, the definition makes sense unless the sum of the positive
terms and that of the negative terms are both infinite. In such a case, one says that
X does not have an expected value.
It is a simple exercise to verify that the number a that minimizes E((X − a)2 ) is
a = E(X). Thus, the mean is the “least squares estimate” of X.

B.2.3 Function of a RV

Consider a function h : → and a discrete random variable X (Fig. B.3). Then

h(X) defines a new random variable with values and probabilities

{(h(xn ), pn ), n = 1, . . . , N }.

Note that the values h(xn ) may not be distinct, so that to conform to our definition
of the pmf one should merge identical values and add their probabilities.
For instance, say that h(1) = h(2) = 10 and h(3) = 15. Then

h(V ) ≡ {(10, 0.4), (15, 0.6)},

where we merged the two values h(1) and h(2) because they are equal to 10.
Thus,

E(h(V )) = 10 × 0.4 + 15 × 0.6 = 13.

Observe that

N
E(h(V )) = h(vn )pn ,
n=1

since

Fig. B.3 Function of a

random variable
Ω X(·)
X(ω)
ω h(·)
h(X(ω))

P (·) h(X(·))
B.2 Discrete Random Variable 337

3
h(vn )pn = h(1)0.1 + h(2)0.3 + h(3)0.6
n=1

= 10 × 0.1 + 10 × 0.3 + 15 × 0.6

= 10 × (0.1 + 0.3) + 15 × 0.6,

which agrees with the previous expression.

Let us state that observation as a theorem.

Theorem B.3 (Expectation of a Function of a Random Variable) Let X be a

random variable with p.m.f. {(xn , pn ), n = 1, . . . , N} and h : → some
function. Then

N
E(h(X)) = h(xn )pn .
n=1

B.2.4 Nonnegative RV

We say that X is nonnegative, and we write X ≥ 0, if all its possible values xn are
nonnegative. Observe that

if X ≥ 0 and E(X) = 0, then P (X = 0) = 1.

Also,

if X ≥ 0 and E(X) < ∞, then P (X < ∞) = 1.

B.2.5 Linearity of Expectation

Consider two functions h1 : → and h2 : → and define h1 (X) + h2 (X) as

follows:

h1 (X) + h2 (X) ≡ {(h1 (xn ) + h2 (xn ), pn ), n = 1, . . . , N }.

As before,

N
E(h1 (X) + h2 (X)) = (h1 (xn ) + h2 (xn ))pn .
n=1
338 B Basic Probability

By regrouping terms, we see that

E(h1 (X) + h2 (X)) = E(h1 (X)) + E(h2 (X)).

We say that expectation is linear.

B.2.6 Monotonicity of Expectation

By X ≥ 0 we mean that all the possible values of X are nonnegative, i.e., that
X(ω) ≥ 0 for all ω. In that case, E(X) ≥ 0 since E(X) = n xn P (X = xn ) and
all the xn are nonnegative.
We also write X ≤ Y if X(ω) ≤ Y (ω). The linearity of expectation then implies
that E(X) ≤ E(Y ) since 0 ≤ E(Y − X) = E(Y ) − E(X). Hence,

X ≤ Y implies that E(X) ≤ E(Y ). (B.2)

One says that expectation is monotone.

B.2.7 Variance, Standard Deviation

The variance var(X) of a random variable X is defined as (Fig. B.4)

var(X) = E((X − E(X))2 ),

By linearity, one has

var(X) = E(X2 − 2XE(X) + E(X)2 )

= E(X2 ) − 2E(X)E(X) + E(X)2 = E(X2 ) − E(X)2 .

With (B.1), one finds

var(V ) = E(V 2 ) − E(V )2 = 12 × 0.1 + 22 × 0.3 + 32 × 0.6 − (2.5)2 = 0.45.

Fig. B.4 The variance makes

randomness interesting
B.2 Discrete Random Variable 339

The standard deviation σX of a random variable X is defined as the square root

of its variance. That is,

σX := var(X).

Note that a random variable W that is equal to E(X) − σX or to E(X) + σX with

equal probabilities is such that E(W ) = E(X) and var(W ) = var(X). In that sense,
σX is an “equivalent” deviation from the mean.
Observe that for any a ∈ and any random variable X one has

var(aX) = a 2 var(X). (B.3)

Indeed,

var(aX) = E((aX)2 ) − [E(aX)]2 = E(a 2 X2 ) − [aE(X)]2 = a 2 E(X2 )

− a 2 [E(X)]2 = a 2 var(X).

B.2.8 Important Discrete Random Variables

Here are a few important examples.

Bernoulli We say that X is Bernoulli with parameter p ∈ [0, 1], and we write
X =D B(p), if2

X = {(0, 1 − p), (1, p)},

i.e., if

P (X = 0) = 1 − p and P (X = 1) = p.

You should check that E(X) = p and var(X) = p(1 − p). This random variable
models a coin flip where 1 represents “heads” and 0 “tails.”

Geometric We say that X is geometrically distributed with parameter p ∈ [0, 1],

and we write X =D G(p), if

P (X = n) = (1 − p)n−1 p, n ≥ 1.

You should check that E(X) = 1/p and var(X) = (1 − p)p−2 . This random
variable models the number of coin flips until the first “heads” if the probability of

2 The symbol =D means equal in distribution.

340 B Basic Probability

Fig. B.5 A geometric

random variable models the
number of coin flips until a
first “heads”

Fig. B.6 The probability mass function of the B(100, p) distribution, for p = 0.1, 0.2, and 0.5

heads is p (Fig. B.5). (Sometimes, X − 1 is also called a geometric random variable

on {0, 1, . . .}. One avoids confusion by specifying the range. We will try to stick to
our definition of X on {1, 2, . . .}.).

Binomial We say that X is binomial with parameters N and p, and we write X =D

B(N, p), if

N n
P (X = n) = p (1 − p)N −n , n = 0, . . . , N, (B.4)
n

where

N N!
= .
n (N − n)!n!

You should verify that E(X) = Np and var(X) = Np(1 − p). This random variable
models the number of heads in N coin flips; it is the sum of N independent Bernoulli
random variables with parameter p. Indeed, there are Nn strings of N symbols in
{H, T } with n symbols H and N − n symbols T . The probability of each of these
sequences is pn (1 − p)N −n (Figs. B.6, and B.7).
B.3 Multiple Discrete Random Variables 341

Fig. B.7 The binomial distribution as a sum of Bernoulli random variables. At each step, every
steel ball moves to the left or to the right with equal
probabilities, i.e., by 2Xn − 1 where Xn is
Bernoulli 0.5. The position after N steps is Y = n = 1N (2Xn − 1) = 2B(N, 0.5) − N . After
M balls, the stacks show approximately the values of M × P (Y = y) for integer y’s

Fig. B.8 Poisson pmf, from Wikipedia

Poisson We say that X is Poisson with parameter λ, and we write X =D P (λ), if

λn −λ
P (X = n) = e , n ≥ 0. (B.5)
n!
You should verify that E(X) = λ and var(X) = λ. This random variable models
the number of text messages that you receive in 1 day (Fig. B.8).

B.3 Multiple Discrete Random Variables

Quite often one is interested in multiple random variables. These random variables
may be related. For instance, the weight and height of a person, the voltage that a
342 B Basic Probability

Fig. B.9 Height and weight W

are related

220

176

132

4'7" 4'11" 5'3" 5'7" 5'11" 6'3" H

Fig. B.10 The jpmf of a pair Y

of discrete random variables

(xm , yn )
pm,n
(x , y )
pi,j i j
X

(x1 , y1 )
p1,1

transmitter sends and the one that the receiver gets, and the backlog and delay at a
queue are pairs of non-independent random variables (Fig. B.9).

B.3.1 Joint Distribution

To study such dependent random variables, one needs a description more complete
than simply looking at the random variables individually. Consider the following
example. Roll a die and let X = 1 if the outcome is odd and X = 0 otherwise.
Let also Y = 1 if the outcome is in {2, 3, 4} and Y = 0 if it is in {1, 5, 6}. Note
that P (X = 1) = P (X = 0) = 0.5 and P (Y = 1) = P (Y = 0) = 0.5.
Thus, individually, X and Y could describe the outcomes of flipping two fair coins.
However, jointly, the pair (X, Y ) does not look like the outcomes of two coin flips.
For instance, X = 1 and Y = 1 only if the outcome is 3, which has probability 1/6.
If X and Y were the outcomes of two flips of a fair coin, one would have X = 1 and
Y = 1 in one out of four equally outcomes.
In the discrete case, one describes a pair (X, Y ) of random variables by listing
the possible values and their probabilities (see Fig. B.10):
B.3 Multiple Discrete Random Variables 343

pi,j = P (X = xi , Y = yj ), ∀(i, j ) ∈ {1, . . . , m} × {1, . . . , n},

where the pi,j are nonnegative and add up to one. Here, m and n can be infinite.
This description specifies the joint probability mass function (jpmf) of the random
variables (X, Y ). (See Fig. B.10.)
From this description, one can in particular recover the probability mass of X
and that of Y . For instance,

n
n
P (X = xi ) = P (X = xi , Y = yj ) = pi,j .
j =1 j =1

B.3.2 Independence

One says that X and Y are independent if

P (X = x, Y = y) = P (X = x)P (Y = y), ∀x, y.

In our die roll example, note that

1 1
P (X = 1, Y = 1) = = P (X = 1)P (Y = 1) = ,
6 4
so that X and Y are not independent (Fig. B.11).

B.3.3 Expectation of Function of Multiple RVs

For h : 2 → , one then defines

Fig. B.11 Independence?

344 B Basic Probability

m
n
E(h(X, Y )) = h(xi , yj )pi,j .
i=1 j =1

Note that if h(x, y) = h1 (x, y) + h2 (x, y), then

m
n
E(h(X, Y )) = h(xi , yj )pi,j
i=1 j =1

m
n
= [h1 (xi , yj ) + h2 (xi , yj )]pi,j
i=1 j =1

m
n
m
n
= h1 (xi , yj )pi,j + h2 (xi , yj )pi,j
i=1 j =1 i=1 j =1

= E(h1 (X, Y )) + E(h2 (X, Y )).

so that expectation is linear.

B.3.4 Covariance

In particular, one defines the covariance of X and Y as

cov(X, Y ) = E((X − E(X))(Y − E(Y )).

By linearity of expectation, one has

cov(X, Y ) = E(XY − E(X)Y − XE(Y ) + E(X)E(Y )) = E(XY ) − E(X)E(Y ).

One says that X and Y are uncorrelated if cov(X, Y ) = 0. One says that X and
Y are positively correlated if cov(X, Y ) > 0 and that they are negatively correlated
if cov(X, Y ) < 0 (Fig. B.12).
In the die roll example, one finds

Fig. B.12 These random Y The dots represent equally

variables are positively likely pairs of values
correlated: if one is large, the
other one tends to be large as
well
X
B.3 Multiple Discrete Random Variables 345

1 1
cov(X, Y ) = E(XY ) − E(X)E(Y ) = − < 0,
6 4
so that X and Y are negatively correlated. This negative correlation suggests that if
X is larger than average, then Y tends to be smaller than average. In our example,
we see that if X = 1, then the outcome is odd and Y is more likely to be 0 than 1.
Here is an important result:

Theorem B.4 (Independent Random Variables are Uncorrelated)

(a) Independent random variables are uncorrelated.

(b) The converse is not true.
(c) The variance of a sum of uncorrelated random variables is the sum of their
variances.

Proof

(a) Let X, Y be independent. Then

E(XY ) = xyP (X = x, Y = y) = xyP (X = x)P (Y = y)
x,y x,y

= xP (X = x) yP (Y = y) = E(X)E(Y ).
x y

(b) As a simple example see Fig. B.13, say that (X, Y ) is equally likely to take each
of the following four values:

{(−1, 0), (0, 1), (0, −1), (1, 0)}.

Then one sees that E(XY ) = 0 = E(X)E(Y ) so that X and Y are uncorrelated.
However, P (X = −1, Y = 1) = 0 = P (X = −1)P (Y = 1), so that X and Y
are not independent.
(c) Let X and Y be uncorrelated random variables. Then

var(X + Y ) = E((X + Y − E(X + Y ))2 )

= E(X2 + Y 2 + 2XY − E(X)2 − E(Y )2 − 2E(X)E(Y ))
= E(X2 ) − E(X)2 + E(Y 2 ) − E(Y )2
= var(X) + var(Y ).

The third equality in this derivation comes from the fact that E(XY ) = E(X)E(Y ).

346 B Basic Probability

Fig. B.13 The random

variables X and Y are Y
uncorrelated but not
independent
1 p = 1/4

X
−1 1
−1

B.3.5 Conditional Expectation

Consider a pair (X, Y ) of discrete random variables such that P (X = xi , Y = yj ) =

pi,j for i = 1,. . . , m and j = 1, . . . , n. In particular, P (X = xi ) = k P (X =
xi , Y = yk ) = k pi,k . Using the definition of conditional probability, we have

P (X = xi , Y = yj )
P [Y = yj | X = xi ] = .
P (X = xi )

Thus, P [Y = yj | X = xi ] for j = 1, . . . , n is the conditional distribution, or

conditional pmf, of Y given that X = xi .
In particular, note that if X and Y are independent, then P [Y = yj | X = xi ] =
P (Y = yj ).
We define E[Y | X = xi ], the conditional expectation of Y given X = xi , as
follows:

E[Y | X = xi ] = yj P [Y = yj | X = xi ].
j

We then define E[Y | X] to be a new random variable that is equal to E[Y | X =

xi ] when X = xi . That is, E[Y | X] is a function g(X) of X with g(xi ) = E[Y |
X = xi ].
The interpretation is that we observe X = xi , which tells us that Y now has a new
distribution: its conditional distribution given that X = xi . Then E[Y | X = xi ] is
the expected value of Y for this conditional distribution.

Theorem B.5 (Properties of Conditional Expectation) One has

E(E[Y | X]) = E(Y ) (B.6)

E[h(X)Y | X] = h(X)E[Y | X] (B.7)
E[Y | X] = E(Y ), if X and Y are independent. (B.8)

B.4 General Random Variables 347

Proof To verify (B.6), one notes that

E(E[Y | X]) = P (X = xi )E[Y | X = xi ] = P (X = xi )
i i

× yj P [Y = yj | X = xi ]
j

= yj P (X = xi )P [Y = yj | X = xi ]
i j

= yj P (X = xi , Y = yj ) = yj P (Y = yj ) = E(Y ).
i j j

For (B.7), we recall that, by definition, E[h(X)Y | X] is a random variable that

takes the value E[h(X)Y | X = xi ] when X = xi . Also, E[h(X)Y | X = xi ] is the
expected value of h(X)Y given that X = xi , i.e., of h(xi )Y given that X = xi . By
linearity of expectation, this is h(xi )E[Y | X = xi ].
Finally, (B.8) is immediate since the distribution of Y given X = xi is the original
distribution of Y when X and Y are independent.

B.3.6 Conditional Expectation of a Function

In the same spirit as Theorem B.3, one has the following result:

Theorem B.6 (Conditional Expectation of a Function of a Random Variable)

One has

E[h(Y ) | X = xi ] = h(yj )P [Y = yj | X = xi ].
j

Also, conditional expectation is linear:

E[h1 (Y ) + h2 (Y ) | X] = E[h1 (Y ) | X] + E[h2 (Y ) | X].

B.4 General Random Variables

Not all random variables have a discrete set of possible values. For instance, the
voltage across a phone line, wind speed, temperature, and the time until the next
customer arrives at a cashier have a continuous range of possible values.
In practice, one can always approximate values by choosing a finite number
of bits to represent them. For instance, one can measure temperature in degrees,
348 B Basic Probability

ignoring fractions, and fixing a lower and upper bound. Thus, discrete random
variables suffice to describe systems with an arbitrary degree of precision. However,
this discretization is rather artificial and complicates things. For instance, writing
Newton’s equation F = ma where a = dv(t)/dt with discrete variables is
rather bizarre since a discrete speed does not admit a derivative. Hence, although
computers perform all their calculations on discrete variables, the analysis and
derivation of algorithms are often more natural with general variables. Nevertheless,
the approximation intuition is useful and we make use of it.
We start with a definition of a general random variable.

B.4.1 Definitions

Definition B.1 (cdf and pdf) Let X be a random variable.

(a) The cumulative distribution function (cdf) of X is the function FX (x) defined
by

FX (x) = P (X ≤ x), x ∈ .

(b) The probability density function (pdf) of X is

d
fX (x) = FX (x),
dx
if this derivative exists.

Observe that, for a < b,

b
P (a < X ≤ b) = P (X ≤ b) − P (X ≤ a) = FX (b) − FX (a) = fX (x)dx,
a

where the last expression makes sense if the derivative exists. Also, if the pdf exists,

fX (x)dx = FX (x + dx) − FX (x) = P (X ∈ (x, x + dx]).

This identity explains the term “probability density.”

B.4.2 Examples

Example B.1 (U [a, b]) As a first example, we say that X is uniformly distributed
in [a, b], for some a < b, and we write X =D U [a, b] if
B.4 General Random Variables 349

Fig. B.14 The pdf and cdf of

a U [a, b] random variable
1
FX (x)
fX (x)
1/(b - a)

0 x
a b

Fig. B.15 Density of exponential distribution

1
fX (x) = 1{a ≤ x ≤ b}.
b−a

In this case, we see that

x−a
FX (x) = max 0, min{1, } .
b−a

Figure B.14 illustrates the pdf and the cdf of a U [a, b] random variable.

Example B.2 (Exp(λ)) As a second example, we say that X is exponentially

distributed with rate λ > 0, and we write X =D Exp(λ), if

fX (x) = λe−λx 1{x ≥ 0}.

Figure B.15.
As before, you can verify that

FX (x) = 1 − exp{−λx}, for x ≥ 0,

350 B Basic Probability

Fig. B.16 A discrete P(X)

approximation of a 0.25
continuous random variable
0.2

0.15

0.1

0.05

0 1 2 3 4 5 6 7 8 9 10

so that

P (X ≥ x) = exp{−λx}, ∀x ≥ 0.

It may help intuition to realize that a random variable X with cdf FX (·)
can be approximated by a discrete random variable Y that takes values in
{. . . , −2, −, 0, , 2, . . .} with

P (Y = n) = FX ((n + 1)) − FX (n) = P (X ∈ (n, (n + 1)]).

Figure B.16.

B.4.3 Expectation

For a function h : → , one has

∞
E(h(Y )) = h(n)[FX ((n + 1)) − FX (n)] ≈ h(x)dFX (x),
n −∞

where the last term is defined as the limit of the sum as → 0. If the pdf exists, one
sees that
∞
E(h(Y )) ≈ h(x)fX (x)dx.
−∞

If is very small, the approximation of X by Y is very close, so that the expressions

for E(h(Y )) should approach E(h(X)). We state these observations as a theorem.

Theorem B.7 (Expectation of a Function of a Random Variable) Let X be a

random variable with cdf FX (·) and h : → some function. Then
B.4 General Random Variables 351

∞
E(h(X)) = h(x)dFX (x).
−∞

If the pdf fX (·) of X exists, then

∞
E(h(X)) = h(x)fX (x)dx.
−∞

For example, if X =D U [0, 1], then

1 1
E(Xk ) = x k dx = .
0 k+1

In particular,

2
1 1 1
var(X) = E(X2 ) − E(X)2 = − = .
3 2 12

As another example, if X =D Exp(λ), then

∞ ∞ ∞
E(X) = xλe−λx dx = − xde−λx = −[xe−λx ]∞
0 + e−λx dx
0 0 0

= −λ−1 [e−λx ]∞ −1
0 =λ .

Also,
∞ ∞
E(X2 ) = x 2 λe−λx dx = −λ−1 x 2 de−λx
0 0
∞
= −λ−1 [x 2 e−λx ]∞
0 +λ
−1
e−λx dx 2
0
∞
−1
= 2λ xe−λx dx = 2λ−2 .
0

In particular,

var(X) = E(X2 ) − E(X)2 = 2λ−2 − (λ−1 )2 = λ−2 .

As a generally confusing example, consider the random variable that is equal

to 0.3 with probability 0.4 and is uniformly distributed in [0, 1] with probability
0.6. That is, one flips a biased coins that yields “head” with probability 0.4. If the
352 B Basic Probability

Fig. B.17 The pdf and cdf of

the mixed random variable X
1
FX (x)
0.6 fX (x)
0.4 (x - 0.3) 0.48

0.18
0 x
0.3 1

outcome of the coin flip is head, then X = 0.3. If the outcome is tail, then X is
picked uniformly in [0, 1]. Then,

FX (x) = P (X ≤ x) = 0.4 × 1{x ≥ 0.3} + 0.6x, x ∈ [0, 1].

This cdf is illustrated in Fig. B.17. We can define the derivative of FX (x) formally
by using the Dirac impulse as the formal derivative of a step function.
For this random variable, one finds that3
∞
E(X ) =
k
x k fX (dx)
−∞
∞ 1
= x 0.4δ(x − 0.3)dx +
k
x k 0.6dx
−∞ 0
1
= 0.4(0.3)k + 0.6 .
k+1

In particular, we find that

var(X) = E(X2 ) − E(X)2

1 1 2
= 0, 4(0.3)2 + 0.6 − 0.4(0.3) + 0.6 = 0.0596.
3 2

3 Recall that the Dirac impulse is defined by

∞
g(x)δ(x − a)dx = g(a)
−∞

for any function g(·) that is continuous at a.

B.4 General Random Variables 353

Fig. B.18 The random

variables Xn converge to zero X3 (ω)
but their expectation does not

X2 (ω)
X1 (ω)

0 1 ω

B.4.4 Continuity of Expectation

We state without proof two useful technical properties of expectation. They address
the following question. Assume that Xn → X as n → ∞. Can we conclude that
E(Xn ) → E(X)? In other words, is expectation “continuous”?
The following counterexample shows that some conditions are needed (see
Fig. B.18). Say that ω is chosen uniformly in [0, 1], so that P ([0, a]) = a for
a ∈ Ω := (0, 1]. Define Xn (ω) = n × 1{ω ≤ 1/n} for n ≥ 1. That is,
Xn (ω) = n if ω ≤ 1/n and Xn (ω) = 0 otherwise. Then P (Xn = n) = 1/n
and P (Xn = 0) = 1 − 1/n, so that E(Xn ) = 1 for all n. Also, Xn (ω) → 0 as
n → ∞, for all ω ∈ Ω. Indeed, Xn (ω) = 0 for all n > 1/ω. Thus, Xn → X = 0
but E(Xn ) = 1 does not converge to 0 = E(X).

Theorem B.8 (Dominated Convergence Theorem (DCT)) Assume that

|Xn (ω)| ≤ Y (ω) for all ω ∈ Ω where E(Y ) < ∞. Assume also that, for all
ω ∈ Ω, Xn (ω) → X(ω) as n → ∞. Then E(Xn ) → E(X) as n → ∞.

Theorem B.9 (Monotone Convergence Theorem (MCT)) Assume that 0 ≤

Xn (ω) ≤ Xn+1 (ω) for all ω and n = 1, 2, . . .. Assume also that Xn (ω) → X(ω) as
n → ∞ for all ω ∈ Ω. Then E(Xn ) → E(X) as n → ∞.

One also has the following useful fact.

Theorem B.10 (Expectation as Integral of Complementary cdf) Let X ≥ 0 be

a nonnegative random variable with E(X) < ∞. Then
∞
E(X) = P (X > x)dx.
0

354 B Basic Probability

Proof Recall the integration by parts formula:

b b
u(x)dv(x) = [u(x)v(x)]ba − v(x)du(x)
a a

that follows from the fact that

d
[u(x)v(x)] = u (x)v(x) + u(x)v (x).
dx
Using that formula, one finds
∞ ∞
E(X) = xdFX (x) = − xd(1 − FX (x))
0 0
∞ ∞
= −[x(1 − FX (x))]∞
0 + (1 − FX (x))dx = P (X > x)dx.
0 0

For the last equality we use the fact that x(1 − FX (x)) = xP (X > x) goes to
zero as x → ∞. This fact follows from DCT . To see this, define Xn = n1{X > n}.
Then |Xn | ≤ X for all n. Also, Xn → 0 as n → ∞. Since E(X) < ∞, DCT then
implies that nP (X > n) = E(Xn ) → 0.

The function P (X > x) = 1 − FX (x) is called the complementary cdf.

For instance, if X =D Exp(λ), then
∞ ∞ 1
E(X) = P (X > x)dx = exp{−λx}dx = .
0 0 λ

As another example, if X =D G(p), then P (X > x) = (1−p)n for x ∈ [n, n+1)

and
∞ 1
E(X) = P (X > x)dx = (1 − p)n = .
0 p
n≥0

B.5 Multiple Random Variables

B.5.1 Random Vector

A random vector X = (X1 , . . . , Xn ) is a vector whose components are random

variables defined on the same probability space. That is, it is a function X : Ω →
n . One then defines the joint cumulative distribution function (jcdf) FX as follows:

FX (x) = P (X1 ≤ x1 , . . . , Xn ≤ xn ), x ∈ n .
B.5 Multiple Random Variables 355

The derivative of this function, if it exists, is the joint probability density function
(jpdf) fX (x, ). That is,
x1 xn
FX (x, ) = ··· fX (u)du1 · · · dun .
−∞ −∞

The interpretation of the jpdf is that

fX (x)dx1 · · · dxn = P (Xm ∈ (xm , xm + dxm ) for m = 1, . . . , n).

For instance, let

1 !
fX,Y (x, y) = 1 x 2 + y 2 ≤ 1 , x, y ∈ .
π

Then, we say that (X, Y ) is picked uniformly at random inside the unit circle.
One intuitive way to look at these random variables is to approximate them by
points on a fine grid with size > 0. For instance, an -approximation of a pair
(X, Y ) is (X̃, Ỹ ) defined by

(X̃, Ỹ ) = (m, n) w. p. fX,Y (m, n) 2 .

This approximation suggests that

E(h(X, Y )) = h(m, n)fX,Y (m, n) 2
m,n
∞ ∞
≈ h(x, y)fX,Y (x, y)dxdy.
−∞ −∞

We take this as a definition.

Definition B.2 Let (X, Y ) be a pair of random variables and h : 2 → . If the

jpdf exists, then
∞ ∞
E(h(X, Y )) := h(x, y)fX,Y (x, y)dxdy.
−∞ −∞

More generally,
∞ ∞
E(h(X)) = ··· h(x)dx1 · · · dxn .
−∞ −∞

This definition guarantees that expectation is linear. The covariance of X and Y

is defined as before.
356 B Basic Probability

Definition B.3 (Independence) Two random variables X and Y are independent if

P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B)

for all sets A and B in .

It is a simple exercise to show that, if the jpdf exists, the random variables are
independent if and only if

fX,Y (x, y) = fX (x)fY (y), ∀x, y ∈ .

If X is a random variable and g : → is some function, then g(X) is a

random variable. Note that

g(X) ∈ A if and only if X ∈ g −1 (A) := {x ∈ | g(x) ∈ A}.

Of course, this is a tautology.

Here is a very useful observation.

Theorem B.11 (Functions of Independent Random Variables are Independent)

Let X, Y be two independent random variables and g, h : → be two functions.
Then g(X) and h(Y ) are two independent random variables.

Proof Note that

P (g(X) ∈ A, h(Y ) ∈ B) = P (X ∈ g −1 (A), Y ∈ h−1 (B))

= P (X ∈ g −1 (A))P (Y ∈ h−1 (B))
= P (g(X) ∈ A)P (h(Y ) ∈ B).

B.5.2 Minimum and Maximum of Independent RVs

One is often led to considering the minimum or the maximum of independent

random variables. The basic observation is as follows. Let X, Y be independent
random variables. Let V = min{X, Y } and W = max{X, Y }. Then,

P (V > v) = P (X > v, Y > v) = P (X > v)P (Y > v).

B.6 Random Vectors 357

Also,

P (W ≤ w) = P (X ≤ w, Y ≤ w) = P (X ≤ w)P (Y ≤ w).

These observations often suffice to do useful calculations.

For example, assume that X = Exp(λ) and Y = Exp(μ). Then

P (V > v) = P (X > v)P (Y > v) = exp{−λv} exp{−μv} = exp{−(λ + μ)v}.

Thus, the minimum of two exponentially distributed random variables is exponen-

tially distributed, with a rate equal to the sum of the rates.
Let X, Y be i.i.d. U [0, 1]. Then,

P (W ≤ w) = P (X ≤ w)P (Y ≤ w) = w2 , for w ∈ [0, 1].

B.5.3 Sum of Independent Random Variables

Let X, Y be independent random variables and let Z = X + Y . We want to calculate

fZ (z) from fX (x) and fY (y). The idea is that
+∞
P (Z ∈ (z, z + dz)) = P (X ∈ (x, x + dx), Y ∈ (z − x, z − x + dz)).
−∞

Hence,
+∞
fZ (z)dz = fX (x)fY (z − x)dxdz.
−∞

We conclude that
+∞
fZ (z) = fX (x)fY (z − x)dx = fX ∗ fY (z),
−∞

where g ∗ h indicates the convolution of two functions. If you took a class on signals
and systems, you learned the “flip and drag” graphical method to find a convolution.

B.6 Random Vectors

In many situations, one is interested in a collection of random variables.

Definition B.4 (Random Vector) A random vector X = (X1 , . . . , Xn ) is a vector

whose components are random variables. It is characterized by the Joint Cumulative
Probability Distribution Function (jcdf)
358 B Basic Probability

FX (x1 , . . . , xn ) := P (X1 ≤ x1 , . . . , Xn ≤ xn ), xi ∈ , i = 1, . . . , n.

The Joint Probability Density Function (jpdf) is the function fX (x) such that
x1 xn
FX (x1 , . . . , xn ) = ··· fX (u1 , . . . , un )du1 . . . dun ,
−∞ −∞

if such a function exists. In that case,

fX (x)dx1 . . . dxn = P (Xi ∈ [xi , xi + dxi ], i = 1, . . . , n).

Thus, the jcdf and the jpdf specify the likelihood that the random vector takes
values in given subsets of n .
As in the case of two random variables, one has

E(h(X)) = · · · h(x)fX (u)du1 . . . dun ,

if the jpdf exists.

The following definitions are used frequently.

Definition B.5 (Mean and Covariance) Let X, Y be random vectors. One defines

E(X) = (E(X1 ), . . . , E(Xn ))

ΣX = E((X − E(X))(X − E(X)) )
cov(X, Y) = E((X − E(X))(Y − E(Y)) ).

We say that X and Y are uncorrelated if cov(X, Y) = 0, i.e., if Xi and Yj are

uncorrelated for all i, j .

Thus, the mean value of a vector is the vector of mean values. Similarly, the mean
value of a matrix is defined as the matrix of mean values. Also, the covariance of X
and Y is the matrix of covariances. Indeed,

cov(X, Y)i,j = E((Xi − E(Xi ))(Yj − E(Yj )) = cov(Xi , Yj ).

Note also that ΣX = cov(X, X) =: cov(X).

As a simple exercise, note that

cov(AX + a, BY + b) = Acov(X, Y)B .

B.6 Random Vectors 359

Fig. B.19 A geometric view Y

of orthogonality

Y−X

0 X

B.6.1 Orthogonality and Projection

The notions of orthogonality and of projection are essential when studying estima-
tion.
Let X and Y be two random vectors. We say that X and Y are orthogonal and we
write X ⊥ Y if

E(XY ) = 0.

Thus, X and Y are orthogonal if and only if each Xi is orthogonal to every Yj .

Note that if E(X) = 0, then X ⊥ Y if and only if cov(X, Y) = 0. Indeed,

cov(X, Y) = E(XY ) − E(X)E(Y) = E(XY ),

since E(X) = 0.
The following fact is very useful (see Fig. B.19).

Theorem B.12 (Orthogonality) If X ⊥ Y, then

E(||Y − X||2 ) = E(||X||2 ) + E(||Y||2 ).

This statement is the equivalent of Pythagoras’ theorem.

Proof One has

E(||Y − X||2 ) = E((Y − X) (Y − X)) = E(Y Y) − 2E(X Y) + E(X X)

= E(||Y||2 ) − 2E(X Y) + E(||X||2 ).

Now, if X ⊥ Y, then E(Xi Yj ) = 0 for all i, j . Consequently, E(X Y) =
i E(Xi Yi ) = 0. This proves the result.

360 B Basic Probability

B.7 Density of a Function of Random Variables

Assume that X has a known p.d.f. fX (x) on n and that g : n → n is a

differentiable function. Let Y = g(X). How do we find fY (y)?
We start with the linear case and then explain the general case.

B.7.1 Linear Transformations

Assume that X has p.d.f. fX (x). Let Y = aX + b for some a > 0. How do we
calculate fY (y)?
As we see in Fig. B.20, we have

P (Y ∈ (y, y + dy)) = P (aX + b ∈ (y, y + dy))

= P (X ∈ (a −1 (y − b), a −1 (y + dy − b))).

Recall that P (Z ∈ (z, z + dz)) = fZ (z)dz. Accordingly,

fY (y)dy = fX (a −1 (y − b)) × a −1 dy,

so that
1
fY (y) = fX (x) where ax + b = y. (B.9)
a
The case a < 0 is not that different. Repeating the argument above, one finds

1
fY (y) = fX (x) where ax + b = y.
|a|

What about a pair of random variables? Assume that X is a random vector that
takes values in 2 with p.d.f. fX (x). Let

Fig. B.20 The linear Y

transformation Y = aX + b dy aX + b
y ax + b = y

X
1
x dy
a
B.7 Density of a Function of Random Variables 361

Area = |A|dx1 dx2

Av + b
X2 Y2

dx2
v
x2 y = Ax + b
x

X1 Y1
dx1
x1

Fig. B.21 The linear transformation Y = AX + b

Y = AX + b,

where A ∈ 2×2 and b ∈ 2 .

Figure B.21 shows that, under the linear transformation, the rectangle [x1 , x1 +
dx1 ] × [x2 , x2 + dx2 ] gets mapped into a parallelogram with area |A|dx1 dx2 where
|A| is the absolute value of the determinant of the matrix A. Hence, the probability
that Y falls in this parallelogram with area |A|dx1 dx2 is fX (x)dx1 dx2 . Since the
probability that Y takes value in a small area is proportional to that area, the
probability that Y falls in this parallelogram with area |A|dx1 dx2 is also given by
fY (y)|A|dx1 dx2 where y = Ax + b. Thus,

fY (y)|A|dx1 dx2 = fX (x)dx1 dx2 , with y = Ax + b.

Hence,

1
fY (y) = fX (x) where Ax + b = y.
|A|

In fact, this result holds for n random variables.

Given the importance of this result, we state it as a theorem.

Theorem B.13 (Change of Density Through Linear Transformation) Let Y =

AX + b where A is an n × n nonsingular matrix. Then

1
fY (y) = fX (x) where Ax + b = y. (B.10)
|A|

362 B Basic Probability

Fig. B.22 A singular

transformation
Y = (X1 , X1 )
1 1
Y

X L
X2
0
0
0 1 0 X1 1
X1

When the matrix A is singular, the random vector Y = AX + b takes values in

a set of dimension less than n. In that case, the vector Y does not have a density in
n . As a simple example of this situation, assume that X1 and X2 are independent
and uniformly distributed in [0, 1]. (See Fig. B.22.)
Let

X1 10
Y= = AX where A = .
X1 10

Then Y has no density in 2 . Indeed, if it had one, one would find that, with

L = {y | y1 = y1 and 0 ≤ y1 ≤ 1},

P (Y ∈ L) = fY (y)dy = 0
L

since L has measure 0 in 2 . But P (Y ∈ L) = 1.

B.7.2 Nonlinear Transformations

The case when Y = g(X) for a nonlinear function g(·) is slightly more tricky. Let
us look at one example first.

First Example
Say that X =D U [0, 1] and Y = X2 , as shown in Fig. B.23.
B.7 Density of a Function of Random Variables 363

Fig. B.23 The transformation Y = X2 with X =D U [0, 1]

As the figure shows, for 0 < 1, one has Y ∈ [y, y + ) if and only if
X ∈ [x1 , x1 + δ) where

δ= = where g(x1 ) = x12 = y.
g (x1 ) 2x1

Now,4

P (Y ∈ [y, y + )) = fY (y) + o()

and

P (X ∈ [x1 , x1 + δ)) = fX (x1 )δ + o(δ).

Also, o(δ) = o(). Hence,

1
fY (y) + o() = fX (x1 )δ + o() = fX (x1 ) + o(),
g (x1 )

o()
4 Recall that o() designates a function of such that → 0 as → 0.
364 B Basic Probability

so that
1
fY (y) = fX (x1 ) where g(x1 ) = y.
g (x1 )

In this example, we see that

1
fY (y) = √ ,
2 y
√
because g (x1 ) = 2x1 = 2 y and fX (x1 ) = 1.

Second Example
We now look at a slightly more complex example. Assume that Y = g(X) = X2
where X takes values in [−1, 1] and has p.d.f.

3
fX (x) = (1 + x)2 , x ∈ [−1, 1].
8
Figure B.24.
Consider one value of y ∈ (0, 1). Note that there are now two values of x, namely
√ √
x1 = y and x2 = − y such that g(x) = y. Thus,

Fig. B.24 The transformation Y = X2 with X ∈ [−1, 1]

B.7 Density of a Function of Random Variables 365

P (Y ∈ (y, y + )) = P (X ∈ (x1 , x1 + δ1 )) + P (X ∈ (x2 − δ2 , x2 )),

where

δ1 = and δ2 = .
g (x1 ) |g (x2 )|

Hence,

fY (y) + o() = fX (x1 ) + fX (x2 ) + o()
g (x1 ) |g (x2 )|

and we conclude that

1 1
fY (y) = fX (x1 ) + fX (x2 ).
g (x1 ) |g (x2 )|

For this specific example, we find

1 3 √ 1 3 √ 31+y
fY (y) = √ (1 + y)2 + √ (1 − y)2 = √ .
2 y8 2 y8 8 y

Third Example
Our next example is a general differentiable function g(·). From the second example,
we can see that if Y = g(X), then
1
fY (y) = fX (xi ), (B.11)
|g (xi )|
i

where the sum is over all the xi such that g(xi ) = y.

Fourth Example
What about the multi-dimensional case? The key idea is that, locally, the transfor-
mation from x to y looks linear. Observe that
∂
gi (x + dx) ≈ gi (x) + gi (x)dxj ≈ g(x) + J (x)dx,
∂xj
j

where J (x) is the matrix defined by

∂
Ji,j (x) = gi (x).
∂xj

This matrix is called the Jacobian of the function g : n → n . Thus, locally,

the transformation looks like Y = AX + b where b = g(x) and A = J (x).
366 B Basic Probability

Consequently, the density of fX around x such that g(x) = y gets transformed

as if the transformation were linear: it is stretched by the determinant of J (x).
Consequently, we have the following theorem.

Theorem B.14 (Density of Function of Random Variables) Assume that Y =

g(X) where X has density fX in n and g : n → n is differentiable. Then
1
fY (y) = fX (xi ),
|J (xi )|
i

where the sum is over all the xi such that g(xi ) = y and |J (xi )| is the absolute value
of the determinant of the Jacobian evaluated at xi .

Here is an example to illustrate this result. Assume that X = (X1 , X2 ) where the
Xi are i.i.d. U [0, 1]. Consider the transformation

Y1 = X12 + X22 and Y2 = 2X1 X2 .

Then

2x1 2x2
J (x) = .
2x2 2x1

Hence,

|J (x)| = 4|x12 − x22 |.

There are two values of x that correspond to each value of y. These values are

1 "√ √ # 1 "√ √ #
x1 = y1 + y2 + y1 − y2 and x2 = y1 + y2 − y1 − y2
2 2
and
1 "√ √ # 1 "√ √ #
x1 = y1 + y2 − y1 − y2 and x2 = y1 + y2 + y1 − y2 .
2 2
For these values,
*
|J (x)| = y12 − y22 .
B.9 Problems 367

Hence,

2
fY (y) = *
y12 − y22

for all possible values of y.

B.8 References

Mastering probability theory requires curiosity, intuition, and patience. Good books
are very helpful. Personally, I enjoyed Pitman (1993). The home page of David
Aldous (2018) is a source of witty and inspiring comments about probability. The
textbooks Bertsekas and Tsitsiklis (2008), Grimmett and Stirzaker (2001), and
Billingsley (2012) are very useful. The text Wong and Hajek (1985) provides a
deeper discussion of the topics in this book. The books Gallager (2014) and Hajek
(2017) are great resources and are highly recommended to complement this course.
Wikipedia and YouTube are cool sources of information about everything,
including probability. I like to joke, “Don’t take notes, it’s all on the web.”

B.9 Problems

Problem B.1 You have a collection of coins and that the probability that coin n
yields heads is pn . Show that, as
you keep flipping the coins, the flips yield a finite
number of heads if and only if pn < ∞.

Hint This is a direct consequence of the Borel–Cantelli Theorem and its converse.

Problem B.2 Indicate whether the following statements are true or false:

(a) Disjoint events are independent.

(b) The variance of a sum of random variables is always the sum of their variances.
(c) The expected value of a sum of random variables is the sum of their expected
values.

Problem B.3 Provide examples of events A, B, C such that

P [A|C] < P (A), P [A|B] > P (A) and P [B|A] > P (B).

Problem B.4 Roll two balanced dice. Let A be the event “the sum of the faces is
less than or equal to 8.” Let B be the event “the face of the first die is larger than or
equal to 3.”
368 B Basic Probability

• What is the probability space (Ω, F , P )?

• Calculate P [A|B] and P [B|A].

Problem B.5 You flip a fair coin repeatedly, forever.

• What is the probability that out of the first 1000 flips the number of heads is
even?
• What is the probability that the number of heads is always ahead of the number
of tails in the first 4 flips?

Problem B.6 Let X, Y be i.i.d. Exp(1), i.e., exponentially distributed with rate 1.
Derive the p.d.f. of Z = X + Y .

Problem B.7 You pick four cards randomly from a perfectly shuffled 52-card deck.
Assume that the four cards you got are all numbered between 2 and 10. For instance,
you got a 2 of diamonds, a 10 of hearts, a 6 of clubs, and a 2 of spades. Write
a MATLAB script to calculate the probability that the sum of the numbers on the
black cards is exactly twice the sum of the numbers on the red cards.

Problem B.8 Let X =D G(p), i.e., geometrically distributed with parameter p.

Calculate E(X3 ).

Problem B.9 Let X, Y be i.i.d. UD [0, 1]. Calculate E(max{X, Y } − min{X, Y }).

Problem B.10 Let X =D P (λ) (i.e., Poisson distributed with mean λ). Find
P (X is even).

Problem B.11 Consider Ω = [0, 1] with the uniform distribution. Let X(ω) =
1{a < ω < b} and Y = 1{c < ω < d} for some 0 < a < b < 1 and 0 < c < d < 1.
Assume that X and Y are uncorrelated. Are they necessarily independent?

Problem B.12 Let X and Y be i.i.d. U [−1, 1] and define Z = XY . Are X and Z
uncorrelated? Are they independent?

Problem B.13 Let X =D U [−1, 3] and Y = X3 . Calculate fY (·).

Problem B.14 You are given a one meter long stick. You choose two points X and
Y independently and uniformly along the stick and cut the stick at those two points.
What is the probability that you can make a triangle with the three pieces?

Problem B.15 Two friends go independently to a bar at times that are uniformly
distributed between 5:00 pm and 6:00 pm. They wait for ten minutes when they get
there. What is the probability that they meet?
B.9 Problems 369

Problem B.16 Choose V ≥ 0 so that V 2 =D Exp(2). Now choose θ =D

U [0, 2π ], independent of V . Define X = V cos(θ ) and Y = V sin(θ ). Calculate
fX,Y (x, y).

Problem B.17 Assume that Z and 1/Z are random variables with the same
probability distribution and such that E(|Z|) is well-defined. Show that E(|Z|) ≥ 1.

Problem B.18 Let {Xn , n ≥ 1} be i.i.d. with mean 0 and variance 1. Define Yn =
(X1 + · · · + Xn )/n.

(a) Calculate var(Yn ).

(b) Show that P (|Yn | ≥ ) → 0 as n → ∞, for all > 0.

Problem B.19 Let X, Y be i.i.d. U [0, 1] and Z = A(X, Y )T where A is a given

2 × 2 matrix. What is the p.d.f. of Z?
√
Problem B.20 Let X =D U [1, 7] and Y = ln(X) + 3 X. Show that E(Y ) ≤ 7.4.

Problem B.21 Pick two points X and Y independently and uniformly in [0, 1]2 .
Calculate E(||X − Y ||2 ).

Problem B.22 Let (X, Y ) be picked uniformly in the triangle with corners
(−1, 0), (1, 0), (0, 1). Find cov(X, Y ).

Problem B.23 Let X be a random variable with mean 1 and variance 0.5. Show
that

E(2X + 3X2 + X4 ) ≥ 8.5.

Problem B.24 Let X, Y, Z be i.i.d. and uniformly distributed in {−1, +1} (i.e.,
equally likely to be −1 or +1). Define V1 = XY, V2 = Y Z, V3 = XZ.

(a) Are {V1 , V2 , V3 } pairwise independent? Prove.

(b) Are {V1 , V2 , V3 } mutually independent? Prove.

Problem B.25 Let A and B be events with probabilities P (A) = 3/4 and P (B) =
1
1/3. Show that 12 ≤ P (A ∩ B) ≤ 1/3, and give examples to show that both upper
and lower bound are tight. Find corresponding bounds for P (A ∪ B).

Problem B.26 A power system supplies electricity to a city from N plants. Each
power plant fails with probability p independently of the other plants. The city will
experience a blackout if fewer than k plants are supplying it, where 0 < k < N .
What is the probability of blackout?
370 B Basic Probability

Fig. B.25 Reliability graph 5

of a system
C A
1 3 B4 6
S T
2 7E 8 10
F D
9
Fig. B.26 A circuit used as a
simple timer. An external
circuit detects when the t=0
voltage V (t) drops below 1V
+ 5V
-
R +
V(t)
C -

Problem B.27 Figure B.25 is the reliability graph of a system. The links of the
graph represent components of the system. Each link i is working with probability
pi and defective with probability 1−pi , independently of the other links. The system
is operational if the nodes S and T are connected. Thus, the system is built of two
redundant subsystems. Each subsystem consists of a number of components.

(a) Calculate the probability that the system is operational.

(b) Assume now the reliability graph is a binary tree with n levels and that the links
fail independently with probability 1 − p. What is the probability g(n) that there
is a working path from the root to a leaf?
(c) Show that g(n) → 0 as n → ∞ if p < 0.5. Also, prove that g(n) → q > 0 if
p > 0.5. What is the limit q?

Problem B.28 Figure B.26 illustrates an RC-circuit used as a timer. Initially, the
capacitor is charged by the power supply to 5 V . At time t = 0, the switch is flipped
and the capacitor starts discharging through the resistor. An external circuit detects
the time τ when V (t) first drops below 1 V .

(a) Calculate τ in terms of R and C.

(b) Assume now that R and C are independent random variables that are uniformly
distributed in [R0 (1 − ), R0 (1 + )] and [C0 (1 − ), C0 (1 + )], respectively.
Calculate the variance of τ .
(c) Let τ0 be the value of τ that corresponds to R = R0 and C = C0 . Find an upper
bound on the probability that |τ − τ0 | > δτ0 for some small δ.
B.9 Problems 371

Fig. B.27 Alice and Bob

play the game “matching
pennies”

Problem B.29 Alice and Bob play the game of matching pennies. In this game,
they both choose the side of the penny to show. Alice wins if the two sides are
different and Bob wins otherwise (Fig. B.27).

(a) Assume that Alice chooses to show “head” with probability pA ∈ [0, 1].
Calculate the probability pB with which Bob should show “head” to maximize
his probability of winning.
(b) From your calculations, find the best choices of pA and pB for Alice and Bob.
Argue that those choices are such that Alice cannot improve her chance of
winning by modifying pA and similarly for Bob. A solution with that property
is called a Nash equilibrium.

Problem B.30 You find two old batteries in a drawer. They produce the voltages X
and Y . Assume that X and Y are i.i.d. and uniformly distributed in [0, 1.5].

(a) What is the probability that if you put them in series they produce a voltage
larger than 2?
(b) What is the probability that at least one of the two batteries has a voltage that
exceeds 1V ?
(c) What is the probability that both batteries have a voltage that exceeds 1 V ?
(d) You find more similar batteries in that drawer. You test them one by one until
you find one whose voltage exceeds 1.2 V . What is the expected number of
batteries that you have to test?
(e) You pick three batteries. What is the probability that at least two of them have
voltages that add up to more than 2? (Fig. B.28).

Problem B.31 You want to sell your old iPhone 4S. Two friends, Alice and Bob,
are interested. You know that they value the phone at X and Y , respectively, where X
and Y are i.i.d. U [50, 150]. You propose the following auction. You ask for a price
R. If Alice bids A and Bob B, then the phone goes to the highest bidder, provided
that it is larger than R, and the highest bidder pays the maximum of the second bid
and R. Thus, if A < R < B, then Bob gets the phone and pays R. If R < A < B,
then Bob gets the phone and pays A (Fig. B.29).

(a) What is the expected payment that you get for the phone if A = X and B = Y ?
(b) Find the value of R that maximizes this expected payment.
372 B Basic Probability

Fig. B.28 Batteries

Fig. B.29 Alice and Bob

have private valuations X and
Y for the phone and they bid
A and B, respectively

(c) The surplus of Alice is X − P if she gets the phone and pays P for it; it is
zero if she does not get the phone. Bob’s surplus is defined similarly. Show that
Alice maximizes her expected surplus by bidding A = X and similarly for Bob.
We say that this auction is incentive compatible. Also, this auction is revenue
maximizing.

Problem B.32 Recall that the trace tr(S) of a square matrix S is the sum of its
diagonal elements. Let A be an m × n matrix and B an n × m matrix. Show that
tr(AB) = tr(BA).

Problem B.33 Let Σ be the covariance of some random vector X. Show that
a Σa ≥ 0 for all real vector a.
B.9 Problems 373

Fig. B.30 What size solar

panels should you buy for
your house?

Problem B.34 You want to buy solar panels for your house. Panels that deliver a
maximum power K cost αK per unit of time, after amortizing the cost over the
lifetime of the panels. Assume that the actual power Z that such panels deliver is
U [0, K] (Fig. B.30).
The power X that you need is U [0, A] and we assume it is independent of the
power that the solar panels deliver. If you buy panels with a maximum power K,
your cost per unit time is

αK + β max{0, X − Z},

where the last term is the amount of power that you have to buy from the grid. Find
the maximum power K of the panels you should buy to minimize your expected
cost per unit time.
References

N. Abramson, The ALOHA System – Another Alternative for Computer Communications, in

Proceedings of 1970 Fall Joint Computer Conference (1970)
S. Agrawal, N. Goyal, Analysis of Thompson sampling for the multi-armed bandit problem, in
Proceedings of the 21st Annual Conference on Learning Theory (COLT), PMLR, vol. 23 (2012),
pp. 39.1–39.26
D. Aldous, David Aldous’s Home Page (2018). http://www.stat.berkeley.edu/~aldous/
D. Bertsekas, Dynamic Programming and Optimal Control (Athena, Nashua, 2005)
D. Bertsekas, J. Tsitsiklis, Distributed and Parallel Computation: Numerical Methods (Prentice-
Hall, Upper Saddle River, 1989)
D. Bertsekas, J. Tsitsiklis, Introduction to Probability (Athena, Nashua, 2008)
P. Billingsley, Probability and Measure, Third Edition (Wiley, Hoboken, 2012)
P. Bremaud, An Introduction to Probabilistic Modeling (Springer, Berlin, 1998)
P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues (Springer, Berlin,
2008)
P. Bremaud, Discrete Probability Models and Methods: Probability on Graphs and Trees, Markov
Chains and Random Fields, Entropy and Coding (Probability Theory and Stochastic Modelling)
(Springer, Berlin, 2017)
R.G. Brown, P.Y.C. Hwang, Introduction to Random Signals and Applied Kalman Filtering (Wiley,
Hoboken, 1996)
E.J. Candes, B. Recht, Exact matrix completion via convex optimization. Found. Comput. Math.
9, 717–772 (2009)
E.J. Candes, J. Romberg, Sparsity and incoherence in compressive sampling. Inverse Prob. 23, 969
(2007)
N. Cesa-Bianchi, G. Lugosi, Prediction Learning and Games (Cambridge University Press,
Cambridge, 2006)
H. Chernoff, Some reminiscences of my friendship with Herman Rubin. Institute of Mathematical
Statistics Lecture Notes – Monograph Series, vol. 4 (2004), pp. 1–4
K.L. Chung, Markov Chains with Stationary Transition Probabilities (Springer, Berlin, 1967)
T.M. Cover, J.A. Thomas, Elements of Information Theory (Wiley-Interscience, Hoboken, 1991)
J.L. Doob, Stochastic Processes (Wiley, Hoboken, 1953)
D. Easley, J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected
World (2012). http://www.cs.cornell.edu/home/kleinber/networks-book/
R.G. Gallager, Low Density Parity Check Codes (M.I.T. Press, Cambridge, 1963)
R.G. Gallager, Stochastic Processes: Theory for Applications (Cambridge University Press,
Cambridge, 2014)
G.C. Goodwin, K.S. Sin, Adaptive Filtering Prediction and Control (Dover, New York, 2009)
G.R. Grimmett, D.R. Stirzaker, Probability and Random Processes (Oxford University Press,
Oxford, 2001)
B. Hajek, Random Processes for Engineers (Cambridge University Press, Cambridge, 2017)

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2
376 References

B. Hajek, T. Van Loon, Decentralized dynamic control of a multiple access broadcast channel.
IEEE Trans. Autom. Control, AC-27(3), 559–569 (1982)
M. Harchol-Balter, Performance Modeling and Design of Computer Systems: Queueing Theory in
Action. (Cambridge University Press, Cambridge, 2013)
T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd edn. (Springer, Berlin, 2009)
D.A. Huffman, A method for the construction of minimum-redundancy codes. Proceeding of the
IRE, pp. 1098–1101 (1952)
R.E. Kalman, A new approach to linear filtering and prediction problems. Trans. ASME J. Basic
Eng. 82(Series D), 35–45 (1960)
F. Kelly, Reversibility and Stochastic Networks (Wiley, Hoboken, 1979)
F. Kelly, E. Yudovina, Lecture Notes in Stochastic Networks (2013). http://www.statslab.cam.ac.
uk/~frank/STOCHNET/LNSN/book.pdf
L. Kleinrock, Queueing Systems, vols.1 and 2 (J. Wiley, Hoboken, 1975–1976)
P.R. Kumar, P.P. Varaiya, Stochastic Systems: Estimation, Identification and Adaptive Control
(Prentice-Hall, Upper Saddle River, 1986)
E.L. Lehmann, Testing Statistical Hypotheses, 3d edn. (Springer, Berlin, 2010)
J.D.C. Little, A proof for the queuing formula: L = λW . Oper. Res. 9(3), 383–401 (1961)
R. Lyons, Y. Perez, Probability on Trees and Networks. Cambridge Series in Statistical and
Probabilistic Mathematics (2017)
S.G. Mallat, Z. Zhang, Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal
Process. 41, 3397–3415 (1993)
M.J. Neely, Stochastic Network Optimization with Application to Communication and Queueing
Systems (Morgan & Claypool, San Rafael, 2010)
J. Neveu, Discrete Parameter Martingales (American Elsevier, North-Holland, 1975)
J. Neyman, E.S. Pearson, On the problem of the most efficient tests of statistical hypotheses. Phil.
Trans. R. Soc. Lond. 231, 289–337 (1933)
L. Page, Method for node ranking in a linked database. United States Patent and Trademark Office,
6,285,999 (2001)
J. Pitman, Probability. Springer Texts in Statistics (Springer, New York, 1993)
J. Proakis, Digital Communications, 4th edn. (McGraw-Hill Science/Engineering/Math, New
York, 2000)
T. Richardson, R. Urbanke. Modern Coding Theory (Cambridge University Press, Cambridge,
2008)
E. Roche, EM algorithm and variants: an informal tutorial (2012). arXiv:1105.1476v2 [stat.CO]
S.M. Ross, Introduction to Stochastic Dynamic Programming (Academic Press, Cambridge, 1995)
D. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen. A Tutorial on Thompson Sampling
problem. IEEE Trans. Signal Process. 11, 1–96 (2018)
D. Shah, Gossip algorithms. Found. Trends Netw. 3, 1–25 (2009)
R. Srikant, L. Ying, Communication Networks: An Optimization, Control, and Stochastic Networks
Perspective (Cambridge University Press, Cambridge, 2014)
E.L. Strehl, M.L. Littman, Online linear regression and its application to model-based rein-
forcement learning, in In Advances in Neural Information Processing Systems 20 (NIPS-07,
pp. 737–744 (2007)
D. Tse, P. Viswanath, Fundamentals of Wireless Communication (Cambridge University Press,
Cambridge, 2005)
M.J. Wainwright, M. Jordan, Graphical Models, Exponential Families, and Variational Inference
(Now Publishers, Boston, 2008)
J. Walrand, An Introduction to Queueing Networks (Prentice-Hall, Upper Saddle River, 1988)
J. Walrand, Uncertainty: A User Guide (Amazon, Seattle, 2019)
E. Wong, B. Hajek, Stochastic Processes in Engineering Systems (Springer, Berlin, 1985)
Index

A Clustering, 209
Adaptive randomized multiple access protocol, Codewords, 285
66 Complementary cdf, 354
Additive Gaussian noise channel, 124 Compressed sensing, 234
Almost sure, 7 Conditional probability, 310
Almost sure convergence, 28 Confidence interval, 47
ANOVA, 154 Consensus algorithm, 90
Aperiodic, 5 Contraction, 261
Arg max, 118 Convergence in distribution, 43
Axiom of choice, 323 Convergence in probability, 28
Convex function, 220
Convex set, 220
B Covariance, 321, 344
Balance equations, 2 Cumulative distribution function (cdf), 333,
detailed, 18 348
Bayes’ Rule, 117 joint, 354
Belief propagation, 156
Bellman-Ford Algorithm, 207, 244
Bellman-Ford Equations, 208 D
Bernoulli RV, 339 Deep neural networks (DNN), 237
Binary Phase Shift Keying (BPSK), 125 Digital link, 115
Binary Symmetric Channel (BSC), 119 Discrete random variable, 334
Binomial distribution, 340 Doob martingale, 296
Binomial RV, 340 Dynamic programming equations, 208, 245
Boosting, 280
Borel-Cantelli Theorem, 330
Converse, 332 E
Entropy, 123, 288
Entropy rate, 288
C Epidemics, 72
Cμ rule, 249 Error correction, 156
Cascades, 72 Error function, 65
Cauchy-Schwarz inequality, 69 bounds, 65
Central Limit Theorem (CLT), 43 Estimation problem, 163
Certainty equivalence, 263 Expectation, 315
Characteristic function, 59 linearity, 316, 338
Chebyshev’s inequality, 24, 320 monotone, 338
Chernoff’s inequality, 288 monotonicity, 317
Chi-squared test, 152 of product of independent RVs, 318

J. Walrand, Probability in Electrical Engineering and Computer Science,
https://doi.org/10.1007/978-3-030-49995-2
378 Index

Expectation maximization Joint cumulative distribution function (jcdf),

hard, 209 354
soft, 210 Joint cumulative probability distribution
Expected value, 315, 335 function (jcdf), 357
Exponentially distributed, 349 Jointly Gaussian random variables, 146
Extended Kalman filter, 201 are uncorrelated iff independent, 147
jpdf, 147
Joint probability density function (jpdf), 355,
F 358
False negative, 129 Joint probability mass function (jpmf), 343
False positive, 129 Jump chain, 106
First passage time, 10
First step equations (FSE), 11
Fisher, R.A., 154 K
Flow conservation equations, 82, 88 Kalman Filter, 183, 197
F-test, 154 derivation, 195
extended, 201
G
Gaussian, 42
Gaussian random variable, 42 L
useful values, 43 LASSO, 228
Geometric distribution, 339 Law of Large Numbers
Geometric RV, 339 Strong, 323
Gradient, 221 Weak, 321
Gradient projection algorithm, 219 LDPC codes, 155
Linear Least Squares Estimate, 164, 322
formula, 166
H as a projection, 168
Hal, 239 Linear regression, 323
Hidden Markov chain, 206 Linear transformation, 360
Hitting time, 10 Little’s Law, 53
Huffman code, 122 LLSE for vectors, 180
Hypothesis Testing Problem, 128 Logistic function, 241
Long term average cost, 262
Long-term fraction of time, 6
I Low rank, 236
Incentive compatible, 372 LQG Control, 260
Independence, 356 with Noisy Observations, 263
Independent, 313, 343 LQG problem, 259
Indicator, 335 Lyapunov-Foster, 276
Inequality
Chebyshev’s, 320
Markov’s, 320 M
Infinitely often, 330 Markov’s inequality, 288, 320
Information Theory, 287 Markov chain, 4
Integration by parts, 354 Markov decision problem (MDP), 247
Internet, 87 Martingale, 226, 294
Invariant distribution, 5 Martingale convergence theorem, 226, 299,
Irreducible, 5 300
Matching pursuit, 231
Matrix completion, 236
J
Maximum A Posteriori (MAP), 118, 311
Jacobian, 365
Maximum Likelihood Estimate (MLE), 118,
Jensen’s inequality, 288
311
Index 379

Mean value, 335 Projection property, 169

Memoryless property, 278 Properties of conditional expectation, 176
Minimum Mean Squares Estimate (MMSE), p-value, 129
164 Python
MMSE = Conditional Expectation, 173 disttool, 41
MMSE = LLSE for jointly Gaussian, 179
Multi-armed bandits, 282
Q
Quadrature Amplitude Modulation (QAM),
N 127
Nash equilibrium, 371 Queuing network, 81
Negatively correlated, 344
Network of queues, 81
Neural networks, 237 R
Neyman-Pearson Theorem, 130 Randomized access protocol, 54
Proof, 144 Random variable, 309
Normal random variable, 42 Rank, 1
useful values, 43 Rate matrix, 103
Nuclear norm, 236 Realization, 22
Null recurrent, 275 Receiver Operating Characteristic (ROC), 129
Nyquist sampling theorem, 232 Recurrent, 275
Regret, 281, 282
Revenue maximizing, 372
O
Observable, 198
Optional stopping theorem, 298 S
Optional stopping theorem - 2, 298 Sample mean value, 7, 29
Orthogonality property of MMSE, 174 Separation theorem, 288
Orthogonal random vectors, 359 Shannon capacity, 284
Output, 182 Smoothing property, 176
Social network, 71
Standard deviation, 319, 339
P Standard Gaussian, 42
Parity check bits, 155 Standard normal, 42
Partially observed Markov decision problems useful values, 43
(POMDPs), 265 States, 4, 182
Period, 5 State transition diagram, 4
Poisson distribution, 341 Stepwise regression, 230
Poisson process, 277 Stochastic dynamic programming equations,
Poisson RV, 341 247
Positively correlated, 344 Stochastic gradient algorithm, 218
Positive recurrent, 275 Stochastic matrix, 4
Posterior probability, 117 Stopping time, 297
Prefix-free codes, 122 Strong Law of Large Numbers, 7, 29, 300, 323
Pre-planning, 244 Submartingale, 294
Prior probability, 116 Sufficient statistic, 272
Probability, 309 Sum of squares of N (0, 1), 62
Probability density function (pdf), 348 Supermartingale, 294
Probability mass function (pmf), 334 Supervised learning, 205
Probability of correct detection (PCD), 129 Supervised training, 213
Probability of false alarm (PFA), 129 Switch, 49
Product form network, 83 Symmetry, 309
380 Index

T Unsupervised learning, 205

Thompson sampling, 283 Updating LLSE, 194
Time-reversible Markov chain, 19
Tower property, 176
Training, 205 V
Trajectory, 22 Value function, 247
Transient, 275 Value iteration equations, 247
Transition probabilities, 4 Variance, 318, 338
Transmission Control Protocols (TCP), 40 of sum of independent RVs, 319
Type I, type II errors, 129 Viterbi Algorithm, 208

U W
Uncorrelated, 321, 344, 358 Wald’s equality, 301
Uniformization, 106 Wavelength, 126
Uniformly distributed, 348 Weak Law of Large Numbers, 28, 321
WiFi Access Point, 54

SOP - University of Texas at Austin
No ratings yet
SOP - University of Texas at Austin
2 pages
Mitsubishi - FD30N
100% (1)
Mitsubishi - FD30N
7 pages
Albert Leon-Garcia - Probability and Random Processes For Electrical Engineering (2nd Edition)
No ratings yet
Albert Leon-Garcia - Probability and Random Processes For Electrical Engineering (2nd Edition)
310 pages
Lecture 3 - Microprocessor Based Systems
No ratings yet
Lecture 3 - Microprocessor Based Systems
46 pages
Best Global Brands 2024 Report
No ratings yet
Best Global Brands 2024 Report
33 pages
Probability and Statistics With Reliability Queueing and Computer Science Applications 2nd Edition Kishor Shridharbhai Trivedi Download
No ratings yet
Probability and Statistics With Reliability Queueing and Computer Science Applications 2nd Edition Kishor Shridharbhai Trivedi Download
57 pages
An Introduction To Probability and Statistics
50% (4)
An Introduction To Probability and Statistics
67 pages
C++ in One Day The Ultimate Beginners Guide To C++ With 7 Awesome Projects - W.B Ean PDF
100% (1)
C++ in One Day The Ultimate Beginners Guide To C++ With 7 Awesome Projects - W.B Ean PDF
134 pages
Matematicas Discretas
100% (2)
Matematicas Discretas
427 pages
Pmbok 6th Edition Free Download PDF
No ratings yet
Pmbok 6th Edition Free Download PDF
3 pages
Numerical Analysis and Application
100% (1)
Numerical Analysis and Application
246 pages
Fundamentals of Applied Probability Theory
100% (2)
Fundamentals of Applied Probability Theory
152 pages
Uncertaintyin Engineering
No ratings yet
Uncertaintyin Engineering
148 pages
CSC2102 Data Structures and Algorithm Program BSSE-3 Sec. A Week 1
No ratings yet
CSC2102 Data Structures and Algorithm Program BSSE-3 Sec. A Week 1
30 pages
The Design of Approximation Algorithms David P. Williamson David B. Shmoys
No ratings yet
The Design of Approximation Algorithms David P. Williamson David B. Shmoys
496 pages
Random Forest
No ratings yet
Random Forest
9 pages
Matrix Calculus Kronecker Product Applications C++ Programs: and With and
No ratings yet
Matrix Calculus Kronecker Product Applications C++ Programs: and With and
263 pages
Audio Technica ATH-M20x
No ratings yet
Audio Technica ATH-M20x
1 page
Probability and Random Processes For Electrical Engineering 2nd Ed
No ratings yet
Probability and Random Processes For Electrical Engineering 2nd Ed
310 pages
(Vijayan Sugumaran, Arun Kumar Sangaiah, Arunkumar
100% (1)
(Vijayan Sugumaran, Arun Kumar Sangaiah, Arunkumar
379 pages
(Kam-Tim Leung, Doris Lai-Chue Chen) Elementary Se
No ratings yet
(Kam-Tim Leung, Doris Lai-Chue Chen) Elementary Se
79 pages
A First Course in Group Theory
100% (1)
A First Course in Group Theory
2 pages
(VMLS) Julia Language Companion PDF
100% (2)
(VMLS) Julia Language Companion PDF
178 pages
2018 Book ProbabilityAndStatisticsForCom
No ratings yet
2018 Book ProbabilityAndStatisticsForCom
374 pages
Probability, Random Variables, and Stochastic Processes - Athanasios Papoulis 1ed-1-100
No ratings yet
Probability, Random Variables, and Stochastic Processes - Athanasios Papoulis 1ed-1-100
100 pages
2018 Book IntroductionToParallelComputin PDF
100% (1)
2018 Book IntroductionToParallelComputin PDF
263 pages
Probability, Statistics For Comp-Sci
No ratings yet
Probability, Statistics For Comp-Sci
374 pages
Fundamentals of Applied Probability Theory
100% (1)
Fundamentals of Applied Probability Theory
152 pages
Probability and Stochastic Processes For Engineers (Carl W. Helstrom)
No ratings yet
Probability and Stochastic Processes For Engineers (Carl W. Helstrom)
632 pages
ML For The Working Programmer
100% (2)
ML For The Working Programmer
493 pages
Englne E Ring: Applpcalon
No ratings yet
Englne E Ring: Applpcalon
264 pages
Wonham Geometric Control
No ratings yet
Wonham Geometric Control
340 pages
Information Systems Today: Chapter # 5
No ratings yet
Information Systems Today: Chapter # 5
32 pages
Uncertainty Quantifi Cation and Predictive Computational Science
No ratings yet
Uncertainty Quantifi Cation and Predictive Computational Science
349 pages
Linear Guest
100% (1)
Linear Guest
436 pages
(Between Science and Economics 2) Melanie Swan, Renato P. Dos Santos, Frank Witte - Quantum Computing - Physics, Blockchains, and Deep Learning Smart Networks - World Scientific (2020)
No ratings yet
(Between Science and Economics 2) Melanie Swan, Renato P. Dos Santos, Frank Witte - Quantum Computing - Physics, Blockchains, and Deep Learning Smart Networks - World Scientific (2020)
400 pages
UChicago Math Books
0% (1)
UChicago Math Books
35 pages
DDCS Expert User's Manual V1-已压缩
No ratings yet
DDCS Expert User's Manual V1-已压缩
137 pages
Statistical Foundations of Machine Learning: The Handbook
No ratings yet
Statistical Foundations of Machine Learning: The Handbook
364 pages
Complex Vector Spaces PDF
No ratings yet
Complex Vector Spaces PDF
42 pages
Yet Another Haskell Tutorial
100% (10)
Yet Another Haskell Tutorial
192 pages
List of ASI Courses
100% (1)
List of ASI Courses
15 pages
Probability With Applications in Engineering Science and Technology 2nd Edition Matthew A. Carlton
No ratings yet
Probability With Applications in Engineering Science and Technology 2nd Edition Matthew A. Carlton
55 pages
Compiler Design - Theory Tools and Examples PDF
No ratings yet
Compiler Design - Theory Tools and Examples PDF
320 pages
How To Apply Initial Stress Using INISTATE
No ratings yet
How To Apply Initial Stress Using INISTATE
4 pages
Analysis of Functions of A Single Variable - A Detailed Development - Lawrence W. Baggett - University of Colorado
No ratings yet
Analysis of Functions of A Single Variable - A Detailed Development - Lawrence W. Baggett - University of Colorado
233 pages
A Comprehensive Study of Egyptian Arabic Volume Two Proverbs and Metaphoric Expressions PDF
No ratings yet
A Comprehensive Study of Egyptian Arabic Volume Two Proverbs and Metaphoric Expressions PDF
410 pages
Kuttler LinearAlgebra AFirstCourse
No ratings yet
Kuttler LinearAlgebra AFirstCourse
318 pages
Linear Algebra With Applications
No ratings yet
Linear Algebra With Applications
1,032 pages
2.electronic Devices & Circuits
No ratings yet
2.electronic Devices & Circuits
93 pages
Introduction To Probability For Data Science 1st Edition Stanley H. Chan - Downloadable PDF 2025
No ratings yet
Introduction To Probability For Data Science 1st Edition Stanley H. Chan - Downloadable PDF 2025
52 pages
Number Theory - A Contemporary Introduction
No ratings yet
Number Theory - A Contemporary Introduction
272 pages
IDL Programming Techniques 2nd Edition
No ratings yet
IDL Programming Techniques 2nd Edition
465 pages
Evolights Laser RGB 400mw Animation - Instrukcja Obs Ugi Manual Eng PL 16ch
No ratings yet
Evolights Laser RGB 400mw Animation - Instrukcja Obs Ugi Manual Eng PL 16ch
19 pages
Comparison of Different DEM Generation Methods Based On Open Source Datasets
No ratings yet
Comparison of Different DEM Generation Methods Based On Open Source Datasets
23 pages
Prolog Versus You - An Introduction To Logic Programming (Johansson, Eriksson-Granskog & Edman
No ratings yet
Prolog Versus You - An Introduction To Logic Programming (Johansson, Eriksson-Granskog & Edman
298 pages
6.262 Discrete Stochastic Processes - Notes - 0. Course Content
No ratings yet
6.262 Discrete Stochastic Processes - Notes - 0. Course Content
10 pages
Elements of Linear Multilinear - Algebra - PDF
No ratings yet
Elements of Linear Multilinear - Algebra - PDF
152 pages
EDA-Discrete Probability Distribution
No ratings yet
EDA-Discrete Probability Distribution
35 pages
Groff
No ratings yet
Groff
268 pages
PDI Demo
No ratings yet
PDI Demo
6 pages
Presentation 17
No ratings yet
Presentation 17
18 pages
Tropical Dynamic Programming
No ratings yet
Tropical Dynamic Programming
137 pages
Main
No ratings yet
Main
12 pages
Alvin W. Drake Fundamentals of Applied Probability Theory 1967
No ratings yet
Alvin W. Drake Fundamentals of Applied Probability Theory 1967
152 pages
PPPT
No ratings yet
PPPT
14 pages
Principles of Compiler Design - Tutorial 9
100% (1)
Principles of Compiler Design - Tutorial 9
7 pages
Matlab Ohio
No ratings yet
Matlab Ohio
182 pages
What Will Cover in This Lab?: Thevenin Theorem
No ratings yet
What Will Cover in This Lab?: Thevenin Theorem
5 pages
Introduction To Probability Theory and Statistics
No ratings yet
Introduction To Probability Theory and Statistics
127 pages
(HK241) Convolution Operation
No ratings yet
(HK241) Convolution Operation
6 pages
Santos Number Theory
No ratings yet
Santos Number Theory
101 pages
Coding Theory: A Bird S Eye View: Continued Block Codes: Basics
No ratings yet
Coding Theory: A Bird S Eye View: Continued Block Codes: Basics
32 pages
Assignment No 2 MATLAB
No ratings yet
Assignment No 2 MATLAB
6 pages
Double Skin Façade and Potential Integration With Other Building Environmental Technologies and Materials
No ratings yet
Double Skin Façade and Potential Integration With Other Building Environmental Technologies and Materials
8 pages
Template For GigaByte Journal Data Report Submissions
No ratings yet
Template For GigaByte Journal Data Report Submissions
10 pages
Kirankumar Kaisetty Manoharan Resume
No ratings yet
Kirankumar Kaisetty Manoharan Resume
7 pages
Nibha Dubey
No ratings yet
Nibha Dubey
5 pages
Evolution of The WAM:: Introduction To Prolog Implementation: The Warren Abstract Machine (WAM)
No ratings yet
Evolution of The WAM:: Introduction To Prolog Implementation: The Warren Abstract Machine (WAM)
21 pages
Question Example 3.2
No ratings yet
Question Example 3.2
5 pages
MG 02 1 FiniteAutomata Anim
No ratings yet
MG 02 1 FiniteAutomata Anim
12 pages
Benzara MBA 2024 MAIT
No ratings yet
Benzara MBA 2024 MAIT
3 pages
SLG Math 10 Quarter 2 Week 2
No ratings yet
SLG Math 10 Quarter 2 Week 2
5 pages
Netezza Analytics Transition Service Flyer
No ratings yet
Netezza Analytics Transition Service Flyer
2 pages
Eagle Incident Form: User Information
No ratings yet
Eagle Incident Form: User Information
6 pages
Chapter 1 Intro&Defi
No ratings yet
Chapter 1 Intro&Defi
5 pages
The University of Faisalabad: School of Electrical Engineering
No ratings yet
The University of Faisalabad: School of Electrical Engineering
3 pages
Testing Scope
No ratings yet
Testing Scope
2 pages
(C) Maximum Power Transformar
No ratings yet
(C) Maximum Power Transformar
4 pages
Whitepaper Top Benefits of Video Conferencing Polycom
No ratings yet
Whitepaper Top Benefits of Video Conferencing Polycom
2 pages
(B) Superposition Theorem
No ratings yet
(B) Superposition Theorem
4 pages
School of Electrical Engineering: The University of Faisalabad
No ratings yet
School of Electrical Engineering: The University of Faisalabad
1 page
On Campus Classes
No ratings yet
On Campus Classes
1 page
School of Electrical Engineering: The University of Faisalabad
No ratings yet
School of Electrical Engineering: The University of Faisalabad
1 page
English Basic Grammar
No ratings yet
English Basic Grammar
1 page
Book Rust Devils
92% (12)
Book Rust Devils
39 pages
Prologue: 0.1 Books and Algorithms
No ratings yet
Prologue: 0.1 Books and Algorithms
9 pages
KehuaFrance 3kW
No ratings yet
KehuaFrance 3kW
2 pages
Minutes of Meeting Held Between M/S Ultra Tech Sewagram Cements LTD and M/S S.N Enviro Solutions PVT LTD
No ratings yet
Minutes of Meeting Held Between M/S Ultra Tech Sewagram Cements LTD and M/S S.N Enviro Solutions PVT LTD
1 page
Probability and Computing Randomization and Probabilistic Techniques in Algorithms and Data Analysis, 2nd Edition
100% (2)
Probability and Computing Randomization and Probabilistic Techniques in Algorithms and Data Analysis, 2nd Edition
490 pages
Dynamic Programming Examples
No ratings yet
Dynamic Programming Examples
2 pages
Cource Outline
No ratings yet
Cource Outline
1 page
676/chapter 10 Problem 102 1173203
No ratings yet
676/chapter 10 Problem 102 1173203
1 page
676/chapter 10 Problem 102 1173203
No ratings yet
676/chapter 10 Problem 102 1173203
1 page
676/chapter 10 Problem 102 1173203
No ratings yet
676/chapter 10 Problem 102 1173203
1 page