Introduction To Probability For Ds
Introduction To Probability For Ds
Introduction To Probability For Ds
for
Data Science
Stanley H. Chan
Purdue University
Copyright ©2021 Stanley H. Chan
This book is published by Michigan Publishing under an agreement with the author. It is
made available free of charge in electronic form to any student or instructor interested in
the subject matter.
ii
To Vivian, Joanna, and Cynthia Chan
And ye shall know the truth, and the truth shall make you free.
John 8:32
iii
Preface
v
find lecture videos and homework videos. Throughout the book you will see many “practice
exercises”, which are easy problems with worked-out solutions. They can be skipped without
loss to the flow of the book.
Acknowledgements: If I could thank only one person, it must be Professor Fawwaz
Ulaby of the University of Michigan. Professor Ulaby has been the source of support in
all aspects, from the book’s layout to technical content, proofreading, and marketing. The
book would not have been published without the help of Professor Ulaby. I am deeply
moved by Professor Ulaby’s vision that education should be made accessible to all students.
With textbook prices rocketing up, the EECS free textbook initiative launched by Professor
Ulaby is the most direct response to the publishers, teachers, parents, and students. Thank
you, Fawwaz, for your unbounded support — technically, mentally, and financially. Thank
you also for recommending Richard Carnes. The meticulous details Richard offered have
significantly improved the fluency of the book. Thank you, Richard.
I thank my colleagues at Purdue who had shared many thoughts with me when I
taught the course (in alphabetical order): Professors Mark Bell, Mary Comer, Saul Gelfand,
Amy Reibman, and Chih-Chun Wang. My teaching assistant I-Fan Lin was instrumental in
the early development of this book. To the graduate students of my lab (Yiheng Chi, Nick
Chimitt, Kent Gauen, Abhiram Gnanasambandam, Guanzhe Hong, Chengxi Li, Zhiyuan
Mao, Xiangyu Qu, and Yash Sanghvi): Thank you! It would have been impossible to finish
the book without your participation. A few students I taught volunteered to help edit
the book: Benjamin Gottfried, Harrison Hsueh, Dawoon Jung, Antonio Kincaid, Deepak
Ravikumar, Krister Ulvog, Peace Umoru, Zhijing Yao. I would like to thank my Ph.D.
advisor Professor Truong Nguyen for encouraging me to write the book.
Finally, I would like to thank my wife Vivian and my daughters, Joanna and Cynthia,
for their love, patience, and support.
May, 2021
Companion website:
https://probability4datascience.com/
vi
Contents
1 Mathematical Background 1
1.1 Infinite Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Geometric Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Binomial Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Taylor approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Exponential series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.3 Logarithmic approximation . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1 Odd and even functions . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . . 17
1.4 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.4.1 Why do we need linear algebra in data science? . . . . . . . . . . . . . 20
1.4.2 Everything you need to know about linear algebra . . . . . . . . . . . 21
1.4.3 Inner products and norms . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.4 Matrix calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.5 Basic Combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.1 Birthday paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.5.2 Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.5.3 Combination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2 Probability 43
2.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.1 Why study set theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.2 Basic concepts of a set . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.1.3 Subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.1.4 Empty set and universal set . . . . . . . . . . . . . . . . . . . . . . . . 48
2.1.5 Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.1.6 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.1.7 Complement and difference . . . . . . . . . . . . . . . . . . . . . . . . 52
2.1.8 Disjoint and partition . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.9 Set operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.1.10 Closing remarks about set theory . . . . . . . . . . . . . . . . . . . . . 57
vii
CONTENTS
viii
CONTENTS
ix
CONTENTS
x
CONTENTS
7 Regression 389
7.1 Principles of Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
7.1.1 Intuition: How to fit a straight line? . . . . . . . . . . . . . . . . . . . 395
7.1.2 Solving the linear regression problem . . . . . . . . . . . . . . . . . . . 397
7.1.3 Extension: Beyond a straight line . . . . . . . . . . . . . . . . . . . . . 401
7.1.4 Overdetermined and underdetermined systems . . . . . . . . . . . . . 409
7.1.5 Robust linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . 412
7.2 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
7.2.1 Overview of overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
7.2.2 Analysis of the linear case . . . . . . . . . . . . . . . . . . . . . . . . . 420
7.2.3 Interpreting the linear analysis results . . . . . . . . . . . . . . . . . . 425
7.3 Bias and Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
7.3.1 Decomposing the testing error . . . . . . . . . . . . . . . . . . . . . . 430
7.3.2 Analysis of the bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
7.3.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
7.3.4 Bias and variance on the learning curve . . . . . . . . . . . . . . . . . 438
7.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.4.1 Ridge regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
7.4.2 LASSO regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 449
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
7.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
7.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
8 Estimation 465
8.1 Maximum-Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 468
8.1.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
8.1.2 Maximum-likelihood estimate . . . . . . . . . . . . . . . . . . . . . . . 472
8.1.3 Application 1: Social network analysis . . . . . . . . . . . . . . . . . . 478
8.1.4 Application 2: Reconstructing images . . . . . . . . . . . . . . . . . . 481
8.1.5 More examples of ML estimation . . . . . . . . . . . . . . . . . . . . . 484
8.1.6 Regression versus ML estimation . . . . . . . . . . . . . . . . . . . . . 487
8.2 Properties of ML Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
xi
CONTENTS
xii
CONTENTS
A Appendix 681
xiii
CONTENTS
xiv
Chapter 1
Mathematical Background
“Data science” has different meanings to different people. If you ask a biologist, data science
could mean analyzing DNA sequences. If you ask a banker, data science could mean pre-
dicting the stock market. If you ask a software engineer, data science could mean programs
and data structures; if you ask a machine learning scientist, data science could mean models
and algorithms. However, one thing that is common in all these disciplines is the concept of
uncertainty. We choose to learn from data because we believe that the latent information
is embedded in the data — unprocessed, contains noise, and could have missing entries. If
there is no randomness, all data scientists can close their business because there is simply
no problem to solve. However, the moment we see randomness, our business comes back.
Therefore, data science is the subject of making decisions in uncertainty.
The mathematics of analyzing uncertainty is probability. It is the tool to help us model,
analyze, and predict random events. Probability can be studied in as many ways as you can
think of. You can take a rigorous course in probability theory, or a “probability for dummies”
on the internet, or a typical undergraduate probability course offered by your school. This
book is different from all these. Our goal is to tell you how things work in the context of data
science. For example, why do we need those three axioms of probabilities and not others?
Where does the “bell shape” Gaussian random variable come from? How many samples do
we need to construct a reliable histogram? These questions are at the core of data science,
and they deserve close attention rather than sweeping them under the rug.
To help you get used to the pace and style of this book, in this chapter, we review some
of the very familiar topics in undergraduate algebra and calculus. These topics are meant
to warm up your mathematics background so that you can follow the subsequent chapters.
Specifically, in this chapter, we cover several topics. First, in Section 1.1 we discuss infinite
series, something that will be used frequently when we evaluate the expectation and variance
of random variables in Chapter 3. In Section 1.2 we review the Taylor approximation,
which will be helpful when we discuss continuous random variables. Section 1.3 discusses
integration and reviews several tricks we can use to make integration easy. Section 1.4
deals with linear algebra, aka matrices and vectors, which are fundamental to modern data
analysis. Finally, Section 1.5 discusses permutation and combination, two basic techniques
to count events.
1
CHAPTER 1. MATHEMATICAL BACKGROUND
Imagine that you have a fair coin. If you get a tail, you flip it again. You do this repeatedly
until you finally get a head. What is the probability that you need to flip the coin three
times to get one head?
This is a warm-up exercise. Since the coin is fair, the probability of obtaining a head
is 12 . The probability of getting a tail followed by a head is 12 × 12 = 14 . Similarly, the
probability of getting two tails and then a head is 12 × 21 × 12 = 81 . If you follow this logic, you
can write down the probabilities for all other cases. For your convenience, we have drawn the
first few in Figure 1.1. As you have probably noticed, the probabilities follow the pattern
{ 12 , 14 , 18 , . . .}.
Figure 1.1: Suppose you flip a coin until you see a head. This requires you to have N − 1 tails followed
by a head. The probability of this sequence of events are 12 , 14 , 18 , . . . , which forms an infinite sequence.
We can also summarize these probabilities using a familiar plot called the histogram
as shown in Figure 1.2. The histogram for this problem has a special pattern, that every
value is one order higher than the preceding one, and the sequence is infinitely long.
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
Figure 1.2: The histogram of flipping a coin until we see a head. The x-axis is the number of coin flips,
and the y-axis is the probability.
Let us ask something harder: On average, if you want to be 90% sure that you will
get a head, what is the minimum number of attempts you need to try? Five attempts?
Ten attempts? Indeed, if you try ten attempts, you will very likely accomplish your goal.
However, this would seem to be overkill. If you try five attempts, then it becomes unclear
whether you will be 90% sure.
2
1.1. INFINITE SERIES
This warm-up exercise has perhaps raised some of your interest in the subject. However,
we will not tell you everything now. We will come back to the probability in Chapter 3
when we discuss geometric random variables. In the present section, we want to make sure
you have the basic mathematical tools to calculate quantities, such as a sum of fractional
numbers. For example, what if we want to calculate P[success after 107 attempts]? Is there
a systematic way of performing the calculation?
Remark. You should be aware that the 93.75% only says that the probability of achieving
the goal is high. If you have a bad day, you may still need more than four attempts. Therefore,
when we stated the question, we asked for 90% “on average”. Sometimes you may need
more attempts and sometimes fewer attempts, but on average, you have a 93.75% chance
of succeeding.
3
CHAPTER 1. MATHEMATICAL BACKGROUND
appears naturally in the context of discrete events. In Chapter 3 of this book, we will use
geometric series when calculating the expectation and moments of a random variable.
Definition 1.1. Let 0 < r < 1, a finite geometric sequence of power n is a sequence
of numbers
2 n
1, r, r , . . . , r .
k=0
= 1 + r + r2 + · · · + rn − r + r2 + r3 + · · · + rn+1
(a)
= 1 − rn+1 ,
Corollary 1.1. Let 0 < r < 1. The sum of an infinite geometric series is
∞
X 1
rk = 1 + r + r2 + · · · = . (1.2)
1−r
k=0
□
Remark. Note that the condition 0 < r < 1 is important. If r > 1, then the limit
limn→∞ rn+1 in Equation (1.2) will diverge. The constant r cannot equal to 1, for oth-
erwise the fraction (1 − rn+1 )/(1 − r)P
is undefined. We are not interested in the case when
∞
r = 0, because the sum is trivially 1: k=0 0k = 1 + 01 + 02 + · · · = 1.
4
1.1. INFINITE SERIES
∞
1
P
Practice Exercise 1.1. Compute the infinite series 2k
.
k=2
Solution.
∞
X 1 1 1
k
= + + ···+
2 4 8
k=2
1 1 1
= 1 + + + ···
4 2 4
1 1 1
= · = .
4 1 − 21 2
Remark. You should not be confused about a geometric series and a harmonic series. A
harmonic series concerns with the sum of {1, 12 , 13 , 14 , . . .}. It turns out that1
∞
X 1 1 1 1
= 1 + + + + · · · = ∞.
n=1
n 2 3 4
On the other hand, a squared harmonic series {1, 212 , 312 , 412 , . . .} converges:
∞
X 1 1 1 1 π2
2
= 1 + 2 + 2 + 2 + ··· = .
n=1
n 2 3 4 6
Proof. Take the derivative on both sides of Equation (1.2). The left hand side becomes
∞ ∞
d X k d X
1 + r + r2 + · · · = 1 + 2r + 3r2 + · · · = krk−1
r =
dr dr
k=0 k=1
d 1 1
The right hand side becomes = .
dr 1−r (1 − r)2
□
P∞ 1
Practice Exercise 1.2. Compute the infinite sum k=1 k· 3k
.
1 This result can be found in Tom Apostol, Mathematical Analysis, 2nd Edition, Theorem 8.11.
5
CHAPTER 1. MATHEMATICAL BACKGROUND
Figure 1.3: When flipping three coins independently, the probability of getting exactly one head can
come from three different possibilities.
What lessons have we learned in this example? Notice that you need to enumerate
all possible combinations of one head and two tails to solve this problem. The number is
3 in our example. In general, the number of combinations can be systematically studied
using combinatorics, which we will discuss later in the chapter. However, the number of
combinations motivates us to discuss another background technique known as the binomial
series. The binomial series is instrumental in algebra when handling polynomials such as
(a + b)2 or (1 + x)3 . It provides a valuable formula when computing these powers.
Theorem 1.2 (Binomial theorem). For any real numbers a and b, the binomial series
of power n is
n
X n n−k k
(a + b)n = a b , (1.4)
k
k=0
where nk = k!(n−k)!
n!
.
n
The binomial theorem is valid for any real numbers a and b. The quantity k reads
as “n choose k”. Its definition is
n def n!
= ,
k k!(n − k)!
6
1.1. INFINITE SERIES
Section 1.5. But we can quickly plug in the “n choose k” into the coin flipping example by
letting n = 3 and k = 1:
3 3!
Number of combinations for 1 head and 2 tails = = = 3.
1 1!2!
So you can see why we want you to spend your precious time learning about the binomial
theorem. In MATLAB and Python, nk can be computed using the commands as follows.
The binomial theorem makes the most sense when we also learn about the Pascal’s
identity.
Theorem 1.3 (Pascal’s identity). Let n and k be positive integers such that k ≤ n.
Then,
n n n+1
+ = . (1.5)
k k−1 k
n n n! n!
+ = +
k k−1 k!(n − k)! (k − 1)!(n − (k − 1))!
1 1
= n! + ,
k!(n − k)! (k − 1)!(n − k + 1)!
where we factor out n! to obtain the second equation. Next, we observe that
1 (n − k + 1) n−k+1
× = ,
k!(n − k)! (n − k + 1) k!(n − k + 1)!
1 k k
× = .
(k − 1)!(n − k + 1)! k k!(n − k + 1)!
7
CHAPTER 1. MATHEMATICAL BACKGROUND
□
n
The Pascal triangle is a visualization of the coefficients of (a + b) as shown in Fig-
ure 1.4. For example, when n = 5, we know that 53 = 10. However, by Pascal’s identity, we
Figure 1.4: Pascal triangle for n = 0, . . . , 5. Note that a number in one row is obtained by summing
two numbers directly above it.
8
1.1. INFINITE SERIES
This result will be helpful when evaluating binomial random variables in Chapter 3.
We now prove the binomial theorem. Please feel free to skip the proof if this is your first
time reading the book.
Therefore, the base case is verified. Assume up to case n. We need to verify case n + 1.
(a + b)n+1 = (a + b)(a + b)n
n
X n n−k k
= (a + b) a b
k
k=0
n n
X n n−k+1 k X n n−k k+1
= a b + a b .
k k
k=0 k=0
We want to apply the Pascal’s identity to combine the two terms. In order to do so, we note
that the second term in this sum can be rewritten as
n n
X n n−k k+1 X n n+1−k−1 k+1
a b = a b
k k
k=0 k=0
n+1
X n
= an+1−ℓ bℓ , where ℓ = k + 1
ℓ−1
ℓ=1
n
X n
= an+1−ℓ bℓ + bn+1 .
ℓ−1
ℓ=1
Therefore, the two terms can be combined using Pascal’s identity to yield
n
n+1
X n n
(a + b) = + an+1−ℓ bℓ + an+1 + bn+1
ℓ ℓ−1
ℓ=1
n n+1
X n + 1
X n + 1 n+1−ℓ ℓ
= a b + an+1 + bn+1 = an+1−ℓ bℓ .
ℓ ℓ
ℓ=1 ℓ=0
9
CHAPTER 1. MATHEMATICAL BACKGROUND
Hence, the (n + 1)th case is also verified. By the principle of mathematical induction, we
have completed the proof.
□
1.2 Approximation
Consider a function f (x) = log(1 + x), for x > 0 as shown in Figure 1.5. This is a nonlinear
function, and we all know that nonlinear functions are not fun to deal with. For example,
Rb
if you want to integrate the function a x log(1 + x) dx, then the logarithm will force you
to do integration by parts. However, in many practical problems, you may not need the full
range of x > 0. Suppose that you are only interested in values x ≪ 1. Then the logarithm
can be approximated, and thus the integral can also be approximated.
2 0.2
1.5 0.15
1 0.1
0.5 0.05
0 0
0 1 2 3 4 5 0 0.05 0.1 0.15 0.2
Figure 1.5: The function f (x) = log(1 + x) and the approximation fb(x) = x.
To see how this is even possible, we show in Figure 1.5 the nonlinear function f (x) =
log(1 + x) and an approximation fb(x) = x. The approximation is carefully chosen such that
for x ≪ 1, the approximation fb(x) is close to the true function f (x). Therefore, we can
argue that for x ≪ 1,
log(1 + x) ≈ x, (1.6)
thereby simplifying the calculation. For example, if you want to integrate x log(1 + x) for
R 0.1 R 0.1
0 < x < 0.1, then the integral can be approximated by 0 x log(1 + x) dx ≈ 0 x2 dx =
x3 −4
3 = 3.33 × 10 . (The actual integral is 3.21 × 10−4 .) In this section we will learn about
the basic approximation techniques. We will use them when we discuss limit theorems in
Chapter 6, as well as various distributions, such as from binomial to Poisson.
10
1.2. APPROXIMATION
Here, the big-O notation O(εk ) means any term that has an order at least power k. For
small ε, i.e., ε ≪ 1, a high-order term O(εk ) ≈ 0 for large k.
x3 x5 x7
f (x) = x − + − + ···
3! 5! 7!
We show the first few approximations in Figure 1.6.
One should be reminded that Taylor approximation approximates a function f (x)
at a particular point x = a. Therefore, the approximation of f near x = 0 and the
11
CHAPTER 1. MATHEMATICAL BACKGROUND
approximation of f near x = π/2 are different. For example, the Taylor approximation
at x = π/2 for f (x) = sin x is
π π π sin π2 π 2 cos π2 π 3
f (x) = sin + cos x− − x− − x−
2 2 2 2! 2 3! 2
1 π 2
1 π 2
=1+0− x− −0=1− x− .
4 2 4 2
4 4
sin x sin x
3rd order 3rd order
5th order 5th order
2 2
7th order 7th order
0 0
-2 -2
-4 -4
-10 -5 0 5 10 -10 -5 0 5 10
x x
(a) Approximate at x = 0 (b) Approximate at x = π/2
Proof. Let f (x) = ex for any x. Then, the Taylor approximation around x = 0 is
f ′′ (0)
f (x) = f (0) + f ′ (0)(x − 0) + (x − 0)2 + · · ·
2!
e0
= e0 + e0 (x − 0) +
(x − 0)2 + · · ·
2!
∞
x2 X xk
=1+x+ + ··· = .
2 k!
k=0
□
∞
X λk e−λ
Practice Exercise 1.5. Evaluate .
k!
k=0
12
1.2. APPROXIMATION
Solution.
∞ ∞
X λk e−λ X λk
= e−λ = e−λ eλ = 1.
k! k!
k=0 k=0
(jθ)2
ejθ = 1 + jθ + + ···
|{z} 2!
=cos θ+j sin θ
θ2 θ4 θ3
= 1− + + ··· + j θ − + ···
2! 4! 3!
| {z } | {z }
real imaginary
Matching the real and the imaginary terms, we can show that
θ2 θ4
cos θ = 1 − + + ···
2! 4!
θ3 θ5
sin θ = θ − + + ···
3! 5!
This gives the infinite series representations of the two trigonometric functions.
f ′′ (0)
f (x) = f (0) + f ′ (0)(x − 0) + (x − 0)2 + O(x3 )
2
1 1
= log 1 + x− x2 + O(x3 )
(1 + 0) (1 + 0)2
= x − x2 + O(x3 ).
□
The difference between this result and the result we showed in the beginning of this
section is the order of polynomials we used to approximate the logarithm:
13
CHAPTER 1. MATHEMATICAL BACKGROUND
First-order: log(1 + x) = x
Second-order: log(1 + x) = x − x2 .
What order of approximation is good? It depends on where you want the approximation to
be good, and how far you want the approximation to go. The difference between first-order
and second-order approximations is shown in Figure 1.7.
2 2
1.5 1.5
1 1
0.5 0.5
0 0
0 1 2 3 4 5 0 1 2 3 4 5
First-order approximation Second-order approximation
Figure 1.7: The function f (x) = log(1 + x), the first-order approximation fb(x) = x, and the second-
order approximation fb(x) = x − x2 .
Example 1.2. When we prove the Central Limit Theorem in Chapter 6, we need to
use the following result.
N
s2
2
lim 1 + = es /2 .
N →∞ 2N
The proof
of this equation can be done using the Taylor approximation. Consider
2
N log 1 + sN . By the logarithmic lemma, we can obtain the second-order approxi-
mation:
s2 s2 s4
log 1 + = − .
2N 2N 4N 2
Therefore, multiplying both sides by N yields
s2 s2 s4
N log 1 + = − .
2N 2 4N
s2 s2
lim N log 1 + = .
N →∞ 2N 2
s2
2
s
exp lim N log 1 + = exp .
N →∞ 2N 2
Moving the limit outside the exponential yields the result. Figure 1.8 provides a pic-
torial illustration.
14
1.3. INTEGRATION
1.8
1.6
1.4
1.2
1
0 0.2 0.4 0.6 0.8 1
N 2
s2
Figure 1.8: We plot a sequence of function fN (x) = 1 + 2N
and its limit f (x) = es /2
.
1.3 Integration
When you learned calculus, your teacher probably told you that there are two ways to
compute an integral:
Substitution: Z Z
1
f (ax) dx = f (u) du.
a
By parts: Z Z
u dv = u v − v du.
Besides these two, we want to teach you two more. The first technique is even and odd
functions when integrating a function symmetrically about the y-axis. If a function is even,
you just need to integrate half of the function. If a function is odd, you will get a zero. The
second technique is to leverage the fact that a probability density function integrates to 1.
We will discuss the first technique here and defer the second technique to Chapter 4.
Besides the two integration techniques, we will review the fundamental theorem of
calculus. We will need it when we study cumulative distribution functions in Chapter 4.
and f is odd if
f (x) = −f (−x). (1.11)
15
CHAPTER 1. MATHEMATICAL BACKGROUND
Essentially, an even function flips over about the y-axis, whereas an odd function flips over
both the x- and y-axes.
See Figure 1.9(a) for illustration. When integrating the function, we have
Z 1 Z 1 Z 1 3 x=1
2 4 x 0.4 5 38
f (x) dx = 2 f (x) dx = 2 x − 0.4 dx = 2 − x = .
−1 0 0 3 5 x=0 75
(−x)2
2
x
f (−x) = (−x) exp − = −x exp − = −f (x).
2 2
See Figure 1.9(b) for illustration. When integrating the function, we can let u = −x.
Then, the integral becomes
Z 1 Z 0 Z 1
f (x) dx = f (x) dx + f (x) dx
−1 −1 0
Z1 Z 1
= f (−u) du + f (x) dx
0 0
Z 1 Z 1
=− f (u) du + f (x) dx = 0.
0 0
1 1
0.5 0.5
0 0
-0.5 -0.5
-1 -1
16
1.3. INTEGRATION
Before we prove the result, let us understand the theorem if you have forgotten its meaning.
d d x3
f (x) = F (x) = = x2 .
dx dx 3
The fundamental theorem of calculus basically puts the two together:
Z x
d
f (x) = f (t) dt.
dx 0
How can the fundamental theorem of calculus ever be useful when studying probabil-
ity? Very soon you will learn two concepts: probability density function and cumulative
distribution function. These two functions are related to each other by the fundamental
theorem of calculus. To give you a concrete example, we write down the probability density
function of an exponential random variable. (Please do not panic about the exponential
random variable. Just think of it as a “rapidly decaying” function.)
f (x) = e−x , x ≥ 0.
It turns out that the cumulative distribution function is
Z x Z x
F (x) = f (t) dt = e−t dt = 1 − e−x .
0 0
d
You can also check Rthat f (x) = dx F (x). The fundamental theorem of calculus says that if
x
you tell me F (x) = 0 e−t dt (for whatever reason), I will be able to tell you that f (x) = e−x
merely by visually inspecting the integrand without doing the differentiation.
Figure 1.10 illustrates the pair of functions f (x) = e−x and F (x) = 1 − e−x . One thing
you should notice is that the height of F (x) is the area under the curve of f (t) from −∞ to x.
For example, in Figure 1.10 we show the area under the curve from 0 to 2. Correspondingly
in F (x), the height is F (2).
17
CHAPTER 1. MATHEMATICAL BACKGROUND
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 1 2 3 4 5 0 1 2 3 4 5
f (x) F (x)
Figure 1.10: The pair of functions f (x) = e−x and F (x) = 1 − e−x
The following proof of the Fundamental Theorem of Calculus can be skipped if it is your
first time reading the book.
Proof. Our proof is based on Stewart (6th Edition), Section 5.3. Define the integral as a
function F : Z x
F (x) = f (t) dt.
a
The derivative of F with respect to x is
d F (x + h) − F (x)
F (x) = lim
dx h→0 h
Z x+h Z x !
1
= lim f (t) dt − f (t) dt
h→0 h a a
1 x+h
Z
= lim f (t) dt
h→0 h x
1 x+h
(a)
Z
≤ lim max f (τ ) dt
h→0 h x x≤τ ≤x+h
= lim max f (τ ) .
h→0 x≤τ ≤x+h
f (t) ≤ max f (τ )
x≤τ ≤x+h
18
1.3. INTEGRATION
d F (x + h) − F (x)
F (x) = lim
dx h→0 h
Z x+h Z x !
1
= lim f (t) dt − f (t) dt
h→0 h a a
1 x+h
Z
= lim f (t) dt
h→0 h x
1 x+h
Z
≥ lim min f (τ ) dt
h→0 h x x≤τ ≤x+h
= lim min f (τ ) .
h→0 x≤τ ≤x+h
However, since the two limits are both converging to f (x) as h → 0, we conclude that
d
dx F (x) = f (x).
□
Proof. We can prove this with the chain rule: Let y = g(x). Then we have
Z g(x) Z y
d dy d
f (t) dt = · f (t) dt = g ′ (x) f (y),
dx a dx dy a
19
CHAPTER 1. MATHEMATICAL BACKGROUND
The two most important subjects for data science are probability, which is the subject of the
book you are reading, and linear algebra, which concerns matrices and vectors. We cannot
cover linear algebra in detail because this would require another book. However, we need to
highlight some ideas that are important for doing data analysis.
20
1.4. LINEAR ALGEBRA
What questions can we ask about this table? We can ask: What is the most influential
cause of the crime rate? What are the leading contributions to the crime rate? To answer
these questions, we need to describe these numbers. One way to do it is to put the numbers
in matrices and vectors. For example,
478 40 74
494 32 72
y crime = . , xfund = . , xhs = . , . . .
.. .. ..
940 66 67
With this vector expression of the data, the analysis questions can roughly be translated
to finding β’s in the following equation:
y crime = βfund xfund + βhs xhs + · · · + βcollege4 xcollege4 .
This equation offers a lot of useful insights. First, it is a linear model of y crime . We call
it a linear model because the observable y crime is written as a linear combination of the
variables xfund , xhs , etc. The linear model assumes that the variables are scaled and added
to generate the observed phenomena. This assumption is not always realistic, but it is often
a fair assumption that greatly simplifies the problem. For example, if we can show that all
β’s are zero except βfund , then we can conclude that the crime rate is solely dependent on
the police funding. If two variables are correlated, e.g., high school graduate and college
graduate, we would expect the β’s to change simultaneously.
The linear model can further be simplified to a matrix-vector equation:
| | | | βfund
| | | | βhs
y crime = xfund xhs · · · xcollege4 ..
.
| | | |
| | | | β college4
Here, the lines “|” emphasize that the vectors are column vectors. If we denote the matrix
in the middle as A and the vector as β, then the equation is equivalent to y = Aβ. So we
can find β by appropriately inverting the matrix A. If two columns of A are dependent, we
will not be able to resolve the corresponding β’s uniquely.
As you can see from the above data analysis problem, matrices and vectors offer a way
to describe the data. We will discuss the calculations in Chapter 7. However, to understand
how to interpret the results from the matrix-vector equations, we need to review some basic
ideas about matrices and vectors.
21
CHAPTER 1. MATHEMATICAL BACKGROUND
Here, xj denotes the jth column of X, and xi denotes the ith row of X. The (i, j)th element
of X is denoted as xij or [X]ij . The identity matrix is denoted as I. The ith column of I
is denoted as ei = [0, . . . , 1, . . . , 0]T , and is called the ith standard basis vector. An all-zero
vector is denoted as 0 = [0, . . . , 0]T .
What is the most important thing to know about linear algebra? From a data analysis
point of view, Figure 1.11 gives us the answer. The picture is straightforward, but it captures
all the essence. In almost all the data analysis problems, ultimately, there are three things we
care about: (i) The observable vector y, (ii) the variable vectors xn , and (iii) the coefficients
βn . The set of variable vectors {xn }N n=1 spans a vector space in which all vectors are living.
Some of these variable vectors are correlated, and some are not. However, for the sake of
this discussion, let us assume they are independent of each other. Then for any observable
vector y, we can always project y in the directions determined by {xn }N n=1 . The projection
of y onto xn is the coefficient βn . A larger value of βn means that the variable xn has more
contributions.
Why is this picture so important? Because most of the data analysis problems can be
expressed, or approximately expressed, by the picture:
N
X
y= βn xn .
n=1
If you recall the crime rate example, this equation is precisely the linear model we used to
describe the crime rate. This equation can also describe many other problems.
Example 1.6. Polynomial fitting. Consider a dataset of pairs of numbers (tm , ym ) for
m = 1, . . . , M , as shown in Figure 1.12. After a visual inspection of the dataset, we
propose to use a line to fit the data. A line is specified by the equation
ym = atm + b, m = 1, . . . , M,
where a ∈ R is the slope and b ∈ R is the y-intercept. The goal of this problem is to
find one line (which is fully characterized by (a, b)) such that it has the best fit to all
the data pairs (tm , ym ) for m = 1, . . . , M . This problem can be described in matrices
22
1.4. LINEAR ALGEBRA
or more compactly,
y = β1 x1 + β2 x2 .
Here, x1 = [t1 , . . . , tM ]T contains all the variable values, and x2 = [1, . . . , 1]T contains
a constant offset.
5
tm ym
0.1622 2.1227
4
0.7943 3.3354
.. ..
. . 3
0.7379 3.4054
data
0.2691 2.5672 2
best fit
0.4228 2.3796 candidate
0.6020 3.2942 1
0 0.2 0.4 0.6 0.8 1
Figure 1.12: Example of fitting a set of data points. The problem can be described by y =
β1 x1 + β2 x2 .
Example 1.7. Image compression. The JPEG compression for images is based on
the concept of discrete cosine transform (DCT). The DCT consists of a set of basis
vectors, or {xn }N
n=1 using our notation. In the most standard setting, each basis vector
xn consists of 8 × 8 pixels, and there are N = 64 of these xn ’s. Given an image, we can
partition the image into M small blocks of 8 × 8 pixels. Let us call one of these blocks
y. Then, DCT represents the observation y as a linear combination of the DCT basis
vectors:
N
X
y= βn xn .
n=1
The coefficients {βn }N are called the DCT coefficients. They provide a representa-
n=1
tion of y, because once we know {βn }N n=1 , we can completely describe y because the
basis vectors {xn }N
n=1 are known and fixed. The situation is depicted in Figure 1.13.
How can we compress images using DCT? In the 1970s, scientists found that most
images have strong leading DCT coefficients but weak tail DCT coefficients. In other
words, among the N = 64 βn ’s, only the first few are important. If we truncate the
number of DCT coefficients, we can effectively compress the number of bits required
to represent the image.
23
CHAPTER 1. MATHEMATICAL BACKGROUND
Figure 1.13: JPEG image compression is based on the concept of discrete cosine transform, which
can be formulated as a matrix-vector problem.
We hope by now you are convinced of the importance of matrices and vectors in the
context of data science. They are not “yet another” subject but an essential tool you must
know how to use. So, what are the technical materials you must master? Here we go.
Practice Exercise 1.7. Let x = [1, 0, −1]T , and y = [3, 2, 0]T . Find xT y.
Solution. The inner product is xT y = (1)(3) + (0)(2) + (−1)(0) = 3.
Inner products are important because they tell us how two vectors are correlated.
Figure 1.14 depicts the geometric meaning of an inner product. If two vectors are correlated
(i.e., nearly parallel), then the inner product will give us a large value. Conversely, if the
two vectors are close to perpendicular, then the inner product will be small. Therefore, the
inner product provides a measure of the closeness/similarity between two vectors.
Figure 1.14: Geometric interpretation of inner product: We project one vector onto the other vector.
The projected distance is the inner product.
24
1.4. LINEAR ALGEBRA
Creating vectors and computing the inner products are straightforward in MATLAB.
We simply need to define the column vectors x and y by using the command [] with ; to
denote the next row. The inner product is done using the transpose operation x’ and vector
multiplication *.
In Python, constructing a vector is done using the command np.array. Inside this
command, one needs to enter the array. For a column vector, we write [[1],[2],[3]], with
an outer [], and three inner [] for each entry. If the vector is a row vector, the one can omit
the inner []’s by just calling np.array([1, 2, 3]). Given two column vectors x and y,
the inner product is computed via np.dot(x.T,y), where np.dot is the command for inner
product, and x.T returns the transpose of x. One can also call np.transpose(x), which is
the same as x.T.
In data analytics, the inner product of two vectors can be useful. Consider the vectors
in Table 1.1. Just from looking at the numbers, you probably will not see anything wrong.
However, let’s compute the inner products. It turns out that xT1 x2 = −0.0031, whereas
xT1 x3 = 2.0020. There is almost no correlation between x1 and x2 , but there is a substan-
tial correlation between x1 and x3 . What happened? The vectors x1 and x2 are random
vectors constructed independently and uncorrelated to each other. The last vector x3 was
constructed by x3 = 2x1 − π/1000. Since x3 is completely constructed from x1 , they have
to be correlated.
x1 x2 x3
0.0006 −0.0011 −0.0020
−0.0014 −0.0024 −0.0059
−0.0034 0.0073 −0.0099
.. .. ..
. . .
0.0001 −0.0066 −0.0030
0.0074 0.0046 0.0116
0.0007 −0.0061 −0.0017
One caveat for this example is that the naive inner product xTi xj is scale-dependent.
For example, the vectors x3 = x1 and x3 = 1000x1 have the same amount of correlation,
25
CHAPTER 1. MATHEMATICAL BACKGROUND
but the simple inner product will give a larger value for the latter case. To solve this problem
we first define the norm of the vectors:
N
!1/p
X
∥x∥p = xpi , (1.15)
i=1
for any p ≥ 1.
The norm essentially tells us the length of the vector. This is most obvious if we consider
the ℓ2 -norm:
N
!1/2
X
2
∥x∥2 = xi .
i=1
By taking the square on both sides, one can show that ∥x∥22 = xT x. This is called the
squared ℓ2 -norm, and is the sum of the squares.
On MATLAB, computing the norm is done using the command norm. Here, we can
indicate the types of norms, e.g., norm(x,1) returns the ℓ1 -norm whereas norm(x,2) returns
the ℓ2 -norm (which is also the default).
On Python, the norm command is listed in the np.linalg. To call the ℓ1 -norm, we use
np.linalg.norm(x,1), and by default the ℓ2 -norm is np.linalg.norm(x).
Using the norm, one can define an angle called the cosine angle between two vectors.
xT y
cos θ = . (1.16)
∥x∥2 ∥y∥2
The difference between the cosine angle and the basic inner product is the normaliza-
tion in the denominator, which is the product ∥x∥2 ∥y∥2 . This normalization factor scales
the vector x to x/∥x∥2 and y to y/∥y∥2 . The scaling makes the length of the new vector
equal to unity, but it does not change the vector’s orientation. Therefore, the cosine angle
is not affected by a very long vector or a very short vector. Only the angle matters. See
Figure 1.15.
26
1.4. LINEAR ALGEBRA
Figure 1.15: The cosine angle is the inner product divided by the norms of the vectors.
Going back to the previous example, after normalization we can show that the cosine
angle between x1 and x2 is cos θ1,2 = −0.0031, whereas the cosine angle between x1 and
x3 is cos θ1,3 = 0.8958. There is still a strong correlation between x1 and x3 , but now using
the cosine angle the value is between −1 and +1.
Remark 1: There are other norms one can use. The ℓ1 -norm is useful for sparse models
where we want to have the fewest possible non-zeros. The ℓ1 -norm of x is
N
X
∥x∥1 = |xi |,
i=1
which is the sum of absolute values. The ℓ∞ -norm picks the maximum of {x1 , . . . , xN }:
N
!1/p
X
∥x∥∞ = lim xpi
p→∞
i=1
= max {x1 , . . . , xN } ,
∥x∥2W = xT W x
w1 ... 0 x1 N
. .. .. .. = X w x2 .
= x1 ... xN .. . . . i i (1.17)
0 ... wN xN i=1
The geometry of the weighted ℓ2 -norm is determined by the matrix W . For example,
if W = I (the identity operator), then ∥x∥2W = ∥x∥22 , which defines a circle. If W is any
“non-negative” matrix2 , then ∥x∥2W defines an ellipse.
2 The technical term for these matrices is positive semi-definite matrices.
27
CHAPTER 1. MATHEMATICAL BACKGROUND
In MATLAB, the weighted inner product is just a sequence of two matrix-vector mul-
tiplications. This can be done using the command x’*W*x as shown below.
In Python, constructing the matrix W and the column vector x is done using np.array.
The matrix-vector multiplication is done using two np.dot commands: one for np.dot(W,x)
and the other one for np.dot(x.T, np.dot(W,x)).
This equation is self-explanatory. The norm ∥♣ − ♡∥2 measures the deviation. If y can
be perfectly explained by {xn }N n=1 , then the norm can eventually go to zero by finding a
good set of {β1 , . . . , βN }. The symbol minimize means to minimize the function by finding
β1 ,...,βN
{β1 , . . . , βN }. Note that the norm is taking a vector as the input and generating a scalar as
the output. It can be expressed as
N 2
def
X
ε(β) = y − βn x n ,
n=1
28
1.4. LINEAR ALGEBRA
for scalar problems. It is the same story for vectors. What we do is to take the derivative of
the error and set it equal to zero:
d
ε(β) = 0.
dβ
Now the question arises, how do we take the derivatives of ε(β) when it takes a vector as
input? If we can answer this question, we will find the best β. The answer is straightforward.
Since the function has one output and many inputs, take the derivative for each element
independently. This is called the scalar differentiation of vectors.
As you can see from this definition, there is nothing conceptually challenging here. The only
difficulty is that things can get tedious because there will be many terms. However, the good
news is that mathematicians have already compiled a list of identities for common matrix
differentiation. So instead of deriving every equation from scratch, we can enjoy the fruit of
their hard work by referring to those formulae. The best place to find these equations is the
Matrix Cookbook by Petersen and Pedersen.3 Here, we will mention two of the most useful
results.
dy
Example 1.8. Let y = xT Ax for any matrix A ∈ RN ×N . Find dx .
Solution.
d
xT Ax = Ax + AT x.
dx
Now, if A is symmetric, i.e., A = AT , then
d
xT Ax = 2Ax.
dx
ε = ∥Ax − y∥22 = xT AT Ax − 2y T Ax + y T y.
3 https://www.math.uwaterloo.ca/
~hwolkowi/matrixcookbook.pdf
29
CHAPTER 1. MATHEMATICAL BACKGROUND
Going back to the crime rate problem, we can now show that
dε
0= ∥y − Xβ∥2 = 2X T (Xβ − y).
dβ
Therefore, the solution is
b = (X T X)−1 Xy.
β
As you can see, if we do not have access to the matrix calculus, we will not be able to solve the
minimization problem. (There are alternative paths that do not require matrix calculus, but
they require an understanding of linear subspaces and properties of the projection operators.
So in some sense, matrix calculus is the easiest way to solve the problem.) When we discuss
the linear regression methods in Chapter 7, we will cover the interpretation of the inverses
and related topics.
In MATLAB and Python, matrix inversion is done using the command inv in MAT-
LAB and np.linalg.inv in Python. Below is an example in Python.
Sometimes, instead of computing the matrix inverse we are more interested in solving a
b = (X T X)−1 Xy). In both MATLAB and
linear equation Xβ = y (the solution of which is β
Python, there are built-in commands to do this. In MATLAB, the command is \ (backslash).
30
1.5. BASIC COMBINATORICS
Closing remark: In this section, we have given a brief introduction to a few of the most
relevant concepts in linear algebra. We will introduce further concepts in linear algebra in
later chapters, such as eigenvalues, principal component analysis, linear transformations,
and regularization, as they become useful for our discussion.
The last topic we review in this chapter is combinatorics. Combinatorics concerns the
number of configurations that can be obtained from certain discrete experiments. It is useful
because it provides a systematic way of enumerating cases. Combinatorics often becomes
very challenging as the complexity of the event grows. However, you may rest assured that
in this book, we will not tackle the more difficult problems of combinatorics; we will confine
our discussion to two of the most basic principles: permutation and combination.
1
0.9
0.8
Probability
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40 50 60 70 80 90 100
Number of people
Figure 1.16: The probability for two people in a group to have the same birthday as a function of the
number of people in the group.
If you think about this problem more deeply, you will probably realize that to solve the
problem, we must carefully enumerate all the possible configurations. How can we do this?
Well, suppose you walk into the room and sequentially pick two people. The probability
31
CHAPTER 1. MATHEMATICAL BACKGROUND
Figure 1.17: The probability for two people to have the same birthday as a function of the number of
people in the group. When there is only one person, this person can land on any of the 365 days. When
there are two people, the first person has already taken one day (out of 365 days), so the second person
can only choose 364 days. When there are three people, the first two people have occupied two days,
so there are only 363 days left. If we generalize this process, we see that the number of configurations
is 365 × 364 × · · · × (365 − k + 1), where k is the number of people in the room.
So imagine that you keep going down the list to the 50th person. The probability that
none of these 50 people will have the same birthday is
32
1.5. BASIC COMBINATORICS
365!
The first term in our equation, (365−k)! , is called the permutation of picking k days from
365 options. We shall discuss this operation shortly.
Why is the probability so high with only 50 people while it seems that we need 366
people to ensure two identical birthdays? The difference is the notion of probabilistic and
deterministic. The 366-people argument is deterministic. If you have 366 people, you are
certain that two people will have the same birthday. This has no conflict with the proba-
bilistic argument because the probabilistic argument says that with 50 people, we have a
97% chance of getting two identical birthdays. With a 97% success rate, you still have a
3% chance of failing. It is unlikely to happen, but it can still happen. The more people you
put into the room, the stronger guarantee you will have. However, even if you have 364
people and the probability is almost 100%, there is still no guarantee. So there is no conflict
between the two arguments since they are answering two different questions.
Now, let’s discuss the two combinatorics questions.
1.5.2 Permutation
Permutation concerns the following question:
Consider a set of n distinct balls. Suppose we want to pick k balls from the set without
replacement. How many ordered configurations can we obtain?
Note that in the above question, the word “ordered” is crucial. For example, the set
A = {a, b, c} can lead to 6 different ordered configurations
(a, b, c), (a, c, b), (b, a, c), (b, c, a), (c, a, b), (c, b, a).
Figure 1.18: Permutation. The number of choices is reduced in every stage. Therefore, the total number
is n × (n − 1) × · · · × (n − k + 1) if there are k stages.
If you start with the base, which contains five balls, you will have five choices. At one
level up, since one ball has already been taken, you have only four choices. You continue
the process until you reached the number of balls you want to collect. The number of
configurations you have generated is the permutation. Here is the formula:
33
CHAPTER 1. MATHEMATICAL BACKGROUND
Practice Exercise 1.8. Consider a set of 4 balls {1, 2, 3, 4}. We want to pick two
balls at random without replacement. The ordering matters. How many permutations
can we obtain?
Solution. The possible configurations are (1,2), (2,1), (1,3), (3,1), (1,4), (4,1), (2,3),
(3,2), (2,4), (4,2), (3,4), (4,3). So totally there are 12 configurations. We can also
verify this number by noting that there are 4 balls altogether and so the number
of choices for picking the first ball is 4 and the number of choices for picking the
second ball is (4 − 1) = 3. Thus, the total is 4 · 3 = 12. Referring to the formula, this
result coincides with the theorem, which states that the number of permutations is
4! 4·3·2·1
(4−2)! = 2·1 = 12.
1.5.3 Combination
Another operation in combinatorics is combination. Combination concerns the following
question:
34
1.5. BASIC COMBINATORICS
Consider a set of n distinct balls. Suppose we want to pick k balls from the set without
replacement. How many unordered configurations can we obtain?
Unlike permutation, combination treats a subset of balls with whatever ordering as
one single configuration. For example, the subset (a, b, c) is considered the same as (a, c, b)
or (b, c, a), etc.
Let’s go back to the 5-ball exercise. Suppose you have picked orange, green, and light
blue. This is the same combination as if you have picked {green, orange, and light blue},
or {green, light blue, and orange}. Figure 1.19 lists all the six possible configurations for
these three balls. So what is combination? Combination needs to take these repeated cases
into account.
Figure 1.19: Combination. In this problem, we are interested in picking 3 colored balls out of 5. This
will give us 5 × 4 × 3 = 60 permutations. However, since we are not interested in the ordering, some of
the permutations are repeated. For example, there are 6 combos of (green, light blue, orange), which is
computed from 3 × 2 × 1. Dividing 60 permutations by these 6 choices of the orderings will give us 10
distinct combinations of the colors.
n!
Proof. We start with the permutation result, which gives us (n−k)! permutations. Note that
every permutation has exactly k balls. However, while these k balls can be arranged in any
order, in combination, we treat them as one single configuration. Therefore, the task is to
count the number of possible orderings for these k balls.
To this end, we note that for a set of k balls, there are in total k! possible ways of
ordering them. The number k! comes from the following table.
35
CHAPTER 1. MATHEMATICAL BACKGROUND
Therefore, the total number of orderings for a set of k balls is k!. Since permutation
n!
gives us (n−k)! and every permutation has k! repetitions due to ordering, we divide the
number by k!. Thus the number of combinations is
n!
.
k!(n − k)!
□
Practice Exercise 1.9. Consider a set of 4 balls {1, 2, 3, 4}. We want to pick two
balls at random without replacement. The ordering does not matter. How many com-
binations can we obtain?
Solution. The permutation result gives us 12 permutations. However, among all these
12 permutations, there are only 6 distinct pairs of numbers. We can confirm this by
noting that since we picked 2 balls, there are exactly 2 possible orderings for these 2
balls. Therefore, we have 12
2 = 6 number of combinations. Using the formula of the
theorem, we check that the number of combinations is
4! 4·3·2·1
= = 6.
2!(4 − 2)! (2 · 1)(2 · 1)
Example 1.10. (Ross, 8th edition, Section 1.6) Consider the equation
x1 + x2 + · · · + xK = N,
where {xk } are positive integers. How many combinations of solutions of this equation
are there?
36
1.6. SUMMARY
Figure 1.20: One possible solution for N = 16 and K = 4. In general, the problem is equivalent
to inserting K − 1 dividers among N − 1 balls.
Closing remark. Permutations and combinations are two ways to enumerate all the pos-
sible cases. While the conclusions are probabilistic, as the birthday paradox shows, permu-
tation and combination are deterministic. We do not need to worry about the distribution
of the samples, and we are not taking averages of anything. Thus, modern data analysis
seldom uses the concepts of permutation and combination. Accordingly, combinatorics does
not play a large role in this book.
Does it mean that combinatorics is not useful? Not quite, because it still provides us
with powerful tools for theoretical analysis. For example, in binomial random variables, we
need the concept of combination to calculate the repeated cases. The Poisson random vari-
able can be regarded as a limiting case of the binomial random variable, and so combination
is also used. Therefore, while we do not use the concepts of permutation per se, we use them
to define random variables.
1.6 Summary
In this chapter, we have reviewed several background mathematical concepts that will be-
come useful later in the book. You will find that these concepts are important for under-
standing the rest of this book. When studying these materials, we recommend not just
remembering the “recipes” of the steps but focusing on the motivations and intuitions
behind the techniques.
We would like to highlight the significance of the birthday paradox. Many of us come
from an engineering background in which we were told to ensure reliability and guarantee
success. We want to ensure that the product we deliver to our customers can survive even
in the worst-case scenario. We tend to apply deterministic arguments such as requiring 366
people to ensure complete coverage of the 365 days. In modern data analysis, the worst-case
scenario may not always be relevant because of the complexity of the problem and the cost
of such a warranty. The probabilistic argument, or the average argument, is more reasonable
and cost-effective, as you can see from our analysis of the birthday problem. The heart of
the problem is the trade-off between how much confidence you need versus how much effort
you need to expend. Suppose an event is unlikely to happen, but if it happens, it will be
a disaster. In that case, you might prefer to be very conservative to ensure that such a
disaster event has a low chance of happening. Industries related to risk management such
as insurance and investment banking are all operating under this principle.
37
CHAPTER 1. MATHEMATICAL BACKGROUND
1.7 Reference
Introductory materials
1-1 Erwin Kreyszig, Advanced Engineering Mathematics, Wiley, 10th Edition, 2011.
1-2 Henry Stark and John W. Woods, Probability and Random Processes with Applications
to Signal Processing, Prentice Hall, 3rd Edition, 2002. Appendix.
1-3 Michael J. Evans and Jeffrey S. Rosenthal, Probability and Statistics: The Science of
Uncertainty, W. H. Freeman, 2nd Edition, 2009. Appendix.
1-4 James Stewart, Single Variable Calculus, Early Transcendentals, Thomson Brooks/-
Cole, 6th Edition, 2008. Chapter 5.
Combinatorics
1-5 Dimitri P. Bertsekas and John N. Tsitsiklis, Introduction to Probability, Athena Sci-
entific, 2nd Edition, 2008. Section 1.6.
1-6 Alberto Leon-Garcia, Probability, Statistics, and Random Processes for Electrical En-
gineering, Prentice Hall, 3rd Edition, 2008. Section 2.6.
1-7 Athanasios Papoulis and S. Unnikrishna Pillai, Probability, Random Variables and
Stochastic Processes, McGraw-Hill, 4th Edition, 2001. Chapter 3.
Analysis
In some sections of this chapter, we use results from calculus and infinite series. Many formal
proofs can be found in the standard undergraduate real analysis textbooks.
1-8 Tom M. Apostol, Mathematical Analysis, Pearson, 1974.
1-9 Walter Rudin, Principles of Mathematical Analysis, McGraw Hill, 1976.
1.8 Problems
1 + 2r + 3r2 + · · · .
38
1.8. PROBLEMS
Evaluate
∞ ∞
X λk e−λ X λk e−λ
k , and k2 .
k! k!
k=0 k=0
(a)
Z b 2
1 a+b
x− dx.
a b−a 2
(b) Z ∞
λxe−λx dx.
0
(c) Z ∞
λx −λ|x|
e dx.
−∞ 2
Exercise 4.
(a) Compute the result of the following matrix vector multiplication using Numpy. Submit
your result and codes.
1 2 3 1
4 5 6 × 2 .
7 8 9 3
(b) Plot a sine function on the interval [−π, π] with 1000 data points.
(c) Generate 10,000 uniformly distributed random numbers on interval [0, 1).
Use matplotlib.pyplot.hist to generate a histogram of all the random numbers.
39
CHAPTER 1. MATHEMATICAL BACKGROUND
Exercise 5.
Calculate
∞ k+1
X 2
k .
3
k=0
Exercise 6.
Let
x 1 4 1
x= , µ= , Σ= .
y 0 1 1
(d) Use matplotlib.pyplot.contour, plot the function f (x) for the range [−3, 3] ×
[−3, 3].
Exercise 7.
Out of seven electrical engineering (EE) students and five mechanical engineering (ME)
students, a committee consisting of three EEs and two MEs is to be formed. In how many
ways can this be done if
(a) any of the EEs and any of the MEs can be included?
(b) one particular EE must be on the committee?
(c) two particular MEs cannot be on the committee?
Exercise 8.
Five blue balls, three red balls, and three white balls are placed in an urn. Three balls are
drawn at random without regard to the order in which they are drawn. Using the counting
approach to probability, find the probability that
(a) one blue ball, one red ball, and one white ball are drawn.
(b) all three balls drawn are red.
(c) exactly two of the balls drawn are blue.
Exercise 9.
A collection of 26 English letters, a-z, is mixed in a jar. Two letters are drawn at random,
one after the other.
40
1.8. PROBLEMS
(a) What is the probability of drawing a vowel (a,e,i,o,u) and a consonant in either order?
(b) Write a MATLAB / Python program to verify your answer in part (a). Randomly
draw two letters without replacement and check whether one is a vowel and the other
is a consonant. Compute the probability by repeating the experiment 10000 times.
Exercise 10.
There are 50 students in a classroom.
(a) What is the probability that there is at least one pair of students having the same
birthday? Show your steps.
(b) Write a MATLAB / Python program to simulate the event and verify your answer
in (a). Hint: You probably need to repeat the simulation many times to obtain a
probability. Submit your code and result.
You may assume that a year only has 365 days. You may also assume that all days have an
equal likelihood of being taken.
41
CHAPTER 1. MATHEMATICAL BACKGROUND
42
Chapter 2
Probability
Data and probability are inseparable. Data is the computational side of the story, whereas
probability is the theoretical side of the story. Any data science practice must be built on
the foundation of probability, and probability needs to address practical problems. However,
what exactly is “probability”? Mathematicians have been debating this for centuries. The
frequentists argue that probability is the relative frequency of an outcome. For example,
flipping a fair coin has a 1/2 probability of getting a head because if you flip the coin
infinitely many times, you will have half of the time getting a head. The Bayesians argue
that probability is a subjective belief. For example, the probability of getting an A in a
class is subjective because no one would want to take a class infinitely many times to obtain
the relative frequency. Both the frequentists and Bayesians have valid points. However, the
differentiation is often non-essential because the context of your problem will force you
to align with one or the other. For example, when you have a shortage of data, then the
subjectivity of the Bayesians allows you to use prior knowledge, whereas the frequentists
tell us how to compute the confidence interval of an estimate.
No matter whether you prefer the frequentist’s view or the Bayesian’s view, there is
something more fundamental thanks to Andrey Kolmogorov (1903-1987). The development
of this fundamental definition will take some effort on our part, but if we distill the essence,
we can summarize it as follows:
This sentence is not a formal definition; instead, it summarizes what we believe to be the
essence of probability. We need to clarify some puzzles later in this chapter, but if you can
understand what this sentence means, you are halfway done with this book. To spell out the
details, we will describe an elementary problem that everyone knows how to solve. As we
discuss this problem, we will highlight a few key concepts that will give you some intuitive
insights into our definition of probability, after which we will explain the sequence of topics
to be covered in this chapter.
43
CHAPTER 2. PROBABILITY
ward problem. You probably have already found the answer, which is 26 because “less than
5” and “an even number” means { , }. However, let’s go through the thinking process
slowly by explicitly writing down the steps.
First of all, how do we know that the denominator in 26 is 6? Well, because there are six
faces. These six faces form a set called the sample space. A sample space is the set containing
all possible outcomes, which in our case is Ω = { , , , , , }. The denominator 6 is the
size of the sample space.
How do we know that the numerator is 2? Again, implicitly in our minds, we have
constructed two events: E1 = “less than 5” = { , , , }, and E2 = “an even number”
= { , , }. Then we take the intersection between these two events to conclude the event
E = { , }. The numerical value “2” is the size of this event E.
So, when we say that “the probability is 62 ,” we are saying that the size of the event
E relative to the sample space Ω is the ratio 26 . This process involves measuring the size
of E and Ω. In this particular example, the measure we use is a “counter” that counts the
number of elements.
This example shows us all the necessary components of probability: (i) There is a
sample space, which is the set that contains all the possible outcomes. (ii) There is an event,
which is a subset inside the sample space. (iii) Two events E1 and E2 can be combined to
construct another event E that is still a subset inside the sample space. (iv) Probability is
a number assigned by certain rules such that it describes the relative size of the event E
compared with the sample space Ω. So, when we say that probability is a measure of the
size of a set, we create a mapping that takes in a set and outputs the size of that set.
44
2.1. SET THEORY
Why do we start the chapter by describing set theory? Because probability is a measure
of the size of a set. Yes, probability is not just a number telling us the relative frequency of
events; it is an operator that takes a set and tells us how large the set is. Using the example
we showed in the prelude, the event “even number” of a die is a set containing numbers
{ , , }. When we apply probability to this set, we obtain the number 36 , as shown in
Figure 2.1. Thus sets are the foundation of the study of probability.
Figure 2.1: Probability is a measure of the size of a set. Whenever we talk about probability, it has to
be the probability of a set.
A = {ξ1 , ξ2 , . . . , ξn } (2.1)
To say that an element ξ is drawn from A, we write ξ ∈ A. For example, the number 1
is an element in the set {1, 2, 3}. We write 1 ∈ {1, 2, 3}. There are a few common sets that
we will encounter. For example,
45
CHAPTER 2. PROBABILITY
Figure 2.2: From left to right: a closed interval, a semi-closed (or semi-open) interval, and an open
interval.
Sets are not limited to numbers. A set can be used to describe a collection of functions.
Example 2.3. A = {f : R → R | f (x) = ax+b, a, b ∈ R}. This is the set of all straight
lines in 2D. The notation f : R → R means that the function f takes an argument
from R and sends it to another real number in R. The definition f (x) = ax + b says
that f is taking the specific form of ax + b. Since the constants a and b can be any
real number, the equation f (x) = ax + b enumerates all possible straight lines in 2D.
See Figure 2.3(a).
1 2
0.5 1
f(t)
f(t)
0 0
-0.5 -1
-1 -2
-2 -1 0 1 2 -1 -0.5 0 0.5 1
t t
Figure 2.3: (a) The set of straight lines A = {f : R → R | f (x) = ax + b, a, b ∈ R}. (b) The set of
phase-shifted cosines A = {f : R → [−1, 1] | f (t) = cos(ω0 t + θ), θ ∈ [0, 2π]}.
A set can also be used to describe a collection of sets. Let A and B be two sets. Then
C = {A, B} is a set of sets.
46
2.1. SET THEORY
is a collection of sets. Note that here we are not saying C is the union of two sets. We
are only saying that C is a collection of two sets. See the next example.
Example 2.6. Let A = {1, 2} and B = {3}, then C = {A, B} means that
Therefore C contains only two elements. One is the set {1, 2} and the other is the set
{3}. Note that {{1, 2}, {3}} =
̸ {1, 2, 3}. The former is a set of two sets. The latter is a
set of three elements.
2.1.3 Subsets
Given a set, we often want to specify a portion of the set, which is called a subset.
Example 2.7.
If A = {1, 2, 3, 4, 5, 6}, then B = {1, 3, 5} is a proper subset of A.
If A = {1, 2}, then B = {1, 2} is an improper subset of A.
If A = {t | t ≥ 0}, then B = {t | t > 0} is a proper subset of A.
Practice Exercise 2.1. Let A = {1, 2, 3}. List all the subsets of A.
Solution. The subsets of A are:
A = {∅, {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3}}.
Practice Exercise 2.2. Prove that two sets A and B are equal if and only if A ⊆ B
and B ⊆ A.
Solution. Suppose A ⊆ B and B ⊆ A. Assume by contradiction that A ̸= B. Then
necessarily there must exist an x such that x ∈ A but x ̸∈ B (or vice versa). But
A ⊆ B means that x ∈ A will necessarily be in B. So it is impossible to have x ̸∈ B.
Conversely, suppose that A = B. Then any x ∈ A will necessarily be in B. Therefore,
we have A ⊆ B. Similarly, if A = B then any x ∈ B will be in A, and so B ⊆ A.
47
CHAPTER 2. PROBABILITY
A set containing an element 0 is not an empty set. It is a set of one element, {0}. The
number of elements of the empty set is 0. The empty set is a subset of any set, i.e., ∅ ⊆ A
for any A. We use ⊆ because A could also be an empty set.
Example 2.8(a). The set A = {x | sin x > 1} is empty because no x ∈ R can make
sin x > 1.
Example 2.8(b). The set A = {x | x > 5 and x < 1} is empty because the two
conditions x > 5 and x < 1 are contradictory.
Definition 2.4 (Universal Set). The universal set is the set containing all elements
under consideration. We denote a universal set as
A = Ω. (2.4)
The universal set Ω contains itself, i.e., Ω ⊆ Ω. The universal set is a relative concept.
Usually, we first define a universal set Ω before referring to subsets of Ω. For example, we
can define Ω = R and refer to intervals in R. We can also define Ω = [0, 1] and refer to
subintervals inside [0, 1].
2.1.5 Union
We now discuss basic set operations. By operations, we mean functions of two or more sets
whose output value is a set. We use these operations to combine and separate sets. Let us
first consdier the union of two sets. See Figure 2.4 for a graphical depiction.
Definition 2.5 (Finite Union). The union of two sets A and B contains all elements
in A or in B. That is,
A ∪ B = {ξ | ξ ∈ A or ξ ∈ B}. (2.5)
As the definition suggests, the union of two sets connects the sets using the logical operator
”or”. Therefore, the union of two sets is always larger than or equal to the individual sets.
Example 2.9(a). If A = {1, 2}, B = {1, 5}, then A ∪ B = {1, 2, 5}. The overlapping
element 1 is absorbed. Also, note that A ∪ B ̸= {{1, 2}, {1, 5}}. The latter is a set of
sets.
Example 2.9(b). If A = (3, 4], B = (3.5, ∞), then A ∪ B = (3, ∞).
Example 2.9(c). If A = {f : R → R | f (x) = ax} and B = {f : R → R | f (x) = b},
then A ∪ B = a set of sloped lines with a slope a plus a set of constant lines with
48
2.1. SET THEORY
Figure 2.4: The union of two sets contains elements that are either in A or B or both.
The previous example can be generalized in the following exercise. What it says is that
if A is a subset of another set B, then the union of A and B is just B. Intuitively, this should
be straightforward because whatever you have in A is already in B, so the union will just
be B. Below is a formal proof that illustrates how to state the arguments clearly. You may
like to draw a picture to convince yourself that the proof is correct.
What should we do if we want to take the union of an infinite number of sets? First,
we need to define the concept of an infinite union.
Definition 2.6 (Infinite Union). For an infinite sequence of sets A1 , A2 , . . ., the in-
finite union is defined as
∞
[
An = {ξ | ξ ∈ An for at least one n that is finite.} . (2.6)
n=1
An infinite union is a natural extension of a finite union. It is not difficult to see that
ξ ∈ A or ξ ∈ B ⇐⇒ ξ is in at least one of A and B.
49
CHAPTER 2. PROBABILITY
To take the infinite union, we know that the set [−1, 1) is always included, because the
right-hand limit 1 − n1 approaches 1 as n approaches ∞. So the only question concerns the
number 1. Should 1 be included? According to the definition above, we ask: Is 1 an element
of at least one of the sets A1 , A2 , . . . , An ? Clearly it is not: 1 ̸∈ A1 , 1 ̸∈ A2 , . . .. In fact,
1 ̸∈ An for any finite n. Therefore 1 is not an element of the infinite union, and we conclude
that
∞ ∞
[ [ 1
An = −1, 1 − = [−1, 1).
n=1 n=1
n
Practice Exercise 2.4. Find the infinite union of the sequences where (a) An =
− 1, 1 − n1 , (b) An = − 1, 1 − n1 .
S∞ S∞
Solution. (a) n=1 An = [−1, 1). (b) n=1 An = (−1, 1).
2.1.6 Intersection
The union of two sets is based on the logical operator or. If we use the logical operator and,
then the result is the intersection of two sets.
Definition 2.7 (Finite Intersection). The intersection of two sets A and B contains
all elements in A and in B. That is,
Figure 2.6 portrays intersection graphically. Intersection finds the common elements of the
two sets. It is not difficult to show that A ∩ B ⊆ A and A ∩ B ⊆ B.
50
2.1. SET THEORY
Figure 2.6: The intersection of two sets contains elements in both A and B.
Example 2.12. If A = {{1}, {2}} and B = {{2, 3}, {4}}, then A ∩ B = ∅. This is
because A is a set containing two sets, and B is a set containing two sets. The two sets
{2} and {2, 3} are not the same. Thus, A and B have no elements in common, and so
A ∩ B = ∅.
Similarly to the infinite union, we can define the concept of infinite intersection.
51
CHAPTER 2. PROBABILITY
We note that the sequence of sets is [0, 2], [0, 1.5], [0, 1.33], . . . . As n → ∞, we note that
the limit is either [0, 1) or [0, 1]. Should the right-hand limit 1 be included in the infinite
intersection? According to the definition above, we know that 1 ∈ A1 , 1 ∈ A2 , . . . , 1 ∈ An
for any finite n. Therefore, 1 is included and so
∞ ∞
\ \ 1
An = 0, 1 + = [0, 1].
n=1 n=1
n
Practice Exercise 2.5. Find the infinite intersection of the sequences where (a)
An = 0, 1 + n1 , (b) An = 0, 1 + n1 , (c) An = 0, 1 − n1 , (d) An = 0, 1 − n1 .
Solution.
T∞
(a) n=1 An = [0, 1].
T∞
(b) n=1 An = (−1, 1].
T∞
(c) n=1 An = [0, 0) = ∅.
T∞
(d) n=1 An = [0, 0] = {0}.
Definition 2.9 (Complement). The complement of a set A is the set containing all
elements that are in Ω but not in A. That is,
Figure 2.8 graphically portrays the idea of a complement. The complement is a set that
contains everything in the universal set that is not in A. Thus the complement of a set is
always relative to a specified universal set.
52
2.1. SET THEORY
Figure 2.8: [Left] The complement of a set A contains all elements that are not in A. [Right] The
difference A\B contains elements that are in A but not in B.
The concept of the complement will help us understand the concept of difference.
Definition 2.10 (Difference). The difference A\B is the set containing all elements
in A but not in B.
A\B = {ξ | ξ ∈ A and ξ ̸∈ B}. (2.10)
Figure 2.8 portrays the concept of difference graphically. Note that A\B ̸= B\A. The former
removes the elements in B whereas the latter removes the elements in A.
Example 2.14(a). Let A = {1, 3, 5, 6} and B = {2, 3, 4}. Then A\B = {1, 5, 6} and
B\A = {2, 4}.
Example 2.14(b). Let A = [0, 1], B = [2, 3], then A\B = [0, 1], and B\A = [2, 3].
This example shows that if the two sets do not overlap, there is nothing to subtract.
Example 2.14(c). Let A = [0, 1], B = R, then A\B = ∅, and B\A = (−∞, 0)∪(1, ∞).
This example shows that if one of the sets is the universal set, then the difference will
either return the empty set or the complement.
53
CHAPTER 2. PROBABILITY
Figure 2.9: [Left] A and B are overlapping. [Right] A and B are disjoint.
Practice Exercise 2.6. Show that for any two sets A and B, the differences A\B
and B\A never overlap, i.e., (A\B) ∩ (B\A) = ∅.
Solution. Suppose, by contradiction, that the intersection is not empty so that there
exists an ξ ∈ (A\B) ∩ (B\A). Then, by the definition of intersection, ξ is an element
of (A\B) and (B\A). But if ξ is an element of (A\B), it cannot be an element of B.
This implies that ξ cannot be an element of (B\A) since it is a subset of B. This is a
contradiction because we just assumed that the ξ can live in both (A\B) and (B\A).
A\B = A ∩ B c (2.11)
A ∩ B = ∅. (2.12)
For a collection of sets {A1 , A2 , . . . , An }, we say that the collection is disjoint if, for
any pair i ̸= j,
Ai ∩ Aj = ∅. (2.13)
54
2.1. SET THEORY
Example 2.15(a). Let A = {x > 1} and B = {x < 0}. Then A and B are disjoint.
Example 2.15(b). Let A = {1, 2, 3} and B = ∅. Then A and B are disjoint.
Example 2.15(c). Let A = (0, 1) and B = [1, 2). Then A and B are disjoint.
With the definition of disjoint, we can now define the powerful concept of partition.
Ai ∩ Aj = ∅. (2.14)
Figure 2.10: A partition of Ω contains disjoint subsets of which the union gives us Ω.
Example 2.16. Let Ω = {1, 2, 3, 4, 5, 6}. The following sets form a partition:
55
CHAPTER 2. PROBABILITY
Thus we can see that B ∩ A1 , B ∩ A2 and B ∩ A3 are disjoint. Furthermore, the union
of these three sets gives B.
A ∩ B = B ∩ A, and A ∪ B = B ∪ A. (2.16)
A ∪ (B ∪ C) = (A ∪ B) ∪ C,
A ∩ (B ∩ C) = (A ∩ B) ∩ C. (2.17)
56
2.1. SET THEORY
A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C),
A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). (2.18)
Theorem 2.5 (De Morgan’s Law). (How to complement over intersection and union)
(A ∩ B)c = Ac ∪ B c ,
(A ∪ B)c = Ac ∩ B c . (2.19)
Example 2.19. Consider [1, 4] ∩ ([0, 2] ∪ [3, 5]). By the distributive property we can
simplify the set as
[1, 4] ∩ ([0, 2] ∪ [3, 5]) = ([1, 4] ∩ [0, 2]) ∪ ([1, 4] ∩ [3, 5])
= [1, 2] ∪ [3, 4].
Example 2.20. Consider ([0, 1] ∪ [2, 3])c . By De Morgan’s Law we can rewrite the set
as
([0, 2] ∪ [1, 3])c = [0, 2]c ∩ [1, 3]c .
Figure 2.11: When there are two events A and B, the probability of A ∩ B is determined by first taking
the intersection of the two sets and then evaluating its probability.
Universal sets and empty sets are useful too. Universal sets cover all the possible
outcomes of an experiment, so we should expect P[Ω] = 1. Empty sets contain nothing,
and so we should expect P[∅] = 0. These two properties are essential to define a probability
because no probability can be greater than 1, and no probability can be less than 0.
57
CHAPTER 2. PROBABILITY
We now formally define probability. Our discussion will be based on the slogan probability
is a measure of the size of a set. Three elements constitute a probability space:
Sample Space Ω: The set of all possible outcomes from an experiment.
Event Space F: The collection of all possible events. An event E is a subset in Ω that
defines an outcome or a combination of outcomes.
Probability Law P: A mapping from an event E to a number P[E] which, ideally,
measures the size of the event.
Therefore, whenever you talk about “probability,” you need to specify the triplet (Ω, F, P)
to define the probability space.
The necessity of the three elements is illustrated in Figure 2.12. The sample space
is the interface with the physical world. It is the collection of all possible states that can
result from an experiment. Some outcomes are more likely to happen, and some are less
likely, but this does not matter because the sample space contains every possible outcome.
The probability law is the interface with the data analysis. It is this law that defines the
likelihood of each of the outcomes. However, since the probability law measures the size of
a set, the probability law itself must be a function, a function whose argument is a set and
whose value is a number. An outcome in the sample space is not a set. Instead, a subset in
the sample space is a set. Therefore, the probability should input a subset and map it to a
number. The collection of all possible subsets is the event space.
Figure 2.12: Given an experiment, we define the collection of all outcomes as the sample space. A
subset in the sample space is called an event. The probability law is a mapping that maps an event to
a number that denotes the size of the event.
A perceptive reader like you may be wondering why we want to complicate things to
this degree when calculating probability is trivial, e.g., throwing a die gives us a probability
1
6 per face. In a simple world where problems are that easy, you can surely ignore all these
complications and proceed to the answer 16 . However, modern data analysis is not so easy.
If we are given an image of size 64 × 64 pixels, how do we tell whether this image is of a cat
or a dog? We need to construct a probability model that tells us the likelihood of having a
58
2.2. PROBABILITY SPACE
Definition 2.13. A sample space Ω is the set of all possible outcomes from an ex-
periment. We denote ξ as an element in Ω.
Figure 2.13 also shows a functional example of the sample space. In this case, the
sample space contains functions. For example,
Set of all straight lines in 2D:
Ω = {f | f (x) = ax + b, a, b ∈ R}.
As we see from the above examples, the sample space is nothing but a universal set.
The elements inside the sample space are the outcomes of the experiment. If you change
59
CHAPTER 2. PROBABILITY
Figure 2.13: The sample space can take various forms: it can contain discrete numbers, or continuous
intervals, or even functions.
the experiment, the possible outcomes will be different so that the sample space will be
different. For example, flipping a coin has different possible outcomes from throwing a die.
What if we want to describe a composite experiment where we flip a coin and throw a
die? Here is the sample space:
Example 2.23: If the experiment contains flipping a coin and throwing a die, then
the sample space is
(H, ), (H, ), (H, ), (H, ), (H, ), (H, ),
(T, ), (T, ), (T, ), (T, ), (T, ), (T, ) .
Practice Exercise 2.8. There are 8 processors on a computer. A computer job sched-
uler chooses one processor randomly. What is the sample space? If the computer job
scheduler can choose two processors at once, what is the sample space then?
Solution. The sample space of the first case is Ω = {1, 2, 3, 4, 5, 6, 7, 8}. The sample
space of the second case is Ω = {(1, 2), (1, 3), (1, 4), . . . , (7, 8)}.
Practice Exercise 2.9. A cell phone tower has a circular average coverage area of
radius of 10 km. We observe the source locations of calls received by the tower. What
is the sample space of all possible source locations?
Solution. Assume that the center of the tower is located at (x0 , y0 ). The sample space
is the set p
Ω = {(x, y) | (x − x0 )2 + (y − y0 )2 ≤ 10}.
Not every set can be a sample space. A sample space must be exhaustive and exclusive.
The term “exhaustive” means that the sample space has to cover all possible outcomes. If
60
2.2. PROBABILITY SPACE
there is one possible outcome that is left out, then the set is no longer a sample space. The
term “exclusive” means that the sample space contains unique elements so that there is no
repetition of elements.
Definition 2.14. An event E is a subset in the sample space Ω. The set of all possible
events is denoted as F.
While this definition is extremely simple, we need to keep in mind a few facts about events.
First, an outcome ξ is an element in Ω but an event E is a subset contained in Ω, i.e.,
E ⊆ Ω. Thus, an event can contain one outcome but it can also contain many outcomes.
The following example shows a few cases of events:
Example 2.25. Throw a die. Let Ω = { , , , , , }. The following are two pos-
sible events, as illustrated in Figure 2.14.
E1 = {even numbers} = { , , }.
61
CHAPTER 2. PROBABILITY
E2 = {less than 3} = { , }.
Figure 2.14: Two examples of events: The first event contains numbers {2, 4, 6}, and the second
event contains numbers {1, 2}.
Practice Exercise 2.10. The “ping” command is used to measure round-trip times
for Internet packets. What is the sample space of all possible round-trip times? What
is the event that a round-trip time is between 10 ms and 20 ms?
Solution. The sample space is Ω = [0, ∞). The event is E = [10, 20].
Practice Exercise 2.11. A cell phone tower has a circular average coverage area of
radius 10 km. We observe the source locations of calls received by the tower. What is
the event when the source location of a call is between 2 km and 5 km from the tower?
Solution. Assume p that the center of the tower is located at (x0 , y0 ). The event is
E = {(x, y) | 2 ≤ (x − x0 )2 + (y − y0 )2 ≤ 5}.
The second point we should remember is the cardinality of Ω and that of F. A sample
space containing n elements has a cardinality n. However, the event space constructed from
Ω will contain 2n events. To see why this is so, let’s consider the following example.
Example 2.26. Consider an experiment with 3 outcomes Ω = {♣, ♡, ✠}. We can list
out all the possible events: ∅, {♣}, {♡}, {✠}, {♣, ♡}, {♣, ✠}, {♡, ♣}, {♣, ♡, ✠}. So
in total there are 23 = 8 possible events. Figure 2.15 depicts the situation. What is
the difference between ♣ and {♣}? The former is an element, whereas the latter is a
set. Thus, {♣} is an event but ♣ is not an event. Why is ∅ an event? Because we can
ask “What is the probability that we get an odd number and an even number?” The
probability is obviously zero, but the reason it is zero is that the event is an empty set.
62
2.2. PROBABILITY SPACE
Figure 2.15: The event space contains all the possible subsets inside the sample space.
In general, if there are n elements in the sample space, then the number of events
is 2n . To see why this is true, we can assign to each element a binary value: either 0 or 1.
For example, in Table 2.1 we consider throwing a die. For each of the six faces, we assign a
binary code. This will give us a binary string for each event. For example, the event { , }
is encoded as the binary string 100010 because only and are activated. We can count
the total number of unique strings, which is the number of strings that can be constructed
from n bits. It is easily seen that this number is 2n .
Table 2.1: An event space contains 2n events, where n is the number of elements in the sample space.
To see this, we encode each outcome with a binary code. The resulting binary string then forms a unique
index of the event. Counting the total number of events gives us the cardinality of the event space.
The box below summarizes what you need to know about event spaces.
63
CHAPTER 2. PROBABILITY
The following discussions can be skipped if it is your first time reading the book.
What else do we need to take care of in order to ensure that an event is well defined? A
few set operations seem to be necessary. For example, if E1 = { } and E2 = { } are events,
it is necessary that E = E1 ∪ E2 = { , } is an event too. Another example: if E1 = { , }
and E2 = { , } are events, then it is necessary that E = E1 ∩ E2 = { } is also an event.
The third example: if E1 = { , , , } is an event, then E = E1c = { , } should be
an event. As you can see, there is nothing sophisticated in these examples. They are just
some basic set operations. We want to ensure that the event space is closed under these
set operations. That is, we do not want to be surprised by finding that a set constructed
from two events is not an event. However, since all set operations can be constructed from
union, intersection and complement, ensuring that the event space is closed under these
three operations effectively ensures that it is closed to all set operations.
The formal way to guarantee these is the notion of a field. This term may seem to be
abstract, but it is indeed quite useful:
For a finite set, i.e., a set that contains n elements, the collection of all possible subsets
is indeed a field. This is not difficult to see if you consider rolling a die. For example,
if E = { , , , } is inside F, then E c = { , } is also inside F. This is because F
consists of 2n subsets each being encoded by a unique binary string. So if E = 001111, then
E c = 110000, which is also in F. Similar reasoning applies to intersection and union.
At this point, you may ask:
Why bother constructing a field? The answer is that probability is a measure of the
size of a set, so we must input a set to a probability measure P to get a number. The
set being input to P must be a subset inside the sample space; otherwise, it will be
undefined. If we regard P as a mapping, we need to specify the collection of all its
inputs, which is the set of all subsets, i.e., the event space. So if we do not define the
field, there is no way to define the measure P.
What if the event space is not a field? If the event space is not a field, then we can
easily construct pathological cases where we cannot assign a probability. For example,
if the event space is not a field, then it would be possible that the complement of
E = { , , , } (which is E c = { , }) is not an event. This just does not make
sense.
The concept of a field is sufficient for finite sample spaces. However, there are two
other types of sample spaces where the concept of a field is inadequate. The first type of
64
2.2. PROBABILITY SPACE
sets consists of the countably infinite sets, and the second type consists of the sets defined
on the real line. There are other types of sets, but these two have important practical
applications. Therefore, we need to have a basic understanding of these two types.
Sigma-field
The difficulty of a countably infinite set is that there are infinitely many subsets in the field
of a countably infinite set. Having a finite union and a finite intersection is insufficient to
ensure the closedness of allSintersections and unions. In particular, having F1 ∪ F2 ∈ F does
∞
not automatically give us n=1 Fn ∈ F because the latter is an infinite union. Therefore,
for countably infinite sets, their requirements to be a field are more restrictive as we need
to ensure infinite intersection and union. The resulting field is called the σ-field.
When do we need a σ-field? When the sample space is countable and has infinitely
many elements. For example, if the sample space contains all integers, then the collection
S∞all possible subsets is a σ-field. For another, if E1 = {2}, E2 = {4}, E3S=
of
∞
{6}, . . . , then
n=1 En = {2, 4, 6, 8, . . .} = {positive even numbers}. Clearly, we want n=1 En to live in
the sample space.
Borel sigma-field
While a sigma-field allows us to consider countable sets of events, it is still insufficient for
considering events defined on the real line, e.g., time, as these events are not countable.
So how do we define an event on the real line? It turns out that we need a different way
to define the smallest unit. For finite sets and countable sets, the smallest units are the
elements themselves because we can count them. For the real line, we cannot count the
elements because any non-empty interval is uncountably infinite.
The smallest unit we use to construct a field for the real line is a semi-closed interval
def
(−∞, b] = {x | − ∞ < x ≤ b}.
The Borel σ-field is defined as the sigma-field generated by the semi-closed inter-
vals.
Definition 2.17. The Borel σ-field B is a σ-field generated from semi-closed intervals:
def
(−∞, b] = {x | − ∞ < x ≤ b}.
The difference between the Borel σ-field B and a regular σ-field is how we measure the
subsets. In a σ-field, we count the elements in the subsets, whereas, in a Borel σ-field, we
use the semi-closed intervals to measure the subsets.
65
CHAPTER 2. PROBABILITY
Being a field, the Borel σ-field is closed under complement, union, and intersection. In
particular, subsets of the following forms are also in the Borel σ-field B:
(a, b), [a, b], (a, b], [a, b), [a, ∞), (a, ∞), (−∞, b], {b}.
For example, (a, ∞) can be constructed from (−∞, a]c , and (a, b] can be constructed by
taking the intersection of (−∞, b] and (a, ∞).
Example 2.27: Waiting for a bus. Let Ω = {0 ≤ t ≤ 30}. The Borel σ-field contains
all semi-closed intervals (a, b], where 0 ≤ a ≤ b ≤ 30. Here are two possible events:
F1 = {less than 10 minutes} = {0 ≤ t < 10} = {0} ∪ ({0 < t ≤ 10} ∩ {10}c ).
F2 = {more than 20 minutes} = {20 < t ≤ 30}.
Further discussion of the Borel σ-field can be found in Leon-Garcia (3rd Edition,)
Chapter 2.9.
The probability law is thus a function, and therefore we must specify the input and
the output. The input to P is an event E, which is a subset in Ω and an element in F. The
output of P is a number between 0 and 1, which we call the probability.
The definition above does not specify how an event is being mapped to a number.
However, since probability is a measure of the size of a set, a meaningful P should be
consistent for all events in F. This requires some rules, known as the axioms of probability,
when we define the P. Any probability law P must satisfy these axioms; otherwise, we will
see contradictions. We will discuss the axioms in the next section. For now, let us look at
two examples to make sure we understand the functional nature of P.
Example 2.28. Consider flipping a coin. The event space is F = {∅, {H}, {T }, Ω}.
We can define the probability law as
1 1
P[∅] = 0, P[{H}] = , P[{T }] = , P[Ω] = 1,
2 2
as shown in Figure 2.16. This P is clearly consistent for all the events in F.
Is it possible to construct an invalid P? Certainly. Consider the following proba-
66
2.2. PROBABILITY SPACE
bility law:
1 1
P[∅] = 0, P[{H}] = , P[{T }] = , P[Ω] = 1.
3 3
This law is invalid because the individual events are P[{H}] = 31 and P[{T }] = 31
but the union is P[Ω] = 1. To fix this problem, one possible solution is to define the
probability law as
1 2
P[∅] = 0, P[{H}] = , P[{T }] = , P[Ω] = 1.
3 3
Then, the probabilities for all the events are well defined and consistent.
Figure 2.16: A probability law is a mapping from an event to a number. A probability law cannot be
arbitrarily assigned; it must satisfy the axioms of probability.
67
CHAPTER 2. PROBABILITY
Figure 2.17: Probability is a measure of the size of a set. The probability can be a counter that counts
the number of elements, a ruler that measures the length of an interval, or an integration that measures
the area of a region.
Ω = { , , , , , }.
Then the probability measure is a counter that reports the number of elements. If
the die is fair, i.e., all the 6 faces have equal probability of happening, then an event
E = { , } will have a probability P[E] = 26 .
Example 2.31 (Intervals). Suppose that the sample space is a unit interval Ω = [0, 1].
Let E be an event such that E = [a, b] where a, b are numbers in [0, 1]. Then the
probability measure is a ruler that measures the length of the intervals. If all the
numbers on the real line have equal probability of appearing, then P[E] = b − a.
Example 2.32 (Regions). Suppose that the sample space is the square Ω = [−1, 1] ×
[−1, 1]. Let E be a circle such that E = {(x, y)|x2 + y 2 < r2 }, where r < 1. Then the
probability measure is an area measure that returns us the area of E. If we assume
that all coordinates in Ω are equally probable, then P[E] = πr2 , for r < 1.
Because probability is a measure of the size of a set, two sets can be compared according
to their probability measures. For example, if Ω = {♣, ♡, ✠}, and if E1 = {♣} and E2 =
{♣, ♡}, then one possible P is to assign P[E1 ] = P[{♣}] = 13 and P[E2 ] = P[{♣, ♡}] = 2/3.
68
2.2. PROBABILITY SPACE
P[E1 ] ≤ P[E2 ].
Let’s now consider the term “size.” Notice that the concept of the size of a set is not
limited to the number of elements. A better way to think about size is to imagine that it is
the weight of the set. This might may seem fanciful at first, but it is quite natural. Consider
the following example.
Example 2.33. (Discrete events with different weights) Suppose we have a sample
space Ω = {♣, ♡, ✠}. Let us assign a different probability to each outcome:
2 1 3
P[{♣}] = , P[{♡}] = , P[{✠}] = .
6 6 6
As illustrated in Figure 2.18, since each outcome has a different weight, when de-
termining the probability of a set of outcomes we can add these weights (instead of
counting the number of outcomes). For example, when reporting P[{♣}] we find its
weight P[{♣}] = 26 , whereas when reporting P[{♡, ✠}] we find the sum of their weights
P[{♡, ✠}] = 61 + 63 = 46 . Therefore, the notion of size does not refer to the number of
elements but to the total weight of these elements.
Figure 2.18: This example shows the “weights” of three elements in a set. The weights are numbers
between 0 and 1 such that the sum is 1. When applying a probability measure to this set, we sum the
weights for the elements in the events being considered. For example, P[♡, ✠] = yellow + green, and
P[♣] = purple.
Example 2.34. (Continuous events with different weights) Suppose that the sample
space is an interval, say Ω = [−1, 1]. On this interval we define a weighting function
f (x) where f (x0 ) specifies the weight for x0 . Because Ω is an interval, events defined
on this Ω must also be intervals. For example, we can consider two events E1 = [a, b]
Rb
and E2 = [c, d]. The probabilities of these events are P[E1 ] = a f (x) dx and P[E2 ] =
Rd
c
f (x) dx, as shown in Figure 2.19.
69
CHAPTER 2. PROBABILITY
Figure 2.19: If the sample space is an interval on the real line, then the probability of an event is the
area under the curve of the weighting function.
narrowly defined concept that is largely limited to discrete events, e.g., flipping a coin. While
we can assign weights to coin-toss events to deal with those biased coins, the extension to
continuous events becomes problematic. By thinking of probability as a measure, we can
generalize the notion to apply to intervals, areas, volumes, and so on.
Second, viewing probability as a measure forces us to disentangle an event from mea-
sures. An event is a subset in the sample space. It has nothing to do with the measure
(e.g., a ruler) you use to measure the event. The measure, on the other hand, specifies the
weighting function you apply to measure the event when computing the probability. For
example, let Ω = [−1, 1] be an interval, and let E = [a, b] be an event. We can define two
weighting functions f (x) and g(x). Correspondingly, we will have two different probability
measures F and G such that
Z Z b
F([a, b]) = dF = f (x) dx,
E a
Z Z b
G([a, b]) = dG = g(x) dx. (2.20)
E a
To make sense of these notations, consider only P[[a, b]] and not F([a, b]) and G([a, b]). As you
can see, the event for both measures is E = [a, b] but the measures are different. Therefore,
the values of the probability are different.
Example 2.35. (Two probability laws are different if their weighting functions are
different.) Consider two different weighting functions for throwing a die. The first one
assigns probability as the following:
1 2 3
P[{ }] = , P[{ }] = , P[{ }] = ,
12 12 12
4 1 1
P[{ }] = , P[{ }] = , P[{ }] = ,
12 12 12
whereas the second function assigns the probability like this:
2 2 2
P[{ }] = , P[{ }] = , P[{ }] = ,
12 12 12
2 2 2
P[{ }] = , P[{ }] = , P[{ }] = .
12 12 12
70
2.2. PROBABILITY SPACE
Let an event E = { , }. Let F be the measure using the first set of probabilities, and
let G be the measure of the second set of probabilities. Then,
1 2 3
F(E) = F({ , }) = + = ,
12 12 12
2 2 4
G(E) = G({ , }) = + = .
12 12 12
Therefore, although the events are the same, the two different measures will give us
two different probability values.
R
Remark. The notation E dF in Equation (2.20) is known as the Lebesgue integral. You
should be aware of this notation, but the theory of Lebesgue measure is beyond the scope
of this book.
In fact, for any weighting function the integral will be zero because the length of the set
E is zero.1 An event that gives us zero probability is known as an event with measure 0.
Figure 2.20 shows an example.
Figure 2.20: The probability of obtaining a single point in a continuous interval is zero.
1 We assume that f is continuous throughout [0, 1]. If f is discontinuous at x = 0.5, some additional
considerations will apply.
71
CHAPTER 2. PROBABILITY
The following discussion of the formal definitions of measure zero sets is optional for the
first reading of this book.
Definition 2.19. Let Ω be the sample space. A set A ∈ Ω is said to have measure
zero if for any given ϵ > 0,
There exists a countable number of subsets An such that A ⊆ ∪∞
n=1 An , and
P∞
n=1 P[An ] < ϵ.
You may need to read this definition carefully. Suppose we have an event A. We construct
a set of neighbors A1 , . . . , A∞ such that A is included in the union ∪∞ n=1 An . If the sum of
the all P[An ] is still less than ϵ, then the set A will have a measure zero.
To understand the difference between a measure for a continuous set and a countable
set, consider Figure 2.21. On the left side of Figure 2.21 we show an interval Ω in which there
is an isolated point x0 . The measure for this Ω is the length of the interval (relative to what-
ever weighting function you use). We define a small neighborhood A0 = (x0 − 2ϵ , x0 + 2ϵ )
surrounding x0 . The length of this interval is not more than ϵ. We then shrink ϵ. How-
ever, regardless of how small ϵ is, since x0 is an isolated point, it is always included in the
neighborhood. Therefore, the definition is satisfied, and so {x0 } has measure zero.
Example 2.37. Let Ω = [0, 1]. The set {0.5} ⊂ Ω has measure zero, i.e., P[{0.5}] = 0.
To see this, we draw a small interval around 0.5, say [0.5 − ϵ/3, 0.5 + ϵ/3]. Inside this
interval, there is really nothing to measure besides the point 0.5. Thus we have found
an interval such that it contains 0.5, and the probability is P[[0.5 − ϵ/3, 0.5 + ϵ/3]] =
72
2.2. PROBABILITY SPACE
The situation is very different for the right-hand side of Figure 2.21. Here, the measure
is not the length but a counter. So if we create a neighborhood surrounding the isolated
point x0 , we can always make a count. As a result, if you shrink ϵ to become a very small
number (in this case less than 14 ), then P[{x0 }] < ϵ will no longer be true. Therefore, the
set {x0 } has a non-zero measure when we use the counter as the measure.
Figure 2.21: [Left] For a continuous sample space, a single point event {x0 } can always be surrounded
by a neighborhood A0 whose size P[A0 ] < ϵ. [Right] If you change the sample space to discrete
elements, then a single point event {x0 } can still be surrounded by a neighborhood A0 . However, the
size P[A0 ] = 1/4 is a fixed number and will not work for any ϵ.
When we make probabilistic claims without considering the measure zero sets, we say
that an event happens almost surely.
P[A] = 1 (2.21)
Therefore, if a set A contains measure zero subsets, we can simply ignore them because they
do not affect the probability of events. In this book, we will omit “a.s.” if the context is
clear.
Example 2.38(a). Let Ω = [0, 1]. Then P[(0, 1)] = 1 almost surely because the points
0 and 1 have measure zero in Ω.
Example 2.38(b). Let Ω = {x | x2 ≤ 1} and let A = {x | x2 < 1}. Then P[A] = 1
almost surely because the circumference has measure zero in Ω.
73
CHAPTER 2. PROBABILITY
zero.
where Ω is the sample space that defines all possible outcomes, F is the event space generated
from Ω, and P is the probability law that maps an event to a number in [0, 1]. Can we drop
one or more of the three components? We cannot! If we do not specify the sample space Ω,
then there is no way to define the events. If we do not have a complete event space F,
then some events will become undefined, and further, if the probability law is applied only
to outcomes, we will not be able to define the probability for events. Finally, if we do not
specify the probability law, then we do not have a way to assign probabilities.
We now turn to a deeper examination of the properties. Our motivation is simple. While
the definition of probability law has achieved its goal of assigning a probability to an event,
there must be restrictions on how the assignment can be made. For example, if we set
P[{H}] = 1/3, then P[{T }] must be 2/3; otherwise, the sum of having a head and a tail
will be greater than 1. The necessary restrictions on assigning a probability to an event are
collectively known as the axioms of probability.
74
2.3. AXIOMS OF PROBABILITY
III. Additivity: For any disjoint sets {A1 , A2 , . . .}, it must be true that
"∞ # ∞
[ X
P Ai = P[Ai ]. (2.23)
i=1 i=1
Axiom I is called the non-negativity axiom. It ensures that a probability value cannot
be negative. Non-negativity is a must for probability. It is meaningless to say that the
probability of getting an event is a negative number.
Axiom II is called the normalization axiom. It ensures that the probability of observing
all possible outcomes is 1. This gives the upper limit of the probability. The upper limit
does not have to be 1. It could be 10 or 100. As long as we are consistent about this upper
limit, we are good. However, for historical reasons and convenience, we choose 1 to be the
upper limit.
Axiom III is called the additivity axiom and is the most critical one among the three.
The additivity axiom defines how set operations can be translated into probability oper-
ations. In a nutshell, it says that if we have a set of disjoint events, the probabilities can
be added. From the measure perspective, Axiom III makes sense because if P measures the
size of an event, then two disjoint events should have their probabilities added. If two dis-
joint events do not allow their probabilities to be added, then there is no way to measure
a combined event. Similarly, if the probabilities can somehow be added even for overlap-
ping events, there will be inconsistencies because there is no systematic way to handle the
overlapping regions.
The countable additivity stated in Axiom III can be applied to both a finite number
or an infinite number of sets. The finite case states that for any two disjoint sets A and B,
we have
P[A ∪ B] = P[A] + P[B]. (2.24)
75
CHAPTER 2. PROBABILITY
In other words, if A and B are disjoint, then the probability of observing either A or B is
the sum of the two individual probabilities. Figure 2.22 illustrates this idea.
Example 2.39. Let’s see why Axiom III is critical. Consider throwing a fair die with
Ω = { , , , , , }. The probability of getting { , } is
1 1 2
P[{ , }] = P[{ } ∪ { }] = P[{ }] + P[{ }] = + = .
6 6 6
In this equation, the second equality holds because the events { } and { } are disjoint.
If we do not have Axiom III, then we cannot add probabilities.
Example 2.40. Consider a sample space with Ω = {♣, ♡, ✠}. The probability for
each outcome is
2 1 3
P[{♣}] = , P[{♡}] = , P[{✠}] = .
6 6 6
Suppose we construct two disjoint events E1 = {♣, ♡} and E2 = {✠}. Then Axiom
III says
2 1 3
P[E1 ∪ E2 ] = P[E1 ] + P[E2 ] = + + = 1.
6 6 6
76
2.3. AXIOMS OF PROBABILITY
Example 2.41. Suppose the sample space is an interval Ω = [0, 1]. The two events
are E1 = [a, b] and E2 = [c, d]. Assume that the measure P uses a weighting function
f (x). Then, by Axiom III, we know that
As you can see, there is no conflict between the axioms and the measure. Figure 2.24
illustrates this example.
Figure 2.23: Applying weighting functions to the measures: Suppose we have three elements in the set.
To compute the probability P[{♡, ✠} ∪ {♣}], we can write it as the sum of P[{♡, ✠}] and P[{♣}].
Figure 2.24: The axioms are compatible with the measure, even if we use a weighting function.
(c) P[∅] = 0.
77
CHAPTER 2. PROBABILITY
Proof. (a) Since Ω = A ∪ Ac , by finite additivity we have P[Ω] = P[A ∪ Ac ] = P[A] + P[Ac ].
By the normalization axiom, we have P[Ω] = 1. Therefore, P[Ac ] = 1 − P[A].
(b) We prove by contradiction. Assume P[A] > 1. Consider the complement Ac where
A∪Ac = Ω. Since P[Ac ] = 1−P[A], we must have P[Ac ] < 0 because by hypothesis P[A] > 1.
But P[Ac ] < 0 violates the non-negativity axiom. So we must have P[A] ≤ 1.
This statement is different from Axiom III because A and B are not necessarily disjoint.
Figure 2.25: For any A and B, P[A ∪ B] = P[A] + P[B] − P[A ∩ B].
Proof. First, observe that A ∪ B can be partitioned into three disjoint subsets as A ∪ B =
(A\B) ∪ (A ∩ B) ∪ (B\A). Since A\B = A ∩ B c and B\A = B ∩ Ac , by finite additivity we
have that
where in (a) we added and subtracted a term P[A ∩ B], and in (b) we used finite additivity
so that P[A ∩ B c ] + P[A ∩ B] = P[(A ∩ B c ) ∪ (A ∩ B)] = P[A ∩ (B c ∪ B)].
□
Example 2.42. The corollary is easy to understand if we consider the following ex-
ample. Let Ω = { , , , , , } be the sample space of a fair die. Let A = { , , }
and B = { , , }. Then
5
P[A ∪ B] = P[{ , , , , }] = .
6
78
2.3. AXIOMS OF PROBABILITY
Practice Exercise 2.13. Let the events A and B have P[A] = x, P[B] = y and
P[A ∪ B] = z. Find the following probabilities: P[A ∩ B], P[Ac ∪ B c ], and P[A ∩ B c ].
Solution.
(a) Note that z = P[A ∪ B] = P[A] + P[B] − P[A ∩ B]. Thus, P[A ∩ B] = x + y − z.
(b) We can take the complement to obtain the result:
There are two events: A = {f | f (x) = ax, a ≥ 0}, and B = {f | f (x) = ax, a ≤ 0}.
So, basically, A is the set of all straight lines with positive slope, and B is the set of
straight lines with negative slope. Show that the union bound is tight.
79
CHAPTER 2. PROBABILITY
The intersection is
P[A ∩ B] = P[{f | f (x) = 0}].
Since this is a point set in the real line, it has measure zero. Thus, P[A ∩ B] = 0 and
hence P[A ∪ B] = P[A] + P[B]. So the union bound is tight.
Table 2.2: Kolmogorov’s summary of set theory results and random events.
In many practical data science problems, we are interested in the relationship between two
or more events. For example, an event A may cause B to happen, and B may cause C
to happen. A legitimate question in probability is then: If A has happened, what is the
probability that B also happens? Of course, if A and B are correlated events, then knowing
one event can tell us something about the other event. If the two events have no relationship,
knowing one event will not tell us anything about the other.
In this section, we study the concept of conditional probability. There are three sub-
topics in this section. We summarize the key points below.
80
2.4. CONDITIONAL PROBABILITY
Definition 2.22. Consider two events A and B. Assume P[B] ̸= 0. The conditional
probability of A given B is
def P[A ∩ B]
P[A | B] = . (2.26)
P[B]
P[A ∩ B] P[A ∩ B]
P[A | B] = and P[A ∩ B] = . (2.27)
P[B] P[Ω]
The difference is illustrated in Figure 2.26: The intersection P[A ∩ B] calculates the overlap-
ping area of the two events. We make no assumptions about the cause-effect relationship.
Figure 2.26: Illustration of conditional probability and its comparison with P[A ∩ B].
What justifies this ratio? Suppose that B has already happened. Then, anything out-
side B will immediately become irrelevant as far as the relationship between A and B is
concerned. So when we ask: “What is the probability that A happens given that B has
happened?”, we are effectively asking for the probability that A ∩ B happens under the
81
CHAPTER 2. PROBABILITY
condition that B has happened. Note that we need to consider A ∩ B because we know
that B has already happened. If we take A only, then there exists a region A\B which
does not contain anything about B. However, since we know that B has happened, A\B is
impossible. In other words, among the elements of A, only those that appear in A ∩ B are
meaningful.
In this example,
If Purdue has won 15 games consecutively, then it is unlikely that Purdue will get
the championship because the sample space of all possible competition results is large.
However, if we have already won 15 games consecutively, then the denominator of the
probability becomes much smaller. In this case, the conditional probability is high.
In other words, if we know that we have an odd number, then the probability of
obtaining a 3 has to be computed over { , , }, which give us a probability 13 . If we
82
2.4. CONDITIONAL PROBABILITY
do not know that we have an odd number, then the probability of obtaining a 3 has
to be computed from the sample space { , , , , , }, which will give us 61 .
The other conditional probability is
P[A ∩ B]
P[B | A] = = 1.
P[A]
Therefore, if we know that we have rolled a 3, then the probability for this number
being an odd number is 1.
Example 2.45. Consider the situation shown in Figure 2.27. There are 12 points
with equal probabilities of happening. Find the probabilities P[A|B] and P[B|A].
Solution. In this example, we can first calculate the individual probabilities:
5 6 2
P[A] = , and P[B] = , and P[A ∩ B] = .
12 12 12
Then the conditional probabilities are
2
P[A ∩ B] 12 1
P[A|B] = = 6 = ,
P[B] 12
3
2
P[A ∩ B] 12 2
P[B|A] = = 5 = .
P[A] 12
5
Figure 2.27: Visualization of Example 2.45: [Left] All the sets. [Middle] P (A|B) is the ratio between
dots inside the light yellow region over those in yellow, which is 26 . [Right] P[A|B] is the ratio between
dots inside the light pink region over those in pink, which is 25 .
Example 2.46. Consider a tetrahedral (4-sided) die. Let X be the first roll and Y
be the second roll. Let B be the event that min(X, Y ) = 2 and M be the event that
max(X, Y ) = 3. Find P[M |B].
Solution. As shown in Figure 2.28, the event B is highlighted in green. (Why?)
Similarly, the event M is highlighted in blue. (Again, why?) Therefore, the probability
83
CHAPTER 2. PROBABILITY
is
2
P[M ∩ B] 16 2
P[M |B] = = 5 = .
P[B] 16
5
Figure 2.28: Visualization of Example 2.46. [Left] Event B. [Middle] Event M . [Right] P(M |B) is the
ratio of the number of blue squares inside the green region to the total number of green squares, which
is 52 .
Remark. Notice that if P[B] ≤ P[Ω], then P[A | B] is always larger than or equal to P[A∩B],
i.e.,
P[A|B] ≥ P[A ∩ B].
Theorem 2.6. Let P[B] > 0. The conditional probability P[A | B] satisfies Axioms I,
II, and III.
Since P[B] > 0 and Axiom I requires P[A ∩ B] ≥ 0, we therefore have P[A | B] ≥ 0.
Axiom II:
P[Ω ∩ B]
P[Ω | B] =
P[B]
P[B]
= = 1.
P[B]
84
2.4. CONDITIONAL PROBABILITY
P[(A ∪ C) ∩ B]
P[A ∪ C | B] =
P[B]
P[(A ∩ B) ∪ (C ∩ B)]
=
P[B]
(a) P[A ∩ B] P[C ∩ B]
= +
P[B] P[B]
= P[A|B] + P[C|B],
□
To summarize this subsection, we highlight the essence of conditional probability.
2.4.2 Independence
Conditional probability deals with situations where two events A and B are related. What
if the two events are unrelated? In probability, we have a technical term for this situation:
statistical independence.
P[A∩B]
Why define independence in this way? Recall that P[A | B] = P[B] . If A and B are
independent, then P[A ∩ B] = P[A] P[B] and so
Definition 2.24. Let A and B be two events such that P[A] > 0 and P[B] > 0. Then
85
CHAPTER 2. PROBABILITY
The two statements are equivalent as long as P[A] > 0 and P[B] > 0. This is because
P[A|B] = P[A ∩ B]/P[B]. If P[A|B] = P[A] then P[A ∩ B] = P[A]P[B], which implies that
P[B|A] = P[A ∩ B]/P[A] = P[B].
A pictorial illustration of independence is given in Figure 2.29. The key message is that
if two events A and B are independent, then P[A|B] = P[A]. The conditional probability
P[A|B] is the ratio of P[A ∩ B] over P[B], which is the intersection over B (the blue set).
The probability P[A] is the yellow set over the sample space Ω.
Figure 2.29: Independence means that the conditional probability P[A|B] is the same as P[A]. This
implies that the ratio of P[A ∩ B] over P[B], and the ratio of P[A ∩ Ω] over P[Ω] are the same.
The statement says that disjoint and independent are two completely different concepts.
If A and B are disjoint, then A ∩ B = ∅. This only implies that P[A ∩ B] = 0.
However, it says nothing about whether P[A ∩ B] can be factorized into P[A] P[B]. If A
and B are independent, then we have P[A ∩ B] = P[A] P[B]. But this does not imply that
P[A ∩ B] = 0. The only condition under which Disjoint ⇔ Independence is when P[A] = 0 or
P[B] = 0. Figure 2.30 depicts the situation. When two sets are independent, the conditional
probability (which is a ratio) remains unchanged compared to unconditioned probability.
When two sets are disjoint, they simply do not overlap.
Practice Exercise 2.15. Throw a die twice. Are A and B independent, where
1
P[A ∩ B] = P[(3, 4)] = 36 , P[A] = 16 , and P[B] = 61 .
86
2.4. CONDITIONAL PROBABILITY
Figure 2.30: Independent means that the conditional probability, which is a ratio, is the same as the
unconditioned probability. Disjoint means that the two sets do not overlap.
1
Figure 2.31: The two events A and B are independent because P[A] = 6
and P[A|B] = 61 .
A pictorial illustration of this example is shown in Figure 2.31. The two events are
independent because A is one row in the 2D space, which yields a probability of 16 . The
conditional probability P[A|B] is the coordinate (3, 4) over the event B, which is a column.
It happens that P[A|B] = 16 . Thus, the two events are independent.
1
P[A ∩ B] = P[(3, 4)] = 36 , P[A] = 61 ,
P[B] = P[(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)] = 16 .
A pictorial illustration of this example is shown in Figure 2.32. Notice that whether the
two events intersect is not how we determine independence (that only determines disjoint or
87
CHAPTER 2. PROBABILITY
not). The key is whether the conditional probability (which is the ratio) remains unchanged
compared to the unconditioned probability.
1
Figure 2.32: The two events A and B are independent because P[A] = 6
and P[A ∩ B] = 16 .
If we let B = {sum is 8}, then the situation is different. The intersection A ∩ B has a
probability 51 relative to B, and therefore P[A|B] = 15 . Hence, the two events A and B are
dependent. If you like a more intuitive argument, you can imagine that B has happened,
i.e., the sum is 8. Then the probability for the first die to be 1 is 0 because there is no way
to construct 8 when the first die is 1. As a result, we have eliminated one choice for the first
die, leaving only five options. Therefore, since B has influenced the probability of A, they
are dependent.
What is independence?
Two events are independent when the ratio P[A ∩ B]/P[B] remains unchanged
compared to P[A].
Independence ̸= disjoint.
88
2.4. CONDITIONAL PROBABILITY
Theorem 2.7 (Bayes’ theorem). For any two events A and B such that P[A] > 0
and P[B] > 0,
P[B | A] P[A]
P[A | B] = .
P[B]
P[A ∩ B] P[B ∩ A]
P[A | B] = and P[B | A] = .
P[B] P[A]
where (a) follows from the definition of conditional probability, (b) is due to Axiom III, (c)
holds because of the distributive property of sets, and (d) results from the partition property
of {A1 , A2 , . . . , An }.
□
Interpretation. The law of total probability can be understood as follows. If the sample
space Ω consists of disjoint subsets A1 , . . . , An , we can compute the probability P[B] by
89
CHAPTER 2. PROBABILITY
summing over its portion P[B ∩A1 ], . . . , P[B ∩An ]. However, each intersection can be written
as
P[B ∩ Ai ] = P[B | Ai ]P[Ai ]. (2.33)
In other words, we write P[B ∩ Ai ] as the conditional probability P[B | Ai ] times the prior
probability P[Ai ]. When we sum all these intersections, we obtain the overall probability.
See Figure 2.33 for a graphical portrayal.
Figure 2.33: The law of total probability decomposes the probability P[B] into multiple conditional
probabilities P[B | Ai ]. The probability of obtaining each P[B | Ai ] is P[Ai ].
P[B | Aj ] P[Aj ]
P[Aj | B] = Pn . (2.34)
i=1 P[B | Ai ] P[Ai ]
Example 2.47. Suppose there are three types of players in a tennis tournament: A,
B, and C. Fifty percent of the contestants in the tournament are A players, 25% are
B players, and 25% are C players. Your chance of beating the contestants depends on
the class of the player, as follows:
If you play a match in this tournament, what is the probability of your winning the
match? Supposing that you have won a match, what is the probability that you played
against an A player?
Solution. We first list all the known probabilities. We know from the percentage
90
2.4. CONDITIONAL PROBABILITY
of players that
Now, let W be the event that you win the match. Then the conditional probabilities
are defined as follows:
Therefore, by the law of total probability, we can show that the probability of
winning the match is
Given that you have won the match, the probability of A given W is
Example 2.48. Consider the communication channel shown below. The probability
of sending a 1 is p and the probability of sending a 0 is 1 − p. Given that 1 is sent, the
probability of receiving 1 is 1 − η. Given that 0 is sent, the probability of receiving 0
is 1 − ε. Find the probability that a 1 has been correctly received.
91
CHAPTER 2. PROBABILITY
is the conditional probability that 1 is received given that 1 is sent. It is possible that
we receive 1 as a result of an error when 0 is sent. Therefore, we need to consider the
probability that both S0 and S1 occur. Using the law of total probability we have
Now, suppose that we have received 1. What is the probability that 1 was origi-
nally sent? This is asking for the posterior probability P[S1 | R1 ], which can be found
using Bayes’ theorem
When do we need to use Bayes’ theorem and the law of total probability?
Bayes’ theorem switches the role of the conditioning, from P[A|B] to P[B|A].
Example:
P[win the game | play with A] and P[play with A | win the game].
Figure 2.34: The Three Prisoners problem: The king says that he will pardon two prisoners and sentence
one.
One of the prisoners, prisoner A, heard the news and wanted to ask a friendly guard
about his situation. The guard was honest. He was allowed to tell prisoner A that prisoner B
would be pardoned or that prisoner C would be pardoned, but he could not tell A whether
he would be pardoned. Prisoner A thought about the problem, and he began to hesitate to
ask the guard. Based on his present state of knowledge, his probability of being pardoned
92
2.4. CONDITIONAL PROBABILITY
is 23 . However, if he asks the guard, this probability will be reduced to 21 because the guard
would tell him that one of the two other prisoners would be pardoned, and would tell him
which one it would be. Prisoner A reasons that his chance of being pardoned would then
drop because there are now only two prisoners left who may be pardoned, as illustrated in
Figure 2.35:
Figure 2.35: The Three Prisoners problem: If you do not ask the guard, your chance of being released
is 2/3. If you ask the guard, the guard will tell you which one of the other prisoners will be released.
Your chance of being released apparently drops to 1/2.
Should prisoner A ask the guard? What has gone wrong with his reasoning? This
problem is tricky in the sense that the verbal argument of prisoner A seems flawless. If
he asked the guard, indeed, the game would be reduced to two people. However, this does
not seem correct, because regardless of what the guard says, the probability for A to be
pardoned should remain unchanged. Let’s see how we can solve this puzzle.
Let XA , XB , XC be the events of sentencing prisoners A, B, C, respectively. Let GB
be the event that the guard says that the prisoner B is released. Without doing anything,
we know that
1 1 1
P[XA ] = , P[XB ] = , P[XC ] = .
3 3 3
Conditioned on these events, we can compute the following conditional probabilities that
the guard says B is pardoned:
1
P[GB | XA ] = , P[GB | XB ] = 0, P[GB | XC ] = 1.
2
Why are these conditional probabilities? P[GB | XB ] = 0 quite straightforward. If the king
decides to sentence B, the guard has no way of saying that B will be pardoned. Therefore,
P[GB | XB ] must be zero. P[GB | XC ] = 1 is also not difficult. If the king decides to
sentence C, then the guard has no way to tell you that B will be pardoned because the
guard cannot say anything about prisoner A. Finally, P[GB | XA ] = 21 can be understood
as follows: If the king decides to sentence A, the guard can either tell you B or C. In other
words, the guard flips a coin.
With these conditional probabilities ready, we can determine the probability. This is the
conditional probability P[XA | GB ]. That is, supposing that the guard says B is pardoned,
what is the probability that A will be sentenced? This is the actual scenario that A is facing.
Solving for this conditional probability is not difficult. By Bayes’ theorem we know that
P[GB | XA ]P[XA ]
P[XA | GB ] = ,
P[GB ]
93
CHAPTER 2. PROBABILITY
and P[GB ] = P[GB |XA ]P[XA ] + P[GB |XB ]P[XB ] + P[GB |XC ]P[XC ] according to the law of
total probability. Substituting the numbers into these equations, we have that
P[GB ] = P[GB |XA ]P[XA ] + P[GB |XB ]P[XB ] + P[GB |XC ]P[XC ]
1 1 1 1 1
= × +0× +1× = ,
2 3 3 3 2
1
P[GB | XA ]P[XA ] ×1 1
P[XA | GB ] = = 2 1 3 = .
P[GB ] 2
3
Therefore, given that the guard says B is pardoned, the probability that A will be sentenced
remains 31 . In fact, what you can show in this example is that P[XA | GB ] = 31 = P[XA ].
Therefore, the presence or absence of the guard does not alter the probability. This is because
what the guard says is independent of whether the prisoners will be pardoned. The lesson
we learn from this problem is not to rely on verbal arguments. We need to write down the
conditional probabilities and spell out the steps.
Figure 2.36: The Three Prisoners problem is resolved by noting that P[XA |GB ] = P[XA ]. Therefore,
the events XA and GB are independent.
94
2.5. SUMMARY
2.5 Summary
By now, we hope that you have become familiar with our slogan probability is a measure
of the size of a set. Let us summarize:
Probability = a probability law P. You can also view it as the value returned by P.
Measure = a ruler, a scale, a stopwatch, or another measuring device. It is a tool that
tells you how large or small a set is. The measure has to be compatible with the set.
If a set is finite, then the measure can be a counter. If a set is a continuous interval,
then the measure can be the length of the interval.
Size = the relative weight of the set for the sample space. Measuring the size is done
by using a weighting function. Think of a fair coin versus a biased coin. The former
has a uniform weight, whereas the latter has a nonuniform weight.
Set = an event. An event is a subset in the sample space. A probability law P always
maps a set to a number. This is different from a typical function that maps a number
to another number.
If you understand what this slogan means, you will understand why probability can be
applied to discrete events, continuous events, events in n-D spaces, etc. You will also under-
stand the notion of measure zero and the notion of almost sure. These concepts lie at the
foundation of modern data science, in particular, theoretical machine learning.
The second half of this chapter discusses the concept of conditional probability. Con-
ditional probability is a metaconcept that can be applied to any measure you use. The
motivation of conditional probability is to restrict the probability to a subevent happening
in the sample space. If B has happened, the probability for A to also happen is P[A∩B]/P[B].
If two events are not influencing each other, then we say that A and B are independent.
According to Bayes’ theorem, we can also switch the order of A given B and B given A, ac-
cording to Bayes’ theorem. Finally, the law of total probability gives us a way to decompose
events into subevents.
We end this chapter by mentioning a few terms related to conditional probabilities
that will become useful later. Let us use the tennis tournament as an example:
P[W | A] = conditional probability = Given that you played with player A, what is
the probability that you will win?
P[A] = prior probability = Without even entering the game, what is the chance that
you will face player A?
P[A | W ] = posterior probability = After you have won the game, what is the proba-
bility that you have actually played with A?
In many practical engineering problems, the question of interest is often the last one. That
is, supposing that you have observed something, what is the most likely cause of that event?
For example, supposing we have observed this particular dataset, what is the best Gaussian
model that would fit the dataset? Questions like these require some analysis of conditional
probability, prior probability, and posterior probability.
95
CHAPTER 2. PROBABILITY
2.6 References
Introduction to Probability
2-1 Dimitri P. Bertsekas and John N. Tsitsiklis, Introduction to Probability, Athena Sci-
entific, 2nd Edition, 2008. Chapter 1.
2-2 Mark D. Ward and Ellen Gundlach, Introduction to Probability, W.H. Freeman and
Company, 2016. Chapter 1 – Chapter 6.
2-3 Roy D. Yates and David J. Goodman, Probability and Stochastic Processes, 3rd Edi-
tion, Wiley 2013, Chapter 1.
2-4 John A. Gubner, Probability and Random Processes for Electrical and Computer En-
gineers, Cambridge University Press, 2006. Chapter 2.
2-5 Sheldon Ross, A First Course in Probability, Prentice Hall, 8th Edition, 2010. Chapter
2 and Chapter 3.
2-6 Ani Adhikari and Jim Pitman, Probability for Data Science, http://prob140.org/
textbook/content/README.html. Chapters 1 and 2.
2-7 Alberto Leon-Garcia, Probability, Statistics, and Random Processes for Electrical En-
gineering, Prentice Hall, 3rd Edition, 2008. Chapter 2.1 – 2.7.
2-8 Athanasios Papoulis and S. Unnikrishna Pillai, Probability, Random Variables and
Stochastic Processes, McGraw-Hill, 4th Edition, 2001. Chapter 2.
2-9 Henry Stark and John Woods, Probability and Random Processes With Applications
to Signal Processing, Prentice Hall, 3rd Edition, 2001. Chapter 1.
Measure-Theoretic Probability
2-10 Alberto Leon-Garcia, Probability, Statistics, and Random Processes for Electrical En-
gineering, Prentice Hall, 3rd Edition, 2008. Chapter 2.8 and 2.9.
2-11 Henry Stark and John Woods, Probability and Random Processes With Applications
to Signal Processing, Prentice Hall, 3rd Edition, 2001. Appendix D.
2-12 William Feller, An Introduction to Probability Theory and Its Applications, Wiley and
Sons, 3rd Edition, 1950.
2-13 Andrey Kolmogorov, Foundations of the Theory of Probability, 2nd English Edition,
Dover 2018. (Translated from Russian to English. Originally published in 1950 by
Chelsea Publishing Company New York.)
2-14 Patrick Billingsley, Probability and Measure, Wiley, 3rd Edition, 1995.
Real Analysis
2-15 Tom M. Apostol, Mathematical Analysis, Pearson, 1974.
2-16 Walter Rudin, Principles of Mathematical Analysis, McGraw Hill, 1976.
96
2.7. PROBLEMS
2.7 Problems
Exercise 1.
A space S and three of its subsets are given by S = {1, 3, 5, 7, 9, 11}, A = {1, 3, 5}, B =
{7, 9, 11}, and C = {1, 3, 9, 11}. Find A ∩ B ∩ C, Ac ∩ B, A − C, and (A − B) ∪ B.
Exercise 2.
Let A = (−∞, r] and B = (−∞, s] where r ≤ s. Find an expression for C = (r, s] in terms
of A and B. Show that B = A ∪ C, and A ∩ C = ∅.
Exercise 4.
We will sometimes deal with the relationship between two sets. We say that A implies B
when A is a subset of B (why?). Show the following results.
(a) Show that if A implies B, and B implies C, then A implies C.
(b) Show that if A implies B, then B c implies Ac .
Exercise 5.
Show that if A ∪ B = A and A ∩ B = A, then A = B.
Exercise 6.
A space S is defined as S = {1, 3, 5, 7, 9, 22}, and three subsets as A = {1, 3, 5}, B =
{7, 9, 11}, C = {1, 3, 9, 11}. Assume that each element has probability 1/6. Find the following
probabilities:
(a) P[A]
(b) P[B]
(c) P[C]
(d) P[A ∪ B]
(e) P[A ∪ C]
(f) P[(A\C) ∪ B]
97
CHAPTER 2. PROBABILITY
Exercise 8.
Consider an experiment consisting of rolling a die twice. The outcome of this experiment is
an ordered pair whose first element is the first value rolled and whose second element is the
second value rolled.
(a) Find the sample space.
(b) Find the set A representing the event that the value on the first roll is greater than
or equal to the value on the second roll.
(c) Find the set B corresponding to the event that the first roll is a six.
(d) Let C correspond to the event that the first valued rolled and the second value rolled
differ by two. Find A ∩ C.
Note that A, B, and C should be subsets of the sample space specified in Part (a).
Exercise 9.
A pair of dice are rolled.
(a) Find the sample space Ω
(b) Find the probabilities of the events: (i) the sum is even, (ii) the first roll is equal to
the second, (iii) the first roll is larger than the second.
Exercise 10.
Let A, B and C be events in an event space. Find expressions for the following:
(a) Exactly one of the three events occurs.
Exercise 11.
A system is composed of five components, each of which is either working or failed. Consider
an experiment that consists of observing the status of each component, and let the outcomes
of the experiment be given by all vectors (x1 , x2 , x3 , x4 , x5 ), where xi is 1 if component i is
working and 0 if component i is not working.
(a) How many outcomes are in the sample space of this experiment?
(b) Suppose that the system will work if components 1 and 2 are both working, or if
components 3 and 4 are both working, or if components 1, 3, and 5 are all working.
Let W be the event that the system will work. Specify all of the outcomes in W .
98
2.7. PROBLEMS
(c) Let A be the event that components 4 and 5 have both failed. How many outcomes
are in the event A?
(d) Write out all outcomes in the event A ∩ W .
Exercise 14.
(a) By using the fact that P[A ∪ B] ≤ P[A] + P[B], show that P[A ∪ B ∪ C] ≤ P[A] + P[B] +
P[C].
Sn Pn
(b) By using the fact that P [ k=1 Ak ] ≤ k=1 P[Ak ], show that
" n # n
\ X
P Ak ≥ 1 − P[Ack ].
k=1 k=1
Exercise 15.
Use the distributive property of set operations to prove the following generalized distributive
law: !
n
\ \n
A∪ Bi = (A ∪ Bi ) .
i=1 i=1
Hint: Use mathematical induction. That is, show that the above is true for n = 2 and that
it is also true for n = k + 1 when it is true for n = k.
Exercise 16.
The following result is known as the Bonferroni’s Inequality.
(a) Prove that for any two events A and B, we have
99
CHAPTER 2. PROBABILITY
100
2.7. PROBLEMS
(a) If the block has 1 or fewer errors, then the receiver accepts the block. Find the prob-
ability that the block is accepted.
(b) If the block has more than 1 error, then the block is retransmitted. What is the
probability that 4 blocks are transmitted?
(d) In part (c), which coin is more probably when 2 heads have been observed?
101
CHAPTER 2. PROBABILITY
(b) Find the probability of receiving 1011 conditioned on that 1011 was sent, i.e.,
(c) To improve reliability, each symbol is transmitted three times, and the received
string is decoded by the majority rule. In other words, a 0 (or 1) is transmitted as
000 (or 111, respectively), and it is decoded at the receiver as a 0 (or 1) if and only if
the received three-symbol string contains at least two 0s (or 1s, respectively). What
is the probability that the symbol is correctly decoded, given that we send a 0?
(d) Suppose that the scheme of part (c) is used. What is the probability that a 0 was
sent if the string 101 was received?
(e) Suppose the scheme of part (c) is used and given that a 0 was sent. For what value of
ε0 is there an improvement in the probability of correct decoding? Assume that
ε0 ̸= 0.
102
Chapter 3
When working on a data analysis problem, one of the biggest challenges is the disparity
between the theoretical tools we learn in school and the actual data our boss hands to us.
By actual data, we mean a collection of numbers, perhaps organized or perhaps not. When
we are given the dataset, the first thing we do would certainly not be to define the Borel
σ-field and then define the measure. Instead, we would normally compute the mean, the
standard deviation, and perhaps some scores about the skewness.
The situation is best explained by the landscape shown in Figure 3.1. On the one hand,
we have well-defined probability tools, but on the other hand, we have a set of practical
“battle skills” for processing data. Often we view them as two separate entities. As long as
we can pull the statistics from the dataset, why bother about the theory? Alternatively, we
have a set of theories, but we will never verify them using the actual datasets. How can we
bridge the two? What are the missing steps in the probability theory we have learned so
far? The goal of this chapter (and the next) is to fill this gap.
Figure 3.1: The landscape of probability and data. Often we view probability and data analysis as two
different entities. However, probability and data analysis are inseparable. The goal of this chapter is to
link the two.
103
CHAPTER 3. DISCRETE RANDOM VARIABLES
implement those theories. How do we make the abstract probability space more convenient
so that we can model practical scenarios?
The first step is to recognize that the sample space and the event space are all based
on statements, for example, “getting a head when flipping a coin” or “winning the game.”
These statements are not numbers, but we (engineers) love numbers. Therefore, we should
ask a very basic question: How do we convert a statement to a number? The answer is the
concept of random variables.
Now, suppose that we have constructed a random variable that translates statements to
numbers. The next task is to endow the random variable with probabilities. More precisely,
we need to assign probabilities to the random variable so that we can perform computations.
This is done using the concept called probability mass function (PMF).
The best way to think about a PMF is a histogram, something we are familiar with.
A histogram has two axes: The x-axis denotes the set of states and the y-axis denotes
the probability. For each of the states that the random variable possesses, the histogram
tells us the probability of getting a particular state. The PMF is the ideal histogram of a
random variable. It provides a complete characterization of the random variable. If you have
a random variable, you must specify its PMF. Vice versa, if you tell us the PMF, you have
specified a random variable.
We ask the third question about pulling information from the probability mass func-
tion, such as the mean and standard deviation. How do we obtain these numbers from the
PMF? We are also interested in operations on the mean and standard deviations. For ex-
ample, if a professor offers ten bonus points to the entire class, how will it affect the mean
and standard deviation? If a store provides 20% off on all its products, what will happen to
its mean retail price and standard deviation? However, the biggest question is perhaps the
difference between the mean we obtain from a PMF and the mean we obtain from a his-
togram. Understanding this difference will immediately help us build a bridge from theory
to practice.
104
3.1. RANDOM VARIABLES
the abstract probability space aside and focus on the random variables. In Section 3.2
we will define the probability mass function (PMF) of a random variable, which tells us
the probability of obtaining a state of the random variable. PMF is closely related to the
histogram of a dataset. We will explain the connection. In Section 3.3 we take a small detour
to consider the cumulative distribution functions (CDF). Then, we discuss the mean and
standard deviation in Section 3.4. Section 3.5 details a few commonly used random variables,
including Bernoulli, binomial, geometric, and Poisson variables.
There is nothing new here: we have merely converted the symbols to numbers, with the help
of a function X. However, with X defined, the probabilities can be written as
1 2 2 1
P[X = 1] = , P[X = 2] = , P[X = 3] = , P[X = 4] = .
6 6 6 6
This is much more convenient, and so the data scientist is happy.
105
CHAPTER 3. DISCRETE RANDOM VARIABLES
This definition may be puzzling at first glance. Why should we overcomplicate things by
defining a function and calling it a variable?
If you recall the story above, we can map the notations of the story to the notations
of the definition as follows.
Symbol Meaning
Ω sample space = the set containing ♣, ♢, ♡, ♠
ξ an element in the sample space, which is one of ♣, ♢, ♡, ♠
X a function that maps ♣ to the number 1, ♢ to the number 2, etc
X(ξ) a number on the real line, e.g., X(♣) = 1
The random variable X is a function. The input to the function is an outcome of the sample
space, whereas the output is a number on the real line. This type of function is somewhat
different from an ordinary function that often translates a number to another number.
Nevertheless, X is a function.
Figure 3.2: A random variable is a mapping from the outcomes in the sample space to numbers on the
real line. We can think of a random variable X as a translator that translates a statement to a number.
Example 3.1. Suppose we flip a fair coin so that Ω = {head, tail}. We can define the
random variable X : Ω → R as
106
3.1. RANDOM VARIABLES
Therefore, when we write P[X = 1] we actually mean P[{head}]. Is there any difference
between P[{Head}] and P[X = 1]? No, because they are describing two identical events.
Note that the assignment of the value is totally up to you. You can say “head” is equal
to the value 102. This is allowed and legitimate, but it isn’t very convenient.
A pictorial illustration of this random variable is shown in Figure 3.3. This example
shows that the mapping defined by the random variable is not necessarily a one-to-one
mapping because multiple outcomes can be mapped to the same number.
Figure 3.3: A random variable that maps a pair of coins to a number, where the number represents the
number of heads.
107
CHAPTER 3. DISCRETE RANDOM VARIABLES
This question appears difficult but is actually quite easy to answer. Since the prob-
ability law P(·) is always applied to an event, we need to define an event for the random
variable X. If we write the sets clearly, we note that “X = a” is equivalent to the set
E = ξ ∈ Ω X(ξ) = a .
This is the set that contains all possible ξ’s such that X(ξ) = a. Therefore, when we say
“find the probability of X = a,” we are effectively asking the size of the set E = {ξ ∈
Ω | X(ξ) = a}.
How then do we measure the size of E? Since E is a subset in the sample space, E is
measurable by P. All we need to do is to determine what E is for a given a. This, in turn,
requires us to find the pre-image X −1 (a), which is defined as
−1 def
X (a) = ξ ∈ Ω X(ξ) = a .
Wait a minute, is this set just equal to E? Yes, the event E we are seeking is exactly the
pre-image X −1 (a). As such, the probability measure of E is
Figure 3.4 illustrates a situation where two outcomes ξ1 and ξ2 are mapped to the same
value a on the real line. The corresponding event is the set X −1 (a) = {ξ1 , ξ2 }.
Figure 3.4: When computing the probability of P[{ξ ∈ Ω | X(ξ) = a}], we effectively take the inverse
mapping X −1 (a) and compute the probability of the event P[{ξ ∈ X −1 (a)}] = P[{ξ1 , ξ2 }].
Ω = { , , , , , }.
108
3.1. RANDOM VARIABLES
(a)
P[X ≤ 3] = P[X = 1] + P[X = 2] + P[X = 3]
(b)
= P[X −1 (1)] + P[X −1 (2)] + P[X −1 (3)]
(c) 3
= P[{ }] + P[{ }] + P[{ }] = .
6
In this derivation, step (a) is based on Axiom III, where the three events are disjoint.
Step (b) is the pre-image due to the random variable X. Step (c) is the list of ac-
tual events in the event space. Note that there is no hand-waving argument in this
derivation. Every step is justified by the concepts and theorems we have learned so
far.
Ω = {( , ), ( , ), . . . , ( , )}.
ξ1 = ( , ), ξ2 = ( , ), . . . , ξ36 = ( , ).
Let
X = sum of two numbers.
Then, if we want to find the probability of getting X = 7, we can trace back and ask:
Among the 36 outcomes, which of those ξi ’s will give us X(ξ) = 7? Or, what is the set
X −1 (7)? To this end, we can write
Closing remark. In practice, when the problem is clearly defined, we can skip the inverse
mapping X −1 (a). However, this does not mean that the probability triplet (Ω, F, P) is gone;
it is still present. The triplet is now just the background of the problem.
The set of all possible values returned by X is denoted as X(Ω). Since X is not
necessarily a bijection, the size of X(Ω) is not necessarily the same as the size of Ω. The
elements in X(Ω) are often denoted as a or x. We call a or x one of the states of X. Be
careful not to confuse x and X. The variable X is the random variable; it is a function.
The variable x is a state assigned by X. A random variable X has multiple states. When
we write P[X = x], we describe the probability of a random variable X taking a particular
state x. It is exactly the same as P[{ξ ∈ Ω | X(ξ) = x}].
109
CHAPTER 3. DISCRETE RANDOM VARIABLES
Random variables are mappings that translate events to numbers. After the translation,
we have a set of numbers denoting the states of the random variables. Each state has a
different probability of occurring. The probabilities are summarized by a function known as
the probability mass function (PMF).
Do not get confused by the sample space Ω and the set of states X(Ω). The sample space Ω
contains all the possible outcomes of the experiments, whereas X(Ω) is the translation by
the mapping X. The event X = a is the set X −1 (a) ⊆ Ω. Therefore, when we say P[X = x]
we really mean P[X −1 (x)].
The probability mass function is a histogram summarizing the probability of each of
the states X takes. Since it is a histogram, a PMF can be easily drawn as a bar chart.
Example 3.5. Flip a coin twice. The sample space is Ω = {HH, HT, TH, TT}. We
can assign a random variable X = number of heads. Therefore,
110
3.2. PROBABILITY MASS FUNCTION
To illustrate the idea, suppose there are two dice. They each have probability masses
as follows.
1 2 3 4 1 1
P[{ }] = , P[{ }] = , P[{ }] = , P[{ }] = , P[{ }] = , P[{ }] = ,
12 12 12 12 12 12
2 2 2 2 2 2
P[{ }] = , P[{ }] = , P[{ }] = , P[{ }] = , P[{ }] = , P[{ }] = ,
12 12 12 12 12 12
Let us define two random variables, X and Y , for the two dice. Then, the PMFs pX and pY
can be defined as
1 2 3 4 1 1
pX (1) = , pX (2) = , pX (3) = , pX (4) = , pX (5) = , pX (6) = ,
12 12 12 12 12 12
2 2 2 2 2 2
pY (1) = , pY (2) = , pY (3) = , pY (4) = , pY (5) = , pY (6) = .
12 12 12 12 12 12
These two probability mass functions correspond to two different probability measures, let’s
say F and G. Define the event E = {between 2 and 3}. Then, F(E) and G(E) will lead to
two different results:
1 2 3
F(E) = P[2 ≤ X ≤ 3] = pX (2) + pX (3) = + = ,
12 12 12
2 2 4
G(E) = P[2 ≤ Y ≤ 3] = pY (2) + pY (3) = + = .
12 12 12
Note that even though for some particular events two final results could be the same (e.g.,
2 ≤ X ≤ 4 and 2 ≤ Y ≤ 4), the underlying measures are completely different.
Figure 3.5 shows another example of two different measures F and G on the same
sample space Ω = {♣, ♢, ♡, ♠}. Since the PMFs of the two measures are different, even
when given the same event E, the resulting probabilities will be different.
Figure 3.5: If we want to measure the size of a set E, using two different PMFs is equivalent to using
two different measures. Therefore, the probabilities will be different.
Does pX = pY imply X = Y ? If two random variables X and Y have the same PMF,
does it mean that the random variables are the same? The answer is no. Consider a random
variable with a symmetric PMF, e.g.,
1 1 1
pX (−1) = , pX (0) = , pX (1) = . (3.2)
4 2 4
Suppose Y = −X. Then, pY (−1) = 41 , pY (0) = 21 , and pY (1) = 14 , which is the same as pX .
However, X and Y are two different random variables. If the sample space is {♣, ♢, ♡}, we
can define the mappings X(·) and Y (·) as
X(♣) = −1, X(♢) = 0, X(♡) = +1,
Y (♣) = +1, Y (♢) = 0, Y (♡) = −1.
111
CHAPTER 3. DISCRETE RANDOM VARIABLES
Therefore, when we say pX (−1) = 14 , the underlying event is ♣. But when we say pY (−1) = 41 ,
the underlying event is ♡. The two random variables are different, although their PMFs have
exactly the same shape.
Proof. The proof follows directly from Axiom II, which states that P[Ω] = 1. Since x covers
all numerical values X can take, and since each x is distinct, by Axiom III we have
X X
P[X = x] = P [{ξ ∈ Ω | X(ξ) = x}]
x∈X(Ω) x∈X(Ω)
[
= P {ξ ∈ Ω | X(ξ) = x} = P[Ω] = 1.
ξ∈Ω
□
k
Practice Exercise 3.1. Let pX (k) = c 12 , where k = 1, 2, . . .. Find c.
P
Solution. Since k∈X(Ω) pX (k) = 1, we must have
∞ k
X 1
= 1.
2
k=1
Evaluating the geometric series on the right-hand side, we can show that
∞ k ∞ k
X 1 cX 1
c =
2 2 2
k=1 k=0
c 1
= · 1
2 1− 2
=c =⇒ c = 1.
π
Practice Exercise 3.2. Let pX (k) = c · sin 2k , where k = 1, 2, . . .. Find c.
Solution. The reader may might be tempted to sum pX (k) over all the possible k’s:
∞ π
?
X
sin k = 1 + 0 − 1 + 0 + · · · = 0.
2
k=1
112
3.2. PROBABILITY MASS FUNCTION
However, a more careful inspection reveals that pX (k) is actually negative when k =
3, 7, 11, . . .. This cannot happen because a probability mass function must be non-
negative. Therefore, the problem is not defined, and so there is no solution.
0.5
1
0.5
0.25 0
0.125 -0.5
0.0625 -1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
(a) (b)
k
Figure 3.6: (a) The PMF of pX (k) = c 21 , for k = 1, 2, . . .. (b) The PMF of pX (k) = sin π2 k ,
where k = 1, 2, . . .. Note that this is not a valid PMF because probability cannot have negative values.
Example. There are 26 English letters, but the frequencies of the letters in writing are
different. If we define a random variable X as a letter we randomly draw from an English
text, we can think of X as an object with 26 different states. The mapping associated with the
random variable is straightforward: X(“a”) = 1, X(“b”) = 2, etc. The probability of landing
on a particular state approximately follows a histogram shown in Figure 3.7. The histogram
provides meaningful values of the probabilities, e.g., pX (1) = 0.0847, pX (2) = 0.0149, etc.
The true probability of the states may not be exactly these values. However, when we have
enough samples, we generally expect the histogram to approach the theoretical PMF. The
MATLAB and Python codes used to generate this histogram are shown below.
% MATLAB code to generate the histogram
load(‘ch3_data_English’);
bar(f/100,‘FaceColor’,[0.9,0.6,0.0]);
113
CHAPTER 3. DISCRETE RANDOM VARIABLES
0.12
0.1
0.08
0.06
0.04
0.02
0
a b c d e f g h i j k l mn o p q r s t u v w x y z
Figure 3.7: The frequency of the 26 English letters. Data source: Wikipedia.
xticklabels({‘a’,‘b’,‘c’,‘d’,‘e’,‘f’,‘g’,‘h’,‘i’,‘j’,‘k’,‘l’,...
‘m’,‘n’,‘o’,‘p’,‘q’,‘r’,‘s’,‘t’,‘u’,‘v’,‘w’,‘x’,‘y’,‘z’});
xticks(1:26);
yticks(0:0.02:0.2);
axis([1 26 0 0.13]);
114
3.2. PROBABILITY MASS FUNCTION
N = 100 N = 1000
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 2 3 4 5 6 1 2 3 4 5 6
(a) N = 100 (b) N = 1000
N = 10000 N=
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
1 2 3 4 5 6 1 2 3 4 5 6
(c) N = 10000 (d) PMF
Figure 3.8: Histogram and PMF, when throwing a fair die N times. As N increases, the histograms are
becoming more similar to the PMF.
histogram becomes more like the PMF. You can imagine that when N goes to infinity, the
histogram will eventually become the PMF. Therefore, when given a dataset, one way to
think of it is to treat the data as random realizations drawn from a certain PMF. The more
data points you have, the closer the histogram will become to the PMF.
The MATLAB and Python codes used to generate Figure 3.8 are shown below. The
two commands we use here are randi (in MATLAB), which generates random integer num-
bers, and hist, which computes the heights and bin centers of a histogram. In Python,
the corresponding commands are np.random.randint and plt.hist. Note that because of
the different indexing schemes in MATLAB and Python, we offset the maximum index in
np.random.randint to 7 instead of 6. Also, we shift the x-axes so that the bars are centered
at the integers.
% MATLAB code to generate the histogram
x = [1 2 3 4 5 6];
q = randi(6,100,1);
figure;
[num,val] = hist(q,x-0.5);
bar(num/100,‘FaceColor’,[0.8, 0.8,0.8]);
axis([0 7 0 0.24]);
115
CHAPTER 3. DISCRETE RANDOM VARIABLES
plt.hist(q+0.5,bins=6)
This generative perspective is illustrated in Figure 3.9. We assume that the underlying
latent random variable has some PMF that can be described by a few parameters, e.g., the
mean and variance. Given the data points, if we can infer these parameters, we might retrieve
the entire PMF (up to the uncertainty level intrinsic to the dataset). We refer to this inverse
process as statistical inference.
Figure 3.9: When analyzing a dataset, one can treat the data points are samples drawn according to a
latent random variable with certain a PMF. The dataset we observe is often finite, and so the histogram
we obtain is empirical. A major task in data analysis is statistical inference, which tries to retrieve the
model information from the available measurements.
Returning to the question of why we need to understand the PMFs, the second part
of the answer is the difference between synthesis and analysis. In synthesis, we start with
a known random variable and generate samples according to the PMF underlying the ran-
dom variable. For example, on a computer, we often start with a Gaussian random variable
and generate random numbers according to the histogram specified by the Gaussian ran-
dom variable. Synthesis is useful because we can predict what will happen. We can, for
example, create millions of training samples to train a deep neural network. We can also
evaluate algorithms used to estimate statistical quantities such as mean, variance, moments,
etc., because the synthesis approach provides us with ground truth. In supervised learning
scenarios, synthesis is vital to ensuring sufficient training data.
The other direction of synthesis is analysis. The goal is to start with a dataset and
deduce the statistical properties of the dataset. For example, suppose we want to know
whether the underlying model is indeed a Gaussian model. If we know that it is a Gaussian
(or if we choose to use a Gaussian), we want to know the parameters that define this
Gaussian. The analysis direction addresses this model selection and parameter estimation
problem. Moving forward, once we know the model and the parameters, we can make a
prediction or do recovery, both of which are ubiquitous in machine learning.
We summarize our discussions below, which is Key Concept 2 of this chapter.
116
3.2. PROBABILITY MASS FUNCTION
The following discussions about histogram estimation can be skipped if it is your first
time reading the book.
If you have a dataset, how would you plot the histogram? Certainly, if you have access
to MATLAB or Python, you can call standard functions such as hist (in MATLAB) or
np.histogram (in Python). However, when plotting a histogram, you need to specify the
number of bins (or equivalently the width of bins). If you use larger bins, then you will have
fewer bins with many elements in each bin. Conversely, if the bin width is too small, you
may not have enough samples to fill the histogram. Figure 3.10 illustrates two histograms
in which the bins are respectively too large and too small.
1000 50
K=5 K = 200
800 40
600 30
400 20
200 10
0 0
0 2 4 6 8 10 0 2 4 6 8 10
(a) 5 bins (b) 200 bins
Figure 3.10: The width of the histogram has substantial influence on the information that can be
extracted from the histogram.
The MATLAB and Python codes used to generate Figure 3.10 are shown below. Note
that here we are using an exponential random variable (to be discussed in Chapter 4). In
MATLAB, calling an exponential random variable is done using exprnd, whereas in Python
the command is np.random.exponential. For this experiment, we can specify the number
of bins k, which can be set to k = 200 or k = 5. To suppress the Python output of the array,
we can add a semicolon ;. A final note is that lambda is a reserved variable in Python. Use
something else.
% MATLAB code used to generate the plots
lambda = 1;
k = 1000;
X = exprnd(1/lambda,[k,1]);
[num,val] = hist(X,200);
bar(val,num,‘FaceColor’,[1, 0.5,0.5]);
117
CHAPTER 3. DISCRETE RANDOM VARIABLES
k = 1000
X = np.random.exponential(1/lambd, size=k)
plt.hist(X,bins=200);
In statistics, there are various rules to determine the bin width of a histogram. We
mention a few of them here. Let K be the number of bins and N the number of samples.
√
Square-root: K = N
Sturges’ formula: K = log2 N + 1.
√
Rice Rule: K = 2 3 N
√
3.5 Var[X]
Scott’s normal reference rule: K = max X−min X
h , where h = √
3
N
is the bin
width.
For the example data shown in Figure 3.10, the histograms obtained using the above rules
are given in Figure 3.11. As you can see, different rules have different suggested bin widths.
Some are more conservative, e.g., using fewer bins, whereas some are less conservative. In
any case, the suggested bin widths do seem to provide better histograms than the original
ones in Figure 3.10. However, no bin width is the best for all purposes.
500 500
Square-root, K = 32 Sturges Rule, K = 11
400 400
300 300
200 200
100 100
0 0
0 1 2 3 4 5 0 1 2 3 4 5
500 500
Rice Rule, K = 20 Scott Rule, K = 22
400 400
300 300
200 200
100 100
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Beyond these predefined rules, there are also algorithmic tools to determine the bin
width. One such tool is known as cross-validation. Cross-validation means defining some
kind of cross-validation score that measures the statistical risk associated with the his-
togram. A histogram having a lower score has a lower risk, and thus it is a better histogram.
118
3.2. PROBABILITY MASS FUNCTION
Note that the word “better” is relative to the optimality criteria associated with the cross-
validation score. If you do not agree with our cross-validation score, our optimal bin width is
not necessarily the one you want. In this case, you need to specify your optimality criteria.
Theoretically, deriving a meaningful cross-validation score is beyond the scope of this
book. However, it is still possible to understand the principle. Let h be the bin width of the
histogram, K the number of bins, and N the number of samples. Given a dataset, we follow
this procedure:
Step 1: Choose a bin width h.
Step 2: Construct a histogram from the data, using the bin width h. The histogram will
have the empirical PMF values pb1 , pb2 , . . . , pbK , which are the heights of the histograms
normalized so that the sum is 1.
Step 3: Compute the cross-validation score (see Wasserman, All of Statistics, Section
20.2):
2 N +1
pb21 + pb22 + · · · + pb2K
J(h) = − (3.4)
(N − 1)h (N − 1)h
Note that when we use a different h, the PMF values pb1 , pb2 , . . . , pbK will change, and the
number of bins K will also change. Therefore, when changing h, we are changing not only
the terms in J(h) that explicitly contain h but also terms that are implicitly influenced.
10-4
10-3 5
-2.9
Cross-validation Score
4
Cross-validation Score
-3
-3.1
3
-3.2
2
-3.3
1
-3.4
-3.5 0
20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120 140 160 180 200
Number of Bins Number of Bins
(a) One dataset (b) Average of many datasets
Figure 3.12: Cross-validation score for the histogram. (a) The score of one particular dataset. (b) The
scores for many different datasets generated by the same model.
For the dataset we showed in Figure 3.10, the cross-validation score J(h) is shown in
Figure 3.12. We can see that although the curve is noisy, there is indeed a reasonably clear
minimum happening around 20 ≤ K ≤ 30, which is consistent with some of the rules.
The MATLAB and Python codes we used to generate Figure 3.12 are shown below.
The key step is to implement Equation (3.4) inside a for-loop, where the loop goes through
the range of bins we are interested in. To obtain the PMF values pb1 , . . . , pbK , we call hist
in MATLAB and np.histogram in Python. The bin width h is the number of samples n
divided by the number of bins m.
119
CHAPTER 3. DISCRETE RANDOM VARIABLES
In Figure 3.12(b), we show another set of curves from the same experiment. The
difference here is that we assume access to the true generative model so that we can generate
the many datasets of the same distribution. In this experiment we generated T = 1000
datasets. We compute the cross-validation score J(h) for each of the datasets, yielding T
score functions J (1) (h), . . . , J (T ) (h). We subtract the minimum because different realizations
have different offsets. Then we compute the average:
T
1X (t)
(t)
J(h) = J (h) − min J (h) . (3.5)
T t=1 h
This gives us a smooth red curve as shown in Figure 3.12(b). The minimum appears to be
at N = 25. This is the optimal N , concerning the cross-validation score, on the average of
all datasets.
All rules, including cross-validation, are based on optimizing for a certain objective.
Your objective could be different from our objective, and so our optimum is not necessarily
your optimum. Therefore, cross-validation may not be the best. It depends on your problem.
120
3.3. CUMULATIVE DISTRIBUTION FUNCTIONS (DISCRETE)
While the probability mass function (PMF) provides a complete characterization of a dis-
crete random variable, the PMFs themselves are technically not “functions” because the
impulses in the histogram are essentially delta functions. More formally, a PMF pX (k)
should actually be written as
X
pX (x) = pX (k) · δ(x − k) .
| {z } | {z }
k∈X(Ω)
PMF values delta function
This is a train of delta functions, where the height is specified by the probability mass pX (k).
For example, a random variable with PMF values
1 1 1
pX (0) = , pX (1) = , pX (2) =
4 2 4
will be expressed as
1 1 1
pX (x) = δ(x) + δ(x − 1) + δ(x − 2).
4 2 4
Since delta functions need to be integrated to generate values, the typical things we want to
do, e.g., integration and differentiation, are not as straightforward in the sense of Riemann-
Stieltjes.
The way to handle the unfriendliness of the delta functions is to consider mild modi-
fications of the PMF. This notation of “cumulative” distribution functions will allow us to
resolve the delta function problems. We will defer the technical details to the next chap-
ter. For the time being, we will briefly introduce the idea to prepare you for the technical
discussion later.
Definition 3.3. Let X be a discrete random variable with Ω = {x1 , x2 , . . .}. The
cumulative distribution function (CDF) of X is
k
def
X
FX (xk ) = P[X ≤ xk ] = pX (xℓ ). (3.6)
ℓ=1
A CDF is essentially the cumulative sum of a PMF from −∞ to x, where the variable x′ in
the sum is a dummy variable.
121
CHAPTER 3. DISCRETE RANDOM VARIABLES
1
FX (0) = P[X ≤ 0] = pX (0) = ,
4
3
FX (1) = P[X ≤ 1] = pX (0) + pX (1) = ,
4
FX (4) = P[X ≤ 4] = pX (0) + pX (1) + pX (4) = 1.
As shown in Figure 3.13, the CDF of a discrete random variable is a staircase function.
1 1
0.75 0.75
0.5 0.5
0.25 0.25
0 1 4 0 1 4
(a) PMF pX (k) (b) CDF FX (k)
The MATLAB code and the Python code used to generate Figure 3.13 are shown
below. The CDF is computed using the command cumsum in MATLAB and np.cumsum in
Python.
figure(1);
stem(x,p,‘.’,‘LineWidth’,4,‘MarkerSize’,50);
figure(2);
stairs([-4 x 10],[0 F 1],‘.-’,‘LineWidth’,4,‘MarkerSize’,50);
plt.stem(x,p,use_line_collection=True); plt.show()
plt.step(x,F); plt.show()
122
3.3. CUMULATIVE DISTRIBUTION FUNCTIONS (DISCRETE)
Why is CDF a better-defined function than PMF? There are technical reasons associ-
ated with whether a function is integrable. Without going into the details of these discus-
sions, a short answer is that delta functions are defined through integrations; they are not
functions. ARdelta function is defined as a function such that δ(x) = 0 everywhere except at
x = 0, and Ω δ(x) dx = 1. On the other hand, a staircase function is always well-defined.
The discontinuous points of a staircase can be well defined if we specify the gap between
two consecutive steps. For example, in Figure 3.13, as soon as we specify the gap 1/4, 1/2,
and 1/4, the staircase function is completely defined.
Example. Figure 3.14 shows the empirical histogram of the English letters and the corre-
sponding empirical CDF. We want to differentiate PMF versus histogram and CDF versus
empirical CDF. The empirical CDF is the CDF computed from a finite dataset.
1
0.12
0.9
0.1 0.8
0.7
0.08 0.6
0.06 0.5
0.4
0.04 0.3
0.2
0.02
0.1
0 0
a b c d e f g h i j k l mn o p q r s t u v w x y z a b c d e f g h i j k l mn o p q r s t u v w x y z
Theorem 3.2. If X is a discrete random variable, then the CDF of X has the following
properties:
123
CHAPTER 3. DISCRETE RANDOM VARIABLES
Since the probability mass function is non-negative, the value of FX is larger when the value
of the argument is larger. That is, x ≤ y implies FX (x) ≤ FX (y). The second statement (ii)
is true because the summation includes all possible states. So we have
∞
X
FX (+∞) = pX (x′ ) = 1.
x′ =−∞
The summation is taken over an empty set, and so FX (−∞) = 0. Statement (iv) is true
because the cumulative sum changes only when there is a non-zero mass in the PMF. □
As we can see in the proof, the basic argument of the CDF is the cumulative sum of
the PMF. By definition, a cumulative sum always adds mass. This is why the CDF is always
increasing, has 0 at −∞, and has 1 at +∞. This last statement deserves more attention. It
implies that the unit step always has a solid dot on the left-hand side and an empty dot
on the right-hand side, because when the CDF jumps, the final value is specified by the
“≤” sign in Equation (3.6). The technical term for this property is right continuous.
Theorem 3.3. If X is a discrete random variable, then the PMF of X can be obtained
from the CDF by
pX (xk ) = FX (xk ) − FX (xk−1 ), (3.8)
where we assumed that X has a countable set of states {x1 , x2 , . . .}. If the sample space
of the random variable X contains integers from −∞ to +∞, then the PMF can be
defined as
pX (k) = FX (k) − FX (k − 1). (3.9)
Example 3.7. Continuing with the example in Figure 3.13, if we are given the CDF
1 3
FX (0) = , FX (1) = , FX (4) = 1,
4 4
how do we find the PMF? We know that the PMF will have non-negative values only
at x = 0, 1, 4. For each of these x, we can show that
1 1
pX (0) = FX (0) − FX (−∞) = −0= ,
4 4
3 1 1
pX (1) = FX (1) − FX (0) = − = ,
4 4 2
3 1
pX (4) = FX (4) − FX (1) = 1 − = .
4 4
124
3.4. EXPECTATION
3.4 Expectation
When analyzing data, it is often useful to extract certain key parameters such as the mean
and the standard deviation. The mean and the standard deviation can be seen from the lens
of random variables. In this section, we will formalize the idea using expectation.
Expectation is the mean of the random variable X. Intuitively, we can think of pX (x) as the
percentage of times that the random variable X attains the value x. When this percentage
is multiplied by x, we obtain the contribution of each x. Summing over all possible values
of x then yields the mean. To see this more clearly, we can write the definition as
X
E[X] = | {z x } | pX (x) .
{z }
x∈X(Ω) a state X takes
| {z } the percentage
sum over all states
Figure 3.15 illustrates a PMF that contains five states x1 , . . . , x5 . Corresponding to each
state are pX (x1 ), . . . , pX (x5 ). For this PMF to make sense, we must assume that pX (x1 ) +
def
· · · + pX (x5 ) = 1. To simplify notation, let us define pi = pX (xi ). Then the expectation
of X is just the sum of the products: value (xi ) times height (pi ). This gives E[X] =
P5
i=1 xi pX (xi ).
We emphasize that the definition of the expectation is exactly the same as the usual
way we calculate the average of a dataset. When we calculate the average of a dataset
D = {x(1) , x(2) , . . . , x(N ) }, we sum up these N samples and divide by the number of samples.
This is what we called the empirical average or the sample average:
N
1 X (n)
average = x . (3.11)
N n=1
125
CHAPTER 3. DISCRETE RANDOM VARIABLES
Of course, in a typical dataset, these N samples often take distinct values. But suppose
that among these N samples there are only K different values. For example, if we throw a
die a million times, every sample we record will be one of the six numbers. This situation
is illustrated in Figure 3.16, where we put the samples into the correct bin storing these
values. In this case, to calculate the average we are effectively doing a binning:
K
1 X
average = value xk × number of samples with value xk . (3.12)
N
k=1
Equation (3.12) is exactly the same as Equation (3.11), as long as the samples can be grouped
into K different values. With a little calculation, we can rewrite Equation (3.12) as
K
X number of samples with value xk
average = value x × ,
| {z k} N
k=1 a state X takes
| {z }
|{z} the percentage
sum of all states
Figure 3.16: If we have a dataset D containing N samples, and if there are only K distinct values, we
can effectively put these N samples into K bins. Thus, the “average” (which is the sum divided by the
number N ) is exactly the same as our definition of expectation.
The difference between E[X] and the average is that E[X] is computed from the ideal
histogram, whereas average is computed from the empirical histogram. When the number of
samples N approaches infinity, we expect the average to approximate E[X]. However, when
N is small, the empirical average will have random fluctuations around E[X]. Every time
we experiment, the empirical average may be slightly different. Therefore, we can regard
E[X] as the true average of a certain random variable, and the empirical average as a finite-
sample average based on the particular experiment we are working with. This summarizes
Key Concept 3 of this chapter.
If we are given a dataset on a computer, computing the mean can be done by calling
the command mean in MATLAB and np.mean in Python. The example below shows the
case of finding the mean of 10000 uniformly distributed random numbers.
126
3.4. EXPECTATION
Example 3.8. Let X be a random variable with PMF pX (0) = 1/4, pX (1) = 1/2 and
pX (2) = 1/4. We can show that the expectation is
1 1 1
E[X] = (0) + (1) + (2) = 1.
4 2 4
| {z } | {z } | {z }
pX (0) pX (1) pX (2)
On MATLAB and Python, if we know the PMF then computing the expectation is
straight-forward. Here is the code to compute the above example.
Example 3.9. Flip an unfair coin, where the probability of getting a head is 34 . Let
X be a random variable such that X = 1 means getting a head. Then we can show
that pX (1) = 34 and pX (0) = 41 . The expectation of X is therefore
3 1 3
E[X] = (1)pX (1) + (0)pX (0) = (1) + (0) = .
4 4 4
Center of mass. How would you interpret the result of this example? Does it mean
that, on average, we will get 3/4 heads (but there is not anything called 3/4 heads!). Recall
the definition of a random variable: it is a translator that translates a descriptive state
to a number on the real line. Thus the expectation, which is an operation defined on the
real line, can only tell us what is happening on the real line, not in the original sample
127
CHAPTER 3. DISCRETE RANDOM VARIABLES
Figure 3.17: Center of mass. If a state x2 is more influential than another state x1 , the center of mass
E[X] will lean towards x2 .
space. On the real line, the expectation can be regarded as the center of mass, which is the
point where the “forces” between the two states are “balanced”. In Figure 3.17 we depict a
random variable with two states x1 and x2 . The state x1 has less influence (because pX (x1 )
is smaller) than x2 . Therefore the center of mass is shifted towards x2 . This result shows us
that the value E[X] is not necessarily in the sample space. E[X] is a deterministic number
with nothing to do with the sample space.
1
Example 3.10. Let X be a random variable with PMF pX (k) = 2k
, for k = 1, 2, 3, . . ..
The expectation is
∞ ∞
X X 1
E[X] = kpX (k) = k·
2k
k=1 k=1
∞
1 X 1 1 1
= k· = · = 2.
2 2k−1 2 (1 − 21 )2
k=1
On MATLAB and Python, if you want to verify this answer you can use the following
code. Here, we approximate the infinite sum by a finite sum of k = 1, . . . , 100.
Example 3.11. Roll a die twice. Let X be the first roll and Y be the second roll.
Let Z = max(X, Y ). To compute the expectation E[Z], we first construct the sample
space. Since there are two rolls, we can construct a table listing all possible pairs of
outcomes. This will give us {(1, 1), (1, 2), . . . , (6, 6)}. Now, we calculate Z, which is the
max of the two rolls. So if we have (1, 3), then the max will be 3, whereas if we have
(5, 2), then the max will be 5. We can complete a table as shown below.
128
3.4. EXPECTATION
1 2 3 4 5 6
1 1 2 3 4 5 6
2 2 2 3 4 5 6
3 3 3 3 4 5 6
4 4 4 4 4 5 6
5 5 5 5 5 5 6
6 6 6 6 6 6 6
This table tell us that Z has 6 states. The PMF of Z can be determined by
counting the number of times a state shows up in the table. Thus, we can show that
1 3 5
pZ (1) = , pZ (2) = , pZ (3) = ,
36 36 36
7 9 11
pZ (4) = , pZ (5) = , pZ (6) = .
36 36 36
The expectation of Z is therefore
1 3 5
E[Z] = (1) + (2) + (3)
36 36 36
7 9 11
+ (4) + (5) + (6)
36 36 36
161
= .
36
Example 3.12. Consider a game in which we flip a coin 3 times. The reward of the
game is
• $1 if there are 2 heads
• $8 if there are 3 heads
• $0 if there are 0 or 1 head
There is a cost associated with the game. To enter the game, the player has to pay
$1.50. We want to compute the net gain, on average.
To answer this question, we first note that the sample space contains 8 elements:
HHH, HHT, HTH, THH, THT, TTH, HTT, TTT. Let X be the number of heads.
Then the PMF of X is
1 3 3 1
pX (0) = , pX (1) = , pX (2) = , pX (3) = .
8 8 8 8
We then let Y be the reward. The PMF of Y can be found by “adding” the probabilities
of X. This yields
4 3 1
pY (0) = pX (0) + pX (1) = , pY (1) = pX (2) = , pY (8) = pX (3) = .
8 8 8
129
CHAPTER 3. DISCRETE RANDOM VARIABLES
The expectation of Y is
4 3 1 11
E[X] = (0) + (1) + (8) = .
8 8 8 8
12
Since the cost of the game is 8 , the net gain (on average) is − 81 .
1 1
where the limit is due to the harmonic seriesa : 1 + 2 + 3 + · · · = ∞.
a https://en.wikipedia.org/wiki/Harmonic_series_(mathematics)
This definition tells us that not all random variables have a finite expectation. This
is a very important mathematical result, but its practical implication is arguably limited.
Most of the random variables we use in practice are absolutely summable. Also, note that
the property of absolute summability applies to discrete random variables. For continuous
random variables, we have a parallel concept called absolute integrability, which will be
discussed in the next chapter.
130
3.4. EXPECTATION
Theorem 3.4. The expectation of a random variable X has the following properties:
(i) Function. For any function g,
X
E[g(X)] = g(x) pX (x).
x∈X(Ω)
E[X + c] = E[X] + c.
Proof of (i): A pictorial proof of (i) is shown in Figure 3.18. The key idea is a change of
variable.
Figure 3.18: By letting g(X) = Y , the PMFs are not changed. What changes are the states.
When we have a function Y = g(X), the PMF of Y will have impulses moved from x
(the horizontal axis) to g(x) (the vertical axis). The PMF values (i.e., the probabilities or
the height of the stems), however, are not changed. If the mapping g(X) is many-to-one,
multiple PMF values will add to the same position. Therefore, when we compute E[g(X)],
we compute the expectation along the vertical axis.
Practice Exercise 3.3. Prove statement (iii): For any constant c, E[cX] = cE[X].
Statement (iii) is illustrated in Figure 3.19. Here, we assume that the original PMF has 3
131
CHAPTER 3. DISCRETE RANDOM VARIABLES
Figure 3.19: Pictorial representation of E[cX] = cE[X]. When we multiply X by c, we fix the probabil-
ities but make the spacing between states wider/narrower.
Practice Exercise 3.4. Prove statement (ii): For any function g and h, E[g(X) +
h(X)] = E[g(X)] + E[h(X)].
= E[g(X)] + E[h(X)].
Practice Exercise 3.5. Prove statement (iv): For any constant c, E[X +c] = E[X]+c.
= E[X] + c.
This result is illustrated in Figure 3.20. As we add a constant to the random variable,
its PMF values remain the same but their positions are shifted. Therefore, when computing
the mean, the mean will be shifted accordingly.
132
3.4. EXPECTATION
Figure 3.20: Pictorial representation of E[X +c] = E[X]+c. When we add c to X, we fix the probabilities
and shift the entire PMF to the left or to the right.
Example 3.14. Let X be a random variable with four equally probable states 0, 1, 2, 3.
We want to compute the expectation E[cos(πX/2)]. To do so, we note that
X πX
E[cos(πX/2)] = cos pX (x)
2
x∈X(Ω)
1 π 1 2π 1 3π 1
= (cos 0) + (cos ) + (cos ) + (cos )
4 2 4 2 4 2 4
1 + 0 + (−1) + 0
= = 0.
4
Example 3.15. Let X be a random variable with E[X] = 1 and E[X 2 ] = 3. We want
to find the expectation E[(aX + b)2 ]. To do so, we realize that
(a) (b)
E[(aX + b)2 ] = E[a2 X 2 + 2abX + b2 ] = a2 E[X 2 ] + 2abE[X] + b2 = 3a2 + 2ab + b2 ,
where (a) is due to expansion of the square, and (b) holds in two steps. The first step
is to apply statement (ii) for individual functions of expectations, and the second step
is to apply statement (iii) for scalar multiple of the expectations.
Essentially, the kth moment is the expectation applied to X k . The definition follows from
statement (i) of the expectation’s properties. Using this definition, we note that E[X] is the
first moment and E[X 2 ] is the second moment. Higher-order moments can be defined, but
in practice they are less commonly used.
133
CHAPTER 3. DISCRETE RANDOM VARIABLES
Example 3.16. Flip a coin 3 times. Let X be the number of heads. Then
1 3 3 1
pX (0) = , pX (1) = , pX (2) = , pX (3) = .
8 8 8 8
The second moment E[X 2 ] is
2 2 1 2 3 2 3 2 1
E[X ] = (0) + (1) + (2) + (4) = 3.
8 8 8 8
Using the second moment, we can define the variance of a random variable.
We denote σ 2 by Var[X]. The square root of the variance, σ, is called the standard deviation
of X. Like the expectation E[X], the variance Var[X] is computed using the ideal histogram
PMF. It is the limiting object of the usual standard deviation we calculate from a dataset.
On a computer, computing the variance of a dataset is done by calling built-in com-
mands such as var in MATLAB and np.var in Python. The standard deviation is computed
using std and np.std, respectively.
% MATLAB code to compute the variance
X = rand(10000,1);
vX = var(X);
sX = std(X);
134
3.4. EXPECTATION
X = np.random.rand(10000)
vX = np.var(X)
sX = np.std(X)
What does the variance mean? It is a measure of the deviation of the random variable
X relative to its mean. This deviation is quantified by the squared difference (X − µ)2 . The
expectation operator takes the average of the deviation, giving us a deterministic number
E[(X − µ)2 ].
Theorem 3.5. The variance of a random variable X has the following properties:
(i) Moment.
Var[X] = E[X 2 ] − E[X]2 .
Var[cX] = c2 Var[X].
Var[X + c] = Var[X].
135
CHAPTER 3. DISCRETE RANDOM VARIABLES
The properties above are useful in various ways. The first statement provides a link connect-
ing variance and the second moment. Statement (ii) implies that when X is scaled by c, the
variance should be scaled by c2 because of the square in the second moment. Statement (iii)
says that when X is shifted by a scalar c, the variance is unchanged. This is true because
no matter how we shift the mean, the fluctuation of the random variable remains the same.
Practice Exercise 3.7. Flip a coin with probability p to get a head. Let X be a
random variable denoting the outcome. The PMF of X is
pX (0) = 1 − p, pX (1) = p.
The variance is
In the previous sections, we have conveyed three key concepts: one about the random vari-
able, one about the PMF, and one about the mean. The next step is to introduce a few
commonly used discrete random variables so that you have something concrete in your “tool-
box.” As we have mentioned before, these predefined random variables should be studied
from a synthesis perspective (sometimes called generative). The plan for this section is to
introduce several models, derive their theoretical properties, and discuss examples.
Note that some extra effort will be required to understand the origins of the random
variables. The origins of random variables are usually overlooked, but they are more impor-
tant than the equations. For example, we will shortly discuss the Poisson random variable
136
3.5. COMMON DISCRETE RANDOM VARIABLES
Figure 3.22: A Bernoulli random variable has two states with probability p and 1 − p.
k −λ
and its PMF pX (k) = λ k! e
. Why is the Poisson random variable defined in this way? If
you know how the Poisson PMF was originally derived, you will understand the assumptions
made during the derivation. Consequently, you will know why Poisson is a good model for
internet traffic, recommendation scores, and image sensors for computer vision applications.
You will also know under what situation the Poisson model will fail. Understanding the
physics behind the probability models is the focus of this section.
pX (0) = 1 − p, pX (1) = p,
X ∼ Bernoulli(p)
In this definition, the parameter p controls the probability of obtaining 1. In a coin-flip event,
p is usually 21 , meaning that the coin is fair. However, for biased coins p is not necessarily 12 .
For other situations such as binary bits (0 or 1), the probability of obtaining 1 could be very
different from the probability of obtaining 0.
In MATLAB and Python, generating Bernoulli random variables can be done by call-
ing the binomial random number generator np.random.binomial (Python) and binornd
(MATLAB). When the parameter n is equal to 1, the binomial random variable is equiv-
alent to a Bernoulli random variable. The MATLAB and Python codes to synthesize a
Bernoulli random variable are shown below.
137
CHAPTER 3. DISCRETE RANDOM VARIABLES
138
3.5. COMMON DISCRETE RANDOM VARIABLES
In both MATLAB and Python, we can plot the PMF of a Bernoulli random variable,
such as the one shown in Figure 3.23. To do this in MATLAB, we call the function binopdf,
with the evaluation points specified by x.
1
0.8
0.6
0.4
0.2
0
-0.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure 3.23: An example of a theoretical PMF (not the empirical histogram) plotted by MATLAB.
In Python, we construct a random variable rv. With rv, we can call its PMF rv.pmf:
139
CHAPTER 3. DISCRETE RANDOM VARIABLES
Does this result make sense? Why is the variance maximized at p = 1/2? If we think
about this problem more carefully, we realize that a Bernoulli random variable represents a
coin-flip experiment. If the coin is biased such that it always gives heads, on the one hand,
it is certainly a bad coin. However, on the other hand, the variance is zero because there
is nothing to vary; you will certainly get heads. The same situation happens if the coin is
biased towards tails. However, if the coin is fair, i.e., p = 1/2, then the variance is large
because we only have a 50% chance of getting a head or a tail whenever we flip a coin.
Nothing is certain in this case. Therefore, the maximum variance happening at p = 1/2
matches our intuition.
140
3.5. COMMON DISCRETE RANDOM VARIABLES
0 0 1 0
will have edges for node pairs (1, 2), (1, 3), and (3, 4). Note that in this example we assume
that the adjacency matrix is symmetric, meaning that the graph is undirected. The “1” in
the adjacency matrix indicates there is an edge, and “0” indicates there is no edge. So A
represents a binary graph.
The Erdős-Rényi graph model says that the probability of getting an edge is an inde-
pendent Bernoulli random variable. That is
Aij ∼ Bernoulli(p),
for i < j. If we model the graph in this way, then the parameter p will control the density
of the graph. High values of p mean that there is a higher chance for an edge to be present.
p = 0.3 p = 0.5 p = 0.7 p = 0.9
4
2 35 33 3 4 3 12 14 28 5
7 24 38
23 7 31
29 6 33 20 36 28
27 3 20 19 2 3 2 15
21 1 39 1 13
1 34 40 15 27 26 10 2 29
10 18 1231 38 8
6 20 10
36 39 1 35 9 25 32
1 11 17 2 16
14 3
1 13 39 9 2
17 32 4 17 33 7 35 18
0 28 11 16 30 7
40 22 0 36 30 0 35
23 4 20 22 33 4 27
30 2 19 28 37 18 0 24 34
14 18 17
34 26 24 5 32
5 8 25
6 -1 1
27 -1 40 34 32 19 21
8
-1 9 37 26 24 21 38 36 40
15 13 10 3 19
16 31 25 29 31
25 148 -2 22 -2 16 26 39 30
29 15 22 -2
11 5 9 37
-2 2 37 6
23 12
38 -3 -3
21 12 23 13 11
-3 -4 -4 -4
-2 0 2 -2 0 2 -4 -2 0 2 4 -4 -2 0 2 4
Figure 3.25: The Erdős-Rényi graph. [Top] The graphs. [Bottom] The adjacency matrices.
141
CHAPTER 3. DISCRETE RANDOM VARIABLES
4 12
10
11 25 5
7 24
4 87
14
13 22 23 99
93 79 100
19 89 74 75
2 21 83 95 67
8 18 20 15 78
84
2 96 86 82 88 72 66
6 16 85 80
9 71 68 69
17 92 77 73
1 3 90 76 97
98
70 94 81 91
0
43
55 32 65
37 47
42 63 28 41
31 50
-2 64 59 30 45 60
44 58
39 33 51 27
26 62 61 46
29 36 52
54 35
38 56
34 57
-4 48 53
40
49
-6
-4 -3 -2 -1 0 1 2 3 4
Figure 3.26: A stochastic block model containing three communities. [Left] The graph. [Right] The
adjacency matrix.
In network analysis, one of the biggest problems is determining the community struc-
ture and recovering the underlying probabilities. The former task is about grouping the
nodes into blocks. This is a nontrivial problem because in practice the nodes are never
arranged nicely, as shown in Figure 3.26. For example, why should Alice be node 1 and
Bob be node 2? Since we never know the correct ordering of the nodes, partitioning the
nodes into blocks requires various estimation techniques such as clustering or iterative esti-
mation. Recovering the underlying probability is also not easy. Given an adjacency matrix,
why can we assume that the underlying network is a stochastic block model? Even if the
model is correct, there will be imperfect grouping in the previous step. As such, estimat-
ing the underlying probability in the presence of these uncertainties would pose additional
challenges.
Today, network analysis remains one of the hottest areas in data science. Its importance
derives from its broad scope and impact. It can be used to analyze social networks, opinion
polls, marketing, or even genome analysis. Nevertheless, the starting point of these advanced
subjects is the Bernoulli random variable, the random variable of a coin flip!
142
3.5. COMMON DISCRETE RANDOM VARIABLES
where 0 < p < 1 is the binomial parameter, and n is the total number of states. We
write
X ∼ Binomial(n, p)
to say that X is drawn from a binomial distribution with a parameter p of size n.
pX (3) = P[{HHH}]
= P[{H} ∩ {H} ∩ {H}]
(a)
= P[{H}]P[{H}]P[{H}]
(b)
= p3 ,
where (a) holds because the three events are independent. (Recall that if A and B are
independent then P[A ∩ B] = P[A]P[B].) (b) holds because each P[{H}] = p by definition.
With exactly the same argument, we can show that pX (0) = P[{TTT}] = (1 − p)3 .
143
CHAPTER 3. DISCRETE RANDOM VARIABLES
Now, let us look at pX (2), i.e., 2 heads. This probability can be calculated as follows:
where (c) holds because the three events HHT, HTH and THH are disjoint in the sample
space. Note that we are not using the independence argument in (c) but the disjoint argu-
ment. We should not confuse the two. The step in (d) uses independence, because each coin
flip is independent.
The above calculation shows an interesting phenomenon: Although the three events
HHT, HTH, and THH are different (in fact, disjoint), the number of heads in all the cases
is the same. This happens because when counting the number of heads, the ordering of the
heads and tails does not matter. So the same problem can be formulated as finding the
number of combinations of { 2 heads and 1 tail }, which in our case is 32 = 3.
To complete the story, let us also try pX (1). This probability is
The running index k should go with 0, 1, . . . , n. It starts with 0 because there could be zero
heads in the sample space. Furthermore, we note that in this definition, two parameters are
driving a binomial random variable: the number of Bernoulli trials n and the underlying
probability for each coin flip p. As such, the notation for a binomial random variable is
Binomial(n, p), with two arguments.
The histogram of a binomial random variable is shown in Figure 3.28(a). Here, we con-
sider the example where n = 10 and p = 0.5. To generate the histogram, we use 5000 samples.
In MATLAB and Python, generating binomial random variables as in Figure 3.28(a) can
be done by calling binornd and np.random.binomial.
% MATLAB code to generate 5000 Binomial random variables
p = 0.5;
n = 10;
X = binornd(n,p,[5000,1]);
[num, ~] = hist(X, 10);
bar( num,‘FaceColor’,[0.4, 0.4, 0.8]);
144
3.5. COMMON DISCRETE RANDOM VARIABLES
1200 0.25
1000
0.2
800
0.15
600
0.1
400
0.05
200
0 0
1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10
p = 0.5
n = 10
X = np.random.binomial(n,p,size=5000)
plt.hist(X,bins=‘auto’);
Generating the ideal PMF of a binomial random variable as shown in Figure 3.28(b)
can be done by calling binopdf in MATLAB. In Python, we can define a random variable
rv through stats.binom, and call the PMF using rv.pmf.
The shape of the binomial PMF is shown in Figure 3.29. In this set of figures, we vary
one of the two parameters n and p while keeping the other fixed. In Figure 3.29(a), we fix
n = 60 and plot three sets of p = 0.1, 0.5, 0.9. For small p the PMF is skewed towards the
left, and for large p the PMF is skewed toward the right. Figure 3.29(b) shows the PMF
145
CHAPTER 3. DISCRETE RANDOM VARIABLES
for a fixed p = 0.5. As we increase n, the centroid of the PMF moves towards the right.
Thus we should expect the mean of a binomial random variable to increase with p. Another
interesting observation is that as n increases, the shape of the PMF approaches the Gaussian
function (the bell-shaped curve). We will explain the reason for this when we discuss the
Central Limit Theorem.
0.2 0.4
p = 0.1 n=5
p = 0.5 n = 50
0.15 0.3 n = 100
p = 0.9
0.1 0.2
0.05 0.1
0 0
0 10 20 30 40 50 60 0 10 20 30 40 50 60
(a) n = 60 (b) p = 0.5
Figure 3.29: PMFs of a binomial random variable X ∼ Binomial(n, p). (a) We assume that n = 60. By
varying the probability p, we see that the PMF shifts from the left to the right, and the shape changes.
(b) We assume that p = 0.5. By varying the number of trials, the PMF shifts and the shape becomes
more “bell-shaped.”
The expectation, second moment, and variance of a binomial random variable are
summarized in Theorem 3.7.
E[X] = np,
E[X 2 ] = np(np + (1 − p)),
Var[X] = np(1 − p).
We will prove that E[X] = np using the first principle. For E[X 2 ] and Var[X], we will skip
the proofs here and will introduce a “shortcut” later.
146
3.5. COMMON DISCRETE RANDOM VARIABLES
Note that we have shifted the index from k = 0 to k = 1. Now let us apply a trick:
n
X n!
E[X] = pk (1 − p)n−k
(k − 1)!(n − k)!
k=1
n
X n!
= pk (1 − p)n−k .
(k − 1)!(n − k − 1 + 1)!
k=1
□
In MATLAB, the mean and variance of a binomial random variable can be found by
calling the command binostat(n,p) (MATLAB).
In Python, the command is rv = stats.binom(n,p) followed by calling rv.stats.
147
CHAPTER 3. DISCRETE RANDOM VARIABLES
The CDF of the binomial random variable is not very informative. It is basically the
cumulative sum of the PMF:
k
X n ℓ
FX (k) = p (1 − p)n−ℓ .
ℓ
ℓ=0
148
3.5. COMMON DISCRETE RANDOM VARIABLES
0.2 1
0.8
0.15
0.6
0.1
0.4
0.05 0.2
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Figure 3.30: PMF and CDF of a binomial random variable X ∼ Binomial(n, p).
The shapes of the PMF and the CDF is shown in Figure 3.30.
In MATLAB, plotting the CDF of a binomial can be done by calling the function
binocdf. You may also call f = binopdf(x,n,p), and define F = cumsum(f) as the cumu-
lative sum of the PMF. In Python, the corresponding command is rv = stats.binom(n,p)
followed by rv.cdf.
pX (k) = (1 − p)k−1 p, k = 1, 2, . . . ,
149
CHAPTER 3. DISCRETE RANDOM VARIABLES
X ∼ Geometric(p)
consecutive failures before one success. There is no alternative combination of the sequence.
The histogram and PMF of a geometric random variable are illustrated in Figure 3.31.
Here, we assume that p = 0.5.
3000 0.5
2500 0.4
2000
0.3
1500
0.2
1000
0.1
500
0 0
0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10
(a) Histogram based on 5000 samples (b) PMF
In MATLAB, generating geometric random variables can be done by calling the com-
mands geornd. In Python, it is np.random.geometric.
To generate the PMF plots, in MATLAB we call geopdf and in Python we call
rv = stats.geom followed by rv.pmf.
150
3.5. COMMON DISCRETE RANDOM VARIABLES
Practice Exercise 3.10. Show that the geometric PMF sums to one.
Solution. We can apply infinite series to show the result:
∞
X ∞
X
pX (k) = (1 − p)k−1 p
k=1 k=1
∞
X
=p· (1 − p)k−1 , ℓ=k−1
k=1
X∞
=p· (1 − p)ℓ
ℓ=0
1
=p· = 1.
1 − (1 − p)
It is interesting to compare the shape of the PMFs for various values of p. In Figure 3.32
we show the PMFs. We vary the parameter p = 0.25, 0.5, 0.9. For small p, the PMF starts
with a low value and decays at a slow speed. The opposite happens for a large p, where the
PMF starts with a high value and decays rapidly.
Furthermore, we can derive the following properties of the geometric random variable.
Proof. We will prove that the mean is 1/p and leave the second moment and variance as
151
CHAPTER 3. DISCRETE RANDOM VARIABLES
1 1 1
p = 0.25 p = 0.5 p = 0.9
0 0 0
0 2 4 6 8 0 2 4 6 8 0 2 4 6 8
an exercise.
∞ ∞
!
X
k−1
X
k−1 (a) 1 1
E[X] = kp(1 − p) =p k(1 − p) = p = ,
(1 − (1 − p))2 p
k=1 k=1
λk −λ
pX (k) = e , k = 0, 1, 2, . . . ,
k!
where λ > 0 is the Poisson rate. We write
X ∼ Poisson(λ)
In this definition, the parameter λ determines the rate of the arrival. The histogram and
PMF of a Poisson random variable are illustrated in Figure 3.33. Here, we assume that
λ = 1.
The MATLAB code and Python code used to generate the histogram are shown below.
152
3.5. COMMON DISCRETE RANDOM VARIABLES
2000 0.4
1500 0.3
1000 0.2
500 0.1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 10
(a) Histogram based on 5000 samples (b) PMF
For the PMF, in MATLAB we can call poisspdf, and in Python we can call rv.pmf
with rv = stats.poisson.
The shape of the Poisson PMF changes with λ. As illustrated in Figure 3.34, pX (k) is
more concentrated at lower values for smaller λ and becomes spread out for larger λ. Thus,
we should expect that the mean and variance of a Poisson random variable will change
153
CHAPTER 3. DISCRETE RANDOM VARIABLES
together as a function of λ. In the same figure, we show the CDF of a Poisson random
variable. The CDF of a Poisson is
k
X λℓ
FX (k) = P[X ≤ k] = e−λ . (3.17)
ℓ!
ℓ=0
0.4 1
=1
=4
= 10 0.8
0.3
0.6
0.2
0.4
0.1
0.2 =1
=4
= 10
0 0
0 5 10 15 20 0 5 10 15 20
Figure 3.34: A Poisson random variable using different λ’s. [Left] Probability mass function pX (k).
[Right] Cumulative distribution function FX (k).
Example 3.18. Let X be a Poisson random variable with parameter λ. Find P[X > 4]
and P[X ≤ 5].
Solution.
4
X λk
P[X > 4] = 1 − P[X ≤ 4] = 1 − e−λ ,
k!
k=0
5
X λk
P[X ≤ 5] = e−λ .
k!
k=0
154
3.5. COMMON DISCRETE RANDOM VARIABLES
origin of the Poisson random variable, which we will discuss shortly. When photons are emit-
ted from the source, they travel through the medium as a sequence of independent events.
During the integration period of the camera, the photons are accumulated to generate a
voltage that is then translated to digital bits.
Figure 3.35: The Poisson random variable can be used to model photon arrivals.
If we assume that the photon arrival rate is α (photons per second), and suppose that
the total amount of integration time is t, then the average number of photons that the sensor
can see is αt. Let X be the number of photons seen during the integration time. Then if we
follow the Poisson model, we can write down the PMF of X:
(αt)k −αt
P[X = k] = e .
k!
Therefore, if a pixel is bright, meaning that α is large, then X will have a higher likelihood
of landing on a large number.
(2) Traffic model. The Poisson random variable can be used in many other problems. For
example, we can use it to model the number of passengers on a bus or the number of spam
phone calls. The required modification to Figure 3.35 is almost trivial: merely replace the
photons with your favorite cartoons, e.g., a person or a phone, as shown in Figure 3.36. In
the United States, shared-ride services such as Uber and Lyft need to model the vacant cars
and the passengers. As long as they have an arrival rate and certain degrees of independence
between events, the Poisson random variable will be a good model.
As you can see from these examples, the Poisson random variable has broad applica-
bility. Before we continue our discussion of its applications, let us introduce a few concepts
related to the Poisson random variable.
155
CHAPTER 3. DISCRETE RANDOM VARIABLES
Figure 3.36: The Poisson random variable can be used to model passenger arrivals and the number of
phone calls, and can be used by Uber or Lyft to provide shared rides.
156
3.5. COMMON DISCRETE RANDOM VARIABLES
The Poisson random variable is special in the sense that the mean and the variance are
equal. That is, if the mean arrival number is higher, the variance is also higher. This is very
different from some other random variables, e.g., the normal random variable where the mean
and variance are independent. For certain engineering applications such as photography, this
plays an important role in defining the signal-to-noise ratio. We will come back to this point
later.
This is a linearity assumption, which typically holds for a short duration of time.
157
CHAPTER 3. DISCRETE RANDOM VARIABLES
For sufficiently small ∆t, the probability that more than one impulse falls in ∆t is
negligible. Thus, we have that P[X(t + ∆t) − X(t) = 0] = 1 − λ∆t.
The number of impulses in non-overlapping time intervals is independent.
The significance of these three hypotheses is that if the underlying photon arrival process
violates any of these assumptions, then the Poisson PMF will not hold. One example is the
presence of scattering effects, where a photon has a certain probability of going off due to
the scattering medium and a certain probability of coming back. In this case, the events will
no longer be independent.
Assuming that these hypotheses hold, then at time t + ∆t, the probability of observing
X(t + ∆t) = k can be computed as
P[X(t + ∆t) = k]
= P[X(t) = k] · (1 − λ∆t) + P[X(t) = k − 1] · (λ∆t)
| {z } | {z }
=P[X(t+∆t)−X(t)=0] =P[X(t+∆t)−X(t)=1]
which is the right-hand side of the equation. To retrieve the basic form of Poisson, we can
just set t = 1 in the PMF so that
λk −λ
P[X(1) = k] = e .
k!
158
3.5. COMMON DISCRETE RANDOM VARIABLES
There is an alternative approach to deriving the Poisson PMF. The idea is to drive
the parameter n in the binomial random variable to infinity while pushing p to zero. In this
limit, the binomial PMF will converge to the Poisson PMF. We will discuss this shortly.
However, we recommend the physics approach we have just described because it has a rich
meaning and allows us to validate our assumptions.
λk −λ
n k
p (1 − p)n−k ≈ e ,
k k!
def
where λ = np.
Before we prove the result, let us see how close the approximation can be. In Figure 3.38,
we show a binomial distribution and a Poisson approximation. The closeness of the approx-
imation can easily be seen.
In MATLAB, the code to approximate a binomial distribution with a Poisson formula
is shown below. Here, we draw 10,000 random binomial numbers and plot their histogram.
On top of the plot, we use poisspdf to compute the Poisson PMF. This gives us Figure 3.38.
A similar set of commands can be called in Python.
159
CHAPTER 3. DISCRETE RANDOM VARIABLES
0.06
Binomial, n = 5000, p = 0.01
Poisson, = 50
Probability
0.04
0.02
0
0 20 40 60 80 100 120
k
Figure 3.38: Poisson approximation of binomial distribution.
log(1 + x) ≈ x, x ≪ 1.
n
λ
≈ − nλ . Hence, 1 − nλ ≈ e−λ
It then follows that log 1 − n
□
160
3.5. COMMON DISCRETE RANDOM VARIABLES
Example 3.19. Consider an optical communication system. The bit arrival rate is 109
bits/sec, and the probability of having one error bit is 10−9 . Suppose we want to find
the probability of having five error bits in one second.
Let X be the number of error bits. In one second there are 109 bits. Since we
do not know the location of these 5 bits, we have to enumerate all possibilities. This
leads to a binomial distribution. Using the binomial distribution, we know that the
probability of having k error bits is
n k
P[X = k] = p (1 − p)n−k
k
9
10 9
= (10−9 )k (1 − 10−9 )10 −k .
k
Using the Poisson to binomial approximation, we can see that the probability can
be approximated by
λk −λ
P[X = k] ≈ e ,
k!
where λ = np = 109 (10−9 ) = 1. Setting k = 5 yields P[X = 5] ≈ 0.003.
Poisson random variables are useful in computer vision, but you may skip this discussion
if it is your first reading of the book.
The strong connection between Poisson statistics and physics makes the Poisson ran-
dom variable a very good fit for many physical experiments. Here we demonstrate an appli-
cation in modeling photon shot noise.
An image sensor is a photon sensitive device which is used to detect incoming photons.
In the simplest setting, we can model a pixel in the object plane as Xm,n , for some 2D
coordinate [m, n] ∈ R2 . Written as an array, an M × N image in the object plane can be
visualized as
X1,1 X1,2 · · · X1,N
X = object = ... .. .. .. .
. . .
XM,1 XM,2 ··· XM,N
Without loss of generality, we assume that Xm,n is normalized so that 0 ≤ Xm,n ≤ 1 for
every coordinate [m, n]. To model the brightness, we multiply Xm,n by a scalar α > 0. If
a pixel αXm,n has a large value, then it is a bright pixel; conversely, if αXm,n has a small
value, then it is a dark pixel. At a particular pixel location [m, n] ∈ R2 , the observed pixel
value Ym,n is a random variable following the Poisson statistics. This situation is illustrated
161
CHAPTER 3. DISCRETE RANDOM VARIABLES
in Figure 3.39, where we see that an object-plane pixel will generate an observed pixel
through the Poisson PMF.1
Figure 3.39: The image formation process is governed by the Poisson random variable. Given a pixel
in the object plane Xm,n , the observed pixel Ym,n is a Poisson random variable with mean αXm,n .
Therefore, a brighter pixel will have a higher Poisson mean, whereas a darker pixel will have a lower
Poisson mean.
Here, by Poisson{αXm,n } we mean that Ym,n is a random integer with probability mass
[αXm,n ]k −αXm,n
P[Ym,n = k] = e .
k!
Note that this model implies that the images seen by our cameras are more or less
an array of Poisson random variables. (We say “more or less” because of other sources of
uncertainties such as read noise, dark current, etc.) Because the observed pixels Ym,n are
random variables, they fluctuate about the mean values, and hence they are noisy. We refer
to this type of random fluctuation as the shot noise. The impact of the shot noise can be
seen in Figure 3.40. Here, we vary the sensor gain level α. We see that for small α the image
is dark and has much random fluctuation. As α increases, the image becomes brighter and
the fluctuation becomes smaller.
In MATLAB, simulating the Poisson photon arrival process for an image requires the
image-processing toolbox. The command to read an image is imread. Depending on the data
type, the input array could be unit8 integers. To convert them to floating-point numbers
between 0 and 1, we use the command im2double. Drawing Poisson measurements from the
clean image is done using poissrnd. Finally, we can use imshow to display the image.
162
3.5. COMMON DISCRETE RANDOM VARIABLES
= 10 = 100 = 1000
Figure 3.40: Illustration of the Poisson random variable in photographing images. Here, α denotes the
gain level of the sensor: Larger α means that there are more photons coming to the sensor.
X = poissrnd(10*x0);
figure(1); imshow(x0, []);
figure(2); imshow(X, []);
Similar commands can be found in Python with the help of the cv2 library. When
reading an image, we call cv2.imread. The option 0 is used to read a gray-scale image;
otherwise, we will have a 3-channel color image. The division /255 ensures that the input
array ranges between 0 to 1. Generating the Poisson random numbers can be done using
np.random.poisson, or by calling the statistics library with stats.poisson.rvs(10*x0).
To display the images, we call plt.imshow, with the color map option set to cmap = ’gray’.
163
CHAPTER 3. DISCRETE RANDOM VARIABLES
The answer to this question lies in the signal-to-noise ratio (SNR) of the Poisson
random variable. The SNR of an image defines its quality. The higher the SNR, the better
the image. The mathematical definition of SNR is the ratio between the signal power and
the noise power. In our case, the SNR is
where Y = Ym,n is one of the observed pixels and λ = αXm,n is the the corresponding object
pixel. In this equation, the step (a) uses the properties
√ of the Poisson random variable Y
where E[Y ] = Var[Y ] = λ. The result SNR = λ is very informative. It says that if √ the
underlying mean photon flux (which is λ) increases, the SNR increases at a rate of λ.
So, yes, the variance becomes larger
p when the scene is brighter. However, the gain in signal
E[Y ] overrides the gain in noise Var[Y ]. As a result, the big fluctuation in bright images
is compensated by the strong signal. Thus, to minimize the shot noise one has to use a
longer exposure to increase the mean photon flux. When the scene is dark and the aperture
is small, shot noise is unavoidable.
Poisson modeling is useful for describing the problem. However, the actual engineering
question is that, given a noise observation Ym,n , how would you reconstruct the clean image
Xm,n ? This is a very difficult inverse problem. The typical strategy is to exploit the spatial
correlations between nearby pixels, e.g., usually smooth except along some sharp edges.
Other information about the image, e.g., the likelihood of obtaining texture patterns, can
also be leveraged. Modern image-processing methods are rich, ranging from classical filtering
techniques to deep neural networks. Static images are easier to recover because we can often
leverage multiple measurements of the same scene to boost the SNR. Dynamic scenes are
substantially harder when we need to track the motion of any underlying objects. There are
also newer image sensors with better photon sensitivity. The problem of imaging in the dark
is an important research topic in computational imaging. New solutions are developed at
the intersection of optics, signal processing, and machine learning.
3.6 Summary
A random variable is so called because it can take more than one state. The probability mass
function specifies the probability for it to land on a particular state. Therefore, whenever
you think of a random variable you should immediately think of its PMF (or histogram
if you prefer). The PMF is a unique characterization of a random variable. Two random
variables with the same PMF are effectively the same random variables. (They are not
identical because there could be measure-zero sets where the two differ.) Once you have the
PMF, you can derive the CDF, expectation, moments, variance, and so on.
When your boss hands a dataset to you, which random variable (which model) should
you use? This is a very practical and deep question. We highlight three steps for you to
consider:
164
3.7. REFERENCES
(i) Model selection: Which random variable is the best fit for our problem? Some-
times we know by physics that, for example, photon arrivals or internet traffic follow a
Poisson random variable. However, not all datasets can be easily described by simple
models. The models we have learned in this chapter are called the parametric mod-
els because they are characterized by one or two parameters. Some datasets require
nonparametric models, e.g., natural images, because they are just too complex. Some
data scientists refer to deep neural networks as parametric models because the net-
work weights are essentially the parameters. Some do not because when the number
of parameters is on the order of millions, sometimes even more than the number of
training samples, it seems more reasonable to call these models nonparametric. How-
ever, putting this debate aside, shortlisting a few candidate models based on prior
knowledge is essential. Even if you use deep neural networks, selecting between con-
volutional structures versus long short-term memory models is still a legitimate task
that requires an understanding of your problem.
(ii) Parameter estimation: Suppose that you now have a candidate model; the next
task is to estimate the model parameter using the available training data. For example,
for Poisson we need to determine λ, and for binomial we need to determine (n, p). The
estimation problem is an inverse problem. Often we need to use the PMF to construct
certain optimization problems. By solving the optimization problem we will find the
best parameter (for that particular candidate model). Modern machine learning is
doing significantly better now than in the old days because optimization methods
have advanced greatly.
(iii) Validation. When each candidate model has been optimized to best fit the data,
we still need to select the best model. This is done by running various testings. For
example, we can construct a validation set and check which model gives us the best
performance (such as classification rate or regression error). However, a model with
the best validation score is not necessarily the best model. Your goal should be to seek
a good model and not the best model because determining the best requires access to
the testing data, which we do not have. Everything being equal, the common wisdom
is to go with a simpler model because it is generally less susceptible to overfitting.
3.7 References
Probability textbooks
3-1 Dimitri P. Bertsekas and John N. Tsitsiklis, Introduction to Probability, Athena Sci-
entific, 2nd Edition, 2008. Chapter 2.
3-2 Alberto Leon-Garcia, Probability, Statistics, and Random Processes for Electrical En-
gineering, Prentice Hall, 3rd Edition, 2008. Chapter 3.
3-3 Athanasios Papoulis and S. Unnikrishna Pillai, Probability, Random Variables and
Stochastic Processes, McGraw-Hill, 4th Edition, 2001. Chapters 3 and 4.
3-4 John A. Gubner, Probability and Random Processes for Electrical and Computer En-
gineers, Cambridge University Press, 2006. Chapters 2 and3.
165
CHAPTER 3. DISCRETE RANDOM VARIABLES
3-5 Sheldon Ross, A First Course in Probability, Prentice Hall, 8th Edition, 2010. Chap-
ter 4.
3-6 Henry Stark and John Woods, Probability and Random Processes With Applications
to Signal Processing, Prentice Hall, 3rd Edition, 2001. Chapters 2 and 4.
Cross-validation
3-9 Larry Wasserman, All of Statistics, Springer 2004. Chapter 20.
3-10 Mats Rudemo, “Empirical Choice of Histograms and Kernel Density Estimators,”
Scandinavian Journal of Statistics, Vol. 9, No. 2 (1982), pp. 65-78.
3-11 David W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization,
Wiley, 1992.
Poisson statistics
3-12 Joseph Goodman, Statistical Optics, Wiley, 2015. Chapter 3.
3-13 Henry Stark and John Woods, Probability and Random Processes With Applications
to Signal Processing, Prentice Hall, 3rd edition, 2001. Section 1.10.
3.8 Problems
166