0% found this document useful (0 votes)

52 views361 pages

Prob Stat For CS Book

Uploaded by

aashi.sengar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views361 pages

Prob Stat For CS Book

Uploaded by

aashi.sengar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 361

Probability & Statistics with

Applications to Computing

Alex Tsun
2

Copyright c 2020 Alex Tsun.

All rights reserved. No part of this book may be reproduced in any form on by an electronic or mechanical
means, including information storage and retrieval systems, without permission in writing from the pub-
lisher, except by a reviewer who may quote brief passages in a review.

Some images from pixabay.com and Larry Ruzzo.

Acknowledgements

This textbook would not have been possible without the following people:

• Mitchell Estberg (Head TA CSE 312 at UW): For helping organize the e↵ort to put this together,
formatting, and revising a vast majority of this content. Your countless hours you put in for the course
and for me are much appreciated. Neither this class nor this book would have been possible without
your contributions!

• Pemi Nguyen (TA CSE 312 at UW): For helping to add examples, motivation, and contributing
to a lot of this content. For constantly going above and beyond to ensure a great experience for the
students, the sta↵, and me. Your bar for quality and dedication to these notes, the course, and the
students is unrivaled.

• Cooper Chia, William Howard-Snyder, Shreya Jayaraman, Aleks Jovcic, Muxi (Scott)
Ni, Luxi Wang (TA’s CSE 312 at UW): For each typesetting several sections and adding your own
thoughts and intuition. Thank you for being the best teaching sta↵ I could have asked for!

• Joshua Fan (UW/Cornell): You are an amazing co-TA who is extremely dedicated to your students.
Thank you for your help in developing this content, and for recording several videos!

• Matthew Taing (UW): You are an extremely caring TA, dedicated to making learning an enjoyable
experience. Thank you for your help and suggestions throughout the development of this content.

• Martin Tompa (Professor at UW Allen School): Thank you for taking a chance on me to give me my
first TA experience, and for supporting me through my career to graduate school and beyond. Thank
you especially for helping me attain my first teaching position, and for your advice and mentorship.

• Anna Karlin (Professor at UW Allen School): Thank you for making my CSE 312 TA experiences at
UW amazing, and for giving me much freedom and flexibility to create content and lead during those
times. Thank you also for your significant help and guidance during my first teaching position.

• Lisa Yan, David Varodayan, Chris Piech (Instructors at Stanford): I learned a lot from TAing
for each of you, especially to compare and constrast this course at two di↵erent universities. I’d like
to think I took the “best of both worlds” at Stanford and the University of Washington. Thank you
for your help, guidance, and inspiration!

• My Family: Thank for for supporting me and encouraging me to pursue my passions. I would not
be where I am or the person I am without you.
4

Notes
Information
This book was written in Summer of 2020 during an o↵ering of “CSE 312: Foundations of Computing
II”, which is essentially probability and statistics for computer scientists. The curriculum was based o↵ of
this course as well as Stanford University’s “CS 109: Probability for Computer Scientists”. I strongly believe
coding applications (which are included in Chapter 9) are essential to teach to show why this class is a core
CS requirement, but also it helps keeps the students engaged and excited. This textbook is currently being
used at the University of Washington (Autumn 2020).

Resources

• Course Videos (YouTube Playlist): Mostly under 5 minutes long, serves generally as a quick review
of each section.
• Course Slides (Google Drive): Contains Google Slides presentations for each section, used in the videos.
• Course Website (UW CSE 312): Taught at the University of Washington during Summer 2020 and
Autumn 2020 quarters by Alex Tsun and Professor Anna Karlin.
https://courses.cs.washington.edu/courses/cse312/20su/
• This Textbook: Available online free here.
• Key Theorems and Definitions: At the end of this book.
• Distributions (2 pages): At the end of this book.

Assumed Prerequisites
We assume the student has experience in the following topics:
• Multivariable calculus (at least up to partial derivatives and double integrals). We won’t really use
much calculus beyond taking derivatives and integrals, so a surface-level knowledge is fine.
• Discrete mathematics (introduction to logic and proofs). We’ll especially use set theory, but this will
be covered in Chapter 0: Prerequisites of this book.
• Programming experience (at least one or two introductory classes, in any language). We will teach
Python, but assuming knowledge of fundamental ideas such as: variables, conditionals, loops, and
arrays. This will be crucial in studying and coding up the CS applications of Chapter 9.
About the Author
Alex Tsun grew up in the bay area, with a family full of software engineers (parents and older brother). He
completed Bachelor’s degrees in computer science, statistics, and theoretical mathematics at the University
of Washington in 2018, before attending Stanford University for his Master’s degree in AI and Theoretical
CS. During his six years as a student, he served as a TA for this course a total of 13 times. After graduating
in June 2020, he returned to UW to be the instructor for the course CSE 312 during Summer 2020.
Contents

0. Prerequisites 7
0.1 Intro to Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
0.2 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
0.3 Sum and Product Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1. Combinatorial Theory 19
1.1 So You Think You Can Count? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 More Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.3 No More Counting Please . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2. Discrete Probability 43
2.1 Intro to Discrete Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3. Discrete Random Variables 67

3.1 Discrete Random Variables Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.2 More on Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.4 Zoo of Discrete RVs I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.5 Zoo of Discrete RVs II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.6 Zoo of Discrete RVs III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4. Continuous Random Variables 110

4.1 Continuous Random Variables Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2 Zoo of Continuous RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.3 The Normal/Gaussian Random Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.4 Transforming Continuous RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5. Multiple Random Variables 145

5.1 Joint Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2 Joint Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.5 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.6 Moment Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.7 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
5.8 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
5.9 The Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

5
6 CONTENTS

5.10 Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

5.11 Proof of the CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

6. Concentration Inequalities 216

6.1 Markov and Chebyshev Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
6.2 The Cherno↵ Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
6.3 Even More Inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

7. Statistical Estimation 234

7.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
7.2 Maximum Likelihood Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.3 Method of Moments Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
7.4 The Beta and Dirichlet Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
7.5 Maximum a Posteriori Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
7.6 Properties of Estimators I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
7.7 Properties of Estimators II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.8 Properties of Estimators III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

8. Statistical Inference 282

8.1 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
8.2 Credible Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
8.3 Intro to Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

9. Applications to Computing 295

9.1 Intro to Python Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.2 Probability via Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.3 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
9.4 Bloom Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
9.5 Distinct Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
9.6 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
9.7 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
9.8 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

Phi Table 342

Distributions Reference Sheet 343

Key Definitions and Theorems 345

CONTENTS 7

Chapter 0. Prerequisites
This chapter focuses on set theory, which makes up the building blocks of probability. To even define a
probability space, we need this notion of a set. While it is assumed that a discrete mathematics course was
taken, we will focus on reviewing this particular topic. We also cover summation and product notation,
which we will use frequently for compactness and conciseness of notation.
Chapter 0. Prerequisites
0.1: Intro to Set Theory
Slides (Google Drive) Video (YouTube)

0.1.1 Sets and Cardinality

Before we start talking about probability, we must learn a little bit of set theory. These notations and
concepts will be used across almost every chapter, and are key to understanding probability theory.

Definition 0.1.1: Set

A set S is an unordered collection of objects with no duplicates. They can be finite or infinite.

Some examples of sets are:

• {3.2, 8.555, 13.122, ⇡}
• {apple, orange, watermelon}
• [0, 1] (all real numbers between 0 and 1)
• {1, 2, 3, . . . } (all positive integers)
• {;, {1}, {2}, {1, 2}} (a set of sets)

Definition 0.1.2: Cardinality

The cardinality of S is denoted | S |, which is the number of elements in the set.

Definition 0.1.3: Empty Set

There is only one set of cardinality 0 (containing no elements), the empty set, denoted by ; = {}

Example(s)

Calculate the cardinality of the sets:

1. {apple, orange, watermelon}
2. {1, 1, 1, 1, 1}
3. [0, 1]
4. {1, 2, 3, · · · }
5. {;, {1}, {2}, {1, 2}}
6. {;, {1}, {1, 1}, {1, 1, 1}, · · · }

Solution To calculate the cardinality of a set, we have to determine the number of elements in the set.
1. For the set {apple, orange, watermelon}, we have three distinct elements, so the cardinality is 3. That
is | {apple, orange, watermelon} |= 3
2. For {1, 1, 1, 1, 1}, there are five 1s, but recall that set’s don’t contain duplicates, so actually this set only
contains 1, and is equal to the set {1}. This means that it’s cardinality is 1, that is | {1, 1, 1, 1, 1} |= 1

8
0.1 Probability & Statistics with Applications to Computing 9

3. For the set [0, 1], all the values between 0 and 1 (inclusive) we have an infinite number of elements.
This means that the cardinality of this set is infinity, that is | [0, 1] |= 1
4. For the set {1, 2, 3, · · · }, the set of all positive integers, we have an infinite number of elements. This
means that the cardinality of this set is infinity, that is | {1, 2, 3, · · · } |= 1.
5. For the set {;, {1}, {2}, {1, 2}} (a set of sets), there are four distinct elements that are each a di↵erent
set. This means that the cardinality is 4, that is | {;, {1}, {2}, {1, 2}} |= 4.
6. Finally, for the set of {;, {1}, {1, 1}, {1, 1, 1}, · · · }, we do have an infinite number of sets, each of which
is an element. But are these distinct? Upon further consideration, all the the sets containing various
numbers of 1s are equivalent, as duplicates don’t matter. So there is the set containing 1 and the
empty set. So the cardinality is 2, that is | {;, {1}, {1, 1}, {1, 1, 1}, · · · } |=| {;, {1}} |= 2.

0.1.2 Subsets and Equality

Definition 0.1.4: In and Not In

If x is in a set S, we write x 2 S, If x is not in set S, we write x 2

/ S.

Definition 0.1.5: Subset

We write A ✓ B to mean A is a subset of B, that is for any x 2 A, it must be the case that x 2 B.

Here is a picture of A ✓ B (A is completely contained inside B, or B contains everything that A does).

Definition 0.1.6: Superset

We write A ◆ B to mean that A is a superset of B (equivalent to B ✓ A).

Definition 0.1.7: Set Equality

We say two sets A, B ae equal, denoted A = B, if and only if both A ✓ B and B ✓ A.

10 Probability & Statistics with Applications to Computing 0.1

Example(s)

Let us define A = {1, 3}, B = {3, 1}, C = {1, 2} and D = {;, {1}, {2}, {1, 2}, 1, 2}.
Determine whether the following are true or false:
• 12A
• 1✓A
• {1} ✓ A
• {1} 2 A
• 32 /C
• A2B
• A✓B
• C2D
• C✓D
• ;2D
• ;✓D
• A=B
• ;✓;
• ;2;

Solution
• 1 2 A. True, because 1 is an element in A.
• 1 ✓ A. False, because 1 is a value, not a set, so it cannot be a subset of a set.
• {1} ✓ A. True, because every element of the set {1} is an element of A.
• {1} 2 A. False, because {1} is a set, and A contains no sets as elements.
• 32
/ C. True, because the value 3 is not one of the elements of C.
• A 2 B. False, because A is a set, and there are no elements of B which are sets, A 62 B.
• A ✓ B. True, because every element of A is an element of B.
• C 2 D. True, because C is an element of D.
• C ✓ D. True, because each of the elements of C are also elements of D.
• ; 2 D. True, because the empty set is an element of D.
• ; ✓ D. True, by definition, the empty set is a subset of any set. This is because if this were not the
case, there would have to be an element of ; which was not in D. But there are no elements in ;, so
the statement is true (vacuously).
• A = B. True, A ✓ B, as every element of A is an elementof B and B ✓ A, as every element of B is an
element of A, so since this relationship is in both directions, we have A = B.
• ; ✓ ;. True, because the empty set is a subset of every set (vacuously).
• ; 2 ;. False, because the empty set contains no elements, so the empty set cannot be an element of it.
Chapter 0. Prerequisites
0.2: Set Operations
Slides (Google Drive) Video (YouTube)

0.2.1 Set Operations

Definition 0.2.1: Universal Set

Let A, B be sets and U be a universal set, so that A ✓ U and B ✓ U . The universal set contains
all elements we would ever care about.

Example(s)

1. If we were talking about the set of fruits a supermarket might sell S, we might have S =
{apple, watermelon, pear, strawberry} and U = {all fruits}. We might want to know which
fruits the supermarket doesn’t sell, which would be denoted S C (defined later). This requires a
universal set of all fruits that we can check with to see which are missing from S.
2. If we were talking about the set of kinds of cars Bill Gates owns, that might be the set T . There
must be a universal set U of possible kinds of cars that exist, if we wanted to list out which
ones he was missing T C .

Definition 0.2.2: Set Operation: Union

The union of A and B is denoted A[B. It contains elements in A or B, or both (without duplicates).
So x 2 A [ B if and only if x 2 A or x 2 B.

The image below shows in red the union of A and B: A [ B. The outer rectangle is the univeral set U .

11
12 Probability & Statistics with Applications to Computing 0.2

Definition 0.2.3: Set Operation: Intersection

The intersection of A and B is denoted A \ B. It contains elements in A and B. So x 2 A \ B if

and only if x 2 A and x 2 B.

The image below shows in red the intersection of A and B: A \ B. The outer rectangle is the univeral set U .

Definition 0.2.4: Set Operation: Set Di↵erence

The set di↵erence of A with B is denoted, A \ B. It contains elements of A which are not in B. So
x 2 A \ B if and only if x 2 A and x 2
/ B.

The image below shows in red the set di↵erent of A with B: A \ B. The outer rectangle is the univeral set
U.

Definition 0.2.5: Set Operation: Complement

The complement of A (with respect to U ) is denoted AC = U \ A. It contains elements of U , the

universal set, which are not in A.

The image below shows in red the complement of A with respect to U : AC = U \ A.

0.2 Probability & Statistics with Applications to Computing 13

Example(s)

Let A = {1, 3}, B = {2, 3, 4}, and U = {1, 2, 3, 4, 5}. Solve for: A \ B, A [ B, B \ A, A \ B, (A [ B)C ,
AC , B C , and AC \ B C .

Solution
• A \ B = {3}, since 3 is the only element in both A and B.
• A [ B = {1, 2, 3, 4}, as these are all the elements in either A or B. Note we dropped the duplicate 3,
since sets cannot contain duplicates.
• B \ A = {2, 4}, as these are the elements of B which are not in A.
• A \ B = {1}, as this is the only element of A which is not an element of B.
• (A [ B)C = {5}, as by definition (A [ B)C = U \ (A [ B) and 5 is the only element of U which is not
an element of A [ B.
• AC = {2, 4, 5}, as by definition AC = U \ A, and these are the elements of U which are not elements
of A.
• B C = {1, 5}, as by definition B C = U \ B, and these are the elements of U which are not elements of
B.
• AC \ B C = {5}, because the only element in both AC and B C is 5 (see the above).
Chapter 0. Prerequisites
0.3: Sum and Product Notation
Slides (Google Drive) Video (YouTube)

0.3.1 Summation Notation

Suppose that we want to write the sum: 1 + 2 + 3 + 5 + 6 + 7 + 8 + 9 + 10. We can write out each element,
but it becomes tedious. We could use dots, to signify this as: 1 + 2 + · · · + 9 + 10, but this can become vague
if the pattern isn’t as clear. Instead, we can use summation notation as shorthand for summations of values.
Here we are referring to the sum of each element i, where i will take on every value in the range starting
with 1 and ending with 10.

10
X
1 + 2 + 3 + · · · + 10 = i
i=1

Note that i is just a dummy variable. We could have also used j, k, or any other letter. What if we wanted
to sum numbers that weren’t consecutive integers?

As long as there is some pattern, we can write it compactly! For example, how could we write 16 + 25 +
36 + · · · + 81? In the first equation below (0.3.1), j takes on the values from 4 to 9, and the square of each of
these values will be summed together. Note that this is equivalent to k taking on the values of 1 to 6, and
adding 3 to each of the values before squaring and summing them up (0.3.2).

9
X
16 + 25 + 36 + · · · + 81 = j2 (0.3.1)
j=4
6
X
= (k + 3)2 (0.3.2)
k=1

If you know what a for-loop is (from computer science), this is exactly the following (in Java or C++).
This first loop represents the first sum with dummy variable j.

int sum = 0
for (int j = 4; j <= 9; j++) {
sum += (j * j)
}

This second loop represents the second sum with dummy variable k, and is equivalent to the first.

int sum = 0
for (int k = 1; k <= 6; k++) {
sum += ((k + 3) * (k + 3))
}

This brings us to the following definition of summation notation:

14
0.3 Probability & Statistics with Applications to Computing 15

Definition 0.3.1: Summation Notation

Let x1 , x2 , x3 , . . . be a sequence of numbers. Then, the following notation represents the “sub-sum”:
b
X
xa + xa+1 + · · · + xb 1 + xb = xi
i=a

Furthermore, if S is a set, and f : S ! R is a function defined on S, then the following notation sums
over all elements x 2 S of f (x):
X
f (x)
x2S

Note that the sum over no terms (the empty set) is defined as 0.

Example(s)

Write P
out the following sums:
7
• Pk=3 k 10
• (2y + 5), for S = {3, 6, 8, 11}
Py2S
8
• 4
Pt=6
1
• Pz=2 sin(z)
• x2T 13x, for T = { 1, 3, 5}.

Solution
P7
• For, k=3 k 10 , we raise each value of k from 3 to 7 to the power of 10 and sum them together. That
is:
7
X
k 10 = 310 + 410 + 510 + 610 + 710
k=3

P
• Then, if we let S = {3, 6, 8, 11}, for y2S (2y + 5), raise 2 to the power of each value y in S and add
5, and then sum the results together. That is
X
(2y + 5) = (23 + 5) + (26 + 5) + (28 + 5) + (211 + 5)
y2S

P8
• For the sum of a constant, t=6 4, we add the constant, 4 for each value t = 6, 7, 8. This is equivalent
to just adding 4 together three times.
8
X
4=4+4+4
t=6

P1
• Then, for a range with no values, the sum is defined as 0, for z=2 sin(z), because there are no values
from 2 to 1, we have:
1
X
sin(z) = 0
z=2
16 Probability & Statistics with Applications to Computing 0.3

P
• Finally, if we let T = { 1, 3, 5}, for x2T 13x, we multiply each value of x in T by 13 and then sum
them up.
X
13x = 13( 1) + 13( 3) + 13(5)
x2T
= 13( 1 + 3 + 5)
X
= 13 x
x2T

Notice that we can actually factor out the 13; that is, we could sum all values of x 2 T first, and then
multiply by 13. This is one of a few properties of summations we can see below!

Further, the associative and distributive properties hold for sums. If you squint hard enough, you can kind
of see why they’re true! We’ll also see some examples below too, since the notation can be confusing at first.

Fact 0.3.1: The Associative and Distributive Properties of Sums

We have the associative property (0.3.3) and distributive property (0.3.4, 0.3.5) for sums.
X X X
f (x) + g(x) = (f (x) + g(x)) (0.3.3)
x2A x2A x2A
X X
↵f (x) = ↵ f (x) (0.3.4)
x2A x2A
!0 1
X X XX
f (x) @ g(x)A = f (x)g(y) (0.3.5)
x2A y2B x2A y2b

The last property is like FOIL - if you multiply (x + x2 + x3 )(1/y + 1/y 2 ) (left-hand side) for example, you
would have to sum over every possible combination x/y + x/y 2 + x2 /y + x2 /y 2 + x3 /y + x3 /y 2 (right-hand
side).

The proof of these are left to the reader, but see the examples below for some intuition!

Example(s)

“Prove”
P7the following
P7 by writing
P7 out the sums:
• i=5 i + i=5 i2 = i=5 (i + i2 )
P5 P5
• j=3 2j = 2 j
P2 Pj=3
3 P2 P3
• ( i=1 f (ai ))( j=1 g(bj )) = i=1 j=1 f (ai )g(bj )

Solution

• Looking at the associative property, we know the following:

7
X 7
X 6
X
2 2 2 2 2 2 2
i+ i = (5 + 6 + 7) + (5 + 6 + 7 ) = (5 + 5 ) + (6 + 6 ) + (7 + 7 ) = (i + i2 )
i=5 i=5 i=5
0.3 Probability & Statistics with Applications to Computing 17

• Also, using the distributive property we know:

5
X 5
X
2j = 2 · 3 + 2 · 4 + 2 · 5 = 2(3 + 4 + 5) = 2 j
j=3 j=3

• This one is similar to FOIL. Finally, we have:

2
!0 3 1
X X
f (ai ) @ g(bj )A = (f (a1 ) + f (a2 ))(g(b1 ) + g(b2 ) + g(b3 ))
i=1 j=1

= f (a1 )g(b1 ) + f (a1 )g(b2 ) + f (a1 )g(b3 ) + f (a2 )g(b1 ) + f (a2 )g(b2 ) + f (a2 )g(b3 )
2 X
X 3
= f (ai )g(bj )
i=1 j=1

0.3.2 Product Notation

Similarly, we can define product notation to handle multiplications.

Definition 0.3.2: Product Notation

Let x1 , x2 , x3 , . . . be a sequence of numbers. Then, the following notation represents the “sub-
product” xa · xa+1 · · · · · xb 1 · xb :
b
Y
xi
i=a

Further, if S is a set, and f : S ! R is a function defined on S, then the following notation multiplies
over all elements x 2 S of f (x):
Y
f (x)
x2S

Note that the product over no terms is defined as 1 (not 0 like it was for sums).

Example(s)

Write Q
out the following products:
7
• Qa=4 a
• x2S 8 for S = {3, 6, 8, 11}
Q1
• z=2 sin(z)
Q5
• b=2 91/b

Solution
Q7
• For a=4 a, we multiply each value a in the range 4 to 7 and have:
7
Y
a=4·5·6·7
a=4
18 Probability & Statistics with Applications to Computing 0.3

Q
• Then if, we let S = {3, 6, 8, 11}, for x2S 8, we multiply 8 for each value in the set, S and have:
Y
8=8·8·8·8
x2S

Q1
• Then for z=2 sin(z), we have the empty product, because there are no values in the range 2 to 1, so
we have:
1
Y
sin(z) = 1
z=2

Q5
• Finally for b=2 91/b , we have each value of b from 2 to 5 of 91/b , to get
5
Y
91/b = 91/2 · 91/3 · 91/4 · 91/5
b=2

= 91/2+1/3+1/4+1/5
P5
1/b
=9 b=2

Q P
Also, if you were to do the same examples as we did for sums replacing with , you just multiply instead
of add! They are almost identical, except the empty sum is 0 and the empty product is 1.
0.3 Probability & Statistics with Applications to Computing 19

Chapter 1. Combinatorial Theory

This chapter focuses on combinatorics, or simply put, “how to count”. This may seem irrelevant to prob-
ability theory, but in fact it not only helps build intuition (at least in the case of equally likely outcomes),
but is used throughout the rest of the chapters. This chapter is particularly hard since there are potentially
many approaches to solving a problem. However, this is also a positive, because you can verify your answer
is (most likely) correct if you had two di↵erent approaches resulting in the same solution!
Chapter 1. Combinatorial Theory
1.1: So You Think You Can Count?
Slides (Google Drive) Video (YouTube)

Before we jump into probability, we must first learn a little bit of combinatorics, or more informally, counting.
You might wonder how this is relevant to probability, and we’ll see how very soon. You might also think
that counting is for kindergarteners, but it is actually a lot harder than you think!

To motivate us, let’s consider how easy or difficult it is for a robber to randomly guess your PIN code. Every
debit card has a PIN code that their owners use to withdraw cash from ATMs or to complete transactions.
How secure are these PINs, and how safe can we feel?

1.1.1 Sum Rule

First though, we will count baby outfits. Let’s say that a baby outfit consists of either a top or a bottom
(but not both), and we have 3 tops and 4 bottoms. How many baby outfits are possible? We simply add
3 + 4 = 7, and we have found the answer using the sum rule.

Definition 1.1.1: Sum Rule

If an experiment can either end up being one of N outcomes, or one of M outcomes (where there is
no overlap), then the number of possible outcomes of the experiment is:

N +M

We’ll see some examples of the Sum Rule combined with the Product Rule (next), so that they can be a bit
more complex!

1.1.2 Product Rule

Now we will count real outfits. Let’s say that a real outfit consists of both a top and a bottom, and again,
we still have 3 tops and 4 bottoms. then how many outfits are possible?

Well, we can consider this from first picking out a top. Once we have our top, we have 4 choices for our
bottom. This means we have 4 choices of bottom for each top, which we have 3 of. So, we have a total of
4 + 4 + 4 = 3 ⇥ 4 = 12 outfit choices.

20
1.1 Probability & Statistics with Applications to Computing 21

We could also do this in reverse and first pick out a bottom. Once we have our bottom, we have 3 choices
for our top. This means we have 3 choices of top for each bottom, which we have 4 of. So, we still have a
total of 3 + 3 + 3 + 3 = 4 ⇥ 3 = 12 outfit choices. (This makes sense - the number of outfits should be the
same no matter how I count!)

What if we also wanted to add socks to the outfit, and we had 2 di↵erent pairs of socks? Then, for each of
the 12 choices outlined above, we now have 2 choices of sock. This brings us to a total of 24 possible outfits.

This could be calculated more directly rather than drawing out each of these unique outfits, by multiplying
our choices: 3 tops ⇥ 4 bottoms ⇥ 2 socks = 24 outfits.
22 Probability & Statistics with Applications to Computing 1.1

Definition 1.1.2: Product Rule

If an experiment has N1 outcomes for the first stage, N2 outcomes for the second stage, . . . , and Nm
outcomes for the mth stage, then the total number of outcomes of the experiment is N1 ⇥N2 ⇥· · ·⇥Nm .

If this still sounds “simple” to you or you just want to practice, see the examples below! There are some
pretty interesting scenarios we can count, and they are more difficult than you might expect.

Example(s)

Flamingos Fanny and Freddy have three o↵spring: Happy, Glee, and Joy. These five flamingos are
to be distributed to seven di↵erent zoos so that no zoo gets both a parent and a child :(. It is not
required that every zoo gets a flamingo. In how many di↵erent ways can this be done?

Solution There are two disjoint (mutually exclusive) cases we can consider that cover every possibility. We
can use the sum rule to add them up since they don’t overlap!
1. Case 1: The parents end up in the same zoo. There are 7 choices of zoo they could end up
at. Then, the three o↵spring can go to any of the 6 other zoos, for a total of 7 ⇥ 6 ⇥ 6 ⇥ 6 = 7 ⇥ 63
possibilities (by the product rule).
2. Case 2: The parents end up in di↵erent zoos. There are 7 choices for Fanny and 6 for Freddy.
Then, the three o↵spring can go to any of the 5 other zoos, for a total of 7 ⇥ 6 ⇥ 53 possibilities.
The result, by the sum rule, is 7 ⇥ 63 + 7 ⇥ 6 ⇥ 53 . (Note: This may not be the only way to solve this problem.
Often, counting problems have two or more approaches, and it is instructive to try di↵erent methods to get
the same answer. If they di↵er, at least one of them is wrong, so try to find out which one and why!)

1.1.3 Permutations
Back to the example of the debit card. There are 10 possible digits for the each of the 4 digits of a PIN. So
how many possible 4-digit PINs are there? This can be solved as 10 ⇥ 10 ⇥ 10 ⇥ 10 = 104 = 10, 000. So,
there is a one in ten thousand chance that a robber can guess your pin code (randomly).
Let’s consider a stronger case where you must use each digit exactly once, so the PIN is exactly 10 digits
long. How many such PINs exist?
Well, we have 10 choices for the first digit, 9 choices for the second digit, and so forth, until we only have 2
choices for the ninth digit, and 1 choice for the tenth digit. This means there are 362,880 possible PINs in
this scenario as follows:

10
Y
10 ⇥ 9 ⇥ · · · ⇥ 2 ⇥ 1 = i = 362, 880
i=1

This formula/pattern seems like it would appear often! Wouldn’t it be great if there were a shorthand for
this?
Definition 1.1.3: Permutation
The number of orderings of N distinct objects, is called a permutation, and mathematically defined
1.1 Probability & Statistics with Applications to Computing 23

as:
N
Y
N ! = N ⇥ (N 1) ⇥ (N 2) ⇥ . . . 3 ⇥ 2 ⇥ 1 = j
j=1

N ! is read as “N factorial”. It is important to note that 0! = 1 since there is one way to arrange 0
objects.

Example(s)

A standard 52-card deck consists of one of each combination of: 13 di↵erent ranks (Ace, 2, 3,...,10,
Jack, Queen, King) and 4 di↵erent suits (clubs, diamonds, hearts, spades), since 13 ⇥ 4 = 52. In how
many ways a 52-card deck be dealt to thirteen players, four to each, so that every player has one card
of each suit?

Solution This is a great example where we can try two equivalent approaches. Each person usually has di↵er-
ent preferences, and sometimes one way is significantly easier to understand than another. Read them both,
understand why they both make sense and are equal, and figure out which approach is more intuitive for you!

Let’s assign each player one at a time. The first player has 13 choices for the club, 13 for the heart, 13 for
the diamond, and 13 for the spade, for a total of 134 ways. The second player has 124 choices (since there
are only 12 of each suit remaining). And so on, so the answer is 134 ⇥ 124 ⇥ 114 ⇥ ... ⇥ 24 ⇥ 14 .

Alternatively, we can assign each suit one at a time. For the clubs suit, there are 13! ways to distribute them
to the 13 di↵erent players. Then, the diamonds suit can be assigned in 13! ways as well, and same for the
other two suits. By the product rule, the total number of ways is (13!)4 . Check that this di↵erent order of
assigning cards gave the same answer as earlier! (Expand the factorials.)

Example(s)

A group of n families, each with m members, are to be lined up for a photograph. In how many ways
can the nm people be arranged if members of a family must stay together?

Solution We first choose the ordering of the families, of which there are n!. Then, in the first family, we have
m! ways to arrange them. The second family also has m! ways to be arranged. And so on. By the product
rule, the number of orderings is n! ⇥ (m!)n .

1.1.4 Complementary Counting

Now, let’s consider an even trickier PIN requirement. Suppose we are still making a 10-digit PIN, but now
at least one digit has to be repeated at least once. How many such PINs exist?
Some examples of this PIN would be 1111111111, 01234556788, or 9876598765, but the list goes on!
Let’s try our “normal” approach. If we try this, we’ll end up getting stuck. Consider placing the first digit
- we have 10 choices. How many choices do we have for the second digit? Is this a repeated digit or not?
We can try to find a product of counts of choices for each digit in di↵erent scenarios but this can become
complicated as we move around which digits are repeated...
24 Probability & Statistics with Applications to Computing 1.1

Another approach might be to count how many PINs don’t satisfy this property, and subtract it from the
total number of PINs. This strategy is called complementary counting, as we are counting the size of the
complement of the set of interest. The number of possible 10-digit PINs, with no stipulations, is 1010 (from
the product rule, multiplying 10 choices with itself for each of 10 positions). Then, we found above that
the 10-digit PINs with no repeats has 10! possibilities (each digit used exactly once). Well, consider that
the 10-digit PINs with at least one repeat will be all other possibilities (they could have one, two, or more
repeats but certainly won’t have none). This means that we can count this by taking the di↵erence of all
the possible 10-digit PINs and those with no repeats. That is:

1010 10!

Definition 1.1.4: Complementary Counting

Let U be a (finite) universal set, and S a subset of interest. Let S C = U \ S denote the set di↵erence
(complement of S). Then,

| S |=| U | | SC |

Informally, to find the number of ways to do something, we could count the number of ways to NOT
to do that thing, and subtract it from the total. That is, the complement of the subset of interest is
also of interest!

1.1.5 Exercises
1. Suppose we have 6 people who want to line up in a row for a picture, but two of them, A and B, refuse
to sit next to each other. How many ways can they sit in a row?

Solution: There are two equivalent approaches. The first approach is to solve it directly. How-
ever, depending on where A sits, B has a di↵erent number of options (whether A sits at the end or the
middle). So we have two disjoint (non-overlapping) cases:
(a) Case 1: A sits at one of the two end seats. Then, A has 2 choices for where to sit, and B has 4.
(See this diagram where A sits at the right end: A.) Then, there are 4! ways for the
remaining people to sit, for a total of 2 ⇥ 4 ⇥ 4! ways.
(b) Case 2: A sits in one of the middle 4 seats. Then, A has 4 choices of seat, but B only has three
choices for where to sit. (See this diagram where A sits in a middle seat: A .) Again,
there are 4! ways to seat the rest, for a total of 4 ⇥ 3 ⇥ 4! ways.
1.1 Probability & Statistics with Applications to Computing 25

Hence our total by the sum rule is 2 ⇥ 4 ⇥ 4! + 4 ⇥ 3 ⇥ 4! = 480.

The alternative approach is complementary counting. We can count the total orderings, of which
there are 6!, and subtract the cases where A and B do sit next to each other. There’s a trick we can
do to guarantee this: let’s treat A and B as a single entity. Then, along with the remaining 4 people,
there are only 5 entities. We order the entities in 5! ways, but also multiply by 2! since we could have
the seating AB or BA. Hence, the number of ways they do sit together is 2 ⇥ 5! = 240, and the ways
they do not sit together is 6! 240 = 720 = 240 = 480.

Decide which approach you liked better - oftentimes, one method will be easier than another!
Chapter 1. Combinatorial Theory
1.2: More Counting
Slides (Google Drive) Video (YouTube)

1.2.1 k-Permutations
Last time, we learned the foundational techniques for counting (the sum and product rule), and the factorial
notation which arises frequently. Now, we’ll learn even more “shortcuts”/“notations” for common counting
situations, and tackle more complex problems.
We’ll start with a simpler situation than most of the exercises from last time. How many 3-color mini
rainbows can be made out of 7 available colors, with all 3 being di↵erent colors?

We choose an outer color, then a middle color and then an inner color. There are 7 possibilities for the outer
layer, 6 for the middle and 5 for the inner (since we cannot have duplicates). Since order matters, we find
that the total number of possibilities is 210, from the following calculation:

Let’s manipulate our equation a little and see what happens.

7·6·5 4·3·2·1
7·6·5= · [multiply numerator and denominator by 4! = 4 · 3 · 2 · 1]
1 4·3·2·1
7!
= [def of factorial]
4!
7!
=
(7 3)!

Notice that we are “picking” 3 out of 7 available colors - so order matters. This may not seem useful, but
imagine if there were 835 colors and we wanted a rainbow with 135 di↵erent colors. You would have to
multiply 135 numbers, rather than just three!

Definition 1.2.1: k-Permutations

If we want to arrange only k out of n distinct objects, the number of ways to do so is P (n, k) (read
as “n pick k”), where

26
1.2 Probability & Statistics with Applications to Computing 27

n!
P (n, k) = n · (n 1) · (n 2) · ... · (n k + 1) =
(n k)!
A permutation of a n objects is an arrangement of each object (where order matters), so a k-
permutation is an arrangement of k members of a set of n members (where order matters). The
number of k-Permutations of n objects is just P (n, k).

Example(s)

Suppose we have 13 chairs (in a row) with 9 TA’s, and Professors Sunny, Rainy, Windy, and Cloudy
to be seated. What is the number of seatings where every professor has a TA to his/her immediate
left and right?

Solution This is quite a tricky problem if we don’t choose the right setup. Imagine we first seat the 9 TA’s
- there are 9! ways to do this. Then, there are 8 spots between them, so that if we place a professor there,
they’re guaranteed to have a TA to their immediate left and right. We can’t place more than one professor
in a spot. Out of the 8 spots, we pick 4 of them for the professors to sit (order matters, since the professors
are di↵erent people). So the answer by the product rule is 9! · P (8, 4).

1.2.2 k-Combinations (Binomial Coefficients)

What if order doesn’t matter? For example, if I need to choose 3 out of 7 weapons on my online adventure?
We’ll tackle that now, continuing our rainbow example!
A kindergartener smears 3 di↵erent colors out of 7 to make a new color. How many smeared colors can she
create?
Notice that there are 3! = 6 possible ways to order red, blue and orange, as you see below. However, all
these rainbows produce the same “smeared” color!

Recall that there were P (7, 3) = 210 possible mini-rainbows. But as we see from these rainbows, each
“smeared” color is counted 3! = 6 times. So, to get our answer, we take the 210 mini-rainbows and divide
by 6 to account for the overcounting since in this case, order doesn’t matter.
The answer is,
210 P (7, 3) 7!
= =
6 3! 3!(7 3)!

Definition 1.2.2: k-Combinations (Binomial Coefficients)

If we want to choose (order doesn’t matter) only k out of n distinct objects, the number of ways to
do so is C(n, k) = nk (read as “n choose k”), where
✓ ◆
n P (n, k) n!
C(n, k) = = =
k k! k!(n k)!
A k-combination is a selection of k objects from a collection of n objects, in which the order does
28 Probability & Statistics with Applications to Computing 1.2

n n
not matter. The number of k-Combinations of n objects is just k . k is also called a binomial
coefficient - we’ll see why in the next section.

Notice, we can show from this that there is symmetry in the definition of binomial coefficients:
✓ ◆ ✓ ◆
n n! n! n
= = =
k k!(n k)! (n k)!k! n k

The algebra checks out - why is this true though, intuitively?

4 4
Let’s suppose that n = 4 and k = 1. We want to show 1 = 3 . We have 4 colors:

These are the possible ways to choose 1 color out of 4:

These are the possible ways to choose 3 colors out of 4:

Looking at these, we can see that the color choices in each row are complementary. Intuitively, choosing
1 color is the same as choosing 4 1 = 3 colors that we don’t want - and vice versa. This explains the
symmetry in binomial coefficients!

Example(s)

There are 6 AI professors and 7 theory professors taking part in an escape room. If 4 security
professors and 4 theory professors are to be chosen and divided into 4 pairs (one AI professor with
one theory professor per pair), how many pairings are possible?

Solution We first choose 4 out of 6 AI professors, with order not mattering, and 4 out of 7 theory professors,
again with order not mattering. There are 64 · 74 ways to do this by the product rule. Then, for the first
theory professor, we have 4 choices of AI professor to match with, for the second theory professor, we only
have 3 choices, and so on. So we multiply by 4! to pair them o↵, and we get 64 · 74 · 4!. You may have
counted it di↵erently, but check if your answer matches!
1.2 Probability & Statistics with Applications to Computing 29

1.2.3 Multinomial Coefficients

Now we’ll see if we can generalize our binomial coefficients to solve even more interesting problems. Actually,
they can be derived easily from binomial coefficients.

How many ways can you arrange the letters in “MATH”?

4! = 24, since they are distinct objects.

But if we want to rearrange the letters in “POOPOO”, we have indistinct letters (two types - P and O).
How do we approach this?
4
One approach is to choose where the 2 P’s go, and then the O’s have to go in the remaining 4 spots ( 4 =1
way) . Or, we can choose where the 4 O’s go, and then the remaining P’s are set ( 22 = 1 way).
Either way, we get,
✓ ◆ ✓ ◆ ✓ ◆ ✓ ◆
6 4 6 2 6!
· = · =
2 4 4 2 2!4!

Another interpretation of this formula is that we are first arranging the 6 letters as if they were distinct:
P1 O1 O2 P2 O3 O4 . Then, we divide by 4! and 2! to account for 4 duplicate O’s and 2 duplicate P’s.
What if we got even more complex, let’s say three di↵erent letters? For example, rearranging the word
‘BABYYYBAY‘. There are 3 B’s, 2 A’s, and 4 Y’s, for a total of 9 letters. We can choose where the 3 B’s
should go of the 9 spots: 93 (order doesn’t matter since all the B’s are identical). Then out of the remaining
6 spots, we should choose 2 for the A’s: 62 . Finally, out of the 4 remaining spots, we put the 4 Y’s there:
4
4 = 1. By the product rule, our answer is
✓ ◆✓ ◆✓ ◆
9 6 4 9! 6! 4! 9!
= =
3 2 4 3!6! 2!4! 4!0! 3!2!4!

Note that we could have chosen to assign the Y’s first instead: Out of 9 positions, we choose 4 to be Y:
9 5
4 . Then from the 5 remaining spots, choose where the 2 A’s go: 2 , and the last three spots must be B’s:
3
3 = 1. This gives us the equivalent answer
✓ ◆✓ ◆✓ ◆
9 5 3 9! 5! 3! 9!
= =
4 2 3 4!5! 2!3! 3!0! 3!2!4!

This shows once again that there are many correct ways to count something. This type of problem also
frequently appears, and so we have a special notation (called a multinomial coefficient)
✓ ◆
9 9!
=
3, 2, 4 3!2!4!

Note the order of the bottom three numbers does not matter (since the multiplication in the denominator is
commutative), and that the bottom numbers must add up to the top number.

Definition 1.2.3: Multinomial Coefficients

If we have k types of objects (n total), with n1 of the first type, n2 of the second, ..., and nk of the
30 Probability & Statistics with Applications to Computing 1.2

k-th, then the number of arrangements possible is

✓ ◆
n n!
=
n1 , n2 , ..., nk n1 !n2 !...nk !

This is a multinomial coefficient, the generalization of binomial coefficients.

Above, we had k = 3 objects (B, A, Y) with n1 = 3 (number of B’s), n2 = 2 (number of A’s), and n3 = 4
(number of Y’s), for an answer of n1 ,nn2 ,n3 = 3!2!4!
9!
.

Example(s)

How many ways can you arrange the letters in ”GODOGGY”?

Solution There are n = 7 letters. There are only k = 4 distinct letters - {G, O, D, Y }.
n1 = 3 - there are 3 G’s.
n2 = 2 - there are 2 O’s.
n3 = 1 - there is 1 D.
n4 = 1 - there is 1 Y.
This gives us the number of possible arrangements:
✓ ◆
7 7!
=
3, 2, 1, 1 3!2!1!1!

It is important to note that even though the 1’s are “useless” since 1! = 1, we still must write every number
on the bottom since they have to add to the top number.

1.2.4 Stars and Bars/Divider Method

Now we tackle another common type of problem, which seems complicated at first. It turns out though that
it can be reduced to binomial coefficients!

How many ways can we give 5 (indistinguishable) candies to these 3 (distinguishable) kids? Here are three
possible distributions of candy:

Notice that the second and third pictures show di↵erent possible distributions, since the kids are distinguish-
able (di↵erent). Any idea on how we can tackle this problem?

The idea here is that we will count something equivalent. Let’s say there are 5 “stars” for the 5 candies
and 2 “bars” for the dividers (dividing 3 kids). For instance, this distribution of candies corresponds to this
arrangement of 5 stars and 2 bars:
1.2 Probability & Statistics with Applications to Computing 31

Here is another example of the correspondence between a distribution of candies and the arrangement of
stars and bars:

For each candy distribution, there is exactly one corresponding way to arrange the stars and bars. Conversely,
for each arrangement of stars and bars, there is exactly one candy distribution it represents.

Hence, the number of ways to distribute 5 candies to the 3 kids is the number of arrangements of 5 stars
and 2 bars.

This is simply

✓ ◆ ✓ ◆
7 7 7!
= =
2 5 2!5!

Amazing right? We just reduced this candy distribution problem to reordering letters!

Definition 1.2.4: Stars and Bars/Divider Method

The number of ways to distribute n indistinguishable balls into k distinguishable bins is

✓ ◆ ✓ ◆
n + (k 1) n + (k 1)
=
k 1 n

since we set up n stars for the n balls, and k 1 bars dividing the k bins.

Example(s)

How many ways can we assign 20 students to 4 di↵erent professors? Assume the students are indis-
tinguishable to the professors; who only care how many students they have, and not which ones.

Solution This is actually the perfect setup for stars and bars. We have 20 stars (students) and 3 bars
(professors), and so our answer is 23 23
3 = 20 .
32 Probability & Statistics with Applications to Computing 1.2

1.2.5 Exercises
1. There are 40 seats and 40 students in a classroom. Suppose that the front row contains 10 seats, and
there are 5 students who must sit in the front row in order to see the board clearly. How many seating
arrangements are possible with this restriction?

Solution: Again, there may be many correct approaches. We can first choose which 5 out of
10 seats in the front row we want to give, so we have 10 5 ways of doing this. Then, assign those 5
students to these seats, to which there are 5! ways. Finally, assign the other 35 students in any way,
for 35! ways. By the product rule, there are 105 · 5! · 35! ways to do so.

2. If we roll a fair 3-sided die 11 times, what is the number of ways that we can get 4 1’s, 5 2’s, and 2 3’s?

Solution: We can write the outcomes as a sequence of length 11, each digit of which is 1, 2 or
3. Hence, the number of ways to get 4 1’s, 5 2’s, and 2 3’s, is the number of orderings of 11112222233,
11 11!
which is 4,5,2 = .
4!5!2!
3. These two problems are almost identical, but have drastically di↵erent approaches to them. These are
both extremely hard/tricky problems, though they may look deceivingly simple. These are probably
the two coolest problems I’ve encountered in counting, as they do have elegant solutions!
(a) How many 7-digit phone numbers are such that the numbers are strictly increasing (digits must
go up)? (e.g., 014-5689, 134-6789, etc.)
(b) How many 7-digit phone numbers are such that the numbers are monotone increasing (digits can
stay the same or go up)? (e.g., 011-5566, 134-6789, etc.) Hint: Reduce this to stars and bars.
Solution:
(a) We choose 7 out of 10 digits, which has 10
7 possibilities, and then once we do, there is only 1 valid
ordering (must put them in increasing order). Hence, the answer is simply 10 7 . This question
has a deceivingly simple solution, as many students (including myself at one point), would have
started by choosing the first digit. But the choices for the next digit depend on the first digit.
And so on for the third. This leads to a complicated, nearly unsolvable mess!
(b) This is a very difficult problem to frame in terms of stars and bars. We need to map one phone
number to exactly one ordering of stars and bars, and vice versa. Consider letting the 9 bars being
an increase from one-digit to the next, and 7 stars for the 7 digits. This is extremely complicated,
so we’ll give 3 examples of what we mean.
i. The phone number 011-5566 is represented as ⇤| ⇤ ⇤|||| ⇤ ⇤| ⇤ ⇤|||. We start a counter at 0, we
see a digit first (a star), so we mark down 0. Then we see a bar, which tells us to increase
our counter to 1. Then, two more digits (stars), which say to mark down 2 1’s. Then, 4 bars
which tell us to increase count from 1 to 5. Then two *’s for the next two 5’s, and a bar to
increase to 6. Then, two stars indicate to put down 2 6’s. Then, we increment count to 9 but
don’t put down any more digits.
ii. The phone number 134-6789 is represented as | ⇤ || ⇤ | ⇤ || ⇤ | ⇤ | ⇤ |⇤. We start a counter at 0,
and we see a bar first, so we increase count to 1. Then a star tells us to actually write down
1 as our first digit. The two bars tell us to increase count from 1 to 3. The star says mark a
3 down now. Then, a bar to increase to 4. Then a star to write down 4. Two bars to increase
to 6. And so on.
1.2 Probability & Statistics with Applications to Computing 33

iii. The stars and bars ordering |||| ⇤ | ⇤ ⇤ ⇤ ⇤|| ⇤ ||⇤ represents the phone number 455-5579. We
start a counter at 0. We see 4 bars so we increment to 4. The star says to mark down a 4.
Then increment count by 1 to 5 due to the next bar. Then, mark 5 down 4 times (4 stars).
Then increment count by 2, put down a 7, and repeat to put down a 9.
Hence there is a bijection between these phone numbers and arrangements of 7 stars and 9 bars.
So the number of satisfying phone numbers is 16 16
7 = 9 .
Chapter 1. Combinatorial Theory
1.3: No More Counting Please
Slides (Google Drive) Video (YouTube)

In this section, we don’t really have a nice successive ordering where one topic leads to the next as we did
earlier. This section serves as a place to put all the final miscellaneous but useful concepts in counting.

1.3.1 Binomial Theorem

We talked last time about binomial coefficients of the form nk . Today, we’ll see how they are used to prove
the binomial theorem, which we’ll use more later on. For now, we’ll see how they can be used to expand
(possibly large) exponents below. You may have learned this technique of FOIL (first, outer, inner, last) for
expanding (x + y)2 .

We then combine like-terms (xy and yx).

(x + y)2 = (x + y)(x + y)
= xx + xy + yx + yy [FOIL]
2 2
= x + 2xy + y

But, let’s say that we wanted to do this for a binomial raised to some higher power, say (x + y)4 . There
would be a lot more terms, but we could use a similar approach.

34
1.3 Probability & Statistics with Applications to Computing 35

(x + y)4 = (x + y)(x + y)(x + y)(x + y)

= xxxx + yyyy + xyxy + yxyy + . . .

But what are the terms exactly that are included in this expression? And how could we combine the
like-terms though?

Notice that each term will be a mixture of x’s and y’s. In fact, each term will be in the form xk y n k (in this
case n = 4). This is because there will be exactly n x’s or y’s in each term, so if there are k x’s, then there
must be n k y’s. That is, we will have terms of the form x4 , x3 y, x2 y 2 , xy 3 , y 4 , with most appearing more
than once.

For a specific k though, how many times does xk y n k appear? For example, in the above case, take k = 1,
then note that xyyy = yxyy = yyxy = yyyx = xy 3 , so xy 3 will appear with the coefficient of 4 in the final
simplified form (just like for (x + y)2 the term xy appears with a coefficient 2). Does this look familiar? It
should remind you yet again of rearranging words with duplicate letters!

Now, we can generalize this, as the number of terms will simplify to xk y n k will be equivalent to the number
of ways to choose exactly k of the binomials to give us x (and let the remaining n k give us y). Alternatively,
we need to arrange k x’s and n k y’s. To think of this in the above example with k = 1 and n = 4, we
were consider which of the four binomials would give us the single x, the first, second, third, or fourth, for
a total of 41 = 4.

Let’s consider k = 2 in the above example. We want to know how many terms are equivalent to x2 y 2 . Well,
we then have xxyy = yxxy = yyxx = xyxy = yxyx = xyyx = x2 y 2 , so there are six ways and the coefficient
on the simplified term x2 y 2 will be 42 = 6.

Notice that we are essentially choosing which of the binomials gives us an x such that k of the n binomials
do. That is, the coefficient for xk y n k where k ranges from 0 to n is simply nk . This is why it was also
called a binomial coefficient.

That leads us to the binomial theorem:

Theorem 1.3.1: Binomial Theorem

Let x, y 2 R be real numbers and n 2 N a positive integer. Then:
n ✓ ◆
X n
(x + y)n = xk y n k
k
k=0

This essentially states that in the expansion of the left side, the coefficient of the term with x raised to the
power of k and y raised to the power of n k will be nk , and we know this because we are considering the
number of ways to choose k of the n binomials in the expression to give us x.

This can also be proved by induction, but this is left as an exercise for the reader.

Example(s)

Calculate the coefficient of a45 b14 in the expansion (4a3 5b2 )22 .
36 Probability & Statistics with Applications to Computing 1.3

Solution Let x = 4a3 and y = 5b2 . Then, we are looking for the coefficient of x15 y 7 (because x15 gives us
a45 and y 7 gives us b14 ), which is 22
15 . So we have the term
✓ ◆ ✓ ◆ ✓ ✓ ◆ ◆
22 15 7 22 3 15 2 7 22 15 7 45 14
x y = (4a ) ( 5b ) = 4 5 a b
15 15 15
22
and our answer is 15 415 57 .

1.3.2 Inclusion-Exclusion
Say we did an anonymous survey where we asked whether students in CSE312 like ice cream, and found
that 43 people liked ice cream. Then we did another anonymous survey where we asked whether students in
CSE312 liked donuts, and found that 20 people liked donuts. With this information can we determine how
many people like ice cream or donuts (or both)?

Let A be the set of people who like ice cream, and B the set of people who like donuts. The sum rule from 1.1
said that, if A, B were mutually exclusive (it wasn’t possible to like both donuts and ice cream: A \ B = ;),
then we could just add them up: |A [ B| = |A| + |B| = 43 + 20 = 63. But this is not the case, since it is
possible that to like both. We can’t quite figure this out yet without knowing how many people overlapped:
the size of A \ B.
So, we did another anonymous survey in which we asked whether students in CSE312 like both ice cream
and donuts, and found that only 7 people like both. Now, do we have enough information to determine how
many students like ice cream or donuts?
Yes! Knowing that 43 people like ice cream and 7 people like both ice cream and donuts, we can conclude
that 36 people like ice cream but don’t like donuts. Similarly, knowing that 20 people like donuts and 7
people like both ice cream and donuts, we can conclude that 13 people like donuts but don’t like ice cream.
This leaves us with the following picture, where A is the students who like ice cream. B is the students who
like donuts (this implies |A \ B| = 7 is the number of students who like both):
1.3 Probability & Statistics with Applications to Computing 37

So we have the following:

|A| = 43
|B| = 20
|A \ B| = 7

Now, to go back to the question of how many students like either ice cream or donuts, we can just add up
the 36 people that just like ice cream, the 7 people that like both ice cream and donuts, and the 13 people
that just like donuts, and get 36 + 7 + 13 = 56. Alternatively, we could consider this as adding up the 43
people who like ice cream (including both the 36 those who just like ice cream and the 7 who like both)
and the 20 people who like donuts (including the 13 who just like donuts and the 7 who like both) and then
subtracting the 7 who like both since they were counted twice. That is 43 + 20 7 = 56. That leaves us
with:

|A [ B| = 36 + 7 + 13 = 56 = 43 20 7 = |A| + |B| |A \ B|

Recall that |A [ B| is the students who like donuts or ice cream (the union of the two sets).

Theorem 1.3.2: Inclusion-Exclusion

Let A, B be sets, then

|A [ B| = |A| + |B| |A \ B|

Further, in general, if A1 , A2 , . . . , An are sets, then:

|A1 [ · · · [ An | = singles doubles + triples quads + · · ·

= (|A1 | + · · · + |An |) (|A1 \ A2 | + · · · + |An 1 \ An |)
+ (A1 \ A2 \ A3 | + · · · + |An 2 \ An 1 \ An |) + · · ·

where singles are the sizes of all the single sets ( n1 terms), doubles are the sizes of all the intersections
of two sets ( n2 terms), triples are the size of all the intersections of three sets ( n3 terms), quads are
all the intersection of four sets, and so forth.

Example(s)

How many numbers in the set [360] = {1, 2, . . . , 360} are divisible by:
1. 4, 6, and 9.
2. 4, 6 or 9.
3. neither 4, 6, nor 9.

Solution
360
1. This is just the multiplies of lcm(4, 6, 9) = 36, which there are = 10 of.
36
2. Let Di be the number of numbers in [360] which are divisible by i, for i = 4, 6, 9. Hence, the number of
numbers which are divisible by 4, 6, or 9 is |D4 [ D6 [ D9 |. We can apply inclusion-exclusion (singles
38 Probability & Statistics with Applications to Computing 1.3

minus doubles plus triples):

|D4 [ D6 [ D9 | = |D4 | + |D6 | + |D9 | |D4 \ D6 | |D4 \ D9 | |D6 \ D9 | + |D4 \ D6 \ D9 |

360 360 360 360 360 360 360
= + + +
4 6 9 12 36 18 36
Notice the denominators for the paired terms are again, dividing by the least common multiple.
3. Complementary counting - this is just 360 minus the answer from the previous part!

Many times it may be possible to avoid this ugly mess using complementary counting, but sometimes it isn’t.

1.3.3 Pigeonhole Principle

The Pigeonhole Principle is a tool that allows us to make guarantees when we tackle problems like: if we
want to assign 20 third-grade students to 3 (equivalent) classes, how can we minimize the largest class size?
It turns out we can’t do any better than having 7 people in the largest class. The reason is because of the
pigeonhole principle!
We’ll start with a smaller but similar problem. If 11 children have to share 3 beds, how can we minimize the
number of children on the most crowded bed? The idea might be just to spread them “uniformly”. Maybe
number the beds A,B,C, and assign the first child to A, the second to B, the third to C, the fourth to A,
and so on. This turns out to be optimal as it spreads the kids out as evenly as possible. The pigeonhole
principle tells us the best worst-case scenario: that at least one bed must have at least 4 children.
You might first distribute the children evenly amongst the beds, say put 3 children in each bed to start.
That leaves us with 3 times 3 equals 9 children accounted for, and 2 children remaining with a bed. Well,
they must be put to bed, so we can put each of them in a separate bed and we finish with the first bed
having 4, the second bed having 4, and the third bed having 3. No matter how we move the children around,
we can’t have an an arrangement where at least one bed will contain at least 4 children.
We could also have found this by dividing 11 by 3 and rounding up to account for the remainder (which must
go somewhere). Before formally defining the pigeonhole principle, we need to define the floor and ceiling
functions.
Definition 1.3.1: Floor and Ceiling Functions

The floor function bxc returns the largest integer  x (i.e. rounds down).
The ceiling function dxe returns the smallest integer x (i.e. rounds up). Note the di↵erence is just
whether the bracket is on top (ceiling) or bottom (floor).

Example(s)

Solve the following: b2.5c, b16.999999c, b5c, d2.5e, d9.000301e, d5e.

Solution

b2.5c = 2 b16.999999c = 16 b5c = 5

d2.5e = 3 d9.000301e = 10 d5e = 5
1.3 Probability & Statistics with Applications to Computing 39

Theorem 1.3.3: Pigeonhole Principle

If there are n pigeons we want to put into k holes (where n > k), then at least one pigeonhole must
contain at least 2 pigeons.
More generally, if there are n pigeons we want to put into k pigeonholes, then at least one pigeonhole
must contain at least dn/ke pigeons.

This fact or rule may seem trivial to you, but the hard part of pigeonhole problems is knowing how to apply
it. See the examples below!

Example(s)

Show that there exists a number made up of only 1’s (e.g., 1111 or 11) which is divisible by 333.

Solution Consider the sequence of 334 numbers x1 , x2 , x3 , . . . , x334 where xi is the number made of exactly
i 1’s (e.g., x2 = 11, x5 = 11, 111, etc.). We’ll use the notation xi = 1i to mean i 1’s concatenated together.

The number of possible remainders when dividing by 333 is 333: {0, 1, 2, . . . , 332}, so by the pigeonhole
principle, since 334 > 333, two numbers xi and xj have the same remainder (suppose i < j without loss
of generality) when divided by 333. The number xj xi is of the form 1j i 0i ; that is j i 1’s followed
by i 0’s (e.g., x5 x2 = 11111 11 = 11100 = 13 02 ). This number must be divisible by 333 because
xi ⌘ xj (mod 333) ) (xj xi ) ⌘ 0 (mod 333).

Now, keep deleting zeros (by dividing by 10) until there aren’t any more left - this doesn’t a↵ect whether or
not 333 goes in since neither 2 nor 5 divides 333. Now we’re left with a number divisible by 333 made up of
all ones (1j i to be exact)!

Note that 333 was not special - we could have used any number that wasn’t divisible by 2 nor 5.

Example(s)

Show that in a group of n people (who may be friends with any number of other people), two must
have the same number of friends.

Solution We have two cases.

1. Case 1: Everyone has at least one friend. Then, everyone has a number of friends between 1, 2, . . . , n 1.
By the pigeonhole principle, since there are n people and n 1 possibilities, at least two people have
the same number of friends.
2. Case 2: At least one person has no friends. Let’s take one such person and call them A. Then, the
other n 1 people can have number of friends from 0, . . . , n 2 since they can’t be friends with A. We
have two more cases within this case unfortunately.
(a) Case 2a: If one of these n 1 people has no friends, we are done since A and this person both
have 0 friends.
(b) Case 2b: Otherwise, these people all have at least one friend, from 1, . . . , n 2, and since there
are n 1 people and n 2 possibilities, at least two people have the same number of friends.
In all cases we are guaranteed that two people have the same number of friends.
40 Probability & Statistics with Applications to Computing 1.3

1.3.4 Combinatorial Proofs

You may have taken a discrete mathematics/formal logic class before this if you are a computer science
major. If that’s the case, you would have learned how to write proofs (e.g., induction, contradiction). Now
that we know how to count, we can actually prove some algebraic identities using counting instead!
n n 1 n 1
Suppose we wanted to show that k = k 1 + k was true for any positive integer n 2 N and 0  k  n.
We could start with an algebraic approach and try something like:

✓ ◆ ✓ ◆
n 1 n 1 (n 1) (n 1)
+ = + [def of binomial coef]
k 1 k (k 1)!(n k)! k!(n 1 k)!
··· [lots of algebra]
n!
=
k!(n k)!
✓ ◆
n
=
k

However, those · · · may be tedious and take a lot of algebra we don’t want to do.
So, let’s consider another approach. A combinatorial proof is one where you prove two quantities are equal
by imagining a situation/something to count. Then, you argue that the left side and right side are two
equivalent ways to count the same thing, and hence must be equal. We’ve seen earlier often how there are
multiple approaches to counting!
In this case, let’s consider the set of numbers [n] = {1, 2, . . . n}. We will argue that the LHS and RHS both
count the number of subsets of size k.
1. LHS: nk is literally the number of subsets of size k, since we just want to choose any k items out of n
(order doesn’t matter).
2. RHS: We take a slightly more convoluted approach, splitting on cases depending on whether or not
the number 1 was included in the subset.

Case 1: Our subset of size k includes the number 1. Then we need to choose k 1 of the
remaining n 1 numbers (n numbers excluding 1 is n 1 numbers) to make a subset of size k which
includes 1.

Case 2: Our subset of size k does not include the number 1. Then we need to choose k
numbers from the remaining n 1 numbers. There are n k 1 ways to do this. So, in total we have
n 1 n 1
k 1 + k possible subsets of size k.
Since the left side and right side count the same thing, they must be equal! Note that we dreamed up
this situation and you may wonder how we did - this just comes from practicing many types of counting
problems. You’ll get used to it!

Definition 1.3.2: Combinatorial Proofs

To prove two quantities are equal, you can come up with a combinatorial situation, and show that
both in fact count the same thing, and hence must be equal.
1.3 Probability & Statistics with Applications to Computing 41

Example(s)

Prove the following two identities combinatorially (NOT algebraically):

n m n n k
1. Prove that m kP = k m k .
n
2. Prove that 2n = k=0 nk

Solution
1. We’ll show that both sides count, from a group of n people, the number of committees of size m, and
within that committee a subcommittee of size k.
n
Left-hand side: We first choose m people to be on the committee from n total; there are m
m
ways to do so. Then, within those m, we choose k to be on a specialized subcommittee; there are k
n m
ways to do so. By the product rule, the number of ways to assign these is m k .

Right-hand side: We first choose which k to be on the subcommitee of size k; there are nk ways
to do so. From the remaining n k people, we choose m k to be on the committee (but not the
subcommittee). By the product rule, the number of ways to assign these is nk m
n k
k .

Since the LHS and RHS both count the same thing, they must be equal.
2. We’ll argue that both sides count the number of subsets of the set [n] = {1, 2, . . . , n}.

Left-hand side: Each element we can have in our subset or not. For the first element, we have
2 choices (in or out). For the second element, we also have 2 choices (in or out). And so on. So the
number of subsets is 2n .

Right-hand side: The subset can be of any size ranging from 0 to n, so we have a sum. Now
how many subsets are there of size exactly k? There are nk because we choose k out of n to have in
Pn
our set (and order doesn’t matter in sets)! Hence, the number of subsets is k=0 nk

Since the LHS and RHS both count the same thing, they must be equal.

It’s cool to note we can also prove this with the binomial theorem setting x = 1 and y = 1 - try
this out! It takes just one line!

1.3.5 Exercises
1. These problems involve using the pigeonhole principle. How many cards must you draw from a standard
52-card deck (4 suits and 13 cards of each suit) until you are guaranteed to have:
(a) A single pair? (e.g., AA, 99, JJ)
(b) Two (di↵erent) pairs? (e.g., AAKK, 9933, 44QQ)
(c) A full house (a triple and a pair)? (e.g., AAAKK, 99922, 555JJ)
(d) A straight (5 in a row, with the lowest being A,2,3,4,5 and the highest being 10,J,Q,K,A)?
(e) A flush (5 cards of the same suit)? (e.g., 5 hearts, 5 diamonds)
(f) A straight flush (5 cards which are both a straight and a flush)?
42 Probability & Statistics with Applications to Computing 1.3

Solution:
(a) The worst that could happen is to draw 13 di↵erent cards, but the next is guaranteed to form a
pair. So the answer is 14.
(b) The worst that could happen is to draw 13 di↵erent cards, but the next is guaranteed to form a
pair. But then we could draw the other two of that pair as well to get 16 still without two pairs.
So the answer is 17.
(c) The worst that could happen is to draw all pairs (26 cards). Then the next is guaranteed to cause
a triple. So the answer is 27.
(d) The worst that could happen is to draw all the A - 4, 6 - 9, and J - K. After drawing these
11 · 4 = 44 cards, we could still fail to have a straight. Finally, getting a 5 or 10 would give us a
straight. So the answer is 45.
(e) The worst that could happen is to draw 4 of each suit (16 cards), and still not have a flush. So
the answer is 17.
(f) Same as straight, 45.
1.3 Probability & Statistics with Applications to Computing 43

Chapter 2. Discrete Probability

This chapter focuses on formally defining what a probability is, and all the relevant terms. We learn
conditional probability, and one of the most fundamental theorems in all of probability: Bayes Theorem.
We also cover the idea of independence, which will also be a ubiquitous idea and assumption we make.
Chapter 2. Discrete Probability
2.1: Intro to Discrete Probability
Slides (Google Drive) Video (YouTube)

We’re just about to learn about the axioms (rules) of probability, and see how all that counting stu↵ from
chapter 1 was relevant at all. This should align with your current understanding of probability (I only assume
you might be able to tell me the probability I roll an even number on a fair six-sided die at this point), and
formalize it.
We’ll be using a lot of set theory from here on out, so review that in Chapter 0 if you need to!

2.1.1 Definitions
Definition 2.1.1: Sample Space

The sample space is the set ⌦ of all possible outcomes of an experiment.

Example(s)

Find the sample space of...

1. a single coin flip.
2. two coin flips.
3. the roll of a fair 6-sided die.

Solution
1. The sample space of a single coin flip is: ⌦ = {H, T } (heads or tails).
2. The sample space of two coin flips is: ⌦ = {HH, HT, T H, T T }.
3. The sample space of the roll of a die is: ⌦ = {1, 2, 3, 4, 5, 6}.

Definition 2.1.2: Event

An event is any subset E ✓ ⌦.

Example(s)

List out the set of outcomes for the following events:

1. Getting at least one head in two coin flips.
2. Rolling an even number on a fair 6-sided die.

Solution
1. Getting at least one head in two coin flips: E = {HH, HT, T H}
2. Rolling an even number: E = {2, 4, 6}

44
2.1 Probability & Statistics with Applications to Computing 45

Definition 2.1.3: Mutual Exclusion

Events E and F are mutually exclusive if E \ F = ;. (i.e. they can’t simultaneously happen).

Example(s)

Say E is the event of rolling an even number: E = {2, 4, 6}, and F is the event of rolling an odd
number: F = {1, 3, 5}. Are E and F mutually exclusive?

Solution E and F are mutually exclusive because E \ F = ;.

Example(s)

Let’s consider another example in which our experiment is the rolling of two fair 4-sided dice,
one which is blue D1 and one which is red D2 (so they are distinguishable, or e↵ectively, order
matters). We can represent each element in the sample set as an ordered pair (D1, D2) where
D1, D2 2 {1, 2, 3, 4} and represent the respective value rolled by the blue and red die.

The sample space ⌦ is the set of all possible ordered pairs of values that could be rolled by
the die (|⌦| = 4 · 4 = 16 by the product rule). Let’s consider some events:
1. A = {(1, 1), (1, 2), (1, 3), (1, 4)}, the event that the blue die, D1 is a 1.
2. B = {(2, 4), (3, 3), (4, 2)}, the event that the sum of the two rolls is 6 (D1 + D2 = 6).
3. C = {(2, 1), (4, 2)}, the event that the value on the blue die is twice the value on the red die
(D1 = 2 ⇤ D2).
All of these events and the sample space are shown below:

Are A and B mutually exclusive? Are B and C mutually exclusive?

Solution Now, let’s consider whether A and B are mutually exclusive. Well, they do not overlap, as we can
see that A \ B = ;, so yes they are mutually exclusive.
B and C are not mutually exclusive, since there is a case in which they can happen at the same time
B \ C = {(4, 2)} =
6 ;, so they are not mutually exclusive.
Again, to summarize, we learned that ⌦ was the sample space (set of all outcomes of an experiment), and
46 Probability & Statistics with Applications to Computing 2.1

E ✓ ⌦ is just a subset of outcomes we are interested in.

2.1.2 Axioms of Probability and their Consequences

Definition 2.1.4: Axioms of Probability and their Consequences

Let ⌦ denote the sample space and E, F ✓ ⌦ be events.

Axioms:
1. (Nonnegativity) P (E) 0; that is, no event has a negative probability.
2. (Normalization) P (⌦) = 1; that is, the probability of the entire sample space is always 1
(something is guaranteed to happen)
3. (Countable Additivity) If E and F are mutually exclusive, then P(E [ F ) = P (E) + P (F ). This
actually holds for any countable (finite or countably infinite) collection of pairwise mutually
exclusive events E1 , E2 , E3 , . . . .
1
! 1
[ X
P Ei = P (Ei )
i=1 i=1

The word “axiom” means: things that we take for granted and assume to be true without proof.
Corollaries:
1. (Complementation) P E C = 1 P (E).
2. (Monotonicity) If E ✓ F , then P (E)  P (F ).
3. (Inclusion-Exclusion) P (E [ F ) = P (E) + P (F ) P (E \ F ).
The world “corollary” means: results that follow almost immediately from a previous result (in this
case, the axioms).

Explanation of Axioms
1. Non-negativity is simply because we cannot consider an event to have a negative probability. It just
wouldn’t make sense. A probability of 1/6 would mean that on average, something would happen 1
out of every 6 trials. What about a probability of 1/4?
2. Normalization is based on the fact that when we run an experiment, there must be some outcome, and
all possible outcomes are in the sample space. So, we say the probability of observing some outcome
from the sample space is 1.
3. Countable additivity is because if two events are mutually exclusive, they don’t overlap at all; that is,
they don’t share any outcomes. This means that the union of them will contain the same outcomes of
each together, so the probability of their union is the the sum of their individual probabilities. (This
is like the sum role of counting).
Explanation of Corollaries
1. Complementation is based on the fact that the sample space is all the possible outcomes. This means
that E C = ⌦ \ E, so P E C = 1 P (E). (This is like complementary counting).
2. Monotonocity is because if E is a subset of F , then all outcomes in the event E are in the event F .
This means that all the outcomes that contribute to the probability of E contribute to the probability
of F , so it’s probability is greater than or equal to that of E (since probabilities are non-negative).
3. Inclusion-Exclusion follows because if E and F have some intersection, this would be counted twice
by adding their probabilities, so we have to subtract it once to only count it once and not overcount.
(This is like inclusion-exclusion for counting).
2.1 Probability & Statistics with Applications to Computing 47

Proof of Corollaries. The proofs of these corollaries only depend on the 3 axioms which we assume to be
true.
1. Since E and E C = ⌦ \ E are mutually exclusive,
P (E) + P E C = P E [ E C [axiom 3]
= P (⌦) [E [ E C = ⌦]
=1 [axiom 2]
Now just subtract P (E) from both sides.
2. Since E ✓ F , consider the sets E and F \ E. Then,
P (F ) = P (E [ (F \ E)) [draw a picture of E inside event F]
= P (E) + P (F \ E) [mutually exclusive, axiom 3]
P (E) + 0 [since P (F \ E) 0 by axiom 1]

3. Left to the reader. Hint: Draw a picture.

2.1.3 Equally Likely Outcomes

Now we’ll see why counting was so useful. We can compute probabilities in the special case where each
outcome is equally likely (e.g., rolling a fair 6-sided die has each outcome in ⌦ = {1, 2, . . . , 6} equally likely).
If events are equally likely, then in determining probabilities, we only care about the number of outcomes
that are in an event. That let’s us conclude the following:

Theorem 2.1.4: Probability in Sample Space with Equally Likely Outcomes

If ⌦ is a sample space such that each of the unique outcome elements in ⌦ are equally likely, then
for any event E ✓ ⌦:

|E|
P(E) =
|⌦|

Proof of Equally Likely Outcomes Formula. If outcomes are equally likely, then for any outcome in the
1
sample space ! 2 ⌦, we have P (!) = (since there are |⌦| total outcomes). Then, if we list the |E|
|⌦|
outcomes that make up event E, we can write
E = {!1 , !2 , . . . , !|E| }
Every set is the union of the (mutually exclusive) singleton sets containing each element (e.g., {1, 2, 3} =
{1} [ {2} [ {3}), and so by countable additivity, we get
0 1
|E| |E|
[ X
P @ {!i }A = P ({!i }) [countable additivity axiom]
i=1 i=1
|E|
X 1
= [equally likely outcomes]
i=1
|⌦|
|E|
= [sum constant |E| times]
|⌦|
48 Probability & Statistics with Applications to Computing 2.1

The notation in the first line is like summation or product notation: just union all the sets {!1 } [ {!2 } [
· · · [ {!|E| }.

Example(s)

If we flip two fair coins independently, what is the probability we get at least one head?

Solution Since the sample space ⌦ = {HH, HT, T H, T T } is such that events are equally likely and the event
of getting at least one head is E = {HH, HT, T H}, we can say that
|E| 3
P (E) = =
|⌦| 4

Example(s)

Consider the example of rolling the red and blue fair 4-sided dice again (above), a blue die D1 and a
red die D2. What is the probability that the two die’s rolls sum up to 6?

Solution We called that event B = {(2, 4), (3, 3), (4, 2)}. What is the probability of the event B happening?
Well, the 16 possible outcomes that make up all the elements of ⌦ are each equally likely because each die
has an equal chance of landing on any of the 4 numbers. So, P(E) = |B| 3 3
|⌦| = 16 , so the probability is 16 .

2.1.4 Exercises
1. If there are 5 people named A, B, C, D, and E, and they are randomly arranged in a row (with each
ordering equally likely), what is the probability that A and B are placed next to each other?

Solution: The size of the sample space is the number of ways to organize 5 people randomly, which
is |⌦| = 5! = 120. The event space is the number of ways to have A and B sit next to each other. We
did a similar problem in 1.1, and so the answer was 2! · 4! = 48 (why?). Hence, since the outcomes are
|E| 48
equally likely, P (E) = = .
|⌦| 120
2. Suppose I draw 4 cards from a standard 52-card deck. What is the probability they are all aces (there
are exactly 4 aces in a deck)?

Solution: There are two ways to define our sample space, one where order matters, and one where
it doesn’t. These two approaches are equivalent.
(a) If order matters, then |⌦| = P (52, 4) = 52 · 51 · 50 · 49, as the number of ways to pick 4 cards out
of 52. The event space E is the number of ways to pick all 4 aces (with order mattering), which
|E| P (52, 4)
is P (4, 4) = 4 · 3 · 2 · 1. Hence, since the outcomes are equally likely, P (E) = = =
|⌦| P (4, 4)
4·3·2·1
52 · 51 · 50 · 49
(b) If order does not matter, then |⌦| = 52 4 , since we just care which 4 out of 52 cards we get.
Then, there is only 44 = 1 way to get all 4 aces, and, since the outcomes are equally likely,
52
|E| P (52, 4)/4! 4·3·2·1
P (E) = = 44 = = .
|⌦| 4
P (4, 4)/4! 52 · 51 · 50 · 49
2.1 Probability & Statistics with Applications to Computing 49

Notice how it did not matter whether order mattered or not, but we had to be consistent! The 4!
accounting for the ordering of the 4 cards gets cancelled out :).
3. Given 3 di↵erent spades (S) and 3 di↵erent hearts (H), shu✏e them. Compute P (E), where E is the
event that the suits of the shu✏ed cards are in alternating order (e.g., SHSHSH or HSHSHS)

Solution: The sample space |⌦| is the number of ways to order the 6 (distinct) cards: 6!. The
number of ways to organize the three spades is 3! and same for the three hearts. Once we do that, we
either lead with spades or hearts, so we get 2 · 3!2 for the size of our event space E. Hence, since the
|E| 2 · 3!2
outcomes are equally likely, P (E) = = .
|⌦| 6!
Note that all of these exercises are just counting two things! We count the size of the sample space, then
the event space and divide them. It is very important to acknowledge that we can only do this when the
outcomes are equally likely.

You can see how we can get even more fun and complicated problems - the three exercises above displayed
counting problems on the “easier side”. The reason we didn’t give “harder” problems is because computing
probability in the case of equally likely outcomes reduces to doing two counting problems (counting |E| and
|⌦|, where computing |⌦| is generally easier than computing |E|). Just use the techniques from Chapter 1
to do this!
Chapter 2. Discrete Probability
2.2: Conditional Probability
Slides (Google Drive) Video (YouTube)

2.2.1 Conditional Probability

Sometimes we would like to incorporate new information into our probability. For example, you may be
feeling symptoms of some disease, and so you take a test to see whether you have it or not. Let D be the
event you have a disease, and T be the event you test positive (T C is the event you test negative). It could
be that P (D) = 0.01 (1% chance of having the disease without knowing anything). But how can we update
this probability given that we tested positive (or negative)? This will be written as P (D | T ) or P D | T C
respectively. You would think P (D | T ) > P (D) since you’re more likely to have the disease once you test
positive, and P D | T C < P (D) since you’re less likely to have the disease once you test negative. These
are called conditional probabilities - they are the probability of an event, given that you know some other
event occurred. Is there a formula for updating P (D) given new information? Yes!

Let’s go back to the example of students in CSE312 liking donuts and ice cream. Recall we defined event
A as liking ice cream and event B as liking donuts. Then, remember we had 36 students that only like ice
cream (A \ B C ), 7 students that like donuts and ice cream (A \ B), and 13 students that only like donuts
(B \ AC ). Let’s also say that we have 14 students that don’t like either (AC \ B C ). That leaves us with the
following picture, which makes up the whole sample space:

Now, what if we asked the question, what’s the probability that someone likes ice cream, given that we
know they like donuts? We can approach this with the knowledge that 20 of the students like donuts (13
who don’t like ice cream and 7 who do). What this question is getting at, is: given the knowledge that
someone likes donuts, what is the chance that they also like ice cream? Well, 7 of the 20 who like donuts
7
like ice cream, so we are left with the probability 20 . We write this as P (A | B) (read the “probability of A
given B”) and in this case we have the following:

50
2.2 Probability & Statistics with Applications to Computing 51

7
P (A | B) =
20
|A \ B|
= [|B| = 20 people like donuts, |A \ B| = 7 people like both]
|B|
|A \ B|/|⌦|
= [divide top and bottom by |⌦|, which is equivalent]
|B|/|⌦|
P (A \ B)
= [if we have equally likely outcomes]
P (B)

This intuition (which worked only in the special case equally likely outcomes), leads us to the definition of
conditional probability:

Definition 2.2.1: Conditional Probability

The conditional probability of event A given that event B happened is:

P (A \ B)
P (A | B) =
P (B)

An equivalent and useful formula we can derive (by multiplying both sides by the denominator, P (B),
and switching the sides of the equation is:

P (A \ B) = P (A | B) P (B)

Let’s consider an important question: does P (A | B) = P (B | A)? No!

This is a common misconception we can show with some examples. In the above example with ice cream,
we showed already P (A | B) = 20
7
, but P (B | A) = 36
7
, and these are not equal.

Consider another example where W is the event that you are wet and S is the event you are swimming.
Then, the probability you are wet given you are swimming, P (W | S) = 1, as if you are swimming you are
certainly wet. But, the probability you are swimming given you are wet, P (S | W ) 6= 1, because there are
numerous other reasons you could be wet that don’t involve swimming (being in the rain, showering, etc.).
52 Probability & Statistics with Applications to Computing 2.2

2.2.2 Bayes Theorem

This brings us to Bayes Theorem:

Theorem 2.2.5: Bayes Theorem

Let A, B be events with nonzero probability. Then,

P (B | A) P (A)
P (A | B) =
P (B)

Note that in the above P (A) is called the prior, which is our belief without knowing anything about
event B. P (A | B) is called the posterior, our belief after learning that event B occurred.

This theorem is important because it allows to “reverse the conditioning”! Notice that both P (A | B)
and P (B | A) appear in this equation on opposite sides. So if we know P (A) and P (B) and can more
easily calculate one of P (A | B) or P (B | A), we can use Bayes Theorem to derive the other.

Proof of Bayes Theorem. Recall the (alternate) definition of conditional probability from above:

P (A \ B) = P (A | B) P (B) (2.2.6)

Swapping the roles of A and B we can also get that:

P (B \ A) = P (B | A) P (A) (2.2.7)

But, because A \ B = B \ A (since these are the outcomes in both events A and B, and the order of
intersection does not matter), P (A \ B) = P (B \ A), so (2.2.1) and (2.2.2) are equal and we have (by
setting the right-hand sides equal):

P (A | B) P (B) = P (B | A) P (A)

We can divide both sides by P (B) and get Bayes Theorem:

P (B | A) P (A)
P (A | B) =
P (B)

Wow, I wish I was alive back then and had this important (and easy to prove) theorem named after me!

Example(s)

We’ll investigate two slightly di↵erent questions whose answers don’t seem that they should be di↵er-
ent, but are. Suppose a family has two children (whom at birth, were each equally likely to be male
or female). Let’s say a telemarketer calls home and one of the two children picks up.
1. If the child who responded was male, and says “Let me get my older sibling”, what is the
probability that both children are male?
2. If the child who responded was male, and says “Let me get my other sibling”, what is the
probability that both children are male?

Solution There are four equally likely outcomes, MM, MF, FM, and FF (where M represents male and F
represents female). Let A be the event both children are male.
2.2 Probability & Statistics with Applications to Computing 53

1. In this part, we’re given that the younger sibling is male. So we can rule out 2 of the 4 outcomes
above and we’re left with MF and MM. Out of these two, in one of these cases we get MM, and so our
desired probability is 1/2.
More formally, let this event be B, which happens with probability 2/4 (2 out of 4 equally likely
P (A \ B) 1/4 1
outcomes). Then, P (A|B) = = = , since P (A \ B) is the probability both children
P (B) 2/4 2
are male, which happens in 1 out of 4 equally likely scenarios. This is because the older sibling’s sex
is independent of the younger sibling’s, so knowing the younger sibling is male doesn’t change the
probability of the older sibling being male (which is what we computed just now).
2. In this part, we’re given that at least one sibling is male. That is, out of the 4 outcomes, we can only
rule out the FF option. Out of the remaining options MM, MF, and FM, only one has both siblings
being male. Hence, the probability desired is 1/3. You can do a similar more formal argument like we
did above!
See how a slight wording change changed the answer?
We’ll see a disease testing example later, which requires the next section first. If you test positive for a
disease, how concerned should you be? The result may surprise you!

2.2.3 Law of Total Probability

Let’s say you sign up for a chemistry class, but are assigned to one of three teachers randomly. Furthermore,
you know the probabilities you fail the class if you were to have each teacher (from historical results, or
word-of-mouth from classmates who have taken the class). Can we combine this information to compute the
overall probability that you fail chemistry (before you know which teacher you get)? Yes - using the law of
total probability below! We first need to define what a partition is.

Definition 2.2.2: Partitions

Non-empty events E1 , . . . , En partition the
Snsample space ⌦ if they are:
• (Exhaustive) E1 [ E2 [ · · · [ En = i=1 Ei = ⌦; that is, they cover the entire sample space.
• (Pairwise Mutually Exclusive) For all i 6= j, Ei \ Ej = ;; that is, none of them overlap.
Note that for any event E, E and E C always form a partition of ⌦.

Example(s)

Two example partitions can be seen in the image below:

You can see that partition is a very appropriate word here! In the first image, the four events
E1 , . . . , E4 don’t overlap and cover the sample space. In the second image, the two events E, E C do
the same thing! This is useful when you know exactly one of a few things will happen. For example,
for the chemistry example, there might be only three teachers, and you will be assigned to exactly
one of them: at most one because you can’t have two teachers (mutually exclusive), and at least one
54 Probability & Statistics with Applications to Computing 2.2

because there aren’t other teachers possible (exhaustive).

Now, suppose we have some event F which intersects with various events that form a partition of ⌦. This
is illustrated by the picture below:

Notice that F is composed of its intersection with each of E1 , E2 , and E3 , and so we can split F up into
smaller pieces. This means that we can write the following (green chunk F \ E1 , plus pink chunk F \ E2
plus yellow chunk F \ E3 ):

P (F ) = P (F \ E1 ) + P (F \ E2 ) + P (F \ E3 )

Note that F and E4 do not intersect, so F \ E4 = ;. For completion, we can include E4 in the above
equation, because P (F \ E4 ) = 0. So, in all we have:

P (F ) = P (F \ E1 ) + P (F \ E2 ) + P (F \ E3 ) + P (F \ E4 )

This leads us to the law of total probability.

2.2 Probability & Statistics with Applications to Computing 55

Theorem 2.2.6: Law of Total Probability (LTP)

If events E1 , . . . , En partition ⌦, then for any event F

n
X
P (F ) = P (F \ E1 ) + · · · + P (F \ En ) = P (F \ En )
i=1

Using the definition of conditional probability, P (F \ Ei ) = P (F | Ei ) P (Ei ), we can replace each of

the terms above and get the (typically) more useful formula:

n
X
P (F ) = P (F | E1 ) P (E1 ) + · · · + P (F | En ) P (En ) = P (F | Ei ) P (Ei )
i=1

That is, to compute the probability of an event F overall; suppose we have n disjoint cases E1 , . . . , En
for which we can (easily) compute the probability of F in each of these cases (P (F |Ei )). Then, take
the weighted average of these probabilities, using the probabilities P (Ei ) as weights (the probability
of being in each case).

Example(s)

Let’s consider an example in which we are trying to determine the probability that we fail chemistry.
Let’s call the event F failing, and consider the three events E1 for getting the Mean Teacher, E2 for
getting the Nice Teacher, and E3 for getting the Hard Teacher which partition the sample space. The
following table gives the relevant probabilities:
Mean Teacher E1 Nice Teacher E2 Hard Teacher E3
Probability of Teaching You P (Ei ) 6/8 1/8 1/8
Probability of Failing You P (F | Ei ) 1 0 1/2
Solve for the probability of failing.

Solution Before doing anything, how are you liking your chances? There is a high probability (6/8) of getting
the Mean Teacher, and she will certainly fail you. Therefore, you should be pretty sad.

Now let’s do the computation. Notice that the first row sums to 1, as it must, since events E1 , E2 , E3
partition the sample space (you have exactly one of the three teachers). Using the Law of Total Probability
(LTP), we have the following:

3
X
P (F ) = P (F | Ei ) P (Ei ) = P (F | E1 ) P (E1 ) + P (F | E2 ) P (E2 ) + P (F | E3 ) P (E3 )
i=1
6 1 1 1 13
=1· +0· + · =
8 8 2 8 16

Notice to get the probability of failing, what we did was: consider the probability of failing in each of the
3 cases, and take a weighted average of using the probability of each case. This is exactly what the law of
total probability lets us do!

You might consider using the LTP when you know the probability of your desired event in
56 Probability & Statistics with Applications to Computing 2.2

Example(s)

Misfortune struck us and we ended up failing chemistry class. What is the probability that we had
the Hard Teacher given that we failed?

Solution First, this probability should be low intuitively because if you failed, it was probably due to the
Hard Teacher (because you are more likely to get them, AND because they have a high fail rate of 100%).

Start by writing out in a formula what you want to compute; in our case, it is P (E3 | F ) (getting the hard
teacher given that we failed). We know P (F | E3 ) and we want to solve for P (E3 | F ). This is a hint to use
Bayes Theorem since we can reverse the conditioning! Using that with the numbers from the table and the
previous question:

P (F | E3 ) P (E3 )
P (E3 | F ) = [bayes theorem]
P (F )
1 1
2 · 8
= 13
16
1
=
13

2.2.4 Bayes Theorem with the Law of Total Probability

Oftentimes, the denominator in Bayes Theorem is hard, so we must compute it using the LTP. Here, we just
combine two powerful formulae: Bayes Theorem and the Law of Total Probability:

Theorem 2.2.7: Bayes Theorem with the Law of Total Probability

Let events E1 , . . . , En partition the sample space ⌦, and let F be another event. Then:

P (F | E1 ) P (E1 )
P (E1 | F ) = [by bayes theorem]
P (F )
P (F | E1 ) P (E1 )
= Pn [by the law of total probability]
i=1 P (F | Ei ) P (Ei )

In particular, in the case of a simple partition of ⌦ into E and E C , if E is an event with nonzero
probability, then:

2.2.5 Exercises
1. Suppose the llama flu disease has become increasingly common, and now 0.1% of the population has
it (1 in 1000 people). Suppose there is a test for it which is 98% accurate (e.g., 2% of the time it will
2.2 Probability & Statistics with Applications to Computing 57

give the wrong answer). Given that you tested positive, what is the probability you have the disease?
Before any computation, think about what you think the answer might be.

Solution: Let L be the event you have the llama flu, and T be the event you test positive (T C is
the event you test negative). You are asked for P (L | T ). We do know P (T | L) = 0.98 because if you
have the llama flu, the probably you test positive is 98%. This gives us the hint to use Bayes Theorem!

We get that
P (T | L) P (L)
P (L | T ) =
P (T )

We are given P (T | L) = 0.98 and P (L) = 0.001, but how can we get P (T ), the probability of testing
positive? Well that depends on whether you have the disease or not. When you have two or more
cases (L and LC ), that’s a hint to use the LTP! So we can write

P (T ) = P (T | L) P (L) + P T | LC P LC

Again, interpret this as a weighted average of the probability of testing positive whether you had llama
flu P (T | L) or not P T | LC , weighting by the probability you are in each of these cases P (L) and
P LC . We know P LC = 0.999 since these P LC = 1 P (L) (axiom of probability). But what
about P T | LC ? This is the probability of testing positive given that you don’t have llama flu, which
is 0.02 or 2% (due to the 98% accuracy). Putting this all together, we get:

Not even a 5% chance we have the disease, what a relief! But wait, how can that be? The test
is so accurate, and it said you were positive? This is because the prior probability of having the
disease P (L) was so low at 0.1% (actually this is pretty high for a disease rate). If you think about
it, the posterior probability we computed P (L | T ) is 47⇥ larger than the prior probability P (L)
(P (L | T ) /P (L) ⇡ 0.047/0.001 = 47), so the test did make it a lot more likely we had the disease after
all!

2. Suppose we have four fair die: one with three sides, one with four sides, one with five sides, and one
with six sides (The numbering of an n-sided die is 1, 2, ..., n). We pick one of the four die, each with
equal probability, and roll the same die three times. We get all 4’s. What is the probability we chose
the 5-sided die to begin with?

Solution: Let Di be the event we rolled the i-sided die, for i = 3, 4, 5, 6. Notice that these
58 Probability & Statistics with Applications to Computing 2.2

D3 , D4 , D5 , D6 partition the sample space.

P (444|D5 )P (D5 )
P (D5 |444) = [by bayes theorem]
P (444)
P (444|D5 )P (D5 )
= [by ltp]
P (444|D3 )P (D3 ) + P (444|D4 )P (D4 ) + P (444|D5 )P (D5 ) + P (444|D6 )P (D6 )
1 1
53 · 4
= 0 1 1 1 1 1 1 1
33 · 4 + 43 · 4 + 5 3 · 4 + 63 · 4
1/125
=
1/64 + 1/125 + 1/216
1728
= ⇡ 0.2831
6103
Note that we compute P (444|Di ) by noting there’s only one outcome where we get (4, 4, 4) out of the
i3 equally likely outcomes. This is true except when i = 3, where it’s not possible to roll all 4’s.
Chapter 2. Discrete Probability
2.3: Independence
Slides (Google Drive) Video (YouTube)

2.3.1 Chain Rule

We learned several tools already to compute probabilities (equally likely outcomes, Bayes Theorem, LTP).
Now, we will learn how to handle the probability of several events occurring simultaneously: that is,
P (A \ B \ C \ D) for example. To compute the probability that at least one of several events happens:
P (A [ B [ C [ D), you would use inclusion-exclusion! We’ll see an example which builds intuition first.
Consider a standard 52 card deck. This has four suits (clubs, spades, hearts, and diamonds). Each of the
four suits has 13 cards of di↵erent rank (A, 2, 3, 4, 5, 6, 7, 8, 9, 10 J, Q, K).

Now, suppose that we shu✏e this deck and draw the top three cards. Let’s define:
1. A to be the event that we get the Ace of spades as our first card.
2. B to be the event that we get the 10 of clubs as our second card.
3. C to be the event that we get the 4 of diamonds as our third card.
What is the probability that all three of these events happen? We can write this as P (A, B, C) (sometimes
we use commas as an alternative to using the intersection symbol, so this is equivalent to P (A \ B \ C)).
Note that this is equivalent to P (C, B, A) or P (B, C, A) since order of intersection does not matter.

1 1 1
Intuitively, you might say that this probability is 52 · 51 · 50 , and you would be correct.
1. The first factor comes from the fact that there are 52 cards that could be drawn, and only one ace of
spades. That is, we computed P (A).

59
60 Probability & Statistics with Applications to Computing 2.3

2. The second factor comes from the fact that there are 51 cards after we draw the first card and only
one 10 of clubs. That is, we computed P (B | A).
3. The final factor comes from the fact that there are 50 cards left after we draw the first two and only
one 4 of diamonds. That is, we computed P (C | A, B).
To summarize, we said that

1 1 1
P (A, B, C) = P (A) · P (B | A) · P (C | A, B) = · ·
52 51 50

This brings us to the chain rule:

Theorem 2.3.8: Chain Rule

Let A1 , . . . , An be events with nonzero probabilities. Then:

P (A1 , . . . , An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 A2 ) · · · P (An | A1 , . . . , An 1)

In the case of two events, A, B (this is just the alternate form of the definition of conditional probability
from 2.2):

P (A, B) = P (A) P (B | A)

An easy way to remember this, is if we want to observe n events, we can observe one event at a
time, and condition on those that we’ve done thus far. And most importantly, since the order of
intersection doesn’t matter, you can actually decompose this into any of n! orderings. Make sure
you “do” one event at a time, conditioning on the intersection of ALL past events like we did above.

Proof of Chain Rule. Remember that the definition of conditional probability says P (A \ B) = P (A) P (B | A).
We’ll use this repeatedly to break down our P (A1 , . . . , An ). Sometimes it is easier to use commas, and some-
times it is easier to use the intersection sign \: for this proof, we’ll use the intersection sign. We’ll prove
this for four events, and you’ll see how it can be easily extended to any number of events!

P (A1 \ A2 \ A3 \ A4 ) = P ((A1 \ A2 \ A3 ) \ A4 ) [treat A1 \ A2 \ A3 as one event]

Note how we keep “chaining” and applying the definition of conditional probability repeatedly!

Example(s)

Consider the 3-stage process. We roll a 6-sided die (numbered 1-6), call the outcome X. Then, we
roll a X-sided die (numbered 1-X), call the outcome Y . Finally, we roll a Y -sided die (numbered
1-Y ), call the outcome Z. What is P (Z = 5)?
2.3 Probability & Statistics with Applications to Computing 61

Solution There are only three things that could have happened for the triplet (X, Y, Z) so that Z takes on
the value 5: {(6, 6, 5), (6, 5, 5), (5, 5, 5)}. So

P (Z = 5) = P (X = 6, Y = 6, Z = 5) + P (X = 6, Y = 5, Z = 5) + P (X = 5, Y = 5, Z = 5) [cases]
1 1 1 1 1 1 1 1 1
= · · + · · + · · [chain rule 3x]
6 6 6 6 6 5 6 5 5
How did we use the chain rule? Let’s see for example the last term:

P (X = 5, Y = 5, Z = 5) = P (X = 5)P (Y = 5 | X = 5)P (Z = 5 | X = 5, Y = 5)
1
P (X = 5) = because we rolled a 6-sided die.
6
1
P (Y = 5 | X = 5) = since we rolled a X = 5-sided die.
5
1
Finally, P (Z = 5 | X = 5, Y = 5) = P (Z = 5 | Y = 5) = since we rolled a Y = 5-sided die. Note we didn’t
5
need to know X = 5 once we knew Y = 5!

2.3.2 Independence
Let’s say we flip a fair coin 3 times independently (whatever that means) - what is the probability of getting
all heads? You may be inclined to say (1/2)3 = 1/8 because the probability of getting heads each time is
just 1/2. However, we haven’t learned such a rule to compute the joint probability P (H1 \ H2 \ H3 ) except
the chain rule.

Using only what we’ve learned, we could consider equally likely outcomes. There are 23 = 8 possible
outcomes when flipping a coin three times (by product rule), and only one of those (HHH) makes up the
event we care about: H1 \ H2 \ H3 . Since the outcomes are equally likely,
| H1 \ H 2 \ H3 | | {HHH} | 1
P (H1 \ H2 \ H3 ) = = =
|⌦| 23 8

We’d love a rule to say P (H1 \ H2 \ H3 ) = P (H1 ) · P (H2 ) · P (H3 ) = 1/2 · 1/2 · 1/2 = 1/8 - and it turns out
this is true when the events are independent!
But first, let’s consider the smaller case: does P (A, B) = P (A) P (B) in general? No! How do we know this
though? Well recall that by the chain rule, we know that:

P (A, B) = P (A) P (B | A)

So, unless P (B | A) = P (B) the equality does not hold. However, when this equality does hold, it is a special
case, which brings us to independence.

Definition 2.3.1: Independence

Events A and B are independent if any of the following equivalent statements hold:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Intuitively what it means for P (A | B) = P (A) is that: given that we know B happened, the prob-
ability of observing A is the same as if we didn’t know anything. So, event B has no influence on
62 Probability & Statistics with Applications to Computing 2.3

event A. The last statement

P (A, B) = P (A) P (B)
is the most often applied to problems where we are allowed to assume independence.

What about independence of more than just two events? We call this concept “mutual independence” (but
most of the time we don’t even say the word “mutual”). You might think that for events A1 , A2 , A3 , A4 to
be (mutually) independent, by extension of the definition of two events, we would just need

P (A1 \ A2 \ A3 \ A4 ) = P (A1 ) · P (A2 ) · P (A3 ) · P (A4 )

But it turns out, we need this property to hold for any subset of the 4 events. For example, the following
must be true (in addition to others):

P (A1 \ A3 ) = P (A1 ) · P (A3 )

P (A2 \ A3 \ A4 ) = P (A2 ) · P (A3 ) · P (A4 )

For all 2n subsets of the 4 events (24 = 16 in our case), the probability of the intersection must simply be
the product of the individual probabilities.

As you can see, it would be quite annoying to check even if three events were (mutually) independent.
Luckily, most of the time we are told to assume that several events are (mutually) independent and we get
all of those statements to be true for free. We are rarely asked to demonstrate/prove mutual independence.

Definition 2.3.2: Mutual Independence

We say n events A1 , A2 , . . . , An are (mutually) independent if, for any subset I ✓ [n] =
{1, 2, . . . , n}, we have !
\ Y
P Ai = P (Ai )
i2I i2I

This is very similar to the last formula P (A, B) = P (A) P (B) in the definition of independence for
two events, just extended to multiple events. It must hold for any subset of the n events, and so this
equation is actually saying 2n equations are true!

Example(s)

Suppose we have the following network, in which circles represents a node in the network (A, B, C,
and D) and the links have the probabilities p, q, r and s of successfully working. That is, for example,
the probability of successful communication from A to B is p. Each link is independent of the others
though.
2.3 Probability & Statistics with Applications to Computing 63

Now, let’s consider the question, what is the probability that A and D can successfully communicate?

Solution There are two ways in which it can communicate: (1) in the top path via B or (2) in the bottom
path via C. Let’s define the event top to be successful communication in the top path and the event bottom
to be successful communication in the bottom path. Let’s first consider the probabilities of each of these
being successful communication. For the top to be a valid path, both links AB and BD must work.

P (top) = P (AB \ BD)

= P (AB) P (BD) [by independence]
= pq

Similarly:
P (bottom) = P (AC \ CD)
= P (AC) P (CD) [by independence]
= rs

So, to calculate the probability of successful communication between A and D, we can take the union of top
and bottom (we just need at least one of the two to work), and so we have:
P (top [ bottom) = P (top) + P (bottom) P (top \ bottom) [by inclusion-exclusion]
= P (top) + P (bottom) P (top) P (bottom) [by independence]
= pq + rs pqrs

2.3.3 Conditional Independence

In the example above for the chain rule, we made this step:
P (Z = 5 | X = 5, Y = 5) = P (Z = 5 | Y = 5)
64 Probability & Statistics with Applications to Computing 2.3

This is actually another form of independence, called conditional independence! That is, given that Y = 5,
the events X = 5 and Z = 5 are independent (the above equation looks exactly like P (Z = 5 | X = 5) =
P (Z = 5) except with extra conditioning on Y = 5 on both sides.

Definition 2.3.3: Conditional Independence

Events A and B are conditionally independent given an event C if any of the following equivalent
statements hold:
1. P (A | B, C) = P (A | C)
2. P (B | A, C) = P (B | C)
3. P (A, B | C) = P (A | C) P (B | C)
Recall the definition of A and B being (unconditionally) independent below:
1. P (A | B) = P (A)
2. P (B | A) = P (B)
3. P (A, B) = P (A) P (B)
Notice that this is very similar to the definition of independence. There is no di↵erence, except we
have just added in conditioning on C to every probability.

Example(s)

Suppose there is a coin C1 with P (head) = 0.3 and a coin C2 with P (head) = 0.9. We pick one
randomly with equal probability and will flip that coin 3 times independently. What is the probability
we get all heads?

Solution Let us call HHH the event of getting three heads, C1 the event of picking the first coin, and C2
the event of getting the second coin. Then we have the following:
P (HHH) = P (HHH | C1 ) P (C1 ) + P (HHH | C2 ) P (C2 ) [by the law of total probability]
3 3
= (P (H | C1 )) P (C1 ) + (P (H | C2 )) P (C2 ) [by conditional independence]
1 1
= (0.3)3 + (0.9)3 = 0.378
2 2
It is important to note that getting heads on the first and second flip are NOT independent. The probability
of heads on the second, given that we got heads on the first flip, is much higher since we are more likely
to have chosen coin C2 . However, given which coin we are flipping, the flips are conditionally independent.
3
Hence, we can write P (HHH | C1 ) = P (H | C1 ) .

2.3.4 Exercises
1. Corrupted by their power, the judges running the popular game show America’s Next Top Mathe-
matician have been taking bribes from many of the contestants. During each of two episodes, a given
contestant is either allowed to stay on the show or is kicked o↵. If the contestant has been bribing
the judges, she will be allowed to stay with probability 1. If the contestant has not been bribing
the judges, she will be allowed to stay with probability 1/3, independent of what happens in earlier
episodes. Suppose that 1/4 of the contestants have been bribing the judges. The same contestants
bribe the judges in both rounds.
(a) If you pick a random contestant, what is the probability that she is allowed to stay during the
first episode?
(b) If you pick a random contestant, what is the probability that she is allowed to stay during both
episodes?
2.3 Probability & Statistics with Applications to Computing 65

(c) If you pick a random contestant who was allowed to stay during the first episode, what is the
probability that she gets kicked o↵ during the second episode?

(d) If you pick a random contestant who was allowed to stay during the first episode, what is the
probability that she was bribing the judge?

Solution:

(a) Let Si be the event a contestant stays in the ith episode, and B be the event a contestant is
bribing the judges. Then, by the law of total probability,

1 1 3 1
P (S1 ) = P (S1 | B) P (B) + P S1 | B C P B C = 1 · + · =
4 3 4 2

(b) Again by the law of total probability,

P (S1 \ S2 ) = P (S1 \ S2 | B) P (B) + P S1 \ S2 | B C P B C [LTP]

C C C
= P (S1 | B) P (SS | B) P (B) + P S1 | B P S2 | B P B [conditional independence]
1 1 1 3
=1·1· + · ·
4 3 3 4
1
=
3

Again, it’s important to note that staying on the first and second episode are NOT independent.
If we know she stayed on the first episode, then it is more likely she stays on the second (since
she’s more likely to be bribing the judges). However, conditioned on whether or not we are bribing
the judges, S1 and S2 are independent.

The denominator is our answer to (a), and the numerator can be computed in the same way as
(b).

(d) By Bayes Theorem,

P (S1 | B) P (B)
P (B | S1 ) =
P (S1 )

We computed all these quantities in part (a).

2. A parallel system functions whenever at least one of its components works. Consider a parallel system
of n components and suppose that each component works with probability p independently

(a) What is the probability the system is functioning?

(b) If the system is functioning, what is the probability that component 1 is working?

Solution:

(a) Let Ci be the event component i is functioning, for i = 1, . . . , n. Let F be the event the system
66 Probability & Statistics with Applications to Computing 2.3

functions. Then,

P (F ) = 1 P FC
n
!
\
=1 P CiC [def of parallel system]
i=1
n
Y
=1 P CiC [independence]
i=1
=1 (1 p)n [prob any fails is 1 p]

(b) By Bayes Theorem, and since P (F | C1 ) = 1 (system is guaranteed to function if C1 is working),

P (F | C1 ) P (C1 ) 1·p
P (C1 | F ) = =
P (F ) 1 (1 p)n

(c)

P (C1 | C2 , F ) = P (C1 | C2 ) [if given C2 , already know F is true]

= P (C1 ) [C1 , C2 independent]
=p
2.3 Probability & Statistics with Applications to Computing 67

Chapter 3. Discrete Random Vari-

ables
In this chapter, we cover discrete random variables. A random variable allows us to “skip” all the out-
comes in the probability space, and directly compute relevant quantities. We’ll learn measures of center
(expectation/average) and spread (variance and standard deviation), and how to compute them. Finally,
we’ll talk about and “memorize” several important discrete random variables which frequently appear in our
everyday lives, so that we can quickly reference their properties.
Chapter 3. Discrete Random Variables
3.1: Discrete Random Variables Basics
Slides (Google Drive) Video (YouTube)

3.1.1 Introduction to Discrete Random Variables

Suppose you flip a fair coin twice. Then the sample space is:

⌦ = {HH, HT, T H, T T }

Sometimes, though, we don’t care about the order (HT vs T H), but just the fact that we got one heads and
one tail. So we can define a random variable as a numeric function of the outcome.
For example, we can define X to be the number of heads in the two independent flips of a fair coin. Then
X is a function, X : ⌦ ! R which takes outcomes ! 2 ⌦ and maps them to a number. For example, for the
outcome HH, we have X(HH) = 2 since there are two heads. See the rest below!

X(HH) = 2
X(HT ) = 1
X(T H) = 1
X(T T ) = 0

X is an example of a random variable, which brings us to the following definition:

Definition 3.1.1: Random Variable

Suppose we conduct an experiment with sample space ⌦. A random variable (rv) is a numeric
function of the outcome, X : ⌦ ! R. That is it maps outcomes ! 2 ⌦ to numbers, ! ! X(!).
The set of possible values X can take on is its range/support, denoted ⌦X .
If ⌦X is finite or countable infinite (typically integers or a subset), X is a discrete random variable
(drv). Else if ⌦X is uncountably large (the size of real numbers), X is a continuous random
variable.

Example(s)

Below are some descriptions of random variables. Find their ranges and classify them as a discrete
random variable (DRV) or continuous random variable (CRV). The first row is filled out for you as
an example!

RV Description Range DRV or CRV?

X, the # of heads in n flips of a fair coin {0, 1, . . . , n} DRV
N , the # of people born this year. TODO TODO
F , the # of flips of a fair coin up to and including my first head. TODO TODO
B, the amount of time I wait for the next bus in seconds. TODO TODO
C, the temperature in Celsius of liquid water TODO TODO

68
3.1 Probability & Statistics with Applications to Computing 69

Solution Here is the solution in a table, with explanations below.

RV Description Range DRV or CRV?

X, the # of heads in n flips of a fair coin {0, 1, . . . , n} DRV
N , the # of people born this year. {0, 1, 2, . . . } DRV
F , the # of flips of a fair coin up to and including my first head. {1, 2, . . . , } DRV
B, the amount of time I wait for the next bus in seconds. [0, 1) CRV
C, the temperature in Celsius of liquid water (0, 100) CRV

• The range of X is ⌦X = {0, 1, . . . , n} because there could be any where from 0 to n heads flipped. It
is a discrete random variable because there are finite n + 1 values that it takes on.

• The range of N is ⌦N = {0, 1, 2 . . . } because there is no upper bound on the number of people that
can be born. This is countably infinite as it is a subset of all the integers, so it is a discrete random
variable.

• The range of F is ⌦F = {1, 2, . . . } because it will take at least 1 flip to flip a head or it could always
be tails and never flip a head (although the chance is low). This is still countable as a subset of all the
integers, so it is a discrete random variable.

• The range of B is ⌦B = [0, 1), as there could be partial seconds waited, and it could be anywhere from
0 seconds to a bus never coming. This is a continuous random variable because there are uncountably
many values in this range.

• The range of C is ⌦C = (0, 100) because the temperature can be any real number in this range. It
cannot be 0 or below because that would be frozen (ice), nor can it be 100 or above because this would
be boiling (steam). This is a continuous random variable.

3.1.2 Probability Mass Functions

Let’s return to X which we defined to be the number of heads in the flip of two fair coins. We already
determined that ⌦ = ⌦ = {HH, HT, T H, T T } and X(HH) = 2, X(HT ) = 1, X(T H) = 1 and X(T T ) = 0.
The range, ⌦X , is {0, 1, 2}.

We can define the probability mass function (pmf ) of X, as pX : ⌦X ! [0, 1]:

pX (k) = P (X = k)

to calculate the probabilities that X takes on each of these values.

In this case we have the following:

8
> 1
<4 k=0
1
pX (k) = 2 k=1
>
:1
4 k=2

this is because the number of outcomes for X = 0 is 1 of the 4, the number of outcomes for X = 1 is 2 of
the 4, and the number of outcomes for X = 2 is 1 of the 4.

This brings us to the formal definition of a probability mass function:

70 Probability & Statistics with Applications to Computing 3.1

Definition 3.1.2: Probability Mass Function (pmf )

The probability mass function (pmf ) of a discrete random variable X assigns probabilities to the
possible values of the random variable. That is pX : ⌦X ! [0, 1] where:

pX (k) = P (X = k)

Note that {X = a} for a 2 ⌦ form a partition of ⌦, since each outcome a 2 ⌦ is mapped to exactly
one number. Hence,
X
pX (z) = 1
z2⌦X

Notice here the only thing consistent is pX , as it’s the PMF of X. The value inside is a dummy
variable - just like we can write f (x) = x2 or f (t) = t2 . To reinforce this, I will constantly use
di↵erent letters for dummy variables.

3.1.3 Expectation
We have this idea of a random variable, which is actually neither random nor a variable (it’s a deterministic
function X : ⌦ ! ⌦X .) However, the way I like to think about it is: it a random quantity which we do
not know the value of yet. You might want to know what you might expect it to equal on average. For
example, X could be the random variable which represents the number of babies born in Seattle per day. On
average, X might be equal to 250, and we would write that its average/mean/expectation/expected value is
E [X] = 250.
Let’s go back to the coin example though to define expectation. Your intuition might tell you that the
expected number of heads in 2 flips of a fair coin would be 1 (you would be correct).
Since X was the random variable defined to be the number of heads in 2 flips of a fair coin, we denote this
E [X]. Think of this as the average value of X.
More specifically, imagine if we repeated the two coin flip experiment 4 times. Then we would “expect” to
get HH, HT , T H, and T T each once. Then, we can divide by the number of times (4) to get 1.
2+1+1+0 1 1 1 1
=2· +1· +1· +0· =1
4 4 4 4 4
Notice that:
1 1 1 1
2· + 1 · + 1 · + 0 · = X(HH)P (HH) + X(HT )P (HT ) + X(T H)P (T H) + X(T T )P (T T )
4 4 4 4
X
= X(!)P (!)
!2⌦

This is the the sum of the random variable’s value for each outcome multiplied by the probability of that
outcome (a weighted average).
Another way of writing this is by multiplying every value that X takes on (in its range) with the probability
of that value occurring (the pmf). Notice that below is the same exact sum, but groups the common values
together (since X(HT ) = X(T H) = 1). That is:
✓ ◆ X
1 1 1 1 1 2 1
2· +1· + +0· =2· +1· +0· = k · pX (k)
4 4 4 4 4 4 4
k2⌦X
3.1 Probability & Statistics with Applications to Computing 71

This brings us to the definition of expectation.

Definition 3.1.3: Expectation

The expectation/expected value/average of a discrete random variable X is:

X
E [X] = X(!)P (!)
!2⌦

or equivalently,
X
E [X] = k · pX (k)
k2⌦X

The interpretation is that we take an average of the possible values, but weighted by their probabilities.

3.1.4 Exercises
1. Let X be the value of single roll of a fair six-sided dice. What is the range ⌦X , the PMF pX (k), and
the expectation E [X]?

Solution: The range is ⌦X = {1, 2, 3, 4, 5, 6}. The pmf is

1
pX (k) = , k 2 ⌦X
6
The expectation is
X 1 1 1 1
E [X] = k · pX (k) = 1 · + 2 · + · · · + 6 · = (1 + 2 + · · · + 6) = 3.5
6 6 6 6
k2⌦X

This kind of makes sense right? You expect the “middle number” between 1 and 6, which is 3.5.
2. Suppose at time t = 0, a frog starts on a 1-dimensional number line at the origin 0. At each step, the
frog moves independently: left with probability 1/10, and right (with probability 9/10). Let X be the
position of the frog after 2 time steps. What is the range ⌦X , the PMF pX (k), and the expectation
E [X]?

Solution: The range is ⌦X = { 2, 0, 2}. To find the pmf, we find the probabilities of being
each of those three values.
1 1
(a) For X to equal 2, we have to move left both times, which happens with probability 10 · 10 by
independence of the moves.
9 9
(b) For X to equal 2, we have to move right both times, which happens with probability 10 · 10 by
independence of the moves.
(c) Finally, for X to equal 0, we have to take opposite moves. So either LR or RL, which happens
1 9 18
with probability 2 · 10 · 10 = 100 . Alternatively, the easier way is to note that these three values
sum to 1, so P (X = 0) = 1 P (X = 2) P (X = 2) = 1 100 81 1 18
100 = 100

So our PMF is: 8

>
<1/100 k= 2
pX (k) = 18/100 k = 0
>
:
81/100 k = 2
72 Probability & Statistics with Applications to Computing 3.1

The expectation is

X 1 18 81
E [X] = k · pX (k) = 2· +0· +2· = 1.6
100 100 100
k2⌦X

You might have been able to guess this, but how? At each time step you “expect” to move to the
9 1
right by 10 10 which is 0.8. So after two steps, you would expect to be at 1.6. We’ll formalize this
approach more in the next chapter!

3. Let X be the number of independent coin flips up to and including our first head, where P (head) = p.
What is the range ⌦X , the PMF pX (k), and the expectation E [X]?

Solution: The range is ⌦X = {1, 2, 3, ...}, since it could theoretically take any number of flips.
The pmf is

pX (k) = (1 p)k 1
p, k 2 ⌦X

Why? We can start slowly.

(a) P (X = 1) is the probability we get heads (for the first time) on our first try, which is just p.

(b) P (X = 2) is the probability we get heads (for the first time) on our second try, which is (1 p)p
since we had to get a tails first.

(c) P (X = k) is the probability we get heads (for the first time) on our k th try, which is (1 p)k 1 p,
since we had to get all tails on the first k 1 tries (otherwise, our first head would have been
earlier).

The expectation is pretty complicated and uses a calculus trick, so don’t worry about it too much.
Just understand the first two lines, which are the setup! But before that, what do you think it should
be? For example, if p = 1/10, how many flips do you think it would take until our first head? Possibly
10? And if p = 1/7, maybe 7? So seems like our guess will be E [X] = p1 . It turns out this intuition is
3.1 Probability & Statistics with Applications to Computing 73

actually correct!
X
E [X] = k · pX (k) [def of expectation]
k2⌦X
X1
= k(1 p)k 1
p
k=1
X1
=p k(1 p)k 1
[p is a constant with respect to k ]
k=1
X1 
d d k
=p ( (1 p)k ) y = ky k 1
dp dy
k=1
1
!
d X k 1
= p (1 p) [swap sum and integral]
dp
k=1
✓ ◆ " 1
#
d 1 X 1
= p geometric series formula: ri =
dp 1 (1 p) i=0
1 r
✓ ◆
d 1
= p
dp p
✓ ◆
1
= p
p2
1
=
p
Chapter 3. Discrete Random Variables
3.2: More on Expectation
Slides (Google Drive) Video (YouTube)

3.2.1 Linearity of Expectation

Right now, the only way you’ve learned to computeP expectation is by first computing the PMF of a random
variable pX (k) and using the formula E [X] = k2⌦X k · pX (k) which is just a weighted sum of the possible
values of X. If you had two random variables X and Y , then to compute the expectation of their sum
Z = X + Y , you could compute the PMF of Z and apply the same formula. But actually, if you knew both
E [X] and E [Y ], you might be inclined to just say E [Z] = E [X + Y ] = E [X] + E [Y ], and we’ll see that
this is true! Linearity of expectation is one of the most fundamental and important concepts in probability
theory, that you will use almost everywhere! We’ll explain it in a simple example, prove it, and then use it
to tackle hard problems.

Let’s say that you and your friend sell fish for a living. Every day, you catch X fish, with E [X] = 3 and
your friend catches Y fish, with E [Y ] = 7. How many fish do the two of you bring in (Z = X + Y ) on an
average day? You might guess 3 + 7 = 10. This is the formula you just guessed:

E [Z] = E [X + Y ] = E [X] + E [Y ] = 3 + 7 = 10

This property turns out to be true! Furthermore, let’s say that you can sell each fish for $5 at a store, but
you need to pay $20 in rent for the storefront. How much profit do you expect to make? The profit formula
would be 5Z 20: $5 times the number of total fish, minus $20. You might guess 5 · 10 20 = 30 and you
would be right once again! This is the formula you just guessed:

E [5Z 20] = 5E [Z] 20 = 5 · 10 20 = 30

Theorem 3.2.9: Linearity of Expectation (LoE)

Let ⌦ be the sample space of an experiment, X, Y : ⌦ ! R be (possibly ”dependent”) random

variables both defined on ⌦, and a, b, c 2 R be scalars. Then,

E [X + Y ] = E [X] + E [Y ]

and

E [aX + b] = aE [X] + b

Combining them gives,

E [aX + bY + c] = aE [X] + bE [Y ] + c

Proof of Linearity of Expectation. Note that X and Y are functions (since random variables are functions),
so X + Y is function that is the sum of the outputs of each of the functions. We have the following (in the

74
3.2 Probability & Statistics with Applications to Computing 75

first equation, (X + Y )(!) is the function (X + Y ) applied to ! which is equal to X(!) + Y (!), it is not a
product):

X
E [X + Y ] = (X + Y )(!) · P (!) [def of expectation for the rv X + Y ]
!2⌦
X
= (X(!) + Y (!)) · P (!) [def of sum of functions]
!2⌦
X X
= X(!) · P (!) + Y (!) · P (!) [property of summation]
!2⌦ !2⌦
= E [X] + E [Y ] [def of expectation of X and Y ]

For the second property, note that aX + b is also a random variable and hence a function (e.g., if f (x) =
sin (1/x), then (2f 5)(x) = 2f (x) 5 = 2 sin (1/x) 5.)

X
E [aX + b] = (aX + b)(!) · P (!) [def of expectation]
!2⌦
X
= (aX(!) + b) · P (!) [def of the function aX + b]
!2⌦
X X
= aX(!) · P (!) + b · P (!) [property of summation]
!2⌦ !2⌦
X X
=a X(!) · P (!) + b P (!) [property of summation]
!2⌦ !2⌦
X
= aE [X] + b [def of E [X] and P (!) = 1]
!

For the last property, we get to assume the first two that we proved already:

E [aX + bY + c] = E [aX] + E [bY ] + E [c] [property 1]

= aE [X] + bE [Y ] + c [property 2]

Again, you may think a result like this is “trivial” or “obvious”, but we’ll see the true power of linearity
of expectation through examples. It is one of the most important ideas that you will continue to use (and
probably take for granted), even when studying some of the most complex topics in probability theory.

Example(s)

A frog starts on a 1-dimensional number line at 0. At each time step, it moves

• left with probability pL
• right with probability pR
• stays with probability ps
where pL + pR + ps = 1. Let X be the position of the frog after 2 (independent) time steps. What is
E [X]?
76 Probability & Statistics with Applications to Computing 3.2

Brute Force Solution: When dealing with any random variable, the first thing you should do is
identify its range. The frog must end up in one of these positions, since it can move at most 1 to the
left and 1 to the right at each step:
⌦X = { 2, 1, 0, +1, +2}
So we need to compute 5 values: the probability of each of these. Let’s start with the easier ones.
The only way to end up at 2 is if the frog moves left at both steps, which happens with probability
pL · pL = p2L = P (X = 2) = pX ( 2). The only reason we can multiply them is because of our
independence assumption. Similarly, pX (2) = pR · pR = p2R .
To get to 1, there are two possibilities: first going left and staying (pL · pS ), or first staying and then
going left (pS · pL ). Adding these disjoint cases gives pX ( 1) = 2pL pS . Again, we can only multiply
due to independence. Similarly, pX (1) = 2pR pS .
Finally, to compute pX (0), we have two options. One is considering all the possibilities (there are
three: left right, right left, or stay stay) and adding them up, and you get 2pL pR + p2S . Alternatively
and equivalently, since you know the probabilities of 4 of the values (pX ( 2), pX (2), pX ( 1), pX (1)),
the last one pX (0) must be 1 minus the other four since probabilities have to sum to 1! This is a
often useful and clever trick - solving for all but one of the probabilities actually gives you the last
one!
In summary, we would write the PMF as:
8
> 2
> pL
> k = 2 :Left left
>
>
>2pL pS
< k = 1 : Left and stay, or stay and left
pX (k) = 2pL pR + p2S k = 0 : Right left, or left right, or stay stay
>
>
>
> 2pR pS k = 1 : Right and stay, or stay and right
>
>
: p2 k = 2 : Right right
R
Then to solve for the expectation we just multiply the value and probability mass function and take
the sum and have the following:
X
E [X] = k · pX (k) [def of expectation]
k2⌦X

= ( 2) · p2L + ( 1) · 2pL pS + (0) · (2pL pR + p2S ) + (1) · 2pR pS + (2) · p2R [plug in our values]
= 2(pR pL ) [lots of messy algebra]
The last step of algebra is not important - once you get to more advanced mathematics (like this
text), getting the second-to-last formula is sufficient. Everything else is algebra which you could do,
or use a computer to do, and so we will omit the useless calculations.
This was quite tedious already; what if instead you were to find the expected location after 100 steps?
Then, this method would be completely ridiculous: finding ⌦X = { 100, 99, . . . , +99, +100} and
3.2 Probability & Statistics with Applications to Computing 77

their 201 probabilities. Since you know the frog always moves with the same probabilities though,
maybe we can do something more clever!

Linearity Solution:
Let X1 , X2 be the distance the frog travels at time steps 1,2 respectively.
Important Observation: X = X1 + X2 , since your location after 2 time steps is the sum of the
displacement of the first time step and the second time step. Therefore, ⌦X1 = ⌦X2 = { 1, 0, +1}.
They have the same simple PMF of:
8
>
< pL k = 1
pXi (k) = pS k = 0
>
:
pR k = 1

So: E [Xi ] = 1 · pL + 0 · pS + 1 · pR = pR pL , for both i = 1 and i = 2.

By linearity of expectation,

E [X] = E [X1 + X2 ] = E [X1 ] + E [X2 ] = 2(pR pL )

Which method is easier? Maybe in this case it is debatable, but if we change the time steps from 2 to
100 or 1000, the brute force solution is entirely infeasible, and the linearity solution will basically be
the same amount of work! You could say that X1 , . . . , X100 is the displacement at each of 100 time
steps, and hence by linearity:
" 100 # 100 100
X X X
E [X] = E Xi = E [Xi ] = (pR pL ) = 100(pR pL )
i=1 i=1 i=1

Hopefully now you can come to appreciate more how powerful LoE truly is! We’ll see more examples
in the next section as well as at the end of this section.

3.2.2 Law of the Unconscious Statistician (LOTUS)

Recall the fish example at the beginning of this section regarding profit. Since the expected number of fish
was E [Z] = 10 and the profit was a function of the number of fish g(Z) = 5Z 20, we were able to use
linearity to say E [5Z 20] = 5E [Z] 20. But this formula only holds for nice linear functions (hence the
name “linearity of expectation”). What if the profit function was ⇥ instead
⇤ something⇥ weird/non-linear
⇤ like
h(Z) = Z 2 or h(Z) = log(5Z )? It turns out we can’t just say E Z 2 = E [Z] or E log(5Z ) = log(5E[Z] ) -
2

this is actually almost never true! Let’s see if we can’t derive a nice formula for E [g(X)] for any function g,
linear or not.

Consider we are flipping 2 coins again. Let X be the number of heads in two independent flips of a fair coin.
Recall the range, PMF, and expectation (again, I’m using the dummy letter d to emphasize that pX is the
78 Probability & Statistics with Applications to Computing 3.2

PMF for X, and the inner variable doesn’t matter):

⌦X = {0, 1, 2}

, 8
> 1
<4 d=0
1
pX (d) = 2 d=1
>
:1
4 d=2

1 1 1
E [X] = 0 · +1· +2· =1
4 2 4
Let g be the cubing function; i.e., g(t) = t . Let Y = g(X) ⇥= X⇤3 ; what does this mean? It literally means
3

the cubed number of heads! Let’s try to compute E [Y ] = E X 3 , the expected cubed number of heads. We
first find its range and PMF. Based on the range of X, we can calculate the range of Y to be:

⌦Y = {0, 1, 8}

since if we get 0 heads, the cubed number of heads is 03 = 0; if we get 1 head, the cubed number of heads
is 13 = 1; and if we get 2 heads, the cubed number of heads is 23 = 8.

Now to find the PMF of Y = X 3 . (Again, below I use the notation pY to denote the probability mass
function of Y = X 3 ; z is a dummy variable which could be any letter.)
8
> 1
<4 z=0
1
pY (z) = 2 z=1
>
:1
4 z=8
since there is a 1/4 chance of getting 0 cubed heads (the outcome TT), 1/2 chance of getting 1 cubed heads
(the outcomes HT or TH), and a 1/4 chance of getting 8 cubed heads (the outcome HH).

⇥ ⇤ 1 1 1
E X 3 = E [Y ] = 0 · + 1 · + 8 · = 2.5
4 2 4
⇥ 3⇤
Is there an easier way to compute E X = E [Y ] without going through the trouble of writing
out pY ? Yes! Since we know X’s PMF already, why should we have to find the PMF of Y = g(X)?
Note this formula below is the same formula as above, rewritten so you can observe something:
⇥ ⇤ 1 1 1
E X 3 = 03 · + 13 · + 23 · = 2.5
4 2 4

In fact: ⇥ ⇤ X
E X3 = b3 pX (b)
b2⌦X

That is, we can apply the function to each value in ⌦X , and then take the weighted average! We can
generalize such that for any function g : ⌦X ! R, we have:
X
E [g(X)] = g(b)pX (b)
b2⌦X

⇥ ⇤
Caveat: It is worth noting that 2.5 = E X 3 6= (E [X])3 = 1. You cannot just say E [g(X)] = g(E [X]) as
we just showed!
3.2 Probability & Statistics with Applications to Computing 79

Theorem 3.2.10: Law of the Unconscious Statistician (LOTUS)

Let X be a discrete random variable with range ⌦X and g : D ! R be a function defined at least
over ⌦X , (⌦X ✓ D). Then X
E [g(X)] = g(b)pX (b)
b2⌦X
⇥ ⇤
Note that in general, E [g(X)] 6= g(E [X]). For example, E X 2 6= (E [X])2 , and E [log(X)] 6=
log(E [X]).

Before we formally prove this, it will help if we have some intuition for each step. As an example, let X have
range ⌦X = { 1, 0, 1} and PMF
8
> 3
< 12 k = 1
5
pX (k) = 12 k=0
>
:4
12 k=1

Notice that Y = X 2 has range ⌦Y = {g(x) : x 2 ⌦X } = {( 1)2 , 02 , 12 } = {0, 1} and the following PMF:
(
3 4
12 + 12 k=1
pY (k) = 5
12 k=0

Note that pY (1) = P (X = 1) + P (X = 1) because { 1, 1} = {x : x2 = 1}. The crux of the LOTUS proof
depends on this fact. We just group things together and sum!
Proof of LOTUS. The proof isn’t too complicated, but the notation is pretty tricky and may be an impedi-
ment to your understanding, so focus on understanding the setup in the next few lines.

Let Y = g(X). Note that

X
pY (y) = pX (x)
x2⌦X :g(x)=y

That is, the total probability that Y = y is the sum of the probabilities over all x 2 ⌦X where g(x) = y
(this is like saying P (Y = 1) = P (X = 1) + P (X = 1) because {x 2 ⌦X : x2 = 1} = { 1, 1}.)

E [g(X)] = E [Y ] [Y = g(X)]
X
= ypY (y) [def of expectation]
y2⌦Y
X X
= y pX (x) [above substitution]
y2⌦Y x2⌦X :g(x)=y
X X
= ypX (x) [move y into a sum]
y2⌦Y x2⌦X :g(x)=y
X X
= g(x)pX (x) [y = g(x) in the inner sum]
y2⌦Y x2⌦X :g(x)=y
X
= g(x)pX (x) [the double sum is the same as summing over all x]
x2⌦X
80 Probability & Statistics with Applications to Computing 3.2

3.2.3 Exercises
1. Let S be the sum of three rolls of a fair 6-sided die. What is E [S]?

Solution: Let X, Y, Z be the first, second, and third roll respectively. Then, S = X + Y + Z.
We showed in the first exercise of 3.1 that E [X] = E [Y ] = E [Z] = 3.5, so by LoE,

E [S] = E [X + Y + Z] = E [X] + E [Y ] + E [Z] = 3.5 + 3.5 + 3.5 = 10.5

Alternatively, imagine if we didn’t have this theorem. We would find the range of S, which is ⌦S =
{3, 4, . . . , 18} and find its PMF. What a nightmare!
2. Blind LOTUS Practice: This will all seem useless, but I promise we’ll need this in the future. Let X
have PMF 8
> 3
< 12 k =5
5
pX (k) = 12 k =2.
>
:4
12 k =1
⇥ 2⇤
(a) Compute E X .
(b) Compute E [log(X)]
⇥ ⇤
(c) Compute E esin(X) .
P
Solution: LOTUS says that E [g(X)] = k2⌦X g(k)pX (k). That is,
(a)
⇥ ⇤ X 3 5 4
E X2 = k 2 pX (k) = 52 · + 22 · + 12 ·
12 12 12
k2⌦X

(b)
X 3 5 4
E [log X] = log (k) · pX (k) = log (5) · + log (2) · + log (1) ·
12 12 12
k2⌦X

(c)
h i X 3 5 4
E esin(X) = esin(k) pX (k) = esin(5) · + esin(2) · + esin(1) ·
12 12 12
k2⌦X
Chapter 3. Discrete Random Variables
3.3: Variance
Slides (Google Drive) Video (YouTube)

3.3.1 Linearity of Expectation with Indicator RVs

We’ve seen how useful and important linearity of expectation was (e.g., with the frog example). We’ll now
see how to apply it in a very clever way that is very commonly used to solve seemingly difficult problems.

Suppose there are 7 mermaids in the sea. Below is a table that represents these mermaids and the colors of
their hair.

Each column in the third row of the table is a variable, Xi , that is 1 if the i-th mermaid has red hair and 0
otherwise. We call these sorts of variables indicator variables because they are either 1 or 0, and their values
indicate the truth of a boolean (red hair or not).
Let the variable X represent how many of the 7 mermaids have red hair. If I only gave you this third row
(X1 , X2 , . . . , X7 of 1’s and 0’s), how could you compute X?

Well, you would add them all up! X = X1 + X2 + ... + X7 = 3. So, there are 3 mermaids in the sea that have
red hair. This might seem like a trivial result, but let’s go over a more complicated an example to illustrate
the usefulness of indicator random variables!
Example(s)

Suppose n people go to a party and leave their hat with the hat-check person. At the end of the
party, she returns hats randomly and uniformly because she does not care about her job. Let X be
the number of people who get their original hat back. What is E [X]?

Solution Your first instinct might be to approach this problem with brute force. Such an approach would
involve enumerating the range, ⌦X = {0, 1, 2, ..., n 2, n} (all the integers from 0 to n, except n 1), and
computing the probability mass function for each of its elements. However, this approach will get very
complicated (give it a shot). So, let’s use our new friend, linearity of expectation.

Quick Observation: Does it matter where you are in line?

If we are first in line, P (gets hat back) = n1 , because there are n in total and each is equally likely.
If we are last in line, P (gets hat back) = n1 , because there is one left and its just as likely to be yours as any
other hat after giving away n 1.
(Similar logic applies to the other positions in between as well). So actually, no, the probability that someone
will get their original hat back does NOT depend on where they are in line. Each person gets their hat back

81
82 Probability & Statistics with Applications to Computing 3.3

1
with probability n.

Let’s use linearity with indicator random variables! For i = 1, ..., n, let
(
1 if i-th person got their hat back
Xi = .
0 otherwise
Pn
Then the total number of people who get their hat back is X = i=1 Xi . (Why?)

The expected value of each individual indicator random variable can be found as follows, since it can only
take on the values 0 and 1:
1
E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1) = P (ith person got their hat back) =
n
From here, we will use linearity of expectation:
" n #
X
E [X] = E Xi
i=1
n
X
= E [Xi ] [linearity of expectation]
i=1
Xn
1
=
i=1
n
1
=n·
n
=1

So, the expected number of people to get their hats back is 1 (doesn’t even depend on n)! It is worth noting
that these indicator random variables are not “independent” (we’ll define this formally later). One of the
reasons why is because if we know that a particular person did not get their own hat back, then the original
owner of that hat will have a probability of 0 that they get that hat back.

Theorem 3.3.11: Linearity of Expectation with Indicators

If asked only about the expectation of a random variable X (and not its PMF), then you may be
able to write X as the sum of possibly dependent indicator random variables, and apply linearity
of expectation. This technique is used when X is counting something (the number of people
who get their hat back). Finding the PMF for this random variable is extremely complicated,
and linearity makes computing the expectation easy (or at least easier than directly finding the PMF).

For an indicator random variable Xi ,

E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1)

Example(s)

Suppose we flip a coin n = 100 times independently, where the probability of getting a head on each
flip is p = 0.23. What is the expected number of heads we get? Before doing any computation, what
do you think it might be?
3.3 Probability & Statistics with Applications to Computing 83

Solution You might expect np = 100 · 0.23 = 23 heads, and you would be absolutely correct! But we do need
to prove/show this.
Let X be the number of heads total, so ⌦X = {0, 1, 2, . . . , 100}. The “normal” approach might be to try to
find this PMF, which could be a bit complicated (we’ll actually see this in the next section)! But let’s try
to use what we just learned instead, and define indicators.
P100
For i = 1, 2, . . . , 100, let Xi = 1 if the i-th flip is heads, and Xi = 0 otherwise. Then, X = i=1 Xi
is the total number of heads (why?). To use linearity, we need to find E [Xi ].
We showed earlier that
E [Xi ] = P (Xi = 1) = p = 0.23
and so
" 100 #
X
E [X] = E Xi [def of X]
i=1
100
X
= E [Xi ] [linearity of expectation]
i=1
100
X
= 0.23
i=1
= 100 · 0.23 = 23

3.3.2 Variance
We’ve talked about the expectation (average/mean) of a random variable, and some approaches to computing
this quantity. This provides a nice “summarization” of a random variable, as something we often want to
know about it (sometimes even in place of its PMF). But we might want to know another summary quantity:
how “variable” the random variable is, or how much it deviates from its mean. This is called the variance
of a random variable, and we’ll start with a motivating example below!
Consider the following two games. In both games we flip a fair coin. In Game 1, if a heads is flipped you
pay me $1, and if a tails is flipped I pay you $1. In Game 2, if a heads is flipped you pay me $1000, and if a
tails is flipped I pay you $1000.
Both games are fair, in the sense that the expected values of playing both games is 0.
1 1 1 1
E [G1 ] = 1· +1· =0= 1000 · + 1000 · = E [G2 ]
2 2 2 2
Which game would you rather play? Maybe the adrenaline junkies among us would be willing to risk it all
on Game 2, but I think most of us would feel better playing Game 1. As shown above, there is no di↵erence
in the expected value of playing these two games, so we need another metric to explain why Game 1 feels
safer than Game 2.
We can measure this by calculating how far away a random variable is from its mean, on average. The
quantity X E [X] is the di↵erence between a rv and its mean, but we want a distance, a positive value.
So we will look at the squared di↵erence (X E [X])2 instead (another option would have been the absolute
di↵erence |X E [X] |, but someone chose the squared one instead). This is still a random variable (a
nonnegative one, since it is squared),
⇥ and so to⇤get a number (the average distance from the mean), we take
the expectation of this new rv, E (X E [X])2 . This is called the variance of the original random variable.
The definition goes as follows:
84 Probability & Statistics with Applications to Computing 3.3

Definition 3.3.1: Variance

The variance of a random variable X is defined to be
⇥ ⇤ ⇥ ⇤ 2
Var (X) = E (X E [X])2 = E X 2 E [X]

The variance is always nonnegative since we take the expectation of a nonnegative random variable
(X E [X])2 . The first equality is the definition of variance, and the second equality is a more useful
identity for doing computation that we show below.

Proof of Variance Identity.

Let µ = E [X] as a shorthand. Then,
⇥ ⇤
Var (X) = E (X µ)2 [def of variance]
⇥ ⇤
= E X 2 2µX + µ2 [algebra]
⇥ ⇤
= E X2 2µE [X] + µ2 [linearity of expectation]
⇥ ⇤ 2 2
= E X2 2E [X] + E [X] [µ = E [X]]
⇥ ⇤ 2
= E X2 E [X]
⇥ ⇤ 2
Notice that E X 2 6= E [X] - this is a perfect time to point this out again. If these were equal, variance
would always be zero and this would be a useless construct.
The reason that someone chose the squared definition
⇥ ⇤ instead of the absolute value definition is because
2
it has this nice splitting property Var (X) = E X 2 E [X] (the absolute value definition wouldn’t have
something this nice), and because the squaring function g(t) = t2 is di↵erentiable but the absolute value
function g(t) = |t| is not.

There is one problem though - if X is the height of someone in feet for example, then the average E [X]
is also in units of feet, but the variance is in terms of square feet (since we square X). We’d like to say
something like: the height of adults is generally 5.5 feet plus or minus 0.3 feet. To correct for this, we define
the standard deviation to be the square root of the variance, which “undoes” the squaring.

Definition 3.3.2: Standard Deviation

Another measure of a random variable X’s spread is the standard deviation, which is
p
X = Var (X)

This measure is also useful, because the units of variance are squared in terms of the original variable
X, and this essentially ”undoes” our squaring, returning our units to the same as X.

We had something nice happen for the random variable aX +b when computing its expectation: E [aX + b] =
aE [X] + b, called linearity of expectation. Is there a similar nice property for the variance as well?

Theorem 3.3.12: Property of Variance

We can also show that for any scalar a, b 2 R,

Var (aX + b) = a2 Var (X)

Before proving this, let’s think about and try to understand why a came out squared, and what happened
3.3 Probability & Statistics with Applications to Computing 85

to the b. The reason a is squared is because variance involved squaring the random variable, so the a had
to come out squared. It might not be a great intuitive reason, but we’ll prove it below algebraically. The
second (b disappearing) has a nice intuition behind it. Which of the two distributions (random variables)
below do you think should have higher variance?

You might agree with me that they have the same variance! Why?

The idea behind variance is that it measures the “spread” of the values that a random variable can take on.
The two graphs of random variables (distributions) above have the same “spread”, but one is shifted slightly
to the right. Since these graphs have the same “spread”, we want their variance to reflect this similarity.
Thus, shifting a random variable by some constant does not change the variance of that random variable.
That is, Var (X + b) = Var (X): that’s why the b got lost!
Proof of Variance Property:. Var (aX + b) = a2 Var (X)
First, we show variance is una↵ected ⇤ is, Var (X + b) = Var (X) for any scalar b. We use the
⇥ by shifts; that
original definition that Var (Y ) = E (Y E [Y ])2 , with Y = X + b.

⇥ ⇤
Var (X + b) = E ((X + b) E [X + b])2 [def of variance]
⇥ ⇤
= E (X + b E [X] b)2 [linearity of expectation]
⇥ ⇤
= E (X E [X])2
= Var (X) [def of variance]

Then, we use this result to get the final one:

Var (aX + b) = Var (aX) [shifts don’t matter]

⇥ ⇤
= E (aX)2 (E [aX])2 [property of variance]
⇥ 2 2⇤
=E a X (aE [X])2 [linearity of expectation]
⇥ ⇤ 2
= a2 E X 2 a2 E [X] [linearity of expectation]
⇥ ⇤ 2
= a2 (E X 2 E [X] )
= a2 Var (X) [def of variance]
86 Probability & Statistics with Applications to Computing 3.3

Example(s)

Let X be the outcome of a fair 6-sided die roll. Recall that E [X] = 3.5. What is Var (X)?
Let’s say you play a casino game, where you must pay $10 to roll this die once, but earn twice the
value of the roll. What are the expected value and variance of your earnings?

Solution Recall that one of the equations for variance is

⇥ ⇤ 2
Var (X) = E X 2 E [X]

Computing the expected value of X we get

X 1 1 1
E [X] = xpX (x) = 1 · + 2 · + ... + 6 · = 3.5
6 6 6
x2⌦X

And using LOTUS, we can compute the expected value of X 2 to be

⇥ ⇤ X 1 1 1 91
E X2 = x2 pX (x) = 12 · + 22 · + ... + 62 · =
6 6 6 6
x2⌦X

Putting these together with our definition above gives

91 35
Var (X) = (3.5)2 =
6 12

Now, let Y denote our earnings; then Y = 2X 10. So by linearity of expectation,

E [2X 10] = 2E [X] 10 = 2 · 3.5 10 = 3

By the property of variance, we get

35 35
Var (2X 10) = 22 Var (X) = 22 · =
12 3

Now you might wonder, what about the variance of a sum Var (X + Y )? You might hope that Var (X + Y ) =
Var (X)+Var (Y ), but this unfortunately is only true when the random variables are independent (we’ll define
this in the next section, but you can kind of guess what it means)! It is so important to remember that we
made no independence assumptions for linearity of expectation - it’s always true!

3.3.3 Exercises
1. Suppose you studied hard for a 100-question multiple-choice exam (with 4 choices per question) so
that you believe you know the answer to about 80% of the questions, and you guess the answer to the
remaining 20%. What is the expected number of questions you answer correctly?

Solution: For i = 1, . . . , 100, let Xi be the indicator rv which is 1 if youPgot the ith question
100
correct, and 0 otherwise. Then, the total number of questions correct is X = i=1 Xi . To compute
E [X] we need E [Xi ] for each i = 1, . . . , 100.

E [Xi ] = 1·P (Xi = 1)+0·P (Xi = 0) = P (Xi = 1) = P (correct on question i) = 1·0.8+0.25·0.2 = 0.85
3.3 Probability & Statistics with Applications to Computing 87

where the second last step was using the law of total probability, conditioning on whether we know the
answer to a question or not. Hence,
" 100 # 100 100
X X X
E [X] = E Xi = E [Xi ] = 0.85 = 85
i=1 i=1 i=1

This kind of makes sense - I should be guaranteed 80 out of 100, and if I guess on the other 20, I would
get about 5 (a quarter of them) right, for a total of 85.
2. Recall exercise 2 from 3.1, where we had a random variable X with PMF
8
>
<1/100 k= 2
pX (k) = 18/100 k = 0
>
:
81/100 k = 2

The expectation was

X 1 18 81
E [X] = k · pX (k) = 2· +0· +2· = 1.6
100 100 100
k2⌦X

Compute Var (X).

⇥ ⇤ 2
Solution: Since Var (X) = E X 2 E [X] , we need to use LOTUS to compute the first part.
⇥ ⇤ X 1 18 81
E X2 = k 2 · pX (k) = ( 2)2 · + 02 · + 22 · = 3.28
100 100 100
k2⌦X

Hence, ⇥ ⇤ 2
Var (X) = E X 2 E [X] = 3.28 1.62 = 0.72
Chapter 3. Discrete Random Variables
3.4: Zoo of Discrete Random Variables Part I
Slides (Google Drive) Video (YouTube)

In this section, we’ll define formally what it means for random variables to be independent. Then, for the
rest of the chapter (3.4, 3.5, 3.6), we’ll discuss commonly appearing random variables for which we can just
cite its properties like its PMF, mean, and variance without doing any work! These situations are so common
that we name them, and can refer to them and related quantities easily!

3.4.1 Independence of Random Variables

Definition 3.4.1: Independence

Random variables X and Y are independent, denoted X ? Y , if for all x 2 ⌦X and all y 2 ⌦Y , any
of the following three equivalent properties holds:
1. P (X = x | Y = y) = P (X = x)
2. P (Y = y | X = x) = P (Y = y)
3. P (X = x \ Y = y) = P (X = x) · P (Y = y)
Note, that this is the same as the event definition of independence, but it must hold for all events
{X = x} and {Y = y}.

Theorem 3.4.13: Variance Adds for Independent RVs

If X ? Y , then
Var (X + Y ) = Var (X) + Var (Y )
This will be proved a bit later, but we can start using this fact now! It is important to remember
that you cannot use this formula if the random variables are not independent (unlike linearity).

A common misconception is that Var (X Y ) = Var (X) Var (Y ), but this actually isn’t true, otherwise we
could get a negative number. In fact, if X ? Y , then

Var (X Y ) = Var (X + ( Y )) = Var (X) + Var ( Y ) = Var (X) + ( 1)2 Var (Y ) = Var (X) + Var (Y )

3.4.2 The Bernoulli Process and Bernoulli Random Variable

There are several random variables that occur naturally and frequently! It is often useful to be able to
recognize these random variables by their characterization, so we can take advantage of relevant properties
such as probability mass functions, expected values, and variance. In the rest of this section and chapter 3,
we will explore some fundamental discrete random variables, while finding those aforementioned properties,
so that we can just cite them instead of doing all the work again!

Before diving into the random variables themselves, let’s look at a situation that arises often...

88
3.4 Probability & Statistics with Applications to Computing 89

Definition 3.4.2: Bernoulli Process

A Bernoulli process with parameter p is a sequence of independent coin flips X1 , X2 , X3 , ... where
P (head) = p. If flip i is heads, then we encode Xi = 1; otherwise, Xi = 0. From this process we can
measure many interesting things.

Let’s illustrate how this might be useful with an example. Suppose we independently flip 8 coins that land
heads with probability p, and get the following sequence of coin flips

This series of flips is a Bernoulli process. We call each of these coin flips a Bernoulli random variable (or
indicator rv).

Definition 3.4.3: Bernoulli/Indicator Random Variable

A random variable X is Bernoulli (or indicator), denoted X ⇠ Ber(p), if and only if X has the
following PMF: (
p, k=1
pX (k) =
1 p, k = 0
Each Xi in the Bernoulli process with parameter p is Bernoulli/indicator random variable with
parameter p. It simply represents a binary outcome, like a coin flip.

Additionally,
E [X] = p and Var (X) = p (1 p)

Proof of Expectation and Variance of Bernoulli.

Suppose X ⇠ Ber(p). Then
E [X] = 1 · p + 0 · (1 p) = p
⇥ ⇤
Now for the variance, we compute E X 2 first by LOTUS:
⇥ ⇤
E X 2 = 12 · p + 02 · (1 p) = p

⇥ ⇤ 2
Var (X) = E X 2 E [X] = p p2 = p(1 p)

Notice how we found a situation whose general form comes up quite often, and derived a random variable
that models that situation well. Now, anytime we need a Bernoulli/indicator random variable we can denote
it as follows: X ⇠ Ber(p).
90 Probability & Statistics with Applications to Computing 3.4

3.4.3 The Binomial Random Variable

If you recall, one of the main reasons that indicator random variables are useful is because they can compose
more complicated random variables. For example, in the sequence of coin flips above, we might be interested
in modeling the probability of getting a certain number of heads. In order to do this, it would be useful to
use a random variable that is equal to the sum of the Bernoulli’s that each represent a single flip. These
types of random variables also come up frequently, so we have a special name for them, binomial random
variables.
n
X
X ⇠ Bin(n, p) = Xi , where Xi ’s are independent Bernoulli random variables
i=1

Pn flips of a coin with P (head) = p.

That is, we write X ⇠ Bin(n, p) to be the number of heads in n independent
Why is it true that if X is the number of heads in n flips, that X = i=1 Xi (recall the mermaids in 3.3)?
Let’s try to derive the PMF of a binomial rv. Its range is ⌦X = {0, 1, . . . , n} since we can get anywhere
from 0 heads to n heads. Let’s consider the case of n = 5 flips, and figure out the probability we get exactly
4 heads, P (X = 4).
Here’s one sample sequence of heads and tails with exactly four heads, HTHHH, and its probability is (by
independence):
P (HT HHH) = p · (1 p) · p · p · p = p4 (1 p)5 4
But this is not the only sequence of flips which gives exactly 4 heads! How many such sequences are there?
There are 54 since we choose 4 out of the 5 positions to put the heads, and the remaining must be tails.
Haha, so our counting knowledge is finally being applied! So we must sum these 54 disjoint cases, and we
get ✓ ◆
5 4
P (X = 4) = p (1 p)5 1
4

We can generalize this as follows to get the PMF of a binomial random variable:
✓ ◆
n k
pX (k) = P (X = k) = P (exactly k heads in n Bernoulli trials) = p (1 p)n k
, k 2 ⌦X
k

This hopefully sheds some light on why nk is called a binomial coefficient and X a binomial random variable.
Before computing its expectation, let’s make sure we didn’t make a mistake, and check thatPour probabilities
n
sum to 1. This will use the binomial theorem we learned in chapter 1 finally: (x + y)n = k=0 nk xk y n k .
Xn Xn ✓ ◆
n k
pX (k) = p (1 p)n k [PMF of Binomial RV]
k
k=0 k=0
= (p + (1 p))n [binomial theorem]
n
=1 =1

Definition 3.4.4: Binomial Random Variable

A random variable X has a Binomial distribution, denoted X ⇠ Bin(n, p), if and only if X has the
following PMF for k 2 ⌦X = {0, 1, 2, . . . , n}:
✓ ◆
n k
pX (k) = p (1 p)n k
k
3.4 Probability & Statistics with Applications to Computing 91

X is the sum of n independent Ber(p) random variables, and represents the number of heads in n
independent coin flips where P (head) = p.

Additionally,
E [X] = np and Var (X) = np(1 p)

Proof of Expectation and Variance of Binomial. We can use linearity of expectation to compute the expected
value of a particular binomial variable (i.e. the expected number of successes in n Bernoulli trials). Let
X ⇠ Bin(n, p).
" n
#
X
E [X] = E Xi
i=1
n
X
= E [Xi ] [linearity of expectation]
i=1
Xn
= p [expectation of Bernoulli]
i=1
= np

This makes sense! If X ⇠ Bin(100, 0.5) (number of heads in 100 independent flips of a fair coin), you expect
50 heads, which is just np = 100 · 0.5 = 50. Variance can be found in a similar manner

n
!
X
Var (X) = Var Xi
i=1
n
X
= Var (Xi ) [variance adds if independent rvs]
i=1
Xn
= p (1 p) [variance of Bernoulli]
i=1
= np(1 p)

Like Bernoulli rvs, Binomial random variables have a special place in our zoo. Argually, Binomial rvs are
probably the most important discrete random variable, so make sure to understand everything above and
be ready to use it!

It is important to note for the hat check example in 3.3 that we had the sum of n Bernoulli/indicator rvs
BUT that they were NOT independent. This is because if we know one person gets their hat back, someone
else is more likely to (since there are n 1 possibilities instead of n). However, linearity of expectation works
regardless of independence, so we were able to still add their expectations like so

n
X 1
E [X] = E [Xi ] = n · =1
i=1
n

It would be incorrect to say that X ⇠ Bin n, n1 because the indicator rvs were NOT independent.
92 Probability & Statistics with Applications to Computing 3.4

Example(s)

A factory produces 100 cars per day, but a car is defective with probability 0.02. What’s the proba-
bility that the factory produces 2 or more defective cars on a given day?

Solution Let X be the number of defective cars that the factory produces. X ⇠ Bin(100, 0.02), so

P (X 2) = 1 P (X = 0) P (X = 1) [complement]
✓ ◆ ✓ ◆
100 100
=1 (0.02)0 (1 0.02)100 (0.02)1 (1 0.02)99 [plug in binomial PMF]
0 1
⇡ 0.5967

So, there is about a 60% chance that 2 or more cars produced on a given day will be defective.

3.4.4 Exercises
1. An elementary school wants to keep track of how many of their 200 students have acceptable attendance.
Each student shows up to school on a particular day with probability 0.85, independently of other days
and students.
(a) A student has acceptable attendance if they show up to class at least 4 out of 5 times in a school
week. What is the probability a student has acceptable attendance?
(b) What is the probability that at least 170 out of the 200 students have acceptable attendance?
Assume students’ attendance are independent since they live separately.
(c) What is the expected number of students with acceptable attendance?
Solution: Actually, this is a great question because it has nested binomials!
(a) Let X be the number of school days a student shows up in a school week. Then, X ⇠ Bin(n =
5, p = 0.85) since a students’ attendance on di↵erent days is independent as mentioned earlier.
We want X 4,
✓ ◆ ✓ ◆
5 5
P (X 4) = P (X = 4) + P (X = 5) = 0.854 0.151 + 0.855 0.150 = 0.83521
4 5
.
(b) Let Y be the number of students who have acceptable attendance. Then, Y ⇠ Bin(n = 200, p =
0.83521) since each students’ attendance is independent of the rest. So,
200 ✓
X ◆
200
P (Y 170) = 0.83521k (1 0.83521)200 k
⇡ 0.3258
k
k=170

(c) We have E [Y ] = np = 200 · 0.83521 = 167.04 as the expected number of students! We can just
cite it now that we’ve identified Y as being Binomial!
2. [From Stanford CS109] When sending binary data to satellites (or really over any noisy channel) the
bits can be flipped with high probabilities. In 1947 Richard Hamming developed a system to more
reliably send data. By using Error Correcting Hamming Codes, you can send a stream of 4 bits with 3
(additional) redundant bits. If zero or one of the seven bits are corrupted, using error correcting codes,
a receiver can identify the original 4 bits. Let’s consider the case of sending a signal to a satellite where
3.4 Probability & Statistics with Applications to Computing 93

each bit is independently flipped with probability p = 0.1. (Hamming codes are super interesting. It’s
worth looking up if you haven’t seen them before! All these problems could be approached using a
binomial distribution (or from first principles).)
(a) If you send 4 bits, what is the probability that the correct message was received (i.e. none of the
bits are flipped).
(b) If you send 4 bits, with 3 (additional) Hamming error correcting bits, what is the probability that
a correctable message was received?
(c) Instead of using Hamming codes, you decide to send 100 copies of each of the four bits. If for
every single bit, more than 50 of the copies are not flipped, the signal will be correctable. What
is the probability that a correctable message was received?
Solution:
(a) We have X ⇠ Bin(n = 4, p = 0.9) to be the number of correct (unflipped) bits. So the binomial
PMF says: ✓ ◆
4
P (X = 4) = 0.94 (0.1)4 0 = 0.94 = 0.656
4
Note we could have also approached this by letting Y ⇠ Bin(4, 0.1) be the number of corrupted
(flipped) bits, and computing P (Y = 0). This is the same result!
(b) Let Z be the number of corrupted bits, then Z ⇠ Bin(n = 7, p = 0.1), so we can use its PMF. A
message is correctable if Z = 0 or Z = 1 (mentioned above), so
✓ ◆ ✓ ◆
7 7 0 7
P (Z = 0) + P (Z = 1) = 0.1 0.9 + 0.16 0.91 = 0.850
0 1

This is a 30% (relative) improvement compared to above by just using 3 extra bits!
(c) For i = 1, . . . , 4, let Xi ⇠ Bin(n = 100, p = 0.9). We need X1 > 50, X2 > 50, X3 > 50, and
X4 > 50 for us to get a correctable message. For Xi > 50, we just sum the binomial PMF from
51 to 100:
100 ✓
X ◆
100
P (Xi > 50) = 0.9k (1 p)100 k
k
k=51

Then, since we need all 4 to work, by independence, we get

P (X1 > 50, X2 > 50, X3 > 50, X4 > 50) = P (X1 > 50) P (X2 > 50) P (X3 > 50) P (X4 > 50)
100 ✓ ◆ !4
X 100 k 100 k
= 0.9 (1 p)
k
k=51
> 0.999

But this required 400 bits instead of just the 7 required by Hamming codes! This is well worth
the tradeo↵.
3. Suppose A and B are random, independent (possibly empty) subsets of {1, 2, ..., n}, where each subset
is equally likely to be chosen. Consider A \ B, i.e., the set containing elements that are in both A and
B. Let X be the random variable that is the size of A \ B. What is E [X]?

Solution: Then, X ⇠ Bin n, 14 , so E [X] = n

4 (since we know the expected value of Bin(n, p)
is np). How did we do that??
94 Probability & Statistics with Applications to Computing 3.4

Choosing a random subset of {1, . . . , n} can be thought of as follows: for each element i = 1, . . . , n,
with probability 1/2 take the element (and with probability 1/2 don’t take it), independently of other
elements. This is a crucial observation.

For each element i = 1, . . . , n, the element is either in A\B or not. So let Xi be the indicator/Bernoulli
rv of whether i 2 A \ B or not. Then, P (Xi = 1) = P (i 2 A, i 2 B) = P (i 2 A) P (i 2 B) = 12 · 12 = 14
because A, B are chosen independently, and each element is in A or B with probability 1/2. Note that
these Xi ’s are independentPnbecause one element being in the set does not a↵ect another element being
in the set. Hence, X = i=1 Xi the number of elements in our intersection, so X ⇠ Bin n, 14 and
E [X] = np = n4 .

Note that it was not necessary that these variables were independent; we could have still applied
linearity of expectation anyway to get n4 . We just wouldn’t have been able to say X ⇠ Bin n, 14 .
Chapter 3. Discrete Random Variables
3.5: Zoo of Discrete Random Variables Part II
Slides (Google Drive) Video (YouTube)

3.5.1 The Uniform (Discrete) Random Variable

In the this lecture we will continue to expand our zoo of discrete random variables. The next one we will
discuss is the uniform random variable. This models situations where the probability of each value in the
range is equally likely, like the roll of a fair die.

Definition 3.5.1: Uniform Random Variable

X is a uniform random variable, denoted X ⇠ Unif(a, b), where a < b are integers, if and only if X
has the following probability mass function
(
1
, k 2 {a, a + 1, ..., b}
pX (k) = b a+1
0, otherwise
X is equally likely to take on any value in ⌦X = {a, a + 1, ..., b}. This set contains b a + 1 integers,
which is why P (X = k) is always b a+11
.

Additionally,
a+b (b a)(b a + 2)
E [X] = and Var (X) =
2 12
As you might expect, the expected value is just the average of the endpoints that the uniform random
variable is defined over.

Proof of Expectation and Variance of Uniform.

Pn Pn
Suppose X ⇠ Unif(a, b). We need to use the fact that i=1 i = n(n+1) 2 and i=1 i2 = n(n+1)(2n+1)
6 to
compute some quantities. I will skip some steps as it is pretty tedious and just algebra, but just focus on
the setup.
Xb b
X X b
1 1 a+b
E [X] = k · pX (k) = k· = k = ··· =
b a+1 b a+1 2
k=a k=a k=a

b b b
⇥ ⇤ X X 1 1 X
E X2 = k 2 · pX (k) = k2 · = k2 = . . .
b a+1 b a+1
k=a k=a k=a

⇥ ⇤ 2 (b a)(b a + 2)
Var (X) = E X 2 E [X] =
12

This variable models situations like rolling a fair six sided die. Let X be the random variable whose value
is the number face up on a die roll. Since the die is fair each outcome is equally likely, which means that
X ⇠ Unif(1, 6) so
(
1
, k 2 {1, 2, ..., 6}
pX (k) = 6
0, otherwise

95
96 Probability & Statistics with Applications to Computing 3.5

This is fairly intuitive, but is nice to have these formulas in our zoo so we can make computations quickly,
and think about random processes in an organized fashion. Using the equations above we can find that

1+6 (6 1)(6 1 + 2) 35
E [X] = = 3.5 and Var (X) = =
2 12 12

3.5.2 The Geometric Random Variable

Another random variable that arises from the Bernoulli process is the Geometric random variable. It models
situations that can be thought of as the number of trials up to and including the first success.
For example, suppose we are betting on how many independent flips it will take for a coin to land heads
for the first time. The coin lands heads with a probability p, and you feel confident that it will take four
flips to get your first head. The only way that this can occur is with the following sequence of flips (since
our first head must have been on the fourth trial, we know everything before must be tails):

Let X be the random variable that represents the number of independent coin flips up to and including
your first head. Lets compute P (X = 4). X = 4 occurs exactly when there are 3 tails followed by a head.
So,
P (X = 4) = P (T T T H) = (1 p)(1 p)(1 p)p = (1 p)3 p

In general,
pX (k) = (1 p)k 1
p

This is because there must be k 1 tails in a row followed by a head occurring on the kth trial.
Let’s also verify that the probabilities sum to 1.
1
X 1
X
pX (k) = (1 p)k 1
p [Geometric PMF]
k=1 k=1
X1
=p (1 p)k 1
[take out constant]
k=1
X1
=p (1 p)k [reindex to 0]
k=0
✓ ◆ " 1
#
1 X 1
i
=p geometric series formula: r =
1 (1 p) i=0
1 r
1
=p· =1
p

The second last step used the geometric series formula - this may be why this random variable is called
Geometric!
3.5 Probability & Statistics with Applications to Computing 97

Definition 3.5.2: Geometric Random Variable

X is a Geometric random variable, denoted X ⇠ Geo(p), if and only if X has the following probability
mass function (and range ⌦X = {1, 2, . . . }):

pX (k) = (1 p)k 1
p, k = 1, 2, 3, ...

Additionally,
1 1 p
E [X] = and Var (X) =
p p2

Proof of Expectation and Variance of Geometric.

Suppose X ⇠ Geo(p). The expectation is pretty complicated and uses a calculus trick, so don’t worry about
it too much. Just understand the first two lines, which are the setup! But before that, what do you think
it should be? For example, if p = 1/10, how many flips do you think it would take until our first head?
Possibly 10? And if p = 1/7, maybe 7? So seems like our guess will be E [X] = p1 . It turns out this intuition
is actually correct!
X
E [X] = k · pX (k) [def of expectation]
k2⌦X
X1
= k(1 p)k 1
p
k=1
X1
=p k(1 p)k 1
[p is a constant with respect to k ]
k=1
X1 
d d k
=p ( (1 p)k ) y = ky k 1
, and chain rule of calculus
dp dy
k=1
1
!
d X k 1
= p (1 p) [swap sum and integral]
dp
k=1
✓ ◆ " 1
#
d 1 X 1
i
= p geometric series formula: r =
dp 1 (1 p) i=0
1 r
✓ ◆
d 1
= p
dp p
✓ ◆
1
= p
p2
1
=
p

We’ll actually have a much nicer proof of this⇥ fact

⇤ in 5.3 using the law of total expectation, so look forward
to that! I hope you’ll take my word that E X 2 is even worse, so I will not provide that proof. But the
expectation and variance are here for you to cite!

Example(s)

Let’s say you buy lottery tickets every day, and the probability you win on a given day is 0.01,
independently of other days. What is the probability that after a year (365 days), you still haven’t
won? What is the expected number of days until you win your first lottery?
98 Probability & Statistics with Applications to Computing 3.5

Solution If X is the number of days until the first win, then X ⇠ Geo(p = 0.01). Hence, the probability we
don’t win after a year is (using the PMF)
364
X 364
X
P (X 365) = 1 P (X < 365) = 1 P (X = k) = 1 (1 0.01)k 1
0.01
k=1 k=1

This is great, but for the geometric, we can actually get a closed-form formula by thinking of what it means
that X 365 in English. X 365 happens if and only if we lose for the first 365 days, which happens with
probability 0.99365 . If you evaluated that nasty sum above and this quantity, you would find that they are
equal!
Finally, we can just cite the expectation of the Geometric RV:
1 1
E [X] = = = 100
p 0.01
This is the point of the zoo! We do all these generic calculations so we can use them later anytime.

Example(s)

You gamble by flipping a fair coin independently up to and including the first head. If it takes k tries,
you earn $2k (i.e., if your first head was the flip, you would earn $8). How much would you pay to
play this game?

Solution Let X be the number of flips to the first head. Then, X ⇠ Geo( 12 ) because its a fair coin, and
⇣ 1 ⌘k 1 ⇣1⌘ 1
pX (k) = 1 = k = 1, 2, 3, ...
2 2 2k
It is usually unwise to gamble, especially if your expected earnings are lower than the price to play. So, let
Y be your expected earnings. Note that Y = 2X because the amount ⇥ you
⇤ win depends
⇥ ⇤ the number of flips
it takes to get a heads. We will use LOTUS to compute E [Y ] = E 2X . Recall E 2X 6= 2E[X] = 22 = 4 as
we’ve seen many times now.
1 1 1
⇥ ⇤ X X 1 X
E [Y ] = E 2X = 2k pX (k) = 2k k = 1=1
2
k=1 k=1 k=1

So, you are expected to win an infinite amount of money!

Some might say they would be willing to pay any finite amount of money to play this game. Think about
why that would be unwise, and what this means regarding the modeling tools we have provided you so far.

3.5.3 The Negative Binomial Random Variable

Consider the situation where we are not just betting on the first head, but on the first r heads. How could
we use a random variables to model this scenario?
If you’ll recall from the last lecture, multiple Bernoulli random variables sum together to produce a more
complicated random variable, the binomial. We might try to do something similar with geometric random
variables.
Let X be a random variable that represents the number of coin flips it takes to get our rth head.
3.5 Probability & Statistics with Applications to Computing 99

r
X
X= Xi
i=1
where Xi is a a geometric random variable that represents the number of flips it takes to get the ith head
after i 1 heads have already occurred. Since all the flips are independent, so are the rvs X1 , . . . , Xr . For
example, if r = 3 we might observe the following sequence of flips

In this case, X1 = 3 and represents the number of trials between the 0th to the 1st head; X2 = 1 and
represents the number of trials between the 1st to the 2nd head; X3 = 4 and represents the number of trials
between the 2nd and the 3rd head. Remember this fact for later!
How do we find P (X = 8)? There must be exactly 5 heads and 3 tails, so it is reasonable to expect (1 p)5 p3
to come up somewhere in our final formula, but how many ways can we get a valid sequence of flips? Note
that the last coin flip must be a heads, otherwise we would’ve gotten our r heads earlier than our 8th flip.
From here, any 2 of the first 7 flips can be heads, and 5 of must be tails. Thus, there are 72 valid sequences
of coin flips.
Each of these 7 flip sub-sequences (of the 8 total flips) occurs with probability (1 p)5 p2 and there is no
overlap. However, we need to include the probability that the last coin flip is a heads. So,
✓ ◆ ✓ ◆
7 5 2 7
pX (8) = P (X = 8) = (1 p) p · p = (1 p)5 p3
2 2

We can generalize as follows:

✓ ◆
k 1
pX (k) = (1 p)k r
pr
r 1

Again, the interpretation is that our rth head must come at the k th trial exactly; so in the first k 1 we can
get r 1 heads anywhere (hence the binomial coefficient), and overall we have r heads and k r tails.
If we are interested in finding the expected value of X we might try the brute force approach directly from
the definition of expected value

X X ✓ ◆
k 1
E [X] = k pX (k) = k (1 p)k r
pr
r 1
k2⌦X k2⌦X

but this approach is overly complicated, and there is a much simpler way using Pr linearity of expectation!
Suppose X1 , ..., Xr ⇠ Geo(p) are independent. As we showed earlier, X = i=1 Xi , and we showed that
each E [Xi ] = 1/p. Using linearity of expectation, we can derive the following:
" r # r r
X X X 1 r
E [X] = E Xi = E [Xi ] = =
i=1 i=1 i=1
p p
100 Probability & Statistics with Applications to Computing 3.5

Using a similar technique and the (yet unproven) fact that Var (X + Y ) = Var (X) + Var (Y ), we can find the
variance of X from the sum of the variances of multiple geometric random variables

r
! r r
X X X 1 p r(1 p)
Var (X) = Var Xi = Var (Xi ) = =
i=1 i=1 i=1
p2 p2

This random variable is called the negative binomial random variable. It is quite common so it too deserves
a special place in our zoo.

Definition 3.5.3: Negative Binomial Random Variable

X is a negative binomial random variable, denoted X ⇠ NegBin(r, p), if and only if X has the
following probability mass function (and range ⌦X = {r, r + 1, . . . , }):
✓ ◆
k 1
pX (k) = pr (1 p)k r , k = r, r + 1, ...
r 1

X is the sum of r independent Geo(p) random variables.

Additionally,
r r(1 p)
E [X] = and Var (X) =
p p2 )
Also, note that Geo(p) ⌘ NegBin(1, p), and that if X, Y are independent such that X ⇠ NegBin(r, p)
and Y ⇠ NegBin(s, p), then X + Y ⇠ NegBin(r + s, p) (waiting for r + s heads).

3.5.4 Exercises
1. You are a hardworking boxer. Your coach tells you that the probability of your winning a boxing
match is 0.25, independently of every other match.
(a) How many matches do you expect to fight until you win once?
(b) How many matches do you expect to fight until you win ten times?
(c) You only get to play 12 matches every year. To win a spot in the Annual Boxing Championship,
a boxer needs to win at least 10 matches in a year. What is the probability that you will go to
the Championship this year?
(d) Let q be your answer from the previous part. How many times can you expect to go to the
Championship in your 20 year career?
Solution:
(a) Let X be the matches you have to fight until you win once. Then, X ⇠ Geo(p = 0.25), so
E [X] = p1 = 0.25
1
= 4.
(b) Let Y be the matches you have to fight until you win ten times. Then, Y ⇠ NegBin(r = 10, p =
0.25), so E [Y ] = pr = 0.25
10
= 40.
(c) Let Z be the number of matches you win out of 12. Then, Z ⇠ Bin(n = 12, p = 0.25), and we
want
X12 ✓ ◆
12
P (Z 10) = 0.2k (1 0.2)12 k
k
k=10
3.5 Probability & Statistics with Applications to Computing 101

(d) Let W be the number of times we make it to the Championship in 20 years. Then, W ⇠ Bin(n =
20, p = q), and
E [W ] = np = 20q

2. You are in music class, and your cruel teacher says you cannot leave until you play the 1000-note
song Fur Elise correctly 5 times. You start playing the song, and if you play an incorrect note,
you immediately start the song over from scratch. You play each note correctly independently with
probability 0.999.
(a) What is the probability you play the 1000-note song Fur Elise correctly immediately? (i.e., the
first 1000 notes are all correct).
(b) What is the probability you take exactly 20 attempts to correctly play the song 5 times?
(c) What is the probability you take at least 20 attempts to correctly play the song 5 times?
(d) (Challenge) What is the expected number of notes you play until you finish playing Fur Elise
correctly 5 times?
Solution:
(a) Let X be the number of correct notes we play in Fur Elise in one attempt, so X ⇠ Bin(1000, 0.999).
We need P (X = 1000) = 0.9991000 ⇡ 0.3677.
(b) If Y is the number of attempts until we play the song correctly 5 times, then Y ⇠ NegBin(5, 0.3677),
and so ✓ ◆
20 1
P (Y = 20) = 0.36775 (1 0.3677)15 ⇡ 0.0269
5 1

(c) We can actually take two approaches to this. We can either take our Y from earlier, and compute
19 ✓
X ◆
k 1
P (Y 20) = 1 P (Y < 20) = 1 0.36775 (1 0.3677)k 5
⇡ 0.1161
4
k=5

Notice the sum starts at 5 since that’s the lowest possible value of Y . This would be exactly the
probability of the statement asked. We could alternatively rephrase the question as: what is the
probability we play the song correctly at most 4 times correctly in the first 19 times? Check that
these questions are equivalent! Then, we can let Z ⇠ Bin(19, 0.3677) and instead compute
4 ✓ ◆
X 19
P (Z  4) = 0.3677k (1 0.3677)19 k
⇡ 0.1161
k
k=0

(d) We will have to revisit this question later in the course! Note that we could have computed
the expected number of attempts to finish playing Fur Elise though, as it would follow a
5
NegBin(5, 0.3677) distribution with expectation 0.3677 ⇡ 13.598.
Chapter 3. Discrete Random Variables
3.6: Zoo of Discrete RV’s Part III
Slides (Google Drive) Video (YouTube)

3.6.1 The Poisson Random Variable

So far, none of the random variables can measure the number of events in a unit time. For example:
• How many babies born in the next minute?
• How many car crashes happen per hour?
If we wanted a count like this, we might try to use the binomial random variable. But what should we choose
for n? Ideally, we wouldn’t have an upper bound or maximum number of events. We’ll actually derive a
new random variable called a Poisson rv, which is derived from the Binomial rv as follows.

The Poisson RV (Idea)

Let’s say we want to model babies born in the next minute, and the historical average is 2 babies/min.
Our strategy will be to take a minute interval and split it into infinitely many small chunks (think
milliseconds, then nanoseconds, etc.)

We start by breaking one unit of time into 5 parts, and we say at each of the five chunks, either a
baby is born or not. That means we’ll be using a binomial rv with n = 5. The choice of p that will
keep our average to be 2 is 25 , because the expected value of binomial RV is np = 2.

Similarly, if we break the time into even smaller chunks such as n = 10 or n = 70, we can get the
2 2
corresponding p to be 10 or 70 respectively (either a baby is born or not in 1/70 of a second).

And we keep increasing n so that it gets down to the smallest fraction of a second; we have n ! 1
and p ! 0 in this fashion while maintaining the condition that np = 2.

Let be the historical average number of events per unit of time. Send n ! 1 and p ! 0 in such a way
that np = is fixed (i.e., p = n ).

Let Xn ⇠ Bin(n, n ) and Y ⇠ limn!1 Xn be the limit of this sequence of Binomial rvs. Then, we say
Y ⇠ Poi( ) and measures the number of events in a unit time, where the historical average is . We’ll derive
its PMF by taking the limit of the binomial PMF.
We’ll need to recall how we defined the base of the natural logarithm e. There are two equivalent formulations.
⇣ x ⌘n
ex = lim 1+
n!1 n

102
3.6 Probability & Statistics with Applications to Computing 103

1
X xk
ex =
k!
k=0

pY (k) = lim pXn (k) [def of Y ⇠ Poi( )]

n!1
✓ ◆
n k
= lim p (1 p)n k
[binomial PMF]
n!1 k
✓ ◆k ✓ ◆n k
n!
= lim 1 [p = /n, expand binomial coefficient]
n!1 k!(n k)! n n
k (1 n
n! n)
= lim [algebra]
n!1 k!(n k)! nk (1 n)
k

k n
n! 1 (1 n)
= · lim [algebra]
k! (nn!1
n)
k)! nk (1
k

k
✓ ◆ n 
n n 1 n k + 1 (1 n) n!
= · lim · · ... · = P (n, k) = n(n 1) . . . (n k + 1)
k! n!1 n n n (1 n)
k (n k)!
k n  ✓ ◆
(1 n) n n 1 n k+1
= · lim lim · · ... · =1
k! (1
n!1
n)
k n!1 n n n
✓ ◆n " ✓ ◆k #
k
= · lim 1 lim 1 = 1 since k is finite
k! n!1 n n!1 n
k h ⇣ x ⌘n i
= e ex = lim 1 +
k! n!1 n

We’ll now verify that the Poisson PMF does sum to 1, and is valid.
P1 k
Recall the Taylor series for ex = k=0 xk! , so
1
X 1
X k 1
X k
pY (k) = e =e =e e =1
k! k!
k=0 k=0 k=0

Definition 3.6.1: The Poisson RV

X ⇠ Poi( ) if and only if X has the following probability mass function (and range ⌦X =
{0, 1, 2, . . . }):
k
pX (k) = e
, k = 0, 1, 2, ...
k!
If is the historical average of events per unit of time, then X is the number of events that occur in
a unit of time.
Additionally,

E [X] = , Var (X) =

Proof of Expectation and Variance of Poisson. Let Xn ⇠ Bin(n, n ) and Y ⇠ limn!1 Xn = Poi( ). By the
properties of the binomial random variable, the mean and variance are as follows for any n (plug in = np
or equivalently, p = /n):
E [Xn ] = np =
104 Probability & Statistics with Applications to Computing 3.6

✓ ◆
Var (Xn ) = np(1 p) = 1
n
Therefore: h i
E [Y ] = E lim Xn = lim E [Xn ] = lim =
n!1 n!1 n!1

⇣ ⌘ ✓ ◆
Var (Y ) = Var lim Xn = lim Var (Xn ) = lim 1 =
n!1 n!1 n!1 n

Example(s)

Suppose the average number of babies born in Seattle historically is 2 babies every 15 minutes.
1. What is the probability no babies are born in the next hour in Seattle?
2. What is the expected number of babies born in the next hour?
3. What is the probability no babies are born in the next 5 minutes in Seattle?

Solution
1. Since Poi( ) is the number of events in a single unit of time (matching units as ), we must convert
our rate to hours (since we are interested in one hour). So the number of babies born in the next hour
can be modelled as X ⇠ Poi( = 8/hr), and so the probability no babies are born is
0
88 8
P (X = 0) = e =e
0!

2. Since X ⇠ Poi(8), we can look up its expectation to be just = 8 babies in an hour.

3. Since our units of interest now are 5 minutes, we now have Y ⇠ Poi( = 2/3) because the average rate
is 2/3 babies per 5 minutes. Now,
0
2/3 (2/3) 2/3
P (Y = 0) = e =e
0!

Before doing the next example, let’s talk about the sum of two independent Poisson rvs. Almost by definition,
if X, Y are independent with X ⇠ Poi( ) and Y ⇠ Poi(µ), then X + Y ⇠ Poi( + µ) (if the average number
of babies born per minute in the USA is 5 and in Canada is 2, then the total babies in the next minute
combined is Poi(5+2) since the average combined rate is 7. We’ll prove this fact that the sum of independent
Poisson rvs is Poisson with the sum of their rates in a future chapter!

Example(s)

Suppose Lookbook gets on average 120 new users per hour, and Quickgram gets 180 new users per
hour, independently. What is the probability that, combined, less than 2 users sign up in the next
minute?

Solution Convert ’s to the same unit of interest. For us, it’s a minute. We can always change the rate
(e.g., 120 per hour is the same as 2 per minute), but we can’t change the unit of time we’re interested in.

X ⇠ Poi( = 2 users/min), Y ⇠ Poi( = 3 users/min)

3.6 Probability & Statistics with Applications to Computing 105

Then their total is Poisson:

Z = X + Y ⇠ Poi(2 + 3) = Poi(5)

0 1
55 55 5
P (Z < 2) = pZ (0) + pZ (1) = e +e = 6e ⇡ 0.04
0! 1!

3.6.2 The Poisson Process

Now we’ll define the Poisson process, which is related to the Poisson random variable, among ones in our
future (Exponential and Gamma). We’ll discuss this a bit more after seeing the definition.

Definition 3.6.2: The Poisson Process

A Poisson process with rate > 0 per unit of time, is a continuous-time stochastic process indexed
by t 2 [0, 1), so that X(t) is the number of events that happens in the interval [0, t]. Notice that if
t1 < t2 , then X(t2 ) X(t1 ) is the number of events in (t1 , t2 ]. The process has three properties:

• X(0) = 0. That is, we initially start with an empty counter at time 0.

• The number of events happening in any two disjoint intervals [a, b] and [c, d] are independent.
• The number of events in any time interval [t1 , t2 ] is Poi( (t2 t1 )). This is because on average
events happen per unit time, so in t2 t1 units of time, the average rate is (t2 t1 ). Again,
we can scale our rate but not our period of interest.
All this formality is saying is that, events happen independently at a constant rate.

3.6.3 The Hypergeometric Random Variable

This will be our last one in our Zoo of discrete random variables!

Suppose there is a candy bag of N = 9 total candies, K = 4 of which are lollipops. Our parents allow us
grab n = 3 of them. Let X be the number of lollipops we grab. What is the probability that we get exactly
2 lollipops?

The number of ways to grab three candies is just 93 , and we need to get exactly 2 lollipops out of 4, which
is 42 . Out of the other 5 candies, we only need one of them, which yields 51 ways.

4 5
pX (2) = P (X = 2) = 2
9
1

We say the number of successes we draw is X ⇠ HypGeo(N, K, n), where K out of N items in a bag are
successes, and we draw n without replacement.
106 Probability & Statistics with Applications to Computing 3.6

Definition 3.6.3: The Hypergeometric RV

X ⇠ HypGeo(N, K, n) if and only if X has the following probability mass function:

K N K
k n k
pX (k) = N
, k = max{0, n + K N }, ..., min{K, n}
n

X is the number of successes when drawing n items without replacement from a bag containing N
items, K of which are successes (hence N K failures).

K K(N K)(N n)
E [X] = n Var (X) = n
N N 2 (N 1)
Note that if we drew with replacement, then we would model this situation using Bin(n, K
N ) as each
draw would be an independent trial.

Proof of Expectation and Variance of Hypergeometric.

Suppose X ⇠ HypGeo(N, K, n), and let X1 , ..., Xn be indicator

Pn RV’s (NOT independent) so that Xi = 1 if
we got a lollipop on the ith draw, and 0 otherwise. So X = i=1 Xi .

Then, each Xi is Bernoulli, but with what parameter? The probability of getting a lollipop on the first draw
(X1 being equal to 1) is just K/N .

K
P (X1 = 1) =
N
What about P (X2 = 1), the probability we get a lollipop on our second draw? Well, it depends on whether
or not we got one the first draw! So we can use the LTP conditioning on whether we got one (X1 = 1) or
we didn’t X1 = 0).

P (X2 = 1) = P (X2 = 1|X1 = 1) P (X1 = 1) + P (X2 = 1|X1 = 0) P (X1 = 0) [LTP]

K 1 K K N K K(N 1) K
= · + · = =
N 1 N N 1 N N (N 1) N

Actually, each Xi ⇠ Ber(K/N ), at every draw i! You could continue the above logic for X3 and so on. This
makes sense, because, if you just think about the i-th draw and you didn’t know anything about the first
i 1, the probability you get a lollipop would just be K/N .

K
E [Xi ] =
N
" n
# n n
X X X K K
E [X] = E Xi = E [Xi ] = =n
i=1 i=1 i=1
N N

Note again it would be wrong to say X ⇠ Bin(n, K/N ) because the trials are NOT independent, but we are
still able to use linearity of expectation. If we did this experiment with replacement though (take one and
put it back), then the draws would be independent, and modelled as Bin(n, K/N ). Note the expectation
with or without replacement is the same because linearity of expectation doesn’t care about independence!
3.6 Probability & Statistics with Applications to Computing 107

The variance is a nightmare, and will be proven in 5.4 when figure out how to compute the variance of the
sum of these dependent indicator variables.

The Zoo of Discrete RV’s: Here are all the distributions in our zoo of discrete random variables!
• The Bernoulli RV
• The Binomial RV
• The Uniform (Discrete) RV
• The Geometric RV
• The Negative Binomial RV
• The Poisson RV
• The Hypergeometric RV

Congratulations on making it through this chapter on all these wonderful discrete random variables! There
are several practice problems below which require using a lot of these zoo elements. It will definitely take
some time to get used to all of these - you’ll need to do practice! See our handy reference sheet for one place
to see all of them while doing problems.

3.6.4 Exercises
1. Suppose that on average, 40 babies are born per hour in Seattle.
(a) What is the probability that over 1000 babies are born in a single day in Seattle?
(b) What is the probability that in a 365-day year, over 1000 babies are born on exactly 200 days?
Solution:
(a) The number of babies born in a single average day is 40 · 24 = 960, so X ⇠ Poi( = 960). Then,
1000
X k
960 960
P (X > 1000) = 1 P (X  1000) = 1 e
k!
k=0

(b) Let q be the answer from part (a). The number of days where over 1000 babies are born is
Y ⇠ Bin(n = 365, p = q), so
✓ ◆
365 200
P (Y = 200) = q (1 q)165
200

2. Suppose the Senate consists of 53 Republicans and 47 Democrats. Suppose we were to create a
bipartisan committee of 20 senators by randomly choosing from the 100 total.
(a) What is the probability we end up with exactly 9 Republicans and 11 Democrats?
(b) What is the expected number of Democrats on the committee?
Solution:
(a) Let X be the number of Republican senators chosen. Then X ⇠ HypGeo(N = 100, K = 53, n =
20), and the desired probability is
53 47
P (X = 9) = 9
100
11

20
108 Probability & Statistics with Applications to Computing 3.6

since choosing 9 out of 20 Republicans also implies immediately we have 11 out of 20 Democrats.
Note we could have flipped the roles of Democrats and Republicans. If Y is the number of
Democratic senators chosen, then Y ⇠ HypGeo(N = 100, K = 47, n = 20), and
47 53
P (Y = 11) = 11
100
9

(b) The number of Democrats as mentioned earlier is Y ⇠ HypGeo(N = 100, K = 47, n = 20), and
so
K 47
E [Y ] = n = 20 · = 9.4
N 100
3. (Poisson Approximation to Binomial) Suppose the famous chip company “Bayes” produces
n = 10000 bags per day. They need to do a quality check, and they know that 0.1% of their bags
independently have “bad” chips in them.
(a) What is the exact probability that at most 5 bags contain “bad” chips?
(b) Recall the Poisson was derived from the Binomial with n ! 1 and p ! 0, so it suggests that a
Poisson distribution would be a good approximation to a Binomial with large n and small p. Use
a Poisson rv instead to compute the same probability as in part (a). How close are the answers?

Note: The reason we use a Poisson approximation sometimes is because the binomial PMF is
hard to compute. Imagine X ⇠ Bin(10000, 0.256), computing P (X = 2000) = 10000 2000 0.256
2000
(1
8000 10000
0.256) has at least 10000 multiplication operations for the probabilities. Furthermore, 2000 =
10000!
2000!8000! - good luck avoiding overflow on your computer!

Solution:
(a) If X is the number of bags with “bad” chips, then X ⇠ Bin(n = 10000, p = 0.001), so
5 ✓
X ◆
10000
P (X  5) = 0.001k (1 0.001)10000 k
⇡ 0.06699
k
k=0

(b) Since n is large and p is small, we might approximate X as Poisson rv, with = np = 10000 ·
0.001 = 10. Then, since X ⇡ Poi(10), we have
5
X k
10 10
P (X  5) = e ⇡ 0.06709
k!
k=0

This approximation is not bad at all!

4. You are writing a 250-page book, but you make an average of one typo every two pages. For a lot of
these questions, if you cite the correct distribution, the answer follows immediately.
(a) What is the probability that a particular page contains (at least) one typo?
(b) What is the expected number of typos in total?
(c) What is the probability that your book contains at most 50 pages with (at least) one typo on
them?
(d) What is the expected “page number” which contains your first typo?
(e) Suppose your book has exactly 50 pages with a typo (and 200 without). If I look at 20 di↵erent
pages randomly, what is the probability that exactly 5 contain (at least) one typo?
3.6 Probability & Statistics with Applications to Computing 109

Solution:
(a) The average rate of typos is one per two pages, or equivalently, 1/2 per one page. Hence, if X is
the number of typos on a page, then X ⇠ Poi( = 1/2), and
0
1/2 (1/2) 1/2
P (X 1) = 1 P (X = 0) = 1 e =1 e ⇡ 0.39347
0!

(b) Since we are interested in a 250 page “time period”, the average rate of typos is 125 per 250 pages.
If Y is the number of typos in total, then Y ⇠ Poi( = 125), and E [Y ] = = 125.
(c) We can consider each page as Poi(1/2) like in part (a). Let Z be the number of pages with at
least one typo. Then, Z ⇠ Bin(n = 250, p = 0.39347), and
50 ✓
X ◆
250
P (Z  50) = 0.39347k (1 0.39347)250 k
k
k=0

(d) Let V be the first page that contains (at least) one typo. Then, V ⇠ Geo(0.39347), so
1
E [V ] = = 2.5415
0.39347

(e) If W is the number of pages out of 20 that have a typo, then W ⇠ HypGeo(N = 250, K = 50, n =
20), and
50 200
P (W = 5) = 5
250
15

20
Chapter 4. Continuous Random Vari-
ables
We learned about how to model things like the number of car crashes in a year or the number of lot-
tery tickets I must buy until I win. What about quantities like the time until the next earthquake, or the
height of human beings? These latter quantities are a completely di↵erent beast, since they can take on
uncountably infinitely many values (infinite decimal precision). Some of the ideas from the previous chapter
will stay, but we’ll have to develop new tools to handle this new challenge. We’ll also learn about the most
important continuous distribution: the Normal distribution.

110
4.1 Probability & Statistics with Applications to Computing 111

Chapter 4. Continuous Random Variables

4.1: Continuous Random Variables Basics
Slides (Google Drive) Video (YouTube)

Up to this point, we have only been talking about discrete random variables - ones that only take values
in a countable (finite or countably infinite) set like the integers or a subset. What if we wanted to model
quantities that were continuous - that could take on uncountably infinitely many values? If you haven’t
studied or seen cardinality (or types of infinities) before, you can think of this as being intervals of the real
line, which take decimal values. Our tools from the previous chapter were not suitable to modelling these
situations, and so we need a new type of random variable.

Definition 4.1.1: Continuous Random Variables

A continuous random variable is a random variable that takes values from an uncountably infinite
set, such as the set of real numbers or an interval. For e.g., height (5.6312435 feet, 6.1123 feet, etc.),
weight (121.33567 lbs, 153.4642 lbs, etc.) and time (2.5644 seconds, 9321.23403 seconds, etc.) are
continuous random variables that take on values in a continuum.

Why do we need continuous random variables?

Suppose we want a random number in the interval [0, 10], with each possibility being “equally likely”.
• What is P (X = 3.141592) for such a random variable X? That is, if I chose a random decimal number
(with infinite precision/decimal places), what is the probability you guess it exactly right (matching
infinitely many decimal places)? The probability is actually 0, it’s not even a tiny positive number!
• What is P (5  X  8) for such a random variable X? That is, what if you were allowed to guess a
range instead of a single number? As you might expect, size of the required interval 3
size of the total interval = 10 since the random
number is uniformly distributed.
Suppose we want to study the set of possible heights (in feet) a person can have, supposing that the range
of possible heights is the interval [1, 8].
• What is the probability that someone has a height of 5.2311333 feet? This is again 0, since you have
to be exactly precise!
• What is the probability that someone has a height between 5 and 6 feet? This is non-zero, since we
are studying an interval. It isn’t necessarily 68 51 = 17 though since heights aren’t necessarily uniformly
distributed! More people will have heights in the interval [4, 6] feet than say [1, 3] feet.
Notice, that since these values can have infinite precision, the probability that a variable has a specific value
is 0, in contrast to discrete random variables.

4.1.1 Probability Density Functions (PDFs)

Every continuous random variable has a probability density function (PDF), instead of a probability mass
function (PMF), that defines the relative likelihood that a random variable X has a particular value. Why
do we need this new construct? We already said that P (X = a) = 0 for any value of a, and so a “PMF” for
a continuous random variable would equal 0 for any input and be useless. It wouldn’t satisfy the constraint
112 Probability & Statistics with Applications to Computing 4.1

that the sum of the probabilities is 1 (assuming we could even sum over uncountably many values; we can’t).
Instead, we have the idea of a probability density function where the x-axis has values in the random variable’s
range (usually an interval), and the y-axis has the probability density (not mass), which is explained below.

A PDF may look something like this:

The probability density function fX has some characteristic properties (denoted with fX to distinguish from
PMFs pX ). Notice again I will use di↵erent dummy variables inside the function like fX (z) or fX (t) to
ensure you get the idea that the density is fX (subscript indicates for rv X) and the dummy variable can
be anything.

• fX (z) 0 for all z 2 R; i.e., it is always non-negative, just like a probability mass function.

R1
• 1 X
f (t)dt = 1; i.e., the area under the entire curve is equal to 1, just like the sum of all the proba-
bilities of a discrete random variable equals 1.

Rb
• P (a  X  b) = a fX (w)dw; i.e., the probability that X lies in the interval a to b is the area under
the curve from a to b. This is key - integrating fX gives us probabilities.
4.1 Probability & Statistics with Applications to Computing 113

Ry
• P (X = y) = P (y  X  y) = y fX (w)dw = 0. The probability of being a particular value is 0, and
NOT equal to the density fX (y) which is nonzero. This is particularly confusing at first.

• P (X ⇡ q) = P q 2"  X  q + 2" ⇡ "fX (q); i.e., with a small epsilon value, we can obtain a good
rectangle approximation of the area under the curve. The width of the rectangle is " (from the di↵erence
between q + 2" and q 2" ). The height of the rectangle is fX (q), the value of the probability density func-
tion fX at q. So, the area of the rectangle is "fX (q). This is similar to the idea of Riemann integration.

P(X⇡u)
• P(X⇡v) ⇡ "f X (u) fX (u)
"fX (v) = fX (v) ; i.e., the PDF tells us ratios of probabilities of being “near” a point. From
the previous point, we know the probabilities of X being approximately u and v, and through algebra,
we see their ratios. Since the density is twice as high at u as it is at v, it means we are twice as likely
to get a point “near” u as we are to get one “near” v.
114 Probability & Statistics with Applications to Computing 4.1

Definition 4.1.2: Probability Density Function (PDF)

Let X be a continuous random variable (one whose range is typically an interval or union of intervals).
The probability density function (PDF) of X is the function fX : R ! R, such that the following
properties hold:
• fRX (z) 0 for all z 2 R
1
• 1 X
f (t) dt = 1
Rb
• P (a  X  b) = a fX (w) dw
• P (X = y) = 0 for any y 2 R
• The probability that X is close to q is proportional to its density fX (q);
⇣ " "⌘
P (X ⇡ q) = P q X q+ ⇡ "fX (q)
2 2
• Ratios of probabilities of being “near points” are maintained;

P (X ⇡ u) "fX (u) fX (u)

⇡ =
P (X ⇡ v) "fX (v) fX (v)

4.1.2 Cumulative Distribution Functions (CDFs)

Here is the density function of a “uniform” random variable on the interval [0, 1]:

We know this is valid, because the area under the curve is the area of a square with side lengths 1, which is
4.1 Probability & Statistics with Applications to Computing 115

1 · 1 = 1.
We define the cumulative distribution function (CDF) of X to be FX (w) = P (X  w). That is, the all the
area to the left of w in the density function. Note we also have CDFs for discrete random variables, they
are defined exactly the same way (the probability of being less than or equal to a certain value)! They just
don’t usually have a nice closed form like they do for continuous RVs. Note for continuous random variables,
the CDF at w is just the cumulative area to the left of w, which can be found by an integral (the dummy
variable of integration should be di↵erent than the input variable w)
Z w
FX (w) = P (X  w) = fX (y)dy
1

Let’s try to compute the CDF of this uniform random variable on [0, 1]. There are three cases to consider
here.
• If w < 0, FX (w) = 0 since ⌦X = [0, 1]. For example, if w = 1, then FX (w) = P (X  1) = 0 since
there is no chance that X  1. Formally, there is also no area to the left of w = 1 as you can see
from the PDF above, so the integral evaluates to 0!
• If 0  w  1, the area up to w is a rectangle of height 1 and width w (see below), so FX (w) = w.
That is, P (X  w) = w. For example, if w = 0.5, then the probability X  0.5 is actually just
0.5 since X is just equally likely to be anywhere in ⌦X = [0, 1]! Note here we didn’t do an integral
since there are nice shapes, and we sometimes don’t have to! We just looked at the area to the left of w.

• If w > 1, all the area is up to the left of w, so FX (w) = 1. Again, since ⌦X = [0, 1] and suppose w = 2,
then FX (w) = P (X  2) = 1 since X is always between 0 and 1 (X must be less than or equal to 2).
Formally, the cumulative area to the left of w = 2 is 1 (just the area of the square)!

We can put these conclusions together to show:

8
< 0 if w < 0
FX (w) = w if 0  w  1
:
1 if w > 0
116 Probability & Statistics with Applications to Computing 4.1

On a graph, FX (w) looks like this:

The cumulative distribution function has some characteristic properties:

Rt
• FX (t) = P (X  t) = 1 fX (w)dw for all t 2 R - i.e. the probability that X  t is the area to the
left of t of the density curve.
• You have a function that is defined to be the integral up to a point of a function, so by the Fundamental
d
Theorem of Calculus, the derivative of the CDF is actually the PDF - i.e. du FX (u) = fX (u). This is
probably the most important observation that explains the relationship between PDF and CDF.
• The probability that X is between a and b is the probability that X  b minus the probability that
X  a; i.e., P (a  X  b) = FX (b) FX (a). For those with an eagle eye, you might have noticed I
lied a little - it should be P (a < X  b). But since P (X = a) = 0, it doesn’t matter!
• FX is always monotone increasing because we are integrating a non-negative function (fX 0). That
is, if c  d, then FX (c)  FX (d). For example, if c = 2 and d = 5, then P (X  2)  P (X  5) because
X  2 implies that X  5 automatically.
• As v ! 1, the CDF at v is the probability that X is less than negative infinity which is 0; so the
left-hand limit is 0, i.e. limv! 1 FX (v) = P (X  1) = 0.
• With similar logic to the previous point, limv!+1 FX (v) = P (X  +1) = 1.

Definition 4.1.3: Cumulative Distribution Function (CDF)

Let X be a continuous random variable (one whose range is typically an interval or union of intervals).
The cumulative distribution function (CDF) of X is the function FX : R ! R such that:
Rt
• FX (t) = P (X  t) = 1 fX (w) dw for all t 2 R
d
• du FX (u) = fX (u)
• P (a  X  b) = FX (b) FX (a)
• FX is monotone increasing, since fX 0. That is, FX (c)  FX (d) for c  d.
• limv! 1 FX (v) = P (X  1) = 0
• limv!+1 FX (v) = P (X  +1) = 1

Example(s)

Suppose the number of hours that a package gets delivered past noon is modelled by the following
PDF: 8
>
<x/10 0  x  2
fX (x) = c 2<x6
>
:
0 otherwise
Here is a graph of the PDF as described above:
4.1 Probability & Statistics with Applications to Computing 117

1. What is the range ⌦X ?

2. What is the value of c that makes fX a valid density function?
3. Find the cumulative distribution function (CDF) of X, FX (x), and make sure to define it
piecewise for any real number x.
4. What is the probability that the delivery arrives between 2pm and 6pm?
5. What is the expected time that the package arrives at?

Solution
1. The range is all values where the density is nonzero; in our case, that is ⌦X = [0, 6] (or (0, 6)), but we
don’t care about single points or endpoints because the probability of being exactly that value is 0.
2. Formally, we need the density function to integrate to 1; that is,
Z 1
fX (x)dx = 1
1

But, the density function is split into three parts, we can split our integral into three. However,
anywhere the density is zero, we will get an integral of zero, so we’ll only set up the two integrals that
are nontrivial: Z Z
2 6
x/10dx + cdx = 1
0 2

Solving this equation for c would definitely work. But let’s try to use geometry instead, as we do know
how to compute the area of a triangle and rectangle. So the left integral is the area of the triangle
with base from 0 to 2 and height c, so that area is 2c/2 = c (the area of a triangle is b · h/2). The area
of the rectangle with base from 2 to 6 is 4c. We need the total area of c + 4c = 1, so c = 1/5.
3. Our CDF needs four cases: when x < 0, when 0  x  2, when 2 < x  6, and when x > 6.
(a) The outer cases are usually the easiest ones: if x < 0, then FX (x) = P (X  x) = 0 since X
cannot be less than zero.
(b) If x > 6, then FX (x) = P (X  x) = 1 since X is guaranteed to be at most 6.
118 Probability & Statistics with Applications to Computing 4.1

(c) For 0  x  2, we need the cumulative area to the left of x, which happens to be a triangle with
base x and height x/10, so the area is x2 /20. Alternatively, evaluate the integral
Z x Z x
FX (x) = fX (t)dt = t/10dt = t2 /20
1 0

(d) For 2 < x  6, we have the entire triangle of area 2 · 1/5 · 0.5 = 1/5, but also a rectangle of base
x 2 and height 1/5, for a total area of 1/5 + 1/5(x 2) = x/5 1/5. Alternatively, the integral
would be Z Z Z
x 2 x
FX (x) = fX (t)dt = t/10dt + 1/5dt = x/5 1/5
1 0 2

Again, I skipped all the integral evaluation steps as they are purely computational, but feel free
to verify!
Finally, putting this together gives
8
>
> 0 x<0
>
<x2 /20 0x2
FX (x) =
>
> x/5 1/5 2<x6
>
:
1 x>6

R6 R6
4. Using the formula, we find the area between 2 and 6 to get P (2  X  6) = 2 fX (t)dt = 2 1/5dt =
4/5. Alternatively, we can just see the area from 2 to 6 is just a rectangle with base 4 and height 1/5,
so the probability is just 4/5.

We could also use the CDF we so painstakingly computed.

P (2  X  6) = FX (6) FX (2) = (6/5 1/5) (22 /20) = 1 1/5 = 4/5

This is just the area to the left of 6, minus the area to the left of 2, which gives us the area between 2
and 6.
5. We’ll use the formula for expectation of a continuous RV, but split into three integrals again due to
the piecewise definition of our density. However, the integral outside the range [0, 6] will evaluate to
zero, so we won’t include it.
Z 1 Z 2 Z 6 Z 2 Z 6
E [X] = xfX (x)dx = xfX (x)dx + xfX (x)dx = x · (x/10)dx + x · (1/5)dx
1 0 2 0 2

We won’t do the computation because it’s not important, but hopefully you get the idea of how similar
this is to the discrete version!

4.1.3 From Discrete to Continuous

Here is a nice summary chart of how similar the formulae for continuous RVs and discrete RVs are! Note
that to compute the expected value of a discrete random variable, we took a weighted sum of each value
multiplied by its probability. For continuous random variables though, we take an integral of each value
multiplied by its density function! We’ll see some examples below.
4.1 Probability & Statistics with Applications to Computing 119

Discrete Continuous

PMF/PDF pX (x) = P (X = x) fX (x) 6= P (X = x) = 0

P Rx
CDF FX (x) = tx pX (t) FX (x) = 1
fX (t) dt
P R1
Normalization x pX (x) = 1 1
fX (x)dx = 1
P R1
Expectation/LOTUS E[g(X)] = x g(x)pX (x) E[g(X)] = 1
g(x)fX (x)dx

4.1.4 Exercises
1. Suppose X is continuous with density
(
cx2 0x9
fX (x) =
0 otherwise

Write an expression for the value of c that makes X a valid pdf, and set up expressions (integrals) for
its mean and variance. Also, find the cdf of X, FX .

Solution: We need the total area under the curve to be 1, so

Z 1 Z 9  9
1 729
1= fX (y)dy = cy 2 dy = c y 3 =c = 243c
1 0 3 0 3
1
Hence, c = 243 . The expected value is the weighted average of each point weighted by its density, so
Z 1 Z 9 Z 9
1 2 1
E [X] = zfX (z)dz = z z dz = z 3 dz
1 0 243 243 0

Similarly, by LOTUS,
Z 1 Z 9 Z 9
⇥ 2
⇤ 2 2 1 2 1
E X = z fX (z)dz = z z dz = z 4 dz
1 0 243 243 0

Finally, we can set ⇥ ⇤ 2

Var (X) = E X 2 E [X]
For the CDF, we know that Z t
FX (t) = P (X  t) = fX (y)dy
1
We actually have three cases, similar to the example earlier. If t < 0, FX (t) = 0 since there’s no way
to get a negative number (the range is ⌦X = [0, 9]). If t > 9, FX (t) = 1 since we are guaranteed to get
a number less than t. And for 0  t  9, we just do a normal integral to get that
Z t Z 0 Z t Z t
c
FX (t) = P (X  t) = fX (s)ds = fX (s)ds + fX (s)ds = 0 + cs2 ds = t3
1 1 0 0 3
Putting this together gives:
8
>
<0 t<0
c 3
FX (t) = t 0t9
>3
:
1 t>9
120 Probability & Statistics with Applications to Computing 4.1

2. Suppose X is continuous with pdf

(
c
x2 1x1
fX (x) =
0 otherwise

Write an expression for the value of c that makes X a valid pdf, and set up expressions (integrals) for
its mean and variance. Also, find the cdf of X, FX .

Solution: We need the total area under the curve to be 1, so

Z 1 Z 1  1
c 1
1= fX (y)dy = 2
dy = c = c(0 1) = c
1 1 y y 1

Hence, c = 1. The expected value is the weighted average of each point weighted by its density, so
Z 1 Z 1 Z 1
1 1
E [X] = zfX (z)dz = z · 2 dz = dz = [ln(z)]1
1 =1
1 1 z 1 z

Actually, the mean and variance are ⇥undefined

⇤ (since they are infinite)! If the integral for E [X] did
not converge, then the integral for E X 2 had no chance either (try it)! For the cdf, we know that
Z t
FX (t) = P (X  t) = fX (y)dy
1

We actually have two cases. If t < 1, FX (t) = 0 since there’s no way to get a number less than 1 (the
range is ⌦X = [1, 1)). For t > 1, we just do a normal integral to get that
Z t Z 1 Z t Z t  t ✓ ◆
1 1 1 1
FX (t) = P (X  t) = fX (s)ds = fX (s)ds+ fX (s)ds = ds = = 1 =1
1 1 1 1 s2 s 1 t t

Putting this together gives:

(
0 t<1
FX (t) = 1
1 t t 1
Chapter 4. Continuous Random Variables
4.2: Zoo of Continuous RVs
Slides (Google Drive) Video (YouTube)

Now that we’ve learned about the properties of continuous random variables, we’ll discover some frequently
used RVs just like we did for discrete RVs! In this section, we’ll learn the continuous Uniform distribution,
the Exponential distribution, and Gamma distribution. In the next section, we’ll finally learn about the
Normal/Gaussian (bell-shaped) distribution which you all may have heard of before!
4.2.1 The (Continuous) Uniform RV
The continuous uniform random variable models a situation where there is no preference for any particular
value over a bounded interval. This is very similar to the discrete uniform random variable (e.g., roll of a fair
die), except extended to include decimal values. The probability of equalling any particular value is again 0
since we are dealing with a continuous RV.

Definition 4.2.1: Uniform (Continuous) RV

X ⇠ Unif(a, b) (continuous) where a < b are real numbers, if and only if X has the following pdf:
(
1
, x 2 [a, b]
fX (x) = b a
0, otherwise
X is equally likely to be take on any value in [a, b]. Note the similarities and di↵erences it has with
the discrete uniform! The value of the density function is constant at b 1 a , for any input x 2 [a, b],
and makes it a rectangle whose area integrates to 1.

a+b (b a)2
E [X] = , Var (X) =
2 12
The cdf is 8
>
<0, x<a
x a
FX (x) = , axb
>b a
:
1, x>b

Proof of Expectation and Variance of Uniform. I’m setting up the integrals but omitting the steps that are
not relevant to your understanding of probability theory (computing integrals):

Z 1 Z b
1 a+b
E [X] = xfX (x)dx = x· dx =
1 a b a 2

Z 1 Z b
⇥ ⇤ 1 a2 + ab + b2
E X2 = x2 fX (x)dx = x2 · dx =
1 a b a 3

✓ ◆2
⇥ 2
⇤ 2 a2 + ab + b2 a+b (b a)2
Var (X) = E X E [X] = =
3 2 12

121
122 Probability & Statistics with Applications to Computing 4.2

Example(s)

Suppose we think that a Hollywood movie’s overall rating is equally likely to be any decimal value in
the interval [1, 5] (this may not be realistic). You may be able to do these questions “in your head”,
but I encourage you to formalize the questions and solutions to practice the notation and concepts
we’ve learned. (You probably wouldn’t be able to do them “in your head” if the movie rating wasn’t
uniformly distributed!)
1. A movie is considered average if it is overall rating is between 1.5 and 4.5. What is the probability
that is average?
2. A movie is considered a huge success if its overall rating is at least 4.5. What is the probability
that it is a huge success?
3. A movie is considered legendary if its overall rating is at least 4.95. Given that a movie is a
huge success, what is the probability it is legendary?

Solution Before starting, we can write that the overall rating of a movie is X ⇠ Unif(1, 5). Hence, its density
1 1
function is fX (x) = = for x 2 [1, 5] (and 0 otherwise).
5 1 4
1. We know the probability of being in the range [1.5, 4.5] is the area under the density function from 1.5
to 4.5, so Z 4.5 Z 4.5
1 3
P (1.5  X  4.5) = fX (x)dx = dx =
1.5 1.5 4 4
You could have also drawn a picture of this density function (which is flat at 1/4), and exploited
geometry to figure that the base of the rectangle is 3 and the height is 1/4.
2. Similarly, Z Z
1 5
1 1
P (X 4.5) = fX (x)dx = dx =
4.5 4.5 4 8
Note that the density function for values x 5 is zero, so that’s why the integral changed its upper
bound from 1 to 5 when replacing the density!
3. We’ll use Bayes Theorem:

P (X 4.5 | X 4.95) P (X 4.95)

P (X 4.95 | X 4.5) =
P (X 4.5)
1
Note that P (X 4.5 | X 4.95) = 1 (why?) and P (X 4.95) = (do a similar integral again or
80
use geometry), so plugging in these numbers gives
1
1· 80 1
= 1 =
8
10

Think about why this also might make sense intuitively!

4.2.2 The Exponential RV

Now we’ll learn a distribution which is typically used to model waiting time until an event, like a server
failure or the bus arriving. This is a continuous RV since the time taken has decimal places, like 3.5341109
4.2 Probability & Statistics with Applications to Computing 123

minutes or 9.9324 seconds. This is like the continuous extension of the Geometric (discrete) RV which is the
number of trials until a success occurs.
Recall the Poisson Process with parameter > 0 has events happening at average rate of per unit time
forever. The exponential RV measures the time (e.g., 4.33212 seconds, 9.382 hours, etc.) until the first
occurrence of an event, so is a continuous RV with range [0, 1) (unlike the Poisson RV, which counts the
number of occurrences in a unit of tine, with range {0, 1, 2, ...} and is a discrete RV).
Let Y ⇠ Exp( ) be the time until the first event. We’ll first compute its CDF FY (t) and then di↵erentiate
it to find its PDF fY (t).
Let X(t) ⇠ Poi( t) be the number of events in the first t units of time, for t 0 (if average is is per
unit of time, then it is t per t units of time). Then, Y > t (wait longer than t units of time until the first
event) if and only if X(t) = 0 (no events happened in the first t units of time). This allows us to relate the
Exponential CDF to the Poisson PMF.

t( t)0 t
P (Y > t) = P (no events in the first t units) = P (X(t) = 0) = e =e
0!
Note that we plugged in the Poi( t) PMF at 0 in the second last equality. Now, the CDF is just the
complement of the probability we computed:

t
FY (t) = P (Y  t) = 1 P (Y > t) = 1 e

Remember since the CDF was the integral of the PDF, the PDF is the derivative of the CDF by the
fundamental theorem of calculus:
d
fY (t) = FY (t) = e t
dt

Definition 4.2.2: The Exponential RV

X ⇠ Exp( ), if and only if X has the following pdf (and range ⌦X = [0, 1)):
(
e x, x 0
fX (x) =
0, otherwise

X is the waiting time until the first occurrence of an event in a Poisson Process with parameter .
1 1
E [X] = , Var (X) = 2

The cdf is (
x
1 e , x 0
FX (x) =
0, otherwise

Proof of Expectation and Variance of Exponential. You can use integration by parts if you want to solve
these integrals, or you can use WolframAlpha. Again, I’m omitting the steps that are not relevant to your
understanding of probability theory (computing integrals):

Z 1 Z 1
x 1
E [X] = xfX (x)dx = x· e dx =
1 0
124 Probability & Statistics with Applications to Computing 4.2

Z 1 Z 1
⇥ ⇤ 2
E X2 = x2 fX (x)dx = x2 · e x
dx = 2
1 0

✓ ◆2
⇥ ⇤ 2 2 1 1
Var (X) = E X 2 E [X] = 2
= 2

If you usually skip examples, please don’t skip the next two. The first example here highlights the relationship
between the Poisson and Exponential RVs, and the second highlights the memoryless property!

Example(s)

Suppose that, on average, 13 car crashes occur each day on Highway 101. What is the probability
that no car crashes occur in the next hour ? Be careful of units of time!

Solution We will solve this problem with three equivalent approaches! Take the time to understand why
each of them work.
13
1. Then, on average there are 24 car crashes per hour. So the number of crashes in the next hour is
X ⇠ Poi = 13 24 .
0
13/24 (13/24) 13/24
P (X = 0) = e =e
0!

2. Similar to above, the time (in hours) until the first car crash is Y ⇠ Exp = 13
24 , since on average
13/24 car crashes happen per hour. Then, the probability no car crashes happen in the next hour is
13/24·1 13/24
P (Y > 1 (hour)) = 1 P (Y  1) = 1 FY (1) = 1 (1 e )=e

3. If we don’t want to change the units, then we can say the waiting time until the next car crash (in
days) is Z ⇠ Exp( = 13). since on average 13 car crashes happen per day. Then, the probability no
car crashes occur in the next hour (1/24 of a day) is the probability that we wait longer than 1/24 day:
13·1/24 13/24
P (Z > 1/24) = 1 P (Z  1/24) = 1 FZ (1/24) = 1 (1 e )=e

Hopefully the first and second solutions show you the relationship between the Poisson and Exponential RVs
(they both come from the Poisson process), and the second and third solution show you how to be careful
with units and that you’ll get the same answer as long as you are consistent.

Example(s)

Suppose the average battery life of an AAA battery is approximately 50 hours.

1. What is the probability the battery lasts more than 60 hours?
2. What is the probability the battery lasts more than 40 hours?
3. What is the probability the battery lasts more than 100 hours, given that the battery has
already lasted 60 hours? That is, what is the probability it can last 40 additional hours? Relate
this to your answer from the previous part!

Solution Since we want to model battery life, we should use an Exponential distribution. Since we know the
average battery life is 50 hours, and that the expected value of an exponential RV is 1/ (see above), we
1
should say that the battery life is X ⇠ Exp = 50 = 0.02 .
4.2 Probability & Statistics with Applications to Computing 125

1. If we want the probability the battery lasts more than 60 hours, then we want
Z 1 Z 1
P (X 60) = fX (t)dt = 0.02e 0.02t dt = e 1.2
60 60

But continuous distributions have a CDF which we can and should take advantage of! We can look
up the CDF above as well:
0.02·60 1.2
P (X 60) = 1 P (X < 60) = 1 FX (60) = 1 (1 e )=e

We made a step above that said P (X < 60) = FX (60), but FX (60) = P (X  60). It turns out they
are the same for continuous RVs, since the probability X = 60 exactly is zero!
2. Similarly,
0.02·40 0.8
P (X 40) = 1 P (X < 40) = 1 FX (40) = 1 (1 e )=e

3. By Bayes Theorem,

P (X 60 | X 100) P (X 100)
P (X 100 | X 60) =
P (X 60)

Note that P (X 60 | X 100) = 1 (why?) and P (X 100) = e 0.02·100

= e 2
(same process as
above), so plugging in these numbers gives

1·e 2 0.8
= =e
e 1.2
Note that this is exactly the same as P (X 40) above, the probability we the battery lasted at least
40 hours. This says that the previous 60 hours don’t matter - P (X 40 + 60 | X 60) = P (X 40).
This property is called memorylessness, since the battery essentially forgets that it was alive for 60
hours! We’ll discuss this more formally below and prove it.

4.2.3 Memorylessness
Definition 4.2.3: Memorylessness

A random variable X is memoryless is for all s, t 0,

P (X > s + t | X > s) = P (X > t)

We just saw a concrete example above, but let’s see another. Let s = 7, t = 2. So P (X > 9 | X > 7) =
P (X > 2).
This memoryless property says that, given we’ve waited (at least) 7 minutes, the probability we wait (at
least) 2 more minutes, is the same as the probability we waited (at least 2) more from the beginning. That
is, the random variable “forgot” how long we’ve already been waiting.
The only memoryless RVs are the Geometric (discrete) and Exponential (Continuous)! This is because
events happen independently over time/trials, and so the past doesn’t matter.
We’ve seen it algebraically and intuitively, but let’s see it pictorially as well. Here is a picture of the
probability is greater than 1 for an exponential RV. It is the area to the right of 1 of the density function
e x for x 0 (shaded in blue).
126 Probability & Statistics with Applications to Computing 4.2

Below is a picture of the probability X > 2.5 given X > 1.5 (shaded in orange and blue). If you hide the
area to the left of 1.5, you can see the ratio of the orange area (right of 2.5) to the entire shaded region (right
of 1.5) is the same as P (X > 1) above. So this exponential density function has memorylessness built in!

Theorem 4.2.14: Memorylessness of Exponential

If X ⇠ Exp( ), then X has the memoryless property.

Proof of Memorylessness of Exponential.

If X ⇠ Exp( ) and x 0, then recall

x x
P (X > x) = 1 FX (x) = 1 (1 e )=e
4.2 Probability & Statistics with Applications to Computing 127

P (X > s | X > s + t) P (X > s + t)

P (X > s + t | X > s) = [Bayes Theorem]
P (X > s)
P (X > s + t)
= [P (X > s | X > s + t) = 1]
P (X > s)
(s+t)
e
= s
[plug in formula above]
e
t
=e
= P (X > t)

Theorem 4.2.15: Memorylessness of Geometric

If X ⇠ Geo(p), then X has the memoryless property.

Proof of Memorylessness of Geometric.

If X ⇠ Geo(p), then for k 2 ⌦X = {1, 2, . . . }, then by independence of the trials,

P (X > k) = P (no successes in first k trials) = (1 p)k

Then, I’ll leave it to you to do the same computation as above (using Bayes Theorem). You’ll see it work
out almost exactly the same way!

4.2.4 The Gamma RV

Just like the Exponential RV is the continuous extension of the Geometric RV (from discrete trials to
continuous time), we have a Gamma RV which models the time until the r-th event. This should remind
you of the Negative Binomial RV, which modelled the number of trials until the r-th success, and so was the
sum of r independent and identically distributed (iid) Geo(p) RVs.

Definition 4.2.4: Gamma RV

X ⇠ Gamma(r, ) if and only if X has the following pdf:

( r
xr 1 e x , x>0
fX (x) = (r 1)!
0, otherwise

X is the sum of r independent Exp( ) random variables.

Gamma is to Exponential as Negative Binomial to Geometric. It is the waiting time until the r-th
event, rather than just the first event. So you can write it as a sum of r independent exponential
random variables.
r r
E [X] = , Var (X) = 2

X is the waiting time until the rth occurrence of an event in a Poisson Process with parameter .
Notice that Gamma(1, ) ⌘ Exp( ). By definition, if X, Y are independent with X ⇠ Gamma(r, )
and Y ⇠ Gamma(s, ), then X + Y ⇠ Gamma(r + s, ).
128 Probability & Statistics with Applications to Computing 4.2

Proof of Expectation and Variance of Gamma. The PDF of the Gamma looks very ugly and hard to deal
Pr use our favorite trick: Linearity of Expectation! As mentioned earlier, if X ⇠ Gamma(r, ),
with, so let’s
then X = i=1 Xi where each Xi ⇠ Exp( ) is independent with E [Xi ] = 1 and Var (Xi ) = 12 . So by LoE,
" r
# r r
X X X 1 r
E [X] = E Xi = E [Xi ] = =
i=1 i=1 i=1

Now, we can use the fact that the variance of a sum of independent rvs is the sum of the variances (we have
yet to prove this fact).
r
! r r
X X X 1 r
Var (X) = Var Xi = Var (Xi ) = 2
= 2
i=1 i=1 i=1

4.2.5 Exercises
1. Suppose that on average, 40 babies are born every hour in Seattle.
(a) What is the probability that no babies are born in the next minute? Try solving this in two
di↵erent but equivalent ways - using a Poisson and Exponential RV.
(b) What is the probabilities that it takes more than 20 minutes for the first 10 babies to be born?
Again, try solving this in two di↵erent but equivalent ways - using a Poisson and Gamma RV.
(c) What is the expected time until the 5th baby is born?
Solution:
(a) The number of babies born in the next minute is X ⇠ Poi(40/60), so P (X = 0) = e 40/60 ⇡
0.5134. Alternatively, the time in minutes until the next baby is born is Y ⇠ Exp(40/60), and we
want the probability that no babies are born in the next minute; i.e., it takes at least one minute
for the first baby to be born. Hence,
2/3·1 2/3
P (Y > 1) = 1 FY (1) = 1 (1 e )=e

We got the same answer with two di↵erent approaches!

(b) The number of babies born in the next 20 minutes is W ⇠ Poi(40/3), so
10
X k
40/3 (40/3)
P (W  10) = e
k!
k=0

Alternatively, the time in minutes until the tenth baby is born is Z ⇠ Gamma(10, 40/60), and we
are asking what’s the probability this is over 20 minutes, is
Z 20
(40/60)10 10 1 (40/60)x
P (Z > 20) = 1 FZ (20) = 1 x e dx
0 (10 1)!

Unfortunately, there isn’t a nice closed form for the Gamma CDF, but this would evaluate to the
same result!
(c) The time in minutes until the 5th baby is born is V ⇠ Gamma(5, 40/60), so E [V ] = r
= 5
40/60 =
7.5 minutes.
4.2 Probability & Statistics with Applications to Computing 129

2. You are waiting for a bus to take you home from CSE. You can either take the E-line, U-line, or Cline.
The distribution of the waiting time in minutes for each is the following:
• E-Line: E ⇠ Exp( = 0.1).
• U-Line: U ⇠ Unif(0, 20) (continuous).
• C-Line: Has range (1, 1) and PDF fC (x) = 1/x2 .
Assume the three bus arrival times are independent. You take the first bus that arrives
(a) Find the CDFs of E, U , and C, FE (t), FU (t) and FC (t). Hint: The first two can be looked up in
our distributions handout!
(b) What is the probability you wait more than 5 minutes for a bus?
(c) What is the probability you wait more than 30 minutes for a bus?
Solution:
(a) The CDF of E for t > 0 is FE (t) = 1 e 0.1t (see above).
t
The CDF of U for 0 < t < 20 is FU (t) = 20 .
Rt
The CDF of C for t > 1 is FC (t) = 1 fC (x)dx = 1 1t .
(b) Let B = min{E, U, C} be the time until the first bus. Then, the probability we wait more than
5 minutes is the probability that all of them take longer than 5 minutes to arrive. We can then
multiply the individual probabilities due to independence.

P (B > 5) = P (E > 5, U > 5, C > 5) = P (E > 5) P (U > 5) P (C > 5)

Then, writing in terms of the CDF and plugging in:

0.5 15 1 3 0.5
= (1 FE (5))(1 FU (5))(1 FC (5)) = e · · = e
20 5 20

(c) The same exact logic applies here! But be careful of the range of U when plugging in the CDF.
It is true that
P (B > 30) = P (E > 30) P (U > 30) P (C > 30)
But when plugging in P (U > 30) = 1 FU (30), we have to remember that FU (30) = 1 because U
must be in [0, 20]. That’s why it is so important to define the piecewise function! This probability
is indeed 0 since bus U will always come within 20 minutes.
Chapter 4. Continuous Random Variables
4.3: The Normal/Gaussian Random Variable
Slides (Google Drive) Video (YouTube)

4.3.1 Standardizing RVs

Let’s say you took two tests. You got 90% on history, and 50% on math. In which test did you do “better”?
You might think it’s obviously history, but actually your performance depends on the mean and standard
deviation of scores in the class! We need to compare them on a fair playing ground then - this process is
called standardizing. Let’s see which test you truly did better on, given some extra information.
1. On your history test, you got a 90% when the mean was 70% and the standard deviation was 10%.
2. On your math test, you got a 50% when the mean was 35% and the standard deviation was 5%.
You scored higher in history, but how many standard deviations above the mean?

your history score mean history score 90 70

= =2
standard deviation of history scores 10

On your math test,

your math score mean math score 50 35
= =3
standard deviation of math scores 5
Then, in terms of standard deviations above the mean, you actually did better in math! What we just
computed here was
X µ

in order to calculate the number of standard deviations above the mean a random variable’s value is. (Note
how we are using standard deviation instead of variance here so the units are the same!)
Recall that in general, if X is any random variable (discrete or continuous) with E [X] = µ and Var (X) = 2
,
and a, b 2 R. Then,
E [aX + b] = aE [X] + b = aµ + b
Var (aX + b) = a2 Var (X) = a2 2

In particular, we call X µ a standardized version of X, as it measures how many standard deviations above
the mean a point is. We standardize random variables for fair comparison. Applying linearity of expectation
and variance of random variables to standardized random variables, we get the expectation and variance of
standardized random variables:


X µ 1
E = (E [X] µ) = 0

✓ ◆
X µ 1 1 1 2
p p
Var = 2
Var (X µ) = 2
Var (X) = 2
= 1 =) X = Var (X) = 1=1

It turns out the mean is 0 and the standard deviation (and variance) is 1 ! This makes sense because on
average, someone is average (0 standard deviations above the mean), and the standard deviation is 1.

130
4.3 Probability & Statistics with Applications to Computing 131

4.3.2 The Normal/Gaussian Random Variable

Definition 4.3.1: Normal (Gaussian, ”bell curve”) distribution
2
X ⇠ N (µ, ) if and only if X has the following PDF (and range ⌦X = ( 1, +1)):
✓ ◆
1 (x µ)2
fX (x) = p exp
2⇡ 2 2
where exp(y) = ey . This Normal random variable actually has as parameters its mean and variance,
and hence:
E [X] = µ
2
Var (X) =

Unfortunately, there is no closed form formula for the CDF (there wasn’t one for the Gamma RV) either.
We’ll see how to compute these probabilities anyway though soon using a lookup table!
Normal distributions produce bell-shaped curves. Here are some visualizations of the density function for
varying µ and 2 .
For instance, a normal distribution with µ = 0 and = 1 produces the following bell curve:

If the standard deviation increases, it becomes more likely for the variable to be farther away from the mean,
so the distribution becomes flatter. For instance, a curve with the same µ = 0 but higher = 2 ( 2 = 4)
looks like this:

If you change the mean, the distribution will shift left or right. For instance, increasing the mean so µ = 4
shifts the distribution 4 to the right. The shape of the curve remains unchanged:
132 Probability & Statistics with Applications to Computing 4.3

If you change the mean AND standard deviation, the curves shape changes and shifts. For instance, changing
the mean so µ = 4 and standard deviation so = 2 gives us a flatter, shifted curve:

4.3.3 Closure Properties of the Normal Random Variable

Occasionally, when we sum two independent random variables of the same type, we get the same type. For
example, if X ⇠ Bin(n, p) and Y ⇠ Bin(m, p) are independent, then X + Y ⇠ Bin(n + m, p) because X is
the number of successes in n trials, and Y is the number of successes in m trials. It also turns out similar
properties hold for the Poisson, Negative Binomial, and Gamma random variables when you think of their
English meaning. We’ll formally prove some of these results in 5.5 though.

However, scaling and shifting a random variable often does not keep it in the same family. Continuous
uniform rvs are the only ones we learned so far that do: if X ⇠ Unif(0, 1), then 3X + 2 ⇠ Unif(2, 5): we’ll
learn how to prove this in the next section! However, this is not true for the others; for example, the range of
a Poi( ) is {0, 1, 2, . . . } as it is the number of events in a unit of time, and 2X has range {0, 2, 4, 6, . . . , } so
cannot be Poisson (cannot be an odd number)! We’ll see that Normal random variables have these closure
properties.
Recall that in general, if X is any random variable (discrete or continuous) with E [X] = µ and Var (X) = 2
,
and a, b 2 R. Then,
E [aX + b] = aE [X] + b = aµ + b

Var (aX + b) = a2 Var (X) = a2 2

4.3 Probability & Statistics with Applications to Computing 133

Definition 4.3.2: Closure of the Normal Under Scale and Shift

If X ⇠ N (µ, 2 ), then aX + b ⇠ N (aµ + b, a2 2 ).

In particular,
X µ
⇠ N (0, 1)

We will prove this theorem later in section 5.6 using Moment Generating Functions! This is really amazing -
the mean and variance are no surprise. The fact that scaling and shifting a Normal random variable results
in another Normal random variable is very interesting!
Let X, Y be ANY independent random variables (discrete or continuous) with E [X] = µX , E [Y ] = µY ,
Var (X) = X2
, Var (Y ) = Y2 and a, b, c 2 R. Recall,

E [aX + bY + c] = aE [X] + bE [Y ] + c = aµX + bµY + c

Var (aX + bY + c) = a2 Var (X) + b2 Var (Y ) = a2 2

X + b2 2
Y

Definition 4.3.3: Closure of the Normal Under Addition

2 2
If X ⇠ N (µX , X) and Y ⇠ N (µY , Y ) (both independent normal random variables), then

aX + bY + c ⇠ N (aµX + bµY + c, a2 2
X + b2 2
Y )

Again, this is really amazing. The mean and variance aren’t a surprise again, but the fact that adding two
independent Normals results in another Normal distribution is not trivial, and we will prove this later as
well!
Example(s)

Suppose you believe temperatures in the Vancouver, Canada each day are approximately normally
distributed with mean 25 degrees Celsius and standard deviation 5 degrees Celsius. However, your
American friend only understands Fahrenheit.
1. What is the distribution of temperatures each day in Vancouver in Fahrenheit? To convert
Celsius (C) to Fahrenheit (F ), the formula is F = 95 C + 32.
2. What is the distribution of the average temperature over a week in the Vancouver, in Fahren-
heit? That is, if you were to sample a random week’s average temperature, what is its distribu-
tion? Assume the temperature each day is independent of the rest (this may not be a realistic
assumption).

Solution
1. The degrees in Celsius are N (µC = 25, C 2
= 52 ). Since F = 95 C + 32, we know by linearity of
expectation and properties of variance:

9 9 9
µF = E [F ] = E C + 32 = E [C] + 32 = 25 + 32 = 77
5 5 5
✓ ◆ ✓ ◆2 ✓ ◆2
2 9 9 9
F = Var (F ) = Var C + 32 = Var (C) = 52 = 81
5 5 5
These values are no surprise, but by closure of the Normal distribution, we can say that F ⇠ N (µF =
77, F2 = 92 ).
134 Probability & Statistics with Applications to Computing 4.3

2. Let F1 , F2 , . . . , F7 be independent temperatures over a week, so each Fi ⇠ N (µF = 77, F2 = 81). Let
P7
F̄ = 17 i=1 Fi denote the average temperature over this week. Then, by linearity of expectation and
properties of variance (requiring independence),
" 7
# 7
1X 1X 1
E Fi = E [Fi ] = · 7 · 77 = 77
7 i=1 7 i=1 7

7
! 7
1X 1 X 1 81
Var Fi = 2 Var (Fi ) = 2 · 7 · 81 =
7 i=1 7 i=1 7 7

Note that the mean is the same, but the variance is smaller. This might make sense because we expect
the average temperature over a week should match that of a single day, but it is more stable (has
lower variance). By closure properties of the Normal distribution, since we take a sum of independent
P7
Normal RVs and then divide it by 7, F̄ = 17 i=1 Fi ⇠ N (µ = 77, 2 = 81/7).

4.3.4 The Standard Normal CDF

We’ll finally learn how to use to calculate probabilities like P (X  55) if X has a Normal distribution!

If Z ⇠ N (0, 1) is the standard normal (the normal RV with mean 0 and variance/standard deviation
1), we denote the CDF (a) = FZ (a) = P (Z  a), since it is so commonly used. There is no closed-form
formula, so this CDF is stored in a table (read a “Phi Table”). Remember, (a) is just the area to the
left of a.

Since the normal distribution curve is symmetric, the area to the left of a is the same as the area to the
right of a. This picture below shows that (a) = 1 ( a).

To get the CDF (1.09) = P (Z  1.09) from the table, we look at the row with a value of 1.0, and column
with value 0.09, as marked here:
4.3 Probability & Statistics with Applications to Computing 135

Table: P (Z  z) when Z ⇠ N (0, 1)

From this, we see that P (Z  1.39) = (1.39) ⇡ 0.91774. (Look at the gray row 1.3, and the column 0.09).

This table usually only has positive numbers, so if you want to look up negative numbers, it’s necessary to
use the fact that (a) = 1 ( a). For example, if we want P (Z  2.13) = ( 2.13), we need to do
1 (2.13) = 1 0.9834 = 0.0166 (try to find (2.13) yourself above).
2
How does this help though when X is Normal but not the standard normal? In general, for a X ⇠ N (µ, ),
136 Probability & Statistics with Applications to Computing 4.3

we can calculate the CDF of X by standardizing it to be standard normal,

FX (y) = P (X  y) [def of CDF]
✓ ◆
X µ y µ
=P  [standardizing both sides]
✓ ◆
y µ X µ
=P Z [since Z = is the standardized normal, Z ⇠ N (0, 1)]
✓ ◆
y µ
= [def of ]

We can also find P (a  X  b),

P (a  X  b) = FX (b) FX (a) [def of CDF]
✓ ◆ ✓ ◆
b µ a µ
= [def of ]

Definition 4.3.4: Standard Normal Random Variable

The “standard normal” random variable is typically denoted Z and has mean 0 and variance 1. By
the closure property of normals, if X ⇠ N (µ, 2 ), then Z = X µ ⇠ N (0, 1). The CDF has no closed
form, but we denote the CDF of the standard normal by (a) = FZ (a) = P (Z  a). Note that by
symmetry of the density about 0, ( a) = 1 (a).

See some examples below of how we can use the table to calculate probabilities associated with the Normal
distributions! Again, the table gives the CDF of the standard Normal since it doesn’t have a closed form
like Uniform/Exponential. Also, any Normal RV can be standardized so we can look up probabilities in the
table!
Example(s)

Suppose the age of a random adult in the United States is (approximately) normally distributed with
mean 50 and standard deviation 15.
1. What is the probability that a randomly selected adult in the US is over 70 years old?
2. What is the probability that a randomly selected adult in the US is under 25 years old?
3. What is the probability that a randomly selected adult in the US is between 40 and 45 years
old?

Solution
2
1. The height of a random adult is X ⇠ N (µ = 50, = 152 ), so remember we standardize to use the
standard Gaussian:
✓ ◆
X 50 70 50
P (X > 70) = P > [standardize]
15 15

X µ
= P (Z > 1.33) = Z ⇠ N (0, 1)

=1 P (Z  1.33) [complement]
=1 (1.33) [def of ]
=1 0.9082 [look up table from earlier]
= 0.0918
4.3 Probability & Statistics with Applications to Computing 137

2. We do a similar calculation:
✓ ◆
X 50 25 50
P (X < 25) = P > [standardize]
15 15

X µ
= P (Z < 2/3) = Z ⇠ N (0, 1)

= ( 0.67) [recall since continuous rv, identical to less than or equal]

=1 (0.67) [symmetry trick to make positive]
=1 0.7486 [look up table from earlier]
= 0.2514

3. We do a similar calculation:
✓ ◆
40 50 X 50 45 50
P (40 < X < 45) = P < < [standardize]
15 15 15

X µ
= P ( 2/3 < Z < 1/3) = Z ⇠ N (0, 1)

= ( 0.33) ( 0.67) [P (a < X < b) = FX (b) FX (a))]

= (1 (0.33)) (1 (0.67)) [symmetry trick to make positive]
= (0.67) (0.33)
= 0.7486 0.6293 [look up table from earlier]
= 0.1193

4.3.5 Exercises
1. Suppose the time (in hours) it takes for you to finish pset i is approximately Xi ⇠ N (µ = 10, 2 = 9)
(for i = 1, . . . , 5) and the time (in hours) it takes for you to finish a project is approximately Y ⇠
N (µ = 20, 2 = 10). Let W = X1 + X2 + X3 + X4 + X5 + Y be the time it takes to complete all 5
psets and the project.
(a) What are the mean and variance of W ?
(b) What is the distribution of W and what are its parameter(s)?
(c) What is the probability that you complete all the homework in under 60 hours?
Solution:
(a) The mean by linearity of expectation is E [W ] = E [X1 ] + · · · + E [X5 ] + E [Y ] = 50 + 20 = 70.
Variance adds for independent RVs, so Var (W ) = Var (X1 )+· · ·+Var (X5 )+Var (Y ) = 45+10 = 55.
(b) Since W is the sum of independent Normal random variables, W is also normal with the parameters
we calculated above. So W ⇠ N (µ = 70, 2 = 55).
(c)
✓ ◆
W 70 60 70
P (W < 60) = P p < p ⇡ P (Z < 1.35) = ( 1.35) = 1 (1.35) = 1 0.9115 = 0.0885
55 55
.
Chapter 4. Continuous Random Variables
4.4: Transforming Continuous RVs
Slides (Google Drive) Video (YouTube)

Suppose the amount of gold a company can mine is X tons per year, and you have some (continuous)
distribution to model this. However, your earning is not simply X - it is actually a function of the amount
of product, some Y = g(X). What is the distribution of Y ?
Since we know the distribution of X, this will help us model the distribution of Y by transforming random
variables.

4.4.1 Transforming 1-D (Continuous) RVs via CDF

When we are dealing with discrete random variables, this process wasn’t too bad. Let’s say X had range
{ 1, 0, 1} and PMF
8
>
<0.3 x = 1
pX (x) = 0.2 x = 0
>
:
0.5 x = 1

and Y = g(X) = X 2 . Then, ⌦Y = {0, 1}, and we could say

(
pX ( 1) + pX (1) = 0.3 + 0.5 = 0.8 y=1
pY (y) =
pX (0) = 0.2 y=0

This is because Y = 1 if and only if X 2 { 1, 1}, so to find P (Y = 1), we sum over all values x such that
x2 = 1 of its probability. That’s all this formula below says (the “:” means “such that”):
X
pY (y) = pX (x)
x2⌦X :g(x)=y

But for continous random variables, we have density functions instead of mass functions. That means
fX is not actually a probability and so we can’t do this same technique. We want to work with the CDF
FX (x) = P (X  x) instead because it actually does represent a probability! It’s best to see this idea through
an example.

Example(s)
p
Suppose you know X ⇠ Unif(0, 9) (continuous). What is the PDF of Y = X?

Solution We know the range of X,

⌦X = [0, 9]

We also know the PDF of X, which is uniform from 0 to 9, and 0 elsewhere.

⇢ 1
9 if 0  x  9
fX (x) =
0 otherwise

138
4.4 Probability & Statistics with Applications to Computing 139

The CDF of X is derived by taking the integral of the PDF, giving us (can also cite this),
8
< 0 if x < 0
x
FX (x) = 9 if 0  x  9
:
1 ifx > 9

p
Now, we determinepthe range of Y . The smallest value that Y can take is 0 = 0, and the largest value
that Y can take is 9 = 3, from the range of X. Since the square root function is monotone increasing, this
gives us,
⌦Y = [0, 3]

But can we assume that, because X has a uniform distribution, Y does too?

This is not the case! Notice that values of X in the range [0, 1] will map to Y values in the range [0, 1]. But,
X values in the range [1, 4] map to Y values in the range [1, 2] and X values in the range [4, 9] map to Y
values in the range [2, 3].

So, there is a much larger range of values of X that map to [2, 3] than to [0, 1] (since [4, 9] is a larger range
than [0, 1]). Therefore, Y ’s distribution shouldn’t be uniform. So, we cannot define the PDF of Y using the
assumption that Y is uniform.

Instead, we will first compute the CDF FY and then, di↵erentiate that to get the PDF fY for y 2 [0, 3].

To compute FY for any y in [0, 3], we first take the CDF at y:

FY (y) = P (Y  y) [def of CDF]

⇣p ⌘
=P Xy [def of Y ]
= P X  y2 [squaring both sides]
2
= FX (y ) [def of CDF of X evaluated at y 2 ]
y2
= [plug in CDF of X, since y 2 2 [0, 9]]
9

Be very careful when squaring both sides of an equation - it may not keep the inequality true. In this case
we didn’t have to worry since X and Y were both guaranteed positive.

Di↵erentiating the CDF to get the PDF fY , for y 2 [0, 3],

d 2y
fY (y) = FY (y) =
dy 9

p
Here is an image of the original and transformed PDFs! Remember that X ⇠ Unif(0, 9) and Y = X.
140 Probability & Statistics with Applications to Computing 4.4

This is the general strategy for transforming continuous RVs! We’ll summarize the steps below.

Definition 4.4.1: Steps to get PDF of Y = g(X) from X (via CDF)

1. Write down the range ⌦X , PDF fX , and CDF FX .

2. Compute the range ⌦Y = {g(x) : x 2 ⌦X }.
3. Start computing the CDF of Y on ⌦Y , FY (y) = P (g(X)  y), in terms of FX .
4. Di↵erentiate the CDF FY (y) to get the PDF fY (y) on ⌦Y . fY is 0 outside ⌦Y .

Example(s)

Let X be continuous with range ⌦X = [ 1, +1] have density function

(
3
(1 x2 ) 1x1
fX (x) = 4
0 otherwise

Suppose Y = X 4 . Find the density function fY (y).

Solution We’ll follow the 4-step procedure as outlined above.

1. First, we list out the range, PDF, and CDF of the original variable X. We were given the range and
PDF, but not the CDF, so let’s compute it. For x 2 [ 1, +1] (note the use of the dummy variable t
since x is already taken),
Z x Z x
3 1
FX (x) = P (X  x) = fX (t)dt = (1 t2 )dt = (2 + 3x x3 )
1 1 4 4

So the complete CDF is: 8

>
<0 x 1
1 3
FX (x) = (2 + 3x x ) 1x1
>4
:
1 x 1

2. The range of Y = X 4 is ⌦Y = {x4 : x 2 [ 1, +1]} = [0, 1], since x4 is always positive and between 0
and 1 for x 2 [ 1, +1].
4.4 Probability & Statistics with Applications to Computing 141

3. Be careful in the third equation below to include both lower and upper bounds (draw the function
y = x4 to see why). For y 2 ⌦Y = [0, 1], we will compute the CDF:
FY (y) = P (Y  y) [def of CDF]
4
=P X y [def of Y ]
p p
= P ( 4 y  X  4 y) [don’t forget the negative side]
p p
= P (X  4 y) P (X  4 y)
p p
= FX ( 4 y) FX ( 4 y) [def of CDF of X ]
1 p p 3 1 p p
= (2 + 3 4 y 4
y ) (2 + 3( 4 y) ( 4
y)3 ) [plug in CDF]
4 4
4. The last step is to di↵erentiate the CDF to get the PDF, which is just computational, so I’ll skip it!

4.4.2 Transforming 1-D RVs via Explicit Formula

Now, it turns out actually that in some special cases, there is an explicit formula for the density function of
Y = g(X), and we don’t have to go through all the same steps above. It’s important to note that the CDF
method can always be applied, but this next method has restrictions.

Theorem 4.4.16: Formula to get PDF of Y = g(X) from X

1
If Y = g(X) and g : ⌦X ! ⌦Y is strictly monotone and invertible with inverse X = g (Y ) =
h(Y ), then ⇢
fX (h(y)) · |h0 (y)| if y 2 ⌦Y
fY (y) =
0 otherwise

That is, the PDF of Y at y is the PDF of X evaluated at h(y) (the value of x that maps to y) multiplied by
the absolute value of the derivative of h(y).
Note that the formula method is not as general as the previous method (using CDF), since g must satisfy
monotonicity and invertibility. So transforming via CDF always works, but transforming may not work with
this explicit formula all the time.
Proof of Formula to get PDF of Y = g(X) from X.
Suppose Y = g(X) and g is strictly monotone and invertible with inverse X = g 1 (Y ) = h(Y ). We’ll assume
g is strictly monotone increasing and leave it to you to prove it for the case when g is strictly monotone
decreasing (it’s very similar).

FY (y) = P (Y  y) [def of CDF]

= P (g(X)  y) [def of Y ]
1
=P Xg (y) [invertibility, AND monotone increasing keeps the sign]
1 1
= FX (g (y)) [def of CDF of X evaluated at g (y)]
1
= FX (h(y)) [h(y) = g (y)]
Hence, by the chain rule (of calculus),
d
fY (y) = FY (y) = fX (h(y)) · h0 (y)
dy
142 Probability & Statistics with Applications to Computing 4.4

A similar proof would hold if g were monotone decreasing, except in the third line we would flip the sign of
the inequality and make the h0 (y) become an absolute value: |h0 (y)|.

Now let’s try the same example as we did earlier, but using this new method instead.

Example(s)
p
Suppose you know X ⇠ Unif(0, 9) (continuous). What is the PDF of Y = X?

Solution Recall, we know the range of X,

⌦X = [0, 9]

We also know the PDF of X, which is uniform from 0 to 9 and 0 elsewhere.

⇢ 1
9 if 0  x  9
fX (x) =
0 otherwise

Our goal is to use the formula given fY (y) = fX (h(y)) · |h0 (y)|, after verifying some conditions on g.
p p
Let g(t) = t. This is strictly monotone increasing on ⌦X = [0, 9]. This means that as t increases, t also
increases - therefore, g(t) is an increasing function.

What is the inverse of this function g? The inverse of the square root function is just the squaring function:
1
h(y) = g (y) = y 2

Then, we find it’s derivative:

h0 (y) = 2y

Now, we can use the explicit formula to find the PDF of Y .

For y 2 [0, 3],

1 2
fY (y) = fX (h(y)) · |h0 (y)| = |2y| = y
9 9

Note that we dropped the absolute value because we already assume y 2 [0, 3] and hence 2y is always positive.
This gives the same formula as earlier, as it should!

4.4.3 Transforming Multidimensional RVs via Formula

For completion, we’ve cited a formula to transform n random variables to n other random variables. For
example, this might be useful if you have a system of two equations. For example, (R, ⇥) (polar) coordinates
which are random variables, and wanting to convert to Cartesian coordinates to the two random variables
(X, Y ) where X = R cos(⇥) and Y = R sin(⇥). This extends the formula we just learned to multi-dimensional
random variables!
4.4 Probability & Statistics with Applications to Computing 143

Theorem 4.4.17: Formula to get PDF of Y = g(X) from X (Multidimensional Case)

Let X = (X1 , ..., Xn ), Y = (Y1 , ..., Yn ) be continuous random vectors (each component is a continuous
rv) with the same dimension n (so ⌦X , ⌦Y ✓ Rn ), and Y = g(X) where g : ⌦X ! ⌦Y is invertible
and di↵erentiable, with di↵erentiable inverse X = g 1 (y) = h(y). Then,
✓ ◆
@h(y)
fY (y) = fX (h(y)) det
@y
⇣ ⌘
where @h(y)
@y 2 Rn⇥n is the Jacobian matrix of partial derivatives of h, with
✓ ◆
@h(y) @(h(y))i
=
@y ij @yj

Hopefully this formula looks very similar to the one for the single-dimensional case! This formula is just for
your information and you’ll never have to use it in this class.

4.4.4 Exercises
1. Suppose X has range ⌦X = (1, 1) and density function
8
< 2 x>1
fX (x) = x3
:0 otherwise

For reference, the CDF is also given

8
<1 1
x>1
FX (x) = x2
:0 otherwise

eX 1
Let Y = .
2
(a) Compute the density function of Y via the CDF transformation method.
(b) Compute the density function of Y using the formula, but explicitly verify the monotonicity and
invertibility conditions.
Solution:
✓ ◆
e 1
(a) The range of Y is ⌦Y = , 1 . For y 2 ⌦Y ,
2

FY (y) = P (Y  y) [def of CDF]

✓ X ◆
e 1
=P y [def of Y ]
2
= P eX  2y + 1
= P (X  ln(2y + 1))
= FX (ln(2y + 1)) [def of CDF]

1 1
=1 FX (x) = 1
[ln(2y + 1)]2 x2
144 Probability & Statistics with Applications to Computing 4.4

The derivative is (don’t forget the chain rule)

d 2 1 4
fY (y) = FY (y) = 3
· ·2=
dy [ln(2y + 1)] 2y + 1 (2y + 1)[ln(2y + 1)]3

This density is valid for y 2 ⌦Y , and 0 everywhere else.

et 1
(b) The function g(t) = is monotone increasing (since et is, and we shift and scale it by a
2
2
positive constant), and has inverse h(y) = g 1 (y) = ln(2y + 1). We have h0 (y) = . By the
2y + 1
formula, we get

fY (y) = fX (h(y))|h0 (y)| [formula]


2 2 2
= 3
· fX (x) =
[ln(2y + 1)] 2y + 1 x3
4
=
(2y + 1)[ln(2y + 1)]3

This gives the same answer as part (a)!

4.4 Probability & Statistics with Applications to Computing 145

Chapter 5. Multiple Random Vari-

ables
Now that we’ve handled both discrete and continuous random variables, what happens if we want to model
more than one at a time? What if the random variables are NOT independent? How do we model and work
with several random variables simultaneously? We’ll answer all of these questions in this coming chapter,
and also state and prove the most fundamental theorem in all of statistics: the Central Limit Theorem.
We’ll see how this can be used to solve more word problems, and in Chapter 8, we’ll see how useful it can
be in the context of hypothesis testing.
Chapter 5. Multiple Random Variables
5.1: Joint Discrete Distributions
Slides (Google Drive) Video (YouTube)

This chapter, especially Sections 5.1-5.6, are arguably the most difficult in this entire text. They might take
more time to fully absorb, but you’ll get it, so don’t give up!
We are finally going to talk about what happens when we want the probability distribution of more than one
random variable. This will be called the joint distribution of two or more random variables. In this section,
we’ll focus on joint discrete distributions, and in the next, joint continuous distributions. We’ll also finally
prove that variance the variance of the sum of independent RVs is the sum of the variances, an important
fact that we’ve been using without proof! But first, we need to review what a Cartesian product of sets is.

5.1.1 Cartesian Products of Sets

Definition 5.1.1: Cartesian Product of Sets
Let A, B be sets. The Cartesian product of A and B is denoted:

A ⇥ B = {(a, b) : a 2 A, b 2 B}

Further if A, B are finite sets, then |A ⇥ B| = |A| · |B| by the product rule of counting.

Example(s)

Write each of the following in a notation that does not involve a Cartesian product:
1. {1, 2, 3} ⇥ {4, 5}
2. R2 = R ⇥ R

Solution
1. Here, we have:

{1, 2, 3} ⇥ {4, 5} = {(1, 4), (1, 5), (2, 4), (2, 5), (3, 4), (3, 5)}

We have each of the elements of the first set paired with each of the elements of the second set. Note
that |{1, 2, 3}| = 3, |{4, 5}| = 2, and |{1, 2, 3} ⇥ {4, 5}| = 6.
2. This is the xy-plane (2D space), which is denoted:

R2 = R ⇥ R = {(x, y) : x 2 R, y 2 R}

5.1.2 Joint PMFs and Expectation

We will now talk about how we can model the distribution of two or more random variables, using an example
to start.

146
5.1 Probability & Statistics with Applications to Computing 147

Suppose we roll two fair 4-sided die independently, one blue and one red. Let X be the value of the blue die
and Y be the value of the red die. Note:

⌦X = {1, 2, 3, 4}
⌦Y = {1, 2, 3, 4}

Then we can also consider ⌦X,Y , the joint range of X and Y . The joint range happens to be any combination
of {1, 2, 3, 4} for both rolls. This can be written as:

⌦X,Y = ⌦X ⇥ ⌦Y

Further each of these will be equally likely (as shown in the table below):

Above is a suitable way to write the joint probability mass function of X and Y , as it enumerates every
probability of every pair of values. If we wanted to write it as a formula, pX,Y (x, y) = P (X = x, Y = y) for
x, y 2 ⌦X,Y we have:
(
1
, x, y 2 ⌦X,Y
pX,Y (x, y) = 16
0, otherwise

Note that either this piecewise function or the table above are valid ways to express the joint PMF.

Definition 5.1.2: Joint PMFs

Let X, Y be discrete random variables. The joint PMF of X and Y is:

pX,Y (a, b) = P (X = a, Y = b)

The joint range is the set of pairs (c, d) that have nonzero probability:

⌦X,Y = {(c, d) : pX,Y (c, d) > 0} ✓ ⌦X ⇥ ⌦Y

Note that the probabilities in the table must sum to 1:

X
pX,Y (s, t) = 1
(s,t)2⌦X,Y
148 Probability & Statistics with Applications to Computing 5.1

Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
X X
E [g(X, Y )] = g(x, y)pX,Y (x, y)
x2⌦X y2⌦Y

A lot of things are just the same as what we learned in Chapter 3, but extended! Note that the joint range
above ⌦X,Y was always a subset of ⌦X ⇥ ⌦Y , and they’re not necessarily equal. Let’s see an example of this.

Back to our example of the blue and red die rolls. Again, let X be the value of the blue die and Y be
the value of the red die. Now, let U = min{X, Y } (the smaller of the two die rolls) and V = max{X, Y }
(the larger of the two die rolls). Then:

⌦U = {1, 2, 3, 4}
⌦V = {1, 2, 3, 4}

because both random variables can take on any of the four values that appear on the dice (e.g., t is possible
that the minimum is 4 if we roll (4, 4) and the maximum to be 1 if we roll (1, 1)).

However, there is the constraint that the minimum value U is always at most the maximum value V . That is,
the joint range would not include the pair (u, v) = (4, 1) for example, since the probability that the minimum
is 4 and the maximum is 1 is zero. We can write this formally as the subset of the Cartesian product subject
to u  v:

⌦U,V = {(u, v) 2 ⌦U ⇥ ⌦V : u  v} =
6 ⌦U ⇥ ⌦ V

This will just be all the ordered pairs of the values that can appear as U and V . Now, however these are
not equally likely, as shown in the table below. Notice that any pair (u, v) with u > v has zero probability,
as promised. We’ll explain how we got the other numbers under the table.

As discussed earlier, we can’t have the case where U > V , so these are all 0. The cases where U = V occurs
1
when the blue and red die have the same value, each which occurs with probability of 16 as shown earlier. For
example, pU,V (2, 2) = P (U = 2, V = 2) = 1/16 since only one of the 16 equally likely outcomes (2, 2) gives
2
this result. The others in which U < V each occur with probability 16 because it could be the red die with
the max and the blue die with the min, or the reverse. For example, pU,V (1, 3) = P (U = 1, V = 3) = 2/16
because two of the 16 outcomes (1, 3) and (3, 1) would result in the min being 1 and the max being 3.
5.1 Probability & Statistics with Applications to Computing 149

So for the joint PMF as a formula pU,V (u, v) = P (U = u, V = v) for u, v 2 ⌦U,V we have:
8
> 2
< 16 , u, v 2 ⌦U ⇥ ⌦V , v>u
1
pU,V (u, v) = 16 , u, v 2 ⌦U ⇥ ⌦V , v=u
>
:
0, otherwise

Again, the piecewise function and the table are both valid ways to express the joint PMF, and you may
choose whichever is easier for you. When the joint range is larger, it might be infeasible to use a table
though!

5.1.3 Marginal PMFs

Now suppose we didn’t care about both U and V , just U (the minimum value). That is, we wanted to solve
for the PMF pU (u) = P (U = u) for u 2 ⌦U . Intuitively, how would you do it? Take a look at the table
version of their joint PMF above.

You might think the answer is 7/16, but how did you get that? Well, P (U = 1) would be the sum of
the first row, since that is all the cases where U = 1. You computed

1 2 2 2 7
P (U = 1) = P (U = 1, V = 1)+P (U = 1, V = 2)+P (U = 1, V = 3)+P (U = 1, V = 4) = + + + =
16 16 16 16 16

Mathematically, we have
X
P (U = u) = P (U = u, V = v)
v2⌦V

Does this look like anything we learned before? It’s just the law of total probability (intersection version)
that we derived in 2.2, as the events {V = v}v2⌦V partition the sample space (V takes on exactly one value)!
We can refer to the table above sum each row (which corresponds to a value of u to find the probability of
that value of u occurring). That gives us the following:

87
>
>
> 16 , u=1
<5, u=2
pU (u) = 16
> 3
>
> , u=3
: 16
1
16 , u=4

One more example with u = 4 is:

1 1
P (U = 4) = P (U = 4, V = 1)+P (U = 4, V = 2)+P (U = 4, V = 3)+P (U = 4, V = 4) = 0+0+0+ =
16 16

This brings us to the definition of marginal PMFs. The idea of these is: given a joint probability distribution,
what is the distribution of just one of them (or a subset)? We get this by marginalizing (summing) out the
other variables.
150 Probability & Statistics with Applications to Computing 5.1

Definition 5.1.3: Marginal PMFs

Let X, Y be discrete random variables. The marginal PMF of X is:

X
pX (a) = pX,Y (a, b)
b2⌦Y

Similarly, the marginal PMF of Y is:

X
pY (d) = pX,Y (c, d)
c2⌦X

(Extension) If Z is also a discrete random variable, then the marginal PMF of z is:
X X
pZ (z) = pX,Y,Z (x, y, z)
x2⌦X y2⌦Y

This follows from the law of total probability, and is just like taking the sum of a row in the example above.

Now if asked for E [U ] for example, we actually don’t need the joint
P PMF anymore. We’ve extracted the
pertinent information in the form of pU (u), and compute E [U ] = u upU (u) normally.

We’ll do more examples right after the next section!

5.1.4 Independence

We’ll now redefine independence of RVs in terms of the joint PMF. This is completely the same as the
definition we gave earlier, just with the new notation we learned.

Definition 5.1.4: Independence (DRVs)

Discrete random variables X, Y are independent, written X ? Y , if for all x 2 ⌦X and y 2 ⌦Y :

pX,Y (x, y) = pX (x)pY (y)

Again, this just says that P (X = x, Y = y) = P (X = x) P (Y = y) for every x, y.

5.1 Probability & Statistics with Applications to Computing 151

Theorem 5.1.18: Check for Independence (DRVs)

Recall the joint range ⌦X,Y = {(x, y) : pX,Y (x, y) > 0} ✓ ⌦X ⇥ ⌦Y is always a subset of the
Cartesian product of the individual ranges. A necessary but not sufficient condition for independence
is that ⌦X,Y = ⌦X ⇥ ⌦Y . That is, if ⌦X,Y 6= ⌦X ⇥ ⌦Y , then X and Y cannot be independent, but
if ⌦X,Y = ⌦X ⇥ ⌦Y , then we have to check the condition above.

This is because if there is some (a, b) 2 ⌦X ⇥ ⌦Y but not in ⌦X,Y , then pX,Y (a, b) = 0 but pX (a) > 0
and pY (b) > 0, violating independence. For example, suppose the joint PMF looks like:
X \Y 8 9 Row Total pX (x)
3 1/3 1/2 5/6
7 1/6 0 1/6
Col Total pY (y) 1/2 1/2 1
Also side note that the marginal distributions are named what they are, since we often write the row
and column totals in the margins. The joint range ⌦X,Y 6= ⌦X ⇥ ⌦Y since one of the entries is 0,
and so (7, 9) 62 ⌦X,Y but (7, 9) 2 ⌦X ⇥ ⌦Y . This immediately tells us they cannot be independent -
pX (7) > 0 and pY (9) > 0, yet pX,Y (7, 9) = 0.

Example(s)

Suppose X, Y are jointly distributed with joint PMF:

X \Y 6 9 Row Total
0 3/12 5/12 ?
2 1/12 2/12 ?
3 0 1/12 ?
Col Total ? ? 1
1. Find the marginal probability mass functions pX (x) and pY (y).
2. Find E [Y ].
3. Are X and
⇥ Y⇤ independent?
4. Find E X Y .

Solution
1. Actually these can be found by filling in the row and column totals, since
X X
pX (x) = pX,Y (x, y), pY (y) = pX,Y (x, y)
y x
P
For example, P (X = 0) = pX (0) = y pX,Y (0, y) = pX,Y (0, 6) + pX,Y (0, 9) = 3/12 + 5/12 = 8/12 is
the sum of the first row.
X \Y 6 9 Row Total pX (x)
0 3/12 5/12 8/12
2 1/12 2/12 3/12
3 0 1/12 1/12
Col Total pY (y) 4/12 8/12 1
Hence, 8
>
<8/12 x = 0
pX (x) = 3/12 x = 2
>
:
1/12 x = 3
152 Probability & Statistics with Applications to Computing 5.1

(
4/12 y=6
pY (y) =
8/12 y=9

2. We can actually compute E [Y ] just using pY now that we’ve eliminated/marginalized out X - we don’t
need the joint PMF anymore. We go back to the definition:

X 4 8
E [Y ] = ypY (y) = 6 · +9· =8
y
12 12

3. X, Y are independent, if for every table entry (x, y), we have pX,Y (x, y) = pX (x)pY (y). However, notice
pX,Y (3, 6) = 0 but pX (3) > 0 and pY (6) > 0. Hence we found an entry where this condition isn’t true,
so they cannot be independent. This is like the comment mentioned earlier: if ⌦X,Y 6= ⌦X ⇥ ⌦Y , they
have no chance of being independent.

4. We use the LOTUS formula:

⇥ ⇤ XX y 3 5 1 2 1
E XY = x pX,Y (x, y) = 06 · + 09 · + 26 · + 29 · + 36 · 0 + 39 ·
x y
12 12 12 12 12

This just sums over all the entries in the table (x, y) and takes a weighted average of all values xy
weighted by pX,Y (x, y) = P (X = x, Y = y).

5.1.5 Variance Adds for Independent Random Variables

We will finally prove that variance adds for independent RVs. You are highly encouraged to read them
because they give practice with expectations with joint distributions and LOTUS!

Lemma 5.1.1: Variance Adds for Independent RVs

If X, Y are independent random variables, denoted X ? Y , then:

Var (X + Y ) = Var (X) + Var (Y )

If a, b, c 2 R are scalars, then:

Var (aX + bY + c) = a2 Var (X) + b2 Var (Y )

Note this property relies on the fact that they are independent, whereas linearity of expectation
always holds, regardless.

To prove this, we must first prove the following lemma:

Lemma 5.1.2: Expected Value of the Product of Independent Random Variables

If X ? Y , then E [XY ] = E [X] E [Y ].

5.1 Probability & Statistics with Applications to Computing 153

Proof of Lemma.
X X
E [XY ] = xypX,Y (x, y) [LOTUS]
x2⌦X y2⌦Y
X X
= xypX (x)pY (y) [X ? Y , so pX,Y (x, y) = pX (x)pY (y)]
x2⌦X y2⌦Y
X X
= xpX (x) ypY (y)
x2⌦X y2⌦Y

= E [X] E [Y ]

Proof of Variance Adds for Independent RVs. Now we have the following:

⇥ ⇤
Var (X + Y ) = E (X + Y )2 (E [X + Y ])2 [def of variance]
⇥ 2 ⇤
= E X + 2XY + Y 2 (E [X] + E [Y ])2 [linearity of expectation]
⇥ 2⇤ ⇥ 2⇤
= E X + 2E [XY ] + E Y (E [X])2 2E [X] E [Y ] (E [Y ])2 [linearity of expectation]
⇥ 2⇤ ⇥ ⇤
=E X (E [X])2 + E Y 2 (E [Y ])2 + 2E [XY ] 2E [X] E [Y ] [rearranging]
⇥ 2⇤ ⇥ ⇤
= E X (E [X])2 + E Y2 (E [Y ])2 + 2 (E [X] E [Y ] E [X] E [Y ]) [lemma, since X ? Y ]
= Var (X) + Var (Y ) + 0 [def of variance]

5.1.6 Reproving Linearity of Expectation

Proof of Linearity of Expectation. Let X, Y be (possibly dependent) random variables. We’ll prove that
E [X + Y ] = E [X] + E [Y ].

XX
E [X + Y ] = (x + y)pX,Y (x, y) [LOTUS]
x y
XX XX
= xpX,Y (x, y) + ypX,Y (x, y) [split sum]
x y x y
X X X X
= x pX,Y (x, y) + y pX,Y (x, y) [algebra]
x y y x
X X
= xpX (x) + ypY (y) [def of marginal PMF]
x y

= E [X] + E [Y ] [def of expectation]

5.1.7 Exercises
1. Suppose we flip a fair coin three times independently. Let X be the number of heads in the first two
flips, and Y be the number of heads in the last two flips (there is overlap).
154 Probability & Statistics with Applications to Computing 5.1

(a) What distribution do X and Y have marginally, and what are their ranges?
(b) What is pX,Y (x, y)? Fill in this table below. You may want to fill in the marginal distributions
first!
(c) What is ⌦X,Y , using your answer to (b)?
(d) Write a formula for E [cos(XY )].
(e) Are X, Y independent?
Solution:
(a) Since X counts the number of heads in two independent flips of a fair coin, then X ⇠ Bin(n =
2, p = 0.5). Y also has this distribution! Their ranges are ⌦X = ⌦Y = {0, 1, 2}.
(b) First, fill in the marginal distributions, which should be 1/4, 1/2, 1/4 for the probability that
X = 0, X = 1, and X = 2 respectively (same for Y ).
First let’s start with pX,Y (2, 2) = P (X = 2, Y = 2). If X = 2, that means the first two flips
must’ve been heads. If Y = 2, that means the last two flips must’ve been heads. So the probabil-
ity that X = 2, Y = 2 is the probability of the single outcome HHH, which is 1/8. Apply similar
logic for pX,Y (0, 0) = P (X = 0, Y = 0) which is the probability of TTT.

Then, pX,Y (0, 2) = P (X = 0, Y = 2). If X = 0 then the first two flips are tails. If Y = 2, the last
two flips are heads. This is impossible, so P (X = 0, Y = 2) = 0. Similarly, P (X = 2, Y = 0) = 0
as well. Now use the constraints (the row totals and col totals) to fill in the rest! For example, the
first row must sum to 1/4, and we have two out of three of the entries pX,Y (0, 0) and pX,Y (0, 2),
so pX,Y (0, 1) = 1/4 1/8 0 = 1/8.
X \Y 0 1 2 Row Total pX (x)
0 1/8 1/8 0 1/4
1 1/8 1/4 1/8 1/2
2 0 1/8 1/8 1/4
Col Total pY (y) 1/4 1/2 1/4 1
(c) From the previous part, we can see that the joint range is everything in the Cartesian product
except (0, 2) and (2, 0), so ⌦X,Y = (⌦X ⇥ ⌦Y ) \ {(0, 2), (2, 0)}.
(d) By LOTUS extended to multiple variables,
XX
E [cos(XY )] = cos(xy)pX,Y (x, y)
x y

(e) No, the joint range is not equal to the Cartesian product. This immediately makes independence
impossible. The intuitive reason is that, since (0, 2) 62 ⌦X,Y for example, if we know X = 0, then
Y cannot be 2. Formally, there exists a pair (x, y) 2 ⌦X ⇥ ⌦Y (namely (x, y) = (0, 2)) such that
pX,Y (0, 2) = 0 but pX (0) > 0 and pY (2) > 0. Hence, pX,Y (0, 2) 6= pX (0)pY (2), which violates
independence.
2. Suppose radioactive particles at Area 51 are emitted at an average rate of per second. You want to
measure how many particles are emitted, but your geiger-counter (device that measures radioactivity)
fails to record each particle independently with some small probability p. Let X be the number of
particles emitted, and Y be the number of particles observed (by your geiger-counter).
(a) Describe the joint range ⌦X,Y using set notation.
(b) Write a formula (not a table) for pX,Y (x, y).
5.1 Probability & Statistics with Applications to Computing 155

(c) Write a formula for pY (y).

Solution:
(a) X ⇠ Poi( ) can be any nonnegative integer {0, 1, 2, . . . }, and Y must be between 0 and X. Hence,
the joint range is ⌦X,Y = {(x, y) 2 Z2 : 0  y  x}, where Z denotes the set of integers.
(b) We know the Poisson PMF is
x
pX (x) = e
x!
and that the distribution of Y given X = x is binomial (Y | X = x) ⇠ Bin(x, 1 p). This
is because, given that X = xx particles were emitted, we observe each one independently with
probability 1 p. Hence,
✓ ◆
x
P (Y = y | X = x) = (1 p)y px y
y

By the chain rule (or definition of conditional probability),

x
✓ ◆
x
pX,Y (x, y) = P (X = x, Y = y) = P (X = x) P (Y = y | X = x) = e · (1 p)y px y
x! y

for (x, y) 2 ⌦X,Y .

(c) We are asked to find the probability that we observe Y = y particles. To make this concrete, let’s
say we want pY (5) = P (Y = 5). Then there is some chance of this if X = 5 (observing 100% of
particles), or X = 6, or X = 7, etc. Hence, for any y 2 ⌦Y ,

X 1
X x
✓ ◆
x
pY (y) = pX,Y (x, y) = e · (1 p)y px y

x=y
x! y
x2⌦X

That is, we sum over all cases where x y.

3. Suppose there are N marbles in a bag, composed Pr of r di↵erent colors. Suppose there are K1 of color
1, K2 of color 2,..., Kr of color r, where i=1 Ki = N . We reach in and draw n without replace-
ment. Let (X1 , . . . , Xr ) be a random vector where Xi is the count of how many marbles of color
i we drew. What is pX1 ,...,Xr (k1 , . . . , kr ) for valid values of k1 , . . . , kr ? We say the random vector
(X1 , . . . , Xr ) ⇠ M V HG(N, K1 , . . . , Kr , n) has a multivariate hypergeometric distribution!

Solution: Qr
K1 Kr Ki
k1 ... kr i=1 ki
pX1 ,...,Xr (k1 , . . . , kr ) = N
= N
n n
Chapter 5. Multiple Random Variables
5.2: Joint Continuous Distributions
Slides (Google Drive) Video (YouTube)

5.2.1 Joint PDFs and Expectation

The joint continuous distribution is the continuous counterpart of a joint discrete distribution. Therefore,
conceptual ideas and formulas will be roughly similar to that of discrete ones, and the transition will be
much like how we went from single variable discrete RVs to continuous ones.
To think intuitively about joint continuous distributions, consider throwing darts at a dart board. A dart
board is two-dimensional and a certain 2D position on the dart board is (x, y). Because x and y positions
are continuous, we want to think about the joint distribution between two continuous random variables X
and Y representing the location of the dart. What is the joint density function describing this scenario?

Definition 5.2.1: Joint PDFs

Let X, Y be continuous random variables. The joint PDF of X and Y is:

fX,Y (a, b) 0

The joint range is the set of pairs (c, d) that have nonzero density:

⌦X,Y = {(c, d) : fX,Y (c, d) > 0} ✓ ⌦X ⇥ ⌦Y

Note that the double integral over all values must be 1:

Z 1Z 1
fX,Y (u, v)dudv = 1
1 1

Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
Z 1Z 1
E [g(X, Y )] = g(s, t)fX,Y (s, t)dsdt
1 1

The joint PDF must satisfy the following (similar to univariate PDFs):
Z b Z d
P (a  X < b, c  Y  d) = fX,Y (x, y)dydx
a c

Example(s)

Let X and Y be two jointly continuous random variables with the following joint PDF:
⇢
x + cy 2 0  x  1, 0  y  1
fX,Y (x, y) =
0 otherwise

(a) Find and sketch the joint range ⌦X,Y .

(b) Find the constant c that makes fX,Y a valid joint PDF.

156
5.2 Probability & Statistics with Applications to Computing 157

Solution

(a)

⌦X,Y = (x, y) 2 R2 : 0  x  1, 0  y  1

(b) To find c, the following condition has to be satisfied:

Z 1 Z 1
fX,Y (x, y)dxdy = 1
1 1

Z 1 Z 1
1= fX,Y (x, y)dxdy
1 1
Z 1Z 1
= (x + cy 2 )dxdy
0 0
Z 1  x=1
1 2
= x + cy 2 x dy
0 2 x=0
Z 1✓ ◆
1
= + cy 2 dy
0 2
 y=1
1 1
= y + cy 3
2 3 y=0
1 1
= + c
2 3

Thus, c = 32 .
158 Probability & Statistics with Applications to Computing 5.2

(c)
✓ ◆ Z 1/2 Z 1/2 ✓ ◆
1 1 3 2
P 0  X  ,0  Y  = x + y dxdy
2 2 0 0 2
Z 1/2  x=1/2
1 2 3 2
= x + y x dy
0 2 2 x=0
Z 1/2 ✓ ◆
1 3 2
= + y dy
0 8 4
3
=
32

Example(s)

Let X and Y be two jointly continuous random variables with the following PDF:
⇢
x + y 0  x  1, 0  y  1
fX,Y (x, y) =
0 otherwise
⇥ ⇤
Find E XY 2 .

Solution By LOTUS,
Z 1 Z 1
⇥ ⇤
E XY 2 = (xy 2 )fX,Y (x, y)dxdy
1 1
Z 1Z 1
= xy 2 (x + y)dxdy
0 0
Z 1✓ ◆
1 2 1 3
= y + y dy
0 3 2
17
=
72

5.2.2 Marginal PDFs

Definition 5.2.2: Marginal PDFs

Suppose that X and Y are jointly distributed continuous random variables with joint PDF fX,Y (x, y).
The marginal PDFs of X and Y are respectively given by the following:
Z 1
fX (x) = fX,Y (x, y)dy
1
Z 1
fY (y) = fX,Y (x, y)dx
1

Note this is exactly like for joint discrete random variables, with integrals instead of sums.
5.2 Probability & Statistics with Applications to Computing 159

(Extension): If Z is also a continuous random variable, then the marginal PDF of Z is:
Z 1Z 1
fZ (z) = fX,Y,Z (x, y, z)dxdy
1 1

Solution

Example(s)

Find the marginal PDFs fX (x) and fY (y) given the joint PDF:
(
3
x + y 2 0  x  1, 0  y  1
fX,Y (x, y) = 2
0 otherwise

Then, compute E [X]. (This is the same joint density as the first example, plugging in c = 3/2).

For 0  x  1:
Z 1
fX (x) = fX,Y (x, y)dy
1
Z 1✓ ◆
3 2
= x + y dy
0 2
 y=1
1
= xy + y 3
2 y=0
1
=x+
2

Thus, the marginal PDF fX (x) is:

⇢ 1
x+ 2 0x1
fX (x) =
0 otherwise

For 0  y  1:
Z 1
fY (y) = fX,Y (x, y)dx
1
Z 1✓ ◆
3
= x + y 2 dx
0 2
 x=1
1 2 3 2
= x + y x
2 2 x=0
3 1
= y2 +
2 2

Thus, the marginal PDF fY (y) is:

⇢3 2 1
2y + 2 0y1
fY (y) =
0 otherwise
160 Probability & Statistics with Applications to Computing 5.2

Note that to compute E [X] for example, we can either use LOTUS, or just the marginal PDF fX (x). These
methods are equivalent. By LOTUS (taking g(X, Y ) = X),
Z 1Z 1 Z 1Z 1 ✓ ◆
3 2
E [X] = xfX,Y (x, y)dxdy = x x + y dxdy
1 1 0 0 2
Alternatively, by definition of expectation for a single RV,
Z 1 Z 1 ✓ ◆
1
E [X] = xfX (x)dx = x x+ dx
1 0 2
It only takes two lines or so of algebra to show they are equal!

5.2.3 Independence of Continuous Random Variables

Definition 5.2.3: Independence of Continuous Random Variables

Continuous random variables X, Y are independent, written X ? Y , if for all x 2 ⌦X and y 2 ⌦Y ,

fX,Y (x, y) = fX (x)fY (y)

Recall ⌦X,Y = {(x, y) : fX,Y (x, y) > 0} ✓ ⌦Y ⇥ ⌦Y . A necessary but not sufficient condition for
independence is that ⌦X,Y = ⌦X ⇥ ⌦Y . That is, if ⌦X,Y = ⌦X ⇥ ⌦Y , then we have to check the
condition, but if not, then we know they are not independent.
This is because if there is some (a, b) 2 ⌦X ⇥ ⌦Y but not in ⌦X,Y , then fX,Y (a, b) = 0 but fX (a) > 0
and fY (b) > 0, which violates independence. (This is very similar to independence for discrete RVs).

5.2.4 Multivariate: From Discrete to Continuous

The following table tells us the relationships between discrete and continuous joint distributions.
Discrete Continuous
Joint Dist pX,Y (x, y) = P (X = x, Y = y) fX,Y (x, y) 6= PR(X = x, Y = y)
P x Ry
Joint CDF FX,Y (x, y) = tx,sy pX,Y (t, s) FX,Y (x, y) = 1 1 fX,Y (t, s) dsdt
P R1 R1
Normalization x,y pX,YP(x, y) = 1 1
f
1R X,Y
(x, y) dxdy = 1
1
Marginal Dist pX (x) = y pX,Y (x, y) fX (x) = 1 fX,Y (x, y)dy
P RR
Expectation E [g(X, Y )] = x,y g(x, y)pX,Y (x, y) E [g(X, Y )] = g(x, y)fX,Y (x, y)dxdy
p (x,y) f (x,y)
Conditional Dist pX|Y (x|y) = X,Y pP
Y (y)
fX|Y (x|y) = X,Y fRY (y)
1
Conditional Exp E [X|Y = y] = x xpX|Y (x|y) E [X|Y = y] = 1 xfX|Y (x|y)dx
Independence 8x, y, pX,Y (x, y) = pX (x)pY (y) 8x, y, fX,Y (x, y) = fX (x)fY (y)
We’ll explore the two conditional rows (second and third last rows) in the next section more, but you can
guess that pX|Y (x | y) = P (X = x | Y = y), and use the definition of conditional probability to see that it
is P (X = x, Y = y) /P (Y = y), as stated!

Example(s)

Let’s return to our dart example. Suppose (X, Y ) are jointly and uniformly distributed on the circle
of radius R centered at the origin (example a dart throw).
1. First find and sketch the joint range ⌦X,Y .
5.2 Probability & Statistics with Applications to Computing 161

2. Now, write an expression for the joint PDF fX,Y (x, y) and carefully define it for all x, y 2 R.
3. Now, solve for the range of X and write an expression we can evaluate to find fX (x), the
marginal PDF for X.
4. Now, let Z be the distance from the center that the dart falls. Find ⌦Z and write an expression
for E [Z].
5. Finally, determine using the definition of independence whether X and Y are independent.

Solution
1. The joint range is ⌦X,Y = {(x, y) 2 R2 : x2 + y 2  R2 } since the values must be within the circle of
radius R. We can sketch the range as follows, with the semi-circles below and above the y-axis labeled
with their respective equations.

2. The height of the density Rfunction is constant, say h, since it is uniform. The double integral over all
1 R1
x and y must equal one ( 1 1 fX,Y (x, y)dxdy = 1), meaning the volume of this cylinder must be
1
1. The volume is base times height, which is ⇡R2 · h, and setting it equal to 1 gives h = ⇡R 2 . This

gives us the following joint PDF:

(
1
2 x, y 2 ⌦X,Y
fX,Y (x, y) = ⇡R
0 otherwise

3. Well, X can range from R to R, since there are points on the circle with x values in this range. So
the range of X is:

⌦X = [ R, R]

Setting up this integral will be trickier than in the earlier examples, because when finding
fX (x) and integrating out the y, the limits of integration actually depend on x. Imagine making a tick
mark at some x 2 [ R, R] (on the x-axis) and drawing a vertical line through x: where does y enter
and leave (like summing a column in a joint PMF)? Based on the equations we had earlier for y in
terms of x (see the sketch above), this give us:
Z p
R 2 x2
fX (x) = p fX,Y (x, y)dy
R 2 x2
162 Probability & Statistics with Applications to Computing 5.2

Again, this is di↵erent from the previous examples, and you MUST sketch/plot the joint range to figure
this out. If you learned how to do double integrals, this is exactly the same idea.
p
4. Well, the distance will be given by Z = X 2 + Y 2 , which is the definition of distance. We can further
see that Z will take on any value from 0 to R, since the point could be at the origin and as far as R.
This gives, ⌦Z = [0, R].
Then, to solve for the expected value of Z, we can use LOTUS, and only integrate over the joint range
of X and Y (since the joint PDF is 0 elsewhere). We have to be careful in setting up the bounds of our
integral. X will range
p from Rpto R as we discussed earlier.p But as X ranges across these values, Y
will range from R2 x2 to R2 x2 . We had Z = X 2 + Y 2 , so for the expected value we have:
Z Z p
hp i R R 2 x2 p
E [Z] = E X2 +Y 2 = p x2 + y 2 fX,Y (x, y)dydx
R R 2 x2

Note that we could’ve set up this integral dxdy instead - what would the limits of integration have
been? It would’ve been
p
hp i Z R Z R2 y 2 p
E [Z] = E X2 + Y 2 = 2 2
p 2 2 x + y fX,Y (x, y)dxdy
R R y

Your outer limits must be just the range of Y (both constants), and your inner limits may depend on
the outer variable of integration.
5. No, they are not independent. We can see this with the test: ⌦X,Y 6= ⌦X ⇥ ⌦Y . This is because X
and Y both have marginal range from R to R, but the joint range is not a rectangle of this region (it
is a circle). More explicitly, take a point (0.99R, 0.99R) which is basically the top right of the square
(R, R)). We get 0 = fX,Y (0.99R, 0.99R) 6= fX (0.99R)fY (0.99R) > 0. This is because the joint PDF
is defined to be 0 at (0.99R, 0.99R) (not in the circle), but the marginal PDFs of both X and Y are
nonzero at 0.99R (since 0.99R is in the marginal range of both).

Example(s)

Now let’s consider another example where we have a continuous joint distribution (X, Y ), where
X 2 [0, 1] is the proportion of the time until the midterm that you actually spend studying for it and
Y 2 [0, 1] is your percentage score on the exam.
Suppose the joint PDF is:
(
ce (y x) x, y 2 [0, 1] and y x
fX,Y (x, y) =
0 otherwise

1. First, consider the joint range and sketch it. Then, interpret it in English in the context of the
problem.
2. Now, write an expression for c in the PDF above.
3. Now, find ⌦Y and write an expression that we could evaluate to find fY (y).
4. Now, write an expression that we could evaluate to find P (Y 0.9).
5. Now, write an expression that we can evaluate to find E [Y ], the expected score on the exam.
6. Finally, consider whether X and Y are independent.

Solution
5.2 Probability & Statistics with Applications to Computing 163

1. X can range from any value in [0, 1] without conditions. Then Y will only be bounded in that it must
be less than or equal to X. We can first draw the line x = y, and then the region above this line for
which x, y are less than 1 will be our range. That gives us the following:

In English, this means that your score is at least the percentage of time that you studied, as your score
will be that proportion or more.
2. RTo solve for c, we should find the volume above this triangle on the x-y plane and invert it, since
1 R1
1
f
1 X,Y
(x, y)dxdy = 1. To find the area we can integrate in terms of x or y first, which will give
us the following two equivalent expressions:

1 1
c = R1R1 = R1Ry
e (y x) dydx e (y x) dxdy
0 x 0 0

We’ll explain the first equality using the dydx ordering. Since dx is the outer integral, the limits must
be just the range of X, which is [0, 1]. For each value of x (draw a vertical line through x on the
x-axis), y goes between x and 1, so those are the inner limits of integration.

Now, for the second equality using dxdy ordering, the outer integral is dy, so the limits are the range
of Y , also [0, 1]. Then, for each value of y (draw a horizontal line through y on the y-axis), x goes
between 0 and y, and so those are the inner limits of integration.
3. Well, ⌦Y = [0, 1] as we can see in our graph above that Y takes on values in this range. For the
marginal PDF we have to integrate in respect to X, which will take on values in the range 0 to y based
on our graph. So, we have:
Z y
fY (y) = ce (y x) dx
0

4. We can integrate from 0.9 to 1 to solve for this, using the marginal PDF that we solved for above.
This takes us back to the univariate case essentially, and gives us the following:
Z 1 Z 1 Z y
(y x)
P (Y 0.9) = fY (y)dy = ce dxdy
0.9 0.9 0

5. By definition of expectation (univariate), or LOTUS, we have:

164 Probability & Statistics with Applications to Computing 5.2

Z 1 Z 1 Z y
(y x)
E [Y ] = yfY (y)dy = cye dxdy
0 0 0

6. ⌦X,Y 6= ⌦X ⇥ ⌦Y since the sketch of the range is not a rectangle. The joint range is not equal to
the cartesian product of the marginal ranges. To be concrete, consider the point (x = 0.99, y = 0.01)
(basically the corner (1, 0)). I chose this point because it was in the Cartesian product ⌦X ⇥ ⌦Y =
[0, 1] ⇥ [0, 1], but not in the joint range (see the picture from the first part). Since it’s not in the joint
range (shaded region), we have fX,Y (0.99, 0.01) = 0, but since 0.99 2 ⌦X and 0.01 2 ⌦Y , fX (0.99) > 0
and fY (0.01) > 0. Hence, I’ve found a pair of points (x, y) where the joint density isn’t equal to the
product of the marginal densities, violating independence.
Chapter 5. Multiple Random Variables
5.3: Conditional Distributions
Slides (Google Drive) Video (YouTube)

5.3.1 Conditional PMFs/PDFs

Now that we’ve finished talking about joint distributions (whew), we can move on to conditional distribu-
tions and conditional expectation. This is actually just applying the concepts from 2.2 about conditional
probability, generalizing to random variables (instead of events)!

Definition 5.3.1: Conditional PMFs and PDFs

If X, Y are discrete random variables, then the conditional PMF of X given Y is:

pX,Y (a, b) pY |X (b | a)pX (a)

pX|Y (a | b) = P (X = a | Y = b) = =
pY (b) pY (b)

Note that this should remind you of Bayes Theorem (because that’s what it is)!
If X, Y are continuous random variables, then the conditional PDF of X given Y is:

fX,Y (u, v) fY |X (v | u)fX (u)

fX|Y (u | v) = =
fY (v) fY (v)

Again, this is just a generalization from discrete to continuous as we’ve been doing!
It’s important to note that, for each fixed value of b, the probabilities that X = a must sum to 1:
X
pX|Y (a | b) = 1
a2⌦X

If X and Y are mixed (one discrete, one continuous), then a similar extension can be made where
any discrete random variable has a p (a probability mass function) any continuous random variable
has an f (a probability density function).

Example(s)

Back to our example of the blue and red die rolls from 5.1. Suppose we roll a fair blue 4-sided die
and a fair red 4-sided die independently. Recall that U = min{X, Y } (the smaller of the two die rolls)
and V = max{X, Y } (the larger of the two die rolls). Then, their joint PMF was:

165
166 Probability & Statistics with Applications to Computing 5.3

What is pU |V (u | 3) = P (U = u | V = 3) for each value of u 2 ⌦U (these should sum to 1)!

Solution Well, we know by the definition of conditional probability that

pU,V (u, 3)
pU |V =
pV (3)

We need to compute the denominator which is the marginal PMF of V (the sum of the third column):
X
pV (3) = pU,V (a, 3) = 2/16 + 2/16 + 1/16 + 0 = 5/16
a2⌦U

Hence, our conditional PMF is 8 2/16

> 2
> 5/16 = 5
>
u=1
>
< 2/16 = 2 u=2
pU |V (u | 3) = 5/16
1/16
5
2
>
> = u=3
>
> 5/16 5
: 0 =0 u=4
5/16

5.3.2 Conditional Expectation

Just like conditional probabilities helped us compute “normal” (unconditional) probabilities in Chapter 2
(using LTP), we will learn about conditional expectation which will help us compute “normal” expectations!
Let’s try to find out how we might define this idea of conditional expectation of a random variable X, given
that we know some other RV Y takes on a particular value y. Well since E [X] (for discrete RVs) is defined
to be: X X
E [X] = xP (X = x) = xpX (x)
x2⌦X x2⌦X

it’s only fair that the conditional expectation of X, given knowledge that some other RV Y is equal to y is
the same exact thing, EXCEPT the probabilities should be conditioned on Y = y now:
X X
E [X | Y = y] = xP (X = x | Y = y) = xpX,Y (x | y)
x2⌦X x2⌦X

Most notably, we are still summing over x and NOT y, since this expression should depend on y right? Given
that Y = y, what is the expectation of X?
5.3 Probability & Statistics with Applications to Computing 167

Definition 5.3.2: Conditional Expectation

Let X, Y be jointly distributed random variables.

If X is discrete (and Y is either discrete or continuous), then we define the conditional expectation
of g(X) given (the event that) Y = y as:
X
E [g(X) | Y = y] = g(x)pX|Y (x | y)
x2⌦X

If X is continuous (and Y is either discrete or continuous), then we define the conditional expectation
of g(X) given (the event that) Y = y as:
Z 1
E [g(X) | Y = y] = g(x)fX|Y (x | y)dx
1

Notice that these sums and integrals are over x (not y), since E [g(X) | Y = y] is a function of y.
These formulas are exactly the same as E [g(X)], except the PMF/PDF of X is replaced with the
conditional PMF/PDF of X | Y = y.

Example(s)

Suppose X ⇠ Unif(0, 1) (continuous). We repeatedly draw independent Y1 , Y2 , Y3 , · · · ⇠ Unif(0, 1)

(continuous) until the first random time T such that YT < X. What is E [T ]?

The question is basically asking the following: we get some uniformly random decimal num-
ber X from [0, 1]. We keep drawing uniform random numbers until we get a value less than our
initial value. What is the expected number of draws until this happens?

Solution We’ll do this problem in a “bad” way (the only way we know how to know), and then learn the
Law of Total Expectation next to see how this solution could be much simpler!

To find E [T ], since T is discrete with range ⌦T = {1, 2, 3, . . . }, we can find its PMF pT (t) = P (T = t) for
any value t and use the usual formula for expectation. However, T depends on the value of the initial number
X right? If X = 0.1 it would take longer to get a number less than this than if X = 0.99. Let’s try to find
the probability T = t given that X = x first:

P (T = t | X = x) = (1 x)t 1
x

because the probability we get a number smaller than x is just x (Uniform CDF), and so we need to get
t 1 failures first before our first success. Actually, (T |X = x) ⇠ Geo(x) so that’s another way we could’ve
computed this conditional PMF. Then, let’s use the LTP to find P (T = t) (we need to integrate over all
values of t because T is continuous, not discrete):
Z 1 Z 1
1
P (T = t) = P (T = t | X = x) fX (x)dx = (1 x)t 1
x · 1dx = · · · =
0 0 t(t + 1)

after skipping some purely computational steps. Finally, since we have the PMF of T , we can compute
expectation in the normal way:
1
X 1
X X 1 1
1
E [T ] = tpT (t) = t = =1
t=1 t=1
t(t + 1) t=1
t+1
168 Probability & Statistics with Applications to Computing 5.3

1 1 1
The reason this is 1 is because this is like the harmonic series 1 + + + + . . . which is known to diverge
2 3 4
to 1. This is surprising right? The expected time until you get a number smaller than your first is infinite!

5.3.3 Law of Total Expectation (LTE)

Now we’ll see how the Law of Total Expectation can make our lives easier! We’ll also see an extremely cool
application, which is to elegantly prove the expected value of a Geo(p) RV is 1/p (we did this algebraically
in 3.5, but this was messy).

Theorem 5.3.19: Law of Total Expectation

Let X, Y be jointly distributed random variables.

If Y is discrete (and X is either discrete or continuous), then:
X
E [g(X)] = E [g(X) | Y = y] pY (y)
y2⌦Y

If Y is continuous (and X is either discrete or continuous), then

Z 1
E [g(X)] = E [g(X) | Y = y] fY (y)dy
1

This looks exactly like the law of total probability we are used to. Basically to solve for E [g(X)], we
need to take a weighted average of E [g(X) | Y = y] over all possible values of y.

Proof of LTE. Now we will prove the law of total expectation.

Suppose that X, Y are discrete (note that the same proof holds for any combination of X, Y being discrete
or continous, but swapping sums to integrals as necessary). Then:
!
X X X
E [g(X) | Y = y] pY (y) = g(x)pX|Y (x | y) pY (y) [def of conditional expectation]
y2⌦Y y2⌦Y x2⌦X
X X
= g(x)pX|Y (x | y)pY (y) [swap sums]
x2⌦X y2⌦Y
X X
= g(x) pX,Y (x, y) [def of conditional pmf]
x2⌦X y2⌦Y
X
= g(x)pX (x) [def of marginal pmf]
x2⌦X

= E [g(X)] [def of expectation/LOTUS]

Example(s)

(This is the same example as earlier): Suppose X ⇠ Unif(0, 1) (continuous). We repeatedly draw
independent Y1 , Y2 , Y3 , · · · ⇠ Unif(0, 1) (continuous) until the first random time T such that YT < X.
What is E [T ]?
5.3 Probability & Statistics with Applications to Computing 169

Solution Using the LTE now, we can solve this in a much simpler fashion. We know that (T | X = x) ⇠ Geo(p)
as stated earlier. By citing the expectation of a Geometric RV, we know that E [T | X = x] = x1 . By the
LTE, conditioning on x:
Z 1 Z 1
1
E [T ] = E [T | X = x] fX (x)dx = 1dx = [ln(x)]10 = 1
0 0 x

This was a much faster way to getting to the answer than before!

Example(s)

Let’s finally prove that if X ⇠ Geo(p), then µ = E [X] = p1 . Recall that the Geometric random
variable is the number of independent Bernoulli trials with parameter p up to and including the first
success.

Solution First, let’s condition on whether our first flip was heads or tails (these events partition the sample
space):

E [X] = E [X | H] P (H) + E [X | T ] P (T )[Law of Total Expectation]

What are those four values on the right though? We know P (H) = p and P (T ) = 1 p, so that’s out of the
way.

What is E [X | H]? If we got heads on the first try, then E [X | H] = 1 since we are immediately done
(i.e., the number of trials it took to get our first heads, given we got heads on the first trial, is 1).

What is E [X | T ]? This is a bit trickier: because the trials are independent, and we got a tail on the
first try, we basically have to restart (memorylessness), and so our conditional expectation is just E [1 + X],
since we are back to square one except with one additional trial!

Plugging these four values in gives a recursive formula (E [X] appears on both sides):

E [X] = p + (1 + E [X]) · (1 p)

We can solve this, using µ = E [X] (for notational convenience):

µ = p + (1 + µ)(1 p)
µ=p+1 p+µ µp
µ=1+µ µp
0=1 µp
µp = 1
1
µ=
p

This is a really “cute” proof of the expectation of a Geometric RV! See the notes in 3.5 to see the “ugly”
calculus proof.
170 Probability & Statistics with Applications to Computing 5.3

5.3.4 Exercises
1. What happens to linearity of expectation when you sum a random number of random variables? We
know it holds for fixed values of n, but let’s see what happens if we sum a random number N of them.
It turns out, you get something very nice!

Let X1 , X2 , X3 , . . . be a sequence of independent and identically distributed (iid) RVs, with com-
mon mean E [X1 ] = E [X2 ] = . . . . Let N be a random variable which
hP hasirange ⌦N ✓ {0, 1, 2, . . . }
N
(nonnegative integers), independent of all the Xi ’s. Show that E i=1 Xi = E [X1 ] E [N ]. That is,
the expected sum of a random number of random variables is the expected number of random variables
times the expected value of each (which you might think is intuitively true, but we have to prove it!).
Solution: We have the following:
"N # "N #
X X X
E Xi = E Xi | N = n pN (n) [Law of Total Expectation]
i=1 n2⌦N i=1
" n
#
X X
= E Xi | N = n pN (n) [given N = n : substitute in the upper limit]
n2⌦N i=1
" n
#
X X
= E Xi pN (n) [N independent of Xi0 s]
n2⌦N i=1
X
= nE [X1 ] pN (n) [Linearity of Expectation]
n2⌦N
X
= E [X1 ] npN (n)
n2⌦N

= E [X1 ] E [N ] [def of E [N ]]
Chapter 5. Multiple Random Variables
5.4: Covariance and Correlation
Slides (Google Drive) Video (YouTube)

In this section, we’ll learn about covariance; which as you might guess, is related to variance. It is a function
of two random variables, and tells us whether they have a positive or negative linear relationship. It also
helps us finally compute the variance of a sum of dependent random variables, which we have not yet been
able to do.

5.4.1 Covariance and Properties

We will start with the definition of covariance: Cov (X, Y ) = E [(X E [X])(Y E [Y ])]. By LOTUS, we
know this is equal to (where µX = E [X] and µY = E [Y ])
XX
(x µX )(y µY )pX,Y (x, y)
x y

Intuitively, we can see the following possibilities:

• x > µX , y > µY ) (x µX )(y µY ) > 0 (X, Y both above their means)
• x < µX , y < µY ) (x µX )(y µY ) > 0 (X, Y both below their means)
• x < µX , y > µY ) (x µX )(y µY ) < 0 (X below its mean, Y above its mean)
• x > µX , y < µY ) (x µX )(y µY ) < 0 (X above its mean, Y below its mean)
So we get a weighted average (by pX,Y ) of these positive or negative quantities. Just with this brief intuition,
we can say that covariance is positive when X, Y are usually both above/below their means, and negative if
they are opposite. That is, covariance is positive in general when increasing one variable leads to an increase
in the other, and negative when increasing one variable leads to a decrease in the other.

Definition 5.4.1: Covariance

Let X, Y be random variables. The covariance of X and Y is:

Cov (X, Y ) = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

This should remind you of the definition of variance - think of replacing Y with X and you’ll see it!
Note: Covariance can be negative, unlike variance.

Covariance satisfies the following properties:

1. If X ? Y , then Cov (X, Y ) = 0 (but not necessarily vice versa, because the covariance could be
zero but X and Y could not be independent).
2. Cov (X, X) = Var (X). (Just plug in Y = X).
3. Cov (X, Y ) = Cov (Y, X). (Multiplication is commutative).
4. Cov (X + c, Y ) = Cov (X, Y ). (Shifting doesn’t and shouldn’t a↵ect the covariance).
5. Cov (aX + bY, Z) = a · Cov (X, Z) + b · Cov (Y, Z). This can be easily remembered like the
distributive property of scalars (aX + bY )Z = a(XZ) + b(Y Z).
6. Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y ), and hence if X ? Y , then Var (X + Y ) =

171
172 Probability & Statistics with Applications to Computing 5.4

Var (X)
⇣P + Var (Y ) (as we
⌘ discussed earlier).
n Pm Pn Pm
7. Cov i=1 Xi , j=1 Yi = i=1 j=1 Cov (Xi , Yj ). That is covariance works like FOIL (first,
outer, inner, last) for multiplication of sums ((a + b + c)(d + e) = ad + ae + bd + be + cd + ce).

Proof of Covariance Alternate Formula. We will prove that Cov (X, Y ) = E [XY ] E [X] E [Y ].

Cov (X, Y ) = E [(X E [X])(Y E [Y ])] [def of covariance]

= E [XY E [X] Y XE [Y ] + E [X] E [Y ]] [algebra]
= E [XY ] E [X] E [Y ] E [X] E [Y ] + E [X] E [Y ] [Linearity of Expectation]
= E [XY ] E [X] E [Y ] [algebra]

Proof of Property 1: Covariance of Independent RVs is 0.

We actually proved in 5.1 already that E [XY ] = E [X] E [Y ] when X, Y are independent. Hence,

Cov (X, Y ) = E [XY ] E [X] E [Y ] = 0

Proof of Property 6: Variance of Sum of RVs.

We will show that in general, for any RVs X and Y , that

Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y )

Var (X + Y ) = Cov (X + Y, X + Y ) [covariance with self = variance]

= Cov (X, X) + Cov (X, Y ) + Cov (Y, X) + Cov (Y, Y ) [covariance like FOIL]
= Var (X) + 2Cov (X, Y ) + Var (Y ) [covariance with self, and symmetry]

Example(s)

Let X and Y be two independent N (0, 1) random variables and:

Z = 1 + X + XY 2

W =1+X
Find Cov(Z, W ).

⇥ ⇤ 2
Solution First note that E X 2 = Var (X) + E [X] = 1 + 02 = 1 (rearrange variance formula and solve for
5.4 Probability & Statistics with Applications to Computing 173

⇥ ⇤ ⇥ ⇤
E X 2 ). Similarly, E Y 2 = 1.

Cov (Z, W ) = Cov 1 + X + XY 2 , 1 + X

= Cov X + XY 2 , X [Property 4]
= Cov (X, X) + Cov XY 2 , X [Property 7]
⇥ ⇤ ⇥ ⇤
= Var (X) + E X 2 Y 2 E XY 2 E [X] [Property 2 and def of covariance]
⇥ ⇤ ⇥ ⇤ 2 ⇥ ⇤
= 1 + E X2 E Y 2 E [X] E Y 2 [Because X and Y are independent]
=1+1 0=2

5.4.2 (Pearson) Correlation

Covariance has a “problem” in measuring linear relationships, in that Cov (X, Y ) will be positive when there
is a positive linear relationship and negative when there is a negative linear relationship, but Cov (2X, Y ) =
2Cov (X, Y ). Scaling one of the random variables should not a↵ect the strength of their relationship, which
it seems to do. It would be great if we defined some metric that was normalized (had a maximum and
minimum), and was invariant to scale. This metric will be called correlation!

Definition 5.4.2: (Pearson) Correlation

Let X, Y be random variables. The (Pearson) correlation of X and Y is:

Cov (X, Y )
⇢(X, Y ) = p p
Var (X) Var (Y )

We can prove by the Cauchy-Schwarz inequality (from linear algebra), 1  ⇢(X, Y )  1. That is,
correlation is just a normalized version of covariance. Most notably, ⇢(X, Y ) = ±1 if and only if
Y = aX + b for some constants a, b 2 R, and then the sign of ⇢ is the same as that of a.
In linear regression (”line-fitting”) from high school science class, you may have calculated some R2 ,
0  R2  1, and this is actually ⇢2 , and measure how well a linear relationship exists between X and
Y . R2 is the percentage of variance in Y which can be explained by X.

Let’s take a look at some example graphs which shows a sample of data and their (Pearson) correlations, to
get some intuition.
174 Probability & Statistics with Applications to Computing 5.4

The 1st (purple) plot has a perfect negative linear relationship and so the correlation is 1.
The 2nd (green) plot has an positive relationship, but it is not perfect, so the correlation is around +0.9.
The 3rd (orange) plot is a perfectly linear positive relationship, so the correlation is +1.
The 4th (red) plot appears to have data that is independent, so the correlation is 0.
The 5th (blue) plot has a negative trend that isn’t strongly linear, so the correlation is around 0.6.

Example(s)

Suppose X and Y are random variables, where Y = 5X + 2. Show that, since there is a perfect
negative linear relationship, ⇢(X, Y ) = 1.

Solution To find the correlation, we need the covariance and the two individual variances. Let’s write them
in terms of Var (X).
Var (Y ) = Var ( 5X + 2) = ( 5)2 Var (X) = 25Var (X)

By properties of covariance (shifting by 2 doesn’t matter),

Cov (X, Y ) = Cov (X, 5X + 2) = 5Cov (X, X) = 5Var (X)

Finally,
Cov (X, Y ) 5Var (X) 5Var (X)
⇢(X, Y ) = p p =p p = = 1
Var (X) Var (Y ) Var (X) 25Var (X) 5Var (X)

Note that the 5 and 2 did not matter at all (except that 5 was negative and made the correlation negative)!

5.4.3 Variance of Sums of Random Variables

Perhaps the most useful application of covariance is in finding the variance of a sum of dependent random
variables. We’ll extend the case of Var (X + Y ) to more than two random variables.
5.4 Probability & Statistics with Applications to Computing 175

Theorem 5.4.20: Variance of Sums of RVs

If X1 , X2 , . . . , Xn are random variables, then
n
! n
X X X
Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i<j

Proof of Variance of Sums of RVs. We’ll first do something unintutive - making our expression more
complicated. ThePvariance X2 + · · · + Xn is the covariance with itself! We’ll use i to index
of the sum X1 + P
n n
one of the sums i=1 Xi and j for the other j=1 Xi . Keep in mind these both represent the same quantity;
you’ll see why we used di↵erent dummy variables soon!

! 0 1
n
X Xn n
X
Var Xi = Cov @ Xi , Xj A [covariance with self = variance]
i=1 i=1 j=1
n X
X n
= Cov (Xi , Xj ) [by FOIL]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [by symmetry (see image below)]
i=1 i<j

The final step comes from the definition of covariance of a variable with itself and the symmetry of the
covariance. It is illustrated below where the red diagonal is the covariance of a variable with itself (which
is its variance), and the green o↵-diagonal are the symmetric pairs of covariance. We used the fact that
Cov (Xi , Xj ) = Cov (Xj , Xi ) to require us to only sum the lower triangle (where i < j), and multiply by 2 to
account for the upper triangle.

It is important to remember than if all the RVs were independent, all the Cov (Xi , Xj ) terms (for i 6= j)
would be zero, and so we would just be left with the sum of the variances as we showed earlier!
176 Probability & Statistics with Applications to Computing 5.4

Example(s)

Recall in the hat check problem in 3.3, we had n people who go to a party and leave their hats with
a hat check person. At the end of the party, the hats are returned randomly though.

We let X be the number of people who get their original hat back. We solved for E [X] with indicator
random variables X1 , . . . Xn for whether the i-th person got their hat back.

We showed that:

E [Xi ] = P (Xi = 1)
= P ith person get their hat back
1
=
n
So,
" n
#
X
E [X] = E Xi
i=1
n
X
= E [Xi ]
i=1
Xn
1
=
i=1
n
1
=n·
n
=1

Above was all review: now compute Var (X).

✓ ◆
1 1
Solution Recall that each Xi ⇠ Ber (1 with probability , and 0 otherwise). (Remember these were
n n
NOT independent RVs, but we still could apply linearity of expectation.) In our previous proof, we showed
that
n
! n
X X X
Var (X) = Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i<j

Recall that Xi , Xj are indicator random variables which are in {0, 1}, so their product Xi Xj 2 {0, 1} as well.
This allows us to calculate:

E [Xi Xj ] = P (Xi Xj = 1) [since indicator, is just probability of being 1]

= P (Xi = 1, Xj = 1) [product is 1 if and only if both are 1]
= P (Xi = 1) P (Xj = 1 | Xi = 1) [chain rule]
✓ ◆
1 1
=
n n 1
5.4 Probability & Statistics with Applications to Computing 177

This is because we need both person i and person j to get their hat back: person i gets theirs back with
probability n1 , and given this is true, person j gets theirs back with probability n 1 1
So, by definition of covariance (recall each E [Xi ] = 1
n ):

Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ]

✓ ◆
1 1 1 1
= · [plug in]
n n 1 n n
n n 1
= 2 [algebra]
n (n 1) n2 (n 1)
1
= 2 [algebra]
n (n 1)

Further, since Xi is a Bernoulli (indicator) random variable:

✓ ◆✓ ◆
1 1
Var (Xi ) = p(1 p) = 1
n n

Finally, we have
n
X X
Var (X) = Var (Xi ) + 2 Cov (Xi , Xj ) [formula for variance of sum]
i=1 i<j
Xn ✓ ◆ X
1 1 1
= 1 +2
2 (n
[plug in]
i=1
n n
i<j
n 1)
✓ ◆✓ ◆ ✓ ◆✓ ◆ ✓ ◆
1 1 n 1 n
=n 1 +2 [there are pairs with i < j]
n n 2 n2 (n 1) 2
✓ ◆ ✓ ◆
1 n(n 1) 1
= 1 +2
n 2 n2 (n 1)
✓ ◆
1 1
= 1 +
n n
=1

How many pairs are their with i < j? This is just n2 = n(n2 1) since we just choose two di↵erent elements.
Another way to see this is that there was an n ⇥ n square, and we removed the diagonal of n elements, so
we are left with n2 n = n(n 1). Divide by two to get just the lower half.
This is very surprising and interesting! When returning n hats randomly and uniformly, the expected number
of people who get their hat back is 1, and so is the variance! These don’t even depend on n at all! It
takes practice to get used to these formula, so let’s do one more problem.

Example(s)

Suppose we throw 12 balls independently and uniformly into 7 bins. What are the mean and variance
of the number of empty bins after this process? (Hint: Indicators).

Solution Let X be theP

total number of empty bins, and X1 , . . . , X7 be the indicator of whether or not bin i
7
is empty so that X = i=1 Xi . Then,
✓ ◆12
6
P (Xi = 1) =
7
178 Probability & Statistics with Applications to Computing 5.4

since we need to avoid this bin (with probability 6/7) 12 times independently. That is,
✓ ◆12 !
6
Xi ⇠ Ber p =
7

Hence, E [Xi ] = p ⇡ 0.1573 and Var (Xi ) = p(1 p) ⇡ 0.1325. These random variables are surely dependent,
since knowing one bin is empty means the 12 balls had to go to the other 6 bins, making it less likely that
another bin is empty.
However, dependence doesn’t bother us for computing the expectation; by linearity of expectation, we get
" 7 # 7 7 ✓ ◆12 ✓ ◆12
X X X 6 6
E [X] = E Xi = E [Xi ] = =7 ⇡ 1.1009
i=1 i=1 i=1
7 7

Now for the variance, we need to find Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ] for i 6= j. Well, Xi Xj 2 {0, 1}
since both Xi , Xj 2 {0, 1}, so Xi Xj is indicator/Bernoulli as well with
✓ ◆12
5
E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1, Xj = 1) = P (both bin i and j are empty) =
7

since all the balls must go into the other 5 bins during each of the 12 independent throws. Finally,
✓ ◆12 ✓ ◆12 ✓ ◆12
5 6 6
Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ] = ⇡ 0.0071
7 7 7

Recall that Var (Xi ) = p(1 p) ⇡ 0.1325, and so putting this all together gives:
7
X X
Var (X) = Var (Xi ) + 2 Cov (Xi , Xj ) [formula for variance of sum]
i=1 i<j
7
X X
⇡ 0.1325 + 2 ( 0.0071) [plug in approximate decimal values]
i=1 i<j
✓ ◆
7
= 7 · 0.1325 + 2 ( 0.0071)
2
⇡ 0.62954

Recall the hypergeometric RV X ⇠ HypGeo(N, K, n) which was the number of lollipops we get when we
draw n candies from a bag of N total candies (K  N are lollipops). We stated without proof that
Var (X) = n K(NN 2 (N
K)(N n)
1) . You have the tools now to prove this if you like using indicators and covariances,
but we’ll prove this later in 5.8 as well!
Chapter 5. Multiple Random Variables
5.5: Convolution
Slides (Google Drive) Video (YouTube)

In section 4.4, we explained how to transform random variables (finding the density function of g(X)). In
this section, we’ll talk about how to find the distribution of the sum of two independent random variables,
X + Y , using a technique called convolution. It will allow us to prove some statements we made earlier
without proof (like sums of independent Binomials are Binomial, sums of indepenent, Poissons are Poisson),
and also derive the density function of the Gamma distribution which we just stated.

5.5.1 Law of Total Probability for Random Variables

We did secretly use this in some previous examples, but let’s formally define this!

Definition 5.5.1: Law of Total Probability for Random Variables

Discrete version: If X, Y are discrete random variables:

X X
pX (x) = pX,Y (x, y) = pX|Y (x | y)pY (y)
y y

Continuous version: If X, Y are continuous random variables:

Z 1 Z 1
fX (x) = fX,Y (x, y)dy = fX|Y (x | y)fY (y)dy
1 1

This should just remind of you of the LTP we learned in section 2.2, or the definition of marginal PMF/PDFs
from earlier in the chapter! We’ll use this LTP to help us derive the formulae for convolution.

5.5.2 Convolution
Convolution is a mathematical operation that allows to derive the distribution of a sum of two independent
random variables. For example, suppose the amount of gold a company can mine is X tons per year in
country A, and the amount of gold the company can mine is Y tons per year in country B, independently.
You have some distribution to model each. What is the distribution of the total amount of p gold you mine,
Z = X + Y ? Combining this with 4.4, if you know your profit is some function of g(Z) = X + Y of the
total amount of gold, you can now find the density function of your profit!

I think this is best learned through examples:

Example(s)

Let X, Y ⇠ Unif(1, 4) be independent rolls of a fair 4-sided die. What is the PMF of Z = X + Y ?

179
180 Probability & Statistics with Applications to Computing 5.5

Solution We know that for the range of Z we have the following, since it is the sum of two values each in
the range {1, 2, 3, 4}:
⌦Z = {2, 3, 4, 5, 6, 7, 8}

Should the probabilities be uniform? That is, would you be equally likely to roll a 2 as a 5? No, because
there is only one way to get a 2 (rolling (1, 1)), but many ways to get a 5.

If I wanted to compute the probability that Z = 3 for example, I could just sum over all possible val-
ues of X in ⌦X = {1, 2, 3, 4} to get:
P (Z = 3) = P (X = 1, Y = 2) + P (X = 2, Y = 1) + P (X = 3, Y = 0) + P (X = 4, Y = 1)
= P (X = 1) P (Y = 2) + P (X = 2) P (Y = 1) + P (X = 3) P (Y = 0) + P (X = 4) P (Y = 1)
1 1 1 1 1 1
= · + · + ·0+ ·0
4 4 4 4 4 4
2
=
16
where the first line is all ways to get a 3, and the second line uses independence. Note that is is not possible
that Y = 0 or Y = 1, but we write this for completion. More generally, to find pZ (z) = P (Z = z) for any
value of z, we just write

pZ (z) = P (Z = z)
X
= P (X = x, Y = z x)
x2⌦X
X
= P (X = x) P (Y = z x)
x2⌦X
X
= pX (x)pY (z x)
x2⌦X

The intuition is that if we want Z = z, we sum over all possibilities of X = x but require that Y = z x so
that we get the desired sum of z. It is very possible that pY (z x) = 0 as we saw above.
It turns out that formula at the bottom was extremely general, and works for any sum of two independent
discrete RVs. Now let’s consider the continuous case. What if X and Y are continuous RVs and we define
Z = X + Y ; how can we solve for the probability density function for Z, fZ (z)? It turns out the formula is
extremely similar, just replacing p with f !

Theorem 5.5.21: Convolution

Let X, Y be independent RVs, and Z = X + Y .
Discrete version: If X, Y are discrete:
X
pZ (z) = pX (x)pY (z x)
x2⌦X

Continuous version: If X, Y are continuous:

Z
fZ (z) = fX (x)fY (z x)dx
x2⌦X

Note: You can swap the roles of X and Y . Note the similarity between the cases!
5.5 Probability & Statistics with Applications to Computing 181

Proof of Convolution.:
• Discrete case: Even though we proved this earlier, we’ll do it again a di↵erent way (using the LTP/def
of marginal):

pZ (z) = P (Z = z)
X
= P (X = x, Z = z) [LTP/marginal]
x2⌦X
X
= P (X = x, Y = z x) [(X = x, Z = z) equivalent to (X = x, Y = z x)]
x2⌦X
X
= P (X = x) P (Y = z x) [X and Y are independent]
x2⌦X
X
= pX (x)pY (z x)
x2⌦X

• Continuous case: Since we should never work with densities as probabilities, let’s start with the CDF
and di↵erentiate:

FZ (z) = P (Z  z)
= P (X + Y  z) [def of Z]
Z
= P (X + Y  z | X = x)fX (x)dx) [LTP, conditioning on X]
x2⌦X
Z
= P (x + Y  z | X = x)fX (x)dx) [given X = x]
x2⌦X
Z
= P (Y  z x | X = x)fX (x)dx) [algebra]
x2⌦X
Z
= P (Y  z x)fX (x)dx) [X and Y are independent]
x2⌦X
Z
= FY (z x)fX (x)dx [def of CDF of Y ]
x2⌦X

Now we can take the derivative (with respect to z) of the CDF to get the density (FY becomes fY ):
Z
d
fZ (z) = FZ (z) = fX (x)fY (z x)dx
dz x2⌦X

Note the striking similarity in the formulae!

Example(s)

Suppose X and Y are two independent random variables such that X ⇠ Poi( 1) and Y ⇠ Poi( 2 ),
and let Z = X + Y . Prove that Z ⇠ Poi( 1 + 2 ).

The range of X, Y are ⌦X = ⌦Y = {0, 1, 2, . . . }, and so ⌦Z = {0, 1, 2, . . . } as well. For n 2 ⌦Z : Note that
the convolution formula says:
X 1
X
pZ (n) = pX (k)pY (n k) = pX (k)pY (n k)
k2⌦X k=0
182 Probability & Statistics with Applications to Computing 5.5

However, if you blindly plug in the PMFs pX and pY , you will get the wrong answer, and here’s why. We
only want to sum things that are non-zero (otherwise what’s the point?), and if we want pX (k)pY (n k) > 0,
we need BOTH to be nonzero. That means, k must be in the range of X AND n k must be in the range
of Y . Remember the dice example (we had pY ( 1) at some point, which would be 0 and not 1/4). We
are guaranteed pX (k) > 0 because we are only summing over valid k 2 ⌦X , but we must have n k be a
nonnegative integer (in the range ⌦Y = {0, 1, 2, . . . }, so actually, we must have k  n. Now, we can just
plug and chug:
n
X
pZ (n) = pX (k)pY (n k) [convolution formula]
k=0
Xn k n k
1 2
= e 1
·e 2
[plug in Poisson PMFs]
k! (n k)!
k=0
n
X
( 1+ 2)
1 k n k
=e 1 (1 2) [algebra]
k!(n k)!
k=0
n
( 1+ 2)
1 X n! k n k
=e 1 (1 2) [multiply and divide by n!]
n! k!(n k)!
k=0
n ✓ ◆ ✓ ◆
( 1+ 2)
1 X n k n k n n!
=e (1 2) =
n! k 1 k k!(n k)!
k=0

( 1+ 2)
( 1 + 2 )n
=e [binomial theorem]
n!

Thus, Z ⇠ Poi( 1 + 2 ), as its PMF matches that of a Poisson distribution! Note we wouldn’t have been
able to do that last step if our sum was still k = 0 to n. You MUST watch out for this at the beginning,
and after that, it’s just algebra.

Example(s)

Suppose X, Y are independent and identically distributed (iid) continuous Unif(0, 1) random vari-
ables. Let Z = X + Y . What is fZ (z)?

Solution We always begin by calculating the range: we have ⌦Z = [0, 2]. Again, we shouldn’t expect Z to
be uniform, since we should expect a number around 1, but not 0 or 2.
For a U ⇠ Unif(0, 1) (continuous) random variable, we know ⌦U = [0, 1], and that
⇢
1 0u1
fU (u) =
0 otherwise

Z Z 1 Z 1
fZ (z) = fX (x)fY (z x)dx = fX (x)fY (z x)dx = fY (z x)dx
x2⌦X 0 0

where the last formula holds since fX (x) = 1 for all 0  x  1 as we saw above. Remember, we need to
make sure z x 2 ⌦Y = [0, 1], otherwise the density will be 0.
For fY (z x) > 0, we need 0  z x  1. We’ll split into two cases depending on whether z 2 [0, 1] or
z 2 [0, 2], which compose its range ⌦Z = [0, 2].
5.5 Probability & Statistics with Applications to Computing 183

• If z 2 [0, 1], we already have z x  1 since z  1 (and x 2 [0, 1]). We also need z x 0 for the
density to be nonzero: x  z. Hence, our integral becomes:
Z z Z 1
fZ (z) = fY (z x)dx + fY (z x)dx
Z0 z z

= 1dx + 0 = [x]z0 = z
0

• If z 2 [1, 2], we already have z x 0 since z 1 (and x 2 [0, 1]). We now need the other condition
z x  1 for the density to be nonzero: x z 1. Hence, our integral becomes:
Z z 1 Z 1
fZ (z) = fY (z x)dx + fY (z x)dx
0 z 1
Z 1
=0+ 1dx = [x]1z 1 =2 z
z 1

Thus, putting these two cases together gives:

8
< z 0z1
fZ (z) = 2 z 1z2
:
0 otherwise

This makes sense because there are “more ways” to get a value of 1 for example than any other point.
Whereas to get a value of 2, there’s only one way - we need both X, Y to be equal to 1.

Example(s)

Mitchell and Alex are competing together in a 2-mile relay race. The time Mitchell takes to finish (in
hours) is X ⇠ Exp(2) and the time Alex takes to finish his mile (in hours) is continuous Y ⇠ Unif(0, 1).
Alex starts immediately after Mitchell finishes his mile, and their performances are independent.
What is the distribution of Z = X + Y , the total time they take to finish the race?

Solution First, we know that ⌦X = [0, 1) and ⌦Y = [0, 1], so ⌦Z = [0, 1). We know from our distribution
chart that
fX (x) = e x , x 0 and fY (y) = 1, 0  y  1
184 Probability & Statistics with Applications to Computing 5.5

Let z 2 ⌦Z . We’ll use the convolution formula, but this time over the range of Y (you could also do over X
too!). We can do this because X + Y = Y + X, and there was no reason why we had to condition on X first.
Z Z 1
fZ (z) = fY (y)fX (z y)dy = fY (y)fX (z y)dy
⌦Y 0

Since we are integrating over y, we don’t need to worry about fY (y) being 0, but we do need to make sure
fX (z y) > 0. There are two cases again:
• If z 2 [0, 1], then since we need z y 0, we need y z:
Z z Z z
(z y) z
fZ (z) = fY (y)fX (z y)dy = 1· e dy = 1 e
0 0

• if z 2 (1, 1), then y  z always (since y 2 [0, 1]), so

Z 1
z
fZ (z) = fY (y)fX (z y)dy = (e 1)e
0

Note this tiny di↵erence in the upper limit of the integral made a huge di↵erence! Our final result is
8
> z
<1 e z 2 [0, 1]
fZ (z) = (e 1)e z z 2 (1, 1)
>
:
0 otherwise

The moral of the story is: always watch out for the ranges, otherwise you might not get what you expect!
The range of the random variable exists for a reason, so be careful!
Chapter 5. Multiple Random Variables
5.6: Moment Generating Functions
Slides (Google Drive) Video (YouTube)

Last time, we talked about how to find the distribution of the sum of two independent random variables.
Some of the most important use cases are to prove the results we’ve been using for so long: the sum of
independent Binomials is Binomial, the sum of independent Poissons is Poisson (we proved this in 5.5 using
convolution), etc. We’ll now talk about Moment Generating Functions, which allow us to do these in a
di↵erent (and arguably easier) way. These will also be used to prove the Central Limit Theorem (next
section), probably the most important result in all of statistics! Also, to derive the Cherno↵ bound (6.2).
The point is, these are used to prove a lot of important results. They might not be as direct applicable to
problems though.

5.6.1 Moments
First, we need to define what a moment is.

Definition 5.6.1: Moments

Let X be a random variable and c 2 R a scalar. Then: The k th moment of X is:

⇥ ⇤
E Xk

and the k th moment of X (about c) is:

⇥ ⇤
E (X c)k

The first four moments of a distribution/RV are commonly used, though we have only talked about the first
two of them. I’ll briefly explain each but we won’t talk about the latter two much.

1. The first moment of X is the mean of the distribution µ = E [X]. This describes the center or average
value.
⇥ ⇤
2. The second moment of X about µ is the variance of the distribution 2 = Var (X) = E (X µ)2 .
This describes the spread of a distribution (how much it varies).
⇣ ⌘3
3. The third standardized moment is called skewness E X µ and typically tells us about the asym-
metry of a distribution about its peak. If skewness is positive, then the mean is larger than the median
and there are a lot of extreme high values. If skewness is negative, than the median is larger than the
mean and there are a lot of extreme low values.
⇣ ⌘4 E[ X 4 ]
4. The fourth standardized moment is called kurtosis E X µ = 4 , which measures how peaked
a distribution is. If the kurtosis is positive, then the distribution is thin and pointy, and if the kurtosis
is negative, the distribution is flat and wide.

185
186 Probability & Statistics with Applications to Computing 5.6

5.6.2 Moment Generating Functions (MGFs)

We’ll first define the MGF of a random variable X, and then explain its use cases and importance.

Definition 5.6.2: Moment Generating Functions (MGFs)

Let X be a random variable. The moment generating function (MGF) of X is a function of a

dummy variable t:
⇥ ⇤
MX (t) = E etX

If X is discrete, by LOTUS:
X
MX (t) = etx pX (x)
x2⌦X

If X is continuous, by LOTUS:
Z 1
MX (t) = etx fX (x)dx
1

We say that the MGF of X exists, if there is a " > 0 such that the MGF is finite for all t 2 ( ", "),
since it is possible that the sum or integral diverges.

Let’s do some example computations before discussing why it might be useful.

Example(s)

Find the MGF of the following random variables:

(a) X is a discrete random variable with PMF:
⇢
1/3 k=1
pX (k) =
2/3 k=2

(b) Y is a Unif(0, 1) continuous random variable.

Solution

(a)

⇥ ⇤
MX (t) = E etX
X
= etx pX (x) [LOTUS]
x
1 t 2 2t
= e + e
3 3
5.6 Probability & Statistics with Applications to Computing 187

(b)
⇥ ⇤
MY (t) = E etY
Z 1
= ety fY (y)dy [LOTUS]
0
Z 1
= ety · 1dy [fY (y) = 1, 0  y  1]
0
t
e 1
=
t

5.6.3 Properties and Uniqueness of MGFs

There are some useful properties of MGFs that we will discuss. Let X, Y be independent random⇥ variables,
⇤
and a, b 2 R be scalars. Then, recall that the moment generating function of X is: MX (t) = E etX .
1. Computing MGF of Linear Transformations: We’ll first see how we can compute the MGF of
aX + b if we know the MGF of X:

h i h i
MaX+b (t) = E et(aX+b) = etb E e(at)X = etb MX (at)

2. Computing MGF of Sums: We can also compute the MGF of the sum of independent RVs X and Y
given their individual MGFs: (the third step is due to independence):

h i ⇥ ⇤ ⇥ ⇤ ⇥ ⇤
MX+Y (t) = E et(X+Y ) = E etX etY = E etX E etY = MX (t)MY (t)

3. Generating Moments with MGFs: The reason why MGFs are named they⇥way⇤ they ⇥ are,
⇤ is because
they generate moments of X. That means, they can be used to compute E [X], E X 2 , E X 3 , and so on.
How? Let’s take the derivative of an MGF (with respect to t):

0 d ⇥ tX ⇤ d X tx X d X
MX (t) = E e = e pX (x) = etx pX (x) = xetx pX (x)
dt dt dt
x2⌦X x2⌦X x2⌦X

d tx
note in the last step that x is a constant with respect to t and so dt e = xetx .
Note that if evaluate the derivative at t = 0, we get E [X] since e0 = 1:
X X
0
MX (0) = xe0x pX (x) = xpX (x) = E [X]
x2⌦X x2⌦X

Now, let’s consider the second derivative:

00 d 0 d X X d X
MX (t) = MX (t) = xetx pX (x) = xetx pX (x) = x2 etx pX (x)
dt dt dt
x2⌦X x2⌦X x2⌦X
188 Probability & Statistics with Applications to Computing 5.6

⇥ ⇤
If we evaluate the second derivative at t = 0, we get E X 2 :
X X ⇥ ⇤
00
MX (0) = x2 e0x pX (x) = x2 pX (x) = E X 2
x2⌦X x2⌦X

Seems like there’s a pattern - if we take the n-th derivative of MX (t), then we will generate the n-th moment
E [X n ]!

Theorem 5.6.22: Properties and Uniqueness of Moment Generating Functions

For a function f : R ! R, we will denote f (n) (x) to be the nth derivative of f (x). Let X, Y be
independent random variables, and a, b 2 R be scalars. Then MGFs satisfy the following properties:
⇥ ⇤ (n)
1. MX 0
(0) = E [X], MX00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a
moment generating function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ? Y , then MX+Y (t) = MX (t)MY (t).
4. (Uniqueness) The following are equivalent:
(a) X and Y have the same distribution.
(b) fX (z) = fY (z) for all z 2 R.
(c) FX (z) = FY (z) for all z 2 R.
(d) There is an " > 0 such that MX (t) = MY (t) for all t 2 ( ", ") (they match on a small
interval around t = 0).
That is MX uniquely identifies a distribution, just like PDFs or CDFs do.

We proved the first three properties before stating all the theorems, so all that’s left is property 4. This is
a very complex proof (out of the scope of this course), but we can prove it for a special case.
Proof of Property 4 for a Special Case. We’ll prove that, if X, Y are discrete rvs with range ⌦ = {0, 1, 2, ..., m}
and whose MGFs are equal everywhere, that pX (k) = pY (k) for all k 2 ⌦. That is, if two distributions have
the same MGF, they have the same distribution (PMF).

For any t, we have

MX (t) = MY (t)
By definition of MGF, we get
m
X m
X
etk pX (k) = etk pY (k)
k=0 k=0

Subtracting the right-hand side from both sides gives:

m
X
etk (pX (k) pY (k)) = 0
k=0

Let ak = pX (k) pY (k) for k = 0, . . . , m and write etk as (et )k . Then, we get
m
X
ak (et )k = 0
k=0

Note that this is an m-th degree polynomial in et , and remember that this equation holds for (uncountably)
infinitely many t. An mth degree polynomial can only have m roots, unless all the coefficients are 0. Hence
ak = 0 for all k, and so pX (k) = pY (k) for all k.
Now we’ll see how to use MGFs to prove some results we’ve been using.
5.6 Probability & Statistics with Applications to Computing 189

Example(s)

Suppose X ⇠ Poi( ), meaning X has range ⌦X = {0, 1, 2, . . . } and PMF:

k
pX (k) = e
k!
Compute MX (t).

Solution First, let’s recall the Taylor series:

1
X xk
ex =
k!
k=0

1 1 1
⇥ ⇤ X X k X k
MX (t) = E etX = etk pX (k) = etk · e =e (et )k
k! k!
k=0 k=0 k=0
1
X ( et ) k et
=e =e e [Taylor series with x = et ]
k!
k=0
(et 1)
=e

We can use MGFs in our proofs of certain facts about RVs.

Example(s)

(et 1)
If X ⇠ Poi( ), compute E [X] using its MGF we computed earlier MX (t) = e .

Solution We can prove that E [X] = as follows.

First we take the derivative of the moment generating function (don’t forget the chain rule of calculus) and
see that:
0 (et 1)
MX (t) = e · et

Then, we know that:

0 (e0 1)
E [X] = MX (0) = e · e0 =

Example(s)

If Y ⇠ Poi( ) and Z ⇠ Poi(µ) and Y ? Z, show that Y + Z ⇠ Poi( + µ) using the uniqueness
property of MGFs. (Recall we did this exact problem using convolution in 5.5).

+µ)(et 1)
Solution First note that a Poi( + µ) RV has MGF e( (just plugging in + µ as the parameter).
Since Y and Z are independent, by property 3,
(et 1) µ(et 1) +µ)(et 1)
MY +Z (t) = MY (t)MZ (t) = e e = e(
190 Probability & Statistics with Applications to Computing 5.6

The MGF of Y + Z which we computed is the same as that of Poi( + µ). So, by the uniqueness of MGFs
(which implies that an MGF can uniquely describe a distribution), Y + Z ⇠ Poi( + µ).

Which way was easier for you - this approach or using convolution? MGF’s have limitations though whereas
convolution doesn’t (besides independence) - we need to compute the MGF of Y, Z but we also need to know
the MGF of what distribution we are trying to “get”.

Example(s)

Now, use MGFs to prove the closure properties of Gaussian RVs (which we’ve been using without
proof).
• If V ⇠ N (µ, 2 ) and W ⇠ N (⌫, 2 ) are independent, that V + W ⇠ N (µ + ⌫, 2 + 2 ).
• If a, b 2 R are constants and X ⇠ N (µ, 2 ), show that aX + b ⇠ N (aµ + b, a2 2 ).
You may use the fact that if Y ⇠ N (µ, 2 ), that
Z 1
1 (y µ)2 2 t2
MY (t) = ety p e 2 2 dy = eµt+ 2
1 2⇡

Solution
2 2
• If V ⇠ N (µ, ) and W ⇠ N (⌫, ) are independent, we have the following:
2 t2 2 t2 ( 2 + 2 )t2
MV +W (t) = MV (t)MW (t) = eµt+ 2 e⌫t+ 2 = e(µ+⌫)t+ 2

2 2
This is the MGF of a Normal distribution with mean µ + ⌫ and variance + . So, by uniqueness
of MGFs, Y + Z ⇠ N (µ + v, 2 + 2 ).
• Let us examine the moment generating function for aX + b. (We’ll use the notation exp(z) = ez so
that we can actually see what’s in the exponent clearly):
✓ 2
◆ ✓ ◆
(at)2 (a2 2 )t2
MaX+b (t) = ebt MX (at) = exp(bt) exp µ(at) + = exp (aµ + b)t +
2 2

Since this is the moment generating function for a RV that is N (aµ + b, a2 2

), we have shown that by
the uniqueness of MGFs that aX + b ⇠ N (aµ + b, a2 2 ).
Chapter 5. Multiple Random Variables
5.7: Limit Theorems
Slides (Google Drive) Video (YouTube)

This is definitely one of the most important sections in the entire text! The Central Limit Theorem is used
everywhere in statistics (hypothesis testing), and it also has its applications in computing probabilities. We’ll
see three results here, each getting more powerful and surprising.

are iid random variables with mean µ and variance 2 , then we define the sample mean
If X1 , . . . , Xn P
1 n
to be X n = n i=1 Xi . We’ll see the following results:
⇥ ⇤
• The expectation of the sample mean E X n is exactly the true mean µ, and the variance Var X n =
2
/n goes to 0 as you get more samples.
• (Law of Large Numbers) As n ! 1, the sample mean X n converges (in probability) to the true mean
µ. That is, as you get more samples, you will be able to get an excellent estimate of µ.
• (Central Limit Theorem) In fact, Xn follows a Normal distribution as n ! 1 (in practice n as low as
30 is good enough for this to be true). When we talk about the distribution of Xn , this means: if we
take n samples and take the sample mean, another n samples and take the sample mean, and so on,
how will these sample means look in a histogram? This is crazy - regardless of what the distribution
of Xi ’s were (discrete, continuous), their average will be approximately Normal! We’ll see pictures and
describe this more soon!

5.7.1 The Sample Mean

Before we start, we will define the sample mean of n random variables, and compute its mean and variance.

Definition 5.7.1: The Sample Mean + Properties

Let X1 , X2 , . . . , Xn be a sequence of iid (independent and identically distributed) random variables

with mean µ and variance 2 . The sample mean is:
n
1X
X̄n = Xi
n i=1

Further:
" n
# n
⇥ ⇤ 1X 1X 1
E X̄n =E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n

Also, since the Xi ’s are independent:

n
! n
1X 1 X 1 2
2
Var X̄n = Var Xi = Var (Xi ) = 2 n =
n i=1 n2 i=1 n n

Again, none of this is “mind-blowing” to prove: we just used linearity of expectation and properties of

191
192 Probability & Statistics with Applications to Computing 5.7

variance to show this.

What is this saying? Basically, if you wanted to estimate the mean height of the U.S. population by sampling
n people uniformly at random:

⇥ ⇤
• In expectation, your sample average will be “on point” at E X n = µ. This even includes the case
n = 1: if you just sample one person, on average, you will be correct. However, the variance is high.

• The variance of your estimate (the sample mean) for the true mean goes down ( 2 /n) as your sample
size n gets larger. This makes sense right? If you have more samples, you have more confidence in
your estimate because you are more “sure” (less variance).

In fact, as n ! 1, the variance of the sample mean approaches 0. A distribution with mean µ and variance
0 is essentially the degenerate random variable that takes on µ with probability 1. We’ll actually see that
the Law of Large Numbers argues exactly that!

5.7.2 The Law of Large Numbers (LLN)

Using the fact that the variance is approaching 0 as n ! 1, we can argue that, by averaging more and more
samples (n ! 1), we get a really good estimate of the true mean µ since the variance of the sample mean
is 2 /n ! 0 (as we showed earlier). Here is the formal mathematical statement:

Theorem 5.7.23: The Law of Large Numbers

Weak Law of Large Numbers (WLLN): Let X1 , X2 , . . . , Xn be aPsequence of independent and

n
identically distributed random variables with mean µ. Let X̄n = n1 i=1 Xi be the sample mean.
Then, X̄n converges in probability to µ. That is for any ✏ > 0:

lim P |X̄n µ| > ✏ = 0

n!1

Strong Law of Large Numbers (SLLN): Let X1 , X2 , . . . , Xn be aPsequence of independent and

n
identically distributed random variables with mean µ. Let X̄n = n1 i=1 Xi be the sample mean.
Then, X̄n converges almost surely to µ. That is:
⇣ ⌘
P lim X̄n = µ = 1
n!1

The SLLN implies the WLLN, but not vice versa. The di↵erence is subtle and is basically swapping
the limit and probability operations.

The proof the WLLN will be given in 6.1 when we prove Chebyshev’s inequality, but the proof of the SLLN
is out of the scope of this class and much harder to prove.
5.7 Probability & Statistics with Applications to Computing 193

5.7.3 The Central Limit Theorem (CLT)

Theorem 5.7.24: The Central Limit Theorem (CLT)

Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
2
µ and (finite) variance 2 . We’ve seen that the sample mean X̄n has mean µ and variance n . Then
as n ! 1, the following equivalent statements hold:
2
1. X̄n ! N (µ, n ).
X̄n µ
2. p 2
! N (0, 1)
Pn /n
3. P i=1 Xi ⇠ N (nµ, n 2 ). This is not “technically” correct, but is useful for applications.
n
X nµ
4. i=1 p i ! N (0, 1)
n 2
The mean or variance are not a surprise (we computed these at the beginning of these notes for any
sample mean); the importance of the CLT is, regardless of the distribution of Xi ’s, the sample mean
approaches a Normal distribution as n ! 1.

We will prove the central limit theorem in 5.11 using MGFs, but take a second to appreciate this crazy
result! The LLN say that as n ! 1, the sample mean of iid variables X n converges to µ. The CLT says
that, as n ! 1, the sample mean actually converges to a Normal distribution! For any original distribution
of the Xi ’s (discrete or continuous), the average/sum will become approximately normally distributed.

If you’re still having trouble with figuring out what “the distribution of the sample mean” means, that’s
completely normal (double pun!). Let’s consider n = 2, so we just take the average of X1 + X2 , which is
X1 +X2
2 . The distribution of X1 + X2 means: if we repeatedly sample X1 , X2 and add them, what might
the density look like? For example, if X1 , X2 ⇠ Unif(0, 1) (continuous), we showed the density of X1 + X2
looked like a triangle. We figured out how to compute the PMF/PDF of the sum using convolution in 5.5,
and the average is just dividing this by 2: X1 +X
2
2
, which you can find the PMF/PDF by transforming RVs
in 4.4. On the next page, you’ll see exactly the CLT applied to these Uniform distributions. With n = 1, it
looks (and is) Uniform. When n = 2, you get the triangular shape. And as n gets larger, it starts looking
more and more like a Normal!

You’ll see some examples below of how we start with some arbitrary distributions and how the density
function of their mean becomes shaped like a Gaussian (you know how to compute the pdf of the mean now
using convolution in 5.5 and transforming RV’s in 4.4)!

On the next two pages, we’ll see some visual “proof” of this surprising result!
194 Probability & Statistics with Applications to Computing 5.7

Let’s see the CLT applied to the (discrete) Uniform distribution.

1
• The first (n = 1) of the four graphs below shows a discrete · Unif(0, 29) PMF in the dots (and
29
a blue line with the curve of the normal distribution
⇢ with the same mean and variance). That is,
1 1 2 28
P (X = k) = for each value in the range 0, , , . . . , , 1 .
30 29 29 29

• The second graph (n = 2) has the average of two of these distributions, again with a blue line with
the curve of the normal distribution with the same mean and variance. Remember we expected this
triangular distribution when summing either discrete or continuous Uniforms. (e.g., when summing
two fair 6-sided die rolls, you’re most likely to get a 7, and the probability goes down linearly as you
approach 2 or 12. See the example in 5.5 if you forgot how we got this!

• The third (n = 3) and fourth (n = 4) have the average of 3 and 4 identically distributed random
variables respectively, each of the distribution shown in the distribution in the first graph. We can see
that as we average more, the sum approaches a normal distribution.

Again, if you don’t believe me, you can compute the PMF yourself using convolution: first add two Unif(0, 1),
then convolve it with a third, and a fourth!

Despite this being a discrete random variable, when we take an average of many, there become increasingly
many values we can get between 0 and 1. The average of these iid discrete rv’s approaches a continuous
Normal random variable even after just averaging 4 of them!

Image Credit: Larry Ruzzo (a previous University of Washington CSE 312 instructor).
5.7 Probability & Statistics with Applications to Computing 195

You might still be skeptical, because the Uniform distribution is “nice” and already looked pretty “Normal”
even with n = 2 samples. We now illustrate the same idea with a strange distribution shown in the first
(n = 1) of the four graphs below, illustrated with the dots (instead of a “nice” uniform distribution). Even
this crazy distribution nearly looks Normal after just averaging 4 of them. This is the power of the CLT!

What we are getting at here is that, regardless of the distribution, as we have more independent and
identically distributed random variables, the average follows a Normal distribution (with the same mean and
variance as the sample mean).
196 Probability & Statistics with Applications to Computing 5.7

Now let’s see how we can apply the CLT to problems! There were four di↵erent equivalent forms (just
scaling/shifting) stated, but I find it easier to just look at the problem and decide what’s best. Seeing
examples is the best way to understand!

Example(s)

Let’s consider the example of flipping a fair coin 40 times independently. What’s the probability of
getting between 15 to 25 heads? First compute this exactly and then give an approximation using
the CLT.

Solution Define X to be the number of heads in the 40 flips. Then we have X ⇠ Bin(n = 40, p = 12 ), so we
just sum the Binomial PMF:

25 ✓ ◆ ✓ ◆k ✓
X ◆40 k
40 1 1
P (15  X  25) = 1 ⇡ 0.9193.
k 2 2
k=15

Now, let’s use the CLT. Since X can be thought of as the sum of 40 iid Ber( 12 ) RVs, we can apply the
CLT. We have E [X] = np = 40( 12 ) = 20 and Var (X) = np(1 p) = 40( 12 )(1 12 ) = 10. So we can use the
approximation X ⇡ N (µ = 20, 2 = 10).

This gives us the following good but not great approximation:

P (15  X  25) ⇡ P (15  N (20, 10)  25)

✓ ◆
15 20 25 20
=P p Z p [standardize]
10 10
⇡ P ( 1.58  Z  1.58)
= (1.58) ( 1.58)
= 0.8862

We’ll see how to improve our approximation below!

5.7.4 The Continuity Correction

Notice that in the prior example in computing P (15  X  25), we sum over 25 15 + 1 = 11 terms of the
PMF. However, our integral P (15  N (20, 10)  25) has width 25 15 = 10. We’ll always be o↵-by-one
since the number of integers in [a, b] is (b a) + 1 (for integers a  b) and not b a (e.g., the number of
integers between [12, 15] is (15 12) + 1 = 4 : {12, 13, 14, 15}).
The continuity correction says we should add 0.5 in each direction. That is, we should ask for P (a 0.5  X  b + 0.5)
instead so the width is b a + 1 instead. Notice that if we do the final calculation, to approximate
P (15  X  25) using the central limit theorem, now with the continuity correction, we get the following:

Example(s)

Use the continuity correction to get a better estimate than we did earlier for the coin problem.
5.7 Probability & Statistics with Applications to Computing 197

Solution We’ll apply the exact same steps, except changing the bounds from 15 and 25 to 14.5 and 25.5.

P (15  X  25) ⇡ P (14.5  N (20, 10)  25.5) [apply continuity correction]

✓ ◆
14.5 20 25.5 20
=P p Z p
10 10
⇡ P ( 1.74  Z  1.74)
= (1.74) ( 1.74)
⇡ 0.9182

Notice that this is much closer to the exact answer from the first part of the prior example (0.9193) than
approximating with the central limit theorem without the continuity correction!

Definition 5.7.2: The Continuity Correction

When approximating an integer-valued (discrete) random variable X with a continuous one Y

(such as in the CLT), if asked to find a P (a  X  b) for integers a  b, you should compute
P (a 0.5  Y  b + 0.5) so that the width of the interval being integrated is the same as the number
of terms summed over (b a + 1). This is called the continuity correction.

Note: If you are applying the CLT to sums/averages of continuous RVs instead, you should not
apply the continuity correction.

See the additional exercises below to get more practice with the CLT!

5.7.5 Exercises
1. Each day, the number of customers who come to the CSE 312 probability gift shop is approximately
Poi(11). Approximate the probability that, after the quarter ends (9 ⇥ 7 = 63 days), that we had over
700 customers.

Solution: The total number of customers that come is X = X1 + · · · + X63 , where each Xi ⇠ Poi(11)
has E [Xi ] = Var (Xi ) = = 11 from the chart. By the CLT, X ⇡ N (µ = 63 · 11, 2 = 63 · 11) (sum of
the means and sum of the variances). Hence,

P (X 700) ⇡ P (X 699.5) [continuity correction]

⇡ P (N (693, 693) 699.5) [CLT]
✓ ◆
699.5 693
=P Z p [standardize]
693
=1 (0.2469)
=1 0.598
= 0.402

Note that you could compute this exactly as well since you know the sum of iid Poissons is Poisson.
In fact, X ⇠ Poi(693) (the average rate in 63 days is 693 per 63 days), and you could do a sum which
would be very annoying.
2. Suppose I have a flashlight which requires one battery to operate, and I have 18 identical batteries. I
want to go camping for a week (24 ⇥ 7 = 168) hours. If the lifetime of a single battery is Exp(0.1),
198 Probability & Statistics with Applications to Computing 5.7

what’s the probability my flashlight can operate for the entirety of my trip?

Solution: The total lifetime of the battery is X = X1 + · · · + X18 where each Xi ⇠ Exp(0.1)
1 1
has E [Xi ] = = 10 and Var (Xi ) = = 100. Hence, E [X] = 180 and Var (X) = 1800 by linearity
0.1 0.12
of expectation and since variance adds for independent rvs. In fact, X ⇠ Gamma(r = 18, = 0.1),
but we don’t have a closed-form for its CDF. By the CLT, X ⇡ N (µ = 180, 2 = 1800), so

P (X 168) ⇡ P (N (180, 1800) 168) [CLT]

✓ ◆
168 180
=P Z p [standardize]
1800
=1 ( 0.28284)
= (0.28284) [symmetry of Normal]
= 0.611

Note that we don’t use the continuity correction here because the RV’s we are summing are already
continuous RVs.
Chapter 5. Multiple Random Variables
5.8: The Multinomial Distribution
Slides (Google Drive) Video (YouTube)

As you’ve seen, the Binomial distribution is extremely commonly used, and probably the most important
discrete distribution. The Normal distribution is certainly the most important continuous distribution. In
this section, we’ll see how to generalize the Binomial, and in the next, the Normal.

Why do we need to generalize the Binomial distribution? Sometimes, we don’t just have two outcomes
(success and failure), but we have r > 2 outcomes. In this case, we need to maintain counts of how many
times each of the r outcomes appeared. A single random variable is no longer sufficient; we need a vector of
counts!

Actually, the example problems at the end could have been solved in Chapter 1. We will just formalize
this situation so that we can use it later!

5.8.1 Random Vectors (RVTRs) and Covariance Matrices

We will first introduce the concept of a random vector, which is just a collection of random variables stacked
on top of each other.

Definition 5.8.1: Random Vectors

Let X1 , ..., Xn be arbitrary random variables, and stack them into a vector like such:
2 3
X1
6 .. 7
X=4 . 5
Xn

We call X an n-dimensional random vector (rvtr).

We define the expectation of a random vector just as we would hope, coordinate-wise:
2 3
E [X1 ]
6 7
E [X] = 4 ... 5
E [Xn ]

What about the variance? We cannot just say or compute a single scalar Var (X) because what does that
mean for a random vector? Actually, we need to define an n ⇥ n covariance matrix, which stores all pairwise
covariances. It is often denoted in one of three ways: ⌃ = Var (X) = Cov(X).

Definition 5.8.2: Covariance Matrices

The covariance matrix of a random vector X 2 Rn with E [X] = µ is the matrix denoted ⌃ =

199
200 Probability & Statistics with Applications to Computing 5.8

Var (X) = Cov(X) whose entries ⌃ij = Cov (Xi , Xj ). The formula for this is:
⇥ ⇤ ⇥ ⇤
⌃ = Var (X) = Cov(X) = E (X µ)(X µ)T = E XXT µµT

2 3
Cov (X1 , X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
6 Cov (X2 , X1 ) Cov (X2 , X2 ) ... Cov (X2 , Xn ) 7
6 7
=6 .. .. .. .. 7
4 . . . . 5
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Cov (Xn , Xn )

2 3
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
6 Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )7
6 7
=6 .. .. .. .. 7
4 . . . . 5
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )

Notice that the covariance matrix is symmetric (⌃ij = ⌃ji ), and contains variances along the
diagonal.

Note: If you know a bit of linear algebra, you might like to know that covariance matrices
are always symmetric positive semi-definite.

We will not be doing any linear algebra in this class - think of it as just a place to store all the pairwise
covariances. Now let us look at an example of a covariance matrix.

Example(s)
2
If X1 , X2 , ..., Xn are iid with mean µ and variance , then find the mean vector and covariance
matrix of the random vector X = (X1 , . . . , Xn ).

Solution The mean vector is:

2 3 2 3
E [X1 ] µ
6 .. 7 6 .. 7
E [X] = 4 . 5 = 4 . 5 = µ1n
E [Xn ] µ

where 1n denotes the n-dimensional vector of all 1’s. The covariance matrix is (since the diagonal is just the
individual variances 2 and the o↵-diagonals (i 6= j) are all Cov (Xi , Xj ) = 0 due to independence)
2 2
3
0 ... 0
60 2
... 07
6 7 2
⌃=6 . .. .. .. 7 = In
4 .. . . .5
2
0 0 ...

where In denotes the n ⇥ n identity matrix.

5.8 Probability & Statistics with Applications to Computing 201

Theorem 5.8.25: Properties of Expectation and Variance Hold for RVTRs

An important theorem is that properties of expectation and variance still hold for RVTRs.

Let X be an n-dimensional RVTR, A 2 Rn⇥n be a constant matrix, b 2 Rn be a constant

vector. Then:

E [AX + b] = AE [X] + b
Var (AX + b) = AVar (X) AT

Since we aren’t expecting any linear algebra background, we won’t prove this.

5.8.2 The Multinomial Distribution

Suppose we have scenario where there are r = 3 outcomes, with probabilities p1 , p2 , p3 respectively, such
that p1 + p2 + p3 = 1. Suppose we have n = 7 independent trials, and let Y = (Y1 , Y2 , Y3 ) be the rvtr of
counts of each outcome.
Pn Suppose we define each Xi as a one-hot vector (exactly one 1, and the rest 0) as
below, so that Y = i=1 Xi (this is exactly like how adding indicators/Bernoulli’s gives us a Binomial):

Now, what is the probability of this outcome (two of outcome 1, one of outcome 2, and four of outcome 3) -
that is, (Y1 = 2, Y2 = 1, Y3 = 4)? We get the following:

7!
pY1 ,Y2 ,Y3 (2, 1, 4) = · p21 · p12 · p43 [recall from counting]
2!1!4!
✓ ◆
7
= · p21 · p12 · p43
2, 1, 4

This describes the joint distribution of the random vector Y = (Y1 , Y2 , Y3 ), and its PMF should remind
7
of you of the binomial PMF. We just count the number of ways 2,1,4 to get these counts (multinomial
2 1 4
coefficient), and make sure we get each outcome that many times p1 p2 p3 .

Now let us define the Multinomial Distribution more generally.

Definition 5.8.3: The Multinomial Distribution

Pr
Suppose there are r outcomes, with probabilities p = (p1 , p2 , ..., pr ) respectively, such that i=1 pi =
1. Suppose we have n independent trials, and let Y = (Y1 , Y2 , ..., Yr ) be the rvtr of counts of each
outcome. Then, we say:

Y ⇠ Multr (n, p)
202 Probability & Statistics with Applications to Computing 5.8

The joint PMF of Y is:

✓ ◆Yr r
X
n
pY1 ,...,Yr (k1 , ...kr ) = pki , k1 , ...kr 0 and ki = n
k1 , ..., kr i=1 i i=1

Notice that each Yi is marginally Bin(n, pi ). Hence, E [Yi ] = npi and Var (Yi ) = npi (1 pi ).
Then, we can specify the entire mean vector E [Y] and covariance matrix:
2 3
np1
6 7
E [Y] = np = 4 ... 5 Var (Yi ) = npi (1 pi ) Cov (Yi , Yj ) = npi pj (for i 6= j)
npr

Notice the covariance is negative, which makes sense because as the number of occurrences of Yi
increases, the number of occurrences of Yj should decrease since they can not occur simultaneously.

Proof of Multinomial Covariance. Recall that marginally, Xi and Xj are binomial random variables; let’s de-
compose them into their Bernoull trials. We’ll use di↵erent dummy indices as we’re dealing with covariances.

th
Let Xik for
Pnk = 1, . . . , n be indicator/Bernoulli rvs of whether the k trial resulted in outcome i, so
that Xi = k=1 Xik

th
Similarly,
Pn let Xj` for ` = 1, . . . , n be indicators of whether the ` trial resulted in outcome j, so that
Xk = `=1 Xj` .

Before we begin, we should argue that Cov (Xik , Xj` ) = 0 when k 6= ` since k and ` are di↵erent trials and
are independent.

Furthermore, E [Xik Xjk ] = 0 since it’s not possible that both outcome i and j occur at trial k.

n n
!
X X
Cov (Xi , Xj ) = Cov Xik , Xj` [indicators]
k=1 `=1
n X
X n
= Cov (Xik , Xj` ) [covariance works like FOIL]
k=1 `=1
Xn
= Cov (Xik , Xjk ) [independent trials, cross terms are 0]
k=1
Xn
= E [Xik Xjk ] E [Xik ] E [Xjk ] [def of covariance]
k=1
Xn
= pi p j [first expectation is 0]
k=1
= npi pj

Note that in the third line we dropped one of the sums because the indicators across di↵erent trials k, ` are
independent (zero covariance). Hence, we just need to sum when k = `.

There is an example of the Multinomial distribution at the end of the section!

5.8 Probability & Statistics with Applications to Computing 203

5.8.3 The Multivariate Hypergeometric (MVHG) Distribution

Suppose there are r = 3 political parties (Green, Democratic, Republican). The senate consists of N = 100
senators: K1 = 45 Green party members, K2 = 20 Democrats, and K3 = 35 Republicans.

We want to choose a committee of n = 10 senators.

Let Y = (Y1 , Y2 , Y3 ) be the number of each party’s members in the committee (G, D, R in that order). What
is the probability we get 1 Green party member, 6 Democrats, and 3 Republicans? It turns out is just the
following:
45 20 35
1 6 3
pY1 ,Y2 ,Y3 (1, 6, 3) = 100
10

This is very similar to the univariate Hypergeometric distribution! For the denominator, there are 100
10
ways to choose 10 senators. For the numerator, we need 1 from the 45 Green party members, 6 from the 20
Democrats, and 3 from the 35 Republicans.

Once again, let us define the MVHG Distribution more generally.

Definition 5.8.4: The Multivariate Hypergeometric Distribution

Suppose there are rPdi↵erent colors of balls in a bag, having K = (K1 , ..., Kr ) balls of each color,
r
1  i  r. Let N = i=1 Ki be the total number of balls in the bag, and suppose we draw n without
replacement. Let Y = (Y1 , ..., Yr ) be the rvtr such that Yi is the number of balls of color i we drew.
We write that:

Y ⇠ MVHGr (N, K, n)

The joint PMF of Y is:

Qr Ki r
i=1 ki
X
pY1 ,...,Yr (k1 , ...kr ) = N
, 0  ki  Ki for all 1  i  r and kr = n
n i=1

Notice that each Yi is marginally HypGeo(N, Ki , n), so E [Yi ] = n K

N and
i

Ki N Ki N n
Var (Yi ) = n · · . Then, we can specify the entire mean vector E [Y] and covariance
N N N 1
matrix:
2 K1 3
nN
K 6 . 7 Ki N K i N n Ki K j N n
E [Y] = n = 4 .. 5 Var (Yi ) = n · · Cov (Yi , Yj ) = n ·
N Kr
N N N 1 N N N 1
nN

Proof of Hypergeometric Variance. We’ll prove the variance of a univariate Hypergeometric finally (the vari-
ance of Yi ), but leave the covariance matrix to you (can approach it similarly to the multinomial covariance
matrix).

Let X ⇠ HypGeo(N, K, n) (univariate hypergeometric). For i = 1, . . . , n, let Xi be the indicator of whether

K
or not we got a success on trial i (not independent indicators). Then, E [Xi ] = P (Xi = 1) = for every
N
K
trial i, so E [X] = n by linearity of expectation.
N
204 Probability & Statistics with Applications to Computing 5.8

✓ ◆
K
First, we have that since Xi ⇠ Ber :
N
✓ ◆
K K
Var (Xi ) = p(1 p) = 1
N N
K K 1
Second, for i 6= j, E [Xi Xj ] = P (Xi Xj = 1) = P (Xi = 1) P (Xj = 1 | Xi = 1) = · , so
N N 1
K K 1 K2
Cov (Xi , Xj ) = E [Xi Xj ] E [Xi ] E [Xj ] = ·
N N 1 N2
Finally,
n
!
X
Var (X) = Var Xi [def of X]
i=1
0 1
n
X n
X
= Cov @ Xi , Xj A [covariance with self is variance]
i=1 j=1
n X
X n
= Cov (Xi , Xj ) [bilinearity of covariance]
i=1 j=1
Xn X
= Var (Xi ) + 2 Cov (Xi , Xj ) [split diagonal]
i=1 i<j
✓ ◆ ✓ ◆✓ ◆
K K n K K 1 K2
=n 1 +2 · [plug in]
N N 2 N N 1 N2
K N K N n
=n · · [algebra]
N N N 1

5.8.4 Exercises
These won’t be very interesting since this could’ve been done in chapter 1 and 2!
1. Suppose you are fishing in a pond with 3 red fish, 4 green fish, and 5 blue fish.
(a) You use a net to scoop up 6 of them. What is the probability you scooped up 2 of each?
(b) You “catch and release” until you caught 6 fish (catch 1, throw it back, catch another, throw it
back, etc.). What is the probability you caught 2 of each?
Solution:
(a) Let (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then, X ⇠
MVHG3 (N = 12, K = (3, 4, 5), n = 6), and
3 4 5
P (X1 = 2, X2 = 2, X3 = 2) = 2 2
12
2

(b) Let (X1 , X2 , X3 ) be how many red, green, and blue fish I caught respectively. Then, X ⇠
Mult3 (n = 6, p = (3/12, 4/12, 5/12)), and
✓ ◆ ✓ ◆2 ✓ ◆2 ✓ ◆2
6 3 4 5
P (X1 = 2, X2 = 2, X3 = 2) =
2, 2, 2 12 12 12
Chapter 5. Multiple Random Variables
5.9: The Multivariate Normal Distribution
Slides (Google Drive) Video (YouTube)

In this section, we will generalize the Normal random variable, the most important continuous distribution!
We were able to find the joint PMF for the Multinomial random vector using a counting argument, but how
can we find the Multivariate Normal density function? We’ll start with the simplest case, and work from
there.

5.9.1 The Special Case of Independent Normals

2
Suppose X ⇠ N (µX , X ) and Y ⇠ N (µY , Y2 ) are independent normal RVs.
Then by independence, their joint PDF is (recall that exp(z) is just another way to write ez ):
✓ ◆ ✓ ◆
1 1 2 1 1 2
fX,Y (x, y) = fX (x)fY (y) = p exp 2 (x µX ) · p exp (y µY ) , x, y 2 R
X 2⇡ 2 X Y 2⇡ 2 Y2
The mean vector µ is given by:

µX
µ=
µY
And the covariance matrix ⌃ is given by:
 2
X 0
⌃= 2
0 Y

Then, we say that (X, Y ) has a bivariate Normal distribution, which we will denote:

(X, Y ) ⇠ N2 (µ, ⌃)
This is nice and all, if we have two independent Normals. But what if they aren’t independent?

5.9.2 The Bivariate Normal Distribution

We’ll now see how we can construct the joint PDF of two (possibly dependent) Normal RVs, to get the
Bivariate Normal PDF.
Definition 5.9.1: The Bivariate Normal Distribution
2
Let Z1 , Z2 ⇠ N (0, 1) be iid standard Normals, and µX , µY , X > 0, Y2 > 0 and 1  ⇢  1 be scalar
parameters. We construct from these two RVs a random vector (X, Y ) by the transformations:
1. We construct X by taking Z1 , multiplying it by X , and adding µX :

X= X Z1 + µX
2. We construct Y from both Z1 and Z2 , as shown below:
p
Y = Y (⇢Z1 + 1 ⇢2 Z2 ) + µY

205
206 Probability & Statistics with Applications to Computing 5.9

From this transformation, we get that marginally (show this by computing the mean and variance
of X, Y and closure properties of Normal RVs),
2 2
X ⇠ N (µX , X) Y ⇠ N (µY , Y )
Additionally,

Cov (X, Y ) Cov (X, Y )

⇢(X, Y ) = ⇢ = p = ) Cov (X, Y ) = ⇢ X Y
Var (X) Var (Y ) X Y

That is, for the the RVTR (X, Y ),

  2
µ ⇢
µ= X ⌃= X X Y
2
µY ⇢ X Y Y

By using the multivariate change-of-variables formula from 4.4, we can turn the ”simple”
product of standard normal PDFs into the PDF of the bivariate Normal:
✓ ◆
1 z
fX,Y (x, y) = p exp , x, y 2 R
2⇡ X Y 1 ⇢2 2(1 ⇢2 )
where

(x µX ) 2 2⇢(x µX )(y µY ) (y µY ) 2
z= 2 + 2
X X Y Y
Finally, we write:
(X, Y ) ⇠ N2 (µ, ⌃)

The visualization below shows the density of a bivariate Normal distribution. On the xy-plane, we have the
actual two Normas, and on the z-axis, we have the density. Marginally, both variables are Normals!
5.9 Probability & Statistics with Applications to Computing 207

Now let’s take a look at the e↵ect of di↵erent covariance matrices ⌃ on the distribution for a bivariate
normal, all with mean vector (0, 0). Each row below modifies one entry in the covariance matrix; see the
pictures graphically to explore how the parameters change the shape!
208 Probability & Statistics with Applications to Computing 5.9

5.9.3 The Multivariate Normal Distribution

Definition 5.9.2: The Multivariate Normal Distribution

A random vector X = (X1 , ..., Xn ) has a multivariate Normal distribution with mean vector µ 2 Rn
and (symmetric and positive-definite) covariance matrix ⌃ 2 Rn⇥n , written X ⇠ Nn (µ, ⌃), if it has
the following joint PDF:
✓ ◆
1 1 T 1
fX (x) = exp (x µ) ⌃ (x µ) , x 2 Rn
(2⇡)n/2 |⌃|1/2 2

While this PDF may look intimidating, if we recall the PDF of a univariate Normal W ⇠ N (µ, 2 ):
✓ ◆
1 1 2
fW (w) = p exp (w µ)
2⇡ 2 2 2
We can note that the two formulae are quite similar; we simply extend scalars to vectors and matrices!

Additionally, let us recall that for any RVs X and Y :

X?Y ! Cov (X, Y ) = 0

If X = (X1 , . . . , Xn ) is Multivariate Normal, the converse also holds:

Cov (Xi , Xj ) = 0 ! Xi ? Xj

Unfortunately, we cannot do example problems as they would require a deeper knowledge of linear algebra,
which we do not assume.
Chapter 5. Multiple Random Variables
5.10: Order Statistics
Slides (Google Drive) Video (YouTube)

We’ve talked a lot about the distribution of the sum of random variables, but what about the maximum,
minimum, or median? For example, if there are 4 possible buses you could take, and the time until each
arrives is independent with an exponential distribution, what is the expected time until the first one arrives?
Mathematically, this would be E [min{X1 , X2 , X3 , X4 }] if the arrival times were X1 , X2 , X3 , X4 .

In this section, we’ll figure out how to find out the density function (and hence expectation/variance) of the
minimum, maximum, median, and more!

5.10.1 Order Statistics

We’ll first formally define order statistics.

Definition 5.10.1: Order Statistics

Suppose Y1 , ..., Yn are iid continuous random variables with common PDF fY and common CDF
FY . We define Y(1) , Y(2) , . . . , Y(n) to be the sorted version of this sample. That is,

Ymin ⌘ Y(1) < Y(2) < ... < Y(n) ⌘ Ymax

Y(1) is the smallest value (the minimum), and Y(n) is the largest value (the maximum), and since
they are so commonly used, they have special names Ymin and Ymax respectively.

Notice that we can’t have equality because with continuous random variables, the probability that
any two are equal is 0. So, we don’t have to worry about any of these random variables being “less
than or equal to” another.

Notice that each Y(i) is a random variable as well! We call Y(i) the i-th order statistic, i.e. the
i-th smallest in a sample of size n. For example, if we had n = 9 samples, Y(5) would be the median
value. We are interested in finding the distribution of each order statistic, and properties such as
expectation and variance as well.

Why are order statistics important? Usually, we take the min, max or median of a set of random variables
and do computations with them - so, it would be useful if we had a general formula for the PDF and CDF
of the min or max.
We start with an example to find the distribution of Y(n) = Ymax , the largest order statistic. We’ll then
extend this to any of the order statistics (not just the max). Again, this means, if we were to repeatedly
take the maximum of n iid RVs, what would the samples look like?

Example(s)

Let Y1 , Y2 , . . . , Yn be iid continuous random variables with the same CDF FY and PDF fY . What is
the distribution of Y(n) = Ymax = max{Y1 , Y2 , . . . , Yn } the largest order statistic?

209
210 Probability & Statistics with Applications to Computing 5.10

Solution
We’ll employ our typical strategy and work with probabilities instead of densities, so we’ll start with the
CDF:

FYmax (y) = P (Ymax  y) [def of CDF]

n
!
\
=P {Yi  y} [max is  y if and only if all are]
i=1
n
Y
= P (Yi  y) [independence]
i=1
Yn
= FY (y) [def of CDF]
i=1
= FYn (y) [identically distributed, all have same CDF]

We can di↵erentiate the CDF to find the PDF:

fYmax (y) = FY0 max (y)

d
= (F n (y))
dy Y
d n
= nFYn 1
(y)fY (y) [chain rule of calculus and x = nxn 1
]
dx

Let’s take a step back and see what we just did here. We just computed the density function of the maximum
of n iid random variables, denoted Ymax = Y(n) . We now need to find the density of any arbitrary ranked
Y(i) .

Theorem 5.10.26: Order Statistics

Suppose Y1 , ..., Yn are iid continuous random variables with common PDF fY and common CDF
FY . We define Y(1) , Y(2) , . . . , Y(n) to be the sorted version of this sample. That is,

Ymin ⌘ Y(1) < Y(2) < ... < Y(n) ⌘ Ymax

. The density function of Y(i) is

✓ ◆
n
fY(i) (y) = · [FY (y)]i 1
· [1 FY (y)]n i
· fY (y), y 2 ⌦Y
i 1, 1, n i

Now, using the same intuition as before, we’ll use an informal argument to find the density of a general Y(i) ,
fY(i) (y). For example, this might help find the distribution of the minimum fY(1) or the median.
Proof of Density of Order Statistics. The formula above may remind you of a multinomial distribution, and
you would be correct! Let’s consider what it means for Y(i) = y (the i-th smallest value in the sample of n
to equal a particular value y).
• One of the values needs to be exactly y
• i 1 of the values need to be smaller than y (this happens for each with probability FY (y))
5.10 Probability & Statistics with Applications to Computing 211

• the other n i values need to be greater than y (this happens for each with probability 1 FY (y)
Now, we have 3 distinct types of objects: 1 that is exactly y, i 1 which are less than y and n i which are
greater. Using multinomial coefficients and the above, we see that
✓ ◆
n
fY(i) (y) = · [FY (y)]i 1 · [1 FY (y)]n i · fY (y)
i 1, 1, n i

Note that this isn’t a probability; it is a density, so there is something flawed with how we approached this
problem. For a more rigorous approach, we just have to make a slight modification, but use the same idea.

Re-Proof (Rigorous) This time, we’ll find P y 2"  Y(i)  y + 2" and use the fact that this is approxi-
mately equal to "fY(i) (y) for small " > 0 (Riemann integral (rectangle) approximation from 4.1).
We have very similar cases:
• One of the values needs to be between y 2" and y + "
2 (this happens with probability approximately
"fY (y), again by Riemann approximation).
" "
• i 1 of the values need to be smaller than y 2 (this happens for each with probability FY (y 2 ))

• the other n i values need to be greater than y+ 2" (this happens for each with probability 1 FY (y+ 2" ))
Now these are actually probabilities (not densities), so we get
⇣ ✓ ◆
" "⌘ n
P y  Y(i)  y + ⇡ "fY(i) (y) = · [FY (y)]i 1
· [1 FY (y)]n i
· ("fY (y))
2 2 i 1, 1, n i

Dividing both sides by " > 0 gives the same result as earlier!

Let’s verify this formula with our maximum that we derived earlier by plugging in n for i:
✓ ◆
n
fYmax (y) = fY(n) (y) = · [FY (y)]n 1
· [1 FY (y)]0 · fY (y) = nFYn 1
(y)fY (y)
n 1, 1, 0

Example(s)

If ⇥ Y1 , ...,
⇤ Yn are iid Unif(0, 1), where do we “expect” the points to end up? That is, find
E Y(i) for any i. You may find this picture with di↵erent values of n useful for intuition.

Solution
212 Probability & Statistics with Applications to Computing 5.10

Intuitively, from the picture, if n = 1, we expect the single point to end up at 12 . If n = 2, we expect the
two points to end up at 13 and 23 . If n = 4, we expect the four points to end up at 15 , 25 , 35 and 45 .

Let’s prove this formally. Recall, if Y ⇠ Unif(0, 1) (continuous), then fY (y) = 1 for y 2 [0, 1] and FY (y) = y
for y 2 [0, 1]. By the order statistics formula,

✓ ◆
n
fY(i) (y) = · [FY (y)]i 1
· [1 FY (y)]n i
· fY (y)
i 1, 1, n i
✓ ◆
n
= · [y]i 1
· [1 y]n i
·1
i 1, 1, n i

Using the PDF, we find the expectation:

Z 1 ✓ ◆
⇥ ⇤ n i
E Y(i) = y · [y]i 1
· [1 y]n i dy =
0 i 1, 1, n i n+1

Here is a picture which may help you figure out what the formulae you just computed mean!

Now let’s do our bus example from earlier.

Example(s)

At 5pm each day, four buses make their way to the HUB bus stop. Each bus would be acceptable
to take you home. The time in hours (after 5pm) that each arrives at the stop is independent with
Y1 , Y2 , Y3 , Y4 ⇠ Exp( = 6) (on average, it takes 1/6 of an hour (10 minutes) for each bus to arrive).
1. On Mondays, you want to get home ASAP, so you arrive at the bus stop at 5pm sharp. What
is the expected time until the first one arrives?
2. On Tuesdays, you have a lab meeting that runs until 5:15 and are worried you may not catch
any bus. What is the probability you miss all the buses?

Solution The first question asks about the smallest order statistic Y(1) = Ymin since we care about the first
bus. The second question asks about the largest order statistic Y(4) since we care about the last bus. Let’s
compute the general formula for order statistics first so we can apply it to both parts of the problem.
6y 6y
Recall, if Y ⇠ Exp( = 6) (continuous), then fY (y) = 6e for y 2 [0, 1) and FY (y) = 1 e for
y 2 [0, 1). By the order statistics formula,
5.10 Probability & Statistics with Applications to Computing 213

✓ ◆
n
fY(i) (y) = · [FY (y)]i 1 · [1 FY (y)]n i · fY (y)
i 1, 1, n i
✓ ◆
n
= · [1 e 6y ]i 1 · [e 6y ]n i · 6e 6y
i 1, 1, n i
⇥ ⇤
1. For the first part, we want E Y(1) , so we plug in i = 1 (and n = 4) to the above formula to get:
✓ ◆
4
fY(1) (y) = · [1 e 6y ]1 1 · [e 6y ]4 1 · 6e 6y = 4[e 18y ]6e 6y = 24e 24y
1 1, 1, 4 1

Now we can use the PDF to find the expectation normally. However, notice that the PDF is that of
an Exp( = 24) distribution, so it has expectation 1/24. That is, the expected time until the first bus
arrives is 1/24 an hour, or 2.5 minutes.

Let’s talk about something amazing here. We found that min{Y1 , Y2 , Y3 , Y4 } ⇠ Exp( = 4 · 6); that the
minimum of exponentials is distributed as an exponential with the sum of the rates! Why might this
be true? If we have Y1 , Y2 , Y3 , Y4 ⇠ Exp(6), that means on average, 6 buses of each type arrive each
hour, for a total of 24. That just means we can model our waiting time in thie regime with an average
of 24 buses per hour, to get that the time until the first bus has an Exp(6 + 6 + 6 + 6) distribution!
2. For finding the maximum, we just plug in i = n = 4 (and n = 4), to get
✓ ◆
4
fY(4) (y) = · [1 e 6y ]4 1 · [e 6y ]4 4 · 6e 6y = [1 e 6y 3
] 6e 6y
4 1, 1, 4 4

Unfortunately, this is as simplified as it gets, and we don’t get the nice result that the maximum of
exponentials is exponential. To find the desired quantity, we just need to compute the probability the
last bus comes before 5:15 (which is 0.25 hours - be careful of units!):
Z 0.25 Z 0.25
6y 3 6y
P (Ymax  0.25) = fYmax (y)dy = [1 e ] 6e dy
0 0
Chapter 5. Multiple Random Variables
5.11: Proof of the CLT
Slides (Google Drive) Video (YouTube)

In this optional section, we’ll prove the Central Limit Theorem, one of the most fundamental and amazing
results in all of statistics, using MGFs!

5.11.1 Properties of Moment Generating Functions

Let’s first recall the properties of MGFs (this is just copied from 5.6):

Theorem 5.11.27: Properties and Uniqueness of Moment Generating Functions

5.11.2 Proof of the Central Limit Theorem (CLT)

Here is a restatement of the CLT from 5.7 that we will prove:

Theorem 5.11.28: The Central Limit Theorem (CLT)

Let X1 , . . . Xn be a sequence of independent and identically distributed random variables with mean
µ and (finite) variance 2 . Then, the standardized sample mean approaches the standard Normal
distribution:
Xn µ
As n ! 1, Zn = p ! N (0, 1)
/ n

Proof of The Central Limit Theorem. Our strategy will be to compute the MGF of Z n and exploit properties
of the MGF (especially uniqueness) to show that it must have a standard Normal distribution!

214
5.11 Probability & Statistics with Applications to Computing 215

Suppose µ = 0 (without loss of generality), so:

⇥ ⇤ 2
E Xi2 = Var (Xi ) + E [Xi ] = 2

Now, let:
n
Xn 1 X
Zn = p = p Xi
/ n n i=1
1 p
Note there is no typo above: the from X n changes the division by n to a multiplication.
n
2
We will show MZ n (t) ! et /2
(the standard normal MGF) and hence, Z n ! N (0, 1) by uniqueness of the
MGF.
1. First, for an arbitrary random variable Y , since the MGF exists in ( ", ") under “most” conditions,
we can use the 2nd order Taylor series expansion around 0 (quadratic approximation to a function):
s0 s1 s2
MY (s) ⇡ MY (0) · + MY0 (0) · + MY00 (0) ·
0! 1! 2!
⇥ 0⇤ ⇥ 2 ⇤ s2 (n)
= E Y + E [Y ] s + E Y [Since MY (0) = E [Y n ]]
2
⇥ 2 ⇤ s2
= 1 + E [Y ] s + E Y [Since Y 0 = 1]
2

2. Now, let MX denote the common MGF of all the Xi ’s (since they are iid).
MZ n (t) = M Pn
1
p
n i=1 Xi (t) [Definition of Z n ]
✓ ◆
t 1
= MPni=1 Xi p [By Property 2 of MGFs above, where a = p , b = 0]
n n
 ✓ ◆ n
t
= MX p [By Property 3 of MGFs above]
n
t
3. Recall Step 1, and now let Y = X and s = p
n
so we get a Taylor approximation of MX . Then:
⇣ ⌘2
✓ ◆ t
p
t t ⇥ ⇤ n
MX p ⇡ 1 + E [X] p + E X 2 [Step 1]
n n 2
t 2 ⇥ ⇤
=1+0+ 2 2 [Since E [X] = 0 and E X 2 = 2
]
2 n
t2 /2
=1+
n
4. Now we combine Steps 2 and 3:
 ✓ ◆ n
t
MZ n (t) = MX p [step 2]
n
✓ 2
◆n
t /2
⇡ 1+ [step 3]
n
2
h ⇣ x ⌘n i
! et /2
Since 1 + ! ex
n

Hence, Z n has the same MGF as that of a standard normal, so must follow that distribution!
Chapter 6. Concentration Inequali-
ties
It seems like we must have learned everything there possible is - what else could go wrong? Sometimes
we know only certain properties about a random variable (its mean and/or variance), but not its entire
distribution. For example, the expected running time (number of comparisons) of the randomized QuickSort
algorithm can be found using linearity of expectation and indicators. But what are the strongest guarantees
we can make about a random variable without full knowledge of its distribution, if any?

216
6.1 Probability & Statistics with Applications to Computing 217

Chapter 6. Concentration Inequalities

6.1: Markov and Chebyshev Inequalities
Slides (Google Drive) Video (YouTube)

When reasoning about some random variable X, it’s not always easy or possible to calculate/know its ex-
act PMF/PDF. We might not know much about X (maybe just its mean and variance), but we can still
provide concentration inequalities to get a bound of how likely it is for X to be far from its mean µ
(of the form P (|X µ| > ↵)), or how likely for this random variable to be very large (of the form P (X k)).

You might ask when we would only know the mean/variance but not the PMF/PDF? Some of our dis-
tributions that we use (like Exponential for bus waiting time), are just modelling assumptions and are
probably incorrect. If we measured how long it took for the bus to arrive over many days, we could estimate
its mean and variance! That is, we have no idea the true distribution of daily bus waiting times but can
get good estimates for the mean and variance. We can use these concentration inequalities to bound the
probability that we wait too long for a bus knowing just those two quantities and nothing else!

6.1.1 Markov’s Inequality

We’ll start with our weakest inequality, Markov’s inequality. This one only requires us to know the mean, and
nothing else! Again, if we didn’t know the PMF/PDF of what we cared about, we could use the sample mean
as a good estimate for the true mean (by the Law of Large Numbers from 5.7), and our inequality/bound
would be pretty accurate still!
This first example will help build intuition for why Markov’s inequality is true.

Example(s)

The score distribution of an exam is modelled by a random variable X with range ⌦X = [0, 110] (with
10 points for extra credit). Give an upper bound on the proportion of students who score at least
100 when the average is 50? When the average is 25?

Solution What would you guess? If the average is E [X] = 50, an upper bound on the proportion of students
who score at least 100 should be 50% right? If more than 50% of students scored a 100 (or higher), the
average would already be 50% since all scores must be nonnegative ( 0). Mathematically, we just argued
that:
E [X] 50 1
P (X 100)  = =
100 100 2
This sounds reasonable - if say 70% of the class were to get 100 or higher, the average would already be at
least 70%, even if everyone else got a zero. The best bound we can get is 50% - and that requires everyone
else to get a zero.
If the average is E [X] = 25, an upper bound on the proportion of students who score at least 100 is:
E [X] 25 1
P (X 100)  = =
100 100 4
Similarly, if we had more than 30% students get 100 or higher, the average would already be at least 30%,
even if everyone else got a zero.
218 Probability & Statistics with Applications to Computing 6.1

That’s literally the entirety of the idea for Markov’s inequality.

Theorem 6.1.29: Markov’s Inequality

Let X 0 be a non-negative random variable (discrete or continuous), and let k > 0. Then:

E [X]
P (X k) 
k
Equivalently (plugging in kE [X] for k above):
1
P (X kE [X]) 
k

Proof of Markov’s Inequality. Below is the proof when X is continuous. The proof for discrete RVs is similar
(just change all the integrals into summations).
Z 1
E [X] = xfX (x)dx [because X 0]
0
Z k Z 1
= xfX (x)dx + xfX (x)dx [split integral at some 0  k  1]
0 k
Z "Z #
1 k
xfX (x)dx xfX (x)dx 0 because k 0, x 0 and fX (x) 0
k 0
Z 1
kfX (x)dx [because x k in the integral]
k
Z 1
=k fX (x)dx
k
= kP (X k)

So just knowing that the random variable is non-negative and what its expectation is, we can bound the
probability that it is “very large”. We know nothing else about the exam distribution! Note there is no bound
we can derive if X could be negative. Always check that X is indeed nonnegative before applying this bound!

The following example demonstrates how to use Markov’s inequality, and how loose it can be in some cases.

Example(s)

A coin is weighted so that its probability of landing on heads is 20%, independently of other flips.
Suppose the coin is flipped 20 times. Use Markov’s inequality to bound the probability it lands on
heads at least 16 times.

Solution We actually do know this distribution; the number of heads is X ⇠ Bin(n = 20, p = 0.2). Thus,
E [X] = np = 20 · 0.2 = 4. By Markov’s inequality:
E[X] 4 1
= P (X= 16) 
16 16 4
Let’s compare this to the actual probability that this happens:
X20 ✓ ◆
20
P (X 16) = 0.2k · 0.820 k ⇡ 1.38 · 10 8
k
k=16
6.1 Probability & Statistics with Applications to Computing 219

This is not a good bound, since we only assume to know the expected value. Again, we knew the exact
distribution, but chose not to use any of that information (the variance, the PMF, etc.).

Example(s)

Suppose the expected runtime of QuickSort is 2n log(n) operations/comparisons (we can show this
using linearity of expectation with dependent indicator variables). Use Markov’s inequality to bound
the probability that QuickSort runs for longer than 20n log(n) time.

Solution Let X be the runtime of QuickSort, with E [X] = 2n log(n). Then, since X is non-negative, we can
use Markov’s inequality:

E [X]
P (X 20n log(n))  [Markov’s inequality]
20n log(n)
2n log(n)
=
20n log(n)
1
=
10

So we know there’s at most 10% probability that QuickSort takes this long to run. Again, we can get this
bound despite not knowing anything except its expectation!

6.1.2 Chebyshev’s Inequality

Chebyshev’s inequality unlike Markov’s inequality does not require that the random variable is non-negative.
However, it also requires that we know the variance in addition to the mean. The goal of Chebyshev’s in-
equality is to bound the probability that the RV is far from its mean (in either direction). This generally
gives a stronger bound than Markov’s inequality; if we know the variance of a random variable, we should
be able to control how much if deviates from its mean better!

We’ll actually prove the Weak Law of Large Numbers as well!

Theorem 6.1.30: Chebyshev’s Inequality

Let X be any random variable with expected value µ = E[X] and finite variance Var (X). Then, for
any real number ↵ > 0:
Var (X)
P (|X µ| ↵) 
↵2
p
Equivalently (plugging in k for ↵ above, where = Var (X)):

1
P (|X µ| k )
k2

This is used to bound the probability of being in the tails. Here is a picture of Chebyshev’s inequality
bounding the probability that a Gaussian X ⇠ N (µ, 2 ) is more than k = 2 standard deviations from its
mean:
220 Probability & Statistics with Applications to Computing 6.1

Proof of Chebyshev’s Inequality. X is a random variable, so (X E[X])2 is a non-negative random variable.

Hence, we can apply Markov’s inequality.
⇣ ⌘
2
P (|X E [X]| ↵) = P (X E [X]) ↵2 [square both sides]
⇥ ⇤
E (X E [X]) 2
 [Markov’s inequality]
↵2
Var (X)
= [def of variance]
↵2

While in principle Chebyshev’s inequality asks about distance from the mean in either direction, it can still
be used to give a bound on how often a random variable can take large values, and will usually give much
better bounds than Markov’s inequality. This is expected, since we also assume to know the variance - and
if the variance is small, we know the RV can’t deviate too far from its mean.

Example(s)

Let’s revisit the example in Markov’s inequality section earlier in which we toss a weighted coin
independently with probability of landing heads p = 0.2. Upper bound the probability it lands on
heads at least 16 times out of 20 flips using Chebyshev’s inequality.

Solution Because X ⇠ Bin(n = 20, p = 0.2):

E [X] = np = 20 · 0.2 = 4

and:
Var (X) = np(1 p) = 20 · 0.2 · (1 0.2) = 3.2

Note that since Chebyshev’s asks about the di↵erence in either direction of the RV from its mean, we
must weaken our statement first to include the probability X  8. The reason we chose 8 is because
6.1 Probability & Statistics with Applications to Computing 221

Chebyshev’s inequality is symmetric about the mean (di↵erence of 12; 4 ± 12 gives the interval [ 8, 16]):

P (X 16)  P (X 16 [ X  8) [adding another event can only increase probability]

= P (|X 4| 12) [def of abs value]
= P (|X E [X]| 12) [E [X] = 4]
Var (X)
 [Chebyshev’s inequality]
122
3.2 1
= 2 =
12 45

This is a much better bound than given by Markov’s inequality, but still far from the actual probability.
This is because Chebyshev’s inequality only takes the mean and variance into account. There is so much
more information about a RV than just these two quantities!
We can actually use Chebyshev’s inequality to prove an important result from 5.7: The Weak Law of Large
Numbers. The proof is so short!

6.1.3 Proof of the Weak Law of Large Numbers

Theorem 6.1.31: Weak Law of Large Numbers
1
Pn
Let X1 , X2 , ...,Xn be a sequence of iid random variables with mean µ. Let X n = n i=1 Xi be the
sample mean. Then, X n converges in probability to µ. That is, for any ✏ > 0:

lim P X n µ >✏ =0
n!1

⇥ ⇤
Proof. By the property of the expectation and variance of sample mean consisting of n iid variables: E X n =
2
µ and Var X n = n (from 5.7). By Chebyshev’s inequality:

Var X n 2
lim P X n µ > ✏  lim 2
= lim =0
n!1 n!1 ✏ n!1 n✏2
Chapter 6. Concentration Inequalities
6.2: The Cherno↵ Bound
Slides (Google Drive) Video (YouTube)

The more we know about a distribution, the stronger concentration inequality we can derive. We know that
Markov’s inequality is weak, since we only use the expectation of a random variable to get the probability
bound. Chebyshev’s inequality is a bit stronger, because we incorporate the variance into the probability
bound. However, as we showed in the example in 6.1, these bounds are still pretty “loose”. (They are tight
in some cases though).

What if we know even more? In particular, its PMF/PDF and hence MGF? That will allow us to have
an even stronger bound. The Cherno↵ bound is derived using a combination of Markov’s inequality and
moment generating functions.

6.2.1 The Cherno↵ Bound for the Binomial Distribution

Here is the idea for the Cherno↵ bound. We will only derive it for the Binomial distribution, but the same
idea can be applied to any distribution.

Let X be any random variable. etX is always a non-negative random variable. Thus, for any t > 0, using
Markov’s inequality and the definition of MGF:

P (X k) = P etX etk [since t > 0. if t < 0, flip the inequality.]

⇥ ⇤
E etX
 [Markov’s inequality]
etk
MX (t)
= [def of MGF]
etk

(Note that the first line requires t > 0, otherwise it would change to P etX  etk . This is because et > 1 for
t > 0 so we get something like 2X which is monotone increasing. If t < 0, then et < 1 so we get something
like 0.3X which is monotone decreasing.)

Now the right hand side holds for (uncountably) infinitely many t. For example, if we plugged in t = 0.5 we
MX (t)
might get = 0.53 and if we plugged in t = 3.26 we might get 0.21. Since P (X k) has to be less than
etk
all the possible values when plugging in di↵erent t > 0, it in particular must be less than the minimum of
all the values.
MX (t)
P (X k)  min
t>0 etk
This is good - if we can minimize the right hand side, we can get a very tight/strong bound.

We’ll now focus our attention to deriving the Cherno↵ bound when X has a Binomial distribution. Everything
above applies generally though.

222
6.2 Probability & Statistics with Applications to Computing 223

Theorem 6.2.32: Cherno↵ Bound for Binomial Distribution

Let X ⇠ Bin(n, p) and let µ = E[X]. For any 0 < < 1:

Upper tail bound: ✓ ◆
2
µ
P (X (1 + )µ)  exp
3
Lower tail bound: ✓ ◆
2
µ
P (X  (1 )µ)  exp
2
where exp(x) = ex .

The Cherno↵ bound will allow us to bound the probability that X is larger than some multiple of its mean,
or less than or equal to it. These are the tails of a distribution as you go farther in either direction from the
mean. For example, we might want to bound the probability that X 1.5µ or X  0.1µ.

I think it’s completely acceptable if you’d like to not read the proof, as it is very involved algebraically.
You can still use the result regardless!

Proof of Cherno↵ Bound for Binomial.

Pn
If X = i=1 Xi where X1 , X2 ,...,Xn are iid variables, then since the MGF of the (independent) sum equals
the product of the MGFs. Taking our general result from above and using this fact, we get:

Qn
MX (t) MXi (t)
P (X k)  min = min i=1
t>0 etk t>0 etk

Let’s derive a Cherno↵ bound for X ⇠ Bin(n, p), which has the form P (X (1 + )µ) for > 0. For example
with = 4, you may want to bound P (X 5E [X]).

Pn
Recall X = i=1 Xi where Xi ⇠ Ber(p) are iid, with µ = E[X] = np.

⇥ ⇤
MXi (t) = E etXi [def of MGF]
t·1 t·0
= e pXi (1) + e pXi (0) [LOTUS]
t
= pe + 1(1 p) [Xi ⇠ Ber(p)]
t
= 1 + p(e 1)
p(et 1)
e [1 + x  ex with x = p(et 1)]

See here for a pictorial proof that 1 + x  ex for any real number x (just plot the two functions). Alter-
natively, use the Taylor series for ex to argue this. We use this bound for algebra convenience coming up soon.
224 Probability & Statistics with Applications to Computing 6.2

Now using the result from earlier and plugging in the MGF for the Ber(p) distribution, we get:
Qn
MX (t)
P (X k)  min i=1 tk i [from earlier]
t>0 e
⇣ t
⌘ n
ep(e 1)
= min [MGF of Ber(p), n times]
t>0 etk
t
enp(e 1)
= min [algebra]
t>0 etk
t
eµ(e 1)
= min [µ = np]
t>0 etk

For our bound, we want something like P (X (1 + )µ), so our k = (1 + )µ. To minimize the RHS and
get the tightest bound, the best bound we get is by choosing t = ln(1 + ) after some terrible algebra (take
the derivative and set to 0). We simply plug in k and our optimal value of t to the above equation:
ln (1+ ) ✓ ◆µ
e µ( e 1)
eµ((1+ ) 1) eµ e
P (X (1 + )µ)  (1+ )µ ln (1+ ) = (1+ )µ
= =
e eln (1+ ) (1 + )(1+ )µ (1 + )(1+ )

Again, we wanted to choose t that minimizes our upper bound for the tail probability. Taking the derivative
with respect to t tells us we should plug in t = ln(1 + ) to minimize that quantity. This would actually be
pretty annoying to plug into a calculator.
✓ 2 ◆
µ
We actually can show that the final RHS is  exp with some more messy algebra. Additionally, if
2+
we restrict 0 < < 1, we can simplify this even more to the bound provided earlier:
✓ 2 ◆
µ
P (X (1 + )µ)  exp
3

The proof of the lower tail is entirely analogous, except optimizing over t < 0 when the inequality flips. It
proceeds by taking t = ln(1 ).
We also get a lower tail bound:
✓ ◆µ ✓ ◆µ ✓ 2
◆
e e µ
P (X  (1 )µ)   2 = exp
(1 )1 e + 2 2
6.2 Probability & Statistics with Applications to Computing 225

You may wonder, why are we bounding P (X (1 + )µ), when we can just sum the PMF of a binomial to
get an exact answer? The reason is, it is very computationally expensive to compute the binomial PMF! For
example, if X ⇠ Bin(n = 20000, p = 0.1), then by plugging in the PMF, we get
✓ ◆
20000 20000!
P (X = 13333) = 0.113333 0.920000 13333 = 0.113333 0.920000 13333
13333 13333!(20000 13333)!

(Actually, n = 20000 isn’t even that large.) You have to multiply 20,000 numbers on the second two terms,
and it multiplies to a number that is infinitesimally small. For the first term (binomial coefficient), computing
20000! is impossible - in fact, it is so large you can’t even imagine. You would have to cleverly interleave
multiplying a factorial vs the probability, to keep the value in an acceptable range for the computer. Then,
sum up a bunch of these....
This is why we have/need the Poisson approximation, the Normal approximation (CLT), and the Cherno↵
bound for the Binomial!
Example(s)

Suppose X ⇠ Bin(500, 0.2). Use Markov’s inequality and the Cherno↵ bound to bound P (X 150),
and compare the results.

Solution We have:
E[X] = np = 500 · 0.2 = 150
Var(X) = np(1 p) = 500 · 0.2 · 0.8 = 80

Using Markov’s Inequality:

E[X] 100
P (X 150)  = ⇡ 0.6667
150 150

Using the Cherno↵ Bound (with = 0.5):

0.52 ·100
P (X 150) = P (X (1 + 0.5) · 100)  e 3 ⇡ 0.00024

The Cherno↵ bound is much stronger! It isn’t a fair comparison necessarily, because the Cherno↵ bound
required knowing the MGF (and hence the distribution), whereas Markov only required knowing the mean
(and that it was non-negative).
These examples give you an overall comparison of all three inequalities we learned so far!

Example(s)

Suppose the number of red lights Alex encounters each day to work is on average 4.8 (according to
historical trips to work). Alex really will be late if he encounters 8 or more red lights. Let X be the
number of lights he gets on a given day.
1. Give a bound for P (X 8) using Markov’s inequality.
2. Give a bound for P (X 8) using Chebyshev’s inequality, if we also assume Var (X) = 2.88.
3. Give a bound for P (X 8) using the Cherno↵ bound. Assume that X ⇠ Bin(12, 0.4) - that
there are 12 traffic lights, and each is independently red with probability 0.4.
4. Compute P (X 8) exactly using the assumption from the previous part.
5. Compare the three bounds and their assumptions.
226 Probability & Statistics with Applications to Computing 6.2

1. Since X is nonnegative and we know its expectation, we can apply Markov’s inequality:

E [X] 4.8
P (X 8)  = = 0.6
8 8

2. Since we know X’s variance, we can apply Chebyshevs inequality after some manipulation. We have
to do this to match the form required:

P (X 8)  P (X 8) + P (X  1.6) = P (|X 4.8| 3.2)

The reason we chose  1.6 is so it looks like P (|X µ| ↵). Now, applying Chebyshev’s gives:

Var (X) 2.88

 = = 0.28125
3.22 3.22

3. Actually, X ⇠ Bin(12, 0.4) also has E [X] = np = 4.8 and Var (X) = np(1 p) = 2.88 (what a
coincidence). The Cherno↵ bound requires something of the form P (X (1 + )µ), so we first need
to solve for : (1 + )4.8 = 8 so that = 2/3. Now,
✓ ◆
(2/3)2 4.8
P (X 8) = P (X (1 + 2/3) · 4.8)  exp ⇡ 0.4911
3

4. The exact probabiltity can be found summing the Binomial PMF:

12 ✓ ◆
X 12
P (X 8) = 0.4k 0.612 k
⇡ 0.0573
k
k=8

5. Actually it’s usually the case that the bounds are tighter/better as we move down the list Markov,
Chebyshev, Cherno↵. But in this case Chebyshev’s gave us the tightest bound, even after being
weakened by including some additional P (X  1.6). Cherno↵ bounds will typically be better for
farther tails - 8 isn’t considered too far from the mean 4.8.
It’s also important to note that we found out more information progressively - we can’t blindly apply
all these inequalities every time. We need to make sure the conditions for the bound being valid are
satisfied.
Even our best bound of 0.28125 was 5-6x larger than the true probability of 0.0573.
Chapter 6. Concentration Inequalities
6.3: Even More Inequalities
Slides (Google Drive) Video (YouTube)

In this section, we will talk about a potpourri of remaining concentration bounds. More specifically, the
union bound, Jensen’s inequality for convex functions, and Hoe↵ding’s inequality.

6.3.1 The Union Bound

Suppose there are many bad events B1 , . . . , Bn , and we don’t want any of them to happen. They may or
may not be independent. Can we bound the probability that any (at least one) bad event occurs?

The intuition for the union bound is fairly simple. Suppose we have two events A and B. Then P (A [ B) 
P (A) + P (B) since the event space of A and B may overlap:

We will now define the Union Bound more formally.

Theorem 6.3.33: The Union Bound

Let E1 , E2 , ..., En be a collection of events. Then:
n
! n
[ X
P Ei  P (Ei )
i=1 i=1

Additionally, if E1 , E2 , ... is a (countably) infinite collection of events, then:

1
! 1
[ X
P Ei  P (Ei )
i=1 i=1

227
228 Probability & Statistics with Applications to Computing 6.3

We can prove the union bound using induction.

Proof of Union Bound by Induction.

Base Case: For n = 2 events, by inclusion-exclusion, we know
P (A [ B) = P (A) + P (B) P (A \ B)
 P (A) + P (B) [since P (A \ B) 0]
Pn
Inductive Hypothesis: Suppose it’s true for n events, P (E1 [ ... [ En )  i=1 P (Ei ).
Inductive Step: We will show it for n + 1.
P (E1 [ ... [ En [ En+1 ) = P ((E1 [ ... [ En ) [ En+1 ) [associativity of []
= P (E1 [ ... [ En ) + P (En+1 ) [base case]
Xn
 P (Ei ) + P (En+1 ) [inductive hypothesis]
i=1
n+1
X
= P (Ei )
i=1

The union bound, though seemingly trivial, can actually be quite useful.

Example(s)

This will relate to the earlier question of bounding the probability of at least one bad event happening.

Suppose the probability Alex is late to teaching class on a given day is at most 0.01. Bound
the probability that Alex is late at least once over a 30-class quarter. Do not make any independence
assumptions.

Solution
Let Ai be the event Alex is late to class on day i for i = 1, . . . , 30. Then, by the union bound,
30
!
[
P (late at least once) = P Ai
i=1
30
X
 P (Ai ) [union bound]
i=1
30
X
 0.01 [P (Ai )  0.01]
i=1
= 0.30
Sometimes it may be useless though; imagine I asked instead about over a 200-day period. Then the union
bound would’ve given me a bound of 2.0 which is not helpful since probabilities have to be at most 1 already...

6.3.2 Convex Sets and Functions

Our next inequality is called Jensen’s inequality, and deals with convex functions. So first, we need to define
what that means. Before convex functions though, we need to discuss convex sets.
6.3 Probability & Statistics with Applications to Computing 229

Let’s look at some examples of convex (left) and non-convex (right) sets:

The sets on the left hand side are said to be convex because if you take any two points in the set and
draw the line segment between them, it is always contained in the set. The sets on the right hand side are
non-convex because I found two endpoints in the set, but the line segment connecting them is not completely
contained in the set.
How can we describe this mathematically? Well for any two points x, y 2 S, the set of points between them
must be entirely contained in S. The set of points making up the line segment between two points x, y can
be described as a weighted average (1 p)x + py for p 2 [0, 1]. If p = 0, we just get x; if p = 1, we just get
y, and if p = 1/2, we get the midpoint (x + y)/2. So p controls the fraction of the way we are from x to y.

Definition 6.3.1: Convex Sets

A set S ✓ Rn is a convex set if for any x, y 2 S, the entire line segment between them is contained
in S. That is, for any two points x, y 2 S,

xy = {(1 p)x + py : p 2 [0, 1]} ✓ S

Equivalently, for any points x1 , ..., xm 2 S, convex polyhedron formed by the “corners” is contained
in S. (This sounds complicated, but if m = 3, it just says the triangle formed by the 3 corners
completely lies in the set S. If m = 4, the quadrilateral formed by the 4 corners completely lies in
the set S.) The points in the convex polyhedron are described by taking weighted average of the
points, where the weights are non-negative and sum to 1. (This should remind you of a probability
distribution!) (m )
X m
X
pi xi : p1 , ..., pm 0 and pi = 1 ✓ S
i=1 i=1

Here are some examples of convex sets:

1. Any interval ([a, b], (a, b), etc.) in R is a convex set (and the only convex sets in R are intervals).
2. The circle C = {(x, y) : x2 + y 2  1} in R2 is a convex set.
3. Any n-dimensional box B = [a1 , b1 ] ⇥ [a2 , b2 ] ⇥ ... ⇥ [an , bn ] is a convex set.

Now, onto convex functions. Let’s take a look at some convex (top) and non-convex (bottom) functions:
230 Probability & Statistics with Applications to Computing 6.3

The functions on the top (convex) have the property that, for any two points on the function curve, the
line segment connecting them lies above the function always. The functions on the bottom don’t have this
property: you can see that some or all of the line segment is below the function.

Let’s try to formalize what this means. For the convex function g(t) = t2 below, we can see that any line
drawn connecting 2 points of the function clearly lies above the function itself and so it is convex. Look at
any two points on the curve g(x) and g(y). Pick a point on the x-axis between x and y, call it (1 p)x + py
where p 2 [0, 1]. The function value at this point is g((1 p)x + py). The corresponding point above it on
the line segment connecting g(x) and g(y) is actually the weighted average (1 p)g(x) + pg(y). Hence, a
function g is convex if it satisfies the following for any x, y and p 2 [0, 1]: g((1 p)x+py)  (1 p)g(x)+pg(y)
6.3 Probability & Statistics with Applications to Computing 231

Definition 6.3.2: Convex Functions

Let S ✓ Rn be a convex set (a convex function must have the domain being a convex set). A function
g : S ! R is a convex function if for any line segment connecting g(x) and g(y), the function g lies
entirely below the line. Mathematically, for any p 2 [0, 1] and x, y 2 R,

g((1 p)x + py)  (1 p)g(x) + pg(y)

Pm
Equivalently, for any m points x1 , ..., xm 2 S, and p1 , ..., pm 0 such that i=1 pi = 1,
m
! m
X X
g pi x i  pi g(xi )
i=1 i=1

Here are some examples of convex functions:

1. g(x) = x2
2. g(x) = x
3. g(x) = log(x)
4. g(x) = ex

6.3.3 Jensen’s Inequality

Now after learning about convex sets and functions, we can learn Jensen’s inequality, which relates E [g(X)]
and g(E [X]) for convex functions. Remember we said many times that these two quantities were never equal
(use LOTUS to compute E [g(X)])!

Theorem 6.3.34: Jensen’s Inequality

Let X be any random variable, and g : Rn ! R be a convex function. Then,

g(E [X])  E [g(X)]

Proof of Jensen’s Inequality. We will only prove it in the case X is a discrete random variable (not a random
vector), and with finite range (not countably infinite). However, this inequality does hold for any random
variable.

The proof follows immediately from the definition of a convex function. Since X has finite range, let
⌦X = {x1 , ..., xn } and pX (xi ) = pi . By definition of a convex function (see above),

n
!
X
g(E [X]) = g pi x i [def of expectation]
i=1
n
X
 pi g(xi ) [def of convex function]
i=1
= E [g(X)] [LOTUS]
232 Probability & Statistics with Applications to Computing 6.3

Example(s)

Show that variance of any random variable X is always non-negative using Jensen’s inequality.

⇥ ⇤
Solution We already know that Var (X) = E (X µ)2 0 since (X µ)2 is a non-negative RV, but let’s
prove it a di↵erent way.

We know g(t) = t2 is a convex function, so by Jensen’s inequality,

2 ⇥ ⇤
E [X] = g(E [X])  E [g(X)] = E X 2
⇥ ⇤ 2
Hence Var (X) = E X 2 E [X] 0.

6.3.4 Hoe↵ding’s Inequality

One final inequality that is commonly used is called Hoe↵ding’s inequality. We’ll state it without proof since
it is quite complicated. The proof uses Jensen’s inequality and ideas from the proof of the Cherno↵ bound
(MGFs)!

Definition 6.3.3: Hoe↵ding’s Inequality

Let X1 , ...Xn be independent random variables, where each Xi is bounded: ai  Xi  bi and let X n
be their sample mean. Then,
✓ ◆
⇥ ⇤ 2n2 t2
P |X n E X n | t  2 exp Pn
i 1 (bi ai ) 2

where exp(x) = ex .

In the case X1 , ..., Xn are iid (so a  Xi  b for all i) with mean µ, then
✓ ◆ ✓ ◆
2n2 t2 2nt2
P |X n µ| t  2 exp = 2 exp
n(b a)2 (b a)2

Example(s)

Suppose an email company ColdMail is responsible for delivering 100 emails per day. ColdMail has
a bad day if it takes longer than 190 seconds to deliver all 100 emails, and a bad week if there is
even one bad day in the week.

The time it takes to send an email on average is 1 second, with a worst-case time of 5 sec-
onds; independently of other emails. (Note we don’t know anything else like its PDF).
1. Give an upper bound for the probability that ColdMail has a bad day.
2. Give an upper bound for the probability that ColdMail has a bad week.

Solution
1. In this scenario, we may use Hoe↵ding’s inequality since we have X1⇥, . . . , ⇤X100 the (independent) times
to send each email bounded in the interval [0, 5] seconds, with E X 100 = 1. Asking that the total
6.3 Probability & Statistics with Applications to Computing 233

time to be at least 190 seconds is the same as asking the mean time to be at least 1.9 seconds.

Like we did for Chebyshev, we have to massage (and weaken) a little bit to get in the same form
as required for Hoe↵ding’s:

P X 100 1.9  P X 100 1.9 [ X 100  0.1 = P |X 100 1| 0.9

⇥ ⇤
Applying Hoe↵ding’s (since E X n = 1 ):
✓ ◆
2 · 100 · 0.92
P X 100 1.9  P |X 100 1| 0.9  2 exp ⇡ 0.0031
(5 0)2

2. For i = 1, . . . , 7, let Bi be the event we had a bad day on day i. Then,

7
!
[
P (bad week) = P Bi
i=1
7
X
 P (Bi ) [union bound]
i=1
7
X
 0.0031 [Hoe↵ding in previous part]
i=1
⇡ 0.0215

You might be tempted to use the CLT (and you should when you can), as it would probably give a better
bound than Hoe↵ding’s. But we didn’t know the variances, so we wouldn’t know which Normal to use.
Hoe↵ding’s gives us a way!
Chapter 7. Statistical Estimation
Now we’ve hit a real turning point in the course. What we’ve been doing so far is “probability”, and
the remaining two chapters of the course will be about “statistics”. In the real world, we’re often not given
the true probability of heads p, or average rate of babies being born per minute . In today’s world, data
is being collected faster than ever! How can we use data to estimate these quantities of interest? We’ll
start with more mundane examples, such as: If I flip a coin (with unknown probability of heads) ten times
independently and I observe seven heads, why is 7/10 the “best” estimate for the probability of heads? We’ll
learn several techniques for estimating quantities, and talk about several properties that allow us to compare
them for “goodness”.

234
7.1 Probability & Statistics with Applications to Computing 235

Chapter 7. Statistical Estimation

7.1: Maximum Likelihood Estimation
Slides (Google Drive) Video (YouTube)

7.1.1 Probability vs Statistics

Before we start, we need to make an important distinction: what is the di↵erence between probability and
statistics? What we’ve been doing up until this point is probability. We’re given a model, in this picture
Ber(p = 0.5) (our assumption), and we’re trying to find the probability of some data. So, given this model,
what is the probability of THHTHH, or P (THHTHH)? That’s something you know how to do now!

What we’re going to focus now is going the opposite way. Given a coin with unknown probability of heads
is, I flip it a few times and I get THHTHH. How can I use this data to predict/estimate this value of p?

7.1.2 Likelihood
Let’s say I give you and your classmates each 5 minutes with a coin with unknown probability of heads p.
Whoever has the closest estimate will get an A+ in the class. What do you do in your precious 5 minutes,
and what do you give as your estimate?

I don’t know about you, but I would flip the coin as many times as I can, and return the total number of
heads over the total number of flips, or
Heads
Heads + Tails
which actually turns out to be a really good estimate.
236 Probability & Statistics with Applications to Computing 7.1

To make things concrete, let’s say you saw 4 heads and 1 tail. You tell me that p̂ = 45 (the hat above the p
just means it is an estimate). How can you argue, objectively, that this is the ”best” estimate?

Is there some objective function that it maximizes? It turns out yes, 45 maximizes this blue curve, which is
called the likelihood of the data. The x-axis has the di↵erent possible values of p, and they y-axis has the
probability of seeing the data if the coin had probability of heads p.

You assume a model (Bernoulli in our case) with unknown parameter ✓ (the probability of heads), and
receive iid samples x = (x1 , ..., xn ) ⇠ Ber(✓) (in this example, each xi is either 1 or 0). The likelihood of the
data given a parameter ✓ is defined as the probability of seeing the data, given ✓, or:

L(x | ✓) = P (seeing data | ✓) [def of likelihood]

= P (x1 , ..., xn | ✓) [plug in data]
Yn
= pX (xi | ✓) [independence]
i=1

(Note: When estimating unknown parameters, we typically use ✓ instead of p, , µ, etc.)

Definition 7.1.1: Realization / Sample

A realization/sample x of a random variable X is the value that is actually observed (will always be
in ⌦X ).

For example, for Bernoulli, a realization is either 0 or 1, and for Geometric, some positive integer 1.

Definition 7.1.2: Likelihood

Let x = (x1 , ..., xn ) be iid samples from probability mass function pX (t | ✓) (if X is discrete), or from
density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We define
the likelihood of x given ✓ to be the “probability” of observing x if the true parameter is ✓.
7.1 Probability & Statistics with Applications to Computing 237

If X is discrete,
n
Y
L(x | ✓) = pX (xi | ✓)
i=1

If X is continuous,
n
Y
L(x | ✓) = fX (xi | ✓)
i=1

In the continuous case, we have to multiply densities, because the probability of seeing a particular value with
a continuous random variable is always 0. We can do this because the density preserves relative probabilities;
P (X ⇡ u) fX (u)
i.e., ⇡ . For example, if X ⇠ N (µ = 3, 2 = 5), the realization x = 503.22 has much
P (X ⇡ v) fX (v)
lower density/likelihood than x = 3.12.

Example(s)

Give the likelihoods for each of the samples, and take a guess at which value of ✓ maximizes the
likelihood!
1. Suppose x = (x1 , x2 , x3 ) = (1, 0, 1) are iid samples from Ber(✓) (recall ✓ is the probability of a
success).
2. Suppose x = (x1 , x2 , x3 , x4 ) = (3, 0, 2, 7) are iid samples from Poi(✓) (recall ✓ is the historical
average number of events in a unit of time).
3. Suppose x = (x1 , x2 , x3 ) = (3.22, 1.81, 2.47) are iid samples from Exp(✓) (recall ✓ is the historical
average number of events in a unit of time).

Solution
1. The samples mean we got a success, then a failure, then a success. The likelihood is the “probability”
of observing the data.
3
Y
L(x | ✓) = pX (xi | ✓) = pX (1 | ✓) · pX (0 | ✓) · pX (1 | ✓) = ✓(1 ✓)✓ = ✓2 (1 ✓)
i=1

Since we observed two successes out of three trials, my guess for the maximum likelihood estimate
2
would be ✓ˆ = .
3
2. The samples mean we observed 3 events in the first unit of time, then 0 in the second, then 2 in the
third, then 7 in the fourth. The likelihood is the “probability” of observing the data (just multiplying
k
Poisson PMFs pX (k | ) = e k! ).

4
Y
L(x | ✓) = pX (xi | ✓) = pX (3 | ✓) · pX (0 | ✓) · pX (2 | ✓) · pX (7 | ✓)
i=1

✓ 3
◆✓ 0
◆✓ 2
◆✓ 7
◆
✓✓ ✓✓ ✓✓ ✓✓
= e e e e
3! 0! 2! 7!
Since there were a total of 3 + 0 + 2 + 7 = 12 events over 4 units of time (samples), my guess for the
12
maximum likelihood estimate would be ✓ˆ = = 3 events per unit time.
4
238 Probability & Statistics with Applications to Computing 7.1

3. The samples mean we waited until three events happened (x1 , x2 , x3 ), and it took 3.22 units of time
until the first event, 1.81 until the second, and 2.47 until the third. The likelihood is the “probability”
of observing the data. The likelihood is the “probability” of observing the data (just multiplying
Exponential PDFs fX (y | ) = e y ).

3
Y
3.22✓ 1.81✓
L(x | ✓) = fX (xi | ✓) = fX (x1 | ✓) · fX (x2 | ✓) · fX (x3 | ✓) = ✓e ✓e ✓e2.47✓
i=1

3.22 + 1.81 + 2.47

Since it took an average of = 2.5 units of time to observe each events, my guess for
3
3
the maximum likelihood estimate would be ✓ˆ = = 0.4 events happen per unit of time.
7.5

7.1.3 Maximum Likelihood Estimation

Now, we’ll formally define what the maximum likelihood estimator of an unknown parameter is. Intuitively,
it is just the value of ✓ which maximizes the “probability” of seeing the data L(x | ✓).

In the previous three scenarios, we set up the likelihood of the data. Now, the only thing left to do is find out
which value of ✓ maximizes the likelihood. Everything else in this section is just explaining how to use calcu-
lus to optimize this likelihood! There is no more “probability” or “statistics” involved in the remaining pages.

Before we move on, we have to go back and review calculus really quickly. How do we optimize a func-
tion? Each of these three points is a local optima; what do they have in common? Their derivative is 0.
We’re going to try and set the derivative of our likelihood to 0, so we can solve for the optimum value.

Example(s)

Suppose x = (x1 , x2 , x3 , x4 , x5 ) = (1, 1, 1, 1, 0) are iid samples from the Ber(✓) distribution with
unknown parameter ✓. Find the maximum likelihood estimator ✓ˆ of ✓.

Solution The data (1, 1, 1, 1, 0) can be thought of the sequence HHHHT , which has likelihood (assuming
7.1 Probability & Statistics with Applications to Computing 239

the probability of heads is ✓)

L(HHHHT | ✓) = ✓4 (1 ✓)
4 5
=✓ ✓

The plot of the likelihood with ✓ on the x-axis and L(HHHHT | ✓) on the y-axis is (copied from above):

and we can actually see the ✓ which maximizes the likelihood is ✓ˆ = 4/5. But sometimes we can’t plot the
likelihood, so we will solve for this analytically now.

We want to find the ✓ which maximizes this likelihood, so we take the derivative with respect to ✓ and
set it to 0:

@
L(x | ✓) = 4✓3 5✓4
@✓
= ✓3 (4 5✓)

Now, when we set the derivative to 0 (remember the optimum points occur when the derivative is 0), we
replace ✓ with ✓ˆ because we are now estimating ✓. After solving for ✓, we end up with

✓ˆ3 (4 ˆ = 0 ! ✓ˆ = 4 or 0
5✓)
5

We switch ✓ to ✓ˆ when we set the derivative to 0, as that is when we start estimating. To see which is the
maximizer, you can just plug in the candidates (0 and 4/5) and the endpoints (0 and 1: the min and max
possible values of ✓)! That is, compute the likelihood at 0, 4/5, 1 and see which is largest.

To summarize, we defined ✓ˆM LE = arg max✓ L(x | ✓), the argument (input) ✓ that maximizes the likelihood
function. The di↵erence between max and argmax is as follows. Here is a function,

f (x) = 1 x2

where the maximum value is 1, it’s the highest value this function could ever achieve. The argmax, on the
other hand, is 0, because argmax just means the argument (input) that maximizes the function. So, which x
actually achieved f (x) = 1? Well that was x = 0. And so, in MLE, we’re trying to find the ✓ that maximizes
the likelihood, and we don’t care what the maximum value of the likelihood is. We didn’t even compute it!
We just care that the argmax is 45 .
240 Probability & Statistics with Applications to Computing 7.1

Definition 7.1.3: Maximum Likelihood Estimation (MLE)

Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We
define the maximum likelihood estimator ✓ˆM LE of ✓ to be the parameter which maximizes the
likelihood (or equivalently, the log-likelihood) of the data.

✓ˆM LE = arg max L(x | ✓) = arg max ln L(x | ✓)

✓ ✓

The (usual) recipe to find the MLE goes as follows:

1. Compute the likelihood and log-likelihood of data.
2. Take the partial derivative(s) with respect to ✓ and set to 0. Solve the equation(s).
3. Optionally, verify ✓ˆM LE is indeed a (local) maximizer by checking that the second derivative at ✓ˆM LE
is negative (if ✓ is a single parameter), or the Hessian (matrix of second partial derivatives) is negative
semi-definite (if ✓ is a vector of parameters).

7.1.3.1 Optimizing Function vs Log(Function)

You may have notice we also included this “log-likelihood” that we hadn’t talked about earlier. In the next
section, we’ll do several more examples of maximum likelihood estimation, and you’ll see that taking the log
makes our derivatives easier. Recall that the likelihood is the product of PDFs or PMFs, and taking the
derivative of a product is quite annoying, especially with more than 2 terms:
d
(f (x) · g(x)) = f 0 (x)g(x) + f (x)g 0 (x)
dx
whereas the derivative of the sum is just the sum of the derivatives:
d
(f (x) + g(x)) = f 0 (x) + g 0 (x)
dx

Taking the log of a product (such as the likelihood) results in the sum of logs because of log properties:
log(a · b · c) = log(a) + log(b) + log(c)
7.1 Probability & Statistics with Applications to Computing 241

We see now why we might want to take the log of the likelihood before di↵erentiating it, but why can we?
Below there are two images: the left image is a function, and the right image is the log of that function.

The values are di↵erent (see the y-axis), but if you look at the x-axis, it happens that both functions are
maximized at 1 (the argmax’s are the same). Log is a monotone increasing function, so it preserves order,
so whatever was the maximizer (argmax) in the original function, will also be maximizer in the log function.
See below to see what happens when you apply the natural log (ln) to a product in our likelihood scenario!
And see the next section 7.2 for examples of maximum likelihood estimation in action.

Definition 7.1.4: Log-Likelihood

Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We
define the likelihood of x given ✓ to be the probability of observing x if the true parameter is ✓.
If X is discrete,
n
X
ln L(x | ✓) = ln pX (xi | ✓)
i=1

If X is continuous,
n
X
ln L(x | ✓) = ln fX (xi | ✓)
i=1
Chapter 7. Statistical Estimation
7.2: Maximum Likelihood Examples
Slides (Google Drive) Video (YouTube)

We spend an entire section just doing examples because maximum likelihood is such a fundamental concept
used everywhere (especially machine learning). I promise that the idea is simple: find ✓ that maximizes the
likelihood of the data. The computation and notation can be confusing at first though.

7.2.1 MLE Example (Poisson)

Example(s)

Let’s say x1 , x2 , ..., xn are iid samples from Poi(✓). (These values might look like x1 = 13, x2 = 5, x3 =
6, etc...) What is the MLE of ✓?

Solution Remember that we discussed that the sample mean might be a good estimate of ✓. If we observed
20 events over 5 units of time, a good estimate for , the average number of events per unit of time, would
be 20
5 = 4. This turns out to be the maximum likelihood estimate!
Let’s follow the recipe provided in 7.1.
1. Compute the likelihood and log-likelihood of data. To do this, we take the following product
of the Poisson PMFs at each sample xi , over all the data points:

n
Y n
Y xi
✓✓
L(x | ✓) = pX (xi | ✓) = e
i=1 i=1
xi !

Again, this is the probability of seeing x1 , then x2 , and so on. This function is pretty hard to di↵er-
entiate, so to make it easier, let’s compute the log-likelihood instead, using the following identities:

log(ab) = log(a) + log(b) log(a/b) = log(a) log(b) log(ab ) = b log(a)

In most cases, we’ll want to optimize the log-likelihood instead of the likelihood (since we don’t want
to use the product rule of calculus)!
n
!
Y xi
✓✓
ln L(x | ✓) = ln e [def of likelihood]
i=1
xi !
n
X  xi
✓✓
= ln e [log of product is sum of logs]
i=1
xi !
Xn
✓
= [ln(e ) + ln(✓xi ) ln xi !)] [log of product is sum of logs]
i=1
Xn
= [ ✓ + xi ln ✓ ln xi !)] [other log properties]
i=1

242
7.2 Probability & Statistics with Applications to Computing 243

2. Take the partial derivative(s) with respect to ✓ and set to 0. Solve the equation(s).
Now we want to take the derivative of the log likelihood with respect to ✓, so the derivative of ✓ is
just 1, and the derivative of xi ln ✓ is just x✓i , because remember xi is a constant with respect to ✓.

@ Xh xi i
n
ln L(x | ✓) = 1+
@✓ i=1
✓

to 0, and solve for ✓, and ✓ˆ is actually the estimate that

And now we want to set the derivative equal P
n
we solve for. We do some algebra, and get n1 i=1 xi , which is actually just the sample mean!

n h
xi i
X n n
1X 1X
1+ =0! n+ xi = 0 ! ✓ˆ = xi
i=1
✓ ✓ˆ i=1 n i=1

3. Optionally, verify ✓ˆM LE is indeed a (local) maximizer by checking that the second deriva-
tive at ✓ˆM LE is negative (if ✓ is a single parameter), or the Hessian (matrix of second
partial derivatives) is negative semi-definite (if ✓ is a vector of parameters).
We want to take the second derivative also, because
Pn otherwise we don’t know if this is a maximum or
a minimum. We di↵erentiate the first derivative i=1 [ 1 + x✓i ] again with respect to ✓, and we notice
that because ✓2 is always positive, the negative of that is always negative, so the second derivative
is always less than 0, so that means that it’s concave down everywhere. This means that anywhere
the derivative is zero is a global maximum, so we’ve successfully found the global maximum of our
likelihood equation.

@2 X h xi i
n
ln L(x | ✓) = < 0 ! concave down everywhere
@✓2 i=1
✓2

7.2.2 MLE Example (Exponential)

Example(s)

Let’s say x1 , x2 , ..., xn are iid samples from Exp(✓). (These values might look like x1 = 1.354, x2 =
3.198, x3 = 4.312, etc...) What is the MLE of ✓?

Solution Now that we’ve seen one example, we’ll just follow the procedure given in the previous section.
1. Compute the likelihood and log-likelihood of data.
Since we have a continuous distribution, our likelihood is the product of the PDFs:
n
Y n
Y
✓xi
L(x | ✓) = fX (xi | ✓) = ✓e
i=1 i=1

The log-likelihood is
n
X n
X
✓xi
ln L(x | ✓) = ln ✓e = [ln(✓) ✓xi ]
i=1 i=1
244 Probability & Statistics with Applications to Computing 7.2

2. Take the partial derivative(s) with respect to ✓ and set to 0. Solve the equation(s).

Xn 
@ 1
ln L(x | ✓) = xi
@✓ i=1
✓

ˆ
Now, we set the derivative to 0 and solve (here we replace ✓ with ✓):
n 
X n
X
1 n n
xi = 0 ! xi = 0 ! ✓ˆ = Pn
i=1 ✓ˆ ✓ˆ i=1 i=1 xi

This is just the inverse of the sample mean! This makes sense because if the average waiting time was
1
1/2 hours, then the average rate per unit of time should be 1/2 = 2 per hour!

Since the second derivative is negative everywhere, the function is concave down, and any critical point
is a global maximum!

7.2.3 MLE Example (Uniform)

Example(s)

Let’s say x1 , x2 , ..., xn are iid samples from (continuous) Unif(0, ✓). (These values might look like
x1 = 2.325, x2 = 1.1242, x3 = 9.262, etc...) What is the MLE of ✓?

Solution It turns out our usual procedure won’t work on this example, unfortunately. We’ll explain why
once we run into the problem!
To compute the likelihood, we first need the individual density functions. Recall
8
<1 0  x  ✓
fX (x | ✓) = ✓
:0 otherwise

Let’s actually define an indicator function for whether or not some boolean condition A is true or false:
(
1 A is true
IA =
0 A is false

This way, we can rewrite the uniform density in one line as (1/✓ for 0  x  ✓ and 0 otherwise):
1
fX (x | ✓) = I{0x✓}
✓
7.2 Probability & Statistics with Applications to Computing 245

First, we take the product over all data points of the density at that data point, and plug in the density of
the uniform distribution. How do we simplify this? First of all, we notice that in every term in the product,
there is still a ✓1 , so multiply it by itself n times and get ✓1n . How do we multiply indicators? If we want the
product of 1’s and 0’s to be 1, they ALL have to be 1. So,

I{0x1 ✓} · I{0x2 ✓} · · · I{0xn ✓} = I{0x1 ,...,xn ✓}

and our likelihood is

n
Y n
Y 1 1
L(x | ✓) = fX (xi | ✓) = I{0xi ✓} = I{0x1 ,...,xn ✓}
i=1 i=1
✓ ✓n

We could take the log-likelihood before di↵erentiating, but this function isn’t too bad-looking, so let’s take
the derivative of this. The I{0x1 ,...,xn ✓} just says the function is ✓1n when it the condition is true and 0
otherwise. So our derivative will just be the derivative of ✓1n when that condition is true and 0 otherwise.

d n
L(x | ✓) = I{0x1 ,...,xn ✓}
d✓ ✓n+1

Now, let’s set the derivative equal to 0 and solve for ✓.

n
= 0 ! ✓ =???
✓n+1

There seems to be no value of ✓ that solves this, what’s going on? Let’s plot the likelihood. First, we plot
n
just ✓1 (not quite the likelihood) where ✓ is on the x-axis:

Above is a graph of ✓1n , and so if we wanted to maximize this function, we should choose ✓ = 0. But
remember that the likelihood, was ✓1n I{0x1 ,...,xn ✓} , which can also be written as ✓1n I{xmax ✓} , because all
the samples are  ✓ if and only if the maximum is. Below is the graph of the actual likelihood:
246 Probability & Statistics with Applications to Computing 7.2

Notice that multiplying by the indicator function just kept the function as is when the condition was true,
xmax  ✓, but zeroed it out otherwise. So now we can see that our maximum likelihood estimator should be
✓ˆM LE = xmax = max{x1 , x2 , . . . , xn }, since it achieves the highest value.

Why? Remember x1 , . . . , xn ⇠ Unif(0, ✓), so ✓ has to be at least as large as the biggest xi , because if it’s
not as large as the biggest xi , then it would have been impossible for that uniform to produce that largest
xi . For example, if our samples were x1 = 2.53, x2 = 8.55, x3 = 4.12, our ✓ had to be at least 8.55 (the
maximum sample), because if it were 7 for example, then Unif(0, 7) could not possibly generate the sample
8.55.
So our likelihood remember ✓1n would have preferred as small a ✓ as possible to maximize it, but subject to
✓ xmax . Therefore the “compromise” was reached by making them equal!
I’d like to point out this is a special case because the range of the uniform distribution depends on its
parameter(s) a, b (the range of Unif(a, b) is [a, b]). On the other hand, most of our distributions like Poisson
or Exponential have the same range no matter what value the value of their parameters. For example, the
range of Poi( ) is always {0, 1, 2, . . . } and the range of Exp( ) is always [0, 1), independent of .

Therefore, most MLE problems will be similar to the first two examples rather than this complicated one!
Chapter 7. Statistical Estimation
7.3: Method of Moments Estimation
Slides (Google Drive) Video (YouTube)

7.3.1 Sample Moments

Maximum likelihood estimation (MLE) as you saw had a nice intuition but mathematically is a bit tedious
to solve. We’ll learn a di↵erent technique for estimating parameters called the Method of Moments (MoM).
The early definitions and strategy may be confusing at first, but we provide several examples which hopefully
makes things clearer!

Recall the definition of a moment from 5.6:

Definition 7.3.1: Moments (Review)

Let X be a random variable and c 2 R a scalar. Then: The k th moment of X is:

⇥ ⇤
E Xk

and the k th moment of X (about c) is:

⇥ ⇤
E (X c)k

Usually, we are ⇥interested⇤in the first moment of X: µ = E [X], and the second moment of X about
µ: Var (X) = E (X µ)2 .

Now since we are in the statistics portion of the class, we will define a sample moment.

Definition 7.3.2: Sample Moments

Let X be a random variable, and c 2 R a scalar. Let x1 , . . . , xn be iid realizations (samples) from X.
The k th sample moment of X is
n
1X k
x
n i=1 i

The k th sample moment of X (about c) is

n
1X
(xi c)k
n i=1

For example, the first sample moment is just the sample mean, and the second sample moment
about the sample mean is the sample variance.

247
248 Probability & Statistics with Applications to Computing 7.3

7.3.2 Method of Moments (MoM)

Recall that the first four moments tell us a lot about the distribution (see 5.6). The first moment is the
expectation or mean, and the second moment tells us the variance.

Suppose we only need to estimate one parameter ✓ (you might have to estimate two for example ✓ = (µ, 2 )
for the N (µ, 2 ) distribution). The idea behind Method of Moments (MoM) estimation is that: to find a
good estimator, we should have the true and sample moments match as best we can. That is, I should choose
the parameter ✓ such that the first true moment E [X] is equal to the first sample moment x̄. Examples
always make things clearer!

Example(s)

Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ Unif(0, ✓) (continuous). (These values might look
like x1 = 3.21, x2 = 5.11, x3 = 4.33, etc.) What is the MoM estimator of ✓?

Solution We then set the first true moment to the first sample moment as follows (recall that
E [Unif(a, b)] = a+b
2 ):

n
✓ 1X
E [X] = = xi
2 n i=1

Solving for ✓ we get:

n
2X
✓ˆM oM = xi
n i=1

That’s all there is to it! Much simpler than MLE right?

This estimator makes sense intuitively once you think about it for aP
bit: if we take the sample mean of a
n
bunch of Unif(0, ✓) rvs, we expect to get close to the true mean: n1 i=1 xi ! ✓/2 (by the Law of Large
Numbers). Hence, a good estimator for ✓ would just be twice the sample mean!

Notice that in this case, the MoM estimator disagrees with the MLE we derived in 7.2!

n
2X
xi = ✓ˆM oM 6= ✓ˆM LE = xmax
n i=1

What if you had two parameters instead of just one? Well, then you would set the first true moment equal
to the first sample moment (as we just did), but also the second true moment equal to the second sample
moment! We’ll see an example of this below. But basically, if we have k parameters to estimate, we need k
equations to solve for these k unknowns!
7.3 Probability & Statistics with Applications to Computing 249

Definition 7.3.3: Method of Moments Estimation

Let x = (x1 , . . . , xn ) be iid realizations (samples) from probability mass function pX (t; ✓) (if X is
discrete), or from density fX (t; ✓) (if X continuous), where ✓ is a parameter (or vector of parameters).

We then define the method of moments (MoM) estimator ✓ˆM oM of ✓ = (✓1 , . . . , ✓k ) to be

a solution (if it exists) to the k simultaneous equations where, for j = 1, . . . , k, we set the j th (true)
moment equal to the j th sample moment:
n
1X
E [X] = xi
n i=1
...
n
⇥ ⇤ 1X k
E Xk = x
n i=1 i

Example(s)

Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ Exp(✓). (These values might look like x1 =
3.21, x2 = 5.11, x3 = 4.33, etc.) What is the MoM estimator of ✓?

Solution We have k = 1 (since only one parameter). We then set the first true moment to the first
sample moment as follows (recall that E [Exp( )] = 1 ):
n
1 1X
E [X] = = xi
✓ n i=1

Solving for ✓ (just taking inverse), we get:

1
✓ˆM oM = 1
Pn
n i=1 xi

Notice that in this case, the MoM estimator agrees with the MLE (Maximum Likelihood Estimator),
hooray!
1
✓ˆM oM = ✓ˆM LE = 1
Pn
n i=1 xi

Isn’t this way better/easier than MLE?

Example(s)

Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ Poi(✓). (These values might look like x1 = 13, x2 =
5, x3 = 4, etc.) What is the MoM estimator of ✓?

Solution We have k = 1 (since only one parameter). We then set the first true moment to the first
sample moment as follows (recall that E [Poi( )] = ):
n
1X
E [X] = ✓ = xi
n i=1
250 Probability & Statistics with Applications to Computing 7.3

“Solving” for ✓, we get:

n
1X
✓ˆM oM = xi
n i=1

In this case, again, the MoM estimator agrees with the MLE! Again, much easier than MLE :).
Now, we’ll do an example where there is more than one parameter.

Example(s)

Let’s say x1 , x2 , . . . , xn are iid samples from X ⇠ N (✓1 , ✓2 ). (These values might look like x1 =
2.321, x2 = 1.112, x3 = 5.221, etc.) What is the MoM estimator of the vector ✓ = (✓1 , ✓2 ) (✓1 is
the mean, and ✓2 is the variance)?

Solution
⇥ 2 ⇤ We have k = 2 (since now we ⇥have⇤ two parameters ✓1 = µ and ✓2 = 2 ). Notice Var (X) =
2 2
E X E [X] , so rearranging we get E X 2 = Var (X) + E [X] . Let’s solve for ✓1 first:

Again, we set the first true moment to the first sample moment:
n
1X
E [X] = ✓1 = xi
n i=1

“Solving” for ✓1 , we get:

n
1X
✓ˆ1 = xi
n i=1

⇥ ⇤ 2
Now let’s use our result for ✓ˆ1 to solve for ✓ˆ2 (recall that E X 2 = Var (X) + E [X] = ✓2 + ✓12 )
n
⇥ ⇤ 1X 2
E X 2 = ✓2 + ✓12 = x
n i=1 i

Solving for ✓2 , and plugging in our result for ✓ˆ1 , we get:

n n
!2
1X 2 1X
✓ˆ2 = x xi
n i=1 i n i=1

If you were to use maximum likelihood to estimate the mean and variance of a Normal distribution, you
would get the same result!
Chapter 7. Statistical Estimation
7.4: The Beta and Dirichlet Distributions
Slides (Google Drive) Video (YouTube)

We’ll take a quick break after learning two ways (MLE and MoM) to estimate unknown parameters! In the
next section, we’ll learn yet another approach. But that approach requires us to learn at least one other
distribution, the Beta distribution, which will be the focus of this section.

7.4.1 The Beta Random Variable

Suppose you want to model your belief on the unknown probability X of heads. You could assign, for
example, a probability distribution as follows:

This figure below shows that you believe that X = P (head) is most likely to be 0.5, somewhat likely to be
0.8, and least likely to be 0.37. That is, X is a discrete random variable with range ⌦X = {0.37, 0.5, 0.8}
and pX (0.37) + pX (0.5) + pX (0.8) = 1. This is a probability distribution on a probability of heads!

Now what if we want P (head) to be open to any value in [0, 1] (which we should want; having it be just
one of three values is arbitrary and unrealistic)? The answer is that we need a continuous random variable
(with range [0, 1] because probabilities can be any number within this range)! Let’s try to see how we might
define a new distribution which might do a good job modelling this belief! Let’s see which of the following
shapes might be appropriate (or not).

251
252 Probability & Statistics with Applications to Computing 7.4

Example(s)

Suppose you flipped the coin n times and observed k heads. Which of the above density functions
have a “shape” which would be reasonable to model your belief?

Solution Here is the answer:

It’s important to note that Distributions 2 and 4 are invalid, because there is no possible sequence of flips
that could result in the belief that is ”bi-modal” (have two peaks in the graph of the distribution). Your
belief should have a single peak at your highest belief, and go down on both sides from there.

For instance, if you believe that the probability of (getting heads) is most likely around 0.25, we have Distri-
bution 1 in the figure above. Similarly, if you think that it’s most likely around 0.85, we have Distribution
3. Or, more interestingly, if you have NO idea what the probability might be and you want to make every
probability equally likely, you could use a Uniform distribution like in Distribution 5.

Let’s have some practice with concrete numbers now.

7.4 Probability & Statistics with Applications to Computing 253

Example(s)

If you flip a coin with unknown probability of heads X, what does your belief distribution look like
if:
• You didn’t observe anything?
• You observed 8 heads and 2 tails?
• You observed 80 heads and 20 tails?
• You observed 2 heads and 3 tails?
Match the four distributions below to the four scenarios above. Note the vertical bars in each
distribution represents where the mode (the point with highest density) is, as that’s probably what
we want to estimate as our probability of heads!

Solution

• You didn’t observe anything? Answer: Distribution 3.

Explanation: Since we haven’t observed anything yet, we shouldn’t have preference over any partic-
ular value. This is encoded as a continuous Unif(0, 1) distribution.

• You observed 8 heads and 2 tails? Answer: Distribution 4.

254 Probability & Statistics with Applications to Computing 7.4

Explanation: We expect P (head) to be around 8+2

8
= 0.8, so either distribution 2 or 4 is reasonable.
BUT we have much uncertainty (since we only flipped it 10 times) so we have a wider distribution.
Note that 0.8 is the MODE, not the mean.
• You observed 80 heads and 20 tails? Answer: Distribution 2.

Explanation: We expect P (head) to be around 80+20 80

= 0.8 again, but now since we have way more
flips, we can be more certain that the probability is more likely to be 0.8 (thus the ”spread” is smaller
than the previous).
• You observed 2 heads and 3 tails? Answer: Distribution 1.

Explanation: We expect P (head) to be around 2+3 2

= 0.4, but since 5 flips are rather limited, we
have much uncertainty in the actual distribution, therefore the ”spread” is quite large!

There is a continuous distribution/rv with range [0, 1] that parametrizes probability distributions over a
7.4 Probability & Statistics with Applications to Computing 255

probability just like this, based on two parameters ↵ and , which allow you to account for how many heads
and tails you’ve seen!

Definition 7.4.1: Beta RV

X ⇠ Beta(↵, ), if and only if X has the following density function (and range ⌦X = [0, 1]):

(
1 ↵ 1 1
B(↵, ) x (1 x) , 0x1
fX (x) =
0, otherwise

X is typically the belief distribution about some unknown probability of success, where we pretend
we’ve seen ↵ 1 successes and 1 failures. Hence the mode (most likely value of the probability/point
with highest density) arg max fX (x), is
x2[0,1]

↵ 1
mode[X] =
(↵ 1) + ( 1)

Also note the following:

1
• The first term in the pdf, B(↵, ) is just a normalizing constant (ensures the pdf to integrates to
1). It is called the Beta function, and so our random variable is called a Beta random variable.
• There is an annoying “o↵-by-1” issue: (↵ 1 heads and 1 tails), so when choosing these
parameters, be careful (examples below)!
• x is the probability of success, and (1 x) is the probability of failure.

7.4.2 Beta Random Variable Examples

Example(s)

If you flip a coin with unknown probability of heads X, identify the parameters of the most appropriate
Beta distribution to model your belief:
• You didn’t observe anything?
• You observed 8 heads and 2 tails?
• You observed 80 heads and 20 tails?
• You observed 2 heads and 3 tails?

Solution

• You didn’t observe anything? Beta(0 + 1, 0 + 1) ⌘ Beta(1, 1) ⌘ Unif(0, 1) ! NO mode (because it

follows the Uniform distribution; every point has same density).
(9 1) 8
• You observed 8 heads and 2 tails? Beta(8 + 1, 2 + 1) ⌘ Beta(9, 3) ! mode = (9 1)+(3 1) = 10

(81 1) 80
• You observed 80 heads and 20 tails? Beta(80+1, 20+1) ⌘ Beta(81, 21) ! mode = (81 1)+(21 1) = 100

(3 1) 2
• You observed 2 heads and 3 tails? Beta(2 + 1, 3 + 1) ⌘ Beta(3, 4) ! mode = (3 1)+(4 1) = 5

Note all the o↵-by-1’s in the parameters!

256 Probability & Statistics with Applications to Computing 7.4

7.4.3 The Dirichlet Random Vector

The Dirichlet random vector generalizes the Beta random P variable to having a belief distribution over
p1 , p2 , . . . , pr (like in the multinomial distribution so pi = 1), and has r parameters ↵1 , ↵2 , . . . , ↵r . It
has the similar interpretation of pretending you’ve seen ↵i 1 outcomes of type i.

Definition 7.4.2: Dirichlet RV

X ⇠ Dir(↵1 , ↵2 , . . . , ↵r ), if and only if X has the following density function:

( Qr Pr
1
B(↵) i=1 xai i 1
, xi 2 (0, 1) and i=1 xi = 1
fX (x) =
0, otherwise

This is a generalization of the Beta random variable from 2 outcomes to r. The random vector X is
typically the belief distribution about some unknown probabilities of the di↵erent outcomes, where
we pretend we saw a1 1 outcomes of type 1, a2 1 outcomes of type 2, . . . , and ar 1 outcomes
of type r . Hence, the mode of the distribution is the vector, arg max P fX (x), is
x2[0,1]d and xi =1

✓ ◆
↵ 1 ↵2 1 ↵r 1
mode[X] = Pr 1 , Pr , . . . , Pr
(a
i=1 i 1) (a
i=1 i 1) i=1 (ai 1)

Also note the following:

1
• Similar to the Beta RV, the first term in the pdf, B(↵) is just a normalizing constant (ensures
the pdf integrates to 1), where ↵ = (↵1 , . . . , ↵r ).
• Notice that this is the probability distribution
Pr over the random vector xi ’s, which is the vector
of probabilities, so they must sum to 1 ( i=1 xi = 1).
Chapter 7. Statistical Estimation
7.5: Maximum A Posteriori Estimation
Slides (Google Drive) Video (YouTube)

We’ve seen two ways now to estimate unknown parameters of a distribution. Maximum likelihood estimation
(MLE) says that we should find the parameter ✓ that maximizes the likelihood (“probability”) of seeing the
data, whereas the method of moments (MoM) says that we should match as many moments as possible
(mean, variance, etc.). Now, we learn yet another (and final) technique for estimation that will cover (there
are many more...).

7.5.1 Maximum A Posteriori (MAP) Estimation

Maximum a Posteriori (MAP) estimation is quite di↵erent from the estimation techniques we learned so
far (MLE/MoM), because it allows us to incorporate prior knowledge into our estimate. Suppose you
wanted to estimate the unknown probability of heads on a coin ✓: using MLE, you may flip the head 20
times and observe 13 heads, giving an estimate of 13/20. But what if your friend had flipped the coin before
and observed 10 heads and 2 tails: how can you (formally) incorporate her information into your estimate?
Or what if you just believed in general that coins were more likely to be fair ✓ = 0.5 than unfair? We’ll see
how to do this below!

7.5.1.1 Intuition
In Maximum Likelihood Estimation (MLE), we used iid samples x = (x1 , . . . , xn ) from some distribution
with unknown parameter(s) ✓, in order to estimate ✓.

n
Y
✓ˆM LE = arg max L(x | ✓) = arg max fX (xi | ✓)
✓ ✓
i=1

Note: Recall that, using the English description, how we found ✓ˆM LE is: we computed this likelihood, which
is the probability of seeing the data given the parameter ✓, and we chose the “best” ✓ that maximized this
likelihood.

You might have been thinking: shouldn’t we be trying to maximize ”P (✓ | x)” instead? Well, this doesn’t
make sense unless ⇥ is a R.V.! And this is where Maximum A Posteriori (MAP) Estimation comes in.

So far, for MLE and MoM estimation, we assumed ✓ was fixed but unknown. This is called the Frequentist
framework where we only estimate our parameter based on data alone, and ✓ is not a random variable.
Now, we are in the Bayesian framework, meaning that our unknown parameter is a random variable
⇥. This means, we will have some belief distribution ⇡⇥ (✓) (think of this as a density function over all
possible values of the parameter), and after observing data x, we will have a new/updated belief distribution
⇡⇥ (✓ | x). Let’s see a picture of what MAP is going to do first, before getting more into the math and
formalism.

257
258 Probability & Statistics with Applications to Computing 7.5

Example(s)

We’ll see the idea of MAP being applied to our typical coin example. Suppose we are trying to
estimate the unknown parameter for the probability of heads on a coin: that is, ✓ in Ber(✓). We
are going to treat the parameter as a random variable (before in MLE/MoM we treated it as a fixed
unknown quantity), so we’ll call it ⇥ (capitalized ✓).
1. We must have a prior belief distribution ⇡⇥ (✓) over possible values that ⇥ could
take on.

The range of ⇥ in our case is ⌦⇥ = [0, 1], because the probability of heads must be in
this interval. Hence, when we plot the density function of ⇥, the x-axis will range from 0 to 1.
On a piece of paper, please sketch a density function that you might have for this probability of
heads without yet seeing any data (coin flips). There are two reasonable shapes for this PDF:
• The Unif(0, 1) = Beta(1, 1) distribution (left picture below).
• Some Beta distribution where ↵ = , since most coins in this world are fair. Let’s say
Beta(11, 11); meaning we pretend we’ve seen 10 heads and 10 tails (right picture below).

2. We will observe some iid samples x = (x1 , . . . , xn ).

Again, for the Bernoulli distribution, these will be a sequence of Pnn 1’s and 0’s repre-
senting heads
Pn or tails. Suppose we observed n = 30 samples, in which i=1 xi = 25 were heads
and n x
i=1 i = 5 were tails.
3. We will combine our prior knowledge and the data to create a posterior belief
distribution ⇡⇥ (✓ | x).

Sketch two density functions for this posterior: one using the Beta(1, 1) prior above, and one
using the Beta(11, 11) prior above. We’ll compare these.
• If our prior distribution was ⇥ ⇠ Beta(1, 1) (meaning we pretend we didn’t see anything
yet), then our posterior distribution should be ⇥ | x ⇠ Beta(26, 6) (meaning we saw 25
heads and 5 tails total).
• If our prior distribution was ⇥ ⇠ Beta(11, 11) (meaning pretend we saw 10 heads and 10
tails beforehand), then our posterior distribution should be ⇥ | x ⇠ Beta(36, 16) (meaning
we saw 35 heads and 15 tails total).
7.5 Probability & Statistics with Applications to Computing 259

4. We’ll give our MAP estimate as the mode of this posterior distribution. Hence,
the name “Maximum a Posteriori”.
• If we used the ⇥ ⇠ Beta(1, 1) prior, we ended up with the ⇥ | x ⇠ Beta(26, 6) posterior,
and our MAP estimate is defined to be the mode of the distribution, which occurs at
✓ˆM AP = 25
30 ⇡ 0.833 (left picture above). You may notice that this would give the same as
the MLE: we’ll examine this more later!
• If we used the ⇥ ⇠ Beta(11, 11) prior, we ended up with the ⇥ | x ⇠ Beta(36, 16) posterior,
our MAP estimate is defined to be the mode of the distribution, which occurs at ✓ˆM AP =
35
50 = 0.70 (right picture above).
Hopefully you now see the process and idea behind MAP: We have a prior belief on our unknown
parameter, and after observing data, we update our belief distribution and take the mode (most likely
value)! Our estimate definitely depends on the prior distribution we choose (which is often arbitrary).

7.5.1.2 Derivation
We chose a Beta prior, and ended up with a Beta posterior, which made sense intuitively given our definition
of the Beta distribution. But how do we prove this? We’ll see the math behind MAP now (quite short), and
see the same example again but mathematically rigorous now.
MAP Idea: Actually, unknown parameter(s) is a random variable ⇥. We have a prior distribution (prior
belief on ⇥ before seeing data) ⇡⇥ (✓) and posterior distribution (given data; updated belief on ⇥ after
observing some data) ⇡⇥ (✓ | x).

By Bayes’ Theorem,

L(x | ✓)⇡⇥ (✓)

⇡⇥ (✓ | x) = / L(x | ✓)⇡⇥ (✓)
P (x)

Recall that ⇡⇥ is just a PDF or PMF over possible values of ⇥. In other words, now we are maximizing
the posterior distribution ⇡⇥ (✓ | x), where ⇥ has a PMF/PDF. That is, we are finding the mode of the
density/mass function. Note that since the denominator P (x) in the expression above does not depend on
✓, we can just maximize the numerator L(x | ✓)⇡⇥ (✓)! Therefore:

✓ˆM AP = arg max ⇡⇥ (✓ | x) = arg max L(x | ✓)⇡⇥ (✓)

✓ ✓
260 Probability & Statistics with Applications to Computing 7.5

Definition 7.5.1: Maximum A Posteriori (MAP) Estimation

Let x = (x1 , . . . , xn ) be iid realizations from probability mass function pX (t ; ⇥ = ✓) (if X discrete),
or from density fX (t ; ⇥ = ✓) (if X continuous), where ⇥ is the random variable representing the
parameter (or vector of parameters). We define the Maximum A Posteriori (MAP) estimator ✓ˆM AP
of ⇥ to be the parameter which maximizes the posterior distribution of ⇥ given the data.

✓ˆM AP = arg max ⇡⇥ (✓ | x) = arg max L(x | ✓)⇡⇥ (✓)

✓ ✓

That is, it’s exactly the same as maximum likelihood, except instead of just maximizing the likelihood,
we are maximizing the likelihood multiplied by the prior!

Now we’ll see a similar coin-flipping example, but deriving the MAP estimate mathematically and building
even more intuition. I encourage you to try each part out before reading the answers!

7.5.1.3 Example
Example(s)

(a) Suppose our samples are x = (0, 0, 1, 1, 0), from Ber(✓), where ✓ is unknown. Assume ✓ is
unrestricted; that is, ✓ 2 (0, 1). What is the MLE for ✓?
(b) Suppose we impose the restriction that ✓ 2 {0.2, 0.5, 0.7}. What is the MLE for ✓?
(c) Assume ⇥ is restricted as in part (b) (but now a random variable for MAP). Suppose we have
a (discrete) prior ⇡⇥ (0.2) = 0.1, ⇡⇥ (0.5) = 0.01, and ⇡⇥ (0.7) = 0.89. What is the MAP for ✓?
(d) Show that we can make the MAP whatever we like, by finding a prior over {0.2, 0.5, 0.7} so
that the MAP is 0.2, another so that it is 0.5, and another so that it is 0.7.
(e) Typically, for the Bernoulli/Binomial distribution, if we use MAP, we want to be able to get
any value 2 (0, 1), not just ones in a finite set such as {0.2, 0.5, 0.7}. So we need a (continuous)
prior distribution with range (0, 1) instead of our discrete one. We assign ⇥ ⇠ Beta(↵, )
1 ↵ 1
with parameters ↵, > 0 and density ⇡⇥ (✓) = B(↵, )✓ (1 ✓) 1 for ✓ 2 (0, 1). Recall the
↵ 1
mode of a W ⇠ Beta(↵, ) random variable is (↵ 1)+( 1) (the mode is the value with highest
density arg maxw fW (w)).

Suppose x1 , . . . , xn arePiid from a Bernoulli distribution with unknown parameter. Recall the
MLE is nk , where k = xi (the total number of successes). Show that the posterior ⇡⇥ (✓ | x)
has a Beta(k + ↵, n k + ) distribution, and find the MAP estimator.
(f) Recall that Beta(1, 1) ⌘ Unif(0, 1) (pretend we saw 1 1 heads and 1 1 tails ahead of time).
If we used this as the prior, how would the MLE and MAP compare?
(g) Since the posterior is also a Beta Distribution, we call Beta the conjugate prior to the Bernoul-
li/Binomial distribution’s parameter p. Interpret ↵, as to how they a↵ect our estimate. This
is a really special property: if the prior distribution multipled by the likelihood results in a pos-
terior distribution in the same family (with di↵erent parameters), then we say that distribution
is the conjugate prior to the distribution we are estimating.
(h) As the number of samples goes to infinity, what is the relationship between the MLE and MAP?
What does this say about our prior when n is small, or n is large?
(i) Which do you think is ”better”, MLE or MAP?

Solution
7.5 Probability & Statistics with Applications to Computing 261

(a) Suppose our samples are x = (0, 0, 1, 1, 0), from Ber(✓), where ✓ is unknown. Assume ✓ is unrestricted;
that is, ✓ 2 (0, 1). What is the MLE for ✓?
• Answer: 25 . We just find the likelihood of the data, which is the probability of observing 2 heads
and 3 tails, and find the ✓ that maximizes it.

L(x | ✓) = ✓2 (1 ✓)3
ˆ
✓M LE = arg max✓2[0,1] ✓2 (1 ✓)3 = 2
5

(b) Suppose we impose the restriction that ✓ 2 {0.2, 0.5, 0.7}. What is the MLE for ✓?
• Answer: 0.5. We need to find which of the three acceptable ✓ values maximizes the likelihood,
and since there are only finitely many, we can just plug them all in and compare!

L(x | 0.2) = (0.22 0.83 ) = 0.02048

L(x | 0.5) = (0.52 0.53 ) = 0.03125
L(x | 0.7) = (0.72 0.33 ) = 0.01323

✓ˆM LE = arg max✓2{0.2,0.5,0.7} L(x | ✓) = 0.5

(c) Assume ⇥ is restricted as in part (b) (but now a random variable for MAP). Suppose we have a
(discrete) prior ⇡⇥ (0.2) = 0.1, ⇡⇥ (0.5) = 0.01, and ⇡⇥ (0.7) = 0.89. What is the MAP for ✓?
• Answer: 0.7. Instead of maximizing just the likelihood, we need to maximize the likelihood times
the prior. Again, since there are only finitely many values, we just plug them in!

⇡⇥ (0.2 | x) = L(x | 0.2)⇡⇥ (0.2) = (0.22 0.83 )(0.1) = 0.0020480

⇡⇥ (0.5 | x) = L(x | 0.5)⇡⇥ (0.5) = (0.52 0.53 )(0.01) = 0.0003125
⇡⇥ (0.7 | x) = L(x | 0.7)⇡⇥ (0.7) = (0.72 0.33 )(0.89) = 0.0117747
Note the e↵ect of this prior - by setting ⇡⇥ (0.7) so high and the other two values, we actually get
a di↵erent maximizer. This is the e↵ect of the prior on the MAP estimate (which was completely
arbitrary)!
(d) Show that we can make the MAP whatever we like, by finding a prior over {0.2, 0.5, 0.7} so that the
MAP is 0.2, another so that it is 0.5, and another so that it is 0.7.
• Answer: Choose ⇡⇥ (✓) = 1 for the ✓ you want! This shows that the prior really does make a
di↵erence, and that MAP and MLE are indeed di↵erent techniques.
(e) Typically, for the Bernoulli/Binomial distribution, if we use MAP, we want to be able to get any value
2 (0, 1), not just ones in a finite set such as {0.2, 0.5, 0.7}. So we need a (continuous) prior distribution
with range (0, 1) instead of our discrete one. We assign ⇥ ⇠ Beta(↵, ) with parameters ↵, > 0 and
1 ↵ 1
density ⇡⇥ (✓) = B(↵, )✓ (1 ✓) 1 for ✓ 2 (0, 1). Recall the mode of a W ⇠ Beta(↵, ) random
↵ 1
variable is (↵ 1)+( 1) (the mode is the value with highest density arg maxw fW (w)).

Suppose x1 , . . . , xnPare iid from a Bernoulli distribution with unknown parameter. Recall the MLE
is nk , where k = xi (the total number of successes). Show that the posterior ⇡⇥ (✓ | x) has a
Beta(k + ↵, n k + ) distribution, and find the MAP estimator.

• Answer: ✓ˆM AP = n+(↵ k+(↵ 1)

1)+( 1) . We first have to write out what the posterior distribution is,
which is proportional to just the prior times the likelihood:
262 Probability & Statistics with Applications to Computing 7.5

⇡⇥ (✓ | x) / L(x | ✓) · ⇡⇥ (✓)

✓ ◆ ! !
n k n k 1
= ✓ (1 ✓) · ✓↵ 1
(1 ✓) 1
k B(↵, )

/ ✓(k+↵) 1
(1 ✓)(n k+ ) 1

The first to second line comes from noticing L(x | ✓) is just the probability of seeing exactly k
successes out of n (binomial PMF), and plugging in our equation for ⇡⇥ (beta density). The
second to third line comes from dropping the normalizing constants (that don’t depend on ✓),
which we can do because we only care to maximize this over ✓. If you stare closely at that last
equation, it actually proportional to the PDF of a Beta distribution with di↵erent parameters!
Our posterior is hence Beta(k + ↵, n k + ) since PDFs uniquely define a distribution (there
is only one normalizing constant that would make it integrate to 1). The MAP estimator is the
mode of this posterior Beta distribution, which is given by the formula:

k+↵ 1 k + (↵ 1)
✓ˆM AP = =
(k + ↵ 1) + (n k + 1) n + (↵ 1) + ( 1)
Try staring at this to see why this might make sense. We’ll explain it more in part (g)!
(f) Recall that Beta(1, 1) ⌘ Unif(0, 1) (pretend we saw 1 1 heads and 1 1 tails ahead of time). If we
used this as the prior, how would the MLE and MAP compare?
• Answer: They would be the same! From our previous question, if ↵ = = 1, then

k + (↵ 1) k
✓ˆM AP = = = ✓ˆM LE
n + (↵ 1) + ( 1) n
This is because we don’t have any prior information essentially, by saying each value is equally
likely!
(g) Since the posterior is also a Beta Distribution, we call Beta the conjugate prior to the Bernoulli/Bi-
nomial distribution’s parameter p. Interpret ↵, as to how they a↵ect our estimate. This is a really
special property: if the prior distribution multipled by the likelihood results in a posterior distribution
in the same family (with di↵erent parameters), then we say that distribution is the conjugate prior to
the distribution we are estimating.
• Answer: The interpretation is: pretend we saw ↵ 1 heads ahead of time, and 1 tails ahead
of time. Then our total number of heads is k + (↵ 1) (real + fake) and our total number of
trials is n + (↵ + 2) (real + fake), so that’s our estimate! That’s how prior information was
factored in to our estimator, rather than just using what we actually saw in the data.
(h) As the number of samples goes to infinity, what is the relationship between the MLE and MAP? What
does this say about our prior when n is small, or n is large?
• Answer: They become equal! The prior is important if we don’t have much data, but as we get
more, the evidence overwhelms the prior. You can imagine that if we only flipped the coin 5
times, the prior would play a huge role in our estimate. But if we flipped the coin 10,000 times,
any (small) prior wouldn’t really change our estimate.
(i) Which do you think is “better”, MLE or MAP?
7.5 Probability & Statistics with Applications to Computing 263

• Answer: There is no right answer. There are two main schools in statistics: Bayesians and
Frequentists.
• Frequentists prefer MLE since they don’t believe you should be putting a prior belief on anything,
and you should only make judgment based on what you’ve seen. They believe the parameter
being estimated is a fixed quantity.
• On the other hand, Bayesians prefer MAP, since they can incorporate their prior knowledge into
the estimation. Hence the parameter being estimated is a random variable, and we seek the
mode - the value with the highest probability or density. An example would be estimating the
probability of heads of a coin - is it reasonable to assume it is more likely fair than not? If so,
what distribution should we put on the parameter space?
• Anyway, in the long run, the prior “washes out”, and the only thing that matters is the likelihood;
the observed data. For small sample sizes like this, the prior significantly influences the MAP
estimate. However, as the number of samples goes to infinity, the MAP and MLE are equal.

7.5.2 Exercises
1. Let x = (x1 , . . . , xn ) be iid samples from Exp(⇥) where ⇥ is a random variable (not fixed). Note that
the range of ⇥ should be ⌦⇥ = [0, 1) (the average rate of events per unit time), so any prior we choose
should have this range.
(a) Using the prior ⇥ ⇠ Gamma(r, ) (for some arbitrary but known parameters r, > 0), show that
the posterior distribution ⇥ | x also follows a Gamma distribution and identify its parameters (by
computing ⇡⇥ (✓ | x)). Then, explain this sentence: “The Gamma distribution is the conjugate
prior for the rate parameter of the Exponential distribution”. Hint: This can be done in just a
few lines!
s 1
(b) Now derive the MAP estimate for ⇥. The mode of a Gamma(s, ⌫) distribution is . Hint:
⌫
This should be just one line using your answer to part (a).
(c) Explain how this MAP estimate di↵ers from the MLE estimate (recall for the Exponential distri-
n
bution it was just the inverse sample mean Pn ), and provide an interpretation of r and as
i=1 xi
to how they a↵ect the estimate.
Solution:
(a) Remember that the posterior is proportional to likelihood times prior, and the density of Y ⇠
Exp(✓) is fY (y | ✓) = ✓e ✓y :

⇡⇥ (✓ | x) / L(x | ✓)⇡⇥ (✓) [def of posterior]

n
!
Y r
✓xi
= ✓e · ✓r 1
e ✓
[def of Exp(✓) likelihood + Gamma(r, ) pdf]
i=1
(r 1)!
P
n ✓ xi r 1 ✓
/✓ e ✓ e [algebra, drop constants]
P
= ✓(n+r) 1
e ( + xi )✓

P
Therefore ⇥ | x ⇠ Gamma(n + r, + xi ), since the final line above is proportional to the pdf
for the gamma distribution (minus normalizing constant).
264 Probability & Statistics with Applications to Computing 7.5

It is the conjugate prior because, assuming a Gamma prior for the Exponential likelihood, we
end up with a Gamma posterior. That is, the prior and posterior are in the same family of
distributions (Gamma) with di↵erent parameters.
(b) Just citing the mode of a Gamma given, we get
n+r 1
✓ˆM AP = P
+ xi

n
(c) We see how the estimate changes from the MLE of ✓ˆM LE = P : pretend we saw r 1 extra
xi
events
P over units of time. (Instead of waiting
P for n events, we waited for n + r 1, and instead
of xi as our total time, we now have + xi units of time).
Chapter 7. Statistical Estimation
7.6: Properties of Estimators I
Slides (Google Drive) Video (YouTube)

Now that we have all these techniques to compute estimators, you might be wondering which one is the
“best”. Actually, a better question would be: how can we determine which estimator is “better” (rather
than the technique)? There are even more di↵erent ways to estimate besides MLE/MoM/MAP, and in
di↵erent scenarios, di↵erent techniques may work better. In these notes, we will consider some properties of
estimators that allow us to compare their “goodness”.

7.6.1 Bias

The first estimator property we’ll cover is Bias. The bias of an estimator measures whether or not in
expectation, the estimator will be equal to the true parameter.

Definition 7.6.1: Bias

Let ✓ˆ be an estimator for ✓. The bias of ✓ˆ as an estimator for ✓ is

h i
ˆ ✓) = E ✓ˆ
Bias(✓, ✓

If h i
ˆ ✓) = 0, or equivalently E ✓ˆ = ✓, then we say ✓ˆ is an unbiased estimator of ✓.
• Bias(✓, ˆ
ˆ ✓) > 0, then ✓ˆ typically overestimates ✓.
• Bias(✓,
ˆ ✓) < 0, then ✓ˆ typically underestimates ✓.
• Bias(✓,

Let’s go through some examples!

Example(s)

First, recall that, if x1 , ..., xn are iid realizations from Poi(✓), then the MLE and MoM were both the
sample mean.
n
1X
✓ˆ = ✓ˆM LE = ✓ˆM oM = xi
n i=1

Show that ✓ˆ is an unbiased estimator of ✓.

265
266 Probability & Statistics with Applications to Computing 7.6

Solution
" n #
h i 1 X
E ✓ˆ = E xi
n i=1
n
1X
= E [xi ] [LoE]
n i=1
n
1X
= ✓ [E [Poi(✓)] = ✓]
n i=1
1
= n✓
n
=✓
This makes sense: the average of your samples should be “on-target” for the true average!

Example(s)

First, recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, ✓), then
n
1X
✓ˆM LE = xmax ✓ˆM oM = 2 · xi
n i=1

Sure, ✓ˆM LE maximizes the likelihood, so in a way ✓ˆM LE is better than ✓ˆM OM . But, what are the biases
of these estimators? Before doing any computation: do you think ✓ˆM LE and ✓ˆM oM are overestimates,
underestimates, or unbiased?

Solution I actually think ✓ˆM oM is spot-on since the average of the samples should be close to ✓/2, and mul-
tiplying by 2 would seem to give the true ✓. On the other hand, ✓ˆM LE might be a bit of an underestimate,
since we probably wouldn’t have ✓ be exactly the largest (maybe a little larger).

• Bias of the maximum likelihood estimator.

Recall from 5.10 that the density of the largest order statistic (i.e. the maximum of the sample) is
⇣ y ⌘n 1 1
n 1
fXmax (y) = n FX (y) fX (y) = n
✓ ✓

You could also instead first find the CDF of Xmax as

⇣ y ⌘n
n
FXmax (y) = P (Xmax  y) = P (Xi  y) = FX (y)n =
✓
since the max is less than or equal to a value if and only if each of them is, then take the derivative.
Using this density function we can compute the expected value of the ✓MˆLE as follows:

Z ! Z " #✓
h i ✓ ⇣ y ⌘n 11 n ✓
n 1 n
E ✓ˆM LE = E [Xmax ] = y n dy = n y n dy = n y n+1 = ✓
0 ✓ ✓ ✓ 0 ✓ n+1 n+1
0

This makes sense because if I had 3 samples from Unif(0, 1) for example, I would expect them at
n
1/4, 2/4, 3/4, and so it would be as my expected max. Similarly, if I had 4 samples, then I would
n+1
n
expect them at 1/5, 2/5, 3/5, 4/5, and so it would again be as my expected max.
n+1
7.6 Probability & Statistics with Applications to Computing 267

Finally,
h i n 1
Bias(✓ˆM LE , ✓) = E ✓ˆM LE ✓= ✓ ✓= ✓
n+1 n+1

• Bias of the method of moments estimator.

" #
h i 1 X
n
2 X
n
2 ✓
E ✓ˆM OM =E 2· xi = E [xi ] = n = ✓
n i=1 n i=1 n 2
h i
Bias(✓ˆM oM , ✓) = E ✓ˆM oM ✓=✓ ✓=0

• Analysis of Results
This means that ✓ˆM LE typically underestimates ✓ and ✓ˆM OM is an unbiased estimator of ✓. But some-
thing isn’t quite right...

Suppose the samples are x1 = 1, x2 = 9, x3 = 2. Then, we would have

2
✓ˆM LE = max{1, 9, 2} = 9 ✓ˆM OM = (1 + 9 + 2) = 8
3
However, based on our sample, the MoM estimator is impossible. If the actual parameter were 8, then
that means that the distribution we pulled the sample from is Unif(0, 8), in which case the likelihood
that we get a 9 is 0. But we did see a 9 in our sample. So, even though ✓ˆM OM is unbiased, it still
yields an impossible estimate. This just goes to show that finding the right estimator is actually quite
tricky.

A good solution would be to “de-bias” the MLE by scaling it appropriately. If you decided to have a
new estimator based on the MLE:
n+1ˆ
✓ˆ = ✓M LE
n
you would now get an unbiased estimator that can’t be wrong! But now it does not maximize the
likelihood anymore...
Actually, the MLE is what we say to be “asymptotically unbiased”, meaning unbiased in the limit.
This is because
1
Bias(✓ˆM LE , ✓) = ✓!0
n+1
as n ! 1. So usually we might just leave it because we can’t seem to win...

Example(s)

Recall that if x1 , . . . , xn ⇠ Exp(✓) are iid, our MLE and MoM estimates were both the inverse sample
mean:
1 n
✓ˆ = ✓ˆM LE = ✓ˆM oM = = Pn
x̄ i=1 xi
What can you say about the bias of this estimator?

Solution
268 Probability & Statistics with Applications to Computing 7.6

h i 
ˆ n
E ✓ = E Pn
i=1 xi
n
Pn [Jensen’s inequality]
i=1 E [xi ]

n 1
= Pn 1 E [Exp(✓)] =
i=1 ✓ ✓
n
=
n ✓1
=✓

1
The inequality comes from Jensen’s (section 6.3): since g(x1 , . . . , xn ) = Pn is convex (at least in the
i=1 xi
positive octant when all xi 0), we have that E [g(x1 , . . . , xn )] g(E [x1 ] , E [x2 ] , . . . , E [xn ]). It is convex
1 h i
for a reason similar to that is a convex function. So E ✓ˆ ✓ systematically, and we typically have an
x
overestimate.

7.6.2 Variance and Mean Squared Error

We are often also interested in how much a estimator varies (we would like it to be unbiased and have small
variance to that it is more accurate). One metric that captures this property of estimators is an estimators
variance.
The variance of an estimator ✓ˆ is ⇣ ⌘ h h i i
Var ✓ˆ = E (✓ˆ E ✓ˆ )2

This is just the definition of variance applied to the random variable ✓ˆ and isn’t actually a new definition.
But maybe instead of just computing the variance, we want a slightly di↵erent metric which instead measures
the squared di↵erence from the actual estimator and not just its expectation:
h i
E (✓ˆ ✓)2

We call this property the mean squared error (MSE),

h i and it is related to both bias and variance! Look
ˆ ˆ
closely at the di↵erence: if ✓ is unbiased, then E ✓ = ✓ and the MSE and variance are actually equal!

Definition 7.6.2: Mean Squared Error

The mean squared error of an estimator ✓ˆ of ✓ is

h i
ˆ ✓) = E (✓ˆ
MSE(✓, ✓)2
h i ⇣ ⌘
If ✓ˆ is an unbiased estimator of ✓ (i.e. E ✓ˆ = ✓), then you can see that MSE(✓,
ˆ ✓) = Var ✓ˆ . In
⇣ ⌘
fact, in general MSE(✓,ˆ ✓) = Var ✓ˆ + Bias(✓,ˆ ✓)2 .
7.6 Probability & Statistics with Applications to Computing 269

This leads to what is known as the “Bias-Variance Tradeo↵” in machine learning and statistics. Usually, we
want to minimize MSE, and these two quantities are often inversely related. That is, decreasing one leads
to an increase in the other, and finding the balance will minimize the MSE. It’s hard to see why that might
be the case since we aren’t working with as complex of estimators (we’re just learning the basics!).
⇣ ⌘
Proof of Alternate MSE Formula. We will prove that MSE(✓, ˆ ✓) = Var ✓ˆ + Bias(✓,
ˆ ✓)2 .

h i
ˆ ✓) = E (✓ˆ ✓)2
MSE(✓, [def of MSE])
⇣ h i h i ⌘2 h i
= E ✓ˆ E ✓ˆ + E ✓ˆ ✓ [add and subtract E ✓ˆ ]
⇣ h i⌘2 h⇣ h i⌘ ⇣ h i ⌘i ⇣ h i ⌘2
= E ✓ˆ E ✓ˆ + 2E ✓ˆ E ✓ˆ E ✓ˆ ✓ +E E ✓ˆ ✓ [(a + b)2 = a2 + 2ab + b2 ]
⇣ ⌘ h i h h ii
= Var ✓ˆ + 0 + E Bias(✓,
ˆ ✓)2 [def of var, bias, E ✓ˆ E ✓ˆ = 0]
⇣ ⌘
= Var ✓ˆ + Bias(✓,
ˆ ✓)2

It is highly desirable that the MSE of an ⇣estimator

⌘ is low! We want a small di↵erence between ✓ˆ and ✓. Use
ˆ
the formula above to compute MSE: Var ✓ is something we learned how to compute a long time ago, and
there are several examples of bias computations above.

Example(s)

First, recall that, if x1 , ..., xn are iid realizations from Poi(✓), then the MLE and MoM were both the
sample mean.
n
1X
✓ˆ = ✓ˆM LE = ✓ˆM oM = xi
n i=1

Compute the MSE of ✓ˆ as an estimator of ✓.

Solution To compute the MSE, let’s compute the bias and variance separately. Earlier, we showed that
h i
ˆ ✓) = E ✓ˆ
Bias(✓, ✓=✓ ✓=0

Now for the variance:

!
⇣ ⌘ 1X
n
Var ✓ˆ = Var xi
n i=1
n
1X
= Var (xi ) [variance adds if independent]
n i=1
n
1 X
= ✓ [Var (Poi(✓)) = ✓]
n2 i=1
1
= n✓
n2
✓
=
n
270 Probability & Statistics with Applications to Computing 7.6

Finally, using both of those results:

⇣ ⌘
ˆ ✓)2 = ✓ + 02 = ✓
ˆ ✓) = Var ✓ˆ + Bias(✓,
MSE(✓,
n n
Chapter 7. Statistical Estimation
7.7: Properties of Estimators II
Slides (Google Drive) Video (YouTube)

We’ll discuss even more desirable properties of estimators. Last time we talked about bias, variance, and
MSE. Bias measured whether or not, in expectation, our estimator was equal to the true value of ✓. MSE
measured the expected squared di↵erence between our estimator and the true value of ✓. If our estimator
was unbiased, then the MSE of our estimator was precisely the variance.

7.7.1 Consistency
Definition 7.7.1: Consistency

An estimator ✓ˆn (depending on n iid samples) of ✓ is said to be consistent if it converges (in

probability) to ✓. That is, for any " > 0,
⇣ ⌘
lim P |✓ˆn ✓| > " = 0
n!1

Basically, as n ! 1, ✓ˆn in the limit will be extremely close to ✓.

As usual, we’ll do some examples to see how to show this.

Example(s)

Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, ✓), then
n
1X
✓ˆn = ✓ˆn,M oM = 2 · xi
n i=1

Let " > 0. Show that ✓ˆn is a consistent estimator of ✓.

Solution
Since ✓ˆn is unbiased, we have that
⇣ ⌘ ⇣ h i ⌘
P |✓ˆn ✓| > " = P |✓ˆn E ✓ˆn | > "

because we can replace ✓ with the expected value of the estimator. Now, we can apply Chebyshev’s inequality
(6.1) to see that
⇣ ⌘
⇣ h i ⌘ Var ✓ˆn
P |✓ˆn E ✓ˆn | > " 
"2

Now, we can take out the 22 from the estimator’s expression and are left only with the variance of the sample

271
272 Probability & Statistics with Applications to Computing 7.7

2 Var(xi )
mean, which is always just n = n .
⇣ ⌘
⇣ h i ⌘ Var ✓ˆn 1
Pn
22 Var xi 4 · Var (xi ) /n
P |✓ˆn E ✓ˆn | > "  = n i=1
=
"2 "2 "2

So now we take the limit with this expression.

⇣ ⌘ 4 · Var (xi ) /n
lim P |✓ˆn ✓| > "  lim =0
n!1 n!1 "2

So, ✓ˆn,M oM is a consistent estimator of ✓.

We’re also going to show that the MLE estimator is consistent!

Example(s)

Recall that, if x1 , ..., xn are iid realizations from (continuous) Unif(0, ✓), then

✓ˆn = ✓ˆn,M LE = max{x1 , ..., xn }

Let " > 0. Show that ✓ˆn is a consistent estimator of ✓.

Solution
In this case, we cannot use Chebyshev’s inequality unfortunately, because the maximum likelihood estimator
is not unbiased. The CDF for ✓ˆn is ⇣ ⌘
F✓ˆn (t) = P ✓ˆn  t

which is the probability that each individual sample is less than t because only in that case will the max be
less than t, and we have independence so we can say
⇣ ⌘
P ✓ˆn  t = P (X1  t) P (X2  t) ...P (Xn  t)

t
This is just the CDF of Xi to the n-th power, where the CDF of Unif(0, ✓) is just ✓ (see the distribution
sheet):
8
>
<0, t<0
n t n
F✓ˆn (t) = FX (t) = ( ✓ ) , 0  t  ✓
>
:
1, t>0

There are two ways we can have the absolute value from before be greater than epsilon
⇣ ⌘ ⇣ ⌘ ⇣ ⌘
P |✓ˆn ✓| > " = P ✓ˆn > ✓ + " + P ✓ˆn < ✓ "

The first term is 0, because there’s no way our estimator is greater than ✓ + ", as it’s never going to be
greater than ✓ by definition (the samples are between 0 and ✓ so there’s no way the max of the samples is
greater than ✓). So, now we can just use the CDF on the right term, and just plug in for t:
(
⇣ ⌘ ⇣ ⌘ ⇣ ⌘ ( ✓ ✓ " )n , " < ✓
P ✓ˆn > ✓ + " + P ✓ˆn < ✓ " = P ✓ˆn < ✓ " =
0, " ✓
7.7 Probability & Statistics with Applications to Computing 273

We can assume that " is less than ✓ because we really only care when " is very very small, so we have that
⇣ ⌘ ✓ ✓ " ◆n
P |✓ˆn ✓| > " =
✓

Thus, when we take the limit as n approaches infinity, we see that in the parenthesis, we have a number less
than 1, and we raise it to the n-th power, so it goes to 0
⇣ ⌘
lim P |✓ˆn ✓| > " = 0
n!1

So, ✓ˆn,M LE is also a consistent estimator of ✓.

Now we’ve seen that, even though the MLE and MoM estimators of ✓ given iid samples from Unif(0, ✓) are
di↵erent, they are both consistent! That means, as n ! 1, they will both converge to the true parameter
✓. This is clearly a good property of an estimator.

7.7.2 Consistency vs Unbiasedness

You may be wondering, what’s the di↵erence between consistency and unbiasedness? I, for one, was very
confused about the di↵erence for a while as well. There is, in fact, a subtle di↵erence, which we’ll see by
comparing estimators for ✓ in the continuous Unif(0, ✓) distribution.

1. For instance, an unbiased and consistent estimator was the MoM for the uniform distribution: ✓ˆn,M oM =
2x̄. We proved it was unbiased in 7.6, meaning it is correct in expectation. It converges to the true
parameter (consistent) since the variance goes to 0.
2. However, if you ignore all the samples and just take the first one and multiply it by 2, ✓ˆ = 2X1 , it is
unbiased (as it is 2 · ✓2 ), but it’s not consistent; our estimator doesn’t get better and better with more n
because we’re not using all n samples. Consistency requires that as we get more samples, we approach
the true parameter.
3. Biased but consistent, on the other hand, was the MLE estimator.
h iWe showed its expectation was
n ˆ n
✓, which is actually “asymptotically unbiased” since E ✓n,M LE = ✓ ! ✓ as n ! 1. It
n+1 n+1
does get better and better as n ! 1.
4. Neither unbiased nor consistent would just be some random expression, such as ✓ˆ = 1
X12
.

7.7.3 Efficiency
To take about our last topic, efficiency, we first have to define Fisher Information. Efficiency says that our
estimator has as low variance as possible. This property combined with consistency and unbiasedness mean
that our estimator is on target (unbiased), converges to the true parameter (consistent), and does so as fast
as possible (efficient).
274 Probability & Statistics with Applications to Computing 7.7

7.7.3.1 Fisher Information

Definition 7.7.2: Fisher Information

Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density function fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters).
The Fisher Information of the parameter ✓ is defined to be:
"✓ ◆2 #  2
@ ln L(x | ✓) @ ln L(x | ✓)
I(✓) = n · E = E
@✓ @✓2

where L(x | ✓) denotes the likelihood of the data given parameter ✓ (defined in 7.1). From Wikipedia,
it “is a way of measuring the amount of information that an observable random variable X carries
about an unknown parameter ✓ upon which the probability X depends”.

That written definition is definitely a mouthful, but if you stop and parse it, you’ll see it’s not too bad
to compute. We always take the second derivative of the log-likelihood to confirm that our MLE was a
maximizer; now all you have to do is take the expectation to get the Fisher Information. There’s no way
though that I can interpret the negative expected value of the second derivative of the log-likelihood, it’s
just too gross and messy.

7.7.3.2 The Cramer-Rao Lower Bound (CRLB) and Efficiency

Why did we define that nasty Fisher information? (Actually, it’s much worse when ✓ is a vector instead of a
single number, as the second derivative becomes a matrix of second partial derivatives). It would be great if
the mean squared error of an estimator ✓ˆ was a low as possible. The Cramer-Rao Lower Bound actually gives
a lower bound on the variance on any unbiased estimator ✓ˆ for ✓. That is, if ✓ˆ is any unbiased estimator for
✓, there is a minimum possible variance (variance = MSE for unbiased estimators). And if your estimator
achieves this lowest possible variance, it is said to be efficient. This is also a highly desirable property of
estimators. The bound is called the Cramer-Rao Lower Bound.

Definition 7.7.3: Cramer-Rao Lower Bound (CRLB)

Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓) (if X is discrete), or
from density function fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters).
If ✓ˆ is an unbiased estimator for ✓, then
⇣ ⌘ 1
ˆ ✓) = Var ✓ˆ
MSE(✓,
I(✓)

where I(✓) is the Fisher information defined earlier. What this is saying is, for any unbiased estimator
✓ˆ for ✓, the variance (=MSE) is at least I(✓)
1
. If we achieve this lower bound, meaning our variance
1
is exactly equal to I(✓) , then we have the best variance possible for our estimate. That is, we have
the minimum variance unbiased estimator (MVUE) for ✓.

Since we want to find the lowest variance possible, we can look at this through the frame of finding the
estimator’s efficiency.
7.7 Probability & Statistics with Applications to Computing 275

Definition 7.7.4: Efficiency

Let ✓ˆ be an unbiased estimator of ✓. The efficiency of ✓ˆ is

ˆ ✓) = I(✓) 1
e(✓, ⇣ ⌘ 1
Var ✓ˆ

This will always be between 0 and 1 because if your variance is equal to the CRLB, then it equals 1,
and anything greater will result in a smaller value. A larger variance will result in a smaller efficiency,
and we want our efficiency to be as high as possible (1).
An unbiased estimator is said to be efficient if it achieves the CRLB - meaning e(✓, ˆ ✓) = 1. That is,
it could not possibly have a lower variance. Again, the CRLB is not guaranteed for biased estimators.

That was super complicated - let’s see how to verify the MLE of Poi(✓) is efficient. It looks scary - but it’s
just messy algebra!

Example(s)

Recall that, if x1 , ..., xn are iid realizations from X ⇠ Poi(✓) (recall E [X] = Var (X) = ✓), then
n
1X
✓ˆ = ✓ˆMLE = ✓ˆMoM = xi
n i=1

Is ✓ˆ efficient?

Solution
First, you have to check that it’s unbiased, as the CRLB only holds for unbiased estimators...
" n #
h i 1 X
E ✓ˆ = E xi = E [xi ] = ✓
n i=1

...which it is! Otherwise, we wouldn’t be able to use this bound. We also need to compute the variance. The
2
variance of the sample mean (the estimator) is just n , and the variance of a Poisson is just ✓.
!
⇣ ⌘ 1X
n
Var (xi ) ✓
ˆ
Var ✓ = Var xi = =
n i=1 n n

Then, we’re going to compute that weird Fisher Information, which gives us the CRLB, and see if our
variance matches. Remember, we take the second derivative of the log-likelihood, which we did earlier in 7.2
so we’re just going to copy over the answer.
n
X
@2 xi
2
ln L(x | ✓) =
@✓ i=1
✓2

n
Then, we need to take the expected value of this. It turns out, with some algebra, you get ✓.
 2 " n # n
@ ln L(x | ✓) X xi 1 X 1 n
E = E = E [xi ] = n✓ =
@✓2 i=1
✓ 2 ✓ 2
i=1
✓ 2 ✓
276 Probability & Statistics with Applications to Computing 7.7

Our Fisher Information was the negative expected value of the second derivative of the log-likelihood, so
we just flip the sign to get n✓ .
 2
@ ln L(x | ✓) n
I(✓) = E 2
=
@✓ ✓
.
Finally, our efficiency is the inverse of the Fisher Information over the variance:

ˆ ✓) = I(✓) 1 (n) 1
e(✓, ⇣ ⌘ = ✓✓ =1
Var ✓ˆ n

Thus, we’ve shown that, since our efficiency is 1, our estimator is efficient. That is, it has the best pos-
sible variance among all unbiased estimators of ✓. This, again, is a really good property that we want to have.

To reiterate, this means we cannot possibly do better in terms of mean squared error. Our bias is 0, and our
variance is as low as it can possibly go. The sample mean is the unequivocally best estimator for a Poisson
distribution, in terms of efficiency, in terms of bias, and MSE (it also happens to be consistent, so there are
a lot of good things).
As you can see, showing efficiency is just a bunch of tedious calculations!
Chapter 7. Statistical Estimation
7.8: Properties of Estimators III
Slides (Google Drive) Video (YouTube)

The final property of estimators we will discuss is called sufficiency. Just like we want our estimators to be
consistent and efficient, we also want them to be sufficient.

7.8.1 Sufficiency
We first must define what a statistic is.
Definition 7.8.1: Statistic

Pn T : R ! R of samples x = (x1 , . . . , xn ). Examples include:

n
A statistic is any function
• T (x1 , . . . , xn ) = P i=1 xi (the sum)
n
• T (x1 , . . . , xn ) = n1 i=1 xi (the mean)
• T (x1 , . . . , xn ) = max{x1 , . . . , xn } (the max/largest value)
• T (x1 , . . . , xn ) = x1 (just take the first sample)
• T (x1 , . . . , xn ) = 7 (ignore all samples)

All estimators are statistics because they take in our n data points and produce a single number. We’ll see
an example which intuitively explains what it means for a statistic to be sufficient.
Suppose we have iid samples x = (x1 , . . . , xn ) from a known distribution with unknown parameter ✓. Imagine
we have two people:
• Statistician A: Knows the entire sample, gets n quantities: x = (x1 , . . . , xn ).
• Statistician B: Knows T (x1 , . . . , xn ) = t, a single number which is a function of the samples. For
example, the sum or the maximum of the samples.
Heuristically, T (x1 , . . . , xn ) is a sufficient statistic if Statistician B can do just as good a job as Statisti-
cian A, given “less Pninformation”. For example, if the samples are from the Bernoulli distribution, knowing
T (x1 , . . . , xn ) = i=1 xi (the number of heads) is just as good as knowing all the individual outcomes, since
a good estimate would be the number of heads over the number of total trials! Hence, we don’t actually
care the ORDER of the outcomes, just how many heads occurred! The word “sufficient” in English roughly
means “enough”, and so this terminology was well-chosen.

Now for the formal definition:

Definition 7.8.2: Sufficient Statistic

A statistic T = T (X1 , . . . , Xn ) is a sufficient statistic if the conditional distribution of X1 , . . . , Xn

given T = t and ✓ does not depend on ✓.
P (X1 = x1 , . . . , Xn = xn | T = t, ✓) = P (X1 = x1 , . . . , Xn = xn | T = t)
(if X1 , . . . , Xn are continuous rather than discrete, replace the probability with a density).

277
278 Probability & Statistics with Applications to Computing 7.8

To motivate the definition, we’ll go back to the previous example. Again, statistician A has all the samples
x1 , . . . , xn but statistician B only has the single number t = T (x1 , . . . , xn ). The idea is, Statistician B
0 0
only knows T = t, but since T is sufficient, doesn’t need ✓ to generate new samples X1 , . . . , Xn from the
distribution. This is because P (X1 = x1 , . . . , Xn = xn | T = t, ✓) = P (X1 = x1 , . . . , Xn = xn | T = t) and
since she knows T = t, she knows the conditional distribution (can generate samples)! Now Statistician
B has n iid samples from the distribution, just like Statistician A. So using these samples X10 , . . . , Xn0 ,
statistician B can do just a good a job as statistician A with samples X1 , . . . , Xn (on average). So no one is
at any disadvantage. :)
This definition is hard to check, but it turns out that there is a criterion that helps us determine whether a
statistic is sufficient:
Theorem 7.8.35: Neyman-Fisher Factorization Criterion

Let x1 , . . . , xn be iid random samples with likelihood L(x1 , . . . , xn | ✓). A statistic T = T (x1 , . . . , xn )
is sufficient if and only if there exist non-negative functions g and h such that:
L(x1 , . . . , xn | ✓) = g(x1 , . . . , xn ) · h(T (x1 , . . . , xn ), ✓)
That is, the likelihood of the data can be split into a product of two terms: the first term g can
depend on the entire data, but not ✓, and the second term h can depend on ✓, but only on the
data through the sufficient statistic T . (In other words, T is the only thing that allows the data
x1 , . . . , xn and ✓ to interact!) That is, we don’t have access to the n individual quantities x1 , . . . , xn ;
just the single number (T , the sufficient statistic).

If you are reading this for the first time, you might not think this is any better...You may be very confused
right now, but let’s see some examples to clear things up!

But basically, you want to split the likelihood into a product of two terms/functions:
1. For the first term g, you are allowed to know each individual sample if you want, but NOT ✓.
2. For the second term h, you can only know the sufficient statistic (single number) T (x1 , . . . , xn ) and ✓.
You may not know each individual xi .

Example(s)

Let x1 , . . . , xn be iid random samples from Unif(0, ✓) (continuous). Show that the MLE ✓ˆ =
T (x1 , . . . , xn ) = max{x1 , . . . , xn } is a sufficient statistic. (The reason this is true is because we
don’t need to know each individual sample to have a good estimate for ✓; we just need to know the
largest!)

Solution We saw the likelihood of this continuous uniform in 7.2, which we’ll just rewrite:
n
Y 1 1 1 1
L(x1 , . . . , xn | ✓) = I{xi ✓} = n
I{x1 ,...,xn ✓} = n I{max{x1 ,...,xn }✓} = n I{T (x1 ,...,xn )✓}
i=1
✓ ✓ ✓ ✓

Choose
g(x1 , . . . , xn ) = 1

and
1
h(T (x1 , . . . , xn ), ✓) = I{T (x1 ,...,xn )✓}
✓n
7.8 Probability & Statistics with Applications to Computing 279

By the Neyman-Fisher Factorization Criterion, ✓ˆM LE = T = max{x1 , . . . , xn } is sufficient. This is a good

property of an estimator!

Notice there is no need for a g term (that’s why it is = 1), because there is no term in the likelihood
which just has the data (without ✓).

For the h term, notice that we just need to know the max of the samples T (x1 , . . . , xn ) to compute h:
we don’t actually need to know each individual xi .

Notice that here the only interaction between the data and parameter ✓ happens through the sufficient
statistic (the max of all the values).

Example(s)
Pn
Let x1 , . . . , xn be iid random samples from Poi(✓). Show that T (x1 , . . . , xn ) = i=1 xi is a sufficient
P n
statistic, and hence the MLE ✓ˆ = n1 i=1 xi is sufficient as well. (The reason this is true is because
we don’t need to know each individual sample to have a good estimate for ✓; we just need to know
how many events happened total!)

Solution We take our Poisson likelihood and split it into smaller terms:

n n
! n
! n
! Pn
Y xi Y Y Y n✓ xi
✓✓ 1 e ✓ i=1
✓ xi
L(x1 , . . . , xn | ✓) = e = e ✓ = Qn
i=1
xi ! i=1 i=1
x!
i=1 i i=1 xi !

1 n✓ T (x1 ,...,xn )
= Qn ·e ✓
i=1 xi !

Choose
1
g(x1 , . . . , xn ) = Qn
i=1 xi !

and
n✓
h(T (x1 , . . . , xn ), ✓) = e ✓T (x1 ,...,xn )

Pn ˆ
By
Pnthe Neyman-Fisher Factorization Criterion, T (x1 , . . . , xn ) = i=1 xi is sufficient. The mean ✓M LE =
i=1 xi T (x1 , . . . , xn )
= is as well, since knowing the total number of events and the average number of
n n
events is equivalent (since we know n)!

Notice here we had the g term handle some function of only x1 , . . . , xn but not ✓.

For the h term though, we do have ✓ but don’t need the individual samples x1 , . . . , xn to compute h.
Imagine being just given T (x1 , . . . , xn ): now you have enough information to compute h!

Notice that here the only interaction between the data and parameter ✓ happens through the sufficient
statistic (the sum/mean of all the values). We don’t actually need to know each individual xi .
280 Probability & Statistics with Applications to Computing 7.8

Example(s)
Pn
Let x1 , . . . , xn be iid random samples from Ber(✓). Show that T (x1 , . . . , xn ) = i=1 xi is a sufficient
P n
statistic, and hence the MLE ✓ˆ = n1 i=1 xi is sufficient as well. (The reason this is true is because
we don’t need to know each individual sample to have a good estimate for ✓; we just need to know
how many heads happened total!)

Solution The Bernoulli likelihood comes by using the PMF pX (k) = ✓k (1 ✓)1 k
for k 2 {0, 1}. We get this
by observing that Ber(✓) = Bin(1, ✓).

n n
! n
!
Y Y Y
xi 1 xi xi 1 xi
L(x1 , . . . , xn |✓) = ✓ (1 ✓) = ✓ (1 ✓)
i=1 i=1 i=1

Pn Pn
xi
=✓ i=1 (1 ✓)n i=1 xi
= ✓T (x1 ,...,xn ) (1 ✓)n T (x1 ,...,xn )

Choose
g(x1 , . . . , xn ) = 1

and
h(T (x1 , . . . , xn ), ✓) = ✓T (x1 ,...,xn ) (1 ✓)n T (x1 ,...,xn )

Pn ˆ
By
Pnthe Neyman-Fisher Factorization Criterion, T (x1 , . . . , xn ) = i=1 xi is sufficient. The mean ✓M LE =
x
i=1 i T (x 1 , . . . , x n )
= is as well, since knowing the total number of heads and the sample proportion of
n n
heads is equivalent (since we know n)!

7.8.2 Properties of Estimators Summary

Here are all the properties of estimators we’ve talked about from 7.6 to 7.8 (now), in one place!

Definition 7.8.3: Bias

Let ✓ˆ be an estimator for ✓. The bias of ✓ˆ as an estimator for ✓ is

h i
ˆ ✓) = E ✓ˆ
Bias(✓, ✓
h i
ˆ ✓) = 0 or equivalently, E ✓ˆ = ✓.
As estimator is unbiased if Bias(✓,

Definition 7.8.4: Mean Squared Error (MSE)

The mean squared error of an estimator ✓ˆ of ✓ measures the expected squared error from the
true value ✓, and decomposes into a bias term and variance term. This term results in the phrase
”Bias-Variance Tradeo↵” - sometimes these are opposing forces and minimizing MSE is a result of
choosing the right balance.
7.8 Probability & Statistics with Applications to Computing 281

h i ⇣ ⌘
ˆ ✓) = E (✓ˆ
MSE(✓, ✓)2 = Var ✓ˆ + Bias2 (✓,
ˆ ✓)
⇣ ⌘
If ✓ˆ is an unbiased estimator of ✓, then the MSE reduces to just: MSE(✓,
ˆ ✓) = Var ✓ˆ .

Definition 7.8.5: Consistency

An estimator ✓ˆn (depending on n iid samples) of ✓ is consistent if it converges (in probability) to ✓.

That is, for any ✏ > 0,
⇣ ⌘
lim P |✓ˆn ✓| > ✏ = 0
n!1

Definition 7.8.6: Efficiency

An unbiased estimator ✓ˆ is efficient if it achieves the Cramer-Rao Lower Bound, meaning it has
the lowest variance possible.

I(✓) 1 ⇣ ⌘ 1 1
ˆ ✓) =
e(✓, ⇣ ⌘ = 1 () Var ✓ˆ = = h i
Var ✓ˆ I(✓) E @ 2 ln L(x|✓)
@✓ 2

Definition 7.8.7: Sufficiency

An estimator ✓ˆ = T (x1 , . . . , xn ) is sufficient if it satisfies the Neyman-Fisher Factorization

Criterion. That is, there exist non-negative functions g and h such that:

ˆ ✓)
L(x1 , . . . , xn | ✓) = g(x1 , . . . , xn ) · h(✓,
Chapter 8. Statistical Inference
In this last chapter, we talk about how to draw conclusions about a population using only a subset (hypoth-
esis testing). This is something we commonly want to do to answer questions like: who will win the next
U.S. presidential election? We can’t possibly poll everyone in the U.S. to see who they prefer, but we can
sample a few thousand and get their opinion. We will then make predictions for the election result with
some margin of error. What about drug testing? How can a drug company use clinical trials to “prove”
that their drug increases life expectancy or reduces risk of disease? These types of important questions will
be addressed in this chapter!

282
8.1 Probability & Statistics with Applications to Computing 283

Chapter 8. Statistical Inference

8.1: Confidence Intervals
Slides (Google Drive) Video (YouTube)

We’ve talked about several ways to estimate unknown parameters, and desirable properties. But there is just
one problem now: even if our estimator had all the good properties, the probability that our estimator for ✓
is exactly correct is 0, since ✓ is continuous (a decimal number)! We’ll see how we can construct confidence
intervals around our estimator, so that we can argue that ✓ˆ is close to ✓ with high probability.

8.1.1 Confidence Intervals Motivation

Confidence intervals are used in the Frequentist setting, which means the population parameters are assumed
to be unknown but will always be fixed, not random variables. Credible intervals, on the other hand, are a
Bayesian version of a Frequentist’s confidence interval which is discussed in the next section 8.2.
When doing point estimation (such as MLE, MoM), the probability that our answer is correct (over the
randomness in our iid samples) is 0: ⇣ ⌘
P ✓ˆ = ✓ = 0
because ✓ is a real number and can take uncountably many values. Hence the probability we are exactly
correct is zero, though we may be very close.
ˆ such that ✓ falls
Instead, we can give an interval (often but not always centered at our point estimate ✓),
into it with high probability, like 95%.
⇣ h i⌘
P ✓ 2 ✓ˆ , ✓ˆ + = 0.95

The confidence interval for ✓ can be illustrated in the below picture. We will explain how to interpret a
confidence interval at a specific confidence level soon.

Note that we can write this in any of the following three equivalent ways, as they all represent the probability
that ✓ˆ and ✓ di↵er by no more than some amount :
⇣ h i⌘ ⇣ ⌘ ⇣ ⌘
P ✓ 2 ✓ˆ , ✓ˆ + = P ✓ˆ ✓  = P ✓ˆ 2 [✓ , ✓ + ] = 0.95

Note the first and third equivalent statements especially (swapping ✓ˆ and ✓).

8.1.2 Review: The Standard Normal CDF

Before we construct confidence intervals, we need to review the standard normal CDF. It turns out, the
Normal distribution frequently appears since our estimators are usually the sample mean (at least for our
common distributions), and the Central Limit Theorem applies!
284 Probability & Statistics with Applications to Computing 8.1

We have learned about the CDF of normal distribution. If Z ⇠ N (0, 1), we denote the CDF (a) = FZ (a) =
P (Z  a), since it’s so commonly used. There is no closed-form formula, so one way to find a z-score
associated with a percentage is to look up in a z-table.

Note: (a) = 1 ( a) by symmetry.

Suppose we want a (centered) interval, where the probability of being in that interval is 95%.

Left bound: the probability of being less than the left bound is 2.5%.

Right bound: the probability of being greater than the right bound is 2.5%. Thus, the probability of being
less than the right bound should be 97.5%.

Note the following two equivalent statements that say that P (Z  1.96) = 0.975 (where 1
is the inverse
CDF of the standard normal):
(1.96) = 0.975

1
(0.975) = 1.96

8.1.3 Confidence Intervals

Let’s start by doing an example.
8.1 Probability & Statistics with Applications to Computing 285

Example(s)

Suppose x1 ,...xn are iid samples from Poi(✓) where ✓ is unknown. Our MLE and MoM estimates
Pn
agreed at the sample mean: ✓ˆ = x̄ = n1 i=1 xi . Create an interval centered at ✓ˆ which contains ✓
with probability 95%.

h i that if W ⇠ ⇣
Solution Recall Poi(✓),
⌘ then E [W ] = Var (W ) = ✓, and so our estimator (the sample mean)
ˆ ˆ
✓ = x̄ has E ✓ = ✓ and Var ✓ =ˆ Var(xi )
= n✓ . Thus, by the Central Limit Theorem, ✓ˆ is approximately
n
Normally distributed:
n ✓ ◆
1X ✓
✓ˆ = xi ⇡ N ✓,
n i=1 n
If we standardize, we get that
✓ˆ ✓
p ⇡ N (0, 1)
✓/n
⇣ h i⌘
To construct our 95% confidence interval, we want P ✓ 2 ✓ˆ , ✓ˆ + = 0.95
⇣ h i⌘ ⇣ ⌘
P ✓ 2 ✓ˆ , ✓ˆ + =P ✓  ✓ˆ  ✓ + [one of 3 equivalent statements]
⇣ ⌘
=P  ✓ˆ ✓
!
✓ˆ ✓
=P p p p
✓/n ✓/n ✓/n
!
=P p Z p [CLT]
✓/n ✓/n
= 0.95

Because p represents the right bound, and the probability of being less than the right bound is 97.5%
✓/n
for a 95% interval (see the above picture again). Thus:
r
1 ✓
p = (0.975) = 1.96 =) = 1.96
✓/n n

ˆ and get
Since we don’t know ✓, we plug in our estimator ✓,
2 s s 3
✓ˆ ˆ ✓ˆ 5
[✓ˆ , ✓ˆ + ] = 4✓ˆ 1.96 , ✓ + 1.96
n n

That is, since ✓ˆ is normally distributed with mean ✓, we just need to find the so that ✓ˆ ± contains 95%
1
of the area in a Normal distribution. The way to do so is to find (0.975) = 1.96, and go ± 1.96 standard
ˆ
deviations of ✓ in each direction!
Definition 8.1.1: Confidence Interval
Suppose you have iid samples x1 ,...,xn from some distribution with unknown parameter ✓, and you
have some estimator ✓ˆ for ✓.
286 Probability & Statistics with Applications to Computing 8.1

ˆ
hA 100(1 ↵)% i confidence interval for ✓ is an interval (typically but not always) centered at ✓,
ˆ
✓ ˆ
,✓ + , such that the probability (over the randomness in the samples x1 ,...,xn ) ✓ lies in the
interval is 1 ↵: ⇣ h i⌘
P ✓ 2 ✓ˆ , ✓ˆ + =1 ↵
Pn
If ✓ˆ = n1 i=1 xi is the sample mean, then ✓ˆ is approximately normal by the CLT, and a 100(1 ↵)%
confidence interval is given by the formula:

✓ˆ z1 ↵/2 p , ✓ˆ + z1 ↵/2 p
n n
1 ↵
where z1 ↵/2 = 1 2 and is the true standard deviation of a single sample (which may need
to be estimated).

It is important to note that this last formula ONLY works when ✓ˆ is the sample mean (otherwise we can’t
use the CLT); you’ll need to find some other strategy if it isn’t.

If we wanted a 95% interval, then that corresponds to ↵ = 0.05, since 100(1 ↵) = 95. We were then
looking up the inverse Phi table at (1 ↵/2) = (1 0.05/2) = 0.975 to get our desired number of standard
deviations in each direction of 1.96.

If we wanted a 98% interval, then that corresponds to ↵ = 0.02 since 100(1 ↵) = 98. We then would
1
look up (0.99) since 1 ↵/2 = 0.99, because if there is to be 98% of the area in the middle, there is 1%
to the left and right!

Example(s)

Construct a 99% confidence interval

Pn for ✓ (the unknown probability of success) in Ber(✓) given n = 400
iid samples x1 ,...,x400 where i=1 xi = 136 (observed 136 successes out of 400).

Solution
Recall for Bernoulli distribution Ber(✓), our MLE/MoM estimator was the sample mean:
n
1X 136
✓ˆ = xi = = 0.34
n i=1 400

Because we want to construct a 99% = 100(1 ↵)% confidence interval:

99
↵=1 = 0.01
100
A 99% confidence interval would use the z-score:
1
z1 ↵/2 = z1 0.01/2 = z0.995 = (0.995) ⇡ 2.576
The population standard deviation is unknown, but we’ll approximate it using the standard deviation of
Ber(✓) as follows (since Var (Ber(✓)) = ✓(1 ✓)):
p q p
= ✓(1 ✓) ⇡ ✓(1 ˆ ˆ = 0.34(1 0.34) = 0.474
✓)
8.1 Probability & Statistics with Applications to Computing 287

Thus, our 99% confidence interval for ✓ is:


0.474 0.474
0.34 2.576 p , 0.34 + 2.576 p = [0.279, 0.401]
400 400

8.1.4 Interpreting Confidence Intervals

How can we interpret our 99% confidence interval [0, 279, 0.401] from the above example?
h i
Incorrect: There is a 99% probability that ✓ falls in the confidence interval ✓ˆ , ✓ˆ + = [0, 279, 0.401]

This is incorrect because there is no randomness here: ✓ is a fixed parameter. ✓ is either in the interval or
out of it; there’s nothing probabilistic about it.

Correct: If we repeat this process several times (getting n samples each time and constructing di↵erent
confidence intervals), about 99% of the confidence intervals we construct will contain ✓.
Notice the subtle di↵erence! Alternatively, before you receive samples, you can say that there is a 99%
probability
h (over the irandomness in the samples) that ✓ will fall into our to-be-constructed confidence
interval ✓ˆ , ✓ˆ + . Once you plug in the numbers though, you cannot say that anymore.
Chapter 8. Statistical Inference
8.2: Credible Intervals
Slides (Google Drive) Video (YouTube)

8.2.1 Credible Intervals

Now we will assume we are in the Bayesian setting, which means out unknown parameter ⇥ will always
be some random variable, not a fixed quantity. If we give a single point estimate like we do in MAP, we will
never be exactly correct. Therefore, just like we did in the Frequentist setting with confidence intervals, we
might want to give an interval instead of a single number. These are called credible intervals instead, and
serve the same purpose!
Actually, since ⇥ is a random variable, finding an interval where the probability is at least 90% for example
involves just looking at the PDF/CDF! An example best illustrates this.

Example(s)

Construct a 80% credible interval for ⇥ (the Pn unknown) probability of success in Ber(⇥) given iid
n = 12 samples x = (x1 , x2 , ..., x12 ) where i=1 xi = 11 (observed 11 successes out of 12). Suppose
our prior is ⇥ ⇠ Beta(↵ = 7, = 3) (i.e., pretend we saw 6 successes and 2 failures ahead of time).

Solution From lecture 7.5 (MAP), we showed that choosing a Beta prior for ⇥ leads to a Beta posterior of
⇥ | x ⇠ Beta(11 + 7, 1 + 3) = Beta(18, 4) and our MAP was then (18 18 1 17
1)+(4 1) = 20 (since we saw 17 total
successes, and 3 total failures.)
We want an interval [a, b] such that P (a  ⇥  b) = 0.8

If we look at the Beta PDF, we are looking for such an interval that the probability that we fall in this area
is 80%. If the area is centered, then the area to the left of that should have probability of 10%, and the area
to the right of that should also have probability 10%.
This is equivalent of looking for P (⇥  a) = 0.1 and P (⇥  b) = 0.9. These information are given in the
CDF of the Beta distribution. Note that on the x-axis we have the range of the Beta distribution [0, 1], and
on the y-axis, we have the cumulative probability of being to the left by integrating the PDF from above.

288
8.2 Probability & Statistics with Applications to Computing 289

1
Let FBeta denote the CDF of this Beta(18,4) distribution. Then, choose a = FBeta (0.1) ⇡ 0.7089 and
1
b = FBeta (0.9) ⇡ 0.9142, so our credible interval is [0.7089, 0.9142].

Note that the MAP was 17

20 = 0.85, which is not at the center! We could have chosen any a, b where the area
between them is 80%, but we set the areas to the left and right to be equal.

In order to compute the inverse CDF, we can use the scipy.stats library as follows:

1 >>> from scipy.stats import beta

2 >>> beta.ppf (0.1 , 18, 4) # i n v e r s e cdf of Beta (18 , 4)
3 0.70898031757137514

That’s all there is to it! Just find the PDF/CDF of your posterior distribution (hopefully you chose a
conjugate prior), and look up the inverse CDF at points a and b such that b a is your desired confidence
level of your credible interval.

Definition 8.2.1: Credible Intervals

Suppose you have iid samples x = (x1 , ..., xn ) from some distribution with unknown parameter ⇥.
You are in the Bayesian setting, so you have chosen a prior distribution for the RV ⇥.

A 100(1 ↵)% credible interval for ⇥ is an interval [a, b] such that the probability (over the
randomness in ⇥) that ⇥ lies in the interval is 1 ↵:

P (⇥ 2 [a, b]) = 1 ↵

If we’ve chosen the appropriate conjugate prior for the sampling distribution (like Beta for Bernoulli),
the posterior is easy to compute. Say the CDF of the posterior is FY . Then, a 100(1 ↵)% credible
interval is given by h ⇣↵⌘ ⇣ ↵ ⌘i
FY 1 , FY 1 1
2 2
Again, this is one which has equal area to the left and right of the interval, but there are infinitely
many possible credible intervals you can create.
290 Probability & Statistics with Applications to Computing 8.2

8.2.2 Interpreting Credible Intervals

How can we interpret a 80% credible interval [0.7089, 0.9142] for parameter ⇥?
Correct: There is an 80% probability that ⇥ falls in the credible interval [0.7089, 0.9142]. Written out,

P (⇥ 2 [0.7089, 0.9142]) = 0.8

This is correct because not ⇥ is a random variable, and it makes sense to say!
Contrast this with the interpretation of a confidence interval, where ✓ is a fixed number.

8.2.3 Exercises
1. Let x = (x1 , . . . , xn ) be iid samples from Exp(⇥) where ⇥ is a random variable (not fixed). Recall from
section 7.5 Exercise 1 that if we choose the P prior distribution ⇥ ⇠ Gamma(r, ), then the posterior
distribution is ⇥ | x ⇠ Gamma(n + r, + xi ).

Suppose n = 13, x̄ = 0.21, r = 7, = 12. Construct a 96% credible interval for ⇥. To find the
point t such that FT (t) = y for T ⇠ Gamma(u, v), call the following function which gets the inverse
CDF:
scipy.stats.gamma.ppf(y, u, 0, 1/v)
Then, verify that the MAP estimate is actually contained in your credible interval.

Solution: Before we call the function, we have to identify what u and v are. Plugging in the numbers
above to the general posterior we computed earlier, we find

⇥ | x ⇠ Gamma(13 + 7, 12 + 13 · 0.21) = Gamma(20, 14.73)

Since we want a 96% interval, we must look up the inverse CDF at 0.02 and 0.98 (why?).
We write a few lines of code, calling the provided function twice:
1 >>> from scipy . stats import gamma
2 >>> gamma.ppf (0.02 , 20, 0, 1/39.3) # i n v e r s e cdf o f Gamma ( 2 0 , 14.73)
3 0.809150510196322
4 >>> gamma.ppf (0.98 , 20, 0, 1/39.3) # i n v e r s e cdf o f Gamma ( 2 0 , 14.73)
5 2.0514641398722735

So our 96% credible interval for ⇥ is approximately

[0.809, 2.051]

Our MAP was just the mode of the Gamma, which is

19
✓ˆM AP = ⇡ 1.28988458927
14.73
Chapter 8. Statistical Inference
8.3: Introduction to Hypothesis Testing
Slides (Google Drive) Video (YouTube)

Hypothesis testing allows us to “statistically prove” claims. For example, if a drug company wants to claim
that their new drug reduces the risk of cancer, they might perform a hypothesis test. Or if a company wanted
to argue that their academic prep program leads to a higher SAT score. A lot of business decisions are reliant
on this statistical method of hypothesis testing, and we’ll see how to conduct them properly below.

8.3.1 Hypothesis Testing (Idea)

Suppose we have this Magician Mark, who says
Magician Mark: I have here a fair coin.
And then an audience member, a skeptical statistician named Stacy, engages him in a conversation:
Skeptical Statistician Stacy: I don’t believe you. Can we examine it?
Magician Mark: Be my guest.
Skeptical Statistician Stacy: I’m going to flip your coin 100 times and see how many heads I get.
[Stacy flips the coin 100 times and sees 99 heads.]
You cannot be telling the truth, there’s no way this coin is fair!
Magician Mark: Wait I was just unlucky, I swear I’m not lying!

So let’s give Mark the benefit of the doubt. We’ll compute the probability that we observed an outcome
as least as extreme as this, given that Mark isn’t lying.
If Mark isn’t lying, then the coin is fair, so the number of heads observed should be X ⇠ Bin(100, 0.5),
because there are 100 independent trials and a 50% of heads since it’s fair. So, the probability that we
observe at least 99 heads (because we’re looking for something as least as extreme), is the sum of the
probability of 99 heads and the probability of 100 heads. You just sum the Binomial PMF and you get:

✓ ◆ ✓ ◆
100 100 101
P (X 99) = (0.5)99 (1 0.5)1 + (0.5)100 = 100 ⇡ 7.96 ⇥ 10 29
⇡0
99 100 2

Basically, if the coin were fair, the probability of what we just observed (99 heads or more) is basically 0.
This is strong statistical evidence that the coin is NOT fair. Our assumption was that the coin is fair, but if
this were the case, observing such an extreme outcome would be extremely unlikely. Hence, our assumption
is probably wrong.
So, this is like a ”Probabilistic Proof by Contradiction”!

8.3.2 Hypothesis Testing (Example)

There is a formal procedure for a hypothesis test, which we will illustrate by example. There are many types
of hypothesis tests, each with di↵erent uses, but we’ll get into that later! You’ll see the CLT often appear
in the most fundamental/commonly conducted hypothesis tests.

291
292 Probability & Statistics with Applications to Computing 8.3

1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)
• Our example will be that SuperSAT Prep claims that their program helps students perform better
on the SAT. (The average SAT score as of June 2020 was: 1059 out of 1600, and the standard
deviation of SAT scores was 210).
2. Set up a null hypothesis H0 and alternative hypothesis HA .
(a) Alternative hypothesis can be one-sided or two-sided.
• Let µ be the true mean of the SAT scores of students of SuperSAT Prep.
• Our null hypothesis is that H0 : µ = 1059, which is our “baseline”, “no e↵ect”, “benefit of the
doubt”. We’re going to assume that the true mean of our scores is the same as the nationwide
scores (for the sake of contradiction).
• Our alternative hypothesis is what we want to show, which is HA : µ > 1059, or that SuperSAT
Prep is good and that their test takers are (strictly) better o↵. So, our alternative will assert that
µ > 1059.
• This is called a one-sided hypothesis. The other one-sided hypothesis would be µ < 1059 (if
we wanted to argue that SuperSAT Prep makes students worse o↵).
• A two-sided hypothesis would be that µ 6= 1059, because it’s two sides (less than or greater
than). This is if we wanted to argue that SuperSAT Prep makes some di↵erence for better or
worse.
3. Choose a significance level ↵ (usually ↵ = 0.05 or 0.01).
• Let’s choose ↵ = 0.05 and explain this more later!
4. Collect data.
• We observe 100 students from SuperSAT Prep, x1 , ..., x100 . It turns out, the sample mean of the
scores, x̄, is x̄ = 1113.
5. Compute a p-value, p = P (observing data at least as extreme as ours | H0 is true).
• Again, since we’re assuming H0 is true (that SuperSAT has no e↵ect), our true mean µ is 1059
(again we do this in hopes of reaching a “probabilistic contradiction”). By the CLT, since n = 100
is large, the distribution of the sample mean of 100 samples is approximately normal with mean
2
1059, and variance 210100 (because the variance of a single test taker was given to be
2
= 2102 ,
2
and so the variance of the sample mean is n ):

2 2102
X̄ ⇡ N (µ = 1059, = )
100
So, then, the p-value is the probability that if we took an arbitrary sample mean, that it would
be at least as extreme as the one we computed, which was 1113. So, we can just standardize, look
up a table like always, which is a procedure you know how to do:

✓ ◆ ✓ ◆
X̄ µ x̄ µ 1113 1059
p = P X̄ x̄ = P p p =P Z p = P (Z 2.14) ⇡ 0.0162
/ n / n 210/ 100

We end up getting that our p-value is 0.0162

6. State your conclusion. Include an interpretation in the context of the problem.
8.3 Probability & Statistics with Applications to Computing 293

(a) If p < ↵, ”reject” the null hypothesis H0 in favor of the alternative HA . (Because, given the null
hypothesis is true, the probability of what we saw happening (or something more extreme) is p
which is less than some small number ↵.)
(b) Otherwise, ”fail to reject” the null hypothesis H0 .
• Since p = 0.0162 < 0.05 = ↵, we’ll reject the null hypothesis H0 at the ↵ = 0.05 significance level.
We can say that there is strong statistical evidence to suggest that SuperSAT Prep actually helps
students perform better on the SAT.
Notice that if we had chosen ↵ = 0.01 earlier instead of 0.05, we would have a di↵erent conclusion:
Since p = 0.0162 > 0.01 = ↵, we fail to reject the null hypothesis at the ↵ = 0.01 significance
level. There is insufficient evidence to prove that SuperSAT Prep actually helps students perform
better.
Note that we’ll NEVER say we “accept” the null hypothesis. If you recall the coin example,
if we had observed 55 heads instead of 99, that wouldn’t have been improbable. We wouldn’t
have called the magician a liar, but it does NOT imply that p = 0.5. It could have been 0.54 or
0.58, for example.

8.3.3 Hypothesis Testing Procedure

The formal hypothesis testing procedure is summarized as follows:
1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)
2. Set up a null hypothesis H0 and alternative hypothesis HA .
(a) Alternative hypothesis can be one-sided or two-sided.
(b) The null hypothesis is usually a ”baseline”, ”no e↵ect”, or ”benefit of the doubt”.
(c) The alternative is what you want to ”prove”, and is opposite the null.
3. Choose a significance level ↵ (usually ↵ = 0.05 or 0.01).
4. Collect data.
5. Compute a p-value, p = P (observing data at least as extreme as ours | H0 is true).
6. State your conclusion. Include an interpretation in the context of the problem.
(a) If p < ↵, ”reject” the null hypothesis H0 in favor of the alternative HA . We say our result is
statistically significant in this case!
(b) Otherwise, ”fail to reject” the null hypothesis H0 .

8.3.4 Exercises
1. You want to determine whether or not more than 3/4 of Americans would vote for George Washington
for President in 2020 (if he were still alive). In a random poll sampling n = 137 Americans, we collected
responses
Pn x1 , . . . , xn (each is 1 or 0, if they would vote for him or not). We observe 131 “yes” responses:
i=1 xi = 131. Perform a hypothesis test and state your conclusion.

Solution: We have our claim that “Over 3/4 of Americans would vote for George Washington for
President in 2020 (if he were still alive).
294 Probability & Statistics with Applications to Computing 8.3

Let p denote the true proportion of Americans that would vote for Washington. Then our null and
alternative hypotheses are:
H0 : p = 0.75
HA : p > 0.75

Let’s test these hypotheses at the ↵ = 0.01 significance level.

✓ ◆
2 0.75(1 0.75)
We know by the CLT that the sample mean is approximately X̄ ⇠ N µ = 0.75, = =
137
N (0.75, 2 = 0.0372 ) (since Xi ⇠ Ber(p): E [Xi ] = p = 0.75 under the null hypothesis, and Var (Xi ) =
⇥ ⇤ 2
p(1 p) = 0.75(1 0.75) and we know E X̄ = E [Xi ] = p and Var X̄ = n = 0.75(1n 0.75) ).
Hence our p-value (observing data at least as extreme), is
✓ ◆ ✓ ◆
2 2 131 131/137 0.75
P X̄ x̄ = P N (0.75, = 0.037 ) =P Z = P (Z 5.42643) ⇡ 0
137 0.037

With a p-value so close to 0 (and certainly < ↵ = 0.01), we reject the null hypothesis that (only) 75%
of Americans would vote for Washington. There is strong evidence that this proportion is actually
larger.
Note: Again, what we did was: assume p = 0.75 (null hypothesis), then note that the probability of
observing data so extreme (in fact very close to 100% of people), was nearly 0. Hence, we reject this
null hypothesis because what we observed would’ve been so unlikely if it were true.
Chapter 9: Applications to Computing
9.1: Intro to Python Programming
Slides (Google Drive) Video (YouTube)

9.1.1 Python
For this section only, I’ll ask you to use the slides linked above. There are a lot of great animations and
visualizations! We assume you know some programming language (such as Java or C++) beforehand, and
are merely teaching you the new syntax and libraries.

Python is the language of choice for anything related to scientific computing, data science, and machine
learning. It is also sometimes used for website development among many other things! It has extremely
powerful syntax and libraries - I came from Java and was adamant on having that be my main language.
But once I saw the elegance of Python, I never went back! I’m not saying that Python is “absolutely better”
than Java, but for our applications involving probability and math, it definitely is!

295
Chapter 9: Applications to Computing
9.2: Probability via Simulation
Slides (Google Drive) Video (YouTube)

9.2.1 Motivation
Even though we have learned several techniques for computing probabilities, and have more to go, it is still
hard sometimes. Imagine I asked the question: “Suppose I randomly shu✏e an array of the first 100 integers
in order: [1, 2, . . . , 100]. What is the probability that exactly 13 end up in their original position?” I’m
not even sure I could solve this problem, and if so, it wouldn’t be pretty to set up nor actually type into a
calculator.

But since you are a computer scientist, you can actually avoid computing hard probabilities! You could
also even verify that your hand-computed answers are correct using this technique of “Probability via Sim-
ulation”.

9.2.2 Probability via Simulation

We first need to define another notion or way of thinking about a probability. If we had some event E, then
we could define P (E) to be the long-term proportion of times that event E occurs in a random experiment.
That is,
# of trials where E occured
! P (E)
# trials
as the number of trials goes to 1.
For example, if E is the event we roll a 4 on a fair six-sided die, the probability is P (E) = 1/6. That means,
if I were to roll this die 6 million times, I should expect to see about 1 million 4’s! In reverse, if I didn’t
know P (E) and wanted to compute it, I could just simulate many rolls of this fair die! Obviously, the more
trials, the better your estimate. But you can’t possibly sit around forever rolling this die - a computer can
do this MUCH faster, simulating millions of trials within seconds.
This also works for averages, in addition to probabilities. I think this topic is best taught by examples, so
we’ll see one of each!
Example(s)

Suppose a weighted coin comes up heads with probability 1/3. How many flips do you think it will
take for the first head to appear? Use code to estimate this average!

Solution You may think it is just 3, and you would be correct! We’ll see how to prove this mathematically
in chapter 3 actually. But for now, since we don’t have the tools to compute it, let’s use our programming
skills!
The first thing we need to do is to simulate a single coin flip. Recall that to generate a random number, we
use the numpy library in Python.
1 np. random .rand () # r e t u r n s a s i n g l e float in the range [0 ,1)

296
9.2 Probability & Statistics with Applications to Computing 297

What about this following line of code?

1 if np. random .rand () < p:

This might be a bit tricky: since np.random.rand() returns a random float between [0, 1), the function
returns a value < p with probability exactly p! For example if p = 1/2, then np.random.rand() < 1/2,
which happens with probability 1/2 right? In our case, we’ll want p = 1/3, which will execute with probability
1/3.
This allows us to simulate the event in question: the first “Heads” appears whenever np.random.rand()
returns a value < p. And, if it is p, the coin flip turned up “Tails”.
The following function allows us to simulate ONCE how long it took to get heads.

1 def sim one game () > int: # r e t u r n a n i n t e g e r

2 flips = 0
3 while True:
4 flips += 1
5 if np. random .rand () < p:
6 return flips

We start with our number of flips being 0. And we keep incrementing flips until we get a head. So this
should return an integer ! We just need to simulate this game many times (call this function many times),
and take the average of our samples! Then, this should give us a good approximation of the true average
time (which happens to be 3)!
The code above is duplicated below, as a helper function. Python is great because you can define functions
inside other functions, only visible to the parent function!

1 import numpy as np
2
3 def coin flips (p, ntrials =50000) > float:
4
5 def sim one game () > int: # i n t e r n a l helper function
6 flips = 0
7 while True:
8 flips += 1
9 if np. random .rand () < p:
10 return flips
11
12 total flips = 0
13 for i in range ( ntrials ):
14 total flips += sim one game ()
15 return total flips / ntrials
16
17 print ( coin flips (p =1/3))

Notice the helper function is the exact same as above! All we did was call it ntrials times and return the
average number of flips per trial. This is it! The number 50000 is arbitrary: any large number of trials is
good!
Now to tackle the original problem:
298 Probability & Statistics with Applications to Computing 9.2

Example(s)

Suppose I randomly shu✏e an array of the first 100 integers in order: [1, 2, . . . , 100]. What is the
probability that exactly 13 end up in their original position? Use code to estimate this probability!
Hint: Use np.random.shuffle to shu✏e an array randomly.

Solution Try it yourself before looking at the answer below!

1 import numpy as np
2
3 def prob 13 original ( ntrials =50000) > float :
4
5 def sim one shuffle () > int: # i n t e r n a l h e l p e r f u n c t i o n
6 arr = np. arange (1, 101) # C r e a t e s a r r a y : [ 1 , 2 , . . . , 1 0 0 ]
7 np. random . shuffle (arr)
8
9 num orig = 0 # C o u n t h o w m a ny e l e m e n t s a r e i n o r i g i n a l position
10 for i in range (100):
11 if arr[i + 1] == i: # P y t h o n i s 0 i n d e x e d
12 num orig += 1
13
14 return int( num orig == 13) # R e t u r n s 1 i f T r u e , 0 i f False
15
16
17 num succ = 0 # C o u n t h o w m a n y t i m e s e x a c t l y 1 3 w e r e i n o r i g i n a l
18 for i in range ( ntrials ):
19 num succ += sim one shuffle ()
20 return num succ / ntrials
21
22 print ( prob 13 original ())

Take a look and see how similar this was to the previous example!
Chapter 9: Applications to Computing
9.3: The Naive Bayes Classifier
Slides (Google Drive) Video (YouTube)

9.3.1 Motivation
Have you ever wondered how Gmail knows whether or not an email should be marked as spam? Or how
Alexa/Google Home can your answer free-form questions? How self-driving cars actually work? How social
media platforms recommend friends and people you should follow? How a computer program DeepBlue beat
the chess champion Garry Kasparov? The answer to all of these questions is: machine learning (ML)!
After learning just a tiny bit of probability, we are ready to discover one way to solve one extremely important
type of ML task: classification. In particular, we’ll learn how to take in an email (a string/text), and predict
whether it is “Spam” or “Ham”. We will discuss this further shortly!

9.3.2 The Machine Learning Framework

Suppose you are given the following four examples in the table below. Could you use the information from
these four rows to predict the label of the last row?

It’s okay if you didn’t see the pattern, but we should predict 16; can you figure out why? It seems that the
pattern is to take the number and multiply it by the number of sides in the shape! So for our last row, we
take 4 and multiply by 4 (the number of sides of the square) to get 16. Sure, there is a possibility that
this isn’t the right function: this is only the most simple explanation we could give. The function could be
some complex polynomial in which case we would be completely wrong.

This is the idea of (supervised) machine learning (ML): given some training examples, we want to
learn the pattern between the input features and output label and be able to have a computer predict the
label on new/unseen examples. Above, our input features were number and shape. We want the computer
to “learn” just like how we do: with several examples.

Within supervised ML, two of the largest subcategories are regression and classification. Regression
refers to predicting a continuous (decimal) value. For example, when predicting house price given features
of the house or predicting weight from height. Classification on the other hand refers to predicting one of a
finite number of classes. For example, predicting whether an email is spam or ham, or whether an image

299
300 Probability & Statistics with Applications to Computing 9.3

of a handwritten digit is one of ten class: 0, 1, . . . , 9.

Example(s)

For each of the situations below with a desired output label, identify whether it would be a classifi-
cation or regression task. Then, describe what input features may be useful in making a prediction.
1. Predicting the price of a house.
2. Predicting whether or not a PhD applicant will be admitted.
3. Predicting which of 50 menu items someone will order.

Solution
1. This is a regression task, since we are predicting a continuous number like $310, 321.55 or $1, 235, 998.23.
Some features which would be useful for prediction include: square footage, age, location, number of
bedrooms/bathrooms, number of stories, etc.
2. This is a classification task, since we predicting one of two outcomes: admitted or not. Features which
may be important are: GPA, SAT score, recommendation letter quality, number of papers published,
number of internships, etc.
3. This is a classification task since we are choosing from one of 50 classes. Important features may
include: past order history, favorite cuisine, dietary restrictions, income, etc.

9.3.3 The Naive Bayes Classifier

To summarize, our end goal is to write a function which takes in an email (a string type) and returns either
that it is “SPAM” or “HAM”. The function in Python may look something like this.
1 def classify (email:str ):
2 # Some C od e Here
3 if some condition :
4 return SPAM
5 else:
6 return HAM

So how do we write the code to make the decision for us? In the past, people tried writing these classifiers
with a set of rules that they came up themselves. For example, if it is over 1000 words, predict “SPAM”.
Or if it contains the word ’Viagra’, predict that it is “SPAM”. This leads to code which looks like a ton of
if-else statements, and is also not very accurate. In machine learning, we come up with a model that learns
a decision-making rule for us! This may not make sense now, but I promise it will soon.

9.3.3.1 Preprocessing the Emails

Handling text can be very messy. People misspell words, use slang that isn’t in the vocabulary, have bad
grammar, use tons of punctuation, and so on. When we process our emails, we will employ the following
approach:
1. Ignore Duplicate Words.
2. Ignore Punctuation.
3. Ignore Casing.
9.3 Probability & Statistics with Applications to Computing 301

That is, we will reduce an email into a Set of lowercase words and nothing else! We’ll see a potential
drawback to this later, but despite these strong assumptions, the classifier still does a really good job!
Here are some examples of how we take the input string (email) to a Set of standardized words.

Raw Email (string) Processed Email (Set)

Hello hello hello there. {hello, there}
You buy Viagra!!!!! {you, buy, viagra}
Hello sir, I must ask that you keep this confidential... {hello, sir, i, must, ask, that, you, keep, this, confidential}

9.3.3.2 The Decision Rule

For this section, we’ll use the example of classifying the email “You buy Viagra!”. The representation we have
after processing is {you, buy, viagra}. Here’s the approach of the Naive Bayes classifier. We will compute
and compare the following two quantities (which must add to 1):
P (spam | {you, buy, viagra}) and P (ham | {you, buy, viagra})
This is because, for a particular email, it is either spam or ham, and so the probabilities must sum to 1. In
fact, because they both sum to 1, we can just compute one of them (let’s say the first), and predict SPAM
if P (spam | {you, buy, viagra}) > 0.5 and HAM otherwise. Note that if it is exactly equal to 0.5, we will
predict HAM (this is arbitrary - you can break ties however you want).

9.3.3.3 Learning from Data

WARNING: This is the heaviest math section, which draws from all ideas of Chapter 2.

Above all sounds nice and all, but how do we even begin to compute such a quantity? Let’s try Bayes
Theorem with the Law of Total Probability and see where that gets us!

P ({you, buy, viagra} | spam) P (spam)

P (spam | {you, buy, viagra}) = [Bayes]
P ({you, buy, viagra})
P ({you, buy, viagra} | spam) P (spam)
= [LTP]
P ({you, buy, viagra} | spam) P (spam) + P ({you, buy, viagra} | ham) P (ham)

How does this even help?? This looks way worse than before... Let’s see if we can’t start by figuring out the
“easier” terms, like P (spam). Remember we haven’t even touched our data yet. Let’s assume we were given
five examples of emails with their labels to learn from:
302 Probability & Statistics with Applications to Computing 9.3

Based on the data only, what would you estimate P (spam) to be? I might guess 3/5, and hope that you
matched that! That is,
# of spam emails
P (spam) ⇡
# of total emails
Similarly, we might estimate
# of ham emails
P (ham) ⇡
# of total emails
to be 2/5 in our case. Great, so we’ve figured out two our of the four terms we needed after using Bayes/LTP.
Now, we might try to similarly guess that

# of spam emails with {you, buy, viagra}

P ({you, buy, viagra} | spam) ⇡
# of spam emails

because our definition of conditional probability came intuitively with equally likely outcomes in 2.1 as

|A \ B| P (A \ B)
P (A | B) = =
|B| P (B)

But how many spam emails are we going to get that contain all three words? Probably none, or very few.
In general, most emails will be much longer, so there’s almost no chance that an email you are given to learn
from has ALL of the words. This is a problem because it makes this probability 0, which isn’t good for our
model.

The Naive Bayes name comes from two parts. We’ve seen the Bayes part because we used Bayes Theorem
to (attempt to) compute our desired probability. We are at a roadblock now, and now we will make the
“naive” assumption that: words are conditionally independent GIVEN the label. That is,

P ({you, buy, viagra} | spam) ⇡ P (you | spam) P (buy | spam) P (viagra | spam)

This should look like what we learned in 2.3:

P (A, B, C | D) = P (A | D) P (B | D) P (C | D)

So now, we might estimate

# of spam emails with “you”

P (“you” | spam) ⇡
# of spam emails

which is most likely nonzero if we have a lot of emails! What should this quantity be? It is 1/3: there is
just one spam email out of three which contains the word “you”. In general,

# of spam emails with word

P (word | spam) ⇡
# of spam emails

Now we’re ready to put all of this together!

Example(s)

The emails are given again here for convenience:

9.3 Probability & Statistics with Applications to Computing 303

Make a prediction as to whether this email is SPAM or HAM, using the Naive Bayes classifier! Do
this by computing P (spam | {you, buy, viagra}) and comparing it to 0.5. Don’t forget to use the
conditional independence assumption!

Solution Combining what we had earlier (Bayes+LTP) with the (naive) conditional independence assumption,
we get
P ({you, buy, viagra} | spam) P (spam)
P (spam | {you, buy, viagra}) =
P ({you, buy, viagra} | spam) P (spam) + P ({you, buy, viagra} | ham) P (ham)
P (you | spam) P (buy | spam) P (viagra | spam) P (spam)
=
P (you | spam) P (buy | spam) P (viagra | spam) P (spam) + P (you | ham) P (buy | ham) P (viagra | ham) P (ham)

We need to compute a bunch of quantities, but notice the left side of the denominator is the same as the
numerator, so we need to compute 8 quantities, 3 of which we did earlier! I’ll just skip to the solution:
3 2
P (spam) = P (ham) =
5 5
1 1
P (you | spam) = P (you | ham) =
3 2
1 0
P (buy | spam) = P (buy | ham) =
3 2
3 1
P (viagra | spam) = P (viagra | ham) =
3 2

Once we plug in all these quantities, we end up with a probability of 1, because the P (buy | ham) =
0 killed the entire right side of the denominator! It turns out then we should predict spam because
P (spam | {you, buy, viagra}) = 1 > 0.5, and this is correct! We still don’t ever want zeros though, so
we’ll see how we can fix that soon!
Notice how the data (example emails) completely dictated our decision rule, along with Bayes Theorem and
Conditional Independence. That is, we learned from our data, and used it to make conclusions on new
data!
One last final thing, to avoid zeros, we will apply the following trick called “Laplace Smoothing”. Before,
we had said that
# of spam emails with word
P (word | spam) ⇡
# of spam emails
We will now pretend we saw TWO additional spam emails: one which contained the word, and one which
did not. This means instead that we have
# of spam emails with word + 1
P (word | spam) ⇡
# of spam emails + 2
304 Probability & Statistics with Applications to Computing 9.3

0
This will ensure that we don’t get any zeros! For example, P (buy | ham) was previously (none of the two
2
0+1 1
ham emails contained the word “buy”), but now it is = .
2+2 4
We do not usually apply Laplace smoothing to the label probabilities P (spam) and P (ham) since these will
never be zero anyway (and it wouldn’t make much di↵erence if we did).

Example(s)

Redo the example from earlier, but now apply Laplace smoothing to ensure no zero probabilities. Do
not apply it to the label probabilities.

Solution Basically, we just take the same numbers from above and add 1 to the numerator and 2 to the
denominator!
3 2
P (spam) = P (ham) =
5 5
1+1 2 1+1 2
P (you | spam) = = P (you | ham) = =
3+2 5 2+2 4
1+1 2 0+1 1
P (buy | spam) = = P (buy | ham) = =
3+2 5 2+2 4
3+1 4 1+1 2
P (viagra | spam) = = P (viagra | ham) = =
3+2 5 2+2 4
Plugging these in gives P (spam | {you, buy, viagra}) ⇡ 0.7544 > 0.5, so our prediction is unchanged! But it
is better to not have probabilities ever being exactly one or zero, so this solution is preferred!
That’s it for the main idea! We’re almost there now, just some logistics.

9.3.3.4 Evaluating Performance

Let’s say we are given 1000 emails for learning our spam filter using Naive Bayes. How should we measure
performance? We could check the accuracy, which is exactly what you think it is:

# of emails classified correctly

accuracy =
# of total emails

However, if we trained/learned from these 1000 emails, and measure the accuracy, surely it will be very good
right? It’s like taking a practice test and then using that as your actual test - of course you’ll do well! What
we care about is how well the spam filter works on NEW or UNSEEN emails. Emails that the spam filter
was not allowed to see/use when estimating those probabilities. This is fair and more realistic now right?
You get practice exams, as many as you want, but you are only evaluated once on an exam you (hopefully)
haven’t seen before!
Where do we get these new/unseen emails? We actually take our initial 1000 emails and do a train/test
split (usually around 80/20 split). That means, we will use 800 emails to estimate those quantities, and
measure the accuracy on the remaining 200 emails. The 800 emails we learn from are collectively called the
training set, and the 200 emails we test on are collectively called the test set.
This is good because we care how our classifier does on new examples, and so when doing machine learning,
we ALWAYS split our data into separate training/testing sets!
Disclaimer: Accuracy is typically not a good measure of performance for classification. Look into F1-Score
and AUROC instead if you are interested! Since this isn’t a ML class, we will stick with plain accuracy for
simplicity.
9.3 Probability & Statistics with Applications to Computing 305

9.3.3.5 Summary
Here’s a summary of everything we just learned:

Definition 9.3.1: Naive Bayes Algorithm for Spam Filtering

Suppose we are given a set of emails WITH their labels (of spam or ham). We split into a training
set with around 80% of the data, and a test set with the remaining 20%.

Suppose we are given an email with wordset {w1 , . . . , wk } and want to make a prediction. We
compute using Bayes Theorem, the law of total probability, and our naive assumption that words are
conditionally independent given their label to get:
Qk
P (spam) P (wi | spam)
P (spam | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)

We estimate the quantities using the TRAINING SET ONLY as follows:

# of TRAINING spam emails
P (spam) ⇡
# of total TRAINING emails
# of TRAINING ham emails
P (ham) ⇡
# of total TRAINING emails
# of TRAINING spam emails with wi + 1
P (wi | spam) ⇡
# of TRAINING spam emails + 2
# of TRAINING ham emails with wi + 1
P (wi | ham) ⇡
# of TRAINING ham emails + 2
If P (spam | {w1 , . . . , wk }) > 0.5, predict that the email is SPAM, and otherwise, predict it is HAM.

To get a fair measure of performance, make predictions using the above procedure on all the
TEST emails and return the overall test accuracy.

9.3.3.6 Underflow Prevention

Computers are great, but sometimes they cause us problems. When we compute something like
k
Y
P (wi | spam)
i=1

we are multiplying a bunch of numbers between 0 and 1, and so we will get some very very small number
(close to zero). When numbers get too large on a computer (exceeding 263 or something), it is called
overflow, and results in weird and wrong arithmetic. Our problem is appropriately named underflow, as
we can’t handle the precision.
This is the last thing we need to figure out (I promise). Remember that our two probabilities P (spam | {w1 , . . . , wk })
and P (ham | {w1 , . . . , wk }) summed to 1, so we only needed to compute one of them. Let’s go back to com-
puting both, and just comparing which is larger:

Qk
P (spam) P (wi | spam)
P (spam | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)
306 Probability & Statistics with Applications to Computing 9.3

Qk
P (ham) P (wi | ham)
P (ham | {w1 , . . . , wk }) = Qk
i=1
Qk
P (spam) i=1 P (wi | spam) + P (ham) i=1 P (wi | ham)

Notice the denominators are equal: they are both just P ({w1 , . . . , wk }). So, P (spam | {w1 , . . . , wk }) >
P (ham | {w1 , . . . , wk }) if and only the corresponding numerator is greater:
k
Y k
Y
P (spam) P (wi | spam) > P (ham) P (wi | ham)
i=1 i=1

Recall the log properties:

log(xy) = log(x) + log(y)
and that both sides are simply a product of k + 1 terms. We can take logs on both sides and this preserves
order because log is a monotone increasing function (if x > y then log(x) > log(y)):
k
X k
X
log (P (spam)) + log(P (wi | spam)) > log (P (ham)) + log(P (wi | ham))
i=1 i=1

And that’s it, problem solved! If our initial quantity (after multiplying 50 word probabilities) was something
like P (spam | {w1 , . . . , wk }) ⇡ 10 81 , then log P (spam | {w1 , . . . , wk }) ⇡ 186.51. There is no chance of
underflow anymore!

9.3.3.7 After Finishing Chapter 7

We actually used some concepts of estimation from Chapter 7 that we just took for granted, as that was the
“natural” thing to do. But it turns out, they have rigorous justifications for doing so (e.g., for estimating
the probability of spam as just the number of spam emails over the total). As well as Laplace smoothing!

After reading Chapter 7: do you see how MLE/MAP were used here? We used MLE to estimate P (spam)
and P (ham). We also used MAP to estimate all the P (wi | spam) as well, with a Beta(2, 2) prior: pretending
we saw 1 of each success and failure. Naive Bayes actually required us to estimate all these di↵erent Bernoulli
parameters, and it’s great to come back and see!
Chapter 9: Applications to Computing
9.4: Bloom Filters
Slides (Google Drive) Video (YouTube)

9.4.1 Motivation
Google Chrome has a huge database of malicious URLs, but it takes a long time to do a database lookup
(think of this as a typical Set, but on a di↵erent computer than yours). As you may know, Sets have
desirable constant-time lookup, but due to the fact it isn’t on your computer, the time bottleneck comes
from the communication between the database and your computer. They want to have a quick check in the
web browser itself (on your computer), so a space-efficient data structure must be used.
That is, we want to save both time (not in the typical big-Oh sense) and space. But what will we trade
for it? It turns out we will have limited operations (fewer than a Set), and some probability of error which
turns out to be fine.

9.4.2 Definition
A bloom filter is a probabilistic data structure which only supports the following two operations:
I. add(x): Add an element x to the structure.
II. contains(x): Check if an element x is in the structure. If either returns “definitely not in the set” or
“could be in the set”.
It does not support the following two operations:
I. Delete an element from the structure.
II. Give a collection of elements that are in the structure.
The idea is that we can check our bloom filter if a URL is in the set. The bloom filter is always correct in
saying a URL definitely isn’t in the set, but may have false positives (it may say a URL is in the set when it
isn’t). So most of the time, we get instant time, and only in these rare cases does Chrome have to perform
an expensive database lookup to know for sure.
Suppose we have k bit arrays t1 , . . . , tk each of length m (all entries are 0 or 1), so the total space required
is only km bits or km/8 bytes (as a byte is 8 bits). See below for one with k = 3 arrays of length m = 5:

So regardless of the number of elements n that we want to insert store in our bloom filter, we use the same
amount of memory! That being said, the higher n is for a fixed k and m, the higher your error rate will be.

Suppose the universe of URL’s is the set U (think of this as all strings with less than 100 characters),

307
308 Probability & Statistics with Applications to Computing 9.4

and we have k independent and uniform hash functions h1 , . . . , hk : U ! {0, 1, . . . , m 1}. That is, for
an element x and hash function hi , pretend hi (x) is a discrete Unif(0, m 1) random variable. Basically,
when we see a new URL, we will add it to one random entry per row of our bloom filter.
See the image below to see how we add the URL “thisisavirus.com” into our bloom filter.

For each of our k = 3 hash functions (corresponding to each row), we hash our URL x as hi (x) to get a
random integer from {0, 1, . . . , 4} (0 to m 1). It happened that h1 (x) = 2, h2 (x) = 1 and h3 (x) = 4 in this
example: each hash function is independent of the others and chooses a position uniformly at random.
But if we hash the same URL, we will get the same hash. In other words, if I tried to add this URL
one more time, nothing would change because all the entries were already set to 1. Notice we never “unset”
an entry: once a URL sets an entry to 1, it will stay 1 forever.
Now let’s see how the contains function is implemented. When we check whether the URL we just added
is contained in the bloom filter, we should definitely return yes.

We say that a URL x is contained in the bloom filter, if when we apply each hash function hi (x), the
corresponding entries are already set to 1. We added this URL “thisisavirus.com” right before this, so we
are guaranteed that t1 [2] == 1, t2 [1] == 1, and t3 [4] == 1, and so we return TRUE overall! You might now
see how this could lead to false positives: returning TRUE even though the URL was never added! Don’t
worry if not, we’ll see some examples below.
That’s all there is for bloom filters!
9.4 Probability & Statistics with Applications to Computing 309

Example(s)

Starting with the current state of the bloom filter above:

1. Add the URL x =“totallynotsuspicious.com” which has h1 (x) = 1, h2 (x) = 0 and h3 (x) = 4.
Draw the resulting bloom filter.
2. Check whether or not the URL “verynormalsite.com” is in the bloom filter, which has h1 (x) = 2,
h2 (x) = 0 and h3 (x) = 4.

Solution

Notice that t3 [4] was already set to 1 by the previous entry, and that’s okay! We just leave it set to 1.

Notice here we got a false positive: that means, saying a URL is in the bloom filter when it wasn’t. This
is a tradeo↵ we make in exchange for using much less space.

9.4.3 Analysis
You might be dying to know, what is the false positive rate (FPR) for a bloom filter, and how should I
choose k and m? These are great questions, and we actually have the tools to figure this out already.
310 Probability & Statistics with Applications to Computing 9.4

Theorem 9.4.36: Bloom Filter FPR

After inserting n distinct URLs to a k ⇥ m bloom filter (k hash functions/rows, m columns), suppose
we had a new URL and wanted to check whether it was contained in the bloom filter. The false
positive rate (probability the bloom filter returns True incorrectly), is
✓ ✓ ◆n ◆ k
1
1 1
m

Proof of Bloom Filter FPR. We get a match for new URL x if in each row, the bit assigned by the hash
function hi (x) is set to 1.
For i = 1, . . . , k, let Ei be the event that hi (x) is set to 1 already. Then,
k
Y
P (false positive) = P (h1 (x) = 1, h2 (x) = 1, . . . , hk (x) = 1) = P (E1 \ E2 \ · · · \ Ek ) = P (Ei )
i=1

where the last equality is because each hash function is assumed to be independent of the others.
Now, let’s focus on a single row i (all the rows are the “same”). The probability that the bit is set to 1
P (Ei ), is the probability that at least one of the n URLs hashed to that entry. Seeing “at least one” should
tell you that: you should try the complement instead (otherwise, use inclusion-exclusion)!
So the probability a bit remains at 0 after n entries are added (EiC ) is
✓ ◆n
C 1
P Ei = 1
m

because the probability of missing this bit for a single URL is 1 1/m. Hence,
✓ ◆n
1
P (Ei ) = 1 P EiC = 1 1
m

Finally, combining this result with the previous gives our final answer, since each row has the same proba-
bility:
Yk ✓ ✓ ◆n ◆k
1
P (false positive) = P (Ei ) = 1 1
i=1
m

So based on n, the number of malicious URLs Google Chrome would like to store, should definitely play a
part in how large they should choose k and m to be.
Let’s now see (by example) the kind of time and space improvement we can get.

Example(s)

1. Let’s compare this approach to using a typical Set data structure. Google wants to store
5 million URLs, with each URL taking (on average) 40 bytes. How much space (in MB, 1
MB = 1 million bytes) is required if we store all the elements in a set? How much space (in
MB) is required if we store all the elements in a bloom filter with k = 30 hash functions and
m = 900, 000 buckets? Recall that 1 byte = 8 bits.
2. Let’s analyze the time improvement as well. Let’s say an average Chrome user attempts to
9.4 Probability & Statistics with Applications to Computing 311

visit 102,000 URLs in a year, only 2,000 of which are actually malicious. Suppose it takes
half a second for Chrome to make a call to the database (the Set), and only 1 millisecond for
Chrome to check containment in the bloom filter. Suppose the false positive rate on the bloom
filter is 3%; that is, if a website is not malicious, the bloom filter will will incorrectly report
it as malicious with probability 0.03. What is the time (in seconds) taken if we only use the
database, and what is the expected time taken (in seconds) to check all 102,000 strings if we
used the bloom filter + database combination described earlier?

Solution
1. For the set, we would require 5 million times 40 bytes, for a total of 200 MB.
For the bloom filter, we need just km/8 = 27/8 million bytes, or 3.375 MB, wow! Note how this doesn’t
depend (directly) at all on how many URLs, or the size of each one as we just hash it to a few bits.
Of course, k and m should increase with n though :) to keep the FPR low.
2. If we only use the database, it will take 102, 000 · 12 = 51, 000 seconds.
If we use the bloom filter + database combination, we will definitely call the bloom filter 102, 000 times
at 0.001 seconds each, for a total of 102 seconds. Then for about 3% of the 100, 000 other URLs (3, 000
of them), we’ll have to do a database lookup, costing 3, 000 · 12 = 1, 500 seconds. For the 2, 000 actually
malicious URLs, we also have to do a database lookup, costing 2, 000 · 12 = 1000 seconds . So in total,
102 + 1500 + 1000 = 2602 seconds.
Just take a second to stare at how much memory savings we had (the first part), and the time savings we
had (the second part)!

9.4.4 Summary
Hopefully now you see the pros and cons of bloom filters. We cannot delete from the bloom filter (why?)
nor list out which elements are in it because we never stored the string! Below summarizes the operations
of a bloom filter.

Algorithm 1 Bloom Filter Operations

1: function initialize(k,m)
2: for i = 1, . . . , k: do
3: ti = new bit array of m 0’s
4: function add(x)
5: for i = 1, . . . , k: do
6: ti [hi (x)] = 1
7: function contains(x)
return t1 [h1 (x)] == 1 ^ t2 [h2 (x)] == 1 ^ · · · ^ tk [hk (x)] == 1

If you imagine coding this up, it’s so short, only a few lines of code! We just saw how probability and
randomness can be used to save space and time, in exchange for accuracy! In our application, we didn’t even
mind the accuracy part because we would just do the lookup in that case just to be certain anyway! We saw
it being used for a data structure, and in our next application, we’ll see it being used for an algorithm.
Randomness just makes our lives (as a computer scientist) better, and can lead to elegant and beautiful data
structures algorithms which often outperform their deterministic counterparts.
Chapter 9: Applications to Computing
9.5: Distinct Elements
Slides (Google Drive) Video (YouTube)

9.5.1 Motivation
YouTube wants to count the number of distinct views for a video, but doesn’t want to store all the user
ID’s. How can they get an accurate count of users without doing so? Note: A user can view their favorite
video several times, but should only be counted as one distinct view.

Before we attempt to solve this problem, you should wonder: why should we even care? For one of the
most popular videos on YouTube, let’s say there are N = 2 billion views, with n = 900 million of them being
distinct views. How much space is required to accurately track this number? Well, let’s assume a user ID is
an 8-byte integer. Then, we need 900, 000, 000 ⇥ 8 bytes total if we use a Set to track the user IDs, which
requires 7.2 gigabytes of memory for ONE video. Granted, not too many videos have this many views, but
imagine now how many videos there are on YouTube: I’m not sure of the exact number, but I wouldn’t be
suprised it it was in the tens or hundreds of millions, or even higher!

It would be great if we could get the number of distinct views with constant space O(1) instead of lin-
ear space O(n) required by storing all the IDs (let’s say a single 8-byte floating point number instead of 7.2
GB). It turns out we (approximately) can! There is no free lunch of course - we can’t solve this problem
exactly with constant memory. But we can trade this space for some error in accuracy, using the contin-
uous Uniform random variable! That is, we will potentially have huge memory savings, but are okay with
accepting a distinct view count which has some margin of error.

9.5.2 Intuition
This seemingly unrelated calculation will be crucial in tying our algorithm together - I’ll ask for your patience
as we do this. Let U1 , . . . , Um be m iid (independent and identically distributed) RVs from the continuous
Unif(0, 1) distribution. If we take the minimum of these m random variables, what do we “expect” it to be?
That is, if X = min{U1 , . . . , Um }, what is E [X]? Before actually doing the computation, let’s think about
this intuitively and see some pictures.

• If m = 1 (only one uniform RV), we expect it to be right in the center at 1/2.

• If m = 2 (two continuous uniform RVs), we expect the two points to be at 1/3 and 2/3, with the

312
9.5 Probability & Statistics with Applications to Computing 313

minimum being the smaller of the two at 1/3.

• If m = 4, we might expect the four points to be at 1/5, 2/5, 3/5, 4/5, and so the minimum is actually
at 1/5.
See below for more details on the last case where m = 4.

What these examples are getting at is that, the expected value of the smallest of m Unif(0, 1) RVs is

1
E [X] = E [min{U1 , . . . , Um }] =
m+1
I promise this will be the key observation in making this clever algorithm work. If you believed the intuition
above, that’s great! If not, that’s also fine, so I’ll have to prove it to you formally below. Whether you
believe me or not at this point, you are definitely encouraged to read through the strategy as it may come
up many times in your future.

Theorem 9.5.37: Expectation of Min of IID Uniforms

If U1 , . . . , Um ⇠ Unif(0, 1) (continuous) are iid (independent and identically distributed), and X =

1
min{U1 , . . . , Um } is their minimum, then E [X] = .
m+1

Proof of Expectation of Min of IID Uniforms.

We should start working with probabilities first (e.g., the CDF FX (x) = P (X  x)) and take the derivative
to find the PDF (this is a common strategy for dealing with continuous RVs). Actually, we’ll compute
P (X > x) first (how is this related to the CDF FX ?):

P (X > x) = P (min{U1 , . . . , Um } > x) [def of X]

= P (U1 > x, U2 > x, . . . , Um > x) [minimum is greater than x i↵ ALL are]
m
Y
= P (Ui > x) [independence]
i=1
Ym
= (1 x) [1 CDF of Unif(0, 1)]]]
i=1
= (1 x)m [all have the same distribution]

Some of these steps need more justification. For the second equation, we use the fact that the minimum of
numbers is greater than a value if and only if all of them are (think about this). For the next equation, the
probability of all of the Ui > x is just the product of the m probabilities by our independence assumption.
314 Probability & Statistics with Applications to Computing 9.5

x 0
And finally, for Ui ⇠ Unif(0, 1), we know its CDF (look it up in our table) is P (Ui  x) = = x, and
1 0
so P (Ui > x) = 1 P (U1  x) = 1 x.

Now, we have that

FX (x) = 1 P (X > x) = 1 (1 x)m

I’ll leave it to you to compute the density fX (x) by di↵erentiating the CDF we just computed, and then
using our standard expectation formula (the minimum of numbers in [0, 1] is also in [0, 1]):
Z 1
E [X] = xfX (x)dx
0

1
and you should get E [X] = after all this work!
m+1

If you are thinking of giving up now, I promise this was the hardest part! The rest of the section should be
(generally) smooth sailing.

9.5.3 The Algorithm

The problem can be formally modelled as follows: a video receives a stream of 8-byte integers (user ID’s),
x1 , x2 , . . . , xN , but there are only n distinct elements (1  n  N ), since some people rewatch the video. We
don’t know what N is, since people continuously view the video, but assume we cannot store all N elements;
we can’t even store the n distinct elements.

Suppose the universe of user ID’s is the set U (think of this as all 8-byte integers), and we have a single
uniform hash function h : U ! [0, 1] (i.e., for an user ID y, pretend h(y) is a continuous Unif(0, 1) random
variable). That is, h(y1 ), h(y2 ), ..., h(yk ) for any k distinct elements are iid continuous Unif(0, 1) random
variables, but since the hash function always gives the same output for some given input, h(y1 ) and h(y1 )
are the “same” Unif(0, 1) random variable.

To parse that mess, let’s see two examples. These will also hopefully give us the lightbulb moment!

Example(s)

Suppose we have user IDs watch the video in this order:

13, 25, 19, 25, 19, 19

This is a stream of user IDs. From this, there are 3 distinct views (13,25,19) out of 6 total views.
The uniform hash function h might give us the following stream of hashes:

0.51, 0.26, 0.79, 0.26, 0.79, 0.79

Note that all of these numbers are between 0 and 1 as they should be, as they are supposedly
Unif(0, 1). Note also that for the same user ID, we get the same hash! That is, h(19) will always
return 0.79, h(25) is always 0.26, and so on. Now go back and reread the previous paragraph and see
if it makes more sense.
9.5 Probability & Statistics with Applications to Computing 315

Example(s)

Consider the same stream of N = 6 elements as the previous example, with n = 3 distinct elements.
1. How many independent Unif(0, 1) RVs are there total: N or n?
2. If we only stored the minimum value every time we received a view, we would store the single
floating point number 0.26 as it is the smallest hash of the six. If we didn’t know n, how
might we exploit 0.26 to get the value of n = 3? Hint: Use the fact we proved earlier that
1
E [min{U1 , . . . , Um }] = where U1 , . . . , Um are iid.
m+1
Solution
1. As you can see, we only have three iid Uniform RVs: 0.26, 0.51, 0.79. So in general, we’ll have
the minimum n (and not N ) RVs.
2. Actually, remember that the expected minimum of n distinct/independent values is approxi-
1
mately as we showed earlier. Our 0.26 isn’t exactly equal to E [X], but it is an estimate
n+1
for it! So if we solve
1
0.26 ⇡ E [X] =
n+1
1
we would get that n ⇡ 1 ⇡ 2.846. Rounding this to the nearest integer of 3 actually
0.26
gives us the correct answer!

So our strategy is: keep a running minimum (a single floating point which ONLY takes 8 bytes).
As we get a stream of user IDs x1 , . . . , xN , hash each one and update the ✓running minimum
◆
1
if necessary. When we want to estimate n, we just reverse solve n = round 1 , and
E [X]
that’s it! Take a minute to reread this example if necessary, as this is the entire idea!

Here is the pseudocode for the algorithm we just described:

Algorithm 2 Distinct Elements Algorithm

function initialize()
val 1
function update(x)
val min {val, hash(x)}
function estimate() ✓ ◆
1
return round 1
val
initialize() . Initialize our single float variable
for i = 1, . . . , N : do . Loop through all stream elements
update(xi ) . Update our single float variable
return estimate() . An estimate for n, the number of distinct elements.

This is known as the Distinct Elements algorithm! We start our single floating point minimum (called val
below) at 1, and repeatedly update it. The key observation is that we are only taking the minimum of n iid
Uniform RVs, and NOT N because h always returns the same value given the same input. Reverse-solving
1
for E [X] = gives us an estimate for m since E [X] (which is stored in the variable val) is only an
m+1
approximation. Note we want to round to the nearest integer because n should be an integer.
316 Probability & Statistics with Applications to Computing 9.5

This algorithm sounds great right? One pass over the data (which is the best we can do in time complexity),
and one single float (which is the best we can do in space complexity)! But you have to remember the
tradeo↵ is in the accuracy, which we haven’t seen yet.

The reason the previous example was spot-on is because I cheated a little bit. I ensured the three values
0.26, 0.51, 0.79 were close to where they were supposed to be: 0.25, 0.50, 0.75. Actually, it’s most important
that just the minimum is on-target. See the following example for an unfortunate situation.

Example(s)

Suppose we have N = 7 user IDs watch the video in this order:

11, 34, 89, 11, 89, 23, 23

The uniform hash function h might give us the following stream of N = 7 hashes:

0.5, 0.21, 0.94, 0.5, 0.94, 0.1, 0.1

Trace the distinct elements algorithm above by hand and report the value that it will return for our
estimate. Compare it to the true value of n = 4 which is unknown to the algorithm.

Solution

At the end of all the updates, val will be equal to the minimum hash of 0.1. So the estimated number of
distinct elements is
✓ ◆
1
round 1 =9
0.1

There are only n = 4 distinct elements though! The reason this time it didn’t work out well for us is that
the minimum value was supposed to be around 1/5 = 0.2, but was actually 0.1. This is not necessarily a
huge di↵erence until we take its reciprocal...

That’s it! The code for this algorithm is actually pretty short and sweet (imagine converting the pseudocode
above into code). If you take a step back and think about what machinery we needed, we needed continuous
RVs: the idea of PDF/CDF, and the Uniform RV. The mathematical/statistical tools we learn have many
applications to computer science; we have several more to go!

9.5.4 Improving Performance (Optional)

You may wonder how we can improve this estimate. The problem is that the variance of the minimum is
pretty high (e.g., it was 0.1 last time instead of 0.2): how can we reduce it? Actually, independent repetitions
is always an excellent strategy (if possible) to get better estimates!

2 1 Pn
If X1 , . . . , Xn are iid RVs with mean µ and variance , we’ll show that the sample mean X̄n = Xi
n i=1
has the same mean but lower variance as each Xi .

" n
# n
⇥ ⇤ 1X 1X 1
E X̄n =E Xi = E [Xi ] = nµ = µ
n i=1 n i=1 n
9.5 Probability & Statistics with Applications to Computing 317

Also, since the Xi ’s are independent, variance adds:

n
! n
1X 1 X 1 2
2
Var X̄n = Var Xi = 2 Var (Xi ) = 2 n =
n i=1 n i=1 n n

That is, the sample mean will have the same expectation, but the variance will go down linearly! Why might
this make sense? Well, imagine you wanted to estimate the height of American adults: would you rather
have a sample of 1, 10, or 100 adults? All would be correct in expectation, but the size of 100 gives us more
confidence in our answer!
1
So if we instead estimate the minimum E [X] = with the average of k minimums instead of just one,
n+1
we should get a more accurate estimate for E [X] and hence n, the number of distinct elements, as well!

So, imagine we had k independent hash functions instead of just one: h1 , . . . , hk , and k minimums val1 , val2 , . . . , valk .
Stream ! 13 25 19 25 19 19 vali
h1 0.51 0.26 0.79 0.26 0.79 0.79 0.26
h2 0.22 0.83 0.53 0.84 0.53 0.53 0.22
... ... ... ... ... ... ... ...
hk 0.27 0.44 0.72 0.44 0.72 0.72 0.27
Each row represents one hash function hi , and the last column in each row is the minimum for that hash
function. Again, we’re only keeping track of the k floating point minimums in the final column. Now, for
improved accuracy, we just take the average of the k minimums first, before reverse-solving. Imagine k = 3
(so there were no rows in . . . above). Then, a good estimate for the true minimum E [X] is
0.26 + 0.22 + 0.27
E [X] ⇡ = 0.25
3
✓ ◆
1
So our estimate for n is round 1 = 3, which is perfect! Note that we basically combined 3 distinct
0.25
elements instances with h1 , h2 , h3 individually from earlier, in a way that reduced the variance! The indi-
vidual estimates 0.26, 0.22, 0.27 were varying around 0.25, but their average was even closer!

Now our memory is just O(k) instead of O(1), but we get a better estimate as a result. It is up to you to
determine how you want to tradeo↵ these two opposing quantities.

9.5.5 Summary
We just saw today an extremely clever use of continuous RVs, applied to computing! In general, randomness
(the use of a random number generator (RNG)) in algorithms and data structures often can help improve
either the time or space (or both)! We saw earlier with the bloom filter how adding a RNG can save a ton
of space in a data structure. Even if you don’t go on to study machine learning or theoretical CS, you can
see what we’re learning can be applied to algorithms and data structures, arguably the core knowledge of
every computer scientist.
Chapter 9: Applications to Computing
9.6: Markov Chain Monte Carlo (MCMC)
Slides (Google Drive) Video (YouTube)

9.6.1 Motivation
Markov Chain Monte Carlo (MCMC) is a technique which can be used to solve hard optimization
problems (among other things). In this section, we’ll design MCMC algorithms to solve the following two
problems, and you will be able to solve many more yourself!

• The Knapsack Problem: Suppose you have a knapsack which has some maximum weight capacity.
There are n items with weights w1 , . . . , wn > 0 and values v1 , . . . , vn > 0, and we want to choose the
subset of them that maximizes the total value subject to the weight constraint of the knapsack. How
can we do this?

• The Travelling Salesman Problem (TSP): Suppose you want to find the best route (minimizing
total distance travelled) between the 50 U.S. state capitals that we want to visit! A valid route starts
and ends in the same state capital, and visits each capital exactly once (this is known as the TSP,
and is known to be NP-Hard). We will design an MCMC algorithm for this as well!

As the name suggests, this technique depends a bit on the idea of Markov Chains. Most of this section
then will actually be building up the foundations of Markov Chains, and MCMC will follow soon after. In
fact, you could definitely understand and code up the algorithm without learning this math, but if you care
to know how and why it works (you should), then it is important to learn first!

9.6.2 Markov Chains

Before we define Markov chains, we must define what a stochastic process is.

Definition 9.6.1: Discrete-Time Stochastic Process

A discrete-time stochastic process (DTSP) is a sequence of random variables X0 , X1 , X2 , . . .

where Xt is the value at time t.

Here are some examples:

• The temperature in Seattle each day. X0 can be the temperature today, X1 tomorrow, and so on.

• The price of Google stock at the end of each year. X0 can be the final price at the end of the year it
IPO’d, X1 the next, and so on.

• The number of people who come to my store each day. X0 is the number of people who came on the
first day, X1 on the second, and so on.

Consider the following random walk on the graph below. You’ll see what that means through an example!

318
9.6 Probability & Statistics with Applications to Computing 319

Suppose we start at node 1, and at each time step, independently step to a neighboring node with equal
probability.
For example, X0 = 1 since at time t = 0, we are at node 1. Then, X1 can be either 2 or 3 (but not 4 or 5
since not neighbors of node 1). And so on. So each Xt just tells us the position we are at at time t, and is
always in the set {1, 2, 3, 4, 5} (for our example anyway).
This DTSP actually has a lot of structure, and is actually an example of a special type of DTSP called a
Markov Chain: can you think about how this particular setup provides a lot of additional constraints over
a normal DTSP?
Here are three key properties of a Markov Chain, which we will formalize immediately after:
1. We only have finitely many states (5 in our example: {1, 2, 3, 4, 5}). (The stock price or temperature
example earlier could be any real number).
2. We don’t care about the past, given the present. That is, the distribution of where we go next
ONLY depends on where we are currently, and not any past history.
3. The transition probabilities are the same at each step (stationary). That is, if we are at node 1 at
time t = 0 or t = 152, we are always equally likely to go to node 2 or 3).

Definition 9.6.2: Markov Chain

A Markov Chain is a special DTSP with the following three additional properties:
1. The state space S = {s1 , . . . , sn } is finite (or countably infinite), so that each Xt 2 S.
2. Satisfies the Markov property: the future is (conditionally) independent of the past given
the present. Mathematically,

P (Xt+1 = xt+1 | X0 = x0 , X1 = x1 , . . . , Xt = xt ) = P (Xt+1 = xt+1 | Xt = xt )

3. Has stationary transition probabilities. That is, we always transition from state si to sj with
probability independent of the current time. Hence, due to this property and the previous, the
transitions are governed by n2 probabilities: the probability of transitioning to one of n current
states to one of n next states. These are stored in a square n ⇥ n transition probability
matrix (TPM) P , where Pij = P (Xt+1 = sj | Xt = si ) is the probability of transitioning
from si ! sj for any and every time t.

If you’re a bit confused right now, especially with that last bullet point, this is totally normal and means you
are paying attention! Let’s construct the TPM for the graph example earlier to see what it means exactly.

9.6.2.1 The Transition Probability Matrix (TPM)

Since we have 5 states, our TPM P will be 5 ⇥ 5. We’ll fill out the first row first, which represents the
probability of going from state s1 to any of the other 5 states.
320 Probability & Statistics with Applications to Computing 9.6

For example, the second entry of the first row is: given that Xt = 1 (we are in state 1 at some time t), what
is the probability of going to state 2 next Xt+1 = 2? It’s 1/2 because from state 1, we are equally likely to
go to state 2 or 3. It isn’t possible to go to states 1, 4, and 5, and that’s why their respective entries are 0.

Now, how about the second row?

From state 2, we can only go to states 1 and 4 as you can see from the graph and the TPM. Try filling out
the remaining three rows yourself! These images may help:

Our final answer is: 2 3

0 1/2 1/2 0 0
61/2 0 0 1/2 0 7
6 7
P =6
61/2 0 0 1/2 0 7
7
4 0 1/3 1/3 0 1/35
0 0 0 1 0
9.6 Probability & Statistics with Applications to Computing 321

Note that in the last row, from state 5, we MUST go to state 4, and so P54 = 1 and the rest of the row
has zero probability. Also note that each ROW sums to 1, but there is no such constraint on the columns.
That’s because this is secretly a joint PMF right? Given we are in some state si (Xt = si ), the probabilities
of going to the next state Xt+1 must sum to 1.

9.6.2.2 Computing Probabilities

The TPM is absolutely crucial; in fact, it defines a Markov chain uniquely. But how can we use it to
compute probabilities? The notation is honestly one of the hardest parts of Markov chains. We’ll continue
to do examples until we are ready for MCMC.

Example(s)

Now let’s talk about how to compute some probabilities we may be interested in. Nothing here is
“new”: it is all based on your core probability knowledge from the previous chapters! Let’s say we
want to find out the probability we end up at state 5 after two time steps, starting from state 3. That
is, compute P (X2 = 5 | X0 = 3). Try to do come up with an “intuitive” answer first, and then show
your work formally.

Solution You might be able to hack your way around to a solution since it is only two time steps: something
like
1 1
·
2 3
Intuitively, we can either go to state 4 or 1 from state 3 with equal probability. If we went to state 1, there’s
no chance we make it to state 5. If we went to state 4, there’s a 1/3 chance we go to state 5. So our answer
is 1/2 · 1/3 = 1/6. This is just the LTP conditioning on possible middle states!
Now we’ll write this out more generally. This LTP will be a conditional form though: the LTP says that if
Bi ’s partition the sample space: X
P (A) = P (A | Bi ) P (Bi )
i
But what about if we wanted P (A | C)? We just condition everything on C as well to get:
X
P (A | C) = P (A | Bi , C) P (Bi | C)
i

This gives (take Bi to be the event X1 = i: the partition is of size 5):

The second equation comes because the probability of X2 given both the positions X0 and X1 only depends
on X1 right? Once we know where we are currently, we can forget about the past. But now, we can zero
out several of these because P (X1 = i | X0 = 3) = 0 for i = 2, 3, 5. So we are left with just 2 of the 5 terms:
= P (X2 = 5 | X1 = 1) P (X1 = 1 | X0 = 3) + P (X2 = 5 | X1 = 4) P (X1 = 4 | X0 = 3)
If you have the TPM P (we have this above), try looking up the entries to see if you get the same answer!
1 1 1 1
= P15 P31 + P45 P34 = 0 · + · =
2 3 2 6
322 Probability & Statistics with Applications to Computing 9.6

9.6.2.3 The Stationary Distribution

Markov chains have deep ties to not only probability, but also to linear algebra. If you haven’t taken a linear
algebra class, it’s okay; we’ll explain everything we need for our application here (it’s not too much since
we’re not going too deep). All we’ll assume is that you know what a matrix and a vector are.

Back to our random walk example: suppose we weren’t sure where we started. That is, let the vector

v = (0.25, 0.45, 0.15, 0.05, 0.10)

be such that P (X0 = i) = vi , where vi is the ith element of v (these probabilities sum to 1, because we must
start in one of these 5 positions). Think of this vector v as our belief distribution of where we are at time
t = 0. Let’s compute vP , the matrix-product of v and P , the transition probability matrix. We’ll see what
comes out of it after computing and interpreting it! If you haven’t taken linear algebra yet, don’t worry: vP
is the following 5-dimensional row vector:
5 5 5 5 5
!
X X X X X
vP = Pi1 vi , Pi2 vi , Pi3 vi , Pi4 vi , Pi5 vi
i=1 i=1 i=1 i=1 i=1

What does vP represent? Let’s focus on the first entry, and substitute vi = P (X0 = i) and Pi1 =
P (X1 = 1 | X0 = i) (the probability of going from i ! 1). We actually get (by LTP over initial states):
5
X 5
X
Pi1 vi = P (X1 = 1 | X0 = i) P (X0 = i) = P (X1 = 1)
i=1 i=1

The second entry is very similar:

5
X 5
X
Pi2 vi = P (X1 = 2 | X0 = i) P (X0 = i) = P (X1 = 2)
i=1 i=1

This is an interesting pattern that holds for the next three entries as well! In fact, the i-th entry of vP is
just P (X1 = i), so overall, the vector vP represents your belief distribution at the next time step!
That is, right-multiplying by the transition matrix P literally transitions your belief distribution from one
time step to the next.
We can also see that for example vP 2 = (vP )P is your belief of where you are after 2 time steps, and by
induction, vP n is your belief of where you are after n time steps.
A natural question might be then, does vP n have a limit as n ! 1? That is, after a long time, is there a belief
distribution (5-dimensional row vector) ⇡ such that it never changes again? The answer is unfortunately:
it depends. We won’t go into the technical details of when it does and doesn’t exist (search “Fundamental
Theorem of Markov Chains” if you are interested), but this leads us to the following definition:

Definition 9.6.3: Stationary Distribution of a Markov Chain

The stationary distribution of a Markov Chain with n states (if one exists), is the n-dimensional
row vector ⇡ (representing a probability distribution: entries which are nonnegative and sum to 1),
such that
⇡P = ⇡
Intuitively, it means that the belief distribution at the next time step is the same as the distribution
at the current. This typically happens after a “long time” (called the mixing time) in the process,
meaning after lots of transitions were taken.
9.6 Probability & Statistics with Applications to Computing 323

We’re going to see an example of this visually, which will also help us build our final piece of intuition for
MCMC. Consider the Markov Chain we’ve been using throughout this section:

Here is the distribution v that we’ll start with. Our Markov Chain happens to have a stationary distribution,
so we’ll see what happens as we take vP n for n ! 1 visually.
v = (0.25, 0.45, 0.15, 0.05, 0.10)
Here is a heatmap of it visually:

Figure 9.6.1: Belief Distribution v at n = 0

You can see from the key that darker values mean lower probabilities (hence 4 and 5 are very dark), and
that 2 is the lighest value since it has the highest probability.
We’ll then show the distribution after 1 step, 5 steps, 10 steps, and 100 steps. Before we continue, what
do you think the fifth entry will look like after one time step, the probability of being in node 5? Actually,
there is only one way to get to node 5, and that’s from node 4, which we start in with probability only 0.05.
From there, only a 1/3 chance to get to node 5, so node 5 will only have 0.05/3 = 1/60 probability at time
step 1 and hence be super dark.

Figure 9.6.2: Belief Distribution vP at n = 1

324 Probability & Statistics with Applications to Computing 9.6

Figure 9.6.3: Belief Distribution vP 5 at n = 5

Figure 9.6.4: Belief Distribution vP 10 at n = 10

Figure 9.6.5: Belief Distribution vP 100 at n = 100

It turns out that after just n = 100 time steps, we start getting the same distribution over and over again
(see t = 10 and t = 100: there’s already almost no di↵erence)! This limiting value of vP n is the stationary
distribution!
⇡ = lim vP n = (0.12, 0.28, 0.28, 0.18, 0.14)
n!1
100
Suppose ⇡ = vP above. Once we find ⇡ such that ⇡P = ⇡ for the first time, that means that if we
transition again, we get
⇡P 2 = (⇡P )P = ⇡P = ⇡
(applying the equality ⇡P = ⇡ twice). That means, by just running the Markov Chain for several
time steps, we actually reached our stationary distribution! This is the most crucial observation
for MCMC.

9.6.3 Markov Chain Monte Carlo (MCMC)

This brings us to our strategy for Markov Chain Monte Carlo. Again, remember that no matter where
we start with distribution v, by simulating the Markov Chain many steps, we will eventually reach the
stationary distribution ⇡ = limn!1 vP n . Meaning, if we start in some state at simulate the chain for a large
number of steps (randomly choosing the next transition), it will give us a sample from the stationary
distribution.
Actually, MCMC is generally a technique to sample from a hard distribution that we can’t explicitly write
out. Oftentimes we can’t compute vP n for n very large because a Markov Chain usually has way too many
states (5 is nothing). Imagine how long it would take a computer to compute vP 100 even if there were 1000
states (1000 ⇥ 1000 matrix P ). We’ll see how we can take advantage of this amazing fact below!
9.6 Probability & Statistics with Applications to Computing 325

Definition 9.6.4: Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) is a technique which can be used to hard optimization
problems (though generally it is used to sample from a distribution). The general strategy is as
follows:
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) transition
probabilities that result in the stationary distribution ⇡ having higher probabilities on “good”
solutions to our problem. We don’t actually compute ⇡, but we just want to define the Markov
Chain such that the stationary distribution would have higher probabilities on more desirable
solutions.
II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good” state/-
solution). This means: start at some initial state, and transition according to the transition
probability matrix (TPM) for a long time. This will eventually take us to our stationary dis-
tribution which has high probability on “good” solutions!

Again, if this doesn’t make sense yet, that’s totally fine. We will apply this two-step procedure to two
examples below so you can understand better how it works!

9.6.3.1 Knapsack Problem

Definition 9.6.5: Knapsack Problem

The 0-1 Knapsack Problem is defined as follows:

• Given n items with weights w1 , . . . , wn > 0 and values v1 , . . . , vn > 0, and a knapsack with
weight limit W .
• Goal: Find the most valuable subset of items which satisfy the weight constraint of the knapsack!
More formally, we let x = (x1 , . . . , xn ) 2 {0, 1}n be the n-dimensional vector of whether Por not we
n
take each item (1 means take, 0 means don’t take).POur goal is to maximize the total value i=1 vi xi
n
in our knapsack subject to our weight constraint i=1 wi xi  W .

P
Note that our total value is the sum of the values of the items we take: think about why vi xi is the total
value (remember that xi is either 0 or 1). This problem has 2n possible solutions (either take each item or
don’t), and so is combinatorially hard (exponentially many solutions). If I asked you to write a program to
do this, would you even know where to begin, except by writing the brute-force solution?

MCMC to the rescue!

I. Define a Markov Chain with states being possible solutions, and (implicitly defined) tran-
sition probabilities that result in the stationary distribution ⇡ having higher probabilities
on “good” solutions to our problem.

We’ll define a Markov Chain with 2n states (that’s huge!). The states will be all possible solutions:
binary vectors x of length n (only having 0/1 entries). We’ll then define our transitions to go to “good”
states (ones that satisfy our weight constraint), while keeping track of the best solution so far. This
way, our stationary distribution has higher probabilities on good solutions than bad ones. Hence, when
we sample from the distribution (simulating the Markov chain), we are likely to get a good solution!
326 Probability & Statistics with Applications to Computing 9.6

Algorithm 3 MCMC for 0-1 Knapsack Problem

1: x vector of n zeros, where xi is a binary vector in {0, 1}n which represents whether or not we have
item i. (Initially, start with an empty knapsack).
2: best x x
3: for t = 1, . . . , NUM ITER do
4: k a random integer in {1, 2, . . . , n}.
5: new x x but with x[k] flipped (0 ! 1 or 1 ! 0).
6: if new x satisfies weight constraint then
7: x new x
8: if value(x) > value(best x) then
9: best x x

II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good”
state/solution). This means: start at some initial state, and transition according to the
transition probability matrix (TPM) for a long time. This will eventually take us to our
stationary distribution which has high probability on “good” solutions!
Basically, this algorithm starts with the guess of x being all zeros (no items). Then, for NUM ITER
steps, we simulate the Markov Chain. Again, what this does is give us a sample from our stationary
distribution. Inside the loop, we literally just choose a random object and flip whether or not we have
it. We maintain track of the best solution so far and return it.
That’s all there is to it! This is such a “dumb” solution right? We just start somewhere and randomly
transition for a long time and hope our answer is good. So MCMC definitely won’t guarantee us to get the
best solution, but it leads to “dumb” solutions that actually work quite well in practice. We are guaranteed
though (provided we take enough transitions), to sample from the stationary distribution which has higher
probabilities on good solutions. This is because we only transition to solutions that maintain feasibility.
Note: This is just one version of MCMC for the knapsack problem, there are definitely probably better
versions. It would be better to transition to solutions which have higher value, not just feasible solutions
like we did. The next example does a better job of this!

9.6.3.2 Travelling Salesman Problem (TSP)

9.6.3.3 Travelling Salesman Problem
Definition 9.6.6: Travelling Salesman Problem

Given n locations and distances between each pair, we want to find an ordering of them that:
• Starts and ends in the same location.
• Visits each location exactly once (except the starting location twice).
• Minimizes the total distance travelled.

You can imagine an instantiation of this problem for the US Postal Service. A mail delivery person wants to
start and end at the post office, and find the most efficient route which delivers all the mail to the residents.
Again, where would you even begin on trying to solve this, other than brute-force? MCMC to the rescue
again! This time, our algorithm will be more clever than the previous.
I. Define a Markov Chain with states being possible solutions, and (implicitly defined) tran-
sition probabilities that result in the stationary distribution ⇡ having higher probabilities
on “good” solutions to our problem.
9.6 Probability & Statistics with Applications to Computing 327

We’ll define a Markov Chain with n! states (that’s huge!). The states will be all possible solutions
(state=route): all orderings of the n locations. We’ll then define our transitions to go to “good” states
(ones that go to lower-distance routes), while keeping track of the best solution so far. This way,
our stationary distribution has higher probabilities on good solutions than bad ones. Hence, when we
sample from the distribution (simulating the Markov chain), we are likely to get a good solution!

Algorithm 4 MCMC for Travelling Salesman Problem (TSP)

1: route random permutation of the n locations.
2: best route route
3: for i = 1, . . . , NUM ITER do
4: new route route; but with two successive locations in route swapped.
5: dist(new route) dist(route)
6: if < 0 OR (T > 0 AND U nif (0, 1) < e /T ) then
7: route new route
8: if dist(route) < dist(best route) then
9: best route route

II. Run MCMC (simulate the Markov Chain for many iterations until we reach a “good”
state/solution). This means: start at some initial state, and transition according to the
transition probability matrix (TPM) for a long time. This will eventually take us to our
stationary distribution which has high probability on “good” solutions!
We will start with a random state (route). At at each iteration, propose a new state (route) as follows:
choose a random index from {1, 2, . . . , n}, and swap that location with the successive (next) location
in the route, possibly with wraparound if index 50 is chosen. If the proposed route has lower total
distance (is better) than the current route, we will always transition to it (exploitation). Otherwise, if
T > 0, with probability e /T , update the current route to the proposed route, where > 0 is the
increase in total distance. This allows us to transition to a “worse” route occasionally (exploration),
and get out of local optima! Repeat this for NUM ITER transitions from the initial state (route), and
output the shortest route during the entire process (which may not be the last route).
Again, this is such a “dumb” solution right? But also very clever! We just start somewhere and randomly
transition for a long time and hope our answer is good. And it should be: after a long time, our route
distance increasingly gets better and better, so we should expect a rather good solution!

9.6.4 Summary
Once again, we’ve used probability to make our lives easier. There are definitely papers and research on
how to solve these problems deterministically, but this is one of the simplest algorithms you can get, and
it uses randomness! Again, the idea of MCMC for optimization is: define the state space to be all possible
solutions, define transitions to go to better states, and just run it and wait!
Chapter 9: Applications to Computing
9.7: Bootstrapping (for Hypothesis Testing)
Slides (Google Drive) Video (YouTube)

9.7.1 Motivation
We’ve just learned how to perform a generic hypothesis test, where in our examples we were especially often
able to use the Normal distribution and its CDF due to the CLT. But actually, there are tons of specialized
other hypothesis tests which won’t allow this. For example:
• The t-test for equality of means when variance is unknown.
2
• The -test of independence (testing whether to quantities are independent or not).
• The F -test for equality of variances (testing whether or not the variances of two populations are equal
or not).
There are many more that I haven’t even listed because I probably have never heard of them myself! These
three above though involve three distributions we haven’t learned yet: the t, 2 , and F distributions. But
because you are a computer scientist, we’ll actually learn a way now to completely erase the need to learn
each specific procedure, called bootstrapping!

9.7.2 The Bootstrap

Bootstrapping is a stellar example of why CS people need to take a course called something like “Probability
& Statistics for Computer Scientists”. Bootstrapping was invented by Bradley Efron in 1979, who has many
accolades largely in part to this particular idea:
• President of the American Statistical Association
• Professor of Statistics at Stanford University
• Founding Editor of the Annals of Applied Statistics
• Won National Science Medal
Disclaimer: I’m not going to teach you everything there is about bootstrapping, just what is necessary for
the application of hypothesis testing.
Recall from 8.3 that a p-value is “the probability of, under the null hypothesis, of observing a di↵erence at
least as extreme.” Remember our first application was Probability via Simulation, and since a p-value is just
a probability, we will try something very similar! A one sentence summary of bootstrapping:
“The bootstrap provides a way to calculate probabilities of statistics using
code.“
This application is rather short, so we just need to get through one idea before revealing it!

Example(s)

Main Idea: We have some (not enough) data and want more. How can we “get more”?

328
9.7 Probability & Statistics with Applications to Computing 329

Imagine: You have 1000 iid coin flip samples, x1 ,...,x1000 which are all 1’s and 0’s. Your boss wants
you to somehow get/generate 500 more (independent) samples.

How can you “get more (iid) data” without actually having access to the coin? There are
two proposed solutions below: both of which you could theoretically come up with, but only one of
which which I expect most of you to guess.

Solution Here are the two ways we might approach this.

1. Estimate the parameter p of Ber(p) (e.g., with max-likelihood), then generate more samples.
2. Resample the data: sample (uniformly) from the same dataset 500 times, with replacement.
In fact, in our scenario, these two are completely equivalent! Why? If for example there were 750/1000 heads
and we resample with replacement uniformly, the probability we get a 1 is just 750/1000. If we estimate the
parameter to be 750/1000, then each time we also will get a 1 with probability 750/1000.
However, the resampling method is much more generalizable: if we wanted to get more samples of human
heights for example (the exact distribution is completely unknown to us), we would only be able to do the
second way! This is the main idea of bootstrapping: “Sampling with Replacement”,

9.7.3 Bootstrapping for p-values

I think this idea is best illustrated by example, as usual.

Example(s)

A colleague has collected samples of weights of labradoodles that live on two di↵erent islands:
CatIsland and DogIsland. The colleague collects 48 samples from CatIsland, and 43 samples from
the DogIsland. The colleague notes ahead of time that she thinks the labradoodles on DogIsland
have a higher spread of weights than CatIsland. You are skeptical. You and your colleague do
however agree to assume that their true means are equal. Here is the data:

CatIsland Labradoodle Weights (48 samples): 13, 12, 7, 16, 9, 11, 7, 10, 9, 8, 9, 7, 16, 7, 9,
8, 13, 10, 11, 9, 13, 13, 10, 10, 9, 7, 7, 6, 7, 8, 12, 13, 9, 6, 9, 11, 10, 8, 12, 10, 9, 10, 8, 14, 13, 13, 10, 11

DogIsland Labradoodle Weights (43 samples): 8, 8, 16, 16, 9, 13, 14, 13, 10, 12, 10, 6, 14, 8,
13, 14, 7, 13, 7, 8, 4, 11, 7, 12, 8, 9, 12, 8, 11, 10, 12, 6, 10, 15, 11, 12, 3, 8, 11, 10, 10, 8, 12

Perform a hypothesis test, computing the p-value using bootstrapping.

Solution Step 5 is the only part where bootstrapping is involved. Everything else is the same as we learned
in 8.3!
1. Make a claim.

The spread of labradoodle weights on DogIsland is (significantly) larger than that on CatIsland.
2. Set up a null hypothesis H0 and alternative hypothesis HA .

2 2 2 2
H0 : C = D HA : C < D
330 Probability & Statistics with Applications to Computing 9.7

Our null hypothesis is that the spreads are the same, and our alternative is what we want to show.
Here, spread is taken to mean “variance”.
3. Choose a significance level ↵ (usually ↵ = 0.05 or 0.01).

Let’s say ↵ = 0.05.

4. Collect data.

This is already done for us.

5. Compute a p-value, p = P (observing data at least as extreme as ours | H0 is true).
Here is when we use knowledge of coding to compute our p-value. The idea is probability by simulation:
we assume H0 is true; that is, the variances in both samples x and y are the same. That is, we assume
there is some global population (a master island if you will), and some seismic event occurred which
split the master island into CatIsland and DogIsland (so they have the same variance).

Because of this, we can combine the two samples into a single one of size 48 + 43 = 91 (in our
case, we’ve also assumed the means are the same, so this is okay). Then, we repeatedly bootstrap
this combined sample (let’s say 50,000 times): we sample with replacement a sample of size 48, and of
size 43, and compute the sample variances of these two samples. Then, we compute the sample pro-
portion of times the di↵erence in variances was at least as extreme, and that’s it! See the pseudocode
below, and reread these two paragraphs.

2 2 2 2
Algorithm 5 Bootstrapping for p-value for H0 : C = D vs HA : C < D
1: Given: Two samples x = [x1 , . . . , xn ] and y = [y1 , . . . , ym ].
2: obs di↵ s2y s2x (the di↵erence in sample variances).
3: combined concat(x, y) = [x1 , x2 , . . . , xn , y1 , y2 , . . . , ym ] (of size n + m).
4: count 0.
5: for i = 1, 2, . . . , 50000 do . Any large number is fine.
6: x0 resample(combined, n) with replacement. . Sample of size n from combined.
7: y0 resample(combined, m) with replacement. . Sample of size m from combined.
8: di↵ s2y0 s2x0 . . Compute the di↵erence in sample variances.
9: if di↵ obs di↵ then . This line changes depending on the alternative hypothesis.
10: count count + 1.
11: p-val count/50000.

Again, what we’re doing is: assuming there was this master island that split into two (same variance),
what is the probability we observed a sample of size 48 and a sample of size 43 with variances at least
as extreme as we did? That is, if we were to repeat this “separation” process many times, how often
would we get a di↵erence so large? We don’t have the other labradoodles from the master island, so
we bootstrap (reuse our current samples). It turns out this method leads to a good approximation to
the true p-value!

It’s important to note that the alternative hypothesis is EXTREMELY IMPORTANT. If instead we
2 2
wanted to assert HA : C 6= D , we would have used absolute values for di↵ and obs di↵. Also, for
example, if we wanted to make a statement about the means µC and µD instead, we would have
computed and compared the sample means instead of the sample variances.

It turns out we get a p-value of approximately 0.07. (Try coding this up yourself!)
9.7 Probability & Statistics with Applications to Computing 331

6. State your conclusion. Include an interpretation in the context of the problem.

Since our p-value of 0.07 was greater than ↵ = 0.05, we fail to reject the null hypothesis. There
is insufficient evidence to show that the labradoodle spreads are di↵erent across the two islands.
Actually, this two-sample test for di↵erence in variances is done by an “F-Test of Equality of Variances” (see
Wikipedia). But because we know how to code, we don’t need to know that!

You can imagine bootstrapping for other types of hypothesis tests as well! Actually, bootstrapping is a
powerful tool which also has other applications.
Chapter 9: Applications to Computing
9.8: Multi-Armed Bandits
Slides (Google Drive) Video (YouTube)

Actually, for this application of bandits, we will do the problem setup before the motivation. This is because
modelling problems in this bandit framework may be a bit tricky, so we’ll kill two birds with one stone. We’ll
also see how to do “Modern Hypothesis Testing” using bandits!

9.8.1 The Multi-Armed Bandit Framework

Imagine you go to a casino in Las Vegas, and there K = 3 di↵erent slot machines (“Bandits” with “Arms”).
(They are called bandits because they steal your money.)

You bought some credits and can pull any slot machines, but only a total of T = 100 times. At each time
step t = 1, . . . , T , you pull arm at 2 {1, 2, . . . , K} and observe a random reward. Your goal is to maximize
your total (expected) reward after T = 100 pulls! The problem is: at each time step (pull), how do I decide
which arm to pull based on the past history of rewards?

We make a simplifying assumption that each arm is independent of the rest, and has some reward dis-
tribution which does NOT change over time.
Here is an example you may be able to do: don’t overthink it!

Example(s)

If the reward distributions are given in the image below for the K = 3 arms, what is the best strategy
to maximize your expected reward?

Solution We can just compute the expectations of each from the distributions handout. The first machine
has expectation = 1.36, the second has expectation np = 4, and the third has expectation µ = 1. So to

332
9.8 Probability & Statistics with Applications to Computing 333

maximize our total reward, we should just always pull arm 2 because it has the best expected reward! There
would be no benefit in pulling other arms at all.

So we’re done right? Well actually, we DON’T KNOW the reward distributions at all! We must estimate all
K expectations (one per arm), WHILE simultaneously maximizing reward! This is a hard problem because
we know nothing about the K reward distributions. Which arm should we pull then at each time step? Do
we pull arms we know to be “good” (probably), or try other arms?

This bandit problem allows us to formally model this tradeo↵ between:

• Exploitation: Pulling arm(s) we know to be “good”.

• Exploration: Pulling less-frequently pulled arms in the hopes they are also “good” or even better.

In this section, we will only handle the case of Bernoulli-bandits. That is, the reward of each arm
a 2 {1, . . . , K} is Ber(pa ) (i.e., we either get a reward of 1 or 0 from each machine, with possibly di↵erent
probabilities). Observe that the expected reward of arm a is just pa (expectation of Bernoulli).

The last thing we need to talk about when talking about bandits is regret. Regret is the di↵erence between

• The best possible expected reward (if you always pulled the best arm).

• The actual reward you got over T arm-pulls.

Let p⇤ = arg maxi2{1,2,...,K} pi denote the highest expected reward from one of the K arms. Then, the regret
at time T is
Regret(T ) = T p⇤ Reward(T )

where T p⇤ is the reward from the best arm if you pull it T times, and Reward(T ) is your actual reward after
T pulls. Sometimes it’s easier to think about this in terms of average regret (divide everything by T ).

Reward(T )
Avg-Regret(T ) = p⇤
T

We ideally want Avg-Regret(T ) ! 0 as T ! 1. In fact, minimizing (average) regret is equivalent

to maximizing reward (why?). The reason we defined this is because our graphs of the plots of di↵erent
algorithms we studied are best compared on such a plot with regret on the y-axis and time on the x-axis. If
you look deeply at the theoretical guarantees (we won’t), a lot of the times they upper-bound the regret.

The below summarizes and formalizes everything above into this so-called “Bernoulli Bandit Framework”.
334 Probability & Statistics with Applications to Computing 9.8

Algorithm 6 (Bernoulli) Bandit Framework

1: Have K arms, where pulling arm i 2 {1, . . . , K} gives Ber(pi ) reward . pi ’s all unknown.
2: for t = 1, . . . , T do
3: At time t, pull arm at 2 {1, . . . , K}. . How do we do decide which arm?
4: Receive reward rt ⇠ Ber(pat ). . Reward is either 1 or 0.

The focus for the rest of the entire section is: “how do we choose which arm”?

9.8.2 Motivation
Before we talk about that though, we’ll discuss the motivation as promised.

As you can see above, we can model a lot of real-life problems as a bandit problem. We will learn two
popular algorithms: Upper Confidence Bound (UCB) and Thompson Sampling. This is after we discuss
some “intuitive” or “naive” strategies you may have yourself!
We’ll actually call on a lot of our knowledge from Chapters 7 and 8! We will discuss maximum likelihood,
maximum a posteriori, confidence intervals, and hypothesis testing, so you may need to brush up on those!

9.8.3 Algorithm: (Naive) Greedy Strategies

If this were a lecture, I might ask you for any ideas you may have? I encourage you to think for a minute
before reading the “solution” below.

One strategy may be: pull each arm M times in the beginning, and then forever pull the best arm! This is
described formally below:
9.8 Probability & Statistics with Applications to Computing 335

Algorithm 7 Greedy (Naive) Strategy for Bernoulli Bandits

1: Choose a number of times M to pull each arm initially, with KM  T .
2: for i = 1, 2, . . . , K do
3: Pull arm i M times, observing iid rewards ri1 , . . . , riM ⇠ Ber(pi ).
PM
j=1 rij
4: Estimate p̂i = . . Maximum likelihood estimate!
M
⇤
5: Determine best (empirical) arm a = arg maxi2{1,2,...,K} p̂i . . We could be wrong...
6: for t = KM + 1, KM + 2, . . . , T do: . For the rest of time...
7: Pull arm at = a⇤ . . Pull the same arm for the rest of time.
8: Receive reward rt ⇠ Ber(pat ).

Actually, this strategy is no good, because if we choose the wrong best arm, we would regret it for the rest
of time! You might then say, why don’t we increase M ? If you do that, then you are pulling sub-optimal
arms more than you should, which would not help us in maximizing reward...The problem is: we did all of
our exploration FIRST, and then exploited our best arm (possibly incorrect) for the rest of time. Why don’t
we try to blend in exploration more? Do you have any ideas on how we might do that?

This following algorithm is called the "-Greedy algorithm, because it explores with probability " at each
time step! It has the same initial setup: pull each arm M times to begin. But it does two things better than
the previous algorithm:
1. It continuously updates an arm’s estimated expected reward when it is pulled (even after the KM
steps).
2. It explores with some probability " (you choose). This allows you to choose in some quantitative way
how to balance exploration and exploitation.
See below!

Algorithm 8 "-Greedy Strategy for Bernoulli Bandits

1: Choose a number of times M to pull each arm initially, with KM  T .
2: for i = 1, 2, . . . , K do
3: Pull arm i M times, observing iid rewards ri1 , . . . , riM ⇠ Ber(pi ).
PM
j=1 rij
4: Estimate p̂i = .
M
5: for t = KM + 1, KM + 2, . . . , T do:
6: if Ber(") == 1: then . With probability ", explore.
7: Pull arm at ⇠ Unif(1, K) (discrete). . Choose a uniformly random arm.
8: else . With probability 1 ", exploit.
9: Pull arm at = arg maxi2{1,2,...,K} p̂i . . Choose arm with highest estimated reward.
10: Receive reward rt ⇠ Ber(pat ).
11: Update p̂at (using newly observed reward rt ).

However, we can do much much better! Why should we explore each arm uniformly at random, when we
have a past history of rewards? Let’s explore more the arms that have the potential to be really good! In an
extreme case, if there is an arm with average reward 0.01 after 100 pulls and an arm with average reward
0.6 after only 5 pulls, should we really both explore each equally?
336 Probability & Statistics with Applications to Computing 9.8

9.8.4 Algorithm: Upper Confidence Bound (UCB)

A great motto for this algorithm would be “optimism in the face of uncertainty”. The idea of the greedy
algorithm was simple: at each time step, choose the best arm (arm with highest p̂a ). The algorithm we
discuss now is very similar, but turns out to work a lot better: construct a confidence interval for p̂a for each
arm, and choose the one with the highest POTENTIAL to be best. That is, suppose we had three arms
with the following estimates and confidence intervals at some time t:

• Arm 1: Estimate is p̂1 = 0.75. Confidence interval is [0.75 0.10, 0.75 + 0.10] = [0.65, 0.85].

• Arm 2: Estimate is p̂2 = 0.33. Confidence interval is [0.33 0.25, 0.33 + 0.25] = [0.08, 0.58].

• Arm 3: Estimate is p̂3 = 0.60. Confidence interval is [0.60 0.29, 0.60 + 0.29] = [0.31, 0.89].

Notice all the intervals are centered at the MLE. Remember the intervals may have di↵erent widths, because
the width of a confidence interval depends on how many times it has been pulled (more pulls means more
confidence and hence narrower interval). Review 8.1 if you need to recall how we construct them.

The greedy algorithm from earlier at this point in time would choose arm 1 because it has the highest
estimate (0.75 is greater than 0.33 and 0.60). But our new Upper Confidence Bound (UCB) algorithm
will choose arm 3 instead, as it has the highest possibility of being the best (0.89 is greater than 0.85 and
0.58).

Algorithm 9 UCB1 Algorithm (Upper Confidence Bound) for Bernoulli Bandits

1: for i = 1, 2, . . . , K do
2: Pull arm i once, observing ri ⇠ Ber(pi ).
3: Estimate p̂i = ri /1. . Each estimate pˆi will initially either be 1 or 0.
4: for t = K + 1, K + 2, . . . , T do: s !
2 ln(t)
5: Pull arm at = arg maxi2{1,2,...,K} p̂i + , where Nt (i) is the number of times arm i was
Nt (i)
pulled before time t.
6: Receive reward rt ⇠ Ber(pat ).
7: Update Nt (at ) and p̂at (using newly observed reward rt ).

See how exploration is “baked in” now? As we pull an arm more and more, the upper confidence bound
decreases. The less frequently pulled arms have a chance to have a higher UCB, despite having a lower
point estimate! After the next algorithm we examine, we will visually compare and contrast the results. But
before we move on, let’s take a look at this visually.

Suppose we have K = 5 arms. The following picture depicts at time t = 10 what the confidence intervals
may look like. The horizontal lines at the top of each arm represent the upper confidence bound, and the red
dots represent the TRUE (unknown) means. The center of each confidence interval are the ESTIMATED
means.
9.8 Probability & Statistics with Applications to Computing 337

Pretty inaccurate at first right? Because it’s so early on, our estimates are expected to be bad.

Now see what happens as t gets larger and larger!

Notice how the interval for the best arm (arm 5) keeps shrinking, and is the smallest one because it was pulled
(exploited) so much! Clearly, arm 1 was terrible and so our estimate isn’t perfect; it has the widest width
since we almost never pulled it. This is the idea of UCB: basically just greedy but using upper confidence
bounds!

You can go to the slides linked at the top of the section if you would like to see a step-by-step of the first
few iterations of this algorithm (slides 64-86).
s
2 ln(t)
Note if just deleted the + in the 5th line of the algorithm, it would reduce to the greedy!
Nt (i)
338 Probability & Statistics with Applications to Computing 9.8

9.8.5 Algorithm: Thompson Sampling

This next algorithm is even better! It takes this idea of MAP (prior and posterior) into account, and ends up
working extremely well. Again, we’ll see how a slight change would reduce this back to the greedy algorithm.
We will assume a Beta(1, 1) (uniform) prior on each unknown probability of reward. That is, we can treat
our pi ’s as continuous probability distributions. Remember that though with this uniform prior, the MAP
and the MLE are equivalent though (pretend we saw 1 1 = 0 heads and 1 1 = 0 failures). However, we
will not be using the posterior distribution just to get the mode, we will SAMPLE from it! Here’s the idea
visually.
Let’s say we have K = 3 arms, and are at the first time step t = 1. We will start each arm o↵ with a
Beta(↵i = 1, i = 1) prior and update ↵i , i based on the rewards we observe. We’ll show the algorithm
first, then use visuals to walk through it.

Algorithm 10 Thompson Sampling Algorithm for Beta-Bernoulli Bandits

1: For each arm i 2 {1, . . . , K}, initialize ↵i = i = 1. . Set Beta(↵i , i) prior for each pi .
2: for t = 1, 2, . . . , T do:
3: For each arm i, get sample si,t ⇠ Beta(↵i , i ). . Each is a float in [0, 1].
4: Pull arm at = arg maxi2{1,2,...,K} si,t . . This “bakes in” exploration!
5: Receive reward rt ⇠ Ber(pat ).
6: if rt == 1 then ↵at ↵at + 1. . Increment number of “successes”.
7: else if rt == 0 then at at + 1. . Increment number of “failures”.

So as I mentioned earlier, each pi is a RV which starts with a Beta(1, 1) distribution. For each arm i, we
keep track of ↵i and i , where ↵i 1 is the number of successes (number of times we got a reward of 1),
and i 1 is the number of failures (number of times we got a reward of 0).
For this algorithm, I would highly recommend you go to the slides linked at the top of the section if you
would like to see a step-by-step of the first few iterations of this algorithm (slides 94-112). If you don’t want
to, we’ll still walk through it below!
Let’s again suppose we have K = 3 arms. At time t = 1, we sample once from each arm’s Beta distribution.

We suppose the true pi ’s are 0.5, 0.2, and 0.9 for arms 1, 2, and 3 respectively (see the table). Each arm
has ↵i and i , initially 1. We get a sample from each arm’s Beta distribution and just pull the arm with
the largest sample! So in our first step, each has the same distribution Beta(1, 1) = Unif(0, 1), so each arm
is equally likely to be pulled. Then, because arm 2 has the highest sample (of 0.75), we pull arm 2. The
algorithm doesn’t know this, but there is only a 0.2 chance of getting a 1 from arm 2 (see the table), and so
let’s say we happen to observe our first reward to be zero: r1 = 0.
9.8 Probability & Statistics with Applications to Computing 339

Consistent with our Beta random variable intuition and MAP, we increment our number of failures by 1 for
arm 2 only.

At the next time step, we do the same! Sample from each arm’s Beta and choose the arm with the highest
sample. We’ll see it for a more interesting example below after skipping a few time steps.

Now let’s say we’re at time step 4, and we see the following chart. Below depicts the current Beta densities
for each arm, and what sample we got from each.

We can see from the ↵i ’s and i ’s that we still haven’t pulled arm 1 (both parameters are still at 1), we pulled
arm 2 and got a reward of 0 ( 2 = 2), and we pulled arm 3 twice and got one 1 and one 0 (↵3 = 3 = 2).
See the density functions below: arm 1 is equally likely to be any number in [0, 1], whereas arm 2 is more
likely to give a low number. Arm 3 is more certain of being in the center.

You can see that Thompson Sampling just uses this ingenious idea of sampling rather than just tak-
ing the MAP, and it works great! We’ll see some comparisons below between UCB and Thompson sampling.

Note that with a single-line change, instead of sampling in line 3, if we just took the MAP (which equals the
MLE because of our uniform prior), we would again revert back to the greedy algorithm! The exploration
comes from the sampling, which works out great for us!

9.8.6 Comparison of Methods

See the UCB and Thompson Sampling’s average regret over time:
340 Probability & Statistics with Applications to Computing 9.8

It might be a bit hard to see, but notice Thompson sampling’s regret got close to 0 a lot faster than UCB.
UCB happened around time 5000, and TS happened around time 2000. The reason why Thompson sampling
might be “better” is unfortunately out of scope.

Below is my favorite visualization of all. On the x-axis we have time, and on the y-axis, we have the
proportion of time each arm was pulled (there were K = 5 arms). Notice how arm 2 (green) has the highest
true expected reward at 0.89, and how quickly Thompson sampling discovered it and starting exploiting it.

9.8.7 Modern Hypothesis Testing

Not only do bandits solve all the applications we talked about earlier, it actually provides a modernized way
to conduct hypothesis tests.
Let’s say a large tech company wants to experiment releasing a new feature/modification.
They assign
• 99% of population to control group (current feature)
• 1% to experimental group (new feature).
9.8 Probability & Statistics with Applications to Computing 341

This has the following consequences:

• If the new feature is “bad”, very small percentage of the population sees it, so company protects itself.
• If the new feature is “good”, very small percentage of the population sees it, so company may lose
revenue.
We would perform a two-sample hypothesis test (called an “A/B Test” in industry) now to compare the
means of some metric we cared about (click-through rate for example), and determine whether we could
reject the null hypothesis and statistically prove that the new feature performs better. Can we do better
though? Can we adaptively assign subjects to each group based on how each is performing rather than
deciding at the beginning? Yes, let’s use bandits!
Let’s use the Bernoulli-bandit framework with just K = 2 arms:
• Arm 1: Current Feature
• Arm 2: New feature
When feature is requested by some user, use Multi-Armed Bandit algorithm to decide which feature to show!
We can have any number of arms.

Here are the benefits and drawbacks of using Traditional A/B Testing vs Multi-Armed bandits. Each has
their own advantages, and you should carefully consider which approach to take before arbitrarily deciding!
When to use Traditional A/B Testing:
• Need to collect data for critical business decisions.
• Need statistical confidence in all your results and impact. Want to learn even about treatments that
didn’t perform well.
• The reward is not immediate (e.g., if drug testing, don’t have time to wait for each patient to finish
before experimenting with next patient).
• Optimize/measure multiple metrics, not just one.
When to use Multi-Armed Bandits:
1. No need for interpreting results, just maximize reward (typically revenue/engagement)
2. The opportunity cost is high (if advertising a car, losing a conversion is $20,000)
3. Can add/remove arms in the middle of an experiment! Cannot do with A/B tests.
The study of Multi-Armed Bandits can be categorized as:
• Statistics
• Optimization
• “Reinforcement Learning” (subfield of Machine Learning)
Alex Tsun Probability & Statistics with Applications to Computing 1

Table: P(Z  z) when Z ⇠ N (0, 1)

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5 0.50399 0.50798 0.51197 0.51595 0.51994 0.52392 0.5279 0.53188 0.53586
0.1 0.53983 0.5438 0.54776 0.55172 0.55567 0.55962 0.56356 0.56749 0.57142 0.57535
0.2 0.57926 0.58317 0.58706 0.59095 0.59483 0.59871 0.60257 0.60642 0.61026 0.61409
0.3 0.61791 0.62172 0.62552 0.6293 0.63307 0.63683 0.64058 0.64431 0.64803 0.65173
0.4 0.65542 0.6591 0.66276 0.6664 0.67003 0.67364 0.67724 0.68082 0.68439 0.68793
0.5 0.69146 0.69497 0.69847 0.70194 0.7054 0.70884 0.71226 0.71566 0.71904 0.7224
0.6 0.72575 0.72907 0.73237 0.73565 0.73891 0.74215 0.74537 0.74857 0.75175 0.7549
0.7 0.75804 0.76115 0.76424 0.7673 0.77035 0.77337 0.77637 0.77935 0.7823 0.78524
0.8 0.78814 0.79103 0.79389 0.79673 0.79955 0.80234 0.80511 0.80785 0.81057 0.81327
0.9 0.81594 0.81859 0.82121 0.82381 0.82639 0.82894 0.83147 0.83398 0.83646 0.83891
1.0 0.84134 0.84375 0.84614 0.84849 0.85083 0.85314 0.85543 0.85769 0.85993 0.86214
1.1 0.86433 0.8665 0.86864 0.87076 0.87286 0.87493 0.87698 0.879 0.881 0.88298
1.2 0.88493 0.88686 0.88877 0.89065 0.89251 0.89435 0.89617 0.89796 0.89973 0.90147
1.3 0.9032 0.9049 0.90658 0.90824 0.90988 0.91149 0.91309 0.91466 0.91621 0.91774
1.4 0.91924 0.92073 0.9222 0.92364 0.92507 0.92647 0.92785 0.92922 0.93056 0.93189
1.5 0.93319 0.93448 0.93574 0.93699 0.93822 0.93943 0.94062 0.94179 0.94295 0.94408
1.6 0.9452 0.9463 0.94738 0.94845 0.9495 0.95053 0.95154 0.95254 0.95352 0.95449
1.7 0.95543 0.95637 0.95728 0.95818 0.95907 0.95994 0.9608 0.96164 0.96246 0.96327
1.8 0.96407 0.96485 0.96562 0.96638 0.96712 0.96784 0.96856 0.96926 0.96995 0.97062
1.9 0.97128 0.97193 0.97257 0.9732 0.97381 0.97441 0.975 0.97558 0.97615 0.9767
2.0 0.97725 0.97778 0.97831 0.97882 0.97932 0.97982 0.9803 0.98077 0.98124 0.98169
2.1 0.98214 0.98257 0.983 0.98341 0.98382 0.98422 0.98461 0.985 0.98537 0.98574
2.2 0.9861 0.98645 0.98679 0.98713 0.98745 0.98778 0.98809 0.9884 0.9887 0.98899
2.3 0.98928 0.98956 0.98983 0.9901 0.99036 0.99061 0.99086 0.99111 0.99134 0.99158
2.4 0.9918 0.99202 0.99224 0.99245 0.99266 0.99286 0.99305 0.99324 0.99343 0.99361
2.5 0.99379 0.99396 0.99413 0.9943 0.99446 0.99461 0.99477 0.99492 0.99506 0.9952
2.6 0.99534 0.99547 0.9956 0.99573 0.99585 0.99598 0.99609 0.99621 0.99632 0.99643
2.7 0.99653 0.99664 0.99674 0.99683 0.99693 0.99702 0.99711 0.9972 0.99728 0.99736
2.8 0.99744 0.99752 0.9976 0.99767 0.99774 0.99781 0.99788 0.99795 0.99801 0.99807
2.9 0.99813 0.99819 0.99825 0.99831 0.99836 0.99841 0.99846 0.99851 0.99856 0.99861
3.0 0.99865 0.99869 0.99874 0.99878 0.99882 0.99886 0.99889 0.99893 0.99896 0.999
Discrete Distributions
PMF (pX (k) for
Distribution Parameters Possible Description Range ⌦X E[X] Var(X)
k 2 ⌦X )
X ⇠ Unif(a, b)
Equally likely to be any a+b (b a)(b a + 2) 1
Uniform (disc) for a, b 2 Z {a, . . . , b}
integer in [a, b] 2 12 b a+1
and a  b
1 k
X ⇠ Ber(p) Takes value 1 with prob p pk (1 p)
Bernoulli {0, 1} p p(1 p)
for p 2 [0, 1] and 0 with prob 1 p
X ⇠ Bin(n, p) Sum of n iid Ber(p) rvs. # n n k
pk (1 p)
Binomial for n 2 N, of heads in n independent {0, 1, . . . , n} np np(1 p) k
and p 2 [0, 1] coin flips with P (head) = p.
# of events that occur in k
X ⇠ Poi( ) one unit of time e
Poisson {0, 1, . . .} k!
for > 0 independently with rate
per unit time
# of independent Bernoulli k 1
X ⇠ Geo(p) 1 1 p (1 p) p
Geometric trials with parameter p up to {1, 2, . . .}
for p 2 [0, 1] p p2
and including first success
# of successes in n draws K N K
X ⇠ HypGeo(N, K, n)
(w/o replacement) from N {max (0, n + K N ), K K (N K) (N n) k n k
Hypergeometric for n, K  N n n N
items that contain K . . . , min (n, K)} N N 2 (N 1) n
and n, K, N 2 N
successes in total
Sum of r iid Geo(p) rvs. # k 1 k r
Negative X ⇠ NegBin(r, p) r r (1 p) pr (1 p)
of independent flips until rth {r, r + 1, . . .} r 1
Binomial for r 2 N, p 2 [0, 1] p p2
head with P (head) = p
X ⇠ Multr (n, p) Generalization of the 2 3
ki 2 {0, . . . , n}, np1 V ar(Xi ) = npi (1 pi ) Qr
for r, n 2 N and Binomial distribution, n 6 7 n
Multinomial i 2 {1, . . . , r} 6 . 7 Cov(Xi , Xj ) = k1 ,...,kr i=1 piki
p = (p1 , p2 , ..., pr ),
P trials with r categories each E[X] = np = 6 . 7
r and ⌃ki = n 4 . 5 npi pj , i 6= j
i=1 pi = 1 with probability pi
npr
Generalization of the 2 3
X ⇠ MVHGr (N, K, n) V ar(X ) =
i
Hypergeometric distribution, ki 2 {0, . . . , Ki }, n K1 Qr
Multivariate for r, n 2 N, 6 N7 nKN ·
i N Ki N n
N ·N 1 Ki
i=1 ki
n draws from r categories i 2 {1, . . . , r} K 6 . 7
Hypergeometric K 2 NrPand E[X] = n = 6 .. 7 Cov(Xi , Xj ) = N
r each with Ki successes and ⌃ki = n N 4 5 i Kj n
N = i=1 Ki nK N n
N N · N 1 , i 6= j
(w/out replacement)
n KNr
Probability & Statistics with Applications to Computing Alex Tsun & Mitchell Estberg
1
Continuous Distributions
Possible CDF
Distribution Parameters Range ⌦X E[X] Var(X) PDF fX (x) for x 2 ⌦X
Description (FX (x) = P(X  x))
8
Equally likely to be 1 >
X ⇠ Unif(a, b) a+b (b a)
2 >0
<
if x < a
Uniform any real number in [a, b] b a x a
for a < b 2 12 if a  x < b
[a, b] >b a
>
:
1 if x b
Time until first (
X ⇠ Exp( ) 1 1 x 0 if x < 0
Exponential event in Poisson [0, 1) e
for > 0 2
1 e x
if x 0
process
X ⇠ N (µ, 2 ) ! ✓ ◆
2
1 (x µ) x µ
Normal for µ 2 R, Standard bell curve ( 1, 1) µ 2
p exp
2 2
and 2 > 0 2⇡
Sum of r iid Exp( )
rvs. Time to rth
r r r
X ⇠ Gam(r, ) event in Poisson Note: (r) = (r 1)!
Gamma [0, 1) xr 1
e x
for r, > 0 process. Conjugate 2 (r) for integers r.
prior for Exp, Poi
parameter
Conjugate prior for
X ⇠ Beta(↵, ) ↵ ↵ (↵ + ) ↵ 1 1
Beta Ber, Bin, Geo, (0, 1) x (1 x)
for ↵, > 0 ↵+ 2
(↵ + ) (↵ + + 1) (↵) ( )
NegBin parameter p
Generalization of
X⇠ Qr
Beta distribution. 1
xai 1
,
Dir(↵1 , ↵2 , . . . , ↵r ) x 2 (0, 1);
i ↵i
Dirichlet Conjugate prior for P r E[Xi ] = Pr B(↵) i=1
Pir
for ↵i , r > 0 and i=1 xi = 1 ↵j xi 2 (0, 1), xi = 1
Multinomial j=1 i=1
r 2 N, ↵i 2 R
parameter p
X ⇠ Nn (µ, ⌃) 1
·
Multivariate Generalization of (2⇡)n/2 |⌃|1/2
for µ 2 Rn and Rn µ ⌃ 1
Normal Normal distribution exp( 2 (x µ)T ⌃ 1 (x µ))
⌃ 2 Rn⇥n
Probability & Statistics with Applications to Computing Alex Tsun & Mitchell Estberg
2
Probability & Statistics with Applications to Computing
Key Definitions and Theorems

1 Combinatorial Theory
1.1 So You Think You Can Count?

The Sum Rule: If an experiment can either end up being one of N outcomes, or one of M outcomes (where there is no
overlap), then the total number of possible outcomes is: N + M .

The Product Rule: If an experiment has N1 outcomes for the first stage, N2 outcomes for the second stage,Qm . . . , and Nm
outcomes for the mth stage, then the total number of outcomes of the experiment is N1 ⇥ N2 · · · · · Nm = i=1 Ni .

Permutation: The number of orderings of N distinct objects is N ! = N · (N 1) · (N 2) · . . . 3 · 2 · 1.

Complementary Counting: Let U be a (finite) universal set, and S a subset of interest. Then, | S |=| U | | U \ S |.

1.2 More Counting

k-Permutations: If we want to pick (order matters) only k out of n distinct objects, the number of ways to do so is:
n!
P (n, k) = n · (n 1) · (n 2) · ... · (n k + 1) = (n k)!

k-Combinations/Binomial Coefficients: If we want to choose (order doesn’t matter) only k out of n distinct objects,
the number of ways to do so is:
✓ ◆
n P (n, k) n!
C(n, k) = = =
k k! k!(n k)!

Multinomial Coefficients: If we have k distinct types of objects (n total), with n1 of the first type, n2 of the second, ...,
and nk of the k-th, then the number of arrangements possible is
✓ ◆
n n!
=
n1 , n2 , ..., nk n1 !n2 !...nk !

Stars and Bars/Divider Method: The number of ways to distribute n indistinguishable balls into k distinguishable bins
is ✓ ◆ ✓ ◆
n + (k 1) n + (k 1)
=
k 1 n

1.3 No More Counting Please

Pn n
Binomial Theorem: Let x, y 2 R and n 2 N a positive integer. Then: (x + y)n = k=0 k xk y n k
.

Principle of Inclusion-Exclusion (PIE):

2 events: |A [ B| = |A| + |B| |A \ B|
3 events: |A [ B [ C| = |A| + |B| + |C| |A \ B| |A \ C| |B \ C| + |A \ B \ C|
k events: singles - doubles + triples - quads + ...
Alex Tsun Probability & Statistics with Applications to Computing 2

Pigeonhole Principle: If there are n pigeons we want to put into k holes (where n > k), then at least one pigeonhole must
contain at least 2 (or to be precise, dn/ke) pigeons.

Combinatorial Proofs: To prove two quantities are equal, you can come up with a combinatorial situation, and show that
both in fact count the same thing, and hence must be equal.

2 Discrete Probability
2.1 Discrete Probability

Key Probability Definitions: The sample space is the set ⌦ of all possible outcomes of an experiment. An event is
any subset E ✓ ⌦. Events E and F are mutually exclusive if E \ F = ;.

Axioms of Probability & Consequences:

1. (Axiom: Nonnegativity) For any event E, P (E) 0.

2. (Axiom: Normalization) P (⌦) = 1.

3. (Axiom: Countable Additivity) If E and F are mutually exclusive, then P(E [ F ) = P (E) + P (F ).

1. (Corollary: Complementation) P E C = 1 P (E)

2. (Corollary: Monotonicity) If E ✓ F , then P (E)  P (F )

3. (Corollary: Inclusion-Exclusion) P (E [ F ) = P (E) + P (F ) P (E \ F )

Equally Likely Outcomes: If ⌦ is a sample space such that each of the unique outcome elements in ⌦ are equally likely,
then for any event E ✓ ⌦: P(E) = |E|/|⌦|.

2.2 Conditional Probability

P (A \ B)
Conditional Probability: P (A | B) =
P (B)

P (B | A) P (A)
Bayes Theorem: P (A | B) =
P (B)

Partition: Non-empty events E1 , . . . , En partition the sample space ⌦ if they are both:
Sn
• (Exhaustive) E1 [ E2 [ · · · [ En = i=1 Ei = ⌦ (they cover the entire sample space).

• (Pairwise Mutually Exclusive) For all i 6= j, Ei \ Ej = ; ( none of them overlap)

Note that for any event E, E and E C always form a partition of ⌦.

Law of Total Probability (LTP): If events E1 , . . . , En partition ⌦, then for any event F :
n
X n
X
P (F ) = P (F \ En ) = P (F | Ei ) P (Ei )
i=1 i=1
Alex Tsun Probability & Statistics with Applications to Computing 3

Bayes Theorem with LTP: Let events E1 , . . . , En partition the sample space ⌦, and let F be another event. Then:

P (F | E1 ) P (E1 )
P (E1 | F ) = Pn
i=1 P (F | Ei ) P (Ei )

2.3 Independence

Chain Rule: Let A1 , . . . , An be events with nonzero probabilities. Then:

P (A1 , . . . , An ) = P (A1 ) P (A2 | A1 ) P (A3 | A1 A2 ) · · · P (An | A1 , . . . , An 1)

Independence: A and B are independent if any of the following equivalent statements hold:

1. P (A | B) = P (A)
2. P (B | A) = P (B)

3. P (A, B) = P (A) P (B)

Mutual Independence: We say n events A1 , A2 , . . . , An are (mutually) independent if, for any subset I ✓ [n] =
{1, 2, . . . , n}, we have !
\ Y
P Ai = P (Ai )
i2I i2I

This equation is actually representing 2 equations since there are 2n subsets of [n].
n

Conditional Independence: A and B are conditionally independent given an event C if any of the following
equivalent statements hold:

1. P (A | B, C) = P (A | C)
2. P (B | A, C) = P (B | C)

3. P (A, B | C) = P (A | C) P (B | C)

3 Discrete Random Variables

3.1 Discrete Random Variables Basics

Random Variable (RV): A random variable (RV) X is a numeric function of the outcome X : ⌦ ! R. The set of possible
values X can take on is its range/support, denoted ⌦X .
If ⌦X is finite or countable infinite (typically integers or a subset), X is a discrete RV. Else if ⌦X is uncountably large (the
size of real numbers), X is a continuous RV.

Probability Mass Function (PMF): For a discrete RV X, assigns probabilities to values in its range. That is pX : ⌦X !
[0, 1] where: pX (k) = P (X = k).

P
Expectation: The expectation of a discrete RV X is: E [X] = k2⌦X k · pX (k).
Alex Tsun Probability & Statistics with Applications to Computing 4

3.2 More on Expectation

Linearity of Expectation (LoE): For any random variables X, Y (possibly dependent):

E [aX + bY + c] = aE [X] + bE [Y ] + c

P
Law of the Unconscious Statistician (LOTUS): For a discrete RV X and function g, E [g(X)] = b2⌦X g(b) · pX (b).

3.3 Variance

Linearity of Expectation with Indicators: If asked only about the expectation of a RV X which is some sort of “count”
(and not its PMF), then you may be able to write X as the sum of possibly dependent indicator RVs X1 , . . . , Xn , and apply
LoE, where for an indicator RV Xi , E [Xi ] = 1 · P (Xi = 1) + 0 · P (Xi = 0) = P (Xi = 1).

⇥ ⇤ ⇥ ⇤ 2
Variance: Var (X) = E (X E [X])2 = E X 2 E [X] .

p
Standard Deviation (SD): X = Var (X).

Property of Variance: Var (aX + b) = a2 Var (X).

3.4 Zoo of Discrete Random Variables Part I

Independence: Random variables X and Y are independent, denoted X ? Y , if for all x 2 ⌦X and all y 2 ⌦Y :
P (X = x \ Y = y) = P (X = x) · P (Y = y).

Independent and Identically Distributed (iid): We say X1 , . . . , Xn are said to be independent and identically
distributed (iid) if all the Xi ’s are independent of each other, and have the same distribution (PMF for discrete RVs, or
CDF for continuous RVs).

Variance Adds for Independent RVs: If X ? Y , then Var (X + Y ) = Var (X) + Var (Y ).

Bernoulli Process: A Bernoulli process with parameter p is a sequence of independent coin flips X1 , X2 , X3 , ... where
P (head) = p. If flip i is heads, then we encode Xi = 1; otherwise, Xi = 0.

Bernoulli/Indicator Random Variable: X ⇠ Bernoulli(p) (Ber(p) for short) i↵ X has PMF:

⇢
p, k=1
pX (k) =
1 p, k=0

E [X] = p and Var (X) = p(1 p). An example of a Bernoulli/indicator RV is one flip of a coin with P (head) = p. By a
clever trick, we can write
1 k
pX (k) = pk (1 p) , k = 0, 1

Binomial Random Variable: X ⇠ Binomial(n, p) (Bin(n, p) for short) i↵ X has PMF

✓ ◆
n k n k
pX (k) = p (1 p) , k 2 ⌦X = {0, 1, . . . , n}
k
E [X] = np and Var (X) = np(1 p). X is the sum of n iid Ber(p) random variables. An example of a Binomial RV is the
number of heads in n independent flips of a coin with P (head) = p. Note that Bin(1, p) ⌘ Ber(p). As n ! 1 and p !
Alex Tsun Probability & Statistics with Applications to Computing 5

0, with np = , then Bin (n, p) ! Poi( ). If X1 , . . . , Xn are independent Binomial RV’s, where Xi ⇠ Bin(Ni , p), then
X = X1 + . . . + Xn ⇠ Bin(N1 + . . . + Nn , p).

3.5 Zoo of Discrete Random Variables Part II

Uniform Random Variable (Discrete): X ⇠ Uniform(a, b) (Unif(a, b) for short), for integers a  b, i↵ X has PMF:

1
pX (k) = , k 2 ⌦X = {a, a + 1, . . . , b}
b a+1
(b a)(b a+2)
E [X] = a+b2 and Var (X) = 12 . This represents each integer in [a, b] to be equally likely. For example, a single roll
of a fair die is Unif(1, 6).

Geometric Random Variable: X ⇠ Geometric(p) (Geo(p) for short) i↵ X has PMF:

k 1
pX (k) = (1 p) p, k 2 ⌦X = {1, 2, 3, . . .}

E [X] = p1 and Var (X) = 1p2p . An example of a Geometric RV is the number of independent coin flips up to and including
the first head, where P (head) = p.

Negative Binomial Random Variable: X ⇠ NegativeBinomial(r, p) (NegBin(r, p) for short) i↵ X has PMF:
✓ ◆
k 1 r k r
pX (k) = p (1 p) , k 2 ⌦X = {r, r + 1, r + 2, . . .}
r 1

E [X] = pr and Var (X) = r(1p2 p) . X is the sum of r iid Geo(p) random variables. An example of a Negative Binomial RV is
the number of independent coin flips up to and including the r-th head, where P (head) = p. If X1 , . . . , Xn are independent
Negative Binomial RV’s, where Xi ⇠ NegBin(ri , p), then X = X1 + . . . + Xn ⇠ NegBin(r1 + . . . + rn , p).

3.6 Zoo of Discrete Random Variables Part III

Poisson Random Variable: X ⇠ Poisson( ) (Poi( ) for short) i↵ X has PMF:

k
pX (k) = e , k 2 ⌦X = {0, 1, 2, . . .}
k!
E [X] = and Var (X) = . An example of a Poisson RV is the number of people born during a particular minute,
where is the average birth rate per minute. If X1 , . . . , Xn are independent Poisson RV’s, where Xi ⇠ Poi( i ), then
X = X1 + . . . + Xn ⇠ Poi( 1 + . . . + n ).

Hypergeometric Random Variable: X ⇠ HyperGeometric(N, K, n) (HypGeo(N, K, n) for short) i↵ X has PMF:

K N K
k n k
pX (k) = N
, k 2 ⌦X = {max{0, n + K N }, . . . , min {K, n}}
n

K(N K)(N n)
E [X] = n KN and Var (X) = n N 2 (N 1) . This represents the number of successes drawn, when n items are drawn from
a bag with N items (K of which are successes, and N K failures) without replacement. If we did this with replacement,
then this scenario would be represented as Bin n, K
N .
Alex Tsun Probability & Statistics with Applications to Computing 6

4 Continuous Random Variables

4.1 Continuous Random Variables Basics

Probability Density Function (PDF): The probability density function (PDF) of a continuous RV X is the function
fX : R ! R, such that the following properties hold:

• fX (z) 0 for all z 2 R

R1
• 1 X
f (t) dt = 1
Rb
• P (a  X  b) = a
fX (w) dw

Cumulative Distribution Function (CDF): The cumulative distribution function (CDF) of ANY random variable
(discrete or continuous) is defined to be the function FX : R ! R with FX (t) = P (X  t). If X is a continuous RV, we have:
Rt
• FX (t) = P (X  t) = 1
fX (w) dw for all t 2 R
d
• du FX (u) = fX (u)

Univariate: Discrete to Continuous:

Discrete Continuous
PMF/PDF pX (x) = P (X = x) fX (x) 6= PR(X = x) = 0
P x
CDF FX (x) = tx pX (t) FX (x) = 1 fX (t) dt
P R1
Normalization x pX (x) = 1 f (x) dx = 1
1 X
P R1
Expectation/LOTUS E [g(X)] = x g(x)pX (x) E [g(X)] = 1 g(x)fX (x) dx

4.2 Zoo of Continuous RVs

Uniform Random Variable (Continuous): X ⇠ Uniform(a, b) (Unif(a, b) for short) i↵ X has PDF:
⇢ 1
b a if x 2 ⌦X = [a, b]
fX (x) =
0 otherwise
2
(b a)
E [X] = a+b
2 and Var (X) = 12 . This represents each real number from [a, b] to be equally likely. Do NOT confuse this
with its discrete counterpart!

Exponential Random Variable: X ⇠ Exponential( ) (Exp( ) for short) i↵ X has PDF:

⇢ x
e if x 2 ⌦X = [0, 1)
fX (x) =
0 otherwise

E [X] = 1 and Var (X) = 12 . FX (x) = 1 e x for x 0. The exponential RV is the continuous analog of the geometric
RV: it represents the waiting time to the next event, where > 0 is the average number of events per unit time. Note that
the exponential measures how much time passes until the next event (any real number, continuous), whereas the Poisson
measures how many events occur in a unit of time (nonnegative integer, discrete). The exponential RV is also memoryless:

for any s, t 0, P (X > s + t | X > s) = P (X > t)

Gamma Random Variable: X ⇠ Gamma(r, ) (Gam(r, ) for short) i↵ X has PDF:

r
fX (x) = xr 1
e x
, x 2 ⌦X = [0, 1)
(r)
Alex Tsun Probability & Statistics with Applications to Computing 7

E [X] = r and Var (X) = r2 . X is the sum of r iid Exp( ) random variables. In the above PDF, for positive integers r,
(r) = (r 1)! (a normalizing constant). An example of a Gamma RV is the waiting time until the r-th event in the Poisson
process. If X1 , . . . , Xn are independent Gamma RV’s, where Xi ⇠ Gam(ri , ), then X = X1 +. . .+Xn ⇠ Gam(r1 +. . .+rn , ).
It also serves as a conjugate prior for in the Poisson and Exponential distributions.

4.3 The Normal/Gaussian Random Variable

2
Normal (Gaussian, “bell curve”) Random Variable: X ⇠ N (µ, ) i↵ X has PDF:

1 1 (x µ)
2

fX (x) = p e 2 2
, x 2 ⌦X = R
2⇡
E [X] = µ and Var (X) = 2 . The “standard normal” random variable is typically denoted Z and has mean 0 and variance 1:
if X ⇠ N (µ, 2 ), then Z = X µ ⇠ N (0, 1). The CDF has no closed form, but we denote the CDF of the standard normal as
(z) = FZ (z) = P (Z  z). Note from symmetry of the probability density function about z = 0 that: ( z) = 1 (z).

2
Closure of the Normal Under Scale and Shift: If X ⇠ N (µ, ), then aX + b ⇠ N (aµ + b, a2 2
). In particular, we
can always scale/shift to get the standard Normal: X µ ⇠ N (0, 1).

2 2
Closure of the Normal Under Addition: If X ⇠ N (µX , X) and Y ⇠ N (µY , Y ) are independent, then

aX + bY + c ⇠ N (aµX + bµY + c, a2 2
X + b2 2
Y )

4.4 Transforming Continuous RVs

Steps to compute PDF of Y = g(X) from X (via CDF): Suppose X is a continuous RV.

1. Write down the range ⌦X , PDF fX , and CDF FX .

2. Compute the range ⌦Y = {g(x) : x 2 ⌦X }.
3. Start computing the CDF of Y on ⌦Y , FY (y) = P (g(X)  y), in terms of FX .

4. Di↵erentiate the CDF FY (y) to get the PDF fY (y) on ⌦Y . fY is 0 outside ⌦Y .

Explicit Formula to compute PDF of Y = g(X) from X (Univariate Case): Suppose X is a continuous RV. If Y =
g(X) and g : ⌦X ! ⌦Y is strictly monotone and invertible with inverse X = g 1 (Y ) = h(Y ), then
⇢
fX (h(y)) · |h0 (y)| if y 2 ⌦Y
fY (y) =
0 otherwise

Explicit Formula to compute PDF of Y = g(X) from X (Multivariate Case): Let X = (X1 , ..., Xn ), Y =
(Y1 , ..., Yn ) be continuous random vectors (each component is a continuous rv) with the same dimension n (so ⌦X , ⌦Y ✓ Rn ),
and Y = g(X) where g : ⌦X ! ⌦Y is invertible and di↵erentiable, with di↵erentiable inverse X = g 1 (y) = h(y). Then,
✓ ◆
@h(y)
fY (y) = fX (h(y)) det
@y
⇣ ⌘
@h(y)
where @y 2 Rn⇥n is the Jacobian matrix of partial derivatives of h, with
✓ ◆
@h(y) @(h(y))i
=
@y ij @yj
Alex Tsun Probability & Statistics with Applications to Computing 8

5 Multiple Random Variables

5.1 Joint Discrete Distributions

Cartesian Product of Sets: The Cartesian product of sets A and B is denoted: A ⇥ B = {(a, b) : a 2 A, b 2 B}.

Joint PMFs: Let X, Y be discrete random variables. The joint PMF of X and Y is:

pX,Y (a, b) = P (X = a, Y = b)

The joint range is the set of pairs (c, d) that have nonzero probability:

⌦X,Y = {(c, d) : pX,Y (c, d) > 0} ✓ ⌦X ⇥ ⌦Y

Note that the probabilities in the table must sum to 1:

X
pX,Y (s, t) = 1
(s,t)2⌦X,Y

Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
X X
E [g(X, Y )] = g(x, y)pX,Y (x, y)
x2⌦X y2⌦Y

P
Marginal PMFs: Let X, Y be discrete random variables. The marginal PMF of X is: pX (a) = b2⌦Y pX,Y (a, b).

Independence (DRVs): Discrete RVs X, Y are independent, written X ? Y , if for all x 2 ⌦X and y 2 ⌦Y : pX,Y (x, y) =
pX (x)pY (y).

Variance Adds for Independent RVs: If X ? Y , then: Var (X + Y ) = Var (X) + Var (Y ).

5.2 Joint Continuous Distributions

Joint PDFs: Let X, Y be continuous random variables. The joint PDF of X and Y is:

fX,Y (a, b) 0

The joint range is the set of pairs (c, d) that have nonzero density:

⌦X,Y = {(c, d) : fX,Y (c, d) > 0} ✓ ⌦X ⇥ ⌦Y

Note that the double integral over all values must be 1:

Z 1Z 1
fX,Y (u, v)dudv = 1
1 1

Further, note that if g : R2 ! R is a function, then LOTUS extends to the multidimensional case:
Z 1Z 1
E [g(X, Y )] = g(s, t)fX,Y (s, t)dsdt
1 1

The joint PDF must satisfy the following (similar to univariate PDFs):
Z b Z d
P (a  X < b, c  Y  d) = fX,Y (x, y)dydx
a c
Alex Tsun Probability & Statistics with Applications to Computing 9

R1
Marginal PDFs: Let X, Y be continuous random variables. The marginal PDF of X is: fX (x) = 1
fX,Y (x, y)dy.

Independence of Continuous Random Variables: Continuous RVs X, Y are independent, written X ? Y , if for all
x 2 ⌦X and y 2 ⌦Y , fX,Y (x, y) = fX (x)fY (y).

5.3 Conditional Distributions

Conditional PMFs and PDFs: If X, Y are discrete, the conditional PMF of X given Y is:

pX,Y (a, b) pY |X (b | a)pX (a)

pX|Y (a | b) = P (X = a | Y = b) = =
pY (b) pY (b)

Similarly for continuous RVs, but with f ’s instead of p’s (PDFs instead of PMFs).

Conditional Expectation: If X is discrete (and Y is either discrete or continuous), then we define the conditional expec-
tation of g(X) given (the event that) Y = y as:
X
E [g(X) | Y = y] = g(x)pX|Y (x | y)
x2⌦X

If X is continuous (and Y is either discrete or continuous), then

Z 1
E [g(X) | Y = y] = g(x)fX|Y (x | y)dx
1

Notice that these sums and integrals are over x (not y), since E [g(X) | Y = y] is a function of y.

Law of Total Expectation (LTE): Let X, Y be jointly distributed random variables.

If Y is discrete (and X is either discrete or continuous), then:
X
E [g(X)] = E [g(X) | Y = y] pY (y)
y2⌦Y

If Y is continuous (and X is either discrete or continuous), then

Z 1
E [g(X)] = E [g(X) | Y = y] fY (y)dy
1

Basically, for E [g(X)], we take a weighted average of E [g(X) | Y = y] over all possible values of y.

Multivariate: Discrete to Continuous:

Discrete Continuous
Joint Dist pX,Y (x, y) = P (X = x, Y = y) fX,Y (x, y) 6= PR(X = x, Y = y)
P x Ry
Joint CDF FX,Y (x, y) = tx,sy pX,Y (t, s) FX,Y (x, y) = 1 1 fX,Y (t, s) dsdt
P R1 R1
Normalization x,y pX,YP(x, y) = 1 1
f
1R X,Y
(x, y) dxdy = 1
1
Marginal Dist pX (x) = y pX,Y (x, y) fX (x) = 1 fX,Y (x, y)dy
P RR
Expectation E [g(X, Y )] = x,y g(x, y)pX,Y (x, y) E [g(X, Y )] = g(x, y)fX,Y (x, y)dxdy
p (x,y) f (x,y)
Conditional Dist pX|Y (x|y) = X,Y pP
Y (y)
fX|Y (x|y) = X,Y fRY (y)
1
Conditional Exp E [X|Y = y] = x xpX|Y (x|y) E [X|Y = y] = 1 xfX|Y (x|y)dx
Independence 8x, y, pX,Y (x, y) = pX (x)pY (y) 8x, y, fX,Y (x, y) = fX (x)fY (y)
Alex Tsun Probability & Statistics with Applications to Computing 10

5.4 Covariance and Correlation

Covariance: The covariance of X and Y is:

Cov (X, Y ) = E [(X E [X])(Y E [Y ])] = E [XY ] E [X] E [Y ]

Covariance satisfies the following properties:

1. If X ? Y , then Cov (X, Y ) = 0 (but not necessarily vice versa).

2. Cov (X, X) = Var (X). (Just plug in Y = X).
3. Cov (X, Y ) = Cov (Y, X). (Multiplication is commutative).

4. Cov (X + c, Y ) = Cov (X, Y ). (Shifting doesn’t and shouldn’t a↵ect the covariance).
5. Cov (aX + bY, Z) = a · Cov (X) Z + b · Cov (Y, Z). This can be easily remembered like the distributive property of scalars
(aX + bY )Z = a(XZ) + b(Y Z).
6. Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y ), and hence if X ? Y , then Var (X + Y ) = Var (X) + Var (Y ).
⇣P Pm ⌘ Pn Pm
n
7. Cov i=1 X i , j=1 Y i = i=1 j=1 Cov (Xi , Yj ). That is covariance works like FOIL (first, outer, inner, last) for
multiplication of sums ((a + b + c)(d + e) = ad + ae + bd + be + cd + ce).

Cov(X,Y )
(Pearson) Correlation: The (Pearson) correlation of X and Y is: ⇢(X, Y ) = p p .
Var(X) Var(Y )
It is always true that 1  ⇢(X, Y )  1. That is, correlation is just a normalized version of covariance. Most notably,
⇢(X, Y ) = ±1 if and only if Y = aX + b for some constants a, b 2 R, and then the sign of ⇢ is the same as that of a.

Variance of Sums of RVs: Let X1 , . . . , Xn be any RVs (independent or not). Then,

n
! n
X X X
Var Xi = Var (Xi ) + 2 Cov (Xi , Xj )
i=1 i=1 i<j

5.5 Convolution

Law of Total Probability for Random Variables:

Discrete version: If X, Y are discrete:
X X
pX (x) = pX,Y (x, y) = pX|Y (x | y)pY (y)
y y

Continuous version: If X, Y are continuous:

Z 1 Z 1
fX (x) = fX,Y (x, y)dy = fX|Y (x | y)fY (y)dy
1 1

Convolution: Let X, Y be independent RVs, and Z = X + Y .

Discrete version: If X, Y are discrete: X
pZ (z) = pX (x)pY (z x)
x2⌦X

Continuous version: If X, Y are continuous:

Z
fZ (z) = fX (x)fY (z x)dx
x2⌦X
Alex Tsun Probability & Statistics with Applications to Computing 11

5.6 Moment Generating Functions

⇥ k⇤
Moments: Let X be ⇥ a random⇤ variable and c 2 R a scalar. Then: The k-th moment of X is E X and the k-th moment
of X (about c) is: E (X c) .k

Moment Generating Functions (MGFs): The moment generating function (MGF) of X is a function of a dummy
⇥ ⇤
variable t (use LOTUS to compute this): MX (t) = E etX .

Properties and Uniqueness of Moment Generating Functions: For a function f : R ! R, we will denote f (n) (x) to
be the n-th derivative of f (x). Let X, Y be independent random variables, and a, b 2 R be scalars. Then MGFs satisfy the
following properties:
⇥ ⇤ (n)
1. MX0
(0) = E [X], MX 00
(0) = E X 2 , and in general MX = E [X n ]. This is why we call MX a moment generating
function, as we can use it to generate the moments of X.
2. MaX+b (t) = etb MX (at).
3. If X ? Y , then MX+Y (t) = MX (t)MY (t).

4. (Uniqueness) The following are equivalent:

(a) X and Y have the same distribution.
(b) fX (z) = fY (z) for all z 2 R.
(c) FX (z) = FY (z) for all z 2 R.
(d) There is an " > 0 such that MX (t) = MY (t) for all t 2 ( ", ").
That is MX uniquely identifies a distribution, just like PDFs/PMFs or CDFs do.

5.7 Limit Theorems

2
The Sample Mean + Properties: Let X1 , X2 , . . . , Xn be a sequence of iid RVs with mean µ and variance . The
Pn ⇥ ⇤
sample mean is: X̄n = n1 i=1 Xi . Further, E X̄n = µ and Var X̄n = 2 /n

The Law of Large Numbers (LLN): Let X1 , . . . , Xn be iid RVs with the same mean µ. As n ! 1, the sample mean
X n converges to the true mean µ.

2
The Central Limit Theorem (CLT): Let X1 , . . . Xn be a sequence of iid RVs with mean µ and (finite) variance .
Then as n ! 1, ✓ ◆
2
X̄n ! N µ,
n
The mean or variance are not a surprise; the importance of the CLT is, regardless of the distribution of Xi ’s, the sample
mean approaches a Normal distribution as n ! 1.

The Continuity Correction: When approximating an integer-valued (discrete) random variable X with a continuous one
Y (such as in the CLT), if asked to find a P (a  X  b) for integers a  b, you should use P (a 0.5  Y  b + 0.5) so that
the width of the interval being integrated is the same as the number of terms summed over (b a + 1).

5.8 The Multinomial Distribution

Random Vectors (RVTRs): Let X1 , ..., Xn be random variables. We say X = (X1 , . . . , Xn )T is a random vector.
Expectation is defined pointwise: E [X] = (E [X1 ] , . . . , E [Xn ])T .
Alex Tsun Probability & Statistics with Applications to Computing 12

Covariance Matrices: The covariance matrix of a random vector X 2 Rn with E [X] = µ is the matrix ⌃ = Var (X) =
Cov (X) whose entries ⌃ij = Cov (Xi , Xj ). The formula for this is:
⇥ ⇤ ⇥ ⇤
⌃ = Var (X) = Cov (X) = E (X µ)(X µ)T = E XXT µµT
2 3
Var (X1 ) Cov (X1 , X2 ) ... Cov (X1 , Xn )
6 Cov (X2 , X1 ) Var (X2 ) ... Cov (X2 , Xn )7
6 7
=6 .. .. .. .. 7
4 . . . . 5
Cov (Xn , X1 ) Cov (Xn , X2 ) ... Var (Xn )

Notice that the covariance matrix is symmetric (⌃ij = ⌃ji ), and has variances on the diagonal.

The PMultinomial Distribution: Suppose there are r outcomes, with probabilities p = (p1 , p2 , ..., pr ) respectively, such
r
that i=1 pi = 1. Suppose we have n independent trials, and let Y = (Y1 , Y2 , ..., Yr ) be the rvtr of counts of each outcome.
Then, we say Y ⇠ Multr (n, p):
The joint PMF of Y is:
✓ ◆Yr r
X
n
pY1 ,...,Yr (k1 , ...kr ) = pki , k1 , ...kr 0 and ki = n
k1 , ..., kr i=1 i i=1

The Multivariate Hypergeometric (MVHG) Distribution: Suppose there are r di↵erent colors of balls in a bag,
Pr
having K = (K1 , ..., Kr ) balls of each color, 1  i  r. Let N = i=1 Ki be the total number of balls in the bag, and suppose
we draw n without replacement. Let Y = (Y1 , ..., Yr ) be the rvtr such that Yi is the number of balls of color i we drew. We
write that Y ⇠ MVHGr (N, K, n) The joint PMF of Y is:
Qr Ki r
i=1 ki
X
pY1 ,...,Yr (k1 , ...kr ) = N
, 0  ki  Ki for all 1  i  r and kr = n
n i=1

Notice that each Yi is marginally HypGeo(N, Ki , n), so E [Yi ] = n K

N and
i

Ki N Ki N n
Var (Yi ) = n · · . The mean vector E [Y] and covariance matrix are:
N N N 1
2 K1 3
nN
K 6 . 7 Ki N K i N n Ki K j N n
E [Y] = n = 4 .. 5 Var (Yi ) = n · · Cov (Yi , Yj ) = n ·
N Kr
N N N 1 N N N 1
nN

5.9 The Multivariate Normal Distribution

Properties of Expectation and Variance Hold for RVTRs: Let X be an n-dimensional RVTR, A 2 Rn⇥n be a con-
stant matrix, b 2 Rn be a constant vector. Then: E [AX + b] = AE [X] + b and Var (AX + b) = AVar (X) AT .

The Multivariate Normal Distribution: A random vector X = (X1 , ..., Xn ) has a multivariate Normal distribution
with mean vector µ 2 Rn and (symmetric and positive-definite) covariance matrix ⌃ 2 Rn⇥n , written X ⇠ Nn (µ, ⌃), if it
has the following joint PDF:
Alex Tsun Probability & Statistics with Applications to Computing 13

✓ ◆
1 1 T 1
fX (x) = exp (x µ) ⌃ (x µ) , x 2 Rn
(2⇡)n/2 |⌃|1/2 2
Additionally, let us recall that for any RVs X and Y : X ? Y ! Cov (X, Y ) = 0. If X = (X1 , . . . , Xn ) is Multivariate Normal,
the converse also holds: Cov (Xi , Xj ) = 0 ! Xi ? Xj .

5.10 Order Statistics

Order Statistics: Suppose Y1 , ..., Yn are iid continuous random variables with common PDF fY and common CDF FY .
We sort the Yi ’s such that Ymin ⌘ Y(1) < Y(2) < ... < Y(n) ⌘ Ymax .
Notice that we can’t have equality because with continuous random variables, the probability that any two are equal is 0.
Notice that each Y(i) is a random variable as well! We call Y(i) the ith order statistic, i.e. the ith smallest in a sample of
size n. The density function of each Y(i) is
✓ ◆
n
fY(i) (y) = · [FY (y)]i 1 · [1 FY (y)]n i · fY (y), y 2 ⌦Y
i 1, 1, n i

6 Concentration Inequalities
6.1 Markov and Chebyshev Inequalities

E [X]
Markov’s Inequality: Let X 0 be a non-negative RV, and let k > 0. Then: P (X k)  .
k

Chebyshev’s Inequality: Let X be any RV with expected value µ = E[X] and finite variance Var (X). Then, for any real
Var (X)
number ↵ > 0. Then, P (|X µ| ↵)  .
↵2

6.2 The Cherno↵ Bound

Cherno↵ Bound for Binomial: Let X ⇠ Bin(n, p) and let µ = E[X]. For any 0 < < 1:
✓ 2
◆ ✓ 2
◆
µ µ
P (X (1 + )µ)  exp and P (X  (1 )µ)  exp
3 2

6.3 Even More Inequalities

Sn Pn
The Union Bound: Let E1 , E2 , ..., En be a collection of events. Then: P ( i=1 Ei )  i=1 P (Ei ).
A similar statement also holds if the number of events is countably infinite.

Convex Sets: A set S ✓ Rn is a convex set if for any x1 , . . . , xm 2 S

(m m
)
X X
pi xi : p1 , ..., pm 0 and pi = 1 ✓ S
i=1 i=1

Convex Functions: Let SP ✓ Rn be a convex set. A function g : S ! R is a convex function if for any x1 , ..., xm 2 S,
m
and p1 , ..., pm 0 such that i=1 pi = 1, !
Xm Xm
g pi x i  pi g(xi )
i=1 i=1
Alex Tsun Probability & Statistics with Applications to Computing 14

Jensen’s Inequality: Let X be any RV, and g : R ! R be convex. Then, g(E [X])  E [g(X)].

Hoe↵ding’s Inequality: Let X1 , ...Xn be independent random variables, where each Xi is bounded: ai  Xi  bi and let
X n be their sample mean. Then,
✓ ◆
⇥ ⇤ 2n2 t2
P |X n E X n | t  2 exp P n
i 1 (bi ai ) 2

In the case X1 , ..., Xn are iid (so a  Xi  b for all i) with mean µ, then
✓ ◆ ✓ ◆
2n2 t2 2nt2
P |X n µ| t  2 exp = 2 exp
n(b a)2 (b a)2

7 Statistical Estimation
7.1 Maximum Likelihood Estimation

Realization / Sample: A realization/sample x of a random variable X is the value that is actually observed (will always
be in ⌦X ).

Likelihood: Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | ✓) (if X is discrete), or from density fX (t | ✓) (if X is
continuous), where ✓ is a parameter (or vector of parameters). We define the likelihood of x given ✓ to be the “probability”
of observing x if the true parameter is ✓. The log-likelihood is just the log of the likelihood, which is typically easier to
optimize.
If X is discrete,
Yn Xn
L(x | ✓) = pX (xi | ✓) ln L(x | ✓) = ln pX (xi | ✓)
i=1 i=1

If X is continuous,
n
Y n
X
L(x | ✓) = fX (xi | ✓) ln L(x | ✓) = ln fX (xi | ✓)
i=1 i=1

Maximum Likelihood Estimator (MLE): Let x = (x1 , ..., xn ) be iid realizations from probability mass function pX (t | ✓)
(if X is discrete), or from density fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). We define
the maximum likelihood estimator (MLE) ✓ˆM LE of ✓ to be the parameter which maximizes the likelihood/log-likelihood:

✓ˆM LE = arg max L(x | ✓) = arg max ln L(x | ✓)

✓ ✓

7.2 MLE Examples

7.3 Method of Moments Estimation

Sample Moments: Let X be a random variable, and c 2 R a scalar. Let x1 , . . . , xn be iid realizations (samples) from X.
Pn
The k th sample moment of X is: n1 i=1 xki .
Pn
The k th sample moment of X (about c) is: n1 i=1 (xi c)k .

Method of Moments Estimation: Let x = (x1 , . . . , xn ) be iid realizations (samples) from PMF pX (t; ✓) (if X is discrete),
or from density fX (t; ✓) (if X continuous), where ✓ is a parameter (or vector of parameters).
Alex Tsun Probability & Statistics with Applications to Computing 15

We then define the Method of Moments (MoM) estimator ✓ˆM oM of ✓ = (✓1 , . . . , ✓k ) to be a solution (if it ex-
ists) to the k simultaneous equations where, for j = 1, . . . , k, we set the j th true and sample moments equal:
Pn ⇥ ⇤ Pn
E [X] = n1 i=1 xi ··· E X k = n1 i=1 xki

7.4 The Beta and Dirichlet Distributions

Beta Random Variable: X ⇠ Beta(↵, ), if and only if X has the following PDF:

(
1 ↵ 1 1
B(↵, ) x (1 x) , x 2 ⌦X = [0, 1]
fX (x) =
0, otherwise

X is typically the belief distribution about some unknown probability of success, where we pretend we’ve seen ↵ 1 successes
and 1 failures. Hence the mode (most likely value of the probability/point with highest density) arg max fX (x), is
x2[0,1]

↵ 1
mode[X] = (↵ 1)+( 1)

Also note that there is an annoying “o↵-by-1” issue: (↵ 1 heads and 1 tails), so when choosing these parameters, be
careful! It also serves as a conjugate prior for p in the Bernoulli and Geometric distributions.

Dirichlet RV: X ⇠ Dir(↵1 , ↵2 , . . . , ↵r ), if and only if X has the following density function:

( Qr Pr
1
B(↵) i=1 xai i 1
, xi 2 (0, 1) and i=1 xi = 1
fX (x) =
0, otherwise

This is a generalization of the Beta random variable from 2 outcomes to r. The random vector X is typically the belief
distribution about some unknown probabilities of the di↵erent outcomes, where we pretend we saw a1 1 outcomes of type
1, a2 1 outcomes of type 2, . . . , and ar 1 outcomes of type r . Hence, the mode of the distribution is the vector,
arg max P fX (x), is
x2[0,1]d and xi =1

⇣ ⌘
mode[X] = Pr ↵ 1 1 , Pr ↵2(ai1 1) , . . . , Pr ↵r(ai1 1)
i=1 (ai 1) i=1 i=1

7.5 Maximum A Posteriori Estimation

Maximum A Posteriori (MAP) Estimation: Let x = (x1 , . . . , xn ) be iid realizations from PMF pX (t ; ⇥ = ✓) (if X
discrete), or from density fX (t ; ⇥ = ✓) (if X continuous), where ⇥ is the random variable representing the parameter (or
vector of parameters). We define the Maximum A Posteriori (MAP) estimator ✓ˆM AP of ⇥ to be the parameter which
maximizes the posterior distribution of ⇥ given the data (the mode).

✓ˆM AP = argmax ⇡⇥ (✓ | x) = argmax L(x | ✓)⇡⇥ (✓)

✓ ✓

7.6 Properties of Estimators I

h i
Bias: Let ✓ˆ be an estimator for ✓. The bias of ✓ˆ as an estimator for ✓ is Bias(✓,
ˆ ✓) = E ✓ˆ ˆ ✓) = 0, or
✓. If Bias(✓,
h i
equivalently E ✓ˆ = ✓, then we say ✓ˆ is an unbiased estimator of ✓.
ˆ
Alex Tsun Probability & Statistics with Applications to Computing 16

h i
Mean Squared Error (MSE): The mean squared error (MSE) of an estimator ✓ˆ of ✓ is MSE(✓, ˆ ✓) = E (✓ˆ ✓)2 .
h i ⇣ ⌘
If ✓ˆ is an unbiased estimator of ✓ (i.e. E ✓ˆ = ✓), then you can see that MSE(✓,
ˆ ✓) = Var ✓ˆ . In fact, in general
⇣ ⌘
MSE(✓, ˆ ✓) = Var ✓ˆ + Bias(✓,
ˆ ✓)2 .

7.7 Properties of Estimators II

Consistency: An estimator ✓ˆn (depending on n iid samples) of ✓ is said to be consistent if it converges (in probability) to
⇣ ⌘
✓. That is, for any " > 0, lim P |✓ˆn ✓| > " = 0.
n!1

Fisher Information: Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | ✓) (if X is discrete), or from density function
fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). The Fisher Information of a parameter ✓
is defined to be "✓ ◆2 #  2
@ ln L(x | ✓) @ ln L(x | ✓)
I(✓) = n · E = E
@✓ @✓2

Cramer-Rao Lower Bound (CRLB): Let x = (x1 , ..., xn ) be iid realizations from PMF pX (t | ✓) (if X is discrete), or
from density function fX (t | ✓) (if X is continuous), where ✓ is a parameter (or vector of parameters). If ✓ˆ is an unbiased
estimator for ✓, then
⇣ ⌘ 1
MSE(✓,ˆ ✓) = Var ✓ˆ
I(✓)
That is, for any unbiased estimator ✓ˆ for ✓, the variance (=MSE) is at least I(✓)
1
. If we achieve this lower bound, meaning
1
our variance is exactly equal to I(✓) , then we have the best variance possible for our estimate. Hence, it is the minimum
variance unbiased estimator (MVUE) for ✓.

1
ˆ ✓) = I(✓)⇣ ⌘  1.
Efficiency: Let ✓ˆ be an unbiased estimator of ✓. The efficiency of ✓ˆ is e(✓,
Var ✓ˆ
ˆ ✓) = 1.
An estimator is said to be efficient if it achieves the CRLB - meaning e(✓,

7.8 Properties of Estimators III

Pn
Statistic: A statistic is any function T : Rn ! R of samples x = (x1 , . . . , xn ). For example, T (x1 , . . . , xn ) = i=1 xi (the
sum), T (x1 , . . . , xn ) = max{x1 , . . . , xn } (the max/largest value), T (x1 , . . . , xn ) = x1 (just take the first sample)

Sufficiency: A statistic T = T (X1 , . . . , Xn ) is a sufficient statistic if the conditional distribution of X1 , . . . , Xn given

T = t and ✓ does not depend on ✓.

P (X1 = x1 , . . . , Xn = xn | T = t, ✓) = P (X1 = x1 , . . . , Xn = xn | T = t)

Neyman-Fisher Factorization Criterion (NFFC): Let x1 , . . . , xn be iid random samples with likelihood
L(x1 , . . . , xn |✓). A statistic T = T (x1 , . . . , xn ) is sufficient if and only if there exist non-negative functions g and h such that:

L(x1 , . . . , xn | ✓) = g(x1 , . . . , xn ) · h(T (x1 , . . . , xn ), ✓)

Alex Tsun Probability & Statistics with Applications to Computing 17

8 Statistical Inference
8.1 Confidence Intervals

Confidence Interval: Suppose you have iid samples x1 ,...,xn from some distribution with unknown parameter ✓, and you
have some estimator ✓ˆ for ✓.
h i
ˆ ✓ˆ
A 100(1 ↵)% confidence interval for ✓ is an interval (typically but not always) centered at ✓, , ✓ˆ + , such that
the probability (over the randomness in the samples x1 ,...,xn ) ✓ lies in the interval is 1 ↵:
⇣ h i⌘
P ✓ 2 ✓ˆ , ✓ˆ + =1 ↵
Pn
If ✓ˆ = n1 i=1 xi is the sample mean, then ✓ˆ is approximately normal by the CLT, and a 100(1 ↵)% confidence interval is
given by the formula: 
✓ˆ z1 ↵/2 p , ✓ˆ + z1 ↵/2 p
n n
1 ↵
where z1 ↵/2 = 1 2 and is the true standard deviation of a single sample (which may need to be estimated).

8.2 Credible Intervals

Credible Intervals: Suppose you have iid samples x = (x1 , ..., xn ) from some distribution with unknown parameter ⇥.
You are in the Bayesian setting, so you have chosen a prior distribution for the RV ⇥.

A 100(1 ↵)% credible interval for ⇥ is an interval [a, b] such that the probability (over the randomness in ⇥) that ⇥ lies
in the interval is 1 ↵:
P (⇥ 2 [a, b]) = 1 ↵
If we’ve chosen the appropriate conjugate prior for the sampling distribution (like Beta for Bernoulli), the posterior is easy
to compute. Say the CDF of the posterior is FY . Then, a 100(1 ↵)% credible interval is given by
h ⇣↵⌘ ⇣ ↵ ⌘i
FY 1 , FY 1 1
2 2

8.3 Introduction to Hypothesis Testing

Hypothesis Testing Procedure:

1. Make a claim (like ”Airplane food is good”, ”Pineapples belong on pizza”, etc...)

2. Set up a null hypothesis H0 and alternative hypothesis HA .

(a) Alternative hypothesis can be one-sided or two-sided.
(b) The null hypothesis is usually a ”baseline”, ”no e↵ect”, or ”benefit of the doubt”.
(c) The alternative is what you want to ”prove”, and is opposite the null.

3. Choose a significance level ↵ (usually ↵ = 0.05 or 0.01).

4. Collect data.
5. Compute a p-value, p = P (observing data at least as extreme as ours | H0 is true).
6. State your conclusion. Include an interpretation in the context of the problem.

(a) If p < ↵, ”reject” the null hypothesis H0 in favor of the alternative HA .

(b) Otherwise, ”fail to reject” the null hypothesis H0 .

Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
User Manual v2 9-AMC-AASD15A 4DOF+TL+Surge-SRS-Simtools
No ratings yet
User Manual v2 9-AMC-AASD15A 4DOF+TL+Surge-SRS-Simtools
29 pages
2022 RaySharp Product Guide
No ratings yet
2022 RaySharp Product Guide
87 pages
ECE2191 Lecture Notes
No ratings yet
ECE2191 Lecture Notes
106 pages
The Economic Definition of Ore
100% (1)
The Economic Definition of Ore
161 pages
Preview of Methods in Algorithmic Analysis
75% (4)
Preview of Methods in Algorithmic Analysis
10 pages
Principles of Statistical Analysis - V1
No ratings yet
Principles of Statistical Analysis - V1
426 pages
Math 630 Course Notes Fall 2021
No ratings yet
Math 630 Course Notes Fall 2021
274 pages
Prob Stat For CS Book
No ratings yet
Prob Stat For CS Book
384 pages
Introduction To Probability For Ds
No ratings yet
Introduction To Probability For Ds
180 pages
Introduction To Probability For Data Science
100% (1)
Introduction To Probability For Data Science
70 pages
Introduction To Probability For Data Science
No ratings yet
Introduction To Probability For Data Science
709 pages
STAT 230 Course Notes Fall 2019
No ratings yet
STAT 230 Course Notes Fall 2019
425 pages
Probability Notes
No ratings yet
Probability Notes
79 pages
GL
No ratings yet
GL
144 pages
Discrete Structures
No ratings yet
Discrete Structures
404 pages
Hofman Notes
No ratings yet
Hofman Notes
114 pages
Discrete Mathematics and Probability Theory: A Brief Compilation
No ratings yet
Discrete Mathematics and Probability Theory: A Brief Compilation
60 pages
STA2610 Study Guide
No ratings yet
STA2610 Study Guide
209 pages
Probability Aug 16
No ratings yet
Probability Aug 16
289 pages
Bruce Hajek - Probability With Engineering Applications - Jan 2017
No ratings yet
Bruce Hajek - Probability With Engineering Applications - Jan 2017
291 pages
Book
No ratings yet
Book
347 pages
Probabilityjan13 PDF
No ratings yet
Probabilityjan13 PDF
281 pages
Cheat Sheet - JAM
No ratings yet
Cheat Sheet - JAM
46 pages
ECE 313 Course Notes: Probability With Engineering Applications
No ratings yet
ECE 313 Course Notes: Probability With Engineering Applications
188 pages
Summary I 2018-2019
No ratings yet
Summary I 2018-2019
72 pages
Introduction To Probability and Random Signals
100% (9)
Introduction To Probability and Random Signals
139 pages
Probability Piyushwairale
No ratings yet
Probability Piyushwairale
42 pages
Statistical Distributions
No ratings yet
Statistical Distributions
202 pages
Discrete Structures
No ratings yet
Discrete Structures
317 pages
Prob Weber
No ratings yet
Prob Weber
32 pages
Discrete Structures
No ratings yet
Discrete Structures
350 pages
Randomised Algorithm
No ratings yet
Randomised Algorithm
385 pages
Quantecon Python
100% (4)
Quantecon Python
1,413 pages
Quantitative Economics With Python
No ratings yet
Quantitative Economics With Python
300 pages
Probability
No ratings yet
Probability
67 pages
Intuition To Probability (Version 1.19)
No ratings yet
Intuition To Probability (Version 1.19)
396 pages
STAT 230 Notes 2013
No ratings yet
STAT 230 Notes 2013
278 pages
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
No ratings yet
280 LN Deller PART1 WITH ALL SUPPLEMENTS Fall2015 PDF
118 pages
Probability Theory - Weber
No ratings yet
Probability Theory - Weber
117 pages
Notesstat230 2014
No ratings yet
Notesstat230 2014
288 pages
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
No ratings yet
Notes On Randomized Algorithms: James Aspnes March 3rd, 2020
453 pages
P Refresher
No ratings yet
P Refresher
264 pages
2020 Math1024
No ratings yet
2020 Math1024
160 pages
Probability May 12
No ratings yet
Probability May 12
201 pages
Probability and Stats For Data Science PDF
100% (1)
Probability and Stats For Data Science PDF
237 pages
Stat230 S2010 Course Notes
100% (1)
Stat230 S2010 Course Notes
281 pages
Notes
No ratings yet
Notes
422 pages
BCS Question Bank
No ratings yet
BCS Question Bank
360 pages
Review of Counting & Probability Basics: How To Use This Book III Acknowledgements VII
No ratings yet
Review of Counting & Probability Basics: How To Use This Book III Acknowledgements VII
5 pages
Lectnotemat 2
No ratings yet
Lectnotemat 2
348 pages
Probability P
No ratings yet
Probability P
66 pages
Lectnotemat 5
No ratings yet
Lectnotemat 5
346 pages
A First Course in Probability Notes
No ratings yet
A First Course in Probability Notes
103 pages
Basic Probability Theory For Bio Medical Engineers - JohnD. Enderle
100% (1)
Basic Probability Theory For Bio Medical Engineers - JohnD. Enderle
136 pages
Cs Theorists Toolkit
No ratings yet
Cs Theorists Toolkit
95 pages
Mathematics I Lecture Notes Full
No ratings yet
Mathematics I Lecture Notes Full
121 pages
bfm:978 0 387 74995 2/1 PDF
No ratings yet
bfm:978 0 387 74995 2/1 PDF
13 pages
Ma 202
No ratings yet
Ma 202
219 pages
Actuarial AdvantEDGE (Sample Manual)
No ratings yet
Actuarial AdvantEDGE (Sample Manual)
33 pages
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
A Discourse Analysis of 1 Peter
From Everand
A Discourse Analysis of 1 Peter
Ervin Ray Starwalt
No ratings yet
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Group 12 PPT Software Programing & Development
No ratings yet
Group 12 PPT Software Programing & Development
24 pages
Sirah: Prophet
No ratings yet
Sirah: Prophet
4 pages
Data Link - Test: C9.3 Marine Auxiliary and Generator Set Engine
No ratings yet
Data Link - Test: C9.3 Marine Auxiliary and Generator Set Engine
7 pages
Literature Review: Modern Public Library
100% (3)
Literature Review: Modern Public Library
8 pages
LaTex LAB MANUAL 2023-24
No ratings yet
LaTex LAB MANUAL 2023-24
43 pages
Ipv4 Addressing: © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential Presentation - Id
No ratings yet
Ipv4 Addressing: © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential Presentation - Id
35 pages
Chapter 5 - GEE 4 - MIDTERM PROJECT
No ratings yet
Chapter 5 - GEE 4 - MIDTERM PROJECT
4 pages
Themes
No ratings yet
Themes
5 pages
New XXX Hot XNXX Sex Xvideo Hot Sex 0008
No ratings yet
New XXX Hot XNXX Sex Xvideo Hot Sex 0008
3 pages
Launcher Log
No ratings yet
Launcher Log
179 pages
Van Der Post H. The Python For Finance Handbook... Financial Insights... 2024
No ratings yet
Van Der Post H. The Python For Finance Handbook... Financial Insights... 2024
454 pages
Master Sheet Quadratic Equation by OM Sir
No ratings yet
Master Sheet Quadratic Equation by OM Sir
14 pages
C. Router
No ratings yet
C. Router
10 pages
Advanced Pressure Vessel Manual
No ratings yet
Advanced Pressure Vessel Manual
198 pages
GFDH
No ratings yet
GFDH
2 pages
Mclab Exp1
No ratings yet
Mclab Exp1
5 pages
OMAGND15 Fujitsu v5
No ratings yet
OMAGND15 Fujitsu v5
5 pages
Sample Paper - 11 - IP-4
No ratings yet
Sample Paper - 11 - IP-4
8 pages
Bkash e Business Project
No ratings yet
Bkash e Business Project
22 pages
DVCon Europe 2015 P1 5 Paper
No ratings yet
DVCon Europe 2015 P1 5 Paper
6 pages
C++ Operator Overloading 2
No ratings yet
C++ Operator Overloading 2
38 pages
Nitin Gond Resume
No ratings yet
Nitin Gond Resume
1 page
WR3000 1.0 Datasheet
No ratings yet
WR3000 1.0 Datasheet
5 pages
Home Theater LG DH4220S C/ DVD Player 330W RMS - 5.1 Canais, Conexão HDMI e USB, Karaokê
No ratings yet
Home Theater LG DH4220S C/ DVD Player 330W RMS - 5.1 Canais, Conexão HDMI e USB, Karaokê
1 page
Upfc PHD Thesis
100% (3)
Upfc PHD Thesis
7 pages
ICT Final Exam For Grade 9
100% (1)
ICT Final Exam For Grade 9
4 pages
Vizio M550SV LCD TV User Manual
No ratings yet
Vizio M550SV LCD TV User Manual
53 pages