0% found this document useful (0 votes)
33 views29 pages

Probability & Statistics 2: Robert Šámal January 29, 2024

The document discusses Markov chains and Bayesian statistics. It covers topics like introduction to Markov chains including basic properties, transition matrices and transition graphs. It also discusses Bayesian method, Bayes' theorem, and Bayesian inference through examples. The document appears to be notes for a university-level course on probability and statistics.

Uploaded by

chierawaikuhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views29 pages

Probability & Statistics 2: Robert Šámal January 29, 2024

The document discusses Markov chains and Bayesian statistics. It covers topics like introduction to Markov chains including basic properties, transition matrices and transition graphs. It also discusses Bayesian method, Bayes' theorem, and Bayesian inference through examples. The document appears to be notes for a university-level course on probability and statistics.

Uploaded by

chierawaikuhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Probability & Statistics 2

Robert Šámal
January 29, 2024

Contents
1 Markov Chains 3
1.1 Introduction, basic properties . . . . . . . . . . . . . . . . . . . . . . 3
1.2 TODAY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Classification of states . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Reducing the MC to smaller ones . . . . . . . . . . . . . . . . . . . . 6
1.5 Convergence to stationary distribution . . . . . . . . . . . . . . . . . 7
1.6 Probability of absorption, time to absorption . . . . . . . . . . . . . . 7
1.7 Application: algorithm for 2-SAT, 3-SAT . . . . . . . . . . . . . . . 8

2 Bayesian statistics 12
2.1 Two approaches to statistics . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Preliminaries – conditionaly pmf, cdf, etc. . . . . . . . . . . . . . . . 12
2.3 Bayesian method – basic description . . . . . . . . . . . . . . . . . . 13
2.4 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Bayesian point estimates – MAP and LMS . . . . . . . . . . . . . . . 13
2.6 Bayesian inference – examples . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Naive Bayes classifier – both Θ and X are discrete . . . . . . 14
2.6.2 Estimating bias of a coin – Θ is continuous, X is discrete . . . 15
2.6.3 Estimating normal random variables – both Θ and X are con-
tinuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Conditional expectation 16

4 Stochastic processes 17
4.1 Bernoulli process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Balls and bins 20

6 Permutation test 25

1
7 Moment Generating Functions and their applications 25
7.1 Proof of CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Chernoff inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Applications of Chernoff . . . . . . . . . . . . . . . . . . . . . . . . 29

This is work in progress, likely full of typos and small (or larger) inprecisions. If
you find something that looks wrong, feel free to reach out to me. Thanks to Kryštof
Višňák for helping me with spotting typos in a previous version.

2
1 Markov Chains
1.1 Introduction, basic properties
Two examples to start with A machine can be in two states: working or broken. For
simplicity, we assume that the state stays the same for the whole day. Then, during
the night, the state changes at random according to the figure below: for instance, if
the machine is working one day, it will work the next day with probability 0.99, with
probability 0.01 it breaks over night. Crucially, we assume that this probability does
not depend on the age of the machine, nor on the previous states.
A fly is moving in a corridor, that we consider as a collection of four spaces, labeled
0, 1, 2, 3. If the fly is in spaces 1 or 2, it stays at the same space with probability 0.4.
Otherwise, it moves equally likely one step left or right. At positions 0 and 3 is a spider
and the fly can never leave. Again, we assume that “the fly has no memory”, so the
probabilities do not depend on the past trajectory of the fly.
TODO: add figures
What are the common features of these examples? We consider a sequence of
random variables, so called random process. We do not care about the numerical value
of these variables, as we consider them as mere labels – so we will not ask about
expected value of a position of the fly, for instance. We may assume that all the random
variables have range contained in set S of labels. For simplicity we assume S to be
finite or countable (and frequently we will assume that S = {1, . . . , s} or S = N). We
also want to prescribe transition probabilities pi,j such that P (Xt+1 = j | Xt = i) =
pi,j . However, there is more subtlety to this: we want to explicitly forbid the history
(values of X0 , . . . , Xt−1 to have an influence on Xt+1 . (See also section of stochastic
processes.)
Definition 1 (Markov chain). Let S be any finite or countably infinite set. A sequence
(Xt )∞
t=0 of random variables with range S is a (discrete time, discrete space, time-
homogeneous) Markov chain if for every t ≥ 0 and every a0 , . . . , at+1 ∈ S we have

P (Xt+1 = at+1 | Xt = at & . . . & X0 = a0 ) = P (Xt+1 = at+1 | Xt = at ) (1)

whenever the conditional probabilities are defined, that is when P (Xt = at & . . . & X0 =
a0 ) > 0. We call the numbers pi,j (t) = P (Xt+1 = j | Xt = i) the transition proba-
bilities. We will only study cases where pi,j (t) does not depend on t and will omit the
t in the notation.

Transition matrix is a matrix P such that Pi,j = pi,j , that is the entry at i-th row
and j-th column is the probability of transition from state i to state j. As a consequence
of the definition, all entries in the transition matrix are nonnegative, and each row sums
to 1. We can describe this succintly by writing P j = j with j denoting the column
vector of all 1’s.
Let P denote the transition matrix for the machine example and Q for the fly ex-

3
ample. We have
 
  1 0 0 0
0.99 0.01 0.3 0.4 0.3 0
P = Q= .
0.9 0.1 0 0.3 0.4 0.3
0 0 0 1

Transition graph/diagram is a directed graph with vertex set S and arcs (directed
edges) (i, j) for every i, j ∈ S such that pi,j > 0. We label arc (i, j) by pi,j . In other
words, the figures above (TODO) are transition graphs.

Describing the distribution. We will again use the basic tool to describe a random
variable, namely a PMF (probability mass function), that is giving a probability of each
state (element of S). A common notation is
(t)
πi = P (Xt = i).
(t)
For any t ≥ 0 we also consider π (t) as a row vector with coordinates πi for i ∈ S.

Transition of the distribution Suppose we know π (0) , what can we say about π (1) ,
and π (t) in general? By law of total probability we have
s
X
P (X1 = j) = P (X0 = i) · P (X1 = j | X0 = i) So, in other notation
i=1
s
(1) (0)
X
πj = πi · Pi,j and using matrix multiplication:
i=1

π (1) = π (0)
P

From this we easily get the following theorem:


Theorem 2. For any Markov chain and any k ≥ 0 we have

π (k) = π (0) P k

and, more generally, π (t+k) = π (t) P k .


Proof. By induction. TODO

k-step transition To look at the above theorem in a different way, we define the
following:

ri,j (k) = P (we get from i to j in k steps)


= P (Xk = j | X0 = i)

As we will see, we also have ri,j (k) = P (Xt+k = j | Xt = i) for any t > 0, but this
remains to be seen, there may be a dependency on t.

4
Theorem 3 (Chapman–Kolmogorov). For any Markov chain and any k, ℓ ≥ 0 we have
• ri,j (k) = (P k )i,j
Ps
• ri,j (k + ℓ) = u=1 = ri,u (k)ru,j (ℓ)
Ps
• ri,j (k + 1) = u=1 = ri,u (k)pu,j

1.2 TODAY
Theorem 4.
P (X0 = a0 & X1 = a1 & . . . & Xt = at ) = πa(0)
0
pa0 ,a1 pa2 ,a3 . . . pat−1 ,at
Proof. Chain rule for conditional probability plus Markov property. TODO: write
more details
The next theorem shows a general version of the Markov property (1), that “future
is not depending of the history, only of the present”.
Theorem 5 (General Markov property). For any Markov chain and any t ≥ 0, any
i ∈ S and any
• H – event depending only on values of X0 , . . . , Xt−1 ,
• F – event depending only on values of Xt+1 , Xt+2 , . . .
we have
P (F | H & Xt = i) = P (F | Xt = i).
Proof. (skipped)

1.3 Classification of states


Definition 6 (Accessible states). For states i, j of a Markov chain we say that j is
accessible from i, if starting at i we have nonzero probability of reaching j in the
future. For short we write j ∈ A(i) or i → j. In formula:
j ∈ A(i) ⇔ P ((∃t ≥ 0)Xt = j|X0 = i) > 0.
It is easy to observe (TODO) that j ∈ A(i) is equivalent with existence of a directed
walk from i to j in the transition graph.
Definition 7 (Communicating states). We say that states i, j of a Markov chain com-
municate if i ∈ A(j) and j ∈ A(i). For short we write i ↔ j.
Theorem 8. For any Markov chain the relation ↔ is an equivalence on the set of
states.
Proof. To show reflexivity, just observe that i → i holds for any state i, as we are
allowed to choose time t = 0. Therefore, i ↔ j as weel.
To show transitivity, assume i ↔ j ↔ k. Thus we have i → j → k. Therefore
there is a directed walk in the transition digraph from i to j and another one from j
to k. Thus, their concatenation is a walk from i to k, showing i → k. By symmetric
argument we also have k → i, which finishes the proof.

5
1.4 Reducing the MC to smaller ones
Let us consider a MC together with its decomposition to communicating classes.
TODO
• think long-term
• MC moves within one equivalence class for a while, when it leaves it, it never
comes back. (Because . . . )

• This process goes on, until we get to an equivalence class, that we cannot leave.

transient vs. recurrent states Let fi,j be the probability that we get to j, if we start
from i; if i = j we need to get there again. In formula,

fi,j = P (∃t ≥ 1 : Xt = j | X0 = i)

We call fi,i recurrence probability and states i such that fi,i = 1 are called recurrent
(sometimes also persistent). The other states (such that fi,i < 1) are called transient.
To explore this idea a bit further, let Ti = min{t ≥ 1 : Xt = i} (or Ti = ∞, if
there is not such t). Recurrent states are such that Ti is finite.
Also let Vi be the number of visits to i, that is, Vi := |{t ≥ 0 : Xt = i}. It
is clear that if i is a recurrent state and X0 = i then Vi = ∞. (After each visit we
have probability fi,i = 1 that we visit again.) If i is transient, then after each visit to i
we have probability 1 − fi,i > 0 that this is the last visit. Thus (by definition of the
geometric distribution)
Vi |X0 = i ∼ Geo(1 − fi,i ).
TODO: is the next theorem trivi for finite MC?

Theorem 9. Let C be a communicating class. Then either all states in C are recurrent
or all are transient.
Proof. Suppose i ⇔ j, in particular ri,j (t) > 0 for some t. Assume that i is recurrent.
Then fi,i = 1, thus we visit i infinitely often. In each of these visits we have
probability r = ri,j (t) that we visit j in t units of time. Thus the probability that this
never happens is limn→∞ (1 − r)n = 0. So we will visit j, starting from i, in symbols
fi,j = 1.
Suppose fj,i < 1. Then with positive probability we never visit i again, a contra-
diction, as we now we will visit i infinitely often. Thus fj,i = 1. This implies, that
fj,j = 1 as well, as fj,j ≤ fj,i fi,j . So, j is recurrent as well.
Suppose a Markov chain is irreducible, so there is a positive probability of moving
from i to j for every pair of states. It is tempting to conclude, that all states must be
recurrent. Indeed, this is true for finite Markov chains:
Theorem 10. Let (Xt ) be a finite irreducible Markov chain. Then all states are recur-
rent.

6
Proof. In view of the last theorem, the alternative is that all states are transient. This
means, no state will be visited infinitely often. So there is time M1 such that for t > M1
we have X1 ̸= 1. Similarly, we define Mi for every state i. But what is the value of Xt
for t > max{M1 , . . . , Ms }?
However, this is not necessarily the case, if the Markov chain is infinite. TODO:
simple example
In fact, a random walk in Z3 (or in higher dimensions) has also all states transient,
while a random walk in Z or in Z2 has all states recurrent. TODO: more details?

1.5 Convergence to stationary distribution


Chapman-Kolmogorov theorem gives us a way how to describe the behaviour of a
Markov chain in a short time: If we start with known π (0) (distribution if X0 , the state
at time 0), we can compute π (k) . Next, we turn to describing the long-term behaviour.

Convergence to stationary distribution Given a Markov chain with transition ma-


trix P ∈ Rn×n , we say π ∈ Rn is a stationary distribution if πP = π. In other words,
if π (0) = π then π (t) = π for all t, which explains the term. For some Markov chains
we are guaranteed that the distribution π (t) will approach the stationary distribution no
matter what is π (0) .

Theorem 11. If a Markov chain is finite, aperiodic and irreducible, then


1. there is a unique stationary distribution π, and
2. limn→∞ (P n )i,j = πj .

In other words, regardless of π (0) we know what π (n) will (approximately) be, if n
is large enough.
TODO: def. of irreducible, aperiodic TODO: what for finite TODO: examples,
when it fails: periodic states, two components of ↔, infinite.

1.6 Probability of absorption, time to absorption


Yet another way to look at long-term behaviour of a Markov chain is to study absorbing
states, states that can never be left. Formally, a ∈ S is absorbing state if pa,a = 1. Not
every Markov chain has such state, but for those that do, two natural questions arise:
how long (on average) will it take till we reach an absorbing state? And if there is more
than one such state, what is the probability of reaching each of them? Both questions
are easily answers, if one approaches it right: it is significantly easier to compute these
times and probabilities for all states at the same time, than do to it just for one state.
In the following, assume A ⊆ S is a nonempty set of absorbing states; also assume
0 ∈ A. For every i ∈ S we define µi to be the expected time to absorption starting
from i, formally

µi = E(T | X0 = i), where T = min{t : Xt ∈ A}.

7
Further, we let ai be the probability we end at state 0, starting from i.

ai = P (∃t : Xt = 0 | X0 = I).

Here we tacitly assume that A contains more absorbing states than just 0, otherwise
ai = 1.
Theorem 12. The probabilities ai are the unique solution to the following system of
equations:

a0 = 1
ai = 0 for 0 ̸= i ∈ A
X
ai = pi,j aj otherwise.
j∈S

TODO: proof simple by law of total probability.


Theorem 13. The expected times µi are the unique solution to the following system of
equations:

µi = 0 for i ∈ A
X
µi = 1 + pi,j µj otherwise.
j∈S

TODO: proof simple by law of total expectation.


Example: random walk on a path

1.7 Application: algorithm for 2-SAT, 3-SAT


A formula is in conjunctive normal form if it is a conjunction of a list of clauses, each
of them is a disjunction of literals (a variable or its negation). If each of the clauses
has at most k literals, we say the formula is a k-CNF. It is well known (and discussed
in other classes) that 2-SAT has a polynomial-time solution, while 3-SAT is an NP-
complete problem. Here, we show how we can apply our knowledge of Markov chains
to get a randomized algorithm: algorithm, that is allowed to give a wrong answer, but
only with a small probability.

2-SAT algorithm
• Input: 2-CNF φ with variables x1 , . . . , xn
• Output: satisfying assignment or statement that none exists
• arbitrary initialize x1 , . . . , xn
• If φ(x1 , . . . , xn ) is true, return (x1 , . . . , xn ). Otherwise, let C be an unsatisfied
clause and change random variable in C.
• repeat previous step at most 2mn2 -times

8
• say that no solution exists
Theorem 14. The above algorithm gives wrong answer with probability at most 2−m .
The running time is O(m · n4 ).
Proof. For the running time estimate we just notice that φ has O(n2 ) clauses of two
literals, ignoring possibility for faster search for unsatisfied clause. If φ is not satisfi-
able, the algorithm never finds a satisfying assignment and thus gives a correct answer.
So suppose there is a solution and let (x∗1 , . . . , x∗n ) be one of the (possibly many) solu-
tions. Let Dt be the distance from solution at time t. Explicitly, Dt is the number of i
such that xi ̸= x∗i , where value of xi is considered at time t.
The algorithm does not know about Dt , we only use it for the analysis. Clearly,
if Dt = 0 at some time t, we have found the solution (and we see that φ is being
satisfied, so the algorithm ends).1 Otherwise let C be the unsatisfied clause. To simplify
notation, assume C = x1 ∨ x2 . As C is unsatisfied, we have x1 = x2 = F . As x∗ is
a satisfying assignment, either x∗1 or x∗2 (or both) are true. Suppose x∗1 = F , x∗2 = T .
Then the algorithm has probability 1/2 of increasing Dt by one (if we change x1 )
and probability 1/2 of decreasing it. This is certainly independent of anything else, in
particular of value of D0 , . . . , Dt−1 .
The issue is that there are two other cases: x∗1 = T , x∗2 = F is another good case,
it works in the same way. However, if x∗1 = x∗2 = T then this step of algorithm will
certainly decrease from the solution: Dt+1 = Dt − 1. While this looks like a good
thing, we call this a bad case: we were hoping to use our knowledge of Markov chains
to analyse the behaviour of (Dt ).
To solve this problem, we (as people analyzing the algorithm) create an auxiliary
sequence Dt′ . Like Dt , this is a quantity the algorithm does not know about.
• We define D0′ = D0 .

• When the choice of C makes a good case, we make sure that Dt+1 − Dt′ =
Dt+1 − Dt .
• In the bad case we toss a coin to ensure that

P (Dt+1 = Dt + x) = 1/2

for both x = +1 and x = −1.


• If Dt′ = n then Dt+1

= n − 1.
It is easy to see that 0 ≤ Dt ≤ Dt′ (actually Dt′ − Dt is an even number, so if
it is not zero, it is at least 2). Mainly, Dt′ is a Markov chain given by the following
transition digraph TODO.
Let T ≥ 0 be the first time such that DT = 0 and, similarly, T ′ ≥ 0 first time
such that DT′ ′ = 0. Using Theorem 13 we find TODO that E(T ′ ) ≤ n2 . Clearly,
E(T ) ≤ E(T ′ ), so E(T ) ≤ n2 . This starts to look useful. But we want to understand
1 It would be satisfied also for a different satisfying assignment. Think whether it changes anything about

the analysis.

9
how likely it is, that T is much larger. For this we use Markov inequality from the first
semester. Using it, we get

E(T ) n2 1
P (T ≥ 2n2 ) ≤ 2
≤ 2 = .
2n 2n 2
To wrap it up: we divide the m · 2n2 steps into m blocks, each of size 2n2 . By what
we just did, in each block we fail with probability at most 1/2: as failure means that the
algorithm runs without finding a solution, and T is the time till we find the solution.2
Each steps are independent: or rather, the probability of failure is ≤ 1/2 no matter
how the previous block has ended. Thus, probability of failure in all m blocks is at
most (1/2)m .
TODO: what if there is a clause with just one variable?

3-SAT algorithm
• Input: 3-CNF φ with variables x1 , . . . , xn

• Output: satisfying assignment or statement that none exists


• arbitrary initialize x1 , . . . , xn
• If φ(x1 , . . . , xn ) is true, return (x1 , . . . , xn ). Otherwise, let C be an unsatisfied
clause and change random variable in C.
• repeat previous step at most ???-times

• say that no solution exists


This is the obvious attempt. If we run along with it, again using a Markov chain
to analyze the distance from one particular solution, we have new issue: the Markov
chain is skewed: in the typical case Dt+1 = Dt + 1 with probability 2/3, while
Dt+1 = Dt − 1 only with probability 1/3. We can again use Theorem 13 but it gives
E(T ) ≤ 2n TODO. And this is no good, as in 2n steps we can try all possible values
of the n variables.
To solve this issue, we take into account the initialization phase. For every i we
have xi = x∗i with probability 1/2, so D0 ∼ Bin(n, 1/2). In particular, we have
P (D0 ≤ n/2) ≥ 1/2. In such case we have a decent chance of direct success: P (Dk =
0 | D0 = k) ≥ 1/3k : in each step we have a probability ≥ 1/3 that we choose correct
variable to change (in an unsatisfied clause) and thus decrease the distance to solution.
2 Or a bound on it, as there may be other solutins than x∗ .

10
From this we get that
n/2
X
P (T ≤ n/2) = P (D0 = k)P (T ≤ n/2 | D0 = k)
k=0
n/2
X
≥ P (D0 = k) · 3−k
k=0
1
≥ P (D0 ≤ n/2) · 3−k ≥ .
2 · 3k
This leads to a modified algorithm:

Faster 3-SAT algorithm


• Input: 3-CNF φ with variables x1 , . . . , xn
• Output: satisfying assignment or statement that none exists
• arbitrary initialize x1 , . . . , xn

• If φ(x1 , . . . , xn ) is true, return (x1 , . . . , xn ). Otherwise, let C be an unsatisfied


clause and change random variable in C.
• repeat previous step at most n/2-times, then reinitialize randomly
• repeat previous step at most m-times

• say that no solution exists


Theorem 15. The above algorithm gives wrong answer with probability at most e−t ,
where m = t · 2 · 3n/2 . The running time is O(n4 · 3n/2 ).
Proof. We found already that one block of n/2 steps succeeds with probability at least
q = 23̇1k . Then we do a new attempt, thus blocks are independent at probability that all
of them fail is at most
(1 − q)m ≤ e−qm = e−t .
Here we used the inequality 1 − q ≤ e−q that is valid for any real q.
TODO: what if there is a clause with just one variable?
Note that the above algorithm is not optimal. If we let each block run for 3n steps,
and if we analyze it smarter, we get the running time down to O(n3 (4/3)n ). But even
3n/2 is much better than the trivial solution O(2n ).

11
2 Bayesian statistics
2.1 Two approaches to statistics
In the first semester we looked at the classical (frequentists’) approach to statistics. In
this approach:
• Probability is a long-term frequency (out of 6000 rolls of the dice, a six was
rolled 1026 times, the ratio converges to the true probability). It is an objective
property of the real world.
• Parameters are fixed, unknown constants. We can’t make meaningful probabilis-
tic statements about them.
• We design statistical procedures to have desirable long-run properties. E.g. 95
% of our interval estimates will cover the unknown parameter.
Now we are going to look at an alternative, so called Bayesian approach:
• Probability describes how much we believe in a phenomenon, how much we are
willing to bet:
(Prob. that T. Bayes had a cup of tea on December 18, 1760 is 90 %.)
(Prob. that COVID-19 virus did leak from a lab is ?50? %.)
• We can make probabilistic statements about parameters (even though they are
fixed constants): the “choice of universe” is the underlying elementary event.
• We compute the distribution of ϑ and form point and interval estimates from it,
etc.

2.2 Preliminaries – conditionaly pmf, cdf, etc.


Before we get to the meat of the matter, let us first define/recall the needed definitions.
TODO: improve
Rx
• pdf fX is a function such that P (X ≤ x) = −∞ fX (t)dt.
P (x≤X≤x+t)
• Intuition: fX (x) = limt→0 t – so indeed, it is a “density of proba-
bility”
• joint pmf
• conditional pmf pX|Y (x|y) = pX,Y (x, y)/pY (y)
Rx Ry
• joint pdf fX,Y is a function such that P (X ≤ x & Y ≤ y) = −∞ −∞
fX,Y (s, t)dsdt.
P (x≤X≤x+t & y≤Y ≤y+t)
• Intuition: fX,Y (x) = limt→0 t2 – so indeed, it is a “den-
sity of probability”
• conditional pdf fX|Y = fX,Y (x, y)/fY (y)
R∞
• marginal pdf fX (x) = −∞ fX,Y (x, y)dy
R∞
• marginal pdf fY (y) = −∞ fX,Y (x, y)dx

12
2.3 Bayesian method – basic description
• The unknown parameter is treated as a random variable Θ
• We choose prior distribution, the pmf pΘ (ϑ) or the pdf fΘ (ϑ) independent of
the data.
• We choose a statistical model pX|Θ (x|ϑ) or fX|Θ (x|ϑ) that describes what we
measure (and with what probability), depending on the value of the parameter.

• After we observe X = x, we compute the posterior distribution fΘ|X (ϑ|x)


• and then derive what we need e.g. find a, b so that P (a ≤ Θ ≤ b | X = x) =
Rb
f
a Θ|X
(ϑ|x)dϑ ≥ 1 − α

2.4 Bayes theorem


Theorem 16 (Bayes theorem for discrete r.v.’s). X, Θ are discrete r.v.’s

pX|Θ (x|ϑ)pΘ (ϑ)


pΘ|X (ϑ|x) = P ′ ′
.
ϑ′ ∈ImΘ pX|Θ (x|ϑ )pΘ (ϑ )

(terms with pΘ (ϑ′ ) = 0 are considered to be 0).


Theorem 17 (Bayes theorem for continuous r.v.’s). X, Θ are continuous r.v.’s with
pdf’s fX , fΘ and joint pdf fX,Θ

fX|Θ (x|ϑ)fΘ (ϑ)


fΘ|X (ϑ|x) = R .
f
ϑ′ ∈R X|Θ
(x|ϑ′ )fΘ (ϑ′ )dϑ′
(terms with fΘ (ϑ′ ) = 0 with fΘ (ϑ′ ) = 0 are considered 0).
Theorem 18 (Bayes theorem for discrete r.v.’s). X be discrete and Θ continuous r.v.
Then
pX|Θ (x|ϑ)fΘ (ϑ)
fΘ|X (ϑ|x) = R .
p
ϑ′ ∈ImΘ X|Θ
(x|ϑ′ )fΘ (ϑ′ )
(terms with pΘ (ϑ′ ) = 0 are considered to be 0).

2.5 Bayesian point estimates – MAP and LMS


Even when we know a distribution of a random variable it is unclear what is the best
numerical value that represents it. is it the mean (expected value)? Or the mode (moste
probable value)? Or the median? It turns out all choices have their justification. In the
context of Bayesian statistics, we are interested in a random variable Θ conditioned on
the event X = x. (You may concentrate on the discrete case, where the conditioning is
easy to understand.)

13
MAP – Maximum A-Posteriori We choose ϑ̂ to maximize
• pΘ|X (ϑ|x) in the discrete case
• fΘ|X (ϑ|x) in the continuous case

• Essentially, we are replacing the random variable by its mode.


• Similar to the ML method in the classical approach if we choose a “flat prior” –
Θ is supposed to be uniform/discrete uniform.

LMS – Least Mean Square Also the conditional mean method.


• We choose ϑ̂ = E(Θ | X = x), so we replace the random variable by its mean.
• What we get is an Unbiased point estimate that has the smallest possible LMS
(least mean square) error:

E((Θ − ϑ̂)2 |X = x)

• (we will show this later.)


Similarly, if we take median (number m such that P (Θ ≤ m | X = x) = 1/2)
then we minimize absolut value of an error E((Θ − ϑ̂)2 |X = x). But we will not
pursue this approach further.

2.6 Bayesian inference – examples


2.6.1 Naive Bayes classifier – both Θ and X are discrete
This techniques can be used for any classification of objects into finite number of cat-
egories, using finite number of discrete features. For concreteness, we will explain it
as a way to test whether some email is a spam or ham (that is, not spam). We let Ω
be the set of all emails (together with the probability of receiving each of them). We
can’t possibly list of elements of Ω, but we consider the emails delivered to our inbox
as sampling from this probability space.
Our interest lies in random variable Θ that is equal to 1 for spams and to 2 for hams.
(Recall Θ is a function from Ω to R, so for each email ω ∈ Ω we need to define value of
Θ(ω).) In order to estimate value of Θ, we measure data: a list of Bernoulli variables
X1 , . . . , Xn , where Xi (ω) = 1 if ω contains word wi (and Xi (ω) = 0 otherwise). So
we imagine w1 , . . . , wn is a list of all words that are useful to detect spams.
By the Bayes theorem we have

pX|Θ (x|ϑ)pΘ (ϑ)


pΘ|X (ϑ|x) = P2 .
t=1 pX|Θ (x|t)pΘ (t)

TODO finish it

14
2.6.2 Estimating bias of a coin – Θ is continuous, X is discrete
Consider a loaded coin with probability of heads being ϑ (which we assume to be an
evaluation of a random variable Θ). Btw, everything applies to any procedure generat-
ing a Bernoulli random variable, but we stick with a coin for concreteness. Our goal
is to find out the value of ϑ. In tune with the Bayesian methodology, we start with a
prior distribution, that is a pdf fΘ . (As we want to allow any real number in [0, 1] as
the value of ϑ, we must take Θ to be a continuous random variable.) Then we take
measurements: we choose a number n of coin tosses and check how many heads we
get. If we know the value of θ, the distribution of this number (call it X), is clearly
Bin(n, ϑ). So we get
 
n k
pX|Θ (k|ϑ) = ϑ (1 − ϑ)n−k .
k

It remains to apply Theorem 18. We still haven’t decided what prior to choose though.
If we don’t known anything (say it is not a real coin but a digital generator), we may
take flat prior Θ ∼ U (0, 1). However, we need something more versatile to allow us
to encode some prior knowledge.

Beta distribution It is convenient to use the following type of distribution for Θ:


(
cϑα−1 (1 − ϑ)β−1 for 0 < ϑ < 1
fΘ (ϑ) =
0 otherwise

Here c is a normalizing constant that makes the following function a pdf. It is typically
written as 1/B(α, β), the reciprocal of a beta function. The r.v. Θ is said to have beta
distribution. We will collect some useful properties of this distribution. All are easy to
verify using basic knowledge of calculus, details are omitted though.
α−1
• fΘ (ϑ) is maximal for ϑ = α+β−2 (mode of the distribution). This can be verified
by a simple differentiation.
α
• E(Θ) = α+β (mean of the distribution). This follows from the next part and
easy calculation.

• B(α, β) = 1/ α+β−2

α−1 . This can be shown by per-partes and induction over
α + β.
Now we have all set up to apply Theorem 18. Fortunately, we don’t need to com-
pute the integral in the denominator.

fΘ|X (ϑ|k) = c1 pX|Θ (k|ϑ)fΘ (ϑ)


= c2 ϑk (1 − ϑ)n−k ϑα−1 (1 − ϑ)β−1
= c2 ϑα+k−1 (1 − ϑ)β+n−k−1

15
The calculation is only valid for ϑ ∈ [0, 1], otherwise fΘ (ϑ) = 0, so the updated
(posterior) pdf is also 0. How to find out c2 , if we need to? We use the fact that
after conditioning on the event {X = k} the random variable Θ still only attains
values in [0, 1]. Thus, c2 takes such value that makes fΘ|X (ϑ|k) a pdf, a function with
integral 1. Based on what we learned about Beta distribution, c2 = 1/B(α′ , β ′ ) and
Θ|X = k follows the Beta distribution with parameters α′ = α+k and β ′ = β +n−k.
TODO: wrap up

2.6.3 Estimating normal random variables – both Θ and X are continuous

3 Conditional expectation
We have already learned about expectation E(Y ) of a random variable Y — average
value over the whole probability space — and about conditional expectation E(Y | A)
— average over a set A ⊆ Ω. In this section we will learn about a related topic,
where we will take averages of Y over sets defined by another random variable, X.
We will restrict the discussion to the case of a discrete random variable X, the case of
continuous X is more subtle. The variable Y can be discrete or continuous.
For any x ∈ R we let
g(x) := E(Y | X = x).
This is obviously some real function of real variable. Next, we plug the random vari-
able X to the function g and we define

E(Y | X) := g(X).

Thus, E(Y | X) is a random variable. In case of discrete X, it is easy to understand


what is going on: on each set of form Ax = {X = x} we define E(Y | X) as
E(Y | Ax ). This leads to the following important observation:
Theorem 19 (Law of Iterated Expectation).

E(E(Y | X)) = E(Y )

Proof. By law of total probability we have


X
E(Y ) = P (Ax )E(Y | Ax )
x

while LOTUS says, that


X
E(E(X | Y )) = E(g(X)) = g(x)P (X = x),
x

which is the same as the expression above.


Example 1: coin TODO
Example 2: stick TODO
Example 3; group of students TODO

16
Ŷ = E(Y | X)
Ỹ = Y − Ŷ
expression of var(Y ) – Eve’s rule

var(Y ) = E(var(Y | X)) + var(E(Y | X))

Theorem 20 (Conditional expectation gives LMS estimate). We switch to the statistical


practice of using Θ for the parameter we care about, and X for the measured data. Let
Θ̂ be any estimator (function of X that we use to estimate Θ). The mean square error

E((Θ̂(X) − Θ)2 |X)

is minimal, if we put Θ̂(X) = E(Θ | X).

4 Stochastic processes
A stochastic process is a name for a sequence (or more generally, a collection) of
random variables. We have already seen an important case of that when we looked at
Markov chains (where we added an important condition, “independence on the past”).
Another important example (that we just mention in passing) is the Wiener process Wt
(here t ∈ R, so it is not a sequence, but a “continuous time parametrized parameter”, in
other words a random function of a real variable). These processes are used to model
Brownian motion and stock prices, to name a few.
Next, we will look at two models of arrival times – time till some random event
occurs, you can imagine next email arrival, or next person walking in a store. The first
model will consider discrete time, the second one continuous time.

4.1 Bernoulli process


A Bernoulli process is an infinite sequence of independent identically distributed Bernoulli
trials, i.e., a sequence of RVs X1 , X2 , . . . that are independent and each follows Ber(p)
distribution. As such, it is a very simple object. We will look at it from different angles
though. As for terminology, we will call the fact Xk = 1 a success at time k, or an
arrival (of a person/email/. . . ) at time k.

Number of successes We let Nt = T1 + · · · + Tt be the number of successes/arrivals


 that Nt ∼ Bin(t, p), so E(Nt ) = tp,
up to time t. We know from the first semester
var(Nt ) = tp(1 − p) and P (Nt = k) = kt pk (1 − p)t−k .

Arrival times We let T = T1 be the time of first success/arrival, that is minimal t


such that Xt = 1. It is easy to see that T ∼ Geom(p), thus E(T ) = 1/p, var(T ) =
(1 − p)/p2 and P (T = t) = (1 − p)t−1 p. More generally, we let Tk be the time of the
k-th success/arrival. In other words, it is the minimal t such that Nt = k. We discuss
its properties in a while.

17
Waiting times/Interarrival times Put Lk = Tk − Tk−1 (we put T0 = 0 to simplify
notation). In words, it is the time we are waiting for the k-th success. TODO memory-
less property thus Lk ∼ Geom(p), that is E(Lk ) = 1/p, var(Lk ) = (1 − p)/p2 and
P (Lk = t) = (1 − p)t−1 p. Moreover, L1 , L2 , . . . are independent.

Properties of Tk We can see that Tk = L1 + · · · + Lk . Thus by linearity, we have


E(Tk ) = k/p. As the interarrival times are independent, we have var(Tk ) = k(1 −
p)/p2 . The PMF of Tk can be obtained by a careful application of the convolution
formula. However, it is more convenient to derive from first principles: The fact Tk = t
means that Xt = 1 and there are exactly k − 1 times τ < t where Xτ = 1. This,
together with independence of all the Bernoulli variables, implies
 
t − 1 k−1
P (Tk = t) = p (1 − p)t−k , for t ≥ k,
k−1
clearly P (Tk = t) = 0 otherwise. This distribution is called Pascal distribution of
order k. A related distribution is that of random variable Tk − k, the number of failures
before k-th success. This is called negative binomial distribution due to its pmf that
can be written as (−1)t −k t k
t (1 − p) p TODO WRITE CORRECTLY

Alternative description Note that we can equivalently describe the situation by the
interarrival times, that is by the sequence of i.i.d. random variables L1 , L2 , · · · ∼
Geom(p). Then we put Tk = L1 + · · · + Lk and Xk = 1 whenever Tt = k for some t.
It is easy to see that this is an equivalent description, in other words, the sequence
X1 , X2 , . . . is a Bernoulli process.
Example: The number of days till the next rain follows the Geom(p) distribution.
(We assume each day is either rainy/not rainy, that is we have no finer distinction.)
What is the probability that it will rain at days 10 and 20?
This seems very complicated and tedious. However, by the indicated description of
Bernoulli process by interarrival times, the indicator variables of rain form a Bernoulli
process. And the probability of rain at days 10 and 20 is
P (X10 = 1 & X20 = 1) = P (X10 = 1) · P (X20 = 1) = p · p = p2 .

Merging of Bernoulli processes Consider two independent Bernoulli processes (Xi ) ∼


Bp(p) and (Yi ) ∼ Bp(q). Put Zi = Xi ∨ Yi . Then (Zi ) ∼ Bp(p + q − pq).
This is obvious as P (Zi = 1) = P (Xi = 1 ∨ Yi = 1) and we use the basic
formula for probability of a union. However, from the point of view of arrival and/or
waiting times this is nontrivial: Suppose time to a rainy day follows Geom(p) and time
to a snowy day follows Geom(q). Then time to a day when it rains or snows follows
Geom(p + q − pq).

Splitting of Bernoulli processes Let (Zi ) ∼ Bp(p). If Zi = 0, we put Xi = Yi = 0.


If Zi = 1, we with probability q put (Xi , Yi ) = (1, 0), and otherwise (Xi , Yi ) = (0, 1).
(Example to imagine: Zi = 1 means we get a message, Xi = 1 it was from Ann,
Yi = 1 it was from Bob.) Then (Xi ) ∼ Bp(pq) and (Yi ) ∼ Bp(p(1 − q)).
(Question: are the Xi s and Yi s independent?)

18
4.2 Poisson process
Continuous-time version of Bernoulli processes. Assume, we want to deal with events
that occur more often than once per day. We can stay with discrete time and measure
it in hours, second, or nanoseconds. But instead of that, we will define a more elegant
description that will allow any real values of the arrival times.
As for Bernoulli process, we will describe the process by several random variables:

• T1 , T2 , T3 , . . . are the times of individual arrivals (events we want to describe),


or arrival times for short
• N ((a, b]) is the number of arrivals at time in (a, b].
• Nt = N ((0, t])

• Lk = Tk − Tk−1 – waiting times for next arrival


What is different now is that we do not have the underlying sequence of “coin
tosses”, that we denoted X1 , X2 , . . . above. To derive the properties of this process
we start with three axioms – that we pose as a natural “limit” version of the Bernoulli
process for a very small time intervals.

(a) We are describing times of “arrival” in interval [0, ∞). For any time interval
(a, b] we let N ((a, b]) be the number of arrivals in this intervals. We postulate
that the pmf of this random variable only depends on τ = b − a. We denote
P (N ((a, b]) = k) as P (k, τ ).
(b) N ((a, b]) and N ((0, a]) are independent.
(c) For small values of τ we have the following approximation, for some λ > 0

• P (0, τ ) = 1 − λτ + o(1)
• P (1, τ ) = λτ + o(1)
• P (k, τ ) = o(1) for k > 1
TODO: explain how this follows from approximating a Bernoulli process

From these axioms we derive (TODO) the following properties:

• Nt ∼ P ois(λt)

• Lk ∼ Exp(λ)

• For any sequence 0 <= t0 < t1 < · · · < tk the random variables N ((ti−1 , ti ])
for i = 1, . . . , k are independent and the i-th of them follows P ois(λ(ti − ti−1 ))

From the distribution of Lk s we get information about the distribution of Tk s:


• First, E(Tk ) = E(L1 ) + · · · + E(Lk ) = k/λ by linearity of expectation.

19
• Next, var(Tk ) = var(L1 ) + · · · + var(Lk ) = k/λ2 by formula for variance of
independent RVs.
• Finally, we can find the pdf of Tk , so-called Erlang distribution of order k − 1

λk tk−1 e−λt
fTk (t) = .
(k − 1)!

One possible proof: by induction on k. For k = 1 this is the pdf of Exp(λ), as it


should be. Then we use convolution formula to get from pdf of Tk−1 to Tk :
Z t
fTk (t) = fTk−1 (s)fLk (t − s)ds
0

Another one: use the formula P (t ≤ Tk ≤ t + δ) ≈ δfTk (t) (TODO: more


precisely) together with the expression

P (t ≤ Tk ≤ t + δ) = P (k − 1 arrivals in [0, t])P (at least 1 arrival in [t, t + δ])


(λt)k−1
= e−λt (1 − e−λδ )
(k − 1)!
(λt)k−1
≈ e−λt λδ
(k − 1)!

Splitting of Poisson processes Consider a sequence of arrival times T1 , T2 , . . . of


a Poisson process of intensity λ. For each i we toss independently a coin and with
probability p classify the arrival as type-1, with probability 1 − p a type 2. We let
T1′ , T2′ , . . . be the times of the arrivals of type 1 and T1′′ , T2′′ , . . . the times of the arrivals
of type 2. Then (Tk′′ )k are arrival times of a Poisson process of intensity λp and Then
(Tk′′ )k are arrival times of a Poisson process of intensity λ(1 − p). Moreover, these
processes are independent. (TODO: explain precise meaning).
TODO: explain why it is so.
Example: customers buying a book? Emails arriving important or not.

Merging of Poisson processes Consider two Poisson processes: one with intensity λ,
the other with intensity λ′ . Then their merging is a Poisson process of intensity λ + λ′ .

5 Balls and bins


Birthday paradox We start with a simple, but illustrative example: in a group of k
people, what is the probability that two celebrate their birthdays in the same day? (Ig-
nore leap years, twins, and irregularities of birthrate during the year.) Obviously, the
probability that no such coincidence happens is exactly

1 2 k−1
(1 − )(1 − ) · · · (1 − ).
365 365 365

20
.
For small values of k we may use the well-known approximation e−x = 1 − x – but
suprisingly we use it to approximate 1 − x. The expression above is close to
1 k−1 Pk−1 i k(k−1)
e− 365 . . . e− 365 = e− i=1 365 = e− 2·365 .
.
We can conclude that for√k 2 = 2 · 365 the probability of no birthday “collision” is
.
approximately 1/e. Btw 730 = 27 and the exact formula for k = 27 gives XXX, so
we see our approximations were pretty good. TODO When is the prob one half.

Balls and bins model Next, we describe an abstract model, that not only generalizes
the above exercise, but mainly is used to analyze many random algorithms, some of
which we will see below. We will be throwing m balls randomly into n bins. Each
ball is thrown independently, and each bin has the same probability of being hit (also
no ball ends up outside the bins). In this setting, if we put n = 365 and m = k, we
have our birthday paradox problem again; now it is equivalent to asking, what is the
probability that some bin will end up with at least two balls.
Let us look at some more easy questions to ask about this model:
• What is the number of balls in the first bin? (Or any fixed bin, really.)
Obviously, it is a random variable. By recalling the definitions, we see that it
follows the binomial distribution Bin(m, 1/n). This is all we can say about this
number – and all we need to answer further questions: e.g., the probability the
.
first bin is empty is m
0 (1 − 1/n)m = e−m/n (using again the approximation
. −x
1 − x = e ).
• How many bins are empty (on average)?
Using the previous item and linearity of expectation, this number is equal to
.
n(1 − 1/n)m = ne−m/n .
• What is maximal amount of balls in a bin?
This is a harder problem, so let us first show why to care about it.

Application 1: hashing We seek a data structure to store strings and later answer
membership question (has ’Cat’ been stored?). We use a hash function h that assigns
to every string an integer in [n] = {1, . . . , n}. We assume h is “sufficiently random”.
This may be confusing, as h is a deterministic function (and we need it to give the
same answer to each string when running next time). What we want is that for typical
input strings s1 , . . . , sn the hashes h(s1 ), . . . , h(sn ) are independent and uniformly
distributed in [n].
To proceed with our data structure: we have n linked lists B1 , . . . , Bn , initially
empty. We store string s to Bh(s) in constant time. We look for s in Bh(s) , which takes
time proportional to the length of this list. Our goal is to estimate the worst seek-time,
which is (proportional to) the maximum size of a bin, so-called max-load. We must be
careful what we ask for though: the worst time in the worst case is n, as we may get
all balls in the same bin (i.e., all words can have the same hash). This is very unlikely
though, with probability 1/nM . More precise result in this direction is the following
upper bound.

21
Theorem 21. For large enough n we have
3 log log n 1
P (maxload ≥ )≤ .
log n n
n 1

Proof. Claim: For any i, P (|Bi | ≥ M ) ≤ M nM
.
To prove this, we use union bound: we first write event “|Bi | ≥ M ” as a union: for
′′
every set S ⊆ [n] of size M we consider event AS = “all S balls in S end in bin Bi .
M
Obviously, P (AS ) = 1/n . Also, “|Bi | ≥ M ” is simply S AS (we take union over
all sets S of size M ). So we get
X 1  
[ X n 1
P (| Bi ) ≥ M ) = P ( As ) ≤ P (AS ) = = .
nM M nM
S S S

n
M
Claim: M n1M ≤ M1 ! ≤ M e


Definition of binomial coefficient and Stirling-type estimate of factorial.


e M
Pn 
Claim: P (maxload ≥ M ) ≤ i=1 P (| Bi ) ≤ n M
We use again the union bound: event “

maxload ≥ M

” is the same as “∃i : |Bi | ≥ M ”, which is a union of events: ∪ni=1 {ω : |Bi | ≥ M }.


(To test your understanding: what does ω stand for here?)
The rest is straightforward estimate: TODO

Later we will see that the bound for maxload we just got is best possible, up to a
multiplicative factor.

Application 2: Bucketsort We want to sort n = 2k input numbers as fast as possible.


We will be assuming the inputs are l-bit integers, thus elements of I = {0, . . . , 2ℓ − 1}.
Crucially, we will also assume the inputs are uniformly random in this set and mutually
independent.
For x ∈ I we let b(x) be the top k bits, thus b(x) = ⌊x/2ℓ−k ⌋ (or b(x) = x >>
(l − k) TODO: different typeset).
Bucketsort algorithm proceeds as follows:

1. Initialize n empty buckets – linked lists.


For i = 1, . . . , n: put input xi to bucket Bb(xi ) .
2. For j = 0, . . . , n − 1: sort bucket Bj by bubblesort.
3. Join buckets B0 , . . . , Bn−1 .

Obviously, steps 1 and 3 take linear time (and we cannot do better). The interesting
part is to analyze, how long step 2 takes. We let Xj = |Bj |. For each input to the
algorithm, this will be a particular integer. However, we analyze the running time in
avarage, on a random input. Thus we treat Xj as a random variable. As we saw before,

22
Xj ∼ Bin(n, 1/n). The running time of bubblesort is quadratic, so total running time
of step 2 is
n−1
X
E(Xj2 ).
j=0

The easiest way to compute the expectation is by using the formula

var(Xj ) = E(Xj2 ) − E(Xj )2

in reverse: we already know that E(Xj ) = n· n1 = 1 and var(Xj ) = n· n1 ·(1− n1 ) ≤ 1.


Thus E(Xj2 ) = var(Xj ) + E(Xj )2 ≤ 1 + 1 = 2. So the expected running time of
step 3 is at most 2n.

Poisson approximation Next, we will prove the “likely lower bound” for maxload:
Theorem 22. For large enough n we have
log log n 1
P (maxload ≤ )≤ .
log n n
In contrary with the upper bound, this will require quite a bit of prep work. We
will invoke the magic of Poisson random variables to help us with this estimate. We
(m)
will also need to set up some notation. We will use Xi (or Xi ) for the number of
balls in bin i when m balls are being thown. We already know that each Xi follows
the Bin(m, 1/n) distribution, which is well approximated by P ois(m/n). Thus, with
a leap of faith we let Y1 , . . . , Yn be i.i.d. random variables, each with distribution
P ois(m/n). We will call the variables X1 , . . . , Xn the exact case and their approxi-
mation Y1 , . . . , Yn the Poisson case. Note, that while Y1 , . . . , Yn are independent, the
X1 , . . . , Xn are definitely not! In fact, we have already met their distribution, it is the
multinomial distribution and satisfies
 
⃗ = ⃗x) = P (X1 = x1 , . . . , Xn = xn ) = m 1
P (X ,
x1 , . . . , x n nm
m m!

where the multinomial coeeficient is defined by x1 ,...,x n
= x1 !...x n!
. The formula
P
above is only true if i xi = m, otherwise P (X ⃗ = ⃗x) = 0. Thus, the distribution
⃗ ⃗
of X is definitely distinct from that of Y : for instance we can have Yi = 0 for each i
with nonzero probability, while the probability of this is 0 in the exact case. However,
this is in some sense the only thing distinguishing the exact and Poisson cases.
Observation 23. The distribution of X ⃗ is the same as that of Y⃗ , given that P Y = m.
i
Formally,
Xn
⃗ = ⃗x) = P (Y
P (X ⃗ = ⃗x| Yn = m).
i=1
P
Proof. Both probabiities are clearly 0 if i x ̸= m. Otherwise, we compute TODO

23
Thus, we can simulate the balls&bins process just by using independent Poisson
variables. However, the conditioning makes computations complicated. The real magic
comes in the next theorem, where we study what happens when we truly embrace the
Poisson case of independent Poisson variables.
Theorem 24. Let f : Zn → [0, ∞) be any function. With the notation as above, we
have √
⃗ ≤ e mE(f (Y
E(f (X)) ⃗ )).

Moreover,
√ if the left-hand-side is monotone in m (the number of balls), we can replace
e m by 2.
Corollary 25. Let A be any event expressed in terms of sizes of the bins, so A ⊆
Zn . Then the probability that A happens in the exact case √ is less or equal that the
probability it happens in the Poisson case times a factor e m. Formally,

⃗ ∈ A) ≤ e mP (Y
P (X ⃗ ∈ A).

Proof. (Theorem implies Corollary) It is enough to let f (a) = 1 if a ∈ A and f (a) = 0


otherwise (f is a characteristic function of A. Then E(f (X)) ⃗ = P (X ⃗ ∈ A) and
⃗ . Thus the corollary follows.
similarly for Y

Proof. (of Theorem)


Pn
Let Y = i=1 Yi (do not confuse with Y ⃗ !). By law of total expectation (using
decomposition Ω = ∪∞y=0 {Y = y}) we get the following

X
⃗ )) =
E(f (Y ⃗ ))|Y = y)
P (Y = y)P (E(f (Y
y=0
⃗ ))|Y = m)
≥ P (Y = m)P (E(f (Y

= P (Y = m)P (E(f (X)))

The inequality is clear (all terms in the sum are nonnegative) and the equality in the
last row follows from the Observation above. It remains to recall that sum of Poisson
m
random variables is again Poisson and thus Y ∼ P ois(m).
√ So P (Y = m) = e−m mm! .
Now we use an estimate for factorial: m! ≤ (m/e)m e m and we are done. P∞
Notes: for the extended version with monotone left-hand side we replace the y=0
Pm P∞
by y=0 or y=m (base on wheter the LHS is decreasing or increasing).
Now we test our technique by estimating probability that maxload is low – The-
orem 22. We let M = log n/ log log n. The probability in the Poisson case can be

24
estimated as follows:

P (max Yi < M ) = P (∀i : Yi < M ) property of max


i
Y
= P (Yi < M ) independence of Yi s
i
Y
≤ (1 − P (Yi = M )) we give away a lot here
i
Y 1M
= (1 − e−1 ) def. of Poisson
i
M!
1 n
= 1−
eM !
1
≤ e− eM ! 1 − t ≤ e−t again

By Corollary 25, the probability in the exact case can be estimated as


√ 1
P (max Xi < M ) ≤ e ne− eM ! .
i

√ 1
So it remains to show, that e ne− eM ! < 1
n (for large enough n). To do this, we will
1
show that e− eM ! < n12 . TODO

6 Permutation test
How to compare two random variables, if their distribution can be arbitrary? For a
concrete example, suppose we want to compare two gadgets by looking at their ratings.
If the gadgets have similar features and price, perhaps this is the best way to decide –
so if one gadget has average rating 4.1, then it surely is better than another one of rating
3.9, right? But wait, what about randomness – what if the gadgets are exactly equal,
they still won’t receive exactly the same rating, with high probability. So what how
to decide what deviation we can priradit to randomness, and what is a mark of true
difference?
–> compare other possibilities
–> describe permutation test
Wilcoxon signed rank test https://stats.stackexchange.com/questions/348057/wilcoxon-
signed-rank-symmetry-assumption

7 Moment Generating Functions and their applications


In this section we will meet an old friend from combinatorics – generating functions.
We will see how they can be applied to study random variables and help us to prove
two important results from the first semester – Central Limit Theorem and Chernoff
bound.

25
Definition 26 (MGF). A moment generating function for a random variable X is the
function
MX (s) := E(esX ).
Observation 27. • MX (0) = 1
• lims→−∞ MX (s) = P (X = 0)
Example 28. Let X ∼ Ber(p). Then

MX (s) = E(esX ) = (1 − p)es·0 + pes·1 = pes + 1 − p.

Theorem 29.
X sk
MX (s) = E(X k )
k!
k≥0

Proof.

MX (s) = E(esX ) by definition


X (sX)k
= E( ) by Taylor expandion of exponential
k!
k≥0
X sk
= E( Xk )
k!
k≥0
X sk
= E(X k ) ) by linearity of expectation
k!
k≥0

TODO: care must be taken, as we are using the linearity for infinite sum.
The theorem above explains the name of MX : the coefficient of sk is the k-th
moment, that is the value E(X k ) (divided by k!), so MX can be thought as a GF
for this sequence of numbers of interest. We can sometimes use this to compute the
moments easily:
Example 30. Let X ∼ Exp(λ). Then
Z ∞ Z ∞
sX sx −λx λ
MX (s) = E(e ) = e λe dx = λ e(s−λ)x dx =
0 0 λ−s

if s < λ, while MX (s) = ∞ otherwise. We can now expand this function as a power
series:
1 X sk
MX (s) = = .
1 − s/λ λk
k≥0

This and Theorem 29 shows, that E(X k ) = k!/λk .


Theorem 31. MaX+b (s) = esb MX (as)

26
Proof. TODO
Theorem 32. If X, Y are independent, then MX+Y = MX MY .
Proof. TODO
MGFs do uniquely determine the distribution of corresponding random variable:
Theorem 33. Suppose for some ε > 0 two MGFs are equal on (−ε, ε), that is for
random variables X, Y we have

MX (s) = MY (s) for all s ∈ (−ε, ε).

Then FX = FY .
(No proof.) (Note: we cannot hope for any stronger result than equality of CDFs,
we certainly cannot expect X = Y , for instance!) We also have the following limit
version that will be used later.
Theorem 34. Suppose for some ε > 0 and for random variables Z, Y1 , Y2 , . . . we have

MZ (s) = lim MYn (s) for all s ∈ (−ε, ε).


n→∞

Then FZ (x) = limn→∞ FYn (x) provided FZ is continuous.


Example 35. If X ∼ P ois(λ) then by definition
X λk −λ s
MX (s) = esk e = eλ(e −1) .
k!
k≥0

s
If Y ∼ P ois(µ) we get MY (s) = eµ(e −1) . By Theorem 32 we get MX+Y (s) =
s s s
eλ(e −1) eµ(e −1) = e(λ+µ)(e −1) and Theorem 33 we see that X + Y ∼ P ois(λ + µ).
Example 36. Let X = Xn ∼ Bin(n, p). We know that X is a sum of n independent
Ber(p), thus MX (s) = (1 − p + pes )n . (This can also be verified independently
s
from the definition.) Let p = λ/n, so MXn (s) = (1 + λ(e n−1) )n and we can see that
s
limn→∞ MXn (s) = eλ(e −1) . By Theorem 34 this shows that Bin(n, λ/n) converges
in distribution to P ois(λ).
Example 37. Let X ∼ N (0, 1). Then
Z ∞
1 2
MX (s) = E(esX ) = √ esx e−x /2 dx
2π −∞
Z ∞
1 2 2
=√ e−(x−s) /2 es /2 dx
2π −∞
2
= es /2
.
2 (s2 /2)2
As es /2 = 1 + s2 /2 + 2! + . . . , this gives a formula for all moments of standard
normal distribution.

27
7.1 Proof of CLT
First, we recall the statement of CLT:
Theorem 38. Let X1 , X2 , . . . be i.i.d. RVs with mean µ and variance σ 2 > 0. Define
X1 + · · · + Xn − nµ
Yn = √ .

d
Then Yn −
→ N (0, 1).
Proof. We may assume that µ = 0, otherwise put Xn′ = Xn − µ; this does not change
σ. We compute first few terms of the MGF of Xn : MXi (s) = 1 + σ 2 s2 /2 + O(s3 ).
This gives the formula for the MGF of Yn :
s 1 s n 2
MYn (s) = MX ( √ )n = 1 + ( √ )2 + O(s3 ) → es /2 .
σ n 2 n
TODO: add more details

7.2 Chernoff inequality


Theorem 39. Suppose i.i.d. X1 , . . . , Xn are ±1, each with probability 1/2. Put X =
have σ 2 = n. Then for any t > 0 we have
P
i Xi . We
2
/2σ 2
P (X ≥ t) = P (X ≤ −t) ≤ e−t .
Proof. For any s > 0 we have
E(esX )
P (X ≥ t) = P (esX ≥ est ) ≤
est
by Markov inequality. The numerator of the last term is MX (s), so let us estimate this.
By Theorem 31 we have
MX (s) = MX1 (s)MX2 (s) . . . MXn (s) = MX1 (s)n .
By definition,
es e−s X s2k
MX1 (s) = + = .
2 2 (2k)!
k≥0
s2 /2
This can be estimated from above by e by looking at each term separately: for
every k we have
s2k (s2 /2)k

(2k)! k!
2
and the terms on the right hand side sum up to es /2 .
2
Thus we have MX (s) ≤ ens /2 and we have the bound
2
P (X ≥ t) ≤ ens /2−st
.
We optimize this bound by choosing s = t/n (which is positive, so we can do that) and
we get the desired bound.
TODO: compare Chernoff and CLT.

28
7.3 Applications of Chernoff
Fair coin Let H be the number of Heads we get in n throws of a fair coin, let X =
2H − n = 2(H − n/2). We expect H to be rather close to n/2, thus X to be rather
small, but how small exactly? Exact answer is given by CDF of Bin(n, 1/2), but
Chernoff has a convenient estimate:
2
P (|X| > t) ≤ 2e−t /2n
.

So if t = 2n ln n, we have the above probability at most 2/n.

Set Balancing Consider sets S1 , . . . , Sn ⊆ [m]. We want to find a set T ⊆ [m] that
divides each of Si as fairly as possible. (Application: design of statistical experiments.)
Specifically, we want to minimize the discrepancy maxi |Si ∩ T | − |Si \ T | . We can
design various algorithms
√ to do that, but a simple argument gives us a solution with
discrepancy at most 4m ln n: we choose T as a random subset of [m], with each
element having probability 1/2 √ for being selected. We will show that probability of
discrepancy larger than d = 4m ln n is at most 2/n.
We can ignore sets for which |Si | ≤ d. (TODO: is P this needed?) If |Si | = k ≥ d,
we express X = |Si ∩ T | − |Si \ T | as a sum X = j Xj where Xj = +1 if the
j-th element of Si is selected to T , and Xj = −1 otherwise. By Chernoff bound
(Theorem 39) we have
2
P (X ≥ d) ≤ e−d /2k
≤ e−4m ln n/(2m) = 1/n2 .

Thus P (| X) ≤ 2/n2 .
For our next application we will need a modified version of Chernoff bound:
Theorem 40. Suppose X1 , · · · , Xn are independent random P variables taking values
in {0, 1} (not necessarily identically distributed). Let X = i Xi and µ = E(X).
2
Pr(X ≥ (1 + δ)µ) ≤ e−δ µ/(2+δ)
, 0 ≤ δ,
2
Pr(X ≤ (1 − δ)µ) ≤ e−δ µ/2
, 0 < δ < 1,
−δ 2 µ/3
Pr(|X − µ| ≥ δµ) ≤ 2e , 0 < δ < 1.

Balls-and-bins revisited Recall our model of putting m into n bins, now with m ≫
n. We will be again interested in variable maxload = max Xi , where Xi is the number
of balls in the i-th bin. Obviously, E(Xi ) = m/n. We know that Xi ∼ Bin(m, 1/n)
m
but we will not use this. Put δ = 1. Then P (Xi ≥ 2m/n) ≤ e− 3n . Thus, if m ≫ n,
this probability is o(1/n), thus also
X
P (maxload ≥ 2m/n) ≤ P (Xi ≥ 2m/n) = o(1).
i

Note that this is in strong contrast to our analysis in the case m = n.


TODO: more versions of Chernoff: https://en.wikipedia.org/wiki/
Chernoff_bound

29

You might also like