Probability & Statistics 2: Robert Šámal January 29, 2024
Probability & Statistics 2: Robert Šámal January 29, 2024
Robert Šámal
January 29, 2024
Contents
1 Markov Chains 3
1.1 Introduction, basic properties . . . . . . . . . . . . . . . . . . . . . . 3
1.2 TODAY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Classification of states . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Reducing the MC to smaller ones . . . . . . . . . . . . . . . . . . . . 6
1.5 Convergence to stationary distribution . . . . . . . . . . . . . . . . . 7
1.6 Probability of absorption, time to absorption . . . . . . . . . . . . . . 7
1.7 Application: algorithm for 2-SAT, 3-SAT . . . . . . . . . . . . . . . 8
2 Bayesian statistics 12
2.1 Two approaches to statistics . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Preliminaries – conditionaly pmf, cdf, etc. . . . . . . . . . . . . . . . 12
2.3 Bayesian method – basic description . . . . . . . . . . . . . . . . . . 13
2.4 Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Bayesian point estimates – MAP and LMS . . . . . . . . . . . . . . . 13
2.6 Bayesian inference – examples . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Naive Bayes classifier – both Θ and X are discrete . . . . . . 14
2.6.2 Estimating bias of a coin – Θ is continuous, X is discrete . . . 15
2.6.3 Estimating normal random variables – both Θ and X are con-
tinuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Conditional expectation 16
4 Stochastic processes 17
4.1 Bernoulli process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Poisson process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 Permutation test 25
1
7 Moment Generating Functions and their applications 25
7.1 Proof of CLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.2 Chernoff inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.3 Applications of Chernoff . . . . . . . . . . . . . . . . . . . . . . . . 29
This is work in progress, likely full of typos and small (or larger) inprecisions. If
you find something that looks wrong, feel free to reach out to me. Thanks to Kryštof
Višňák for helping me with spotting typos in a previous version.
2
1 Markov Chains
1.1 Introduction, basic properties
Two examples to start with A machine can be in two states: working or broken. For
simplicity, we assume that the state stays the same for the whole day. Then, during
the night, the state changes at random according to the figure below: for instance, if
the machine is working one day, it will work the next day with probability 0.99, with
probability 0.01 it breaks over night. Crucially, we assume that this probability does
not depend on the age of the machine, nor on the previous states.
A fly is moving in a corridor, that we consider as a collection of four spaces, labeled
0, 1, 2, 3. If the fly is in spaces 1 or 2, it stays at the same space with probability 0.4.
Otherwise, it moves equally likely one step left or right. At positions 0 and 3 is a spider
and the fly can never leave. Again, we assume that “the fly has no memory”, so the
probabilities do not depend on the past trajectory of the fly.
TODO: add figures
What are the common features of these examples? We consider a sequence of
random variables, so called random process. We do not care about the numerical value
of these variables, as we consider them as mere labels – so we will not ask about
expected value of a position of the fly, for instance. We may assume that all the random
variables have range contained in set S of labels. For simplicity we assume S to be
finite or countable (and frequently we will assume that S = {1, . . . , s} or S = N). We
also want to prescribe transition probabilities pi,j such that P (Xt+1 = j | Xt = i) =
pi,j . However, there is more subtlety to this: we want to explicitly forbid the history
(values of X0 , . . . , Xt−1 to have an influence on Xt+1 . (See also section of stochastic
processes.)
Definition 1 (Markov chain). Let S be any finite or countably infinite set. A sequence
(Xt )∞
t=0 of random variables with range S is a (discrete time, discrete space, time-
homogeneous) Markov chain if for every t ≥ 0 and every a0 , . . . , at+1 ∈ S we have
whenever the conditional probabilities are defined, that is when P (Xt = at & . . . & X0 =
a0 ) > 0. We call the numbers pi,j (t) = P (Xt+1 = j | Xt = i) the transition proba-
bilities. We will only study cases where pi,j (t) does not depend on t and will omit the
t in the notation.
Transition matrix is a matrix P such that Pi,j = pi,j , that is the entry at i-th row
and j-th column is the probability of transition from state i to state j. As a consequence
of the definition, all entries in the transition matrix are nonnegative, and each row sums
to 1. We can describe this succintly by writing P j = j with j denoting the column
vector of all 1’s.
Let P denote the transition matrix for the machine example and Q for the fly ex-
3
ample. We have
1 0 0 0
0.99 0.01 0.3 0.4 0.3 0
P = Q= .
0.9 0.1 0 0.3 0.4 0.3
0 0 0 1
Transition graph/diagram is a directed graph with vertex set S and arcs (directed
edges) (i, j) for every i, j ∈ S such that pi,j > 0. We label arc (i, j) by pi,j . In other
words, the figures above (TODO) are transition graphs.
Describing the distribution. We will again use the basic tool to describe a random
variable, namely a PMF (probability mass function), that is giving a probability of each
state (element of S). A common notation is
(t)
πi = P (Xt = i).
(t)
For any t ≥ 0 we also consider π (t) as a row vector with coordinates πi for i ∈ S.
Transition of the distribution Suppose we know π (0) , what can we say about π (1) ,
and π (t) in general? By law of total probability we have
s
X
P (X1 = j) = P (X0 = i) · P (X1 = j | X0 = i) So, in other notation
i=1
s
(1) (0)
X
πj = πi · Pi,j and using matrix multiplication:
i=1
π (1) = π (0)
P
π (k) = π (0) P k
k-step transition To look at the above theorem in a different way, we define the
following:
As we will see, we also have ri,j (k) = P (Xt+k = j | Xt = i) for any t > 0, but this
remains to be seen, there may be a dependency on t.
4
Theorem 3 (Chapman–Kolmogorov). For any Markov chain and any k, ℓ ≥ 0 we have
• ri,j (k) = (P k )i,j
Ps
• ri,j (k + ℓ) = u=1 = ri,u (k)ru,j (ℓ)
Ps
• ri,j (k + 1) = u=1 = ri,u (k)pu,j
1.2 TODAY
Theorem 4.
P (X0 = a0 & X1 = a1 & . . . & Xt = at ) = πa(0)
0
pa0 ,a1 pa2 ,a3 . . . pat−1 ,at
Proof. Chain rule for conditional probability plus Markov property. TODO: write
more details
The next theorem shows a general version of the Markov property (1), that “future
is not depending of the history, only of the present”.
Theorem 5 (General Markov property). For any Markov chain and any t ≥ 0, any
i ∈ S and any
• H – event depending only on values of X0 , . . . , Xt−1 ,
• F – event depending only on values of Xt+1 , Xt+2 , . . .
we have
P (F | H & Xt = i) = P (F | Xt = i).
Proof. (skipped)
5
1.4 Reducing the MC to smaller ones
Let us consider a MC together with its decomposition to communicating classes.
TODO
• think long-term
• MC moves within one equivalence class for a while, when it leaves it, it never
comes back. (Because . . . )
• This process goes on, until we get to an equivalence class, that we cannot leave.
transient vs. recurrent states Let fi,j be the probability that we get to j, if we start
from i; if i = j we need to get there again. In formula,
fi,j = P (∃t ≥ 1 : Xt = j | X0 = i)
We call fi,i recurrence probability and states i such that fi,i = 1 are called recurrent
(sometimes also persistent). The other states (such that fi,i < 1) are called transient.
To explore this idea a bit further, let Ti = min{t ≥ 1 : Xt = i} (or Ti = ∞, if
there is not such t). Recurrent states are such that Ti is finite.
Also let Vi be the number of visits to i, that is, Vi := |{t ≥ 0 : Xt = i}. It
is clear that if i is a recurrent state and X0 = i then Vi = ∞. (After each visit we
have probability fi,i = 1 that we visit again.) If i is transient, then after each visit to i
we have probability 1 − fi,i > 0 that this is the last visit. Thus (by definition of the
geometric distribution)
Vi |X0 = i ∼ Geo(1 − fi,i ).
TODO: is the next theorem trivi for finite MC?
Theorem 9. Let C be a communicating class. Then either all states in C are recurrent
or all are transient.
Proof. Suppose i ⇔ j, in particular ri,j (t) > 0 for some t. Assume that i is recurrent.
Then fi,i = 1, thus we visit i infinitely often. In each of these visits we have
probability r = ri,j (t) that we visit j in t units of time. Thus the probability that this
never happens is limn→∞ (1 − r)n = 0. So we will visit j, starting from i, in symbols
fi,j = 1.
Suppose fj,i < 1. Then with positive probability we never visit i again, a contra-
diction, as we now we will visit i infinitely often. Thus fj,i = 1. This implies, that
fj,j = 1 as well, as fj,j ≤ fj,i fi,j . So, j is recurrent as well.
Suppose a Markov chain is irreducible, so there is a positive probability of moving
from i to j for every pair of states. It is tempting to conclude, that all states must be
recurrent. Indeed, this is true for finite Markov chains:
Theorem 10. Let (Xt ) be a finite irreducible Markov chain. Then all states are recur-
rent.
6
Proof. In view of the last theorem, the alternative is that all states are transient. This
means, no state will be visited infinitely often. So there is time M1 such that for t > M1
we have X1 ̸= 1. Similarly, we define Mi for every state i. But what is the value of Xt
for t > max{M1 , . . . , Ms }?
However, this is not necessarily the case, if the Markov chain is infinite. TODO:
simple example
In fact, a random walk in Z3 (or in higher dimensions) has also all states transient,
while a random walk in Z or in Z2 has all states recurrent. TODO: more details?
In other words, regardless of π (0) we know what π (n) will (approximately) be, if n
is large enough.
TODO: def. of irreducible, aperiodic TODO: what for finite TODO: examples,
when it fails: periodic states, two components of ↔, infinite.
7
Further, we let ai be the probability we end at state 0, starting from i.
ai = P (∃t : Xt = 0 | X0 = I).
Here we tacitly assume that A contains more absorbing states than just 0, otherwise
ai = 1.
Theorem 12. The probabilities ai are the unique solution to the following system of
equations:
a0 = 1
ai = 0 for 0 ̸= i ∈ A
X
ai = pi,j aj otherwise.
j∈S
µi = 0 for i ∈ A
X
µi = 1 + pi,j µj otherwise.
j∈S
2-SAT algorithm
• Input: 2-CNF φ with variables x1 , . . . , xn
• Output: satisfying assignment or statement that none exists
• arbitrary initialize x1 , . . . , xn
• If φ(x1 , . . . , xn ) is true, return (x1 , . . . , xn ). Otherwise, let C be an unsatisfied
clause and change random variable in C.
• repeat previous step at most 2mn2 -times
8
• say that no solution exists
Theorem 14. The above algorithm gives wrong answer with probability at most 2−m .
The running time is O(m · n4 ).
Proof. For the running time estimate we just notice that φ has O(n2 ) clauses of two
literals, ignoring possibility for faster search for unsatisfied clause. If φ is not satisfi-
able, the algorithm never finds a satisfying assignment and thus gives a correct answer.
So suppose there is a solution and let (x∗1 , . . . , x∗n ) be one of the (possibly many) solu-
tions. Let Dt be the distance from solution at time t. Explicitly, Dt is the number of i
such that xi ̸= x∗i , where value of xi is considered at time t.
The algorithm does not know about Dt , we only use it for the analysis. Clearly,
if Dt = 0 at some time t, we have found the solution (and we see that φ is being
satisfied, so the algorithm ends).1 Otherwise let C be the unsatisfied clause. To simplify
notation, assume C = x1 ∨ x2 . As C is unsatisfied, we have x1 = x2 = F . As x∗ is
a satisfying assignment, either x∗1 or x∗2 (or both) are true. Suppose x∗1 = F , x∗2 = T .
Then the algorithm has probability 1/2 of increasing Dt by one (if we change x1 )
and probability 1/2 of decreasing it. This is certainly independent of anything else, in
particular of value of D0 , . . . , Dt−1 .
The issue is that there are two other cases: x∗1 = T , x∗2 = F is another good case,
it works in the same way. However, if x∗1 = x∗2 = T then this step of algorithm will
certainly decrease from the solution: Dt+1 = Dt − 1. While this looks like a good
thing, we call this a bad case: we were hoping to use our knowledge of Markov chains
to analyse the behaviour of (Dt ).
To solve this problem, we (as people analyzing the algorithm) create an auxiliary
sequence Dt′ . Like Dt , this is a quantity the algorithm does not know about.
• We define D0′ = D0 .
′
• When the choice of C makes a good case, we make sure that Dt+1 − Dt′ =
Dt+1 − Dt .
• In the bad case we toss a coin to ensure that
′
P (Dt+1 = Dt + x) = 1/2
the analysis.
9
how likely it is, that T is much larger. For this we use Markov inequality from the first
semester. Using it, we get
E(T ) n2 1
P (T ≥ 2n2 ) ≤ 2
≤ 2 = .
2n 2n 2
To wrap it up: we divide the m · 2n2 steps into m blocks, each of size 2n2 . By what
we just did, in each block we fail with probability at most 1/2: as failure means that the
algorithm runs without finding a solution, and T is the time till we find the solution.2
Each steps are independent: or rather, the probability of failure is ≤ 1/2 no matter
how the previous block has ended. Thus, probability of failure in all m blocks is at
most (1/2)m .
TODO: what if there is a clause with just one variable?
3-SAT algorithm
• Input: 3-CNF φ with variables x1 , . . . , xn
10
From this we get that
n/2
X
P (T ≤ n/2) = P (D0 = k)P (T ≤ n/2 | D0 = k)
k=0
n/2
X
≥ P (D0 = k) · 3−k
k=0
1
≥ P (D0 ≤ n/2) · 3−k ≥ .
2 · 3k
This leads to a modified algorithm:
11
2 Bayesian statistics
2.1 Two approaches to statistics
In the first semester we looked at the classical (frequentists’) approach to statistics. In
this approach:
• Probability is a long-term frequency (out of 6000 rolls of the dice, a six was
rolled 1026 times, the ratio converges to the true probability). It is an objective
property of the real world.
• Parameters are fixed, unknown constants. We can’t make meaningful probabilis-
tic statements about them.
• We design statistical procedures to have desirable long-run properties. E.g. 95
% of our interval estimates will cover the unknown parameter.
Now we are going to look at an alternative, so called Bayesian approach:
• Probability describes how much we believe in a phenomenon, how much we are
willing to bet:
(Prob. that T. Bayes had a cup of tea on December 18, 1760 is 90 %.)
(Prob. that COVID-19 virus did leak from a lab is ?50? %.)
• We can make probabilistic statements about parameters (even though they are
fixed constants): the “choice of universe” is the underlying elementary event.
• We compute the distribution of ϑ and form point and interval estimates from it,
etc.
12
2.3 Bayesian method – basic description
• The unknown parameter is treated as a random variable Θ
• We choose prior distribution, the pmf pΘ (ϑ) or the pdf fΘ (ϑ) independent of
the data.
• We choose a statistical model pX|Θ (x|ϑ) or fX|Θ (x|ϑ) that describes what we
measure (and with what probability), depending on the value of the parameter.
13
MAP – Maximum A-Posteriori We choose ϑ̂ to maximize
• pΘ|X (ϑ|x) in the discrete case
• fΘ|X (ϑ|x) in the continuous case
E((Θ − ϑ̂)2 |X = x)
TODO finish it
14
2.6.2 Estimating bias of a coin – Θ is continuous, X is discrete
Consider a loaded coin with probability of heads being ϑ (which we assume to be an
evaluation of a random variable Θ). Btw, everything applies to any procedure generat-
ing a Bernoulli random variable, but we stick with a coin for concreteness. Our goal
is to find out the value of ϑ. In tune with the Bayesian methodology, we start with a
prior distribution, that is a pdf fΘ . (As we want to allow any real number in [0, 1] as
the value of ϑ, we must take Θ to be a continuous random variable.) Then we take
measurements: we choose a number n of coin tosses and check how many heads we
get. If we know the value of θ, the distribution of this number (call it X), is clearly
Bin(n, ϑ). So we get
n k
pX|Θ (k|ϑ) = ϑ (1 − ϑ)n−k .
k
It remains to apply Theorem 18. We still haven’t decided what prior to choose though.
If we don’t known anything (say it is not a real coin but a digital generator), we may
take flat prior Θ ∼ U (0, 1). However, we need something more versatile to allow us
to encode some prior knowledge.
Here c is a normalizing constant that makes the following function a pdf. It is typically
written as 1/B(α, β), the reciprocal of a beta function. The r.v. Θ is said to have beta
distribution. We will collect some useful properties of this distribution. All are easy to
verify using basic knowledge of calculus, details are omitted though.
α−1
• fΘ (ϑ) is maximal for ϑ = α+β−2 (mode of the distribution). This can be verified
by a simple differentiation.
α
• E(Θ) = α+β (mean of the distribution). This follows from the next part and
easy calculation.
• B(α, β) = 1/ α+β−2
α−1 . This can be shown by per-partes and induction over
α + β.
Now we have all set up to apply Theorem 18. Fortunately, we don’t need to com-
pute the integral in the denominator.
15
The calculation is only valid for ϑ ∈ [0, 1], otherwise fΘ (ϑ) = 0, so the updated
(posterior) pdf is also 0. How to find out c2 , if we need to? We use the fact that
after conditioning on the event {X = k} the random variable Θ still only attains
values in [0, 1]. Thus, c2 takes such value that makes fΘ|X (ϑ|k) a pdf, a function with
integral 1. Based on what we learned about Beta distribution, c2 = 1/B(α′ , β ′ ) and
Θ|X = k follows the Beta distribution with parameters α′ = α+k and β ′ = β +n−k.
TODO: wrap up
3 Conditional expectation
We have already learned about expectation E(Y ) of a random variable Y — average
value over the whole probability space — and about conditional expectation E(Y | A)
— average over a set A ⊆ Ω. In this section we will learn about a related topic,
where we will take averages of Y over sets defined by another random variable, X.
We will restrict the discussion to the case of a discrete random variable X, the case of
continuous X is more subtle. The variable Y can be discrete or continuous.
For any x ∈ R we let
g(x) := E(Y | X = x).
This is obviously some real function of real variable. Next, we plug the random vari-
able X to the function g and we define
E(Y | X) := g(X).
16
Ŷ = E(Y | X)
Ỹ = Y − Ŷ
expression of var(Y ) – Eve’s rule
4 Stochastic processes
A stochastic process is a name for a sequence (or more generally, a collection) of
random variables. We have already seen an important case of that when we looked at
Markov chains (where we added an important condition, “independence on the past”).
Another important example (that we just mention in passing) is the Wiener process Wt
(here t ∈ R, so it is not a sequence, but a “continuous time parametrized parameter”, in
other words a random function of a real variable). These processes are used to model
Brownian motion and stock prices, to name a few.
Next, we will look at two models of arrival times – time till some random event
occurs, you can imagine next email arrival, or next person walking in a store. The first
model will consider discrete time, the second one continuous time.
17
Waiting times/Interarrival times Put Lk = Tk − Tk−1 (we put T0 = 0 to simplify
notation). In words, it is the time we are waiting for the k-th success. TODO memory-
less property thus Lk ∼ Geom(p), that is E(Lk ) = 1/p, var(Lk ) = (1 − p)/p2 and
P (Lk = t) = (1 − p)t−1 p. Moreover, L1 , L2 , . . . are independent.
Alternative description Note that we can equivalently describe the situation by the
interarrival times, that is by the sequence of i.i.d. random variables L1 , L2 , · · · ∼
Geom(p). Then we put Tk = L1 + · · · + Lk and Xk = 1 whenever Tt = k for some t.
It is easy to see that this is an equivalent description, in other words, the sequence
X1 , X2 , . . . is a Bernoulli process.
Example: The number of days till the next rain follows the Geom(p) distribution.
(We assume each day is either rainy/not rainy, that is we have no finer distinction.)
What is the probability that it will rain at days 10 and 20?
This seems very complicated and tedious. However, by the indicated description of
Bernoulli process by interarrival times, the indicator variables of rain form a Bernoulli
process. And the probability of rain at days 10 and 20 is
P (X10 = 1 & X20 = 1) = P (X10 = 1) · P (X20 = 1) = p · p = p2 .
18
4.2 Poisson process
Continuous-time version of Bernoulli processes. Assume, we want to deal with events
that occur more often than once per day. We can stay with discrete time and measure
it in hours, second, or nanoseconds. But instead of that, we will define a more elegant
description that will allow any real values of the arrival times.
As for Bernoulli process, we will describe the process by several random variables:
(a) We are describing times of “arrival” in interval [0, ∞). For any time interval
(a, b] we let N ((a, b]) be the number of arrivals in this intervals. We postulate
that the pmf of this random variable only depends on τ = b − a. We denote
P (N ((a, b]) = k) as P (k, τ ).
(b) N ((a, b]) and N ((0, a]) are independent.
(c) For small values of τ we have the following approximation, for some λ > 0
• P (0, τ ) = 1 − λτ + o(1)
• P (1, τ ) = λτ + o(1)
• P (k, τ ) = o(1) for k > 1
TODO: explain how this follows from approximating a Bernoulli process
• Nt ∼ P ois(λt)
• Lk ∼ Exp(λ)
• For any sequence 0 <= t0 < t1 < · · · < tk the random variables N ((ti−1 , ti ])
for i = 1, . . . , k are independent and the i-th of them follows P ois(λ(ti − ti−1 ))
19
• Next, var(Tk ) = var(L1 ) + · · · + var(Lk ) = k/λ2 by formula for variance of
independent RVs.
• Finally, we can find the pdf of Tk , so-called Erlang distribution of order k − 1
λk tk−1 e−λt
fTk (t) = .
(k − 1)!
Merging of Poisson processes Consider two Poisson processes: one with intensity λ,
the other with intensity λ′ . Then their merging is a Poisson process of intensity λ + λ′ .
1 2 k−1
(1 − )(1 − ) · · · (1 − ).
365 365 365
20
.
For small values of k we may use the well-known approximation e−x = 1 − x – but
suprisingly we use it to approximate 1 − x. The expression above is close to
1 k−1 Pk−1 i k(k−1)
e− 365 . . . e− 365 = e− i=1 365 = e− 2·365 .
.
We can conclude that for√k 2 = 2 · 365 the probability of no birthday “collision” is
.
approximately 1/e. Btw 730 = 27 and the exact formula for k = 27 gives XXX, so
we see our approximations were pretty good. TODO When is the prob one half.
Balls and bins model Next, we describe an abstract model, that not only generalizes
the above exercise, but mainly is used to analyze many random algorithms, some of
which we will see below. We will be throwing m balls randomly into n bins. Each
ball is thrown independently, and each bin has the same probability of being hit (also
no ball ends up outside the bins). In this setting, if we put n = 365 and m = k, we
have our birthday paradox problem again; now it is equivalent to asking, what is the
probability that some bin will end up with at least two balls.
Let us look at some more easy questions to ask about this model:
• What is the number of balls in the first bin? (Or any fixed bin, really.)
Obviously, it is a random variable. By recalling the definitions, we see that it
follows the binomial distribution Bin(m, 1/n). This is all we can say about this
number – and all we need to answer further questions: e.g., the probability the
.
first bin is empty is m
0 (1 − 1/n)m = e−m/n (using again the approximation
. −x
1 − x = e ).
• How many bins are empty (on average)?
Using the previous item and linearity of expectation, this number is equal to
.
n(1 − 1/n)m = ne−m/n .
• What is maximal amount of balls in a bin?
This is a harder problem, so let us first show why to care about it.
Application 1: hashing We seek a data structure to store strings and later answer
membership question (has ’Cat’ been stored?). We use a hash function h that assigns
to every string an integer in [n] = {1, . . . , n}. We assume h is “sufficiently random”.
This may be confusing, as h is a deterministic function (and we need it to give the
same answer to each string when running next time). What we want is that for typical
input strings s1 , . . . , sn the hashes h(s1 ), . . . , h(sn ) are independent and uniformly
distributed in [n].
To proceed with our data structure: we have n linked lists B1 , . . . , Bn , initially
empty. We store string s to Bh(s) in constant time. We look for s in Bh(s) , which takes
time proportional to the length of this list. Our goal is to estimate the worst seek-time,
which is (proportional to) the maximum size of a bin, so-called max-load. We must be
careful what we ask for though: the worst time in the worst case is n, as we may get
all balls in the same bin (i.e., all words can have the same hash). This is very unlikely
though, with probability 1/nM . More precise result in this direction is the following
upper bound.
21
Theorem 21. For large enough n we have
3 log log n 1
P (maxload ≥ )≤ .
log n n
n 1
Proof. Claim: For any i, P (|Bi | ≥ M ) ≤ M nM
.
To prove this, we use union bound: we first write event “|Bi | ≥ M ” as a union: for
′′
every set S ⊆ [n] of size M we consider event AS = “all S balls in S end in bin Bi .
M
Obviously, P (AS ) = 1/n . Also, “|Bi | ≥ M ” is simply S AS (we take union over
all sets S of size M ). So we get
X 1
[ X n 1
P (| Bi ) ≥ M ) = P ( As ) ≤ P (AS ) = = .
nM M nM
S S S
n
M
Claim: M n1M ≤ M1 ! ≤ M e
maxload ≥ M
Later we will see that the bound for maxload we just got is best possible, up to a
multiplicative factor.
Obviously, steps 1 and 3 take linear time (and we cannot do better). The interesting
part is to analyze, how long step 2 takes. We let Xj = |Bj |. For each input to the
algorithm, this will be a particular integer. However, we analyze the running time in
avarage, on a random input. Thus we treat Xj as a random variable. As we saw before,
22
Xj ∼ Bin(n, 1/n). The running time of bubblesort is quadratic, so total running time
of step 2 is
n−1
X
E(Xj2 ).
j=0
Poisson approximation Next, we will prove the “likely lower bound” for maxload:
Theorem 22. For large enough n we have
log log n 1
P (maxload ≤ )≤ .
log n n
In contrary with the upper bound, this will require quite a bit of prep work. We
will invoke the magic of Poisson random variables to help us with this estimate. We
(m)
will also need to set up some notation. We will use Xi (or Xi ) for the number of
balls in bin i when m balls are being thown. We already know that each Xi follows
the Bin(m, 1/n) distribution, which is well approximated by P ois(m/n). Thus, with
a leap of faith we let Y1 , . . . , Yn be i.i.d. random variables, each with distribution
P ois(m/n). We will call the variables X1 , . . . , Xn the exact case and their approxi-
mation Y1 , . . . , Yn the Poisson case. Note, that while Y1 , . . . , Yn are independent, the
X1 , . . . , Xn are definitely not! In fact, we have already met their distribution, it is the
multinomial distribution and satisfies
⃗ = ⃗x) = P (X1 = x1 , . . . , Xn = xn ) = m 1
P (X ,
x1 , . . . , x n nm
m m!
where the multinomial coeeficient is defined by x1 ,...,x n
= x1 !...x n!
. The formula
P
above is only true if i xi = m, otherwise P (X ⃗ = ⃗x) = 0. Thus, the distribution
⃗ ⃗
of X is definitely distinct from that of Y : for instance we can have Yi = 0 for each i
with nonzero probability, while the probability of this is 0 in the exact case. However,
this is in some sense the only thing distinguishing the exact and Poisson cases.
Observation 23. The distribution of X ⃗ is the same as that of Y⃗ , given that P Y = m.
i
Formally,
Xn
⃗ = ⃗x) = P (Y
P (X ⃗ = ⃗x| Yn = m).
i=1
P
Proof. Both probabiities are clearly 0 if i x ̸= m. Otherwise, we compute TODO
23
Thus, we can simulate the balls&bins process just by using independent Poisson
variables. However, the conditioning makes computations complicated. The real magic
comes in the next theorem, where we study what happens when we truly embrace the
Poisson case of independent Poisson variables.
Theorem 24. Let f : Zn → [0, ∞) be any function. With the notation as above, we
have √
⃗ ≤ e mE(f (Y
E(f (X)) ⃗ )).
Moreover,
√ if the left-hand-side is monotone in m (the number of balls), we can replace
e m by 2.
Corollary 25. Let A be any event expressed in terms of sizes of the bins, so A ⊆
Zn . Then the probability that A happens in the exact case √ is less or equal that the
probability it happens in the Poisson case times a factor e m. Formally,
√
⃗ ∈ A) ≤ e mP (Y
P (X ⃗ ∈ A).
The inequality is clear (all terms in the sum are nonnegative) and the equality in the
last row follows from the Observation above. It remains to recall that sum of Poisson
m
random variables is again Poisson and thus Y ∼ P ois(m).
√ So P (Y = m) = e−m mm! .
Now we use an estimate for factorial: m! ≤ (m/e)m e m and we are done. P∞
Notes: for the extended version with monotone left-hand side we replace the y=0
Pm P∞
by y=0 or y=m (base on wheter the LHS is decreasing or increasing).
Now we test our technique by estimating probability that maxload is low – The-
orem 22. We let M = log n/ log log n. The probability in the Poisson case can be
24
estimated as follows:
√ 1
So it remains to show, that e ne− eM ! < 1
n (for large enough n). To do this, we will
1
show that e− eM ! < n12 . TODO
6 Permutation test
How to compare two random variables, if their distribution can be arbitrary? For a
concrete example, suppose we want to compare two gadgets by looking at their ratings.
If the gadgets have similar features and price, perhaps this is the best way to decide –
so if one gadget has average rating 4.1, then it surely is better than another one of rating
3.9, right? But wait, what about randomness – what if the gadgets are exactly equal,
they still won’t receive exactly the same rating, with high probability. So what how
to decide what deviation we can priradit to randomness, and what is a mark of true
difference?
–> compare other possibilities
–> describe permutation test
Wilcoxon signed rank test https://stats.stackexchange.com/questions/348057/wilcoxon-
signed-rank-symmetry-assumption
25
Definition 26 (MGF). A moment generating function for a random variable X is the
function
MX (s) := E(esX ).
Observation 27. • MX (0) = 1
• lims→−∞ MX (s) = P (X = 0)
Example 28. Let X ∼ Ber(p). Then
Theorem 29.
X sk
MX (s) = E(X k )
k!
k≥0
Proof.
TODO: care must be taken, as we are using the linearity for infinite sum.
The theorem above explains the name of MX : the coefficient of sk is the k-th
moment, that is the value E(X k ) (divided by k!), so MX can be thought as a GF
for this sequence of numbers of interest. We can sometimes use this to compute the
moments easily:
Example 30. Let X ∼ Exp(λ). Then
Z ∞ Z ∞
sX sx −λx λ
MX (s) = E(e ) = e λe dx = λ e(s−λ)x dx =
0 0 λ−s
if s < λ, while MX (s) = ∞ otherwise. We can now expand this function as a power
series:
1 X sk
MX (s) = = .
1 − s/λ λk
k≥0
26
Proof. TODO
Theorem 32. If X, Y are independent, then MX+Y = MX MY .
Proof. TODO
MGFs do uniquely determine the distribution of corresponding random variable:
Theorem 33. Suppose for some ε > 0 two MGFs are equal on (−ε, ε), that is for
random variables X, Y we have
Then FX = FY .
(No proof.) (Note: we cannot hope for any stronger result than equality of CDFs,
we certainly cannot expect X = Y , for instance!) We also have the following limit
version that will be used later.
Theorem 34. Suppose for some ε > 0 and for random variables Z, Y1 , Y2 , . . . we have
s
If Y ∼ P ois(µ) we get MY (s) = eµ(e −1) . By Theorem 32 we get MX+Y (s) =
s s s
eλ(e −1) eµ(e −1) = e(λ+µ)(e −1) and Theorem 33 we see that X + Y ∼ P ois(λ + µ).
Example 36. Let X = Xn ∼ Bin(n, p). We know that X is a sum of n independent
Ber(p), thus MX (s) = (1 − p + pes )n . (This can also be verified independently
s
from the definition.) Let p = λ/n, so MXn (s) = (1 + λ(e n−1) )n and we can see that
s
limn→∞ MXn (s) = eλ(e −1) . By Theorem 34 this shows that Bin(n, λ/n) converges
in distribution to P ois(λ).
Example 37. Let X ∼ N (0, 1). Then
Z ∞
1 2
MX (s) = E(esX ) = √ esx e−x /2 dx
2π −∞
Z ∞
1 2 2
=√ e−(x−s) /2 es /2 dx
2π −∞
2
= es /2
.
2 (s2 /2)2
As es /2 = 1 + s2 /2 + 2! + . . . , this gives a formula for all moments of standard
normal distribution.
27
7.1 Proof of CLT
First, we recall the statement of CLT:
Theorem 38. Let X1 , X2 , . . . be i.i.d. RVs with mean µ and variance σ 2 > 0. Define
X1 + · · · + Xn − nµ
Yn = √ .
nσ
d
Then Yn −
→ N (0, 1).
Proof. We may assume that µ = 0, otherwise put Xn′ = Xn − µ; this does not change
σ. We compute first few terms of the MGF of Xn : MXi (s) = 1 + σ 2 s2 /2 + O(s3 ).
This gives the formula for the MGF of Yn :
s 1 s n 2
MYn (s) = MX ( √ )n = 1 + ( √ )2 + O(s3 ) → es /2 .
σ n 2 n
TODO: add more details
28
7.3 Applications of Chernoff
Fair coin Let H be the number of Heads we get in n throws of a fair coin, let X =
2H − n = 2(H − n/2). We expect H to be rather close to n/2, thus X to be rather
small, but how small exactly? Exact answer is given by CDF of Bin(n, 1/2), but
Chernoff has a convenient estimate:
2
P (|X| > t) ≤ 2e−t /2n
.
√
So if t = 2n ln n, we have the above probability at most 2/n.
Set Balancing Consider sets S1 , . . . , Sn ⊆ [m]. We want to find a set T ⊆ [m] that
divides each of Si as fairly as possible. (Application: design of statistical experiments.)
Specifically, we want to minimize the discrepancy maxi |Si ∩ T | − |Si \ T | . We can
design various algorithms
√ to do that, but a simple argument gives us a solution with
discrepancy at most 4m ln n: we choose T as a random subset of [m], with each
element having probability 1/2 √ for being selected. We will show that probability of
discrepancy larger than d = 4m ln n is at most 2/n.
We can ignore sets for which |Si | ≤ d. (TODO: is P this needed?) If |Si | = k ≥ d,
we express X = |Si ∩ T | − |Si \ T | as a sum X = j Xj where Xj = +1 if the
j-th element of Si is selected to T , and Xj = −1 otherwise. By Chernoff bound
(Theorem 39) we have
2
P (X ≥ d) ≤ e−d /2k
≤ e−4m ln n/(2m) = 1/n2 .
Thus P (| X) ≤ 2/n2 .
For our next application we will need a modified version of Chernoff bound:
Theorem 40. Suppose X1 , · · · , Xn are independent random P variables taking values
in {0, 1} (not necessarily identically distributed). Let X = i Xi and µ = E(X).
2
Pr(X ≥ (1 + δ)µ) ≤ e−δ µ/(2+δ)
, 0 ≤ δ,
2
Pr(X ≤ (1 − δ)µ) ≤ e−δ µ/2
, 0 < δ < 1,
−δ 2 µ/3
Pr(|X − µ| ≥ δµ) ≤ 2e , 0 < δ < 1.
Balls-and-bins revisited Recall our model of putting m into n bins, now with m ≫
n. We will be again interested in variable maxload = max Xi , where Xi is the number
of balls in the i-th bin. Obviously, E(Xi ) = m/n. We know that Xi ∼ Bin(m, 1/n)
m
but we will not use this. Put δ = 1. Then P (Xi ≥ 2m/n) ≤ e− 3n . Thus, if m ≫ n,
this probability is o(1/n), thus also
X
P (maxload ≥ 2m/n) ≤ P (Xi ≥ 2m/n) = o(1).
i
29