Statistics 202C Study Guide: Part I: Sampling Basic Unstructured Distributions and Monte Carlo Basics
Statistics 202C Study Guide: Part I: Sampling Basic Unstructured Distributions and Monte Carlo Basics
I
m
I)
L
N(0,
2
), error is /
m, o(m
1/2
)
MC gives better approximations with fewer samples
Need to sample from pi(x)
Variance
2
could be large, error is
2
/
m
Get i.i.d. samples from (x) that puts more probability on important parts D
I
m
=
1
m
m
j=1
g(x
j
)
(x
j
)
E
(
g(x)
(x)
) =
_
D
g(x)dx
r
m
(
I
m
I)
L
N(0,
)
2
= Var
(
g(x)
(x)
), if (x) g(x)
2
= 0, ideal
Best distribution to sample from is g(x)
If g(x) and (x) are very dierent,
2
(
g(x)
(x)
) may be huge!
Pick sampling distribution (x) close to g(x)
Sampling from (almost) any distribution
Inversion Method
Let u U[0, 1]; F is a cdf; F(x) =
_
x
0
p(x) dx, for some density p(x), then x = F
1
(u)
has cdf F(x) and is a sample from p(x)
Get i.i.d. samples u
1
, . . . , u
n
from U[0, 1]
Compute F
1
(u
1
), . . . , F
1
(u
n
), from the distribution with probability p(x)
Gaussian Higher Dimension
Sampling from p(x; , )
Diagonalize
1
by solving eigenvector equation
1
e
, = 1, . . . , m
Re-express multidimensional Gaussian as a product of one-dimensional Gaussians by
changing to coordinates dened by eigenvectors
Sample from each one-dimnsional Gaussian by the Inversion Method
2
Mixture of Gaussians
p(x) =
n
i=1
i
G(x;
i
,
i
)
Sample
i
from (
1
, . . . ,
n
) with probability
i
Sample from G(x;
i
,
i
) by diagonalization and inversion
Any distribution can be approximated by a mixture of Gaussians
Beta distribution can be approximated by product of two Gammas
If x
1
and x
2
come from P
Gamma
(),
x
1
x
1
+x
2
is a random sample from P
Beta
()
Sampling from distribution, (x), normalization constant unknown
Rejection Sampling
Want to sample from (x), but Z unknown; instead have (x) = c(x), where (x) is
known but c is not
Find sampling distribution g(x), easy to sample from and normalized
Find a convering constant, M, such that Mg(x) (x) x
Draw sample x from g()
Compute r = (x)/Mg(x) 1
Accept sample with probability r, otherwise reject
P(I) is probability of accepting sample =
c
M
, if small inecient!
Rao-Blackwellization
An initial estimate (x) can be improved upon by adding statistics:
1
(x) = E((x)|T(x))
is always better than
E((
1
(x) )
2
) E(((x) )
2
), i.e., Var(
1
(x)) Var((x))
E[h(x)|x
2
] =
x
1
h(x
1
, x
2
)(x
1
, x
i
2
), computed analytically, x
i
2
2
(x
2
)
When sampling, do as much analytically as possible
I
m
us sample based and
I
m
is sampled and conditioned upon
Both estimates are unbiased E(
I
m
) = E(
I
m
) =
I.
I
m
has smaller variance more ecient
m(
I
m
I)
L
N(0,
2
1
)
2
1
= Var(h(x))
3
m(
I
m
I)
L
N(0,
2
2
)
2
2
= Var(E(h(x)|x
2
))
2
2
<
2
1
2
1
=
2
2
+E
2
(Var(h(x)|x
2
))
Importance Sampling
Want to estimate =
_
D
h(x)(x)dx; target: (x)
Draw iid samples from g(x)
If (x) is known, including normalization constant
(1)
(i)
=
(x
(i)
)
g(x
(i)
)
If (x) = c(x) is known, and c is unknown
(2)
(i)
=
l(x)
g(x)
If (x) is known,
m
=
1
m
(i)
h(x
(i)
)
If (x) is known,
m
=
(i)
h(x
(i)
)
(i)
m
is unbiased E(
m
=
m
has large variance
m
is biased E(
m
impossible to evaluate
But
m
easy to estimate (only need to know proportionally)
g(x) h(x)(x) gives most ecient estimate (smaller variance)
Eective sample size - measure of how dierent the sampling distribution is from the
target
ESS(m) =
m
1+Varg((x))
, Var
g
((x))
i
(
i
)
2
(m1)
2
, =
i
m
Importance sampling can be super ecient smaller var than samp
Rao-Blackwellization applied to import samp marginalization
Adaptive Importance Sampling
Use t distribution as trial distribution g
0
(x)
Use weighted sampling to estimate
1
,
2
1
Use new trial distribution g
1
(x) = t
(x, h
1
,
2
1
)
Estimate h(x) as
1
m
h(x
i
)(x
i
)/
1
m
(x
i
)
Can be quite unstable!
4
Weighted Sampling
g(, x) = g(|x)g(x)
Import - special case where us a deterministic function
(x)
g(x)
Rejection - = 1 or 0
g( = 1|x) =
(x)
Mg(x)
, g( = 0|x) = 1
(x)
Mg(x)
h(x)g(, x) =
h(x)g( = 1|x)g(x) =
c
M
h(x)(x)
Weighted sampling: g(, x)
Eg(g(|x)g(x))
Eg(g())
= (x)
m
=
h(x)g(|x)g(x)
g(|x)g(x)
=
h(x)(x)
Problems with Rejection and Importance Sampling
Rejection
Enforcing
(x)
Mg(x)
1 is problematic cause many rejections for some xs
Importance
Samples with small weights contribute very little to the estimate but require
evaluating h(x) - which could be computationally expensive
Control Rejection Sampling
Relax requirement
(x)
Mg(x)
1
Reject samples with small weights - dont waste time to evaluate h(x)
Sample x from g(x)
Dene r(x) = min{1, (x)/cg(x)}
Dene g(|x) = (
(x)qc
cg(x)r(x)
)r(x) + ( 0){1 r(x)}
= 0 or (x)q
c
/cg(x)r(x)
1
m
h(x
i
)
i
1
m
(
i
)
p
x
h(x)(x) as m
Can be viewed as rejection sampling with special g
(x) where g
(x) =
r(x)
qc
g(x) =
1
qc
min(g(x),
(x)
c
) and accept with probability r
(x) =
(x)
cg
(x)
5
Part 2: Sampling with structured distribution
Considers distributions (x) with x = (x
0
, x
1
, ..., x
n
) where N is large.
x
i
takes s
k
, x
i
S = {s
1
, ..., s
k
}
How to sample (x) to get iids x
1
= (x
1
0
, x
1
1
, ..., x
1
n
), x
2
= (x
2
0
, x
2
1
, ..., x
2
n
)
How to estimate
x
e
Q(x
1
,...,x
i
)
in order nk
2
operations as opposed to k
n
Can nd marginal distribution of
i
(x
i
) and draw exact random samples from (x)
eciently.
Maximizing (x) is equivialent to minimizing E(x) = Q
1
(x
0
, x
1
) + ... +Q
N
(k
n1
, k
n
)
DP acts recursively
Forward Pass
Dene m
1
(x
1
) = min
s
i
S
Q
1
(s
i
, x
1
) for x
1
= s
1
, ..., s
k
Recursively compute m
t
(x
t
) = min
s
i
S
{m
t1
(s
i
) + Q
t
(s
i
, x
t
)} for x
t
= s
1
, ..., s
k
Optimal value of E( x) is obtained by min m
N
(k
N
)
min
x
E(x) = min
x
N
m
n
(x
N
)
Backwards Pass
Let x
N
= argmin
s
i
S
m
N
(s
i
) for t = N 1, N 2, ..., 0
Let argmin
s
i
S
{m
t
(s
i
) + Q
t+1
(S
i
, x
t+1
)}
x = ( x
0
, x
1
, ..., x
N
) is the minimizer, NOTE, there are many x
James Molyneux is a sexy beast
6
DP2
(x)
0
(x
0
)
1
(x
1
|x
0
)
2
(x
2
|x
1
)...
N
(x
N
|x
N1
)
Then we can sample x
(1)
, x
(2)
, ..., x
(N)
as follows
x
(1)
0
0
(x
0
)...
x
(1)
1
1
(x
1
|x
(1)
0
)...
.
.
.
x
(1)
N
N
(x
N
|x
N1
)..
Because of the graph structure Q
i
(x
i1
, x
i
), which means the model has a local
Markov structure so (x
i
|x
i1
, x
i1
, ..., x
0
) = (x
i
|x
i1
)
(x) =
N
(x
N
)
N1
(x
N1
|x
N
), ...,
0
(x
0
|x
1
)
DP2 Algorithm
Dene V
1
(x
1
) =
s
i
S
e
Q
1
(S,x
1
)
Recursively compute V
i
(x
0
)
y S
V
i1
(y)e
Q
i
(y,x
i
)
Z =
x
N
S
V
N
(x
N
)
Marginal
N
(x
N
) = V
N
(x
N
)/Z
Conditional
i
(x
i
|x
i+1
) =
V
i
(x
i
)e
Q
i+1
(x
i
,x
i+1
)
y S
V
i
(y)e
Q
i+1
(y,x
i+1
)
DP3
DP may also refer to:
Acronym for Display Picture. Can be any sort of prole picture on a social networking
site (such as facebook) or an instant messaging system. E.g., Hollywood complained
about the resolution he had to view KYs DP in.
Can also be meant as Dance party. A party where you will dance.. and where the
trance never sleeps!
Dummy Pack/Dummy Pacc/Dummy Pakc. Common slang used by Bay Area rappers
and/or people who rep the Yay Area. DP to da fullest! Yadadamean? We go hella
dumb. James
7
Bayes-Kalman
Suppose we want to estimate the position of x as it moves over time x
1
, x
2
, ..., x
t1
, x
t
, x
t+1
but only observe y
1
, ..., y
m
We have observation model p(y
t
|x
t
), target movement prior model p(x
t+1
|x
t
) and prior
p(x
1
)
Goal is to estimate P(x
t
|y
t
) probability x
t
conditional on y
t
- Recursive algorithm, go from p(x
t
|y
t
) p(x
t+1
|y
t+1
)
Prediction P(x
t+1
|y
t
) =
xt
P(x
t+1
|x
t
)P(x
t
|y
t
)
Correction P(x
t
|y
t
) =
P(y
t+1
|x
t+1
)P(x
t+1
|y
t
)
P(y
t+1
|x
t+1
)P(x
t+1
|y
t
)
Kalman Filter is a special case where all distributions are Gaussians and updates are
done analytically
Particle Filters
Draw x
(1)
t
, ..., x
(m)
t
from P(x
t
|y
t
) many particles with high probability
Draw x
(1)
t+1
, ..., x
(m)
t+1
from P(x
t+1
|x
()
t
)
Weight each sample by
()
P(y
t+1
|x
()
t+1
) - uses new data
Resample from {x
(1)
t+1
, ..., x
(m)
t+1
} with probability weight
()
to produce random samples
with replacement
x
(1)
t+1
, ..., x
(m)
t+1
now follows from P(x
t+1
|y
t+1
)
Limitation: doesnt use correct information y
t+1
to sample; resampling may cause
ineciency
Particle lters are more fun than no picnic
Sequential Monte Carlo
Need a trial distribution g(x) = g(x
1
)g(x
2
|x
1
)...g(x
d
|x
d1
, ...x
1
)
Need to approximate distributions that approximate marginals
d
(x
d
) (x
d
)
-
1
(x
1
),
2
(x
1
, x
2
), ...,
d
(x
1
, ..., x
d
)
SIS algorithm
8
Draw x
(1)
t
from g(x
t
|x
(1)
t1
) and x
(1)
t
= (x
(1)
t
, x
(1)
t1
)
Compute
(1)
t
=
t
(x
t
)
t1
(x
t1
)g(x
t
|x
t1
)
(1)
t
=
t1
t
Self Avoiding Walk
(x) is uniform over all self avoiding paths of length N
Trial distribution g(x
t+1
= (i
, j
)|x
1
, ..., x
t
) =
1
n
t
n
1
, ..., n
N1
Resample with probability for sample (x)
Importance sample for SAW - Rosenbluth Method
Part 3: MCMC
Graphical models with closed loops.
MCMC is a way to sample from any (x)
Does not sample from (x) directly, denes a Markov Chain that converges to samples
from (x)
P(x
t+1
|x
t
) = K(x
t+1
|x
t
) transition kernal
x
t+1
K(x
t+1
|x
t
) = 1
MCMC is a special MC chosen so that K(x, y) satises
Fixed point condition
y
K(x, y)(y) = (x)
- If sample y from (y), then x from K(x, y) implies x is drawn from (x)
Detailed balance: K(x, y)(y) = K(y, x)(x), which implies a xed point condi-
tion.
Irreducible: K(x, x
1
)K(x
1
|x
2
), .., K(x
n
|y) > 0, this means that there is a move
from any x to any other point y.
9
Basic Metropolis
(x) =
1
Z
e
E(x)
Propose a move x
t
to x
t+1
N(x
t
) with uniform probability p(x
t+1
) =
1
|N(x
t
)|
Accept move with probability r = min
_
1,
(x
t+1
)
x
t
_
x
t+1
is a sample from K(x
t+1
|x
t
) =
1
|N(x)|
min
_
1,
(x
t+1
)
x
t
_
Moves to lower energy/ higher probability will be always be accepted
Moves to higher energy/ lower probability have a probability of being accepted.
Metropolis to Metropolis-Hastings
Burn-in: amount of time required for MCMC to generate iid samples from (x)
Stickiness: measure of how slowly the MCMC explores the space of x, e.g. auto
correlation
k
= corr(
k
,
k+1
)
Metropolis-Hastings
Improves basic Metropolis by adding a proposal probability T(y, x) for y N(x) - If
T( y x) is uniform, this is just basic Metropolis at time t and state x
t
Accept proposal with probability r = min
_
1,
(y)T(x|y)
(x)T(y|x)
_
K(y|x) = T(y|x) min
_
1,
(y)T(x|y)
(x)T(y|x)
_
Embree acts like he is 10 years older than he is
Gibbs Sampler
(x
i
|x
/i
) marginal distribution, usually Markovian assumption K
i
(y|x) = (y
i
|x
/i
) the
maringal.
Obeys detailed balance but not irreducibility (only linearity) K(y|x) =
i
K
i
(y|x)
Algorithm:
Select i with prob
i
(random or systematic scan)
10
Select y from K
i
(y|x
t
)
Note: a special case of M-H where T(y|x) = K(y|x) and r = 1
Data Augmentation
Stochastic alternative to EM
P(|y
obs
, y
mis
) + P(y
mis
|y
obs
) known
Given P(|y
obs
, y
mis
), sample & y
mis
in turn
Algorithm - form of Gibbs
Initialize
o
and y
o
mis
Sample
t
from P(|y
t1
mis
, y
obs
)
Sample y
t
mis
from P(y
mis
|
t
, y
obs
)
Multiple Try Metropolis
Enables bigger jumps than M-H in one step
Dene (x, y) = (x)T(y|x) (x, y)
. .
Non negative and symmetric
From state x
t
Draw k independent trial proposals; y
1
, ..., y
k
from T(y|x)
Compute (y
j
, x)
Select y from (y
1
, ..., y
k
) with probability (y
j
, x)
Produce a reference set: x
1
, ..., x
k1
, x
k
= x
t
Accept y with probability r = min
_
1,
(y
1
, x) + ... +(y
k
, x)
(x
1
, y) + ... +(x
k
, y)
_
Hybrid Monte Carlo
Avoid random walk behavior of metropolis type algorithms
Motivated by physics
Total Energy H(x, p) = U(x) + K(p) where K(p) =
p
2
2m
Hamiltonians equation x(t) =
H
P
, and, p(t) =
H
x
, where
H
t
= 0
11
Momentum allows particles to escape from a local minima and a gradient helps guide
the particle in the right direction.
Uses Hamiltonian systems with Metropolis acceptance rules.
If we can sample from (x, p) e
H(x,p)
then we can sample from the marginals -
x (x) e
U(x)
and p (p)
Hamiltonian trajectory is time reversible, related to volume preservation
Can run leap from t steps to (x
, p
), then start at (x
, p
, p
)
x(t + t) = x(t) + t
p
m
(t +
1
2
t)
p(t +
1
2
t) = p(t
1
2
t) t
H
X
| t
Accept proposal state (x
, p
, p
) + H(x, p)]}
Look up vampire-robot in the dictionary and you will nd a picture of Katie.
Swendsen-Wang
Metropolis is slow for Ising model at low temp
Speed up sampling by dynamically grouping sites into clusters
Is Data-Augmentation for Ising-Potts
Dene a new variable b {0, 1}, that is b = 1 if x
i
= x
i+1
, b = 0 otherwise with
probability q
o
= 1 e
F
M(H, t + 1) = (1 P
c
)M(H, t)
F(H, t)
F
+P
c
_
M(H, t)
F(H, t)
F
(1 losses ) + gains
_
13
If we make a conservative assumption and ignore gains, and treat all disruptions as losses,
we have:
M(H, t + 1) (1 P
c
)M(H, t)
F(H, t)
F
+P
c
_
M(H, t)
F(H, t)
F
(1 disruptions )
_
Where disruption is
(H)
L 1
(1P(H, t)), and (H) is the dening length associated with
a 1 point cross over.
P(H, t) is the proportional representation of H obtained by dividing M(H, t) by the
population size.
Which implies the Schema Theorem is:
P(H, t + 1) P(H, t)
F(H, t)
F
+
_
1 P
c
(H)
L 1
(1 P(H, t))
F(H, t)
F
_
The Schema Theorem suggests that mutation is less important than crossover. But mu-
tation is necessary to prevent a system from getting stuck.
14