Lecture 3 9.66
Lecture 3 9.66
Luminance Ignoring me
Strong Strong
athlete academics Easy exam Good student
colored
Mach
card…
Explaining away in vision
S0: Lighting
The Mach
Illusion...
Explaining away in social inference
Easy exam Good student
A on exam
Explaining away in social inference
Easy exam 1 Good student Easy exam 2 Easy exam 3
Good student 2
Explaining away in social inference
Easy subject
Good student 2
• Rejection sampling
• MCMC: Metropolis-Hastings (MH)
– MH with prior kernel
– MH with drift kernel
– MH with drift along the posterior gradient
Generate samples, re-weight them to approximate
posterior… but don’t start from scratch each time as in
rejection!
Markov Chain
x x x x x x x x
Transition matrix
T = P(x(t+1)|x(t))
x x x x x x x x
Markov chain
Transition matrix
T = P(x(t+1)|x(t))
https://www.youtube.com/watch?v=4I6TaYo9j_Y
Why does MH work?
ìï p i Qij üï
Aji = miní1, ý
(see Russell and Norvig) ïî p j Q ji ïþ
Burn in
Early dependence on initial state, but chains very similar after enough samples…
Mixing
s*: std. dev. of Gaussian proposal distribution
MH-drift versus Hamiltonian Monte Carlo
• Hamiltonian Monte Carlo makes the kernel sensitive to the
gradient of the posterior, giving more of a directed search
dynamic towards regions of greatest probability
(Duvenaud
Broderick)
https://www.youtube.com/watch?v=Vv3f0QNWvWQ
The magic of MCMC
• Since we only ever need to evaluate the relative
probabilities of two states, we can have very large state
spaces (much of which we rarely reach)
• In fact, our state spaces can be infinite
– common with nonparametric Bayesian models
• State spaces can be implicitly defined, with an infinite
number of states we’ve never seen or imagined …
– natural with probabilistic programs, and program induction
• But the guarantees it provides are asymptotic
– making algorithms that converge in practical amounts of time
is a significant challenge
MCMC and cognition
P(ttotal|tpast)
ttotal
P(ttotal|tpast)
ttotal
Is this a Fep?
Is this a Fep?
Training
set
Test
set
Rational rules model
(Goodman, Tenenbaum, Feldman, Griffiths, 2008)
Why sample?
• Cognition has to be extremely flexible.
– Posterior samples are a good target for general-
purpose probabilistic inference algorithms, maybe the
only good target.
viz(weightPosterior)
print('Expected heads weight: ‘ + Math.round(maxScale*expectation(weightPosterior)))
“One and done”: Optimal decisions from very
few samples
(Vul, Goodman, Griffiths, Tenenbaum, 2014)
Assume the cost of this policy is proportional to k, plus some fixed action cost.
Assume ”correct” action (Max Exp Util, under true P(S|D)) succeeds with
probability uniform in [0.5, 1].
MCMC and cognitive science
Figure 1: Bias of the mean of the approximation Qt , i.e. |E[X̃t ] E[X|Y 2: where
= y]|
Figure Number Qt , as i? that maximizes the agent’s expected utility as a function of
Xtof⇠iterations
• Example: Half of you close your eyes, then the other half.
a function of the number of iterations t of our Metropolis-Hastings algorithm.
ratio between the cost pershow
this relationship for different posterior distributions whose means are posterior
The five lines
located 1distributions
iteration and the cost per unit error. The five lines correspond to the s
p , · · · , 5 pasaway
in Figure 1.
from the prior mean ( p is the standard deviation of the prior). As the plot shows, the bias decays
• Question 1
geometrically with the number of iterations in all five cases.
– What
This subsection combines doreported
the result youinthink thesubsection
the previous population is?and computes
with time costs
the optimal bias-time tradeoffs depending on the ratio of time cost to error cost and on how large
the initial bias is. It suggests that intelligent agents might use these results to choose the number
of MCMC iterations according to their estimate of the initial bias. Formally, we define the optimal
number of iterations i? and resulting bias b? as
i? = arg max E [u(ai , x, t0 + i/v)|Y = y] , where ai ⇠ Qi (6)
i
b? = Bias [Qi? ; P ] (7)
using the variables defined above. If the upper bound in Equation 5 is tight, then the optimal number
We simulated the decay of bias in the Normal-Normal case that we are focussing on (Equation 1)
and costerror (a, x) = ka xk. All Markov chains were initialized with the prior mean. Our results
Explaining anchoring and adjustment
show that the mean of the sampling distribution converges geometrically as well (see Figure 1). Thus
the reduction in bias tends to be largest in the first few iterations and decreases quickly from there
on, suggesting a situation of diminishing returns for further iterations: an agent under time pressure
(Lieder, Griffiths, Goodman 2012)
may do well stopping after the initial reduction in bias.
Figure 1: Bias of the mean of the approximation Qt , i.e. |E[X̃t ] E[X|Y 2: where
= y]|
Figure Number Qt , as i? that maximizes the agent’s expected utility as a function of
Xtof⇠iterations
• Example: Half of you close your eyes, then the other half.
a function of the number of iterations t of our Metropolis-Hastings algorithm.
ratio between the cost pershow
this relationship for different posterior distributions whose means are posterior
The five lines
located 1distributions
iteration and the cost per unit error. The five lines correspond to the s
p , · · · , 5 pasaway
in Figure 1.
from the prior mean ( p is the standard deviation of the prior). As the plot shows, the bias decays
• Question 2
geometrically with the number of iterations in all five cases.
– What
This subsection combines doreported
the result youinthink thesubsection
the previous population is?and computes
with time costs
the optimal bias-time tradeoffs depending on the ratio of time cost to error cost and on how large
the initial bias is. It suggests that intelligent agents might use these results to choose the number
of MCMC iterations according to their estimate of the initial bias. Formally, we define the optimal
number of iterations i? and resulting bias b? as
i? = arg max E [u(ai , x, t0 + i/v)|Y = y] , where ai ⇠ Qi (6)
i
b? = Bias [Qi? ; P ] (7)
using the variables defined above. If the upper bound in Equation 5 is tight, then the optimal number
We simulated the decay of bias in the Normal-Normal case that we are focussing on (Equation 1)
and costerror (a, x) = ka xk. All Markov chains were initialized with the prior mean. Our results
Explaining anchoring and adjustment
show that the mean of the sampling distribution converges geometrically as well (see Figure 1). Thus
the reduction in bias tends to be largest in the first few iterations and decreases quickly from there
on, suggesting a situation of diminishing returns for further iterations: an agent under time pressure
(Lieder, Griffiths, Goodman 2012)
may do well stopping after the initial reduction in bias.
Figure 1: Bias of the mean of the approximation Qt , i.e. |E[X̃t ] E[X|Y 2: where
= y]|
Figure Number Qt , as i? that maximizes the agent’s expected utility as a function of
Xtof⇠iterations
• Example: Half of you close your eyes, then the other half.
a function of the number of iterations t of our Metropolis-Hastings algorithm.
ratio between the cost pershow
this relationship for different posterior distributions whose means are posterior
The five lines
located 1distributions
iteration and the cost per unit error. The five lines correspond to the s
p , · · · , 5 pasaway
in Figure 1.
from the prior mean ( p is the standard deviation of the prior). As the plot shows, the bias decays
• Question 1
geometrically with the number of iterations in all five cases.
– What
This subsection combines doreported
the result youinthink thesubsection
the previous population is?and computes
with time costs
the optimal bias-time tradeoffs depending on the ratio of time cost to error cost and on how large
the initial bias is. It suggests that intelligent agents might use these results to choose the number
• Question 2
of MCMC iterations according to their estimate of the initial bias. Formally, we define the optimal
number of iterations i? and resulting bias b? as
–i Is= the
? population
arg max of Cleveland
E [u(a , x, t + i/v)|Y
i
i 0 bigger
= y] , where a ⇠ Q or smaller
i(6) ithan 5,000,000?
–b What
?
= Bias do[Q ; you
P] i? think the population is? (7)
using the variables defined above. If the upper bound in Equation 5 is tight, then the optimal number
MCMC and cognitive science