0% found this document useful (0 votes)
20 views45 pages

Lecture 3 9.66

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views45 pages

Lecture 3 9.66

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Class announcements

• Pset 2 due next Monday (Oct 24)

• Recitation this week


– Help with Pset 2
Plan for today

• Patterns of inference in causal models


– Explaining away in perception

• Introduction to MCMC and sampling-based


inference in human cognition
Explaining away in cognition
Reflectance Illumination Busy Doesn’t like me

Luminance Ignoring me

Strong Strong
athlete academics Easy exam Good student

College admission A on exam


Explaining away in vision

colored
Mach
card…
Explaining away in vision

S0: Lighting
The Mach
Illusion...
Explaining away in social inference
Easy exam Good student

A on exam
Explaining away in social inference
Easy exam 1 Good student Easy exam 2 Easy exam 3

A on exam 1 A on exam 2 A on exam 3


Explaining away in social inference
Easy subject

Easy exam 1 Good student 1 Easy exam 2 Easy exam 3

A for student 1 A for student 1 A for student 1


on exam 1 on exam 2 on exam 3

A for student 2 A for student 2 A for student 2


on exam 1 on exam 2 on exam 3

Good student 2
Explaining away in social inference
Easy subject

Easy exam 1 Good student 1 Easy exam 2 Easy exam 3

A for student 1 A for student 1 A for student 1


on exam 1 on exam 2 on exam 3

A for student 2 A for student 2 A for student 2


on exam 1 on exam 2 on exam 3

Good student 2

• Abstract principles and systematic biases in attribution?


Plan for today

• Towards a probabilistic language of thought


– Bayesian networks
– Probabilistic programs

• Patterns of inference in causal models


– Explaining away

• Introduction to MCMC and sampling-based


inference in human cognition
Varieties of Monte Carlo

• Rejection sampling
• MCMC: Metropolis-Hastings (MH)
– MH with prior kernel
– MH with drift kernel
– MH with drift along the posterior gradient
Generate samples, re-weight them to approximate
posterior… but don’t start from scratch each time as in
rejection!
Markov Chain

x x x x x x x x

Transition matrix
T = P(x(t+1)|x(t))

Variables x(t+1) independent of all previous variables given


immediate predecessor x(t) .
Systematic relationship between transition matrix and the
stationary (asymptotic) distribution.
Markov Chain Monte Carlo (MCMC)

x x x x x x x x

Markov chain
Transition matrix
T = P(x(t+1)|x(t))

• States of chain = joint settings of the variables of


interest (“possible worlds”).
• Transition matrix chosen to make some target
conditional distribution (Bayesian posterior) the
stationary distribution.
When Metropolis-Hastings?

• Suppose we can compute P(data|h) and P(h), but


not P(h|data):
P(data | h) P(h)
P(h | data ) =
å P(data | h¢) P(h¢)

• Or maybe we can only compute relative posteriors


(or likelihood ratios, prior odds)

P(hi | data) P(data | hi ) P(hi )


=
P(h j | data ) P(data | h j ) P(h j )
Metropolis-Hastings algorithm

• Transitions have two parts:


– proposal distribution: Q(h(t+1)| h(t))

– acceptance: take proposals with probability

P(h(t+1)|data) Q(h(t)| h(t +1))


A(h(t+1)| h(t)) = min{ 1, }
P(h |data) Q(h | h )
(t) (t+1) (t)

T (h(t +1) | h(t ) ) µ Q(h(t +1) | h(t ) ) A(h(t +1) | h(t ) )


( t +1)
(h ¹h )
(t )
Metropolis-Hastings MCMC

https://www.youtube.com/watch?v=4I6TaYo9j_Y
Why does MH work?

• A Markov chain with transition probability Tij


converges to the stationary distribution p i whenever
detailed balance holds:
Tij pj
=
T ji p i Satisfied by choosing:

Qij : proposal prob Qij Aij pj


= ìï p j Q ji üï
Aij : acceptance prob Q ji A ji p i Aij = miní1, ý
ïî p i Qij ïþ

ìï p i Qij üï
Aji = miní1, ý
(see Russell and Norvig) ïî p j Q ji ïþ
Burn in

Early dependence on initial state, but chains very similar after enough samples…
Mixing
s*: std. dev. of Gaussian proposal distribution
MH-drift versus Hamiltonian Monte Carlo
• Hamiltonian Monte Carlo makes the kernel sensitive to the
gradient of the posterior, giving more of a directed search
dynamic towards regions of greatest probability

(Duvenaud
Broderick)

https://www.youtube.com/watch?v=Vv3f0QNWvWQ
The magic of MCMC
• Since we only ever need to evaluate the relative
probabilities of two states, we can have very large state
spaces (much of which we rarely reach)
• In fact, our state spaces can be infinite
– common with nonparametric Bayesian models
• State spaces can be implicitly defined, with an infinite
number of states we’ve never seen or imagined …
– natural with probabilistic programs, and program induction
• But the guarantees it provides are asymptotic
– making algorithms that converge in practical amounts of time
is a significant challenge
MCMC and cognition

• MCMC provides one basis for “rational process models”:


models of how the mind and brain implement Bayesian
inference (Marr level 2, 3?) that can predict more detailed
aspects of behavior than traditional rational analyses at the
level of ideal computational theory (Marr level 1).
• MH may be a model for the dynamics of perception and
thinking.
• Sanborn, Griffiths, Shiffrin (2008; 2010) have shown how to
use MCMC algorithms as the basis for experiments with
people, mapping out mental representations.
• MH may be a metaphor for aspects of development.
Priors P(ttotal) based on empirically measured durations or magnitudes
for many real-world events in each class:

Median human judgments of the total duration or magnitude ttotal of


events in each class, given one random observation at a duration or
magnitude t, versus Bayesian predictions (median of P(ttotal|t)).
Individual judgments as samples from
the posterior predictive
(Vul et al., Cog Sci 2009; Cognitive Science 2014)
Proportion of judgments below predicted value

P(ttotal|tpast)

ttotal

Quantile of Bayesian posterior distribution


Individual judgments as samples from
the posterior predictive
(Vul et al., Cog Sci 2009; Cognitive Science 2014)

P(ttotal|tpast)

ttotal

Average over all


prediction tasks:
• movie run times
• movie grosses
• poem lengths
• life spans
• terms in congress
• cake baking times
Posterior sampling in concept learning

These are Feps:

Is this a Fep?

These are not Feps:


Posterior sampling in concept learning

These are Feps:

Is this a Fep?

These are not Feps:


Posterior sampling in concept learning
Rational rules model
(Goodman, Tenenbaum, Feldman, Griffiths, 2008)

Bayesian inference over disjunctions of conjunctions


of features – prior favors simpler hypotheses:
– “X is a Fep if X has round wings”
– “… has round wings and a striped body”
– “… has round wings or a striped body and pointy antenna”
Rational rules model
(Goodman, Tenenbaum, Feldman, Griffiths, 2008)

Training
set

Test
set
Rational rules model
(Goodman, Tenenbaum, Feldman, Griffiths, 2008)
Why sample?
• Cognition has to be extremely flexible.
– Posterior samples are a good target for general-
purpose probabilistic inference algorithms, maybe the
only good target.

• Cognition has to be very fast.


– Even just one or a few posterior samples are very
useful in the settings that matter most for everyday
cognition.
– This is very different from the statistician’s perspective
on sampling, and inference more generally.

Try playing with this on probmods…


Probmods script

// Explore numSamples and numData; grain of judgment (7, 5, binary)

var observedData = ['h', 't', 'h', 'h', 't']


// var observedData = ['h', 't', 'h', 'h', 't', 't', 'h', 't', 't', 'h’]

var maxScale = 7; // 100, 10, 7, 5, 3, 1


var numSamples = 1000; // 100, 10, 5, 3, 1

var weightPosterior = Infer({method: 'rejection', samples: numSamples}, function() {


var coinWeight = sample(Uniform({a: 0, b: 1}))
var coin = Bernoulli({p: coinWeight})
var obsFn = function(datum){observe(coin, datum == 'h')}
mapData({data: observedData}, obsFn)
return coinWeight
})

viz(weightPosterior)
print('Expected heads weight: ‘ + Math.round(maxScale*expectation(weightPosterior)))
“One and done”: Optimal decisions from very
few samples
(Vul, Goodman, Griffiths, Tenenbaum, 2014)

• How many samples to take?


– Trade off increase in decision utility with opportunity
cost of thinking more…
A: action to take
S: state of the world
D: data
U: utility function
P: Bayesian posterior
“One and done”: Optimal decisions from very
few samples
(Vul, Goodman, Griffiths, Tenenbaum, 2014)

• How many samples to take?


– Trade off increase in decision utility with opportunity
cost of thinking more…
A: action to take
S: state of the world
D: data
U: utility function
P: Bayesian posterior

How to choose the number of samples, k?


“One and done”: Optimal decisions from very
few samples
(Vul, Goodman, Griffiths, Tenenbaum, 2014)

• How many samples to take?


– Trade off increase in decision utility with opportunity
cost of thinking more…
A: action to take
S: state of the world
D: data
U: utility function
P: Bayesian posterior

How to choose the number of samples, k?

Assume two possible actions, each of which is correct in some states


of the world but not others. Utility depends only on whether action is
“correct” or “incorrect”. A simple policy is to draw k samples from the
posterior over world states, and choose the action that is best in more
than half of k samples.
“One and done”: Optimal decisions from very
few samples
(Vul, Goodman, Griffiths, Tenenbaum, 2014)
How many samples to take?
– How does choice of k determine expected reward per unit time, when
you have to make many small decisions over a lifetime (or an hour)?

Assume the cost of this policy is proportional to k, plus some fixed action cost.
Assume ”correct” action (Max Exp Util, under true P(S|D)) succeeds with
probability uniform in [0.5, 1].
MCMC and cognitive science

• MH may be a model for the dynamics of perception and


thinking, e.g., bistability (Gershman, Vul, Tenenbaum 2009)
But maybe sometimes even one posterior
sample is too costly?
We simulated the decay of bias in the Normal-Normal case that we are focussing on (Equation 1)
and costerror (a, x) = ka xk. All Markov chains were initialized with the prior mean. Our results
Explaining anchoring and adjustment
show that the mean of the sampling distribution converges geometrically as well (see Figure 1). Thus
the reduction in bias tends to be largest in the first few iterations and decreases quickly from there
on, suggesting a situation of diminishing returns for further iterations: an agent under time pressure
(Lieder, Griffiths, Goodman 2012)
may do well stopping after the initial reduction in bias.

Figure 1: Bias of the mean of the approximation Qt , i.e. |E[X̃t ] E[X|Y 2: where
= y]|
Figure Number Qt , as i? that maximizes the agent’s expected utility as a function of
Xtof⇠iterations
• Example: Half of you close your eyes, then the other half.
a function of the number of iterations t of our Metropolis-Hastings algorithm.
ratio between the cost pershow
this relationship for different posterior distributions whose means are posterior
The five lines
located 1distributions
iteration and the cost per unit error. The five lines correspond to the s
p , · · · , 5 pasaway
in Figure 1.
from the prior mean ( p is the standard deviation of the prior). As the plot shows, the bias decays
• Question 1
geometrically with the number of iterations in all five cases.

2.4 – Is the population of Cleveland bigger or smaller than 200,000?


Optimal Time-Bias Tradeoffs

– What
This subsection combines doreported
the result youinthink thesubsection
the previous population is?and computes
with time costs
the optimal bias-time tradeoffs depending on the ratio of time cost to error cost and on how large
the initial bias is. It suggests that intelligent agents might use these results to choose the number
of MCMC iterations according to their estimate of the initial bias. Formally, we define the optimal
number of iterations i? and resulting bias b? as
i? = arg max E [u(ai , x, t0 + i/v)|Y = y] , where ai ⇠ Qi (6)
i
b? = Bias [Qi? ; P ] (7)
using the variables defined above. If the upper bound in Equation 5 is tight, then the optimal number
We simulated the decay of bias in the Normal-Normal case that we are focussing on (Equation 1)
and costerror (a, x) = ka xk. All Markov chains were initialized with the prior mean. Our results
Explaining anchoring and adjustment
show that the mean of the sampling distribution converges geometrically as well (see Figure 1). Thus
the reduction in bias tends to be largest in the first few iterations and decreases quickly from there
on, suggesting a situation of diminishing returns for further iterations: an agent under time pressure
(Lieder, Griffiths, Goodman 2012)
may do well stopping after the initial reduction in bias.

Figure 1: Bias of the mean of the approximation Qt , i.e. |E[X̃t ] E[X|Y 2: where
= y]|
Figure Number Qt , as i? that maximizes the agent’s expected utility as a function of
Xtof⇠iterations
• Example: Half of you close your eyes, then the other half.
a function of the number of iterations t of our Metropolis-Hastings algorithm.
ratio between the cost pershow
this relationship for different posterior distributions whose means are posterior
The five lines
located 1distributions
iteration and the cost per unit error. The five lines correspond to the s
p , · · · , 5 pasaway
in Figure 1.
from the prior mean ( p is the standard deviation of the prior). As the plot shows, the bias decays
• Question 2
geometrically with the number of iterations in all five cases.

2.4 – Is the population of Cleveland bigger or smaller than 5,000,000?


Optimal Time-Bias Tradeoffs

– What
This subsection combines doreported
the result youinthink thesubsection
the previous population is?and computes
with time costs
the optimal bias-time tradeoffs depending on the ratio of time cost to error cost and on how large
the initial bias is. It suggests that intelligent agents might use these results to choose the number
of MCMC iterations according to their estimate of the initial bias. Formally, we define the optimal
number of iterations i? and resulting bias b? as
i? = arg max E [u(ai , x, t0 + i/v)|Y = y] , where ai ⇠ Qi (6)
i
b? = Bias [Qi? ; P ] (7)
using the variables defined above. If the upper bound in Equation 5 is tight, then the optimal number
We simulated the decay of bias in the Normal-Normal case that we are focussing on (Equation 1)
and costerror (a, x) = ka xk. All Markov chains were initialized with the prior mean. Our results
Explaining anchoring and adjustment
show that the mean of the sampling distribution converges geometrically as well (see Figure 1). Thus
the reduction in bias tends to be largest in the first few iterations and decreases quickly from there
on, suggesting a situation of diminishing returns for further iterations: an agent under time pressure
(Lieder, Griffiths, Goodman 2012)
may do well stopping after the initial reduction in bias.

Figure 1: Bias of the mean of the approximation Qt , i.e. |E[X̃t ] E[X|Y 2: where
= y]|
Figure Number Qt , as i? that maximizes the agent’s expected utility as a function of
Xtof⇠iterations
• Example: Half of you close your eyes, then the other half.
a function of the number of iterations t of our Metropolis-Hastings algorithm.
ratio between the cost pershow
this relationship for different posterior distributions whose means are posterior
The five lines
located 1distributions
iteration and the cost per unit error. The five lines correspond to the s
p , · · · , 5 pasaway
in Figure 1.
from the prior mean ( p is the standard deviation of the prior). As the plot shows, the bias decays
• Question 1
geometrically with the number of iterations in all five cases.

2.4 – Is the population of Cleveland bigger or smaller than 200,000?


Optimal Time-Bias Tradeoffs

– What
This subsection combines doreported
the result youinthink thesubsection
the previous population is?and computes
with time costs
the optimal bias-time tradeoffs depending on the ratio of time cost to error cost and on how large
the initial bias is. It suggests that intelligent agents might use these results to choose the number
• Question 2
of MCMC iterations according to their estimate of the initial bias. Formally, we define the optimal
number of iterations i? and resulting bias b? as
–i Is= the
? population
arg max of Cleveland
E [u(a , x, t + i/v)|Y
i
i 0 bigger
= y] , where a ⇠ Q or smaller
i(6) ithan 5,000,000?
–b What
?
= Bias do[Q ; you
P] i? think the population is? (7)
using the variables defined above. If the upper bound in Equation 5 is tight, then the optimal number
MCMC and cognitive science

• Sanborn, Griffiths, Shiffrin (2008; 2010) have shown how to


use MCMC algorithms as the basis for experiments with
people, mapping out mental representations.
Measuring people’s priors
(Lewandowsky, Griffiths & Kalish, Cognitive Science 2009)
Iterated
learning:
(cf. human
MCMC)
MCMC and cognitive science
• The Metropolis-Hastings algorithm seems like a good
metaphor for aspects of how children learn and reason
– an algorithm for what doesn’t seem very
“algorithmic”. (cf. “learning rule”, “learning algorithm”)
– Small, random, dumb, local steps
– Takes a long time
– Can get stuck in plateaus or stages
– “Two steps forward, one step back”
– Over time, intuitive theories get consistently better (more
veridical, more powerful, broader scope).
– Everyone reaches basically the same state (though some
take longer than others).
(Ullman & Tenenbaum, Annual Review of Dev Psych 2020)

You might also like