0% found this document useful (0 votes)
11 views

Lecture 14

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 14

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Class announcements

• We will try to avoid an overflow room: Please move in


to the center, to avoid empty seats!
– But if necessary overflow still in 3037...

• Thursday’s class will be virtual, due to a Picower


event. Synchronous attendance still required!

• Next week we will also not have in-person classes:


We will have a guest lecture from Vikash Mansinghka
on Thursday that you can watch asynchronously, and
I may give a virtual lecture on Tuesday (depending
on where we end up this week).
Class announcements

• Recitations going forwards: Th, F 4 PM – 46-3189


– Review of Basic Bayes

• PSet 1 will be released shortly, and due Oct 3. Other


psets due approximately every two weeks thereafter.

• Readings for this week


– “Basic Bayes” and “Bayesian concept learning”
Plan for today

Basic Bayesian cognition


– Finish coin flipping
– The number game
What makes one random sequence more
representative of than another?

HHTHT
HHHHH
What process produced these sequences?
What are the plausible alternative hypotheses,
and their priors?
Comparing two simple hypotheses

• Contrast simple hypotheses:


– H1: “fair coin”, P(H) = 0.5
– H2:“always heads”, P(H) = 1.0
• Bayes’ rule:

P( H ) P( D | H )
P( H | D) =
P( D)
• With two hypotheses, use odds form
Comparing two simple hypotheses

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHTHT
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = ?
P(D|H2) = 0 P(H2) = ?
Comparing two simple hypotheses

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHTHT
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 0 P(H2) = 1/1000

P( H1 | D) 1 32 999
= ´ = infinity
P( H 2 | D) 0 1
Comparing two simple hypotheses

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000

P( H1 | D) 1 999
= ´ » 30
P( H 2 | D) 32 1
Comparing two simple hypotheses

• How many heads would you need to see in a


row to actually become suspicious that the coin
might be a trick “always heads” coin?

Don’t do a calculation – ask your intuition!

• … To actually be confident that it was a trick


coin?
Comparing two simple hypotheses

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHHHHHHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/210 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000

P ( H1 | D ) 1 999
= ´ »1
P( H 2 | D) 1024 1
Comparing two simple hypotheses

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHHHHHHHHHHHHHH
H1, H2: “fair coin”, “always heads”
P(D|H1) = 1/215 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000

P ( H1 | D ) 1 999
= ´ » 1~0.03
P( H 2 | D) 1024 1
~32000
An alternative analysis

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHTHT
H1, H2: “fair coin”, “coin that always comes up HHTHT”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 1 P(H2) = 1/1000

P( H1 | D) 1 999
= ´ » 30
P( H 2 | D) 32 1
An alternative analysis

P( H1 | D) P( D | H1 ) P( H1 )
= ´
P( H 2 | D) P( D | H 2 ) P( H 2 )

D: HHHHH
H1, H2: “fair coin”, “coin that always comes up HHTHT”
P(D|H1) = 1/25 P(H1) = 999/1000
P(D|H2) = 0 P(H2) = 1/1000

P( H1 | D) 1 32 999
= ´ = infinity
P( H 2 | D) 0 1
The role of priors

What looks random, and what doesn’t, depends


very much on our hypothesis space and prior
probabilities.

Is that a bad thing, or a good thing?

What might this the true nature and origins of our


hypotheses and priors? Are they just statistical
summaries of prior experience, or something
deeper?
The role of intuitive theories

The fact that HHTHT looks representative of a fair


coin, and HHHHH does not, reflects our intuitive
theories of how the world works.
– Easy to imagine how a trick all-heads coin could work:
high prior probability.
– Hard to imagine how a trick “HHTHT” coin could work:
low prior probability.

Key questions: how intuitive theories generate


probabilities for inference, how theories might
themselves be learned.
Low prior, or zero prior?

• Is there any evidence you could see that would


make you suspect you had a trick coin that
always comes up “HHTHT”?
How about ...
How about ...

HHTHT
How about ...

HHTHTHHTHT
How about ...

HHTHTHHTHTHHTHT
How about ...

HHTHTHHTHTHHTHTHHTHT
How about ...

HHTHTHHTHTHHTHTHHTHTHHTHT
How about ...

HHTHTHHTHTHHTHTHHTHTHHTHT

... What does this tell us about our hypothesis spaces,


our priors, our intuitive theories?
Plan for today

Basic Bayesian cognition


– Finish coin flipping
– The number game
The number game

• Program input: number between 1 and 100


• Program output: “yes” or “no”
The number game

• Your task:
– Observe one or more positive (“yes”) examples.
– Judge whether other numbers are “yes” or “no”.
Bayesian concept learning
“horse” “horse” “horse”

“tufa”
“tufa”

“tufa”
The number game

One positive example: 60

What other numbers do you think are likely to be


accepted?
The number game

One positive example: 60


The number game

Four positive examples: 60, 80, 10, 30


The number game

Four positive examples: 60, 80, 10, 30


The number game

Four positive examples: 60, 52, 57, 55


The number game

Four positive examples: 60, 52, 57, 55


The number game

Examples of Generalization
“yes” numbers judgments (N = 20)

60
Diffuse similarity
The number game

Examples of Generalization
“yes” numbers judgments (N = 20)

60
Diffuse similarity

60 80 10 30 Rule:
“multiples of 10”
The number game

Examples of Generalization
“yes” numbers judgments (N = 20)

60
Diffuse similarity

60 80 10 30 Rule:
“multiples of 10”

60 52 57 55 Focused similarity:
numbers near 50-60
The number game

Examples of Generalization
“yes” numbers judgments (N = 20)

16
Diffuse similarity

16 8 2 64 Rule:
“powers of 2”

16 23 19 20 Focused similarity:
numbers near 20
The number game

60
Diffuse similarity

60 80 10 30 Rule:
“multiples of 10”

60 52 57 55 Focused similarity:
numbers near 50-60

Main phenomena to explain:


– Generalization can appear either similarity-
based (graded) or rule-based (all-or-none).
– Learning from just a few positive examples.
A single unifying account of (number) concept learning?

• We’re going to use this to introduce Bayesian


approaches, but first consider ...
A single unifying account of (number) concept learning?

• We’re going to use this to introduce Bayesian


approaches, but first consider ...
– The “naïve programmer” approach?
A single unifying account of (number) concept learning?

• We’re going to use this to introduce Bayesian


approaches, but first consider ...
– The “naïve programmer” approach?
– The “modern neural network” approach?
Traditional (algorithmic level) cognitive models

• Multiple representational systems: rules and


similarity
– Categorization, language (past tense), reasoning
• Questions this leaves open:
– How does each system work? How far and in ways to
generalize as a function of the examples observed?
• Which rule to choose?
– E.g., X = {60, 80, 10, 30}: multiples of 10 vs. even numbers?
• Which similarity metric?
– E.g., X = {60, 53} vs. {60, 20}?
– Why these two systems?
– When and why does a learner switch between them?
Reverse-engineering a cognitive system:
Marr’s three levels

• Level 1: Computational theory


– What are the inputs and outputs to the computation,
what is its goal, and what is the logic by which it is
carried out?
• Level 2: Representation and algorithm
– How is information represented and processed to
achieve the computational goal?
• Level 3: Hardware implementation
– How is the computation realized in physical or
biological hardware?
Bayesian model
• H: Hypothesis space of possible concepts:
– h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
– h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
– h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
– h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
– ...

Representational interpretations for H:


– Candidate rules
– Features for similarity
– “Consequential subsets” (Shepard, 1987)
Three hypothesis subspaces for number
concepts
• Mathematical properties (24 hypotheses):
– Odd, even, square, cube, prime numbers
– Multiples of small integers
– Powers of small integers
• Raw magnitude (5050 hypotheses):
– All intervals of integers with endpoints between 1 and
100.
• Approximate magnitude (10 hypotheses):
– Decades (1-10, 10-20, 20-30, …)
Bayesian model

• H: Hypothesis space of possible concepts:


– Mathematical properties: even, odd, square, prime, . . . .
– Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
– Raw magnitude: all intervals between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.


• Evaluate hypotheses given data:
p ( X | h) p ( h)
p(h | X ) =
å p( X | h¢) p(h¢)
h¢ÎH
– p(h) [“prior”]: domain knowledge, pre-existing biases
– p(X|h) [“likelihood”]: statistical information in examples.
– p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h

• Captures the intuition of a “representative” sample, versus


a “suspicious coincidence”.
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100
Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100

Data slightly more of a coincidence under h1


Illustrating the size principle
h1 2 4 6 8 10 h2
12 14 16 18 20
22 24 26 28 30
32 34 36 38 40
42 44 46 48 50
52 54 56 58 60
62 64 66 68 70
72 74 76 78 80
82 84 86 88 90
92 94 96 98 100

Data much more of a coincidence under h1


Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
likelihood, and exponentially more so as n increases.
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h

• Captures the intuition of a “representative” sample, versus


a “suspicious coincidence”.
• A special case of the law of “conservation of belief”:

åx
p( X = x | Y = y ) = 1
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Do we need this? Why not allow all logically possible
hypotheses, with uniform priors, and let the data sort
them out (via the likelihood)?
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
effectively, p(h) ~ 0 for many logically possible but
conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
hypotheses, e.g. “multiples of 10 except 50 and 70”.

e.g., X = {60 80 10 30}:


4
é1ù
p ( X | multiples of 10) = ê ú = 0.0001
ë10 û
4
é1 ù
p ( X | multiples of 10 except 50, 70) = ê ú = 0.00024
ë8 û
p ( X | h) p ( h)
Posterior: p(h | X ) =
å p( X | h¢) p(h¢)
h¢ÎH

• X = {60, 80, 10, 30}


• Why prefer “multiples of 10” over “even
numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
10 except 50 and 70”? p(h).
• Why does a good generalization need both high
prior and high likelihood? p(h|X) ~ p(X|h) p(h)

You might also like