0% found this document useful (0 votes)
10 views42 pages

Lecture 12

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views42 pages

Lecture 12

9.66

Uploaded by

Gio Villa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Generalizing to new objects

Hypothesis averaging:
Compute the probability that C applies to some
new object y by averaging the predictions of all
hypotheses h, weighted by p(h|X):

p( y Î C | X ) = å$
p( y Î C | h) p(h | X )
!#!"
hÎH é 1 if yÎh

ë 0 if yÏh

= å p(h | X )
h É{ y , X }
Examples:
16
Examples:
16
8
2
64
Examples:
16
23
19
20
+ Examples Human generalization Bayesian Model

60

60 80 10 30

60 52 57 55

16

16 8 2 64

16 23 19 20
Summary of the Bayesian model

• How do the statistics of the examples interact with


prior knowledge to guide generalization?
posterior µ likelihood ´ prior

• Why does generalization appear rule-based or


similarity-based?
hypothesis averaging + size principle

broad p(h|X): similarity gradient


narrow p(h|X): all-or-none rule
Summary of the Bayesian model

• How do the statistics of the examples interact with


prior knowledge to guide generalization?
posterior µ likelihood ´ prior

• Why does generalization appear rule-based or


similarity-based?
hypothesis averaging + size principle

broad p(h|X): Many h of similar size, or


very few examples (i.e. 1)
narrow p(h|X): One h much smaller
Model variants
1. Bayes with weak sampling
posterior µ likelihood ´ prior
hypothesis averaging + size principle

“Weak sampling” p( X | h) µ 1 if x1 ,!, xn Î h


= 0 if any xi Ï h

2. Maximum a posteriori (MAP)


Maximum likelihood /subset principle
posterior µ likelihood ´ prior
hypothesis averaging + size principle
p( y Î C | X ) = 1 if y Î h*; h* = arg max p(h | X )
hÎH
= 0 if y Ï h *
Human generalization Full Bayesian model

Bayes with weak sampling Maximum a posteriori (MAP) / subset


(no size principle) principle (no hypothesis averaging)
Taking stock

• A model of high-level, knowledge-driven inductive reasoning


that makes strong quantitative predictions with minimal free
parameters.
(r2 > 0.9 for mean judgments on 180 generalization stimuli, with 3 free
numerical parameters)
• Explains qualitatively different patterns of generalization
(rules, similarity) as the output of a single general-purpose
rational inference engine.
– Marr level 1 (Computational theory) explanation of phenomena that
have traditionally been treated only at Marr level 2 (Representation
and algorithm).
Looking forward
• Can we see these ideas at work in more natural cognitive
function, not just toy problems and games?
– How might differently structured hypothesis spaces, different
likelihood functions or priors, be needed?
• Can we move from ‘weak rational analysis’ to ‘strong
rational analysis’ in the priors, as with the likelihood?
– “Weak”: behavior consistent with some reasonable prior.
– “Strong”: behavior consistent with the “correct” prior given the
structure of the world.
• Can we work with more flexible priors, not just restricted to
a small subset of all logically possible concepts?
– Would like to be able to learn any concept, even very complex ones,
given enough data (a non-dogmatic prior).
• Can we describe formally how these hypothesis spaces and
priors are generated by abstract knowledge or theories?
• Can we explain how people learn these rich priors?
Learning more natural concepts
“horse” “horse” “horse”

“tufa”
“tufa”

“tufa”
Learning rectangle concepts

Weighting different rectangle


hypotheses based on the size principle:
n
é 1 ù
p ( X | h) = ê ú if x1 , ! , xn Î h
ë size(h) û
= 0 if any xi Ï h
Generalization gradients
Full Bayes Subset principle Bayes w/o size principle
(MAP Bayes) (0/1 likelihoods)
Modeling word learning (Xu & Tenenbaum, 2007)
Modeling word learning (Xu & Tenenbaum, 2007)
Modeling word learning (Xu & Tenenbaum, 2007)
Children’s
generalizations

Bayesian
concept learning
with tree-structured
hypothesis space
Exploring different models

• Priors, likelihoods derived from simple assumptions.


What about more complex cases?
• Different likelihoods?
– Suppose the examples are sampled by a different process,
such as active learning, or active pedagogy.

• Different priors?
– More complex language-like hypothesis spaces, allowing
exceptions, compound concepts, and much more…
Exploring different models

• Priors, likelihoods derived from simple assumptions.


What about more complex cases?
• Different likelihoods?
– Suppose the examples are sampled by a different process,
such as active learning, or active pedagogy.

• Different priors?
– More complex language-like hypothesis spaces, allowing
exceptions, compound concepts, and much more…
Another word learning game
Hypothesis
Space: h1

h2

P(h): ~5-to-1 for basic-level shape (h1)


Hypothesis
Space: h1

h2

P(h): ~5-to-1 for basic-level shape (h1)

Teacher-driven condition: Active learning condition:


3 random ex’s: 1 random ex:
2 non-random
(weak) samples:

P(d|h1) ~ 1/(1/1)^3 P(d|h1) ~ 1/(1/1)^1 * k


P(d|h2) ~ 1/(1/3)^3 P(d|h2) ~ 1/(1/3)^1 * k
Hypothesis
Space: h1

h2

P(h): ~5-to-1 for basic-level shape (h1)

Teacher-driven condition: Active learning condition:


Looking forward (from Basic Bayes)
• Can we see these ideas at work in more natural cognitive
function, not just toy problems and games?
– How might differently structured hypothesis spaces, different
likelihood functions or priors, be needed?
• Can we move from ‘weak rational analysis’ to ‘strong
rational analysis’ in the priors, as with the likelihood?
– “Weak”: behavior consistent with some reasonable prior.
– “Strong”: behavior consistent with the “correct” prior given the
structure of the world.
• Can we work with more flexible priors, not just restricted to
a small subset of all logically possible concepts?
– Would like to be able to learn any concept, even very complex ones,
given enough data (a non-dogmatic prior).
• Can we describe formally how these hypothesis spaces and
priors are generated by abstract knowledge or theories?
• Can we explain how people learn these rich priors?
The need for more flexible priors
• Suppose you see these examples in the number game:
– 60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29
The need for more flexible priors
• Suppose you see these examples in the number game:
– 60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29
– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
The need for more flexible priors
• Suppose you see these examples in the number game:
– 60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29
– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
– 60, 80, 10, 30, ...
The need for more flexible priors
• Suppose you see these examples in the number game:
– 60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29
– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
– 60, 80, 10, 30, 52
The need for more flexible priors
• Suppose you see these examples in the number game:
– 60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29
– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
– 60, 80, 10, 30, 52
– 60, 80, 10, 30, 55
The need for more flexible priors
• Suppose you see these examples in the number game:
– 60, 22, 80, 10, 25, 30, 24, 90, 27, 70, 26, 29
– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
– 60, 80, 10, 30, 52
– 60, 80, 10, 30, 55
– 60, 80, 10, 30, 55, 40, 20, 80, 50, 10
Can we explain these preferences from different sets of
examples in the number game?
– 60, 80, 10, 30
• Why “multiples of 10”? (vs. “multiples of 10 except 50 and 70”?)

– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
• Why “multiples of 10 except 50 and 70” more likely here?

– 60, 80, 10, 30, 52


• Why “multiples of 10, plus 52”? (vs. “even numbers”?)

– 60, 80, 10, 30, 55


• Why “multiples of 5”? (vs. “multiples of 10, plus 55”?)

– 60, 80, 10, 30, 55, 40, 20, 80, 50, 10,
• Why “multiples of 10, plus 55” more likely here?
Constructing more flexible priors: A
language for thought?
• Start with a base set of regularities R and combination
operators C. Hypothesis space = closure of R under C.
– C = {and, or}: H = unions and intersections of regularities in R (e.g.,
“multiples of 10 between 30 and 70”).

• Start with a base set of regularities R and allow singleton (+ or


-) exceptions.
– e.g., “multiples of 10 except 50”, “multiples of 10 except 50 and 70”,
“multiples of 10 and also 52”.

• Defining a prior:
– The Bayesian Occam’s Razor:
• Model classes defined by number of combinations.
• More combinations more hypotheses lower prior
Prior: p (h)
All hypotheses

µc µc
µc

simple math math properties math properties …


properties with 1 exception with 2 exceptions

c c
µ c µ
20 µ 200000
2000

~ 20 hypotheses ~ 20 x 100 hypotheses ~ 20 x 100 2 hypotheses


Or, a fancier version for + and – exceptions…
Keeping in mind that each exception makes the prior
~100 times lower, can we now explain these sets of
examples in the number game?
– 60, 80, 10, 30
• Why “multiples of 10”? (vs. “multiples of 10 except 50 and 70”?)

– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
• Why “multiples of 10 except 50 and 70” more likely here?

– 60, 80, 10, 30, 52


• Why “multiples of 10, plus 52”? (vs. “even numbers”?)

– 60, 80, 10, 30, 55


• Why “multiples of 5”? (vs. “multiples of 10, plus 55”?)

– 60, 80, 10, 30, 55, 40, 20, 80, 50, 10


• Why “multiples of 10, plus 55” more likely here?
Keeping in mind that each exception makes the prior
~100 times lower, can we now explain these sets of
examples in the number game?
– 60, 80, 10, 30
• Why “multiples of 10”? (vs. “multiples of 10 except 50 and 70”?)

– 60, 80, 10, 30, 40, 20, 90, 80, 60, 40, 10, 20, 80, 30, 90, 60,
40, 30, 60, 80, 20, 90, 10, 30, 40, 90, 10, 60, 20, 80, 30
• Why “multiples of 10 except 50 and 70” more likely here?

– 60, 80, 10, 30, 52


• Why “multiples of 10, plus 52”? (vs. “even numbers”?)

– 60, 80, 10, 30, 55


• Why “multiples of 5”? (vs. “multiples of 10, plus 55”?)

– 60, 80, 10, 30, 55, 40, 20, 80, 50, 10


• Why “multiples of 10, plus 55” more likely here?
Bayesian Occam’s Razor
Probabilities provide a common currency for
balancing model complexity with fit to the
data.
Constructing more flexible priors: A
language for thought?
• Functional composition: Let x be a stream of natural
numbers {1, 2, 3, …}
(* 3 x): multiples of 3
(* 2 x): even numbers
(+ 1 (* 2 x)): odd numbers
(power_of 2 x): powers of 2 (<= 30 (>= 20 x)): numbers between 20 and 30
(power_to 2 x): square numbers (<= 15 (>= 15 x)): numbers between 10 and 15
(<= 10 x): numbers less than 10 (<= 30 (>= 20 (* 2 x))): even numbers between 20 and 30
(= 80 x): exactly 80 (>= 20 (* 2 x)): even numbers greater than 20
(!= 70 x): numbers except 70 (power-of 2 (+ 1 (* 2 x))): odd-numbered powers of 2
(<= 50 (* 10 x)): multiples of 10 up to 100
(+ 3 (* 4 x)): multiples of 4, shifted by 3 (3, 7, 11, 15, ...)
(!= 50 (!= 70 (* 10 x))): multiples of 10 except 50 and 70

Could this approach be extended to all the learnable concepts?


If not, what else is needed? A possible project topic.
Constructing more flexible priors: A
language for thought?

You might also like