Active Inductive Inference in Child and Adults
Active Inductive Inference in Child and Adults
Cognition
journal homepage: www.elsevier.com/locate/cognit
Original Articles
Keywords: A defining aspect of being human is an ability to reason about the world by generating and adapting ideas and
Hypothesis generation hypotheses. Here we explore how this ability develops by comparing children’s and adults’ active search and
Active learning explicit hypothesis generation patterns in a task that mimics the open-ended process of scientific induction.
Inductive inference
In our experiment, 54 children (aged 8.97 ± 1.11) and 50 adults performed inductive inferences about a series
Developmental change
of causal rules through active testing. Children were more elaborate in their testing behavior and generated
Concept learning
Program induction
substantially more complex guesses about the hidden rules. We take a ‘computational constructivist’ perspective
to explaining these patterns, arguing that these inferences are driven by a combination of thinking (generating
and modifying symbolic concepts) and exploring (discovering and investigating patterns in the physical world).
We show how this framework and rich new dataset speak to questions about developmental differences in
hypothesis generation, active learning and inductive generalization. In particular, we find children’s learning
is driven by less fine-tuned construction mechanisms than adults’, resulting in a greater diversity of ideas but
less reliable discovery of simple explanations.
‘‘We think we understand the rules when we become adults but what we aspects of constructivism and use these to analyze children and adults’
really experience is a narrowing of the imagination.’’—David Lynch behavior in an open-ended inductive learning task. We show that a
virtue of the constructivist account is that it captures the wide range
A central question in the study of both human development and of ideas and testing behaviors we observe, particularly in children. We
reasoning is how learners come up with the ideas and hypotheses use our account to examine developmental differences in hypothesis
they use to explain the world around them. Children excel at forming generation and active learning. To foreshadow, we show children’s
new categories, concepts, and causal theories (Carey, 2009) and by hypothesis generation and active learning are driven by less fine-tuned
maturity, this coalesces into a capacity for intelligent thought char- construction mechanisms than adults’, resulting in a greater diversity
acterized by its domain generality and occasional moments of insight of ideas but less reliable discovery of simple explanations and less
and innovation. Constructivism is an influential perspective in de- systematic coverage of the data space.
velopmental psychology (Carey, 2009; Piaget, 2013; Xu, 2019) and
philosophy of science (Fedyk & Xu, 2018; Phillips, 1995; Quine, 1969) Concept learning
that posits learners actively construct new ideas through a mixture
of thinking – recombining and modifying ideas – and play—exploring Classic work in experimental psychology suggests symbol manipula-
and discovering patterns in the world (Bruner, Jolly, & Sylva, 1976; tion is required for humanlike reasoning and problem solving (Bruner,
Piaget & Valsiner, 1930; Xu, 2019). While the tenets and promise of Goodnow, & Austin, 1956; Johnson-Laird, 1983; Wason, 1968). How-
constructivist accounts are appealing, it has historically lacked the ever, classic symbolic accounts struggled to explain how discrete rep-
formalization needed to distinguish it from alternative accounts of resentations could be learned or effectively applied to reasoning un-
learning, limiting its testable predictions or detailed insights into cog- der uncertainty (Oaksford & Chater, 2007; Posner & Keele, 1968).
nition. We draw on recent methodological advances to formalize key
∗ Corresponding author.
E-mail address: [email protected] (N.R. Bramley).
1
Developmental data was collected under IRB protocol (Ref No: 2019-10-12687). Adult data was collected under ethical approval granted by the Edinburgh
University Psychology Research Ethics Committee (Ref No: 3231819/1). Supplementary material including all data and code is available at https://github.com/
bramleyccslab/computational_constructivism. This study was not preregistered. Thanks to Gwyneth Heuser for developmental data collection. Thanks to Jan-
Philipp Fränken for help with coding free text responses. This research was supported by an EPSRC New Investigator Grant (EP/T033967/1) to N.R. Bramley
and an NSF Award SMA-1640816 to F. Xu.
https://doi.org/10.1016/j.cognition.2023.105471
Received 29 October 2021; Received in revised form 27 February 2023; Accepted 24 April 2023
Available online 24 May 2023
0010-0277/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Meanwhile, statistical accounts of concept learning have flourished treat algorithmic-level cognition as necessarily symbolic and at least
by treating concepts as driven by ‘‘family resemblance’’ within a fea- somewhat language-like (Fodor, 1975) in its ability to make ‘‘infinite
ture space—for instance, centered around a prototypical example or use of finite means’’ (von Humboldt, 1863/1988).
set of exemplars (Kruschke, 1992; Love, Medin, & Gureckis, 2004; For example, a constructivist learner might stochastically combine
Medin & Schaffer, 1978; Shepard & Chang, 1963). Such accounts elements from an underlying concept grammar to produce new ideas
help explain how people assign category membership fuzzily, and that can be tested against evidence. Alternatively, they might use
generalize effectively to novel stimuli (Shepard, 1987) but lack a core their grammar to describe patterns in evidence or to adapt a previous
representation capable of capturing how people construct conceptual hypotheses to fit some new evidence (Bonawitz, Denison, Gopnik, &
novelty (Komatsu, 1992). Griffiths, 2014; Lewis, Perez, & Tenenbaum, 2014; Nosofsky & Palmeri,
Bayesian approaches have also played a major role in study of con- 1998; Nosofsky, Palmeri, & McKinley, 1994). Outside of narrow experi-
cept learning, providing a principled way of modeling probabilistic in- mental settings, this modal incompleteness seems completely normal. A
ference over both sub-symbolic and symbolic hypothesis spaces (How- simple illustration is the gap between ease of evaluation versus gener-
son & Urbach, 2006). On the symbolic side this includes inferences ation of hypotheses (Gettys & Fisher, 1979). We can typically generate
about particular causal structures (Bramley, Lagnado, & Speekenbrink, fewer explanations on the fly – i.e., reasons why our car will not start
2015; Coenen, Rehder, & Gureckis, 2015; Gopnik et al., 2004; Steyvers, – than we would endorse if a list was presented to us. We would likely
Tenenbaum, Wagenmakers, & Blum, 2003) as well as more general come up with more as we looked under the hood than we would sat in
causal theories (Goodman, Ullman, & Tenenbaum, 2011; Griffiths & the car thinking. Inference about any area of active scientific inquiry,
Tenenbaum, 2009; Kemp & Tenenbaum, 2009; Lucas & Griffiths, 2010). like that reported in this journal, typically involve an enormous latent
Alongside Bayesian analyses, information theory has also featured space of potential explanatory theories only a fraction of which have
frequently as a metric of idealized evidence acquisition (Gureckis & ever been articulated or tested and many of which were discovered
Markant, 2012), including choice of interventions and experiments that only serendipitously (Shackle, 2015). It is generally accepted that the
reveal causal structure (Bramley, Dayan & Griffiths, 2017; Bramley ground truth is unlikely to be among the set of theories already on the
et al., 2015; Coenen et al., 2015; Steyvers et al., 2003). However, table (Box, 1976) and that challenging results are as likely to lead to
since idealized Bayesian and information theoretic accounts describe theory modification as complete abandonment (Lakatos, 1976).
learning within a predefined hypothesis space, they do not directly The constructivist perspective thus departs from a Bayesian analysis
explain how a learner explores or generates possibilities within an by emphasizing that induction is as much about constructing candidate
infinite latent space. That is, probabilistic accounts of induction on possibilities, as optimizing within a set of candidates. This reframing
are generally cast at Marr’s computational level (Marr, 1982), showing demystifies a number of behavioral patterns that look like biases from
people behave roughly as if they consider and average exhaustively the computational-level perspective. These include anchoring, order ef-
over what is really an unbounded space of possible concepts. Thus, fects, probability matching and confirmation bias. For example, Anchoring
while these accounts provide a jumping off point for rational analysis of is a natural consequence of generating new hypotheses by making
cognition, we should take their limitations seriously when seeking to re- local adjustments to an earlier hypothesis or from a salient starting
verse engineer humanlike inductive inference (Simon, 2013; Van Rooij, point such as a number mentioned in a prompt (Griffiths, Lieder, &
Blokpoel, Kwisthout, & Wareham, 2019). Goodman, 2015; Lieder, Griffiths, Huys, & Goodman, 2018). Order
The goal of this paper is to examine children’s and adults’ inductive effects, where the sequence of evidence encountered affects the final
learning in a rich open-ended task where the space of potential hy- belief, are pervasive in human learning. If new hypotheses are arrived
potheses and behaviors is effectively unbounded. In doing this, we will at through a limited local search starting from a previous hypothesis
treat constructivism as a form of rational process framework (Lieder then we should expect path dependence and auto-correlation between
& Griffiths, 2020), capturing how people are shaped by Bayesian and a single learner’s hypotheses over time (Bramley, Dayan et al., 2017;
information-theoretic norms but also why they diverge from and fall Dasgupta, Schulz, & Gershman, 2016; Fränken, Theodoropoulos, &
short of them outside of constrained scenarios. To do this, we fo- Bramley, 2022; Thaker, Tenenbaum, & Gershman, 2017; Zhao, Lucas &
cus on recent work in cognitive science that has attempted to marry Bramley, 2022). Probability matching is also natural under a construc-
symbolic and statistical perspectives. This work characterizes compu- tivist perspective. In experiments, participants often choose options in
tational principles driving both human development and intelligence proportion to their probability of being correct or optimal rather than
as resting on a capacity to flexibly generate, adapt, combine and re- reliably selecting the best action, as we might expect if they had the full
purpose symbolic representations when learning and reasoning, but posterior to hand Shanks, Tunney, and McCarthy (2002). However, it
crucially to do so in ways that approximate probabilistic principles of can be shown that rather than being a choice pathology, probability
inference under uncertainty (Bramley, Dayan et al., 2017; Goodman, matching may be better seen as a best case scenario for a learner
Tenenbaum, Feldman, & Griffiths, 2008; Piantadosi, 2021; Piantadosi, limited to using the endpoint of a local search as their guess (Bramley,
Tenenbaum, & Goodman, 2016). Dayan et al., 2017). It has been argued that in a variety of plausible
everyday settings, a single-sample–based decision can be the appropri-
Constructivism ate computation–accuracy tradeoff for a resource-limited learner (Vul,
Goodman, Griffiths, & Tenenbaum, 2009). Confirmation bias is also
Fundamentally, we take the constructivist account to depart from pervasive in human reasoning and active learning (Klayman & Ha,
computational-level Bayesian accounts because it presumes represen- 1989) and hard to explain in purely Bayesian terms. Wason (1960)
tational incompleteness, and consequently stochasticity and path depen- famously asked participants to test and identify a hidden rule and ini-
dence in a given individual’s learning trajectory. By this, we mean tially simply told them that the sequence 2–4–6 followed the rule. The
that the constructivist learner has not, and normally could not, con- intended true rule was simply ‘‘ascending numbers’’ but participants
sider and weigh all the possibilities in play when learning. Instead, frequently guessed more complex rules such as ‘‘numbers increasing
they must have some mechanism for generating and comparing finite by two’’. Analysis of participants’ tests revealed that they frequently
numbers of discrete possibilities (Sanborn & Chater, 2016; Stewart, generated tests that would be rule-following under their hypothesis
Chater, & Brown, 2006). Eponymously, the construction mechanism (such as 6–8–12), so failing to adequately challenge and disconfirm this
needs to be capable of recursive construction: composing and recom- hypothesis. On a constructivist perspective, learners can only base their
posing symbolic elements so as to achieve the systemtaticity and pro- exploration on testing hypotheses they have actually generated (or else
ductivity required for a finite system to cover an infinite space of behave randomly). To the extent that certain simpler hypotheses like
ideas (Piantadosi & Jacobs, 2016). In this way, constructivist views ‘‘ascending numbers’’ were less likely to be generated on the basis of
2
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
the provided example (cf. Oaksford & Chater, 1994; Tenenbaum, 1999), noisier (Lucas et al., 2014). This is algorithmically sensible as opti-
it is not surprising that participants failed to actively exclude these mization over high dimensional spaces is known to be more effective
possibilities with their tests. when proposals are initially large leaps and decrease over time, as in
In the computational cognitive science literature, recent symbolic simulated annealing (Van Laarhoven & Aarts, 1987). However, a high
search ideas manifest under the label of ‘‘learning as program induc- diversity of guesses might also reflect that children have a rationally
tion’’. Such models have begun to be applied to synthesizing humanlike flatter latent prior than adults, inherently entertaining a wider range
problem solving and planning and tool use (Allen, Smith, & Tenen- of hypotheses at the cost of entertaining high probability ones less
baum, 2020; Ellis et al., 2020; Lai & Gershman, 2021; Lake, Ullman, frequently. A third possibility is that children’s hypothesis generation
Tenenbaum, & Gershman, 2017; Ruis, Andreas, Baroni, Bouchacourt, might be driven more by bottom-up processing than adults’. With less
& Lake, 2020; Rule, Schulz, Piantadosi, & Tenenbaum, 2018). We will established expectations, or less powerful primitive concepts to work
draw on these in examining children and adults’ hypothesis generation. with, children’s hypotheses might more directly describe encountered
patterns, while adults might rely more on their existing knowledge
hierarchy to constrain hypothesis generation in a top-down way (Clark,
Constructivism in development
2012). We will contrast children’s and adults’ hypothesis generation
and active learning in a rich task setting that allows us to closely
The ‘‘child as scientist’’ (Carey, 1985; Gopnik, 1996)—or recently,
investigate these ideas.
‘‘child as hacker’’ (Rule, Tenenbaum, & Piantadosi, 2020) — perspec-
tive casts children’s cognition as driven by broadly the same inductive
Task
processes as adults’ but at an earlier stage in a journey of construction
and discovery. In order to study inductive learning, we use a rich open-ended
While children have been shown to be capable active learners (Mc- task that extends on Wason (1960) and the logical rule-induction
Cormack, Bramley, Frosch, Patrick, & Lagnado, 2016; Meng, Bramley, & tasks studied by Nosofsky et al. (1994), Lewis et al. (2014), Goodman
Xu, 2018; Sobel & Kushnir, 2006) there is also evidence that children’s et al. (2008), and Piantadosi et al. (2016). Akin to the blicket-detector
ability to learn effectively from active learning data is more fragile than paradigm in developmental causal cognition (Gopnik et al., 2004;
adults’. For example, children’s play can look repetitive and inefficient Lucas et al., 2014), our task has a causal framing, probing inductive
when held to information theoretic norms (Lapidow & Walker, 2020; inferences about what conditions make an effect occur in a minimally
McCormack et al., 2016; Meng et al., 2018; Sim & Xu, 2017). Sobel and contextualized domain. However, departing from Blicket detector tasks,
Kushnir (2006) also found children were much less accurate at causal we include a large and physically rich set of features that learners
structure identification in ‘‘yoked’’ conditions – where they had to use can draw on in their inferences allowing test scenes to vary in the
evidence generated by someone else to learn – while adults are less number, nature and arrangement of objects. Our task is inspired by
affected, sometimes able to learn about as well from others’ data as a tabletop game of scientific induction called ‘‘Zendo’’ (Heath, 2004)
their own (Lagnado & Sloman, 2006). This performance gap has been and builds on a pilot task examined in Bramley, Rothe, Tenenbaum,
argued to stem from the mismatch between whatever idiosyncratic Xu, and Gureckis (2018). In it, learners both observe and create scenes,
hypotheses are under consideration by the observer and those being which are arrangements of 2D triangular objects called cones (Fig. 1)
tested by the active learner, making the yoked learner less able to and test them to see if they produce a causal effect (which arrangements
use the data to progress their theories (Fränken et al., 2022; Markant of blocks ‘‘make stars come out’’ in our minimal framing). The goal is
& Gureckis, 2014). Relatedly, children have been argued to be more to both predict which of a set of new scenes will produce the effect
narrowly focused toward testing a single hypothesis at a time (Bramley, and describe the hidden rule that determines the general set of circum-
Jones, Gureckis, & Ruggeri, 2022; Ruggeri & Lombrozo, 2014; Ruggeri, stances produce the effect (try it here). Scenes could contain between
Lombrozo, Griffiths, & Xu, 2016). This might reflect a less developed 1 and 9 cones. Each cone has two immutable properties: size∈ {small,
working memory, restricting the number of hypotheses children can medium, large} and color∈ {red, green, blue} and continuous scene-
keep track of and compare to evidence. An early emphasis on explo- specific x∈(0,8), y∈(0,6) positions and orientations∈(0,2𝜋). In addition
ration has also been argued to be an effective solution to a lifelong to cones’ individual properties, scenes also admit many relational prop-
explore–exploit tradeoff, since earlier discoveries can be exploited for erties arising from the relative features and arrangement of different
longer (Gopnik, 2020). Program induction also provides a potential cones. For instance, subsets of cones might share a feature value (i.e., be
explanation for transitions between developmental ‘‘stages’’, character- the same color, or have the same orientation) or be ordered on another
ized by occasional leaps forward in insight. For instance, Piantadosi, (i.e., be larger than, or above) and pairs of cones might have relational
Tenenbaum, and Goodman (2012) demonstrate how a program in- properties like pointing at one another or touching. This results in an
duction model can reproduce a characteristic developmental transition extremely rich implicit space of potential concepts.
from grasping a few small numbers to discovering a recursive concept We note that, by design, the dimensionality of this task makes
of real numbers. We note that an important part of constructivism it extremely difficult. As with Wason’s 2-4-6 example, and genuine
questions of scientific induction, the hard part of this task is not
is the idea that we cache the useful concepts we invent (cf. Zhao,
evaluating whether a candidate hypothesis can explain the data but
Bramley & Lucas, 2022), meaning our conceptual library grows as
rather generating the right hypothesis in the first place. As with the
we do, becoming richer and more powerful for solving the tasks we
2-4-6 task, there are always infinite data-consistent possibilities and
repeatedly face. We do not attempt to model this important aspect of
while the bulk of these may be outlandishly complex, many others may
constructivism in this paper but return to it in the General Discussion.
still be simpler or more salient than the ground truth. Without carefully
Differences between childlike and adultlike inductive inference
gathered evidence with broad coverage of the space of possible scenes,
might also be captured by parameterizable differences in search, poten-
a learner will frequently be unable to rule out simpler possibilities that
tially reflecting principles of stochastic optimization (Lucas, Bridgers,
more parsimoniously capture the data than the ground truth, essentially
Griffiths, & Gopnik, 2014). For instance, young children have been
being left with evidence that would not lead even an unbounded
found to be quick to make broad abductive generalizations from Bayesian agent to the correct answer.2
a small number of examples—e.g. readily imputing novel physical
laws to explain surprising evidence (Schulz, Goodman, Tenenbaum, &
Jenkins, 2008). Building on this finding, children’s hypothesis gener- 2
In tabletop game form, Zendo typically takes dozens of rounds of tests
ation and search has been framed as rationally ‘‘higher temperature’’ and incorrect guesses by multiple guessers, as well as leading examples and
than adults’—producing more diversity of ideas at the cost of being clues from the rule-setter for even simple hidden rules to be identified.
3
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. 1. The experimental task: (a) Active learning phase. (b) An example sequence of 8 tests, the first is provided to all participants, and subsequent tests are constructed by
the learner using the interface in (a). Yellow stars indicate those that follow the hidden rule. (c) Generalization phase: Participants select which of a set of new scenes are rule
following by clicking on them. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
4
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
these construction weights to become tuned so as to favor certain here is to start with a prior sample of hypotheses (e.g. drawn context-
elements or features that have proven useful in the past. A uniform- free) and progressively create scenes that serve to minimize expected
weighted PCFG hypothesis generator will thus tend to produce greater uncertainty over this sample by forking their predictions (Bramley
diversity than a more fine-tuned one. As such, it embodies the idea that et al., 2022; Nelson, Divjak, Gudmundsdottir, Martignon, & Meder,
more elaborately or implausibly structured, or ‘‘weird’’, concepts will 2014). We visualize this in Fig. 3a, imagining three labeled scenes
come to the minds of children than adults. 𝑑1 … 𝑑3 that progressively divide a prior sample of hypotheses (ℎs) until
What PCFG approaches have in common is a generative mechanism
a most-likely candidate emerges. The constructivist setting presents a
for sampling from an infinite latent prior, here over possible logical
challenge for this norm since the hypothesis space is latent and is
concepts. However, sampled ‘‘guesses’’ must also be tested against data.
Unfortunately, in our task – and perhaps even more so outside of it – the initially unexplored.
vast majority a priori generated concepts are likely to be inconsistent
with whatever evidence a learner has already encountered.3 For this
Exploration-driven learning
reason, the procedure is astronomically inefficient, requiring very large
An alternative hypothesis-free approach might be to explore the
numbers of samples in order to reliably generate non-trivial rules. One
can also use a PCFG to adapt existing hypotheses, for instance using a data space directly, for instance generating scenes that vary in the
Markov Chain Monte Carlo scheme in which parts of a hypothesis are number and nature of objects they contain in the hope of naturally
regrown and accepted according to their fit to evidence (cf. Fränken uncovering concept boundaries and inspiring hypothesis generation.
et al., 2022; Goodman et al., 2008). While we think this approach is We sketch this in Fig. 3b. Efficient uncertainty-driven and exploration-
promising we do not model this here, and simply return to it in the driven learning both predict generation of scenes that differ substan-
general discussion. However, we do additionally consider an alternative tially from one another, ideally being anti-correlated so as to cover the
to the PCFG, that provides a more sample efficient and, on the face of it, space efficiently (Osborne et al., 2012). However this does not seem
more cognitively plausible mechanism for initializing new hypotheses. well matched to constructism, where we rather think of the learner as
entertaining a small but not completely empty set of possibilities and
Context-based hypothesis generation
hence unable to capitalize on such diverse evidence.
Instance Driven Generation (IDG) (Bramley et al., 2018) is a recent A constructivist way to think of active learning is as acting in ways
proposal related to the PCFG framework but with a key difference. that challenge one’s current hypotheses and so facilitate their refine-
Rather than generating initial hypotheses prior to, or blind to the ment or the construction of better alternatives. We sketch two such
current evidence, the IDG generates ideas inspired by encountered approaches: Confirmatory testing and Sequential Contrastive testing.
patterns (cf. Michalski, 1969), thus incorporating bottom-up reactivity
to evidence into its conceptualization process. Each IDG hypothesis
starts with an observation of features of one or several objects in a scene Confirmatory testing
and uses these to back out a true logical statement about the scene in With a candidate hypothesis in mind, a learner can seek to challenge
a stochastic but truth-preserving way. If the scene is rule following, it through its generalizations (Nickerson, 1998; Popper, 1959). For
this statement constitutes a positive hypothesis about the hidden rule. example, after encountering the scene in row 1 of Table 1, a learner
Otherwise, it constitutes a negative hypothesis, i.e. about what must not might generate the initial hypothesis that ‘‘there must be a small red’’
be present. Thus, an IDG does not begin each learning problem with a (since this describes one of the objects). To confirm this, they might
prior over all possible concepts, but rather draws its initial ideas from a try a positive generalization test, i.e. keep the small red but remove
restricted space consistent with the extant patterns in a focal observa- or randomize the other objects and predict the effect will still occur
tion. Fig. 2b illustrates this approach. While a regular PCFG effectively
(e.g. 𝑑1 in Fig. 3c). Alternatively they might use it to predict a way
starts at the top level (i.e. outermost nesting) of a compound concept
to minimally alter 𝑑1 so it no longer produces the effect, removing
and works downward and inward, the IDG starts from the central
the small red and keeping the rest (e.g. 𝑑2 ). So long as the learner
content (drawn from its observation) and works upward and outward to
a quantified statement, ensuring at each step that the statement is true gets the outcome they anticipate, they can stick with their hypothesis.
of the scene. The result is a mechanism that uses its concept grammar When they do not they can either abandon or adapt it. For instance,
to describe features and patterns in evidence. This means that the IDG 𝑑3 in Fig. 3c proves inconsistent with ℎ1 , requiring a new hypothesis
does not entertain hypotheses that are possible but never exemplified be generated that can explain why 𝑑1 and 𝑑3 produce the effect but
by a scene. For example, ‘‘at most five reds’’ would only be generated if not 𝑑2 . A limitation of a one-hypothesis-at-a-time approach is that it
a learner actually saw a rule-following scene containing five reds. A key is unclear how distinctive the hypothesis’s generalization predictions
prediction of the IDG is an interaction between the scenes generated are.4 For example, since the ground truth in this example is just ‘‘there
by the participant and the hypotheses these subsequently inspire, with is a red’’, producing new scenes containing small reds will fail to reveal
simpler scenes, embodying fewer extraneous or coincidental patterns that the redness but not the smallness is causative of the label. Another
being more likely to inspire the learner to generate the true concepts. limitation is that it is unclear what to do when one’s hypothesis is ruled
out, especially if the scene if the test that differs dramatically from
Hypothesis-driven scene generation
the ones with which it is consistent. For this reason, the education
literature has long emphasized the utility of a ‘‘control of variables’’
Uncertainty-driven learning
Normatively, test scenes should serve to minimize expected uncer- strategy (Chen & Klahr, 1999; Klahr, Fay, & Dunbar, 1993; Klahr,
tainty across the full hypothesis space. A direct way to approximate this Zimmerman, & Jirout, 2011). This amounts to manipulating exactly
one design variable per test, such that any difference in the outcome
is straightforwardly attributable to the change in the input providing a
3
In our task, many more are simply tautological (i.e., ‘‘All cones are red route to adapting one’s hypothesis when it fails.
or not red’’), contradictory (i.e., ‘‘There is a cone that is red and not red’’),
or physically impossible (‘‘Two (different) objects have the same position’’).
Indeed, around 20% of the hypotheses generated by our PCFGs are tautologies,
4
and 15% are contradictions. Many others combine a meaningful hypothesis A general finding is that positive confirmatory tests are valuable to the
with a tautological corollary (i.e., ‘‘There is a large red object that is larger extent that the outcome of interest is rare, e.g. if most scenes are not rule
than all medium sized objects’’). following. This is not generally the case in this task.
5
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. 2. (a) Example generation of hypotheses using the PCFG. (b) Examples of IDG hypothesis generation based on an observation of a scene that follows the rule. New additions
on each line are marked in blue. Full details in Appendix A. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
Fig. 3. Active learning strategies: 𝐻 = latent hypothesis space 𝐷 = data space. Arrows indicate direction of inferences. Stars indicate scenes that followed the rule. (a) Uncertainty-
driven tests over prior sample ℎ ∈ 𝐻. Dotted lines separate hypotheses by outcomes they predict for initial example 𝑒 and self-generated scenes 𝑑1 … 𝑑3 . Shading indicates which
ℎs mis-predict each outcome. (b) Exploration-driven testing. Scenes selected to explore 𝐷 without regard to 𝐻. Outcomes may then inspire hypotheses. (c) Confirmatory testing:
Example 𝑒 inspires hypothesis ℎ1 . Scenes then test its generalization predictions. Colored circles visualize space of scenes for which each hypothesis predicts outcome will be
produced. 𝑑1 and 𝑑2 are correctly predicted as rule following. 𝑑3 is mispredicted by ℎ1 in producing the outcome, leading to a new ℎ2 . (d) Sequential contrastive testing: 𝑒 inspires
ℎ1 and ℎ1 inspires ℎ2 , 𝑑1 contrasts these leading to rejection of ℎ1 . ℎ2 then inspires ℎ3 and 𝑑2 contrasts these, etc. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)
Sequential contrastive testing ℎ1 :‘‘there is a small red’’, one local alternative would be to drop the
A related scheme that might allow a constructivist learner to escape mention of size, leading to ℎ2 : ‘‘There is a red’’. Now the learner has a
some pathologies of confirmatory testing is the iterative counterfactual pair of hypotheses and a recipe distinguishing between them: Testing
strategy described in Oaksford and Chater (1994). That is, learners a scene containing a red object that is not small (e.g. 𝑑1 ). This could
might first generate an alternative hypothesis ℎ2 by inverting some again be easily achieved by adapting the original scene, so the small
feature of their initial hypothesis and then focus their next test on red is a different size (Chen & Klahr, 1999; Klahr et al., 1993, 2011). If
separating ℎ1 from ℎ2 (e.g., Fig. 3d).5 For example, starting with 𝑑2 produces the effect, ℎ1 can be supplanted with ℎ2 . Otherwise ℎ2 can
be rejected and a new ℎ3 can be generated. Either way, this approach
facilitates constructivism by providing a direction of travel however a
5
In Oaksford and Chater’s 1994 formulation, the complementary hypoth-
esis is then inconsistent with the scene that inspired the original hypothesis,
such as going from ‘‘increasing by two’’ (inspired by seeing 2-4-6) to ‘‘de-
creasing by two’’ such that its falsification may be mistaken for confirmation a hypothesis both with or without rendering it inconsistent with a scene that
of the original hypothesis. Here there are many ways to flip the content of inspired it.
6
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
test comes out, so allowing a constructivist learner to explore both the Design
data and hypothesis spaces in parallel (Klahr & Dunbar, 1988). All participants faced the same five learning problems in an in-
As illustrated in Fig. 3, what constructivism-compatible hypothesis- dependently randomized order (see Table 1). For each learning prob-
driven approaches have in common is a prediction of anchoring in data lem participants were given an initial positive example, as shown in
space: Each new scene shares features with the scene that inspired the the table, and then performed self tests of their own before making
earlier hypotheses that inspired it. This contrasts with the pattern we generalizations and free guesses as to the hidden rule.
would expect if participants followed a normative uncertainty-driven
approach or model-free exploration-driven approach since both tend Materials and procedure
to predict each scene should be as different as possible to earlier
Child sample Instructions. Participants sat in front of a laptop with
ones (although see Navarro & Perfors, 2011, for how this depends on
the structure of the hypothesis space). While we do not collect the trial- a mouse attached, with the experimenter sitting next to them and
by-trial guesses we would need to distinguish between all the accounts interacted with the task through the browser.
we mention, we will look for an empirical signature of constructivist The experimenter read out the instructions for the participant.
active learning, in the form of anchored, incremental and systematic These explained how the game worked and showed the participant
testing patterns and assess whether these differ between children and five examples of possible rules the blocks could have (relating to
adults. color, size, proximity, angle, or relation). The instructions also included
videos showing the participant how to manipulate the blocks using the
Overview mouse and keyboard. After the instructions, the participant was given
a comprehension check of five true or false questions. If they did not
In summary, the main goal of this paper is a close investigation of get them all right on their first try, the experimenter read through the
developmental differences in active open-ended hypothesis generation instructions again and asked them again. All participants passed the
examined through the lens of a constructivism-inspired rational-process comprehension check the second time.
framework that puts stochastic generation and incremental search at
the center of the individuals’ learning. To foreshadow, we find that Learning phase. The participant was then introduced to an initial ex-
children make more complex guesses about the hidden rule that are ample of a block type (‘‘Here are some blocks called [name]s. We are
only a marginally worse fit to the evidence than adults’ guesses. Chil- going to click test to see if stars will come out of the [name]s’’.). The
dren also create more complex learning data than adults but do so less initial example of each block type (i.e., each rule) was constant across
systematically. We then show that both children’s and adults’ guesses participants. Since every initial example of a block type was a positive
reflect an evidence-inspired process of compositional concept formation example, a star animation played when the ‘‘Test’’ button was clicked.
as modeled by our Instance Driven Generation algorithm over a top- The participant was encouraged to use either the trackpad or the mouse
down–first PCFG norm, capturing that their guesses are inspired by to click the ‘‘Test’’ button, whichever was comfortable for them.
discovery of patterns in their learning data. We show these behavioral After the initial positive example, the participant was shown a blank
patterns are a natural result of children having a less fine-tuned concept scene with blocks available to add to it, and was asked to test the blocks
generation mechanism. Crucially, we also show that both children’s and seven more times (Fig. 1a). The scene creation interface was subject
adults’ symbolic guesses causally drive their generalizations, as opposed to simulated gravity, meaning there were physical constraints on how
to these being driven by surface feature resemblance as emphasized the objects can be arranged. The experimenter told them they could
in statistical views of concepts (cf. Medin & Schaffer, 1978; Posner & now play with the blocks like they saw in the instructional video. The
Keele, 1968). Finally, we show that both children and adults create experimenter also reminded the participant of how to add, remove,
scenes by adapting earlier scenes, which we argue is consistent with move, and rotate blocks on the screen using the mouse and keyboard.
confirmatory or iterative counterfactual testing rather than uncertainty- Participants were encouraged to ask for help with moving the blocks if
or exploration-driven testing. needed. If they seemed to be having trouble, the experimenter would
ask if they needed help with setting up the blocks. The participants
Experiment were told that when they had finished moving the blocks around, they
should press the ‘‘Test’’ button to see if stars came out of them. For
Methods positive tests, the experimenter would neutrally say: ‘‘Stars did come
out of the [name]s that time’’ and for negative tests: ‘‘Stars did not
Participants come out of the [name]s that time’’.
We recruited 54 children in the lab (23 female, aged 8.97 ± 1.11)
and 50 adults online (22 female, aged 38.6 ± 10.2). Forty children Question phase. After testing the blocks a total of eight times (Fig. 1b),
completed all five trials and the remaining 14 completed 2.71 ± 1.07 participants were shown a selection of eight more pre-determined
trials before indicating that they had enough. For these children we scenes containing blocks (Fig. 1c). The experimenter asked them to
simply include the trials that they completed. We collected participants click on which pictures they thought the stars would come out of,
until we reached our intended sample size of 50 per agegroup after reminding them that they could pick as many as they wanted, but they
exclusions. We chose this sample size simply to exceed our 2018 (N had to pick at least one. Unknown to participants, half of these scenes
= 30) pilot with adults.6 Ten additional adult participants completed were always rule following but their positions on screen were inde-
the task but were excluded before analysis for providing nonsensical or pendently counterbalanced. The test scenes and their labels remained
copy-pasted text responses. Adult participants were paid $1.50 and a visible on the screen throughout the Learning and Question phases.
performance related bonus of up to $4 ($1.96 ± 0.75). Children’s ses-
sions lasted between 30 minutes and an hour. For adults, the task took Free responses. Participants were then presented with a blank text
27.49 ± 12.09 minutes of which 9.8 ± 7.9 was spent on instructions. box and asked, ‘‘What do you think the rule is for how the [name]s
The children’s and adults’ versions of the task are available to try here work?’’ The experimenter typed into the text box the participant’s
https://github.com/bramleyccslab/computational_constructivism. verbal answer verbatim, or as close as possible.
The Testing, Question, and Free Response phases were repeated
identically for each of the five block types. After the five trials were
6
While we note that 104 is not a large sample by modern standards, completed, the participant was shown the results including each true
our focus is on modeling inferences at the individual level. Each participant rule and how well they did on each problem and was thanked for
produces an exceptionally rich dataset and our analyses have unusually large playing the game. As compensation, participants were allowed to pick a
storage and compute requirements making a larger sample infeasible to small toy out of a prize box, and parents were given a paper ‘‘diploma’’
analyze. to commemorate their child’s visit.
7
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Adult sample. We recruited our adult sample from Amazon Mechanical Accuracy
Turk and adults completed the task on their own computers. They com- Having observed systematic differences in the content of children’s
pleted the same instructions as the children with an additional section and adults’ hypotheses, we now ask if these manifest in children’s and
about bonuses and had to successfully answer comprehension ques- adults’ inferential success; their ability to identify the ground truth and
tions, including an additional two about the bonuses, before starting make accurate generalizations.
the main task. Specifically, adults were bonused 5 cents for each correct
generalization (up to a possible 40 cents for each of the five trials) and Guesses. Both children and adults were occasionally able to guess
an additional 40 cents for a correct guess as to the hidden rule, again for exactly the correct rules, doing so a respective 11% and 28% of trials.
each of the five trials. Aside from having no experimenter in the room, Adults produced the correct rule more frequently than children 𝑡(102) =
and filling out the text fields themselves, the procedure was identical 4.0, 𝑝 < .001 and were more likely then children to guess correctly (at
to the children’s task. Full materials including experiment demos, data a corrected significance level of 0.01) for the ‘‘All are the same size’’,
and code are available at the Online Repository. ‘‘One is blue’’ and ‘‘There is a small blue’’ rules (see Fig. 5a). The plot
reveals that no child identified rule 4 exactly ‘‘One is blue’’ and only
Results one identified rule 5 ‘‘There is a small blue’’, while a slightly greater
proportion of children than adults identified the positional ‘‘Nothing
We first look at the qualitative characteristics of children’s and
is upright’’ rule. Note that chance level baseline for these free guesses
adults’ explicit rule guesses then assess relative accuracy of partici-
pants’ rules and generalizations about new scenes before comparing is essentially 0%. There are an unlimited number of wrong guesses
the features of the scenes produced by adults and children. We will and a small set of semantically correct guesses. It is also the nature of
then turn to a series of model-based analyses that attempt to reproduce this inductive problem that there are an infinite number of wrong yet
participants distributions of free guesses, generalizations and scenes perfectly evidence-consistent rules for any evidence and often there is a
within the constructivist framework. simpler evidence-consistent rule available than the ground truth.7 Thus,
it is instructive to ask whether participants’ rules, where not exactly
Guess complexity and constituents correct, are nevertheless consistent with the evidence they gathered.
We had human coders translate participants’ free text guesses about While, a completely random rule would only be consistent with
the hidden rule wherever possible into an equivalent logical expres- all 8 scenes around 0.58 × 100 = 0.4% of the time, children’s explicit
sion using the grammatical elements available to our learning models. rule guesses were perfectly consistent with the labels of the 8 training
We were able to do this for 86% (n = 205) of children’s trials and scenes 30% of the time and Adult’s guesses were fully consistent 54%
88% (n = 219) of adults’ trials. For example, if the participant wrote of the time. There was a moderate difference in average proportion
‘‘There must be one big red block’’ this was converted into 𝑁 = (𝜆𝑥1 ∶ of the learning data explained by children’s compared to adults’ rules
∧(= (𝑥1 , large, size), = (𝑥1 , red, color)), 1, ). This logical version can be
71% ± 27% vs 87% ± 17% 𝑡(98) = 5.6, 𝑝 < .001. Similarly there was
automatically evaluated on the scenes and can be read literally as
a difference the proportion of the participants’ generalizations that
asserting ‘‘There exists exactly one 𝑥1 in the set of objects such that 𝑥1
were consistent with their rule guess 72% ± 21% vs 84% ± 16%,
has the size ‘large’ and the color ‘red’’’. We had a primary coder, blind
𝑡(98) = 4.1, 𝑝 < .001 (see Fig. 5c for a by-rule breakdown).
to the experimental hypotheses code all responses, and a second coder
blind spot check 15% of these (64). The two coders agreed in 95% of Generalizations. We now report participants performance in predicting
cases. We provide further details about the coding in Appendix B and which of 8 new scenes will produce stars (i.e. follow each hidden rule).
full coding resources and full coding data in the Online Repository. Across the five tasks, both children and adults guessed more accurately
To explore structural differences in children’s versus adults’ hy-
than chance (50%): children mean±𝑆𝐷 59% ± 11%, 𝑡(53) = 5.9, 𝑝 < .001;
potheses, we first break down these encoded rule guesses into their
adults 70% ± 14%, 𝑡(49) = 10.3, 𝑝 < .001. Adults’ generalizations were
logical parts. This primarily reveals that children’s encoded rules were
significantly more accurate than children’s 𝑡(102) = 4.6, 𝑝 < .001 and
substantially more complex than those generated by adults and that
both were substantially more complex than the ground truth rules. children’s accuracy improved significantly with age 𝐹 (1, 52) = 6.2, 𝜂 2 =
Children’s and adults’ rules also differed in terms of the prevalence .11, 𝑝 = 0.015. Indeed, adults’ generalization accuracy was above a
of particular elements and features (see Fig. 4). As an example, one Bonferroni-corrected chance level of 𝑝 ≤ 0.01 for all five rules and
child’s rule for problem 1 was ‘‘You must have two reds and one blue’’ children were similarly above chance except for rules 1. ‘‘There is a
which was translated to 𝑁 = 𝜆𝑥1 ∶ 𝑁 = (𝜆𝑥2 ∶ (∧(= (𝑥1 , red, color), = red’’ (𝑡(46) = 2.5, 𝑝 = .015) and 4. ‘‘One is blue’’ (𝑡(46) = .1, 𝑝 = .915; see
(𝑥2 , blue, color)), 1, ), 2, ), requiring two quantifiers (𝑁 = ), one boolean Fig. 5b).
(∧), 2 equalities (= ()), and two references to the feature color. The typ-
ical child-generated-rule used 2.25 quantifiers (4c), 2.06 booleans (4d), Scene generation
1.55 equalities and inequalities (4e), referred to 1.39 different primary As well as generating more complex rules, children tended to create
features (color, size, orientation, x- or y-position, groundedness, 4f) and more complex test scenes than adults. The average child-generated
0.37 relational features (contact, stackedness, pointing, or insideness, scene contained 3.7 ± 0.88 objects (close to the average in the example
4g). In contrast, the average adult generated rule required just 1.84 scenes) compared to 2.8 ± 0.57 objects for adults (𝑡(102) = 5.8, 𝑝 <
quantifiers, 1.20 booleans, 1.47 equalities and inequalities, and referred
.001). The complexity of a learner’s test scenes was inversely related
to 1.44 primary features but only 0.16 relational features. Children thus
to their performance overall (𝐹 (1, 102) = 39.0, 𝛽 = −0.08, 𝜂 2 = .28, 𝑝 <
used significantly more quantification (i.e. referred to more separate
.001) and also within both the children (𝐹 (1, 52) =, 𝛽 = −0.056, 𝜂 2 =
entities) 𝑡(102) = 3.98, 𝑝 < .0001, more booleans 𝑡(102) = 3.59, 𝑝 <
.20, 𝑝 < .001) and adults (𝐹 (1, 49) = 9.1, 𝛽 = −0.096, 𝜂 2 = .16, 𝑝 < .001)
.0001 and relational features 𝑡(102) = 3.12, 𝑝 < .002 than adults, but
the agegroups did not differ significantly in mentions of (in)equalities taken individually (see Fig. 6a). Within the children, age was inversely
𝑡(102) = −0.05, 𝑝 = 0.96 and references to the objects’ basic features associated with scene complexity, with an average of 0.35 fewer objects
𝑡(102) = −.91, 𝑝 = .36. When children posited that an ‘‘at least’’, per scene for each additional year 𝐹 (1, 52) = 12.6, 𝜂 2 = .19, 𝑝 <
‘‘at most’’ or ‘‘exactly’’ a certain number of objects must have certain .001. Aside from this difference, we also assess whether children’s or
features, the number they chose was substantially higher than that for adults’ scenes bear the hallmarks of being driven by confirming or
adults (2.36 compared to 1.58, 𝑡(68) = 3.72, 𝑝 = 0.0004). In terms of distinguishing between a small set of possible rules.
features, adults frequently gave rules relating to color (58% compared
to 39% of children’s rules, 𝑡(102) = 2.27, 𝑝 = 0.025), while children were
more likely to refer to positional properties (26% compared to 18% of 7
Although as more evidence arrives the ground truth is increasingly likely
adults’ rules 𝑡(102) = 2.15, 𝑝 = 0.034). to be among ‘‘simplest’’ rules in a posterior sample.
8
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. 4. (a) Length of Children’s and Adults’ rule guesses. (b) Relative frequency of rule elements in logic coded versions of these rules, c–g with respect to quantifiers, booleans,
(in)equalities, basic and relational features respectively. Error bars show normal 95% confidence intervals. Yellow points in a show ground truth frequency. (For interpretation of
the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 5. (a) Percentage children and adults guessing correct rule. (b) Generalization accuracy. Bars show mean ± bootstrapped 95% CIs. In a–b, Black vertical lines denote chance
performance. Blue and red points show performance of simulated PCFG and IDG learners as described in Modeling section. Circles = guessing the MAP rule or MAP generalization
(after marginalizing over posterior). ‘‘ + ’’ shows accuracy of a single posterior sample. Both models here use agegroup-consistent production weights, CIs show bootstrapped 95%
confidence intervals. (c) Consistency between subjects’ rule guess and their (self-generated) learning data, and generalizations. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
If participants do follow a control of variables, confirmatory, or it- how many of the features (positions, colors, shapes) of the objects in
erative counterfactual approach, we would expect the scenes generated one scene would have to be changed to reproduce the other scene.
by participants to be more similar to the initial example or one of their This involved 𝑧-scoring and combining a ‘‘minimal-edit set’’ of feature
own preceding scenes, than to a random scene or a scene drawn from a differences and incorporating a proportional cost for additional or
different learning problem. If they are rather maximizing information omitted objects and scaling by the number of objects in the scenes. We
with respect to a larger set of hypotheses, or exploring the data space provide a detailed procedure and example of how we computed these
efficiently, we would expect the opposite pattern of independence edit distances and break them down into their separate components
or anticorrelation. To explore this, we constructed a distance metric in Appendix C. The mean distance between any randomly selected pair
that we used to measure the feature-dissimilarity between any pair of of participant-generated scenes was M ± SD = 3.67 ± 0.94. Taken as a
scenes. The metric is based on edit distance, encoding how much and whole, the scenes generated by children were more diverse than adults’
9
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. 6. (a) Generalization accuracy by number of objects per test scene. (b) Average dissimilarity between self-generated scenes at different levels of aggregation. Error bars show
standard errors for subject means. (c) Average similarity matrices between initial example and self generated scenes 2 to 8. See Appendix C for detailed procedure and similarity
matrices separated by component.
with average dissimilarity of 3.70 ± 0.14 compared to 3.63 ± 0.08, seen in the larger number of quantifiers and relations mentioned in
𝑡(102) = 2.9, 𝑝 = 0.0048. children’s rules than in adults’, essentially referring to more different
However, this diversity seems to be primarily between rather than objects and more complex properties of the learning scenes that were
within subject for children’s choices. Within subject but across trials, actually irrelevant to their label. As well as generating more complex
the average inter-scene dissimilarity for children was 3.60 ± .33 similar concepts, children created more complex test scenes that appeared
to that for adults’ 3.65±.22, 𝑡(102) = .83, 𝑝 = .4. Focusing more narrowly, to be more repetitive overall, yet also appeared to be varied less
within the scenes produced by an individual subject while learning systematically than adults’.
about a single rule, we see a reversal of the aggregate pattern. That
is, within a learning task, children’s scenes are marginally less diverse Model comparison
on average than adults’ (children: 3.30 ± 0.459, adults: 3.44 ± 0.33,
𝑡(102) = 1.77, 𝑝 = 0.08, Fig. 6b&c).
To explore the basis for the diversity of guesses and generaliza-
Fig. 6c breaks down the within-trial scene dissimilarity by test
tions, and of the differences between children and adults’ learning,
position for the two agegroups. Adults’ scenes are clearly anchored to
we now turn to model-based characterization of the behavioral data.
the initial example (right hand facet) – shown by the dark shading in
We focus first on the guesses, then the generalizations, and finally
the top row indicating high similarity decreasing from left to right for
the scene creation. We will assess whether participants guess and
later tests – Adults’ scenes also look sequentially self-similar—shown by
generalization patterns are better captured by Bayesian inference over
the relatively darker shading along the diagonal compared to the off-
samples from an expressive latent prior – Probabilistic Context Free
diagonal. In contrast, children’s similarity patterns look more uniform.
Generation (PCFG) – or rather by the partially bottom-up generation—
However, for both adults and children, the first self-generated scene is
more similar to the initial example than any other scene. Instance Driven Generation (IDG) limited to hypotheses inspired by
patterns in scenes (Bramley et al., 2018). We then assess whether new
Interim discussion scenes are better captured as independently generated – consistent with
uncertainty-driven or exploration-driven testing – or as adaptations of
In sum, in our experiment we found children were only moderately earlier scenes—consistent with confirmatory or iterative contrastive
less able to come up with rules that fit the evidence than adults and testing.
there were only moderate differences in the compatibility between To foreshadow, we find convergent evidence that both children’s
children’s and adults’ rules and their subsequent generalizations. Most and adults’ guesses are better accounted for by Instance Driven Gener-
striking was the fact that children’s guesses appeared to overfit the ation (IDG) of hypotheses than by an approximately normative Prob-
evidence more, producing more complex, perhaps more naïve, char- abilistic Context Free Grammar (PCFG) norm. We then demonstrate
acterizations of the rule-following scenes than did adults. This can be that neither children’s nor adults’ generalizations can be explained
10
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
by surface similarity between rule-following and generalization probe additionally attempt to capture ‘‘learned’’ inductive biases common to
scenes, but that they are well predicted by the learners’ own symbolic the requisite age-group (but not specific to the participant). The IDG
guess. Finally, we show that almost all children’s and adults’ scenes are samples are additionally idiosyncratically constrained in the sense of
more likely to have been created by making edits to either the previous only reflecting rules referring to features or relations actually present
or the initial scene—in line with hypothesis-driven confirmatory or in at least one of the learning scenes. We split the IDG sample evenly
contrastive testing—rather than being generated independently from across tests such that 1250 were ‘‘inspired’’ by each learning scene,
scratch—consistent with uncertainty-driven or direct exploration of the necessarily repeating this procedure for each trial for each participant
data space. since each generates different evidence. In order to approximate a
posterior over rules given self-generated learning scenes 𝐝, we then
Guesses weighted these samples by their likelihood of producing all eight scene
labels 𝑙 observed during the learning phase
Participants produced a huge variety of guesses but despite this,
these guesses were consistent with the majority of their evidence. 𝑃 (ℎ|𝐥; 𝐝) ∝ 𝑃 (𝐥|ℎ; 𝐝)𝑃 (ℎ) (1)
∑
Children’s guesses were more complex and a little less data-consistent ≈ 𝑃 (𝐥|ℎ; 𝐝) I(ℎ = h) (2)
on average than adults’. We now explore using PCFG and IDG sampling ̂ 𝐻̂
ℎ∈
to produce similar guesses. and combined this with their prior weight—given by counting how
We first assume a PCFG as a computational level framework and often they appear in the prior sample, with indicator function I(.) denot-
reverse engineer what production weights it requires to generate the ing exact or semantic equivalence. To test for semantic equivalence, we
kinds of guesses we see adults and children make. Next, we contrast computed predictions for the first 1000 participant-generated scenes for
the prior sample-based PCFG approach to rule generation with our each rule and clustered together those that made identical predictions.
proposed data-inspired IDG, showing that the IDG does a better job We rounded positional features to one decimal place in evaluating rules
of capturing participants’ accuracy by problem type and agegroup to accommodate perceptual uncertainty. Concretely, we assumed the
and is also better able to produce the specific guesses made by the following likelihood function
participants.
𝑃 (𝑙 = 1|ℎ; 𝐝) ∝ exp(−𝑏 × 𝑁mispredictions ) (3)
Reverse engineering childlike and adultlike production weights
embodying the idea that: the more learning scene labels a rule cannot
Having encoded all the rule guesses from adults and children (in
explain, the less likely it is to have produced them. For a large 𝑏,
the section on Rule complexity and constituents), we created PCFG pro-
the likelihood function approaches the true deterministic behavior of
duction weights that produce similar guesses as adults and children.
the rules. However, in our analyses we simply assume a 𝑏 = 2 to
To do this, we worked back from the observed counts for each rule
allow for some noise while maintaining computational tractability. This
element doing this separately for children’s and for adults’ guesses
corresponds to a likelihood function that decays rapidly from ∝ 1 for
(see Appendix A). Of course, the guesses are samples from a range of
rules that predict all 8 scenes’ labels, to ∝ .13 for a single misprediction,
different participants’ posteriors, since guesses were always based on
and ∝ .02 for 2 mispredictions, and so on.
some evidence. However, since this evidence differs dramatically be-
To generate IDG predictions, we merged the production probabil-
tween trials and across the rules we considered and scenes participants
ities from the PCFG into the Instance Driven Generation procedure
created, and since the structural elements of the grammar (booleans,
detailed in Appendix A. For scenes that did not follow the rule we
quantifiers etc) are not tightly tied to scene-specifics, this still provides
followed the same procedure as for scenes that did, but wrapped the
a helpful elucidation of generation differences behind child-like and
adult-like guesses. A full set of fitted prior weights for both adults and rule in a negation. For example, observing a non-rule-following scene
children are visualized in Fig. 7. This analysis simply demonstrates in which there are objects in contact might inspire the rule that ‘‘no
that a natural way to understand children’s guesses are as emanating cones are touching’’.
from a less fine-tuned generation mechanism adults’, with flatter, more The resulting model guess accuracy is shown visualized in Fig. 5a.
entropic branching at 12 of the 14 forking production steps we assumed We distinguish between two possible decision mechanisms: (1) Taking
in our PCFG model. Indeed probability distribution over productions at the maximum a posteriori (MAP) estimate from a large posterior sample
each stage averaged 1.28 ± 0.50 bits for children compared to 1.03 ± 0.59 (guessing in the event of ties), which we take as closer to a normative
bits for adults, 𝑡(13) = 3.2, 𝑝 = 0.007. ideal and (2) taking the accuracy of a single posterior sample, which
we take to be more consistent with the best-case-scenario output of
Modeling accuracy by participant and rule a process in which a given learner searches over hypotheses driven
We now compare participants patterns of accuracy to simulated by a combination of prior complexity and fit. Under all models, the
approximately normative inference over a PCFG-generated sample and MAP lines up with the correct hypothesis more often than participants
IDG hypothesis generation algorithms provided with the active learning do (15–37% based on children’s active learning and 20–51% based on
data generated by the human participants. We generated a sample of adults’, recalling that children guessed correctly of 11% of trials and
10,000 hypotheses based on uniform production weights 𝐻̂ PCFGu , and adults on 28% of trials). For instance, under a uniform-weighted prior
similarly for the IDG generated a sample based on uniform productions sample, the PCFG MAP is correct on 15% of all children’s trials and 20%
𝑝,𝑡
for each task 𝐻̂ IDGu . Additionally, for each participant 𝑝—and sepa- of all adults’ trials. Note that since these simulations use the same prior
rately for each learning task 𝑡 in the case of the IDG—we generated sample, the small differences we see are due to the different learning
another 10,000 possible rules using age-consistent prior production data generated by children and adults. However, accuracy improves
𝑝
weights derived above 𝐻̂ PCFGh 𝑝,𝑡
and 𝐻̂ IDGh that have statistics matched substantially and better reproduces the empirical child–adult accuracy
8
to those in Fig. 4a–f. The PCFG samples act as an approximation to difference when we use samples based on reverse-engineered weights
an infinite latent prior over rules 𝑃 (ℎ) before seeing any data. The that reproduce the qualitative properties of other participants in the
uniform-weight PCFG samples capture a generic inductive bias for same agegroup (see Appendix A and Fig. 7). For age-appropriate prior
simpler hypotheses while fitted held-out child- and adult-like weights samples, the PCFG guesses correctly on 18% of children’s trials and
32% of adults’ trials. Using an age-inappropriate ‘‘flipped’’ prior sample
(i.e. child-like weights for adults and adult-like weights for children)
8
For these, we held out the subjects own guesses when setting the weights obliterates this difference, resulting in 23% for children and 22% for
to avoid double dipping the data. adults. We see a similar pattern for the IDG algorithm, but higher
11
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. 7. Visualization of (a) child-like and (b) adult-like PCFGs, reverse engineered to produce rules with empirical frequencies matched to children’s and adults’ guesses. A rule
is produced by following arrows from ‘‘Start’’ according to their probabilities (line weights and annotation), replacing the capital letters with the syntax fragment at the arrow’s
target and repeating until termination.
Table 2
Accuracy of rule guesses by simulation models.
Algorithm Prior Accuracy MAP (%) Accuracy Posterior Sample (%)
Children’s data Adults’ data Fit Children’s data Adults’ data Fit
PCFG Uniform 14 ± 16 20 ± 14 −229 9 ± 5 12 ± 5 −226
PCFG Agegroup 17 ± 17 32 ± 15 −230 11 ± 7 20 ± 7 −225
PCFG Flipped 22 ± 20 22 ± 15 −231 15 ± 9 15 ± 6 −229
IDG Uniform 26 ± 22 39 ± 21 −226 9 ± 5 14 ± 6 −217
IDG Agegroup 36 ± 25 51 ± 18 −226 14 ± 8 24 ± 8 −212
IDG Flipped 26 ± 20 52 ± 18 −230 13 ± 8 23 ± 8 −223
‘‘Children’’ and ‘‘Adults’’ columns show the M ± SD% by-subject accuracy of the requisite algorithm.
‘‘Fit’’ shows the log likelihood for a logistic mixed-effects regression using model accuracy to predict if the
participant guesses correctly on each trial.
accuracy across the board. The IDG achieves the best accuracy on both to estimate the probability of each approach generating exactly the
children’s and adults’ trials, guessing over half of the hidden rules participant’s encoded guess based on their active learning data.
correctly (51%) in the case of adults’ trials. However, achieving this By definition, all 87% of trials in which participant gave an un-
level requires maximizing over the full sample, while we have argued ambiguous rule, we were able to encode in our concept grammar, so
that process level accounts are more likely to yield behavior closer all have nonzero support under a PCFG prior. Due to the stochasticity
to posterior sampling (Table 2, right hand columns). Indeed posterior we assumed in our likelihood function, all possibilities also nonzero
samples provide a visually closer fit to the by-rule guess rates (Fig. 5a). have posterior probability, meaning they are guaranteed to appear in
To check what provides the better account of participants trial-by- a sufficiently large PCFG sample.9 However, in practice it is impossible
trial accuracy patterns, we fit logistic mixed-effect regression models to cover an infinite space of discrete possibilities with a finite set of
using the response under each algorithm and prior combination to samples, meaning there are a substantial number of cases in which
predict each participant’s by-task probability of guessing correctly, we did not generate the participants’ guess. The proportion of rules
including random effects for both rule type and participant. For the that were generated at least once in 10,000 samples with agegroup
maximization models, we softmaxed the posterior with a low ‘‘temper- fitted weights was highest for the IDG with fitted weights (69% for
ature’’ parameter (𝜏 = 1∕500, Luce, 1959), meaning predictions were children 76% for adults), decreasing to 49% and 62% using uniform
close to 1 or 0 excepting where multiple hypotheses were tied, where weights. This was still higher than for the PCFG which generated 42%
they were close to 1∕𝑁 for the 𝑁 tied hypotheses. The ‘‘Fit’’ columns for children’s and 53% for adults’ guesses with the fitted prior weights
of Table 2 shows the log likelihood for each of these models, revealing and 45% for children’s and 50% for adults’ rules from a uniform prior.
that participants’ correct judgments were most in line with posterior Table 3 details model fits to participants’ guesses. The IDG is again
sampling under an IDG prior, with age-appropriate production weights the stronger hypothesis generation candidate, assigning higher proba-
(log likelihood = 211.5, 𝛽 = 5.44 ± 1.74, 𝑍 = 5.99, 𝑝 < .001) improving bilities on average to the rules that participants provided. As expected,
over a baseline fit of −234.3 for a model with only intercept and
random effects.
9
They would not necessarily appear in an infinitely large IDG sample
because many of the more complex concepts are merely possible without being
Modeling rule guess positively present. For example ‘‘there is a red and fewer than five small blues’’
As a more direct test of the constructivist PCFG and IDG models’ is consistent with Fig. 1b but would never be generated by the IDG procedure
ability to explain participants’ free response guesses, we also attempted inspired by these scenes.
12
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
10 11
Note that these prior generation probabilities are a lower bound on the It is likely that other approximate inference methods, such as an MCMC
chance of generating a particular semantic rule since many syntactic forms can or greedy posterior search approach, could improve on this sample efficiency.
express the same semantic content (Fränken et al., 2022). This captures why However they also introduce other challenges for the learner (i.e. escaping
some relatively frequently generated semantic classes of guess nevertheless had local minima) and the modeler (getting good coverage of the response space
a low probability for each specific syntactic expression. and aggregating auto-correlated samples).
13
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Table 4
Example guesses.
Agegroup Rule Example syntax log Prior log Prior log(Likelihood) N/10k
Uniform Agegroup
Children ‘‘One is on top of the ∃(𝜆𝑥1 ∶ ∃(𝜆𝑥2 ∶ −9.5 −8.4 0 117
other’’ 𝛤 (𝑥1 , 𝑥2 , stacked), ), )
Children ‘‘Only different ∀(𝜆𝑥1 ∶ ∀(𝜆𝑥2 ∶ ∨(= (𝑥1 , 𝑥2 , ID), ¬(= −9.8 −8.0 0 260
colors’’ (𝑥1 , 𝑥2 , color))), ), )
Adults ‘‘If there are multiple 𝑁≥ (𝜆𝑥1 ∶= (𝑥1 , 1, size), 2, ) −9.9 −19.6 0 609
small blocks.’’
Adults ‘‘There is at least one ∃(𝜆𝑥1 ∶ ∧(= (𝑥1 , green, color), = −13.8 −21.3 0 532
small green triangle.’’ (𝑥1 , 1, size)), )
Children ‘‘They have to be ∃(𝜆𝑥1 ∶ ∃(𝜆𝑥2 ∶ ∃(𝜆𝑥3 ∶ ∧(∧(= −22.3 −16.6 −2.0 0
with all three (𝑥1 , red, color), = (𝑥2 , green, color)), =
different colors’’ (𝑥3 , blue, color)), ), ), )
Children ‘‘There has to be one ∃(𝜆𝑥1 ∶ 𝑁≥ (𝜆𝑥2 ∶ ∧(= (𝑥1 , 1, size), = −12.5 −11.3 0 0
small blue piece and (𝑥1 , blue, color)), 2, ), )
there has to be more
than one piece’’
Adults ‘‘When there is a ∃(𝜆𝑥1 ∶ ∃(𝜆𝑥2 ∶ ∃(𝜆𝑥3 ∶ ∧(∧(∧(∧(= −20.5 −11.11 −2.0 0
cone from each color (𝑥1 , red, color), = (𝑥2 , green, color)), =
of the same size’’ (𝑥3 , blue, color)), = (𝑥1 , 𝑥2 , size)), =
(𝑥1 , 𝑥3 , size)), ), ), )
Adults ‘‘one piece has to be ∃(𝜆𝑥1 ∶ ∃(𝜆𝑥2 ∶ −18.5 −21.3 −3.9 0
leaning on another’’ ∧(𝛤 (𝑥1 , 𝑥2 , contact), ¬(=
(𝑥2 , upright, orientation))), ), )
Note N/10k shows how many times we generated this rule in 10,000 samples assuming agegroup-specific weights and counting any semantically
equivalent expressions.
Fig. 8. (a) Posterior probability of participants’ guesses under PCFG and IDG samples with agegroup weights. Full black line compares with posterior samples, dashed line with
selection of the posterior maximum a posteriori hypothesis (or sampling from them if there are more than one), dotted line compares with samples from the prior. (b) Individual
generalization model fits showing BIC improvement over baseline per trial (higher is better). Opaque points show mean±SE, faint points show individual fits, with triangles used
to mark where the model (of the 17 blind to the symbolic guess) is the best fit for that participant.
We fit a total of 18 models to the generalization data. All models had • 3–8. PCFG {Uniform, Flipped, Agegroup} {No Bias, Bias}.
between 0 and 2 parameters. For each model, we fit the parameter(s) These models base their generalizations on the marginal like-
by maximizing the model’s likelihood of producing the participant data, lihood that each generalization scene is rule following under
using R’s optim function. We compared models using the Bayesian the Probabilistic Context Free Generation (PCFG) posterior 𝑟 =
Information Criterion (Schwarz, 1978) to accommodate their different
𝑃PCFG (𝐥 ∗ |𝐥; 𝐝, 𝐝 ∗). ‘‘Uniform’’ uses a prior with uniform produc-
numbers of fitted parameters.
tion weights. ‘‘Flipped’’ uses a prior generated with mismatched
The models we fit were:
weights—that is, adultlike weights for children’s generalizations
• 1. Baseline. Simply assigns a likelihood of .5 to each general- and childlike weights for adults’ generalizations. ‘‘Agegroup’’ uses
ization ∈ {rule following, not rule following} for each of the 8 a sample based on weights derived from other participants in
generalization probes for each of the 5 learning trials.
the same agegroup holding out the participants’ own guesses. In
• 2. Bias. Acts a stronger baseline by allowing participants to have
each case, these predictions are then softmaxed using 𝑃 (choice) =
an overall bias toward or against selecting generalization scenes 𝑒𝑟∕𝜏
as rule following. For this model, 𝑏 = 1 if > 50% of generalizations ∑ 𝑟∕𝜏 , with temperature parameter 𝜏 ∈ (0, ∞) (Luce, 1959) op-
𝑟∈𝑅 𝑒
predict the scene is rule following and 0 otherwise. The model is timized to maximize model likelihood. Large positive 𝜏 indicates
fit using a mixture parameter 𝜆 to mix this modal prediction with random selection. 𝜏 → 0 indicates hard maximization. Variants
the baseline prediction of .5 𝑃 (choice) = 𝜆𝑏 + (1 − 𝜆).5. with a bias term also mix this prediction with the subject’s modal
14
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
15
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
12 13
We selected these by generating 10,000 sets of seven scenes for each rule, We do not attempt to predict the relational features or absolute positions
and selecting the set that best reduced entropy. in this analysis.
16
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. 10. Example sequences for the ‘‘There is a red’’ problem. (a) A child’s scenes (b) An adult’s scenes (c) Random selection from all participant generated scenes (d) Uncertainty
driven selection from all participant scenes (e) Optimal scene selection for communicating the concept. (f) Expected Information Gain and (g) achieved uncertainty reduction for
sequences in a–e.
5. Adapt Mixed {Simple}: This model simply mixes the predic- than children to reduce the number of objects and had more tendency
tions of Adapt Initial and Adapt Previous to capture the behavior to adapt sequentially, gradually traveling further away from the initial
of a learner who sometimes adapts the initial scene (with prob- example.
ability 𝜃) or by their own preceding scene with probability (1 −
𝜃). General discussion
17
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Table 5
Models of scene generation.
Children
Model BIC/scene N Best 𝜆 𝜂 𝜃
Generate Uniform 40.2 0
Generate 34.9 0
Generate Simple 30.7 0 0.34 ± 0.1
Adapt Initial 30.4 2 .29 ± .19
Adapt Previous 30.1 8 .25 ± .18
Adapt Mixed 30.0 1 .27 ± .19 .40 ± .29
Adapt Initial Simple 29.3 7 0.33 ± 0.11 .34 ± .16
Adapt Previous Simple 29.0 10 0.34 ± 0.13 .31 ± .17
Adapt Mixed Simple 28.7 26 0.34 ± 0.12 .33 ± .17 .40 ± .24
Adults
Model BIC/scene N Best 𝜆 𝜂 𝜃
Generate Uniform 32.8 0
Generate 27.8 0
Generate Simple 23.1 0 0.50 ± 0.18
Adapt Initial 23.6 0 .23 ± .14
Adapt Previous 23.4 1 .21 ± .13
Adapt Mixed 23.3 1 .21 ± .13 .35 ± .26
Adapt Initial Simple 22.4 5 0.50 ± 0.20 .29 ± .12
Adapt Previous Simple 21.9 24 0.54 ± 0.30 .23 ± .13
Adapt Mixed Simple 21.8 19 0.54 ± 0.27 .24 ± .13 .32 ± .25
Note: BIC/scene shows the fit of the model at the agegroup level divided by the number of scenes for easier comparison. 𝜆 (simplicity), 𝜂
(fidelity) and 𝜃 (mixture) show 𝑀 ± 𝑆𝐷 of best fitting model parameters variant across subjects. Boldface indicates the best fitting model.
5. The logical form of both children and adults’ symbolic guesses lends support to the idea that some symbolic and compositional process
predicted their generalizations to new scenes far better than drives children and adults’ active inductive inferences about the world.
feature similarity. That is, we can explain the variability and productivity of human
6. Both children and adults scenes generation seemed to involve hypothesis formation in symbolic terms. Identifying the computational
modifying previous scenes, with adults doing so more systemat- primitives of thought may not be a realistic empirical goal since a
ically and with more tendency to simplify them. feature of constructivist accounts is their flexibility. Learners can grow
their concept grammar over time, caching new primitives that prove
We now discuss these results more broadly, first highlighting some
useful (Piantadosi, 2021). Moreover, it is well known many different
limitations, then expanding on what we see as the implications of this
symbol systems can mimic one another (Turing, 1937), meaning that
work for theories of concepts and of development and finally pointing
expressivity alone cannot distinguish between them. Since, we expect
to some future directions.
different learners to take different paths in an inherently stochastic
learning process, this limits universal claims about representational
Limitations
content.
Experimental control
While this task and new dataset provide an exceptionally rich win- Feature selection
dow on inductive inference, some of what is gained in open-endedness We assumed our scenes had directly observable features and cued
is lost in experimental control. There is considerable residual ambiguity these to participants in our instructions. However, a number of re-
about the extent that differences in active learning shaped differences cent models in machine learning combine neural network methods for
in hypothesis generation and visa versa. One way to try and partial feature extraction with compositional engines for symbolic inference,
this out could be to run more experiments that fix the evidence and creating hybrid systems that can learn rules and solve problems from
probe the hypotheses generated, or that fix the hypotheses in play and raw inputs like natural images (cf. Nye, Solar-Lezama, Tenenbaum,
probe what evidence is sought. However, we have argued that such con- & Lake, 2020; Valkov, Chaudhari, Srivastava, Sutton, & Chaudhuri,
strained tasks run the risk of short-circuiting natural cognition: Learners 2018). We see these approaches as having promise to bridge the gap
may struggle to test hypotheses they did not conceive themselves, and between subsymbolic and symbolic cognitive processing.
are known to struggle to use data they have not generated to evaluate
their hypotheses (Markant & Gureckis, 2014; Sobel & Kushnir, 2006). Elicitation differences between children and adults
Sole focus on scenarios that fix one or other aspect of the inductive One potential concern is that the complexity of children’s guesses
inference loop may provide a misleading perspective on end-to-end relative to adults stems partly from their being collected verbally and
active inference in the wild. We feel our open ended task provides a in the presence of an experimenter rather than typed during an online
valuable complementary perspective. In future work hope, we plan to experiment. Speaking carries different cognitive demands than typing
elicit more fine-grained online measures of learners’ thought process— and may lead to children simply responding in a more verbose way than
e.g. asking them to list their hypotheses after each guess or describe adults. While we cannot rule this out, we do not think this is a major
how they construct test scenes. This would support comparison of concern. Adults were well compensated for accuracy, meaning their
process-level accounts of both hypothesis adaptation and active search motivation was primarily to be correct rather than brief. The semantic
and allow identification of individual differences. content of both children’s and adults’ rules were extracted through our
coding of them into lambda calculus meaning that surface differences
Theoretical expressivity in concise expression can be separated from logical complexity. Fur-
There are many ways we could have set up the primitives, param- thermore, children’s guesses were not the only thing that was more
eters and productions of our PCFG and IDG models. This makes for a elaborate about their behavior. They were also more elaborate in their
dangerously expressive set of theories of cognition. We do not claim to active testing choices, producing more complex scenes despite having
have explored this space exhaustively here but rather that our modeling to create these in the same manner as adults. Since the testing interface
18
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
was reset on each trial, this complexity took more effort, with children’s While the IDG captured the data better here, it is not a complete ac-
scenes requiring substantially more clicks and more time to produce count because, even with instance-inspired stating point, we still need
than adults’. to explain how a learner adapts in light of new evidence. Following a
number of recent research lines (Bramley, Mayrhofer, Gerstenberg &
Use of verbal protocols Lagnado, 2017; Dasgupta, Schulz, & Gershman, 2017; Ullman, Good-
Another worry about our use of free responses is that they rely on a man, & Tenenbaum, 2012), we see incremental mutation of one or a
capacity for precise linguistic expression not to mention the assumption few focal hypotheses in the light of evidence as a promising approach.
that learners have insight into the structure of their own concepts. It For instance, a learner might use an observation to generate an initial
is known that children’s vocabularies differ from adults’, raising the idea akin to our IDG, but then explore permutations to this to generate
concern that some of our results reflect language use rather than the new scenes to test (Oaksford & Chater, 1994), and to account for these
concepts being articulated. While our artificial environment contains tests (Fränken et al., 2022). While older models like RULEX (Nosofsky
only simple objects and basic features that are familiar to even young & Palmeri, 1998; Nosofsky et al., 1994) provide candidate heuristics for
children, there is evidence that children’s speech does not distinguish achieving such a search over theories, their long run behavior lacks a
as well among quantifier usage (e.g., all, each, every) until late in child- clear relationship with computational-level rationality (Navarro, 2005).
hood (Brooks & Braine, 1996; Inhelder & Piaget, 1958). Thus, it could However, if a learners’ adaptations approximate a valid approxima-
be that linguistic imprecision is behind some of the differences between tion scheme, for instance accepting proposed permutations with the
′)
children’s and adults’ guesses. For instance, this seems like a potential Metropolis–Hastings probability max(1, 𝑃𝑃 (ℎ(ℎ𝑡 )
) (Bramley, Dayan et al.,
explanation for the lack of any exactly correct guesses from children 2017; Dasgupta et al., 2016; Hastings, 1970; Thaker et al., 2017), they
about the quantifier-dependent rule 4 ‘‘exactly one is blue’’. However, a can start to explain why more probable hypotheses are discovered more
closer look at responses reveals that only 11/47 children guessed a rule often as well as explaining probability matching and order effects are
that mentioned blue at all. Meanwhile 37/50 of adults’ rules mentioned inevitable consequences of approximation (see Fränken et al., 2022).
blue, but all but seven of these were wrong about the particulars of the Since the endpoint of an MCMC search approaches an independent
quantification. In many cases other potential quantifications were not posterior sample, we would expect a population of such searchers
ruled out by adults’ testing. For instance, several subjects never tried to end up with a set of hypotheses that look like posterior samples.
adding more than one blue object to a scene and later responded that at Moreover, since individual searchers have finite time to search, we
least one object must be blue. Thus, it seems that children’s rules simply would expect order effects and dependence in their ideas over time.
picked out different features of the scenes than adults. An interesting To the extent that participants deviate from a probabilistically valid
question is whether, in the cases where a child’s guess is logically approximation scheme, for instance ‘‘hill climbing’’ or accepting only
inconsistent with some of their learning data, this is because their strictly better fitting ideas, we might also explain how they can get
representation itself is imprecise, or because their verbal description stuck in local optima and exhibit mal-adaptive order effects like garden
imprecisely describes their representation. Another possibility could be paths (Gelpi, Prystawski, Lucas, & Buchsbaum, 2020). Taking the idea
that adults are better introspectors than children, better able to ‘‘read that earlier hypotheses carry information about older evidence and
out’’ the structure of their own representations (Morris, 2021). While inference, we might also think of a population of such hypotheses as
these are intriguing possibilities our current experiment cannot fully a kind of particle filter (Bramley, Dayan et al., 2017; Daw & Courville,
resolve these explanations. 2008). While acting primarily as a computational level norm, the PCFG
prior provides useful infrastructure for hypothesis search. For example,
Implications for theories of concepts prior production weights can be used to adapt an existing hypothesis
by partially ‘‘regrowing’’ it (Goodman et al., 2008). Furthermore, prior
Psychological theories of concepts have oscillated between symbolic production weights implied by a generative prior mechanism combined
accounts – that seek to explain conceptual productivity and creativity data likelihoods allows for the principled acceptance or rejection of
– and similarity accounts—that seek to explain how concepts drive new proposals in an MCMC-like search scheme. This could result in
probabilistic generalization. The constructivist framework is based in much greater sample efficiency than either the PCFG or IDG presented
the symbolic camp, however it inherits many of the advantages of here, and it would be interesting to consider combinations of prior-
similarity accounts by maintaining a relationship with probabilistic or instance-driven initializations with permutation-based search. For
inference embodied by the stochastic mechanisms of generation and this to become a fully satisfying account of constructivist inference
search. Thus, we see our findings as support for recent claims that this would need to be paired with a mechanism for scene generation
higher level cognition utilizes some form of stochastic generative sam- in line with those we sketch in Fig. 3c&d, so explaining anchoring,
pling to approximate rational inference (Bramley, Dayan et al., 2017; order effects, probability matching and confirmation bias in a unified
Sanborn et al., 2021; Zhu, Sanborn, & Chater, 2020) and that this might account (Klahr & Dunbar, 1988).
also explain aspects of human cultural and technological development Our modeling of generalizations revealed that there is no straight-
that take place over populations and multiple generations (Krafft, forward family resemblance between the features of rule-following
Shmueli, Griffiths, Tenenbaum, et al., 2021). training scenes (generated by the participant) and rule-following gen-
While neither the PCFG or IDG are oven-ready process models of eralization scenes (as pre-selected for the experiment). This resulted in
human concept formation, they provide a useful starting point for the Similarity model performing at chance and also being completely
thinking about process accounts. The PCFG framework describes nor- uncorrelated with participants while all our symbolic model variants
mative inference in the limit of infinite sampling, but also provides a received support. While this is far from an exhaustive comparison
mechanism for both generating and adapting samples. The IDG is a with sub-symbolic concept models, even a successful similarity-driven
hybrid that seeds hypotheses by trying to describe patterns that are account of generalizations would only account for half of the behavior
present in observations rather than merely possible, making it more in this task. As well as generalizing systematically, participants gave
sample-efficient as a brute force approach to inference in situations detailed natural language descriptions of their ideas. The majority of
where a learner already has some positive or demonstrative evidence of these we could convert into logical statements (86%) that predicted
a concept. However its success is dependent on the learner generating most generalizations (72%: children, 84%: adults) and were consistent
or encountering scenes that exemplify and isolate causally relevant with the majority of their learning data (71%: children, 87%: adults).
features. With enough evidence both approaches should favor the Any subsymbolic account of concepts would essentially need to be
ground truth but with little evidence the PCFG will tend to entertain paired with an explanation for how people generate these verbal de-
many concepts that the IDG does not. scriptions of their non-symbolic concepts that nonetheless reflect their
19
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
use (cf. Dennett, 1988). Arguably, this task is no easier than the one more ‘‘high temperature’’ exploratory than adults’ (Gopnik, 2020), over
of generating a symbolic hypothesis about the nature of the world in and above differences in the flatness of their latent prior. Importantly,
the first place. Thus we feel that our results are more straightforwardly while the endpoints of children’s theorizing were more diverse than
explained by our symbolic account whereby the logical structure of adults’, the cognition required to produce their hypotheses is still highly
the hypotheses participants describe is actually the causal mechanism systematic. Children were able to implement a stable-enough symbolic
driving their generalizations rather than some form of computationally generation or adaptation mechanism to produce meaningful symbolic
expensive but behaviorally impotent retrospective confabulation (cf. hypotheses on the large majority of trials, referring to the features and
Johansson et al., 2008). Our generalization analysis also showcases relations they encountered. Even when their hypotheses did a poor job
the difficulty of predicting human behavior in a setting where there of explaining all the learning data, the hypothesis construction process
is such a large and long-tailed space of similarly plausible rules an did not break down entirely as it would if childlike brain activity were
individual might be using to drive their generalizations. Modeling simply random and disorganized. However, the issue remains whether
symbolic inference directly from the learning input had some predictive there is just more noise in children’s behavior – e.g., they are just a bit
power for adults’ generalizations, but simply by asking participants for more easily distracted compared to adults – as opposed something like
their best guess, we could immediately get a far better handle on how a greater inclination to explore.
they would generalize. Another aspect of constructivism that we did not focus on here
While we did not provide a fully satisfying model of scene gen- but that is critical to understanding development, is the idea that
eration, we did show that participant-generated scenes were better over time, learners can chunk, cache and recursively reuse concepts
understood as adapting earlier scenes than as being created from to build ever richer ones (cf. Zhao, Bramley et al., 2022). As such the
scratch. We argued that this is consistent with testing driven by one or conceptual library of an adult ought to be more advanced, containing
a couple of conceptually neighboring hypotheses, either generalizing more powerful and complex concepts that can be readily reused to build
their predictions or contrasting them. This is in some ways a return new concepts. This might lead to a prediction of a different pattern of
to pre-Bayesian ideas in philosophy of science in that testing permits guesses than we found here. That is, we might have expected adults’
falsification but not confirmation. Even when a hypothesis ℎ survives concepts to look more complex than children’s, not because they are
repeated confirmatory tests, or repeated head-to-head challenges from built from more parts, but because the parts they are built from are,
local alternatives, we might think of it as gaining a degree of con- themselves, more complex. We suspect that the reason we did not
firmation, but there always remains the spectre of potential future find this sort of pattern here is that our task used very basic abstract
falsification (cf. Popper, 1959). We think this better reflects the state of features. Presumably our shape and geometric relation concepts are
a constructivist learner who cannot know, until discovering it, whether fairly established by around the age of 10. We predict that this would
some better hypothesis is waiting in the wings. not hold in more applied domains where adults are able to draw
For a learner limited to a few hypotheses at a time, the approach on advanced concepts. For instance, when theorizing about economic
has clear virtues: It links the process of adapting a hypotheses with that conditions an adult might refer advanced primitives like ‘‘power laws’’,
of coming up with new scenes to test and links the outcome of tests to ‘‘compound growth’’ or ‘‘arbitrage’’ that we would not expect to exist
the subsequent inferential step of supplanting or reinforcing the current yet in the conceptual repertoire of many 9–11 year olds.
favored hypothesis. Since learners are always reusing at least some As well as producing more complex guesses, children also pro-
feature or other, it allows the learner’s two tasks to support the other, duced more elaborate scenes during learning. One possible characteri-
with reuse of modified previous tests and minimal positive examples zation is that children’s active scene construction was more exploration-
minimizing the cognitive and physical costs of generating both new driven and less hypothesis-driven than adults’ (Wu et al., 2018), per-
tests and new hypotheses (Gershman & Niv, 2010). haps mixing more exploration-driven actions in with hypothesis-driven
ones (Meder, Wu, Schulz, & Ruggeri, 2021). Indeed, differences in
Implications for theories of development active exploration are the other side of the coin of the high temperature
search idea (Friston, FitzGerald, Rigoli, Schwartenbeck, Pezzulo, et al.,
Our analyses revealed a variety of developmental differences. Chil- 2016; Gopnik, 2020; Klahr & Dunbar, 1988; Schulz, Klenske, Bramley,
dren’s guesses were more complex than adults’, and consequently we & Speekenbrink, 2017). However within each trial, children’s testing
could capture them with a significantly ‘‘flatter’’ generation process that was more repetitive than adults’, suggesting that they made slower
inherently produced a wider diversity of hypotheses. This is potentially progress in exploring the problem space, or were generally less able
normative: Having been exposed to less evidence, with less idea what to keep track of what they had done. The problem of generating
conceptual compositions and fragments will be useful in understanding informative tests is not quite the same as that of finding the right
their environment, we should expect children’s construction process to hypothesis. It is important to avoid redundancy and, in combination,
be less fine-tuned. In other words, children are justified in entertaining serve to test a wide variety of salient hypotheses. In this sense, adults’
a wider set of ideas than adults. However, we noted there are several testing behavior was more systematic, better reducing global measures
algorithmic stories that could underpin this diversity: (1) children of uncertainty and potentially reflecting a more metacognitive control
might simply have hypothesis generation mechanism that embodies over learning (Kuhn & Brannock, 1977; Oaksford & Chater, 1994).
a rationally flatter latent prior, (2) they might additionally explore Curiously, children were more likely to refer to relational and posi-
theory space more radically, over and above differences in the relative tional properties in their guesses, while adults were most likely to make
credibility their latent prior actually attaches to different possibili- guesses that pertained to the primary object features (color and size).
ties (Gopnik, 2020; Lucas et al., 2014; Wu, Schulz, Speekenbrink, This is an independently interesting finding. Since relational features
Nelson, & Meder, 2018) or (3) we also considered that children’s gener- are structurally more complex than primitive features, we might have
ation mechanisms might be more dominated by ‘‘bottom-up’’ processes. predicted they would be more readily evoked by adults. It could be
We take our comparison of PCFG and IDG to speak against option 3. that children bought in more to the scientific reasoning cover story,
Adults’ hypotheses were, as far as we could tell, at least as anchored treating mechanistic explanations, such as that objects must touch or
to idiosyncratic patterns of their learning data as children’s. However, be positioned in particular ways to produce stars, as credible (Gelman,
these data do not distinguish clearly between options (1) and (2). To do 2004). Conversely, adults may have been more likely to expect Gricean
this, one would need to measure children and adults’ prior distributions considerations to apply, e.g. that experimenters would likely set simple
directly. If children’s guesses shift within a problem in a way that is rules using salient but abstract features like color over perceptually
less sensitive to their own relative subjective probabilities than adults, ambiguous properties like position (Szollosi & Newell, 2020). However,
this would support the idea that children’s hypothesis generation is it could also be the case that there are deeper differences between the
20
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
experiences of children and adults that render structural features more We use capital letters as non-terminal symbols and each rewrite is
relevant to children and surface features more relevant to adults. sampled from the available productions for a given symbol.14 Because
Children’s guesses were also less consistent with their evidence some of the productions involve branching (e.g., 𝐵 → 𝐻(𝐵, 𝐵)), the
than adults’. This might be because they were less able to extract resultant string can become arbitrarily long and complex, involving
common features across all eight learning scenes (Ruggeri & Feufel, multiple boolean functions and complex relationships between bound
2015; Ruggeri & Lombrozo, 2015). However, it could also be a con- variables.
sequence of a more generalized limitation in ability to generate, store We include a variant that samples uniformly from the set of possible
and compare hypotheses. With a flatter prior and limited sampling, one replacements in each case, but we also reverse engineer a set of
has a lower chance of ever generating a hypothesis that can explain all productions that produce exactly the statistics of the empirical samples,
the evidence. Children also under-generalized, often selecting only 1 or as described in the main text.
2 of the 8 test scenes (there was actually always 4) doing so even when We used the process described in Table A.2 to produce a sample
their symbolic guesses predicted more should be selected. It could be of 10,000 with a uniform generation prior and an additional 10,000
that children found this part of the task overwhelming, perhaps tending for each participant with a ‘‘held out’’ age-consistent prior based on
to stop after identifying one or two hypothesis consistent scenes rather the rule guesses of other participants in the requisite agegroup. For the
than evaluating all of them. In sum, it seems children were less able flipped prior analyses, we used the sample generated for the chrono-
to come up with a concise description of all the evidence generated, logically first participant from the other agegroup. We chose 10,000
reflecting both a less developed metacognitive awareness and the skills simply because this provided reasonable coverage of the task without
needed (both verbal and conceptual) to extract patterns. exhausting our storage and computational capacity.
We created a grammar (specifically a probabilistic context free (a) For a statement involving an unordered feature there is
grammar or PCFG; Ginsburg, 1966) that can be used to produce any only one possibility—e.g., {#3}: ‘‘= (𝑥1 , red, color)’’, or for
rule that can be expressed with first-order logic and lambda abstrac- {#1,#2}: ‘‘= (𝑥1 , 𝑥2 , color)’’
tion referring to the features participants referred to in our task. The (b) For a single cone and an ordered feature, this could also
grammatical primitives we assumed are detailed in Table A.1. be a nonstrict inequality (≥ or ≤). We assume a learner
There are multiple ways to implement a PCFG. Here we adopt a only samples an inequality if it expands the number of
common approach to set up a set of string-rewrite rules (Goodman cones picked out from the scene relative to an equality—
et al., 2008). Thus, each hypothesis begins life as a string containing e.g., in Fig. 2b in the main text, there is also a large cone
a single non-terminal symbol (here, 𝑆) that is replaced using rewrite
rules, or productions. These productions are repeatedly applied to the
14
string, replacing non-terminal symbols with a mixture of other non- The grammar is not strictly context free because the bound variables
(𝑥1 , 𝑥2 , etc.) are automatically shared across contexts (e.g. 𝑥1 is evoked twice
terminal symbols and terminal fragments of first order logic, until no
in both expressions generated in Fig. 2a). We also draw feature value pairs
non-terminal symbols remain. The productions are so designed that the
together and conditional on the type of function they inhabit, to make our
resulting string is guaranteed to be a valid grammatical expression and process more concise, however the same sampling is achievable in a context
all grammatical expressions have a nonzero chance of being produced. free way by having a separate function for every feature value, i.e. ‘‘‘isRed()’’
In addition, by having the productions tie the expression to bound vari- and sampling these directly (c.f. Rothe, Lake, & Gureckis, 2017).
ables and truth statements, our PCFG serves as an automatic concept 15
Numbers prepended with # refer to the labels on the cones in the example
generator. Table A.2 details the PCFG we used in the paper. observation in Fig. 2b.
21
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Table A.1
A concept grammar for the task.
Meaning Expression
There exists an 𝑥𝑖 such that... ∃(𝜆𝑥𝑖 ∶, )
For all 𝑥𝑖 ... ∀(𝜆𝑥𝑖 ∶ ., )
There exists {at least, at most, exactly} 𝑁 objects 𝑁{<,>,=} (𝜆𝑥𝑖 ∶ ., 𝑁, )
in 𝑥𝑖 such that...
Feature 𝑓 of 𝑥𝑖 has value {larger, smaller, (or) {<, >, ≤, ≥, =}(𝑥𝑖 , 𝑣, 𝑓 )
equal} to 𝑣
Feature 𝑓 of 𝑥𝑖 is {larger, smaller, (or) equal} to {<, >, ≤, ≥, =}(𝑥𝑖 , 𝑥𝑗 , 𝑓 )
feature 𝑓 of 𝑥𝑗
Relation 𝑟 between 𝑥𝑖 and 𝑥𝑗 holds 𝛤 (𝑥𝑖 , 𝑥𝑗 , 𝑟)
Booleans {and, or, not} {∧, ∨, ≠}(𝑥)
Object feature Levels
Color {red, green, blue}
Size {1:small, 2:medium, 3:large}
𝑥-position (0, 8)
𝑦-position (0, 8)
Orientation {Upright, left hand side, right hand side, strange}
Grounded true if touching the ground
Pairwise feature Condition
Contact true if 𝑥1 touches 𝑥2
Stacked true if 𝑥1 is above and touching 𝑥2 and 𝑥2 is
grounded
Pointing true if 𝑥1 is orientated {left/right} and 𝑥2 is to 𝑥1 s
{left/right}
Inside true if 𝑥1 is smaller than 𝑥2 + has same 𝑥 and 𝑦
position (±0.3), false
Note that {<, >, ≥, ≤} comparisons only apply to numeric features (e.g., size).
22
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. A.1. Three example scenes. Objects indices link the most similar set of objects in b to those in a. Numbers below indicate the edit distance for each object (i.e. the sum of
scaled dimension adjustments).
(𝜆𝑥1 ∶ ∃(𝜆𝑥2 ∶ ∧(= (𝑥2 , green, color), ≤ (𝑥1 , 𝑥2 , size)), ), )’’. Appendix B. Model fitting details
The inner quantifier ∃ is selected (three of the four cones
are green {#1, #2, #4}), and the outer quantifier ∀ is
selected (all cones are less than or equal in size to a green Full generalization model fits
cone).
As described in main text, we fit 18 model variants to participant’s
Note that a procedure like the one laid out above is, in principle,
data. All models have between 0 and 2 parameters. For each model, we
capable of generating any rule generated by the PCFG in Fig. 7(a)&7(b),
fit the parameter(s) by maximizing the model’s likelihood of producing
but will only do so when exposed to an observation that exemplifies
the participant data, using R’s optim function. We compare models
that rule, and will do so more often when the observation is inconsistent
using the Bayesian Information Criterion (Schwarz, 1978) to accom-
with as many other rules as possible (i.e., a minimal positive example).
modate their different numbers of fitted parameters.17 Full results are
Step 4. allows that non-rule following scenes can be used to inspire
in Table A.3.
rules involving a negation, for instance that ‘‘something is not upright’’
– which is semantically equivalent to saying that ‘‘nothing is upright’’.
Scene generation model fits
Basing hypotheses on instances may improve the quality of the effective
sample of hypotheses that the learner generates.
We used a grid search in increments of 0.05 to optimize 𝜂 and 𝜃 and
One way to think of the IDG procedure is as a partial inversion of
directly optimized 𝜆 for each setting of 𝜂 and 𝜃.
a PCFG. As illustrated by the blue text in the examples in Fig. 2b in
the main text. While the PCFG starts at the outside and works inward,
Appendix C. Free response coding
the IDG starts from the central content and works outward out to a
quantified statement, ensuring at each step that this final statement is
To analyze the free responses, we first had two coders go through
true of the scene.
all responses and categorize them as either:
We note that it is possible, in principle, to calculate a lower bound
on the prior probability for the PCFG or IDG generating a hypothesis 1. Correct: The subject gives exactly the correct rule or something
that a participant reported, even if it does not occur in our sample. This logically equivalent
can be achieved by reverse engineering the production steps that would 2. Overcomplicated: The subject gives a rule that over-specifies the
be needed to produce the precise encoded syntax. This is a lower bound criteria needed to produce stars relative to the ground truth. This
because it does not count semantically equivalent ‘‘phrasings’’ of the means the rule they give is logically sufficient but not necessary.
hypothesis that e.g. mention features in different orders or use logically For example, stipulating that ‘‘there must be a small red’’ is
equivalent combinations of booleans. We found that complex expres- overcomplicated if the true rule is ‘‘there must be a red’’ because
sions tend to have a large number of ‘‘phrasings’’. In our sample-based a scene could contain a medium or large red and emit stars.
approximation we implicitly treat semantically equivalent expressions 3. Overliberal: The opposite of overcomplicated. The subject gives
as constituting the same hypothesis but note that determining semantic a rule that under-specifies what must happen for the scene to
equivalence is an nontrivial aspect of constructivist inference that we produce stars. For example, stipulating that ‘‘there must be a
do not fully address here. blue’’ if the true rule is that ‘‘exactly one is blue’’. This is logically
necessary but not sufficient because a scene could contain blue
Reverse engineering production child-like and adult-like production weights objects but not produce stars because there is not exactly one of
them.
To roughly accommodate the fact that each guess is based on
different learning data, we regularized these counts by including a prior
17
pseudo-count of 5 on all productions. This value was not fit to the On one perspective, our derivation of the child-like and adult-like produc-
data, and simply serves to smooth the predictions a little. For example, tions constitutes fitting an additional 39 parameters (𝑚 − 1 for each production
step), so evoking an additional BIC parameter penalty of 39 × log(3940) = 323
children’s rules involved ∃ 263 times, ∀ 108 times and 𝑁 297 times,
for PCFG Agegroup over PCFG Uniform and similarly for the IDG. If we were to
so we assumed prior production weights of {263 + 5, 108 + 5, 297 +
apply this penalty, the uniform weighted variants would be clearly preferred
5}∕(263 + 108 + 297 + 15) = {.39, .17, .44}. To avoid double counting under the BIC criterion at the aggregate level. It is less clear how to apply
the data in modeling subjects’ specific guesses, we created a separate this penalty at the individual level since the held out priors are fit to different
agegroup-appropriate prior production weighting for each participant data than that being modeled. We chose to include the fitted versions alongside
based on the guesses of the other participants’ from the same agegroup, the uniform versions here without penalty as demonstrations of the differences
but omitting their own guesses. that arise from different generation probabilities.
23
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Table A.3
Models of participants’ generalizations.
Model Group log(Likelihood) BIC 𝜆 𝜏 N N blind Accuracy
1. Baseline children −1319.75 2639.50 7 13 50%
2. Bias children −1218.96 2445.47 0.32 16 25 50%
3. PCFG Uniform children −1319.72 2647.00 58.17 0 1 61%
4. PCFG Uniform + Bias children −1208.93 2432.97 0.35 2.18 0 0
5. PCFG Flipped children −1318.46 2644.47 8.97 1 1 66%
6. PCFG Flipped + Bias children −1207.28 2429.67 0.34 2.07 0 0
7. PCFG Agegroup children −1319.58 2646.71 24.17 1 1 63%
8. PCFG Agegroup + Bias children −1208.63 2432.36 0.35 2.15 0 0
9. IDG Uniform children −1298.73 2605.02 1.78 1 2 65%
10. IDG Uniform + Bias children −1193.90 2402.90 0.32 1.19 0 0
11. IDG Flipped children −1315.49 2638.54 4.35 1 4 66%
12. IDG Flipped + Bias children −1199.22 2413.54 0.35 1.38 0 0
13. IDG Agegroup children −1308.05 2623.65 2.51 2 5 69%
14. IDG Agegroup + Bias children −1193.41 2401.93 0.34 1.19 0 0
15. Similarity children −1316.44 2640.42 −1.99 0 1 41%
16. Similarity + Bias children −1214.71 2444.52 0.32 −1.30 1 1
17. Symbolic Guess children −1143.69 2294.92 1.02 15 62%
18. Symbolic Guess + Bias children −1067.18 2149.47 0.26 0.80 9
1. Baseline adults −1386.29 2772.59 2 5 50%
2. Bias adults −1364.90 2737.40 0.15 6 6 50%
3. PCFG Uniform adults −1320.64 2648.89 1.27 0 0 63%
4. PCFG Uniform + Bias adults −1253.52 2522.25 0.26 0.68 0 0
5. PCFG Flipped adults −1294.91 2597.42 1.06 1 1 66%
6. PCFG Flipped + Bias adults −1229.18 2473.55 0.24 0.63 0 0
7. PCFG Agegroup adults −1266.96 2541.51 0.94 1 5 69%
8. PCFG Agegroup + Bias adults −1203.64 2422.47 0.23 0.59 0 0
9. IDG Uniform adults −1228.21 2464.02 0.67 2 8 69%
10. IDG Uniform + Bias adults −1179.12 2373.44 0.20 0.48 0 0
11. IDG Flipped adults −1245.56 2498.72 0.76 0 5 73%
12. IDG Flipped + Bias adults −1179.23 2373.65 0.24 0.48 0 0
13. IDG Agegroup adults −1188.28 2384.17 0.62 2 15 74%
14. IDG Agegroup + Bias adults −1134.58 2284.37 0.20 0.44 0 0
15. Similarity adults −1359.05 2725.70 −0.73 0 1 37%
16. Similarity + Bias adults −1337.55 2690.30 0.14 −0.61 0 4
17. Symbolic Guess adults −893.49 1794.58 0.56 32 70%
18. Symbolic Guess + Bias adults −880.59 1776.38 0.08 0.50 4
Note: Boldface indicates best fitting model overall. N blind restricts comparisons to models blind to the symbolic guess. Underlines indicate
best fitting blind model. Accuracy column shows performance of the requisite model on 100 simulated runs through the task using participants’
active learning data with 𝜏 set to 1/100 (i.e. hard maximizing over the model predictions). Biased models perform strictly worse so are not
included in this column.
Table A.4
Agreement matrix for independent coders’ free response classifications.
correct overliberal overspecific different vague no rule multiple
correct 93 1 5 0 0 0 0
overliberal 5 13 1 8 0 1 0
overspecific 1 2 42 12 0 0 0
different 0 5 3 224 15 3 0
vague 0 1 2 3 11 6 0
no rule 0 0 0 0 0 31 0
multiple 0 1 0 2 0 0 0
4. Different: The subject gives a rule that is intelligible but different the requisite classification, encoded rules are available in the Online
from the ground truth in that it is neither necessary or sufficient Repository.
for determining whether a scene will produce stars.
5. Vague or multiple. Nuisance category.
Appendix D. Scene similarity measurement
6. No rule. The subject says they cannot think of a rule.
We were able to encode 205/238 (86%) of the children’s responses To establish the overall similarity between two scenes, we need
and (219/250) 87% for adults as correct, overcomplicated, overliberal to map the objects in a given scene to the objects in another scene
or different. Table A.4 shows the complete confusion matrix. The two (for example between the scenes in Fig. A.1 a and b) and establish a
coders agreed 85% of the time, resulting in a Cohen’s Kappa of .77 reasonable cost for the differences between objects across dimensions.
indicating a good level of agreement (Krippendorff, 2012). We also need a procedure for cases where there are objects in one
We then had one coder familiar with the grammar go through each scene that have no analogue in the other. We approach the calculation
free response that was not assigned vague or no rule, and encode it as of similarity via the principle of minimum edit distance (Levenshtein,
a function in our grammar. The second coder then blind spot checked 1966). This means summing up the elementary operations required to
15% of these rules (64) and agreed in 95% of cases 61/64. The 6 cases convert scene (a) into scene (b) or visa versa. We assume objects can
of disagreement were discussed and resolved. In 5/6 cases, this was in be adjusted in one dimension at a time (i.e. moving them on the 𝑥 axis,
favor of the primary coder. The full set of free text responses along with rotating them, or changing their color, and so on.
24
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Fig. A.2. (a) The average minimum edit distance summed up across shared objects. (b) Rescaling a by dividing by the number of objects. (c) The penalty for additional or omitted
objects. (d) Combined distance as in main text.
Fig. A.3. Generalization accuracy by number of objects per test scene comparing with 10 rule adult pilot from Bramley et al. (2018).
Before focusing on how to map the objects between the scenes we across all objects and all scenes and dimensions is 1. We then take
must decide how to measure the adjustment distance for a particular the L1-norm (or city block distance) as the cost for converting an
object in scene a to its supposed analogue in scene b. As a simple object in scene (a) to an object in scene (b), or visa versa. Note this
way to combine the edit costs across dimensions we first 𝑍-score each is sensitive the size of the adjustment, penalizing larger changes in
dimension, such that the average distance between any two values position, orientation or size more severely than smaller changes, while
25
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
changes in color are all considered equally large since color is taken Brooks, P. J., & Braine, M. D. (1996). What do children know about the universal
as categorical. Note also that for orientation differences we also always quantifiers all and each? Cognition, 60(3), 235–268.
Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of thinking. Routledge.
assume the shortest distance around the circle.
Bruner, J. S., Jolly, A., & Sylva, K. (1976). Play: Its role in development and evolution.
If scene (a) has an object that does not exist in scene (b) we assume Penguin.
a default adjustment penalty equal to the average divergence between Campbell, D. T. (1960). Blind variation and selective retention in creative thought as
two objects across all comparisons (3.57 in the current dataset). We do in other knowledge processes. Psychological Review, 67, 380–400.
the same for any object that exists in (a) but not (b). Carey, S. (1985). Are children fundamentally different kinds of thinkers and learners
than adults. Thinking and Learning Skills, 2, 485–517.
Calculating the overall similarity between two scenes involves solv- Carey, S. (2009). The origin of concepts: Oxford series in cognitive development. England:
ing a mapping problem of identifying which objects in scene (a) are Oxford University Press.
‘‘the same’’ as those in scene (b). We resolve this ‘‘charitably’’, by Chen, Z., & Klahr, D. (1999). All other things being equal: Acquisition and transfer of
searching exhaustively for the mapping of objects in scene (a) to the control of variables strategy. Child Development, 70(5), 1098–1120.
Church, A. (1932). A set of postulates for the foundation of logic. Annals of Mathematics,
scene (b) that minimizes the total edit distance. Having selected this
346–366.
mapping, and computed the final edit distance including any costs for Clark, A. (2012). Whatever next? Predictive brains, situated agents, and the future of
additional or removed objects, we divide by the number shared cones, cognitive science. Behavioral Brain Sciences, 1–86.
so as to avoid the dissimilarities increasing with the number of objects Coenen, A., Rehder, R., & Gureckis, T. M. (2015). Strategies to intervene on causal
involved. systems are adaptively selected. Cognitive Psychology, 79, 102–133.
Dasgupta, I., Schulz, E., & Gershman, S. J. (2016). Where do hypotheses come
Fig. A.2 computes the inter-scene similarity components that go
from? Center for Brains, Minds and Machines (Preprint).
into Fig. 6c in the main text. Summing up the edit distances across Dasgupta, I., Schulz, E., & Gershman, S. J. (2017). Where do hypotheses come
all objects, children’s scenes seem much more diverse than adults from? Cognitive Psychology, 96, 1–25.
(Fig. A.2a). However this is primarily due to their containing a greater Daw, N., & Courville, A. (2008). The pigeon as particle filter. Advances in Neural
Information Processing Systems, 20, 369–376.
average number of objects. Scaling the edit distance by the number of
Dennett, D. C. (1988). The intentional stance in theory and practice. In R. Byrne,
objects in the target scene gives a more balanced perspective (Fig. A.2b) & A. Whiten (Eds.), Machiavellian intelligence (pp. 180–202). Oxford, UK: Oxford
but does not account for the fact that the compared scene may contain University Press.
more or fewer objects in total. Fig. A.2c visualizes just the object Dennett, D. C. (1991). Consciousness explained. London, UK: Penguin.
difference showing that children’s scenes contain roughly as many Ellis, K., Wong, C., Nye, M., Sable-Meyer, M., Cary, L., Morales, L., et al. (2020). Dream-
coder: Growing generalizable, interpretable knowledge with wake-sleep bayesian
objects on average as the initial example while adults’ scenes contain
program learning. arXiv preprint arXiv:2006.08381.
around 0.75 fewer objects than are present in the initial example (dark Fedyk, M., & Xu, F. (2018). The epistemology of rational constructivism. Review of
shading in top row). Thus, we opted to combine b and c by weighting Philosophy and Psychology, 9(2), 343–362.
the unsigned cone difference by the mean inter-object distance across Feldman, J. (2000). Minimization of Boolean complexity in human concept learning.
Nature, 407(6804), 630.
all comparisons to give our combined distance measure (Fig. A.2d and
Fodor, J. A. (1975). vol. 5, The language of thought. Harvard University P Press.
Fig. 6c in the main text). Fränken, J.-P., Theodoropoulos, N. C., & Bramley, N. R. (2022). Algorithms for
adaptation in inductive inference. Cognitive Psychology.
Appendix E. Comparison with Bramley et al. (2018) Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G., et al. (2016).
Active inference and learning. Neuroscience & Biobehavioral Reviews, 68, 862–879.
Gelman, S. A. (2004). Psychological essentialism in children. Trends in Cognitive Sciences,
Finally, for interest and to demonstrate replication of our core
8(9), 404–409.
results. We provide a direct comparison between the generalization Gelpi, R., Prystawski, B., Lucas, C. G., & Buchsbaum, D. (2020). Incremental hypothesis
accuracies in the current sample of children and adults and those in revision in causal reasoning across development.
the sample of 30 adults modeled in Bramley et al. (2018). Bramley Gershman, S. J., & Niv, Y. (2010). Learning latent structure: carving nature at its joints.
Current Opinion in Neurobiology, 20(2), 251–256.
et al. (2018) included 10 ground truth concepts, and the current paper
Gettys, C. F., & Fisher, S. D. (1979). Hypothesis plausibility and hypothesis generation.
uses just the first five of these. Fig. A.3 shows these accuracy patterns Organizational Behavior and Human Performance, 24(1), 93–110.
side by side, revealing the adults in the current experiment performed Ginsburg, S. (1966). The mathematical theory of context free languages. McGraw-Hill Book
approximately as well as those in the original conference paper. Company.
Goodman, N. D., Tenenbaum, J. B., Feldman, J., & Griffiths, T. L. (2008). A rational
analysis of rule-based concept learning. Cognitive Science, 32(1), 108–154.
Goodman, N. D., Ullman, T. D., & Tenenbaum, J. B. (2011). Learning a theory of
References causality. Psychological Review, 118(1), 110–119.
Gopnik, A. (1996). The scientist as child. Philosophy of Science, 63(4), 485–514.
Allen, K. R., Smith, K. A., & Tenenbaum, J. B. (2020). Rapid trial-and-error learning Gopnik, A. (2020). Childhood as a solution to explore–exploit tensions. Philosophical
with simulation supports flexible tool use and physical reasoning. Proceedings of the Transactions of the Royal Society B, 375(1803), Article 20190502.
National Academy of Sciences, 117(47), 29302–29310. Gopnik, A., Glymour, C., Sobel, D., Schulz, L. E., Kushnir, T., & Danks, D. (2004). A
Bonawitz, E. B., Denison, S., Gopnik, A., & Griffiths, T. L. (2014). Win-Stay, Lose- theory of causal learning in children: Causal maps and Bayes nets. Psychological
Sample: A simple sequential algorithm for approximating Bayesian inference. Review, 111, 1–31.
Cognitive Psychology, 74, 35–65. Griffiths, T. L., Lieder, F., & Goodman, N. D. (2015). Rational use of cognitive
Box, G. E. (1976). Science and statistics. Journal of the American Statistical Association, resources: Levels of analysis between the computational and the algorithmic. Topics
71(356), 791–799. in Cognitive Science, 7, 217–229.
Bramley, N. R., Dayan, P., Griffiths, T. L., & Lagnado, D. A. (2017). Formalizing Griffiths, T. L., & Tenenbaum, J. B. (2009). Theory-based causal induction. Psychological
Neurath’s ship: Approximate algorithms for online causal learning. Psychological Review, 116, 661–716.
Review, 124(3), 301–338. Gureckis, T. M., & Markant, D. B. (2012). Self-Directed Learning: A Cognitive and
Bramley, N. R., Jones, A., Gureckis, T. M., & Ruggeri, A. (2022). Children’s failure Computational Perspective. Perspectives on Psychological Science, 7(5), 464–481.
to control variables may reflect adaptive decision making. Psychonomic Bulletin & Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their
Review, 29, 2314—2324. applications.
Bramley, N. R., Lagnado, D. A., & Speekenbrink, M. (2015). Conservative forgetful Heath, C. (2004). Zendo–Design History. Retrieved from http://www.koryheath.com/
scholars: How people learn causal structure through interventions. Journal of zendo/design-history/.
Experimental Psychology: Learning, Memory & Cognition, 41(3), 708–731. Howson, C., & Urbach, P. (2006). Scientific reasoning: The Bayesian approach. Open Court
Bramley, N. R., Mayrhofer, R., Gerstenberg, T., & Lagnado, D. A. (2017). Causal learning Publishing.
from interventions and dynamics in continuous time. In Proceedings of the 39th von Humboldt, W. (1863/1988). On language. New York: Cambridge University Press.
annual meeting of the cognitive science society. Austin, TX: Cognitive Science Society. Inhelder, B., & Piaget, J. (1958). vol. 22, The growth of logical thinking from childhood to
Bramley, N. R., Rothe, A., Tenenbaum, J. B., Xu, F., & Gureckis, T. M. (2018). adolescence: An essay on the construction of formal operational structures. Psychology
Grounding compositional hypothesis generation in specific instances. In Proceedings Press.
of the 40th annual meeting of the cognitive science society. Austin, TX: Cognitive Johansson, P., Hall, L., & Sikström, S. (2008). From change blindness to choice
Science Society. blindness. Psychologia, 51(2), 142–155.
26
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Johnson, R. B., Onwuegbuzie, A. J., & Turner, L. A. (2007). Toward a definition of Nelson, J. D., Divjak, B., Gudmundsdottir, G., Martignon, L. F., & Meder, B. (2014).
mixed methods research. Journal of Mixed Methods Research, 1(2), 112–133. Children’s sequential information search is sensitive to environmental probabilities.
Johnson-Laird, P. N. (1983). Mental models: Towards a cognitive science of language, Cognition, 130(1), 74–80.
inference, and consciousness. Cambridge: Cambridge University Press. Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises.
Kemp, C., & Tenenbaum, J. B. (2009). Structured statistical models of inductive Review of General Psychology, 2(2), 175.
reasoning.. Psychological Review, 116(1), 20. Nosofsky, R. M., & Palmeri, T. J. (1998). A rule-plus-exception model for classify-
Klahr, D., & Dunbar, K. (1988). Dual space search during scientific reasoning. Cognitive ing objects in continuous-dimension spaces. Psychonomic Bulletin & Review, 5(3),
Science, 12(1), 1–48. 345–369.
Klahr, D., Fay, A. L., & Dunbar, K. (1993). Heuristics for scientific experimentation: A Nosofsky, R. M., Palmeri, T. J., & McKinley, S. C. (1994). Rule-plus-exception model
developmental study. Cognitive Psychology, 25(1), 111–146. of classification learning. Psychological Review, 101(1), 53.
Klahr, D., Zimmerman, C., & Jirout, J. (2011). Educational interventions to advance Nye, M. I., Solar-Lezama, A., Tenenbaum, J. B., & Lake, B. M. (2020). Learning
children’s scientific thinking. Science, 333(6045), 971–975. compositional rules via neural program synthesis. arXiv preprint arXiv:2003.05562.
Klayman, J., & Ha, Y.-w. (1989). Hypothesis testing in rule discovery: Strategy, Oaksford, M., & Chater, N. (1994). Another look at eliminative and enumerative
structure, and content. Journal of Experimental Psychology: Learning, Memory & behaviour in a conceptual task. European Journal of Cognitive Psychology, 6(2),
Cognition, 15(4), 596. 149–169.
Komatsu, L. K. (1992). Recent views of conceptual structure. Psychological Bulletin, Oaksford, M., & Chater, N. (2007). BayesIan rationality: The probabilistic approach to
112(3), 500. human reasoning. Oxford: Oxford University Press.
Krafft, P. M., Shmueli, E., Griffiths, T. L., Tenenbaum, J. B., et al. (2021). Bayesian Osborne, M., Garnett, R., Ghahramani, Z., Duvenaud, D. K., Roberts, S. J., & Ras-
collective learning emerges from heuristic social learning. Cognition, 212, Article mussen, C. (2012). Active learning of model evidence using Bayesian quadrature.
104469. Advances in Neural Information Processing Systems, 25.
Krippendorff, K. (2012). Content analysis: An introduction to its methodology. Sage. Phillips, D. C. (1995). The good, the bad, and the ugly: The many faces of
Kruschke, J. K. (1992). ALCOVE: an exemplar-based connectionist model of category constructivism. Educational Researcher, 24(7), 5–12.
learning. Psychological Review, 99(1), 22. Piaget, J. (2013). vol. 82, The construction of reality in the child. Routledge.
Kuhn, D., & Brannock, J. (1977). Development of the isolation of variables scheme in Piaget, J., & Valsiner, J. (1930). The Child’s conception of physical causality. Transaction
experimental and ‘‘natural experiment’’ contexts. Developmental Psychology, 13(1), Pub.
9. Piantadosi, S. T. (2021). The computational origin of representation. Minds and
Lagnado, D. A., & Sloman, S. A. (2006). Time as a guide to cause. Journal of Machines, 31(1), 1–58.
Experimental Psychology: Learning, Memory & Cognition, 32(3), 451–460. Piantadosi, S. T., & Jacobs, R. A. (2016). Four problems solved by the probabilistic
language of thought. Current Directions in Psychological Science, 25(1), 54–59.
Lai, L., & Gershman, S. J. (2021). Policy compression: an information bottleneck in
Piantadosi, S. T., Tenenbaum, J. B., & Goodman, N. D. (2012). Bootstrapping in a
action selection.
language of thought: A formal model of numerical concept learning. Cognition,
Lakatos, I. (1976). Falsification and the methodology of scientific research programmes.
123(2), 199–217.
In Can theories be refuted? (pp. 205–259). Springer.
Piantadosi, S. T., Tenenbaum, J. B., & Goodman, N. D. (2016). The logical primitives
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building
of thought: Empirical foundations for compositional cognitive models. Psychological
machines that learn and think like people. Behavioral and Brain Sciences, 40.
Review, 123(4), 392.
Lapidow, E., & Walker, C. M. (2020). The search for invariance: repeated positive
Popper, K. (1959). The logic of scientific discovery. Routledge.
testing serves the goals of causal learning. Language and Concept Acquisition from
Posner, M. I., & Keele, S. W. (1968). On the genesis of abstract ideas. Journal of
Infancy Through Childhood, 197–219.
Experimental Psychology, 77(3p1), 353.
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and
Quine, W. v. O. (1969). Word and object. MIT Press.
reversals. Vol. 10, In Soviet physics Doklady (8), (pp. 707–710). Soviet Union.
Rothe, A., Lake, B. M., & Gureckis, T. M. (2017). Question asking as program
Lewis, O., Perez, S., & Tenenbaum, J. (2014). Error-driven stochastic search for theories
generation. In Neural information processing systems.
and concepts. 36, In Proceedings of the annual meeting of the cognitive science society.
Ruggeri, A., & Feufel, M. (2015). How basic-level objects facilitate question-asking in
(36).
a categorization task. Frontiers in Psychology, 6, 918.
Lieder, F., & Griffiths, T. L. (2020). Resource-rational analysis: Understanding human
Ruggeri, A., & Lombrozo, T. (2014). Learning by asking: How children ask questions
cognition as the optimal use of limited computational resources. Behavioral and
to achieve efficient search. In Proceedings of the 36th annual meeting of the cognitive
Brain Sciences, 43.
science society (pp. 1335–1340). Austin, TX: Cognitive Science Society.
Lieder, F., Griffiths, T. L., Huys, Q. J., & Goodman, N. D. (2018). The anchoring bias
Ruggeri, A., & Lombrozo, T. (2015). Children adapt their questions to achieve efficient
reflects rational use of cognitive resources. Psychonomic Bulletin & Review, 25(1),
search. Cognition, 143, 203–216.
322–349.
Ruggeri, A., Lombrozo, T., Griffiths, T. L., & Xu, F. (2016). Sources of developmental
Love, B. C., Medin, D. L., & Gureckis, T. M. (2004). SUSTAIN: a network model of
change in the efficiency of information search. Developmental Psychology, 52(12),
category learning. Psychological Review, 111(2), 309.
2159.
Lucas, C. G., Bridgers, S., Griffiths, T. L., & Gopnik, A. (2014). When children are better Ruis, L., Andreas, J., Baroni, M., Bouchacourt, D., & Lake, B. M. (2020). A benchmark
(or at least more open-minded) learners than adults: Developmental differences in for systematic generalization in grounded language understanding. arXiv preprint
learning the forms of causal relationships. Cognition, 131(2), 284–299. arXiv:2003.05161.
Lucas, C. G., & Griffiths, T. L. (2010). Learning the form of causal relationships using Rule, J. S., Schulz, E., Piantadosi, S. T., & Tenenbaum, J. B. (2018). Learning list
hierarchical Bayesian models. Cognitive Science, 34(1), 113–147. concepts through program induction. Article 321505, BioRxiv.
Luce, D. R. (1959). Individual choice behavior. New York: Wiley. Rule, J. S., Tenenbaum, J. B., & Piantadosi, S. T. (2020). The child as hacker. Trends
Markant, D. B., & Gureckis, T. M. (2014). Is it better to select or to receive? Learning via in Cognitive Sciences.
active and passive hypothesis testing. Journal of Experimental Psychology: General, Sanborn, A. N., & Chater, N. (2016). Bayesian brains without probabilities. Trends in
143(1), 94. Cognitive Sciences.
Marr, D. (1982). Vision. New York: Freeman & Co. Sanborn, A. N., Zhu, J., Spicer, J., Sundh, J., León-Villagrá, P., & Chater, N. (2021).
McCormack, T., Bramley, N. R., Frosch, C., Patrick, F., & Lagnado, D. A. (2016). Sampling as the human approximation to probabilistic inference.
Children’s use of interventions to learn causal structure. Journal of Experimental Schulz, L. E., Goodman, N. D., Tenenbaum, J. B., & Jenkins, A. C. (2008). Going
Child Psychology, 141, 1–22. beyond the evidence: Abstract laws and preschoolers’ responses to anomalous data.
Meder, B., Wu, C. M., Schulz, E., & Ruggeri, A. (2021). Development of directed and Cognition, 109(2), 211–223.
random exploration in children. Developmental Science, 24(4), Article e13095. Schulz, E., Klenske, E. D., Bramley, N. R., & Speekenbrink, M. (2017). Strategic
Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning. exploration in human adaptive control. In Proceedings of the 39th annual meeting of
Psychological Review, 85(3), 207. the cognitive science society. The Cognitive Sicence Society.
Meng, Y., Bramley, N., & Xu, F. (2018). Children’s causal interventions combine Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2),
discrimination and confirmation. In Proceedings of the 40th annual meeting of the 461–464.
cognitive science society. Shackle, S. (2015). Science and serendipity: famous accidental discoveries: Most
Michalski, R. S. (1969). On the quasi-minimal solution of the general covering problem. scientific breakthroughs take years of research–but often, serendipity provides the
A3, In Proceedings of the 5th annual symposium on information processing (pp. final push, as these historic discoveries show. New Humanist, 2.
125–128). Shanks, D. R., Tunney, R. J., & McCarthy, J. D. (2002). A re-examination of proba-
Morris, A. (2021). Invisible gorillas in the mind: Internal inattentional blindness and bility matching and rational choice. Journal of Behavioral Decision Making, 15(3),
the prospect of introspection training. 233–250.
Navarro, D. J. (2005). Analyzing the RULEX model of category learning. Journal of Shepard, R. N. (1987). Toward a universal law of generalization for psychological
Mathematical Psychology, 49(4), 259–275. science. Science, 237(4820), 1317–1323.
Navarro, D. J., & Perfors, A. F. (2011). Hypothesis generation, sparse categories, and Shepard, R. N., & Chang, J.-J. (1963). Stimulus generalization in the learning of
the positive test strategy. Psychological Review, 118(1), 120. classifications. Journal of Experimental Psychology, 65(1), 94.
27
N.R. Bramley and F. Xu Cognition 238 (2023) 105471
Sim, Z. L., & Xu, F. (2017). Learning higher-order generalizations through free play: Van Laarhoven, P. J., & Aarts, E. H. (1987). Simulated annealing. In Simulated annealing:
Evidence from 2-and 3-year-old children. Developmental Psychology, 53(4), 642. Theory and applications (pp. 7–15). Springer.
Simon, H. A. (2013). Administrative behavior. Simon and Schuster. Van Rooij, I., Blokpoel, M., Kwisthout, J., & Wareham, T. (2019). Cognition and
Sobel, D. M., & Kushnir, T. (2006). The importance of decision making in causal intractability: A guide to classical and parameterized complexity analysis. Cambridge
learning from interventions. Memory & Cognition, 34(2), 411–419. University Press.
Stewart, N., Chater, N., & Brown, G. D. A. (2006). Decision by sampling. Cognitive Vul, E., Goodman, N. D., Griffiths, T. L., & Tenenbaum, J. B. (2009). One and done?
Psychology, 53(1), 1–26. Optimal decisions from very few samples. Vol. 1, In Proceedings of the 31st annual
Steyvers, M., Tenenbaum, J. B., Wagenmakers, E., & Blum, B. (2003). Inferring causal meeting of the cognitive science society (pp. 66–72). Austin, TX: Cognitive Science
networks from observations and interventions. Cognitive Science, 27, 453–489. Society.
Szollosi, A., & Newell, B. R. (2020). People as intuitive scientists: Reconsidering Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task.
statistical explanations of decision making. Trends in Cognitive Sciences. Quarterly Journal of Experimental Psychology, 12(3), 129–140.
Tenenbaum, J. B. (1999). A Bayesian framework for concept learning (Ph.D. thesis), Wason, P. C. (1968). Reasoning about a rule. The Quarterly Journal of Experimental
Massachusetts Institute of Technology. Psychology, 20(3), 273–281.
Thaker, P., Tenenbaum, J. B., & Gershman, S. J. (2017). Online learning of symbolic Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018). General-
concepts. Journal of Mathematical Psychology, 77, 10–20. ization guides human exploration in vast decision spaces. Nature Human Behaviour,
Turing, A. M. (1937). On computable numbers, with an application to the 2(12), 915–924.
entscheidungsproblem. Proceedings of the London Mathematical Society, 2(1), Xu, F. (2019). Towards a rational constructivist theory of cognitive development.
230–265. Psychological Review, 126(6), 841.
Turing, A. M. (2009). Computing machinery and intelligence. In Parsing the Turing test Zhao, B., Bramley, N. R., & Lucas, C. (2022). Powering up causal generalization:
(pp. 23–65). Springer. A model of human conceptual bootstrapping with adaptor grammars. 44, In
Tversky, A. (1977). Features of similarity. Psychological Review, 84(4), 327. Proceedings of the 44th annual meeting of the cognitive science society. (44).
Ullman, T. D., Goodman, N. D., & Tenenbaum, J. B. (2012). Theory learning as Zhao, B., Lucas, C. G., & Bramley, N. R. (2022). How do people generalize causal
stochastic search in the language of thought. Cognitive Development, 27(4), 455–480. relations over objects? A non-parametric Bayesian account. Computational Brain &
Valkov, L., Chaudhari, D., Srivastava, A., Sutton, C., & Chaudhuri, S. (2018). Houdini: Behavior, 5(1), 22–44.
Lifelong learning as program synthesis. In Advances in neural information processing Zhu, J.-Q., Sanborn, A. N., & Chater, N. (2020). The Bayesian sampler: Generic Bayesian
systems (pp. 8687–8698). inference causes incoherence in human probability judgments. Psychological Review,
127(5), 719.
28