0% found this document useful (0 votes)
31 views22 pages

Comp05 Diederich (2023)

Computational modeling has rapidly grown in psychology, particularly in cognitive psychology, and is bridging the gap between neurosciences and behavioral sciences. It involves creating formal models that represent complex phenomena using mathematical and statistical methods, enabling predictions that can be tested against empirical data. The chapter discusses the distinctions between mathematical and computational modeling, the advantages of formal models, and their applications in simulation, parameter estimation, model comparison, and latent variable inference.

Uploaded by

cacanonp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views22 pages

Comp05 Diederich (2023)

Computational modeling has rapidly grown in psychology, particularly in cognitive psychology, and is bridging the gap between neurosciences and behavioral sciences. It involves creating formal models that represent complex phenomena using mathematical and statistical methods, enabling predictions that can be tested against empirical data. The chapter discusses the distinctions between mathematical and computational modeling, the advantages of formal models, and their applications in simulation, parameter estimation, model comparison, and latent variable inference.

Uploaded by

cacanonp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Chapter 23

COMPUTATIONAL
MODELING
Adele Diederich

Over the last 50 years or so, computational specific technics of computational modeling,
modeling has become a rapidly growing approach a psychological journal—Computational Brain and
in many scientific disciplines. In psychology, Behavior—has been founded recently as another
computational modeling is more recent but has outlet for the growing work in this field.
Copyright American Psychological Association. Not for further distribution.

had an exponential growth. It has predominantly Given that mathematical psychology


been applied in cognitive psychology, in particular, (Chapter 22, this volume) has been an established
in learning, memory, categorization, pattern field since the middle of the last century, the
recognition, psycholinguistics, vision, decision questions arise whether computational modeling
making and lately, developmental psychology, is different from mathematical modeling; whether
social psychology, and even clinical psychology there is something like computational psychology;
took interest in the approach. It is also bridging and if so, whether it is different from mathe­matical
the neurosciences and the behavioral sciences. psychology.
In the last 15 years, several (introductory) books To answer these questions, it is helpful to
have been written on how to do computational distinguish between mathematical/computational
modeling in psychology and in neuroscience. models and mathematical/computational modeling.
For instance, Sun (2008a) devoted an entire In general, a model is an abstraction representing
handbook to computational modeling in psychol- some aspects of a complex phenomenon or
ogy; Busemeyer and Diederich (2010) provided situation in the real world. A formal model applies
a tutorial that reviews the methods and steps mathematical and statistical methods, formal
used in cognitive modeling; in the same vein, logic, or computer simulation to establish a cor-
Lewandowsky and Farrell (2011) outlined the respondence between a particular set of empirical
principles and practice of computational modeling phenomena and a formal system reflecting the
in cognition; Farrell and Lewandowsky (2018) assumptions about specific entities or objects and
updated their textbook, extended it to newer their relations to each other. The model thereby
modeling practice, and added examples of models may take on various structures such as algebraic
in psychology and neuroscience to demonstrate or geometric, and forms such as axiomatic, systems
the steps typically involved in computational of equations, algorithms, networks, and so on.
modeling. Besides many journal articles on They are static or dynamic, deterministic, or

This chapter is a revised version of Diederich and Busemeyer (2012).


https://doi.org/10.1037/0000319-023
APA Handbook of Research Methods in Psychology, Second Edition: Vol. 2. Research Designs: Quantitative, Qualitative, Neuropsychological, and
Biological, H. Cooper (Editor-in-Chief)
Copyright © 2023 by the American Psychological Association. All rights reserved.

515
APA Handbook of Research Methods in Psychology: Research Designs: Quantitative, Qualitative,
Neuropsychological, and Biological, edited by H. Cooper, M. N. Coutanche, L. M. McMullen, A. T.
Panter, D. Rindskopf, and K. J. Sher
Copyright © 2023 American Psychological Association. All rights reserved.
Adele Diederich

probabilistic, linear or nonlinear in nature, and equations, such as power function decay, linear
so forth. They are designed to describe internal combinations of inputs to produce activation
presentations, processes, functions, mechanisms, values, and mathematical reinforcement learning
and structures. The goal of a formal model is to rules. But they are also heavily reliant on produc-
derive predictions which can connect to data tion rules. That is, technically, they could be
observed in the real world. These data are often formalized as mathematical models (dynamic
obtained in experiments. Interpreting data in systems, stochastic dynamic systems, finite-state
light of the model’s prediction is crucial for the machines, Markov processes); however, in practice,
modeling process and often leads to a modification it is difficult to see the mathematics in some of
of the model. This entire process is referred to as them, such as pure production rule systems.
mathematical modeling. When the phenomena With this description of a computational
of interest stem from psychological questions, approach in mind, are mathematical models
the area is called mathematical psychology then a subset of computational models, as Sun
(Chapter 22, this volume), and the process often (2008b) claimed? Or are they merely a tool in
is called cognitive modeling. mathematical psychology, one of many methods
Sometimes, a mathematical model representing that can be applied in the modeling process, with
the phenomena or situation is so complex that the advantage that data can be simulated easily
Copyright American Psychological Association. Not for further distribution.

an analytical solution is not readily available and extremely quickly? After all, computational
or a closed-form solution does not even exist. models require a mathematically and logically
Other times, experiments are too expensive or formal representation of the problem and could
may not even be feasible to conduct. For those therefore be considered as a subset of math-
cases, a computational approach is considered as ematical models.
a valuable alternative. Computational approaches In light of the many disciplines involved
develop computer algorithms and programs, in computational modeling research, it seems
implement them on a computer, and the computer impossible to come up with an agreed-upon
simulations derive the predictions of the model definition. There are interesting discussions going
and also generate data. Instead of deriving an on in some disciplines outside of psychology.
analytical solution to the problem, a computer Indeed, an entire subfield of computer science
simulation—that is, changing the parameters studies the relationships and differences
of the system in the computer—provides the between computational and mathematical
basis for studying and evaluating the model by models. I am not dwelling on it here further
comparing the simulated data to the outcome (Diederich & Busemeyer, 2012) and agree with
of the experiments. Wilson and Collins (2019) that computational
Many computational models are computer modeling in behavioral science use precise
simulations of complex mathematical equations mathematical models.
that cannot be solved in a simple form and Many formal models in (cognitive) psychology
must be simulated by computer. Some other are functional in nature. They are built to describe
computational models are simply based on if–then and explain psychological mechanisms and
production rules and are harder to see as processes and to predict (human) performance—
mathematical models. Technically, they are finite- that is, what can be expected to happen under
state machines, or if a probabilistic selection of various conditions. The (cognitive) architecture
rules is applied, they technically become Markov of a model reflects its structure and computa-
models in a large state space. However, these types tional rules, that is, its functional organization.
of computational models are never expressed Cognitive architectures are classified as symbolic,
mathematically, even though they could be. sub­symbolic (connectionist) or hybrid depending
Models such as Act-R (Lebière & Anderson, 1993) on the assumed properties of the models and the
or Clarion (Sun, 2009) use some mathematical rules applied to the system.

516
Computational Modeling

Symbolic modeling has its origin in computer possibly modify our assumptions to give a better
science, in particular, artificial intelligence, which account of the data. Formal modeling has several
focuses on building enhanced intelligence into advantages over a merely verbal modeling approach.
computer systems. Patterns of reasoning, such First, it forces the researcher to give precise
as logical operations or processing systems with definitions and clear statements. This requires a
their respective specified rules, operate by using high degree of abstraction: Assumptions about
symbols. From early on, these models have underlying processes, relations between entities,
attracted a broad range of disciplines from interactions between variables, and so on, all need
philosophy to engineering. They have been applied to be mapped onto mathematical objects and
to semantic representation, language processing, operations. The language of mathematics minimizes
knowledge representation, reasoning, speech the risk of making contradictory statements
recognition, computer vision, robotic, various in the theory. Second, formal modeling allows
expert systems, and many more (see Nilsson, deriving precise predictions from the underlying
2010, for a history and achievement of artificial assumptions, thereby enabling empirical falsifica-
intelligence with a focus on engineering, and tion of these assumptions. Furthermore, deriving
Boden, 2006, for a comprehensive work on a predictions is particularly important and useful
history of cognitive science, which also include when they are not obvious. Testable predictions
Copyright American Psychological Association. Not for further distribution.

the philosophical perspective of computational may be useful for deciding between competing
approaches). theories of a given phenomenon. They are not
Subsymbol or connectionist models use necessarily quantitative but can also reflect
the analogy to neural networks as observed in qualitative patterns, which can be observable in
the human brain. Activation patterns among the data. Third, mathematical modeling brings
large numbers of processing units (neurons) encode together theory and data; it facilitates the analysis
knowledge and specific content. Connectionist and interpretation of complex data and helps
models are developed in many scientific disciplines generating new hypotheses. Fourth, even rather
from computer science to neuroscience, natural simple mathematical models often describe data
science, cognitive science, and behavioral sciences. better and are more informative than a statistical
They have been applied to speech recognition test of a verbally phrased hypothesis. Finally,
and speech generation, predictions of financial formal models can provide a unifying language and
indices, identification of cancerous cells, automatic methodology that can be used across disciplines
recognition to handwritten characters, sexing ranging from experimental psychology to cognitive
of faces and many more. Hybrid architectures science, computer science, and neuroscience.
combine both types of processing and become
more interesting for cognitive modelers (Diederich
USAGE OF COMPUTATIONAL MODELS
& Busemeyer, 2012). Two simple connectionist
types of models for categorical learning serve Wilson and Collins (2019) identified the appli-
as examples of the modeling process later. cation of computational models predominantly
in four different fields: simulation, parameter
estimation, model comparison, and latent variable
THE ADVANTAGE OF FORMAL MODELS
inference. These are parts of the modeling process.
AND MODELING
How to design a good computational/mathematical
Modeling is part of the scientific method (e.g., model is a different topic and goes beyond the
Hepburn & Andersen, 2021; van Rooij & Baggio, scope of this chapter.
2021; Voit, 2019). Roughly, starting with an
observation or a general theory, we formulate Simulation
assumptions (or hypotheses), derive predictions, Simulation, in this context, refers to generating
test them in the laboratory or in the field, and (a) artificial data by inserting specific parameter

517
Adele Diederich

values into the model, (b) qualitative and quanti- decisive step in model falsification, showing that
tative predictions of the model by inserting a wide a computational cognitive model is unable to
range of different parameter values (combinations), account for a specific (behavioral) phenomenon.
or (c) a combination of both. Simulation is This becomes even more important when several
particularly important in the model-building phase model candidates are compared (see the Model
(Fan, 2012). Wilson and Collins (2019) suggested Comparison section). Navarro (2019, p. 234)
also including qualitative properties of the model; argued that showing how the qualitative patterns
it may give the modeler a better intuition for in the empirical data emerge from a computational
the model’s behavior and may allow for specific model is often more scientifically useful than
predictions. For instance, take the popular diffu­sion presenting a quantified measure of its performance.
model for binary choice options, which accounts Simulation also includes parameter recovery;
simultaneously for (mean) choice response that is, when artificial data are produced with a
times and choice frequencies (Ratcliff, 2012). specific set of parameters, and the model is fitted
It assumes that all the information of the stimuli to those data (see below), the estimated parameters
(provided in attributes or dimensions) is mapped should be close to the ones that generated the
onto one so-called drift rate, µ. It reflects the mean artificial data.
tendency to choose one alternative over the other.
Copyright American Psychological Association. Not for further distribution.

The better the two stimuli can be discriminated, Parameter Estimation


the larger is µ. To account for specific choice Starting with a particular model, the major interest
patterns, additional (ad hoc) assumptions are is to find the set of parameters that best accounts
necessary (Ratcliff, 2012). The multiattribute diffu- for observed data. This is also referred to as fitting
sion model for binary choice options (Diederich, the model to the data. We distinguish between
1997) on the other hand links attribute information free (to be estimated) parameters and fixed
sequentially to separate stages—that is, one piece parameters. Ideally, free parameters are related to
of information delivered at t0, a second piece of psychological concepts and interpreted accord-
information at t1, and so on. Diederich and Oswald ingly. They also may serve to summarize differences
(2014) showed that for two stages, only the between experimental conditions and experimental
relationship (larger, smaller) between the drift rate groups. Some other parameters may be added to
parameters of the first stage and the second stage the model to improve the fit. They appear some-
is crucial for predicting a specific pattern between times ad hoc and are not related to psychological
mean choice response time and choice probability, concepts. Fixed parameters are set and often
regardless of the parameters values. In par­ticular, part of the model’s genuine feature, for instance,
when drift rate µ1 of the first stage of the process in a diffusion process, the diffusion coefficient
is larger than drift rate µ2 of the second stage, (randomness). Note that in cognitive sciences,
predicted mean choice response time is always the number of parameters is kept as small as
smaller for the more frequently chosen alternative possible so as not to have too flexible a model.
than for the less frequently chosen alternative. On There are many different ways how to estimate
the other hand, when µ1 < µ2, then the predicted the parameters from data. Indeed, Farrell and
mean choice response time is always smaller for Lewandowsky (2018) devoted roughly half of
the less frequently chosen alternative than for the their book to this topic.
more frequently chosen alternative (fast error). Before selecting a particular method to estimate
Thus, when in an experimental setup more the free parameters from the data, we need to
stimulus information is provided in a first stage decide what kind of measure to take to minimize
than in a second while the data show fast errors, the deviance between the model’s predictions
there is no rescue for this model. and the data. In other words, we seek those
Palminteri et al. (2017) argued that simulation, parameters that maximize the similarity between
including visualization of the outcome, is a the data and the model predictions. Clearly,

518
Computational Modeling

this becomes an optimization problem, and L (u y ) = f ( y u) (23.2)


the function that maximizes or minimizes
an output value by systematically choosing is called the likelihood function. That is, as a
input values—for example, selecting parameter function of y with fixed u, f (y ⎜u) is a probability
values from a (predefined) set—is called an (density/mass) function; as a function of u with
objective function. fixed y, f (y ⎜u) is a likelihood function.
Here the objective function itself may include For a random sample (independent and
a discrepancy measure such as the mean of the identically distributed variables), it is the product
squared deviation between the data and the model of n individual functions
prediction, the square root of it, or root mean
f ( y u ) = ∏ i =1 f ( yi u ),
n
square error of approximation (RMSEA), variations
of chi-squared test formulas, or any goodness-
of-fit measure that is appropriate (e.g., Browne and taking the logarithm, it becomes the
& Cudeck, 1992; Chechile, 1999; Smith & log-likelihood
Vickers, 1988). For instance, fitting the dynamic
LL ( u y ) = ln f ( y u ) = ∑ i =1 ln f ( yi u ).
n
dual process model to a risky decision situation (23.3)
with two frames (gain and loss), two deadlines
Copyright American Psychological Association. Not for further distribution.

for making a decision (short and long) and nine Maximum likelihood parameter estimation
gambles (probability and value combinations), seeks to find those parameters that maximize the
Diederich and Trueblood (2018) minimized likelihood of the chosen model for the given set of
the function data. This method plays a major role in Bayesian
parameter estimation (Feinberg & Gonzales, 2007;
72  RTobs − RT pred  36  Probs − Pr pred 
i i 2 i i 2

χ2 = ∑ i =1  Farrell & Lewandowsky, 2018, devote several


 SERTi  ∑ i =1  SE i
+  chapters to it).
obs Probs

(23.1) Parameter estimation techniques are computer


algorithms, often as part of a computer program
for estimating the parameters from 72 mean such as MATLAB, Python, R, and so on. That is,
choice response times (choice between sure there is no need to develop a computer program
option and lottery) and 36 choice frequencies for for minimizing or maximizing the objective
the lottery (or the sure option). RT obs i and RT i
pred function. For instance, the MATLAB Optimization
are the observed and the model’s predicted mean Toolbox offers a family of algorithms for solving
response times related to condition I, respectively, optimization problems, including primal and dual
SERT
i
obs is the observed standard error for i. Similar simplex algorithms, originally based on the Nelder-
notation holds for the choice probabilities (Pr). Mead simplex method. The Python optimizing
Note that for statistical testing, not rejecting methods and the R package “optimization” also
the null hypothesis is the desired outcome for include the simplex methods (e.g., George &
a fitted model. Raimond, 2013; Sun et al., 2019; for a survey of
Another approach to estimate parameters methods, Beasley & Rodgers, 2012).
includes maximum likelihood estimation (MLE). Closely related to the previous two topics
It estimates the parameters of a probability are model complexity and parameter recovery.
distribution (rather than means, for instance) Myung and Pitt (1997) defined model complexity
by maximizing a likelihood function. Assume or model flexibility as the model’s ability to fit
that we observed a sample of n response times diverse patterns of data. They identified three
y = {(y}1, y2, . . . , yn), that is, realizations from dimensions of a model that contribute to its
random variables Yi, and the model accounting complexity: the number of parameters, the mode’s
for the observations includes k parameters functional form, and the extension of the
u = (θ1, θ2, . . . , θk). The real-valued function parameter space.

519
Adele Diederich

Adding more parameters to the model typically include point or interval predictions, predictive
gives a better fit of the data. To give a simple regions, and predictive distributions (p. 291).
and intuitive example: the higher the degree of When a model is used as measurement model,
a polynomial model is, the better it can be fitted it is important to know whether the parameters
to even very noisy data. Several goodness-of-fit are identifiable—that is, whether the set of
measures take the number of parameters into parameter values can be determined from the set
account. They penalize models with many of data and whether the model can recover the
parameters as compared to models with fewer parameters. Parameter recovery includes the
parameters. Example of measures are listed in following steps: The computational model simu-
the next section on model comparison. lates data with known parameter values. Then the
The functional form is defined as the way model is fitted to these simulated data. Typically,
in which parameters are combined in the model this procedure is repeated for several sets of
equation. For instance, for psychophysical models parameters. Ideally, the estimated parameters are
mapping the intensity of a physical stimulus, I, identical (or close) to the ones that generated the
onto the perceived magnitude, Steven’s power law, data (see, e.g., Hübner & Pelzer, 2020; Kandil
ψ(I) = k • Ia, is more complex than Fechner’s et al., 2014; van Ravenzwaaij & Oberauer, 2009;
law with the same number of parameters, for examples of parameter recovery studies for
Copyright American Psychological Association. Not for further distribution.

ψ(I) = k • ln(I + a), because the parameter a different models).


in Steven’s law can take on any value in R, and
therefore allows for convex, concave, and linear Model Comparison
functional curvatures whereas Fechner’s law is Science relies on accumulated knowledge, and
only defined for a ∈ R+, restricting it to concave the scientific method (see above) requires new
functions (cf. Myung & Pitt, 1997; Townsend, empirical evidence to be built into the model.
1975). Thus, Steven’s law can account for a much A new model may replace a previous one and/or
broader pattern of data, and therefore, is more compete with other models. Before answering
difficult to falsify. The third dimension of model the question of which model is best supported
complexity is extension of parameter space and by the data, some of the issues addressed before
demonstrated in the previous example. The need to be checked. Is the model falsifiable—that
parameter space may cover the entire real numbers is, does the model make predictions that can be
or is restricted to a subset of it. Obviously, tested empirically? Are the model parameters
the more restrictions there are, the less flexible identifiable—that is, is there a unique mapping
the model is. between a specific set of parameters and
Computational and statistical modeling are used a specific set of data? Does the model make
as explanations or as predictions (Shmueli, 2010), qualitative predictions that reflect the patterns
in psychology also referred to as measurements in the data? When the models meet these
(Farrell & Lewandowsky, 2018). In the context criteria, they are compared by goodness-of-fit
of statistical modeling, Shmueli (2010) defined the measures that take the number of parameters
relationship between theory and data explaining into account. Some measures also deal with
as causal explanation and explanatory modeling as model flexibility. Most, if not all, goodness-of-fit
the use of models for testing causal explanations. measures have been developed in the context
The hypotheses are given in terms of theoretical of testing structural equation models (see
constructs (p. 290). Cole & Ciesla, 2012; Flaherty & Kiff, 2012;
When a model is used as measurement model Steyer et al., 2012).
(predictive model), conclusions are drawn Some widely used measures for computational
from parameter estimates; the purpose here is models involving MLE (Equations 23.2 and 23.3)
to predict future observations. According to are the Akaike information criterion (AIC; Akaike,
Shmueli (2010), predictions (measurements) 1983) and the Bayesian information criterion

520
Computational Modeling

(BIC; Schwarz, 1978). The AIC for model i with is defined as F* = χ2 − df. The degrees of freedom
ki free parameters is are defined in terms of the number of experi­
mental conditions, c, number of response bins, b,
 i + 2ki
AICi = −2 LL (23.4) and number of free parameters, k—that is,
υ = c(2b − 1) − k. The number of participants, N,
where LL i is the maximized value of the log- is also included in the denominator:
likelihood (Equation 23.3) for the ith-model with
estimated parameter values ûi that maximized max ( F*, 0 )
R* = . (23.7)
the likelihood function. The model with the υ ( N − 1)
smallest AIC should be chosen.
The BIC is similar to the AIC except that it Obviously, the smaller R*, the better is the
also includes the number of observations n: model fit. If χ2 is smaller than υ, zero is taken.
Summaries and evaluation of goodness-of-fit
 i + ki ln n.
BICi = −2 LL (23.5) measures are provided by, for instance, Evren and
Tuna (2012) based on statistical entropy, Steiger
For fewer than eight observations (exp(2) = 7.39), (n.d.), and many more. Farrell and Lewandowsky
BIC is smaller than AIC and grows slowly as (2018) described Bayesian model comparison
Copyright American Psychological Association. Not for further distribution.

n increases (ln(1000) = 6.91). For a detailed using Bayes factors in Chapter 11 of their book.
discussion of both measures with respect to Besides comparing competing models on the
the philosophical background (information basis of goodness-of fit measures, several other
theory versus Bayesian theory) and statistical methods have been proposed, like cross-validation
properties see Burnham and Anderson (2004) procedures likelihood ratio tests, in particular for
and Vrieze (2012). nested models, and many more (Nezlek, 2012;
Another goodness-of-fit measures is the RMSEA Rindskopf, 2012).
(R*; Browne & Cudeck, 1992; Steiger, 1990):
EXAMPLE OF THE MODELING
F*
R* = . (23.6) PROCESS WITH TWO COMPETING
υ
CONNECTIONIST MODELS
F* is the Population Noncentrality Index The following simple example demonstrates with
(PNI), a measure of badness-of-fit (Steiger, 1990). two competing connectionist models for categorical
As Steiger (n.d., p. 5) pointed out, model com- learning how the modeling process is performed.
plexity is directly related to the number of free Categorizing (perceptual) objects or patterns
parameters and inversely to the number of degrees in distinct classes is important to many real-life
of freedom (υ). Therefore, dividing the PNI situations. Does the functional magnetic resonance
by the degrees of freedom accounts for model imaging scan indicate disease A or B or none?
complexity and taking the square root of the ratio Are these cells cancerous or not? Is this a female
returns the index to the same metric as the original or male face? What ethnic group do these people
standardized parameters (Steiger, n.d., p. 5). belong to? Often, people are very successful in
For example, Schubert et al. (2017) used this performing these and similar tasks, but sometimes
measure when fitting quantiles of a diffusion model they are not. How do we learn to categorize objects,
to the empirical RT quantiles (Ratcliff, 2012). and how do we generalize our knowledge to
Using a χ2 statistics (the squared difference objects we have not seen before?
between the observed and predicted quantile
values, divided by the predicted quantile values Step 1
and then summed up, slightly different from The first step in the modeling process to answer
Equation 23.1) as objective function, the PNI these questions is to come up with a conceptual

521
Adele Diederich

theoretical framework. This requires creativity may be performed. The units can be inter-
on the part of the researcher and involves hard preted as natural and artificial neurons and
work. For the current demonstration two exist- groups of neurons. For cognitive models, the
ing and competing theoretical frameworks for input units may represent perceptual features,
categorization are taken: a prototype model and letters, faces, and so on, and output units may
an exemplar model. According to the prototype represent words, phonemes, or ethnic groups.
model, some members of a category are more All the processing is carried out by these
central than others. The person extracts the central units. The system is parallel, as many simple
tendency (sometimes referred to as characteristic operations are carried out simultaneously.
features) of the examples presented during a Units in a computer simulation are virtual
learning phase and uses these characteristic entities and usually presented by circles.
features to form a prototype, which serves as basis 2. A state of activation for each unit, ai, at a
for categorizing new objects. That is, when a new given time t,ai(t). The state of a set of units
target object is presented, it is compared with at time t are organized in a vector, a(t) =
the prototypical object of each category, and the (a1(t), a2(t), . . . , ai(t) . . . , an(t)). The activation
category with the most similar prototype is chosen. values can be any numeric value but often they
According to the exemplar model, the learner are real numbers bounded between 0 and 1.
Copyright American Psychological Association. Not for further distribution.

stores specific instances (exemplars) for each The analogy to neural activity is the neuron’s
category. When a new target stimulus is presented, firing rate (rate of action potentials). A zero
the similarity of the target to each stored example would indicate the least and a one the most
is computed for each category, and the category possible activity of a neuron.
with the greatest total similarity is chosen. 3. The pattern of connectivity. To make a network,
units need to be connected. If units are analo-
Step 2 gous to neurons, connections are analogous
The second step is to describe the objects in an to synapses. Connections are represented with
abstract formal way, translate the assumptions lines and arrows indicate the flow of informa-
into equations, and describe the response also in tion from one unit to the next. In a standard,
a rigorous form. Here, we take a connectionist three-layer feedforward network (described
version of a prototype model and a connectionist later), activation is sent from all input units
version of an exemplar model (e.g., Nosofsky to all hidden units to all output units in a
et al., 1992). There has been a long debate about single direction—that is, a directed graph
which of these two models (prototype versus with nodes and intermodal connections.
exemplar) best represents category learning, The strength or weakness of a connection
and some question whether it is possible to between any two units determines the extent
distinguish empirically. to which the activation state of one unit can
affect the activation state of another unit and
Digression.   Before describing the models in
can be measured by a connection weight, w.
more detail, we define the general framework
The connections weights of all units are orga-
of key features for connectionist processing
nized in a matrix W = ||wij||. Often connection
(Rumelhart et al., 1986; Thomas & McClelland,
weights are real numbers between −1 and 1.
2008) to fix some notation.
High connection weights represent a strong
1. A set of processing units ui, organized in layers connection while low weights represent a
and often divided into input units, which weak connection, analogous to excitatory and
receive the information to be processed; output inhibitory synapses.
units, which provide the results of the pro- 4. The propagation rule. This rule determines how
cessing; and hidden units, in which specific activation is propagated through the network.
computations necessary for the entire process The activation values of the sending units

522
Computational Modeling

are combined with the connection weights to digms refer to models of the environment in
produce the net input into the receiving units, which the neural network operates. Any given
usually by a linear function. That is, the inputs network architecture can usually be employed
from all sending units are multiplied by the in any given learning paradigm.
connection weights and summed to get the 7. Network architectures. There are three
overall input of the receiving units, that is, fundamentally different classes of network
the net input for the receiving units is net(t) = architectures:
W • a(t). The net input for a specific unit, i, ■ Single-layer feedforward networks. The input

is therefore, neti(t) = ∑j wij aj(t). layer of source nodes projects on an output


5. The activation rule. This rule specifies how layer of computational nodes but not vice
the combined or net input of a given unit is versa. It is strictly feed-forward. The notion
transformed to produce its new activation “single-layer” refers to the output layer
state ai(t + 1), and is expressed in terms of of the computational nodes. The input
a function F, such that ai(t + 1) = F(neti(t)). layer is not counted since no computation
Typically, the activation F is chosen from takes place.
a small selection of functions including ■ Multilayer feedforward networks. One or

F(x) = sgn(x), producing binary (±1) output; more hidden layers or hidden units are
Copyright American Psychological Association. Not for further distribution.

F(x) = (sgn(x) + 1)/2, producing binary (0/1) sandwiched between the input layer and
output; F(x) = (1 + e−x)−1, the sigmoidal the output layer. The hidden units inter-
(logistic) nonlinearity, producing output vene between the external input and the
between 0 and 1; F(x) = tanh(x), producing network output in some way. Typically,
output between −1 and 1; and some other the units in each layer of the network have
forms are also possible. The net input can take as their inputs the output signals of the
on any value and the function F ascertains preceding layer. If every node in each layer
that the new activation state does not exceed of the network is connected to every other
the maximum or minimum activation values node in the adjacent forward layer, the neural
(e.g., above 1 or below 0). network is fully connected. If a connection
6. The learning rule. Learning in a neural network is missing the neural network is partially
involves modifying the connection weights connected.
and finding the right weights is at the heart of ■ Recurrent networks. The network has at

connectionist models. The learning rule is an least one feedback loop. The feedback
algorithm and specifies how the connectivity loops may be self-feedback loops—that is,
changes over time as a function of experience, the output of a neuron is fed back into
that is, data. For instance, the simplest learning its own input or no self-feedback loops,
rule assumes that the weight wij between two for instance, when the output is fed back
units ui and uj changes proportional to the to the inputs of all the other neurons.
respective activation values—that is, ∆wij = 8. Learning paradigms. There are three major
wij(t + 1) − wij(t) = ηaiaj, where the constant η learning paradigms: supervised learning,
is called learning rate. Basically, it determines unsupervised learning, and reinforcement
the step size at each iteration. There is a variety learning.
of learning algorithms. They differ from each ■ Supervised learning, also referred to as

other in the way in which the adjustment of learning with a teacher. In supervised
the connection weights of a unit is formulated learning, there is given a set of data, the
(e.g., Haykin, 1999, for a detailed description training set, which consists of the input
of several algorithms). Specific learning rules (e.g., object patterns) and the desired output
depend on the architecture of the neural (e.g., classification). That is, the input is
network. In addition, various learning para- given together with the correct output,

523
Adele Diederich

also called target. The parameters of the vector a (e.g., describing visual objects) to a binary
network are gradually adjusted to match the output value y (e.g., category A or B). Furthermore,
input and desired output by going through the neural network applies a reinforcement
the training set many times. The aim of the paradigm. The architecture of this model in the
supervised neural network is to predict above introduced notation is
the correct answer to new data that were
not included in the training set. That is, y=F (∑ n
j =1 )
w ja j (t ) + w0 = sgn (∑ n
j =1 )
w ja j (t ) + w0 ,
the network is expected to learn certain
(23.8)
aspects of the input–output pairs in the
training set and to apply to it new data.
where w0 is a bias factor. The bias has the effect
■ Unsupervised learning, also referred to as
of increasing or lowering the net input of the
self-organized leaning or learning without a
activation function, depending on whether it is
teacher. In unsupervised learning a correct
positive or negative, respectively. Setting a0 ≡ 1
or desired output is not known—that is,
the above equation can be written as
there is no input–output pair. This type of
learning is often used to form natural groups
or clusters of objects based on similarity
y = sgn (∑ n
j= 0 )
w ja j (t ) = sgn ( w ′a (t )) . (23.9)
Copyright American Psychological Association. Not for further distribution.

between objects.
■ Reinforcement learning. In reinforcement Denote the set of training or learning data
learning the only information given for as D = {[a(t), z(t)], t = 1, . . . , m}, where {z(t)}
each input–output pair is whether the contains the binary classification variables ±1,
neural network produced the desired result the desired activation state at time t and a(t) =
or not or the total reward given for an output (a0(t), a1(t), . . . , an(t)) is the state activation
response. The weights are updated based vector for the observation at time t as before.
solely on this global feedback (that is, the Learning is modeled by updating the weight
Boolean values true or false or the reward vector w during m iterations for all training
value) (for details, see, e.g., Rojas, 1996). examples. That is, for each pair in D and for each
iteration t, the weight w is updated according to
As an example for a learning rule, consider
the single-unit perceptron (Figure 23.1), which ∆ wj = w j (t + 1) − w j (t ) = η ( z − y ) a j , (23.10)
is the simplest version of a feed-forward neural
network and classifies inputs into two distinct where the constant η (> 0) is the learning rate,
categories. That is, it maps a real-valued input and the learning rule is called the delta rule. The
delta rule is related to a gradient descent type of
method for optimization.
a0 = +1 End of digression.   For the two connectionist
a1 W0 models for categorization, we assume very simple
W1 objects characterized by only two dimensions.
a2 W2 These could be saturation and brightness, length
n
F wiai y and orientation, distance between eyes and length
i=0
of mouth, and so on. The stimuli are conveniently
Wn
described in forms of vectors. A stimulus is denoted
S = (s1, s2), where s1 represents the value of the
an
stimulus on the first dimension and s2 represents
the value of the stimulus on the second dimension.
FIGURE 23.1.   A simple single-unit
perceptron, also known as McCulloch- Consider the connectionist version of the prototype
Pitts neuron (McCulloch & Pitts, 1943). model first.

524
Computational Modeling

Connectionist version of the prototype model.    z ij − si  


2
fsim ( z ij , si ) = exp  −  ,
The model assumes that a stimulus is represented   σ  
by two sets of input units: One set, u1, is acti-
vated by the value of the stimulus on the first i = 1, 2 j = 1, . . . , p. (23.11)
dimension, s1; and the other set, u2, is activated
by the value of the stimulus on the second The choice of this function is another
dimension, s2. The number of units in each set ad hoc assumption. Indeed, there are a variety
is p, ui = {ui1, ui2, . . . , uip}, i = 1, 2, and so there of similarity functions one can choose from.
are a total of 2 • p units. Obviously, this can be The proposed one is related to Shepard’s (1987)
written as u1 ∪ u2 = {u1, . . . , up, up+1, . . . , u2p}. so-called universal law stating that the perceived
Each unit within a set is designed to detect a similarity between two entities is an exponential
particular stimulus value, which is called the decay function of their distance. In the present
ideal point of the unit. This may be considered in example the similarity function is Gaussian
analogy to neuronal tuning. Neurons responding (exponent is 2).
best to specific orientation, movement direction, The parameter σ is called the discriminability
disparity, frequency, and the like are said to be parameter, and it determines the width or spread
tuned to that orientation, movement direction, of the activation around the ideal point. A low
Copyright American Psychological Association. Not for further distribution.

disparity, frequency. The ideal point value of discriminability parameter (large σ) makes it
each unit is not naturally given but needs to be hard to discriminate differences between the
defined. These additional detailed assumptions stimulus value and the ideal point, and a high
(called ad hoc assumptions) are necessary in order discriminability parameter (small σ) makes easy-
to complete the model. That is, for the prototype to-discriminate differences between the stimulus
model, assumptions about what features should value and the ideal point. That is, it determines
be used to represent the stimuli to be categorized the rate at which similarity declines with distance.
need to be added and also formulated in an The values of the function range between 0 and 1.
abstract way. If the stimulus value si and the ideal point zij are
The jth unit in the first set is designed to identical, the function takes on the value 1. If the
detect a stimulus value, z1j, that is, the ideal stimulus value si is far apart from the ideal point zij,
point of that unit and the activation of this unit, the function approaches 0.
denoted a1j(t), is determined by the similarity The input activation aij(t) generated at the
of si presented at trial t to the ideal point z1j. jth unit is determined by the similarity of that
Analogously, the jth unit in the second set is unit relative to the sum of the similarity of all
designed to detect a stimulus value, z2j, and the the units:
activation of this unit, denoted a2j(t), is determined
by the similarity of the ideal point z2j to s2 presented fsim ( z ij , si )
a ij (t ) = ,
∑ j=1 fsim ( z ij , si )
p
at trial t.
How large the set U is, depends on how many
specific features are to be coded. For instance, i = 1, 2 j = 1, . . . , p. (23.12)
Le Cun et al. (1989) developed a network for
zip-code recognition, in which 7,291 handwritten The input units are connected to two output
zip-code digits were processed such to fit into a units, one for each category. The activation
16 × 16 pixel image with grey levels in the range of the two category units is denoted c1(t) and
of –1 to +1. Thus, the dimensionality of each c2(t) for category C1 and C2, respectively. The
input is 256. connection weight, wijk, connects the input
The similarity between the current stimulus activation aij(t) to the kth output unit, k = 1, 2.
value si and the ideal point zij, for each unit j is The propagation rule—that is, the function
determined by the following function: that combines the input activation with the

525
Adele Diederich

connection weights to produce the input to the a11


output units—is W111
a12
W112
ck (t ) = ∑ j=1 w1 jk a1 j (t ) + ∑ j=1 w2 jk a 2 j (t ),
p p
k = 1, 2. C1
(23.13) S1
a1j
This is the net input for unit k, such that
ck(t) = netk(t). The set of weights {w11k, . . . , w1pk;
w21k, . . . , w2pk} connecting the inputs to the output a1p
for category Ck forms a representation of the
a21
prototype pattern for category Ck. The more similar
the input activation pattern is to these weights, a22
the more likely the stimulus matches the prototype C2
for category Ck. S2 W2p1
The connection weights are updated according a2j
to the delta learning rule (Equation 23.10):
W2p2

∆wijk = wijk (t + 1) − wijk (t ) = η • [hk (t ) − ck (t )] • a ij , a2p


Copyright American Psychological Association. Not for further distribution.

(23.14)
FIGURE 23.2.   Architecture of the connectionist
where hk(t) is the indicator function with hk(t) = 1 version of the prototype model.
for the desired category and 0 otherwise.
The whole learning process begins with some where zi is a vector of length p with the p ideal
initial weights, and usually these are randomly points for dimension i and O is a p-dimensional
assigned to represent a state of ignorance at the vector containing ones. The similarity function is
beginning of the learning process. Alternatively,
if some prior knowledge or training exists, then the fsim ( z i , si ) = exp ( − d•2
i ), (23.16)
initial weights can be set to values that represent
this prior knowledge or training. where • 2 means the elementwise square of the
The architecture of the connectionist version of vector.
the prototype model is presented in Figure 23.2. The activation function in Equation 23.12
One essential step in computational modeling is a p-dimensional vector with
is to implement the model onto the computer—
that is, writing codes and algorithms for training 1
the model and estimating the parameters. a i (t ) = fsim ( z i , si ) , i = 1, 2
O T • fsim ( z i , si )
To do so it is convenient to rewrite the equa-
(23.17)
tions in matrix format. Computer languages such
as MATLAB, Mathematica, Python, and R have
where OT is the transposed vector with ones,
built-in matrix operators, which allows effective
that is, a row vector. Note that (OT • fsim(zi, si))
programming leading to fast computations.
is the inner product, which produces a scalar.
The deviation between the p ideal points and
Obviously, the activation for both stimulus
the stimulus value si in Equation 23.11 can be
dimensions can be expressed in one vector of
written as p-dimensional vector di, i = 1, 2 with
 a1 (t ) 
length 2p with a (t ) =   . The weights for
1  a 2 (t )
di = ( z i − si O ) , (23.15)
σ each category are arranged in a 2 × p matrix

526
Computational Modeling

 w1 
T
TABLE 23.1
W =   . The propagation rule in Equation 23.6
 w2 
T
MATLAB Codes for Determining the Similarity
can be written as
Function and Weights for the Prototype Model
c (t ) = Wa (t ) , (23.18)
Program command Comment
function[a] = protoac(Z,S, sigma,j); Input and output of function
 c1 (t )  fsim=exp(-((Z-S(j))./sigma).^2); Calculating similarity
where c (t ) =   is a vector of length 2 with a=fsim/sum(fsim); Calculation activation
 c2 (t )
the activation of the two category units for function[W] = prototype(eta,p)
category C1 and C2, respectively. W = zeros(2,p) Initial connection weights
for j=1:p Loop for stimuli
Finally, the delta rule (Equation 23.14) can be [a1]=protoac(Z1,S1,sigma,j); Activation for category 1
written as [a2]=protoac(Z2,S2,sigma,j); Activation for category 2
c=W*[a1;a2]; Propagation
∆W = W (t + 1) − W (t ) = η • [h (t ) − c (t )] • a, W1=eta*([1;0]-c)*a1; Adjusting weights for c1
W2=eta*([0;1]-c)*a2; Adjusting weights for c2
(23.19) W=W+W1+W2; Updating the weights
End End of loop
 h1 (t ) 
Copyright American Psychological Association. Not for further distribution.

where h (t ) =   is the 2-dimensional vector


 h2 (t )
of the indicator function with 1 for the desired
that is, each cell of the matrix represents a single
category and 0 otherwise.
input unit. Each unit on the grid is designed to
There are numerous ways to write the algorithms
detect a pair of stimulus values. In particular, the
and programs for the model. A convenient one
unit on the grid point corresponding to row i and
is to divide the program in several subprograms
column j is designed to detect the value zij = [zi, zj],
and address only those necessary for a specific
which is the ideal point for this unit. The difference
purpose. For instance, estimating parameters
between the assumptions of the prototype model
from data require different steps than simulating
and the exemplar model is obvious. For the
predictions of the model over a set of predefined
prototype model a unit is tuned to exactly one
parameters. However, both routines involve the
feature, but the exemplar model assumes that as
computational model. An example of the model
unit is tuned to two (or possibly more) features.
written in MATLAB codes is found in Table 23.1.
The stimulus S = (s1, s2) activates a circular
Subprograms in MATLAB are called “functions”
receptive field of grid points. Note that the analogy
and received input from, and provide output to,
to neural structures in the brain is drawn, including
other parts of the program. For demonstrational
the same terminology. A receptive field of a neuron
reasons, the expressions are taken apart. For expe-
in the visual system, for instance, is a restricted
rienced programmers, they can be written more
area of the retina that influences the firing rate
compactly. These functions are embedded in a larger
of that neuron because of light. Receptive fields
frame where parameters such as the delta and sigma
of ganglion cells have a concentric form with a
are defined, stimuli vectors are read, ideal points are
center-surround organization. A receptive field
defined, and so forth. For an example of a complete
in the context of artificial neural networks is a
program, see Busemeyer and Diederich (2010).
restricted network with local connections which
Connectionist version of the exemplar model.   may or may not be concentric. The centroid of
The model assumes that the inputs to the network the receptive field is located at the pair of stimulus
form a square grid with p rows and p columns— values (s1, s2). The amount of activation of a
that is, a p × p matrix. Each point on the grid, nearby input unit declines as a function of the

527
Adele Diederich

distance of the unit from this center. The activation activation aij(t) to the kth output unit, k = 1, 2.
of this unit, aij(t), is determined by the similarity The propagation rule for the exemplar model is
of the stimulus S to the ideal point zij denoted
ck (t ) = ∑ i =1 ∑ j=1 wijk aij (t ),
p p
fsim(zij, S) and defined as k = 1, 2. (23.22)

(
 z i − s1  •
)
 z j − s2
( )  ,
2 2
fsim ( z ij , S ) = exp − exp − This model is called an exemplar model
 σ   σ because each receptive field of a training stimulus
i = 1, . . . , p j = 1, . . . , p. (23.20) is associated with the output category units
through a separate set of connection weights.
This is a type of a bivariate Gaussian distribution Thus, the model simply associates each region
and is used to form the receptive field. As for the of the stimulus space with a response, and similar
prototype model the values of the function range examples get mapped to similar responses.
between 0 and 1. If the stimulus values of both As for the prototype model, the connection
dimensions (s1, s2) and the ideal point (zi, zj) weights are updated according to the delta
are identical, the function takes on the value 1. learning rule:
If the stimulus value of at least one dimension, si,
∆wijk = wijk (t + 1) − wijk (t ) = η • [hk (t ) − ck (t )] • a ij ,
i = 1, 2, is far apart from its ideal point, zi, i = 1, 2,
Copyright American Psychological Association. Not for further distribution.

the function approaches 0, regardless of the (23.23)


difference between the stimulus value and its ideal
point of the other dimension. where hk(t) is the indicator function with hk(t) = 1
The parameter σ is interpreted as the discrim- for the desired category and 0 otherwise.
inability parameter and has the same effect as The architecture of the connectionist version of
it had before for the prototype model. Low dis- the exemplar model is presented in Figure 23.3.
criminability (large values of σ) produces a large The matrix form of the exemplar model is
receptive field, which makes it hard to detect derived as follows. The deviations between the
differences among stimuli. High discriminability ideal points and the stimulus values on each
(small values of σ) produces a small receptive dimension are the same as in Equation 23.15. The
field, which makes it easy to detect differences similarities for these deviations are the same as
among stimuli. in Equation 23.16. The function in Equation 23.20
The input activation aij(t) generated at the is one point on the grid (one input unit); all the
unit in the ith row and jth column is determined elements on the p × p grid of input units can be
by the similarity of that unit relative to the sum computed by the Kronecker product ⊗:
of similarity of all the units:
fsim ( Z, S ) = fsim ( z1 , s1 ) ⊗ fsim ( z 2 , s2 ) . (23.24)
fsim ( z ij , S )
a ij (t ) = ,
∑ i=1 ∑ j=1 fsim ( z ij , S) Note that fsim(Z, S) is a p2 vector with elements
p p

as defined in Equation 23.21. This could have been


i = 1, . . . , p j = 1, . . . , p. (23.21) arranged differently, as a p × p matrix, by setting
fsim(Z, S) = fsim(z1, s1) • fsim(z2, s2)T. It depends on
A stimulus produces a bivariate distribution what is more convenient for the remaining steps
of input activations on the grid, which is centered in the calculation but is also a matter of taste of
around the pair of stimulus values. As before, the the individual researcher.
input units are connected to two category units, The input activations for all the input units on
one for each category and the activation of the the p × p grid are
two category units are ck(t), k = 1, 2, for category
Ck(t), k = 1, 2. Each unit on the grid has a con- 1
a (t ) = fsim ( z, S ) . (23.25)
nection weight, wijk, which connects the input O T • fsim ( z, S )

528
Computational Modeling

s1

s2 a12 a1p
W1p1

a21 W1p2 c1

Wpp1
c2
ap1 app Wpp2

FIGURE 23.3.   Architecture of the connectionist version of


Copyright American Psychological Association. Not for further distribution.

the exemplar model.

a(t) is a p2 vector with elements as defined hold for all possible parameter values), quantitative
in Equation 23.22. The propagation rule and predictions do require specific values for the
the delta rule are analogous to Equation 23.18 free parameters of the model. Probing the model
and Equation 23.19, respectively. The algorithm includes both qualitative and quantitative
for this part of the model can be found in tests. For a qualitative test, the model predicts
Table 23.2. characteristic patterns which are compared to
patterns observed in data. For the quantitative
Step 3 test, the free parameters of the model are esti-
The third step of the modeling process is to derive mated from data and a goodness-of-fit measure
the predictions of the models. The predictions (see above) provides information about how
can be qualitative and quantitative. Although well the model describes the data in a statistical
qualitative predictions do not require specific sense. (For a broader discussion on qualitative
parameter values of the model (the predictions versus quantitative tests, see Busemeyer and
Diederich, 2010.)
Both models make predictions with respect to
two different transfer tests: a generalization test
TABLE 23.2 and a recognition test. For the generalization test,
MATLAB Codes for Determining the Similarity new stimuli, not previously presented in the
Function and Weights for the Exemplar Model training set are classified. For the recognition
test, new and old stimuli—that is, those presented
Program command Comment
in the training set are mixed and classified as
new and old.
function[a] = exempac(Z1,Z2,S1,S2, Input and output of
  sigma,j);   function For the generalization test, the models assume
fsim1=exp(-((Z1-S1(j))./sigma).^2); Calculating similarity that the probability of choosing category Ck
fsim2=exp(-((Z2-S2(j))./sigma).^2); for a new stimulus Snew (i.e., not an element of
fsim=kron(fsim1,fsim2);
the training set) is based on a ratio of strength
a=fsim/sum(fsim); Calculation activation
of the output activations. After t trials of training,

529
Adele Diederich

the output for category k = 1 is c1(t) and the its new activation state, is the logistic function
probability for choosing C1 is F(x) = (1 + e−x)−1, where x = β(−c1(t) + c2(t))).
The coefficient, β, is called a sensitivity parameter.
exp (β • c1 (t )) Increasing the sensitivity parameter decreases
Pr [C1 Snew ] =
exp (β • c1 (t )) + exp (β • c2 (t )) the value for exp(−β) and therefore increases
the F(x). That is, increasing the sensitivity
1
= (23.26) increase the slope of the function that relates
1 + exp ( −β ( − c1 (t ) + c2 (t )))
the choice probability to the activation of a
category. Here it increases the probability for
and the probability for choosing C2 is
choosing C1 with activation c1(t).
Pr [C2 Snew ] = 1 − Pr [C1 Snew ]. (23.27) The predictions of both models over a range
of parameters are presented in Figure 23.4. In
That is, the activation rule, which specifies particular, the sensitivity parameter β ranged from
how the net input of a given unit produces 0 to 15 in unit steps, the learning parameter η

a. Prototype: Generalization predictions


Copyright American Psychological Association. Not for further distribution.

0.9
Proportion correct

0.8

0.7

0.6

0.5
1
0.8 15
0.6 10
0.4
0.2 5
Learning rate η 0 0 Sensitivity β

b. Exemplar: Generalization predictions

0.9
Proportion correct

0.8

0.7

0.6

0.5
1
0.8 15
0.6 10
0.4
0.2 5
Learning rate η 0 0 Sensitivity β

FIGURE 23.4.   Prediction of the prototype model (a) and


the exemplar model (b) with respect to a generalization task.

530
Computational Modeling

from 0 to one in steps of 0.04, and σ in Equa- The sensitivity parameter γ determines the
tion 23.11 and Equation 23.20 is set to 5. Suppose recognition probability to the category activations.
the stimuli are defined by two dimensions, and Increasing the sensitivity parameter causes the
let H and L be sets containing all possible values recognition probability to be more strongly
within the described dimension. Stimuli belonging influenced by the category activations.
to category C1 have either low values, L, on both The parameter δ is a background-noise constant
dimensions, (S1 ∈ L, S2 ∈ L) = (l, l), or high (Nosofsky et al., 1992). Here, it can be inter-
values, H, on both dimensions, (S1 ∈ H, S2 ∈ H) = preted as a response bias parameter representing
(h, h). Stimuli belonging to category C2 have low the tendency to say new to any stimulus, and
values on the first dimension and high values on increasing δ increases the tendency to respond new.
the second dimension, (S1 ∈ L, S2 ∈ H) = (l, h), Both models have five model parameters:
or high values on the first dimension and low the discriminability parameter σ which deter-
values on the second dimension, (S1 ∈ H, S2 ∈ L) = mines the width of the generalization gradients;
(h, l). For the simulation, the stimuli are realiza- the learning rate parameter η for the delta
tions from Gaussian distributions, N(µ, φ2). In learning rule; the sensitivity parameter β for the
particular, stimuli belonging to category C1 have categorization choice rule; and two parameters
low values on both dimensions, with mean µ1 = 1 for the recognition response rule, the sensitivity
Copyright American Psychological Association. Not for further distribution.

for the first and µ2 = 1 for the second dimension, parameter γ and the response bias parameter δ.
or high values on both dimensions, with mean The main difference between the two models is
µ1 = 10 for the first and µ2 = 9 for the second in terms of the input representation. The prototype
dimension; stimuli belonging to category C2 have model uses two univariate sets of input units,
whereas the exemplar model uses a single bivariate
either a low and a high value on both dimensions
grid of input units. The latter is plausible as
with µ1 = 2 for the first and µ2 = 10 for the second
many neurons are tuned to more than one feature.
dimension, or with µ1 = 9 for the first and µ2 = 1
For instance, neurons in MT are tuned both to
for the second dimension. For all conditions,
direction and spatial frequency or neurons in
φ2 is set to 1.
V1 and V2 are tuned both to orientation and spatial
For a recognition task, the models assume that
frequency (e.g., De Valois & De Valois, 1990;
the probability of classifying a stimulus as old,
Mazer et al., 2002).
that is, as previously presented in the training
set, is an increasing function of the total amount
Step 4
of activation produced by the stimulus to both
The fourth step is to test the predictions of the
output units.
model with data and to compare the predictions
Again, a logistic function is used to relate
of competing models with respect to their ability to
total activation to old–new recognition response explain the empirical results. However, as Roberts
probability: and Pashler (2000) pointed out, showing that a
model fits the data is not enough. As pointed out
exp ( γ • ( c1 (t ) + c2 (t ))) before, a major concern is that if a model is too
Pr [ old Snew ] =
δ + exp ( γ • ( c1 (t ) + c2 (t ))) flexible (fits too much) and does not constrain
possible outcomes, then the fit is meaningless;
1
= if it is too flexible, it is necessary to penalize it
1 + exp ( − ( γ ( c1 (t ) + c2 (t )) + ln ( δ ))

for its complexity (Myung, 2000).
(23.28) All models are an abstraction from a real-world
phenomenon, and they focus only on essential
and the probability for choosing new is aspects of a complex system. To be tractable and
useful, models only reflect a simple and limited
Pr [ new Snew ] = 1 − Pr [ old Snew ]. (23.29) representation of the complex phenomenon.

531
Adele Diederich

That is, a priori, all models are wrong in some response probability, depending on the value
details, and a sufficient amount of data will always of the first dimension (Nosofsky et al., 1992).
prove that a model is not true. The question is This crossover interaction effect is critical for
which among the competing models provides a a qualitative test of the two competing models.
better representation of the phenomenon under As it turns out, the prototype model cannot predict
question. Within the present context, the question the crossover effect when fixing one dimension
is which of the two models, the prototype model and varying only the second; the exemplar model,
or the exemplar model provide a better explanation however, predicts this crossover for a wide range
of how objects are categorized. of parameter values (see Busemeyer & Diederich,
To empirically test competing models, it is 2010). For demonstration, let us take the same
crucial to design experiments that challenge the parameters for β, η, σ as in Step 3. The mean for
models. For instance, designing experimental the stimuli values, however, are set to µ1 = 1 and
conditions that lead to opposite qualitative pre- µ2 = 1 or µ1 = 10 and µ2 = 10 for category C1,
dictions (categorical or ordinal) is an essential and µ1 = 1 and µ2 = 10 or µ1 = 10 and µ2 = 1 for
step in the model testing process. For example, category C2. Figure 23.5 shows the simulation
the prototype model predicts that stimulus S is results. A dot indicates a combination of param­
categorized in category C1 most often, but the eters that successfully reproduced the crossover.
Copyright American Psychological Association. Not for further distribution.

exemplar model predicts that stimulus S is cate­ If the parameters are sufficiently large the exemplar
gorized in category C2 most often. Qualitative tests model predicts the crossover, here in 337 out of
are parameter free in the sense that the models are 375 possible cases.
forced to make these predictions for any value When we take the previous parameters µ1 = 1
of the free parameters. The following briefly and µ2 = 1 or µ1 = 10 and µ2 = 9 for category C1
describes a design and shows a qualitative test and µ1 = 2 and µ2 = 10 or µ1 = 9 and µ2 = 1 for
for the two competing models. category C2, the simulation reproduces the cross-
Experiments in categorical learning typically over for both models as show in Figure 23.6.
are divided in two phases: a learning or training
phase followed by a transfer test phase. During
0.04
the training phase, the participants categorize
objects in distinct classes and receive feedback
about the performance. The transfer test is either 0.2
a generalization test or a recognition test, both
without feedback (see Step 3).
0.4
Learning rate η

Assume that the participants accurately


learned the category assignments for each of the
four clusters of stimuli, (S1 ∈ L, S2 ∈ L) = (l, l), 0.6
(S1 ∈ H, S2 ∈ H) = (h, h), (S1 ∈ L, S2 ∈ H) = (l, h),
(S1 ∈ H, S2 ∈ L) = (h, l), during the training phase.
According to the exemplar model, for the transfer 0.8
test, when fixing the first dimension at a high
value (h, •), the probability of choosing category C2
1
decreases as the value of the second dimension
1 5 10 15
increases, (h, l → h); however, fixing the first
Sensitivity β
dimension at a low value (l, •), the probability
of choosing category C2 increases as the value of FIGURE 23.5.  Predicted patterns
the second dimension increases (l, l → h). Thus, of results for the exemplar model.
Each dot indicates a combination of
according to the exemplar model, the value of parameters that correctly reproduced
the second dimension has opposite effects on the crossover interaction.

532
Computational Modeling

(A) similar predictions. It is not always possible to


0.04 construct qualitative tests for deciding between
competing models.
0.2 For instance, a model may predict increasing
or decreasing functions depending on the specific
parameter values. Furthermore, a model might be
0.4
Learning rate η

so complex that is it impossible to identify general


qualitative patterns. For example, an arbitrarily
large, hidden unit, nonlinear neural network model
0.6
can approximate a wide range of continuous
functions, and thus it is not constrained to predict
0.8 a general pattern that can be tested. Sometimes
it is also important to interpret specific parameter
values—for instance, when comparing patients and
1 healthy adults or younger and older participants.
1 5 10 15 For those cases, a quantitative test of the model or
Sensitivity β a quantitative comparison of competing models is
appropriate. It is also necessary for a model to
Copyright American Psychological Association. Not for further distribution.

(B)
0.04 make quantitative predictions that are more accu-
rate than its competitors. Quantitative predictions
of a model are evaluated on the basis of an optimal
0.2
selection of parameters. Otherwise, a perfectly good
model could be rejected due to a poor selection of
0.4 parameters.
Learning rate η

Step 5
0.6 The last step is in the modeling process is to
modify the model in light of the data. Sometimes
it is sufficient to make some adjustments to account
0.8 for the observed data. Sometimes it is necessary
to reformulate the theoretical framework—for
instance, modifying assumptions or by adding
1
new assumptions; sometimes it is inevitable to
1 5 10 15
abandon the model and construct a completely
Sensitivity β
new model based on the feedback obtained from
FIGURE 23.6.  Predicted patterns new experimental results. That is, new experi-
of results for the exemplar model (A) mental findings pose new challenges to previous
and the prototype model (B). Each dot
indicates a combination of parameters
models. New models trigger new experiments.
that correctly reproduced the crossover Modeling is a cyclic process, and progress in the
interaction. empirical and experimental sciences is made via
this cycle: theorizing about the phenomenon and
developing a model, deriving predictions from
The exemplar model (Fig. 23.6A) reproduces the the model, testing the model, revising the model
correct pattern in 325 out of 375 possible cases, in light of empirical findings, testing the model
the prototype model reproduces it in 191 cases. again, and so on. Thus, the modeling process
This example shows how crucial it is to design produces an evolution of models that improve
a proper experiment to distinguish between two and become more powerful over time as the
competing models that make, in general, very science in a field progresses.

533
Adele Diederich

CONCLUSION in psychology, Vol. 3. Data analysis and research


publication (pp. 309–322). American Psychological
What are the advantages of having a computational Association. https://doi.org/10.1037/13621-015
model, what can they offer to the modeling D’Mello, S., & Franklin, S. (2011). Computational
cycle? D’Mello and Franklin (2011) pointed to modeling/cognitive robotics complements functional
two benefits. First, the process of model devel­ modeling/experimental psychology. New Ideas
in Psychology, 29(3), 217–227. https://doi.org/
opment is highly instrumental in obtaining a 10.1016/j.newideapsych.2009.07.003
deep understanding of the phenomenon under De Valois, R. L., & De Valois, K. K. (1990). Spatial
consideration. It involves deciding on the functional vision. Oxford University Press.
requirements and goal of the model; separating Diederich, A. (1997). Dynamic stochastic models for
the various individual components of the model; decision making with time constraints. Journal
and inventing schemes that bring all this together of Mathematical Psychology, 41(3), 260–274.
in order to obtain the desired behavior. Second, https://doi.org/10.1006/jmps.1997.1167
insights can be obtained from basic computational Diederich, A., & Busemeyer, J. R. (2012). Computational
modeling. In H. Cooper, P. M. Camic, D. L. Long,
principles that underlie the model. That is, any A. T. Panter, D. Rindskopf, & K. J. Sher (Eds.),
decision on a design made in the model-building APA handbook of research methods in psychology:
process can be interpreted as a hypothesis that Vol. 2. Research designs: Quantitative, qualitative,
can be tested empirically. neuropsychological, and biological (pp. 387–405).
American Psychological Association. https://doi.org/
Copyright American Psychological Association. Not for further distribution.

10.1037/13620-021
References Diederich, A., & Oswald, P. (2014). Sequential
Akaike, H. (1983). Information measures and model sampling model for multiattribute choice alter­
selection. Bulletin of the International Statistical natives with random attention time and processing
Institute, 50, 277–290. order. Frontiers in Human Neuroscience, 8, 697.
https://doi.org/10.3389/fnhum.2014.00697
Beasley, W. H., & Rodgers, J. L. (2012). Bootstrapping
and Monte Carlo methods. In H. Cooper, P. M. Diederich, A., & Trueblood, J. S. (2018). A dynamic
Camic, D. L. Long, A. T. Panter, D. Rindskopf, & dual process model of risky decision making.
K. J. Sher (Eds.), APA handbook of research methods Psychological Review, 125(2), 270–292. https://
in psychology: Vol. 2. Research designs: Quantitative, doi.org/10.1037/rev0000087
qualitative, neuropsychological, and biological Evren, A., & Tuna, E. (2012). On some properties
(pp. 407–425). American Psychological Association. of goodness of fit measures based on statistical
https://doi.org/10.1037/13620-022 entropy. International Journal of Research and
Boden, M. A. (2006). Mind as machine: A history of Reviews in Applied Sciences, 13(1), 192–205.
cognitive science. The Clarendon Press. Fan, X. (2012). Designing simulation studies. In
Browne, M. W., & Cudeck, R. (1992). Alternative H. Cooper, P. M. Camic, D. L. Long, A. T. Panter,
ways of assessing model fit. Sociological Methods D. Rindskopf, & K. J. Sher (Eds.), APA handbook
& Research, 21(2), 230–258. https://doi.org/10.1177/ of research methods in psychology: Vol. 2. Research
0049124192021002005 designs: Quantitative, qualitative, neuropsychological,
and biological (pp. 427–444). American Psycho-
Burnham, K. P., & Anderson, D. R. (2004). Multi- logical Association. https://doi.org/10.1037/
model inference: Understanding AIC and BIC in 13620-021
model selection. Sociological Methods & Research,
33(2), 261–304. https://doi.org/10.1177/ Farrell, S., & Lewandowsky, S. (2018). Computational
0049124104268644 modeling of cognition and behavior. Cambridge
University Press. https://doi.org/10.1017/
Busemeyer, J. R., & Diederich, A. (2010). Cognitive CBO9781316272503
modeling. SAGE Publishing.
Feinberg, F. M., & Gonzalez, R. (2007, March). Bayesian
Chechile, R. A. (1999). A vector-based goodness-of fit modeling for psychologists: An applied approach
metric for interval-scaled data. Communications [Paper presentation]. Tutorial Workshop on
in Statistics. Theory and Methods, 28(2), 277–296. Bayesian Techniques, University of Michigan,
https://doi.org/10.1080/03610929908832298 Ann Arbor, MI, United States.
Cole, D. A., & Ciesla, J. A. (2012). Latent variable Flaherty, B. P., & Kiff, C. J. (2012). Latent class and
modeling of continuous growth. In H. Cooper, latent profile models. In H. Cooper, P. M. Camic,
P. M. Camic, D. L. Long, A. T. Panter, D. Rindskopf, D. L. Long, A. T. Panter, D. Rindskopf, & K. Sher
& K. Sher (Eds.), APA handbook of research methods (Eds.), APA handbook of research methods in

534
Computational Modeling

psychology, Vol. 3. Data analysis and research Psychology, 44(1), 190–204. https://doi.org/10.1006/
publication (pp. 391–404). American Psychological jmps.1999.1283
Association. https://doi.org/10.1037/13621-019
Myung, I. J., & Pitt, M. A. (1997). Applying Occam’s
George, G., & Raimond, K. (2013). A survey on opti- razor in modeling cognition: A Bayesian approach.
mization algorithms for optimizing the numerical Psychonomic Bulletin & Review, 4(1), 79–95.
functions. International Journal of Computers and https://doi.org/10.3758/BF03210778
Applications, 61(6), 41–46. https://doi.org/10.5120/
Navarro, D. J. (2019). Between the devil and the deep
9935-4570 blue sea: Tensions between scientific judgment
Haykin, S. (1999). Neural networks: A comprehensive and statistical model selection. Computational
foundation (2nd ed.). Prentice Hall. Brain & Behavior, 2(1), 28–34. https://doi.org/
10.1007/s42113-018-0019-z
Hepburn, B., & Andersen, H. (2021). Scientific method.
In E. N. Zalta (Ed.), The Stanford encyclopedia of Nezlek, J. B. (2012). Multilevel modeling for psycholo-
philosophy. https://plato.stanford.edu/archives/ gists. In H. Cooper, P. M. Camic, D. L. Long, A. T.
sum2021/entries/scientific-method/ Panter, D. Rindskopf, & K. J. Sher (Eds.), APA
handbook of research methods in psychology: Vol. 3.
Hübner, R., & Pelzer, T. (2020). Improving parameter
Data analysis and research publication (pp. 219–241).
recovery for conflict drift-diffusion models.
American Psychological Association. https://doi.org/
Behavior Research Methods, 52(5), 1848–1866.
10.1037/13621-011
https://doi.org/10.3758/s13428-020-01366-8
Nilsson, N. J. (2010). The quest for artificial intelligence:
Kandil, F. I., Diederich, A., & Colonius, H. (2014). A history of ideas and achievements. Cambridge
Parameter recovery for the time-window-of-
Copyright American Psychological Association. Not for further distribution.

University Press.
integration (TWIN) model of multisensory
integration in focused attention. Journal of Vision, Nosofsky, R. M., Kruschke, J. K., & McKinley, S. C.
14(11), 1–20. https://doi.org/10.1167/14.11.14 (1992). Combining exemplar-based category
representations and connectionist learning rules.
Kruschke, J. K. (1992). ALCOVE: An exemplar-based Journal of Experimental Psychology: Learning,
connectionist model of category learning. Psycho- Memory, and Cognition, 18(2), 211–233. https://
logical Review, 99(1), 22–44. https://doi.org/ doi.org/10.1037/0278-7393.18.2.211
10.1037/0033-295X.99.1.22
Palminteri, S., Wyart, V., & Koechlin, E. (2017).
Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., The importance of falsification in computational
Howard, R. E., Hubbard, W., & Jackel, L. D. cognitive modeling. Trends in Cognitive Sciences,
(1989). Backpropagation applied to handwritten 21(6), 425–433. https://doi.org/10.1016/j.tics.
Zip code recognition. Neural Computation, 1(4), 2017.03.011
541–551. https://doi.org/10.1162/neco.1989.
1.4.541 Ratcliff, R. (2012). Response time distributions.
In H. Cooper, P. M. Camic, D. L. Long, A. T.
Lebière, C., & Anderson, J. R. (1993). A connectionist Panter, D. Rindskopf, & K. J. Sher (Eds.), APA
implementation of the ACT-R production system. handbook of research methods in psychology: Vol. 2.
In Proceedings of the Fifteenth Annual Conference Research designs: Quantitative, qualitative, neuro­
of the Cognitive Science Society (pp. 635–640). psychological, and biological (pp. 429–443).
Lawrence Erlbaum Associates. American Psychological Association. https://doi.org/
Lewandowsky, S., & Farrell, S. (2011). Computational 10.1037/13620-021
modeling in cognition: Principles and practice. Rindskopf, D. (2012). Generalized linear models.
SAGE Publications. https://doi.org/10.4135/ In H. Cooper, P. M. Camic, D. L. Long, A. T. Panter,
9781483349428 D. Rindskopf, & K. J. Sher (Eds.), APA handbook
Mazer, J. A., Vinje, W. E., McDermott, J., Schiller, P. H., of research methods in psychology: Vol. 3. Data
& Gallant, J. L. (2002). Spatial frequency and analysis and research publication (pp. 191–206).
orientation tuning dynamics in area V1. Proceedings American Psychological Association. https://doi.org/
of the National Academy of Sciences of the United 10.1037/13621-009
States of America, 99(3), 1645–1650. https://doi.org/ Roberts, S., & Pashler, H. (2000). How persuasive is
10.1073/pnas.022638499 a good fit? A comment on theory testing. Psycho­
McCulloch, W. S., & Pitts, W. (1943). A logical logical Review, 107(2), 358–367. https://doi.org/
calculus of ideas immanent in nervous activity. 10.1037/0033-295X.107.2.358
Bulletin of Mathematical Biophysics, 5, 115–133. Rojas, R. (1996). Neural networks. A systematic intro-
https://doi.org/10.1007/BF02478259 duction. Springer.
Myung, I. J. (2000). The importance of complexity Rumelhart, D. E., McClelland, J. L., & the PDP
in model selection. Journal of Mathematical Research Group. (1986). Parallel distributed

535
Adele Diederich

processing: Explorations in the microstructure of Sun, R. (2009). Theoretical status of computational


cognition: Vol. 1. Foundations. MIT Press. cognitive modeling. Cognitive Systems Research,
Schubert, A.-L., Hagemann, D., Voss, A., & Bergmann, K. 10(2), 124–140. https://doi.org/10.1016/j.cogsys.
(2017). Evaluating the model fit of diffusion models 2008.07.002
with the root mean square error of approximation. Sun, S., Cao, Z., Zhu, H., & Zhao, J. (2019). A survey
Journal of Mathematical Psychology, 77, 29–45. of optimization methods from a machine learning
https://doi.org/10.1016/j.jmp.2016.08.004 perspective. arXiv:1906.06821v2. https://doi.org/
Schwarz, G. (1978). Estimating the dimension of 10.48550/arXiv.1906.06821
a model. Annals of Statistics, 6(2), 461–464. Thomas, M. S. C., & McClelland, J. L. (2008).
https://doi.org/10.1214/aos/1176344136 Connectionist models of cognition. In R. Sun
Shepard, R. N. (1987). Toward a universal law of (Ed.), The Cambridge handbook of computational
generalization for psychological science. Science, psychology (pp. 23–58). Cambridge University
237(4820), 1317–1323. https://doi.org/10.1126/ Press.
science.3629243 Townsend, J. T. (1975). The mind-body equation
Shmueli, G. (2010). To explain or to predict? Statistical revisited. In C. Cheng (Ed.), Philosophical aspects
Science, 25(3), 289–310. https://doi.org/10.1214/ of the mind-body problem. The University Press
10-STS330 of Hawaii.
Smith, P. L., & Vickers, D. (1988). The accumulator van Ravenzwaaij, D., & Oberauer, K. (2009). How to
model of two-choice discrimination. Journal use the diffusion model: Parameter recovery of
of Mathematical Psychology, 32(2), 135–168. three methods: EZ, fast-dm, and DMAT. Journal
Copyright American Psychological Association. Not for further distribution.

https://doi.org/10.1016/0022-2496(88)90043-0 of Mathematical Psychology, 53(6), 463–473.


Steiger, J. H. (1990). Structural model evaluation and https://doi.org/10.1016/j.jmp.2009.09.004
modification: An interval estimation approach. van Rooij, I., & Baggio, G. (2021). Theory before the
Multivariate Behavioral Research, 25, 173–180. test: How to build high-verisimilitude explanatory
Steiger, J. H. (n.d.). Measures of fit in structural equation theories in psychological science. Perspectives
modeling: An introduction. https://www.statpower.net/ on Psychological Science, 16(4), 682–697. https://
Content/312/Handout/Measures%20of%20Fit%20 doi.org/10.1177/1745691620970604
in%20Structural%20Equation%20Modeling.pdf Voit, E. O. (2019). Perspective: Dimensions of the
Steyer, R., Geiser, C., & Fiege, C. (2012). Latent scientific method. PLOS Computational Biology,
state–trait models. In H. Cooper, P. M. Camic, 15(9), e1007279. https://doi.org/10.1371/journal.
D. L. Long, A. T. Panter, D. Rindskopf, & K. Sher pcbi.1007279
(Eds.), APA handbook of research methods in Vrieze, S. I. (2012). Model selection and psychological
psychology: Vol. 3. Data analysis and research theory: A discussion of the differences between
publication (pp. 291–308). American Psychological the Akaike information criterion (AIC) and the
Association. https://doi.org/10.1037/13621-014 Bayesian information criterion (BIC). Psychological
Sun, R. (2008a). The Cambridge handbook of computa- Methods, 17(2), 228–243. https://doi.org/10.1037/
tional psychology. Cambridge University Press. a0027127
Sun, R. (2008b). Introduction to computational Wilson, R. C., & Collins, A. G. E. (2019). Ten simple
cognitive modeling. In R. Sun (Ed.), The Cambridge rules for the computational modeling of behavioral
handbook of computational psychology (pp. 3–19). data. eLife, 8, e49547. https://doi.org/10.7554/
Cambridge University Press. eLife.49547

536

You might also like