0% found this document useful (0 votes)
17 views12 pages

Bayesian Decision Models A Primer

Bayesian Decision Models A Primer

Uploaded by

zhuyang9158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

Bayesian Decision Models A Primer

Bayesian Decision Models A Primer

Uploaded by

zhuyang9158
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Neuron

Primer

Bayesian Decision Models: A Primer


Wei Ji Ma1,*
1Center for Neural Science and Department of Psychology, New York University, New York, NY, USA

*Correspondence: [email protected]
https://doi.org/10.1016/j.neuron.2019.09.037

To understand decision-making behavior in simple, controlled environments, Bayesian models are often use-
ful. First, optimal behavior is always Bayesian. Second, even when behavior deviates from optimality, the
Bayesian approach offers candidate models to account for suboptimalities. Third, a realist interpretation
of Bayesian models opens the door to studying the neural representation of uncertainty. In this tutorial, we
review the principles of Bayesian models of decision making and then focus on five case studies with exer-
cises. We conclude with reflections and future directions.

1. Introduction ponents are; the recipe for constructing a Bayesian model ap-
1.1. What Are Bayesian Decision Models? plies across a wide range of tasks. Third, in Bayesian models,
Good computational modeling of decision making goes beyond a the decision model is largely dictated by the generative model,
mere description of the data (curve fitting). Good models help to which, in turn, is often largely dictated by the statistics of the
break down perceptual, cognitive, or motor processes into inter- experiment. As a result, many Bayesian models have few free
pretable and generalizable stages. This, in turn, may allow for parameters. Fourth, Bayesian models have a good empirical
conceptual connections across experiments or domains, for a track record of accounting for behavior, both in humans and in
characterization of individual differences, or for establishing cor- other animals. Fifth, sensible models can easily be constructed
relations between model variables and aspects of neural activity. by modifying the assumptions of an optimal Bayesian model.
Across domains of application, Bayesian models of decision Thus, the Bayesian model is a good starting point for model
making are based on the same small set of principles, thereby generation.
promising high interpretability and generalizability. Bayesian 1.3. What Math Does One Need?
models aspire to account for an organism’s decision process Bayesian modeling may seem intimidating to beginners, but the
when the task-relevant states of the world are not exactly known math involved rarely goes beyond standard calculus. Because of
to the organism. A state of the world can be a physical variable, the recipe-based methodology, many learners already feel in
such as the reflectance of a surface or the location where a ball control after several tens of hours of practice. Importantly,
will land, or a more abstract one, such as whether two parts when building Bayesian models, it is easy to supplement math
belong to the same object or the intent of another person. and intuitions with simple simulations.
Lack of exact knowledge can arise from noise (Faisal et al., 1.4. Areas of Application
2008) or from missing information, such as in the case of occlu- Bayesian modeling is most straightforward if the task-relevant
sion (Kersten et al., 2004). world states are categorical (e.g., binary) or low dimensional, if
Bayesian decision models have two key components (Figure their distributions are simple, if the task allows for parametric vari-
1). The first is Bayes’ rule, which formalizes how the decision ation of world state variables, and if the decision maker has
maker assigns probabilities (degrees of belief) to hypothesized an unambiguously defined objective. This makes Bayesian
states of the world given a particular set of observations. The modeling, in particular, suitable for simple perceptual tasks.
second is a cost function, which is the quantity that the decision However, Bayesian decision models appear also in studies of
maker would like to minimize; an example would be the propor- eye movements in natural scenes (Itti and Baldi, 2009), reaching
tion of errors in a task. The cost function dictates how the deci- movements (Körding and Wolpert, 2004), physical scene under-
sion maker should transform beliefs about states of the world standing (Battaglia et al., 2013), speech understanding
into a decision. Combining the components, we end up with a (Goodman and Frank, 2016), inductive reasoning (Tenenbaum
mapping from observations to decision. While the form of that et al., 2006), and economic decisions (Cogley and Sargent, 2008).
mapping depends on the task, the Bayesian recipe for deriving 1.5. Disclaimer
it is general. This primer is about Bayesian decision models in psychology
1.2. Why Bayesian Decision Models? and neuroscience, not about Bayesian data analysis. We will
The Bayesian modeling framework for decision making holds ap- also not discuss how to fit Bayesian decision models and
peal for various reasons. The first reason has an evolutionary or compare them to other decision models because that involves
ecological flavor: Bayesian inference optimizes behavioral per- methods that are not specific to Bayesian modeling and are
formance, and one might postulate that the mind applies a well described elsewhere.
near-optimal algorithm in decision tasks that are common or
important in the natural world (or daily life). This argument is 2. Recipe for Bayesian Modeling
more plausible for perceptual than for cognitive decision making. The recipe for Bayesian modeling consists of the following steps:
Second, Bayesian models are general because its two key com- specifying the ‘‘generative model’’ (step 1), calculating the

164 Neuron 104, October 9, 2019 ª 2019 Published by Elsevier Inc.


Neuron

Primer
Figure 1. Schematic of Bayesian Decision
observations likelihood function Making
posterior distribution
prior distribution action/response

cost function The factor pðxtrial j sÞ in Equation 1 is the


likelihood function of s. The likelihood of a
hypothesized world state s given an
‘‘posterior distribution’’ (inference; step 2a), turning the posterior observation is the probability of those observations if that hy-
distribution into an ‘‘action or response’’ (step 2b), and calcu- pothesis were true. It is important that the likelihood function is
lating the ‘‘action/response distribution’’ for comparison with a function of the hypothesized world state s, not of xtrial (which
experimental data (step 3). is a given value). To make this dependence explicit, it could be
2.1. Step 1: Generative Model helpful to use the notation LðsÞ = pðxtrial j sÞ. The likelihood of s
Consider a decision maker who has to take an action that re- is numerically the same as pðxtrial j sÞ in the generative model,
quires inferring a state of the world s from an observation x. but what is the argument and what is given is switched. (Side
The variables s and x can be discrete or continuous and one- note: Bayesian experts would not use the phrase ‘‘likelihood of
dimensional or multidimensional. The observation x can be a the observation.’’)
physical stimulus generated from an underlying unknown s (for The left-hand side distribution, pðs j xtrial Þ, is called the poste-
example, when s is a category of stimuli), an abstract ‘‘measure- rior distribution of s. It captures the answer to the question to
ment’’ (s plus noise), or a pattern of neural activity. what extent each possible world state value s is supported by
The frequencies of occurrence of each value of s in the envi- the observed measurement xtrial and prior beliefs. Finally, the
ronment are captured by a probability distribution pðsÞ. The probability in the denominator of Equation 1, pðxtrial Þ, does not
distribution of the observation is specified conditioned on s depend on s and is therefore a numerical constant; it acts as
and denoted by pðx j sÞ. Often, pðx j sÞ is not directly known but the normalization factor of the numerator.
has to be derived from other conditional distributions. Together, 2.3. Step 2b: Taking an Action (Making a Response)
the distributions pðsÞ and pðx j sÞ define a ‘‘generative model,’’ a Bayesian decision making does not end with the computation of
statistical description of how the observations come about. a posterior distribution. The decision maker has to take an ac-
2.2. Step 2a: Inference tion, which we will denote by a. An action could be a natural
The key assumption of Bayesian models of behavior is that the movement or a response in an experiment. In perceptual tasks
decision maker has learned the distributions in the generative in the laboratory, the response is typically an estimate sb of the
model and puts this knowledge to full use when inferring states state of the world s, and such an estimate is Bayesian if it is
of the world. We now turn to this inference process. derived from the posterior. In perceptual tasks, one could even
On a given trial, the decision maker makes a specific observa- go as far as postulating that sb represents the contents of percep-
tion xtrial . They then calculate the probabilities of possible world tion (i.e., a percept).
states s given that observation. They do this using Bayes’ rule, In general, the appropriate action a can be chosen in a princi-
pled manner by defining a ‘‘cost function’’ Cðs; aÞ (also called a
pðxtrial j sÞpðsÞ
pðs j xtrial Þ = : (Equation 1) ‘‘loss function’’ or an ‘‘objective function’’), which the decision
pðxtrial Þ maker strives to minimize. This function depends on the world
state s and the action a. On a trial when the observation is xtrial ,
the ‘‘expected cost’’ is the expected value of Cðs; aÞ with respect
The numerator of the right-hand side involves two probabilities
to the posterior distribution calculated in Equation 1:
that we recognize from the generative model. Indeed, because
we defined them in the generative model, they can be calculated X
here. However, their interpretation is different from the genera- ECðaÞ = pðs j xtrial ÞCðs; aÞ; (Equation 2)
tive model. To understand this, we first need to realize that, in s

Equation 1, the world state variable s should be considered a hy-


pothesis entertained by the decision maker. Each probability where the sum applies to discrete s; for continuous s, the sum is
involving s should be interpreted as a degree of belief in a value replaced by an integral. The Bayesian decision maker chooses
of s. As such, these probabilities exist only in the head of the de- the action that minimizes ECðaÞ. In this sense, the action is optimal
cision maker and are not directly observable. and the Bayesian approach is normative. The process of choosing
Specifically, in the context of Equation 1, pðsÞ is called the an action given a posterior is, in its basic form, deterministic.
‘‘prior distribution.’’ At first glance, it might confusing to give When a is an estimate of s, we can be a bit more specific. First,
pðsÞ a new name; was it not already the distribution of the state we consider the case that s is discrete, and the objective is to es-
of the world? The reason for the new name is that the prior dis- timate correctly (i.e., Cðs; aÞ =  1 when s = a, 0 otherwise).
tribution reflects to what extent the decision maker expects Then, the optimal action is to report the mode of the posterior:
different values of s; in other words, it formalizes a belief. If the the state for which pðs j xtrial Þ is highest. Next, we consider the
decision maker’s beliefs are wrong, the distribution pðsÞ in case s is real valued, and the objective is to minimize expected
Equation 1 will be different from the pðsÞ in the generative model; squared error (i.e., Cðs;aÞ = ðs  aÞ2 ). Then, the optimal readout
we will discuss this in Section 4.5. is the mean of the posterior distribution.

Neuron 104, October 9, 2019 165


Neuron

Primer

retina (up or down). For didactic purposes, let’s say that there
exist only two scenarios in the world.

d Scenario 1: all dots are part of the same object, and they
therefore always move together. They move together
either up or down, each with probability 0.5.
d Scenario 2: each dot is an object by itself. Each dot
Figure 2. Generative Model Diagram for the Two Scenarios in Case 1 independently moves either up or down, each with
probability 0.5.

2.4. Step 3: Response (Action) Distribution (Dots are only allowed to move up and down, and speed and
The generative model, the computation of the posterior (infer- position do not play a role in this problem.) The world state s from
ence), and a mapping from posterior to action together complete Section 2 is now a binary scenario.
a Bayesian model. To test the model against experimental data,
the modeler needs to derive or simulate the distribution of ac- a. The generative model diagram in Figure 2 shows each sce-
tions a given a state of the world s—that is, pða j sÞ. When the ac- nario in a big box. Inside each box, the bubbles contain the
tion is an estimate sb, the difference sb  s is the estimation error, variables and the arrows represent dependencies between
and the distribution pð sb j sÞ will characterize estimation errors. variables. In other words, an arrow can be understood to
Finally, the parameters of the model need to be fitted to the represent the influence of one variable on another; it can
data. Examples of parameters in Bayesian models are sensory be read as ‘‘produces’’ or ‘‘generates’’ or ‘‘gives rise to.’’
noise level (Section 3.5), lapse rate (Section 4.4), and wrong The sensory observations should always be at the bottom
belief parameters (Section 4.5). For fitting, maximum-likelihood of the diagram. Put the following variable names in the cor-
estimation is the most standard, where now ‘‘likelihood’’ refers rect boxes: retinal images I1 , I2 , I3 , I4 , and I5 and motion di-
to the likelihood of the parameters given the experimental data rections s (a single motion direction), s1 , s2 , s3 , s4 , and s5 .
(Myung, 2003). To implement the maximization, one can use The same variable might appear more than once.
any number of standard algorithms; make sure to initialize with
multiple starting points to reduce the chance of getting stuck in Step 2: Inference
a local optimum (Martı́ et al., 2016). In inference, the two scenarios become hypothesized scenarios.
Inference involves likelihoods and priors. The likelihood of a scenario
is the probability of the sensory observations under the scenario.
3. Case Studies
We will now go through five case studies that illustrate different b. What is the likelihood of scenario 1?
aspects of Bayesian decision models. We encourage the reader c. What is the likelihood of scenario 2?
to try to do the exercises; solutions are included in Methods S1. d. Do the likelihoods of the scenarios sum to 1? Explain why
3.1. Case 1: Unequal Likelihoods and Gestalt Laws or why not.
You observe the five dots below all moving downward, as indi- e. What is wrong with the phrase ‘‘the likelihood of the
cated by the arrows. observations’’?

Let’s say scenario 1 occurs twice as often in the world as sce-


nario 2. The observer can use these frequencies of occurrence
as prior probabilities, reflecting expectations in the absence of
specific sensory observations.

f. What are the prior probabilities of scenarios 1 and 2?


g. What is the product of the likelihood and the prior proba-
bility for scenario 1?
According to Gestalt psychology (Wertheimer, 1938), the mind h. What is this product for scenario 2?
has a tendency to group the dots together because of their com- i. Do these products of the scenarios sum to 1?
mon motion and perceive them as a single object. This is captured j. Posterior probabilities have to sum to 1. To achieve that,
by the ‘‘Gestalt law of common fate.’’ Gestalt laws, however, are divide each of the products above by their sum. Calculate
merely narrative summaries of phenomenology. A Bayesian the posterior probabilities of scenarios 1 and 2. You have
model has the potential to provide a true explanation of the just applied Bayes’ rule.
percept and, in some cases, make quantitative predictions (Wa-
gemans et al., 2012). In this case, the Bayesian decision model The default Bayesian perception model for discrete
takes the form of an ‘‘observer model’’ or ‘‘perception model.’’ hypotheses holds that the percept is the scenario with the high-
Step 1: Generative Model est posterior probability (maximum-a-posteriori or MAP esti-
We first formulate our generative model. The retinal image of mation).
each dot serves as a sensory observation. We will denote these
five retinal images by I1 , I2 , I3 , I4 , and I5 , each specifying the di- k. Would that be consistent with the law of common fate?
rection of movement of the corresponding dot’s image on the Explain.

166 Neuron 104, October 9, 2019


Neuron

Primer

l. How does this Bayesian observer model complement—or 3.3. Case 3: Ambiguity Due to a Nuisance Parameter in
go beyond—the traditional Gestalt account of this Color Vision
phenomenon? We switch domains once again and apply a Bayesian approach
to the central problem of color vision (Brainard and Freeman,
In this case, like often, the action is in the likelihood and the 1997), simplified to a problem for grayscale surfaces. We see a
prior is relatively unimportant. surface when there is a light source. The surface absorbs
3.2 Case 2: Competing Likelihoods and Priors in Motion some proportion of the incident photons and reflects the rest.
Sickness Some of the reflected photons reach our retina.
Michel Treisman has tried to explain motion sickness in the Step 1: Generative Model
context of evolution (Treisman, 1977). During the millions of The diagram of the generative model is:
years over which the human brain evolved, accidentally eating
toxic food was a real possibility, and that could cause hallucina-
tions. Perhaps, our modern brain still uses prior probabilities
passed on from those days; those would not be based on our
personal experience, but on our ancestors’! This is a fascinating,
though only weakly tested, theory. Here, we don’t delve into the
merits of the theory but try to cast it in Bayesian form.
Suppose you are in the windowless room on a ship at
sea. Your brain has two sets of sensory observations: visual ob-
servations and vestibular observations. Let’s say that the brain
considers three scenarios for what caused these observations: The ‘‘shade’’ of a surface is the grayscale in which a surface
d Scenario 1: the room is not moving and your motion in the has been painted. Technically, shade is ‘‘reflectance,’’ the pro-
room causes both sets of observations. portion of incident light that is reflected. Black paper might
d Scenario 2: your motion in the room causes your visual ob- have a reflectance of 0.10, while white paper might have a reflec-
servations, whereas your motion in the room and the tance of 0.90. The ‘‘intensity of a light source’’ (illuminant) is the
room’s motion in the world together cause the vestibular amount of light it emits. Surface shade and light intensity are the
observations. world state variables relevant to this problem.
d Scenario 3: you are hallucinating; your motion in the The sensory observation is the amount of light measured by
room and ingested toxins together cause both sets of the retina, which we will also refer to as retinal intensity. The
observations. retinal intensity can be calculated as follows:

Step 1: Generative Model Retinal intensity = surface shade 3 light intensity (Equation 3)
a. Draw a diagram of the generative model. It should contain
one box for each scenario, and all of the italicized variables In other words, if you make a surface twice as reflectant, it has
in the previous paragraph. Some variables might appear the same effect on your retina as doubling the intensity of the
more than once. light source.
Step 2: Inference
Step 2: Inference Let’s take each of these numbers to be between 0 (representing
No numbers are needed except in part (e). black) and 1 (representing white). For example, if the surface
shade is 0.5 (mid-level gray) and the light intensity is 0.2 (very
b. In prehistory, people would, of course, move around in the
dim light), then the retinal intensity is 0:5 3 0:2 = 0:1.
world, but surroundings would almost never move. Once
in a while, a person might accidentally ingest toxins. a. Suppose your retinal intensity is 0.2. Suppose further that
Assuming that your innate prior probabilities are based you hypothesize the light intensity to be 1 (very bright light).
on these prehistoric frequencies of events, draw a bar di- Under that hypothesis, calculate what the surface shade
agram to represent your prior probabilities of the three must have been.
scenarios above. b. Suppose your retinal intensity is the same 0.2. Suppose
c. In the windowless room on the ship, there is a big further that you hypothesize the light intensity to be 0.4.
discrepancy between your visual and vestibular obser- Under that hypothesis, calculate what the surface shade
vations. Draw a bar diagram that illustrates the likeli- must have been.
hoods of the three scenarios in that situation (i.e., c. Explain why the retinal intensity provides ambiguous infor-
how probable these particular sensory observations mation about surface shade.
are under each scenario). d. Suppose your retinal intensity is again 0.2. By going
d. Draw a bar diagram that illustrates the posterior probabil- through a few more examples like the ones in (a) and (b),
ities of the three scenarios. draw in the two-variable likelihood diagram in Figure 3 all
e. Use numbers to illustrate the calculations in (b)–(d). combinations of hypothesized surface shade and hypoth-
f. Using the posterior probabilities, explain why you might esized light intensity that could have produced your retinal
vomit in this situation. intensity of 0.2. Think of this plot as a 3D plot (surface plot)!

Neuron 104, October 9, 2019 167


Neuron

Primer

2
1 ðxsÞ
pðx j sÞ = pffiffiffiffiffiffiffiffiffiffiffie 2s2 ; (Equation 5)
2ps 2

where s is the standard deviation of the measurement noise, also


called ‘‘measurement noise level’’ or ‘‘sensory noise level.’’ This
Gaussian distribution is shown in Figure 4B. The higher s, the
noisier the measurement and the wider its distribution. The
Gaussian assumption can be justified using the Central Limit
Theorem.
Figure 3. Two-Variable Diagrams for Case 3 Step 2a: Inference
On a given trial, the observer makes a measurement xtrial . The
e. Explain the statement: ‘‘The curve that we just drew repre- inference problem is: what stimulus estimate should the
sents the combinations of surface shade and light intensity observer make?
that have a high likelihood.’’ We introduced the stimulus distribution pðsÞ, which reflects
f. Suppose you have a strong prior that light intensity was be- how often each stimulus value tends to occur in the experi-
tween 0.2 and 0.4 and definitely nothing else. In the two- ment. Suppose that the observer has learned this distribution
variable prior diagram in Figure 3 (center), shade the area through training. Then, the observer will already have an
corresponding to this prior. expectation about the stimulus before it even appears. This
g. In the two-variable posterior diagram in Figure 3 (right), expectation constitutes prior knowledge, and, therefore, in
indicate where the posterior probability is high. the inference process, pðsÞ is referred to as the ‘‘prior dis-
h. What would you perceive according to the Bayesian tribution’’ (Figure 5A). Unlike the stimulus distribution in
theory? the generative model, the prior distribution reflects the
observer’s beliefs. The likelihood function represents the ob-
3.4. Case 4: Inference under Measurement Noise in server’s belief about the stimulus based on the measurement
Sound Localization only—absent any prior knowledge. Formally, the likelihood is
The previous cases featured categorically distinct scenarios. We the probability of the observed measurement under a hypoth-
now consider a continuous estimation task—for example, esized stimulus:
locating a sound on a line. This will allow us to introduce the
LðsÞ = pðxtrial j sÞ: (Equation 6)
concept of noise in the internal measurement of the stimulus.
This case would be uninteresting without such noise.
Step 1: Generative Model As stated in Section 2.2, the likelihood function is a function of
The stimulus is the location of the sound. The sensory obser- s, not of x. The x variable is now fixed to the observed value xtrial .
vations generated by the sound location consist of a complex Under our assumption for the measurement distribution pðx j sÞ,
pattern of auditory neural activity, but for the purpose of our the likelihood function over the stimulus is
model, and reflecting common practice, we reduce the sen-
2
1 ðsxtrial Þ
sory observations to a single scalar, namely a noisy internal
LðsÞ = pffiffiffiffiffiffiffiffiffiffiffie 2s2 : (Equation 7)
measurement x. The measurement lives in the same space 2ps2
as the stimulus itself—in this case, the real line. For example,
if the true location s of the sound is 3+ to the right of straight
(Although this particular likelihood is normalized over s, that is
ahead, then its measurement x could be 2:7+ or 3:1+ .
not generally true. This is why the likelihood function is called a
Thus, the problem contains two variables: the stimulus s and
function and not a distribution.) The width of the likelihood
the observer’s measurement x. Each node in the graph is
function is interpreted as the observer’s level of uncertainty
associated with a probability distribution: the stimulus node
based on the measurements alone.
with a stimulus distribution pðsÞ and the measurement node
The posterior distribution is pðs j xtrial Þ, the probability density
with a measurement distribution pðx j sÞ that depends on the
function over the stimulus variable s given the measurement
value of the stimulus. In our example, say that the experi-
xtrial . We rewrite Bayes’ rule, Equation 1, as
menter has programmed pðsÞ to be Gaussian with a mean m
and variance s2s . pðsjxtrial Þ f pðxtrial j sÞpðsÞ = LðsÞpðsÞ: (Equation 8)

2
ðsmÞ
1 
pðsÞ = pffiffiffiffiffiffiffiffiffiffiffie 2ss :
2
(Equation 4)
2pss 2
a. Why can we get away with the proportionality sign?

(See Figure 4A.) The ‘‘measurement distribution’’ is the distribu- Equation 8 assigns a probability to each possible hypothe-
tion of the measurement x for a given stimulus value s. We make sized value of the unknown stimulus s. We will now compute
the common assumption that the measurement distribution is the posterior distributions under the assumptions we made in
Gaussian: step 1. Upon substituting the expressions for LðsÞ and pðsÞ into

168 Neuron 104, October 9, 2019


Neuron

Primer

A B Figure 4. The Probability Distributions that


Belong to the Two Variables in the
Generative Model
(A) A Gaussian distribution over the stimulus, pðsÞ,
reflecting the frequency of occurrence of each
stimulus value in the world.
(B) Suppose we now fix a particular value
of s (the dotted line). Then, the measurements
x will follow a Gaussian distribution around
that s. The diagram at the bottom shows
a few samples of x, which are scattered
around the true sound location s, indicated by
the arrow.

The mean of the posterior, Equation 10,


is of the form axtrial + bm; in other words, it
is a linear combination of xtrial and
the mean of the prior, m. The coeffici-
ents a and b in this linear combination
are ðð1 =s2 Þ =ð1 =s2 + 1 =s2s ÞÞ and
Equation 8, we see that in order to compute the posterior, we ðð1 =ss Þ =ð1 =s + 1 =ss ÞÞ, respectively. These sum to 1, and,
2 2 2

need to compute the product of two Gaussian functions. An therefore, the linear combination is a ‘‘weighted average,’’ where
example is shown in Figure 6. the coefficients act as weights. This weighted average, mposterior ,
will always lie somewhere in between xtrial and m.
b. Create a figure similar to Figure 6 through numerical
computation of the posterior. Numerically normalize prior, d. In the special case that s = ss , compute the mean of the
likelihood, and posterior. posterior.

Beyond plotting the posterior, our assumptions in this case The intuition behind the weighted average is that the prior
actually allow us to characterize the posterior mathematically. ‘‘pulls the posterior away’’ from the measurement xtrial and to-
ward its own mean m, but its ability to pull depends on how nar-
c. Show that the posterior is a new Gaussian distribution row it is compared to the likelihood function. If the likelihood
function is narrow, which happens when the noise level s is
low, the posterior won’t budge much: it will be centered close
ðsmposterior Þ
2
to the mean of the likelihood function. This intuition is still valid
1 2s2posterior
pðs j xtrial Þ = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi e ; (Equation 9) if the likelihood function and the prior are not Gaussian but are
2ps2posterior roughly bell shaped.
The variance of the posterior is given by Equation 11. It is inter-
preted as the overall level of uncertainty the observer has about
with mean
the stimulus after combining the measurement with the prior. It is
xtrial
+ sm2 different from both the variance of the likelihood function and the
s2
mposterior = s
(Equation 10) variance of the prior distribution.
1
s2
+ s12
s
e. Show that the variance of the posterior can also be written
and variance as s2posterior = s2 s2s =ðs2 + s2s Þ.
f. Show that the variance of the posterior is smaller than both
1 the variance of the likelihood function and the variance of
s2posterior = : (Equation 11)
1
s2
+ s12 the prior distribution. This shows that combining a mea-
s

surement with prior knowledge makes an observer less un-


certain about the stimulus.
You may use the following auxiliary calculation: g. What is the variance of the posterior in the special case
that s = ss ?
! m !2
ðs  mÞ
2
ðs  xtrial Þ 1 1
2
1 s2s
+ xstrial
2
  = + s + junk Step 2b: The Stimulus Estimate (response)
2s2s 2s2 2 s2s s2 ss
1
2 + 1
s2 We now estimate s on the trial under consideration. We
denote the estimate by sb. As mentioned in Section 2.3, for a
where ‘‘junk’’ refers to terms that don’t depend on s (see part a real-valued variable and a squared error loss function, the
to understand why we can ignore these when calculating the observer should use the mean of the posterior as the estimate.
new distribution). Thus,

Neuron 104, October 9, 2019 169


Neuron

Primer
Figure 5. Prior and Likelihood under
Sensory Noise
Consider a single trial on which the observed
measurement is xtrial . The observer is trying to infer
which stimulus s produced this measurement. The
two functions that play a role in the observer’s
inference process (on a single trial) are the prior
and the likelihood. The argument of both the prior
and the likelihood function is s, the hypothesized
stimulus.
(A) Prior distribution with m = 0. This distribution
reflects the observer’s beliefs about different
possible values the stimulus can take.
(B) The likelihood function over the stimulus based
on the measurement xtrial . Under our assumptions,
the likelihood function is a Gaussian centered
at xtrial .

xtrial
s2
+ sm2 We see that the variance of the estimate can be different
sb = mposterior = s
(Equation 12) from the variance of the posterior. Intuitively, the response
1
s2
+ s12
s distribution (the distribution of the observer’s posterior mean
estimate) for a given true stimulus s reflects the variability of
behavioral responses we would find when repeatedly pre-
This would be a Bayesian observer’s response in this localiza-
senting the same stimulus s many times. This is conceptually
tion task.
distinct from the internal uncertainty of the observer on a
Step 3: Response Distribution
single given trial, which is not directly measurable. Because
We now like to use this model to predict subjects’ behavior in this
of a strong prior, a Bayesian observer could have consistent
experiment. To do so, we’d like to compare our predicted re-
responses from trial to trial despite being very internally
sponses, sb, to the subject’s actual responses. Looking at
uncertain on any particular trial. This completes the model:
Equation 12 for sb, we note that, to compute a predicted response
the distribution pð sb j sÞ can now be compared to human
on a given trial, we need to know xtrial . But this is something we
behavior.
don’t know! xtrial is the noisy measurement made by the ob-
3.5. Case 5: Hierarchical Inference in Change Point
server’s sensory system, an internal variable to which an exper-
Detection
imenter has no access.
Our last case is change point detection, the task of inferring from
A common mistake in Bayesian modeling is to discuss
a time series of noisy observations whether or when an underly-
the likelihood function (or the posterior distribution) as if it were
ing variable changed. There are two main reasons for consid-
a single, specific function in a given experimental condition. In
ering this task. First, it is a common form of inference. A new
the presence of noise in the observation/measurement, this is
chef or owner might cause the quality of the food in a restaurant
incorrect. Both the likelihood and the posterior depend on the
to suddenly change, or a neurologist may want to detect a
measurement xtrial , which itself is randomly generated on each
seizure on an EEG in a comatose patient. Second, this case is
trial, and, therefore, the likelihood and posterior will ‘‘wiggle
representative of a type of inference that involves multiple layers
around’’ from trial to trial (Figure 7). This variability propagates
of world state variables—in our case, not only the stimulus at
to the estimate: the estimate sb also depends on the noisy mea-
each time point but also the ‘‘higher-level’’ variable of when the
surement xtrial via Equation 12. Since xtrial varies from trial to trial,
stimulus changed. In spite of this complication, the
so does the estimate: the stochasticity in the estimate is inherited
problem lends itself well to a Bayesian treatment (Wilson et al.,
from the stochasticity in the measurement xtrial . Hence, in
2013; Norton et al., 2019).
response to repeated presentations of the same stimulus, the
Step 1: Generative Model
estimate will be a random variable with a probability distribution,
The generative model is shown in Figure 8. Time is discrete and
which we will denote by pð sb j sÞ.
goes from 1 to T. A change occurs at exactly one time point
So rather than comparing our model’s predicted responses to
tchange , chosen with equal probability:
subjects’ actual responses on individual trials, we’ll instead use
our model to predict the distribution over subjects’ responses for   1
a given value of the stimulus. The predicted distribution is pre- p tchange = : (Equation 13)
T
cisely pð sb j sÞ. To compare our Bayesian model with an ob-
server’s behavior, we thus need to calculate this distribution.
The stimulus s = ðs1 ; .; sT Þ is a sequence that starts with rep-
h. From step 1, we know that when the true stimulus is s, xtrial etitions of the value 1 and at some point changes to repetitions
follows a Gaussian distribution with mean s and variance of the value 1. The change point is when the value 1 changes
s2. Show that when the true stimulus is s, the estimate’s to 1. Formally,
distribution pð sb j sÞ is a Gaussian distribution with mean
.  . 2 
m 1 if t < tchange
s
s2
+ s2
1
s2
+ 1
s2
and variance 1
s2
1
+ 1
s2 s2
. st = (Equation 14)
s s s 1 if tRtchange

170 Neuron 104, October 9, 2019


Neuron

Primer

posterior, p(s|xtrial) quence of the ‘‘stacked’’ or hierarchical nature of the generative


model. We do have the distributions pðx j sÞ and pðs j tchange Þ. To
likelihood, L(s)=p(xtrial|s)
Probability or likelihood

make the link, we have to ‘‘average’’ over all possible values of s.


This is called ‘‘marginalization.’’ Marginalization is extremely
common in Bayesian models except for the very simplest
ones, the reason being that there are almost always unknown
states of the world that affect the observations but that the
prior, p (s)
observer is not primarily interested in. The box describes the
relevant probability calculus.
Marginalization of probabilities. If A and B are two discrete
-10 -8 -6 -4 -2 0 6 8 10 random variables, their ‘‘joint distribution’’ is pðA; BÞ. From the
sxtrial joint distribution, we can obtain the distribution of one of the vari-
Hypothesized stimulus s ables by summing over the other one, for example:
X
Figure 6. The Posterior Distribution Is Obtained by Multiplying the pðAÞ = pðA; BÞ (Equation 18)
Prior by the Likelihood Function B

Finally, we assume that the observer makes measurements This is called the ‘‘marginal distribution’’ of A. Making use of the
x = ðx1 ; .; xT Þ, whose noise is independent across time points. definition of conditional probability, pðA j BÞ = ðpðA;BÞ =pðBÞÞ, we
Mathematically, this means that we can write the conditional further write
probability of the vector as the product of the conditional proba-
X
bilities of the components: pðAÞ = pðA j BÞpðBÞ: (Equation 19)
B
Y
T
pðx j sÞ = pðxt j st Þ:
t=1
We can obtain a variant of this equation by conditioning each
We next assume that each measurement follows a Gaussian probability on a third random variable C:
distribution with mean equal to the stimulus at the corresponding X
pðA j CÞ = pðA j B; CÞpðB j CÞ: (Equation 20)
time and with a fixed variance: B
2
1 ðxt st Þ
pðxt j st Þ = N ðxt ; st ; sÞ = pffiffiffiffiffiffiffiffiffiffiffie 2s2 :
2ps 2
This is the equation we use to obtain Equation 21.
We compute Lðtchange Þ by marginalizing over s:
Step 2: Inference   X   
The observer would like to infer when the change occurred. What L tchange = p xjsÞp s j tchange ; (Equation 21)
s
can make this problem challenging is that the noise in the mea-
surements can create ‘‘apparent changes’’ that are not due to a
change in the underlying state st . The Bayesian observer solves
this problem in an optimal manner. a. Besides using Equation 20, we used a property that is
The world state of interest is change point tchange . Therefore, specific to our generative model. Which one?
the Bayesian observer computes the posterior over tchange given b. For a given tchange , how many sequences s are possible?
a sequence of measurements, x (we leave out the subscript
‘‘obs’’ for ease of notation). We first apply Bayes’ rule: Based on (b), we understand that the probability pðs j tchange Þ is
      zero for all s except for one. Thus, the sum in Equation 21 re-
p tchange jx f p tchange p x j tchange : (Equation 15) duces to a single term:
 
Since the prior is constant, pðtchange Þ = T1 , this simplifies to    
L tchange = p x j the one s in which the change occurs at tchange
   
p tchange jx f p x j tchange : (Equation 16) (Equation 22)

tchange 1
!0 1
Thus, our challenge is to calculate the likelihood function Y Y
T

    = pðxt j st =  1Þ @ pðxt j st = 1ÞA:


L tchange = p x j tchange : (Equation 17) t=1 t = tchange

(Equation 23)
For each hypothesized value of tchange , this function tells us
how expected the observations are under that hypothesis. The This looks complicated, but we are not out of ideas!
problem is that, unlike in case 4, the generative model does
not give us this likelihood function right away; this is a conse- c. Show that this can be written more simply as

Neuron 104, October 9, 2019 171


Neuron

Primer

i. Assume s = 1 and T = 10. Vary the true change point tchange


from 1 to T. For each value of tchange , we simulate 10,000
trials (or more if possible). On each simulated trial,
Based on tchange; specify the stimulus sequence s.
Simulate a measurement sequence x from s.
Apply the decision rule to each measurement sequence. The
output is the simulated observer’s response, namely an
estimate of tchange .
Determine whether the response was correct.

Plot proportion correct as a function of tchange . Interpret


the plot.
Figure 7. Trial-to-Trial Variability in Likelihoods and Posteriors j. What is overall proportion correct (averaged across
Likelihood functions (blue) and corresponding posterior distributions (purple)
on three example trials on which the true stimulus could have been the same.
all tchange )?
The key point is that the likelihood function, the posterior distribution, and any k. Vary s = 1; 2; 3 and T from 2 to 16 in steps of 2. Plot overall
posterior-derived estimate are not fixed objects: they move around from trial to proportion correct as a function of T for the three values of
trial because the measurement xtrial does. s (color coded). Interpret the plot.
l. How could our simple example be extended to cover more
realistic cases of change point detection?

  Y
T
pðxt jst = 1Þ
L tchange f : (Equation 24) 4. Extensions
t = tchange
pðxt j st =  1Þ
We have concluded our case studies of Bayesian decision
models. These cases were basic, and many extensions can be
d. Now substitute pðxt j st Þ = N ðxt ; st ; sÞ to find made. We discuss a few such extensions here.
4.1. Multiple Observations
P
T
2
xt In case 4, if the decision maker makes two conditionally inde-
  s2 t = tchange
L tchange fe : (Equation 25) pendent observations, x1 and x2 , then the likelihood function
LðsÞ becomes the product pðx1 jsÞpðx2 j sÞ. This is the premise
e. Does this equation make intuitive sense? of many Bayesian cue combination studies, in particular, multi-
sensory ones (Trommershauser et al., 2011). If multiple obser-
Combining Equations 16, 17, and 25, we find for the posterior vations are made over time, then Bayesian inference becomes
probability of a change point at tchange : a form of evidence accumulation. In case 5, multiple observa-
tions were also made over time, but, in addition, the stimulus
2
P
T changed as measurements were made. The most common
   s2 xt
generative model that describes time-varying stimuli with mea-
p tchange  x fe t = tchange
: (Equation 26)
surements at each time point is called a ‘‘Hidden Mar-
To obtain the actual posterior probabilities, the right-hand side kov Model.’’
has to be normalized (divided by the sum over all tchange ). 4.2. More Sophisticated Encoding Models
The distribution of an observation given a world state, pðx j sÞ,
f. The data in Figure 8 are x = ð  0:46;0:83;  3:26;  0:14; was deliberately kept very simple in cases 4 and 5. Instead,
 0:68;  2:31; 0:57; 1:34; 4:58; 3:77Þ, with s = 1. Plot the with x still being a noisy measurement of s, the noise level could
posterior distribution over change point. depend on s, an instance of ‘‘heteroskedasticity’’; this would
make Bayesian inference substantially harder (Girshick et al.,
However, if the goal is just to pick the most probable change 2011; Acerbi et al., 2018). Furthermore, x could be a pattern
point (MAP estimate), normalizing is not needed. of neural activity, for example, in early sensory cortex; this
g. Why not? could be the starting point for asking how Bayesian computa-
h. If that is the goal, the decision rule becomes ‘‘Cumulatively tions could be implemented in neural populations (Ma
add up the elements of x, going backwards from t = T to et al., 2006).
t = 1. The time at which this cumulative sum peaks is 4.3. More Realistic Cost Functions
the MAP estimate of tchange .’’ Explain. Outside of purely perceptual tasks, an action is rarely an esti-
mate of the state of the world. In fact, the action a can be of a
Step 3: Response Distribution very different nature than s. For example, if s represents whether
In case 4, we were able to obtain an equation for the pre- milk is spoiled or still good, the action a could be to toss or drink
dicted response distribution. Here, however, and in many the milk. Similarly, in an estimation experiment, a correct
other cases, that is not possible. Nevertheless, we can still response could be rewarded differently depending on the true
simulate the model’s responses to obtain an approximate state of the world (Whiteley and Sahani, 2008). These scenarios
prediction. can be captured by suitably choosing Cða; sÞ in Equation 2.

172 Neuron 104, October 9, 2019


Neuron

Primer

Figure 8. Case 5: Change Point Detection


(A) Generative model.
(B) Example of a true stimulus sequence with tchange = 7.
(C) Example sequence of measurements. When did the change from 1 to 1 occur?

4.4. Ad Hoc Sources of Error cost (or maximize any form of total reward) in the long run,
Ad hoc sources of errors can be incorporated into Bayesian then they should use the posterior distribution to determine their
models as into any behavioral model. Those sources could action. However, if the Bayesian decision maker suffers from
include some proportion of random responses (lapse rate). In model mismatch, then they are still Bayesian, but not necessarily
addition, the readout of the posterior could be stochastic rather optimal (Ma, 2012). In addition, a lot of recent work has focused
than deterministic. In other words, decision noise could be on including in the decision maker’s objective function not only
added in step 2b. Such noise could alternatively reflect inevitable task performance or task rewards, but also the cost of represen-
stochasticity, a level of imprecision that is strategically chosen to tation or computation. This gives rise to a class of modified
save effort when more effort would not be worth the task gains, Bayesian models known as ‘‘resource-rational’’ models (Griffiths
or it could be a proxy for systematic but unmodeled suboptimal- et al., 2015).
ities in the decision process.
4.5. Model Mismatch 6. Criticisms of Bayesian Models
The decision maker might have wrong beliefs about the genera- Criticisms of Bayesian models of decision making fall into
tive model. For example, if the true distributions of s and x are several broad categories. First, it has been alleged that Bayesian
pðsÞ and pðx j sÞ, the decision maker might instead believe that modelers insufficiently consider alternative models (Bowers and
these variables follow different distributions, say qðsÞ and Davis, 2012). Indeed, many Bayesian modelers can do much
qðx j sÞ. Then, they can still compute a posterior distribution better in testing Bayesian models against alternatives. This crit-
qðs j xtrial Þ, but it would be different from the correct one, icism, however, applies to many modeling studies, Bayesian or
pðs j xtrial Þ. This is a case of ‘‘model mismatch’’: the distributions not. Second, it has been alleged that Bayesian models are overly
used in inference are not the same as in the true generative flexible and can ‘‘fit anything’’ (Bowers and Davis, 2012). I
model (Beck et al., 2012). consider this largely an unfair criticism, as Bayesian models
are often highly constrained by either experimental or natural
5. Remarks statistics (examples of the latter: Girshick et al., 2011; Geisler
5.1. Persistent Myths and Perry, 2009). It is true that some Bayesian studies use
We address two common misunderstandings about Bayesian many parameters to agnostically estimate a prior distribution
models of behavior. First, it is a myth that all Bayesian models (e.g., Stocker and Simoncelli, 2006); those models need strong
are characterized by the presence of a non-uniform prior distri- external validation (e.g., Houlsby et al., 2013) or to be appropri-
bution. While the prior is important in many applications, the ately penalized in model comparison. I discuss two more criti-
calculation of the likelihood is sometimes much more central. Ex- cisms below and conclude by pointing out two major challenges
amples include case 1 and most forms of sensory cue integration for Bayesian models.
(Trommershauser et al., 2011). Second, it is a myth that all 6.1. Bayesian Inference without and with Probabilities
Bayesian models have so many degrees of freedom that any da- It has been alleged that empirical findings cited in support of de-
taset can be fitted. On the contrary, as we saw in cases 4 and 5, cision making being Bayesian are equally consistent with models
Bayesian models have very few parameters if we assume that that predict the same input-output mapping as the Bayesian
the observer uses the true generative model of the task. That be- model but without any representation of probabilities (Howe
ing said, models with mismatch (Section 4.5) can have many et al., 2006; Block, 2018). Indeed, in their weak form, Bayesian
more free parameters. However, adding parameters solely to models are simply mappings from states to actions (policies)
obtain better fits violates the spirit of Bayesian modeling. without regard to internal constructs such as likelihoods and
5.2. Bayesian versus Optimal Decision Making posteriors (Ma and Jazayeri, 2014). In their strong form, however,
Bayesian decision models are normative or optimal in the sense Bayesian models claim that the brain represents those con-
that, if the decision maker strives to minimize any form of total structs. Such representations naturally give rise to notions of

Neuron 104, October 9, 2019 173


Neuron

Primer

uncertainty and decision confidence. For example, the standard addition of patterns of neural activity (Ma et al., 2006). How-
deviation of a posterior distribution could be a measure of uncer- ever, this case might be an exception because, under the
tainty. Evidence for the strong form could be obtained from same distributional assumption, the neural implementation of
Bayesian transfer tests (Maloney and Mamassian, 2009). Exam- many other Bayesian computations is not exact and can get
ples of Bayesian transfer tests involve varying sensory reliability very complex (e.g., Ma et al., 2011). Instead, a simple trained
(Trommershauser et al., 2011; Qamar et al., 2013), priors (Acerbi neural network can perform strong-form Bayesian inference
et al., 2014), or rewards (Whiteley and Sahani, 2008) from trial to across a wide variety of tasks, even when the distributional
trial without giving the subject performance feedback. However, assumption is violated (Orhan and Ma, 2017). The neural im-
when no transfer tests are done, it is difficult to argue that the plementation of Bayesian computation remains an active
brain has done more than learn a fixed policy that produces area of research.
as-if Bayesian behavior. While Bayesian modelers need to be Finally, resource-rational theories take implementation-level
more explicit about the epistemological status of their models, costs and constraints seriously (Griffiths et al., 2015). A
and while some Bayesian studies only provide evidence for the resource-rational decision maker is one who not only maximizes
weak form, the collective evidence for the strong form is by task performance, but also simultaneously minimizes an ecolog-
now plentiful (Ma and Jazayeri, 2014). ically meaningful cost, such as total firing rate, total number of
6.2. The Neural Implementation of Bayesian Inference neurons, amount of effort, or time spent. Resource-rational
Bayesian modelers have been accused of not paying sufficient models provide accounts of the nature of representations as
attention to the implementational level (Jones and Love, 2011). well as of apparent suboptimalities in decision making.
Bayesian models are primarily computational-level models of 6.3. Other Challenges
behavior. However, the evidence that decision-making behavior We briefly mention two other major challenges to Bayesian
is approximately Bayesian in many tasks raises the question of models of decision making. The first is that they often do not
how neurons implement Bayesian decisions. This question is scale up well. For example, if one were to infer the depth ordering
most interesting for the strong form of Bayesian models because of N image patches, there are N! possible orderings, and a
answering it then requires a theoretical commitment to the neural Bayesian decision maker would have to consider every one of
representations of likelihoods, priors, and cost functions. them. For large N, this would be computationally prohibitive
Consider the neural representation of a sensory likelihood func- and not realistic as a model of human vision. A similar combina-
tion as an example. A straightforward theoretical postulate—but torial explosion would arise in case 5 if the number of change
by no means the only one (Hoyer and Hyva €rinen, 2003; Fiser points were not known to be 1. A second challenge is how a
et al., 2010; Deneve, 2008; Haefner et al., 2016)—is that the ac- Bayesian decision maker learns a natural generative model
tivity (e.g., firing rates) in a specific population of sensory neu- from scratch using only a small number of training examples
rons, collectively denoted by r, takes the role of the observation, (Tenenbaum et al., 2011; Lake et al., 2015).
so that the sensory component of the generative model—the
analog of Equation 5—becomes a stimulus-conditioned activity SUPPLEMENTAL INFORMATION
distribution pðr j sÞ (Sanger, 1996; Pouget et al., 2003). One could
then proceed by hypothesizing that this distribution serves as the Supplemental Information can be found online at https://doi.org/10.1016/j.
basis of the likelihood function. In other words, for given activity neuron.2019.09.037.
rtrial , the likelihood of stimulus s would be
ACKNOWLEDGMENTS
LðsÞ = pðrtrial j sÞ; (Equation 27)
This primer is based on a Bayesian modeling tutorial that I have taught in
several places. Thanks to all students who actively participated in these tuto-
in analogy to Equation 6. This equation is the cornerstone of the rials. Special thanks to my teaching assistants of the Bayesian tutorial at the
theory of ‘‘probabilistic population coding’’ (Ma et al., 2006). The Computational and Systems Neuroscience conference in 2019, who not only
hypothesis would be untestable if it were not for the brain’s use taught but also greatly improved the five case studies and wrote solutions:
Anna Kutschireiter, Anne-Lene Sax, Jennifer Laura Lee, Jorge Menéndez, Julie
of a likelihood function over s in subsequent computation. There-
Lee, Lucy Lai, and Sashank Pisupati. A much more detailed didactic introduc-
fore, one needs a behavioral task in which evidence for the tion to Bayesian decision models will appear in 2020 in book form; many
strong form of a Bayesian model has been obtained. Then, one thanks to my co-authors of that book, Konrad Körding and Daniel Goldreich.
could use Equation 27 to decode on each trial a full neural likeli- My research is funded by grants R01EY020958, R01EY027925,
R01MH118925, and R01EY026927 from the National Institutes of Health.
hood function from a sensory population and plug it into the
Bayesian model to predict behavior (van Bergen et al., 2015;
Walker et al., 2019). REFERENCES

Equipped with putative neural representations of the build-


Acerbi, L., Vijayakumar, S., and Wolpert, D.M. (2014). On the origins of subop-
ing blocks of Bayesian inference, one can ask the further timality in human probabilistic inference. PLoS Comput. Biol. 10, e1003661.
question of how the computation that puts the pieces together
Acerbi, L., Dokka, K., Angelaki, D.E., and Ma, W.J. (2018). Bayesian compar-
is implemented. In a handful of cases, this question can be
ison of explicit and implicit causal inference strategies in multisensory heading
approached analytically. For example, in cue combination, perception. PLoS Comput. Biol. 14, e1006110.
the Bayesian computation consists of a multiplication of likeli-
Battaglia, P.W., Hamrick, J.B., and Tenenbaum, J.B. (2013). Simulation as an
hood functions. Under a certain assumption about pðrtrial j sÞ, engine of physical scene understanding. Proc. Natl. Acad. Sci. USA 110,
the neural implementation of this computation is a simple 18327–18332.

174 Neuron 104, October 9, 2019


Neuron

Primer
Beck, J.M., Ma, W.J., Pitkow, X., Latham, P.E., and Pouget, A. (2012). Not Ma, W.J., and Jazayeri, M. (2014). Neural coding of uncertainty and probability.
noisy, just wrong: the role of suboptimal inference in behavioral variability. Annu. Rev. Neurosci. 37, 205–220.
Neuron 74, 30–39.
Ma, W.J., Beck, J.M., Latham, P.E., and Pouget, A. (2006). Bayesian inference
Block, N. (2018). If perception is probabilistic, why does it not seem probabi- with probabilistic population codes. Nat. Neurosci. 9, 1432–1438.
listic? Philos. Trans. R. Soc. Lond. B Biol. Sci. 373, 20170341.
Ma, W.J., Navalpakkam, V., Beck, J.M., Berg, Rv., and Pouget, A. (2011).
Bowers, J.S., and Davis, C.J. (2012). Bayesian just-so stories in psychology Behavior and neural basis of near-optimal visual search. Nat. Neurosci. 14,
and neuroscience. Psychol. Bull. 138, 389–414. 783–790.

Brainard, D.H., and Freeman, W.T. (1997). Bayesian color constancy. J. Opt. Maloney, L.T., and Mamassian, P. (2009). Bayesian decision theory as a model
Soc. Am. A Opt. Image Sci. Vis. 14, 1393–1411. of human visual perception: testing Bayesian transfer. Vis. Neurosci. 26,
147–155.
Cogley, T., and Sargent, T.J. (2008). Anticipated utility and rational expecta-
tions as approximations of Bayesian decision making. Int. Econ. Rev. 49, Martı́, R., Lozano, J.A., Mendiburu, A., and Hernando, L. (2016). Multi-start
185–221. methods. In Handbook of Heuristics, R. Martı́, P. Pardalos, and M. Resende,
eds. (Springer), pp. 155–175.
Deneve, S. (2008). Bayesian spiking neurons I: inference. Neural Comput.
20, 91–117. Myung, I.J. (2003). Tutorial on maximum likelihood estimation. J. Math. Psy-
chol. 47, 90–100.
Faisal, A.A., Selen, L.P., and Wolpert, D.M. (2008). Noise in the nervous sys-
tem. Nat. Rev. Neurosci. 9, 292–303. Norton, E.H., Acerbi, L., Ma, W.J., and Landy, M.S. (2019). Human online
adaptation to changes in prior probability. PLoS Comput. Biol. 15, e1006681.
Fiser, J., Berkes, P., Orbán, G., and Lengyel, M. (2010). Statistically optimal
perception and learning: from behavior to neural representations. Trends Orhan, A.E., and Ma, W.J. (2017). Efficient probabilistic inference in generic
Cogn. Sci. 14, 119–130. neural networks trained with non-probabilistic feedback. Nat. Commun.
8, 138.
Geisler, W.S., and Perry, J.S. (2009). Contour statistics in natural images:
grouping across occlusions. Vis. Neurosci. 26, 109–121. Pouget, A., Dayan, P., and Zemel, R.S. (2003). Inference and computation with
population codes. Annu. Rev. Neurosci. 26, 381–410.
Girshick, A.R., Landy, M.S., and Simoncelli, E.P. (2011). Cardinal rules: visual
orientation perception reflects knowledge of environmental statistics. Nat. Qamar, A.T., Cotton, R.J., George, R.G., Beck, J.M., Prezhdo, E., Laudano, A.,
Neurosci. 14, 926–932. Tolias, A.S., and Ma, W.J. (2013). Trial-to-trial, uncertainty-based adjustment
of decision boundaries in visual categorization. PNAS 110, 20332–20337.
Goodman, N.D., and Frank, M.C. (2016). Pragmatic language interpretation as
probabilistic inference. Trends Cogn. Sci. 20, 818–829. Sanger, T.D. (1996). Probability density estimation for the interpretation of
neural population codes. J. Neurophysiol. 76, 2790–2793.
Griffiths, T.L., Lieder, F., and Goodman, N.D. (2015). Rational use of cognitive
resources: levels of analysis between the computational and the algorithmic. Stocker, A.A., and Simoncelli, E.P. (2006). Noise characteristics and prior ex-
Top. Cogn. Sci. 7, 217–229. pectations in human visual speed perception. Nat. Neurosci. 9, 578–585.

Haefner, R.M., Berkes, P., and Fiser, J. (2016). Perceptual decision-making as Tenenbaum, J.B., Griffiths, T.L., and Kemp, C. (2006). Theory-based Bayesian
probabilistic inference by neural sampling. Neuron 90, 649–660. models of inductive learning and reasoning. Trends Cogn. Sci. 10, 309–318.

Houlsby, N.M., Huszár, F., Ghassemi, M.M., Orbán, G., Wolpert, D.M., and Tenenbaum, J.B., Kemp, C., Griffiths, T.L., and Goodman, N.D. (2011). How to
Lengyel, M. (2013). Cognitive tomography reveals complex, task-independent grow a mind: statistics, structure, and abstraction. Science 331, 1279–1285.
mental representations. Curr. Biol. 23, 2169–2175.
Treisman, M. (1977). Motion sickness: an evolutionary hypothesis. Science
Howe, C.Q., Beau Lotto, R., and Purves, D. (2006). Comparison of Bayesian 197, 493–495.
and empirical ranking approaches to visual perception. J. Theor. Biol. 241,
866–875. Trommershauser, J., Kording, K., and Landy, M.S. (2011). Sensory Cue Inte-
gration (Oxford University Press).
€rinen, A. (2003). Interpreting neural response variability
Hoyer, P.O., and Hyva
as Monte Carlo sampling of the posterior. In Advances in Neural Information van Bergen, R.S., Ma, W.J., Pratte, M.S., and Jehee, J.F. (2015). Sensory un-
Processing Systems (NIPS), 293–30. certainty decoded from visual cortex predicts behavior. Nat. Neurosci. 18,
1728–1730.
Itti, L., and Baldi, P. (2009). Bayesian surprise attracts human attention. Vision
Res. 49, 1295–1306. Wagemans, J., Feldman, J., Gepshtein, S., Kimchi, R., Pomerantz, J.R., van
der Helm, P.A., and van Leeuwen, C. (2012). A century of Gestalt psychology
Jones, M., and Love, B.C. (2011). Bayesian fundamentalism or enlightenment? in visual perception: II. Conceptual and theoretical foundations. Psychol. Bull.
On the explanatory status and theoretical contributions of Bayesian models of 138, 1218–1252.
cognition. Behav. Brain Sci. 34, 169–188.
Walker, E.Y., Cotton, R.J., Ma, W.J., and Tolias, A.S. (2019). A neural basis of
Kersten, D., Mamassian, P., and Yuille, A. (2004). Object perception as probabilistic computation in visual cortex. bioRxiv. https://doi.org/10.1101/
Bayesian inference. Annu. Rev. Psychol. 55, 271–304. 365973.

Körding, K.P., and Wolpert, D.M. (2004). Bayesian integration in sensorimotor Wertheimer, M. (1938). Gestalt Theory. In A Source Book of Gestalt Psychol-
learning. Nature 427, 244–247. ogy, W.D. Ellis, ed. (Kegan Paul, Trench, Trubner & Company), pp. 1–11.

Lake, B.M., Salakhutdinov, R., and Tenenbaum, J.B. (2015). Human-level Whiteley, L., and Sahani, M. (2008). Implicit knowledge of visual uncertainty
concept learning through probabilistic program induction. Science 350, guides decisions with asymmetric outcomes. J. Vis. 8, 1–15.
1332–1338.
Wilson, R.C., Nassar, M.R., and Gold, J.I. (2013). A mixture of delta-rules
Ma, W.J. (2012). Organizing probabilistic models of perception. Trends Cogn. approximation to bayesian inference in change-point problems. PLoS Com-
Sci. 16, 511–518. put. Biol. 9, e1003150.

Neuron 104, October 9, 2019 175

You might also like