Object Perception As Bayesian Inference.
Object Perception As Bayesian Inference.
Pascal Mamassian
Department of Psychology, University of Glasgow, Glasgow G12 8QB, Scotland
Alan Yuille
Departments of Statistics and Psychology, University of California, Los Angeles,
Los Angeles, California 90095-1554
CONTENTS
OBJECT PERCEPTION: GEOMETRY AND MATERIAL . . . . . . . . . . . . . . . . . . . . 272
INTRODUCTION TO BAYES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
How to Resolve Ambiguity in Object Perception? . . . . . . . . . . . . . . . . . . . . . . . . . . 273
How Does Vision Deal with the Complexity of Images? . . . . . . . . . . . . . . . . . . . . . 277
PSYCHOPHYSICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Ideal Observers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
Basic Bayes: The Trade-Off Between Feature Reliability
and Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Discounting and Task Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Integration of Image Measurements and Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
Perceptual Explaining Away . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
THEORETICAL AND COMPUTATIONAL ADVANCES . . . . . . . . . . . . . . . . . . . . . 293
Bayes Decision Theory and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
0066-4308/04/0204-0271$14.00 271
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
Computer vision has shown how the problems of complexity and ambiguity can
be handled using Bayesian inference, which provides a common framework for
modeling artificial and biological vision. In addition, studies of natural images have
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
shown statistical regularities that can be used for designing theories of Bayesian
inference. The goal of understanding biological vision also requires using the tools
of psychophysics and neurophysiology to investigate how the visual pathways of
the brain transform image information into percepts and actions.
In the next section, we provide an overview of object perception as Bayesian
inference. In subsequent sections, we review psychophysical (Psychophysics,
below), computational (Theoretical and Computational Advances, below), and
neurophysiological (Neural Implications, below) results on the nature of the com-
putations and mechanisms that support visual inference. Psychophysically, a major
challenge for vision research is to obtain quantitative models of human perceptual
performance given natural image input for the various functional tasks of vision.
These models should be extensible in the sense that one should be able to build on
simple models, rather than having a different model for each set of psychophysical
results. Meeting this challenge will require further theoretical advances and Theo-
retical and Computational Advances (below) highlights recent progress in learning
classifiers, probability distributions, and in visual inference. Psychophysics con-
strains neural models but can only go so far, and neural experiments are required
to further determine theories of object perception. Neural Implications (below)
describes some of the neural implications of Bayesian theories.
INTRODUCTION TO BAYES
1
Recent reviews include Knill et al. (1996), Yuille & Bülthoff (1996), Kersten & Schrater
(2002), Kersten & Yuille (2003), Maloney (2002), Pizlo (2001), and Mamassian et al.
(2002).
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
17 Dec 2003
274
9:0
AR
KERSTEN
¥
AR207-PS55-10.tex
MAMASSIAN
¥
YUILLE
AR207-PS55-10.sgm
LaTeX2e(2002/01/18)
P1: GCE
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
interpretations are called ideal observers. We first consider one type of ideal ob-
server that computes the most probable interpretation given the retinal image stim-
ulus. Technically, this observer is called a maximum a posteriori (MAP) observer.
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
The ideal observer bases its decision on the posterior probability distribution—
the probability of each possible true state of the scene given the retinal stimulus.
According to Bayes’ theorem, the posterior probability is proportional to the prod-
uct of the prior probability—the probability of each possible state of the scene
prior to receiving the stimulus—and the likelihood—the probability of the stimu-
lus given each possible state of the scene. In many applications, prior probability
distributions represent knowledge of the regularities of possible object shapes, ma-
terials, and illumination, and likelihood distributions represent knowledge of how
images are formed through projection onto the retina. Figure 2 illustrates how a
symmetric likelihood [a function of the stimulus representing the curvature of the
two-dimensional (2-D) line] can lead to an asymmetric posterior owing to a prior
toward convex objects. The ideal (MAP) observer then picks the most probable
interpretation for that stimulus—i.e., the state of the scene (3-D surface curvature
and viewpoint slant) for which the posterior distribution peaks in panel D of Fig-
ure 2. An ideal observer does not necessarily get the right answer for each input
stimulus, but it does make the best guesses so it gets the best performance averaged
over all the stimuli. In this sense, an ideal observer may “see” illusions.
We now take a closer look at three key aspects to Bayesian modeling: the
generative model, the task specification, and the inference solution.
←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Figure 2 The ideal observer. (A) The image of a cylinder is consistent with multiple
objects and viewpoints, including convex cylinders viewed from above and concave
cylinders viewed from below. Therefore, different scene interpretations for this image
vary in the estimated surface curvature and slant (the viewing angle). (B) The likelihood
represents the compatibility of different scene interpretations with the image (1/R is
the curvature in the image). The hatched region represents those surface curvatures
that are never compatible with the image, indicating that, for instance, a plane will
never project as a curved patch in the image. (C) The prior probability describes an
observer preference for convex objects viewed from above. (D) A Bayesian observer
then combines likelihood and prior to estimate a posterior probability for each inter-
pretation given the original image. The maximum a posteriori (MAP) is the set of scene
parameters for which the posterior is the largest. (E) The utility function represents
the costs associated to errors in the estimation of the surface slant and curvature and
is dependent on the task. (F) Finally, the posterior probability is convolved with the
utility function to give the expected utility associated with each interpretation.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
likelihood of the image features, p(I |S), and the prior probability of the scene
description, p(S), determine an external generative model. As illustrated later in
Figure 5, a strong generative model allows one to draw image samples—the high-
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
THE TASK SPECIFICATION There is a limitation with the MAP ideal observer de-
scribed above. Finding the most probable interpretation of the scene does not allow
for the fact that some tasks may require more accurate estimates of some aspects of
the scene than others. For example, it may be critical to get the exact object shape,
but not the exact viewpoint (represented in the utility function in Figure 2E). The
task specifies the costs and benefits associated with the different possible errors in
the perceptual decision. Generally, an optimal perceptual decision is a function of
the task as well as the posterior.
Often, we can simplify the task requirements by splitting S into components
(S1 , S2 ) that specify which scene properties are important to estimate (S1 , e.g.,
surface shape) and which confound the measurements and are not worth estimating
(S2 , e.g., viewpoint, illumination).
2
When not qualified, we use the term generative model to mean an external model that
describes the causal relationship in terms of variables in the scene. Models of inference may
also use an internal generative model to test perceptual hypotheses against the incoming data
(e.g., image features). We use the term strong generative model to mean one that produces
consistent image samples in terms of intensities.
3
Optimists maximize the utility or gain, whereas pessimists minimize their loss.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
2003) has shown how specifying generative models in terms of influence graphs
(or Bayesian networks) (Pearl 1988), together with a description of visual tasks,
allow us to break problems down into categories (see the example in Figure 3A).
The idea is to decompose the description of a scene S into n components S1 , . . . ,
Sn ; the image into m features I1 , . . . , Im ; and express the ensemble distribution
as p(S1 , . . . , Sn ; I1 , . . . , Im ). We represent this distribution by a graph where the
nodes correspond to the variables S1 , . . . , Sn and I1 , . . . , Im , and links are drawn
between nodes that directly influence each other. There is a direct correspondence
between graphical structure and the factorization (and thus simplification) of the
joint probability. In the most complex case, every random variable influences every
other one, and the joint probability cannot be factored into simpler parts. In order
Figure 3 Influence graphs and the visual task. (A) An example of an influence graph.
Arrows indicate causal links between nodes that represent random variables. The scene class
(indoor or outdoor) influences the kind of illumination and background that determine the
environment. The scene also determines the kinds of objects one may find (artifactual or
not). Object reflectivity (paint or pigment) and shape are influenced by the object class.
The model of lighting specified by the environment interacts with reflectivity and shape to
determine the image measurements or features. The environment can determine large-scale
global features (e.g., overall contrast and color, spatial frequency spectrum) that may be
relatively unaffected by smaller-scale objects of interest. Global features can serve to set a
context. (B) The inference problem depends on the task. The task specifies which variables
are important to estimate accurately (disks) and which are not (hexagons). Triangles represent
image measurements or features. For the purpose of later illustration of explaining away, we
also distinguish auxiliary features (upside-down triangles) that are available or actively sought
and which can modulate the probability of object variables that do not directly influence the
auxiliary variable. Note that perceptual inference goes against the arrows.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
to build models for inference, it is useful to first build quantitative models of image
formation—external generative models based on real-world statistics. As we have
noted above, the requirements of the task split S into variables that are important
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
to estimate accurately for the task (disks) and those which are not (hexagons)
(Figure 3B). The consequences of task specification are described in more detail
in Discounting and Task Dependence (below).
In the next section, Psychophysics, we review psychophysical results supporting
the Bayesian approach to object perception. The discussion is organized around
the four simple influence graphs of Figure 4.
PSYCHOPHYSICS
Ideal Observers
Ideal observers provide the strongest psychophysical tests because they are com-
plete models of visual performance based on both the posterior and the task.
A Bayesian ideal observer is designed to maximize a performance measure for a
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
visual task (e.g., proportion of correct decisions) and, as a result, serves as a bench-
mark with which to compare human performance for the same task (Barlow 1962,
Green & Swets 1974, Parish & Sperling 1991, Tjan et al. 1995, Pelli et al. 2003).
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
Characterizing the visual information for a task can be critical for proper interpre-
tation of psychophysical results (Eckstein et al. 2000) as well as for the analysis
of neural information transmission (Geisler & Albrecht 1995, Oram et al. 1998).
For example, when deciding whether human object recognition uses 3-D shape
cues, it is necessary to characterize whether these cues add objectively useful
information for the recognition task (independent of whether the task is being
performed by a person or by a computer). Liu & Kersten (2003) show that human
thresholds for discriminating 3-D asymmetric objects are less than for symmetric
objects (the image projections were not symmetric); however, when one compares
human performance to the ideal observer for the task, which takes into account
the redundancy in symmetric objects, human discrimination is more efficient for
symmetric objects.
Because human vision is limited by both the nature of the computations and its
physiological hardware, we might expect significant departures from optimality.
Nevertheless, Bayesian ideal observer models provide a first approximation to
human performance that has been surprisingly effective (cf. Knill 1998, Schrater
& Kersten 2002, Legge et al. 2002, Ernst & Banks 2002, Schrater et al. 2000).
A major theoretical challenge for ideal observer analysis of human vision is
the requirement for strong generative models so that human observers can be
tested with image samples I drawn from p(I |S). We discuss (Basic Bayes: The
Trade-Off Between Feature Reliability and Priors, below) results from statistical
models of natural image features that constrain, but are not sufficient to specify
the likelihood distribution. In Theoretical and Computational Advances (below),
we discuss relevant work on machine learning of probability distributions. But we
first present a preview of one aspect of the problem.
The need for strong generative models is an extensibility requirement that rules
out classes of models for which the samples are image features. The distinction is
sometimes subtle. The key point is that images features may either be insufficient to
uniquely determine an image or they may sometimes overconstrain it. For example,
suppose that a system has learned probability models for airplane parts. Then
sampling from these models is highly unlikely to produce an airplane—the samples
will be images, but additional constraints are needed to make sure they correspond
to airplanes (see Figure 5A). M.C. Escher’s pictures and other impossible figures,
such as the impossible trident, give examples of images that are not globally
consistent. In addition, local feature responses can sometimes overconstrain the
image locally. For example, consider the binary image in Figure 5B and suppose
our features 1L are defined to be the difference in image intensity L at neighboring
pixels. The nature of binary images puts restrictions on the values that 1L can
take at neighboring pixels. It is impossible, for example, that neighboring 1Ls can
both take the value + 1. So, sampling from these features will not give a consistent
image unless we impose additional constraints. Additional consistency conditions
are also needed in 2-D images (see Figure 5C) where the local image differences
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
17 Dec 2003
280
9:0
AR
KERSTEN
¥
AR207-PS55-10.tex
MAMASSIAN
¥
YUILLE
AR207-PS55-10.sgm
LaTeX2e(2002/01/18)
P1: GCE
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
←−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Figure 5 The challenge of building strong generative models. The samples of strong
generative models are images that can be used as stimuli for ideal observer analysis of
human vision. But models specified by image features may be unable to generate images
because the image features may either underconstrain or overconstrain the image.
(A) A model whose features are airplane parts must impose additional constraints
to ensure that the samples form a plane (left and center panels). Constraints must
prevent an image from being globally inconsistent (right panel). (B) The nature of
binary images (left panel) imposes constraints on the feature values (center panel) and
means that some feature configurations are inconsistent with any image (right panel).
(C) Measurements of natural image statistics (Simoncelli & Olshausen 2001) have
shown that the probability distribution of intensity differences between pixels has a
characteristic distribution (left panel), but to produce natural image samples requires an
additional consistency constraint on neighboring filter values (center and right panels).
(D) Samples from a strong generative model learned from image features (Zhu et al.
1997). Panel (a) shows the original picture of fur. Panels (b)–(e) show image samples
drawn from several p(I | f ur )s with increasing numbers of spatial features and hence
increased realism. (E) Samples drawn from a generative model for closed curves (Zhu
1999) learned from spatial features.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
prior is that motions tend to be slow and which integrates local measurements
according to their reliabilities (see Yuille & Grzywacz 1988 for the same prior
applied to other motion stimuli). With this model (using a single free parameter),
the authors showed that a wide range of motion results in human perception could
be accounted for in terms of the trade-off between the prior and the likelihood.
The Bayesian models give a simple unified explanation for phenomena that had
previously been used to argue for a bag of tricks theory requiring many different
mechanisms (Ramachandran 1985).
Geometry and shape Some prior regularities refer to the geometry of objects that
humans interact with. For instance, the perception of a solid shape is consistently
biased toward convexity rather than concavity (Kanizsa & Gerbino 1976, Hill
& Bruce 1993, Mamassian & Landy 1998, Bertamini 2001, Langer & Bülthoff
2001). This convexity prior is robust over a range of object shapes, sizes, and
tasks.
More specific tests and ideal observer models will necessitate developing proba-
bility models for the high-dimensional spaces of realistic objects. Some of the most
highly developed work is on human facial surfaces (cf. Vetter & Troje 1997). Relat-
ing image intensity to such measured surface depth statistics has yielded computer
vision solutions for face recognition (Atick et al. 1996) and has provided objec-
tive prior models for face recognition experiments, suggesting that human vision
may represent facial characteristics along principal component dimensions in an
opponent fashion (Leopold et al. 2001).
Material The classical problems of color and lightness constancy are directly
tied to the nature of the materials of objects and environmental lighting. However,
most past work has implicitly or explicitly assumed a special case: that surfaces are
Lambertian (matte). Here, the computer graphics communities have been instru-
mental in going beyond the Lambertian model by measuring and characterizing
the reflectivities of natural smooth homogeneous surfaces in terms of the bidirec-
tional reflectance distribution function (BRDF) (cf. Marschner et al. 2000), with
important extensions to more complicated textured surfaces (Dana et al. 1999),
including human skin (Jensen et al. 2001). Real images are, of course, intimately
tied to the structure of illumination, and below we review psychophysical results
on realistic material perception.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
Lighting Studies of human object perception (as well as computer vision) have
traditionally assumed simple lighting models, such as a single-point light source
with a directionally nonspecific ambient term. One of the best-known examples
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
of a prior is the assumption that light is coming from above. This assumption
is particularly useful to disambiguate convex from concave shapes from shading
information. The light from above prior is natural when one considers that the
sun and most artificial light sources are located above our heads. However, two
different studies have now shown that humans prefer the light source to be located
above-left rather than straight above (Sun & Perona 1998, Mamassian & Goutcher
2001). A convincing explanation for this leftward bias remains to be advanced.
Apparent motion in depth of an object is strongly influenced by the movement
of its cast shadow (Kersten et al. 1997, Mamassian et al. 1998). This result can be
interpreted in terms of a stationary light source prior—the visual system is more
likely to interpret change in the image as being due to a movement of the object
or a change in viewpoint, rather than a movement of the light source.
Do we need more complex lighting models? The answer is surely yes, especially
in the context of recent results on the perception of specular surfaces (Fleming
et al. 2003) and color given indirect lighting (Bloj et al. 1999), both discussed in
Discounting and Task Dependence (below). Dror et al. (2001) have shown that
spatial maps of natural illumination (Debevec 1998) show statistical regularities
similar to those found in natural images (cf. Simoncelli et al. 2001).
P
IMAGE REGULARITIES p(I ) = S p(I |S) p(S) Image regularities arise from the sim-
ilarity between natural scenes. They cover geometric properties, such as the statis-
tics of edges, and photometric properties, such as the distribution of contrast as a
function of spatial frequency in the image.
Geometric regularities Geisler et al. (2001) used spatial filtering to extract local
edge fragments from natural images. They measured statistics on the distance
between the element centers, the orientation difference between the elements, and
the direction of the second element relative to the orientation of first (reference)
element. A model derived from the measured statistics and from a simple rule
that integrated local contours together could quantitatively predict human contour
detection performance. More detailed rules to perceptually organize a chain of dot
elements into a subjective curve with a corner or not, or to be split into one versus
two groups, have also been given a Bayesian interpretation (Feldman 2001).
Elder & Goldberg (2002) measured statistics of contours from images hand
segmented into sets of local tangents. These statistics were used to put probability
distributions on three Gestalt principles of perceptual organization: proximity, con-
tinuation, and luminance similarity. The authors found that these three grouping
cues were independent and that the proximity cue was by far the most powerful.
Moreover, the contour likelihood distribution (the probability of a gap length be-
tween two tangents of a contour) follows a power law with an exponent very close
to that determined psychophysically on dot lattice experiments.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
The work of Geisler et al. (2001) deals with distributions on image features,
namely edge pairs. To devise a distribution p(I ) from which one can draw true
contour samples, one needs to also take into account the consistency condition that
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
edge pairs have to lie in an image (see Figure 5A for an airplane analogy). Zhu
(1999) describes a procedure that learns a distribution on the image itself, and thus
samples drawn from it produce contours (see Figure 5E; Elder & Goldberg 2002).
IMAGE LIKELIHOOD p(I |S) The likelihood characterizes how image regularities
result from projection and rendering of objects as a function of view and lighting,
and it is related to what is sometimes called the forward optics or computer graphics
problem.
WHERE DO THE PRIORS COME FROM? Without direct input, how does image-
independent knowledge of the world get built into the visual system? One pat
answer is that the priors are in the genes. Observations of stereotyped periods in
the development of human depth perception do in fact suggest a genetic compo-
nent (Yonas 2003). In another domain, the strikingly rapid development of object
concepts in children is still a major mystery that suggests predispositions to cer-
tain kinds of grouping rules. Adults, too, are quick to accurately generalize from
a relatively small set of positive examples (in many domains, including objects)
to a whole category. Tenenbaum (2000; Tenenbaum & Xu 2000, Tenenbaum &
Griffiths 2001) provides a Bayesian synthesis of two theories of generalization
(similarity-like and rule-like) and provides a computational framework that helps
to explain rapid category generalization.
The accurate segmentation of objects such as the kayaks in Figure 1 likely
requires high-level prior knowledge regarding the nature of the forms of possi-
ble object classes. Certain kinds of priors, such as learning the shapes of specific
objects, may develop through what Brady & Kersten (2003) have called oppor-
tunistic learning and bootstrapped learning. In opportunistic learning, the visual
system seizes the opportunity to learn object structure during those relatively rare
occasions when an object is seen under conditions of low ambiguity, such as when
motion breaks camouflage of an object in plain view. Bootstrapped learning op-
erates under intermediate or high conditions of ambiguity (e.g., in which none of
the training images provide a clear view of an object’s boundaries). Then later, the
visual system can apply the prior knowledge gained to high (objective) ambigu-
ity situations more typical of everyday vision. The mechanisms of bootstrapped
learning are not well understood, although there has been some computer vision
work (Weber et al. 2000) (see Theoretical and Computational Advances, below).
General purpose cortical learning mechanisms have been proposed (Dayan
et al. 1995, Hinton & Ghahramani 1997); however, it is not clear whether these are
workable with complex natural image input. We discuss computer vision methods
for learning priors in Theoretical and Computational Advances (below).
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
Figure 6 Traditional visual noise versus illumination variation. Example of face
recognition given confounding variables. The first row and second rows of the top and
bottom panels show images of faces s1 and s2 , respectively. (A) Face recognition is con-
founded by additive Gaussian contrast noise. Bayesian ideal discriminators for this task
are well understood. However, the Gaussian assumption leads to a least squares metric
for measuring the similarities between faces. But the similarity between two images
of the same face under different lighting can be bigger than the least squares distance
between two faces. (B) Face recognition is confounded by illumination variation. This
type of uncertainty is more typical of the type of variation encountered during natural
visual tasks. The human visual system seems competent at discounting illumination,
but Bayesian theories for general illumination variation are more difficult to formulate
(cf. Yuille et al. 2001). The columns show different illumination conditions of the two
faces. Light direction gradually varies from right (left-most column) to left (right-most
column). In this example, illumination changes are relatively large compared to the
illumination-invariant features corresponding to facial identity. The illumination di-
rection changes are ordered for clarity. In actual samples, the illumination direction
may not be predictable.
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
17 Dec 2003
9:0
AR
AR207-PS55-10.tex
AR207-PS55-10.sgm
LaTeX2e(2002/01/18)
287
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
This is equivalent to spreading out the loss function completely in one of the
directions (e.g., extending the utility function vertically in Figure 2E). As noted
above, the choice of which variables to discount will depend on the task. Viewpoint
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
Viewpoint variation When interpreting 2-D projections of 3-D shapes, the hu-
man visual system favors interpretations that assume that the object is being
viewed from a general (or generic), rather than accidental, viewpoint (Nakayama &
Shimojo 1992, Albert 2000). Freeman (1994) showed that a Bayesian observer that
integrates out viewpoint can account for the generic view assumption.
How does human vision recognize 3-D objects as the same despite changes in
viewpoint? Shape-based object recognition models range between two extremes—
those that predict a strong dependence of recognition performance on viewpoint
(e.g., as a function of familiarity with particular views) and those that, as a result
of using view-invariant features together with structural descriptions of objects,
do not (Poggio & Edelman 1990, Ullman & Basri 1991, Tarr & Bülthoff 1995,
Ullman 1996, Biederman 2000, Riesenhuber & Poggio 2002). By comparing hu-
man performance to several types of ideal observers that integrate out viewpoint
variations, Liu et al. (1995) showed that models that allow rigid rotations in the
image plane of independent 2-D templates could not account for human perfor-
mance in discriminating novel object views. More recent work by Liu & Kersten
(1998) showed that the performance of human observers relative to Bayesian affine
4
Bayesian decision theory (see Theoretical and Computational Advances, below) provides a
precise language to model the costs of errors determined by the choice of visual task (Yuille
& Bülthoff 1996). The cost or risk R(α; I ) of guessing α when the image measurement is
I is defined as the expected loss (or negative utility):
X
R(α; I ) = L(α, S) p(S|I ),
S
with respect to the posterior probability, p(S|I ). The best interpretation of the image can
then be made by finding the α that minimizes the risk function. The loss function L(α, S)
specifies the cost of guessing α when the scene variable is S. One possible loss function is
−δ(α − S). In this case, the risk becomes R(α; I ) = − p(α|I ), and then the best strategy is
to pick the most likely interpretation. This is MAP estimation and is the optimal strategy for
the task requiring an observer to maximize the proportion of correct decisions. A second
kind of loss function assumes that costs are constant over all guesses of a variable. This is
equivalent to integrating out, summing over, or marginalization of the posterior with respect
to that variable.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
models (which allow stretching, translations, rotations, and shears in the image)
is better for the novel views than for the template views, suggesting that humans
have a better means to generalize to novel views than could be accomplished with
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
affine warps. Other related work describes the role of structural knowledge (Liu
et al. 1999), the importance of view-frequency (Troje & Kersten 1999), and shape
constancy under perspective (Pizlo 1994).
Bayesian task dependence may account for studies where different operational-
izations for measuring perceived depth lead to inconsistent, though related, esti-
mates. For instance, the information provided by a single image of an object is
insufficient for an observer to infer a unique shape (see Figure 1B). Not surprisingly,
these ambiguities will lead different observers to report different depth percepts
for the same picture, and the same observer to report different percepts when using
different depth probes. Koenderink et al. (2001) show that most of this variability
could be accounted for by affine transformations of the perceived depth, in par-
ticular, scalings and shears. These affine transformations correspond to looking at
the picture from a different viewpoint. The authors call the perceived depth of a
surface, once the viewpoint has been discounted, pictorial relief.
be integrated with the other measurement (Landy et al. 1995, Bülthoff & Mallot
1988). Ruling out outliers is possible within a single modality (e.g., when inte-
grating disparity and texture gradients in vision) but may not be possible between
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
modalities (e.g., between vision and touch) because in the latter case, single cue
information appears to be preserved (Hillis et al. 2002). The visual system is of-
ten more sophisticated and combines image measurements weighted according to
their reliability (Jacobs 2002, Weiss et al. 2002), which we discuss next.
Figure 4C illustrates the influence graph for cue integration with an example
of illumination direction estimation. From a Bayes net perspective, the two cues
are conditionally independent given the shared explanation. An important case is
when the nodes represent Gaussian variables that are independent when condi-
tioned on the common cause and we have estimates for each cue alone (i.e., Ŝ i
is the best estimate of Si from p(S|Ii )). Then optimal integration (i.e., the most
probable value) of the two estimates takes into account the uncertainty owing to
measurement noise (the variance) and is given by the weighted average,
r1 r2
Ŝ = Ŝ 1 + Ŝ 2 ,
r1 + r2 r1 + r2
where ri , the reliability, is the reciprocal of the variance. This model has been used
to study whether the human visual system combines cues optimally (cf. Jacobs
2002 for a review and a discussion of integration in the context of Kalman filtering,
which is a special case of Bayesian estimation).
For instance, visual and haptic information about object size are combined and
weighted according to the reliability of the source (Ernst & Banks 2002, Gepshtein
& Banks 2003). Object size can be perceived both visually and by touch. When
the information from vision and touch disagree, vision usually dominates. The
authors showed that when one takes into account the reliability of the sensory
measurements, information from vision and touch are integrated optimally. Visual
dominance occurs when the reliability of the visual estimation is greater than that
of the haptic one.
Integration is also important for grouping local image elements likely to be-
long to the same surface. The human visual system combines spatial frequency
and orientation information optimally when detecting the boundary between two
regions (Landy & Kojima 2001). Human observers also behave like an optimal
observer when integrating information from skew-symmetry and disparity in per-
ceiving the orientation of a planar object (Saunders & Knill 2001). The projection
of a symmetric flat object has a distorted or skewed symmetry that provides partial
information about the object’s orientation in depth. Saunders & Knill show that hu-
man observers integrate symmetry information with stereo information weighted
according to the reliability of the source.
Prior probabilities can also combine like weighted cues. We discussed above that
human observers interpret the shape of an object assuming that both the viewpoint
and the light source are above the object (Mamassian et al. 1998, Mamassian
& Goutcher 2001). Mamassian & Landy (2001) manipulated the reliability of
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
each of the two constraints by changing the contrast of different parts of the
stimuli. For instance, increasing the shading contrast increased the reliability of
the light-source prior and biased the observers’ percept toward the shape most
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
consistent with the light-source prior alone. Their interpretation of the results was
that observers modulated the strength of their priors based on the stimulus contrast.
As a consequence, prior constraints behaved just like depth cue integration: Cues
with more reliable information have higher weight attributed to their corresponding
prior constraint.
Not all kinds of cue integrating are consistent with the simple graph of Figure 4C.
Yuille & Bülthoff (1996) have argued that strong coupling of visual cues (Clark &
Yuille 1990) is required to model a range of visual phenomena (Bülthoff & Mallot
1988). The graph for explaining away (Figure 4D) provides one useful simple
extension to the graphs discussed so far, and we discuss this next.
Material In the Land & McCann (1971) version of the classic Craik-O’Brien
Cornsweet illusion, two abutting regions that have the same gradual change of
5
Some related perceptual phenomena Rock described as “perceptual interactions” (Rock
1983).
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
luminance appear to have different reflectances. Knill & Kersten (1991) found
that the illusion is weakened when a curved occluding contour (auxiliary evidence)
bounding the regions above and below suggests that the variation of luminance
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
is due to a change in surface orientation (Figure 4D). The lightness gradients are
explained away by the gradual surface curvature. Buckley et al. (1994) extended
this result when binocular disparities were used to suggest a 3-D surface. These
results indicate that some scene attributes (in this case surface curvature) influence
the inference of the major scene attributes (material reflectance) that are set by
the task.
Another study asked whether human vision could discount the effects of the
color of indirect lighting (Bloj et al. 1999). Imagine a concave folded card con-
sisting of a red half facing a white half. With white direct illumination, pinkish
light radiates from the white card because of the indirect additional illumination
from the red card. Does vision use shape knowledge to discount the red illuminant
in order to perceive the true material color, which is white? A change in retinal
disparities (an auxiliary measurement) can cause the concave card to appear as
convex, without any change in the chromatic content of the stimulus. When the
card appears convex, the white card appears more pinkish, as if perception has lost
its original explanation for the pinkish tinge in the image and now attributes it to
reddish pigment rather than reddish light.
Geometry and shape Explaining away occurs in discounting the effects of oc-
clusion, and when simple high-level object descriptions override more complex
interpretations of line arrangements or moving dots (Lorenceau & Shiffrar 1992,
McDermott et al. 2001, Murray et al. 2002). We describe this in more detail in
Neural Implications (below).
Explaining away is closely related to previous work on competitive models,
where two alternative models compete to explain the same data. It has been argued
that this accounts for a range of visual phenomena (Yuille & Bülthoff 1996),
including the estimation of material properties (Blake & Bülthoff 1990). This
approach has been used successfully in computer vision systems by Tu & Zhu
(2002), including recent work (Tu et al. 2003) in which a whole class of generative
models, including faces, text, generic shading, and texture models compete and
cooperate to explain the entire image (Theoretical and Computational Advances,
below). In particular, the generic shading models help detect faces by explaining
away shadows and glasses.
Failures to explain away Visual perception can also unexpectedly fail to explain
away. In one simple demonstration, an ambiguous Mach folded card can be inter-
preted as a concave horizontal edge or a convex vertical edge. A shadow cast over
the edge by an object (e.g., pencil) placed in front provides enough information to
disambiguate the percept, and yet humans fail to use this information (Mamassian
et al. 1998). There has yet to be a good explanation for this failure.
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
plifications of both stimuli and tasks. This simplification, however, must be exten-
sible to the visual input experienced during natural perceptual functioning. In the
previous sections, several psychophysical studies used models of natural image
statistics, as well as models of prior object structure, such as shape. Future advances
in understanding perception will increasingly depend on the efficient characteri-
zation (and simulation) of realistic images to identify informative image statistics,
models of scene properties, and a theoretical understanding of inference for natural
perceptual functions. In this section, we discuss relating Bayesian decision the-
ory to current theories of machine learning, learning the probability distributions
relevant for vision, and determining algorithms for Bayesian inference.
X
N
Remp (α) = (1/N ) L(α(Ii ), Si ). (1)
i=1
The best decision rule α ∗ (.) is selected to minimize Remp (α). For example,
the decision rule is chosen to minimize the number of misclassifications. Neural
networks and machine learning models select rules to minimize Remp (α) (Vapnik
1998, Evgeniou et al. 2000, Schölkopf & Smola 2002).
Now suppose that the samples {Si , Ii } come from a distribution p(S, I ) over
the set of problem instances. Then, if we have a sufficient number of samples6, we
can replace the empirical risk by the true risk:
6
The number of samples required is a complicated issue (Vapnik 1998, Schölkopf & Smola
2002).
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
XX
R(α) = L(α(I ), S) p(S, I ). (2)
I S
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
Minimizing R(α) leads to a decision rule that depends on the posterior distri-
bution p(S|I ) obtained by BayesP rule p(S|I P) = p(I |S) p(S)/ p(I ). To see this, we
rewrite Equation 2 as R(α) = I p(I ){ S p(S|I )L(α(I ), S)}, where we have
expressed p(S, I ) = p(S|I ) p(I ). So the best decision α(I ) for a specific image I
is given by
X
α ∗ (I ) = arg min p(S|I )L(α(I ), S), (3)
α
S
and depends on the posterior distribution p(S|I ). Hence, Bayes arises naturally
when you start from the risk function specified by Equation 2.
There are two points to be made here. First, the use of the Bayes posterior
p(S|I ) follows logically from trying to minimize the number of misclassifications
in the empirical risk (provided there are a sufficient number of samples). Second,
it is possible to have an algorithm, or a network, that computes α(.) and minimizes
the Bayes risk but that does not explicitly represent the probability distributions
p(I |S) and p(S). For example, Liu et al. (1995) compared the performance of
ideal observers for object recognition with networks using radial basis functions
(Poggio & Girosi 1990). It is possible that the radial basis networks, given sufficient
examples and having sufficient degrees of freedom, are effectively doing Bayesian
inference.
advanced models of this type can learn distributions with parameters representing
hidden states (Weber et al. 2000).
For certain problems it is also possible to learn the posterior distribution p(S|I )
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
directly, which relates to directly learning a classifier α(I ). For example, the Ad-
aBoost learning algorithm (Freund & Schapire 1999) has been applied very suc-
cessfully to build a decision rule α(I ) for classifying between faces (seen from
front-on) and nonfaces (Viola & Jones 2001). But the AdaBoost theory shows that
the algorithm can also learn the posterior distributions p(face|I ) and p(non face|I )
(Hastie et al. 2001, Tu et al. 2003). Other workers have learned posterior proba-
bilities p(edge|φ(I )) and p(not-edge|φ(I )), where φ(I ) are local image features
(Konishi et al. 2003). Similarly, Oliva and colleagues have learned a decision rule
α(I ) to determine the type of scene (urban, mountain, etc.) from feature measure-
ments (Oliva & Schyns 2000, Oliva & Torralba 2001). Fine et al. (2003) used the
statistics of the spatio-chromatic structure of natural scenes to segment natural
images into regions likely to be part of the same surface. They computed the prob-
ability of whether or not two points within an image fall on the same surface given
measurements of luminance and color differences.
Visual Inference
It is necessary to have algorithms to perform Bayesian inference after the prob-
ability distributions have been learned. The complexity of vision makes it very
unlikely that we can directly learn a classifier α(I ) to solve all visual tasks. (The
brain may be able to do this but we don’t know how to). Recently, however, there
have been some promising new algorithms for Bayesian inference. Particle fil-
ters have been shown to be very useful for tracking objects over time (Isard &
Blake 1998). Message passing algorithms, such as belief propagation, have had
some success (Freeman et al. 2000). Tu & Zhu (2002) have developed a general
purpose algorithm for Bayesian inference known as the Data Driven Markov Chain
Monte Carlo (DDMCMC). This algorithm has been very successful at segmenting
images when evaluated on datasets with specified ground truth. It works, loosely
speaking, by using low level cues to propose high-level models (scene descrip-
tions), which are validated or rejected by generative models. It therefore combines
bottom-up and top-down processing in a way suggestive of the feedforward and
feedback pathways in the brain, described in the next section. The algorithm has
been extended to combine segmentation with the detection and recognition of faces
and text (Tu et al. 2003).
NEURAL IMPLICATIONS
What are the neural implications of Bayesian models? The graphical structure
of these models often makes it straightforward to map them onto networks and
suggests neural implementations. The notion of incorporating prior probabilities
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
Ballard 1999). Thus, low activity at an early level would mean a good fit or ex-
planation of the image measurements. Experimental support for this possibility
comes from fMRI data (Murray et al. 2002, Humphrey et al. 1997). Earlier fMRI
work by a number of groups has shown that the human lateral occipital complex
(LOC) has increased activity during object perception. Murray et al. use fMRI to
show that when local visual information is perceptually organized into whole ob-
jects, activity in human primary visual cortex (V1) decreases over the same period
that activity in higher, lateral occipital areas (LOC) increases. The authors inter-
pret the activity changes in terms of high-level hypotheses that compete to explain
away the incoming sensory data.
There are two alternative theoretical possibilities for why early visual activity
is reduced. High-level areas may explain away the image and cause the early
areas to be completely suppressed—high-level areas tell lower levels to “shut up.”
Such a mechanism would be consistent with the high metabolic cost of neuronal
spikes (Lennie 2003). Alternatively, high-level areas might sharpen the responses
of the early areas by reducing activity that is inconsistent with the high level
interpretation—high-level areas tell lower levels to “stop gossiping.” The second
possibility seems more consistent with some single-unit recording experiments
(Lee et al. 2002). Lee et al. have shown that cells in V1 and V2 of macaque monkeys
respond to the apparently high-level task of detecting stimuli that pop-out owing
to shape-from-shading. These responses changed with the animal’s behavioral
adaptation to contingencies, suggesting dependence on experience and utility.
Lee & Mumford (2003) review a number of neurophysiological studies consis-
tent with a model of the interactions between cortical areas based on particle filter
methods, which are non-Gaussian extensions of Kalman filters that use Monte
Carlo methods. Their model is consistent with the “stop gossiping” idea. In other
work, Yu & Dayan (2002) raise the intriguing possibility that acetylcholine levels
may be associated with the certainty of top-down information in visual inference.
random dots is moving one way or the other) by accumulating information over
time represented by a single quantity representing the logarithm of the likelihood
ratio favoring one alternative over another.
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
CONCLUSIONS
The Bayesian perspective yields a uniform framework for studying object percep-
tion. We have reviewed work that highlights several advantages of this perspective.
First, Bayesian theories explicitly model uncertainty. This is important in account-
ing for how the visual system combines large amounts of objectively ambiguous
information to yield percepts that are rarely ambiguous. Second, in the context
of specific experiments, Bayesian theories are optimal, and thus define ideal ob-
servers. Ideal observers characterize visual information for a task and can thus be
critical for interpreting psychophysical and neural results. Third, Bayesian methods
allow the development of quantitative theories at the information processing level,
avoiding premature commitment to specific neural mechanisms. This is closely
related to the importance of extensibility in theories. Bayesian models provide
for extensions to more complicated problems involving natural images and func-
tional tasks as illustrated in recent advances in computer vision. Fourth, Bayesian
theories emphasize the role of the generative model and thus tie naturally to the
growing body of work on graphical models and Bayesian networks in other areas,
such as language, speech, concepts, and reasoning. The generative models also
suggest top-down feedback models of information processing in the cortex.
ACKNOWLEDGMENTS
Supported by NIH RO1 EY11507-001, EY02587, EY12691 and EY013875; NSF
SBR-9631682, 0240148, HFSP RG00109/1999-B; and EPSRC GR/R57157/01.
We thank Zili Liu for helpful comments.
LITERATURE CITED
Albert MK. 2000. The generic viewpoint as- from single two-dimensional images. Neural
sumption and Bayesian inference. Perception Comput. 8:1321–40
29:601–8 Barlow HB. 1962. A method of determining the
Albright TD, Stoner GR. 2002. Contextual in- overall quantum efficiency of visual discrim-
fluences on visual processing. Annu. Rev. inations. J. Physiol. 160:155–68
Neurosci. 25:339–79 Berger J. 1985. Statistical Decision Theory
Atick JJ, Griffin PA, Redlich AN. 1996. Statis- and Bayesian Analysis. New York: Springer-
tical approach to shape from shading: recon- Verlag
struction of three-dimensional face surfaces Bertamini M. 2001. The importance of being
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
convex: an advantage for convexity when tion and high dynamic range photography.
judging position. Perception 30:1295–310 Presented at SIGGRAPH
Biederman I. 2000. Recognizing depth-rotated Dror RO, Leung TK, Adelson EH, Willsky AS.
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
objects: a review of recent research and the- 2001. Statistics of real-world illumination.
ory. Spat. Vis. 13:241–53 Presented at Proc. CVPR, Hawaii
Blake A, Bülthoff HH. 1990. Does the brain Eckstein MP, Thomas JP, Palmer J, Shimozaki
know the physics of specular reflection? Na- SS. 2000. A signal detection model predicts
ture 343:165–69 the effects of set size on visual search ac-
Bloj MG, Kersten D, Hurlbert AC. 1999. Per- curacy for feature, conjunction, triple con-
ception of three-dimensional shape influ- junction, and disjunction displays. Percept.
ences colour perception through mutual il- Psychophys. 62:425–51
lumination. Nature 402:877–79 Elder JH, Goldberg RM. 2002. Ecological
Brady MJ, Kersten D. 2003. Bootstrapped statistics of gestalt laws for the percep-
learning of novel objects. J. Vis. 3:413– tual organization of contours. J. Vis. 2:324–
22 53
Brainard DH, Freeman WT. 1997. Bayesian Ernst MO, Banks MS. 2002. Humans integrate
color constancy. J. Opt. Soc. Am. A 14:1393– visual and haptic information in a statistically
411 optimal fashion. Nature 415:429–33
Buckley D, Frisby JP, Freeman J. 1994. Light- Evgeniou T, Pontil M, Poggio T. 2000. Regu-
ness perception can be affected by sur- larization networks and support vector ma-
face curvature from stereopsis. Perception chines. Adv. Comput. Math. 13:1–50
23:869–81 Feldman J. 2001. Bayesian contour integration.
Bullier J. 2001. Integrated model of visual pro- Percept. Psychophys. 63:1171–82
cessing. Brain Res. Brain Res. Rev. 36:96– Field DJ. 1987. Relations between the statistics
107 of natural images and the response properties
Bülthoff HH, Mallot HA. 1988. Integration of of cortical cells. J. Opt. Soc. Am. A 4:2379–
depth modules: stereo and shading. J. Opt. 94
Soc. Am. A 5:1749–58 Fine I, MacLeod DIA, Boynton GM. 2003. Vi-
Bülthoff HH, Yuille A. 1991. Bayesian mod- sual segmentation based on the luminance
els for seeing surfaces and depth. Comments and chromaticity statistics of natural scenes.
Theor. Biol. 2:283–314 Special issue on Bayesian and statistical
Burgi PY, Yuille AL, Grzywacz NM. 2000. approaches to vision. J. Opt. Soc. Am. A
Probabilistic motion estimation based on 20:1283–91
temporal coherence. Neural Comput. 12: Fleming RW, Dror RO, Adelson EH. 2003.
1839–67 Real-world illumination and the perception
Clark JJ, Yuille AL. 1990. Data Fusion for of surface reflectance properties. J. Vis.
Sensory Information Processing. Boston: 3:347–68
Kluwer Acad. Freeman WT. 1994. The generic viewpoint as-
Dana KJ, van Ginneken B, Nayar SK, sumption in a framework for visual percep-
Koenderink JJ. 1999. Reflectance and texture tion. Nature 368:542–45
of real world surfaces. ACM Trans. Graph. Freeman WT, Pasztor EC, Carmichael OT.
18:1–34 2000. Learning low-level vision. Int. J. Com-
Dayan P, Hinton GE, Neal RM, Zemel RS. put. Vis. 40:25–47
1995. The Helmholtz machine. Neural Com- Freund Y, Schapire R. 1999. A short introduc-
put. 7:889–904 tion to boosting. J. Jpn. Soc. Artif. Intell.
Debevec PE. 1998. Rendering synthetic objects 14:771–80
into real scenes: bridging traditional and Geisler WS, Albrecht DG. 1995. Bayesian
image-based graphics with global illumina- analysis of identification performance in
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
monkey visual cortex: nonlinear mechanisms Humphrey GK, Goodale MA, Bowen CV, Gati
and stimulus certainty. Vis. Res. 35:2723–30 JS, Vilis T, et al. 1997. Differences in per-
Geisler WS, Kersten D. 2002. Illusions, percep- ceived shape from shading correlate with ac-
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
tion and Bayes. Nat. Neurosci. 5:508–10 tivity in early visual areas. Curr. Biol. 7:144–
Geisler WS, Perry JS, Super BJ, Gallogly DP. 47
2001. Edge co-occurrence in natural images Isard M, Blake A. 1998. Condensation—
predicts contour grouping performance. Vis. conditional density propagation for visual
Res. 41:711–24 tracking. Int. J. Comput. Vis. 29:5–28
Gepshtein S, Banks MS. 2003. Viewing geom- Jacobs RA. 2002. What determines visual cue
etry determines how vision and haptics com- reliability? Trends Cogn. Sci. 6:345–50
bine in size perception. Curr. Biol. 13:483–88 Jensen HW, Marschner SR, Levoy M, Hanra-
Gold JI, Shadlen MN. 2001. Neural computa- han P. 2001. A practical model for subsur-
tions that underlie decisions about sensory face light transport. Presented at Computer
stimuli. Trends Cogn. Sci. 5:10–16 Graphics (SIGGRAPH)
Green DM, Swets JA. 1974. Signal Detection Kanizsa G, Gerbino W. 1976. Convexity and
Theory and Psychophysics. Huntington, NY: symmetry in figure-ground organisation. In
Krieger Vision and Artifact, ed. M Henle. New York:
Grenander U. 1996. Elements of Pattern The- Springer
ory. Baltimore, MD: Johns Hopkins Univ. Kersten D. 1999. High-level vision as statis-
Press tical inference. In The New Cognitive Neu-
Grill-Spector K. 2003. The neural basis of ob- rosciences, ed. MS Gazzaniga, pp. 353–63.
ject perception. Curr. Opin. Neurobiol. 13: Cambridge, MA: MIT Press. 2nd ed.
1–8 Kersten D, Mamassian P, Knill DC. 1997. Mov-
Grill-Spector K, Kourtzi Z, Kanwisher N. 2001. ing cast shadows induce apparent motion in
The lateral occipital complex and its role in depth. Perception 26:171–92
object recognition. Vis. Res. 41:1409–22 Kersten D, Schrater PW. 2002. Pattern infer-
Hastie T, Tibshirani R, Friedman J. 2001. The ence theory: a probabilistic approach to vi-
Elements of Statistical Learning. New York: sion. In Perception and the Physical World,
Springer ed. R Mausfeld, D Heyer. Chichester: Wiley
Helmholtz H. 1867. Handbuch der Physiolo- Kersten D, Yuille A. 2003. Bayesian models
gischen Optik. Leipzig: Voss. (English tranl. of object perception. Curr. Opin. Neurobiol.
1924 JPC Southall as Treatise on Physiolog- 13:1–9
ical Optics) Knill DC. 1998. Discrimination of planar sur-
Hill H, Bruce V. 1993. Independent effects of face slant from texture: human and ideal ob-
lighting, orientation, and stereopis on the servers compared. Vis. Res. 38:1683–711
hollow-face illusion. Perception 22:887–97 Knill DC, Field D, Kersten D. 1990. Human
Hillis JM, Ernst MO, Banks MS, Landy discrimination of fractal images. J. Opt. Soc.
MS. 2002. Combining sensory information: Am. A 7:1113–23
mandatory fusion within, but not between, Knill DC, Kersten D. 1991. Apparent surface
senses. Science 298:1627–30 curvature affects lightness perception. Na-
Hinton GE, Ghahramani Z. 1997. Generative ture 351:228–30
models for discovering sparse distributed Knill DC, Kersten D, Yuille A. 1996. Introduc-
representations. Philos. Trans. R. Soc. Lon- tion: a Bayesian formulation of visual per-
don Ser. B 352:(1358):1177–90 ception. See Knill & Richards 1996, pp. 1–
Howe CQ, Purves D. 2002. Range image statis- 21
tics can explain the anomalous perception of Knill DC, Richards W. 1996. Perception as
length. Proc. Natl. Acad. Sci. USA 99:13184– Bayesian Inference. Cambridge, UK: Cam-
88 bridge Univ. Press
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
Koechlin E, Anton JL, Burnod Y. 1999. Malach R. 2001. A hierarchical axis of object
Bayesian inference in populations of corti- processing stages in the human visual cortex.
cal neurons: a model of motion integration Cereb. Cortex 11:287–97
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
and segmentation in area MT. Biol. Cybern. Liu Z, Kersten D. 1998. 2D observers for human
80:25–44 3D object recognition? Vis. Res. 38:2507–19
Koenderink JJ, van Doorn AJ, Kappers AM, Liu Z, Kersten D. 2003. 3D symmetric shapes
Todd JT. 2001. Ambiguity and the ‘mental are discriminated more efficiently than asym-
eye’ in pictorial relief. Perception 30:431–48 metric ones. J. Opt. Soc. Am. A 20:1331–
Konishi SM, Yuille AL, Coughlan JM, Zhu SC. 40
2003. Statistical edge detection: learning and Liu Z, Kersten D, Knill DC. 1999. Stimulus
evaluating edge cues. Pattern Anal. Mach. information or internal representation?—A
Intell. 1:37–48 case study in human object recognition. Vis.
Lamme VA, Roelfsema PR. 2000. The dis- Res. 39:603–12
tinct modes of vision offered by feedforward Liu Z, Knill DC, Kersten D. 1995. Object clas-
and recurrent processing. Trends Neurosci. sification for human and ideal observers. Vis.
23:571–79 Res. 35:549–68
Land EH, McCann JJ. 1971. Lightness and the Lorenceau J, Shiffrar M. 1992. The influence
retinex theory. J. Opt. Soc. Am. 61:1–11 of terminators on motion integration across
Landy MS, Kojima H. 2001. Ideal cue combi- space. Vis. Res. 32:263–73
nation for localizing texture-defined edges. MacKay DM. 1956. The epistemological prob-
J. Opt. Soc. Am. A Opt. Image Sci. Vis. lem for automata. In Automata Studies,
18:2307–20 ed. CE Shannon, J McCarthy, pp. 235–50.
Landy MS, Maloney LT, Johnston EB, Young Princeton: Princeton Univ. Press
M. 1995. Measurement and modeling of Maloney LT. 2002. Statistical decision theory
depth cue combination: in defense of weak and biological vision. In Perception and the
fusion. Vis. Res. 35:389–412 Physical World, ed. D Heyer, R Mausfeld,
Langer MS, Bülthoff HH. 2001. A prior pp. 145–89. Chichester, UK: Wiley
for global convexity in local shape-from- Mamassian P, Goutcher R. 2001. Prior knowl-
shading. Perception 30:403–10 edge on the illumination position. Cognition
Lee TS, Mumford D. 2003. Hierarchical 81:B1–9
Bayesian inference in the visual cortex. J. Mamassian P, Knill DC, Kersten D. 1998. The
Opt. Soc. Am. A 20:1434–48 perception of cast shadows. Trends Cogn.
Lee TS, Yang CF, Romero RD, Mumford D. Sci. 2:288–95
2002. Neural activity in early visual cortex Mamassian P, Landy MS. 1998. Observer bi-
reflects behavioral experience and higher- ases in the 3D interpretation of line drawings.
order perceptual saliency. Nat. Neurosci. Vis. Res. 38:2817–32
5:589–97 Mamassian P, Landy MS. 2001. Interaction of
Legge GE, Hooven TA, Klitz TS, Mansfield JS, visual prior constraints. Vis. Res. 41:2653–
Tjan BS. 2002. Mr. Chips 2002: new insights 68
from an ideal-observer model of reading. Vis. Mamassian P, Landy MS, Maloney LT. 2002.
Res. 42:2219–34 Bayesian modelling of visual perception. See
Lennie P. 2003. The cost of cortical computa- Rao et al. 2002, pp. 13–36
tion. Curr. Biol. 13:493–97 Marschner SR, Westin SH, Lafortune EPF, Tor-
Leopold DA, O’Toole AJ, Vetter T, Blanz V. rance KE. 2000. Image-based measurement
2001. Prototype-referenced shape encoding of the bidirectional reflectance distribution
revealed by high-level aftereffects. Nat. Neu- function. Appl. Opt. 39:2592–600
rosci. 4:89–94 McDermott J, Weiss Y, Adelson EH. 2001.
Lerner Y, Hendler T, Ben-Bashat D, Harel M, Beyond junctions: nonlocal form constraints
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
chitecture of the neocortex. II. The role of learns to recognize three-dimensional ob-
cortico-cortical loops. Biol. Cybern. 66:241– jects. Nature 343:263–66
51 Poggio T, Girosi F. 1990. Regularization algo-
Murray SO, Kersten D, Olshausen BA, Schrater rithms for learning that are equivalent to mul-
P, Woods DL. 2002. Shape perception re- tilayer networks. Science 247:978–82
duces activity in human primary visual cor- Portilla J, Simoncelli EP. 2000. A paramet-
tex. Proc. Natl. Acad. Sci. USA 99:15164– ric texture model based on joint statistics of
69 complex wavelet coefficients. Int. J. Comput.
Nakayama K, Shimojo S. 1992. Experienc- Vis. 40:9–71
ing and perceiving visual surfaces. Science Pouget A, Dayan P, Zemel R. 2000. Informa-
257:1357–63 tion processing with population codes. Nat.
Oliva A, Schyns PG. 2000. Diagnostic colors Rev. Neurosci. 1:125–32
mediate scene recognition. Cogn. Psychol. Ramachandran VS. 1985. The neurobiology of
41:176–210 perception. Perception 14:97–103
Oliva A, Torralba A. 2001. Modeling the shape Rao RP, Ballard DH. 1999. Predictive coding
of the scene: a holistic representation of the in the visual cortex: a functional interpre-
spatial envelope. Int. J. Comput. Vis. 42:145– tation of some extra-classical receptive-field
75 effects. Nat. Neurosci. 2:79–87
Olshausen BA, Field DJ. 2000. Vision and the Rao RPN, Olshausen BA, Lewicki MS, eds.
coding of natural images. Am. Sci. 88:238– 2002. Probabilistic Models of the Brain: Per-
45 ception and Neural Function. Cambridge,
Oram MW, Foldiak P, Perrett DI, Sengpiel MA: MIT Press
F. 1998. The ‘ideal Homunculus’: decoding Read JCA. 2002. A Bayesian model of stere-
neural population signals. Trends Neurosci. opsis depth and motion direction discrimina-
21:259–65 tion. Biol. Cybern. 86:117–36
Parish DH, Sperling G. 1991. Object spatial fre- Riesenhuber M, Poggio T. 2002. Neural mecha-
quencies, retinal spatial frequencies, noise, nisms of object recognition. Curr. Opin. Neu-
and the efficiency of letter discrimination. robiol. 12:162–68
Vis. Res. 31:1399–415 Rock I. 1983. The Logic of Perception. Cam-
Parraga CA, Troscianko T, Tolhurst DJ. 2000. bridge, MA: MIT Press
The human visual system is optimised for Sanger TD. 1996. Probability density estima-
processing the spatial information in natural tion for the interpretation of neural pop-
visual images. Curr. Biol. 10:35–38 ulation codes. J. Neurophysiol. 76:2790–
Pearl J. 1988. Probabilistic Reasoning in Intel- 93
ligent Systems: Networks of Plausible Infer- Saunders JA, Knill DC. 2001. Perception of 3D
ence. San Mateo, CA: Morgan Kaufmann surface orientation from skew symmetry. Vis.
Pelli DG, Farell B, Moore DC. 2003. The re- Res. 41:3163–83
markable inefficiency of word recognition. Schölkopf B, Smola AJ. 2002. Learning with
Nature 243:752–56 Kernels: Support Vector Machines, Regu-
Pizlo Z. 1994. A theory of shape constancy larization, Optimization, and Beyond. Cam-
based on perspective invariants. Vis. Res. bridge, MA: MIT Press
34:1637–58 Schrater PR, Kersten D. 2000. How optimal
Pizlo Z. 2001. Perception viewed as an inverse depth cue integration depends on the task.
problem. Vis. Res. 41:3145–61 Int. J. Comput. Vis. 40:73–91
Platt ML, Glimcher PW. 1999. Neural corre- Schrater PR, Kersten D. 2002. Vision,
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
psychophysics, and Bayes. See Rao et al. Ullman S. 1996. High-Level Vision: Object
2002b, pp. 39–64 Recognition and Visual Cognition. Cam-
Schrater PR, Knill DC, Simoncelli EP. 2000. bridge, MA: MIT Press
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
Mechanisms of visual motion detection. Nat. Ullman S, Basri R. 1991. Recognition by linear
Neurosci. 1:64–68 combination of models. IEEE Trans. Pattern
Simoncelli EP, Olshausen BA. 2001. Natural Anal. Mach. Intell. 13:992–1006
image statistics and neural representation. VanRullen R, Thorpe SJ. 2001. Is it a bird? Is
Annu. Rev. Neurosci. 24:1193–216 it a plane? Ultra-rapid visual categorisation
Sinha P, Adelson E. 1993. Recovering reflec- of natural and artifactual objects. Perception
tance and illumination in a world of painted 30:655–68
polyhedra. Presented at Proc. Int. Conf. Vapnik VN. 1998. Statistical Learning Theory.
Comput. Vis., 4th, Berlin New York: Wiley
Sun J, Perona P. 1998. Where is the sun? Nat. Vetter T, Troje NF. 1997. Separation of texture
Neurosci. 1:183–84 and shape in images of faces for image coding
Tarr MJ, Bülthoff HH. 1995. Is human ob- and synthesis. J. Opt. Soc. Am. A 14:2152–
ject recognition better described by geon 61
structural descriptions or by multiple views? Viola P, Jones MJ. 2001. Robust real-time ob-
Comment on Biederman and Gerhardstein ject detection. Proc. IEEE Workshop Stat.
(1993). J. Exp. Psychol. Hum. Percept. Per- Comput. Theor. Vis., Vancouver, Can.
form. 21:1494–505 Weber M, Welling M, Perona P. 2000. Unsuper-
Tenenbaum JB. 2000. Bayesian modeling of hu- vised Learning of Models for Recognition.
man concept learning. Advances in Neural Presented at Proc. Eur. Conf. Comp. Vis., 6th,
Information Processing Systems, ed. Solla S, Dublin, Ireland
Leen T, Muller KR, 12:59–65. Cambridge, Weiss Y, Simoncelli EP, Adelson EH. 2002.
MA: MIT Press Motion illusions as optimal percepts. Nat.
Tenenbaum JB, Griffiths TL. 2001. Generaliza- Neurosci. 5:598–604
tion, similarity, and Bayesian inference. Be- Yonas A, ed. 2003. Development of space per-
hav. Brain Sci. 24:629–40; discussion 652– ception. In Encyclopedia of Cognitive Sci-
791 ence, ed. R Anand, pp. 96–100. London, UK:
Tenenbaum JB, Xu F. 2000. Word learning Macmillan
as Bayesian inference. Proc. Ann. Conf. Yu AJ, Dayan P. 2002. Acetylcholine in cortical
Cogn. Sci. Soc., 22nd, eds. Gleitman LR, inference. Neural Netw. 15:719–30
Joshi AK. Mahwah, NJ: Lawerence Erlbaum Yuille AL, Bülthoff HH. 1996. Bayesian deci-
Assoc. sion theory and psychophysics. See Knill &
Tjan BS, Braje WL, Legge GE, Kersten D. Richards 1996, pp. 123–61
1995. Human efficiency for recognizing 3- Yuille AL, Coughlan JM, Konishi S. 2001. The
D objects in luminance noise. Vis. Res. KGBR viewpoint-lighting ambiguity and its
35:3053–69 resolution by generic constraints. Presented
Troje NF, Kersten D. 1999. Viewpoint depen- at Proc. Int. Conf. Comput. Vis., Vancouver,
dent recognition of familiar faces. Perception Canada
28:483–87 Yuille A, Coughlan JM, Konishi S. 2003. The
Tu Z, Chen A, Yuille AL, Zhu SC. 2003. KGBR viewpoint-lighting ambiguity. J. Opt.
Image parsing. Proc. Int. Conf. Comput. Vis., Soc. Am. A Opt. Image Sci. Vis. 20:24–
Cannes, France 31
Tu Z, Zhu S-C. 2002. Image segmentation Yuille A, Grzywacz N. 1988. A computational
by data-driven Markov chain Monte Carlo. theory for the perception of coherent visual
IEEE Trans. Pattern Anal. Mach. Intell. motion. Nature 333:71–74
24:657–73 Zhu SC. 1999. Embedding gestalt laws in
17 Dec 2003 9:0 AR AR207-PS55-10.tex AR207-PS55-10.sgm LaTeX2e(2002/01/18) P1: GCE
Markov random fields. IEEE Trans. Pattern max entropy principle and its applications to
Anal. Mach. Intell. 21:1170–87 texture modeling. Neural Comput. 9:1627–
Zhu SC, Mumford D. 1997. Prior learning and 60
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
Gibbs reaction-diffusion. IEEE Trans. PAMI Zipser K, Lamme VA, Schiller PH. 1996. Con-
19:1236–50 textual modulation in primary visual cortex.
Zhu SC, Wu Y, Mumford D. 1997. Mini- J. Neurosci. 16:7376–89
Kersten.qxd 12/17/2003 9:17 AM Page 1
Figure 1 Visual complexity and ambiguity. (A) The same object, a kayak, can produce
different images. The plots below the images show surface plots of the intensity as a func-
tion of position, illustrating the complex variations typical of natural image data that result
from a change in view and environment. (B) Different shapes (three-quarter views of two
different facial surfaces in the top panel) can produce the same image (frontal view of the
two faces in the bottom panel) with an appropriate change of illumination direction.
(C) The same material can produce different images. A shiny silver pot reflects complete-
ly different patterns depending on its illumination environment. (D) Different materials can
produce the same images. The image of a silver pot could be the result of paint (right-most
panel). The silver pot renderings were produced by Bruce Hartung using illumination maps
made available at: http://www.debevec.org/Probes/ (Debevec 1998).
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
Kersten.qxd
C-2
12/17/2003
KERSTEN
■
MAMASSIAN
■
Page 2
YUILLE
Kersten.qxd 12/17/2003 9:17 AM Page 3
Figure 4 Four simple categories of influence graphs. (A) Basic Bayes. The four curved
line segments are consistent with a spherical and saddle-shaped surface patch (and an infi-
nite family of other 3-D interpretations). Human observers prefer the convex spherical
Downloaded from www.annualreviews.org. University of Hong Kong (ar-184631) IP: 147.8.31.43 On: Wed, 12 Feb 2025 09:57:08
interpretation (Mamassian & Landy 1998). (B) Discounting. A given object or object class
can give rise to an infinite variety of images because of variations introduced by con-
founding variables, such as pose, viewpoint, illumination, and background clutter. Robust
object recognition requires some degree of object invariance or constancy requiring vision
to discount confounding variables, such as pose, illustrated here. There is a one-to-one rela-
tionship between graph structure and factorizing the joint probability. If Ii and Sj indicate
the ith image and jth object variables, respectively, then p(. . . , Sj. . . , Ii ,. . . ) is the joint prob-
ability. For this graph, p(S1, S2, I) = p(I|S1, S2)p(S1)p(S2). (C) Cue integration. A single
cause in a scene can give rise to more than one effect in the image. Illumination position
affects both the shading on the ball and the relationship between the ball and shadow posi-
tions in the image. Both kinds of image measurement can, in principle, be combined to
yield a more reliable estimate of illumination direction than either alone, and it is an empir-
ical question to find out if human vision combines such cues, and if so, how optimally. For
this graph, p(S1, I1, I2) = p(I1|S1)p(I2|S1)p(S1). (D) Explaining away. An image measure-
ment (ambiguous horizontal shading gradients) can be caused by a change in reflectance
or a change in 3-D shape. A change in the probability of one of them being the true cause
of the shading (from an auxiliary contour measurement) can change the probability of the
putative cause (from different to same apparent reflectance). For this graph, p(S1, S2, I1, I2)
= p(I2|S2)p(I1|S1, S2)p(S1)p(S2).