Adaptive Thinking - Rationality - Center For Adaptive Behavior
Adaptive Thinking - Rationality - Center For Adaptive Behavior
Gerd Gigerenzer
OXFORD
UNIVERSITY PRESS
OXFORD
UNIVERSITY PRESS
1 3 5 7 9 8 6 4 2
Printed in the United States of America
on acid-free paper
For Raine and Thalia
This page intentionally left blank
PREFACE
Some years ago, I had lunch with a motley group of colleagues at Stanford,
mostly psychologists and economists, who were interested in decision making
in an uncertain world. We chewed our way through our sandwiches and
through the latest embellishments of the prisoner's dilemma, trading stories of
this or that paradox or stubborn irrationality. Finally, one economist concluded
the discussion with the following dictum: "Look," he said with conviction,
"either reasoning is rational or it's psychological."
This supposed opposition between the rational and the psychological has
haunted me ever since. For the economists and psychologists seated at the
picnic table with me that afternoon, it meant a division of labor. The heavenly
laws of logic and probability rule the realm of sound reasoning; psychology is
assumed to be irrelevant. Only if mistakes are made are psychologists called
in to explain how wrong-wired human minds deviate from these laws. Cher-
nobyl, U.S. foreign policy, and human disasters of many kinds have been as-
sociated with failures in logical thinking. Adopting this opposition, many text-
books present first the laws of logic and probability as the standard by which
to measure human thinking, then data about how people actually think. The
discrepancy between the two makes people appear to be irrational.
Adaptive Thinking offers a different story. I view the mind in relation to its
environment rather than in opposition to the laws of logic or probability. In a
complex and uncertain world, psychology is indispensable for sound reason-
ing; it is rationality's fuel rather than its brake. This book is about rethinking
rationality as adaptive thinking: to understand how minds cope with specific
environments, ecological and social. The chapters in this book elaborate the
idea that human thinking—from scientific creativity to simply understanding
what a positive HIV test means—"happens" partly outside of the mind. For
instance, new laboratory instruments can inspire scientists to create new meta-
phors and theories, and new ways of representing uncertainties can either
cloud or facilitate physicians' understanding of risks. In this sense, insight can
come from outside the mind.
viii PREFACE
The chapters provide both research programs and case studies. For instance,
the program of ecological rationality studies the mind in relation to its envi-
ronment, past and present. Bounded rationality stresses that sound reasoning
can be achieved by simple heuristics that do not follow the prescriptions of
logic and probability. Social rationality is a form of ecological rationality in
which the environment consists of conspecifics and that highlights the impor-
tance of domain-specific behavior and cognition in social environments.
Adaptive Thinking is a collection of what I consider the most important of
my papers on rationality, reasoning and rituals in the 1990s. I have rewritten,
updated, and shortened them to bring out the coherent story they tell as a
whole. The papers were originally addressed to different scientific communi-
ties. This book affords readers the opportunity, for the first time, to see how
the various theoretical endeavors and practical applications fit together.
Berlin G. G.
July 1999
ACKNOWLEDGMENTS
People
Institutions
The research reported in this book was supported by fellowships and grants
from the Center for Advanced Study in the Behavioral Sciences, Stanford; the
Center for Interdisciplinary Research, Bielefeld, Germany; Deutsche For-
schungsgemeinschaft, Germany; Fonds zur Forderung der Wissenschaften,
Austria; the National Science Foundation; the Spencer Foundation; and the
University of Chicago School Mathematics Project Fund for Research in Math-
ematics Education. I am particularly grateful to the Max Planck Society, which
has provided outstanding research support since 1995.
Publishers
II Ecological Rationality 57
4. Ecological intelligence 59
5. AIDS counseling for low-risk clients 77
6. How to improve Bayesian reasoning without instruction 92
References 297
Name Index 329
Subject Index 337
This page intentionally left blank
I
WHERE DO NEW IDEAS
COME FROM?
I wrote "From Tools to Theories" in one of the cabinlike offices at the Cen-
ter for Advanced Study in Palo Alto in 1990. That was in the good old days
when the offices had no telephones, e-mail, or other communication facilita-
tors to interrupt one's thoughts. In the meantime, the Center, like you and I,
has surrendered to technology. Chapter 1 is about the impact of new tech-
nologies on creative thinking—an impact of a productive rather than a dis-
ruptive kind. New tools can suggest new scientific ideas and metaphors
about nature, society, and the mind. When this happens, we can trace dis-
coveries back to the changing technological environment in which they
evolved rather than attributing them to some mystical process inside a scien-
tist's head. In this sense, new j'nsights can come from outside the mind.
Two influential tools fueled the cognitive revolution: new statistical tech-
niques and the computer. Both started as tools for data processing and ended
up as theories of mind. The power of tools to inspire new theories derives
from changes both in the technological environment (new tools) and in the
social environment in which a scientist works (the community of tool users).
The social environment is influential in several ways. First, it affects the
pragmatic use of a tool (of which there are many), which then leaves its
mark on the new theories of mind. Second, entrenchment of the tool in the
research community is an important precondition for its final acceptance as
a model of mind. Finally, new social organizations can inspire the creation
of tools in the first place, as evidenced by the invention of the machine com-
puter. Babbage's computer was modeled after a new social organization of
work, namely, the division of labor in large-scale manufacturing. The social
origin of the computer illustrates how a metaphor can cut both ways: First
computers were modeled after minds, and later minds were modeled after
computers.
Computers and statistics have both been used to fulfill the timeless long-
ing to replace judgment by the application of content-blind, mechanical
rules. Such mechanization has become an ideal in many professions, includ-
2 WHERE DO NEW IDEAS COME FROM?
3
4 WHERE DO NEW IDEAS COME FROM?
tween the two. Furthermore, I attempt to demonstrate that this discovery heu-
ristic may be of interest not only for an a posteriori understanding of theory
development but also for understanding limitations of present-day theories and
research programs and for the further development of alternatives and new
possibilities. The discovery heuristic that I call the tools-to-theories heuristic
(see Gigerenzer & Murray, 1987) postulates a close connection between the
light and the dark parts of Leibniz's ocean: Scientists' tools for justification
provide the metaphors and concepts for their theories.
The power of tools to shape, or even to become, theoretical concepts is an
issue largely ignored in both the history and philosophy of science. Inductivist
accounts of discovery, from Bacon to Reichenbach and the Vienna Circle, focus
on the role of data but do not consider how the data are generated or processed.
Nor do the numerous anecdotes about discoveries—Newton watching an apple
fall in his mother's orchard while pondering the mystery of gravitation; Galton
taking shelter from a rainstorm during a country outing when discovering cor-
relation and regression toward mediocrity; and the stories about Fechner, Ke-
kule, Poincare, and others that link discovery to the three B's: beds, bicycles,
and bathrooms. What unites these anecdotes is the focus on the vivid but
prosaic circumstances; they report the setting in which a discovery occurs,
rather than analyzing the process of discovery.
The question Is there a logic of discovery? and Popper's (1935/1959) con-
jecture that there is none have misled many into assuming that the issue is
whether there exists a logic of discovery or only idiosyncratic personal and
accidental reasons that explain the "flash of insight" of a particular scientist
(Nickles, 1980). I do not think that formal logic and individual personality are
the only alternatives, nor do I believe that either of these is a central issue for
understanding discovery.
The process of discovery can be shown, according to my argument, to pos-
sess more structure than thunderbolt guesses but less definite structure than a
monolithic logic of discovery, of the sort Hanson (1958) searched for, or a
general inductive hypothesis-generation logic (e.g., Reichenbach, 1938). The
present approach lies between these two extremes; it looks for structure be-
yond the insight of a genius but does not claim that the tools-to-theories heu-
ristic is (or should be) the only account of scientific discovery. The tools-to-
theories heuristic applies neither to all theories in science nor to all cognitive
theories; it applies to a specific group of cognitive theories developed during
the last three or four decades, after the so-called cognitive revolution.
Nevertheless, similar heuristics have promoted discovery in physics, phys-
iology, and other areas. For instance, it has been argued that once the me-
chanical clock became the indispensable tool for astronomical research, the
universe itself came to be understood as a kind of mechanical clock, and God
as a divine watchmaker. Lenoir (1986) showed how Faraday's instruments for
recording electric currents shaped the understanding of electrophysiological
processes by promoting concepts such as "muscle current" and "nerve cur-
rent."
FROM TOOLS TO THEORIES 5
Thus, this discovery heuristic boasts some generality both within cognitive
psychology and within science, but this generality is not unrestricted. Because
there has been little research in how tools of justification influence theory
development, the tools-to-theories heuristic may be more broadly applicable
than I am able to show in this chapter. If my view of heuristics of discovery
as a heterogeneous bundle of search strategies is correct, however, this implies
that generalizability is, in principle, bounded.
What follows has been inspired by Herbert Simon's notion of heuristics of
discovery but goes beyond his attempt to model discovery with programs such
as BACON that attempt to induce scientific laws from data (discussed later).
My focus is on the role of the tools that process and produce data, not the data
themselves, in the discovery and acceptance of theories.
What has been called the "cognitive revolution" is more than the overthrow
of behaviorism by mentalist concepts. These concepts have been continuously
part of scientific psychology since its emergence in the late 19th century, even
coexisting with American behaviorism during its heyday (Lovie, 1983). The
cognitive revolution did more than revive the mental; it has changed what the
mental means, often dramatically. One source of this change is the tools-to-
FROM TOOLS TO THEORIES 7
theories heuristic, with its new analogy of the mind as an intuitive statistician.
To show the discontinuity within cognitive theories, I briefly discuss two areas
in which an entire statistical technique, not only a few statistical concepts,
became a model of mental processes: (a) stimulus detection and discrimination
and (b) causal attribution.
What intensity must a 440-Hz tone have to be perceived? How much heavier
than a standard stimulus of 100 g must a comparison stimulus be in order for
a perceiver to notice a difference? How can the elementary cognitive processes
involved in those tasks, known today as stimulus detection and stimulus dis-
crimination, be explained? Since Herbart (1834), such processes have been
explained by using a threshold metaphor: Detection occurs only if the effect
an object has on the nervous system exceeds an absolute threshold, and dis-
crimination between two objects occurs if the excitation from one exceeds that
from another by an amount greater than a differential threshold. E. H. Weber
and G. T. Fechner's laws refer to the concept of fixed thresholds; Titchener
(1896) saw in differential thresholds the long-sought-after elements of mind
(he counted approximately 44,000); and classic textbooks, such as Brown and
Thomson's (1921) and Guilford's (1954), document methods and research.
Around 1955, the psychophysics of absolute and differential thresholds was
revolutionized by the new analogy between the mind and the statistician. W. P
Tanner and others proposed a "theory of signal detectability" (TSD), which
assumes that the Neyman-Pearson technique of hypothesis testing describes
the processes involved in detection and discrimination. Recall that in Neyman-
Pearson statistics, two sampling distributions (hypotheses H0 and HJ and a
decision criterion (which is a likelihood ratio) are defined, and then the data
observed are transformed into a likelihood ratio and compared with the de-
cision criterion. Depending on which side of the criterion the data fall, the
decision "reject H0 and accept H," or "accept H0 and reject H," is made. In
straight analogy, TSD assumes that the mind calculates two sampling distri-
butions for noise and signal plus noise (in the detection situation) and sets a
decision criterion after weighing the cost of the two possible decision errors
(Type I and Type II errors in Neyman-Pearson theory, now called false alarms
and misses}. The sensory input is transduced into a form that allows the brain
to calculate its likelihood ratio, and depending on whether this ratio is smaller
or larger than the criterion, the subject says "no, there is no signal" or "yes,
there is a signal." Tanner (1965) explicitly referred to his new model of the
mind as a "Neyman-Pearson" detector, and, in unpublished work, his flow-
charts included a drawing of a homunculus statistician performing the uncon-
scious statistics in the brain (Gigerenzer & Murray, 1987, pp. 49-53).
The new analogy between mind and statistician replaced the century-old
concept of a fixed threshold by the twin notions of observer's attitudes and
observer's sensitivity. Just as the Neyman-Pearson technique distinguishes be-
tween a subjective part (e.g., selection of a criterion dependent on cost-benefit
considerations) and a mathematical part, detection and discrimination became
understood as involving both subjective processes, such as attitudes and cost-
benefit considerations, and sensory processes. Swets, Tanner, and Birdsall
8 WHERE DO NEW IDEAS COME FROM?
(1964, p. 52) considered this link between attitudes and sensory processes to
be the main thrust of their theory. The analogy between technique and mind
made new research questions thinkable, such as How can the mind's decision
criterion be manipulated? A new kind of data even emerged: Two types of
error were generated in the experiments, false alarms and misses, just as the
statistical theory distinguishes two types of error.
As far as I can tell, the idea of generating these two kinds of data was not
common before the institutionalization of inferential statistics. The discovery
of TSD was not motivated by new data; rather, the new theory motivated a
new kind of data. In fact, in their seminal article, Tanner and Swets (1954,
p. 401} explicitly admitted that their theory "appears to be inconsistent with
the large quantity of existing data on this subject" and proceeded to criticize
the "form of these data."
The Neyman-Pearsonian technique of hypothesis testing was subsequently
transformed into a theory of a broad range of cognitive processes, ranging from
recognition in memory (e.g., Murdock, 1982; Wickelgren & Norman, 1966) to
eyewitness testimony (e.g., Birnbaum, 1983) to discrimination between ran-
dom and nonrandom patterns (e.g., Lopes, 1982).
My second example concerns theories of causal reasoning. In Europe, Albert
Michotte (1946/1963), Jean Piaget (1930), the gestalt psychologists, and others
had investigated how certain temporospatial relationships between two or
more visual objects, such as moving dots, produced phenomenal causality. For
instance, the participants were made to perceive that one dot launches, pushes,
or chases another. After the institutionalization of inferential statistics, Harold
H. Kelley (1967) proposed in his "attribution theory" that the long-sought laws
of causal reasoning are in fact the tools of the behavioral scientist: R. A.
Fisher's ANOVA. Just as the experimenter has come to infer a causal relation-
ship between two variables from calculating an ANOVA and performing an F
test, the person-in-the-street infers the cause of an effect by unconsciously
doing the same calculations. By the time Kelley discovered the new meta-
phor for causal inference, about 70% of all experimental articles already used
ANOVA (Edgington, 1974).
The theory was accepted quickly in social psychology; Kelley and Michaela
(1980) reported there were more than 900 references in one decade. The vision
of the Fisherian mind radically changed the understanding of causal reasoning,
the problems posed to participants, and the explanations looked for. I list a
few discontinuities that reveal the "fingerprints" of the tool, (a) ANOVA needs
repetitions or numbers as data in order to estimate variances and covariances.
Consequently, the information presented to the participants in studies of causal
attribution consists of information about the frequency of events (e.g., Mc-
Arthur, 1972), which played no role in either Michotte's or Piaget's work, (b)
Whereas Michotte's work still reflects the broad Aristotelian conception of four
causes (see Gavin, 1972), and Piaget (1930) distinguished 17 kinds of causality
in children's minds, the Fisherian mind concentrates on the one kind of causes
for which ANOVA is used as a tool (similar to Aristotle's "material cause"), (c)
In Michotte's view, causal perception is direct and spontaneous and needs no
FROM TOOLS TO THEORIES 9
inferences" as Fisherian significance testing: "We may account for the stability
of perceptual forms by suggesting that there is something akin to statistical
significance which must be exceeded by the rival interpretation and the rival
hypothesis before they are allowed to supersede the present perceptual hy-
pothesis" (p. 528). In his theory of how perception works, Gregory also ex-
plained other perceptual phenomena, using Bayesian and Neyman-Pearsonian
statistics as analogies, thus reflecting the actual heterogeneous practice in the
social sciences. Here, a new perspective, but no quantitative model, is gener-
ated. On the other hand, there are cognitive theories that propose quantitative
models of statistical inference that profoundly transform qualitative concepts
and research practice. Examples are the various TSDs of cognition mentioned
earlier and the theory of adaptive memory as statistical optimization by An-
derson and Milson (1989).
To summarize: The tools-to-theories heuristic can account for the discovery
and acceptance of a group of cognitive theories in apparently unrelated sub-
fields of psychology, all of them sharing the view that cognitive processes can
be modeled by statistical hypothesis testing. Among these are several highly
innovative and influential theories that have radically changed our under-
standing of what cognitive means.
There is an important test case for the present hypotheses (a) that familiarity
with the statistical tool is crucial to the discovery of corresponding theories of
mind and (b) that the institutionalization of the tool within a scientific com-
munity is crucial for the broad acceptance of those theories. That test case is
the era before the institutionalization of inferential statistics. Theories that con-
ceive of the mind as an intuitive statistician should have a very small likeli-
hood of being discovered and even less likelihood of being accepted. The two
strongest tests are cases in which (a) someone proposed a similar conceptual
analogy and (b) someone proposed a similar probabilistic (formal) model. The
chances of theories of the first kind being accepted should be small, and the
chances of a probabilistic model being interpreted as "intuitive statistics"
should be similarly small. I know of only one case each, which I analyze after
defining first what I mean by the phrase "institutionalization of inferential
statistics."
Statistical inference has been known for a long time but not used as theories
of mind. In 1710, John Arbuthnot proved the existence of God using a signif-
icance test; as mentioned earlier, astronomers used significance tests in the
19th century; G. T. Fechner's (1897) statistical text Kollektivmasslehre included
tests of hypotheses; W. S. Cosset (using the pseudonym Student) published the
t test in 1908; and Fisher's significance testing techniques, such as ANOVA, as
well as Neyman-Pearsonian hypothesis-testing methods, have been available
since the 1920s (see Gigerenzer et al., 1989). Bayes's rule has been known
FROM TOOLS TO THEORIES 11
since 1763. Nonetheless, there was little interest in these techniques in exper-
imental psychology before 1940 (Rucci & Tweney, 1980).
The statisticians' conquest of new territory in psychology started in the
1940s. By 1942, Maurice Kendall could comment on the statisticians' expan-
sion: "They have already overrun every branch of science with a rapidity of
conquest rivalled only by Attila, Mohammed, and the Colorado beetle" (p. 69).
By the early 1950s, half of the psychology departments in leading American
universities offered courses on Fisherian methods and had made inferential sta-
tistics a graduate program requirement. By 1955, more than 80% of the experi-
mental articles in leading journals used inferential statistics to justify conclu-
sions from the data (Sterling, 1959). Editors of major journals made significance
testing a requirement for articles submitted and used the level of significance as
a yardstick for evaluating the quality of an article (e.g., Melton, 1962).
I therefore use 1955 as a rough date for the institutionalization of the tool
in curricula, textbooks, and editorials. What became institutionalized as the
logic of statistical inference was a mixture of ideas from two opposing camps,
those of R. A. Fisher on the one hand and Jerzy Neyman and Egon S. Pearson
(the son of Karl Pearson) on the other (see Chapter 13).
The analogy between the mind and the statistician was first proposed before
the institutionalization of inferential statistics, in the early 1940s, by Egon
Brunswik at Berkeley (e.g., Brunswik, 1943). As Leary (1987) has shown, Brun-
swik's probabilistic functionalism was based on a very unusual blending of
scientific traditions, including the probabilistic world view of Hans Reichen-
bach and members of the Vienna Circle and Karl Pearson's correlational sta-
tistics.
The important point here is that in the late 1930s, Brunswik changed his
techniques for measuring perceptual constancies, from calculating (nonstatis-
tical) "Brunswik ratios" to calculating Pearson correlations, such as functional
and ecological validities. In the 1940s, he also began to think of the organism
as "an intuitive statistician," but it took him several years to spell out the
analogy in a clear and consistent way.
The analogy is this: The perceptual system infers its environment from un-
certain cues by (unconsciously) calculating correlation and regression statis-
tics, just as the Brunswikian researcher does when (consciously) calculating
the degree of adaptation of a perceptual system to a given environment. Brun-
swik's intuitive statistician was a statistician of the Karl Pearson school, like
the Brunswikian researcher. Brunswik's intuitive statistician was not well
adapted to the psychological science of the time, however, and the analogy
was poorly understood and generally rejected.
Brunswik's analogy came too early to be comprehended and accepted by
his colleagues of the experimental community; it came before the institution-
alization of statistics as the indispensable method of scientific inference, and
12 WHERE DO NEW IDEAS COME FROM?
telligence), all probabilistic terms could be eliminated from the theory. This
does not hold for a probabilistic model that is based on the metaphor. Here,
the probabilistic terms model the ignorance of the mind rather than that of the
experimenter. That is, they model how the homunculus statistician in the brain
comes to terms with a fundamentally uncertain world. Even if the experi-
menter had complete knowledge, the theories would remain probabilistic be-
cause it is the mind that is ignorant and needs statistics.
The key example is represented in L. L. Thurstone, who in 1927 formulated
a model for perceptual judgment that was formally equivalent to the present-
day TSD. But neither Thurstone nor his followers recognized the possibility
of interpreting the formal structure of their model in terms of the intuitive
statistician. Like TSD, Thurstone's model had two overlapping normal distri-
butions, which represented the internal values of two stimuli and which spec-
ified the corresponding likelihood ratios, but it never occurred to Thurstone
to include in his model the conscious activities of a statistician, such as the
weighing of the costs of the two errors and the setting of a decision criterion.
Thus neither Thurstone nor his followers took the—with hindsight—small step
to develop the "law of comparative judgment" into TSD. When Duncan Luce
(1977) reviewed Thurstone's model 50 years later, he found it hard to believe
that nothing in Thurstone's writings showed the least awareness of this small
but crucial step. Thurstone's perceptual model remained a mechanical, albeit
probabilistic, stimulus-response theory without a homunculus statistician in
the brain. The small conceptual step was never taken, and TSD entered psy-
chology by an independent route.
To summarize: There are several kinds of evidence for a close link between
the institutionalization of inferential statistics in the 1950s and the subsequent
broad acceptance of the metaphor of the mind as an intuitive statistician: (a)
the general failure to accept, and even to understand, Brunswik's intuitive
statistician before the institutionalization of the tool and (b) the case of Thur-
stone, who proposed a probabilistic model that was formally equivalent to one
important present-day theory of intuitive statistics but was never interpreted
in this way; the analogy was not yet seen. Brunswik's case illustrates that tools
may act on two levels: First, new tools may suggest new cognitive theories to
a scientist. Second, the degree to which these tools are institutionalized within
the scientific community to which the scientist belongs can prepare (or hinder)
the acceptance of the new theory. This close link between tools for justification
on the one hand and discovery and acceptance on the other reveals the arti-
ficiality of the discovery-justification distinction. Discovery does not come first
and justification afterward. Discovery is inspired by justification.
1988; Gruber, 1981; Tweney, Doherty, & Mynatt, 1981) but also for the eval-
uation and further development of current cognitive theories. The general
point is that institutionalized tools like statistics do not come as pure mathe-
matical (or physical) systems but with a practical context attached. Features
of this context in which a tool has been used may be smuggled Trojan-horse
fashion into the new cognitive theories and research programs. One example
was mentioned earlier: The formal tools of significance testing have been used
in psychology as tools for rejecting hypotheses, with the assumption that the
data are correct, whereas in other fields and at other times the same tools were
used as tools for rejecting data (outliers), with the assumption that the hy-
potheses were correct. The latter use of statistics is practically extinct in ex-
perimental psychology (although the problem of outliers routinely emerges)
and therefore also absent in theories that liken cognitive processes to signifi-
cance testing. In cases like these, analysis of discovery may help to reveal blind
spots associated with the tool and, as a consequence, new possibilities for
cognitive theorizing.
I illustrate this potential in more detail using examples from the "judgment
under uncertainty" program of Daniel Kahneman, Amos Tversky, and others
(see Kahneman & Tversky, 1982). This stimulating research program emerged
from the earlier research on human information processing by Ward Edwards
and his coworkers. In Edwards's work, the dual role of statistics as a tool and
a model of mind is again evident: Edwards, Lindman, and Savage (1963) pro-
posed Bayesian statistics for scientific hypothesis evaluation and considered
the mind as a reasonably good, albeit conservative, Bayesian statistician (e.g.,
Edwards, 1966). The judgment-under-uncertainty program also investigates
reasoning as intuitive statistics but focuses on so-called errors in probabilistic
reasoning. In most of the theories based on the metaphor of the intuitive stat-
istician, statistics or probability theory is used both as normative and as de-
scriptive of a cognitive process (e.g., both as the optimal and the actual mech-
anism for speech perception and human memory; see Massaro, 1987, and
Anderson & Milson, 1989, respectively). This is not the case in the judgment-
under-uncertainty program; here, statistics and probability theory are used
only in the normative function, whereas actual human reasoning has been
described as "biased," "fallacious," or "indefensible" (on the rhetoric, see
Lopes, 1991).
In the following, I first point out three features of the practical use of the
statistical tool (as opposed to the mathematics). Then I show that these features
reemerge in the judgment-under-uncertainty program, resulting in severe lim-
itations on that program. Finally, I suggest how this hidden legacy of the tool
could be eliminated to provide new impulses and possibilities for the research
program.
The first feature is an assumption that can be called "There is only one
statistics." Textbooks on statistics for psychologists (usually written by non-
mathematicians) generally teach statistical inference as if there existed only
one logic of inference. Since the 1950s and 1960s, almost all texts teach a
mishmash of R. A. Fisher's ideas tangled with those of Jerzy Neyman and Egon
FROM TOOLS TO THEORIES 15
soning problem, and their answers are compared with the so-called normative
or correct answer, supplied by statistics and probability theory. Second, the
deviation between the participant's answer and the so-called normative an-
swer, also called a bias of reasoning, is attributed to some heuristic of reason-
ing.
One implicit assumption at the heart of this research program says that
statistical theory provides exactly one answer to the real-world problems pre-
sented to the participants. If this were not true, the deviation between partic-
ipants' judgments and the "normative" answer would be an inappropriate ex-
planandum, because there are as many different deviations as there are
statistical answers. Consider the following problem:
A cab was involved in a hit-and-run accident at night. Two companies,
the Green and the Blue, operate in the city. You are given the following
data:
(i) 85% of the cabs in the city are Green and 15% are Blue, (ii) A witness
identified the cab as a Blue cab. The court tested his ability to identify
cabs under the appropriate visibility conditions. When presented with a
sample of cabs (half of which were Blue and half of which were Green),
the witness made correct identifications in 80% of the cases and erred
in 20% of the cases.
Question: What is the probability that the cab involved in the accident
was Blue rather than Green? (Tversky & Kahneman, 1980, p. 62)
The authors inserted the values specified in this problem into Bayes's rule
and calculated a probability of .41 as the "correct" answer, and, despite criti-
cism, they have never retreated from that claim. They saw in the difference
between this value and the participants' median answer of .80 an instance of
a reasoning error, known as neglect of base rates. But alternative statistical
solutions to the problem exist.
Tversky and Kahneman's reasoning is based on one among many possible
Bayesian views—which the statistician I. J. Good (1971), not all too seriously,
once counted up to 46,656. For instance, using the classical principle of in-
difference to determine the Bayesian prior probabilities can be as defensible
as Tversky and Kahneman's use of base rates of "cabs in the city" for the
relevant priors, but it leads to a probability of .80 instead of .41 (Levi, 1983).
Or, if Neyman-Pearson theory is applied to the cab problem, solutions range
between .28 and .82, depending on the psychological theory about the wit-
ness's criterion shift—the shift from witness testimony at the time of the ac-
cident to witness testimony at the time of the court's test (Birnbaum, 1983;
Gigerenzer & Murray, 1987, pp. 167-174).
There may be more arguable answers to the cab problem, depending on
what statistical or philosophical theory of inference one uses and what as-
sumptions one makes. Indeed, the range of possible statistical solutions is
about the range of participants' actual answers. The point is that none of these
statistical solutions is the only correct answer to the problem, and therefore it
FROM TOOLS TO THEORIES 17
makes little sense to use the deviation between a participant's judgment and
one of these statistical answers as the psychological explanandum.
Statistics is an indispensable tool for scientific inference, but, as Neyman
and Pearson (1928, p. 176) pointed out, in "many cases there is probably no
single best method of solution." Rather, several such theories are legitimate,
just as "Euclidean and non-Euclidean geometries are equally legitimate" (Ney-
man, 1937, p. 336). My point is this: The idee fixe that statistics speaks with
one voice has reappeared in research on intuitive statistics. The highly inter-
esting judgment-under-uncertainty program could progress beyond the present
point if (a) participants' judgments rather than deviations between judgments
and a so-called normative solution are considered as the data to be explained
and if (b) various statistical models are proposed as competing hypotheses of
problem-solving strategies rather than one model being proposed as the general
norm for rational reasoning. The willingness of many researchers to accept the
claim that statistics speaks with one voice is the legacy of the institutionalized
tool, not of statistics per se.
Note the resulting double standard: Many researchers on intuitive statistics
argue that their participants should draw inferences from data to hypotheses
by using Hayes's rule, although they themselves do not. Rather, the researchers
use the institutionalized mixture of Fisherian and Neyman-Pearsonian statis-
tics to draw their inferences from data to hypotheses.
Just as there are alternative logics of inference, there are alternative interpre-
tations of probability that have been part of the mathematical theory since its
inception in the mid-17th century (Daston, 1988; Hacking, 1975). Again, both
the institutionalized tool and the recent cognitive research on probabilistic
reasoning exhibit the same blind spot concerning the existence of alternative
interpretations of probability. For instance, Lichtenstein, Fischhoff, and Phil-
lips (1982) have reported and summarized research on a phenomenon called
overconfidence. Briefly, participants were given questions such as "Absinthe
is (a) a precious stone or (b) a liqueur"; they chose what they believed was the
correct answer and then were asked for a confidence rating in their answer,
for example, 90% certain. When people said they were 100% certain about
individual answers, they had in the long run only about 80% correct answers;
when they were 90% certain, they had in the long run only 75% correct an-
swers; and so on. This discrepancy was called overconfidence bias and was
explained by general heuristics in memory search, such as confirmation biases,
or general motivational tendencies, such as a so-called illusion of validity.
My point is that two different interpretations of probability are compared:
degrees of belief in single events (i.e., that this answer is correct) and relative
frequencies of correct answers in the long run. Although 18th-century math-
ematicians, like many of today's cognitive psychologists, would have had no
problem in equating the two, most mathematicians and philosophers since
18 WHERE DO NEW IDEAS COME FROM?
then have. For instance, according to the frequentist point of view, the term
probability, when it refers to a single event, "has no meaning at all" (Mises,
1928/1957, p. 11) because probability theory is about relative frequencies in
the long run. Thus, for a frequentist, probability theory does not apply to
single-event confidences, and therefore no confidence judgment can violate
probability theory. To call a discrepancy between confidence and relative fre-
quency a bias in probabilistic reasoning would mean comparing apples and
oranges. Moreover, even subjectivists would not generally think of a discrep-
ancy between confidence and relative frequency as a bias (see Kadane & Lich-
tenstein, 1982, for a discussion of conditions). For a subjectivist such as Bruno
de Finetti, probability is about single events, but rationality is identified with
the internal consistency of probability judgments. As de Finetti (1931/1989,
p. 174) emphasized: "However an individual evaluates the probability of a
particular event, no experience can prove him right, or wrong; nor in general,
could any conceivable criterion give any objective sense to the distinction one
would like to draw, here, between right and wrong."
Nonetheless, the literature on overconfidence is largely silent on even the
possibility of this conceptual problem (but see Keren, 1987). The question
about research strategy is whether to use the deviation between degrees of
belief and relative frequencies (again considered as a bias) as the explanandum
or to accept the existence of several meanings of probability and to investigate
the kind of conceptual distinctions that untutored people make. Almost all
research has been done within the former research strategy. And, indeed, if
the issue were a general tendency to overestimate one's knowledge, as the term
overconfidence suggests—for instance, as a result of general strategies of mem-
ory search or motivational tendencies—then asking people for degrees of belief
or for frequencies should not matter.
But it does. In a series of experiments (Gigerenzer, Hoffrage, & Kleinbolting,
1991; see also May, 1987), participants were given several hundred questions
of the absinthe type and were asked for confidence judgments after every ques-
tion was answered (as usual). In addition, after each 50 (or 10, 5, and 2) ques-
tions, they were asked how many of those questions they believed they had
answered correctly; that is, frequency judgments were requested. This design
allowed comparison both between their confidence in their individual answers
and true relative frequencies of correct answers, and between judgments of
relative frequencies and true relative frequencies. Comparing frequency judg-
ments with the true frequency of correct answers showed that overestimation
or overconfidence disappeared in 80% to 90% of the participants, depending
on experimental conditions. Frequency judgments were precise or even
showed underestimation. Ironically, after each frequency judgment, partici-
pants went on to give confidence judgments (degrees of belief) that exhibited
what has been called overconfidence.
As in the preceding example, a so-called bias of reasoning disappears if a
controversial norm is dropped and replaced by several descriptive alternatives,
statistical models, and meanings of probability, respectively. Thus probabilities
FROM TOOLS TO THEORIES 19
for single events and relative frequencies seem to refer to different meanings
of confidence in the minds of the participants. This result is inconsistent with
previous explanations of the alleged bias by deeper cognitive deficiencies (e.g.,
confirmation biases) and has led to the theory of probabilistic mental models,
which describes mechanisms that generate different confidence and frequency
judgments (see Chapter 7). Untutored intuition seems to be capable of making
conceptual distinctions of the sort statisticians and philosophers make (e.g.,
Cohen, 1986; Lopes, 1981; Teigen, 1983). And it suggests that the important
research questions to be investigated are How are different meanings of prob-
ability cued in everyday language? and How does this affect judgment?, rather
than How can the alleged bias of overconfidence be explained by some general
deficits in memory, cognition, or personality?
The same conceptual distinction can help to explain other kinds of judg-
ments under uncertainty. For instance, Tversky and Kahneman (1982a, 1983)
used a personality sketch of a character named Linda that suggested she was
a feminist. Participants were asked which is more probable: (a) that Linda is
a bank teller or (b) that Linda is a bank teller and active in the feminist move-
ment. Most participants chose Alternative b, which Tversky and Kahneman
(1982a) called a "fallacious" belief, to be explained by their hypothesis that
people use a limited number of heuristics—in the present case, representa-
tiveness (the similarity between the description of Linda and the alternatives
a and b). Participants' judgments were called a conjunction fallacy because the
probability of a conjunction of events (bank teller and active in the feminist
movement) cannot be greater than the probability of one of its components.
As in the example just given, this normative interpretation neglects two
facts. First, in everyday language, words like probable legitimately have several
meanings, just as "if. . . then" and "or" constructions do. The particular mean-
ing seems to be automatically cued by content and context. Second, statisti-
cians similarly have alternative views of what probability is about. In the con-
text of some subjectivist theories, choosing Alternative b truly violates the
rules of probability; but for a frequentist, judgments of single events such as
in the Linda problem have nothing to do with probability theory: As the stat-
istician G. A. Barnard (1979, p. 171) objected, they should be treated in the
context of psychoanalysis, not probability.
Again, the normative evaluation explicit in the term conjunction fallacy
is far from being uncontroversial, and progress in understanding reasoning
may be expected by focusing on people's judgments as explanandum rather
than on their deviations from a so-called norm. As in the previous example,
if problems of the Linda type are rephrased as involving frequency judgments
(e.g., "How many out of 100 cases that fit the description of Linda are [a]
bank tellers and [b] bank tellers and active in the feminist movement?"), then
the so-called conjunction fallacy decreases from 77% to 27%, as Fiedler (1988)
showed. "Which alternative is more probable?" is not the same as "Which
alternative is more frequent?" in the Linda context. Tversky and Kahneman
(1983) found similar results, but they maintained their normative claims and
20 WHERE DO NEW IDEAS COME FROM?
to deal with how the mind analyzes the structure of a problem (or environ-
ment) and how it infers the presence or absence of crucial statistical assump-
tions—just as the practicing statistician has to first check the structure of a
problem in order to decide whether a particular statistical model can be ap-
plied. Checking structural assumptions precedes statistical calculations (see
also Cohen, 1982; Einhorn & Hogarth, 1981; Ginossar & Trope, 1987).
My intention here is not to criticize this or that specific experiment, but
rather to draw attention to the hidden legacy that tools bequeath to theories.
The general theme is that some features of the practical context in which a
tool has been used (to be distinguished from its mathematics) have reemerged
and been accepted in a research program that investigates intuitive statistics,
impeding progress. Specifically, the key problem is a simplistic conception of
normativeness that confounds one view about probability with the criterion
for rationality.
Although I have dwelt on the dangerous legacy that tools hand on to the-
ories, I do not mean to imply that a theory that originates in a tool is ipso facto
a bad theory. The history of science, not just the history of psychology, is
replete with examples to the contrary. Good ideas are hard to come by, and
one should be grateful for those few that one has, whatever their lineage. But
knowing that lineage can help to refine and criticize the new ideas. In those
cases in which the tools-to-theories heuristic operates, this means taking a
long, hard look at the tools—and the statistical tools of social scientists are
overdue for such a skeptical inspection.
Discussion
Herbert A. Simon (1973) and his coworkers (e.g., Langley, Simon, Bradshaw,
& Zytkow, 1987) explicitly reconsidered the possibility of a logic of discovery.
For example, a series of programs called BACON has "rediscovered" quanti-
tative empirical laws, such as Kepler's third law of planetary motion. How
does BACON discover a law? Basically, BACON starts from data and analyzes
them by applying a group of heuristics until a simple quantitative law can be
fitted to the data. Kepler's law, for instance, can be rediscovered by using heu-
ristics such as "If the values of two numerical terms increase together, then
consider their ratio" (Langley et al., 1987, p. 66). Such heuristics are imple-
.mented as production rules.
What is the relation between heuristics used in programs like BACON and
the tools-to-theories heuristics? First, the research on BACON was concerned
mainly with the ways in which laws could be induced from data. BACON's
heuristics work on extant data, whereas the tools-to-theories heuristic works
on extant tools for data generation and processing and describes an aspect of
discovery (and acceptance) that goes beyond data. As I argued earlier, new
data can be a consequence of the tools-to-theories heuristic, rather than the
starting point to which it is applied. Second, what can be discovered seems to
have little overlap. For Langley et al. (1987), discoveries are of two major
kinds: quantitative laws such as Kepler's law and qualitative laws such as
taxonomies using clustering methods. In fact, the heuristics of discovery pro-
posed in that work are similar to the statistical methods of exploratory data
analysis (Tukey, 1977). It is this kind of intuitive statistics that serves as the
analogy to discovery in Simon's approach. In contrast, the tools-to-theories
heuristic can discover new conceptual analogies, new research programs, and
new data. It cannot—at least not directly—derive quantitative laws by sum-
marizing data, as BACON's heuristics can.
The second issue, What can be discovered?, is related to the first, that is,
to Simon's approach to discovery as induction from data, as "recording in a
parsimonious fashion, sets of empirical data" (Simon, 1973, p. 475). More re-
cently, Simon and Kulkarni (1988) went beyond that data-centered view of
discovery and made a first step toward characterizing the heuristics used by
scientists for planning and guiding experimental research. Although Simon
and Kulkarni did not explore the potential of scientists' tools for suggesting
theoretical concepts (and their particular case study may not invite this), the
tools-to-theories heuristic can complement this recent, broader program to un-
derstand discovery. Both Simon's heuristics and the tools-to-theories heuristic
go beyond the inductive probability approach to discovery (such as Reichen-
bach's). The approaches are complementary in their focus on aspects of dis-
covery, but both emphasize the possibility of understanding discovery by ref-
erence to heuristics of creative reasoning, which go beyond the merely
personal and accidental.
FROM TOOLS TO THEORIES 23
The examples of discovery I give in this chapter are modest instances com-
pared with the classical literature in the history of science treating the contri-
bution of a Copernicus or a Darwin. But in the narrower context of recent
cognitive psychology, the theories I have discussed count as among the most
influential. In this more prosaic context of discovery, the tools-to-theories heu-
ristic can account for a group of significant theoretical innovations. And, as I
have argued, this discovery heuristic can both open and foreclose new avenues
of research, depending on the interpretations attached to the statistical tool.
My focus is on analytical tools of justification, and I have not dealt with phys-
ical tools of experimentation and data processing. Physical tools, once familiar
and considered indispensable, also may become the stuff of theories. This
holds not only for the hardware (like the software) of the computer, but also
for theory innovation beyond recent cognitive psychology. Smith (1986) argued
that Edward C. Tolman's use of the maze as an experimental apparatus trans-
formed Tolman's conception of purpose and cognition into spatial character-
istics, such as cognitive maps. Similarly, he argued that Clark L. Hull's fasci-
nation with conditioning machines has shaped Hull's thinking of behavior as
if it were machine design. With the exception of Danziger's (1985, 1987) work
on changing methodological practices in psychology and their impact on the
kind of knowledge produced, however, there seems to exist no systematic re-
search program on the power of familiar tools to shape new theories in psy-
chology.
But the history of science beyond psychology provides some striking in-
stances of scientists' tools, both analytical and physical, that ended up as the-
ories of nature. Hackmann (1979), Lenoir (1986), and Wise (1988) have ex-
plored how scientific instruments shaped the theoretical concepts of, among
others, Emil DuBois-Reymond and William Thomson (Lord Kelvin).
The case of Adolphe Quetelet illustrates nicely how the tools-to-theories
heuristic can combine with an interdisciplinary exchange of theories. The sta-
tistical error law (normal distribution) was used by astronomers to handle ob-
servational errors around the true position of a star. Quetelet (1842/1969), who
began as an astronomer, transformed the astronomer's tool for taming error into
a theory about society: The true position of a star turned into 1'homme moyen,
or the ideal average person within a society, and observational errors turned
into the distribution of actual persons (with respect to any variable) around
1'homme moyen—actual persons now being viewed as nature's errors. Quete-
let's social error theory was in turn seminal in the development of statistical
mechanics; Ludwig Boltzmann and James Clerk Maxwell in the 1860s and
1870s reasoned that gas molecules might behave as Quetelet's humans do;
erratic and unpredictable as individuals, but regular and predictable when
considered as a collective (Porter, 1986). By this strange route of discovery—
from astronomer's tool to a theory of society, and from a theory of society to
24 WHERE DO NEW IDEAS COME FROM?
Discovery Reconsidered
Let me conclude with some reflections on how the present view stands in
relation to major themes in scientific discovery.
scientific discovery, and the tools-to-theories heuristic explores the field be-
yond.
Have philosophers of science spent too little time inside the laboratories to
be drawn in by the glamour of technology? Tools, after all, fascinate scientists.
New tools can directly, rather than through new data, inspire new theories.
This chapter extends the thesis of a tools-to-theories heuristic from statistical
tools to the computer.1 Recall that the thesis is twofold:
This chapter is divided into two parts. In the first part, we argue that a
conceptual divorce between intelligence and calculation circa 1800, motivated
by a new social organization of work, made mechanical computation (and ul-
timately the computer) conceivable. The tools-to-theories heuristic comes into
play in the second part. When computers finally became standard laboratory
tools in the 20th century, the computer was proposed, and with some delay
accepted, as a model of mind. Thus we travel in a full circle from mind to
computer and back.
The work on which this chapter is based was coauthored with D. G. Goldstein.
1. Although we are only dealing with theories of mind, this does not imply that the
tools-to-theories heuristic is not applicable in the analysis of other scientific domains.
Schaffer (1992) provided several examples from the history of electromagnetism, in
which theories stemmed from tools. For instance, in 1600, the court physician William
Gilbert described the Earth as a vast spherical magnet. This new idea stemmed from the
tool he had invented (a magnet, the small terrella) and subsequently used as an analogy
to understand the world. This projection had consequences. Gilbert inferred that, be-
cause his terrella rotated, so did the Earth. The tool proved Copernicanism.
26
MIND AS COMPUTER 27
"Well, Babbage, what are you dreaming about?" to which I replied, "I
am thinking that all these tables (pointing to the logarithms) might be
calculated by machinery." (Charles Babbage, 1812/1994, p. 31)
2. The Jaquard loom, a general-purpose device loaded with a set of punched cards,
could be used to weave infinite varieties of patterns. Factories in England were equipped
with hundreds of these machines, and Babbage was one of the "factory tourists" of the
1830s and 1840s.
MIND AS COMPUTER 29
the mental life. New ideas and insights were assumed to be the product of the
novel combinations and permutations of ideas and sensations. In the first de-
cades of the nineteenth century, numerical calculation was separated from the
rest of intelligence and demoted to one of the lowest operations of the human
mind. After calculation became the repetitive task of an army of unskilled
workers, Babbage could envision mechanical computers replacing human
computers. Pools of human computers and Babbage's mechanical computer
manufactured numbers in the same way as the factories of the day manufac-
tured their goods.3
3. Calculation became dissociated and opposed not only to the human intellect but
also to moral impulse. Madame de Stael, for instance, used the term calcul only in
connection with the "egoism and vanity" of those opportunists who exploited the
French Revolution for their own advantage and selfishness (Daston, 1994).
30 WHERE DO NEW IDEAS COME FROM?
Turing anticipated much of the new conceptual language and even the very
problems Allen Newell and Herbert Simon later attempted to address, as we
see in the second part of this chapter. With amazing prophecy, Turing sug-
gested that many intellectual issues can be translated into the form "Find a
number n such that. . ."; that is, he suggested that searching is the key concept
for problem solving and that Whitehead and Russell's (1935) Principia Math-
ematica might be a good start for demonstrating the power of the machine
(McCorduck, 1979, p. 57).
Not only did Turing's life end early and under tragic circumstances, but his
work had practically no influence on artificial intelligence in Britain until the
mid-1960s (McCorduck, 1979, p. 68). Neither von Neumann nor his friends
were persuaded to look beyond similarities between cells and diodes to func-
tional similarities between humans and computers.
To summarize, we have looked at two groups who compared humans and
computers before the cognitive revolution. One of these groups, represented
by von Neumann, spoke tentatively about the computer as a brain but warned
about taking the analogy too far. The other group, represented by Turing, asked
whether the computer has features of the human mind but not vice versa—
that is, this group did not attempt to design theories of mind through the
analogy of the tool.
Before the second half of the century, the mind was not yet a computer.
However, a new incarnation of the Enlightenment view of intelligence as a
combinatorial calculus was on the horizon.
What has been called in retrospect the cognitive revolution in American psy-
chology of the 1960s is more than an overthrow of behaviorism by mentalist
concepts. The cognitive revolution did more than revive the mental; it changed
its meaning. One source of this change is the projection of new tools (i.e.,
statistics and computers) into the mind. We refer to this heuristic of discovery
as the tools-to-theories heuristic. The two new classes of theories that emerged
and that partially overlap pictured the new mind as an "intuitive statistician"
or a "computer program."
In this section, we see how a tools-to-theories explanation accounts for the
new conception of the mind as a computer, focusing on the discovery and
acceptance of Simon and Newell's brand of information-processing psychol-
ogy. We try to reconstruct the discovery of Newell and Simon's (1972)
information-processing model of mind and its (delayed) acceptance by the psy-
chological community in terms of the tools-to-theories heuristic.
32 WHERE DO NEW IDEAS COME FROM?
Discovery
4. The Manhattan Project at Los Alamos, where the atomic bomb was constructed,
housed another human computer. Although the project could draw on the best technol-
ogy available, in the early 1940s mechanical calculators (e.g., the typewriter-sized Mar-
chant calculator) could only add, subtract, multiply, and, with some difficulty, divide.
Richard Feynman and Nicholas Metropolis arranged a pool of people (mostly scientists'
wives, who were getting paid three-eighths of the scientists' salary), each of whom re-
petitively performed a small calculation (e.g., cubing a number) and passed the result
on to another person, who incorporated it into yet another computation (Gleick, 1992).
MIND AS COMPUTER 33
The metaphor I'd been using, of a mind as something that took some
premises and ground them up and processed them into conclusions, be-
gan to transform itself into a notion that a mind was something which
took some program inputs and data and had some processes which op-
erated on the data and produced output, (cited in McCorduck, 1979,
p. 127)
others is slight. New discoveries, by definition, clash with what has come be-
fore, but it is often a useful strategy to hide the amount of novelty and to claim
historical continuity. When Tanner and Swets (1954) proposed (in the Psycho-
logical Review four years earlier) that another scientific tool (i.e., the Neyman-
Pearsonian techniques of hypothesis testing) would model the cognitive pro-
cesses of stimulus detection and discrimination, their signal-detection model
also clashed with earlier notions, such as the notion of a sensory threshold.
Tanner and Swets, however, chose not to conceal this schism between the old
and the new theories, explicitly stating that their new theory "appears to be
inconsistent with the large quantity of existing data on this subject" (p. 401).
As we argued before, there is a different historical continuity in which Newell
and Simon's ideas stand—the earlier Enlightenment view of intelligence as a
combinatorial calculus.
Conceptual Change
Newell et al. (1958) tried to emphasize the historical continuity of what was
to become their new information-processing model of problem solving, as did
Miller, Galanter, and Pribram (1960) in their Plans and the Structure of Be-
havior when they linked their version of Newell and Simon's theory to many
great names such as William James, Frederic Bartlett, and Edward Tolman. We
believe that these early claims for historical continuity served as protection:
George Miller, who was accused by Newell and Simon of having stolen their
ideas and gotten them all wrong, said, "I had to put the scholarship into the
book, so they would no longer claim that those were their ideas. As far as I
was concerned they were old familiar ideas" (Baars, 1986, p. 213). In contrast
to this rhetoric, here we emphasize the discontinuity introduced by the trans-
formation of the new tool into a theory of mind.
What was later called the "new mental chemistry" pictured the mind as a
computer program:
The atoms of this mental chemistry are symbols, which are combinable
into larger and more complex associational structures called lists and list
structures. The fundamental "reactions" of the mental chemistry employ
elementary information processes that operate upon symbols and symbol
structures: copying symbols, storing symbols, retrieving symbols, input-
ting and outputting symbols, and comparing symbols. (Simon, 1979,
p. 363)
This atomic view is certainly a major conceptual change in the views about
problem solving compared to the theories of Kohler, Wertheimer, and Duncker,
MIND AS COMPUTER 35
Here we do not attempt to probe the depths of how Newell and Simon's
ideas of information processing changed theories of mind; the commonplace
usage of computer terminology in the cognitive psychological literature since
1972 is a reflection of this. How natural it seems for present-day psychologists
to speak of cognition in terms of encoding, storage, retrieval, executive pro-
cesses, algorithms, and computational cost.
5. In fact, the new view was directly inspired by 19th-century mathematician George
Boole (1854/1958), who, in the very spirit of the Enlightenment mathematicians such as
the Bernoullis and Laplace, set out to derive the laws of logic, algebra, and probability
from what he believed to be the laws of human thought. Boole's algebra culminated in
Whitehead and Russell's (1935) Principia Mathematica, describing the relation between
mathematics and logic, and in Claude Shannon's seminal work (his master's thesis at
MIT in 1940), which used Boolean algebra to describe the behavior of relay and switch-
ing circuits (McCorduck, 1979, p. 41).
36 WHERE DO NEW IDEAS COME FROM?
Acceptance
and Simon's ideas. The "acceptance" part of the tools-to-theories thesis can
explain this: Computers were not yet entrenched in the daily routine of psy-
chologists, as we show here.
We take two institutions as case studies to demonstrate the part of the tools-
to-theories hypothesis that concerns acceptance—the Harvard University Cen-
ter for Cognitive Studies and Carnegie-Mellon University (CMU). The former
never came to embrace fully the new information-processing psychology; the
latter did but after a considerable delay. Tools-to-theories might explain both
phenomena.
George Miller, the cofounder of the Center for Cognitive Studies, was cer-
tainly a proponent of the new information-processing psychology. As we said,
Miller et al.'s (1960) Plans and the Structure of Behavior was so near to Newell
et al.'s (1958) ideas that it was at first considered a form of theft, although the
version of the book that did see the presses is filled with citations recognizing
Newell et al. Given Miller's enthusiasm, one might expect the center, partially
under Miller's leadership, to blossom into information-processing research. It
never did. Looking at the 1963-1969 annual reports (Harvard University Cen-
ter for Cognitive Studies, 1963, 1964, 1966, 1968, 1969), we found only a few
symposia or papers dealing with computer simulation.
Although the center had a PDP-4C Computer and the reports anticipated
the possibility of using it for cognitive simulation, as late as 1969 it never
happened. The reports mention that the computer served to run experiments,
demonstrate the feasibility of computer research, and draw visitors to the lab-
oratory. However, difficulties involved in using the tool were considerable. The
PDF saw 83 hours of use on an average week in 1965-1966, but 56 of these
were spent on debugging and maintenance. In the annual reports are several
remarks of the type, "It is difficult to program computers. . . . Getting a program
to work may take months." The center even turned out a 1966 technical report
entitled Programmanship, or How to Be One-Up on a Computer without Ac-
tually Ripping out Its Wires.
What might have kept the Harvard computer from becoming a metaphor
of the mind was that the researchers could not integrate this tool into their
everyday laboratory routine. The tool even turned out to be a steady source
of frustration. As tools-to-theories suggests, this lack of entrenchment in every-
day practice accounted for the lack of acceptance of the new information-
processing psychology. Simon (1979) took notice of this:
Perhaps the most important factors that impeded the diffusion of the new
ideas, however, were the unfamiliarity of psychologists with computers
and the unavailability on most campuses of machines and associated
software (list processing programming languages) that were well adapted
to cognitive simulation. The 1958 RAND Summer Workshop, mentioned
earlier, and similar workshops held in 1962 and 1963, did a good deal
to solve the first problem for the 50 or 60 psychologists who participated
MIND AS COMPUTER 39
At CMU in the late 1950s, the first doctoral theses involving computer simu-
lation of cognitive processes were being written (H. A. Simon, personal com-
munication, 1994). But this was not representative of the national state of af-
fairs. In the mid-1960s, a small number of psychological laboratories were built
around computers, including those of CMU, Harvard, Michigan, Indiana, MIT,
and Stanford (Aaronson, Grupsmith, & Aaronson, 1976, p. 130). As indicated
by the funding history of NIMH grants for cognitive research, the amount of
computer-using research tripled over the next decade. In 1967, only 15% of
the grants being funded had budget items related to computers (e.g., program-
mer salaries, hardware, supplies); by 1975, this figure had increased to 46%.
The late 1960s saw a turn toward mainframe computers that lasted until the
late 1970s, when the microcomputer started its invasion of the laboratory. In
the 1978 Behavioral Research Methods & Instrumentation conference, micro-
computers were the issue of the day (Castellan, 1981, p. 93). By 1984, the
journal Behavioral Research Methods &- Instrumentation appended the word
Computers to its title to reflect the broad interest in the new tool. By 1980, the
cost of computers had dropped an order of magnitude from what it was in
1970 (Castellan, 1981, 1991). During the last two decades, computers have
become the indispensable research tool of the psychologist.
After the tool became entrenched in everyday laboratory routine, a broad
acceptance of the view of the mind as a computer followed. In the early 1970s,
information-processing psychology finally caught on at CMU. Every CMU-
authored article in the proceedings of the 1973 Carnegie Symposium on Cog-
nition mentions some sort of computer simulation. For the rest of the psycho-
logical community, which was not as familiar with the tool, the date of broad
acceptance was years later. Simon (1979) estimated that, from about 1973 to
1979, the number of active research scientists working in the information-
processing vein had "probably doubled or tripled" (p. 390).
This does not mean that the associated methodology became accepted as
well. It clashed too strongly with the methodological ritual that was institu-
tionalized during the 1940s and 1950s in experimental psychology. We use the
term ritual here for the mechanical practice of a curious mishmash between
Fisher's and Neyman-Pearson's statistical techniques, which was taught to psy-
40 WHERE DO NEW IDEAS COME FROM?
chologists as the sine qua non of scientific method (see Chapter 13). Most
psychologists assumed, as the textbooks told them, that there is only one way
to do good science. But their own heroes—Fechner, Wundt, Pavlov, Kb'hler,
Bartlett, Piaget, Skinner, and Luce, to name a few—had never used this "rit-
ual." Some had used experimental practices that resembled the newly pro-
posed methods used to study the mind as computer.
Pragmatics
The same process of projecting pragmatic aspects of the use of a tool into
a theory can be shown for the view of the mind as a computer. One example
is Levelt's (1989) model of speaking. The basic unit in Levelt's model, which
he called the "processing component," corresponds to the computer program-
mer's concept of a subroutine. We argue that Levelt's model not only borrowed
the subroutine as a tool but also borrowed the practical aspects of how sub-
routines are used and constructed in computer programming.
A subroutine (or "subprocess") is a group of computer instructions (usually
serving a specific function) that are separated from the main routine of a com-
puter program. It is common for subroutines to perform often needed func-
tions, such as extracting cube roots or rounding numbers. There is a major
pragmatic issue involved in writing subroutines that centers on the "principle
of isolation" (Simon & Newell, 1986). The issue is whether subroutines should
be black boxes or not. According to the principle of isolation, the internal
workings of the subroutine should remain a mystery to the main program, and
the outside program should remain a mystery to the subroutine. Black-box
subroutines have become known as program modules, perfect for the divide-
and-conquer strategy programmers often use to tackle large problems. To the
computer, however, it makes no difference whether subroutines are isolated or
not. Subroutines that are not isolated work just as well as those that are. The
only real difference between the two types of subroutine is psychological. Sub-
routines that violate the principle of isolation are more difficult for the pro-
grammer to read, write, debug, maintain, and reuse. For this reason, introduc-
tory texts on computer programming stress the principle of isolation as the
very essence of good programming style.
The principle of isolation—a pragmatic feature of using subroutines as a
programming tool—has a central place in Levelt's model, in which the pro-
cessing components are "black boxes" that exemplify Fodor's notion of infor-
mational encapsulation (Levelt, 1989, p. 15). In this way, Levelt's psychological
model embodies a maxim of good computer programming—the principle of
isolation. That this practical aspect of the use of the tool shaped a theory of
speaking is not an evaluation of the quality of the theory. Our point concerns
origins, not validity. However, this pragmatic feature of subroutines has not
always served the model well. Kita (1993) and Levinson (1992) have attacked
Levelt's model at its Achilles' heel—its insistence on isolated processing com-
ponents.
To summarize the second part of this chapter, we started with the separation
between intelligence and calculation and argued that the realization that com-
puters can do more than arithmetic was an important precondition for the view
of the mind as a computer. Newell and Simon seem to have been the first who
tried to understand the mind in terms of a computer program, but the accep-
tance of their information-processing view was delayed until the psychologists
became used to computers in their daily laboratory routine. We have argued
that, along with the tool, its pragmatic use has been projected into theories of
mind. Now that the metaphor is in place, many find it difficult to see how the
42 WHERE DO NEW IDEAS COME FROM?
mind could be anything else: To quote Philip Johnson-Laird (1983): "The com-
puter is the last metaphor; it need never be supplanted" (p. 10).
Social Computers
The tools-to-theories heuristic can reverse the commonly assumed fixed tem-
poral order between discovery and justification—discovery first, justification
second. New tools for justification enter the laboratory first, new theories fol-
low. In the case of Babbage's computer, the tool itself was modeled after a new
social system, the organization of work in large-scale manufacturing. The
model for the machine computer was a social computer.
The argument was that economic changes—the large-scale division of labor
in manufacturing and in the "bureaux de calculs"—went along with the break-
down of the Enlightenment conception of the mind, in which calculation was
the distinctive essence of intelligence. Once calculation was separated from
the rest of intelligence and relegated to the status of a dull and repetitive task,
Babbage could envision replacing human computers with mechanical ones.
Both human and mechanical computers manufactured numbers as the factories
of the day manufactured goods. In the twentieth century, the technology be-
came available to make Babbage's dream a reality. Computers became indis-
pensable scientific tools for everything from number crunching to simulation.
Our focus was on the work by Herbert Simon and Allen Newell and their
colleagues, who proposed the tool as a theory of mind. Their proposal reunited
mere calculation with what was now called "symbol processing," returning to
the Enlightenment conception of mind. After personal computers found a
place in nearly every psychological laboratory, broad acceptance of the meta-
phor of the mind as computer followed.6
A question remains: Why was the digital computer used as a model of the
individual mind rather than of the social organization of many minds? As the
social roots of the idea behind Babbage's computer shows, there is nothing
inherently individualistic about the business of computation. We can only
speculate that it was the traditional focus of psychological research on indi-
viduals that suggested the analogy between the computer and the individual
mind and that in less individualistic disciplines the computer would have had
a better chance of becoming a model of social organization. In fact, anthro-
pologist Ed Hutchins (1995) has proposed using the digital computer as a
model of how social groups make decisions, for instance, how a crew on a
large ship solves the problem of navigation. Here the computer is used to
6. The reconstruction of the path "from mind to computer and back" also provides
an explanation for one widespread type of resistance to the computer metaphor of mind.
The post-Enlightenment divorce between intelligence and calculation still holds to this
day, and, for those who still associate the computer with mere calculation (as opposed
to symbol processing), the mind-as-a-computer is a contradiction in itself.
MIND AS COMPUTER 43
model the division of labor in the storing, processing, and exchange of infor-
mation among members of a social group. This notion of distributed intelli-
gence completes the circle traveled by the computer metaphor. Once modeled
after the social organization of human work, the computer has now become a
model of the social organization of human work.
3
Ideas in Exile
The Struggles of an Upright Man
44
IDEAS IN EXILE 45
Intellectual Integrity
Refinancing Determinism
Brunswik found himself and his ideas exiled from his discipline. Ernest
Hilgard (1955), an eminent experimental psychologist, put his lack of regard
for Brunswik's methods in no uncertain terms: "Correlation is an instrument
of the devil" (p. 228). But methods per se are neither good nor bad; the ques-
tion is whether they match a theory or not. Brunswik's intellectual integrity
demanded that he think for himself, deciding what the proper method was,
rather than just climbing on the bandwagon. The tragedy is that he found
himself in a no-man's-land between the two newly established disciplines.
Fisher's Straitjacket
B. F. Skinner once told me that he had thought of dedicating one of his books
to "the statisticians and scientific methodologists with whose help this book
would have never been completed." He had second thoughts, and, in fact,
dedicated the book to those who actually were helpful, "to the pigeon staff."
Skinner had had in mind those statisticians who imposed Sir Ronald Fisher's
doctrine that the design of an experiment must match the statistical method,
such as analysis of variance.
Fisher's randomized control group experiments were tailor-made to Wood-
worth's ideal of experimentation, and analysis of variance allowed one to study
more than one independent variable. Skinner's resistance arose when research-
ers started to use Fisher's method compulsively rather than in a thoughtful
way, that is, as a tool, which is—like all tools—useful only in specific situa-
tions. Editors began to make what they believed was good scientific method a
sine qua non for publication: factorial designs, large numbers of participants,
and small p values.
Statistical thinking became replaced by a mindless ritual performed in the
presence of any set of data (see Chapter 13). Skinner confessed to me that he
once tried a factorial design with some two dozen animals. But only once. He
lost experimental control because he could not keep so many animals at the
same level of deprivation, and the magnitude of error in his data increased.
Why increase error just to have a method that measures error?
The Skinnerians escaped the emerging pressure of editors to publish studies
with large numbers of animals by founding a new journal in 1958, the Journal
of the Experimental Analysis of Behavior. Brunswik, however, had no follow-
ing with which he could found his own journal. Like Skinner, he remarked
drolly that "our ignorance of Fisher's work on factorial design and its mathe-
matical evaluation . . . paid off" (1956, p. 102). As almost all great psycholo-
gists did, he analyzed individuals rather than comparing group means, and he
continued to employ his own nonfactorial representative designs. But he also
sometimes felt that he should make concessions, for instance, when he per-
formed "a routine analysis of variance for the factorially orthodox part of our
experiment" (1956, p. 106).
In Brunswik's struggle with Fisher's ideas, unlike Skinner's, a classic con-
troversy repeated itself. Karl Pearson, who, with Francis Galton, founded cor-
50 WHERE DO NEW IDEAS COME FROM?
relation methods, was involved in a terrible intellectual and personal feud with
Fisher. This fight between these towering statisticians repeated itself in psy-
chology between the proponents of their respective tools. Just at the time when
Brunswik adopted Pearson's correlation methods around 1940, Fisherian meth-
ods began to spread. By 1955, when Brunswik died, Fisherian methods had
overrun, conquered, and redefined every branch of experimental psychology.
Then the newly institutionalized tools evolved into new theories of mind.
When Brunswik's vision of the mind as an intuitive statistician finally became
a great success in experimental psychology, the mind's intuitive statistician
was not of the Karl Pearson school, as Brunswik had imagined. Rather, the
homunculus statistician used the new laboratory tools, such as analysis of
variance. For instance, according to Harold Kelley's (1967) causal attribution
theory, the mind attributes a cause to an effect in the same way as researchers
have come to do—by calculating an intuitive version of analysis of variance
(see Chapter 1). Brunswik had never been able to persuade his colleagues from
experimental psychology that the mind would use the techniques of the com-
peting discipline of correlational psychology.
American psychology would hardly remember Brunswik's ideas had not one
of his students, Ken Hammond, kept his memory alive to the present day. But
IDEAS IN EXILE 51
is the memory of Egon Brunswik of more than historical interest? Are his ideas
still exiled, and if so, does it matter?
Representative Sampling
Brunswik (1956) sadly reported that his success in persuading fellow research-
ers to shift to representative sampling of stimuli is "very slow going and hard
to maintain" (p. 39). He complained that his colleagues practiced "double stan-
dards" by being concerned with the sampling of participants but not of stim-
ulus objects. Representative sampling of stimuli is one aspect of the more gen-
eral notion of representative design.
It would be an error to introduce representative sampling as a new dogma
to replace current methodological dogmas. The point is to choose the appro-
priate sampling method for the problem under discussion. For instance, rep-
resentative sampling of objects from a class is indispensable if one wants to
make general statements about the degree of "achievement," or its flip side,
the fallacies of perception and judgment concerning this class of objects. But
if the purpose is testing competing models of cognitive strategies and flat max-
ima obscure the discriminability of strategies, then using selected stimuli that
discriminate between the strategies may be the only choice (see Rieskamp &
Hoffrage, 1999).
Is the idea of representative sampling of any relevance for present-day re-
search? Imagine Brunswik browsing through recent textbooks on cognitive psy-
chology and looking for what we have discovered about achievement in judg-
ment—now more fashionably labeled fallacies and cognitive illusions. It
would catch his eye that the stimuli used in the demonstrations of fallacies
were typically selected rather than representative: the five letters in Tversky
and Kahneman's (1973) study from which the availability heuristic was con-
cluded; the personality sketches in Kahneman and Tversky's (1973) engineer-
lawyer study from which base-rate neglect was concluded; and the general-
knowledge questions from which the overconfidence bias was concluded
(Lichtenstein, Fischhoff, & Phillips, 1982), among others. Brunswik would
have objected that if one wants to measure achievement or demonstrate fal-
lacies in a reference class of objects, one needs to take a representative (or
random) sample of these objects. If not, one can "demonstrate" almost any
level of performance by selecting those objects for which performance is at its
worst (or at its best). In fact, when one uses representative (rather than se-
lected) samples in these three studies, performance greatly improves: The
errors in estimating the frequency of letters largely disappear (Sedlmeier,
Hertwig, & Gigerenzer, 1998); the estimated probabilities that a person is an
engineer approach Bayes's rule (Gigerenzer et al., 1988); and the over-
confidence bias completely disappears (Chapter 7; Juslin, Olsson, & Winman,
1998). These celebrated cognitive illusions, attributed to the participants, are
in part due to the selected sampling done by the experimenters.
These examples illustrate that representative sampling of stimuli is still a
blind spot in some areas of research. In survey research, it would be a mistake
52 WHERE DO NEW IDEAS COME FROM?
to present the odd views of a few selected citizens as public opinion; that the
same applies to stimulus objects is still not commonly acknowledged. Unre-
flectively selected samples can produce apparently general phenomena that
occupy us for years and then finally dissolve into an issue of mere sampling.
Natural Sampling
Imagine Brunswik looking at the studies on Bayesian reasoning, which
emerged about 10 years after his death. When he learned that people neglect
base rates he might have been surprised because his rats did not (Brunswik,
1939). His rats were not perfect, but they were sensitive to the difference of
the base rates of reinforcement in the two sides of a T-maze and to the ratio
as well. Sensitized by the frequentist Reichenbach, Brunswik's eye would have
caught an essential difference between his study and the base-rate studies of
the 1970s and 1980s: His rats learned the base rates from actual experienced
frequencies, whereas the humans in almost all studies that reported base-rate
neglect could not; they were presented summary information in terms of prob-
abilities or percentages. Rats would not understand probabilities, and humans
have only recently in their evolution begun to struggle with this representation
of uncertainty. Does representation matter? Christensen-Szalanski and Beach
(1982) presented base rates in terms of actual frequencies, sequentially en-
countered, and reported that base-rate neglect largely disappeared. This pro-
cess of sampling instances from a population sequentially is known as natural
sampling. Natural sampling is the everyday equivalent—for rats and humans
alike—of the representative sampling done by scientific experimenters. When
observed frequencies are based on natural sampling—that is, on raw (rather
than normalized) counts of events made in an ecological (rather than experi-
mental) setting—then one can show that Bayesian computations become sim-
pler than with probabilities, and people have more insight into Bayesian prob-
lems (Chapter 4).
Structure of Environments
A most important insight I gained from Brunswik's writings is the relevance
of the structure of information in environments to the study of judgment. Brun-
swik tentatively proposed measuring environmental structure by ecological va-
lidities and measuring these in turn by correlation coefficients. Brunswik,
though, almost as much as Skinner, hesitated to look into the black box, and
so he failed to see the important connection between the structure of environ-
ments and that of mediation. Adaptive mental strategies can exploit certain
structures. For instance, if there is a match between the structure of the en-
vironment and that of a strategy, a simple heuristic that processes much less
information than multiple regression can nevertheless make as many (or more)
accurate inferences about its environment (Martignon & Hoffrage, 1999). Her-
bert Simon had emphasized the link between cognitive processes and envi-
ronmental structure in his famous 1956 Psychological Review article on
IDEAS IN EXILE 53
the unobservable process of mediation, and even in 1937 still declared that
psychology is a science of "what" rather than of "how." The question of how
mediation works should be studied only insofar as it throws light on the ques-
tion of what an organism achieves. Only later did Brunswik (e.g., 1957) grant
a place, though only a second place, to the study of cognitive processes.
Given his reluctance to open the black box, I am not sure how Brunswik
would look at the process models of vicarious functions that were inspired by
his ideas: multiple regression models on the one hand (e.g., Hammond,
Hursch, & Todd, 1964) and the theory of probabilistic mental models (PMM
theory) and the fast and frugal lens model on the other (Chapter 8). When
Brunswik coined the metaphor of the "intuitive statistician," he tentatively
suggested that the process of vicarious functioning might be like multiple re-
gression (Doherty & Kurz, 1996). Brunswik's measurement tool turned into a
theory of cognitive processes. In the neo-Brunswikian revival, multiple regres-
sion became the model of vicarious functioning, and, unfortunately, it remains
so. Ken Hammond, like Brunswik, has had second thoughts, but by and large,
the tool has become part of the message. It structures our thinking about Bruns-
wik.
Brunswik's reluctance to think about processes may explain why his ex-
amples for vicarious functioning vacillated back and forth between two differ-
ent processes, substitution and combination. Some of his examples—such as
Hull's habit family hierarchy and the psychoanalytic substitution mechanism
in which one cause can manifest itself as various symptoms—referred to sub-
stitution without combination, others to the combination of cues. The fast and
frugal lens model, based on PMM theory, assumes substitution without com-
bination, emphasizing that judgments need to be made quickly and- on the
basis of limited knowledge (see Gigerenzer & Kurz, in press). Here Egon Bruns-
wik meets Herbert Simon, creating models of bounded rationality in which
simple cognitive heuristics exploit environmental structures.
Just as the human species has a history, so do our theories and methods. Not
knowing where they come from can blind one to understanding why one pro-
pounds a particular theory or uses a specific method. Nevertheless, looking
down at history is symptomatic for much of current psychology. Brunswik had
written about the history of his field and had published in philosophical jour-
nals; possibly it is just that background that helped him to see that there are
differences between methodologies and that one actually needs to make in-
formed choices. Many researchers do not seem to make these choices; rather,
they take on the methodological practice of their field and then defend it as if
it were religious dogma. If one reads Brunswik, one finds a constant stream of
thought about methodology, from preferring matching tasks over numerical
response tasks in order to minimize the confounding of perception with judg-
ment to the larger program of representative design. In contrast, the enthusiasm
with which some methods have been mechanically applied as general-purpose
IDEAS IN EXILE 55
John Locke (1690/1959) remarked that "God . . . has afforded us only with the
twilight of probability; suitable, I presume, to that state of mediocrity and pro-
bationership he has been pleased to place us in here. . . . " Buhler's psychology
opened the door for Brunswik to the twilight of uncertainty, and the Vienna
Circle inspired him to search for objective knowledge behind that door. What
Brunswik found there: that we know. What he was looking for is more: not
answers, but the right questions. From him, one can learn to rethink that which
is taken for granted. I have.
Yet there is another, deeper message in the work of Egon Brunswik: the
value of the struggle for intellectual integrity—daring to think ideas through,
with all the consequences, and remaining true to them even if they are con-
demned to exile. Kant's final two words in his lovely essay on the Enlighten-
ment capture the essence of this struggle: sapere aude, that is, have the courage
to know.
This page intentionally left blank
II
ECOLOGICAL RATIONALITY
The first two chapters in this section illustrate the practical relevance of
this argument for criminal law, medical diagnosis, AIDS counseling, and
other professions concerned with uncertainties and risks. Should evidence of
wife battering be admissible in the trial of a man accused of murdering his
wife? How many 40-year-old women with a positive mammogram in routine
screening actually have breast cancer? How likely is it that a man with a
positive HIV test actually has the virus? Earlier studies have documented
that many experts—and most patients and jurors—do not understand how to
answer these questions, possibly because they neglect base rates or are con-
fused by probabilities. I show that the notion of ecological rationality leads
to a simple method for helping experts and laypersons alike. One can restore
the representation of uncertainty that humans have encountered throughout
their evolution by translating probabilities back into natural frequencies—the
outcome of natural sampling. This change can turn innumeracy into insight.
The final chapter in this section is theoretical and experimental rather
than applied. It defines the concepts of natural sampling, natural frequen-
cies, and reports experimental evidence for the impact of various external
representations on statistical thinking. The mental strategies or shortcuts
people use, not only their numerical estimates of risks, turn out to be a func-
tion of the external representation of numbers we choose.
Ecological rationality can refer to the adaptation of mental processes to
the representation of information, as in this section. It also can refer to the
adaptation of mental processes to the structure of information in an environ-
ment, as illustrated in the section on bounded rationality and, in more de-
tail, in Simple Heuristics that Make Us Smart (Gigerenzer, Todd, & the ABC
Research Group, 1999). In both cases, it is important to distinguish between
past and present environments, particularly when we are studying humans,
who change their environments rapidly. Studying how past environments
differ from present environments reminds us that an ecological perspective
has an evolutionary and historical dimension. Here we go beyond Brun-
swik's metaphor of the married couple, which focuses on the adaptation be-
tween the mind and its current spouse while forgetting its previous mar-
riages. The program of ecological rationality is a research heuristic, not a
foolproof recipe—just as new laboratory tools do not always lead to good
theories for mental processes.
4
Ecological Intelligence
59
60 ECOLOGICAL RATIONALITY
Bayesian Inference
David Eddy (1982) asked physicians to estimate the probability that a woman
has breast cancer given that she has a positive mammogram on the basis of the
following information:
The probability that a patient has breast cancer is 1% (the physician's
prior probability).
If the patient has breast cancer, the probability that the radiologist will
correctly diagnose it is 80% (sensitivity or hit rate).
If the patient has a benign lesion (no breast cancer), the probability that
the radiologist will incorrectly diagnose it as cancer is 9.6% (false pos-
itive rate).
QUESTION: What is the probability that a patient with a positive mam-
mogram actually has breast cancer?
Eddy reported that 95 out of 100 physicians estimated the probability of
breast cancer after a positive mammogram to be about 75%. The inference from
an observation (positive test) to a disease, or more generally, from data D to a
hypothesis H, is often referred to as "Bayesian inference," because it can be
modeled by Bayes's rule:
Equation 1 shows how the probability p(H I D) that the woman has breast
cancer (H) after a positive mammogram (D) is computed from the prior prob-
ability p(H) that the patient has breast cancer, the sensitivity p(D\H), and the
false positive rate p(D I -H] of the mammography test. The probability p(H\ D)
is called the "posterior probability." The symbol —H stands for "the patient
does not have breast cancer." Equation 1 is Bayes's rule for binary hypotheses
and data. The rule is named after Thomas Bayes (1702 [?]-1761), an English
dissenting minister, to whom this solution of the problem of how to make an
inference from data to hypothesis (the so-called inverse problem; see Daston,
1988) is attributed.1 The important point is that Equation 1 results in a prob-
ability of 7.8%, not 75% as estimated by the majority of physicians. In other
words, the probability that the woman has breast cancer is one order of mag-
nitude smaller than estimated.
This result, together with an avalanche of studies reporting that laypeople's
reasoning does not follow Bayes's rule either, has (mis-)led many to believe
that Homo sapiens would be inept to reason the Bayesian way. Listen to some
influential voices: "In his evaluation of evidence, man is apparently not a con-
servative Bayesian: he is not Bayesian at all" (Kahneman & Tversky, 1972,
p. 450). "Tversky and Kahneman argue, correctly, I think, that our minds are
not built (for whatever reason) to work by the rules of probability" (Gould,
1992, p. 469).2 The literature of the last 25 years has reiterated again and again
the message that people are bad reasoners, neglect base rates most of the time,
neglect false positive rates, and are unable to integrate base rate, hit rate, and
false positive rate the Bayesian way (for a review see Koehler, 1996). Proba-
bility problems such as the mammography problem have become the stock-in-
trade of textbooks, lectures, and party entertainment. It is guaranteed fun to
point out how dumb others are. And aren't they? There seem to be many
customers eager to buy the message of "inevitable illusions" wired into our
brains (Piattelli-Palmarini, 1994).
mans making irrational judgments. What was the format of the numerical in-
formation humans encountered during their evolution? We know too little
about these environments, for instance, about the historically normal condi-
tions of childbirth, or how strong a factor religious doctrines were, and most
likely, these varied considerably between societies. But concerning the format
of numerical information, I believe we can be as certain as we ever can be—
probabilities and percentages were not the way organisms encountered infor-
mation. Probabilities and percentages are quite recent forms of representations
of uncertainty. Mathematical probability emerged in the mid-seventeenth cen-
tury (Hacking, 1975), and the concept of probability itself did not gain prom-
inence over the primitive notion of "expectation" before the mid-eighteenth
century (Daston, 1988). Percentages became common notations only during the
nineteenth century, after the metric system was introduced during the French
Revolution (mainly, though, for interest and taxes rather than for representing
uncertainty). Only in the second half of the twentieth century did probabilities
and percentages become entrenched in the everyday language of Western coun-
tries as representations of uncertainty. To summarize, probabilities and per-
centages took millennia of literacy and numeracy to evolve as a format to
represent degrees of uncertainty. In what format did humans acquire numerical
information before that time?
I propose that the original format was natural frequencies, acquired by nat-
ural sampling. Let me explain what this means by a parallel to the mammog-
raphy problem, using the same numbers. Think about a physician in an illit-
erate society. Her people have been afflicted by a new, severe disease. She has
no books nor statistical surveys; she must rely solely on her experience. For-
tunately, she discovered a symptom that signals the disease, although not with
certainty. In her lifetime, she has seen 1,000 people, 10 of whom had the dis-
ease. Of those 10, eight showed the symptom; of the 990 not afflicted, 95 did.
Thus there were 8 + 95 = 103 people who showed the symptom, and only 8
of these had the disease. Now a new patient appears. He has the symptom.
What is the probability that he actually has the disease?
The physician in the illiterate society does not need a pocket calculator to
estimate the Bayesian posterior probability. All she needs to do is to keep track
of the number of symptom and disease cases (8) and the number of symptom
and no-disease cases (95). The probability that the new patient actually has
the disease can be "seen" easily from these frequencies:
3. However, there is a price to be paid if one replaces the actual with a convenient
sample size. One can no longer compute second-order probabilities (Kleiter, 1994).
ECOLOGICAL INTELLIGENCE 65
tic inferences on a daily basis should, despite their experience, show the same
effect. Third, the "inevitable illusions" (Piattelli-Palmarini, 1994), such as
base-rate neglect, should become evitable by using natural frequencies. Finally,
natural frequencies should provide a superior vehicle for teaching Bayesian
inference. In what follows, I report tests of these predictions and several ex-
amples drawn from a broad variety of everyday situations.
This is not to say that probabilities are useless or perverse. In mathematics,
they play their role independent of whether or not they suit human reasoning,
just as Riemannian and other non-Euclidean geometries play their roles in-
dependent of the fact that human spatial reasoning is Euclidean.
Breast Cancer
When Dr. Average saw the first problem in a frequency format, his ner-
vousness subsided. "That's so easy," he remarked with relief, and came up
with the Bayesian answer, as he did with the second problem in a frequency
format. Dr. Average's reasoning turned Bayesian the moment the information
was in frequencies, despite his never having heard of, or at least not remem-
bering, Bayes's rule. In the words of a 38-year-old gynecologist faced with the
mammography problem in a frequency format: "A first grader could do that.
Wow, if someone can't solve this . . . !"
Consider now all the physicians' diagnostic inferences concerning breast
cancer. Do natural frequencies foster insight in them?
In the probability format, only 2 out of 24 physicians (8%) came up with
the Bayesian answer. The median estimate of the probability of breast cancer
after a positive mammogram was 70%, consistent with Eddy's findings. With
natural frequencies, however, 11 out of 24 physicians (46%) responded with
the Bayesian answer. Across all four diagnostic problems, similar results were
obtained—10% Bayesian responses in the probability format and 46% with
natural frequencies (Figure 4.1).
Probability format
The probability that one of these women has breast cancer is 1 %.
If a woman has breast cancer, the probability is 80% that she will have a positive
mammogram.
If a woman does not have breast cancer, the probability is 10% that she will still have
a positive mammogram.
Imagine a woman (age 40 to 50, no symptoms) who has a positive mammogram in
your breast cancer screening. What is the probability that she actually has breast can-
cer? %
Natural frequencies
Ten out of every 1,000 women have breast cancer.
Of these 10 women with breast cancer, 8 will have a positive mammogram.
Of the remaining 990 women without breast cancer, 99 will still have a positive mam-
mogram.
Imagine a sample of women (age 40 to 50, no symptoms) who have positive mammo-
grams in your breast cancer screening. How many of these women do actually have
breast cancer? out of
ECOLOGICAL INTELLIGENCE 67
Colorectal Cancer
The hemoccult test is a widely used and well-known test for colorectal cancer.
Windeler and Kobberling (1986) report that just as physicians overestimated
the (posterior) probability that a patient has colorectal cancer if the hemoccult
test is positive, they also overestimated the base rate of colorectal cancer, the
sensitivity (hit rate), and the false positive rate of the test. Windeler and Kob-
berling asked these physicians about probabilities and percentages. Would nat-
ural frequencies improve physicians' estimates of what a positive test tells
about the presence of colorectal cancer? The 48 physicians in the study re-
ported previously were given the best available estimates for the base rate,
sensitivity, and false positive rate, as published in Windeler and Kobberling
68 ECOLOGICAL RATIONALITY
(1986). The following is a shortened version of the full text (structured like
the mammography problem in Table 4.1) given to the physicians. In the prob-
ability format, the information was:
The probability that a person has colorectal cancer is 0.3%.
If a person has colorectal cancer, the probability that the test is positive
is 50%.
If a person does not have colorectal cancer, the probability that the test
is positive is 3%.
What is the probability that a person who tests positive actually has
colorectal cancer?
When one inserts these values in Bayes's rule (Equation 1), the resulting
probability is 4.8%. In natural frequencies, the information was:
30 out of every 10,000 people have colorectal cancer.
Of these 30 people with colorectal cancer, 15 will test positive.
Of the remaining 9,970 people without colorectal cancer, 300 will still
test positive.
Imagine a group of people who test positive. How many of these will
actually have colorectal cancer?
When the information was in the probability format, only 1 out of 24 phy-
sicians (4%) could find the Bayesian answer, or anything close to it. The me-
dian estimate was one order of magnitude higher, namely 47%. When the
information was presented in natural frequencies, 16 out of 24 physicians
(67%) came up with the Bayesian answer (details are in Gigerenzer, 1996b;
Hoffrage & Gigerenzer, 1998).
Wife Battering
Alan Dershowitz, the Harvard law professor who advised the defense in the
first O. J. Simpson trial, claimed repeatedly that evidence of abuse and batter-
ing should not be admissible in a murder trial. In his best-seller, Reasonable
Doubts: The Criminal Justice System and the O. J. Simpson Case (1996), Der-
showitz says: "The reality is that a majority of women who are killed are killed
by men with whom they have a relationship, regardless of whether their men
previously battered them. Battery, as such, is not a good independent predictor
of murder" (p. 105). Dershowitz stated on U.S. television in March 1995 that
only about one-tenth of 1% of wife batterers actually murder their wives. In
response to Dershowitz, I. J. Good, a distinguished professor emeritus of sta-
tistics at the Virginia Polytechnic Institute, published an article in Nature to
correct for the possible misunderstandings of what that statement implies for
the probability that O. J. Simpson actually murdered his wife in 1994 (Good,
1995). Good's argument is that the relevant probability is not the probability
ECOLOGICAL INTELLIGENCE 69
that a husband murders his wife if he batters her. Instead, the relevant prob-
ability is the probability that a husband has murdered his wife if he battered
her and if she was actually murdered by someone. More precisely, the relevant
probability is not p(G\Bai] but p(G\Bat and M), in which G stands for "the
husband is guilty" (that is, did the murder in 1994), Bat means that "the hus-
band battered his wife," and M means that "the wife was actually murdered
by somebody in 1994."
My point concerns the way Good presents his argument, not the argument
itself. Good presented the information in single-event probabilities and odds
(rather than in natural frequencies). I will first summarize Good's argument as
he made it. I hope I can demonstrate that you the reader—unless you are a
trained statistician or exceptionally smart with probabilities—will be confused
and have some difficulty following it. Thereafter, I will present the same ar-
gument in natural frequencies, and confusion should turn into insight. Let's
see.
Good bases his calculations of p(G I Bat and M) on the odds version of Bayes's
rule:
4. Good possibly assumed that the average wife batterer is married less than 10 years.
Good also made a second calculation assuming a value of p(G\Bat) that is half as large.
70 ECOLOGICAL RATIONALITY
Because there are about 25,000 murders per year in the U.S. population of
about 250,000,000, Good estimates the probability of a battered woman being
murdered, but not by her husband, as:
From Equations Good-3 and Good-4, it follows that the likelihood ratio is about
10,000/1; therefore, the posterior odds can be calculated:
That is, the probability that a murdered, battered wife was killed by her hus-
band is:
Good's point is that "most members of a jury or of the public, not being
familiar with elementary probability, would readily confuse this with
P(G\Bat), and would thus be badly misled by Dershowitz's comment" (Good,
1995, p. 541). He adds that he sent a copy of this note to both Dershowitz and
the Los Angeles Police Department, reminding us that Bayesian reasoning
should be taught at the precollege level.
Good's persuasive argument, I believe, could have been understood more
easily by his readers and the Los Angeles Police Department if the information
had been presented in natural frequencies rather than in the single-event prob-
abilities and odds in the six equations. As with breast cancer and colorectal
cancer, one way to represent information in natural frequencies is to start with
a concrete sample of individuals divided into subclasses, in the same way it
would be experienced by natural sampling. Here is a frequency version of
Good's argument.
AIDS Counseling
Under the headline, "A False HIV Test Caused 18 Months of Hell," the Chicago
Tribune (3/5/93) published the following letter and response:
Dear Ann Landers: In March 1991,1 went to an anonymous testing center
for a routine HIV test. In two weeks, the results came back positive.
I was devastated. I was 20 years old and doomed. I became severely
depressed and contemplated a variety of ways to commit suicide. After
encouragement from family and friends, I decided to fight back.
My doctors in Dallas told me that California had the best care for HIV
patients, so I packed everything and headed west. It took three months
to find a doctor I trusted. Before this physician would treat me, he in-
sisted on running more tests. Imagine my shock when the new results
came back negative. The doctor tested me again, and the results were
clearly negative.
I'm grateful to be healthy, but the 18 months I thought I had the virus
changed my life forever. I'm begging doctors to be more careful. I also
want to tell your readers to be sure and get a second opinion. I will
continue to be tested for HIV every six months, but I am no longer ter-
rified.
David in Dallas
Dear Dallas: Yours is truly a nightmare with a happy ending, but don't
blame the doctor. It's the lab that needs to shape up. The moral of your
story is this: Get a second opinion. And a third. Never trust a single test.
Ever.
Ann Landers
David does not mention what his Dallas doctors told him about the chances
that he actually had the virus after the positive test, but he seems to have
inferred that a positive test meant that he had the virus, period. In fact, when
we studied AIDS counselors in Germany, we found that many doctors and
social workers (erroneously) tell their low-risk clients that a positive HIV test
implies that the virus is present (see Chapter 5). These counselors know that
a single ELISA (enzyme-linked immunoabsorbent assay) test can produce a
false positive, but they erroneously assume that the whole series of ELISA and
Western blot tests would wipe out every false positive. How could a doctor
have explained the actual risk to David and spared him the nightmare?
I do not have HIV statistics for Dallas, so I will use German figures for
illustration. (The specific numbers are not the point here.) In Germany, the
72 ECOLOGICAL RATIONALITY
Expert Witnesses
who tried to understand the arguments. I will explain this point with the case
of a chimney sweep who was accused of having committed a murder in Wup-
pertal, Germany (Schrage, n.d.).
The Rheinischer Merkur (No. 39, 1974) reported:
On the evening of July 20, 1972, the 40-year-old Wuppertal painter Wil-
helm Fink and his 37-year-old wife Ingeborg took a walk in the woods
and were attacked by a stranger. The husband was hit by three bullets
in the throat and the chest, and fell down. Then the stranger attempted
to rape his wife. When she defended herself and, unexpectedly, the shot-
down husband got back on his feet to help her, the stranger shot two
bullets into the wife's head and fled.
Three days later, 20 kilometers from the scene of the crime, a forest ranger
discovered the car of Werner Wiegand, a 25-year-old chimney sweep who used
to spend his weekends in the vicinity. The husband, who had survived, at first
thought he recognized the chimney sweep in a photo. Later, he grew less cer-
tain and began to think that another suspect was the murderer. When the other
suspect was found innocent, however, the prosecution came back to the chim-
ney sweep and put him on trial. The chimney sweep had no previous convic-
tions and denied being the murderer. The Rheinischer Merkur described the
trial:
After the experts had testified and explained their "probability theories,"
the case seemed to be clear: Wiegand, despite his denial, must have been
the murderer. Dr. Christian Rittner, a lecturer at the University of Bonn,
evaluated the traces of blood as follows: 17.29% of German citizens share
Wiegand's blood group, traces of which have been found underneath the
fingernails of the murdered woman; 15.69% of German share [her] blood
group that was also found on Wiegand's boots; based on a so-called
"cross-combination" the expert subsequently calculated an overall prob-
ability of 97.3% that Wiegand "can be considered the murderer." And
concerning the textile fiber traces which were found both on Wiegand's
clothes and on those of the victim. . . . Dr. Ernst Rohm from the Munich
branch of the State Crime Department explained: "The probability that
textile microfibers of this kind are transmitted from a human to another
human who was not in contact with the victim is at most 0.06%. From
this results a 99.94% certainty for Wiegand being the murderer."
Both expert witnesses agreed that, with a high probability, the chimney
sweep was the murderer. These expert calculations, however, collapsed when
the court discovered that the defendant was in his hometown, 100 kilometers
away from the scene of the crime at the time of the crime.
So what was wrong with the expert calculations? One can dispel the con-
fusion in court by representing the uncertainties in natural frequencies. Let
us assume that the blood underneath the fingernails of the victim was indeed
the blood of the murderer, that the murderer carried traces of the victim's
blood (as the expert witnesses assumed), and that there were 10 million men
in Germany who could have committed the crime (these and the following
74 ECOLOGICAL RATIONALITY
figures are from Schrage, n.d., but the specific figures do not matter for my
argument). Let us assume further that on one of every 100 of these men a close
examination would find microscopic traces of foreign blood, that is, on 100,000
men. Of these, some 15,690 men (15.69%) will carry traces from blood that
is of the victim's blood type. Of these 15,690 men, some 2,710 (17.29%) will
also have the blood type that was found underneath the victim's fingernails
(here, I assume independence between the two pieces of evidence). Thus there
are some 2,710 men (including the murderer) who might appear guilty based
on the two pieces of blood evidence. The chimney sweep is one of these men.
Therefore, given the two pieces of blood evidence, the probability that the
chimney sweep is the murderer is about 1 in 2,710, and not 97.3%, as the
first expert witness testified.
The same frequency method can be applied to the textile traces. Let us
assume that the second expert witness was correct when he said that the prob-
ability of the chimney sweep carrying the textile trace, if he were not the
murderer, would be at most 0.06%. Let us assume as well that the murderer
actually carries that trace. Then some 6,000 of the 10 million would carry this
textile trace, and only one of them would be the murderer. Thus the probability
that the chimney sweep was the murderer, given the textile fiber evidence, was
about 1 in 6,000, and not 99.94%, as the second expert witness testified.
What if one combines both the blood and the textile evidence together,
which seems not to have happened at the trial? In this case, one of the 2,710
men who satisfy both pieces of blood type evidence would be the murderer,
and he would show the textile traces. Of the remaining innocent men, we
expect one or two (0.06%) to also show the textile traces (assuming mutual
independence of the three pieces of evidence). Thus there would be two or
three men who satisfy all three types of evidence. One of them is the murderer.
Therefore, the probability that the chimney sweep was the murderer, given the
two pieces of blood sample evidence and the textile evidence, would be be-
tween .3 and .5. This probability would not be beyond reasonable doubt.
The teaching of statistical reasoning is, like that of reading and writing, part
of forming an educated citizenship. Our technological world, with its abun-
dance of statistical information, makes the art of dealing with uncertain infor-
mation particularly relevant. Reading and writing is taught to every child in
modern Western democracies, but statistical thinking is not (Shaughnessy,
1992). The result has been termed "innumeracy" (Paulos, 1988). But can sta-
tistical reasoning be taught? Previous studies that attempted to teach Bayesian
inference, mostly by corrective feedback, had little or no training effect (e.g.,
Peterson, DuCharme, & Edwards, 1968; Schaefer, 1976). This result seems to
be consistent with the view that the mind does not naturally reason the Bay-
esian way. However, the argument developed in this chapter suggests a "nat-
ECOLOGICAL INTELLIGENCE 75
Conclusions
The work on which this chapter is based was coauthored with U. Hoffrage and
A. Ebert.
77
78 ECOLOGICAL RATIONALITY
government, for instance, has encouraged voluntary testing to the point that
"people who are unlikely to be infected are the ones who take the test, in
droves" (Mansson, 1990). Involuntary testing is a legal possibility in several
countries, one that insurers exploit to protect themselves against losses. For
instance, in 1990, Bill Clinton (then governor of Arkansas) had to take an HIV
test to get his life insurance renewed. People with low-risk behavior may be
subjected to HIV tests not only involuntarily but also unknowingly. For in-
stance, large companies in Bombay have reportedly subjected their employees
to blood tests without telling them that they were being tested for AIDS; when
a test was positive, the employee was fired.
Counseling people at low risk requires paying particular attention to false
positives, that is, to the possibility that the client has a positive HIV test even
though he or she is not infected with the virus. The lower the prevalence of
HIV in a group, the larger the proportion of false positives among those who
test positive. In other words, if a client with high-risk behavior tests positive,
the probability that he actually is infected with HIV is very high, but if some-
one with low-risk behavior tests positive, this probability may be as low as
50%, as indicated previously. If clients are not informed about this fact, they
tend to believe that a positive test means that they are infected with absolute
certainty. The case of the young man from Dallas described in the previous
chapter is one example. If he had committed suicide, as the blood donors in
the Florida case did, we might never have found out that his test was a false
positive. Emotional pain and lives can be saved if counselors inform the clients
about the possibility of false positives.1
We do not know of any study that has investigated what AIDS counselors
tell their clients about the meaning of a positive test. We pondered long over
the proper methodology, such as sending questionnaires to counselors or ask-
ing them to participate in paper-and-pencil tests. However, we decided against
questionnaires and similar methods because they are open to the criticism that
they tell us little about actual counseling sessions. For instance, these methods
have been criticized for not allowing physicians to pose their own questions
to get further information, to use their own estimates of the relevant statistical
information rather than those provided by the experimenter, and for removing
the element of actual concern for the patient, because either the patient is
fictional or the case was resolved years ago.
In the end, we decided to take a direct route. One of us went as a client to
20 counseling sites and took a series of counseling sessions and HIV tests. We
1. In their review of suicidal behavior and HIV infection, Catalan and Pugh (1995)
conclude that "suicidal ideas, completed suicide and deliberate self-harm are not un-
common in people with HIV-infection" (p. 119). However, as they themselves point out,
the evidence is far from conclusive: Many reports are anecdotal or involve few cases,
results vary between countries, and methodological problems make matching with com-
parison groups difficult (e.g., Marzuk & Perry, 1993; Pugh et al., 1993). A recent pro-
spective cohort study that controlled for several factors found a 1.35-fold increase in
suicides in HIV-positives relative to HIV-negatives, whereas earlier studies reported a 7-
to 36-fold increase in risk for HIV-positives (Dannenberg et al., 1996).
AIDS COUNSELING FOR LOW-RISK CLIENTS 79
were interested in one important issue that AIDS counselors have to explain
to the client: What does a positive test result mean? To answer this question,
it is useful to know: (a) the base rate of HIV in heterosexual men with low
risk, which is referred to as the prevalence, (b) the probability that the test is
positive if the client is infected, which is referred to as the sensitivity (or hit
rate) of the test, and (c) the probability that the test is positive if the client is
not infected, which is known as the false positive rate (or 1 — specificity).
From this information, one can estimate what a positive test actually means,
that is, the probability of being infected if one tests positive, also known as
the positive predictive value (PPV). Let us first get the best estimates for these
values from the literature.
Prevalence
Germany has a relatively small number of reported AIDS cases. The cumula-
tive number by the end of 1995 was 13,665, as compared with some 30,000
in Italy, 38,000 in France, and more than 500,000 in the USA (World Health
Organization, 1996). Thus one can assume that the prevalence of HIV is also
comparatively low. The client in our study was 27 years old, a German het-
erosexual male who did not engage in risky behavior. What is the prevalence
of the HIV virus in 20- to 30-year-old heterosexual men in Germany who do
not engage in risky behavior? A reasonable estimate is about one in 10,000 or
0.01%.2 This figure is in the range of the prevalence of HIV in blood donors
in the United States (a group with low prevalence within the United States),
2. This is a crude estimate, given that there seem to be no published figures for the
prevalence of HIV for 20- to 30-year-old men with low-risk behavior in Germany. This
value is based on two approximations. One is to estimate the unknown prevalence by
the known prevalence in first-time male blood donors (as opposed to repeat donors, who
are a highly selected group). The proportion of HIV-positives in some 130,000 first-time
male blood donors (1986-1991, state of Baden-Wurttemberg) was 1.5 in 10,000 (Maurer
et al., 1993). For comparison, the proportion among repeat donors was one order of
magnitude smaller, about 1.2 in 100,000 (Maurer et al., 1993). Because false positives
occur, the proportion of men actually infected is smaller than 1.5 in 10,000. This esti-
mate is crude in several respects; for instance, it does not differentiate by age group and
assumes that men with low-risk behavior are comparable to first-time blood donors.
A second way to estimate the unknown prevalence is by the proportion of HIV-
positives who report infection through heterosexual contact. Dietz et al. (1994, p. 1998)
found that 3.8% of HIV-positives reported that they were infected by heterosexual con-
tact, as opposed to homosexual/bisexual behavior, injecting drug use, and other risks
(for similar figures see Gliick et al., 1990; Hoffman-Valentin, 1991; Schering, 1992). In
1994, when our study was begun, the number of HIV-positives in Germany was about
65,000, of which some 29% were in the 20- to 30-year-old age group. If one assumes
that the figure of 3.8% also holds for this age group, this results in an estimated 700
HIV-positives in this age group reporting infection through heterosexual contact. Be-
cause in 1994 there were an estimated 6 million German men between 20 and 30 who
were in no known risk group (a total of 6,718,500 men minus an estimated 11% who
belong to one or more of the known risk groups; see Statistisches Bundesamt, 1994), the
proportion of HIV-positives who report infection through heterosexual contact can be
estimated as 1.2 in 10,000.
80 ECOLOGICAL RATIONALITY
which has been estimated at one in 10,000 (Busch, 1994, p. 229) or two in
10,000 (George & Schochetman, 1994, p. 90).
HIV testing typically involves the following sequence. If the first test, ELISA,
is negative, the client is notified that he or she is HIV-negative. If positive, at
least one more ELISA (preferably from a different manufacturer) is conducted.
If the result is again positive, then the more expensive and time-consuming
Western blot test is performed. If the Western blot is also positive, then the
client is notified of being HIV-positive, and sometimes a second blood sample
is also tested. Thus two errors can occur. First, a client who is infected is
notified that he is HIV-negative. The probability of this error (false negative)
is the complement of the sensitivity of the ELISA test. The estimates for the
sensitivity typically range between 98% and 99.8% (Eberle et al., 1988; George
& Schochetman, 1994; Schwartz et al., 1990; Spielberg et al., 1989; Tu et al.,
1992; Wilber, 1991). Second, a client who is not infected is notified of being
HIV-positive. The probability of this second error (false positive) is the com-
plement of the combined specificity of the ELISA and Western blot tests. Al-
though all surveys agree that false positives do occur, the quantitative estimates
vary widely.3 This is in part due to the fact that what constitutes a positive
Western blot test has not been standardized (various agencies use different
reagents, testing methods, and test-interpretation criteria), that the ELISAs and
the Western blot tests are not independent (that is, one cannot simply multiply
the individual false positive rates of the tests to calculate the combined false
positive rate), and that the higher the prevalence in a group, the lower the
specificity seems to be for this group (Wittkowski, 1989). For instance, 20 sam-
ples—half with HIV antibodies and half without (the laboratories were not
informed which samples were which)—were sent in 1990 to each of 103 lab-
oratories in six World Health Organization (WHO) regions (Snell et al., 1992).
Both ways to estimate the unknown prevalence give consistent numbers; neverthe-
less, they should only be taken as rough approximations. Because not all of these HIV-
positives have the virus (due to false positives), we need to correct these numbers down-
ward. A prevalence of about 1 in 10,000 seems to be a reasonable estimate for the
unknown prevalence of the HIV virus in 20- to 30-year-old heterosexual German men
with low-risk behavior.
3. Among the reasons for false positives are the presence of cross-reacting antibodies
(Stine, 1996); false positive reactions with nonspecifically "sticky" IgM antibodies (Ep-
stein, 1994, p. 56); false positives from samples placed in the wrong wells; and contam-
ination of wells containing negative specimens by positive samples from adjacent wells.
In addition, heat-treated, lipemic, and hemolyzed sera may cause false positives; false
positive results have been reported to occur in 19% of hemophilia patients and in 13%
of alcoholic patients with hepatitis (George & Schochetman, 1994, p. 69). People who
have liver disease, have received a blood transfusion or gamma globulin within six
weeks of the test, or have received vaccines for influenza and hepatitis B may test false
positive as well (Stine, 1996, p. 333).
AIDS COUNSELING FOR LOW-RISK CLIENTS 81
What the client needs to understand is the probability of being infected with
HIV if he tests positive. The predictive value of a positive test (PPV) can be
calculated from the prevalence p(HIV}, the sensitivity p(pos\HIV), and the
false positive rate p(pos I no HIV):
where p(no HIV] equals 1 — p(HIV). Equation 1 is known as Bayes's rule. This
rule expresses the important fact that the smaller the prevalence, the smaller
the probability that a client is infected if the test is positive. What is the pre-
dictive value of a positive test (repeated ELISA and Western blot, one blood
sample) for a 20- to 30-year-old heterosexual German man who does not engage
in risky behavior? Inserting the previous estimates—a prevalence of 0.01%, a
sensitivity of 99.8%, and a specificity of 99.99%—into Bayes's rule, the PPV
results in 0.50, or 50%.
An estimated PPV of about 50% for heterosexual men who do not engage
in risky behavior is consistent with the report of the Enquete Committee of the
German Bundestag, which estimated the PPV for low-risk people as "less than
50%" (Deutscher Bundestag, 1990, p. 121).
Even if a counselor understands this formula, ordinary people rarely do. More-
over, we know from paper-and-pencil studies in the United States and in Ger-
many that even experienced physicians have great difficulties when asked to
infer the PPV from probability information. But we also have seen in Chapter
4 that physicians' performance can be substantially improved, by a factor of
more than four, if the information is presented in natural frequencies rather
than in terms of probabilities or percentages.
How would a counselor communicate information in natural frequencies?
She might explain to the patient the meaning of a positive test in the following
way: "Imagine 10,000 heterosexual men like you being tested. One has the
82 ECOLOGICAL RATIONALITY
virus, and he will with practical certainty test positive. Of the remaining non-
infected men, one will also test positive (the false positive rate of 0.01%). Thus
we expect that two men will test positive, and only one of them has HIV. This
is the situation you are in if you test positive; the chance of having the virus
is one out of two, or 50%."
This simple method can be applied whatever the relevant numbers are as-
sumed to be. If the prevalence is two in 10,000, the PPV would be two out of
three, or 67%. The numbers can be adjusted; the point is that clients can
understand more easily if the counselor communicates in natural frequencies
than in probabilities. With a frequency representation the client can "see" how
the PPV depends on the prevalence. If the prevalence of HIV among German
homosexuals is about 1.5%, then the counselor might explain: "Think of
10,000 homosexual men like you. About 150 have the virus, and they all will
likely test positive. Of the remaining noninfected men, one will also test pos-
itive. Thus we expect that 151 men will test positive, and 150 of them have
HIV. This is the situation you are in if you test positive; the chance of having
the virus is 150 out of 151, or 99.3%."
In general, the PPV is the number of true positives (TP) divided by the
number of true positives plus false positives (FP):
Some 300 German public health centers ("Gesundheitsamter") offer free HIV
tests and AIDS counseling for the general public. By 1990, these centers had
hired 315 counselors, 43% of whom were physicians, 22% social workers, and
7% psychologists. The rest had various professional training (Fischer, 1990).
As in other countries, counseling before testing is designed to make sure that
the client understands the testing procedure, the risks for HIV infection, and
the meaning of either a positive or negative test (Ward, 1994). The report
of the Enquete Committee of the German Bundestag (Deutscher Bundestag,
1990, p. 122) directs the counselor explicitly to perform a "quantitative and
qualitative assessment of the individual risk" and to "explain the reliability of
the test result" before a test is taken. If the client decides to take a test, ano-
nymity is guaranteed in all German states (unlike in the United States, where
in 25 states the patient's name is reported; Stine, 1996, p. 346). Counseling
requires both social tact and knowledge about the uncertainties involved in
testing, and the fact that in 1990 about 37% of clients tested at publicly funded
clinics in the United States failed to return for their test results suggests that
counseling is not always successful (Doll & Kennedy, 1994).
AIDS COUNSELING FOR LOW-RISK CLIENTS 83
Method
Counseling Centers
The Interview
The client asked the counselor the following questions (unless the counselor
provided the information unprompted):
1. Sensitivity of the HIV test. If one is infected with HIV, is it possible
to have a negative test result? How reliable does the test identify a
virus if the virus is present?
2. False positives. If one is not infected with HIV, is it possible to have
a positive test result? How reliable is the test with respect to a false
positive result?
3. Prevalence of HIV in heterosexual men. How frequent is the virus in
my risk group, that is, heterosexual men, 20 to 30 years old, with no
known risk such as drug use?
4. Predictive value of a positive test. What is the probability that men in
my risk group actually do have HIV after a positive test?
84 ECOLOGICAL RATIONALITY
5. Window period. How much time has to pass between infection and
test, so that antibodies can be detected?
The pilot study indicated a tendency in counselors to provide vague and
noninformative answers, such as, "Don't worry; the test is very reliable; trust
me." It also indicated that if the client asked for clarification more than twice,
the counselors were likely to become upset and angry, experiencing the client's
insistence on clarification as a violation of social norms of communication.
Based on these pilot sessions, the interview included the following scheme for
clarifying questions: If the counselor's answer was a quantitative estimate (a
number or a range) or if the counselor said that he or she could not (or did
not want to) give a more precise answer, then the client went on to the next
question. If the answer was qualitative (e.g., "fairly certain") or if the counselor
misunderstood or avoided answering the question, then the client asked for
further clarification and, if necessary, repeated this request for clarification one
more time. If, after the third attempt, there was still no success, the client did
not push further and went on to the next question. When the client needed
to ask for clarification concerning the prevalence of HIV (Question 3), he always
repeated his specific risk group; when asking for clarification concerning the
PPV (Question 4), he always referred to the specific prevalence in his risk group.
As mentioned previously, when the client asked for the prevalence of HIV
in his risk group, he specified this group as "heterosexual men, 20 to 30 years
old, with no known risk such as drug use." When counselors asked for more
information, which happened in only 11 of the sessions, the client explained
that he was 27 years old, monogamous, and that neither his current nor his
(few) previous sexual partners used drugs or engaged in risky behavior. In two
of these 11 cases, the client was given a detailed questionnaire to determine
his risk; in one of these cases the counselor did not look at the questionnaire,
and the client still had it in his hands when he left the site.
The client was trained in simulated sessions to use a coding system (ques-
tion number; number of repetitions of a question; the counselor's answer at
each repetition; e.g., "1; 2; 99.9%") that allowed him to write down the rele-
vant information in shorthand during the counseling or, if the session was very
brief, to rehearse the code in memory and write it down immediately after the
counseling session.
After the counseling session, the client took the HIV test, except for three
cases (in two he would have had to wait several hours to take the test, and in
one case the counselor suggested that the client might first consider it over-
night before making the decision of whether or not to take the test).
Results
Four counseling sessions are shown, for illustration, in Table 5.1. The client's
questions are abbreviated (e.g., sensitivity?], and the information provided by
the counselor directly follows the question. The counselors' answers to the
client's clarifying questions are preceded by a dash in subsequent lines.
Table 5.1 Four sample counseling sessions
Session 1. The counselor is a female social worker
Sensitivity? False negatives really never occur. Although, if I think about the literature,
there were reports about such cases.
—I don't know exactly how many.
—It happened only once or twice.
False positives? No, because the test is repeated; it is absolutely certain.
—If there are antibodies, the test identifies them unambiguously and with absolute
certainty.
—No, it is absolutely impossible that there are false positives; by repeating the test
it is absolutely certain.
Prevalence? I can't tell you this exactly.
—Between about 1 of 500 to I of 1,000.
Positive predictive value? As I have now told you repeatedly, the test is absolutely cer-
tain.
False Positives
Prevalence
The question concerning the prevalence of HIV in heterosexual men with low-
risk behavior produced the most uncertainty among the counselors. Sixteen of
20 (all counselors responded) expressed uncertainty or ignorance or argued that
the prevalence for heterosexual men with low-risk behavior cannot be deter-
mined (e.g., because of unreported cases) or that it would be of no use for the
individual case (e.g., Session 2). Several counselors searched for publications
in response to the client's question but found only irrelevant statistics, such as
the large number of HIV-positives in West Berlin: "The Wall was the best con-
dom for East Berlin," one counselor answered. Twelve counselors provided
numerical estimates, with a median of 0.1%. The variability of the estimates
was considerable (Table 5.2), including the extreme estimate that in people
such as the client an HIV infection is "less probable than winning the lottery
three times" (we have not included this value in Table 5.2). Four counselors
asserted that information concerning prevalence is of little or no use: "But
statistics don't help us in the individual case—and we also have no precise
data" (see also Sessions 2 and 3). Two counselors said that they have problems
remembering numbers or reasoning with numbers; for instance: "I have diffi-
culties reasoning with statistical information. It's about groups and the transfer
is problematic. It reminds me of playing the lottery. The probability of getting
all six correct is very small; nevertheless, every week someone wins."
Recall that under the currently available estimates, only some 50% of heter-
osexual German men with low-risk behavior actually have HIV if they test
88 ECOLOGICAL RATIONALITY
positive. The information provided by the counselors was quite different. Half
of the counselors (10 of 18; two repeatedly ignored this question) told the client
that if he tested positive it was absolutely certain (100%) that he has HIV (Table
5.2 and Session 1). Five told him that the probability is 99.9% or higher (e.g.,
Session 3). Thus, if the client had tested positive and trusted the information
provided by these 15 counselors, he might indeed have contemplated suicide,
as many have before (Stine, 1996).
How did the counselors arrive at this inflated estimate of the predictive
value? They seemed to have two lines of thought. A total of eight counselors
confused the sensitivity with the PPV(a confusion also reported by Eddy, 1982,
and Elstein, 1988); that is, they gave the same number for the sensitivity and
the PPV (e.g., Sessions 2 and 3). Three of these eight counselors explained
that, except for the window period, the sensitivity is 100% and therefore the
PPV was also 100%. Another five counselors reasoned by the second strategy.
They (erroneously) assumed that false positives would be eliminated through
repeated testing and concluded from this (consistently) that the PPV is 100%.
For both groups, the client's question concerning the PPV must have appeared
as one they had already answered. In fact, more than half of the counselors
(11 of 18) explicitly introduced their answers with a phrase such as, "As I
have already said . . ." (e.g., Sessions 1-3). Consistent with this observation,
the answers to the question concerning the PPV came rather quickly, and the
client did not need to ask for clarification as often as before. The average num-
ber of questions asked by the client on the PPV was only 1.8, compared to 2.4,
2.4, and 2.5 for sensitivity, specificity, and prevalence, respectively.
Table 5.2 lists two counselors who provided estimates of the PPV in the
correct direction (between 99% and 90%). Only one of these (Session 4), how-
ever, arrived at this estimate by reasoning that the proportion of false positives
among all positives increases when the prevalence decreases. She was also the
only one who explained to the client that there are reasons for false positives
that cannot be eliminated by repeated testing, such as that the test reacts to
antibodies that it confuses with HIV antibodies. The second counselor first
asserted that after a positive test an HIV infection is "completely certain," but
when the client asked what "completely certain" meant, the physician had
second thoughts and said that the PPV is "at least in the upper 90s" and "I
can't be more exact."
There was not a single counselor who communicated the information in nat-
ural frequencies, the representation physicians and laypeople can understand
best. Except for the prevalence of HIV, all numerical information was com-
municated to the client in terms of percentages. The four sessions in Table 5.1
illustrate this fact. As a consequence, clients will most likely not understand,
and several counselors also seemed not to understand the numbers they were
AIDS COUNSELING FOR LOW-RISK CLIENTS 89
communicating. This can be inferred from the fact that several counselors gave
the client inconsistent pieces of information but seemed not to notice.
Two examples illustrate this disturbing fact. One physician told the client
that the prevalence of HIV in men such as the client is 0.1% or slightly higher,
and the sensitivity, specificity, and the PPV are each 99.9%. To see that this
information is contradictory, we represent it in natural frequencies. Imagine
1,000 men taking an HIV test. One of these men (0.1%) is infected, and he will
test positive with practical certainty. Of the remaining uninfected men, one
will also test positive (because the specificity is assumed to be 99.9%, which
implies a false positive rate of 0.1%). Thus two test positive, and one of them
is infected. Therefore, the odds of being infected with HIV are 1 to 1 (50%),
and not 999 to 1 (99.9%). (Even if the physician assumed a prevalence of 0.5%,
the odds are 5 to 1 rather than 999 to 1.)
Next consider the information the client received in Session 2. Assume for
the prevalence (which the counselor did not provide) the median estimate of
the other counselors, namely 0.1%. Again imagine 1,000 men. One has the
virus, and he will test positive with practical certainty (the counselor's esti-
mated sensitivity: 99.8%). Of the remaining uninfected men, three will also
test positive (the counselor's estimated specificity: 99.7%). Thus we expect
four to test positive, one of whom actually has the virus. Therefore, the prob-
ability of being infected if the test is positive is 25% (one in four), not 99.8%
as the counselor told the client.
If the counselors had been trained to represent information in natural fre-
quencies, these inconsistencies could have been easily detected. But the coun-
selors seem to have had no training in how to represent and communicate
information concerning risk. A hypothetical session in which an "ideal" coun-
selor uses natural frequencies is given here. Because the client did not find
such a counselor, the following session is fictional:
Sensitivity? The test will be positive in about 998 of 1,000 persons with
an HIV infection. Depending on circumstances, such as the specific tests
used, this estimate can vary slightly.
False positives? About 1 in 10,000. False positives can be largely reduced
by repeated testing (ELISA and Western blot), but not completely elim-
inated. Among the reasons for false positives are . . .
Prevalence? About 1 in 10,000 German heterosexual men with low-risk
behavior is HIV infected.
Positive predictive value? Think about 10,000 heterosexual men like you.
One is infected, and he will test positive with practical certainty. Of the
remaining noninfected men, one will also test positive. Thus we expect
that two men will test positive, and only one of them has HIV. This is
the situation you are in if you test positive. Your chance of having the
virus is about 1 in 2.
Do the brochures available in AIDS centers, a source from which the coun-
selors might draw, provide help in understanding what a positive test means
90 ECOLOGICAL RATIONALITY
Conclusions
This study shows, for a sample of public AIDS counseling centers in Germany,
that counselors were not prepared to explain to a man with low-risk behavior
what it would mean if he tested positive for HIV. This is not to say that the
counselors were generally ignorant; on the contrary, several counselors gave
long and sophisticated lectures concerning immunodiagnostic techniques, the
nature of proteins, and the pathways of infection. But when it came to ex-
plaining to the client the risk of being infected if he tests positive, there was
a lack of information as well as a lack of knowledge of how to communicate
risks.
The key problems identified in this study are:
2. Only one of 20 counselors (Session 4) explained the fact that the lower
the prevalence, the higher the proportion of false positives among
positive tests.
3. A majority of counselors incorrectly assured the client that false pos-
itives would never occur. Counselors had a simple, deterministic ex-
planation: False positives would be eliminated through repeated test-
ing (and similarly, false negatives would be eliminated after the
window period).
4. Half of the counselors asserted incorrectly that if a low-risk person
tests positive, it is absolutely certain (100%) that he is infected with
the virus. Counselors arrived at this erroneous judgment by one of
two strategies. One group confused the sensitivity of the test with the
PPV. A second group assumed that there are no false positives be-
cause of repeated tests, which implies that a positive test indicates an
infection with absolute certainty.
We do not know how representative these results are for AIDS counseling
of low-risk client groups in other centers in Germany or in other countries.
This study seems to be the first one of this kind, but there is no reason to
believe that the sample of counseling centers visited is not representative of
Germany (precisely, former West Germany). The lesson of this study is the
importance of teaching counselors how to explain to clients in simple terms
the risks involved. The counselors need rough estimates of false positives,
sensitivity, and the prevalence of HIV in various risk groups. Then they can
be taught to communicate this information in an understandable way. Exper-
imental evidence suggests that the most efficient and simple method is to train
counselors to represent the relevant information in natural frequencies and to
communicate it to the client in the same way.4 Such training takes little time,
and is cost-effective, and participants do not show the usual decay of what
they had learned over time (Chapter 4).
The competence to explain in simple language what a positive result means
is certainly not all that a counselor needs to be able to do, but it is an important
part. Proper information may prevent self-destructive reactions in clients.
These reactions are avoidable, an unnecessary toll on top of the one the disease
itself takes from humankind.
4. There is also experimental evidence that the error made most often by AIDS coun-
selors in this study, confusing the sensitivity with the PPV of the test, is markedly re-
duced (from 19% to 5% of all diagnostic inferences) when information is represented
in terms of frequencies rather than probabilities (Hoffrage & Gigerenzer, 1998).
6
How to Improve Bayesian Reasoning
without Instruction
92
IMPROVING BAYESIAN REASONING 93
cally neglect base rates in Bayesian inference problems. "The genuineness, the
robustness, and the generality of the base-rate fallacy are matters of established
fact" (Bar-Hillel, 1980, p. 215). Bayes's rule, like Bernoulli's theorem, was no
longer thought to describe the workings of the mind. But passion and desire
were no longer blamed as the causes of the disturbances. The new claim was
stronger. The discrepancies were taken as tentative evidence that "people do
not appear to follow the calculus of chance or the statistical theory of predic-
tion" (Kahneman & Tversky, 1973, p. 237). It was proposed that as a result of
"limited information-processing abilities" (Lichtenstein, Fischhoff, & Phillips,
1982, p. 333), people are doomed to compute the probability of an event by
crude, nonstatistical rules such as the "representativeness heuristic."
Here is the problem. There are contradictory claims as to whether people
naturally reason according to Bayesian inference. The two extremes are rep-
resented by the Enlightenment probabilists and by proponents of the
heuristics-and-biases program. Their conflict cannot be resolved by finding fur-
ther examples of good or bad reasoning; text problems generating one or the
other can always be designed. Our particular difficulty is that after more than
two decades of research, we still know little about the cognitive processes
underlying human inference, Bayesian or otherwise. This is not to say that
there have been no attempts to specify these processes. For instance, it is un-
derstandable that when the "representativeness heuristic" was first proposed
in the early 1970s to explain base-rate neglect, it was only loosely denned. Yet
at present, representativeness remains a vague and ill-defined notion. For some
time it was hoped that factors such as "concreteness," "vividness," "causal-
ity," "salience," "specificity," "extremeness," and "relevance" of base-rate in-
formation would be adequate to explain why base-rate neglect seemed to come
and go (e.g., Ajzen, 1977; Bar-Hillel, 1980; Borgida & Brekke, 1981). However,
these factors have led neither to an integrative theory nor even to specific
models of underlying processes (Hammond, 1990; Koehler, 1996; Lopes, 1991;
Scholz, 1987).
Some have suggested that there is perhaps something to be said for both
sides, that the truth lies somewhere in the middle: Maybe the mind does a
little of both Bayesian computation and quick-and-dirty inference. This com-
promise avoids the polarization of views but makes no progress on the theo-
retical front.
Both views, however, are based on an incomplete analysis: They focus on
cognitive processes, Bayesian or otherwise, without making the connection
between what we will call a cognitive algorithm and an information format.
We (a) provide a theoretical framework that specifies why frequency formats
should improve Bayesian reasoning and (b) present two studies that test
whether they do. Our goal is to lead research on Bayesian inference out of the
present conceptual cul-de-sac and to shift the focus from human errors to hu-
man engineering (see Edwards & von Winterfeldt, 1986): how to help people
reason the Bayesian way without even teaching them.
94 ECOLOGICAL RATIONALITY
ters illustrate how one can help experts turn innumeracy into insight. In this
chapter, we will explore in more depth the differences between natural fre-
quencies and probabilities and extend this analysis to other representations.
Natural Sampling
Figure 6.1 (a) Natural frequencies; (b) absolute frequencies that are not natu-
ral frequencies (obtained by systematic sampling or by normalizing natural
frequencies with respect to base rates); (c) relative frequencies or probabili-
ties. H = hypothesis; D = data.
96 ECOLOGICAL RATIONALITY
The result is 0.078. However, physicians, college students, and staff at Harvard
Medical School all have equally great difficulties with this and similar medical
problems and typically estimate the posterior probability p(cancer I positive)
to be between 70% and 80%, rather than 7.8% (Chapter 4).
The experimenters who have amassed the apparently damning body of ev-
idence that humans fail to meet the norms of Bayesian inference have usually
given their research participants information in the standard probability format
(or its variant, in which one or more of the three percentages are relative fre-
quencies; see below). Studies on the cab problem (Bar-Hillel, 1980), the light-
bulb problem (Lyon & Slovic, 1976), and various disease problems (Casscells
et al., 1978; Eddy, 1982; Hammerton, 1973) are examples. Results from these
and other studies have generally been taken as evidence that the human mind
does not reason with Bayesian algorithms. Yet this conclusion is not war-
ranted, as explained before. One would be unable to detect a Bayesian algo-
rithm within a system by feeding it information in a representation that does
not match the representation with which the algorithm works.
In the last few decades, the standard probability format has become a com-
mon way to communicate information ranging from medical and statistical
textbooks to psychological experiments. But we should keep in mind that it is
only one of many mathematically equivalent ways of representing information;
it is, moreover, a recently invented notation. Neither the standard probability
format nor Equation 1 was used in Bayes's (1763) original essay. As Figure 6.2
shows, with natural frequencies one does not need a pocket calculator to es-
timate the Bayesian posterior. All one needs is the number of cases that had
both the symptom and the disease (here, 8) and the number of symptom cases
(here, 8 + 95). A Bayesian algorithm for computing the posterior probability
p(H\D) from the frequency format (see Figure 6.2, left side) requires solving
the following equation:
where d&h (data and hypothesis) is the number of cases with symptom and
disease, and d8t—his the number of cases having the symptom but lacking the
IMPROVING BAYESIAN REASONING 99
disease. One does not even need to keep track of the base rate of the disease.
A medical student who struggles with single-event probabilities presented in
medical textbooks may on the other hand have to rely on a calculator and end
up with little understanding of the result (see Figure 6.2, right side).1 Hence-
forth, when we use the term frequency format, we always refer to natural fre-
quencies as defined by the natural sampling tree in Figure 6.2.
Comparison of Equations 1 and 2 leads to our first theoretical result:
Result 1: Computational demands. Bayesian algorithms are computa-
tionally simpler when information is encoded in a frequency format
rather than a standard probability format.
By "computationally simpler" we mean that (a) fewer operations (multiplica-
tion, addition, or division) need to be performed in Equation 2 than Equation
1, and (b) the operations can be performed on natural numbers (absolute fre-
quencies) rather than fractions (such as percentages).
Equations 1 and 2 are mathematically equivalent formulations of Bayes's
rule. Both produce the same result, p(H\D) = .078. Equation 1 is a standard
version of Bayes's rule in today's textbooks in the social sciences, whereas
1. This clinical example illustrates that the standard probability format is a conven-
tion rather than a necessity. Clinical studies often collect data that have the structure of
frequency trees as in Figure 6.2. Such information can always be represented in fre-
quencies as well as probabilities.
100 ECOLOGICAL RATIONALITY
sampling yields a more parsimonious menu with only two pieces of informa-
tion, d&h and dlk—h (or alternatively, d&/i and d). We call this the short menu.
So far we have introduced the probability format with a standard menu and
the frequency format with a short menu. However, information formats and
menus can be completely crossed. For instance, if we replace the probabilities
in the standard probability format with frequencies, we get a standard menu
with a frequency format, or the standard frequency format. Table 6.1 uses the
mammography problem to illustrate the four versions that result from crossing
the two menus with the two formats. All four displays are mathematically
equivalent in the sense that they lead to the same Bayesian posterior proba-
bility. In general, within the same format information can be divided into var-
ious menus; within the same menu, it can be represented in a range of formats.
To transform the standard probability format into the standard frequency
format, we simply replaced 1% with "10 out of 1,000," "80%" with "8 out of
10," and so on (following the tree in Figure 6.2) and phrased the task in terms
of a frequency estimate. All else went unchanged. Note that whether the fre-
quency format actually carries information about the sample size (e.g., that
there were exactly 1,000 women) or not (as in Table 6.1, where it is said "in
every 1,000 women") makes no difference for Results 1 to 3 because these
relate to single-point estimates only (unlike Result 4).
What are the Bayesian algorithms needed to draw inferences from the two
new format-menu combinations? The complete crossing of formats and menus
leads to two important results. A Bayesian algorithm for the short probability
format, that is, the probability format with a short menu (as in Table 6.1),
amounts to solving the following equation:
If the two pieces of information in the short menu are d&h and d, as in
Table 6.1, rather than dlkh and d&.—h, then the Bayesian computations are even
simpler because the sum in the denominator is already computed.
Relative Frequencies
Several studies of Bayesian inference have used standard probability formats
in which one, two, or all three pieces of information were presented as relative
frequencies rather than as single-event probabilities—although the task still
was to estimate a single-event probability (e.g., Tversky & Kahneman's, 1982b,
cab problem). For instance, in the following version of the mammography
problem, all information is represented in relative frequencies (in %).
Relative frequency version (standard menu)
1% of women at age forty who participate in routine screening have
breast cancer. 80% of women with breast cancer will get positive mam-
mograms. 9.6% of women without breast cancer will also get positive
mammograms. A woman in this age group had a positive mammogram
in a routine screening. What is the probability that she actually has breast
cancer? %
Is the algorithm needed for relative frequencies computationally equivalent to
the algorithm for natural frequencies? The relative frequency format does not
display the natural frequencies needed for Equation 2. Rather, the numbers are
the same as in the probability format, making the Bayesian computation the
same as in Equation 1. This yields the following result:
Result 7: Algorithms for relative frequency versions are computationally
equivalent to those for the standard probability format.
We tested several implications of Results 1 through 7 (except Result 4) in
the studies reported below.
frequency for a new sample. In the experimental research of the past two de-
cades, participants were almost always required to estimate a single-event
probability. But this need not be. In the experiments reported herein, we asked
people both for single-event probability and frequency estimates.
To summarize, mathematically equivalent information need not be com-
putationally and psychologically equivalent. We have shown that Bayesian
algorithms can depend on information format and menu, and we derived sev-
eral specific results for when algorithms are computationally equivalent and
when they are not.
How might the mind draw inferences that follow Bayes's rule? Surprisingly,
this question seems rarely to have been posed. Psychological explanations typ-
ically were directed at "irrational" deviations between human inference and
the laws of probability; the "rational" seems not to have demanded an expla-
nation in terms of cognitive processes. The cognitive account of probabilistic
reasoning by Piaget and Inhelder (1951/1975), as one example, stops at the
precise moment the adolescent turns "rational," that is, reaches the level of
formal operations.
We propose three classes of cognitive strategies for Bayesian inference: first,
the algorithms corresponding to Equations 1 through 3; second, physical an-
alogs of Bayes's rule, as anticipated by Bayes's (1763) billiard table; and third,
shortcuts that simplify the Bayesian computations in Equations 1 through 3.
Physical Analogs
The "beam analysis" (see Figure 6.3) is a physical analog of Bayes's rule de-
veloped by one of our research participants. This student represented the class
of all possible outcomes (child has severe prenatal damage and child does not
have severe prenatal damage) by a beam. He drew inferences (here, about the
probability that the child has severe prenatal damage) by cutting off two pieces
from each end of the beam and comparing their size. His algorithm was as
follows:
Step 1: Base rate cut. Cut off a piece the size of the base rate from the
right end of the beam.
Step 2: Hit rate cut. From the right part of the beam (base rate piece),
cut off a proportion p(D I H].
Step 3: False alarm cut. From the left part of the beam, cut off a propor-
tion p(D I -H).
Step 4: Comparison. The ratio of the right piece to both pieces is the
posterior probability.
This algorithm amounts to Bayes's rule in the form of Equation 1.
IMPROVING BAYESIAN REASONING 105
Rare-Event Shortcut Rare events—that is, outcomes with small base rates, such
as severe prenatal damage—enable simplification of the Bayesian inference
with little reduction in accuracy. If an event is rare, that is, if p(H] is very
small, and p(—H) is therefore close to 1.0, then p(D\ —H)p(—H) can be ap-
proximated by p{D\ —H}. That is, instead of cutting the proportion p(D\ —H]
off the left part of the beam (Step 3), it is sufficient to cut a piece of absolute
size p(D\ —H). The rare-event shortcut (see Figure 6.3) is as follows:
IF the event is rare,
THEN simplify Step 3: Cut a piece of absolute size p(D\ —H}.
This shortcut corresponds to the approximation
The shortcut works well for the German measles problem, where the base rate
of severe prenatal damage is very small, p(H] = .005. The shortcut estimates
p(H\ D] as .9524, whereas Bayes's rule gives .9526. It also works with the mam-
mography problem, where it generates an estimate of .077, compared with .078
from Bayes's rule.
The big hit-rate shortcut would not work as well as the rare-event shortcut in
the German measles problem because p ( D \ H ) is only .40. Nevertheless, the
shortcut estimate is only a few percentage points removed from that obtained
106 ECOLOGICAL RATIONALITY
with Bayes's rule (.980 instead of .953). The big hit-rate shortcut works well,
to offer one instance, in medical diagnosis tasks where the hit rate of a test is
high (e.g., around .99 as in HIV tests).
Note that the right side of this approximation is equivalent to the posterior
odds ratio p(H\D]/p(—H\D). Thus the comparison shortcut estimates the pos-
terior probability by the posterior odds ratio.
IF D&H occurs much more often than D&—H,
THEN simplify Step 4: Take the ratio ofD&-H (left piece) to D&H (right
piece) as the complement of the posterior probability.
This shortcut corresponds to the approximation
Does the standard frequency format invite the same shortcuts? Consider the
inference about breast cancer from a positive mammogram, as illustrated in
Figure 6.2. Would the rare-event shortcut facilitate the Bayesian computations?
In the probability format, the rare-event shortcut uses p(D I —H) to approximate
p(-H]p(D I — H); in the frequency format, the latter corresponds to the absolute
frequency 95 (i.e., d&—h) and no approximation is needed. Thus a rare-event
shortcut is of no use and would not simplify the Bayesian computation in
frequency formats. The same can be shown for the big hit-rate shortcut for the
same reason. The comparison shortcut, however, can be applied in the fre-
quency format:
IF d&—h occurs much more often than d&h,
THEN compute d&h/d&-h.
The condition and the rationale are the same as in the probability format.
To summarize, we proposed three classes of cognitive strategies underlying
Bayesian inference: (a) algorithms that satisfy Equations 1 through 3; (b) phys-
ical analogs that work with operations such as "cutting" instead of multiplying
(Figure 6.3); and (c) three shortcuts that can exploit environmental structures.
Predictions
We now derive several predictions from the theoretical results obtained. The
predictions specify conditions that do and do not make people reason the
108 IMPROVING BAYESIAN REASONING
Bayesian way. The predictions should hold independently of whether the cog-
nitive strategies follow Equations 1 through 3, whether they are physical an-
alogs of Bayes's rule, or whether they include shortcuts.
Prediction 1: Frequency formats elicit a substantially higher proportion
of Bayesian inferences than probability formats.
This prediction is derived from Result 1, which states that the Bayesian algo-
rithm is computationally simpler in frequency formats.2
Prediction 2: Probability formats elicit a larger proportion of Bayesian
inferences for the short menu than for the standard menu.
This prediction is deduced from Result 5, which states that with a probability
format, the Bayesian computations are simpler in the short menu than in the
standard menu.
Prediction 3: Frequency formats elicit the same proportion of Bayesian
inferences for the two menus.
This prediction is derived from Result 6, which states that with a frequency
format, the Bayesian computations are the same for the two menus.
Prediction 4: Relative frequency formats elicit the same (small) propor-
tion of Bayesian inferences as probability formats.
This prediction is derived from Result 7, which states that the Bayesian algo-
rithms are computationally equivalent in both formats.
The data we obtained for each of several thousand problem solutions were
composed of a participant's (a) probability or frequency estimate and (b) on-
line protocol ("write aloud" protocol) of his or her reasoning. Data type (a)
allowed for an outcome analysis, as used exclusively in most earlier studies
on Bayesian inference, whereas data type (b) allowed additionally for a process
analysis.
2. At the point when we introduced Result 1, we had dealt solely with the standard
probability format and the short frequency format. However, Prediction 1 also holds
when we compare formats across both menus. This is the case because (a) the short
menu is computationally simpler in the frequency than in the probability format, be-
cause the frequency format involves calculations with natural numbers and the proba-
bility format with fractions, and (b) with a frequency format, the Bayesian computations
are the same for the two menus (Result 6).
ECOLOGICAL RATIONALITY 109
lated from applying Bayes's rule to the information given (outcome criterion),
and (b) the on-line protocol specified that one of the Bayesian computations
defined by Equations 1 through 3 or one (or several) of the shortcuts was used,
either by means of calculation or physical representation (process criterion).
We applied the same strict criteria to identify non-Bayesian cognitive strate-
gies.
Outcome: Strict Rounding Criterion By the phrase "exactly the same" in the
outcome criterion, we mean the exact probability or frequency, with exceptions
made for rounding up or down to the next full percentage point (e.g., in the
German measles problem, where rounding the probability of 95.3% down or
up to a full percentage point results in 95% or 96%). If, for example, the on-
line protocol showed that a participant in the German measles problem had
used the rare-event shortcut and the answer was 95% or 96% (by rounding),
this inferential process was classified as a Bayesian inference. Estimates below
or above were not classified as Bayesian inferences: If, for example, another
participant in the same problem used the big hit-rate shortcut (where the con-
dition for this shortcut is not optimally satisfied) and accordingly estimated
98%, this was not classified as a Bayesian inference. Cases of the latter type
ended up in the category of "less frequent strategies." This example illustrates
the strictness of the joint criteria. The strict rounding criterion was applied to
the frequency format in the same way as to the probability format.
When a participant answered with a fraction—such as that resulting from
Equation 3—without performing the division, this was treated as if he or she
had performed the division. We did not want to evaluate basic arithmetic
skills. Similarly, if a participant arrived at a Bayesian equation but made a
calculation error in the division, we ignored the calculation error.
H D H D\H D\ -H p(H\D)
all tasks. This happened only a few times. If a participant could not immedi-
ately identify what his or her notes meant, we did not inquire further.
The "write aloud" method avoids two problems associated with retrospec-
tive verbal reports: that memory of the cognitive strategies used may have
faded by the time of a retrospective report (Ericsson & Simon, 1984) and that
participants may have reported how they believe they ought to have thought
rather than how they actually thought (Nisbett & Wilson, 1977).
We used the twin criteria of outcome and process to cross-check outcome
by process and vice versa. The outcome criterion prevents a shortcut from
being classified as a Bayesian inference when the precondition for the shortcut
is not optimally satisfied. The process criterion protects against the opposite
error, that of inferring from a probability judgment that a person actually used
Bayesian reasoning when he or she did not.
We designed two studies to identify the cognitive strategies and test the
predictions. Study 1 was designed to test Predictions 1,2, and 3.
Method
Sixty students, 21 men and 39 women from ten disciplines (predominantly
psychology) from the University of Salzburg, Austria, were paid for their par-
ticipation. The median age was 21 years. None of the participants was familiar
with Bayes's rule. Participants were studied individually or in small groups
of 2 or 3 (in two cases, 5). On the average, students worked 73 min in the
first session (range = 25-180 min) and 53 min in the second (range = 30-120
min).
We used two formats, probability and frequency, and two menus, standard
and short. The two formats were crossed with the two menus, so four versions
were constructed for each problem. There were 15 problems, including the
mammography problem (Eddy, 1982; see Table 6.1), the cab problem (Tversky
& Kahneman, 1982b), and a short version of Ajzen's (1977) economics problem.
The four versions of each problem were constructed in the same way as ex-
plained before with the mammography problem (see Table 6.1).3 In the fre-
quency format, participants were always asked to estimate the frequency of "h
out of d"; in the probability format, they were always asked to estimate the
probability p(H\D}. Table 6.2 shows for each of the 15 problems the infor-
mation given in the standard frequency format; the information specified in
the other three versions can be derived from that.
3. If the Y number in "X out of Y" was large and odd, such as 9,950, we rounded
the number to a close, more simple number, such as 10,000. The German measles prob-
lem is an example. This made practically no difference for the Bayesian calculation and
was meant to prevent participants from being puzzled by odd Y numbers.
112 ECOLOGICAL RATIONALITY
Results
Bayesian Reasoning
Predition 1: Frequency formats elicit a substantially higher proportion of
Bayesian inferences than probability formats.
Do frequency formats foster Bayesian reasoning? Yes. Frequency formats elic-
ited a substantially higher proportion of Bayesian inferences than probability
formats: 46% in the standard menu and 50% in the short menu. Probability
formats, in contrast, elicited 16% and 28%, for the standard menu and the
short menu, respectively. These proportions of Bayesian inferences were ob-
tained by the strict joint criteria of process and outcome and held fairly stable
across 15 different inference problems. Note that 50% Bayesian inferences
means 50% of all answers, and not just of those answers where a cognitive
strategy could be identified. The percentage of identifiable cognitive strategies
across all formats and menus was 84%.
Figure 6.4 shows the proportions of Bayesian inferences for each of the 15
problems. The individual problems mirror the general result. For each prob-
lem, the standard probability format elicited the smallest proportion of Baye-
sian inferences. Across formats and menus, in every problem Bayesian infer-
ences were the most frequent.
The comparison shortcut was used quite aptly in the standard frequency
format, that is, only when the precondition of this shortcut was satisfied to a
high degree. It was most often used in the suicide problem, in which the ratio
between DikH cases and D&—H cases was smallest (Table 6.2), that is, in which
the precondition was best satisfied. Here, 9 out of 30 participants used the
comparison shortcut (and 5 participants used the Bayesian algorithm without
a shortcut). In all 20 instances where the shortcut was used, 17 satisfied the
strict outcome criterion, and the remaining 3 were accurate to within 4 per-
centage points.
Because of the strict rounding criterion, the numerical estimates of the par-
ticipants using Bayesian reasoning can be directly read from Table 6.2. For
instance, in the short frequency version of the mammography problem, 43.3%
of participants (see Figure 6.4) came up with a frequency estimate of 8 out of
103 (or another value equivalent to 7.8%, or between 7% and 8%).
IMPROVING BAYESIAN REASONING 113
The empirical result in Figure 6.4 is consistent with the theoretical result
that frequency formats can be handled by Bayesian algorithms that are com-
putationally simpler than those required by probability formats.
Prediction 2: Probability formats elicit a larger proportion of Bayesian
inferences for the short menu than for the standard menu.
The percentages of Bayesian inferences in probability formats were 16% and
28% for the standard menu and the short menu, respectively. Prediction 2
holds for each of the 15 problems (Figure 6.4).
Prediction 3: The proportion of Bayesian inferences elicited by the fre-
quency format is independent of the menu.
The effect of the menu largely, but not completely, disappeared in the fre-
quency format. The short menu elicited 3.7 percentage points more Bayesian
strategies than the standard menu. The residual superiority of the short menu
could have the following cause: Result 2 (attentional demands) states that in
natural sampling it is sufficient for an organism to monitor either the frequen-
cies d&h and d or d&h and d&—h. We have chosen the former pair for the
114 ECOLOGICAL RATIONALITY
short menus in our studies and thus reduced the Bayesian computation by one
step, that of adding up d&h and d!k—h to d, which was part of the Bayesian
computation in the standard but not the short menu. This additional compu-
tational step is consistent with the small difference in the proportions of Baye-
sian inferences found between the two menus in the frequency formats.
How does the impact of format on Bayesian reasoning compare with that
of menu? The effect of the format was about three times larger than that of the
menu (29.9 and 21.6 percentage points difference compared with 12.1 and 3.7).
Equally striking, the largest percentage of Bayesian inferences in the two prob-
ability menus (28%) was considerably smaller than the smallest in the two
frequency menus (46%).
Probability Frequency
UiOgmuve r onnai /o OI
strategy equivalent Standard Short Standard Short Total total
a. The sum of total answers is 1,774 rather than 1,800 (60 participants times 30 tasks) because of
some participants' refusals to answer and a few missing data.
IMPROVING BAYESIAN REASONING 115
the short menu. Joint occurrence always underestimates the Bayesian poste-
rior unless p(D) = I. From participants' "write aloud" protocols, we learned
about a variant, which we call adjusted joint occurrence, in which the partic-
ipant starts with joint occurrence and adjusts it slightly (5 or fewer percentage
points).
Fisherian. Not all statisticians are Bayesians. Ronald A. Fisher, who invented
the analysis of variance and promoted significance testing, certainly was not.
In Fisher's (1955) theory of significance testing, an inference from data D to a
null hypothesis H0 is based solely on p(D\H0), which is known as the "exact
level of significance." The exact level of significance ignores base rates and
false alarm rates. With some reluctance, we labeled the second most frequent
non-Bayesian strategy—picking p(DlH) and ignoring everything else—"Fish-
erian." Our hesitation lay in the fact that it is one thing to ignore everything
else besides p(D\H], as Fisher's significance testing method does, and quite
another thing to confuse p(D\H] with p(HID). For instance, a p value of 1%
is often erroneously believed to mean, by both researchers and some statistical
textbook authors (see Chapter 13), that the probability of the null hypothesis
being true is 1%. Thus the term Fisherian refers to this widespread misinter-
pretation rather than to Fisher's actual ideas (we hope that Sir Ronald would
forgive us).
There exist several related accounts of the strategy for inferring p(H\D]
solely on the basis of p(D\H) Included in these are the tendency to infer "cue
validity" from "category validity" (Medin, Wattenmaker, & Michalski, 1987)
and the related thesis that people have spontaneous access to sample spaces
that correspond to categories (e.g., cancer) rather than to features associated
with categories (Gavanski & Hui, 1992). Unlike the Bayesian algorithms and
joint occurrence, the Fisherian strategy is menu specific: It cannot be elicited
from the short menu. We observed from participants' "write aloud" protocols
the use of a variant, which we call adjusted Fisherian, in which the participant
started with p(D I H) and then adjusted this value slightly (5 or fewer percent-
age points) in the direction of some other information.
Likelihood Subtraction. Jerzy Neyman and Egon S. Pearson challenged
Fisher's null-hypothesis testing. They argued that hypothesis testing is a de-
cision between (at least) two hypotheses that is based on a comparison of the
probability of the observed data under both, which they construed as the like-
lihood ratio p(D\H) I p(D\ -H}. We observed a version of the Neyman-Pear-
son method, the likelihood subtraction strategy, which computes p(D\H) —
p(D\ —H}. As in Neyman-Pearson hypotheses testing, this strategy makes no
use of prior probabilities and thus neglects base-rate information. The cogni-
tive strategy is menu specific (it can only be elicited by the standard menu)
and occurred predominantly in the probability format. In Robert Nozick's ac-
count, likelihood subtraction, also known as AR, is considered a measure of
evidential support (see Schum, 1994), and McKenzie (1994) has simulated the
performance of this and other non-Bayesian strategies.
116 ECOLOGICAL RATIONALITY
Others. There were cases of multiply all in the short menu (the logic of which
escaped us) and a few cases of base rate only in the standard menu (a pro-
portion similar to that reported in Gigerenzer, Hell, & Blank, 1988). We iden-
tified a total of 10.8% other strategies; these are not described here because
each was used in fewer than 1% of the solutions.
Summary of Study 1
The two formats and the three menus were mathematically interchangeable
and always entailed the same posterior probability. However, the Bayesian al-
gorithm for the short menu is computationally simpler than that for the stan-
dard menu, and the hybrid menu is in between; therefore the proportion of
Bayesian inferences should increase from the standard to the hybrid to the
short menu (extended Prediction 2). In contrast, the Bayesian algorithms for
the probability and relative frequency formats are computationally equivalent;
thus there should be no difference between these two formats (Prediction 4).
Method
Fifteen students from the fields of biology, linguistics, English studies, German
studies, philosophy, political science, and management at the University of
IMPROVING BAYESIAN REASONING 117
Results
We could identify cognitive strategies in 67% of 1,080 probability judgments.
Table 6.4 shows the distribution of the cognitive strategies for the two formats
as well as for the three menus.
4. Study 2 was performed before Study 1 but is presented here second because it
builds on the central Study 1. In a few cases the numerical information in the problems
(e.g., German measles problem) was different in the two studies.
Table 6.4 Cognitive strategies in Study 2
Information format Information menu
The reasoning of two participants, Rudiger and Oliver, illustrates this de-
pendence of thought on menu. They try to solve the German measles problem,
in which the task is to estimate the probability p(H I D ) of severe prenatal dam-
age in the child (H) if the mother had German measles during pregnancy (D).
In the standard menu, the information (probability expressed in percentages)
was p(H) = 0.5%, p(D\H) = 40%, and p(D\-H) = 0.01%; in the hybrid
menu, p(H) = 0.5%, p(D\H] = 40%, and p(D] = 0.21%; and in the short
menu, p(D] = 0.21% and p(D&H) = 0.2%.
Rudiger, age 22, management. In the standard menu, Rudiger focused
on p(D\H], explaining that because a child of an infected mother is at
such high risk (40%), his estimate would accordingly be high. He ad-
justed p(D\H) by 5% and estimated the posterior probability of severe
prenatal damage as 35% (adjusted Fisherian). In the hybrid menu, he
picked the base rate and estimated the same probability as 0.5% with
the argument that p(D I H ) and p(D] "are without significance" (base rate
only]. In the short menu, he picked p(H8tD) and estimated 0.2% because
"this is the information that specifies the probability of severe damage
in the child. The percentage of infected mothers, however, is irrelevant"
(joint occurrence).
Oliver, age 22, German literature. In the standard menu, Oliver stated
that the "correlation between not having damage and nevertheless
having measles," as he paraphrased p(D\ —H), was the only relevant in-
formation. He calculated 1 - p(D\ -H) = 99.99% and rounded to 100%,
which was his estimate (false alarm complement). In the hybrid menu,
he concluded that the only relevant information was the base rate of
severe prenatal damage, and his estimate consequently dropped to 0.5%
(base rate only). In the short menu, he determined the proportion of
severe damage and measles in all cases with German measles, which led
him to the Bayesian answer of 95.3%.
The thinking of Rudiger and Oliver illustrates how strongly cognitive strategies
can depend on the representation of information, resulting in estimates that
may vary as much as from 0.5% to 100% (as in Oliver's case). These cases also
reveal how helpless and inconsistent participants were when information was
represented in a probability or relative frequency format.
With 72 inference problems per participant, Study 2 can answer the question
of whether mere practice (without feedback or instruction) increased the pro-
portion of Bayesian inferences. There was virtually no increase during the first
three sessions, which comprised 36 tasks. Only thereafter did the proportion
increase—from .04, .07, and .14 (standard, hybrid, and short menus, respec-
tively) in the first three sessions to .08, .14, and .21 in Sessions 4 through 6.
Thus extensive practice seems to be needed to increase the number of Bayesian
responses. In Study 1, with "only" 30 problems per participant, the proportion
increased slightly from .30 in the first session to .38 in the second. More gen-
IMPROVING BAYESIAN REASONING 121
erally, with respect to all cognitive strategies, we found that when information
was presented in a frequency format, our participants became more consistent
in their use of strategies with time and practice, whereas there was little if any
improvement over time with probability formats.
Summary of Study 2
General Discussion
Bayes's rule than those from the standard probability version; however, only
means—and not individual judgments or processes—were analyzed. Cosmides
and Tooby (1996) constructed a dozen or so versions of the medical problem
presented by Casscells et al. (1978). They converted, piece by piece, probability
information into frequencies and showed how this increases, at the same pace,
the proportion of Bayesian answers. They reported that when the frequency
format was mixed—that is, when the information was represented in frequen-
cies but the single-point estimate was a single-event probability or vice versa—
the effect of the frequency format was reduced by roughly half. Their results
are consistent with our theoretical framework.
At the beginning of this chapter, we contrasted the belief of the Enlighten-
ment probabilists that the laws of probability theory were the laws of the mind
(at least for hommes eclaires) with the belief of the proponents of the
heuristics-and-biases program that the laws of probability are not the laws of
the mind. We side with neither view, nor with.those who have settled some-
where in between the two extremes. Both views are based on an incomplete
analysis: They focus on cognitive strategies, good or bad, without making the
connection between a strategy and the information format it has been designed
for. Through exploration of the computational consequences of an evolutionary
argument, a novel theoretical framework for understanding intuitive Bayesian
inference has emerged.
Why have so many experimental studies used the standard probability for-
mat? Part of the reason may be historical accident. There is nothing in Bayes's
rule that dictates whether the mathematical probabilities pertain to single
events or to frequencies, nor is the choice of format and menus specified by
the formal rules of probability. Thomas Bayes himself seemed not to have sided
with either single-event probabilities or frequencies. Like his fellow Enlight-
enment probabilists, he blurred the distinction between warranted degrees of
belief and objective frequencies by trying to combine the two (Barman, 1992).
Thus the experimental research on Bayesian inference could as well have
started with frequency representations, if not for the historical accident that it
became tied to Savage's (1954) agenda of bringing singular events back into
the domain of probability theory. For instance, if psychological research had
been inspired by behavioral ecology, foraging theory, or other ecological ap-
proaches to animal behavior in which Bayes's rule figures prominently (e.g.,
Stephens & Krebs, 1986), then the information format used in human studies
might have been frequencies from the very beginning.
We would like to emphasize that our results hold for an elementary form
of Bayesian inference, with binary hypotheses and data. Pregnancy tests, mam-
mograms, HIV tests, and the like are everyday examples where this elementary
form of inference is of direct relevance. However, there exist other situations
in which hypotheses, data, or both are multinomial or continuous and where
there is not only one datum, but several. Massaro (1998), for instance, conjec-
tured that when there are two or more pieces of evidence or cues—such as
two medical test results—a representation in natural frequencies would no
longer help to improve Bayesian reasoning. Krauss, Martignon, and Hoffrage
IMPROVING BAYESIAN REASONING 123
I have been asked, How did you discover these fast and frugal heuristics?
In fact, most of the credit goes to my students and colleagues in our interdis-
ciplinary research group, the Center for Adaptive Behavior and Cognition.
We benefited from a healthy mixture of persistence and luck. The discovery
of the recognition heuristic and the less-is-more effect illustrates how insight
can emerge from failure to accomplish something else. In an article on prob-
abilistic mental model (PMM) theory (Chapter 7), Ulrich Hoffrage, Heinz
Kleinbolting, and I derived a bold prediction about the "hard-easy" effect,
in which people are overconfident about their ability to solve hard but not
easy questions. The prediction was that the hard-easy effect would disap-
pear when both kinds of questions are representatively sampled.
After the article appeared, we designed an experiment to test the predic-
tion. We needed a hard and an easy set of questions and a domain from
which to draw representative samples. At that time, I was teaching at the
University of Salzburg, a cheerful architectural blend of postmodern Bau-
haus and Austrian Empire-style marble and gold. Common sense dictated
that German cities, about which students there knew a lot, would be an easy
set, and American cities, about which they knew comparatively little, would
be a hard set. How could it be any other way? We drew 100 random pairs of
cities from the 75 largest German cities, such as Bielefeld and Heidelberg,
and another 100 random pairs of the 75 largest American cities, such as San
Diego and San Antonio. The task was to judge which of two cities has the
larger population. When we saw the results, we could not believe our eyes.
These German-speaking students gave slightly more accurate answers for the
American cities (76.0%) than for the German cities (75.6%). How could they
have made as many correct judgments in a domain about which they knew
little as in one about which they knew a lot?
Salzburg has excellent restaurants. One night our research group had din-
ner at one of them to mourn the failed experiment—we could not test the
prediction because we failed to generate a hard and an easy set of questions.
As we tried in vain to make sense of the counterintuitive result, our col-
league Anton Kiihberger politely remarked: "Why don't you look in your
PMM paper? The answer is there." What an embarrassing moment: He was
right. Our paper said that having heard of one city and not of the other is a
cue that the first city is larger—a fast and frugal strategy that we later named
the "recognition heuristic." How could this heuristic explain our puzzling
result? Because most of the students in Salzburg had heard of all the largest
German cities, they could not use the recognition heuristic in that set. But
when it came to American cities, many of which they had never heard, they
could use it. By exploiting the wisdom in missing knowledge, the recogni-
tion heuristic can lead to highly accurate judgments when lack of recogni-
tion is not random but systematically correlated with the criterion (here,
population size). Lack of recognition can be highly informative.
So the less-is-more effect was discovered accidentally when it ruined an
experiment. After I left the architecturally playful University of Salzburg for
the stolidly gothic University of Chicago, I met Daniel Goldstein, with whom
BOUNDED RATIONALITY 127
I began to study the recognition heuristic and the less-is-more effect syste-
matically (Goldstein & Gigerenzer, 1999). Meanwhile, others succeeded in
testing and confirming our original prediction about the hard-easy effect
(Juslin, 1993; Juslin, Winman, & Olsson, 2000; Klayman et al, 1999). This
story illustrates how scientific discovery can come by failing to do one thing
and yet achieving another.
This page intentionally left blank
7
Do people think they know more than they really do? In the last 20 years,
cognitive psychologists have amassed a large and apparently damning body of
experimental evidence on overconfidence in knowledge, evidence that is in
turn part of an even larger and more damning literature on so-called cognitive
biases. The cognitive bias research claims that people are naturally prone to
making mistakes in reasoning and memory, including the mistake of over-
estimating their knowledge. In this chapter, we propose a new theoretical
model for confidence in knowledge based on the more charitable assumption
that people are good judges of the reliability of their knowledge, provided that
the knowledge is representatively sampled from a specified reference class. We
claim that this model both predicts new experimental results (that we have
tested) and explains a wide range of extant experimental findings on confi-
dence, including some perplexing inconsistencies.
Moreover, it is the first theoretical framework to integrate the two most striking
and stable effects that have emerged from confidence studies—the overconfidence
effect and the hard—easy effect—and to specify the conditions under which these
effects can be made to appear, disappear, and even invert. In most recent studies
(including our own, reported herein), participants are asked to choose between
two alternatives for each of a series of general-knowledge questions. Here is atyp-
ical example: "Which city has more inhabitants? (a) Hyderabad or (b) Islamabad."
Participants choose what they believe to be the correct answer and then are di-
rected to specify their degree of confidence (usually on a 50%-100% scale) that
their answer is indeed correct. After the participants answer many questions of
this sort, the responses are sorted by confidence level, and the relative frequencies
of correct answers in each confidence category are calculated. The overconfidence
effect occurs when the confidence judgments are larger than the relative frequen-
cies of the correct answers; the hard-easy effect occurs when the degree of over-
confidence increases with the difficulty of the questions, where the difficulty is
measured by the percentage of correct answers.
The work on which this chapter is based was coauthored with U. Hoffrage and
H. Kleinbolting.
129
130 BOUNDED RATIONALITY
PMM Theory
Local MM
We assume that the mind first attempts a direct solution that could generate
certain knowledge by constructing a local MM. For instance, a participant may
recall from memory that Heidelberg has a population between 100,000 and
200,000, whereas Bonn has more than 290,000 inhabitants. This is already
sufficient for the answer "Bonn" and a confidence judgment of 100%. In gen-
eral, a local MM can be successfully constructed if (a) precise figures can be
retrieved from memory for both alternatives, (b) intervals that do not overlap
can be retrieved, or (c) elementary logical operations, such as the method of
exclusion, can compensate for missing knowledge. Figure 7.2 illustrates a sue-
Figure 7.1 Cognitive processes in solving a two-alternative general-knowledge task. MM = mental model;
PMM — probabilistic mental model.
132 BOUNDED RATIONALITY
cessful local MM for the previous example. Now consider a task in which the
target variable is not quantitative (such as the number of inhabitants) but is
qualitative: "If you see the nationality letter P on a car, is it from Poland or
Portugal?" Here, either direct memory about the correct answer or the method
of exclusion is sufficient to construct a local MM. The latter is illustrated by
a participant reasoning "Since I know that Poland has PL it must be Portugal"
(Allwood & Montgomery, 1987, p. 370).
The structure of the task must be examined to define more generally what
is referred to as a local MM. The task consists of two objects, a and b (alter-
natives), and a target variable t. First, a local MM of this task is local; that is,
only the two alternatives are taken into account, and no reference class of
objects is constructed (see the following discussion). Second, it is direct; that
is, it contains only the target variable (e.g., number of inhabitants), and no
probability cues are used. Third, no inferences besides elementary operations
of deductive logic (such as exclusion) occur. Finally, if the search is successful,
the confidence in the knowledge produced is evaluated as certain. In these
respects, our concept of a local MM is similar to what Johnson-Laird (1983,
pp. 134-142) called a "mental model" in syllogistic inference.
A local MM simply matches the structure of the task; there is no use of the
probability structure of an environment and, consequently, no frame for in-
ductive inference as in a PMM. Because memory can fail, the "certain" knowl-
edge produced can sometimes be incorrect. These failures contribute to the
amount of overconfidence to be found in 100%-confident judgments.
PMM
Local MMs are of limited success in general-knowledge tasks and in most nat-
ural environments, although they seem to be sufficient for solving some syl-
logisms and other problems of deductive logic (see Johnson-Laird, 1983). If no
local MM can be activated, it is assumed that a PMM is constructed next. A
PMM solves the task by inductive inference, and it does so by putting the
specific task into a larger context. A PMM connects the specific structure of
the task with a probability structure of a corresponding natural environment
(stored in long-term memory). In our example, a natural environment could
be the class of all cities in Germany with a set of variables defined on this
PROBABILISTIC MENTAL MODELS 133
class, such as the number of inhabitants. This task selects the number of in-
habitants as the target and the variables that covary with this target as the cues.
A PMM is different from a local MM in several respects. First, it contains
a reference class of objects that includes the objects a and b. Second, it uses
a network of variables in addition to the target variable for indirect inference.
Thus it is neither local nor direct. These two features also change the third
and fourth aspects of a local MM. Probabilistic inference is part of the cognitive
process, and uncertainty is part of the outcome.
Reference Class
The term reference class refers to the class of objects or events that a PMM
contains. In our example, the reference class "all cities in Germany" may be
generated. To generate a reference class means to generate a set of objects
known from a person's natural environment that contains objects a and b.
The reference class determines which cues can function as probability cues
for the target variable and what their cue validities are. For instance, a valid
cue in the reference class "all cities in Germany" would be the soccer-team
cue; that is, whether a city's soccer team plays in the German soccer Bundes-
liga, in which the 18 best teams compete. Cities with more inhabitants are
more likely to have a team in the Bundesliga. The soccer-team cue would not
help in the Hyderabad-Islamabad task, which must be solved by a PMM con-
taining a different reference class with different cues and cue validities.
Probability Cues
A PMM for a given task contains a reference class, a target variable, probability
cues, and cue validities. A variable is a probability cue Q (for a target variable
in a reference class R) if the probability p(a) of a being correct is different from
the conditional probability of a being correct, given that the values of a and b
differ on C,. If the cue is a binary variable such as the soccer-team cue, this
condition can be stated as follows:
where aQb signifies the relation of a and b on the cue C, (e.g., a has a soccer
team in the Bundesliga, but b does not) and p(a I aCtb; R] is the cue validity of
Ct in R.
Thus cue validities are thought of as conditional probabilities, following
Rosch (1978) rather than Brunswik (1955), who defined his "cue utilizations"
as Pearson correlations. Conditional probabilities need not be symmetric as
correlations are. This allows the cue to be a better predictor for the target than
the target is for the cue, or vice versa. Cue validity is a concept in the PMM,
whereas the corresponding concept in the environment is ecological validity
(Brunswik, 1955), which is the true relative frequency of any city having more
inhabitants than any other one in R if aCtb. For example, consider the reference
class all cities in Germany with more than 100,000 inhabitants. The ecological
134 BOUNDED RATIONALITY
validity of the soccer-team cue here is .91 (calculated for 1988/1989 for what
then was West Germany). That is, if one checked all pairs in which one city
a has a team in the Bundesliga but the other city b does not, one would find
that in 91% of these cases city a has more inhabitants.
Vicarious Functioning
and tested. In the Heidelberg-Bonn task, none of the five cues cited earlier can
in fact be activated. Finally, one cue may be generated that can be activated,
such as whether one city is the capital of the country and the other is not
(capital cue). This cue has a small probability of being activated—a small ac-
tivation rate in R (because it applies only to pairs that include Bonn]—and it
does not have a particularly high cue validity in R because it is well known
that Bonn is not exactly London or Paris.
The Heidelberg-Bonn problem illustrates that probability cues may have
small activation rates in R, and as a consequence, several cues may have to be
generated and tested before one is found that can be activated. The capital cue
that can be activated for the Heidelberg-Bonn comparison may fail for the next
problem, for instance a Heidelberg-Gottingen comparison. Cues can substitute
for one another from problem to problem, a process that Brunswik (1955]
called "vicarious functioning."
If (a) the number of problems is large or other kinds of time pressure apply
and (b) the activation rate of cues is rather small, then one can assume that
the cue generation and testing cycle ends after the first cue that can be acti-
vated has been found. Both conditions seem to be typical for general-
knowledge questions. For instance, even when participants were explicitly in-
structed to produce all possible reasons for and against each alternative, they
generated only about three on the average and four at most (Koriat, Lichten-
stein, & Fischhoff, 1980). If no cue can be activated, we assume that choice is
made randomly, and "confidence 50%" is chosen.
Note that the assumption that confidence equals cue validity is not arbitrary;
it is both rational and simple in the sense that good calibration is to be ex-
pected if cue validities correspond to ecological validities. This holds true even
if only one cue is activated.
Thus choice and confidence are inferred from the same activated cue. Both
are expressions of the same conditional probability. Therefore, they need not
be generated in the temporal sequence choice followed by confidence. The
latter is, of course, typical for actual judgments and often enforced by the
instructions in confidence studies.
136 BOUNDED RATIONALITY
Table 7.1 Probabilistic mental models for confidence task versus frequency
task: Differences between target variables, reference classes, and probability
cues
PMM Confidence task Frequency task
quency task will also contain different cues and cue validities. For instance,
base rates of performance in earlier general knowledge or similar testing sit-
uations could serve as a probability cue for the target variable. Again, our basic
assumption is that a PMM connects the structure of the task with a known
structure of the participant's environment.
Table 7.1 summarizes the differences between PMMs that are implied by
the two different tasks. Note that in our account, both confidences in a single
event and judgments of frequency are explained by reference to experienced
frequencies. However, these frequencies relate to different target variables and
reference classes. We use this assumption to predict systematic differences
between these kinds of judgments.
A PMM is an inductive device that uses the "normal" life conditions in known
environments as the basis for induction. How well does the structure of prob-
ability cues defined on R in a PMM represent the actual structure of probability
cues in the environment? This question is also known as that of "proper cog-
nitive adjustment" (Brunswik, 1964, p. 22). If the order of cue validities
roughly corresponds to that of the ecological validities, then the PMM is well
adapted to a known environment. In Brunswik's view, cue validities are
learned by observing the frequencies of co-occurrences in an environment.
A large literature exists that suggests that (a) memory is often (but not al-
ways) excellent in storing frequency information from various environments
and (b) the registering of event occurrences for frequency judgments is a fairly
automatic cognitive process requiring very little attention or conscious effort
(e.g., Gigerenzer, 1984; Hasher, Goldstein, & Toppino, 1977; Howell & Burnett
1978; Sedlmeier, Hertwig, & Gigerenzer, 1998; Zacks, Hasher, & Sanft, 1982).
Hasher and Zacks (1979) concluded that frequency of occurrence, spatial lo-
cation, time, and word meaning are among the few aspects of the environment
that are encoded automatically and that encoding of frequency information is
"automatic at least in part because of innate factors" (p. 360). In addition,
Hintzman, Nozawa, and Irmscher (1982) proposed that frequencies are stored
in memory in a nonnumerical analog mode.
Whatever the mechanism of frequency encoding, we use the following as-
sumption for deriving our predictions: If participants had repeated experience
with a reference class, a target variable, and cues in their environment, we
assume that cue validities correspond well to ecological validities. (This holds
true for the average in a group of participants, but individual idiosyncrasies
in learning the frequency structure of the environment may occur.) This is a
bold assumption made in ignorance of potential deviations between specific
cue validities and ecological validities. If such deviations existed and were
known, predictions by PMM theory could be improved. The assumption, how-
ever, derives support from both the literature on automatic frequency process-
ing and a large body of neo-Brunswikian research on the correspondence be-
tween ecological validities and cue utilization (the latter of which corresponds
138 BOUNDED RATIONALITY
to our cue validities; e.g., Arkes & Hammond, 1986; Armelius, 1979; Brehmer
& Joyce, 1988; MacGregor & Slovic, 1986).
Note that this adaptiveness assumption does not preclude that individuals
(as well as the average participant) err. Errors can occur even if a PMM is
highly adapted to a given environment. For instance, if an environment is
changing or is changed in the laboratory by an experimenter, an otherwise
well-adapted PMM may be suboptimal in a predictable way.
Brunswik's notion of "representative sampling" is important here. If a per-
son experienced a representative sample of objects from a reference class, one
can expect his or her PMM to be better adapted to an environment than if he
or she happened to experience a skewed, unrepresentative sample.
Representative sampling is also important in understanding the relation be-
tween a PMM and the task. If a PMM is well adapted, but the set of objects
used in the task (questions) is not representative of the reference class in the
environment, performance in tasks will be systematically suboptimal.
To avoid confusion with terms such as calibration, we will use the term
adaptation only when we are referring to the relation between a PMM and a
corresponding environment—not, however, for the relation between a PMM
and a task.
Predictions
A concrete example can help motivate our first prediction. Two of our col-
leagues, K and O, are eminent wine tasters. K likes to make a gift of a bottle
of wine from his cellar to Friend O, on the condition that O guesses what
country or region the grapes were grown in. Because O knows the relevant
cues, O can usually pick a region with some confidence. O also knows that K
sometimes selects a quite untypical exemplar from his ample wine cellar to
test Friend O's limits. Thus, for each individual wine, O can infer the proba-
bility that the grapes ripened in, say, Portugal as opposed to South Africa with
considerable confidence from his knowledge about cues. In the long run, how-
ever, O nevertheless expects the relative frequency of correct answers to be
lower because K occasionally selects unusual items.
Consider tests of general knowledge, which share an important feature with
the wine-tasting situation: Questions are selected to be somewhat difficult and
sometimes misleading. This practice is common and quite reasonable for test-
ing people's limits, as in the wine-tasting situation. Indeed, there is apparently
not a single study on confidence in knowledge in which a reference class has
been defined and a representative (or random) sample of general-knowledge
questions has been drawn from this population. For instance, consider the
reference class "metropolis" and the geographical north-south location as the
target variable. A question like "Which city is farther north? (a) New York or
(b) Rome" is likely to appear in a general-knowledge test (almost everyone gets
it wrong), whereas a comparison between Berlin and Rome is not.
PROBABILISTIC MENTAL MODELS 139
The crucial point is that confidence and frequency judgments refer to dif-
ferent kinds of reference classes. A set of questions can be representative with
respect to one reference class and, at the same time, selected with respect to
the other class. Thus, a set of 50 general-knowledge questions of the city type
may be representative for the reference class "sets of general-knowledge ques-
tions" but not for the reference class "cities in Germany" (because city pairs
have been selected for being difficult or misleading). Asking for a confidence
judgment summons up a PMM on the basis of the reference class "cities in
Germany"; asking for a frequency judgment summons up a PMM on the basis
of the reference class "sets of general-knowledge questions." The first predic-
tion can now be stated.
1. Typical general-knowledge tasks elicit both overconfidence and ac-
curate frequency judgments.
By "typical" general-knowledge tasks we refer to a set of questions that is
representative for the reference class "sets of general-knowledge questions."
This prediction is derived in the following way: If (a) PMMs for confidence
tasks are well adapted to an environment containing a reference class R (e.g.,
all cities in Germany) and (b) the actual set of questions is not representative
for R but selected for difficult pairs of cities, then confidence judgments exhibit
overconfidence. Condition A is part of our theory (the simplifying assumption
we just made), and Condition B is typical for the general-knowledge questions
used in studies on confidence as well as in other testing situations.
If (a) PMMs for frequency-of-correct-answer tasks are well adapted with
respect to an environment containing a reference class R (e.g., the set of all
general-knowledge tests experienced earlier), and (b) the actual set of questions
is representative for R, then frequency judgments are expected to be accurate.
Again, Condition A is part of our theory, and Condition B will be realized in
our experiments by using a typical set of general-knowledge questions.
Taken together, the prediction is that the same person will exhibit over-
confidence when asked for her confidence that a particular answer is correct
and accurate estimates when asked for a judgment of the frequency of correct
answers. This prediction is shown by the two points on the left side of Figure
7.4. This prediction cannot be derived from any of the previous accounts of
overconfidence.
To introduce the second prediction, we return to the wine-tasting story.
Assume that K changes his habit of selecting unusual wines from his wine
cellar and instead buys a representative sample of French red wines and lets
O guess from what region they come. However, K does not tell O about the
new sampling technique. O's average confidence judgments will now be close
to the proportion of correct answers. In the long run, O nevertheless expects
the proportion of correct answers to be smaller, still assuming the familiar
testing situation in which wines were selected, not randomly sampled. Thus
O's frequency judgments will show underestimation.
Consider now a set of general-knowledge questions that is a random sample
from a defined reference class in the participant's natural environment. We use
140 BOUNDED RATIONALITY
If sampling deviates in both the hard and the easy set equally from represen-
tative sampling, points will lie on a horizontal line parallel to the zero-
overconfidence line.
Now consider the case that the easy set is selected from a corresponding
reference class (e.g., general-knowledge questions), but the hard set is a rep-
resentative sample from another reference class (denoted as H' in Figure 7.5).
One then would predict a reversal of the hard-easy effect, as illustrated in
Figure 7.5 by the double line from E to H'.
Experiment 1
Method
Two sets of questions were used, which we refer to as the representative and
the selected set. The representative set was determined in the following way.
We used as a reference class in a natural environment (an environment known
to our participants) the set of all cities in West Germany with more than
100,000 inhabitants. There were 65 cities (Statistisches Bundesamt, 1986).
From this reference class, a random sample of 25 cities was drawn, and all
pairs of cities in the random sample were used in a complete paired compar-
ison to give 300 pairs. No selection occurred. The target variable was the num-
ber of inhabitants, and the 300 questions were of the following kind: "Which
city has more inhabitants? (a) Solingen or (b) Heidelberg." We chose city ques-
tions for two reasons. First, and most important, this content domain allowed
for a precise definition of a reference class in a natural environment and for
random sampling from this reference class. The second reason was for com-
parability. City questions have been used in earlier studies on overconfidence
(e.g., Keren, 1988; May, 1987).
In addition to the representative set, a typical set of 50 general-knowledge
questions, as in previous studies, was used. Two examples are "Who was born
first? (a) Buddha or (b) Aristotle" and "When was the zipper invented? (a)
before 1920 or (b) after 1920."
After each answer, the participant gave a confidence judgment (that this
particular answer was correct). Two kinds of frequency judgments were used.
First, after each block of 50 questions, the participant estimated the number
of correct answers among the 50 answers given. Because there were 350 ques-
144 BOUNDED RATIONALITY
tions, every participant gave seven estimates of the number of correct answers.
Second, after the participants answered all questions, they were given an en-
larged copy of the confidence scale used throughout the experiment and were
asked for the following frequency judgment: "How many of the answers that
you classified into a certain confidence category are correct? Please indicate
for every category your estimated relative frequency of correct answers."
In Experiment 1, we also introduced two of the standard manipulations in
the literature. The first was to inform and warn half of our participants of the
overconfidence effect, and the second was to offer half of each group a mon-
etary incentive for good performance. Both are on a list of "debiasing" methods
known as being relatively ineffective (Fischhoff, 1982), and both contributed
to the view that overconfidence is a robust phenomenon. If PMM theory is
correct, the magnitude of effects resulting from the manipulations in this
chapter—confidence versus frequency judgment and selected versus represen-
tative sampling—should be much larger than those resulting from the two stan-
dard "debiasing" manipulations.
where n is the total number of answers, nt is the number of times the confi-
dence judgment pf was used, and ff is the relative frequency of correct answers
for all answers assigned confidence p,. / is the number of different confidence
categories used (/ = 7), and p and/ are the overall mean confidence judgment
and percentage correct, respectively. A positive difference is called overcon-
fidence. For convenience, we report over- and underconfidence in percentages
(X 100).
Results
Prediction 1 PMM theory predicts that in the selected set (general-knowledge
questions), people show overestimation in confidence judgments (overconfi-
dence) and, simultaneously, accurate frequency judgments.
The open-circle curve in Figure 7.6 shows the relation between judgments
of confidence and the true relative frequency of correct answers in the selected
set—that is, the set of mixed general-knowledge questions. The relative fre-
quency of correct answers (averaged over all participants) was 72.4% in the
100%-confidence category, 66.3% in the 95% category, 58.0% in the 85% cat-
egory, and so on. The curve is far below the diagonal (calibration curve) and
similar to the curves reported by Lichtenstein, Fischhoff, and Phillips (1982,
Figure 2). It replicates and demonstrates the well-known overconfidence effect.
Percentage correct was 52.9, mean confidence was 66.7, and overconfidence
was 13.8.
Participants' frequency judgments, however, are fairly accurate, as Table 7.2
(last row) shows. Each entry is averaged over the 20 participants in each con-
dition. For instance, the figure —1.8 means that, on average, participants in
this condition underestimated the true number of correct answers by 1.8. Av-
eraged across the four conditions, we get —1.2, which means that participants
missed the true frequency by an average of only about 1 correct answer in the
set of 50 questions. Quite accurate frequency judgments coexist with over-
confidence. The magnitude of this confidence—frequency effect found is shown
in Figure 7.7 (left side). PMM theory predicts this systematic difference be-
tween confidence and frequency judgments, within the same person and the
same general-knowledge questions.
Prediction 2 PMM theory predicts that in the representative set (city ques-
tions) people show zero overconfidence and, at the same time, underestimation
in frequency judgments.
Figure 7.6 Calibration curves for three sets. Overconfidence appears when
questions are selected (open circles) but disappears when questions are rep-
resentative (black squares). The matched set controls for the different content
of the two sets. Here, questions are selected from the representative set to
match the difficulty of the selected set, and overconfidence is again pro-
duced.
Table 7.2 Mean differences between estimated and true frequencies of correct
answers
No warning- Incentive Warning Warning
Set no incentive only only and incentive
Representative
1-50 -9.9 -9.4 -8.8 -8.7
51-100 -9.5 -10.4 -12.0 -11.3
101-150 -9.9 -10.9 -10.9 -9.9
151-200 -6.7 -6.7 -9.4 -5.9
201-250 -9.8 -9.8 -8.0 -5.3
251-300 -9.5 -10.8 -9.4 -9.1
Average -9.2 -9.7 -9.7 -8.4
Selected -1.8 -0.6 -2.7 0.3
The solid-square curve in Figure 7.6 shows the relation between confidence
and percentage correct in the representative set—that is, the city questions.
For instance, percentage correct in the 100%-confidence category was 90.8%,
instead of 72.4%. Overconfidence disappeared (—0.9%). Percentage correct
and mean confidence were 71.7 and 70.8, respectively.
The confidence curve for the representative set is similar to a regression
curve for the estimation of relative frequencies by confidence, resulting in un-
derconfidence in the left part of the confidence scale, overconfidence in the
right, and zero overconfidence on the average.
Table 7.2 shows the differences between estimated and true frequencies for
each block of 50 items and each of the conditions, respectively. Again, each
entry is averaged over the 20 participants in each condition. For instance,
participants who were given neither information nor incentive underestimated
their true number of correct answers by 9.9 (on the average) in the first 50
items of the representative set. Table 7.2 shows that the values of the mean
differences were fairly stable over the six subsets, and, most important, they
are, without exception, negative (i.e., underestimation).
148 BOUNDED RATIONALITY
with confidence. For instance, in the 100%-confidence category, true and es-
timated percentage correct were 88.8% and 93.0%, respectively.
Averaged across experimental conditions, the ratio between estimated fre-
quency in the long run and confidence value is fairly constant, around .87, for
confidence ratings between 65% and 95%. It is highest in the extreme cate-
gories (see Table 7.3).
To summarize, participants explicitly distinguished between confidence in
single answers and the relative frequency of correct answers associated with
Table 7.3 Estimated and true percentage correct in each confidence category
(summarized over the representative and the selected sets)
No. of % correct
f~t £• J
Confidence confidence /~\ 1 J
Uver-/under-
category judgments Estimated True estimation
Order of Presentation and Sex Which set (representative vs. selected) was
given first had no effect on confidences, neither in Experiment 1 nor in Ex-
periment 2. Arkes, Christensen, Lai, and Blumer (1987) found an effect of the
difficulty of one set of items on the confidence judgments for a second set
when participants received feedback for their performance in the first set. In
our experiment, however, no feedback was given. Thus, participants had no
reason to correct their confidence judgments, such as by subtracting a constant
value. Sex differences in degree of overconfidence in knowledge have been
claimed by both philosophy and folklore. Our study, however, showed no sig-
PROBABILISTIC MENTAL MODELS 151
Experiment 2
We tried to replicate the results and test several objections. First, to strengthen
the case against PMM theory, we instructed the participants both verbally and
in written form that confidence is subjective probability, and that among all
cases where a subjective probability of X% was chosen, X% of the answers
should be correct. Several authors have argued that such a frequentist instruc-
tion could enhance external calibration or internal consistency (e.g., Kahne-
man & Tversky, 1982; May, 1987). According to PMM theory, however, confi-
dence is already inferred from frequency (with or without this instruction)—
but from frequencies of co-occurrences between, say, number of inhabitants
and several cues, and not from base rates of correct answers in similar testing
situations (see Table 7.1). Thus, in our view, the preceding caution will be
ineffective because the base rate of correct answers is not a probability cue
that is defined on a reference class such as cities in Germany.
Second, consider the confidence-frequency effect. We have shown that this
new effect is implied by PMM theory. One objection might be that the differ-
ence between confidence and frequency judgments is an artifact of the re-
sponse function, just as overconfidence has sometimes been thought to be.
Consider the following interpretation of overconfidence. If (a) confidence is
well calibrated but (b) the response function that transforms confidence into a
confidence judgment differs from an identity function, then (c) overconfidence
or underconfidence "occurs" on the response scale. Because an identity func-
tion has not been proven, Anderson (1986), for instance, denoted the over-
confidence effect and the hard—easy effect as "largely meaningless" (p. 91):
They might just as well be response function artifacts.
A similar objection could be made against the interpretation of the confi-
dence-frequency effect within PMM theory. Despite the effect's stability across
selected and representative sets, it may just reflect a systematic difference be-
tween response functions for confidence and frequency judgments. This con-
jecture can be rephrased as follows: If (a) the difference between "internal"
confidence and frequency impression is zero, but (b) the response functions
that transform both into judgments differ systematically, then (c) a confidence-
152 BOUNDED RATIONALITY
frequency effect occurs on the response scales. We call this the response-
function conjecture.
How can this conjecture be tested? According to PMM theory, the essential
basis on which both confidence and frequency judgments are formed is the
probability cues, not response functions. We assumed earlier that frequency
judgments are based mainly on base rates of correct answers in a reference
class of similar general-knowledge test situations. If we make another cue
available, then frequency judgments should change. In particular, if we make
the confidence judgments more easily retrievable from memory, these can be
used as additional probability cues, and the confidence—frequency effect
should decrease. This was done in Experiment 2 by introducing frequency
judgments in the short run, that is, frequency judgments for a very small num-
ber of questions. Here, confidence judgments can be more easily retrieved from
memory than they could in the long run. Thus, if PMM theory is correct, the
confidence-frequency effect should decrease in the short run. If the issue were,
however, different response functions, then the availability of confidence judg-
ments should not matter because confidence and frequency impression are
assumed to be identical in the first place. Thus, if the conjecture is correct, the
confidence-frequency effect should be stable.
In Experiment 2, we varied the length N of a series of questions from the
long-run condition N = 50 in Experiment 1 to the smallest possible short run
of AT- 2.
Third, in Experiment 1 we used a response scale ranging from 50% to 100%
for confidence judgments but a full-range response scale for frequency judg-
ments ranging from 0 to 50 correct answers (which corresponds to 0% to
100%). Therefore one could argue that the confidence—frequency effect is an
artifact of the different ranges of the two response scales. Assume that (a) there
is no difference between internal confidence and frequency, but (b) because
confidence judgments are limited to the upper half of the response scale,
whereas frequency judgments are not, (c) the confidence-frequency effect re-
sults as an artifact of the half-range response scale in confidence judgments
We refer to this as the response-range conjecture. It can be backed up by at
least two hypotheses.
1. Assume that PMM theory is wrong and participants indeed use base rates
of correct answers as a probability cue for confidence in single answers.
Then confidence shouldfoe considerably lower. If participants anticipate
misleading questions, eVen confidences lower than 50% are reasonable to
expect on this conjecture. Confidences below 50%, however, cannot be
expressed on a scale with a lower boundary at 50%, whereas they can at
the frequency scale. Effects of response range such as those postulated in
range-frequency theory (Parducci, 1965) or by Schonemann (1983) may
enforce the distorting effect of the half-range format. In this account, both
the overconfidence effect and the confidence-frequency effect are gener-
ated by a response-scale effect. With respect to overconfidence, this con-
jecture has been made and has claimed some support (e.g., May, 1986,
1987; Ronis & Yates, 1987). We call this the base rate hypothesis.
PROBABILISTIC MENTAL MODELS 153
2. Assume that PMM theory is wrong in postulating that choice and confi-
dence are essentially one process and that the true process is a temporal
sequence: choice, followed by search for evidence, followed by confi-
dence judgment. Koriat et al. (1980), for instance, proposed this se-
quence. Assume further, contrary to Koriat, that the mind is "Popper-
ian," searching for disconfirming rather than for confirming evidence to
determine the degree of "corroboration" of an answer. If the participant
is successful in retrieving disconfirming evidence from memory but is
not allowed to change the original answer, confidence judgments less
than 50% will result. Such disconfirmation strategies, however, can
hardly be detected using a 50%-100% format, whereas they could in a
full-scale format. We call this the disconfirmation strategy hypothesis.
To test the response-range conjecture, half of the participants in Experiment
2 were given full-range response scales, whereas the other half received the
response scales used in Experiment 1.
Method
Design and procedure This was a 4 X 2 X 2 design, with length of series (50,
10, 5, and 2) and response scale (half range vs. full range) varied between
participants and type of knowledge questions (selected vs. representative set)
varied within participants.
The procedure and the materials were like that in Experiment 1, except for
the following. We used a new random sample of 21 (instead of 25) cities. This
change decreased the number of questions in the representative set from 300
to 210. As mentioned earlier, we explicitly instructed the participants to in-
terpret confidences as frequencies of correct answers: "We are interested in
how well you can estimate subjective probabilities. This means, among all the
answers where you give a subjective probability of X%, there should be X%
of the answers correct." This calibration instruction was orally repeated and
emphasized to the participants.
The response scale contained the means (50%, 55%, 65%, . . . , 95%, 100%)
of the intervals used in Experiment 1 rather than the intervals themselves to
avoid the problematic assumption that means would represent intervals. End-
points were marked absolutely certain that the alternative chosen is correct
(100%), both alternatives equally probable (50%), and, for the full-range scale,
absolutely certain that the alternative chosen is incorrect (0%). In the full-
range scale, one reason for using confidences between 0% and 45% was ex-
plained in the following illustration: "If you think after you have made your
choice that you would have better chosen the other alternative, do not change
your choice, but answer with a probability smaller than 50%."
154 BOUNDED RATIONALITY
Results
Response-Range Conjecture We tested the conjecture that the systematic dif-
ference in confidence and frequency judgments stated in Predictions 1 and 2
(confidence-frequency effect) and shown in Experiment 1 resulted from the
availability of only a limited response scale for confidence judgments (50% to
100%).
Forty-seven participants were given the full-range response scale for con-
fidence judgments. Twenty-two of these never chose confidences below 50%;
the others did. The number of confidence judgments below 50% was small.
Eleven participants used them only once (in altogether 260 judgments), 5 did
twice, and the others 3 to 7 times. There was one outlier, a participant who
used them 67 times. In total, participants gave a confidence judgment smaller
than 50% for only 1.1% of their answers (excluding the outlier: 0.6%). If the
response-range conjecture had been correct, participants would have used con-
fidence judgments below 50% much more frequently.
In the representative set, overconfidence was 3.7% (SEM — 1.23) in the full-
range scale condition and 1.8% (SEM = 1.15) in the half-range condition. In
the selected set, the corresponding values were 14.4 (SEM = 1.54) and 16.4
(SEM = 1.43). Averaging all questions, we got slightly larger overconfidence in
the full-range condition (mean difference = 1.2). The response-range conjec-
ture, however, predicted a strong effect in the opposite direction. Frequency
judgments were essentially the same in both conditions. Hence, the confi-
dence-frequency effect can also be demonstrated when both confidence and
frequency judgments are made on a full-range response scale.
To summarize, there was (a) little use of confidences below 50% and (b) no
decrease of overconfidence in the full-range condition. These results contradict
the response-range conjecture.
A study by Ronis and Yates (1987) seems to be the only other study that
has compared the full-range and the half-range format in two-alternative choice
tasks, but it did not deal with frequency judgments. These authors also re-
ported that only about half their participants used confidence judgments below
50%, although they did so more frequently than our participants. Ronis and
Yates concluded that confidences below 50% had only a negligible effect on
overconfidence and calibration (pp. 209-211). Thus results in both studies are
consistent. The main difference is that Ronis and Yates seem to consider only
"failure to follow the instructions" and "misusing the probability scale"
(p. 207) as possible explanations for confidence judgments below 50%. In con-
PROBABILISTIC MENTAL MODELS 155
trast, we argue that there are indeed plausible cognitive mechanisms—the base
rate and disconfirmation strategy hypotheses—that imply these kind of judg-
ments, although they would contradict PMM theory.
Both Experiment 2 and the Ronis and Yates (1987) study do not rule out,
however, a more fundamental conjecture that is difficult to test. This argument
is that internal confidence (not frequency) takes a verbal rather than a numer-
ical form and that it is distorted on any numerical probability rating scale, not
just on a 50%-100% response scale. Zimmer (1983, 1986) argued that verbal
expressions of uncertainty (such as "highly improbably" and "very likely") are
more realistic, more precise, and less prone to overconfidence and other so-
called judgmental biases than are numerical judgments of probability. Zim-
mer's fuzzy-set modeling of verbal expressions, like models of probabilistic
reasoning that dispense with the Kolmogoroff axioms (e.g., Cohen, 1989; Ky-
burg, 1983; Shafer, 1978), remains a largely unexplored source of alternative
accounts of confidence.
For the remaining analysis, we do not distinguish between the full-range
and the half-range response format. For combining the data, we receded an-
swers like "alternative a, 40% confident" as "alternative b, 60% confident,"
following Ronis and Yates (1987).
mated percentage correct differed again from confidence and was close to the
actual percentage correct.
Despite the instruction not to do so, our participants still distinguished be-
tween a specific confidence value and the corresponding percentage of correct
responses. Therefore confidence and hypothesized percentage correct should
not be used as synonyms. As suggested by this experiment, an instruction
alone cannot override the cognitive processes at work.
In the 100%-confidence category, for instance, 67 participants gave esti-
mates below 100%. In a postexperimental interview, we pointed out to them
that these judgments imply that they assumed they had not followed the cal-
ibration instruction. Most explained that in each single case, they were in fact
100% confident. But they also knew that, in the long run, some answers would
nonetheless be wrong, and they did not know which ones. Thus they did not
know which of the 100% answers they should correct. When asked how they
made the confidence judgments, most answered by giving examples of prob-
ability cues, such as "I know that this city is located in the Ruhrgebiet (in-
dustrial belt), and most cities there are rather large." Interviews provided ev-
idence for several probability cues, but no evidence that base rate expectations,
as reported in frequency judgments, were also used in confidence judgments.
Discussion
Our starting point was the overconfidence effect, reported in the literature as
a fairly stable cognitive illusion in evaluating one's general knowledge and
attributed to general principles of memory search, such as confirmation bias
(Koriat et al., 1980), to general motivational tendencies such as fear of invalid-
ity (Mayseless & Kruglanski, 1987), to insensitivity to task difficulty (see von
Winterfeldt & Edwards, 1986, p. 128), and to wishful thinking and other "def-
icits" in cognition, motivation, and personality. Our view, in contrast, proposes
that one evaluates one's knowledge by probabilistic mental models. In our
account, the main deficit of most cognitive and motivational explanations is
that they neglect the structure of the task and its relation to the structure of a
corresponding environment known to the participants. If people want to search
for confirming evidence or to believe that their answers are more correct than
they are because of some need, wish, or fear, then overestimation of accuracy
should express itself independently of whether they judge single answers or
frequencies, a selected or representative sample of questions, and hard or easy
questions.
Our experiments also do not support the explanation of overconfidence and
the hard—easy effect by assuming that participants are insensitive to task dif-
ficulty: In frequency tasks we have shown that participants' judgments of their
percentage correct in the long run are in fact close to actual percentage correct,
although confidences are not. Overconfidence does not imply that participants
are not aware of task difficulty. At least two more studies have shown that
estimated percentage correct can correspond closely to true percentage correct
in general-knowledge tasks. Allwood and Montgomery (1987) asked their par-
ticipants to estimate how difficult each of 80 questions was for their peers and
found that difficulty ratings (M = 57%) were more realistic (percentage correct
= 61%) than confidence judgments (M = 74%). May (1987) asked her partic-
ipants to estimate their percentage of correct answers after they completed an
158 BOUNDED RATIONALITY
Ronis and Yates (1987) We have mentioned that the Ronis and Yates study is
the only other study that tested a full-range response scale for two-alternative
tasks. The second purpose of that study was to compare confidence judgments
in situations in which the participant knows that the answers are known to
the experimenter (general-knowledge questions) with outcomes of upcoming
basketball games, in which answers are not yet known. In all three (response-
scale) groups, percentage correct was larger for general-knowledge questions
PROBABILISTIC MENTAL MODELS 159
Figure 7.10 Reversal of the hard-easy effect in Ronis and Yates (1987) and
Keren (1988).
than for basketball predictions. Given this result, what would current theories
predict about overconfidence? The insensitivity hypothesis proposes that peo-
ple are largely insensitive to percentage correct (see von Winterfeldt & Ed-
wards, 1986, p. 128). This implies that overconfidence will be larger in the
more difficult (hard) set: the hard-easy effect. (The confirmation bias and mo-
tivational explanations are largely mute on the difficulty issue.) PMM theory,
in contrast, predicts that overconfidence will be larger in the easier set (hard-
easy effect reversal; see Prediction 5) because general-knowledge questions
(the easy set) were selected and basketball predictions were not; only with
clairvoyance could one select these predictions for percentage correct.
In fact, Ronis and Yates (1987) reported an apparent anomaly: three hard-
easy effect reversals. In all groups, overconfidence was larger for the easy
general-knowledge questions than for the hard basketball predictions (Figure
7.10). Prediction 5 accounts for these observed reversals of the hard-easy ef-
fect.
Koridt et al. (1980) Experiment 2 of Koriat et al.'s study provided a direct test
of the confirmation bias explanation of overconfidence. The explanation is this:
(a) participants first choose an answer based on their knowledge, then (b) they
selectively search for confirming memory (or for evidence discontinuing the
alternative not chosen), and (c) this confirming evidence generates over-
confidence. Between the participants' choice of an answer and their confidence
judgment, the authors asked them to give reasons for the alternative chosen.
160 BOUNDED RATIONALITY
Three groups of participants were asked to write down one confirming reason,
one disconfirming reason, or one of each, respectively. Reasons were given for
half of the general-knowledge questions; otherwise, no reasons were given
(control condition). If the confirmation bias explanation is correct, then asking
for a contradicting reason (or both reasons) should decrease overconfidence
and improve calibration. Asking for a confirming reason, however, should
make no difference "since those instructions roughly simulate what people
normally do" (Koriat et al, 1980, p. 111).
What does PMM theory predict? According to PMM theory, choice and con-
fidence are inferred from the same activated cue. This cue is by definition a
confirming reason. Therefore, the confirming-reason and the no-reason (con-
trol) tasks engage the same cognitive processes. The difference is only that in
the former the supporting reason is written down. Similarly, the disconfirming-
reason and both-reason tasks involve the same cognitive processes. Further-
more, PMM theory implies that there is no difference between the two pairs
of tasks.
This result is shown in Table 7.4. In the first row we have the no-reason
and confirming-reason tasks, which are equivalent. Here, only one cue is ac-
tivated, which is confirming. There is no disconfirming cue. Now consider the
second row, the disconfirming-reason and both-reason tasks, which are again
equivalent. Both tasks are solved if one additional cue, which is disconfirming,
can be activated. Thus, for PMM theory, the cue generation and testing cycle
is started again, and cues are generated according to the hierarchy of cue va-
lidities and tested as to whether they can be activated for the problem at hand.
The point is that the next cue that can be activated may turn out to be either
confirming or disconfirming.
For simplicity, assume that the probability that the next activated cue turns
out to be confirming or disconfirming is the same. If it is disconfirming, the
cycle is stopped, and two cues in total have been activated, one confirming
Table 7.4 Predictions of PMM theory for the effects of asking for a
disconfirming reason
Cues activated
Predicted change in
Task No. of cues activated CON DIS Probability confidence
and one discontinuing. This stopping happens with probability .5, and it de-
creases both confidence and overconfidence. (Because the second cue activated
has a lower cue validity, however, confidence is not decreased below 50%.) If
the second cue activated is again confirming, a third has to be activated, and
the cue generation and testing cycle is entered again. If the third cue is dis-
confirming, the cycle stops with two confirming cues and one disconfirming
cue activated, as shown in the third row of Table 7.4. This stopping is to be
expected with probability .25. Because the second cue has higher cue validity
than the third, disconfirming, cue, overall an increase in confidence and over-
confidence is to be expected. If the third cue is again confirming, the same
procedure is repeated. Here and in all subsequent cases confidence will
increase. As shown in Table 7.4, the probabilities of an increase sum up to
.5 (.25 + .125 + .125), which is the same as the probability of a decrease.
Thus PMM theory leads to the prediction that, overall, asking for a discon-
firming reason will not change confidence or overconfidence. As just shown,
the confirmation-bias hypothesis, in contrast, predicts that asking for a discon-
firming reason should decrease confidence and overconfidence.
What were the results of the Koriat study? In both crucial conditions,
disconfirming reason and both reasons, the authors found only small and
nonsignificant decreases of overconfidence (2% and 1%, respectively) and
similar small improvements in calibration (.006 each, significant only in the
disconfirming-reason task). These largely insignificant differences are consis-
tent with the prediction by PMM theory that asking for a disconfirming reason
makes no difference and are inconsistent with the confirmation-bias ex-
planation. Further evidence comes from a replication of the Koriat study by
Fischhoff and MacGregor (1982), who reported zero effects of disconfirming
reasons.
To summarize, the effects on confidence of giving confirming and discon-
firming reasons in the Koriat study can be both explained by and integrated
into PMM theory. There is no need to postulate a confirmation bias.
no selection for stimuli that generated perceptual illusions took place, over-
confidence was close to zero (perception of areas of squares is quite well
adapted in adults; see Gigerenzer & Richter, 1990). This result is predicted by
both accounts. The anomaly arises with the second perceptual task used—
judging which of two subsequent tones is longer. If the second tone was longer,
Dawes reported almost perfect calibration, but if the first tone was longer, par-
ticipants exhibited large overconfidence.
PMM theory predicts that in the inconsistent acoustic task, perceptual stim-
uli have been selected (albeit unwittingly) for a perceptual illusion. This is in
fact the case. From the literature on time perception, we know that of two
subsequently presented tones, the tone more recently heard appears to be
longer. This perceptual illusion is known as the negative presentation effect
(e.g., Fraisse, 1964; Sivyer & Finlay, 1982). It implies a smaller percentage of
correct answers in the condition in which the tone presented first was longer,
because this tone is perceived to be shorter. A decrease in percentage correct
in turn increases overconfidence. In Dawes's (1980) experiments, this is exactly
the inconsistent condition in which overconfidence occurred. Thus, from the
perspective we propose, this inconsistent result can be reconciled.
theory cannot account for the latter, nor can the notion of degree of perception-
likeness.
A second perceptual task was letter identification. In Experiment 3, Keren
(1988) used two letter-identification tasks, which were identical except that
the exposure time of the letters to be recognized was either short or long. Mean
percentages correct were 63.5 for short and 77.2 for long exposures. According
to earlier explanations, such as participants' insensitivity to task difficulty, a
hard-easy effect should result. According to PMM theory, however, the hard-
easy effect should be zero, because both tasks were generated by the same
sampling process (Prediction 4). In fact, Keren (1988, p. 112) reported that in
both tasks, overconfidence was not significantly different from zero. Prediction
4 accounts for this disappearance of the hard-easy effect in a situation in
which differences in percentage correct were large.
ing (e.g., Duncker, 1935/1945), and this issue has regained favor (e.g., Brehmer,
1988; Hammond, Stewart, Brehmer, & Steinmann, 1975). In their review, Ein-
horn and Hogarth (1981) emphasized that "the cognitive approach has been
concerned primarily with how tasks are represented. The issue of why tasks
are represented in particular ways has not yet been addressed" (p. 57). PMM
theory addresses this issue. Different tasks, such as confidence and frequency
tasks, cue different reference classes and different probability cues from known
environments. It is these environments that provide the particular represen-
tation, the PMM, of a task.
Many parts of PMM theory need further expansion, development, and test-
ing. Open issues include the following: (a) What reference class is activated?
For city comparisons, this question has a relatively clear answer, but in gen-
eral, more than one reference class can be constructed to solve a problem, (b)
Are cues always generated according to their rank in the cue validity hierar-
chy? Alternative models of cue generation could relax this strong assumption,
assuming, for instance, that the first cue generated is the cue activated in the
last problem. The latter would, however, decrease the percentage of correct
answers, (c) What are the conditions under which we may expect PMMs to be
well adapted? There exists a large body of neo-Bmnswikian research that, in
general, indicates good adaptation but also points out exceptions (e.g., Arme-
lius, 1979; Bjorkman, 1987; Brehmer & Joyce, 1988; Hammond & Wascoe,
1980). (d) What are the conditions under which cue substitution without cue
integration is superior to multiple cue integration? PMM theory assumes a pure
cue substitution model—a cue that cannot be activated can be replaced by any
other cue—without integration of two or more cues. We focused on the sub-
stitution and not the integration aspect of Brunswik's vicarious functioning
(see Gigerenzer & Murray, 1987, pp. 66-81), in contrast to the multiple regres-
sion metaphor of judgment. Despite its simplicity, the substitution model pro-
duces zero overconfidence and a large number of correct answers, if the PMM
is well adapted. There may be more reasons for simple substitution models.
Armelius and Armelius (1974), for instance, reported that participants were
well able to use ecological validities, but not the correlations between cues. If
the latter is the case, then multiple cue integration may not work well.
Conclusions
The work on which this chapter is based was coauthored with D. G. Goldstein.
166
REASONING THE FAST AND FRUGAL WAY 167
quick-and-dirty heuristics and not the laws of probability (Tversky & Kahne-
man, 1974). This second perspective appears diametrically opposed to the clas-
sical rationality of the Enlightenment, but this appearance is misleading. It has
retained the normative kernel of the classical view. For example, a discrepancy
between the dictates of classical rationality and actual reasoning is what de-
fines a reasoning error in this program. Both views accept the laws of proba-
bility and statistics as normative, but they disagree about whether humans can
live up to these norms.
Many experiments have been conducted to test the validity of these two
views, identifying a host of conditions under which the human mind appears
more rational or irrational. But most of this work has dealt with simple situ-
ations, such as Bayesian inference with binary hypotheses, one single piece of
binary data, and all the necessary information conveniently laid out for the
participant (Chapter 6). In many real-world situations, however, there are mul-
tiple pieces of information, which are not independent, but redundant. Here,
Bayes's rule and other "rational" algorithms quickly become mathematically
complex and computationally intractable, at least for ordinary human minds.
These situations make neither of the two views look promising. If one were to
apply the classical view to such complex real-world environments, this would
suggest that the mind is a supercalculator like a Laplacean demon (Wimsatt,
1976)—carrying around the collected works of Kolmogoroff, Fisher, or Ney-
man—and simply needs a memory jog, like the slave in Plato's Meno. On the
other hand, the heuristics-and-biases view of human irrationality would lead
us to believe that humans are hopelessly lost in the face of real-world com-
plexity, given their supposed inability to reason according to the canon of
classical rationality, even in simple laboratory experiments.
There is a third way to look at inference, focusing on the psychological
and ecological rather than on logic and probability theory. This view ques-
tions classical rationality as a universal norm and thereby questions the very
definition of "good" reasoning on which both the Enlightenment and the
heuristics-and-biases views were built. Herbert Simon, possibly the best
known proponent of this third view, proposed looking for models of bounded
rationality instead of classical rationality. Simon (1956, 1982) argued that in-
formation-processing systems typically need to satisfies rather than to opti-
mize. Satisficing, a blend of sufficing and satisfying, is a word of Scottish or-
igin, which Simon uses to characterize strategies that successfully deal with
conditions of limited time, knowledge, or computational capacities. His con-
cept of satisficing postulates, for instance, that an organism would choose the
first object (a mate, perhaps) that satisfies its aspiration level—instead of the
intractable sequence of taking the time to survey all possible alternatives, es-
timating probabilities and utilities for the possible outcomes associated with
each alternative, calculating expected utilities, and choosing the alternative
that scores highest.
Let us stress that Simon's notion of bounded rationality has two sides,
one cognitive and one ecological. As early as in Administrative Behavior
(1945), he emphasized the cognitive limitations of real minds as opposed to
168 BOUNDED RATIONALITY
The Task
We deal with inferential tasks in which a choice must be made between two
alternatives on a quantitative dimension. Consider the following example:
Which city has a larger population? (a) Hamburg (b) Cologne.
Two-alternative-choice tasks occur in various contexts in which inferences
need to be made with limited time and knowledge, such as in decision making
and risk assessment during driving (e.g., exit the highway now or stay on);
treatment-allocation decisions (e.g., who to treat first in the emergency room:
the 80-year-old heart attack victim or the 16-year-old car accident victim); and
financial decisions (e.g., whether to buy or sell in the trading pit). Inference
concerning population demographics, such as city populations of the past,
present, and future (e.g., Brown & Siegler, 1993), is of importance to people
working in urban planning, industrial development, and marketing. Popula-
tion demographics, which is better understood than, say, the stock market, will
serve us later as a "drosophila" environment that allows us to analyze the
behavior of heuristics.
We study two-alternative-choice tasks in situations in which a person has
to make an inference based solely on knowledge retrieved from memory. We
refer to this as inference from memory, as opposed to inference from givens.
Inference from memory involves search in declarative knowledge and has been
investigated in studies of, inter alia, confidence in general knowledge (e.g.,
Juslin, 1994; Sniezek & Buckley, 1993); the effect of repetition on belief (e.g.,
Hertwig, Gigerenzer, & Hoffrage, 1997); hindsight bias (e.g., Fischhoff, 1977);
quantitative estimates of area and population of nations (Brown & Siegler,
1993); and autobiographic memory of time (Huttenlocher, Hedges, & Prohaska,
1988). Studies of inference from givens, on the other hand, involve making
inferences from information presented by an experimenter (e.g., Hammond,
Hursch, & Todd, 1964). In the tradition of Ebbinghaus's nonsense syllables,
attempts are often made here to prevent individual knowledge from having an
impact on the results by using problems about hypothetical referents instead
of actual ones. For instance, in celebrated judgment and decision-making tasks,
such as the "cab" problem and the "Linda" problem, all the relevant infor-
mation is provided by the experimenter, and individual knowledge about cabs
and hit-and-run accidents, or feminist bank tellers, is considered of no rele-
vance (Gigerenzer & Murray, 1987). As a consequence, limited knowledge or
individual differences in knowledge play a small role in inference from givens.
In contrast, the heuristics proposed in this chapter perform inference from
memory, they use limited knowledge as input, and as we will show, they can
actually profit from a lack of knowledge.
Assume that a person does not know or cannot deduce the answer to the
Hamburg-Cologne question but needs to make an inductive inference from
related real-world knowledge. How is this inference derived? How can we
predict choice (Hamburg or Cologne) from a person's state of knowledge?
170 BOUNDED RATIONALITY
Theory
Limited Knowledge
A PMM is an inductive device that uses limited knowledge to make fast in-
ferences. Different from mental models of syllogisms and deductive inference
(Johnson-Laird, 1983), which focus on the logical task of truth preservation
and where knowledge is irrelevant (except for the meanings of connectives
and other logical terms), PMMs perform intelligent guesses about unknown
features of the world, based on uncertain indicators. To make an inference
about which of two objects, a or b, has a higher value, knowledge about a
reference class R is searched, with a, b € R. In our example, knowledge about
the reference class "cities in Germany" could be searched. The knowledge
consists of probability cues Ct(i = I,. . . , n) and the cue values ai and bf of
the objects for the j'th cue. For instance, when making inferences about pop-
ulations of German cities, the fact that a city has a professional soccer team in
the major league (Bundesliga] may come to a person's mind as a potential cue.
That is, when considering pairs of German cities, if one city has a soccer team
REASONING THE FAST AND FRUGAL WAY 171
in the major league and the other does not, then the city with the team is
likely, but not certain, to have the larger population.
Limited knowledge means that the matrix of objects by cues has missing
entries (i.e., objects, cues, or cue values may be unknown). Figure 8.1 models
the limited knowledge of a person. She has heard of three German cities, a, b,
and c, but not of d (represented by three positive and one negative recognition
values). She knows some facts (cue values) about these cities with respect to
five binary cues. For a binary cue, there are two cue values, positive (e.g., the
city has a soccer team) or negative (it does not). Positive refers to a cue value
that signals a higher value on the target variable (e.g., having a soccer team is
correlated with a large population). Unknown cue values are shown by a ques-
tion mark. Because she has never heard of d, all cue values for object d are,
by definition, unknown.
People rarely know all the information on which an inference could be
based, that is, knowledge is limited. We model limited knowledge in two re-
spects: A person can have (a) incomplete knowledge of the objects in the ref-
erence class (e.g., she recognizes only some of the cities), (b) limited knowledge
of the cue values (facts about cities), or (c) both. For instance, a person who
does not know all of the cities with soccer teams may know some cities with
positive cue values (e.g., Munich and Hamburg certainly have teams), many
with negative cue values (e.g., Heidelberg and Potsdam certainly do not have
teams), and several cities for which cue values will not be known.
The first fast and frugal heuristic presented is called Take The Best, because
its policy is "take the best, ignore the rest." It is the basic heuristic in the
172 BOUNDED RATIONALITY
PMM framework. Variants that work faster or with less knowledge are de-
scribed later. We explain the steps of Take The Best for binary cues (the heu-
ristic can be easily generalized to many valued cues), using Figure 8.1 for
illustration.
Take The Best assumes a rank order of cues according to their subjective
validities (as in Figure 8.1). We call the highest ranking cue (that discriminates
between the two alternatives) the best cue. The heuristic is shown in the form
of a flow diagram in Figure 8.2.
Step 7. Search Rule Choose the cue with the highest validity that has not yet
been tried for this choice task. Look up the cue values of the two objects.
Step 2. Stopping Rule If one object has a positive cue value and the other does
not (i.e., either a negative or an unknown value; see Figure 8.3), then stop
search and go on to Step 3. Otherwise go back to Step 1 and search for another
cue. If no further cue is found, then guess.
Step 3. Decision Rule (one-reason decision making) Predict that the object with
the positive cue value has the higher value on the criterion.
Examples: Suppose the task is judging which of city a or b is larger (Figure
8.1). Both cities are recognized (Step 0), and search for the best cue results in
a positive and a negative cue value for Cue 1 (Step 1). The cue discriminates,
and search is terminated (Step 2). The person makes the inference that city a
is larger (Step 3).
Suppose now the task is judging which city b or c is larger. Both cities are
recognized (Step 0), and search for the cue values results in a negative cue
value on object b for Cue 1, but the corresponding cue value for object c is
unknown (Step 1). The cue does not discriminate, so search is continued (Step
2). Search for the next cue results in positive and negative cue values for Cue
2 (Step 1). This cue discriminates and search is terminated (Step 2). The person
makes the inference that city b is larger (Step 3).
The features of this heuristic are (a) search extends through only a portion
of the total knowledge in memory (as shown by the shaded and dotted parts
of Figure 8.1) and is stopped immediately when the first discriminating cue is
found, (b) the algorithm does not attempt to integrate information but uses
one-reason decision making instead, and (c) the total amount of information
processed is contingent on each task (pair of objects) and varies in a predict-
able way among individuals with different knowledge. This fast and compu-
tationally simple heuristic is a model of bounded rationality rather than of
classical rationality. There is a close parallel with Simon's concept of "satis-
ficing": Take The Best stops search after the first discriminating cue is found,
just as Simon's satisficing algorithm stops search after the first option that
meets an aspiration level.
The heuristic is hardly a standard statistical tool for inductive inference: It
does not use all available information, it is non-compensatory and nonlinear,
Figure 8.3 Stopping rule. A cue discriminates between two alternatives if one
has a positive cue value and the other does not. The four discriminating
cases are shaded. If a cue discriminates, search is stopped.
174 BOUNDED RATIONALITY
and variants of it can violate transitivity. Thus it differs from standard linear
tools for inference such as multiple regression, as well as from nonlinear neu-
ral networks that are compensatory in nature. Take The Best is noncompen-
satory because only the best discriminating cue determines the inference or
decision; no combination of other cue values can override this decision. In this
way, the heuristic does not conform to the classical economic view of human
behavior (e.g., Becker, 1976), where, on the assumption that all aspects can be
reduced to one dimension (e.g., money), there always exists a trade-off between
commodities or pieces of information. That is the heuristic violates the Ar-
chimedian axiom, which implies that for any multidimensional object a (a^
a2, . . . ,aj preferred to b (b1, b2, . . . ,bn), where a1 dominates ba, this preference
can be reversed by taking multiples of any one or a combination of b2, b3, . . . ,
bn. As we discuss, variants of this heuristic also violate transitivity, one of the
cornerstones of classical rationality (McClennen, 1990).
Empirical Evidence
Despite their flagrant violation of traditional standards of rationality, Take The
Best and PMM theory have been successful in integrating various extant phe-
nomena in inference from memory and predicting novel phenomena. These
include conditions under which overconfidence occurs, disappears, and in-
verts to underestimation (Gigerenzer, 1993b; Juslin, 1993, 1994; Juslin, Win-
man, & Persson, 1995; but see Griffin & Tversky, 1992), and those in which the
hard-easy effect occurs, disappears, and inverts—predictions that have been
experimentally confirmed by Hoffrage (1994) and by Juslin (1993).
Fast and frugal heuristics allow for predictions of individual choices, in-
cluding individual differences based on each person's knowledge. Broder (in
press) reported that when search for information is costly, about 65% of the
participants' choices were consistent with Take The Best, compared to fewer
than 10% with a linear strategy. (For similar results, see Rieskamp & Hoffrage,
1999.) Hoffrage and Hertwig (1999) showed that a memory updating model
with Take The Best could correctly predict some 75% of all individual occur-
rences of hindsight bias. Goldstein and Gigerenzer (1999) showed that the rec-
ognition heuristic predicted individual participants' choices in about 90% to
100% of all cases, even when participants were taught information that sug-
gested doing otherwise (negative cue values for the recognized objects). Among
the evidence for the empirical validity of Take The Best are the tests of a bold
prediction, the less-is-more effect, which postulates conditions under which
people with little knowledge make better inferences than those who know
more. This surprising prediction has been experimentally confirmed. For in-
stance, U.S. students make slightly more correct inferences about German city
populations (about which they know little) than about U.S. cities, and vice
versa for German students (Gigerenzer, 1993b; Goldstein & Gigerenzer, 1999;
Hoffrage, 1994). The recognition heuristic has been successfully applied to
stock investment (Borges et al., 1999); on rumor-based stock market trading,
see DiFonzo (1994). Other species also practice one-reason decision making
REASONING THE FAST AND FRUGAL WAY 175
closely resembling Take The Best, such as when female guppies choose be-
tween males on the basis of an order of cues (Dugatkin, 1996). For general
reviews, see Gigerenzer et al. (1999) and McClelland and Bolger (1994).
The reader familiar with the original heuristic presented in Gigerenzer et
al. (1991, see Chapter 7) will have noticed that we simplified the stopping
rule.1 In the present version, search is already terminated if one object has a
positive cue value and the other does not, whereas in the earlier version,
search was terminated only when one object had a positive value and the other
a negative one (cf. Figure 7.3 in Chapter 7 with Figure 8.3 in this chapter).
This change follows empirical evidence that participants tend to use this faster,
simpler stopping rule (Hoffrage, 1994).
This chapter does not attempt to provide further empirical evidence. For
the moment, we assume that the model is descriptively valid and investigate
how accurate this fast and frugal heuristic is in drawing inferences about un-
known aspects of a real-world environment. Can a heuristic based on simple
psychological principles that violate the norms of classical rationality make a
fair number of accurate inferences?
The Environment
We tested the performance of Take The Best on how accurately it made infer-
ences about a real-world environment. The environment was the set of all
cities in Germany with more than 100,000 inhabitants (83 cities after German
reunification), with population as the target variable. The model of the envi-
ronment consisted of 9 binary ecological cues and the actual 9 X 8 3 cue values.
The full model of the environment is shown in Gigerenzer and Goldstein
(1996a).
Each cue has an associated validity that is indicative of its predictive power.
The ecological validity of a cue is the relative frequency with which the cue
correctly predicts the target, defined with respect to the reference class (e.g.,
all German cities with more than 100,000 inhabitants). For instance, if one
checks all pairs in which one city has a soccer team but the other city does
not, one finds that in 87% of these cases, the city with the team also has the
higher population. This value is the ecological validity of the soccer team cue.
The validity vi of the rth cue is
where t(a] and t(b] are the values of objects a and b on the target variable t
and p is a probability measured as a relative frequency in R.
The ecological validity of the nine cues ranged over the whole spectrum:
from .51 (only slightly better than chance) to 1.0 (certainty), as shown in Table
1. Also, we now use the term stopping rule instead of activation rule.
176 BOUNDED RATIONALITY
8.1. A cue with a high ecological validity, however, is often not useful if its
discrimination rate is small.
Table 8.1 also shows the discrimination rates for each cue. The discrimi-
nation rate of a cue is the relative frequency with which the cue discriminates
between any two objects from the reference class. The discrimination rate is a
function of the distribution of the cue values and the number N of objects in
the reference class. Let the relative frequencies of the positive and negative
cue values be x and y, respectively. Then the discrimination rate di of the j'th
cue is
2. For instance, if N = 2 and one cue value is positive and the other negative (x, =
Yi = .5), d; = 1.0. If N increases, with x, and yt held constant, then dt decreases and
converges to 2x,y,.
REASONING THE FAST AND FRUGAL WAY 177
The Competition
The question of how well a fast and frugal heuristic performs in a real-world
environment has rarely been posed in research on inductive inference. The
present simulations seem to be the first to test how well one-reason decision
making does compared with standard integration strategies, which require
more knowledge, time, and computational power. This question is important
for Simon's postulated link between the cognitive and the ecological: If the
simple psychological principles in fast and frugal heuristics are tuned to ec-
ological structures, these heuristics should not fail outright. We propose a com-
petition between various inferential strategies. The contest will go to the strat-
egy that scores the highest proportion of correct inferences (accuracy) using
the smallest number of cues (frugality).
3. There are various other measures of redundancy besides pairwise correlation. The
important point is that whatever measure of redundancy one uses, the resultant value
does not have the same meaning for all strategies. For instance, all that counts for Take
The Best is what proportion of correct inferences the second cue adds to the first in the
cases where the first cue does not discriminate, how much the third cue adds to the first
two in the cases where they do not discriminate, and so on. If a cue discriminates, search
is terminated, and the degree of redundancy in the cues that were not included in the
search is irrelevant. Integration strategies, in contrast, integrate all information and, thus,
always work with the total redundancy in the environment (or knowledge base). For
instance, when deciding among objects a, b, c, and d in Figure 8.1, the cue values of
Cues 3, 4, and 5 do not matter from the point of view of Take The Best (because search
is terminated before reaching Cue 3). However, the values of Cues 3, 4, and 5 affect the
redundancy of the ecological system, from the point of view of all integration algorithms.
The lesson is that the degree of redundancy in an environment depends on the kind of
strategy that operates on the environment. One needs to be cautious in interpreting
measures of redundancy without reference to a strategy.
178 BOUNDED RATIONALITY
dividuals, who differed randomly from one another in the particular objects
and cue values they knew. All objects and cue values known were determined
randomly within the appropriate constraints, that is, a certain number of ob-
jects known, a certain total percentage of cue values known, and the validity
of the recognition heuristic (as explained in the following paragraph).
The simulation needed to be realistic in the sense that the simulated people
could invoke the recognition heuristic. Therefore, the sets of cities the simu-
lated people knew had to be carefully chosen so that the recognized cities were
larger than the unrecognized ones a certain percentage of the time. We per-
formed a survey to get an empirical estimate of the actual covariation between
recognition of cities and city populations. Let us define the recognition validity
a to be the probability, in a reference class, that one object has a greater value
on the target variable than another, in the cases where one object is recognized
and the other is not:
a = p[t(a) > t(b)a is positive and br is negative],
where t(a) and t(b] are the values of objects a and b on the target variable t, ar
and br are the recognition values of a and b, and p is a probability measured
as a relative frequency in R.
In a pilot study of 26 undergraduates at the University of Chicago, we found
that the cities they recognized (within the 83 largest in Germany) were larger
than the cities they did not recognize in about 80% of all possible comparisons.
We incorporated this value into our simulations by choosing sets of cities (for
each knowledge state, i.e., for each number of cities recognized) where the
known cities were larger than the unknown cities in about 80% of all cases.
Thus the cities known by the simulated individuals had the same relationship
between recognition and population as did those of the human individuals.
Let us first look at the performance of Take The Best.
Figure 8.4 Correct inferences about the population of German cities (two-
alternative-choice tasks) by Take The Best. Inferences are based on actual in-
formation about the 83 largest cities and nine cues for population (see text).
Limited knowledge of the simulated individuals is varied across two dimen-
sions: (a) the number of cities recognized (x axis) and (b) the percentage of
cue values known (the six curves).
suit was that this maximum was not achieved when individuals knew all cue
values of all cities, but rather when they knew less. This result shows the
ability of the heuristic to exploit limited knowledge, that is, to do best when
not everything is known. Thus, Take The Best produces the less-is-more effect.
At any level of limited knowledge of cue values, learning more German cities
will eventually cause a decrease in the proportion correct. Take, for instance,
the curve where 75% of the cue values were known and the point where the
simulated participants recognized about 60 German cities. If these individuals
learned about the remaining German cities, their proportion correct would de-
crease. The rationale behind the less-is-more effect is the recognition heuristic,
and it can be understood best from the curve that reflects 0% of total cue values
known. Here, all decisions are made on the basis of the recognition heuristic,
or by guessing. On this curve, the recognition heuristic comes into play most
when half of the cities are known, so it takes on an inverted-U shape. When
half the cities are known, the recognition heuristic can be activated most often,
that is, for roughly 50% of the questions. Because we set the recognition va-
lidity in advance, 80% of these inferences will be correct. In the remaining
half of the questions, when recognition cannot be used (either both cities are
recognized or both cities are unrecognized), then the organism is forced to
guess and only 50% of the guesses will be correct. Using the 80% effective
180 BOUNDED RATIONALITY
recognition validity half of the time and guessing the other half of the time,
the organism scores 65% correct, which is the peak of the bottom curve. The
mode of this curve moves to the right with increasing knowledge about cue
values. Note that even when a person knows everything, all cue values of all
cities, there are states of limited knowledge in which the person would make
more accurate inferences. We are not going to discuss the conditions of this
counterintuitive effect and the supporting experimental evidence here (see
Goldstein & Gigerenzer, 1999). Our focus is on how much better integration
strategies can do in making inferences.
Integration Strategies
then guess.
then guess.
Note that weighted tallying needs more information than either tallying or Take
The Best, namely, quantitative information about ecological validities. In the
simulation, we provided the real ecological validities to give this strategy a
good chance.
Calling again on the comparison of objects a and b from Figure 8.1, let us
assume that the validities would be .8 for recognition and .9, .8, .7, .6, .51 for
Cues 1 through 5. Weighted tallying would thus assign 1.7 points to a and 2.3
points to b. Thus weighted tallying would also choose b to be the larger.
Both tallying strategies treat negative information and missing information
identically. That is, they consider only positive evidence. The following strat-
egies distinguish between negative and missing information and integrate both
positive and negative information.
Comparing objects a and b from Figure 8.1 would involve assigning 1.0
points to a and 1.0 points to b and, thus, choosing randomly. This simple linear
model corresponds to Model 2 in Einhorn and Hogarth (1975, p. 177) with the
weight parameter set equal to 1.
Contestant 4: Weighted Linear Model This model is like the unit-weight linear
model except that the values of a, and fo, are multiplied by their respective
ecological validities. The decision criterion is the same as with weighted tal-
lying. The weighted linear model (or some variant of it) is often viewed as an
optimal rule for preferential choice, under the idealization of independent di-
mensions or cues (e.g., Keeney & Raiffa, 1993; Payne et al., 1993). Comparing
objects a and b from Figure 8.1 would involve assigning 1.0 points to a and
0.8 points to b and, thus, choosing a to be the larger.
Contestant 5: Multiple Regression The weighted linear model reflects the dif-
ferent validities of the cues but not the dependencies between cues. Multiple
regression creates weights that reflect the covariances between predictors or
cues and is commonly seen as an "optimal" linear way to integrate various
pieces of information into an estimate (e.g., Brunswik, 1955; Hammond, 1966).
Neural networks using the delta rule determine their "optimal" weights by the
same principles as multiple regression does (Stone, 1986). The delta rule car-
ries out the equivalent of a multiple linear regression from the input patterns
to the targets.
The weights for the multiple regression could simply be calculated from
the full information about the nine ecological cues. To make multiple regres-
sion an even stronger competitor, we also provided information about which
cities the simulated individuals recognized. Thus the multiple regression used
nine ecological cues and the recognition cue to generate its weights. Because
the weights for the recognition cue depend on which cities are recognized, we
calculated 6 X 500 X 84 sets of weights: one for each simulated individual.
Unlike any of the other strategies, regression had access to the actual city pop-
ulations (even for those cities not recognized by the hypothetical person) in
the calculation of the weights.4 During the quiz, each simulated person used
the set of weights provided to it by multiple regression to estimate the popu-
lations of the cities in the comparison.
There was a missing-values problem in computing these 6 X 84 X 500 sets
of regression coefficients, because most simulated individuals did not know
certain cue values, for instance, the cue values of the cities they did not rec-
ognize. We strengthened the performance of multiple regression by substituting
4. We cannot claim that these integration strategies are the best ones, nor can we
know a priori which small variations will succeed in our bumpy real-world environ-
ment. An example: After we had completed the simulations, we learned that regressing
on the ranks of the cities does slightly better than regressing on the city populations.
The key issue is what are the structures of environments in which particular strategies
and variants thrive.
REASONING THE FAST AND FRUGAL WAY 183
unknown cue values with the average of the cue values the person knew for
the given cue.5 This was done both in creating the weights and in using these
weights to estimate populations. Unlike cross-validation procedures in which
weights are estimated from one half of the data and inferences based on these
weights are made for the other half, the regression strategy had access to all
the information (except, of course, the unknown cue values)—more informa-
tion than was given to any of the competitors. In the competition, multiple
regression and, to a lesser degree, the weighted linear model approximate the
ideal of the Laplacean demon.
Results
Frugality Take The Best is designed to enable quick decision making. Com-
pared with the integration strategies, how frugal is it, measured by the amount
of information searched in memory? For instance, in Figure 8.1, Take The Best
would look up four cue values (including the recognition cue values) to infer
that a is larger than b. None of the integration strategies use limited search;
thus they always look up all cue values.
Figure 8.5 shows the number of cue values retrieved from memory by Take
The Best for various levels of limited knowledge. Take The Best reduces search
in memory considerably. Depending on the knowledge state, this heuristic
needed to search for between 2 (the number of recognition values) and 20 (the
maximum possible cue values: Each city has nine cue values and one recog-
nition value). For instance, when a person recognized half of the cities and
knew 50% of their cue values, then, on average, only about 4 cue values (that
is, one fifth of all possible) were searched for. The average across all simulated
participants was 5.9, which was less than a third of all available cue values.
Accuracy Given that it searches only for a limited amount of information, how
accurate is Take The Best, compared with the integration strategies? We ran
the competition for all states of limited knowledge shown in Figure 8.4. We
first report the results of the competition in the case where each strategy
achieved its best performance: when 100% of the cue values were known.
Figure 8.6 shows the results of the simulations, carried out in the same way
as those in Figure 8.4.
To our surprise, Take The Best drew as many correct inferences as any of
the other strategies, and more than some. The curves for Take The Best, mul-
tiple regression, weighted tallying, and tallying are so similar that there are
only slight differences among them. Weighted tallying performed about as well
as tallying, and the unit-weight linear model performed about as well as the
weighted linear model—demonstrating that the previous finding that weights
5. If no single cue value was known for a given cue, the missing values were sub-
stituted by .5. This value was chosen because it is the midpoint of 0 and 1, which are
the values used to stand for negative and positive cue values, respectively.
184 BOUNDED RATIONALITY
Figure 8.5 Frugality: Number of cue values looked up by Take The Best and
by the competing integration strategies (see text), depending on the number
of objects recognized (0-83) and the percentage of cue values known.
may be chosen in a fairly arbitrary manner, as long as they have the correct
sign (Dawes, 1979), is generalizable to tallying. The two integration strategies
that make use of both positive and negative information, unit-weight and
weighted linear models, made considerably fewer correct inferences. By look-
ing at the lower-left and upper-right corners of Figure 8.6, one can see that all
competitors do equally well with a complete lack of knowledge or with com-
plete knowledge. They differ when knowledge is limited. Note that some strat-
egies can make more correct inferences when they do not have complete
knowledge: a demonstration of the less-is-more effect mentioned earlier.
What was the result of the competition across all levels of limited knowl-
edge? Table 8.2 shows the result for each level of limited knowledge of cue val-
ues, averaged across all levels of recognition knowledge. (Table 8.2 reports also
the performance of two variants of Take The Best, which we discuss later: the
Minimalist and Take The Last.) The values in the 100% column of Table 8.2 are
the values in Figure 8.6 averaged across all levels of recognition. Take The Best
made as many correct inferences as one of the competitors (weighted tallying)
and more than the others. Because it was also the most frugal, we judged the
competition goes to Take The Best as the highest performing, overall.
To our knowledge, this is the first time that it has been demonstrated that
a fast and frugal heuristic, that is, Take The Best, can draw as many correct
inferences about a real-world environment as integration strategies, across all
Figure 8.6 Results of the competition. The curve for Take The Best is identi-
cal with the 100% curve in Figure 8.4. The results for proportion correct
have been smoothed by a running median smoother, to lessen visual noise
between the lines.
Note: Values are rounded; averages are computed from the unrounded values. Bottom two heuristics
are variants of Take The Best.
186 BOUNDED RATIONALITY
6. The proof for this is as follows. The tallying score t for a given object is the number
n+ of positive cue values, as defined above. The score u for the unit-weight linear model
is n+ - n~, where n~ is the number of negative cue values. Under complete knowledge,
n — n+ + n~, where n is the number of cues. Thus, t = n+, and u = n+ — n~. Because
n~ — n — n+, by substitution into the formula for u, we find that u — n+ — (n — n+) =
2t - n.
REASONING THE FAST AND FRUGAL WAY 187
thus, is always smaller than the tally for a recognized object, which is at least
1 (for tallying, or .8 for weighted tallying, due to the positive value on the
recognition cue). Thus tallying always arrives at the inference that a recognized
object is larger than an unrecognized one.
Note that this explanation of the different performances puts the full weight
in a psychological principle (the recognition heuristic) explicit in Take The
Best, as opposed to the statistical issue of how to find optimal weights in a
linear function. To test this explanation, we reran the simulations for the unit-
weight and weighted linear models under the same conditions but replaced
the recognition cue with the recognition heuristic. The simulation showed that
the recognition heuristic accounts for all the difference.
Take The Last first tries the cue that discriminated the last time. If this cue
does not discriminate, the heuristic then tries the cue that discriminated the
time before last, and so on. The algorithm differs from Take The Best in Step
1, which is now reformulated as Step I'.
Step T. Search Rule If there is a record of which cues stopped search in pre-
vious problems, choose the cue that stopped search in the most recent problem
and that has not yet been tried. Look up the cue values of the two objects.
Otherwise try a random cue and build up such a record.
Thus, in Step 2, the algorithm goes back to Step 1'. Variants of this search
principle have been studied as the "Einstellung effect" in the water jar exper-
188 BOUNDED RATIONALITY
iments (Luchins & Luchins, 1994), in which the solution strategy of the most
recently solved problem is tried first on the subsequent problem. This effect
has also been noted in physicians' generation of diagnoses for clinical cases
(Weber, Bockenholt, Hilton, & Wallace, 1993).
This heuristic does not need a rank order of cues according to their valid-
ities; all that needs to be estimated is the direction in which a cue points. The
rank order of cue validities is replaced by a memory of which cues were last
used. Note that such a record can be built up independently of any knowledge
about the structure of an environment and neither needs, nor uses, any feed-
back about whether inferences are right or wrong.
Step 1". Search Rule Draw a cue randomly (without replacement) and look up
the cue values of the two objects.
The Minimalist does not necessarily speed up search, but it tries to get by
with even less knowledge than any other strategy.
Results
Frugality How frugal are the heuristics? The simulations showed that for each
of the two variant heuristics, the relationship between amount of knowledge
and the number of cue values looked up had the same form as for Take The
Best (Figure 8.5). That is, unlike the integration strategies, the curves are con-
cave and the number of cues searched for is maximal when knowledge of cue
values is lowest. The average number of cue values looked up was lowest for
Take The Last (5.3) followed by the Minimalist (5.6) and Take The Best (5.9).
As knowledge becomes more and more limited (on both dimensions: recog-
nition and cue values known), the difference in frugality becomes smaller and
smaller. The reason why the Minimalist looks up fewer cue values than Take
The Best is that cue validities and cue discrimination rates are negatively cor-
related (Table 8.1); therefore, randomly chosen cues tend to have larger dis-
crimination rates than cues chosen by cue validity.
Accuracy What is the price to be paid for speeding up search or reducing the
knowledge of cue orderings and discrimination histories to nothing? We tested
the performance of the two heuristics in the same environment as all other
strategies. Figure 8.7 shows the proportion of correct inferences that the Min-
imalist achieved. For comparison, the performance of Take The Best with
100% of cue values known is indicated by a dotted line. Note that the Mini-
REASONING THE FAST AND FRUGAL WAY 189
first increases the chances of finding a discriminating cue that points in the
right direction (toward the larger city). We learned our lesson and reran the
whole competition with randomly ordered pairs of cities.
Discussion
The competition showed a surprising result: The Take The Best heuristic drew
as many correct inferences about unknown features of a real-world environ-
ment as any of the integration strategies, and more than some of them. Two
further simplifications of the heuristic—Take The Last (replacing knowledge
about the rank orders of cue validities with a memory of the discrimination
history of cues) and Minimalist (dispensing with both)—showed a compara-
tively small loss in correct inferences, and only when knowledge about cue
values was high.
To the best of our knowledge, this is the first inference competition between
fast and frugal heuristics and "rational" strategies in a real-world environment.
The result is of importance for encouraging research that focuses on the power
of simple psychological mechanisms, that is, on the design and testing of mod-
els of bounded rationality. The result is also of importance as an existence
proof that cognitive strategies capable of successful performance in a real-
world environment do not need to satisfy the classical norms of rational in-
ference. The classical norms may be sufficient but are not necessary for good
inference in real environments.
Bounded Rationality
In this section, we discuss the fundamental psychological mechanism postu-
lated by the PMM family of heuristics: one-reason decision making. We discuss
how this mechanism exploits the structure of environments in making fast
inferences that differ from those arising from standard models of rational rea-
soning.
Limited Search Both one-reason decision making and the recognition heuristic
realize limited search by defining stopping rules. Integration strategies, in con-
trast, do not provide any model of stopping rules and implicitly assume ex-
haustive search (although they may provide rules for tossing out some of the
variables in a lengthy regression equation). Stopping rules are crucial for mod-
eling inference under limited time, as in Simon's examples of satisficing, where
search among alternatives terminates when a certain aspiration level is met.
Figure 8.8 Limited knowledge and a stricter stopping rule can produce intran-
sitive inferences.
illustrates a state of knowledge in which this stricter stopping rule gives the
result that a dominates b, b dominates c, and c dominates a.7
Biological systems, for instance, can exhibit systematic intransitivities
based on incommensurability between two systems on one dimension (Gilpin,
1975; Lewontin, 1968). Imagine three species: a, b, and c. Species a inhabits
both water and land; species b inhabits both water and air. Therefore, the two
only compete in water, where species a defeats species b. Species c inhabits
land and air, so it only competes with b in the air, where it is defeated by b.
Finally, when a and c meet, it is only on land, and here, c is in its element
and defeats a. A linear model that estimates some value for the combative
strength of each species independently of the species with which it is com-
peting would fail to capture this nontransitive cycle.
Inferences without Estimation Einhorn and Hogarth (1975) noted that in the
unit-weight model "there is essentially no estimation involved in its use"
(p. 177), except for the sign of the unit weight. A similar result holds for the
heuristics reported here. Take The Best does not need to estimate regression
weights; it only needs to estimate a rank ordering of cue validities. Take The
Last and the Minimalist involve essentially no estimation (except for the sign
of the cues). The fact that there is no estimation problem has an important
consequence: An organism can use as many cues as it has experienced, without
being concerned about whether the size of the sample experienced is suffi-
ciently large to generate reliable estimates of weights.
7. Note that missing knowledge is necessary for intransitivities to occur. If all cue
values are known, no intransitive inferences can possibly result. Take The Best with the
stricter stopping rule allows precise predictions about the occurrence of intransitivities
over the course of knowledge acquisition. For instance, imagine a person whose knowl-
edge is described by Figure 8.8, except that she does not know the value of Cue 2 for
object c. This person would make no intransitive judgments comparing objects a, b, and
c. If she were to learn that object c had a negative cue value for Cue 2, she would produce
an intransitive judgment. If she learned one piece more, namely, the value of Cue 1 for
object c, then she would no longer produce an intransitive judgment. The prediction is
that transitive judgments should turn into intransitive ones and back during learning.
Thus intransitivities do not simply depend on the amount of limited knowledge but also
on what knowledge is missing.
196 BOUNDED RATIONALITY
Cue Redundancy and Performance Einhorn and Hogarth (1975) suggested that
unit-weight models can be expected to perform approximately as well as
proper linear models when (a) R2 from the regression model is in the moderate
or low range (around .5 or smaller) and (b) predictors (cues) are correlated.
Are these two criteria necessary, sufficient, or both to explain the performance
of Take The Best? Take The Best and its variants certainly can exploit cue
redundancy: If cues are highly correlated, one cue can do the job.
We have already seen that in the present environment, R2 = .87, which is
in the high rather than the moderate or low range. As mentioned earlier, the
pairwise correlations between the nine ecological cues ranged between -.25
and .54, with an absolute average value of .19. Thus, despite a high R2 and
only moderate-to-small correlation between cues, the heuristics performed
quite successfully. Their excellent performance in the competition can be ex-
plained only partially by cue redundancy, because the cues were only mod-
erately correlated. High cue redundancy, thus, does seem sufficient but is not
necessary for the successful performance of the heuristics.
A New Perspective on the Lens Model Ecological theorists such as Brunswik
(1955) emphasized that the cognitive system is designed to find many path-
ways to the world, substituting missing cues with whatever cues happen to be
available. Brunswik labeled this ability vicarious functioning, in which he saw
the most fundamental principle of a science of perception and cognition. His
proposal to model this adaptive process by linear multiple regression has in-
spired a long tradition of neo-Brunswikian research (Brehmer, 1994; Ham-
mond, 1990), although the empirical evidence for mental multiple regression
is still controversial (e.g., Brehmer & Brehmer, 1988). However, vicarious func-
tioning need not be equated with linear regression. The PMM family of heu-
ristics provides an alternative, nonadditive model of vicarious functioning, in
which cue substitution operates without integration. This offers a new per-
spective on Brunswik's lens model. In a fast and frugal lens model, the first
discriminating cue that passes through inhibits any other rays passing through
and determines judgment (Gigerenzer & Kurz, in press). Noncompensatory vi-
carious functioning is consistent with some of Brunswik's original examples,
such as the substitution of behaviors in Hull's habit-family hierarchy, and the
alternative manifestation of symptoms according to the psychoanalytic writ-
ings of Frenkel-Brunswik (see Gigerenzer & Murray, 1987, chap. 3).
It has been reported sometimes that teachers, physicians, and other profes-
sionals claim that they use seven or so criteria to make judgments (e.g., when
grading papers or making a differential diagnosis) but that experimental tests
showed that they in fact often used only one criterion (Shepard, 1967). At first
glance, this seems to indicate that those professionals make outrageous claims.
But it need not be. If experts' vicarious functioning works according to the
PMM heuristics, then they are correct in saying that they use many predictors,
but the decision is made by only one at any time.
What Counts as Good Reasoning? Much of the research on reasoning in the
last decades has assumed that sound reasoning can be reduced to principles
REASONING THE FAST AND FRUGAL WAY 197
At the beginning of this chapter, we pointed out the common opposition be-
tween the rational and the psychological, which emerged in the nineteenth
century after the breakdown of the classical interpretation of probability. Since
then, rational inference is commonly reduced to logic and probability theory,
and psychological explanations are called on when things go wrong. This di-
vision of labor is, in a nutshell, the basis on which much of the current re-
search on reasoning and decision making under uncertainty is built.
We believe that after 40 years of toying with the notion of bounded ration-
ality, it is time to overcome the opposition between the rational and the
198 BOUNDED RATIONALITY
psychological and to reunite the two. The PMM family of heuristics provides
precise computational models that attempt to do so. They differ from the En-
lightenment's unified view of the rational and psychological, in that they focus
on simple psychological mechanisms that operate under constraints of limited
time and knowledge and are supported by empirical evidence. The single most
important result in this chapter is that simple psychological mechanisms can
yield about as many (or more) correct inferences more quickly and with less
information than integration strategies that embody classical properties of ra-
tional inference. The demonstration that a fast and frugal heuristic won the
competition defeats the widespread view that only "rational" strategies can be
accurate. Models of inference do not have to forsake accuracy for simplicity.
The mind can have it both ways.
IV
SOCIAL RATIONALITY
several ways in which the mind might implement such a division of labor.
For instance, some have proposed mental modules for intuitive physics,
mathematics, and biology—a view that turns academic subjects into do-
mains. An evolutionary perspective suggests that a different division of labor
has evolved, one directed at solving important adaptive problems, such as
attachment development, mate search, parenting, social exchange, coalition
formation, and maintaining and upsetting dominance hierarchies. The mod-
ule dedicated to solving each of these problems needs to integrate motiva-
tion, perception, thinking, emotion, and behavior into a functional unit. This
is not to say that domain-specific modules are encapsulated and dissociated;
they are probably as coordinated as the violins, violas, cellos, oboes, and
French horns in an orchestra, or the liver, kidneys, lungs, and heart in a hu-
man body.
The idea of modules specialized for certain adaptive problems conflicts
with the compartmentalization of psychology. Today's areas of specialization
are defined in terms of faculties, such as memory, thinking, decision making,
intelligence, motivation, and emotion. These faculties have become institu-
tionalized in modern university curricula and grant agencies. They deter-
mine the professional self-perception of our colleagues, what they read and
what they ignore, their departmental alliances, and the hiring of professors.
If you ask a psychologist at a conference what she is doing, you will proba-
bly get an answer such as "I am a cognitive psychologist," "I do emotions,"
"I am a judgment and decision-making person," or "My field is motivation."
Evolutionary thinking is an antidote to this faculty view of the mind. Adap-
tive problems and their modern equivalents, such as foraging and dieting
and social exchange and markets, demand the orchestration of these facul-
ties, not their segregation.
In my opinion, the partitioning of psychological research into faculties is
one of the greatest barriers to progress. Research on modularity forces us to
reconsider the borders that have gone unquestioned for many decades. Re-
thinking rationality means rethinking the organization of the fields that study
it. Most interesting problems do not respect today's disciplinary boundaries.
Nor should we.
9
Rationality
Why Social Context Matters
I want to argue against an old and beautiful dream. It was Leibniz's dream,
but not his alone. Leibniz (1677/1951) hoped to reduce rational reasoning to
a universal calculus, which he termed the Universal Characteristic. The plan
was simple: to establish characteristic numbers for all ideas, which would
reduce every question to calculation. Such a rational calculus would put an
end to scholarly bickering; if a dispute arose, the contending parties could
settle it quickly and peacefully by sitting down and calculating. For some time,
the Enlightenment probabilists believed that the mathematical theory of prob-
ability had made this dream a reality. Probability theory rather than logic be-
came the flip side of the newly coined rationality of the Enlightenment, which
acknowledged that humankind lives in the twilight of probability rather than
the noontime sun of certainty, as John Locke expressed it. Leibniz guessed
optimistically of the Universal Characteristic that "a few selected persons
might be able to do the whole thing in five years" (Leibniz, 1677/1951, p. 22).
By around 1840, however, mathematicians had given up as thankless and even
antimathematical the task of reducing rationality to a calculus (Daston, 1988).
Psychologists and economists have not.
Contemporary theories embody Leibniz's dream in various forms. Piaget
and Inhelder's (1951/1975) theory of cognitive development holds that, by
roughly age 12, human beings begin to reason according to the laws of prob-
ability theory; Piaget and Inhelder thus echo the Enlightenment conviction that
human rationality and probability theory are two sides of the same coin. Neo-
classical economic theories center on the assumption that Jacob Bernoulli's
expected utility maximization principle or its modern variants, such as sub-
jective expected utility, define rationality in all contexts. Similarly, neo-
Bayesians tend to claim that the formal machinery of Bayesian statistics defines
rational inferences in all contexts. In cognitive psychology, formal axioms and
rules—consistency, transitivity, and Bayes's rule, for example, as well as entire
statistical techniques—figure prominently in recent theories of mind and war-
rant the rationality of cognition (Chapter 1).
201
202 SOCIAL RATIONALITY
come back to her place for tea. He chooses to have tea with her (x) over
returning home (7). The young lady then offers him a third choice—to
share some cocaine at her apartment (z). This extension of the choice set
may quite reasonably affect Mr. Sociable's ranking of x and y. Depending
on his objectives and values, he may consequently choose to go home
(7)-
All three examples seek to illustrate the same,point: Property Alpha will or
will not be entailed depending on the social objectives, values, and expecta-
tions of the individual making the choice. To impose Property Alpha as a
general yardstick of rational behavior independent of social objectives or other
factors external to choice behavior seems fundamentally flawed.
The conclusion is not that consistency is an invalid principle; rather, con-
sistency, as defined by Property Alpha or similar principles, is indeterminate.
The preceding examples illustrate different kinds of indeterminateness. With
respect to the last apple, social values define what the alternatives in the choice
set are and, thereby, what consistency is about. If there are many apples in the
basket, the choice is between "apple" and "nothing." If a single apple remains
and one does not share the values of Mr. Polite, the alternatives are the same;
for Mr. Polite, however, they become "last apple" and "nothing." In the dinner
and tea examples, one learns something new about the old alternatives when
a new choice is introduced. The fresh option provides new information—that
is, it reduces uncertainty about the old alternatives.
To summarize the argument: Consistency, as defined by Property Alpha,
cannot be imposed on human behavior independent of something external to
choice behavior, such as social objectives and expectations. Social concerns
and moral views (e.g., politeness), as well as inferences from the menu offered
(learning from one option as to what others may involve), determine whether
internal consistency is or is not entailed.
six in four tosses, which is .518. In the same way, the answer to his puzzle
can be calculated. The probability of not getting a double six in one toss of a
pair of dice is 35/36, therefore the probability of not getting at least one double
six in 24 tosses is (35/S6) 24 , which is .509. Thus de Mere was unlucky because
his reasoning was wrong.
In general terms, maximizing expected utility means maximizing the prod-
uct between probabilities and utilities. In the simple case of a choice between
two options, x and y (e.g., no six and at least one six), with probabilities p(x)
and p(y], and utilities U(x] and U(y), one can maximize the gain according to
the rule:
Maximizing expected utility:
Choose x if p(x]U(x) > p(y)U(y).
For de Mere, the utilities were equal, because he and his gambling partner
bet the same amount of money on x and y, respectively. All this seems to be
straightforward once the mathematical theory is in place.
Now consider the following situation, which seems to be, formally, essen-
tially equivalent. The choice set is again {x, y}. Choosing x will lead to a re-
inforcement with a probability p(x) — .80, whereas choosing y will only lead
to the same reinforcement with a probability p(y] = .20. That is, the utilities
of the outcomes (reinforcements) are the same, but their probabilities differ. It
is easy to see that when the choice is repeated n times, the expected number
of reinforcements will be maximized if an organism always chooses x:
Maximizing with equal utilities:
Always choose x if p(x) > p(y).
Consider a hungry rat in a T-maze where reinforcement is obtained at the left
end in 80% of cases and at the right end in 20% of cases. The rat will maximize
reinforcement if it always turns left. Imagine students who watch the rat run-
ning and predict on which side the reinforcement will appear in each trial.
They will also maximize their number of correct predictions by always saying
"left." But neither rats nor students seem to maximize. Under a variety of
experimental conditions, organisms choose both alternatives with relative fre-
quencies that roughly match the probabilities (Gallistel, 1990):
Probability matching:
Choose x with probability p(x);
choose y with probability p(y).
In the preceding example, the expected rate of reinforcements is 80% for max-
imizing, but only 68% for probability matching (this value is calculated by
.802 + .202 = .68). The conditions of the seemingly irrational behavior of prob-
ability matching are discussed in the literature (e.g., Brunswik, 1939; Estes,
1976; Gallistel, 1990).
Violations of maximizing by probability matching pose a problem for a
context-independent account of rational behavior in animals and humans.
What looks irrational for an individual, however, can be optimal for a group.
206 SOCIAL RATIONALITY
Again, the maximizing principle does not capture the distinction between the
individual in social isolation and in social interaction. Under natural condi-
tions of foraging, there will not be just one rat but many who compete to
exploit food resources. If all choose to forage in the spot where previous ex-
perience suggests food is to be found in greatest abundance, then each may
get only a small share. The one mutant organism that sometimes chooses the
spot with less food would be better off. Natural selection will favor those ex-
ceptional individuals who sometimes choose the less attractive alternative.
Thus maximizing is not always an evolutionarily stable strategy in situations
of competition among individuals. Given certain assumptions, probability
matching may in fact be an evolutionarily stable strategy, one that does not
tend to create conditions that select against it (Fretwell, 1972; Gallistel, 1990).
To summarize the argument: The maximization rule cannot be imposed on
behavior independent of social context. Whether an organism performs in iso-
lation or in the context of other organisms can determine, among other things,
whether maximization is entailed as an optimal choice rule.
Mr. Smart would like to invest the $10,000 in his savings account in the hope
of increasing his capital. After some consideration, he opts to risk the amount
in a gamble with two possible outcomes, x and y. The outcomes are determined
by a fair roulette wheel with 10 equal sections, 6 of them white (x) and 4 black
(y). Thus the probability p(x) of obtaining white is .6, and the probability p(y)
of obtaining black is .4. The rules of the game are that he has to bet all his
money ($10,000) either on black or on white. If Mr. Smart guesses the outcome
correctly, his money will be doubled; otherwise, he will lose three-quarters of
his investment. Could it ever be advantageous for Mr. Smart to bet on black?
If Mr. Smart bets on white, his expectation is $20,000 with a probability of
.6, and $2,500 with a probability of .4. The expected value E(x) is (.6 X
$20,000) + (.4 X $2,500) = $13,000. But if he bets on black, the expected value
E(y) is only (.4 X $20,000) + (.6 X $2,500) = $9,500. Betting on white would
give him an expectation larger than the sum he invests. Betting on black, on
the other hand, would result in an expectation lower than the sum he invests.
A maximization of the expected value implies betting on white:
where E(x) = £ p(x)V(x). The principle of maximizing the expected value (or
subjective variants such as expected utility) is one of the cornerstones of clas-
sical definitions of rationality. Mr. Smart would be a fool to bet on black,
wouldn't he?
RATIONALITY 207
Let me apply the same argument again. The principle of maximizing the
expected value does not distinguish between the individual in social isolation
and in social interaction. If many individuals face the same choice, could it
be to the benefit of the whole group that some sacrifice themselves and bet on
black? Let us first look at an example from biology.
Cooper (1989; Cooper & Kaplan, 1982) discussed conditions under which
it is essential for the survival of the group that some individuals bet against
the probabilities and do not, at the individual level, maximize their expected
value. Consider a hypothetical population of organisms whose evolutionary
fitness (measured simply by the finite rate of increase in their population)
depends highly on protective coloration. Each winter predators pass through
the region, decimating those within the population that can be spotted against
the background terrain. If the black soil of the organisms' habitat happens to
be covered with snow at the time, the best protective coloration is white; oth-
erwise, it is black. The probability of snow when predators pass through is .6,
and protectively colored individuals can expect to survive the winter in num-
bers sufficient to leave an average of two surviving offspring each, whereas the
conspicuous ones can expect an average of only 0.25 offspring each. This ex-
ample assumes a simple evolutionary model with asexual breeding (each off-
spring is genetically identical to its parent), seasonal breeding (offspring are
produced only in spring), and semelparous breeding (each individual produces
offspring only once in a lifetime at the age of exactly one year).
Adaptive Coin-Flipping
Suppose two genotypes, W and WB, are in competition within a large popu-
lation. Individuals of genotype W always have white winter coloration; that is,
W is a genotype with a uniquely determined phenotypic expression. Genotype
WB, in contrast, gives rise to both white and black individuals, with a ratio of
5 to 3. Thus, 3 out of 8 individuals with genotype WB are "betting" on the
low probability of no snow. Each of these individuals' expectation to survive
and reproduce is smaller than that of all other individuals in both W and WB.
How will these two genotypes fare after 1,000 generations (1,000 years)? We
can expect that there was snow cover in about 600 winters, exposed black soil
in about 400 winters. Then the number of individuals with genotype W will
be doubled 600 times and reduced to one-fourth 400 times. If n is the original
population size, the population size after 1,000 years is
That is, genotype W will have been wiped out with practical certainty after
1,000 years. How does genotype WB do? In the 600 snowy winters, 5/s of the
population will double in number and three-eighths will be reduced to 25%,
with corresponding proportions for the 400 winters without snow. The number
of individuals after 1,000 years is then
208 SOCIAL RATIONALITY
Thus genotype WB is likely to win the evolutionary race easily.1 (The large
estimated number is certainly an overestimation, however, because it does not
take account of such other constraints as food resources.) The reason why WB
has so much better a chance of survival than W is that a considerable propor-
tion of the WB individuals do not maximize their individual expectations but
"bet" on small probabilities.
This violation of individual maximization has been termed "adaptive coin-
flipping" (Cooper & Kaplan, 1982), meaning that individuals are genetically
programmed to "flip coins" to adopt phenotypic traits. Thus the phenotype is
ultimately determined by the nature of the coin-flipping process, rather than
uniquely specified by the genotype.2
Back to Mr. Smart. Assume he won and wants to try again. So do his nu-
merous brothers, sisters, and cousins, who all are willing to commit their entire
investment capital to this gamble. The game is offered every week, and the
rules are as before: Each person's choice is every week to bet all his or her
investment capital either on black or on white (no hedging of bets). If everyone
wanted solely to maximize his or her individual good, his or her money would
be better invested in white than in black, because the chances to double one's
assets are 60% for white compared with only 40% for black. Investing in black
would appear irrational. But we know from our previous calculations that
someone who invests all his or her money every week in white will, with a
high probability, lose every dollar of assets in the long run.
If Mr. Smart and his extended family, however, acted as one community
rather than as independent individuals—that is, create one investment capital
fund in which they share equally—they can quickly increase their capital with
a high probability. Every week they would need to instruct three-eighths of
their members to invest in black, the rest in white. This social sharing is es-
sentially the same situation as the "adaptive coin-flipping" example. Thus Mr.
Smart's betting on black needs to be judged against his motivation: If he is
1. I have reported the numbers only for the most likely event (i.e., 600 snowy winters
out of 1,000 winters). If one looks at all possible events, one finds that those in which
W would result in a larger population size than WB are extremely rare (Cooper, 1989).
Nevertheless, the expected value is larger for W than for WB, due to the fact that in
those very few cases in which W results in a larger population size, this number is
astronomically large. The reader who is familiar with the St. Petersburg paradox will
see a parallel (Wolfgang Hell has drawn my attention to this fact). The parallel is best
illustrated in Lopes's (1981) simulations of businesses selling the St. Petersburg gamble.
Although these businesses sold the gamble far below its expected value, most nonethe-
less survived with great profits.
2. Adaptive coin-flipping is a special case of a general phenomenon: In variable en-
vironments (in which the time scale of variation is greater than the generation time of
the organism, as in the example given), natural selection does not maximize expected
individual fitness but geometric mean fitness (Gillespie, 1977).
RATIONALITY 209
cooperating with others for their common interest, then betting on the wrong
side of a known probability is part of an optimal strategy. If he is not cooper-
ating but, rather, investing for his own immediate benefit, then betting on black
is the fastest way to ruin.
This example, like the preceding ones, attempts to illustrate that a rule such
as maximizing the expected value cannot be imposed on behavior without
consideration of the social context. Is this context a single individual wagering
all his or her assets at once or a population that risks their collective assets or
offspring at regular intervals? It makes all the difference, because individual
maximization can lead to the extinction of the genotype.
Conclusions
These examples show that general principles such as consistency and maximiz-
ing are insufficient for capturing rationality. I have argued that there is no way of
determining whether a behavioral pattern is consistent or maximizes without
first referring to something external to choice behavior (Sen, 1993). The external
factor investigated in this chapter is the social context of choice behavior, in-
cluding objectives, motivations, and values. I am not arguing against consis-
tency, maximization, or any given rule per se but against the a priori imposition
of a rule or axiom as a requirement for rationality, independent of the social con-
text of judgment and decision and, likewise, of whether the individual operates
in isolation or within a social context (see Chapter 12; Elster, 1990).
One way to defend general principles against this argument would be to
say that maximization poses no restrictions on what individuals maximize, be
it their own good (utilities) or the fitness of their genotype. Switching from
individual goals to genotypic fitness can save the concept of maximization.
Such a defense would imply, however, that maximization cannot be imposed
independent of the motivations and goals built into living systems, which is
precisely the point I have asserted. By the same token, to claim that consis-
tency poses no restrictions on whatever consistency is about would destroy
the very idea of behavioral consistency, because Property Alpha would as a
result be open to any external interpretation and would no longer impose any
constraint on choice.
More generally, the formal principles of logic, probability theory, rational
choice theory, and other context-independent principles of rationality are often
rescued and defended by post hoc justifications. Post hoc reasoning typically
uses the social objectives, values, and motivations of organisms to make room
for exceptions or to reinterpret the alternatives in axioms or rules until they
are compatible with the observed result. Contemporary neoclassical econom-
ics, for instance, provides little theoretical basis for specifying the content and
shape of the utility function; it thus affords many degrees of freedom for fitting
any phenomenon to the theory (Simon, 1986). In Elster's (1990) formulation,
a theory of rationality can fail through indeterminacy (rather than through
inadequacy) to the extent that it fails to yield unique predictions.
210 SOCIAL RATIONALITY
Domain-Specific Reasoning
Social Contracts and Cheating Detection
211
212 SOCIAL RATIONALITY
tasting, rats have great difficulty learning to avoid the flavored water. Yet in
just one trial the rat can learn to avoid the flavored water when it is followed
by experimentally induced nausea, even when the nausea occurs 2 hours later:
From the evolutionary view, the rat is a biased learning machine de-
signed by natural selection to form certain CS—US [conditioned stimu-
lus-unconditioned stimulus] associations rapidly but not others. From a
traditional learning viewpoint, the rat was an unbiased learner able to
make any association in accordance with the general principles of con-
tiguity, effect, and similarity. (Garcia y Robertson & Garcia, 1985, p. 25)
One feature that sets humans and some other primates apart from almost all
animal species is the existence of cooperation among genetically unrelated
individuals within the same species, known as reciprocal altruism or coop-
eration. The thesis that such cooperation has been practiced by our ancestors
since ancient times, possibly for at least several million years, is supported by
evidence from several sources. First, our nearest relatives in the hominid line,
chimpanzees, also engage in certain forms of sophisticated cooperation (de
Waal & Luttrell, 1988), and in more distant relatives, such as macaques and
baboons, cooperation can still be found (e.g., Packer, 1977). Second, coopera-
tion is both universal and highly elaborated across human cultures, from
hunter-gatherers to technologically advanced societies. Finally, paleoanthro-
pological evidence also suggests that cooperation is extremely ancient (e.g.,
Tooby & DeVore, 1987).
Why altruism? Kin-related helping behavior, such as that by the sterile
worker castes in insects, which so troubled Darwin, has been accounted for by
generalizing "Darwinian fitness" to "inclusive fitness"—that is, to the number
of surviving offspring an individual has plus the individual's effect on the
number of offspring produced by its relatives (Hamilton, 1964). But why re-
ciprocal altruism, which involves cooperation among two or more nonrelated
individuals? The now-classic answer draws on the economic concept of trade
and its analogy to game theory (Axelrod, 1984; Williams, 1966). If the repro-
ductive benefit of being helped is greater than the cost of helping, then indi-
viduals who engage in reciprocal helping can outreproduce those who do not,
causing the helping design to spread. A vampire bat, for instance, will die if
it fails to find food for two consecutive nights, and there is high variance in
food-gathering success. Food sharing allows the bats to reduce this variance,
and the best predictor of whether a bat, having foraged successfully, will share
its food with a hungry nonrelative is whether the nonrelative has shared food
with the bat in the past (Wilkinson, 1990).
But "always cooperate" would not be an evolutionarily stable strategy. This
can be seen using the analogy of the prisoner's dilemma. If a group of individ-
uals always cooperates, then individuals who always defect—that is, who take
214 SOCIAL RATIONALITY
the benefit but do not reciprocate—can invade and outreproduce the cooper-
ators. Where the opportunity for defecting (or cheating) exists, indiscriminate
cooperation would eventually be selected out. "Always defect" would not be
an evolutionarily stable strategy, either. A group of individuals who always
defect can be invaded by individuals who cooperate in a selective (rather than
indiscriminate) way. A simple heuristic for selective cooperation is "Cooperate
on the first move; for subsequent moves, do whatever your partner did on the
preceding move" (a strategy known as Tit For Tat). There are several rules in
addition to Tit For Tat that lead to cooperation with other "selective cooper-
ators" and exclude or retaliate against cheaters (Axelrod, 1984).
The important point is that selective cooperation would not work without
a cognitive heuristic for detecting cheaters—or, more precisely, a heuristic for
directing an organism's attention to information that could reveal that it (or its
group) is being cheated (Cosmides & Tooby, 1992). Neither indiscriminate co-
operation nor indiscriminate cheating demands such a heuristic. In vampire
bats, who exchange only one thing—regurgitated blood—such a heuristic can
be restricted to a sole commodity. Cheating, or more generally noncooperation,
would mean, "That other bat took my blood when it had nothing, but it did
not share blood with me when I had nothing." In humans, who exchange many
goods (including such abstract forms as money), a cheating-detection heuristic
needs to work on a more general level of representation—in terms, for exam-
ple, of "benefits" and "costs."
To summarize, cooperation between two or more individuals for their mu-
tual benefit is a solution to a class of important adaptive problems, such as the
sharing of scarce food when foraging success is highly variable. Rather than
being indiscriminate, cooperation needs to be selective, requiring a cognitive
heuristic that directs attention to information that can reveal cheating. This
evolutionary account of cooperation, albeit still general, can be applied to a
specific puzzle in the psychology of reasoning.
In 1966, Peter Wason invented the "selection task" to study reasoning about
conditionals. This was to become one of the most extensively researched sub-
jects in cognitive psychology during the following decades. The selection task
involves four cards and a conditional statement in the form "If P then Q." One
example is, "If there is a 'D' on one side of the card, then there is a '3' on the
other side." The four cards are placed on a table so that the participant can read
only the information on the side facing upward. For instance, the four cards
may read "D," "E," "3," and "4." The participant's task is to indicate which of
the four cards need(s) to be turned over to find out whether the statement has
been violated. Table 10.1 shows two examples of selection tasks, each with a
different content: a numbers-and-letters rule and a transportation rule.
Because the dominant approach has been to impose prepositional logic as
a general-purpose standard of rational reasoning in the selection task (inde-
DOMAIN-SPECIFIC REASONING 215
Transportation rule
If a person goes in to Boston, then he takes the subway.
The cards below have information about four Cambridge residents. Each card repre-
sents one person. One side of the card tells where the person went and the other
side tells how the person got there. Indicate only the card(s) you definitely need to
turn over to see if the rule has been violated.
SUBWAY ARLINGTON CAB BOSTON
The numbers-and-letters rule and the transportation rule are not social con-
tracts. There are not two partners who have engaged in a contract; the rules
are descriptive ones rather than obligations, permissions, or other contracts
with mutual costs and benefits. Therefore, social contract theory is mute on
these problems. Cosmides (1989), however, showed that if a rule expressed a
social contract, then the percentage of "benefit taken" and the "costs not paid"
selections was very high. But this result can also be consistent with competing
accounts that do not invoke reciprocal altruism, so we need to look more
closely at tests that differentiate between competing accounts. Below is a sam-
ple of tests with that aim.
The major account of the content effect in the 1970s and 1980s was variously
called "familiarity" and "availability" (Manktelow & Evans, 1979; Pollard,
1982), without ever being precisely defined. The underlying idea is that the
more familiar a statement is, the more often a participant may have experi-
enced associations between the two propositions in a conditional statement,
including those that are violations ("benefit taken" and "cost not paid") of the
conditional statement. In this view, familiarity makes violations more "avail-
able" in memory, and selections may simply reflect availability. According to
this conjecture, therefore, familiarity and not social contracts accounts for se-
lecting the "benefit taken" and "cost not paid" cards. If familiarity were indeed
the guiding cognitive principle, then unfamiliar social contracts should not
elicit the same results. However, Cosmides (1989), Gigerenzer and Hug (1992),
and Platt and Griggs (1993) showed that social contracts with unfamiliar prop-
ositions elicit the same high number of "benefit taken" and "cost not paid"
selections, in contradiction to the availability account.
the benefits of cooperation. The second conjecture, however, rejects any role
of cheating detection in the selection task, claiming that people are, for some
reason, better at reasoning about social contracts than about numbers-and-
letters problems. Social contracts may be more "interesting" or "motivating,"
or people may have some "mental model" for social contracts that affords
"clear" thinking. Although this alternative is nebulous, it needs to be taken
into account; in her tests, Cosmides (1989) never distinguished between social
contracts and cheating detection.
But one can experimentally disentangle social contracts from cheating de-
tection. Klaus Hug and I also used social contracts but varied whether the
search for violations constituted looking for cheaters or not (Gigerenzer & Hug,
1992). For instance, consider the following social contract: "If someone stays
overnight in the cabin, then that person must bring along a bundle of wood
from the valley" (Table 10.2). This was presented in one of two context stories.
The "cheating" version explained that a cabin high in the Swiss Alps serves
as an overnight shelter for hikers. Because it is cold and firewood is not oth-
erwise available at this altitude, the Swiss Alpine Club has made the rule that
each hiker who stays overnight in the cabin must bring along a bundle of
firewood from the valley. The participants were cued to the perspective of a
guard who checks whether any of four hikers has violated the rule. The four
hikers were represented by four cards (Table 10.2).
In the "no-cheating" version, the participants were cued to the perspective
of a member of the German Alpine Association, visiting the same cabin in the
Swiss Alps to find out how it is managed by the local Alpine Club. He observes
people carrying firewood into the cabin, and a friend accompanying him sug-
gests that the Swiss may have the same overnight rule as the Germans, namely,
"If someone stays overnight in the cabin, then that person must bring along a
bundle of wood from the valley." That this is also the Swiss Alpine Club's rule
is not the only possible explanation; alternatively, only its members (who do
not stay overnight in the cabin), and not the hikers, might bring firewood. The
participants were now in the position of an observer who checks information
to find out whether the social contract suggested by his friend actually holds.
DOMAIN-SPECIFIC REASONING 219
This observer does not represent a party in a social contract. The participants
instruction was the same as in the cheating version.
Thus, in the cheating scenario, the observation "benefit taken and cost not
paid" means that the party represented by the guard is being cheated; in the
no-cheating scenario, the same observation suggests only that the Swiss Alpine
Club never made the supposed rule in the first place.
Assume as true the conjecture that what matters is only that the rule is a
social contract, making the game-theoretic model (which requires a cheating
mechanism) irrelevant. Because in both versions the rule is always the same
social contract, such a conjecture implies that there should be no difference
in the selections observed. In the overnight problem, however, 89% of the
participants selected "benefit taken" and "cost not paid" when cheating was
at stake, compared with 53% in the no-cheating version (Figure 10.1). Simi-
larly, the averages across all four test problems used were 83% and 45%, re-
spectively, consistent with the game-theoretic account of cooperation (Giger-
enzer & Hug, 1992).
Figure 10.1 The absence of the possibility of being cheated reduces the "bene-
fit taken and cost not paid" selections (which coincide with the P & not-Q
selections), even when all rules are social contracts. From Gigerenzer and
Hug (1992).
220 SOCIAL RATIONALITY
Perspective Change
get a day off" cards, which correspond to the "not-P" and "Q" cards. (Note
that "not-P & Q" selections have rarely been observed in selection tasks.) Thus
perspective change can play cheating detection against general-purpose logic.
The two competing predictions are: If the cognitive system attempts to detect
instances of "benefit taken and cost not paid" in the other party's behavior,
then a perspective switch implies switching card selections; if the cognitive
system reasons according to propositional logic, however, pragmatic perspec-
tives are irrelevant and there should be no switch in card selections.
The results showed that when the perspective was changed, the cards se-
lected also changed in the predicted direction (Figure 10.2). The effects were
strong and robust across the three rules tested (Gigerenzer & Hug, 1992). For
instance, in the employee perspective of the day-off problem, 75% of the par-
ticipants had selected "worked on the weekend" and "did not get a day off,"
but only 2% had selected the other pair of cards. In the employer perspective,
this 2% (who had selected "did not work on the weekend" and "did get a day
off") rose to more than 60%. The result is consistent with the thesis that at-
tention is directed toward information that could reveal oneself (or one's
group) as being cheated in a social contract but is inconsistent with the claim
that reasoning is directed by propositional logic independent of content.
Thus social contracts do not simply facilitate logical reasoning. I believe
that the program of reducing context merely to an instrument for "facilitating"
logical reasoning is misguided. My point is the same as for Property Alpha
(Chapter 9). Reasoning consistent with propositional logic is entailed by some
perspectives (e.g., the employee's), but is not entailed by other perspectives
(e.g., the employer's).
Two additional conjectures can be dealt with briefly. First, several authors
have argued that the cheating-detection thesis can be invalidated because "log-
ical facilitation" (large proportions of "P & not-Q" selections) has also been
found in some conditional statements that were not social contracts (e.g.,
Cheng & Holyoak, 1989; Politzer & Nguyen-Xuan, 1992). This conjecture mis-
construes the thesis in two respects. The thesis is not about "logical facilita-
tion"; the conjunction "benefit taken and cost not paid" is not the same as the
logical conjunction "P & not-Q," as we have seen. Furthermore, a domain-
specific theory makes, by definition, no prediction about performance outside
its own domain; it can only be refuted within that domain.
The second conjecture also tries to reduce the findings to propositional
logic, pointing out that a conditional that states a social contract is generally
understood as a biconditional "if and only if." In this case all four cards can
reveal logical violations and need to be turned over. However, it is not true
that four-card selections are frequent when cheating detection is at stake. We
found in about half of the social contract problems (12 problems, each an-
swered by 93 students) that not a single participant had selected all four cards;
for the remaining problems, the number was very small. Only when cheating
detection was excluded (the no-cheating versions) did four-card selections in-
crease to a proportion of about 10% (Gigerenzer & Hug, 1992). There is, then,
222 SOCIAL RATIONALITY
Figure 10.2 Social contracts in which both parties have the option to cheat
allow us to test whether reasoning about social contracts follows aperspecti-
val propositional logic (that is, the hypothesis that the conditional rule is in-
terpreted as a material conditional) or the pragmatic and domain-specific
goal of cheating detection. Results show that in both perspectives (Party A
and Party B; e.g., employer and employee), participants search for informa-
tion that could reveal that their party is being cheated, whether this informa-
tion corresponds to P & not-Q (as the material conditional would suggest) or
to not-P & Q. This result also does not support the hypothesis that partici-
pants interpret the rule as a biconditional, which implies that they would
have to check all four cards. Four-card selections were rare. From Gigerenzer
and Hug (1992).
Conclusions
226
THE MODULARITY OF SOCIAL INTELLIGENCE 227
Intelligence is often assumed to be of one kind, one general ability that helps
an organism cope with all situations—such as Francis Gallon's "natural abil-
ity," Spearman's general intelligence factor g, and the numberless definitions
that start with "intelligence is the general ability to . . ." The thesis that intel-
ligence is a unified general ability has been created only recently, in the mid-
19th century, by Francis Galton, Herbert Spencer, and Hippolyte Taine, among
others (Daston, 1992). The idea of one general intelligence was motivated by
Darwin's theory of evolution (Galton was Darwin's cousin) and seemed to pro-
vide the missing continuum between animals and humans, as well as between
human races, and last but not least, between men and women.
Such a unified general ability was alien to the earlier faculty psychology,
which dated back to Aristotle. Faculty psychology posited a collection of fac-
ulties and talents in the mind, such as imagination, memory, and judgment.
These faculties organized an intricate division of mental labor, and no single
one nor their sum coincided with our concept of intelligence (Daston, 1992).
Faculty psychology was revived, in the language of factor analysis, in the late
1930s when L. L. Thurstone claimed about seven primary mental abilities. In
the second half of the 20th century, the mind has become again a crowded
place. Evidence has been announced for dozens of factors of intelligence, and
Guilford and Hoepfner (1971) even claimed the confirmation of some 98 factors
of cognitive ability (see Carroll, 1982). Cognitive psychologists who use ex-
periments rather than IQ tests also divide up cognition in terms of faculties
(but you will not catch one using that term): deductive reasoning, inductive
228 SOCIAL RATIONALITY
Intelligence modules, however, are not like Thurstone's primary mental abili-
ties and faculties such as reasoning. I distinguish between two types of fac-
ulties: domain specific and domain general. Faculties such as deductive rea-
soning, memory, and numerical ability (as well as such factors as "fluid" and
"crystallized" intelligence) are assumed to treat any content identically, that
is, to operate in a domain-general way. The laws of memory, for instance, in
this view, are not about what is memorized; they are formulated without ref-
erence to content. Fodor (1983) called these domain-general faculties "hori-
zontal" as opposed to "vertical" domain-specific faculties. The modularity of
social intelligence, I propose, is vertical.
The doctrine of domain-general mechanisms flourished in Skinner's behav-
iorism, before it was generally rejected following experimental work by John
Garcia and others (see Chapter 10). Learning through imitation (rather than
reinforcement) is also reported to be domain specific. Rhesus monkeys, for
instance, reared in the laboratory show no fear toward venomous snakes. How-
ever, one will show fear if it sees another monkey exhibiting a fear reaction
toward snakes. Yet the monkey does not become afraid of just any stimulus:
If it sees another monkey emit a fear reaction toward a flower, it does not
acquire a fear of flowers (Cosmides & Tooby, 1994b; Mineka & Cook, 1988).
Learning by imitation of others, like learning by association, is simultaneously
enabled and constrained by specific "expectations" of what to avoid, what to
fear, or more generally, what causal connections to establish. Without domain-
specific mechanisms, an organism would not "know" what to look for, nor
which of the infinite possible causal connections to check. Such an organism
would be paralyzed by data analysis like the quantophrenic researcher who
measures everything one can think of, computes correlation matrices of di-
nosaurian dimensions, and blindly searches for significant correlations. De-
spite the available evidence to the contrary, Skinner's ideal of domain gener-
ality has survived the cognitive revolution and is flourishing in present-day
conceptions of the mind.
Domain generality is possibly the most influential and suspect idea in 20th-
century psychology. Psychologists love to organize their field by horizontal
faculties such as attention, memory, perception, problem solving, and judg-
THE MODULARITY OF SOCIAL INTELLIGENCE 229
ment and decision making. Terms such as these organize the chapter structure
of textbooks, the specialties of scientific journals, the divisional structure in
grant agencies, and the self-identity of numerous colleagues. Psychologists
tend to identify with horizontal faculties, not with domains.
I propose, in contrast, that modules for social intelligence are domain spe-
cific. How should we think about these modules? Fodor (1983), a vehement
proponent of modularity, has argued that modularity is restricted to input sys-
tems (the senses) and language, whereas central processes such as reasoning
are domain general. I term this the "weak modularity thesis." In his view,
modules are specifically designed mechanisms for voice recognition in con-
specifics, for face recognition in conspecifics, and color perception, among
others. I disagree with Fodor's opposition between modular sensory processes
(and language) and general-purpose central processes. Social intelligence in-
volves both perceptual processes and mechanisms for reasoning and inductive
inference. For instance, assume there is a module for social contracts, that is,
a module that enables cooperation between unrelated conspecifics for their
mutual benefit. Such a module would need to incorporate both "central" pro-
cesses, such as cost-benefit computations and search algorithms for informa-
tion that could reveal that one is being cheated, and sensory processes such
as face recognition. Without both "peripheral" and "central" mechanisms, nei-
ther social contracts nor cheating detection would be possible.
What I call the "strong modularity thesis" postulates that modules include
central processes as well as sensory mechanisms (and language). The function
of modules is not tied to "peripheral" as opposed to "central" processes.
Rather, their function is to solve specific problems of adaptive significance and
to do this quickly. A problem of adaptive significance can be described as an
evolutionarily recurrent problem whose solution promoted reproduction (Cos-
mides & Tooby, 1994a, b; Miller & Todd, 1995, 1998). Candidates include co-
alition forming and cooperation, foraging, predator avoidance, navigation,
mate selection, and rearing children. To solve such problems, modules need
to combine "peripheral" and "central" processes. Thus the domains (more
precisely, the "proper" domains; see next section) of modules are important
adaptive problems and not just perceptual (plus language) tasks.
Assume there is a social intelligence module designed for handling social con-
tracts in a hunter-gatherer society. The proper domain of the module may have
been the exchange of food for the mutual benefit of both parties involved in
the contract (because food sharing is not too common among animals, an al-
ternative hypothesis would be that the proper domain concerned social ser-
vices such as alliance formation; see Harcourt & de Waal, 1992). Generations
later, currency has been developed, and the module's representation of possi-
230 SOCIAL RATIONALITY
Assume there is a simple social organism with two modules for social intel-
ligence: One deals with social contracts, the other with threats. Thus this or-
ganism knows only two ways to deal with conspecifics: to engage with them
in the exchange of certain goods to their mutual benefit and to threaten indi-
viduals to get one's way (and react when others do so). As simple as the social
intelligence of this organism is, the organism needs to decide when to activate
the social contract module and when the threat module. All modules cannot
be activated at the same time because the very advantage of modularity is to
focus attention and to prevent combinatorial explosion. For instance, the social
contract module focuses attention on information that can reveal that the or-
ganism is being cheated, whereas this information is of no relevance for a
threat module. A threat module needs to attend to information that can reveal,
for instance, whether the other side is bluffing or whether high-status individ-
uals are present who could be used for "protected threat" (Kummer, 1988).
How is one of the two modules activated? I assume that there is a triggering
algorithm that attends to a small set of cues whose presence signals either
threat or social contract. These signals can include facial expressions, gestures,
body movements, and verbal statements. Assume that the organisms do have
language. A simple algorithm can quickly recognize whether a verbal state-
ment of the type "if you do X, then I do Y" is a threat or a social contract. If
Y is a negative consequence for me, and follows X in time, then I am being
threatened. If Y is a benefit for me and the temporal sequence can be either
way, then I am being offered a social contract. I call such simple heuristics "trig-
gering algorithms" because their function is to activate a module that can focus
attention, emotion, and behavioral responses so that fast reaction is possible.
Triggering algorithms can err, that is, not activate the appropriate module,
such as mistaking a serious threat for pretend play. The likelihood of triggering
the wrong module may increase when there are more than two modules, but
redundancy in cues, such as verbal cues, facial cues, and gestures, may reduce
errors.
When a mind has not just two but a large number of modules, a single trig-
gering algorithm may be too slow to discriminate between all possibilities si-
multaneously. In such a socially more intelligent mind, modules can be hier-
archically connected by triggering algorithms, as in a sequential decision tree.
Hierarchical organization corresponds to the idea that species started out with
a few modules, to which more specialized modules were added later in phy-
logeny, and to Wimsatt's (1986) notion of generative entrenchment.
Assume I march through a forest at night. Visibility is poor, a storm is com-
ing up, and I suddenly see the contours of a large dark object that seems to
232 SOCIAL RATIONALITY
move slowly. A triggering algorithm needs to decide quickly whether the object
is "self-propelled" (animal or human) or not (plant or physical object; Premack
& Premack, 1994). According to the Premacks, this decision is based on the
object's motion pattern. Recall the demonstrations by Fritz Heider in which
the motion patterns of two points in two-dimensional space make us "see" the
points as animate or inanimate, chasing, hunting, hurting, or supporting one
another (e.g., Heider & Simmel, 1944). These are beautiful demonstrations, but
they include no descriptions of the algorithms that make us see all these social
behaviors. How could the first triggering algorithm work? A simple algorithm
would analyze only external movements (such as the direction and accelera-
tion of the object) and not internal movements (the relative movement of the
body parts). For instance, if a motion pattern centers on my own position, such
as an object that circles around me or speeds up toward me, the algorithm
infers a self-propelled object. Moreover, it infers a self-propelled object that
takes some interest in me. Motion patterns that center around the object's own
center of gravity, in contrast, indicate that the object is a plant (e.g., a tree).
Now, if the motion pattern indicates that the object is self-propelled, the trig-
gering algorithm may activate a module for unrecognized self-propelled ob-
jects. This module will immediately set the organism into a state of physio-
logical and emotional arousal, initiate behavioral routines such as stopping
and preparing to run away, and activate a second, more specialized triggering
algorithm whose task is to decide whether the self-propelled object is animal
or human. Assume that this second triggering algorithm infers from shape and
motion information that the object is human. A module for social encounters
with unknown humans is subsequently activated, which initiates a search for
individual recognition in memory and may initiate an appeal for voice contact
in order to find out whether the other is friend or enemy, is going to threaten
or help me, and so on. This is pure speculation, but one might work out the
mechanisms of a hierarchical organization along these lines.
Modules that are hierarchically organized can act quickly, as only a few
branches of the combinatorial tree need to be traveled. For instance, if the first
triggering algorithm had indicated that the unknown object was not self-
propelled, then all subsequent information concerning whether it is human or
animal, friend or enemy, or predator or prey could have been ignored.
There are two views about the machinery of intelligent behavior. The classical
view is that the laws of probability theory or logic define intelligent processes:
Intelligent agents have rules such as Bayes's rule, the law of large numbers,
transitive inference, and consistency built in. This was the view of the Enlight-
enment mathematicians, to which Jean Piaget added an ontogenetic dimension,
and it still is a dominant view in economics, cognitive psychology, artificial
intelligence, and optimal foraging theory. For instance, Bayes's rule has been
THE MODULARITY OF SOCIAL INTELLIGENCE 233
single reason, namely on the first good reason on which two alternatives differ.
The first good reason can be simply that the individual does not recognize (has
never heard of) one of the two alternatives. This "recognition heuristic" seems
to operate in domains in which recognition is correlated with the variable that
needs to be inferred. For instance, rats who can choose between food that they
recognize and food that is new to them do not accept the new food unless they
have smelled it on the breath of a fellow rat (Gallistel et al., 1991). The sur-
prising result is that simple heuristics such as Take The Best can make as
accurate inferences about real-world environments as costly statistical algo-
rithms of the Laplacean demon type (Martignon & Laskey, 1999).
If short-sighted evolution has equipped us with adaptive heuristics rather
than with the collected works of logic and probability theory, this result in-
dicates that we need to rethink human rationality. The challenges are to un-
derstand what these heuristics are and to describe the structure of the envi-
ronments in which they can perform well and in which they cannot. My
proposal is that both the triggering algorithms and the mechanisms of the mod-
ule can be modeled as fast and frugal heuristics (Gigerenzer et al., 1999).
Summary
1. The notion of test intelligence, left undefined in its content, has had many faces,
including the moral and the social (Daston, 1992). For instance, the creators of the first
intelligence tests, Binet and Simon (1914), asked questions about social skills, such as
"Why should one judge a person by his acts rather than by his words?" In the first
edition of the Stanford-Binet Test, Louis Terman (1916) expressed the intimate link
between lack of intelligence and morally inappropriate behavior in no uncertain terms:
"Every feeble-minded woman is a prostitute." In the 1937 revision of the text (with M. A.
Merrill), this sentence was deleted. Piece by piece, IQ tests became pure and puritan.
This page intentionally left blank
V
COGNITIVE ILLUSIONS AND
STATISTICAL RITUALS
The "discovery" of cognitive illusions was not the first assault on human
rationality. Sigmund Freud's attack is probably the best known: According to
him, the unconscious wishes and desires of the human id are a steady
source of intrapsychical conflict that manifests itself in all kinds of irrational
fears, beliefs, and behavior. But the cognitive-illusion assault is stronger than
the psychoanalytic one. It does not need to invoke a conflict between ra-
tional judgment and unconscious wishes and desires to explain humans' ap-
parent irrationality: Judgment is itself fundamentally deficient. Homo sap-
iens appears to be a misnomer. During the last few decades, cognitive
illusions have become fodder for classroom demonstrations and textbooks.
Isn't it fun to show how dumb everyone else is, and after all, aren't they?
In the spring of 1990, I gave a talk in the Department of Psychology at
Stanford University entitled "Beyond heuristics and biases: How to make
cognitive illusions disappear." The first chapter in this section is an updated
version of this talk. It became the fountainhead of an ongoing, heated debate
over the nature of human rationality and the litany of sins people seem to
commit routinely against reason (e.g., Gigerenzer, 1996a; Kahneman & Tver-
sky, 1996), the so-called "rationality wars" (Samuels, Stich, & Bishop, in
press). Cognitive illusions have been linked to perceptual illusions, suggest-
ing that they are "inevitable illusions" (Piattelli-Palmarini, 1994). The politi-
cal implications of this view are not hard to see. Given the message that or-
dinary citizens are unable to estimate uncertainties and risks, one might
conclude that a government would be well advised to keep these nitwits out
of important decisions regarding new technologies and environmental risks.
In Chapter 12,1 criticize the narrow norms that make humans look irrational
and show how to make inevitable illusions "evitable."
Research on cognitive illusions is but one example of the more general phe-
nomenon of replacing statistical thinking with narrow, simplistic norms. The
statistical practices institutionalized in many social and medical sciences are
another case in point; they have little to do with statistical thinking and instead
238 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
promote statistical rituals. Textbooks teach our students the equivalent of com-
pulsive hand washing, the result being confusion and anxiety.
I once asked a well-known author who was busy preparing the latest edi-
tion of his best-selling statistical text for psychologists why he promoted the
usual incoherent mishmash of Fisherian and Neyman-Pearsonian prescrip-
tions for testing hypotheses (see Chapter 13). He did not try to deny the
problem, but he told me whom to blame for it. First, there was his publisher,
who had insisted that he supply a statistical cookbook and take out anything
that hinted at the existence of alternative tools for statistical inference,
which he did. Next, there were his fellow researchers, who did not aim to
truly understand statistics but to get their papers published. Finally, he
passed the blame on to the editors who demanded a statistical ritual and to
his university administration, which determined salary increases by counting
the number of papers published. When I asked him in what statistical meth-
ods he himself believed, he said that deep in his heart he was a Bayesian. I
was shocked. What a Faustian pact—an author successfully sells a method
in which he does not believe, which students and researchers then naively
mistake for the moral guidelines of doing science.
This is not to say that the many textbook authors who borrowed from
him are as aware of the confusion behind the ritual of null hypothesis test-
ing as he was. In my experience, many authors are innocent because igno-
rant, which is one way to maintain one's intellectual integrity. I wrote Chap-
ter 13 as an antidote to mindless statistics for both students and future
textbook writers.
The larger social and intellectual background for this chapter can be
found in The empire of chance: How probability changed science and every-
day life (Gigerenzer et al., 1989). This book was written by an interdiscipli-
nary group of scholars who studied the probabilistic revolution in the sci-
ences at the Center for Interdisciplinary Research in Bielefeld, a place where
nothing could distract us from work. Geoffrey Loftus reviewed the book in
Contemporary Psychology in 1991. When he became the editor of Memory &
Cognition in 1993, in his editorial statement he asked researchers to stop
submitting manuscripts with legions of p-values, t-values, and F-values and
instead present sound descriptive statistics, such as figures with error bars
around means. I admire him for having the courage to stand up against
mindless null hypothesis testing. A few years later, I asked Geoffrey how his
crusade was going. To Geoffrey's surprise, the resistance was coming from
the researchers. Most of them insisted on going through with the ritual. As
this case illustrates, editors alone cannot be blamed for psychologists' con-
tinued reliance on misguided statistical procedures. I might add that I have
never had a problem publishing experimental papers without null hypothe-
sis tests.
The story of null hypothesis testing in psychology is reminiscent of Hans
Christian Andersen's tale of the emperor's new clothes. In a sense, the proce-
dure has no clothes: Its outcome, the p-value, does not inform the reader
about the size of the effect, the probability that the null hypothesis is true,
COGNITIVE ILLUSIONS AND STATISTICAL RITUALS 239
the probability that the alternative hypothesis is true, or the probability that
the result is replicable. Nevertheless, studies in the United States, Great Brit-
ain, and Germany indicate that some 80% to 90% of academic psychologists
"see" one or more of these attractive garments on the p-value.
I have been asked what we should do instead of null hypothesis testing.
The answer is: not a new ritual. Chapter 6 illustrated one alternative. First,
the data is tested against multiple hypotheses—alternative models of cogni-
tive strategies—rather than against one null hypothesis. Second, multiple hy-
potheses are tested against the judgments of each individual participant
rather than against the average across individuals, thus enabling detection of
multiple strategies. Third, hypotheses are tested against both outcome and
process data rather than outcome data only.
The most important thing is to define candidate hypotheses before start-
ing the business of hypothesis testing. Null hypothesis testing encourages
theoretical laziness. To use it, one does not need to specify one's research
hypothesis or a substantive alternative except "chance." This scant require-
ment allows surrogates for theories to grow like weeds (Chapter 14). We
need statistical thinking, not statistical rituals.
This page intentionally left blank
12
How to Make Cognitive Illusions Disappear
241
242 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
The "heuristics and biases" program of Kahneman, Tversky, and others has
generated two main results concerning judgment under uncertainty: (1) a list
of so-called biases, fallacies, or errors in probabilistic reasoning, such as the
base-rate fallacy and the conjunction fallacy, and (2) explanations of these bi-
ases in terms of cognitive heuristics such as representativeness. Table 12.1
gives a taste of the conclusions drawn from this program.
Kahneman and Tversky (1982) see the study of systematic errors in proba-
bilistic reasoning, also called "cognitive illusions," as similar to that of visual
illusions. "The presence of an error of judgment is demonstrated by comparing
people's responses either with an established fact (e.g., that the two lines are
equal in length) or with an accepted rule of arithmetic, logic, or statistics"
(p. 493). Their distinction between "correct" and "erroneous" judgments un-
der uncertainty has been echoed by many social psychologists: "We follow
conventional practice by using the term 'normative' to describe the use of a
rule when there is a consensus among formal scientists that the rule is appro-
priate for the particular problem" (Nisbett & Ross, 1980, p. 13).
Social psychology is not the only area in which the "heuristics and biases"
program has made strong inroads. Experimental demonstrations of "fallacious"
judgments have entered law (e.g., Saks & Kidd, 1980), economics (e.g., Frey,
1990), management science (e.g., Bazerman, 1990), medical diagnosis (e.g.,
Casscells, Schoenberger, & Grayboys, 1978), behavioral auditing (see Shanteau,
1989), philosophy (e.g., Stich, 1990), and many other fields. There is no doubt
that understanding judgment under uncertainty is essential in all these fields.
It is the achievement of the "heuristics and biases" program to have finally
established this insight as a central topic of psychology. Earlier pioneers who
studied intuitive statistics (Hofstatter, 1939; Peirce & Jastrow, 1884; Wendt,
1966) had little impact. Even Ward Edwards and his colleagues (e.g., Edwards,
1968), who started the research from which Kahneman and Tversky's "heuris-
HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 243
Overconfidence Bias
between the Fisherians, the Neyman-Pearsonians, and the Bayesians are evidence of this
unresolved rivalry. For the reader who is not familiar with the fundamental issues, two
basic themes may help introduce the debate (for more, see Hacking, 1965). The first
issue relevant for our topic is whether probability is additive (that is, satisfies the Kol-
mogorov axioms, e.g., that the probabilities of all possible events sum up to 1) or not.
The above-mentioned points of view (including that of the heuristics-and-biases pro-
gram) subscribe to additivity, whereas L. J. Cohen's (e.g., 1982) Baconian probabilities
are nonadditive (for more on nonadditive theories, see Shafer, 1976). In my opinion,
Cohen correctly criticizes the normative claims in the heuristics-and-biases program
insofar as not all uses of "probability" that refer to single events must be additive—but
this does not imply that Baconian probability is the only alternative, nor that one should
assume, as Cohen did, that all minds reason rationally (or at least are competent to do
so) in all situations. I do not deal with this issue in this chapter (but see Gigerenzer,
1991d). The second fundamental issue is whether probability theory is about relative
frequencies in the long run or (also) about single events. For instance, the question "What
is the relative frequency of women over 60 who have breast cancer?" refers to frequen-
cies, whereas "What is the probability that Ms. Young has breast cancer?" refers to a
single event. Bayesians usually assume that (additive) probability theory is about single
events, whereas frequentists hold that statements about single cases have nothing to do
with probability theory (they may be dealt with by cognitive psychology, but not by
probability theory).
246 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
given, but not for information that could falsify it. This selective information
search artificially increases confidence. The key idea in this explanation is that
the mind is not a Popperian. Despite the popularity of the confirmation bias
explanation in social psychology, there is little or no support for this hypoth-
esis in the case of confidence judgments (see Chapter 7).
As with many "cognitive illusions," overconfidence bias seems to be a ro-
bust fact waiting for a theory. This "fact" was quickly generalized to account
for human disasters of many kinds, such as deadly accidents in industry (Spet-
tell & Liebert, 1986), confidence in clinical diagnosis (Arkes, 1981), and short-
comings in management and negotiation (Bazerman, 1990) and in the legal
process (Saks & Kidd, 1980), among others.
view. It only looks like it from a narrow interpretation of probability that blurs
the distinction between single events and frequencies fundamental to proba-
bility theory. (The choice of the word "overconfidence" for the discrepancy
put the "fallacy" message into the term itself.)
How to Make the Cognitive Illusion Disappear If there are any robust cognitive
biases at all, overconfidence in one's knowledge would seem to be a good
candidate. "Overconfidence is a reliable, reproducible finding" (von Winter-
feldt & Edwards, 1986, p. 539). "Can anything be done? Not much" (Edwards
& von Winterfeldt, 1986, p. 656). "Debiasing" methods, such as warning the
participants of the overconfidence phenomenon before the experiment and of-
fering them money to avoid it, have had little or no effect (Fischhoff, 1982).
Setting the normative issue straight has important consequences for under-
standing confidence judgments. Let us go back to the metaphor of the mind as
an intuitive statistician. I now take the term "statistician" to refer to a statis-
tician of the dominant school in this (and in the last) century, not one adopting
the narrow perspective some psychologists and economists have suggested.
Assume that the mind is a frequentist. Like a frequentist, the mind should be
able to distinguish between single-event confidences and frequencies in the
long run.
This view has testable consequences. Ask people for their estimated fre-
quencies of correct answers and compare them with true frequencies of correct
answers, instead of comparing the latter frequencies with confidences. We are
now comparing apples with apples. Ulrich Hoffrage, Heinz Kleinbolting, and
I carried out such experiments. Participants answered several hundred ques-
tions of the Islamabad-Hyderabad type (see above), and, in addition, estimated
their frequencies of correct answers.
Table 12.2 (top row) shows the usual "overconfidence bias" when single-
event confidences are compared with actual relative frequencies of correct
answers. In both experiments, the difference was around 13 to 15 percentage
points, which is a large discrepancy. After each set of 50 general knowledge
Note: To make values for frequency and confidence judgments comparable, all frequen-
cies were transformed to relative frequencies. Values shown are differences multiplied
by a factor of 100. Positive values denote "overconfidence" (Gigerenzer, Hoffrage, &
Kleinbolting, 1991).
248 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
Conjunction Fallacy
Linda is 31 years old, single, outspoken and very bright. She majored in
philosophy. As a student, she was deeply concerned with issues of dis-
crimination and social justice, and also participated in antinuclear dem-
onstrations.
Eighty-five percent of the participants chose T&F in the Linda problem (see
Table 12.3). Tversky and Kahneman, however, argued that the "correct" an-
swer is T, because the probability of a conjunction of two events, such as T&F,
can never be greater than that of one of its constituents. They explained this
"fallacy" as induced by the representativeness heuristic. They assumed that
judgments were based on the match (similarity, representativeness) between
the description of Linda and the two alternatives T and T&F. That is, since
HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 249
Fiedler (1988)
Exp. 1 91 22
Exp. 2 83 17
Note: Numbers are violations (in %) of the conjunction rule. The various versions
of the Linda problem are (i) which is more probable (see text), (ii) probability
ratings on a 9-point scale, (iii) probability ratings using the alternative "Linda is
a bank teller whether or not she is active in the feminist movement" (T*) instead
of "Linda is a bank teller" (T), (iv) hypothetical betting, that is, participants were
asked "If you could win $10 by betting on an event, which of the following would
you choose to bet on?" Fiedler asked participants to rank order T, T&F, and other
alternatives with respect to their probability. In his first frequency version the
population size was always 100, in the second it varied. Hertwig and Gigerenzer
asked participants to rank order T, T&F, and F with respect to their probability,
or estimate their frequency. Tversky and Kahneman (1983, p. 309) had reported
a facilitating effect of frequency judgments for a different problem.
Linda was described as if she were a feminist and T&F contains the term "fem-
inist," people believe that T&F is more probable.
This alleged demonstration of human irrationality in the Linda problem has
been widely publicized in psychology, philosophy, economics, and beyond.
Stephen J. Gould (1992, p. 469) put the message clearly:
I am particularly fond of [the Linda] example, because I know that the
[conjunction] is least probable, yet a little homunculus in my head con-
tinues to jump up and down, shouting at me, "but she can't just be a
bank teller; read the description." . . . Why do we consistently make this
simple logical error? Tversky and Kahneman argue, correctly I think, that
our minds are not built (for whatever reason) to work by the rules of
probability.
I suggest that Gould should have had more trust in the rationality of his ho-
munculus.
is "no." Choosing T&F is not a violation of probability theory, and for the same
reason given previously. For a frequentist, this problem has nothing to do with
probability theory. Participants were asked for the probability of a single event
(that Linda is a bank teller), not for frequencies. For instance, the statistician
Barnard (1979) commented thus on subjective probabilities for single events:
"If we accept it as important that a person's subjective probability assessments
should be made coherent, our reading should concentrate on the works of
Freud and perhaps Jung rather than Fisher and Neyman" (p. 171).
Note that problems that are claimed to demonstrate the "conjunction fal-
lacy" are structurally different from "confidence" problems. In the former, sub-
jective probabilities (that Linda is a bank teller or a bank teller and a feminist)
are compared with one another; in the latter, they are compared with frequen-
cies.
To summarize the normative issue, what is called the "conjunction fallacy"
looks like a violation of some subjective theories of probability, including Bay-
esian theory. It is not, however, a violation of a major view of probability, the
frequentist conception.
How to Make the Cognitive Illusion Disappear What if the mind were a fre-
quentist? If the untutored mind is as sensitive to the distinction between single
cases and frequencies as a statistician of the frequentist school is, then we
should expect dramatically different judgments if we pose the above problem
in a frequentist mode, such as the following:
There are 100 persons who fit the description above (i.e., Linda's).
How many of them are:
(a) bank tellers
(b) bank tellers and active in the feminist movement.
Participants are now asked for frequency judgments rather than for single-
event probabilities. If the mind solves the Linda problem by using a represen-
tativeness heuristic, changes in information representation should not matter
because they do not change the degree of similarity. The description of Linda
is still more representative of (or similar to) the conjunction "teller and femi-
nist" than of "teller." Participants therefore should still exhibit the conjunction
fallacy. Table 12.3, however, shows that with frequency judgments, the "con-
junction fallacy" largely disappears. The effect is dramatic, from some 80% to
90% conjunction violations in probability judgments to 10% to 20% in fre-
quency judgments, with one study even reporting 0%.
What accounts for this striking effect of frequency judgments? Hertwig and
Gigerenzer (1999) analyzed how participants understood the phrase "which is
more probable?", for instance, by asking them to paraphrase the problem to
another person who is not a native speaker of the language in which the prob-
lem was presented. The results indicate that most participants did not under-
stand "probability" in the sense of mathematical probability but as one of the
many other legitimate meanings that are listed in, for example, the Oxford
English Dictionary (e.g., meaning credibility, typicality, or that there is evi-
HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 251
dence). The term frequency, unlike probability, narrows down the spectrum of
possible interpretations to meanings that follow mathematical probability.
The results in Table 12.3 are consistent with the earlier work by Inhelder
and Piaget (1969), who showed children a box containing wooden beads, most
of them brown, but a few white. They asked the children, "Are there more
wooden beads or more brown beads in this box?" By the age of eight, a majority
of children responded that there were more wooden beads, indicating that they
understand conjunctions (class inclusions). Note that Inhelder and Piaget
asked children for frequency judgments, not probability judgments.
Base-Rate Fallacy
Among all cognitive illusions, the "base-rate fallacy" has probably received
the most attention. The neglect of base rates seems in direct contradiction to
the widespread belief that judgments are unduly affected by stereotypes (Land-
man & Manis, 1983), and for this and other reasons it has generated a great
deal of interesting research on the limiting conditions for the "base-rate fal-
lacy" in attribution and judgment (e.g., Ajzen, 1977; Borgida & Brekke, 1981).
For instance, in their review, Borgida and Brekke argue for the pervasiveness
of the "base-rate fallacy" in everyday reasoning about social behavior, ask the
question "Why are people susceptible to the base-rate fallacy?" (1981, p. 65),
and present a list of conditions under which the "fallacy" is somewhat re-
duced, such as "vividness," "salience," and "causality" of base-rate informa-
tion.
My analysis is different. Again I first address the normative claims that
people's judgments are "fallacies" using two examples that reveal two different
aspects of the narrow understanding of good probabilistic reasoning in much
of this research.
The first is from Casscells, Schoenberger, and Grayboys (1978, p. 999) and
presented by Tversky and Kahneman (1982b, p. 154) to demonstrate the gen-
erality of the phenomenon:
If a test to detect a disease whose prevalence is 1/1000 has a false positive
rate of 5%, what is the chance that a person found to have a positive
result actually has the disease, assuming you know nothing about the
person's symptoms or signs?
Sixty students and staff at Harvard Medical School answered this medical
diagnosis problem. Almost half of them judged the probability that the person
actually had the disease to be 0.95 (modal answer), the average answer was
0.56, and only 18% of participants responded 0.02. The latter was considered
to be the correct answer. Note the enormous variability in judgments. Little
has been achieved in explaining how people make these judgments and why
the judgments are so strikingly variable.
The Normative Issue But do statistics and probability give one and only one
"correct" answer to that problem? The answer is again "no." And for the same
252 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
reason, as the reader will already have guessed. As in the case of confidence
and conjunction judgments, participants were asked for the probability of a
single event, that is, that "a person found to have a positive result actually has
the disease." If the mind is an intuitive statistician of the frequentist school,
such a question has no necessary connection to probability theory. Further-
more, even for a Bayesian, the medical diagnosis problem has several possible
answers. One piece of information necessary for a Bayesian calculation is miss-
ing: the test's long-run frequency of correctly diagnosing persons who have the
disease (admittedly a minor problem if we can assume a high "true positive
rate"). A more serious difficulty is that the problem does not specify whether
or not the person was randomly drawn from the population to which the base
rate refers. Clinicians, however, know that patients are usually not randomly
selected—except in screening and large survey studies—but rather "select"
themselves by exhibiting symptoms of the disease. In the absence of random
sampling, it is unclear what to do with the base rates specified. The modal
response, 0.95, would follow from applying the Bayesian principle of indif-
ference (i.e., same prior probabilities for each hypothesis), whereas the answer
0.02 would follow from using the specified base rates and assuming random
sampling. In fact, the range of actual answers corresponds quite well to the
range of possible solutions.
How to Make the Cognitive Illusion Disappear The literature overflows with
assertions of the generality and robustness of the "base-rate fallacy," such as:
"the base-rate effect appears to be a fairly robust phenomenon that often results
from automatic or unintentional cognitive processes" (Landman & Manis,
1983, p. 87); and "many (possibly most) subjects generally ignore base rates
completely" (Pollard & Evans, 1983, p. 124; see also Table 12.1). Not only are
the normative claims often simplistic and, therefore, misleading, but so too are
the robustness assertions.
What happens if we do something similar as for the "overconfidence bias"
and the "conjunction fallacy," that is, rephrase the medical diagnosis problem
in a frequency format? Cosmides and Tooby (1996) did so. They compared the
original problem (above) with a frequency format, in which the same infor-
mation was given:
One out of 1000 Americans has disease X. A test has been developed to
detect when a person has disease X. Every time the test is given to a
person who has the disease, the test comes out positive. But sometimes
the test also comes out positive when it is given to a person who is
completely healthy. Specifically, out of every 1000 people who are per-
fectly healthy, 50 of them test positive for the disease.
In this frequentist version of the medical diagnosis problem, both the in-
formation and the question are phrased in terms of frequencies. (In addition,
the two pieces of information missing in the original version [see above] are
supplied. In numerous other versions of the medical diagnosis problem, Cos-
mides and Tooby showed that the striking effect [see Table 12.4] on partici-
pants' reasoning is mainly due to the transition from a single-event problem
to a frequency format, and only to a lesser degree to the missing information.)
Participants were Stanford University undergraduates.
If the question was rephrased in natural frequencies, as shown above, then
the Bayesian answer of 0.02—that is, the answer "one out of 50 (or 51)"—was
given by 76% of the participants. The "base-rate fallacy" disappeared. By com-
parison, the original single-event version elicited only 12% Bayesian answers
in Cosmides and Tooby's study. Chapter 6 provides an explanation for this
effect.
Cosmides and Tooby identified one condition in which almost every par-
ticipant found the Bayesian answer of 0.02. Participants received the frequen-
tist version of the medical diagnosis problem (except that it reported a random
sample of "100 Americans" instead of "1000 Americans"), and in addition a
page with 100 squares (10 X 10). Each of these squares represented one Amer-
ican. Before the frequentist question "How many people who test positive . . . "
was put, participants were asked to (i) circle the number of people who will
have the disease and (ii) to fill in squares to represent people who will test
positive. After that, 23 out of 25 participants came up with the Bayesian an-
swer (see frequency format, pictorial, in Table 12.4).
All three examples point in the same direction: The mind acts as if it were
a frequentist; it distinguishes between single events and frequencies in the long
run—just as probabilists and statisticians do. Despite the fact that researchers
in the "heuristics and biases" program routinely ignore this distinction fun-
damental to probability theory when they claim to have identified "errors," it
would be foolish to label these judgments "fallacies." These results not only
point to a truly new understanding of judgment under uncertainty, but they
also seem to be relevant for teaching statistical reasoning.
tion was that participants use a representativeness heuristic, that is, they judge
the probability by the similarity (representativeness) between a description and
their stereotype of an engineer. Kahneman and Tversky believed that their
participants were violating "one of the basic principles of statistical predic-
tion," the integration of prior probability with specific evidence by Bayes's
rule. The result was given much weight: "The failure to appreciate the rele-
vance of prior probability in the presence of specific evidence is perhaps one
of the most significant departures of intuition from the normative theory of
prediction" (p. 243).2
2. The terms "prior probabilities" and "base rates" are frequently used interchange-
ably in the psychological literature. But these concepts are not identical. It is the prior
probabilities that are fed into Bayes's rule, and these priors may be informed by base
rates. Base rates are just one piece of information among several that a person can con-
sider relevant for making up her prior probabilities. Equating prior probabilities with
one particular kind of base-rate information would be a narrow understanding of Bayes-
ian reasoning. Such reasoning might be defensible in those situations in which one
knows very little, but not in real-life situations in which one can base judgments on rich
knowledge.
256 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
Random Sampling Increases Use of Base Rates One way to understand partic-
ipants' judgments is to assume that the engineer-lawyer problem activates ear-
lier knowledge associated with profession guessing, which can be used as an
inferential framework—a "mental model"—to solve the problem.3 But, as I
have argued, we cannot expect random sampling to be part of this mental
model. If my analysis is correct, then base-rate use can be increased if we take
care to commit the participants to the crucial property of random sampling—
that is, break apart their mental models and insert the new structural assump-
tion. In contrast, if the true explanation is that participants rely on the repre-
sentativeness heuristic, then the participants should continue to neglect base
rates.
There is a simple method of making people aware of random sampling in
the engineer-lawyer problem, which we used in a replication of the original
study (Gigerenzer, Hell, & Blank, 1988). The participants themselves drew each
description (blindly) out of an urn, unfolded the description, and gave their
probability judgments. There was no need to tell them about random sampling
because they did it themselves. This condition increased the use of base rates.
Participants' judgments were closer to Bayesian predictions than to base-rate
neglect. When we used, for comparison, the original study's version of the
crucial assumption—as a one-word assertion—neglect of base rates appeared
again (although less intensely than in Kahneman and Tversky's study).
3. I use the term "mental model" in a sense that goes beyond Johnson-Laird's (1983).
As in the theory of probabilistic mental models (Chapter 7), a mental model is an infer-
ential framework that generalizes the specific task to a reference class (and probability
cues defined on it) that a person knows from his or her environment.
HOW TO MAKE COGNITIVE ILLUSIONS DISAPPEAR 257
I show that even with Kahneman and Tversky's original "verbal assertion
method," that is, a one-word assertion of random sampling, there is in fact no
support for the claim that judgments about an uninformative description are
guided by a general representativeness heuristic—contrary to assertions in the
literature.
Table 12.5 lists all studies of the uninformative description "Dick" that I
am aware of—all replications of Kahneman and Tversky's (1973) verbal asser-
tion method. The two base-rate groups were always 30% and 70% engineers.
According to Kahneman and Tversky's argument, the difference between the
two base-rate groups should approach the difference between the two base
rates, that is, 40% (or somewhat less, if the description of Dick was not per-
ceived as totally uninformative by the participants). The last column shows
their result mentioned above, a zero difference, which we (Gigerenzer, Hell, &
Blank, 1988) could closely replicate. Table 12.5 also shows, however, that sev-
eral studies found substantial mean differences up to 37%, which comes very
close to the actual difference between base-rate groups.
Seen together, the studies seem to be as inconsistent as it is possible to be:
Every result between zero difference (base-rate neglect) and the actual base-
rate difference has been obtained. This clearly contradicts the rhetoric of ro-
bustness and generality of the base-rate fallacy, such as: "Regardless of what
kind of information is presented, subjects pay virtually no attention to the base
rate in guessing the profession of the target" (Holland et al., 1986, p. 217). And
it contradicts the explanation of the so-called fallacy: the proposed general
representativeness heuristic.
Table 12.5 How to make the "base-rate fallacy" disappear: The uninformative
description "Dick" in the engineer—lawyer problem
"Dick" Mean
encountered difference
No. of first (relative between base
Study descriptions frequency) rate groups8
a. Entries are {p70 (E I D)—p 30 (E I D)} X 100, where p70 (E I D) is the mean probability judgment that
"Dick" is an engineer, given the description and the 70% base rate.
b. Order of descriptions systematically varied.
c. Medians (no means reported).
d. Three descriptions were used, but "Dick" was always encountered first.
e. Separate analysis for all participants who encountered "Dick" first.
258 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
standing judgments under uncertainty (for similar results see Ginossar & Trope
1987; Grether, 1980; Hansen & Donoghue, 1977; Wells & Harvey, 1977; but see
Nisbett & Borgida, 1975).
Note that the critical variable here is the content of a problem. There seems
to be a class of contents for which participants know from their environment
that base rates are relevant (as do birds; see Caraco, Martindale, & Whittam,
1980) or that random sampling is common (though they need not represent
these concepts explicitly), whereas in other contents this is not the case. Pro-
fession guessing seems to belong to the latter category. In contrast, predictions
of sports results, such as those of soccer games, seem to belong to the former.
For instance, we found that participants revised information about the pre-
vious performance of soccer teams (base rates) in light of new information
(half-time results) in a way that is indistinguishable from Bayesian statistics
(Gigerenzer, Hell, & Blank, 1988). Here verbal assertion of random drawing
was sufficient—there was no need for strong measures to break apart mental
models.
Heuristics
cized (e.g., Jungermann, 1983; Wallsten, 1983), but to no avail. Why is this? I
believe that particular features of the use of the term "heuristic" have led to
the present conceptual dead end, and more research in a cul-de-sac will not
help. In my opinion, these features are the following.
In artificial intelligence research one hopes that heuristics can make computers
smart; in the "heuristics and biases" program one hopes that heuristics can
tell why humans are not smart. The fundamental problem with the latter is
that most "errors" in probabilistic reasoning that one wants to explain by heu-
ristics are in fact not errors, as I have argued above. Thus heuristics are meant
to explain what does not exist. Rather than explaining a deviation between
human judgment and allegedly "correct" probabilistic reasoning, future re-
search has to get rid of simplistic norms that evaluate human judgment instead
of explaining it.
Simon, and earlier Egon Brunswik, has emphasized that cognitive functions
are adaptations to a given environment and that we have to study the structure
of environments in order to infer the constraints they impose on reasoning.
Heuristics such as representativeness have little to say about how the mind
adapts to the structure of a given environment.
Several of the explanations using heuristics are hardly more than redescrip-
tions of the phenomena reported. Take, for instance, the explanation of base-
rate neglect in the engineer—lawyer problem (and similar base-rate problems)
by the representativeness heuristic. Representativeness here means the per-
ceived similarity between a personality description and the participants'
stereotype of an engineer. In the vocabulary of Bayes's rule, this similarity is
a likelihood: that is, the probability of a description given that the person is
an engineer. Now we can see that Bayes's rule, in particular its concepts of
base rates (prior probabilities) and likelihoods, provides the vocabulary for
both the phenomenon and its purported explanation. The phenomenon is ne-
glect of base rates and use of likelihoods. The "explanation" is that participants
use representativeness (likelihoods) and do not use base rates. What is called
a representativeness heuristic here is nothing more than a redescription of the
phenomenon (Gigerenzer & Murray, 1987, pp. 153-155).
I have argued that what have been widely accepted to be the "normative prin-
ciples of statistical prediction" (e.g., Ajzen, 1977, p. 304), against which hu-
man judgment has been evaluated as "fallacious," are a caricature of the pres-
ent state of probability theory and statistics. I have shown that several so-called
fallacies are in fact not violations of probability theory. Conceptual distinctions
routinely used by probabilists and statisticians were just as routinely ignored
in the normative claims of "fallacies." Most strikingly, in the experimental
research reviewed, "fallacies" and "cognitive illusions" tend to disappear if
we pay attention to these fundamental distinctions. I am certainly not the first
to criticize the notion of "robust fallacies." The only novelty in my research
is that the variables that bring "cognitive illusions" under experimental control
are those important from the viewpoint of probability and statistics (as op-
posed to, say, whether participants were given more or less "vivid" or "caus-
ally relevant" information).
Together, these results point to several ways to develop an understanding
of judgment under uncertainty that goes beyond the narrow notion of a "bias"
and the largely undefined notion of a "heuristic."
For instance, despite the quantity of empirical data that has been gathered
on the cab problem, the lack of a theory of the cognitive processes involved
in solving it is possibly the most striking result. Tversky and Kahneman
claimed that the cab problem has one "correct answer" (1980, p. 62). They
attempted to explain the extent to which people's judgments deviated from
that "norm" by largely undefined terms such as "causal base rates." But sta-
tistics gives several interesting answers to the cab problem, rather than just
one "correct" answer (e.g., Birnbaum, 1983; Gigerenzer, 1998c; Levi, 1983). If
progress is to be made and people's cognitive processes are to be understood,
one should no longer try to explain the difference between people's judgments
and Tversky and Kahneman's "normative" Bayesian calculations. People's
judgments have to be explained. Statistical theories can provide highly inter-
esting models of these judgments. The only theoretically rich account of the
cognitive processes involved in solving the cab problem (or similar "eyewit-
ness testimony" problems) was in fact derived from a frequentist framework:
Birnbaum (1983) combined Neyman-Pearson theory with psychological mod-
els of judgments such as range-frequency theory.
Future research should use competing statistical theories as competing ex-
planatory models, rather than pretending that statistics speaks with one voice
(see also Cohen, 1982; Wallendael & Hastie, 1990).
changed and that the important event (being eaten or not) can no longer be
considered as an independent random drawing from the same reference class.
Updating "old" base rates may be fatal for the child.
The question of whether some part of the world is stable enough to use
statistics has been posed by probabilists and statisticians since the inception
of probability theory in the mid-seventeenth century—and the answers have
varied and will vary, as is well documented by the history of insurance (Das-
ton, 1987). Like the underwriter, the layperson has to check structural as-
sumptions before entering into calculations. For instance, the following struc-
tural assumptions are all relevant for the successful application of Bayes's rule:
independence of successive drawings, random sampling, an exhaustive and
mutually exclusive set of hypotheses, and independence between prior prob-
abilities and likelihoods.
How can the intuitive statistician judge whether these assumptions hold?
One possibility is that the mind generalizes the specific content to a broader
mental model that uses implicit domain-dependent knowledge about these
structural assumptions. If so, then the content of problems is of central im-
portance for understanding judgment—it embodies implicit knowledge about
the structure of an environment.
Conclusion
Those good old days have gone, although the eighteenth-century link be-
tween probability and rationality is back in vogue in cognitive and social psy-
chology. If, in studies on social cognition, researchers find a discrepancy be-
tween human judgment and what probability theory seems to dictate, the
blame is now put on the human mind, not on the statistical model.
I have used classical demonstrations of overconfidence bias, conjunction
fallacy, and base-rate neglect to show that what have been called "errors" in
probabilistic reasoning are in fact not violations of probability theory. They
only look so from a narrow understanding of good probabilistic reasoning that
ignores conceptual distinctions fundamental to probability and statistics.
These so-called cognitive illusions largely disappear when one pays attention
to these conceptual distinctions. The intuitive statistician seems to be highly
sensitive to them—a result unexpected from the view that "mental illusions
should be considered the rule rather than the exception" (see Table 12.1).
Why do cognitive illusions largely disappear? The examples in this chapter
have illustrated three reasons:
1. Polysemy: not all probabilities are mathematical probabilities. Asking
a frequency as opposed to a probability question can reduce the mul-
tiple meanings (polysemy) of the English terms "probable" and
"likely." Frequency questions clarify that the question is actually
about mathematical probability and not about one of the other legit-
imate meanings (see the Oxford English Dictionary), which are often
suggested by the cover story of a problem. Reducing polysemy seems
to be the major reason the conjunction fallacy in the Linda problem
largely disappears (Hertwig & Gigerenzer, 1999).
2. A mathematical probability refers to a reference class, which may
differ depending on the task. Asking a frequency (as opposed to a
probability) question can systematicaly cue different reference classes
(and therefore, different probabilistic mental models). Changing ref-
erence classes seems to be the reason overconfidence bias appears in
probability judgments and disappears in frequency judgments (Chap-
ter 7).
3. Natural frequencies facilitate Bayesian reasoning. When information
is represented in natural frequencies rather than in conditional prob-
abilities (or relative frequencies), Bayesian computations become sim-
pler. Using natural frequencies is a powerful tool to reduce people's
mental confusion and foster Bayesian reasoning (Chaper 6).
This is not to say that frequencies always improve judgment. For instance,
the theory of probabilistic mental models specifies conditions under which
frequency judgments systematically underestimate actual frequencies, and
Chapter 6 explains why natural frequencies but not other kinds of frequencies
facilitate Bayesian reasoning. The question is not whether or not, or how often,
"cognitive illusions" disappear, but why. We need precise models of heuristics
that make surprising (and falsifiable) predictions, not vague terms that, post
hoc, explain everything and nothing. Future progress will be in understanding,
not debunking, human thinking.
13
The Superego, the Ego, and the Id
in Statistical Reasoning
Piaget worked out his logical theory of cognitive development, Kohler the Ge-
stalt laws of perception, Pavlov the principles of classical conditioning, Skin-
ner those of operant conditioning, and Bartlett his theory of remembering and
schemata—all without rejecting null hypotheses. But by the time I took my
first course in psychology at the University of Munich in 1969, null hypothesis
tests were presented as the indispensable tool, as the sine qua non of scientific
research. Post-World War II German psychology mimicked a revolution of re-
search practice that had occurred between 1940 and 1955 in American psy-
chology.
What I learned in my courses and textbooks about the logic of scientific
inference was not without a touch of moralizing, a scientific version of the Ten
Commandments: Thou shalt not draw inferences from a nonsignificant result.
Thou shalt always specify the level of significance before the experiment; those
who specify it afterward (by rounding up obtained p values) are cheating. Thou
shalt always design thy experiments so that thou canst perform significance
testing.
What happened between the time of Piaget, Kohler, Pavlov, Skinner, and Bart-
lett and the time I was trained? In Kendall's (1942) words, statisticians "have
already overrun every branch of science with a rapidity of conquest rivalled
only by Attila, Mohammed, and the Colorado beetle" (p. 69).
What has been termed the probabilistic revolution in science (Gigerenzer et
al., 1989) reveals how profoundly our understanding of nature changed when
concepts such as chance and probability were introduced as fundamental the-
oretical concepts. The work of Mendel in genetics, that of Maxwell and Boltz-
mann on statistical mechanics, and the quantum mechanics of Schrodinger
and Heisenberg that built indeterminism into its very model of nature are key
examples of that revolution in thought.
267
268 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
1. R. Duncan Luce, personal communication, April 4, 1990. See also Luce's (1989)
autobiography, p. 270 and pp. 281-282.
270 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
To understand the structure of the hybrid logic that has been taught in psy-
chology for some 50 years, I briefly sketch those ideas of Fisher on the one
hand and Neyman and Pearson on the other that are relevant to understanding
the hybrid structure of the logic of inference.
Fisher's first book, Statistical Methods for Research Workers, published in
1925, was successful in introducing biologists and agronomists to the new
techniques. However, it had the agricultural odor of issues like the weight of
pigs and the effect of manure, and, such alien topics aside, it was technically
far too difficult to be understood by most psychologists.
THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 271
the Natural Sciences" (p. 69). And he maintained his epistemic view: "From
a test of significance . . . we have a genuine measure of the confidence with
which any particular opinion may be held, in view of our particular data"
(p. 74). For all his anti-Bayesian talk, Fisher adopted a very similar-sounding
line of argument (Johnstone, 1987).
2. On the distinction between statistical and substantive hypotheses, see Hager and
Westermann (1983) and Meehl (1978).
THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 275
Power
In null hypothesis testing, only one kind of error is defined: rejecting the null
hypothesis when it is in fact true. In their attempt to supply a logical basis for
Fisher's ideas and make them consistent, Neyman and Pearson replaced
Fisher's single null hypothesis by a set of rival hypotheses. In the simplest
case, two hypotheses, H0 and H,, are specified, and it is assumed that one of
them is true. This assumption allows us to determine the probability of both
Type I errors and Type II errors, indicated in Neyman-Pearson theory by a
and (3, respectively. If Ha is rejected although Ha is true, a Type II error has
occurred, ot is also called the size of a test, and 1 — P is called its power. The
power of a test is the long-run frequency of accepting Ht if it is true. The
concept of power makes explicit what Fisher referred to as "sensitivity."
Fisher (1935) pointed out two ways to make an experiment more sensitive:
by enlarging the number of repetitions and by qualitative methods, such as
experimental refinements that minimize the error in the measurements
(pp. 21-25). Nevertheless, he rejected the concept of Type II error and calcu-
lations of power on the grounds that they are inappropriate for scientific in-
duction. In his view, calculations of power, although they look harmless, reflect
the "mental confusion" between technology and scientific inference (Fisher,
1955, p. 73). If someone designs a test for acceptance procedures in quality
control, the goal of which is to minimize costs due to decision errors, calcu-
lations of power based on cost-benefit considerations in situations of repetitive
tests are quite appropriate. But scientific inference and discovery, in Fisher's
view, are about gaining knowledge, not saving money.
Fisher always rejected the concept of power. Neyman, for his part, pointed
out that some of Fisher's tests "are in a mathematical sense 'worse than use-
less,' " because their power is less than their size (see Hacking, 1965, p. 99).
Even in the Tea-Tasting Experiment, used by Fisher to introduce the logic of
null hypothesis testing in The Design, the power is only a little higher than
the level of significance (.05) or cannot be calculated at all, depending on the
conditions (see Neyman, 1950).
(every day a random sample may be taken). Recall that Neyman and Pearson
based their theory on the concept of repeated random sampling, which denned
the probability of Type I and Type II errors as long-run frequencies of wrong
decisions in repeated experiments.
Fisher, in contrast, held that in scientific applications there is no known
population from which repeated sampling can be done. There are always many
populations to which a sample may belong. "The phrase 'repeated sampling
from the same population' does not enable us to determine which population
is to be used to define the probability level, for no one of them has objective
reality, all being products of the statistician's imagination" (Fisher, 1955,
p. 71). Fisher proposed to view any sample (such as the sample of participants
in a typical psychological experiment, which is not drawn randomly from a
known population) as a random sample from an unknown hypothetical infinite
population. "The postulate of randomness thus resolves into the question, 'Of
what population is this a random sample?' which must frequently be asked by
every practical statistician" (Fisher, 1922, p. 313). But how can the practical
statistician find out? The concept of an unknown hypothetical infinite popu-
lation has puzzled many: "This is, to me at all events, a most baffling concep-
tion" (Kendall, 1943, p. 17).
One way of reading The Design suggests that null hypothesis testing is a fairly
mechanical procedure: Set up a null hypothesis, use a conventional level of
significance, calculate a test statistic, and disprove the null hypothesis, if you
can. Fisher later made clear that he did not mean it to be so. For instance, he
pointed out that the choice of the test statistic and deciding which null hy-
potheses are worth testing cannot be reduced to a mechanical process. You
need constructive imagination and much knowledge based on experience
(Fisher, 1933). Statistical inference has two components: informed judgment
and mathematical rigor.
Similarly, Neyman and Pearson always emphasized that the statistical part
has to be supplemented by a subjective part. As Pearson (1962) put it: "We left
in our mathematical model a gap for the exercise of a more intuitive process
of personal judgment in such matters—to use our terminology—as the choice
of the most likely class of admissible hypotheses, the appropriate significance
level, the magnitude of worthwhile effects and the balance of utilities"
(pp. 395-396).
In Neyman and Pearson's theory, once all judgments are made, the decision
(reject or accept) results mechanically from the mathematics. In his later writ-
ings, Fisher opposed these mechanical accept/reject decisions, which he be-
lieved to be inadequate in science, in which one looks forward to further data.
Science is concerned with the communication of information, such as exact
levels of significance. Again, Fisher saw a broader context, the freedom of the
Western world. Communication of information (but not mechanical decisions)
THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 277
recognizes "the right of other free minds to utilize them in making their own
decisions" (Fisher, 1955, p. 77).
But Neyman reproached Fisher with the same sin—mechanical statistical
inference. As a statistical behaviorist, Neyman (1957) looked at what Fisher
actually did in his own research in genetics, biology, and agriculture, rather
than at what he said one should do. He found Fisher using .01 as a conven-
tional level of significance, without giving any thought to the choice of a par-
ticular level dependent on the particular problem or the probability of an error
of the second kind; he accused Fisher of drawing mechanical conclusions,
depending on whether or not the result was significant. Neyman urged a
thoughtful choice of the level of significance, not using .01 for all problems
and contexts.
Both camps in the controversy accused the other party of mechanical,
thoughtless statistical inference; thus I conclude that here at least they
agreed—statistical inference should not be automatic.
These differences between what Fisher proposed as the logic of significance
testing and what Neyman and Pearson proposed as the logic of hypothesis
testing suffice for the purpose of this chapter. Both have developed further
tools for inductive inference, and so have others, resulting in a large toolbox
that contains maximum likelihood, fiducial probability, confidence interval ap-
proaches, point estimation, Bayesian statistics, sequential analysis, and ex-
ploratory data analysis, to mention only a few. But it is null hypothesis testing
and Neyman-Pearson hypothesis-testing theory that have transformed exper-
imental psychology and part of the social sciences.
The conflicting views presented earlier are those of the parents of the hybrid
logic. Not everyone can tolerate unresolved conflicts easily and engage in a
free market of competing ideas. Some long for the single truth or search for a
compromise that could at least suppress the conflicts. Kendall (1949) com-
mented on the desire for peace negotiations among statisticians:
If some people asserted that the earth rotated from east to west and others
that it rotated from west to east, there would always be a few well-
meaning citizens to suggest that perhaps there was something to be said
for both sides, and maybe it did a little of one and a little of the other;
or that the truth probably lay between the extremes and perhaps it did
not rotate at all. (p. 115)
The denial of the existing conflicts and the pretense that there is only one
statistical solution to inductive inference were carried to an extreme in psy-
chology and several neighboring sciences. This one solution was the hybrid
logic of scientific inference, the offspring of the shotgun marriage between
Fisher and Neyman and Pearson. The hybrid logic became institutionalized in
278 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
Before World War II, psychologists drew their inferences about the validity of
hypotheses by many means—ranging from eyeballing to critical ratios. The
issue of statistical inference was not of primary importance. Note that this was
not because techniques were not yet available. On the contrary: already in
1710, John Arbuthnot proved the existence of God by a kind of significance
test, astronomers had used them during the 19th century for rejecting outliers
(Swijtink, 1987), and Fechner (1897) wrote a book on statistics including in-
ference techniques, to give just a few examples. Techniques of statistical in-
ference were known and sometimes used, but experimental method was not
yet dominated by and almost equated with statistical inference.
Through the work of the statisticians George W. Snedecor at Iowa State
College, Harold Hotelling at Columbia University, and Palmer Johnson at the
University of Minnesota, Fisher's ideas spread in the United States. Psychol-
ogists began to cleanse the Fisherian message of its agricultural odor and its
mathematical complexity and to write a new genre of textbooks featuring null
hypothesis testing. Guilford's Fundamental Statistics in Psychology and Edu-
cation, first published in 1942, was probably the most widely read textbook
in the 1940s and 1950s. In the preface, Guilford credited Fisher for the logic
of hypothesis testing taught in a chapter that was "quite new to this type of
text" (p. viii). The book does not mention Neyman, E. S. Pearson, or Bayes.
What Guilford teaches as the logic of hypothesis testing is Fisher's null hy-
pothesis testing, deeply colored by "Bayesian" thinking: Null hypothesis test-
ing is about the probability that the null hypothesis is true. "If the result comes
out one way, the hypothesis is probably correct, if it comes out another way,
the hypothesis is probably wrong" (p. 156). Null hypothesis testing is said to
give degrees of doubt such as "probable" or "very likely" a "more exact mean-
ing" (p. 156). Its logic is explained via surprising headings such as "Probability
of hypotheses estimated from the normal curve" (p. 160).
Guilford's logic is not consistently Fisherian, nor does it consistently use
"Bayesian" language of probabilities of hypotheses. It wavers back and forth
and beyond. Phrases such as "we obtained directly the probabilities that the
null hypothesis was plausible" and "the probability of extreme deviations from
chance" are used interchangeably for the same thing: the level of significance.
And when he proposed his own "somewhat new terms," his intuitive Bayesian
thinking becomes crystal clear. A p value of .015 for a hypothesis of zero
difference in the population "gives us the probability that the true difference
is a negative one, and the remainder of the area below the point, or .985, gives
us the probability that the true difference is positive. The odds are therefore
THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 279
.985 to .015 that the true difference is positive" (p. 166). In Guilford's hands,
p values that specify probabilities p(D I H ) of some data (or test statistic) D
given a hypothesis H turn miraculously into Bayesian posterior probabilities
p(H\D) of a hypothesis given data.
Guilford's confusion is not an exception. It marks the beginning of a genre
of statistical texts that vacillate between the researcher's "Bayesian" desire for
probabilities of hypotheses and what Fisher is willing to give them.
This first phase of teaching Fisher's logic soon ran into a serious compli-
cation. In the 1950s and 1960s, the theory of Neyman and E. S. Pearson also
became known. How were the textbook writers to cope with two logics of
scientific inference? How should the ideological differences and personal in-
sults be dealt with? Their solution to this conflict was striking. The textbook
writers did not side with Fisher. That is, they did not go on to present null
hypothesis testing as scientific inference and add a chapter on hypothesis test-
ing outside science, introducing the Neyman-Pearson theory as a logic for
quality control and related technological problems. Nor did they side with
Neyman and Pearson, teaching their logic as a consistent and improved version
of Fisher's and dispensing entirely with Fisherian null hypothesis testing.
Instead, textbook writers started to add Neyman-Pearsonian concepts on
top of the skeleton of Fisher's logic. But acting as if they feared Fisher's re-
venge, they did it without mentioning the names of Neyman and Pearson. A
hybrid logic of statistical inference was created in the 1950s and 1960s. Neither
Fisher nor Neyman and Pearson would have accepted this hybrid as a theory
of statistical inference. The hybrid logic is inconsistent from both perspectives
and burdened with conceptual confusion. Its two most striking features are (a)
it hides its hybrid origin and (b) it is presented as the monolithic logic of
scientific inference. Silence about its origin means that the respective parts of
the logic are not identified as part of two competing and partly inconsistent
theoretical frameworks. For instance, the idea of testing null hypotheses with-
out specifying alternative hypotheses is not identified as part of the Fisherian
framework, and the definition of the level of significance and the power of a
test as long-run frequencies of false and correct decisions, respectively, in re-
peated experiments is not identified as part of the Neyman-Pearson frame-
work. And, as a consequence, there is no mention of the fact that each of these
parts of the hybrid logic were rejected by the other party, and why, and what
the unresolved controversial issues are.
To capture the emotional tensions associated with the hybrid logic, I use a
Freudian analogy.3
Editors and textbook writers alike have institutionalized the level of signifi-
cance as a measure of the quality of research. As mentioned earlier, Melton,
after 12 years editing one of the most prestigious journals in psychology, said
in print that he was reluctant to publish research with significance levels below
.05 but above .01, whereas p < .01 made him confident that the results would
be repeatable and deserved publication (1962, pp. 553-554). In Nunnally's In-
troduction to Statistics for Psychology and Education (1975), the student is
taught similar values and informed that the standard has been raised: "Up until
20 years ago, it was not uncommon to see major research reports in which
most of the differences were significant only at the 0.05 level. Now, such re-
sults are not taken very seriously, and it is more customary today to see results
reported only if they reach the 0.01 or even lower probability levels" (p. 195).
THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 281
Not accidentally, both Melton and Nunnally show the same weak understand-
ing of the logic of inference and share the same erroneous belief that the level
of significance specifies the probability that a result can be replicated (dis-
cussed later). The believers in the divinatory power of the level of significance
set the standards.
The researcher's Ego knows that these publish-or-perish standards exist in
the outside world and knows that the best way to adapt is to round up the
obtained p value after the experiment to the nearest conventional level, say to
round up the value p = .006 and publish p < .01. But the Superego has higher
moral standards: If you set alpha to 5% before the experiment, then you must
report the same finding (p = .006) as "significant at the 5% level." Mostly, the
Ego gets its way but is left with feelings of dishonesty and of guilt at having
violated the rules. Conscientious experimenters have experienced these feel-
ings, and statisticians have taken notice. The following comment was made in
a panel discussion among statisticians; Savage remarked on the statisticians'
reluctance to take responsibility for once having built up the Superego in the
minds of the experimenters:
I don't imagine that anyone in this room will admit ever having taught
that the way to do an experiment is first carefully to record the signifi-
cance level then do the experiment, see if the significance level is at-
tained, and if so, publish, and otherwise, perish. Yet, at one time we
must have taught that; at any rate it has been extremely well learned in
some quarters. And there is many a course outside of statistics depart-
ments today where the modern statistics of twenty or thirty years ago is
taught in that rigid way. People think that's what they're supposed to do
and are horribly embarrassed if they do something else, such as do the
experiment, see what significance level would have been attained, and
let other people know it. They do the better thing out of their good in-
stincts, but think they're sinning. (Barnard, Kiefer, LeCam, & Savage,
1968, p. 147)
Statistics has become more tolerant than its offspring, the hybrid logic.
The hybrid logic attempts to solve the conflict between its parents by denying
its parents. It is remarkable that textbooks typically teach hybrid logic without
mentioning Neyman, E. S. Pearson, and Fisher—except in the context of tech-
nical details, such as specific tables, that are incidental to the logic. In 25 out
of 30 textbooks I have examined, Neyman and E. S. Pearson do not appear to
exist. For instance, in the introduction to his Statistical Principles of Experi-
mental Design (1971), Winer credits Fisher with inspiring the "standard work-
ing equipment" (p. 3) in this field, but a few pages later he presents the Ney-
man-Pearson terminology of Type I error, Type II error, power, two precise
statistical hypotheses, cost-benefit considerations, and rejecting and accepting
hypotheses. Yet nowhere in the book do the names of Neyman and E. S. Pear-
son appear (except in a "thank you" note to Pearson for permission to repro-
282 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
duce tables), although quite a few other names can be found in the index. No
hint is given to the reader that there are different ways to think about the logic
of inference. Even in the exceptional case of Hays's textbook (1963), in which
all parents are mentioned by their names, the relationship of their ideas is
presented (in a single sentence) as one of cumulative progress, from Fisher to
Neyman and Pearson (p. 287).4 Both Winer's and Hays's are among the best
texts, without the confusions that abound in Guilford's, Nunnally's, and a mass
of other textbooks. Nevertheless, even in these texts the parents' different ways
of thinking about statistical inference and the controversial issues are not
pointed out.
4. In the third edition (1981), however, Hays's otherwise excellent text falls back to
common standards: J. Neyman and E. S. Pearson no longer appear in the book.
THE SUPEREGO, THE EGO, AND THE ID IN STATISTICAL REASONING 283
tations by declaring that everything is the same. The price for this is conceptual
confusion, false assertions, and an illusory belief in the omnipotence of the
level of significance. Nunnally is a pronounced but not an atypical case.
The institutionalization of the hybrid logic as the sine qua non of scientific
method is the environment that encourages mechanical hypothesis testing. The
Publication Manual of the American Psychological Association (APA, 1974)
for instance, called "rejecting the null hypothesis" a "basic" assumption
(p. 19) and presupposes the hybrid logic. The researcher was explicitly told to
make mechanical decisions: "Caution: Do not infer trends from data that fail
by a small margin to meet the usual levels of significance. Such results are
best interpreted as caused by chance and are best reported as such. Treat the
result section like an income tax return. Take what's coming to you, but no
more" (p. 19; this passage was deleted in the third edition in 1983). This pre-
scription sounds like a Neyman-Pearson accept-reject logic, by which it mat-
ters for a decision only on which side of the criterion the data fall, not how
far. Fisher would have rejected such mechanical behavior (e.g., Fisher, 1955,
1956). Nevertheless, the examples in the manual that tell the experimenter
how to report results use p values that were obviously determined after the
experiment and rounded up to the next conventional level, such as p < .05,
p < .01, and p < .001 (pp. 39, 43, 48, 49, 70, 96). Neyman and Pearson would
have rejected this practice: These p values are not the probability of Type I
errors—and determining levels of significance after the experiment prevents
determining power and sample size in advance. Fisher (e.g., 1955,1956) would
have preferred that the exact level of significance, say p — .03, be reported,
not upper limits, such as p < .05, which look like probabilities of Type I errors
but aren't.
Replication Fallacy Suppose a is set as .05 and the null hypothesis is rejected
in favor of a given alternative hypothesis. What if we replicate the experiment?
In what percentage of exact replications will the result again turn out signifi-
cant? Although this question arises from the frequentist conception of repeated
experiments, the answer is unknown. The a we choose does not tell us, nor
does the exact level of significance.
The replication fallacy is the belief that the level of significance provides
an answer to the question. Here are some examples: In an editorial in the
Journal of Experimental Psychology, the editor stated that he used the level of
significance reported in submitted papers as the measure of the "confidence
that the results of the experiment would be repeatable under the conditions
described" (Melton, 1962, p. 553). Many textbooks fail to mention that the
level of significance does not specify the probability of a replication, and some
explicitly teach the replication fallacy. For instance, "The question of statis-
tical significance refers primarily to the extent to which similar results would
be expected if an investigation were to be repeated" (Anastasi, 1958, p. 9). Or,
"If the statistical significance is at the 0.05 level . . . the investigator can be
confident with odds of 95 out of 100 that the observed difference will hold up
in future investigations" (Nunnally, 1975, p. 195). Oakes (1986, p. 80) asked
70 university lecturers, research fellows, and postgraduate students with at
least two years' research experience what a significant result (t — 2.7, df — 18,
p = .01) means. Sixty percent of these academic psychologists erroneously
believed that these figures mean that if the experiment is repeated many times,
a significant result would be obtained 99% of the time.
In Neyman and Pearson's theory the level of significance (alpha) is defined
as the relative frequency of rejections of H0 if H0 is true. In the minds of many,
1 — alpha erroneously turned into the relative frequency of rejections of HQ,
that is, into the probability that significant results could be replicated.
The Bayesian Id's Wishful Thinking I mentioned earlier that Fisher both re-
jected the Bayesian cake and wanted to eat it, too: He spoke of the level of
significance as a measure of the degree of confidence in a hypothesis. In the
minds of many researchers and textbook writers, however, the level of signif-
icance virtually turned into a Bayesian posterior probability.
What I call the Bayesian Id's wishful thinking is the belief that the level of
significance, say .01, is the probability that the null hypothesis is correct, or
that 1 — .01 is the probability that the alternative hypothesis is correct. In
various linguistic versions, this wishful thinking was taught in textbooks from
the very beginning. Early examples are Anastasi (1958, p. 11), Ferguson (1959,
p. 133), Guilford (1942, pp. 156-166), and Lindquist (1940, p. 14). But the be-
lief has persisted over decades of teaching hybrid logic, for instance in Miller
and Buckhout (1973; statistical appendix by F. L. Brown, p. 523), Nunnally
(1975, pp. 194-196), and the examples collected by Bakan (1966) and Pollard
and Richardson (1987). Oakes (1986, p. 82) reported that 96% of academic
psychologists erroneously believed that the level of significance specifies the
probability that the hypothesis under question is true or false.
286 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
The Bayesian Id has its share. Textbook writers have sometimes explicitly
taught this misinterpretation but have more often invited it by not specifying
the difference between a Bayesian posterior probability, a Neyman-Pearsonian
probability of a Type I error, and a Fisherian exact level of significance.
Dogmatism
themselves has lasted for half a century. This is far too long. We need a knowl-
edgeable use of statistics, not a collective compulsive obsession. It seems to
have gone almost unnoticed that this dogmatism has created a strange double
standard. Many researchers believe that their participants must use Bayes's
rule to test hypotheses, but the researchers themselves use the hybrid logic to
test their hypotheses—and thus themselves ignore base rates. There is the il-
lusion that one kind of statistics normatively defines objectivity in scientific
inference and another kind, rationality in everyday inference. The price is a
kind of "split brain," where Neyman-Pearson logic is the Superego for exper-
imenters' hypothesis testing and Bayesian statistics is the Superego for partic-
ipants' hypothesis testing.
Here are a few first principles: Do not replace the dogmatism of the hybrid
logic of scientific inference with a new, although different one (e.g., Bayesian
dogmatism). Remember the obvious: The problem of inductive inference has
no universal mathematical solution. Use informed judgment and statistical
knowledge. Here are several specific suggestions:
1. Stop teaching hybrid logic as the sine qua non of scientific inference.
Teach researchers and students alternative theories of statistical in-
ference, give examples of typical applications, and teach the students
how to use these theories in a constructive (not mechanical) way.
Point out the confused logic of the hybrid, the emotional, behavioral,
and cognitive distortions associated with it, and insist on clarity (Co-
hen, 1990). This will lead to recognizing the second point.
2. Statistical inference (Fisherian, Neyman-Pearsonian, or Bayesian) is
rarely the most important part of data analysis. Teach researchers and
students to look at the data, not just at p values. Computer-aided
graphical methods of data display and exploratory data analysis are
means toward this end (Diaconis, 1985; Tukey, 1977). The calculation
of descriptive statistics such as effect sizes is a part of data analysis
that cannot be substituted by statistical inference (Rosnow & Rosen-
thai, 1989). A good theory predicts particular curves or effect sizes,
but not levels of significance.
3. Good data analysis is pointless without good data. The measurement
error should be controlled and minimized before and during the ex-
periment; instead one tends to control it after the experiment by in-
serting the error term in the F ratio. Teach researchers and students
that the important thing is to have a small real error in the data. With-
out that, a significant result at any level is, by itself, worthless—as
Cosset, who developed the t test in 1908, emphatically emphasized
(see Pearson, 1939). Minimizing the real error in measurements may
be achieved by an iterative method: First, obtain measurements and
look at the error variance, then try methods to minimize the error (e.g.,
stronger experimental control, investigating each participant carefully
288 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
Conclusions
I enjoy conference dinners. At such a dinner several years ago, I was crammed
in with four graduate students and four professors around a table laden with
Chinese food. The graduate students were eager to learn first-hand how to com-
plete a dissertation and begin a research career, and the professors were keen
to give advice. With authority, one colleague advised them: "Don't think big.
Just do four or five experiments, clip them together, and hand them in." The
graduate students nodded gratefully. They continued to nod when I added:
"Don't follow this advice unless you are mediocre or unimaginative. Try to think
in a deep, bold, and precise way. Take risks and be courageous." What a di-
lemma. How could these students follow these contradictory bits of advice?
Based on an analysis of articles in two major social psychology journals,
the Journal of Personality and Social Psychology and the Journal of Experi-
mental Social Psychology, Wallach and Wallach (1994, 1998) concluded that
the theoretical argument in almost half of the studies borders on tautology. If
an argument is a "near-tautology," there is no point in spending time and
money trying to experimentally confirm it. "Don't think big" seems to be a
prescription followed by many professional researchers, not merely conser-
vative advice for graduate students. Complaints about the lack of serious the-
ory in social psychology have been voiced before (e.g., Fiedler, 1991, 1996).
Atheoretical research is not specific to social psychology, however, although
some parts of psychology do better than others (Brandtstadter, 1987).
In this chapter, I address two questions: What are the surrogates for theory
in psychology? and What institutional forces perpetuate reliance on these sur-
rogates? This chapter is not intended to be exhaustive, only illustrative. The
examples I use are drawn from the best work in the areas discussed: the psy-
chology of reasoning, judgment, and decision making.
Surrogates
The problem is not that a majority of researchers would say that theory is
irrelevant; the problem is that almost anything passes as a theory. I identify
289
290 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
One-Word Explanations
The first species of theory surrogate is the one-word explanation. Such a word
is a noun, broad in its meaning and chosen to relate to the phenomenon. At
the same time, it specifies no underlying mechanism or theoretical structure.
The one-word explanation is a label with the virtue of a Rorschach inkblot: A
researcher can read into it whatever he or she wishes to see.
Examples of one-word explanations are representativeness, availability, and
anchoring and adjustment, which are treated as the cognitive heuristics people
use to make judgments and decisions. These terms supposedly explain "cog-
nitive illusions" such as base-rate neglect. These "explanations" figure prom-
inently in current textbooks in cognitive psychology, social psychology, and
decision making. It is understandable that when these three terms were first
proposed as cognitive processes in the early 1970s, they were only loosely
characterized (Tversky & Kahneman, 1974). Yet 30 years and many experi-
ments later, these three "heuristics" remain vague and undefined, unspecified
both with respect to the antecedent conditions that elicit (or suppress) them
and also to the cognitive processes that underlie them (Gigerenzer, 1996a). I
fear that in another 30 years we will still be stuck with plausible yet nebulous
proposals of the same type: that judgments of probability or frequency are
sometimes influenced by what is similar (representativeness), comes easily to
mind (availability), and conies first (anchoring).
The problem with these heuristics is that, post hoc, at least one of them can
be fitted to almost any experimental result. For example, base-rate neglect is
commonly attributed to representativeness. But the opposite result, over-
weighting of base rates ("conservatism"), is just as easily "explained" by in-
voking anchoring (on the base rate) and adjustment. One-word explanations
derive their seductive power from the fact that almost every observation can
be called upon as an example.
Even better, one-word explanations can be so parsimonious that a single
one can explain both a phenomenon and its opposite (Ayton & Fisher, 1999).
For instance, Laplace (1814/1951) had described a phenomenon that is known
today as the gambler's fallacy: when in a random sequence a run is observed
(e.g., a series of red on the roulette wheel), players tend to believe that the
opposite result (black) will come up next. Tversky and Kahneman (1974) pro-
posed that this intuition is due to "representativeness," because "the occur-
rence of black will result in a more representative sequence than the occur-
rence of an additional red" (p. 1125). Gilovich, Vallone, and Tversky (1985)
have described another phenomenon known as the belief in the "hot hand":
SURROGATES FOR THEORIES 291
Redescription
Recall Moliere's parody of the Aristotelian doctrine of substantial forms: Why
does opium make you sleepy? Because of its dormative properties. Redescrip-
tion has a long tradition in trait psychology, for instance, when an aggressive
behavior is attributed to an aggressive disposition or intelligent behavior to
high intelligence. But redescription in psychology is not limited to attributing
behaviors to traits and other essences.
292 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
Muddy Dichotomies
Torn between being distressed over and content with the state of research on
information processing, Allen Newell (1973) entitled a commentary "You Can't
Play 20 Questions with Nature and Win." What distressed Newell was that
when behavior is explained in terms of dichotomies—nature versus nurture,
serial versus parallel, grammars versus associations, and so on—"clarity is
never achieved" and "matters simply become muddier and muddier as we go
down through time" (pp. 288—289). There is nothing wrong with making dis-
tinctions in terms of dichotomies per se; what concerned Newell were situa-
tions in which theoretical thinking gets stuck in binary oppositions beyond
which it never seems to move.
Let us consider a case in which false dichotomies have hindered precise
theorizing. Some arguments against evolutionary psychology are based on the
presumed dichotomy between biology and culture, or genes and environment
(Tooby & Cosmides, 1992). One such argument goes: Because cognition is
bound to culture, evolution must be irrelevant. But biology and culture are not
opposites. For instance, our ability to cooperate with conspecifics to whom we
are genetically unrelated—which distinguishes us humans from most other
species—is based on mechanisms of both biological and cultural origin. Sim-
ply to ask about the relative importance of each in terms of explained variance,
such as that 80% of intelligence is genetically inherited, is, however, not al-
ways an interesting question. The real theoretical question concerns the mech-
anism that combines what is termed the "biological" and the "cultural." For
biologists, the nature/nurture or biological/cultural dichotomy is a nontstarter:
Genes are influenced by their environment, which can include other genes,
and culture can change gene pools (coevolution).
Cognitive psychology is also muddied by vague dichotomies. For instance,
a popular opposition is between associations and rules. Sloman (1996) has
SURROGATES FOR THEORIES 293
Data Fitting
There are other surrogates for theories in psychology, one of which is the use
of powerful mathematical tools for data fitting in the absence of theoretical
underpinnings. Psychologists have historically embraced such new tools,
which they then propose as new theories. When factor analysis became a com-
mon tool for data processing in psychological research, humans were modeled
as a bundle of personality factors. When multidimensional scaling came along
in the 1960s and 1970s, human categorization and other mental processes were
proposed to be based on distances between points in multidimensional space.
More recently, the advent of the serial computer was followed by that of neural
networks as a model of cognitive function. There is nothing wrong with using
these mathematical tools per se. The important point with respect to surrogate
theories is whether the tool is used for modeling or for data fitting (this is itself
a false dichotomy, there being a continuum between these poles). Charles
Spearman originally designed factor analysis as a theory of intelligence, but
(in the form of principal component analysis) it ended up as a fitting tool for
all kinds of psychological phenomena. Likewise, Roger Shepard (e.g., 1962,
1974) interpreted the various Minkowski metrics that can be used in multi-
dimensional scaling as psychological theories of similarity, such as in color
perception, but multidimensional scaling ended up as a largely atheoretical
tool for fitting any similarity data, with the Euclidean metric as a conventional
routine. Similarly, neural networks can be used as constrained or structured
networks into which theoretical, domain-specific assumptions are built (e.g.,
Regier, 1996), but many applications of neural networks to modeling psycho-
logical phenomena seem to amount to data fitting with numerous free param-
eters. Neural networks with hidden units and other free parameters can be too
powerful to be meaningful—in the sense that they can fit different types of
results that were generated with different process models (Geman, Bienen-
stock, & Doursat, 1992; Massaro, 1988).
In general, mathematical structures can be used to test theories (with pa-
rameters determined by theoretical considerations, e.g., the metric in multi-
dimensional scaling) or as a fitting tool (with parameters chosen post hoc so
as to maximize the fit). Fitting per se is not objectionable. The danger is that
294 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
enthusiasm for a mathematical tool can lead one to get stuck in data fitting
and to use a good fit as a surrogate for a theory.
There is one obvious reason why surrogates for theories come to mind more
quickly than real theories: demonstrating how a one-word explanation, a re-
description, a dichotomy, or an exercise in data fitting "explains" a phenom-
enon demands less mental strain than developing a bold and precise theory.
It takes imagination to conceive the idea that heat is caused by motion, but
only little mental effort to propose that heat is caused by specific particles that
have the propensity to be hot. In what follows, I identify two institutions that
may maintain (rather than cause) the abundant use of surrogates for theories
in some areas of psychology.
surrogates for thinking big are sufficient. The result has been called "null sci-
ence" (Bower, 1997). It reminds me of a mechanical maxim regarding the crit-
ical ratio (the difference between the means divided by the standard deviation
of the differences), the predecessor of the significance level: "A critical ratio
of three, or no Ph.D."
Disciplinary Isolation
Over the course of the 20th century, academic psychology has become more
and more compartmentalized into subdisciplines such as social psychology,
cognitive psychology, developmental psychology, and so on. Each subdisci-
pline has its own journals, reviewers, and grant programs, and one can have
a career in one of them without ever reading the journals of neighboring
subdisciplines. In addition, job searches are often organized according to
these categories. This territorial organization of psychology discourages re-
searchers from engaging with psychological knowledge and colleagues out-
side of their territory, not to mention with other disciplines. As Jerry Fodor
(1995) put it:
Unfortunately, cognitive psychology as people are trained to practice it,
at least in this country, has been traditionally committed to methodolog-
ical empiricism and to disciplinary isolationism, in which it was, for
example, perfectly possible to study language without knowing anything
about linguistics, (pp. 85-86)
This isolationism is by no means restricted to the study of language. For
instance, the experimental study of logical thinking in arguably the most re-
searched problem, the Wason selection task, has been carried out with little
reference to modern logic, and the study of statistical reasoning has been con-
ducted with little attention to the relevant issues in statistics (see Gigerenzer,
1994a; Oaksford & Chater, 1994).
Intellectual inbreeding can block the flow of positive metaphors from one
discipline to another. Neither disciplines nor subdisciplines are natural cate-
gories. Interdisciplinary exchange has fueled the development of some of the
most influential new metaphors and theories in the sciences, such as when
Ludwig Boltzmann and James Clerk Maxwell developed statistical mechanics
by borrowing from sociology. Bolzmann and Maxwell modeled the behavior
of gas molecules on the behavior of humans as Adolphe Quetelet had portrayed
it: erratic and unpredictable at the individual level but exhibiting orderly sta-
tistical laws at the level of collectives (Gigerenzer et al., 1989, ch. 2). Territorial
science, in contrast, blocks the flow of metaphors and the development of new
theories. Distrust and disinterest in anything outside one's subdiscipline sup-
ports surrogates for theory.
296 COGNITIVE ILLUSIONS AND STATISTICAL RITUALS
In this chapter, I have specified four surrogates for theory and two possible
institutional reasons why some of these surrogates flourish like weeds. These
two reasons certainly cannot explain the whole story.
Several years ago, I spent a day and a night in a library reading through
issues of the Journal of Experimental Psychology from the 1920s and 1930s.
This was professionally a most depressing experience, but not because these
articles were methodologically mediocre. On the contrary, many of them make
today's research pale in comparison with their diversity of methods and sta-
tistics, their detailed reporting of single-case data rather than mere averages,
and their careful selection of trained participants. And many topics—such as
the influence of the gender of the experimenter on the performance of the
participants—were of interest then as now. What depressed me was that almost
all of this work is forgotten; it does not seem to have left a trace in the collec-
tive memory of our profession. It struck me that most of it involved collecting
data without substantive theory. Data without theory are like babies without
parents: Their life expectancy is low.
REFERENCES
Aaronson, D., Grupsmith, E., & Aaronson, M. (1976). The impact of computers
on cognitive psychology. Behavioral Research Methods and Instrumenta-
tion, 8, 129-138.
Acree, M. C. (1978). Theories of statistical inference in psychological research:
A historicocritical study. Dissertation Abstracts International, 39 (10),
5037B. (University Microfilms No. H790 H7000.)
Adler, J. E. (1984). Abstraction is uncooperative. Journal for the Theory of So-
cial Behavior, 14, 165-181.
Adler, J. E. (1991). An optimist's pessimism: Conversation and conjunction. In
E. Eells & T. Maruszewski (Eds.), Probability and rationality: Studies on
L. Jonathan Cohen's philosophy of science (pp. 251-282). Amsterdam:
North-Holland Rodopi.
Ajzen, I. (1977). Intuitive theories of events and the effects of base-rate infor-
mation on predictions. Journal of Personality and Social Psychology, 35,
303-314.
Ajzen, I., & Fishbein, M. (1975). A Bayesian analysis of attribution processes.
Psychological Bulletin, 82, 261-277.
Alba, J. W., & Marmorstein, H. (1987). The effects of frequency knowledge on
consumer decision making. Journal of Consumer Research, 14, 14—26.
Alibi des Schornsteinfegers: Unwahrscheinliche Wahrscheinlichkeitsrech-
nungen in einem Mordprozess [Alibi of a chimney sweep: Improbable
probability calculations in a murder trial]. Rheinischer Merkur, No. 39.
Airport, D. A. (1975). The state of cognitive psychology. Quarterly Journal of
Experimental Psychology, 27, 141-152.
Allwood, C. M., & Montgomery, H. (1987). Response selection strategies and
realism of confidence judgments. Organizational Behavior and Human De-
cision Processes, 39, 365-383.
American Psychological Association (1974). Publication Manual of the Amer-
ican Psychological Association (2nd ed.). Baltimore: Garamond/Pridemark
Press.
American Psychological Association (1983). Publication Manual of the Amer-
ican Psychological Association (3rd ed.). Menasha, WI: Banta.
Anastasi, A. (1958). Differential psychology (3rd ed.) New York: Macmillan.
297
298 REFERENCES
sky and Dr. Brown. Journal of the Royal Statistical Society, Series A, 142,
171-172.
Barnard, G. A., Kiefer, J. C., LeCam, L. M., & Savage, L. J. (1968). Statistical
inference. In D. G. Watts (Ed.), The future of statistics (p. 147). New York:
Academic Press.
Baron-Cohen, S. (1995). Mindblindness. Cambridge, MA: MIT Press.
Barsalou, L. W., & Ross, B. H. (1986). The roles of automatic and strategic pro-
cessing in sensitivity to superordinate and property frequency. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 12, 116-134.
Bayes, T. (1763). A essay towards solving a problem in the doctrine of chances.
Philosophical Transactions of the Royal Society of London, 53, 370-418.
Bazerman, M. H. (1990). Judgment in managerial decision making. New York:
Wiley.
Bazerman, M. H., & Neale, M. A. (1986). Heuristics in negotiation: Limitations
to effective dispute resolution. In H. R. Arkes & R. R. Hammond (Eds.),
Judgment and decision making: An interdisciplinary reader (pp. 311-321).
Cambridge, England: Cambridge University Press.
Becker, G. (1976). The economic approach to human behavior. Chicago: Uni-
versity of Chicago Press.
Beretty, P. M., Todd, P. M., & Martignon, L. (1999). Categorization by elimina-
tion: Using few cues to choose. In G. Gigerenzer, P. M. Todd, & the ABC
Group, Simple heuristics that make us smart (pp. 235-254). New York:
Oxford University Press.
Binet, A., & Simon, T. (1914). Mentally defective children. London: Edward
Arnold.
Birnbaum, M. H. (1983). Base rates in Bayesian inference: Signal detection
analysis of the cab problem. American Journal of Psychology, 96, 85-
94.
Bjorkman, M. (1984). Decision making, risk taking and psychological time:
Review of empirical findings and psychological theory. Scandinavian
Journal of Psychology, 25, 31-49.
Bjorkman, M. (1987). A note on cue probability learning: What conditioning
data reveal about cue contrast. Scandinavian Journal of Psychology, 28,
226-232.
Blackwell, R. J. (1983). Scientific discovery: The search for new categories.
New Ideas in Psychology, I , 111-115.
Boesch, C., & Boesch, H. (1984). Mental map in wild chimpanzees: An analysis
of hammer transports for nut cracking. Primates, 25, 160-170.
Boole, G. (1958). An investigation of the laws of thought on which are founded
the mathematical theories of logic and probabilities. New York: Dover.
(Original work published 1854)
Borges, B., Goldstein, D. G., Ortmann, A., & Gigerenzer, G. (1999). Can igno-
rance beat the stock market? In G. Gigerenzer, P. M. Todd, & the ABC
Group, Simple heuristics that make us smart (pp. 59-72). New York: Ox-
ford University Press.
Borgida, E., & Brekke, N. (1981). The base rate fallacy in attribution and pre-
diction. In J. H. Harvey, W. J. Ickes, & R. F. Kidd (Eds.), New directions in
attribution research (Vol. 3, pp. 63-95). Hillsdale, NJ: Erlbaum.
Boring, E. G. (1942). Sensation and perception in the history of experimental
psychology. New York: Appleton-Century-Crofts.
300 REFERENCES
the Enquete Committee of the llth German Bundestag, 13/90. Bonn, Ger-
many: Bonner Universitats Buchdruckerei.
Devereux, G. (1967). From anxiety to method in the behavioral sciences. Paris:
Mouton.
de Waal, F. B. M., & Luttrell, L. M. (1988). Mechanisms of social reciprocity in
three primate species: Symmetrical relationship characteristics or cogni-
tion? Ethology and Sociobiology, 9, 101-118.
Diaconis, P. (1985). Theories of data analysis: From magical thinking through
classical statistics. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Ex-
ploring data tables, trends and shapes (pp. 1-36). New York: Wiley.
Dietz, K., Seydel, J., & Schwartlander, B. (1994). Back-projection of German
AIDS data using information on dates of tests. Statistics in Medicine, 13,
1991-2008.
DiFonzo, N. (1994). Piggybacked syllogisms for investor behavior: Probabilistic
mental modeling in rumor-based stock market trading. Ph.D. diss., Temple
University, Philadelphia.
Doherty, M. E., & Kurz, E. M. (1996). Social judgment theory. Thinking and
Reasoning, 2, 109-140.
Doll, L. S., & Kennedy, M. B. (1994). HIV counseling and testing: What is it
and how well does it work? In G. Schochetman & J. R. George (Eds.), AIDS
testing: A comprehensive guide to technical, medical, social, legal, and
management issues (pp. 302-319). New York: Springer.
Dugatkin, L. A. (1996). Interface between culturally based preferences and ge-
netic preferences: Female mate choice in Poecilia Reticulata. Proceedings
of the National Academy of Sciences, 93, 2770-2773.
Duncker, K. (1945). On problem solving (Trans. T. L. S. Lees). Psychological
Monographs, 58 (5, Whole No. 270). (Original work published 1935.)
Barman, J. (1992). Bayes or bust? A critical examination of Bayesian confir-
mation theory. Cambridge, MA: MIT Press.
Eberle, J. F., Deinhardt, K. O., & Habermehl, M. A. (1988). Die Zuverlassigkeit
des HIV-Antikorpertests. Deutsches Arzteblatt, 85, 1512-1514.
Eddy, D. M. (1982). Probabilistic reasoning in clinical medicine: Problems and
opportunities. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment
under uncertainty: Heuristics and biases (pp. 249-267). Cambridge, En-
gland: Cambridge University Press.
Edgington, E. S. (1974). A new tabulation of statistical procedures used in APA
journals. American Psychologist, 29, 25-26.
Edwards, W. (1954). The theory of decision making. Psychological Bulletin, 51,
380-417.
Edwards, W. (1962). Dynamic decision theory and probabilistic information
processing. Human Factors, 4, 59-73.
Edwards, W. (1966). Nonconservative information processing systems. Rep.
No. 5893-22-F. Ann Arbor: University of Michigan, Institute of Science
and Technology.
Edwards, W. (1968). Conservatism in human information processing. In B.
Kleinmuntz (Ed.), Formal representation of human judgment (pp. 17-52).
New York: Wiley.
Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference
for psychological research. Psychological Review, 70, 193-242.
Edwards, W., & von Winterfeldt, D. (1986). On cognitive illusions and their
REFERENCES 305
Gallistel, C. R., & Gelman, R. (1992). Preverbal and verbal counting and com-
putation. Cognition, 44, 43-74.
Garcia, J., & Koelling, R. A. (1966). The relation of cue to consequence in avoid-
ance learning. Psychonomic Science, 4, 123-124.
Garcia y Robertson, R., & Garcia, J. (1985). X-rays and learned taste aversions:
Historical and psychological ramifications. In T. G. Burish, S. M. Levy, &
B. E. Meyerowitz (Eds.), Cancer, nutrition and eating behavior: A biobe-
havioral perspective (pp. 11-41). Hillsdale, NJ: Erlbaum.
Gardner, H. (1988). Creative lives and creative works: A synthetic scientific
approach. In R. J. Sternberg (Ed.), The nature of creativity (pp. 298-321).
Cambridge, England: Cambridge University Press.
Gavanski, I., & Hui, C. (1992). Natural sample spaces and uncertain belief.
Journal of Personality and Social Psychology, 63 (5), 766-780.
Gavin, E. A. (1972). The causal issue in empirical psychology from Hume to
the present with emphasis upon the work of Michotte. Journal of the His-
tory of the Behavorial Sciences, 8, 302-320.
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/
variance dilemma. Neural Computation, 4, 1-58.
George, J. R., & Schochetman, G. (1994). Detection of HIV infection using se-
rologic techniques. In G. Schochetman & J. R. George (Eds.), AIDS testing:
A comprehensive guide to technical, medical, social, legal, and manage-
ment issues (pp. 62-102). New York: Springer.
Gigerenzer, G. (1981). Messung und Modellbildung in der Psychologie [Mea-
surement and modeling in psychology]. Munich: Reinhardt.
Gigerenzer, G. (1984). External validity of laboratory experiments: The
frequency-validity relationship. American Journal of Psychology, 97 (2),
185-195.
Gigerenzer, G. (1987a). Probabilistic thinking and the fight against subjectivity.
In L. Kriiger, G. Gigerenzer, & M. S. Morgan (Eds.), The probabilistic revo-
lution: Vol. 2. Ideas in the sciences (pp. 11-33). Cambridge, MA: MIT Press.
Gigerenzer, G. (1987b). Survival of the fittest probabilist: Brunswik, Thurstone,
and the two disciplines of psychology. In L. Kriiger, G. Gigerenzer, & M.
Morgan (Eds.), The probabilistic revolution: Vol. 2. Ideas in the sciences
(pp. 49-72). Cambridge, MA: MIT Press.
Gigerenzer, G. (1990). Strong AI and the problem of "second-order" algorithms.
Behavioral and Brain Sciences, 13 (4), 663-664.
Gigerenzer, G. (199la). From tools to theories: A heuristic of discovery in cog-
nitive psychology. Psychological Review, 98, 254-267.
Gigerenzer, G. (1991b). How to make cognitive illusions disappear: Beyond
"heuristics and biases." In W. Stroebe & M. Hewstone (Eds.), European
Review of Social Psychology, 2, 83-115.
Gigerenzer, G. (1991c). Does the environment have the same structure as Bayes'
theorem? Behavioral and Brain Sciences, 14, 495.
Gigerenzer, G. (199ld). On cognitive illusions and rationality. In E. Eells & T.
Manuszewski (Eds.), Reasoning and rationality: Essays in honour of L. J.
Cohen (pp. 225-249). Amsterdam: Radop.
Gigerenzer, G. (1993a). The superego, the ego, and the id in statistical reason-
ing. In G. Keren & G. Lewis (Eds.), A handbook for data analysis in the
behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ:
Erlbaum.
308 REFERENCES
Gould, S. J., & Vrba, E. S. (1982). Exaptation—a missing term in the science of
form. Paleobiology, 8, 4-15.
Gregory, R. L. (1974). Concepts and mechanisms of perception. New York:
Scribner.
Grether, D. M. (1980). Bayes rule as a descriptive model: The representative-
ness heuristic. Quarterly Journal of Economics, 95, 537-557.
Grice, H. P. (1975). Logic and conversation. In P. Cole & J. L. Morgan (Eds.),
Syntax and semantics: Vol. 3. Speech acts (pp. 41—58). New York: Aca-
demic Press.
Griffin, D., & Tversky, A. (1992). The weighing of evidence and the determi-
nants of confidence. Cognitive Psychology, 24, 411-435.
Griggs, R. A., & Cox, J. R. (1982). The elusive thematic-materials effect in Wa-
son's selection task. British Journal of Psychology, 73, 407-420.
Groner, M., Groner, R., & Bischof, W. F. (1983). Approaches to heuristics: A
historical review. In R. Groner, M. Groner, & W. F. Bischof (Eds.), Methods
of heuristics (pp. 1-18). Hillsdale, NJ: Erlbaum.
Gruber, H. E. (1977). The fortunes of a basic Darwinian idea: Chance. In R. W.
Rieber & K. Salzinger (Eds.), The roots of American psychology: Historical
influences and implications for the future (pp. 233-245). New York: New
York Academy of Sciences.
Gruber, H. E. (1981). Darwin on man: A psychological study of scientific cre-
ativity (2nd ed.). Chicago: University of Chicago Press.
Gruber, H. E., & Voneche, J. J. (Eds.). (1977). The essential Piaget. New York:
Basic Books.
Guilford, J. P. (1942). Fundamental statistics in psychology and education.
New York: McGraw-Hill.
Guilford, J. P. (1954). Psychometric methods (2nd ed.). New York: McGraw-Hill.
Guilford, J. P., & Hoepfner, R. (1971). The analysis of intelligence. New York:
McGraw-Hill.
Guttman, L. (1977). What is not what in statistics. Statistician, 26, 81-107.
Guttman, L. (1985). The illogic of statistical inference for cumulative science.
Applied Stochastic Models and Data Analysis, 1, 3-10.
Hacking, I. (1965). Logic of statistical inference. Cambridge, England: Cam-
bridge University Press.
Hacking, I. (1975). The emergence of probability. Cambridge, England: Cam-
bridge University Press.
Hacking, I. (1983). Representing and intervening. Cambridge, England: Cam-
bridge University Press.
Hackmann, W. D. (1979). The relationship between concept and instrument
design in eighteenth-century experimental science. Annals of Science, 36,
205-224.
Hager, W., & Westermann, R., (1983). Zur Wahl und Pruning statistischer Hy-
pothesen in psychologischen Untersuchungen. Zeitschrift fur experimen-
ted und angewandte Psychologie, 30, 67-94.
Hamilton, W. D. (1964). The genetic evolution of social behavior: Parts 1 and
2. Journal of Theoretical Biology, 7, 1-52.
Hammerton, M. (1973). A case of radical probability estimation. Journal of
Experimental Psychology, 101, 252-254.
Hammond, K. R. (1966). The psychology of Egon Brunswik. New York: Holt,
Rinehart & Winston.
REFERENCES 311
Hertwig, R., Hoffrage, U., & Martignon, L. (1999). Quick estimation: Letting the
environment do the work. In G. Gigerenzer, P. M. Todd, & the ABC Group,
Simple heuristics that make us smart (pp. 209-234). New York: Oxford
University Press.
Hilgard, E. R. (1955). Discussion of probabilistic functionalism. Psychological
Review, 62, 226-228.
Hilton, D. J. (1995). The social context of reasoning: Conversational inference
and rational judgment. Psychological Bulletin, 118, 248-271.
Hintzman, D. L., & Block, R. A. (1972). Repetition and memory: Evidence for
multiple trace hypothesis. Journal of Experimental Psychology, 88, 297-
306.
Hintzman, D. L., Nozawa, G., & Irmscher, M. (1982). Frequency as a nonpro-
positional attribute of memory. Journal of Verbal Learning and Verbal Be-
havior, 21, 127-141.
Hirschfeld, L. A., & Gelman, S. A. (Eds.). (1994a). Mapping the mind: Domain
specificity in cognition and culture. Cambridge, England: Cambridge Uni-
versity Press.
Hirschfeld, L. A., & Gelman, S. A. (1994b). Toward a topography of mind: An
introduction to domain specificity. In L. A. Hirschfeld & S. A. Gelman
(Eds.), Mapping the mind: Domain specificity in cognition and culture
(pp. 3-35). Cambridge, England: Cambridge University Press.
Hoffman-Valentin, F. (1991). AIDS: Gefahren, Schutz, Vorsorge, Behandlungs-
moglichkeiten. Landsberg, Germany: Ecomed.
Hoffrage, U. (1994). Zur Angemessenheit subjektiver Sicherheits-Urteile: Eine
Exploration der Theorie der probabilistischen mentalen Modelle [On the
validity of confidence judgments: A study of the theory of probabilistic
mental models]. Ph.D. diss., University of Salzburg.
Hoffrage, U., & Gigerenzer, G. (1996). The impact of information representation
on Bayesian reasoning. In G. Cottrell (Ed.), Proceedings of the Eighteenth
Annual Conference of the Cognitive Science Society (pp. 126-130). Mah-
wah, NJ: Erlbaum.
Hoffrage, U., & Gigerenzer, G. (1998). Using natural frequencies to improve
diagnostic inferences. Academic Medicine, 73, 538—540.
Hoffrage, U., & Hertwig, R. (1999). Hindsight bias: A price worth paying for
fast and frugal memory. In G. Gigerenzer, P. M. Todd, & the ABC Group.
Simple heuristics that make us smart (pp. 191-208). New York: Oxford
University Press.
Hofstatter, P. R. (1939). Uber die Schatzung von Gruppeneigenschaften. Zeit-
schrift fur Psychologie, 145, 1-44.
Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction:
Processes of inference, learning and discovery. Cambridge, MA: MIT Press.
Holton, G. (1988). Thematic origins of scientific thought (2nd ed.). Cambridge,
MA: Harvard University Press.
Howell, W. C., & Burnett, S. (1978). Uncertainty measurement: A cognitive
taxonomy. Organizational Behavior and Human Performance, 22, 45—68.
Huber, O. (1989). Information-processing operators in decision making. In H.
Montgomery & O. Svenson (Eds.), Process and structure in human deci-
sion making (pp. 3-21). New York: Wiley.
Hull, C. L. (1943). The uniformity point of view. Psychological Review, 50,
203-216.
REFERENCES 313
Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal
reports on mental processes. Psychological Review, 84, 231-259.
Nunnally, J. C. (1975). Introduction to statistics for psychology and education.
New York: McGraw-Hill.
Oakes, M. (1986). Statistical inference: A commentary for the social and the
behavioral sciences. Chichester, England: Wiley.
Oaksford, M., & Chater, N. (1994). A rational analysis of the selection task as
optimal data selection. Psychological Review, 101, 608-631.
Over, D. E., & Manktelow, K. I. (1993). Rationality, utility and deontic reason-
ing. In K. I. Manktelow & D. E. Over (Eds.), Rationality: Psychological and
philosophical perspectives (pp. 231-259). London: Routledge.
Packer, C. (1977). Reciprocal altruism in Papio annubis. Nature, 265, 441-443.
Parducci, A. (1965). Category judgment: A range-frequency model. Psycholog-
ical Review, 72, 407-418.
Paulos, J. A. (1988). Innumeracy: Mathematical illiteracy and its consequences.
New York: Vintage Books.
Payne, J. W., Bettman, J. R., & Johnson, E. J. (1993). The adaptive decision
maker. New York: Cambridge University Press.
Pearson, E. S. (1939). "Student" as statistician. Biometrika, 30, 210-250.
Pearson, E. S. (1962). Some thoughts on statistical inference. Annals of Math-
ematical Statistics, 33, 394-403.
Peichl-Hoffman, G. (1991). Spezifitatsprobleme bei der Testung auf Anti-HIVl
bzw. Anti-HIV2 in der Routineuntersuchung von Blutspendern. Klinisches
Labor, 37, 320-328.
Peirce, C. S., & Jastrow, J. (1884). On small differences of sensation. Memoirs
of the National Academy of Sciences, 3, 75-83.
Peterson, C. R., DuCharme, W. M., & Edwards, W. (1968). Sampling distribu-
tions and probability revision. Journal of Experimental Psychology, 76,
236-243.
Phelps, R. H., & Shanteau, J. (1978). Livestock judges: How much information
can an expert use? Organizational Behavior and Human Performance, 21,
209-219.
Phillips, L. D., & Edwards, W. (1966). Conservatism in a simple probability
model inference task. Journal of Experimental Psychology, 72, 346-354.
Piaget, J. (1930). The child's conception of causality. London: Kegan Paul,
Trench, & Trubner.
Piaget, J., & Inhelder, B. (1975). The origin of the idea of chance in children.
New York: Norton. (Original work published 1951)
Piattelli-Palmarini, M. (1991, March/April). Probability blindness: Neither ra-
tional nor capricious. Bostonia, 28—35.
Piattelli-Palmarini, M. (1994). Inevitable illusions: How mistakes of reason rule
our minds. New York: Wiley.
Pinker, S. (1994). The language instinct. London: Penguin Press.
Platt, R., & Griggs, R. (1993). Darwinian algorithms and the Wason selection
task: A factorial analysis of social contract selection task problems. Cog-
nition, 48, 163-192.
Politzer, G., & Nguyen-Xuan, A. (1992). Reasoning about promises and warn-
ings: Darwinian algorithms, mental models, relevance judgments or prag-
matic schemas? Quarterly Journal of Experimental Psychology, 44A, 402-
421.
REFERENCES 321
& M. Reder (Eds.), Rational choice: The contrast between economics and
psychology (pp. 25-40). Chicago: University of Chicago Press.
Simon, H. A. (1987). Bounded rationality. In J. Eatwell, M. Milgate, & P. New-
man (Eds.), The New Palgrave: A dictionary of economics (pp. 266-268).
London: Macmillan.
Simon, H. A. (1990). Invariants of human behavior. Annual Review of Psy-
chology, 41, 1-19.
Simon, H. A. (1991). Models of my life. New York: Basic Books.
Simon, H. A. (1992a). What is an "explanation" of behavior? Psychological
Science, 3, 150-161.
Simon, H. A. (1992b). Economics, bounded rationality, and the cognitive rev-
olution. Aldershot Hants, England: Elgar.
Simon, H. A., & Kulkarni, D. (1988). The processes of scientific discovery: The
strategy of experimentation. Cognitive Science, 12, 139-175.
Simon, H. A., & Newell, A. (1986). Information processing language V on the
IBM 650. Annals of the History of Computing, 8, 47-49.
Sivyer, M., & Finlay, D. (1982). Perceived duration of auditory sequences. Jour-
nal of General Psychology, 107, 209-217.
Skinner, B. F. (1972). Cumulative record. New York: Appleton-Century-Crofts.
Skinner, B. F. (1984). A matter of consequences. New York: New York Univer-
sity Press.
Sloman, S. A. (1996). The empirical case for two systems of reasoning. Psy-
chological Bulletin, 119, 3-22.
Slovic, P., Fischhoff, B., & Lichtenstein, S. (1976). Cognitive processes and
societal risk taking. In J. S. Carroll & J. W. Payne (Eds.), Cognition and
social behavior (pp. 165-184). Hillsdale, NJ: Erlbaum.
Smith, E. E., & Osherson, D. N. (1989). Similarity and decision making. In S.
Vosniadou & A. Ortony (Eds.), Similarity and analogical reasoning
(pp. 60-75). Cambridge, England: Cambridge University Press.
Smith, L. D. (1986). Behaviorism and logical positivism. Stanford, CA: Stanford
University Press.
Snell, J. J. S., Supran, E. M., & Tamashiro, H. (1992). WHO international qual-
ity assessment scheme for HIV antibody testing: Results from the second
distribution of sera. Bulletin of the World Health Organization, 70, 605-
613.
Sniezek, J. A., & Buckley, T. (1993). Becoming more or less uncertain. In N. J.
Castellan (Ed.), Individual and group decision making (pp. 87-108). Hills-
dale, NJ: Erlbaum.
Sperber, D. (1994). The modularity of thought and the epidemiology of rep-
resentations. In L. A. Hirschfeld & S. A. Gelman (Eds.), Mapping the mind:
Domain specificity in cognition and culture (pp. 39-67). Cambridge, En-
gland: Cambridge University Press.
Sperber, D., Cara, F., & Girotto, V. (1995). Relevance theory explains the selec-
tion task. Cognition, 57, 31-95.
Sperber, D., & Wilson, D. (1986). Relevance: Communication and cognition.
Oxford, England: Blackwell.
Spettell, C. M., & Liebert, R. M. (1986). Training for safety in automated person-
machine systems. American Psychologist, 41, 545-550.
Spielberg, F., Kabeya, C. M., Ryder, R. W., Kifuani, N. K., Harris, J., Bender,
T. R., Heyward, W. L., & Quinn, T. C. (1989). Field testing and comparative
REFERENCES 325
Thaler, R. H. (1991). Quasi rational economics. New York: Russell Sage Foun-
dation.
Thines, G., Costall, A., & Butterworth, G. (Eds.). (1991). Michotte's experimen-
tal phenomenology of perception. Hillsdale, NJ: Erlbaum.
Thomas, D. H. (1978). The awful truth about statistics in archaeology. Ameri-
can Antiquity, 43, 231-244.
Thorndike, R. L. (1954). The psychological value system of psychologists.
American Psychologist, 9, 787-789.
Thurstone, L. L. (1927). A law of comparative judgment. Psychological Review,
34, 273-286.
Titchener, E. B. (1896). An outline of psychology. New York: Macmillan.
Todd, P. M. (2000a). Fast and frugal heuristics for environmentally bounded
minds. In G. Gigerenzer & R. Selten (Eds.), Bounded rationality: The adap-
tive toolbox (pp. 51-70). Cambridge, MA: MIT Press.
Todd, P. M. (2000b). The ecological rationality of mechanisms evolved to make
up minds. American Behavioral Scientist, 43, 940-956.
Tooby, ]., & Cosmides, L. (1992). The psychological foundations of culture. In
J. Barkow, L. Cosmides, & J. Tooby (Eds.), The adapted mind: Evolutionary
psychology and the generation of culture (pp. 19-136). New York: Oxford
University Press.
Tooby, J., & DeVore, I. (1987). The reconstruction of hominid behavioral evo-
lution through strategic modeling. In W. G. Kinzey (Ed.), The evolution of
human behavior: Primate models (pp. 183-237). Albany, NY: State Uni-
versity of New York Press.
Toulmin, S., & Leary, D. E. (1985). The cult of empiricism in psychology, and
beyond. In S. Koch & D. E. Leary (Eds.), A century of psychology as science
(pp. 594-617). New York: McGraw-Hill.
Tu, X. T., Litvak, E., & Pagano, M. (1992). Issues in human immunodeficiency
virus (HIV) screening programs. American Journal of Epidemiology, 136,
244-255.
Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
Turing, A. M. (1950). Computing machinery and intelligence. Mind, 59, 433-
460.
Turing, A. M. (1969). Intelligent machinery. In B. Meltzer & D. Michie (Eds.),
Machine intelligence (Vol. 5, pp. 3-23). Edinburgh, Scotland: Edinburgh
University Press. (Original work published 1947)
Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327-352.
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psy-
chological Bulletin, 76, 105-110.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging fre-
quency and probability. Cognitive Psychology, 5, 207-232.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics
and biases. Science, 185, 1124-1131.
Tversky, A., & Kahneman, D. (1980). Causal schemata in judgments under un-
certainty. In M. Fishbein (Ed.), Progress in social psychology (Vol. 1,
pp. 49-72). Hillsdale, NJ: Erlbaum.
Tversky, A., & Kahneman, D. (1982a). Judgments of and by representativeness.
In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under uncer-
tainty: Heuristics and biases (pp. 84-98). Cambridge, England: Cambridge
University Press.
REFERENCES 327
329
330 NAME INDEX
Fischhoff, B., 17, 51, 169, 174-175, 180, Hager, J. L., 212, 274n
93, 130, 135, 141, 192, 194, 196-197, Hamilton, W. D., 213
144, 145, 150-151, 217-222, 224-225, Hammerton, M., 98
161, 169, 174, 243, 228, 233-234, 237- Hammond, K. R., 12,
245, 279 238, 241, 245n, 247, 46-47, 50, 54, 93,
Fishbein, M., 9 249, 250, 256-257, 138, 164, 166, 169,
Fishburn, P. C., 191, 259, 260-263, 265- 182, 193, 196
194 268, 270, 274, 278, Hansen, R. D., 259
Fisher, I., 290 283, 286n, 290, 292- Hanson, N. R., 4
Fisher, R. A., 8, 10-11, 293, 295 Harcourt, A. H., 229
14-15, 36, 47, 49-50, Gilbert, W., 26n Harvey, J. H., 257-259
102, 115, 167, 250, Gillespie, J. H., 208n Hasher, L., 64, 137, 262
264, 266, 269, 270- Gilovich, T., 290 Hastie, R., 262
277, 280-285 Gilpin, M. E., 195 Hays, W. L., 281-282
Fodor, J. A., 41, 212, Ginossar, Z., 21, 257, Hedges, L., 169
228-229, 295 259 Heidelberger, M., 241
Fraisse, P., 162 Girotto, V., 191, 216 Heider, F, 12, 224, 232
Frenkel-Bmnswik, E., Gleik, J., 32n Heisenberg, W., 267
196 Gliick, D., 79n Hell, W., 20, 116, 208n,
Fretwell, S. D., 206 Godel, K., 44 256-257, 259
Freud, S., 237, 250, 293 Goldstein, D., 137 Helmholtz, H. von, 9
Frey, B. S., 242 Goldstein, D. G., 26, Helson, H., 291
Freyd, J. J., 265 126-127, 166, 174- Hempel, C. G., 24
Friedman, J. H., 191 175, 180, 192, 233 Henkel, R. E., 278
Good, I. J., 16, 68-70 Herbart, J. F, 7
Galanter, E., 34 Cosset, W. S., 10, 271 Hertwig, R., 51, 137,
Galison, P., 25 Gould, S. J., 61, 229, 169, 174, 197, 225,
Gallistel, C. R., 64, 192, 249 249-250, 266
205, 206, 234 Grayboys, T., 65, 242, Hilgard, E., 12, 47, 49
Gallon, F., 4, 49, 227- 251, 253 Hilton, D. J., 188, 225
228 Green, B., 37 Hinton, G. E., 293
Garcia, J., 211-212, 228 Gregory, R. L., 9, 10, Hintzman, D. L,, 64,
Garcia y Robertson, R., 258 137
212 Grether, D. M., 259 Hirschfeld, L. A., 212,
Gardner, H., 13 Grice, H. P., 224, 265 225-226
Gardner, A., 193 Griggs, R. A., 217, 223 Hoepmer, R., 227
Gauss, K. F., 27 Groner, M., 259 Hoffman-Valentin, F.,
Gavanski, I., 115, 121 Groner, R., 259 79n
Gavin, E. A., 8 Gruber, H. E., 14, 211, Hoffrage, U., 18, 51-52,
Gelman, R., 64, 192 268 65, 68, 77, 91-92,
Gelman, S. A.,212,225- Gmpsmith, E., 39 122, 126 ,129, 169,
226 Guilford, J. P., 7, 227, 174, 197, 247
Geman, S., 293 278-279, 282-283, Hoffstatter, P. R., 242
George, J. R., 80 285 Hogarth, R. M., 21, 164,
Gigerenzer, G., 4, 6, 7, Guttman, L., 284 181-182, 195-196,
10, 14, 18, 20, 24, 36, 291
45, 47-48, 51, 54, 58, Ha, Y., 186 Holland, J. H., 256-257
61n, 65, 68, 72, 75, Hacking, L, 17, 25, 62, Holton, G., 24
91, 116, 125, 127, 245n, 275 Holyoak,K.J.,216, 220-
136-137, 162, 164, Hackmann, W. D., 23 221, 224
332 NAME INDEX
237, 242-244, 248- Wason, P., 214-216 Wimsatt, W., 167, 231
249, 251, 254-260, Wattenmaker, W. D., Windeler, J., 67
262, 284, 288, 290- 115 Winer, B. J., 281-282
292 Weber, E. H., 7 Winman, A., 51, 127,
Tweney, R. D., 11, 14, Weber, U., 188 174
186, 268 Wells, G. L., 257-259 Winterfeldt, D. von, 93,
Wendt, D., 242 130, 157, 159, 193,
Vallone, R., 290 Wertheimer, M., 34 247
Voneche, J. J., 211, 268 Westermann, R., 274n Wise, M. N., 23
Vrba, E. S., 230 White, S., 199 Wittgenstein, L., 44
Whitehead, A. N., 31, Wittkowski, K., 80-81
Walker, B. J., 186 33, 35n Woodworth, R. S., 48-
Wallace, B., 188 Whiten, A., 224, 226, 49
Wallach, L., 289, 290 234 World Health
Wallach, M. A., 289, Whittam, T. S., 259 Organization, 79-80
290 Wickelgren, W. A., 8 Wundt, W., 37, 40, 268
Wallendael, L. R., van Wiegand, W., 73
262 Wilkinson, G. S., 213 Yates, J. R, 152, 154-
Wallsten, T. S., 260, Williams, G.C., 59, 212- 155, 158-159
291 213
Wanke, M., 291 Wilson, M., 71 Zacks, R. T., 64, 262
Ward, J. W., 82 Wilson, T. D., 110 Zimmer, A. C., 155
Wascoe, N. E., 164, 193 Wilson, D., 224 Zytkow, J. M., 35
This page intentionally left blank
SUBJECT INDEX
337
338 SUBJECT INDEX