Experts in Uncertainty
Experts in Uncertainty
Experts in Uncertainty
ACCEPTABLE EVIDENCE
Science and Values in Risk Management
edited by Deborah Mayo and Rachelle D. Hollander
EXPERTS IN UNCERTAINTY
Opinion and Subjective Probability in Science
Roger M. Cooke
EXPERTS IN
UNCERTAINTY
Opinion and Subjective Probability
in Science
ROGER M. COOKE
98765432
Printed in the United States of America
on acid-free paper
To Pam, Tessa, and Dorian
This page intentionally left blank
Contents
Introduction, xi
Broadly speaking, this book is about the speculations, guesses, and estimates of
people who are considered experts, in so far as these serve as "cognitive input" in
some decision process. The questions studied include how expert opinion is in fact
being used today, how an expert's uncertainty is or should be represented, how
people do or should reason with uncertainty, how the quality and usefulness of
expert opinion can be assessed, and how the views of several experts might be
combined. The subject matter of this book therefore overlaps a large number of
disciplines, ranging from philosophy through policy analysis up to some rather
technical mathematics.
Most important, we are interested in developing practical models with a
transparent mathematical foundation for using expert opinion in science. The
models presented have been operationalized and evaluated in practice in the course
of research sponsored by the Dutch government, by the European Space Agency,
by the European Community, and by several research laboratories, including The
Netherlands Organization for Applied Scientific Research, Delft Hydraulics
Laboratory, DSM Research, and Shell Research.
This book presupposes a "general science background," some background in
college level mathematics, a nodding acquaintance with probability theory that the
reader is willing to extend, and an interest in policy analysis, particularly risk
analysis. The reader must be willing to assimilate concepts outside his or her
specific field.
The material is organized into three parts. Part I, "Experts and Opinions,"
studies the various ways in which expert opinion is being used today. It is
important to have a good feel for the problems and promises of expert opinion
before embarking on mathematical modeling, and this is the purpose of Part I.
Readers not interested in using expert opinion, but simply wishing to see how the
scientific and engineering community is dealing with it, will benefit most from Part
I, perhaps supplemented with Chapters 8, 10 from Part II, and Part III.
Part II, "Subjective Probability," focuses on the representation of uncertainty
for an individual rational subject. There are many discussions of subjective
probability, but none of them oriented to the use of expert subjective probabilities.
The present discussion is distinguished by an informal exposition of Savage's
theory of rational decision and the representation of preference in terms of expected
utility. New proofs of Savage's representation theorem suitable for undergraduate
students in science and engineering are included in a mathematical supplement to
xii INTRODUCTION
To the Greek philosopher Plato, the title "Experts in Uncertainty" would have
been a contradiction. With his famous "divided line" he partitioned knowledge into
four categories. The lowest category was "eikasia" which is best translated as
"conjecture." After this comes "pistis" (belief), followed by "dianoia" (correct
reasoning from hypotheses, as in mathematics), and "episteme" (knowledge). A line
divides the lower two categories, belonging to the realm of appearances and
deception, from the upper two, for which rigorous intellectual training is required.
"Uncertainty," whatever it may be, certainly belongs beneath the line, whereas
"expert" denotes a result of rigorous intellectual training. What conceivable
purpose could be served by studying the uncertainties of experts?
The purpose is really twofold. First, people in general and decision makers in
particular do in fact tend to place great weight on the uncertain opinions of experts.
This is presently done in a rather unmethodological way. Given that this is taking
place, it is relevant to ask if there are "better" or "worse" ways of using expert
opinion. Second, and more importantly, there is a growing body of evidence that
expert opinion can, under certain circumstances, be a very useful source of data.
Indeed, expert opinion is cheap, plentiful, and virtually inexhaustible. However, the
proper use of this source may well require new techniques. Expert opinion is not
the same as expert knowledge, and the consequences of ignoring this distinction
may be grave.
This chapter encircles the subject of expert opinion with a wide compass.
Selected topics from the 1950s and 1960s are treated at some length. These have
receded far enough into the past to offer us the advantage of hindsight, but they
have not yet lost their relevance. The era of think tanks and expert oracles is
roughly the period between the Second World War and the Vietnam war. This
period produced two principal techniques for using expert opinion in science,
scenario analysis and the Delphi method. Before setting off in search of new
techniques we must review these. This is done below. A final section philosophizes
on the role of expert opinion in science.
3
4 EXPERTS AND OPINIONS
BACKGROUND
The phenomenon of experts is not new; however, the notion that the musings,
brainstorms, guesses, and speculations of experts can be significant input in a
structured decision process is relatively recent. We may effectively date the
inception of this phenomenon with the establishment of the RAND Corporation
after World War II. A period of unbridled growth and almost unlimited faith in
expert opinion came to a close in the United States sometime in the early 1970s, as
suggested by Figure 1.1.
The period between the World War II and the Vietnam War witnessed a rapid
growth in research and development. A few facts and figures serve to illustrate this
development. Whereas the U.S. federal budget between 1940 and 1963 grew by a
factor of 11, the federal outlays for R&D (research and development) in this period
grew by a factor of 200 (Kevles, 1978, p. 394). Research contracts to universities and
research laboratories soared. In 1964 MIT had a banner year and bagged
$47,000,000 in research contracts, the highest for any academic institution in that
year. The National Science Foundation reported that by the late 1960s there were
Figure 1.1 Percentage of the American public expressing "great confidence" in the leaders of
various institutions. (Mazur, 1981).
THINK TANKS AND ORACLES 5
upwards of 11,000 independent "think tanks" advising the U.S. government. The
advice ranged from strategic planning, through the war on poverty, the role of birds
in warfare, diarrhea in horses, psychological differences between tatooed and
nontatooed sailors, up to and including a classified study of the vulnerabilities of
communists with respect to music (Toekomstonderzoek, 1974, pp. 6.6.1.02, 6.6.302).
As the war in Vietnam intensified, the honeymoon between science and
government in the United States came to an end. In 1966 the R&D growth started
to decelerate (Kevles, 1978, p. 411). Most of the President's science advisers
opposed the war. Many talked of resigning, but only Los Alamos veteran George
Kistiakowsky did. Steven Weinberg later resigned from the elite and secretive
Jason Division because of its Vietnam involvement. In 1969, following a debate by
the American Physical Society on the proposed Anti-Ballistic Missile system, 250
physicists staged a march on Washington protesting its deployment (Kevles, 1978,
p. 405). Scientists opposed to the war were blacklisted from advisory committees
(Science, p. 171, March 5, 1971). In 1973 Nixon abolished the Science Advisory
Committee because of their opposition to the war in Vietnam and to the
Supersonic Transport project. Jerome Weisner, president of MIT and member of
Kennedy's Science Advisory Committee, was even placed on Nixon's "enemies list"
and the Nixon administration considered cutting all research grants to MIT
(Kevles, 1978, p. 413).
In the halcyon years between World War II and the Vietnam War two metho-
dologies dictated the form in which structured expert opinion was conveyed to
decision makers, namely the Delphi method and scenario analysis. As both were
developed at the RAND Corporation, and as RAND (for R-and-D) represents the
think tank par excellence, it is appropriate to focus this chapter on the two pink (no
pun intended) buildings in Santa Monica, California, which are home to "Mother
RAND," as she is called by her employees.
The RAND corporation originated in a joint project, called "Project RAND"
of the U.S. Air Force and Douglas Aircraft in 1946. RAND was set up as a 300-man
department of Douglas. In its first year, RAND produced a report that predicted
that the first space satelite would be launched in the middle of 1957. The first
Russian Sputnik was launched on October 4, 1957.
In 1948 RAND split off from Douglas and became an independent corpora-
tion, the first think tank. In 20 years its yearly budget grew to $30,000,000, with
over a thousand employees. It is reported that in some years as many as 70% of the
yearly U.S. crop of mathematics Ph.D.s applied for jobs at RAND. (Toekomstonder-
zoek, p. 6.6.4.06). Throughout the 1950s and into the 1960s RAND worked almost
exclusively for the Air Force, but has diversified since then.
The diversification was aimed to demilitarize their public image, attract new
clients, and deflect a swelling wave of criticism. As one example of the latter, John
Gofman and Arthur Tamplin had revealed that a 1966 RAND study advised post-
nuclear-war leaders to forego all measures to help the elderly, the weak and the
mentally and physically handicapped, as these groups would only present a further
6 EXPERTS AND OPINIONS
burden and impede the reconstruction of the country. The country would be better
off without them (Gofman and Tamplin, 1970). It is pointed out that RAND had
never done a study on care for the elderly, except for this study on caring for the
elderly after nuclear war.
RAND's research has been concentrated in four broad areas: methodology,
strategic and tactical planning, international relations, and new technology. Our
interest is primarily in the first. However, as method without matter is empty, it is
useful to consider an example of strategic planning at RAND.
HERMAN KAHN
We have chosen for this purpose the extraordinary researches into civil defense and
strategic planning, as related by RAND project leader Herman Kahn. During the
1950s, Kahn led several large RAND studies into the effects of thermonuclear war
and civil defense for the U.S. Air Force. In 1960 he was the top ranking authority
on strategic planning and civil defense. Kahn's research became available to the
general public in several books, of which the most important is On Thermonuclear
War published in 1960. Many of the concepts and themes of this book have entered
the folklore through the movie Dr. Strangelove. The tragic but distinguishable
postwar states, the doomesday machine, the flexible response, these and many
more come from On Thermonuclear War.
It is not clear to what extent the conclusions and recommendations in On
Thermonuclear War are representative of thinking at RAND. In any case, Kahn
had a falling out with the director over fallout shelters (Kahn wanted them, the
director did not), and he became the first RAND alumnus to start his own think
tank. In 1961 he founded the Hudson Institute, located in New York at Croton-on-
Hudson. People soon began speaking of Herman-on-Hudson. Herman Kahn died
in 1983, and The Futurist of October 1983 contains several reminiscences by friends
and colleagues.
On Thermonuclear War opens with a discussion of possible strategic postures.
The official posture in 1960 was called "finite deterrence." It involves the ability to
inflict unacceptable damage on an enemy after absorbing a surprise nuclear attack.
Finite deterrence can be upgraded by adopting various "counterforce" measures. A
counterforce measure is one designed to limit the damage inflicted by an enemy.
Building fallout shelters and removing attractive targets from population centers
are counterforce measures. So is an antiballistic missile system that destroys enemy
missiles in flight. So is a highly accurate attack weapon capable of knocking out
enemy missiles before they are launched. When enough counterforce measures are
adopted such that one is willing to absorb a retaliatory blow from the other side,
one possesses a "credible first strike capability." On Thermonuclear War is an
extended argument for abandoning finite deterrence and striving for a credible first
strike capability:
Since we wish to be able to limit Soviet behavior to deter them from acts which cannot
be met by limited means, we must have a credible first strike capability... The only time
in which the credible first strike capability would be destabilizing would be when we
THINK TANKS AND ORACLES 7
2,000,000 1 year
5,000,000 2 years
10,000,000 5 years
20,000,000 10 years
40,000,000 20 years
80,000,000 50 years
160,000,000 100 years
Will the survivors envy the dead?
assumption means that the United States has not lost the war. The second
assumption entails that society starts to function immediately after the attack:
"debris has been cleared up, minimum communications restored, the most urgent
repairs made, credits and markets reestablished, a basic transportation system
provided, minimum utilities either set up or restored, the basic necessities of life
made available, and so on" (Kahn, 1960, p. 84).
All these assumptions are uncertain. We shall not go into the details of his
argument, but just give an impression how Kahn assesses and deals with the
uncertainties inherent in his assumptions. For example, he claims that the
destruction of the 53 largest metropolitan areas would not be an economic
catastrophe, but "may simply set the nation's productive capacity back a decade or
two — This statement strikes most people as being very naive" (Kahn, 1960, p. 77).
After all, destroying only a small percentage of the cells of an organism can result in
the organism's death and hence the subsequent destruction of all cells. However,
the analogy with an organism "seems to be completely wrong — The creating (or
re-creating) of a society is an art rather than a science; even though empirical and
analytic 'laws' have been worked out, we do not really know how it is done, but
almost everybody (Ph.D. or savage) can do it" (Kahn, 1960, p. 77).
The fifth optimistic assumption appears bold to many people. Kahn asks
"would not the shock of the catastrophe so derange people that every man's hand
would be against every other man's?" Again: "This seems entirely wrong
Nations have taken equivalent shocks even without special preparations and
have survived with their prewar virtues intact" (Kahn, 1960, p. 89). The fact
that the destruction would happen quickly only strengthens his case: "While many
normal personalities would disintegrate under hardships spread over a period of
many years, the habits of a lifetime cannot be changed for most people in a few
days— It is my belief that if the government has made at least moderate
preparations, so that most people whose lives have been saved will give some credit
to the government's foresight, then people probably will rally round..." (Kahn,
1960, pp. 89-90).
The following passage puts the foregoing in perspective:
In spite of the many uncertainties of our study, we do have a great deal of confidence in
some partial conclusions—such as, that a nation like the United States or the Soviet
Union could handle each of the problems of radioactivity, physical destruction, or
likely levels of casualties, if they occurred by themselves." (Kahn, 1960, p. 91).
THINK TANKS AND ORACLES 9
Kahn explains that in analyzing each problem, he and his RAND colleagues
assumed that this was the only problem. The RAND scientists did not consider the
eventuality that radiation, destruction, and death might happen together.
A review in Scientific American conjectured that the whole book was a staff
joke in poor taste (Newman, 1961). The book was not meant as a joke, and we must
try to understand how an advice to initiate nuclear war could be given, and
presumably taken seriously, on the basis of such reasoning. Part of the explanation
may lie in the unfamiliarity of reasoning with uncertainty. The following recon-
struction of Kahn's reasoning suggests itself:
Kahn wants to establish the conclusion that we will probably restore our gross
national product (GNP) quickly after a small attack in which 53 metropolitan
areas are lost. It is sufficient, he thinks, if it is likely that we can handle the
problems of radiation destruction and death. Hence, he considers in isolation
the three propositions: "It is likely that we can handle radiation," "it is likely
that we can handle death," and "It is likely that we can handle destruction."
Having convinced himself of each of these, he concludes "It is likely that we
can handle radiation, destruction, and death." QED
Of course, this is probabilistically invalid; if proposition A is likely, and
proposition B is also likely, it does not follow that proposition "A and B" is likely.
However, Kahn seems to be reasoning with uncertainty in the same way that most
of us do, if we don't stop to think about it. That is, his normal pattern of reasoning
is adapted to "cope with uncertainty" by simply adding a number of qualifiers such
as probably, likely, seems, or maybe. There is no real attempt to say how uncertain
a given statement is, and no attempt to propagate uncertainty through a chain of
reasoning. The subtleties involved in probabilistic reasoning will become apparent
in Chapter 3.
This gives a sufficient impression of Kahn's reasoning under uncertainty.
Before turning to methodology, we add one final detail to this picture of expert
advice. At the end of the book, Kahn considers a number of possible objections to
his proposals. One of these is
7. Objection: 'It is pointless to fight a war even if victory is achieved, since the country
will be so debilitated that any third- or fourth-rate power could take it over...'
Answer: Saving lives is in itself a reasonable objective, but the program is intended to
be much more than a reduction in the number of immediate casualties. If the U.S. does
win the war promptly, it will have had an infuriating experience that is likely to create a
problem quite the opposite of that envisaged by this objection. The country will be very
well sheltered, having lost all its soft spots. It will be in a 'mean temper,' having just
fought a war to save the world. It will be in no mood to be pushed around by other
countries. It will have a large strategic air force because a properly defended air force
cannot be thrown away in a single strike or two. Far from being weak or vulnerable, the
U.S. might be able to 'take over the world'—even though such a goal is utterly
inconsistent with our political institutions and values. The least the U.S. could expect,
and insist on, is that the rest of the world help, not hinder, her reconstruction effort and
cooperate in organizing the peace." (Kahn, 1960, p. 646)
SCENARIO ANALYSIS
Herman Kahn is regarded as the father of scenario analysis. This method underlies
the type of "systems analysis" represented in On Thermonuclear War, though it
became more explicit as a methodology in Kahn's later futurological research. The
most prominent example of the latter is the book he coauthored with Anthony
Wiener in 1967, The Year 2000. (The maker of the film Dr. Strangelove, Stanley
Kubric, also made the film 2001.) In discussing scenario analysis directly after On
Thermonuclear War we do not mean to suggest that practitioners of this method are
implicitly responsible for the views expressed in that book. On the other hand, the
choice of methodology must be made in awareness of the consequences to which
such choices might lead.
In On Thermonuclear War Kahn speaks of a "technological breakthrough in
the art of doing Systems Analysis and Military Studies" (Kahn, 1960, p. 119).
Previously, analysts would have tried to optimize a single objective, or perhaps
attempt to weigh different objectives and assess probabilities of outcomes in order
to arrive at a best expected outcome. Kahn, who understood the basis of
mathematical decision theory quite well, describes these methodologies as
inadequate.
The reason is rooted in the foundations of statistical decision theory. The
assessment of probabilities is necessarily subjective. Moreover, programs and
proposals must eventually be decided on by committees, whose members will
generally have conflicting goals and differing assessments of probabilities. Within
classical decision theory there is no way of getting groups to arrive at a rational
consensus on subjective probabilities or on objectives. It is a fundamental feature of
statistical decision theory, which is usually forgotten in casual discussions, that the
concepts of subjective probability and utility are only meaningful for an individual.
In general, they cannot be meaningfully defined for a group, and it is impossible for
a committee to come to a decision on the basis of maximal expected utility.
It follows that if the analyst seeks to develop proposals that will be approved
by some committee, he should not attempt to maximize his (or someone else's)
expected utility, but should try to develop proposals on which a sufficient number
of committee members can agree.
THINK TANKS AND ORACLES 11
Kahn acknowledges that this is little more than common political sense, but it
does have far-reaching consequences. It entails, for example, that the scientist doing
systems analysis should think, not as a scientist traditionally thinks, but rather as a
politician. Knowing that Kahn's strategic planning studies were contracted by the
Air Force, one might charge that he was merely telling the generals what they
wanted to hear. If we take his methodological statements seriously, this is exactly
what he was trying to do. In any event, RAND's "technological breakthrough"
entails that probability can play at best a subordinate role in the scenario-analytic
approach to policy analysis. Although our focus is on scenario analysis as practiced
in the 1950s and 1960s, this latter feature is characteristic of most scenario studies
up to the present. The probabilities of the scenarios are not explicitly taken into
account.
What is exactly a scenario, and what role can probability play in scenario
analysis? For answers to these questions we must turn to The Year 2000.
Scenarios are hypothetical sequences of events constructed for the purpose of focusing
attention on causal processes and decision-points. They answer two kinds of questions:
(1) Precisely how might some hypothetical situation come about, step by step? and (2)
What alternatives exist, for each actor, at each step, for preventing, diverting, or
facilitating the process. (Kahn and Wiener, 1967, p. 6)
The method as applied in projecting the year 2000 works basically as follows.
The analyst first identifies what he takes to be the set of basic long-term trends.
These trends are then extrapolated into the future, taking account of any
theoretical or empirical knowledge that might impinge on such extrapolations. The
result is termed the surprise-free scenario. The surprise-free scenario serves as a foil
for defining alternative futures or canonical variations. Roughly speaking, these are
generated by varying key parameters in the surprise-free scenario.
What about the probability of the various scenarios. Kahn is very explicit
about this. In doing long-range projections the trouble is, he says, that no
particular scenario is much more likely than all the others:
The subjective curve of probabilities often seems flat In order to avoid the dilemma
of Buridan's ass, who starved midway between two bales of hay because he could not
decide which one he preferred, we must then make arbitrary choices among almost
equally interesting, important, or plausible possibilities. That is, if we are to explore any
predictions at all, we must to some extent 'make them up.' (Kahn and Wiener, 1967,
p. 8)
The surprise-free scenario is salient because of its relation to the basic long-
term trends, not because it is probable, or more probable than its alternatives. In
fact, the surprise-free scenario may be judged to have a very low probability.
One may well ask, what is the point of devoting an extensive analysis to one
scenario, plus a few canonical variations, if all the scenarios have a very low
inherent probability.
Consider a very simple mathematical analogy. Suppose we have an uncertain
quantity, which might take any integer value between zero and one million.
Suppose that no particular number is much more probable than any other. Would
it make sense to pick one particular number, say 465,985, affirm that it was not
more probable than a host of others, and then study extensively the properties of
465,985 plus a few "canonical variations"? Of course not. Is Kahn's method of
12 EXPERTS AND OPINIONS
scenario analysis less absurd? Well, the surprise-free scenario is salient, not because
of its probability but because of its relation to basic trends. If we bear in mind that
saliency has nothing to do with prediction, then focusing on a few scenarios might
help study the basic trends. However, it is easy to be misled into believing that
scenario analysis yields predictions. Proof of the latter comes from Kahn's own
blurb on the book jacket: "Demonstrating the new techniques of the think tanks,
this book projects what our own world most probably will be like a generation
from now—and gives alternatives."
If the value of a quantity is uncertain, then probability theory presents us with
a variety of methods for representing this uncertainty and making predictions.
Ideally, we should state the probability distribution of the uncertain quantity.
However, we need not describe the distribution completely in order to make
meaningful predictions. In the above example we might say that the probability
was 90% that the quantity's value would be between 200,000 and 800,000. What
sort of prediction is this? Under certain circumstances, which we shall study
extensively in Part II, this entails the following: If we give such "90% subjective
confidence intervals" for a large number of uncertain quantities, then we expect
90% of the actual values to fall within their respective confidence intervals. We
cannot say that a given prediction is right or wrong, but if all the values were to fall
outside their respective confidence intervals, then one could rightly conclude that
the person giving the predictions was not a very good predictor. In the language of
succeeding chapters, he would be called poorly calibrated.
In their book, Kahn and Wiener do give a number of probabilistic predictions.
Table 1.3 lists 25 technological innovations that, with characteristic clarity, are
described as "even money bets, give or take a factor of five" before the year 2000. If
Kahn and Wiener are "well-calibrated assessors" we should expect roughly one-
half of these predictions to come true by the year 2000.
The Delphi method was developed at the RAND corporation in the early 1950s as
a spin-off of an Air Force-sponsored research project, "Project Delphi." The
original project was designed to anticipate an optimal targeting of U.S. industries
by a hypothetical Soviet strategic planner. Delphi was first brought before a wider
audience in a 1963 RAND study "Report on a Long-Range Forecasting Study," by
Olaf Helmer and T. J. Gordon (Rand Paper P-2982, later incorporated in Helmer,
1966). It is undoubtedly the best-known method of eliciting and synthesizing expert
opinion.
In the middle 1960s and early 1970s the Delphi method found a wide variety of
applications, and by 1974 the number of Delphi studies had exceeded 10,000
(Linstone and Turoff, 1975, p. 3). Although most applications are concerned with
technology forecasting, the method has also been applied to many types of policy
analysis.
Policy Delphi's differ from technology forecasting Delphi's with respect to
both purpose and method. In technology forecasting, the team conducting the
Delphi study seeks experts who are most knowledgable on the issues in question,
THINK TANKS AND ORACLES 13
The revised predictions are then processed in the same way as the first
responses, and arguments for outliers are summarized. This information is then
sent back to the respondents, and the whole process is iterated. A Delphi exercise
typically involves three or four rounds.
The responses on the final round generally show a smaller spread than the
responses on the first round, and this is taken to indicate that the experts have
reached a degree of consensus. The median values on the final round are taken as
the best predictions.
Questionnaire # 1
This is the first in a series of four questionnaires intended to demonstrate the use of the Delphi
Technique in obtaining reasoned opinions from a group of respondents.
Each of the following six questions is concerned with developments in the United States within the
next few decades.
In addition to giving your answer to each question, you are also being asked to rank the questions
from 1 to 7. Here "1" means that in comparing your own ability to answer this question with what you
expect the ability of the other participants to be, you feel that you have the relatively best chance of
coming closer to the truth than most of the others, while a "7" means that you regard that chance as
relatively least.
Rank Question Answer*
l. In your opinion, in what year will the median family income (in
1967 dollars) reach twice its present amount?
2. In what year will the percentage of electric automobiles among all
automobile in use reach 50 percent?
3. In what year will the percentage of households that are
equipped with computer consoles tied to a central computer and
data bank reach 50 percent?
4. By what year will the per-capita amount of personal cash transactions
(in 1967 dollars) be reduced to one-tenth of what it is now?
5. In what year will power generation by thermonuclear fusion become
commercially competitive with hydroelectric power?
6. By what year will it be possible by commercial carriers to get from
New York's Times Square to San Francisco's Union Square in half
the time that is now required to make that trip?
7. In what year will a man for the first time travel to the Moon, stay
for at least 1 month, and return to Earth?
*"Never" is also an acceptable answer.
Please also answer the following question, and give your name (this is for identification purposes during
the exercise only; no opinions will be attributed to a particular person).
Check one:
I would like to participate in
I am willing but not anxious the three remaining
I would prefer not questionnaires
Name (block letters please):
Present
Conference RAND* 1963
Forecast Pretest LRF Study
Group Size 1 2 3 4 5 1 2 3 4 5
5 0.20 0.19 0.18 0.23 0.27 0.28 0.27 0.28 0.35 0.35
7 0.13 0.09 0.07 0.10 0.11 0.44 0.42 0.43 0.36 0.36
9 0.22 0.14 0.09 0.17 0.16 0.25 0.20 0.16 0.16 0.16
11 0.24 0.29 0.23 0.28 0.22 0.12 0.09 0.10 0.10 0.10
corporate planners in the late 1960s and early 1970s. By the middle 1970s
psychometricians, people trained in conducting controlled experiments with
humans, began taking a serious look at the Delphi methods and results.
Perhaps the most significant study in this regard is Sackman's Delphi Critique
(1975). Sackman approaches the Delphi exercises as psychometric experiments
(which, strictly speaking, they are not) and concludes that the Delphi method
violates several essential methodological rules of sound experimental science. For
one thing, he notes that the questionnaire items are often so vague that it would be
impossible to determine when, if ever, they occurred. Furthermore, the respondents
are not treated equally. People whose predictions fall inside the interquartile band
are "rewarded" with a reduced workload in returning the questionnaires, whereas
those whose predictions fall outside this band are "punished" and must produce
arguments. Moreover, in many Delphi exercises there seem to be a significant
number of dropouts—people who simply don't return their forms. Delphi exercises
do not publish the number of dropouts, they make no attempt to discover the
reason for dropping out, and do not assess the influence of this negative selection
on the results. This all raises the possibility that the Delphi convergence may owe
more to boredom than to consensus. Finally, Sackman argues that experts and
nonexperts generally produce comparable results in Delphi exercises.
Of course, the proof of the pudding is in the eating. Methodological critiques
like that of Sackman would be of limited impact if the predictions from Delphi
exercises were good. This latter question has been examined, so far as possible in a
laboratory setting, in a number of psychometric studies. Delphi is compared with a
"nominal group technique" (Delbecq, Van de Ven, and Gusstafson, 1975) in which
participants confront each other directly in a controlled environment after giving
initial assessments, and with a "no interaction" model in which initial assessments
are simply aggregated mathematically. Gustafson et al. (1973) found the nominal
group technique superior to the others, and found Delphi to be the worst of the
three. Gough (1975) found a similar pattern of results. Fischer (1975) found no
significant differences between the various techniques. Finally Seaver (1977)
compared different group techniques as well as different techniques of mathemat-
ical aggregation. He found a general lack of difference in scores, regardless of group
technique and regardless of method of mathematical aggregation. One significant
conclusion was the following. In estimating probabilities, group interaction (either
Delphi or nominal group technique) tended to result in more extreme probability
THINK TANKS AND ORACLES 17
estimates (i.e., estimates closer to zero or one). In this sense, the group interaction
tended to make the participants more confident. However, this increased con-
fidence did not correspond to an increased relative frequency of correctness. In
other words, the group interaction tended to make the participants overconfident.
We shall return to the question of overconfidence in discussing measures of quality
in probabilistic assessments.
The results cited above have put the oracles of Delphi on the defensive. The
blank check extended to experts in the 1960s has generally been rescinded and the
burden of proof has been placed on the experts and their monitors. As a result, the
whole question of evaluating expert opinion and developing methodological
guidelines for its use has moved into the foreground. The Delphi exercises seem to
have disappeared, and play almost no role in contemporary discussions of expert
opinion (for a recent survey see Parente and Parente, 1987). However, one
important legacy of the Delphi age does survive in the "benchmark exercises" used
to gage the "reproducibility" of risk analyses. Examples are discussed in Chapter
two.
Plato's theory of knowledge, expressed in the divided line, provided the basis for his
ideal state, The Republic. The Republic was governed by "guardians" who had
undergone a lifetime of rigorous training in science, statecraft, and ethics. The most
important conclusion emerging from this brief review is this: Experts are not the
guardians Plato had in mind. They are ordinary mortals with ordinary foibles.
They may have interests, they may have biases and predilictions, and they may be
unable adequately to assess their own uncertainty. An expert who knows more
about a particular subject than any one else may be able to talk an opponent under
the table. This does not mean that his opinions are more likely to be true.
The "received view" of science, as articulated by philosophers shortly before
and after World War II, paints a picture of science in which the scientific method
plays a central role. Science aims at rational consent, and the method of science
must serve to protect and further this aim. Hence, the scientific method shields
science from idiosyncracies of individual scientists. Concomitantly, any individual
who scrupulously follows this method can arrive at results that have a legitimate
claim to rational consent, regardless of the person actually performing the research.
In this sense science is conceived as impersonal.
Within the received view of science, something like expert opinion can play at
most a marginal role. It is useful to recall in this regard a distinction first drawn by
the philosopher Hans Reichenbach (1951). Reichenbach distinguished a "context of
discovery" from the "context of justification" in science. Discovery in science is
often nonmethodological. It is frequently driven by factors that are highly
subjective and nonrational; hunches, predilictions, biases, and flights of imaginat-
ion. Everyone has seen the moon move across the night sky and seen apples fall to
earth. However, it took a Newton to realize the the same physical laws underlay
both phenomena. Once this idea was proposed, it was subjected to tests that could
18 EXPERTS AND OPINIONS
be verified by everyone. Having the idea is not the same as justifying the idea. The
former belongs to the context of discovery, the latter to the context of justification.
According to this view, the scientific method operates primarily within the context
of justification. Expert opinion, if it plays any role at all, can only figure in the
context of discovery, according to the received view.
This received view came under attack in the 1960s as people realized that the
very features mentioned above, interest, biases, prediliction, uncertainty, play a
larger role in science than many philosophers cared to acknowledge. Against this
background, the introduction of expert opinion into the scientific method may
have seemed like an imminently logical step. Since this time, expert opinion has
become increasingly visible within the context of justification in science. An expert's
opinion, whether sprinkled with "uncertainty modifiers" or cast in the form of
quasi-structured input, is increasingly advanced as an argument. In the next chapter
we shall see to what lengths this has been taken in certain areas of "hard science."
It is appropriate to pause by the question whether this development is really
consistent with the fundamental aim of science, rational consensus. Within the
received view, rational consent is pursued via the twin principles of reproducibility
and empirical control. Scientific results must in principle be susceptible to empirical
control, by anyone, hence they must be reproducible by anyone. A result gained by
expert opinion cannot in general be reproduced by other experts, and cannot be
directly submitted to empirical control. This is surely why opinion plays no explicit
role in the methodology of the received view.
This is not to say that expert opinion cannot be used in science outside the
context of discovery. In fact this has been going on to some degree since the
inception of science. However, it is fair to say that methodological guidelines for the
use of expert opinion outside the context of discovery are unknown within the
received canons of science. Much recent research in the field of expert opinion has
been directed, at least implicitly, toward articulating such guidelines, and this
research is the principal focus of the succeeding chapters.
The central theme of this book is that expert opinion can find a place within
the context of justification. The "emerging technologies" for expert opinion provide
us with the tools for rationally analyzing and evaluating heretofore nonrational
portions of scientific activity, and bringing them within the purview of
methodology.
The most important tool in rationally incorporating expert opinon in science
is the representation of uncertainty. Opinion is by its very nature uncertain. Hence,
when expert opinion is used as input in a scientific inquiry or report, the question to
be addressed is simply this: Is the uncertainty adequately represented? But what
does it mean to "adequately represent uncertainty"? When is uncertainty adequate-
ly represented and when is it not? These are the questions which a methodology for
expert opinion must answer.
2
Expert Opinion in Practice
Expert opinion has been used in a more or less systematic way in many fields. This
chapter discusses uses in the aerospace program, in military intelligence, in nuclear
energy, and in policy analysis. The latter field is extremely broad and contains
many methodologies that are only marginally related to the central theme of this
book, namely the representation and use of expert uncertainty. Techniques such as
time series analysis and regression analysis have been extensively used in economic
forecasting, for example, but cannot be called methodologies for using expert
opinion. Hence, the items chosen from the general field of policy analysis will focus
on expert opinion and will not constitute a representative sample from this very
wide field.
As in the nuclear sector, expert opinion entered the aerospace sector because of the
desire to assess safety. In particular, managers and politicians needed to assess the
risks associated with rare or unobserved catastrophic events. The likelihood of
such events could obviously not be assessed via the traditional scientific method of
repeated independent experiments.
The problems of assessing such likelihoods were dramatically brought out on
January 28, 1986, with the tragic accident with the Challenger space shuttle. In
1983, E. W. Colglazier and R. K. Weatherwax (Colglazier and Weatherwax, 1986)
brought out a report sponsored by the U.S. Air Force, which reviewed an earlier
National Aeronautics and Space Administration (NASA)-sponsored estimate of
shuttle failure modes and failure probabilities. [In this earlier report, certain critical
failure probabilities were simply dictated by the sponsor, in disregard of available
data (Bell and Esch, 1989).] Their estimate of the solid rocket booster failure
probability per launch, based on subjective probabilities and operating experience,
was roughly 1 in 35. The NASA management rejected this estimate and elected to
rely on their own engineering judgment, which led to a figure of 1 in 100,000. As the
report of Colglazier and Weatherwax is not available to the public at this writing,
the only published documentation is to be found in the abstracts for a conference
19
20 EXPERTS AND OPINIONS
on risk analysis in 1986 at which their results were presented. We quote their
abstract in full:
We estimated in 1983 that the probability of a solid rocket booster (SRB) failure
destroying the shuttle was roughly 1 in 35 based on prior experience with this
technology. Sponsored by Teledyne Energy Systems for the Air Force, our report was a
review of earlier NASA-sponsored estimates of shuttle failure modes and their
probabilities. These estimates were to be used in assessing the risk of carrying
radioisotope thermoelectric generators (RTGs) aboard the shuttle for the Galileo and
Ulysses missions scheduled for 1986. Our estimates of SRB failure were based on a
Bayesian analysis utilizing the prior experience of 32 confirmed failures from 1902
launches of various solid rocket motors. We also found that failure probabilities for
other accident modes were likely to have been underestimated by as much as a factor of
1000. A congressional hearing on March 4, 1986, reviewed our report and critiques by
Sandia National Laboratory and NASA. NASA had decided to rely upon its
engineering judgment and to use 1 in 100,000 as the SRB failure probability estimate for
nuclear risk assessments. We have recently reviewed the critiques and stand by our
original conclusions. We believe that in formulating space policy, as well as in assessing
the risk of carrying RTGs on the shuttle, the prudent approach is to rely upon
conservative failure estimates based upon prior experience and probabilistic analysis.
The Challenger accident occurred on the twenty-fifth launch. A presidential
commission under former Secretary of State Rogers investigated the accident. The
personal experiences of one commission member, Richard Feynman, colorfully
depict some of the problems the commissioners found at NASA. One passage
deserves to be quoted at length.
Suddenly I got an idea. I said "All right, I'll tell you what. In order to save time, the
main question I want to know is this: is there the same misunderstanding, or difference
of understanding, between the engineers and the management associated with the
engines, as we have discovered associated with the solid rocket boosters?"
Mr. Lovingood says, "No, of course not. Although I'm now a manager, I was
trained as an engineer."
I gave each person a piece of paper. I said, "Now, each of you please write down
what you think the probability of failure for a flight is, due to a failure in the engines."
I got four answers—three from the engineers and one from Mr. Lovingood, the
manager. The answers from the engineers all said,... almost exactly the same thing: 1 in
200. Mr. Lovingood's answer said, "Cannot quantify. Reliability is determined by
studies of this, checks on that, experience here—blah, blah, blah, blah, blah."
"Well," I said, "I've got four answers. One of them weaseled." I turned to Mr.
Lovingood and said, "I think you weaseled." He says, "I don't think I weaseled." "Well,
look," I said, "you didn't tell me what your confidence was; you told me how you
determined it. What I want to know is: After you determined it, what was it?"
He says, "100 percent." The engineers' jaws drop. My jaw drops. I look at him,
everybody looks at him—and he says "Uh ... uh, minus epsilon!"
"OK. Now the only problem left is, what is epsilon?"
He says, "1 in 100,000." So I showed Mr. Lovingood the other answers and said, "I
see there is a difference between engineers and management in their information and
knowledge here ... ". (Feynman, 1987)
A systematic concern with risk assessment methodology began after the fire on
the Apollo flight AS-204 on January 27,1967, in which three astronauts were killed.
EXPERT OPINIONS IN PRACTICE 21
This one event set the NASA planning back 18 months, involved considerable loss
of public support, cost NASA salaries and expenses for 1500 people involved in the
subsequent investigation, and ran up $410 million in additional costs (Wiggins,
1985). This was reason enough to subject the erstwhile safety policy to an extensive
review.
Prior to the Apollo accident NASA relied on its contractors to apply "good
engineering practices" to provide quality assurance and quality control. Although
there was some systematic review of these policies undertaken at the Johnson
Space Center in 1965, there was no formal attempt to define "safety" or
"reliability."
The problems of a "contractor-driven" approach to safety were brought out in
a letter to Senator Gravel from the Comptroller General of the United States
(U.S.N.R.C., 1975 pp. XI 3-15, 3-21), which quoted an Air Force report:"... where a
manufacturer is interested in having his equipment look good he can, and will,
select some of the more optimistic data he can find or generate to use in his
reliability predictions. Thus reliability predictions, for several reasons, tend to be
generally optimistic by a factor of two to six, but sometimes for substantially
greater factors."
Data on mean times between failures (MTBF) for aircraft radar subsystems
supporting this contention are given in Table 2.1.
On April 5, 1969, the Space Shuttle Task Group was formed in the Office of
Manned Space Flight of NASA. The task group developed "suggested criteria" for
evaluating the safety policy of the shuttle program, which contained quantitative
safety goals. The probability of mission completion was to be at least 95% and the
probability of death or injury per mission was not to exceed 1%. These numerical
safety goals were not adopted in the subsequent shuttle program (Wiggins, 1985,
p. 9).
In an attempt to structure their safety policy, and protect themselves from self-
serving safety assessments, NASA, following a lead from the military, adopted so-
called risk assessment matrix tables to "quantify and prioritize" risks. An example
of such tables is given as Table 2.2. Of course, these matrix tables are "quantitative"
F-4B 10 4
A-6A 75 8
F-4C 10 9
F-lll A/E 140 35
F-4D 10 10
A-7 A/B 90 30
A-7 D/E 250 12
F-4E 18 10
F-111D 193 Less than one
F-4J 20 5
Hazard Categories
I II III IV
Frequency of Occurrence Catastrophic Critical Marginal Negligible
(A) Frequent 1 3 7 13
(B) Probable 2 5 9 16
(C) Occasional 4 6 11 18
(D) Remote 8 10 14 19
(E) Improbable 12 15 17 20
only in the sense that they use the roman numerals I, II, etc. No attempt is made to
quantify probabilities. This reflects NASA's pervasive distrust of subjective
numerical representations of uncertainty. Although numerical risk assessment
including quantification of accident probabilities was in fact pioneered in the
aerospace industry, NASA dropped it early on.
EXPERT OPINIONS IN PRACTICE 23
The published reason for abandoning quantitative techniques was that low
numerical assessments of accident probability do not guarantee safety. Accord-
ingly, NASA has always been suspicious of "absolute reliability numbers." A recent
report describing the NASA safety program, contracted by the European Space
Agency and prepared by American aerospace experts, puts the matter this way:
"... the problem with quantifying risk assessment is that when managers are given
numbers, the numbers are treated as absolute judgments, regardless of warnings
against doing so. These numbers are then taken as fact, instead of what they really
are: subjective evaluations of hazard level and probability" (Wiggins, 1985, p. 85).
This type of statement recurs frequently in the public expositions of NASA
safety policy and are sharply at odds with experiences reported by Colglazier and
Weatherwax cited at the beginning of this chapter. Here we do not see managers
treating risk assessments as absolute numbers, alas. Instead, they relied on their
own judgment, which, in hindsight, must be regarded as tragically self-serving.
In the corridors of the aerospace world there are loud and persistent rumors to
the effect that the primary motive for abandoning quantitative risk assessment in
the U.S. aerospace program was not distrust in overoptimistic reliability assess-
ments. Rather initial estimates of catastrophic failure probabilities were so high
that their publication would have threatened the political viability of the entire
space program. For example, a General Electric "full numerical probabilistic risk
assessment" on the likelihood of successfully landing a man on the moon indicated
that the chance of success was "less than 5%." When the NASA administrator was
presented with the results, he "felt that the numbers could do irreparable harm, and
disbanded the effort" (Bell and Esch, 1989).
As a result of the extensive investigation following the shuttle accident, there
are strong signs that NASA will make increasing use of quantitative risk
assessment in the future.
MILITARY INTELLIGENCE
Communication
The problem of communicating uncertainty to consumers in such a way that they
make proper use of it, is formidable, and several systems have been implemented
and subsequently discarded. One widely used system, combining so-called
reliability-accuracy ratings, is reproduced as Table 2.3. Research (Samet, 1975) has
indicated that this system was not adequate. In evaluating reports, intelligence
officers were much more heavily influenced by the assessed accuracy of reports and
did not take account of the reliability of the source. Moreover, there was a wide
disparity in the absolute interpretation of the qualitative ratings.
EXPERT OPINIONS IN PRACTICE 25
The DE had made use of so-called Kent charts, which provide a quantitative
interpretation of natural language expressions of uncertainty. A Kent chart is
shown in Table 2.4.
Evaluation
Kent charts have been abandoned in favor of direct numeric estimations of
probability. Such estimations are not without problems, however. The most
important question regarding numerical probabilities is that of "validity" or
"calibration" (as it will be called here). A subjective probability assessor is said to be
Table 2.4 A Kent Chart for Estimating Terms and Degrees of Probability
This table explains the terms most frequently used to describe the range of likelihood in the
key judgment of DIA estimates.
Note: Words such as "perhaps," "may," and "might" will be used to describe situations in the lower ranges of
likelihood. The word "possible," when used without further modification, will generally be used only when a judgment
is important but cannot be given an order of likelihood with any degree of precision.
Source: (Morris, J. M. and D'Amore, R. J., 1980, p. 5-21)
26 EXPERTS AND OPINIONS
Probabilistic risk analysis was the first "hard science" to introduce subjective
probabilities on a large scale, and it has been at the forefront of developments in
using expert opinion. For this reason, our treatment of this area will be somewhat
more extended. After examining the historical background we focus on four
problems that have emerged in connection with "subjective data," namely, the
spread or divergence of expert opinion, the dependencies between the opinions of
different experts, the reproducibility of the results of risk studies, and the
calibration of expert probability assessments.
Historical Background
Probabilistic risk assessment as such is not new. Covello and Mumpower (1985)
identify explicit risk assessment in the writings of Arnobius the Elder in the fourth
century. More recently, the American Atomic Energy Commission [AEC, the
forerunner of the Nuclear Regulatory Commission (NRC)] pursued a philosophy
of risk assessment through the 1950s based on the "maximum credible accident."
Because credible accidents were covered by plant design, residual risk was
estimated by studying the hypothetical consequences of "incredible accidents." An
example of such a study is WASH-740 (U.S. AEC, 1957), released in 1957. It
focused on three scenarios of radioactive releases from a 200 Megawatt-electric-
power nuclear power plant operating 30 miles from a large population center.
Regarding the probability of such releases the study concluded that "no one knows
now or will ever know the exact magnitude of this low probability."
Design improvements introduced as a result of WASH-740 were intended to
reduce the probability of a catastrophic release of the reactor core inventory. Such
improvements could have no visible impact on the risk as studied by the WASH-
740 methodology. On the other hand, plans were being drawn for reactors in the
1000-MWe range located close to population centers, and these developments
would certainly have a negative impact on the consequences of the incredible
accident.
The desire to quantify and evaluate the effects of these improvements led to the
introduction of probabilistic risk analysis (PRA). As mentioned previously, the
28 EXPERTS AND OPINIONS
conservatively assumed that a "degraded" core would melt entirely. The proba-
bilities associated with that sequence, particularly those concerning human error,
do not appear realistic in hindsight.
Two influential independent analyses of the Three Mile Island accident, the
Report of the President's Commission on the Accident at Three Mile Island"
(Kemeny et al., 1979) and the Rogovin Report (Rogovin and Frampton, 1980)
recommended that greater use be made of probabilistic analyses in assessing
nuclear plant risks.
Shortly thereafter a new generation of PRAs appeared in which some of the
methodological defects of the Reactor Safety Study were avoided. The Zion
Probabilistic Safety Study in particular has served as a model for subsequent
PRAs, and its methodology has been canonized in a series of articles in the Journal
of Risk Analysis (see, e.g., Kaplan and Garrick, 1981). In 1983 the U.S. NRC
released the PRA Procedures Guide, which shored up and standardized much of the
risk assessment methodology. The Zion Probabilistic Safety Study uses subjective
probabilities. It is interesting to note that in the first article in the first issue of the
first journal devoted to risk analysis, we find the following definition of
"probability":
"... 'probability' as we shall use it is a numerical measure of a state of knowledge, a
degree of belief, a state of confidence." (Kaplan and Garrick, 1981, p. 17)
Since the Lewis report, expert opinion has been used in a structured form as a
source of data in several large studies. Among these are the risk study of the fast
breeder reactor at Kalkar, Germany (Hofer, Javeri, and Loffler, 1985); the studies of
seismic risk by Okrent (1975), and by Bernreuter et al. (1984) and a study of fire
hazards in nuclear power plants (Sui and Apostolakis, 1985). Finally, the Draft
Reactor Risk Reference Document (NUREG-1150, 1987) makes massive (and
poorly documented) use of expert opinion to assess the risks of five nuclear power
plants. The final draft is still under review at this writing, but a report describing a
more concerted effort with regard to expert judgment is now available (Wheeler et
al., 1989). The improved documentation reveals a laudable, though time consuming
and costly, attempt to counter the various problems with expert opinion via an
intensive elicitation process. This approach is extended in Bonano, Hora, Keeney,
and von Winterfeldt (1990), with particular attention to potential applications in
the field of radioactive waste management. In the United Kingdom, Her Majesty's
Inspectorate of Pollution has sponsored similar exploratory research in applying
expert opinion to the risk analysis of radioactive waste repositories (Dalrymple and
Willows, 1990). Characteristic of these approaches is a heavy investment on the
elicitation side, with no evaluation of performance and no proposal for com-
bination other than simple arithmetic averaging. The possible cost ineffectiveness
of this approach is underscored in Woo (1990). Perhaps the most prominent area
for applying expert opinion in risk analysis is the assessment of human error
probabilities. The main reference in this area is the Handbook of Human Reliability
(Swain and Guttmann, 1983). The PRA Procedures Guide (U.S.N.R.C., 1983, chap.
4) and Humphreys (1988) give a good review of the literature and methods.
There is no comprehensive review of expert judgment applications. However,
an entire issue of Nuclear Engineering and Design (vol. 93, 1986) was devoted to the
30 EXPERTS AND OPINIONS
role of data and judgment in risk analysis. A recent report by Mosleh, Bier, and
Apostolakis (1987) gives an overview of some of these studies, and a compressed
version of this report appeared in Reliability Engineering and System Safety (Vol.
20, no. 1, December 1988).
The best way to appreciate the problems of substituting experts' degrees of
belief for data is to look hard at some examples. We shall look at subjective data
from four viewpoints, namely the spread of expert opinion, the dependency
between experts, the reproducibility of the results, and finally the calibration of the
results. We shall not review the above-mentioned literature. This chapter is devoted
to examples. Later chapters return to methodological issues raised in this chapter.
Source Value
1. LMEC 5xl0-6
2. Holmes 1xl0-6
3. G.E. 7xl0-8
4. Shopsky 1x10-8
5. IEEE, a lxl0-8
6. IEE, b 1xl0-8
-8
7. NRTS Idaho 1x10
8. Otway 6xl0-9
9. Davies 3x10 - 9
10. SRS 2xl10 - 9
11. IKWS Germany 2xl0-10
12. Collins lxl0-10
13. React. Incd. lxl0-10
RSS estimate 1x10-10
90% confidence bounds 3 x 1 0 9 - 3 x 10 - 1 2
study needed a large number of component failure probabilities that could not be
reliably estimated from available data. Thirty experts (the experts were sometimes
consulting firms or data banks) were asked to estimate failure probabilities for 60
components. Not every expert chosen estimated each component, but the total
matrix gives a good impression of the divergence of expert opinion.
The results for one component, the failure probability of high-quality steel
pipe of diameter at least 7.6 cm per section-hour, are given in Table 2.5 (a section is
a piece of pipe about 10 meters long). The thirteen responses range from 5E-6 to E-
10. The Reactor Safety Study used the value E-10 in its calculations, with 90%
confidence bounds of 3E-9 to 3E-12. Eight of the thirteen responses fall above the
upper confidence bound. Calling the spread of expert opinion for a given
component the ratio of the largest to the smallest estimate, the average spread over
the 60 components was 167,820. If one outlier (with a spread of E7) is omitted, then
the average of the remaining components spreads was 1173. By comparison, the
average ratio between the upper and lower confidence bounds used by the Reactor
Safety study for these 60 components was 126. Hence we see that the confidence
bounds given in the study tend to be smaller than the spread of expert opinion. We
shall return to this feature in another context shortly.
Table 2.6 Results of a Rank-Independence Test for Expert Estimates in the Reactor
Safety Study
Results of a rough rank-independence test for expert estimates in the Reactor Safety Study for items estimated by at
least four experts. Owing to ties at the median values, the relative frequency of pessimism, that is, answers strictly above
the median value, was 36.4%. The results concern only those experts who estimated at least 14 of the 39 items.
Source: (Cooke, 1986b)
or at least as many O's (if he tends toward optimism) in the associated coin-tossing
experiment. We see that rank independence would be rejected at the 5% level for
five of the nine experts.
Nuclear energy is not the only area in which experts disagree dramatically.
Shooman and Sinkar (1977) conducted a study into the risks associated with
grounded-type electric lawn mowers. They collected opinions from 12 experts
regarding the probabilities for events which might lead to an electrical accident,
with the intention of combining these probabilities to derive the probability for a
shock accident. A description of the events estimated together with the results are
reproduced as Table 2.7. One of the matrices shows a cluster analysis similar to
that described above. The degree of clustering is quite extreme.
analysis." A fault tree analysis decomposes the event of interest (the "top
event") into combinations of "component events." The probability of the
top event is then computed as a function of the probabilities of the
component events (and their interactions). The spread in results after this
stage were rather large, and the teams were unable to agree on a common
fault tree analysis.
3. In the third stage, the project leaders wanted to separate the effects of
different fault tree modeling from the effects of different failure data. The
teams agreed to proceed with a common fault tree, although this fault tree
did not represent a consensus. This led to the "common fault tree with
participants' data" results.
4. In the final stage, the goal was to determine whether different methods of
calculation also played a significant role. These are the "common fault tree
and common data" results. The results are shown in Table 2.8.
The event that a person is grounded or has a low resistance while operating the lawn mower. (Low resistance is that resistance that
is sufficiently small to cause an electric shock for the specified operating voltage.)
The event that the wire plug of the mower is in while the mower is being repaired or adjusted.
The event that a person touches a "live" part of the lawn mower. The live part considered here is a part that is normally live when a
mower is connected to the supply, and not a part that has become live due to a fault.
The event that a person touches cut (E3) or damaged (E4) cord when the mower is connected to the supply.
The event that the cord is cut due to some sharp object either while the mower is in operator or stored.
E4. The event insulation of the cord does not work (damage due to environment, prolonged and/or abusive use, etc.).
E5 The event that a person touches the conductive part of the body (includes handle, blade, etc.) when the mower is connected to the
supply.
The event grounding does not work. This event comprises one or more of the following:
(1) Consumer does not have a grounded outlet box.
(2) Ground wire is either broken or disconnected at the grounding terminal.
(3) High-impedance grounding circuit exists.
Estimated Probability of the Event
First Round Grounded Type
Expert
Event 1 2 3 4 5 6 7 8 9 10 11 12
Ea 0.5 0.5 0.75 0.5 0.5 0.0051.0 0.9 0.005 0.005 0.3 0.1
E0 0.25 0.1 0.1 0.01 0.0001 0.00028 0.66 0.8 0.0001 0.0045 0.0225 0.878
E1 0.05 0.01 0.001 0.001 0.5 0 0.1 0.2 0.0001 0.0001 0.00028 0.001
E2 1.0 1.0 0.9 0.3 0.001 0.007 1.0 1.0 0.005 0.0055 0.01 0.2
E3 0.1 0.05 0.01 0.01 0.05 0.0007 0.1 0.7 0.0001 0.0001 0.0125 0.0078
E4 0.05 0.1 0.005 0.05 0.002 0.00007 0.05 0.05 0.0001 0.00015 0.01 0.01
E5 1.0 0.7 0.8 1.0 0.001 0.007 1.0 1.0 0.005 0.006 1.0 0.022
E6 0.35 0.2 0.2 0.2 0.65 0.0035 0.6 0.75 0.005 0.0015 0.425 0.6
E7 0.05 0.005 0.001 0.1 0.00002 0.000003 0.05 0.05 0.0001 0.000002 0.002 0.001
E8 0.05 0.001 0.01 0.05 0.001 0.000003 0.02 0.05 0.0001 0.000002 0.002 0.00035
Mission failure probability as evaluated on the basis of the common fault tree and data
P F 1.4xl0 - 3
Intermediate
Fault tree
Point Values after Comparison Common Common
of the First of Fault tree Fault Tree
Different "Blind" Qualitative with Participants with Common
Participants Evaluation Analysis Data Data
*By excluding some extreme values according to certain evaluation criteria; such a procedure may give rise to
somewhat different tables. All probability values are conditional on the occurrence of the initiating event.
Source: Amendola (1986)
Problem i
3a 3b, 3bu 4a 5a 5b 6a
N = 29 N = 27 N = 28 N = 28 N = 27 N = 27 N = 25
Sandia Solution
*•,(•=-) 0.005 0.0025 0.05 0.039 0.02 0.032 0.00001 1
Lower limit (•) 0.0025 0.00125 0.025 0.0195 0.01 0.016 0.0000055
*Upper limit (•) 0.025 0.0125 0.25 0.195 0.1 0.15 0.000055
These limits underestimate by a factor of two or more the usual uncertainty bounds that would be calculated by a
Monte Carlo procedure. Were the wider bounds used, considerably more of the peers' responses would lie within
the usual uncertainty bounds of the Sandia solution.
Source: Brune Weinstein Fitzwater.
38 EXPERTS AND OPINIONS
PWR
Small LOCA 8.3E-3 3E-4 3E-3*
AFW (failure/demand) 1.1E-3 7E-6 3E-4
HIP (failure/demand) 1.3E-3 4.4E-3 2.7E-2
LTCC (failure/demand) 1.2E-3 4.4E-3 3.2E-2
BWR
Small LOCA 2.1E-2 3E-4 3E-3*
ADS (failure/demand) 2.7E-2 3.3E-3 7.5E-3
HPCI (failure/demand) 5.7E-2 3E-3 5.5E-2
*These bands are derived from the bands for coremelt resulting from a small LOCA. RSS
says that the latter uncertainty is principally a result of uncertainty with respect to the
initiating event (i.e., the small LOCA).
The Oak Ridge value does not include unavailability due to test and maintenance. The
RSS median value for this contribution is 1.3E-2 per demand. The RSS uncertainty bands
including the test and maintenance contribution are 6.8E-2 1.4E-1. The above bands are
derived by assuming the largest possible "error factor" consistent with the last mentioned
upper bound, under the assumption of independence.
Abbreviations:
PWR Pressurized water reactor
BWR Boiling water reactor
LOCA Loss of coolant accident
AFW Auxiliary feedwater system
HPI High-pressure injection
LTCC Long-term core cooling
ADS Automatic depressurization system
HPCI High-pressure coolant injection
The news is not all bad. Snaith (1981) studied the correlation between
observed and predicted values of some 130 reliability parameters. The predicted
values include both expert assessments and results of analysis. The correlation was
generally good. Figure 2.1 plots the ratio R of the observed to predicted values, as a
function of cumulative frequency. We see that in 64% of the cases; R 2, while
in 93% of the cases R 4.
Unfortunately, the above data say nothing about how certain the experts were
of their assessments, hence it is impossible to determine whether, for example, a
value of R = 2 would be considered surprising, for a given expert.
A better perspective on the relation of observed and predicted values is given
by Mosleh, Bier, and Apostolakis (1987). The experts gave distributions for
maintenance times used in the risk assessments of the Seabrook and Midland
nuclear power plants, and these were compared with observed mean values. The
results, shown in Table 2.10, indicate a ratio of observed to predicted mean values
generally in agreement with the results of Snaith's study. However, Mosleh, Bier,
and Apostolakis also looked at the degree of confidence in the predicted results and
compared this with the observed spread in the observed values. Degree of
confidence is indicated by "range factors," where the range factor associated with
the probability distribution of a given quantity is the square root of the ratio of the
EXPERT OPINIONS IN PRACTICE 39
95th and 5th percentiles of the distribution for this quantity. Hence, if an expert
reports of range factor of 3.2, his 95th percentile is a factor 10 larger than his 5th
percentile. The range factors associated with the expert distributions are compared
with the range factors derived from the observed values. Table 2.11 shows that
experts' range factors are consistently too small. If the range factor is increased by a
factor of 3, this means the ratio of the 95th to the 5th percentile has been increased
by a factor of 10. From Table 2.11 we may conclude that the expert assessments
reflect significant overconfidence.
Words of Caution
One must be cautious in drawing general conclusions from examples, as bad news
always travels further and faster than good news. In fact, Christensen-Szalanski
and Beach (1984) reviewed more than 3500 abstracts on judgment and reasoning.
Of the 84 empirical studies, 47 reported poor performance, while 37 reported good
performance. However, the Social Sciences Citation Index showed that citations of
poor performance outnumbered citations of good performance by a factor of 6.
Regarding the issues of spread, dependence, reproducibility and calibration,
the foregoing pretty much capture all that is presently available in the applications
literature (new bench mark studies are in progress at Ispra). The conclusion from
40 EXPERTS AND OPINIONS
Characteristics of Distribution
*For example, diesel generators, fans, electrical equipment; also includes heat exchangers with technical specifications.
the above examples is clear and not too encouraging: Expert opinions in
probabilistic risk analysis have exhibited extreme spreads, have shown clustering,
and have led to results with low reproducibility and poor calibration.
In general, we may affirm that the use of expert opinion to date has been rather
ad hoc and bereft of methodological guidance. Indeed, there exists no body of rules
for how an analyst should use expert opinion. Different studies have done this in
very different ways, and some divergencies in this respect were cited in this chapter.
Improvement in this respect is not only urgently needed, but also feasible.
To illustrate the problems created by this methodological vacuum, suppose we
are performing a risk analysis and we are confronted with a spread of expert
opinion such as that shown in Table 2.5. What are we to do? Are we to take the
average value? some weighted average? the median value? the most popular value?
Should we try to get the experts to agree more? If so, can we do this without
perturbing the result with various more or less overt forms of social pressure?
Indeed, such pressures are always present, and there is little reason to believe that
they systematically push a group of experts in the direction of the truth.
The presence of clustering in pessimists and optimists makes these problems
only more acute. A difference of opinion at the component level in a risk analysis
might not be so serious if no one particular component was especially important
for the end result. One might hope that the differences of opinion would interfere
destructively (i.e., everyone is sometimes a pessimist and sometimes an optimist) so
that the uncertainty of the final result would not explode. If the experts cluster, then
this hope will be vain.
This leads to a final cautionary note for concluding the discussion of
probabilistic risk analysis. After doing a risk analysis, one is expected to perform an
EXPERT OPINIONS IN PRACTICE 41
"uncertainty analysis," that is, to give some indication of the confidence bounds in
which the results are said to lie. In the best case the following procedure is invoked.
Probability distributions are introduced reflecting the uncertainty regarding the
values of input parameters (i.e., failure frequencies, weather conditions, population
densities, etc.). Computer simulations are then used to sample from the distribu-
tions of all the input parameters. When a value for each variable is chosen, the
whole risk calculation is repeated and the result recorded. This process is then
repeated a large number of times to build up a distribution of results, from which
confidence bounds can be extracted.
This procedure is perfectly satisfactory, if the underlying distributions are
independent. If however, these distributions are positively correlated, the above
procedure can underestimate the resulting uncertainty. A simple example will make
this clear.
Suppose we have three parameters that are uniformly distributed over the
range (0,100). With 90% certainty each variable will be found in the range (5,95).
Suppose now we are interested in the product of these variables. If the variables are
independent, the value of their product will lie with 90% probability in the range
(1600,440,000). If the variables are completely positively correlated, then this range
would be (125, 857,375), and intermediate correlations will result in intermediate
ranges. Further, as we shall see later on, the distinctive feature of subjective
probability distributions (in particular, those of the risk analyst) is exactly that they
are positively correlated, even when the physical processes being modeled are not.
Hence, uncritical uncertainty analysis (and it is always uncritical) may give the
spurious impression of knowing more than one actually knows. At the European
Space Agency, a program of uncertainty analysis is under development that
accounts for subjective dependence (Preyssl and Cooke, 1989, Cooke and Waij,
1987).
POLICY ANALYSIS
"Policy analysis" is really a catchall for everything that does not fit into the areas
discussed above. It will be understood to include macroeconomic modeling,
economic forecasting, energy planning, project management, environmental
impact studies, etc. Some policy analysis have made explicit use of subjective expert
probability assessments. The most conspicuous examples in this regard are
Granger Morgan et al. (1984), Granger Morgan and Harrison (1988), and
Merkhofer and Keeney (1987). In general this is not the case. For this reason our
discussion will be more restricted than in the foregoing. The main service of this
section will be to highlight the differences of probabilistic and deterministic
methods of using expert opinion.
The typical forecasting problem concerns the future value of some variable. An
electric utility may be interested in the peak demand for the coming years, in order
to plan their generating capacity. An investor may be interested in the price of some
securities, government planners may be interested in the price of oil on the
international market, or in the gross national product between now and election
time. All these situations involve predicting the values of quantities that become
42 EXPERTS AND OPINIONS
known before all too long and would lend themselves very well to assessment via
probabilistic expert opinion. However, in each case there has been very little use of
probabilistic assessment.
In economic forecasting various mathematical models have been used,
sometimes in conjunction with expert forecasts. Experts' forecasts are combined to
produce the subsequent estimate on the basis of past performance and the observed
values of various exogenous variables. Such methods are deterministic in the sense
that they make no attempt to assess or communicate the uncertainty of the
estimate. The decision maker is given no guidance whatsoever as to how the
estimate should be factored into his decision problem. Such forecasting methods
are therefore more primitive than even the qualitative attempts to represent
uncertainty reviewed in the preceding pages. In practice this often means that the
decision maker treats the forecast values as certain.
The most important feature of such forecasting models was pointed out by
C. W. J. Granger (1980): "There is no really useful procedure for evaluating an
individual forecasting technique, as any evaluation cannot be separated from the
degree of forecastability of the series being predicted." Any forecasting technique
will perform well when estimating some one's yearly age. On the other hand, any
method for predicting the outcome of tosses with a fair coin will get about half of
the outcomes right. This is perhaps the single most important difference between
probabilistic and deterministic methods: as we shall see in succeeding chapters,
probabilistic forecasters can most certainly be evaluated independently of the
forecastability of the things they are forecasting.
Although it is impossible to evaluate forecasting models or forecasters as such,
it is possible to do clever things in given situations. Suppose we have a series
x 1 ; ... of realizations of some variable, and suppose f, g{ are forecasts of xt by
forecasters f and g. Suppose f and g are "unbiased" in the sense that their expected
errors f i — xi, gi — xi are zero (this does not mean that we expect no errors, but
rather that the errors are expected to average out to zero). Then Granger (1980, p.
158) shows how to combine f and g in such a way as to make a better forecaster
than either / and g alone.
This surprising fact is worth examining. Let r lie in the interval (0,1), and
consider the combined forecaster:
Let Vf, Vg, and Vc denote the variance (mean square error) of the respecti
forecasters. Since the forecasters are assumed to be unbiased, the best forecaster
the one with the smallest variance. A simple calculation shows that
where cov (f, g) denotes the covariance of f and g. Elementary calculus shows that
Vc is minimized when
If we substitute this value of r into the definition of ci, then Vc will generally be
EXPERT OPINIONS IN PRACTICE 43
smaller than Vf or Vf.1 For example, suppose that forecasters f and g are
independent, so that cov(f, g) = 0. Then
Source: Taken from "Here's How Our Staff Picks "em," Chicago Daily News, November 25, 1966,
p. 43): November 24, 1967, p. 38: and November 29, 1968, p. 43.
*When all three years are combined, the consensus outperforms every one of the forecasters (i.e.,
ranks first).
1
However, to apply this technique the variances and covariances must be estimated from the data.
Estimates of the covariance on small data sets tend to be unstable, particularly when the correlation
between f and g is high, and this can seriously degrade performance. Examples of this phenomenon can
be found in Clemen and Winkler (1986). A similar variance reduction technique, facing a similar
problem of assessing an empirical covariance matrix is developed by the Electric Power Research
Institute (1986).
44 EXPERTS AND OPINIONS
For example, Figure 2.2 shows yearly projections of summer peak electric demand
of the North American Electric Reliability Council. These projections were used by
utility planners throughout the United States and Canada. The difference between
the summer peak demand in 1983 and that projected for 1983 a decade earlier was
equivalent to the output of 300 large nuclear plants, representing an investment of
about $750 billion at 1984 prices (Flavin, 1984).
Equally dramatic are the projections of world oil prices, compared with actual
prices. Figure 2.3 is an example, showing the oil price projections made by the
Dutch Ministry of Economic Affairs, a steering committee of experts advising the
ministry, and a political economist specializing in this field.
From these examples, it is easy to see the rub in Granger's combined expert.
The argument showing that the combination of expert forecasts is better than each
forecaster assumes that the forecasters are unbiased. In the real world, they are not
always unbiased. They are not "just as often too high as too low." Moreover, they
can be all biased in the same way. In such cases the combined expert will not
perform better than the best expert.
Nonetheless, there may be cases in which such biases are absent and in which
these techniques may work well. Beaver cites the example of compiling investment
portfolios. If the underlying market forces determining the prices are "felt" by each
expert in a way which involves a great deal of random errors, then a consensus
forecast might well feel the underlying forces better than each expert. Indeed, in the
consensus forecast the random errors will cancel out.
The market analogy is less plausible in the areas of science and policy analysis.
Science is not a question of majority rule. The voice in the wilderness is sometimes
right, and good new ideas usually come from the wilderness. In deciding matters of
policy one has to reckon with very powerful interests that can impose a significant
EXPERT OPINIONS IN PRACTICE 45
Figure 2.3 Oil prices in 1985 dollars and projected prices from Dutch experts. (Kok, private
communication)
THINKING
In one sense thinking is as easy as breathing, we do them both all the time. We are
also inclined to believe that logical thinking is easy. In our daily life we seldom need
a paper and pencil to draw logical conclusions from premises. However, logical
thinking has been an explicit object of study at least since Aristotle. In spite of the
fact that we have a large and well-understood body of rules governing logical
thinking, there are a number of situations in which the majority of us can be
expected to commit elementary logical errors.
Consider the following argument:
premises Only the fittest species for their biological niche survive.
Cockroaches are one of the fittest species for their biological niche.
survives then x is one of the fittest species for its biological niche." If we throw the
above argument into the logically equivalent form:
For all species x, if x survives then x is one of the fittest for its biological niche.
Cockroaches are one of the fittest for their biological niche.
Cockroaches survive.
then most people would recognize that the argument is invalid. It is an example of
the classical fallacy of "affirming the consequent." The difference is simply this: In
the first premise of the second argument the grammatical order in which the clauses
appear corresponds to the direction of inference—from "survive" to "fittest." In the
first premise of the first argument this correspondence does not hold and the reader
is tricked into mistaking the grammatical for the logical order.
Suppose this same elementary error is committed by experts in evolutionary
biology, and suppose we set out to represent expert reasoning among evolutionary
biologists. Should we distinguish between the above two, logically equivalent
arguments? Or should we "reconstruct" their reasoning in such a way that the two
arguments are the same? In the latter case we would allow the representation to
deviate sometimes from observed inferential behavior, in order to conform to the
rules of logic. Our choice would obviously depend on our purpose. If we were
interested in describing the inferential behavior of evolutionary biologists we might
opt for the first alternative. If we were writing a textbook on evolutionary biology,
or designing an expert system to do evolutionary biology, then we should choose
the second.
This same choice confronts us when we want to represent probabilistic
reasoning. In this case, however, the choice is much more difficult. Probabilistic
reasoning is much more subtle than nonprobabilistic reasoning. Whereas nonpro-
babilistic reasoning has been studied for more than 2000 years, the study of
probabilistic reasoning is of very recent origin. The first formal "probability logic"
was given in 1963 (Los, 1963; for a simple exposition see Cooke, 1986). Moreover
the rules for probabilistic reasoning cannot even be formulated in nonmathemat-
ical language, and they are not well understood, either by those whose reasoning is
to be represented, or by those doing the representing.
This can be illustrated with some simple examples. In one sense these are just
curiosities, but they do illustrate the subtleties involved in probabilistic reasoning.
For this purpose we introduce a few abbreviations:
x = an arbitrary species
F(x) = "x is one of the fittest species for its biological niche"
S(x) = "x survives"
c = cockroaches
The following argument is logically valid:
For all x, if F(x) then S(x)
F(c)
S(c)
PROBABILISTIC THINKING 49
S(c)
Now let us "probabilize" the above arguments, that is, we treat the premises as
reflecting highly probable, though not certain, knowledge. A rendering might read
as follows:
For all x, if F(x) then probably S(x)
Probably F(c)
Probably S(c)
Probably S(c)
Are these arguments "probabilistically valid," and are they equivalent? These
questions cannot be answered without some serious reconstructing and some
pencil and paper calculations. A formalism adequate to this task is presented in
Cooke (1986), but we can get by for the present with the following notation. Let
p(S(x) | F(x)) r = "The conditional probability of S(x) given F(x) is greater than
or equal to r"
Then for r and t close to 1, the above arguments could be reconstructed as
For all x , p(S(x) | F(x))> r
p(F(c))> t
p(S(c))> rt
p(S(c))>
Both arguments are valid in the sense that the conclusions hold whenever their
respective premises hold.1 However, as the lower bounds for the conclusions are
not the same, and both bounds can be obtained, the arguments are not equivalent.
In fact, the second argument is generally better than the first in the sense that its
lower bound on the probability of the conclusion is higher, for r and t close to 1.
1
For the first argument, observe that p(S(c)) p(S(c) and F(c)) = p(S(c)\F(c))p(F(c)) = rt. For
the second, observe that (1 - r)[l - p(S(c)] + p(S(c)) p(F(c) | not-S(c))p(not-S(c)) + p(F(c) \
S(c))p(S(c)) = p(F(c)) t. Solving for p(S(c)) yields the inequality in the conclusion of the second
argument. Observe also that equality can hold in both arguments, though not simultaneously.
50 EXPERTS AND OPINIONS
Things can get much worse. There are valid elementary arguments whose
probalizations are not valid. Let I(x) be another predicate, for example, I(x) = "x
influences the future ecosystem," and consider:
For all x, if F(x) then S(x)
For all x, if S(x) then I(x)
If x has 30 letters in his/her name, then x probably has less than 25 letters in
his/her last name ????????
These logical calisthenics make clear why probabilistic reasoning is hard. Put in
somewhat metaphorical mathematical language:
Logic is not continuous in certainty.
Arguments that are valid when the premises are known with certainty are not
"almost valid" when the premises are "almost certain." Premises that are equiva-
lent when known with certainty are not "almost equivalent" when the premises are
"almost certain." Rather, discontinuities arise, and just the slightest bit of
uncertainty can change a valid argument into an invalid one or can make
equivalent propositions inequivalent. This lies behind the problems with probabil-
istic reasoning noted in Chapter 1. This fact becomes especially important when
one attempts to represent experts' reasoning with uncertainty on a computer.
budgeted at about one billion dollars over 10 years (in comparison, Star Wars has
been projected to cost 30 billion in 5 years). The DARPA project in particular
envisions many futuristic applications of expert systems, including integrated battle
management systems with "speech input," a "pilot's associate" (a sort of R2D2 for
fighter pilots), and "autonomous land vehicles" for waging automated warfare.
[For a description of this project's goals see Stefik (1985).} In all these projects,
reasoning with uncertain information is essential.
Experts systems are being rapidly deployed in a wide variety of fields, and
many software houses now offer inexpensive "shells" allowing the user to build his
own expert system. The user fills in his own "rules" and supplies his own "certainty
factors," without being told what these might mean, and the shell is transformed
into an expert system that relieves the user of the task of reasoning with
uncertainty.
An entrance into the active areas of research regarding the representation of
uncertainty on intelligent machines can be gained from the February 1987 issue of
Statistical Science, in which some of the principal protagonists lock horns. It is not
the purpose of the ensuing discussion to pursue active areas of research in this
exciting field, but rather to review some of the standard literature and become
acquainted with the problems that researchers in this field are facing. Hence we
shall set our focus not on what is being contemplated and discussed, but what has
actually been implemented and evaluated. This entails, perhaps regrettably, that we
shall not discuss the theory of belief functions, as these are still in the discussion
phase. On the other hand, a final section of this chapter will look at the theory of
fuzzy sets, as these are being applied. For mathematical details the reader must
consult the original sources.
Computer-Aided Diagnosis
The first expert system was developed by E. Feigenbaum and B. Buchanan at the
Stanford Heuristic Programming Project, started in 1965. This system is called
DENDRAL and is used to identify organic compounds. The reasoning in this
system is largely nonprobabilistic. Probabilistic inference systems were initiated in
the early 1970s as an aid for diagnosis. The early systems were designed by decision
theorists with a background in probability, and the inference mechanisms used the
probability calculus. These early attempts were generally unsuccessful, and it is
essential to understand the reasons for this lack of success, before we can appreciate
the contribution which artificial intelligence techniques have made to the field. The
following discussion of these early systems follows Szolovits and Pauker (1978).
The generic decision problem in diagnosis can be described as follows. In a
given area we distinguish a number of possible diseases that a patient might have.
These are traditionally called "hypotheses," and will be denoted h 1 ,...,h n . In a
typical application there may be 15 such hypotheses. A patient may be given any of
m tests tlt...,tm. We assume that the result of a test is either "yes" or " "no." A test
may be anything from asking the patient's sex, to determining his/her response to a
treatment. In a typical application there may be 10 tests. A patient is given a
number of tests, in a particular order, and the results are recorded as a "case
history." Let Q denote a generic case history.
52 EXPERTS AND OPINIONS
The generic decision problem is simply this: On the basis of a case history Q
determine which hypothesis is the most likely. For the decision theoretically
oriented analyst, this is simply a question of determining the "posterior proba-
bility" p(hi | Q). The theorem of Bayes (see Chap. 4) states that
for all i and j. Under this assumption, it can be shown that the number of
likelihoods in the typical problem is 300.4
2
There are m!/(m — j)! ordered sets of tests of length j, and each test can have one of two outcomes.
This gives 2s m!/(m — j)! possible case histories of length j, for each hypothesis hi. Summing over j and
multiplying by n, we find that the number N of likelihoods is
3
The number of likelihoods N' under this assumption is now given by
4
Since
it suffices to know the likelihoods for single tests. There are 10 tests, 15 diseases, and 2 possible outcomes
of each test. This yields 300 likelihoods.
PROBABILISTIC THINKING 53
This is a workable number. The expert systems DENDRALL and MYCIN (to
be discussed presently) each contain about 400 "rules," and a rule in this context
may be roughly compared with a likelihood term. Everyone agrees that the
assumption of conditional independence is wrong. In fact, it has recently been
shown that if the hypotheses are mutually exclusive and jointly exhaustive, then
conditional independence implies that at most one case history can alter the prior
probabilities of the hypotheses (Johnson, 1986). The designers of the early systems
knew that conditional independence was a very strong assumption, but were
unaware of this particular problem. They were forced to adopt it because of
hardware constraints.
Hypothesis
(1) Findings
and its complementary hypotheses that are actually observed, and divides
this by the total number of possible expected findings.
2. The binding score divides the number of observed expected findings of a
hypothesis and its complementary hypotheses by the total number of
observed findings.
The program calculates the likelihood scores of all hypotheses and inquires about
the status of not-yet-observed expected findings of all active hypotheses. After each
new finding the process is repeated. When all active hypotheses' expected findings
have been queryed, it starts with the expected findings of the complementary
hypotheses.
Regarding these likelihood scores, Szolovits and Pauker write "we must think
of them as an arbitrary numeric mechanism for combining information, somewhat
analogous to the static evaluation of a board in a chess-playing program"
(Szolovits and Pauker, 1978).
This will give an impression of the inference mechanism. The program has 38
completely developed hypotheses, of which 18 can be confirmed by "sufficient"
findings.
We see here a conscious attempt to mimic the reasoning of a doctor in a
diagnostic situation. The likelihood scores used to represent uncertainty bear no
relation to probabilities, and would not satisfy the axioms of probability theory. It
is plausible that this program more closely resembles the actual reasoning of
doctors than the decision theoretic programs discussed earlier.
This is not the place to evaluate this program as such. It is surely a very useful
tool for the task for which it was designed. However, it is appropriate to cast a
critical eye on the inference mechanism, abstracted from this particular application.
It is obvious that the scores are strongly influenced by the numbers of expected
PROBABILISTIC THINKING 55
findings. A hypothesis with few expected findings will have a hard time achieving a
high binding score, and a hypothesis with many expected findings may be placed at
a disadvantage with regard to the matching score. The number of findings may also
be rather ad hoc, reflecting the number of tests that have been devised for the
hypothesis in question.
More important, however, is the following. The program takes no account of
the prior probability of the hypotheses. Suppose there is a rare disease, say
Mongolian tongue-worm, whose initial symptoms are identical with those of the
common cold. In every session, the program would necessarily give these two
hypotheses the same likelihood. The designer could try to design this anomaly
away by adding the expected finding "recently visited Mongolia," but does this
solve the problem? Will the slight change in the matching score induced by this
addition compensate for neglecting the very low prior probability of Mongolian
tongue-worm, even among people who have recently visited Mongolia?
Neglecting the prior probabilities is well known in probabilistic inference and
has been given the name "base rate fallacy" (prior probabilities are often termed
"base rates"). This is a typical feature of probabilistic thinking, which has no direct
analog in nonprobabilistic reasoning. In Chapter 4 we shall see that the base rate
fallacy is very common, in particular among doctors.
Knowledge Base
MYCIN operates with a "knowledge base" containing a dynamic data base and a
"rule base." The dynamic data base contains data of a patient fed in during a
consultation. The rule base contains rules of inference, similar to our inferences
56 EXPERTS AND OPINIONS
Then: there is evidence (certainty factor = 0.7) that the identity of the organism is
streptococcus.
The value of the certainty factor, 0.7, is obtained by giving an expert the following
prompt:
On a scale of 1 to 10 how much certainty do you affix to the conclusion based
on the evidence (i.1), (i.2), and (i.3)?
The answer to this prompt is divided by 10 to obtain the certainty factor.
The evidence (i.l), (i.2), and (i.3) may be present in the dynamic data base if a
culture has been taken from the patient. In this case, the conclusion "the organism
taken from the patient is streptococcus" would be added to the data base, with
certainty factor 0.7. Statements with certainty factors greater or equal to 0.2 may be
used as premises in new inferences.
The program will now be able to "fire" other rules, as the data base has been
enlarged. However, the certainty factor of the conclusions must be reduced to
reflect the uncertainty of the premises of the rules. MYCIN has a general
mechanism for compounding uncertainties in "if" clauses, which is derived from
the theory of fuzzy sets.
To illustrate this mechanism, let us consider a Rule^ with premise (j.1), (j.2),
and (j.3), in the following layout:
Rule,-: If (j.1)
and
[(j.2) or (j.3)];
Then:
conclusion, with certainty factor CFj.
Let CF(j.l) denote the certainty factor attached to (j.1), etc. The fuzzy set rules
applied in MYCIN calculate the certainty factors of conjunctions and disjunctions,
as functions of the certainty factors of the conjuncts and disjuncts, as follows:
CF(P and Q) = min{CF(p), CF(Q)}
CF(P or Q) = max{CF(P), CF(Q)}
The above rules can be applied to calculate certainty factor for the "if clause of
Rutej. This latter number would then be multiplied by the number CFj to give the
certainty factor of the conclusion. This new conclusion is then entered in the data
base, and the "inference engine" looks for new rules to fire.
One additional feature of MYCIN's inference mechanism deserves mention
PROBABILISTIC THINKING 57
here. It may arise that the same conclusion can be drawn from firing two or more
different rules. Suppose the same conclusion h can be drawn from Rule i and Rule)
with certainty factors CFi and CFj; and suppose the premises of both rules are
known with certainty. MYCIN would then attribute a "combined certainty factor"
CFi +j to the conclusion, defined as follows:
Probabilistic Interpretation
In their initial publications, Buchanan and Shortliffe proposed a probabilistic
interpretation for the certainty factors CFi attaching to rules. Let Ei denote the "if
clause and hi the conclusion of Rulei. The CFi was to be interpreted as
The right-hand side of the above expression is derived from the confirmation
theory of the philosopher Rudolph Carnap (1950). However, Carnap emphasized
that this could be used as a measure of increased belief in hi on the basis of evidence
Ei. MYCIN in effect takes this measure of increased belief as a measure of belief.
Buchanan and Shortliffe note, however, that if p(ht) is small, as one would expect at
the beginning of a consultation, then CF, is approximately equal to the conditional
probability p(h, \ E,).
Why not just use p(ht \ Ei) instead of CFi? Buchanan and Shortliffe claim that
doctors just don't reason in such a way that "certainty" of an hypothesis and the
"certainty" of its negation add to one, as is the case with (conditional) probabilities.
Another reason may be found in the following fact. The conditional probability
p(h | Ei and Ej) cannot be calculated as a function of p(h \ Ei) and p(h \ Ej). Even
assuming independence and conditional independence (under h) of the evidences,
we cannot do better than5
Critique
It is hard to find anyone presently willing to defend the probabilistic interpretation
of certainty factors, or willing to specify any other interpretation. That a formalism
that no one is willing to defend is being commercialized on such a massive scale is a
fact worthy of contemplation.
The probabilistic interpretation came under sharp criticism rather quickly.
Adams (1976) proved that the combination rule for CFi+j given above in
combination with the probabilistic interpretation is equivalent to
where h is the common conclusion of Rulei and Rulej. These are even stronger than
the conditional independence assumed by the decision theoretic approaches
discussed earlier. He also showed that the choice of certainty factors in this
situation is severely constrained. If the prior probability p(h) is , and if CFi = 0.7,
then CFj must be less than 0.035. Expert system shells contain no checks for
consistency in this sense, and it is doubtful if the numbers actually used in any
expert system satisfy this constraint. This sort of consistency check is not
incorporated into the production rule systems. Cendrowska and Bramer (1984)
further showed that in some cases the order in which the rules are fired can affect
the resulting certainty factors.
MYCIN's advice has been evaluated in a sort of "Turing test" (Yu et al., 1979)
("try to pick out the computer in a blind experiment"). Eight doctors and MYCIN
gave advice for treatment of 10 cases. The advices were evaluated blind by eight
specialists. Thus, in total there were 80 evaluations per "expert." Thirty-five percent
of MYCIN's evaluations were judged "unacceptable," and this was lower than the
corresponding percentage for the eight human experts.
Simulations described by Shortliffe and Buchanan (1984) indicate that
MYCIN's treatment recommendations are very insensitive to the numerical values
of the certainty factors. When the values of the certainty factors were coarse
grained, so that MYCIN distinguished only three different values, a difference in
diagnosis resulted in 5 of 10 cases, but a difference in treatment recommendation in
only 1 of 10 cases. Moreover, conclusions are often drawn on the basis of rules
PROBABILISTIC THINKING 59
whose "if" clauses were all certain, so that the propagation of uncertainty is not
really important for the conclusions.
It must be emphasized that this latter conclusion is peculiar to the specific
application for which MYCIN was designed. Reading Shortliffe and Buchanan
(1984), one gets the impression that the designers fiddled around with the certainty
factors and with the combination rules until the recommendations made sense. Of
course this is an eminently sensible thing to do. However, it suggests that the
MYCIN model is not a generic solution to the problem of reasoning under
uncertainty, but another ad hoc solution.
FUZZINESS
Many expert systems use the theory of fuzzy sets to propagate uncertainty, and it is
appropriate to make a few remarks on fuzzy sets. The theory of fuzzy sets was
introduced by Zadeh (1968). A recent exposition can be found in Zadeh (1986). A
well-received and frequently cited introduction to the literature is Dubois and
Prade (1980). For a pointed critique, see French (1984, 1987).
There is an immense literature on fuzzy sets, which will not be reviewed here. It
is scarcely possible to speak of a "theory" of fuzzy sets, as writers in this field have
been unable to agree on a definition of a fuzzy set, and unable to agree on rules for
operating with fuzzy sets. Indeed, Zadeh has said that it is not in the spirit of fuzzy
sets to ask for precise definitions. We shall confine ourselves to a brief discussion of
the philosophy behind this development and a few remarks on the combination
rules applied in MYCIN. These are the rules most frequently encountered in
applications. The exposition will draw on Zimmermann (1987).
To introduce fuzziness, we may distinguish between uncertainty and ambiguity.
Uncertainty is understood to denote a state of partial knowledge regarding (future)
observations; ambiguity denotes a state of partial knowledge regarding the
meaning of terms in the language. Put differently, uncertainty is that which is
removed by observation, ambiguity is that which is removed by linguistic
convention.
Many writers regard subjective probability as a representation of uncertainty
in the above sense. Many fuzzy set adherents would regard fuzziness as a
representation of ambiguity. Everyone would agree that observations can some-
times be ambiguous, hence the two notions overlap. Three questions will be briefly
addressed.
CONCLUSION
Logical thinking is not always as simple as it seems. There are many situations in
which most of us will make elementary logical mistakes if we do not pay close
attention. Probabilistic thinking is much more subtle and tricky than ordinary
logical thinking. Logical validity is not continuous in certainty; rather, uncertainty
forces us to reason in ways that are not simple transcriptions of logical reasoning.
Moreover, validity of probabilistic arguments cannot be assessed without perform-
ing calculations.
For deterministic reasoning we have a well-developed theory of logical
inference that helps us track down and repair errors in thinking. We also need such
a theory for reasoning under uncertainty.
This need is felt most acutely by designers of artificial intelligence systems that
model scientific reasoning under uncertainty. Such systems have tried to model
62 EXPERTS AND OPINIONS
63
64 EXPERTS AND OPINIONS
Many important issues remain unclarified. Most measures used to score calibra-
tion are mathematically dubious, at best. Failure to appreciate this fact has led to
inappropriate experimental designs whose results must remain ambiguous. This in
turn has clouded the relation between calibration and expertise or knowledge.
Finally, the emphasis on calibration has led to neglect of other important aspects of
probability assessments, in particular, the information contained in such
assessments.
The above issues are studied in detail in Part II, but it is important to mention
them before examining the experimental literature. The results reported in this
chapter do not suffer from the problems mentioned above, as far as one can tell
from their descriptions in the literature.
AVAILABILITY
When asked to estimate the size of a class (e.g., the number of automobile deaths in
a year) subjects tend to base their estimates on the ease with which members of the
class can be retrieved from memory. Hence, the frequency with which a given event
occurs is usually estimated by the ease with which instances can be recalled. In one
study, for example, subjects were told that a group of 10 people are forming
committees of different sizes. When asked to estimate the number of committees
with two members, the median estimate was 70, for eight members the median
estimate was 20 (Tversky and Kahneman 1982c). Of course, there are just as many
committees with two members as with eight members, as the eight people not
included in the committee of two, can form their own committee. The responses
indicate a bias. The correct number for two and eight is 45. The cause of this type of
estimation is presumably the following. It is much easier to imagine committees of
two and to imagine grouping the ten people into different groups of two, than it is
to imagine groups of eight. Committees of size two are "more available.' to the
subject.
Another example along the same lines is given by the permutation experiment
(Tversky and Kahneman 1982a). Subjects are given the following text:
Consider the two structures A and B which are displayed below:
(A) (B)
XXXXXXXX XX
XXXXXXXX XX
XXXXXXXX XX
XX
XX
XX
XX
XX
XX
A path in a structure is a line that connects an element in the top row with an element in
the bottom row, and passes through one and only one element in each row. In which of
the two structures are there more paths? How many paths do you think there are in
each structure?
HEURISTICS AND BIASES 65
In a typical experiment, 46 of the 54 subjects thought there were more paths in (A).
The median estimate for the number of paths in (A) was 40, and in (B), 18. In fact,
there are 83 paths in (A) and 29 paths in (B); and 83 = 29 = 512.
Why do people see more paths in (A) than (B)? Kahneman and Tversky
speculate that it is much easier to "imagine" paths through three points than paths
through nine points, hence the paths in (A) are more available. Another question is
why the subjects underestimate the number of paths so severely.
Perhaps the best-known instance of the availability heuristic involves the
perception of risks. When asked to estimate the probabilities of death from various
causes, subjects typically overestimate the risks of "glamorous" and well-publicized
causes (botulism, snake bite) and typically underestimate "unglamorous" causes
(stomach cancer, heart disease). Figure 4.1 is typical of the subjects' responses to
these types of questions.
ANCHORING
Figure 4.1 Relationship between judged frequency and the actual number of deaths per year
for 41 causes of death. If judged and actual frequencies were equal, the data would fall on
the straight line. The points, and the curved line fitted to them, represent the average
responses of a large number of lay people. As an index of the variability across individuals,
vertical bars are drawn to depict the 25th and 75th percentiles of the judgments for botulism,
diabetes, and all accidents. The range of responses for the other 37 causes of death was
similar. (Slovic, Fischqff, and Lichtenstein, 1982)
66 EXPERTS AND OPINIONS
REPRESENTATIVENESS
When asked to judge the conditional probability p(A \ B) that event A occurs given
that B has occurred, subjects seem to rely on an assessment of the degree of
similarity between events A and B. It is easy to see that this heuristic can lead to
very serious biases. Indeed, similarity is symmetrical. The degree to which A
resembles B is also the degree to which B resembles A. However, conditional
probabilities are not symmetrical. Applying the definition of conditional proba-
bility (the first equality):
lifeless. In school, he was strong in mathematics but weak in social studies and
humanities.
Please rank in order the following statements by their probability, using 1 for the most
probable and 8 for the least probable:
A Bill is a physician who plays poker for a hobby.
B Bill is an architect.
C Bill is an accountant.
D Bill plays jazz for a hobby.
E Bill surfs for a hobby.
F Bill is a reporter.
G Bill is an accountant who plays jazz for a hobby.
H Bill climbs mountains for a hobby.
(Tversky and Kahneman 1982b)
The text is constructed such that C, being an account, is most representative for the
description X. If the subject ranks C higher than E, this means that he regards the
conditional probability p(C | X) as larger than p(E X). In other words, he judges
that
He has probably reached this conclusion by the same psychological procedure that
he would follow were he asked to rank the conditional probabilities p(X | C) and
p(X E). However,
To perform this task correctly, he must also weigh the base rates p(C) and p(E).
From experiments similar to the above, it is known that subjects generally neglect
the base rates. In fact, there are many more people in class E than in class C. Even if
the subjects are given this information in the experiment, they fail to use it.
Estimates of p(C \ X) and p(E \ X) are not influenced by information regarding the
base rates. Clearly, if p(E) is much larger than p(C), and if p(X | C) is of the same
order as p(X E), then the above ratio will be less than one.
Another interesting aspect of this experiment reveals the strength of the
representativeness heuristic. Event G is the conjunction of events C and D. Event C
is highly representative for X, and event D is highly unrepresentative. Event G is of
intermediate representativeness. Subjects using the representativeness heuristic will
therefore rank G between events C and D. However p(D | X) p(G | X).
In the experiment cited above 87% of the "statistically naive" subjects ranked
G higher than D. The experiment was repeated with subjects taken from graduate
students in the Stanford Business School who had had several advanced courses in
probability and statistics. Eighty percent of these "statistically sophisticated"
subjects ranked G higher than D.
Similar results were found in the experiment of Thys (1987). Experienced
operators of sophisticated technical systems did not differ significantly from their
68 EXPERTS AND OPINIONS
inexperienced colleagues in performance on this type of test. About 80% ranked the
subset (G) above the superset (D). Moreover, this same pattern was observed on
items relating to their technical expertise.
The representativeness heuristic also leads people to ignore effects due to
sample size. For example, if a subject is told that an urn contains 500 white and 500
red balls, and is asked to assess the probability of drawing at least six white and
four red balls on ten draws, then he will judge this probability by comparing the
ratios 4/10 and 500/1000. The same ratios will be compared if he is asked to assess
the probability of drawing at least 60 white and at most 40 red balls. For him the
10-ball sample and the 100-ball sample are equally representative of the population
being sampled, and he will tend to equate their probabilities. This, of course, is
wrong. The 60/40 sample is much less likely than the "6/4 sample."
The following experiment illustrates this effect (Tversky and Kahneman,
1982c):
A certain town is served by two hospitals. In the larger hospital about 45 babies are
born each day, and in the smaller hospital about 15 babies are born each day. As you
know, about 50% of all babies are boys. However, the exact percentage varies from day
to day. Sometimes it may be higher than 50%, sometimes lower.
For a period of 1 year, each hospital recorded the days on which more than 60% of
the babies born were boys. Which hospital do you think recorded more such days?
Of the 95 subjects responding to this question, 21 opted for the larger hospital, 21
for the smaller hospital, and 53 thought that both hospitals recorded about the
same number of such days. Of course, the smaller hospital is much more likely to
see more than 60% boys on any given day.1
CONTROL
Subjects tend to act as if they can influence situations over which, objectively
speaking, they have no control whatsoever. This may lead to distorted probability
assessments.
In one experiment (Langer, 1975), 36 undergraduates at Yale University were
divided into two groups and each group was placed in the following betting
situation. Each participant was asked to play a simple betting game with one of the
two opponents. The game consisted of cutting a deck of cards. The subject
determined the stakes that would be wagered (even money) and whoever cut the
highest card would win. One opponent was instructed to be shy and insecure, and
the other opponent was instructed to be confident and self-possessed. It was
conjectured that the subjects would think they had a better chance of winning
against the insecure opponent, and that this would be reflected in the amount of
money they were willing to bet. The maximum bet was set at $.25, and each subject
played the game four times. The median bet against the insecure opponent was
$.16, whereas against the confident opponent this was $.11.
1
Regarding the births as independent tosses of a coin with probability p of "heads" (for "boy"), the
variance of the percentage S n /n of heads in n tosses is p(l — p)/n. Since this variance decreases with n, the
probability of a given deviation from the mean value of Snln decreases in n.
HEURISTICS AND BIASES 69
The base rate fallacy is one of the most common and most pernicious biases
involved in subjective probability assessments. It arises in a wide variety of contexts
involving expert opinion. The best way of learning to recognize this fallacy is to
study several examples.
Medicine
In the field of medicine, the base rate fallacy may be described as a veritable
epidemic. A study by Eddy (1982) discusses several instances of this fallacy in
medical textbooks. One example involves the diagnosis of breast cancer.
A definitive diagnosis for breast cancer is accomplished by means of a biopsy,
a surgical operation (usually under complete anesthesia) in which a portion of
suspicious breast mass is removed. This is not a trivial operation and is usually
performed on the basis of a mammogram, that is, an x-ray photograph of the breast
mass.
The symptoms that may lead a physician to order a mammogram may be
caused by any number of disorders. It is stated that the frequency of malignant
cancer of the breast among women complaining of a painful hardening in the
breast is about 1/100. Numerous medical sources estimate the accuracy of
mammography at about 90%. This means that the probability of a positive x-ray
result given a malignant cancer is about 90% and the probability of a negative x-
ray result given no malignant cancer is about 90%.
In deciding whether to order a biopsy given a positive x-ray result in a patient
complaining of a painful hardening of the breast, the physician must estimate the
conditional probability of cancer given a positive result, p(C \ +). Using Bayes'
theorem:
70 EXPERTS AND OPINIONS
The base rate p(C) is about 1/100. The base rate p( + ) can be calculated as
This yields
Simulator Training
The following example resulted from an analysis of a simulator training program in
The Netherlands performed by members of the Department of Mechanical
Engineering at the Delft University of Technology (Thys 1987).
A simulator training program was developed at a research center to train
operators of large technical installations. A pilot group of trainees was selected
from operators currently employed at the type of installation for which the training
was designed. Part of the training program worked as follows. The simulator would
choose a particular malfunction from its library of malfunctions, and the approp-
riate instrument readings would appear on the simulator's control panel. The
trainee had to diagnose the malfunction from the instrument readings. For
example, the malfunction might be an oil leak and the symptoms on the instrument
panel would be an alarm indicating oil underpressurization and high temperature
readings for the coolant of a particular component. The same symptoms might also
be caused by failure in the oil pump. The engineering knowledge that the trainees
were being taught enabled them to assess the probabilities that various mal-
functions would cause various meter readings.
The frequencies of various malfunctions are determined by installation-
HEURISTICS AND BIASES 71
specific factors such as maintenance regime, load, weather, supplier, and design.
Each installation has its own particular signature of base rate frequencies. The oil
pump may frequently have given trouble in the past, or the oil lines may be of
inferior quality. The designers of the simulator program could not know this
signature of the trainee's particular installation, and they therefore chose to
confront trainees with malfunctions chosen at random.
An initial evaluation of the results of the training program indicated that the
trainees performed worse on particular points after completing the course, than
before. The problem was analyzed as follows. Let M be a particular malfunction,
and R a particular reading on the control panel. The operator's knowledge enabled
him to assess p(R \ M), whereas his diagnostic task required him to assess p(M \ R).
Let M' be another malfunction which might cause R. In choosing between M and
M', the operator has to compare p(M \ R) and p(M' \ R). It is not difficult to see that
Since malfunctions are chosen randomly on the simulator, the bases rates p(M) and
p(M') were equal on the simulator. Hence
on the simulator
Empirical Bayesians
The final example of the base rate fallacy is a bit embarrassing, as it is committed by
Bayesians and seems to have been going on in risk analysis for several years (Martz
and Bryson, 1983). It has been used to estimate the spread of failure probabilities in
a given population on the basis of the spread in expert estimates of these
probabilities (Martz, 1984).
In risk analysis, experts are called upon to estimate failure rates of compo-
nents. Such failure rates are commonly assumed to be lognormally distributed, that
is, the log of the failure frequency per unit time is assumed to be normally
distributed over the population of similar components. For simplicity, we assume
that the experts are asked to estimate log failure frequencies.
Suppose an expert estimates a log failure frequency as y. As he is not certain
this estimate is correct, he gives some indication of his subjective uncertainty. The
usual way of doing this is to give the standard deviation a of his distribution over
the possible values of the log failure frequency. The usual interpretation is this: The
expert's subjective probability distribution for the log failure frequency in question
is a normal distribution with mean y and standard deviation a, notated N(y, ).
The "empirical Bayes" methodology involves treating the expert's responses
themselves as random variables. In other words, we assume there is a probability
72 EXPERTS AND OPINIONS
density p(x, y, ) giving the probability density that the true value is x, that the
expert estimates this value as y with standard deviation a. The expert may be
subjected to various biases but in the simplest case he is unbiased. In this case the
empirical Bayes methodology assumes that if the true value is x and the expert
gives standard deviation a, then the probability that he estimates x as y, p(y \ x, ), is
normal with mean x and standard deviation a:
Now what does it mean in this context to say that the expert is unbiased? The
empirical Bayesians have not given a definition; but it seems reasonable to say this:
If an unbiased expert's subjective distribution is normal with mean y and standard
deviation a, with y and a fixed, for a large number of log failure frequencies, then if
we examine the true values for these frequencies, they will indeed be normally
distributed with mean y and standard deviation .2 For such an expert,
The normal density depends only on the square distance to the mean and the
standard deviation, hence
which in turn entails that both p(y \ a) and p(x \ a) are the improper uniform
density3. As this holds for all values of a, it follows that p(y) is also the improper
uniform density. In addition to being 'improper,' this is inconsistent with other
modeling assumptions made by the empirical Bayesians and highly implausible.
Calibration will be treated systematically in Part II. We content ourselves here with
a rough provisional definition: A subjective assessor is well-calibrated if for every
probability value r, in the class of all events to which the assessor assigns subjective
probability r, the relative frequency of occurrence is equal to r. Calibration
represents a form of empirical control on subjective probability assessments.
This concept was introduced in Chapters 1 and 2. In Chapter 2, for example,
2
Martz (1986) seems not to accept this interpretation of unbiasedness, but does not propose
another (see also Apostolakis 1985, and Cooke, 1986).
3
Improper probability densities and not 'real' probability densities, as the total probability is
infinite.
HEURISTICS AND BIASES 73
we saw that the precursor risk studies of nuclear plants could be used to calibrate
the probabilistic assessments in the Reactor Safety Study.
Calibration can be measured in two types of test, discrete tests and fractile or
quantile tests. In a discrete test the subject is presented with a number of events. For
each event, he is asked to state his probability that the event will occur. His
probabilities are discretized, either by himself or by the experimenter, such that
only a limited number of probability values are used. For example, his probabilities
may be rounded off to the nearest 10%. It is helpful to think of the subject as
throwing events into "probability bins," where, for example, the 20% probability
bin contains all those events to which the subject attributes (after discretization) a
probability of 20%. The subject is well calibrated if the relative frequency of
occurrence in each probability bin is equal to the corresponding bin probability.
Quantile tests are used when the subject is required to assess variables with a
continuous range (continuous for all practical purposes). For example, we may be
interested in the maximal efficiency of a new type of engine under certain operating
conditions. No one would predict this value with certainty, but an expert will
typically be able to give a subjective distribution over the real line, reflecting his
uncertainty. We can learn something about this distribution by asking the
following type of question: "For which x is your probability 25% that the engine's
efficiency is less than or equal to x?" The expert's answer is called his 25% quantile.
In a risk analysis we are typically interested in 5%, 50% and 95% quantiles, so these
are often elicited from experts. Another popular choice is 1%, 25%, 50%, 75%, 99%.
Suppose we ask an expert for his 1%, 25%, 50%, 75%, and 99% quantiles for a
large number of variables, for which the actual values later become known. If the
expert is well calibrated, then we should expect that approximately 1% of the true
values fall beneath the 1% quantiles of their respective distributions, roughly 24%
should fall between the 1% and the 25% quantiles, etc. The interquartile range is the
interval between the 25% and the 75% quantiles. We should expect 50% of the true
values to fall within the interquartile ranges. The surprise index is the percentage of
true values that fall below the lowest or above the highest quantile. For the
quantiles given above, we should expect this to occur in 2% of the cases.
Interest in calibrating subjective probabilities seems to have arisen in
meteorology. National Weather Service forecasters have been expressing their
predictions of rain in probabilistic terms since 1965. An enormous amount of data
has been collected and analyzed for calibration. For example, Murphy and Winkler
(1977) analyzed 24,859 precipitation forecasts in the Chicago area over 4 years. The
results shown in Figure 4.2 indicate excellent calibration. It is reported in
Lichtenstein, Fischhoff, and Phillips (1982) that the more recent calibration data is
even better.
The data from weather forecasters are highly significant for two reasons. First,
it shows that excellent expert calibration is possible, and second, closer analysis
reveals that expert calibration improves as the results of calibration measurements
are made known. Murphy and Daan (1984) studies the learning effect in a 2-year
experiment in The Netherlands. Figure 4.3 compares the overall calibration of the
forecasters in the first and second years of the experiment.
The experiment on which Murphy and Daan report was continued up to 1987.
Figure 4.2 Calibration data for precipitation forecasts. The number of forecasts is shown for
each point. (Murphy and Winkler, 1977)
Figure 4.3 Comparison of calibration of weather forecasters before (1980-1981) and after
(1981-1982) receiving feedback on their own performance. The number of forecasts in each
year is shown as n. (Murphy and Daan, 1984)
HEURISTICS AND BIASES 75
Figure 4.4 Relation between physician's subjective probability of pneumonia and the actual
probability of pneumonia. (Christensen-Szalanski and Bushyhead, 1981}
The raw data have recently been obtained and analyzed using the models for
combining expert opinion developed in Part III. In Chapter 15 we shall see how
this combination leads to superior probabilistic forecasts.
Expert calibration is not always so good. Figure 4.4 shows calibration for
Army physicians for diagnoses of pneumonia (Christensen-Szalanski and Bushy-
head, 1981). The calibration is described as "abysmal." For subjective probability
of 90% the corresponding relative frequency is about 20%. It is natural to suppose
that the doctors are "erring on the side of safety" in their predictions in these cases.
The experimenters were aware of this possibility and instructed the doctors to try
to avoid this. Moreover, the doctors were asked to evaluate the effect of making
true and false diagnoses given that the patient does or does not have the disease.
The mean results, shown in Figure 4.5, indicate that failing to diagnose pneumonia
when the patient in fact had the disease was slightly preferable to diagnosing the
disease when the patient in fact was healthy.
Figure 4.5 Physicians' mean values for outcomes of pneumonia diagnosis (PnDx) decision.
(Christensen-Szalanski and Bushyhead, 1981)
Table 4.1 Calibration Summary for Quantile Tests
Pratt (1975)
"Astonishingly high/low" 175 37 5 9
Figure 4.6 Predictors' interquartile ranges and best estimates of added height to failure.
(Hynes and Van Marcke, 1976)
78 EXPERTS AND OPINIONS
CONCLUSION
There is good news and bad news. The good news is that the probabilistic
representation of uncertainty provides clear criteria for evaluating subjective
probability assessments. It is scarcely conceivable that a subject would wish to
persist in any of the biases discussed in this chapter, when these would be brought
to his/her attention. This means that subjective probabilities are not beyond the
pale of rational discourse.
The bad news is that people, even experts, do not handle subjective
probabilities with any great aplomb. There is ample room for improvement, and
HEURISTICS AND BIASES 79
The foregoing four chapters have explored a very wide range of subjects, all
connected with the use of expert opinion in science. Part II is devoted to assembling
the mathematical modeling tools that will be needed in Part III. Before embarking
on mathematical modeling, it is useful to gather some conclusions from the
previous chapters and to form a preliminary picture of what we would like to
accomplish in Part III. Chapter 1 ended with a question, what does it mean to
represent uncertainty adequately for purposes of science? We shall not answer this
question here. Rather, we return to the fundamental aim of science and derive some
general principles that a methodology for using expert opinion in science should
satisfy. At the same time, we review the existing practice, as it has evolved, and
determine whether science's fundamental aims are presently being served.
RATIONAL CONSENSUS
Science aims at rational consensus, and the methodology of science must serve to
further this aim. Were science to abandon its commitment to rational consensus,
then its potential contribution to rational decision making would be compromised.
Since the authoritative Lewis report (Lewis, 1979), the use of experts'
subjective probabilities in risk assessment has been officially sanctioned and
universally recognized. Nonetheless, a subjective probability is "just someone's
opinion." Traditional scientific methodology does not explicitly accommodate the
use of opinions as scientific data.
The Lewis report did not address the question how subjective probabilities
should be used in science. Experience with expert opinion to date has demonstrated
both the potential value and the potential dangers of expert subjective probability
assessments. In short, the use of subjective probabilities could further the aim of
rational consensus, but could also obstruct this aim.
One conclusion from the foregoing chapters is overwhelmingly evident.
Expert opinion may, in certain circumstances, be a useful source of data, but it is
not a source of rational consensus. Given the extreme differences of opinion
80
BUILDING RATIONAL CONSENSUS 81
PRINCIPLES
Reproducibility
It must be possible for scientific peers to review and if necessary reproduce all
calculations. This entails that the calculational models must be fully specified
and the ingredient data must be made available.
It goes without saying that reproducibility is an essential element of the scientific
method. Nevertheless, there is no existing study that fully respects this principle.
The Reactor Safety Study (1975) faithfully reproduced the data from experts,
including the names of the individuals and or institutions from which the data
originated. However, these data were synthesized in a way that was entirely
inscrutable. Hence, it was impossible for reviewers to reproduce the resulting
assessments. The SNR-300 risk study (Hofer, Javeri, and Loffler, 1985) use
weighted combinations of expert opinions. However, the weights are not reported.
Again, it is impossible to reproduce the calculations.
Accountability
The source of expert subjective probabilities must be identified.
The notion of accountability is essential to science. Articles for scientific journals
are not considered for publication unless the author is identified. In the present
context, accountability entails that the decision maker can trace every subjective
probability to the name of the person or institution from which it comes. In the
cases of public decision making, this information must be made public.
The controversial Reactor Risk Reference Document (Office of Nuclear
Regulator Research, 1987) is cited as an example where expert anonymity resulted
in controversial assessments (the December 1988 issue of Reliability Engineering
and System Safety, devoted to expert opinion in risk analysis, was prompted by this
controversy). The Seismic Hazard Characterization of the Eastern United States
(Bernreuter et al., 1984) publishes the names of the experts consulted. However, the
individual assessments are associated with the number of an expert, and the reader
is not told which name corresponds to which number. On the other hand, the
bench mark studies from the European Joint Research Center at Ispra (Amendola,
1986, Poucet, Amendola, and Cacciabue, 1987) identify the experts by name and
give their individual assessments.
A potentially very serious breach of accountability occurs in the various
expert system methodologies reviewed in Chapter 3. The expert opinions output by
82 EXPERTS AND OPINIONS
Empirical Control
Expert probability assessments must in principle be susceptible to empirical
control.
The requirement of empirical control through observations is universally re-
cognized as a cornerstone of scientific methodology. In contemporary metho-
dology this requirement is given the following expression: Scientific statements and
scientific theories should be falsifiable in principle. It is recognized that theories can
never be conclusively verified, but at least it should be possible in principle to
discover a reproducible conflict with observations, if the theory is in fact false. The
necessary experiments need not be feasible, but they must be physically possible.
In the same spirit, a methodology for using expert opinion should incorporate
some form of empirical control, at least in principle. In other words, it must be
possible in principle to evaluate expert probabilistic opinion on the basis of
possible observations. Moreover, this evaluation should be reflected in the degree
to which an expert's opinion influences the end results.
Empirical control ensures that the use of subjective probabilities cannot be
construed as a license for the expert to say anything whatever. Without empirical
control it is easy to argue that "one subjective probability is as good as another,"
and subjective probabilities would be of very limited value in reaching rational
consensus. Further, the expert would be deprived of a weapon for resisting the
institutional and psychological pressures that may well attend the assessments of
critical quantities in risk analysis.
There has been no application of expert opinion in risk analysis that utilizes
the possibilities of empirical control. Most notable in this respect is the use of
expert opinion in Wheeler et al. (1989). After extensive elicitation procedures,
expert opinions are combined by simple arithmatical averaging. However, the
published literature (Morris and D'Amore, 1981) indicates that such possibilities
are utilized in the field of military intelligence. Some of the Bayesian models used in
risk analysis (Apostolakis, 1985) incorporate empirical control via Bayesian
updating methods.
BUILDING RATIONAL CONSENSUS 83
Neutrality
The method for combining/evaluating expert opinion should encourage
experts to state their true opinions.
A poorly chosen method of combining/evaluating expert opinion will encourage
experts to state an opinion at variance with their true opinion. Perhaps the best-
known examples of this are found in the Delphi techniques. As Sackman (1975)
demonstrates, these techniques "punish" experts who deviate strongly from a
median value and reward changes of opinion in the direction of the median.
Most methods for forming weighted combinations of expert probability
assessments must be criticized from the viewpoint of neutrality. For example, the
Seismic Hazard Characterization of the Eastern United States (Bernreuter et al.,
1984) and the SNR-300 risk study (Hofer, Javeri, and Loffler, 1985) both use
weighted combinations of expert opinion, as mentioned above. The former
determines these weights by asking the experts to weight themselves according to
how good an expert they think they are for the question at hand (these are termed
"self-weights"). The latter employs so called "De Groot weights," whereby each
expert weighs the expertise of each other expert (including himself).
The idea of using self-weights was first introduced by practitioners of the
Delphi method. Although an initial study (Dalkey, Brown, and Cochran, 1970)
indicated that self weights resulted in improved accuracy, a later and more
extended study challenged this conclusion (Brockhoff, 1975). Curiously, women
consistently rate themselves lower than men (Linstone and Turoff, 1975, p. 234).
There is no psychometric measurement underlying the notion of "good expert,"
hence the scale for self-weights cannot be given any operational meaning (does a
rating of 6 mean "twice as good an expert" as a rating of 3?).
A high rating is obviously a form of reward. The system of self-weights asks an
expert to punish and reward himself; De Groot weights empower the experts to
punish and reward each other. Without wishing to suggest that experts would
misuse such rating systems, it must be affirmed that these systems offer no incentive
for performing these tasks honestly. Indeed it is not clear what this latter notion
would mean, as the scales used in the ratings are not susceptible of operational
interpretation.
We note in addition that the use of self-weights or De Groot weights makes it
very difficult to satisfy the principles of reproducibility and accountability.
Fairness
All experts are treated equally, prior to processing the results of observations.
Most Bayesian models for combining expert probability assessments require the
analyst to "assess the reliability of a given expert." No guidance is given as to how
this should be done, the analyst is simply expected to quantify his trust in some
inscrutable way. This type of model is clearly ruled out by the above principle. It
would indeed be curious to subject the expert assessments to empirical control, but
to allow the analyst to nullify this by his own assessments of reliability of individual
experts.
Since empirical control is acknowledged as the means for evaluating expert
84 EXPERTS AND OPINIONS
opinions, in the absence of any empirical data there is no reason for preferring one
expert to another.
Of course, the analyst must "prefer" one expert to another when he decides
which experts to consult. However, it is judged that these decisions must be made
initially on the basis of factors that cannot be meaningfully translated into
numerical input in the combination models.
CONCLUSION
87
88 SUBJECTIVE PROBABILITY
status of the theory, it merely shows that rational choice is not so simple as we
might like to believe. Perhaps the most interesting work in this direction is that of
MacCrimmon (1968). MacCrimmon studied preference behavior in middle- and
upper-level executives enrolled in a management training course. He found
significant departures from the Savage axioms. However, when these departures
were pointed out to the persons involved, they usually admitted that they had
"made mistakes" and revised their preferences to conform with Savage's require-
ments. Moreover, the more experienced executives made fewer "mistakes" than the
less experienced executives.
Savage's ideas have been critized on normative grounds (Allais, 1953, 1979;
Machina, 1981; Shafer, 1986) and on theoretical grounds (Luce and Krantz, 1971;
Blach and Fishburn, 1974; Cooke, 1983, 1986) as well. There have also been
attempts to refine and improve the axiomatic basis of rational decision (Jeffrey,
1966; Blach, McFadden, and Wu, 1974; Pfanzagl, 1968; Luce and Krantz, 1971;
Krantz, Luce, Suppes, and Tversky, 1971; Blach and Fishburn, 1974). Much
current research looks for generalizations and/or alternatives of the "expected
utility model" that capture more of observed empirical behavior. Although this
work has led to some generalizations and qualifications, it has not fundamentally
altered our thinking about rational decision. All theories of rational decision being
discussed today have their points of departure in Savage's work. A good
elementary discussion of some of these issues is found in Hogarth (1987).
Of course Savage's theory is well known among decision theorists and
econometricians, and there would be little point in describing it in a book
addressed to them. However, the problems surrounding expert opinion are
presently attracting attention from researchers from very diverse backgrounds, and
a contemporary review of Savage's model seems eminently appropriate for them.
As Savage's original proofs are not tailored for this public, it seems appropriate to
include proofs of the main results suitable for undergraduates having taken a
course in probability. The first section of this chapter discusses Savage's decision
model. The second section sketches the important representation theorem. Proofs
are given in the supplement to this chapter. A third and final section outlines the
role of observation within the theory of rational preference.
off at some point. Let us agree to cut this analysis off with the above events and
their complements (denoted as A' etc.). This just means that we are not going to
distinguish types of accidents, or prices of gasoline, etc.
A state or possible world for this decision problem is a complete specification
of which events occur. Alternatively, it is an atom in the field generated by the
events of interest for this problem. There are eight states for this problem:
A and B and C
A and B and C'
A and B' and C
A and B' and C
A' and B and C
A' and B and C'
A' and B' and C
A' and B' and C'
Now, in deciding whether to buy a car or a motorcycle, we have to consider the
outcomes of each act in all these eight states, evaluate the outcomes of each act in
each state, and evaluate the probability of each state. What is then the outcome of
an act? It is simply the degree of satisfaction associated with a given act in a given
state. An outcome understood in this sense is what Savage calls a consequence. A
consequence is simply a state of the subject. Regarded mathematically, actions are
functions taking states of the world into states of the subject.
Supposing we have done all this, which act do we choose? According to
Savage, we should choose the act with the best expectation. Where S denotes the
set of all states, the expectation of an act is
Expectation of act = SS p(s) x consequence of act in s
There is much yet to be explained before we can really understand the above
formula. For example, how do we determine the probability of state s and how do
we multiply a probability with a "state of the subject"? This will all be explained in
the next section; however, there is one feature that must be addressed before going
further.
There is something wrong with the above analysis of our decision problem. In
the formula for expectation of act, the probability of each state is independent of
the act under consideration. The p(A. and B and C) used for determining the
expectation of the act "buy a car" will also be used to determine the expectation of
the act "buy a motorcycle." Now it is reasonable that the price of gasoline and
getting a new job do not depend on my choice of mode of transportation. But what
about B; "have an accident"? It seems plausible that the probability of this event
(and hence of the event A and B and C) does depend on my choice of
transportation.
The above analysis of the decision problem is not suitable for the decision
model which Savage puts forward. However, there is a simple "technical fix" that
will convert the above analysis into an analysis for which Savage's model is
suitable. We simply throw out the event "have an accident" and replace it with two
other events:
SAVAGE'S NORMATIVE DECISION THEORY 91
Savage answers both questions in the affirmative. In the "big decision problem of
life" Savage proves that a rational agent always behaves in such a way that he
prefers act f to act g if and only if the expected utility of f is greater than that of g.
The expected utility is calculated with a (subjective) probability that is uniquely
determined, and utility is calculated with a utility function on consequences that is
unique up to a positive affine transformation. The subjective probabilities and
utilities are peculiar to the agent, different agents will in general have different
probabilities and utilities. Savage's theory does not prescribe probabilities and
utilities; any probability in combination with any utility can lead to rational
behavior in his sense, provided the subject chooses according to the principle of
maximum expected utility. According to Savage, it it not possible to specify
rational behavior further.
The proof of this fact is rather technical, and Savage's original proof is quite
92 SUBJECTIVE PROBABILITY
We can say more about rational preference. For example, if we define the act BAD:
BAD(s) = bad for all s
then we should expect that for the act / above
Since BAD takes the consequence "good" on 0, it follows that for any event C,
C 5. . If also . C, then we say that C is null. Hence, we guarantee that for all
C, C . by assuming that preference satisfies the
Principle of dominance. If for all s, the consequence h(s) is at least as preferable
as the consequence k(s), then h k, for any acts h, k. If in addition h(s) > k(s)
for all s C, with C nonnull, then h > k.
We shall want to assume more. Suppose there is a third event C disjunct from both
the events A and B above: C A = C B = . Then we might expect that the
qualitative probability relation is additive, that is, that
Why should we expect this? Well, what does this mean in terms of preference?
Consider again the acts f and g above. Since C is disjunct from both A and B, we
have
From what we have just said, it follows that f' g'. From this it follows that
A. C . B C. In other words, the additivity of the qualitative probability
relation is implied by what Savage calls the
Sure thing principle. The preference between two acts is independent of the
values the acts take on any set on which the two acts agree.
Other authors refer to the sure thing principle as the principle of independence,
for obvious reasons. Letting S denote the set of all states, it is easy to show that for
any A, S . A (start with A' . , and "add" A to both sides). It is the additivity
property, together with the fact that .A. .S, that makes the qualitative
probability relation "look like" a probability. Any representation of uncertainty
that does not have these properties will not be representable as a quantitative
probability measure. However, it is not the case that every qualitative probability
can be represented by a genuine probability measure (Kraft, Pratt, and Seidenberg,
1959). For this we need more assumptions introduced in the following step of the
argument.
94 SUBJECTIVE PROBABILITY
The critique of Savage's axioms has been largely directed at the sure thing
principle. To give one example, we describe what is known as the "Ellsberg
paradox" (Ellsberg, 1961). If subjects are given a choice between an even money
wager on "heads" with a coin of unknown composition and an even money wager
on "heads" with a fair coin (i.e., for which the probability of "heads" is known to be
), most subjects wil prefer the latter wager. Hence, they behave as if the probability
of heads on the first wager were less than . However, changing the first wager so
that the subject wins on "tails," they still prefer the second wager. This preference
behavior is not consistent with Savage's axioms (in particular, it contradicts (e) of
Lemma 6.1 in the supplement).
Step 3: Utility
Let C denote the set of consequences and S the set of states. In the simplest version
of utility theory, which is adequate for all practical applications, it is assumed that
SAVAGE'S NORMATIVE DECISION THEORY 95
C is finite. Let F denote the set of all possible acts, that is, F is the set of functions
from S to C. The simple version of utility requires the slightly
Strengthened principle of refinement. If f > g, then for all a sufficiently small,
a > 0, the preference between f and g is unaffected by altering f and/or g on a
set of probability less than a.
The foundation of utility theory is given in the following:
Theorem 6.2: Let C, F, and S, be as above, let R denote the real numbers, let .
satisfy the principles set forth above, and let p be the unique quantitative
probability representing .. Finally, let U:C R be a "utility" function, and let
Uf denote the expected utility of / with respect to p:
Then
1. There exists a utility function U: C R such that for all f, g F,
2. If U' is another real-valued function on C satisfying (1), then there exist real
constants a, b with a > 0, such that
The second statement in this theorem says that the utility functions satisfying (1)
are related by positive affine transformations, or equivalently, that utility is positive
affine unique.
The mathematics behind this theorem are elementary, though Savage's
original proof was quite technical. A simplified proof is given in the supplement to
this chapter.
OBSERVATION
probability involves the following paradox. Since one subjective probability "is as
good as another," why should one bother to change one's beliefs via observation?
For example, suppose I have some money to invest and I am interested in the
probability that the price of gold will go up next year. According to Savage's
theory, I already have some subjective probability that the price of gold will go up
next year. Why shouldn't I just act on my preferences? Most people in this situation
would try to get more data, but why should they? Isn't one subjective probability as
good as another?
The decision model presented in this chapter gives a satisfactory answer to this
question. We shall first examine how observations affect subjective probabilities,
and then examine how observations can influence a given decision problem. Again,
we shall report on a theorem whose proof is given in the supplement.
What is an observation? In the case of the price of gold next year, I could
imagine doing any number of things to get a better idea whether the price of gold
will rise. I could consult expert investors, I could look up the gold prices for
preceding years, I could examine the political situation in South Africa where much
of the world's gold is mined, etc. Whatever I do, I can always express the result as a
number. I do not know beforehand what this number will be; that's why I have to
look or ask. The value of the number that I find is obviously determined by the
actual state of the world (of which I am uncertain). In other words, an observation
can be mathematically represented as a real-valued function on the set S of possible
worlds.
Mathematically, observations are similar to actions, but as they have a
different interpretation we shall distinguish them in the notation. An observation
will be called a random variable and will be denoted with capital letters from the
end of the alphabet: X, Y, etc. The possible values of observation X will be assumed
to be finite in number, and will be denoted x1; . . . , xn.
We may consider a partition B1; . . . , Bm of S. For simplicity, we shall take
B1 = "the price of gold goes up next year," and B2 = "the price of gold does not go
up next year." Then S = B1 B2. We would like to know whether our actual world
belongs to B1 or B2, but unfortunately we don't know and there is no observation
that we can perform at present which will give us the right answer with certainty.
We now describe the way in which an observation X can give us more
information regarding next year's gold price. In this discussion we shall use some
elementary facts regarding conditional probability and conditional expectation set
forth in the mathematical appendix.
The event that the observation X yields the value xr is a subset of S that we
shall simply denote as xr. The theorem of Bayes says
Now before we make the observation of X, its value is uncertain. Hence, we may
SAVAGE'S NORMATIVE DECISION THEORY 97
consider p(B1 | X) as a random variable, and since the above equation holds for all
values xr of X we may write
Theorem 6.3 shows that if B1 is the case, and if R(X) is not unity given Bl then
after observing X we may expect to be more confident of B1 than we were before
performing the observation. The quantity R(K) is sometimes called the likelihood
ratio of B1 to B2 given X.
Theorem 6.3 does not use the fact that p(B1) + p(B2) = 1, and it can be
generalized straightaway to larger partitions. Defining
one shows that E(.R1j(X) | B1) 1 just as in the proof of Theorem 6.3.
Theorem 6.3 gives a nice account of learning from observations, but it does not
answer the question "why observe."
Why indeed? Before giving an answer, two deeply trivial remarks must be
made. First, in deriving the representation of preference in terms of expected utility
we assumed that the subject's preferences are defined with respect to all mathemat-
ically possible acts. If he can actually choose any of these acts, then he can always
choose the act yielding the maximum utility in every possible world. Such a person
has no decision problem and has no reason to observe anything. Observation
becomes interesting only if the set of accessible acts is strictly smaller than the set of
all acts.
Second, we have seen that Savage's decision theory provides for changing
subjective probabilities via conditionalization on the results of observation. It does
not provide for changing subjective utilities. For the absolute stoic, the decision
problem of life is not a question of choosing the accessible act with the highest
expectation, it is rather a question of changing one's utilities in such a way that the
98 SUBJECTIVE PROBABILITY
act of doing nothing yields maximum pleasure. Such a person also has no reason to
observe anything.
Hence, in explaining why observation is rational under certain circumstances,
we must assume that not all acts are accessible, and that utilities cannot be
changed. As before, we let F denote the set of all acts. A decision problem may be
characterized as a subset F of F, where F denotes the accessible acts. We assume
that F is finite. In light of Theorem 6.2 we may assume that all acts are utility-
valued, and we write E(f) to denote the expected utility of f(E(f) corresponds to
Uf in Theorem 6.2). We define the value v(F) of F as follows:
The point of performing an observation is that we can choose an act from F after
inspecting the result of the observation. For observation X with possible values
x 1; ..., xn, we define the value of F given X = xi (F | xi) as
Here, the conditional expectation E(f | xi) is the expectation of / with respect to
the conditional probability p(. \ X = x i ). As this holds for every possible value of X,
we may define the value of F given X, v(F \ X) as
SUPPLEMENT2
In this supplement, proofs are given for the theorems mentioned in the text. The
proof of Theorem 6.2 is a combination of the proof of Savage and the proof of a
similar theorem by Ramsey (1931). Villegas (1964) has given a shorter proof of a
theorem resembling Theorem 6.1. However, he makes assumptions that are
substantially stronger than those of Savage, and that entail the existence of a
2
I am grateful to Peter Wakker and Bram Meima for reading this supplement and providing many
useful comments.
SAVAGE'S NORMATIVE DECISION THEORY 99
f g denotes "f g and f g". f > g denotes "f g and not g f". For c, d C
we write c d if the constant function on S with value c dominates the constant
function on S with value d. B 2s is called null if for all f, g F, f B g | B. For A,
B 2s, we write A B if for all c, d C with c > d, and for all f g F satisfying
we have f g.
Definition. A relation on F is a rational preference if
• is a weak order.
3
The reader should be especially cautious in applying theorem 4.1 of Villegas (1964), as the notion
of isomorphism used there is not the one common to probabilists.
4
2 denotes the set of maps from S to {0, 1}; each such map corresponds to one subset of S.
100 SUBJECTIVE PROBABILITY
Proof. We show only (a) and (c) to give the idea behind the proofs.
(a) Let D~ = D — B. Then by the definition of qualitative probability
The result now follows from the transitivity of the qualitative probability relation.
B E <. C. By the principle of refinement, there exists a partition {Ai} such that
B A <. C, for all i. Since B <. S, there must be an i such that B' Ai >. . Put
E = B' Ai.
Lemma 6.3. For all B 2s there exist subsets B1 and B2 of B such that
B1 n B2 = , B1 =. B2, and B1 B2 = B.
Proof. The lemma is trivial if B =. . Assume B >. . We show that for all n there
exists a threefold partition of B; B = Dn Gn Cn such that
With the refinement principle we can find a partition {A;}; = l!...>n of Gl such that
A! = Et and
Put
It is easily verified that G2, D2, and C2 satisfy the conditions on the threefold
partition for n — 2. Iterating the above construction we construct Cn, Gn, and Dn
for arbitary n. We put
It is easy to verify that E1 and B2 partition B. We show that Bj =. B2 with the help
of the following:
102 SUBJECTIVE PROBABILITY
Claim. For all H >. 0, there exists an integer n' such that for all n > n', Gn <. H.
Proof. Let be a partition such that A, <. H for all i. Choose n' = m. For
n > n' suppose that Gn .H; then Gn >. A; for all i = 1, . . . , m. By the properties of
the threefold partitions and Lemma 6. 1d we derive the contradiction:
Suppose B1 <. B2. From the claim it follows that Gm =. . For sufficiently large
m, n with n > m we should have
then B =.G. Suppose to the contrary that B <.G. Then C <.G =.H, and by
Lemma 6.1eB C <. G H, a contradiction. Suppose we have partitioned S into
2n-1 elements that are qualitatively equally probable. Applying Lemma 6.3 to each
element, it follows that we can partition S into 2" elements of equal qualitative
probability. The lemma now follows by induction. •
Definition. For all B 2s, let B(n) denote the largest k such that
SAVAGE'S NORMATIVE DECISION THEORY 103
Theorem 6.1. There is a unique quantitative probability that represents that is,
there is a unique probability such that for all
if and only if
Proof. Let v be given bv the above definition. The reader can easily verify that D is a
probability. If then hence Suppose that By
the refinement principle there exists a partition such that
for Choose i such that Again, by the refinement
principle there exists a partition such that
Hence, for there exists a uniform partition such that
It follows that
and consequently.
Finally, let p' be another probability that represents .. Since p and p' must agree
on all elements of all uniform partitions, it follows that for all n and all B 2s:
then
Proof. By the sure thing principle, (a) and (b) imply, respectively,
Lemma 6.6. For all there exists a set such that and
such that
if and only if
Lemma 6.6 follows directly from Remark 2. Lemma 6.7 is little more than the
principle of definition, in combination with Remark 1 and Theorem 6.1.
Lemma 6.8. For all with there exists a unique such that
If
then by the strengthened refinement principle we could alter the right-hand side
such that for some
contradicting the fact that s is the infimum. A similar argument holds for " >," and
we conclude
By Lemma 6.7, this equation depends only on s and not on A, hence we write
Proof. Suppose
which is a contradiction.
Lemma 6.10. For all integers n, Lemma 6.9 holds for B with
By Lemma 6.10
The set r.D may be chosen such that the principle of dominance then
entails that
Hence
Theorem 6.2. Let and be as above, and let R denote the real
numbers. Let and let Uf be denned as
where
Similarly,
with equality if and only if, with probability one, Y is linear on the range of f
Where B1 and B2 partition S, with we define
This chapter studies the relation between subjective probability and relative
frequency. Not only is this essential for understanding subjective probability, it is
also important for the Bayesian models for using expert opinion. In Theorem 6.3
we have seen how we expect to learn from experience, and in Theorem 6.4 how we
expect to profit from observations. However, these theorems do not explain why we
like to observe repeated events and like to use their relative frequencies as
probabilities.
Although the relation between subjective probability and relative frequency
has been well understood since the 1930s, the more popular literature has been very
slow in appreciating it, and there is still much confusion. Perhaps the most
important piece of disinformation concerns an alleged antagonism between
subjective probabilities and relative frequencies. There is to be sure a certain
opposition between the subjectivist and frequentist interpretation of probability, as
seen in the Appendix. Unfortunately, this often gets understood as an opposition
between subjective probabilities and relative frequencies per se, as if the latter were
different from and "more objective" than the former. This is nonsense. Subjective
probabilities can be, and often are, limiting relative frequencies. In particular, this
happens when a subject's belief state leads him to regard the past as relevant for the
future in a special way. Thanks to the work of Bruno De Finetti, we can give a
precise mathematical account of this type of relevance: past and future together
should form an exchangeable sequence, prior to observation.
Equally regrettable is the fact that most books on Bayesian decision theory
scarcely mention the word "exchangeability" and certainly do not accord it the
position it deserves. The reason for this may lie in some beguiling mathematics. De
Finetti's famous "representation theorem" shows that an exchangeable sequence of
events can be uniquely represented as lotteries over certain "archetypal" sequences:
A given exchangeable sequence corresponds to drawing an archetype from a given
distribution over all archetypes. In the case of an infinite sequence of exchangeable
events, these archetypal sequences are just the old Bernoulli independent coin-
tossing processes, which probabilists know and love. Thus, in the infinite case,
108
RELATIVE FREQUENCIES AND EXCHANGEABILITY 109
"metaphysical" character: One would be obliged to suppose that beyond the proba-
bility distributions corresponding to our judgment, there must be another, unknown,
corresponding to something real, and that the different hypotheses about the unknown
distribution—according to which the various trials would no longer be dependent, but
independent—would constitute events whose probability one could consider. From our
point of view these statements are completely devoid of sense, and no one has given
them a justification which seems satisfactory, even in relation to a different point of view
(De Finetti, 1937).
Someone flipping a coin repeatedly, prior to betting on its outcomes, may explain
his behavior by rehearsing the frequentist account. This does not prove that the
"thing" which he is looking for actually exists. Saying a prayer does not prove the
existence of God. Nonetheless, a subjectivist should give some account of what this
person is doing. If a state of partial belief is represented by a subjective probability
measure, under what partial beliefs would this person's behavior make sense?
EXPECTED FREQUENCY
Theorem 7.1 says that the average probability is equal to the expected relative
frequency. It makes no assumption on the probabilities of the Ai, and says nothing
about changing our beliefs on the basis of relative frequencies.
EXCHANGEABILITY
Learning from experience can take many forms. We are interested in a particular
form, learning from observed frequencies. In what sort of belief states are we
inclined to learn from observed relative frequencies? Roughly speaking, we must
believe that the past is similar to the future. Let denote the sequence of
events {"it rains on day i"} Consider a particular event, say A100. Would
RELATIVE FREQUENCIES AND EXCHANGEABILITY 111
we be prepared, before observing any days, to say that after observing A1-A99, our
probability for A 100 should be approximately equal to the relative frequency of
occurrence of A1-A99? That depends on whether we know anything special about
A 100 , relative to the other events. For example suppose the sequence starts on June
1, and suppose we know that it hardly rains at all in the summer, but that
September is the wettest month of the year. Then p(A100) may change as a result of
observing A 1 -A 99 , but it will not incline toward the relative frequency of rain in the
summer.
Suppose we do not have any such knowledge. If we think about it, we can
express this lack of knowledge mathematically as follows: Let 1i denote the
indicator of Ai; then for any i, i = 1,..., 99 our probability for any outcome
sequence for 1 1 ,..., 1100 is unaffected by interchanging the outcomes of 1i and 1100.
Instead of interchanging the outcomes, we can interchange the variables, keeping
the outcomes fixed. Let Q denote an arbitrary sequence of 100 0s and 1s. This
means
Suppose we observe the values of and find the sequence Q with r 1s and
n — r Os. After this observation our probability for An+ 1 is not p(An + 1) but
p(A n+1 |Q). We now calculate this latter quantity.
112 SUBJECTIVE PROBABILITY
This is the famous "Laplace rule" for betting on "success on the next trial" given r
successes on n previous trials. As r and n get large, this approaches the observed
relative frequency of occurrence r/n. In general, the ratio determines
how fast we will "learn from experience. We record this as
Theorem 7.2. Let be exchangeable, and suppose for all
and all Suppose there is a constant K such that for all n:
The conditions of Theorem 7.2 do not always apply. If the events are independent
(a special case of exchangeability) then and nothing is
learned from the observation Q.
RELATIVE FREQUENCIES AND EXCHANGEABILITY 113
There are n — r terms in the first summand, terms in the second summand, etc.
Multiplying through and using the linearity of expectation proves the claim.
From the claim, it follows that the probability of every event can be calculated
114 SUBJECTIVE PROBABILITY
SUPPLEMENT
In this supplement we compare the weak laws of large numbers for the frequentist
and subjective interpretations, and give a proof of De Finetti's representation
theorem for infinite sequences of exchangeable events. Given the moment conver-
gence theorem of the Appendix, this proof is perhaps the simplest; however, there
are simple proofs that use only Holly's theorem (Heath and Sudderth, 1975). The
mathematics underlying the applications of De Finetti's theorem in Chapters 1 1
and 13 are developed here, and further generalizations are indicated.
If the events Ai form a Bernoulli sequence with parameter p, then the weak law of
large numbers (see the Appendix) says that Xn converge in probability to the
constant p. This theorem forms the foundation for the frequency interpretation of
probability. A corresponding role in the subjective interpretation is played by the
weak law of large numbers for exchangeable events, according to which the Xn
converge in probability, but not necessarily to a constant.
Theorem 7.4. If A 1; ... are exchangeable, then Xn converge in probability.
Proof. Fix , and let k > h. Since
it suffices to show
Hence
for any measurable set B. De Finetti's theorem says essentially that the measures ph
converge as , and characterizes the limit measure. For the notion of
convergence of probability measures, we refer to the moment convergence theorem
of the Appendix.
Theorem 7.5. (De Finetti). Let Al ,... be exchangeable with respect to p, with Xh
and ph defined as above, and with wn as in (7.4). Then as , the measures ph
converge to a measure on the unit interval and wn is the nth moment of this
measure.
Proof. By the moment convergence theorem (A.6) of the Appendix, it suffices to
show that as . Recall the multinomial expansion:
as
It follows that
We denote the probability measure on the unit interval given by Theorem 7.5
as p. Since wn is the nth moment of p, we may write
Let Q be an outcome sequence of length n containing exactly r 1s. Using the above
facts in combination with Equation (7.4), we see that
We can interpret this as follows. If the events in question form a Bernoulli sequence
with parameter x, then the probability of Q would be x r (1 — x ) n - r . The limiting
118 SUBJECTIVE PROBABILITY
relative frequency in this case is certain to be x. When the events are exchangeable,
the probability of Q is given by mixing Bernoulli sequences according to the
probability measure dp on [0,1]. This is the analog of Equation (7.6) for infinite
sequences of exchangeable events.
Equivalent Observations
The key to using De Finetti's theorem in applications involves finding prior
probability measures dp for which the integral in (7.8) can be evaluated easily. For
exchangeable events, the beta densities are most useful in this respect. For positive
integers a and b define
Events may be thought of as random variables taking two possible values, 0 and 1.
The results of this chapter can be generalized for random variables taking a finite
number of possible values. Instead of urns with two colors of balls, we then must
consider drawings from urns whose balls can have any of a given finite number of
colors. Instead of Bernoulli, or coin-tossing processes, we have to consider
multinomial processes. These may be thought of as a sequences of rolls with a die
having a given finite number of faces. The approach to relative frequencies and the
representation of exchangeable sequences go through mutatis mutandis.
Let y 1 ,y 2 ,... be an infinite sequence of exchangeable random variables taking
outcomes in the set {1,..., k}, and let Q be a sequence of outcomes of length
n containing exactly ri occurrences of outcome i, i = 1, . . . , k. Let
, then the appropriate generalization of De
Finetti's theorem entails that
where and
Compare (7.18) with (7.12). The numerator is the number of equivalent ob-
servations of outcome j plus 1, whereas the denominator is the total number of
equivalent observations plus the number k of alternatives. If k > 2, then
(number of equivalent observations = 7) + (number of equivalent
observations j) + 2.
120 SUBJECTIVE PROBABILITY
The De Finetti theorem is less helpful in this case, as the set of probabilities on the
real line over which is defined is quite large. The problem is that exchangeability
is too weak for studying infinite sequences of continuous random variables.
Intuitively, exchangeability means that the probability of a sequence of outcomes
depends only on the relative frequencies with which the outcomes occur in the
sequence. If the outcomes can be chosen from the real line, then there are too many
possible outcomes to see repetitions in finite sequences.
Interesting extensions of De Finetti's theorem to the case of real-valued
random variables involve strengthening the assumptions. A probability is exchan-
geable if it is invariant under the action of the finite permutation group. We
strengthen the assumptions by requiring invariance under the action of a group
that includes the finite permutation group as a subgroup. Theorems can be proved
relating strengthened invariance properties with restrictions on the measure
above. We give one recent example. Let Xl ,... be a sequence of random variables,
and assume for every k, the density fk of the random vector (X1 ..., Xk) exists, is
continuous, and has the form
In other words, fk depends only on the lp norm of the point (x1; . . . , xk). Note that
the process X1 . . . is exchangeable, as the finite permutation group is a subgroup of
the group of "rotations under the lp norm." Then for a unique measure on (0, ),
fk can be written as (Berman, 1980; for a simpler proof see Cooke and Misiewicz,
1988)
121
122 SUBJECTIVE PROBABILITY
ELICITATION
Direct Methods
The simplest method of measuring a person's degrees of belied is simply to ask him
what his degree of belief is. While this method is surely the most common, it is
equally surely the worst, especially for persons who are not familiar with the notion
of probability. Most people have poor intuitions regarding numerical probabilities.
Twice as Happy
Evidence for this among experts may be gleened from the "electric lawn mower"
data in Table 2.7. More than 8% of probabilities assessed by the experts were either
0 or 1. Given the events in question, these assessments are clearly absurd.
A better method has been proposed by Lindley (1970). The idea is that states
of uncertainty can be compared with regard to intensity. Hence, it makes sense to
assess the probability of an event A by comparing the intensity of one's uncertainty
regarding A to that of some other event B, to which numbers may easily be
assigned. Consider a lottery basket containing 1000 tickets, each with a number
between 1 and 1000. Suppose we ask a person:
For which number N, 0 < N < 1000, is your uncertainty regarding the
occurrence of A equal to your uncertainty for drawing a ticket from this basket
with a number less than or equal to N?
The number N, which he gives, divided by 1000, may be taken as a measure of his
degree of belief in the event A. A slight variation on this method involves spinning a
"probability wheel" (De Groot, 1970), similar to a roulette wheel.
Other direct methods are the discrete tests and quantile tests for calibration
discussed in Chapter 4. Although these are treated in the following section, it is
appropriate to recall them briefly here. In the discrete tests, a subject assigns an
uncertain event to one of a designated number of probability bins. The probability
associated with a probability bin is taken as an estimate of the subject's probability
for the event in question. In assessing the distributions of continuous variables, the
quantile test involves asking the subject to state certain fixed quantiles, of his
distribution. Although these tests were designed to measure calibration, they also
serve to elicit probabilities.
Parametric Elicitation
In some applications, the nature of the quantities whose distributions are assessed
may be such as to suggest a particular class of probability distributions. In such
cases more elegant elicitation procedures can be derived. One such procedure is
described below. This specific implementation presupposes that the experts'
distributions are approximately lognormal; however, the idea may be applied to
any class of distributions determined by two parameters. This procedure was
developed for the European Space Agency (Preyssl and Cooke, 1989) for the
assessment of failure frequencies.
Preliminary discussions with experts at the European Space Agency indicated
a preference for breaking the elicitation down into two steps:
Step 1: Indicate a best estimate for the failure frequency in question.
Step 2: Indicate "how certain" one is of the best estimate.
Furthermore, given the traditions in the aerospace sector (see Chap. 2), there was a
preference for qualitative as well as quantitative elicitation procedures. The
elicitation was broken down into two steps as follows:
Step 1: The expert is asked for his median estimate of the failure frequency in
question. His answer is M.
ELICITATION AND SCORING 125
Step 2: The expert is asked how surprised he/she would be if the true value
turned out to be a factor 10 or more higher. The answer is a number r,
0 < r < 1, reflecting the probability that the true value should exceed
the median by a factor of 10 or more. (For median estimates greater
than 0.1, the expert is simply asked to state his upper 95% confidence
bound directly.)
The numbers M and r determine a unique lognormal distribution. The techniques
described in the Appendix can be applied to find the experts' 5% and 95%
confidence bounds. The 5% and 95% confidence bounds are then M/k 95, Mk95,
respectively. where
SCORING
An Ideal Measurement
It is useful at first to discuss an ideal setup for measuring calibration and entropy.
Suppose that we are required to estimate the mean time to failure in days of a new
system component that cannot be subjected to destructive experimental tests. Our
only way of obtaining quantitative data is to ask the opinion of experts acquainted
with similar kinds of components. Given the amount of uncertainty inherent in
predictions of this sort, the experts may feel uncomfortable about giving point
predictions, and may prefer to communicate something about the range of their
uncertainty. The best they could do in this respect would be to give their subjective
probability mass functions (or density functions in the case of continuous variables)
for the quantity in question. In other words, they could provide a histogram over
the positive integers such that the mass above the integer i is proportional to their
subjective probability that the mean time to failure is i days.
126 SUBJECTIVE PROBABILITY
The mean time to failure will eventually become known, and when it is known,
we may want to pose the question how good was this expert's assessment.
Calibration and entropy are relevant to performing this kind of evaluation. We
assume that subjective probability mass functions over a finite number M of
integers can be solicited from each of several experts, for a large number of
uncertain quantities.
The entropy associated with a probability mass function P over the integers
i = 1,...M is:
where P(i) is the probability assigned to the integer i. H(P) is a good measure of the
degree to which the mass is "spread out." Its maximal value 1n M is attained if
P(i) = 1/M for all i, and its minimal value 0 is attained if P(i) = 1 for some i.
Obviously, low entropy is a desideratum in expert probabilistic assessment. Other
things being equal, we should prefer the advice of the expert whose probability
functions have the lowest entropy. Other things are usually not equal.
To get an idea how a calibration score could be defined, suppose for the sake
of argument that an expert gives the same probability mass function P for a large
number n of physically unrelated uncertain quantities. By observing the true values
for all these quantities we generate a sample distribution S with S(i) equal to the
number of times the value i is observed, divided, by n.
It might appear reasonable to say that the expert is miscalibrated if S P.
Upon reflection, however, this is easily seen to be quite unreasonable. Suppose the
true values represent independent samples from a random variable with distribu-
tion P. P certainly "corresponds to reality" (by assumption), but in general we will
not have S = P, as statistical fluctuations will cause P and S to differ. In line with
the intuitive definition of calibration, we might say that the expert was well-
calibrated if S and P agree in the long run. The problem with this, as Keynes was
fond of saying, is that in the long run we are all dead. This definition gives us no
way of measuring calibration for finite samples. We shall see shortly that "S = P in
the long run" is a necessary but not sufficient condition for calibration.
Still speaking roughly, we want to say that the expert is well-calibrated if the
true values of the uncertain quantities can be regarded as independent samples of a
random variable with distribution P. This entails that the discrepancy between S
and P should be no more than what one might expect in the case of independent
multinomial variables with distribution P. We therefore propose to interpret the
statement
The expert is well-calibrated.
as the statistical hypothesis:
Cal(P): = the uncertain quantities are independent and identically distributed
with distribution P
We want to define a calibration score as the degree to which the data supports the
hypothesis Cal(P). A procedure for doing this is described below.
ELICITATION AND SCORING 127
This probability can be used to define statistical tests in the classical sense. Of
particular interest is the following fact (see Appendix): If P is concentrated on a
finite number M of integers that include all observed values, then as the number n
of observations gets large 2nI(S, P) becomes x2-distributed with M — 1 degrees of
freedom (see Hoel, 1971). The natural logarithm must be used in Equation (8.2).
Expanding the logarithms in (8.2) via a Taylor series and retaining the dominant
terms yield the familiar x2 statistic for testing goodness of fit between the sample
distribution S and the "theoretical distribution" P.
We call the above conditional probability the expert's calibration score for the
n observations, and we propose to use this quantity to measure calibration in
expert assessments.
We can now understand why asymptotic convergence of S to P is not sufficient
for good calibration. Suppose that P is concentrated on six values so that the
number of degrees of freedom of the x2 distribution is five, and suppose that as the
number of observations n goes to infinity, the expert's calibration score converges
to 1%. From a x2 table we conclude that 2nI(S, P) converges to 15. This entails that
I(S, P) converges to 0, and hence that S converges to P. However, for all n greater
than some n0, the hypothesis Cal(P) would be rejected at the 5% significance level.
The basic principle of a classical approach to expert evaluation can now be
outlined: Good experts should have good entropy scores and good calibration
scores. This theory is normative in the sense that it prescribes how experts should
perform. In Chapter 10 we present evidence that experienced probability assessors
do indeed perform better as a group with respect to both scores than inexperienced
assessors, for items relating to their field.
F i (Xi), as these all have the uniform distribution on the unit interval.1 However, it
is seldom convenient to elicit an entire distribution function for every random
variable.
The quantile tests encountered in Chapter 4 may be seen as providing
practical implementations of the ideas in the preceding section. The implementa-
tion involves casting the assessments in a form that yields sets of variables with
identical distributions.
Instead of eliciting the entire mass function from an assessor, we elicit various
quantiles from his mass function. The rth quantile of a mass function P over the
integers is by definition the smallest value i such that
In the experiment described in Chapter 10, the 1%, 25%, 50%, 75%, and 99%
quantiles were elicited. For each uncertain item, a multinomial variable is
introduced with probability vector p = P 1 ,..., p6. P1 denotes the probability that
the true value of the original value is less than or equal to the 1% quantile, p2 the
probability that this value falls between the 1% and the 25% quantiles, etc.
Obviously,
Observing between which quantiles the true values fall, a sample distribution
s = s l . . . , s6 is generated. S1 represents the number of true values falling beneath
the 1% quantile, divided by the total number of observations, etc.
The method of scoring calibration for the ideal measurement applies equally
well to the probability vectors p and s. For n observations, the scoring variable for
calibration is
Figure 8.1 A mass function P approximated by a mass function P' in which the mass
between the 0% quantile (0 days) and the 5% quantile (3 days) has been evenly smeared out,
and similarly for the 50% quantile (9 days), the 95% quantile (15 days), and the 100% quantile
(20 days). P' is the minimal information approximation (relative to the uniform measure) to
P agreeing with P at the 5%, 50% and 95% quantiles.
When quantile tests are constructed, the quantities involved may have different
physical units (e.g., days, kilograms, percentages). Moreover, the intrinsic ranges of
possible values may be quite different.
If the uncertain quantities all had the same intrinsic range, then we could
reasonably adopt the entropy in the joint distribution over all quantities as an
entropy score. If the quantities are independent, this joint entropy is simply the sum
of the entropies for each quantity.
If the uncertain quantities do not have the same intrinsic range, then there are
good arguments for not using the "joint entropy" as an entropy score. For example,
if one of the uncertain quantities can take a billion possible values, then the
maximal entropy for this quantity is 1n 1,000,000,000 = 20.7. If the other quantities
can take only one of a hundred possible values, the maximal entropy for these
quantities is 1n 100 = 4.6. Simply adding the scores may therefore give inordinate
weight to quantities with intrinsically larger ranges. In particular, if we rank the
experts according to "joint entropy," then we may well find that the entropy rank is
largely determined by the rank on the variable with the largest intrinsic range. If we
do not wish the intrinsic ranges of the uncertain quantities to influence the entropy
score, then the joint entropy score is not appropriate.
The natural way to eliminate the influence of different intrinsic ranges is
simply to normalize the intrinsic ranges to unity. We illustrate this with a
calculation. Calling the lower and upper limits of the intrinsic range the 0 and 100%
quantiles, respectively, let mj denote the number of units between the jth and the
(j — l)th quantiles. In the example described above, j runs from 1 to 6. Clearly,
P'(i) = Pj/mj, when i is greater than the (j — l)th quantile and less than or equal to
the 7'th quantile. Letting Rg denote the intrinsic range,
If we now rescale the intrinsic range by replacing mj in the above expression with
mj/m, where is the total number of units in the original range, then H(P') is
transformed into
Since 1n m is the maximal value of H(P'), the above expression is always less than or
equal to 0. From the Appendix we recognize the above expression as the negative of
the relative information of P' with respect to the uniform distribution over the
intrinsic range. Letting U denote this uniform distribution, it follows that we can
represent entropy under rescaling by calculating the quantity I(P', U). I(P', U) is
always nonnegative, and low values of H(P') correspond, under rescaling, to high
values of I(P', U).
ELICITATION AND SCORING 131
is the sum of asymptotically independent asymptotic chi square variables with one
degree of freedom. It follows that RI itself is asymptotic chi square with B degrees
of freedom, where B is the number of bins. Hence, RI can be used to define a
likelihood function, which in turn can be used as a calibration score, just as in the
case of quantile tests.
Concerning the entropy score, we note that in a discrete test all variables have
the same intrinsic range, namely "correct" and "not-correct." The response entropy
is the entropy in the joint distribution for all items when the items are distributed
according to the calibration hypothesis. The term "response" serves to remind us
132 SUBJECTIVE PROBABILITY
that H(P) measures the entropy in what the subject says. This is to be contrasted
with the sample entropy:
The sample entropy measures the entropy in the subject's performance, assuming
that all items are independent. Whereas the response entropy refers to the
distribution associated with the calibration hypothesis, the sample entropy does
not correspond to a distribution which anyone believes, or would like to believe,
barring the exceptional case that the joint distributions P and S coincide.
An example will illustrate the meaning of the response entropy. Suppose for 1
year two experts are asked each day to give their probability of rain on the next
day. The probability bins run from 10% to 90%. Both experts know that the yearly
"base rate" of rain is 20%. The first expert simply predicts rain with 20%
probability each day and may expect to be well-calibrated. The second expert
distinguishes between days in which he thinks rain is more or less likely. As he is
also aware of the base rate, he will assign days to probability bins such that
Let H(Ej) denote the entropy score of expert j, j = 1,2. Considering H(P i ) as a
function of pi, we note that H(.) is concave, so by Jensen's inequality (see
supplement to Chap. 6):
Under the calibration hypothesis, the entropy of the second expert's responses is
less than that of the first.
METHODOLOGICAL PROBLEMS
where, as before, ni is the number of events in bin i, n is the total number of items,
and Si is the relative frequency of occurrence in bin i. Murphy (1973) extracted this
score as the "calibration term" of the Brier (1950) score. There is no evident way of
extending (8.8) to quantile tests, thereby suggesting that calibration for such tests is
not meaningful. Equation (8.8) is very different from the calibration score derived in
Equation (8.5) of the previous section:
Equation (8.5) is related to the familiar x2 variable for testing goodness of fit in
statistics. Indeed, taking the Taylor expansion of the logarithm in (8.5) and
retaining the dominant terms, we find that (8.5) can be approximated by
PRACTICAL GUIDELINES
Once the questions have been selected and a format chosen, a dry run must be
performed on a small number of experts. The author has yet to perform a dry run
that did not result in significant improvements.
An analysis must be present during the elicitation.
This is absolutely essential. Some items will inevitably require clarification, as it is
impossible to anticipate all possible unintended interpretations. Moreover, the
presence of an analyst shows that the study has sufficient priority for the expert to
take it seriously.
Prepare a brief explanation of the elicitation format, and of the model for
processing the responses.
According to the principles formulated in Part I, experts have a right to know how
their answers will be processed. A short explanation should be given, and
supporting material should be made available if desired. A clear, concise explana-
tion of the format should be carefully prepared, and a few practice items should be
walked through before starting the elicitation.
Avoid coaching.
The analyst is not an expert, the expert is the expert. The expert must be convinced
that the analyst is interested in the expert's opinion.
The elicitation session should not exceed 1 hour.
The elicitation process is more taxing for the expert than for the analyst. If more
time is required a second session should be arranged.
9
Scoring Rules for Evaluating and
Weighing Assessments
in a proper reward structure. Although the scores introduced here are called
"weights," by way of anticipation, they are treated here simply as scoring variables.
The problem of combining expert assessments is deferred to Part III.
The developments in the later sections of this chapter are somewhat more
technical than the previous chapters, and the proofs appeal to graduate level
probability theory. Although the more technical proofs are placed in a supplement,
some sections will still be a bit rough on nonmathematicians. Such readers can skip
these sections, if they are willing to take the assertions in Chapter 12 at face value.
The first section initializes the discussion by considering improper scoring rules,
and the second section reviews the literature in this field.
The most natural way of scoring the expert in this situation is to assign a score
proportional to pi if outcome i occurs. In other words, his score R(p, i) can be
written as a function of his assessment p and the observed outcome i as follows:
for some constant K. Suppose his true belief is represented by the probability
vector q. Let Eq denote expectation with respect to q. R(p, i) is strictly proper for q if
the maximum of Eq(R(p, i)) for all probability vectors p is attained if and only if
p = q. More generally,
R(p, i) is strictly proper (positive sensed) if for all probability vectors q,
argmax Eq(R(p, i)) is unique and equals q
P
The "argmax" operator returns the set of arguments that maximize the
function on which it operates; saying that the argmax is unique means that
argmax is a singleton set. For negatively sensed scoring rules, ((argmin"
replaces "argmax."
Let us see whether R(p, i) is strictly proper. We have
perhaps the first paper on this subject (Roberts, 1965; this example was first
discussed in Winkler, 1969). Suppose a decision maker chooses his probability
measure p to be a weighted sum of the measures p1 and p2 of experts 1 and 2:
After observing event A, the decision maker's posterior probability for B is given by
Writing pi(B and A) as p i (B \ A)pi(A), and equating the right-hand sides of the last
two equations, it is easy to check that
Assuming that expert 1 experiences the ratio of updated weights on the left-
hand side as a score, and assuming that he can influence neither the original
weights nor expert 2's assessment, we see that expert 1's score is proportional to his
probability p1(A). The same holds for expert 2. DeGroot and Bayarri (1987) have
recently studied this phenomenon and found that the situation improves somewhat
if the experts incorporate beliefs regarding the assessments of other experts.
The scoring rule discussed above assigns a score to each individual assessment on
the basis of an observed realization. It is natural to focus on such rules when we are
interested in elicitation. However, if we wish to evaluate sets of assessments and
assign them weights, then we might consider rules that assign a score to a set of
assessments on the basis of a set of realizations. These scores need not be the result
of adding scores of individual variables. It turns out that scores of the latter type
are better suited for evaluating assessments. These are called scoring rules for
average probabilities for reasons that become clear in the following section. Before
turning to these, it is important to appreciate the limitations of scoring individual
variables.
Consider an uncertain quantity with possible outcomes o1,...,om. Let
p = p l , . . , pm be a probability vector for these outcomes, and let R(p, i) be the score
for assessment p upon observing i. The best known strictly proper scoring rules1
1
The strict propriety of the first two rules can be proved by writing and
setting the derivative with respect to pi of the expected score equal to 0. For the third score, the result
follows easily from the nonnegativeness of the relative information (see the Appendix).
SCORING RULES 139
The resolution term is simply the entropy in the joint sample distributions, when
these are considered independent. Calibration is measured by the relative in-
formation for each bin, weighted by the number of items in each bin. Equation (9.2)
will emerge as a special case of scoring rules for average probabilities. The
calibration term by itself will be shown to be strictly proper, and it will prove
possible to assign a multiplicative entropy penalty when entropy is associated with
the assessed distributions (propriety in the latter case is asymptotic).
The principal disadvantage in using a scoring variable like (9.1) or (9.2) to
derive weights (which De Groot and Fienberg do not propose) is the following. The
resulting scores cannot be meaningfully interpreted without knowing the number
of quantities involved and their overall sample distribution.
For example, suppose we score two experts assessing different sets of
quantities with (9.2). Suppose the first expert assesses only one quantity, assigns one
of the possible outcomes probability 1, and suppose this outcome is observed. His
score is then maximal, namely 0. Suppose the second expert does the same, but for
1000 quantities. His score will also be 0. On the basis of their respective scores, we
cannot distinguish the performance of these two experts, and would have to assign
them equal weights. Intuitively, however, the second expert should receive a greater
weight, as his performance is more convincing (the first expert might have gotten
lucky). Dividing R by the number N of uncertain quantities (which De Groot and
Fienberg in fact do) would not help.
The point is this: A scoring rule, being a function of the values of uncertain
quantities, is a random variable, and interpreting the values of the score requires
knowledge of the score's distribution.
Moreover, if S denotes the total sample distribution, then the maximal value of
the resolution term in (9.2) is NH(S). The resolution terms for different sets of
uncertain quantities with different sample distributions therefore cannot be
compared.
SCORING RULES 141
In practice we shall often want to pool the advice of different experts who have
assessed different quantities in the past. In light of the above remarks, there would
be no way of combining experts' opinions via scores derived from different sets of
test variables. Even if the scores did pertain to the same variables, it would be
impossible to assess the importance of the differences in scores without some
knowledge of the distribution of the scoring variable.
The theory developed below involves scoring variables that are not gotten by
summing scores for individual variables. This generalization provides considerably
more latitude in choosing (relevant) proper scoring rules. We shall find that the
calibration term in (9.2) is strictly proper in an appropriate sense of the word, and
has a known distribution under common assumptions. Weights can then be
derived on the basis of the significance level of the calibration term, and these
weights are shown to be asymptotically strictly proper. Similar remarks apply to
the entropy penalty .
It is significant and perhaps surprising that a strictly proper calibration score
exists whose sample distribution under customary assumptions is known. The
theory presented below generates a large class of strictly proper scores, but only
one has been found with this property.
In this section we develop a theory of strictly proper scoring rules for average
probabilities and prove a useful representation theorem. ( ) will denote an
arbitrary measurable space. It is assumed that all probability measures are
countably additive and that all random variables on are fff measurable. R and N
denote the real numbers and the integers, respectively. For A J5", 1A denotes the
indicator function of A. The following notation will be adopted:
Set of outcomes
X is J^-measurable
M(O) Set of nondegenerate probability vectors over O;
The argmax (argmin) is taken over all nondegenerate probability vectors over the
outcome set O. R(p, N, s) is called strictly proper if it is M(X)-strictly proper. Strict
propriety is stronger for scoring rules for average probabilities than for scoring
rules for individual variables, as the set M(X) from which the assessor's probability
is taken is larger than the set M(O) from which the "response distribution" is drawn.
Theorem 9.1. With the notation as above, let R(p, N, s) be differentiable in p. Then
the following are equivalent:
for all
rule for average probabilities encourages him to respond with the probabilities
Remark 3. If the outcome set is {0,1}, then the variables X 1 ,..., XN are indicator
functions for uncertain events. In this case the terms , vanish and
(9.4) takes the form:
Remark 4. Under the conditions of Remark 3, it follows from Theorem 9.1 that if R
satisfies Equation (9.4a), then
G(p, x) is interpreted as the income of a subject who states price x when his true
price is p, for commodities of which the experimenter will buy f ( y ) units at price y,
Examples
The following proposition shows that the relative information score
is strictly proper.
Proposition 9.1. Let s,
(i) I(s, p) is a convex function of p; and putting
(ii) For m = 2, writing I(s,p) as a function of s1 p1; and dropping the subscripts,
where f ( y ) = o(y) means that as y approaches some limit (in this case 0).
The proof is found in the supplement.
The last statement in Proposition 9.1 shows that I(p,s) is not a strictly proper
scoring rule.
Another possible choice for R is
144 SUBJECTIVE PROBABILITY
where the ci are positive constants. From Remark 1 and Theorem 9.1 it is easy to
verify that this is a strictly proper scoring rule. Moreover, it corresponds to a
"quadratic loss function," where R is the loss incurred when one takes "action" p
while s is the "true value." Indeed, scoring rules for average probabilities can be
regarded as loss functions for the random variable s. The set of negatively sensed
proper scoring rules for average probabilities may be regarded as the set of loss
functions for the random variable s that are minimized for the action p equal to the
expected value of s.
The following three propositions extend the formalism of Theorem 9.1 to
cover the case of an arbitrary finite number B of "probability bins." denotes
summation over the bins b = 1,..., B. We revive the notation of the previous
section:
Pb, Probability vector associated with bin b;
In particular, this proposition applies if the Rb's are all the same scoring rule.
However, taking B = 1, then, in general,
This can readily be verified for the relative information score, taking N = 2,
. The right-hand side is 2 In 2, whereas the left-
hand side is 0. This emphasizes the difference between using scoring rules for
average probabilities, as against using scoring rules for individual variables and
adding the scores.
SCORING RULES 145
ASYMPTOTIC PROPERTIES
In this section we revert to the formalism of Theorem 9.1, involving one probability
bin, and we study asymptotic properties of scoring rules for average probabilities.
A strong and weak form of asymptotic propriety are distinguished. Results with the
weak form are easier to prove, and seem sufficient for applications. As before, M
denotes a subset of M(X). The definitions will be formulated for positively sensed
rules, for negatively sensed rules, "argmin" replaces "argmax."
for Q M, of the strictly proper scoring rules for average probabilities introduced in
the previous section. We define
Statisticians recognize 2NI(s, q) as the log likelihood ratio; it has an asymptotic chi-
square distribution with m — 1 degrees of freedom under Q. From Proposition 1, RI
is a strictly proper scoring rule for average probabilities, for all Q M(X).
Moreover, if we have B probability bins, B > 1, then we can simply add the scores
RI(q,N,s) for each bin. The resulting sum will have an asymptotic chi-square
distribution with B(m — 1) degrees of freedom (we must also assume the variables in
different bins are independent under Q).
If we expand the logarithm in RI and retain only the dominant term, we arrive
at
which is the familiar chi-square variable for testing goodness of fit. This has the
same asymptotic properties as RI, but is not a strictly proper scoring rule for
average probabilities. The terms pi in the denominators cause the gradient in
Equation (9.3) to have a term in (pi — si)2.
For B = 1, m = 2, the quadratic score has tractable properties. Put p = p1
s = s1. Under P = p the variable
Proposition 9.4. For t (0, ), the score w, is weakly asymptotic M-strictly proper.
SCORING RULES 147
Remark 2. Propositions 9.4, 9.5, and 9.6 will also go through if RI is replaced in the
definitions of w, and W, by the sum relative information score introduced in
Formula (8.5):
The number of degrees of freedom must be changed to B(m - 1), and the class M
must be altered appropriately.
The proofs of Propositions 9.4,9.5, and 9.6 do not use the propriety of the score RI.
Had we used any other "goodness of fit" statistic to define w and W, these proofs
would still go through. The use of RI reflects a preference for a statistic which is
itself strictly proper.
A MENU OF WEIGHTS
Definition. For a given finite set of variables, a weight for an expert assessment p is a
nonnegative, positively sensed scoring rule for the average probabilities. A system
of weights for a finite set of experts (perhaps assessing different variables) is a
normalized set of weights for each expert, if one of the experts' weights is positive;
otherwise the system of weights is identically zero.
The above definition explicitly accounts for the eventuality that all experts might
receive zero weight. Based on the discussions of the previous sections, we formulate
four desiderata for a system of weights. Such weights should
1. Reward low entropy and good calibration
2. Be relevant
3. Be asymptotically strictly proper (under suitable assumptions)
4. Be meaningful, prior to normalization, independent of the specific indicator
variables from which they are derived
The last desideratum is somewhat vague, but is understood to entail that the
unnormalized weights for experts assessing different variables can be meaningfully
compared.
The requirement of asymptotic strict propriety requires explanation. Let us
assume that an expert experiences "influence on the beliefs of the decision maker"
as a form of reward. Let us represent the decision maker's beliefs as a distribution
Pdm, which can be expressed as some function G of experts' weights we and
assessments Pe, e = 1, ...,E:
Expert e's influence is found by taking the partial derivative of Pdm with respect to
that argument which e can control, namely Pe. Maximizing expected influence in
this sense is not always the same as maximizing the expected value of we. However,
if G is a weighted average:
then
When giving his assessment, expert e will not generally know the weights of the
other experts; he may not even know who they are or how many there are.
Therefore the normalization constant K is effectively independent of the variable Pe
that e can manipulate. Maximizing the expected influence Pdm/ Pe is effectively
equivalent to maximizing the (unnormalized) weight we. In Chapter 11 we shall see
that the above form for G is the only serious candidate.
Requirement 2 is satisfied by scoring rules for average probabilities in the
following sense: The scores do not depend on the probabilities of outcomes that
might have been observed but were not. Of course, the average of probabilities is
not itself the probability of an outcome that can be observed. However, if the
SCORING RULES 149
where
and similarly for H(sb ). In conjunction with Remark 2 to Theorem 9.1, we note
that the distinction between sample and response entropy is not meaningful for
quantile tests.
Proposition 9.3 allows us to add an arbitrary function of the sample
distribution to a proper scoring rule. The maximal value of the sample entropy
H(n, s) is In m. We can define positive, positively sensed weights taking values in
[0,1] as follows:
, where S denotes the total sample distribution over all variables. If the
variables are independent and distributed according to S, the quantity
The weights (9.7) to (9.10) are easily seen to be asymptotically strictly proper. For
large n, 1 — xl(D) represents the significance level at which we would reject the
hypothesis that the expert had assigned test variables to bins randomly. These
weights still have the property that poorly calibrated experts can receive sub-
stantial weight.
Weights using a multiplicative entropy penalty based on the response entropy
can avoid this problem. A suitable form for such weights is
If the experts' weights are derived from assessment of different variables in the past,
then comparing their response entropies might not be meaningful. For example,
suppose one expert has assessed the probability of rain in The Netherlands, where
the weather is quite unpredictable, and another has assessed the probability of rain
in Saudi Arabia. The latter would have a lower response entropy simply because
his assessments would all be near 0. If these experts' assessments were to be
combined via weights derived from their past assessments, then their response
entropies should not be used. In such situations the term l/H(n,p) should be
replaced by
where S is the overall sample distribution. Weight (9.13) is the average information
in the expert's assessments relative to the base rate. To satisfy the conditions of
Proposition 9.6, (13) must be bounded, and bounded away from 0, and the overall
sample distribution must be treated as a constant.
Weights (9.11), (9.12), and (9.13) are 0 whenever the expert's calibration score
exceeds the critical value. Moreover, it is possible to argue that the response
entropy is a more appropriate index for lack of information than the sample
entropy. The weighted combinations refer indeed to the assessed probabilities pb .
SCORING RULES 151
and not to the sample probabilities sb.. If s = p, then these two coincide. However,
an experiment discussed in Chapter 10 will illustrate that the sample and response
entropies can differ, even for well-calibrated assessors. The weak asymptotic
propriety of these weights is proved in Proposition 9.6.
In the case of quantile tests, the term l/H(n, p) would simply be a constant and
should be replaced in (9.11) and (9.12) by some other appropriately bounded
function of the assessment. Recalling the discussion in Chapter 8, the average
relative information of the assessor's (approximated) density functions with respect
to the uniform distribution can be used. The details of this will be postponed until
Chapter 12.
HEURISTICS OF WEIGHING
We conclude with some heuristic remarks on choosing a score for weighing expert
opinion. First, the unnormalized weights for each expert depend only on the
assessments of each expert and the realizations. When a decision maker calls
various experts together for advice on a particular variable, he could compute
weights based on prior assessments of different variables. When experts are
combined, their weights must be normalized to add to unity. However, the decision
maker must ensure that the experts are all calibrated on the same effective number
of variables. In the language of hypothesis testing, this ensures that the experts are
subjected to significance tests of equal power. The technique of equalizing the
power of the calibration tests will be discussed in Chapter 12.
It seems obvious that we do not want to assign a high weight to experts who
are very poorly calibrated, regardless how low their entropy is, and regardless
which information measure is used. Entropy should be used to distinguish between
experts who are more or less equally well-calibrated. The weight (9.12) does in fact
behave in this way. The calibration scores W, will typically range over several
orders of magnitude, while the entropy scores typically remain within a factor 3.
Because of the form of the weight (9.12), when weights of different experts are
normalized, the calibration term will dominate, unless all experts are more or less
equally calibrated. The weight (9.10) does not have this property. We also note,
referring to the discussion of the previous chapter, that there are other measures of
"lack of information" that can be substituted for the function f in Proposition 9.6.
Any other measure should be chosen in such a way that the calibration term will
dominate in weight (9.12).
In testing hypotheses, it is usual to say that a hypothesis is "rejected" when the
test statistic exceeds its critical value. This way of speaking, of course, is not
appropriate in dealing with expert probability assessments. In hypothesis testing,
choice of the significance level a means choosing to reject true hypotheses with
frequency a. In combining expert opinions, we are not trying to find the "true"
expert and "reject" all the others as "false." Indeed, if we collect enough data, then
we will surely succeed in rejecting all experts, as no one will be perfectly calibrated.
The significance level a is chosen so as to optimize the decision maker's "virtual
weight," as explained in Chapter 12.
152 SUBJECTIVE PROBABILITY
SUPPLEMENT
This supplement gives the proofs for various results cited in the text.
The proof uses three lemmata, the first was introduced as theorem (7.1).
(exactly occur)
Lemma 9.2. Let A be an L x n matrix with rank L,L<n, and let Let
or
The left-hand side does not depend on /, or on the choice of basis, hence both sides
must vanish, and y must be orthogonal to x(0) and to all z Z. Since b 0, x(0) Z,
and y is orthogonal to n — L + 1 linearly independent vectors. The proof is
completed by noting that any solution to the linear system Ax = b may be written
x = x (0) + z for some z Z. Hence y Y if and only if y is orthogonal to x(0) and to
and Dim
Lemma 9.3. Let R(p,N,s) satisfy Equation (9.3), and fix N. Let
SCORING RULES 153
For let
outcome ij occurs kij times in X1,. .., Xn|j= 1,. . . .,L}
Use k W to index the coordinates of RW. Then, by writing
The following consideration shows that these equations are independent: Since for
we can always find two probability vectors Q, Q'
whose vectors of average probabilities q(N) and q'(N) disagree in just one coordinate
. By Lemma 9.2, the dimension of A(p) equals \W\ — (m — 1),
hence the dimension of A(p) equals m — 1.
We show that Dim B(p, i) m - 1. Fix i. If m = 2, Dim A(p) = 1, the
functions gif] are all zero, and it is trivial to show that B(p, i) has dimension 1.
Assume m 3. It suffices to find m — 1 linearly independent vectors in B(p, i). In
fact, it suffices to find m — 1 vectors whose components on m — 1 coordinates
k (1) ,..., k(m~l) are linearly independent, where
and otherwise
It suffices to find scoring rules R (1) ,..., R(m-1) satisfying Equation (9.4) such that
the (m — 1) by (m — 1) matrix Y with
154 SUBJECTIVE PROBABILITY
other derivatives being zero. This shows that the scores R(j) satisfy (9.4); j = I,...,
m — 1. Multiplying row; by pt(l — pt — Pj) for j =£ i the matrix Y has the form
Pi Pi Pi Pi Pi-l Pi i th row
Pi Pi Pi Pi Pm- I + P1--1 • 0
That y has full rank can be seen by subtracting the ith row of the above matrix
from each of the other rows. The result is
-Pi 0 0 0 Pi 0 0 •- 0
0 -Pi 0 0 P2 0 0 • 0
0 0 -Pi 0 Pa 0 0 0
Pi Pi Pi Pi p;-l Pi Pi Pi
0 0 0 0 P.- + I -Pi 0 • 0
0 0 0 0 Pm-l 0 0 • -Pi
These rows are linearly dependent if and only if
p1 + p2 = ... + pi-1 + pi=1....+ pm-1 = 1-pi
SCORING RULES 155
However, this condition cannot hold if pm > 0, which is the case if p M(0). It
follows that Y has full rank, and the proof is completed.
We now prove Theorem 9.1. We fix N and adopt the notation of Lemma 9.3.
Equation (9.4) implies (9.3):
It follows that for all q M(0) and i = 1,..., m — 1, R(q, i) A(q) . From Equation
(9.5) it follows that B(q, i) is contained in A(q) . From Lemma 9.3 it now follows
that B(q,i) = A(q) , hence R(q,i) B(q,i). Since this holds for all q M(0),
i— 1 , . . . , m — 1, and since R is differentiable, it follows that R has the form of
Equation (9.4).
The first term is the negative entropy of s and is always nonpositive. It suffices to
verify that
where
Proof. Let 1A denote the indicator function of the set A. Since F is bounded, F is
integrable with respect to dF and the Fubini theorem may be applied.
Corollary.
Hence,
Suppose r M(0) with r q. Since it follows from the proof of
Proposition 9.4 that
Remark. Note that the continuity of X2m-1 is essential in the above proof. If we
replaced x2m-1 in the definition of Wt by a noncontinuous distribution, for example,
QN, then Proposition 9.5 would yield only a crude estimate of EQWt. This illustrates
the advantage of studying propriety from the asymptotic perspective. Note also
that t is essential.
We treat only the first score, as the argument for the second is similar. Choose
Q M. Suppose r M(0), r q. We must show that for all sufficiently large N
The right-hand side is bounded from above by b/a, and this bound does not depend
on N. From the proof of Proposition 9.4, the left-hand side goes to as
10
Two Experiments with Calibration
and Entropy
This chapter discusses two experiments recently carried out at the Delft University
of Technology. Both involve the measurement of calibration and entropy with
experts. One experiment (Cooke, Mendel, and Thys, 1988) was designed in such a
way that the performance of "experienced" and "inexperienced" experts could be
compared both on general knowledge items and on items relating to their common
field of expertise. The experts were all mechanical engineers, and "expertise" in this
test refers to technical expertise. The experiment used the quantile tests discussed in
Chapter 8. The second (Bhola et al., 1991) also used more and less experienced
experts, but the sense of expertise might be described as managerial rather than
technical. It involved assessments by project leaders of the probabilities that their
project proposals would be realized. In a follow-up experiment the evolution of
calibration scores can be tracked.
In the first test, the experienced subjects outperformed the inexperienced
subjects, while in the second test the reverse occurred. This will warn against any
simple minded conclusions relating performance with experience. However, ten-
tative conclusions can be drawn from each test.
The first section of this chapter reviews the psychometric literature on
calibration and knowledge. The subsequent sections describe and analyze the
experiments.
Several attempts have been made in the past to relate calibration to "knowledge."
Adams and Adams (1961), in one of the earliest studies, found no correlation
between knowledge and calibration for subjects taking a final examination. In this
case "knowledge" was determined by the number of exam questions answered
correctly. Sieber (1974) found similar results. Lichtenstein and Fischhoff (1977)
158
TWO EXPERIMENTS WITH CALIBRATION AND ENTROPY 159
The Subjects
The experimental subjects fell into two groups. One group, the inexperienced
operators, was in the last year of a 5-year training program, roughly equivalent to a
bachelor of science program at an American university. Their field of study was
mechanical engineering. All these subjects were between 20 and 25 years of age, and
all had completed a course in statistics.
The second group, the experienced operators, had all completed the training
program. Their average was 36 years, and they had on the average 15 years of
practical experience. Some of them were teachers at the training facility. Twenty-
two inexperienced and twelve experienced subjects took both general knowledge
and expertise-specific, or technical, calibration tests. Three additional experienced
subjects took only the general knowledge test. All subjects were male.
The Tests
The tests were modeled on the quantile calibration tests of Alpert and Raiffa (1982).
Some of the general knowledge items were taken literally from this test, and others
were adapted to the situation in Holland. The following are examples of the
uncertain quantities from the technical tests:
1 8 2 13 23
2 27 31 1 7
3 3 25 31 4
4 6.5 6 24 10
5 17 1 22 24
6 25.5 4 18 33
1 11 7 36 22
8 22 15 15 5
9 12 11 11 16
10 29 3 6 1
11 19.5 22 27 3
12 25.5 20 28 2
13 31 5
14 23 21
15 18 26
16 4 12 19 30
17 24 26 10 27
18 9 13 17 26
19 1 5 25 28
20 32 30 29 8
21 5 23 32 9
22 19.5 8 34 34
23 2 19 33 19
24 10 24 35 15
25 21 17.5 8 20
26 33 16 37 32
27 14.5 21 30 13
28 34 17.5 14 18
29 30 32 23 12
30 35 29 12 17
31 16 27 2 25
32 14.5 14 20 29
33 36 33 3 6
34 6.5 9 9 31
35 37 10 7 14
36 28 34 16 11
37 13 28 4 21
information I(P', U) of P' with respect to the uniform density on the intrinsic range
U. Low values of I(P', U) correspond to highly entropic, or highly uninformative,
distributions. The values for 1(P, U) are then added for each subject and the
subjects are ranked. Rank 1 corresponds to the "best" entropy score [i.e., largest
value for the sum of the terms I(P', U)]. The results are presented in Table 10.1. The
162 SUBJECTIVE PROBABILITY
Figure 10.1 Graphical representation of the correlations between calibration and entropy
ranks for the top three ranked subjects on the general knowledge and technical tests.
The Subjects
There were 14 project leaders for whom previous assessments and data regarding
eventual realizations were available. Table 10.2 shows the rank ordering of the
experts in terms of age and years experience in the firm.
The Results
The subjective probability assessments were discretized into 10%, 20%,...,90%
probability bins. Response and sample entropy were computed for an expert
1
We are grateful to D. Roeleven for catching some errors in the codes used in a previous version of
this analysis.
TWO EXPERIMENTS WITH CALIBRATION AND ENTROPY 165
1 4.5 6
2 9.5 9
3 14 14
4 11 11
5 9.5 12
6 12 7
7 1 2
8 7 10
9 6 1
10 13 13
11 3 4
12 2 3
13 8 8
14 4.5 5
assessing n items as H(P)/n and H(S)/n, respectively, where H(P) and H(S) are
defined in Formulae (8.6) and (8.7), respectively. If RIe denotes the statistic defined
in (8.5) for expert e, then the calibration score C(e) of e is defined as
C(e) — Prob{RI > RIe\e is perfectly calibrated and all items are independent}
In addition, the base rate resolution index defined in Chapter 9 [see formula (9.9)]
is computed for each expert. This index measures the degree to which the sample
distributions in each bin, per expert, differ from the overall sample distribution, per
expert. Unnormalized weights are computed by dividing the calibration score by
the response entropy [compare formula (9.12)]. The results for each expert are
shown in Table 10.3. For all events whose assessments are shown in Table 10.3, the
fate of the project proposal is known. Table 10.4 shows the allocations and
realizations for each bin.
Several points regarding Tables 10.3 and 10.4 require comment. First of all,
most of the experts are quite well calibrated. For only two experts is the probability
of a relative information score [(8.5)] exceeding the observed value less than 5%.
The chi square approximation is unreliable for the number of items shown in Table
10.4, so the calibration scores were determined by direct calculation. Also notable
is the fact that the sample entropy is generally lower than the response entropy. In
part, this is explained by the fact that many bins contained a small number of items.
Hence, the sample distribution for true and untrue events was often (0, 100%), or
(100%, 0), and these extreme distributions, of course, have zero entropy. Neverthe-
less, inspecting group performance for the 20% and 80% bins indicates that these
experts may display some underconfidence. Unlike the experiment with mechanical
engineers, there is no significant rank correlation between response entropy and
calibration scores. Neither is there significant rank correlation between the
calibration scores and the base rate resolution index.
Table 10.5 shows the aggregate calibration scores, with projects grouped
Table 10.3 Calibration, Response, and Sample Entropy Base Rate
Resolution Scores and Unnormalized Weights for the 14 Experts
1 1 0 4 3 1 1 3 3 9 7
2 5 0 1 0 3 3 6 2 3 3 1 1 19 9
3 9
2 1 0 3 3 6 4 19 9
4 2 1 1 0 2 2 1 1 6 4
5 5 0 2 0 1 0 3 6 1 0 3 2 18 5
6 3 2 1 0 2 2 1 1 4 4 11 9
7 1 0 1 1 4 7 4 3 4 4 17 12
8 5 1 3 0 1 1 4 6 1 1 2 2 18 9
9 4 1 3 8 3 3 15 7
10 11 0 1 0 2 2 3 3 17 5
11 12 0 5 9 2 2 7 6 20 13
12 1 0 1 0 2 1 1 1 5 5 10 7
13 1 1 3 2 1 1 3 3 1 1 9 8
14 4 1 2 0 1 0 3 3 10 4
Contractor
A B C D E
Turnover
(1000 Dfl) Calibration Response Entropy
according to potential contractor. Curiously, the score was worse for contractor A,
which is the contractor with which Delft Hydraulics has the most experience.
There is a correlation between the percentage projects in the Netherlands of
each expert and his calibration score. The Spearman rank correlation is 0.58 (both
ranked from high to low), which is significant at the 5% level. In other words,
project leaders with more projects in the Netherlands tended to have higher
calibration scores.
Table 10.6 shows the aggregated scores for projects broken down into four
categories of monetary turnover. It is seen that the very large projects are assessed
better than others.
There is a negative correlation between calibration and age. The Spearman
rank correlation coefficient, —0.82, is significant at the 5% level. Figure 10.2 graphs
the calibration score against ranked ages. This negative correlation may indicate
that the younger project leaders are better probability assessors. However, the
younger leaders have a different mix of projects among the different contractors
than the older leaders. It is possible that younger leaders have proportionally more
projects in the Netherlands, and that this explains their better performance.
To decide whether age or percentage projects in the Netherlands,
%(A + B + C), best explains the data, Kendall's tau coefficient for partial rank
correlation is computed. This coefficient measures the rank correlation between
two variables, when the influence of a third variable is eliminated by keeping it
constant. The results are
Rankcorr. (cal. score %(A + B + C)|age constant) = 0.102
Rankcorr. (cal. score, age|%(A + B + C) constant) = 0.577
Only the latter rank correlation is significant, hence the variable "age" better
explains the differences in calibration than the variable "percentage projects in the
Netherlands."
CONCLUSION
This chapter reviews various models for combining expert opinion found in the
literature, and motivates some of the choices underlying the models in the
following three chapters. We assume that the combination results in a probability
distribution for the "decision maker." Three broad classes of models are discussed.
The Bayesian models discussed in the second section all take their point of
departure from Bayes' theorem and require the decision maker to supply a prior
probability. The first section is devoted to weighted combinations without prior
distributions. It is useful to recall from Chapter 6 that subjective probabilities exist
for rational individuals. Most decision-making bodies are not individuals but
groups. There is no reason why groups should have a preference structure like that
of rational individuals, and no reason why groups, for example, the scientific
community, should have prior distributions. The third section is devoted to
psychological scaling models. These models are rather unlike the models in the first
two sections. They have received little attention from Bayesians studying the
"expert problem," but have demonstrated their appeal among people involved with
applications.
Mathematical Background
Much has been learned in recent years about the mathematical properties of
various rules for combining probabilities. Excellent summaries can be found in
French (1985) and Genest and Zidek (1986). In this section we restrict our attention
to the class of combination rules derived from "elementary weighted means." This
class is wide enough to include the most interesting possibilities, and its mathemat-
ical properties are well understood. Within this class we can derive many important
results very easily, using standard results in analysis. The locus classicus for these
results is Hardy, Littlewood, and Polya (1983), to which we refer for proofs.
171
172 COMBINING EXPERT OPINIONS
To get the discussion started, we may recall the expert estimates of the
probability per section-hour of pipe failure taken from the Rasmussen Report (see
Table 2.5). The 13 estimates ranged from 5E-6 to E-10. The analyst performing the
risk analysis of a nuclear power plant must choose some estimate for this quantity.
How is he to choose? Should he take the largest or the smallest value? Should he
take the arithmetical average of the expert estimates (4,7E-7), or perhaps the
geometrical average (7.5E-9)? Should he perform some weighted combination of
the expert assessments? It is clear that the choice of combination rule will have a
great impact on his final result. Is there any principle to guide his choice?
In fact there are general principles to which the analyst can appeal in choosing
a rule of combination. Suppose we have experts 1,..., E, and that each expert i
gives a probability vector Pi1,...,p i n for the elements A 1 . . . , An of some partition
of the set S of possible worlds. Further, let w l , . . , WE be nonnegative weights that
sum to unity. For any real number r we define the
Elementary r-norm
weighted mean:
r-norm probability:
For a fixed set of weights, choosing r larger gives more influence to the larger
assessments for each alternative. In the limit for r , the largest assessment for
each j is chosen and these are normalized to determine the probability P . Similar
remarks apply for choosing smaller r. All the probabilities Pr possess what is called
the zero preservation property; that is, if pij = 0 for all i, then Pr(j) = 0. P0 possesses
this property in the extreme; if any expert assigns alternative j probability 0, then so
will the decision maker who uses P0.
Marginalization
From (6) we see that Pi (j) = Mj (j); and for r 1 the probability Pr(j) is affected by
the terms Mr(k), k j. This can lead to curious effects. Suppose r > 1, and suppose
after deriving Pr, the decision maker subdivides the alternative An into two disjunct
subalternatives Anl and An2. Let P'r denote the probability derived after hearing the
experts probabilities for these subalternatives. From (5) we see that the denomi-
nator of P, will be less than that of P'r, hence for j n, Pr(j) > P'r(j).
If the combination rule is such that the probabilities are unaffected by
refinements of the partition of alternatives Al...,An, then the rule is said to
possess the marginalization property. From the above it is clear that P1 is the only
probability derived from elementary weighted means that possesses the mar-
ginalization property. Marginalization in this context is equivalent to the so-called
strong setwise function property (McConway, 1981): the decision maker's proba-
bility for any event A depends only on the experts' assessments for event A.1
A combination rule that does not possess the marginalization property is
downright queer. Consider a simple example. Two experts whom I esteem equally
are consulted about the probability that my flashlight, which I forgot to unpack
after last year's vacation, still works. Both experts give the flashlight a probability
of 0.8 of not working. I decide to combine their opinions via a normalized
geometric mean (i.e, P0). It is easy to check that my probability for the flashlight
not working is also 0.8. A discussion ensues whether failure is most likely due to a
dead battery or corroded contacts (only these possible failure modes are con-
sidered, and the eventuality that both occur is excluded). The first expert assigns
probabilities 0.7 and 0.1 to these events, whereas the second assigns probabilities
0.1 and 0.7, respectively. On receiving this information my probability that the
flashlight won't work drops to 0.73. Further disagreement about which contacts
might be corroded causes my probability to drop again. If the first expert decides
that a dead battery has probability 0.8 and corroded contacts probability 0, my
probability drops to 0.51. If the second expert then changes his probability for
corrosion and dead battery to 0.8 and 0, respectively, then my probability that the
flashlight fails drops to 0. All the while, the experts agree that the flashlight has a
probability 0.8 of not working.
1
In a more general setting marginalization and the strong setwise function property are not
equivalent; however, marginalization and the zero preservation property imply the strong setwise
function property. McConway (1981) and Wagner (1982) prove that P1 is the only combination rule
satisfying the strong setwise function property when the number of alternatives is at least 3.
174 COMBINING EXPERT OPINIONS
Independence Preservation
The combination rule P0 has one property that some people consider important
(Laddaga, 1977). Let A and B be unions of the alternatives A1, . . . , An. Suppose we
elicit only the experts' probabilities for A and B, and suppose that all experts regard
these events as independent, that is, . Writing
Determining Weights
Authors of the Bayesian persuasion point out that no substantive proposal has
been made for determining the weights to be used in weighted combination. French
(1985) summarizes the obstacles to any operational definition of the weights. Such a
REVIEW OF THE LITERATURE 175
The fourth proposal is not worked out, and from Chapter 9 we can anticipate the
problems encountered when the traditional notion of a scoring rule for individual
events is maintained.
An elegant suggestion was made by De Groot (1974) and independently by
Lehrer and Wagner (1981). We return to the matrix p = [p ij ] of assessments from
expert i of alternative j. Each expert i is asked to report the weight wij that he would
assign to the opinion of expert j (i also assigns a weight to himself). After learning
the assessments from all the other experts, expert i should change his probability,
on this view, to
where the second expression is the matrix notation for the first. There being no
reason to stop at p', the experts should change again to p" = w(wp) = w2p, and so
on. Under certain general conditions which we shall not describe here, this process
converges to a matrix w p, where the rows of w are all the same. In other words,
by iteratively revising their own opinions in the above manner, the experts all
converge toward the same probability vector. Practical methods exist for approx-
imating the matrix w .
The "De Groot weights" mentioned in Chapter 5 are the weights given by any
row of w . Despite the evident mathematical appeal of this procedure there are
three serious drawbacks:
1. The method requires that the same experts assess all alternatives.
2. The notion of "honesty" in assigning weights to the opinions of one's
colleagues needs defining. There is no guarantee that honesty, whatever
that means, is encouraged by this method.
176 COMBINING EXPERT OPINIONS
BAYESIAN COMBINATIONS
the latter problem reduces to determining the terms under the product on the right-
hand side. Apostolakis and Mosleh suggest two models for accomplishing this.
1. On the additive error model, expert i's estimate X, is treated as the sum of
two terms:
where x denotes the true value and ei and additive error term. The model assumes
that the errors ei are normally distributed with mean mi and standard deviation .
The choice of mi and reflects the decision maker's appraisal of expert i's bias and
accuracy.
Under these assumptions the likelihood P(X\x) of observing estimate Xi
given that the true value is x, is simply the value of the normal density with mean
x + mi and standard deviation .
It is interesting to study the decision maker's posterior expectation for x, given
the advice X, under the above assumptions. As in all Bayesian models, the decision
REVIEW OF THE LITERATURE 177
maker must first provide his prior P(x). Let us assume that P(x) is the normal
density with mean and standard deviation a. A standard calculation2 shows
where
Hence, the decision maker's updated expectation is just a weighted sum of the
experts' expectations, corrected to remove the appraised biases, and the decision
maker's prior expectation. The weights are determined by the assessed accuracies
, including the decision maker's assessment of this own accuracy, . The
decision maker thus treats himself as the (n + l)th expert.
2. On the multiplicative error model expert i's assessment is treated as the
product
of the true value x and an error term ei. Taking logarithms on both sides of the
above equation, we reduce this case to the additive error model for the observation
InX i . The assumption of normality in that model entails that Xi is lognormally
distributed. Assuming independence as above, the decision maker's posterior
expectation in this case is
If the experts' assessments are not independent, then the likelihood function cannot
be written as a product of the likelihoods for the single-expert assessments.
However, under the normality assumption, we can specify a joint normal
likelihood function. Dependences between the experts are accounted for by
specifying the correlation coefficients between the experts' assessments.
The additive error model was introduced by Winkler (1981). Lindley and
Singpurwalla (1986) describe a more general model in which the experts give
2
This result is mentioned in Lindley (1983). One shows
where
The constant C serves to normalize the density P(x\X). The mean of P(x\X) is E(x\X) = L/K.
178 COMBINING EXPERT OPINIONS
Another variation has been developed for the case that experts assess a probability
of an uncertain event. If pi is the assessed probability from expert i that the event
occurs, then Xi = In [pi/(l — pi)] is the "log odds" for this assessment, and can be
treated as in the additive error model (Lindley, 1985). Clemen and Winkler (1987)
have applied this model to probabilistic weather forecasts with the prior distribu-
tion derived from the base rate frequencies and found relatively poor performance.
This model is attractive, conceptually simple, and has demonstrated its worth
in applications. However, there are two principle drawbacks:
1. These models place a heavy assessment burden on the decision maker. Not
only must he specify his prior distribution, he must also specify two
parameters for each expert plus a correlation coefficient for each pair of
experts. For the 13 experts assessing pipe failure probabilities that comes to
104 assessments, in addition to his prior. The model gives no guidance how
this is to be done, and there is no provision for updating the decision
maker's estimates of the experts' biases and accuracies on the basis of past
performance.
2. For more general classes of distributions, the correlation coefficients do not
determine the joint distribution, and this approach would not work.
where P is the decision maker's probability density for some unknown quantity x,
and D represents some observational data relevant to x. P(D\x) and P(x) are called
natural conjugates if P(x) and P(x\D) belong to the same class of distributions, that
is, if they can be described by the same parameters. Natural conjugates are very
convenient in Bayesian inference, since the updating can be expressed as an
updating of the parameters of the prior distribution.
REVIEW OF THE LITERATURE 179
Inserting (11.4), (11.3), and (11.2) in (11.1) and using (7.9), we see that P(x\D) is a
beta density with parameters (r + r', n — r + n' — r'). Hence, knowing D and
knowing the parameters of the prior distribution allows us to write the posterior
immediately.
We see that the binomial likelihood function (1 1.2) and the beta prior (1 1.4) are
natural conjugates, since the posterior is also a beta density. As described in the
supplement to Chapter 7, the prior distribution admits a simple interpretation in
terms of equivalent observations. We saw that observing r heads in n tosses induced
a change in the parameters from (r1, n' — r') to (r + r1, n + n' — r — r'). Hence, it is
natural to consider the original parameters r' and n' as equivalent to having already
observed (r1 — 1) heads in (n' — 1) tosses, with the uniform prior distribution which
is a beta density with parameters (1, 1). The expectation of (11.4) is r'/n' [Eq. (7.11)].
This is therefore the value we should use to estimate x if we have no new data. After
observing D our belief state is equivalent to starting with the uniform prior and
observing (r + r1 — 1) heads in (n + n' — 1) tosses. The value n' is associated with
the variance of (11.4) and indicates how confident we are in the prior assessment.
Keeping r'/n' fixed, the variance decreases as n' gets larger, reflecting greater
confidence in the assessment r'/n' [Eq. (7.12)].
Winkler's idea is simply this. When an expert gives his probability density
function for a variable of interest, we interpret this as an equivalent observation. Of
course, this is only possible if his density has the required form. Suppose experts
1, . . . , E give beta densities with parameters (ri, ni — ri), i = 1,...,E for x. We
consider first the case where the experts are not weighted. The decision maker
starts with the prior of the first expert in (11.1). Each successive expert is then
treated as an observation Di = "r, heads in n,- tosses," and fed into (11.1). The result
is a beta density with parameters (r, n — r) where .
Notice that n and r can be regarded as weighted sums of the expert's
parameters, with weights wi = 1. If we wish to let the advice of some experts weigh
more heavily than that of others, we simply choose different (nonnegative) weights
and adopt a beta posterior with parameters .
180 COMBINING EXPERT OPINIONS
Morris' Theory
Morris (1974, 1977) developed a Bayesian theory of expert use that is conceptually
important, even if the assumptions underlying the theory are prohibitively strong.
The theory attempts to account for past performance and attempts to show how a
decision maker can use Bayes' theorem to recalibrate an expert. The theory is of
REVIEW OF THE LITERATURE 181
Single Expert
We have one expert who gives a density function/(x), with cumulative distribution
F(x), for a continuous random variable X. The updating technique requires that f
be determined by its mean and variance. Morris assumes that/is a normal density.
We can therefore represent the data from this expert as D = (m, v), where m and v
are the mean and variance, respectively, of the expert's density. The problem is to
determine the decision maker's posterior density P(x\m, v). Once again, we begin
with Bayes' theorem,
As P(m, v) does not depend on x, we may absorb this term into a normalization
constant. The crucial term to be determined is P(m, v \ x). To do this Morris makes
two assumptions. The first is scale invariance:
(The first equality follows from the definition of conditional probability.) Since P(v)
does not contain x, we can absorb this term into the proportionality constant as
well. Scale invariance says the decision maker's probability that the expert gives
variance v is independent of the unknown value of x. This assumption is quite
gratuitous. In assessing log failure probabilities, for example, it is prima facie
plausible that higher probabilities will be better known and hence have smaller
variance Cooke (1986), Martz (1986) confirmed this conjecture.
We determine P(m\ v, x) by introducing a change of variable. Instead of asking
for the probability density for m, given x and v, we might just as well ask for the
probability that the true value x corresponds to the rth quantile of the expert's
distribution, given that x is the true value, and given that the expert has chosen
variance v. Consider v fixed and introduce the so-called performance indicator
function :
quantile of expert's distribution realized by X
Note that is a function of the random variable X. As the mean gets larger with
X = x fixed, the quantile corresponding to x gets smaller. In fact,
182 COMBINING EXPERT OPINIONS
We now take derivatives of both sides of (11.6) with respect m0. On the left-hand
side we simply get P(m0\x, v), the term we want to determine. The right-hand side
yields
Since / is normal,
Substituting this into (11.7), and the result into the derivative of (11.6) with respect
to m = m0,yields
This means roughly the following: for any r [0,1], the probability density of seeing
the true value of X correspond to the rth quantile of the expert's distribution is
independent of X.
This is a very strong assumption, as a simple example makes clear. Suppose
the decision maker is the director of a company, who asks an adviser what the
competitor's price will be for the next year. The adviser states his variance v, and
retires to consider his mean value m. The decision maker thinks the price will be
about $20, but he isn't very sure. He is confident in his adviser, and thinks there is a
probability of that the true price will fall between the 25% and 75% quantiles of
whatever distribution the adviser gives. At this moment an industrial spy delivers a
memo recently purloined from the competitor in which the competitor's price is
revealed to be $5. The competitor is making a surprise move and starting a price
war. Now, if the decision maker's opinion of his adviser satisfies shift invariance, he
would still believe that $5 will fall between the adviser's 25% and 75% quantiles
with probability . However, if hearing the price $5 leads him to think that the
adviser's 25% quantile will probably be greater than $5, then shift invariance is
violated.
The probability density P( = r\v)is the decision maker's probability density
that the true value will realize the rth quantile of the expert's distribution. Morris
calls this probability density the performance function :
3
Note that this step is not possible for other two-parameter families of densities, for example, the
beta or the lognormal families.
REVIEW OF THE LITERATURE 183
Hence, the decision maker's posterior after receiving the expert's advice is
Multiple Experts
Morris' theory for multiple experts proceeds along the same lines as for the single
expert, and we shall not give the derivations. Suppose for convenience that there
are two experts giving assessments F and G, characterized by mean values
m = (m1,m2) and variances v = (v 1 ,v 2 ). If the experts are judged to be independent
by the decision maker then the generalization of (11.10) is straightforward:
If they are judged to be dependent, then the dependence is absorbed into the joint
performance function (F(x), G(x)):
to the decision maker's distribution, and similarly for Gx and GY. Since exchan-
geable events have the same probability of occurrence, and since Fx and FY are
invertible, for all r [0,1],
PSYCHOLOGICAL SCALING
Figure 11.1 Distributions for scale values in population of experts for ages of Ron, Don,
Lon, and John, with mean values r, d, I, and j.
186 COMBINING EXPERT OPINIONS
CONCLUSION
H(p) assumes its minimal value 0 if pi = 1 for some i, and assumes its maximal value
In n if Pi = 1fn, i = 1,..., n. H(p) is commonly regarded as an index of the lack of
information in the distribution p, high values of H indicate low informativeness.
Let q = q 1 . . . , qm be another distribution, and let pq denote the product
distribution. pq is the joint distribution of two independent variables having
marginal distributions p and q. It is easily verified that
where u is the uniform distribution over 1,..., n; that is, ui = 1fn, i = 1,..., n.
Let s denote a sample distribution generated by N independent samples from
the distribution p. Let xd2 denote the cumulative distribution function a chi square
variable with d degrees of freedom. Then
defined such that within a reasonable time it is unambiguously clear whether the
events have occurred. We assume for purposes of exposition that this is the case for
all events.
A set of experts e = 1,..., E assesses the probability of each uncertain event by
assigning the corresponding indicator functions to one of B probability bins. Each
probability bin is associated with a distribution over the possible outcomes
"occurred" and "not-occurred." The bins will be characterized by the probability pb
of occurrence, 1 > pb > 0, b = 1,...,B. When no confusion can arise, pb is also used
to denote the distribution associated with the bth bin. pe(m) denotes the probability
associated with the bin to which expert e assigns variable lm.
On the basis of the observed values of 11,..., 1N and the experts' assignments,
weights we will be determined for each expert e = 1,..., E, satisfying
These weights will then be used to determine the decision maker's probability P for
subsequent, as yet unobserved uncertain events 1 N + k
assuming that we is positive for some e. Modeling parameters can always be chosen
to assure that this is the case. These weights are global in the sense that the same
weights apply to all "unobserved variables" 1 N + k , k = 1,... . They are also
dynamic in the sense that with each new observation the weights can recomputed.
To define the weights we we restrict attention for the moment to one expert e.
The following notation refers to the assignments of expert e for the variables
l 1 ,..., 1N, and the dependence on e and N is suppressed in the notation. Let
nb = number of variables assigned to bin b; b = 1,...,B (12.8)
n = (nl,...,nB) N = nb
sb = sample distribution of variables in bin b; b = 1 , . . . , B
s = (s1...,sB)
The quantity C(e) takes values close to 1 if e's calibration is very good and takes
values close to 0 if e's calibration is very poor. The weights (12.12) reward good
calibration and low response entropy, and the calibration score dominates the
entropy score, as noted in Chapter 9. For all (0,1) the weights w'ex are weakly
asymptotically strictly proper, as discussed in Chapter 9 [see (9.12)]. This is the
basic weight recommended for use in the classical model with uncertain events. It
should be emphasized that the choice of the calibration term in (12.12) is largely
determined by the requirement that the weights should have desirable asymptotic
properties. The choice of the information term is determined by intuitions
regarding the representation of high informativeness (see Chapters 8 and 9).
Note that an expert can receive zero weight. This occurs if the probability of
seeing an overall deviation between the sample distributions and bin probabilities
greater than I(s1; p 1 ),..., I(sB, pB) is less than a, under the assumption that the items
are independent and distributed according to the probabilities of the bins to which
they are assigned. This is equivalent to saying that the user regards each expert as a
statistical hypothesis (namely that the variables are independent and that their
marginal probabilities correspond to the expert's assessments). The close relation
to classical hypothesis testing accounts for the designation "classical model." It is
essential for the propriety of the weights (12.12) that a > 0, that is, that zero weight
is a real possibility.
This having been said, it must be emphasized that weighing experts differs
from testing statistical hypotheses in two respects, as discussed in Chapter 9. First,
the decision maker is not only interested in the calibration of his experts, but also in
their informativeness. Second, the significance level a is not chosen in the same way
as in hypothesis testing. This aspect is explained under "Optimization; Virtual
Weights" below.
|xi0, xiR+1]; intrinsic range for variable Xi; xi0 and xiR+1 are called the lower
and upper cutoff points for Xi respectively, and they must satisfy
xi0 < xire < xiR + 1, for all r = l,...,R, and all e = 1,...,E. We adopt the
notation: xi0e = xi0; xiR+le = xiR +1.
pr = fr—fr - 1, r = 1,..., R + 1; "theoretical probability" associated with the
event Qie(Xi) (fr - 1, f]; the event Qie(Xi) (fr - 1 , f r ] is termed "probability
outcome r."
P = ( P 1 , . . . , P R +1)
X = cumulative 2 distribution function with R degrees of freedom
It will be observed that each variable must be supplied with an "intrinsic range,"
containing all the quantiles elicited from the experts. In some cases the choice of the
cutoff points might be motivated, for example, if Xi were a relative frequency or a
percentage, then 0 and 1 might provide suitable cutoff points. However, in many
cases the choice of cutoff points must be made ad hoc. This is one point at which the
analyst must simply make a decision. The choice affects only the measure of
information, and model performance is quite robust with respect to this choice. In
the computer implementation used to analyze the data in Chapter 15, a simple
"10% overshoot" above and below the interval generated by the set
{xire|e = 1,..., E; r = 1,..., R} is used. Because the intrinsic range depends on the
assessments of all experts, the information score of a given expert may change
slightly as experts are added or removed.
When the value xi is observed, then for each expert exactly one probability
outcome is "hit." In this way, observations of xi; i = 1,..., N, generate a sample
distribution over the probability outcomes for each expert. The sample distribution
depends on N, but we suppress this in the notation as N is considered fixed. For the
notation in (Equation 12.14) we restrict attention to a single expert e.
s = (s1,..., sR+1) sample distribution over probability outcomes r,
I(e) is the average over i = 1,..., N of the relative information in the densities Qie
with respect to the uniform distribution over the intrinsic range for variable Xi (see
the discussion in Chapter 8 following footnote 2). The discussion of the weights
(9.12) applies to (12.15) above. I(e) must be bounded and bounded away for 0 to be
weakly asymptotical strictly proper in the sense of Proposition 9.6.
192 COMBINING EXPERT OPINIONS
The principles underlying the basic model sketched above have been discussed in
the foregoing chapters. They include the principles for expert opinion in science,
Savage's axioms for rational preference, the theory of proper scoring rules, and the
properties underlying the choice of the weighted arithmetic average combining rule
discussed in the previous chapter. However, we cannot derive a model for
combining expert opinions exclusively from "first principles." At several points in
the preceding chapters alternative modeling approaches were indicated. The
present section is not devoted to exploring all such possible alternatives; rather
three variations and/or enhancements are mentioned, which seem particularly
important and which can be evaluated in light of the experiences in Chapter 15.
Item Weights
The weights (12.12) and (12.15) above are global. It would be possible to replace the
average entropy or relative information with respect to the uniform distribution Ui
by terms which measure the (lack of) information for each variable Xi separately.
This is most natural with regard to continuous variables. In this case the term I(e)
in (12.15) would be replaced by
for each Xi, i = 1, — Normalization is performed per item in the obvious way.
This allows an expert to downweight or upweight himself on individual items,
according as his quantiles are further apart or, respectively, closer together.
It is hardly a foregone conclusion that the use of item weights would result in
better assessments for the decision maker. In the engineers experiment discussed in
Chapter 10 a negative correlation between calibration and information was found.
It may well be that experts are less well calibrated on just those items for which they
are more informative. If so, the decision maker's calibration could be degraded by
the use of item weights. The results presented in Chapter 15 fail to provide
convincing evidence in favor of item weights.
1
We may have to wait a long time. In the Dutch meteorological experiment described in Chapter
15, many calibration scores were greater than 10-3 after some 2500 realizations. In quantile tests this
limit seems to be reached more quickly.
194 COMBINING EXPERT OPINIONS
that on a modest set of realizations one expert achieves calibration score 10-6
while all others receive the minimal score. The model would effectively assign all
weight to the best-calibrated expert, and would produce a poorly calibrated
decision maker. The point is, it may be imprudent to let a very poorly calibrated
expert dominate other experts who are even worse. In fixing the numerical
accuracy of the routines we effectively limit the ratio of calibration scores. For
example, in the present implementation the accuracy is 10 -4 ; hence a calibration
score of 10-3 can never dominate by a factor greater than 10, etc. If low scores are
being caused by a large number of realizations, then we might improve model
performance by reducing the effective power of the test; hence we should have to
choose an optimal power level. Examples will be discussed in Chapter 15.
ISSUES
We discuss three issues that may arise in the application of the basic model.
Seed Variables
It will often arise that the decision maker needs assessments for events, none of
which will be observed within a required time frame. This typically occurs in risk
analysis, where probabilities for unlikely and nonrepetitive events must be assessed.
In this case the model must be "seeded" with other events, whose outcomes are
known, or become known within a short time. These seed variables must be drawn
from the experts' area of expertise, but need not pertain to the problem at hand.
Weights are then determined on the basis of seed variables and used to define the
decision maker's distributions for the variables of interest. The choice of meaning-
ful seed variables is difficult, and critical.
The number of seed variables required depends on the number of bins and on
the bin probabilities pb. According to standard statistical practice, the chi square
approximation for 2n b I(s b , pb) is acceptable if
The decision maker is not testing hypotheses, but combining assessments. His
concern is that C(e) distinguish well- from less-well-calibrated experts, and that w'e
be asymptotically proper. The accuracy of the chi square approximation is
therefore less crucial than in hypothesis testing. Of course, one could forego the chi
square approximation and calculate the distribution of 2n b I(s, pb) explicitly. Even
when this is done, however, the number of seed variables required with uncertain
events compares unfavorably with that for continuous variables when three
quantiles are elicited. In this case, 8 to 10 seed items seems sufficient to observe
substantial differences in calibration. Of course, the fewer the number of re-
alizations, the less robust the calibration scores are likely to be.
In the more recent applications of the classical model, considerable effort has
been put into the identification of meaningful seed variables. This effort is felt to
pay off; not only can such seed variables be found, they greatly enhance confidence
in expert judgment generally, and in the classical model in particular.
THE CLASSICAL MODEL 195
Disaggregation
If the number of uncertain items is large, it may become possible to disaggregate
the variables into distinct sets and compute weights for each such set. This makes
sense whenever there are enough realizations, and when the scores substantially
differ between sets of variables. The Dutch meteorological experiment discussed in
Chapter 15 affords examples where disaggregation was worthwhile.
• The effective power of the calibration tests should be set equal to the smallest
number of items which any expert has assessed.
• For uncertain events, if the base rates over the seed variables are different for
different experts, 1/H(e(n)) in (12.12) should be replaced by 1/N nbI(pb,S)
(summation over b = 1,..., B, where S depends on the expert e.), where S is the
overall sampling distribution.
• For continuous variables, the intrinsic range must be determined by the
analyst; if this cannot be done meaningfully, then either item weights must be
used, or information must not be considered at all.
Substituting the first equation above into the second, we see that P = Pdm. Hence
adding the virtual expert to the set of experts would not result in a new distribution
for the decision maker.
Moreover, since the decision maker's virtual weight depends on the sig-
nificance level a, we can choose a so as to maximize the decision maker's virtual
weight,2 thereby defining a unique set of weights we = we , where
The optimal virtual weight of the decision maker can be compared with the global
weights of the experts, or with the virtual weight of other decision makers
generated by other combinations of the experts' opinions. For example, this allows
us to compare the performance of the basic model with the use of item weights.
The use of virtual weight optimization to determine a is a significant feature of
the classical model. It underscores the difference between forming weighted
combinations of expert opinions and classical hypothesis testing. Of course, we
could also simply choose weights for experts that optimize the calibration and
information scores of the decision maker. However, this optimization problem is
mathematically untractable on all but very small sets of experts. It is very
nonrobust, and experts' weights would, in general, have no relation to the quality of
their assessments. In the language of optimization, the restriction to weights that
are asymptotically strictly proper scoring rules introduces a constraint that renders
the optimization tractable.
In all applications performed to date, the optimized decision maker under the
basic classical model has consistently outperformed the best expert, and also
outperformed the decision maker gotten by simple arithmetic averaging of the
expert's distributions. This, however, is a fact of experience and not a mathematical
theorem, as the following section illustrates.
CORRELATION
In the foregoing chapters attention has been drawn to the fact that expert
assessments are frequently correlated. It is therefore appropriate to address this
issue.
Expert assessments (either 'within' or between experts) will generally be
correlated, as they are based on common sources of information. Such correlation
is usually benign, and always unavoidable. Another sort of correlation arises when
experts conspire to influence the decision maker by giving assessments at variance
with their true opinions.
For example, if experts conspire to give the same distribution for all variables,
then they will obviously receive the same score, and the weight assigned to their
assessments will be unfairly multiplied by the number of experts in the conspiracy.
The weights encourage honesty, but if, in spite of this encouragement, the experts
give dishonest assessments, the model is powerless to redress this. However, the
2
For this idea I am indebted to Simon French.
THE CLASSICAL MODEL 197
optimization feature in choosing the significance level will assign these assessments
weight 0 if, indeed, the decision maker is better off without them on the seed
variables. In discussing correlation we therefore assume that the experts all respond
honestly.
A peculiar sort of correlation can actually degrade model performance, such
that the decision maker's virtual weight is lower than the experts' weights. This
phenomenon has never been observed in practice, but it is of theoretical interest.
To illustrate, assume there are two experts and four uncertain events. The
events with indicators x1 and x2 occur, and the events with indicators y1 and y2 do
not occur. The experts assign these events to 40% and 60% probability bins, as
shown in Table 12.1. Obviously, the experts' calibration and entropy scores will
coincide, so they will each receive weight . The decisions maker's assignment can
be easily derived and is also shown in Table 12.1. It is clear that the decision
maker's distribution is more entropic, and it is also clear that he will be less well
calibrated than either expert.
Probability Bin
The x's occur and the y's do not occur. Each expert
receives weight .
There are two striking features of this example: (1) The experts are equally well
calibrated (so that the optimal choice of a cannot lead to one of them being
rejected), and (2) their assessments are positively correlated for the events that
occur and negatively correlated for those which do not occur.
CONCLUSIONS
The classical model is not a mathematically closed theory. One could hardly expect
to derive a theory of weights from first principles. It is a practical tool and must be
judged as such. Once the principles for applying expert opinion in science have
been satisfied, the only important question is "does it work?" Applications are
discussed in Chapter 15. We may conclude this chapter with a few general remarks.
A fundamental assumption of the classical (as well as the Bayesian) model is
that the future performance of experts can be judged on the basis of past
performance. The success of any implementation depends to a large measure on
198 COMBINING EXPERT OPINIONS
defining relevant variables whose true values become known in a reasonable time
frame. This requires resourcefulness on the part of the analyst as well as the
sympathetic cooperation of the experts themselves. It is essential that the experts
understand the model and generally appreciate its potential usefulness.
Experts may have biases. Their biases may be expected to fall into two general
categories, probabilistic biases and domain biases. Probabilistic biases, such as the
base rate fallacy, anchoring, overconfidence, representativeness, involve the misper-
ception of probabilities. Domain biases are connected with individuals' preferences
relating to their specific fields. An expert may be "sold" on a particular design (e.g.,
his own), and may have a visceral distrust of other designs. In principle, the model
can deal with both types, though the latter is much more difficult to identify and
verify. Identification will generally require knowledge of the individuals involved,
and verification will require substantial specific data. Once identified, a domain
bias may be neutralized by a judicious disaggregation of the data set.
The whole question of domain biases should be approached gingerly. In a
sense, domain biases are the essence of expert opinion. One uses more than one
expert in the hope that such biases will interfere destructively. The user must not
embark on a crusade to eliminate domain biases. If he suspects a significant
domain bias but lacks the data to verify it, he must under no circumstances attempt
to neutralize the bias via a post hoc choice of model parameters. This would
compromise his objectivity. The user is not himself an expert and must not choose
sides in professional squabbles.
Finally, we emphasize that the classical (as well as the Bayesian) model
requires experts with some implicit insight into probability theory and some facility
in estimating numerical values. The assessment task may involve considerable
"cognitive depth." If the experts have little training or feeling for this type of task,
the psychological scaling models may be more appropriate.
13
The Bayesian Model
Bayesian models require that the user supply prior probability distributions and
process expert assessments by updating these distributions via Bayes' theorem. The
model of Mendel and Sheridan (1986, 1989; see also Mendel, 1989) uses Bayes'
theorem to recalibrate and combine expert assessments of continuous variables via
Bayes' theorem.
The only other model (Morris, 1977) which putatively accomplishes the same
has serious computational and philosophical drawbacks (see Chap. 11). The
Mendel-Sheridan model compares favorably with other Bayesian models in the
following respects:
1. Default egalitarian prior distributions are given by the model itself.
2. The expert assessments are not restricted to a given class of distributions,
but rather the experts give quantiles of their distributions for continuous
variables (which makes this model compatible with the classical model).
3. Bayes' theorem automatically accounts for correlations in the expert
assessments—correlation coefficients need not be assessed by the decision
maker.
4. The computational algorithm has been streamlined for ease of im-
plementation.
5. The model has demonstrated its worth in experiment.
The Bayesian model is compatible with the continuous variable version of the
classical model.
Because of the features mentioned above, the Mendel-Sheridan model fully
conforms to the principles formulated in Chapter 5. There are features of this model
that restrict its practical application at this moment, and these will become
apparent in the following.
In addition to the model of Mendel and Sheridan, a variant involving the
notion of partial exchangeability will be presented below. The variant repairs a
theoretical shortcoming in the Mendel-Sheridan model, but is still under develop-
ment at present.
This chapter is written jointly with M. Mendel. Comments of Simon French on a previous draft of
this chapter are gratefully acknowledged.
199
200 COMBINING EXPERT OPINIONS
BASIC THEORY
Figure 13.1
THE BAYESIAN MODEL 201
Figure 13.2
variables and is going to learn from experience. The question is, experience with
respect to what? Roughly speaking, the Mendel-Sheridan model considers the
above matrix as a random variable whose possible values are the cells that might be
hit. Before obtaining the experts' advice on the current variable, the decision maker
uses the theory of exchangeability to update his probability distribution for this
random variable. The variant proposed below introduces exchangeability as-
sumptions only after the experts' assessments for the current decision variable are
known, that is, only after a path has been chosen. In the following the notation and
definitions for treating the general case of £ experts and R quantiles is introduced.
It will be convenient to illustrate the concepts with reference to the above figures.
The notation used here agrees with that in the classical model, and agrees as much
as possible with the notation in Mendel and Sheridan. It is convenient to introduce
an underlying sample space with generic element . The -field of measurable
events will not be specified, and it is assumed that all variables are measurable. The
sample point also determines the advice of the experts, as, indeed, the user is
uncertain what this advice will be before hearing it. The functional dependence of
advice on the sample point will generally be suppressed in the notation.
{1,...,£} set of experts; (13.1)
= underlying sample space , a sample point;
Xi: R; i = 1,..., N (practically) continuous variables;
X0:Q R; (practically continuous) current decision variable;
Qie; e = 1,..., E, i = 0,..., N; expert e's cumulative distribution function for Xi;
f 1 , . . . , f R ; quantiles elicited from each expert for each variable;
0 = f0 < f1 < ... < fR < fR + 1 = 1;
pr = fr — fr - 1; theoretical probability for the "probability outcome" Q i e (X i )e
( f r - 1 , fr), r = l,...,R + l.
aie = x i 1 e ,...,x i R e ; where Qie(xire) =fr,r = 1,...,R;a ie is expert e's advice for Xi;
(13.2)
ai = a i 1 ,..., aiE = advice from experts 1,..., for Xi;
= {(r 1 ,..., rE) | re {1,..., R + 1}, e = 1, ..., }; corresponds to the set of cells
in the matrix of Figure 13.2. (13.3)
Zi: ; i = 0,...,N, where Zi( ) = (r1,...,rE) Xi( ) hits the r e th probability
outcome of expert e; e = 1,..., .
Let the elements of be lexicographically1 ordered by; = 1,..., (R + l)E. We shall
write Zi(co)=j to mean Zi( ) = the jth element of in the lexicographical
ordering.
Yj(( ) = |{Zi( ) = j | i = 1,..., }|; Yj counts the number of times that j has been
observed for the variables proceeding the current decision variable (13.4)
Y = Y 1 ,..., Y(R+ l)E
P = the decision maker's probability.
P(Z0) is shorthand for P(Z0 =j) forj arbitrary, etc. It must be borne in mind that
1
Ordered pairs of reals are lexicographically ordered when (x, y) (u, v) if x > u or x = u and
y > v, otherwise (x, y) < (u, v). The generalization for E-tuples of reals is straightforward.
THE BAYESIAN MODEL 203
the user's probability in these expressions is not conditioned on the expert's advice
a0 for the current decision variable. These expressions reflect the user's probability
of seeing various outcomes "hit" without yet knowing what the experts' quantiles
are.
Put somewhat simply, (13.6) says that the user is only interested in the experts'
advice for the current decision variable, and the frequency with which the various
values of j have been hit on the previous variables.
Assumption (13.7) states that before receiving any advice and before performing
observations, the user is prepared to take the expert's advice at face value, and to
regard the experts as independent.
P(Z0| Y) and P(Z0 | a0, Y) are minimally informative, subject to the constraints
expressed in the foregoing assumptions. (13.8)
Determining P(Z0 | Y)
Let = ( 1 ,..., (R+1) E ), j 0, j = 1. Assumption (13.5) and De Finetti's
representation theorem entail that [see Eq. (7.13)]:
for some suitable prior probability measure dF over the possible values of . If dF
can be determined from the constraints (13.5) to (13.8), then the first step in the
inference will in principle be solved. Mendel and Sheridan give the density f = dF,
which minimizes the information with respect to the uniform distribution subject to
the constraint (13.7):
The solution is
When = 0,
=0 otherwise.
THE BAYESIAN MODEL 205
Figure 13.4 Decision maker's prior corresponding to 5%, 50%, and 95% assessments in
Figure 13.1.
206 COMBINING EXPERT OPINIONS
where K1 > 0, K2 > 0. The Bayesian estimate under a loss function is that value x'
of x for which the expected loss is minimum. It is not difficult to show that under
the above loss function the Bayesian estimate corresponds to the K 2 / ( K 1 + K2)th
quantile of the distribution for ®.
Mendel and Sheridan (1986) have computed the cumulative losses for the
bilinear loss functions L(x, 0) shown in Figure 13.5. Two experts, A and B
(graduate students at MIT), assessed 53 items from the Guinness book of records.
Figure 13.6 (a and b) shows the cumulative losses for experts A and B, compared to
that of the recalibrated experts derived by applying the Bayesian model to these
experts singly. Figure 13.6(c) shows the loss when the prior distribution for both
experts is used, compared with the loss when this prior distribution is updated.
Figure 13.7 shows the losses of the model applied to the experts singly, compared to
the losses when the experts are combined.
Model performance may be called encouraging when the number of ob-
servations is fairly large. Note that the improvement wrought by the model is
especially marked for the loss function L3, which punishes underestimation. People
do tend to underestimate record items. In Chapter 15 the Mendel-Sheridan model
is applied in a case study involving only 13 observations, and model performance is
less satisfactory. Finally, note that loss has the dimension of the variable being
estimated. A change of unit, say from kilograms to grams, will cause the computed
loss to increase by a factor of 1000. Hence, the meaning of cumulative loss for
variables measured in different units (kilograms, dollars, etc.) must be sought in the
analyst's own utility function.
Figure 13.7 Cumulative losses for (calibrated) observer with subject A alone (dashed), B alone (dotted) and A and B together
for the loss functions in figure 13.5.
210 COMBINING EXPERT OPINIONS
quantiles, the number of possible paths is 902. For three experts estimating three
quantiles, this number is 1680, for five experts estimating two quantiles, it is
113,400. Hence, we should need astronomical amounts of data to build up
frequency distributions as the number of experts or the number of quantiles
increases. If the experts assess an infinite number of quantiles (e.g., by giving entire
distributions) the model breaks down.
Undoubtedly, additional theoretical work could lead to more practical
models. For example, one could introduce the intensity of a cell as the number of
times the cell is hit divided by the number of times the cell is offered. Upon receiving
advice a0, the decision maker could adopt the renormalized intensities of the cells
on the path corresponding to a0 as his posterior for Z0. However, a satisfactory
Bayesian analysis for such a procedure remains to be given.3
CONCLUSIONS
As with the classical model, a successful implementation stands or falls with the
ability to define variables that are relevant and whose true values become
unambiguously known within a reasonable time frame. This requires resourceful-
ness on the part of the analyst and sympathetic cooperation from the experts
themselves. It is therefore essential that the experts understand and appreciate the
model, and that the analyst take the utmost care in defining the variables.
The Bayesian model is less flexible than the classical model. In its present form
it admits only quantile assessments and requires that all experts assess the same
variables. Moreover, because of its minimally informative default priors, the
Bayesian model will surely require more variables to warm up. Evidence of this is
presented in Chapter 15. Hence this model will be more dependent on "seed
variables" than the classical model. On the other hand, the Bayesian model is
unquestionably more powerful, and under ideal conditions may be expected to
yield better results.
2
Consider 6 points on the real line representing the 6 quantile assessments. The 2 corresponding to
the first expert's quantiles can be chosen in 15 ways. Of the remaining 4 points, 2 can be assigned to the
second expert in 6 ways, and the third expert's choice is fixed. For M experts with K quantiles the
general formula is: # arrangements = ( ) • ( ) ....
3
Let Oj denote the event that cell j is offered, and let hj denote the event that cell j is hit. Let a0
generate a path containing cell j. If we assume that P(hj and a0 | hj and Oj) = P(a0| Oj), then an easy
calculation shows that P(hj | a0) = P(h j | Oj). If this assumption holds strictly, the expected intensities
along a path will automatically add to 1. If this assumption holds only approximately, renormalization
will be required.
14
Psychological Scaling
Models: Paired Comparisons
We now turn to the psychological scaling models. For background and general
discussion, refer to Chapter 11. This chapter provides the tools for applying three
psychological scaling models, the Thurstone model, the Bradley-Terry model and
the NEL (negative exponential lifetime) model. The latter two models are
computationally identical, but differ in interpretation. The NEL model is specifi-
cally designed for estimating constant failure rates. Derivations and proofs for
these models are readily available in the literature and will not be given here. In a
concluding section the relation to the principles for expert opinion in science is
discussed. Before turning to the models we first discuss generic issues.
GENERIC ISSUES
211
212 COMBINING EXPERT OPINIONS
will sometimes be unwilling or unable to judge some pairs of objects. Hence the
data will contain void comparisons. In most cases, the formulas given below
generalize easily to the case where some experts are unable or unwilling to express
preferences for some pairs, and these generalizations will be indicated.
In analyzing paired comparison data, the first question to be addressed, before
choosing a model, is "Is there a significant difference in the objects with respect to
preference?" This question may be posed with regard to an individual expert, or
with regard to the set of experts as a whole.
Suppose for expert e, the values V(i, e) are barely distinguishable. For each pair, he
expresses some preference, but the preferences are "unstable." How might we detect
this from the data from this expert?
The following procedure suggests itself. Suppose A(1), A(2), and A(3) are barely
distinguishable in preference. If the expert says ,4(1) > A(2) and A(2) > A(3), then
A(\) and A(3) will still be very close in preference, and the expert might well say
A(3) > A(l). These three objects would then constitute a circular triad in the
expert's preference structure. In the language of Chapter 6, the expert's preferences
are intransitive. All intransitivities in the preferences of the objects can be reduced
to circular triads.
When an expert assesses a large number of pairs, we should not be surprised if
a few circular triads arise. This poses no particular problem, as each object is
compared many times with other objects, and the models will extract an underlying
trend, if there is one.
As the number of circular triads increases, we would begin to doubt whether
the expert has sharply defined underlying preferences. David (1963) provides a
procedure for testing the hypothesis that each preference is determined at random.
Let a(i, e) be the number of times that expert e prefers A(i) to some other object.
David shows that the number C(e) of circular triads in e's preferences is given by
Kendell (1962) has calculated the probabilities that various values of C(e) would be
exceeded, with 2 to 10 objects, under the hypothesis that the preferences are
randomly determined. These are shown in Table B.4, Appendix B. For more than
seven objects he shows that
with
Coefficient of Agreement u
Let a(ij) denote the number of times that some expert prefers object A(i) to A(j).
Then a(ji) = n — a(ij). If the experts agree completely, then half of the a(ij)s will be
equal to n, and the other half equal to 0. (Note that complete agreement does not
preclude circular triads.) Define
where the summation runs over t(t — 1) terms. Kendall (1962) defines the coefficient
of agreement u as
u attains its maximum, 1, when there is complete agreement. Under the hypothesis
that all agreements of the experts are due to chance, the distribution of u can be
determined. Kendall (1962) tabulates the distribution of ~ for small values of n
and t (Table B.5, Appendix B). Furthermore, for large values of n and t,
degrees of freedom. The hypothesis that all agreements are due to chance should be
rejected at the 5% level.
214 COMBINING EXPERT OPINIONS
Coefficient of Concordance W
The sum of the ranks R(i) is calculated by
where R(i, e) is the rank of A(i) obtained from the responses of expert e; the value of
R(i,e) ranges from 1 to t. Siegel (1956) defines
where S is the sum of squares of the observed deviations from the mean of R(i):
Assumptions
For i, j = 1, ... , t, it is assumed that
The values V(i, e) are normally distributed over the population of experts, with
mean (i) and standard deviation (i). (14.4)
MO = V(i); (14.5)
(i) = , that is, (i) does not depend on i; (14.6)
PSYCHOLOGICAL SCALING MODELS: PAIRED COMPARISONS 215
where p(ij) is the correlation coefficient of V(i, e) and V(j, e). (14.7) adds that the
distributions V(i, e) and V(j, e) are uncorrelated. Slightly weaker assumptions are
mathematically tractable; in particular, (14.7) can be replaced by the assumption
that p(ij) does not depend on i and j (Mosteller, 1951a). In the context of estimating
small probabilities, several authors (Comer et al. 1984, Hunns, 1982, Kirwan et al.
1987) recommend replacing (14.4) with
The values In V(i, e) are normally distributed over the population of experts,
with mean (i) and standard deviation (i). (14.4a)
It is plausible that (14.4) or some variant would reasonably describe the distribu-
tions V(i, e).1 Assumptions (14.5), (14.6), and (14.7) are very strong.
Set x(ij) = -1 (%(ij)). Since (ij) does not depend on i and j, by choosing a suitable
scaling constant we may arrange that (ij) = 1, and hence write
There are ( ) equations of the form (14.9), one for each pair of objects, and t
unknowns (l),..., (t). If the estimation %(ij) was perfect, and if the modeling
assumptions hold, then all equations of the form (14.9) would be satisfied, and
could be solved for (l),..., (t). However, the estimates will not be perfect, and the
1
It should be noted however, that the least squares solution algorithm yields an optimum (unbiased
and minimum variance) estimate for the means of the normal distributions (see, e.g., Hanushek and
Jackson, 1977, chap. 2). If (14.4a) is used, these estimates are optimal for log (i), but not for (i).
216 COMBINING EXPERT OPINIONS
system (14.9) will be overdetermined for t 4. If (14.7) were replaced by (14.8), the
correlation coefficients p(ij) would also be unknowns, and the resulting system
would be underdetermined.
We can solve the overdetermined system by finding estimates '(i) of (i) such
that the sum of the squared errors
is minimal. Adding a constant to (i), i = 1,..., t, would not affect the value of the
above sum; hence we may add a constant such that '(i) = 0. The solution is then
Goodness of Fit
The solution (14.11) minimizes the squared error (14.10), but the error may still be
large. How large is too large? One way to answer this question is the following.
Assume that assumptions (14.4), (14.6), and (14.7) hold with (i) replacing (i), that
is, assume that our solution is really correct. We now ask, what is the probability of
seeing a set of preferences %(ij) that lead to an error at least as great as (14.10)? If
this probability is too small, then we should conclude that some of our modeling
assumptions are false. Mosteller (1951c) gives the following test. Define
Transformation
If our data survive all the foregoing tests, then the solution (14.10) can be used.
However, this solution involves an arbitrary positive scaling constant, used to
arrange that (ij) = 1, and an arbitrary shift parameter, used to arrange that
'(i) = 0. Hence, if (i) is a solution then so is
for arbitrary constants a > 0 and b. If we know two values V(i) and V(j), for some i
PSYCHOLOGICAL SCALING MODELS: PAIRED COMPARISONS 217
and j, then we can solve (14.13) for a and b, and determine the remaining true values
via (14.13).
Confidence Bounds
Using assumptions (14.4) to (14.7) it is possible to determine confidence bounds for
the solution (14.10) via simulation. The procedure is as follows:
Solve the initial model for '(i), i = 1,..., t.
Sample values V'(i, e),e = 1,..., n from the normal distribution with mean (i)
and a = 1.
Define %(ij)' as the number of e such that V'(i, e) > V'(j, e), divided by n.
Solve the model again using %((ij')' instead of %(ij), and find solutions m(i),
i= 1,...,t.
If we repeat the above say 1000 times, always keeping the '(i) fixed, we get 1000
values m(i) for each i. Excluding the 50 smallest and 50 largest, the range spanned
by the remaining values defines a 90% confidence interval.
These confidence bands are derived from the modeling assumptions (14.4) to
(14.7). They do not reflect uncertainty in these assumptions, but rather reflect
sampling fluctuations in the choice of experts. As the number n of experts increases,
the bounds determined in this way shrink; modeling uncertainty does not.
Void Comparisons
The solution (14.10), the goodness of fit test (14.12), and the procedure for
determining confidence bounds can all be applied in the case of void comparisons.
The solution requires only the proportion of experts preferring one alternative to
another. The goodness of fit test can be applied by replacing n by the number n(ij)
of experts comparing A(i) to A(j).
The Bradley-Terry model was introduced by Bradley and Terry (1952) and further
developed in Bradley (1953). Ford (1957) contributed an important convergence
result, and a good mathematical exposition is found in David (1963).
Assumptions
The model assumes that each object A(i) is associated with a true scale value V(i),
and that the probability r(ij) that A(i) should be preferred over A(j) is given by
The V(i) are determined only up to a constant scale factor, hence we may assume
that V(i) = 1.
When several experts judge each pair of objects once, it is assumed that r(ij) is
the same for all experts, and that the judgments are independent. In other words,
218 COMBINING EXPERT OPINIONS
Solution
The proportion %(ij) is taken as an estimate of r(ij), and the overdetermined system
that results by substituting %(ij) for r(ij) in (14.14) must be solved. The method of
least squares is not tractable, owing to the form of (14.14), and the modeling
assumptions suggest another strategy. One seeks values V(l),..., V(t) such that,
under the modeling assumptions, the probability of seeing the outcomes %(ij) is as
large as possible.3 Finding such V(i) is equivalent to solving the following system
(see David, 1963):
where a(i) = the number of times A(i) is preferred by some expert to some other
object, and ' denote summation over j, with j i.
The solution to (14.15) can be found by iteration. Begin with initial values
V 0 (l),..., V 0 (t) and define the first iteration by
Goodness of Fit
The quantity
Void Comparisons
When not all experts compare all pairs, the iteration process is slightly altered and
2
With this interpretation, circular triads can arise from the probabilistic mechanism according to
which preference is determined. Nonetheless, the coefficient of agreement should be used to test and
hopefully reject the null hypothesis: All probabilities of preference are equal to .
3
This is known as the maximum likelihood solution. Ford (1957) has shown that the solution is
unique and that the iterative process converges to this solution if the following condition is met: it is not
possible to divide the set of objects into two nonempty subsets, such that no object in one subset is
preferred by any judge above some object in the second subset.
PSYCHOLOGICAL SCALING MODELS: PAIRED COMPARISONS 219
n is replaced in (14.16) by the number n(ij) of experts who compare object i with
object j. Similarly, in testing for goodness of fit, n(ij) replaces n in (14.17).
Confidence Intervals
Confidence intervals can be determined by simulation in a manner analogous to
the Thurstone model. Once a solution is found for the scale values (l),..., V(t),
these are used to define preference probabilities via (14.14). Simulated preference
judgments are then performed by the computer for the n experts and the t(t — l)/2
pairs of objects. From this simulated data a simulation solution is determined via
(14.16). After, say, 1000 simulation solutions have been found, always keeping the
original solution fixed, 90% confidence intervals may be determined as in the
Thurstone model.
The NEL model was developed to assess failure rates in mechanical components.
Assumptions
The objects are mechanical components. For each pair of objects the experts are
asked "Which of these two components, operating independently, will fail first'" It
is assumed that all components are as good as new at time T = 0, and "A(i) < T"
means that A(i) fails before time T. The following assumptions are made:
In other words, A(i) is assumed to have an exponential life distribution with failure
rate r(i). Further, it is assumed that the probability that an expert says "A(i) fails
before A(j)" is given by
under the assumption that A(i) and A(j) are independent. An elementary cal-
culation4 shows that
In other words, the model assumes that the expert answers the question put to him
by performing a mental experiment, letting A(i) and A(j) operate independently
with life probability (14.18), and observing which fails first.
4
This can be seen by writing
It will be observed that (14.19) has the same form as (14.14). This means that the
solution procedure described in the Bradley-Terry model will apply here, as well as
the goodness of fit test, the procedures for generating confidence intervals, and the
procedure for handling void comparisons.
Transformation
The solution (14.16) determines the values V(l),..., V(t) only up to a scaling
constant. If for component i the true failure rate r(i) is known or can be estimated,
then the failure rates for the other components can be found by setting
We conclude this chapter by noting that the psychological scaling models stand in
a rather different relation to the principles set forth in Part I for using expert
opinion in science than the models described in the previous two chapters.
Reproducibility, accountability, neutrality, and fairness pose no particular pro-
blem. Empirical control is another matter.
The tests of goodness of fit do not constitute empirical control. These tests
determine whether the paired comparison data are sufficiently consistent with the
modeling assumptions, but do not involve a comparison of model output with
empirical values.
If the confidence intervals are interpreted as subjective confidence intervals for
the decision maker, then empirical control is possible. If true values become
observable at some later point, then one should expect that 90% of the true values
fall within the 90% confidence bounds. The hypothesis that the model output is
well calibrated may then be tested in the manner described in Chapter 9.
However, this interpretation of the confidence intervals may not be warranted.
The confidence intervals reflect uncertainty due to the choice of experts, and do not
reflect uncertainty regarding the modeling assumptions themselves. As the number
of experts increases, the confidence intervals shrink, whereas the uncertainty in the
modeling assumptions need not decrease.
If the model output is in the form of probabilities for uncertain events, then
these can in principle be treated as input into the classical model and subjected to
the empirical control inherent in the classical model.
In either case, empirical control is in principle available for the model output
as a whole. Empirical control for the individual assessments of the experts is not
possible.
15
Applications
This chapter discusses applications of the models presented in Part III. Three of
these applications ("ESTEC-1," "DSM-1," and DSM paired comparisons) were
developed in the course of research supported by the Dutch Ministry of Housing,
Physical Planning and Environment, and are described in Cooke et al. (1989).1 The
studies "DSM-2" and DSM-3 were supported by the Dutch State Mines (Akker-
mans, 1989; Claessens 1990), and ESTEC-2, and -3 were supported by the
European Space Agency (Meima and Cooke, 1989; Offerman, 1990a). A study
involving weather forecasting data was supported by Delft Hydraulics (Roeleven,
1989).
The first section of this chapter reviews applications of the classical model, and
discusses the variations and enhancements proposed in Chapter 12. The second
section discusses Bayesian analyses of the data sets in the first section. This is
possible because the Bayesian model in Chapter 13 is compatible with the
continuous version of the classical model. The third section discusses a joint
application of the psychological scaling model and the classical model. The
supplement to this chapter contains extensive data from these applications.
Many (ex)students and colleagues have helped in gathering and analyzing the data recorded in this
chapter, in particular, D. Roeleven, M. Kok, D. Akkermans, M. Stobbelaar, D. Solomatine, J. van Steen,
C. Preyssl, B. Meima, F. Vogt, and M. Claessens.
1
The numerical results given in this report, also in Preyssl and Cooke (1989) differ slightly from
those reported here due to refinements in the numerical routines.
221
222 COMBINING EXPERT OPINIONS
ESTEC = European Space Technical Centre, DSM = Dutch State Mines, RDMI = Royal Dutch Meteorological
Institute.
this study came too late for complete treatment here, and only summary data is
presented.) The RDMI study concerned uncertain events. Data supplied by the
Royal Dutch Meteorological Institute was analyzed to determine whether a
combination of expert opinions according to the classical model would yield better
forecasts. The first six studies required seed variables; in the RDMI study
realizations were retrievable for every assessment.
Summary Descriptions
The study DSM-1 is described in the last section of this chapter, and the RDMI
study is described in the following section. Summary remarks must suffice for the
others. The ESTEC-1 study grew out of discussions between design and reliability
engineers over a proposed design of a propulsion system. Four experts participated
in the study, two senior design engineers (experts 1 and 2), a junior design engineer
(expert 3) and a reliability engineer (expert 4). The elicitation was conducted with
each expert individually by a member of the analysis team. Nine of the seed items
concerned empirical failure frequencies of spaceflight systems;2 four concerned
"general reliability items." The results and expert scores were fed back informally to
the experts. The results and method were accepted by the project supervision, and
the European Space Agency is currently implementing an extended expert
judgment program building on these methods (Preyssl and Cooke, 1989; Cooke,
French, and van Steen, 1990).
The study ESTEC-2 applied expert judgment to assess critical variables used
in calculating the risks to manned spaceflight due to collision with space debris.
The current design base models of the debris environment stem from 1984, and in
some circles are felt to underestimate the risk. New data on the debris environment
are to become available in 1991, affording the opportunity to compare the
predictions of this study with realizations. The 26 seed variables concerned the
number of radar-tracked objects injected into orbit in the years 1961 to 1985 (for
future years this is a variable of interest). The manner of elicitation was somewhat
idiosyncratic and utilized graphic methods. Six of the seven experts participating in
2
The seed items are included in supplement A, but the names of the systems have been removed.
APPLICATIONS 223
this study were specialists in space environment modeling and shielding. One
expert was a reliability specialist. The results were fed back to the experts and met
with general assent.
The ESTEC-3 study fed into a study of the effectiveness of various new
composite materials in reducing the risks of debris and meteorite collision in
manned space flight. Six experts participated in the study. Fourteen seed variables
were originally defined, but two were subsequently excluded when the values of the
realizations were thrown in doubt. Feedback to the experts led to a follow-up study
of on-line training to improve expert calibration (Offerman, 1990b).
The DSM-2 study was initiated by the chemical process concern DSM after
receiving indications that crane activities in the neighborhood of process facilities
could constitute a significant risk. Eight experts, including crane drivers, signallers,
a supervisor, a job planner, a technical inspector, and a department head
participated in the study. These experts had little or no knowledge of probability
and statistics. Elicitation was accomplished via personal interview, asking first for a
best estimate, then for an interval for which the expert was 90% sure that it
contained the true value. The results of this study confirmed the apprehensions
regarding crane risks.
In DSM-3, expert judgment was used to quantify uncertainty in important
parameters of a ground water transport model used to predict future con-
tamination with hazardous materials. The seven experts were all geohydraulogists.
Data from permeameter- and pump measurements were used to define seed
variables. Experts were asked to assess transmissivity of soil in locations where
transmissivity had already been measured. They were given information about the
soil obtained from drilling. The assessment tasks for seed variables correspond
closely to the assessment tasks for the variables of interest.
Overall Evaluation
The classical model yields probabilistic assessments for an optimized decision
maker, as described in Chapter 12. These assessments can be evaluated by
considering their calibration and information scores for those variables whose
realizations are known, and comparing these with scores of other assessments.
Table 15.2 presents results comparing the basic model assessments with the
optimized decision maker gotten using "item weights" (see Chap. 12), the decision
maker gotten by assigning the experts equal weight and taking the arithmetic
average of their distributions, and the best expert, that is, the expert receiving the
greatest weight in the basic classical model.
The RDMI results presented in Table 15.2 represent aggregations over some
6000 assessments plus realizations. These data are analyzed according to the
"uncertain event" version of the classical model, and hence (lack of) information is
measured as mean relative entropy [see (12.10)]. In some cases the calibration score
C(e) [(12.9)] is very small, in the order of 10 -12 ; hence the values of the scoring
variable 2n b I(s b ,p b ) are given. The RDMI case is described in greater detail below.
A full description is available in Dutch (Roeleven, 1989) and a description in
English is in preparation.
Table 15.2 Comparison of the Optimized Decision Maker Under the Basic Classical
Model with the Decision Maker Generated by Item Weights (See Chap. 12) by
Assigning Experts Equal Weight, and with the "Best Expert."
Performance Comparisons
Optimized DM Item Weight DM Equal Weight DM Best Expert
Continuous Variables
"Calibration" and "mean seed information" correspond to the quantities C(e) and I(e) of formula (12.14). For the
measurement of relative information, seed variables for which 0.001 < realization < 1000 are referred to a uniform
background measure, other seed variables are referred to a loguniform background measure.
*The optimal DM was obtained by scoring calibration at the 70% power level; the calibration scores shown refer to the
100% power level (see Chap. 12 for explanation).
Due to the very large number of items the chi square variable (with 10 degrees of freedom) 2nb I(sb,pb) is shown;
lower values correspond to better calibration [(12.9)].
Lower values indicate greater information [see (12.10)].
These results are aggregated over experts' predictions for periods 2 and 4, respectively; for periods 3 and 5, see text.
Items weights were not used in analyzing this data.
APPLICATIONS 225
Supplement A to this chapter contains the expert and optimal DM scores, the
DM's assessments for the seed variables, and the realizations, for the continuous-
version applications (with the exception of DSM-3, which was finished too late for
inclusion). Supplement B contains the experts' assessments for the seed variables
for these applications.
In accordance with the discussion in Chapter 12, a uniform background
measure was used for all seed variables whose realization fell between 0.001 and
1000; otherwise a loguniform background measure was used. To get an impression
of the importance of the background measure, Table 15.3 gives the results for the
optimal DM in the five continuous-version studies after changing the background
measure. "U" and "L" indicate that all variables have been referred to the uniform
or loguniform background measure, respectively. In one study, ESTEC-1, this
made a substantial difference in the optimal DM. In the other studies the effect was
marginal.
The ESTEC-2 study illustrates the complications that can arise as the number
of seed variables becomes large. Table 15.4 shows the scores under the power levels
1.0, 0.70 and 0.60. At power 1.0 all experts receive the minimal calibration score of
10 -4 . At power 0.60 expert 6 is a factor 10 better calibrated than the other experts.
The optimal DM is obtained at power level 0.70, at which expert 6's calibration
dominates by a factor 5. When the accuracy of the chi square distribution is
L denotes log scale for all variables. U denotes uniform scale for all variables.
Table 15.4. Expert and Optimal DM Scores for ESTEC-2 at Power Levels 1.0, 0.70,
and 0.60
extended to six significant decimals, expert 6's calibration dominates by a factor 10,
and the DM in this case is identical to the DM with power 0.6.
Figure 15.1 Data structure for combining assessments of morning and evening experts.
Vertically aligned assessments concern the same exceedence event. Assessments within one
block were combined.
Table 15.5. Unnormalized Weights for Annual, Cumulative DMs, Equal Weight DM,
Log odds DM, and 'Aggregate expert' (see text).
Unnormalized Response Weight for the Combination of Periods 2 and 4
DM Experts
DM Experts
predictability of some of the exceedence events, some of the exceedence events were
excluded from the analysis. These were events that were very predictable, and the
expert assessments were strongly clustered about the extreme probability values.
For such events combination of expert opinions is irrelevant, as the assessments
agree in large measure. All assessments 6 hours into the future were excluded, as
these strongly clustered into the extreme probability bins.3 Preliminary analysis
identified three exceedence events that were sufficiently unpredictable to make
combination of expert assessments meaningful. These were the events "wind speed
exceeds 22 knots" (coded FF22), "visibility is less than 4km" (coded VV4), and
"visibility is less than 10km" (coded VV10). For these three exceedence events the
assessments of periods 4 and 5 of the "morning expert" could be combined with the
assessments of periods 2 and 3 of the "evening expert." In Figure 15.1, vertically
aligned assessments concerned the same events, and assessments within one block
were combined.
For each individual expert and each separate period, weights were computed
according to the basic classical model. Since each event was predicted by only two
experts, and since a given pair of experts occurred relatively infrequently in the data
set, no optimization was applied. Rather the expert-period weights were simply
normalized to form the assessments of the decision maker. The weights were
calculated in two ways. The "annual DM" is computed using weights derived from
the experts' performance in the preceeding year, and the "cumulative DM" is
computed using weights derived from all previous years. These weights determine
the DMs' assessments in the current year. Because of changes of personnel, new
experts entered the data set in 1983, hence there is no DM for 1983. Moreover,
because of sickness, leaves of absence, etc., the numbers of assessments of the
individual experts differed substantially, and equalization of calibration power was
applied.
Table 15.5 shows the unnormalized weight of the annual and cumulative DM,
per exceedence event, per year. The weights for the "aggregate expert" correspond-
ing to the predictions less far into the future are also shown. In combining the
assessments for period 3 and period 5, this is the "period 3 expert," and in
combining the assessments for period 2 and period 4, this is the "period 2 expert."
Of course, on different days the period 3 expert is actually a different individual.
The expert weights shown in Table 15.5 are derived from the aggregate period
performance, and are not the weights used in combining, as the latter depend on
the individual experts. The assessments of the period 3 aggregate expert are
consistently better than those of the period 5 aggregate expert (and similarly for
periods 2 and 4).
Clemen and Winkler (1987) found that averaging the log-odds of the experts'
assessments to form the log-odds of the decision maker yielded the best perfor-
mance in a study combining precipitation forecasts. This is equivalent to taking the
3
In most cases the realization was defined as an average over the 6-hour interval, making it very
easy to predict the variables for the first period.
230 COMBINING EXPERT OPINIONS
normalized geometric mean of the experts probabilities.4 The second best perfor-
mance was gotten by taking the arithmetic average of the experts' probabilities,
that is, assigning the experts equal weight. The log-odds combination gives
relatively more weight to assessments near zero and one. If low entropy correlates
with good calibration, then the log-odds rule gives good performance. This was
indeed the case in the RDMI data set. Table 15.5 also shows the results of the "log-
odds DM" and the "equal DM." An asterisk indicates the best performance per
event-year. The total Calindex and Response Entropy scores are also shown (see
also Table 15.2).
From Table 15.5 it is evident that the annual DM outperforms the other DMs
and outperforms the best expert. The annual DM dominates all others in 11 of the
24 cases. The log-odds DMs dominate in 9 cases, the equal weight DM dominates
three times, the cumulative DM once, and the best expert is never dominant.
Without a structured combination of expert assessments, the DM would perforce
adopt the assessments of the period 3 or, the period 2 expert depending on the time
at which the assessment was needed. Any of the DMs would have yielded better
performance, and the classical model computed on an annual basis yields the best
performance.
In total, the cumulative DM made 2000 assessments per variable, the yearly
DM made 2400 assessments per variable, and the "best expert" made 2560
assessments per variable. Interestingly, the yearly DM generally outperforms the
cumulative DM, indicating that fluctuations in expert performance are significant.
In Chapter 12 it was noted that correlation in expert assessments could
degrade model performance if the correlation on the set of events that actually
occurred differed greatly from the correlation on the set of events that did not in
fact occur. In the RDMI data set the correlation was about 0.5 on both sets.
The Royal Dutch Meteorological Institute, of course, did not have access to
the classical model for combining expert opinions while the experiment was
underway. This analysis is after the fact. However, had the model been on-line, it
would have improved the experts' probabilistic forecasts.
BAYESIAN ANALYSIS
so that
APPLICATIONS 231
summary results for the DSM-1, -2, ESTEC-2, and -3 are given in Tables 15.7 and
15.8.
From a Bayesian viewpoint, it is most natural to evaluate performance by
calculating cumulative loss for the seed variables. This involves choosing a loss
function for each seed item and computing the loss as a function of the realization
and the Bayesian estimate for that item. As emphasized in Chapter 13, adding
losses from different items presupposes that the items are measured in units of equal
utility. For example, if one item is measured in kilograms and another in meters,
then the disutility of one kilogram error on the first item must be equal to the
disutility of one meter error (in the same sense) for the second. Such judgments, of
course, are highly subjective and problem dependent, and do not normally form
part of a quantitative analysis.
In this analysis we simply assume that the units in which the items are
measured are units of equal utility, and we compute loss under three bilinear loss
functions for which the Bayesian estimate corresponds to the 5%, 50%, and 95%
quantiles (see Chap. 13). If loss is proportional to r times the distance between the
estimate and the true value when overestimating and s times this distance when
underestimating, then his minimum expected loss is obtained when his estimate is
the s/(r + s)th quantile of his distribution.
The Bayesian analysis is first performed for each of the four experts
individually. Figure 15.2 shows the cumulative losses for expert 1 and for the
"Bayesian updated expert 1" (the variables all concern relative frequencies, and the
loss is displayed on a log scale). The Bayesian updated expert 1 is the result of
applying the Bayesian model to expert 1 and updating his distribution after each
true value is learned. Three loss functions are used, for which the optimal estimates
are the 5%, 50%, and 95% quantiles, respectively. Figure 15.3 to 15.5 provide the
same information for experts 2 to 4. Figure 15.6 shows the cumulative losses of the
Figure 15.2 Bilinear cumulative loss for Bayesian estimates at 5%, 50%, and 95% quantiles
for expert 1, compared with expert 1 updated.
232 COMBINING EXPERT OPINIONS
Figure 15.3 Bilinear cumulative loss for Bayesian estimates at 5%, 50%, and 95% quantiles,
for expert 2, compared with Bayesian updated expert 2.
Figure 15.4 Bilinear cumulative loss for Bayesian estimates at 5%, 50%, and 95% quantiles
for expert 3, compared with Bayesian updated expert 3.
Mendel and Sheridan combination of experts 3 and 4 (these are the "best" experts
in the classical model), and compares the losses with experts 3 and 4 individually.
The graphs in Figure 15.6 have been blown up, and we see that the Mendel and
Sheridan combination yields a cumulative loss that is between that of experts 3
and 4.
Table 15.6 shows the results of feeding the Bayesian updated experts back into
the classical model. For each variable, we extract from the Bayesian updated
APPLICATIONS 233
Figure 15.5 Bilinear cumulative loss for Bayesian estimates at 5%, 50%, and 95% quantiles
for expert 4, compared with Bayesian updated expert 4.
distribution the 5%, 50%, and 95% quantiles, and determine the calibration and
information scores corresponding to these assessments. In each case we see that the
Bayesian updating improves calibration at the expense of information. Indeed, on
overconfident assessors the updating has the effect of driving the outer quantiles
further apart, making the updated assessor less overconfident.
Also shown are the optimized decision maker and the Mendel-Sheridan
combination of experts 3 and 4. It is remarkable that the Mendel-Sheridan
combination is quite poorly calibrated. This illustrates the features discussed at
length in Chapter 13, namely, that this combination involves a prior that is more
informative than the experts' distributions, and requires a large number of
variables to move significantly away from the prior. The information scores for the
original experts differ slightly from those in Table 15.3 because the additional
experts influence slightly the intrinsic ranges for each variable.
The Bayesian model seems to perform well on this data, when applied to
individual experts. Whether analyzed in terms of loss functions or in terms of
unnormalized weight, the updated experts perform well, the unnormalized weight
of updated experts 3 and 4 dominates that of the classical DM. One imporotant
caveat must be made here. As mentioned in Chapter 13, the Bayesian model is very
sensitive to the choice of cutoff points for each variable. In the classical model
applied to the original experts, the cutoff points for each variable were determined
as "a little larger/smaller" than the highest/lowest quantile for each variable. If we
had chosen the cutoff points much more spread out, this would affect all the
original experts' information scores in more or less the same way, and would not
influence the placement of their quantiles. However, in performing the Bayesian
updating the cutoff points had to be more spread out, to allow the model "room to
recalibrate." The cutoff points for all variables were set at 1E-13, 1 — 1E-13 (the
estimates all concerned relative frequencies). This choice directly influences the
Figure 15.6 Bilinear cumulative loss for Bayesian estimates at 5%, 50%, and 95% quantiles,
for optimized decision maker and updated optimized decision maker.
APPLICATIONS 235
placement of the 5% and 95% updated quantiles. A different choice, say 1E-99,
1 — E-99, would give the original experts much higher information relative to the
updated experts (computed on a logarithmic scale). In applying the Bayesian
model, even in the single-expert case, the user must be acutely aware of the affects of
choosing cutoff points.
Table 15.6 underscores the reservations in Chapter 13 regarding the Mendel-
Sheridan Bayesian combination of expert assessments. Practical guidelines for the
applicability of this model, or of possible variants, cannot be given at present.
Table 15.7 shows the cumulative bilinear losses for the experts and for the
classical DM on all five continuous-version applications. Interestingly, the total
losses do not correspond particularly well with the classical weights. In three of the
five studies the classical DM has the lowest total loss.
Given the problems with a Bayesian combination of experts, it is natural to
explore the possibility of updating the individual experts using Bayes' theorem, and
combining the updated experts according to the classical model. Such a method is
philosophically inconsistent, as the classical model is based on the theory of proper
scoring rules, whereas updating before combination destroys the propriety of the
weights. Nonetheless, curiosity cannot be suppressed, and the results are shown in
Table 15.8. Comparing the results with Table 15.2, in all five cases the DM's
unnormalized and normalized weight is worse in Table 15.8. The data give little
encouragement for this type of hybrid model.
Table 15-6. Results of Feeding the Bayesian Updated Experts and the Mendel-
Sheridan Combination of Experts 3 and 4 into the Classical Model and Comparing the
Calibration, Information, and Weights with the Four Original Experts and with the
Optimized Decision Maker Class DM. Changes in the intrinsic ranges (see text) cause
the information scores to differ slightly from those in supplement A for the non-updated
experts.
ESTEC-1
Exp. 1 0.31 2.37 1.49 4.17
Exp. 2 0.24 1.83 2.68 4.75
Exp. 3 0.24 1.71 1.38 3.33
Exp. 4 0.28 1.66 1.78 3.78
DM 0.25 1.48 1.59 3.32*
DSM-1
Exp. 1 4.4 E05 4.3 E06 8.1 E06 1.3 E07
Exp. 2 1.5 E05 9.5 E06 1.5 E05 9.8 E06
Exp. 3 4.3 E05 4.3 E06 8.0 E06 1.3 E07
Exp. 4 4.4 E05 4.4 E06 8.2 E06 1.3 E07
Exp. 5 1.2 E05 3.2 E05 8.6 E04 5.2 E05
Exp. 6 7.4 E04 5.9 E05 8.8 E04 7.5 E05
Exp. 7 9.1 E04 6.1 E05 3.0 E04 7.3 E05
Exp. 8 9.1 E04 3.1 E05 1.5 E05 5.5 E05
Exp. 9 2.1 E05 1.6 E06 2.9 E06 4.7 E06
Exp. 10 2.4 E05 2.1 E06 3.6 E06 5.9 E06
DM 1.5 E05 7.4 E04 1.5 E05 3.7 E05*
DSM-2
Exp. 1 2.7 E02 5.7 E02 1.5 E02 9.9 E02*
Exp. 2 4.3 E02 3.8 E03 6.7 E03 1.1 E04
Exp. 3 2.7 E03 3.1 E03 6.1 E02 6.4 E03
Exp. 4 1.3 E03 4.7 E03 6.1 E03 1.2 E04
Exp. 5 8.4 E01 4.7 E02 5.2 E02 1.1 E03
Exp. 6 5.2 E02 9.6 E02 7.1 E02 2.2 E03
Exp. 7 8.7 E02 1.6 E03 1.2 E03 3.7 E03
Exp. 8 4.2 E02 3.9 E03 6.4 E03 1.1 E04
DM 1.3 E02 1.0 E03 2.3 E02 1.4 E03
ESTEC-2
Exp. 1 2.3 E03 2.0 E04 3.2 E04 5.4 E04
Exp. 2 6.9 E03 1.7 E04 1.3 E04 3.7 E04
Exp. 3 1.9 E04 1.8 E04 6.8 E03 3.9 E04
Exp. 4 6.1 E03 1.9 E04 2.4 E04 4.9 E04
Exp. 5 1.6 E04 1.9 E04 5.7 E03 3.5 E04
Exp. 6 4.8 E03 1.6 E04 3.7 E03 2.5 E04
Exp. 7 1.3 E04 2.0 E04 3.5 E03 3.6 E04
DM 2.2 E03 1.7 E04 3.9 E03 2.3 E04*
ESTEC-3
Exp. 1 0.11 0.35 0.15 0.61
Exp. 2 0.02 0.07 0.11 0.20
Exp. 3 0.04 0.11 0.16 0.31
Exp. 4 0.007 0.14 0.13 0.28
Exp. 5 0.02 0.08 0.16 0.26
Exp. 6 0.04 0.06 0.04 0.14*
DM 0.02 0.11 0.04 0.17
The following two case studies were undertaken at DSM, a large chemical process
plant in the south of The Netherlands, to determine the probabilities of various
failure causes of flange connections. A previous attempt by plant engineers to assess
the probabilities of various failure modes via an internal questionnaire was
unsuccessful, and it was decided to apply structured expert opinion to this problem.
The present application utilized the psychological scaling model, using the method
of paired comparisons, and a subsequent study used the classical model. It was
hoped that the results of the classical model would generate reference probabilities
for transforming the scale values emerging from the paired comparison study. It
was decided to focus on one particular plant.
Group two subdivides the failure cause "improper mounting" into the following 10
specific procedural failure causes:
Group two; Procedural Failures:
7.1 Face of the flange not/insufficiently cleaned
7.2 Damaged face of the flange
7.3 Wrong gasket type
7.4 Bolts badly tightened due to difficult accessibility
7.5 Gasket stuck/greased while mounting
7.6 Bolts improperly tightened (flanges not parallel and pipe not in line; following
a wrong tightening pattern)
7.7 Gasket not centered
7.8 Bolts/thread ends handled improperly (not made suitable, not greased)
7.9 Gasket damaged
7.10 Warm bolt connection not retightened during startup.
The Experts
The plant engineers selected 14 experts to participate in the study. They were
operators, mechanics, maintenance engineers, mechanical engineers, and their
chiefs. Ten of the experts had the equivalent of a bachelor of science degree, and
these ten experts were selected for the classical model. All experts participated in
the paired comparison study. The set of all experts is referred to as subgroup A. For
each round a subgroup B of experts was selected, consisting of the experts having
the fewest circular triads, that is, inconsistencies in preference (see Chap. 14). Each
expert was interviewed individually. One plant representative and one member of
APPLICATIONS 239
the analysis team were present during each interview. Each interview consisted of
the following parts:
• An introduction, intended to motivate the expert. The purpose of the
interview and the type of questions in the method of paired comparisons
were explained, and an overview of all failure causes was given.
• The actual paired comparisons data collection (rounds 1 and 2). The
questionnaire was filled in by the expert himself. Direct rankings of the
failure causes was elicited. For the 10 experts participating in the classical
model study, the interview also included the following parts:
—An introduction to the classical model, including a short training,
intended to teach the experts how to quantify their uncertainty
—The actual classical data collection
Results
The paired comparison data was analyzed with computer programs whose output
results are included in supplement C to this chapter. Comparison of the results of
the rankings based on paired comparisons with the direct rankings shows a high
rank correlation, although the rankings typically differ with respect to the highest
and lowest ranked items.
The experts' paired comparisons were analyzed for inconsistencies in pre-
ference, or circular triads. The average number of circular triads seems to be quite
low, indicating stable preferences. The maximum number of circular triads for
which the null hypothesis H0 that the preferences are at random is still rejected at
the significance level a = 0.05 is 4 in the case of 7 objects (round 1) and 21 in the
case of 10 objects (round 2). Only for expert e12 was H0 not rejected in round one.
The coefficient of agreement u and the coefficient of concordance W were
computed for subgroups A and B. Using the statistical tests described in Chapter
14, the null hypothesis H0 that the preferences were at random was rejected at the
5% level for both groups in both rounds.
In order to transform the scale values to absolute values, assessments from the
classical model were used as reference values. The following reference values were
available:
9 1.0 1.13/year
10 5.0 0.64/year
11 7.0 2.24/year
12 7.1 0.55/year
13 7.6 1.62/year
14 7.10 0.45/year
Note that the value for failure number 7 should be greater than the sum of the
240 COMBINING EXPERT OPINIONS
values for failure numbers 7.1, 7.6, and 7.10. This need not point to a real
inconsistency, as the reference values are medians of uncertainty distributions. On
the other hand, it may be that detailed consideration of all subdivisions of failure
cause 7 leads to higher assessments for "improper mounting" (number 7) than
simply considering this cause in isolation.
Since the Thurstone model requires two reference values, there are three
possible combinations of two reference values in each round. For all possible
combinations, the scale values were transformed into absolute values for subgroup
B. From the results it must be concluded that transformation is problematic in this
case. The ranking of the reference values contradicts that of the scale values. The
variation in the absolute values, given different reference values, is so large that
determination of confidence intervals was deemed meaningless. Modeling un-
certainty, that is, uncertainty with regard to the validity of the modeling
assumptions, clearly swamps the sampling fluctuations represented by the con-
fidence bounds.
Focusing on the untransformed values, the following results were communi-
cated as representing a "group opinion" on the ranking of failure causes. The
results of round one point to three important causes, cause 7 (improper mounting),
cause 5 (aging of gasket), and cause 1 (changes in temperature caused by process
cycles). Causes 2, 3, 4, and 6 are failure causes of lower order.
In the category of improper mounting, cause 7.4 (bolts badly tightened due to
difficult accessibility) is deemed most important, followed by 7.7 (gasket not
centered), 7.6 (bolts improperly tightened), and 7.10 (warm bolt connection not
retightened during startup) in this order. Other causes are of lower order.
The following conclusions were drawn with respect to the preformance of the
psychological scaling models and the method of paired comparisons:
• The paired comparison exercise went quite smoothly, and experts generally
enjoyed having their expertise extracted in this manner. There was lively
interest in the results.
• The rankings emerging from the paired comparison exercises appear
meaningful and were accepted as such.
• The paired comparison scale values do not agree well with the results of the
classical model, for those values assessed by both methods. This may not be
surprising, as the paired comparison method treats all experts equally,
whereas the classical model concentrates the weight on 2 of the 10 experts
(see next case study).
• Three reference values for the Thurstone model are provided by the classical
model, but these values are not affine related to the scale values emerging
from the Thurstone model. Hence, the assumptions underlying the
Thurstone model are not satisfied when the classical model's values play the
role of true values in this case. The same conclusion applies to the Bradley-
Terry (NEL) model, which requires only one reference value. Of course, it is
not known whether these underlying assumptions would be satisfied with
the true values themselves. The experience with this application teaches that
it is advisable to have more than two reference values, in order to have a
check on the modeling assumptions, before the transformation to absolute
APPLICATIONS 241
As indicated above, the classical model was used to generate reference values for
the paired comparison study of flange connection failure causes. Ten experts each
assessed 14 items in total, of which the first 8 were seed variables. The elicitation
was accomplished by directly assessing the 5%, 50%, and 95% quantiles of the
uncertainty distributions for each variable. A brief training session preceded the
elicitation.
The scores for the decision maker are shown in Table 15.9 for significance
levels 0 and 0.14. Also shown are the decision maker's scores when the experts are
assigned equal weights a priori. The decision maker's unnormalized weight is
maximal at significance level 0.14. At this level, the decision maker is better than the
weightiest expert, expert number 2, who receives weight 0.84. At significance level 0
all experts contribute to the decision maker, though unequally. The decision
maker's calibration score is unaffected, but his information score is degraded.
Substantial degradation results from assigning all experts equal weight. It is
interesting that in this case the optimization is driven by the decision maker's
information score, and not by his calibration score.
At significance level 0, experts 2 and 8 receive jointly about 80% of the weight.
These experts are the best calibrated and among the least informative. At the
Table 15-9. Scores for Experts and DM at Significance Level = 0.14 (Optimal) and
0, and for the DM Gotten by Assigning All Experts Equal Weight
optimal level of 0.14, their weights are, respectively, 0.74 and 0.26. "Range graphs"
presented in Figure 15.7 give a visual appreciation of the results. These graphs
show the 90% confidence bands and median assessments for all experts and for the
optimal decision maker, for all variables. When present, the realization is also
shown. From these graphs it is evident that the classical model produces very
meaningful results for the optimal decision maker.
The following conclusions emerge from this application of the classical model:
• The elicitation for the classical model went quite smoothly. Experts were
comfortable with the elicitation format.
• The numerical conclusions were accepted by the plant engineers as being
quite meaningful. The numerical weights were found quite meaningful by
the plant engineer who worked with the analysis team. In fact, he was one of
the experts contributing to the optimal decision maker, and he was able to
predict who the other contributer was.
• For reasons indicated in the previous section, the attempt to "bootstrap"
the paired comparison model with the classical model, by using the latter to
supply reference values, was not successful. Inherent differences in the two
models caused the classical model's output to violate the assumptions
underlying the psychological scaling models.
Figure 15.7 Bilinear loss for Bayesian estimates at 5%, 50%, and 95% quantiles, for experts 3,
4, Mendel-Sheridan combination of 3 and 4, and optimized decision maker.
244
245
SUPPLEMENT A
Case name: ESTEC-1 9.7.89 CLASS system
Quantiles of Solution
246
Case name: ESTEC-1 9.7.89 CLASS system
247
Case name: DSM-1 9.7.89 CLASS system
Quantiles of Solution
248
Case name: DSM-2 9.7.89 CLASS system
Quantiles of Solution
249
Case name: ESTEC-2 16.7.89 CLASS system
250
Case name: ESTEC-2 9.7.89 CLASS system
Quantiles of Solution
251
Case name: ESTEC-3 9.7.89 CLASS system
Quantiles of Solution
252
Case name: ESTEC-3 9.7.89 CLASS system
1 Item 1 UNI
2 Item 2 LOG 2.50000E-0006
3 Item 3 UNI 0.95
4 Item 4 LOG 0.00077
5 Item 5 UNI
6 Item 6 LOG 0.0005
7 Item 7 LOG 9.80000E-0005
8 Item 8 UNI 0.0036
9 Item 9 UNI 0.0059
10 Item 10 LOG l.00000E-0004
11 Item 11 LOG 0.00035
12 Item 12 UNI 0.0023
13 Item 13 LOG 2.30000E-0006
14 Item 14 UNI 0.014
SUPPLEMENT B
ESTEC-1
Expert Item
No. No. 5% 50% 95%
253
ESTEC-1. (Continued)
Expert Item
No. No. 5% 50% 95%
2 12 5.200E-0010 l.000E-0008 1.900E-0007
2 13 5.200E-0008 l.000E-0006 1.900E-0005
3 1 3.400E-0005 5.000E-0003 6.700E-0001
3 2 5.200E-0004 l.000E-0002 1.900E-0001
3 3 5.900E-0002 2.000E-0001 6.800E-0001
3 4 5.900E-0003 3.000E-0002 1.500E-0001
3 5 5.900E-0003 2.000E-0002 6.800E-0002
3 6 1.200E-0002 4.000E-0002 1.300E-0001
3 7 1.600E-0002 8.000E-0002 4.100E-0001
3 8 2.900E-0010 1.000-0009 3.400E-0009
3 9 3.800E-0004 5.000E-0002 9.900E-0001
3 10 2.000E-0010 l.000E-0009 5.100E-0009
3 11 2.000E-0003 l.OOOE-0002 1.200E-0001
3 12 2.300E-0005 l.000E-0003 3.400E-0001
3 13 5.200E-0010 l.000E-0008 1.900E-0007
4 1 2.900E-0003 l.OOOE-0002 3.400E-0002
4 2 2.900E-0003 l.OOOE-0002 3.400E-0002
4 3 2.900E-0002 l.000E-0001 3.400E-0001
4 4 1.200E-0002 4.000E-0002 1.400E-0001
4 5 1.200E-0002 4.000E-0002 1.400E-0001
4 6 8.800E-0003 3.000E-0002 l.OOOE-0001
4 7 1.200E-0002 4.000E-0002 1.400E-0001
4 8 5.200E-0005 l.OOOE-0003 1.900E-0002
4 9 2.900E-0003 l.000E-0002 3.400E-0002
4 10 7.500E-0011 l.000E-0008 1.300E-0006
4 11 2.000E-0003 l.000E-0002 5.100E-0001
4 12 2.300E-0003 5.000E-0002 9.700E-0001
4 13 7.500E-0009 l.000E-0006 1.300E-0004
ESTEC-2
Expert Item
No. No. 5% 50% 95%
254
ESTEC-2. (Continued)
Expert Item
No. No. 5% 50% 95%
Expert Item
No. No. 5% 50% 95%
Expert Item
No. No. 5% 50% 95%
Expert Item
No. No. 5°/
-Vo 50% 95%
ESTEC-3
259
DSM-1
260
DSM-1. (Continued)
DSM-2
Expert No. Item No. 5% 50% 95%
262
DSM-2. (Continued
263
SUPPLEMENT C
Elicitation Format
Round 1
Question: Which of the following two failure causes has led more often to the failure
of a flanged connection in the plant?
1. 1 2
2. 3 7
3. 4 6
4. 5 1
5. 2 3
6. 7 4
7. 6 5
8. 1 3
9. 4 2
10. 5 7
11. 6 1
12. 3 4
13. 2 5
14. 7 6
15. 1 4
16. 5 3
17. 6 2
18. 7 1
19. 4 5
20. 3 6
21. 2 7
264
(1) answers experts:
Notation: 1 if Obj. A > Obj. B
2 Obj. A < Obj. B
0 Obj. A = Obj. B
9 no knowledge of
Obj. A or Obj. B
Expert 1 111122211112222112211
Expert 2 122111212221211121212
Expert 3 122111212221211111212
Expert 4 122221212221211111212
Expert 5 221111212121111121222
Expert 6 121111211221211111222
Expert 7 122121212221211121222
Expert 8 121211212222111121222
Expert 9 222111212121111122212
Expert 10 121211211222111121122
Expert 11 112222122111221112222
Expert 12 112212211121221211212
Expert 13 222111212121211121211
Expert 14 222111212221211121222
Expert 1 1,3,6 Cl
Total 1 Cl + 0 C2 + 0 C3 = 1
Expert 3 2,3,6 Cl
Total 1 C1+ 0 C2 + 0 C3 = 1
Expert 5 2,5,7 Cl
3,4,6 Cl
Total 2 Cl + 0 C2 + 0 C3 = 2
Expert 6 2,3,4 Cl
3,4,6 Cl
Total 2 Cl + 0 C2 + 0 C3 = 2
Expert 7 2, 3, 6 Cl
Total 1 Cl + 0 C2 + 0 C3 = 1
Expert 9 1, 2, 7 Cl
2, 5, 7 Cl
Total 2 Cl + 0 C2 + 0 C3 = 2
Expert 1 1 1, 3, 5 Cl
2,4, 7 Cl
265
Output Paired Comparisons (Round 1). (Continued)
Total 2 Cl + 0 C2 + 0 C3 = 2
Expert 12 1, 3,7 Cl
1,4,7 Cl
1, 5,7 Cl
1,6,7 Cl
2,3,4 Cl
2,3,5 Cl
2,3,6 Cl
2,3,7 Cl
Total 8 Cl + 0 C2 + 0 C3 = 8
Data-Proportion Matrix
Obj 2 3 4 5 6 7
1 0.1734 0.4480
2 0.0784 -0.0704
3 0.0350 -0.5874
4 0.0208 -0.8055
5 0.3041 0.7666
6 0.0395 -0.4815
7 0.3487 0.7302
266
Transformed Object Values (I) reference values;
obj., 7: 1.13, obj., 5: 0.64
1 1.13E + 00 1.13E + 00
2 5.11E-01 1.93E + 00
3 2.28E-01 2.72E + 00
4 1.35E-01 3.06E + 00
5 1.98E + 00 6.40E-01
6 2.58E-01 2.56E + 00
7 2.27E + 00 6.96E-01
1 3.65E-01 1.46E + 01
2 1.65E-01 3.74E + 01
3 7.37E-02 6.02E + 01
4 4.37E-02 6.97E + 01
5 6.40E-01 6.40E-01
6 8.32E-02 5.55E + 01
7 7.34E-01 2.24E + 00
1 1.11E + 00 1.13E + 00
2 5.04E-01 0.00E + 00
3 2.25E-01 0.00E + 00
4 1.34E-01 0.00E + 00
5 1.95E + 00 2.38E + 00
6 2.54E-01 0.00E + 00
7 2.24E + 00 2.24E + 00
267
16
Conclusions
This final brief chapter attempts to draw together conclusions from the preceding
parts of this study, and indicates the important open problems.
PART I
Part I addressed the data on expert opinion and the methods for using expert
opinion currently found in practice. The overall conclusion to be drawn from this
material is twofold:
• Expert opinions can contain useful data for rational decision support.
• Considered as a source of data, expert opinion has certain characteristics
that require new techniques of data collection and processing.
The most experience to date with quantified expert opinion is found in the field of
risk analysis. Here we confront all the problems and potential payoffs of expert
opinion. Expert opinions typically show a very wide spread, they may be poorly
calibrated, and experts tend to cluster into optimists and pessimists. On the other
hand, there are dramatic examples of successful application of expert opinion in
this field.
The field of artificial intelligence has invited massive applications of expert
opinion, in the form of input into expert systems. The use of expert opinion in this
field is highly unstructured, and the representations of uncertainty tend to be ad
hoc, lacking a firm axiomatic basis. In particular, these representations contradict
the axioms of probability. When uncertainty is represented in an ad hoc manner, it
is impossible to evaluate and combine expert opinions in an intelligent way. On the
other hand, the most cursory acquaintance with expert data demonstrates the need
for evaluation and combination.
Psychometric studies have thrown some light on factors that may adversely
affect the quality of subjective probability assessments. However, the data relating
specifically to experts are sparse, and the analytical tools used to analyze these data
could be improved.
The most important problem emerging from the material reviewed in Part I is
268
CONCLUSIONS 269
the need for an overarching methodology for using expert opinion in science. The
methodological rules for collecting and processing "objective data" have been
developed over a great many years. "Subjective data" in the form of expert opinion
may be a useful new for form of information, but rules for its collection and
processing must be developed. Such rules must serve the fundamental aim of
science, namely, building rational consensus, and they must also take account of
the peculiar features of subjective data. The conclusions of Part I outlined the
hesitant first steps in this direction, which have guided the development in the
remainder of this study.
PART II
theory will be "downloaded" onto the expert problem in the future. The techniques
for testing the calibration hypotheses can surely be improved and extended.
Perhaps more pressing is the need for more and refined techniques for eliciting
subjective probability distributions. These techniques must be user friendly and
should allow us encode more information easily. Quantile assessment is somewhat
crude in this respect. Parametric elicitation techniques constitute a promising
direction for future work. Closely related to elicitation is the problem of
communicating probabilistic information to decision makers. This requires, as it
were, inverting the elicitation procedure. Little has been done in this area to date,
but no one will deny its importance. Finally, many interesting mathematical
problems remain regarding the asymptotic behavior of weights and scoring rules.
The results in Chapter 9 represent a "first pass."
For practical applications the most pressing need is to develop techniques for
identifying meaningful seed variables easily. This requires experience in ferreting
out such variables, but might also profit from the use of "laboratory seed variables"
if these could be demonstrated to predict performance on variables of interest.
PART III
In Part III, three models for combining expert opinions are developed. These
models are all operational, have been applied, and have proved to be of value. It is
appropriate to conclude by offering a summary assessment of these models.
The classical model is easy to understand, easy to apply, and can be applied
whenever a modest number of seed variables are available. The experience to date
indicates that its results are relatively robust—it has never yielded strange or
intuitively unacceptable results. The two most arbitrary features are the choice of
seed variables and the measure of informativeness. With regard to the latter,
experience to date indicates that the results are not highly sensitive to the choice of
information measures. With regard to the former, little can be said at present.
However, one can anticipate that variations of the classical model, with more or
less equal normative appeal, will yield somewhat differing results, and a choice
between these will be very difficult on normative grounds. In short, the classical
model is probably substantially better than doing nothing, but is probably not the
"best" solution to any given problem.
The Bayesian model described in Chapter 13 is probably most useful when
applied to a single expert. The results reported in Chapter 15 are encouraging in
this respect. Even in this case, however, the results are highly sensitive to decisions
of the analyst. For multiple experts, the solution presented in Chapter 13 is rather
speculative, and better solutions, particularly for a small number of seed variables,
may be anticipated. The results of the application in Chapter 15 confirm these
conclusions. Of the three models developed in Part III, the Bayesian models have
the strongest mathematical foundation. Translating this theoretical advantage into
a practical advantage, while at the same time reducing dependence on ad hoc prior
information, is a challenge for the future.
The psychological scaling models are quite different from the other two. The
primary virtues of these models lie in their user-friendliness and in their role as
CONCLUSIONS 271
MATHEMATICAL FRAMEWORK
We denote the real numbers as R and the integers as N. A set is indicated by curly
brackets, with the generic element separated from the defining property by a
vertical stroke:
A= {x|x satisfies the defining property of A} for example;
Bachelor = {x|x is an unmarried man}
A' denotes the complement of A, the set of elements not belonging to A. If A and B
are sets, A B denotes the union of A and B, and A B denotes the intersection of
A and B. A B says that B is a subset of A, which may also be written B A. x A
says that x is an element of A. 0 denotes the empty set, the set without elements.
The mathematical framework and notation used in this study is the following.
practical situations, the descriptions of the possible worlds will be "truncated" and
the possible worlds will be described only up to a preassigned level. Such
truncation is merely a convenience of the analyst and it is implicitly assumed that
each state description could be indefinitely refined.
A set of subsets of S is called field if it contains the empty set 0 (which is a
subset of every set), and is closed under unions and complementation. More
exactly,
A set of subsets of S is a field of subsets of S if
(i)
(ii) If A and B ,
then A B
(iii) If A then A' (A.1)
If is a field of subsets of S, one also says that is a field over S. There are many
different fields over S. The smallest field over S is the trivial field containing only
and S itself. The largest field over S is the set of all subsets of S. The set of all subsets
of S is denoted1 2s. Any set of subsets of S generates a unique field, namely the
intersection of all fields that contain that set of subsets (which intersection is itself a
field).
The pair (S, ) is called a measurable space; it is the sort of mathematical
object over which probabilities can be defined.
A (finitely additive) probability measure p over the measurable space (S, ) is a
function p: [0,1] from to the unit interval [0,1] satisfying
(i) For all A 0 S p(A) 1
(ii) p(S) = 1
(iii) If A and B and if A B = 0, then p(A B) = p(A) + p(B)
The object (S, , p) is called a probability space. In most mathematical contexts it is
customary to assume as well that p is countably additive; that is, if Ai ,
i = 1,2,..., with Ai Aj = 0 when i j, then p( Ai) = p(Ai). It is then neces-
sary to assume that is closed under countable unions, in which case is called a
-field. We do not insist on this assumption, as subjectivists have strong
philosophical reservations against countable additivity (see below, and see De
Finetti, 1974, pp 116 f.f.).
Unless otherwise stated, the field will always be the field of all subsets of S.
For countably additive probability measures this would involve mathematical
indiscretions, but for finitely additive measures it is quite harmless.
the set of possible worlds in which it rains on February 9, 2010, that is, the set of
possible worlds in which the quoted proposition holds. The following statements
are therefore equivalent:
"Event A occurs"
"Proposition A holds"
"The real world belongs to A"
The boolean operations on events, union, intersection and complement, corre-
spond to logical operations on propositions. Thus,
A B corresponds to "A and B"
A B corresponds to "A or B"
A' corresponds to "not A"
The event/proposition S corresponds to the trivial proposition (the proposition
true in all possible worlds) and the empty set 0 corresponds to the impossible
proposition (the proposition that is true in no possible world).
To be utterly precise we must distinguish between propositions and state-
ments. The statements "2 + 2 = 5" and "2 + 2 = 6" both express the same
proposition, namely, the proposition corresponding to 0. All true mathematical
statements express the same proposition, namely, the proposition corresponding to
the event S. Still being utterly precise, we should distinguish between a state s and
the event {s} whose single element is the state s. A degree of partial belief is not
assigned to s, but to {s}. These distinctions will be made only if necessary to avoid
ambiguity.
In any concrete application we obviously don't want to consider the set of all
possible worlds. It is customary to introduce a reduced set of states and a reduced
field of events. We can do this by isolating first a set of propositions in which we are
interested. Suppose our interest is confined to the following propositions:
A: It rains tomorrow.
B: My car won't start tomorrow.
C: I have a cold the day after tomorrow.
A reduced state, or equivalently a reduced possible world, is a complete description
in terms of the above propositions. In other words, it specifies whether each of the
above events occurs; that is, specifies the truth value of each of the above
propositions. In the present example there are 23 = 8 elements in the set of reduced
states.
If we start with n propositions, then we generate in this way 2" reduced states,
assuming that the propositions are logically disjoint (i.e., that no proposition or its
negation is entailed by other propositions or their negations). An event is a set of
states. If follows that there are
22n
events to which probabilities must be assigned. Hence, the number of events grows
very quickly. If n = 5, then we shall have to assign probabilities to more than 4
trillion events. Of course, these assignments are not independent. The probability
276 APPENDIX A
of an event is the sum of the probabilities of the states comprising the event. For
n = 5, we have to assign probabilities to each of the 25 = 32 states, and the
probabilities of all other events can be calculated.
It is sometimes convenient to speak of atoms of a field of events. A is an atom
of if A ,A , and if B , B A, then B = Aor B = 0. States are atoms
of the field of all subsets of S. The trivial field { , S} has S as its only atom. The
atoms of a reduced field of events are the reduced states.
INTERPRETATION
The frequency interpretation has always stumbled over the definition of "independ-
ent trials for A." The definition of independence cannot appeal to the notion of
probability, as independence is used here in defining probability. Von Mises
attempted to characterize the sort of sequences of trials whose limiting relative
frequencies could be interpreted as probabilities, calling such sequences "collect-
ives." His attempts are now generally regarded as unsuccessful, although there have
been notable attempts to rehabilitate this notion (see Martin-Lof, 1970; Schnorr,
1970; van Lambalgen, 1987).
Even if frequencies cannot be used to define probability, they can be used to
measure probabilities.3 The main mechanism for doing this is the
Weak Law of Large Numbers. Let events {Ai}i = 1,..., be independent with
p(Ai) = p, i = 1,...; then for all d > 0, (A.3)
Expectation is linear, that is, for all real numbers a and b, E(af + bg)
= aE(f) + bE(g).
If S is infinite the definition of the expectation of a function / on S is more
delicate, especially when the probability p is only finitely additive. When we write
the expectation of / as
we shall assume that the measure p is countably additive, and that this expression is
defined in the usual way.
The conditional expectation of f given B[p(B) > 0)], for S finite is defined as
and a similar definition applies for S infinite. The indicator function for an event A,
1A, is defined as
1A(s) = 1 if s A and =0 otherwise
1
This expression is meaningful if {s} , for all s = S. This of course holds if = 2s. It need not
hold in general, and then we should say that it was not measurable with respect to .
280 APPENDIX A
DISTRIBUTIONS
Distribution functions are continuous from the right, and hence differentiable
except on a set that is at most countable. If F is everywhere differentiable, then
4
Specifically, p n (A) p(A) for every set A whose boundary has probability 0 under p. An equivalent
definition is that En(f) E(f) for every real continuous function f on [0,1].
APPENDIX A 281
The integrand is the normal density with mean and standard deviation a. If X is a
normally distributed random variable with mean and standard deviation , then
Y = (X — )/ is a standard normal variable having mean 0 and unit variance.
Values for the standard normal cumulative distribution function are given in table
B.1. The inverse distribution is given in table B.2. Any positive affine transformation
f(Y):
is normal with mean b and standard deviation a. The constants a and b are
sometimes called the scale and location parameters of the transformation f. Affine
transformations (positive or otherwise) preserve the ratios of intervals, that is, for
all y1, y2, y3, y4, y3 y4,
where " = d" means "has the same distribution as." Let za and ya be the ath quantiles
of the distributions of Z and Y; that is, the unique numbers za, ya such that
Since the standard normal variable is symmetric about the origin, y1 - a = —y a , and
hence k 1 _ a — 1/k a . This means that the quantiles za and z1-a can be derived by
respectively multiplying and dividing the median by ka. It is therefore convenient to
give the median 5% and 95% quantiles of a lognormal distribution by giving the
median and the factor k95. k.95 is called the error factor or range factor of the
distribution of z. Moreover, given ka for one value of a, we can derive kb for b a
from the last equation, and easily compute other quantiles. Finally, given two
symmetric quantiles, za and Z1-a, the median is just their geometric mean:
Z
Another important distribution is the chi square distribution, denoted x2. The chi
square distributions are parametrized by their "degrees of freedom." X denotes the
cumulative chi square distribution function with B degrees of freedom. Its mean is
B and its variance is 2B. If AT is a standard normal variable, then X2 is a chi square
variable with one degree of freedom. The sum of independent chi square variables is
chi square, whose number of degrees of freedom is the sum of the numbers of
degrees of freedom of the summands. Hence X is the distribution of the sum of B
independent squared standard normal variables. x (r) denotes the value of the chi
square cumulative distribution function with B degrees of freedom at point r. Table
B.3 gives quantiles of chi square distributions with the number of degrees of
freedom ranging from 1 to 200.
The information of p, I(p), is the negative of the entropy; I(p) = — H(p). H(p) is
always nonnegative. Its minimal value, 0, is attained if and only if pi = 1 for some i.
Its maximal value, In n, is attained if and only if pi = 1/n, i = 1,..., n. H(P) is a
measure for the lack of information in the distribution P.
For n = 2 we may graph H(p) as a function of p1 as shown in Figure A.1.
Let q = q 1 ,...,q m , qi > 0, i = l,...,m, be a distribution over the outcomes
y 1 ,..., ym. If p and q are independent, then the joint distribution (p, q) is given by
By direct integration, we can calculate the entropies for the normal and lognormal
distributions with parameters , a:
Entropy of normal
and
Entropy of lognormal
We see that both these expressions can be negative and go to — as 0. In
general, probability densities, as opposed to mass functions, can have negative
entropy, as f ( x ) can be greater than 1.
These problems are caused by the fact that (A. 11) is not really a continuous
analog of (A. 10). In going from finite probability mass functions to continuous
density functions, we should replace summation by integration and replace "pi" by
f(x) dx. However, the expression In [f(x) dx] would not make sense. The "missing
dx" in (A. 11) causes H(f) to behave quite differently than H(p). For example,
suppose that a digital weight scale distinguishes 200 different kilogram readings,
and suppose we consider a distribution over possible scale readings for a given
object. Since there are 200 possible outcomes, the entropy of this distribution is a
dimensionless number reflecting the lack of information in the distribution. If we
convert the scale from kilograms to grams, the value of the entropy remains the
same. Consider now a normal random variable X with standard deviation x
reflecting uncertainty in weight, measured in grams. If we now express weight in
kilograms instead of grams, we transform the variable X into the variable
Y = X/1000. This is a positive affine transformation with scale parameter ;
hence, Y is normally distributed with the same mean as X, but with standard
284 APPENDIX A
Figure A.2 I(s, p) for n = 2, s = (s1, 1 — s1), p = (p1 1 — p1), shown as a function of sl, for
P1 = 0.125, p1 = 0.25, and p1 = 0.5.
deviation y — x/1000. Consulting the above formula, we see that the entropy
drops when we change the scale from grams to kilograms.
There is no universally accepted measure for information for continuous
distributions. We discussed practical solutions for this problem in Chapter 8.
Let p be as above, and let s = ( s 1 , . . . , s n ) be another distribution over
x1,...,xn. The relative information of s with respect to p is
I(s, p) is always nonnegative (Kullback, 1959, p. 15), and its minimal value, 0, is
attained if and only if s = p. Note that si = 0 is allowed but pi > 0 must be assumed.
If u denotes the uniform distribution over {1,..., n}, such that ui = l/n, i = 1,..., n;
then an easy calculation shows that
from which it follows that the maximal value of I(p, u), In n is attained when
H(p) = 0.
Let s be the sample distribution generated by N independent samples from the
distribution p. Then as N goes to infinity, the quantity 2NI(s,p) becomes 2
distributed with n — 1 degrees of freedom.
If again we set n = 2, and we fix p1, then we can graph I(s, p) as a function of s1,
as shown in Figure A.2.
We note that the relative information, in contrast with the entropy, does have
a natural generalization for probability densities. Let f1 and f2 be two continuous
densities which are nowhere equal to zero. Then
is always nonnegative, and equals zero if and only if f1 = f2 (Kullback, 1959, p. 15).
The "missing dx" of (A.11) drops out in the argument of the natural logarithm in
(A. 13).
Appendix B
Tables
z 0 1 2 3 4 5 6 7 8 9
-3.9 0.0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
-3.8 0.0001 0001 0001 0001 0001 0001 0001 0001 0001 0001
-3.7 0.0001 0001 0001 0001 0001 0001 0001 0001 0001 0001
-3.6 0.0002 0002 0001 0001 0001 0001 0001 0001 0001 0001
-35 0.0002 0002 0002 0002 0002 0002 0002 0002 0002 0002
-3.4 0.0003 0003 0003 0003 0003 0003 0003 0003 0003 0002
-3.3 0.0005 0005 0005 0004 0004 0004 0004 0004 0004 0003
-3.2 0.0007 0007 0006 0006 0006 0006 0006 0005 0005 0005
-3.1 0.0010 0009 0009 0009 0008 0008 0008 0008 0007 0007
-3.0 0.0013 0013 0013 0012 0012 0011 0011 0011 0010 0010
-2.9 0.0019 0018 0018 0017 0016 0016 0015 0015 0014 0014
-2.8 0.0026 0025 0024 0023 0023 0022 0021 0021 0020 0019
-2.7 0.0035 0034 0033 0032 0031 0030 0029 0028 0027 0027
-2.6 0.0047 0045 0044 0043 0041 0040 0039 0038 0037 0036
-2.5 0.0062 0060 0059 0057 0055 0054 0052 0051 0049 0048
-2.4 0.0082 0080 0078 0075 0073 0071 0069 0069 0066 0064
-2.3 0.0107 0104 0102 0096 0096 0094 0091 0089 0087 0084
-2.2 0.0139 0136 0132 0129 0125 0122 0119 0116 0113 0110
-2.1 0.0179 0174 0170 0166 0162 0158 0154 0150 0146 0143
-2.0 0.0228 0222 0217 0212 0207 0202 0197 0192 0188 0183
-1.9 0.0287 0281 0274 0268 0262 0256 0250 0244 0239 0233
-1.8 0.0359 0351 0344 0336 0329 0322 0314 0307 0301 0294
-1.7 0.0446 0436 0427 0418 0409 0401 0392 0384 0375 0367
-1.6 0.0548 0537 0526 0516 0505 0495 0485 0475 0465 0455
-1.5 0.0668 0655 0643 0630 0618 0606 0594 0582 0571 0559
-1.4 0.0808 0793 0778 0764 0749 0735 0721 0708 0694 0681
-1.3 0.0968 0951 0934 0918 0901 0885 0869 0853 0838 0823
-1.2 0.1151 1131 1112 1093 1075 1056 1038 1020 1003 0985
285
Table B.1 Continued).
z 0 1 2 3 4 5 6 7 8 9
-1.1 0.1357 1335 1314 1292 1271 1251 1230 1210 1190 1170
-1.0 0.1587 1562 1539 1515 1492 1469 1446 1423 1401 1379
-0.9 0.1841 1814 1788 1762 1736 1711 1685 1660 1635 1611
-0.8 0.2119 2090 2061 2033 2005 1977 1949 1922 1894 1867
-0.7 0.2420 2389 2358 2327 2296 2266 2236 2206 2177 2148
-0.6 0.2743 2709 2676 2643 2611 2578 2546 2514 2483 2451
-0.5 0.3085 3050 3015 2981 2946 2912 2877 2843 2810 2776
-0.4 0.3446 3409 3372 3336 3300 3264 3228 3192 3156 3121
-0.3 0.3821 3783 3745 3707 3669 3632 3594 3557 3520 3483
-0.2 0.4207 4168 4129 4090 4052 4013 3974 3936 3897 3859
-0.1 0.4602 4562 4522 4483 4443 4404 4364 4325 4286 4247
-0.0 0.5000 4960 4920 4880 4840 4801 4761 4721 4681 4641
0.0 0.5000 5040 5080 5120 5160 5199 5239 5279 5319 5359
0.1 0.5398 5438 5478 5517 5557 5596 5636 5675 5714 5753
0.2 0.5793 5832 5871 5910 5948 5987 6026 6064 6103 6141
0.3 0.6179 6217 6255 6293 6331 6368 6406 6443 6480 6517
0.4 0.6554 6591 6628 6664 6700 6736 6772 6808 6844 6879
0.5 0.6915 6950 6985 7019 7054 7088 7123 7157 7190 7224
0.6 0.7257 7291 7324 7357 7389 7422 7454 7486 7517 7549
0.7 0.7580 7611 7642 7673 7704 7734 7764 7794 7823 7852
0.8 0.7881 7910 7939 7967 7995 8023 8051 8078 8106 8133
0.9 0.8159 8186 8212 8238 8264 8289 8315 8340 8365 8389
1.0 0.8413 8438 8461 8485 8508 8531 8554 8577 8599 8621
1.1 0.8643 8665 8686 8708 8729 8749 8770 8790 8810 8830
1.2 0.8849 8869 8888 8907 8925 8944 8962 8980 8997 9015
1.3 0.9032 9049 9066 9082 9099 9115 9131 9147 9162 9177
1.4 0.9192 9207 9222 9236 9251 9265 9279 9292 9306 9319
1.5 0.9332 9345 9357 9370 9382 9394 9406 9418 9429 9441
1.6 0.9452 9463 9474 9484 9495 9505 9515 9525 9535 9545
1.7 0.9554 9564 9573 9582 9591 9599 9608 9616 9625 9633
1.8 0.9641 9649 9656 9664 9671 9678 9686 9693 9699 9706
1.9 0.9713 9719 9726 9732 9738 9744 9750 9756 9761 9767
2.0 0.9772 9778 9783 9788 9793 9798 9803 9808 9812 9817
2.1 0.9821 9826 9830 9834 9838 9842 9846 9850 9854 9857
2.2 0.9861 9864 9868 9871 9875 9878 9881 9884 9887 9890
2.3 0.9893 9896 9898 9901 9904 9906 9909 9911 9913 9916
2.4 0.9918 9920 9922 9925 9927 9929 9931 9932 9934 9936
2.5 0.9938 9940 9941 9943 9945 9946 9948 9949 9951 9952
2.6 0.9953 9955 9956 9957 9959 9960 9961 9962 9963 9964
2.7 0.9965 9966 9967 9968 9969 9970 9971 9972 9973 9974
2.8 0.9974 9975 9976 9977 9977 9978 9979 9979 9980 9981
2.9 0.9981 9982 9982 9983 9984 9984 9985 9985 9986 9986
3.0 0.998650 .998694 .998736 .998777 .998817 .998856 .998893 .998930 .998965 .998999
3.1 0.999032 .999065 .999096 .999126 .999155 .999184 .999211 .999238 .999264 .999289
3.2 0.999313 .999336 .999359 .999381 .999402 .999423 .999443 .999462 .999481 .999499
3.3 0.999517 .999534 .999550 .999566 .999581 .999596 .999610 .999624 .999638 .999651
3.4 0.999663 .999675 .999687 .999698 .999709 .999720 .999730 .999740 .999749 .999758
286
Table B.I. (Continued)
z 0 1 2 3 4 5 6 7 8 9
3.5 0.999767 .999776 .999784 .999792 .999800 .999807 .999815 .999822 .999828 .999835
3.6 0.999841 .999847 .999853 .999858 .999864 .999869 .999874 .999879 .999883 .999888
3.7 0.999892 .999896 .999900 .999904 .999908 .999912 .999915 .999918 .999922 .999925
3.8 0.999928 .999931 .999933 .999936 .999938 .999941 .999943 .999946 .999948 .999950
3.9 0.999952 .999954 .999956 .999958 .999959 .999961 .999963 .999964 .999966 .999967
4.0 0.999968 .999970 .999971 .999972 .999973 .999974 .999975 .999976 .999977 .999978
4.1 0.999979 .999980 .999981 .999982 .999983 .999983 .999984 .999985 .999985 .999986
4.2 0.999987 .999987 .999988 .999988 .999989 .999989 .999990 .999990 .999991 .999991
4.3 0.999991 .999992 .999992 .999993 .999993 .999993 .999993 .999994 .999994 .999994
4.4 0.999995 .999995 .999995 .999995 .999996 .999996 .999996 .999996 .999996 .999996
4.5 0.999997 .999997 .999997 .999997 .999997 .999997 .999997 .999998 .999998 .999998
4.6 0.999998 .999998 .999998 .999998 .999998 .999998 .999998 .999998 .999999 .999999
4.7 0.999999 .999999 .999999 .999999 .999999 .999999 .999999 .999999 .999999 .999999
4.8 0.999999 .999999 .999999 .999999 .999999 .999999 .999999 .999999 .999999 .999999
4.9 1.000000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000
z 0 1 2 3 4 5 6 7 8 9
P .000 .001 .002 .003 .004 .005 .006 .007 .008 .009 .010
.00 3.0902 2.8782 2.7478 2.6521 2.5758 2.5121 2.4573 2.4089 2.3656 2.3263 .99
.01 2.3263 2.2904 2.2571 2.2262 2.1973 2.1701 2.1444 2.1201 2.0969 2.0749 2.0537 .98
.02 2.0537 2.0335 2.0141 1.9954 1.9774 1.9600 1.9431 1.9268 1.9110 1.8957 1.8808 .97
.03 1.8808 1.8663 1.8522 1.8384 1.8250 1.8119 1.7991 1.7866 1.7744 1.7624 1.7507 .96
.04 1.7507 1.7392 1.7279 1.7169 1.7060 1.6954 1.6849 1.6747 1.6646 1.6546 1.6449 .95
.05 1.6449 1.6352 1.6258 1.6164 1.6072 1.5982 1.5893 1.5805 1.5718 1.5632 1.5548 .94
.06 1.5548 1.5464 1.5382 1.5301 1.5220 1.5141 1.5063 1.4985 1.4909 1.4833 1.4758 .93
.07 1.4758 1.4684 1.4611 1.4538 1.4466 1.4395 1.4325 1.4255 1.4187 1.4118 1.4051 .92
.08 1.4051 1.3984 1.3917 1.3852 1.3787 1.3722 1.3658 1.3595 1.3532 1.3469 1.3408 .91
.09 1.3408 1.3346 1.3285 1.3225 1.3165 1.3106 1.3047 1.2988 1.2930 1.2873 1.2816 .90
.10 1.2816 1.2759 1.702 1.2646 1.2591 1.2536 1.2481 1.2426 1.2372 1.2319 1.2265 .89
.11 1.2265 1.2212 1.2160 1.2107 1.2055 1.2004 1.1952 1.1901 1.1850 1.1800 1.1750 .88
.12 1.1750 1.1700 1.1650 1.1601 1.1552 1.1503 1.1455 1.1407 1.1359 1.1311 1.1264 .87
.13 1.1264 1.1217 1.1170 1.1123 1.1077 1.1031 1.0985 1.0939 1.0893 1.0848 1.0803 .86
.14 1.0803 1.0758 1.0714 1.0669 1.0625 1.0581 1.0537 1.0494 1.0450 1.0407 1.0364 .85
.15 1.0364 1.0322 1.0279 1.0237 1.0194 1.0152 1.0110 1.0069 1.0027 0.9986 0.9945 .84
.16 0.9945 0.9904 0.9863 0.9822 0.9782 0.9741 0.9701 0.9661 0.9621 0.9581 0.9542 .83
.17 0.9542 0.9502 0.9463 0.9424 0.9385 0.9346 0.9307 0.9269 0.9230 0.9192 0.9154 .82
.18 0.9154 0.9116 0.9078 0.9040 0.9002 0.8965 0.8927 0.8890 0.8853 0.8816 0.8779 .81
.19 0.8779 0.8742 0.8705 0.8669 0.8633 0.8596 0.8560 0.8524 0.8488 0.8452 0.8416 .80
.20 0.8416 0.8381 0.8345 0.8310 0.8274 0.8239 0.8204 0.8169 0.8134 0.8099 0.8064 .79
.21 0.8064 0.8030 0.7995 0.7961 0.7926 0.7892 0.7858 0.7824 0.7790 0.7756 0.7722 .78
.22 0.7722 0.7688 0.7655 0.7621 0.7588 0.7554 0.7521 0.7488 0.7454 0.7421 0.7388 .77
.23 0.7388 0.7356 0.7323 0.7290 0.7257 0.7225 0.7192 0.7160 0.7128 0.7095 0.7063 .76
.24 0.7063 0.7031 0.6999 0.6967 0.6935 0.6903 0.6871 0.6840 0.6808 0.6776 0.6745 .75
.25 0.6745 0.6713 0.6682 0.6651 0.6620 0.6588 0.6557 0.6526 0.6495 0.6464 0.7433 .74
287
Table B.2. The Inverse Normal Function
p .000 .001 .002 .003 .004 .005 .006 .007 .008 .009 .010
.26 0.6433 0.6403 0.6372 0.7341 0.6311 0.6280 0.6250 0.6219 0.6189 0.6158 0.6128 .73
.27 0.6128 0.6098 0.6068 0.6038 0.6008 0.5978 0.5948 0.5918 0.5888 0.5858 0.5828 .72
.28 0.5828 0.5799 0.5769 0.5740 0.5710 0.5681 0.5651 0.5622 0.5592 0.5563 0.5534 .71
.29 0.5534 0.5505 0.5476 0.5446 0.5417 0.5388 0.5359 0.5330 0.5302 0.5273 0.5244 .70
.30 0.5244 0.5215 0.5187 0.5158 0.5129 0.5101 0.5072 0.5044 0.5115 0.4987 0.4959 .69
.31 0.4959 0.4930 0.4902 0.4874 0.4845 0.4817 0.4789 0.4761 0.4733 0.4705 0.4677 .68
.32 0.4677 0.4649 0.4621 0.4593 0.4565 0.4538 0.4510 0.4482 0.4454 0.4427 0.4399 .67
.33 0.4399 0.4372 0.4344 0.4316 0.4289 0.4261 0.4234 0.4207 0.4179 0.4152 0.4125 .66
.34 0.4125 0.4097 0.4070 0.4043 0.4016 0.3989 0.3961 0.3934 0.3907 0.3880 0.3853 .65
.35 0.3853 0.3826 0.3799 0.3772 0.3745 0.3719 0.3692 0.3665 0.3638 0.3611 0.3585 .64
.36 0.3585 0.3558 0.3531 0.3505 0.3478 0.3451 0.3425 0.3398 0.3372 0.3345 0.3319 .63
.37 0.3319 0.3292 0.3266 0.3239 0.3213 0.3186 0.3160 0.3134 0.3107 0.3081 0.3055 .62
.38 0.3055 0.3029 0.3002 0.2976 0.2950 0.2924 0.2898 0.2871 0.2845 0.2819 0.2793 .61
.39 0.2793 0.2767 0.2741 0.2715 0.2689 0.2663 0.2637 0.2611 0.2585 0.2559 0.2533 .60
.40 0.2533 0.2508 0.2482 0.2456 0.2430 0.2404 0.2378 0.2353 0.2327 0.2301 0.2275 .59
.41 0.2275 0.2250 0.2224 0.2198 0.2173 0.2147 0.2121 0.20096 0.2070 0.2045 0.2019 .58
.42 0.2019 0.1993 0.1968 0.1942 0.1917 0.1891 0.1866 0.1840 0.1815 0.1789 0.1764 .57
.43 0.1764 0.1738 0.1713 0.1687 0.1662 0.1637 0.1611 0.1586 0.1560 0.1535 0.1510 .56
.44 0.1510 0.1484 0.1459 0.1434 0.1408 0.1383 0.1358 0.1332 0.1307 0.1282 0.1257 .55
.45 0.1257 0.1231 0.1206 0.1181 0.1156 0.1130 0.1105 0.1080 0.1055 0.1030 0.1004 .54
.46 0.1004 0.0979 0.0954 0.0929 0.0904 0.0878 0.0853 0.0828 0.0803 0.0778 0.0753 .53
.47 0.0753 0.728 0.0702 0.0677 0.0652 0.0627 0.0602 0.0577 0.0552 0.0527 0.0502 .52
.48 0.0502 0.0476 0.0451 0.0426 0.0401 0.0376 0.0351 0.0326 0.0301 0.0276 0.0251 .51
.49 0.0251 0.0226 0.0201 0.0175 0.0150 0.0125 0.0100 0.0075 0.0050 0.0025 0.0000 .50
.010 .009 .008 .007 .006 .005 .004 .003 .002 .001 .000 q
288
Table B.3. The Chi-Squared Distribution
a
V 0.500 0.600 0.700 0.800 0.850 0.900 0.925 0.950 0.975 0.990 0.995 0.999 0.9995
1 0.455 0.708 1.074 1.642 2.072 2.706 3.170 3.841 5.024 6.635 7.879 10.83 12.12
2 1.386 0.833 2.408 3.219 3.794 4.605 5.181 5.991 7.378 9.210 10.60 13.82 15.20
3 2.366 2.946 3.665 4.642 5.317 6.251 6.905 7.815 9.348 11.34 12.84 16.27 17.73
4 3.357 4.045 4.878 5.989 6.745 7.779 8.496 9.488 11.14 13.28 14.86 18.47 20.00
5 4.351 5.132 6.064 7.289 8.115 9.236 10.01 11.07 12.83 15.09 16.75 20.52 22.11
6 5.348 6.211 7.231 8.558 9.446 10.64 11.47 12.59 14.45 16.81 18.55 22.46 24.10
7 6.346 7.283 8.383 9.803 10.75 12.02 12.88 14.07 16.01 18.48 20.28 24.32 26.02
8 7.344 8.351 9.524 11.03 12.03 13.36 14.27 15.51 17.53 20.09 21.95 26.12 27.87
9 8.343 9.414 10.66 12.24 13.29 14.68 15.63 16.92 19.02 21.67 23.59 27.88 29.67
10 9.342 10.47 11.78 13.44 14.53 15.99 16.97 18.31 20.48 23.21 25.19 29.59 31.42
11 10.34 11.53 12.90 14.63 15.77 17.28 18.29 19.68 21.92 24.72 26.76 31.26 33.14
12 11.34 12.58 14.01 15.81 16.99 18.55 19.60 21.03 23.34 26.22 28.30 32.91 34.82
13 12.34 13.64 15.12 16.98 18.20 19.81 20.90 22.36 24.74 27.69 29.82 34.53 36.48
14 13.34 14.69 16.22 18.15 19.41 21.06 22.18 23.68 26.12 29.14 31.32 36.12 38.11
15 14.34 15.73 17.32 19.31 20.60 22.31 23.45 25.00 27.49 30.58 32.80 37.70 39.72
16 15.34 16.78 18.42 20.47 21.79 23.54 24.72 26.30 28.85 32.00 34.27 39.25 41.31
17 16.34 17.82 19.51 21.61 22.98 24.77 25.97 27.59 30.19 33.41 35.72 40.79 42.88
18 17.34 18.87 20.60 22.76 24.16 25.99 27.22 28.87 31.53 34.81 37.16 42.31 44.43
19 18.34 19.91 21.69 23.90 25.33 27.20 28.46 30.14 32.85 36.19 38.58 43.82 45.97
20 19.34 20.95 22.77 25.04 26.50 28.41 29.69 31.41 34.17 37.57 40.00 45.31 47.50
21 20.34 21.99 23.86 26.17 27.66 29.62 30.92 32.67 35.43 38.93 41.40 46.80 49.01
22 21.34 23.03 24.94 27.30 28.82 30.81 32.14 33.92 36.78 40.29 42.80 48.27 50.51
23 22.34 24.07 26.02 28.43 29.98 32.01 33.36 35.17 38.08 41.64 44.18 49.73 52.00
24 23.34 25.11 27.10 29.55 31.13 33.20 34.57 36.42 39.36 42.98 45.56 51.18 53.48
25 24.34 26.14 28.17 30.68 32.28 34.38 35.78 37.65 40.65 44.31 46.93 52.62 54.95
26 25.34 27.18 29.25 31.79 33.43 35.56 36.98 38.89 41.92 45.64 48.29 54.05 56.41
27 26.34 28.21 30.32 32.91 34.57 36.74 38.18 40.11 43.19 46.96 49.64 55.48 57.86
28 27.34 29.25 31.39 34.03 35.71 37.92 39.38 41.34 44.46 48.28 50.99 56.89 59.30
29 28.34 30.28 32.46 35.14 36.85 39.09 40.57 42.56 45.72 49.59 52.34 58.30 60.73
30 29.34 31.32 33.53 36.25 37.99 40.26 41.76 43.77 46.98 50.89 53.67 59.70 62.16
Table B.3. The Chi-Squared Distribution
1
V 0.500 0.600 0.700 0.800 0.850 0.900 0.925 0.950 0.975 0.990 0.995 0.999 0.9995
31 30.34 32.35 34.60 37.36 39.12 41.42 42.95 44.99 48.23 52.19 55.00 61.10 63.58
32 31.34 33.38 35.66 38.47 40.26 42.58 44.13 46.19 49.48 53.49 56.33 62.49 65.00
33 32.34 34.41 36.73 39.57 41.39 43.75 45.31 47.40 50.73 54.78 57.65 63.87 66.40
34 33.34 35.44 37.80 40.68 42.51 44.90 46.49 48.60 51.97 56.06 58.96 65.25 67.80
35 34.34 36.47 38.86 41.78 43.64 46.06 47.66 49.80 53.20 57.34 60.27 66.62 69.20
36 35.34 37.50 39.92 42.88 44.76 47.21 48.84 51.00 54.44 58.62 61.58 67.99 70.59
37 36.34 38.53 40.98 43.98 45.89 48.36 50.01 52.19 55.67 59.89 62.88 69.35 71.97
38 37.34 39.56 42.05 45.08 47.01 49.51 51.17 53.38 56.90 61.16 64.18 70.70 73.35
39 38.34 40.59 43.11 46.17 48.13 50.66 52.34 54.57 58.12 62.43 65.48 72.05 74.73
40 39.34 41.62 44.16 47.27 49.24 51.81 53.50 55.76 59.34 63.69 66.77 73.40 76.09
45 44.34 46.76 49.45 52.73 54.81 57.51 59.29 61.66 65.41 69.96 73.17 80.08 82.88
50 49.33 51.89 54.72 58.16 60.35 63.17 65.03 67.50 71.42 76.15 79.49 86.66 89.56
60 59.33 62.13 65.23 68.97 71.34 74.40 76.41 79.08 83.30 88.38 91.95 99.61 102.7
70 69.33 72.36 75.69 79.71 82.26 85.53 87.68 90.53 95.02 100.4 104.2 112.3 115.6
80 79.33 82.57 86.12 90.41 93.11 96.58 98.86 101.9 106.6 112.3 116.3 124.8 128.3
90 89.33 92.76 96.52 101.1 103.9 107.6 110.0 113.1 118.1 124.1 128.3 137.2 140.8
100 99.33 102.9 106.9 111.7 114.7 118.5 121.0 124.3 129.6 135.8 140.2 149.4 153.2
120 119.3 123.3 127.6 132.8 136.1 140.2 143.0 146.6 152.2 159.0 163.6 173.6 177.6
150 149.3 153.8 158.6 164.3 168.0 172.6 175.6 179.6 185.8 193.2 198.4 209.3 213.6
200 199.3 204.4 210.0 216.6 220.7 226.0 229.5 234.0 241.1 249.4 255.3 267.5 272.4
1
V 0.500 0.600 0.700 0.800 0.850 0.900 0.925 0.950 0.975 0.990 0.995 0.999 .9995
1 0.455 0.708 1.074 0.642 2.072 2.706 3.170 3.841 5.024 6.635 7.879 10.83 12.12
2 1.386 1.833 2.408 3.219 3.794 4.605 5.181 5.991 7.378 9.210 10.60 13.82 15.20
3 2.366 2.946 3.665 4.642 5.317 6.251 6.905 7.815 9.348 11.34 12.84 16.27 17.73
4 3.357 4.045 4.878 5.989 6.745 7.779 8.496 9.488 11.14 13.28 14.86 18.47 20.00
5 4.351 5.132 6.064 7.289 8.115 9.236 10.01 11.07 12.83 15.09 16.75 20.52 22.11
6 5.348 6.211 7.231 8.558 9.446 10.64 11.47 12.59 14.45 16.81 18.55 22.46 24.10
7 6.346 7.283 8.383 9.803 10.75 12.02 12.88 14.07 16.01 18.48 20.28 24.32 26.02
8 7.344 8.351 9.524 11.03 12.03 13.36 14.27 15.51 17.53 20.09 21.95 26.12 27.87
9 8.343 9.414 10.66 12.24 13.29 14.68 15.63 16.92 19.02 21.67 23.59 27.88 29.67
10 9.342 10.47 11.78 13.44 14.53 15.99 16.97 18.31 20.48 23.21 25.19 29.59 31.42
11 10.34 11.53 12.90 14.63 15.77 17.28 18.29 19.68 21.92 24.72 26.76 31.26 33.14
12 11.34 12.58 14.01 15.81 16.99 18.55 19.60 21.03 23.34 26.22 28.30 32.91 34.82
13 12.34 13.64 15.12 16.98 18.20 19.81 20.90 22.36 24.74 27.69 29.82 34.53 36.48
14 13.34 14.69 16.22 18.15 19.41 21.06 22.18 23.68 26.12 29.14 31.32 36.12 38.11
15 14.34 15.73 17.32 19.31 20.60 22.31 23.45 25.00 27.49 30.58 32.80 37.70 39.72
16 15.34 16.78 18.42 20.47 21.79 23.54 24.72 26.30 28.85 32.00 34.27 39.25 41.31
17 16.34 17.82 19.51 21.61 22.98 24.77 25.97 27.59 30.19 33.41 35.72 40.79 42.88
18 17.34 18.87 20.60 22.76 24.16 25.99 27.22 28.87 31.53 34.81 37.16 42.31 44.43
19 18.34 19.91 21.69 23.90 25.33 27.20 28.46 30.14 32.85 36.19 38.58 43.82 45.97
20 19.34 20.95 22.77 25.04 26.50 28.41 29.69 31.41 34.17 37.57 40.00 45.31 47.50
21 20.34 21.99 23.86 26.17 27.66 29.62 30.92 32.67 35.48 38.93 41.40 46.80 49.01
22 21.34 23.03 24.94 27.30 28.82 30.81 32.14 33.92 36.78 40.29 42.80 48.27 50.51
23 22.34 24.07 26.02 28.43 29.98 32.01 33.36 35.17 38.08 41.64 44.18 49.73 52.00
24 23.34 25.11 27.10 29.55 31.13 33.20 34.57 36.42 39.36 42.98 45.56 51.18 53.48
25 24.34 26.14 28.17 30.68 32.28 34.38 35.78 37.65 40.65 44.31 46.93 52.62 54.95
(D
Table 5.3. (Continued)
q
0.500 0.600 0.700 0.800 0.850 0.900 0.925 0.950 0.975 0.990 0.995 0.999 .9995
26 25.34 27.18 29.25 31.79 33.43 35.56 36.98 38.89 41.92 45.64 48.29 54.05 56.41
27 26.34 28.21 30.32 32.91 34.57 36.74 38.18 40.11 43.19 46.96 49.64 55.48 57.86
28 27.34 29.25 31.39 34.03 35.71 37.92 39.38 41.34 44.46 48.28 50.99 56.89 59.30
29 28.34 30.28 32.46 35.14 36.85 39.09 40.57 42.56 45.72 49.59 52.34 58.30 60.73
30 29.34 31.32 33.53 36.25 37.99 40.26 41.76 43.77 46.98 50.89 53.67 59.70 62.16
31 30.34 32.35 34.60 37.36 39.12 41.42 42.95 44.99 48.23 52.19 55.00 61.10 63.58
32 31.34 33.38 35.66 38.47 40.26 42.58 44.13 46.19 49.48 53.49 56.33 62.49 65.00
33 32.34 34.41 36.73 39.57 41.39 43.75 45.31 47.40 50.73 54.78 57.65 63.87 66.40
34 33.34 35.44 37.80 40.68 42.51 44.90 46.49 48.60 51.97 56.06 58.96 65.25 67.80
35 34.34 36.47 38.86 41.78 43.64 46.06 47.66 49.80 53.20 57.34 60.27 66.62 69.20
36 35.34 37.50 39.92 42.88 44.76 47.21 48.84 51.00 54.44 58.62 61.58 67.99 70.59
37 36.34 38.53 40.98 43.98 45.89 48.36 50.01 52.19 55.67 59.89 62.88 69.35 71.97
38 37.34 39.56 42.05 45.08 47.01 49.51 51.17 53.38 56.90 61.16 64.18 70.70 73.35
39 38.34 40.59 43.11 46.17 48.13 50.66 52.34 54.57 58.12 62.43 65.48 72.05 74.73
40 39.34 41.62 44.16 47.27 49.24 51.81 53.50 55.76 59.34 36.69 66.77 73.40 76.09
45 44.34 46.76 49.45 52.73 54.81 57.51 59.29 61.66 65.41 69.96 73.17 80.08 82.88
50 49.33 51.89 54.72 58.16 60.35 63.17 65.03 67.50 71.42 76.15 79.49 86.66 89.56
60 59.33 62.13 65.23 68.97 71.34 74.40 76.41 79.08 83.30 88.38 91.95 99.61 102.7
70 69.33 72.36 75.69 79.71 82.26 85.53 87.68 90.53 95.02 100.4 104.2 112.3 115.6
80 79.33 82.57 86.12 90.41 93.11 96.58 98.86 101.9 106.6 112.3 116.3 124.8 128.3
90 89.33 92.76 96.52 101.1 103.9 107.6 110.0 113.1 118.1 124.1 128.3 137.2 140.8
100 99.33 102.9 106.9 111.7 114.7 118.5 121.0 124.3 129.6 135.8 140.2 149.4 153.2
120 119.3 123.3 127.6 132.8 136.1 140.2 143.0 146.6 152.2 159.0 163.6 173.6 177.6
150 149.3 153.8 158.6 164.3 168,0 172.6 175.6 179.6 185.8 193.2 198.4 209.3 213.6
200 199.3 204.4 210.0 216.6 220.7 226.0 229.5 234.0 241.1 249.4 255.3 267.5 272.4
Table B.4. Frequencies (f) for Values of the Number d of Circular Triads in Paired
Comparisons, and the Probability (P) That These Values Will Be Attained or Exceeded
n = 8! n=9 n — 10
Value
of d / P / P f P P'
293
Table B.4. (Continued)
For the case when the number (n) of objects is ten and approximate probability (P') is given based on the x2
approximation.
Source: Kendal, M., Rank Correlation Methods, Ch. Griffin & Co. London, 1962.
Notation: d = number of circular triads
n — number of objects
Table B.5. Agreement in Paired Comparisons. The Probability P That a Value of (for
u) Will Be Attained or Exceeded, for m = 8, n = 2 to 8
P P P P P P P
P P P P P P P
Agreement in Paired Comparisons. The Probability P that a Value of (for u) will be attained
or exceeded, for m = 4 and n. = 2 to 6 (for n = 6 only Values beyond the 1 per cent Point are
given)
The Probability P that a Value of (for u) will be attained or exceeded, for m = 5 and n = 2 to 5
295
Table B.5. (Continued)
P P P P P
The probability P that a Value of 2 (foru) will be attained or exceeded , for m = 6 and n = 2
to 4
P 4 p P min P P
6 1.000 18 1.000 36 1.000 55 0.043 74 0.0412
7 0.688 19 0.969 37 0.999 56 0.029 75 0.0589
10 0.219 20 0.332 38 0.991 57 0.020 76 0.0549
15 0.031 21 0.626 39 0.959 58 0.016 77 0.0532
22 0.523 40 0.896 59 0.011 80 0.0658
23 0.468 41 0.822 60 0.0072 81 0.0417
24 0.303 42 0.755 61 0.0049 82 0.0612
26 0.180 43 0.669 62 0.0034 85 0.0734
27 0.147 44 0.556 63 0.0025 90 0.0893
28 0.088 45 0.466 64 0.0016
29 0.061 46 0.409 65 0.0383
30 0.040 47 0.337 66 0.0366
31 0.034 48 0.257 67 0.0348
32 0.023 49 0.209 68 0.0326
35 0.0062 50 0.175 69 0.0316
36 0.0029 51 0.133 70 0.0486
37 0.0020 52 0.097 71 0.0468
40 0.0358 53 0.073 72 0.0448
45 0.0431 54 0.057 73 0.0416
296
Table B-6. Between-Expert Agreement Table (Coefficient of Concordance)*
Additional Values
N for N=3
k 3 4 5 6h 7 k 7ks
4 Adapted from M. Friedman, "A Comparison of Alternative Tests of Significance for the Problem of m rankings," Ann.
Math. Statist., vol. 11, pp. 86-92, 1940, with the kind permission of the author and the publisher.
Notice that additional critical values of s for N — 3 are given in the right-hand column of this table.
Source: Siegel, S., Nonparametric Statistics for the Behavioral Sciences, McGraw-Hill, New York, 1956.
Notation: N = number of objects
k = number of experts
297
This page intentionally left blank
References
CHAPTER 1
Brockhoff, K., "The Performance of Forecasting Groups in Computer Dialogue and Face to
Face Discussion," in H. A. Linstone and M. Turoff (eds.), The Delphi Method,
Techniques and Applications, Addison Wesley, Reading, Mass., 1975, pp. 291-321.
Dalkey, N., Brown, B., and Cochran, S., "Use of Self-Ratings to Improve Group Estimates,"
Technological Forecasting, vol. 1, no. 3, pp. 283-291, 1970.
Delbecq, A., Van de Ven, A. and Gusstafson, D., Group Techniques for Program Planning,
Scott, Foresman, Glenview, Ill., 1975.
Federation of American Scientists, Public Interest Report, vol. 33, no. 8, October 1980.
Fischer, G., "An Experimental Study of Four Procedures for Aggregating Subjective
Probability Assessments," Technical Report 75-7, Decisions and Designs, Inc.,
McLean, Va., 1975.
Gofman, J., and Tamplin, A., Population Control Through Nuclear Pollution, Nelson-Hall
Co. Chicago, 1970.
Gough, R., "The Effect of Group Format on Aggregate Subjective Probability Distribu-
tions," in D. Wendt and C. Viek (eds.), Utility, Probability and Human Decision
Making, Dordrecht, Reidel, 1975.
Gustafson, D., Shulka, R., Delbecq, A., and Walster, A., "A Comparative Study of
Differences in Subjective Likelihood Estimates Made by Individuals, Interacting
Groups, Delphi Groups, and Nominal Groups," Organizational Behaviour and
Human Performance, vol. 9, pp. 280-291, 1973.
Helmer, Olaf, "Analysis of the Future: The Delphi Method" and "The Delphi Method-An
Illustration," in J. Bright (ed.), Technological Forecasting for Industry and Government,
Prentice-Hall, Englewood Cliffs, N.J., 1968, pp. 116-134.
Helmer, Olaf, Social Technology, Basic Books, New York, 1966.
Kahn, Herman, On Thermonuclear War, Free Press, New York, 1960.
Kahn, Herman, and Wiener, Anthony J., The Year 2000, A Framework for Speculation,
Macmillan, New York, 1967.
Kevles, Daniel, The Physicists, Alfred Knopf, New York, 1978.
Linstone, H. A., and Turoff, M., The Delphi Method, Techniques and Applications, Addison
Wesley, Reading, Mass., 1975.
Mazur, Allen, "Opinion Pool Measurements of American Confidence in Science," Science,
Technology and Human Values, vol. 6, no. 36, pp. 16-19, 1981.
299
300 REFERENCES
Newman, J. R., "Thermonuclear War," Scientific American, March 1961. See: Readings from
Sci. Amer. Science, Conflict and Society, W. M. Freeman, San Francisco, 1969, pp.
282-286.
Parente, F. J., and Anderson-Parente, J. K., "Delphi Inquiry Systems," in G. Wright and P.
Ayton (eds.), Judgmental Forecasting, Wiley, Chichester, 1987.
Reichenbach, H., The Rise of Scientific Philosophy, University of California Press, 1968; first
edition, 1951.
Sackman, H., Delphi Critique, Expert Opinion, Forecasting and Group Processes, Lexington
Books, Lexington, Mass., 1975.
Seaver, D., "How Groups Can Assess Uncertainty" Proc. Int. Conf. on Cybernetics and
Society, Wash. D.C., Sept. 19-21, 1977, pp. 185-190.
Science p. 171, March 5, 1971.
Toekomstonderzoek, suppl. 10, pp. 6.6.2-01-6.6.4-07, November 1974.
CHAPTER 2
Amendola, A., "Systems Reliability Benchmark Exercise Parts I and II," EUR-10696, EN/I,
1986.
American Physical Society, "Study Group on Light Water Reactor Safety; Report to the
American Physical Society," Review of Modern Physics, vol. 47, suppl. no. 1, 1975.
Beaver, W. M. Financial Reporting: An Accounting Revolution. Prentice Hall, Englewood
Cliffs, N. J., 1981.
Bell, T. E., and Esch, K., "The Space Shuttle: A Case of Subjective Engineering," IEEE
Spectrum, pp. 42-46, June 1989.
Bernreuter, D. L., Savy, J. B., Mensing, R. W., and Chung, D. H., Seismic Hazard
Characterization of the Eastern United States: Methodology and Interim Results for
Ten Sites," NUREG/CR-3756, 1984.
Brune, R., Weinstein, M. and Fitzwater, M., Peer Review Study of the Draft Handbook for
Human Reliability Analysis with Emphasis on Nuclear Power Plant Applications,"
NUREG/CR-1278, Human Performance Technologies, Inc., Thousand Oaks, Calif.,
1983.
Christensen-Szalanski, J. J. J., and Beach, L. R., "The Citation Bias: Fad and Fashion in the
Judgment and Decision Literature," American Psychologist, vol. 39, pp. 75-78,1984.
Clemen, R., and Winkler, R., "Combining Economic Forecasts," J. of Business and Economic
Statistics, vol. 4, pp. 39-46, January 1986.
Colglazier, E. W., and Weatherwax, R. K., "Failure Estimates for the Space Shuttle,"
Abstracts for Society for Risk Analysis, Annual Meeting 1986, Boston, Mass., p. 80,
Nov. 9-12, 1986.
Cooke, R. "Problems with Empirical Bayes," Risk Analysis, vol. 6, no. 3, pp. 269-272,1986b.
Cooke, R., and Waij, R., "Monte Carlo Sampling for Generalized Knowledge Dependence
with Application to Human Reliability," Risk Analysis, vol. 6, no. 3, pp. 335-343,
1986.
Cottrell W., and Minarick, C, "Precursors to Potential Severe Core Damage Accidents:
1980-1982, a Status Report," NUREG/CR-3591, 1984.
Covello, V. T., and Mumpower, J. "Risk Analysis and Risk Management: An Historical
Perspective," Risk Analysis, vol. 5, no. 2, pp. 103-120, 1985.
Dalrymple, G. J., and Willows, M., "DoE Disposal Assessments, vol. 4: Expert Elicitation,"
TR-DR3-4, Yard, London, July 1990.
Electric Power Research Institute, "Seismic Hazard Methodology for the Central and Eastern
United States," vol. 1, Methodology, NP-4/26, 1986.
REFERENCES 301
Environmental Protection Agency, Reactor Safety Study Oversight Hearings Before the
Subcommittee on Energy and the Environment of the Committee on Interior and Insular
Affairs. House of Representatives, 94th Congress, second session, serial no. 94—61,
Washington, D.C., June 11, 1976.
Feynman, R. P., "Mr. Feynman Goes to Washington," Engineering and Science, pp. 6-22,
Fall 1987.
Flavin, C., "Electricity's Future: The Shift to Efficiency and Small Scale Power," WorldWatch
paper 61, 1984; reprinted in Bull. Sci. Tech. Soc., vol. 5, 55-103, 1985.
Granger, C. W. J., Forecasting in Business and Economics, Academic Press, New York, 1980.
Granger Morgan, M., and Henrion, M., Uncertainty; a Guide to Dealing with Uncertainty in
Quantitative Risk and Policy Analysis, Department of Engineering and Public Policy,
Carnegie Mellon University, 1988.
Granger Morgan, M., Amaral, D., Henrion, M., and Morris, S., "Technological Uncertainty
in Quantitative Policy Analysis—A Wulfur Pollution Example," Risk Analysis, vol.
3, pp. 201-220, 1984.
Hofer, E., Javeri, V., and Loffler, H., "A Survey of Expert Opinion and Its Probabilistic
Evaluation for Specific Aspects of the SNR-300 Risk Study," Nuclear Technology, vol.
68, pp. 180-225, 1985.
Honano, E. J., Hora, S. C., Keeney, R. L., and Von Winterfeldt, D., Elicitation and Use of
Expert Judgment in Performance Assessment for High-level Radioactive Waste
Repositories, NUREG/CR-5411, Washington, D.C., May 1990.
Humphreys, P., Human Reliability Assessors Guide, Safety and Reliability Directorate,
United Kingdom Atomic Energy Authority, 1988.
IEEE, IEEE Guide to the Collection and Presentation of Electrical, Electronic and Sensing
Component Reliability Data of Nuclear Power Generation Stations, IEEE st-500,1977.
Jungermann, H., and Thuring, M., "The Use of Mental Models for Generating Scenarios," in
G. Wright and P. Ayton (eds.), Judgmental Forecasting, Wiley, New York, 1987.
Kaplan, S., and Garrick, B., "On the Quantitative Definition of Risk," Risk Analysis, vol. 1,
pp. 11-27, 1981.
Kemeny J., Report of the President's Commission on the Accident at Three Mile Island,
Washington, D.C., 1979.
Kok, M., "Multiple Objective Energy Modeling: Experimental Results with Interactive
Methods in The Netherlands," Ph.D. thesis, Department of Mathematics, Delft
University of Technology, report 85-49, 1985.
Lee, Y. T., Orkent, D. and Apostolakis, G., "A Comparison of Background Seismic Risks
and the Incremental Seismic Risk Due to Nuclear Power Plants," Nuclear Engineer-
ing & Design, vol. 53, pp. 141-154, 1979.
Levine, S., and Rasmussen, N, "Nuclear Plant PRA: How Far Has It Come," Risk Analysis,
vol. 4, 247-255, 1984.
Lewis, H. W., Budnitz, R. J., Kouts, H. J. C., Lowenstein, W. B., Rowe, W. D., Von Hippel, F.,
and Zachariasen, F., Risk Assessment Review Group Report to the U.S. Nuclear
Regulatory Commission, NUREG/CR-0400, 1979.
Merkhofer, M., and Keeney, R., "A Multiattribute Utility Analysis of Alternative Sites for
the Disposal of Nuclear Waste," Risk Analysis, vol. 7, no. 2, pp. 173-194, 1987.
Minarick, J., and Kukielka, C., "Precursors to Potential Severe Core Damage Accidents:
1969-1979," NUREG/CR-2497, 1982.
Morris, J. M., and D'Amore, R. J., "Aggregating and Communicating Uncertainty," Pattern
Analysis and Recognition Corp., 228 Liberty Plaza, Rome, New York, 1980.
Mosleh, A., Bier, V., and Apostolakis, G., Critique of Current Practice for the Use of Expert
Opinions in Probabilistic Risk Assessment," Reliability Engineering and System
Safety, vol. 20, pp. 63-85, 1988.
302 REFERENCES
Mosleh, A., Bier, V. M., and Apostolakis, G., "Methods for the Elicitation and Use of Expert
Opinion in Risk Assessment," NUREG/CR-4962, 1987.
Office of Nuclear Regulatory Research, "Reactor Risk Reference Document," NUREG-
1150, 1987.
Orkent, D., "A Survey of Expert Opinion on Row Probability Earthquakes," in Annals of
Nuclear Energy, Pergamon Press, pp. 601-614, 1975.
Poucet, A., "The European Benchmark Exercise on Human Reliability Analysis," Proceed-
ings PSA '89, International Topical Meeting on Probability, Reliability and Safety
Assessment, Pittsburgh, pp. 103-110, April 2-7, 1989.
Preyssl, C., and Cooke, R., "Expert Judgment; Subjective and Objective Data for Risk
Analysis of Spaceflight Systems," Proceedings PSA '89, International Topical Meeting
on Probability, Reliability and Safety Assessment, Pittsburgh, pp. 603-612, April 2-7,
1989.
Rogovin, M., and Frampton, G. T., Three Mile Island, a Report to the Commissioners and to
the Public, Government Printing Office, 1980.
Samet, M. G., "Quantitative Interpretation of Two Qualitative Scales Used to Rate Military
Intelligence," Human Factors, vol. 17, no. 2, pp. 192-202, 1975.
Shooman, M., and Sinkar, S., "Generation of Reliability and Safety Data by Analysis of
Expert Opinion," Proc. 1977 Annual Reliability and Maintainability Symposium, pp.
186-193, 1977.
Snaith, E. R., "The Correlation Between the Predicted and Observed Reliabilities of
Components, Equipment, and Systems," National Center of Systems Reliability,
U.K. Atomic Energy Authority, NCSR R18, 1981.
Sui, N., and Apostolakis, G., "Combining Data and Judgment in Fire Risk Analysis," 8th
International Conference on Structural Mechanics in Reactor Technology, Brussels,
Belgium, August 26-27, 1985.
Swain, A., and Guttman, H., Handbook of Human Reliability, Analysis with Emphasis on
Nuclear Power Plant Applications, NUREG/CR-1278, 1983.
Union of Concerned Scientists, "The Risks of Nuclear Power Reactors: A Review of the
NRC Reactor Safety Study," WASH-1400, 1977.
U.S. AEC, "Theoretical Possibilities and Consequences of Major Accident in Large Nuclear
Power Plants," U.S. Atomic Energy Commission, WASH-740, 1957.
U.S. NRC, PRA Procedures Guide, U.S. Nuclear Regulatory Commission, NUREG/CR-
2300, 1983.
U.S. NRC, "Nuclear Regulatory Commission Issues Policy Statement on Reactor Safety
Study and Review by the Lewis Panel," NRC Press Release, no. 79-19, January 19,
1979.
U.S. NRC, "Reactor Safety Study," U.S. Nuclear Regulatory Commission, WASH-1400,
NUREG-751014, 1975.
Vlek, C., "Rise, Decline and Aftermath of the Dutch 'Societal Discussion on (Nuclear)
Energy Policy (1981-1983)'," in H. A. Becker and A. Porter (eds.), Impact Assessment
Today, Van Arkel, Utrecht, 1986.
Vlek, C., and Otten, W., "Judgmental Handling of Energy Scenarios: A Psychological
Analysis and Experiment," in G. Wright and P. Ayton (eds.), Judgmental Forecasting,
Wiley, New York, 1987.
Wheeler, T. A., Hora, S. C., Cramond, W. R., and Unwin, S. D., "Analysis of Core Damage
Frequency from Internal Events: Expert Judgment Elicitation," NUREG/CR-4550,
vol. 2, Sandia National Laboratories, 1989.
Wiggins, J., "ESA Safety Optimization Study," Hernandez Engineering, HEI-685/1026,
Houston, Texas, 1985.
Woo, G., "The Use of Expert Judgment in Risk Assessment; Draft Report for Her Majesty's
Inspectorate of Pollution," Yard, London, October 1990.
REFERENCES 303
CHAPTER 3
Adams, J. B., "A Probability Model of Medical Reasoning and the MYCIN Model,"
Mathematical Biosciences, vol. 32, pp. 177-186, 1976.
Carnap, R., Logical Foundations of Probability, Routledge and Kegan Paul, Chicago, 1950.
Cendrowska, J., and Bramer, M., "Inside an Expert Systems: A Rational Reconstruction of
the MYCIN Consultation System," in O'Shea T. and Eisenstadt, M. (eds.), Artificial
Intelligence, Tools Techniques and Applications, Harper and Row, New York, 1984,
pp. 453-497.
Cooke, R. M., "Probabilistic Reasoning in Expert Systems Reconstructed in Probability
Semantics," Philosophy of Science Association, vol. 1, pp. 409—421, 1986.
Dubois, D., and Prade, H., Fuzzy Sets and Systems: Theory and Applications, Academic
Press, New York, 1980.
French, S., "Fuzzy Sets: The Unanswered Questions," Manchester-Sheffield School of
Probability and Statistics Research Report, November 1987.
French, S., "Fuzzy Decision Analysis, Some Problems," in Zimmermann, H., Zadeh, L., and
Gains (eds.), Fuzzy Sets and Decision Analysis, Elsevier North-Holland, Amsterdam,
1984.
Gordon, J., and Shortliffe, E., "The Dempster-Shafer Theory of Evidence," in E. Shortliffe
and B. Buchanan (eds.), Rule Based Expert Systems, Reading, Mass., 1984.
Johnson, R. W., "Independence and Bayesian Updating Methods," in L. Kanal and J.
Lemmer (eds.), Uncertainty in Artificial Intelligence, Elsevier North Holland, Amster-
dam, 1986.
Los, J., "Semantic Representation of the Probability of Formulas in Formalized Theories,"
Studio Logika vol. 14, pp. 183-196, 1963.
Shortliffe, E., and Buchanan, B., Rule Based Expert Systems, Reading, Mass., 1984.
Shortliffe, E., and Buchanan, B., "A Model of Inexact Reasoning in Medicine," Mathematical
Biosciences, vol. 23, pp. 351-379, 1975.
Stefik, M., "Strategic Computing at DARPA: Overview and Assessment," Comm. of the
ACM, July, vol. 28, no. 7, pp. 690-704, July 1985.
Szolovits, P., and Pauker, S., "Categorical and Probabilistic Reasoning in Medical
Diagnoses," Artificial Intelligence, vol. 11, pp. 115-144, 1978.
Yu, V. L., Fagan, L., Wraith, S., Clancey, W., Scott, A., Hannigan, J., Blum, R., Buchanan, B.,
Cohen, S., "Antimicrobial Selection by a Computer," JAMA, vol. 242, no. 12, pp.
1279-1282, Sept. 21, 1979.
Zadeh, L., "Is Probability Theory Sufficient for Dealing with Uncertainty in AI: A Negative
View," in Kanal, L. and Lemmer, J. (eds.), Uncertainty in Artificial Intelligence,
Elsevier North Holland, Amsterdam, 1986, pp. 103-116.
Zadeh, L., "Probability Measures of Fuzzy Events," J. Math. Anal. Appl., vol. 23, pp. 421-
427, 1968.
Zimmermann, H., "Fuzzy set theory and inference mechanism," in G. Mitra (eds),
Mathematical Models for Decision Support, Springer-Verlag, Berlin pp. 727-743,
1987.
CHAPTER 4
Apostolakis, G., "The Broadening of Failure Rate Distributions in Risk Analysis: How
Good Are the Experts?" Risk Analysis, vol. 5, no. 2, pp. 89-95, 1985.
Christensen-Szalanski, J., and Bushyhead, J., "Physicians; Use of Probabilistic Information
in a Real Clinical Setting," Journal of Experimental Psychology: Human Perception
and Performance, vol. 7, pp. 928-935, 1981.
304 REFERENCES
Cooke, R., "Problems with Empirical Bayes," Risk Analysis, vol. 6, no. 3, pp. 269-272,1986.
Eddy, D., "Probabilistic Reasoning in Clinical Medicine," in D. Kahneman, P. Slovic, and A.
Tversky (eds.), Judgment under Uncertainty, Cambridge University Press, Cam-
bridge, 1982.
Hynes, M. E, and Vanmarcke, E. H., "Reliability of Enbankment Performance Predictions,"
in Mechanics in Engineering, Proc. 1st ASCE-EMD Specialty Conf. University of
Waterloo, May 26-28, 1976, pp. 367-384.
Kahneman, D., Slovic, P., and Tversky, A., (eds.) Judgment under Uncertainty, Heuristics and
Biases, Cambridge University Press, Cambridge, 1982.
Langer, E., "The Illusion of Control," The Journal of Personality and Social Psychology, vol.
32 pp. 311-328, 1975.
Lichtenstein, S., Fischhoff, B., and Phillips, L., "Calibration of Probabilities: The State of the
Art to 1980", in D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under
Uncertainty, pp. 306-334.
Martz, H., "Response to 'Problems with Empirical Bayes'," Risk Analysis, vol. 6, no. 3, pp.
273-274, 1986.
Martz, H., "On Broadening Failure Rate Distributions in PRA Uncertainty Analysis," Risk
Analysis, vol. 4, no. 1, pp. 15-23, 1984.
Martz, H., and Bryson, M., "On Combining Data for Estimating the Frequency of Low-
Probability Events with Application to Sodium Value Failure Rates," Nuclear
Science and Engineering, vol. 83, pp. 267-280, 1983.
Murphy, A., and Daan, H., "Impacts of Feedback and Experience on the Quality of
Subjective Probability Forecasts: Comparison of Results from the First and Second
Years of the Zierikzee Experiment," Monthly Weather Review, vol. 112, pp. 413-423,
1984.
Murphy, A., and Winkler, R., "Can Weather Forecasters Formulate Reliable Probability
Forecasts of Precipitation and Temperature?" National Weather Digest, vol. 2, pp. 2-
9, 1977.
Oskamp, S., "Overconfidence in case-study judgments," in D. Kahneman, P. Slovic, and A.
Tversky (eds.), Judgment under Uncertainty, Cambridge University Press, Cam-
bridge, 1982, pp. 287-293.
Slovic, P., Fischhoff, B., and Lichtenstein, S., "Facts versus Fears: Understanding Perceived
Risk," in D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under Uncertainty,
Cambridge University Press, Cambridge, 1982.
Tversky, A., and Kahneman, D., "Availability: A Heuristic for Judging Frequency and
Probability," in D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under
Uncertainty, Cambridge University Press, Cambridge, 1982a.
Tversky, A., and Kahneman, D., "Causal Schemas in Judgments Under Uncertainty," in D.
Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under Uncertainty, Cambridge
University Press, Cambridge, 1982b.
Tversky, A., and Kahneman, D., "Judgment under Uncertainty: Heuristics and Biases," in D.
Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under Uncertainty, Cambridge
University Press, Cambridge, 1982c.
Thys, W., "Fault Management," Ph.D. dissertation, Delft University of Technology, Delft,
1987.
CHAPTER 5
Amendola, A., "Systems Reliability Benchmark Exercises Parts I and II," EUR-10696, EN/I,
1986.
REFERENCES 305
Apostolakis, G., "The Broadening of Failure Rate Distributions in Risk Analysis: How
Good Are the Experts?" Risk Analysis, vol. 5, no. 2, 89-95, 1985.
Apostolakis, G. (ed.), Reliability Engineering and System Safety, vol. 23, no. 4, 1988.
Bernreuter, D. L., Savy, J. B., Mensing, R. W., and Chung, D. H., "Seismic Hazard
Characterization of the Eastern United States: Methodology and Interim Results for
Ten Sites," NUREG/CR-3756, 1984.
Broclchoff, K., "The Performance of Forecasting Groups in Computer Dialogue and Face to
Face Discussion," in H. A. Linstone and M. Turoff (eds.), The Delphi Method,
Techniques and Applications, Addison Wesley, Reading, Mass., 1975, pp. 291-321.
Dalkey, N., Brown, B., and Cochran, S., "Use of Self-Ratings to Improve Group Estimates,"
Technological Forecasting, vol. 1, no. 3, pp. 283-291, 1970.
Hofer, E., Javeri, V., and Loffler, H., "A Survey fof Expert Opinion and Its Probabilistic
Evaluation for Specific Aspects of the SNR-300 Risk Study," Nuclear Technology, vol.
68, pp. 180-225, 1985.
Lewis, H. W. et al., Risk Assessment Review Group Report to the U.S. Nuclear Regulatory
Commission, NUREG/CR-0400, 1979.
Linstone, H. A., and Turoff, M., The Delphi Method, Techniques and Applications, Addison
Wesley, Reading, Mass., 1975.
Morris, J. M., and D'Amore, R. Y., "Aggregating and Communicating Uncertainty," Pattern
Analysis and Recognition Corp., 228 Liberty Plaza, Rome, New York, 1980.
Office of Nuclear Regulatory Research, "Reactor Risk Reference Document," NUREG-
1150, 1987.
Poucet, A., Amendola, A., and Cacciabue, P. C., "CCF-RBE Common Cause Failure
Reliability Benchmark Exercises," EUR-11054, EN, 1987.
Sackman, H., Delphi Critique, Expert Opinion, Forecasting and Group Processes, Lexington
Books, Lexington, Mass., 1975.
U.S. NRC, "Reactor Safety Study," U.S. Nuclear Regulatory Commision, WASH-1400,
NUREG-751014, 1975.
CHAPTER 6
Allais, M., "The So-Called Allais Paradox and Rational Decisions Under Uncertainty," in
M. Allais and O. Hagen (eds.), The Expected Utility Hypothesis and the Allais
Paradox, Reidel, Dordrecht, 1979, pp. 437-683.
Allais, M., "Le comportement de l'homme rationed devant le Risque," Econometrica, vol. 21,
pp. 503-546, 1953.
Blach, M., and Fishburn, P., "Subjective Expected Utility for Conditional Primitives," in M.
Blach, D. McFadden, and S. Wu (eds.), Essays on Economic Behavior Under
Uncertainty, North-Holland, Amsterdam, 1974, pp. 57-69.
Blach, M., McFadden, D., and Wu, S., Essays on Economic Behavior Under Uncertainty,
North-Holland, Amsterdam, 1974, pp. 57-69.
Cooke, R., "Conceptual Fallacies in Subjective Probability," Topoi, vol. 5, pp. 21-27, 1986.
Cooke, R., "A Result in Renyi's Theory of Conditional Probability with Application to
Subjective Probability," Journal of Philosophical Logic, vol. 12, 1983.
Ellsberg, D., "Risk, Ambiguity and the Savage Axioms," Quarterly Journal of Economics, vol.
75, pp. 643-699, 1961.
Hogarth, R., Judgement and Choice, Wiley, New York, 1987.
Jeffrey, R., The Logic of Decision, McGraw-Hill, New York, 1966.
Kahneman, D., and Tversky, A., "Prospect Theory," Econometrica, vol. 47, no. 2, 1979.
306 REFERENCES
Kraft, C. H., Pratt, J. W., and Seidenberg, A., "Intuitive Probability on Finite Sets," Annals of
Mathematical Statistics, vol. 30, pp. 408-419, 1959.
Krantz, D., Luce, R., Suppes, P., and Tversky, A., Foundations of Measurement, vol. 1,
Academic Press, New York, 1971.
Luce, R., and Krantz, D., "Conditional Expected Utility," Econometrica, vol. 39, no. 2, pp.
253-271, 1971.
MacCrimmon, K. R., "Descriptive and Normative Implications of the Decision-Theory
Postulates," in K. Borch and J. Mossin (eds.), Risk and Uncertainty, MacMillan,
London, 1968, pp. 3-24.
Machina, M., "'Rational' Decision Making versus 'Rational Decision Modelling'," Journal
of Mathematical Psychology, vol. 24, pp. 163-175, 1981.
Pfanzagl, J., Theory of Measurement, Physica Verlag, Wiirzburg-Wien, 1968.
Ramsey, F., "Truth and Probability," in R. B. Braithwaite (ed.), The Foundations of
Mathematics, Keegan Paul, London, 1931, pp. 156-198.
Savage, L., The Foundations of Statistics, Dover, New York, 1972, first published by John
Wiley & Sons, 1954.
Shafer, G., "Savage Revisited," Statistical Science, vol. 1, no. 4, pp. 463-501, 1986.
Tversky, A., "Intransitivity of Preference," Psychological Review, vol. 76, no. 1, pp. 31-48,
1969.
Villegas, C., "On Qualitative Probability a-Algebras," Annals of Math. Stat., vol. 35, no. 4,
pp. 1787-1796, 1964.
CHAPTER 7
Aldous, D., "Exchangeability and Related Topics," in Lecture Notes in Mathematics, vol.
117, Springer-Verlag, Berlin, 1985.
Berman, S. M., "Stationarity, Isotropy and Sphericity in Ip" Z. Wahr. ver. Geb., vol. 54, pp.
21-23, 1980.
Box, G., and Tiao, G., Bayesian Inference in Statistical Analysis, Addison-Wesley, Reading,
Mass., 1973.
Cooke, R., and Misiewicz, J., "lp-Invariant Probability Measures," Delft University of
Technology, Department of Mathematics, report 88-91, 1988.
Cooke, R., Misiewicz, J., and Mendel, M. "Applications of lp-Symmetric Measures to
Bayesian Inference," in W. Kasprzak and A. Weron (eds.), Stochastic Methods in
Experimental Sciences, World Scientific, Singapire, 1990.
De Finetti, B., Theory of Probability, Wiley, New York, 1974.
De Finetti, B., "La Prevision; ses lois logique, ses source subjectives," Annales de Elnstut
Henri Poincare, vol. 7, pp. 1-68, 1937. English translation in H. Kyburg and H.
Smokier (eds.), Studies in Subjective Probability, Wiley, New York, 1964.
Heath, D., and Sudderth, W., "De Finetti's Theorem on Exchangeable Variables," The
Amer. Statis., vol. 30, no. 4, pp. 188-189, 1975.
Hewitt, E., and Savage, J., "Symmetric Measures on Cartesian Products," Trans. Am. Math.
Soc., vol. 80, pp. 470-501, 1955.
Schoenberg, J., "Metric Spaces and Completely Monotonic Functions," Ann. Math., vol. 38,
pp. 811-841, 1938.
Tucker, H., A Graduate Course in Probability, Academic Press, New York, 1967.
REFERENCES 307
CHAPTER 8
Alpert, M., and Raiffa, H., "A Progress Report on the Training of Probability Assessors," in
D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under Uncertainty,
Heuristics and Biases, Cambridge University Press, Cambridge, 1982, pp. 294-306.
Brier, G. "Verification of Forecasts Expressed in Terms of Probability," Man. Weath. Rev.,
vol. 75, pp. 1-3, 1950.
Cooke, R., Mendel, M., and Thys, W., "Calibration and Information in Expert Resolution; a
Classical Approach," Automatica, vol. 24, pp. 87-94, 1988.
De Finetti, B., "La prevision: ses lois logique, ses source subjectives," Annales de EInstut
Henri Poincare, vol. 7, pp. 1-68, 1937. English translation in H. Kyburg and H.
Smokier (eds.), Studies in Subjective Probability, Wiley, New York, 1964.
De Groot, M. H., Optimal Statistical Decisions, McGraw-Hill, New York, 1970.
ESRRDA, "Expert Judgment in Risk and Reliability Analysis; Experiences and
Perspective," ESRRDA project group "Expert Judgment," (draft report), 1989.
Galanter, E., "The Direct Measurement of Utility and Subjective Probability," Amer. J. of
Psych., vol. 75, pp. 208-220, 1962.
Hoel, P., Introduction to Mathematical Statistics, Wiley, New York, 1971.
Kullback, S., Information Theory and Statistics, Wiley, New York, 1959.
Lichtenstein, S., and Fischhoff, B., "Do Those Who Know More Also Know More About
How Much They Know?" Orgl. Behavior Human Perform., vol. 20, pp. 159-183,1977.
Lichtenstein, S., Fischhoff, B., and Phillips, D., "Calibration of Probabilities: The State of the
Art to 1980," in D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under
Uncertainty, Heuristics and Biases, Cambridge University Press, Cambridge, 1982,
pp. 306^335.
Lindley, D., Introduction to Probability and Statistics from a Bayesian Viewpoint, Cambridge
University Press, Cambridge, 1970.
Murphy, A., "A New Vector Partition of the Probability Score," J. Appl. Met., vol. 12, pp.
595-600, 1973.
Preyssl, C, and Cooke, R., "Expert Judgment: Subjective and Objective Data for Risk
Analysis of Spaceflight Systems," Proceedings PSA '89 International Topical
Meeting Probability, Reliability and Safety Assessment, Pittsburg, April 2-7, 1989.
Ramsey, F., "Truth and Probability," in Braithwaite (ed.), The Foundations of Mathematics,
Kegan Paul, London, 1931, pp. 156-198.
von Winterfeld, D., "Eliciting and Communicating Expert Judgments: Methodology and
Application to Nuclear Safety," Joint Research Centre, Commission of the Europen
Communities, 1989.
Wheeler T., Hora, S., Cramond, W., and Unwin, S., "Analysis of Core Damage Frequency
from Internal Events: Expert Judgment Solicitation," NUREG-CR-4550, vol. 2, U.S.
Nuclear Regulatory Commission, 1989.
Winkler, R., "Scoring Rules and the Evaluation of Probability Assessors," J. Amer. Statist.
Assoc., vol. 64, pp. 1073-1078, 1969.
CHAPTER 9
Bayarri M., and De Groot, M., "Gaining Weight: A Bayesian Approach," Dept. of Stat,
Carnegie Mellon University Tech. Report 388, January 1987.
308 REFERENCES
De Groot, M., and Fienberg, S., "Comparing Probability Forecasters: Basic Binary
Concepts and Multivariate Extensions," in P. Goel and A. Zellner (eds.), Bayesian
Inference and Decision Techniques, Elsevier, New York, 1986.
De Groot, M., and Fienberg, S., (1983) "The Comparison and Evaluation of Forecasters,"
The Statistician, vol. 32, pp. 12-22, 1983.
Friedman, D., "Effective Scoring Rules for Probabilistic Forecasts," Management Science,
vol. 29, no. 4, pp. 447-454, 1983.
Honglin, W., and Duo, D., "Reliability Calculation of Prestressing Cable System of the
PCPV Model," Trans, of the the 8th Intern. Conf. on Structural Mech. in Reactor
Techn., Brussels, Aug. 19-23, 1985, pp. 41-44.
Lichtenstein, S., Fischoff, B., and Phillips, D., "Calibration of Probabilities: The State of the
Art to 1980," in D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under
Uncertainty: Heuristics and Biases, Cambridge University Press, Cambridge, 1982,
pp. 306-335.
Matheson, J., and Winkler, R., "Scoring Rules for Continuous Probability Distributions,"
Management Science, vol. 22, no. 10, pp. 1087-1096, 1976.
McCarthy, J., "Measures of the Value of Information," Proc. of the National Academy of
Sciences, 1956, pp. 654-655.
Murphy, A., "A New Vector Partition of the Probability Score," J. of Applied Meteorology,
vol. 12, pp. 595-600, 1973.
Roberts, H., "Probabilistic Prediction," J. Amer. Statist. Assoc., vol. 60, pp. 50-62, 1965.
Savage, L., "Elicitation of Personal Probabilities and Expectations," J. Amer. Statis. Assoc.,
vol. 66, no. 336, pp. 783-801, 1971.
Shuford, E., Albert, A., and Massengil, H., "Admissible Probability Measurement
Procedures," Psychometrika, vol. 31, pp. 125-145, 1966.
Stael von Holstein, C., "Measurement of Subjective Probability," Acta Psychologica, vol. 34,
pp. 146-159, 1970.
Tucker, H., A Graduate Course in Probability, Wiley, New York, 1967.
Wagner, C., and Lehrer, K., Rational Consensus in Science and Society, Reidel, Dordrecht,
1981.
Winkler, R., "On Good Probability Appraisers," in P. Goel and A. Zellner (eds.), Bayesian
Inference and Decision Techniques, Elsevier, New York, 1986.
Winkler, R., (1969) "Scoring Rules and the Evaluation of Probability Assessors," J. Amer.
Statist. Assoc., vol. 64, pp. 1073-1078, 1969.
Winkler, R., and Murphy, A., "Good Probability Assessors," J. of Applied Meteorology, vol.
7, pp. 751-758, 1968.
CHAPTER 10
Adams, J. K., and Adams, P. A., "Realism of Confidence Judgment," Psychol. Rev., vol. 68,
pp. 33-45, 1961.
Alpert, M., and Raiffa, H., "A Progress Report on the Training of Probability Assessors," in
D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under Uncertainty,
Heuristics and Biases, Cambridge University Press, Cambridge, 1982, pp. 294-306.
Bhola, B., Blaauw, H., Cooke, R., and Kok, M., "Expert Opinion in Project Management,"
appearing in European Journal of Operations Research, 1991.
Cooke, R., Mendel, M., and Thys, W., "Calibration and Information in Expert Resolution; a
Classical Approach," Automatica, vol. 24, pp. 87-94, 1988.
Hodges, J. L., and Lehmann, E. L., Basic Concepts of Probability and Statistics, Holden-Day,
San Francisco, 1970.
REFERENCES 309
Lichtenstein, S., and Fischhoff, B., "Do Those Who Know More Also Know More About
How Much They Know?" Orgl. Behavior Human Perform, vol. 20, pp. 159-183,1977.
Lichtenstein, S., Fischhoff, B., and Phillips, D., "Calibration of Probabilities: The State of the
Art to 1980," in D. Kahneman, P. Slovic, and A. Tversky (eds.), Judgment under
Uncertainty, Heuristics and Biases, Cambridge University Press, Cambridge, 1982,
pp. 306-335.
Sieber, J., "Effects of Decision Importance on Ability to Generate Warranted Subjective
Uncertainty," J. Personality Social Psychol., vol. 30, pp. 688-694, 1974.
Siegel, S., Nonparametric Statistics, McGraw-Hill, New York, 1956.
CHAPTER 11
Blanchard, P., Mitchell, M., and Smith, R., "Likelihood-of-Accomplishment Scale of Man-
Machine Activities," Dunlop and Associates, Inc. Santa Monica, CA., 1966.
Bernreuter, D. L., Savy, J. B., Mensing, R. W., and Chung, D. H., "Seismic Hazard
Characterization of the Eastern United States: Methodology and Interim Results for
Ten Sites," NUREG/CR 3756, 1984.
Bradley, R., (1953) "Some Statistical Methods in Taste Testing and Quality Evaluation,"
Biometrics, vol. 9, pp. 22-38, 1953.
Clemen, R. T., and Winkler, R. L., (1987) "Calibrating and Combining Precipitation
Probability Forecasts," in R. Viertl (ed.), Probability and Bayesian Statistics, Plenum
Press, New York, 1987, pp. 97-110.
Comer, K., Seaver, D., Stillwell, W., and Gaddy, C, "Generating Human Reliability
Estimates Using Expert Judgment, vols. I and II," NUREG/CR-3688, 1984.
Cooke, R., "Problems with Empirical Bayes," Risk Analysis, vol. 6, no. 3, pp. 269-272,1986.
David, H. A., The Method of Paired Comparisons, Charles Griffin, London, 1963.
De Groot, M., "Reaching a Consensus," J. Amer. Statist. Assoc., vol. 69, pp. 118-121, 1974.
Embrey, D., and Kirwan, B., "A Comparative Evaluation of Three Subjective Human
Reliability Quantification Techniques," Proceedings of the Ergonomics Society's
Converence, K. Coombes (ed.), Taylor and Francis, London, 1983.
French, S., "Group Consensus Probability Distributions: A Critical Survey," in J. M.
Bernardo, M. H. De Groot, D. V. Lindley, and A. F. M. Smith (eds.), Bayesian
Statistics, Elsevier, North Holland, 1985, pp. 183-201.
Genest, C., and Zidek, J., (1986) "Combining Probability Distributions: A Critique and an
Annotated Bibliography," Statistical Science, vol. 1, no. 1, pp. 114-148, 1986.
Gokhale, D., and Press, S., "Assessment of a Prior Distribution for the Correlation
Coefficient in a Bivariate Normal Distribution," J. R. Statist. Soc. A, vol. 145, P. 2, pp.
237-249, 1982.
Hardy, G. H., Littlewood, J. E., and Polya, G. Inequalities, Cambridge University Press,
Cambridge, 1983, (first edition 1934).
Hofer, E., Javeri, V., and Loffler, H., "A Survey of Expert Opinion and Its Probabilistic
Evaluation for Specific Aspects of the SNR-300 Risk Study," Nuclear Technology, vol.
68, pp. 180-225, 1985.
Humphreys, P., Human Reliability Assessor's Guide, United Kingdom Atomic Energy
Authority, Warrington WA3 4NE, 1988.
Hunns, D., "Discussions Around a Human Factors Data-Base. An Interim Solution: The
Method of Paired Comparisons," in A. E. Green (ed.), High Risk Safety Technology,
Wiley, New York, 1982.
310 REFERENCES
Hunns, D., and Daniels, B., "The Method of Paired Comparisons and the Results of the
Paired Comparisons Consensus Exercise," Proceedings of the 6th Advances in
Reliability Technology Symposium, vol. 1, NCSR R23, Culcheth, Warrington, 1980,
pp. 31-71.
Huseby, A. B., "Combining Opinions in a Predictive case," presented at Third Valencia
International Meeting on Bayesian Statistics, Altea, Spain, June 1-5, 1987.
IEEE, IEEE Guide to the Collection and Presentation of Electrical, Electronic, and Sensing
Component Reliability Data for Nuclear Power Generation Stations, IEEE st-500,
1977.
Kadane, J., Dickey, J., Winkler, R., Smith, W., and Peters, S., "Interactive Elicitation of
Opinion for a Normal Linear Model," J. Amer. Statist. Assoc., vol. 75, pp. 845-854,
1980.
Kirwan, B., Human Reliability Assessor's Guide, vols. 1 and 2, (DRAFT) Human Reliability
Associates, Dalton, 1987.
Laddaga, R., "Lehrer and the Consensus Proposal," Synthese, vol. 36, pp. 473-477, 1977.
Lehrer, K., and Wagner, C. G., Rational Consensus in Science and Society, Reidel, Dordrecht,
1981.
Lindley, D., "Reconciliation of Discrete Probability Distributions," in J. Bernardo, M. De
Groot, D. Lindley, and A. Smith (eds.), Bayesian Statistics 2, North Holland,
Amsterdam, 1985, pp. 375-390.
Lindley, D., "Reconciliation of Probability Distributions," Operations Research, vol. 31, no.
5, pp. 866-880, 1983.
Lindley, D., and Singpurwalla, N., "Reliability (and Fault Tree) Analysis Using Expert
Opinions," J. Amer. Statist. Assoc., vol. 81, no. 393, pp. 87-90, 1986.
Martz, H., "Reaction to 'Problems with Empirical Bayes'," Risk Analysis, vol. 6, no. 3, pp.
272-273, 1986.
McConway, K. J., "Marginalization and Linear Opinion Pools," J. Amer. Statist. Assoc., vol.
76, pp. 410-414, 1981.
Mosleh, A., and Apostolakis, G., "The Assessment of Probability Distributions from Expert
Opinions with an Application to Seismic Fragility Curves," Risk Analysis, vol. 6, no.
4, pp. 447-461, 1986.
Mosleh, A., and Apostolakis, G., "Models for the Use of Expert Opinions," presented at the
workshop on low-probability high-consequence risk analysis, Society for Risk
Analysis, Arlington, Va., June 1982.
Morris, P., "An Axiomatic Approach to Expert Resolution," Management Science, vol. 29,
pp. 24-32, 1983.
Morris, P., "Combining Expert Judgments: A Bayesian Approach," Management Science,
vol. 23, pp. 679-693, 1977.
Morris, P., "Decision Analysis Expert Use," Management Science, vol. 20, pp. 1233-1241,
1974.
Mosleh, A., and Apostolakis, G., "The Assessment of Probability Distributions from Expert
Opinions with an Application to Seismic Fragility Curves," Risk Analysis, vol. 6, no.
4, pp. 447-461, 1986.
Mosleh, A., and Apostolakis, G., "Models for the Use of Expert Opinions," presented at the
workshop on low-probability high-consequence risk analysis, Society for Risk
Analysis, Arlington, Va., June 1982.
Pontecorvo, A., "A Method of Predicting Human Reliability," Reliability and Maintenance,
vol. 4, 4th Annual Reliability and Maintainability Conference, pp. 337-342, 1965.
Raiffa, H., and Schlaifer, R., Applied Statistical Decision Theory, Harvard University, 1961.
Seaver, D., and Stillwell, W., "Procedures for Using Expert Judgment to Estimate Human
Error Probabilities in Nuclear Power Plant Operations," NUREG/CR-2743, 1983.
REFERENCES 311
Swain A., and Guttman, H., "Handbook of Human Reliability, Analysis with Emphasis on
Nuclear Power Plant Applications," NUREG/CR-1278, 1983.
Thurstone, L., "A Law of Comparative Judgment," Psychl. Rev., vol. 34, pp. 273-286, 1927.
Torgerson, W., Theory and Methods of Scaling, Wiley, New York, 1958.
Wagner, C. G., "Allocation, Lehrer Models, and the Consensus of Probabilities," Theory and
Decision, vol. 14, pp. 207-220, 1982.
Williams, J. C., "Validation of Human Reliability Assessment Techniques," Proceedings of
the 4th National Reliability Conference, Birmingham NEC, 6-8 July, 1983.
Winkler, R. L., "Combining Probability Distributions from Dependent Information
Sources," Management Science, vol. 27, pp. 479-488, 1981.
Winkler, R. L., "The Consensus of Subjective Probability Distributions," Management
Science, vol. 15, pp. B61-B75, 1968.
CHAPTER 13
CHAPTER 14
Bradley, R., "Some Statistical Methods in Taste Testing and Quality Evaluation," Biome-
trica, vol. 9, pp. 22-38, 1953.
Bradley, R., and Terry, M., "Rank Analysis of Incomplete Block Designs," Biometrica, vol.
39, pp. 324-345, 1952.
Comer, K., Seaver, D., Stillwell, W., and Gaddy, C., "Generating Human Reliability
Estimates Using Expert Judgment, vols. I and II," NUREG/CR-3688, 1984.
David, H. A., The Method of Paired Comparisons, Charles Griffin, London, 1963.
Ford, L., "Solution of a Ranking Problem from Binary Comparisons," Amer. Math.
Monthly, vol. 64, pp. 28-33, 1957.
Hanushek, E., and Jackson, J., Statistical Methods for Social Scientists, Academic Press, New
York, 1977.
Hunns, D., "Discussions Around a Human Factors Data-Base. An Interim Solution: The
Method of Paired Comparisons," in A. E. Green (ed.), High Risk Safety Technology,
Wiley, New York, 1982.
Kendall, M., Rank Correlation Methods, Charles Griffin & Co. Limited, London, 1962.
Kirwan, B. et al., Human Reliability Assessor's Guide, vols. 1 and 2, Human Reliability
Associates, 1987.
312 REFERENCES
Mosteller, F., "Remarks on the Method of Paired Comparisons: I The Least Squares
Solution Assuming Equal Standard Deviations and Equal Correlations," Psychomet-
rika, vol. 16, no. 1, pp. 3-9, 195la.
Mosteller, F., "Remarks on the Method of Paired Comparisons: II The Effect of an Aberrant
Standard Deviation When Equal Standard Deviations and Equal Correlations Are
Assumed," Psychometrika, vol. 16, no. 2, pp. 203-206, 1951b.
Mosteller, F., "Remarks on the Method of Paired Comparisons: III A Test of Significance
for Paired Comparisons When Equal Standard Deviations and Equal Correlations
Are Assumed," Psychometrika, vol. 16, no. 2, pp. 207-218, 1951c.
Siegel, S., Nonparametric Statistics, McGraw-Hill, New York, 1956.
Thurstone, L., "A Law of Comparative Judgment," Psychl. Rev., vol. 34, pp. 273-286,1927.
Torgerson, W., Theory and Methods of Scaling, Wiley, New York, 1958.
CHAPTER 15
APPENDIX A
Borel, E., "Valeur pratique et philosophie des probabilities," in E. Borel (ed.), Traite du
Calcul des Probabilites, Gauthier-Villars, Paris, 1924.
Carnap, R., Logical Foundations of Probability, Univ. of Chicago Press, Chicago, 1950.
De Finetti, B., (1937) "La Prevision: ses lois logique, ses source subjectives," Annales de
EInstitut Henri Poincare, vol. 7, pp. 1-68,1937. English translation in H. Kyburg and
H. Smokier (eds.). Studies in Subjective Probability, Wiley, New York, 1964.
De Finetti, B., Theory of Probability, vols. I and II, Wiley, New York, 1974.
Feller, W., An Introduction to Probability Theory and its Applications, vol. 2, Wiley, New
York, 1971.
Keynes, J. M., Treatise on Probability, MacMillan, London, 1973 (first edition 1921).
Kullback, S., Information Theory and Statistics, Wiley, New York, 1959.
Laplace, P. S., "Probability and its Principles," in E. H. Madden (ed.), The Structure of
Scientific Theories, Routledge Kegan Paul, London, 1960, pp. 250-255. Translated
from sixth French edition of A Philosophical Essay on Probabilities.
Martin-Lof, P., "On the Notion of Randomness," in A. Kino and R. E. Vesley (eds.),
Intuitionism and Proof Theory, North-Holland, 1970, pp. 73-78.
Ramsey, F., "Truth and Probability," in R. B. Braithwaite (ed.), The Foundations of
Mathematics, Kegan Paul, London, 1931, pp. 156-198.
Reichenbach, H., "Axiomatik der Wahrscheinlichkeitsrechnung," Math. Z., vol. 34, pp. 568-
619, 1932.
Savage, L. J., The Foundations of Statistics, Wiley, New York, 1954. Second edition, Dover,
New York, 1972.
Schnorr, C. P., "Zufaligkeit und Wahrscheinlichkeit," Lecture Notes in Mathematics, no.
218, Springer-Verlag, Berlin, 1970.
van Lambalgen, M., "Random Sequences," PhD. Dissertation, University of Amsterdam,
1987.
von Mises, R., Probability Statistics and Truth, Dover, New York, 1981. Translation of 3rd
edition of Wahrscheinlichkeit, Statistik und Wahrheit, Springer, 1936.
von Mises, R., "Grundlagen der Wahrscheinlichkeitsrechnung," Math Z., vol. 5, pp. 52-99,
1919.
von Neumann, J., and Morgenstern, O., Theory of Games and Economic Behavior, Wiley,
New York, 1944.
This page intentionally left blank
Author Index
315
316 AUTHOR INDEX
Tamplin, A., 5, 6
Terry, M., 217 Zadeh, L., 59
Thurstone, L., 184, 214 Zidek, J., 171
Thys, W., 67, 70, 158, 159 Zimmermann, H., 59, 61
This page intentionally left blank
Subject Index