Watson White Paper1
Watson White Paper1
Watson White Paper1
BThis is Watson[
D. A. Ferrucci
Introduction
The open-domain question-answering (QA) problem is
one of the most challenging in the realm of computer science
and articial intelligence (AI). QA has had a long history
[1] and has seen considerable advancement over the past
decade [2, 3].
Jeopardy!** is a well-known television quiz show that
has been on air in the United States for more than 25 years.
It pits three human contestants against one another in a
competition that requires rapidly understanding and
answering rich natural-language questions, which are called
clues, over a very broad domain of topics, with stiff penalties
for wrong answers [4]. On January 14, 2011, at IBM
Research in Yorktown Heights, New York, IBM Watson*,
a computer, beat the two best Jeopardy! champions in a
real-time two-game competition. The historic match was
conducted and taped by Jeopardy Productions, Inc. and was
nationally televised over three nights on February 1416, 2011.
The fact that a computer beat the best human contestants
at Jeopardy! represents a major landmark in open-domain
QA, but in many ways, this is just the beginning. Research
in open-domain QA requires advances in many areas of
computer science and AI, including information retrieval
(IR), natural-language processing (NLP), knowledge
representation and reasoning (KR&R), machine learning, and
Digital Object Identifier: 10.1147/JRD.2012.2184356
Copyright 2012 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without
alteration and (2) the Journal reference and IBM copyright notice are included on the rst page. The title and abstract, but no other portions, of this paper may be copied by any means or distributed
royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
0018-8646/12/$5.00 B 2012 IBM
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1:1
History
Unstructured information
Much of human communication, whether it is in
natural-language text, speech, or images, is unstructured.
The semantics necessary to interpret unstructured
information to solve problems is often implicit and must
be derived by using background information and inference.
With structured information, such as traditional database
tables, the data is well-dened, and the semantics is explicit.
Queries are prepared to answer predetermined questions
on the basis of necessary and sufcient knowledge of the
meaning of the table headings (e.g., Name, Address, Item,
Price, and Date). However, what does an arbitrary string of
text or an image really mean? How can a computer program
act on the content of a Bnote[ or a Bcomment[ without
explicit semantics describing its intended meaning or usage?
With the enormous proliferation of electronic content on
the web and within our enterprises, unstructured information
(e.g., text, images, and speech) is growing far faster than
structured information. Whether it is general reference
material, textbooks, journals, technical manuals, biographies,
or blogs, this content contains high-value knowledge
essential for informed decision making. The promise of
leveraging the knowledge latent in these large volumes of
unstructured text lies in deeper natural-language analysis that
can more directly infer answers to our questions.
NLP techniques, which are also referred to as text
analytics, infer the meaning of terms and phrases by
analyzing their syntax, context, and usage patterns. Human
language, however, is so complex, variable (there are
many different ways to express the same meaning), and
polysemous (the same word or phrase may mean many
things in different contexts) that this presents an enormous
technical challenge. Decades of research have led to many
specialized techniques, each operating on language at
different levels and on different isolated aspects of the
language understanding task. These techniques include, for
1:2
D. A. FERRUCCI
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1:3
1:4
D. A. FERRUCCI
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
Figure 1
DeepQA architecture.
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1:5
1:6
D. A. FERRUCCI
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
Right type
If candidate generation nds a correct answer for only 85%
of the questions, then Watson can, at most, get 85% of the
questions correct. It can only perform this well if, after
considering all of the candidates, it ranks the right answer in
the rst position and with enough condence to take the
risk and buzz in. Accurately computing a probability that a
candidate is correct is a critical and very challenging feature
of the system. Buzzing in with a wrong answer is costly.
A contestant will lose the dollar value of that clue and end up
helping his competitors earn the money on the rebound.
The entire remainder of Watsons processing pipeline and a
signicant portion of its computational resources are spent
on nding the best answer and computing an accurate
probability that it might be correct.
To do that, Watson analyzes many different classes of
evidence for each answer. An important class of evidence
considered is whether the answer is of the right answer
type. It was clear from early experiments that the approach
we took in PIQUANT, where we anticipated all answer
types and built specialized algorithms for nding instances
of anticipated types (such as people, organizations, and
animals), would not be sufcient. Jeopardy! refers to far
too many types (on the order of thousands) for us to have
considered developing a priori information extraction
components for each one. Moreover, they are used
in ways where their meaning is highly contextual and
difcult to anticipate. For example, in the rst clue below,
the *Running_Foot_St_EvenLAT is Bquality.[ How
many things might be correctly classied as a Bquality?[
Similarly, the next two questions are both asking for a
Bdirection.[
ART HISTORY: Unique quality of BFirst Communion
of Anemic Young Girls in the Snow,[ shown at the
1883 Arts Incoherents Exhibit.
(Answer: Ball white[)
DECORATING: If youre standing, its the direction
you should look to check out the wainscoting.
(Answer: Bdown[)
SEWING: The direction of threads in a fabric is called
this, like the pattern of bers in wood.
(Answer: Bgrain[)
Jeopardy! uses very opened-ended types. We employ
dynamic and exible techniques for classifying candidate
answers, heavily relying on the context in the question.
For this, we developed a technique we called type coercion
[24]. Type coercion radically differs from earlier systems
(such as PIQUANT) that statically classify a possible answer
on the basis of a preexisting set of types. Instead, it takes a
lexical answer type such as Bdirection[ and poses a type
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1:7
1:8
D. A. FERRUCCI
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
Missing links
Consider the following clues known in Jeopardy! as
BCOMMON BONDS[:
COMMON BONDS: feet, eyebrows, and
McDonalds. (Answer: Barches[)
COMMON BONDS: trout, loose change in your
pocket, and compliments.
(Answer: Bthings that you sh for[)
Realizing and resolving implicit relationships and using
them to interpret language and to answer questions is
generally useful and appears in different forms in Jeopardy!
clues. Although these examples are fun and may seem of
unique interest to Jeopardy!, the ability to nd what different
concepts have in common can help in many areas, including
relating symptoms to diseases or generating hypotheses
that a chemical might be effective at treating a disease. For
example, a hypothesis of using sh oil as a treatment for
Raynauds disease was made after text analytics discovered
that both had a relationship to blood viscosity [34].
In BCOMMON BONDS[ clues, the task of nding the
missing link is directly suggested by the well-known
Jeopardy! category. However, this is not always the case.
Final Jeopardy! questions, in particular, were uniquely
difcult, partly because they made implicit references to
unnamed entities or missing links. The missing link had to be
correctly resolved in order to evidence the right answer.
Consider the following Final Jeopardy! clue:
EXPLORERS: On hearing of the discovery of George
Mallorys body, he told reporters he still thinks he was
rst. (Answer: BSir Edmund Hillary[)
To answer this question accurately, Watson had to rst
make the connection to Mount Everest and realize that,
although not the answer, it is essential to condently
getting the correct answerVin this case, Edmund Hillary, the
rst person to reach the top of Mount Everest. Implicit
relationships and other types of tacit context that help in
interpreting language are certainly not unique to Jeopardy!
but are commonplace in ordinary language. In the next paper
in this journal issue, BIdentifying Implicit Relationships,[
Chu-Carroll et al. [35] discuss the algorithmic techniques
used to solve BCOMMON BONDS[ questions, as well as
other questions, such as the Final Jeopardy! question above,
which require the discovery of missing links and implicit
relationships.
Breaking the question down
Another important technique generally useful for QA is
breaking a question down into logical subparts, so that the
subparts may be independently explored and the results
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1:9
1 : 10
D. A. FERRUCCI
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
Figure 2
Incremental progress in answering precision on the Jeopardy! challenge: June 2007 to November 2011.
Summary of results
After four years of effort by an average of roughly 25 core
researchers and engineers, Watson was ready to play.
We relied on different ways to measure our results:
1) precision and condence, 2) game-winning performance,
and 3) component performance.
The rst two sets of metrics are end-to-end system metrics.
They consider the performance of the whole QA system
on the Jeopardy! task. They are best represented by plotting
QA precision against percentage answered to produce a
condence curve. Figure 2 shows the condence curves
for successive versions of the system, starting with the
PIQUANT-based baseline system and progressing through
the various versions of DeepQA. The x-axis gives the
percentage of questions answered, and the y-axis gives
precision (i.e., for those questions answered, the percentage
correctly answered). Each DeepQA curve is produced by
running DeepQA on 200 games worth of blind data. These
games contain 12,000 questions that were never viewed
by the developers and never used to train the system.
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1 : 11
Table 1 DeepQA technology performance on public benchmark sets. (ACE: automatic content extraction; RTE:
recognizing textual entailment.)
Ken Jennings and Brad Rutter and, on any given day, may
defeat either one of them. Watson played 55 real-time
previously unseen games against these players and won 71%
of them. To do this, Watson computed its condences
and its best answer in approximately three seconds, on
average, and included a very competitive betting strategy.
The third set of metrics is distributed across the individual
component algorithms that populate the DeepQA processing
pipeline. Each paper in this special issue presents
individual component technologies and describes how they
affect end-to-end performance. However, it is important
to realize that the quantity and diversity of components used
to implement DeepQA make it extraordinarily robust and
exible. The system does not heavily depend on any
one of them. We have run many ablation experiments
consistently showing that all but the most dramatic
of ablations have very little effect.
For example, when we run experiments, in which we
ablate a single evidence scorer from the full Watson system,
we rarely see a statistically signicant impact on a few
thousand questions, and we never see an impact of 1% or
greater. However, if we ablate all of the evidence scorers, this
heavily ablated version of Watson answers only 50%
correctly. It produces a condence curve that is insufcient
to compete at Jeopardy!. Consequently, to illuminate and
measure the effectiveness of individual components, the
papers in this journal describe a variety of different
approaches.
One important method used in several of the papers relies
on a simpler conguration of Watson we have built called
the Watson answer-scoring baseline (WASB) system.
The WASB system includes all of Watsons question
analysis, search, and candidate generation components.
It includes only one evidence-scoring component: an
answer-typing component that uses a named-entity detector.
The use of named-entity detection, to determine whether a
candidate answer has the semantic type that the question
requires, is a very popular technique in QA (as discussed
in detail in [27]), which is why we included it in our baseline.
We have evaluated the impact of adding various components
to the WASB system [2729, 32] and found that we are
able to examine and compare the individual effectiveness
1 : 12
D. A. FERRUCCI
Future directions
Watson, as developed for Jeopardy!, attempts to provide a
single correct answer and associated condence. We would
like to see applications of the DeepQA technology move
toward a broader range of capabilities that engage in
dialogues with users to provide decision support over large
volumes of unstructured content. The notion of a computer
system helping to produce and provide evidence for
alternative solutions has been around for decades. Such
knowledge-based decision support tools, however,
traditionally suffer from the requirement to manually craft
and encode formal logical models of the target domain,
such as medicine, where these models represent the concepts,
their relationships, and the rules that govern their interaction.
It can be prohibitively inefcient to do this for broad
bodies of knowledge. It is slow and expensive to maintain
these formal structures as the raw knowledge grows and as
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
Figure 3
Evidence proles for differential diagnosis in medicine.
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1 : 13
Acknowledgments
The author would like to thank the courageous, talented,
and uniquely committed team of researchers and engineers
that he had the distinct pleasure to work with and lead in the
creation of DeepQA and Watson. These are the people who
designed, built, and continue to advance Watson and its
underlying technologies. Their work and the work of many
university collaborators, including the University of
Massachusetts at Amherst; University of Texas at Austin;
University of Trento; MIT; University of Southern
California, Information Sciences Institute; and, most notably,
the team under Eric Nyberg at Carnegie Mellon University,
made Watson a success. The author would like to particularly
thank Bill Murdock, one of the principal researchers who
contributed to Watsons development, a coauthor of many
papers in this issue of the IBM Journal of Research and
Development, and the one who helped to compile and edit
the papers in this journal.
*Trademark, service mark, or registered trademark of International
Business Machines Corporation in the United States, other countries, or
both.
**Trademark, service mark, or registered trademark of Jeopardy
Productions, Inc., Trustees of Princeton University, or Wikimedia
Foundation in the United States, other countries, or both.
References
1. R. F. Simmons, BNatural language question-answering systems:
1969,[ Commun. ACM, vol. 13, no. 1, pp. 1530, Jan. 1970.
2. M. Maybury, New Directions in Question-Answering. Menlo
Park, CA: AAAI Press, 2004.
3. T. Strzalkowski and S. Harabagiu, Advances in Open-Domain
Question-Answering. Berlin, Germany: Springer-Verlag, 2006.
1 : 14
D. A. FERRUCCI
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
VOL. 56
NO. 3/4
PAPER 1
MAY/JULY 2012
D. A. FERRUCCI
1 : 15