Who Wants To Be A Millionaire Questn Difficulty Level

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

UAI2003 LAM ETAL.

337

1 Billion Pages = 1 Million Dollars?


Mining the Web to Play "Who Wants to be a Millionaire?"

Shyong (Tony) K. Lam David M. Pennock* Dan Cosley Steve Lawrence


Computer Science Dept Overture Services, Inc. Computer Science Dept NEC Laboratories America
University of Minnesota 74 N. Pasadena Ave., 3rd floor University of Minnesota Princeton, NJ 08540
Minneapolis, MN 55455 Pasadena, CA 911 0 I Minneapolis, MN 55455 lawrence®necmail.com
lam®cs.umn.edu david.pennock®overture.com cosley®cs.umn.edu

Abstract (5] are formal enough to be solvable in principle, though


are far from trivial to master in practice due to exponential
size search spaces. In chess, checkers, and backgammon,
We exploit the redundancy and volume of infor­
current machine players rival their best human competitors.
mation on the web to build a computerized player
Recently, attention has turned to less structured game envi­
for the ABC TV game show"Who Wants To Be A
ronments, like crossword puzzles (16], video games (30],
Millionaire?". The player consists of a question­ and soccer [29], where game states, actions, or both are
answering module and a decision-making mod­
not easily enumerable, making a pure search formulation
ule. The question-answering module utilizes
unnatural or impractical.
question transformation techniques, natural lan­
guage parsing, multiple information retrieval al­ "Who Wants to be a Millionaire?" is a trivia game where
gorithms, and multiple search engines; results actions are enumerable, though competence depends on the
are combined in the spirit of ensemble learning ability to answer general-interest questions-{)ften requir­
using an adaptive weighting scheme. Empiri­ ing common sense or knowledge of popular culture-and
cally, the system correctly answers about 75% to make decisions based on confidence, expected reward,
of questions from the Millionaire CD-ROM, 3rd and risk attitude. True human-level competence at Million­
edition general-interest trivia questions often
- aire will likely require excellence in natural language pro­
about popular culture and common knowledge. cessing and common sense reasoning. We present a first­
The decision-making module chooses from al­ order system that exploits the breadth and redundancy of
lowable actions in the game in order to maxi­ information available on the World Wide Web to answer
mize expected risk-adjusted winnings, where the questions and estimate confidence, and utilizes a decision­
estimated probability of answering correctly is a theoretic subsystem to choose actions to maximize ex­
function of past performance and confidence in pected risk-adjusted payoffs.
correctly answering the current question. When
given a six question head start (i.e., when start­
ing from the $2,000 level), we find that the sys­
2 RELATED WORK
tem performs about as well on average as humans
starting at the beginning. Our system demon­ 2.1 QUESTION ANSWERING
strates the potential of simple but well-chosen
techniques for mining answers from unstructured A large body of research exists on question answering. For
information such as the web. example, see the Question-Answering Track (32] of the
Text Retrieval Evaluation Conference (TREC). Systems in
this track compete against each other to retrieve short (50
or 250 byte long) answers to a set of test questions.
1 INTRODUCTION
Question-answering systems typically decompose the
Machine competence in games has long served as a bench­ problem into two main steps: retrieving documents that
mark for progress in artificial intelligence (AI). While we may contain answers, and extracting answers from these
seem hardly close to building systems capable of passing documents. For the first part of the task, retrieving a set of
a full-blown Turing Test, machine excellence in a grow­ promising documents from a collection, the systems in the
ing number of games signals incremental progress. Games TREC QA track submitted the original questions to various
such as chess [13], checkers [27], Othello [7, 18], and Go information retrieval systems [32].

This work conducted at NEC Laboratories America, Prince­ A number of systems aim to extract answers from docu­
ton, NJ. ments. For example, Abney et al. [!] describe a system in
338 LAM ETAL. UAI2003

which documents returned by the SMART infonnation re­ ence diagrams [28] that can encode probabilities and utili­
trieval system are processed to extract answers. Questions ties compactly are often used. In Millionaire, the space of
are classified into one of a set of known "question types" possible outcomes is small enough that decision trees [23],
that identify the type of entity corresponding to the answer. that explicitly enumerate probabilities and utilities for all
Documents are tagged to recognize entities, and passages future possibilities, are sufficient.
surrounding entities of the correct type for a given ques­
tion are ranked using a set of heuristics. Two papers [3, 21] 2.3 GAME PLAYING
present systems that re-rank and post-process the results of
regular infonnation retrieval systems with the goal of re­ Board games have dominated much of the history of AI in
turning the best passages. These systems use the general game playing [5, 6, 13, 18, 27]. This paper follows instead
approach of retrieving documents or passages that are sim­ in the tradition of the crossword-puzzle-solving program
ilar to the original question with variations of standard TF­ PROVERB [16]. Like PROVERB, our Millionaire player
IDF term weight schemes [25]. The most promising pas­ brings together technologies from several core areas of ar­
sages are chosen from the documents returned using heuris­ tificial intelligence (including infonnation retrieval, natural
tics and/or hand-crafted regular expressions. language parsing, ensemble learning, and decision making)
to solve a challenging problem that does not naturally con­
Other systems modify queries in order to improve the
fonn to the game-tree method for solving board games.
chance of retrieving answers. Lawrence and Giles [17] in­
Other domains under recent and rapid investigation-that
troduced Specific Expressive Forms, where questions are
transfonned into specific phrases that may be contained in are also not easily amenable to tree enumeration-include
answers. For example, the question "what is x" may be video games such as Quake [30] and soccer [29].
transfonned into phrases such as "x is" or "x refers to". The Millionaire game has been explored in some previ­
Joho and Sanderson [15] use a set of hand-crafted query ous work. Vankov et al. presented an abstract decision­
transfonnations in order to retrieve documents containing theoretic model of Millionaire that yields a strategy for
descriptive phrases of proper nouns. Agichtein et a!. [2] a player to maximize expected utility; however, it is still
describe a method for learning these transfonnations and left up to the player to actually answer questions and as­
apply their method to web search engines. sess confidence. 1 Rump [24] uses Millionaire as an educa­
tional tool to present problems in decision analysis includ­
Clarke, Connack, and Lynam [8] describe a system that
ing probability estimation and calculating expected utility,
exploits the redundancy present in their corpus by using
problems that our system must address. Clarke et al. [8]
the frequency of each candidate answer to "vote" for the
apply their general-purpose question-answering system to
answer most likely to be correct. This approach is similar
a set of questions asked on the Millionaire TV show (natu­
to the base approach of our system.
rally composed of more early-round questions), answering
Recent work has shown that the Web can effectively 76 out of 108 questions (70.4%) correctly.
be used as a general knowledge database for question­
answering [9, 11, 22] and other related tasks. Fallman [12]
presents a spelling and grammar checking tool that uses the 3 PLAYING MILLIONAIRE
Google search engine as its source of infonnation, allowing
it to handle names as well as infonnal aspects of a language Millionaire, a game show on ABC TV in the United States,
such as idioms and slang expressions. might be characterized as a cultural phenomenon, spawn­
ing catch phrases and even fashion trends. The show orig­
In contrast to most previous research, where systems are
inated in the United Kingdom and has since been exported
designed to search for an unknown answer, we present a
around the world. Computers are explicitly forbidden as
system that aims to select the correct answer from a number contestants on the actual game show by the official rules,
of possible answers. negating any dreams we had of showcasing our system
alongside Regis on national TV. We wrote our player in­
2.2 DECISION MAKING stead based on a home version of the game: the Millionaire
CD-ROM, 3rd edition.
Decision theory fonnalizes optimal strategies for human
decision making [23], justified on compelling axiomatic
grounds [26, 31]. The likelihood of future states is en­
3.1 RULES OF THE GAME

coded as a subjective probability distribution and the value


In Millionaire, the player is asked a series of multiple­
of future state-action pairs is encoded as a utility func­
choice trivia questions. Each correct answer roughly dou­
tion; the decision maker optimizes by choosing actions that
bles the current prize. An incorrect answer ends the game
maximize future expected utility. A growing subfield in
and reduces the prize to the amount associated with the
AI employs decision theory as a framework for designing
last correctly-answered "milestone" question, or zero if
autonomous agents. When the agent's state space grows
unmanageably large-as in many real-world settings­ 1 The paper describing the system, unfortunately, has been re­

graphical models such as Bayesian networks [14] or influ- moved from the web.
UAI2003 LAM ETAL. 339

no milestones have been met. Milestones occur at the that contain the answer. We use the World Wide Web as our
$1,000 and $32,000 stages, after questions five and ten, re­ data source and several search engines (most prominently
spectively. Answering fifteen questions correctly wins the Google) as our conduit to that data.
grand prize of one million dollars. The difficulty of the
We bring together several AI techniques from information
questions (for people) rises along with the dollar value.
retrieval, natural language parsing, and ensemble machine
At any stage, after seeing the next question, the player may learning, as well as some domain-specific heuristics, in or­
decline to answer and end the game with the current prize der to select answers and generate confidence measures.
total. Alternatively, the player may opt to use any or all This information is then fed into the decision-making mod­
available lifelines to obtain help answering the question. ule, described later, to actually play the game.
Players are allotted three lifelines per game. The three life­
lines allow the player to (1) poll the audience, (2) eliminate 4.1 THE NAIVE APPROACH: COUNTING
two incorrect choices, or (3) telephone a friend.
Our system does not address some aspects of Millionaire. Our basic approach was to query Google with the question
In particular, we do not attempt to play the fastest finger along with each of the four answers. Google enforces a 10-
round that determines the next player from a pool of candi­ term limit on searches, so we performed stopword filtering
date contestants. Winning this round entails heing the first on the questions to shorten our queries. Because answers
one to provide the proper ordering of four things by the cri­ were entirely comprised of stopwords in some cases, we
teria given in the question (e.g., "Place these states in geo­ did not filter them. The program generated queries in the
graphic order from East to West: Wyoming, Illinois, Texas, format answer filtered-question to help ensure that the an­
Florida."). To be competitive, an answer generally must be swer words fit in under Google's 1 0-term limit.
provided within several seconds. Our question-answering The response to the question was normally the answer that
system is neither designed to answer questions of this na­ produced the highest number of search results. However,
ture nor is it capable of answering most questions quickly. a number of questions are "inverted" in the sense that the
We also do not address extraneous tasks that people must answer is the one that is unlike the other three. We are able
perform in order to play the game, including speech recog­ to identify nearly all of these by the presence of the word
nition, speech synthesis, motor skills, etc. "not" in the question. In such cases, we choose the answer
yielding the fewest results. This baseline strategy answers
3.2 CHARACTERISTICS OF QUESTIONS about half of the questions correctly.

The Millionaire CD-ROM game contains 635 questions


that are roughly comparable in nature and difficulty to those
4.1.1 Simple Query Modifications

on the TV show. The game places the questions into seven To improve on this strategy, we empirically found a small
difficulty levels. The lower difficulty levels contain more number of query transformations and modifications that in­
common sense and common knowledge questions, while creased the percentage of correct responses to 60%.
the difficult questions tend to be much more obscure. Life­
line information is also provided in the game data and is
used in our game model. • Multiple-word answers are enclosed in quotes to re­
quire that they appear as a phrase in any search results.
For exploring algorithms and tuning parameters, we used
three random 90-question samples and one random !SO­ • "Complete a saying" questions, identified by the pres­
question sample. Various reports on these training samples ence of one of the strings "According", "said to", or
are reported throughout Section 4. Final test results on all "asked to", were handled by constructing each possi­
635 questions are reported in Section 5.1. ble saying from the choices and requiring that it ap­
pear in the search results.
3.3 OUR PLAYER
• When a query returns no results for any of the an­
Our Millionaire player consists of two main components, swers, we use a series of "fallback" queries that pro­
a question-answering (QA) module for multiple-choice gressively relax the query. Quotes and words were re­
questions and a decision-making (DM) module. We de­ moved from each query until at least one answer pro­
scribe each component in tum below. duced a non-zero number of search results.

4 THE QA MODULE • Longer web pages tend to contain lists of links, es­
says, manifestos, and stories; in general, their content
Our system exploits the redundancy present in text corpora is less useful for answering questions. Since search
to answer questions. More precisely, we use the idea that engines typically do not provide query syntax for re­
question words associated with the answer tend to appear strictions on page size, we used a first-order approxi­
and are more likely to be repeated in multiple documents mation where we excluded .pdf files from the results.
340 LAM ET AL. UAI2003

Table 1: Pseudocode for DistanceScore, our proximity 70


scoring method for favoring question words that appear 050 .
·

�00 •
near (within rad words of) answer words. I
. ·-

gos
� . . .
L: --:--.
""
-- ,.,.,/'
.

II wordList is the document split at spaces -"'


.

c:ro &
·.
11
/
.
DistanceScore(wordList, qWords, aWards, rad) ·.· .

score, answerWords 0 �45 ./ ... . . . . '• . < .

iii.:o
= . • .

for i= 1 to lwordListl do , . . .· . .
·

0..35
if wordList[i] is in aWards then .
. .·

3J
·

answerWords = answerWords + 1
for j (i-rad) to (i+rad) do
=
p3 p5 p10 p3J p50 p 100 na'i>Je
if wordList[j] is in qWords then strategy
score += (rad - abs(i-j)) I rad
if answerWords == 0 then return 0
else return score I answerWords Figure 1: Question-answering accuracy versus proximity
radius when using DistanceScore, as compared to the naive
method on three 90-question samples. Each line represents
4.2 WORD PROXIMITY MEASURES performance on one sample.

Our heuristics for finding phrases are a specific variation


of the general strategy of using proximity. Our belief-and
4.3 COMBINING STRATEGIES

that of many of the teams working on the TREC question­


Among the naive, DistanceScore based on naive, and Dis­
answering track [32]-is that not only do answers appear
tanceScore based on noun phrase strategies, at least one
in the same documents as questions, but that they usually
has the correct answer for about 85% of the questions in a
appear near the question words. In order to test proxim­
!SO-question sample. To exploit this, we look to answer­
ity measures, we downloaded the first 10 (or all, if there
combining ("ensemble") approaches used commonly in
were less than 10) pages Google returned for each query.
machine learning, as summarized in [10].
We score each document based on a heuristic named Dis­
tanceScore that gives more credit to question words that Using the following formula, we attempt to combine our
appear closer to answer words in the document. Each such three strategies, or "experts," and produce a single score
question word contributes a score between 0 and 1 to the for each possible answer:
score depending on how close the word is. A radius pa­
rameter controls what is considered near and how much a
c; =I. ws * (S;/max{S!..n}) over all strategies
word adds to the score. We use the average score per an­
swer word in the document to further penalize documents
where answer words appear frequently but question words where c; is the combined score for answer i, ws is the
do not. Table 1 gives pseudo-code for DistanceScore. weight for strategy S, S; is the score for strategy S for an­
swer i, and n is the number of candidate answers.
Figure 1 shows the performance DistanceScore at various
values for the radius on three 90-question samples, along Using the above formula to score candidate answers, we
with the performance of the naive method. Small random were able to reach 70% performance on the question sam­
question samples were used to reduce the download and ple. The weights yielding this, found empirically, were
computation time required. DistanceScore performs rea­ around ±0.05 of wn = 0.40, Wp = 0.15, Wpp = 0.45 for the
sonably well, doing worse than the naive method at low naive, word proximity, and noun phrase proximity strate­
radius values but overtaking it at higher ones. gies, respectively.

4.2.1 A Third Expert: Noun-Phrase Proximity


4.3.1 Combining Search Engines

In addition to combining strategies, we investigated using


We developed a third strategy, also based on proximity.
multiple search engines to improve results. We modified
Since requiring multi-word answers to appear as phrases
each of the three strategies to submit queries to AllTheWeb,
in web pages improved the accuracy of the naive method,
MSN Search, and AltaVista, using syntax appropriate for
another plausible strategy is to do the same for each of the
each engine. The scores are combined using the same for­
noun phrases contained in the question. Noun phrases were
mula as above. Table 2 shows the results for each strategy
identified using simple heuristics based on Brill's Part-of­
using each search engine.
Speech tagger. We submitted each {noun-phrase, answer}
pair to Google and scored the results the same way as be­ Google performs better than the other engines individu­
fore. The result-count method produced poor results; how­ ally. However, we can combine the results from multiple
ever, downloading the returned documents and using Dis­ engines, much as we combined the opinions of multiple
tanceScore to score each document worked well and pro­ strategies. Manually choosing a single set of weights for
duced results comparable to the previous two strategies. each {method, engine} pair showed that combining results
UAI2003 LAM ETAL. 341

obtained from the shareware trivia game "AZ Trivia," the


Table 2: Performance of the three strategies using different
Google and AltaVista-based answerer answered 72% of the
search engines.
questions correctly.

engine naive proxim phr prox combined We consider this to be good performance over the unstruc­
Google 55.6% 55.0% 68.9% 70%
tured (and not necessarily correct!) data available from the
AI!TheWeb 56.1% 51.7% 58.3% 66%
47.2% 58%
web, supporting our claim that the web can be an effective
MSN 44.4% 48.9%
AltaVista 46.7% 55.6% 56.1% 68% knowledge base for multiple-choice question-answering.

across engines could result in better performance. For ex­ 5.2 CHOOSING GOOD WEIGHTS
ample, combining Google with Alta Vista results in 75% of
the 180-question sample being answered correctly. In a few cases, using confidence scores to combine strate­
gies caused the system's accuracy to fall below that of the

4.3.2 Confidence-Based Weight Assignments best single strategy. This probably means that the "confi­
dence ratio" is not a good heuristic for all scoring meth­
However, choosing the weights manually was difficult. The ods. The ratio is also difficult to compare between different
optimal weights are probably sample-dependent and prone engine-metho d combinations. Fer exa..�ple, All The \Vcb's
to overfitting; minor changes often led to 2-4% drops in ratios with the proximity score are consistently low, which
performance. We modified our formula to assign different translates into high confidence for many questions--even
weights to each scoring method on a question-by-question though this strategy only answers about half the questions
basis, using the "confidence" of each scoring strategy S: correctly. Conversely, Google's ratios with the noun-phrase
proximity score (which performs excellently) are consis­
• Let xs be the "confidence ratio" defined by tently high, leading to lower confidences. The PROVERB
crossword puzzle solver [16], which utilizes a similar ap­

xs-
_ { lowestscorejsecondlowestscore
secondhighestscorefhighestscore
if"not" question;
otherwise.
proach to consider candidate answers from multiple ex­
perts, avoids this problem by allowing each expert to sup­
ply its own estimated confidence explicitly rather than ap­

• Let T = L,( 1 - xJ) over all strategies plying a single function to every expert.

• The weight for strategy S is ws = (1 -xi) / T .


5.3 SAMPLE "PROBLEM" QUESTIONS

This assigns higher weights to more confident scoring


It appears that we have run into another example of the
methods. We chose the ratio between the second-best and
80-20 rule. About one-quarter of the Millionaire questions
best scores because we found a large difference in the ratio
are "hard" for the program. Below are examples of such
when the correct answer has the best score (mean ratio of
questions that suggest areas in which a program trying to
0.34) versus when the incorrect answer has the best score
use the web as a knowledge base would need to improve.
(mean of 0.58). Using these confidence-based weights
generally results in slightly worse performance than hand­ Common Sense. How many legs does a fish have? 0, 1,
tuned weights, with Google falling to 69%, Alta Vista to 2, or 4? This information may exist on the web, but is
65%, and the combination falling to 74% on the 180- probably not spelled out.
question sample. Nonetheless, we believe that automatic
confidence-based weights are more robust and less prone
Multiple Plausible Answers. What does the letter "]"
to overfitting than hand-tuned weights.
stand for in the computer company name "IBM"? Infor­
mation, International, Industrial, or Infrastructure? "In­
formation" probably appears just as often as "international"
5 DISCUSSION: QA MODULE in the context of IBM.

Below we discuss several issues that came up in the course


Polysemy. Which of these parts of a house shares its
of building the question answering subsystem, and ways in
name with a viewing area on a computer screen? Wall,
that it could be improved.
Root, Window, or Basement? The words "root" and "com­
puter" often co-occur (e.g., the Unix superuser). This ques­
tion also suggests that biases in the content of the web­
5.1 OVERALL PERFORMANCE originally by and for technical, computer-literate users­
may hamper using the web as a general knowledge base in
We used confidence-based weights with the three-strategy
some instances.
method on the entire set of 635 Millionaire questions. The
Google-based question-answerer got 72.3% of the ques­ Non-Textual Knowledge. Which of these cities is located
tions correct, while one that used Google and Alta Vista in Russia? Kiev, Minsk, Odessa, or Omsk? The program
got 76.4%. On a set of 50 non-Millionaire trivia questions doesn't know how to read maps.
342 LAM ET AL. UAI2003

Alternative Representations. Who is Flash Gordon's


archenemy? Doctor Octopus, Sinestro, Ming the Merci­ Table 3: Results of playing 10,000 games with k = 250,000
less, or Lex Luthor? The word "archenemy" usually ap­ and a = 4. The columns show the current prize level, num­
pears as two words ("arch enemy") on Flash Gordon (and ber of games ending, number of correctly-answered ques­
other) pages. tions, number of incorrectly-answered questions, times the
player "walked away", number of lifelines used, number
of lifelines that caused the the player to change its answer
6 THE DM MODULE to the correct one, and number of lifelines that misled the
player.
Answering questions is only half the battle. In order to ac­ Stage #-win #-wrong #-right #-stop llused llgood llbad

tually play Millionaire, the system must also decide when 0 4676 820 9180 0 1838 614 0
100 781 8398 I 1653 535
to use a lifeline and when to "walk away". In order to com­ 200 749 7646 1464 504
300 1227 6414 2526 722 76
pute its best next move, the decision-making module con­ 1099
500 5308 2212 538 67
structs a decision tree [23] that encodes the probabilities 1000 3700 1048 4260 2404 517 51
2000 42 881 3337 42 1597 335 38
and utilities at every possible future state of the game. The 46 710 219
4000 2581 46 1030 19
full tree consists of decision forks for choosing whether 8000 97 610 1874 97 625 62 21
16000 76 451 1347 76 388 48 10
to answer the question, use a lifeline, or walk away, and 32000 815 351 181 30
996 17
chance forks to encode the uncertainty of answering the 64000 37 254 705 37 liB 11 11
125000 99 115 491 99 124 11
questions correctly. The best choice for the program is the 156 56 72
250000 279 156
action that maximizes expected utility. 500000 125 39 115 125 15
1000000 115 115 0
Utility is not necessarily synonymous with winnings in dol­ Avg. right: 5.29, winnings: $26328.87

lars. For example, suppose a contestant is at the $500,000


level. Even if he or she believes that by answering the fi­
nal question his or her chances are fifty-fifty of winning ei­ whichever is greater. This models the idea that us­
ther $1 million or $32,000 (expected value $516,000), the ing a lifeline should raise the estimated probability of
contestant will almost surely walk away with a guaranteed getting the question correct.
$500,000 instead. To model such risk-aversion we give
• When lifelines are used by the player, a new re­
the agent an exponential utility function u(x) = 1- e-(x/k).
sponse and confidence level are calculated based on
For any finite k > 0, the agent exhibits risk averse behav­
the new information received. For the 50/50 lifeline,
ior, though as k -t oo, the agent becomes risk neutral (i.e.,
the new response is simply the remaining choice with
maximizes expected dollar value). In general, after playing
the higher score. The phone-a-friend and poll-the­
many games, more risk averse agents will earn less prize
audience lifelines are taken as an additional "expert"
money on average, though will have a smaller variation
with a weight based on historical data.
(standard deviation) of winnings.

6.2 PLAYING THE GAME: RESULTS


6.1 MODELING THE GAME
Table 3 shows the results of a risk-averse player (k =
We use the following specifications to construct the deci­ 250,000) playing 10,000 games using the above model us­
sion tree and play the game: ing the question-answerer that uses Google and AltaVista.
Questions were selected randomly from all the available
• For all questions beyond the current question, chance questions in the appropriate difficulty level for each stage.
nodes are assigned probabilities based on historical Figure 2 summarizes the relationship between k, average
past performance on a sample of questions from the winnings, and standard deviation. The more risk neutral
associated difficulty level. the program is, the more it wins, and the more its winnings
vary between games. Note that these points lie essentially
• For the current question (i.e., after the question has along an efficient frontier (i.e., any gain in expected value
been asked and analyzed), the current chance node necessitates an increase in risk [20]).
probability is I - xa, where x is the ratio between
the second-highest score and the highest score ob­ We also explored the effects of changes in a, the expo­
tained from the question-answering module (or the nent in the function used to convert confidence ratios into
lowest score and the second-lowest score for "not" probabilities. Figure 3 graphs average winnings versus a
questions), and a is a tunable parameter that will be and Figure 4 graphs the average number of correctly an­
examined later. This lets us estimate confidence in our swered questions versus a. Using higher a raises the pro­
answer to the specific question being asked. gram's estimated probability of answering a question cor­
rectly. Choosing a too low or too high hinders game per­
• The estimated future effect of lifelines on probabil­ formance since the program chooses to stop too soon or in­
ity p is given by the function f(p) = - p2 + 2 p, or correctly answers questions that it is overconfident about.
the lifeline's performance based on historical data, While high a values can produce high average winnings, it
UAI2003 LAM ET AL. 343

Table 4: Human performance on the television show as re- 40000


ported on ABC's website in July 2001, compared to the
computer's performance when given no handicap, and a
� 30000 ,--,l""'����ll.iil�lll:--1
"

six-question handicap. � 20000 t---.iii"-z;¥''-L-----'-"-'--'-:---1


r 10000 +---,t.�----'---1
c
Stage human (pet) computer (pet) 6-handi (pet)
0 14 (2. 01) 4676 (46.81) 0 (0.01)
100 0 (0. Oil (0.0%) 0 (0. 01) 0 2 4 6 8 10
(0.Ol) (0.Oil
Alpha
200 (0.01) 0
300 (0. 01) 5 (0.11) (0.01)
500 0 (0. 01) 7 (0.11) (0. 01)
1000 195 (28.61) 3700 (37 .Oil 5447 (54. 51)
2000 0 (0. 01) 42 (0.41) 0 (0. Oil
4000 4 (0. 61) 46 (0.51) 61 (0. 61) Figure 3: Average winnings versus a. Black points are for
8000 9 (1. 31) 97 (1. Oil 249 (2.51) a risk-averse player (k = 250, 000); gray points are for a
16000 40 (5. 91) 76 (0. 81) 231 (2.31)
32000 166 (24.31) 815 (8.211 2337 (23 .41) risk-neutral player.
64000 92 (13 .51) 37 (0.41) 139 (1.41)
125000 89 (13.Oil 99 (1. 01) 370 (J. 71)
Questions Correc:t vs Alpha
250000 48 (7. 01) 156 (1. 61) 504 (5.01)
500000 18 (2. 61) 125 (1.31) 311 (J .11)
(1. 2%) (J. 51)

�1+-==(?
.7===·�
�:=:_--'---- _.:_. - ----·.����
1000000 8 115 (1.21) 351
Avg. winnings: $764 97 vs. $26328.87 vs. $77380.90

200000
Effect of k on WI nnlngs

. i .

- -· -· -
c

� 150000 0 2 4 6 8 10
i
� 100000 · . _/.· . Alpha

! 50000
..,:·� .
� •
......
0 -- - :- Figure 4: Average number of questions correct versus a.
0 10000 20000 30000 40000 Black points are for a risk-averse player (k = 250,000);
gray points are for a risk -neutral player.

Figure 2: Standard deviation versus average winnings as Observe that even if the question answerer could achieve
k ranges from 5,000 to 400,000 and a is fixed at 4. The a 95% success rate on early questions, it would still only
gray point is a risk-neutral player. As k increases, average have a 77% chance of achieving tbe $1,000 milestone. Its
winnings and standard deviation both increase. actual performance is worse, correctly answering 86% at
level I ($100, $200, and $300) and 75% at level 2 ($500
and $1,000). Table 3 shows that as a result the program
comes at the cost of many more games ( 65%) resulting in often exhausts its lifelines early in tbe game. On tbe other
a $0 prize as the player is too confident during early ques­ hand, we believe our program would have tbe upper hand
tions and saves its lifelines for later use. An a of 4 seems against most people in a one-question, level 7, winner­
reasonable; about 47% of games result in $0 in that case, takes-all match.
and the average winnings are relatively high.

7.2 SIX QUESTIONS TO HUMAN


7 DISCUSSION: DM MODULE
We might ask how well the program fares when given a
handicap--that is, assuming that tbe program is able to an­
7.1 HARD QUESTIONS EASY, EASY ONES HARD
swer the first N questions correctly without using any life­
lines. Figure 5 graphs the program's winnings versus its
Table 4 compares the program's winnings to humans' win­
handicap. With a six question head start (going for $4,000)
nings based on data from the ABC website as of mid-July,
and all lifelines remaining, a risk-averse computer player
2001. A striking feature of the program's performance is
(k = 250000) averages $77,381 with a standard deviation
how often it wins nothing compared to people. Humans
of $202,296.
almost always answer the first several questions correctly;
however, some are so obvious that the question-answerer Data from ABC's website as of mid-July, 2001 indicates
cannot find tbe correct answer on tbe web. People gen­ that people on the show won about $76,497 on average
erally do not encode common knowledge into their web with a standard deviation of $140,441. This suggests that,
documents. As a result, while the web seems to be a good given a six-question handicap, the program performs about
knowledge repository for general knowledge, it is more dif­ as well as qualified human players (i.e., those who self­
ficult to use it as a common-sense database. selected to play the game, passed stringent entrance tests,
344 LAM ET AL. UAI2003

Avg Winnings vs Handicap decision-making module and given a six question handi­
200000 cap, our system plays the game about as well as people.
a
£
c
150000
c • We believe that our system can be marginally improved in a

� 100000 variety of ways: for example, by employing better schemes
a • •
• •
i 50000 for weighting multiple scoring methods, or by narrowing

� I I
0
• down the domain of a question and using domain-specific
0 2 4 6 8 10 search strategies. We are also excited about the potential
II of Given Questions promised by approaches for structuring web data [4], al­
though we believe that advances in automatic techniques
for applying such structure (e.g., better natural language
Figure 5: Average winnings versus handicap. Black points processing and common sense reasoning [19]) will be re­
are for a risk-averse player (k = 250, 000); gray points are quired for these approaches to succeed.
for a risk-neutral player.
The call for such advances is a familiar one. From natural
language processing to computer vision, a similar barrier
exists across many subfields of AI: easy tasks (for people)
and likely practiced for the game). Table 4 shows that even
are hard and hard tasks easy. W hile statistical and brute­
with the handicap, the program's performance is more vari­
force methods can go a long way toward matching human
able than a human's, both winning big and losing early
performance, an often difficult-to-bridge gap remains.
more often than people. However, its performance is still
comparable, with "only" six "easy" questions separating
the program from human-level performance. References

[!] S. Abney, M. Collins, and A. Singhal. Answer extrac­


8 OTHER APPLICATIONS tion. In ANLP, 2000.

W hile designed to play Millionaire, our system has other,


[2] E. Agichtein, S. Lawrence, and L. Gravano. Learn­
ing search engine specific query transformations for
more practical applications. The most straightforward is
simply as a general-purpose question-answering system
question answering. In Proceedings of WWW, pages
that can answer questions, provided a small pool of can­
169-178, 2001.
didate answers can be provided or generated by some other
[3] D. Aliod, J. Berri, and M. Hess. A real world imple­
means.
mentation of answer extraction. In Proceedings of
the
Combining the question-answerer with the decision maker 9th International Workshop on Database and Expert
can be useful in domains where a non-trivial penalty exists Systems, Workshop: Natural Language and Informa­
for answering a question incorrectly. For example, our sys­ tion Systems (NLIS-98), 1998.
tem could be adapted to take the Scholastic Aptitude Test
[4] T. Berners-Lee, J. Hendler, and 0. Lassila. The se­
(SAT), an exam where answering a question incorrectly re­
mantic web. Scientific American, 284(5):34-43, May
sults in a lower score than not answering.
2001.
The general strategy of using search engines to mine the
[5] B. Bouzy and T. Cazenave. Computer Go: An AI
web as a giant text corpus shows promise in a number of
oriented survey. AI, 132(1):39-103, 2001.
areas. For example, web sites which provide content in
multiple languages could become a knowledge base for au­ [6] M. Buro. Methods for the evaluation of game po­
tomatic translation. Natural language processing programs sitions using examples. PhD thesis, University of
could use the web as a corpus to help disambiguate parsing, Paderborn, Germany, 1994.
or to find commonly occurring close matches to ungram­
matical sentences. [7] M. Buro. The Othello match of the year: Takeshi
Murakami vs. Logistello. ICCA Journal, 20(3):189-
193, 1997.
9 CONCLUSIONS AND FUTURE WORK
[8] C. L. A. Clarke, G. V. Cormack, and T. R. Lynam.
Exploiting redundancy in question answering. In Pro­
We find that the web is effective as a knowledge base for
answering generic multiple-choice questions. Naive meth­ ceedings of SIGIR, New Orleans, September 2001.
ods that simply count search engine results do surprisingly [9] C. L. A. Clarke, G. V. Cormack, T. R. Lynam, C. M.
well; more sophisticated methods that employ simple query
Li, and G. L. McLearn. Web reinforced question an­
modifications, identify noun phrases, measure proximity swering. In Proceedings of TREC, 200 I.
between question and answer words, and combine results
from multiple engines do even better, attaining about 75% [1OJ T. G. Dietterich. Machine learning research: Four cur­
accuracy on Millionaire questions. When coupled with a rent directions. AI Magazine, 18(4):97-136, 1997.
UAJ2003 LAM ET AL. 345

[11] S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng. [27] J. Schaeffer, J. C. Culberson, N. Treloar, B. Knight,
Web question answering: Is more always better? In P. Lu, and D. Szafran. A world championship cal­
Proceedings of SIGIR, 2002. iber checkers program. Artificial Intelligence, 53(2-
3):273-289, 1992.
[12] D. Fallman. The penguin: Using the web as a
database for descriptive and dynamic grammar and [28] R. D. Shachter. Probabilistic inference and influ­
spell checking. In Proceedings of CHI, 2002. ence diagrams. Operations Research, 36(4):589-604,
1988.
[13] F. Hsu, M. S. Campbell, and A. J. Hoane. Deep Blue
system overview. In Proceedings of the 9th ACM [29] P. Stone and R. Sutton. Scaling reinforcement learn­
International Conference on Supercomputing, pages ing toward RoboCup soccer. In Proceedings of the
240-244, July 1995. 18th ICML, 2001.
[14] F. V. Jensen. An introduction to Bayesian Networks. [30] M. van Lent, J. Laird, J. Buckman, J. Hartford,
Springer, New York, 1996. S. Houchard, K. Steinkraus, and R. Tedrake. In­
telligent agents in computer games. Proc. 16th
In
[15] H. Joho and M. Sanderson. Retrieving descriptive
National Conference on Artificial Intelligence, pages
phrases from large amounts of free text. In Proceed­
929-930, 1999.
ings of CIKM, 2000.
[31] J. von Neumann and 0. Morgenstern. Theory of
[16] G. A. Keirn, N. N. Shazeer, M. L. Littman, S. A. C. M.
Games and Economic Behavior. Princeton University
Cheves, J. Fitzgerald, J. Grosland, F. Jiang, S. Pol­
Press, Princeton, NJ, 1953.
lard, and K. Weinmeister. Proverb: The probabilistic
cruciverbalist. In Proc. 16th National Conference on [32] E. Voorhees. The TREC-8 question answering track
Artificial Intelligence, pages 710-717, 1999. report. In Proceedings of TREC, 1999.
[17] S. Lawrence and C. L. Giles. Context and page analy­
sis for improved web search. IEEE Internet Comput­
ing, 2(4):38-46, 1998.
[18] K. F. Lee and S. Mahajan. The development of a
world class Othello program. Artificial Intelligence,
43:21-36, 1990.

[19] D. B. Lenat. Cyc: A large-scale investment in knowl­


edge infrastructure. Communications of the ACM,
38(11):33-38, November 1995.

[20] H. M. Markowitz. Portfolio selection. Journal of Fi­


nance, 7(1):77-91, 1952.
[21] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea,
R. Goodrum, R. Girju, and V. Rus. Lasso: A tool
for surfing the Answer Net. In Proceedings of TREC,
1999.

[22] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal.


Probabilistic question answering on the web. In Pro­
ceedings of WWW, 2002.
Decision Analysis: Introductory Lectures
[23] H. Raiffa.
on Choices under Uncertainty. Addison-Wesley,
Reading, MA, 1968.

[24] C. M. Rump. Who wants to see a $million error?


Transactions on Education, 1(3), 2001.
[25] G. Salton. Automatic Text Processing: The transfor­
mation, analysis, and retrieval of information by com­
puter. Addison-Wesley, 1989.
[26] L. J. Savage. The Foundations of Statistics. Wiley,
New York, 1954.

You might also like