Who Wants To Be A Millionaire Questn Difficulty Level
Who Wants To Be A Millionaire Questn Difficulty Level
Who Wants To Be A Millionaire Questn Difficulty Level
337
This work conducted at NEC Laboratories America, Prince A number of systems aim to extract answers from docu
ton, NJ. ments. For example, Abney et al. [!] describe a system in
338 LAM ETAL. UAI2003
which documents returned by the SMART infonnation re ence diagrams [28] that can encode probabilities and utili
trieval system are processed to extract answers. Questions ties compactly are often used. In Millionaire, the space of
are classified into one of a set of known "question types" possible outcomes is small enough that decision trees [23],
that identify the type of entity corresponding to the answer. that explicitly enumerate probabilities and utilities for all
Documents are tagged to recognize entities, and passages future possibilities, are sufficient.
surrounding entities of the correct type for a given ques
tion are ranked using a set of heuristics. Two papers [3, 21] 2.3 GAME PLAYING
present systems that re-rank and post-process the results of
regular infonnation retrieval systems with the goal of re Board games have dominated much of the history of AI in
turning the best passages. These systems use the general game playing [5, 6, 13, 18, 27]. This paper follows instead
approach of retrieving documents or passages that are sim in the tradition of the crossword-puzzle-solving program
ilar to the original question with variations of standard TF PROVERB [16]. Like PROVERB, our Millionaire player
IDF term weight schemes [25]. The most promising pas brings together technologies from several core areas of ar
sages are chosen from the documents returned using heuris tificial intelligence (including infonnation retrieval, natural
tics and/or hand-crafted regular expressions. language parsing, ensemble learning, and decision making)
to solve a challenging problem that does not naturally con
Other systems modify queries in order to improve the
fonn to the game-tree method for solving board games.
chance of retrieving answers. Lawrence and Giles [17] in
Other domains under recent and rapid investigation-that
troduced Specific Expressive Forms, where questions are
transfonned into specific phrases that may be contained in are also not easily amenable to tree enumeration-include
answers. For example, the question "what is x" may be video games such as Quake [30] and soccer [29].
transfonned into phrases such as "x is" or "x refers to". The Millionaire game has been explored in some previ
Joho and Sanderson [15] use a set of hand-crafted query ous work. Vankov et al. presented an abstract decision
transfonnations in order to retrieve documents containing theoretic model of Millionaire that yields a strategy for
descriptive phrases of proper nouns. Agichtein et a!. [2] a player to maximize expected utility; however, it is still
describe a method for learning these transfonnations and left up to the player to actually answer questions and as
apply their method to web search engines. sess confidence. 1 Rump [24] uses Millionaire as an educa
tional tool to present problems in decision analysis includ
Clarke, Connack, and Lynam [8] describe a system that
ing probability estimation and calculating expected utility,
exploits the redundancy present in their corpus by using
problems that our system must address. Clarke et al. [8]
the frequency of each candidate answer to "vote" for the
apply their general-purpose question-answering system to
answer most likely to be correct. This approach is similar
a set of questions asked on the Millionaire TV show (natu
to the base approach of our system.
rally composed of more early-round questions), answering
Recent work has shown that the Web can effectively 76 out of 108 questions (70.4%) correctly.
be used as a general knowledge database for question
answering [9, 11, 22] and other related tasks. Fallman [12]
presents a spelling and grammar checking tool that uses the 3 PLAYING MILLIONAIRE
Google search engine as its source of infonnation, allowing
it to handle names as well as infonnal aspects of a language Millionaire, a game show on ABC TV in the United States,
such as idioms and slang expressions. might be characterized as a cultural phenomenon, spawn
ing catch phrases and even fashion trends. The show orig
In contrast to most previous research, where systems are
inated in the United Kingdom and has since been exported
designed to search for an unknown answer, we present a
around the world. Computers are explicitly forbidden as
system that aims to select the correct answer from a number contestants on the actual game show by the official rules,
of possible answers. negating any dreams we had of showcasing our system
alongside Regis on national TV. We wrote our player in
2.2 DECISION MAKING stead based on a home version of the game: the Millionaire
CD-ROM, 3rd edition.
Decision theory fonnalizes optimal strategies for human
decision making [23], justified on compelling axiomatic
grounds [26, 31]. The likelihood of future states is en
3.1 RULES OF THE GAME
graphical models such as Bayesian networks [14] or influ- moved from the web.
UAI2003 LAM ETAL. 339
no milestones have been met. Milestones occur at the that contain the answer. We use the World Wide Web as our
$1,000 and $32,000 stages, after questions five and ten, re data source and several search engines (most prominently
spectively. Answering fifteen questions correctly wins the Google) as our conduit to that data.
grand prize of one million dollars. The difficulty of the
We bring together several AI techniques from information
questions (for people) rises along with the dollar value.
retrieval, natural language parsing, and ensemble machine
At any stage, after seeing the next question, the player may learning, as well as some domain-specific heuristics, in or
decline to answer and end the game with the current prize der to select answers and generate confidence measures.
total. Alternatively, the player may opt to use any or all This information is then fed into the decision-making mod
available lifelines to obtain help answering the question. ule, described later, to actually play the game.
Players are allotted three lifelines per game. The three life
lines allow the player to (1) poll the audience, (2) eliminate 4.1 THE NAIVE APPROACH: COUNTING
two incorrect choices, or (3) telephone a friend.
Our system does not address some aspects of Millionaire. Our basic approach was to query Google with the question
In particular, we do not attempt to play the fastest finger along with each of the four answers. Google enforces a 10-
round that determines the next player from a pool of candi term limit on searches, so we performed stopword filtering
date contestants. Winning this round entails heing the first on the questions to shorten our queries. Because answers
one to provide the proper ordering of four things by the cri were entirely comprised of stopwords in some cases, we
teria given in the question (e.g., "Place these states in geo did not filter them. The program generated queries in the
graphic order from East to West: Wyoming, Illinois, Texas, format answer filtered-question to help ensure that the an
Florida."). To be competitive, an answer generally must be swer words fit in under Google's 1 0-term limit.
provided within several seconds. Our question-answering The response to the question was normally the answer that
system is neither designed to answer questions of this na produced the highest number of search results. However,
ture nor is it capable of answering most questions quickly. a number of questions are "inverted" in the sense that the
We also do not address extraneous tasks that people must answer is the one that is unlike the other three. We are able
perform in order to play the game, including speech recog to identify nearly all of these by the presence of the word
nition, speech synthesis, motor skills, etc. "not" in the question. In such cases, we choose the answer
yielding the fewest results. This baseline strategy answers
3.2 CHARACTERISTICS OF QUESTIONS about half of the questions correctly.
on the TV show. The game places the questions into seven To improve on this strategy, we empirically found a small
difficulty levels. The lower difficulty levels contain more number of query transformations and modifications that in
common sense and common knowledge questions, while creased the percentage of correct responses to 60%.
the difficult questions tend to be much more obscure. Life
line information is also provided in the game data and is
used in our game model. • Multiple-word answers are enclosed in quotes to re
quire that they appear as a phrase in any search results.
For exploring algorithms and tuning parameters, we used
three random 90-question samples and one random !SO • "Complete a saying" questions, identified by the pres
question sample. Various reports on these training samples ence of one of the strings "According", "said to", or
are reported throughout Section 4. Final test results on all "asked to", were handled by constructing each possi
635 questions are reported in Section 5.1. ble saying from the choices and requiring that it ap
pear in the search results.
3.3 OUR PLAYER
• When a query returns no results for any of the an
Our Millionaire player consists of two main components, swers, we use a series of "fallback" queries that pro
a question-answering (QA) module for multiple-choice gressively relax the query. Quotes and words were re
questions and a decision-making (DM) module. We de moved from each query until at least one answer pro
scribe each component in tum below. duced a non-zero number of search results.
4 THE QA MODULE • Longer web pages tend to contain lists of links, es
says, manifestos, and stories; in general, their content
Our system exploits the redundancy present in text corpora is less useful for answering questions. Since search
to answer questions. More precisely, we use the idea that engines typically do not provide query syntax for re
question words associated with the answer tend to appear strictions on page size, we used a first-order approxi
and are more likely to be repeated in multiple documents mation where we excluded .pdf files from the results.
340 LAM ET AL. UAI2003
�00 •
near (within rad words of) answer words. I
. ·-
gos
� . . .
L: --:--.
""
-- ,.,.,/'
.
c:ro &
·.
11
/
.
DistanceScore(wordList, qWords, aWards, rad) ·.· .
iii.:o
= . • .
for i= 1 to lwordListl do , . . .· . .
·
0..35
if wordList[i] is in aWards then .
. .·
3J
·
answerWords = answerWords + 1
for j (i-rad) to (i+rad) do
=
p3 p5 p10 p3J p50 p 100 na'i>Je
if wordList[j] is in qWords then strategy
score += (rad - abs(i-j)) I rad
if answerWords == 0 then return 0
else return score I answerWords Figure 1: Question-answering accuracy versus proximity
radius when using DistanceScore, as compared to the naive
method on three 90-question samples. Each line represents
4.2 WORD PROXIMITY MEASURES performance on one sample.
engine naive proxim phr prox combined We consider this to be good performance over the unstruc
Google 55.6% 55.0% 68.9% 70%
tured (and not necessarily correct!) data available from the
AI!TheWeb 56.1% 51.7% 58.3% 66%
47.2% 58%
web, supporting our claim that the web can be an effective
MSN 44.4% 48.9%
AltaVista 46.7% 55.6% 56.1% 68% knowledge base for multiple-choice question-answering.
across engines could result in better performance. For ex 5.2 CHOOSING GOOD WEIGHTS
ample, combining Google with Alta Vista results in 75% of
the 180-question sample being answered correctly. In a few cases, using confidence scores to combine strate
gies caused the system's accuracy to fall below that of the
4.3.2 Confidence-Based Weight Assignments best single strategy. This probably means that the "confi
dence ratio" is not a good heuristic for all scoring meth
However, choosing the weights manually was difficult. The ods. The ratio is also difficult to compare between different
optimal weights are probably sample-dependent and prone engine-metho d combinations. Fer exa..�ple, All The \Vcb's
to overfitting; minor changes often led to 2-4% drops in ratios with the proximity score are consistently low, which
performance. We modified our formula to assign different translates into high confidence for many questions--even
weights to each scoring method on a question-by-question though this strategy only answers about half the questions
basis, using the "confidence" of each scoring strategy S: correctly. Conversely, Google's ratios with the noun-phrase
proximity score (which performs excellently) are consis
• Let xs be the "confidence ratio" defined by tently high, leading to lower confidences. The PROVERB
crossword puzzle solver [16], which utilizes a similar ap
xs-
_ { lowestscorejsecondlowestscore
secondhighestscorefhighestscore
if"not" question;
otherwise.
proach to consider candidate answers from multiple ex
perts, avoids this problem by allowing each expert to sup
ply its own estimated confidence explicitly rather than ap
• Let T = L,( 1 - xJ) over all strategies plying a single function to every expert.
tually play Millionaire, the system must also decide when 0 4676 820 9180 0 1838 614 0
100 781 8398 I 1653 535
to use a lifeline and when to "walk away". In order to com 200 749 7646 1464 504
300 1227 6414 2526 722 76
pute its best next move, the decision-making module con 1099
500 5308 2212 538 67
structs a decision tree [23] that encodes the probabilities 1000 3700 1048 4260 2404 517 51
2000 42 881 3337 42 1597 335 38
and utilities at every possible future state of the game. The 46 710 219
4000 2581 46 1030 19
full tree consists of decision forks for choosing whether 8000 97 610 1874 97 625 62 21
16000 76 451 1347 76 388 48 10
to answer the question, use a lifeline, or walk away, and 32000 815 351 181 30
996 17
chance forks to encode the uncertainty of answering the 64000 37 254 705 37 liB 11 11
125000 99 115 491 99 124 11
questions correctly. The best choice for the program is the 156 56 72
250000 279 156
action that maximizes expected utility. 500000 125 39 115 125 15
1000000 115 115 0
Utility is not necessarily synonymous with winnings in dol Avg. right: 5.29, winnings: $26328.87
�1+-==(?
.7===·�
�:=:_--'---- _.:_. - ----·.����
1000000 8 115 (1.21) 351
Avg. winnings: $764 97 vs. $26328.87 vs. $77380.90
200000
Effect of k on WI nnlngs
. i .
- -· -· -
c
� 150000 0 2 4 6 8 10
i
� 100000 · . _/.· . Alpha
! 50000
..,:·� .
� •
......
0 -- - :- Figure 4: Average number of questions correct versus a.
0 10000 20000 30000 40000 Black points are for a risk-averse player (k = 250,000);
gray points are for a risk -neutral player.
Figure 2: Standard deviation versus average winnings as Observe that even if the question answerer could achieve
k ranges from 5,000 to 400,000 and a is fixed at 4. The a 95% success rate on early questions, it would still only
gray point is a risk-neutral player. As k increases, average have a 77% chance of achieving tbe $1,000 milestone. Its
winnings and standard deviation both increase. actual performance is worse, correctly answering 86% at
level I ($100, $200, and $300) and 75% at level 2 ($500
and $1,000). Table 3 shows that as a result the program
comes at the cost of many more games ( 65%) resulting in often exhausts its lifelines early in tbe game. On tbe other
a $0 prize as the player is too confident during early ques hand, we believe our program would have tbe upper hand
tions and saves its lifelines for later use. An a of 4 seems against most people in a one-question, level 7, winner
reasonable; about 47% of games result in $0 in that case, takes-all match.
and the average winnings are relatively high.
Avg Winnings vs Handicap decision-making module and given a six question handi
200000 cap, our system plays the game about as well as people.
a
£
c
150000
c • We believe that our system can be marginally improved in a
•
� 100000 variety of ways: for example, by employing better schemes
a • •
• •
i 50000 for weighting multiple scoring methods, or by narrowing
•
� I I
0
• down the domain of a question and using domain-specific
0 2 4 6 8 10 search strategies. We are also excited about the potential
II of Given Questions promised by approaches for structuring web data [4], al
though we believe that advances in automatic techniques
for applying such structure (e.g., better natural language
Figure 5: Average winnings versus handicap. Black points processing and common sense reasoning [19]) will be re
are for a risk-averse player (k = 250, 000); gray points are quired for these approaches to succeed.
for a risk-neutral player.
The call for such advances is a familiar one. From natural
language processing to computer vision, a similar barrier
exists across many subfields of AI: easy tasks (for people)
and likely practiced for the game). Table 4 shows that even
are hard and hard tasks easy. W hile statistical and brute
with the handicap, the program's performance is more vari
force methods can go a long way toward matching human
able than a human's, both winning big and losing early
performance, an often difficult-to-bridge gap remains.
more often than people. However, its performance is still
comparable, with "only" six "easy" questions separating
the program from human-level performance. References
[11] S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng. [27] J. Schaeffer, J. C. Culberson, N. Treloar, B. Knight,
Web question answering: Is more always better? In P. Lu, and D. Szafran. A world championship cal
Proceedings of SIGIR, 2002. iber checkers program. Artificial Intelligence, 53(2-
3):273-289, 1992.
[12] D. Fallman. The penguin: Using the web as a
database for descriptive and dynamic grammar and [28] R. D. Shachter. Probabilistic inference and influ
spell checking. In Proceedings of CHI, 2002. ence diagrams. Operations Research, 36(4):589-604,
1988.
[13] F. Hsu, M. S. Campbell, and A. J. Hoane. Deep Blue
system overview. In Proceedings of the 9th ACM [29] P. Stone and R. Sutton. Scaling reinforcement learn
International Conference on Supercomputing, pages ing toward RoboCup soccer. In Proceedings of the
240-244, July 1995. 18th ICML, 2001.
[14] F. V. Jensen. An introduction to Bayesian Networks. [30] M. van Lent, J. Laird, J. Buckman, J. Hartford,
Springer, New York, 1996. S. Houchard, K. Steinkraus, and R. Tedrake. In
telligent agents in computer games. Proc. 16th
In
[15] H. Joho and M. Sanderson. Retrieving descriptive
National Conference on Artificial Intelligence, pages
phrases from large amounts of free text. In Proceed
929-930, 1999.
ings of CIKM, 2000.
[31] J. von Neumann and 0. Morgenstern. Theory of
[16] G. A. Keirn, N. N. Shazeer, M. L. Littman, S. A. C. M.
Games and Economic Behavior. Princeton University
Cheves, J. Fitzgerald, J. Grosland, F. Jiang, S. Pol
Press, Princeton, NJ, 1953.
lard, and K. Weinmeister. Proverb: The probabilistic
cruciverbalist. In Proc. 16th National Conference on [32] E. Voorhees. The TREC-8 question answering track
Artificial Intelligence, pages 710-717, 1999. report. In Proceedings of TREC, 1999.
[17] S. Lawrence and C. L. Giles. Context and page analy
sis for improved web search. IEEE Internet Comput
ing, 2(4):38-46, 1998.
[18] K. F. Lee and S. Mahajan. The development of a
world class Othello program. Artificial Intelligence,
43:21-36, 1990.