Who Wants To Be A Millionaire Questn Difficulty Level

UAI2003 LAM ETAL.
337
1 Billion Pages = 1 Million Dollars?

Mining the Web to Play "Who Wants to be a Millionaire?"
Shyong (Tony) K. Lam David M. Pennock* Dan Cosley Steve Lawrence

Computer Science Dept Overture Services, Inc. Computer Science Dept NEC Laboratories America
University of Minnesota 74 N. Pasadena Ave., 3rd floor University of Minnesota Princeton, NJ 08540
Minneapolis, MN 55455 Pasadena, CA 911 0 I Minneapolis, MN 55455 lawrence®necmail.com
lam®cs.umn.edu david.pennock®overture.com cosley®cs.umn.edu
Abstract (5] are formal enough to be solvable in principle, though

are far from trivial to master in practice due to exponential
size search spaces. In chess, checkers, and backgammon,
We exploit the redundancy and volume of infor
current machine players rival their best human competitors.
mation on the web to build a computerized player
Recently, attention has turned to less structured game envi
for the ABC TV game show"Who Wants To Be A
ronments, like crossword puzzles (16], video games (30],
Millionaire?". The player consists of a question and soccer [29], where game states, actions, or both are
answering module and a decision-making mod
not easily enumerable, making a pure search formulation
ule. The question-answering module utilizes
unnatural or impractical.
question transformation techniques, natural lan
guage parsing, multiple information retrieval al "Who Wants to be a Millionaire?" is a trivia game where
gorithms, and multiple search engines; results actions are enumerable, though competence depends on the
are combined in the spirit of ensemble learning ability to answer general-interest questions-{)ften requir
using an adaptive weighting scheme. Empiri ing common sense or knowledge of popular culture-and
cally, the system correctly answers about 75% to make decisions based on confidence, expected reward,
of questions from the Millionaire CD-ROM, 3rd and risk attitude. True human-level competence at Million
edition general-interest trivia questions often
- aire will likely require excellence in natural language pro
about popular culture and common knowledge. cessing and common sense reasoning. We present a first
The decision-making module chooses from al order system that exploits the breadth and redundancy of
lowable actions in the game in order to maxi information available on the World Wide Web to answer
mize expected risk-adjusted winnings, where the questions and estimate confidence, and utilizes a decision
estimated probability of answering correctly is a theoretic subsystem to choose actions to maximize ex
function of past performance and confidence in pected risk-adjusted payoffs.
correctly answering the current question. When
given a six question head start (i.e., when start
ing from the $2,000 level), we find that the sys
2 RELATED WORK
tem performs about as well on average as humans
starting at the beginning. Our system demon 2.1 QUESTION ANSWERING
strates the potential of simple but well-chosen
techniques for mining answers from unstructured A large body of research exists on question answering. For
information such as the web. example, see the Question-Answering Track (32] of the
Text Retrieval Evaluation Conference (TREC). Systems in
this track compete against each other to retrieve short (50
or 250 byte long) answers to a set of test questions.
1 INTRODUCTION
Question-answering systems typically decompose the
Machine competence in games has long served as a bench problem into two main steps: retrieving documents that
mark for progress in artificial intelligence (AI). While we may contain answers, and extracting answers from these
seem hardly close to building systems capable of passing documents. For the first part of the task, retrieving a set of
a full-blown Turing Test, machine excellence in a grow promising documents from a collection, the systems in the
ing number of games signals incremental progress. Games TREC QA track submitted the original questions to various
such as chess [13], checkers [27], Othello [7, 18], and Go information retrieval systems [32].
This work conducted at NEC Laboratories America, Prince A number of systems aim to extract answers from docu
ton, NJ. ments. For example, Abney et al. [!] describe a system in
338 LAM ETAL. UAI2003
which documents returned by the SMART infonnation re ence diagrams [28] that can encode probabilities and utili
trieval system are processed to extract answers. Questions ties compactly are often used. In Millionaire, the space of
are classified into one of a set of known "question types" possible outcomes is small enough that decision trees [23],
that identify the type of entity corresponding to the answer. that explicitly enumerate probabilities and utilities for all
Documents are tagged to recognize entities, and passages future possibilities, are sufficient.
surrounding entities of the correct type for a given ques
tion are ranked using a set of heuristics. Two papers [3, 21] 2.3 GAME PLAYING
present systems that re-rank and post-process the results of
regular infonnation retrieval systems with the goal of re Board games have dominated much of the history of AI in
turning the best passages. These systems use the general game playing [5, 6, 13, 18, 27]. This paper follows instead
approach of retrieving documents or passages that are sim in the tradition of the crossword-puzzle-solving program
ilar to the original question with variations of standard TF PROVERB [16]. Like PROVERB, our Millionaire player
IDF term weight schemes [25]. The most promising pas brings together technologies from several core areas of ar
sages are chosen from the documents returned using heuris tificial intelligence (including infonnation retrieval, natural
tics and/or hand-crafted regular expressions. language parsing, ensemble learning, and decision making)
to solve a challenging problem that does not naturally con
Other systems modify queries in order to improve the
fonn to the game-tree method for solving board games.
chance of retrieving answers. Lawrence and Giles [17] in
Other domains under recent and rapid investigation-that
troduced Specific Expressive Forms, where questions are
transfonned into specific phrases that may be contained in are also not easily amenable to tree enumeration-include
answers. For example, the question "what is x" may be video games such as Quake [30] and soccer [29].
transfonned into phrases such as "x is" or "x refers to". The Millionaire game has been explored in some previ
Joho and Sanderson [15] use a set of hand-crafted query ous work. Vankov et al. presented an abstract decision
transfonnations in order to retrieve documents containing theoretic model of Millionaire that yields a strategy for
descriptive phrases of proper nouns. Agichtein et a!. [2] a player to maximize expected utility; however, it is still
describe a method for learning these transfonnations and left up to the player to actually answer questions and as
apply their method to web search engines. sess confidence. 1 Rump [24] uses Millionaire as an educa
tional tool to present problems in decision analysis includ
Clarke, Connack, and Lynam [8] describe a system that
ing probability estimation and calculating expected utility,
exploits the redundancy present in their corpus by using
problems that our system must address. Clarke et al. [8]
the frequency of each candidate answer to "vote" for the
apply their general-purpose question-answering system to
answer most likely to be correct. This approach is similar
a set of questions asked on the Millionaire TV show (natu
to the base approach of our system.
rally composed of more early-round questions), answering
Recent work has shown that the Web can effectively 76 out of 108 questions (70.4%) correctly.
be used as a general knowledge database for question
answering [9, 11, 22] and other related tasks. Fallman [12]
presents a spelling and grammar checking tool that uses the 3 PLAYING MILLIONAIRE
Google search engine as its source of infonnation, allowing
it to handle names as well as infonnal aspects of a language Millionaire, a game show on ABC TV in the United States,
such as idioms and slang expressions. might be characterized as a cultural phenomenon, spawn
ing catch phrases and even fashion trends. The show orig
In contrast to most previous research, where systems are
inated in the United Kingdom and has since been exported
designed to search for an unknown answer, we present a
around the world. Computers are explicitly forbidden as
system that aims to select the correct answer from a number contestants on the actual game show by the official rules,
of possible answers. negating any dreams we had of showcasing our system
alongside Regis on national TV. We wrote our player in
2.2 DECISION MAKING stead based on a home version of the game: the Millionaire
CD-ROM, 3rd edition.
Decision theory fonnalizes optimal strategies for human
decision making [23], justified on compelling axiomatic
grounds [26, 31]. The likelihood of future states is en
3.1 RULES OF THE GAME
coded as a subjective probability distribution and the value

In Millionaire, the player is asked a series of multiple
of future state-action pairs is encoded as a utility func
choice trivia questions. Each correct answer roughly dou
tion; the decision maker optimizes by choosing actions that
bles the current prize. An incorrect answer ends the game
maximize future expected utility. A growing subfield in
and reduces the prize to the amount associated with the
AI employs decision theory as a framework for designing
last correctly-answered "milestone" question, or zero if
autonomous agents. When the agent's state space grows
unmanageably large-as in many real-world settings 1 The paper describing the system, unfortunately, has been re
graphical models such as Bayesian networks [14] or influ- moved from the web.
UAI2003 LAM ETAL. 339
no milestones have been met. Milestones occur at the that contain the answer. We use the World Wide Web as our
$1,000 and $32,000 stages, after questions five and ten, re data source and several search engines (most prominently
spectively. Answering fifteen questions correctly wins the Google) as our conduit to that data.
grand prize of one million dollars. The difficulty of the
We bring together several AI techniques from information
questions (for people) rises along with the dollar value.
retrieval, natural language parsing, and ensemble machine
At any stage, after seeing the next question, the player may learning, as well as some domain-specific heuristics, in or
decline to answer and end the game with the current prize der to select answers and generate confidence measures.
total. Alternatively, the player may opt to use any or all This information is then fed into the decision-making mod
available lifelines to obtain help answering the question. ule, described later, to actually play the game.
Players are allotted three lifelines per game. The three life
lines allow the player to (1) poll the audience, (2) eliminate 4.1 THE NAIVE APPROACH: COUNTING
two incorrect choices, or (3) telephone a friend.
Our system does not address some aspects of Millionaire. Our basic approach was to query Google with the question
In particular, we do not attempt to play the fastest finger along with each of the four answers. Google enforces a 10-
round that determines the next player from a pool of candi term limit on searches, so we performed stopword filtering
date contestants. Winning this round entails heing the first on the questions to shorten our queries. Because answers
one to provide the proper ordering of four things by the cri were entirely comprised of stopwords in some cases, we
teria given in the question (e.g., "Place these states in geo did not filter them. The program generated queries in the
graphic order from East to West: Wyoming, Illinois, Texas, format answer filtered-question to help ensure that the an
Florida."). To be competitive, an answer generally must be swer words fit in under Google's 1 0-term limit.
provided within several seconds. Our question-answering The response to the question was normally the answer that
system is neither designed to answer questions of this na produced the highest number of search results. However,
ture nor is it capable of answering most questions quickly. a number of questions are "inverted" in the sense that the
We also do not address extraneous tasks that people must answer is the one that is unlike the other three. We are able
perform in order to play the game, including speech recog to identify nearly all of these by the presence of the word
nition, speech synthesis, motor skills, etc. "not" in the question. In such cases, we choose the answer
yielding the fewest results. This baseline strategy answers
3.2 CHARACTERISTICS OF QUESTIONS about half of the questions correctly.
The Millionaire CD-ROM game contains 635 questions

that are roughly comparable in nature and difficulty to those
4.1.1 Simple Query Modifications
on the TV show. The game places the questions into seven To improve on this strategy, we empirically found a small
difficulty levels. The lower difficulty levels contain more number of query transformations and modifications that in
common sense and common knowledge questions, while creased the percentage of correct responses to 60%.
the difficult questions tend to be much more obscure. Life
line information is also provided in the game data and is
used in our game model. • Multiple-word answers are enclosed in quotes to re
quire that they appear as a phrase in any search results.
For exploring algorithms and tuning parameters, we used
three random 90-question samples and one random !SO • "Complete a saying" questions, identified by the pres
question sample. Various reports on these training samples ence of one of the strings "According", "said to", or
are reported throughout Section 4. Final test results on all "asked to", were handled by constructing each possi
635 questions are reported in Section 5.1. ble saying from the choices and requiring that it ap
pear in the search results.
3.3 OUR PLAYER
• When a query returns no results for any of the an
Our Millionaire player consists of two main components, swers, we use a series of "fallback" queries that pro
a question-answering (QA) module for multiple-choice gressively relax the query. Quotes and words were re
questions and a decision-making (DM) module. We de moved from each query until at least one answer pro
scribe each component in tum below. duced a non-zero number of search results.
4 THE QA MODULE • Longer web pages tend to contain lists of links, es
says, manifestos, and stories; in general, their content
Our system exploits the redundancy present in text corpora is less useful for answering questions. Since search
to answer questions. More precisely, we use the idea that engines typically do not provide query syntax for re
question words associated with the answer tend to appear strictions on page size, we used a first-order approxi
and are more likely to be repeated in multiple documents mation where we excluded .pdf files from the results.
340 LAM ET AL. UAI2003
Table 1: Pseudocode for DistanceScore, our proximity 70

scoring method for favoring question words that appear 050 .
·
�00 •
near (within rad words of) answer words. I
. ·-
gos
� . . .
L: --:--.
""
-- ,.,.,/'
.
II wordList is the document split at spaces -"'

.
c:ro &
·.
11
/
.
DistanceScore(wordList, qWords, aWards, rad) ·.· .
score, answerWords 0 �45 ./ ... . . . . '• . < .
iii.:o
= . • .
for i= 1 to lwordListl do , . . .· . .
·
0..35
if wordList[i] is in aWards then .
. .·
3J
·
answerWords = answerWords + 1
for j (i-rad) to (i+rad) do
=
p3 p5 p10 p3J p50 p 100 na'i>Je
if wordList[j] is in qWords then strategy
score += (rad - abs(i-j)) I rad
if answerWords == 0 then return 0
else return score I answerWords Figure 1: Question-answering accuracy versus proximity
radius when using DistanceScore, as compared to the naive
method on three 90-question samples. Each line represents
4.2 WORD PROXIMITY MEASURES performance on one sample.
Our heuristics for finding phrases are a specific variation

of the general strategy of using proximity. Our belief-and
4.3 COMBINING STRATEGIES
that of many of the teams working on the TREC question

Among the naive, DistanceScore based on naive, and Dis
answering track [32]-is that not only do answers appear
tanceScore based on noun phrase strategies, at least one
in the same documents as questions, but that they usually
has the correct answer for about 85% of the questions in a
appear near the question words. In order to test proxim
!SO-question sample. To exploit this, we look to answer
ity measures, we downloaded the first 10 (or all, if there
combining ("ensemble") approaches used commonly in
were less than 10) pages Google returned for each query.
machine learning, as summarized in [10].
We score each document based on a heuristic named Dis
tanceScore that gives more credit to question words that Using the following formula, we attempt to combine our
appear closer to answer words in the document. Each such three strategies, or "experts," and produce a single score
question word contributes a score between 0 and 1 to the for each possible answer:
score depending on how close the word is. A radius pa
rameter controls what is considered near and how much a
c; =I. ws * (S;/max{S!..n}) over all strategies
word adds to the score. We use the average score per an
swer word in the document to further penalize documents
where answer words appear frequently but question words where c; is the combined score for answer i, ws is the
do not. Table 1 gives pseudo-code for DistanceScore. weight for strategy S, S; is the score for strategy S for an
swer i, and n is the number of candidate answers.
Figure 1 shows the performance DistanceScore at various
values for the radius on three 90-question samples, along Using the above formula to score candidate answers, we
with the performance of the naive method. Small random were able to reach 70% performance on the question sam
question samples were used to reduce the download and ple. The weights yielding this, found empirically, were
computation time required. DistanceScore performs rea around ±0.05 of wn = 0.40, Wp = 0.15, Wpp = 0.45 for the
sonably well, doing worse than the naive method at low naive, word proximity, and noun phrase proximity strate
radius values but overtaking it at higher ones. gies, respectively.
4.2.1 A Third Expert: Noun-Phrase Proximity

4.3.1 Combining Search Engines
In addition to combining strategies, we investigated using

We developed a third strategy, also based on proximity.
multiple search engines to improve results. We modified
Since requiring multi-word answers to appear as phrases
each of the three strategies to submit queries to AllTheWeb,
in web pages improved the accuracy of the naive method,
MSN Search, and AltaVista, using syntax appropriate for
another plausible strategy is to do the same for each of the
each engine. The scores are combined using the same for
noun phrases contained in the question. Noun phrases were
mula as above. Table 2 shows the results for each strategy
identified using simple heuristics based on Brill's Part-of
using each search engine.
Speech tagger. We submitted each {noun-phrase, answer}
pair to Google and scored the results the same way as be Google performs better than the other engines individu
fore. The result-count method produced poor results; how ally. However, we can combine the results from multiple
ever, downloading the returned documents and using Dis engines, much as we combined the opinions of multiple
tanceScore to score each document worked well and pro strategies. Manually choosing a single set of weights for
duced results comparable to the previous two strategies. each {method, engine} pair showed that combining results
UAI2003 LAM ETAL. 341
obtained from the shareware trivia game "AZ Trivia," the

Table 2: Performance of the three strategies using different
Google and AltaVista-based answerer answered 72% of the
search engines.
questions correctly.
engine naive proxim phr prox combined We consider this to be good performance over the unstruc
Google 55.6% 55.0% 68.9% 70%
tured (and not necessarily correct!) data available from the
AI!TheWeb 56.1% 51.7% 58.3% 66%
47.2% 58%
web, supporting our claim that the web can be an effective
MSN 44.4% 48.9%
AltaVista 46.7% 55.6% 56.1% 68% knowledge base for multiple-choice question-answering.
across engines could result in better performance. For ex 5.2 CHOOSING GOOD WEIGHTS
ample, combining Google with Alta Vista results in 75% of
the 180-question sample being answered correctly. In a few cases, using confidence scores to combine strate
gies caused the system's accuracy to fall below that of the
4.3.2 Confidence-Based Weight Assignments best single strategy. This probably means that the "confi
dence ratio" is not a good heuristic for all scoring meth
However, choosing the weights manually was difficult. The ods. The ratio is also difficult to compare between different
optimal weights are probably sample-dependent and prone engine-metho d combinations. Fer exa..�ple, All The \Vcb's
to overfitting; minor changes often led to 2-4% drops in ratios with the proximity score are consistently low, which
performance. We modified our formula to assign different translates into high confidence for many questions--even
weights to each scoring method on a question-by-question though this strategy only answers about half the questions
basis, using the "confidence" of each scoring strategy S: correctly. Conversely, Google's ratios with the noun-phrase
proximity score (which performs excellently) are consis
• Let xs be the "confidence ratio" defined by tently high, leading to lower confidences. The PROVERB
crossword puzzle solver [16], which utilizes a similar ap
xs-
_ { lowestscorejsecondlowestscore
secondhighestscorefhighestscore
if"not" question;
otherwise.
proach to consider candidate answers from multiple ex
perts, avoids this problem by allowing each expert to sup
ply its own estimated confidence explicitly rather than ap
• Let T = L,( 1 - xJ) over all strategies plying a single function to every expert.
• The weight for strategy S is ws = (1 -xi) / T .

5.3 SAMPLE "PROBLEM" QUESTIONS
This assigns higher weights to more confident scoring

It appears that we have run into another example of the
methods. We chose the ratio between the second-best and
80-20 rule. About one-quarter of the Millionaire questions
best scores because we found a large difference in the ratio
are "hard" for the program. Below are examples of such
when the correct answer has the best score (mean ratio of
questions that suggest areas in which a program trying to
0.34) versus when the incorrect answer has the best score
use the web as a knowledge base would need to improve.
(mean of 0.58). Using these confidence-based weights
generally results in slightly worse performance than hand Common Sense. How many legs does a fish have? 0, 1,
tuned weights, with Google falling to 69%, Alta Vista to 2, or 4? This information may exist on the web, but is
65%, and the combination falling to 74% on the 180- probably not spelled out.
question sample. Nonetheless, we believe that automatic
confidence-based weights are more robust and less prone
Multiple Plausible Answers. What does the letter "]"
to overfitting than hand-tuned weights.
stand for in the computer company name "IBM"? Infor
mation, International, Industrial, or Infrastructure? "In
formation" probably appears just as often as "international"
5 DISCUSSION: QA MODULE in the context of IBM.
Below we discuss several issues that came up in the course

Polysemy. Which of these parts of a house shares its
of building the question answering subsystem, and ways in
name with a viewing area on a computer screen? Wall,
that it could be improved.
Root, Window, or Basement? The words "root" and "com
puter" often co-occur (e.g., the Unix superuser). This ques
tion also suggests that biases in the content of the web
5.1 OVERALL PERFORMANCE originally by and for technical, computer-literate users
may hamper using the web as a general knowledge base in
We used confidence-based weights with the three-strategy
some instances.
method on the entire set of 635 Millionaire questions. The
Google-based question-answerer got 72.3% of the ques Non-Textual Knowledge. Which of these cities is located
tions correct, while one that used Google and Alta Vista in Russia? Kiev, Minsk, Odessa, or Omsk? The program
got 76.4%. On a set of 50 non-Millionaire trivia questions doesn't know how to read maps.
Alternative Representations. Who is Flash Gordon's

archenemy? Doctor Octopus, Sinestro, Ming the Merci Table 3: Results of playing 10,000 games with k = 250,000
less, or Lex Luthor? The word "archenemy" usually ap and a = 4. The columns show the current prize level, num
pears as two words ("arch enemy") on Flash Gordon (and ber of games ending, number of correctly-answered ques
other) pages. tions, number of incorrectly-answered questions, times the
player "walked away", number of lifelines used, number
of lifelines that caused the the player to change its answer
6 THE DM MODULE to the correct one, and number of lifelines that misled the
player.
Answering questions is only half the battle. In order to ac Stage #-win #-wrong #-right #-stop llused llgood llbad
tually play Millionaire, the system must also decide when 0 4676 820 9180 0 1838 614 0
100 781 8398 I 1653 535
to use a lifeline and when to "walk away". In order to com 200 749 7646 1464 504
300 1227 6414 2526 722 76
pute its best next move, the decision-making module con 1099
500 5308 2212 538 67
structs a decision tree [23] that encodes the probabilities 1000 3700 1048 4260 2404 517 51
2000 42 881 3337 42 1597 335 38
and utilities at every possible future state of the game. The 46 710 219
4000 2581 46 1030 19
full tree consists of decision forks for choosing whether 8000 97 610 1874 97 625 62 21
16000 76 451 1347 76 388 48 10
to answer the question, use a lifeline, or walk away, and 32000 815 351 181 30
996 17
chance forks to encode the uncertainty of answering the 64000 37 254 705 37 liB 11 11
125000 99 115 491 99 124 11
questions correctly. The best choice for the program is the 156 56 72
250000 279 156
action that maximizes expected utility. 500000 125 39 115 125 15
1000000 115 115 0
Utility is not necessarily synonymous with winnings in dol Avg. right: 5.29, winnings: $26328.87
lars. For example, suppose a contestant is at the $500,000

level. Even if he or she believes that by answering the fi
nal question his or her chances are fifty-fifty of winning ei whichever is greater. This models the idea that us
ther $1 million or $32,000 (expected value $516,000), the ing a lifeline should raise the estimated probability of
contestant will almost surely walk away with a guaranteed getting the question correct.
$500,000 instead. To model such risk-aversion we give
• When lifelines are used by the player, a new re
the agent an exponential utility function u(x) = 1- e-(x/k).
sponse and confidence level are calculated based on
For any finite k > 0, the agent exhibits risk averse behav
the new information received. For the 50/50 lifeline,
ior, though as k -t oo, the agent becomes risk neutral (i.e.,
the new response is simply the remaining choice with
maximizes expected dollar value). In general, after playing
the higher score. The phone-a-friend and poll-the
many games, more risk averse agents will earn less prize
audience lifelines are taken as an additional "expert"
money on average, though will have a smaller variation
with a weight based on historical data.
(standard deviation) of winnings.
6.2 PLAYING THE GAME: RESULTS

6.1 MODELING THE GAME
Table 3 shows the results of a risk-averse player (k =
We use the following specifications to construct the deci 250,000) playing 10,000 games using the above model us
sion tree and play the game: ing the question-answerer that uses Google and AltaVista.
Questions were selected randomly from all the available
• For all questions beyond the current question, chance questions in the appropriate difficulty level for each stage.
nodes are assigned probabilities based on historical Figure 2 summarizes the relationship between k, average
past performance on a sample of questions from the winnings, and standard deviation. The more risk neutral
associated difficulty level. the program is, the more it wins, and the more its winnings
vary between games. Note that these points lie essentially
• For the current question (i.e., after the question has along an efficient frontier (i.e., any gain in expected value
been asked and analyzed), the current chance node necessitates an increase in risk [20]).
probability is I - xa, where x is the ratio between
the second-highest score and the highest score ob We also explored the effects of changes in a, the expo
tained from the question-answering module (or the nent in the function used to convert confidence ratios into
lowest score and the second-lowest score for "not" probabilities. Figure 3 graphs average winnings versus a
questions), and a is a tunable parameter that will be and Figure 4 graphs the average number of correctly an
examined later. This lets us estimate confidence in our swered questions versus a. Using higher a raises the pro
answer to the specific question being asked. gram's estimated probability of answering a question cor
rectly. Choosing a too low or too high hinders game per
• The estimated future effect of lifelines on probabil formance since the program chooses to stop too soon or in
ity p is given by the function f(p) = - p2 + 2 p, or correctly answers questions that it is overconfident about.
the lifeline's performance based on historical data, While high a values can produce high average winnings, it
UAI2003 LAM ET AL. 343
Table 4: Human performance on the television show as re- 40000

ported on ABC's website in July 2001, compared to the
computer's performance when given no handicap, and a
� 30000 ,--,l""'��ll.iil�lll:--1
"
six-question handicap. � 20000 t---.iii"-z;¥''-L-----'-"-'--'-:---1

r 10000 +---,t.�----'---1
c
Stage human (pet) computer (pet) 6-handi (pet)
0 14 (2. 01) 4676 (46.81) 0 (0.01)
100 0 (0. Oil (0.0%) 0 (0. 01) 0 2 4 6 8 10
(0.Ol) (0.Oil
Alpha
200 (0.01) 0
300 (0. 01) 5 (0.11) (0.01)
500 0 (0. 01) 7 (0.11) (0. 01)
1000 195 (28.61) 3700 (37 .Oil 5447 (54. 51)
2000 0 (0. 01) 42 (0.41) 0 (0. Oil
4000 4 (0. 61) 46 (0.51) 61 (0. 61) Figure 3: Average winnings versus a. Black points are for
8000 9 (1. 31) 97 (1. Oil 249 (2.51) a risk-averse player (k = 250, 000); gray points are for a
16000 40 (5. 91) 76 (0. 81) 231 (2.31)
32000 166 (24.31) 815 (8.211 2337 (23 .41) risk-neutral player.
64000 92 (13 .51) 37 (0.41) 139 (1.41)
125000 89 (13.Oil 99 (1. 01) 370 (J. 71)
Questions Correc:t vs Alpha
250000 48 (7. 01) 156 (1. 61) 504 (5.01)
500000 18 (2. 61) 125 (1.31) 311 (J .11)
(1. 2%) (J. 51)
�1+-==(?
.7===·�
�:=:_--'---- _.:_. - ----·.��
1000000 8 115 (1.21) 351
Avg. winnings: $764 97 vs. $26328.87 vs. $77380.90
200000
Effect of k on WI nnlngs
. i .
- -· -· -
c
� 150000 0 2 4 6 8 10
i
� 100000 · . _/.· . Alpha
! 50000
..,:·� .
� •
......
0 -- - :- Figure 4: Average number of questions correct versus a.
0 10000 20000 30000 40000 Black points are for a risk-averse player (k = 250,000);
gray points are for a risk -neutral player.
Figure 2: Standard deviation versus average winnings as Observe that even if the question answerer could achieve
k ranges from 5,000 to 400,000 and a is fixed at 4. The a 95% success rate on early questions, it would still only
gray point is a risk-neutral player. As k increases, average have a 77% chance of achieving tbe $1,000 milestone. Its
winnings and standard deviation both increase. actual performance is worse, correctly answering 86% at
level I ($100, $200, and $300) and 75% at level 2 ($500
and $1,000). Table 3 shows that as a result the program
comes at the cost of many more games ( 65%) resulting in often exhausts its lifelines early in tbe game. On tbe other
a $0 prize as the player is too confident during early ques hand, we believe our program would have tbe upper hand
tions and saves its lifelines for later use. An a of 4 seems against most people in a one-question, level 7, winner
reasonable; about 47% of games result in $0 in that case, takes-all match.
and the average winnings are relatively high.
7.2 SIX QUESTIONS TO HUMAN

7 DISCUSSION: DM MODULE
We might ask how well the program fares when given a
handicap--that is, assuming that tbe program is able to an
7.1 HARD QUESTIONS EASY, EASY ONES HARD
swer the first N questions correctly without using any life
lines. Figure 5 graphs the program's winnings versus its
Table 4 compares the program's winnings to humans' win
handicap. With a six question head start (going for $4,000)
nings based on data from the ABC website as of mid-July,
and all lifelines remaining, a risk-averse computer player
2001. A striking feature of the program's performance is
(k = 250000) averages $77,381 with a standard deviation
how often it wins nothing compared to people. Humans
of $202,296.
almost always answer the first several questions correctly;
however, some are so obvious that the question-answerer Data from ABC's website as of mid-July, 2001 indicates
cannot find tbe correct answer on tbe web. People gen that people on the show won about $76,497 on average
erally do not encode common knowledge into their web with a standard deviation of $140,441. This suggests that,
documents. As a result, while the web seems to be a good given a six-question handicap, the program performs about
knowledge repository for general knowledge, it is more dif as well as qualified human players (i.e., those who self
ficult to use it as a common-sense database. selected to play the game, passed stringent entrance tests,
Avg Winnings vs Handicap decision-making module and given a six question handi
200000 cap, our system plays the game about as well as people.
a
£
c
150000
c • We believe that our system can be marginally improved in a
•
� 100000 variety of ways: for example, by employing better schemes
a • •
• •
i 50000 for weighting multiple scoring methods, or by narrowing
•
� I I
0
• down the domain of a question and using domain-specific
0 2 4 6 8 10 search strategies. We are also excited about the potential
II of Given Questions promised by approaches for structuring web data [4], al
though we believe that advances in automatic techniques
for applying such structure (e.g., better natural language
Figure 5: Average winnings versus handicap. Black points processing and common sense reasoning [19]) will be re
are for a risk-averse player (k = 250, 000); gray points are quired for these approaches to succeed.
for a risk-neutral player.
The call for such advances is a familiar one. From natural
language processing to computer vision, a similar barrier
exists across many subfields of AI: easy tasks (for people)
and likely practiced for the game). Table 4 shows that even
are hard and hard tasks easy. W hile statistical and brute
with the handicap, the program's performance is more vari
force methods can go a long way toward matching human
able than a human's, both winning big and losing early
performance, an often difficult-to-bridge gap remains.
more often than people. However, its performance is still
comparable, with "only" six "easy" questions separating
the program from human-level performance. References
[!] S. Abney, M. Collins, and A. Singhal. Answer extrac

8 OTHER APPLICATIONS tion. In ANLP, 2000.
W hile designed to play Millionaire, our system has other,

[2] E. Agichtein, S. Lawrence, and L. Gravano. Learn
ing search engine specific query transformations for
more practical applications. The most straightforward is
simply as a general-purpose question-answering system
question answering. In Proceedings of WWW, pages
that can answer questions, provided a small pool of can
169-178, 2001.
didate answers can be provided or generated by some other
[3] D. Aliod, J. Berri, and M. Hess. A real world imple
means.
mentation of answer extraction. In Proceedings of
the
Combining the question-answerer with the decision maker 9th International Workshop on Database and Expert
can be useful in domains where a non-trivial penalty exists Systems, Workshop: Natural Language and Informa
for answering a question incorrectly. For example, our sys tion Systems (NLIS-98), 1998.
tem could be adapted to take the Scholastic Aptitude Test
[4] T. Berners-Lee, J. Hendler, and 0. Lassila. The se
(SAT), an exam where answering a question incorrectly re
mantic web. Scientific American, 284(5):34-43, May
sults in a lower score than not answering.
2001.
The general strategy of using search engines to mine the
[5] B. Bouzy and T. Cazenave. Computer Go: An AI
web as a giant text corpus shows promise in a number of
oriented survey. AI, 132(1):39-103, 2001.
areas. For example, web sites which provide content in
multiple languages could become a knowledge base for au [6] M. Buro. Methods for the evaluation of game po
tomatic translation. Natural language processing programs sitions using examples. PhD thesis, University of
could use the web as a corpus to help disambiguate parsing, Paderborn, Germany, 1994.
or to find commonly occurring close matches to ungram
matical sentences. [7] M. Buro. The Othello match of the year: Takeshi
Murakami vs. Logistello. ICCA Journal, 20(3):189-
193, 1997.
9 CONCLUSIONS AND FUTURE WORK
[8] C. L. A. Clarke, G. V. Cormack, and T. R. Lynam.
Exploiting redundancy in question answering. In Pro
We find that the web is effective as a knowledge base for
answering generic multiple-choice questions. Naive meth ceedings of SIGIR, New Orleans, September 2001.
ods that simply count search engine results do surprisingly [9] C. L. A. Clarke, G. V. Cormack, T. R. Lynam, C. M.
well; more sophisticated methods that employ simple query
Li, and G. L. McLearn. Web reinforced question an
modifications, identify noun phrases, measure proximity swering. In Proceedings of TREC, 200 I.
between question and answer words, and combine results
from multiple engines do even better, attaining about 75% [1OJ T. G. Dietterich. Machine learning research: Four cur
accuracy on Millionaire questions. When coupled with a rent directions. AI Magazine, 18(4):97-136, 1997.
UAJ2003 LAM ET AL. 345
[11] S. Dumais, M. Banko, E. Brill, J. Lin, and A. Ng. [27] J. Schaeffer, J. C. Culberson, N. Treloar, B. Knight,
Web question answering: Is more always better? In P. Lu, and D. Szafran. A world championship cal
Proceedings of SIGIR, 2002. iber checkers program. Artificial Intelligence, 53(2-
3):273-289, 1992.
[12] D. Fallman. The penguin: Using the web as a
database for descriptive and dynamic grammar and [28] R. D. Shachter. Probabilistic inference and influ
spell checking. In Proceedings of CHI, 2002. ence diagrams. Operations Research, 36(4):589-604,
1988.
[13] F. Hsu, M. S. Campbell, and A. J. Hoane. Deep Blue
system overview. In Proceedings of the 9th ACM [29] P. Stone and R. Sutton. Scaling reinforcement learn
International Conference on Supercomputing, pages ing toward RoboCup soccer. In Proceedings of the
240-244, July 1995. 18th ICML, 2001.
[14] F. V. Jensen. An introduction to Bayesian Networks. [30] M. van Lent, J. Laird, J. Buckman, J. Hartford,
Springer, New York, 1996. S. Houchard, K. Steinkraus, and R. Tedrake. In
telligent agents in computer games. Proc. 16th
In
[15] H. Joho and M. Sanderson. Retrieving descriptive
National Conference on Artificial Intelligence, pages
phrases from large amounts of free text. In Proceed
929-930, 1999.
ings of CIKM, 2000.
[31] J. von Neumann and 0. Morgenstern. Theory of
[16] G. A. Keirn, N. N. Shazeer, M. L. Littman, S. A. C. M.
Games and Economic Behavior. Princeton University
Cheves, J. Fitzgerald, J. Grosland, F. Jiang, S. Pol
Press, Princeton, NJ, 1953.
lard, and K. Weinmeister. Proverb: The probabilistic
cruciverbalist. In Proc. 16th National Conference on [32] E. Voorhees. The TREC-8 question answering track
Artificial Intelligence, pages 710-717, 1999. report. In Proceedings of TREC, 1999.
[17] S. Lawrence and C. L. Giles. Context and page analy
sis for improved web search. IEEE Internet Comput
ing, 2(4):38-46, 1998.
[18] K. F. Lee and S. Mahajan. The development of a
world class Othello program. Artificial Intelligence,
43:21-36, 1990.
[19] D. B. Lenat. Cyc: A large-scale investment in knowl

edge infrastructure. Communications of the ACM,
38(11):33-38, November 1995.
[20] H. M. Markowitz. Portfolio selection. Journal of Fi

nance, 7(1):77-91, 1952.
[21] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea,
R. Goodrum, R. Girju, and V. Rus. Lasso: A tool
for surfing the Answer Net. In Proceedings of TREC,
1999.
[22] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal.

Probabilistic question answering on the web. In Pro
ceedings of WWW, 2002.
Decision Analysis: Introductory Lectures
[23] H. Raiffa.
on Choices under Uncertainty. Addison-Wesley,
Reading, MA, 1968.
[24] C. M. Rump. Who wants to see a $million error?

Transactions on Education, 1(3), 2001.
[25] G. Salton. Automatic Text Processing: The transfor
mation, analysis, and retrieval of information by com
puter. Addison-Wesley, 1989.
[26] L. J. Savage. The Foundations of Statistics. Wiley,
New York, 1954.

Who Wants To Be A Millionaire Questn Difficulty Level

Uploaded by

Copyright:

Available Formats

Who Wants To Be A Millionaire Questn Difficulty Level

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Who Wants To Be A Millionaire Questn Difficulty Level

Uploaded by

Copyright:

Available Formats

UAI2003 LAM ETAL.

1 Billion Pages = 1 Million Dollars?

Shyong (Tony) K. Lam David M. Pennock* Dan Cosley Steve Lawrence

Abstract (5] are formal enough to be solvable in principle, though

coded as a subjective probability distribution and the value

The Millionaire CD-ROM game contains 635 questions

Table 1: Pseudocode for DistanceScore, our proximity 70

II wordList is the document split at spaces -"'

score, answerWords 0 �45 ./ ... . . . . '• . < .

Our heuristics for finding phrases are a specific variation

that of many of the teams working on the TREC question­

4.2.1 A Third Expert: Noun-Phrase Proximity

In addition to combining strategies, we investigated using

obtained from the shareware trivia game "AZ Trivia," the

• The weight for strategy S is ws = (1 -xi) / T .

This assigns higher weights to more confident scoring

Below we discuss several issues that came up in the course

Alternative Representations. Who is Flash Gordon's

lars. For example, suppose a contestant is at the $500,000

6.2 PLAYING THE GAME: RESULTS

Table 4: Human performance on the television show as re- 40000

six-question handicap. � 20000 t---.iii"-z;¥''-L-----'-"-'--'-:---1

7.2 SIX QUESTIONS TO HUMAN

[!] S. Abney, M. Collins, and A. Singhal. Answer extrac­

W hile designed to play Millionaire, our system has other,

[19] D. B. Lenat. Cyc: A large-scale investment in knowl­

[20] H. M. Markowitz. Portfolio selection. Journal of Fi­

[22] D. Radev, W. Fan, H. Qi, H. Wu, and A. Grewal.

[24] C. M. Rump. Who wants to see a $million error?

You might also like

that of many of the teams working on the TREC question

[!] S. Abney, M. Collins, and A. Singhal. Answer extrac

[19] D. B. Lenat. Cyc: A large-scale investment in knowl

[20] H. M. Markowitz. Portfolio selection. Journal of Fi