AIMag25 04 003

Articles
Project Halo
Towards a Digital Aristotle
Noah S. Friedland, Paul G. Allen, Gavin Matthews, Michael Witbrock,
David Baxter, Jon Curtis, Blake Shepard, Pierluigi Miraglia, Jürgen Angele,
Steffen Staab, Eddie Moench, Henrik Oppermann, Dirk Wenke, David Israel,
Vinay Chaudhri, Bruce Porter, Ken Barker, James Fan, Shaw Yi Chaw,
Peter Yeh, Dan Tecuci, and Peter Clark
■ Project Halo is a multistaged effort, sponsored us highly optimistic that the technical chal-
by Vulcan Inc, aimed at creating Digital Aristo- lenges facing this effort in the years to come can
tle, an application that will encompass much of be identified and overcome.
the world’s scientific knowledge and be capable This article presents the motivation and long-
of applying sophisticated problem solving to term goals of Project Halo, describes in detail
answer novel questions. Vulcan envisions two the six-month first phase of the project—the
primary roles for Digital Aristotle: as a tutor to Halo Pilot—its KR&R challenge, empirical eval-
instruct students in the sciences and as an inter- uation, results, and failure analysis. The pilot’s
disciplinary research assistant to help scientists outcome is used to define challenges for the
in their work. next phase of the project and beyond.
As a first step towards this goal, we have just
completed a six-month pilot phase designed to
A
ristotle (384–322 BCE) was remarkable
assess the state of the art in applied knowledge for the depth and scope of his knowl-
representation and reasoning (KR&R). Vulcan edge, which included mastery of a wide
selected three teams, each of which was to for-
range of topics from medicine and philosophy
mally represent 70 pages from the advanced
to physics and biology. Aristotle not only had
placement (AP) chemistry syllabus and deliver
knowledge-based systems capable of answering command over a significant portion of the
questions on that syllabus. The evaluation world’s knowledge, but he was also able to ex-
quantified each system’s coverage of the syl- plain this knowledge to others, most famously,
labus in terms of its ability to answer novel, pre- though briefly, to Alexander the Great.
viously unseen questions and to provide hu- Today, the knowledge available to hu-
man-readable answer justifications. These mankind is so extensive that it is not possible
justifications will play a critical role in building for a single person to assimilate it all. This is
user trust in the question-answering capabilities forcing us to become much more specialized,
of Digital Aristotle. further narrowing our worldview and making
Prior to the final evaluation, a “failure taxono- interdisciplinary collaboration increasingly
my” was collaboratively developed in an at- difficult. Thus, researchers in one narrow field
tempt to standardize failure analysis and to fa- may be completely unaware of relevant
cilitate cross-platform comparisons. Despite
progress being made in other neighboring dis-
differences in approach, all three systems did
ciplines. Even within a single discipline, re-
very well on the challenge, achieving perfor-
mance comparable to the human median. The searchers often find themselves drowning in
analysis also provided key insights into how the new results. MEDLINE,1 for example, is an
approaches might be scaled, while at the same archive of 4,600 medical publications in 30
time suggesting how the cost of producing such languages, containing over 12 million publica-
systems might be reduced. This outcome leaves tions, with 2,000 added daily.
Copyright © 2004, American Association for Artificial Intelligence. All rights reserved. 0738-4602-2002 / $2.00 WINTER 2004 29
Articles
Making the full range of scientific knowl- knowledge that will form the foundation for
edge accessible and intelligible might involve Digital Aristotle.
anything from simply retrieving facts to an-
swering a complex set of interdependent ques- The Halo Pilot
tions and providing appropriate justifications The pilot phase of Project Halo was a six-
for those answers. Retrieval of simple facts month effort to set the stage for a long-term re-
might be achieved by information-extraction search and development effort aimed at creat-
systems searching and extracting information ing Digital Aristotle. The primary objective was
from a large corpus of text, such as Voorheese to evaluate the state of the art in applied KR&R
(2003). But aside from the simplicity of the systems. Understanding the performance char-
types of questions such advanced retrieval sys- acteristics of these technologies was considered
tems are designed to answer, they are only ca- to be especially critical to DA, as they are ex-
pable of retrieving “answers”—and justifica- pected to form the basis of its reasoning capa-
tions for those answers—that already exist in bilities. The first objectives were to identify and
the corpus. Knowledge-based question–an- engage leaders in the field and to develop suit-
swering systems, by contrast, though generally able evaluation methodologies; the project was
more computationally intense, are capable of also designed to help in the determination of a
generating answers and appropriate justifica- research and development roadmap for KR&R
tions and explanations that are not found in systems. Finally, the project adopted principles
texts. This capability may be the only way to of scientific transparency aimed at producing
bridge some interdisciplinary gaps where little understandable, reproducible results.
or no documentation currently exists. Vulcan undertook a formal bidding process
Project Halo is a multistaged effort aimed at to identify teams to participate in the pilot.
creating Digital Aristotle (DA), an application Criteria for selection included a well-estab-
encompassing much of the world’s scientific lished and mature technology and a world-
knowledge and capable of answering novel class team with a track record of government
questions through advanced problem solving. and private funding. Three teams were con-
DA will act both as a tutor capable of instruct- tracted to participate in the evaluation: a team
ing students in the sciences and as a research led by SRI International with substantial con-
assistant with broad interdisciplinary skills, tributions from Boeing Phantom Works and
able to help scientists in their work. The final the University of Texas at Austin; a team from
DA will differ from classical expert systems in Cycorp; and a team from Ontoprise.
four important ways. Significant attention was given to selecting a
First, in speed and ease of knowledge formu- proper domain for the evaluation. It was im-
lation. Classical expert systems required years portant, given the limited scope of this phase
to perfect and highly skilled knowledge engi- of the project, to adapt an existing, well known
neers to craft them; Digital Aristotle will pro- evaluation methodology with easily under-
vide tools to facilitate rapid knowledge formu- stood and objective standards. First a decision
lation by domain experts with little or no help was made to focus on a “hard” science and,
from knowledge engineers. more specifically, on a textbook presentation of
Second, in coverage. Classical expert systems some part of that science. Several standardized
were narrowly focused on the single topic for test formats were also examined. In the end, a
which they were specifically designed; DA will 70-page subset of introductory college-level ad-
over time encompass much of the world’s sci- vanced placement (AP) chemistry was selected
entific knowledge. because it was reasonably self-contained and
Third, in reasoning techniques. Classical ex- did not require solutions to other hard AI prob-
pert systems mostly employed a single infer- lems, such as representing and reasoning with
ence technology; DA will employ multiple uncertainty, or understanding diagrams
technologies and problem solving methods. (Brown, LeMay, and Bursten 2003). This latter
Fourth, in explanations. Classical expert sys- consideration, for example, argued against se-
tems produced explanations derived directly lecting physics as a domain.
from inference proof trees; DA will produce Table 1 lists the topics in the chemistry syl-
concise explanations, appropriate to the do- labus. Topics included stoichiometry calcula-
main and the user’s level of expertise. tions with chemical formulas; aqueous reac-
Adoption by communities of subject matter tions and solution stoichiometry; and
experts of the Project Halo tools and method- chemical equilibrium. Background material
ologies is critical to the success of DA. These was also identified to make the selected chap-
tools will empower scientists and educators to ters more fully self-contained.2
build the peer-reviewed, machine-processable This scope was large enough to support a
30 AI MAGAZINE
Articles
large variety of novel, and hence unanticipat- Subject Chapters Sections Pages
ed, question types. One analysis of the syllabus Stoichiometry: Calculations 3 3.1 – 3.2 75 - 83
identified nearly 100 distinct chemistry laws, with Chemical Formulas
suggesting that it was rich enough to require Aqueous Reactions and Solution 4 4.1 – 4.4 113 - 133
complex inference. It was also small enough to Stoichiometry
be represented relatively quickly—which was Chemical Equilibrium 16 16.1 – 16.11 613 - 653
essential because the three Halo teams were al-
Table 1. Course Outline for the Halo Challenge.
located only four months to create formal en-
codings of the chemistry syllabus. This amount
of time was deemed sufficient to construct de- Domain-Driven Versus Question-
tailed solutions that leveraged the existing Driven Knowledge Formation
technologies, yet was too brief to allow signifi-
Recall that Vulcan released a course description
cant revisions to the teams’ platforms. Hence,
consisting of 70 pages of a chemistry textbook
by design, we were able to avoid undue cus-
and 50 sample questions. The teams had the
tomization to the task domain and thus to cre-
choice of building knowledge bases either start-
ate a true evaluation of the state of the art of
ing from the syllabus text or from the sample
KR&R technologies.
questions or working from both in parallel.
Nevertheless, at the outset of the project it
Ontoprise and Cycorp approached knowledge
was completely unclear whether competent formation in a target-text-driven approach,
systems could be built. In fact, Vulcan’s secret and SRI approached knowledge formation in a
intent was to set such a high bar for success that question-driven approach.
the experiment would expose the weaknesses The Ontoprise team encoded knowledge in
in KR&R technologies and determine whether three phases. During the first phase team mem-
these technologies could form the foundation bers encoded the knowledge within the corpus
of DA. The teams accepted the challenge with into the ontology and rules without consider-
trepidation caused by several factors, including ing any sample test questions. They then tested
the mystery of working in a new domain with this knowledge on test questions that appeared
the novel performance task of answering hard, in the textbook—which were different from
and highly varied, advanced placement ques- the sample set released by Vulcan. In the sec-
tions; and generating coherent explanations in ond phase, they tested the sample questions re-
English—all within four months. leased by Vulcan. The initial coverage they ob-
served was around 30 percent. During this
The Technology phase, they refined the knowledge base until
coverage of around 70 percent was reached. In
The three teams had to address the same set of the second phase, they also coded the explana-
issues: knowledge formation, question answer- tion rules. In the third phase, they refined the
ing, and explanation generation (Barker et al. encoding of the knowledge base and the expla-
2004; Angele et al. 2003; and Witbrock and nation rules.
Matthews 2003.) They all built knowledge Cycorp used a hybrid approach by first con-
bases in a formal language and relied on centrating on representing the basic concepts
knowledge engineers to encode the requisite and principles of the corpus and gradually
knowledge. Furthermore, all the teams used au- shifting over to a question-driven approach.
tomated deductive inference to answer ques- The intent behind this approach was to avoid
tions. Despite these high-level similarities, the overfitting the knowledge to the specifics of
teams’ approaches differed in some interesting the sample questions available. This strategy
ways, especially with respect to explanation met with mixed success: in the second phase,
generation. considerable reengineering of the knowledge
was required to meet the requirements of the
Knowledge Formation questions without compromising the strategy’s
Each system achieved significant coverage of generality. This was partly because the text-
the parts of the domain represented by the syl- book adopted an example-based approach with
labus and was able to use that coverage to an- somewhat varied depth, whereas the process of
swer substantial numbers of novel questions. knowledge formation would have benefited
All three systems used class taxonomies, such from a more systematic and uniform coverage.
as the one illustrated in figure 1, to organize The SRI team’s approach for knowledge for-
concepts such as acids, physical constants, and mation was highly question-driven. Starting
reactions; represented properties of classes us- from the 50 sample questions, team members
ing relations; and used rules to represent com- worked backwards to identify what pieces of
plex relationships. knowledge would be needed to solve them. In-
WINTER 2004 31
Articles
#$ChemicalSubstanceType
#$genis
#$AcidType-Negligible
#$BaseType-Negligible
#$AcidType-Lewis #$BaseType-Lewis
#$AcidType-Weak #$BaseType-Weak
#$AmphotericSubstanceType
#$AcidType-Bronsted-Lowry #$BaseType-Bronsted-Lowry
#$AcidType-Monoprotic
#$AcidType-Arhenius #$AcidType-Polyprotic #$BaseType-Arrhenius
#$AcidType-Strong #$AcidType-Diprotic $#BaseType-Strong
Figure 1. Extract from Cyc’s Ontology of Acids and Bases.

These nodes represent second-order collections for organizing specific substance types; edges represent subsuption relationships.
terestingly, the initial set of questions was building a library of representations of generic
found to require coverage of a substantial por- entities, events, and roles (Barker, Porter, and
tion of the syllabus. Once the coverage for the Clark 2001) and they were able to reuse parts of
sample set of questions was achieved, they this for the Project Halo pilot. In addition to
looked for additional sample questions from providing the types of information commonly
the available AP tests. Working with this addi- found in ontologies (class-subclass relations
tional set of sample questions, they ensured and instance-level predicates), their representa-
the robustness of their initial coverage. tions include sets of axioms for reasoning
about instances of these classes. The portion of
Reliance on Domain- the ontology dealing with properties and val-
Independent Ontologies ues was especially useful for the Halo pilot. It
Both Cycorp and SRI relied on their preexisting includes representations for numerous dimen-
knowledge base content. Ontoprise started sions (for example, capacity, density, duration,
from scratch. Not surprisingly, the top-level frequency, quantity) and values of three types:
classes in the Ontoprise knowledge base are scalars, cardinals, and categoricals. This ontol-
chemistry concepts such as elements, mixtures, ogy also includes methods for converting
and reactions. Interestingly, the Ontoprise among units of measurement (Novak 1995),
knowledge base did not draw on well-known which the SRI team’s system used to align the
ontological distinctions such as object type representation of questions with representa-
versus stuff type. We describe here in more de- tions of terms and laws, even if they are ex-
tail how SRI and Cycorp leveraged their prior pressed with different units of measurement.
knowledge base and the issues that arose in do- Cycorp publishes an open-source version of
ing so. Cyc3 that was used as a platform for the Open-
For several years the SRI team has been Halo system. Cyc’s knowledge consists of
32 AI MAGAZINE
Articles
terms, relations, and assertions. The assertions term goals, and their instincts of what might
are organized into a hierarchy of microtheories work.
that permit the isolation of specific assump- The Ontoprise System. OntoNova, the Onto-
tions into a specific context. OpenHalo utilized prise system, was based on the representation
OpenCyc’s 6,000 concepts but was augmented language F(rame)-Logic (Kifer, Lausen, and Wu
for Project Halo with 1,000 new concepts and 1995) and the logic programming-based infer-
8,000 existing concepts selected from the full encing system OntoBroker (Angele et al. 2003).
Cyc knowledge base. A significant fraction of For answer justification, OntoNova used
the latter formed part of the compositional metainferencing, as follows. While processing
explanation-generation system. a query OntoBroker produced a log file of the
proof tree for any given answer. This proof tree,
Reliance on Domain Experts which was represented in F-Logic and con-
Cycorp and Ontoprise relied on their knowl- tained the instantiated rules that were success-
edge engineers to do all the knowledge forma- fully applied to derive an answer, acted as in-
tion, while SRI relied on a combined team of put for a second inference run to produce
knowledge engineers and chemistry domain English answer justifications.
experts. We illustrate this approach with a sample
Team SRI used four chemists to help with the question. The question asks for the Ka value of
knowledge formation process, which was done a substance, given its quantity in moles and its
in the following steps. First, ontological engi- pH. The following is an extract from the log file
neers designed representations for chemistry of the proof tree:
content, including the basic structure for terms a15106:Instantiation[ofRule->>kavalueMPhKa;
and laws, chemical equations, reactions, and instantiatedVars->>{i(M,0.2),i(PH,3.0),…].
solutions. Second, chemists consolidated the This log file extract states that the rule kaval-
domain knowledge into a 35-page compendi- ueMPhKa was applied at the point in time
um of terms and laws summarizing the rele- logged here. Then, the variables M and PH were
vant material from 70 pages of a textbook. instantiated by 0.2 and 3.0 respectively. Rules
While doing this, the chemists were asked to important for justifying results, for example
start from the premise to be proven and trace “kavalueMPhKa,” were applied in the second,
the reasoning in a backward-chaining manner meta-inference run. Explanation rules were
to make it easy for knowledge engineers to en- specified by their reference to an inference rule
code this in the knowledge base. Third, knowl- used to derive the answer, the instantiations of
edge engineers implemented that knowledge the variables of that rule, and a human-au-
in the KM frame language, creating representa- thored explanation template referring to those
tions of about 150 laws and 65 terms. While variables. These explanation rules resembled
doing so, they compiled a large suite of test cas- the explanation templates of the SRI system.
es for individual terms and laws as well as com- The corresponding explanation rule for kaval-
binations of them. This test suite was run daily. ueMPhKa was:
Fourth, the “explanation engineer” augmented FORALL I,M1,PH1 explain(EX1,I) <-
the representation of terms and laws to gener- I:Instantiation[ofRule->>kavalueMPhKa;
ate English explanations. Finally, the domain instantiatedVars->>{i(M,M1),i(PH,PH1)}] and
experts reviewed the output of the system for EX1 is (“The equation for calculating the acid-
correctness and understandability. dissociation.”).
Ontoprise knowledge engineers learned the These explanation rules were applied to the
domain and built the knowledge base—mostly proof tree for the example to produce the fol-
starting with understanding and modeling the lowing justification output:
examples given in the textbook. They com- The equation for calculating the acid-dissocia-
piled a set of 41 domain concepts, 582 domain tion constant Ka for monoprotic acids is
instances, 47 domain relations, and 345 ax- Ka=[H+][A-]/[HA]. For monoprotic acids the
ioms used for answering the questions. In addi- concentrations for hydrogen [H+] and for the
tion, they added 138 rules in order to provide anion [A-] are the same: [H+]=[A-]. Thus, we get
explanations for the answers produced. Ka= 0.0010 * 0.0010 / 0.2 = 5.0E-6 for a solu-
tion concentration [HA] = 0.2 M.
Explanation Generation The equation for calculating the pH-value
The three teams took quite different approach- is ph=-log[H+]. Thus we get the pH-value,
es to explanation generation. These differences ph = 3, H+ concentration [H+] = 0.0010.
were based on the teams’ available technolo- This two-step process for creating explanations
gies (recall that the project allowed little time allowed the application of OntoBroker to gen-
to develop new technologies), their longer- erate explanations. For OntoNova, the Onto-
WINTER 2004 33
Articles
and displayed, followed by the nested explana-

ComputeConcentrationOfIons(C) tion of dependent facts, followed by the exit
if C is a strong electrolyte text. The explanation generated for the compu-
return(Max) [expl-tag-1] tation of concentration of ions in NaOH is as
else
… [expl-tag-2]
follows:
If a solute is a strong electrolyte, the concen-
[expl-tag-1] tration of ions is maximal
entry: "If a solute is a strong electrolyte, the Checking the electrolyte status of NaOH.
concentration of ions is maximal" Strong acids and bases are strong elec-
exit: "The concentration of ions in" C "is" Max trolytes.
dependencies: electrolyte-status(C)
NaOH is a strong base and is therefore a
strong electrolyte.
NaOH is thus a strong electrolyte.
Figure 2. The Law for Computing the Concentration of Ions in a Chemical. The concentration of ions in NaOH is 1.00
molar.
A more complete description of team SRI’s sys-
tem can be found in Barker et al. (2004).
prise team developed an entire knowledge base The Cycorp System. All the systems generated
for this purpose. Running short of time, they their explanations by appropriate filtering and
could not fully exploit the flexibility of this ap- transformation of the inference proof tree. The
proach and thus mostly restricted it to an ap- primary difference in Cycorp’s approach was
proach similar to template matching. In the fu- that Cyc was already capable of providing nat-
ture, however, they plan to apply additional ural language explanations for any detail, how-
inference rules to (1) integrate additional ever minor, whereas the other systems required
knowledge, (2) reduce redundancies of expla- the addition of template responses for each
nations, (3) abstract from fine-grained explana- rule and fact deemed important. The result of
tions, and (4) provide personalized explana- this was that much of the effort expended on
tions. explanations by Cycorp concerned judicious
Team SRI’s System. Team SRI’s system was strengthening of the filters, and Cyc’s output
based on KM, a frame language with some sim- consequently erred on the side of verbosity.
ilarities to KRL and KL-ONE systems.4 During Moreover, Cyc’s English, being built up compo-
reasoning, KM records which rules are used in sitionally by automatic techniques rather than
the derivation of ground facts. These proof tree being handcrafted for a specific project, ex-
fragments could be presented as an “explana- hibits a certain clumsiness of expression.
tion” of the derivation of a fact. Experience A specific example of where Cyc had a lot of
with expert systems, however, has taught us trouble generating readable explanations was
that proof trees and inference traces are not when the use of a mathematical equation re-
comprehensible explanations for most users. quired a lot of arithmetic. While a step-by-step
To provide better explanations, KM allows exposition of every mathematical step in-
the knowledge engineer to supply explanation volved is technically correct, it makes most
templates for each knowledge base rule. These readers recoil in horror. Cyc’s lower scores for
explanation templates provide control over explanations may therefore be ascribed not to
which proof tree fragments are presented in any errors contained within, nor to the com-
the explanation and what English text is used plete absence of an explanation, but to the fact
to describe them. In particular, the knowledge that the key chemistry principles involved
engineer specifies what text to display when a tended to be buried amid more trivial reason-
rule is invoked (“entry text”), what text is dis- ing. In the Halo pilot challenge, the graders ap-
played when the rule has been successfully appeared to value conciseness of expression over
plied (“exit text”), and a list of any other facts either correctness or completeness.
that should be explained in support of the cur- In the Halo pilot challenge, Cyc produced
rent rule (dependent facts). The three parts of explanations for every question it answered
the explanation template can contain arbitrary correctly, and it was rare for any of the graders
KM expressions, allowing the knowledge engi- to find any fault with the explanations’ correct-
neer considerable control over explanation ness. Comments like “calculations are correct
generation. but once again buried” and “not well focused”
Consider the law for computing the concen- were common. It was clear at the end of the pi-
tration of ions in a chemical solution, shown lot phase that Cyc required significant work in
in figure 2. When the explanation of a rule ap- explanation filtering; substantial progress has
plication is requested, the entry text is formed been made during subsequent projects.
34 AI MAGAZINE
Articles
Evaluation
At the end of four months, knowledge formu- Sodium azide is used in air bags to rapidly produce
gas to inflate the bag. The products of the decomposition
lation was stopped, even though the teams had reaction are:
not completed the task. All three systems were
sequestered on identical servers at Vulcan. (a) Na and water;
Then the challenge exam, consisting of 100 (b) Ammonia and sodium metal;
novel AP-style English questions, was released (c) N2 and O2;
(d) Sodium and nitrogen gas;
to the teams. The exam consisted of three sec- (e) Sodium oxide and nitrogen gas.
tions: 50 multiple choice questions and two
sets of 25 multipart questions—the detailed an-
swer and free-form sections. The detailed an- Figure 3. An Example of a Multiple Choice Section Question, MC3.
swer section consisted mainly of quantitative
questions requiring a “fill in the blank” (with
explanation) or short essay response. The free-
form section consisted of qualitative, compre- of the explanation. The exam encompassed
hension questions, which exercised additional 168 distinct gradable components consisting of
reasoning tasks such as metareasoning, and re-
questions and question subparts. Each of these
lied more, if only in a limited way, on com-
received marks, ranging from 0 to 1 point each
monsense knowledge and reasoning.
for correctness and separately for explanation
Due to the limited scope of the pilot, there
quality, for a maximum high score of 336. All
was no requirement that questions be input in
three experts graded all three exams. The scor-
their original, natural language form. Thus,
ing of all three chemistry experts was aggregat-
two weeks were allocated to the teams for the
ed for a maximum high score of 1008.
translation of the exam questions into their re-
spective formal languages. Upon completion of Empirical Results
the encoding effort, the formal question en-
Vulcan was able to run all of the applications
codings of each team were evaluated by a pro-
during the challenge despite minor problems
gramwide committee to guarantee high fidelity
associated with each of the three systems.5 Re-
to the original English. The criterion of fidelity
sults were compiled for the three exam sections
was as follows:
separately and then aggregated to form the to-
Assume that a student was fluent in both Eng- tal scores. Despite significant differences in ap-
lish and the formal language in question. If she
proach, all three systems performed remark-
is able to infer additional facts from the formal
encodings either through omission of detail or ably well, above 40 percent for correctness for
because new material details were provided that most of the graders—a score comparable to an
were not available in the English description of AP-3 (out of 5)—close to the mean human
the question, then a fidelity violation had oc- score of AP-2.82!
curred. The multiple choice (MC) section consisted
Once the encodings were evaluated, Vulcan of 50 questions, MC1 through MC50. Each of
personnel submitted them to the sequestered these questions featured five choices, lettered
systems. The evaluations ran in batch mode. “a” through “e.” The evaluation required both
The Ontoprise system completed its processing an answer and a justification for full credit,
in 2 hours, the SRI system in 5 hours and the even for MC questions. Figure 3 provides an ex-
Cycorp system in a little over 12 hours. Each of ample of one of the multiple choice questions,
the three systems produced an output file in ac- MC3.
cordance with a predefined specification. For Figure 4 depicts the correctness (on the left)
each question, the format required the specifi- and answer-justification (on the right) scores
cation of the question number; the full English for the multiple choice section as a percentage
text of the question; a clear answer, either in of the 50-point maximum. The Cycorp, Onto-
prose or letter form for multiple choice ques- prise, and SRI scores are depicted by the differ-
tions; and an explanation of how the answer ent gray bars. Bars are grouped by the grading
was derived—even for multiple choice ques- chemistry professors, SME1 through SME3,
tions. See the sidebar “Examples of System where SME stands for subject matter expert. SRI
Outputs and Grader Comments” for more de- and Ontoprise both scored about 70 percent
tails. correct in this section, while Cycorp scored
Vulcan engaged three chemistry professors slightly above 50 percent. Cycorp applied a
to evaluate the exams. Adopting an AP-style metareasoning technique to evaluate multiple
evaluation methodology, they graded each choice questions. First, Cycorp’s OpenHalo at-
question for both correctness and the quality tempted to find a correct answer among the
WINTER 2004 35
Articles
siderably lower than the correctness scores.

MC Answer Scores Note that these two measurements were not in-
dependent. Systems that were unable to pro-
80.00
duce an answer did not produce a justification,
70.00
and systems that produced incorrect answers
60.00 were rarely able to produce convincing answer
50.00 justifications. The answer justification scores
S c ore s (%)
CYCORP
40.00 ONTOPRISE were also far less uniform than the correctness
30.00
TEAM SRI scores,6 with the scoring for SRI appearing to be
the most consistent across the three evaluators.
20.00
All the evaluators found the SRI justifications
10.00
to be the best, while the Cycorp generative-
0.00 English was the least comprehensible to the
SME1 SME2 SME3
subject matter experts.
The detailed answer (DA) section had 25
multipart essay questions, DA1–DA25, repre-
MC Justification Scores
60.00
senting a total of 80 gradable answer compo-
nents. Figure 5 depicts an example of a DA sec-
50.00 tion question, DA1. Figure 6 depicts the
40.00
correctness and answer-justification scores for
the DA section. The correctness assessment
S core s (%)
CYCORP
30.00 ONTOPRISE shows a slight advantage to the Cycorp system
TEAM SRI
in this section. OpenHalo may have fared bet-
20.00
ter here because it was not penalized by its
10.00 multiple choice strategy in this section.
The free-form (FF) section also had 25 multi-
0.00
SME1 SME2 SME3 part essay questions, FF1–FF25, representing 38
gradable answer components. Figure 7 depicts
an example of a FF question, FF2. Figure 8
Figure 4. Correctness and Answer-Justification Scores for the Multiple shows the correctness and answer-justification
Choice Section as a Percentage of the Maximum Score of 50 Points. scores for the FF section. This section was de-
signed to include questions that were some-
what beyond the scope of the defined syllabus.
Some required metareasoning and, in some
cases, limited commonsense knowledge. The
Balance the following reactions, and indicate whether they objective was to see how well the systems per-
are examples of combustion, decomposition, or combination formed faced with such challenges and
(a)C4H10 + O2 → CO2 + H2O whether the additional knowledge constructs
(b)KClO3 → KCl + O2 available to SRI and Cycorp would translate in-
(c)CH3CH2OH + O2 → CO2 + H2O to better results. The outcome of this section
(d)P4 + O2 → P2O5 showed a marked advantage to the SRI system,
(e)N2O5 + H2O → HNO3 both for correctness and for justification. We
were surprised that the Cycorp system did not
do better, given its many thousands of con-
cepts and relations and the rich expressivity of
Figure 5. An Example of a Detailed Answer Section Question, DA1.
CycL. This result may reflect the inability of
their knowledge engineering team to leverage
knowledge in Cyc for this particular challenge.
Figure 9 provides the total challenge results,
five. If it failed to do so, it would attempt to de- as percentages of the 168-point maximum
termine which of the five options were prov- scores, for answer correctness and justifica-
ably wrong. This led to some questions return- tions. The correctness scores show a similar
ing more than one letter answer, none of trend for the three subject matter experts, with
which received credit from the subject matter team SRI slightly outperforming Ontoprise and
experts. In contrast, the other two teams hard- Ontoprise slightly outperforming Cycorp. By
coded the approach to be used—direct proof contrast, the justification scores display a sig-
versus elimination of obvious wrong an- nificant amount of variability. We are consider-
swers—and appeared to fare better. ing changes in our methodology to address this
The answer-justification scores were all con- issue, including training future subject matter
36 AI MAGAZINE
Articles
experts to produce more consistent scoring.

All subject matter experts found some an- DA Answer Scores
swer justifications that they liked. The subject 60.00
matter experts provided high-level comments, 50.00
mostly focused on the organization and con-

40.00
ciseness of the justifications. In some instances, CYCORP
S c ore s (%)
justifications were quite long. For example, Cy- 30.00 ONTOPRISE
TEAM SRI
corp’s generative English produced some justi- 20.00
fications in excess of 16 pages in length. The
10.00
subject matter experts also complained that
many arguments were used repetitively and 0.00
SME1 SME2 SME3
that proofs took a long time to “get to the
point.” In some multiple choice questions,
proofs involved invalidating all wrong answers
rather than proving the right one. All the DA Justification Scores
teams appeared to rely on instance-based solu- 60.00
tions to prove generalized comprehension-ori- 50.00

ented questions, indicating a limited ability to
reason with concepts. Gaps in knowledge cov- 40.00
S c ore s (%)
CYCORP
erage were also evident. For example, many of 30.00 ONTOPRISE
the teams had significant gaps in their knowl- TEAM SRI
20.00
edge of net ionic equations. Detailed question-
by-question scores are available on the project 10.00
Web site.
0.00
SME1 SME2 SME3
Problematic Questions
Despite the impressive overall performance of
the three systems, there were questions on
which each of them failed. Most interestingly,
Figure 6. Correctness and Answer-Justification Scores for the Detailed
there were questions on which all three sys-
Answer Section as a Percentage of the Maximum Score of 80 Points.
tems failed dramatically. Five prominent and
interesting cases—DA10, DA22, FF1, FF8, and
FF22—are shown in figure 10. We examine
these questions more closely.
The first issue to address is whether these
questions share properties that explain their
Pure water is a poor conductor of electricity, yet ordinary
difficulty. An initial hypothesis is that all five tap water is a good conductor. Account for this difference.
questions require that a system be able to rep-
resent and reason about its own problem-solv-
ing procedures and data structures—that is, Figure 7. An Example of a Free-Form Section Question, FF2.
that it be reflective or capable of metarepresen-
tation and metareasoning. That property
would explain the difficulty all three systems
had, at least to the extent that the systems can
be said to lack such reflective capabilities. FF22 seems similar in that it asks about the
DA10 seems to probe a system’s strategy for applicability of approximate solutions for a cer-
solving any of a general class of problems; in- tain class of problems and about the reasons
deed, it seems to ask for an explicit description for the limits to that applicability. On reflec-
of that strategy. DA22 implies that the pH cal- tion, though, it is really probing the system’s
culation that a problem solver is likely to use knowledge of certain methodological princi-
will generate an unacceptable result in this case ples used in chemistry rather than the system’s
(a value of greater than 7 for an acid) and then knowledge of its own inner workings. What
asks for an explanation of what went wrong, seems to be missing is knowledge about chem-
that is, of why the normal pH calculation leads istry—not about chemical compounds but
to an anomalous result here. These two ques- rather about methods used in chemistry, in
tions both seem to require the system to repre- particular about approximate methods and
sent and reason about, indeed to explain the their scope and limits. And, of course, these lat-
workings of, its own problem-solving proce- ter methods may or may not be built into the
dures. system’s own problem-solving routines.
WINTER 2004 37
Articles
systems. The Ontoprise system proved to be

the fastest and most reliable, taking approxi-
FF Answer Scores
45.00
mately 2 hours to complete its batch run. The
40.00
SRI system ran the challenge in approximately
35.00 5 hours, and the Cycorp system completed its
30.00 processing in over 12 hours. In this latter case,
a memory leak on the sequestered platform
S c ore s (%)
25.00 CYCORP
20.00
ONTOPRISE
TEAM SRI
caused the server to crash, and the system was
15.00 rebooted and run until the evaluation time
10.00 limit expired.
5.00 All three teams undertook modifications and
0.00 improvements to the sequestered systems and
SME SME2 SME3
ran the challenges again. In this case the Onto-
prise system was able to complete the chal-
lenge in 9 minutes, the SRI system took 30
FF Justification Scores minutes, and the Cycorp system took approxi-
30.00
mately 27 hours to process the challenge. Both
25.00
the sequestered and the improved systems are
freely available for download off the Project
20.00
Halo Web site.
S c ore s (%)
CYCORP
15.00 ONTOPRISE
TEAM SRI
10.00 Analysis
5.00 The three systems did well—better than Vulcan
0.00
expected. Nevertheless, their performance was
SME1 SME2 SME3 far from perfect, and the goal of the pilot pro-
ject was to go beyond evaluations of KR&R sys-
tems to an analysis of them. Therefore, we
wanted to understand why these systems failed
Figure 8. Correctness and Answer Justification Scores for the Free-Form when they did, the relative frequency of each
Section as a Percentage of the Maximum Score of 38 Points. type of failure, and the ways these failures
Note that SRI fared significantly better both in the correctness and justification might be avoided or mitigated.
scoring. Based on our collective experience building
KR&R systems, at the beginning of Project Halo
we designed a taxonomy of failures that fielded
systems might experience. Then, at the end of
FF1 and FF8 are similar in that one asks for the project, we analyzed every point lost on the
similarities and the other for differences, and evaluation in an attempt to identify the failure
in both cases, the systems did represent the and place it within the taxonomy. We studied
knowledge but did not support the reasoning the resulting data to draw lessons about the
method to compute them. taxonomy, the systems, and (by extrapolation)
FF1 is a question about the language of the current state of KR&R technologies for
chemistry, in particular about the abstract syn- building fielded systems. See Friedland et al.
tax or type conventions of terms for chemical (2004) for a comprehensive report of this
compounds. All three systems had some study.
knowledge encoded so that the differences In particular, our failure analysis suggests
could be computed but lacked the necessary three broad lessons that can be drawn across
reasoning method to compute them. the board for the three systems: modeling, an-
swer justification, and scalability for speed and
A Note on Performance reuse.
The Halo pilot challenge was run over the Modeling. A common theme in the modeling
course of a day and a half on sequestered sys- problems across the systems was that the incor-
tems at Vulcan, by Vulcan personnel. As noted rect knowledge was represented, or some do-
above, we did encounter minor problems with main assumption was not adequately factored-
all three systems that were resolved over this in, or the knowledge was not captured at the
period of time. Among other issues, the batch right level of abstraction. Addressing these
files containing the formal encodings of the problems requires us to have direct involve-
challenge questions needed to be broken into ment of the domain experts in the knowledge-
two to facilitate their processing on all three engineering process. The teams involved such
38 AI MAGAZINE
Articles
experts to different extents and at different

times during the course of the project. The SRI
team, which involved professional chemists 60.00
Challenge Answer Scores
from the beginning of the project, appeared to

50.00
benefit substantially. This presents a research
challenge, since it suggests that the expositions 40.00
S c ore s (%)
CYCORP
of chemistry in current texts are not sufficient 30.00 ONTOPRISE
TEAM SRI
for building or training knowledge-based sys- 20.00
tems. Instead, a high-level domain expert must

10.00
be involved in formulating the knowledge ap-
propriately for system use. Two approaches to 0.00
SME1 SME2 SME3
ameliorating this problem being pursued by
participants are (1) providing tools that sup-
port direct manipulation and testing of KR&R
systems by such experts, and (2) providing the
background knowledge required by a system to Challenge Justification Scores
make appropriate use of specialized knowledge 45.00
40.00
as it is presented in texts.
35.00
Answer Justification. Explanation, or, more 30.00
S c ore s (%)
CYCORP
generally, response interpretability, is funda- 25.00
ONTOPRISE
20.00
mental to the acceptance of a knowledge-based TEAM SRI
15.00
system, yet for all three state-of-the-art sys- 10.00
tems, it proved to be a substantial challenge. 5.00
Since the utility of the system will be evaluated 0.00

SME1 SME2 SME3
end to end, it is to a large degree immaterial
whether its answers are correct, if they cannot
be understood. Reaching the goals of projects
Figure 9. Total Correctness and Justification Scores
as a Percentage of the Maximum Score of 168 Points.
DA10: HCl, H2SO4, HClO4, and HNO3 are all examples of strong acids and are
100% ionized in water. This is known as the “leveling effect” of the solvent.
Explain how you would establish the relative strengths of these acids. That is,
how would you answer a question such as “which of these acids is the strongest?”
DA22. Phenol, C6H5OH, is a very weak acid with an acid equilibrium constant of
Ka = 1.3 x 10-10. Determine the pH of a very dilute, 1 x 10-5 M, solution of
phenol. Is the value acceptable? If not, give a possible explanation for the
unreasonable pH value.
FF1. What is the difference between the subscript 3 in HNO3 and a coefficient 3 in
front of HNO3?
FF8. Although nitric acid and phosphoric acid have very different properties as pure
substances, their aqueous solutions possess many common properties. List some
general properties of these solutions and explain their common behavior in terms of
the species present.
FF22. When we solve equilibrium expressions for the [H3O+] approximations are
often made to reduce the complexity of the equation thus making it easier to solve.
Why can we make these approximations? Would these approximations ever lead to
significant errors in the answer? If so give an example of an equilibrium problem that
would require use of the quadratic equation.
Figure 10. Examples of Chemistry Questions That Proved to Be Problematic for All Three Teams.
WINTER 2004 39
Articles
A Brief History of
Evaluating Knowledge Systems
O
ne unique aspect of the Halo pilot is its rigorous scheme signed to test the question-answering capability on huge bodies
of evaluation. It uses an independently defined and of text on widely ranging subjects using very limited reasoning
well-understood test, specifically, the advanced place- capabilities. In contrast, the Halo evaluation is focused on eval-
ment test for chemistry, on a well-defined scope, specifically, 70 uating deep reasoning in the field of sciences. The eventual goal
pages of a chemistry textbook. Though such rigorous evalua- of Halo is to do significant coverage of sciences, but the current
tion schemes have been developed in the areas of shallow in- phase was limited to only 70 pages of a chemistry textbook.
formation extraction (the MUC conferences) information re-
trieval and simple question answering (the TREC conferences) Building and Running
for quite a while, the corresponding task of evaluating the kind Knowledge-Based Systems
of knowledge-based systems deployed in the Halo Pilot had ap- In the area of knowledge-based systems, DARPA, AFOSR, NRI,
peared to be too difficult to be approached in one step. and NSF jointly funded the knowledge sharing effort in 1991
Thus, previous efforts at measuring the performance of (Neches et al. 1991). This was a three-year collaborative pro-
knowledge-based systems such as in high-performance knowl- gram to develop “knowledge sharing” technologies to facilitate
edge bases (HPKB) and rapid knowledge formation (RKF) con- the exchange and reuse of inference-capable knowledge bases
stituted important stepping stones towards rigorous evaluation among different groups. The aim was to help reduce costs and
of knowledge-based systems, but the Halo pilot represents a sig- promote development of knowledge-based applications. This
nificant advance. To substantiate this summary, we shall review was followed by DARPA’s High Performance Knowledge Base
some of the details of developments in these various areas. (HPKB) program (1996–2000), designed to push knowledge-
based technology further and demonstrate that very large
Retrieving Answers from Texts (100k+ axiom) systems could be built quickly and be usefully
Question-answering via information retrieval and extraction applied to question-answering tasks (Cohen et al. 1998). The
from texts has been an active area of research, with a progres- evaluation in HPKB was aimed simply at the hypothesis that
sion of annual competitions and conferences, especially the 7 large knowledge-based systems can be built at all, that they can
message-understanding conferences (MUCs) and the 12 text-re- accomplish interesting tasks, and that they do not break—as a
trieval conferences (TRECs) from 1992–2003, sponsored by toy system would and as many of the initial knowledge-based
NIST, IAD, DARPA, and ARDA. TRECs were initially aimed at re- systems did—when working with a realistically sized knowl-
trieving relevant texts from large collections and then at ex- edge base.
tracting relevant passages from texts (Voorheese 2003). The ear-
lier systems had virtually no need for inference-capable Evaluating Knowledge-Based Systems
knowledge bases and reasoning capabilities. In recent years the There have been few efforts so far at documenting and analyz-
question-answering tasks have become more challenging, for ing the quality of fielded KR&R systems (Brachman et al. 1999;
example, requiring a direct answer to a question rather than a Keyes 1989; and Batanov and Brezillon 1996). RKF made signif-
passage containing the answer. The evaluation schemes are icant efforts to analyze and document the quality of knowledge
very well defined, including well worked out definitions of the base performance (Pool et al. 2003). Specifically, an evaluation
tasks and answer keys that are used to compute evaluation mea- in DARPA’s Rapid Knowledge Formation project, which was
sures including precision and recall. roughly comparable to the one used in Project Halo, was based
Recently there has been a surge of interest in the use of do- on approximately 10 pages from a biology textbook and a set
main knowledge in question answering (for example, see of test questions—however, this was not an independently es-
Chaudhri and Fikes [1999]). ARDA’s current advanced question tablished test. The Halo pilot, reported on here, improves upon
and answering for intelligence program (AQUAINT), started in these evaluations by being more systematic and usable for
2001, is pushing text-based question-answering technology fur- cross-system comparisons. The Halo pilot has adopted an eval-
ther, seeking to address a typical intelligence-gathering scenario uation standard that is comparable to the rigor of the chal-
in which multiple, interrelated questions are used to fulfill an lenges for retrieving answers from texts. It provides an exact de-
overall information need rather than answer just single, isolat- finition of the scope of the domain—an AP chemistry test
ed, fact-based questions. AQUAINT has both adopted TREC’s setting that has proven its validity in many years with many
approach to the evaluation of question-answering and tried to students—as well as an objective evaluation by independent
extend it to encompass more complex question types, for ex- graders. We conjecture that the Halo evaluation scheme is ex-
ample biographical questions of the form “Tell me all the im- tensible enough to support a coherent long-term development
portant things you know about Osama bin Laden.” The funda- program.
mental difference between the Halo evaluation and the
AQUAINT evaluation is that the AQUAINT evaluations are de-
40 AI MAGAZINE
Articles
like Digital Aristotle will require an investment have conceded that, if they had the opportuni-
of considerably more resources into this aspect ty to revisit the challenge, they would adopt
of systems to realize robust gains in their com- the use of subject matter experts in knowledge
petence. Constructing explanations directly formation. Cycorp’s OpenHalo knowledge base
from the system’s proof strategy is neither was two orders of magnitude larger than the
straightforward nor particularly successful, es- other teams’. They were unable to demonstrate
pecially if that strategy has not been designed any measurable advantage in using this addi-
with explanation in mind. One alternative is to tional knowledge, even in example-based ques-
use explicit representations of problem-solving tions, where they exhibited metareasoning
methods (PSMs) so that explanations can in- brittleness similar to that observed in the other
clude statements of problem-solving strategy as systems. The size of Cycorp’s knowledge base
well as statements of facts and rules (Clancey does, however, explain some of the significant
1983). Another is to perform more metareason- run-time differences. They have also yet to
ing over the proof tree to construct a more demonstrate successful, effective reintegration
readable explanation. of Halo knowledge into the extended Cyc plat-
Scalability for Speed and Reuse. There has form. Reuse and integration appear to remain
been substantial work in the literature on the open questions for all three Halo teams.
trade-off between expressiveness and tractabil- The most novel aspect of the Halo pilot was
ity, yet managing this trade-off, or even pre- the great emphasis put on answer explana-
dicting its effect in the design of fielded sys- tions, which served two primary purposes: first,
tems over real domains, is still not at all to exhibit and thereby verify that deep reason-
straightforward. To move from a theoretical to ing was occurring, and second, to validate that
an engineering model of scalability, the KR appropriate domain explanations can be gener-
community would benefit from a more system- ated. This is an area that is still open to signifi-
atic exploration of this area driven by the em- cant improvement. SRI’s approach produced
pirical requirements of problems at a wide the best-quality results, but it leaves open
range of scales. For example, the three Halo many questions regarding how well it might be
systems, and more generally, the Halo develop- scaled, generalized, and reused. Cycorp’s gener-
ment and testing corpora, can provide an ex- ative approach may eventually scale and gener-
cellent test bed to enable KR&R researchers to alize, but the current results were extremely
pursue experimental research in the trade-off verbose and often unintelligible to domain ex-
between expressiveness and tractability. perts. Ontoprise’s approach of running a sec-
ond inference process appears to be very
Discussion promising in the near term.
All three logical languages, KM, F-Logic, and Vulcan Inc. and the pilot participants have
CycL, were expressive enough to represent invested considerable efforts in promoting the
most of the knowledge in this domain. F-Logic scientific transparency of the Halo pilot. The
was by far the most concise and easy to read, project Web site provides all the scientifically
with syntax most resembling an object-orient- relevant documentation and tutorials, includ-
ed language. F-Logic also yielded very high-fi- ing an interactive results browser and fully doc-
delity representations that appear to be easier umented downloads representing both the se-
and more intuitive to construct. Ontoprise was questered systems and improved Halo pilot
the only team to conduct a sensitivity study of chemistry knowledge bases. We eagerly antici-
the impact of different question encodings on pate comment from the AI community and
system performance. In the case of the two look forward to its use by universities and other
questions they examined, their system pro- researchers.
duced similar answers with slightly different Finally, the issue of cost must be considered.
justifications. For the most part, the encoding We estimate that the per page expense for each
process and its impact on question-answering of the three Halo teams was on the order of
stability remain an open research topic. $10,000 per page for the 70-page syllabus. This
SRI and Ontoprise yielded comparably sized cost must be significantly reduced before this
knowledge bases. OntoNova was built from technology can be considered viable for Digital
scratch using no predefined primitives, while Aristotle.
SRI’s system leveraged the Component Library, In summary, all of the Halo systems scored
though not as extensively as the team had ini- well on a very difficult challenge: extrapolating
tially hoped. SRI’s use of professional chemists the results of team SRI’s system on the limited
in the knowledge-formulation process was a 70-page syllabus to the entire AP syllabus yield-
huge advantage, and the quality of their out- ed the equivalent of an AP-3 score for answer
come is reflected by this fact. The other teams correctness—good enough to earn course cred-
WINTER 2004 41
Articles
Examples of System Outputs and

Grader Comments
The Halo pilot evaluation was intended to as-
sess deep-reasoning capabilities of the three
competing systems in the context of a well-
known evaluation methodology. Seventy
pages of the advance placement (AP) chem-
istry syllabus were selected. Systems were re-
quired to produce coherent answers and an-
swer justifications in English to each of the
100 AP-style questions posed. In the case of
multiple choice questions, a letter response
was required. The evaluation team consisted
of three chemistry professors, who were in-
structed to grade the exams using AP guide-
lines. Answer justifications were thus re-
quired to conform to AP guidelines to receive
full credit.
System outputs were required to conform
to strict formats. The question number need-
ed to be clearly indicated, followed by the
original English text of the question. This was
to be followed by the answer. Multiple choice
questions required a letter answer. Finally,
the answer justification was required. Justifi-
cation guidelines required that the answers
be clear, concise, and appropriate for AP ex- Question Example 1. Ontoprise’s OntoNova
ams. Application’s Output for Multiple Choice Question 20.
The system outputs were rendered into Note the output’s format: the question number is indicated at the top; fol-
hardcopy and distributed to three chemistry lowed by the full English text of the original question; next, the letter answer
professors for evaluation. These subject mat- is indicated; finally, the answer justification is presented. The graders written
ter experts were asked to apply AP grading remarks are included. OntoNova employed a second inference step to derive
guidelines to assess the system outputs and to the answer justification, using human-authored templates, the proof tree, and
provide lots of written comments. combination rules to assemble the English text.
Question examples 1 through 3 present
Ontoprise, SRI, and Cycorp system outputs,
respectively. Examples 1 and 2 depict re-
sponses to multiple choice question 20, while
example 3 depicts the response to multiple
choice question 34. These figures also contain
graders’ remarks.
Question Example 2. SRI’s SHAKEN Output.

Human-authored templates associated with chem-
ical methods were combined during the back-
chaining process to produce the English text. The
templates specify the salient subgoals to be elabo-
rated as indented “sub explanations.” This result-
ed in generally superior answer justifications.
42 AI MAGAZINE
Articles
Question Example 3. The Output of

Cycorp’s OpenHalo Application for
Multiple Choice Question 34.
Note the grader’s remarks. OpenHalo used the
proof tree and Cyc’s generative English capabil-
ities to produce the English answer. This exam-
ple illustrates one of the better outcomes—
some questions produced many pages of gener-
ative English, which were far less intelligible to
the graders.
Comments on SRI Output

Very good, brief presentation in logical order
Good use of selective elimination
This is a good first effort, but still looks a little like what would be
expected from a student taking this exam.
Generally, when a calculation was required, the program did not

follow what I would expect from a student: namely, a setup, Comments on Output.
substituted numbers, followed by a solution to the problem.
Comments on Ontoprise Output
This figure shows a small, if representative, sub-
Good logical flow, set up with substitutions shown
set of verbatim comments that subject matter
Well done, and to the point experts made on the answers to the questions
The logic is more readable (and shorter!) than the KB justifications, but it produced by the system. The subject matter ex-
often seems to be presented backwards — arguing from the answer as perts had well-defined expectations of system
opposed to arguing to the answer. behavior, for example setting up a problem be-
There was a common error in the use of significant figures. Answers are fore actually presenting its solution or the
often given in as many figures as the program could generate. This is also a
common problem with students so I guess we could claim that the com- number of significant digits used in the output.
puter is being more human-like in these responses These comments did not reflect, for the most
Comments on Cycorp Output part, on the correctness of the computation,
Main strength of the program is that it did a fairly good job of arriving at but rather were indicators of “how things were
the correct answers to be done on an AP chemistry exam.” This
A good approach for replacing the “reasoning” used in this program is highlights the importance of understanding
to use “Show Method of Calculation” (for all questions which involve the
domain-specific requirements in answer and
calculation of a numerical answer) and “Explain Reasoning” (for all
questions which do not involve a calculated answer) answer-justification formation and generation
In general, the program appears to take a quantitative approach when
by question-answering systems.
answering questions and does not know how to take a qualitative
approach. For example, when the activity series was part of a question,
the program would use cell potentials
The “reasoning” are much, much, much too long. A sufficient “reasoning”
to any of the questions on the exam never requires more than the 1/2 page.
WINTER 2004 43
Articles
it at many top universities. The Halo teams be- portions of the syllabi would be best suited to
lieve that with additional, limited effort they demonstrate the broadest proof of concept,
would be able to improve the scores to the AP- given our resources, and better understand the
4 level and beyond. Vulcan Inc. has developed ways in which the content requirements of the
two additional challenge question sets to vali- different disciplines can be mutually leveraged;
date these claims at a future date. and finally, (4) produce a coherent, well moti-
vated design.
A 15-month implementation stage will fol-
Conclusions and Next Steps low. Here, the detailed designs will be rendered
As we noted at the beginning of this article, into working systems, and these systems will
Project Halo is a multistaged effort. In the fore- be subject to a comprehensive user evaluation
going, we have described phase one, which as- to understand their viability and the degree to
sessed the capability of knowledge-based sys- which the empirical data from actual use by
tems to answer a wide variety of unanticipated domain experts fits the models developed dur-
questions with coherent explanations. Phase ing the design stage. Finally, a nine-month re-
two of Project Halo will examine whether tools finement stage will attempt to correct the
can be built to enable domain experts to build shortcomings detected in the implementation
such with an ever-decreasing reliance on stage evaluation, and a second evaluation will
knowledge engineers—a goal that was pursued be undertaken to validate the refinements.
in DARPA’s Rapid Knowledge Formation pro- Future work will focus on tactical research to
ject. Empowering domain experts to build ro- fill gaps identified in Halo 2 that will lead to
greater coverage of the scientific domains.
bust knowledge bases with little or no assis-
These efforts will investigate both automated
tance from knowledge engineers will (1)
and semiautomated methods to facilitate for-
dramatically decrease the cost of knowledge
mulation of knowledge and posing of ques-
formulation; (2) greatly reduce the type of er-
tions and provide better tools for evaluation
rors observed in the Halo Pilot that resulted
and inspection of the knowledge-formulation
from lack of understanding of the domain on
and question-answering processes. We will also
the part of knowledge engineers; and (3) facili-
be focusing on reducing brittleness and other
tate a growing, peer-reviewed body of ma-
systemic failures of the Halo phase-two systems
chine-processable knowledge that will form
that will be identified by a comprehensive fail-
the basis for Digital Aristotle. A critical measure
ure analysis of the sort we developed for phase
of success is the degree to which the relevant
one. We will be seeking the assistance of the
scientific communities are willing to adopt
KR&R community to standardize our extended
these tools, especially in their pedagogies.
failure taxonomy for use in a wide variety of
At the core of the knowledge formulation
knowledge-based applications.
approach envisioned in phase two is a docu- Throughout phase two, Project Halo will be
ment-rooted methodology in which the do- conducting an ongoing dialogue with domain
main expert uses an existing document such as experts and educators, especially those from
a textbook as the basis for the formulation of a the three target scientific disciplines. Our aims
knowledge module. Tying knowledge modules are to better understand their needs and to ex-
to documents in this way will help determine plain the potential benefits of the availability
the scope and context of each module, the of high-quality machine-processable knowl-
types of questions they can be expected to an- edge to both research and education. For exam-
swer, and the appropriate depth and resolution ple, once a proof of concept for our knowledge
of the answers. The 30-month phase-two effort formulation approach has been established,
will be undertaken in three stages. First, a six- Project Halo will consider how knowledge
month, analysis-driven design process will ex- modules might be integrated into interactive
amine the complete AP syllabi for chemistry, tutoring applications. We will also examine
biology, and physics (B). The objective of this how such modules might assist knowledge-dri-
analysis will be to determine requirements on ven discovery, as part of the functionality of a
effective use by domain experts of a range of digital research assistant.
knowledge-acquisition technologies. The re-
sults of this study should allow us to (1) deter- Acknowledgments
mine the gaps in “coverage” of current state-of- Full support for this research was provided by
the-art knowledge-acquisition techniques and Vulcan Inc. as part of Project Halo. For more in-
define targeted research to fill those gaps; (2) formation, see www.projecthalo.com.
understand and prioritize the methods, tech-
niques, and technologies that will be central to Notes
the Halo 2 application; (3) understand which 1. MEDLINE is the National Library of Medicine’s
44 AI MAGAZINE
Articles
premier bibliographic database covering the fields of American Association for Artificial Intelligence.
medicine, nursing, dentistry, veterinary medicine, Clancey, W. J. 1983. The Epistemology of a Rule-
the health care system, and the preclinical sciences. Based Expert System: A Framework for Explanation.
2. Sections 2.6–2.9 in chapter 2 provide detailed in- Artificial Intelligence Journal 20(3): 215–251.
formation. Chapter 16 also requires the definition of Cohen, P.; Schrag, R.; Jones, E.; Pease, A.; Lin, A.;
moles, which appears in section 3.4, pages 87–89, Starr, B.; Gunning, D.; and Burke, M. 1998. The
and molarity, which can be found on page 134. The DARPA High Performance Knowledge Bases Project.
form of the equilibrium expression can be found on AI Magazine 19(4): 5–49.
page 580, and buffer solutions can be found in sec- Friedland, N.; Allen, P. G.; Witbrock, W.; Matthews,
tion 17.2. G.; Salay, N.; Miraglia, P.; Angele, J.; Staab, S.; Israel,
3. Available from http://www.opencyc.org/. D.; Chaudhri, V.; Porter, B.; Barker, K.; and Clark, P.
4 The system code and documentation are available 2004. Towards a Quantitative Platform-independent
at http://www.cs.utexas.edu/users/mfkb/km.html. Quantitative Analysis of Knowledge Systems. In Pro-
5. After the challenge evaluation was complete, the ceedings of the Ninth International Conference of Knowl-
teams put in a considerable effort to make improved edge Representation and Reasoning. 2004. Menlo Park,
versions of their application for use by the general Calif.: AAAI Press.
public. These improved versions address many of the Keyes, J. 1989. Why Expert Systems Fail? IEEE Expert
problems encountered on the sequestered versions. 4(11): 50–53.
Vulcan Inc has made both the sequestered and im- Kifer, M.; Lausen, G.; and Wu, J. 1995. Logical
proved versions available for download on the pro- Foundations of Object Oriented and Frame Based
ject Halo Web site. Languages. Journal of the ACM, 42(4): 741–843.
6. One explanation for this is that, although agreed Neches, R.; Fikes, R.; Finin, T.; Gruber, T.; Patil, R.;
guidelines exist for marking human justifications, Senator, T.; and Swartout, W. R. 1991. Enabling Tech-
the Halo systems can create justifications unlike any nology for Knowledge Sharing. AI Magazine 12(3):
that the graders have seen before (for example, with 6–56.
extensive verbatim repetition), and for which no Novak, G. 1995. Conversion of Units of Measure-
agreed scoring protocol has been established. ment. IEEE Transactions on Software Engineering 21(8):
651–661.
References Pool, M.; Murray, K.; Mehrotra, M.; Schrag, R.;
Angele, J.; Mönch, E.; Oppermann, H.; Staab, S.; and Blythe, J.; Kim, J.; Miraglia, P.; Russ, T.; and Schnei-
Wenke, D. 2003. Ontology-based Query and Answer- der, D. 2003. Evaluation of Expert Knowledge Elicit-
ing in Chemistry: Ontonova @ Project Halo. In Pro- ed for Critiquing Military Courses of Action. In Pro-
ceedings of the Second International Semantic Web Con- ceedings of the Second International Conference on
ference (ISWC2003). Berlin: Springer Verlag. Knowledge Capture (KCAP-03). New York: Association
Barker, K.; Porter, B.; and Clark, P. A Library of Gener- for Computing Machinery.
ic Concepts for Composing Knowledge Bases. In Pro- Voorheese, E. M., ed. 2003. The Twelfth Text Retrieval
ceedings of the First International Conference on Knowl- Conference (TREC2003). Washington, DC: Depart-
edge Capture (K-Cap’01), 14–21. New York: ment of Commerce, National Institute of Standards
Association for Computing Machinery Special Inter- and Technology (http://trec.nist.gov/pubs/trec12/
est Group on Artificial Intelligence. t12_proceedings.html).
Barker, K.; Chaudhri, V. K.; Chaw, S. Y.; Clark, P. E.; Witbrock, M.; and Matthews, G. 2003. Cycorp Project
Fan, J.; Israel, D.; Mishra, S.; Porter, B.; Romero, P.; Halo Final Report. Austin, TX.: Cycorp.
Tecuci, D.; and Yeh, P. 2004. A Question-Answering
System for AP Chemistry: Assessing KR&R Technolo-
gies. In Proceedings of the Ninth International Confer-
ence of Knowledge Representation and Reasoning. Menlo
Park, Calif.: AAAI Press. Noah Friedland, the former Pro-
ject Halo program manager at Vul-
Batanov, D.; and Brezillon, P., eds. 1996. Proceedings
can Inc., led the development of
of the First International Conference on Successes and
Digital Aristotle—a knowledge ap-
Failures of Knowledge-based Systems in Real World Ap-
plication that will be capable of as-
plications. Bangkok, Thailand: Asian Institute of
sisting scientific educators and re-
Technology.
searchers in their work in a
Brachman, R. J.; McGuinness, D. L.; Patel-Schneider, growing number of scientific disci-
P. F.; and Borgida, A. 1999. Reducing CLASSIC to plines. Friedland holds a Ph.D. in
“Practice”: Knowledge Representation Theory Meets computer science from the University of Maryland,
Reality. Artificial Intelligence Journal 114(1–2): where he studied under the late Azriel Rosenfeld. He
203–237. also holds degrees in aeronautical and electrical en-
Brown, T. L.; LeMay, H. Eugene; and Bursten, B. 2003. gineering from the Technion in Haifa, Israel. Prior to
Chemistry: The Central Science. Englewood Cliffs, NJ: joining Vulcan in 2002, Friedland held a variety of
Prentice Hall. leadership roles in Seattle-area high-tech startups.
Chaudhri, V.; and Fikes, R., eds. 1999. Question An- His website is www.noahfriedland.com.
swering Systems: Papers from the AAAI Fall Symposium. Investor and philanthropist Paul G. Allen creates
Technical Report FS-99-02. Menlo Park, California:
WINTER 2004 45
Articles
and advances world-class

projects and high-im-
pact initiatives that
change and improve the
world. He cofounded Mi-
crosoft with Bill Gates in
1976, remained the com-
pany’s chief technologist
until he left Microsoft in
1983, and is the founder and chairman of
Vulcan Inc. and the chairman of Charter
Communications (a broadband communi-
cations company). Named one of the top
ten philanthropists in America, Allen gives
back to the community through his chari-
table foundations. Learn more about Allen
online at www.vulcan.com.
Bruce Porter ([email protected]) is a

professor in the Department of Computer
Science at the University of Texas at Austin
and the director of the Artificial Intelli-
gence Laboratory. His research and teach-
ing focuses on knowledge-based systems SRI Team. Front row (left to right): Lisa Ahlberg, Vinay Chaudhri, Irina Beylin, David Israel,
Tomas Uribe. Back row (left to right): Ken Murray, Pedro Romero, Anne Venturelli, and
and the contributing technologies of
David Wilkins.
KR&R, machine learning, explanation gen-
eration, and natural language understand-
ing.
Ken Barker ([email protected]) inves-

tigates knowledge-based systems as a re-
search associate in the Department of Com-
puter Sciences at the University of Texas at
Austin. He received a Ph.D. in computa-
tional linguistics from the University of Ot-
tawa in 1998.
James Fan ([email protected]) is a Ph.D.

student at the University of Texas at Austin.
He has an M.S. (2001) and a B.S. (1996) in
computer sciences from the University of
Texas. His research interests include knowl-
edge-based systems, ontologies, semantic
web, natural language processing, and ma-
chine learning.
Dan G. Tecuci ([email protected]) is

currently a Ph.D. student in computer sci-
ences at the University of Texas at Austin.
He received a M.Sc. in computer sciences Knowledge Based Systems Group at University of Texas at Austin. Sitting (left to right): Ken
Barker, Bruce Porter. Standing (left to right): Jason Chaw, Peter Yeh, Dan Tecuci, James Fan.
from the same university in 2001. His main
research interest is artificial intelligence
with a focus on knowledge-based systems.
Peter Z. Yeh ([email protected]) is cur-

research interests center on advancing ontologies, and inference for large shared
rently a Ph.D. student in computer sciences
knowledge-based systems as a feasible tool knowledge bases.
at the University of Texas at Austin. He re-
for knowledge management and evolution.
ceived his B.S. and M.S. in computer science Peter Clark (Peter.E.Clark
from the same institution. His research in- David Israel ([email protected]) is di- @boeing.com) is a research
terests are in the areas of inexact and se- rector of the natural language program in scientist in the Intelligent
mantic graph matching, knowledge-base the Artificial Intelligence Center at SRI In- Information Systems
systems, and knowledge representation. ternational. group, Boeing Phantom
Works, leading research in
Shaw Yi Chaw ([email protected]) is Vinay Chaudhri (Vinay.Chaudhri@s the areas of knowledge-
currently a Ph.D student in computer sci- ri.com) is the program manager for ontol- based systems, machine
ences at the University of Texas at Austin. ogy management in the Artificial Intelli- reasoning, and controlled language pro-
He has a B.Comp in computer science from gence Center at SRI International. His re- cessing.
the National University of Singapore. His search focuses on knowledge acquisition,
46 AI MAGAZINE
Articles
Cycorp Team. Front row (left to right): Michael Witbrock, Gavin Matthews.
Back row (left to right): David Baxter, John Curtis, Blake Shaperd.
Juergen Angele ([email protected]) is

At Left:
the chief executive officer and cofounder of
Ontoprise Team.
Ontoprise GmbH. He has published some Sitting (left to right):
75 papers on ontologies, frame-based log- Eddie Moench,
ics, reasoning, and semantic web technolo- Jurgen Angele.
gies. He is the inventor of the inference en- Standing (left to right):
gine, OntoBroker. Dirk Wenke,
Henrik Oppermann,
Steffen Staab, PD Dr. rer. nat. (staab@onto- Steffen Staab.
prise.de), is a senior lecturer at the Univer-
sity of Karlsruhe and cofounder of Onto-
prise GmbH. He is currently on the tronic Data Systems, specializing in mem- nior ontologist with Convera Corporation.
editorial board of IEEE Intelligent Systems, ory management and 3-D solid modeling. He lives in Austin, Texas.
International Journal of Human-Computer
Michael Witbrock ([email protected]) is David Baxter ([email protected]) holds a
Studies, Journal of Web Semantics, and Jour-
currently vice president for research at Cy- Ph.D. in linguistics from the University of
nal of IT&T.
corp. Previously Witbrock was a principal Illinois. He has worked for Cycorp since
Dirk Wenke, Dipl. Math. Oec. (wenke@on- scientist at Terra Lycos, a research scientist 1998 and specializes in natural language
toprise.de), is a senior software developer at Just Systems Pittsburgh Research Center, parsing and generation.
and knowledge engineer at Ontoprise and a system scientist at CMU. He has a Blake Shepard ([email protected]) is an on-
GmbH, Karlsruhe. He is responsible for the Ph.D. in computer science from Carnegie tologist at Cycorp. He received his Ph.D. in
development of the ontology editing tools. Mellon University and is the author of philosophy from the University of Texas at
Eddie Moench, Dipl. Wi.-Ing. (moench more than two dozen papers and two Austin in 2000 and has been with Cycorp
@ontoprise.de), is software engineer at On- granted patents. since 1999.
toprise GmbH. He is responsible for the de- John Curtis ([email protected]) received his
velopment of SemanticMiner, the ontol- Ph.D. in philosophy at the Ohio State Uni-
ogy-based knowledge-retrieval application. versity in 1999, where he specialized in the
Henrik Oppermann, Dipl.-Ing. (opper- history of analytic philosophy and the phi-
[email protected]), is vice president of losophy of logic. With Cycorp since 1999,
professional services at and shareholder of he has worked primarily at the intersection
Ontoprise GmbH. His focus is on industrial of NLP and ontology, designing semantic
application of ontologies and reasoning models for parsing, and overseeing the de-
technologies. velopment and implementation of strate-
gies for resolving intermediate-level repre-
Gavin Matthews ([email protected]) is a sentations (Cyc’s semantic layer between
project manager with Cycorp. He holds an natural language and inferentially salient
M.A. in computer science and mathematics CycL).
from the University of Cambridge and
graduated in 1991. He has been with Cy- Pierluigi Miraglia (plmiraglia@yahoo.
corp since 2002. He has experience in soft- com) earned his doctorate in philosophy at
ware engineering management at Geodesic the Ohio State University. After working
Systems, Harlequin Group plc, and Elec- for several years at Cycorp, he is now a se-
WINTER 2004 47
AAAI Press
www.aaai.org/Press/Books/
Data Mining: Next Generation New Directions in

Challenges and Future Directions Question Answering
Edited by Hillol Kargupta, Anupam Joshi, Edited by Mark Maybury
Krishnamoorthy Sivakumar and Yelena Yesha
Q
uestion answering systems, which provide natural lan-
guage responses to natural language queries, are the sub-
ject of rapidly advancing research encompassing both
D
ata mining, or knowledge discovery, has become an in-
dispensable technology for businesses and researchers in academic study and commercial applications, the most well-
many fields. Drawing on work in such areas as statistics, known of which is the search engine Ask Jeeves. Question answer-
machine learning, pattern recognition, databases, and high per- ing draws on different fields and technologies, including natural
formance computing, data mining extracts useful information language processing, information retrieval, explanation genera-
from the large data sets now available to industry and science. tion, and human computer interaction. Question answering cre-
This collection surveys the most recent advances in the field and ates an important new method of information access and can be
charts directions for future research. seen as the natural step beyond such standard Web search meth-
The first part looks at pervasive, distributed, and stream data ods as keyword query and document retrieval. This collection
mining, discussing topics that include distributed data mining al- charts significant new directions in the field, including temporal,
gorithms for new application areas, several aspects of next-gen- spatial, definitional, biographical, multimedia, and multilingual
eration data mining systems and applications, and detection of question answering. After an introduction that defines essential
recurrent patterns in digital media. The second part considers terminology and provides a roadmap to future trends, the book
data mining, counter-terrorism, and privacy concerns, examin- covers key areas of research and development. These include cur-
ing such topics as biosurveillance, marshalling evidence through rent methods, architecture requirements, and the history of ques-
data mining, and link discovery. The third part looks at scientific tion answering on the Web; the development of systems to ad-
data mining; topics include mining temporally-varying phenom- dress new types of questions; interactivity, which is often required
ena, data sets using graphs, and spatial data mining. The last part for clarification of questions or answers; reuse of answers; ad-
considers web, semantics, and data mining, examining advances vanced methods; and knowledge representation and reasoning
in text mining algorithms and software, semantic webs, and oth- used to support question answering. Each section contains an in-
er subjects. troduction that summarizes the chapters included and places
ISBN 0-262-61203-8 576 pp, index.
them in context, relating them to the other chapters in the book
$40.00 softcover Copublished by The MIT Press
as well as to the existing literature in the field and assessing the
problems and challenges that remain.
ISBN 0-262-63304-3 352 pp., index.
$40.00 softcover Copublished by The MIT Pres
48 AI MAGAZINE

AIMag25 04 003

Uploaded by

Copyright:

Available Formats

AIMag25 04 003

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIMag25 04 003

Uploaded by

Copyright:

Available Formats

Articles

#$AcidType-Strong #$AcidType-Diprotic $#BaseType-Strong

Figure 1. Extract from Cyc’s Ontology of Acids and Bases.

and displayed, followed by the nested explana-

siderably lower than the correctness scores.

experts to produce more consistent scoring.

matter experts provided high-level comments, 50.00

mostly focused on the organization and con-

tions to prove generalized comprehension-ori- 50.00

systems. The Ontoprise system proved to be

experts to different extents and at different

from the beginning of the project, appeared to

tems. Instead, a high-level domain expert must

Answer Justification. Explanation, or, more 30.00

tems, it proved to be a substantial challenge. 5.00

Since the utility of the system will be evaluated 0.00

Examples of System Outputs and

Question Example 2. SRI’s SHAKEN Output.

Question Example 3. The Output of

Comments on SRI Output

Generally, when a calculation was required, the program did not

and advances world-class

Bruce Porter ([email protected]) is a

Ken Barker ([email protected]) inves-

James Fan ([email protected]) is a Ph.D.

Dan G. Tecuci ([email protected]) is

Peter Z. Yeh ([email protected]) is cur-

Juergen Angele ([email protected]) is

Data Mining: Next Generation New Directions in

You might also like