Handbook NLP Final
Handbook NLP Final
Information Extraction
Jerry R. Hobbs, University of Southern California
Ellen Riloff, University of Utah
21.1
21.2
21.3
21.4
21.5
21.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Diversity of IE Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IE with Cascaded Finite-State Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Learning-based Approaches to IE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
How Good is Information Extraction? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
4
8
16
21
23
23
21.1
Introduction
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 3
the events and their roles. For joint ventures, the roles were such things as
the participating companies, the joint venture company that was formed, the
activity it would engage in, and the amount of money it was capitalized for.
The systems were then run on a previously unseen test corpus. A systems
performance was measured on recall (what percentage of the correct answers
did the system get), precision (what percentage of the systems answers were
correct), and F-score. F-score is a a weighted harmonic mean between recall
and precision computed by the following formula:
F =
( 2 +1)P R
2 P +R
1 When
TIE-UP
Bridgestone Sports Co.
a local concern
a Japanese trading house
Bridgestone Sports Taiwan Co.
ACTIVITY-1
NT$20000000
ACTIVITY-1:
Activity:
Company:
Product:
Start Date:
PRODUCTION
Bridgestone Sports Taiwan Co.
iron and metal wood clubs
DURING: January 1990
in a courtroom you promise to tell the whole truth, you are promising 100% recall.
When you promise to tell nothing but the truth, you are promising 100% precision.
21.2
Diversity of IE Tasks
21.2.1
2 http://www.itl.nist.gov/iad/mig/tests/ace/
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 5
Professor John Skvoretz, U. of South Carolina, Columbia, will
present a seminar entitled Embedded Commitment, on Thursday, May 4th from 4-5:30 in PH 223D.
FIGURE 21.1: Example of an unstructured seminar announcement
21.2.2
Originally, information extraction systems were designed to locate domainspecific information in individual documents. Given a document as input, the
IE system identifies and extracts facts relevant to the domain that appear
in the document. We will refer to this task as single-document information
extraction.
The abundance of information available on the Web has led to the creation
of new types of IE systems that seek to extract facts from the Web or other
very large text collections (e.g., (Brin 1998; Fleischman, Hovy, and Echihabi
2003; Etzioni, Cafarella, Popescu, Shaked, Soderland, Weld, and Yates 2005;
Pasca, Lin, Bigham, Lifchits, and Jain 2006; Pasca 2007; Banko, Cafarella,
Soderland, Broadhead, and Etzioni 2007)). We will refer to this task as multidocument information extraction
Single-document IE is fundamentally different from multi-document IE, although both types of systems may use similar techniques. One distinguishing issue is redundancy. A single-document IE system must extract domain3 These
text forms can include some structured information as well, such as publication
dates and author by-lines. But most of the text in these genres is unstructured.
specific information from each document that it is given. If the system fails
to find relevant information in a document, then that is an error. This task
is challenging because many documents mention a fact only once, and the
fact may be expressed in an unusual or complex linguistic context (e.g., one
requiring inference). In contrast, multi-document IE systems can exploit the
redundancy of information in its large text collection. Many facts will appear
in a wide variety of contexts, so the system usually has multiple opportunities
to find each piece of information. The more often a fact appears, the greater
the chance that it will occur at least once in a linguistically simple context
that will be straightforward for the IE system to recognize.4
Multi-document IE is sometimes referred to as open-domain IE because
the goal is usually to acquire broad-coverage factual information, which will
likely benefit many domains. In this paradigm, it doesnt matter where the information originated. Some open-domain IE systems, such as KnowItAll (Etzioni, Cafarella, Popescu, Shaked, Soderland, Weld, and Yates 2005) and TextRunner (Banko, Cafarella, Soderland, Broadhead, and Etzioni 2007), have
4 This
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 7
addressed issues of scale to acquire large amounts of information from the
Web. One of the major challenges in multi-document IE is cross-document
coreference resolution: when are two documents talking about the same entities? Some researchers have tackled this problem (e.g., (Bagga and Baldwin
1998; Mann and Yarowsky 2003; Gooi and Allan 2004; Niu, Li, and Srihari
2004; Mayfield, Alexander, Dorr, Eisner, Elsayed, Finin, Fink, Freedman,
Garera, McNamee, Mohammad, Oard, Piatko, Sayeed, Syed, Weischedel, Xu,
and Yarowsky 2009)), and in 2008 the ACE evaluation expanded its focus to
include cross-document entity disambiguation (Strassel, Przybocki, Peterson,
Song, and Maeda 2008).
21.2.3
5 Note
21.3
Probably the most important idea that emerged in the course of the MUC
evaluations was the decomposition of the IE process into a series of subproblems that can be modeled with cascaded finite-state transducers (Lehnert,
Cardie, Fisher, Riloff, and Williams 1991; Hobbs, Appelt, Bear, Israel, and
Tyson 1992; Hobbs, Appelt, Bear, Israel, Kameyama, Stickel, and Tyson
1997; Joshi 1996; Cunningham, Maynard, Bontcheva, and Tablan 2002). A
finite-state automaton reads one element at a time of a sequence of elements;
each element transitions the automaton into a new state, based on the type
of element it is, e.g., the part of speech of a word. Some states are designated
as final, and a final state is reached when the sequence of elements matches
a valid pattern. In a finite-state transducer, an output entity is constructed
when final states are reached, e.g., a representation of the information in a
phrase. In a cascaded finite-state transducer, there are different finite-state
transducers at different stages. Earlier stages will package a string of elements
into something the next stage will view as a single element.
In the typical system, the earlier stages recognize smaller linguistic objects
and work in a largely domain-independent fashion. They use purely linguistic
knowledge to recognize portions of the syntactic structure of a sentence that
linguistic methods can determine reliably, requiring relatively little modification or augmentation as the system is moved from domain to domain. The
later stages take these linguistic objects as input and find domain-dependent
patterns within them. In a typical IE system, there are five levels of processing:
1. Complex Words: This includes the recognition of multiwords and proper
name entities, such as people, companies, and countries.
2. Basic Phrases: Sentences are segmented into noun groups, verb groups,
and particles.
3. Complex Phrases: Complex noun groups and complex verb groups are
identified.
document may mention multiple victims so the IE system needs to determine whether an
extracted victim refers to a previously mentioned victim or a new one.
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 9
4. Domain Events: The sequence of phrases produced at Level 3 is scanned
for patterns of interest to the application, and when they are found,
semantic structures are built that encode the information about entities
and events contained in the pattern.
5. Merging Structures: Semantic structures from different parts of the text
are merged if they provide information about the same entity or event.
This process is sometimes called template generation, and is a complex
process not done by a finite-state transducer.
As we progress through the five levels, larger segments of text are analyzed
and structured. In each of stages 2 through 4, the input to the finite-state
transducer is the sequence of chunks constructed in the previous stage. The
GATE project (Cunningham, Maynard, Bontcheva, and Tablan 2002) is a
widely used toolkit that provides many of the components needed for such an
IE pipeline.
This decomposition of the natural-language problem into levels is essential
to the approach. Many systems have been built to do pattern matching on
strings of words. The advances in information extraction have depended crucially on dividing that process into separate levels for recognizing phrases and
recognizing patterns among the phrases. Phrases can be recognized reliably
with purely syntactic information, and they provide precisely the elements
that are required for stating the patterns of interest.
In the next five sections we illustrate this process on the Bridgestone Sports
text.
21.3.1
Complex Words
The first level of processing identifies multiwords such as set up, trading house, new Taiwan dollars, and joint venture, and company names
like Bridgestone Sports Co. and Bridgestone Sports Taiwan Co.. The
names of people and locations, dates, times, and other basic entities are also
recognized at this level. Languages in general are very productive in the
construction of short, multiword fixed phrases and proper names employing
specialized microgrammars, and this is the level at which they are recognized.
Some names can be recognized by their internal structure. A common
pattern for company names is ProperName ProductName, as in Acme
Widgets. Others can only be recognized by means of a table. Internal
structure cannot tell us that IBM is a company and DNA is not. It is also
sometimes possible to recognize the types of proper names by the context in
which they occur. For example, in the sentences below:
(a) XYZs sales
(b) Vaclav Havel, 53, president of the Czech Republic
we might not know that XYZ is a company and Vaclav Havel is a person, but
10
21.3.2
Basic Phrases
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 11
derings and conjunctions of prenominal nouns and noun-like adjectives. Thus,
among the noun groups that can be recognized are:
approximately 5 kg
more than 30 people
the newly elected president
the largest leftist political force
a government and commercial project
The principal ambiguities that arise in this stage are due to noun-verb ambiguities. For example, the company names could be a single noun group with
the head noun names, or it could be a noun group the company followed
by the verb names. One can use a lattice representation to encode the two
analyses and resolve the ambiguity in the stage for recognizing domain events.
Verb groups (and predicate adjective constructions) can be recognized by an
even simpler finite-state grammar that, in addition to chunking, also tags them
as Active Voice, Passive Voice, Gerund, and Infinitive. Verbs are sometimes
locally ambiguous between active and passive senses, as the verb kidnapped
in the following two sentences:
Several men kidnapped the mayor today.
Several men kidnapped yesterday were released today.
These cases can be tagged as Active/Passive, and the domain-event stage can
later resolve the ambiguity. Some work has also been done to train a classifier
to distinguish between active voice and reduced passive voice constructions
(Igo and Riloff 2008).
The breakdown of phrases into nominals, verbals, and particles is a linguistic
universal. Whereas the precise parts of speech that occur in any language can
vary widely, every language has elements that are fundamentally nominal in
character, elements that are fundamentally verbal or predicative, and particles
or inflectional affixes that encode relations among the other elements (Croft
1991).
21.3.3
Complex Phrases
Some complex noun groups and verb groups can be recognized reliably on
the basis of domain-independent, syntactic information. For example:
the attachment of appositives to their head noun group
The joint venture, Bridgestone Sports Taiwan Co.,
the construction of measure phrases
20,000 iron and metal wood clubs a month
the attachment of of and for prepositional phrases to their head
noun groups
12
TIE-UP
PRODUCTION
Complex verb groups can also be recognized in this stage. Consider the
following variations:
GM formed a joint venture with Toyota.
GM announced it was forming a joint venture with Toyota.
GM signed an agreement forming a joint venture with Toyota.
GM announced it was signing an agreement to form a joint venture with Toyota.
Although these sentences may differ in significance for some applications, often
they would be considered equivalent in meaning. Rather than defining each
of these variations, with all their syntactic variants, at the domain event
level, the user should be able to define complex verb groups that share the
same significance. Thus, formed, announced it was forming, signed an
agreement forming, and announced it was signing an agreement to form
may all be equivalent, and once they are defined to be so, only one domain
event pattern needs to be expressed. Verb group conjunction, as in
Terrorists kidnapped and killed three people.
can be treated as a complex verb group as well.
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 13
21.3.4
Domain Events
The next stage is recognizing domain events, and its input is a a list of
the basic and complex phrases recognized in the earlier stages, in the order
in which they occur. Anything that was not identified as a basic or complex
phrase in a previous stage can be ignored in this stage; this can be a significant
source of robustness.
Identifying domain events requires a set of domain-specific patterns both
to recognize phrases that correspond to an event of interest and to identify
the syntactic constitutents that correspond to the events role fillers. In early
information systems, these domain-specific extraction patterns were defined
manually. In Sections 21.4.1 and 21.4.3, we describe a variety of learning
methods that have subsequently been developed to automatically generate
domain-specific extraction patterns from training corpora.
The patterns for events of interest can be encoded as finite-state machines,
where state transitions are effected by phrases. The state transitions are
driven off the head words in the phrases. That is, each pair of relevant
head word and phrase typesuch as company-NounGroup and formedPassiveVerbGroup has an associated set of state transitions. In the sample
joint-venture text, the domain event patterns
<Company/ies> <Set-up> <Joint-Venture> with <Company/ies>
and
<Produce> <Product>
would be instantiated in the first sentence, and the patterns
<Company> <Capitalized> at <Currency>
and
<Company> <Start> <Activity> in/on <Date>
in the second. These four patterns would result in the following four structures
being built:
Relationship:
Entities:
Joint Venture Company:
Activity:
Amount:
TIE-UP
Bridgestone Sports Co.
a local concern
a Japanese trading house
Activity:
Company:
Product:
Start Date:
PRODUCTION
golf clubs
14
TIE-UP
NT$20000000
Activity:
Company:
Product:
Start Date:
PRODUCTION
Bridgestone Sports Taiwan Co.
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 15
21.3.5
The first four stages of processing all operate within the bounds of single
sentences. The final level of processing operates over the whole text. Its task
is to see that all the information collected about a single entity, relationship,
or event is combined into a unified whole. This is one of the primary ways that
the problem of coreference is dealt with in information extraction, including
both NP coreference (for entities) and event coreference. One event template
is generated for each event, which coalesces all of the information associated
with that event. If an input document discusses multiple events of interest,
then the IE system must generate multiple event templates. Generating multiple event templates requires additional discourse analysis to (a) correctly
determine how many distinct events are reported in the document, and (b)
correctly assign each entity and object to the appropriate event template.
Among the criteria that need to be taken into account in determining
whether two structures can be merged are the internal structure of the noun
groups, nearness along some metric, and the consistency, or more generally,
the compatibility of the two structures.
In the analysis of the sample joint-venture text, we have produced three
activity structures. They are all consistent because they are all of type PRODUCTION and because iron and metal wood clubs is consistent with golf
clubs. Hence, they are merged, yielding:
Activity:
Company:
Product:
Start Date:
PRODUCTION
Bridgestone Sports Taiwan Co.
iron and metal wood clubs
DURING: January 1990
Similarly, the two relationship structures that have been generated are consistent with each other, so they can be merged, yielding:
Relationship:
Entities:
Joint Venture Company:
Activity:
Amount:
TIE-UP
Bridgestone Sports Co.
a local concern
a Japanese trading house
Bridgestone Sports Taiwan Co.
NT$20000000
The entity and event coreference problems are very hard, and constitute
active and important areas of research. Coreference resolution was a task in
the later MUC evaluations (MUC-6 Proceedings 1995; MUC-7 Proceedings
1998), and has been a focus of the ACE evaluations. Many recent research
efforts have applied machine learning techniques to the problem of coreference
resolution (e.g., (Dagan and Itai 1990; McCarthy and Lehnert 1995; Aone and
Bennett 1996; Kehler 1997; Cardie and Wagstaff 1999; Harabagiu, Bunescu,
and Maiorana 2001; Soon, Ng, and Lim 2001; Ng and Cardie 2002; Bean and
16
Riloff 2004; McCallum and Wellner 2004; Yang, Su, and Tan 2005; Haghighi
and Klein 2007)).
Some attempts to automate the template generation process will be discussed in Section 21.4.4.
21.4
Learning-based Approaches to IE
21.4.1
6 In
contrast, creating IE patterns and rules by hand typically requires computational linguists who understand how the patterns or rules will be integrated into the NLP system.
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 17
systems. AutoSlog (Riloff 1993; Riloff 1996a) matches a small set of syntactic templates against the text surrounding a desired extraction and creates
one (or more) lexico-syntactic patterns by instantiating the templates with
the corresponding words in the sentence. A human in the loop must then
manually review the patterns to decide which ones are appropriate for the
IE task. PALKA (Kim and Moldovan 1993) uses manually defined frames
and keywords that are provided by a user and creates IE patterns by mapping clauses containing the keywords onto the frames slots. The patterns are
generalized based on the semantic features of the words.
Several systems use rule learning algorithms to automatically generate IE
patterns from annotated text corpora. LIEP (Huffman 1996) creates candidate patterns by identifying syntactic paths that relate the role fillers in
a sentence. The patterns that perform well on training examples are kept,
and as learning progresses they are generalized to accommodate new training
examples by creating disjunctions of terms. CRYSTAL (Soderland, Fisher,
Aseltine, and Lehnert 1995) learns extraction rules using a unification-based
covering algorithm. CRYSTALs rules are concept node structures that include lexical, syntactic, and semantic constraints. WHISK (Soderland 1999)
was an early system that was specifically designed to be flexible enough to
handle structured, semi-structured, and unstructured texts. WHISK learns
regular expression rules that consist of words, semantic classes, and wildcards
that match any token. (LP )2 (Ciravegna 2001) induces two different kinds of
IE rules: tagging rules to label instances as desired extractions, and correction
rules to correct mistakes made by the tagging rules. Freitag created a rulelearning system called SRV (Freitag 1998b) and later combined it with a rote
learning mechanism and a Naive Bayes classifier to explore a multi-strategy
approach to IE (Freitag 1998a).
Relational learning methods have also been used to learn rule-like structures for IE (e.g., (Roth and Yih 2001; Califf and Mooney 2003; Bunescu and
Mooney 2004; Bunescu and Mooney 2007)). RAPIER (Califf and Mooney
1999; Califf and Mooney 2003) uses relational learning methods to generate
IE rules, where each rule has a pre-filler, filler, and post-filler component.
Each component is a pattern that consists of words, POS tags, and semantic
classes. Roth and Yih (Roth and Yih 2001) propose a knowledge representation language for propositional relations and create a 2-stage classifier that
first identifies candidate extractions and then selects the best ones. Bunescu
and Mooney (Bunescu and Mooney 2004) use Relational Markov Networks to
represent dependencies and influences across entities and extractions.
IE pattern learning methods have also been developed for related applications such as question answering (Ravichandran and Hovy 2002), where the
goal is to learn patterns for specific types of questions that involve relations
between entities (e.g., identifying the birth year of a person).
18
21.4.2
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 19
producing extractions that otherwise would have been missed. Yu et al. (Yu,
Guan, and Zhou 2005) created a cascaded model of HMMs and SVMs. In
the first pass, an HMM segments resumes into blocks that represent different
types of information. In the second pass, HMMs and SVMs extract information from the blocks, with different classifiers trained to extract different
types of information.
The chapter on Fundamental Statistical Techniques in this book explains
how to create classifiers and sequential prediction models using supervised
learning techniques.
21.4.3
Supervised learning techniques substantially reduced the manual effort required to create an IE system for a new domain. However, annotating training
texts still requires a substantial investment of time, and annotating documents for information extraction can be deceptively complex (Riloff 1996b).
Furthermore, since IE systems are domain-specific, annotated corpora cannot
be reused: a new corpus must be annotated for each domain.
To further reduce the knowledge engineering required to create an IE system, several methods have been developed in recent years to learn extraction
patterns using weakly supervised and unsupervised techniques. AutoSlogTS (Riloff 1996b) is a derivative of AutoSlog that requires as input only a
preclassified training corpus in which texts are identified as relevant or irrelevant with respect to the domain but are not annotated in any other way.
AutoSlog-TSs learning algorithm is a two-step process. In the first step, AutoSlogs syntactic templates are applied to the training corpus exhaustively,
which generates a large set of candidate extraction patterns. In the second
step, the candidate patterns are ranked based on the strength of their association with the relevant texts. Ex-Disco (Yangarber, Grishman, Tapanainen,
and Huttunen 2000) took this approach one step further by eliminating the
need for a preclassified text corpus. Ex-Disco uses a small set of manually
defined seed patterns to partition a collection of unannotated text into relevant and irrelevant sets. The pattern learning process is then embedded in
a bootstrapping loop where (1) patterns are ranked based on the strength of
their association with the relevant texts, (2) the best pattern(s) are selected
and added to the pattern set, and (3) the corpus is re-partitioned into new
relevant and irrelevant sets. Both AutoSlog-TS and Ex-Disco produce IE patterns that performed well in comparison to pattern sets used by previous IE
systems. However, the ranked pattern lists produced by these systems still
need to be manually reviewed.7
Stevenson and Greenwood (Stevenson and Greenwood 2005) also begin with
7 The
human reviewer discards patterns that are not relevant to the IE task and assigns an
event role to the patterns that are kept.
20
seed patterns and use semantic similarity measures to iteratively rank and
select new candidate patterns based on their similarity to the seeds. Stevenson
and Greenwood use predicate-argument structures as the representation for
their IE patterns, as did Surdeanu et al. (Surdeanu, Harabagiu, Williams,
and Aarseth 2003) and Yangarber (Yangarber 2003) in earlier work. Sudo et
al. (Sudo, Sekine, and Grishman 2003) created an even richer subtree model
representation for IE patterns, where an IE pattern can be an arbitrary subtree
of a dependency tree. The subtree patterns are learned from relevant and
irrelevant training documents. Bunescu and Mooney (Bunescu and Mooney
2007) developed a weakly supervised method for relation extraction that uses
Multiple Instance Learning (MIL) techniques with SVMs and string kernels.
Meta-bootstrapping (Riloff and Jones 1999) is a bootstrapping method that
learns information extraction patterns and also generates noun phrases that
belong to a semantic class at the same time. Given a few seed nouns that
belong to a targeted semantic class, the meta-bootstrapping algorithm iteratively learns a new extraction pattern and then uses the learned pattern to
hypothesize additional nouns that belong to the semantic class. The patterns
learned by meta-bootstrapping are more akin to named entity recognition patterns than event role patterns, however, because they identify noun phrases
that belong to general semantic classes, irrespective of any events.
Recently, Phillips and Riloff (Phillips and Riloff 2007) showed that bootstrapping methods can be used to learn event role patterns by exploiting roleidentifying nouns as seeds. A role-identifying noun is a word that, by virtue
of its lexical semantics, identifies the role that the noun plays with respect
to an event. For example, the definition of the word kidnapper is the agent
of a kidnapping event. By using role-identifying nouns as seeds, the Basilisk
bootstrapping algorithm (Thelen and Riloff 2002) can be used to learn both
event extraction patterns as well as additional role-identifying nouns.
Finally, Shinyama and Sekine (Shinyama and Sekine 2006) have developed
an approach for completely unsupervised learning of information extraction
patterns. Given texts for a new domain, relation discovery methods are used
to preemptively learn the types of relations that appear in domain-specific
documents. The On-Demand Information Extraction (ODIE) system (Sekine
2006) accepts a user query for a topic, dynamically learns IE patterns for
salient relations associated with the topic, and then applies the patterns to
fill in a table with extracted information related to the topic.
21.4.4
Discourse-oriented Approaches to IE
Most of the IE systems that we have discussed thus far take a relatively
localized approach to information extraction. The IE patterns or classifiers
focus only on the local context surrounding a word or phrase when making
an extraction decision. Recently, some systems have begun to take a more
global view of the extraction process. Gu and Cercone (Gu and Cercone 2006)
and Patwardhan & Riloff (Patwardhan and Riloff 2007) use classifiers to first
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 21
identify the event-relevant sentences in a document and then apply an IE
system to extract information from those relevant sentences.
Finkel et al. (Finkel, Grenager, and Manning 2005) impose penalties in their
learning model to enforce label consistency among extractions from different
parts of a document. Maslennikov and Chua (Maslennikov and Chua 2007) use
dependency and RST-based discourse relations to connect entities in different
clauses and find long-distance dependency relations.
Finally, as we discussed in Section 21.3.5, IE systems that process multipleevent documents need to generate multiple templates. Template generation
for multiple events is extremely challenging, and only a few learning systems
have been developed to automate this process for new domains. WRAPUP (Soderland and Lehnert 1994) was an early supervised learning system
that uses a collection of decision trees to make a series of discourse decisions
to automate the template generation process. More recently, Chieu et al.
(Chieu, Ng, and Lee 2003) developed a system called ALICE that generates
complete templates for the MUC-4 terrorism domain (MUC-4 Proceedings
1992). ALICE uses a set of classifiers that identify extractions for each type
of slot and a template manager to decide when to create a new template.
The template manager uses general-purpose rules (e.g., a conflicting date will
spawn a new template) as well as automatically derived seed words that
are associated with different incident types to distinguish between events.
21.5
22
HowDidtheFieldProgress?
P
MUC-3: 1991
MUC-4: 1992
MUC-5: 1993
60%
60%
60%
MUC-6: 1995
MUC-7: 1998
60%
60%
in performance on the MUC data sets (e.g., (Soderland 1999; Chieu, Ng, and
Lee 2003; Maslennikov and Chua 2007)).8
There are several possible explanations for this barrier. Detailed analysis
of the performance of some of the systems revealed that the biggest source
of mistakes was in entity and event coreference; more work certainly needs
to be done on this. Another possibility is that 60% is what the text wears
on its sleeve; the rest is implicit and requires inference and access to world
knowledge.
Another explanation is that there is a Zipf distribution of problems that
need to be solved. When we solve the more common problems, we get a big
boost in performance. But we have solved all the most common problems,
and now we are in the long tail of the distribution. We might take care of
8 The
one exception is that (Maslennikov and Chua 2007) report an F score of 72% on a
modified version of the MUC-6 corpus.
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 23
a dozen new problems we find in the training data, only to find that none
of these problems occur in the test data, so there is no effect on measured
performance. One possible solution is active learning (e.g., (Lewis and Catlett
1994; Liere and Tadepalli 1997; McCallum and Nigam 1998; Thompson, Califf,
and Mooney 1999)) and the automated selection of rare training examples in
the tail for additional manual annotation. This could help to reduce the overall
amount of annotated training data that is required, while still adequately
covering the rare cases.
A final possibility is both simple and disconcerting. Good named entity
recognition systems typically recognize about 90% of the entities of interest
in a text, and this is near human performance. To recognize an event and
its arguments requires recognizing about four entities, and .904 is about 60%.
If this is the reason for the 60% barrier, it is not clear what we can do to
overcome it, short of solving the general natural language problem in a way
that exploits the implicit relations among the elements of a text.
21.6
Acknowledgments
Bibliography
Ananiadou, S., C. Friedman, and J. Tsujii (2004). Introduction: Named
Entity Recognition in Biomedicine. Journal of Biomedical Informatics 37 (6).
Ananiadou, S. and J. McNaught (Eds.) (2006). Text Mining for Biology
and Biomedicine. Artech House, Inc.
Aone, C. and S. W. Bennett (1996). Applying machine learning to anaphora
resolution. In S. Wermter, E. Riloff, and G. Scheler (Eds.), Connectionist, Statistical, and Symbolic Approaches to Learning for Natural
Language Processing, pp. 302314. Springer-Verlag, Berlin.
Bagga, A. and B. Baldwin (1998). Entity-based Cross-Document Coreferencing using the Vector Space Model. In Proceedings of the 17th International Conference on Computational Linguistics.
24
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 25
Collins, M. and Y. Singer (1999). Unsupervised Models for Named Entity
Classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora
(EMNLP/VLC-99).
Croft, W. A. (1991). Syntactic Categories and Grammatical Relations.
Chicago, Illinois: University of Chicago Press.
Cucerzan, S. and D. Yarowsky (1999). Language Independent Named Entity Recognition Combining Morphologi cal and Contextual Evidence. In
Proceedings of the Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large Corpora (EMNLP/VLC99).
Cunningham, H., D. Maynard, K. Bontcheva, and V. Tablan (2002). GATE:
A framework and graphical development environment for robust nlp
tools and applications. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics.
Dagan, I. and A. Itai (1990). Automatic Processing of Large Corpora for
the Resolution of Anaphora References. In Proceedings of the Thirteenth
International Conference on Computational Linguistics (COLING-90),
pp. 330332.
Etzioni, O., M. Cafarella, A. Popescu, T. Shaked, S. Soderland, D. Weld,
and A. Yates (2005). Unsupervised Named-Entity Extraction from the
Web: An Experimental Study. Artificial Intelligence 165 (1), 91134.
Finkel, J., T. Grenager, and C. Manning (2005, June). Incorporating Nonlocal Information into Information Extraction Systems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting of the Association for
Computational Linguistics, Ann Arbor, MI, pp. 363370.
Finn, A. and N. Kushmerick (2004, September). Multi-level Boundary Classification for Information Extraction. In In Proceedings of the 15th European Conference on Machine Learning, Pisa, Italy, pp. 111122.
Fleischman, M. and E. Hovy (2002, August). Fine grained classification of
named entities. In Proceedings of the COLING conference.
Fleischman, M., E. Hovy, and A. Echihabi (2003). Offline strategies for online question answering: Answering questions before they are asked. In
Proceedings of the 41th Annual Meeting of the Association for Computational Linguistics.
Freitag, D. (1998a). Multistrategy Learning for Information Extraction.
In Proceedings of the Fifteenth International Conference on Machine
Learning. Morgan Kaufmann Publishers.
Freitag, D. (1998b). Toward General-Purpose Learning for Information Extraction. In Proceedings of the 36th Annual Meeting of the Association
for Computational Linguistics.
26
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 27
Igo, S. and E. Riloff (2008). Learning to Identify Reduced Passive Verb
Phrases with a Shallow Parser. In Proceedings of the 23rd National Conference on Artificial Intelligence.
Joshi, A. K. (1996). A Parser from Antiquity: An Early Application of
Finite State Transducers to Natural Language Parsing. In European
Conference on Artificial Intelligence 96 Workshop on Extended Finite
State Models of Language, pp. 3334.
Kehler, A. (1997). Probabilistic Coreference in Information Extraction. In
Proceedings of the Second Conference on Empirical Methods in Natural
Language Processing.
Kim, J. and D. Moldovan (1993). Acquisition of Semantic Patterns for Information Extraction from Corpora. In Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications, Los Alamitos, CA, pp.
171176. IEEE Computer Society Press.
Lehnert, W., C. Cardie, D. Fisher, J. McCarthy, E. Riloff, and S. Soderland
(1992). University of Massachusetts: Description of the CIRCUS System
as Used for MUC-4. In Proceedings of the Fourth Message Understanding
Conference (MUC-4), San Mateo, CA, pp. 282288. Morgan Kaufmann.
Lehnert, W., C. Cardie, D. Fisher, E. Riloff, and R. Williams (1991). University of Massachusetts: Description of the CIRCUS System as Used
for MUC-3. In Proceedings of the Third Message Understanding Conference (MUC-3), San Mateo, CA, pp. 223233. Morgan Kaufmann.
Lewis, D. D. and J. Catlett (1994). Heterogeneous uncertainty sampling
for supervised learning. In Proceedings of the Eleventh International
Conference on Machine Learning.
Li, Y., K. Bontcheva, and H. Cunningham (2005, June). Using Uneven
Margins SVM and Perceptron for Information Extraction. In Proceedings of Ninth Conference on Computational Natural Language Learning,
Ann Arbor, MI, pp. 7279.
Liere, R. and P. Tadepalli (1997). Active learning with committees for text
categorization. In Proceedings of the Fourteenth National Conference on
Artificial Intelligence.
Light, M., G. Mann, E. Riloff, and E. Breck (2001). Analyses for Elucidating
Current Question Answering Technology. Journal for Natural Language
Engineering 7 (4).
Mann, G. and D. Yarowsky (2003). Unsupervised Personal Name Disambiguation. In Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003).
Maslennikov, M. and T. Chua (2007). A Multi-Resolution Framework for
Information Extraction from Free Text. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics.
28
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 29
the Conference on Empirical Methods in Natural Language Processing
(EMNLP-2007).
Peng, F. and A. McCallum (2004). Accurate Information Extraction from
Research Papers using Conditional Random Fields. In Proceedings of
the Annual Meeting of the North American Chapter of the Association
for Computational Linguistics (HLT/NAACL 2004).
Phillips, W. and E. Riloff (2007). Exploiting Role-Identifying Nouns and
Expressions for Information Extraction. In Proceedings of the 2007 International Conference on Recent Advances in Natural Language Processing (RANLP-07), pp. 468473.
Ravichandran, D. and E. Hovy (2002). Learning Surface Text Patterns for a
Question Answering System. In Proceedings of the 40th Annual Meeting
on Association for Computational Linguistics.
Riloff, E. (1993). Automatically Constructing a Dictionary for Information
Extraction Tasks. In Proceedings of the 11th National Conference on
Artificial Intelligence.
Riloff, E. (1996a). An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains. Artificial Intelligence 85, 101134.
Riloff, E. (1996b). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on
Artificial Intelligence, pp. 10441049. The AAAI Press/MIT Press.
Riloff, E. and R. Jones (1999). Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. In Proceedings of the Sixteenth
National Conference on Artificial Intelligence.
Roth, D. and W. Yih (2001, August). Relational Learning via Propositional Algorithms: An Information Extraction Case Study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Seattle, WA, pp. 12571263.
Sang, E. F. T. K. and F. D. Meulder (2003). Introduction to the conll2003 shared task: Language-independent named entity recognition. In
Proceedings of CoNLL-2003, pp. 142147.
Sekine, S. (2006). On-demand information extraction. In Proceedings
of Joint Conference of the International Committee on Computational Linguistics and the Association for Computational Linguistics
(COLING/ACL-06.
Shinyama, Y. and S. Sekine (2006, June). Preemptive Information Extraction using Unrestricted Relation Discovery. In Proceedings of the Human Language Technology Conference of the North American Chapter
of the Association for Computational Linguistics, New York City, NY,
pp. 304311.
30
Information Extraction Jerry R. Hobbs, University of Southern California Ellen Riloff, University of Utah 31
Yakushiji, A., Y. Miyao, T. Ohta, and J. Tateisi, Y. Tsujii (2006). Automatic construction of predicate-argument structure patterns for biomedical information extraction. In Proceedings of the 2006 Conference on
Empirical Methods in Natural Language Processing.
Yang, X., J. Su, and C. L. Tan (2005). Improving Pronoun Resolution using
Statistics-Based Semantic Compatibility Information. In Proceedings of
the 43th Annual Meeting of the Association for Computational Linguistics.
Yangarber, R. (2003). Counter-training in the discovery of semantic patterns. In Proceedings of the 41th Annual Meeting of the Association for
Computational Linguistics.
Yangarber, R., R. Grishman, P. Tapanainen, and S. Huttunen (2000). Automatic Acquisition of Domain Knowledge for Information Extraction.
In Proceedings of the Eighteenth International Conference on Computational Linguistics (COLING 2000).
Yu, K., G. Guan, and M. Zhou (2005, June). Resume Information Extraction with Cascaded Hybrid Model. In Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguistics, Ann Arbor,
MI, pp. 499506.
Zelenko, D., C. Aone, and A. Richardella (2003). Kernel Methods for Relation Extraction. Journal of Machine Learning Research 3.
Zhao, S. and R. Grishman (2005). Extracting Relations with Integrated
Information Using Kernel Methods. In Proceedings of the 43rd Annual
Meeting of the Association for Computational Linguistics (ACL-05),
Ann Arbor, Michigan.