Lect 06
Lect 06
By Ivan Wong
Introduction
• When compared to structured information sources like
databases or tables or semi-structured sources such as
webpages (which have some markup), text is a form of
unstructured data.
• Information extraction (IE) refers to the NLP task of
extracting relevant information from text documents.
• Typical IE tasks include key phrase extraction, named
entity recognition, named entity disambiguation and
linking, and relationship extraction.
IE Applications
• Tagging news and other content
• Chatbots
• A chatbot needs to understand the user’s question in order to
generate/retrieve a correct response.
• Applications in social media
• An example use case is extracting time-sensitive, frequently
updated information, such as traffic updates and disaster
relief efforts, based on tweets.
• Extracting data from forms and receipts
• Google’s Document AI Custom Extractor
IE Tasks
• The overarching goal of IE is to
extract “knowledge” from text, and
each of these tasks provides different
information to do that.
• As human readers, we find several
useful pieces of information in this
blurb.
• For example, we know that the article
is about Apple, the company (and
not the fruit), and that it mentions a
person, Luca Maestri, who is the
finance chief of the company. The
article is about the buyback of
stock and other issues related to it.
IE Tasks
• Identifying that the article is about “buyback” or “stock
price” relates to the IE task of keyword or keyphrase
extraction (KPE).
• Identifying Apple as an organization and Luca Maestri
as a person comes under the IE task of named entity
recognition (NER).
• Recognizing that Apple is not a fruit, but a company,
and that it refers to Apple, Inc. and not some other
company with the word “apple” in its name is the IE
task of named entity disambiguation and linking.
• Extracting the information that Luca Maestri is the
finance chief of Apple refers to the IE task of relation
Advanced IE tasks
• Identifying that this article is about a single event (let’s
call it “Apple buys back stocks”) and being able to link it
to other articles talking about the same event over time
refers to the IE task of event extraction.
• Template filling: Many applications, such as
automatically generating weather reports or flight
announcements, follow a standard template with some
slots that need to be filled based on extracted data.
The General
Pipeline for IE
• The general pipeline for
IE requires more fine-
grained NLP processing
than what we saw for
text classification.
• For example, to identify
named entities (persons,
organizations, etc.), we
would need to know the
part-of-speech tags of
words.
• For relating multiple
references to the same
entity (e.g., Albert Einstein,
Einstein, the scientist, he,
etc.), we would need
coreference resolution.
Keyphrase Extraction
• Amazon has a filtering feature: “Read reviews that mention.”
This presents a bunch of keywords or phrases that several
people used in these reviews to filter the reviews:
• Event Extraction
Other Advanced IE Tasks
• Template Filling