0% found this document useful (0 votes)
13 views21 pages

Lect 06

Uploaded by

rodrigoferraribr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

Lect 06

Uploaded by

rodrigoferraribr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Information Extraction

By Ivan Wong
Introduction
• When compared to structured information sources like
databases or tables or semi-structured sources such as
webpages (which have some markup), text is a form of
unstructured data.
• Information extraction (IE) refers to the NLP task of
extracting relevant information from text documents.
• Typical IE tasks include key phrase extraction, named
entity recognition, named entity disambiguation and
linking, and relationship extraction.
IE Applications
• Tagging news and other content
• Chatbots
• A chatbot needs to understand the user’s question in order to
generate/retrieve a correct response.
• Applications in social media
• An example use case is extracting time-sensitive, frequently
updated information, such as traffic updates and disaster
relief efforts, based on tweets.
• Extracting data from forms and receipts
• Google’s Document AI Custom Extractor
IE Tasks
• The overarching goal of IE is to
extract “knowledge” from text, and
each of these tasks provides different
information to do that.
• As human readers, we find several
useful pieces of information in this
blurb.
• For example, we know that the article
is about Apple, the company (and
not the fruit), and that it mentions a
person, Luca Maestri, who is the
finance chief of the company. The
article is about the buyback of
stock and other issues related to it.
IE Tasks
• Identifying that the article is about “buyback” or “stock
price” relates to the IE task of keyword or keyphrase
extraction (KPE).
• Identifying Apple as an organization and Luca Maestri
as a person comes under the IE task of named entity
recognition (NER).
• Recognizing that Apple is not a fruit, but a company,
and that it refers to Apple, Inc. and not some other
company with the word “apple” in its name is the IE
task of named entity disambiguation and linking.
• Extracting the information that Luca Maestri is the
finance chief of Apple refers to the IE task of relation
Advanced IE tasks
• Identifying that this article is about a single event (let’s
call it “Apple buys back stocks”) and being able to link it
to other articles talking about the same event over time
refers to the IE task of event extraction.
• Template filling: Many applications, such as
automatically generating weather reports or flight
announcements, follow a standard template with some
slots that need to be filled based on extracted data.
The General
Pipeline for IE
• The general pipeline for
IE requires more fine-
grained NLP processing
than what we saw for
text classification.
• For example, to identify
named entities (persons,
organizations, etc.), we
would need to know the
part-of-speech tags of
words.
• For relating multiple
references to the same
entity (e.g., Albert Einstein,
Einstein, the scientist, he,
etc.), we would need
coreference resolution.
Keyphrase Extraction
• Amazon has a filtering feature: “Read reviews that mention.”
This presents a bunch of keywords or phrases that several
people used in these reviews to filter the reviews:

• Keyword and phrase extraction, as the name indicates, is the


IE task concerned with extracting important words and
phrases that capture the gist of the text from a given text
document.
Practical Advice
• The process of extracting potential n-grams and building
the graph with them is sensitive to document length,
which could be an issue in a production scenario.
• One approach to dealing with it is to not use the full text, but
instead try using the first M% and the last N% of the text.
• Since each keyphrase is independently ranked, we
sometimes end up seeing overlapping keyphrases (e.g.,
“buy back stock” and “buy back”).
• One solution for this could be to use some similarity measure
(e.g., cosine similarity) between the top-ranked keyphrases
and choose the ones that are most dissimilar to one another.
Practical Advice
• Seeing counterproductive patterns (e.g., a keyphrase that starts
with a preposition when you don’t want that) is another common
problem.
• This is relatively straightforward to handle by tweaking the
implementation code for the algorithm and explicitly encoding
information about such unwanted word patterns.
• Improper text extraction can affect the rest of the KPE process,
especially when dealing with formats such as PDF or scanned
images.
• This is primarily because KPE is sensitive to sentence structure in the
document.
• Hence, it’s always a good idea to add some post-processing to the
extracted key phrases list to create a final, meaningful list without noise.
Named Entity Recognition
• NER refers to the IE task of identifying the entities in a
document.
• Entities are typically names of persons, locations, and
organizations, and other specialized strings, such as
money expressions, dates, products, names/numbers of
laws or articles, and so on.
• NER is an important step in the pipeline of several NLP
applications involving information extraction.
• Explosion.ai. “displaCy Named Entity Visualizer” (
https://demos.explosion.ai/displacy-ent)
Building an NER System
• A simple approach to building an NER system is to maintain a large
collection of person/organization/location names that are the most
relevant to our company (e.g., names of all clients, cities in their
addresses, etc.); this is typically referred to as a gazetteer.
• An approach that goes beyond a lookup table is rule-based NER, which
can be based on a compiled list of patterns based on word tokens and
POS tags.
• For example, a pattern “NNP was born,” where “NNP” is the POS tag for a
proper noun, indicates that the word that was tagged “NNP” refers to a person.
• Such rules can be programmed to cover as many cases as possible to build a
rule-based NER system.
• Stanford NLP’s RegexNER (nlp.stanford.edu/software/regexner.html) and
spaCy’s EntityRuler (spacy.io/usage/rule-based-matching#entityruler) provide
functionalities to implement your own rule-based NER.
Machine Learning Approach
• A more practical approach to NER is to train an ML
model, which can predict the named entities in unseen
text. For each word, a decision has to be made whether
or not that word is an entity, and if it is, what type of
the entity it is.
• NER is traditionally modeled as a sequence
classification problem, where the entity prediction for
the current word also depends on the context.
• For example, if the previous word was a person name, there’s
a higher probability that the current word is also a person
name if it’s a noun (e.g., first and last names).
Sequence Classifier
• The labels in the figure follow what’s known
as a BIO notation: B indicates the beginning
of an entity; I, inside an entity, indicates
when entities comprise more than one word;
and O, other, indicates non-entities.
• “Peter” gets tagged as a B-PER, and “Such” gets
tagged as an I-PER to indicate that Such is a part
of the entity from the previous word.
• The remaining entities in this example, Essex,
Yorkshire, and Headingley, are all one-word
entities. So, we only see B-ORG and B-LOC as
their tags.
Named Entity Disambiguation and
Linking
Named Entity Disambiguation and
Linking
• Named entity disambiguation (NED) refers to the NLP
task of achieving exactly this: assigning a unique
identity to entities mentioned in the text.
• NER and NED together are known as named entity
linking (NEL).
• Some other NLP applications that would need NEL
include question answering and constructing large
knowledge bases of connected events and entities, such
as the Google Knowledge Graph.
Relationship Extraction
• Relationship extraction (RE) is the IE task that deals
with extracting entities and relationships between them
from text documents.
• It’s an important step in building a knowledge base, and
it’s also useful in improving search and developing
question-answering systems.
• Apart from identifying what entities there are and
disambiguating them, we need to model the process of
extracting the relationships between them by
considering the words connecting the entities in a
sentence, their sense of usage, and so on.
Relationship
Extraction
• Satya Narayana Nadella is an
Indian-American business
executive. He currently serves
as the Chief Executive Officer
(CEO) of Microsoft, succeeding
Steve Ballmer in 2014. Before
becoming chief executive, he
was Executive Vice President of
Microsoft’s Cloud and Enterprise
Group, responsible for building
and running the company’s
computing platforms.
Approaches to RE
• RE is often treated as a supervised classification
problem. The datasets used to train RE systems contain
a set of pre-defined relations, similar to classification
datasets. This consists of modeling it as a two-step
classification problem:

• Whether two entities in a text are related (binary


classification).
• If they are related, what is the relation between them
(multiclass classification)?
Other Advanced IE Tasks
• Temporal Information Extraction

• Event Extraction
Other Advanced IE Tasks
• Template Filling

You might also like