0% found this document useful (0 votes)

13 views21 pages

Lect 06

Uploaded by

rodrigoferraribr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views21 pages

Lect 06

Uploaded by

rodrigoferraribr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Information Extraction

By Ivan Wong
Introduction
• When compared to structured information sources like
databases or tables or semi-structured sources such as
webpages (which have some markup), text is a form of
unstructured data.
• Information extraction (IE) refers to the NLP task of
extracting relevant information from text documents.
• Typical IE tasks include key phrase extraction, named
entity recognition, named entity disambiguation and
linking, and relationship extraction.
IE Applications
• Tagging news and other content
• Chatbots
• A chatbot needs to understand the user’s question in order to
generate/retrieve a correct response.
• Applications in social media
• An example use case is extracting time-sensitive, frequently
updated information, such as traffic updates and disaster
relief efforts, based on tweets.
• Extracting data from forms and receipts
• Google’s Document AI Custom Extractor
IE Tasks
• The overarching goal of IE is to
extract “knowledge” from text, and
each of these tasks provides different
information to do that.
• As human readers, we find several
useful pieces of information in this
blurb.
• For example, we know that the article
is about Apple, the company (and
not the fruit), and that it mentions a
person, Luca Maestri, who is the
finance chief of the company. The
article is about the buyback of
stock and other issues related to it.
IE Tasks
• Identifying that the article is about “buyback” or “stock
price” relates to the IE task of keyword or keyphrase
extraction (KPE).
• Identifying Apple as an organization and Luca Maestri
as a person comes under the IE task of named entity
recognition (NER).
• Recognizing that Apple is not a fruit, but a company,
and that it refers to Apple, Inc. and not some other
company with the word “apple” in its name is the IE
task of named entity disambiguation and linking.
• Extracting the information that Luca Maestri is the
finance chief of Apple refers to the IE task of relation
Advanced IE tasks
• Identifying that this article is about a single event (let’s
call it “Apple buys back stocks”) and being able to link it
to other articles talking about the same event over time
refers to the IE task of event extraction.
• Template filling: Many applications, such as
automatically generating weather reports or flight
announcements, follow a standard template with some
slots that need to be filled based on extracted data.
The General
Pipeline for IE
• The general pipeline for
IE requires more fine-
grained NLP processing
than what we saw for
text classification.
• For example, to identify
named entities (persons,
organizations, etc.), we
would need to know the
part-of-speech tags of
words.
• For relating multiple
references to the same
entity (e.g., Albert Einstein,
Einstein, the scientist, he,
etc.), we would need
coreference resolution.
Keyphrase Extraction
• Amazon has a filtering feature: “Read reviews that mention.”
This presents a bunch of keywords or phrases that several
people used in these reviews to filter the reviews:

• Keyword and phrase extraction, as the name indicates, is the

IE task concerned with extracting important words and
phrases that capture the gist of the text from a given text
document.
Practical Advice
• The process of extracting potential n-grams and building
the graph with them is sensitive to document length,
which could be an issue in a production scenario.
• One approach to dealing with it is to not use the full text, but
instead try using the first M% and the last N% of the text.
• Since each keyphrase is independently ranked, we
sometimes end up seeing overlapping keyphrases (e.g.,
“buy back stock” and “buy back”).
• One solution for this could be to use some similarity measure
(e.g., cosine similarity) between the top-ranked keyphrases
and choose the ones that are most dissimilar to one another.
Practical Advice
• Seeing counterproductive patterns (e.g., a keyphrase that starts
with a preposition when you don’t want that) is another common
problem.
• This is relatively straightforward to handle by tweaking the
implementation code for the algorithm and explicitly encoding
information about such unwanted word patterns.
• Improper text extraction can affect the rest of the KPE process,
especially when dealing with formats such as PDF or scanned
images.
• This is primarily because KPE is sensitive to sentence structure in the
document.
• Hence, it’s always a good idea to add some post-processing to the
extracted key phrases list to create a final, meaningful list without noise.
Named Entity Recognition
• NER refers to the IE task of identifying the entities in a
document.
• Entities are typically names of persons, locations, and
organizations, and other specialized strings, such as
money expressions, dates, products, names/numbers of
laws or articles, and so on.
• NER is an important step in the pipeline of several NLP
applications involving information extraction.
• Explosion.ai. “displaCy Named Entity Visualizer” (
https://demos.explosion.ai/displacy-ent)
Building an NER System
• A simple approach to building an NER system is to maintain a large
collection of person/organization/location names that are the most
relevant to our company (e.g., names of all clients, cities in their
addresses, etc.); this is typically referred to as a gazetteer.
• An approach that goes beyond a lookup table is rule-based NER, which
can be based on a compiled list of patterns based on word tokens and
POS tags.
• For example, a pattern “NNP was born,” where “NNP” is the POS tag for a
proper noun, indicates that the word that was tagged “NNP” refers to a person.
• Such rules can be programmed to cover as many cases as possible to build a
rule-based NER system.
• Stanford NLP’s RegexNER (nlp.stanford.edu/software/regexner.html) and
spaCy’s EntityRuler (spacy.io/usage/rule-based-matching#entityruler) provide
functionalities to implement your own rule-based NER.
Machine Learning Approach
• A more practical approach to NER is to train an ML
model, which can predict the named entities in unseen
text. For each word, a decision has to be made whether
or not that word is an entity, and if it is, what type of
the entity it is.
• NER is traditionally modeled as a sequence
classification problem, where the entity prediction for
the current word also depends on the context.
• For example, if the previous word was a person name, there’s
a higher probability that the current word is also a person
name if it’s a noun (e.g., first and last names).
Sequence Classifier
• The labels in the figure follow what’s known
as a BIO notation: B indicates the beginning
of an entity; I, inside an entity, indicates
when entities comprise more than one word;
and O, other, indicates non-entities.
• “Peter” gets tagged as a B-PER, and “Such” gets
tagged as an I-PER to indicate that Such is a part
of the entity from the previous word.
• The remaining entities in this example, Essex,
Yorkshire, and Headingley, are all one-word
entities. So, we only see B-ORG and B-LOC as
their tags.
Named Entity Disambiguation and
Linking
Named Entity Disambiguation and
Linking
• Named entity disambiguation (NED) refers to the NLP
task of achieving exactly this: assigning a unique
identity to entities mentioned in the text.
• NER and NED together are known as named entity
linking (NEL).
• Some other NLP applications that would need NEL
include question answering and constructing large
knowledge bases of connected events and entities, such
as the Google Knowledge Graph.
Relationship Extraction
• Relationship extraction (RE) is the IE task that deals
with extracting entities and relationships between them
from text documents.
• It’s an important step in building a knowledge base, and
it’s also useful in improving search and developing
question-answering systems.
• Apart from identifying what entities there are and
disambiguating them, we need to model the process of
extracting the relationships between them by
considering the words connecting the entities in a
sentence, their sense of usage, and so on.
Relationship
Extraction
• Satya Narayana Nadella is an
Indian-American business
executive. He currently serves
as the Chief Executive Officer
(CEO) of Microsoft, succeeding
Steve Ballmer in 2014. Before
becoming chief executive, he
was Executive Vice President of
Microsoft’s Cloud and Enterprise
Group, responsible for building
and running the company’s
computing platforms.
Approaches to RE
• RE is often treated as a supervised classification
problem. The datasets used to train RE systems contain
a set of pre-defined relations, similar to classification
datasets. This consists of modeling it as a two-step
classification problem:

• Whether two entities in a text are related (binary

classification).
• If they are related, what is the relation between them
(multiclass classification)?
Other Advanced IE Tasks
• Temporal Information Extraction

• Event Extraction
Other Advanced IE Tasks
• Template Filling

UNIT 5 - Information Extraction
No ratings yet
UNIT 5 - Information Extraction
14 pages
Unit 4 DL
No ratings yet
Unit 4 DL
31 pages
Information Extraction and Named Entity Recognition
No ratings yet
Information Extraction and Named Entity Recognition
32 pages
UNIT 4. ASEAN and Viet Nam - KEY
No ratings yet
UNIT 4. ASEAN and Viet Nam - KEY
11 pages
All Pakistani Channel Frequency Chart - All About Digital Satellite World Khan Solangi
No ratings yet
All Pakistani Channel Frequency Chart - All About Digital Satellite World Khan Solangi
3 pages
CNF MELC5 FINAL Field-Validated-updated
100% (2)
CNF MELC5 FINAL Field-Validated-updated
19 pages
Temporal Information Processing: A Survey
No ratings yet
Temporal Information Processing: A Survey
14 pages
Session 6
No ratings yet
Session 6
19 pages
Information Extraction: Sunita Sarawagi
No ratings yet
Information Extraction: Sunita Sarawagi
117 pages
Stylistics and Discourse Analysis Module 8
No ratings yet
Stylistics and Discourse Analysis Module 8
12 pages
Study of NER & Developed System For Development of NER System
No ratings yet
Study of NER & Developed System For Development of NER System
2 pages
Introduction To Information Extraction Technology: Douglas E. Appelt David J. Israel
No ratings yet
Introduction To Information Extraction Technology: Douglas E. Appelt David J. Israel
41 pages
Fundations Weekly Guides 1
0% (1)
Fundations Weekly Guides 1
32 pages
© Pearson 2020 Photocopiable Focus 3 Second Edition
No ratings yet
© Pearson 2020 Photocopiable Focus 3 Second Edition
1 page
Slideshare Thesis Filipino
100% (3)
Slideshare Thesis Filipino
8 pages
Week 2 Monophthongs 1:: 1. The Vowel Sounds /i:/ & /ɪ
No ratings yet
Week 2 Monophthongs 1:: 1. The Vowel Sounds /i:/ & /ɪ
13 pages
Maldivian Myths by Hasan Ahmed Manik
No ratings yet
Maldivian Myths by Hasan Ahmed Manik
37 pages
4.1.5.named Entity Recognition
No ratings yet
4.1.5.named Entity Recognition
11 pages
Applsci 12 09691 v2
No ratings yet
Applsci 12 09691 v2
35 pages
AI PPT
No ratings yet
AI PPT
14 pages
NLP Applications
No ratings yet
NLP Applications
32 pages
ASWIN TS Named Entity Recognition (NER) Simplified Notes Unit 3 Gen Ai
No ratings yet
ASWIN TS Named Entity Recognition (NER) Simplified Notes Unit 3 Gen Ai
4 pages
Piskorski 2012
No ratings yet
Piskorski 2012
27 pages
FALLSEM2023-24 CSE4022 ETH VL2023240103739 2023-08-23 Reference-Material-II
No ratings yet
FALLSEM2023-24 CSE4022 ETH VL2023240103739 2023-08-23 Reference-Material-II
5 pages
Building Information Extraction System Based On Computing Domain Ontology
No ratings yet
Building Information Extraction System Based On Computing Domain Ontology
5 pages
01 Unit 4
No ratings yet
01 Unit 4
10 pages
Unit4 Final
No ratings yet
Unit4 Final
57 pages
Ijitcs V10 N9 3
No ratings yet
Ijitcs V10 N9 3
11 pages
Using Language Models For Generic Entity Extraction: Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan
No ratings yet
Using Language Models For Generic Entity Extraction: Ian H. Witten, Zane Bray, Malika Mahoui, W.J. Teahan
1 page
Unit 4 TB
No ratings yet
Unit 4 TB
24 pages
Unit5 NLP RNP
No ratings yet
Unit5 NLP RNP
112 pages
Unit 4 TB
No ratings yet
Unit 4 TB
23 pages
A Survey of Named Entity Recognition Techniques
No ratings yet
A Survey of Named Entity Recognition Techniques
8 pages
Information Extraction: Methodologies and Applications: Jietang@tsinghua - Edu.cn
No ratings yet
Information Extraction: Methodologies and Applications: Jietang@tsinghua - Edu.cn
40 pages
Unit No 2
No ratings yet
Unit No 2
14 pages
Aplicacion de Tecnicas de Extraccion de Informacion A Bibliotecas Digitales Applying Information Extraction Techniques To Dls 0
No ratings yet
Aplicacion de Tecnicas de Extraccion de Informacion A Bibliotecas Digitales Applying Information Extraction Techniques To Dls 0
10 pages
NLP Unit 3&4
No ratings yet
NLP Unit 3&4
37 pages
Information Extraction
No ratings yet
Information Extraction
25 pages
Information Extraction and Named Entity Recognition
No ratings yet
Information Extraction and Named Entity Recognition
39 pages
Mining Knowledge From Text Using Information Extraction
No ratings yet
Mining Knowledge From Text Using Information Extraction
8 pages
Nformation Xtraction: Santosh S. Peerappagol
No ratings yet
Nformation Xtraction: Santosh S. Peerappagol
18 pages
7-Information Extraction (IE) and Machine Translation (MT)
No ratings yet
7-Information Extraction (IE) and Machine Translation (MT)
46 pages
Unit 4 Updated
No ratings yet
Unit 4 Updated
178 pages
Text Analysis Semantic Search
No ratings yet
Text Analysis Semantic Search
165 pages
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
No ratings yet
CS 523 - Essentials of Natural Language Processing: Project Title: Report On Named Entity Recognition
19 pages
NLTK Analysis 5
No ratings yet
NLTK Analysis 5
5 pages
Information Extraction - CS
No ratings yet
Information Extraction - CS
19 pages
Nasar 2021
No ratings yet
Nasar 2021
39 pages
Named-Entity Recognition
No ratings yet
Named-Entity Recognition
7 pages
Information Extraction
No ratings yet
Information Extraction
7 pages
Handbook NLP Final
No ratings yet
Handbook NLP Final
32 pages
Offered To Final Year B.Tech. CSE by Dept. of C.Tech.: 18CSE359T Natural Language Processing
No ratings yet
Offered To Final Year B.Tech. CSE by Dept. of C.Tech.: 18CSE359T Natural Language Processing
178 pages
New Approach For Arabic Named Entity Rec
No ratings yet
New Approach For Arabic Named Entity Rec
13 pages
Conditionals 2 Conditional Wishes and Imaginary Situations
No ratings yet
Conditionals 2 Conditional Wishes and Imaginary Situations
4 pages
DeekshikaJadyada27 AP24LDS11
No ratings yet
DeekshikaJadyada27 AP24LDS11
4 pages
Data Mining
No ratings yet
Data Mining
84 pages
Unit - 1
No ratings yet
Unit - 1
11 pages
IR Ass1
No ratings yet
IR Ass1
4 pages
Unit 4
No ratings yet
Unit 4
174 pages
NLP MiniProject GroupNo 16
No ratings yet
NLP MiniProject GroupNo 16
9 pages
Entity Extraction AI Backend Research
No ratings yet
Entity Extraction AI Backend Research
18 pages
Named Entity Recognition
No ratings yet
Named Entity Recognition
8 pages
IE For Social Media
No ratings yet
IE For Social Media
8 pages
NLPSession 09
No ratings yet
NLPSession 09
18 pages
History of The Poqomam Language
No ratings yet
History of The Poqomam Language
3 pages
Printable Crossword Game With Friends
No ratings yet
Printable Crossword Game With Friends
3 pages
Change The Narration
No ratings yet
Change The Narration
10 pages
Articles
No ratings yet
Articles
20 pages
Lect 02
No ratings yet
Lect 02
23 pages
Lect 04
No ratings yet
Lect 04
44 pages
Lect 07
No ratings yet
Lect 07
24 pages
Initial Evaluation in The 11th Form B1 Level
No ratings yet
Initial Evaluation in The 11th Form B1 Level
2 pages
Owen - Ilp - Semester 3 1
No ratings yet
Owen - Ilp - Semester 3 1
7 pages
CSIS 3300 W11 QueryOptimization
No ratings yet
CSIS 3300 W11 QueryOptimization
27 pages
IF WH EN : Conditionals
No ratings yet
IF WH EN : Conditionals
38 pages
2 Pap1 Q3 Black Oak
No ratings yet
2 Pap1 Q3 Black Oak
22 pages
Confusing Pairs
No ratings yet
Confusing Pairs
11 pages
3175 Lab 4
No ratings yet
3175 Lab 4
2 pages
Puzzling Adventures Tales of Strategy, Logic, and Mathematical Skill
No ratings yet
Puzzling Adventures Tales of Strategy, Logic, and Mathematical Skill
229 pages
Lect 01
No ratings yet
Lect 01
28 pages
3175 Lab 3
No ratings yet
3175 Lab 3
1 page
3175 Lab 2 - Trip Booking App - Solutions
No ratings yet
3175 Lab 2 - Trip Booking App - Solutions
1 page
Blood Cries For Blood
No ratings yet
Blood Cries For Blood
14 pages
Lect 05
No ratings yet
Lect 05
17 pages
CSIS 3300 W3 Denormalization StarSchema
No ratings yet
CSIS 3300 W3 Denormalization StarSchema
27 pages
Csis3300 001 Outline NB f24
No ratings yet
Csis3300 001 Outline NB f24
8 pages
Csis 3300 w5 9 Nosql
No ratings yet
Csis 3300 w5 9 Nosql
27 pages
Lect 08
No ratings yet
Lect 08
17 pages
CSIS 3300 W13 Transactions
No ratings yet
CSIS 3300 W13 Transactions
13 pages
CSIS3400 070CourseOutline 2024fall
No ratings yet
CSIS3400 070CourseOutline 2024fall
5 pages
Proj 01
No ratings yet
Proj 01
5 pages
Vocab Rozana 2
No ratings yet
Vocab Rozana 2
6 pages
Netflix Guide For Spanish
No ratings yet
Netflix Guide For Spanish
8 pages
Proj 2
No ratings yet
Proj 2
5 pages
JAMBOREE
No ratings yet
JAMBOREE
6 pages
Greek Timed Text Style Guide - Netflix - Partner Help Center
No ratings yet
Greek Timed Text Style Guide - Netflix - Partner Help Center
8 pages
Sharifudin, Nanang. 2019. Students' Difficuties in Translating Explanation Text From English To Indonesian.
No ratings yet
Sharifudin, Nanang. 2019. Students' Difficuties in Translating Explanation Text From English To Indonesian.
58 pages
CSIS 3300 W3 Denormalization StarSchema Sol
No ratings yet
CSIS 3300 W3 Denormalization StarSchema Sol
2 pages
English For Specific Purposes Questionnaire
No ratings yet
English For Specific Purposes Questionnaire
2 pages
C Programming String Example Programs
No ratings yet
C Programming String Example Programs
10 pages
Start With Two Lines: Your Students Will Succeed On All Paper Styles
No ratings yet
Start With Two Lines: Your Students Will Succeed On All Paper Styles
4 pages
Meaning of Idan - Google Search
No ratings yet
Meaning of Idan - Google Search
1 page
The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises
From Everand
The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises
Len Silverston
No ratings yet
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
From Everand
Python Text Mining: Perform Text Processing, Word Embedding, Text Classification and Machine Translation
Alexandra George
No ratings yet
Software Development Accelerated Essentials: What You Didn't Know, You Needed to Know
From Everand
Software Development Accelerated Essentials: What You Didn't Know, You Needed to Know
Ed Gomez
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Named Entity Recognition: Fundamentals and Applications
From Everand
Named Entity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lect 06

Uploaded by

Lect 06

Uploaded by

Information Extraction

• Keyword and phrase extraction, as the name indicates, is the

• Whether two entities in a text are related (binary

You might also like