Unit5 NLP RNP
Unit5 NLP RNP
Dr Ratna Patil
Syllabus – Unit 5
Information Extraction: Named Entity Recognition,
Information Retrieval (IR).
Question Answering Systems: IR based Factoid Question
Answering, Entity Linking, Knowledge Based
Question Answering, Classic QA Models, Evaluation of
Factoid Answers.
Summarization: Summarizing single documents, Multi
Document Summarization,
Few Questions on Unit -5 :-
1. Explain Information extraction architecture with the help of neat diagram.
2. What is Information Retrieval (IR)? Explain the architecture of an Information Retrieval system with a
neat diagram.
3. Explain Named entity recognition (NER). Problems while recognizing NER.
4. Differentiate between Information Extraction & Information Retrieval.
5. What is the significance of TFIDF in Information Retrieval?
6. Compare & explain Boolean retrieval and Vector space model for the information retrieval.
7. Define the following with respect to Information Retrieval: a) Vector Space Model b) Term Frequency c)
Inverse Document Frequency b) Boolean Model
8. Explain text summarization and multiple document text summarization with neat diagram.
9. Explain the process of multi-document summarization.
10. Discuss in detail text summarization in NLP.
11. Compare text extraction and summarization.
12. Discuss various steps in a typical information extraction systems.
13. Explain TF-IDF in detail with the help of suitable example. What is the significance of TF-IDF in IR
system.
14. Explain stages of IR based question answering model.
15. Explain evaluation metrics for automatic summarizer system.
Information Extraction (IE)
Information Extraction, which is an area of natural language
processing, deals with finding factual information in free text.
In formal terms, facts are structured objects, such as
database records.
Such a record may capture a real-world entity with its
attributes mentioned in text, or a real-world event, occurrence,
or state, with its arguments or actors: who did what to whom,
where and when.
Information Extraction(IE) …
The task of Information Extraction (IE) is to identify a
predefined set of concepts in a specific domain, ignoring
other irrelevant information, where a domain consists of a
corpus of texts together with a clearly specified information
need.
In other words, IE is about deriving structured factual
information from unstructured text.
For instance, consider as an example the extraction of
information on violent events from online news, where one
is interested in identifying the main actors of the event, its
location and number of people affected.
Information Extraction(IE) …
Example: the figure below shows an example of a text snippet from a
news article about a terrorist attack and a structured information
derived from that snippet.
"Three bombs have exploded in north-eastern Nigeria, killing 25
people and wounding 12 in an attack carried out by an Islamic sect.
Authorities said the bombs exploded on Sunday afternoon in the city of
Maiduguri."
Information Extraction(IE) …
➢ Information extraction (IE) systems:
Find and understand limited relevant parts of texts.
Gather information from many pieces of text.
Produce a structured representation of relevant information:
relations (in the database sense), a.k.a., a knowledge base.
➢ Goals:
1. Organize information so that it is useful to people.
2. Put information in a semantically precise form that allows
further inferences to be made by computer algorithms.
Information Extraction Architecture:
Information Extraction(IE) …
Tasks of information extraction:
○ Named entity recognition is recognition of entity names.
○ Co-reference Resolution requires the identification of multiple
(co-referring) mentions of the same entity in the text.
○ Relation Extraction (RE) is the task of detecting and classifying
predefined relationships between entities identified in text.
○ Event Extraction (EE) refers to the task of identifying events in
free text and deriving detailed and structured information
about them, ideally identifying who did what to whom, when,
where, through what methods (instruments), and why
Information Extraction(IE) …
➢ Relation Extraction (RE) - Relation Extraction (RE) is the task of detecting and
classifying predefined relationships between entities identified in text.
For example:
• EmployeeOf(Steve Jobs, Apple) : a relation between a person and an
organization, extracted from 'Steve Jobs works for Apple’.
• LocatedIn(Smith, New York) : a relation between a person and location,
extracted from 'Mr. Smith gave a talk at the conference in New York’,
• SubsidiaryOf(TVN, ITI Holding) : a relation between two companies, extracted
from 'Listed broadcaster TVN said its parent company, ITI Holdings, is
considering various options for the potential sale.
➢ Note, although in general the set of relations that may be of interest is unlimited, the
set of relations within a given task is predefined and fixed, as part of the specification
of the task.
Named Entity Recognition
● The starting point for most information extraction
applications is the detection and classification of the
named entities in a text.
● By named entity, we simply mean anything that can be
referred to with a proper name.
● Common Noun (Mango, Table, Sky… etc.)
● Abstract Noun (Age, smile, sadness, wish… etc.)
● Proper Noun (Ram, VIIT college, Mumbai,.... etc.)
Named Entity Recognition
Applications of NER……
• Coreference Resolution / Entity Linking
• Knowledge Base Construction
• Web Query Understanding
• Question Answering
Information Retrieval
The process of accessing and retrieving the most appropriate information from
text based on a particular query given by the user, with the help of context-
based indexing or metadata.
Google Search is the most famous example of information retrieval.
Information Retrieval …………… Basic Terms
Information Retrieval …………… Types of data
Information Retrieval …………… Types of data
Information Retrieval …………… Types of data
Example of IR problem
Example of IR problem -
Example of IR problem
Example of IR problem - Term Document Index Matrix
Term \
Document
Example of IR problem - Term Document Index Matrix
Term \
Document
IR Model- 1) Boolean Retrieval Model
Term \
Document
Advantages of the Boolean Retrieval Model
The advantages of the Boolean model are as follows −
Term \
Document
The simplest model, which is based on sets.
Easy to understand and implement.
It only retrieves exact matches
It gives the user, a sense of control over the system.
Disadvantages of the Boolean Retrieval Model
• The model’s similarity function is Boolean. Hence, there would be no
Term \
Document partial matches. This can be annoying for the users.
• In this model, the Boolean operator usage has much more influence
than a critical word.
• The query language is expressive, but it is complicated too.
• No ranking for retrieved documents.
Information Extraction – 2. Vector Space Model
Term \
Document
ed as vector
Information Retrieval – 2. Vector Space Model
Term Frequency: In document d, the frequency represents the
number of instances of a given word t.
Term \
Document
Term Weighting:
Term \
Document
Approach 1: Term Frequency
• Its simplest form the raw frequency of a term within a document (Luhn,
1957).
• It reflects the intuition that terms that occur frequently within a
document may reflect its meaning more strongly than terms that occur
less frequently and should thus have higher weights.
Information Retrieval– 2. Vector Space Model
Term Weighting:
Term \
Document
Approach 2:
• It gives a higher weight to words that only occur in a few documents.
• Terms that are limited to a few documents are useful for discriminating
those documents from the rest of the collection, while terms that occur
frequently across the entire collection aren’t as helpful.
Information Retrieval– 2. Vector Space Model
Term Weighting:
Term \
Document
Inverse Document Frequency (IDF): Term weight is one way of
assigning higher weights to these more discriminative words.
The fewer documents a term occurs in, the higher this weight.
The lowest weight of 1 is assigned to terms that occur in all the
documents.
Information Retrieval– 2. Vector Space Model
Term Weighting:
Inverse
Term \ Document Frequency (IDF): Due to the large number of
Document
documents in many collections, this measure is usually squashed with a
log function.
Information Retrieval– 2. Vector Space Model
Term \
Document
Information Retrieval Tf-IDF Example
Term \
Document
Information Retrieval Tf-IDF Example
Term \
Document
Information Retrieval Tf-IDF Example
Term \
Document
Information Retrieval Tf-IDF Example
Term \
Document
Tf-IDF
Information Retrieval Tf-IDF Example
Term \
Document
Tf-IDF
Example 2- TF-IDF
https://medium.com/nlplanet/two-minutes-nlp-learn-tf-idf-with-easy-examples-
7c15957b4cb3
Do not remove stopwords.
Example 2- TF-IDF (Solution contd..)
Example 2- TF-IDF (Solution contd..)
Example 2- TF-IDF (Solution contd..)
Difference between Information Retrieval and Information Extraction
Information Extraction
Information Retrieval
1. Document Retrieval Feature Retrieval
The goal is to find documents that are relevant to The goal is to extract pre-specified features from
3.
the user’s information need documents or display information.
4. Real information is buried inside documents Extract information from within the documents
Used in many search engines – Google is the best Used in database systems to enter extracted
6.
IR system for the web. features automatically.
Typically uses a bag of words model of the source Typically based on some form of semantic
7.
text. analysis of the source text.
7
Question Answering System (QA)
● There are many situations where the user wants a
particular piece of information rather than an entire
document or document set.
● QA system returns a particular piece of information to the
user in response to a question.
● Factoid QA if the information is a simple fact, and
particularly if this fact has to do with a named entity like a
person, organization, or location.
Factoid question answering system (FQA)
IR-based Factoid QA
• QUESTION PROCESSING
• Detect question type, answer type, focus, relations
• Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
• Retrieve ranked documents
• Break into suitable passages and rerank
• ANSWER PROCESSING
• Extract candidate answers
• Rank candidates
• using evidence from the text and external sources
Dan Jurafsky
Question Processing
Things to extract from the question
• Answer Type Detection
• Decide the named entity type (person, place) of the answer
• Query Formulation
• Choose query keywords for the IR system
• Question Type classification
• Is this a definition question, a math question, a list question?
• Focus Detection
• Find the question words that are replaced by the answer
• Relation Extraction
• Find relations between entities in the question
71
Dan Jurafsky
• 6 coarse classes
• ABBEVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION,
NUMERIC
• 50 finer classes
• LOCATION: city, country, mountain…
• HUMAN: group, individual, title, description
• ENTITY: animal, body, color, currency…
72
Dan Jurafsky
73
Dan Jurafsky
Answer Types
74
Dan Jurafsky
• Hand-written rules
• Machine Learning
• Hybrids
Dan Jurafsky
78
Dan Jurafsky
1 1
4 4
7
cyberspace/1 Neuromancer/1 term/4 novel/4 coined/7
80
Dan Jurafsky
Passage Retrieval
• Step 1: IR engine retrieves documents using query terms
• Step 2: Segment the documents into shorter units
• something like paragraphs
• Step 3: Passage ranking
• Use answer type to help rerank passages
81
Dan Jurafsky
Answer Extraction
• Run an answer-type named-entity tagger on the passages
• Each answer type requires a named-entity tagger that detects it
• If answer type is CITY, tagger has to tag CITY
• Can be full NER, simple regular expressions, or hybrid
• Return the string with the right type:
• Who is the prime minister of India (PERSON)
Manmohan Singh, Prime Minister of India, had told
left leaders that the deal would not be renegotiated .
• How tall is Mt. Everest? (LENGTH)
The official height of Mount Everest is 29035 feet
Dan Jurafsky
Knowledge-based approaches
• Build a semantic representation of the query
• Times, dates, locations, entities, numeric quantities
• Map from this semantics to query structured data or resources
• Geospatial databases
• Ontologies (Wikipedia infoboxes, dbPedia, WordNet, Yago)
• Restaurant review sources and reservation services
• Scientific databases
88
Dan Jurafsky
Relation Extraction
• Answers: Databases of Relations
• born-in(“Emma Goldman”, “June 27 1869”)
• author-of(“Cao Xue Qin”, “Dream of the Red Chamber”)
• Draw from Wikipedia infoboxes, DBpedia, FreeBase, etc.
• Questions: Extracting Relations in Questions
Whose granddaughter starred in E.T.?
(acted-in ?x “E.T.”)
89
(granddaughter-of ?x ?y)
Dan Jurafsky
Text Summarization
• Summarization Applications
• outlines or abstracts of any document, article, etc
• summaries of email threads
• action items from a meeting
• simplifying text by compressing sentences
90
Dan Jurafsky
What to summarize?
Single vs. multiple documents
• Single-document summarization
• Given a single document, produce
• abstract
• outline
• headline
• Multiple-document summarization
• Given a group of documents, produce a gist of the content:
• a series of news stories on the same event
• a set of web pages about some topic or question
91
Dan Jurafsky
Query-focused Summarization
& Generic Summarization
• Generic summarization:
• Summarize the content of a document
• Query-focused summarization:
• summarize a document with respect to an information need expressed
in a user query.
• a kind of complex question answering:
• Answer a question by summarizing a document that has the
information to construct the answer
92
Dan Jurafsky
93
Dan Jurafsky
94
Dan Jurafsky
• Extractive summarization:
• create the summary from phrases or sentences in the source document(s)
• Abstractive summarization:
• express the ideas in the source documents using (at least in part) different words
95
Dan Jurafsky
Content Selection
96
Dan Jurafsky
Content Selection
97
Dan Jurafsky
• Given: • Problems:
• a labeled training set of good • hard to get labeled training
summaries for each document data
• Align: • alignment difficult
• the sentences in the document • performance not better than
with sentences in the summary unsupervised algorithms
Evaluating Summaries:
ROUGE
Dan Jurafsky
100
Dan Jurafsky
A ROUGE example:
Q: “What is water spinach?”
• ROUGE-2 =
3+3+6
= 12/29 = .41
101 10 + 10 + 9
Basic Three tasks of summarizer
For example:
Document
Document
Document All sentences Extracted
Document
Document plus simplified versions sentences
Input Docs All sentences
from documents
Sentence
Sentence
Sentence Extraction:
Segmentation
Simplification LLR, MMR
Content Selection
Content Selection
1. Sentence Reordering
● Order of documents can be used
● Chronological ordering using dates associated with
documents/sentences
● Concept of coherence can also be used.
Few Questions on Unit -5 :-
1. Explain Information extraction architecture with the help of neat diagram.
2. What is Information Retrieval (IR)? Explain the architecture of an Information Retrieval system with a
neat diagram.
3. Explain Named entity recognition (NER). Problems while recognizing NER.
4. Differentiate between Information Extraction & Information Retrieval.
5. What is the significance of TFIDF in Information Retrieval?
6. Compare & explain Boolean retrieval and Vector space model for the information retrieval.
7. Define the following with respect to Information Retrieval: a) Vector Space Model b) Term Frequency
c) Inverse Document Frequency b) Boolean Model
8. Explain text summarization and multiple document text summarization with neat diagram.
9. Explain the process of multi-document summarization.
10. Discuss in detail text summarization in NLP.
11. Compare text extraction and summarization.
12. Discuss various steps in a typical information extraction systems.
13. Explain TF-IDF in detail with the help of suitable example. What is the significance of TF-IDF in IR
system.
14. Explain stages of IR based question answering model.
15. Explain evaluation metrics for automatic summarizer system.