0% found this document useful (0 votes)
13 views112 pages

Unit5 NLP RNP

Uploaded by

savidahegaonkar7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views112 pages

Unit5 NLP RNP

Uploaded by

savidahegaonkar7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

Question Answering, Information

Retrieval, and Summarization


UNIT - 5

Dr Ratna Patil
Syllabus – Unit 5
Information Extraction: Named Entity Recognition,
Information Retrieval (IR).
Question Answering Systems: IR based Factoid Question
Answering, Entity Linking, Knowledge Based
Question Answering, Classic QA Models, Evaluation of
Factoid Answers.
Summarization: Summarizing single documents, Multi
Document Summarization,
Few Questions on Unit -5 :-
1. Explain Information extraction architecture with the help of neat diagram.
2. What is Information Retrieval (IR)? Explain the architecture of an Information Retrieval system with a
neat diagram.
3. Explain Named entity recognition (NER). Problems while recognizing NER.
4. Differentiate between Information Extraction & Information Retrieval.
5. What is the significance of TFIDF in Information Retrieval?
6. Compare & explain Boolean retrieval and Vector space model for the information retrieval.
7. Define the following with respect to Information Retrieval: a) Vector Space Model b) Term Frequency c)
Inverse Document Frequency b) Boolean Model
8. Explain text summarization and multiple document text summarization with neat diagram.
9. Explain the process of multi-document summarization.
10. Discuss in detail text summarization in NLP.
11. Compare text extraction and summarization.
12. Discuss various steps in a typical information extraction systems.
13. Explain TF-IDF in detail with the help of suitable example. What is the significance of TF-IDF in IR
system.
14. Explain stages of IR based question answering model.
15. Explain evaluation metrics for automatic summarizer system.
Information Extraction (IE)
Information Extraction, which is an area of natural language
processing, deals with finding factual information in free text.
In formal terms, facts are structured objects, such as
database records.
Such a record may capture a real-world entity with its
attributes mentioned in text, or a real-world event, occurrence,
or state, with its arguments or actors: who did what to whom,
where and when.
Information Extraction(IE) …
The task of Information Extraction (IE) is to identify a
predefined set of concepts in a specific domain, ignoring
other irrelevant information, where a domain consists of a
corpus of texts together with a clearly specified information
need.
In other words, IE is about deriving structured factual
information from unstructured text.
For instance, consider as an example the extraction of
information on violent events from online news, where one
is interested in identifying the main actors of the event, its
location and number of people affected.
Information Extraction(IE) …
Example: the figure below shows an example of a text snippet from a
news article about a terrorist attack and a structured information
derived from that snippet.
"Three bombs have exploded in north-eastern Nigeria, killing 25
people and wounding 12 in an attack carried out by an Islamic sect.
Authorities said the bombs exploded on Sunday afternoon in the city of
Maiduguri."
Information Extraction(IE) …
➢ Information extraction (IE) systems:
Find and understand limited relevant parts of texts.
Gather information from many pieces of text.
Produce a structured representation of relevant information:
relations (in the database sense), a.k.a., a knowledge base.
➢ Goals:
1. Organize information so that it is useful to people.
2. Put information in a semantically precise form that allows
further inferences to be made by computer algorithms.
Information Extraction Architecture:
Information Extraction(IE) …
Tasks of information extraction:
○ Named entity recognition is recognition of entity names.
○ Co-reference Resolution requires the identification of multiple
(co-referring) mentions of the same entity in the text.
○ Relation Extraction (RE) is the task of detecting and classifying
predefined relationships between entities identified in text.
○ Event Extraction (EE) refers to the task of identifying events in
free text and deriving detailed and structured information
about them, ideally identifying who did what to whom, when,
where, through what methods (instruments), and why
Information Extraction(IE) …
➢ Relation Extraction (RE) - Relation Extraction (RE) is the task of detecting and
classifying predefined relationships between entities identified in text.
For example:
• EmployeeOf(Steve Jobs, Apple) : a relation between a person and an
organization, extracted from 'Steve Jobs works for Apple’.
• LocatedIn(Smith, New York) : a relation between a person and location,
extracted from 'Mr. Smith gave a talk at the conference in New York’,
• SubsidiaryOf(TVN, ITI Holding) : a relation between two companies, extracted
from 'Listed broadcaster TVN said its parent company, ITI Holdings, is
considering various options for the potential sale.
➢ Note, although in general the set of relations that may be of interest is unlimited, the
set of relations within a given task is predefined and fixed, as part of the specification
of the task.
Named Entity Recognition
● The starting point for most information extraction
applications is the detection and classification of the
named entities in a text.
● By named entity, we simply mean anything that can be
referred to with a proper name.
● Common Noun (Mango, Table, Sky… etc.)
● Abstract Noun (Age, smile, sadness, wish… etc.)
● Proper Noun (Ram, VIIT college, Mumbai,.... etc.)
Named Entity Recognition

NER involves identifying and classifying entities in text into


predefined categories such as names of persons, organizations,
locations, dates, and more.
Named Entity Recognition

Applications of NER……
• Coreference Resolution / Entity Linking
• Knowledge Base Construction
• Web Query Understanding
• Question Answering
Information Retrieval
The process of accessing and retrieving the most appropriate information from
text based on a particular query given by the user, with the help of context-
based indexing or metadata.
Google Search is the most famous example of information retrieval.
Information Retrieval …………… Basic Terms
Information Retrieval …………… Types of data
Information Retrieval …………… Types of data
Information Retrieval …………… Types of data
Example of IR problem
Example of IR problem -
Example of IR problem
Example of IR problem - Term Document Index Matrix

Term \
Document
Example of IR problem - Term Document Index Matrix

Term \
Document
IR Model- 1) Boolean Retrieval Model

Term \
Document
Advantages of the Boolean Retrieval Model
The advantages of the Boolean model are as follows −
Term \
Document
The simplest model, which is based on sets.
Easy to understand and implement.
It only retrieves exact matches
It gives the user, a sense of control over the system.
Disadvantages of the Boolean Retrieval Model
• The model’s similarity function is Boolean. Hence, there would be no
Term \
Document partial matches. This can be annoying for the users.
• In this model, the Boolean operator usage has much more influence
than a critical word.
• The query language is expressive, but it is complicated too.
• No ranking for retrieved documents.
Information Extraction – 2. Vector Space Model

Term \
Document
ed as vector
Information Retrieval – 2. Vector Space Model
Term Frequency: In document d, the frequency represents the
number of instances of a given word t.
Term \
Document

Consider a document containing 100 words wherein the


word cat appears 3 times.

tf (cat)= (3 / 100) = 0.03


Information Retrieval– 2. Vector Space Model
Document Frequency: The frequency of the document is the number of
separate documents in which the term appears.
Term \
Document
It depends on the entire corpus.
Information Retrieval– 2. Vector Space Model

Inverse Document Frequency (IDF): The frequency of the document is


Term \
the number
Document
of separate documents in which the term appears.
It depends on the entire corpus.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Term \
Document
Approach 1: Term Frequency
• Its simplest form the raw frequency of a term within a document (Luhn,
1957).
• It reflects the intuition that terms that occur frequently within a
document may reflect its meaning more strongly than terms that occur
less frequently and should thus have higher weights.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Term \
Document
Approach 2:
• It gives a higher weight to words that only occur in a few documents.
• Terms that are limited to a few documents are useful for discriminating
those documents from the rest of the collection, while terms that occur
frequently across the entire collection aren’t as helpful.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Term \
Document
Inverse Document Frequency (IDF): Term weight is one way of
assigning higher weights to these more discriminative words.

The fewer documents a term occurs in, the higher this weight.
The lowest weight of 1 is assigned to terms that occur in all the
documents.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Inverse
Term \ Document Frequency (IDF): Due to the large number of
Document
documents in many collections, this measure is usually squashed with a
log function.
Information Retrieval– 2. Vector Space Model

TF-IDF (term frequency-inverse document frequency): A statistical


Term \
Document
measure that evaluates how relevant a word is to a document in a
collection of documents.
Tf-idf thus prefers words which are frequent in the current document
but rare overall in the collection.
Information Retrieval
Bag of words

Term \
Document
Information Retrieval Tf-IDF Example

Term \
Document
Information Retrieval Tf-IDF Example

Term \
Document
Information Retrieval Tf-IDF Example

Term \
Document
Information Retrieval Tf-IDF Example

Term \
Document

Tf-IDF
Information Retrieval Tf-IDF Example

Term \
Document

Tf-IDF
Example 2- TF-IDF
https://medium.com/nlplanet/two-minutes-nlp-learn-tf-idf-with-easy-examples-
7c15957b4cb3
Do not remove stopwords.
Example 2- TF-IDF (Solution contd..)
Example 2- TF-IDF (Solution contd..)
Example 2- TF-IDF (Solution contd..)
Difference between Information Retrieval and Information Extraction
Information Extraction
Information Retrieval
1. Document Retrieval Feature Retrieval

2. Return set of relevant documents Return facts out of documents

The goal is to find documents that are relevant to The goal is to extract pre-specified features from
3.
the user’s information need documents or display information.

4. Real information is buried inside documents Extract information from within the documents

5. The long listing of documents Aggregate over the entire set

Used in many search engines – Google is the best Used in database systems to enter extracted
6.
IR system for the web. features automatically.

Typically uses a bag of words model of the source Typically based on some form of semantic
7.
text. analysis of the source text.

Mostly use the theory of information, probability,


8. Emerged from research into rule-based systems.
and statistics.
Factoid Questions

What's the:›fficiallaaguage of Alğeria* Arabic


What is íbe, telephone number !Ioz thë University of (303Ø92-ì411
Colorado; Boulder7
How many.pounds aæ there in a stone?

7
Question Answering System (QA)
● There are many situations where the user wants a
particular piece of information rather than an entire
document or document set.
● QA system returns a particular piece of information to the
user in response to a question.
● Factoid QA if the information is a simple fact, and
particularly if this fact has to do with a named entity like a
person, organization, or location.
Factoid question answering system (FQA)

● FQA answers questions by finding, either from the Web or


some other collection of documents, short text segments that
are likely to contain answers to questions, reformatting them,
and presenting them to the user.
Factoid question answering system (FQA)
Dan Jurafsky

IR-based Factoid QA
• QUESTION PROCESSING
• Detect question type, answer type, focus, relations
• Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
• Retrieve ranked documents
• Break into suitable passages and rerank
• ANSWER PROCESSING
• Extract candidate answers
• Rank candidates
• using evidence from the text and external sources
Dan Jurafsky

Question Processing
Things to extract from the question
• Answer Type Detection
• Decide the named entity type (person, place) of the answer
• Query Formulation
• Choose query keywords for the IR system
• Question Type classification
• Is this a definition question, a math question, a list question?
• Focus Detection
• Find the question words that are replaced by the answer
• Relation Extraction
• Find relations between entities in the question
71
Dan Jurafsky

Answer Type Taxonomy


Xin Li, Dan Roth. 2002. Learning Question Classifiers. COLING'02

• 6 coarse classes
• ABBEVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION,
NUMERIC
• 50 finer classes
• LOCATION: city, country, mountain…
• HUMAN: group, individual, title, description
• ENTITY: animal, body, color, currency…
72
Dan Jurafsky

Part of Li & Roth’s Answer Type Taxonomy

73
Dan Jurafsky

Answer Types

74
Dan Jurafsky

Answer Type Detection

• Hand-written rules
• Machine Learning
• Hybrids
Dan Jurafsky

Answer Type Detection


• Regular expression-based rules can get some cases:
• Who {is|was|are|were} PERSON
• PERSON (YEAR – YEAR)
• Other rules use the question headword:
(the headword of the first noun phrase after the wh-word)

• Which city in China has the largest number of


foreign financial companies?
• What is the state flower of California?
Dan Jurafsky

Answer Type Detection


• Most often, we treat the problem as machine learning
classification
• Define a taxonomy of question types
• Annotate training data for each question type
• Train classifiers for each question class
using a rich set of features.
• features include those hand-written rules!
77
Dan Jurafsky

Features for Answer Type Detection


• Question words and phrases
• Part-of-speech tags
• Parse features (headwords)
• Named Entities
• Semantically related words

78
Dan Jurafsky

Keyword Selection Algorithm


Dan Moldovan, Sanda Harabagiu, Marius Paca, Rada Mihalcea, Richard Goodrum,
Roxana Girju and Vasile Rus. 1999. Proceedings of TREC-8.

1. Select all non-stop words in quotations


2. Select all NNP words in recognized named entities
3. Select all complex nominals with their adjectival modifiers
4. Select all other complex nominals
5. Select all nouns with their adjectival modifiers
6. Select all other nouns
7. Select all verbs
8. Select all adverbs
9. Select the QFW word (skipped in all previous steps)
10. Select all other words
Dan Jurafsky

Choosing keywords from the query


Slide from Mihai Surdeanu

Who coined the term “cyberspace” in his novel “Neuromancer”?

1 1

4 4

7
cyberspace/1 Neuromancer/1 term/4 novel/4 coined/7
80
Dan Jurafsky

Passage Retrieval
• Step 1: IR engine retrieves documents using query terms
• Step 2: Segment the documents into shorter units
• something like paragraphs
• Step 3: Passage ranking
• Use answer type to help rerank passages

81
Dan Jurafsky

Features for Passage Ranking


Either in rule-based classifiers or with supervised machine learning

• Number of Named Entities of the right type in passage


• Number of query words in passage
• Number of question N-grams also in passage
• Proximity of query keywords to each other in passage
• Longest sequence of question words
• Rank of the document containing passage
Dan Jurafsky

Answer Extraction
• Run an answer-type named-entity tagger on the passages
• Each answer type requires a named-entity tagger that detects it
• If answer type is CITY, tagger has to tag CITY
• Can be full NER, simple regular expressions, or hybrid
• Return the string with the right type:
• Who is the prime minister of India (PERSON)
Manmohan Singh, Prime Minister of India, had told
left leaders that the deal would not be renegotiated .
• How tall is Mt. Everest? (LENGTH)
The official height of Mount Everest is 29035 feet
Dan Jurafsky

Ranking Candidate Answers

• But what if there are multiple candidate answers!


Q: Who was Queen Victoria’s second son?
• Answer Type: Person
• Passage:
The Marie biscuit is named after Marie Alexandrovna,
the daughter of Czar Alexander II of Russia and wife of
Alfred, the second son of Queen Victoria and Prince
Albert
Dan Jurafsky

Ranking Candidate Answers

• But what if there are multiple candidate answers!


Q: Who was Queen Victoria’s second son?
• Answer Type: Person
• Passage:
The Marie biscuit is named after Marie Alexandrovna,
the daughter of Czar Alexander II of Russia and wife of
Alfred, the second son of Queen Victoria and Prince
Albert
Dan Jurafsky

Use machine learning:


Features for ranking candidate answers
Answer type match: Candidate contains a phrase with the correct answer type.
Pattern match: Regular expression pattern matches the candidate.
Question keywords: # of question keywords in the candidate.
Keyword distance: Distance in words between the candidate and query keywords
Novelty factor: A word in the candidate is not in the query.
Apposition features: The candidate is an appositive to question terms
Punctuation location: The candidate is immediately followed by a
comma, period, quotation marks, semicolon, or exclamation mark.
Sequences of question terms: The length of the longest sequence
of question terms that occurs in the candidate answer.
Dan Jurafsky

Common Evaluation Metrics

1. Accuracy (does answer match gold-labeled answer?)


2. Mean Reciprocal Rank
• For each query return a ranked list of M candidate answers.
• Query score is 1/Rank of the first correct answer
• If first answer is correct: 1
• else if second answer is correct: ½
• else if third answer is correct: ⅓, etc.
• Score is 0 if none of the M answers are correct
• Take the mean over all N queries
87
Dan Jurafsky

Knowledge-based approaches
• Build a semantic representation of the query
• Times, dates, locations, entities, numeric quantities
• Map from this semantics to query structured data or resources
• Geospatial databases
• Ontologies (Wikipedia infoboxes, dbPedia, WordNet, Yago)
• Restaurant review sources and reservation services
• Scientific databases

88
Dan Jurafsky

Relation Extraction
• Answers: Databases of Relations
• born-in(“Emma Goldman”, “June 27 1869”)
• author-of(“Cao Xue Qin”, “Dream of the Red Chamber”)
• Draw from Wikipedia infoboxes, DBpedia, FreeBase, etc.
• Questions: Extracting Relations in Questions
Whose granddaughter starred in E.T.?
(acted-in ?x “E.T.”)
89
(granddaughter-of ?x ?y)
Dan Jurafsky

Text Summarization

• Goal: produce an abridged version of a text that contains


information that is important or relevant to a user.

• Summarization Applications
• outlines or abstracts of any document, article, etc
• summaries of email threads
• action items from a meeting
• simplifying text by compressing sentences

90
Dan Jurafsky

What to summarize?
Single vs. multiple documents

• Single-document summarization
• Given a single document, produce
• abstract
• outline
• headline
• Multiple-document summarization
• Given a group of documents, produce a gist of the content:
• a series of news stories on the same event
• a set of web pages about some topic or question

91
Dan Jurafsky

Query-focused Summarization
& Generic Summarization

• Generic summarization:
• Summarize the content of a document
• Query-focused summarization:
• summarize a document with respect to an information need expressed
in a user query.
• a kind of complex question answering:
• Answer a question by summarizing a document that has the
information to construct the answer

92
Dan Jurafsky

Summarization for Question Answering:


Snippets

• Create snippets summarizing a web page for a query


• Google: 156 characters (about 26 words) plus title and link

93
Dan Jurafsky

Summarization for Question Answering:


Multiple documents

Create answers to complex questions


summarizing multiple documents.
• Instead of giving a snippet for each document
• Create a cohesive answer that combines
information from each document

94
Dan Jurafsky

Extractive summarization &


Abstractive summarization

• Extractive summarization:
• create the summary from phrases or sentences in the source document(s)
• Abstractive summarization:
• express the ideas in the source documents using (at least in part) different words

95
Dan Jurafsky

Summarization: Three Stages

1. content selection: choose sentences to extract


from the document
2. information ordering: choose an order to place
them in the summary
3. sentence realization: clean up the sentences

All sentences Extracted


from documents sentences
Sentence Summary
Document
Sentence Sentence Information Realization
Segmentation Extraction Ordering Sentence
Simplification

Content Selection

96
Dan Jurafsky

Basic Summarization Algorithm

1. content selection: choose sentences to extract


from the document
2. information ordering: just use document order
3. sentence realization: keep original sentences

All sentences Extracted


from documents sentences
Sentence Summary
Document
Sentence Sentence Information Realization
Segmentation Extraction Ordering Sentence
Simplification

Content Selection

97
Dan Jurafsky

Supervised content selection

• Given: • Problems:
• a labeled training set of good • hard to get labeled training
summaries for each document data
• Align: • alignment difficult
• the sentences in the document • performance not better than
with sentences in the summary unsupervised algorithms

• Extract features • So in practice:


• position (first sentence?) • Unsupervised content
selection is more common
• length of sentence
• word informativeness, cue phrases
• cohesion
• Train
• a binary classifier (put sentence in summary? yes or no)
Question Answering

Evaluating Summaries:
ROUGE
Dan Jurafsky

ROUGE (Recall Oriented Understudy for Gisting Evaluation)


Lin and Hovy 2003

• Intrinsic metric for automatically evaluating summaries


• Based on BLEU (a metric used for machine translation)
• Not as good as human evaluation (“Did this answer the user’s question?”)
• But much more convenient
• Given a document D, and an automatic summary X:
1. Have N humans produce a set of reference summaries of D
2. Run system, giving automatic summary X
3. What percentage of the bigrams from the reference summaries appear in X?

100
Dan Jurafsky

A ROUGE example:
Q: “What is water spinach?”

Human 1: Water spinach is a green leafy vegetable grown in the


tropics.
Human 2: Water spinach is a semi-aquatic tropical plant grown as a
vegetable.
Human 3: Water spinach is a commonly eaten leaf vegetable of Asia.

• System answer: Water spinach is a leaf vegetable commonly


eaten in tropical areas of Asia.

• ROUGE-2 =
3+3+6
= 12/29 = .41
101 10 + 10 + 9
Basic Three tasks of summarizer

1. Content Selection: What information to select from the


document(s) we are summarizing?

● Generally sentences or phrases (not words) are extracted.

● Content selection thus mainly consists of choosing which


sentences or clauses to extract into the summary.

2. Information Ordering: How to order and structure the extracted


units.
Basic Three tasks of summarizer

3. Sentence Realization: What kind of clean up to perform on


the extracted units so they are fluent in their new context.

For example:

● Removing nonessential phrases from each sentence,


● Fusing multiple sentences into a single sentence,
● Fixing problems in coherence.
Content Selection

● Content selection task is considered as classification of


sentences as important or unimportant (i.e extract-worthy or
extract-non-worthy)
● It can be performed in two ways
● Unsupervised content selection
● Supervised content selection
Content Selection

● Unsupervised content selection


● Sentences those are more informative are selected based
on some criterion.
● Informativeness/importantness/extract-worthyness is
generally measured may not be measured in frequency.
● Therefore weighting schemes like tf-idf or log-likelihood ratio
are more often used.
Content Selection

● Supervised content selection


● Since these are extracts, each sentence in the summary is, by
definition,taken from the document. That means we can assign a
label to every sentence in the document; 1 if it appears in the
extract, 0 if it doesn’t.
● To build our classifier, then, we just need to choose features to
extract which are predictive of being a good sentence
● Key features used for this purpose are: position of sentence,
length of sentence, cue phrases in sentence etc.
Multidocument summarization

● When we apply summarization techniques to groups of


documents rather than a single documentwe call the goal multi-
document summarization.
● Multi-document summarization is particularly appropriate for web-
based applications, for example for building summaries of a
particular event in the news by combining information from
different news stories, or finding answers to complex questions by
including components from extracted from multiple documents.
Multidocument summarization
● Basic Architecture
1. Content Selection
2. Sentence Reordering
3. Sentence Realization
Query-Focused Multi-Document Summarization
● a Query

Document
Document
Document All sentences Extracted
Document
Document plus simplified versions sentences
Input Docs All sentences
from documents

Sentence
Sentence
Sentence Extraction:
Segmentation
Simplification LLR, MMR

Content Selection

Summary Sentence Information


Realization Ordering
Multidocument summarization

Content Selection

● Supervised / unsupervised way


● Main problem here is redundancy of sentences.
● We need some way to make sure the sentences extracted from current
document doesn’t overlap too much with the already-extracted
sentences.
● The concept of redundancy factor is used.
● The redundancy factor is based on the similarity between a candidate
sentence and the sentences that have already been extracted into the
summary.
Multidocument summarization

1. Sentence Reordering
● Order of documents can be used
● Chronological ordering using dates associated with
documents/sentences
● Concept of coherence can also be used.
Few Questions on Unit -5 :-
1. Explain Information extraction architecture with the help of neat diagram.
2. What is Information Retrieval (IR)? Explain the architecture of an Information Retrieval system with a
neat diagram.
3. Explain Named entity recognition (NER). Problems while recognizing NER.
4. Differentiate between Information Extraction & Information Retrieval.
5. What is the significance of TFIDF in Information Retrieval?
6. Compare & explain Boolean retrieval and Vector space model for the information retrieval.
7. Define the following with respect to Information Retrieval: a) Vector Space Model b) Term Frequency
c) Inverse Document Frequency b) Boolean Model
8. Explain text summarization and multiple document text summarization with neat diagram.
9. Explain the process of multi-document summarization.
10. Discuss in detail text summarization in NLP.
11. Compare text extraction and summarization.
12. Discuss various steps in a typical information extraction systems.
13. Explain TF-IDF in detail with the help of suitable example. What is the significance of TF-IDF in IR
system.
14. Explain stages of IR based question answering model.
15. Explain evaluation metrics for automatic summarizer system.

You might also like