0% found this document useful (0 votes)

13 views112 pages

Unit5 NLP RNP

Uploaded by

savidahegaonkar7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views112 pages

Unit5 NLP RNP

Uploaded by

savidahegaonkar7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 112

Question Answering, Information

Retrieval, and Summarization

UNIT - 5

Dr Ratna Patil
Syllabus – Unit 5
Information Extraction: Named Entity Recognition,
Information Retrieval (IR).
Question Answering Systems: IR based Factoid Question
Answering, Entity Linking, Knowledge Based
Question Answering, Classic QA Models, Evaluation of
Factoid Answers.
Summarization: Summarizing single documents, Multi
Document Summarization,
Few Questions on Unit -5 :-
1. Explain Information extraction architecture with the help of neat diagram.
2. What is Information Retrieval (IR)? Explain the architecture of an Information Retrieval system with a
neat diagram.
3. Explain Named entity recognition (NER). Problems while recognizing NER.
4. Differentiate between Information Extraction & Information Retrieval.
5. What is the significance of TFIDF in Information Retrieval?
6. Compare & explain Boolean retrieval and Vector space model for the information retrieval.
7. Define the following with respect to Information Retrieval: a) Vector Space Model b) Term Frequency c)
Inverse Document Frequency b) Boolean Model
8. Explain text summarization and multiple document text summarization with neat diagram.
9. Explain the process of multi-document summarization.
10. Discuss in detail text summarization in NLP.
11. Compare text extraction and summarization.
12. Discuss various steps in a typical information extraction systems.
13. Explain TF-IDF in detail with the help of suitable example. What is the significance of TF-IDF in IR
system.
14. Explain stages of IR based question answering model.
15. Explain evaluation metrics for automatic summarizer system.
Information Extraction (IE)
Information Extraction, which is an area of natural language
processing, deals with finding factual information in free text.
In formal terms, facts are structured objects, such as
database records.
Such a record may capture a real-world entity with its
attributes mentioned in text, or a real-world event, occurrence,
or state, with its arguments or actors: who did what to whom,
where and when.
Information Extraction(IE) …
The task of Information Extraction (IE) is to identify a
predefined set of concepts in a specific domain, ignoring
other irrelevant information, where a domain consists of a
corpus of texts together with a clearly specified information
need.
In other words, IE is about deriving structured factual
information from unstructured text.
For instance, consider as an example the extraction of
information on violent events from online news, where one
is interested in identifying the main actors of the event, its
location and number of people affected.
Information Extraction(IE) …
Example: the figure below shows an example of a text snippet from a
news article about a terrorist attack and a structured information
derived from that snippet.
"Three bombs have exploded in north-eastern Nigeria, killing 25
people and wounding 12 in an attack carried out by an Islamic sect.
Authorities said the bombs exploded on Sunday afternoon in the city of
Maiduguri."
Information Extraction(IE) …
➢ Information extraction (IE) systems:
Find and understand limited relevant parts of texts.
Gather information from many pieces of text.
Produce a structured representation of relevant information:
relations (in the database sense), a.k.a., a knowledge base.
➢ Goals:
1. Organize information so that it is useful to people.
2. Put information in a semantically precise form that allows
further inferences to be made by computer algorithms.
Information Extraction Architecture:
Information Extraction(IE) …
Tasks of information extraction:
○ Named entity recognition is recognition of entity names.
○ Co-reference Resolution requires the identification of multiple
(co-referring) mentions of the same entity in the text.
○ Relation Extraction (RE) is the task of detecting and classifying
predefined relationships between entities identified in text.
○ Event Extraction (EE) refers to the task of identifying events in
free text and deriving detailed and structured information
about them, ideally identifying who did what to whom, when,
where, through what methods (instruments), and why
Information Extraction(IE) …
➢ Relation Extraction (RE) - Relation Extraction (RE) is the task of detecting and
classifying predefined relationships between entities identified in text.
For example:
• EmployeeOf(Steve Jobs, Apple) : a relation between a person and an
organization, extracted from 'Steve Jobs works for Apple’.
• LocatedIn(Smith, New York) : a relation between a person and location,
extracted from 'Mr. Smith gave a talk at the conference in New York’,
• SubsidiaryOf(TVN, ITI Holding) : a relation between two companies, extracted
from 'Listed broadcaster TVN said its parent company, ITI Holdings, is
considering various options for the potential sale.
➢ Note, although in general the set of relations that may be of interest is unlimited, the
set of relations within a given task is predefined and fixed, as part of the specification
of the task.
Named Entity Recognition
● The starting point for most information extraction
applications is the detection and classification of the
named entities in a text.
● By named entity, we simply mean anything that can be
referred to with a proper name.
● Common Noun (Mango, Table, Sky… etc.)
● Abstract Noun (Age, smile, sadness, wish… etc.)
● Proper Noun (Ram, VIIT college, Mumbai,.... etc.)
Named Entity Recognition

NER involves identifying and classifying entities in text into

predefined categories such as names of persons, organizations,
locations, dates, and more.
Named Entity Recognition

Applications of NER……
• Coreference Resolution / Entity Linking
• Knowledge Base Construction
• Web Query Understanding
• Question Answering
Information Retrieval
The process of accessing and retrieving the most appropriate information from
text based on a particular query given by the user, with the help of context-
based indexing or metadata.
Google Search is the most famous example of information retrieval.
Information Retrieval …………… Basic Terms
Information Retrieval …………… Types of data
Information Retrieval …………… Types of data
Information Retrieval …………… Types of data
Example of IR problem
Example of IR problem -
Example of IR problem
Example of IR problem - Term Document Index Matrix

Term \
Document
Example of IR problem - Term Document Index Matrix

Term \
Document
IR Model- 1) Boolean Retrieval Model

Term \
Document
Advantages of the Boolean Retrieval Model
The advantages of the Boolean model are as follows −
Term \
Document
The simplest model, which is based on sets.
Easy to understand and implement.
It only retrieves exact matches
It gives the user, a sense of control over the system.
Disadvantages of the Boolean Retrieval Model
• The model’s similarity function is Boolean. Hence, there would be no
Term \
Document partial matches. This can be annoying for the users.
• In this model, the Boolean operator usage has much more influence
than a critical word.
• The query language is expressive, but it is complicated too.
• No ranking for retrieved documents.
Information Extraction – 2. Vector Space Model

Term \
Document
ed as vector
Information Retrieval – 2. Vector Space Model
Term Frequency: In document d, the frequency represents the
number of instances of a given word t.
Term \
Document

Consider a document containing 100 words wherein the

word cat appears 3 times.

tf (cat)= (3 / 100) = 0.03

Information Retrieval– 2. Vector Space Model
Document Frequency: The frequency of the document is the number of
separate documents in which the term appears.
Term \
Document
It depends on the entire corpus.
Information Retrieval– 2. Vector Space Model

Inverse Document Frequency (IDF): The frequency of the document is

Term \
the number
Document
of separate documents in which the term appears.
It depends on the entire corpus.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Term \
Document
Approach 1: Term Frequency
• Its simplest form the raw frequency of a term within a document (Luhn,
1957).
• It reflects the intuition that terms that occur frequently within a
document may reflect its meaning more strongly than terms that occur
less frequently and should thus have higher weights.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Term \
Document
Approach 2:
• It gives a higher weight to words that only occur in a few documents.
• Terms that are limited to a few documents are useful for discriminating
those documents from the rest of the collection, while terms that occur
frequently across the entire collection aren’t as helpful.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Term \
Document
Inverse Document Frequency (IDF): Term weight is one way of
assigning higher weights to these more discriminative words.

The fewer documents a term occurs in, the higher this weight.
The lowest weight of 1 is assigned to terms that occur in all the
documents.
Information Retrieval– 2. Vector Space Model

Term Weighting:
Inverse
Term \ Document Frequency (IDF): Due to the large number of
Document
documents in many collections, this measure is usually squashed with a
log function.
Information Retrieval– 2. Vector Space Model

TF-IDF (term frequency-inverse document frequency): A statistical

Term \
Document
measure that evaluates how relevant a word is to a document in a
collection of documents.
Tf-idf thus prefers words which are frequent in the current document
but rare overall in the collection.
Information Retrieval
Bag of words

Term \
Document
Information Retrieval Tf-IDF Example

Term \
Document

Tf-IDF
Information Retrieval Tf-IDF Example

Term \
Document

Tf-IDF
Example 2- TF-IDF
https://medium.com/nlplanet/two-minutes-nlp-learn-tf-idf-with-easy-examples-
7c15957b4cb3
Do not remove stopwords.
Example 2- TF-IDF (Solution contd..)
Example 2- TF-IDF (Solution contd..)
Example 2- TF-IDF (Solution contd..)
Difference between Information Retrieval and Information Extraction
Information Extraction
Information Retrieval
1. Document Retrieval Feature Retrieval

2. Return set of relevant documents Return facts out of documents

The goal is to find documents that are relevant to The goal is to extract pre-specified features from
3.
the user’s information need documents or display information.

4. Real information is buried inside documents Extract information from within the documents

5. The long listing of documents Aggregate over the entire set

Used in many search engines – Google is the best Used in database systems to enter extracted
6.
IR system for the web. features automatically.

Typically uses a bag of words model of the source Typically based on some form of semantic
7.
text. analysis of the source text.

Mostly use the theory of information, probability,

8. Emerged from research into rule-based systems.
and statistics.
Factoid Questions

What's the:›fficiallaaguage of Alğeria* Arabic

What is íbe, telephone number !Ioz thë University of (303Ø92-ì411
Colorado; Boulder7
How many.pounds aæ there in a stone?

7
Question Answering System (QA)
● There are many situations where the user wants a
particular piece of information rather than an entire
document or document set.
● QA system returns a particular piece of information to the
user in response to a question.
● Factoid QA if the information is a simple fact, and
particularly if this fact has to do with a named entity like a
person, organization, or location.
Factoid question answering system (FQA)

● FQA answers questions by finding, either from the Web or

some other collection of documents, short text segments that
are likely to contain answers to questions, reformatting them,
and presenting them to the user.
Factoid question answering system (FQA)
Dan Jurafsky

IR-based Factoid QA
• QUESTION PROCESSING
• Detect question type, answer type, focus, relations
• Formulate queries to send to a search engine
• PASSAGE RETRIEVAL
• Retrieve ranked documents
• Break into suitable passages and rerank
• ANSWER PROCESSING
• Extract candidate answers
• Rank candidates
• using evidence from the text and external sources
Dan Jurafsky

Question Processing
Things to extract from the question
• Answer Type Detection
• Decide the named entity type (person, place) of the answer
• Query Formulation
• Choose query keywords for the IR system
• Question Type classification
• Is this a definition question, a math question, a list question?
• Focus Detection
• Find the question words that are replaced by the answer
• Relation Extraction
• Find relations between entities in the question
71
Dan Jurafsky

Answer Type Taxonomy

Xin Li, Dan Roth. 2002. Learning Question Classifiers. COLING'02

• 6 coarse classes
• ABBEVIATION, ENTITY, DESCRIPTION, HUMAN, LOCATION,
NUMERIC
• 50 finer classes
• LOCATION: city, country, mountain…
• HUMAN: group, individual, title, description
• ENTITY: animal, body, color, currency…
72
Dan Jurafsky

Part of Li & Roth’s Answer Type Taxonomy

73
Dan Jurafsky

Answer Types

74
Dan Jurafsky

Answer Type Detection

• Hand-written rules
• Machine Learning
• Hybrids
Dan Jurafsky

Answer Type Detection

• Regular expression-based rules can get some cases:
• Who {is|was|are|were} PERSON
• PERSON (YEAR – YEAR)
• Other rules use the question headword:
(the headword of the first noun phrase after the wh-word)

• Which city in China has the largest number of

foreign financial companies?
• What is the state flower of California?
Dan Jurafsky

Answer Type Detection

• Most often, we treat the problem as machine learning
classification
• Define a taxonomy of question types
• Annotate training data for each question type
• Train classifiers for each question class
using a rich set of features.
• features include those hand-written rules!
77
Dan Jurafsky

Features for Answer Type Detection

• Question words and phrases
• Part-of-speech tags
• Parse features (headwords)
• Named Entities
• Semantically related words

78
Dan Jurafsky

Keyword Selection Algorithm

Dan Moldovan, Sanda Harabagiu, Marius Paca, Rada Mihalcea, Richard Goodrum,
Roxana Girju and Vasile Rus. 1999. Proceedings of TREC-8.

1. Select all non-stop words in quotations

2. Select all NNP words in recognized named entities
3. Select all complex nominals with their adjectival modifiers
4. Select all other complex nominals
5. Select all nouns with their adjectival modifiers
6. Select all other nouns
7. Select all verbs
8. Select all adverbs
9. Select the QFW word (skipped in all previous steps)
10. Select all other words
Dan Jurafsky

Choosing keywords from the query

Slide from Mihai Surdeanu

Who coined the term “cyberspace” in his novel “Neuromancer”?

1 1

4 4

7
cyberspace/1 Neuromancer/1 term/4 novel/4 coined/7
80
Dan Jurafsky

Passage Retrieval
• Step 1: IR engine retrieves documents using query terms
• Step 2: Segment the documents into shorter units
• something like paragraphs
• Step 3: Passage ranking
• Use answer type to help rerank passages

81
Dan Jurafsky

Features for Passage Ranking

Either in rule-based classifiers or with supervised machine learning

• Number of Named Entities of the right type in passage

• Number of query words in passage
• Number of question N-grams also in passage
• Proximity of query keywords to each other in passage
• Longest sequence of question words
• Rank of the document containing passage
Dan Jurafsky

Answer Extraction
• Run an answer-type named-entity tagger on the passages
• Each answer type requires a named-entity tagger that detects it
• If answer type is CITY, tagger has to tag CITY
• Can be full NER, simple regular expressions, or hybrid
• Return the string with the right type:
• Who is the prime minister of India (PERSON)
Manmohan Singh, Prime Minister of India, had told
left leaders that the deal would not be renegotiated .
• How tall is Mt. Everest? (LENGTH)
The official height of Mount Everest is 29035 feet
Dan Jurafsky

Ranking Candidate Answers

• But what if there are multiple candidate answers!

Q: Who was Queen Victoria’s second son?
• Answer Type: Person
• Passage:
The Marie biscuit is named after Marie Alexandrovna,
the daughter of Czar Alexander II of Russia and wife of
Alfred, the second son of Queen Victoria and Prince
Albert
Dan Jurafsky

Ranking Candidate Answers

• But what if there are multiple candidate answers!

Use machine learning:

Features for ranking candidate answers
Answer type match: Candidate contains a phrase with the correct answer type.
Pattern match: Regular expression pattern matches the candidate.
Question keywords: # of question keywords in the candidate.
Keyword distance: Distance in words between the candidate and query keywords
Novelty factor: A word in the candidate is not in the query.
Apposition features: The candidate is an appositive to question terms
Punctuation location: The candidate is immediately followed by a
comma, period, quotation marks, semicolon, or exclamation mark.
Sequences of question terms: The length of the longest sequence
of question terms that occurs in the candidate answer.
Dan Jurafsky

Common Evaluation Metrics

1. Accuracy (does answer match gold-labeled answer?)

2. Mean Reciprocal Rank
• For each query return a ranked list of M candidate answers.
• Query score is 1/Rank of the first correct answer
• If first answer is correct: 1
• else if second answer is correct: ½
• else if third answer is correct: ⅓, etc.
• Score is 0 if none of the M answers are correct
• Take the mean over all N queries
87
Dan Jurafsky

Knowledge-based approaches
• Build a semantic representation of the query
• Times, dates, locations, entities, numeric quantities
• Map from this semantics to query structured data or resources
• Geospatial databases
• Ontologies (Wikipedia infoboxes, dbPedia, WordNet, Yago)
• Restaurant review sources and reservation services
• Scientific databases

88
Dan Jurafsky

Relation Extraction
• Answers: Databases of Relations
• born-in(“Emma Goldman”, “June 27 1869”)
• author-of(“Cao Xue Qin”, “Dream of the Red Chamber”)
• Draw from Wikipedia infoboxes, DBpedia, FreeBase, etc.
• Questions: Extracting Relations in Questions
Whose granddaughter starred in E.T.?
(acted-in ?x “E.T.”)
89
(granddaughter-of ?x ?y)
Dan Jurafsky

Text Summarization

• Goal: produce an abridged version of a text that contains

information that is important or relevant to a user.

• Summarization Applications
• outlines or abstracts of any document, article, etc
• summaries of email threads
• action items from a meeting
• simplifying text by compressing sentences

90
Dan Jurafsky

What to summarize?
Single vs. multiple documents

• Single-document summarization
• Given a single document, produce
• abstract
• outline
• headline
• Multiple-document summarization
• Given a group of documents, produce a gist of the content:
• a series of news stories on the same event
• a set of web pages about some topic or question

91
Dan Jurafsky

Query-focused Summarization
& Generic Summarization

• Generic summarization:
• Summarize the content of a document
• Query-focused summarization:
• summarize a document with respect to an information need expressed
in a user query.
• a kind of complex question answering:
• Answer a question by summarizing a document that has the
information to construct the answer

92
Dan Jurafsky

Summarization for Question Answering:

Snippets

• Create snippets summarizing a web page for a query

• Google: 156 characters (about 26 words) plus title and link

93
Dan Jurafsky

Summarization for Question Answering:

Multiple documents

Create answers to complex questions

summarizing multiple documents.
• Instead of giving a snippet for each document
• Create a cohesive answer that combines
information from each document

94
Dan Jurafsky

Extractive summarization &

Abstractive summarization

• Extractive summarization:
• create the summary from phrases or sentences in the source document(s)
• Abstractive summarization:
• express the ideas in the source documents using (at least in part) different words

95
Dan Jurafsky

Summarization: Three Stages

1. content selection: choose sentences to extract

from the document
2. information ordering: choose an order to place
them in the summary
3. sentence realization: clean up the sentences

All sentences Extracted

from documents sentences
Sentence Summary
Document
Sentence Sentence Information Realization
Segmentation Extraction Ordering Sentence
Simpliﬁcation

Content Selection

96
Dan Jurafsky

Basic Summarization Algorithm

1. content selection: choose sentences to extract

from the document
2. information ordering: just use document order
3. sentence realization: keep original sentences

All sentences Extracted

from documents sentences
Sentence Summary
Document
Sentence Sentence Information Realization
Segmentation Extraction Ordering Sentence
Simpliﬁcation

Content Selection

97
Dan Jurafsky

Supervised content selection

• Given: • Problems:
• a labeled training set of good • hard to get labeled training
summaries for each document data
• Align: • alignment difficult
• the sentences in the document • performance not better than
with sentences in the summary unsupervised algorithms

• Extract features • So in practice:

• position (first sentence?) • Unsupervised content
selection is more common
• length of sentence
• word informativeness, cue phrases
• cohesion
• Train
• a binary classifier (put sentence in summary? yes or no)
Question Answering

Evaluating Summaries:
ROUGE
Dan Jurafsky

ROUGE (Recall Oriented Understudy for Gisting Evaluation)

Lin and Hovy 2003

• Intrinsic metric for automatically evaluating summaries

• Based on BLEU (a metric used for machine translation)
• Not as good as human evaluation (“Did this answer the user’s question?”)
• But much more convenient
• Given a document D, and an automatic summary X:
1. Have N humans produce a set of reference summaries of D
2. Run system, giving automatic summary X
3. What percentage of the bigrams from the reference summaries appear in X?

100
Dan Jurafsky

A ROUGE example:
Q: “What is water spinach?”

Human 1: Water spinach is a green leafy vegetable grown in the

tropics.
Human 2: Water spinach is a semi-aquatic tropical plant grown as a
vegetable.
Human 3: Water spinach is a commonly eaten leaf vegetable of Asia.

• System answer: Water spinach is a leaf vegetable commonly

eaten in tropical areas of Asia.

• ROUGE-2 =
3+3+6
= 12/29 = .41
101 10 + 10 + 9
Basic Three tasks of summarizer

1. Content Selection: What information to select from the

document(s) we are summarizing?

● Generally sentences or phrases (not words) are extracted.

● Content selection thus mainly consists of choosing which

sentences or clauses to extract into the summary.

2. Information Ordering: How to order and structure the extracted

units.
Basic Three tasks of summarizer

3. Sentence Realization: What kind of clean up to perform on

the extracted units so they are fluent in their new context.

For example:

● Removing nonessential phrases from each sentence,

● Fusing multiple sentences into a single sentence,
● Fixing problems in coherence.
Content Selection

● Content selection task is considered as classification of

sentences as important or unimportant (i.e extract-worthy or
extract-non-worthy)
● It can be performed in two ways
● Unsupervised content selection
● Supervised content selection
Content Selection

● Unsupervised content selection

● Sentences those are more informative are selected based
on some criterion.
● Informativeness/importantness/extract-worthyness is
generally measured may not be measured in frequency.
● Therefore weighting schemes like tf-idf or log-likelihood ratio
are more often used.
Content Selection

● Supervised content selection

● Since these are extracts, each sentence in the summary is, by
definition,taken from the document. That means we can assign a
label to every sentence in the document; 1 if it appears in the
extract, 0 if it doesn’t.
● To build our classifier, then, we just need to choose features to
extract which are predictive of being a good sentence
● Key features used for this purpose are: position of sentence,
length of sentence, cue phrases in sentence etc.
Multidocument summarization

● When we apply summarization techniques to groups of

documents rather than a single documentwe call the goal multi-
document summarization.
● Multi-document summarization is particularly appropriate for web-
based applications, for example for building summaries of a
particular event in the news by combining information from
different news stories, or finding answers to complex questions by
including components from extracted from multiple documents.
Multidocument summarization
● Basic Architecture
1. Content Selection
2. Sentence Reordering
3. Sentence Realization
Query-Focused Multi-Document Summarization
● a Query

Document
Document
Document All sentences Extracted
Document
Document plus simplified versions sentences
Input Docs All sentences
from documents

Sentence
Sentence
Sentence Extraction:
Segmentation
Simplification LLR, MMR

Content Selection

Summary Sentence Information

Realization Ordering
Multidocument summarization

Content Selection

● Supervised / unsupervised way

● Main problem here is redundancy of sentences.
● We need some way to make sure the sentences extracted from current
document doesn’t overlap too much with the already-extracted
sentences.
● The concept of redundancy factor is used.
● The redundancy factor is based on the similarity between a candidate
sentence and the sentences that have already been extracted into the
summary.
Multidocument summarization

1. Sentence Reordering
● Order of documents can be used
● Chronological ordering using dates associated with
documents/sentences
● Concept of coherence can also be used.
Few Questions on Unit -5 :-
1. Explain Information extraction architecture with the help of neat diagram.
2. What is Information Retrieval (IR)? Explain the architecture of an Information Retrieval system with a
neat diagram.
3. Explain Named entity recognition (NER). Problems while recognizing NER.
4. Differentiate between Information Extraction & Information Retrieval.
5. What is the significance of TFIDF in Information Retrieval?
6. Compare & explain Boolean retrieval and Vector space model for the information retrieval.
7. Define the following with respect to Information Retrieval: a) Vector Space Model b) Term Frequency
c) Inverse Document Frequency b) Boolean Model
8. Explain text summarization and multiple document text summarization with neat diagram.
9. Explain the process of multi-document summarization.
10. Discuss in detail text summarization in NLP.
11. Compare text extraction and summarization.
12. Discuss various steps in a typical information extraction systems.
13. Explain TF-IDF in detail with the help of suitable example. What is the significance of TF-IDF in IR
system.
14. Explain stages of IR based question answering model.
15. Explain evaluation metrics for automatic summarizer system.

NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
Unit 4
No ratings yet
Unit 4
174 pages
Module 7
No ratings yet
Module 7
53 pages
Information Retrival Unit 1
No ratings yet
Information Retrival Unit 1
29 pages
Information Extraction: Sunita Sarawagi
No ratings yet
Information Extraction: Sunita Sarawagi
117 pages
Offered To Final Year B.Tech. CSE by Dept. of C.Tech.: 18CSE359T Natural Language Processing
No ratings yet
Offered To Final Year B.Tech. CSE by Dept. of C.Tech.: 18CSE359T Natural Language Processing
178 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
29 pages
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE409L TH VL2024250101881 2024-11-15 Reference-Material-I
68 pages
Data Mining
No ratings yet
Data Mining
84 pages
Piskorski 2012
No ratings yet
Piskorski 2012
27 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Unit 4 Updated
No ratings yet
Unit 4 Updated
178 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Informal Domain Research
No ratings yet
Informal Domain Research
34 pages
Introduction To Information Extraction Technology: Douglas E. Appelt David J. Israel
No ratings yet
Introduction To Information Extraction Technology: Douglas E. Appelt David J. Israel
41 pages
CT075!3!2 DTM Topic 12 Text Data Mining
No ratings yet
CT075!3!2 DTM Topic 12 Text Data Mining
25 pages
Nasar 2021
No ratings yet
Nasar 2021
39 pages
Lect 06
No ratings yet
Lect 06
21 pages
22103071-Assignment - Ii
No ratings yet
22103071-Assignment - Ii
7 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Text Extraction Research Paper
No ratings yet
Text Extraction Research Paper
6 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages
Mining Knowledge From Text Using Information Extraction
No ratings yet
Mining Knowledge From Text Using Information Extraction
8 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Ch1 IR
No ratings yet
Ch1 IR
39 pages
Unit4 Final
No ratings yet
Unit4 Final
57 pages
Unit 4 TB
No ratings yet
Unit 4 TB
24 pages
Handbook NLP Final
No ratings yet
Handbook NLP Final
32 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
IE For Social Media
No ratings yet
IE For Social Media
8 pages
Unit 4 TB
No ratings yet
Unit 4 TB
23 pages
Temporal Information Processing: A Survey
No ratings yet
Temporal Information Processing: A Survey
14 pages
Information Extraction - CS
No ratings yet
Information Extraction - CS
19 pages
Information Extraction
No ratings yet
Information Extraction
25 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Information Retrieval Thesis Topics
100% (3)
Information Retrieval Thesis Topics
6 pages
IR Notes
No ratings yet
IR Notes
14 pages
IR Introduction
100% (1)
IR Introduction
6 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
IR Ass1
No ratings yet
IR Ass1
4 pages
Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
DBMS LAB Manual Final22
0% (1)
DBMS LAB Manual Final22
74 pages
7-Information Extraction (IE) and Machine Translation (MT)
No ratings yet
7-Information Extraction (IE) and Machine Translation (MT)
46 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Information Extraction
No ratings yet
Information Extraction
7 pages
Roadmap To Become An Azure Data Engineer 2024
No ratings yet
Roadmap To Become An Azure Data Engineer 2024
3 pages
NLP MiniProject GroupNo 16
No ratings yet
NLP MiniProject GroupNo 16
9 pages
NoSQL Slides
100% (2)
NoSQL Slides
31 pages
Big Data Analytics Tutorial
0% (1)
Big Data Analytics Tutorial
25 pages
Installation and Configuration Guide For The ILM Store
No ratings yet
Installation and Configuration Guide For The ILM Store
52 pages
Process of System Clone in Servicenow
100% (1)
Process of System Clone in Servicenow
11 pages
Oracle Fcub Technical Architecture Overview 12-0-3
No ratings yet
Oracle Fcub Technical Architecture Overview 12-0-3
4 pages
400 Questions
No ratings yet
400 Questions
11 pages
Auto-Scaling With Microsoft Fabric Capacity For Power BI
No ratings yet
Auto-Scaling With Microsoft Fabric Capacity For Power BI
6 pages
SAP HANA Commvault Best Practices PDF
No ratings yet
SAP HANA Commvault Best Practices PDF
38 pages
Chapter 1 Data Science
No ratings yet
Chapter 1 Data Science
18 pages
EB05 PivotTableCalculationsLayout
No ratings yet
EB05 PivotTableCalculationsLayout
564 pages
SQL Queries Select Statement
No ratings yet
SQL Queries Select Statement
5 pages
Mysql and applicATION
No ratings yet
Mysql and applicATION
35 pages
Unit 2
No ratings yet
Unit 2
13 pages
DBMS - QB - Sem Iii V - Ec
No ratings yet
DBMS - QB - Sem Iii V - Ec
7 pages
Correlated Subquery
No ratings yet
Correlated Subquery
15 pages
Not Null Constraints
No ratings yet
Not Null Constraints
13 pages
OAS: Cheat Sheet: File / Directories
No ratings yet
OAS: Cheat Sheet: File / Directories
8 pages
MongoDB Queries
No ratings yet
MongoDB Queries
17 pages
Section15 Practice
No ratings yet
Section15 Practice
30 pages
SQL SYLLABUS JKJ Techno
No ratings yet
SQL SYLLABUS JKJ Techno
4 pages
PowerBI Developer Track Syllabus Overview 2023 2024
No ratings yet
PowerBI Developer Track Syllabus Overview 2023 2024
4 pages
Cst204 Dbms July 2021
No ratings yet
Cst204 Dbms July 2021
3 pages
Analisis Desain Formulir Instalasi Gawat Darurat Rsud Kota Bengkulu
No ratings yet
Analisis Desain Formulir Instalasi Gawat Darurat Rsud Kota Bengkulu
5 pages
Computer Class 12 - T - 3
No ratings yet
Computer Class 12 - T - 3
1 page
SELECT INITCAP (Lastname - ',' - Firstname) AS "NAME" FROM Employees WHERE Job - Id 'AD - PRES' OR Job - Id 'IT - PROG'
No ratings yet
SELECT INITCAP (Lastname - ',' - Firstname) AS "NAME" FROM Employees WHERE Job - Id 'AD - PRES' OR Job - Id 'IT - PROG'
4 pages
Data Repository
No ratings yet
Data Repository
2 pages
Solving Crossword Puzzles Via The Google Api: David E. Goldschmidt
No ratings yet
Solving Crossword Puzzles Via The Google Api: David E. Goldschmidt
8 pages
Introduction To PLSQL and Other Schema Objects
No ratings yet
Introduction To PLSQL and Other Schema Objects
3 pages
Named Entity Recognition: Fundamentals and Applications
From Everand
Named Entity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Information Extraction: Fundamentals and Applications
From Everand
Information Extraction: Fundamentals and Applications
Fouad Sabry
No ratings yet
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Retrieval: Unlocking the Power of Visual Data
From Everand
Image Retrieval: Unlocking the Power of Visual Data
Fouad Sabry
No ratings yet
Image Retrieval: Fundamentals and Applications
From Everand
Image Retrieval: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Unit5 NLP RNP

Uploaded by

Unit5 NLP RNP

Uploaded by

Question Answering, Information

Retrieval, and Summarization

NER involves identifying and classifying entities in text into

Consider a document containing 100 words wherein the

tf (cat)= (3 / 100) = 0.03

Inverse Document Frequency (IDF): The frequency of the document is

TF-IDF (term frequency-inverse document frequency): A statistical

2. Return set of relevant documents Return facts out of documents

5. The long listing of documents Aggregate over the entire set

Mostly use the theory of information, probability,

What's the:›fficiallaaguage of Alğeria* Arabic

● FQA answers questions by finding, either from the Web or

Answer Type Taxonomy

Part of Li & Roth’s Answer Type Taxonomy

Answer Type Detection

Answer Type Detection

• Which city in China has the largest number of

Answer Type Detection

Features for Answer Type Detection

Keyword Selection Algorithm

1. Select all non-stop words in quotations

Choosing keywords from the query

Who coined the term “cyberspace” in his novel “Neuromancer”?

Features for Passage Ranking

• Number of Named Entities of the right type in passage

Ranking Candidate Answers

• But what if there are multiple candidate answers!

Ranking Candidate Answers

• But what if there are multiple candidate answers!

Use machine learning:

Common Evaluation Metrics

1. Accuracy (does answer match gold-labeled answer?)

• Goal: produce an abridged version of a text that contains

Summarization for Question Answering:

• Create snippets summarizing a web page for a query

Summarization for Question Answering:

Create answers to complex questions

Extractive summarization &

Summarization: Three Stages

1. content selection: choose sentences to extract

All sentences Extracted

Basic Summarization Algorithm

1. content selection: choose sentences to extract

All sentences Extracted

Supervised content selection

• Extract features • So in practice:

ROUGE (Recall Oriented Understudy for Gisting Evaluation)

• Intrinsic metric for automatically evaluating summaries

Human 1: Water spinach is a green leafy vegetable grown in the

• System answer: Water spinach is a leaf vegetable commonly

1. Content Selection: What information to select from the

● Generally sentences or phrases (not words) are extracted.

● Content selection thus mainly consists of choosing which

2. Information Ordering: How to order and structure the extracted

3. Sentence Realization: What kind of clean up to perform on

● Removing nonessential phrases from each sentence,

● Content selection task is considered as classification of

● Unsupervised content selection

● Supervised content selection

● When we apply summarization techniques to groups of

Summary Sentence Information

● Supervised / unsupervised way

You might also like