0% found this document useful (0 votes)
5 views

Unit-1-Natural Language Processing Applications

The document outlines various applications of Natural Language Processing (NLP), focusing on massive management of textual information and person/machine interaction. Key areas include Machine Translation, Information Retrieval, Question Answering, and Automatic Summarization, each with specific techniques and methodologies. The document also discusses the importance of user needs and the different approaches to processing and extracting information from text.

Uploaded by

yashfinkhan977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit-1-Natural Language Processing Applications

The document outlines various applications of Natural Language Processing (NLP), focusing on massive management of textual information and person/machine interaction. Key areas include Machine Translation, Information Retrieval, Question Answering, and Automatic Summarization, each with specific techniques and methodologies. The document also discusses the importance of user needs and the different approaches to processing and extracting information from text.

Uploaded by

yashfinkhan977
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

NLP Applications

• Two main areas:


• Massive management of textual information
sources:
• for human use
• for automatic collection of linguistic resources
• Person/Machine interaction

NLP Applications 1
NLP Applications

• Massive management of textual


information sources
• Machine Translation (MT)
• Information Retrieval (IR)
• Question Answering (Q&A)
• Information Extraction (IE)
• Document Classification and Clustering

NLP Applications 2
Machine Translation 1

• Process of translating a text from a source


language to a target language preserving some
properties
• The main property to preserve (but not the only
one) is the meaning
• MT textual vs oral
• Different degrees of human intervention

NLP Applications 3
Machine Translation 2

NLP Applications 4
Machine Translation 3

• Some readings
• General
• Juan Alberto Alonso (2000) La Traducció automàtica chapter 4 of Les
tecnologies del llenguatge, M.A.Martí (ed) UOC

• SMT
• Kevin Knight (1999)
• http://www.isi.edu/natural-language/people/knight.html
• Cristina España (2012) Introduction to Statistical Machine Translation
• Software:
• Giza++, Moses
• Projects:
• MOLTO, OpenMT

NLP Applications 5
Machine Translation 4

• Basic approaches
• Direct MT
• Transfer-based
• Interlingua-based
• Translation Memories
• Statistic vs symbolic approaches

NLP Applications 6
Machine Translation 5

Interlingua

Semantic Str Semantic Str


Semantic
Transfer
Syntactic Str. Syntactic Str.
Syntactic
Transfer

Lexic Str. Lexic Str.


Direct translation

Source text Target text

NLP Applications 7
Machine Translation 6

NLP Applications 8
Machine Translation 7

NLP Applications 9
Statistical Machine Translation 1

NLP Applications 10
Statistical Machine Translation 2

NLP Applications 11
Statistical Machine Translation 3

NLP Applications 12
Statistical Machine Translation 4

NLP Applications 13
Statistical Machine Translation 5

• Translation Model P(f|e)


• source: f = f1f2...fm
• target: e = e1e2…el
• alignment: a = a1a2…am
• in general
• a ∈ {1,…,m} × {1,…,l}
• usually
• a: {1,…,m} → {0,…,l}
• a(j) ≠ 0 fj is mapped into ea(j)
• a(j) = 0 fj is not aligned
• A(f,e) is the set of possible aligments (2lm)

NLP Applications 14
Statistical Machine Translation 5

• Translation Model P(f|e).


• One should at least model for each word in the
source language:
• Its translation,
• the number of necessary words in the target language,
• the position of the translation within the sentence,
• and, besides, the number of words that need to be
generated from scratch.

NLP Applications 15
Statistical Machine Translation 6

• Word-based models: the IBM models


• They characterise P(f |e) with 4 parameters: t, n, d, p1.
• Lexical probability t
• t(Quan|When): the prob. that Quan translates into When.
• Fertility n
• n(3|tornes): the prob. that tornes generates 3 words.
• Distortion d
• d(j ji ;m; n): the prob. that the word in the j position generates a word
in the i position. m and n are the length of the source and target
sentences.
• Probability p1
• p(you|NULL): the prob. that the spurious word you is generated
(from NULL).
NLP Applications 16
Statistical Machine Translation 7

NLP Applications 17
Statistical Machine Translation 8

NLP Applications 18
Statistical Machine Translation 9

NLP Applications 19
Information Retrieval 1

• Input
• A collection of documents
• The Web
• A corporate document collection
• ...
• A user need represented as a query
• Output
• The documents of the collection that
satisfy the user needs.

NLP Applications 20
Information Retrieval 2

{0,1}
Oard, 1997
Human judgement: j

Queries space: Q Documents space: D

Query Document

q d

representation 1 representation 2
Representation space: R

Comparison function: c

NLP Applications {0,1} 21


Information Retrieval 3

Ideal setting

c(q(query), d(doc)) = j(query, doc)


∀query ∈Q
∀doc ∈D

NLP Applications 22
Information Retrieval 4

query text
User Interface

text
Textual operations

feedback representation
Operations over
Indexing
The query

query
DB manager
Searching
Indexes
Docs
Docs retrieved
classified
Text DB
Classification

NLP Applications 23
Information Retrieval 5

IR types

• Type of information
• Text, speech, structured information
• Query language
• Exact, ambiguous
• Matching
• Exact, aproximate
• Kind of information needed
• Loose, precise
• Relevance:
• Usefulness of information according to user needs
NLP Applications 24
Information Retrieval 6

Operations on texts & queries

• Preprocess
• Lexical analysis, estandardization
• non estandard forms, dates, numbers, acronyms, abbreviations,
idioms, ...
• lematization
• Morphological analysis, stemming (Porter’s stemmer)
• filtering
• Stopwords
• Classification
• manual
• Automatic
• Classification vs clustering
• Compression

NLP Applications 25
Information Retrieval 7

Indexing

• manual vs automatic
• indicators
• objetive: structural
• subjective: textual (content)
• indexing pre-coordinate vs post-coordinate
• Simple terms vs Complex terms (multiwords)

Most frequent : Bag of simple words

NLP Applications 26
Information Retrieval 8

Representing documents

• Classical Models
• Full text
• Boolean
• Vectorial
• Probabilistic
• Variants of the Probabilistic Model
• Bayesian
• Statistic Graphical Models
• Other paradigms
• Generalized vectorial model
• Extended Boolean Model
• Latent Semantic Indexing
• Neural Nets

NLP Applications 27
Information Retrieval 9

Simple Boolean Model Extensions

Boolean expressions over terms distance constraints (at paragraph


occurring in the document or sentence level)
(key words). Fixed or variable window
Logical connectors: AND, OR, NOT
parenthesis Extended Boolean Model

Term weighting: term frequency in the


Query expansion document, in the collection,
normalization
-Use of external knowledge sources (e.g. WN)
extension with synonyms and/or hyponyms
- Morphological generalization
- Relevance
- Feedback
NLP Applications 28
Information Retrieval 10

IR quality measures
retrieved = a + b
relevants = a + d
a recall = a / (a + d)
e tri eved precision = a / (a + b)
r
b
d F: weighted harmonic mean of
precision and recall
c

2
 β 1⋅p⋅r
F= 2

Re
β ⋅p  r

lev
an
t
When the result is not a Boolean but an ordered list of documents with an associated relevance score
(ranked) measures can be vectors of precision at (usually) 3, 5, 7, 9, 11 points of recall (e.g. at 0,
0.25, 0.5, 0.75, 1)

NLP Applications 29
Information Retrieval 11

Boolean Model
t1 t2 t3 ... ti t
... m
attributes: all the terms (words, lemmas,
d1 0 1 0 multiwords, ...) occurring in the collection
(except stopwords). Sometimes only the most frequent.

d2 1 0 1 0 rows: each document represented by a vector


of Booleans (1 if the term occurs in the document,
0 otherwise). For n documents
d3
columns: each term represented by a vector
of Booleans . For m terms
...

dj

...

NLP Applications 30
Information Retrieval 12

Vectorial Model
t1 t2 t3 ... ti t
... m

d1 wij weight (relevance)


of term j in document i

d2
Most used way of computing relevance: TF*IDF

d3 tfij frecuency of term tj in the document di


dfj # documents containing tj

idfj log (N / dfj )


...

wij = tfij * idfj

dj wij

...

NLP Applications 31
Information Retrieval 14

IR and NL

• NL Resources
• NL Processors
• Indexing
• words, stems, lemmas, senses, multiterms
• phrases, …
• problems:
• Named entities
• Unknown words
• Non standard units
• polysemy
• => Only slight improvement over using forms
• Retrieval
• Query expansion
NLP Applications 32
Cross Language Information Retrieval 15

CLIR, Oard, 1997

CLIR

Free text
Controlled
Vocabulary
Corpus-based Knowledge-based

parallel comparable Monolingual Dictionary Ontology


Corpora Corpora Corpora based based

Aligned at Aligned at Aligned at Thesaurus


document level sentence level term level based

NLP Applications 33
Question Answering 1

• Natural extension of IR
• A QA system receives a query expressed in NL
and tries to provide not a document containing the
answer but the proper answer (usually a fact).
• QA systems need to use NLP techniques for both
processing the question and looking for the
answer.

NLP Applications 34
Question Answering 2

• Some QA systems that can be accessed through the


Web:

• Webclopedia
• http://www.isi.edu/natural-language/projects/webclopedia/
• AskJeeves
• http://www.ask.com
• LCC
• http://www.languagecomputer.com/

NLP Applications 35
Question Answering 3

• Starting in TREC challenges from del TREC-8 (1999)


• Later CLEF challenges
• Related Disciplines
• Answer Finding
• Given a collection of questions and answers the task consists on looking
for the question(s) closest to the one formulated by the user in order to
provide its answer.
• FAQ Finder
• NL Interfaces to databases
• Information Integration, II
• Information Extraction, IE
• Answer Validation Exercise (AVE)

NLP Applications 36
Question Answering 4

• Factual QA
• Who? When? Where?
• List QA
• Which are the last 10 presidents of USA?
• Domain independent vs domain restricted QA
• QA with complex queries:
• Which are the USA republican presidents after world war
II?
• Linked queries

NLP Applications 37
Question Answering 5

Some readings
• Horacio Rodriguez (2001)
http://www.lsi.upc.es/~horacio/varios/qaBuenosAires.zip
• Documentos de las conferencias TREC
http://trec.nist.gov/pubs/trec8/t8_proceedings.html
http://trec.nist.gov/pubs/trec9/t9_proceedings.html
http://trec.nist.gov/pubs/trec10/t10_proceedings.html

http://www.isi.edu/natural-language/projects/webclopedia/
http://www.languagecomputer.com/
http://www.dlsi.ua.es/~vicedo/

NLP Applications 38
Question Answering 7

Most QA systems consist on 4


processes

Question Processing

IR of relevant documents

Segmentation in passages,
IR of relevant passages

Answer Extraction
NLP Applications 39
Question Answering 9

Frequently performed sequentially

Question Processing Relevant terms


Question type
Focus
...
IR of relevant documents
Relevant documents

Segmentation in passages,
IR of relevant passages Relevant passages

Answer Extraction
answer
NLP Applications 40
Automatic Summarization 1

• A summary is a reductive transformation of a


source text into a summary text by extraction
or generation
• Sparck-Jones, 2001

NLP Applications 41
Automatic Summarization 2

• Look for the relevant parts of a document and


produce a summary of them
• Summarization vs IE
• IE
• What has to be extracted is defined a priori
• “I am interested on this, look for it”
• Summarization
• An a priori definition of what is relevant is not always defined

NLP Applications 42
Automatic Summarization 3

Some readings

• Tutorial
• E.Hovy, D. Marcu (1998)
• Horacio Rodriguez (2001) Summarization
http://www.lsi.upc.es/%7Ehoracio/varios/alicante2007.zip

NLP Applications 43
Automatic Summarization 4

Types of summarization

• Type
• Indicative vs informative
• Extract vs Abstract
• Generic vs query based
• Background vs just-the-news
• Single-document vs multi-document
• general vs domain restricted
• textual vs multimedia
• Input
• domain, genre, form, size
NLP Applications 44
Automatic Summarization 5

• Related disciplines
• IE, IR, Q&A, Topic identification (TI), Document Classification
(DC), Event (topic) detection and tracking (TDT)
• Evaluation
• Applications
• Biographies
• Medical reports
• E-mails
• Web pages
• Word spotters
• News
• Headlines extraction
• Automatic subtitle generation
• IR enhancements
• Meeting interventions

NLP Applications 45
Automatic Summarization 5

Basic
schema

multi-document restrictions

extract

single-document abstract
Summarizer

headline

query

NLP Applications 46
Automatic Summarization 6
Techniques

• Lexical chains
• [Barzilay, 1997], [Fuentes, 2008]
• Coreference chains
• [Baldwin, Morton, 1998]
• [Bagga, Baldwin, 1998]
• Alignment techniques
• [Banko et al, 1999]
• Compression, reduction or simplification of sentences
(cut & paste)
• [Jing, 2000]
• [Jing, McKeown, 1999]

NLP Applications 47
Automatic Summarization 7
• Statistical models
• modelos estadísticos de la lengua
• [Berger, 2001], [Berger, Mittal, 2000]
• modelos bayesianos
• [Kupiec et al, 1995], [Schlesinger et al, 2001]
• cadenas ocultas de Markov
• Regresión logística
• [Conroy et al, 2001]
• Machine Learning
• Decision trees
• ILP
• [Knight, Marcu, 2000], [Tzoukerman et al, 2001]
• Similarity (and distance) measures
• MMR
• [Carbonell, Goldstein, 1998]
NLP Applications 48
Automatic Summarization 8

• IE
• [Kan, McKeown, 1999]
• Topic Detection
• [Hovy, Lin, 1999]
• [Hovy, 2000]
• Topic Signatures
• [Lin, Hovy, 2001]
• Document’s rethoric structure
• [Marcu, 1997]
• Combination
• [Goldstein et al, 1999], [Kraaij et al, 2001],
• [Muresan et al, 2000], [White et al, 2001].

NLP Applications 49
Multidocument Summarization (MDS) 1
Objectives

• Summary of a collection content


• Briefing
• concise summary of the factual matter of a set
of news articles on the same or related events
(SUMMONS, Radev,1999)
• Actualization of already known information

NLP Applications 50
MDS 2

SDS vs MDS

More challenging
• Compression
• Redundancy
• Temporal terms
• Correference

NLP Applications 51
MDS 3

Requirements

• Clustering of documents and passages


• Recall
• Anti-redundancy
• Summary cohesion
• quality
• readable
• relevant
• context
• Inconsistency of sources
• Actualization

NLP Applications 52
MDS 4

Approaches

• From the common sections of all the documents of the


collection
• Common sections + unique sections
• Centroids
• Centroids + outliers
• Last document + outliers
• Common sections + unique sections + time weighting factor

NLP Applications 53
MDS 6

Mc.Keown et al, 1999


MULTIGEN
Analysis Component Generation Component

Feature Extraction Theme Intersection

Themes
Feature Synthesis Sentence Planner

Rule Induction Sentence Generator

article 1 .... article n Summary

NLP Applications 54
Information Extraction 1

• Extracting useful information from free text


• MUC, ACE, TAC challenges
• Named Entity Recognition (NER)
• Named Entity Classification (NEC)
• Both tasks together (NERC)
• Slot Filling
• Relation Extraction

NLP Applications 55
Information Extraction 2

NERC

NLP Applications 56
Information Extraction 3

Slot Filling
• Set of relevant slots
• ML
• Supervised Learning
• Unsupervised Learning
• Distant learning
• Semisupervised Learning
• Active Learning
• Rule-based systems

NLP Applications 57
Information Extraction 4

Relation Extraction

• Labeled vs unlabeled relations


• Binary vs n-ary relations
• Properties:
• Simetric, transitive, reflexive
• Constraints over source and target
• NE, PER, ORG, LOC,

NLP Applications 58
Information Extraction 5

Relation Extraction

• ML
• Supervised Learning
• Unsupervised Learning
• Semisupervised Learning

NLP Applications 59
Document Classification 1

• Classification vs. Clustering


• Assign each document to one or more
class(es) belonging to a predefined tagset
• Examples:
• Spam filtering
• Language identification
• Level of relevance, urgency, ...
• Thematic domain

NLP Applications 60
Document Classification 2

• Extensions:
• Multiclass
• A document can be assigned to more than one class
• Rank
• A document is a assigned to different classes acording a
probabilistic distribution.
• Features
• Textual content
• Metadata

NLP Applications 61
Document Classification 3

• Approaches
• Vectorial
• Categorize each class with a reference document (Topic
Signature, Lexical Profile, ...)
• Represent the document to classify with VSM (Vector Space
Model)
• Using a similariry measure for comparing the vector associated to
the document with the reference document of each of the classes.
• Choose the best or rank them
• e.g. k-means
• ML
• Naive Bayes, decision lists, decision trees, maximum entropy,
SVM, boosting, ...

NLP Applications 62
Document Classification 4

Precision vs. Recall of


Good (non-spam) Email • Precision =
good messages kept
100%

75%
all messages kept
Precision

50%

25%
• Recall =
0%
0% 25% 50% 75% 100% good messages kept
Recall all good messages

NLP Applications 63

You might also like