Unit-1-Natural Language Processing Applications
Unit-1-Natural Language Processing Applications
NLP Applications 1
NLP Applications
NLP Applications 2
Machine Translation 1
NLP Applications 3
Machine Translation 2
NLP Applications 4
Machine Translation 3
• Some readings
• General
• Juan Alberto Alonso (2000) La Traducció automàtica chapter 4 of Les
tecnologies del llenguatge, M.A.Martí (ed) UOC
• SMT
• Kevin Knight (1999)
• http://www.isi.edu/natural-language/people/knight.html
• Cristina España (2012) Introduction to Statistical Machine Translation
• Software:
• Giza++, Moses
• Projects:
• MOLTO, OpenMT
NLP Applications 5
Machine Translation 4
• Basic approaches
• Direct MT
• Transfer-based
• Interlingua-based
• Translation Memories
• Statistic vs symbolic approaches
NLP Applications 6
Machine Translation 5
Interlingua
NLP Applications 7
Machine Translation 6
NLP Applications 8
Machine Translation 7
NLP Applications 9
Statistical Machine Translation 1
NLP Applications 10
Statistical Machine Translation 2
NLP Applications 11
Statistical Machine Translation 3
NLP Applications 12
Statistical Machine Translation 4
NLP Applications 13
Statistical Machine Translation 5
NLP Applications 14
Statistical Machine Translation 5
NLP Applications 15
Statistical Machine Translation 6
NLP Applications 17
Statistical Machine Translation 8
NLP Applications 18
Statistical Machine Translation 9
NLP Applications 19
Information Retrieval 1
• Input
• A collection of documents
• The Web
• A corporate document collection
• ...
• A user need represented as a query
• Output
• The documents of the collection that
satisfy the user needs.
NLP Applications 20
Information Retrieval 2
{0,1}
Oard, 1997
Human judgement: j
Query Document
q d
representation 1 representation 2
Representation space: R
Comparison function: c
Ideal setting
NLP Applications 22
Information Retrieval 4
query text
User Interface
text
Textual operations
feedback representation
Operations over
Indexing
The query
query
DB manager
Searching
Indexes
Docs
Docs retrieved
classified
Text DB
Classification
NLP Applications 23
Information Retrieval 5
IR types
• Type of information
• Text, speech, structured information
• Query language
• Exact, ambiguous
• Matching
• Exact, aproximate
• Kind of information needed
• Loose, precise
• Relevance:
• Usefulness of information according to user needs
NLP Applications 24
Information Retrieval 6
• Preprocess
• Lexical analysis, estandardization
• non estandard forms, dates, numbers, acronyms, abbreviations,
idioms, ...
• lematization
• Morphological analysis, stemming (Porter’s stemmer)
• filtering
• Stopwords
• Classification
• manual
• Automatic
• Classification vs clustering
• Compression
NLP Applications 25
Information Retrieval 7
Indexing
• manual vs automatic
• indicators
• objetive: structural
• subjective: textual (content)
• indexing pre-coordinate vs post-coordinate
• Simple terms vs Complex terms (multiwords)
NLP Applications 26
Information Retrieval 8
Representing documents
• Classical Models
• Full text
• Boolean
• Vectorial
• Probabilistic
• Variants of the Probabilistic Model
• Bayesian
• Statistic Graphical Models
• Other paradigms
• Generalized vectorial model
• Extended Boolean Model
• Latent Semantic Indexing
• Neural Nets
NLP Applications 27
Information Retrieval 9
IR quality measures
retrieved = a + b
relevants = a + d
a recall = a / (a + d)
e tri eved precision = a / (a + b)
r
b
d F: weighted harmonic mean of
precision and recall
c
2
β 1⋅p⋅r
F= 2
Re
β ⋅p r
lev
an
t
When the result is not a Boolean but an ordered list of documents with an associated relevance score
(ranked) measures can be vectors of precision at (usually) 3, 5, 7, 9, 11 points of recall (e.g. at 0,
0.25, 0.5, 0.75, 1)
NLP Applications 29
Information Retrieval 11
Boolean Model
t1 t2 t3 ... ti t
... m
attributes: all the terms (words, lemmas,
d1 0 1 0 multiwords, ...) occurring in the collection
(except stopwords). Sometimes only the most frequent.
dj
...
NLP Applications 30
Information Retrieval 12
Vectorial Model
t1 t2 t3 ... ti t
... m
d2
Most used way of computing relevance: TF*IDF
dj wij
...
NLP Applications 31
Information Retrieval 14
IR and NL
• NL Resources
• NL Processors
• Indexing
• words, stems, lemmas, senses, multiterms
• phrases, …
• problems:
• Named entities
• Unknown words
• Non standard units
• polysemy
• => Only slight improvement over using forms
• Retrieval
• Query expansion
NLP Applications 32
Cross Language Information Retrieval 15
CLIR
Free text
Controlled
Vocabulary
Corpus-based Knowledge-based
NLP Applications 33
Question Answering 1
• Natural extension of IR
• A QA system receives a query expressed in NL
and tries to provide not a document containing the
answer but the proper answer (usually a fact).
• QA systems need to use NLP techniques for both
processing the question and looking for the
answer.
NLP Applications 34
Question Answering 2
• Webclopedia
• http://www.isi.edu/natural-language/projects/webclopedia/
• AskJeeves
• http://www.ask.com
• LCC
• http://www.languagecomputer.com/
NLP Applications 35
Question Answering 3
NLP Applications 36
Question Answering 4
• Factual QA
• Who? When? Where?
• List QA
• Which are the last 10 presidents of USA?
• Domain independent vs domain restricted QA
• QA with complex queries:
• Which are the USA republican presidents after world war
II?
• Linked queries
NLP Applications 37
Question Answering 5
Some readings
• Horacio Rodriguez (2001)
http://www.lsi.upc.es/~horacio/varios/qaBuenosAires.zip
• Documentos de las conferencias TREC
http://trec.nist.gov/pubs/trec8/t8_proceedings.html
http://trec.nist.gov/pubs/trec9/t9_proceedings.html
http://trec.nist.gov/pubs/trec10/t10_proceedings.html
http://www.isi.edu/natural-language/projects/webclopedia/
http://www.languagecomputer.com/
http://www.dlsi.ua.es/~vicedo/
NLP Applications 38
Question Answering 7
Question Processing
IR of relevant documents
Segmentation in passages,
IR of relevant passages
Answer Extraction
NLP Applications 39
Question Answering 9
Segmentation in passages,
IR of relevant passages Relevant passages
Answer Extraction
answer
NLP Applications 40
Automatic Summarization 1
NLP Applications 41
Automatic Summarization 2
NLP Applications 42
Automatic Summarization 3
Some readings
• Tutorial
• E.Hovy, D. Marcu (1998)
• Horacio Rodriguez (2001) Summarization
http://www.lsi.upc.es/%7Ehoracio/varios/alicante2007.zip
NLP Applications 43
Automatic Summarization 4
Types of summarization
• Type
• Indicative vs informative
• Extract vs Abstract
• Generic vs query based
• Background vs just-the-news
• Single-document vs multi-document
• general vs domain restricted
• textual vs multimedia
• Input
• domain, genre, form, size
NLP Applications 44
Automatic Summarization 5
• Related disciplines
• IE, IR, Q&A, Topic identification (TI), Document Classification
(DC), Event (topic) detection and tracking (TDT)
• Evaluation
• Applications
• Biographies
• Medical reports
• E-mails
• Web pages
• Word spotters
• News
• Headlines extraction
• Automatic subtitle generation
• IR enhancements
• Meeting interventions
NLP Applications 45
Automatic Summarization 5
Basic
schema
multi-document restrictions
extract
single-document abstract
Summarizer
headline
query
NLP Applications 46
Automatic Summarization 6
Techniques
• Lexical chains
• [Barzilay, 1997], [Fuentes, 2008]
• Coreference chains
• [Baldwin, Morton, 1998]
• [Bagga, Baldwin, 1998]
• Alignment techniques
• [Banko et al, 1999]
• Compression, reduction or simplification of sentences
(cut & paste)
• [Jing, 2000]
• [Jing, McKeown, 1999]
NLP Applications 47
Automatic Summarization 7
• Statistical models
• modelos estadísticos de la lengua
• [Berger, 2001], [Berger, Mittal, 2000]
• modelos bayesianos
• [Kupiec et al, 1995], [Schlesinger et al, 2001]
• cadenas ocultas de Markov
• Regresión logística
• [Conroy et al, 2001]
• Machine Learning
• Decision trees
• ILP
• [Knight, Marcu, 2000], [Tzoukerman et al, 2001]
• Similarity (and distance) measures
• MMR
• [Carbonell, Goldstein, 1998]
NLP Applications 48
Automatic Summarization 8
• IE
• [Kan, McKeown, 1999]
• Topic Detection
• [Hovy, Lin, 1999]
• [Hovy, 2000]
• Topic Signatures
• [Lin, Hovy, 2001]
• Document’s rethoric structure
• [Marcu, 1997]
• Combination
• [Goldstein et al, 1999], [Kraaij et al, 2001],
• [Muresan et al, 2000], [White et al, 2001].
NLP Applications 49
Multidocument Summarization (MDS) 1
Objectives
NLP Applications 50
MDS 2
SDS vs MDS
More challenging
• Compression
• Redundancy
• Temporal terms
• Correference
NLP Applications 51
MDS 3
Requirements
NLP Applications 52
MDS 4
Approaches
NLP Applications 53
MDS 6
Themes
Feature Synthesis Sentence Planner
NLP Applications 54
Information Extraction 1
NLP Applications 55
Information Extraction 2
NERC
NLP Applications 56
Information Extraction 3
Slot Filling
• Set of relevant slots
• ML
• Supervised Learning
• Unsupervised Learning
• Distant learning
• Semisupervised Learning
• Active Learning
• Rule-based systems
NLP Applications 57
Information Extraction 4
Relation Extraction
NLP Applications 58
Information Extraction 5
Relation Extraction
• ML
• Supervised Learning
• Unsupervised Learning
• Semisupervised Learning
NLP Applications 59
Document Classification 1
NLP Applications 60
Document Classification 2
• Extensions:
• Multiclass
• A document can be assigned to more than one class
• Rank
• A document is a assigned to different classes acording a
probabilistic distribution.
• Features
• Textual content
• Metadata
NLP Applications 61
Document Classification 3
• Approaches
• Vectorial
• Categorize each class with a reference document (Topic
Signature, Lexical Profile, ...)
• Represent the document to classify with VSM (Vector Space
Model)
• Using a similariry measure for comparing the vector associated to
the document with the reference document of each of the classes.
• Choose the best or rank them
• e.g. k-means
• ML
• Naive Bayes, decision lists, decision trees, maximum entropy,
SVM, boosting, ...
NLP Applications 62
Document Classification 4
75%
all messages kept
Precision
50%
25%
• Recall =
0%
0% 25% 50% 75% 100% good messages kept
Recall all good messages
NLP Applications 63