Chapter 8 - Applications of NLP
Chapter 8 - Applications of NLP
Te ssf u Ge t e ye (Ph D)
2020/2021-Se me st er- II
Information Retrieval
The Retrieval Process
Information
Classic IR Models
Extraction Machine
IR Performance Evaluation
Translation
NLP in IR
Question-Answering and Dialogue
Sy st em s
The Retrieval Process Text Summarization
User Interface
user need
Searching Index
Text
ranked docs retrieved
Database
docs
Ranking
Depending on how index terms are treated, there are three classic IR models: Boolean,
Vector and Probabilistic models.
These term weights are used to compute the degree of similarity each
document stored in the systems and the user query.
Retrieved documents can be sorted in decreasing order to get ranked list of
documents.
Advantages:
Captures the IR problem with the assumption that for a given user query there is a set
of documents which contains exactly the relevant documents and no other.
This set of documents is the ideal answer set.
Given the description of this ideal answer set, there would be no problem in
retrieving its documents.
The querying process can then be thought of as a process of specifying the properties
of an ideal answer set.
Initially guess the properties (we can start by using index terms).
User feed back is then initiated and taken to improve the probability that the
user will find document d with query q.
Measure of document similarity to the query:
P(d relevant-to q)
P(d non-relevant-to q)
The main advantage is that documents are ranked in decreasing order of their
probability of being relevant.
The journey of an IR process begins with a user query sent to the IR system which
encodes the query, compares the query with the available resources, and returns the
most relevant pieces of information. Thus, the system is equipped with the ability to
store, retrieve and maintain information.
In the early era of IR, the whole process was completed using handcrafted features and
ad-hoc relevance measures.
Later, principled frameworks for relevance measure were developed with statistical
learning as a basis.
Recently, deep learning has proven essential to the introduction of more opportunities
to IR. This is because data-driven features combined with data-driven relevance
measures can effectively eliminate the human bias in either feature or relevance
measure design.
Deep Learning is used for IR for all components.
Document Ranking
Document Indexing
Query Processing
Document Searching
The capability of neural ranking models to extract features directly from raw
text inputs overcomes many limitations of traditional IR models that rely on
handcrafted features.
The performance of IR systems can be evaluated by using two commonly used metrics:
precision and recall.
Recall is the fraction of the relevant documents which has been retrieved.
relevant retrieved
R ecall =
relevant
Stop word removal: with the objective of filtering out words with very low
discrimination values for retrieval purpose.
Example: Firm XYZ is a full service advertising agency specializing in direct and
interactive marketing. Located in Bole, Addis Ababa, Firm XYZ is
Text: looking for an Assistant Account Manager to help manage and
coordinate
interactive marketing initiatives. Experience in online marketing and/or the
advertising field is a plus. Depending on the experiences of the applicants,
the company pays an attractive salary of Birr 3,000- Birr 5,000 per month.
Extracted Information:
INDUSTRY: Advertising
POSITION: Assistant
Account Manager LOCATION: Bole,
Addis Ababa.
COMPANY: Firm XYZ
SALARY: Birr 3,000 -
Birr 5,000 per month
Department of Computer Science, SC, DDIT, DDU Applications of NLP 12/59
Information Retrieval Co mpo nents of Information Extraction
Information N amed Entity Recognition
Extraction Machine Relation Detection and Classification
Translation Temporal and Event Processing
Question-Answering and Dialogue Template Filling
Sy st em s
Components of Information Extraction Text Summarization
Named Entity Recognition is the process of recognition of entity names such as:
Quantities: three quintals of teff, 3000 Birr, ሶስት ኩንታል ጤፍ, 3ሺ ብር, etc.
etc.
Examples:
Temporal and Event Processing recognizes and normalizes temporal expressions and
analyzes events.
Template Filling is the final task of information extraction systems where structured
data is to be filled in the template slots.
Example:
Machine Translation (MT) refers to a translation of texts from one natural language to
another by means of a computerized system.
Direct (dictionary-based) translation uses a large bilingual dictionary and translates the
source language text word-by-word.
The process of direct translation involves morphological analysis, lexical transfer, local
reordering, and morphological generation.
Fast
Simple
Inexpensive
Cons:
Unreliable
Not powerful
Rule proliferation
NP VP S
N V NP NP VP
Syntactic Syntactic Syntactic
Abebe broke Det N Structure Transfer Structure N N V
Offers the ability to deal with more complex source language phenomena than
the direct approach.
Cons:
EVENT:
breaking TENSE:
past
AGENT: Abebe
PATIENT: window
DEFINI
Interlingua
TENESS
Analysis : Generation
definite
Source text Target text
Abebe broke the window አበበ መስኮቱን ሰበረው
Interlingual translation is suitable for multilingual machine translation, and its main
drawback is that the definition of an Interlingua is difficult and maybe even impossible
for a wider domain.
Statistical Machine Translation (SMT) finds the most probable target sentence given a
source text sentence.
Parameters of probabilistic models are derived from the analysis of bilingual text
corpora.
Source text
Target text
Language Model tries to ensure that words come in the right order.
Some notion of grammaticality
Given an English string e, language model assigns p(e) by formula.
Good English string high p(e); and bad English string low p(e).
Calculated with:
A statistical grammar such as a probabilistic context free grammar; or
An n-gram language model.
Trigram probabilities
= 12 = 0.500
The job of the translation model is to assign a probability that a given source language
sentence generates target language sentence.
We can model the translation from a source language sentence S to a target language
sentence Tˆ as:
best-translation Tˆ = argmax faithfulness(T,S) * fluency(T)
T
Suppose that we want to build a foreign-to-English machine
translation system.
Thus, in a probabilistic model, the best English sentence e is the
one whose probability ê = argmax p(e|f)
e
p(e|f) is the highest.
Bayes’ rule:
p(e|f ) = p(f|e) * p(e) / p(f
)
argmax p(e|f ) = argmax p(f|e) * p(e) /
e
p(f )
e ê = argmax p(f|e) * p(e) [for a given
f] e
Noisy channel equation: Translation model Language model
count (f,e)
p(f |e) = count (e)
Impossible because sentences are novel, so we would never have enough
data to find values for all sentences.
For example:
p(Aበበ ወደ ትምህርት ቤት ሄደ እና ድንጋዩን ወረወረው|Abebe went to school and
threw the stone)=?
አበበ በሶ በላ ።
Abebe
ate
besso
Source text
f
Target text
e
A decoder searches for the best sequence of transformations that translates a source
sentence.
Look up all translations of every source word or phrase, using word or phrase
translation table.
Recombine the target language phrases that maximizes the translation model
probability * the language model probability.
This search over all possible combinations can get very large so we need to find
ways of limiting the search space.
Decoding is, therefore, a searching problem that can be reformulated as a classic
Artificial Intelligence problem, i.e. searching for the shortest path in an implicit graph.
Pros and Co ns
Pros:
Has a way of dealing with lexical ambiguity
Can deal with idioms that occur in the training data
Can be built for any language pair that has enough training data (language
independent)
No need of language experts (requires minimal human effort)
Cons:
Does not explicitly deal with syntax
Choosing S M T
Economic reasons:
Low cost
Rapid prototyping
Practical reasons:
Many language pairs don't have NLP resources, but do have parallel corpora
Quality reasons:
Uses chunks of human translated text as its building blocks
Produces state of the art results (when very large data sets are available)
Decoder
For example, Pharaoh (phrase-based decoder that builds phrase tables from
Giza++ word alignments and produces best translation for new input using the
phrase table plus SRILM language model)
Fundamental idea:
People do not translate by doing deep linguistics analysis of a sentence.
They translate by decomposing sentence into fragments, translating each of
those, and then composing those properly.
Uses the principle of analogy in translation
Example:
Given the following translations:
የቤቱ ዋጋ ከ500 ብር በላይ The price of the house is more than 5 0 0 Birr
ነው
Challenges
Locating similar sentences
Aligning sub-sentential fragments
Combining multiple fragments of example translations into a single sentence
Determining when it is appropriate to substitute one fragment for another
Selecting the best translation out of many candidates
Cons:
May have limited coverage depending on the size of the example database, and
flexibility of matching heuristics
Machine translation systems discussed so far have their own pros and cons.
Hybrid systems take the synergy effect of rule-based, statistical and example-based
machine translations.
Rules can be post-processed by statistics and/or examples
Statistics guided by rules and/or examples
Example:
Selection module
Segment1 Segment2
Translated by rule-based Translated by example-based
The application of deep learning approaches for machine translation is called Neural
Machine translation.
NMT is not a drastic step beyond what has been traditionally done in statistical
machine translation.
There is no separate language model, translation model, and reordering model, but
just a single sequence model that predicts one word at a time. However, this sequence
prediction is conditioned on the entire source sentence and the entire already produced
target sequence.
NMT: it is simple architecture and ability in capturing long dependency in the sentence,
which indicates a huge potential in becoming a new trend of the mainstream.
NMT needs less linguistic knowledge than other approaches but can produce a
competitive performance.
Motivation of NMT: The inspiration for neural machine translation comes from two
aspects: the success of Deep Learning in other NLP tasks as we mentioned, and the
unresolved problems in the development of MT itself.
End-to-End means the model processes source data to target data directly, without
explicable intermediate result.
The common deep learning algorithm used for encoding and decoding in NMT are:
Recurrent Neural networks such as Conventional RNN, LSTM, GRU, BRNN,
BLSTM, BGRU, these models with attention mechanism and Full attention
mechanism models.
All the above models are used to model the word sequence of the source and target
natural languages.
The encoder model is the first model that is used by the neural network to encode a
source sentence for a second model, known as a decoder.
Recurrent neural networks face difficulties in encoding long inputs into a single vector.
This can be compensated by an attention mechanism which allows the decoder to focus
on different parts of the input while generating each word of the output.
Question
Answer
Question Question
Answer Answer
QA Systems deal with a wide range of question types such as fact, list, “wh”-questions,
definition, hypothetical, and semantically-constrained questions.
Search engines do not speak natural language.
Human beings need to speak the language of search engines.
QA Systems attempt to let human beings ask their questions in the normal way
using natural languages.
QA Systems are important NLP applications especially for inexperienced users.
QA Systems are closer to human beings than search engines are.
QA Systems are viewed as natural language search engines.
QA Systems are considered as next step to current search engines.
Question answering can be approached from one of two existing NLP research areas:
Information Retrieval: QA can be viewed as short passage retrieval.
Information Extraction: QA can be viewed as open-domain information
extraction.
The performance of QA Systems is heavily dependent on good search corpus.
Major Components of QA Sy st e ms
Question Analysis: The natural language question input by the user needs to be
analyzed into whatever form or forms are needed by subsequent parts of the system.
The user could be asked to clarify his or her question before proceeding.
Candidate Document Selection: A subset of documents from the total document
collection (typically several orders of magnitude smaller) is selected, comprising those
documents deemed most likely to contain an answer to the question.
This collection may need to be processed before querying, in order to transform
it into a form which is appropriate for real-time question answering.
Candidate Document Analysis: If the preprocessing stage has only superficially
analyzed the documents in the document collection, then additional detailed analysis of
the candidates selected at the preceding stage may be carried out.
Answer Extraction: Using the appropriate representation of the question and of each
candidate document, candidate answers are extracted from the documents and ranked
in terms of probable correctness.
Response Generation: A response is returned to the user.
This may be affected by the clarification request, and may in turn lead to the
response being updated.
User
Question Response
Clarification Request
Document
Collection
GRU and LSTM units allow recurrent neural networks to handle the longer texts
required for QA.
Further improvements – such as attention mechanisms and memory networks –
allow the network to focus on the most relevant facts. Such networks provide the
current state-of-the-art performance for deep-learning-based QA.
Common models developed for question and answer using deep learning are:
Sequence-to-sequence model:
Dynamic memory networks
End-to-end memory networks
Dialogue Systems differ in the degree with which human or computer takes the
initiative.
Question
Answer
Question
Answer
Computer-Initiative Human-Initiative
Computer maintains tight control Human maintains tight control
Human is highly restricted Computer is highly restricted
E.g., Dialogue Boxes E.g., ELIZA
Mixed-Initiative
Human and computer have flexibility to specify constraints
Mainly research prototypes
I/O
Server Dialogue
Knowledge
Base
Manager
In the process of Natural Language Understanding, there are many ways to represent
the meaning of sentences.
For dialogue systems, the most common is “frame and slot semantics”
representation.
SHOW:
FLIGHTS:
ORIGIN
CITY: Addis Ababa
DATE: Tuesday
TIME: morning
DESTINATION
CITY: London
DATE:
TIME:
Domain Identification
User Intent Detection
Slot Filling
Dialogue Manager
Deep Learning algorithms are applied to all these components of the Dialogue System,
and achieved a state of the art result.
Given the original text T and summary S, two measures are commonly used to
evaluate Text Summarization systems:
Length (S)
CR =
Length (T)
Information (S)
RR =
Information (T)
Measuring length:
Number of letters
Number of words
Number of sentences
Measuring information:
Question Game: quantify
Shannon test reader’s understanding.
information content.
Variants of Recurrent Neural Networks (RNNs), i.e. Gated Recurrent Neural Network (GRU)
or Long Short Term Memory (LSTM), are preferred as the encoder and decoder
components. This is because they are capable of capturing long term dependencies by
overcoming the problem of vanishing gradient.
Word embeddings are a type of word representation that allows words with similar
meaning to have a similar representation.
Attention mechanism is used to secure individual parts of the input which are more
important at that particular time.
It can be implemented by taking inputs from each time steps and giving
weightage to time steps.
The weightage depends on the contextual importance of that particular time
step.
It helps pay attention to the most relevant parts of the input data sequence so
that the decoder can optimally generate the next word in the output sequence.
BLEU - measures precision - how much the words (and/or n-grams) in the machine
generated summaries appeared in the human reference summaries.
ROUGE - measures recall - how much the words (and/or n-grams) in the human
reference summaries appeared in the machine generated summaries.
F1-score - F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)