NLP Units Iv V
NLP Units Iv V
Unigram Model
A model that simply relies on how often a word occurs without looking at previous words
is called unigram.
Equation:
E.g.
Bigram Model
If a model considers only the previous word to predict the current word, then it is
called bigram model.
Equation
Example
Trigram Model
If a model considers two previous words to predict the current word, then it is
called trigram model.
Equation:
E.g.
N-Gram Model
If a model considers previous N-1 words to predict the current word then it is called N-
Gram Model.
An n-gram model for the above example would calculate the following probability:
P('The prime minister of our country') = P('The', 'prime', 'minister', 'of', ‘our’,’country’) =
P('The')P('prime'|'The')P('minister'|'The prime')P('of'|'The prime minister')P(‘our’| ‘The prime
minister of’)P(‘country’|’The prime minister of our’)
Since it's impractical to calculate these conditional probabilities, using Markov assumption, we
approximate this to a bigram model:
P('The prime minister of our country') ~
P('The')P('prime'|'The')P('minister'|'prime')P('of'|minister')P(‘our’|’of’)P(‘country’|’our’).
Formally, the perplexity is the function of the probability that the probabilistic language model
assigns to the test data. For a test set W = w1, w2, …, wN, the perplexity is the probability of the
test set, normalized by the number of words:
Using the chain rule of probability, the equation can be expanded as follows;
This equation can be modified to accommodate the language model that we use. For example, if
we use a bigram language model, then the equation can be modified as follows;
2.B) Explain about parameter estimation in Language Models
Maximum Likelihood Estimation
The maximum likelihood estimation is a method that determines values for parameters of
the model. It is the statistical method of estimating the parameters of the probability distribution
by maximizing the likelihood function. The point in which the parameter value that maximizes
the likelihood function is called the maximum likelihood estimate.
We can use Maximum Likelihood Estimation to estimate the Bigram and Trigram
probabilities.
The bigram probability is calculated by dividing the number of times the string “prime minister”
appears in the given corpus by the total number of times the word “prime” appears in the same
corpus.
Example:
The trigram probability is calculated by dividing the number of times the string “prime minister
of” appears in the given corpus by the total number of times the string “prime minister” appears
in the same corpus.
We have used Maximum Likelihood Estimation (MLE) for training the parameters of an
N-gram model. The problem with MLE is that it assigns zero probability to unknown (unseen)
words. This is because, MLE uses a training corpus. If the word in the test set is not available in
the training set, then the count of that particular word is zero and it leads to zero probability.
To eliminate these zero probabilities, we can do smoothing.
Smoothing is about taking some probability mass from the events seen in training and
assigns it to unseen events. Add-1 smoothing (also called as Laplace smoothing) is a simple
smoothing technique that Add 1 to the count of all n-grams in the training set before normalizing
into probabilities.
Example:
Recall that the unigram and bi-gram probabilities for a word w are calculated as follows;
P(w) = C(w)/N
P(wn|wn-1) = C(wn-1 wn)/C(wn-1)
Where, P(w) is the unigram probability, P(wn-1 wn) is the bigram probability, C(w) is the count of
occurrence of w in the training set, C(wn-1 wn) is the count of bigram (wn-1 wn) in the training set,
N is the total number of word tokens in the training set.
Add-1 smoothing for unigrams
PLaplace(w) = (C(w)+1)/N+|V|
Here, N is the total number of tokens in the training set and |V| is the size of the vocabulary
represents the unique set of words in the training set.
As we have added 1 to the numerator, we have to normalize that by adding the count of unique
words with the denominator in order to normalize.
Add-1 smoothing for bigrams
PLaplace(wn|wn-1) = (C(wn-1 wn)+1)/C(wn-1)+|V|
1. D − Document Representation.
2. Q − Query Representation.
3. F − A framework to match and establish a relationship between D and Q.
4. R (q, di) − A ranking function that determines the similarity between the query and the
document to display relevant information.
1. Classical IR Model — It is designed upon basic mathematical concepts and is the most
widely-used of IR models. Classic Information Retrieval models can be implemented with
ease. Its examples include Vector-space, Boolean and Probabilistic IR models. In this system,
the retrieval of information depends on documents containing the defined set of queries. There
is no ranking or grading of any kind. The different classical IR models take Document
Representation, Query representation, and Retrieval/Matching function into account in their
modelling. This is one of the most used Information retrieval models.
2. Non-Classical IR Model — They differ from classic models in that they are built upon
propositional logic. Examples of non-classical IR models include Information Logic, Situation
Theory, and Interaction models.
3. Alternative IR Model — These take principles of classical IR model and enhance upon to
create more functional models like the Cluster model, Alternative Set-Theoretic Models Fuzzy
Set model, Latent Semantic Indexing (LSI) model, Alternative Algebraic Models Generalized
Vector Space Model, etc.
Boolean Model — This model required information to be translated into a Boolean
expression and Boolean queries. The latter is used to determine the information needed to be
able to provide the right match when the Boolean expression is found to be true. It uses
Boolean operations AND, OR, NOT to create a combination of multiple terms based on what
the user asks. This is one of the information retrieval models that is widely used.
2. Vector Space Model — This model takes documents and queries denoted as vectors and
retrieves documents depending on how similar they are. This can result in two types of
vectors which are then used to rank search results either
But, RBMT requires broad editing, and its substantial reliance on dictionaries
quality
NMT is a type of machine translation that relies upon neural network models
(based on the human brain) to build statistical models with the end goal of
translation.
The essential advantage of NMT is that it gives a solitary system that can be
prepared to unravel the source and target text. Subsequently, it doesn't rely upon
specific systems that are regular to other machine translation systems, particularly
SMT.
Quality: Even the best AI translation tools are far away from matching the quality of
professional translators.
Consistency: Quality varies greatly depending on the complexity of input language
and the linguistic distance between the source and target languages.
Word-for-word output: Despite improvements, algorithms still produce outputs
largely consisting of word-for-word translations.
Grammar: Although this is one of the biggest areas of improvement in recent years,
grammar remains a challenge for machine translation, especially between languages
with significantly different grammar systems.
Context: Again, AI technologies have dramatically improved contextual
understanding but the end results are far from matching human capabilities.
Nuance: Algorithms struggle to determine and replicate the nuances of human
language.
Explain about Cross Lingual Information Retrieval (CLIR)
Cross-lingual Information Retrieval is the task of retrieving relevant information when the
document collection is written in a different language from the user query.
Translation Approaches
CLIR requires the ability to represent and match information in the same representation
space even if the query and the document collection are in different languages.
representation space.
Query translation is to map the query representation into the document
representation space.
a third space.
Explain about Multilingual Information Retrieval
Multilingual Information Retrieval (MLIR) refers to the ability to process a query for
information in any language, search a collection of objects, including text, images,
sound files, etc., and return the most relevant objects, translated if necessary into the
user's language.
Explain about Latent Semantic analysis?
Latent Semantic Analysis is a natural language processing method that uses the statistical
approach to identify the association among the words in a document.
Singular Value Decomposition is the statistical method that is used to find the
latent(hidden) semantic structure of words spread across the document.
Let
C = collection of documents.
d = number of documents.
n = number of unique words in the whole collection.
M=dXn
The SVD decomposes the M matrix i.e word to document matrix into three matrices as follows
M = U∑VT
where
U = distribution of words across the different contexts
∑ = diagonal matrix of the association among the contexts
VT = distribution of contexts across the different documents
Explain about Text Rank algorithm
TextRank – is a graph-based ranking model for text processing which can be used in
order to find the most relevant sentences in text and also to find keywords.
Page Rank:
PageRank (PR) is an algorithm used by Google Search to rank websites in their search
engine results.
PageRank was named after Larry Page, one of the founders of Google.
PageRank is a way of measuring the importance of website pages.
The formula for calculating the page rank is
1. Initialise a vector “V” where all element is equal to 1 and size is equal to the number
of nodes. and also define no of iteration “Iter”
2. Normalize Vector “V”.
3. Take the damping factor value like “0.85”
4. Compute the PageRank of each node by the above formula.
5. repeat step 4 for a given number of iterations “Iter”..
Zipf’s Law:
Zipf's law is a relation between rank order and frequency of occurrence: it states that
when observations (e.g., words) are ranked by their frequency, the frequency of a particular
observation is inversely proportional to its rank, Frequency ∝ 1 Rank .
Application of zipf’s law:
Zipf's Law provides a distributional foundation for models of the language learner's
exposure to segments, words and constructs, and permits evaluation of learning models
Graph: