0% found this document useful (0 votes)
18 views7 pages

Unit 2

The document discusses N-grams, which are contiguous sequences of n items used in language modeling to predict word sequences based on their probabilities. It explains the need for smoothing techniques to address issues of data sparsity and zero probabilities in N-gram models, detailing various methods such as Laplace, Good-Turing, Kneser-Ney, and Katz smoothing. Additionally, it covers back-off and interpolation techniques to improve the accuracy of language models in natural language processing applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

Unit 2

The document discusses N-grams, which are contiguous sequences of n items used in language modeling to predict word sequences based on their probabilities. It explains the need for smoothing techniques to address issues of data sparsity and zero probabilities in N-gram models, detailing various methods such as Laplace, Good-Turing, Kneser-Ney, and Katz smoothing. Additionally, it covers back-off and interpolation techniques to improve the accuracy of language models in natural language processing applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit – II

Word Level Analysis


Unsmoothed N-Grams:

N-grams are defined as the contiguous sequence of n items that can be extracted from a
given sample of text or speech. An N-gram model is one type of a Language Model (LM),
which is about finding the probability distribution over word sequences. The items can be letters,
words, or base pairs, according to the application. The N-grams typically are collected from
a text or speech corpus (Usually a corpus of long text dataset).

N-grams are classified into different types depending on the value that n takes("n" is a
number saying how long a string of words we have considered in the construction of n-grams).
When n=1, it is said to be a unigram. When n=2, it is said to be a bigram. When n=3, it is
said to be a trigram. When n=4, it is said to be a 4-gram, and so on. Different types of n-
grams are suitable for different types of applications in the n-gram model in nlp.

Example:

An N-gram model is built by counting how often word sequences occur in corpus text and then
estimating the probabilities

Given a sequence of N-1 words, an N-gram model predicts the most probable word that might
follow this sequence. It's a probabilistic model that's trained on a corpus of text. Such a model is
useful in many NLP applications including speech recognition, machine translation and
predictive text input.

Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience,
we know that the former sentence sounds better. An N-gram model will tell us that "heavy rain"
occurs much more often than "heavy flood" in the training corpus. Thus, the first sentence is
more probable and will be selected by the model.

A model that simply relies on how often a word occurs without looking at previous words is
called unigram. If a model considers only the previous word to predict the current word, then it's
called bigram. If two previous words are considered, then it's a trigram model.
An n-gram model for the above example would calculate the following probability:

P('There was heavy rain') = P('There', 'was', 'heavy', 'rain') =


P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy')

Since it's impractical to calculate these conditional probabilities, using Markov assumption, we
approximate this to a bigram model:

P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')

Smoothed N-Grams:
One major problem with standard N-gram models is that they must be trained from some corpus,
and because any particular training corpus is finite, some perfectly acceptable N-grams are
bound to be missing from it. We can see that bigram matrix for any given training corpus is
sparse. There are large number of cases with zero probabilty bigrams and that should really have
some non-zero probability. This method tend to underestimate the probability of strings that
happen not to have occurred nearby in their training corpus.

Why do we need smoothing in NLP?


We use Smoothing for the following reasons.
 To improve the accuracy of our model.
 To handle data sparsity, out of vocabulary words, words that are absent in the training set.

For Example let us considerthe following

Training set: ["I like coding", “Prakriti likes mathematics”, “She likes coding”]
Let’s consider bigrams, a group of two words.
P(wi | w(i-1)) = count(wi w(i-1)) / count(w(i-1))
So, let's find the probability of “I like mathematics”.
We insert a start token, <S> and end token, <E> at the start and end of a sentence respectively.
P(“I like mathematics”)
= P( I | <S>) * P( like | I) * P( mathematics | like) * P(<E> | mathematics)
= (count(<S>I) / count(<S>)) * (count(I like) / count(I)) * (count(like mathematics) /
count(like)) * (count(mathematics <E>) / count(<E>))
= (1/3) * (1/1) * (0/1) * (1/3)
=0
As you can see, P (“I like mathematics”) comes out to be 0, but it can be a proper sentence, but
due to limited training data, our model didn’t do well.

Hence to overcome this problem we adjust the probabilities used in the model so that our model
can perform more accurately and even handle the words absent in the training set. This task of
reevaluating some of the zero-probability and low-probabilty N-grams, and assigning them non-
zero values, is called smoothing.

We use Smoothing for the following reasons.


 To improve the accuracy of our model.
 To handle data sparsity, out of vocabulary words, words that are absent in the training set.

In addition to the smoothing techniques the Interpolation and Backoff can also be used in order
to handle the data sparsity and to improve the accuracy of the model.

Smoothing Techniques:

Types of Smoothing in NLP


 Laplace / Add-1 Smoothing
Here, we simply add 1 to all the counts of words so that we never incur 0 value.
PLaplace(wi | w(i-1)) = (count(wi w(i-1)) +1 ) / (count(w(i-1)) + V)
Where V= total words in the training set, 9 in our example.

So, P(“I like mathematics”)


= P( I | <S>)*P( like | I)*P( mathematics | like)*P(<E> | mathematics)
= ((1+1) / (3+9)) * ((1+1) / (1+9)) * ((0+1) / (1+9)) * ((1+1) / (3+9))
= 1 / 1800

 Additive Smoothing
It is very similar to Laplace smoothing. Instead of 1, we add a δ value.
So, PAdditive(wi | w(i-1)) = (count(wi w(i-1)) + δ) / (count(w(i-1)) + δ|V|)

 Good Turing Smoothing

Good Turing discounting is one of the useful techniques in smoothing to improve the model
accuracy by taking into account additional information about the data distribution.

The basic idea behind good Turing smoothing is to use the total frequency of events that occur
only once to estimate how much mass to shift to unseen events.

 Good-Turing estimator (GT) developed by Gale & Sampson in 1995 assumes the
probability mass of unseen events to be n1 / N, where N is the total of instances observed
in the training data and n1 is the number of instances observed only once.
 Reassign the probability mass of all events that occur k times in the training data to all
events that occur k–1 times.
 Re-estimates the amount of probability mass to assign to N-grams with zero or low
counts by looking at the number of N-grams with higher counts.

For example, as we saw above, P(“like mathematics”) equals 0 without smoothing. We use the
frequency of bigrams that occurred once, the total number of bigrams for unknown bigrams.
Punknown(wi | w(i-1)) = (count of bigrams that appeared once) / (count of total bigrams)
For known bigrams like “like coding,” we use the frequency of bigrams that occurred more than
one of the current bigram frequency (Nc+1), frequency of bigrams that occurred the same as the
current bigram frequency (Nc), and the total number of bigrams(N).
Pknown(wi | w(i-1)) = c* / N
Where c* = (c+1) * (Nc+1) / (Nc) and c = count of input bigram, “like coding” in our example.

 Advantages of Good Turing Discounting: Good Turing discounting Works very well in
practice.
o Usually, the excellent Turing discounted estimate is used only for unreliable
counts, for example, for counts less than 5.
 Disadvantages with Good Turing Discounting: There are two main problems with
good Turing discounting.
o The reassignment may change the probability of the most frequent event, and
also, we generally don’t observe events for every k.
o Both of these problems can be solved by a variant of good Turing called Simple
Good-Turing Discounting.

 Kneser-Ney Smoothing:

Kneser-Ney smoothing is an extension of absolute discounting that uses a more


sophisticated discounting scheme that takes into account the number of distinct words that
precede an n-gram in the training corpus. In Absolute discounting method which is also a
smoothing technique a fixed discount value gets subtracted from each count of an n-gram,
which effectively redistributes some of the probability mass from higher frequency n-grams
to lower frequency ones. The discounted count is then used to compute the conditional
probability of the n-gram given its history. One of the issues with absolute discounting is
that it may lead to negative probabilities when the discount value is too large. Instead of
subtracting a fixed discount value, Kneser-Ney smoothing subtracts a discount value that
depends on the number of times an n-gram has a distinct word as its prefix. This approach
tends to work well in practice and is widely used in language modeling. Here we discount
an absolute discounting value, d from observed N-grams and distribute it to unseen N-
grams.
Example: Let's consider the example sentence:
"The quick brown fox jumps over the lazy dog."
we want to know the probability of the sentence using a trigram language
model, i.e., a model that estimates the probability of each three-word
sequence.
One way to estimate the probability of a trigram is to use the maximum
likelihood estimate (MLE), which is simply the count of the trigram divided by
the count of its history. For example, the MLE estimate of the trigram "brown
fox jumps" given the history "quick brown fox" is:

P(brown fox jumps | quick brown fox) = Count(quick brown fox brown fox
jumps) / Count(quick brown fox)

However, when the count of a particular trigram is zero, as is often the case in
natural language, the MLE estimate becomes zero as well. This is not
desirable, as it implies that the sentence has zero probability.

Smoothing techniques, such as absolute discounting and Kneser-Ney


smoothing, address this problem by redistributing probability mass from high-
frequency n-grams to low-frequency ones. Here is a simplified formula for
absolute discounting:

P(w | h) = (Count(w,h) - d) / Count(h) + lambda(w,h)


where:

Count(w,h) is the count of the n-gram w (words) given its history h.


d is a discount parameter that is subtracted from each count of an n-gram.
lambda(w,h) is a normalization constant that ensures that the conditional
probabilities sum to one.

P(w | h) is the probability of the n-gram w given its history h.


For example, suppose we use absolute discounting with a discount value of
0.5 and the same trigram "brown fox jumps" given the history "quick brown
fox". If the count of this trigram is zero in the training data, then the absolute
discounted count would be:

Count'(brown fox jumps | quick brown fox) = max(Count(quick brown fox


brown fox jumps) - 0.5, 0)

And the probability estimate would be:

P(brown fox jumps | quick brown fox) = (Count'(quick brown fox brown fox
jumps) / Count(quick brown fox)) + lambda(brown fox jumps, quick brown fox)

Kneser-Ney smoothing uses a more sophisticated discounting scheme that


takes into account the number of distinct words that precede an n-gram in the
training corpus. The formula for Kneser-Ney smoothing is more complex, but it
can be simplified as follows:

P(w | h) = (max(Count(w,h) - d, 0) / Count(h)) + alpha(h) *


P_cont(w | h)
where: alpha(h) is a normalization constant that ensures that the conditional
probabilities sum to one.
P_cont(w | h) is the continuation probability of the n-gram w given its history h.
The continuation probability is defined as the probability of seeing the final
word of the n-gram w given its history h, i.e., P_cont(w | h) = Count_cont(w, h)
/ Count(h). The Count_cont(w, h) is the number of distinct words that follow
the n-gram w in the training data.

 Katz Smoothing
Here we combine the Good-turing technique with interpolation
 Church and Gale Smoothing
Here, the Good-turing technique is combined with bucketing. Every N-gram is added to one
bucket according to its frequency, and then good-turing is estimated for every bucket.

Back-off and Interpolation Techniques


Katz back-off is a generative n-gram language model that estimates the conditional probability of a
word given its history in the n-gram. It accomplishes this estimation by backing off through
progressively shorter history models under certain conditions.[1] By doing so, the model with the most
reliable information about a given history is used to provide the better results.
The model was introduced in 1987 by Slava M. Katz. Prior to that, n-gram language models were
constructed by training individual models for different n-gram orders using maximum likelihood
estimation and then interpolating them together.

Method[edit]
The equation for Katz's back-off model is:[2]

where
C(x) = number of times x appears in training
wi = ith word in the given context
Essentially, this means that if the n-gram has been seen more than k times in training,
the conditional probability of a word given its history is proportional to the maximum
likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the
back-off conditional probability of the (n − 1)-gram.
The more difficult part is determining the values for k, d and α.

is the least important of the parameters. It is usually chosen to be 0. However,


empirical testing may find better values for k.

is typically the amount of discounting found by Good–Turing estimation. In

other words, if Good–Turing estimates as , then

To compute , it is useful to first define a quantity β, which is the left-over


probability mass for the (n − 1)-gram:

Then the back-off weight, α, is computed as follows:

The above formula only applies if there is data for the "(n − 1)-gram". If not, the
algorithm skips n-1 entirely and uses the Katz estimate for n-2. (and so on until
an n-gram with data is found)

You might also like