Unit 2
Unit 2
N-grams are defined as the contiguous sequence of n items that can be extracted from a
given sample of text or speech. An N-gram model is one type of a Language Model (LM),
which is about finding the probability distribution over word sequences. The items can be letters,
words, or base pairs, according to the application. The N-grams typically are collected from
a text or speech corpus (Usually a corpus of long text dataset).
N-grams are classified into different types depending on the value that n takes("n" is a
number saying how long a string of words we have considered in the construction of n-grams).
When n=1, it is said to be a unigram. When n=2, it is said to be a bigram. When n=3, it is
said to be a trigram. When n=4, it is said to be a 4-gram, and so on. Different types of n-
grams are suitable for different types of applications in the n-gram model in nlp.
Example:
An N-gram model is built by counting how often word sequences occur in corpus text and then
estimating the probabilities
Given a sequence of N-1 words, an N-gram model predicts the most probable word that might
follow this sequence. It's a probabilistic model that's trained on a corpus of text. Such a model is
useful in many NLP applications including speech recognition, machine translation and
predictive text input.
Consider two sentences: "There was heavy rain" vs. "There was heavy flood". From experience,
we know that the former sentence sounds better. An N-gram model will tell us that "heavy rain"
occurs much more often than "heavy flood" in the training corpus. Thus, the first sentence is
more probable and will be selected by the model.
A model that simply relies on how often a word occurs without looking at previous words is
called unigram. If a model considers only the previous word to predict the current word, then it's
called bigram. If two previous words are considered, then it's a trigram model.
An n-gram model for the above example would calculate the following probability:
Since it's impractical to calculate these conditional probabilities, using Markov assumption, we
approximate this to a bigram model:
Smoothed N-Grams:
One major problem with standard N-gram models is that they must be trained from some corpus,
and because any particular training corpus is finite, some perfectly acceptable N-grams are
bound to be missing from it. We can see that bigram matrix for any given training corpus is
sparse. There are large number of cases with zero probabilty bigrams and that should really have
some non-zero probability. This method tend to underestimate the probability of strings that
happen not to have occurred nearby in their training corpus.
Training set: ["I like coding", “Prakriti likes mathematics”, “She likes coding”]
Let’s consider bigrams, a group of two words.
P(wi | w(i-1)) = count(wi w(i-1)) / count(w(i-1))
So, let's find the probability of “I like mathematics”.
We insert a start token, <S> and end token, <E> at the start and end of a sentence respectively.
P(“I like mathematics”)
= P( I | <S>) * P( like | I) * P( mathematics | like) * P(<E> | mathematics)
= (count(<S>I) / count(<S>)) * (count(I like) / count(I)) * (count(like mathematics) /
count(like)) * (count(mathematics <E>) / count(<E>))
= (1/3) * (1/1) * (0/1) * (1/3)
=0
As you can see, P (“I like mathematics”) comes out to be 0, but it can be a proper sentence, but
due to limited training data, our model didn’t do well.
Hence to overcome this problem we adjust the probabilities used in the model so that our model
can perform more accurately and even handle the words absent in the training set. This task of
reevaluating some of the zero-probability and low-probabilty N-grams, and assigning them non-
zero values, is called smoothing.
In addition to the smoothing techniques the Interpolation and Backoff can also be used in order
to handle the data sparsity and to improve the accuracy of the model.
Smoothing Techniques:
Additive Smoothing
It is very similar to Laplace smoothing. Instead of 1, we add a δ value.
So, PAdditive(wi | w(i-1)) = (count(wi w(i-1)) + δ) / (count(w(i-1)) + δ|V|)
Good Turing discounting is one of the useful techniques in smoothing to improve the model
accuracy by taking into account additional information about the data distribution.
The basic idea behind good Turing smoothing is to use the total frequency of events that occur
only once to estimate how much mass to shift to unseen events.
Good-Turing estimator (GT) developed by Gale & Sampson in 1995 assumes the
probability mass of unseen events to be n1 / N, where N is the total of instances observed
in the training data and n1 is the number of instances observed only once.
Reassign the probability mass of all events that occur k times in the training data to all
events that occur k–1 times.
Re-estimates the amount of probability mass to assign to N-grams with zero or low
counts by looking at the number of N-grams with higher counts.
For example, as we saw above, P(“like mathematics”) equals 0 without smoothing. We use the
frequency of bigrams that occurred once, the total number of bigrams for unknown bigrams.
Punknown(wi | w(i-1)) = (count of bigrams that appeared once) / (count of total bigrams)
For known bigrams like “like coding,” we use the frequency of bigrams that occurred more than
one of the current bigram frequency (Nc+1), frequency of bigrams that occurred the same as the
current bigram frequency (Nc), and the total number of bigrams(N).
Pknown(wi | w(i-1)) = c* / N
Where c* = (c+1) * (Nc+1) / (Nc) and c = count of input bigram, “like coding” in our example.
Advantages of Good Turing Discounting: Good Turing discounting Works very well in
practice.
o Usually, the excellent Turing discounted estimate is used only for unreliable
counts, for example, for counts less than 5.
Disadvantages with Good Turing Discounting: There are two main problems with
good Turing discounting.
o The reassignment may change the probability of the most frequent event, and
also, we generally don’t observe events for every k.
o Both of these problems can be solved by a variant of good Turing called Simple
Good-Turing Discounting.
Kneser-Ney Smoothing:
P(brown fox jumps | quick brown fox) = Count(quick brown fox brown fox
jumps) / Count(quick brown fox)
However, when the count of a particular trigram is zero, as is often the case in
natural language, the MLE estimate becomes zero as well. This is not
desirable, as it implies that the sentence has zero probability.
P(brown fox jumps | quick brown fox) = (Count'(quick brown fox brown fox
jumps) / Count(quick brown fox)) + lambda(brown fox jumps, quick brown fox)
Katz Smoothing
Here we combine the Good-turing technique with interpolation
Church and Gale Smoothing
Here, the Good-turing technique is combined with bucketing. Every N-gram is added to one
bucket according to its frequency, and then good-turing is estimated for every bucket.
Method[edit]
The equation for Katz's back-off model is:[2]
where
C(x) = number of times x appears in training
wi = ith word in the given context
Essentially, this means that if the n-gram has been seen more than k times in training,
the conditional probability of a word given its history is proportional to the maximum
likelihood estimate of that n-gram. Otherwise, the conditional probability is equal to the
back-off conditional probability of the (n − 1)-gram.
The more difficult part is determining the values for k, d and α.
The above formula only applies if there is data for the "(n − 1)-gram". If not, the
algorithm skips n-1 entirely and uses the Katz estimate for n-2. (and so on until
an n-gram with data is found)