0% found this document useful (0 votes)
50 views22 pages

Unit 2b

Uploaded by

Samriddhi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views22 pages

Unit 2b

Uploaded by

Samriddhi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Other Statistical

Methods/Models
Unit2
Language Models
• Models that assign probabilities to sequences of words are called
language models
• models that assign a probability to each possible next word or to an
entire sentence.
• Why would you want to predict upcoming words, or assign
probabilities to sentences?
• identify words in noisy, ambiguous input, like speech recognition.
• Writing tools like spelling correction or grammatical error correction.
• also essential in machine translation.
N-gram Language Models
• simplest model that assigns probabilities to sentences and sequences
of words
• computing P(w/h), the probability of a word w given some history h

The intuition of the n-gram model is that instead of computing the probability of a
word given its entire history, we can approximate the history by just the last few
words.
bigram
• bigram model, for example, approximates the probability of a word
given all the previous words P(wn|w1:n-1) by using only the conditional
probability of the preceding word P(wn|wn-1) . In other words, instead
of computing the probability P(the|Walden Pond’s water is so
transparent that) we approximate it with the probability P(the|that).
• The assumption that the probability of a word depends only on the
previous word is called a Markov assumption.
ngram
• P(wn|wn-N+1:n-1) N = grams example bigram N=2 trigram N=3…
• So if N=4 then P(wn|wn-4+1:n-1) = P(wn|wn-3:n-1)
• = P(wn|wn-3wn-2wn-1)
• P(the|Walden Pond’s water is so transparent that) ≈ P(the|so
transparent that)
Markov models
• Markov models are the class of probabilistic models that assume we
can predict the probability of some future unit without looking too far
into the past.
• We can generalize the bigram (which looks one word into the past) to
the trigram (which looks two words into the past) and thus to the n-
gram (which looks n-1 words into the past).
maximum likelihood estimation
• How do we estimate these bigram or n-gram probabilities?
• We get the MLE estimate for the parameters of an n-gram model by
getting counts from a corpus, and normalizing the counts so that they
lie between 0 and 1.
Counts of
each word
in corpus

Find the bigram probabilities for each cell??


0.056

Counts of
each word
in corpus
42
=
Find the bigram probabilities for each cell?? Find P(lunch|eat) 746
If we have following probabilities…

• Now we can compute the probability of sentences like I want English


food or I want Chinese food by simply multiplying the appropriate
bigram probabilities together, as follows..
Evaluating Language Models
• extrinsic evaluation
• intrinsic evaluation
• training, development, and test sets
• In practice, we often just divide our data into 80% training, 10%
development, and 10% test
In general, perplexity is a measurement of how well a
probability model predicts a sample. In the context of
Perplexity Natural Language Processing, perplexity is one way to
evaluate language models.
• The perplexity (sometimes called PP for short) of a language model
on a test set is the inverse probability of the test set, normalized by
the number of words.

• minimizing perplexity is equivalent to maximizing the test set


probability according to the language model
Perplexity
The table below shows the perplexity of a 1.5 million word WSJ test set according to each of these
grammars.

In general, perplexity is a measurement of how well a probability model predicts a


sample. In the context of Natural Language Processing, perplexity is one way to
evaluate language models.
Smoothing
• words that are in our vocabulary but appear in a test set in an unseen
context
• for example they appear after a word they never appeared after in training
• To prevent a language model from assigning zero probability to these
unseen events we do smoothing.
• Laplace (add-one) smoothing,
• add-k smoothing,
• stupid backoff, and
• Kneser-Ney smoothing
Laplace Smoothing
• The simplest way to do smoothing is to add one to all the n-gram
counts, before we normalize them into probabilities.
• Does not perform well enough to be used smoothing in modern n-gram
models.
• Sometimes is also a practical smoothing algorithm for other tasks like
text classification.
• word wi is its count ci normalized by the total number of word tokens N

After Laplace Smoothing becomes

N is number of tokens V is number of words in the vocabulary


Add-k smoothing
• Instead of adding 1 to each count, we add a fractional count k (.5?
.05? .01?). This algorithm is therefore called add-k smoothing.

• Not that useful in language modelling


Backoff and Interpolation
• If we are trying to compute P(wn|wn-2wn-1) but we have no examples of a
particular trigram wn-2wn-1wn, we can instead estimate its probability by
using the bigram probability P(wn|wn-1). Similarly, if we don’t have counts
to compute P(wn|wn-1), we can look to the unigram P(wn)
• In backoff, we use the trigram if the evidence is sufficient, otherwise we
use the bigram, otherwise the unigram.
• By contrast, in interpolation, we always mix the probability estimates
from all the n-gram estimators, weighting and combining the trigram,
bigram, and unigram counts.
Kneser-Ney Smoothing
• Absolute discounting formalizes this intuition by subtracting a fixed
(absolute) discount d from each count.

• The first term is the discounted bigram, and the second term is the
unigram with an interpolation weight λ. We could just set all the d
values to .75, or we could keep a separate discount value of 0.5 for
the bigrams with counts of 1.

You might also like