0% found this document useful (0 votes)
47 views75 pages

N Grams - Nptel Notes

The document provides an overview of N-gram language models, which predict upcoming words based on the probabilities derived from previous words. It discusses the limitations of N-gram models, such as their inability to handle long-distance dependencies and new sequences, while highlighting the advantages of large language models (LLMs) that can manage longer contexts and better model synonymy. Additionally, it covers methods for estimating N-gram probabilities and introduces various N-gram model toolkits.

Uploaded by

deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views75 pages

N Grams - Nptel Notes

The document provides an overview of N-gram language models, which predict upcoming words based on the probabilities derived from previous words. It discusses the limitations of N-gram models, such as their inability to handle long-distance dependencies and new sequences, while highlighting the advantages of large language models (LLMs) that can manage longer contexts and better model synonymy. Additionally, it covers methods for estimating N-gram probabilities and introduces various N-gram model toolkits.

Uploaded by

deepa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

N-gram

Language Introduction to N-gram


Language Models
Modeling
Predicting words
The water of Walden Pond is beautifully ...

blue
*refrigerator
green
*that
clear
Language Models
Systems that can predict upcoming words
• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence
Why word prediction?
It's a helpful part of language tasks
• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved

• Speech recognition
I will be back soonish I will be bassoon dish
Why word prediction?
It's how large language models (LLMs) work!
LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next word
LLMs generate text by predicting words
• By predicting the next word over and over again
Language Modeling (LM) more formally
Goal: compute the probability of a sentence or
sequence of words W:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
An LM computes either of these:
P(W) or P(wn|w1,w2…wn-1)
P(blue|The water of Walden Pond is so beautifully) (3.1)
ne way to estimate this probability is directly from relative frequency counts: take a
ne way toHow to estimate these probabilities
estimate this probability is directly from relative frequency
ery large corpus, count the number of times we see The water of Walden Pond counts: take a
ry large corpus, count the number of times we see The water of Walden Pond
s so beautifully, and count the number of times this is followed by blue. This
s so beautifully, and count the number of times this is followed by blue. This
ould be answering
Could we the question
just count “Out
and of the times we saw the history h, how many
divide?
ould be answering the question “Out of the times we saw the history h, how many
mes was it followed by the word w”, as follows:
mes was it followed by the word w”, as follows:
P(blue|The water of Walden Pond is so beautifully) ==
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully blue) (3.2
C(The water of Walden Pond is so beautifully) (3.2)
C(The water of Walden Pond is so beautifully)
we had a large enough corpus, we could compute these two counts and estimate
we had a large enough corpus, we could compute these two counts and estimate
e probability
No! from
Too Eq. possible
many 3.2. Butsentences!
even the entire web isn’t big enough to give us
e probability from Eq. 3.2. But even the entire web isn’t big enough to give us
ood estimates
We’ll for counts
never see of entiredata
enough sentences.
for estimating because language is creative
This isthese
ood estimates for counts of entire sentences. This is because language is creative;
ew sentences are invented all the time, and we can’t expect to get accurate counts
w sentences are invented all the time, and we can’t expect to get accurate counts
r such large objects as entire sentences. For this reason, we’ll need more cleve
r such large objects as entire sentences. For this reason, we’ll need more clever
How to compute P(W) or P(wn|w1, …wn-1)

How to compute the joint probability P(W):

P(The, water, of, Walden, Pond, is, so, beautifully, blue)

Intuition: let’s rely on the Chain Rule of Probability


Reminder: The Chain Rule
Recall the definition of conditional probabilities
P(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A) P(B|A)

More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
3.1 • N-G RAM
The Chain Rule applied to compute joint
probability of words in sentence
plying the chain rule to words, we get
P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1:2 ) . . . P(wn |w1:n 1 )
Yn
= P(wk |w1:k 1 )
k=1

chain rule shows


P(“The the of
water linkWalden
between computing
Pond”) = the joint probability of a se
computing the conditional probability of a word given previous words.
P(The) × P(water|The) × P(of|The water)
3.4 suggests that we could estimate the joint probability of an entire seque
× P(Walden|The
ds by multiplying together a water
numberof)of×conditional
P(Pond|The water of Walden)
probabilities. But us
in rule doesn’t really seem to help us! We don’t know any way to comp
ntuition of the n-gram model is that instead of computing
(wn |wn 1 ). In other words, instead of computing the proba the prob
given its entire
Markov history,
Assumptionwe can approximate the history by just t
am model, for example, approximates
s. water of Walden Pond is so beautifully) the probability of a word gi
he
us words P(w n |w 1:n 1 ) by using only the conditional
he bigram model, for example, approximates the probability of a probability of
rd P(w Simplifying
et previous|w
with thewords
n n ). In
probability
1 otherassumption:
words, instead of computing the probabili
P(wn |w1:n 1 ) by using only the conditional probab
ding word
ue|The P(wnof
water 1 ). In other
|wn Walden Pond words,
is so instead of computing
Andrei Markovthe pr
beautifully) (
P(blue|beautifully)
P(blue|The water of
ate it with the probability Walden Pond is so beautifully)
gram model to predict the conditional probability of the nex
pproximate it ≈
withP(blue|beautifully)
the probability (
ng the following approximation:
P(blue|beautifully)
a bigram model to predict the conditional probability of the next w
making the
P(w n |w
following1:n 1 ) ⇡ P(w
approximation:n |wn 1 )
n we use a bigram model to predict the conditional probability of theWikimedia commons
e can predict the probability of some future unit w
of a complete
t. We Bigram word
can generalize sequence by substituting
the bigram (which looks Eq.
one 3.w
Markov Assumption 3.1 • N-G
which looks two words Y n into the past) and thus to t
rds Applying
into the the past).
chain rule to words, we get
P(w1:n ) ⇡ P(wk |wk 1 )
general equation P(w1:n ) =forP(wthis
k=1 1 )P(w2n-gram
|w1 )P(w3 |wapproximation
1:2 ) . . . P(wn |w1:n 1 )
Yn
the next word in a= sequence. P(wk |w1:k 1 )We’ll use N here to
Instead of:
means bigrams and N = 3 means trigrams. Then w
k=1

TheMore generally,
chain rule shows the we approximate
link between computingeach
the joint probability of a
a word given
andcomponent
its entire
computing the in
context
the product
conditional
as follows:
probability of a word given previous wor
tion 3.4 suggests that we could estimate the joint probability of an entire se
words byP(wmultiplying 1 ) ⇡a P(w
n |w1:ntogether numbern |w
of conditional 1)
n N+1:nprobabilities. But
chain rule doesn’t really seem to help us! We don’t know any way to co
Simplest case: Unigram model

P(w1w 2 …w n ) ≈ ∏ P(w i )
i
Some automatically generated sentences from two different unigram models
To him swallowed confess hear both . Which . Of save on trail
for are ay device and rote life have

Hill he late speaks ; or ! a more to leg less first you enter



Months the my and issue of year foreign new exchange’s September

were recession exchange new endorsed a acquire to six executives


Bigram model
P(w i | w1w 2 …w i−1 ) ≈ P(w i | w i−1 )
Some automatically generated sentences rom two different unigram models
Why dost stand forth thy canopy, forsooth; he is this palpable hit
the King Henry. Live king. Follow.

What means, sir. I confess she? then all sorts, he is trim, captain.

Last December through the way to preserve the Hudson corporation N.


B. E. C. Taylor would seem to complete the major central planners
one gram point five percent of U. S. E. has already old M. X.
corporation of living

on information such as more frequently fishing to keep her


Problems with N-gram models
• N-grams can't handle long-distance dependencies:
“The soups that I made from that new cookbook I
bought yesterday were amazingly delicious."
• N-grams don't do well at modeling new sequences
with similar meanings
The solution: Large language models
• can handle much longer contexts
• because of using embedding spaces, can model
synonymy better, and generate better novel strings
Why N-gram models?
A nice clear paradigm that lets us introduce many of
the important issues for large language models
• training and test sets
• the perplexity metric
• sampling to generate sentences
• ideas like interpolation and backoff
N-gram
Language Introduction to N-grams
Modeling
N-gram
Language Estimating N-gram
Probabilities
Modeling
lall
between 0 and that
the bigrams 1. share the same first word wn 1 :
e, to compute a particular bigram
Estimating bigram probabilities probability of a word wn
wn 1 , we’ll compute the countC(wof 1 wnbigram
n the ) C(w n 1 w n ) and
P(wn |wn 1 ) = P
of all the bigrams that share the same
C(w
The Maximum Likelihood Estimate
w n 1 first
w) word w n 1 :
ify this equation, since the sum of all
C(wn 1 wn ) bigram counts that sta
1 must be P(w
equalnto
|wthe
n 1 ) = P
unigram count for that word wn 1 (the
w C(w n 1 w)
ment to be convinced of this):
plify this equation, since the sum of all bigram counts that s
1 must be equal to the C(w
unigramn w
count
1 n ) for that word wn 1 (th
P(wn |wn 1 ) =
oment to be convinced of this): C(wn 1 )
hrough an example using a mini-corpus
C(wn 1 wn ) of three sentences.
An example
<s> I am Sam </s> c(w i−1,w i )
P(w i | w i−1 ) =
<s> Sam I am </s> c(w i−1 )
<s> I do not like green eggs and ham </s>


More examples:
Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:

Result:
Bigram estimates of sentence probabilities

P(<s> I want english food </s>) =


P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= .000031
What kinds of knowledge do N-grams represent?

P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
3.1.3 Dealing with scale in large n-gram models
Dealing with scale in large n-grams
In practice, language models can be very large, leading to practical issues.
Log probabilities Language model probabilities are always stored and com
LM probabilities are stored and computed in
in log format, i.e., as log probabilities. This is because probabilities are (b
inition)log
less format, i.e.tolog
than or equal probabilities
1, and so the more probabilities we multiply to
the smaller the product becomes. Multiplying enough n-grams together would
This avoids underflow from multiplying many
in numerical underflow. Adding in log space is equivalent to multiplying in
small
space, so numbers
we combine log probabilities by adding them. By adding log proba
instead of multiplying probabilities, we get results that are not as small. We
log( p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4
computation and storage in log space, and just convert back into probabilitie
need to If
report probabilities at the end by taking the exp
we need probabilities we can do one exp at the endof the logprob:

p1 ⇥ p2 ⇥ p3 ⇥ p4 = exp(log p1 + log p2 + log p3 + log p4 )


Larger ngrams
4-grams, 5-grams
Large datasets of large n-grams have been released
• N-grams from Corpus of Contemporary American English (COCA)
1 billion words (Davies 2020)
• Google Web 5-grams (Franz and Brants 2006) 1 trillion words)
• Efficiency: quantize probabilities to 4-8 bits instead of 8-byte float
Newest model: infini-grams (∞-grams) (Liu et al 2024)
• No precomputing! Instead, store 5 trillion words of web text in
suffix arrays. Can compute n-gram probabilities with any n!
N-gram LM Toolkits

SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/
N-gram
Language Estimating N-gram
Probabilities
Modeling
Evaluation and Perplexity

Language
Modeling
How to evaluate N-gram models

"Extrinsic (in-vivo) Evaluation"


To compare models A and B
1. Put each model in a real task
• Machine Translation, speech recognition, etc.
2. Run the task, get a score for A and for B
• How many words translated correctly
• How many words transcribed correctly
3. Compare accuracy for A and B
Intrinsic (in-vitro) evaluation
Extrinsic evaluation not always possible
• Expensive, time-consuming
• Doesn't always generalize to other applications
Intrinsic evaluation: perplexity
• Directly measures language model performance at
predicting words.
• Doesn't necessarily correspond with real application
performance
• But gives us a single general metric for language models
• Useful for large language models (LLMs) as well as n-grams
Training sets and test sets
We train parameters of our model on a training set.
We test the model’s performance on data we
haven’t seen.
◦ A test set is an unseen dataset; different from training set.
◦ Intuition: we want to measure generalization to unseen data
◦ An evaluation metric (like perplexity) tells us how well
our model does on the test set.
Choosing training and test sets

• If we're building an LM for a specific task


• The test set should reflect the task language we
want to use the model for
• If we're building a general-purpose model
• We'll need lots of different kinds of training
data
• We don't want the training set or the test set to
be just from one domain or author or language.
Training on the test set
We can’t allow test sentences into the training set
• Or else the LM will assign that sentence an artificially
high probability when we see it in the test set
• And hence assign the whole test set a falsely high
probability.
• Making the LM look better than it really is
This is called “Training on the test set”
Bad science!
35
Dev sets
• If we test on the test set many times we might
implicitly tune to its characteristics
• Noticing which changes make the model better.
• So we run on the test set only once, or a few times
• That means we need a third dataset:
• A development test set or, devset.
• We test our LM on the devset until the very end
• And then test our LM on the test set once
Intuition of perplexity as evaluation metric:
How good is our language model?
Intuition: A good LM prefers "real" sentences
• Assign higher probability to “real” or “frequently
observed” sentences
• Assigns lower probability to “word salad” or
“rarely observed” sentences?
Intuition of perplexity 2:
Predicting upcoming words
time 0.9
The Shannon Game: How well can we
predict the next word? dream 0.03
• Once upon a ____ midnight 0.02
• That is a picture of a ____ …
• For breakfast I ate my usual ____
and 1e-100
Claude Shannon Unigrams are terrible at this game (Why?)

A good LM is one that assigns a higher probability


to the next word that actually occurs
Picture credit: Historiska bildsamlingen
https://creativecommons.org/licenses/by/2.0/
Intuition of perplexity 3: The best language model
is one that best predicts the entire unseen test set
• We said: a good LM is one that assigns a higher
probability to the next word that actually occurs.
• Let's generalize to all the words!
• The best LM assigns high probability to the entire test
set.
• When comparing two LMs, A and B
• We compute PA(test set) and PB(test set)
• The better LM will give a higher probability to (=be less
surprised by) the test set than the other LM.
Intuition of perplexity 4: Use perplexity instead of
raw probability
• Probability depends on size of test set
• Probability gets smaller the longer the text
• Better: a metric that is per-word, normalized by length
• Perplexity is the inverse probability of the test set,
normalized by the number of words
1

PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )
Intuition of perplexity 5: the inverse
Perplexity is the inverse probability of the test set,
normalized by the number of words
1

PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )

(The inverse comes from the original definition of perplexity


from cross-entropy rate in information theory)
Probability range is [0,1], perplexity range is [1,∞]
Minimizing perplexity is the same as maximizing probability
Intuition of perplexity 6: N-grams
1

PP(W ) = P(w1w2 ...wN ) N

1
= N
P(w1w2 ...wN )

Chain rule:

Bigrams:
Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 = ((⅓)5)-1/5 = (⅓)-1 =3
But now suppose red was very likely in training set, such that for LM B:
◦ P(red) = .8 p(green) = .1 p(blue) = .1
We would expect the probability to be higher, and hence the perplexity to be smaller:
PerplexityB(T) = PB(red red red red blue)-1/5
= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89
Holding test set constant:
Lower perplexity = better language model

Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram


Order
Perplexity 962 170 109
Evaluation and Perplexity

Language
Modeling
Sampling and Generalization

Language
Modeling
The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
Claude Shannon

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME


CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO
OF TO EXPERT GRAY COME TO FURNISHES THE LINE
MESSAGE HAD BE THESE.

Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER
THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO
EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
How Shannon sampled those words in 1948

"Open a book at random and select a letter at random on the page.


This letter is recorded. The book is then opened to another page
and one reads until this letter is encountered. The succeeding
letter is then recorded. Turning to another page this second letter
is searched for and the succeeding letter recorded, etc."
Sampling a word from a distribution

polyphonic
p=.0000018
however
the of a to in (p=.0003)

0.06 0.03 0.02 0.02 0.02 …


… …
.06 .09 .11 .13 .15 .66 .99
0 1
Visualizing Bigrams the Shannon Way

Choose a random bigram (<s>, w)


<s> I
according to its probability p(w|<s>) I want
Now choose a random bigram (w, x) want to
according to its probability p(x|w) to eat
And so on until we choose </s> eat Chinese
Chinese food
Then string the words together
food </s>
I want to eat Chinese food
Note: there are other sampling methods
Used for neural language models
Many of them avoid generating words from the very
unlikely tail of the distribution
We'll discuss when we get to neural LM decoding:
◦ Temperature sampling
◦ Top-k sampling
◦ Top-p sampling
We can use the sampling method from the prior section to visualize both of

Approximating Shakespeare
these facts! To give an intuition for the increasing power of higher-order n-grams,
Fig. 3.4 shows random sentences generated from unigram, bigram, trigram, and 4-
gram models trained on Shakespeare’s works.

–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2gram king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4gram great banquet serv’d in;
–It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
Shakespeare as corpus

N=884,647 tokens, V=29,066


Shakespeare produced 300,000 bigram types out of
V = 844 million possible bigrams.
2

◦ So 99.96% of the possible bigrams were never seen (have


zero entries in the table)
◦ That sparsity is even worse for 4-grams, explaining why
our sampling generated actual Shakespeare.
The Wall Street Journal is not Shakespeare
3.5 • G ENERALIZATION AND Z EROS 13

1gram Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2gram B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3gram four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 3.5 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
Can you guess the author? These 3-gram sentences
are sampled from an LM trained on who?
1) They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent of the rates of
interest stores as Mexico and gram Brazil
on market conditions
2) This shall forbid it should be branded,
if renown made it empty.
3) “You are uniformly charming!” cried he,
with a smile of associating and now and
then I bowed and they perceived a chaise
and four to wish for.
55
Choosing training data
If task-specific, use a training corpus that has a similar
genre to your task.
• If legal or medical, need lots of special-purpose documents
Make sure to cover different kinds of dialects and
speaker/authors.
• Example: African-American Vernacular English (AAVE)
• One of many varieties that can be used by African Americans and others
• Can include the auxiliary verb finna that marks immediate future tense:
• "My phone finna die"
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
• But even when we try to pick a good training
corpus, the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set
Zeros
Training set: • Test set
… ate lunch … ate lunch
… ate dinner … ate breakfast
… ate a
… ate the
P(“breakfast” | ate) = 0
Zero probability bigrams
Bigrams with zero probability
◦ Will hurt our performance for texts where those words
appear!
◦ And mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t
divide by 0)!
Sampling and Generalization

Language
Modeling
N-gram
Language Smoothing, Interpolation,
and Backoff
Modeling
The intuition of smoothing (from Dan Klein)

When we have sparse statistics:


P(w | denied the)

allegations
3 allegations

outcome
2 reports

reports

attack
1 claims

request
claims

man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations

allegations
1.5 reports

allegations

outcome
0.5 claims

reports

attack
0.5 request

man
claims

request
2 other
7 total
lunch
spend
spend 223 11 1 22 1 11 1 11 1 11 2 11 1
Add-one estimation
spend
Figure
Figure 3.6 2
3.6 Add-one
Add-one 1
smoothed
smoothed 2
bigram
bigram 1 for
counts
counts 1 of
for eight
eight of the 1 (out
the words
words 1 VV
(out of
of
Figure
the
the 3.6 Restaurant
Berkeley
Berkeley Add-oneProject
Restaurant smoothed
Project bigram
corpus
corpus of
of9332counts
9332 for eight
sentences.
sentences. of the wordscounts
Previously-zero
Previously-zero (out
count
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero co
Also called
Figure
Figure 3.7 Laplace
3.7 shows
shows the smoothing
the add-one
add-one smoothed
smoothed probabilities
probabilities for
for the
the bigrams
bigrams
Pretend
Recall
Recall thatwe
that
Figure 3.7saw
normal
normal eachthe word
bigram
bigram
shows one more
probabilities
probabilities
add-one are time thanby
are computed
smoothed computed we
by
probabilities didthe bigra
normalizing
normalizing
for ee
counts
counts
Recallby
by the
the
that
Just add unigram
unigram
normal
one count:
count:
to allbigram
the probabilities are computed by normalizin
counts!
counts by the unigram count: C(w
C(wnn 11wwnn))
P(w
P(wnn|w |wnn 11))=
=
MLE estimate: C(w
C(wC(w
nn 11)n 1 wn )
)
PMLE (wn |wn 1 ) =
For
For add-one
add-one smoothed
smoothed bigram
bigram counts,
counts, wewe need toC(w
need to augment
augment
n 1 ) the
the unigram
unigram co
c
number
number of
of total
total word
word types
types in
in the
the vocabulary
vocabulary VV::
For add-one smoothed bigram counts, we need to augment the unigra
Add-1 estimate:
number of total word types in the vocabulary
C(w
C(w nn 11w +1V
wnn))+ 1: C(w
C(wnn 11wwnn))+
+11
PPLaplace
Laplace (w
(wnn|w
|wnn 11)) =
= P
P =
=
(C(wnn 11w)
ww(C(w w)++1)
1) C(w
C(wnn 11))+V
+V
C(wn 1 wn ) + 1 C(wn 1 wn ) +
Maximum Likelihood Estimates
The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
Suppose the word “bagel” occurs 400 times in a corpus of a million words
What is the probability that a random word from some other text will be
“bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400 times
in a million word corpus.
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram counts
Add-1 estimation is a blunt instrument
So add-1 isn’t used for N-grams:
◦ Generally we use interpolation or backoff instead
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Backoff and Interpolation
Sometimes it helps to use less context
◦ Condition on less context for contexts you know less about
Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
Interpolation:
◦ mix unigram, bigram, trigram

Interpolation works better


unigram counts. P̂(wn |wn 2 wn 1 ) = l1 P(wn |wn
Linear Interpolation
In simple linear interpolation, we combine different order N-grams by linearly
interpolating all the models. Thus, we estimate the trigram probability P(wn |wn 2 wn +l
1) 2
P(wn |
by mixing together the unigram, bigram, and trigram probabilities, each weighted
by a l : +l3 P(wn )
Simple interpolation
such that the l s sum to
P̂(w |w w ) = l P(w |w w
1: X
n n 2 n 1 1 n n 2 n 1)
+l2 P(wn |wn 1 ) li = 1
+l3 P(wn ) i (4.24)
Lambdas conditional on
a slightly
X context:
such that the l s sum to 1:In
more sophisticated version of linear i
computed in alimore =1 sophisticated way,(4.25) by condition
i
if we have particularly accurate counts for
In a slightly more sophisticated version of linear interpolation, each l weight is
a particul
computed in a morecounts of way,
sophisticated the bytrigrams
conditioningbased
on the on thisThis
context. bigram
way, will be
if we have particularly accurate counts for a particular bigram, we assume that the
make the l s for those trigrams higher
counts of the trigrams based on this bigram will be more trustworthy, so we can
and thus give
make the l s for those trigrams higher and thus give that trigram more weight in
How to set λs for interpolation?
Use a held-out corpus
Held-Out Test
Training Data Data Data

Choose λs to maximize probability of held-out data:


◦ Fix the N-gram probabilities (on the training data)
◦ Then search for λs that give largest probability to held-
out set
Backoff
Suppose you want:
P(pancakes| delicious soufflé)
If the trigram probability is 0, use the bigram
P(pancakes| soufflé)
If the bigram probability is 0, use the unigram
P(pancakes)
Complication: need to discount the higher-order ngram so
probabilities don't sum higher than 1 (e.g., Katz backoff)
Stupid Backoff
Backoff without discounting (not a true probability)

" i
$$ count(wi−k+1 ) i
i−1 i−1
if count(wi−k+1 ) > 0
S(wi | wi−k+1 ) = # count(wi−k+1 )
$ i−1
$% 0.4S(wi | wi−k+2 ) otherwise

count(wi )
S(wi ) =
N
74
N-gram
Language Interpolation and Backoff
Modeling

You might also like