N Grams - Nptel Notes
N Grams - Nptel Notes
blue
*refrigerator
green
*that
clear
Language Models
Systems that can predict upcoming words
• Can assign a probability to each potential next word
• Can assign a probability to a whole sentence
Why word prediction?
It's a helpful part of language tasks
• Grammar or spell checking
Their are two midterms Their There are two midterms
Everything has improve Everything has improve improved
• Speech recognition
I will be back soonish I will be bassoon dish
Why word prediction?
It's how large language models (LLMs) work!
LLMs are trained to predict words
• Left-to-right (autoregressive) LMs learn to predict next word
LLMs generate text by predicting words
• By predicting the next word over and over again
Language Modeling (LM) more formally
Goal: compute the probability of a sentence or
sequence of words W:
P(W) = P(w1,w2,w3,w4,w5…wn)
Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4) or P(wn|w1,w2…wn-1)
An LM computes either of these:
P(W) or P(wn|w1,w2…wn-1)
P(blue|The water of Walden Pond is so beautifully) (3.1)
ne way to estimate this probability is directly from relative frequency counts: take a
ne way toHow to estimate these probabilities
estimate this probability is directly from relative frequency
ery large corpus, count the number of times we see The water of Walden Pond counts: take a
ry large corpus, count the number of times we see The water of Walden Pond
s so beautifully, and count the number of times this is followed by blue. This
s so beautifully, and count the number of times this is followed by blue. This
ould be answering
Could we the question
just count “Out
and of the times we saw the history h, how many
divide?
ould be answering the question “Out of the times we saw the history h, how many
mes was it followed by the word w”, as follows:
mes was it followed by the word w”, as follows:
P(blue|The water of Walden Pond is so beautifully) ==
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully blue) (3.2
C(The water of Walden Pond is so beautifully) (3.2)
C(The water of Walden Pond is so beautifully)
we had a large enough corpus, we could compute these two counts and estimate
we had a large enough corpus, we could compute these two counts and estimate
e probability
No! from
Too Eq. possible
many 3.2. Butsentences!
even the entire web isn’t big enough to give us
e probability from Eq. 3.2. But even the entire web isn’t big enough to give us
ood estimates
We’ll for counts
never see of entiredata
enough sentences.
for estimating because language is creative
This isthese
ood estimates for counts of entire sentences. This is because language is creative;
ew sentences are invented all the time, and we can’t expect to get accurate counts
w sentences are invented all the time, and we can’t expect to get accurate counts
r such large objects as entire sentences. For this reason, we’ll need more cleve
r such large objects as entire sentences. For this reason, we’ll need more clever
How to compute P(W) or P(wn|w1, …wn-1)
More variables:
P(A,B,C,D) = P(A) P(B|A) P(C|A,B) P(D|A,B,C)
The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
3.1 • N-G RAM
The Chain Rule applied to compute joint
probability of words in sentence
plying the chain rule to words, we get
P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1:2 ) . . . P(wn |w1:n 1 )
Yn
= P(wk |w1:k 1 )
k=1
TheMore generally,
chain rule shows the we approximate
link between computingeach
the joint probability of a
a word given
andcomponent
its entire
computing the in
context
the product
conditional
as follows:
probability of a word given previous wor
tion 3.4 suggests that we could estimate the joint probability of an entire se
words byP(wmultiplying 1 ) ⇡a P(w
n |w1:ntogether numbern |w
of conditional 1)
n N+1:nprobabilities. But
chain rule doesn’t really seem to help us! We don’t know any way to co
Simplest case: Unigram model
P(w1w 2 …w n ) ≈ ∏ P(w i )
i
Some automatically generated sentences from two different unigram models
To him swallowed confess hear both . Which . Of save on trail
for are ay device and rote life have
What means, sir. I confess she? then all sorts, he is trim, captain.
€
More examples:
Berkeley Restaurant Project sentences
can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams:
Result:
Bigram estimates of sentence probabilities
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25
3.1.3 Dealing with scale in large n-gram models
Dealing with scale in large n-grams
In practice, language models can be very large, leading to practical issues.
Log probabilities Language model probabilities are always stored and com
LM probabilities are stored and computed in
in log format, i.e., as log probabilities. This is because probabilities are (b
inition)log
less format, i.e.tolog
than or equal probabilities
1, and so the more probabilities we multiply to
the smaller the product becomes. Multiplying enough n-grams together would
This avoids underflow from multiplying many
in numerical underflow. Adding in log space is equivalent to multiplying in
small
space, so numbers
we combine log probabilities by adding them. By adding log proba
instead of multiplying probabilities, we get results that are not as small. We
log( p1 × p2 × p3 × p4 ) = log p1 + log p2 + log p3 + log p4
computation and storage in log space, and just convert back into probabilitie
need to If
report probabilities at the end by taking the exp
we need probabilities we can do one exp at the endof the logprob:
SRILM
◦ http://www.speech.sri.com/projects/srilm/
KenLM
◦ https://kheafield.com/code/kenlm/
N-gram
Language Estimating N-gram
Probabilities
Modeling
Evaluation and Perplexity
Language
Modeling
How to evaluate N-gram models
1
= N
P(w1w2 ...wN )
Intuition of perplexity 5: the inverse
Perplexity is the inverse probability of the test set,
normalized by the number of words
1
−
PP(W ) = P(w1w2 ...wN ) N
1
= N
P(w1w2 ...wN )
1
= N
P(w1w2 ...wN )
Chain rule:
Bigrams:
Intuition of perplexity 7:
Weighted average branching factor
Perplexity is also the weighted average branching factor of a language.
Branching factor: number of possible next words that can follow any word
Example: Deterministic language L = {red,blue, green}
Branching factor = 3 (any word can be followed by red, blue, green)
Now assume LM A where each word follows any other word with equal probability ⅓
Given a test set T = "red red red red blue"
PerplexityA(T) = PA(red red red red blue)-1/5 = ((⅓)5)-1/5 = (⅓)-1 =3
But now suppose red was very likely in training set, such that for LM B:
◦ P(red) = .8 p(green) = .1 p(blue) = .1
We would expect the probability to be higher, and hence the perplexity to be smaller:
PerplexityB(T) = PB(red red red red blue)-1/5
= (.8 * .8 * .8 * .8 * .1) -1/5 =.04096 -1/5 = .527-1 = 1.89
Holding test set constant:
Lower perplexity = better language model
Language
Modeling
Sampling and Generalization
Language
Modeling
The Shannon (1948) Visualization Method
Sample words from an LM
Unigram:
Claude Shannon
Bigram:
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER
THAT THE CHARACTER OF THIS POINT IS THEREFORE
ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO
EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
How Shannon sampled those words in 1948
polyphonic
p=.0000018
however
the of a to in (p=.0003)
Approximating Shakespeare
these facts! To give an intuition for the increasing power of higher-order n-grams,
Fig. 3.4 shows random sentences generated from unigram, bigram, trigram, and 4-
gram models trained on Shakespeare’s works.
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1gram rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2gram king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3gram ’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4gram great banquet serv’d in;
–It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
Shakespeare as corpus
1gram Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2gram B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3gram four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 3.5 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
Can you guess the author? These 3-gram sentences
are sampled from an LM trained on who?
1) They also point to ninety nine point
six billion dollars from two hundred four
oh six three percent of the rates of
interest stores as Mexico and gram Brazil
on market conditions
2) This shall forbid it should be branded,
if renown made it empty.
3) “You are uniformly charming!” cried he,
with a smile of associating and now and
then I bowed and they perceived a chaise
and four to wish for.
55
Choosing training data
If task-specific, use a training corpus that has a similar
genre to your task.
• If legal or medical, need lots of special-purpose documents
Make sure to cover different kinds of dialects and
speaker/authors.
• Example: African-American Vernacular English (AAVE)
• One of many varieties that can be used by African Americans and others
• Can include the auxiliary verb finna that marks immediate future tense:
• "My phone finna die"
The perils of overfitting
N-grams only work well for word prediction if the
test corpus looks like the training corpus
• But even when we try to pick a good training
corpus, the test set will surprise us!
• We need to train robust models that generalize!
One kind of generalization: Zeros
• Things that don’t ever occur in the training set
• But occur in the test set
Zeros
Training set: • Test set
… ate lunch … ate lunch
… ate dinner … ate breakfast
… ate a
… ate the
P(“breakfast” | ate) = 0
Zero probability bigrams
Bigrams with zero probability
◦ Will hurt our performance for texts where those words
appear!
◦ And mean that we will assign 0 probability to the test set!
And hence we cannot compute perplexity (can’t
divide by 0)!
Sampling and Generalization
Language
Modeling
N-gram
Language Smoothing, Interpolation,
and Backoff
Modeling
The intuition of smoothing (from Dan Klein)
allegations
3 allegations
outcome
2 reports
reports
attack
1 claims
…
request
claims
man
1 request
7 total
Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
1.5 reports
allegations
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
lunch
spend
spend 223 11 1 22 1 11 1 11 1 11 2 11 1
Add-one estimation
spend
Figure
Figure 3.6 2
3.6 Add-one
Add-one 1
smoothed
smoothed 2
bigram
bigram 1 for
counts
counts 1 of
for eight
eight of the 1 (out
the words
words 1 VV
(out of
of
Figure
the
the 3.6 Restaurant
Berkeley
Berkeley Add-oneProject
Restaurant smoothed
Project bigram
corpus
corpus of
of9332counts
9332 for eight
sentences.
sentences. of the wordscounts
Previously-zero
Previously-zero (out
count
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero co
Also called
Figure
Figure 3.7 Laplace
3.7 shows
shows the smoothing
the add-one
add-one smoothed
smoothed probabilities
probabilities for
for the
the bigrams
bigrams
Pretend
Recall
Recall thatwe
that
Figure 3.7saw
normal
normal eachthe word
bigram
bigram
shows one more
probabilities
probabilities
add-one are time thanby
are computed
smoothed computed we
by
probabilities didthe bigra
normalizing
normalizing
for ee
counts
counts
Recallby
by the
the
that
Just add unigram
unigram
normal
one count:
count:
to allbigram
the probabilities are computed by normalizin
counts!
counts by the unigram count: C(w
C(wnn 11wwnn))
P(w
P(wnn|w |wnn 11))=
=
MLE estimate: C(w
C(wC(w
nn 11)n 1 wn )
)
PMLE (wn |wn 1 ) =
For
For add-one
add-one smoothed
smoothed bigram
bigram counts,
counts, wewe need toC(w
need to augment
augment
n 1 ) the
the unigram
unigram co
c
number
number of
of total
total word
word types
types in
in the
the vocabulary
vocabulary VV::
For add-one smoothed bigram counts, we need to augment the unigra
Add-1 estimate:
number of total word types in the vocabulary
C(w
C(w nn 11w +1V
wnn))+ 1: C(w
C(wnn 11wwnn))+
+11
PPLaplace
Laplace (w
(wnn|w
|wnn 11)) =
= P
P =
=
(C(wnn 11w)
ww(C(w w)++1)
1) C(w
C(wnn 11))+V
+V
C(wn 1 wn ) + 1 C(wn 1 wn ) +
Maximum Likelihood Estimates
The maximum likelihood estimate
◦ of some parameter of a model M from a training set T
◦ maximizes the likelihood of the training set T given the model M
Suppose the word “bagel” occurs 400 times in a corpus of a million words
What is the probability that a random word from some other text will be
“bagel”?
MLE estimate is 400/1,000,000 = .0004
This may be a bad estimate for some other corpus
◦ But it is the estimate that makes it most likely that “bagel” will occur 400 times
in a million word corpus.
Berkeley Restaurant Corpus: Laplace
smoothed bigram counts
Laplace-smoothed bigrams
Reconstituted counts
Compare with raw bigram counts
Add-1 estimation is a blunt instrument
So add-1 isn’t used for N-grams:
◦ Generally we use interpolation or backoff instead
But add-1 is used to smooth other NLP models
◦ For text classification
◦ In domains where the number of zeros isn’t so huge.
Backoff and Interpolation
Sometimes it helps to use less context
◦ Condition on less context for contexts you know less about
Backoff:
◦ use trigram if you have good evidence,
◦ otherwise bigram, otherwise unigram
Interpolation:
◦ mix unigram, bigram, trigram
" i
$$ count(wi−k+1 ) i
i−1 i−1
if count(wi−k+1 ) > 0
S(wi | wi−k+1 ) = # count(wi−k+1 )
$ i−1
$% 0.4S(wi | wi−k+2 ) otherwise
count(wi )
S(wi ) =
N
74
N-gram
Language Interpolation and Backoff
Modeling