Chapter 03-Number System
Chapter 03-Number System
Modeling
Introduction to N-grams
Dan Jurafsky
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Dan Jurafsky
The Chain Rule applied to compute
joint probability of words in sentence
Markov Assumption
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Dan Jurafsky
Markov Assumption
P(w1w2 … wn ) » Õ P(w i )
i
Some automatically generated sentences from a unigram model
Bigram model
Condition on the previous word:
N-gram models
• We can extend to trigrams, 4-grams, 5-grams
• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer which I had just put into the machine room on
the fifth floor crashed.”
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
Dan Jurafsky
An example
More examples:
Berkeley Restaurant Project sentences
• Result:
Dan Jurafsky
Practical Issues
• We do everything in log space
• Avoid underflow
• (also adding is faster than multiplying)
…
Dan Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan Jurafsky
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
• Unigrams are terrible at this game. (Why?) and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Dan Jurafsky
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
Approximating Shakespeare
Dan Jurafsky
Shakespeare as corpus
• N=884,647 tokens, V=29,066
• Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
• So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
• Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
Dan Jurafsky
Zeros
• Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
allegations
3 allegations
outcome
reports
2 reports
attack
…
request
claims
1 claims
man
1 request
7 total
• Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
allegations
1.5 reports
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
Dan Jurafsky
Add-one estimation
Laplace-smoothed bigrams
Dan Jurafsky
Reconstituted counts
Dan Jurafsky
Linear Interpolation
• Simple interpolation
ï i-1
ïî 0.4S(wi | wi-k+2 ) otherwise
count(wi )
S(wi ) =
63 N
Dan Jurafsky
64
Dan Jurafsky
Advanced: Good
Turing Smoothing
Dan Jurafsky
c(wi-1, wi ) +1
PAdd-1 (wi | wi-1 ) =
c(wi-1 ) +V
Dan Jurafsky
c(wi-1, wi ) + k
PAdd-k (wi | wi-1 ) =
c(wi-1 ) + kV
1
c(wi-1, wi ) + m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 ) + m
Dan Jurafsky
c(wi-1, wi ) + mP(wi )
PUnigramPrior (wi | wi-1 ) =
c(wi-1 ) + m
Dan Jurafsky
Held-out words:
75
Dan Jurafsky
Training Held out
Ney et al. Good Turing Intuition
(slide from Dan Klein)
N1 N0
• Intuition from leave-one-out validation
• Take each of the c training words out in turn
• c training sets of size c–1, held-out of size 1
• What fraction of held-out words are unseen in training? N2 N1
• N1/c
• What fraction of held-out words are seen k times in
training? N3 N2
• (k+1)Nk+1/c
....
....
• So in the future we expect (k+1)Nk+1/c of the words to be
those with training count k
• There are Nk words with training count k
• Each should occur with probability:
• (k+1)Nk+1/c/Nk N3511 N3510
(k +1)N k+1
• …or expected count: k* =
Nk N4417 N4416
Dan Jurafsky
Good-Turing complications
(slide from Dan Klein)
Advanced: Good
Turing Smoothing
Language
Modeling
Advanced:
Kneser-Ney Smoothing
Dan Jurafsky
c(wi-1, wi ) - d
PAbsoluteDiscounting (wi | wi-1 ) = + l (wi-1 )P(w)
c(wi-1 )
unigram
Kneser-Ney Smoothing I
• Better estimate for probabilities of lower-order unigrams!
Francisco
glasses
• Shannon game: I can’t see without my reading___________?
• “Francisco” is more common than “glasses”
• … but “Francisco” always follows “San”
• The unigram is useful exactly when we haven’t seen this bigram!
• Instead of P(w): “How likely is w”
• Pcontinuation(w): “How likely is w to appear as a novel continuation?
• For each word, count the number of bigram types it completes
• Every bigram type was a novel continuation the first time it was seen
PCONTINUATION (w)µ {wi-1 : c(wi-1, w) > 0}
Dan Jurafsky
Kneser-Ney Smoothing II
• How many times does w appear as a novel continuation:
PCONTINUATION (w)µ {wi-1 : c(wi-1, w) > 0}
• A frequent word (Francisco) occurring in only one context (San) will have a
low continuation probability
Dan Jurafsky
Kneser-Ney Smoothing IV
max(c(wi-1, wi ) - d, 0)
PKN (wi | wi-1 ) = + l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi-1 ) = {w : c(wi-1, w) > 0}
c(wi-1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
86 = # of times we applied normalized discount
Dan Jurafsky
i
max(c (w i-n+1 ) - d, 0)
i-1
PKN (wi | wi-n+1 ) = KN
i-1
+ l (w i-1
)P
i-n+1 KN (wi | w i-1
i-n+2 )
cKN (wi-n+1 )
Advanced:
Kneser-Ney Smoothing