Session 2-3 Language Modeling
Session 2-3 Language Modeling
Modeling
Introduction to N-grams
Dan Jurafsky
• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1,…,xn-1)
Dan Jurafsky
The Chain Rule applied to compute
joint probability of words in sentence
Markov Assumption
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Dan Jurafsky
Markov Assumption
P(w1w2 … wn ) » Õ P(w i )
i
Some automatically generated sentences from a unigram model
Bigram model
Condition on the previous word:
N-gram models
• We can extend to trigrams, 4-grams, 5-grams
• In general this is an insufficient model of language
• because language has long-distance dependencies:
“The computer which I had just put into the machine room on
the fifth floor crashed.”
count(wi-1,wi )
P(wi | w i-1) =
count(w i-1 )
c(wi-1,wi )
P(wi | w i-1 ) =
c(wi-1)
Dan Jurafsky
An example
More examples:
Berkeley Restaurant Project sentences
• Result:
Dan Jurafsky
• Result:
Dan Jurafsky
Practical Issues
• We do everything in log space
• Avoid underflow
• (also adding is faster than multiplying)
…
Dan Jurafsky
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Dan Jurafsky
31
Dan Jurafsky
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
• Unigrams are terrible at this game. (Why?) and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Dan Jurafsky
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
Approximating Shakespeare
Dan Jurafsky
Shakespeare as corpus
• N=884,647 tokens, V=29,066
• Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams.
• So 99.96% of the possible bigrams were never seen
(have zero entries in the table)
• Quadrigrams worse: What's coming out looks
like Shakespeare because it is Shakespeare
Dan Jurafsky
45
Dan Jurafsky
Zeros
• Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
allegations
3 allegations
outcome
reports
2 reports
attack
…
request
claims
1 claims
man
1 request
7 total
• Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
allegations
1.5 reports
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
Dan Jurafsky
Add-one estimation
Laplace-smoothed bigrams
Dan Jurafsky
Reconstituted counts
Dan Jurafsky
Linear Interpolation
• Simple interpolation
ï i-1
ïî 0.4S(wi | wi-k+2 ) otherwise
count(wi )
S(wi ) =
66 N
Dan Jurafsky
67
Dan Jurafsky