N-Grams and Smoothing: CSC 371: Spring 2012
N-Grams and Smoothing: CSC 371: Spring 2012
• Conditional independence
– 2 random variables A and B are conditionally independent given C iff
P(a, b | c) = P(a | c) P(b | c) for all values a, b, c
Today’s Lecture
• Use probability axioms to model word prediction
• N-grams
• Smoothing
• Reading:
– Sections 4.1, 4.2, 4.3, 4.5.1 in Speech and Language
Processing [Jurafsky & Martin]
Next Word Prediction
• From a NY Times story...
– Stocks ...
– Stocks plunged this ….
– Stocks plunged this morning, despite a cut in interest
rates
– Stocks plunged this morning, despite a cut in interest
rates by the Federal Reserve, as Wall ...
– Stocks plunged this morning, despite a cut in interest
rates by the Federal Reserve, as Wall Street began
– Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since last …
– Stocks plunged this morning, despite a cut in
interest rates by the Federal Reserve, as Wall
Street began trading for the first time since last
Tuesday's terrorist attacks.
Human Word Prediction
• Clearly, at least some of us have the ability to
predict future words in an utterance.
• How?
– Domain knowledge: red blood vs. red hat
– Syntactic knowledge: the…<adj|noun>
– Lexical knowledge: baked <potato vs. chicken>
Claim
• A useful part of the knowledge needed to allow
Word Prediction can be captured using simple
statistical techniques
• In particular, we'll be interested in the notion of the
probability of a sequence (of letters, words,…)
Useful Applications
• Why do we want to predict a word, given some
preceding words?
– Rank the likelihood of sequences containing various
alternative hypotheses, e.g. for ASR
Theatre owners say popcorn/unicorn sales have doubled...
– Assess the likelihood/goodness of a sentence, e.g. for
text generation or machine translation
Applications of word prediction
• Spelling checkers
– They are leaving in about fifteen minuets to go to her house
– The study was conducted mainly be John Black.
– I need to notified the bank of this problem.
• Speech recognition
– Theatre owners say popcorn/unicorn sales have doubled...
• Handwriting recognition
• Disabled users
• Machine Translation
September 2003 9
N-Gram Models of Language
• Use the previous N-1 words in a sequence to
predict the next word
• Language Model (LM)
– unigrams, bigrams, trigrams,…
• How do we train these models?
– Very large corpora
Corpora
• Corpora are online collections of text and speech
– Brown Corpus
– Wall Street Journal
– AP newswire
– Hansards
– Timit
– DARPA/NIST text/speech corpora (Call Home, Call
Friend, ATIS, Switchboard, Broadcast News, Broadcast
Conversation, TDT, Communicator)
– TRAINS, Boston Radio News Corpus
Terminology
I 8 1087 0 13 0 0 0
Want 3 0 786 0 6 8 6
To 3 0 10 860 3 0 12
Eat 0 0 2 0 19 2 52
Chinese 2 0 0 0 0 120 1
Food 19 0 17 0 0 0 0
Lunch 4 0 0 0 0 1 0
Early BERP Bigram
Probabilities
• Normalization: divide each row's counts by
appropriate unigram counts for wn-1
I Want To Eat Chinese Food Lunch
32
Slide from Dan Klein
Laplace Smoothing
• For unigrams:
–Add 1 to every word (type) count to get an
adjusted count c*
–Normalize by N (#tokens) + V (#types)
–Original unigram probability
P(w ) c i
i
N
–New unigram probability
P (w )
LP i
c 1
i
N V
Unigram Smoothing
Example
P (w ) c 1
• Tiny Corpus, V=4; N=20
i
LP i
N V
Word True Ct Unigram New Ct Adjusted
Prob Prob
eat 10 .5 11 .46
British 4 .2 5 .21
food 6 .3 7 .29
happily 0 .0 1 .04
c(w ) n 1
c(w ) V n 1