Lecture 4
Lecture 4
Content
• Word prediction task
• Language modeling (N-grams)
– N-gram introduction
– The chain rule
– Model evaluation
– Smoothing
2
Word Prediction
5
Applications
• It turns out that being able to predict the next word
(or any linguistic unit) in a sequence is an
extremely useful thing to be able to do.
• As we’ll see, it lies at the core of the following
applications
– Automatic speech recognition
– Handwriting and character recognition
– Spelling correction
– Machine translation
– And many more.
6
Counting
• Simple counting lies at the core of any
probabilistic approach. So let’s first take a
look at what we’re counting.
– He stepped out into the hall, was delighted to
encounter a water brother.
• 13 tokens, 15 if we include “,” and “.” as separate
tokens.
• Assuming we include the comma and period, how
many bigrams are there?
7
Counting
• Not always that simple
– I do uh main- mainly business data processing
• Spoken language poses various challenges.
– Should we count “uh” and other fillers as tokens?
– What about the repetition of “mainly”? Should such do-overs
count twice or just once?
– The answers depend on the application.
• If we’re focusing on something like ASR to support indexing for
search, then “uh” isn’t helpful (it’s not likely to occur as a query).
• But filled pauses are very useful in dialog management, so we might
want them there.
8
Counting: Types and Tokens
• How about
– They picnicked by the pool, then lay back on
the grass and looked at the stars.
• 18 tokens (again counting punctuation)
• But we might also note that “the” is used 3
times, so there are only 16 unique types (as
opposed to tokens).
• In going forward, we’ll have occasion to
focus on counting both types and tokens of
both words and N-grams.
9
Counting: Wordforms
• Should “cats” and “cat” count as the same
when we’re counting?
• How about “geese” and “goose”?
• Some terminology:
– Lemma: a set of lexical forms having the same
stem, major part of speech, and rough word
sense
– Wordform: fully inflected surface form
• Again, we’ll have occasion to count both
lemmas and wordforms
10
Counting: Corpora
• So what happens when we look at large bodies of text
instead of single utterances?
• Brown et al (1992) large corpus of English text
– 583 million wordform tokens
– 293,181 wordform types
• Google
– Crawl of 1,024,908,267,229 English tokens
– 13,588,391 wordform types
• That seems like a lot of types... After all, even large dictionaries of English have
only around 500k types. Why so many here?
• Numbers
• Misspellings
• Names
• Acronyms
• etc
11
Language Modeling
• Back to word prediction
• We can model the word prediction task as
the ability to assess the conditional
probability of a word given the previous
words in the sequence
– P(wn|w1,w2…wn-1)
• We’ll call a statistical model that can assess
this a Language Model
12
Language Modeling
• How might we go about calculating such a
conditional probability?
– One way is to use the definition of conditional
probabilities and look for counts. So to get
– P(the | its water is so transparent that)
• By definition that’s
P(its water is so transparent that the)
P(its water is so transparent that)
We can get each of those from counts in a large
corpus.
13
Very Easy Estimate
• How to estimate?
– P(the | its water is so transparent that)
14
Very Easy Estimate
• According to Google those counts are 5/9.
– Unfortunately... 2 of those were to these
slides... So maybe it’s really
– 3/7
– In any case, that’s not terribly convincing due
to the small numbers involved.
15
Language Modeling
• Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.
– What we’re likely to get is 0. Or worse 0/0.
• Clearly, we’ll have to be a little more clever.
– Let’s use the chain rule of probability
– And a particularly useful independence
assumption.
16
The Chain Rule
P ( A^ B ) P ( A | B ) P ( B )
• For sequences...
– P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• In general
– P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)
17
The Chain Rule
18
Unfortunately
19
Independence Assumption
• Make the simplifying assumption
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|a)
• Or maybe
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|saw,a)
• That is, the probability in question is
independent of its earlier history.
20
Independence Assumption
21
Markov Assumption
n 1 n 1
P(wn | w 1 ) P(wn | w n N 1 )
Bigram version
n 1
P(w n | w 1 ) P(w n | w n 1 )
22
Estimating Bigram Probabilities
count(w i 1,w i )
P(w i | w i 1)
count(w i 1 )
23
An Example
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>
24
Maximum Likelihood Estimates
• The maximum likelihood estimate of some parameter of a
model M from a training set T
– Is the estimate that maximizes the likelihood of the training set T given
the model M
• Suppose the word Chinese occurs 400 times in a corpus of a
million words (Brown corpus)
• What is the probability that a random word from some other
text from the same distribution will be “Chinese”
• MLE estimate is 400/1000000 = .004
– This may be a bad estimate for some other corpus
• But it is the estimate that makes it most likely that “Chinese”
will occur 400 times in a million word corpus.
25
Berkeley Restaurant Project Sentences
26
Bigram Counts
• Out of 9222 sentences
– Eg. “I want” occurred 827 times
27
Bigram Probabilities
• Divide bigram counts by prefix unigram
counts to get probabilities.
28
Bigram Estimates of Sentence Probabilities
29
Kinds of Knowledge
As crude as they are, N-gram probabilities capture
a range of interesting facts about language.
• P(english|want) = .0011
• P(chinese|want) = .0065
• P(to|want) = .66
• P(eat | to) = .28
• P(food | to) = 0
• P(want | spend) = 0
• P (i | <s>) = .25
30
Evaluation
• How do we know if our models are any
good?
– And in particular, how do we know if one
model is better than another.
• Well Shannon’s game gives us an intuition.
– The generated texts from the higher order
models sure look better. That is, they sound
more like the text the model was obtained from.
– But what does that mean? Can we make that
notion operational?
31
Evaluation
• Standard method
– Train parameters of our model on a training set.
– Look at the models performance on some new data
• This is exactly what happens in the real world; we want to know
how our model performs on data we haven’t seen
– So use a test set. A dataset which is different than our
training set, but is drawn from the same source
– Then we need an evaluation metric to tell us how well
our model is doing on the test set.
• One such metric is perplexity
32
Unknown Words
• But once we start looking at test data, we’ll
run into words that we haven’t seen before
(pretty much regardless of how much
training data you have.
• With an Open Vocabulary task
– Create an unknown word token <UNK>
– Training of <UNK> probabilities
• Create a fixed lexicon L, of size V
– From a dictionary or
– A subset of terms from the training set
• At text normalization phase, any training word not in L changed to
<UNK>
• Now we count that like a normal word
– At test time
• Use UNK counts for any word not in training
33
Zero Counts
• Back to Shakespeare
– Recall that Shakespeare produced 300,000 bigram
types out of V2= 844 million possible bigrams...
– So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
– Does that mean that any sentence that contains one of
those bigrams should have a probability of 0?
34
Laplace-Smoothed Bigram Counts
35
Laplace-Smoothed Bigram Probabilities
36
Backoff and Interpolation
• Another really useful source of knowledge
• If we are estimating:
– trigram p(z|x,y)
– but count(xyz) is zero
• Use info from:
– Bigram p(z|y)
• Or even:
– Unigram p(z)
• How to combine this trigram, bigram,
unigram info in a valid fashion?
37