0% found this document useful (0 votes)
2 views

Lecture 4

Uploaded by

Beekan Gammadaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 4

Uploaded by

Beekan Gammadaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Language Model

Content
• Word prediction task
• Language modeling (N-grams)
– N-gram introduction
– The chain rule
– Model evaluation
– Smoothing

2
Word Prediction

• Guess the next word...


– ... I notice three guys standing on the ???
• There are many sources of knowledge that
can be used to inform this task, including
arbitrary world knowledge.
• But it turns out that you can do pretty well
by simply looking at the preceding words
and keeping track of some fairly simple
counts.
3
Word Prediction

• We can formalize this task using what are called


N-gram models.
• N-grams are token sequences of length N.
• Our earlier example contains the following 2-
grams (bigrams)
– (I notice), (notice three), (three guys), (guys
standing), (standing on), (on the)
• Given knowledge of counts of N-grams such as
these, we can guess likely next words in a
sequence.
4
N-Gram Models
• More formally, we can use knowledge of
the counts of N-grams to assess the
conditional probability of candidate words
as the next word in a sequence.
• Or, we can use them to assess the
probability of an entire sequence of words.
– Pretty much the same thing as we’ll see...

5
Applications
• It turns out that being able to predict the next word
(or any linguistic unit) in a sequence is an
extremely useful thing to be able to do.
• As we’ll see, it lies at the core of the following
applications
– Automatic speech recognition
– Handwriting and character recognition
– Spelling correction
– Machine translation
– And many more.

6
Counting
• Simple counting lies at the core of any
probabilistic approach. So let’s first take a
look at what we’re counting.
– He stepped out into the hall, was delighted to
encounter a water brother.
• 13 tokens, 15 if we include “,” and “.” as separate
tokens.
• Assuming we include the comma and period, how
many bigrams are there?

7
Counting
• Not always that simple
– I do uh main- mainly business data processing
• Spoken language poses various challenges.
– Should we count “uh” and other fillers as tokens?
– What about the repetition of “mainly”? Should such do-overs
count twice or just once?
– The answers depend on the application.
• If we’re focusing on something like ASR to support indexing for
search, then “uh” isn’t helpful (it’s not likely to occur as a query).
• But filled pauses are very useful in dialog management, so we might
want them there.

8
Counting: Types and Tokens
• How about
– They picnicked by the pool, then lay back on
the grass and looked at the stars.
• 18 tokens (again counting punctuation)
• But we might also note that “the” is used 3
times, so there are only 16 unique types (as
opposed to tokens).
• In going forward, we’ll have occasion to
focus on counting both types and tokens of
both words and N-grams.
9
Counting: Wordforms
• Should “cats” and “cat” count as the same
when we’re counting?
• How about “geese” and “goose”?
• Some terminology:
– Lemma: a set of lexical forms having the same
stem, major part of speech, and rough word
sense
– Wordform: fully inflected surface form
• Again, we’ll have occasion to count both
lemmas and wordforms
10
Counting: Corpora
• So what happens when we look at large bodies of text
instead of single utterances?
• Brown et al (1992) large corpus of English text
– 583 million wordform tokens
– 293,181 wordform types
• Google
– Crawl of 1,024,908,267,229 English tokens
– 13,588,391 wordform types
• That seems like a lot of types... After all, even large dictionaries of English have
only around 500k types. Why so many here?

• Numbers
• Misspellings
• Names
• Acronyms
• etc
11
Language Modeling
• Back to word prediction
• We can model the word prediction task as
the ability to assess the conditional
probability of a word given the previous
words in the sequence
– P(wn|w1,w2…wn-1)
• We’ll call a statistical model that can assess
this a Language Model

12
Language Modeling
• How might we go about calculating such a
conditional probability?
– One way is to use the definition of conditional
probabilities and look for counts. So to get
– P(the | its water is so transparent that)
• By definition that’s
P(its water is so transparent that the)
P(its water is so transparent that)
We can get each of those from counts in a large
corpus.
13
Very Easy Estimate
• How to estimate?
– P(the | its water is so transparent that)

P(the | its water is so transparent that) =


Count(its water is so transparent that the)
Count(its water is so transparent that)

14
Very Easy Estimate
• According to Google those counts are 5/9.
– Unfortunately... 2 of those were to these
slides... So maybe it’s really
– 3/7
– In any case, that’s not terribly convincing due
to the small numbers involved.

15
Language Modeling
• Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.
– What we’re likely to get is 0. Or worse 0/0.
• Clearly, we’ll have to be a little more clever.
– Let’s use the chain rule of probability
– And a particularly useful independence
assumption.

16
The Chain Rule

• Recall the definition of conditional probabilities


P ( A^ B )
• Rewriting: P( A | B) 
P( B)

P ( A^ B ) P ( A | B ) P ( B )
• For sequences...
– P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• In general
– P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)

17
The Chain Rule

P(its water was so transparent)=


P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

18
Unfortunately

• There are still a lot of possible sentences


• In general, we’ll never be able to get enough
data to compute the statistics for those longer
prefixes
– Same problem we had for the strings themselves

19
Independence Assumption
• Make the simplifying assumption
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|a)
• Or maybe
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|saw,a)
• That is, the probability in question is
independent of its earlier history.

20
Independence Assumption

• This particular kind of independence assumption is


called a Markov assumption after the Russian
mathematician Andrei Markov.

21
Markov Assumption

So for each component in the product replace with


the approximation (assuming a prefix of N)

n 1 n 1
P(wn | w 1 ) P(wn | w n N 1 )
Bigram version

n 1
P(w n | w 1 ) P(w n | w n 1 )

22
Estimating Bigram Probabilities

• The Maximum Likelihood Estimate (MLE)

count(w i 1,w i )
P(w i | w i 1) 
count(w i 1 )

23
An Example
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>

24
Maximum Likelihood Estimates
• The maximum likelihood estimate of some parameter of a
model M from a training set T
– Is the estimate that maximizes the likelihood of the training set T given
the model M
• Suppose the word Chinese occurs 400 times in a corpus of a
million words (Brown corpus)
• What is the probability that a random word from some other
text from the same distribution will be “Chinese”
• MLE estimate is 400/1000000 = .004
– This may be a bad estimate for some other corpus
• But it is the estimate that makes it most likely that “Chinese”
will occur 400 times in a million word corpus.

25
Berkeley Restaurant Project Sentences

• can you tell me about any good cantonese restaurants


close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are
available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day

26
Bigram Counts
• Out of 9222 sentences
– Eg. “I want” occurred 827 times

27
Bigram Probabilities
• Divide bigram counts by prefix unigram
counts to get probabilities.

28
Bigram Estimates of Sentence Probabilities

• P(<s> I want english food </s>) =


P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031

29
Kinds of Knowledge
 As crude as they are, N-gram probabilities capture
a range of interesting facts about language.
• P(english|want) = .0011
• P(chinese|want) = .0065
• P(to|want) = .66
• P(eat | to) = .28
• P(food | to) = 0
• P(want | spend) = 0
• P (i | <s>) = .25

30
Evaluation
• How do we know if our models are any
good?
– And in particular, how do we know if one
model is better than another.
• Well Shannon’s game gives us an intuition.
– The generated texts from the higher order
models sure look better. That is, they sound
more like the text the model was obtained from.
– But what does that mean? Can we make that
notion operational?

31
Evaluation

• Standard method
– Train parameters of our model on a training set.
– Look at the models performance on some new data
• This is exactly what happens in the real world; we want to know
how our model performs on data we haven’t seen
– So use a test set. A dataset which is different than our
training set, but is drawn from the same source
– Then we need an evaluation metric to tell us how well
our model is doing on the test set.
• One such metric is perplexity

32
Unknown Words
• But once we start looking at test data, we’ll
run into words that we haven’t seen before
(pretty much regardless of how much
training data you have.
• With an Open Vocabulary task
– Create an unknown word token <UNK>
– Training of <UNK> probabilities
• Create a fixed lexicon L, of size V
– From a dictionary or
– A subset of terms from the training set
• At text normalization phase, any training word not in L changed to
<UNK>
• Now we count that like a normal word
– At test time
• Use UNK counts for any word not in training
33
Zero Counts
• Back to Shakespeare
– Recall that Shakespeare produced 300,000 bigram
types out of V2= 844 million possible bigrams...
– So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
– Does that mean that any sentence that contains one of
those bigrams should have a probability of 0?

34
Laplace-Smoothed Bigram Counts

35
Laplace-Smoothed Bigram Probabilities

36
Backoff and Interpolation
• Another really useful source of knowledge
• If we are estimating:
– trigram p(z|x,y)
– but count(xyz) is zero
• Use info from:
– Bigram p(z|y)
• Or even:
– Unigram p(z)
• How to combine this trigram, bigram,
unigram info in a valid fashion?
37

You might also like