0% found this document useful (0 votes)
2 views

Lecture 4

Uploaded by

Beekan Gammadaa
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture 4

Uploaded by

Beekan Gammadaa
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

Language Model

Content
• Word prediction task
• Language modeling (N-grams)
– N-gram introduction
– The chain rule
– Model evaluation
– Smoothing

2
Word Prediction

• Guess the next word...


– ... I notice three guys standing on the ???
• There are many sources of knowledge that
can be used to inform this task, including
arbitrary world knowledge.
• But it turns out that you can do pretty well
by simply looking at the preceding words
and keeping track of some fairly simple
counts.
3
Word Prediction

• We can formalize this task using what are called


N-gram models.
• N-grams are token sequences of length N.
• Our earlier example contains the following 2-
grams (bigrams)
– (I notice), (notice three), (three guys), (guys
standing), (standing on), (on the)
• Given knowledge of counts of N-grams such as
these, we can guess likely next words in a
sequence.
4
N-Gram Models
• More formally, we can use knowledge of
the counts of N-grams to assess the
conditional probability of candidate words
as the next word in a sequence.
• Or, we can use them to assess the
probability of an entire sequence of words.
– Pretty much the same thing as we’ll see...

5
Applications
• It turns out that being able to predict the next word
(or any linguistic unit) in a sequence is an
extremely useful thing to be able to do.
• As we’ll see, it lies at the core of the following
applications
– Automatic speech recognition
– Handwriting and character recognition
– Spelling correction
– Machine translation
– And many more.

6
Counting
• Simple counting lies at the core of any
probabilistic approach. So let’s first take a
look at what we’re counting.
– He stepped out into the hall, was delighted to
encounter a water brother.
• 13 tokens, 15 if we include “,” and “.” as separate
tokens.
• Assuming we include the comma and period, how
many bigrams are there?

7
Counting
• Not always that simple
– I do uh main- mainly business data processing
• Spoken language poses various challenges.
– Should we count “uh” and other fillers as tokens?
– What about the repetition of “mainly”? Should such do-overs
count twice or just once?
– The answers depend on the application.
• If we’re focusing on something like ASR to support indexing for
search, then “uh” isn’t helpful (it’s not likely to occur as a query).
• But filled pauses are very useful in dialog management, so we might
want them there.

8
Counting: Types and Tokens
• How about
– They picnicked by the pool, then lay back on
the grass and looked at the stars.
• 18 tokens (again counting punctuation)
• But we might also note that “the” is used 3
times, so there are only 16 unique types (as
opposed to tokens).
• In going forward, we’ll have occasion to
focus on counting both types and tokens of
both words and N-grams.
9
Counting: Wordforms
• Should “cats” and “cat” count as the same
when we’re counting?
• How about “geese” and “goose”?
• Some terminology:
– Lemma: a set of lexical forms having the same
stem, major part of speech, and rough word
sense
– Wordform: fully inflected surface form
• Again, we’ll have occasion to count both
lemmas and wordforms
10
Counting: Corpora
• So what happens when we look at large bodies of text
instead of single utterances?
• Brown et al (1992) large corpus of English text
– 583 million wordform tokens
– 293,181 wordform types
• Google
– Crawl of 1,024,908,267,229 English tokens
– 13,588,391 wordform types
• That seems like a lot of types... After all, even large dictionaries of English have
only around 500k types. Why so many here?

• Numbers
• Misspellings
• Names
• Acronyms
• etc
11
Language Modeling
• Back to word prediction
• We can model the word prediction task as
the ability to assess the conditional
probability of a word given the previous
words in the sequence
– P(wn|w1,w2…wn-1)
• We’ll call a statistical model that can assess
this a Language Model

12
Language Modeling
• How might we go about calculating such a
conditional probability?
– One way is to use the definition of conditional
probabilities and look for counts. So to get
– P(the | its water is so transparent that)
• By definition that’s
P(its water is so transparent that the)
P(its water is so transparent that)
We can get each of those from counts in a large
corpus.
13
Very Easy Estimate
• How to estimate?
– P(the | its water is so transparent that)

P(the | its water is so transparent that) =


Count(its water is so transparent that the)
Count(its water is so transparent that)

14
Very Easy Estimate
• According to Google those counts are 5/9.
– Unfortunately... 2 of those were to these
slides... So maybe it’s really
– 3/7
– In any case, that’s not terribly convincing due
to the small numbers involved.

15
Language Modeling
• Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.
– What we’re likely to get is 0. Or worse 0/0.
• Clearly, we’ll have to be a little more clever.
– Let’s use the chain rule of probability
– And a particularly useful independence
assumption.

16
The Chain Rule

• Recall the definition of conditional probabilities


P ( A^ B )
• Rewriting: P( A | B) 
P( B)

P ( A^ B ) P ( A | B ) P ( B )
• For sequences...
– P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• In general
– P(x1,x2,x3,…xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|x1…xn-1)

17
The Chain Rule

P(its water was so transparent)=


P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

18
Unfortunately

• There are still a lot of possible sentences


• In general, we’ll never be able to get enough
data to compute the statistics for those longer
prefixes
– Same problem we had for the strings themselves

19
Independence Assumption
• Make the simplifying assumption
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|a)
• Or maybe
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|saw,a)
• That is, the probability in question is
independent of its earlier history.

20
Independence Assumption

• This particular kind of independence assumption is


called a Markov assumption after the Russian
mathematician Andrei Markov.

21
Markov Assumption

So for each component in the product replace with


the approximation (assuming a prefix of N)

n 1 n 1
P(wn | w 1 ) P(wn | w n N 1 )
Bigram version

n 1
P(w n | w 1 ) P(w n | w n 1 )

22
Estimating Bigram Probabilities

• The Maximum Likelihood Estimate (MLE)

count(w i 1,w i )
P(w i | w i 1) 
count(w i 1 )

23
An Example
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>

24
Maximum Likelihood Estimates
• The maximum likelihood estimate of some parameter of a
model M from a training set T
– Is the estimate that maximizes the likelihood of the training set T given
the model M
• Suppose the word Chinese occurs 400 times in a corpus of a
million words (Brown corpus)
• What is the probability that a random word from some other
text from the same distribution will be “Chinese”
• MLE estimate is 400/1000000 = .004
– This may be a bad estimate for some other corpus
• But it is the estimate that makes it most likely that “Chinese”
will occur 400 times in a million word corpus.

25
Berkeley Restaurant Project Sentences

• can you tell me about any good cantonese restaurants


close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are
available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day

26
Bigram Counts
• Out of 9222 sentences
– Eg. “I want” occurred 827 times

27
Bigram Probabilities
• Divide bigram counts by prefix unigram
counts to get probabilities.

28
Bigram Estimates of Sentence Probabilities

• P(<s> I want english food </s>) =


P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031

29
Kinds of Knowledge
 As crude as they are, N-gram probabilities capture
a range of interesting facts about language.
• P(english|want) = .0011
• P(chinese|want) = .0065
• P(to|want) = .66
• P(eat | to) = .28
• P(food | to) = 0
• P(want | spend) = 0
• P (i | <s>) = .25

30
Evaluation
• How do we know if our models are any
good?
– And in particular, how do we know if one
model is better than another.
• Well Shannon’s game gives us an intuition.
– The generated texts from the higher order
models sure look better. That is, they sound
more like the text the model was obtained from.
– But what does that mean? Can we make that
notion operational?

31
Evaluation

• Standard method
– Train parameters of our model on a training set.
– Look at the models performance on some new data
• This is exactly what happens in the real world; we want to know
how our model performs on data we haven’t seen
– So use a test set. A dataset which is different than our
training set, but is drawn from the same source
– Then we need an evaluation metric to tell us how well
our model is doing on the test set.
• One such metric is perplexity

32
Unknown Words
• But once we start looking at test data, we’ll
run into words that we haven’t seen before
(pretty much regardless of how much
training data you have.
• With an Open Vocabulary task
– Create an unknown word token <UNK>
– Training of <UNK> probabilities
• Create a fixed lexicon L, of size V
– From a dictionary or
– A subset of terms from the training set
• At text normalization phase, any training word not in L changed to
<UNK>
• Now we count that like a normal word
– At test time
• Use UNK counts for any word not in training
33
Zero Counts
• Back to Shakespeare
– Recall that Shakespeare produced 300,000 bigram
types out of V2= 844 million possible bigrams...
– So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
– Does that mean that any sentence that contains one of
those bigrams should have a probability of 0?

34
Laplace-Smoothed Bigram Counts

35
Laplace-Smoothed Bigram Probabilities

36
Backoff and Interpolation
• Another really useful source of knowledge
• If we are estimating:
– trigram p(z|x,y)
– but count(xyz) is zero
• Use info from:
– Bigram p(z|y)
• Or even:
– Unigram p(z)
• How to combine this trigram, bigram,
unigram info in a valid fashion?
37

You might also like