Unit 2b
Unit 2b
Methods/Models
Unit2
Language Models
• Models that assign probabilities to sequences of words are called
language models
• models that assign a probability to each possible next word or to an
entire sentence.
• Why would you want to predict upcoming words, or assign
probabilities to sentences?
• identify words in noisy, ambiguous input, like speech recognition.
• Writing tools like spelling correction or grammatical error correction.
• also essential in machine translation.
N-gram Language Models
• simplest model that assigns probabilities to sentences and sequences
of words
• computing P(w/h), the probability of a word w given some history h
The intuition of the n-gram model is that instead of computing the probability of a
word given its entire history, we can approximate the history by just the last few
words.
bigram
• bigram model, for example, approximates the probability of a word
given all the previous words P(wn|w1:n-1) by using only the conditional
probability of the preceding word P(wn|wn-1) . In other words, instead
of computing the probability P(the|Walden Pond’s water is so
transparent that) we approximate it with the probability P(the|that).
• The assumption that the probability of a word depends only on the
previous word is called a Markov assumption.
ngram
• P(wn|wn-N+1:n-1) N = grams example bigram N=2 trigram N=3…
• So if N=4 then P(wn|wn-4+1:n-1) = P(wn|wn-3:n-1)
• = P(wn|wn-3wn-2wn-1)
• P(the|Walden Pond’s water is so transparent that) ≈ P(the|so
transparent that)
Markov models
• Markov models are the class of probabilistic models that assume we
can predict the probability of some future unit without looking too far
into the past.
• We can generalize the bigram (which looks one word into the past) to
the trigram (which looks two words into the past) and thus to the n-
gram (which looks n-1 words into the past).
maximum likelihood estimation
• How do we estimate these bigram or n-gram probabilities?
• We get the MLE estimate for the parameters of an n-gram model by
getting counts from a corpus, and normalizing the counts so that they
lie between 0 and 1.
Counts of
each word
in corpus
Counts of
each word
in corpus
42
=
Find the bigram probabilities for each cell?? Find P(lunch|eat) 746
If we have following probabilities…
• The first term is the discounted bigram, and the second term is the
unigram with an interpolation weight λ. We could just set all the d
values to .75, or we could keep a separate discount value of 0.5 for
the bigrams with counts of 1.