NLP Week 03
NLP Week 03
to NLP
Faizad Ullah
1
Language Models
2
Language Models
• A language model is a machine learning model LM that predicts
upcoming words.
6
Probability of the next word
• Machine Translation
• Spell Correction
• Speech Recognition
• Text Generation …
Probability of a sentence
• P(all of a sudden I notice three guys standing on the sidewalk)?
• Suppose:
• h = “The water of Walden Pond is so beautifully”
• Why?
15
Chain rule of probability
• The chain rule shows the link between computing the joint
probability of a sequence and computing the conditional
probability of a word given previous words.
P(W) = P(w1).P(w2|w1).P(w3|w1w2).P(w4|w1w2w3)...P(wn|w1w2…wn-1)
The Markov assumption
18
The Markov assumption
• The probability of a word depends only on the previous word is
called a Markov assumption.
The Markov assumption (n-gram)
• The intuition of the n-gram model is that instead of computing the
probability of a word given its entire history, we can approximate
the history by just the last few words.
• N=1 → Unigram
• N=2 → Bigram
• N=3 → Trigram
• N=4→ 4-gram
How to estimate probabilities
• In simple words, an intuitive way to estimate probabilities is called
maximum likelihood estimation or MLE.
2. Normalize by the sum of all the bigrams that share the same
first word wn-1:
How to estimate probabilities: an example
• Let’s work through an example using a mini-corpus of three
sentences.
Estimates the n-gram probability by
• Augmenting sentences with <s> and </s>. dividing the observed frequency of a
particular sequence by the observed
frequency of a prefix.
How to estimate probabilities: Unigram
28
The Bayes Theorem
• Generally, we want the most probable hypothesis given the training data
• P(N)=? • P(N)=?
• P(Vi)=? • P(Vi)=?
Sources
• https://web.stanford.edu/~jurafsky/slp3/3.pdf
• https://web.stanford.edu/~jurafsky/slp3/4.pdf