Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar
Natural Language
Processing
Lecture 5
Instructor: Dr. Muhammad Asfand-e-yar
• Introduction to N – Grams
• Estimating N-Grams Probabilities
• Evaluation and Perplexity
For example:
P (its, water, is, so, transparent, that)
If you were able to complete these word sequences, it was likely from
prior knowledge and exposure to the complete sequence.
Not all word sequences are obvious, but for any given word sequence,
it should be possible to compute the probability of the next word.
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
The Chain Rule
N-Grams
Word sequences are given a formal name:
Unigram A sequence of one word
WebSphere, Mobile, Coffee
Bigram A sequence of two words:
cannot stand, Lotus Notes
Trigram A sequence of three words:
Lazy yellow dog, friend to none, Rational Software Architect
4-Gram A sequence of four words:
Play it again Sam
5-Gram A sequence of five words
6-Gram A sequence of six words (etc)
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
The Chain Rule
What is the probability that "Sam" will occur after the trigram "Play it again"?
The word sequence might well be
1. "Play it again Sally",
2. "Play it again Louise“,
3. or "Play it again and again",
4. and so on.
If we want to compute the probability of "Sam" occurring next, how do we do this?
The chain rule of probability: P(W) = P(w4 | w1, w2, w3) This can be stated:
Therefore, if we plug the values for "Play it again Sam" into this formula, we
get
P(Sam | Play, it, again )
Hence given the word sequence { Play, it, again }, what is the probability of
"Sam" being the fourth word in this sequence?
We can answer a question with a question.
The probability of
P(A, B, C, D)
is
P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C)
or with values in place:
or maybe
texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen
“The computer(s) which I had just put into the machine room on the fifth
floor is (are) crashing.”
• can you give me a listing of the kinds of food that are available
It is:
c(w i1,w i )
P(w i | w i1 )
c(w i1)
Normalize by Bigrams:
Result:
Result:
P(i|<s>) = 0.25
P(want| i) = 0.33
P(english|want) = 0.0011
P(food|english) = 0.5
P(</s>|food) = 0.68
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Bigram Estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= 0.25 x 0.33 x 0.0011 x 0.5 x 0.68
= .000031
KenLM
https://kheafield.com/code/kenlm/
Bad science!
Therefore,
• Sometimes use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
• But is helpful to think about.
For bigrams:
P( 1 x 1 x …x 1 )
10 10 10