Adv. Natural Language Processing: Instructor: Dr. Muhammad Asfand-E-Yar

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

Adv.

Natural Language
Processing
Lecture 5
Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Previous Lecture

Minimum Edit Distance

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Today’s Lecture

• Introduction to N – Grams
• Estimating N-Grams Probabilities
• Evaluation and Perplexity

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Language Modeling
Introduction to N-grams

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Probabilistic Language Models
Today’s goal: assign a probability to a sentence
Why?
• Machine Translation:
P(high winds tonight) > P(large winds tonight)
• Spell Correction
The office is about fifteen minuets from my house
P(about fifteen minutes from) > P(about fifteen minuets from)
• Speech Recognition
P(I saw a van) > P(eyes awe of an)
• Summarization, Question-Answering, etc., etc.!!
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Probabilistic Language Modeling
Goal: compute the probability of a sentence or sequence of words:
P(W) = P (w1, w2, w3, w4, w5 … wn)
Related task: probability of an upcoming word:
P (w5 | w1, w2, w3, w4)
A model that computes either of these:
P(W) or P(wn | w1, w2 … wn-1) is called a language model.

Computing Probability can be either Or combination of last word


by complete combination of words with its previous words

Better: the grammar But language model or LM is standard

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


How to compute P(W)
How to compute this joint probability:

For example:
P (its, water, is, so, transparent, that)

Intuition: let’s rely on the Chain Rule of Probability

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


The Chain Rule
Give any of these word sequences, what is the probability of the next
word?
Premature optimization is the root of all ____
evil -Donald Knuth
A house divided against itself cann't
_____ ____
stand -Abraham Lincoln
The quick brown fox jumped over the ____lazy ____
dog -Wm. Shakespeare
A friend to all is __
a ______
friend ____ ____ -Aristotle
of none

If you were able to complete these word sequences, it was likely from
prior knowledge and exposure to the complete sequence.
Not all word sequences are obvious, but for any given word sequence,
it should be possible to compute the probability of the next word.
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
The Chain Rule
N-Grams
Word sequences are given a formal name:
Unigram A sequence of one word
WebSphere, Mobile, Coffee
Bigram A sequence of two words:
cannot stand, Lotus Notes
Trigram A sequence of three words:
Lazy yellow dog, friend to none, Rational Software Architect
4-Gram A sequence of four words:
Play it again Sam
5-Gram A sequence of five words
6-Gram A sequence of six words (etc)
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
The Chain Rule
What is the probability that "Sam" will occur after the trigram "Play it again"?
The word sequence might well be
1. "Play it again Sally",
2. "Play it again Louise“,
3. or "Play it again and again",
4. and so on.
If we want to compute the probability of "Sam" occurring next, how do we do this?
The chain rule of probability: P(W) = P(w4 | w1, w2, w3) This can be stated:

P(W) "A sequence of words"


=
P(w4 | w1, w2, w3) "The conditional probability of word w4 given the sequence w1,w2,w3."
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
The Chain Rule
P(W) "A sequence of words"
=
P(w4 | w1, w2, w3) "The conditional probability of word w4 given the sequence w1,w2,w3."

Therefore, if we plug the values for "Play it again Sam" into this formula, we
get
P(Sam | Play, it, again )

Hence given the word sequence { Play, it, again }, what is the probability of
"Sam" being the fourth word in this sequence?
We can answer a question with a question.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


The Chain Rule

1. What is the probability that "it" will follow "play"?


2. What is the probability that "again" will follow "play it"?
3. What is the probability that "Sam" will follow "play it again"?

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


The Chain Rule

The probability of
P(A, B, C, D)
is
P(A) * P(B | A) * P(C | A, B) * P(D | A, B, C)
or with values in place:

P(Play, it, again, Sam)


is
MS(CS), Bahria University, Islamabad
P(Play) * P(it | Play) * P(again | Play, it ) * P(Sam | Play, it, again )
Instructor: Dr. Muhammad Asfand-e-yar
The Chain Rule
Recall the definition of conditional probabilities
P(B | A) = P(A, B) Rewriting: P(A, B) = P(A) * P(B|A)
P(A)
More variables:
P(A, B, C, D) = P(A)*P(B|A)*P(C|A, B)*P(D|A, B, C)

The Chain Rule in General


P(x1, x2, x3, … , xn) = P(x1)*P(x2|x1)*P(x3|x1,x2)*…*P(xn | x1, … , xn-1)

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


The Chain Rule
The Chain Rule applied to compute joint probability of words in
sentence

P(“its water is so transparent”) =


P(its) × P(water | its) × P(is | its, water) ×
P(so | its, water, is) × P(transparent | its, water, is, so)

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


How to estimate these probabilities?
Could we just count and divide?
P(the | its water is so transparent that) 
Count(its water is so transparent that the)
Count(its water is so transparent that)
No! Too many possible sentences!

We’ll never see enough data for estimating these


…the longer the sequence, the less likely we are to find it in a training
corpus
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Markov Assumption
Simplifying assumption:
Andrei Markov

P(the | its water is so transparent that)  P(the | that)

or maybe

P(the | its water is so transparent that)  P(the | transparent that)


MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Markov Assumption

In other words, we approximate each component in the product

Therefore, according to above equation we can say that the probability


of all words approximately equal to the last word to its previous few
words.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Simplest case: Unigram model

Some automatically generated sentences from a unigram model

fifth, an, of, futures, the, an, incorporated, a, a,


the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the


MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Bigram model
Condition on the previous word:

texaco, rose, one, in, this, issue, is, pursuing, growth, in,
a, boiler, house, said, mr., gurria, mexico, 's, motion,
control, proposal, without, permission, from, five, hundred,
fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november


MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
N-Gram models
We can extend to trigrams, 4-grams, 5-grams

In general this is an insufficient model of language


because language has long-distance dependencies:

“The computer(s) which I had just put into the machine room on the fifth
floor is (are) crashing.”

But we can often get away with N-gram models

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Language Modeling
Estimating N-gram Probabilities

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Estimating Bigram probabilities
How do we estimate these bigram or N-gram probabilities?

The Maximum Likelihood Estimate (MLE)

Get the MLE estimate for the parameters of an N-gram model by


getting counts from a corpus, and normalizing the counts so that
they lie between 0 and 1.
count(w i1,w i )
P(w i | w i1 )  c(w i1,w i )
count(w i1 ) P(w i | w i1 ) 
c(w i1)
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Estimating Bigram probabilities
Let’s work through an example using a mini-corpus of three
sentences.
We’ll first need to augment each sentence with a special symbol
<s> at the beginning of the sentence, to give us the bigram
context of the first word.
We’ll also need a special end-symbol </s>.

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
An example
c(w i1,w i ) <s> I am Sam </s>
P(w i | w i1 )  <s> Sam I am </s>
c(w i1) <s> I do not like green eggs and ham </s>

Here are the calculations form some of the bigram probabilities


from the above given corpus (i.e. example).

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


More examples:
Berkeley Restaurant Project sentences
• can you tell me about any good cantonese restaurants close by

• mid priced thai food is what i’m looking for

• tell me about chez panisse

• can you give me a listing of the kinds of food that are available

• i’m looking for a good place to eat breakfast

• when is caffe venezia open during the day

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Raw Bigram counts
Out of 9222 sentences

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Raw Bigram counts
Now to Normalize the counts by calculated the Probability
unigrams.

It is:
c(w i1,w i )
P(w i | w i1 ) 
c(w i1)

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Raw bigram probabilities
Normalize by unigrams:

In the corpus of 9222 sentences the count of each word


separately is given in above table.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Raw bigram probabilities
Normalize by unigrams:

Normalize by Bigrams:

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Raw bigram probabilities
Normalize by unigrams:

Result:

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Raw bigram probabilities
Normalize by unigrams:

Result:

Note that for


probability of
0 will be
same in
count be 0.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Bigram Estimates of sentence probabilities
Now we can compute the probability of sentences like
I want English food or I want Chinese food
by simply multiplying the appropriate bigram probabilities together, as
follows:

P(i|<s>) = 0.25
P(want| i) = 0.33
P(english|want) = 0.0011
P(food|english) = 0.5
P(</s>|food) = 0.68
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Bigram Estimates of sentence probabilities
P(<s> I want english food </s>) =
P(I|<s>)
× P(want|I)
× P(english|want)
× P(food|english)
× P(</s>|food)
= 0.25 x 0.33 x 0.0011 x 0.5 x 0.68
= .000031

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


What kinds of knowledge we get?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


What kinds of knowledge we get?
Why the
P(english|want) = .0011 P(english | want) = 0.0011
P(chinese|want) = .0065 is less than
P(chinese | want) = 0.0065
P(to|want) = .66 It can be just because people like want chinese
P(eat | to) = .28 food more as compared to english food.
Its what the World want (i.e. People want).
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


What kinds of knowledge we get?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66 P(to | want) = 0.66
It’s grammatical because “want” is infinitive and
P(eat | to) = .28
“to” comes after it.
P(food | to) = 0
P(want | spend) = 0
P (i | <s>) = .25

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


What kinds of knowledge we get?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0 P(want | spend) = 0
It’s grammatical because “want” and “spend”
P (i | <s>) = .25
are verbs and won’t come one after the other. It
is because grammatical disallowing. Therefore,
the “0” here is structural zero.
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
What kinds of knowledge we get?
P(english|want) = .0011
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(food | to) = 0
It’s possible that “to food” never occurred in the
P(want | spend) = 0 sentence and training data. Therefore, this “0”
P (i | <s>) = .25 will be called as contagious zero.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Practical Issues
In practice we don’t put the probability as probabilities, but in fact it is log
probabilities; therefore we do everything in log space/probabilities. There are
two reasons to put everything in log probabilities.
1. Avoid underflow
2. (also adding is faster than multiplying)
therefore multiplying the probabilities, we put a log and add the
probabilities to avoid underflow.

log(p1  p2  p3  p4 )  log p1  log p2  log p3  log p4


MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Language Modeling Toolkits
SRILM
http://www.speech.sri.com/projects/srilm/

KenLM
https://kheafield.com/code/kenlm/

These are publicly available Language Modeling tools.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Google N-Gram Release, August 2006

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Google N-Gram Release
These are examples from Google 4-grams counts

serve as the incoming 92


serve as the incubator 99
serve as the independent 794
serve as the index 223
serve as the indication 72
serve as the indicator 120
serve as the indicators 45
serve as the indispensable 111
serve as the indispensable 40
serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all‐our‐n‐gram‐are‐belong‐to‐you.html
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Google Book N-grams
http://ngrams.googlelabs.com/

Another google corpus is also available and you can download


large number of corpus according to your requirements.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Language Modeling
Evaluation and Perplexity

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Evaluation: How good is our model?
Does our language model prefer good sentences to bad ones?
Assign higher probability to “real” or “frequently observed” sentences
Than “ungrammatical” or “rarely observed” sentences?

We train parameters of our model on a training set.

We test the model’s performance on data we haven’t seen.


• A test set is an unseen dataset that is different from our training set,
totally unused.
• An evaluation metric tells us how well our model does on the test set.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Training on the test set
We can’t allow test sentences into the training set

We will assign it an artificially high probability when we set it in the test


set

“Training on the test set”

Bad science!

And violates the honor code


MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Extrinsic evaluation of N-gram models
Best evaluation for comparing models A and B
Put each model in a task
spelling corrector, speech recognizer, MT system
Run the task, get an accuracy for A and for B
How many misspelled words corrected properly
How many words translated correctly
Compare accuracy for A and B

Therefore, it is called Extrinsic Evaluation, using external


evaluation tools to check the models

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Difficulty of extrinsic (in-vivo) evaluation
of N-gram models
Extrinsic evaluation
Time-consuming; can take days or weeks

Therefore,
• Sometimes use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
• But is helpful to think about.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Intuition of Perplexity
mushrooms 0.1
The Shannon Game:
How well can we predict the next word? pepperoni 0.1
Unlikely to
I always order pizza with cheese and ____ anchovies 0.01 choose a
word
The 33rd President of the US was ____ ….

I saw a ____ fried rice 0.0001

Unigrams are terrible at this game. (Why?) ….


and 1e-100

A better model of a text


is one which assigns a higher probability to the word that actually occurs
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Perplexity
The best language model is one that best predicts an unseen test
set 1

• Gives the highest P(sentence) PP(W )  P(w1w2 ...wN ) N
Perplexity is the inverse probability of the
test set, normalized by the number of
words: 1
 N
P(w1w2 ...wN )
Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability


MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Perplexity as branching factor
Let’s suppose a sentence consisting of random digits
What is the perplexity of this sentence according to a model that assign 
P=1/10 to each digit?

P( 1 x 1 x …x 1 )
10 10 10

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Lower perplexity = better model
Training 38 million words, test 1.5 million words, WSJ

N-gram Order Unigram Bigram Trigram


Perplexity 962 170 109

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar


Language Modeling
Generalization and zeros

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

You might also like