0% found this document useful (0 votes)
286 views48 pages

Natural Language Processing:: N-Gram Language Models

This document discusses natural language processing and n-gram language models. It covers the following key points in 3 sentences: N-gram language models estimate the probability of a word given the previous n-1 words to model language probabilistically rather than with formal grammars. They are useful for applications like speech recognition, machine translation, and predictive text by predicting the most likely next words based on n-gram counts from large text corpora. N-gram models make the Markov assumption that a word is independent of all words preceding the previous n-1 words to allow estimating probabilities from limited data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
286 views48 pages

Natural Language Processing:: N-Gram Language Models

This document discusses natural language processing and n-gram language models. It covers the following key points in 3 sentences: N-gram language models estimate the probability of a word given the previous n-1 words to model language probabilistically rather than with formal grammars. They are useful for applications like speech recognition, machine translation, and predictive text by predicting the most likely next words based on n-gram counts from large text corpora. N-gram models make the Markov assumption that a word is independent of all words preceding the previous n-1 words to allow estimating probabilities from limited data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Natural Language

Processing:

N-Gram Language Models

1
Language Models
 Formal grammars (e.g. regular, context
free) give a hard “binary” model of the
legal sentences in a language.
 For NLP, a probabilistic model of a
language that gives a probability that a
string is a member of a language is more
useful.
 To specify a correct probability
distribution, the probability of all
sentences in a language must sum to 1.
Uses of Language Models
 Speech recognition
 “I ate a cherry” is a more likely sentence than “Eye eight
uh Jerry”
 OCR & Handwriting recognition
 More probable sentences are more likely correct readings.
 Machine translation
 More likely sentences are probably better translations.
 Generation
 More likely sentences are probably better NL generations.
 Context sensitive spelling correction
 “Their are problems wit this sentence.” 6383010676 -
sankar
Completion Prediction

 A language model also supports


predicting the completion of a sentence.
 Please turn off your cell _____
 Your program does not ______
 Predictive text input systems can guess
what you are typing and give choices on
how to complete it.
N-Gram Models
 Estimate probability of each word given prior context.
 P(phone | Please turn off your cell)
 Number of parameters required grows exponentially
with the number of words of prior context.
 An N-gram model uses only N1 words of prior context.
 Unigram: P(phone)
 Bigram: P(phone | cell)
 Trigram: P(phone | your cell)
 The Markov assumption is the presumption that the
future behavior of a dynamical system only depends on
its recent history. In particular, in a kth-order
Markov model, the next state only depends on the k
most recent states, therefore an N-gram model is a
(N1)-order Markov model.
N-Gram Model Formulas
 Word sequences
w1n  w1...wn
 Chain rule of probability
n
P( w )  P( w1 ) P( w2 | w1 ) P( w3 | w )...P( wn | w )   P( wk | w1k 1 )
n
1
2
1
n 1
1
k 1
 Bigram approximation
n
P( w )   P( wk | wk 1 )
n
1
k 1
 N-gram approximation
n
P( w )   P( wk | wkk1N 1 )
n
1
k 1
Estimating Probabilities
 N-gram conditional probabilities can be
estimated from raw text based on the relative
frequency of word sequences.
C ( wn 1wn )
Bigram: P ( wn | wn 1 ) 
C ( wn 1 )
n 1
C ( wn  N 1 wn )
N-gram: P ( wn | wnn1N 1 ) 
C ( wnn1N 1 )
 To have a consistent probabilistic model, append
a unique start (<s>) and end (</s>) symbol to
every sentence and treat these as additional
words.
Word Prediction
 Guess the next word...
 ... I notice three guys standing on the ???
 There are many sources of knowledge
that could be helpful, including world
knowledge.
 But we can do pretty well by simply
looking at the preceding words and
keeping track of simple counts.
Word Prediction
 Formalize this task using N-gram models.
 N-grams are token sequences of length N.
 Given N-grams counts, we can guess likely
next words in a sequence.

08/14/20 9
N-Gram Models
 More formally, we will assess the
conditional probability of candidate words.
 Also, we will assess the probability of an
entire sequence of words.
 Being able to predict the next word (or
any linguistic unit) in a sequence is very
common in many applications.

08/14/20 10
Applications
 It lies at the core of the following applications
 Automatic speech recognition
 Handwriting and character recognition
 Spelling correction
 Machine translation
 Augmentative communication
 Word similarity, generation, POS tagging, etc.

08/14/20 11
Counting
 He stepped out into the hall, was delighted to
encounter a water brother.
 13 tokens, 15 if we include “,” and “.” as separate
tokens.
 Assuming we include the comma and period, how
many bigrams are there?

08/14/20 Speech and Language Processing - Jurafsky and Martin 12


Counting
 Not always that simple
 I do uh main- mainly business data processing
 Spoken language poses various challenges.
 Should we count “uh” and other fillers as tokens?
 What about the repetition of “mainly”? Should such do-
overs count twice or just once?
 The answers depend on the application.
 If we’re focusing on something like Automatic Speech
Recognition to support indexing for search, then “uh” isn’t
helpful (it’s not likely to occur as a query).
 But filled pauses are very useful in dialog management, so we
might want them there.

08/14/20 Speech and Language Processing - Jurafsky and Martin 13


Counting: Types and Tokens
 They picnicked by the pool, then lay back
on the grass and looked at the stars.
 18 tokens (again counting punctuation)
 only 16 types
 In going forward, we’ll sometimes count
types and sometimes tokens.

08/14/20 Speech and Language Processing - Jurafsky and Martin 14


Counting: Wordforms
 Should “cats” and “cat” count as the
same?
 How about “geese” and “goose”?
 Lemma: cats & cat and geese & goose have
the same lemmas
 Wordform: fully inflected surface form
 Again, we’ll have occasion to count both
lemmas and wordforms

08/14/20 Speech and Language Processing - Jurafsky and Martin 15


Counting: Corpora
 Corpus: body of text
 Google
 Crawl of 1,024,908,267,229 English tokens
 13,588,391 wordform types
 That seems like a lot of types... Even large dictionaries of English have only
around 500k types. Why so many here?

•Numbers
•Misspellings
•Names
•Acronyms
•etc

08/14/20 Speech and Language Processing - Jurafsky and Martin 16


Language Modeling
 To Perform word prediction:
 Assess the conditional probability of a
word given the previous words in the
sequence
 P(wn|w1,w2…wn-1).
 We’ll call a statistical model that can
assess this a Language Model.

08/14/20 Speech and Language Processing - Jurafsky and Martin 17


Language Modeling
 One way to calculate them is to use the
definition of conditional probabilities, and
get the counts from a large corpus.
 Counts are used to estimate the probabilities.

08/14/20 Speech and Language Processing - Jurafsky and Martin 18


Language Modeling
 Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.

 We’ll use the chain rule of probability


and an independence assumption.

08/14/20 Speech and Language Processing - Jurafsky and Martin 19


The Chain Rule

P(its water was so transparent)=


P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

08/14/20 Speech and Language Processing - Jurafsky and Martin 20


 So, that’s the chain rule
 We still have the problem of estimating
probabilities of word sequences from counts;
we’ll never have enough data.
 The other thing we mentioned (two slides
back) is an independence assumption
 What type of assumption would make sense?

08/14/20 Speech and Language Processing - Jurafsky and Martin 21


Independence Assumption
 Make the simplifying assumption
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|a)
 Or maybe
 P(lizard|
the,other,day,I,was,walking,along,and,saw,a)
= P(lizard|saw,a)

08/14/20 Speech and Language Processing - Jurafsky and Martin 22


Independence Assumption
 This kind of independence assumption is called a
Markov assumption after the Russian
mathematician Andrei Markov.

08/14/20 Speech and Language Processing - Jurafsky and Martin 23


Markov Assumption

For each component in the product replace with the


approximation (assuming a prefix of N)

P(wn | w n1
1 )  P(wn | w n1
nN 1 )
Bigram version

P(w n | w n1
1 )  P(w n | w n1 )

08/14/20 Speech and Language Processing - Jurafsky and Martin 24


Estimating Bigram
Probabilities
 The Maximum Likelihood Estimate (MLE)

count(w i1,w i )
P(w i | w i1 ) 
count(w i1 )

08/14/20 Speech and Language Processing - Jurafsky and Martin 25


An Example
 <s> I am Sam </s>
 <s> Sam I am </s>
 <s> I do not like green eggs and ham </s>

08/14/20 Speech and Language Processing - Jurafsky and Martin 26


08/14/20 Speech and Language Processing - Jurafsky and Martin 27
Maximum Likelihood Estimates
 The maximum likelihood estimate of some parameter of
a model M from a training set T
 Is the estimate that maximizes the likelihood of the training set
T given the model M
 Suppose the word Chinese occurs 400 times in a corpus
of a million words (Brown corpus)
 What is the probability that a random word from some
other text from the same distribution will be “Chinese”
 MLE estimate is 400/1000000 = .004
 This may be a bad estimate for some other corpus
 (We’ll return to MLEs later – we’ll do smoothing to get
estimates for low-frequency or 0 counts. Even with our
independence assumptions, we still have this problem.)

08/14/20 Speech and Language Processing - Jurafsky and Martin 28


Berkeley Restaurant Project
Sentences
 can you tell me about any good cantonese restaurants
close by
 mid priced thai food is what i’m looking for
 tell me about chez panisse
 can you give me a listing of the kinds of food that are
available
 i’m looking for a good place to eat breakfast
 when is caffe venezia open during the day

08/14/20 Speech and Language Processing - Jurafsky and Martin 29


Bigram Counts
 Out of 9222 sentences
 Eg. “I want” occurred 827 times

08/14/20 Speech and Language Processing - Jurafsky and Martin 30


Bigram Probabilities
 Divide bigram counts by prefix unigram
counts to get probabilities.

08/14/20 Speech and Language Processing - Jurafsky and Martin 31


Bigram Estimates of Sentence
Probabilities
 P(<s> I want english food </s>) =
P(i|<s>)*
P(want|I)*
P(english|want)*
P(food|english)*
P(</s>|food)*
=.000031

08/14/20 Speech and Language Processing - Jurafsky and Martin 32


Kinds of Knowledge
 As crude as they are, N-gram probabilities capture a
range of interesting facts about language.

 P(english|want) = .0011
World knowledge
 P(chinese|want) = .0065
 P(to|want) = .66
 P(eat | to) = .28 Syntax
 P(food | to) = 0
 P(want | spend) = 0
 P (i | <s>) = .25 Discourse

08/14/20 Speech and Language Processing - Jurafsky and Martin 33


Shannon’s Method
 Let’s turn the model around and use it to
generate random sentences that are like
the sentences from which the model was
derived.
 Illustrates:
 Dependence of N-gram models on the specific
training corpus
 Greater N  better model
 Generally attributed to
Claude Shannon.
08/14/20 Speech and Language Processing - Jurafsky and Martin 34
Shannon’s Method
 Sample a random bigram (<s>, w) according to its probability
 Now sample a random bigram (w, x) according to its probability
 And so on until we randomly choose a (y, </s>)
 Then string the words together
 <s> I
I want
want to
to eat
eat Chinese
Chinese food
food </s>

08/14/20 Speech and Language Processing - Jurafsky and Martin 35


Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 36


Shakespeare as a Corpus
 N=884,647 tokens, V=29,066
 Shakespeare produced 300,000 bigram types
out of V2= 844 million possible bigrams...
 So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
 This is the biggest problem in language modeling;
we’ll come back to it.
 Quadrigrams are worse: What's coming out
looks like Shakespeare because it is
Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 37


The Wall Street Journal is Not
Shakespeare

08/14/20 Speech and Language Processing - Jurafsky and Martin 38


A bad language model
(thanks to Joshua Goodman)

39
A bad language model

40
A bad language model

41
A bad language model

42
Evaluation
 How do we know if one model is better
than another?
 Shannon’s game gives us an intuition.
 The generated texts from the higher order
models sound more like the text the model
was obtained from.

08/14/20 Speech and Language Processing - Jurafsky and Martin 43


Evaluation
 Standard method
 Train parameters of our model on a training set.
 Look at the model’s performance on some new
data: a test set. A dataset which is different than
our training set, but is drawn from the same source
 Then we need an evaluation metric to tell us how
well our model is doing on the test set.
 One such metric is perplexity (to be introduced below)

08/14/20 Speech and Language Processing - Jurafsky and Martin 44


Perplexity
 Perplexity is the probability of
the test set (assigned by the
language model), normalized by
the number of words:
 Chain rule:

 For bigrams:

 Minimizing perplexity is the same as maximizing


probability
 The best language model is one that best
predicts an unseen test set
08/14/20 Speech and Language Processing - Jurafsky and Martin 45
Lower perplexity means a
better model
 Training 38 million words, test 1.5 million
words, WSJ

08/14/20 Speech and Language Processing - Jurafsky and Martin 46


Evaluating N-Gram Models
 Best evaluation for a language model
 Put model A into an application
 For example, a speech recognizer
 Evaluate the performance of the
application with model A
 Put model B into the application and
evaluate
 Compare performance of the application
with the two models
 Extrinsic evaluation
08/14/20 Speech and Language Processing - Jurafsky and Martin 47
Difficulty of extrinsic (in-vivo)
evaluation of N-gram models
 Extrinsic evaluation
 This is really time-consuming
 Can take days to run an experiment
 So
 As a temporary solution, in order to run experiments
 To evaluate N-grams we often use an intrinsic
evaluation, an approximation called perplexity
 But perplexity is a poor approximation unless the test
data looks just like the training data
 So is generally only useful in pilot experiments
(generally is not sufficient to publish)
 But is helpful to think about.

08/14/20 Speech and Language Processing - Jurafsky and Martin 48

You might also like