N-Gram Language Models
N-Gram Language Models
All
rights reserved. Draft of January 12, 2025.
CHAPTER
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next word someone is
going to say? What word, for example, is likely to follow
The water of Walden Pond is so beautifully ...
You might conclude that a likely word is blue, or green, or clear, but probably
not refrigerator nor this. In this chapter we formalize this intuition by intro-
language model ducing language models or LMs. A language model is a machine learning model
LM that predicts upcoming words. More formally, a language model assigns a prob-
ability to each possible next word, or equivalently gives a probability distribution
over possible next works. Language models can also assign a probability to an entire
sentence. Thus an LM could tell us that the following sequence has a much higher
probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
Why would we want to predict upcoming words, or know the probability of a sen-
tence? One reason is for generation: choosing contextually better words. For ex-
ample we can correct grammar or spelling errors like Their are two midterms,
in which There was mistyped as Their, or Everything has improve, in which
improve should have been improved. The phrase There are is more probable
than Their are, and has improved than has improve, so a language model can
help users select the more grammatical variant. Or for a speech system to recognize
that you said I will be back soonish and not I will be bassoon dish, it
helps to know that back soonish is a more probable sequence. Language models
can also help in augmentative and alternative communication (Trnka et al. 2007,
AAC Kane et al. 2017). People can use AAC systems if they are physically unable to
speak or sign but can instead use eye gaze or other movements to select words from
a menu. Word prediction can be used to suggest likely words for the menu.
Word prediction is also central to NLP for another reason: large language mod-
els are built just by training them to predict words!! As we’ll see in chapters 7-9,
large language models learn an enormous amount about language solely from being
trained to predict upcoming words from neighboring words.
n-gram In this chapter we introduce the simplest kind of language model: the n-gram
2 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “The water of Walden Pond is so
beautifully ” and we want to know the probability that the next word is blue:
One way to estimate this probability is directly from relative frequency counts: take a
very large corpus, count the number of times we see The water of Walden Pond
is so beautifully, and count the number of times this is followed by blue. This
would be answering the question “Out of the times we saw the history h, how many
times was it followed by the word w”, as follows:
If we had a large enough corpus, we could compute these two counts and estimate
the probability from Eq. 3.2. But even the entire web isn’t big enough to give us
good estimates for counts of entire sentences. This is because language is creative;
new sentences are invented all the time, and we can’t expect to get accurate counts
for such large objects as entire sentences. For this reason, we’ll need more clever
ways to estimate the probability of a word w given a history h, or the probability of
an entire word sequence W .
Let’s start with some notation. First, throughout this chapter we’ll continue to
refer to words, although in practice we usually compute language models over to-
kens like the BPE tokens of page ??. To represent the probability of a particular
random variable Xi taking on the value “the”, or P(Xi = “the”), we will use the
simplification P(the). We’ll represent a sequence of n words either as w1 . . . wn or
w1:n . Thus the expression w1:n−1 means the string w1 , w2 , ..., wn−1 , but we’ll also
be using the equivalent notation w<n , which can be read as “all the elements of w
from w1 up to and including wn−1 ”. For the joint probability of each word in a se-
quence having a particular value P(X1 = w1 , X2 = w2 , X3 = w3 , ..., Xn = wn ) we’ll
use P(w1 , w2 , ..., wn ).
Now, how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )?
One thing we can do is decompose this probability using the chain rule of proba-
3.1 • N-G RAMS 3
bility:
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equa-
tion 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |w1:n−1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string in some corpus, because language is creative and
any particular context might have never occurred before!
P(blue|beautifully) (3.6)
When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:
The assumption that the probability of a word depends only on the previous word is
Markov called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
n-gram to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Let’s see a general equation for this n-gram approximation to the conditional
probability of the next word in a sequence. We’ll use N here to mean the n-gram
4 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
Given the bigram assumption for the probability of an individual word, we can com-
pute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
n
Y
P(w1:n ) ≈ P(wk |wk−1 ) (3.9)
k=1
C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):
C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol </s>.1
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = 0.67 P(Sam|<s>) = 3 = 0.33 P(am|I) = 3 = 0.67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = 0.5 P(do|I) = 3 = 0.33
For the general case of MLE n-gram parameter estimation:
C(wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) = (3.12)
C(wn−N+1:n−1 )
1 We need the end-symbol to make the bigram grammar a true probability distribution. Without an end-
symbol, instead of the sentence probabilities of all sentences summing to one, the sentence probabilities
for all sentences of a given length would sum to one. This model would define an infinite set of probability
distributions, with one distribution per sentence length. See Exercise 3.5.
3.1 • N-G RAMS 5
Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the
observed frequency of a particular sequence by the observed frequency of a prefix.
relative
frequency This ratio is called a relative frequency. We said above that this use of relative
frequencies as a way to estimate probabilities is an example of maximum likelihood
estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood of
the training set T given the model M (i.e., P(T |M)). For example, suppose the word
Chinese occurs 400 times in a corpus of a million words. What is the probability
that a random word selected from some other text of, say, a million words will be the
400
word Chinese? The MLE of its probability is 1000000 or 0.0004. Now 0.0004 is not
the best possible estimate of the probability of Chinese occurring in all situations; it
might turn out that in some other corpus or context Chinese is a very unlikely word.
But it is the probability that makes it most likely that Chinese will occur 400 times
in a million-word corpus. We present ways to modify the MLE estimates slightly to
get better probability estimates in Section 3.6.
Let’s move on to some examples from a real but tiny corpus, drawn from the
now-defunct Berkeley Restaurant Project, a dialogue system from the last century
that answered questions about a database of restaurants in Berkeley, California (Ju-
rafsky et al., 1994). Here are some sample user queries (text-normalized, by lower
casing and with punctuation striped) (a sample of 9332 sentences is on the website):
can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Figure 3.1 shows the bigram counts from part of a bigram grammar from text-
normalized Berkeley Restaurant Project sentences. Note that the majority of the
values are zero. In fact, we have chosen the sample words to cohere with each other;
a matrix selected from a random set of eight words would be even more sparse.
Figure 3.2 shows the bigram probabilities after normalization (dividing each cell
in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of
unigram counts):
In practice throughout this book, we’ll use log to mean natural log (ln) when the
base is not specified.
3.2 • E VALUATING L ANGUAGE M ODELS : T RAINING AND T EST S ETS 7
Longer context Although for pedagogical purposes we have only described bi-
trigram gram models, when there is sufficient training data we use trigram models, which
4-gram condition on the previous two words, or 4-gram or 5-gram models. For these larger
5-gram n-grams, we’ll need to assume extra contexts to the left and right of the sentence end.
For example, to compute trigram probabilities at the very beginning of the sentence,
we use two pseudo-words for the first trigram (i.e., P(I|<s><s>).
Some large n-gram datasets have been created, like the million most frequent
n-grams drawn from the Corpus of Contemporary American English (COCA), a
curated 1 billion word corpus of American English (Davies, 2020), Google’s Web
5-gram corpus from 1 trillion words of English web text (Franz and Brants, 2006),
or the Google Books Ngrams corpora (800 billion tokens from Chinese, English,
French, German, Hebrew, Italian, Russian, and Spanish) (Lin et al., 2012)).
It’s even possible to use extremely long-range n-gram context. The infini-gram
(∞-gram) project (Liu et al., 2024) allows n-grams of any length. Their idea is to
avoid the expensive (in space and time) pre-computation of huge n-gram count ta-
bles. Instead, n-gram probabilities with arbitrary n are computed quickly at inference
time by using an efficient representation called suffix arrays. This allows computing
of n-grams of every length for enormous corpora of 5 trillion tokens.
Efficiency considerations are important when building large n-gram language
models. It is standard to quantize the probabilities using only 4-8 bits (instead of
8-byte floats), store the word strings on disk and represent them in memory only as
a 64-bit hash, and represent n-grams in special data structures like ‘reverse tries’.
It is also common to prune n-gram language models, for example by only keeping
n-grams with counts greater than some threshold or using entropy to prune less-
important n-grams (Stolcke, 1998). Efficient language model toolkits like KenLM
(Heafield 2011, Heafield et al. 2013) use sorted arrays and use merge sorts to effi-
ciently build the probability tables in a minimal number of passes through a large
corpus.
The training set is the data we use to learn the parameters of our model; for
simple n-gram language models it’s the corpus from which we get the counts that
we normalize into the probabilities of the n-gram language model.
The test set is a different, held-out set of data, not overlapping with the training
set, that we use to evaluate the model. We need a separate test set to give us an
unbiased estimate of how well the model we trained can generalize when we apply
it to some new unknown dataset. A machine learning model that perfectly captured
the training data, but performed terribly on any other data, wouldn’t be much use
when it comes time to apply it to any new data or problem! We thus measure the
quality of an n-gram model by its performance on this unseen test set or test corpus.
How should we choose a training and test set? The test set should reflect the
language we want to use the model for. If we’re going to use our language model
for speech recognition of chemistry lectures, the test set should be text of chemistry
lectures. If we’re going to use it as part of a system for translating hotel booking re-
quests from Chinese to English, the test set should be text of hotel booking requests.
If we want our language model to be general purpose, then the test set should be
drawn from a wide variety of texts. In such cases we might collect a lot of texts
from different sources, and then divide it up into a training set and a test set. It’s
important to do the dividing carefully; if we’re building a general purpose model,
we don’t want the test set to consist of only text from one document, or one author,
since that wouldn’t be a good measure of general performance.
Thus if we are given a corpus of text and want to compare the performance of
two different n-gram models, we divide the data into training and test sets, and train
the parameters of both models on the training set. We can then compare how well
the two trained models fit the test set.
But what does it mean to “fit the test set”? The standard answer is simple:
whichever language model assigns a higher probability to the test set—which
means it more accurately predicts the test set—is a better model. Given two proba-
bilistic models, the better model is the one that better predicts the details of the test
data, and hence will assign a higher probability to the test data.
Since our evaluation metric is based on test set probability, it’s important not to
let the test sentences into the training set. Suppose we are trying to compute the
probability of a particular “test” sentence. If our test sentence is part of the training
corpus, we will mistakenly assign it an artificially high probability when it occurs
in the test set. We call this situation training on the test set. Training on the test
set introduces a bias that makes the probabilities all look too high, and causes huge
inaccuracies in perplexity, the probability-based metric we introduce below.
Even if we don’t train on the test set, if we test our language model on the
test set many times after making different changes, we might implicitly tune to its
characteristics, by noticing which changes seem to make the model better. For this
reason, we only want to run our model on the test set once, or a very few number of
times, once we are sure our model is ready.
development For this reason we normally instead have a third dataset called a development
test
test set or, devset. We do all our testing on this dataset until the very end, and then
we test on the test set once to see how good our model is.
How do we divide our data into training, development, and test sets? We want
our test set to be as large as possible, since a small test set may be accidentally un-
representative, but we also want as much training data as possible. At the minimum,
we would want to pick the smallest test set that gives us enough statistical power
to measure a statistically significant difference between two potential models. It’s
3.3 • E VALUATING L ANGUAGE M ODELS : P ERPLEXITY 9
important that the devset be drawn from the same kind of text as the test set, since
its goal is to measure how we would do on the test set.
Note that because of the inverse in Eq. 3.15, the higher the probability of the word
sequence, the lower the perplexity. Thus the the lower the perplexity of a model on
the data, the better the model. Minimizing perplexity is equivalent to maximizing
the test set probability according to the language model. Why does perplexity use
the inverse probability? It turns out the inverse arises from the original definition
of perplexity from cross-entropy rate in information theory; for those interested, the
explanation is in the advanced Section 3.7. Meanwhile, we just have to remember
that perplexity has an inverse relationship with probability.
The details of computing the perplexity of a test set W depends on which lan-
guage model we use. Here’s the perplexity of W with a unigram language model
(just the geometric mean of the inverse of the unigram probabilities):
v
uN
uY 1
N
perplexity(W ) = t (3.16)
P(wi )
i=1
10 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
What we generally use for word sequence in Eq. 3.15 or Eq. 3.17 is the entire
sequence of words in some test set. Since this sequence will cross many sentence
boundaries, if our vocabulary includes a between-sentence token <EOS> or separate
begin- and end-sentence markers <s> and </s> then we can include them in the
probability computation. If we do, then we also include one token per sentence in
the total count of word tokens N.2
We mentioned above that perplexity is a function of both the text and the lan-
guage model: given a text W , different language models will have different perplex-
ities. Because of this, perplexity can be used to compare different language models.
For example, here we trained unigram, bigram, and trigram grammars on 38 million
words from the Wall Street Journal newspaper. We then computed the perplexity of
each of these models on a WSJ test set using Eq. 3.16 for unigrams, Eq. 3.17 for
bigrams, and the corresponding equation for trigrams. The table below shows the
perplexity of the 1.5 million word test set according to each of the language models.
Unigram Bigram Trigram
Perplexity 962 170 109
As we see above, the more information the n-gram gives us about the word
sequence, the higher the probability the n-gram will assign to the string. A trigram
model is less surprised than a unigram model because it has a better idea of what
words might come next, and so it assigns them a higher probability. And the higher
the probability, the lower the perplexity (since as Eq. 3.15 showed, perplexity is
related inversely to the probability of the test sequence according to the model). So
a lower perplexity tells us that a language model is a better predictor of the test set.
Note that in computing perplexities, the language model must be constructed
without any knowledge of the test set, or else the perplexity will be artificially low.
And the perplexity of two language models is only comparable if they use identical
vocabularies.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) im-
provement in the performance of a language processing task like speech recognition
or machine translation. Nonetheless, because perplexity usually correlates with task
improvements, it is commonly used as a convenient evaluation metric. Still, when
possible a model’s improvement in perplexity should be confirmed by an end-to-end
evaluation on a real task.
language that is deterministic (no probabilities), any word can follow any word, and
whose vocabulary consists of only three colors:
We should expect the perplexity of the same test set red red red red blue for
language model B to be lower since most of the time the next color will be red, which
is very predictable, i.e. has a high probability. So the probability of the test set will
be higher, and since perplexity is inversely related to probability, the perplexity will
be lower. Thus, although the branching factor is still 3, the perplexity or weighted
branching factor is smaller:
polyphonic
p=0.0000018
however
the of a to in (p=0.0003)
Figure 3.3 A visualization of the sampling distribution for sampling sentences by repeat-
edly sampling unigrams. The blue bar represents the relative frequency of each word (we’ve
ordered them from most frequent to least frequent, but the choice of order is arbitrary). The
number line shows the cumulative probabilities. If we choose a random number between 0
and 1, it will fall in an interval corresponding to some word. The expectation for the random
number to fall in the larger intervals of one of the frequent words (the, of, a) is much higher
than in the smaller interval of one of the rare words (polyphonic).
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1
gram
rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2
gram
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3
gram
’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4
gram
great banquet serv’d in;
–It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
and the WSJ are both English, so we might have expected some overlap between our
n-grams for the two genres. Fig. 3.5 shows sentences generated by unigram, bigram,
and trigram grammars trained on 40 million words from WSJ.
Compare these examples to the pseudo-Shakespeare in Fig. 3.4. While they both
model “English-like sentences”, there is no overlap in the generated sentences, and
little overlap even in small phrases. Statistical models are pretty useless as predictors
if the training sets and the test sets are as different as Shakespeare and the WSJ.
How should we deal with this problem when we build n-gram models? One step
is to be sure to use a training corpus that has a similar genre to whatever task we are
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
question-answering system, we need a training corpus of questions.
It is equally important to get training data in the appropriate dialect or variety,
especially when processing social media posts or spoken transcripts. For exam-
ple some tweets will use features of African American English (AAE)— the name
for the many variations of language used in African American communities (King,
2020). Such features can include words like finna—an auxiliary verb that marks
immediate future tense —that don’t occur in other varieties, or spellings like den for
then, in tweets like this one (Blodgett and O’Connor, 2017):
14 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
of the word wi is its count ci normalized by the total number of word tokens N:
ci
P(wi ) =
N
Laplace smoothing merely adds one to each count (hence its alternate name add-
add-one one smoothing). Since there are V words in the vocabulary and each one was in-
cremented, we also need to adjust the denominator to take into account the extra V
observations. (What happens to our P values if we don’t increase the denominator?)
ci + 1
PLaplace (wi ) = (3.24)
N +V
Now that we have the intuition for the unigram case, let’s smooth our Berkeley
Restaurant Project bigrams. Figure 3.6 shows the add-one smoothed counts for the
bigrams in Fig. 3.1.
Figure 3.7 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2,
computed by Eq. 3.26 below. Recall that normal bigram probabilities are computed
by normalizing each row of counts by the unigram count:
C(wn−1 wn )
PMLE (wn |wn−1 ) = (3.25)
C(wn−1 )
For add-one smoothed bigram counts, we need to augment the unigram count in the
denominator by the number of total word types in the vocabulary V . We can see
why this is in the following equation, which makes it explicit that the unigram count
in the denominator is really the sum over all the bigrams that start with wn−1 . Since
we add one to each of these, and there are V of them, we add a total of V to the
denominator:
C(wn−1 wn ) + 1 C(wn−1 wn ) + 1
PLaplace (wn |wn−1 ) = P = (3.26)
w (C(wn−1 w) + 1) C(wn−1 ) +V
Thus, each of the unigram counts given on page 5 will need to be augmented by V =
1446. The result, using Eq. 3.26, is the smoothed bigram probabilities in Fig. 3.7.
One useful visualization technique is to reconstruct an adjusted count matrix
so we can see how much a smoothing algorithm has changed the original counts.
This adjusted count C∗ is the count that, if divided by C(wn−1 ), would result in
the smoothed probability. This adjusted count is easier to compare directly with
the MLE counts. That is, the Laplace probability can equally be expressed as the
adjusted count divided by the (non-smoothed) denominator from Eq. 3.25:
C(wn−1 wn ) + 1 C∗ (wn−1 wn )
PLaplace (wn |wn−1 ) = =
C(wn−1 ) +V C(wn−1 )
16 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
[C(wn−1 wn ) + 1] ×C(wn−1 )
C∗ (wn−1 wn ) = (3.27)
C(wn−1 ) +V
Figure 3.8 shows the reconstructed counts, computed by Eq. 3.27.
i want to eat chinese food lunch spend
i 3.8 527 0.64 6.4 0.64 0.64 0.64 1.9
want 1.2 0.39 238 0.78 2.7 2.7 2.3 0.78
to 1.9 0.63 3.1 430 1.9 0.63 4.4 133
eat 0.34 0.34 1 0.34 5.8 1 15 0.34
chinese 0.2 0.098 0.098 0.098 0.098 8.2 0.2 0.098
food 6.9 0.43 6.9 0.43 0.86 2.2 0.43 0.43
lunch 0.57 0.19 0.19 0.19 0.19 0.38 0.19 0.19
spend 0.32 0.16 0.32 0.16 0.16 0.16 0.16 0.16
Figure 3.8 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus
of 9332 sentences, computed by Eq. 3.27. Previously-zero counts are in gray.
Note that add-one smoothing has made a very big change to the counts. Com-
paring Fig. 3.8 to the original counts in Fig. 3.1, we can see that C(want to) changed
from 608 to 238! We can see this in probability space as well: P(to|want) decreases
from 0.66 in the unsmoothed case to 0.26 in the smoothed case. Looking at the dis-
count d, defined as the ratio between new and old counts, shows us how strikingly
the counts for each prefix word have been reduced; the discount for the bigram want
to is 0.39, while the discount for Chinese food is 0.10, a factor of 10! The sharp
change occurs because too much probability mass is moved to all the zeros.
∗ C(wn−1 wn ) + k
PAdd-k (wn |wn−1 ) = (3.28)
C(wn−1 ) + kV
Add-k smoothing requires that we have a method for choosing k; this can be
done, for example, by optimizing on a devset. Although add-k is useful for some
tasks (including text classification), it turns out that it still doesn’t work well for
3.6 • S MOOTHING , I NTERPOLATION , AND BACKOFF 17
language modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).
How are these λ values set? Both the simple interpolation and conditional interpo-
held-out lation λ s are learned from a held-out corpus. A held-out corpus is an additional
training corpus, so-called because we hold it out from the training data, that we use
to set these λ values.4 We do so by choosing the λ values that maximize the likeli-
hood of the held-out corpus. That is, we fix the n-gram probabilities and then search
for the λ values that—when plugged into Eq. 3.29—give us the highest probability
of the held-out set. There are various ways to find this optimal set of λ s. One way
is to use the EM algorithm, an iterative learning algorithm that converges on locally
optimal λ s (Jelinek and Mercer, 1980).
3 We won’t discuss the less-common alternative, called backoff, in which we use the trigram if the
evidence is sufficient for it, but if not we instead just use the bigram, otherwise the unigram. That is, we
only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram.
4 Held-out corpora are generally used to set hyperparameters, which are special parameters, unlike
regular counts that are learned from the training data; we’ll discuss hyperparameters in Chapter 7.
18 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
count(w)
The backoff terminates in the unigram, which has score S(w) = N . Brants et al.
(2007) find that a value of 0.4 worked well for λ .
1 1
Horse 1 2 Horse 5 64
1 1
Horse 2 4 Horse 6 64
1 1
Horse 3 8 Horse 7 64
1 1
Horse 4 16 Horse 8 64
The entropy of the random variable X that ranges over horses gives us a lower
bound on the number of bits and is
i=8
X
H(X) = − p(i) log2 p(i)
i=1
= 1 log 1 −4( 1 log 1 )
− 12 log2 12 − 14 log2 41 − 18 log2 18 − 16 2 16 64 2 64
= 2 bits (3.33)
A code that averages 2 bits per race can be built with short encodings for more
probable horses, and longer encodings for less probable horses. For example, we
could encode the most likely horse with the code 0, and the remaining horses as 10,
then 110, 1110, 111100, 111101, 111110, and 111111.
What if the horses are equally likely? We saw above that if we used an equal-
length binary code for the horse numbers, each horse took 3 bits to code, so the
average was 3. Is the entropy the same? In this case each horse would have a
probability of 18 . The entropy of the choice of horses is then
i=8
X 1 1 1
H(X) = − log2 = − log2 = 3 bits (3.34)
8 8 8
i=1
Until now we have been computing the entropy of a single variable. But most of
what we will use entropy for involves sequences. For a grammar, for example, we
will be computing the entropy of some sequence of words W = {w1 , w2 , . . . , wn }.
One way to do this is to have a variable that ranges over sequences of words. For
example we can compute the entropy of a random variable that ranges over all se-
quences of words of length n in some language L as follows:
X
H(w1 , w2 , . . . , wn ) = − p(w1 : n ) log p(w1 : n ) (3.35)
w1 : n ∈L
entropy rate We could define the entropy rate (we could also think of this as the per-word
entropy) as the entropy of this sequence divided by the number of words:
1 1 X
H(w1 : n ) = − p(w1 : n ) log p(w1 : n ) (3.36)
n n
w1 : n ∈L
1
H(L) = lim H(w1 : n )
n
n→∞
1X
= − lim p(w1 : n ) log p(w1 : n ) (3.37)
n→∞ n
W ∈L
20 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
The Shannon-McMillan-Breiman theorem (Algoet and Cover 1988, Cover and Thomas
1991) states that if the language is regular in certain ways (to be exact, if it is both
stationary and ergodic),
1
H(L) = lim − log p(w1 : n ) (3.38)
n→∞ n
That is, we can take a single sequence that is long enough instead of summing over
all possible sequences. The intuition of the Shannon-McMillan-Breiman theorem
is that a long-enough sequence of words will contain in it many other shorter se-
quences and that each of these shorter sequences will reoccur in the longer sequence
according to their probabilities.
Stationary A stochastic process is said to be stationary if the probabilities it assigns to a
sequence are invariant with respect to shifts in the time index. In other words, the
probability distribution for words at time t is the same as the probability distribution
at time t + 1. Markov models, and hence n-grams, are stationary. For example, in
a bigram, Pi is dependent only on Pi−1 . So if we shift our time index by x, Pi+x is
still dependent on Pi+x−1 . But natural language is not stationary, since as we show
in Appendix D, the probability of upcoming words can be dependent on events that
were arbitrarily distant and time dependent. Thus, our statistical models only give
an approximation to the correct distributions and entropies of natural language.
To summarize, by making some incorrect but convenient simplifying assump-
tions, we can compute the entropy of some stochastic process by taking a very long
sample of the output and computing its average log probability.
cross-entropy Now we are ready to introduce cross-entropy. The cross-entropy is useful when
we don’t know the actual probability distribution p that generated some data. It
allows us to use some m, which is a model of p (i.e., an approximation to p). The
cross-entropy of m on p is defined by
1X
H(p, m) = lim − p(w1 , . . . , wn ) log m(w1 , . . . , wn ) (3.39)
n→∞ n
W ∈L
That is, we draw sequences according to the probability distribution p, but sum the
log of their probabilities according to m.
Again, following the Shannon-McMillan-Breiman theorem, for a stationary er-
godic process:
1
H(p, m) = lim − log m(w1 w2 . . . wn ) (3.40)
n→∞ n
This means that, as for entropy, we can estimate the cross-entropy of a model m
on some distribution p by taking a single sequence that is long enough instead of
summing over all possible sequences.
What makes the cross-entropy useful is that the cross-entropy H(p, m) is an up-
per bound on the entropy H(p). For any model m:
This means that we can use some simplified model m to help estimate the true en-
tropy of a sequence of symbols drawn according to probability p. The more accurate
m is, the closer the cross-entropy H(p, m) will be to the true entropy H(p). Thus,
the difference between H(p, m) and H(p) is a measure of how accurate a model is.
Between two models m1 and m2 , the more accurate model will be the one with the
3.8 • S UMMARY 21
lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so
a model cannot err by underestimating the true entropy.)
We are finally ready to see the relation between perplexity and cross-entropy
as we saw it in Eq. 3.40. Cross-entropy is defined in the limit as the length of the
observed word sequence goes to infinity. We approximate this cross-entropy by
relying on a (sufficiently long) sequence of fixed length. This approximation to the
cross-entropy of a model M = P(wi |wi−N+1 : i−1 ) on a sequence of words W is
1
H(W ) = − log P(w1 w2 . . . wN ) (3.42)
N
perplexity The perplexity of a model P on a sequence of words W is now formally defined as
2 raised to the power of this cross-entropy:
Perplexity(W ) = 2H(W )
1
= P(w1 w2 . . . wN )− N
s
1
= N
P(w1 w2 . . . wN )
3.8 Summary
This chapter introduced language modeling via the n-gram model, a classic model
that allows us to introduce many of the basic concepts in language modeling.
• Language models offer a way to assign a probability to a sentence or other
sequence of words or tokens, and to predict a word or token from preceding
words or tokens.
• N-grams are perhaps the simplest kind of language model. They are Markov
models that estimate words from a fixed window of previous words. N-gram
models can be trained by counting in a training corpus and normalizing the
counts (the maximum likelihood estimate).
• N-gram language models can be evaluated on a test set using perplexity.
• The perplexity of a test set according to a language model is a function of
the probability of the test set: the inverse test set probability according to the
model, normalized by the length.
• Sampling from a language model means to generate some sentences, choos-
ing each sentence according to its likelihood as defined by the model.
• Smoothing algorithms provide a way to estimate probabilities for events that
were unseen in training. Commonly used smoothing algorithms for n-grams
include add-1 smoothing, or rely on lower-order n-gram counts through inter-
polation.
trigram probability that a given letter would be a vowel given the previous one or
two letters. Shannon (1948) applied n-grams to compute approximations to English
word sequences. Based on Shannon’s work, Markov models were commonly used in
engineering, linguistic, and psychological work on modeling word sequences by the
1950s. In a series of extremely influential papers starting with Chomsky (1956) and
including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky argued
that “finite-state Markov processes”, while a possibly useful engineering heuristic,
were incapable of being a complete cognitive model of human grammatical knowl-
edge. These arguments led many linguists and computational linguists to ignore
work in statistical modeling for decades.
The resurgence of n-gram language models came from Fred Jelinek and col-
leagues at the IBM Thomas J. Watson Research Center, who were influenced by
Shannon, and James Baker at CMU, who was influenced by the prior, classified
work of Leonard Baum and colleagues on these topics at labs like the US Institute
for Defense Analyses (IDA) after they were declassified. Independently these two
labs successfully used n-grams in their speech recognition systems at the same time
(Baker 1975b, Jelinek et al. 1975, Baker 1975a, Bahl et al. 1983, Jelinek 1990). The
terms “language model” and “perplexity” were first used for this technology by the
IBM group. Jelinek and his colleagues used the term language model in a pretty
modern way, to mean the entire set of linguistic influences on word sequence prob-
abilities, including grammar, semantics, discourse, and even speaker characteristics,
rather than just the particular n-gram model itself.
Add-one smoothing derives from Laplace’s 1812 law of succession and was first
applied as an engineering solution to the zero frequency problem by Jeffreys (1948)
based on an earlier Add-K suggestion by Johnson (1932). Problems with the add-
one algorithm are summarized in Gale and Church (1994).
A wide variety of different language modeling and smoothing techniques were
proposed in the 80s and 90s, including Good-Turing discounting—first applied to the
n-gram smoothing at IBM by Katz (Nádas 1984, Church and Gale 1991)— Witten-
class-based
n-gram Bell discounting (Witten and Bell, 1991), and varieties of class-based n-gram mod-
els that used information about word classes. Starting in the late 1990s, Chen and
Goodman performed a number of carefully controlled experiments comparing dif-
ferent algorithms and parameters (Chen and Goodman 1999, Goodman 2006, inter
alia). They showed the advantages of Modified Interpolated Kneser-Ney, which
became the standard baseline for n-gram language modeling around the turn of the
century, especially because they showed that caches and class-based models pro-
vided only minor additional improvement. SRILM (Stolcke, 2002) and KenLM
(Heafield 2011, Heafield et al. 2013) are publicly available toolkits for building n-
gram language models.
Large language models are based on neural networks rather than n-grams, en-
abling them to solve the two major problems with n-grams: (1) the number of param-
eters increases exponentially as the n-gram order increases, and (2) n-grams have no
way to generalize from training examples to test set examples unless they use iden-
tical words. Neural language models instead project words into a continuous space
in which words with similar contexts have similar representations. We’ll introduce
transformer-based large language models in Chapter 9, along the way introducing
feedforward language models (Bengio et al. 2006, Schwenk 2007) in Chapter 7 and
recurrent language models (Mikolov, 2012) in Chapter 8.
E XERCISES 23
Exercises
3.1 Write out the equation for trigram probability estimation (modifying Eq. 3.11).
Now write out all the non-zero trigram probabilities for the I am Sam corpus
on page 4.
3.2 Calculate the probability of the sentence i want chinese food. Give two
probabilities, one using Fig. 3.2 and the ‘useful probabilities’ just below it on
page 6, and another using the add-1 smoothed table in Fig. 3.7. Assume the
additional add-1 smoothed probabilities P(i|<s>) = 0.19 and P(</s>|food) =
0.40.
3.3 Which of the two probabilities you computed in the previous exercise is higher,
unsmoothed or smoothed? Explain why.
3.4 We are given the following corpus, modified from the one in the chapter:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
Using a bigram language model with add-one smoothing, what is P(Sam |
am)? Include <s> and </s> in your counts just like any other token.
3.5 Suppose we didn’t use the end-symbol </s>. Train an unsmoothed bigram
grammar on the following training corpus without using the end-symbol </s>:
<s> a b
<s> b b
<s> b a
<s> a a
Demonstrate that your bigram model does not assign a single probability dis-
tribution across all sentence lengths by showing that the sum of the probability
of the four possible 2 word sentences over the alphabet {a,b} is 1.0, and the
sum of the probability of all possible 3 word sentences over the alphabet {a,b}
is also 1.0.
3.6 Suppose we train a trigram language model with add-one smoothing on a
given corpus. The corpus contains V word types. Express a formula for esti-
mating P(w3|w1,w2), where w3 is a word which follows the bigram (w1,w2),
in terms of various n-gram counts and V. Use the notation c(w1,w2,w3) to
denote the number of times that trigram (w1,w2,w3) occurs in the corpus, and
so on for bigrams and unigrams.
3.7 We are given the following corpus, modified from the one in the chapter:
<s> I am Sam </s>
<s> Sam I am </s>
<s> I am Sam </s>
<s> I do not like green eggs and Sam </s>
If we use linear interpolation smoothing between a maximum-likelihood bi-
gram model and a maximum-likelihood unigram model with λ1 = 12 and λ2 =
1
2 , what is P(Sam|am)? Include <s> and </s> in your counts just like any
other token.
3.8 Write a program to compute unsmoothed unigrams and bigrams.
24 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
3.9 Run your n-gram program on two different small corpora of your choice (you
might use email text or newsgroups). Now compare the statistics of the two
corpora. What are the differences in the most common unigrams between the
two? How about interesting differences in bigrams?
3.10 Add an option to your program to generate random sentences.
3.11 Add an option to your program to compute the perplexity of a test set.
3.12 You are given a training set of 100 numbers that consists of 91 zeros and 1
each of the other digits 1-9. Now we see the following test set: 0 0 0 0 0 3 0 0
0 0. What is the unigram perplexity?
Exercises 25
Algoet, P. H. and T. M. Cover. 1988. A sandwich proof of Jelinek, F. and R. L. Mercer. 1980. Interpolated estimation
the Shannon-McMillan-Breiman theorem. The Annals of of Markov source parameters from sparse data. In E. S.
Probability, 16(2):899–909. Gelsema and L. N. Kanal, eds, Proceedings, Workshop
Bahl, L. R., F. Jelinek, and R. L. Mercer. 1983. A maxi- on Pattern Recognition in Practice, 381–397. North Hol-
mum likelihood approach to continuous speech recogni- land.
tion. IEEE Transactions on Pattern Analysis and Machine Jelinek, F., R. L. Mercer, and L. R. Bahl. 1975. Design of a
Intelligence, 5(2):179–190. linguistic statistical decoder for the recognition of contin-
Baker, J. K. 1975a. The DRAGON system – An overview. uous speech. IEEE Transactions on Information Theory,
IEEE Transactions on ASSP, ASSP-23(1):24–29. IT-21(3):250–256.
Baker, J. K. 1975b. Stochastic modeling for automatic Johnson, W. E. 1932. Probability: deductive and inductive
speech understanding. In D. R. Reddy, ed., Speech Recog- problems (appendix to). Mind, 41(164):421–423.
nition. Academic Press. Jurafsky, D., C. Wooters, G. Tajchman, J. Segal, A. Stolcke,
Bengio, Y., H. Schwenk, J.-S. Senécal, F. Morin, and J.-L. E. Fosler, and N. Morgan. 1994. The Berkeley restaurant
Gauvain. 2006. Neural probabilistic language models. In project. ICSLP.
Innovations in Machine Learning, 137–186. Springer. Jurgens, D., Y. Tsvetkov, and D. Jurafsky. 2017. Incorpo-
Blodgett, S. L. and B. O’Connor. 2017. Racial disparity in rating dialectal variability for socially equitable language
natural language processing: A case study of social media identification. ACL.
African-American English. FAT/ML Workshop, KDD. Kane, S. K., M. R. Morris, A. Paradiso, and J. Campbell.
Brants, T., A. C. Popat, P. Xu, F. J. Och, and J. Dean. 2017. “at times avuncular and cantankerous, with the
2007. Large language models in machine translation. reflexes of a mongoose”: Understanding self-expression
EMNLP/CoNLL. through augmentative and alternative communication de-
Chen, S. F. and J. Goodman. 1999. An empirical study of vices. CSCW.
smoothing techniques for language modeling. Computer King, S. 2020. From African American Vernacular English
Speech and Language, 13:359–394. to African American Language: Rethinking the study of
Chomsky, N. 1956. Three models for the description of race and language in African Americans’ speech. Annual
language. IRE Transactions on Information Theory, Review of Linguistics, 6:285–300.
2(3):113–124. Lin, Y., J.-B. Michel, E. Aiden Lieberman, J. Orwant,
Chomsky, N. 1957. Syntactic Structures. Mouton. W. Brockman, and S. Petrov. 2012. Syntactic annotations
for the Google books NGram corpus. ACL.
Church, K. W. and W. A. Gale. 1991. A comparison of the
enhanced Good-Turing and deleted estimation methods Liu, J., S. Min, L. Zettlemoyer, Y. Choi, and H. Hajishirzi.
for estimating probabilities of English bigrams. Com- 2024. Infini-gram: Scaling unbounded n-gram language
puter Speech and Language, 5:19–54. models to a trillion tokens. ArXiv preprint.
Cover, T. M. and J. A. Thomas. 1991. Elements of Informa- Markov, A. A. 1913. Essai d’une recherche statistique sur
tion Theory. Wiley. le texte du roman “Eugene Onegin” illustrant la liaison
des epreuve en chain (‘Example of a statistical investiga-
Davies, M. 2020. The Corpus of Contemporary Amer-
tion of the text of “Eugene Onegin” illustrating the de-
ican English (COCA): One billion words, 1990-2019.
pendence between samples in chain’). Izvistia Impera-
https://www.english-corpora.org/coca/.
torskoi Akademii Nauk (Bulletin de l’Académie Impériale
Franz, A. and T. Brants. 2006. All our n-gram are des Sciences de St.-Pétersbourg), 7:153–162.
belong to you. https://research.google/blog/
Mikolov, T. 2012. Statistical language models based on neu-
all-our-n-gram-are-belong-to-you/.
ral networks. Ph.D. thesis, Brno University of Technol-
Gale, W. A. and K. W. Church. 1994. What is wrong with ogy.
adding one? In N. Oostdijk and P. de Haan, eds, Corpus-
Based Research into Language, 189–198. Rodopi. Miller, G. A. and N. Chomsky. 1963. Finitary models of lan-
guage users. In R. D. Luce, R. R. Bush, and E. Galanter,
Goodman, J. 2006. A bit of progress in language model- eds, Handbook of Mathematical Psychology, volume II,
ing: Extended version. Technical Report MSR-TR-2001- 419–491. John Wiley.
72, Machine Learning and Applied Statistics Group, Mi-
crosoft Research, Redmond, WA. Miller, G. A. and J. A. Selfridge. 1950. Verbal context and
the recall of meaningful material. American Journal of
Heafield, K. 2011. KenLM: Faster and smaller language Psychology, 63:176–185.
model queries. Workshop on Statistical Machine Trans-
lation. Nádas, A. 1984. Estimation of probabilities in the language
model of the IBM speech recognition system. IEEE
Heafield, K., I. Pouzyrevsky, J. H. Clark, and P. Koehn. 2013.
Transactions on ASSP, 32(4):859–861.
Scalable modified Kneser-Ney language model estima-
tion. ACL. Schwenk, H. 2007. Continuous space language models.
Computer Speech & Language, 21(3):492–518.
Jeffreys, H. 1948. Theory of Probability, 2nd edition. Claren-
don Press. Section 3.23. Shannon, C. E. 1948. A mathematical theory of commu-
nication. Bell System Technical Journal, 27(3):379–423.
Jelinek, F. 1990. Self-organized language modeling for
Continued in the following volume.
speech recognition. In A. Waibel and K.-F. Lee, eds,
Readings in Speech Recognition, 450–506. Morgan Kauf- Stolcke, A. 1998. Entropy-based pruning of backoff lan-
mann. Originally distributed as IBM technical report in guage models. Proc. DARPA Broadcast News Transcrip-
1985. tion and Understanding Workshop.
26 Chapter 3 • N-gram Language Models