0% found this document useful (0 votes)
24 views35 pages

NLP PLM

This document discusses probabilistic language models and techniques for estimating them from text data. It introduces the language modeling problem, trigram models, and methods for evaluating and estimating language model probabilities, including linear interpolation and discounting.

Uploaded by

pawebiarxdxd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views35 pages

NLP PLM

This document discusses probabilistic language models and techniques for estimating them from text data. It introduces the language modeling problem, trigram models, and methods for evaluating and estimating language model probabilities, including linear interpolation and discounting.

Uploaded by

pawebiarxdxd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Natural Language Processing

Probabilistic Language Models

Felipe Bravo-Marquez

March 26, 2024


Overview

• The language modeling problem


• Trigram models
• Evaluating language models: perplexity
• Estimation techniques:
1. Linear interpolation
2. Discounting methods
• This slides are based on the course material by Michael Collins:
http://www.cs.columbia.edu/˜mcollins/cs4705-spring2019/
slides/lmslides.pdf
The Language Modeling Problem

• We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham,


two, . . .}
• We have an (infinite) set of strings, V ∗ .
• For example:
• the STOP
• a STOP
• the fan STOP
• the fan saw Beckham STOP
• the fan saw saw STOP
• the fan saw Beckham play for Real Madrid STOP
• Where STOP is a special symbol indicating the end of a sentence.
The Language Modeling Problem (Continued)
• We have a training sample of example sentences in English.
• We need to ”learn” a probability distribution p.
• p is a function that satisfies:
X
p(x) = 1
x∈V ∗

p(x) ≥ 0 for all x ∈ V ∗

• Examples of probabilities assigned to sentences:

p(the STOP) = 10−12


p(the fan STOP) = 10−8
p(the fan saw Beckham STOP) = 2 × 10−8
p(the fan saw saw STOP) = 10−15
...
p(the fan saw Beckham play for Real Madrid STOP) = 2 × 10−9
The Language Modeling Problem (Continued)

• Idea 1: The model assigns a higher probability to fluent sentences (those that
make sense and are grammatically correct).
• Idea 2: Estimating this probability function from text (corpus).
• The language model helps text generation models distinguish between good and
bad sentences.
Why would we want to do this?

• Speech recognition was the original motivation.


• Consider the sentences: 1) recognize speech and 2) wreck a nice beach.
• These two sentences sound very similar when pronounced, making it challenging
for automatic speech recognition systems to accurately transcribe them.
• When the speech recognition system analyzes the audio input and tries to
transcribe it, it takes into account the language model probabilities to determine
the most likely interpretation.
• The language model would favor p(recognize speech) over p(wreck a nice
beach).
• This is because the former is a more common sentence and should occur more
frequently in the training corpus.
Why on earth would we want to do this?

• By incorporating language models, speech recognition systems can improve


accuracy by selecting the sentence that aligns better with linguistic patterns and
context, even when faced with similar-sounding alternatives.
• Related problems are optical character recognition, handwriting recognition.
• Actually, Language Models are useful in any NLP tasks involving the generation
of language (e.g., machine translation, summarization, chatbots).
• The estimation techniques developed for this problem will be VERY useful for
other problems in NLP.
Language Models are Generative

• Language models can generate sentences by sequentially sampling from


probabilities.
• This is analogous to drawing balls (words) from an urn where their sizes are
proportional to their relative frequencies.
• Alternatively, one could always draw the most probable word, which is equivalent
to predicting the next word.
A Naive Method

• A very naive method for estimating the probability of a sentence is to count the
occurrences of the sentence in the training data and divide it by the total number
of training sentences (N) to estimate the probability.
• We have N training sentences.
• For any sentence x1 , x2 , . . . , xn , c(x1 , x2 , . . . , xn ) is the number of times the
sentence is seen in our training data.
• A naive estimate:
c(x1 , x2 . . . , xn )
p(x1, x2, . . . , xn) =
N
• Problem: As the number of possible sentences grows exponentially with
sentence length and vocabulary size, it becomes increasingly unlikely for a
specific sentence to appear in the training data.
• Consequently, many sentences will have a probability of zero according to the
naive model, leading to poor generalization.
Markov Processes

• Consider a sequence of random variables X1 , X2 , . . . , Xn .


• Each random variable can take any value in a finite set V .
• For now, we assume the length n is fixed (e.g., n = 100).
• Our goal: model P(X1 = x1 , X2 = x2 , . . . , Xn = xn )
First-Order Markov Processes

n
Y
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 ) P(Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 )
i=2
n
Y
= P(X1 = x1 ) P(Xi = xi |Xi−1 = xi−1 )
i=2

The first-order Markov assumption: For any i ∈ {2, . . . , n} and any x1 , . . . , xi ,

P(Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P(Xi = xi |Xi−1 = xi−1 )


Second-Order Markov Processes

P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
n
Y
P(X1 = x1 ) · P(X2 = x2 |X1 = x1 ) · P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=3
n
Y
= P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1
(For convenience, we assume x0 = x−1 = ∗, where ∗ is a special ”start” symbol.)
Modeling Variable Length Sequences

• We would like the length of the sequence, n, to also be a random variable.


• A simple solution: always define Xn = STOP, where STOP is a special symbol.
• Then use a Markov process as before:

n
Y
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1

• (For convenience, we assume x0 = x−1 = ∗, where ∗ is a special ”start” symbol.)


Trigram Language Models

• A trigram language model consists of:


1. A finite set V
2. A parameter q(w|u, v ) for each trigram u, v , w such that
w ∈ V ∪ {STOP}, and u, v ∈ V ∪ {∗}
• For any sentence x1 . . . xn where xi ∈ V for i = 1 . . . (n − 1), and xn = STOP,
the probability of the sentence under the trigram language model is:

n
Y
p(x1 . . . xn ) = q(xi |xi−2 , xi−1 )
i=1

• We define x0 = x−1 = ∗ for convenience.


An Example

For the sentence the dog barks STOP, we would have:

p(the dog barks STOP) = q(the|∗, ∗)×q(dog|∗, the)×q(barks|the, dog)×q(STOP|dog, barks)


The Trigram Estimation Problem

Remaining estimation problem:

q(wi |wi−2 , wi−1 )

For example:
q(laughs|the, dog)
A natural estimate (the ”maximum likelihood estimate”):

Count(wi−2 , wi−1 , wi )
q(wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )

For instance,
Count(the, dog, laughs)
q(laughs|the, dog) =
Count(the, dog)
Sparse Data Problems

A natural estimate (the ”maximum likelihood estimate”):

Count(wi−2 , wi−1 , wi )
q(wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )

Count(the, dog, laughs)


q(laughs|the, dog) =
Count(the, dog)

• Say our vocabulary size is N = |V |, then there are N 3 parameters in the model.
• For example, N = 20, 000 ⇒ 20, 0003 = 8 × 1012 parameters.
Evaluating a Language Model: Perplexity

• We have some test data, m sentences: s1 , s2 , s3 , ..., sm


• We could look at the probability under our model m
Q
i=1 p(si ). Or more
conveniently, the log probability:

m m
!
Y X
log p(si ) = log p(si )
i=1 i=1

• In fact, the usual evaluation measure is perplexity:

m
1 X
Perplexity = 2−l where l = log p(si )
M
i=1

• M is the total number of words in the test data


Some Intuition about Perplexity

• Say we have a vocabulary V , and N = |V | + 1, and a model that predicts:

1
q(w|u, v ) = for all w ∈ V ∪ {STOP}, for all u, v ∈ V ∪ {∗}
N

• It’s easy to calculate the perplexity in this case:

1
Perplexity = 2−l where l = log ⇒ Perplexity = N
N

• Perplexity can be seen as a measure of the effective ”branching factor”


Some Intuition about Perplexity
• Proof: Let’s asume we have m sentences of length n in the corpus, and M the
amount of tokens in the corpus, M = m · n.
• Let’s consider the log (base 2) probability of a sentence s = w1 w2 . . . wn under
the model:
n
Y n
X
log p(s) = log q(wi |wi−2 , wi−1 ) = log q(wi |wi−2 , wi−1 )
i=1 i=1

• Since each q(wi |wi−2 , wi−1 ) is equal to 1


N
, we have:

n
X 1 1
log p(s) = log = n · log = −n · log N
N N
i=1

m m
1 X 1 X 1
l= log p(si ) = −n · log N = · −m · n · log N = − log N
M M M
i=1 i=1

• Therefore, the perplexity is given by:

Perplexity = 2−l = 2−(− log N) = N


Typical Values of Perplexity

• Results from Goodman (”A bit of progress in language modeling”), where


|V | = 50, 000 [Goodman, 2001].
• A trigram model: p(x1 , . . . , xn ) = ni=1 q(xi |xi−2 , xi−1 )
Q
Perplexity = 74
• A bigram model: p(x1 , . . . , xn ) = ni=1 q(xi |xi−1 )
Q
Perplexity = 137
• A unigram model: p(x1 , . . . , xn ) = ni=1 q(xi )
Q
Perplexity = 955
Some History
• Shannon conducted experiments on the entropy of English, specifically
investigating how well people perform in the perplexity game.
• Reference: C. Shannon. “Prediction and entropy of printed English.” Bell
Systems Technical Journal, 30:50–64, 1951. [Shannon, 1951]
Some History

• Chomsky, in his book Syntactic Structures (1957), made several important points
regarding grammar. [Chomsky, 2009]
• According to Chomsky, the notion of ”grammatical” cannot be equated with
”meaningful” or ”significant” in a semantic sense.
• He illustrated this with two nonsensical sentences:
• (1) Colorless green ideas sleep furiously.
• (2) Furiously sleep ideas green colorless.
• While both sentences lack meaning, Chomsky argued that only the first one is
considered grammatical by English speakers.
Some History
• Chomsky also emphasized that grammaticality in English cannot be determined
solely based on statistical approximations.
• Even though neither sentence (1) nor (2) has likely occurred in English
discourse, a statistical model would consider them equally ”remote” from English.
• However, sentence (1) is grammatical, while sentence (2) is not, highlighting the
limitations of statistical approaches in capturing grammaticality.
The Bias-Variance Trade-Off

• Trigram maximum-likelihood estimate:

Count(wi−2 , wi−1 , wi )
qML (wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )

• Bigram maximum-likelihood estimate:

Count(wi−1 , wi )
qML (wi |wi−1 ) =
Count(wi−1 )

• Unigram maximum-likelihood estimate:

Count(wi )
qML (wi ) =
Count()
Linear Interpolation

• Take our estimate q(wi |wi−2 , wi−1 ) to be

q(wi |wi−2 , wi−1 ) = λ1 · qML (wi |wi−2 , wi−1 ) + λ2 · qML (wi |wi−1 ) + λ3 · qML (wi )

where λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i.


• Our estimate correctly defines a distribution (define V ′ = V ∪ {STOP}):

X
q(w|u, v )
w∈V ′
X
= [λ1 · qML (w|u, v ) + λ2 · qML (w|v ) + λ3 · qML (w)]
w∈V ′
X X X
=λ1 qML (w|u, v ) + λ2 qML (w|v ) + λ3 qML (w)
w w w
=λ1 + λ2 + λ3 = 1

• We can also show that q(w|u, v ) ≥ 0 for all w ∈ V ′ .


Estimating λ Values

• Hold out part of the training set as validation data.


• Define c ′ (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen
in the validation set.
• Choose λ1 , λ2 , λ3 to maximize:
X
L(λ1 , λ2 , λ3 ) = c ′ (w1 , w2 , w3 ) log q(w3 |w1 , w2 )
w1 ,w2 ,w3

such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, and where

q(wi |wi−2 , wi−1 ) = λ1 · qML (wi |wi−2 , wi−1 ) + λ2 · qML (wi |wi−1 ) + λ3 · qML (wi )
Discounting Methods

• Consider the following counts and maximum-likelihood estimates:

Sentence Count qML (wi |wi−1 )


the 48
the, dog 15 15/48
the, woman 11 11/48
the, man 10 10/48
the, park 5 5/48
the, job 2 2/48
the, telescope 1 1/48
the, manual 1 1/48
the, afternoon 1 1/48
the, country 1 1/48
the, street 1 1/48

• The maximum-likelihood estimates are high, particularly for low count items.
Discounting Methods

• Define “discounted” counts as follows:

Count∗ (x) = Count(x) − 0.5

Sentence Count Count*(x) qML (wi |wi−1 )


the 48
the, dog 15 14.5 14.5/48
the, woman 11 10.5 10.5/48
the, man 10 9.5 9.5/48
the, park 5 4.5 4.5/48
the, job 2 1.5 1.5/48
the, telescope 1 0.5 0.5/48
the, manual 1 0.5 0.5/48
the, afternoon 1 0.5 0.5/48
the, country 1 0.5 0.5/48
the, street 1 0.5 0.5/48

• The new estimates are based on the discounted counts.


Discounting Methods (Continued)

• We now have some ”missing probability mass”:


X Count∗ (wi−1 , w)
α(wi−1 ) = 1 −
w
Count(wi−1 )

• For example, in our case:

10 × 0.5 5
α(the) = =
48 48
Katz Back-Off Models (Bigrams)
• For a bigram model, define two sets:

A(wi−1 ) = {w : Count(wi−1 , w) > 0}

B(wi−1 ) = {w : Count(wi−1 , w) = 0}
• A bigram model:
 ∗
 Count (wi−1 ,wi ) if wi ∈ A(wi−1 )
Count(wi−1 )
qBO (wi |wi−1 ) =
Pα(wi−1 )qML (wi ) if wi ∈ B(wi−1 )
) qML (w)

w∈B(w i−1

• Where:
X Count∗ (wi−1 , w)
α(wi−1 ) = 1 −
Count(wi−1 )
w∈A(wi−1 )
Katz Back-Off Models (Trigrams)

• For a trigram model, first define two sets:

A(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w) > 0}

B(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w) = 0}


• A trigram model is defined in terms of the bigram model:

Count∗ (wi−2 ,wi−1 ,wi )




 Count(wi−2 ,wi−1 )
if wi ∈ A(wi−2 , wi−1 )
qBO (wi |wi−2 , wi−1 ) = α(wi−2 ,wi−1 )qBO (wi |wi−1 )
 P if wi ∈ B(wi−2 , wi−1 )
) qBO (w|wi−1 )
 w∈B(w ,w
i−2 i−1

• Where:

X Count∗ (wi−2 , wi−1 , w)


α(wi−2 , wi−1 ) = 1 −
Count(wi−2 , wi−1 )
w∈A(wi−2 ,wi−1 )
Summary

• Deriving probabilities in probabilistic language models involves three steps:


1. Expand p(w1 , w2 , . . . , wn ) using the Chain rule.
2. Apply Markov Independence Assumptions
p(wi |w1 , w2 , . . . , wi−2 , wi−1 ) = p(wi |wi−2 , wi−1 ).
3. Smooth the estimates using low order counts.
• Other methods for improving language models include:
• Introducing latent variables to represent topics, known as topic models.
[Blei et al., 2003]
• Replacing p(wi |w1 , w2 , . . . , wi−2 , wi−1 ) with a predictive neural network
and an “embedding layer” to better represent larger contexts and leverage
similarities between words in the context. [Bengio et al., 2000]
• Modern language models utilize deep neural networks in their backbone and
have a vast parameter space.
Questions?

Thanks for your Attention!


References I

Bengio, Y., Ducharme, R., and Vincent, P. (2000).


A neural probabilistic language model.
Advances in neural information processing systems, 13.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.
Chomsky, N. (2009).
Syntactic structures.
In Syntactic Structures. De Gruyter Mouton.
Goodman, J. T. (2001).
A bit of progress in language modeling.
Computer Speech & Language, 15(4):403–434.
Shannon, C. E. (1951).
Prediction and entropy of printed english.
Bell system technical journal, 30(1):50–64.

You might also like