0% found this document useful (0 votes)

24 views35 pages

NLP PLM

This document discusses probabilistic language models and techniques for estimating them from text data. It introduces the language modeling problem, trigram models, and methods for evaluating and estimating language model probabilities, including linear interpolation and discounting.

Uploaded by

pawebiarxdxd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views35 pages

NLP PLM

Uploaded by

pawebiarxdxd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Natural Language Processing

Probabilistic Language Models

Felipe Bravo-Marquez

March 26, 2024

Overview

• The language modeling problem

• Trigram models
• Evaluating language models: perplexity
• Estimation techniques:
1. Linear interpolation
2. Discounting methods
• This slides are based on the course material by Michael Collins:
http://www.cs.columbia.edu/˜mcollins/cs4705-spring2019/
slides/lmslides.pdf
The Language Modeling Problem

• We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham,

two, . . .}
• We have an (infinite) set of strings, V ∗ .
• For example:
• the STOP
• a STOP
• the fan STOP
• the fan saw Beckham STOP
• the fan saw saw STOP
• the fan saw Beckham play for Real Madrid STOP
• Where STOP is a special symbol indicating the end of a sentence.
The Language Modeling Problem (Continued)
• We have a training sample of example sentences in English.
• We need to ”learn” a probability distribution p.
• p is a function that satisfies:
X
p(x) = 1
x∈V ∗

p(x) ≥ 0 for all x ∈ V ∗

• Examples of probabilities assigned to sentences:

p(the STOP) = 10−12

p(the fan STOP) = 10−8
p(the fan saw Beckham STOP) = 2 × 10−8
p(the fan saw saw STOP) = 10−15
...
p(the fan saw Beckham play for Real Madrid STOP) = 2 × 10−9
The Language Modeling Problem (Continued)

• Idea 1: The model assigns a higher probability to fluent sentences (those that
make sense and are grammatically correct).
• Idea 2: Estimating this probability function from text (corpus).
• The language model helps text generation models distinguish between good and
bad sentences.
Why would we want to do this?

• Speech recognition was the original motivation.

• Consider the sentences: 1) recognize speech and 2) wreck a nice beach.
• These two sentences sound very similar when pronounced, making it challenging
for automatic speech recognition systems to accurately transcribe them.
• When the speech recognition system analyzes the audio input and tries to
transcribe it, it takes into account the language model probabilities to determine
the most likely interpretation.
• The language model would favor p(recognize speech) over p(wreck a nice
beach).
• This is because the former is a more common sentence and should occur more
frequently in the training corpus.
Why on earth would we want to do this?

• By incorporating language models, speech recognition systems can improve

accuracy by selecting the sentence that aligns better with linguistic patterns and
context, even when faced with similar-sounding alternatives.
• Related problems are optical character recognition, handwriting recognition.
• Actually, Language Models are useful in any NLP tasks involving the generation
of language (e.g., machine translation, summarization, chatbots).
• The estimation techniques developed for this problem will be VERY useful for
other problems in NLP.
Language Models are Generative

• Language models can generate sentences by sequentially sampling from

probabilities.
• This is analogous to drawing balls (words) from an urn where their sizes are
proportional to their relative frequencies.
• Alternatively, one could always draw the most probable word, which is equivalent
to predicting the next word.
A Naive Method

• A very naive method for estimating the probability of a sentence is to count the
occurrences of the sentence in the training data and divide it by the total number
of training sentences (N) to estimate the probability.
• We have N training sentences.
• For any sentence x1 , x2 , . . . , xn , c(x1 , x2 , . . . , xn ) is the number of times the
sentence is seen in our training data.
• A naive estimate:
c(x1 , x2 . . . , xn )
p(x1, x2, . . . , xn) =
N
• Problem: As the number of possible sentences grows exponentially with
sentence length and vocabulary size, it becomes increasingly unlikely for a
specific sentence to appear in the training data.
• Consequently, many sentences will have a probability of zero according to the
naive model, leading to poor generalization.
Markov Processes

• Consider a sequence of random variables X1 , X2 , . . . , Xn .

• Each random variable can take any value in a finite set V .
• For now, we assume the length n is fixed (e.g., n = 100).
• Our goal: model P(X1 = x1 , X2 = x2 , . . . , Xn = xn )
First-Order Markov Processes

n
Y
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 ) P(Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 )
i=2
n
Y
= P(X1 = x1 ) P(Xi = xi |Xi−1 = xi−1 )
i=2

The first-order Markov assumption: For any i ∈ {2, . . . , n} and any x1 , . . . , xi ,

P(Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P(Xi = xi |Xi−1 = xi−1 )

Second-Order Markov Processes

P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
n
Y
P(X1 = x1 ) · P(X2 = x2 |X1 = x1 ) · P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=3
n
Y
= P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1
(For convenience, we assume x0 = x−1 = ∗, where ∗ is a special ”start” symbol.)
Modeling Variable Length Sequences

• We would like the length of the sequence, n, to also be a random variable.

• A simple solution: always define Xn = STOP, where STOP is a special symbol.
• Then use a Markov process as before:

n
Y
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1

• (For convenience, we assume x0 = x−1 = ∗, where ∗ is a special ”start” symbol.)

Trigram Language Models

• A trigram language model consists of:

1. A finite set V
2. A parameter q(w|u, v ) for each trigram u, v , w such that
w ∈ V ∪ {STOP}, and u, v ∈ V ∪ {∗}
• For any sentence x1 . . . xn where xi ∈ V for i = 1 . . . (n − 1), and xn = STOP,
the probability of the sentence under the trigram language model is:

n
Y
p(x1 . . . xn ) = q(xi |xi−2 , xi−1 )
i=1

• We define x0 = x−1 = ∗ for convenience.

An Example

For the sentence the dog barks STOP, we would have:

p(the dog barks STOP) = q(the|∗, ∗)×q(dog|∗, the)×q(barks|the, dog)×q(STOP|dog, barks)

The Trigram Estimation Problem

Remaining estimation problem:

q(wi |wi−2 , wi−1 )

For example:
q(laughs|the, dog)
A natural estimate (the ”maximum likelihood estimate”):

Count(wi−2 , wi−1 , wi )
q(wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )

For instance,
Count(the, dog, laughs)
q(laughs|the, dog) =
Count(the, dog)
Sparse Data Problems

A natural estimate (the ”maximum likelihood estimate”):

Count(wi−2 , wi−1 , wi )
q(wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )

Count(the, dog, laughs)

q(laughs|the, dog) =
Count(the, dog)

• Say our vocabulary size is N = |V |, then there are N 3 parameters in the model.
• For example, N = 20, 000 ⇒ 20, 0003 = 8 × 1012 parameters.
Evaluating a Language Model: Perplexity

• We have some test data, m sentences: s1 , s2 , s3 , ..., sm

• We could look at the probability under our model m
Q
i=1 p(si ). Or more
conveniently, the log probability:

m m
!
Y X
log p(si ) = log p(si )
i=1 i=1

• In fact, the usual evaluation measure is perplexity:

m
1 X
Perplexity = 2−l where l = log p(si )
M
i=1

• M is the total number of words in the test data

Some Intuition about Perplexity

• Say we have a vocabulary V , and N = |V | + 1, and a model that predicts:

1
q(w|u, v ) = for all w ∈ V ∪ {STOP}, for all u, v ∈ V ∪ {∗}
N

• It’s easy to calculate the perplexity in this case:

1
Perplexity = 2−l where l = log ⇒ Perplexity = N
N

• Perplexity can be seen as a measure of the effective ”branching factor”

Some Intuition about Perplexity
• Proof: Let’s asume we have m sentences of length n in the corpus, and M the
amount of tokens in the corpus, M = m · n.
• Let’s consider the log (base 2) probability of a sentence s = w1 w2 . . . wn under
the model:
n
Y n
X
log p(s) = log q(wi |wi−2 , wi−1 ) = log q(wi |wi−2 , wi−1 )
i=1 i=1

• Since each q(wi |wi−2 , wi−1 ) is equal to 1

N
, we have:

n
X 1 1
log p(s) = log = n · log = −n · log N
N N
i=1

m m
1 X 1 X 1
l= log p(si ) = −n · log N = · −m · n · log N = − log N
M M M
i=1 i=1

• Therefore, the perplexity is given by:

Perplexity = 2−l = 2−(− log N) = N

Typical Values of Perplexity

• Results from Goodman (”A bit of progress in language modeling”), where

|V | = 50, 000 [Goodman, 2001].
• A trigram model: p(x1 , . . . , xn ) = ni=1 q(xi |xi−2 , xi−1 )
Q
Perplexity = 74
• A bigram model: p(x1 , . . . , xn ) = ni=1 q(xi |xi−1 )
Q
Perplexity = 137
• A unigram model: p(x1 , . . . , xn ) = ni=1 q(xi )
Q
Perplexity = 955
Some History
• Shannon conducted experiments on the entropy of English, specifically
investigating how well people perform in the perplexity game.
• Reference: C. Shannon. “Prediction and entropy of printed English.” Bell
Systems Technical Journal, 30:50–64, 1951. [Shannon, 1951]
Some History

• Chomsky, in his book Syntactic Structures (1957), made several important points
regarding grammar. [Chomsky, 2009]
• According to Chomsky, the notion of ”grammatical” cannot be equated with
”meaningful” or ”significant” in a semantic sense.
• He illustrated this with two nonsensical sentences:
• (1) Colorless green ideas sleep furiously.
• (2) Furiously sleep ideas green colorless.
• While both sentences lack meaning, Chomsky argued that only the first one is
considered grammatical by English speakers.
Some History
• Chomsky also emphasized that grammaticality in English cannot be determined
solely based on statistical approximations.
• Even though neither sentence (1) nor (2) has likely occurred in English
discourse, a statistical model would consider them equally ”remote” from English.
• However, sentence (1) is grammatical, while sentence (2) is not, highlighting the
limitations of statistical approaches in capturing grammaticality.
The Bias-Variance Trade-Off

• Trigram maximum-likelihood estimate:

Count(wi−2 , wi−1 , wi )
qML (wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )

• Bigram maximum-likelihood estimate:

Count(wi−1 , wi )
qML (wi |wi−1 ) =
Count(wi−1 )

• Unigram maximum-likelihood estimate:

Count(wi )
qML (wi ) =
Count()
Linear Interpolation

• Take our estimate q(wi |wi−2 , wi−1 ) to be

q(wi |wi−2 , wi−1 ) = λ1 · qML (wi |wi−2 , wi−1 ) + λ2 · qML (wi |wi−1 ) + λ3 · qML (wi )

where λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i.

• Our estimate correctly defines a distribution (define V ′ = V ∪ {STOP}):

• We can also show that q(w|u, v ) ≥ 0 for all w ∈ V ′ .

Estimating λ Values

• Hold out part of the training set as validation data.

• Define c ′ (w1 , w2 , w3 ) to be the number of times the trigram (w1 , w2 , w3 ) is seen
in the validation set.
• Choose λ1 , λ2 , λ3 to maximize:
X
L(λ1 , λ2 , λ3 ) = c ′ (w1 , w2 , w3 ) log q(w3 |w1 , w2 )
w1 ,w2 ,w3

such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, and where

q(wi |wi−2 , wi−1 ) = λ1 · qML (wi |wi−2 , wi−1 ) + λ2 · qML (wi |wi−1 ) + λ3 · qML (wi )
Discounting Methods

• Consider the following counts and maximum-likelihood estimates:

Sentence Count qML (wi |wi−1 )

the 48
the, dog 15 15/48
the, woman 11 11/48
the, man 10 10/48
the, park 5 5/48
the, job 2 2/48
the, telescope 1 1/48
the, manual 1 1/48
the, afternoon 1 1/48
the, country 1 1/48
the, street 1 1/48

• The maximum-likelihood estimates are high, particularly for low count items.
Discounting Methods

• Define “discounted” counts as follows:

Count∗ (x) = Count(x) − 0.5

Sentence Count Count*(x) qML (wi |wi−1 )

the 48
the, dog 15 14.5 14.5/48
the, woman 11 10.5 10.5/48
the, man 10 9.5 9.5/48
the, park 5 4.5 4.5/48
the, job 2 1.5 1.5/48
the, telescope 1 0.5 0.5/48
the, manual 1 0.5 0.5/48
the, afternoon 1 0.5 0.5/48
the, country 1 0.5 0.5/48
the, street 1 0.5 0.5/48

• The new estimates are based on the discounted counts.

Discounting Methods (Continued)

• We now have some ”missing probability mass”:

X Count∗ (wi−1 , w)
α(wi−1 ) = 1 −
w
Count(wi−1 )

• For example, in our case:

10 × 0.5 5
α(the) = =
48 48
Katz Back-Off Models (Bigrams)
• For a bigram model, define two sets:

A(wi−1 ) = {w : Count(wi−1 , w) > 0}

B(wi−1 ) = {w : Count(wi−1 , w) = 0}
• A bigram model:
 ∗
 Count (wi−1 ,wi ) if wi ∈ A(wi−1 )
Count(wi−1 )
qBO (wi |wi−1 ) =
Pα(wi−1 )qML (wi ) if wi ∈ B(wi−1 )
) qML (w)

w∈B(w i−1

• Where:
X Count∗ (wi−1 , w)
α(wi−1 ) = 1 −
Count(wi−1 )
w∈A(wi−1 )
Katz Back-Off Models (Trigrams)

• For a trigram model, first define two sets:

A(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w) > 0}

B(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w) = 0}

• A trigram model is defined in terms of the bigram model:

Count∗ (wi−2 ,wi−1 ,wi )



 Count(wi−2 ,wi−1 )
if wi ∈ A(wi−2 , wi−1 )
qBO (wi |wi−2 , wi−1 ) = α(wi−2 ,wi−1 )qBO (wi |wi−1 )
 P if wi ∈ B(wi−2 , wi−1 )
) qBO (w|wi−1 )
 w∈B(w ,w
i−2 i−1

• Where:

X Count∗ (wi−2 , wi−1 , w)

α(wi−2 , wi−1 ) = 1 −
Count(wi−2 , wi−1 )
w∈A(wi−2 ,wi−1 )
Summary

• Deriving probabilities in probabilistic language models involves three steps:

1. Expand p(w1 , w2 , . . . , wn ) using the Chain rule.
2. Apply Markov Independence Assumptions
p(wi |w1 , w2 , . . . , wi−2 , wi−1 ) = p(wi |wi−2 , wi−1 ).
3. Smooth the estimates using low order counts.
• Other methods for improving language models include:
• Introducing latent variables to represent topics, known as topic models.
[Blei et al., 2003]
• Replacing p(wi |w1 , w2 , . . . , wi−2 , wi−1 ) with a predictive neural network
and an “embedding layer” to better represent larger contexts and leverage
similarities between words in the context. [Bengio et al., 2000]
• Modern language models utilize deep neural networks in their backbone and
have a vast parameter space.
Questions?

Thanks for your Attention!

References I

Bengio, Y., Ducharme, R., and Vincent, P. (2000).

A neural probabilistic language model.
Advances in neural information processing systems, 13.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003).
Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022.
Chomsky, N. (2009).
Syntactic structures.
In Syntactic Structures. De Gruyter Mouton.
Goodman, J. T. (2001).
A bit of progress in language modeling.
Computer Speech & Language, 15(4):403–434.
Shannon, C. E. (1951).
Prediction and entropy of printed english.
Bell system technical journal, 30(1):50–64.

Hospital Planning and Design PDF
100% (1)
Hospital Planning and Design PDF
47 pages
YIP 6.0 Students
No ratings yet
YIP 6.0 Students
86 pages
Formal Aspects of Language Modeling
No ratings yet
Formal Aspects of Language Modeling
252 pages
FeelingFaces Cards En-Blank
No ratings yet
FeelingFaces Cards En-Blank
4 pages
16 MM MS Plate 355 JR - India-MTC
No ratings yet
16 MM MS Plate 355 JR - India-MTC
1 page
Detailed Lesson Plan in Oral Communication I. Objectives
No ratings yet
Detailed Lesson Plan in Oral Communication I. Objectives
4 pages
Johanson Cointegration Test and ECM
100% (7)
Johanson Cointegration Test and ECM
7 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Automotive Cloud
No ratings yet
Automotive Cloud
370 pages
02 NLP LM
No ratings yet
02 NLP LM
99 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Lectures LM
No ratings yet
Lectures LM
57 pages
Deep Learning (MODULE-4) - RNN - NLP
No ratings yet
Deep Learning (MODULE-4) - RNN - NLP
52 pages
PLM 17
No ratings yet
PLM 17
15 pages
Trigram Language Models
No ratings yet
Trigram Language Models
19 pages
Using Gann's Methods
No ratings yet
Using Gann's Methods
3 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
3i's - 4th Quarter Reviewer
100% (1)
3i's - 4th Quarter Reviewer
5 pages
3.4 Diaphragm Wall
No ratings yet
3.4 Diaphragm Wall
16 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Ngrams
100% (1)
Ngrams
22 pages
Lecture 7 - Language Modelling
No ratings yet
Lecture 7 - Language Modelling
107 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
No ratings yet
DR Pushpak's Talk IIT Bombay, Ex IIT Patna
136 pages
N-Gram Language Models
No ratings yet
N-Gram Language Models
26 pages
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
No ratings yet
Predicting Words and Sentences Using Statistical Models: Nicola Carmignani
42 pages
NLP m2
No ratings yet
NLP m2
74 pages
Module-1 ch-2
No ratings yet
Module-1 ch-2
31 pages
04 Language Modeling
No ratings yet
04 Language Modeling
70 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
LM 24 Aug
No ratings yet
LM 24 Aug
84 pages
NLP Lec 11
No ratings yet
NLP Lec 11
6 pages
CoSM Vision Plan 2018 Small
No ratings yet
CoSM Vision Plan 2018 Small
64 pages
L5 Cse256 Fa24 LM
No ratings yet
L5 Cse256 Fa24 LM
65 pages
Lecture5 Ngrams
No ratings yet
Lecture5 Ngrams
40 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Cs224n 2025 Lecture05 RNNLM
No ratings yet
Cs224n 2025 Lecture05 RNNLM
54 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Lecture - 3 - Statistical Language Models
No ratings yet
Lecture - 3 - Statistical Language Models
56 pages
6.chapter6 LanguageModel
No ratings yet
6.chapter6 LanguageModel
33 pages
City Sci Yb2019
No ratings yet
City Sci Yb2019
19 pages
Lecture 6 N Gram Language Models Contd Annotations
No ratings yet
Lecture 6 N Gram Language Models Contd Annotations
36 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Consumer Equilibrium
No ratings yet
Consumer Equilibrium
31 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
Analogy
No ratings yet
Analogy
4 pages
Language Models
No ratings yet
Language Models
11 pages
Technical NLP U3-6
No ratings yet
Technical NLP U3-6
83 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Language Modelling
No ratings yet
Language Modelling
3 pages
Lecture 6 To 8 N-Gram
No ratings yet
Lecture 6 To 8 N-Gram
19 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
N Grams
No ratings yet
N Grams
51 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
MT6761 Android Scatter
No ratings yet
MT6761 Android Scatter
12 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
NLP
No ratings yet
NLP
12 pages
UNIT 3 Language Modelling
No ratings yet
UNIT 3 Language Modelling
15 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Lived Experiences and Challenges of Senior High School Learners in The Implementation of Limited Face-To-Face Classes
No ratings yet
Lived Experiences and Challenges of Senior High School Learners in The Implementation of Limited Face-To-Face Classes
7 pages
Mohit SOP (University of Adelaide)
No ratings yet
Mohit SOP (University of Adelaide)
2 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
NLP Unit-4
No ratings yet
NLP Unit-4
62 pages
11-AutoSys Basic Commands Quick Reference
100% (1)
11-AutoSys Basic Commands Quick Reference
1 page
PR 2 Group 3
No ratings yet
PR 2 Group 3
41 pages
Core Mathematics 4 Jun14
No ratings yet
Core Mathematics 4 Jun14
4 pages
Impact of Gender Diversity On Team Performance SM Raza Naqvi
No ratings yet
Impact of Gender Diversity On Team Performance SM Raza Naqvi
8 pages
111747920
No ratings yet
111747920
61 pages
Routine EEE UG Spring2024
No ratings yet
Routine EEE UG Spring2024
41 pages
I Am Curious (Yellow)
No ratings yet
I Am Curious (Yellow)
7 pages
Diff. Lit. Elements
No ratings yet
Diff. Lit. Elements
11 pages
Soci1001 Unit 5
No ratings yet
Soci1001 Unit 5
15 pages
Design Deep Dive: Color Psychology: "According To EDU Research Lab (2023), 62% of Teachers Lack Engagement Tools"
No ratings yet
Design Deep Dive: Color Psychology: "According To EDU Research Lab (2023), 62% of Teachers Lack Engagement Tools"
10 pages
Ai Apl#241373
No ratings yet
Ai Apl#241373
7 pages
Yellow & Brown Hand-Drawn Process Writing Proofreading Essay Worksheet - 20241124 - 080226 - 0000
No ratings yet
Yellow & Brown Hand-Drawn Process Writing Proofreading Essay Worksheet - 20241124 - 080226 - 0000
5 pages
Sudan To Tamworth Stories of Resilience, Hope and Transition
No ratings yet
Sudan To Tamworth Stories of Resilience, Hope and Transition
4 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

NLP PLM

Uploaded by

NLP PLM

Uploaded by

Natural Language Processing

Probabilistic Language Models

March 26, 2024

• The language modeling problem

• We have some (finite) vocabulary, say V = {the, a, man, telescope, Beckham,

p(x) ≥ 0 for all x ∈ V ∗

• Examples of probabilities assigned to sentences:

p(the STOP) = 10−12

• Speech recognition was the original motivation.

• By incorporating language models, speech recognition systems can improve

• Language models can generate sentences by sequentially sampling from

• Consider a sequence of random variables X1 , X2 , . . . , Xn .

The first-order Markov assumption: For any i ∈ {2, . . . , n} and any x1 , . . . , xi ,

P(Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P(Xi = xi |Xi−1 = xi−1 )

• We would like the length of the sequence, n, to also be a random variable.

• (For convenience, we assume x0 = x−1 = ∗, where ∗ is a special ”start” symbol.)

• A trigram language model consists of:

• We define x0 = x−1 = ∗ for convenience.

For the sentence the dog barks STOP, we would have:

p(the dog barks STOP) = q(the|∗, ∗)×q(dog|∗, the)×q(barks|the, dog)×q(STOP|dog, barks)

Remaining estimation problem:

q(wi |wi−2 , wi−1 )

A natural estimate (the ”maximum likelihood estimate”):

Count(the, dog, laughs)

• We have some test data, m sentences: s1 , s2 , s3 , ..., sm

• In fact, the usual evaluation measure is perplexity:

• M is the total number of words in the test data

• Say we have a vocabulary V , and N = |V | + 1, and a model that predicts:

• It’s easy to calculate the perplexity in this case:

• Perplexity can be seen as a measure of the effective ”branching factor”

• Since each q(wi |wi−2 , wi−1 ) is equal to 1

• Therefore, the perplexity is given by:

Perplexity = 2−l = 2−(− log N) = N

• Results from Goodman (”A bit of progress in language modeling”), where

• Trigram maximum-likelihood estimate:

• Bigram maximum-likelihood estimate:

• Unigram maximum-likelihood estimate:

• Take our estimate q(wi |wi−2 , wi−1 ) to be

where λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i.

• We can also show that q(w|u, v ) ≥ 0 for all w ∈ V ′ .

• Hold out part of the training set as validation data.

such that λ1 + λ2 + λ3 = 1, and λi ≥ 0 for all i, and where

• Consider the following counts and maximum-likelihood estimates:

Sentence Count qML (wi |wi−1 )

• Define “discounted” counts as follows:

Count∗ (x) = Count(x) − 0.5

Sentence Count Count*(x) qML (wi |wi−1 )

• The new estimates are based on the discounted counts.

• We now have some ”missing probability mass”:

• For example, in our case:

A(wi−1 ) = {w : Count(wi−1 , w) > 0}

• For a trigram model, first define two sets:

A(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w) > 0}

B(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w) = 0}

Count∗ (wi−2 ,wi−1 ,wi )

X Count∗ (wi−2 , wi−1 , w)

• Deriving probabilities in probabilistic language models involves three steps:

Thanks for your Attention!

Bengio, Y., Ducharme, R., and Vincent, P. (2000).

You might also like