NLP PLM
NLP PLM
Felipe Bravo-Marquez
• Idea 1: The model assigns a higher probability to fluent sentences (those that
make sense and are grammatically correct).
• Idea 2: Estimating this probability function from text (corpus).
• The language model helps text generation models distinguish between good and
bad sentences.
Why would we want to do this?
• A very naive method for estimating the probability of a sentence is to count the
occurrences of the sentence in the training data and divide it by the total number
of training sentences (N) to estimate the probability.
• We have N training sentences.
• For any sentence x1 , x2 , . . . , xn , c(x1 , x2 , . . . , xn ) is the number of times the
sentence is seen in our training data.
• A naive estimate:
c(x1 , x2 . . . , xn )
p(x1, x2, . . . , xn) =
N
• Problem: As the number of possible sentences grows exponentially with
sentence length and vocabulary size, it becomes increasingly unlikely for a
specific sentence to appear in the training data.
• Consequently, many sentences will have a probability of zero according to the
naive model, leading to poor generalization.
Markov Processes
n
Y
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(X1 = x1 ) P(Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 )
i=2
n
Y
= P(X1 = x1 ) P(Xi = xi |Xi−1 = xi−1 )
i=2
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
n
Y
P(X1 = x1 ) · P(X2 = x2 |X1 = x1 ) · P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=3
n
Y
= P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1
(For convenience, we assume x0 = x−1 = ∗, where ∗ is a special ”start” symbol.)
Modeling Variable Length Sequences
n
Y
P(X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P(Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 )
i=1
n
Y
p(x1 . . . xn ) = q(xi |xi−2 , xi−1 )
i=1
For example:
q(laughs|the, dog)
A natural estimate (the ”maximum likelihood estimate”):
Count(wi−2 , wi−1 , wi )
q(wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )
For instance,
Count(the, dog, laughs)
q(laughs|the, dog) =
Count(the, dog)
Sparse Data Problems
Count(wi−2 , wi−1 , wi )
q(wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )
• Say our vocabulary size is N = |V |, then there are N 3 parameters in the model.
• For example, N = 20, 000 ⇒ 20, 0003 = 8 × 1012 parameters.
Evaluating a Language Model: Perplexity
m m
!
Y X
log p(si ) = log p(si )
i=1 i=1
m
1 X
Perplexity = 2−l where l = log p(si )
M
i=1
1
q(w|u, v ) = for all w ∈ V ∪ {STOP}, for all u, v ∈ V ∪ {∗}
N
1
Perplexity = 2−l where l = log ⇒ Perplexity = N
N
n
X 1 1
log p(s) = log = n · log = −n · log N
N N
i=1
m m
1 X 1 X 1
l= log p(si ) = −n · log N = · −m · n · log N = − log N
M M M
i=1 i=1
• Chomsky, in his book Syntactic Structures (1957), made several important points
regarding grammar. [Chomsky, 2009]
• According to Chomsky, the notion of ”grammatical” cannot be equated with
”meaningful” or ”significant” in a semantic sense.
• He illustrated this with two nonsensical sentences:
• (1) Colorless green ideas sleep furiously.
• (2) Furiously sleep ideas green colorless.
• While both sentences lack meaning, Chomsky argued that only the first one is
considered grammatical by English speakers.
Some History
• Chomsky also emphasized that grammaticality in English cannot be determined
solely based on statistical approximations.
• Even though neither sentence (1) nor (2) has likely occurred in English
discourse, a statistical model would consider them equally ”remote” from English.
• However, sentence (1) is grammatical, while sentence (2) is not, highlighting the
limitations of statistical approaches in capturing grammaticality.
The Bias-Variance Trade-Off
Count(wi−2 , wi−1 , wi )
qML (wi |wi−2 , wi−1 ) =
Count(wi−2 , wi−1 )
Count(wi−1 , wi )
qML (wi |wi−1 ) =
Count(wi−1 )
Count(wi )
qML (wi ) =
Count()
Linear Interpolation
q(wi |wi−2 , wi−1 ) = λ1 · qML (wi |wi−2 , wi−1 ) + λ2 · qML (wi |wi−1 ) + λ3 · qML (wi )
X
q(w|u, v )
w∈V ′
X
= [λ1 · qML (w|u, v ) + λ2 · qML (w|v ) + λ3 · qML (w)]
w∈V ′
X X X
=λ1 qML (w|u, v ) + λ2 qML (w|v ) + λ3 qML (w)
w w w
=λ1 + λ2 + λ3 = 1
q(wi |wi−2 , wi−1 ) = λ1 · qML (wi |wi−2 , wi−1 ) + λ2 · qML (wi |wi−1 ) + λ3 · qML (wi )
Discounting Methods
• The maximum-likelihood estimates are high, particularly for low count items.
Discounting Methods
10 × 0.5 5
α(the) = =
48 48
Katz Back-Off Models (Bigrams)
• For a bigram model, define two sets:
B(wi−1 ) = {w : Count(wi−1 , w) = 0}
• A bigram model:
∗
Count (wi−1 ,wi ) if wi ∈ A(wi−1 )
Count(wi−1 )
qBO (wi |wi−1 ) =
Pα(wi−1 )qML (wi ) if wi ∈ B(wi−1 )
) qML (w)
w∈B(w i−1
• Where:
X Count∗ (wi−1 , w)
α(wi−1 ) = 1 −
Count(wi−1 )
w∈A(wi−1 )
Katz Back-Off Models (Trigrams)
• Where: