02 NLP LM
02 NLP LM
Processing
Language models
Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan Jurafsky
Plan
• Problem definition
• Trigram models
• Evaluation
• Estimation
• Interpolation
• Discounting
2
Motivations
• Define a probability distribution over sentences
• Spelling correction
• P(“The office is fifteen minutes from here”) > P(“The office is fifteen minuets
from here”)
• And more!
3
Motivations
• Philosophical: a model that is good at predicting
the next word, must know something about
language and the world
• paper 1
• paper 2
Motivations
5
Problem definition
• Given a finite vocabulary
V = {the, a, man, telescope, Beckham, two, . . . }
the STOP
a STOP
the fan STOP
the fan saw Beckham STOP
the fan saw saw STOP
the fan saw Beckham play for Real Madrid STOP
6
Problem definition
• Input: a training set of example sentences
7
A naive method
• Assume we have N training sentences
c(x1 , . . . , xn )
p(x1 , . . . , xn ) =
N
8
A naive method
• Assume we have N training sentences
c(x1 , . . . , xn )
p(x1 , . . . , xn ) =
N
• No generalization!
8
Markov processes
• Markov processes:
X1 , X2 , . . . , Xn , n = 100, Xi 2 V
p(X1 = x1 , X2 = x2 , · · · Xn = xn )
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
Yn
p(X1 = x1 ) p(Xi = xi | X1 = x1 , . . . , Xi 1 = xi 1)
i=2
10
First-order Markov process
Chain rule
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
Yn
p(X1 = x1 ) p(Xi = xi | X1 = x1 , . . . , Xi 1 = xi 1)
i=2
Markov assumption
p(Xi = xi | X1 = x1 , . . . , Xi 1 = xi 1) = p(Xi = xi | Xi 1 = xi 1)
10
Second-order Markov
process
Relax independence assumption:
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
p(X1 = x1 ) ⇥ p(X2 = x2 | X1 = x1 )
Yn
⇥ p(Xi = xi | Xi 2 = xi 2 , Xi 1 = xi 1)
i=3
Simplify notation
x0 = ⇤, x 1 =⇤
11
Second-order Markov
process
Relax independence assumption:
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
p(X1 = x1 ) ⇥ p(X2 = x2 | X1 = x1 )
Yn
⇥ p(Xi = xi | Xi 2 = xi 2 , Xi 1 = xi 1)
i=3
Simplify notation
x0 = ⇤, x 1 =⇤
Is this reasonable?
11
Detail: variable length
• Probability distribution over sequences of any length
12
Trigram language model
• A trigram language model contains
• A vocabulary V
w 2 V [ {STOP}, u, v 2 V [ {⇤}
13
Example
p(the dog barks STOP) =q(the | ⇤, ⇤)⇥
q(dog | ⇤, the)⇥
q(barks | the, dog)⇥
q(ST OP | dog, barks)⇥
14
Limitation
• Markovian assumption is false
15
Sparseness
• Maximum likelihood for estimating q
16
Sparseness
• Maximum likelihood for estimating q
17
Bigram counts
18
Bigram probabilities
19
What did we learn
20
Evaluation: perplexity
• Test data: S = (s1, s2, …, sM)
• q(w | u, v) = 1/N.
Xm
1
l= log2 p(si | s1 , . . . , si 1)
M i=1
1 1 M 1
= log = log
M N N
perplexity = 2 l = 2log N = N
<latexit sha1_base64="hDVVx2yFhBsJ+xm8S4ppOtcyb28=">AAACo3icbVFda9swFJW9r877aLY97uWy0JJCG6xssL0EygZjMJplo2kLUWJkRU5FJdtI8mgQ/mP7GXvbv5nsGtaPXRAcnXOPdHWUllIYG8d/gvDe/QcPH209jp48ffZ8u/fi5YkpKs34jBWy0GcpNVyKnM+ssJKflZpTlUp+ml58avTTn1wbUeTHdlPyhaLrXGSCUeuppPcrkrA7JpmmzOHaHdVATKUSJ8a4XiogslgnIygHJhFAlFiBSfA+kFVhzb7HThzgeg8IiXbHcOMUbwSSivXgHz2pW2ZveQTjruG6RCJi+aV1Jdel5JfCbmo/GoyW7kDW0ILWNGk2kyhKev14GLcFdwHuQB91NU16v/3crFI8t0xSY+Y4Lu3CUW0Fk7yOSGV4SdkFXfO5hzlV3Cxcm3ENO55ZQVZov3ILLXvd4agyZqNS36moPTe3tYb8nzavbPZh4UReVpbn7OqirJJgC2g+DFZCc2blxgPKtPCzAjunPjfrv7UJAd9+8l1wMhrit0P8/V3/8GMXxxZ6jd6gAcLoPTpEX9AUzRALIPgcfAum4U74NfwRHl+1hkHneYVuVLj4C8XTyLY=</latexit>
23
Evaluation
24
History
25
Estimating parameters
• Recall that the number of parameters for a trigram
model with |V| = 20,000 is 8 x 1012, leading to zeros
and undefined probabilities
c(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
c(wi 2 , wi 1 )
26
Over/under-fitting
• Given a corpus of length M:
• Trigram model:
c(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
c(wi 1 , wi )
• Bigram model:
c(wi 1 , wi )
q(wi | wi 1 ) =
c(wi 1 )
• Unigram model:
c(wi )
q(wi ) =
27
M
28
Radford et al, 2019
Linear interpolation
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
+ 2 ⇥ q(wi | wi 1)
+ 3 ⇥ q(wi )
i 0, 1 + 2 + 3 =1
30
Linear interpolation
• Need to verify the parameters define a probability
distribution
X
qLI (w | u, v)
w2V
X
= 1 ⇥ q(w | u, v) + 2 ⇥ q(w | v) + 3 ⇥ q(w)
w2V
X X X
= 1 q(w | u, v) + 2 q(w | v) + 3 ⇥q(w)
w2V w2V w2V
= 1 + 2 + 3 =1
31
Estimating coefficients
32
Linear interpolation
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
+ 2 ⇥ q(wi | wi 1)
+ 3 ⇥ q(wi )
i 0, 1 + 2 + 3 =1
Linear interpolation
( 1 c(wi 2 , wi 1 ) = 0
2 1 c(wi 2 , wi 1 ) 2
⇧(wi 2 , wi 1 ) =
3 3 c(wi 2 , wi 1 ) 10
4 otherwise
⇧(wi 2 ,wi 1)
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
⇧(wi 2 ,wi 1)
+ 2 ⇥ q(wi | wi 1)
⇧(wi 2 ,wi 1)
+ 3 ⇥ q(wi )
⇧(wi 2 ,wi 1)
i 0
⇧(wi 2 ,wi 1) ⇧(wi 2 ,wi 1) ⇧(wi 2 ,wi 1)
1 + 2 + 3 =1
Discounting methods
x c(x) q(wi | wi-1)
the 48
the, dog 15 15/48
the, woman 11 11/48
the, man 10 10/48
the, park 5 5/48
the, job 2 2/48
the, telescope 1 1/48
the, manual 1 1/48
the, afternoon 1 1/48
the, country 1 1/48
the, street 1 1/48
( c⇤ (wi 1 ,wi )
c(wi 1 ) wi 2 A(wi 1)
qBO (wi | wi 1) = q(wi )
↵(wi 1 ) P wi 2 B(wi 1)
w2B(wi 1 ) q(w)
X ⇤
c (wi 1 , w)
↵(wi 1) =1
c(wi 1 )
w2A(wi 1)
37
Katz back-off
In a trigram model
A(wi 2 , wi 1 ) = {w : c(wi 2 , wi 1 , w) > 0}
B(wi 2 , wi 1 ) = {w : c(wi 2 , wi 1 , w) = 0}
( c⇤ (wi 2 ,wi 1 ,wi )
c(wi 2 ,wi 1 ) wi 2 A(wi 2 , wi 1 )
qBO (wi | wi 2 , wi 1 ) = qBO (wi |wi 1 )
↵(wi 2 , wi 1 ) P wi 2 B(wi 2 , wi 1 )
w2B(w i ,w2 i) qBO (w|wi
1
1)
X c⇤ (wi 2 , wi 1 , w)
↵(wi 2 , wi 1 ) =1
c(wi 2 , wi 1 )
w2A(wi 2 ,wi 1)
38
Advanced Smoothing
• Good-Turing
• Kneser-Ney
Advanced Smoothing
• Principles
• Train
42
Problem
• Our estimator q(w | u, v) is based on a one-hot
representation. There is no relation between words.
Neural networks
43
Neural networks
• Vanishing/exploding gradients
Neural networks: history
• Proposed in the mid-20th century
• w: weights vector
• b: bias
• f: activation function
48
intro to ml
A single neuron
• If f is sigmoid, a neuron is logistic regression
1
p(y = 1 | x) = w> x b
= (w> x + b)
1+e
Provides a linear
decision boundary
49
intro to ml
Single layer network
• Perform logistic regression
in parallel multiple times
(with different parameters)
> >
y = (ŵ x̂ + b) = (w x)
d+1
x, w 2 R
<latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit>
, xd+1 = 1, wd+1 = b
50
Single layer network
51
Multi-layer network
Repeat:
52
Matrix notation
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 )
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 )
a3 = f (W31 x1 + W32 x2 + W33 x3 + b3 )
z = W x + b, a = f (z)
x 2 R4 , z 2 R3 , W 2 R3⇥4
53
Language modeling with
NNs
54
Bengio et al, 2003
f(h(the, dog))
h(the, dog)
55
the dog
Loss function
• Minimize negative log-likelihood
T
X
L(✓) = log p✓ (wi | wi 2 , wi 1 )
<latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit>
i=1
56
Advantages
• If we see in the training set
• Compared to what?
57
Parameter estimation
• We train with SGD
58
Backpropagation
• Notation:
Wt : weight matrix at the input of layer t
zt : output vector at layer t
x = z0 : input vector
y : gold scalar
ŷ = zL : predicted scalar
l(y, ŷ) : loss function
v t = W t · zt 1 : pre-activations
@l(y, ŷ)
t = : gradient vector
<latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit>
@zt
59
Backpropagation
• Run the network forward to obtain all values vt, zt
• Base:
0
L = l (y, zL )
• Recursion:
>
t = Wt+1 ( 0 (vt+1 ) t+1 )
0 dt+1 ⇥1 dt+1 ⇥dt
(vt+1 ), t+1 2R , Wt+1 2 R
• Gradients:
@l 0
=( t (vt ))zt> 1
@Wt
60
Bigram LM example
• Forward pass:
61
Bigram LM example
• Backward pass:
0
(v3 ) 3 = (z3 y)
> 0 >
2 = W3 ( (v3 ) 3) = W3 (z3 y)
> 0 >
1 = W2 ( (v2 ) 2) = W2 (z2 (1 z2 ) 2)
@l 0 > >
=( 3 (v3 ))z2 = (z3 y)z2
@W3
@l 0 > >
=( 2 (v2 ))z1 =( 2 z2 (1 z2 ))z1
@W2
@l 0 > >
=( 1 (v1 ))z0 = 1 z0
@W1
62
Summary
• Neural nets can improve language models:
63
Summary
• Neural nets can improve language models:
Input: w1 , . . . , wt 1 , wt , wt+1 , . . . , wT , w i 2 RV
Model: xt = W (e) · wt , W (e) 2 Rd⇥V
ht = (W (hh) · ht 1 + W (hx) · xt ), W (hh) 2 RDh ⇥Dh , W (hx) 2 RDh ⇥d
ŷt = softmax(W (s) · ht ) , W (s) 2 RV⇥Dh
64 64
Recurrent neural networks
Recurrent neural networks
• Can exploit long range dependencies
• Optimization? SGD
• Linear interpolation
• Katz back-off
X c⇤ (wi 2 , wi 1 , w)
↵(wi 2 , wi 1 ) =1
c(wi 2 , wi 1 )
w2A(wi 2 ,wi 1)
70
Bengio et al, 2003
f(h(the, dog))
h(the, dog)
71
the dog
Recurrent neural networks
Input: w1 , . . . , wt 1 , wt , wt+1 , . . . , wT , w i 2 RV
Model: xt = W (e) · wt , W (e) 2 Rd⇥V
ht = (W (hh) · ht 1 + W (hx) · xt ), W (hh) 2 RDh ⇥Dh , W (hx) 2 RDh ⇥d
ŷt = softmax(W (s) · ht ) , W (s) 2 RV⇥Dh
72 72
Training
• Objective:
T
X
J(✓) = CE(yt , ŷt )
t=1
• Objective: SGD
n-gram No No
FFNN LM Yes No
• Vanishing/exploding gradients
<latexit sha1_base64="pxKLDOC9jxriRriR+6qlhmxpjQs=">AAAEcnicnVNLb9QwEE53FyjLqwVxAQkGKmBXtKsNIIEQSAUuHHooEn1IzTaaOE5ixbFD7FSUrO/8Pm78Ci78AJxkt2q7iAOWoozn9c03Mw5yzpQej38udbq9CxcvLV/uX7l67fqNldWbu0qWBaE7RHJZ7AeoKGeC7mimOd3PC4pZwOlekH6o7XtHtFBMis/6OKeTDGPBIkZQW5W/2vn+2NP0q67eKVVm9DUY8BSLM3wy8Ego9RA8Tr+AF2OW4bq92NQh+u5MHRVIKtdUrd2MwPP6j085NamBKdAJBQyU5KWmcIS8pCCjRsuxiKnSQFlMxYnFwN5hNUiSoWlTRrJAziEFD6bTFtbLsdAMOSR+ak7fqnTDNWY6bWu0/vNUtaomVadoKgsZxmYwJ2zzDIfW580Cr/ZnDW5bTRO8RXXdLapxzjOwHIFIoTQKDaokiWWIjdf/198AzJEXYr0MdUKQV1vG12eyaIuaFzL0K/Y2feqaQw0LuOwsLjuPe1jpjdQsVvwv1Dr+pEegJHAp4g1NiwyYiOx4Bak3QDCVULsXEr7RQo5M319ZG4/GzYFFwZ0Ja87sbPsrP7xQEru0QhOOSh2441xPqroUwqnpe6WiOZIUY3pgRYEZVZOqeTIGHllNCHYq9rOzarSnIyrMlDrOAutZc1XnbbXyb7aDUkevJhUTuV10QVqgqOQ10fr9QcgKSjQ/tgKSgtlagSRou2s7pOomuOcpLwq7z0bu85H76cXa5vtZO5adu85DZ+C4zktn0/nobDs7Dun86t7u3uve7/7u3ek96M1611maxdxyzpze+h/B/ngl</latexit>
so long-term influence vanishes to zero.
Solutions
• Exploding gradient: gradient clipping
Illustration
LSTMs and GRUs
LSTMs and GRUs
• Vanilla RNN:
ht = f (W (hh) ht 1 + W (hx)
xt )
h̃t = tanh(W xt + rt U ht 1)
h t = zt h t 1 + (1 zt ) h̃t
• Use the gates to control information flow
• If z=1, we simply copy the past and ignore the present (note that gradient will
be 1)
• If z=0, then we have an RNN like update, but we are also free to reset some of
the past units, and if r=0, then we have no memory of the past
ȟt
rt zt
ht-1 xt
Hochreiter and Schmidhuber, 1997
• h is the output
Illustration
ht
ct
čt it ft ot
ct-1 ht-1 xt
Illustration
Results
Summary
• Language modeling is a fundamental NLP task used in machine
translation, spelling correction, speech recognition, etc.
93