0% found this document useful (0 votes)
78 views99 pages

02 NLP LM

The document discusses natural language processing and language models. It introduces trigram models, which assign probabilities to sequences of three consecutive words. Trigram models are limited by sparse data and their Markov assumption. The document also discusses evaluating language models using perplexity, which measures how well a model predicts a sample. Overall, the document provides an overview of trigram language models, their motivations, formulations, and evaluation using perplexity.

Uploaded by

Tanmay Purandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views99 pages

02 NLP LM

The document discusses natural language processing and language models. It introduces trigram models, which assign probabilities to sequences of three consecutive words. Trigram models are limited by sparse data and their Markov assumption. The document also discusses evaluating language models using perplexity, which measures how well a model predicts a sample. Overall, the document provides an overview of trigram language models, their motivations, formulations, and evaluation using perplexity.

Uploaded by

Tanmay Purandare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

Natural Language

Processing
Language models

Based on slides from Michael Collins, Chris Manning, Richard Soccer, Dan Jurafsky
Plan
• Problem definition

• Trigram models

• Evaluation

• Estimation

• Interpolation

• Discounting
2
Motivations
• Define a probability distribution over sentences

• Why? p(01010 |⇠) / p(⇠| 01010) · p(01010)


• Machine translation

• P(“high winds”) > P(“large winds”)

• Spelling correction

• P(“The office is fifteen minutes from here”) > P(“The office is fifteen minuets
from here”)

• Speech recognition (that’s where it started!)

• P(“recognize speech”) > P(“wreck a nice beach”)

• And more!

3
Motivations
• Philosophical: a model that is good at predicting
the next word, must know something about
language and the world

• A good representations for any NLP task

• paper 1

• paper 2
Motivations

• Techniques will be useful later

5
Problem definition
• Given a finite vocabulary
V = {the, a, man, telescope, Beckham, two, . . . }

• We have an infinite language L, which is V*


concatenated with the special symbol STOP

the STOP
a STOP
the fan STOP
the fan saw Beckham STOP
the fan saw saw STOP
the fan saw Beckham play for Real Madrid STOP

6
Problem definition
• Input: a training set of example sentences

• Currently: roughly hundreds of billions of words.

• Output: a probability distribution p over L


X
p(x) = 1, p(x) 0 for all x 2 L
x2L

p(“the STOP”) = 10-12



p(“the fan saw Beckham STOP”) = 2 x 10-8

p(“the fan saw saw STOP”) = 10-15

7
A naive method
• Assume we have N training sentences

• Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn)


be the number of times it appeared in the training
data.

• Define a language model:

c(x1 , . . . , xn )
p(x1 , . . . , xn ) =
N

8
A naive method
• Assume we have N training sentences

• Let x1, x2, …, xn be a sentence, and c(x1, x2, …, xn)


be the number of times it appeared in the training
data.

• Define a language model:

c(x1 , . . . , xn )
p(x1 , . . . , xn ) =
N

• No generalization!
8
Markov processes
• Markov processes:

• Given a sequence of n random variables:

• We want a sequence probability model

X1 , X2 , . . . , Xn , n = 100, Xi 2 V
p(X1 = x1 , X2 = x2 , · · · Xn = xn )

• There are |V|n possible sequences


9
First-order Markov process
Chain rule

p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
Yn
p(X1 = x1 ) p(Xi = xi | X1 = x1 , . . . , Xi 1 = xi 1)
i=2

10
First-order Markov process
Chain rule

p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
Yn
p(X1 = x1 ) p(Xi = xi | X1 = x1 , . . . , Xi 1 = xi 1)
i=2

Markov assumption

p(Xi = xi | X1 = x1 , . . . , Xi 1 = xi 1) = p(Xi = xi | Xi 1 = xi 1)

10
Second-order Markov
process
Relax independence assumption:
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
p(X1 = x1 ) ⇥ p(X2 = x2 | X1 = x1 )
Yn
⇥ p(Xi = xi | Xi 2 = xi 2 , Xi 1 = xi 1)
i=3

Simplify notation
x0 = ⇤, x 1 =⇤

11
Second-order Markov
process
Relax independence assumption:
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
p(X1 = x1 ) ⇥ p(X2 = x2 | X1 = x1 )
Yn
⇥ p(Xi = xi | Xi 2 = xi 2 , Xi 1 = xi 1)
i=3

Simplify notation
x0 = ⇤, x 1 =⇤

Is this reasonable?
11
Detail: variable length
• Probability distribution over sequences of any length

• Define always Xn=STOP, and obtain a probability


distribution over all sequences

• Intuition: at every step you have probability 𝜶h to stop


(conditioned on history) and (1-𝜶h) to keep going
p(X1 = x1 , X2 = x2 , . . . , Xn = xn ) =
Yn
p(Xi = xi | Xi 2 = xi 2 , Xi 1 = xi 1)
i=1

12
Trigram language model
• A trigram language model contains

• A vocabulary V

• Non-negative parameters q(w|u,v) for every trigram, such that

w 2 V [ {STOP}, u, v 2 V [ {⇤}

• The probability of a sentence x1, …, xn, where


xn=STOP is
n
Y
p(x1 , . . . , xn ) = q(xi | xi 1 , xi 2 )
i=1

13
Example
p(the dog barks STOP) =q(the | ⇤, ⇤)⇥
q(dog | ⇤, the)⇥
q(barks | the, dog)⇥
q(ST OP | dog, barks)⇥

14
Limitation
• Markovian assumption is false

He is from France, so it makes sense that his first


language is…

• We would want to model longer dependencies

15
Sparseness
• Maximum likelihood for estimating q

• Let c(w_1, …, w_n) be the number of times that


n-gram appears in a corpus
c(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
c(wi 2 , wi 1 )

16
Sparseness
• Maximum likelihood for estimating q

• Let c(w_1, …, w_n) be the number of times that


n-gram appears in a corpus
c(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
c(wi 2 , wi 1 )

• If vocabulary has 20,000 words —> Number of parameters


is 8 x 1012!

• Most sentences will have zero or undefined probabilities


16
Berkeley restaurant project
sentences
• can you tell me about any good cantonese restaurants
close by

• mid priced that food is what i’m looking for

• tell me about chez pansies

• can you give me a listing of the kinds of food that are


available

• i’m looking for a good place to eat breakfast

• when is caffe venezia open during the day

17
Bigram counts

18
Bigram probabilities

19
What did we learn

• p(English | want) < p(Chinese | want) - people like


Chinese stuff more when it comes to this corpus

• p(to | want) = 0.66 - English behaves in a certain


way

• p(eat | to) = 0.28 - English behaves in a certain way

20
Evaluation: perplexity
• Test data: S = (s1, s2, …, sM)

• Parameters are not estimated from S

• A good language model has high p(S) and low perplexity

• Perplexity is the normalized inverse log-probability of S


M
Y
p(S) = p(si | s1 , . . . , si 1)
i=1
M
X
log2 p(S) = log2 p(si | s1 , . . . , si 1)
i=1
XM
1
perplexity = 2 l , l = log2 p(si | s1 , . . . , si 1)
M i=1
<latexit sha1_base64="SvfPMMBz646j5bpI1p4dYVS710U=">AAAC3nicnVLPaxQxFM6Mv+r4a6tHL8FFWWG7TFahXgpFL14KFd22sLMdM5nMNjSZhOSNdAlz8OJBEa/+Xd78Q7yb2S5F24Lgg8DH9977XvK9FEYKB2n6M4qvXL12/cbazeTW7Tt37/XW7+853VjGJ0xLbQ8K6rgUNZ+AAMkPjOVUFZLvF8evuvz+B26d0PU7WBg+U3Rei0owCoHKe7/MIFMUjhiV/m37FD/Zwpmxusy92CLt4Q42A5cLnClRYpeTIc5KDW4YsBcbJDRkWZJJPc/H+BIl16gzobOqf+sBPwFvuDWSnwhYtJ3W+NBvyHaIJQ66laXMk9bvtP8zI0nyXj8dpcvAFwFZgT5axW7e+xE0WKN4DUxS56YkNTDz1IJgkrdJ1jhuKDumcz4NsKaKu5lfrqfFjwNT4krbcGrAS/bPDk+VcwtVhMrOQHc+15GX5aYNVC9mXtSmAV6z00FVIzFo3O0al8JyBnIRAGVWhLtidkSDdxB+RGcCOf/ki2BvPCLPRuTN8/72y5Uda+gheoQGiKBNtI1eo100QSyaRh+jz9GX+H38Kf4afzstjaNVzwP0V8TffwN4cd5g</latexit>

• M is the number of words in the corpus


21
Evaluation: perplexity
• Say we have a vocabulary V and N = |V| + 1 and a
trigram model with uniform distribution

• q(w | u, v) = 1/N.
Xm
1
l= log2 p(si | s1 , . . . , si 1)
M i=1
1 1 M 1
= log = log
M N N
perplexity = 2 l = 2log N = N
<latexit sha1_base64="hDVVx2yFhBsJ+xm8S4ppOtcyb28=">AAACo3icbVFda9swFJW9r877aLY97uWy0JJCG6xssL0EygZjMJplo2kLUWJkRU5FJdtI8mgQ/mP7GXvbv5nsGtaPXRAcnXOPdHWUllIYG8d/gvDe/QcPH209jp48ffZ8u/fi5YkpKs34jBWy0GcpNVyKnM+ssJKflZpTlUp+ml58avTTn1wbUeTHdlPyhaLrXGSCUeuppPcrkrA7JpmmzOHaHdVATKUSJ8a4XiogslgnIygHJhFAlFiBSfA+kFVhzb7HThzgeg8IiXbHcOMUbwSSivXgHz2pW2ZveQTjruG6RCJi+aV1Jdel5JfCbmo/GoyW7kDW0ILWNGk2kyhKev14GLcFdwHuQB91NU16v/3crFI8t0xSY+Y4Lu3CUW0Fk7yOSGV4SdkFXfO5hzlV3Cxcm3ENO55ZQVZov3ILLXvd4agyZqNS36moPTe3tYb8nzavbPZh4UReVpbn7OqirJJgC2g+DFZCc2blxgPKtPCzAjunPjfrv7UJAd9+8l1wMhrit0P8/V3/8GMXxxZ6jd6gAcLoPTpEX9AUzRALIPgcfAum4U74NfwRHl+1hkHneYVuVLj4C8XTyLY=</latexit>

• Perplexity is the “effective” vocabulary size.


22
Typical values of perplexity

• When |V| = 50,000

• trigram model perplexity: 74 (<< 50000)

• bigram model: 137

• unigram model: 955

23
Evaluation

• Extrinsic evaluations: MT, speech, spelling


correction, …

24
History

• Shannon (1950) estimated the perplexity score that


humans get for printed English (we are good!)

• Test your perplexity

25
Estimating parameters
• Recall that the number of parameters for a trigram
model with |V| = 20,000 is 8 x 1012, leading to zeros
and undefined probabilities

c(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
c(wi 2 , wi 1 )

26
Over/under-fitting
• Given a corpus of length M:

• Trigram model:
c(wi 2 , wi 1 , wi )
q(wi | wi 2 , wi 1 ) =
c(wi 1 , wi )
• Bigram model:
c(wi 1 , wi )
q(wi | wi 1 ) =
c(wi 1 )
• Unigram model:
c(wi )
q(wi ) =
27
M
28
Radford et al, 2019
Linear interpolation
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
+ 2 ⇥ q(wi | wi 1)
+ 3 ⇥ q(wi )
i 0, 1 + 2 + 3 =1

• Combine the three models to get all benefits

30
Linear interpolation
• Need to verify the parameters define a probability
distribution

X
qLI (w | u, v)
w2V
X
= 1 ⇥ q(w | u, v) + 2 ⇥ q(w | v) + 3 ⇥ q(w)
w2V
X X X
= 1 q(w | u, v) + 2 q(w | v) + 3 ⇥q(w)
w2V w2V w2V
= 1 + 2 + 3 =1
31
Estimating coefficients

• Use validation/development set (intro to ML!)

• Partition training data to training (90%?) and dev


(10%?) data and optimize the coefficient to
minimize the perplexity (the measure we care
about!) on the development data

32
Linear interpolation
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
+ 2 ⇥ q(wi | wi 1)
+ 3 ⇥ q(wi )
i 0, 1 + 2 + 3 =1
Linear interpolation
( 1 c(wi 2 , wi 1 ) = 0
2 1  c(wi 2 , wi 1 )  2
⇧(wi 2 , wi 1 ) =
3 3  c(wi 2 , wi 1 )  10
4 otherwise
⇧(wi 2 ,wi 1)
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
⇧(wi 2 ,wi 1)
+ 2 ⇥ q(wi | wi 1)
⇧(wi 2 ,wi 1)
+ 3 ⇥ q(wi )
⇧(wi 2 ,wi 1)
i 0
⇧(wi 2 ,wi 1) ⇧(wi 2 ,wi 1) ⇧(wi 2 ,wi 1)
1 + 2 + 3 =1
Discounting methods
x c(x) q(wi | wi-1)
the 48
the, dog 15 15/48
the, woman 11 11/48
the, man 10 10/48
the, park 5 5/48
the, job 2 2/48
the, telescope 1 1/48
the, manual 1 1/48
the, afternoon 1 1/48
the, country 1 1/48
the, street 1 1/48

Low count bigrams have high estimates


35
Discounting methods
c*(x) = c(x) - 0.5

x c(x) c*(x) q(wi | wi-1)


the 48
the, dog 15 14.5 14.5/48
the, woman 11 10.5 10.5/48
the, man 10 9.5 9.5/48
the, park 5 4.5 4.5/48
the, job 2 1.5 1.5/48
the, telescope 1 0.5 0.5/48
the, manual 1 0.5 0.5/48
the, afternoon 1 0.5 0.5/48
the, country 1 0.5 0.5/48
the, street 1 0.5 0.5/48
36
Katz back-off
In a bigram model
A(wi 1) = {w : c(wi 1 , w) > 0}
B(wi 1) = {w : c(wi 1 , w) = 0}

( c⇤ (wi 1 ,wi )
c(wi 1 ) wi 2 A(wi 1)
qBO (wi | wi 1) = q(wi )
↵(wi 1 ) P wi 2 B(wi 1)
w2B(wi 1 ) q(w)

X ⇤
c (wi 1 , w)
↵(wi 1) =1
c(wi 1 )
w2A(wi 1)

37
Katz back-off
In a trigram model
A(wi 2 , wi 1 ) = {w : c(wi 2 , wi 1 , w) > 0}
B(wi 2 , wi 1 ) = {w : c(wi 2 , wi 1 , w) = 0}
( c⇤ (wi 2 ,wi 1 ,wi )
c(wi 2 ,wi 1 ) wi 2 A(wi 2 , wi 1 )
qBO (wi | wi 2 , wi 1 ) = qBO (wi |wi 1 )
↵(wi 2 , wi 1 ) P wi 2 B(wi 2 , wi 1 )
w2B(w i ,w2 i) qBO (w|wi
1
1)

X c⇤ (wi 2 , wi 1 , w)
↵(wi 2 , wi 1 ) =1
c(wi 2 , wi 1 )
w2A(wi 2 ,wi 1)

38
Advanced Smoothing

• Good-Turing

• Kneser-Ney
Advanced Smoothing
• Principles

• Good-Turing: Take probability mass from things you


have seen n times and spread this probability mass
over things you have seen n-1 times (specifically
move mass from things you have seen once to
things you have never seen)

• Kneser-Ney: The probability of a unigram is not its


frequency in the data, but how frequently it appears
after other things (“francisco” vs. “glasses”).
Unknown words
• What if we see a completely new word at test time?

• p(s) = p(S) = 0 —> infinite perplexity

• Solution: create a special token <unk>

• Fix vocabulary V and replace any word in the training


set not in V with <unk>

• Train

• At test time, use p(<unk>) for words not in the training


set
Summary and re-cap
• Sentence probability: decompose with the chain rule

• Use a Markov independence assumption

• Smooth estimates to do better on rare events

• Many attempts to improve LMs using syntax but not


easy

• More complex smoothing methods exist for handling


rare events (Kneser-Ney, Good-Turing…)

42
Problem
• Our estimator q(w | u, v) is based on a one-hot
representation. There is no relation between words.

• p(played | the, actress)

• p(played | the actor)

• Can we use distributed representations?

Neural networks

43
Neural networks

• Feed-forward neural networks for language


modeling

• Recurrent neural networks for language modeling

• Vanishing/exploding gradients
Neural networks: history
• Proposed in the mid-20th century

• Criticism from Minsky (“Perceptrons”)

• Back-propagation appeared in the 1980s

• In the 1990s SVMs appeared and were more


successful

• Since the beginning of this decade show great


success in speech, vision, language, robotics…
45
Motivation 1

Can we learn representations from raw data?


46
Motivation 2

Can we learn non-linear decision boundaries


Actually very similar to motivation 1
47
A single neuron
• A neuron is a computational unit of the form
fw,b (x) = f (w> x + b)
• x: input vector

• w: weights vector

• b: bias

• f: activation function

48
intro to ml
A single neuron
• If f is sigmoid, a neuron is logistic regression

• Let x, y be a binary classification training example

1
p(y = 1 | x) = w> x b
= (w> x + b)
1+e

Provides a linear
decision boundary

49
intro to ml
Single layer network
• Perform logistic regression
in parallel multiple times
(with different parameters)

> >
y = (ŵ x̂ + b) = (w x)
d+1
x, w 2 R
<latexit sha1_base64="bbrl+RUwRpr9A7SRaml+HC1MK+g=">AAACV3icbVFNaxsxENVu8+G6X05zzGWoaUiIMbslkPRQCM0lx7TUTcByjCTLtohWWqTZ2mbZP1noIfkrvVRem9AmHRC8eW8e0jzxXCuPSXIXxc82Nre2G8+bL16+ev2mtfP2u7eFE7InrLbumjMvtTKyhwq1vM6dZBnX8orfni/1qx/SeWXNN1zkcpCxiVFjJRgGatiyC9j/BNSrScYO6JRhOatuKNoc6mZewRHwQ3gYma3E+SFQ2px3ZrBPlQGaMZxyXn6tbsrRUVp1YD6sQTCmHZg9NHzYaifdpC54CtI1aJN1XQ5bP+nIiiKTBoVm3vfTJMdByRwqoWXVpIWXORO3bCL7ARqWST8o62AqeB+YEYytC8cg1OzfjpJl3i8yHiaXG/jH2pL8n9YvcHw6KJXJC5RGrC4aFxrQwjJlGCknBepFAEw4Fd4KYsocExj+ohlCSB+v/BT0PnQ/dtMvx+2zz+s0GmSPvCMHJCUn5IxckEvSI4L8Ir+jjWgzuo9JvBU3VqNxtPbskn8q3vkDSESvGQ==</latexit>
, xd+1 = 1, wd+1 = b

50
Single layer network

• L1: input layer


• L2: hidden layer
• L3: output layer
• Output layer provides
the prediction
• Hidden layer is the
learned
representation

51
Multi-layer network
Repeat:

52
Matrix notation
a1 = f (W11 x1 + W12 x2 + W13 x3 + b1 )
a2 = f (W21 x1 + W22 x2 + W23 x3 + b2 )
a3 = f (W31 x1 + W32 x2 + W33 x3 + b3 )
z = W x + b, a = f (z)
x 2 R4 , z 2 R3 , W 2 R3⇥4

53
Language modeling with
NNs

• Keep the Markov assumption

• Learn a probability q(u | v, w) with distributed


representations

54
Bengio et al, 2003

Language modeling with


NNs
e(w) = We · w, , We 2 Rd⇥|V| , w 2 R|V|⇥1
m⇥2d
h(wi 2 , wi 1 ) = (Wh [e(wi 2 ); e(wi 1 )]), Wh 2 R
|V|⇥m m⇥1
f (z) = softmax(Wo · z), Wo 2 R ,z 2 R
p(wi | wi 2 , wi 1 ) = f (h(wi 2 , wi 1 ))i
laughed
<latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit>

f(h(the, dog))

h(the, dog)

p(laughed | the, dog)


e(the) e(dog)

55
the dog
Loss function
• Minimize negative log-likelihood

• When we learned word vectors, we were


basically training a language model

T
X
L(✓) = log p✓ (wi | wi 2 , wi 1 )
<latexit sha1_base64="UB0KjsDGBMGjkqYusmXz7ge4OCs=">AAACK3icbVBNSwMxFMzW7/pV9eglWIQWbNkVQT0Uil48eFCwWujWJZumbWiyuyRvlbLsD/LiXxHEg4pX/4dptwdtHQgMM/N4eeNHgmuw7Q8rNze/sLi0vJJfXVvf2Cxsbd/qMFaUNWgoQtX0iWaCB6wBHARrRooR6Qt25w/OR/7dA1Oah8ENDCPWlqQX8C6nBIzkFc4vSy70GZAyruEKdnUsvYTXnPT+Brsi7OHIy/zSo8exK3kHP5pA5TA9yIiTlr1C0a7aY+BZ4kxIEU1w5RVe3U5IY8kCoIJo3XLsCNoJUcCpYGnejTWLCB2QHmsZGhDJdDsZH5vifaN0cDdU5gWAx+rviYRIrYfSN0lJoK+nvZH4n9eKoXvSTngQxcACmi3qxgJDiEfN4Q5XjIIYGkKo4uavmPaJIhRMv3lTgjN98ixpHFZPq871UbF+NmljGe2iPVRCDjpGdXSBrlADUfSEXtA7+rCerTfr0/rKojlrMrOD/sD6/gFsU6WD</latexit>
i=1

56
Advantages
• If we see in the training set

The cat is walking in the bedroom

• We can hope to learn

A dog was running through a room

• We can use n-grams with n > 3 and pay only a


linear price

• Compared to what?
57
Parameter estimation
• We train with SGD

• How to efficiently compute the gradients?

• backpropagation (Rumelhart, Hinton and


Williams, 1986)

• Proof in intro to ML, will repeat algorithm only


and give an example here

58
Backpropagation
• Notation:
Wt : weight matrix at the input of layer t
zt : output vector at layer t
x = z0 : input vector
y : gold scalar
ŷ = zL : predicted scalar
l(y, ŷ) : loss function
v t = W t · zt 1 : pre-activations
@l(y, ŷ)
t = : gradient vector
<latexit sha1_base64="UjuhSbGl3wADPYeqK6u90QmND4A=">AAADZ3icbVJNb9NAEN04fJSUj5QvIXEZiKhaiUY2QgIqVargwoFDkQipFEfWZr1OVl17rd1xiGuZH8mNH8CFX8HacdI2ZU+jee/NzJudSSqFQdf93XLaN27eur11p7N99979B92dh9+NyjTjA6ak0qcTargUCR+gQMlPU81pPJF8ODn7VOHDOddGqOQb5ikfx3SaiEgwijYV7LR+7sIwwEPwkS+w+MHFdIYQU9RiARQBZxxEkmYIKgJJc66hBATf7+zC+YUOVIYVac4ZKl0JN7iLo/PAPWzIy4JLbrnE8xU2VTIEw6ikK8ifUSzy0hb4sm5nPYaCId+gyr389Yq/vyZLZQxEWcIqyw1zHiAcQWUdfBYqtGYKPPDKyx0OqFXM60WZ1Swhl0hrrR9pygo/pRoFlVdblxd5u6QS1uY0DQVP1t47naDbc/tu/eB64DVBjzTvJOj+8kPFsthWYZIaM/LcFMdF1Y1JXnb8zPCUsjM65SMbJjTmZlzUh1LCK5sJIbI/FCk7RZ29rChobEweTyzTnsDMbGJV8n/YKMPo/bio/5UnbNkoyiSggurqIBTaOpa5DSjTws4KbEbtAtHeZrUEb9Py9WDwpv+h73192zv+2GxjizwnL8ke8cg7ckw+kxMyIKz1x9l2HjtPnL/tbvtp+9mS6rQazSNy5bVf/AP8QREE</latexit>
@zt
59
Backpropagation
• Run the network forward to obtain all values vt, zt

• Base:
0
L = l (y, zL )
• Recursion:
>
t = Wt+1 ( 0 (vt+1 ) t+1 )
0 dt+1 ⇥1 dt+1 ⇥dt
(vt+1 ), t+1 2R , Wt+1 2 R
• Gradients:
@l 0
=( t (vt ))zt> 1
@Wt

60
Bigram LM example
• Forward pass:

z0 2 R|V|⇥1 : one-hot vector input


d1 ⇥|V| d1 ⇥1
z1 = W 1 · z0 , W 1 2 R , z1 2 R
z2 = (W2 · z1 ), W2 2 Rd2 ⇥d1 , z2 2 Rd2 ⇥1
|V|⇥d2 |V|⇥1
z3 = softmax(W3 · z2 ), W3 2 R , z3 2 R
X (i)
(i)
l(y, z3 ) = y log z3
i

61
Bigram LM example
• Backward pass:
0
(v3 ) 3 = (z3 y)
> 0 >
2 = W3 ( (v3 ) 3) = W3 (z3 y)
> 0 >
1 = W2 ( (v2 ) 2) = W2 (z2 (1 z2 ) 2)
@l 0 > >
=( 3 (v3 ))z2 = (z3 y)z2
@W3
@l 0 > >
=( 2 (v2 ))z1 =( 2 z2 (1 z2 ))z1
@W2
@l 0 > >
=( 1 (v1 ))z0 = 1 z0
@W1
62
Summary
• Neural nets can improve language models:

• better scalability for larger N

• use of word similarity

• complex decision boundaries

• Training through backpropagation

63
Summary
• Neural nets can improve language models:

• better scalability for larger N

• use of word similarity

• complex decision boundaries

• Training through backpropagation

But we still have a Markov assumption


He is from France, so it makes sense that his first language
is…
63
Recurrent neural networks

the dog the dog laughed

Input: w1 , . . . , wt 1 , wt , wt+1 , . . . , wT , w i 2 RV
Model: xt = W (e) · wt , W (e) 2 Rd⇥V
ht = (W (hh) · ht 1 + W (hx) · xt ), W (hh) 2 RDh ⇥Dh , W (hx) 2 RDh ⇥d
ŷt = softmax(W (s) · ht ) , W (s) 2 RV⇥Dh

64 64
Recurrent neural networks
Recurrent neural networks
• Can exploit long range dependencies

• Each layer has the same weights (weight sharing/


tying)

• What is the loss function?


Recurrent neural networks
• Can exploit long range dependencies

• Each layer has the same weights (weight sharing/


tying)

• What is the loss function?


XT
J(✓) = CE(yt , ŷt )
t=1
Recurrent neural networks
• Component:

• model? RNN (saw before)

• Loss? sum of cross-entropy over time

• Optimization? SGD

• Gradient computation? back-propagation


through time
Class 3 Recap
• N-gram based language models

• Linear interpolation

• Katz back-off

• Neural language models

• limited horizon - feed-forward neural network

• unlimited - Recurrent neural network


Linear interpolation
( 1 c(wi 2 , wi 1 ) = 0
2 1  c(wi 2 , wi 1 )  2
⇧(wi 2 , wi 1 ) =
3 3  c(wi 2 , wi 1 )  10
4 otherwise
⇧(wi 2 ,wi 1)
qLI (wi | wi 2 , wi 1 ) = 1 ⇥ q(wi | wi 2 , wi 1 )
⇧(wi 2 ,wi 1)
+ 2 ⇥ q(wi | wi 1)
⇧(wi 2 ,wi 1)
+ 3 ⇥ q(wi )
⇧(wi 2 ,wi 1)
i 0
⇧(wi 2 ,wi 1) ⇧(wi 2 ,wi 1) ⇧(wi 2 ,wi 1)
1 + 2 + 3 =1
Katz back-off
In a trigram model
A(wi 2 , wi 1 ) = {w : c(wi 2 , wi 1 , w) > 0}
B(wi 2 , wi 1 ) = {w : c(wi 2 , wi 1 , w) = 0}
( c⇤ (wi 2 ,wi 1 ,wi )
c(wi 2 ,wi 1 ) wi 2 A(wi 2 , wi 1 )
qBO (wi | wi 2 , wi 1 ) = qBO (wi |wi 1 )
↵(wi 2 , wi 1 ) P wi 2 B(wi 2 , wi 1 )
w2B(w i ,w2 i) qBO (w|wi
1
1)

X c⇤ (wi 2 , wi 1 , w)
↵(wi 2 , wi 1 ) =1
c(wi 2 , wi 1 )
w2A(wi 2 ,wi 1)

70
Bengio et al, 2003

Language modeling with


NNs
e(w) = We · w, , We 2 Rd⇥|V| , w 2 R|V|⇥1
m⇥2d
h(wi 2 , wi 1 ) = (Wh [e(wi 2 ); e(wi 1 )]), Wh 2 R
|V|⇥m m⇥1
f (z) = softmax(Wo · z), Wo 2 R ,z 2 R
p(wi | wi 2 , wi 1 ) = f (h(wi 2 , wi 1 ))i
laughed
<latexit sha1_base64="ZrQ4GmagbCZRNUjYkwjacyL2SY0=">AAADb3icjVJdi9NAFJ0mfqz1q+u+CH5wsbgkUEtSBBURFn3xcRW7XWhimEwm7bCZTMhM7LbZvPoDffNH+OIvcJKm4G4reCFwc+65954z3DBLmFSO87NjmNeu37i5d6t7+87de/d7+w9OpChyQsdEJCI/DbGkCUvpWDGV0NMsp5iHCZ2EZx/q+uQbzSUT6Re1zKjP8SxlMSNYaSjY73w/pNbChncwCSh4JBIKFgPwYLAGWAoex2oehuXn6msZgacYpxIuGpTgpDypLqoBLLaolxibNrcCz+sezq1FULIXo7qxTtyqVuBJNuPYmgTzKd0QbHgL7Y8m+XYtTRO2tvHNhlG0XhFbq2amoueqlCJWHJ9XerZoTa7aUeL/hHMALXb1772ts0xrZZrBIthlMbZ2WLcD1g16fWfoNAHbidsmfdTGcdD74UWCFJymiiRYyqnrZMovca4YSWjV9QpJM0zO8IxOdZpiLdIvm4Op4LlGIohFrr9UQYP+3VFiLuWSh5pZe5VXazW4qzYtVPzaL1maFYqmZL0oLhJQAurrg4jllKhkqRNMcqa1ApnjHBOlb7R+BPeq5e1kPBq+GbqfXvaP3revsYceoWfIQi56hY7QR3SMxoh0fhkHxmPjifHbfGg+NWFNNTptzwG6FKb9BwbrDmM=</latexit>

f(h(the, dog))

h(the, dog)

p(laughed | the, dog)


e(the) e(dog)

71
the dog
Recurrent neural networks

the dog the dog laughed

Input: w1 , . . . , wt 1 , wt , wt+1 , . . . , wT , w i 2 RV
Model: xt = W (e) · wt , W (e) 2 Rd⇥V
ht = (W (hh) · ht 1 + W (hx) · xt ), W (hh) 2 RDh ⇥Dh , W (hx) 2 RDh ⇥d
ŷt = softmax(W (s) · ht ) , W (s) 2 RV⇥Dh

72 72
Training
• Objective:
T
X
J(✓) = CE(yt , ŷt )
t=1

• Objective: SGD

• Intrinsic evaluation: perplexity


Comparison
Model Similarity Unlimited horizon

n-gram No No

FFNN LM Yes No

RNN LM Yes Yes


Class 4
Training RNNs

• Capturing long-range dependencies with RNNs is


difficult

• Vanishing/exploding gradients

• Small changes in hidden layer value in step k


cause huge/minuscule changes to values in
hidden layer t for t >> k
Explanation
Consider a simple linear RNN with no input:
ht = W · ht 1
t
ht = W · h0
1 t
ht = (Q⇤Q ) · h0
t 1
ht = (Q⇤ Q ) · h0
1
where W = Q⇤Q is an eigendecomposition

• Some eigenvalues will explode and some will


shrink to zero
• Stretch input in the direction of eigenvector
with largest eigenvalue
Explanation
(hh) (hx)
ht = W (ht 1) +W xt + b
T
X T
X
L= Lt = L(ht )
<latexit sha1_base64="b5AOVOYDBTNZMJkiop5k9cEbZUY=">AAACZXicdVFNS8MwGE7r15xf8wMvHgwOZWM4WhHUgyB68eBBwTlhnSXNsjWYtCV5Kxulf9KbZy/+DNNtBz9fCDw8H7zJkyARXIPjvFn2zOzc/EJpsby0vLK6VlnfeNBxqihr0VjE6jEgmgkesRZwEOwxUYzIQLB28HxV6O0XpjSPo3sYJawrySDifU4JGMqv5KEP+OAct5+yWhjWc0/zgSS10M/g0M3ruDFRhvV8aIwNHGDPK3uSQEiJyG7yIuvpVBr/uZs/3eMvmgn8L5oVUPcrVafpjAf/Bu4UVNF0bv3Kq9eLaSpZBFQQrTuuk0A3Iwo4FSwve6lmCaHPZMA6BkZEMt3NxjXleN8wPdyPlTkR4DH7NZERqfVIBsZZ3FP/1AryL62TQv+0m/EoSYFFdLKonwoMMS46xz2uGAUxMoBQxc1dMQ2JIhTMz5RNCe7PJ/8GraPmWdO9O65eXE7bKKEdtIdqyEUn6AJdo1vUQhS9W4vWhrVpfdir9pa9PbHa1jSzib6NvfsJ0uq1Cw==</latexit>
t=1 t=1
Explanation
T
X
@L @Lt
=
@✓ t=1
@✓
Xt +
@Lt @Lt @ht @ hk
=
@✓ @ht @hk @✓
k=1
Yt t
Y
@ht @hi (hh) 0
= = W diag( (hi 1 ))
@hk @hi 1
<latexit sha1_base64="7w8dKJv4xUMIZCjXWvpgkvlKh9o=">AAADoniclVJdi9NAFJ0mfqz1q6uPvgyW1ZbFkoigPhQWBREU7OLWLnTaMJ1MmiGTD2ZuxDLkh/k3fPPfOOkG2U0rrBcCJ+feM+fOnbsqpNDgeb87jnvj5q3bB3e6d+/df/Cwd/jom85LxfiU5TJX5yuquRQZn4IAyc8LxWm6kny2St7X+dl3rrTIszPYFHyR0nUmIsEoWCo47PwkkaLMkIIqEFRiklKIGZXmc1VdYiHmQCv8bIyJLtPAwNivlmf43+IA9sgJ6f6n4q9hUhvCdQ1j+9eujVsVSbtiebxlr9P2nrPqTguVh4ER4+R4X7NxIK6IjHjhVxXe1c2WZhDHQ2sM/AeYUNB1NSBarFP6fNDohsOg1/dG3jbwLvAb0EdNTILeLxLmrEx5BkxSree+V8DC1P0wyasuKTUvKEvoms8tzGjK9cJsd6zCR5YJcZQr+2WAt+xlhaGp1pt0ZSvrN9HtXE3uy81LiN4sjMiKEnjGLoyiUmLIcb2wOBSKM5AbCyhTwvaKWUztYMGuddcOwW9feRdMX47ejvzTV/2Td800DtAT9BQNkI9eoxP0EU3QFDEHOx+cL87EPXI/uafu14tSp9NoHqMr4ZI/z5QxQw==</latexit>
i=k+1 i=k+1
Explanation
0 1
Assume: (·)  , 1  .
(hh)
1 is the absolute value of the largest eigenvalue of W
@hk (hh) 0 1
8k || ||  ||W || · ||diag( (hk ))|| < <1
@hk 1
@hk
Let ⌘ be a constant such that 8k || ||  ⌘ < 1
@hk 1
Yt
@Lt @hi t k @Lt
|| ||  ⌘ || ||
@ht @hi 1 @ht
i=k+1

<latexit sha1_base64="pxKLDOC9jxriRriR+6qlhmxpjQs=">AAAEcnicnVNLb9QwEE53FyjLqwVxAQkGKmBXtKsNIIEQSAUuHHooEn1IzTaaOE5ixbFD7FSUrO/8Pm78Ci78AJxkt2q7iAOWoozn9c03Mw5yzpQej38udbq9CxcvLV/uX7l67fqNldWbu0qWBaE7RHJZ7AeoKGeC7mimOd3PC4pZwOlekH6o7XtHtFBMis/6OKeTDGPBIkZQW5W/2vn+2NP0q67eKVVm9DUY8BSLM3wy8Ego9RA8Tr+AF2OW4bq92NQh+u5MHRVIKtdUrd2MwPP6j085NamBKdAJBQyU5KWmcIS8pCCjRsuxiKnSQFlMxYnFwN5hNUiSoWlTRrJAziEFD6bTFtbLsdAMOSR+ak7fqnTDNWY6bWu0/vNUtaomVadoKgsZxmYwJ2zzDIfW580Cr/ZnDW5bTRO8RXXdLapxzjOwHIFIoTQKDaokiWWIjdf/198AzJEXYr0MdUKQV1vG12eyaIuaFzL0K/Y2feqaQw0LuOwsLjuPe1jpjdQsVvwv1Dr+pEegJHAp4g1NiwyYiOx4Bak3QDCVULsXEr7RQo5M319ZG4/GzYFFwZ0Ja87sbPsrP7xQEru0QhOOSh2441xPqroUwqnpe6WiOZIUY3pgRYEZVZOqeTIGHllNCHYq9rOzarSnIyrMlDrOAutZc1XnbbXyb7aDUkevJhUTuV10QVqgqOQ10fr9QcgKSjQ/tgKSgtlagSRou2s7pOomuOcpLwq7z0bu85H76cXa5vtZO5adu85DZ+C4zktn0/nobDs7Dun86t7u3uve7/7u3ek96M1611maxdxyzpze+h/B/ngl</latexit>
so long-term influence vanishes to zero.
Solutions
• Exploding gradient: gradient clipping

• Re-normalize gradient to be less than C (this is


no longer the true gradient in size but it is in
direction)

• Exploding gradients are easy to detect

• The problem is with the model!

• Change it (LSTMs, GRUs)


Pascanu et al, 2013

Illustration
LSTMs and GRUs
LSTMs and GRUs

• Bottom line: use vector addition and not matrix-


vector multiplication. Allows for better propagation
of gradients to the past
Cho et al., 2014

Gated Recurrent Unit (GRU)


• Main insight: add learnable gates to the recurrent
unit that control the flow of information from the
past to the present

• Vanilla RNN:
ht = f (W (hh) ht 1 + W (hx)
xt )

• Update and reset gates:


zt = (W (z) ht 1 + U (z)
xt )
(r) (r)
rt = (W ht 1 +U xt )
Gated Recurrent Unit (GRU)
(z) (z)
zt = (W xt + U ht 1)
(r) (r)
rt = (W xt + U ht 1)

h̃t = tanh(W xt + rt U ht 1)

h t = zt h t 1 + (1 zt ) h̃t
• Use the gates to control information flow

• If z=1, we simply copy the past and ignore the present (note that gradient will
be 1)

• If z=0, then we have an RNN like update, but we are also free to reset some of
the past units, and if r=0, then we have no memory of the past

• The + in the last equation is crucial


Illustration
ht

ȟt

rt zt

ht-1 xt
Hochreiter and Schmidhuber, 1997

Long short term memory


(LSTMs)
• z has been split into i and f
it = (W (i) xt + U (i) ht 1)
• There is no r
ft = (W (f ) xt + U (f ) ht 1)
• There is a new gate o that ot = (W (o)
xt + U (o)
ht 1)
distinguishes between the
c̃t = tanh(W (c) xt + U (c) ht 1)
memory and the output.
c t = ft c t 1 + it c̃t
• c is like h in GRUs ht = ot tanh(ct )

• h is the output
Illustration
ht

ct

čt it ft ot

ct-1 ht-1 xt
Illustration

Chris Olah’s blog


More GRU intuition from
Stanford

• Go over sequence of slides from Chris Manning


Jozefowicz et al, 2016

Results
Summary
• Language modeling is a fundamental NLP task used in machine
translation, spelling correction, speech recognition, etc.

• Traditional models use n-gram counts and smoothing

• Feed-forward take into account word similarity to generalize better

• Recurrent models can potentially learn to exploit long-range


interactions

• Neural models dramatically reduced perplexity

• Recurrent networks are now used in many other NLP tasks


(bidirectional RNNs, deep RNNs)

93

You might also like