Lecture 7 - Conditional Language Modeling
Lecture 7 - Conditional Language Modeling
Language Modeling
Chris Dyer
Review: Unconditional LMs
A language model assigns probabilities to sequences of
words, w = (w1 , w2 , . . . , w` ).
We saw that it is helpful to decompose this probability
using the chain rule, as follows:
p(w) = p(w1 ) ⇥ p(w2 | w1 ) ⇥ p(w3 | w1 , w2 ) ⇥ · · · ⇥
p(w` | w1 , . . . , w` 1)
|w|
Y
= p(wt | w1 , . . . , wt 1)
t=1
random variable
p(W5 |w1 ,w2 ,w3 ,w4 )
z }| {
RNN hidden state
vector, length=|vocab| softmax
h1 h2 h3 h4
h0 w1 w2 w3 w4
w1 w2 w3 w4
observed
vector
context word (word embedding)
Conditional LMs
A conditional language model assigns probabilities to
sequences of words, w = (w1 , w2 , . . . , w` ) , given some
conditioning context, x .
Human evaluation.
hard to implement, easy to interpret
Evaluating conditional LMs
How good is our conditional language model?
Human evaluation.
hard to implement, easy to interpret
Lecture overview
The rest of this lecture will look at “encoder-decoder”
models that learn a function that maps x into a fixed-size
vector and then uses a language model to “decode”
that vector into a sequence of words, w.
x1 x2 x3 x4 x5 x6
x1 x2 x3 x4 x5 x6
• Good
• Bad
* Kalchbrenner et al. (2014). A convolutional neural network for modelling sentences. In Proc. ACL.
K&B 2013: RNN Decoder
Encoder
c = embed(x)
s = Vc
Recurrent connection
Embedding of wt 1
Recurrent decoder
Source sentence
ht = g(W[ht 1 ; wt 1 ] + s + b])
0 Learnt bias
ut = Pht + b
p(Wt | x, w<t ) = softmax(ut )
h1
h0 x1
<s>
K&B 2013: RNN Decoder
p̂1
softmax
h1
h0 x1
<s>
K&B 2013: RNN Decoder
p(tom | s, hsi)
tom
⇠
p̂1
softmax
h1
h0 x1
<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
tom likes
⇠
p̂1
softmax softmax
h1 h2
h0 x1 x2
<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
⇥p(beer | s, hsi, tom, likes)
s
⇠
p̂1
softmax softmax softmax
h1 h2 h3
h0 x1 x2 x3
<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
⇥p(beer | s, hsi, tom, likes)
⇥p(h\si | s, hsi, tom, likes, beer)
s
⇠
p̂1
softmax softmax softmax softmax
h1 h2 h3 h4
h0 x1 x2 x3 x4
<s>
Sutskever et al. (2014)
LSTM encoder
(c0 , h0 ) are parameters
(ci , hi ) = LSTM(xi , ci 1 , hi 1)
LSTM decoder
w0 = hsi
(ct+` , ht+` ) = LSTM(wt 1 , ct+` 1 , ht+` 1 )
ut = Pht+` + b
p(Wt | x, w<t ) = softmax(ut )
Sutskever et al. (2014)
START
Beginnings
START
Beginnings are
START
START
START
• Good
• Bad
START
START
Decoder:
(j) (j) (j) (j) (j)
(ct+` , ht+` ) = LSTM (wt 1 , ct+` 1 , ht+` 1 )
(j) (j) (j)
ut = Pht +b
XJ
1 (j 0 )
ut = u
J 0
j =1
hsi
logprob=0
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer
hsi logprob=-1.82
logprob=0
I
logprob=-2.11
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer I
hsi logprob=-1.82 logprob=-5.80
logprob=0
I
logprob=-2.11
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer I
hsi logprob=-1.82 logprob=-5.80
logprob=0
I beer
logprob=-2.11 logprob=-8.66
drink
logprob=-2.87
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer I
hsi logprob=-1.82 logprob=-5.80
logprob=0
I beer
logprob=-2.11 logprob=-8.66
drink
logprob=-2.87
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04
drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04
drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with
beam size b. Key idea: keep track of top b hypothesis.
beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04
drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
Sutskever et al. (2014): Tricks
beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04
drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
Image caption generation