0% found this document useful (0 votes)
69 views

Lecture 7 - Conditional Language Modeling

This document discusses conditional language models. It explains that conditional LMs assign probabilities to word sequences given some context or input. It decomposes the probability using the chain rule and discusses how RNNs can be used to model the probability of the next word given previous words and context. It also discusses challenges like searching for the most likely output, available data for training conditional LMs, and how to evaluate them.

Uploaded by

Mario Molina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

Lecture 7 - Conditional Language Modeling

This document discusses conditional language models. It explains that conditional LMs assign probabilities to word sequences given some context or input. It decomposes the probability using the chain rule and discusses how RNNs can be used to model the probability of the next word given previous words and context. It also discusses challenges like searching for the most likely output, available data for training conditional LMs, and how to evaluate them.

Uploaded by

Mario Molina
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Conditional

Language Modeling

Chris Dyer
Review: Unconditional LMs
A language model assigns probabilities to sequences of

words, w = (w1 , w2 , . . . , w` ).
We saw that it is helpful to decompose this probability

using the chain rule, as follows:
p(w) = p(w1 ) ⇥ p(w2 | w1 ) ⇥ p(w3 | w1 , w2 ) ⇥ · · · ⇥
p(w` | w1 , . . . , w` 1)
|w|
Y
= p(wt | w1 , . . . , wt 1)
t=1

This reduces the language modeling problem to modeling



the probability of the next word, given the history of

preceding words.
Unconditional LMs with RNNs

random variable
p(W5 |w1 ,w2 ,w3 ,w4 )
z }| {
RNN hidden state
vector, length=|vocab| softmax

h1 h2 h3 h4

h0 w1 w2 w3 w4

w1 w2 w3 w4
observed
 vector

context word (word embedding)
Conditional LMs
A conditional language model assigns probabilities to
sequences of words, w = (w1 , w2 , . . . , w` ) , given some
conditioning context, x .

As with unconditional models, it is again helpful to use



the chain rule to decompose this probability:

p(w | x) = p(wt | x, w1 , w2 , . . . , wt 1 )
t=1
What is the probability of the next word, given the history of

previously generated words and conditioning context x?
Conditional LMs
x “input” w “text output”
An author A document written by that author
A topic label An article about that topic
{SPAM, NOT_SPAM} An email
A sentence in French Its English translation
A sentence in English Its French translation
A sentence in English Its Chinese translation
An image A text description of the image
A document Its summary
A document Its translation
Meterological measurements A weather report
Acoustic signal Transcription of speech
Conversational history + database Dialogue system response
A question + a document Its answer
A question + an image Its answer
Conditional LMs
x “input” w “text output”
An author A document written by that author
A topic label An article about that topic
{SPAM, NOT_SPAM} An email
A sentence in French Its English translation
A sentence in English Its French translation
A sentence in English Its Chinese translation
An image A text description of the image
A document Its summary
A document Its translation
Meterological measurements A weather report
Acoustic signal Transcription of speech
Conversational history + database Dialogue system response
A question + a document Its answer
A question + an image Its answer
Conditional LMs
x “input” w “text output”
An author A document written by that author
A topic label An article about that topic
{SPAM, NOT_SPAM} An email
A sentence in French Its English translation
ek

A sentence in English Its French translation


this we

A sentence in English Its Chinese translation


An image A text description of the image
A document Its summary
A document Its translation
Meterological measurements A weather report
e k
weAcoustic signal Transcription of speech
x t
ne
Conversational
k s history + database Dialogue system response
e e
o w A question + a document Its answer
tw
A question + an image Its answer
Data for training conditional LMs
To train conditional language models, we need paired

N
samples, {(x i , w i i=1.
)}

Data availability varies. It’s easy to think of tasks that



could be solved by conditional language models, but the

data just doesn’t exist.

Relatively large amounts of data for:


Translation, summarisation, caption generation,

speech recognition
Algorithmic challenges
We often want to find the most likely w given some x . This

is unfortunately generally an intractable problem.

w = arg max p(w | x)
w

We therefore approximate it using a beam search or with



(i)
Monte Carlo methods since w ⇠ p(w | x) is often

computationally easy.

Improving search/inference is an open research question.


How can we search more effectively?
Can we get guarantees that we have found the max?
Can we limit the model a bit to make search easier?
Evaluating conditional LMs
How good is our conditional language model?

These are language models, we can use cross-entropy



or perplexity. okay to implement, hard to interpret

Task-specific evaluation. Compare the model’s most likely


output to human-generated expected output using a

task-specific evaluation metric L .
w⇤ = arg max p(w | x) ⇤
L(w , wref )
w

Examples of L : BLEU, METEOR, WER, ROUGE.


easy to implement, okay to interpret

Human evaluation.
hard to implement, easy to interpret
Evaluating conditional LMs
How good is our conditional language model?

These are language models, we can use cross-entropy



or perplexity. okay to implement, hard to interpret

Task-specific evaluation. Compare the model’s most likely


output to human-generated expected output using a

task-specific evaluation metric L .
w⇤ = arg max p(w | x) ⇤
L(w , wref )
w

Examples of L : BLEU, METEOR, WER, ROUGE.


easy to implement, okay to interpret

Human evaluation.
hard to implement, easy to interpret
Lecture overview
The rest of this lecture will look at “encoder-decoder”

models that learn a function that maps x into a fixed-size

vector and then uses a language model to “decode”

that vector into a sequence of words, w.

x Kunst kann nicht gelehrt werden…

w Artistry can’t be taught…


Lecture overview
The rest of this lecture will look at “encoder-decoder”

models that learn a function that maps x into a fixed-size

vector and then uses a language model to “decode”

that vector into a sequence of words, w.

w A dog is playing on the beach.


Lecture overview
• Two questions

• How do we encode x as a fixed-size vector, c ?


- Problem (or at least modality) specific
- Think about assumptions
• How do we condition on c in the decoding
model?
- Less problem specific
- We will review solution/architectures
Kalchbrenner and Blunsom 2013
Encoder
c = embed(x)
s = Vc
Recurrent connection
Embedding of wt 1
Recurrent decoder
Source sentence
ht = g(W[ht 1 ; wt 1 ] + s + b])
0 Learnt bias
ut = Pht + b
p(Wt | x, w<t ) = softmax(ut )

Recall unconditional RNN


ht = g(W[ht 1 ; wt 1 ] + b])
K&B 2013: Encoder
How should we define c = embed(x)?
The simplest model possible:
X
c= xi
i

x1 x2 x3 x4 x5 x6

x1 x2 x3 x4 x5 x6

What do you think of this model?


K&B 2013: CSM Encoder
How should we define c = embed(x)?
Convolutional sentence model (CSM)
K&B 2013: CSM Encoder

• Good

• Convolutions learn interactions among features in a local


context

• By stacking them, longer range dependencies can be learnt

• Deep ConvNets have a branching structure similar to trees,


but no parser is required

• Bad

• Sentences have different lengths, need different depth trees;


convnets are not usually so dynamic, but see*

* Kalchbrenner et al. (2014). A convolutional neural network for modelling sentences. In Proc. ACL.
K&B 2013: RNN Decoder
Encoder
c = embed(x)
s = Vc
Recurrent connection
Embedding of wt 1
Recurrent decoder
Source sentence
ht = g(W[ht 1 ; wt 1 ] + s + b])
0 Learnt bias
ut = Pht + b
p(Wt | x, w<t ) = softmax(ut )

Recall unconditional RNN


ht = g(W[ht 1 ; wt 1 ] + b])
K&B 2013: RNN Decoder

h1

h0 x1

<s>
K&B 2013: RNN Decoder

p̂1
softmax

h1

h0 x1

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi)

tom

p̂1
softmax

h1

h0 x1

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)

tom likes

p̂1
softmax softmax

h1 h2

h0 x1 x2

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
⇥p(beer | s, hsi, tom, likes)
s

tom likes beer



p̂1
softmax softmax softmax

h1 h2 h3

h0 x1 x2 x3

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
⇥p(beer | s, hsi, tom, likes)
⇥p(h\si | s, hsi, tom, likes, beer)
s

tom likes beer </s>



p̂1
softmax softmax softmax softmax

h1 h2 h3 h4

h0 x1 x2 x3 x4

<s>
Sutskever et al. (2014)
LSTM encoder
(c0 , h0 ) are parameters
(ci , hi ) = LSTM(xi , ci 1 , hi 1)

The encoding is (c` , h` ) where ` = |x| .

LSTM decoder
w0 = hsi
(ct+` , ht+` ) = LSTM(wt 1 , ct+` 1 , ht+` 1 )
ut = Pht+` + b
p(Wt | x, w<t ) = softmax(ut )
Sutskever et al. (2014)

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014)

Beginnings

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014)

Beginnings are

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014)

Beginnings are difficult

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014)

Beginnings are difficult STOP

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014)

• Good

• RNNs deal naturally with sequences of various lengths

• LSTMs in principle can propagate gradients a long


distance

• Very simple architecture!

• Bad

• The hidden state has to remember a lot of information!


(We will return to this problem on Thursday.)
Sutskever et al. (2014): Tricks

Beginnings are difficult STOP

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014): Tricks
Read the input sequence “backwards”: +4 BLEU

Beginnings are difficult STOP

START

Aller Anfang ist schwer STOP


Sutskever et al. (2014): Tricks
Use an ensemble of J independently trained models.
Ensemble of 2 models: +3 BLEU
Ensemble of 5 models: +4.5 BLEU

Decoder:
(j) (j) (j) (j) (j)
(ct+` , ht+` ) = LSTM (wt 1 , ct+` 1 , ht+` 1 )
(j) (j) (j)
ut = Pht +b
XJ
1 (j 0 )
ut = u
J 0
j =1

p(Wt | x, w<t ) = softmax(ut )


A word about decoding
In general, we want to find the most probable (MAP) output

given the input, i.e.
w⇤ = arg max p(w | x)
w
|w|
X
= arg max log p(wt | x, w<t )
w
t=1
A word about decoding
In general, we want to find the most probable (MAP) output

given the input, i.e.
w⇤ = arg max p(w | x)
w
|w|
X
= arg max log p(wt | x, w<t )
w
t=1

This is, for general RNNs, a hard problem. We therefore


approximate it with a greedy search:
w1⇤ = arg max p(w1 | x)
w1

w2⇤ = arg max p(w2 | x, w1⇤ )


w2
..
.
wt⇤ = arg max p(wt | x, w⇤<t )
w2
A word about decoding
In general, we want to find the most probable (MAP) output

given the input, i.e.
w⇤ = arg max p(w | x)
w
|w|
X
= arg max log p(wt | x, w<t )
w
t=1
undecidable :(
This is, for general RNNs, a hard problem. We therefore
approximate it with a greedy search:
w1⇤ = arg max p(w1 | x)
w1

w2⇤ = arg max p(w2 | x, w1⇤ )


w2
..
.
wt⇤ = arg max p(wt | x, w⇤<t )
w2
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich
beer drink I

hsi
logprob=0

w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich
beer drink I

beer
hsi logprob=-1.82
logprob=0
I
logprob=-2.11

w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich drink
beer drink I logprob=-6.93

beer I
hsi logprob=-1.82 logprob=-5.80
logprob=0
I
logprob=-2.11

w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich drink
beer drink I logprob=-6.93

beer I
hsi logprob=-1.82 logprob=-5.80
logprob=0
I beer
logprob=-2.11 logprob=-8.66

drink
logprob=-2.87
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich drink
beer drink I logprob=-6.93

beer I
hsi logprob=-1.82 logprob=-5.80
logprob=0
I beer
logprob=-2.11 logprob=-8.66

drink
logprob=-2.87
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich drink drink
beer drink I logprob=-6.93 logprob=-6.28

beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich drink drink
beer drink I logprob=-6.93 logprob=-6.28

beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with

beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:


x = Bier trinke ich drink drink
beer drink I logprob=-6.93 logprob=-6.28

beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
Sutskever et al. (2014): Tricks

Use beam search: +1 BLEU

x = Bier trinke ich drink drink


beer drink I logprob=-6.93 logprob=-6.28

beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
Image caption generation

• Neural networks are great for working with multiple


modalities—everything is a vector!

• Image caption generation can therefore use the same


techniques as translation modeling

• A word about data

• Relatively few captioned images are available

• Pre-train image embedding model using another


task, like image identification (e.g., ImageNet)
Kiros et al. (2013)

• Looks a lot like Kalchbrenner and Blunsom (2013)

• convolutional network on the input

• n-gram language model on the output

• Innovation: multiplicative interactions in the


decoder n-gram model
Kiros et al. (2013)
Encoder x = embed(x)
Kiros et al. (2013)
Encoder x = embed(x)
Unconditional n-gram LM: Embedding of wt 1
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1]
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
wi = ri,j,w xj
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w how big is this tensor?
wi = ri,j,w xj
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
wi = ri,j,w xj
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
wi = ri,j,w xj
what’s the intuition here?
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w how big is this tensor?
wi = ri,j,w xj
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
wi = ri,j,w xj
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


wi = ri,w
wi = ri,j,w xj
|V |⇥d d⇥k
wi = uw,i vi,j (U 2 R , V2R )
rt = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] + Cx
Kiros et al. (2013)
Encoder x = embed(x)
Simple conditional n-gram LM:
ht = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] +Cx
ut = Pht + b
p(Wt | x, wtt 1
n+1 ) = softmax(ut )

Multiplicative n-gram LM:


|V |⇥d d⇥k
wi = uw,i vi,j (U 2 R , V2R )
rt = W[wt n+1 ; wt n+2 ; . . . ; wt 1 ] + Cx
ht = (Wf r rt ) (Wf x x)
ut = Pht + b
p(Wt | x, w<t ) = softmax(ut )
Kiros et al. (2013)

• Two take-home messages:

• Feed-forward n-gram models can be used in


place of RNNs in conditional models

• Modeling interactions between input modalities


holds a lot of promise

• Although MLP-type models can approximate


higher order tensors, multiplicative models
appear to make learning interactions easier
Questions?

You might also like