0% found this document useful (0 votes)

69 views

Lecture 7 - Conditional Language Modeling

This document discusses conditional language models. It explains that conditional LMs assign probabilities to word sequences given some context or input. It decomposes the probability using the chain rule and discusses how RNNs can be used to model the probability of the next word given previous words and context. It also discusses challenges like searching for the most likely output, available data for training conditional LMs, and how to evaluate them.

Uploaded by

Mario Molina

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

Lecture 7 - Conditional Language Modeling

Uploaded by

Mario Molina

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Conditional

Language Modeling

Chris Dyer
Review: Unconditional LMs
A language model assigns probabilities to sequences of 
words, w = (w1 , w2 , . . . , w` ).
We saw that it is helpful to decompose this probability 
using the chain rule, as follows:
p(w) = p(w1 ) ⇥ p(w2 | w1 ) ⇥ p(w3 | w1 , w2 ) ⇥ · · · ⇥
p(w` | w1 , . . . , w` 1)
|w|
Y
= p(wt | w1 , . . . , wt 1)
t=1

This reduces the language modeling problem to modeling 

the probability of the next word, given the history of 
preceding words.
Unconditional LMs with RNNs

random variable
p(W5 |w1 ,w2 ,w3 ,w4 )
z }| {
RNN hidden state
vector, length=|vocab| softmax

h1 h2 h3 h4

h0 w1 w2 w3 w4

w1 w2 w3 w4
observed  vector 
context word (word embedding)
Conditional LMs
A conditional language model assigns probabilities to
sequences of words, w = (w1 , w2 , . . . , w` ) , given some
conditioning context, x .

As with unconditional models, it is again helpful to use 

the chain rule to decompose this probability:
Ỳ
p(w | x) = p(wt | x, w1 , w2 , . . . , wt 1 )
t=1
What is the probability of the next word, given the history of 
previously generated words and conditioning context x?
Conditional LMs
x “input” w “text output”
An author A document written by that author
A topic label An article about that topic
{SPAM, NOT_SPAM} An email
A sentence in French Its English translation
A sentence in English Its French translation
A sentence in English Its Chinese translation
An image A text description of the image
A document Its summary
A document Its translation
Meterological measurements A weather report
Acoustic signal Transcription of speech
Conversational history + database Dialogue system response
A question + a document Its answer
A question + an image Its answer
Conditional LMs
x “input” w “text output”
An author A document written by that author
A topic label An article about that topic
{SPAM, NOT_SPAM} An email
A sentence in French Its English translation
A sentence in English Its French translation
A sentence in English Its Chinese translation
An image A text description of the image
A document Its summary
A document Its translation
Meterological measurements A weather report
Acoustic signal Transcription of speech
Conversational history + database Dialogue system response
A question + a document Its answer
A question + an image Its answer
Conditional LMs
x “input” w “text output”
An author A document written by that author
A topic label An article about that topic
{SPAM, NOT_SPAM} An email
A sentence in French Its English translation
ek

A sentence in English Its French translation

this we

A sentence in English Its Chinese translation

An image A text description of the image
A document Its summary
A document Its translation
Meterological measurements A weather report
e k
weAcoustic signal Transcription of speech
x t
ne
Conversational
k s history + database Dialogue system response
e e
o w A question + a document Its answer
tw
A question + an image Its answer
Data for training conditional LMs
To train conditional language models, we need paired 
N
samples, {(x i , w i i=1.
)}

Data availability varies. It’s easy to think of tasks that 

could be solved by conditional language models, but the 
data just doesn’t exist.

Relatively large amounts of data for:

Translation, summarisation, caption generation, 
speech recognition
Algorithmic challenges
We often want to find the most likely w given some x . This 
is unfortunately generally an intractable problem.
⇤
w = arg max p(w | x)
w

We therefore approximate it using a beam search or with 

(i)
Monte Carlo methods since w ⇠ p(w | x) is often 
computationally easy.

Improving search/inference is an open research question.

How can we search more effectively?
Can we get guarantees that we have found the max?
Can we limit the model a bit to make search easier?
Evaluating conditional LMs
How good is our conditional language model?

These are language models, we can use cross-entropy 

or perplexity. okay to implement, hard to interpret

Task-specific evaluation. Compare the model’s most likely

output to human-generated expected output using a 
task-specific evaluation metric L .
w⇤ = arg max p(w | x) ⇤
L(w , wref )
w

Examples of L : BLEU, METEOR, WER, ROUGE.

easy to implement, okay to interpret

Human evaluation.
hard to implement, easy to interpret
Evaluating conditional LMs
How good is our conditional language model?

These are language models, we can use cross-entropy 

or perplexity. okay to implement, hard to interpret

Task-specific evaluation. Compare the model’s most likely

output to human-generated expected output using a 
task-specific evaluation metric L .
w⇤ = arg max p(w | x) ⇤
L(w , wref )
w

Examples of L : BLEU, METEOR, WER, ROUGE.

easy to implement, okay to interpret

Human evaluation.
hard to implement, easy to interpret
Lecture overview
The rest of this lecture will look at “encoder-decoder” 
models that learn a function that maps x into a fixed-size 
vector and then uses a language model to “decode” 
that vector into a sequence of words, w.

x Kunst kann nicht gelehrt werden…

w Artistry can’t be taught…

Lecture overview
The rest of this lecture will look at “encoder-decoder” 
models that learn a function that maps x into a fixed-size 
vector and then uses a language model to “decode” 
that vector into a sequence of words, w.

w A dog is playing on the beach.

Lecture overview
• Two questions

• How do we encode x as a fixed-size vector, c ?

- Problem (or at least modality) specific
- Think about assumptions
• How do we condition on c in the decoding
model?
- Less problem specific
- We will review solution/architectures
Kalchbrenner and Blunsom 2013
Encoder
c = embed(x)
s = Vc
Recurrent connection
Embedding of wt 1
Recurrent decoder
Source sentence
ht = g(W[ht 1 ; wt 1 ] + s + b])
0 Learnt bias
ut = Pht + b
p(Wt | x, w<t ) = softmax(ut )

Recall unconditional RNN

ht = g(W[ht 1 ; wt 1 ] + b])
K&B 2013: Encoder
How should we define c = embed(x)?
The simplest model possible:
X
c= xi
i

x1 x2 x3 x4 x5 x6

What do you think of this model?

K&B 2013: CSM Encoder
How should we define c = embed(x)?
Convolutional sentence model (CSM)
K&B 2013: CSM Encoder

• Good

• Convolutions learn interactions among features in a local

context

• By stacking them, longer range dependencies can be learnt

• Deep ConvNets have a branching structure similar to trees,

but no parser is required

• Bad

• Sentences have different lengths, need different depth trees;

convnets are not usually so dynamic, but see*

* Kalchbrenner et al. (2014). A convolutional neural network for modelling sentences. In Proc. ACL.
K&B 2013: RNN Decoder
Encoder
c = embed(x)
s = Vc
Recurrent connection
Embedding of wt 1
Recurrent decoder
Source sentence
ht = g(W[ht 1 ; wt 1 ] + s + b])
0 Learnt bias
ut = Pht + b
p(Wt | x, w<t ) = softmax(ut )

Recall unconditional RNN

ht = g(W[ht 1 ; wt 1 ] + b])
K&B 2013: RNN Decoder

h0 x1

<s>
K&B 2013: RNN Decoder

p̂1
softmax

h0 x1

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi)

tom
⇠

p̂1
softmax

h0 x1

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)

tom likes
⇠

p̂1
softmax softmax

h1 h2

h0 x1 x2

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
⇥p(beer | s, hsi, tom, likes)
s

tom likes beer

⇠

⇠
p̂1
softmax softmax softmax

h1 h2 h3

h0 x1 x2 x3

<s>
K&B 2013: RNN Decoder
p(tom | s, hsi) ⇥p(likes | s, hsi, tom)
⇥p(beer | s, hsi, tom, likes)
⇥p(h\si | s, hsi, tom, likes, beer)
s

tom likes beer </s>

⇠

⇠
p̂1
softmax softmax softmax softmax

h1 h2 h3 h4

h0 x1 x2 x3 x4

<s>
Sutskever et al. (2014)
LSTM encoder
(c0 , h0 ) are parameters
(ci , hi ) = LSTM(xi , ci 1 , hi 1)

The encoding is (c` , h` ) where ` = |x| .

LSTM decoder
w0 = hsi
(ct+` , ht+` ) = LSTM(wt 1 , ct+` 1 , ht+` 1 )
ut = Pht+` + b
p(Wt | x, w<t ) = softmax(ut )
Sutskever et al. (2014)

START

Aller Anfang ist schwer STOP

Sutskever et al. (2014)

Beginnings

Sutskever et al. (2014): Tricks
Use an ensemble of J independently trained models.
Ensemble of 2 models: +3 BLEU
Ensemble of 5 models: +4.5 BLEU

Decoder:
(j) (j) (j) (j) (j)
(ct+` , ht+` ) = LSTM (wt 1 , ct+` 1 , ht+` 1 )
(j) (j) (j)
ut = Pht +b
XJ
1 (j 0 )
ut = u
J 0
j =1

p(Wt | x, w<t ) = softmax(ut )

A word about decoding
In general, we want to find the most probable (MAP) output 
given the input, i.e.
w⇤ = arg max p(w | x)
w
|w|
X
= arg max log p(wt | x, w<t )
w
t=1
A word about decoding
In general, we want to find the most probable (MAP) output 
given the input, i.e.
w⇤ = arg max p(w | x)
w
|w|
X
= arg max log p(wt | x, w<t )
w
t=1

This is, for general RNNs, a hard problem. We therefore

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
A word about decoding
A slightly better approximation is to use a beam search with 
beam size b. Key idea: keep track of top b hypothesis.

E.g., for b=2:

x = Bier trinke ich drink drink
beer drink I logprob=-6.93 logprob=-6.28

beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
Sutskever et al. (2014): Tricks

Use beam search: +1 BLEU

x = Bier trinke ich drink drink

beer drink I logprob=-6.93 logprob=-6.28

beer I like
hsi logprob=-1.82 logprob=-5.80 logprob=-7.31
logprob=0
I beer beer
logprob=-2.11 logprob=-8.66 logprob=-3.04

drink wine
logprob=-2.87 logprob=-5.12
w0 w1 w2 w3
Image caption generation

• Neural networks are great for working with multiple

modalities—everything is a vector!

• Image caption generation can therefore use the same

techniques as translation modeling

• A word about data

• Relatively few captioned images are available

• Pre-train image embedding model using another

task, like image identification (e.g., ImageNet)
Kiros et al. (2013)

• Looks a lot like Kalchbrenner and Blunsom (2013)

• convolutional network on the input

• n-gram language model on the output

• Two take-home messages:

• Feed-forward n-gram models can be used in

place of RNNs in conditional models

• Modeling interactions between input modalities

holds a lot of promise

• Although MLP-type models can approximate

higher order tensors, multiplicative models
appear to make learning interactions easier
Questions?

Building LLMs - Stanford
No ratings yet
Building LLMs - Stanford
78 pages
Cheatsheet Recurrent Neural Networks
No ratings yet
Cheatsheet Recurrent Neural Networks
5 pages
Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
XCS224N_Module4_Slides
No ratings yet
XCS224N_Module4_Slides
91 pages
L5_CSE256_FA24_LM
No ratings yet
L5_CSE256_FA24_LM
65 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
A Survey On Neural Network Language Models
No ratings yet
A Survey On Neural Network Language Models
7 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Nn4nlp 02 LM
No ratings yet
Nn4nlp 02 LM
47 pages
3 - Deep Learning
No ratings yet
3 - Deep Learning
33 pages
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
No ratings yet
Large Language Models: Dr. Asgari, Dr. Rohban, Soleymani Fall 2023
53 pages
state_of_multilingual_and_multimodal_NLP
No ratings yet
state_of_multilingual_and_multimodal_NLP
27 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
Recurrent Neural Networks: Amir H. Payberah
No ratings yet
Recurrent Neural Networks: Amir H. Payberah
142 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
AN2DL_05_2324_Seq2SeqAndWordEmbedding
No ratings yet
AN2DL_05_2324_Seq2SeqAndWordEmbedding
42 pages
NLP NN Language Modeling Week5
No ratings yet
NLP NN Language Modeling Week5
33 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
5th Unit
No ratings yet
5th Unit
36 pages
RNN_for_Moodle
No ratings yet
RNN_for_Moodle
42 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
RNN-1
No ratings yet
RNN-1
50 pages
3. Graph Representation Learning
No ratings yet
3. Graph Representation Learning
32 pages
Deep Learning (MODULE-4)_RNN - NLP
No ratings yet
Deep Learning (MODULE-4)_RNN - NLP
52 pages
rnnjan25
No ratings yet
rnnjan25
59 pages
NLP Short
No ratings yet
NLP Short
5 pages
NLP_basics
No ratings yet
NLP_basics
119 pages
Natual Language Processing
No ratings yet
Natual Language Processing
33 pages
Formal Aspects of Language Modeling
No ratings yet
Formal Aspects of Language Modeling
252 pages
RNN
No ratings yet
RNN
22 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
NLP Cache Model
No ratings yet
NLP Cache Model
9 pages
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
No ratings yet
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
105 pages
anlp-02-wordrep-textclass
No ratings yet
anlp-02-wordrep-textclass
59 pages
10 Encdec Attention Notes
No ratings yet
10 Encdec Attention Notes
29 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Ngrams
100% (1)
Ngrams
22 pages
Language Models Are Unsupervised Multitask Learners
No ratings yet
Language Models Are Unsupervised Multitask Learners
24 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Course3 LM
No ratings yet
Course3 LM
69 pages
Anlp 02 Wordrep Textclass
No ratings yet
Anlp 02 Wordrep Textclass
58 pages
Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
Cs224n 2025 Lecture05 Rnnlm
No ratings yet
Cs224n 2025 Lecture05 Rnnlm
54 pages
Llms Course Andrew
No ratings yet
Llms Course Andrew
46 pages
Introduction To LLMS: Transformers Types of Llms Configuration Settings
100% (1)
Introduction To LLMS: Transformers Types of Llms Configuration Settings
7 pages
Jason Wei Stanford cs330 Talk
No ratings yet
Jason Wei Stanford cs330 Talk
44 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
CS585 Lecture October15th
No ratings yet
CS585 Lecture October15th
162 pages
Natural Language Processing With Neural Network - Class3
No ratings yet
Natural Language Processing With Neural Network - Class3
25 pages
4. WordRepresentation
No ratings yet
4. WordRepresentation
26 pages
Notes - Ryan
No ratings yet
Notes - Ryan
258 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Horn Clause: Fundamentals and Applications
From Everand
Horn Clause: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
Lecture 4 - Language Modelling and RNNs Part 2
No ratings yet
Lecture 4 - Language Modelling and RNNs Part 2
44 pages
Lecture 2b - Overview of The Practicals
No ratings yet
Lecture 2b - Overview of The Practicals
43 pages
Display Purchase Order Number Unknown
No ratings yet
Display Purchase Order Number Unknown
20 pages
CSE Module2
No ratings yet
CSE Module2
9 pages
Computer Communication (MIS Project)
No ratings yet
Computer Communication (MIS Project)
16 pages
Nato Joint Advanced Distributed Learning: Online Course Catalogue
No ratings yet
Nato Joint Advanced Distributed Learning: Online Course Catalogue
47 pages
Linux Command Line Cheat Sheet
No ratings yet
Linux Command Line Cheat Sheet
1 page
Technical Whitepaper Upgrade VIrtual Connect
No ratings yet
Technical Whitepaper Upgrade VIrtual Connect
25 pages
LEPU MEDICAL Pocket ECG PCECG-500 Catalog
No ratings yet
LEPU MEDICAL Pocket ECG PCECG-500 Catalog
4 pages
Database Management 1
No ratings yet
Database Management 1
27 pages
100 Comandos para Inicio
No ratings yet
100 Comandos para Inicio
6 pages
CFD Aci 318 14 - 1
100% (1)
CFD Aci 318 14 - 1
15 pages
Test Method RC 37401 Polished Stone Value PDF
No ratings yet
Test Method RC 37401 Polished Stone Value PDF
1 page
Final Presentation On IPO-Pakistan.
No ratings yet
Final Presentation On IPO-Pakistan.
48 pages
Ecs h61h2 m12
No ratings yet
Ecs h61h2 m12
70 pages
Instant Download Algebraic Theory for True Concurrency 1st Edition Yong Wang PDF All Chapters
100% (1)
Instant Download Algebraic Theory for True Concurrency 1st Edition Yong Wang PDF All Chapters
40 pages
Related Literature
No ratings yet
Related Literature
10 pages
MFS 9130 Functional Description
No ratings yet
MFS 9130 Functional Description
62 pages
MX 8000
No ratings yet
MX 8000
307 pages
JQP Rigger
No ratings yet
JQP Rigger
8 pages
Crypto Midsem
No ratings yet
Crypto Midsem
1 page
Bettis Multiport Flow Selector Flier
No ratings yet
Bettis Multiport Flow Selector Flier
6 pages
LeroySomer VSD
No ratings yet
LeroySomer VSD
288 pages
Experiment (2) Frequency Analysis of Amplifier Circuits: Objectives
No ratings yet
Experiment (2) Frequency Analysis of Amplifier Circuits: Objectives
2 pages
59039885-254624-400-DS-PRO-310-Rev-F-Datasheet-of-VRU
No ratings yet
59039885-254624-400-DS-PRO-310-Rev-F-Datasheet-of-VRU
18 pages
Mobile Edge Caching
No ratings yet
Mobile Edge Caching
29 pages
Lesson 3 Requirements Engineering and Analysis
100% (1)
Lesson 3 Requirements Engineering and Analysis
20 pages
Ugly Construction
No ratings yet
Ugly Construction
14 pages
Case Study
No ratings yet
Case Study
7 pages
ISF - Threat Horizon 2024 - Executive Summary
No ratings yet
ISF - Threat Horizon 2024 - Executive Summary
4 pages
180014D1 - AM7 Central Office Simulator Instruction Manual
No ratings yet
180014D1 - AM7 Central Office Simulator Instruction Manual
142 pages