0% found this document useful (0 votes)

47 views32 pages

Week9 Seq2seq

Sequence-to-sequence models use two recurrent neural networks, an encoder and decoder, to map an input sequence to an output sequence. The encoder encodes the input sequence into a vector representation, and the decoder takes this vector to generate the output sequence. Attention mechanisms allow the decoder to focus on different parts of the input sequence at each step of generating the output. Variational sequence-to-sequence models introduce latent variables to model uncertainty and encourage the decoder to rely on the encoded meaning rather than copying from the input.

Uploaded by

Fresney ALejandro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views32 pages

Week9 Seq2seq

Uploaded by

Fresney ALejandro

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Sequence-to-sequence models

27 Jan 2016
Seq2seq (Sutskever et al., 2014)
Decoder RNN

Encoder RNN

Source: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns
Seq2seq overview and applications
• Encoder-decoder
• Two RNNs (typically LSTMs or GRUs)
• Can be deterministic or variational
• Applications:
• Machine translation
• Question answering
• Dialogue models (conversational agents)
• Summarization
• Etc.
LSTM cell
Seq2Seq
• Source sequence x= (x1, x2,..., x|x|) represented as word embedding
vectors
• Target sequence y= (y1, y2,..., y|y|)
• At the end of the encoding process, we have the final hidden and cell
states
• Hidden state initialization:
• Set the initial states of the decoder to
Seq2seq (cont.)
• At each step of the decoder, compute

• yj-1 – ground truth previous word during training (“teacher forcing”),

and previously predicted word at inference time.
• θ – parameters (weights) of the network
Seq2seq (cont.)
• Predicted word at time step j is given by a softmax layer:

• Wout is a weight matrix

• Softmax function:

• yjk is the value of the kth dimension of the output vector at time step j
Softmax example

Source: (Bahuleyan, 2018)

Seq2seq model

Source: (Bahuleyan, 2018)

Selecting the word at each time step of the
decoder
• Greedy search: select word with the highest p(yi) given by the softmax
layer

• Beam search: choose k words with the highest p(yi) at each time step.
• k – beam width (typically 5-10)
Beam search
• Multiple possible replies
can be generated in
response to “Who does
John like?”

Image source: https://www.analyticsvidhya.com/blog/2018/03/essentials-of-deep-learning-sequence-to-sequence-modelling-with-attention-part-i/

Beam search (cont.)
• Chose the proposed path
with the maximum
combined probability

Image source: https://www.analyticsvidhya.com/blog/2018/03/essentials-of-deep-learning-sequence-to-sequence-modelling-with-attention-part-i/

Seq2seq resources
• https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequ
ence-learning-in-keras.html
Attention mechanism in RNN encoder-
decoder networks – Intuitions
• Dynamically align target sequence with source sequence in the
decoder
• Pay different level of attention to words in the input sequence at each
time step in the decoder
• At each time step, the decoder is provided access to all encoded
source tokens
• The decoder gives higher weights to certain and lower to others
Attention mechanism – Formal definition
• Compute a probabilistic distribution at each decoding time step j

where is the weight given to source output i and

is a pre-normalized score
Attention mechanism – Formal definition
(cont.)
• Two methods to compute :
• Multiplicative (Luong et al., 2015)

• Additive (Bahdanau et al., 2014)

Attention mechanism – Formal definition
(cont.)
• Take the sum of the source outputs weighted by to get
the context vector

• Compute attention vector

• Finally, feed attention vector to the softmax layer

Seq2seq model
with attention

Figure source: (Bahuleyan, 2018)

Visualizing Attention in Machine Translation
(1)

Source: https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
Visualizing Attention in Machine Translation
(2)

Source: https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
Variational Attention for
Sequence-to-Sequence Models
Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, Pascal Poupart
In Proc. COLING 2018
Deterministic
Attention in
Variational Encoder-
Decoder (VED)
• The decoder LSTM has direct
access to source via cj
• This may cause the decoder to
ignore z – bypassing
phenomenon (Bahuleyan et al.,
2018)

Figure source: (Bahuleyan, 2018)

Variational Attention
• The context vector cj is modelled as a Gaussian random variable
• ELBO for the standard VAE:

• ELBO for VAE with variational attention:

Variational Attention (continued)
• Given x we can assume conditional independence between z and cj
• Hence, the posterior factorizes as

• Assume separate priors for z and cj

• Sampling is done separately and KL loss can be computed
independently
Seq2Seq VED with
Variational Attention

Figure source: (Bahuleyan, 2018)

Seq2Seq VED with Variational Attention
• Loss function:

• 𝛌KL – coefficient for both KL terms

• 𝛄𝑎 – coefficient for the context vector’s KL term (kept constant)
Seq2Seq VED with Variational Attention –
Prior
• Sentence latent code z prior: p(z)=N(0,I) (same as in the VAE)
• Context vector cj prior:
• Option 1: p(cj) = N(0,I)
• Option 2:
where is the mean of the source hidden states
and
Seq2Seq VED with Variational Attention –
Posterior
• Both posterior distributions q(z|x) and q(cj|x) are
parameterized by the encoder LSTM
• For the sentence latent space (same as VAE):

• For the context vector cj at time step j:

Where , and is computed using feed-forward neural

network
Evaluation
• Tasks and datasets
• Question generation (SQuAD dataset) ~100K QA pairs
• Dialogue (Cornell Movie dialogs corpus) >200K conversational exchanges
• Evaluation measures:
• BLEU scores
• Entropy

• Distinct
Results on the question generation task

Source: Bahuleyan et al (2018) https://arxiv.org/abs/1712.08207

Results on the conversational (dialogue)
system experiment
Examples
from the
question
generation
task

Source: Bahuleyan et al (2018) https://arxiv.org/abs/1712.08207

Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
Transformers
No ratings yet
Transformers
102 pages
Attention Is All You Need Explained
No ratings yet
Attention Is All You Need Explained
46 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
(Slides) Module 44
No ratings yet
(Slides) Module 44
119 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Lecture 5: Self-Attention and Transformers
No ratings yet
Lecture 5: Self-Attention and Transformers
99 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
cs224n 2022 Lecture08 Final Project
No ratings yet
cs224n 2022 Lecture08 Final Project
71 pages
Sequence Models
No ratings yet
Sequence Models
85 pages
Transformers
No ratings yet
Transformers
102 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
Attention
No ratings yet
Attention
12 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
No ratings yet
06-DL-Deep Learning For Text Data (LSTM Seq2Seq Models)
44 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
Vinija's Notes - Natural Language Processing - Attention
No ratings yet
Vinija's Notes - Natural Language Processing - Attention
27 pages
cl8 Encdec
No ratings yet
cl8 Encdec
51 pages
UNIT 2 FULL - Compressed
No ratings yet
UNIT 2 FULL - Compressed
26 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Emd-Mi3900 Motor de Traccion
100% (4)
Emd-Mi3900 Motor de Traccion
24 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
No ratings yet
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey
14 pages
AN2DL 06 2324 AttentionAndTrasformers
No ratings yet
AN2DL 06 2324 AttentionAndTrasformers
60 pages
Unit5 3
No ratings yet
Unit5 3
48 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
DL For Sequencial Data
No ratings yet
DL For Sequencial Data
36 pages
Transformer Tutorial
No ratings yet
Transformer Tutorial
14 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
AATN Merged
No ratings yet
AATN Merged
139 pages
Introduction To Linear Algebra With Applications
0% (3)
Introduction To Linear Algebra With Applications
7 pages
Attention Attention!
No ratings yet
Attention Attention!
26 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Copy Mechanism
No ratings yet
Copy Mechanism
10 pages
Cook P. Fundamentals of HTML, SVG, CSS and JavaScript For Data Visual. 2022
No ratings yet
Cook P. Fundamentals of HTML, SVG, CSS and JavaScript For Data Visual. 2022
87 pages
NeurIPS 2021 Understanding How Encoder Decoder Architectures Attend Paper
No ratings yet
NeurIPS 2021 Understanding How Encoder Decoder Architectures Attend Paper
12 pages
NLP 8
No ratings yet
NLP 8
42 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
62 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Comprehensive Guide Attention Mechanism Deep Learning
No ratings yet
Comprehensive Guide Attention Mechanism Deep Learning
17 pages
Unit - IV - Natural Language Processing
No ratings yet
Unit - IV - Natural Language Processing
9 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
3.1 Language Models and Attention
No ratings yet
3.1 Language Models and Attention
22 pages
Sequence Models-II
No ratings yet
Sequence Models-II
10 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
NLP Lab2
No ratings yet
NLP Lab2
7 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
LA - Sleeve Auto Performance 2014 Catalog
No ratings yet
LA - Sleeve Auto Performance 2014 Catalog
76 pages
Challenges and Opportunities of Artificial Intelligence
No ratings yet
Challenges and Opportunities of Artificial Intelligence
9 pages
DL Unitwuse
No ratings yet
DL Unitwuse
5 pages
Neutral Grounding
No ratings yet
Neutral Grounding
57 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Dynamic Chat Bot
No ratings yet
Dynamic Chat Bot
4 pages
Attention in Neural Networks
No ratings yet
Attention in Neural Networks
8 pages
Attention
No ratings yet
Attention
15 pages
Aiayn
No ratings yet
Aiayn
15 pages
Attention Mechanism
No ratings yet
Attention Mechanism
2 pages
What Is A Transformer
No ratings yet
What Is A Transformer
11 pages
Spm-Unit Ii
No ratings yet
Spm-Unit Ii
84 pages
Notes 2 Transformer Model Architecture
No ratings yet
Notes 2 Transformer Model Architecture
4 pages
Berryman
No ratings yet
Berryman
24 pages
Catalogue & Price List 2019-20: Swimming Pool & Spa Equipment
No ratings yet
Catalogue & Price List 2019-20: Swimming Pool & Spa Equipment
260 pages
AD OffensiveActiveDirectory 101 MichaelRitter
No ratings yet
AD OffensiveActiveDirectory 101 MichaelRitter
84 pages
Photography Proposal Example
No ratings yet
Photography Proposal Example
7 pages
PDI Demo
No ratings yet
PDI Demo
6 pages
2HRMS
No ratings yet
2HRMS
4 pages
Icom IC-T90A Instruction Manual
100% (1)
Icom IC-T90A Instruction Manual
100 pages
Fluke 718 300g Process Calibrator Manual
No ratings yet
Fluke 718 300g Process Calibrator Manual
36 pages
C++ - Short-Notes
No ratings yet
C++ - Short-Notes
73 pages
6FM9Y
No ratings yet
6FM9Y
2 pages
820.9.5X MLA Style (8th Edition) - S17
No ratings yet
820.9.5X MLA Style (8th Edition) - S17
6 pages
Boolean Algebra and Logic Gates
No ratings yet
Boolean Algebra and Logic Gates
11 pages
35 Swap Space Management 08-11-2024
No ratings yet
35 Swap Space Management 08-11-2024
6 pages
Account Statement 1 Sep 2024 To 21 Mar 2025
No ratings yet
Account Statement 1 Sep 2024 To 21 Mar 2025
8 pages
Master Thesis - Order To Cash
No ratings yet
Master Thesis - Order To Cash
53 pages
Lss Nt.:ffiffi Fart: (Faper
No ratings yet
Lss Nt.:ffiffi Fart: (Faper
3 pages
Energy Performance Certificate (EPC) : Rules On Letting This Property
No ratings yet
Energy Performance Certificate (EPC) : Rules On Letting This Property
5 pages
PyTorch Geometric Temporal Spatiotemporal Signal Processing
No ratings yet
PyTorch Geometric Temporal Spatiotemporal Signal Processing
10 pages
Bsbops501 Task 2
No ratings yet
Bsbops501 Task 2
6 pages
Price List Toshiba - Januari 2011
No ratings yet
Price List Toshiba - Januari 2011
3 pages
Hardware of The PIC16F877
No ratings yet
Hardware of The PIC16F877
2 pages
Understanding The Security Architecture of The One Identity Safeguard Appliance
No ratings yet
Understanding The Security Architecture of The One Identity Safeguard Appliance
6 pages
Stationary Waves
No ratings yet
Stationary Waves
3 pages
Ps Primer: Description
No ratings yet
Ps Primer: Description
2 pages
25 Recetas Con Aromas Capella
No ratings yet
25 Recetas Con Aromas Capella
3 pages
What Type of Person?: Gerunds and Infinitives
No ratings yet
What Type of Person?: Gerunds and Infinitives
1 page
Obligation 3
No ratings yet
Obligation 3
1 page
Blockchain Foundation Courseware - English
From Everand
Blockchain Foundation Courseware - English
Eppo Luppes
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Week9 Seq2seq

Uploaded by

Week9 Seq2seq

Uploaded by

Sequence-to-sequence models

• yj-1 – ground truth previous word during training (“teacher forcing”),

• Wout is a weight matrix

Source: (Bahuleyan, 2018)

Source: (Bahuleyan, 2018)

Image source: https://www.analyticsvidhya.com/blog/2018/03/essentials-of-deep-learning-sequence-to-sequence-modelling-with-attention-part-i/

Image source: https://www.analyticsvidhya.com/blog/2018/03/essentials-of-deep-learning-sequence-to-sequence-modelling-with-attention-part-i/

where is the weight given to source output i and

• Additive (Bahdanau et al., 2014)

• Compute attention vector

• Finally, feed attention vector to the softmax layer

Figure source: (Bahuleyan, 2018)

Figure source: (Bahuleyan, 2018)

• ELBO for VAE with variational attention:

• Assume separate priors for z and cj

Figure source: (Bahuleyan, 2018)

• 𝛌KL – coefficient for both KL terms

• For the context vector cj at time step j:

Where , and is computed using feed-forward neural

Source: Bahuleyan et al (2018) https://arxiv.org/abs/1712.08207

Source: Bahuleyan et al (2018) https://arxiv.org/abs/1712.08207

You might also like