0% found this document useful (0 votes)
47 views32 pages

Week9 Seq2seq

Sequence-to-sequence models use two recurrent neural networks, an encoder and decoder, to map an input sequence to an output sequence. The encoder encodes the input sequence into a vector representation, and the decoder takes this vector to generate the output sequence. Attention mechanisms allow the decoder to focus on different parts of the input sequence at each step of generating the output. Variational sequence-to-sequence models introduce latent variables to model uncertainty and encourage the decoder to rely on the encoded meaning rather than copying from the input.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views32 pages

Week9 Seq2seq

Sequence-to-sequence models use two recurrent neural networks, an encoder and decoder, to map an input sequence to an output sequence. The encoder encodes the input sequence into a vector representation, and the decoder takes this vector to generate the output sequence. Attention mechanisms allow the decoder to focus on different parts of the input sequence at each step of generating the output. Variational sequence-to-sequence models introduce latent variables to model uncertainty and encourage the decoder to rely on the encoded meaning rather than copying from the input.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Sequence-to-sequence models

27 Jan 2016
Seq2seq (Sutskever et al., 2014)
Decoder RNN

Encoder RNN

Source: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns
Seq2seq overview and applications
• Encoder-decoder
• Two RNNs (typically LSTMs or GRUs)
• Can be deterministic or variational
• Applications:
• Machine translation
• Question answering
• Dialogue models (conversational agents)
• Summarization
• Etc.
LSTM cell
Seq2Seq
• Source sequence x= (x1, x2,..., x|x|) represented as word embedding
vectors
• Target sequence y= (y1, y2,..., y|y|)
• At the end of the encoding process, we have the final hidden and cell
states
• Hidden state initialization:
• Set the initial states of the decoder to
Seq2seq (cont.)
• At each step of the decoder, compute

• yj-1 – ground truth previous word during training (“teacher forcing”),


and previously predicted word at inference time.
• θ – parameters (weights) of the network
Seq2seq (cont.)
• Predicted word at time step j is given by a softmax layer:

• Wout is a weight matrix


• Softmax function:

• yjk is the value of the kth dimension of the output vector at time step j
Softmax example

Source: (Bahuleyan, 2018)


Seq2seq model

Source: (Bahuleyan, 2018)


Selecting the word at each time step of the
decoder
• Greedy search: select word with the highest p(yi) given by the softmax
layer

• Beam search: choose k words with the highest p(yi) at each time step.
• k – beam width (typically 5-10)
Beam search
• Multiple possible replies
can be generated in
response to “Who does
John like?”

Image source: https://www.analyticsvidhya.com/blog/2018/03/essentials-of-deep-learning-sequence-to-sequence-modelling-with-attention-part-i/


Beam search (cont.)
• Chose the proposed path
with the maximum
combined probability

Image source: https://www.analyticsvidhya.com/blog/2018/03/essentials-of-deep-learning-sequence-to-sequence-modelling-with-attention-part-i/


Seq2seq resources
• https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequ
ence-learning-in-keras.html
Attention mechanism in RNN encoder-
decoder networks – Intuitions
• Dynamically align target sequence with source sequence in the
decoder
• Pay different level of attention to words in the input sequence at each
time step in the decoder
• At each time step, the decoder is provided access to all encoded
source tokens
• The decoder gives higher weights to certain and lower to others
Attention mechanism – Formal definition
• Compute a probabilistic distribution at each decoding time step j

where is the weight given to source output i and


is a pre-normalized score
Attention mechanism – Formal definition
(cont.)
• Two methods to compute :
• Multiplicative (Luong et al., 2015)

• Additive (Bahdanau et al., 2014)


Attention mechanism – Formal definition
(cont.)
• Take the sum of the source outputs weighted by to get
the context vector

• Compute attention vector

• Finally, feed attention vector to the softmax layer


Seq2seq model
with attention

Figure source: (Bahuleyan, 2018)


Visualizing Attention in Machine Translation
(1)

Source: https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
Visualizing Attention in Machine Translation
(2)

Source: https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/
Variational Attention for
Sequence-to-Sequence Models
Hareesh Bahuleyan, Lili Mou, Olga Vechtomova, Pascal Poupart
In Proc. COLING 2018
Deterministic
Attention in
Variational Encoder-
Decoder (VED)
• The decoder LSTM has direct
access to source via cj
• This may cause the decoder to
ignore z – bypassing
phenomenon (Bahuleyan et al.,
2018)

Figure source: (Bahuleyan, 2018)


Variational Attention
• The context vector cj is modelled as a Gaussian random variable
• ELBO for the standard VAE:

• ELBO for VAE with variational attention:


Variational Attention (continued)
• Given x we can assume conditional independence between z and cj
• Hence, the posterior factorizes as

• Assume separate priors for z and cj


• Sampling is done separately and KL loss can be computed
independently
Seq2Seq VED with
Variational Attention

Figure source: (Bahuleyan, 2018)


Seq2Seq VED with Variational Attention
• Loss function:

• 𝛌KL – coefficient for both KL terms


• 𝛄𝑎 – coefficient for the context vector’s KL term (kept constant)
Seq2Seq VED with Variational Attention –
Prior
• Sentence latent code z prior: p(z)=N(0,I) (same as in the VAE)
• Context vector cj prior:
• Option 1: p(cj) = N(0,I)
• Option 2:
where is the mean of the source hidden states
and
Seq2Seq VED with Variational Attention –
Posterior
• Both posterior distributions q(z|x) and q(cj|x) are
parameterized by the encoder LSTM
• For the sentence latent space (same as VAE):

• For the context vector cj at time step j:

Where , and is computed using feed-forward neural


network
Evaluation
• Tasks and datasets
• Question generation (SQuAD dataset) ~100K QA pairs
• Dialogue (Cornell Movie dialogs corpus) >200K conversational exchanges
• Evaluation measures:
• BLEU scores
• Entropy

• Distinct
Results on the question generation task

Source: Bahuleyan et al (2018) https://arxiv.org/abs/1712.08207


Results on the conversational (dialogue)
system experiment
Examples
from the
question
generation
task

Source: Bahuleyan et al (2018) https://arxiv.org/abs/1712.08207

You might also like