Attention Based Models

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

CSC421/2516 Lecture 16:

Attention

Roger Grosse and Jimmy Ba

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 1 / 39


Overview

We have seen a few RNN-based sequence prediction models.


It is still challenging to generate long sequences, when the decoders
only has access to the final hidden states from the encoder.
Machine translation: it’s hard to summarize long sentences in a single
vector, so let’s allow the decoder peek at the input.
Vision: have a network glance at one part of an image at a time, so
that we can understand what information it’s using
This lecture will introduce attention that drastically improves the
performance on the long sequences.
We can also use attention to build differentiable computers
(e.g. Neural Turing Machines)

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 2 / 39


Attention-Based Machine Translation

Remember the encoder/decoder architecture for machine translation:

The network reads a sentence and stores all the information in its
hidden units.
Some sentences can be really long. Can we really store all the
information in a vector of hidden units?
Let’s make things easier by letting the decoder refer to the input
sentence.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 3 / 39


Attention-Based Machine Translation

We’ll look at the translation model from the classic paper:


Bahdanau et al., Neural machine translation by jointly learning to
align and translate. ICLR, 2015.

Basic idea: each output word comes from one word, or a handful of
words, from the input. Maybe we can learn to attend to only the
relevant ones as we produce the output.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 4 / 39


Attention-Based Machine Translation

The model has both an encoder and a decoder. The encoder


computes an annotation of each word in the input.
It takes the form of a bidirectional RNN. This just means we have an
RNN that runs forwards and an RNN that runs backwards, and we
concantenate their hidden vectors.
The idea: information earlier or later in the sentence can help
disambiguate a word, so we need both directions.
The RNN uses an LSTM-like architecture called gated recurrent units.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 5 / 39


Attention-Based Machine Translation
The decoder network is also an RNN. Like the encoder/decoder translation
model, it makes predictions one word at a time, and its predictions are fed
back in as inputs.
The difference is that it also receives a context vector c(t) at each time step,
which is computed by attending to the inputs.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 6 / 39


Attention-Based Machine Translation
The context vector is computed as a weighted average of the
encoder’s annotations.
X
c(i) = αij h(j)
j

The attention weights are computed as a softmax, where the inputs


depend on the annotation and the decoder’s state:
exp(eij )
αij = P
j 0 exp(eij 0 )

eij = a(s(i−1) , h(j) )


Note that the attention function depends on the annotation vector,
rather than the position in the sentence. This means it’s a form of
content-based addressing.
My language model tells me the next word should be an adjective.
Find me an adjective in the input.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 7 / 39
Attention-Based Machine Translation
Here’s a visualization of the attention maps at each time step.

Nothing forces the model to go linearly through the input sentence,


but somehow it learns to do it.
It’s not perfectly linear — e.g., French adjectives can come after the
nouns.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 8 / 39
Attention-Based Machine Translation

The attention-based translation model does much better than the


encoder/decoder model on long sentences.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 9 / 39


Attention-Based Caption Generation

Attention can also be used to understand images.


We humans can’t process a whole visual scene at once.
The fovea of the eye gives us high-acuity vision in only a tiny region of
our field of view.
Instead, we must integrate information from a series of glimpses.
The next few slides are based on this paper from the UofT machine
learning group:
Xu et al. Show, Attend, and Tell: Neural Image Caption Genera-
tion with Visual Attention. ICML, 2015.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 10 / 39


Attention-Based Caption Generation

The caption generation task: take an image as input, and produce a


sentence describing the image.
Encoder: a classification conv net (VGGNet, similar to AlexNet).
This computes a bunch of feature maps over the image.
Decoder: an attention-based RNN, analogous to the decoder in the
translation model
In each time step, the decoder computes an attention map over the
entire image, effectively deciding which regions to focus on.
It receives a context vector, which is the weighted average of the conv
net features.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 11 / 39


Attention-Based Caption Generation
This lets us understand where the network is looking as it generates a
sentence.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 12 / 39


Attention-Based Caption Generation

This can also help us understand the network’s mistakes.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 13 / 39


Computational Cost and Parallelism

There are a few things we should consider when designing an RNN.


Computational cost:
Number of connections. How many add-multiply operations for the
forward and backward pass.
Number of time steps. How many copies of hidden units to store for
Backpropgation Through Time.
Number of sequential operations. The computations cannot be
parallelized. (The part of the model that requires a for loop).
Maximum path length across time: the shortest path length
between the first encoder input and the last decoder output.
It tells us how easy it is for the RNN to remember / retreive
information from the input sequence.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 14 / 39


Computational Cost and Parallelism
Consider a standard d layer RNN from Lecture 13 with k hidden
units, training on a sequence of length t.

There are k 2 connections for each hidden-to-hidden connection. A


total of t × k 2 × d connections.
We need to store all t × k × d hidden units during training.
Only k × d hidden units need to be stored at test time.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 15 / 39
Computational Cost and Parallelism

Consider a standard d layer RNN from Lecture 13 with k hidden


units, training on a sequence of length t.

Which hidden layers can be computed in parallel in this RNN?

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 16 / 39


Computational Cost and Parallelism

Consider a standard d layer RNN from Lecture 13 with k hidden


units, training on a sequence of length t.

Both the input embeddings and the outputs of an RNN can be


computed in parallel.
The blue hidden units are independent given the red.
The numer of sequential operation is still propotional to t.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 17 / 39
Computational Cost and Parallelism
In the standard encoder-decoder RNN, the maximum path length
across time is propotional to the number of time steps.
Attention-based RNNs have a constant path length between the
encoder inputs and the decoder hidden states.
Learning becomes easier if all the information are present in the inputs.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 18 / 39


Computational Cost and Parallelism
Attention-based RNNs achieves efficient content-based addressing at
the cost of re-computing context vectors at each time step.
Bahdanau et. al. computes context vector over the entire input
sequence of length t using a neural network of k 2 connections.
Computing the context vectors adds a t × k 2 cost at each time step.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 19 / 39


Computational Cost and Parallelism

In summary:
t: sequence length, d: # layers and k: # neurons at each layer.
training training test test
Model complexity memory complexity memory
RNN t × k2 × d t ×k ×d t × k2 × d k ×d
RNN+attn. t2 × k2 × d t2 × k × d t2 × k2 × d t ×k ×d
Attention needs to re-compute context vectors at every time step.
Attention has the benefit of reducing the maximum path length
between long range dependencies of the input and the target
sentences.
sequential maximum path
Model operations length across time
RNN t t
RNN+attn. t 1

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 20 / 39


Improve Parallelism

RNNs are sequential in the sequence length t due to the number


hidden-to-hidden lateral connections.
RNN architecture limits the parallelism potential for longer sequences.
The idea: remove the lateral connections. We will have a deep
autoregressive model, where the hidden units depends on all the
previous time steps.

Benefit: the number of sequential operations are now independent of


the sequence length.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 21 / 39


Attention is All You Need

Autoregressive models like PixelCNN and WaveNet from Lecture 15


used a fixed context window with causal convolution.
We would like our model to have access to the entire history at each
hidden layer.
But the context will have different input length at each time step.
Max or average pooling are not very effective.
We can use attention to aggregate the context information by
attending to one or a few important tokens from the past history.

Vaswani, Ashish, et al. ”Attention is all you need.” Advances in Neural Information Processing Systems. 2017.

https://arxiv.org/pdf/1706.03762.pdf
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 22 / 39
Attention is All You Need

In general, Attention mappings can be


described as a function of a query and a set of
key-value pairs.
Transformers use a ”Scaled Dot-Product
Attention” to obtain the context vector:
QK T
 
(t)
c = attention(Q, K , V ) = softmax √ V,
dK
scaled by square root of the key dimension dK .
Invalid connections to the future inputs are
masked out to preserve the autoregressive
property.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 23 / 39


Attention is All You Need

Transformer models attend to both the encoder annotations and its


previous hidden layers.
When attending to the encoder annotations, the model computes the
key-value pairs using linearly transformed the encoder outputs.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 24 / 39


Attention is All You Need
Transformer models also use “self-attention” on its previous hidden
layers.
When applying attention to the previous hidden layers, the casual
structure is preserved.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 25 / 39


Attention is All You Need

The Scaled Dot-Product Attention attends to one or few entries in


the input key-value pairs.
Humans can attend to many things simultaneously.
The idea: apply Scaled Dot-Product Attention multiple times on the
linearly transformed inputs.
MultiHead(Q, K , V ) = concat (c1 , · · · , ch ) W O ,
ci = attention(QWiQ , KWiK , VWiV ).
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 26 / 39
Positional Encoding

Unlike RNNs and CNNs encoders, the attention encoder outputs do


not depend on the order of the inputs. (Why?)
The order of the sequence conveys important information for the
machine translation tasks and language modeling.
The idea: add positional information of a input token in the sequence
into the input embedding vectors.

PEpos,2i = sin(pos/100002i/demb ),
PEpos,2i+1 = cos(pos/100002i/demb ),

The final input embeddings are the concatenation of the learnable


embedding and the postional encoding.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 27 / 39


Transformer Machine Translation

Transformer has a encoder-decoder


architecture similar to the previous
RNN models.
except all the recurrent connections
are replaced by the attention modules.
The transfomer model uses N stacked
self-attention layers.
Skip-connections help preserve the
positional and identity information from
the input sequences.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 28 / 39


Transformer Machine Translation

Self-attention layers learnt ”it“ could refer to different entities in the


different contexts.

Visualization of the 5th to 6th self-attention layer in the encoder.


https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 29 / 39


Transformer Machine Translation

BLEU scores of state-of-the-art models on the WMT14


English-to-German translation task

Vaswani, Ashish, et al. ”Attention is all you need.” Advances in Neural Information Processing Systems. 2017.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 30 / 39


Computational Cost and Parallelism
Self-attention allows the model to learn to access information from
the past hidden layer, but decoding is very expensive.
When generating sentences, the computation in the self-attention
decoder grows as the sequence gets longer.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 31 / 39


Computational Cost and Parallelism
t: sequence length, d: # layers and k: # neurons at each layer.
training training test test
Model complexity memory complexity memory
RNN t × k2 × d t ×k ×d t × k2 × d k ×d
RNN+attn. t2 × k2 × d t2 × k × d t2 × k2 × d t ×k ×d
transformer t2 × k × d t ×k ×d t2 × k × d t ×k ×d
Transformer vs RNN: There is a trade-off between the sequencial
operations and decoding complexity.
The sequential operations in transformers are independent of sequence
length, but they are very expensive to decode.
Transformers can learn faster than RNNs on parallel processing
hardwards for longer sequences.
sequential maximum path
Model operations length across time
RNN t t
RNN+attn. t 1
transformer d 1

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 32 / 39


Transformer Language Pre-training
Similar to pre-training computer vision models on ImageNet, we can
pre-train a language model for NLP tasks.
The pre-trained model is then fine-tuned on textual entailment,
question answering, semantic similarity assessment, and document
classification.

Radford, Alec, et al. ”Improving Language Understanding by Generative Pre-Training.” 2018.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 33 / 39


Transformer Language Pre-training
Increasing the training data set and the model size has a noticible
improvement on the transformer language model. Cherry picked
generated samples from Radford, et al., 2019:

For the full text samples see Radford, Alec, et al. ”Language Models are Unsupervised Multitask Learners.” 2019.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 34 / 39


Neural Turing Machines (optional)

We said earlier that multilayer perceptrons are like differentiable circuits.


Using an attention model, we can build differentiable computers.
We’ve seen hints that sparsity of memory accesses can be useful:

Computers have a huge memory, but they only access a handful of locations
at a time. Can we make neural nets more computer-like?
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 35 / 39
Neural Turing Machines (optional)
Recall Turing machines:

You have an infinite tape, and a head, which transitions between various
states, and reads and writes to the tape.
“If in state A and the current symbol is 0, write a 0, transition to state B,
and move right.”
These simple machines are universal — they’re capable of doing any
computation that ordinary computers can.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 36 / 39
Neural Turing Machines (optional)

Neural Turing Machines are an analogue of Turing machines where all of the
computations are differentiable.
This means we can train the parameters by doing backprop through the
entire computation.
Each memory location stores a
vector.
The read and write heads interact
with a weighted average of memory
locations, just as in the attention
models.
The controller is an RNN (in
particular, an LSTM) which can
issue commands to the read/write
heads.

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 37 / 39


Neural Turing Machines (optional)
Repeat copy task: receives a sequence of binary vectors, and has to
output several repetitions of the sequence.
Pattern of memory accesses for the read and write heads:

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 38 / 39


Neural Turing Machines (optional)
Priority sort: receives a sequence of (key, value) pairs, and has to
output the values in sorted order by key.

Sequence of memory accesses:

Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 39 / 39

You might also like