Attention Based Models
Attention Based Models
Attention Based Models
Attention
The network reads a sentence and stores all the information in its
hidden units.
Some sentences can be really long. Can we really store all the
information in a vector of hidden units?
Let’s make things easier by letting the decoder refer to the input
sentence.
Basic idea: each output word comes from one word, or a handful of
words, from the input. Maybe we can learn to attend to only the
relevant ones as we produce the output.
In summary:
t: sequence length, d: # layers and k: # neurons at each layer.
training training test test
Model complexity memory complexity memory
RNN t × k2 × d t ×k ×d t × k2 × d k ×d
RNN+attn. t2 × k2 × d t2 × k × d t2 × k2 × d t ×k ×d
Attention needs to re-compute context vectors at every time step.
Attention has the benefit of reducing the maximum path length
between long range dependencies of the input and the target
sentences.
sequential maximum path
Model operations length across time
RNN t t
RNN+attn. t 1
Vaswani, Ashish, et al. ”Attention is all you need.” Advances in Neural Information Processing Systems. 2017.
https://arxiv.org/pdf/1706.03762.pdf
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 22 / 39
Attention is All You Need
PEpos,2i = sin(pos/100002i/demb ),
PEpos,2i+1 = cos(pos/100002i/demb ),
Vaswani, Ashish, et al. ”Attention is all you need.” Advances in Neural Information Processing Systems. 2017.
For the full text samples see Radford, Alec, et al. ”Language Models are Unsupervised Multitask Learners.” 2019.
Computers have a huge memory, but they only access a handful of locations
at a time. Can we make neural nets more computer-like?
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 35 / 39
Neural Turing Machines (optional)
Recall Turing machines:
You have an infinite tape, and a head, which transitions between various
states, and reads and writes to the tape.
“If in state A and the current symbol is 0, write a 0, transition to state B,
and move right.”
These simple machines are universal — they’re capable of doing any
computation that ordinary computers can.
Roger Grosse and Jimmy Ba CSC421/2516 Lecture 16: Attention 36 / 39
Neural Turing Machines (optional)
Neural Turing Machines are an analogue of Turing machines where all of the
computations are differentiable.
This means we can train the parameters by doing backprop through the
entire computation.
Each memory location stores a
vector.
The read and write heads interact
with a weighted average of memory
locations, just as in the attention
models.
The controller is an RNN (in
particular, an LSTM) which can
issue commands to the read/write
heads.