Attention and Transformers
Attention and Transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 1 May 03, 2022
Administrative
- Project proposal grades released.
Check feedback on GradeScope!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 2 May 03, 2022
Last Time: Recurrent Neural Networks
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 3 May 03, 2022
Last Time: Variable length computation L
graph with shared weights
y1 L1 y2 L2 y3 L3 yT LT
h0 fW h1 fW h2 fW h3
… hT
x1 x2 x3
W
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 4 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’
h1 h2 h3 h4
x1 x2 x3 x4
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 5 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’
h1 h2 h3 h4 s0
x1 x2 x3 x4 c
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 6 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos
h1 h2 h3 h4 s0 s1
x1 x2 x3 x4 c y0
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 7 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo
h1 h2 h3 h4 s0 s1 s2
x1 x2 x3 x4 c y0 y1
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 8 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo pan [STOP]
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c y0 y1 y2 y3
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 9 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo pan [STOP]
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c y0 y1 y2 y3
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 10 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo pan [STOP]
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c y0 y1 y2 y3
h1 h2 h3 h4 s0
x1 x2 x3 x4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 12 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
et,i = fatt(st-1, hi) (fatt is an MLP)
h1 h2 h3 h4 s0
x1 x2 x3 x4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 13 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
a12 a13 a14
et,i = fatt(st-1, hi) (fatt is an MLP)
a11
h1 h2 h3 h4 s0
x1 x2 x3 x4
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 14 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
✖ ✖ ✖ ✖ et,i = fatt(st-1, hi) (fatt is an MLP)
a11 a12 a13 a14
estamos
Normalize alignment scores
softmax to get attention weights
From final hidden state: y1 0 < at,i < 1 ∑iat,i = 1
e11 e12 e13 e14
Initial decoder state s0
Compute context vector as
linear combination of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
x1 x2 x3 x4 c1 y0
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 15 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
✖ ✖ ✖ ✖ et,i = fatt(st-1, hi) (fatt is an MLP)
a11 a12 a13 a14
estamos
Normalize alignment scores
softmax to get attention weights
From final hidden state: y1 0 < at,i < 1 ∑iat,i = 1
e11 e12 e13 e14
Initial decoder state s0
Compute context vector as
linear combination of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
Intuition: Context
vector attends to the Use context vector in
x1 x2 x3 x4 relevant part of the input c1 y0
decoder: st = gU(yt-1, st-1, ct)
sequence
we are eating bread “estamos” = “we are”
so maybe a11=a12=0.45, [START]
a13=a14=0.05
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 16 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
✖ ✖ ✖ ✖ et,i = fatt(st-1, hi) (fatt is an MLP)
a11 a12 a13 a14
estamos
Normalize alignment scores
softmax to get attention weights
From final hidden state: y1 0 < at,i < 1 ∑iat,i = 1
e11 e12 e13 e14
Initial decoder state s0
Compute context vector as
linear combination of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
Intuition: Context
vector attends to the Use context vector in
x1 x2 x3 x4 relevant part of the input c1 y0
decoder: st = gU(yt-1, st-1, ct)
sequence This is all differentiable! No
we are eating bread “estamos” = “we are” supervision on attention
so maybe a11=a12=0.45, [START]
weights – backprop through
a13=a14=0.05 everything
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 17 May 03, 2022
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
✖ ✖ ✖ ✖ new context vector c2
a21 a22 a23 a24
estamos
softmax
y1
e21 e22 e23 e24
h1 h2 h3 h4 s0 s1
x1 x2 x3 x4 c1 y0 c2
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 18 May 03, 2022
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
✖ ✖ ✖ ✖ new context vector c2
a21 a22 a23 a24
estamos comiendo
softmax
y1 y2
e21 e22 e23 e24
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2
x1 x2 x3 x4 c1 y0 c2 y1
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 19 May 03, 2022
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
✖ ✖ ✖ ✖ new context vector c2
a21 a22 a23 a24
estamos comiendo
softmax
y1 y2
e21 e22 e23 e24
+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2
Intuition: Context
vector attends to the
x1 x2 x3 x4
relevant part of the input c1 y0 c2 y1
sequence
we are eating bread “comiendo” = “eating”
so maybe a21=a24=0.05, [START] estamos
a22=0.1, a23=0.8
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 20 May 03, 2022
Sequence to Sequence with RNNs and Attention
Use a different context vector in each timestep of decoder
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 21 May 03, 2022
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 22 May 03, 2022
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Diagonal attention means
Input: “The agreement on words correspond in order
the European Economic Area
was signed in August 1992.”
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 23 May 03, 2022
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to
French translation
Diagonal attention means
Input: “The agreement on words correspond in order
the European Economic
Area was signed in Attention figures out
different word orders
August 1992.”
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 24 May 03, 2022
Sequence to Sequence with RNNs and Attention
The decoder doesn’t use the fact that hi form an ordered
sequence – it just treats them as an unordered set {hi}
estamos comiendo pan [STOP]
Can use similar architecture given any set of input hidden
vectors {hi}! y1 y2 y3 y4
h1 h2 h3 h4 s0 s1 s2 s3 s4
x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 25 May 03, 2022
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 26 May 03, 2022
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT
Encoder: h0 = fW(z)
where z is spatial CNN features
fW(.) is an MLP
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 27 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 28 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 29 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 30 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 31 May 03, 2022
Image Captioning using spatial features
Problem: Input is "bottlenecked" through c
- Model needs to encode everything it
wants to say within c
person wearing hat [END]
This is a problem if we want to generate
really long descriptions? 100s of words long
y1 y2 y3 y4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 32 May 03, 2022
Image Captioning with RNNs and Attention
gif source
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 33 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores:
Compute alignments HxW
scores (scalars):
e1,0,0 e1,0,1 e1,0,2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 34 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 35 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get Compute context vector:
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 36 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image
person
y1
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 37 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention: Decoder: yt = gV(yt-1, ht-1, ct)
HxW HxW New context vector at every time step
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 38 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image
person wearing
y1 y2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 39 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image
y1 y2 y3
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 40 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image
y1 y2 y3 y4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 41 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 42 May 03, 2022
Image Captioning with Attention
Soft attention
Hard attention
(requires
reinforcement
learning)
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 43 May 03, 2022
Image Captioning with Attention
Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 44 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 45 May 03, 2022
Attention we just saw in image captioning
Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 46 May 03, 2022
Attention we just saw in image captioning
Operations:
Alignment: ei,j = fatt(h, zi,j)
Alignment
Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 47 May 03, 2022
Attention we just saw in image captioning
Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)
softmax
Alignment
Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 48 May 03, 2022
Attention we just saw in image captioning
c
Outputs:
context vector: c (shape: D)
mul + add
Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)
Output: c = ∑i,j ai,jzi,j
softmax
Alignment
Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 49 May 03, 2022
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add
a0
Attention
a1 Operations:
Alignment: ei = fatt(h, xi)
a2
Attention: a = softmax(e)
Output: c = ∑i ai xi
softmax
Input vectors
x0 e0
Alignment
x1 e1
Attention operation is permutation invariant.
x2 e2 - Doesn't care about ordering of the features
- Stretch H x W = N into N vectors
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 50 May 03, 2022
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add
a0
Attention
Change fatt(.) to a simple dot product
a1 Operations: - only works well with key & value
Alignment: ei = h ᐧ xi transformation trick (will mention in a
a2
Attention: a = softmax(e) few slides)
Output: c = ∑i ai xi
softmax
Input vectors
x0 e0
Alignment
x1 e1
x2 e2
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 51 May 03, 2022
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add
Change fatt(.) to a scaled simple dot product
a0 - Larger dimensions means more terms in
Attention
the dot product sum.
a1 Operations: - So, the variance of the logits is higher.
Alignment: ei = h ᐧ xi / √D Large magnitude vectors will produce
a2
Attention: a = softmax(e) much higher logits.
Output: c = ∑i ai xi - So, the post-softmax distribution has
softmax lower-entropy, assuming logits are IID.
Input vectors
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 52 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: D)
mul(→) + add (↑)
Multiple query vectors
- each query creates a new output
a0,0 a0,1 a0,2
context vector
Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2
Alignment: ei,j = qj ᐧ xi / √D
Attention: a = softmax(e)
Output: yj = ∑i ai,j xi
softmax (↑)
Input vectors
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 53 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: D)
mul(→) + add (↑)
Attention
a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Alignment: ei,j = qj ᐧ xi / √D attention calculations.
a2,0 a2,1 a2,2
Attention: a = softmax(e) - We can add more expressivity to
Output: yj = ∑i ai,j xi the layer by adding a different FC
softmax (↑) layer before each of the two steps.
Input vectors
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 54 May 03, 2022
General attention layer
v0
Notice that the input vectors are used for
v1 Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 - We can add more expressivity to
Value vectors: v = xWv
the layer by adding a different FC
layer before each of the two steps.
Input vectors
x0 k0
x1 k1
x2 k2
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 55 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
The input and output dimensions can
context vectors: y (shape: Dv)
mul(→) + add (↑) now change depending on the key and
value FC layers
v0 a0,0 a0,1 a0,2
Notice that the input vectors are used for
Attention
v1 a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 a2,0 a2,1 a2,2
- We can add more expressivity to
Value vectors: v = xWv
Alignment: ei,j = qj ᐧ ki / √D the layer by adding a different FC
softmax (↑) Attention: a = softmax(e) layer before each of the two steps.
Input vectors
Output: yj = ∑i ai,j vi
x0 k0 e0,0 e0,1 e0,2
Alignment
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 56 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)
Attention
v1 a1,0 a1,1 a1,2
Operations: Encoder: h0 = fW(z)
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
where z is spatial CNN features
Alignment: ei,j = qj ᐧ ki / √D fW(.) is an MLP
softmax (↑) Attention: a = softmax(e)
z0,0 z0,1 z0,2
Input vectors
Output: yj = ∑i ai,j vi h0
x0 k0 e0,0 e0,1 e0,2
CNN z1,0 z1,1 z1,2 MLP
Alignment
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 57 May 03, 2022
Self attention layer
Attention: a = softmax(e)
x0 Output: yj = ∑i ai,j vi
x1
x2
No input query vectors anymore
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 58 May 03, 2022
Self attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)
Attention
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D
Input vectors
Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 59 May 03, 2022
Self attention layer - attends over sets of inputs
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)
Attention
y0 y1 y2
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
self-attention
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D x0 x1 x2
Input vectors
Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 60 May 03, 2022
Self attention layer - attends over sets of inputs
y1 y0 y2 y2 y1 y0 y0 y1 y2
x1 x0 x2 x2 x1 x0 x0 x1 x2
Permutation equivariant
Problem: How can we encode ordered sequences like language or spatially ordered image features?
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 61 May 03, 2022
Positional encoding
y0 y1 y2
self-attention
x0 x1 x2
p0 p1 p2
position encoding
Desiderata of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each
time-step (word’s position in a sentence)
Concatenate/add special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the vector 4. It must be deterministic.
into a d-dimensional vector
So, pj = pos(j)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 62 May 03, 2022
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.
x0 x1 x2
p0 p1 p2
position encoding
Desiderata of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each
time-step (word’s position in a sentence)
Concatenate special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the 4. It must be deterministic.
vector into a d-dimensional vector
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 63 May 03, 2022
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.
position encoding
x1 x0 x2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 64 May 03, 2022
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.
x0 x1 x2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 65 May 03, 2022
Masked self-attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)
Attention
v1 0 a1,1 a1,2
Operations: - Prevent vectors from
v2 0 0 a2,2
Key vectors: k = xWk looking at future vectors.
Value vectors: v = xWv
- Manually set alignment
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D scores to -infinity
Input vectors
Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment
x1 k1 -∞ e1,1 e1,2
x2 k2 -∞ -∞ e2,2
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 66 May 03, 2022
Multi-head self attention layer
- Multiple self-attention heads in parallel
y0 y1 y2
Concatenate
x0 x1 x2 x0 x1 x2 x0 x1 x2
Split
x0 x1 x2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 67 May 03, 2022
General attention versus self-attention
y0 y1 y2 y0 y1 y2
attention self-attention
k0 k1 k2 v0 v1 v2 q0 q1 q2 x0 x1 x2
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 68 May 03, 2022
Example: CNN with Self-Attention
Input Image
CNN
Features:
CxHxW
Cat image is free to use under the Pixabay License
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 69 May 03, 2022
Example: CNN with Self-Attention
Queries:
C’ x H x W
Keys:
CNN C’ x H x W
1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License
Values:
C’ x H x W
1x1 Conv
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 70 May 03, 2022
Example: CNN with Self-Attention
Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W
1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License
Values:
C’ x H x W
1x1 Conv
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 71 May 03, 2022
Example: CNN with Self-Attention
Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W
1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License
Values:
C’ x H x W
x
1x1 Conv
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 72 May 03, 2022
Example: CNN with Self-Attention
Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W
1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License
Values:
C’ x H x W
x 1x1 Conv
1x1 Conv
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 73 May 03, 2022
Example: CNN with Self-Attention
Residual Connection
Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W
Values:
C’ x H x W
x 1x1 Conv
1x1 Conv
Self-Attention Module
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 74 May 03, 2022
Comparing RNNs to Transformer
RNNs
(+) LSTMs work reasonably well for long sequences.
(-) Expects an ordered sequences of inputs
(-) Sequential computation: subsequent hidden states can only be computed after the previous
ones are done.
Transformer:
(+) Good at long sequences. Each attention calculation looks at all inputs.
(+) Can operate over unordered sets or ordered sequences with positional encodings.
(+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and
stored for a single self-attention head. (but GPUs are getting bigger and better)
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 75 May 03, 2022
“ImageNet Moment for Natural Language Processing”
Pretraining:
Download a lot of text from the internet
Finetuning:
Fine-tune the Transformer on your own NLP task
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 76 May 03, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 77 May 03, 2022
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 78 May 03, 2022
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT
Encoder: c = TW(z)
where z is spatial CNN features
TW(.) is the transformer encoder
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 79 May 03, 2022
Image Captioning using Transformers
Input: Image I Decoder: yt = TD(y0:t-1, c)
Output: Sequence y = y1, y2,..., yT where TD(.) is the transformer decoder
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 80 May 03, 2022
The Transformer encoder block
xN
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 81 May 03, 2022
The Transformer encoder block
xN
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 82 May 03, 2022
The Transformer encoder block
xN
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 83 May 03, 2022
The Transformer encoder block
xN
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 84 May 03, 2022
The Transformer encoder block
xN
...
+ Residual connection
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 85 May 03, 2022
The Transformer encoder block
xN
...
+ Residual connection
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 86 May 03, 2022
The Transformer encoder block
+ Residual connection
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 87 May 03, 2022
The Transformer encoder block
+ Residual connection
Transformer encoder
+ Residual connection
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 88 May 03, 2022
The Transformer encoder block
y0 y1 y2 y3
c0,0 c0,1 c0,2 ... c2,2 Transformer Encoder Block:
Layer norm
Inputs: Set of vectors x
+ Outputs: Set of vectors y
Transformer encoder
+
Layer norm and MLP operate
independently per vector.
Multi-head self-attention
Highly scalable, highly
Positional encoding parallelizable, but high memory usage.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 89 May 03, 2022
The Transformer
decoder block
person wearing hat [END]
y0 y1 y2 y3
Transformer decoder
Made up of N decoder blocks.
xN
...
c0,1
c0,2
...
c2,2
y0 y1 y2 y3
[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 90 May 03, 2022
The Transformer y0 y1 y2 y3
decoder block FC
y0 y1 y2 y3
Transformer decoder
Let's dive into the
c0,0
transformer decoder block
xN
...
c0,1
c0,0
c0,2
...
c0,1
c0,2 c2,2
...
c2,2
y0 y1 y2 y3
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 91 May 03, 2022
The Transformer y0 y1 y2 y3
Decoder block FC
Layer norm
person wearing hat [END]
+
y0 y1 y2 y3
MLP
Layer norm
Transformer decoder
+ Most of the network is the
c0,0
same the transformer
xN
...
c0,1
encoder.
c0,0
c0,2
Layer norm
...
c0,1
+
c0,2 c2,2
Masked Multi-head
...
self-attention
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 92 May 03, 2022
The Transformer y0 y1 y2 y3
Decoder block FC
Layer norm
person wearing hat [END]
+
y0 y1 y2 y3
MLP
Layer norm
Transformer decoder
+ Multi-head attention block
c0,0
attends over the transformer
xN Multi-head attention
...
c0,1
encoder outputs.
k v q
c0,0
c0,2 For image captions, this is
Layer norm
how we inject image
...
c0,1
+
features into the decoder.
c0,2 c2,2
Masked Multi-head
self-attention
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 93 May 03, 2022
The Transformer y0 y1 y2 y3
Decoder block FC
Layer norm
Transformer Decoder Block:
Transformer decoder
c0,0
+ interacts with past inputs.
xN Multi-head attention
...
...
c0,1
+
Highly scalable, highly
c0,2 c2,2
Masked Multi-head parallelizable, but high memory
self-attention usage.
...
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 94 May 03, 2022
Image Captioning using transformers
- No recurrence at all
y1 y2 y3 y4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 95 May 03, 2022
Image Captioning using transformers
- Perhaps we don't need
convolutions at all?
y1 y2 y3 y4
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 96 May 03, 2022
Image Captioning using ONLY transformers
- Transformers from pixels to language
y1 y2 y3 y4
Transformer encoder
y0 y1 y2 y3
...
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
[START] person wearing hat
Colab link to an implementation of vision transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 97 May 03, 2022
Vision Transformers vs. ResNets
Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
Colab link to an implementation of vision transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 98 May 03, 2022
Vision Transformers
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 99 May 03, 2022
ConvNets strike back!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 100 May 03, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 101 May 03, 2022
Summary
- Adding attention to RNNs allows them to "attend" to different
parts of the input at every time step
- The general attention layer is a new type of layer that can be
used to design new neural network architectures
- Transformers are a type of layer that uses self-attention and
layer norm.
○ It is highly scalable and highly parallelizable
○ Faster training, larger models, better performance across
vision and language tasks
○ They are quickly replacing RNNs, LSTMs, and may(?) even
replace convolutions.
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 102 May 03, 2022
Next time: Video Understanding
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 103 May 03, 2022