0% found this document useful (0 votes)

53 views103 pages

Attention and Transformers

Stanford Lecture (https://cs231n.stanford.edu/slides/2022/lecture_11_ruohan.pdf)

Uploaded by

Jash Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views103 pages

Attention and Transformers

Stanford Lecture (https://cs231n.stanford.edu/slides/2022/lecture_11_ruohan.pdf)

Uploaded by

Jash Mehta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Lecture 11:

Attention and Transformers

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 1 May 03, 2022
Administrative
- Project proposal grades released.
Check feedback on GradeScope!

- Project milestone due May 7th Saturday 11:59pm PT

Check Ed and course website for requirements

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 2 May 03, 2022
Last Time: Recurrent Neural Networks

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 3 May 03, 2022
Last Time: Variable length computation L
graph with shared weights
y1 L1 y2 L2 y3 L3 yT LT

h0 fW h1 fW h2 fW h3
… hT

x1 x2 x3
W

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 4 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’

Encoder: ht = fW(xt, ht-1)

h1 h2 h3 h4

x1 x2 x3 x4

we are eating bread

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 5 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’

From final hidden state predict:

Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0

x1 x2 x3 x4 c

we are eating bread

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 6 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos

From final hidden state predict:

y1
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c y0

we are eating bread [START]

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 7 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo

From final hidden state predict:

y1 y2
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c y0 y1

we are eating bread [START] estamos

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 8 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo pan [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread [START] estamos comiendo pan

Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 9 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo pan [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

we are eating bread

Problem: Input sequence [START] estamos comiendo pan
bottlenecked through
fixed-sized vector. What if
T=1000?
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 10 May 03, 2022
Sequence to Sequence with RNNs
Input: Sequence x1, … xT Decoder: st = gU(yt-1, st-1, c)
Output: Sequence y1, …, yT’ estamos comiendo pan [STOP]

From final hidden state predict:

y1 y2 y3 y4
Encoder: ht = fW(xt, ht-1) Initial decoder state s0
Context vector c (often c=hT)

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c y0 y1 y2 y3

Problem: Input sequence

we are eating bread [START] estamos comiendo pan
bottlenecked through
fixed-sized vector. What if Idea: use new context vector
T=1000?
Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 at each step of decoder!
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 11 May 03, 2022
Sequence to Sequence with RNNs and Attention
Input: Sequence x1, … xT
Output: Sequence y1, …, yT’

From final hidden state:

Encoder: ht = fW(xt, ht-1) Initial decoder state s
0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 12 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
et,i = fatt(st-1, hi) (fatt is an MLP)

From final hidden state:

e11 e12 e13 e14
Initial decoder state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 13 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
a12 a13 a14
et,i = fatt(st-1, hi) (fatt is an MLP)
a11

Normalize alignment scores

softmax
From final hidden to get attention weights
e11 e12 e13 e14 state: Initial decoder 0 < at,i < 1 ∑iat,i = 1
state s0

h1 h2 h3 h4 s0

x1 x2 x3 x4

we are eating bread

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 14 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
✖ ✖ ✖ ✖ et,i = fatt(st-1, hi) (fatt is an MLP)
a11 a12 a13 a14
estamos
Normalize alignment scores
softmax to get attention weights
From final hidden state: y1 0 < at,i < 1 ∑iat,i = 1
e11 e12 e13 e14
Initial decoder state s0
Compute context vector as
linear combination of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi

x1 x2 x3 x4 c1 y0

we are eating bread

[START]

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 15 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
✖ ✖ ✖ ✖ et,i = fatt(st-1, hi) (fatt is an MLP)
a11 a12 a13 a14
estamos
Normalize alignment scores
softmax to get attention weights
From final hidden state: y1 0 < at,i < 1 ∑iat,i = 1
e11 e12 e13 e14
Initial decoder state s0
Compute context vector as
linear combination of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
Intuition: Context
vector attends to the Use context vector in
x1 x2 x3 x4 relevant part of the input c1 y0
decoder: st = gU(yt-1, st-1, ct)
sequence
we are eating bread “estamos” = “we are”
so maybe a11=a12=0.45, [START]
a13=a14=0.05
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 16 May 03, 2022
Sequence to Sequence with RNNs and Attention
Compute (scalar) alignment scores
✖ ✖ ✖ ✖ et,i = fatt(st-1, hi) (fatt is an MLP)
a11 a12 a13 a14
estamos
Normalize alignment scores
softmax to get attention weights
From final hidden state: y1 0 < at,i < 1 ∑iat,i = 1
e11 e12 e13 e14
Initial decoder state s0
Compute context vector as
linear combination of hidden
h1 h2 h3 h4 s0 + s1 states
ct = ∑iat,ihi
Intuition: Context
vector attends to the Use context vector in
x1 x2 x3 x4 relevant part of the input c1 y0
decoder: st = gU(yt-1, st-1, ct)
sequence This is all differentiable! No
we are eating bread “estamos” = “we are” supervision on attention
so maybe a11=a12=0.45, [START]
weights – backprop through
a13=a14=0.05 everything
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 17 May 03, 2022
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
✖ ✖ ✖ ✖ new context vector c2
a21 a22 a23 a24
estamos

softmax
y1
e21 e22 e23 e24

h1 h2 h3 h4 s0 s1

x1 x2 x3 x4 c1 y0 c2

we are eating bread

[START]

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 18 May 03, 2022
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
✖ ✖ ✖ ✖ new context vector c2
a21 a22 a23 a24
estamos comiendo

softmax
y1 y2
e21 e22 e23 e24

+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2

x1 x2 x3 x4 c1 y0 c2 y1

we are eating bread

[START] estamos

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 19 May 03, 2022
Sequence to Sequence with RNNs and Attention
Repeat: Use s1 to compute
✖ ✖ ✖ ✖ new context vector c2
a21 a22 a23 a24
estamos comiendo

softmax
y1 y2
e21 e22 e23 e24

+
Use c2 to compute s2, y2
h1 h2 h3 h4 s0 s1 s2

Intuition: Context
vector attends to the
x1 x2 x3 x4
relevant part of the input c1 y0 c2 y1
sequence
we are eating bread “comiendo” = “eating”
so maybe a21=a24=0.05, [START] estamos
a22=0.1, a23=0.8
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 20 May 03, 2022
Sequence to Sequence with RNNs and Attention
Use a different context vector in each timestep of decoder

- Input sequence not bottlenecked through single vector

estamos comiendo pan [STOP]
- At each timestep of decoder, context vector “looks at”
different parts of the input sequence
y1 y2 y3 y4

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread

[START] estamos comiendo pan

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 21 May 03, 2022
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation

Input: “The agreement on the

European Economic Area was
signed in August 1992.”

Output: “L’accord sur la zone

économique européenne a
été signé en août 1992.”

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 22 May 03, 2022
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to French
translation
Diagonal attention means
Input: “The agreement on words correspond in order
the European Economic Area
was signed in August 1992.”

Output: “L’accord sur la

zone économique européenne
a été signé en août 1992.”
Diagonal attention means
words correspond in order

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 23 May 03, 2022
Sequence to Sequence with RNNs and Attention
Visualize attention weights at,i
Example: English to
French translation
Diagonal attention means
Input: “The agreement on words correspond in order
the European Economic
Area was signed in Attention figures out
different word orders
August 1992.”

Output: “L’accord sur la

zone économique
européenne a été signé Diagonal attention means
words correspond in order
en août 1992.”
Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 24 May 03, 2022
Sequence to Sequence with RNNs and Attention
The decoder doesn’t use the fact that hi form an ordered
sequence – it just treats them as an unordered set {hi}
estamos comiendo pan [STOP]
Can use similar architecture given any set of input hidden
vectors {hi}! y1 y2 y3 y4

h1 h2 h3 h4 s0 s1 s2 s3 s4

x1 x2 x3 x4 c1 y0 c2 y1 c3 y2 c4 y3

we are eating bread

[START] estamos comiendo pan

Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 25 May 03, 2022
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 26 May 03, 2022
Image Captioning using spatial features
Input: Image I
Output: Sequence y = y1, y2,..., yT

Encoder: h0 = fW(z)
where z is spatial CNN features
fW(.) is an MLP

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 27 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person

where z is spatial CNN features
fW(.) is an MLP y1

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 28 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person wearing

where z is spatial CNN features
fW(.) is an MLP y1 y2

z0,0 z0,1 z0,2

h0 h1 h2
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 29 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person wearing hat

where z is spatial CNN features
fW(.) is an MLP y1 y2 y3

z0,0 z0,1 z0,2

h0 h1 h2 h3
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 30 May 03, 2022
Image Captioning using spatial features
Input: Image I Decoder: yt = gV(yt-1, ht-1, c)
Output: Sequence y = y1, y2,..., yT where context vector c is often c = h0

Encoder: h0 = fW(z) person wearing hat [END]

where z is spatial CNN features
fW(.) is an MLP y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 31 May 03, 2022
Image Captioning using spatial features
Problem: Input is "bottlenecked" through c
- Model needs to encode everything it
wants to say within c
person wearing hat [END]
This is a problem if we want to generate
really long descriptions? 100s of words long
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN MLP
z2,0 z2,1 z2,2

Extract spatial Features:

c y0 y1 y2 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 32 May 03, 2022
Image Captioning with RNNs and Attention
gif source

Attention idea: New context vector at every time step.

Each context vector will attend to different image regions

Attention Saccades in humans

z0,0 z0,1 z0,2
h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 33 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores:
Compute alignments HxW
scores (scalars):
e1,0,0 e1,0,1 e1,0,2

e1,1,0 e1,1,1 e1,1,2

fatt(.) is an MLP
e1,2,0 e1,2,1 e
1,2,2

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 34 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

fatt(.) is an MLP 0 < at, i, j < 1,
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2 attention values sum to 1

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 35 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention:
Compute alignments Normalize to get Compute context vector:
HxW HxW
scores (scalars): attention weights:
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e a1,1,0 a1,1,1 a1,1,2

fatt(.) is an MLP 1,1,2 0 < at, i, j < 1,
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2 attention values sum to 1

z0,0 z0,1 z0,2

h0
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 36 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 37 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention: Decoder: yt = gV(yt-1, ht-1, ct)
HxW HxW New context vector at every time step
e1,0,0 e1,0,1 e1,0,2 a1,0,0 a1,0,1 a1,0,2

e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2

person
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1

z0,0 z0,1 z0,2

h0 h1
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START]

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 38 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing

y1 y2

z0,0 z0,1 z0,2

h0 h1 h2
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 39 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing hat

y1 y2 y3

z0,0 z0,1 z0,2

h0 h1 h2 h3
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 40 May 03, 2022
Image Captioning with RNNs and Attention
Decoder: yt = gV(yt-1, ht-1, ct)
Each timestep of decoder uses a
New context vector at every time step
different context vector that looks at
different parts of the input image

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 41 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 42 May 03, 2022
Image Captioning with Attention

Soft attention

Hard attention
(requires
reinforcement
learning)

Xu et al, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 43 May 03, 2022
Image Captioning with Attention

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 44 May 03, 2022
Image Captioning with RNNs and Attention
Alignment scores: Attention: This entire process is differentiable.
HxW HxW - model chooses its own
e1,0,0 e1,0,1 e1,0,2
attention weights. No attention
a1,0,0 a1,0,1 a1,0,2
supervision is required
e1,1,0 e1,1,1 e1,1,2 a1,1,0 a1,1,1 a1,1,2
person wearing hat [END]
e1,2,0 e1,2,1 e1,2,2 a1,2,0 a1,2,1 a1,2,2
y1 y2 y3 y4

z0,0 z0,1 z0,2

h0 h1 h2 h3 h4
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

c1 y0 c2 y1 c3 y2 c4 y3
features from a HxWxD
pretrained CNN
X
Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015 [START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 45 May 03, 2022
Attention we just saw in image captioning

z0,0 z0,1 z0,2

Features

z1,0 z1,1 z1,2

z2,0 z2,1 z2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 46 May 03, 2022
Attention we just saw in image captioning

Operations:
Alignment: ei,j = fatt(h, zi,j)

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 47 May 03, 2022
Attention we just saw in image captioning

a0,0 a0,1 a0,2

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)

softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 48 May 03, 2022
Attention we just saw in image captioning
c
Outputs:
context vector: c (shape: D)
mul + add

a0,0 a0,1 a0,2

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2 Alignment: ei,j = fatt(h, zi,j)
Attention: a = softmax(e)
Output: c = ∑i,j ai,jzi,j
softmax

z0,0 z0,1 z0,2 e0,0 e0,1 e0,2

Features

Alignment

z1,0 z1,1 z1,2 e1,0 e1,1 e1,2

z2,0 z2,1 z2,2 e2,0 e2,1 e2,2

Inputs:
h Features: z (shape: H x W x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 49 May 03, 2022
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add

Attention
a1 Operations:
Alignment: ei = fatt(h, xi)
a2
Attention: a = softmax(e)
Output: c = ∑i ai xi
softmax
Input vectors

x0 e0
Alignment

x1 e1
Attention operation is permutation invariant.
x2 e2 - Doesn't care about ordering of the features
- Stretch H x W = N into N vectors
Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 50 May 03, 2022
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add

Attention
Change fatt(.) to a simple dot product
a1 Operations: - only works well with key & value
Alignment: ei = h ᐧ xi transformation trick (will mention in a
a2
Attention: a = softmax(e) few slides)
Output: c = ∑i ai xi
softmax
Input vectors

x0 e0
Alignment

x1 e1

x2 e2

Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 51 May 03, 2022
General attention layer
c
Outputs:
context vector: c (shape: D)
mul + add
Change fatt(.) to a scaled simple dot product
a0 - Larger dimensions means more terms in

Attention
the dot product sum.
a1 Operations: - So, the variance of the logits is higher.
Alignment: ei = h ᐧ xi / √D Large magnitude vectors will produce
a2
Attention: a = softmax(e) much higher logits.
Output: c = ∑i ai xi - So, the post-softmax distribution has
softmax lower-entropy, assuming logits are IID.
Input vectors

- Ultimately, these large magnitude

x0 e0 vectors will cause softmax to peak and
Alignment

assign very little weight to all others

x1 e1 - Divide by √D to reduce effect of large
magnitude vectors
x2 e2

Inputs:
h Input vectors: x (shape: N x D)
Query: h (shape: D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 52 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: D)
mul(→) + add (↑)
Multiple query vectors
- each query creates a new output
a0,0 a0,1 a0,2
context vector

Attention
a1,0 a1,1 a1,2
Operations:
a2,0 a2,1 a2,2
Alignment: ei,j = qj ᐧ xi / √D
Attention: a = softmax(e)
Output: yj = ∑i ai,j xi
softmax (↑)
Input vectors

x0 e0,0 e0,1 e0,2

Alignment

x1 e1,0 e1,1 e1,2

x2 e2,0 e2,1 e2,2

Multiple query vectors

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 53 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: D)
mul(→) + add (↑)

a0,0 a0,1 a0,2

Notice that the input vectors are used for

Attention
a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Alignment: ei,j = qj ᐧ xi / √D attention calculations.
a2,0 a2,1 a2,2
Attention: a = softmax(e) - We can add more expressivity to
Output: yj = ∑i ai,j xi the layer by adding a different FC
softmax (↑) layer before each of the two steps.
Input vectors

x0 e0,0 e0,1 e0,2

Alignment

x1 e1,0 e1,1 e1,2

x2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 54 May 03, 2022
General attention layer

v0
Notice that the input vectors are used for
v1 Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 - We can add more expressivity to
Value vectors: v = xWv
the layer by adding a different FC
layer before each of the two steps.
Input vectors

x0 k0

x1 k1

x2 k2
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 55 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
The input and output dimensions can
context vectors: y (shape: Dv)
mul(→) + add (↑) now change depending on the key and
value FC layers
v0 a0,0 a0,1 a0,2
Notice that the input vectors are used for

Attention
v1 a1,0 a1,1 a1,2
Operations: both the alignment as well as the
Key vectors: k = xWk attention calculations.
v2 a2,0 a2,1 a2,2
- We can add more expressivity to
Value vectors: v = xWv
Alignment: ei,j = qj ᐧ ki / √D the layer by adding a different FC
softmax (↑) Attention: a = softmax(e) layer before each of the two steps.
Input vectors

Output: yj = ∑i ai,j vi
x0 k0 e0,0 e0,1 e0,2
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 56 May 03, 2022
General attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

Recall that the query vector was a

v0 a0,0 a0,1 a0,2
function of the input vectors

Attention
v1 a1,0 a1,1 a1,2
Operations: Encoder: h0 = fW(z)
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
where z is spatial CNN features
Alignment: ei,j = qj ᐧ ki / √D fW(.) is an MLP
softmax (↑) Attention: a = softmax(e)
z0,0 z0,1 z0,2
Input vectors

Output: yj = ∑i ai,j vi h0
x0 k0 e0,0 e0,1 e0,2
CNN z1,0 z1,1 z1,2 MLP
Alignment

x1 k1 e1,0 e1,1 e1,2 z2,0 z2,1 z2,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 57 May 03, 2022
Self attention layer

We can calculate the query vectors

from the input vectors, therefore,
defining a "self-attention" layer.
Operations:
Key vectors: k = xWk
Value vectors: v = xWv
Query vectors: q = xWq Instead, query vectors are
Alignment: ei,j = qj ᐧ ki / √D calculated using a FC layer.
Input vectors

Attention: a = softmax(e)
x0 Output: yj = ∑i ai,j vi

x2
No input query vectors anymore
Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)
Queries: q (shape: M x Dk)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 58 May 03, 2022
Self attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

v0 a0,0 a0,1 a0,2

Attention
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 59 May 03, 2022
Self attention layer - attends over sets of inputs
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

v0 a0,0 a0,1 a0,2

Attention
y0 y1 y2
v1 a1,0 a1,1 a1,2
Operations:
Key vectors: k = xWk
v2 a2,0 a2,1 a2,2
Value vectors: v = xWv
self-attention
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D x0 x1 x2
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 e1,0 e1,1 e1,2

x2 k2 e2,0 e2,1 e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 60 May 03, 2022
Self attention layer - attends over sets of inputs

y1 y0 y2 y2 y1 y0 y0 y1 y2

self-attention self-attention self-attention

x1 x0 x2 x2 x1 x0 x0 x1 x2

Permutation equivariant

Self-attention layer doesn’t care about the orders of the inputs!

Problem: How can we encode ordered sequences like language or spatially ordered image features?

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 61 May 03, 2022
Positional encoding
y0 y1 y2

self-attention

x0 x1 x2
p0 p1 p2

position encoding
Desiderata of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each
time-step (word’s position in a sentence)
Concatenate/add special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the vector 4. It must be deterministic.
into a d-dimensional vector

So, pj = pos(j)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 62 May 03, 2022
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2
p0 p1 p2

position encoding
Desiderata of pos(.) :
x0 x1 x2 1. It should output a unique encoding for each
time-step (word’s position in a sentence)
Concatenate special positional 2. Distance between any two time-steps should be
encoding pj to each input vector xj consistent across sentences with different lengths.
3. Our model should generalize to longer sentences
We use a function pos: N →Rd without any efforts. Its values should be bounded.
to process the position j of the 4. It must be deterministic.
vector into a d-dimensional vector

So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 63 May 03, 2022
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2 2. Design a fixed function with the desiderata

p0 p1 p2

position encoding

x1 x0 x2

Concatenate special positional

encoding pj to each input vector xj
p(t) =
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector where
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 64 May 03, 2022
Positional encoding
Options for pos(.)
y0 y1 y2
1. Learn a lookup table:
○ Learn parameters to use for pos(t) for t ε [0, T)
self-attention ○ Lookup table contains T x d parameters.

x0 x1 x2 2. Design a fixed function with the desiderata

p0 p1 p2

position encoding Intuition:

x0 x1 x2

Concatenate special positional

encoding pj to each input vector xj
p(t) =
We use a function pos: N →Rd
to process the position j of the
vector into a d-dimensional vector where
image source
So, pj = pos(j) Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 65 May 03, 2022
Masked self-attention layer
y0 y1 y2
Outputs:
context vectors: y (shape: Dv)
mul(→) + add (↑)

v0 a0,0 a0,1 a0,2

Attention
v1 0 a1,1 a1,2
Operations: - Prevent vectors from
v2 0 0 a2,2
Key vectors: k = xWk looking at future vectors.
Value vectors: v = xWv
- Manually set alignment
Query vectors: q = xWq
softmax (↑) Alignment: ei,j = qj ᐧ ki / √D scores to -infinity
Input vectors

Attention: a = softmax(e)
x0 k0 e0,0 e0,1 e0,2 Output: yj = ∑i ai,j vi
Alignment

x1 k1 -∞ e1,1 e1,2

x2 k2 -∞ -∞ e2,2

Inputs:
q0 q1 q2 Input vectors: x (shape: N x D)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 66 May 03, 2022
Multi-head self attention layer
- Multiple self-attention heads in parallel
y0 y1 y2

Concatenate

head0 head1 headH-1

y0 y1 y2 y0 y1 y2 y0 y1 y2

Self-attention Self-attention ... Self-attention

x0 x1 x2 x0 x1 x2 x0 x1 x2

Split
x0 x1 x2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 67 May 03, 2022
General attention versus self-attention

y0 y1 y2 y0 y1 y2

attention self-attention

k0 k1 k2 v0 v1 v2 q0 q1 q2 x0 x1 x2

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 68 May 03, 2022
Example: CNN with Self-Attention

Input Image

CNN
Features:
CxHxW
Cat image is free to use under the Pixabay License

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 69 May 03, 2022
Example: CNN with Self-Attention

Queries:
C’ x H x W

Input Image 1x1 Conv

Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License

Values:
C’ x H x W

1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 70 May 03, 2022
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW
Cat image is free to use under the Pixabay License

Values:
C’ x H x W

1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 71 May 03, 2022
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x
1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 72 May 03, 2022
Example: CNN with Self-Attention

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x CxHxH
Keys:
CNN C’ x H x W

1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x 1x1 Conv
1x1 Conv

Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 73 May 03, 2022
Example: CNN with Self-Attention
Residual Connection

Attention Weights
Queries:
Transpose (H x W) x (H x W)
C’ x H x W

Input Image 1x1 Conv

softmax
x CxHxW
Keys:
CNN C’ x H x W
+
1x1 Conv
Features:
CxHxW C’ x H x W
Cat image is free to use under the Pixabay License

Values:
C’ x H x W
x 1x1 Conv
1x1 Conv

Self-Attention Module
Zhang et al, “Self-Attention Generative Adversarial Networks”, ICML 2018 Slide credit: Justin Johnson

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 74 May 03, 2022
Comparing RNNs to Transformer

RNNs
(+) LSTMs work reasonably well for long sequences.
(-) Expects an ordered sequences of inputs
(-) Sequential computation: subsequent hidden states can only be computed after the previous
ones are done.

Transformer:
(+) Good at long sequences. Each attention calculation looks at all inputs.
(+) Can operate over unordered sets or ordered sequences with positional encodings.
(+) Parallel computation: All alignment and attention scores for all inputs can be done in parallel.
(-) Requires a lot of memory: N x M alignment and attention scalers need to be calculated and
stored for a single self-attention head. (but GPUs are getting bigger and better)

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 75 May 03, 2022
“ImageNet Moment for Natural Language Processing”

Pretraining:
Download a lot of text from the internet

Train a giant Transformer model for language modeling

Finetuning:
Fine-tune the Transformer on your own NLP task

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 76 May 03, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 77 May 03, 2022
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT

z0,0 z0,1 z0,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2

Extract spatial Features:

features from a HxWxD
pretrained CNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 78 May 03, 2022
Image Captioning using Transformers
Input: Image I
Output: Sequence y = y1, y2,..., yT

Encoder: c = TW(z)
where z is spatial CNN features
TW(.) is the transformer encoder

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 79 May 03, 2022
Image Captioning using Transformers
Input: Image I Decoder: yt = TD(y0:t-1, c)
Output: Sequence y = y1, y2,..., yT where TD(.) is the transformer decoder

Encoder: c = TW(z) person wearing hat [END]

where z is spatial CNN features
TW(.) is the transformer encoder y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 80 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
...

Made up of N encoder blocks.

In vaswani et al. N = 6, Dq= 512

z0,0 z0,1 z0,2 ... z2,2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 81 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
...

Let's dive into one encoder block

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 82 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
...

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 83 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
...

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 84 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
...

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 85 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN
...

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 86 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

Transformer encoder

xN MLP MLP over each vector individually

...

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 87 May 03, 2022
The Transformer encoder block

c0,0 c0,1 c0,2 ... c2,2

+ Residual connection
Transformer encoder

xN MLP MLP over each vector individually

...

Layer norm LayerNorm over each vector individually

+ Residual connection

Multi-head self-attention Attention attends over all the vectors

Positional encoding Add positional encoding

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 88 May 03, 2022
The Transformer encoder block
y0 y1 y2 y3
c0,0 c0,1 c0,2 ... c2,2 Transformer Encoder Block:
Layer norm
Inputs: Set of vectors x
+ Outputs: Set of vectors y
Transformer encoder

MLP Self-attention is the only

xN
...

interaction between vectors.

Layer norm

+
Layer norm and MLP operate
independently per vector.
Multi-head self-attention
Highly scalable, highly
Positional encoding parallelizable, but high memory usage.

z0,0 z0,1 z0,2 ... z2,2 x0 x1 x2 x2

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 89 May 03, 2022
The Transformer
decoder block
person wearing hat [END]

y0 y1 y2 y3

Transformer decoder
Made up of N decoder blocks.

xN
...

In vaswani et al. N = 6, Dq= 512

c0,0

c0,1

c0,2
...

c2,2
y0 y1 y2 y3

[START] person wearing hat Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 90 May 03, 2022
The Transformer y0 y1 y2 y3

decoder block FC

person wearing hat [END]

y0 y1 y2 y3

Transformer decoder
Let's dive into the
c0,0
transformer decoder block
xN
...

c0,1
c0,0
c0,2

...
c0,1

c0,2 c2,2
...

c2,2
y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 91 May 03, 2022
The Transformer y0 y1 y2 y3

Decoder block FC
Layer norm
person wearing hat [END]
+

y0 y1 y2 y3
MLP
Layer norm

Transformer decoder
+ Most of the network is the
c0,0
same the transformer
xN
...

c0,1
encoder.
c0,0
c0,2
Layer norm

...
c0,1
+

c0,2 c2,2
Masked Multi-head
...

self-attention

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 92 May 03, 2022
The Transformer y0 y1 y2 y3

Decoder block FC
Layer norm
person wearing hat [END]
+

y0 y1 y2 y3
MLP
Layer norm

Transformer decoder
+ Multi-head attention block
c0,0
attends over the transformer
xN Multi-head attention
...

c0,1
encoder outputs.
k v q
c0,0
c0,2 For image captions, this is
Layer norm
how we inject image

...
c0,1
+
features into the decoder.
c0,2 c2,2
Masked Multi-head
self-attention
...

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 93 May 03, 2022
The Transformer y0 y1 y2 y3

Decoder block FC
Layer norm
Transformer Decoder Block:

person wearing hat [END] Inputs: Set of vectors x and

+
Set of context vectors c.
y0 y1 y2 y3 Outputs: Set of vectors y.
MLP
Layer norm
Masked Self-attention only

Transformer decoder
c0,0
+ interacts with past inputs.
xN Multi-head attention
...

c0,1 k v q Multi-head attention block is

c0,0
NOT self-attention. It attends
c0,2 over encoder outputs.
Layer norm

...
c0,1
+
Highly scalable, highly
c0,2 c2,2
Masked Multi-head parallelizable, but high memory
self-attention usage.
...

c2,2 Positional encoding

y0 y1 y2 y3

[START] person wearing hat x0 x1 x2 x3

Vaswani et al, “Attention is all you need”, NeurIPS 2017

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 94 May 03, 2022
Image Captioning using transformers
- No recurrence at all

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 95 May 03, 2022
Image Captioning using transformers
- Perhaps we don't need
convolutions at all?

person wearing hat [END]

y1 y2 y3 y4

z0,0 z0,1 z0,2 c0,0 c0,1 c0,2 ... c2,2

Transformer decoder
z1,0 z1,1 z1,2
CNN
z2,0 z2,1 z2,2 Transformer encoder

Extract spatial Features:

y0 y1 y2 y3
features from a HxWxD z0,0 z0,1 z0,2 ... z2,2
pretrained CNN
[START] person wearing hat

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 96 May 03, 2022
Image Captioning using ONLY transformers
- Transformers from pixels to language

person wearing hat [END]

y1 y2 y3 y4

c0,0 c0,1 c0,2 ... c2,2

Transformer decoder

Transformer encoder

y0 y1 y2 y3
...

Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
[START] person wearing hat
Colab link to an implementation of vision transformers

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 97 May 03, 2022
Vision Transformers vs. ResNets

Dosovitskiy et al, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ArXiv 2020
Colab link to an implementation of vision transformers

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 98 May 03, 2022
Vision Transformers

Liu et al, “Swin Transformer: Hierarchical Vision

Fan et al, “Multiscale Vision Transformers”, ICCV 2021 Transformer using Shifted Windows”, CVPR 2021

Carion et al, “End-to-End Object Detection with Transformers”,

ECCV 2020

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 99 May 03, 2022
ConvNets strike back!

A ConvNet for the 2020s. Liu et al. CVPR 2022

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 100 May 03, 2022
Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 101 May 03, 2022
Summary
- Adding attention to RNNs allows them to "attend" to different
parts of the input at every time step
- The general attention layer is a new type of layer that can be
used to design new neural network architectures
- Transformers are a type of layer that uses self-attention and
layer norm.
○ It is highly scalable and highly parallelizable
○ Faster training, larger models, better performance across
vision and language tasks
○ They are quickly replacing RNNs, LSTMs, and may(?) even
replace convolutions.

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 102 May 03, 2022
Next time: Video Understanding

Fei-Fei Li, Jiajun Wu, Ruohan Gao Lecture 11 - 103 May 03, 2022

Data Structures & Algorithms
89% (19)
Data Structures & Algorithms
52 pages
Intelligent Control of Robotic Systems 1st Edition Laxmidhar Behera Instant Download
100% (1)
Intelligent Control of Robotic Systems 1st Edition Laxmidhar Behera Instant Download
59 pages
Class 10 Ai Sample Paper MS 23-24
50% (2)
Class 10 Ai Sample Paper MS 23-24
8 pages
Greenshield's and Greenberg's Model
0% (1)
Greenshield's and Greenberg's Model
10 pages
Addition and Subtraction
No ratings yet
Addition and Subtraction
9 pages
Lec 17
No ratings yet
Lec 17
92 pages
AML - Lecture - 09 - 08nov24
No ratings yet
AML - Lecture - 09 - 08nov24
126 pages
Project Data Mining
No ratings yet
Project Data Mining
55 pages
Transformers and Attention Models
No ratings yet
Transformers and Attention Models
115 pages
598 FA2020 Lecture13
No ratings yet
598 FA2020 Lecture13
112 pages
RNN (v2)
No ratings yet
RNN (v2)
89 pages
Attention Layers
No ratings yet
Attention Layers
103 pages
Lecture 5: Self-Attention and Transformers
No ratings yet
Lecture 5: Self-Attention and Transformers
99 pages
Support Materi
No ratings yet
Support Materi
120 pages
Attention Deep Learning
No ratings yet
Attention Deep Learning
118 pages
CM Slides On Attention
No ratings yet
CM Slides On Attention
162 pages
cl8 Encdec
No ratings yet
cl8 Encdec
51 pages
Lecture 5
No ratings yet
Lecture 5
102 pages
Lec 11
No ratings yet
Lec 11
30 pages
DL4CV Seq Att
No ratings yet
DL4CV Seq Att
63 pages
M-Tech 1 Year Cace Lab Computational Lab
No ratings yet
M-Tech 1 Year Cace Lab Computational Lab
41 pages
Slides On RNNs 26th March 2025
No ratings yet
Slides On RNNs 26th March 2025
30 pages
Recurrent Neural Net-1
No ratings yet
Recurrent Neural Net-1
47 pages
Unit5 3
No ratings yet
Unit5 3
48 pages
Outline
No ratings yet
Outline
50 pages
11 Seq To Seq Model
No ratings yet
11 Seq To Seq Model
30 pages
Attention - Attention! - Lil'Log
No ratings yet
Attention - Attention! - Lil'Log
23 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
L22 - Attention in Deep Learning
No ratings yet
L22 - Attention in Deep Learning
65 pages
Visualizing A Neural Machine Translation Model
No ratings yet
Visualizing A Neural Machine Translation Model
38 pages
Deep Recurrent Neural Networks
No ratings yet
Deep Recurrent Neural Networks
24 pages
Introduction To Rnns
No ratings yet
Introduction To Rnns
48 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
CNS Lab Manual
No ratings yet
CNS Lab Manual
25 pages
Neural Network and Deep Learning 1736802600
No ratings yet
Neural Network and Deep Learning 1736802600
54 pages
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
No ratings yet
Class44-46 Introduction To Enncoder-Decoder Model Attention-03-09May2023
35 pages
Simplifying Neural Networks and Deep Learning Basics!
No ratings yet
Simplifying Neural Networks and Deep Learning Basics!
27 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
No ratings yet
Recurrent Neural Network (RNN) : Tuan Nguyen - AI4E
38 pages
DL 4
No ratings yet
DL 4
19 pages
Sequence Transduction With Recurrent Neural Networks: Alex Graves
No ratings yet
Sequence Transduction With Recurrent Neural Networks: Alex Graves
9 pages
Sequence Learning Problem
No ratings yet
Sequence Learning Problem
42 pages
Sequence Models-II
No ratings yet
Sequence Models-II
10 pages
Module 3 Part 2 Encoder
No ratings yet
Module 3 Part 2 Encoder
14 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
Online and Linear-Time Attention by Enforcing Monotonic Alignments
No ratings yet
Online and Linear-Time Attention by Enforcing Monotonic Alignments
19 pages
Control System Engineering
No ratings yet
Control System Engineering
2 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
No ratings yet
Sequence-To-Sequence, Attention, Transformer - Machine Learning Lecture
20 pages
Ad3501-Dl-Unit 3 Notes
No ratings yet
Ad3501-Dl-Unit 3 Notes
34 pages
Week9 Seq2seq
No ratings yet
Week9 Seq2seq
32 pages
05 Attention Slides
No ratings yet
05 Attention Slides
69 pages
Attention: Sharad Jones
No ratings yet
Attention: Sharad Jones
25 pages
hw4 - Solution
No ratings yet
hw4 - Solution
2 pages
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks For Sequence-to-Sequence Prediction
11 pages
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
No ratings yet
Asset-V1 - MITx 6.86x 1T2021 Type@asset Block@slides - Lecture10 - Withcredits
25 pages
Modeling With Machine Learning: RNN (Part 1)
No ratings yet
Modeling With Machine Learning: RNN (Part 1)
24 pages
Retentive Network - A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network - A Successor To Transformer For Large Language Models
14 pages
Bayes' Theorem in Artificial Intelligence
No ratings yet
Bayes' Theorem in Artificial Intelligence
21 pages
DL Un3
No ratings yet
DL Un3
11 pages
Retentive Network: A Successor To Transformer For Large Language Models
No ratings yet
Retentive Network: A Successor To Transformer For Large Language Models
14 pages
Cs224n Self Attention Transformers 2023 Draft
No ratings yet
Cs224n Self Attention Transformers 2023 Draft
18 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
Libopenabe v1.0.0 Design
No ratings yet
Libopenabe v1.0.0 Design
30 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
Artificial Intelligence: Department of Computer Science and Engineering
No ratings yet
Artificial Intelligence: Department of Computer Science and Engineering
34 pages
AIML ISE mpq2
No ratings yet
AIML ISE mpq2
4 pages
Enr 533 Examples
No ratings yet
Enr 533 Examples
7 pages
Introduction To RNNS!: Arun Mallya!
No ratings yet
Introduction To RNNS!: Arun Mallya!
52 pages
2014 10 Cho EMNLP
No ratings yet
2014 10 Cho EMNLP
11 pages
@2018 Regression Tree Ensembles For Wind Energy and Solar Radiation
No ratings yet
@2018 Regression Tree Ensembles For Wind Energy and Solar Radiation
10 pages
Icoei 198
No ratings yet
Icoei 198
2 pages
ML Project Report EV 2025
No ratings yet
ML Project Report EV 2025
2 pages
Seemingly Unrelated Regressions
No ratings yet
Seemingly Unrelated Regressions
9 pages
Bits F463 Cryptography - Handout
No ratings yet
Bits F463 Cryptography - Handout
3 pages
Multivariable Control
No ratings yet
Multivariable Control
11 pages
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
No ratings yet
A Practical Survey On Faster and Lighter Transformers - 2023 - Fournier Et Al
40 pages
Exp 05
No ratings yet
Exp 05
4 pages
Matrix Chain Multiplication
No ratings yet
Matrix Chain Multiplication
11 pages
PersoNet A Novel Framework For Personality Classification-Based Apt Customer Service Agent Selection
No ratings yet
PersoNet A Novel Framework For Personality Classification-Based Apt Customer Service Agent Selection
7 pages
Math Aa HL p3 QP
No ratings yet
Math Aa HL p3 QP
3 pages
DAA Lesson Plan For CSE A & B 2022
No ratings yet
DAA Lesson Plan For CSE A & B 2022
3 pages
N Gram, RNN Tranformer
No ratings yet
N Gram, RNN Tranformer
2 pages
ADA MID SEM QUESTION BANK - Final
No ratings yet
ADA MID SEM QUESTION BANK - Final
2 pages
What Are The Advantages and Disadvantages of Triple DES? List The Important Design Criteria For Stream Cipher
No ratings yet
What Are The Advantages and Disadvantages of Triple DES? List The Important Design Criteria For Stream Cipher
2 pages
Managing and Summarizing Large Excel Datasets: Pivottable Calculations
No ratings yet
Managing and Summarizing Large Excel Datasets: Pivottable Calculations
2 pages