11.RNN and Transformers
11.RNN and Transformers
• Conv -> Pool -> Conv -> Pool -> Conv -> FC
• As we go deeper: Width, Height Number of Filters
P1 P2
Large dataset Small dataset
Use what has been
learned for another
setting
I2DL: Prof. Dai 11
Transfer Learning for Images
Feature
extraction
Edges
[Donahue et al., ICML’14] DeCAF,
[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 14
Trained on Transfer Learning
ImageNet
TRAIN New dataset
with C classes
FROZEN
• When you have more data for task T1 than for task T2
Image captioning
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 30
RNNs are Flexible
Language recognition
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 31
RNNs are Flexible
Machine translation
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 32
RNNs are Flexible
Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 33
RNNs are Flexible
Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 34
Basic Structure of an RNN
• Multi-layer RNN
Outputs
Hidden
states
Inputs
I2DL: Prof. Dai 35
Basic Structure of an RNN
• Multi-layer RNN
Outputs
More expressive
model!
Inputs
I2DL: Prof. Dai 36
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state Previous input
hidden
state
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state Parameters to be learned
Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡
Note: non-linearities
ignored for now
Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡
Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡
• If 𝜽 admits eigendecomposition
𝜽 = 𝑸𝚲𝑸𝑇
• If 𝜽 admits eigendecomposition
𝜽 = 𝑸𝚲𝑸𝑇
Sigmoid
Sigmoid = output
between 0 (forget)
and 1 (keep)
Decides which
values will be
updated
Output from a
tanh (−1, 1)
• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡
Dimensions need to
128 match
What operation do I need to do to my input to get
a 128 vector representation?
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 67
LSTM in code
Attention
~62,000 citations in
5 years!
Context
Fully connected
layer
Masked Multi-
Multi-Head Head Attention
Attention on the on the “decoder”
“encoder”
𝑄𝐾 𝑇
Attention 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘
K1
K2
K5
K4 Q
K3
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5
K parallel
attention heads.
dimension
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin
100002𝑖/𝑑model
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos( )
100002𝑖/𝑑model
I2DL: Prof. Dai 103
Transformers – a final look
Considering that most sentences have a smaller dimension than the representation
dimension (in the paper, it is 512), self-attention is very efficient.
I2DL: Prof. Dai 106
Transformers – training tricks
• ADAM optimizer with proportional learning rate:
• Residual dropout
• Label smoothing
• Checkpoint averaging