0% found this document useful (0 votes)

59 views100 pages

11.RNN and Transformers

This document provides an overview of recurrent neural networks (RNNs), including how they can process sequential data, their basic structure involving hidden states, and examples of applications like language processing, translation, and event classification. It also discusses limitations of basic RNNs and how more advanced RNN variants like LSTMs address them.

Uploaded by

Huy Vũ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views100 pages

11.RNN and Transformers

Uploaded by

Huy Vũ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Lecture 10 Recap

I2DL: Prof. Dai 2

LeNet
• Digit recognition: 10 classes 60k parameters

• Conv -> Pool -> Conv -> Pool -> Conv -> FC
• As we go deeper: Width, Height Number of Filters

I2DL: Prof. Dai 3

AlexNet

• Softmax for 1000 classes [Krizhevsky et al., ANIPS’12] AlexNet

I2DL: Prof. Dai 4
VGGNet
• Striving for simplicity
– Conv -> Pool -> Conv -> Pool -> Conv -> FC
– Conv=3x3, s=1, same; Maxpool=2x2, s=2
• As we go deeper: Width, Height Number of Filters
• Called VGG-16: 16 layers that have weights
138M parameters
• Large but simplicity makes it appealing

[Simonyan et al., ICLR’15] VGGNet

I2DL: Prof. Dai 5
Residual Block
• Two layers
𝐿−1 𝑥𝐿
𝑥 𝑥 𝐿+1

𝑥 𝐿+1 = 𝑓(𝑊 𝐿+1 𝑥 𝐿 + 𝑏 𝐿+1 + 𝑥 𝐿−1 )

Input Linear Linear

𝑥 𝐿+1 = 𝑓(𝑊 𝐿+1 𝑥 𝐿 + 𝑏 𝐿+1 )

I2DL: Prof. Dai 6
Inception Layer

[Szegedy et al., CVPR’15] GoogleNet

I2DL: Prof. Dai 7
Lecture 11

I2DL: Prof. Dai 8

Transfer Learning

I2DL: Prof. Dai 9

Transfer Learning
• Training your own model can be difficult with limited
data and other resources
e.g.,
• It is a laborious task to manually annotate your
own training dataset
• Why not reuse already pre-trained models?

I2DL: Prof. Dai 10

Transfer Learning
Distribution Distribution

P1 P2
Large dataset Small dataset
Use what has been
learned for another
setting
I2DL: Prof. Dai 11
Transfer Learning for Images

[Zeiler al., ECCV’14] Visualizing and Understanding Convolutional Networks

I2DL: Prof. Dai 12
Trained on Transfer Learning
ImageNet

Feature
extraction

[Donahue et al., ICML’14] DeCAF,

[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 13
Trained on Transfer Learning
ImageNet
Decision layers

Parts of an object (wheel, window)

Simple geometrical shapes (circles, etc)

Edges
[Donahue et al., ICML’14] DeCAF,
[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 14
Trained on Transfer Learning
ImageNet
TRAIN New dataset
with C classes

FROZEN

[Donahue et al., ICML’14] DeCAF,

[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 15
Transfer Learning
TRAIN

If the dataset is big

enough train more
layers with a low FROZEN
learning rate

I2DL: Prof. Dai 16

When Transfer Learning Makes Sense
• When task T1 and T2 have the same input (e.g. an
RGB image)

• When you have more data for task T1 than for task T2

• When the low-level features for T1 could be useful to

learn T2

I2DL: Prof. Dai 17

Now you are:
• Ready to perform image classification on any dataset

• Ready to design your own architecture

• Ready to deal with other problems such as semantic

segmentation (Fully Convolutional Network)

I2DL: Prof. Dai 18

Representation
Learning

I2DL: Prof. Dai 19

Learning Good Features
• Good features are essential for successful machine
learning

• (Supervised) deep learning depends on training data

used: input/target labels

• Change in inputs (noise, irregularities, etc) can result

in drastically different results

I2DL: Prof. Dai 20

Representation Learning
• Allows for discovery of representations required for
various tasks

• Deep representation learning: model maps input 𝑋 to

output 𝑌

I2DL: Prof. Dai 21

Deep Representation Learning
• Intuitively, deep networks learn multiple levels of
abstraction

I2DL: Prof. Dai 22

How to Learn Good Features?
• Determine desired feature invariances

• Teach machines to distinguish between similar and

dissimilar things

I2DL: Prof. Dai https://amitness.com/2020/03/illustrated-simclr/ 23

How to Learn Good Features?

[Chen et al., ICML’20] SimCLR,

I2DL: Prof. Dai https://amitness.com/2020/03/illustrated-simclr/ 24
Apply to Downstream Tasks

[Chen et al., ICML’20] SimCLR,

https://amitness.com/2020/03/illu
strated-simclr/
I2DL: Prof. Dai 25
Transfer & Representation Learning
• Transfer learning can be done via representation
learning

• Effectiveness of representation learning often

demonstrated by transfer learning performance (but
also other factors, e.g., smoothness of the manifold)

I2DL: Prof. Dai 26

Recurrent Neural
Networks

I2DL: Prof. Dai 27

Processing Sequences
• Recurrent neural networks process sequence data

• Input/output can be sequences

I2DL: Prof. Dai 28

RNNs are Flexible

Classical neural networks for image classification

I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 29
RNNs are Flexible

Image captioning
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 30
RNNs are Flexible

Language recognition
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 31
RNNs are Flexible

Machine translation
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 32
RNNs are Flexible

Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 33
RNNs are Flexible

Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 34
Basic Structure of an RNN
• Multi-layer RNN
Outputs

Hidden
states

Inputs
I2DL: Prof. Dai 35
Basic Structure of an RNN
• Multi-layer RNN
Outputs

The hidden state

will have its own
internal dynamics Hidden
states

More expressive
model!
Inputs
I2DL: Prof. Dai 36
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state Previous input
hidden
state

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 37

Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state Parameters to be learned

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 38

Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡
Note: non-linearities
ignored for now

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 39

Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡

Same parameters for each

time step = generalization!

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 40

Basic Structure of an RNN
• Unrolling RNNs Same function for the hidden layers

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 41

Basic Structure of an RNN
• Unrolling RNNs

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 42

Basic Structure of an RNN
• Unrolling RNNs as feedforward nets

Weights are the same!

I2DL: Prof. Dai 43

Backprop through an RNN
• Unrolling RNNs as feedforward nets
Chain rule

All the way to 𝑡 = 0

Add the derivatives at different times for each weight
I2DL: Prof. Dai 44
Long-term Dependencies

I moved to Germany … so I speak German fluently.

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 45

Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

• Let us forget the input 𝑨𝑡 = 𝜽 𝒄𝑡 𝑨0

Same weights are

multiplied over and over
again

I2DL: Prof. Dai 46

Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

What happens to small weights?

Vanishing gradient

What happens to large weights?

Exploding gradient

I2DL: Prof. Dai 47

Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

• If 𝜽 admits eigendecomposition

𝜽 = 𝑸𝚲𝑸𝑇

Matrix of Diagonal of this

eigenvectors matrix are the
eigenvalues
I2DL: Prof. Dai 48
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝑡 𝑨0

• If 𝜽 admits eigendecomposition

𝜽 = 𝑸𝚲𝑸𝑇

• Orthogonal 𝜽 allows us to simplify the recurrence

𝑨𝑡 = 𝑸𝚲𝑡 𝑸𝑇 𝑨0
I2DL: Prof. Dai 49
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝑸𝚲t 𝑸𝑇 𝑨0

What happens to eigenvalues with

magnitude less than one?
Vanishing gradient

What happens to eigenvalues with

magnitude larger than one?
Exploding gradient Gradient
I2DL: Prof. Dai
clipping 50
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

Let us just make a matrix with eigenvalues = 1

Allow the cell to maintain its “state”

I2DL: Prof. Dai 51

Vanishing Gradient
• 1. From the weights 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

• 2. From the activation functions (𝑡𝑎𝑛ℎ)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 52

Vanishing Gradient
• 1. From the weights 𝑨𝑡 = 𝜽 𝑡 𝑨0
1
• 2. From the activation functions (𝑡𝑎𝑛ℎ) ?

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 53

Long Short Term
Memory
[Hochreiter et al., Neural Computation’97] Long Short-Term Memory
I2DL: Prof. Dai 54
Long-Short Term Memory Units
• Simple RNN has tanh as non-linearity

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 55

Long-Short Term Memory Units
LSTM

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 56

Long-Short Term Memory Units
• Key ingredients
• Cell = transports the information through the unit

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 57

Long-Short Term Memory Units
• Key ingredients
• Cell = transports the information through the unit
• Gate = remove or add information to the cell state

Sigmoid

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 58

LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )
Decides when to
erase the cell state

Sigmoid = output
between 0 (forget)
and 1 (keep)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 59

LSTM: Step by Step
• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )

Decides which
values will be
updated

New cell state,

output from a
tanh (−1,1)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 60

LSTM: Step by Step
• Element-wise operations

𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡

Previous Current
states state

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 61

LSTM: Step by Step
• Output gate 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡
Decides which
values will be
outputted

Output from a
tanh (−1, 1)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 62

LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )

• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )

• Output gate 𝒐𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑜 𝒙𝑡 + 𝜽ℎ𝑜 𝒉𝑡−1 + 𝒃𝑜 )
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )

• Cell 𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡

• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡

I2DL: Prof. Dai 63

LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )

• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )

• Cell 𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡

• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡 Learned through

backpropagation
I2DL: Prof. Dai 64
LSTM
• Highway for the gradient to flow

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 66

LSTM: Dimensions
128 128 128
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )
When coding an
LSTM, we have to
define the size of the
128
hidden state

Dimensions need to
128 match
What operation do I need to do to my input to get
a 128 vector representation?
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 67
LSTM in code
Attention

I2DL: Prof. Dai 73

Attention is all you need

I2DL: Prof. Dai 75

Attention is all you need

~62,000 citations in
5 years!

I2DL: Prof. Dai 76

Attention vs convolution

I2DL: Prof. Dai 77

Long-Term Dependencies

I moved to Germany … so I speak German fluently.

Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

I2DL: Prof. Dai 78

Attention: Intuition

Context

I moved to Germany … so I speak German fluently

I2DL: Prof. Dai 79
Attention: Architecture
• A decoder processes
the information
D D D

• Decoders take as Context

input:
– Previous decoder
hidden state
– Previous output
– Attention
I2DL: Prof. Dai 80
Transformers

I2DL: Prof. Dai 81

Deep Learning Revolution
Deep Learning Deep Learning 2.0

Main idea Convolution Attention

Field invented Computer vision NLP

Started NeurIPS 2012 NeurIPS 2017

Paper AlexNet Transformers

Conquered vision Around 2014-2015 Around 2020-2021

Replaced Traditional ML/CV CNNs, RNNs

(Augmented)

I2DL: Prof. Dai 82

Transformers

Fully connected
layer

Masked Multi-
Multi-Head Head Attention
Attention on the on the “decoder”
“encoder”

I2DL: Prof. Dai 84

Multi-Head Attention
Intuition: Take the query Q, find the most similar
key K, and then find the value V that
corresponds to the key.

In other words, learn V, K, Q where:

V – here is a bunch of interesting things.
K – here is how we can index some things.
Q – I would like to know this interesting thing.

Loosely connected to Neural Turing Machines

(Graves et al.).

I2DL: Prof. Dai 85

Multi-Head Attention
Index the values Multiply queries
via a differentiable with keys
operator.
Get the values

𝑄𝐾 𝑇
Attention 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘

To train them well, divide by 𝑑𝑘 , “probably” because for

large values of the key’s dimension, the dot product grows
large in magnitude, pushing the softmax function into regions
where it has extremely small gradients.
I2DL: Prof. Dai 86
Multi-Head Attention

Adapted from Y. Kilcher

I2DL: Prof. Dai 87

Multi-Head Attention

K2
K5
K4 Q
K3