0% found this document useful (0 votes)
59 views100 pages

11.RNN and Transformers

This document provides an overview of recurrent neural networks (RNNs), including how they can process sequential data, their basic structure involving hidden states, and examples of applications like language processing, translation, and event classification. It also discusses limitations of basic RNNs and how more advanced RNN variants like LSTMs address them.

Uploaded by

Huy Vũ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views100 pages

11.RNN and Transformers

This document provides an overview of recurrent neural networks (RNNs), including how they can process sequential data, their basic structure involving hidden states, and examples of applications like language processing, translation, and event classification. It also discusses limitations of basic RNNs and how more advanced RNN variants like LSTMs address them.

Uploaded by

Huy Vũ
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Lecture 10 Recap

I2DL: Prof. Dai 2


LeNet
• Digit recognition: 10 classes 60k parameters

• Conv -> Pool -> Conv -> Pool -> Conv -> FC
• As we go deeper: Width, Height Number of Filters

I2DL: Prof. Dai 3


AlexNet

• Softmax for 1000 classes [Krizhevsky et al., ANIPS’12] AlexNet


I2DL: Prof. Dai 4
VGGNet
• Striving for simplicity
– Conv -> Pool -> Conv -> Pool -> Conv -> FC
– Conv=3x3, s=1, same; Maxpool=2x2, s=2
• As we go deeper: Width, Height Number of Filters
• Called VGG-16: 16 layers that have weights
138M parameters
• Large but simplicity makes it appealing

[Simonyan et al., ICLR’15] VGGNet


I2DL: Prof. Dai 5
Residual Block
• Two layers
𝐿−1 𝑥𝐿
𝑥 𝑥 𝐿+1

𝑥 𝐿+1 = 𝑓(𝑊 𝐿+1 𝑥 𝐿 + 𝑏 𝐿+1 + 𝑥 𝐿−1 )

Input Linear Linear

𝑥 𝐿+1 = 𝑓(𝑊 𝐿+1 𝑥 𝐿 + 𝑏 𝐿+1 )


I2DL: Prof. Dai 6
Inception Layer

[Szegedy et al., CVPR’15] GoogleNet


I2DL: Prof. Dai 7
Lecture 11

I2DL: Prof. Dai 8


Transfer Learning

I2DL: Prof. Dai 9


Transfer Learning
• Training your own model can be difficult with limited
data and other resources
e.g.,
• It is a laborious task to manually annotate your
own training dataset
• Why not reuse already pre-trained models?

I2DL: Prof. Dai 10


Transfer Learning
Distribution Distribution

P1 P2
Large dataset Small dataset
Use what has been
learned for another
setting
I2DL: Prof. Dai 11
Transfer Learning for Images

[Zeiler al., ECCV’14] Visualizing and Understanding Convolutional Networks


I2DL: Prof. Dai 12
Trained on Transfer Learning
ImageNet

Feature
extraction

[Donahue et al., ICML’14] DeCAF,


[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 13
Trained on Transfer Learning
ImageNet
Decision layers

Parts of an object (wheel, window)

Simple geometrical shapes (circles, etc)

Edges
[Donahue et al., ICML’14] DeCAF,
[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 14
Trained on Transfer Learning
ImageNet
TRAIN New dataset
with C classes

FROZEN

[Donahue et al., ICML’14] DeCAF,


[Razavian et al., CVPRW’14] CNN Features off-the-shelf
I2DL: Prof. Dai 15
Transfer Learning
TRAIN

If the dataset is big


enough train more
layers with a low FROZEN
learning rate

I2DL: Prof. Dai 16


When Transfer Learning Makes Sense
• When task T1 and T2 have the same input (e.g. an
RGB image)

• When you have more data for task T1 than for task T2

• When the low-level features for T1 could be useful to


learn T2

I2DL: Prof. Dai 17


Now you are:
• Ready to perform image classification on any dataset

• Ready to design your own architecture

• Ready to deal with other problems such as semantic


segmentation (Fully Convolutional Network)

I2DL: Prof. Dai 18


Representation
Learning

I2DL: Prof. Dai 19


Learning Good Features
• Good features are essential for successful machine
learning

• (Supervised) deep learning depends on training data


used: input/target labels

• Change in inputs (noise, irregularities, etc) can result


in drastically different results

I2DL: Prof. Dai 20


Representation Learning
• Allows for discovery of representations required for
various tasks

• Deep representation learning: model maps input 𝑋 to


output 𝑌

I2DL: Prof. Dai 21


Deep Representation Learning
• Intuitively, deep networks learn multiple levels of
abstraction

I2DL: Prof. Dai 22


How to Learn Good Features?
• Determine desired feature invariances

• Teach machines to distinguish between similar and


dissimilar things

I2DL: Prof. Dai https://amitness.com/2020/03/illustrated-simclr/ 23


How to Learn Good Features?

[Chen et al., ICML’20] SimCLR,


I2DL: Prof. Dai https://amitness.com/2020/03/illustrated-simclr/ 24
Apply to Downstream Tasks

[Chen et al., ICML’20] SimCLR,


https://amitness.com/2020/03/illu
strated-simclr/
I2DL: Prof. Dai 25
Transfer & Representation Learning
• Transfer learning can be done via representation
learning

• Effectiveness of representation learning often


demonstrated by transfer learning performance (but
also other factors, e.g., smoothness of the manifold)

I2DL: Prof. Dai 26


Recurrent Neural
Networks

I2DL: Prof. Dai 27


Processing Sequences
• Recurrent neural networks process sequence data

• Input/output can be sequences

I2DL: Prof. Dai 28


RNNs are Flexible

Classical neural networks for image classification


I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 29
RNNs are Flexible

Image captioning
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 30
RNNs are Flexible

Language recognition
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 31
RNNs are Flexible

Machine translation
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 32
RNNs are Flexible

Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 33
RNNs are Flexible

Event classification
I2DL: Prof. Dai Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 34
Basic Structure of an RNN
• Multi-layer RNN
Outputs

Hidden
states

Inputs
I2DL: Prof. Dai 35
Basic Structure of an RNN
• Multi-layer RNN
Outputs

The hidden state


will have its own
internal dynamics Hidden
states

More expressive
model!
Inputs
I2DL: Prof. Dai 36
Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state Previous input
hidden
state

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 37


Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state Parameters to be learned

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 38


Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡
Note: non-linearities
ignored for now

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 39


Basic Structure of an RNN
• We want to have notion of “time” or “sequence”

Output
𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

Hidden
state 𝒉𝑡 = 𝜽 𝒉 𝑨𝑡

Same parameters for each


time step = generalization!

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 40


Basic Structure of an RNN
• Unrolling RNNs Same function for the hidden layers

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 41


Basic Structure of an RNN
• Unrolling RNNs

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 42


Basic Structure of an RNN
• Unrolling RNNs as feedforward nets

Weights are the same!

I2DL: Prof. Dai 43


Backprop through an RNN
• Unrolling RNNs as feedforward nets
Chain rule

All the way to 𝑡 = 0


Add the derivatives at different times for each weight
I2DL: Prof. Dai 44
Long-term Dependencies

I moved to Germany … so I speak German fluently.

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 45


Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝑐 𝑨𝑡−1 + 𝜽𝑥 𝒙𝑡

• Let us forget the input 𝑨𝑡 = 𝜽 𝒄𝑡 𝑨0

Same weights are


multiplied over and over
again

I2DL: Prof. Dai 46


Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

What happens to small weights?


Vanishing gradient

What happens to large weights?


Exploding gradient

I2DL: Prof. Dai 47


Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

• If 𝜽 admits eigendecomposition

𝜽 = 𝑸𝚲𝑸𝑇

Matrix of Diagonal of this


eigenvectors matrix are the
eigenvalues
I2DL: Prof. Dai 48
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝑡 𝑨0

• If 𝜽 admits eigendecomposition

𝜽 = 𝑸𝚲𝑸𝑇

• Orthogonal 𝜽 allows us to simplify the recurrence


𝑨𝑡 = 𝑸𝚲𝑡 𝑸𝑇 𝑨0
I2DL: Prof. Dai 49
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝑸𝚲t 𝑸𝑇 𝑨0

What happens to eigenvalues with


magnitude less than one?
Vanishing gradient

What happens to eigenvalues with


magnitude larger than one?
Exploding gradient Gradient
I2DL: Prof. Dai
clipping 50
Long-term Dependencies
• Simple recurrence 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

Let us just make a matrix with eigenvalues = 1

Allow the cell to maintain its “state”

I2DL: Prof. Dai 51


Vanishing Gradient
• 1. From the weights 𝑨𝑡 = 𝜽𝒄𝑡 𝑨0

• 2. From the activation functions (𝑡𝑎𝑛ℎ)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 52


Vanishing Gradient
• 1. From the weights 𝑨𝑡 = 𝜽 𝑡 𝑨0
1
• 2. From the activation functions (𝑡𝑎𝑛ℎ) ?

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 53


Long Short Term
Memory
[Hochreiter et al., Neural Computation’97] Long Short-Term Memory
I2DL: Prof. Dai 54
Long-Short Term Memory Units
• Simple RNN has tanh as non-linearity

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 55


Long-Short Term Memory Units
LSTM

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 56


Long-Short Term Memory Units
• Key ingredients
• Cell = transports the information through the unit

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 57


Long-Short Term Memory Units
• Key ingredients
• Cell = transports the information through the unit
• Gate = remove or add information to the cell state

Sigmoid

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 58


LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )
Decides when to
erase the cell state

Sigmoid = output
between 0 (forget)
and 1 (keep)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 59


LSTM: Step by Step
• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )

Decides which
values will be
updated

New cell state,


output from a
tanh (−1,1)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 60


LSTM: Step by Step
• Element-wise operations

𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡


Previous Current
states state

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 61


LSTM: Step by Step
• Output gate 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡
Decides which
values will be
outputted

Output from a
tanh (−1, 1)

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 62


LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )

• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )


• Output gate 𝒐𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑜 𝒙𝑡 + 𝜽ℎ𝑜 𝒉𝑡−1 + 𝒃𝑜 )
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )

• Cell 𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡

• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡

I2DL: Prof. Dai 63


LSTM: Step by Step
• Forget gate 𝒇𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑓 𝒙𝑡 + 𝜽ℎ𝑓 𝒉𝑡−1 + 𝒃𝑓 )

• Input gate 𝒊𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑖 𝒙𝑡 + 𝜽ℎ𝑖 𝒉𝑡−1 + 𝒃𝑖 )


• Output gate 𝒐𝑡 = 𝑠𝑖𝑔𝑚(𝜽𝑥𝑜 𝒙𝑡 + 𝜽ℎ𝑜 𝒉𝑡−1 + 𝒃𝑜 )
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )

• Cell 𝑪𝑡 = 𝒇𝑡 ⊙𝑪𝑡−1 +𝒊𝑡 ⊙𝒈𝑡

• Output 𝒉𝑡 = 𝒐𝑡 ⊙ tanh 𝑪𝑡 Learned through


backpropagation
I2DL: Prof. Dai 64
LSTM
• Highway for the gradient to flow

I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 66


LSTM: Dimensions
128 128 128
• Cell update 𝒈𝑡 = 𝑡𝑎𝑛ℎ(𝜽𝑥𝑔 𝒙𝑡 + 𝜽ℎ𝑔 𝒉𝑡−1 + 𝒃𝑔 )
When coding an
LSTM, we have to
define the size of the
128
hidden state

Dimensions need to
128 match
What operation do I need to do to my input to get
a 128 vector representation?
I2DL: Prof. Dai [Olah, https://colah.github.io ’15] Understanding LSTMs 67
LSTM in code
Attention

I2DL: Prof. Dai 73


Attention is all you need

I2DL: Prof. Dai 75


Attention is all you need

~62,000 citations in
5 years!

I2DL: Prof. Dai 76


Attention vs convolution

I2DL: Prof. Dai 77


Long-Term Dependencies

I moved to Germany … so I speak German fluently.


Source: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

I2DL: Prof. Dai 78


Attention: Intuition

Context

I moved to Germany … so I speak German fluently


I2DL: Prof. Dai 79
Attention: Architecture
• A decoder processes
the information
D D D

• Decoders take as Context


input:
– Previous decoder
hidden state
– Previous output
– Attention
I2DL: Prof. Dai 80
Transformers

I2DL: Prof. Dai 81


Deep Learning Revolution
Deep Learning Deep Learning 2.0

Main idea Convolution Attention

Field invented Computer vision NLP

Started NeurIPS 2012 NeurIPS 2017

Paper AlexNet Transformers

Conquered vision Around 2014-2015 Around 2020-2021

Replaced Traditional ML/CV CNNs, RNNs


(Augmented)

I2DL: Prof. Dai 82


Transformers

Fully connected
layer

Masked Multi-
Multi-Head Head Attention
Attention on the on the “decoder”
“encoder”

I2DL: Prof. Dai 84


Multi-Head Attention
Intuition: Take the query Q, find the most similar
key K, and then find the value V that
corresponds to the key.

In other words, learn V, K, Q where:


V – here is a bunch of interesting things.
K – here is how we can index some things.
Q – I would like to know this interesting thing.

Loosely connected to Neural Turing Machines


(Graves et al.).

I2DL: Prof. Dai 85


Multi-Head Attention
Index the values Multiply queries
via a differentiable with keys
operator.
Get the values

𝑄𝐾 𝑇
Attention 𝑄, 𝐾, 𝑉 = softmax 𝑉
𝑑𝑘

To train them well, divide by 𝑑𝑘 , “probably” because for


large values of the key’s dimension, the dot product grows
large in magnitude, pushing the softmax function into regions
where it has extremely small gradients.
I2DL: Prof. Dai 86
Multi-Head Attention

Adapted from Y. Kilcher

I2DL: Prof. Dai 87


Multi-Head Attention

K1

K2
K5
K4 Q
K3

I2DL: Prof. Dai 88


Multi-Head Attention

K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5

I2DL: Prof. Dai 89


Multi-Head Attention

K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5

Essentially, dot product between (<Q,K1>), (<Q,K2>), (<Q,K3>),


(<Q,K4>), (<Q,K5>).
I2DL: Prof. Dai 90
Multi-Head Attention

K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5

𝑄𝐾 𝑇 Is simply inducing a distribution over the values.


softmax The larger a value is, the higher is its softmax value.
𝑑𝑘 Can be interpreted as a differentiable soft indexing.
I2DL: Prof. Dai 91
Multi-Head Attention

K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5

𝑄𝐾 𝑇 Is simply inducing a distribution over the values.


softmax The larger a value is, the higher is its softmax value.
𝑑𝑘 Can be interpreted as a differentiable soft indexing.
I2DL: Prof. Dai 92
Multi-Head Attention

K1 Values
V1
V2
K2
K5 V3
K4 Q V4
K3
V5

𝑄𝐾 𝑇 Selecting the value V where


softmax the network needs to attend..
𝑑𝑘
I2DL: Prof. Dai 93
Transformers – a closer look

K parallel
attention heads.

I2DL: Prof. Dai 96


Transformers – a closer look

Good old fully-


connected
layers.

I2DL: Prof. Dai 97


Transformers – a closer look
N layers of
attention
followed by FC

I2DL: Prof. Dai 98


Transformers – a closer look

Same as multi-head attention,


but masked. Ensures that the
predictions for position i can
depend only on the known
outputs at positions less than i.

I2DL: Prof. Dai 99


Transformers – a closer look

Multi-headed attention between


encoder and the decoder.

I2DL: Prof. Dai 100


Transformers – a closer look

Projection and prediction.

I2DL: Prof. Dai 101


What is missing from self-attention?
• Convolution: a different linear transformation for each
relative position. Allows you to distinguish what
information came from where.
• Self-attention: a weighted average.

I2DL: Prof. Dai 102


Transformers – a closer look

Uses fixed positional encoding


based on trigonometric series, in
order for the model to make use
of the order of the sequence

dimension

𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖) = sin
100002𝑖/𝑑model
𝑝𝑜𝑠
𝑃𝐸(𝑝𝑜𝑠,2𝑖+1) = cos( )
100002𝑖/𝑑model
I2DL: Prof. Dai 103
Transformers – a final look

I2DL: Prof. Dai 104


Self-attention: complexity

where n is the sequence length, d is the representation dimension,


k is the convolutional kernel size, and r is the size of the neighborhood.

I2DL: Prof. Dai 105


Self-attention: complexity

where n is the sequence length, d is the representation dimension,


k is the convolutional kernel size, and r is the size of the neighborhood.

Considering that most sentences have a smaller dimension than the representation
dimension (in the paper, it is 512), self-attention is very efficient.
I2DL: Prof. Dai 106
Transformers – training tricks
• ADAM optimizer with proportional learning rate:

• Residual dropout
• Label smoothing
• Checkpoint averaging

I2DL: Prof. Dai 107


Transformers - results

I2DL: Prof. Dai 108


Transformers - summary
• Significantly improved SOTA in machine translation
• Launched a new deep-learning revolution in MLP
• Building block of NLP models like BERT (Google) or
GPT/ChatGPT (OpenAI)
• BERT has been heavily used in Google Search

• And eventually made its way to computer vision (and


other related fields)

I2DL: Prof. Dai 109


See you next time!

I2DL: Prof. Dai 110

You might also like