lecture_7
lecture_7
1 person per group needs to submit, but tag all group members
Ranjay:
- I am defending my PhD on Friday, April 23rd at 1pm PST.
- Stanford CS defenses are public events.
- Join CS 547's seminar if you want to watch it.
- If you are unable to find the zoom link to watch it and want to,
send me an email by Thursday 3pm.
Computational graphs
x
s (scores) hinge
* loss
+
L
W
R
x W1 h W2 s
3072 100 10
28
32 28
3 1
32
28
Convolution Layer
32 28
3 6
We stack these up to get a “new image” of size 28x28x6!
Mini-batch SGD
Loop:
1. Sample a batch of data
2. Forward prop it through the graph
(network), get loss
3. Backprop to calculate the gradients
4. Update the parameters using the gradient
Hardware + Software
PyTorch
TensorFlow
tanh Maxout
ReLU ELU
Sigmoid
3 problems:
3 problems:
So!! Sign of gradient for all wi is the same as the sign of upstream scalar gradient!
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(For a single element! Minibatches help)
35
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 35 April 20, 2021
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating “firing rate” of a neuron
3 problems:
tanh(x)
ReLU
(Rectified Linear Unit)
[Krizhevsky et al., 2012]
dead ReLU
will never activate
=> never update
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 42 April 20, 2021
active ReLU
DATA CLOUD
Leaky ReLU
α = 1.6733, λ = 1.0507
hypothetical
What can we say about the gradients on w? optimal w
vector
Always all positive or all negative :(
(this is also why you want zero-mean data!)
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 53 April 20, 2021
Data Preprocessing
56
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
TLDR: In practice for Images: center only
e.g. consider CIFAR-10 example with [32,32,3] images
- Subtract the mean image (e.g. AlexNet)
(mean image = [32,32,3] array)
- Subtract per-channel mean (e.g. VGGNet)
(mean along each channel = 3 numbers)
- Subtract per-channel mean and
Not common
Divide by per-channel std (e.g. ResNet) to do PCA or
(mean along each channel = 3 numbers) whitening
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Let: y = x1w1+x2w2+...+xDinwDin
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Let: y = x1w1+x2w2+...+xDinwDin
Assume: Var(x1) = Var(x2)= …=Var(xDin)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Let: y = x1w1+x2w2+...+xDinwDin
Assume: Var(x1) = Var(x2)= …=Var(xDin)
We want: Var(y) = Var(xi)
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
He et al, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, ICCV 2015
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks by Saxe et al, 2013
Random walk initialization for training very deep feedforward networks by Sussillo and Abbott, 2014
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks, Frankle and Carbin, 2019
this is a vanilla
differentiable function...
Per-channel var,
shape is D
N X
Normalized x,
Shape is N x D
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
[Ioffe and Szegedy, 2015]
Batch Normalization
Per-channel var,
shape is D
N X
Normalized x,
Shape is N x D
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
[Ioffe and Szegedy, 2015]
Batch Normalization
Normalized x,
Learning = , Shape is N x D
= will recover the Output,
identity function! Shape is N x D
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Estimates depend on minibatch;
Batch Normalization: Test-Time can’t do this at test-time!
Normalized x,
Learning = , Shape is N x D
= will recover the Output,
identity function! Shape is N x D
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Batch Normalization: Test-Time
Normalized x,
During testing batchnorm Shape is N x D
becomes a linear operator!
Can be fused with the previous Output,
fully-connected or conv layer Shape is N x D
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
[Ioffe and Szegedy, 2015]
Batch Normalization
FC
BN
tanh
...
FC
- Makes deep networks much easier to train!
BN - Improves gradient flow
- Allows higher learning rates, faster convergence
tanh - Networks become more robust to initialization
- Acts as regularization during training
FC - Zero overhead at test-time: can be fused with conv!
- Behaves differently during training and testing: this
BN
is a very common source of bugs!
tanh
...
x: N × D x: N×C×H×W
Normalize Normalize
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Layer Normalization Layer Normalization for
Batch Normalization for fully-connected networks
fully-connected networks Same behavior at train and test!
Can be used in recurrent networks
x: N × D x: N × D
Normalize Normalize
𝞵,𝝈: 1 × D 𝞵,𝝈: N × 1
ɣ,β: 1 × D ɣ,β: 1 × D
y = ɣ(x-𝞵)/𝝈+β y = ɣ(x-𝞵)/𝝈+β
Ba, Kiros, and Hinton, “Layer Normalization”, arXiv 2016
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Instance Normalization
Batch Normalization for Instance Normalization for
convolutional networks convolutional networks
Same behavior at train / test!
x: N×C×H×W x: N×C×H×W
Normalize Normalize
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Comparison of Normalization Layers
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Group Normalization
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - April 20, 2021
Transfer learning
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 96 April 20, 2021
“You need a lot of a data if you want to
train/use CNNs”
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 97 April 20, 2021
ED
“You need a lot of a data if you want to
ST
train/use CNNs”
BU
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 98 April 20, 2021
Transfer Learning with CNNs
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 99 April 20, 2021
Transfer Learning with CNNs
AlexNet:
64 x 3 x 11 x 11
1. Train on Imagenet
FC-1000
FC-4096
FC-4096
MaxPool
Conv-512
Conv-512
MaxPool
Conv-512
Conv-512
MaxPool
Conv-256
Conv-256
MaxPool
Conv-128
Conv-128
MaxPool
Conv-64
Conv-64
Image
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 102 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 103 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for
Conv-64 Conv-64 Generic Visual Recognition”, ICML 2014
Image Image
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 104 April 20, 2021
Donahue et al, “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition”, ICML 2014
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 105 April 20, 2021
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
MaxPool
very little data ? ?
Conv-512
More specific
Conv-512
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of ? ?
MaxPool
Conv-64
data
Conv-64
Image
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 106 April 20, 2021
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
MaxPool
very little data Use Linear ?
Conv-512
More specific Classifier on
Conv-512
top layer
MaxPool
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a ?
MaxPool
Conv-64
data few layers
Conv-64
Image
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 107 April 20, 2021
FC-1000
FC-4096
very similar very different
FC-4096 dataset dataset
MaxPool
Conv-512
Conv-512
MaxPool
very little data Use Linear You’re in
Conv-512
More specific Classifier on trouble… Try
Conv-512
top layer linear classifier
MaxPool
Conv-256 from different
Conv-256
More generic stages
MaxPool
Conv-128
Conv-128
quite a lot of Finetune a Finetune a
MaxPool
Conv-64
data few layers larger number
Conv-64 of layers
Image
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 108 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
(Fast R-CNN) Image Captioning: CNN + RNN
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 109 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained Image Captioning: CNN + RNN
(Fast R-CNN)
on ImageNet
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 110 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Object Detection
CNN pretrained Image Captioning: CNN + RNN
(Fast R-CNN)
on ImageNet
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 111 April 20, 2021
Transfer learning with CNNs is pervasive…
(it’s the norm, not an exception)
Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA” CVPR 2020 Krishna et al, “Visual genome: Connecting language and vision using crowdsourced dense image annotations” IJCV 2017
Figure copyright Luowei Zhou, 2020. Reproduced with permission. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” ArXiv 2018
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 112 April 20, 2021
Transfer learning with CNNs -
Architecture matters
Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 113 April 20, 2021
Transfer learning with CNNs is pervasive…
But recent results show it might not always be necessary!
Training from scratch can work just as
well as training from a pretrained
ImageNet model for object detection
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 114 April 20, 2021
Takeaway for your projects and beyond:
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 115 April 20, 2021
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision
Fei-Fei Li & Ranjay Krishna & Danfei Xu Lecture 7 - 116 April 20, 2021
Summary TLDRs
We looked in detail at:
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 117 April 20, 2021
Next time:
Training Neural Networks, Part 2
- Parameter update schemes
- Learning rate schedules
- Gradient checking
- Regularization (Dropout etc.)
- Babysitting learning
- Evaluation (Ensembles etc.)
- Hyperparameter Optimization
- Transfer learning / fine-tuning
Fei-Fei Li, Ranjay Krishna, Danfei Xu Lecture 7 - 118 April 20, 2021