0% found this document useful (0 votes)
21 views100 pages

498 FA2019 Lecture11

This document provides reminders and an overview of topics related to training neural networks. The key points are: 1) It reminds students about upcoming deadlines for Assignment 3 which is due October 14th, and the midterm exam on October 21st. 2) It provides an overview of topics covered last time including activation functions, data preprocessing, weight initialization, and regularization techniques. 3) It previews topics to be covered today including learning rate schedules, early stopping, and techniques for choosing hyperparameters like grid search and random search.

Uploaded by

sabah nushra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views100 pages

498 FA2019 Lecture11

This document provides reminders and an overview of topics related to training neural networks. The key points are: 1) It reminds students about upcoming deadlines for Assignment 3 which is due October 14th, and the midterm exam on October 21st. 2) It provides an overview of topics covered last time including activation functions, data preprocessing, weight initialization, and regularization techniques. 3) It previews topics to be covered today including learning rate schedules, early stopping, and techniques for choosing hyperparameters like grid search and random search.

Uploaded by

sabah nushra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Lecture 11:

Training Neural Networks


(Part 2)

Justin Johnson Lecture 11 - 1 October 9, 2019


Reminder: A3

• Due Monday, October 14

• Remember to run the validation script!

Justin Johnson Lecture 11 - 2 October 9, 2019


Reminder: Midterm
• Monday, October 21 (two weeks from today!)
• Location: Chrysler 220 (NOT HERE!)
• Format:
• True / False, Multiple choice, short answer
• Emphasize concepts – you don’t need to memorize AlexNet!
• Closed-book
• You can bring 1 page ”cheat sheet” of handwritten notes
(standard 8.5” x 11” paper)
• Alternate exam times: Fill out this form: https://forms.gle/uiMpHdg9752p27bd7
• Conflict with EECS 551
• SSD accommodations
• Conference travel for Michigan
Justin Johnson Lecture 11 - 3 October 9, 2019
Reminder: No class on Monday 10/14!
(Fall Study Break)

Justin Johnson Lecture 11 - 4 October 9, 2019


Overview
1.One time setup
Activation functions, data preprocessing, weight Last Time
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization Today
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 5 October 9, 2019


Last Time: Activation Functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Justin Johnson Lecture 11 - 6 October 9, 2019


Last Time: Data Preprocessing

Justin Johnson Lecture 11 - 7 October 9, 2019


Last Time: Weight Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Justin Johnson Lecture 11 - 8 October 9, 2019


Last Time: Data Augmentation

Load image
“cat”
and label

Compute
loss
CNN

Transform image

Justin Johnson Lecture 11 - 9 October 9, 2019


Stochastic Depth
Last Time: Regularization Cutout
Training: Add randomness
Testing: Marginalize out randomness
Examples:
Batch Normalization
Data Augmentation Dropout DropConnect
Fractional pooling
Mixup

Justin Johnson Lecture 11 - 10 October 9, 2019


Overview
1.One time setup
Activation functions, data preprocessing, weight
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization Today
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 11 October 9, 2019


Learning Rate Schedules

Justin Johnson Lecture 11 - 12 October 9, 2019


SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Justin Johnson Lecture 11 - 13 October 9, 2019


SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Q: Which one of these learning rates


is best to use?

Justin Johnson Lecture 11 - 14 October 9, 2019


SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Q: Which one of these learning rates


is best to use?

A: All of them! Start with large


learning rate and decay over time

Justin Johnson Lecture 11 - 15 October 9, 2019


Learning Rate Decay: Step
Step: Reduce learning rate at a few fixed points.
Reduce learning rate E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Justin Johnson Lecture 11 - 16 October 9, 2019


Learning Rate Decay: Cosine
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017
Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018
Feichtenhofer et al, “SlowFast Networks for Video Recognition”, ICCV 2019
Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019
Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019

Justin Johnson Lecture 11 - 17 October 9, 2019


Learning Rate Decay: Linear
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Linear:

Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2018
Liu et al, “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019
Yang et al, “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, NeurIPS 2019

Justin Johnson Lecture 11 - 18 October 9, 2019


Learning Rate Decay: Inverse Sqrt
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

Vaswani et al, “Attention is all you need”, NIPS 2017

Justin Johnson Lecture 11 - 19 October 9, 2019


Learning Rate Decay: Constant!
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

Constant:

Brock et al, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, ICLR 2019
Donahue and Simonyan, “Large Scale Adversarial Representation Learning”, NeurIPS 2019

Justin Johnson Lecture 11 - 20 October 9, 2019


How long to train? Early Stopping
Train
Loss Accuracy Val

Stop training here

Iteration Iteration

Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot that
worked best on val. Always a good idea to do this!

Justin Johnson Lecture 11 - 21 October 9, 2019


Choosing Hyperparameters

Justin Johnson Lecture 11 - 22 October 9, 2019


Choosing Hyperparameters: Grid Search

Choose several values for each hyperparameter


(Often space choices log-linearly)

Example:
Weight decay: [1x10-4, 1x10-3, 1x10-2, 1x10-1]
Learning rate: [1x10-4, 1x10-3, 1x10-2, 1x10-1]

Evaluate all possible choices on this


hyperparameter grid

Justin Johnson Lecture 11 - 23 October 9, 2019


Choosing Hyperparameters: Random Search

Choose several values for each hyperparameter


(Often space choices log-linearly)

Example:
Weight decay: log-uniform on [1x10-4, 1x10-1]
Learning rate: log-uniform on [1x10-4, 1x10-1]

Run many different trials

Justin Johnson Lecture 11 - 24 October 9, 2019


Hyperparameters: Random vs Grid Search
Grid Layout Random Layout

Unimportant

Unimportant
Parameter

Parameter
Important Important
Parameter Parameter
Bergstra and Bengio, “Random Search for Hyper-Parameter Optimization”, JMLR 2012

Justin Johnson Lecture 11 - 25 October 9, 2019


Choosing Hyperparameters: Random Search

Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019

Justin Johnson Lecture 11 - 26 October 9, 2019


Choosing Hyperparameters
(without tons of GPUs)

Justin Johnson Lecture 11 - 27 October 9, 2019


Choosing Hyperparameters

Step 1: Check initial loss

Turn off weight decay, sanity check loss at initialization


e.g. log(C) for softmax with C classes

Justin Johnson Lecture 11 - 28 October 9, 2019


Choosing Hyperparameters

Step 1: Check initial loss


Step 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of training


data (~5-10 minibatches); fiddle with architecture, learning rate,
weight initialization. Turn off regularization.

Loss not going down? LR too low, bad initialization


Loss explodes to Inf or NaN? LR too high, bad initialization

Justin Johnson Lecture 11 - 29 October 9, 2019


Choosing Hyperparameters

Step 1: Check initial loss


Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training data,
turn on small weight decay, find a learning rate that makes the loss
drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

Justin Johnson Lecture 11 - 30 October 9, 2019


Choosing Hyperparameters

Step 1: Check initial loss


Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around what
worked from Step 3, train a few models for ~1-5 epochs.

Good weight decay to try: 1e-4, 1e-5, 0

Justin Johnson Lecture 11 - 31 October 9, 2019


Choosing Hyperparameters

Step 1: Check initial loss


Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer

Pick best models from Step 4, train them for longer


(~10-20 epochs) without learning rate decay

Justin Johnson Lecture 11 - 32 October 9, 2019


Choosing Hyperparameters

Step 1: Check initial loss


Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at learning curves

Justin Johnson Lecture 11 - 33 October 9, 2019


Look at Learning Curves!

Losses may be noisy, use a scatter


plot and also plot moving average
to see trends better

Justin Johnson Lecture 11 - 34 October 9, 2019


Loss
Bad initialization a prime suspect

time

Justin Johnson Lecture 11 - 35 October 9, 2019


Loss
Loss plateaus: Try learning
rate decay

time

Justin Johnson Lecture 11 - 36 October 9, 2019


Loss
Learning rate step decay Loss was still going down when
learning rate dropped, you
decayed too early!

time

Justin Johnson Lecture 11 - 37 October 9, 2019


Accuracy Accuracy still going up, you
need to train longer

Train

Val

time

Justin Johnson Lecture 11 - 38 October 9, 2019


Accuracy Huge train / val gap means
overfitting! Increase regularization,
get more data
Train

Val

time

Justin Johnson Lecture 11 - 39 October 9, 2019


Accuracy No gap between train / val means
underfitting: train longer, use a bigger
model
Train

Val

time

Justin Johnson Lecture 11 - 40 October 9, 2019


Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss curves
Step 7: GOTO step 5

Justin Johnson Lecture 11 - 41 October 9, 2019


Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)

neural networks practitioner


music = loss function

This image by Paolo Guereta is licensed under CC-BY 2.0

Justin Johnson Lecture 11 - 42 October 9, 2019


Cross-validation
“command center”

Justin Johnson Lecture 11 - 43 October 9, 2019


Track ratio of weight update / weight magnitude

ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay)
want this to be somewhere around 0.001 or so

Justin Johnson Lecture 11 - 44 October 9, 2019


Overview
1.One time setup
Activation functions, data preprocessing, weight
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 45 October 9, 2019


Model Ensembles

1. Train multiple independent models


2. At test time average their results
(Take average of predicted probability distributions, then choose
argmax)

Enjoy 2% extra performance

Justin Johnson Lecture 11 - 46 October 9, 2019


Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple
snapshots of a single model during training!

Cyclic learning rate schedules


Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
can make this work even better!
Justin Johnson Lecture 11 - 47 October 9, 2019
Model Ensembles: Tips and Tricks

Instead of using actual parameter vector, keep a moving


average of the parameter vector and use that at test time
(Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.
Karras et al, “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, ICLR 2018
Brock et al, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, ICLR 2019

Justin Johnson Lecture 11 - 48 October 9, 2019


Transfer Learning

Justin Johnson Lecture 11 - 49 October 9, 2019


Transfer Learning

“You need a lot of a data if you


want to train/use CNNs”

Justin Johnson Lecture 11 - 50 October 9, 2019


Transfer Learning

“You need a lot of a data if you


want to train/use CNNs”

Justin Johnson Lecture 11 - 51 October 9, 2019


Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor
FC-4096 FC-4096
FC-4096 FC-4096 Remove
MaxPool MaxPool last layer
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 52 October 9, 2019


Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor Classification on Caltech-101
FC-4096 FC-4096
FC-4096 FC-4096 Remove
MaxPool MaxPool last layer
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 53 October 9, 2019


Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
feature extractor
FC-1000
FC-4096 FC-4096
Bird Classification on Caltech-UCSD
FC-4096 FC-4096 Remove
70
MaxPool MaxPool last layer 64.96
Conv-512 Conv-512
65
Conv-512

MaxPool
Conv-512

MaxPool 60 56.78 58.75


Conv-512 Conv-512
Conv-512 Conv-512
55 50.98
50
MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these 45
MaxPool MaxPool
40
Conv-128 Conv-128
Conv-128 Conv-128
DPD (Zhang POOF (Berg AlexNet FC6 AlexNet FC6
MaxPool MaxPool
et al, 2013) & Belhumeur, + logistic + DPD
Conv-64 Conv-64 2013) regression
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 54 October 9, 2019


Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
feature extractor
FC-1000
FC-4096 FC-4096
Bird Classification on Caltech-UCSD
FC-4096 FC-4096 Remove
70
MaxPool MaxPool last layer 64.96
Conv-512 Conv-512
65
Conv-512

MaxPool
Conv-512

MaxPool 60 56.78 58.75


Conv-512 Conv-512
Conv-512 Conv-512
55 50.98
MaxPool
Conv-256
MaxPool
Conv-256
Freeze 50

Conv-256 Conv-256 these 45


MaxPool MaxPool
40
Conv-128 Conv-128
Conv-128 Conv-128
DPD (Zhang POOF (Berg AlexNet FC6 AlexNet FC6
MaxPool MaxPool
et al, 2013) & Belhumeur, + logistic + DPD
Conv-64 Conv-64 2013) regression
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 55 October 9, 2019


Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor
FC-4096
FC-4096
FC-4096
FC-4096
Image Classification
MaxPool MaxPool 95 91.4
89.5 89
Conv-512 Conv-512 90 86.8
Conv-512 Conv-512 85 80.7
MaxPool MaxPool 80 77.2
73.9 74.7 73
Conv-512 Conv-512 75 71.1 69.970.8
Conv-512 Conv-512 69
70
64
MaxPool MaxPool 65 61.8
Conv-256 Conv-256 58.4 56.8
60
Conv-256 Conv-256 53.3
55
MaxPool MaxPool 50
Conv-128 Conv-128
Conv-128 Conv-128
Objects Scenes Birds Flowers Human Object
Attriburtes Attributes
MaxPool MaxPool
Conv-64 Conv-64 Prior State of the art CNN + SVM CNN + Augmentation + SVM
Conv-64 Conv-64

Image Image Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Justin Johnson Lecture 11 - 56 October 9, 2019


Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor
Image Retrieval: Nearest-Neighbor
FC-4096 FC-4096
FC-4096 FC-4096

MaxPool MaxPool
Conv-512 100
Conv-512
89.3 91.1
Conv-512 Conv-512 90 81.9 84.3
79.5
MaxPool MaxPool
80 74.9 76.3
Conv-512 Conv-512
65.9 67.4 68
Conv-512 Conv-512
70 64.6
MaxPool MaxPool 60
Conv-256
48.5 45.4
Conv-256 50 42.3
Conv-256 Conv-256
40
MaxPool MaxPool
Conv-128 Conv-128
30
Conv-128 Conv-128 Paris Oxford Scupltures Scenes Object
MaxPool MaxPool Buildings Buildings Instance
Conv-64 Conv-64
Conv-64 Conv-64 Prior State of the art CNN + SVM CNN + Augmentation + SVM
Image Image Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Justin Johnson Lecture 11 - 57 October 9, 2019


Transfer Learning with CNNs 3. Bigger dataset:
1. Train on Imagenet 2. Use CNN as a
Fine-Tuning
FC-1000 feature extractor Continue training
FC-4096 FC-4096 FC-4096
FC-4096 FC-4096 Remove FC-4096 CNN for new task!
MaxPool MaxPool last layer MaxPool
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512

MaxPool MaxPool MaxPool


Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512

MaxPool
Conv-256
MaxPool
Conv-256
Freeze MaxPool
Conv-256
Conv-256 Conv-256 these Conv-256

MaxPool MaxPool MaxPool


Conv-128 Conv-128 Conv-128
Conv-128 Conv-128 Conv-128

MaxPool MaxPool MaxPool


Conv-64 Conv-64 Conv-64
Conv-64 Conv-64 Conv-64

Image Image Image

Justin Johnson Lecture 11 - 58 October 9, 2019


Transfer Learning with CNNs 3. Bigger dataset:
1. Train on Imagenet 2. Use CNN as a
Fine-Tuning
FC-1000 feature extractor Continue training
FC-4096 FC-4096 FC-4096
FC-4096 FC-4096 Remove FC-4096 CNN for new task!
MaxPool MaxPool last layer MaxPool
Conv-512 Conv-512 Conv-512 Some tricks:
Conv-512 Conv-512 Conv-512
- Train with feature
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512 extraction first before
Conv-512 Conv-512 Conv-512
fine-tuning
MaxPool
Conv-256
MaxPool
Conv-256
Freeze MaxPool
Conv-256
- Lower the learning rate:
Conv-256 Conv-256 these Conv-256 use ~1/10 of LR used in
MaxPool
Conv-128
MaxPool
Conv-128
MaxPool
Conv-128
original training
Conv-128 Conv-128 Conv-128 - Sometimes freeze lower
MaxPool
Conv-64
MaxPool
Conv-64
MaxPool
Conv-64
layers to save
Conv-64 Conv-64 Conv-64 computation
Image Image Image

Justin Johnson Lecture 11 - 59 October 9, 2019


Transfer Learning with CNNs 3. Bigger dataset:
1. Train on Imagenet 2. Use CNN as a
Fine-Tuning
FC-1000 feature extractor Continue training
FC-4096 FC-4096 FC-4096
FC-4096 FC-4096 Remove FC-4096 CNN for new task!
MaxPool MaxPool last layer MaxPool
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512 Object Detection
MaxPool MaxPool MaxPool
60 54.2
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512 50 44.7
MaxPool MaxPool
Freeze MaxPool 40
Conv-256 Conv-256 Conv-256 29.7
Conv-256 Conv-256 these Conv-256 30 24.1
MaxPool MaxPool MaxPool 20
Conv-128 Conv-128 Conv-128
Conv-128 Conv-128 Conv-128
10
MaxPool MaxPool MaxPool 0
Conv-64 Conv-64 Conv-64 VOC 2007 ILSVRC 2013
Conv-64 Conv-64 Conv-64
Feature extraction Fine Tuning
Image Image Image

Justin Johnson Lecture 11 - 60 October 9, 2019


Transfer Learning with CNNs: Architecture Matters!

Improvements in CNN
architectures lead to
improvements in
many downstream
tasks thanks to
transfer learning!

Justin Johnson Lecture 11 - 61 October 9, 2019


Transfer Learning with CNNs: Architecture Matters!
Object Detection on COCO
50 46
45
40 36 39
35
30
29
25 19
20 15
15
10 5
5
0
DPM Fast R-CNN Fast R-CNN Faster R-CNN Faster R-CNN Faster R-CNN Mask R-CNN
(Pre DL) (AlexNet) (VGG-16) (VGG-16) (ResNet-50) FPN (ResNet- FPN (ResNeXt-
101) 152)
Ross Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition

Justin Johnson Lecture 11 - 62 October 9, 2019


Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little ? ?


More specific
Conv-512
Conv-512
data (10s
MaxPool to 100s)
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot ? ?
MaxPool of data
Conv-64
Conv-64
(100s to
Image 1000s)

Justin Johnson Lecture 11 - 63 October 9, 2019


Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little Use Linear ?


More specific
Conv-512
Conv-512
data (10s Classifier on
MaxPool to 100s) top layer
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot Finetune a ?
MaxPool of data few layers
Conv-64
Conv-64
(100s to
Image 1000s)

Justin Johnson Lecture 11 - 64 October 9, 2019


Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little Use Linear ?


More specific
Conv-512
Conv-512
data (10s Classifier on
MaxPool to 100s) top layer
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot Finetune a Finetune a larger
MaxPool of data few layers number
Conv-64
Conv-64
(100s to of layers
Image 1000s)

Justin Johnson Lecture 11 - 65 October 9, 2019


Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little Use Linear You’re in trouble…


More specific
Conv-512
Conv-512
data (10s Classifier on Try linear
MaxPool to 100s) top layer classifier from
Conv-256
different stages
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot Finetune a Finetune a larger
MaxPool of data few layers number
Conv-64
Conv-64
(100s to of layers
Image 1000s)

Justin Johnson Lecture 11 - 66 October 9, 2019


Transfer learning is pervasive!
It’s the norm, not the exception
Object
Detection
(Fast R-CNN)

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments


Girshick, “Fast R-CNN”, ICCV 2015
for Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.

Justin Johnson Lecture 11 - 67 October 9, 2019


Transfer learning is pervasive!
It’s the norm, not the exception
Object
Detection CNN pretrained
(Fast R-CNN) on ImageNet

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments


Girshick, “Fast R-CNN”, ICCV 2015
for Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.

Justin Johnson Lecture 11 - 68 October 9, 2019


Transfer learning is pervasive!
It’s the norm, not the exception
Object
Detection CNN pretrained
(Fast R-CNN) on ImageNet

Word vectors pretrained


Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.
with word2vec Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments
for Generating Image Descriptions”, CVPR 2015

Justin Johnson Lecture 11 - 69 October 9, 2019


Transfer learning is pervasive!
It’s the norm, not the exception
1. Train CNN on ImageNet
2. Fine-Tune (1) for object detection
on Visual Genome
3. Train BERT language model on lots
of text
4. Combine (2) and (3), train for joint
image / language modeling
5. Fine-tune (5) for image
captioning, visual question
answering, etc.
Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA”, arXiv 2019

Justin Johnson Lecture 11 - 70 October 9, 2019


Transfer learning is pervasive!
Some very recent results have questioned it
COCO object detection

Training from scratch can


work as well as pretraining
on ImageNet!

… If you train for 3x as long

He et al, ”Rethinking ImageNet Pre-Training”, ICCV 2019

Justin Johnson Lecture 11 - 71 October 9, 2019


Transfer learning is pervasive!
Some very recent results have questioned it
COCO object detection
45.0
40.0
Pretraining + Finetuning beats
35.0 training from scratch when
30.0 dataset size is very small
25.0
20.0
15.0 Collecting more data is more
10.0 effective than pretraining
5.0
0.0
118K 35K 10K 1K
Pretrain + Fine Tune Train From Scratch
He et al, ”Rethinking ImageNet Pre-Training”, ICCV 2019

Justin Johnson Lecture 11 - 72 October 9, 2019


Transfer learning is pervasive!
Some very recent results have questioned it
COCO object detection
My current view on transfer
45.0
40.0 learning:
35.0
30.0 - Pretrain+finetune makes your
25.0
20.0
training faster, so practically
15.0 very useful
10.0 - Training from scratch works well
5.0
once you have enough data
0.0
118K 35K 10K 1K - Lots of work left to be done
Pretrain + Fine Tune Train From Scratch
He et al, ”Rethinking ImageNet Pre-Training”, ICCV 2019

Justin Johnson Lecture 11 - 73 October 9, 2019


Distributed Training

Justin Johnson Lecture 11 - 74 October 9, 2019


Beyond individual devices

Cloud TPU v2
180 TFLOPs Cloud TPU v2 Pod
64 GB HBM memory 64 TPU-v2
$4.50 / hour 11.5 PFLOPs
(free on Colab!) $384 / hour
Justin Johnson Lecture 11 - 75 October 9, 2019
Model Parallelism: Split Model Across GPUs

This image is in the public domain

Justin Johnson Lecture 11 - 76 October 9, 2019


Model Parallelism: Split Model Across GPUs
Idea #1: Run different layers on different GPUs

This image is in the public domain

Justin Johnson Lecture 11 - 77 October 9, 2019


Model Parallelism: Split Model Across GPUs
Idea #1: Run different layers on different GPUs
Problem: GPUs spend lots of time waiting

This image is in the public domain

Justin Johnson Lecture 11 - 78 October 9, 2019


Model Parallelism: Split Model Across GPUs
Idea #2: Run parallel branches of model on different GPUs

This image is in the public domain

Justin Johnson Lecture 11 - 79 October 9, 2019


Model Parallelism: Split Model Across GPUs
Idea #2: Run parallel branches of model on different GPUs
Problem: Synchronizing across GPUs is expensive;
Need to communicate activations and grad activations

This image is in the public domain

Justin Johnson Lecture 11 - 80 October 9, 2019


Data Parallelism: Copy Model on each GPU, split data

N/2
images

Batch of
N images

N/2
images

Justin Johnson Lecture 11 - 81 October 9, 2019


Data Parallelism: Copy Model on each GPU, split data
Forward: compute loss
N/2
images

Batch of
N images
Forward: compute loss
N/2
images

Justin Johnson Lecture 11 - 82 October 9, 2019


Data Parallelism: Copy Model on each GPU, split data
Forward: compute loss
N/2
images

Batch of Backward: Compute gradient


N images
Forward: compute loss
N/2
images

Backward: Compute gradient

Justin Johnson Lecture 11 - 83 October 9, 2019


Data Parallelism: Copy Model on each GPU, split data
Forward: compute loss
N/2
images

Batch of Exchange gradients, Backward: Compute gradient


N images sum, update Forward: compute loss
N/2
images
GPUs only
communicate once per
iteration, and only Backward: Compute gradient
exchange grad params
Justin Johnson Lecture 11 - 84 October 9, 2019
Data Parallelism: Copy Model on each GPU, split data
N/K
images

Scale up to K GPUs Main mechanism for


Batch of
N images (within and across … distributing across
servers) many GPUs today

N/K
images

Justin Johnson Lecture 11 - 85 October 9, 2019


Mixed Model + Data Parallelism
N/K
images

Batch of
N images
Model parallelism
within each server
… Data parallelism
across K servers

N/K
images

Example: https://devblogs.nvidia.com/training-bert-with-gpus/

Justin Johnson Lecture 11 - 86 October 9, 2019


Mixed Model + Data Parallelism
Problem: Need to
train with very N/K
large minibatches! images

Batch of
N images
Model parallelism
within each server
… Data parallelism
across K servers

N/K
images

Example: https://devblogs.nvidia.com/training-bert-with-gpus/

Justin Johnson Lecture 11 - 87 October 9, 2019


Large-Batch Training

Justin Johnson Lecture 11 - 88 October 9, 2019


Large-Batch Training
How to scale up to data-
Suppose we can train a parallel training on K GPUs?
good model with one GPU


Justin Johnson Lecture 11 - 89 October 9, 2019
Large-Batch Training
How to scale up to data-
Suppose we can train a parallel training on K GPUs?
good model with one GPU

Goal: Train for same number of


epochs, but use larger minibatches.
We want model to train K times faster!

Justin Johnson Lecture 11 - 90 October 9, 2019
Large-Batch Training: Scale Learning Rates
Single-GPU model: K-GPU model:
batch size N, learning rate ⍺ batch size KN, learning rate K⍺

Alex Krizhevsky, “One weird trick for parallelizing convolutional neural networks”, arXiv 2014
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Justin Johnson Lecture 11 - 91 October 9, 2019


Large-Batch Training: Learning Rate Warmup

High initial learning rates can


make loss explode; linearly
increasing learning rate from 0
over the first ~5000 iterations can
prevent this

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Justin Johnson Lecture 11 - 92 October 9, 2019


Large-Batch Training: Other Concerns

Be careful with weight decay and momentum, and data shuffling

For Batch Normalization, only normalize within a GPU

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017
Justin Johnson Lecture 11 - 93 October 9, 2019
Large-Batch Training: ImageNet in One Hour!
ResNet-50 with batch size 8192,
distributed across 256 GPUs
Trains in one hour!

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Justin Johnson Lecture 11 - 94 October 9, 2019


Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017


Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 95 October 9, 2019


Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017


Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 96 October 9, 2019


Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017


Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 97 October 9, 2019


Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017


Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 98 October 9, 2019


Recap
1.One time setup
Activation functions, data preprocessing, weight Last Time
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization Today
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 99 October 9, 2019


Next Time:
Recurrent Neural Networks

Justin Johnson Lecture 11 - 100 October 9, 2019

You might also like