0% found this document useful (0 votes)

21 views100 pages

498 FA2019 Lecture11

This document provides reminders and an overview of topics related to training neural networks. The key points are: 1) It reminds students about upcoming deadlines for Assignment 3 which is due October 14th, and the midterm exam on October 21st. 2) It provides an overview of topics covered last time including activation functions, data preprocessing, weight initialization, and regularization techniques. 3) It previews topics to be covered today including learning rate schedules, early stopping, and techniques for choosing hyperparameters like grid search and random search.

Uploaded by

sabah nushra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views100 pages

498 FA2019 Lecture11

Uploaded by

sabah nushra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Lecture 11:

Training Neural Networks

(Part 2)

Justin Johnson Lecture 11 - 1 October 9, 2019

Reminder: A3

• Due Monday, October 14

• Remember to run the validation script!

Justin Johnson Lecture 11 - 2 October 9, 2019

Reminder: Midterm
• Monday, October 21 (two weeks from today!)
• Location: Chrysler 220 (NOT HERE!)
• Format:
• True / False, Multiple choice, short answer
• Emphasize concepts – you don’t need to memorize AlexNet!
• Closed-book
• You can bring 1 page ”cheat sheet” of handwritten notes
(standard 8.5” x 11” paper)
• Alternate exam times: Fill out this form: https://forms.gle/uiMpHdg9752p27bd7
• Conflict with EECS 551
• SSD accommodations
• Conference travel for Michigan
Justin Johnson Lecture 11 - 3 October 9, 2019
Reminder: No class on Monday 10/14!
(Fall Study Break)

Justin Johnson Lecture 11 - 4 October 9, 2019

Overview
1.One time setup
Activation functions, data preprocessing, weight Last Time
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization Today
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 5 October 9, 2019

Last Time: Activation Functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Justin Johnson Lecture 11 - 6 October 9, 2019

Last Time: Data Preprocessing

Justin Johnson Lecture 11 - 7 October 9, 2019

Last Time: Weight Initialization
“Xavier” initialization: “Just right”: Activations are
std = 1/sqrt(Din) nicely scaled for all layers!

Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010

Justin Johnson Lecture 11 - 8 October 9, 2019

Last Time: Data Augmentation

Load image
“cat”
and label

Compute
loss
CNN

Transform image

Justin Johnson Lecture 11 - 9 October 9, 2019

Stochastic Depth
Last Time: Regularization Cutout
Training: Add randomness
Testing: Marginalize out randomness
Examples:
Batch Normalization
Data Augmentation Dropout DropConnect
Fractional pooling
Mixup

Justin Johnson Lecture 11 - 10 October 9, 2019

Overview
1.One time setup
Activation functions, data preprocessing, weight
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization Today
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 11 October 9, 2019

Learning Rate Schedules

Justin Johnson Lecture 11 - 12 October 9, 2019

SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Justin Johnson Lecture 11 - 13 October 9, 2019

SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Q: Which one of these learning rates

is best to use?

Justin Johnson Lecture 11 - 14 October 9, 2019

SGD, SGD+Momentum, Adagrad, RMSProp, Adam
all have learning rate as a hyperparameter.

Q: Which one of these learning rates

is best to use?

A: All of them! Start with large

learning rate and decay over time

Justin Johnson Lecture 11 - 15 October 9, 2019

Learning Rate Decay: Step
Step: Reduce learning rate at a few fixed points.
Reduce learning rate E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Justin Johnson Lecture 11 - 16 October 9, 2019

Learning Rate Decay: Cosine
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017
Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018
Feichtenhofer et al, “SlowFast Networks for Video Recognition”, ICCV 2019
Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019
Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019

Justin Johnson Lecture 11 - 17 October 9, 2019

Learning Rate Decay: Linear
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Linear:

Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2018
Liu et al, “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019
Yang et al, “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, NeurIPS 2019

Justin Johnson Lecture 11 - 18 October 9, 2019

Learning Rate Decay: Inverse Sqrt
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

Vaswani et al, “Attention is all you need”, NIPS 2017

Justin Johnson Lecture 11 - 19 October 9, 2019

Learning Rate Decay: Constant!
Step: Reduce learning rate at a few fixed points.
E.g. for ResNets, multiply LR by 0.1 after epochs
30, 60, and 90.

Cosine:

Linear:

Inverse sqrt:

Constant:

Brock et al, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, ICLR 2019
Donahue and Simonyan, “Large Scale Adversarial Representation Learning”, NeurIPS 2019

Justin Johnson Lecture 11 - 20 October 9, 2019

How long to train? Early Stopping
Train
Loss Accuracy Val

Stop training here

Iteration Iteration

Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot that
worked best on val. Always a good idea to do this!

Justin Johnson Lecture 11 - 21 October 9, 2019

Choosing Hyperparameters

Justin Johnson Lecture 11 - 22 October 9, 2019

Choosing Hyperparameters: Grid Search

Choose several values for each hyperparameter

(Often space choices log-linearly)

Example:
Weight decay: [1x10-4, 1x10-3, 1x10-2, 1x10-1]
Learning rate: [1x10-4, 1x10-3, 1x10-2, 1x10-1]

Evaluate all possible choices on this

hyperparameter grid

Justin Johnson Lecture 11 - 23 October 9, 2019

Choosing Hyperparameters: Random Search

Choose several values for each hyperparameter

(Often space choices log-linearly)

Example:
Weight decay: log-uniform on [1x10-4, 1x10-1]
Learning rate: log-uniform on [1x10-4, 1x10-1]

Run many different trials

Justin Johnson Lecture 11 - 24 October 9, 2019

Hyperparameters: Random vs Grid Search
Grid Layout Random Layout

Unimportant

Unimportant
Parameter

Parameter
Important Important
Parameter Parameter
Bergstra and Bengio, “Random Search for Hyper-Parameter Optimization”, JMLR 2012

Justin Johnson Lecture 11 - 25 October 9, 2019

Choosing Hyperparameters: Random Search

Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019

Justin Johnson Lecture 11 - 26 October 9, 2019

Choosing Hyperparameters
(without tons of GPUs)

Justin Johnson Lecture 11 - 27 October 9, 2019

Choosing Hyperparameters

Step 1: Check initial loss

Turn off weight decay, sanity check loss at initialization

e.g. log(C) for softmax with C classes

Justin Johnson Lecture 11 - 28 October 9, 2019

Choosing Hyperparameters

Step 1: Check initial loss

Step 2: Overfit a small sample

Try to train to 100% training accuracy on a small sample of training

data (~5-10 minibatches); fiddle with architecture, learning rate,
weight initialization. Turn off regularization.

Loss not going down? LR too low, bad initialization

Loss explodes to Inf or NaN? LR too high, bad initialization

Justin Johnson Lecture 11 - 29 October 9, 2019

Choosing Hyperparameters

Step 1: Check initial loss

Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down

Use the architecture from the previous step, use all training data,
turn on small weight decay, find a learning rate that makes the loss
drop significantly within ~100 iterations

Good learning rates to try: 1e-1, 1e-2, 1e-3, 1e-4

Justin Johnson Lecture 11 - 30 October 9, 2019

Choosing Hyperparameters

Step 1: Check initial loss

Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs

Choose a few values of learning rate and weight decay around what
worked from Step 3, train a few models for ~1-5 epochs.

Good weight decay to try: 1e-4, 1e-5, 0

Justin Johnson Lecture 11 - 31 October 9, 2019

Choosing Hyperparameters

Step 1: Check initial loss

Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer

Pick best models from Step 4, train them for longer

(~10-20 epochs) without learning rate decay

Justin Johnson Lecture 11 - 32 October 9, 2019

Choosing Hyperparameters

Step 1: Check initial loss

Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at learning curves

Justin Johnson Lecture 11 - 33 October 9, 2019

Look at Learning Curves!

Losses may be noisy, use a scatter

plot and also plot moving average
to see trends better

Justin Johnson Lecture 11 - 34 October 9, 2019

Loss
Bad initialization a prime suspect

time

Justin Johnson Lecture 11 - 35 October 9, 2019

Loss
Loss plateaus: Try learning
rate decay

time

Justin Johnson Lecture 11 - 36 October 9, 2019

Loss
Learning rate step decay Loss was still going down when
learning rate dropped, you
decayed too early!

time

Justin Johnson Lecture 11 - 37 October 9, 2019

Accuracy Accuracy still going up, you
need to train longer

Train

Val

time

Justin Johnson Lecture 11 - 38 October 9, 2019

Accuracy Huge train / val gap means
overfitting! Increase regularization,
get more data
Train

Val

time

Justin Johnson Lecture 11 - 39 October 9, 2019

Accuracy No gap between train / val means
underfitting: train longer, use a bigger
model
Train

Val

time

Justin Johnson Lecture 11 - 40 October 9, 2019

Choosing Hyperparameters
Step 1: Check initial loss
Step 2: Overfit a small sample
Step 3: Find LR that makes loss go down
Step 4: Coarse grid, train for ~1-5 epochs
Step 5: Refine grid, train longer
Step 6: Look at loss curves
Step 7: GOTO step 5

Justin Johnson Lecture 11 - 41 October 9, 2019

Hyperparameters to play with:
- network architecture
- learning rate, its decay schedule, update type
- regularization (L2/Dropout strength)

neural networks practitioner

music = loss function

This image by Paolo Guereta is licensed under CC-BY 2.0

Justin Johnson Lecture 11 - 42 October 9, 2019

Cross-validation
“command center”

Justin Johnson Lecture 11 - 43 October 9, 2019

Track ratio of weight update / weight magnitude

ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay)
want this to be somewhere around 0.001 or so

Justin Johnson Lecture 11 - 44 October 9, 2019

Overview
1.One time setup
Activation functions, data preprocessing, weight
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 45 October 9, 2019

Model Ensembles

1. Train multiple independent models

2. At test time average their results
(Take average of predicted probability distributions, then choose
argmax)

Enjoy 2% extra performance

Justin Johnson Lecture 11 - 46 October 9, 2019

Model Ensembles: Tips and Tricks
Instead of training independent models, use multiple
snapshots of a single model during training!

Cyclic learning rate schedules

Loshchilov and Hutter, “SGDR: Stochastic gradient descent with restarts”, arXiv 2016
Huang et al, “Snapshot ensembles: train 1, get M for free”, ICLR 2017
Figures copyright Yixuan Li and Geoff Pleiss, 2017. Reproduced with permission.
can make this work even better!
Justin Johnson Lecture 11 - 47 October 9, 2019
Model Ensembles: Tips and Tricks

Instead of using actual parameter vector, keep a moving

average of the parameter vector and use that at test time
(Polyak averaging)

Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.
Karras et al, “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, ICLR 2018
Brock et al, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, ICLR 2019

Justin Johnson Lecture 11 - 48 October 9, 2019

Transfer Learning

Justin Johnson Lecture 11 - 49 October 9, 2019

Transfer Learning

“You need a lot of a data if you

want to train/use CNNs”

Justin Johnson Lecture 11 - 50 October 9, 2019

Transfer Learning

“You need a lot of a data if you

want to train/use CNNs”

Justin Johnson Lecture 11 - 51 October 9, 2019

Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor
FC-4096 FC-4096
FC-4096 FC-4096 Remove
MaxPool MaxPool last layer
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 52 October 9, 2019

Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor Classification on Caltech-101
FC-4096 FC-4096
FC-4096 FC-4096 Remove
MaxPool MaxPool last layer
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512

MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128

MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 53 October 9, 2019

Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
feature extractor
FC-1000
FC-4096 FC-4096
Bird Classification on Caltech-UCSD
FC-4096 FC-4096 Remove
70
MaxPool MaxPool last layer 64.96
Conv-512 Conv-512
65
Conv-512

MaxPool
Conv-512

MaxPool 60 56.78 58.75

Conv-512 Conv-512
Conv-512 Conv-512
55 50.98
50
MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these 45
MaxPool MaxPool
40
Conv-128 Conv-128
Conv-128 Conv-128
DPD (Zhang POOF (Berg AlexNet FC6 AlexNet FC6
MaxPool MaxPool
et al, 2013) & Belhumeur, + logistic + DPD
Conv-64 Conv-64 2013) regression
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 54 October 9, 2019

MaxPool
Conv-512

MaxPool 60 56.78 58.75

Conv-512 Conv-512
Conv-512 Conv-512
55 50.98
MaxPool
Conv-256
MaxPool
Conv-256
Freeze 50

Conv-256 Conv-256 these 45

MaxPool MaxPool
40
Conv-128 Conv-128
Conv-128 Conv-128
DPD (Zhang POOF (Berg AlexNet FC6 AlexNet FC6
MaxPool MaxPool
et al, 2013) & Belhumeur, + logistic + DPD
Conv-64 Conv-64 2013) regression
Conv-64 Conv-64

Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014

Justin Johnson Lecture 11 - 55 October 9, 2019

Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor
FC-4096
FC-4096
FC-4096
FC-4096
Image Classification
MaxPool MaxPool 95 91.4
89.5 89
Conv-512 Conv-512 90 86.8
Conv-512 Conv-512 85 80.7
MaxPool MaxPool 80 77.2
73.9 74.7 73
Conv-512 Conv-512 75 71.1 69.970.8
Conv-512 Conv-512 69
70
64
MaxPool MaxPool 65 61.8
Conv-256 Conv-256 58.4 56.8
60
Conv-256 Conv-256 53.3
55
MaxPool MaxPool 50
Conv-128 Conv-128
Conv-128 Conv-128
Objects Scenes Birds Flowers Human Object
Attriburtes Attributes
MaxPool MaxPool
Conv-64 Conv-64 Prior State of the art CNN + SVM CNN + Augmentation + SVM
Conv-64 Conv-64

Image Image Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Justin Johnson Lecture 11 - 56 October 9, 2019

Transfer Learning with CNNs
1. Train on Imagenet 2. Use CNN as a
FC-1000 feature extractor
Image Retrieval: Nearest-Neighbor
FC-4096 FC-4096
FC-4096 FC-4096

MaxPool MaxPool
Conv-512 100
Conv-512
89.3 91.1
Conv-512 Conv-512 90 81.9 84.3
79.5
MaxPool MaxPool
80 74.9 76.3
Conv-512 Conv-512
65.9 67.4 68
Conv-512 Conv-512
70 64.6
MaxPool MaxPool 60
Conv-256
48.5 45.4
Conv-256 50 42.3
Conv-256 Conv-256
40
MaxPool MaxPool
Conv-128 Conv-128
30
Conv-128 Conv-128 Paris Oxford Scupltures Scenes Object
MaxPool MaxPool Buildings Buildings Instance
Conv-64 Conv-64
Conv-64 Conv-64 Prior State of the art CNN + SVM CNN + Augmentation + SVM
Image Image Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014

Justin Johnson Lecture 11 - 57 October 9, 2019

MaxPool MaxPool MaxPool

Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512

MaxPool
Conv-256
MaxPool
Conv-256
Freeze MaxPool
Conv-256
Conv-256 Conv-256 these Conv-256

MaxPool MaxPool MaxPool

Conv-128 Conv-128 Conv-128
Conv-128 Conv-128 Conv-128

MaxPool MaxPool MaxPool

Conv-64 Conv-64 Conv-64
Conv-64 Conv-64 Conv-64

Image Image Image

Justin Johnson Lecture 11 - 58 October 9, 2019

Transfer Learning with CNNs 3. Bigger dataset:
1. Train on Imagenet 2. Use CNN as a
Fine-Tuning
FC-1000 feature extractor Continue training
FC-4096 FC-4096 FC-4096
FC-4096 FC-4096 Remove FC-4096 CNN for new task!
MaxPool MaxPool last layer MaxPool
Conv-512 Conv-512 Conv-512 Some tricks:
Conv-512 Conv-512 Conv-512
- Train with feature
MaxPool MaxPool MaxPool
Conv-512 Conv-512 Conv-512 extraction first before
Conv-512 Conv-512 Conv-512
fine-tuning
MaxPool
Conv-256
MaxPool
Conv-256
Freeze MaxPool
Conv-256
- Lower the learning rate:
Conv-256 Conv-256 these Conv-256 use ~1/10 of LR used in
MaxPool
Conv-128
MaxPool
Conv-128
MaxPool
Conv-128
original training
Conv-128 Conv-128 Conv-128 - Sometimes freeze lower
MaxPool
Conv-64
MaxPool
Conv-64
MaxPool
Conv-64
layers to save
Conv-64 Conv-64 Conv-64 computation
Image Image Image

Justin Johnson Lecture 11 - 59 October 9, 2019

Transfer Learning with CNNs 3. Bigger dataset:
1. Train on Imagenet 2. Use CNN as a
Fine-Tuning
FC-1000 feature extractor Continue training
FC-4096 FC-4096 FC-4096
FC-4096 FC-4096 Remove FC-4096 CNN for new task!
MaxPool MaxPool last layer MaxPool
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512 Object Detection
MaxPool MaxPool MaxPool
60 54.2
Conv-512 Conv-512 Conv-512
Conv-512 Conv-512 Conv-512 50 44.7
MaxPool MaxPool
Freeze MaxPool 40
Conv-256 Conv-256 Conv-256 29.7
Conv-256 Conv-256 these Conv-256 30 24.1
MaxPool MaxPool MaxPool 20
Conv-128 Conv-128 Conv-128
Conv-128 Conv-128 Conv-128
10
MaxPool MaxPool MaxPool 0
Conv-64 Conv-64 Conv-64 VOC 2007 ILSVRC 2013
Conv-64 Conv-64 Conv-64
Feature extraction Fine Tuning
Image Image Image

Justin Johnson Lecture 11 - 60 October 9, 2019

Transfer Learning with CNNs: Architecture Matters!

Improvements in CNN
architectures lead to
improvements in
many downstream
tasks thanks to
transfer learning!

Justin Johnson Lecture 11 - 61 October 9, 2019

Transfer Learning with CNNs: Architecture Matters!
Object Detection on COCO
50 46
45
40 36 39
35
30
29
25 19
20 15
15
10 5
5
0
DPM Fast R-CNN Fast R-CNN Faster R-CNN Faster R-CNN Faster R-CNN Mask R-CNN
(Pre DL) (AlexNet) (VGG-16) (VGG-16) (ResNet-50) FPN (ResNet- FPN (ResNeXt-
101) 152)
Ross Girshick, “The Generalized R-CNN Framework for Object Detection”, ICCV 2017 Tutorial on Instance-Level Visual Recognition

Justin Johnson Lecture 11 - 62 October 9, 2019

Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little ? ?

More specific
Conv-512
Conv-512
data (10s
MaxPool to 100s)
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot ? ?
MaxPool of data
Conv-64
Conv-64
(100s to
Image 1000s)

Justin Johnson Lecture 11 - 63 October 9, 2019

Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little Use Linear ?

More specific
Conv-512
Conv-512
data (10s Classifier on
MaxPool to 100s) top layer
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot Finetune a ?
MaxPool of data few layers
Conv-64
Conv-64
(100s to
Image 1000s)

Justin Johnson Lecture 11 - 64 October 9, 2019

Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little Use Linear ?

More specific
Conv-512
Conv-512
data (10s Classifier on
MaxPool to 100s) top layer
Conv-256
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot Finetune a Finetune a larger
MaxPool of data few layers number
Conv-64
Conv-64
(100s to of layers
Image 1000s)

Justin Johnson Lecture 11 - 65 October 9, 2019

Transfer Learning with CNNs
FC-1000
FC-4096 Dataset similar Dataset very
FC-4096
to ImageNet different from
MaxPool
Conv-512
ImageNet
Conv-512

MaxPool very little Use Linear You’re in trouble…

More specific
Conv-512
Conv-512
data (10s Classifier on Try linear
MaxPool to 100s) top layer classifier from
Conv-256
different stages
Conv-256
More generic
MaxPool
Conv-128
Conv-128
quite a lot Finetune a Finetune a larger
MaxPool of data few layers number
Conv-64
Conv-64
(100s to of layers
Image 1000s)

Justin Johnson Lecture 11 - 66 October 9, 2019

Transfer learning is pervasive!
It’s the norm, not the exception
Object
Detection
(Fast R-CNN)

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments

Girshick, “Fast R-CNN”, ICCV 2015
for Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.

Justin Johnson Lecture 11 - 67 October 9, 2019

Transfer learning is pervasive!
It’s the norm, not the exception
Object
Detection CNN pretrained
(Fast R-CNN) on ImageNet

Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments

Girshick, “Fast R-CNN”, ICCV 2015
for Generating Image Descriptions”, CVPR 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.

Justin Johnson Lecture 11 - 68 October 9, 2019

Transfer learning is pervasive!
It’s the norm, not the exception
Object
Detection CNN pretrained
(Fast R-CNN) on ImageNet

Word vectors pretrained

Girshick, “Fast R-CNN”, ICCV 2015
Figure copyright Ross Girshick, 2015. Reproduced with permission.
with word2vec Karpathy and Fei-Fei, “Deep Visual-Semantic Alignments
for Generating Image Descriptions”, CVPR 2015

Justin Johnson Lecture 11 - 69 October 9, 2019

Transfer learning is pervasive!
It’s the norm, not the exception
1. Train CNN on ImageNet
2. Fine-Tune (1) for object detection
on Visual Genome
3. Train BERT language model on lots
of text
4. Combine (2) and (3), train for joint
image / language modeling
5. Fine-tune (5) for image
captioning, visual question
answering, etc.
Zhou et al, “Unified Vision-Language Pre-Training for Image Captioning and VQA”, arXiv 2019

Justin Johnson Lecture 11 - 70 October 9, 2019

Transfer learning is pervasive!
Some very recent results have questioned it
COCO object detection

Training from scratch can

work as well as pretraining
on ImageNet!

… If you train for 3x as long

He et al, ”Rethinking ImageNet Pre-Training”, ICCV 2019

Justin Johnson Lecture 11 - 71 October 9, 2019

Transfer learning is pervasive!
Some very recent results have questioned it
COCO object detection
45.0
40.0
Pretraining + Finetuning beats
35.0 training from scratch when
30.0 dataset size is very small
25.0
20.0
15.0 Collecting more data is more
10.0 effective than pretraining
5.0
0.0
118K 35K 10K 1K
Pretrain + Fine Tune Train From Scratch
He et al, ”Rethinking ImageNet Pre-Training”, ICCV 2019

Justin Johnson Lecture 11 - 72 October 9, 2019

Transfer learning is pervasive!
Some very recent results have questioned it
COCO object detection
My current view on transfer
45.0
40.0 learning:
35.0
30.0 - Pretrain+finetune makes your
25.0
20.0
training faster, so practically
15.0 very useful
10.0 - Training from scratch works well
5.0
once you have enough data
0.0
118K 35K 10K 1K - Lots of work left to be done
Pretrain + Fine Tune Train From Scratch
He et al, ”Rethinking ImageNet Pre-Training”, ICCV 2019

Justin Johnson Lecture 11 - 73 October 9, 2019

Distributed Training

Justin Johnson Lecture 11 - 74 October 9, 2019

Beyond individual devices

Cloud TPU v2
180 TFLOPs Cloud TPU v2 Pod
64 GB HBM memory 64 TPU-v2
$4.50 / hour 11.5 PFLOPs
(free on Colab!) $384 / hour
Justin Johnson Lecture 11 - 75 October 9, 2019
Model Parallelism: Split Model Across GPUs

This image is in the public domain

Justin Johnson Lecture 11 - 76 October 9, 2019

Model Parallelism: Split Model Across GPUs
Idea #1: Run different layers on different GPUs

This image is in the public domain

Justin Johnson Lecture 11 - 77 October 9, 2019

Model Parallelism: Split Model Across GPUs
Idea #1: Run different layers on different GPUs
Problem: GPUs spend lots of time waiting

This image is in the public domain

Justin Johnson Lecture 11 - 78 October 9, 2019

Model Parallelism: Split Model Across GPUs
Idea #2: Run parallel branches of model on different GPUs

This image is in the public domain

Justin Johnson Lecture 11 - 79 October 9, 2019

Model Parallelism: Split Model Across GPUs
Idea #2: Run parallel branches of model on different GPUs
Problem: Synchronizing across GPUs is expensive;
Need to communicate activations and grad activations

This image is in the public domain

Justin Johnson Lecture 11 - 80 October 9, 2019

Data Parallelism: Copy Model on each GPU, split data

N/2
images

Batch of
N images

N/2
images

Justin Johnson Lecture 11 - 81 October 9, 2019

Data Parallelism: Copy Model on each GPU, split data
Forward: compute loss
N/2
images

Batch of
N images
Forward: compute loss
N/2
images

Justin Johnson Lecture 11 - 82 October 9, 2019

Data Parallelism: Copy Model on each GPU, split data
Forward: compute loss
N/2
images

Batch of Backward: Compute gradient

N images
Forward: compute loss
N/2
images

Backward: Compute gradient

Justin Johnson Lecture 11 - 83 October 9, 2019

Data Parallelism: Copy Model on each GPU, split data
Forward: compute loss
N/2
images

Batch of Exchange gradients, Backward: Compute gradient

N images sum, update Forward: compute loss
N/2
images
GPUs only
communicate once per
iteration, and only Backward: Compute gradient
exchange grad params
Justin Johnson Lecture 11 - 84 October 9, 2019
Data Parallelism: Copy Model on each GPU, split data
N/K
images

Scale up to K GPUs Main mechanism for

Batch of
N images (within and across … distributing across
servers) many GPUs today

N/K
images

Justin Johnson Lecture 11 - 85 October 9, 2019

Mixed Model + Data Parallelism
N/K
images

Batch of
N images
Model parallelism
within each server
… Data parallelism
across K servers

N/K
images

Example: https://devblogs.nvidia.com/training-bert-with-gpus/

Justin Johnson Lecture 11 - 86 October 9, 2019

Mixed Model + Data Parallelism
Problem: Need to
train with very N/K
large minibatches! images

Batch of
N images
Model parallelism
within each server
… Data parallelism
across K servers

N/K
images

Example: https://devblogs.nvidia.com/training-bert-with-gpus/

Justin Johnson Lecture 11 - 87 October 9, 2019

Large-Batch Training

Justin Johnson Lecture 11 - 88 October 9, 2019

Large-Batch Training
How to scale up to data-
Suppose we can train a parallel training on K GPUs?
good model with one GPU

…
Justin Johnson Lecture 11 - 89 October 9, 2019
Large-Batch Training
How to scale up to data-
Suppose we can train a parallel training on K GPUs?
good model with one GPU

Goal: Train for same number of

epochs, but use larger minibatches.
We want model to train K times faster!
…
Justin Johnson Lecture 11 - 90 October 9, 2019
Large-Batch Training: Scale Learning Rates
Single-GPU model: K-GPU model:
batch size N, learning rate ⍺ batch size KN, learning rate K⍺

Alex Krizhevsky, “One weird trick for parallelizing convolutional neural networks”, arXiv 2014
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Justin Johnson Lecture 11 - 91 October 9, 2019

Large-Batch Training: Learning Rate Warmup

High initial learning rates can

make loss explode; linearly
increasing learning rate from 0
over the first ~5000 iterations can
prevent this

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Justin Johnson Lecture 11 - 92 October 9, 2019

Large-Batch Training: Other Concerns

Be careful with weight decay and momentum, and data shuffling

For Batch Normalization, only normalize within a GPU

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017
Justin Johnson Lecture 11 - 93 October 9, 2019
Large-Batch Training: ImageNet in One Hour!
ResNet-50 with batch size 8192,
distributed across 256 GPUs
Trains in one hour!

Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017

Justin Johnson Lecture 11 - 94 October 9, 2019

Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017

Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 95 October 9, 2019

Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017

Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 96 October 9, 2019

Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017

Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 97 October 9, 2019

Large-Batch Training
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, 2017
Batch size: 8192; 256 P100 GPUs; 1 hour

Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes

You et al, “ImageNet training in minutes”, 2017

Batch size: 16000; 1600 Xeon CPUs; 31 minutes

Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes

Justin Johnson Lecture 11 - 98 October 9, 2019

Recap
1.One time setup
Activation functions, data preprocessing, weight Last Time
initialization, regularization
2.Training dynamics
Learning rate schedules;
hyperparameter optimization Today
3.After training
Model ensembles, transfer learning,
large-batch training

Justin Johnson Lecture 10 - 99 October 9, 2019

Next Time:
Recurrent Neural Networks

Justin Johnson Lecture 11 - 100 October 9, 2019

Neural Networks Desing - Martin T. Hagan - 2nd Edition
100% (2)
Neural Networks Desing - Martin T. Hagan - 2nd Edition
1,013 pages
22 Book Neural Network Design 2014-1-506
No ratings yet
22 Book Neural Network Design 2014-1-506
506 pages
498 FA2019 Lecture06
No ratings yet
498 FA2019 Lecture06
125 pages
2 DL Training
No ratings yet
2 DL Training
60 pages
4 Mlp04 Learn
No ratings yet
4 Mlp04 Learn
28 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
Lecture 02
No ratings yet
Lecture 02
147 pages
Lecture 5 Fall 2024
No ratings yet
Lecture 5 Fall 2024
49 pages
HandsOnML Ch4E
No ratings yet
HandsOnML Ch4E
46 pages
Lecture 4
No ratings yet
Lecture 4
146 pages
4 CNN PDF
No ratings yet
4 CNN PDF
205 pages
DNN Tip
No ratings yet
DNN Tip
49 pages
Lecture02-Basics of Deep Learning
No ratings yet
Lecture02-Basics of Deep Learning
34 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
NNDesign PDF
No ratings yet
NNDesign PDF
1,012 pages
Probability Neuron Network
No ratings yet
Probability Neuron Network
84 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
2.game AI 1
No ratings yet
2.game AI 1
268 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Winter1516 Lecture55
No ratings yet
Winter1516 Lecture55
22 pages
Mobile Originated Call and Mobile Terminated Call in GSM
No ratings yet
Mobile Originated Call and Mobile Terminated Call in GSM
10 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Slides 11
No ratings yet
Slides 11
48 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Unit 1 - Overview of OS
No ratings yet
Unit 1 - Overview of OS
41 pages
5 CommonPractices
No ratings yet
5 CommonPractices
106 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Part 3 of Negida Handbook of Clinical Research
No ratings yet
Part 3 of Negida Handbook of Clinical Research
147 pages
TrainingNN 1
No ratings yet
TrainingNN 1
52 pages
Unit 3
No ratings yet
Unit 3
110 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
A Recipe For Training Neural Networks
No ratings yet
A Recipe For Training Neural Networks
15 pages
Unit 3
No ratings yet
Unit 3
47 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Module 1 Lesson 1
No ratings yet
Module 1 Lesson 1
8 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Lect 7
No ratings yet
Lect 7
43 pages
5380CRP Operation Manual Main Body
No ratings yet
5380CRP Operation Manual Main Body
214 pages
Pin and Puk System Final Final Project
No ratings yet
Pin and Puk System Final Final Project
66 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Cours 4
No ratings yet
Cours 4
30 pages
Meridian Gyrocompass Comparator - User Manual
No ratings yet
Meridian Gyrocompass Comparator - User Manual
25 pages
Unit 3 - Part 1 Assignment Problem
No ratings yet
Unit 3 - Part 1 Assignment Problem
54 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
PIC16F877A μc1
No ratings yet
PIC16F877A μc1
26 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Fixing Neural Network Course 2 1659759284
No ratings yet
Fixing Neural Network Course 2 1659759284
30 pages
Unit 2
No ratings yet
Unit 2
37 pages
More About Spreadsheet Errors and Fixes
100% (1)
More About Spreadsheet Errors and Fixes
3 pages
1 Introduction To Data Analytics
No ratings yet
1 Introduction To Data Analytics
14 pages
DL Class3
No ratings yet
DL Class3
28 pages
A. Four Operations
No ratings yet
A. Four Operations
14 pages
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
No ratings yet
Neural Networks For Machine Learning: Lecture 9a Overview of Ways To Improve Generalization
39 pages
Master Report Grzemba
No ratings yet
Master Report Grzemba
55 pages
Quadratic Equation Assignment
No ratings yet
Quadratic Equation Assignment
4 pages
OCI Architect Associate Flash Cards
No ratings yet
OCI Architect Associate Flash Cards
19 pages
ADVIA Centaur XPT Immunoassay System Software Version 1.7 Release Notes, EN, 11558378 DXDCM 09017fe98064bcc3-1639588293494
No ratings yet
ADVIA Centaur XPT Immunoassay System Software Version 1.7 Release Notes, EN, 11558378 DXDCM 09017fe98064bcc3-1639588293494
20 pages
Informed Search
No ratings yet
Informed Search
131 pages
Chapter 2 Notes For Math in Grade 11
No ratings yet
Chapter 2 Notes For Math in Grade 11
17 pages
DLD Lab-Report
No ratings yet
DLD Lab-Report
49 pages
Group No 7 ISE 2 MPMC
No ratings yet
Group No 7 ISE 2 MPMC
10 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Math Lec2 1900
No ratings yet
Math Lec2 1900
52 pages
MC Co Implementation Guide
No ratings yet
MC Co Implementation Guide
53 pages
Introduction To Artificial Intelligence
No ratings yet
Introduction To Artificial Intelligence
81 pages
Ch04 The MSP430 Slides
No ratings yet
Ch04 The MSP430 Slides
42 pages
Neural Networks Report
No ratings yet
Neural Networks Report
4 pages
XXXXXXXX
No ratings yet
XXXXXXXX
7 pages
Afp Afp Afp Afp Afp-300 - 300 - 300 - 300 - 300 Afp Afp Afp Afp Afp-400 - 400 - 400 - 400 - 400
No ratings yet
Afp Afp Afp Afp Afp-300 - 300 - 300 - 300 - 300 Afp Afp Afp Afp Afp-400 - 400 - 400 - 400 - 400
10 pages
Before You Start TV On Demand
No ratings yet
Before You Start TV On Demand
32 pages
Math Lec3 1903
No ratings yet
Math Lec3 1903
16 pages
Lecture 1 Maths
No ratings yet
Lecture 1 Maths
7 pages
Cocomo Model
No ratings yet
Cocomo Model
9 pages
Boot Patch
No ratings yet
Boot Patch
5 pages
Enhanced RR
No ratings yet
Enhanced RR
4 pages
1.3 Python As A Calculator
100% (1)
1.3 Python As A Calculator
2 pages
Math 190041
No ratings yet
Math 190041
3 pages
Hindi Intent Classification
No ratings yet
Hindi Intent Classification
13 pages
Anonymous CV
No ratings yet
Anonymous CV
2 pages
Etherchannel: By. Eng. Ayman Boghdady
No ratings yet
Etherchannel: By. Eng. Ayman Boghdady
19 pages
Datasheet 708E Series v1 02 20221205
No ratings yet
Datasheet 708E Series v1 02 20221205
3 pages
EHEIM ControlCenter User Manual GB 07-2011
No ratings yet
EHEIM ControlCenter User Manual GB 07-2011
26 pages
Rackspace Data Hosting Whitepaper
No ratings yet
Rackspace Data Hosting Whitepaper
6 pages