498 FA2019 Lecture11
498 FA2019 Lecture11
tanh Maxout
ReLU ELU
Glorot and Bengio, “Understanding the difficulty of training deep feedforward neural networks”, AISTAT 2010
Load image
“cat”
and label
Compute
loss
CNN
Transform image
Cosine:
Loshchilov and Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts”, ICLR 2017
Radford et al, “Improving Language Understanding by Generative Pre-Training”, 2018
Feichtenhofer et al, “SlowFast Networks for Video Recognition”, ICCV 2019
Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019
Child at al, “Generating Long Sequences with Sparse Transformers”, arXiv 2019
Cosine:
Linear:
Devlin et al, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2018
Liu et al, “RoBERTa: A Robustly Optimized BERT Pretraining Approach”, 2019
Yang et al, “XLNet: Generalized Autoregressive Pretraining for Language Understanding”, NeurIPS 2019
Cosine:
Linear:
Inverse sqrt:
Cosine:
Linear:
Inverse sqrt:
Constant:
Brock et al, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, ICLR 2019
Donahue and Simonyan, “Large Scale Adversarial Representation Learning”, NeurIPS 2019
Iteration Iteration
Stop training the model when accuracy on the validation set decreases
Or train for a long time, but always keep track of the model snapshot that
worked best on val. Always a good idea to do this!
Example:
Weight decay: [1x10-4, 1x10-3, 1x10-2, 1x10-1]
Learning rate: [1x10-4, 1x10-3, 1x10-2, 1x10-1]
Example:
Weight decay: log-uniform on [1x10-4, 1x10-1]
Learning rate: log-uniform on [1x10-4, 1x10-1]
Unimportant
Unimportant
Parameter
Parameter
Important Important
Parameter Parameter
Bergstra and Bengio, “Random Search for Hyper-Parameter Optimization”, JMLR 2012
Radosavovic et al, “On Network Design Spaces for Visual Recognition”, ICCV 2019
Use the architecture from the previous step, use all training data,
turn on small weight decay, find a learning rate that makes the loss
drop significantly within ~100 iterations
Choose a few values of learning rate and weight decay around what
worked from Step 3, train a few models for ~1-5 epochs.
time
time
time
Train
Val
time
Val
time
Val
time
ratio between the updates and values: ~ 0.0002 / 0.02 = 0.01 (about okay)
want this to be somewhere around 0.001 or so
Polyak and Juditsky, “Acceleration of stochastic approximation by averaging”, SIAM Journal on Control and Optimization, 1992.
Karras et al, “Progressive Growing of GANs for Improved Quality, Stability, and Variation”, ICLR 2018
Brock et al, “Large Scale GAN Training for High Fidelity Natural Image Synthesis”, ICLR 2019
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014
MaxPool MaxPool
Conv-512 Conv-512
Conv-512 Conv-512
MaxPool
Conv-256
MaxPool
Conv-256
Freeze
Conv-256 Conv-256 these
MaxPool MaxPool
Conv-128 Conv-128
Conv-128 Conv-128
MaxPool MaxPool
Conv-64 Conv-64
Conv-64 Conv-64
Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014
MaxPool
Conv-512
Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014
MaxPool
Conv-512
Image Image Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition”, ICML 2014
Image Image Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
MaxPool MaxPool
Conv-512 100
Conv-512
89.3 91.1
Conv-512 Conv-512 90 81.9 84.3
79.5
MaxPool MaxPool
80 74.9 76.3
Conv-512 Conv-512
65.9 67.4 68
Conv-512 Conv-512
70 64.6
MaxPool MaxPool 60
Conv-256
48.5 45.4
Conv-256 50 42.3
Conv-256 Conv-256
40
MaxPool MaxPool
Conv-128 Conv-128
30
Conv-128 Conv-128 Paris Oxford Scupltures Scenes Object
MaxPool MaxPool Buildings Buildings Instance
Conv-64 Conv-64
Conv-64 Conv-64 Prior State of the art CNN + SVM CNN + Augmentation + SVM
Image Image Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR Workshops 2014
MaxPool
Conv-256
MaxPool
Conv-256
Freeze MaxPool
Conv-256
Conv-256 Conv-256 these Conv-256
Improvements in CNN
architectures lead to
improvements in
many downstream
tasks thanks to
transfer learning!
Cloud TPU v2
180 TFLOPs Cloud TPU v2 Pod
64 GB HBM memory 64 TPU-v2
$4.50 / hour 11.5 PFLOPs
(free on Colab!) $384 / hour
Justin Johnson Lecture 11 - 75 October 9, 2019
Model Parallelism: Split Model Across GPUs
N/2
images
Batch of
N images
N/2
images
Batch of
N images
Forward: compute loss
N/2
images
N/K
images
Batch of
N images
Model parallelism
within each server
… Data parallelism
across K servers
N/K
images
Example: https://devblogs.nvidia.com/training-bert-with-gpus/
Batch of
N images
Model parallelism
within each server
… Data parallelism
across K servers
N/K
images
Example: https://devblogs.nvidia.com/training-bert-with-gpus/
…
Justin Johnson Lecture 11 - 89 October 9, 2019
Large-Batch Training
How to scale up to data-
Suppose we can train a parallel training on K GPUs?
good model with one GPU
Alex Krizhevsky, “One weird trick for parallelizing convolutional neural networks”, arXiv 2014
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017
Justin Johnson Lecture 11 - 93 October 9, 2019
Large-Batch Training: ImageNet in One Hour!
ResNet-50 with batch size 8192,
distributed across 256 GPUs
Trains in one hour!
Goyal et al, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour”, arXiv 2017
Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes
Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes
Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes
Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes
Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes
Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes
Codreanu et al, “Achieving deep learning training in less than 40 minutes on imagenet-1k”, 2017
Batch size: 12288; 768 Knight’s Landing devices; 39 minutes
Akiba et al, “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”, 2017
Batch size: 32768; 1024 P100 GPUs; 15 minutes