NN 08
NN 08
• Loss Function
• Hyperparameters
• Regularization for good generalization
• Dropout
• Data augmentation
• DropConnect
• Reduce the number of parameters
• Weight decay
Agenda • CNN Applications
• Object Classification
• Different Dataset for Object Recognition.
• Different CNN Architectures for Object
recognition
• AlexNet, ZFNet, VGGNet, GoogLeNet,
ResNet
2
Neural Networks - Lecture 8
1
Important Components of Neural Network apart
from the neurons
• Activation functions. Transforms the sum of weights and biases of each layer –
adds non-linearity to the model.
• Loss function (cost function, objective function, error function). Measures how
well the NN reproduces the experimental training data.
• Optimization algorithm. Finds weights and bias values that minimize (locally ) the
Loss function.
Deep learning neural networks are trained using the stochastic gradient descent
optimization algorithm.
• Hyperparameters. It are some setting that is difficult to optimize (LR, Momentum
term, # of hidden layers, etc). Settled at first. No training for them.
• Regulation techniques. Prevents over-fitting of the NN to the training data.
Loss Function
• A loss function tells how good our current classifier is.
• The loss is calculated using loss function by matching the target(actual) value and predicted value by
a neural network.
• Then we use the gradient descent method to update the weights of the neural network such that the loss is
minimized. This is how we train a neural network.
• The loss function used to estimate the loss of the model so that the weights can be updated to reduce the
loss on the next evaluation.
2
The choice of Loss Function
• Regression Loss Functions
• Mean Squared Error Loss
• Mean Squared Logarithmic Error Loss Cross-entropy
• Mean Absolute Error Loss and mean squared
• Binary Classification Loss Functions error are the two
• Binary Cross-Entropy main types of loss
• Hinge Loss functions to use
• Squared Hinge Loss when training
• Multi-Class Classification Loss Functions neural network
• Multi-Class Cross-Entropy Loss models.
• Sparse Multiclass Cross-Entropy Loss
• Kullback Leibler Divergence Loss
Reference: https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
Hyperparameters
• It are some setting that is difficult to optimize.
i.e. It is not appropriate to learn that on the training set.
• Examples:
• Network architecture
• Learning rate
• Filter size for convolution layer
3
Regularization for Good
Generalization
7
Neural Networks - Lecture 8
Regularization
• Regularization is any modification we make to a learning algorithm
that is intended to reduce its generalization error but not its training
error.
i.e. any method that prevent over-fitting or help the optimization.
4
Under- and Over-fitting
• are factors determining how well an ML algorithm will
perform. i.e. its ability to:
1. Make the training error small
2. Make gap between training and test errors small
• Underfitting
Inability to obtain low enough error rate on the training
set.
• Overfitting
Gap between training error and testing error is too large
5
Regularization Strategies
1. Parameter Norm Penalties
– (L2- and L1- regularization)
2. Norm Penalties as Constrained Optimization
3. Regularization and Under-constrained Problems
4. Data Set Augmentation
5. Noise Robustness The best-performing models
6. Semi-supervised learning on most benchmarks use
7. Multi-task learning some or all of these tricks.
8. Early Stopping
9. Parameter tying and parameter sharing
10. Sparse representations
11. Bagging and other ensemble methods
12. Dropout
13. Adversarial training
14. Tangent methods
6
Regularization: Dropout, cont.
See : http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
Dropout was used for training of fully connected layers.
Training:
• Setting to 0 the output of each hidden neuron with probability 0.5 (50%).
• The neurons which are “dropped out” in this way
• do not contribute to the forward pass
• and do not participate in back-propagation.
• So, every time an input is presented, the neural network samples a different
architecture, but all these architectures share weights.
Test:
At test time, we use all the neurons.
7
Regularization: DropConnect
• Training: Drop connections between neurons (set weights to 0)
• Testing: Use all the connections.
8
Weight Decay
• Encouraging the weights to be small in magnitude.
• The weight decay is an additional term in the weight update rule that causes
the weights to exponentially decay to zero.
• When training neural networks, it is common to use "weight decay," where after each
update, the weights are multiplied by a factor slightly less than 1. This prevents
the weights from growing too large.
• We regularize the cost function by change it to (adds a penalty equal to the sum of the
squared value of the coefficients, this called L2 regularization)
The regularization parameter λ determines how you trade off the original cost E with the
large weights penalization.
The new term −ηλw causes the weight to decay in proportion to its size.
when the regularization hyperparameter lambda increases, Weights are pushed
toward becoming smaller (closer to 0).
9
CNN Applications
20
Neural Networks - Lecture 8
CNN Applications
10
Object Recognition
22
Neural Networks - Lecture 8
Object Recognition
• Object recognition is the task of identifying which object category
is present in an image.
• It's challenging because objects can differ widely in
position, size, shape, appearance, etc.,
and we have to deal with occlusions, lighting changes, etc.
• Object recognition can be either in either:
• Direct applications to image search.
• Closely related to object detection, the task of locating all
instances of an object in an image
• E.g., a self-driving car detecting pedestrians or stop signs.
11
CNNs for Recognition or Classification: Feature
Learning
12
CNNs: Training with Backpropagation
Recognition Datasets
27
Neural Networks - Lecture 8
13
Recognition Datasets
• In order to train and evaluate a machine learning system, we need to collect a
dataset. The design of the dataset can have major implications.
• Some questions to consider:
• Which categories to include?
• Where should the images come from?
• How many images to collect?
• How to normalize (preprocess) the images?
• During the last two decades:
• Datasets have gotten much larger (because of digital cameras and the
Internet)
• Computers got much faster
• Graphics processing units (GPUs) turned out to be really good at training
big neural nets; they're generally about 30 times faster than CPUs.
14
MNIST Dataset
15
ImageNet Dataset
• ImageNet is the modern object recognition
benchmark dataset. It was introduced in
2009, and has led to amazing progress in
object recognition since then.
• ImageNet is a dataset of over 15 million
labeled high-resolution images belonging
to roughly 22,000 categories. The images
were collected from the web and labeled
by human labelers
ImageNet, cont.
• Used for the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
2010 contest, an annual benchmark competition for object recognition.
• ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000
categories. In all, there are roughly 1.2 million training images, 50,000
validation images, and 150,000 testing images.
16
Different CNN
Architectures
34
Neural Networks - Lecture 8
LeNet
[LeCun et al., 1998]
17
AlexNet
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, 2012.
AlexNet
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
• Q: what is the output volume size? Hint: (227-11)/4+1 = 55
Output volume [55x55x96]
• Q: What is the total number of parameters in this layer?
Parameters: (11*11*3)*96 = 35K
Input: 227x227x3 images
After CONV1: 55x55x96
Input: 227x227x3 images
Second layer (POOL1): 3x3 filters applied at stride 2
After CONV1: 55x55x96
• Q: what is the output volume size? Hint: (55-3)/2+1 = 27
After POOL1: 27x27x96
Output volume: 27x27x96
...
• Q: what is the number of parameters in this layer?
Parameters: 0!
Neural Networks - Lecture 8 40
18
AlexNet
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
Details/Retrospectives:
[27x27x96] MAX POOL1: 3x3 filters at stride 2
- first use of ReLU
[27x27x96] NORM1: Normalization layer
- used Norm layers (not common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
- heavy data augmentation
[13x13x256] MAX POOL2: 3x3 filters at stride 2
- dropout 0.5
[13x13x256] NORM2: Normalization layer
- batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
- SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
- Learning rate 1e-2, reduced by 10
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
manually when val accuracy plateaus
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- L2 weight decay 5e-4
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
19
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners
ZFNet: improved hyperparameters over AlexNet
ZFNet
[Zeiler and Fergus, 2013]
• It is AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
• Top-5 error in ILSVRC’13: 11.7%
20
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners
Deeper Networks
VGGNet
[K. Simonyan and A. Zisserman, University of Oxford, 2014]
21
VGGNet
[K. Simonyan and A. Zisserman, University of Oxford, 2014]
VGG 16
22
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners
Deeper Networks
GoogLeNet
[Szegedy et al., 2014]
• 22 layers.
• No fully connected FC layers.
• Convolutions are broken down into a
bunch of smaller convolutions (since this
requires fewer parameters total)
• GoogLeNet has only 5 million parameters, Inception module
compared with 60 million for AlexNet. 12x
less than AlexNet “Inception module”: design a
good local network topology
• Top-5 error in ILSVRC’14: 6.7% test error (network within a network) and
then stack these modules on
on ImageNet. top of each other
23
GoogLeNet
Conv Ops:
GoogLeNet [1x1 conv, 128] 28x28x128x1x1x256
Q: What is the problem with this? [3x3 conv, 192] 28x28x192x3x3x256
Computational complexity [5x5 conv, 96] 28x28x96x5x5x256
Total: 854M ops
24
Reminder: 1x1 convolutions
GoogLeNet
Solution: “bottleneck” layers that use 1x1 convolutions to reduce feature depth.
25
GoogLeNet Using same parallel layers as
naive example, and adding “1x1
conv, 64 filter” bottlenecks:
Conv Ops:
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x64
[5x5 conv, 96] 28x28x96x5x5x64
[1x1 conv, 64] 28x28x64x1x1x256
Total: 358M ops
• Bottleneck can also reduce depth after
pooling layer
GoogLeNet
26
GoogLeNet
GoogLeNet
27
GoogLeNet
GoogLeNet
28
GoogLeNet
[Szegedy et al., 2014]
• 22 layers.
• Efficient Inception module.
Inception module
• Avoid expensive FC layers.
• 12x less parameters than AlexNet
• Top-5 error in ILSVRC’14: 6.7% test error
on ImageNet.
29
ResNet
[He et al., 2015]
ResNet
[He et al., 2015]
• What happens when we continue stacking deeper layers on a “plain” convolutional
neural network?
30
ResNet
[He et al., 2015]
Hypothesis: the problem is an optimization problem, deeper models are harder to
optimize.
• The deeper model should be able to perform at least as well as the shallower
model.
• A solution by construction is copying the learned layers from the shallower model
and setting additional layers to identity mapping.
ResNet
[He et al., 2015]
Solution: Use network layers to fit a residual mapping instead of directly trying to
fit a desired underlying mapping
31
ResNet No FC layers
besides FC
[He et al., 2015] 1000 to
output
Full ResNet architecture: classes
Global
- Stack residual blocks average
pooling layer
- Every residual block has after last
two 3x3 conv layers conv layer
- Periodically, double # of
filters and downsample
3x3 conv, 128
spatially using stride 2 filters, /2
spatially with
(/2 in each dimension) stride 2
- Additional conv layer at 3x3 conv, 64
the beginning filters
ResNet
[He et al., 2015]
32
ResNet
[He et al., 2015]
ResNet
[He et al., 2015]
Experimental Results:
- Able to train very deep networks
without degrading (152 layers on
ImageNet, 1202 on Cifar)
- Deeper networks now achieve lowing
training error as expected
- Swept 1st place in all ILSVRC and
COCO 2015 competitions
ILSVRC 2015 classification winner (3.6%
top 5 error) -- better than “human
performance”! (Russakovsky 2014)
33
AlexNet:
Comparing complexity...GoogLeNet: Smaller compute, still
memory heavy, lower VGG: Highest
Most efficient accuracy memory, most
Inception-v4: Resnet + Inception!
operations
ResNet:
Moderate efficiency depending on
model, highest accuracy
34
• Loss Function
• Hyperparameters
• Regularization for good generalization
• Dropout
• Data augmentation
• DropConnect
• Reduce the number of parameters
• Weight decay
Summary • CNN Applications
• Object Classification
• Different Dataset for Object Recognition.
• Different CNN Architectures for Object
recognition
• AlexNet, ZFNet, VGGNet, GoogLeNet,
ResNet
74
Neural Networks - Lecture 8
Resources
1. Roger Grosse and Jimmy Ba, CSC421 /2516 winter 2019 Neural
Network and Deep Learning, http://www.cs.toronto.edu.
2. Related Lecture from CS231n @ Stanford.
http://cs231n.stanford.edu/
3. MIT 6.S191, Introduction to Deep Learning, 2020.
35
Thanks
for your attention…
36