0% found this document useful (0 votes)
2 views51 pages

Ceng403 - Week 6b

The document discusses key concepts in deep learning, including representational capacity, model complexity, and the balance between memorization and generalization. It covers techniques for avoiding overfitting, such as data augmentation, regularization methods, and weight initialization strategies. Additionally, it emphasizes the importance of data preprocessing and normalization in training deep learning models.

Uploaded by

bhejpurimalik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views51 pages

Ceng403 - Week 6b

The document discusses key concepts in deep learning, including representational capacity, model complexity, and the balance between memorization and generalization. It covers techniques for avoiding overfitting, such as data augmentation, regularization methods, and weight initialization strategies. Additionally, it emphasizes the importance of data preprocessing and normalization in training deep learning models.

Uploaded by

bhejpurimalik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

CENG 403

Introduction to Deep Learning

Week 6b

Sinan Kalkan
Representational capacity
• Boolean functions:
• Every Boolean function can be represented exactly by a neural network
• The number of hidden layers might need to grow with the number of inputs
• Continuous functions:
• Every bounded continuous function can be approximated with small error with
two layers
• Arbitrary functions:
• Three layers can approximate any arbitrary function

Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and
Systems, 2 (4), 303-314

Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251257.

Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators."
Neural networks 2.5 (1989): 359-366.

Sinan Kalkan 2
Model Complexity
• Models range in their flexibility to fit arbitrary data

https://cs231n.github.io/ https://cs231n.github.io/

simple
high bias
model complex
low bias model

constrained
low variance unconstrained
high variance

small capacity may large capacity may


prevent it from allow it to memorize
Slide Credit: Michael Mozer
representing all data and fail to
Sinan Kalkan 3
structure in data capture regularities
Training Vs. Test/Val Set Error

Validation
Error

Optimum
Model Complexity
Error

Training
Error

High Bias Low Bias


Low Variance Model Complexity High Variance

Sinan Kalkan 4
Memorization vs. Generalization
https://arxiv.org/pdf/1906.05271.pdf

Sinan Kalkan 5
Double Descent

Nakkiran et al., “Deep Double Descent: Where Bigger Models and More Data Hurt”, 2019.
Sinan Kalkan 6
Grokking

Sinan Kalkan 7
How do you spot overfitting?

Sinan Kalkan 8
Avoiding Overfitting
• Increase training set size
• Make sure effective size is growing; redundancy doesn’t help
• Incorporate domain-appropriate bias into model
• Customize model to your problem
• Tune hyperparameters of model
• number of layers, number of hidden units per layer, connectivity, etc.
• Regularization techniques

Sinan Kalkan 9
Slide Credit: Michael Mozer
Regularization
1
• L2 regularization: 𝜆𝑤 2
2
• Very common
• Penalizes peaky weight vector, prefers diffuse weight vectors

• L1 regularization: 𝜆|𝑤|
• Enforces sparsity (some weights become zero)
• Why? Weight decay is by a constant value if |w| is non-zero.
• Leads to input selection (makes it robust against noise)
• Use it if you require sparsity / feature selection

• Can be combined: 𝜆1 𝑤 + 𝜆2 𝑤 2

• Regularization is not performed on the bias; it seems to make no significant


difference
Sinan Kalkan 10
L0 regularization
0 1/0
• 𝐿0 = σ𝑖 𝑥𝑖
• How to compute the zeroth power and
zeroth-root?
• Mathematicians approximate this as:
• 𝐿0 = # 𝑖 𝑥𝑖 ≠ 0}
• The cardinality of non-zero elements
• This is a strong enforcement of sparsity.
• However, this is non-convex
• L1 norm is the closest convex form

Sinan Kalkan 11
Garipov et al., “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs”, 2018.

See also: Kuditipudi et al., “Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets”, 2020.
- Explains this with noise stability, dropout stability.
Sinan Kalkan 12
Dropout as Ensemble Training Method
Fig: Srivastava et al., 2014

“Dropout performs gradient


descent on-line with respect to
both the training examples and
the ensemble of all possible
subnetworks.”

Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in neural information
processing systems, pp. 2814–2822, 2013.

Sinan Kalkan 13
Data Augmentation

http://blcv.pl/static//2018/02/27/demystifying-face-recognition-v-
data-augmentation/ Sinan Kalkan 14
Training Vs. Val Set Error
Val Set

Training Set

Epochs
Sinan Kalkan 15
Slide Credit: Michael Mozer
Today
• Data Preprocessing and Initialization
• Issues and Practical Advices

Sinan Kalkan - CENG403 16


Data Preprocessing and
Weight Initialization

Sinan Kalkan 18
Data Preprocessing
• Mean subtraction
• Normalization
• PCA and whitening

Sinan Kalkan 19
Data Preprocessing: Mean subtraction
• Compute the mean of each dimension, 𝜇𝑖 , over the training set:
1
𝜇𝑖 = ෍ 𝑥𝑗𝑖
𝑁
𝑗
• Subtract the mean for each dimension:
𝑥𝑗𝑖′ ← 𝑥𝑗𝑖 − 𝜇𝑖
• Effect: Move the data center (mean) to coordinate center

Mean image of CIFAR10

Sinan Kalkan 20
http://cs231n.github.io/neural-networks-2/
Data Preprocessing:
Normalization (or conditioning)
• Necessary if you believe that your dimensions have different scales
• Might need to reduce this to give equal importance to each dimension
• Normalize each dimension by its std. dev. (𝜎𝑖 ) after mean subtraction:
𝑥𝑗𝑖′ = 𝑥𝑗𝑖 − 𝜇𝑖
𝑥𝑗𝑖′′ = 𝑥𝑗𝑖′ /𝜎𝑖
• Effect: Make the dimensions have the same scale

Sinan Kalkan 21
http://cs231n.github.io/neural-networks-2/
Data Preprocessing:
Principle Component Analysis
• First center the data
• Find the eigenvectors 𝑒1 , … , 𝑒𝑛
• Project the data onto the eigenvectors:
• 𝑥𝑖𝑅 = 𝑥𝑖 ⋅ [𝑒1 , … , 𝑒𝑛 ]
• This corresponds to rotating the data to have the eigenvectors as the axes
• If you take the first 𝑀 eigenvectors, it corresponds to dimensionality reduction

Sinan Kalkan 22
http://cs231n.github.io/neural-networks-2/
Data Preprocessing: Whitening
• Normalize the scale with the norm of the eigenvalue:
𝑥𝑖𝑤 = 𝑥𝑖𝑅 /(𝜆𝑖 + 𝜖)
• 𝜖: a very small number to avoid division by zero
• This stretches each dimension to have the same scale.
• Side effect: this may exaggerate noise.

Sinan Kalkan 24
http://cs231n.github.io/neural-networks-2/
Data Preprocessing: Example

Sinan Kalkan 25
http://cs231n.github.io/neural-networks-2/
Data Preprocessing: Summary
• We mostly don’t use PCA or whitening
• They are computationally very expensive
• Whitening has side effects
• It is quite crucial and common to zero-center the data
• Most of the time, we see normalization with the std. deviation

Sinan Kalkan 26
Weight Initialization
• Zero weights
• Does not work well!
• Leads to updating weights by the same amount for every input
• Initialize the weights randomly to small values:
• Sample from a small range, e.g., Normal(0, 0.01)
• Don’t initialize too small
• The bias may be initialized to zero
• For ReLU units, this may be a small number like 0.01.

Note: None of these provide guarantees of convergence.


Moreover, there is no guarantee that one of these will always be better.
Sinan Kalkan 27
Weight Initialization

• Problem: Variance of the output changes with the number of inputs


• If 𝑠 = σ𝑛𝑖=1 𝑤𝑖 𝑥𝑖 (note that Var(𝑋) = 𝐸 𝑋 − 𝜇 2 ):

Eqn: https://cs231n.github.io/neural-networks-2/#init

Sinan Kalkan 30
Weight
Initialization
• Solution:
Get rid of 𝑛 in Var 𝑠 = (𝑛 Var(w))Var(x)

• How?
• Scale the initial weights by 𝑛
• Why? Because: Var 𝑎𝑋 = 𝑎2 Var 𝑋 Figures: Glorot & Bengio, “Understanding the difficulty of training deep feedforward neural networks”, 2010.

• Standard Initialization (top plots in Figure Xavier initialization for symmetric activation functions (Glorot &
6 & 7): Bengio):
1 1 1 1
𝑤𝑖 ∼ 𝑈 − , 𝑤𝑖 ∼ 𝑁 0, or 𝑤𝑖 ∼ 𝑁 0, if 𝑛𝑖𝑛 ≠ 𝑛𝑜𝑢𝑡
𝑛 𝑛 √𝑛 𝑛𝑖𝑛 +𝑛𝑜𝑢𝑡 /2
1
which yields 𝑛 Var 𝑤 =
3 With Uniform distribution:
𝑟2
because variance of 𝑈[−𝑟, 𝑟] is (see [1])
𝑤𝑖 ∼ 𝑈 −
√3 √3
, or
3
𝑛 𝑛
3 3
𝑤𝑖 ∼ 𝑈 − , if 𝑛𝑖𝑛 ≠ 𝑛𝑜𝑢𝑡
𝑛𝑖𝑛 +𝑛𝑜𝑢𝑡 /2 𝑛𝑖𝑛 +𝑛𝑜𝑢𝑡 /2
[1] https://proofwiki.org/wiki/Variance_of_Continuous_Uniform_Distribution 31
Sinan Kalkan
Weight Initialization He et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification”, 2015.

• He et al. shows that Xavier initialization


does not work well for ReLUs.

Sinan Kalkan 32
More on Weight Initialization

Tutorial and Demo:


https://www.deeplearning.ai/ai-notes/initialization/index.html

Tutorial:
https://mmuratarat.github.io/2019-02-25/xavier-glorot-he-weight-init

Sinan Kalkan 33
Alternative: Batch Normalization
• Normalization (of input) is differentiable
• So, make it part of the model (not only at the
beginning)
• I.e., perform normalization during every step
of processing
• More robust to initialization
• Shown to also regularize the network in some
cases (dropping the need for dropout)
• Issue: How to normalize at test time?
1. Store means and variances during training,
or
2. Calculate mean & variance over your test
data
• PyTorch: use model.eval() at test time. Ioffe & Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift”, 2015.

Sinan Kalkan 34
How critical are BatchNorm parameters?

Frankle et al., “Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs”, 2020.
Sinan Kalkan 35
Alternative: Batch Normalization
• Before or after non-linearity?
• Proposers
• “If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network
trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would
accelerate.”
• “As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same
whitening of the inputs of each layer. By whitening the inputs to each layer, we would take a step towards
achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.”
• “We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+b. We could have also
normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its
distribution is likely to change during training, and constraining its first and second moments would not
eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that
is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable
distribution.”

Sinan Kalkan 36
BatchNorm introduces scale invariance
• https://www.inference.vc/exponentially-growing-learning-rate-
implications-of-scale-invariance-induced-by-batchnorm/

Sinan Kalkan 37
Alternative Normalizations

https://medium.com/syncedreview/facebook-ai-proposes-group-normalization-alternative-to-batch-normalization-fb0699bffae7
38
Sinan Kalkan
2018

Sinan Kalkan 39
To sum up
• Initialization and normalization are crucial
• Different initialization & normalization strategies may be
needed for different deep learning methods
• E.g., in CNNs, normalization might be performed only on convolution
etc.

Sinan Kalkan 40
Issues & Practical advices

Sinan Kalkan 41
Issues & tricks
• Vanishing gradient
• Saturated units block gradient propagation (why?)
• A problem especially present in recurrent networks or networks with
a lot of layers
• Overfitting
• Drop-out, regularization and other tricks.
• Tricks:
• Unsupervised pretraining
• Batch normalization (each unit’s preactivation is normalized)
• Helps keeping the preactivation non-saturated
• Do this for mini-batches (adds stochasticity)

Sinan Kalkan 42
Unsupervised pretraining

Sinan Kalkan 43
Unsupervised pretraining

Sinan Kalkan 44
What if things are not working?
• Check your gradients by comparing them against numerical
gradients
• More on this at: http://cs231n.github.io/neural-networks-3/
• Check whether you are using an appropriate floating point representation
• Be aware of floating point precision/loss problems
• Turn off drop-out and other “extra” mechanisms during gradient check
• This can be performed only on a few dimensions

• Regularization loss may dominate the data loss


• First disable regularization loss & make sure data loss works
• Then add regularization loss with a big factor
• And check the gradient in each caseSinan Kalkan 45
What if things are not working?
• Have a feeling of the initial loss value
• For CIFAR-10 with 10 classes: because each class has probability of 0.1,
initial loss is –ln(0.1)=2.302
• For hinge loss: since all margins are violated (since all scores are
approximately zero), loss should be around 9 (+1 for each margin).

• Try to overfit on a tiny subset of the dataset


• The cost should reach to zero if things are working properly

Sinan Kalkan 46
What if things are not working?

Learning rate might be too low;


Batch size might be too small

Sinan Kalkan 47
What if things are not working?

Sinan Kalkan 48
What if things are not working?
• Plot the histogram of activations per layer
• E.g., for tanh functions, we expect to see a diverse distribution of values
between [-1,1]

Sinan Kalkan 50
What if things are not working?
• Visualize your layers (the weights)

Sinan Kalkan 51
Andrew Ng’s suggestions
https://www.youtube.com/watch?v=F1ka6a13S9I

• “In DL, the coupling between bias &


variance is weaker compared to other
ML methods:
• We can train a network to have high
bias and variance.”

• “Dev (validation) and test sets should


come from the same distribution.
Dev&test sets are like problem
specifications.
• This requires especially attention if you
have a lot of data from simulated
environments etc. but little data from
the real test environment.”

Sinan Kalkan 52
Andrew Ng’s suggestions
https://www.youtube.com/watch?v=F1ka6a13S9I

• “Knowing the human performance level gives


information about the problem of your
network:
• If training error is far from human performance, then
there is a bias error.
• If they are close but validation has more error
(compared to the diff between human and training
error), then there is variance problem.”

• “After surpassing human level, performance


increases only very slowly/difficult.
• One reason: There is not much space for improvement
(only tiny little details). Problem gets much harder.
• Another reason: We get labels from humans.”

Sinan Kalkan 53
Also read the following
• 37 reasons why your neural network is not working:
• https://medium.com/@slavivanov/4020854bd607

• “A Recipe for Training Neural Networks” by Karpathy:


• http://karpathy.github.io/2019/04/25/recipe/

• Deep Learning Tuning Playbook:


• https://github.com/google-research/tuning_playbook

• Calibrated Chaos: Variance Between Runs of Neural Network


Training is Harmless and Inevitable:
• https://arxiv.org/pdf/2304.01910.pdf

Sinan Kalkan 54
Luckily, deep networks are very powerful

Regularization is turned off in the experiments.


When you turn on regularization, the networks
perform worse.
Sinan Kalkan 55
Concluding remarks for the first part
• Loss functions
• Gradients of loss functions for minimizing them
• All operations in the network should be differentiable
• Gradient descent and its variants
• Initialization, normalization, adaptive learning rate, …
• Overall, you have learned most of the tools you will use in the rest of
the course.

Sinan Kalkan 56

You might also like