Ceng403 - Week 6b
Ceng403 - Week 6b
Week 6b
Sinan Kalkan
Representational capacity
• Boolean functions:
• Every Boolean function can be represented exactly by a neural network
• The number of hidden layers might need to grow with the number of inputs
• Continuous functions:
• Every bounded continuous function can be approximated with small error with
two layers
• Arbitrary functions:
• Three layers can approximate any arbitrary function
Cybenko, G. (1989) "Approximations by superpositions of sigmoidal functions", Mathematics of Control, Signals, and
Systems, 2 (4), 303-314
Kurt Hornik (1991) "Approximation Capabilities of Multilayer Feedforward Networks", Neural Networks, 4(2), 251257.
Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators."
Neural networks 2.5 (1989): 359-366.
Sinan Kalkan 2
Model Complexity
• Models range in their flexibility to fit arbitrary data
https://cs231n.github.io/ https://cs231n.github.io/
simple
high bias
model complex
low bias model
constrained
low variance unconstrained
high variance
Validation
Error
Optimum
Model Complexity
Error
Training
Error
Sinan Kalkan 4
Memorization vs. Generalization
https://arxiv.org/pdf/1906.05271.pdf
Sinan Kalkan 5
Double Descent
Nakkiran et al., “Deep Double Descent: Where Bigger Models and More Data Hurt”, 2019.
Sinan Kalkan 6
Grokking
Sinan Kalkan 7
How do you spot overfitting?
Sinan Kalkan 8
Avoiding Overfitting
• Increase training set size
• Make sure effective size is growing; redundancy doesn’t help
• Incorporate domain-appropriate bias into model
• Customize model to your problem
• Tune hyperparameters of model
• number of layers, number of hidden units per layer, connectivity, etc.
• Regularization techniques
Sinan Kalkan 9
Slide Credit: Michael Mozer
Regularization
1
• L2 regularization: 𝜆𝑤 2
2
• Very common
• Penalizes peaky weight vector, prefers diffuse weight vectors
• L1 regularization: 𝜆|𝑤|
• Enforces sparsity (some weights become zero)
• Why? Weight decay is by a constant value if |w| is non-zero.
• Leads to input selection (makes it robust against noise)
• Use it if you require sparsity / feature selection
• Can be combined: 𝜆1 𝑤 + 𝜆2 𝑤 2
Sinan Kalkan 11
Garipov et al., “Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs”, 2018.
See also: Kuditipudi et al., “Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets”, 2020.
- Explains this with noise stability, dropout stability.
Sinan Kalkan 12
Dropout as Ensemble Training Method
Fig: Srivastava et al., 2014
Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in neural information
processing systems, pp. 2814–2822, 2013.
Sinan Kalkan 13
Data Augmentation
http://blcv.pl/static//2018/02/27/demystifying-face-recognition-v-
data-augmentation/ Sinan Kalkan 14
Training Vs. Val Set Error
Val Set
Training Set
Epochs
Sinan Kalkan 15
Slide Credit: Michael Mozer
Today
• Data Preprocessing and Initialization
• Issues and Practical Advices
Sinan Kalkan 18
Data Preprocessing
• Mean subtraction
• Normalization
• PCA and whitening
Sinan Kalkan 19
Data Preprocessing: Mean subtraction
• Compute the mean of each dimension, 𝜇𝑖 , over the training set:
1
𝜇𝑖 = 𝑥𝑗𝑖
𝑁
𝑗
• Subtract the mean for each dimension:
𝑥𝑗𝑖′ ← 𝑥𝑗𝑖 − 𝜇𝑖
• Effect: Move the data center (mean) to coordinate center
Sinan Kalkan 20
http://cs231n.github.io/neural-networks-2/
Data Preprocessing:
Normalization (or conditioning)
• Necessary if you believe that your dimensions have different scales
• Might need to reduce this to give equal importance to each dimension
• Normalize each dimension by its std. dev. (𝜎𝑖 ) after mean subtraction:
𝑥𝑗𝑖′ = 𝑥𝑗𝑖 − 𝜇𝑖
𝑥𝑗𝑖′′ = 𝑥𝑗𝑖′ /𝜎𝑖
• Effect: Make the dimensions have the same scale
Sinan Kalkan 21
http://cs231n.github.io/neural-networks-2/
Data Preprocessing:
Principle Component Analysis
• First center the data
• Find the eigenvectors 𝑒1 , … , 𝑒𝑛
• Project the data onto the eigenvectors:
• 𝑥𝑖𝑅 = 𝑥𝑖 ⋅ [𝑒1 , … , 𝑒𝑛 ]
• This corresponds to rotating the data to have the eigenvectors as the axes
• If you take the first 𝑀 eigenvectors, it corresponds to dimensionality reduction
Sinan Kalkan 22
http://cs231n.github.io/neural-networks-2/
Data Preprocessing: Whitening
• Normalize the scale with the norm of the eigenvalue:
𝑥𝑖𝑤 = 𝑥𝑖𝑅 /(𝜆𝑖 + 𝜖)
• 𝜖: a very small number to avoid division by zero
• This stretches each dimension to have the same scale.
• Side effect: this may exaggerate noise.
Sinan Kalkan 24
http://cs231n.github.io/neural-networks-2/
Data Preprocessing: Example
Sinan Kalkan 25
http://cs231n.github.io/neural-networks-2/
Data Preprocessing: Summary
• We mostly don’t use PCA or whitening
• They are computationally very expensive
• Whitening has side effects
• It is quite crucial and common to zero-center the data
• Most of the time, we see normalization with the std. deviation
Sinan Kalkan 26
Weight Initialization
• Zero weights
• Does not work well!
• Leads to updating weights by the same amount for every input
• Initialize the weights randomly to small values:
• Sample from a small range, e.g., Normal(0, 0.01)
• Don’t initialize too small
• The bias may be initialized to zero
• For ReLU units, this may be a small number like 0.01.
Eqn: https://cs231n.github.io/neural-networks-2/#init
Sinan Kalkan 30
Weight
Initialization
• Solution:
Get rid of 𝑛 in Var 𝑠 = (𝑛 Var(w))Var(x)
• How?
• Scale the initial weights by 𝑛
• Why? Because: Var 𝑎𝑋 = 𝑎2 Var 𝑋 Figures: Glorot & Bengio, “Understanding the difficulty of training deep feedforward neural networks”, 2010.
• Standard Initialization (top plots in Figure Xavier initialization for symmetric activation functions (Glorot &
6 & 7): Bengio):
1 1 1 1
𝑤𝑖 ∼ 𝑈 − , 𝑤𝑖 ∼ 𝑁 0, or 𝑤𝑖 ∼ 𝑁 0, if 𝑛𝑖𝑛 ≠ 𝑛𝑜𝑢𝑡
𝑛 𝑛 √𝑛 𝑛𝑖𝑛 +𝑛𝑜𝑢𝑡 /2
1
which yields 𝑛 Var 𝑤 =
3 With Uniform distribution:
𝑟2
because variance of 𝑈[−𝑟, 𝑟] is (see [1])
𝑤𝑖 ∼ 𝑈 −
√3 √3
, or
3
𝑛 𝑛
3 3
𝑤𝑖 ∼ 𝑈 − , if 𝑛𝑖𝑛 ≠ 𝑛𝑜𝑢𝑡
𝑛𝑖𝑛 +𝑛𝑜𝑢𝑡 /2 𝑛𝑖𝑛 +𝑛𝑜𝑢𝑡 /2
[1] https://proofwiki.org/wiki/Variance_of_Continuous_Uniform_Distribution 31
Sinan Kalkan
Weight Initialization He et al., “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification”, 2015.
Sinan Kalkan 32
More on Weight Initialization
Tutorial:
https://mmuratarat.github.io/2019-02-25/xavier-glorot-he-weight-init
Sinan Kalkan 33
Alternative: Batch Normalization
• Normalization (of input) is differentiable
• So, make it part of the model (not only at the
beginning)
• I.e., perform normalization during every step
of processing
• More robust to initialization
• Shown to also regularize the network in some
cases (dropping the need for dropout)
• Issue: How to normalize at test time?
1. Store means and variances during training,
or
2. Calculate mean & variance over your test
data
• PyTorch: use model.eval() at test time. Ioffe & Szegedy, “Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift”, 2015.
Sinan Kalkan 34
How critical are BatchNorm parameters?
Frankle et al., “Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs”, 2020.
Sinan Kalkan 35
Alternative: Batch Normalization
• Before or after non-linearity?
• Proposers
• “If, however, we could ensure that the distribution of nonlinearity inputs remains more stable as the network
trains, then the optimizer would be less likely to get stuck in the saturated regime, and the training would
accelerate.”
• “As each layer observes the inputs produced by the layers below, it would be advantageous to achieve the same
whitening of the inputs of each layer. By whitening the inputs to each layer, we would take a step towards
achieving the fixed distributions of inputs that would remove the ill effects of the internal covariate shift.”
• “We add the BN transform immediately before the nonlinearity, by normalizing x = Wu+b. We could have also
normalized the layer inputs u, but since u is likely the output of another nonlinearity, the shape of its
distribution is likely to change during training, and constraining its first and second moments would not
eliminate the covariate shift. In contrast, Wu + b is more likely to have a symmetric, non-sparse distribution, that
is “more Gaussian” (Hyv¨arinen & Oja, 2000); normalizing it is likely to produce activations with a stable
distribution.”
Sinan Kalkan 36
BatchNorm introduces scale invariance
• https://www.inference.vc/exponentially-growing-learning-rate-
implications-of-scale-invariance-induced-by-batchnorm/
Sinan Kalkan 37
Alternative Normalizations
https://medium.com/syncedreview/facebook-ai-proposes-group-normalization-alternative-to-batch-normalization-fb0699bffae7
38
Sinan Kalkan
2018
Sinan Kalkan 39
To sum up
• Initialization and normalization are crucial
• Different initialization & normalization strategies may be
needed for different deep learning methods
• E.g., in CNNs, normalization might be performed only on convolution
etc.
Sinan Kalkan 40
Issues & Practical advices
Sinan Kalkan 41
Issues & tricks
• Vanishing gradient
• Saturated units block gradient propagation (why?)
• A problem especially present in recurrent networks or networks with
a lot of layers
• Overfitting
• Drop-out, regularization and other tricks.
• Tricks:
• Unsupervised pretraining
• Batch normalization (each unit’s preactivation is normalized)
• Helps keeping the preactivation non-saturated
• Do this for mini-batches (adds stochasticity)
Sinan Kalkan 42
Unsupervised pretraining
Sinan Kalkan 43
Unsupervised pretraining
Sinan Kalkan 44
What if things are not working?
• Check your gradients by comparing them against numerical
gradients
• More on this at: http://cs231n.github.io/neural-networks-3/
• Check whether you are using an appropriate floating point representation
• Be aware of floating point precision/loss problems
• Turn off drop-out and other “extra” mechanisms during gradient check
• This can be performed only on a few dimensions
Sinan Kalkan 46
What if things are not working?
Sinan Kalkan 47
What if things are not working?
Sinan Kalkan 48
What if things are not working?
• Plot the histogram of activations per layer
• E.g., for tanh functions, we expect to see a diverse distribution of values
between [-1,1]
Sinan Kalkan 50
What if things are not working?
• Visualize your layers (the weights)
Sinan Kalkan 51
Andrew Ng’s suggestions
https://www.youtube.com/watch?v=F1ka6a13S9I
Sinan Kalkan 52
Andrew Ng’s suggestions
https://www.youtube.com/watch?v=F1ka6a13S9I
Sinan Kalkan 53
Also read the following
• 37 reasons why your neural network is not working:
• https://medium.com/@slavivanov/4020854bd607
Sinan Kalkan 54
Luckily, deep networks are very powerful
Sinan Kalkan 56