0% found this document useful (0 votes)
7 views45 pages

Lecture 5-6

The document discusses stability issues in deep learning, particularly focusing on vanishing and exploding gradients in neural networks. It emphasizes the importance of proper weight initialization, model capacity, and regularization techniques like L1 and L2 regularization, early stopping, and dropout to mitigate these issues. Additionally, it introduces concepts such as VC dimension and data complexity that influence model generalization and performance.

Uploaded by

hanimukhtar512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views45 pages

Lecture 5-6

The document discusses stability issues in deep learning, particularly focusing on vanishing and exploding gradients in neural networks. It emphasizes the importance of proper weight initialization, model capacity, and regularization techniques like L1 and L2 regularization, early stopping, and dropout to mitigate these issues. Additionally, it introduces concepts such as VC dimension and data complexity that influence model generalization and performance.

Uploaded by

hanimukhtar512
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Deep Learning

7. Stability, Model Generalization and Regularization

Dr. Ahsen Tahir


Stability – Vanishing and
Exploding Gradients
Gradients for Neural Networks

• Consider a network with d layers

h t = ft(ht−1) and y = ℓ ∘ f ∘ … ∘ f1(x)


d

• Compute the gradient of the loss ℓ w.r.t. Wt

∂ℓ ∂ℓ ∂h d ∂h t+1 ∂h t
= d d−1…
∂W t ∂h ∂h ∂h t ∂W t

{
Multiplication of d-t matrices
i+1
Two Issues for Deep Neural Networks d−1
∂h
∏ i
i=t
∂h

Gradient Exploding Gradient Vanishing

100 17 100 −10


1.5 ≈ 4 × 10 0.8 ≈ 2 × 10
Example: MLP

• Assume MLP (without bias for simplicity)


t−1 t t−1
f (h ) = σ(Wh ) σ is the activation function
t
∂h t t t−1 t
T
= diag ( σ′(Wh ) ) (W ) σ′ is the gradient function of σ
∂h t−1

d−1 i+1 d−1


∂h i i−1 i
T
∏ = ∏ diag ( σ′(Wh ) ) (W )
i=t
∂h i
i=t
Gradient Exploding

• Use ReLU as the activation function

1 if x > 0
σ(x) = max(0,x) and σ′(x) ={
0 otherwise
d−1 d−1
∂h i+1 d−1 i−1 i i
• Elements of ∏
∂h i
i
may from
= ∏ diag ( σ′(Wh )) (W )T ∏ (W )T
i=t i=t i=t

• Leads to large values when d-t is large


100 17
1.5 ≈ 4 × 10
Exploding Gradients
Issues with Gradient Exploding

• Value out of range: infinity value


• Severe for using 16-bit floating points
• Range: 6e-5 - 6e4
• Sensitive to learning rate (LR)
• Not small enough LR -> large weights -> larger
gradients
• Too small LR -> No progress
• May need to change LR dramatically during training
Gradient Vanishing

• Use sigmoid as the activation function


1
σ(x) = σ′(x) = σ(x)(1 − σ(x))
1+e−x

Small Small
gradients gradients
Gradient Exploding

• Use sigmoid as the activation function


1
σ(x) = σ′(x) = σ(x)(1 − σ(x))
1+e−x

d−1
∂hi+1 d−1 i−1 i
• Elements ∏
∂h i
= ∏ diag ( σ′(Wi
h )) (W )T are products of d-t small
i=t i=t
values
100 −10
0.8 ≈ 2 × 10
Issues with Gradient Vanishing

• Gradients with value 0


• Severe with 16-bit floating points
• No progress in training
• No matter how to choose learning rate
• Severe with bottom layers
• Only top layers are well trained
• No benefit to make networks deeper
Stabilize Training

• Goal: make sure gradient values are in a proper range


• E.g. in [1e-6, 1e3]
• Multiplication -> plus
• ResNet, LSTM
• Normalize
• Batch Normalization, Gradient clipping
• Proper weight initialization and activation functions
Weight Initialization random
beginning
• Initialize weights with random values in a
proper range
• The beginning of training easily suffers to
numerical instability
• The surface far away from an optimal
can be complex
• Near optimal may be flatter
• Initializing according to (0, 0.01) works
well for small networks, but not guarantee near
for deep neural networks optimal
Underfitting/Overfitting Review
Underfitting Overfitting

Image credit: hackernoon.com


Underfitting and Overfitting Review

Data complexity

Simple Complex

Low Normal Underfitting


Model
capacity
High Overfitting Normal
Model Capacity

• The ability to fit variety of functions


• Low capacity models struggles to
fit training set
• Underfitting
• High capacity models can
memorize the training set
• Overfitting
Influence of Model Complexity
Estimate Model Capacity
d+1
• It’s hard to compare complexity
between different algorithms
• e.g. tree vs neural network
• Given an algorithm family, two main (d + 1)m + (m + 1)k
factors matter:
• The number of parameters
• The values taken by each
parameter
VC Dimension

• A center topic in Statistic Learning


Vladimir Vapnik
Theory
• For a classification model, it’s the
size of the largest dataset, no
matter how we assign labels, there
exist a model to classify them
Alexey Chervonenkis
perfectly
VC-Dimension for Linear Classifier

• 2-D perceptron: VCdim = 3


• Can classify any 3 points, but not 4 points (xor)

• Perceptron with N parameters: VCdim = N


• Some Multilayer Perceptrons: VCdim = O(N log(N ))
2
Usefulness of VC-Dimension

• Provides theory insights why a model works


• Bound the gap between training error and
generalization error
• Rarely used in practice with deep learning
• The bounds are too loose
• Difficulty to compute VC-dimension for deep neural
networks
• Same for other statistic learning theory tools
Data Complexity

• Multiple factors matters


• # of examples
• # of elements in each example
• time/space structure
• diversity
Regularization
Definition
Regularization Objective Function
L2 Regularization

• Hyper-parameter a or λ controls regularization importance


• λ = 0 : no effect (λ is more commonly used than a)
• λ → ∞, w* → 0
Illustrate the Effect on Optimal Solutions

λ
w* = arg min J(w; X,y)+ ∥w∥
2
2

w* w̃* = arg min J(w; X,y)


L1 Regularization

λ symbol is more commonly used as regularization parameter than a


L1 Regularization

λ symbol is more commonly used as regularization parameter than a


L1 Regularization

λ symbol is more commonly used as regularization parameter than a

*Alex Smola @ cmu


Early Stopping
Algorithm
Early Stopping Based on objective function
Deep Learning
UET

Dropout
Motivation

• A good model should be


robust under modest changes
in the input
• Training with input noise
equals to Tikhonov
Regularization
• Dropout: inject noises into
internal layers
Apply Dropout

• Often apply dropout on the output of hidden fully-


connected layers

h = σ(W x + b)
1 1
h′ =dropout(h)
o = W2h′ + b2
y = softmax(o)
Add Noise without Bias

• Add noise into x to get x’, we hope

E[x′] = x

• Dropout perturbs each element by

0 with probability p
x′i = { xi
1−p
otherwise
Dropout using numpy?

Let’s do in Jupyter-notebook
Dropout using pytorch tensors
Code difference - numpy vs pytorch

# in pytorch
assert 0 <= dropout <= 1
# Here X.shape is a tensor
if dropout==1:
return torch.zeros_like(X.shape)

# for numpy
# Here X.shape is a tuple
if dropout==1:
return np.zeros(X.shape)
Dropout from scratch – using dropout_layer func
Dropout in Inference

• Regularization is only used in training


• The dropout layer for inference is

h′ =dropout(h)

• Guarantee deterministic results

You might also like