0% found this document useful (0 votes)

9 views45 pages

Lecture 5-6

The document discusses stability issues in deep learning, particularly focusing on vanishing and exploding gradients in neural networks. It emphasizes the importance of proper weight initialization, model capacity, and regularization techniques like L1 and L2 regularization, early stopping, and dropout to mitigate these issues. Additionally, it introduces concepts such as VC dimension and data complexity that influence model generalization and performance.

Uploaded by

hanimukhtar512

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views45 pages

Lecture 5-6

Uploaded by

hanimukhtar512

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Deep Learning

7. Stability, Model Generalization and Regularization

Dr. Ahsen Tahir

Stability – Vanishing and
Exploding Gradients
Gradients for Neural Networks

• Consider a network with d layers

h t = ft(ht−1) and y = ℓ ∘ f ∘ … ∘ f1(x)

• Compute the gradient of the loss ℓ w.r.t. Wt

∂ℓ ∂ℓ ∂h d ∂h t+1 ∂h t
= d d−1…
∂W t ∂h ∂h ∂h t ∂W t

{
Multiplication of d-t matrices
i+1
Two Issues for Deep Neural Networks d−1
∂h
∏ i
i=t
∂h

Gradient Exploding Gradient Vanishing

100 17 100 −10

1.5 ≈ 4 × 10 0.8 ≈ 2 × 10
Example: MLP

• Assume MLP (without bias for simplicity)

t−1 t t−1
f (h ) = σ(Wh ) σ is the activation function
t
∂h t t t−1 t
T
= diag ( σ′(Wh ) ) (W ) σ′ is the gradient function of σ
∂h t−1

d−1 i+1 d−1

∂h i i−1 i
T
∏ = ∏ diag ( σ′(Wh ) ) (W )
i=t
∂h i
i=t
Gradient Exploding

• Use ReLU as the activation function

1 if x > 0
σ(x) = max(0,x) and σ′(x) ={
0 otherwise
d−1 d−1
∂h i+1 d−1 i−1 i i
• Elements of ∏
∂h i
i
may from
= ∏ diag ( σ′(Wh )) (W )T ∏ (W )T
i=t i=t i=t

• Leads to large values when d-t is large

100 17
1.5 ≈ 4 × 10
Exploding Gradients
Issues with Gradient Exploding

• Value out of range: infinity value

• Severe for using 16-bit floating points
• Range: 6e-5 - 6e4
• Sensitive to learning rate (LR)
• Not small enough LR -> large weights -> larger
gradients
• Too small LR -> No progress
• May need to change LR dramatically during training
Gradient Vanishing

• Use sigmoid as the activation function

1
σ(x) = σ′(x) = σ(x)(1 − σ(x))
1+e−x

Small Small
gradients gradients
Gradient Exploding

• Use sigmoid as the activation function

1
σ(x) = σ′(x) = σ(x)(1 − σ(x))
1+e−x

d−1
∂hi+1 d−1 i−1 i
• Elements ∏
∂h i
= ∏ diag ( σ′(Wi
h )) (W )T are products of d-t small
i=t i=t
values
100 −10
0.8 ≈ 2 × 10
Issues with Gradient Vanishing

• Gradients with value 0

• Severe with 16-bit floating points
• No progress in training
• No matter how to choose learning rate
• Severe with bottom layers
• Only top layers are well trained
• No benefit to make networks deeper
Stabilize Training

• Goal: make sure gradient values are in a proper range

• E.g. in [1e-6, 1e3]
• Multiplication -> plus
• ResNet, LSTM
• Normalize
• Batch Normalization, Gradient clipping
• Proper weight initialization and activation functions
Weight Initialization random
beginning
• Initialize weights with random values in a
proper range
• The beginning of training easily suffers to
numerical instability
• The surface far away from an optimal
can be complex
• Near optimal may be flatter
• Initializing according to (0, 0.01) works
well for small networks, but not guarantee near
for deep neural networks optimal
Underfitting/Overfitting Review
Underfitting Overfitting

Image credit: hackernoon.com

Underfitting and Overfitting Review

Data complexity

Simple Complex

Low Normal Underfitting

Model
capacity
High Overfitting Normal
Model Capacity

• The ability to fit variety of functions

• Low capacity models struggles to
fit training set
• Underfitting
• High capacity models can
memorize the training set
• Overfitting
Influence of Model Complexity
Estimate Model Capacity
d+1
• It’s hard to compare complexity
between different algorithms
• e.g. tree vs neural network
• Given an algorithm family, two main (d + 1)m + (m + 1)k
factors matter:
• The number of parameters
• The values taken by each
parameter
VC Dimension

• A center topic in Statistic Learning

Vladimir Vapnik
Theory
• For a classification model, it’s the
size of the largest dataset, no
matter how we assign labels, there
exist a model to classify them
Alexey Chervonenkis
perfectly
VC-Dimension for Linear Classifier

• 2-D perceptron: VCdim = 3

• Can classify any 3 points, but not 4 points (xor)

• Perceptron with N parameters: VCdim = N

• Some Multilayer Perceptrons: VCdim = O(N log(N ))
2
Usefulness of VC-Dimension

• Provides theory insights why a model works

• Bound the gap between training error and
generalization error
• Rarely used in practice with deep learning
• The bounds are too loose
• Difficulty to compute VC-dimension for deep neural
networks
• Same for other statistic learning theory tools
Data Complexity

• Multiple factors matters

• # of examples
• # of elements in each example
• time/space structure
• diversity
Regularization
Definition
Regularization Objective Function
L2 Regularization

• Hyper-parameter a or λ controls regularization importance

• λ = 0 : no effect (λ is more commonly used than a)
• λ → ∞, w* → 0
Illustrate the Effect on Optimal Solutions

λ
w* = arg min J(w; X,y)+ ∥w∥
2
2

w* w̃* = arg min J(w; X,y)

L1 Regularization

λ symbol is more commonly used as regularization parameter than a

L1 Regularization

λ symbol is more commonly used as regularization parameter than a

L1 Regularization

λ symbol is more commonly used as regularization parameter than a

*Alex Smola @ cmu

Early Stopping
Algorithm
Early Stopping Based on objective function
Deep Learning
UET

Dropout
Motivation

• A good model should be

robust under modest changes
in the input
• Training with input noise
equals to Tikhonov
Regularization
• Dropout: inject noises into
internal layers
Apply Dropout

• Often apply dropout on the output of hidden fully-

connected layers

h = σ(W x + b)
1 1
h′ =dropout(h)
o = W2h′ + b2
y = softmax(o)
Add Noise without Bias

• Add noise into x to get x’, we hope

E[x′] = x

• Dropout perturbs each element by

0 with probability p
x′i = { xi
1−p
otherwise
Dropout using numpy?

Let’s do in Jupyter-notebook
Dropout using pytorch tensors
Code difference - numpy vs pytorch

# in pytorch
assert 0 <= dropout <= 1
# Here X.shape is a tensor
if dropout==1:
return torch.zeros_like(X.shape)

# for numpy
# Here X.shape is a tuple
if dropout==1:
return np.zeros(X.shape)
Dropout from scratch – using dropout_layer func
Dropout in Inference

• Regularization is only used in training

• The dropout layer for inference is

h′ =dropout(h)

• Guarantee deterministic results

Xero
80% (15)
Xero
18 pages
Quiz FMG
100% (1)
Quiz FMG
11 pages
Irosin Wacs
No ratings yet
Irosin Wacs
22 pages
CHP 5 Communication
100% (1)
CHP 5 Communication
59 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
African Traditional Religion (ATR)
100% (1)
African Traditional Religion (ATR)
18 pages
The Composite Steel Reinforced Concrete Column Under Axial and Seismic Loads: A Review
No ratings yet
The Composite Steel Reinforced Concrete Column Under Axial and Seismic Loads: A Review
19 pages
Full Charm SLD
0% (1)
Full Charm SLD
31 pages
Rail Gun
100% (1)
Rail Gun
20 pages
FEA-Academy Course On-Demand - Practical Basic FEA
No ratings yet
FEA-Academy Course On-Demand - Practical Basic FEA
35 pages
Processing of Leather by Microbial Enzyme
100% (1)
Processing of Leather by Microbial Enzyme
13 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
5 Regularization
No ratings yet
5 Regularization
79 pages
Bus Paper Craft
No ratings yet
Bus Paper Craft
10 pages
Completion Diagram: Reda Discharge: UT Pump Oring Oring B/u LT Pump
No ratings yet
Completion Diagram: Reda Discharge: UT Pump Oring Oring B/u LT Pump
2 pages
7 Key Principles of Apparel Costing - Textile Tutorials
No ratings yet
7 Key Principles of Apparel Costing - Textile Tutorials
2 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Unit 3
No ratings yet
Unit 3
110 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
MCS-012 Block 3
No ratings yet
MCS-012 Block 3
94 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
No ratings yet
Convolutional Neural Networks (Image Recognition) Part - II: Dr. Syed M. Usman
75 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Sharplcd13 15 20s1u2
No ratings yet
Sharplcd13 15 20s1u2
59 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Unit-2 L2
No ratings yet
Unit-2 L2
22 pages
Week 10
No ratings yet
Week 10
69 pages
DL Class3
No ratings yet
DL Class3
28 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
No ratings yet
CII4Q3 - Computer Vision-EAR - Week-11-Intro To Deep Learning v1.0
50 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
40 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Training Neural
No ratings yet
Training Neural
16 pages
IoT - Lecture 11
No ratings yet
IoT - Lecture 11
58 pages
Cours 4
No ratings yet
Cours 4
30 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
6 - Tips For Training Deep Neural Networks
No ratings yet
6 - Tips For Training Deep Neural Networks
59 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
NN 08
No ratings yet
NN 08
36 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Theo 5 - Module 4
No ratings yet
Theo 5 - Module 4
26 pages
Presentation 1
No ratings yet
Presentation 1
24 pages
1 2 Logistics Comp Graphs
No ratings yet
1 2 Logistics Comp Graphs
58 pages
Lecture 7-8
No ratings yet
Lecture 7-8
56 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
Lect11 Seq
No ratings yet
Lect11 Seq
35 pages
ANN Presentation Exam Hafsa
No ratings yet
ANN Presentation Exam Hafsa
29 pages
Bioplastic 2
No ratings yet
Bioplastic 2
13 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
Deep Learning Unit2
No ratings yet
Deep Learning Unit2
16 pages
DL Regularization
No ratings yet
DL Regularization
28 pages
Motors AC
No ratings yet
Motors AC
5 pages
CT1 DL Ans
No ratings yet
CT1 DL Ans
13 pages
CBS-Manual July 2019
No ratings yet
CBS-Manual July 2019
8 pages
Module 2
No ratings yet
Module 2
13 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
CUCOH 2013 Executive Application
No ratings yet
CUCOH 2013 Executive Application
4 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Machine Learning (PARAMETER'S RESUMES 2)
No ratings yet
Machine Learning (PARAMETER'S RESUMES 2)
8 pages
Module 1 Lesson 1
No ratings yet
Module 1 Lesson 1
8 pages
Design and Implementation of Smart Micro-Grid and Its Digital Replica: First Steps
No ratings yet
Design and Implementation of Smart Micro-Grid and Its Digital Replica: First Steps
7 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
The Effect of Taxes On The Demand For Cigarettes
No ratings yet
The Effect of Taxes On The Demand For Cigarettes
8 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
PA 6.0 Amplifier Datasheet
No ratings yet
PA 6.0 Amplifier Datasheet
6 pages
Test 1 Truss Test 2014
No ratings yet
Test 1 Truss Test 2014
4 pages
Group 1 - CJR PSYCHOLINGUITICS - DIK 19 B-1
No ratings yet
Group 1 - CJR PSYCHOLINGUITICS - DIK 19 B-1
4 pages
Leading For The Future
No ratings yet
Leading For The Future
4 pages
ICT Project Creation Process
No ratings yet
ICT Project Creation Process
3 pages
AMIR
No ratings yet
AMIR
2 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet