0% found this document useful (0 votes)

16 views35 pages

Intro DL 04

Uploaded by

Hưng Đinh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views35 pages

Intro DL 04

Uploaded by

Hưng Đinh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

INTRODUCTION TO DEEP LEARNING (IT3320E)

4 - Training Neural Networks (Part2)

Hung Son Nguyen

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

October 09, 2024

Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

1
Outline
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

2
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

3
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

4
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

5
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

6
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

7
Outline
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

8
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

9
Weight initialization
The optimization algorithm requires a starting point in the space of possible
weight values from which to begin the optimization process.
Weight initialization is a procedure to set the weights of a neural network to
small random values that define the starting point for the optimization
(learning or training) of the neural network model.
Each time, a neural network is initialized with a different set of weights,
resulting in a different starting point for the optimization process, and
potentially resulting in a different final set of weights with different
performance characteristics.

10
Weight Initialization

PAGE 301, DEEP LEARNING, 2016.

... training deep models is a sufficiently difficult task that most algorithms
are strongly affected by the choice of initialization. The initial point can
determine whether the algorithm converges at all, with some initial points
being so unstable that the algorithm encounters numerical difficulties
and fails altogether.

11
Traditional weight Initialization

We cannot initialize all weights to the value 0.0 because Initializing all the
weights with zeros leads the neurons to learn the same features during
training.
If we forward propagate an input (x1 , x2 ) in a network with 2 hiden units, the
output of both hidden units will be relu(αx1 + αx2 ). Thus, both hidden units
will have identical influence on the cost, which will lead to identical
gradients. Thus, both neurons will evolve symmetrically throughout training,
effectively preventing different neurons from learning different things.
Historically, weight initialization follows simple heuristics, such as:
Small random values in the range [-0.3, 0.3]
Small random values in the range [0, 1]
Small random values in the range [-1, 1]

12
Illustration

We almost always initialize all the weights in the model to values drawn
randomly from a Gaussian or uniform distribution.
The choice of Gaussian or uniform distribution does not seem to matter very
much, but has not been exhaustively studied.
The scale of the initial distribution, however, does have a large effect on both
the outcome of the optimization procedure and on the ability of the network
to generalize.
see: https:
//www.deeplearning.ai/ai-notes/initialization/index.html
Despite breaking the symmetry, initializing the weights with values (i) too
small or (ii) too large leads respectively to (i) slow learning or (ii) divergence.
Choosing proper values for initialization is necessary for efficient training.

13
The problem of exploding or vanishing gradients

Consider a 9 layer NN:

Then the output activation is:

ŷ = a[L] = W[L] W[L−1] W[L−2] . . . W[3] W[2] W[1] x

where L = 10 and W[1] , W[2] , . . . , W[L−1] are all matrices of size (2,2). With this in
mind, and for illustrative purposes, if we assume W[1] = W[2] = · · · = W[L−1] = W
the output prediction is ŷ = W[L] WL−1 x

14
Case 1: A too-large initialization leads to exploding gra-
dients

Consider the case where every weight is initialized slightly larger than the
identity matrix.
[ ]
1.5 0
W[1] = W[2] = · · · = W[L−1] =
0 1.5

This simplifies to ŷ = W[L] 1.5L−1 x, and the values of a[l] increase

exponentially with l.
When these activations are used in backward propagation, this leads to the
exploding gradient problem.
That is, the gradients of the cost with the respect to the parameters are too
big.
This leads the cost to oscillate around its minimum value.
15
Case 2: A too-small initialization leads to vanishing gra-
dients

Consider the case where every weight is initialized slightly smaller than the
identity matrix.
[ ]
0.5 0
W[1] = W[2] = · · · = W[L−1] =
0 0.5

This simplifies to ŷ = W[L] 0.5L−1 x, and the values of a[l] decrese

exponentially with l.
When these activations are used in backward propagation, this leads to the
vanishing gradient problem.
That is, the gradients of the cost with the respect to the parameters are too
small, leading to convergence of the cost before it has reached the minimum
value.
16
Modern weight initialization

To prevent the gradients of the network’s activations from vanishing or

exploding, we will stick to the following rules of thumb:
The mean of the activations should be zero.
The variance of the activations should stay the same across every layer.
Nevertheless, more tailored approaches have been developed over the last
decade that have become the defacto standard given they may result in a
slightly more effective optimization (model training) process.
These modern weight initialization techniques are divided based on the type
of activation function used in the nodes that are being initialized, such as
“Sigmoid and Tanh” and “ReLU.”

17
Xavier initialization for tanh activations

The recommended initialization is Xavier initialization (or one of its derived

methods), for every layer l:
1
W[l] ∼ N (µ = 0, σ 2 = )
n[l−1]
b[l] = 0
In other words, all the weights of layer ll are picked randomly from a normal
1
distribution with mean µ = 0 and variance σ 2 = n[l−1] where n[l−1] is the
number of neuron in layer l1. Biases are initialized with zeros.
Normalized Xavier Weight Initialization In practice, Machine Learning
Engineers using Xavier initialization would either initialize the weights as
1
N (0, n[l−1] ) or as N (0, n[l−1]2+n[l] ). The variance term of the latter distribution
1
is the harmonic mean of n[l−1] and n1[l]
Xavier initialization works with tanh activations
18
Weight Initialization for ReLU activation

In He Uniform weight initialization, the weights are assigned from values of a

uniform distribution as follows:
[ √ √ ]
6 6
wi ∼ U − ,
n[l−1] n[l]

In He Normal Initialization, the weights are assigned from values of a normal

distribution as follows:
wi ∼ N[0, σ]
Here, σ is given by: √
2
σ=
nl−1
He Normal Initialization is suitable for layers where ReLU activation function
is used.
19
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

20
Regularization: avoiding the overfitting
For CNN models, over-fitting represents the central issue associated with
obtaining well-behaved generalization.
The model is entitled over-fitted in cases where the model executes
especially well on training data and does not succeed on test data (unseen
data) which is more explained in the latter section.
An under-fitted model is the opposite; this case occurs when the model does
not learn a sufficient amount from the training data.
The model is referred to as “just-fitted” if it executes well on both training
and testing data.

21
Regularization techniques
1 Dropout: This is a widely utilized technique for generalization. During each
training epoch, neurons are randomly dropped.
the feature selection power is distributed equally across the whole group of
neurons, as well as forcing the model to learn different independent features.
training process: the dropped neuron will not be a part of back-propagation or
forward-propagation.
testing process: the full-scale network is utilized to perform prediction
2 Drop-Weights: This method is highly similar to dropout. In each training
epoch, the connections between neurons (weights) are dropped rather than
dropping the neurons; this represents the only difference between
drop-weights and dropout.
3 Data Augmentation: utilizes to train the model on a sizeable (artificially
expanded) amount of data. This is the easiest way to avoid over-fitting.
4 Batch Normalization: Subtracting the mean and dividing by the standard
deviation will normalize the output at each layer. While it is possible to
22
consider this as a pre-processing task at each layer in the network, it is also
Advantages of the batch normalization (BN)
BN can be employed to reduce the “internal covariance shift” of the activation
layers:
Internal covariance shift: the variation in the activation distribution in each
layer
This shift becomes very high due to the continuous weight updating through
training, which may occur if the samples of the training data are gathered
from numerous dissimilar sources (for example, day and night images).
Thus, the model will consume extra time for convergence, and in turn, the
time required for training will also increase.
To resolve this issue, a BN layer is applied in the CNN architecture.
The advantages of utilizing batch normalization are as follows:
It prevents the problem of vanishing gradient from arising.
It can effectively control the poor weight initialization.
23
It significantly reduces the time required for network convergence
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

24
CNN learning Process

Two major issues are included in the learning process:

the first issue is the learning algorithm selection (optimizer),
the second issue is the use of many enhancements (such as AdaDelta, Adagrad,
and momentum) along with the learning algorithm to enhance the output.
Loss functions, which are determined on numerous learnable parameters
(e.g. biases, weights, etc.) or minimizing the error (variation between actual
and predicted output), are the core purpose of all supervised learning
algorithms.
The techniques of gradient-based learning for a CNN network appear as the
usual selection.
The network parameters should always update though all training epochs,
while the network should also look for the locally optimized answer in all
training epochs in order to minimize the error.

25
Gradient descent

We start with a guess x0 for a local minimum of F, and considers the sequence
x0 , x1 , x2 , . . . such that

xn+1 = xn − γn ∇F(xn ), n ≥ 0.

26
Gradient Descent or Gradient-based learning algorithm:
To minimize the training error, this algorithm repetitively updates the
network parameters through every training epoch.
it needs to compute the objective function gradient (slope) by applying a
first-order derivative with respect to the network parameters.
Next, the parameter is updated in the reverse direction of the gradient to
reduce the error.
∂E
wtij = wt−1
ij − ∆wtij , ∆wtij = η ∗
∂wij
The parameter updating process is performed though network
back-propagation, in which the gradient at every neuron is back-propagated to
all neurons in the preceding layer.
The learning rate η is defined as the step size of the parameter updating. The
training epoch represents a complete repetition of the parameter update
that involves the complete training dataset at one time. Note that it needs to
select the learning rate wisely so that it does not influence the learning
process imperfectly, although it is a hyper-parameter.
27
Gradient Descent or Gradient-based learning algorithm:

Different alternatives of the gradient-based learning algorithm are available and

commonly employed; these include the following:

Batch Gradient Descent (BGD): The standard formula involves calculations over
full training set X, at each gradient descent step.
Stochastic Gradient Descent (SGD): picks a random instance in the training
set at every step and computes the gradients based only on thatinstance.
it is much faster than Batch version
random nature =⇒ it is much less regular than Batch Gradient Descent
good advantage: when the cost function is irregular, the randomness can help
jump out of local minima.

28
Mini-Batch Gradient Descent: at each step, we compute the gradients on
small random sets of instances called mini-batches. It will end up walking a
bit closer to minimum compared to SGD, but it may be harder for him to
escape local minima.
The advantage of this method comes from combining the advantages of both
BGD and SGD techniques. Thus, it has a steady convergence, more
computational efficiency and extra memory effectiveness.

29
Enhanced techniques: Momentum
The following describes several enhancement techniques in gradient-based
learning algorithms (usually in SGD), which further powerfully enhance the CNN
training process.
Momentum: For neural networks, this technique is employed in the objective
function. It enhances both the accuracy and the training speed by summing
the computed gradient at the preceding training step, which is weighted via a
factor λ (known as the momentum factor).
However, it therefore simply becomes stuck in a local minimum rather than a
global minimum. This represents the main disadvantage of gradient-based
learning algorithms. Issues of this kind frequently occur if the issue has no
convex surface (or solution space).
Together with the learning algorithm, momentum is used to solve this issue,
which can be expressed mathematically as
( )
∂E
t
∆wij = η ∗ + (λ ∗ ∆wt−1
ij )
30
∂wij
Momentum

The momentum factor value is maintained within the range 0 to 1; in turn,

the step size of the weight updating increases in the direction of the bare
minimum to minimize the error.
As the value of the momentum factor becomes very low, the model loses its
ability to avoid the local bare minimum.
By contrast, as the momentum factor value becomes high, the model
develops the ability to converge much more rapidly.
If a high value of momentum factor is used together with learning rate, then
the model could miss the global bare minimum by crossing over it.
However, when the gradient varies its direction continually throughout the
training process, then the suitable value of the momentum factor (which is a
hyper-parameter) causes a smoothening of the weight updating variations.

31
Adaptive Moment Estimation (Adam)
It is another optimization technique or learning algorithm that is widely used.
Adam represents the latest trends in deep learning optimization.
This is represented by the Hessian matrix, which employs a second-order
derivative.
Adam is a learning strategy that has been designed specifically for training
deep neural networks.
More memory efficient and less computational power are two advantages of
Adam.
The mechanism of Adam is to calculate adaptive LR for each parameter in the
model.
It integrates the pros of both Momentum and RMSprop. It utilizes the
squared gradients to scale the learning rate as RMSprop and it is similar to
the momentum by using the moving average of the gradient.
η \
wtij = wijt−1 − √ ∗ E[δ 2 ]t
32
\
E[δ ] + ∈
2 t
Improving performance of CNN

The most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning

(explained in latter sections).
Increase the training time.
Increase the depth (or width) of the model.
Add regularization.
Increase hyperparameters tuning.

33
The End
Thank You!

BS en Iso 14692-3-2017
No ratings yet
BS en Iso 14692-3-2017
46 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Chapter 5 - Plug and Abandonment of Subsea Wells
No ratings yet
Chapter 5 - Plug and Abandonment of Subsea Wells
23 pages
I Need You To Survive - DB
No ratings yet
I Need You To Survive - DB
9 pages
Weights Initialization in Neural Networks
No ratings yet
Weights Initialization in Neural Networks
31 pages
Wedding Venue List 1
No ratings yet
Wedding Venue List 1
53 pages
Introduction To Linear Programming Sau
No ratings yet
Introduction To Linear Programming Sau
42 pages
General Observation
No ratings yet
General Observation
93 pages
Deep Learning
No ratings yet
Deep Learning
299 pages
ACC30 Accounting For Partnerships Lesson Plan Liquidation
0% (1)
ACC30 Accounting For Partnerships Lesson Plan Liquidation
4 pages
Understanding Weight Initialization For Neural Networks - PyImageSearch
No ratings yet
Understanding Weight Initialization For Neural Networks - PyImageSearch
16 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
1-Claims For Variations
No ratings yet
1-Claims For Variations
7 pages
Algonquin College Oda Check List
No ratings yet
Algonquin College Oda Check List
17 pages
Diagnostic Test 15 Dependent Prepositions
No ratings yet
Diagnostic Test 15 Dependent Prepositions
1 page
ML Lec 10 Neural Networks
No ratings yet
ML Lec 10 Neural Networks
87 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
More On CNN
No ratings yet
More On CNN
131 pages
Unit 3
No ratings yet
Unit 3
110 pages
WHSmiths AR17 WEB 2017-10-25-Min-Compressed
No ratings yet
WHSmiths AR17 WEB 2017-10-25-Min-Compressed
114 pages
Ceng403 - Week 6b
No ratings yet
Ceng403 - Week 6b
51 pages
ML MODULE6 Artificial Neural Networks
No ratings yet
ML MODULE6 Artificial Neural Networks
42 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
Banner Installation Manual and Configuration Guide - V1-6
No ratings yet
Banner Installation Manual and Configuration Guide - V1-6
39 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Deep Learning in Target Space: Michael Fairbank Spyridon Samothrakis Luca Citi
No ratings yet
Deep Learning in Target Space: Michael Fairbank Spyridon Samothrakis Luca Citi
46 pages
CNN Training Aspects Presentation
No ratings yet
CNN Training Aspects Presentation
26 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Duaa - Asim - Ignatius Initialization Methods
No ratings yet
Duaa - Asim - Ignatius Initialization Methods
42 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Introduction To Deep Learning - Deep Feed Forward Network
No ratings yet
Introduction To Deep Learning - Deep Feed Forward Network
24 pages
DNN Hyperparameter Tuning
No ratings yet
DNN Hyperparameter Tuning
105 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Introduction To Neural Network: - CS 280 Tutorial One
No ratings yet
Introduction To Neural Network: - CS 280 Tutorial One
14 pages
Introduction To Neural Network: - CS 280 Tutorial One
No ratings yet
Introduction To Neural Network: - CS 280 Tutorial One
14 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
ITNN Week3
No ratings yet
ITNN Week3
21 pages
Weight Initialization in ANNs
No ratings yet
Weight Initialization in ANNs
13 pages
Initialization To Keep SNN Training and Generalization Great With Surrogate-Stable Variance.18250v1
No ratings yet
Initialization To Keep SNN Training and Generalization Great With Surrogate-Stable Variance.18250v1
11 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
Activation Functions and Initialization Methods
No ratings yet
Activation Functions and Initialization Methods
17 pages
SS 2020 Solutions
No ratings yet
SS 2020 Solutions
22 pages
Initializing Neural Networks - Deeplearning - Ai
No ratings yet
Initializing Neural Networks - Deeplearning - Ai
15 pages
Training Neural
No ratings yet
Training Neural
16 pages
DL Unit 5 Notes 2
No ratings yet
DL Unit 5 Notes 2
23 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
DL24/DL24P User Manual
No ratings yet
DL24/DL24P User Manual
9 pages
Ann-Back Propagation
No ratings yet
Ann-Back Propagation
21 pages
Lecture 8.4
No ratings yet
Lecture 8.4
13 pages
(MDS-G6) PMS
No ratings yet
(MDS-G6) PMS
22 pages
K7D628
No ratings yet
K7D628
16 pages
Yashaswini (DBMS)
No ratings yet
Yashaswini (DBMS)
8 pages
Artificial Neural Network - Back-Propagation Learning
No ratings yet
Artificial Neural Network - Back-Propagation Learning
21 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
MX SB RO: User Manual
No ratings yet
MX SB RO: User Manual
23 pages
SILO Company Presentation FY 2017
No ratings yet
SILO Company Presentation FY 2017
61 pages
9.b Handout-5-Weight Init
No ratings yet
9.b Handout-5-Weight Init
4 pages
Introduction To Neural Network: - CS 280 Tutorial One
No ratings yet
Introduction To Neural Network: - CS 280 Tutorial One
14 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Icann 2023
No ratings yet
Icann 2023
4 pages
Ndoro and Another V Conjugal Enterprises (Private) Limited and Another (814 of 2022) 2022 ZWHHC 814 (16 November 2022)
No ratings yet
Ndoro and Another V Conjugal Enterprises (Private) Limited and Another (814 of 2022) 2022 ZWHHC 814 (16 November 2022)
7 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
CS224n: Natural Language Processing With Deep Learning
No ratings yet
CS224n: Natural Language Processing With Deep Learning
18 pages
Initialization
No ratings yet
Initialization
16 pages
Losing Track of Time
No ratings yet
Losing Track of Time
2 pages
Theory of ANN
No ratings yet
Theory of ANN
21 pages
Lab Assignment 2
No ratings yet
Lab Assignment 2
7 pages
6 Internship Contract Agreement f2f
No ratings yet
6 Internship Contract Agreement f2f
2 pages
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Neural - Net - Fitting
3 pages
WEEK-1: Q. Create Neural Network Class and Initialize Those Weights and Biases Program Using Python
No ratings yet
WEEK-1: Q. Create Neural Network Class and Initialize Those Weights and Biases Program Using Python
1 page
HI121 - Installation Instructions PDF
No ratings yet
HI121 - Installation Instructions PDF
2 pages
Certificate
No ratings yet
Certificate
1 page
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Analysis of The Gate-Source/Drain Capacitance Behavior of A Narrow-Channel FD SOI NMOS Device Considering The 3-D Fringing Capacitances Using 3-D Simulation
No ratings yet
Analysis of The Gate-Source/Drain Capacitance Behavior of A Narrow-Channel FD SOI NMOS Device Considering The 3-D Fringing Capacitances Using 3-D Simulation
5 pages
Illycaffe: The Starbucks Threat: Marketing Strategy
No ratings yet
Illycaffe: The Starbucks Threat: Marketing Strategy
12 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Robin Austin Resume
No ratings yet
Robin Austin Resume
4 pages
NeurIPS 2019 How To Initialize Your Network Robust Initialization For Weightnorm Resnets Paper
No ratings yet
NeurIPS 2019 How To Initialize Your Network Robust Initialization For Weightnorm Resnets Paper
10 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
Att-4 LV Cable Epr Epr GSWB Shf-2
No ratings yet
Att-4 LV Cable Epr Epr GSWB Shf-2
7 pages
Ajsr 50 08
No ratings yet
Ajsr 50 08
14 pages
An Introduction To Matrix Structural Analysis and Finite Element Methods
No ratings yet
An Introduction To Matrix Structural Analysis and Finite Element Methods
8 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Intro DL 04

Uploaded by

Intro DL 04

Uploaded by

INTRODUCTION TO DEEP LEARNING (IT3320E)

4 - Training Neural Networks (Part2)

Hung Son Nguyen

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

October 09, 2024

PAGE 301, DEEP LEARNING, 2016.

Consider a 9 layer NN:

Then the output activation is:

ŷ = a[L] = W[L] W[L−1] W[L−2] . . . W[3] W[2] W[1] x

This simplifies to ŷ = W[L] 1.5L−1 x, and the values of a[l] increase

This simplifies to ŷ = W[L] 0.5L−1 x, and the values of a[l] decrese

To prevent the gradients of the network’s activations from vanishing or

The recommended initialization is Xavier initialization (or one of its derived

In He Uniform weight initialization, the weights are assigned from values of a

In He Normal Initialization, the weights are assigned from values of a normal

Two major issues are included in the learning process:

Different alternatives of the gradient-based learning algorithm are available and

The momentum factor value is maintained within the range 0 to 1; in turn,

Expand the dataset with data augmentation or use transfer learning

You might also like