0% found this document useful (0 votes)
16 views35 pages

Intro DL 04

Uploaded by

Hưng Đinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views35 pages

Intro DL 04

Uploaded by

Hưng Đinh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

INTRODUCTION TO DEEP LEARNING (IT3320E)

4 - Training Neural Networks (Part2)

Hung Son Nguyen

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY


SCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY

October 09, 2024


Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

1
Outline
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

2
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

3
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

4
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

5
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

6
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

7
Outline
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

8
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

9
Weight initialization
The optimization algorithm requires a starting point in the space of possible
weight values from which to begin the optimization process.
Weight initialization is a procedure to set the weights of a neural network to
small random values that define the starting point for the optimization
(learning or training) of the neural network model.
Each time, a neural network is initialized with a different set of weights,
resulting in a different starting point for the optimization process, and
potentially resulting in a different final set of weights with different
performance characteristics.

10
Weight Initialization

PAGE 301, DEEP LEARNING, 2016.


... training deep models is a sufficiently difficult task that most algorithms
are strongly affected by the choice of initialization. The initial point can
determine whether the algorithm converges at all, with some initial points
being so unstable that the algorithm encounters numerical difficulties
and fails altogether.

11
Traditional weight Initialization

We cannot initialize all weights to the value 0.0 because Initializing all the
weights with zeros leads the neurons to learn the same features during
training.
If we forward propagate an input (x1 , x2 ) in a network with 2 hiden units, the
output of both hidden units will be relu(αx1 + αx2 ). Thus, both hidden units
will have identical influence on the cost, which will lead to identical
gradients. Thus, both neurons will evolve symmetrically throughout training,
effectively preventing different neurons from learning different things.
Historically, weight initialization follows simple heuristics, such as:
Small random values in the range [-0.3, 0.3]
Small random values in the range [0, 1]
Small random values in the range [-1, 1]

12
Illustration

We almost always initialize all the weights in the model to values drawn
randomly from a Gaussian or uniform distribution.
The choice of Gaussian or uniform distribution does not seem to matter very
much, but has not been exhaustively studied.
The scale of the initial distribution, however, does have a large effect on both
the outcome of the optimization procedure and on the ability of the network
to generalize.
see: https:
//www.deeplearning.ai/ai-notes/initialization/index.html
Despite breaking the symmetry, initializing the weights with values (i) too
small or (ii) too large leads respectively to (i) slow learning or (ii) divergence.
Choosing proper values for initialization is necessary for efficient training.

13
The problem of exploding or vanishing gradients

Consider a 9 layer NN:

Then the output activation is:

ŷ = a[L] = W[L] W[L−1] W[L−2] . . . W[3] W[2] W[1] x

where L = 10 and W[1] , W[2] , . . . , W[L−1] are all matrices of size (2,2). With this in
mind, and for illustrative purposes, if we assume W[1] = W[2] = · · · = W[L−1] = W
the output prediction is ŷ = W[L] WL−1 x

14
Case 1: A too-large initialization leads to exploding gra-
dients

Consider the case where every weight is initialized slightly larger than the
identity matrix.
[ ]
1.5 0
W[1] = W[2] = · · · = W[L−1] =
0 1.5

This simplifies to ŷ = W[L] 1.5L−1 x, and the values of a[l] increase


exponentially with l.
When these activations are used in backward propagation, this leads to the
exploding gradient problem.
That is, the gradients of the cost with the respect to the parameters are too
big.
This leads the cost to oscillate around its minimum value.
15
Case 2: A too-small initialization leads to vanishing gra-
dients

Consider the case where every weight is initialized slightly smaller than the
identity matrix.
[ ]
0.5 0
W[1] = W[2] = · · · = W[L−1] =
0 0.5

This simplifies to ŷ = W[L] 0.5L−1 x, and the values of a[l] decrese


exponentially with l.
When these activations are used in backward propagation, this leads to the
vanishing gradient problem.
That is, the gradients of the cost with the respect to the parameters are too
small, leading to convergence of the cost before it has reached the minimum
value.
16
Modern weight initialization

To prevent the gradients of the network’s activations from vanishing or


exploding, we will stick to the following rules of thumb:
The mean of the activations should be zero.
The variance of the activations should stay the same across every layer.
Nevertheless, more tailored approaches have been developed over the last
decade that have become the defacto standard given they may result in a
slightly more effective optimization (model training) process.
These modern weight initialization techniques are divided based on the type
of activation function used in the nodes that are being initialized, such as
“Sigmoid and Tanh” and “ReLU.”

17
Xavier initialization for tanh activations

The recommended initialization is Xavier initialization (or one of its derived


methods), for every layer l:
1
W[l] ∼ N (µ = 0, σ 2 = )
n[l−1]
b[l] = 0
In other words, all the weights of layer ll are picked randomly from a normal
1
distribution with mean µ = 0 and variance σ 2 = n[l−1] where n[l−1] is the
number of neuron in layer l1. Biases are initialized with zeros.
Normalized Xavier Weight Initialization In practice, Machine Learning
Engineers using Xavier initialization would either initialize the weights as
1
N (0, n[l−1] ) or as N (0, n[l−1]2+n[l] ). The variance term of the latter distribution
1
is the harmonic mean of n[l−1] and n1[l]
Xavier initialization works with tanh activations
18
Weight Initialization for ReLU activation

In He Uniform weight initialization, the weights are assigned from values of a


uniform distribution as follows:
[ √ √ ]
6 6
wi ∼ U − ,
n[l−1] n[l]

In He Normal Initialization, the weights are assigned from values of a normal


distribution as follows:
wi ∼ N[0, σ]
Here, σ is given by: √
2
σ=
nl−1
He Normal Initialization is suitable for layers where ReLU activation function
is used.
19
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

20
Regularization: avoiding the overfitting
For CNN models, over-fitting represents the central issue associated with
obtaining well-behaved generalization.
The model is entitled over-fitted in cases where the model executes
especially well on training data and does not succeed on test data (unseen
data) which is more explained in the latter section.
An under-fitted model is the opposite; this case occurs when the model does
not learn a sufficient amount from the training data.
The model is referred to as “just-fitted” if it executes well on both training
and testing data.

21
Regularization techniques
1 Dropout: This is a widely utilized technique for generalization. During each
training epoch, neurons are randomly dropped.
the feature selection power is distributed equally across the whole group of
neurons, as well as forcing the model to learn different independent features.
training process: the dropped neuron will not be a part of back-propagation or
forward-propagation.
testing process: the full-scale network is utilized to perform prediction
2 Drop-Weights: This method is highly similar to dropout. In each training
epoch, the connections between neurons (weights) are dropped rather than
dropping the neurons; this represents the only difference between
drop-weights and dropout.
3 Data Augmentation: utilizes to train the model on a sizeable (artificially
expanded) amount of data. This is the easiest way to avoid over-fitting.
4 Batch Normalization: Subtracting the mean and dividing by the standard
deviation will normalize the output at each layer. While it is possible to
22
consider this as a pre-processing task at each layer in the network, it is also
Advantages of the batch normalization (BN)
BN can be employed to reduce the “internal covariance shift” of the activation
layers:
Internal covariance shift: the variation in the activation distribution in each
layer
This shift becomes very high due to the continuous weight updating through
training, which may occur if the samples of the training data are gathered
from numerous dissimilar sources (for example, day and night images).
Thus, the model will consume extra time for convergence, and in turn, the
time required for training will also increase.
To resolve this issue, a BN layer is applied in the CNN architecture.
The advantages of utilizing batch normalization are as follows:
It prevents the problem of vanishing gradient from arising.
It can effectively control the poor weight initialization.
23
It significantly reduces the time required for network convergence
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection

24
CNN learning Process

Two major issues are included in the learning process:


the first issue is the learning algorithm selection (optimizer),
the second issue is the use of many enhancements (such as AdaDelta, Adagrad,
and momentum) along with the learning algorithm to enhance the output.
Loss functions, which are determined on numerous learnable parameters
(e.g. biases, weights, etc.) or minimizing the error (variation between actual
and predicted output), are the core purpose of all supervised learning
algorithms.
The techniques of gradient-based learning for a CNN network appear as the
usual selection.
The network parameters should always update though all training epochs,
while the network should also look for the locally optimized answer in all
training epochs in order to minimize the error.

25
Gradient descent

We start with a guess x0 for a local minimum of F, and considers the sequence
x0 , x1 , x2 , . . . such that

xn+1 = xn − γn ∇F(xn ), n ≥ 0.

26
Gradient Descent or Gradient-based learning algorithm:
To minimize the training error, this algorithm repetitively updates the
network parameters through every training epoch.
it needs to compute the objective function gradient (slope) by applying a
first-order derivative with respect to the network parameters.
Next, the parameter is updated in the reverse direction of the gradient to
reduce the error.
∂E
wtij = wt−1
ij − ∆wtij , ∆wtij = η ∗
∂wij
The parameter updating process is performed though network
back-propagation, in which the gradient at every neuron is back-propagated to
all neurons in the preceding layer.
The learning rate η is defined as the step size of the parameter updating. The
training epoch represents a complete repetition of the parameter update
that involves the complete training dataset at one time. Note that it needs to
select the learning rate wisely so that it does not influence the learning
process imperfectly, although it is a hyper-parameter.
27
Gradient Descent or Gradient-based learning algorithm:

Different alternatives of the gradient-based learning algorithm are available and


commonly employed; these include the following:

Batch Gradient Descent (BGD): The standard formula involves calculations over
full training set X, at each gradient descent step.
Stochastic Gradient Descent (SGD): picks a random instance in the training
set at every step and computes the gradients based only on thatinstance.
it is much faster than Batch version
random nature =⇒ it is much less regular than Batch Gradient Descent
good advantage: when the cost function is irregular, the randomness can help
jump out of local minima.

28
Mini-Batch Gradient Descent: at each step, we compute the gradients on
small random sets of instances called mini-batches. It will end up walking a
bit closer to minimum compared to SGD, but it may be harder for him to
escape local minima.
The advantage of this method comes from combining the advantages of both
BGD and SGD techniques. Thus, it has a steady convergence, more
computational efficiency and extra memory effectiveness.

29
Enhanced techniques: Momentum
The following describes several enhancement techniques in gradient-based
learning algorithms (usually in SGD), which further powerfully enhance the CNN
training process.
Momentum: For neural networks, this technique is employed in the objective
function. It enhances both the accuracy and the training speed by summing
the computed gradient at the preceding training step, which is weighted via a
factor λ (known as the momentum factor).
However, it therefore simply becomes stuck in a local minimum rather than a
global minimum. This represents the main disadvantage of gradient-based
learning algorithms. Issues of this kind frequently occur if the issue has no
convex surface (or solution space).
Together with the learning algorithm, momentum is used to solve this issue,
which can be expressed mathematically as
( )
∂E
t
∆wij = η ∗ + (λ ∗ ∆wt−1
ij )
30
∂wij
Momentum

The momentum factor value is maintained within the range 0 to 1; in turn,


the step size of the weight updating increases in the direction of the bare
minimum to minimize the error.
As the value of the momentum factor becomes very low, the model loses its
ability to avoid the local bare minimum.
By contrast, as the momentum factor value becomes high, the model
develops the ability to converge much more rapidly.
If a high value of momentum factor is used together with learning rate, then
the model could miss the global bare minimum by crossing over it.
However, when the gradient varies its direction continually throughout the
training process, then the suitable value of the momentum factor (which is a
hyper-parameter) causes a smoothening of the weight updating variations.

31
Adaptive Moment Estimation (Adam)
It is another optimization technique or learning algorithm that is widely used.
Adam represents the latest trends in deep learning optimization.
This is represented by the Hessian matrix, which employs a second-order
derivative.
Adam is a learning strategy that has been designed specifically for training
deep neural networks.
More memory efficient and less computational power are two advantages of
Adam.
The mechanism of Adam is to calculate adaptive LR for each parameter in the
model.
It integrates the pros of both Momentum and RMSprop. It utilizes the
squared gradients to scale the learning rate as RMSprop and it is similar to
the momentum by using the moving average of the gradient.
η \
wtij = wijt−1 − √ ∗ E[δ 2 ]t
32
\
E[δ ] + ∈
2 t
Improving performance of CNN

The most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning


(explained in latter sections).
Increase the training time.
Increase the depth (or width) of the model.
Add regularization.
Increase hyperparameters tuning.

33
The End
Thank You!

34

You might also like