Intro DL 04
Intro DL 04
1
Outline
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
2
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
3
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
4
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
5
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
6
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
7
Outline
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
8
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
9
Weight initialization
The optimization algorithm requires a starting point in the space of possible
weight values from which to begin the optimization process.
Weight initialization is a procedure to set the weights of a neural network to
small random values that define the starting point for the optimization
(learning or training) of the neural network model.
Each time, a neural network is initialized with a different set of weights,
resulting in a different starting point for the optimization process, and
potentially resulting in a different final set of weights with different
performance characteristics.
10
Weight Initialization
11
Traditional weight Initialization
We cannot initialize all weights to the value 0.0 because Initializing all the
weights with zeros leads the neurons to learn the same features during
training.
If we forward propagate an input (x1 , x2 ) in a network with 2 hiden units, the
output of both hidden units will be relu(αx1 + αx2 ). Thus, both hidden units
will have identical influence on the cost, which will lead to identical
gradients. Thus, both neurons will evolve symmetrically throughout training,
effectively preventing different neurons from learning different things.
Historically, weight initialization follows simple heuristics, such as:
Small random values in the range [-0.3, 0.3]
Small random values in the range [0, 1]
Small random values in the range [-1, 1]
12
Illustration
We almost always initialize all the weights in the model to values drawn
randomly from a Gaussian or uniform distribution.
The choice of Gaussian or uniform distribution does not seem to matter very
much, but has not been exhaustively studied.
The scale of the initial distribution, however, does have a large effect on both
the outcome of the optimization procedure and on the ability of the network
to generalize.
see: https:
//www.deeplearning.ai/ai-notes/initialization/index.html
Despite breaking the symmetry, initializing the weights with values (i) too
small or (ii) too large leads respectively to (i) slow learning or (ii) divergence.
Choosing proper values for initialization is necessary for efficient training.
13
The problem of exploding or vanishing gradients
where L = 10 and W[1] , W[2] , . . . , W[L−1] are all matrices of size (2,2). With this in
mind, and for illustrative purposes, if we assume W[1] = W[2] = · · · = W[L−1] = W
the output prediction is ŷ = W[L] WL−1 x
14
Case 1: A too-large initialization leads to exploding gra-
dients
Consider the case where every weight is initialized slightly larger than the
identity matrix.
[ ]
1.5 0
W[1] = W[2] = · · · = W[L−1] =
0 1.5
Consider the case where every weight is initialized slightly smaller than the
identity matrix.
[ ]
0.5 0
W[1] = W[2] = · · · = W[L−1] =
0 0.5
17
Xavier initialization for tanh activations
20
Regularization: avoiding the overfitting
For CNN models, over-fitting represents the central issue associated with
obtaining well-behaved generalization.
The model is entitled over-fitted in cases where the model executes
especially well on training data and does not succeed on test data (unseen
data) which is more explained in the latter section.
An under-fitted model is the opposite; this case occurs when the model does
not learn a sufficient amount from the training data.
The model is referred to as “just-fitted” if it executes well on both training
and testing data.
21
Regularization techniques
1 Dropout: This is a widely utilized technique for generalization. During each
training epoch, neurons are randomly dropped.
the feature selection power is distributed equally across the whole group of
neurons, as well as forcing the model to learn different independent features.
training process: the dropped neuron will not be a part of back-propagation or
forward-propagation.
testing process: the full-scale network is utilized to perform prediction
2 Drop-Weights: This method is highly similar to dropout. In each training
epoch, the connections between neurons (weights) are dropped rather than
dropping the neurons; this represents the only difference between
drop-weights and dropout.
3 Data Augmentation: utilizes to train the model on a sizeable (artificially
expanded) amount of data. This is the easiest way to avoid over-fitting.
4 Batch Normalization: Subtracting the mean and dividing by the standard
deviation will normalize the output at each layer. While it is possible to
22
consider this as a pre-processing task at each layer in the network, it is also
Advantages of the batch normalization (BN)
BN can be employed to reduce the “internal covariance shift” of the activation
layers:
Internal covariance shift: the variation in the activation distribution in each
layer
This shift becomes very high due to the continuous weight updating through
training, which may occur if the samples of the training data are gathered
from numerous dissimilar sources (for example, day and night images).
Thus, the model will consume extra time for convergence, and in turn, the
time required for training will also increase.
To resolve this issue, a BN layer is applied in the CNN architecture.
The advantages of utilizing batch normalization are as follows:
It prevents the problem of vanishing gradient from arising.
It can effectively control the poor weight initialization.
23
It significantly reduces the time required for network convergence
Agenda
1 TRAINING CNN - PART 1
Hyperparameters
Activation functions
Loss functions
Design of algorithms (Backpropagation in CNN)
Backpropagation in Convolution Layer
2 TRAINING CNN - PART 2
Weight Initialization In Deep Neural Networks
Regularization to CNN
Optimizer selection
24
CNN learning Process
25
Gradient descent
We start with a guess x0 for a local minimum of F, and considers the sequence
x0 , x1 , x2 , . . . such that
xn+1 = xn − γn ∇F(xn ), n ≥ 0.
26
Gradient Descent or Gradient-based learning algorithm:
To minimize the training error, this algorithm repetitively updates the
network parameters through every training epoch.
it needs to compute the objective function gradient (slope) by applying a
first-order derivative with respect to the network parameters.
Next, the parameter is updated in the reverse direction of the gradient to
reduce the error.
∂E
wtij = wt−1
ij − ∆wtij , ∆wtij = η ∗
∂wij
The parameter updating process is performed though network
back-propagation, in which the gradient at every neuron is back-propagated to
all neurons in the preceding layer.
The learning rate η is defined as the step size of the parameter updating. The
training epoch represents a complete repetition of the parameter update
that involves the complete training dataset at one time. Note that it needs to
select the learning rate wisely so that it does not influence the learning
process imperfectly, although it is a hyper-parameter.
27
Gradient Descent or Gradient-based learning algorithm:
Batch Gradient Descent (BGD): The standard formula involves calculations over
full training set X, at each gradient descent step.
Stochastic Gradient Descent (SGD): picks a random instance in the training
set at every step and computes the gradients based only on thatinstance.
it is much faster than Batch version
random nature =⇒ it is much less regular than Batch Gradient Descent
good advantage: when the cost function is irregular, the randomness can help
jump out of local minima.
28
Mini-Batch Gradient Descent: at each step, we compute the gradients on
small random sets of instances called mini-batches. It will end up walking a
bit closer to minimum compared to SGD, but it may be harder for him to
escape local minima.
The advantage of this method comes from combining the advantages of both
BGD and SGD techniques. Thus, it has a steady convergence, more
computational efficiency and extra memory effectiveness.
29
Enhanced techniques: Momentum
The following describes several enhancement techniques in gradient-based
learning algorithms (usually in SGD), which further powerfully enhance the CNN
training process.
Momentum: For neural networks, this technique is employed in the objective
function. It enhances both the accuracy and the training speed by summing
the computed gradient at the preceding training step, which is weighted via a
factor λ (known as the momentum factor).
However, it therefore simply becomes stuck in a local minimum rather than a
global minimum. This represents the main disadvantage of gradient-based
learning algorithms. Issues of this kind frequently occur if the issue has no
convex surface (or solution space).
Together with the learning algorithm, momentum is used to solve this issue,
which can be expressed mathematically as
( )
∂E
t
∆wij = η ∗ + (λ ∗ ∆wt−1
ij )
30
∂wij
Momentum
31
Adaptive Moment Estimation (Adam)
It is another optimization technique or learning algorithm that is widely used.
Adam represents the latest trends in deep learning optimization.
This is represented by the Hessian matrix, which employs a second-order
derivative.
Adam is a learning strategy that has been designed specifically for training
deep neural networks.
More memory efficient and less computational power are two advantages of
Adam.
The mechanism of Adam is to calculate adaptive LR for each parameter in the
model.
It integrates the pros of both Momentum and RMSprop. It utilizes the
squared gradients to scale the learning rate as RMSprop and it is similar to
the momentum by using the moving average of the gradient.
η \
wtij = wijt−1 − √ ∗ E[δ 2 ]t
32
\
E[δ ] + ∈
2 t
Improving performance of CNN
The most active solutions that may improve the performance of CNN are:
33
The End
Thank You!
34