0% found this document useful (0 votes)
10 views34 pages

Lesson 4 Training ANNs

This document discusses the training of artificial neural networks (NNs), focusing on loss functions such as Mean Squared Error and Cross-Entropy Loss, which guide the optimization process. It covers key concepts like forward propagation, backpropagation, and various gradient descent algorithms, including mini-batch and stochastic gradient descent, as well as techniques for optimizing learning rates. Additionally, it highlights the importance of data preprocessing and the challenges of local minima in neural network training.

Uploaded by

ngugivivy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views34 pages

Lesson 4 Training ANNs

This document discusses the training of artificial neural networks (NNs), focusing on loss functions such as Mean Squared Error and Cross-Entropy Loss, which guide the optimization process. It covers key concepts like forward propagation, backpropagation, and various gradient descent algorithms, including mini-batch and stochastic gradient descent, as well as techniques for optimizing learning rates. Additionally, it highlights the importance of data preprocessing and the challenges of local minima in neural network training.

Uploaded by

ngugivivy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

SPC 2408:

Artificial Neural
Networks

Lesson 4 Training NNs:


Learning the weights
CS 502, Fall 2020

2 1. Loss Functions
 Loss functions measure the difference between the predicted
and actual outputs, guiding the optimization process to
improve the model's performance.
1. Mean Squared Error (MSE)
 Definition: Measures the average squared difference
between predicted and actual values.
 Usage:
 Commonly used for regression tasks.
 Penalizes larger errors more significantly.

 Formula:

03/11/202
5

2
CS 502, Fall 2020

3 2. Cross-Entropy Loss
 Definition: Measures the difference between two
probability distributions, typically used for classification
tasks.
 Formula (Binary Classification)

 Advantages:
o Works well for probabilities.
o Sensitive to confidence in predictions.
03/11/202
5

3
CS 502, Fall 2020

4 2. Forward Propagation
 Forward propagation involves computing the output
of the neural network given an input, layer by layer,
using the weights and biases of the network.
 Steps:
1. Input Layer:
 Pass input data xxx to the first layer.

2. Hidden Layers:
 Compute activations for each neuron:

03/11/202
5

4
CS 502, Fall 2020

5 3. Output Layer:
 Compute final output using the same process.
 Example:

03/11/202
5

5
CS 502, Fall 2020

6 3. Backpropagation Algorithm
 Backpropagation calculates the gradient of the loss function
with respect to each weight and bias, enabling the
optimization process.
 Steps:
1. Compute Loss:
 Calculate the difference between predicted and actual output
using a loss function.

2. Output Layer Gradients:


 Compute the gradient of the loss with respect to the output
(δoutput)

03/11/202
5

6
CS 502, Fall 2020

03/11/202
5

7
CS 502, Fall 2020

Training NNs
 The network parameters include the weight matrices and bias
vectors from all layers

 Training a model to learn a set of parameters that are optimal


(according to a criterion) is one of the greatest challenges in ML

𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿

x1 … 0.
y is 1
1
1

Softmax
x2 … 0.
y72 is 2





x256 … 0.
y21 is 0
16 x 16 = …
256 0
CS 502, Fall 2020

Training NNs
 Data preprocessing - helps convergence during training
 Mean subtraction
 Zero-centered data
 Subtract the mean for each individual data dimension (feature)

 Normalization
 Divide each image by its standard deviation
 To obtain standard deviation of 1 for each data dimension (feature)

 Or, scale the data within the range [-1, 1]


CS 502, Fall 2020

Training NNs
 To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values

Input: y1 has the maximum value

Input: y2 has the maximum value


.
.
.

Input: y9 has the maximum value

Input: y10 has the maximum value


CS 502, Fall 2020

Training NNs
 Define an objective function/cost function/loss function that
calculates the difference between the model prediction and the true
label
 E.g., can be mean-squared error, cross-entropy, etc.

x1 … y1 0.
1
… 2
x2 … 0.
y2 0
… 3
Cost









x256 … … y1 0.5 ℒ(𝜃) 0

0
True label “1”
CS 502, Fall 2020

Training NNs
 For n training images, calculate the total loss:
 Find the optimal NN parameters that minimize the total loss

ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y
2
ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3






ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
CS 502, Fall 2020

Training NNs
 Optimizing the loss function
 Almost all DL models these days are trained with a variant of the gradient
descent (GD) algorithm
 GD applies iterative refinement of the network parameters
 GD uses the opposite direction of the gradient of the loss with respect to
the NN parameters (i.e., ) for updating
 The gradient of the loss function gives the direction of fastest increase of the loss
function when the parameters are changed

ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖

𝜃𝑖
CS 502, Fall 2020

Training NNs
 The loss functions for most DL tasks are defined over very high-
dimensional spaces
 E.g., ResNet50 NN has about 23 million parameters
 This makes the loss function impossible to visualize
 We can still gain intuitions by studying 1-dimensional and 2-
dimensional examples of loss functions

1D loss (the minimum point is 2D loss (blue = low loss, red = high
obvious) loss)
CS 502, Fall 2020

Gradient Descent Algorithm


 Steps in the gradient descent algorithm:
1. Randomly initialize the model parameters
 In the figure, the parameters are denoted

2. Compute the gradient of the loss function at :


3. Update the parameters as:
 Where α is the learning rate

4. Go to step 2 and repeat (until a terminating criterion is reached)


CS 502, Fall 2020

Gradient Descent Algorithm


 Example: a NN with only 2 parameters and , i.e.,
 Different colors are the values of the loss (minimum loss is ≈ 1.3)

1. Randomly pick
a starting point

2. Compute the
gradient at ,

𝜃
𝑤2 3. Times the
1 learning rate , and
𝜃 update
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2,
0
𝜃 repeat

𝑤1
𝛻 ℒ ( 𝜃 )=
0

[ 𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 0)/ 𝜕 𝑤 2 ]
CS 502, Fall 2020

Gradient Descent Algorithm


 Example (contd.)

Eventually, we would reach a


minimum …..
1. Randomly pick
a starting point

2. Compute the
2
𝜃 gradient at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the
1 learning rate , and
𝜃 update

4. Go to step 2,
0
𝜃 repeat

𝑤1
CS 502, Fall 2020

Gradient Descent Algorithm


 Gradient descent algorithm stops when a local minimum of the
loss surface is reached
 GD does not guarantee reaching a global minimum
 However, empirical evidence suggests that GD works well for NNs

𝜃
CS 502, Fall 2020

Gradient Descent Algorithm


 For most tasks, the loss surface is highly complex (and non-convex)

• Random initialization in NNs


results in different initial ℒ
parameters
– Gradient descent may reach
different minima at every run
– Therefore, NN will produce
different predicted outputs
• Currently, we don’t have an
algorithm that guarantees 𝑤1 𝑤2
reaching a global minimum
for an arbitrary loss function
CS 502, Fall 2020

Backpropagation
 How to calculate the gradients of the loss function in NNs?
 There are two ways:
1. Numerical gradient: slow, approximate, but easy way
2. Analytic gradient: requires calculus, fast, but more error-
prone way
 In practice the analytic gradient is used
 Analytical differentiation for gradient computation is available
in almost all deep learning libraries
CS 502, Fall 2020

Mini-batch Gradient Descent


 It is wasteful to compute the loss over the entire set to perform a
single parameter update for large datasets
 E.g., ImageNet has 14M images
 GD (a.k.a. vanilla GD) is replaced with mini-batch GD
 Mini-batch gradient descent
 Approach:
 Compute the loss on a batch of images, update the parameters , and repeat
until all images are used
 At the next epoch, shuffle the training data, and repeat above process

 Mini-batch GD results in much faster training


 Typical batch size: 32 to 256 images
 It works because the examples in the training data are correlated
 I.e., the gradient from a mini-batch is a good approximation of the gradient of
the entire training set
CS 502, Fall 2020

Stochastic Gradient
 Descent
Stochastic gradient descent
 SGD uses mini-batches that consist of a single input example
 E.g., one image mini-batch

 Although this method is very fast, it may cause significant


fluctuations in the loss function
 Therefore, it is less commonly used, and mini-batch GD is preferred

 In most DL libraries, SGD is typically a mini-batch SGD (with an


option to add momentum)
CS 502, Fall 2020

Problems with Gradient Descent


 Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points

cost

Very slow at the


plateau
Stuck at a saddle
point
Stuck at a local
minimum

𝛻 ℒ ( 𝜃 ) ≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃
CS 502, Fall 2020

Gradient Descent with Momentum


 Gradient descent with momentum uses the momentum of the
gradient for parameter optimization

cost
Movement = Negative of Gradient +
Momentum
Negative of Gradient
Momentum
Real Movement

𝜃
Gradient =
0
CS 502, Fall 2020

Gradient Descent with


Momentum
 Parameters update in GD with momentum :
 Where: =

 Compare to vanilla GD:


 The term is called momentum
 This term accumulates the gradients from the past several
steps
 It is similar to a momentum of a heavy ball rolling down the
hill
 The parameter referred to as a coefficient of momentum
 A typical value of the parameter is 0.9
 This method updates the parameters in the direction of
the weighted average of the past gradients
CS 502, Fall 2020

Nesterov Accelerated Momentum


 Gradient descent with Nesterov accelerated momentum
 Parameters update:
 Where: =

 The term allows us to predict the position of the parameters in the next
step (i.e., )
 The gradient is calculated with respect to the approximate future position of
the parameters in the next step,

GD with
GD with
Nesterov
momentum
momentum
CS 502, Fall 2020

Learning Rate
 Learning rate
 The gradient tells us the direction in which the loss has the steepest rate of
increase, but it does not tell us how far along the opposite direction we
should step
 Choosing the learning rate (also called the step size) is one of the most
important hyper-parameter settings for NN training

LR LR
too too
smal large
l
CS 502, Fall 2020

Learning Rate
 Training loss for different learning rates
 High learning rate: the loss increases or plateaus too quickly
 Low learning rate: the loss decreases too slowly (takes many epochs to
reach a solution)
CS 502, Fall 2020

Annealing the Learning


 Reduce the Rate
learning rate over time (learning rate decay)
 Approach 1
 Reduce the learning rate by some factor every few epochs
 Typical values: reduce the learning rate by a half every 5 epochs, or by 10 every 20
epochs
 Exponential decay reduces the learning rate exponentially over time
 These numbers depend heavily on the type of problem and the model

 Approach 2
 Reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
 In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
 Monitor: validation loss
 Factor: 0.1 (i.e., divide by 10)
 Patience: 10 (how many epochs to wait before applying it)
 Minimum learning rate: 1e-6 (when to stop)
CS 502, Fall 2020

Adam
 Adaptive Moment Estimation (Adam)
 Adam computes adaptive learning rates for each dimension of
 Similar to GD with momentum, Adam computes a weighted average of past
gradients, i.e., =
 Adam also computes a weighted average of past squared gradients, i.e., =

 The parameters update is:


 Where: and
 The proposed default values are = 0.9, = 0.999, and

 Other commonly used optimization methods include:


 Adagrad, Adadelta, RMSprop, Nadam, etc.
 Most papers nowadays used Adam and SGD with momentum
CS 502, Fall 2020

Vanishing / Exploding Gradient


 Problem
In some cases, during training, the gradients can become either very
small (vanishing gradients) of very large (exploding gradients)
 They result in very small or very large update of the parameters
 Solutions: ReLU activations, regularization, LSTM units in RNNs

x1 … y1

x2 … y2








xN … yM

Small gradients, learns very
slow
CS 502, Fall 2020

Generalization

 Underfitting
 The model is too “simple” to represent all
the relevant class characteristics
 Model with too few parameters
 High error on the training set and high
error on the testing set
 Overfitting Blue line – decision boundary by
 the model
The model is too “complex” and fits
Green line – optimal decision
irrelevant characteristics (noise) in the boundary
data
 Model with too many parameters
 Low error on the training error and high
error on the testing set
CS 502, Fall 2020

Overfitting

 A model with high capacity fits the noise in the data instead of
the underlying relationship

• The model may fit the training


data very well, but fails to
generalize to new examples
(test data)
CS 502, Fall 2020

Ways to reduce overfitting


 A large number of different methods have been developed.
 Weight-decay
 Weight-sharing
 Early stopping
 Model averaging
 Bayesian fitting of neural nets
 Dropout
 Generative pre-training
 Many of these methods will be described later.

You might also like