0% found this document useful (0 votes)

10 views34 pages

Lesson 4 Training ANNs

This document discusses the training of artificial neural networks (NNs), focusing on loss functions such as Mean Squared Error and Cross-Entropy Loss, which guide the optimization process. It covers key concepts like forward propagation, backpropagation, and various gradient descent algorithms, including mini-batch and stochastic gradient descent, as well as techniques for optimizing learning rates. Additionally, it highlights the importance of data preprocessing and the challenges of local minima in neural network training.

Uploaded by

ngugivivy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views34 pages

Lesson 4 Training ANNs

Uploaded by

ngugivivy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 34

SPC 2408:

Artificial Neural
Networks

Lesson 4 Training NNs:

Learning the weights
CS 502, Fall 2020

2 1. Loss Functions
 Loss functions measure the difference between the predicted
and actual outputs, guiding the optimization process to
improve the model's performance.
1. Mean Squared Error (MSE)
 Definition: Measures the average squared difference
between predicted and actual values.
 Usage:
 Commonly used for regression tasks.
 Penalizes larger errors more significantly.

 Formula:

03/11/202
5

2
CS 502, Fall 2020

3 2. Cross-Entropy Loss
 Definition: Measures the difference between two
probability distributions, typically used for classification
tasks.
 Formula (Binary Classification)

 Advantages:
o Works well for probabilities.
o Sensitive to confidence in predictions.
03/11/202
5

3
CS 502, Fall 2020

4 2. Forward Propagation
 Forward propagation involves computing the output
of the neural network given an input, layer by layer,
using the weights and biases of the network.
 Steps:
1. Input Layer:
 Pass input data xxx to the first layer.

2. Hidden Layers:
 Compute activations for each neuron:

03/11/202
5

4
CS 502, Fall 2020

5 3. Output Layer:
 Compute final output using the same process.
 Example:

03/11/202
5

5
CS 502, Fall 2020

6 3. Backpropagation Algorithm
 Backpropagation calculates the gradient of the loss function
with respect to each weight and bias, enabling the
optimization process.
 Steps:
1. Compute Loss:
 Calculate the difference between predicted and actual output
using a loss function.

2. Output Layer Gradients:

 Compute the gradient of the loss with respect to the output
(δoutput)

03/11/202
5

6
CS 502, Fall 2020

03/11/202
5

7
CS 502, Fall 2020

Training NNs
 The network parameters include the weight matrices and bias
vectors from all layers

 Training a model to learn a set of parameters that are optimal

(according to a criterion) is one of the greatest challenges in ML

𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿

x1 … 0.
y is 1
1
1
…

Softmax
x2 … 0.
y72 is 2
…
…
…

…
…

…
…
x256 … 0.
y21 is 0
16 x 16 = …
256 0
CS 502, Fall 2020

Training NNs
 Data preprocessing - helps convergence during training
 Mean subtraction
 Zero-centered data
 Subtract the mean for each individual data dimension (feature)

 Normalization
 Divide each image by its standard deviation
 To obtain standard deviation of 1 for each data dimension (feature)

 Or, scale the data within the range [-1, 1]

CS 502, Fall 2020

Training NNs
 To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values

Input: y1 has the maximum value

Input: y2 has the maximum value

.
.
.

Input: y9 has the maximum value

Input: y10 has the maximum value

CS 502, Fall 2020

Training NNs
 Define an objective function/cost function/loss function that
calculates the difference between the model prediction and the true
label
 E.g., can be mean-squared error, cross-entropy, etc.

x1 … y1 0.
1
… 2
x2 … 0.
y2 0
… 3
Cost
…
…

…
…
…
…

…
…
…
…
…
x256 … … y1 0.5 ℒ(𝜃) 0
…
0
True label “1”
CS 502, Fall 2020

Training NNs
 For n training images, calculate the total loss:
 Find the optimal NN parameters that minimize the total loss

ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y
2
ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3
…
…
…
…

…
…
…
…

ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
CS 502, Fall 2020

Training NNs
 Optimizing the loss function
 Almost all DL models these days are trained with a variant of the gradient
descent (GD) algorithm
 GD applies iterative refinement of the network parameters
 GD uses the opposite direction of the gradient of the loss with respect to
the NN parameters (i.e., ) for updating
 The gradient of the loss function gives the direction of fastest increase of the loss
function when the parameters are changed

ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖

𝜃𝑖
CS 502, Fall 2020

Training NNs
 The loss functions for most DL tasks are defined over very high-
dimensional spaces
 E.g., ResNet50 NN has about 23 million parameters
 This makes the loss function impossible to visualize
 We can still gain intuitions by studying 1-dimensional and 2-
dimensional examples of loss functions

1D loss (the minimum point is 2D loss (blue = low loss, red = high
obvious) loss)
CS 502, Fall 2020

Gradient Descent Algorithm

 Steps in the gradient descent algorithm:
1. Randomly initialize the model parameters
 In the figure, the parameters are denoted

2. Compute the gradient of the loss function at :

3. Update the parameters as:
 Where α is the learning rate

4. Go to step 2 and repeat (until a terminating criterion is reached)

CS 502, Fall 2020

Gradient Descent Algorithm

 Example: a NN with only 2 parameters and , i.e.,
 Different colors are the values of the loss (minimum loss is ≈ 1.3)

1. Randomly pick
a starting point

2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the
1 learning rate , and
𝜃 update
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2,
0
𝜃 repeat

𝑤1
𝛻 ℒ ( 𝜃 )=
0

[ 𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 0)/ 𝜕 𝑤 2 ]
CS 502, Fall 2020

Gradient Descent Algorithm

 Example (contd.)

Eventually, we would reach a

minimum …..
1. Randomly pick
a starting point

2. Compute the
2
𝜃 gradient at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the
1 learning rate , and
𝜃 update

4. Go to step 2,
0
𝜃 repeat

𝑤1
CS 502, Fall 2020

Gradient Descent Algorithm

 Gradient descent algorithm stops when a local minimum of the
loss surface is reached
 GD does not guarantee reaching a global minimum
 However, empirical evidence suggests that GD works well for NNs

𝜃
CS 502, Fall 2020

Gradient Descent Algorithm

 For most tasks, the loss surface is highly complex (and non-convex)

• Random initialization in NNs

results in different initial ℒ
parameters
– Gradient descent may reach
different minima at every run
– Therefore, NN will produce
different predicted outputs
• Currently, we don’t have an
algorithm that guarantees 𝑤1 𝑤2
reaching a global minimum
for an arbitrary loss function
CS 502, Fall 2020

Backpropagation
 How to calculate the gradients of the loss function in NNs?
 There are two ways:
1. Numerical gradient: slow, approximate, but easy way
2. Analytic gradient: requires calculus, fast, but more error-
prone way
 In practice the analytic gradient is used
 Analytical differentiation for gradient computation is available
in almost all deep learning libraries
CS 502, Fall 2020

Mini-batch Gradient Descent

 It is wasteful to compute the loss over the entire set to perform a
single parameter update for large datasets
 E.g., ImageNet has 14M images
 GD (a.k.a. vanilla GD) is replaced with mini-batch GD
 Mini-batch gradient descent
 Approach:
 Compute the loss on a batch of images, update the parameters , and repeat
until all images are used
 At the next epoch, shuffle the training data, and repeat above process

 Mini-batch GD results in much faster training

 Typical batch size: 32 to 256 images
 It works because the examples in the training data are correlated
 I.e., the gradient from a mini-batch is a good approximation of the gradient of
the entire training set
CS 502, Fall 2020

Stochastic Gradient
 Descent
Stochastic gradient descent
 SGD uses mini-batches that consist of a single input example
 E.g., one image mini-batch

 Although this method is very fast, it may cause significant

fluctuations in the loss function
 Therefore, it is less commonly used, and mini-batch GD is preferred

 In most DL libraries, SGD is typically a mini-batch SGD (with an

option to add momentum)
CS 502, Fall 2020

Problems with Gradient Descent

 Besides the local minima problem, the GD algorithm can be very slow
at plateaus, and it can get stuck at saddle points

cost

Very slow at the

plateau
Stuck at a saddle
point
Stuck at a local
minimum

𝛻 ℒ ( 𝜃 ) ≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃
CS 502, Fall 2020

Gradient Descent with Momentum

 Gradient descent with momentum uses the momentum of the
gradient for parameter optimization

cost
Movement = Negative of Gradient +
Momentum
Negative of Gradient
Momentum
Real Movement

𝜃
Gradient =
0
CS 502, Fall 2020

Gradient Descent with

Momentum
 Parameters update in GD with momentum :
 Where: =

 Compare to vanilla GD:

 The term is called momentum
 This term accumulates the gradients from the past several
steps
 It is similar to a momentum of a heavy ball rolling down the
hill
 The parameter referred to as a coefficient of momentum
 A typical value of the parameter is 0.9
 This method updates the parameters in the direction of
the weighted average of the past gradients
CS 502, Fall 2020

Nesterov Accelerated Momentum

 Gradient descent with Nesterov accelerated momentum
 Parameters update:
 Where: =

 The term allows us to predict the position of the parameters in the next
step (i.e., )
 The gradient is calculated with respect to the approximate future position of
the parameters in the next step,

GD with
GD with
Nesterov
momentum
momentum
CS 502, Fall 2020

Learning Rate
 Learning rate
 The gradient tells us the direction in which the loss has the steepest rate of
increase, but it does not tell us how far along the opposite direction we
should step
 Choosing the learning rate (also called the step size) is one of the most
important hyper-parameter settings for NN training

LR LR
too too
smal large
l
CS 502, Fall 2020

Learning Rate
 Training loss for different learning rates
 High learning rate: the loss increases or plateaus too quickly
 Low learning rate: the loss decreases too slowly (takes many epochs to
reach a solution)
CS 502, Fall 2020

Annealing the Learning

 Reduce the Rate
learning rate over time (learning rate decay)
 Approach 1
 Reduce the learning rate by some factor every few epochs
 Typical values: reduce the learning rate by a half every 5 epochs, or by 10 every 20
epochs
 Exponential decay reduces the learning rate exponentially over time
 These numbers depend heavily on the type of problem and the model

 Approach 2
 Reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
 In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
 Monitor: validation loss
 Factor: 0.1 (i.e., divide by 10)
 Patience: 10 (how many epochs to wait before applying it)
 Minimum learning rate: 1e-6 (when to stop)
CS 502, Fall 2020

Adam
 Adaptive Moment Estimation (Adam)
 Adam computes adaptive learning rates for each dimension of
 Similar to GD with momentum, Adam computes a weighted average of past
gradients, i.e., =
 Adam also computes a weighted average of past squared gradients, i.e., =

 The parameters update is:

 Where: and
 The proposed default values are = 0.9, = 0.999, and

 Other commonly used optimization methods include:

 Adagrad, Adadelta, RMSprop, Nadam, etc.
 Most papers nowadays used Adam and SGD with momentum
CS 502, Fall 2020

Vanishing / Exploding Gradient

 Problem
In some cases, during training, the gradients can become either very
small (vanishing gradients) of very large (exploding gradients)
 They result in very small or very large update of the parameters
 Solutions: ReLU activations, regularization, LSTM units in RNNs

x1 … y1
…
x2 … y2
…
…
…

…
…
…
…

…
…

…
…
xN … yM
…
Small gradients, learns very
slow
CS 502, Fall 2020

Generalization

 Underfitting
 The model is too “simple” to represent all
the relevant class characteristics
 Model with too few parameters
 High error on the training set and high
error on the testing set
 Overfitting Blue line – decision boundary by
 the model
The model is too “complex” and fits
Green line – optimal decision
irrelevant characteristics (noise) in the boundary
data
 Model with too many parameters
 Low error on the training error and high
error on the testing set
CS 502, Fall 2020

Overfitting

 A model with high capacity fits the noise in the data instead of
the underlying relationship

• The model may fit the training

data very well, but fails to
generalize to new examples
(test data)
CS 502, Fall 2020

Ways to reduce overfitting

 A large number of different methods have been developed.
 Weight-decay
 Weight-sharing
 Early stopping
 Model averaging
 Bayesian fitting of neural nets
 Dropout
 Generative pre-training
 Many of these methods will be described later.

Chapter4 Case Studies
67% (12)
Chapter4 Case Studies
26 pages
Unit-Ii (Ml-I)
No ratings yet
Unit-Ii (Ml-I)
81 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
Optim
No ratings yet
Optim
33 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
Topic 4 (Part 2) - NN Learning
No ratings yet
Topic 4 (Part 2) - NN Learning
92 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Machine Vesion hw6
No ratings yet
Machine Vesion hw6
18 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Optimization
No ratings yet
Optimization
51 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
ML3 Unit 4-3
No ratings yet
ML3 Unit 4-3
13 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
1 Intro
No ratings yet
1 Intro
91 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
Upload Unit 2
No ratings yet
Upload Unit 2
19 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Topic 5 - Part2 NN Learning
No ratings yet
Topic 5 - Part2 NN Learning
90 pages
Week 4
No ratings yet
Week 4
61 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Optimizer
No ratings yet
Optimizer
13 pages
AI Unit II Lec Notes Deep Learning
No ratings yet
AI Unit II Lec Notes Deep Learning
64 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2016
14 pages
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
Lec 8
No ratings yet
Lec 8
43 pages
UNIT3
No ratings yet
UNIT3
17 pages
Linearity: Skip To Content
No ratings yet
Linearity: Skip To Content
10 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
15 Deep
No ratings yet
15 Deep
39 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Test 1 Truss Test 2014
No ratings yet
Test 1 Truss Test 2014
4 pages
Maxwellian Distribution Revisited: Maxwell - 2a.m
No ratings yet
Maxwellian Distribution Revisited: Maxwell - 2a.m
7 pages
Unit 3 - Operating System - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Operating System - WWW - Rgpvnotes.in
38 pages
BCA Final Year Project
No ratings yet
BCA Final Year Project
78 pages
SM T311 - Direy 6
No ratings yet
SM T311 - Direy 6
3 pages
Culture and Traditions of The Karaya People
No ratings yet
Culture and Traditions of The Karaya People
9 pages
Quality Systems Manual Method Statement
No ratings yet
Quality Systems Manual Method Statement
9 pages
Writing Research Report
No ratings yet
Writing Research Report
33 pages
Scan Converting Circle
No ratings yet
Scan Converting Circle
36 pages
RCF-1865 Rechageable Fan R5 (IB Format - ENG)
No ratings yet
RCF-1865 Rechageable Fan R5 (IB Format - ENG)
16 pages
5th Grade Colonial Village Unit Plan
100% (1)
5th Grade Colonial Village Unit Plan
25 pages
Pump Control (Main Hydraulic) ..
92% (12)
Pump Control (Main Hydraulic) ..
23 pages
Files Reviewer
No ratings yet
Files Reviewer
24 pages
Nervous System Regulation Script
No ratings yet
Nervous System Regulation Script
3 pages
ITI Newsletter July 2024
No ratings yet
ITI Newsletter July 2024
3 pages
The Alchemist Test Study Guide
No ratings yet
The Alchemist Test Study Guide
2 pages
Cot-English 2 Q2 W6
No ratings yet
Cot-English 2 Q2 W6
7 pages
Lesson Plan
No ratings yet
Lesson Plan
5 pages
Science-Unit-Plann-Final 2
No ratings yet
Science-Unit-Plann-Final 2
111 pages
Purposive Communication 2
100% (1)
Purposive Communication 2
2 pages
Answer Key
No ratings yet
Answer Key
1 page
The Mars Agency Retail Media Report Card ANZ Mar 2024
No ratings yet
The Mars Agency Retail Media Report Card ANZ Mar 2024
31 pages
CMA Study Plan
No ratings yet
CMA Study Plan
10 pages
Topic/ Lesson: Communicative Style
No ratings yet
Topic/ Lesson: Communicative Style
8 pages
Rehan
No ratings yet
Rehan
1 page
Testbank For Economics of Money Banking and Financial Markets The 13th Edition Mishkin Instant Download
No ratings yet
Testbank For Economics of Money Banking and Financial Markets The 13th Edition Mishkin Instant Download
18 pages
HCIA-HarmonyOS Device Developer V1.0 学员用书
No ratings yet
HCIA-HarmonyOS Device Developer V1.0 学员用书
166 pages
MBA Managerial Economics Unit 1 - Economic Problems and Decision Making
No ratings yet
MBA Managerial Economics Unit 1 - Economic Problems and Decision Making
24 pages
Living Off The Analyst: Harvesting Features From Yara Rules For Malware Detection
No ratings yet
Living Off The Analyst: Harvesting Features From Yara Rules For Malware Detection
11 pages