Lesson 4 Training ANNs
Lesson 4 Training ANNs
Artificial Neural
Networks
2 1. Loss Functions
Loss functions measure the difference between the predicted
and actual outputs, guiding the optimization process to
improve the model's performance.
1. Mean Squared Error (MSE)
Definition: Measures the average squared difference
between predicted and actual values.
Usage:
Commonly used for regression tasks.
Penalizes larger errors more significantly.
Formula:
03/11/202
5
2
CS 502, Fall 2020
3 2. Cross-Entropy Loss
Definition: Measures the difference between two
probability distributions, typically used for classification
tasks.
Formula (Binary Classification)
Advantages:
o Works well for probabilities.
o Sensitive to confidence in predictions.
03/11/202
5
3
CS 502, Fall 2020
4 2. Forward Propagation
Forward propagation involves computing the output
of the neural network given an input, layer by layer,
using the weights and biases of the network.
Steps:
1. Input Layer:
Pass input data xxx to the first layer.
2. Hidden Layers:
Compute activations for each neuron:
03/11/202
5
4
CS 502, Fall 2020
5 3. Output Layer:
Compute final output using the same process.
Example:
03/11/202
5
5
CS 502, Fall 2020
6 3. Backpropagation Algorithm
Backpropagation calculates the gradient of the loss function
with respect to each weight and bias, enabling the
optimization process.
Steps:
1. Compute Loss:
Calculate the difference between predicted and actual output
using a loss function.
03/11/202
5
6
CS 502, Fall 2020
03/11/202
5
7
CS 502, Fall 2020
Training NNs
The network parameters include the weight matrices and bias
vectors from all layers
𝜃= { 𝑊 , 𝑏 ,𝑊 ,𝑏 , ⋯ 𝑊 , 𝑏 }
1 1 2 2 𝐿 𝐿
x1 … 0.
y is 1
1
1
…
Softmax
x2 … 0.
y72 is 2
…
…
…
…
…
…
…
x256 … 0.
y21 is 0
16 x 16 = …
256 0
CS 502, Fall 2020
Training NNs
Data preprocessing - helps convergence during training
Mean subtraction
Zero-centered data
Subtract the mean for each individual data dimension (feature)
Normalization
Divide each image by its standard deviation
To obtain standard deviation of 1 for each data dimension (feature)
Training NNs
To train a NN, set the parameters such that for a training subset of
images, the corresponding elements in the predicted output have
maximum values
Training NNs
Define an objective function/cost function/loss function that
calculates the difference between the model prediction and the true
label
E.g., can be mean-squared error, cross-entropy, etc.
x1 … y1 0.
1
… 2
x2 … 0.
y2 0
… 3
Cost
…
…
…
…
…
…
…
…
…
…
…
x256 … … y1 0.5 ℒ(𝜃) 0
…
0
True label “1”
CS 502, Fall 2020
Training NNs
For n training images, calculate the total loss:
Find the optimal NN parameters that minimize the total loss
ℒ1 ( 𝜃 )
x1 NN ^
𝑦
1
y1
ℒ2 ( 𝜃 )
x2 NN ^
𝑦
2
y
2
ℒ3 ( 𝜃 )
x3 NN ^
𝑦
3
y3
…
…
…
…
…
…
…
…
ℒ𝑛 ( 𝜃 )
xN NN ^
𝑦
𝑁 yN
CS 502, Fall 2020
Training NNs
Optimizing the loss function
Almost all DL models these days are trained with a variant of the gradient
descent (GD) algorithm
GD applies iterative refinement of the network parameters
GD uses the opposite direction of the gradient of the loss with respect to
the NN parameters (i.e., ) for updating
The gradient of the loss function gives the direction of fastest increase of the loss
function when the parameters are changed
ℒ( 𝜃) 𝜕ℒ
𝜕𝜃𝑖
𝜃𝑖
CS 502, Fall 2020
Training NNs
The loss functions for most DL tasks are defined over very high-
dimensional spaces
E.g., ResNet50 NN has about 23 million parameters
This makes the loss function impossible to visualize
We can still gain intuitions by studying 1-dimensional and 2-
dimensional examples of loss functions
1D loss (the minimum point is 2D loss (blue = low loss, red = high
obvious) loss)
CS 502, Fall 2020
1. Randomly pick
a starting point
2. Compute the
gradient at ,
∗
𝜃
𝑤2 3. Times the
1 learning rate , and
𝜃 update
− 𝛻 ℒ ( 𝜃0 )
4. Go to step 2,
0
𝜃 repeat
𝑤1
𝛻 ℒ ( 𝜃 )=
0
[ 𝜕 ℒ ( 𝜃 0 ) / 𝜕 𝑤1
𝜕 ℒ (𝜃 0)/ 𝜕 𝑤 2 ]
CS 502, Fall 2020
2. Compute the
2
𝜃 gradient at ,
𝜃1 − 𝛼 𝛻 ℒ ( 𝜃 1 )
𝑤2 𝜃2 − 𝛼 𝛻 ℒ ( 𝜃2 ) 3. Times the
1 learning rate , and
𝜃 update
4. Go to step 2,
0
𝜃 repeat
𝑤1
CS 502, Fall 2020
𝜃
CS 502, Fall 2020
Backpropagation
How to calculate the gradients of the loss function in NNs?
There are two ways:
1. Numerical gradient: slow, approximate, but easy way
2. Analytic gradient: requires calculus, fast, but more error-
prone way
In practice the analytic gradient is used
Analytical differentiation for gradient computation is available
in almost all deep learning libraries
CS 502, Fall 2020
Stochastic Gradient
Descent
Stochastic gradient descent
SGD uses mini-batches that consist of a single input example
E.g., one image mini-batch
cost
𝛻 ℒ ( 𝜃 ) ≈ 0 𝛻 ℒ ( 𝜃 )=0
𝛻 ℒ ( 𝜃 )=0
𝜃
CS 502, Fall 2020
cost
Movement = Negative of Gradient +
Momentum
Negative of Gradient
Momentum
Real Movement
𝜃
Gradient =
0
CS 502, Fall 2020
The term allows us to predict the position of the parameters in the next
step (i.e., )
The gradient is calculated with respect to the approximate future position of
the parameters in the next step,
GD with
GD with
Nesterov
momentum
momentum
CS 502, Fall 2020
Learning Rate
Learning rate
The gradient tells us the direction in which the loss has the steepest rate of
increase, but it does not tell us how far along the opposite direction we
should step
Choosing the learning rate (also called the step size) is one of the most
important hyper-parameter settings for NN training
LR LR
too too
smal large
l
CS 502, Fall 2020
Learning Rate
Training loss for different learning rates
High learning rate: the loss increases or plateaus too quickly
Low learning rate: the loss decreases too slowly (takes many epochs to
reach a solution)
CS 502, Fall 2020
Approach 2
Reduce the learning rate by a constant (e.g., by half) whenever the validation loss
stops improving
In TensorFlow: tf.keras.callbacks.ReduceLROnPleateau()
Monitor: validation loss
Factor: 0.1 (i.e., divide by 10)
Patience: 10 (how many epochs to wait before applying it)
Minimum learning rate: 1e-6 (when to stop)
CS 502, Fall 2020
Adam
Adaptive Moment Estimation (Adam)
Adam computes adaptive learning rates for each dimension of
Similar to GD with momentum, Adam computes a weighted average of past
gradients, i.e., =
Adam also computes a weighted average of past squared gradients, i.e., =
x1 … y1
…
x2 … y2
…
…
…
…
…
…
…
…
…
…
…
xN … yM
…
Small gradients, learns very
slow
CS 502, Fall 2020
Generalization
Underfitting
The model is too “simple” to represent all
the relevant class characteristics
Model with too few parameters
High error on the training set and high
error on the testing set
Overfitting Blue line – decision boundary by
the model
The model is too “complex” and fits
Green line – optimal decision
irrelevant characteristics (noise) in the boundary
data
Model with too many parameters
Low error on the training error and high
error on the testing set
CS 502, Fall 2020
Overfitting
A model with high capacity fits the noise in the data instead of
the underlying relationship