0% found this document useful (0 votes)
12 views83 pages

06 AIS302 ANN Backpropagation

This lecture focuses on the optimization of artificial neural networks through backpropagation, detailing the steps involved in training a neural network, including forward propagation, loss computation, and weight updates. It discusses various activation functions, their advantages and disadvantages, and emphasizes the importance of choosing appropriate functions for effective training. Additionally, it covers advanced optimization techniques like momentum and Adam to enhance learning efficiency.

Uploaded by

Hana El Gabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views83 pages

06 AIS302 ANN Backpropagation

This lecture focuses on the optimization of artificial neural networks through backpropagation, detailing the steps involved in training a neural network, including forward propagation, loss computation, and weight updates. It discusses various activation functions, their advantages and disadvantages, and emphasizes the importance of choosing appropriate functions for effective training. Additionally, it covers advanced optimization techniques like momentum and Adam to enhance learning efficiency.

Uploaded by

Hana El Gabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

AIS302: ANN (Artificial Neural Networks)

Lecture 6: Optimization - Backpropagation


Spring 2025

Dr. Ensaf Hussein


Associate Professor, Artificial Intelligence,
School of Information Technology and Computer Science,
Nile University.
Course Map Selected Topics in
Deep Learning

Foundation Shallow Artificial Deep Computer Deep Sequence Deep Generative Deep
concepts NN Vision Modeling Models Reinforcement

Training
Convolutional NN Object Detection Recurrent NN VAE
Parameters

Pre-trained
LSTM GAN
Models

Transfer Learning Transformers

2
Lecture 6
Optimization - Backpropagation
and
Evaluation Metrics

Lectures are based on:


• Traditional Learning: Machine learning Andrew Ng [Full course]
• Stanford University CS231n,Deep Learning for Computer Vision
• MIT Introduction to Deep Learning | 6.S191
3
Recap: Linear Classifier Training Steps
• Step 1: Start with a random W and b
• Step 2: Calculate the score function. For instance, given our input
feature vectors, the score function takes these data points, applies
some function f (our score function), and then returns the predicted
class labels. ➔ Forward Propagation
• Step 3: Calculate the loss function (i.e. the error). It quantifies what it
means to have a "good" W, measure our unhappiness with outcomes.
• Step 4: Optimization step. It is the process of finding the set of
parameters W that minimize the loss function using small step (i.e.
learning rate). ➔ Backward Propagation
• Step 5: Repeat Step 2 to 4 for a specified number (i.e. number of
epochs) or until loss function is near to zero (until converge).

4
Perceptron &
Activation Functions
Non-linear classifier

5
The perceptron: Forward Propagation

m is the number of inputs for the node (cell).


wi is the weight of the input sample
xi is the input sample (feature)
g is the activation fn.
y is the hypothesis fn.

6
Activation Function

7
Activation Functions
Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU

Important feature of a Activation function is that it should be differentiable


8
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating <firing rate= of a
neuron

• 3 problems:
1. Saturated neurons <kill= the
Sigmoid gradients (Vanishing gradient)
Your neuron is saturated if it outputs either 0 or 1, then the
gradient will be killed. It’ll just be multiplied by a very tiny
number then gradients can't backpropagate through the
network because they'll be stopped learning.
The gradients only flow if you're kind of in a safer zone and what
we call an active region of a sigmoid
9
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating <firing rate= of a
neuron

• 3 problems:
1. Saturated neurons <kill= the
Sigmoid gradients
(logistic function) 2. Sigmoid outputs are not zero-
centered
It causes a zigzag path to reach the minima

10
Activation Functions
- Squashes numbers to range [0,1]
- Historically popular since they
have nice interpretation as a
saturating <firing rate= of a
neuron

• 3 problems:
1. Saturated neurons <kill= the
Sigmoid gradients
2. Sigmoid outputs are not
zero-centered
3. exp() is a bit computed
expensively
11
Activation Functions

- Squashes numbers to range [-1,1]


- zero centered (nice)
- still kills gradients when saturated :(

tanh(x)

[LeCun et al., 1991]

12
Activation Functions Computes f(x) = max(0,x)
- Does not saturate (in +region)
- Very computationally efficient
- Converges much faster than
sigmoid/tanh in practice (e.g. 6x)

- Not zero-centered output


ReLU - Dying ReLU problem—A form of
(Rectified Linear Unit) the vanishing gradient problem
[Krizhevsky et al., 2012] hint: what is the gradient when x < 0?

13
ReLU (Rectified Linear Unit) Activation Function
Features:
1.Does not saturate (in the + region)
1. For x>0, ReLU has a constant gradient of 1, avoiding the vanishing gradient issue in this
region.
2.Very computationally efficient
1. Just a simple comparison (max(0,x)), much cheaper than sigmoid/tanh.
3.Converges much faster than sigmoid/tanh
1. ReLU helps networks train up to 6x faster due to its strong gradients and simple
computation.
Problems:
1.Not zero-centered output
1. Outputs are in [0,∞), which can still cause imbalanced weight updates.
2.Dying ReLU problem
1. If x<0, the gradient is 0, meaning neurons stop updating weights and become inactive
("dead").
2. This is a form of the vanishing gradient problem.

14
Reasons of Dead ReLU
• Dying ReLU problem—when inputs approach zero or are
negative, the gradient of the function becomes zero, the
network cannot perform backpropagation and cannot learn.
active
The issue can happen when:
(1) Very unlucky initialization of your network may
ReLU
cause that the neurons only activate in the region
outside of your data cloud then this dead ReLU you DATA CLOUD
will never become activated and then it will never
update.
=> people like to initialize ReLU neurons with slightly dead ReLU
positive biases (e.g. 0.01)
will never activate
2) High learning rate can cause saturated neurons.
=> never update
15
Activation Functions
- Does not saturate
- Computationally efficient
- Converges much faster than
sigmoid/tanh in practice! (e.g. 6x)
- will not <die=.

Parametric Rectifier (PReLU)


Leaky ReLU

backprop into \alpha


[Mass et al., 2013] [He et al., 2015] (parameter)

16
Activation Functions
Exponential Linear Units (ELU)
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime
compared with Leaky ReLU
adds some robustness to noise

- Computation requires exp()


[Clevert et al., 2015]
17
Maxout <Neuron=
- Does not have the basic form of dot product ->
nonlinearity
- Generalizes ReLU and Leaky ReLU
- Linear Regime! Does not saturate! Does not die!

Problem: doubles the number of parameters/neuron :(

[Goodfellow et al., 2013]


18
TLDR: In practice:

- Use ReLU. Be careful with your learning rates


- Try out Leaky ReLU / Maxout / ELU
- Try out tanh but don’t expect much
- Don’t use sigmoid

19
Importance of Activation Function

20
Forward-Propagation of a Neuron

21
Backward-Propagation of a Neuron

22
ANN Steps
Forward and Backward propagation

23
means at layer 2
(1)
�㕊 �㕊 (2) <activation= of unit in layer
�㕊 (Ā) = matrix of weights controlling
function mapping from layer to
layer

/�㕊 ą
�㕊30
�㕊31
�㕊32
�㕊33

<Input=
What is the dimension of �㕾(1) ?
=4x3

Size of �㕾(�㖋) ?

�㖔�㖋+ÿ x (�㖔�㖋 +ÿ)


Forward Propagation
2 3
ą0 = þ0 = þ0 =

2 2 3 3 4 4
ý1 þ1 ý1 þ1 ý1 þ1
/�㕊 (ą)

2 2 3 3
ý2 þ2 ý2 þ2

(ą (ÿ) , Ć (ÿ) )
One sample
26
Forward Propagation
2 3
ą0 = þ0 = þ0 =
(1)
�㕊20
2 2 3 3 4 4
ý1 þ1 ý1 þ1 ý1 þ1
(1) /�㕊 (ą)
�㕊21
Ā Ā 3 3
ýĀ þĀ ý2 þ2
(1)
�㕊22

(ą (ÿ) , Ć (ÿ) ) 2 (1) (1) (1) þ2


2
= Ą(ý2
2
)=
1
ý2 = �㕊20 ą0 +�㕊21 ą1 +�㕊22 ą2 2
2ý2
One sample 1+Ă
Cost function
• Neural network: Use the logistic loss function

m = number of samples
K = number of classes
Gradient Computation

Need code to compute:


-
-
(Don’t initialize with zero)
Training a neural network (Cont.)

34
•Initialize Weights – Start with random values for the model’s parameters.
•Forward Propagation – Compute predictions by passing inputs through the network.
•Compute Cost Function – Measure how far predictions are from actual values using a loss
function.
•Backpropagation – Calculate how the weights should be adjusted by computing gradients.
•Gradient Checking – Verify if backpropagation gradients are correct using numerical
estimation.
•Optimization – Update weights using gradient descent (or advanced methods) to minimize
the cost.
•The goal is to minimize the cost function J(Θ) by updating weights Θ.
•Gradient Descent:
•Update weights using the computed gradients:

•where α is the learning rate.


•This gradually moves the parameters toward values that reduce error.

•Advanced Optimization Methods:


•Momentum: Accelerates learning by considering past gradients.
•Adam (Adaptive Moment Estimation): Adapts learning rate for each parameter dynamically.
•RMSprop: Normalizes gradients to stabilize learning.
35
In the next Example
Focusing on a single example , , and ignoring regularization ( ),

ĀĀĀā ÿ = Ć ÿ log /�㕊 ą ÿ + 12Ć ÿ log (1 2 /�㕊 ą ÿ

ÿ ÿ 2
(Think of ĀĀĀā ÿ ≈ /�㕊 (ą ) 2 Ć )
Numerical Example

initial weights
the biases, and
training inputs/outputs

• https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
37
Numerical Example: Forward Pass-hidden layer
• Calculate total net input for h1 (neth1)
(Ā)
ýÿ
(Ā)
= þÿ

• Apply the sigmoid function to get the output of h1 (outh1)


(Ā) (Ā)
þÿ = �㖈(ýÿ )
(Ā)
= þĀ

• Carrying out the same process for h2, we get:


(Ā)
þĀ

38
Numerical Example: Forward Pass-output layer
Repeat same process for output layer:
• Calculate total net input for o1 (neto1)
(ā)
ýÿ

(ā)
= þÿ

• Apply the sigmoid function to get the output of o1 (outo1)


(ā) (ā)
þÿ = �㖈(ýÿ )
(ā)
= þĀ

• Carrying out the same process for o2, we get:


(ā)
þĀ

39
Numerical Example: Calculate Total Error
• Calculate the error for each output neuron using the squared error
function and sum them to get the total error:
Used for an approximate
cost for one sample

• For example, the target output for o1 is 0.01 but the neural network
output 0.75136507, therefore its error is:

• Repeating this process for o2 (remembering that the target is 0.99)


we get:

• The total error for the neural network is the sum of these errors:
Numerical Example: Backward Pass-output layer

• Consider w5, we want to know how much a change in w5


affects the total error. By applying the chain rule we know
that:
Numerical Example: Backward Pass-output layer

1- The total error change with respect to the output o1


• Error Definition:
•The total error is calculated using the squared difference between the target output and the actual output.
•This measures how far the network's prediction (outo1) is from the desired value (targeto1).
•Derivative Meaning:
•The derivative tells us how much the total error changes when the output o1 changes.
•Since error is (target - output)², taking the derivative follows the power rule.
•Step-by-Step Derivation:
•The squared error function for one output:

•Differentiate with respect to the output:

•Simplifies to:

• The negative sign means that if the output is too large, the error will decrease if we lower it.
•Why is this Useful?
• This derivative helps adjust the weights in backpropagation to reduce error.
• It tells us whether we need to increase or decrease the output to get closer to the target.
43
Numerical Example: Backward Pass-output layer

2- The output o1 change with respect to its total net input


Numerical Example: Backward Pass-output layer

3- The total net input of o1 change with respect to w5


Numerical Example: Backward Pass-output layer

= (outo1-target)*outo1 (1-outo1)*outh1

4- Putting all together


Numerical Example: Backward Pass-output layer
• 5- To decrease the error, we then subtract this value from the current weight
(optionally multiplied by some learning rate α, which we’ll set to 0.5):
+ �㔕ýþĀþþ�㕙
Ą5 = Ą5 2 �㗼 = 0.4 2 0.5 ∗ 0.082167041 = 0.35891648
�㔕Ą5

• We can repeat this process to all weights of output layer to get the new weights
w6, w7 and w8:
Numerical Example: Backward Pass-hidden layer
• we’ll continue the backwards pass by calculating new values for w1, w2,w3 and
w4

Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕡�㕜�㕡�㕎�㕙 ĀĀÿþℎ1 Āÿþþℎ1 Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕜1 Ā�㔸�㕜2


= * * where, = +
Āÿ1 ĀĀÿþℎ1 Āÿþþℎ1 Āÿ1 ĀĀÿþℎ1 ĀĀÿþℎ1 ĀĀÿþℎ1
Numerical Example: Backward Pass-hidden layer
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕡�㕜�㕡�㕎�㕙 ĀĀÿþℎ1 Āÿþþℎ1
= * *
Āÿ1 ĀĀÿþℎ1 Āÿþþℎ1 Āÿ1

Where,
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕜1 Ā�㔸�㕜2
= +
ĀĀÿþℎ1 ĀĀÿþℎ1 ĀĀÿþℎ1

Ā�㔸�㕜1 Ā�㔸�㕜1 ĀĀÿþ�㕜1 Āÿþþ�㕜1


1- Starting with: = ∗ ∗
ĀĀÿþℎ1 ĀĀÿþ�㕜1 Āÿþþ�㕜1 ĀĀÿþℎ1

using values
calculated
earlier
Not updated
weight
Numerical Example: Backward Pass-hidden layer
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕡�㕜�㕡�㕎�㕙 ĀĀÿþℎ1 Āÿþþℎ1
= * *
Āÿ1 ĀĀÿþℎ1 Āÿþþℎ1 Āÿ1

Where,
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕜1 Ā�㔸�㕜2
= +
ĀĀÿþℎ1 ĀĀÿþℎ1 ĀĀÿþℎ1
Numerical Example: Backward Pass-hidden layer
2 3
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕡�㕜�㕡�㕎�㕙 ĀĀÿþℎ1 Āÿþþℎ1
= * *
Āÿ1 ĀĀÿþℎ1 Āÿþþℎ1 Āÿ1

Where,
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕜1 Ā�㔸�㕜2
= +
ĀĀÿþℎ1 ĀĀÿþℎ1 ĀĀÿþℎ1

2-

3-
Numerical Example: Backward Pass-hidden layer
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕡�㕜�㕡�㕎�㕙 ĀĀÿþℎ1 Āÿþþℎ1
= * *
Āÿ1 ĀĀÿþℎ1 Āÿþþℎ1 Āÿ1

Where,
Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕜1 Ā�㔸�㕜2
= +
ĀĀÿþℎ1 ĀĀÿþℎ1 ĀĀÿþℎ1

4- Put all together


Ā�㔸�㕡�㕜�㕡�㕎�㕙 Ā�㔸�㕡�㕜�㕡�㕎�㕙 ĀĀÿþℎ1 Āÿþþℎ1
= * *
Āÿ1 ĀĀÿþℎ1 Āÿþþℎ1 Āÿ1

Ā�㔸�㕡�㕜�㕡�㕎�㕙
= 0.036350306 ∗ 0.241300709 ∗ 0.05 = 0.000438568
Āÿ1
Numerical Example: Backward Pass- hidden layer
5- We can update w1:

We can repeat this process to all weights of hidden layer to get the new
weights w2, w3 and w4:
Finally,
➢ When we fed forward the 0.05 and 0.1 inputs originally,
• Total error was 0.298371109.
➢ After this first round of backpropagation,
• Total error is now 0.291027924.
➢ But after repeating this process 10,000 times, for example,
• Total error drops to 0.0000351085.
➢ At this point, when we feed forward 0.05 and 0.1,the two outputs
neurons generate
• 0.015912196 (vs 0.01 target) and
• 0.984065734 (vs 0.99 target).
Hypothesis Evaluation

55
To Evaluate hypothesis
• One way to break down our dataset into the three sets is:
• Training set: 60%
• Cross validation set: 20% (This validation set is essentially used as a fake test
set to tune the hyper-parameters)
• Test set: 20%

60% 20% 20%

56
Why validation set is important ?
1.
2.
3.

10.
Choose
How well does the model generalize? Report test set error

Problem: is likely to be an optimistic estimate of


generalization error. i.e. our extra parameter ( = degree of
polynomial) is fit to test set.
57
Model Selection
• We can now calculate three separate error values for the three
different sets using the following method:
1.Optimize the parameters in Θ using the training set for each
polynomial degree.
2.Find the polynomial degree d with the least error using the cross
validation set.
3.Estimate the generalization error using the test set with �㔽þþýþ (Θ(ý) )
(d = theta from polynomial with lower error)
• This way, the degree of the polynomial d has not been trained using
the test set.

59
Cross-Validation
• For small datasets, sometimes we use a more sophisticated
technique for hyperparameter tuning called cross-validation.
• ◮ Instead of arbitrarily picking the first data points to be the
validation set and rest training set,
• ◮ Get a better and less noisy estimate of how well hyperparameters
work by iterating over different validation sets and averaging the
performance across these.

61
Cross-Validation
• For example: 5-fold cross-validation
1. Split the training data into 5 equal folds (parts),
2. Use 4 of them for training, and 1 for validation.
3. Iterate over which fold is the validation fold, and evaluate the performance,
4. Finally average the performance across the different folds.

62
Performance Metrics

63
Accuracy in a Classification Model
• Accuracy is measured as the percentage of predicted results that
match the expected results.

• Ex: if there are 1000 results and 850 predicted results match the
expected results , then the accuracy is 85%

64
Problem with accuracy metric (measure):
Skewed classes
• Skewed classes basically refer to a dataset, wherein the number of
training example belonging to one class out-numbers heavily the
number of training examples belonging to the other.

• Consider a binary classification (cancer is labelled 1 and not cancer


labelled 0), where a cancerous patient is to be detected based on some
features.
- only 1 % of the data provided has cancer positive.
• If a system naively gives the prediction as all 0’s, still the prediction
accuracy will be 99%.

65
Commonly used Metrics
• Accuracy is only one metric.

• Other metrics commonly used are:


- Precision
- Recall (Sensitivity)
- Specificity
- F1-score
- ROC AUC

67
Confusion Matrix
• The confusion matrix is a performance measurement technique that
visualizes the accuracy of a classifier by comparing the actual and
predicted classes.
• It is called a confusion matrix because it shows how confused the
model is between the classes.
• The class of interest is commonly called the positive class, and the
rest negative class

68
Binary Confusion Matrix
Confusion matrix Predicted class
Positive Negative
Positive TP FN

Actual
Class
Negative FP TN

69
Example of Confusion Matrix
• If class <Daisy= is the positive class (y=1), so:
Predicted Label
• TP=9 FN=1
Daisy Tulip
• FP=2 TN=8

True Label
Daisy 9 1

Tulip 2 8

70
Accuracy

Predicted Label

1 0
True Label

1 9 1
0 2 8
71
Precision
(predicted as positives)

Predicted Label

1 0
True Label

1 9 1
0 2 8
72
Recall

(true positives)

Predicted Label

1 0
True Label

1 9 1
0 2 8
73
Specificity
• Specificity= True Negative Rate

(true negatives)

Predicted Label

1 0 þÿ
Specificity=
þÿ+ÿā
True Label

1 9 1
0 2 8 ▪ Specificity= 8/(8+2)= 0.8
74
Precision/Recall for skewed data
in presence of rare class (i.e. has cancer) that we want to detect
Precision
Actual Class
(Of all patients where we predicted , what
1 0 fraction actually has cancer?)
Predicted Class

1 True False �㕇ÿĂĂ āĀĀÿāÿăĂ �㕇ÿĂĂ āĀĀÿāÿăĂ


=
positive Positive # āÿĂāÿĀāĂā āĀĀÿāÿăĂ �㕇ÿĂĂ āĀĀÿāÿăĂ + þþ�㕙ĀĂ āĀĀÿāÿăĂ

0 False True
Recall
Negative Negative
(Of all patients that actually have cancer, what fraction
did we correctly detect as having cancer?)
�㕇ÿĂĂ āĀĀÿāÿăĂ �㕇ÿĂĂ āĀĀÿāÿăĂ
=
# þĀāĂþ�㕙 āĀĀÿāÿăĂ �㕇ÿĂĂ āĀĀÿāÿăĂ + þþ�㕙ĀĂ ÿĂĄþāÿăĂ

Now, if we evaluate a scenario where the classifier predicts all 0’s then TP=0, and the recall of the model
will be 0, which then points out the inability of the system.
75
Trading off precision and recall
Logistic regression:
Predict 1 if , 0.7, 0.9, 0.3
Predict 0 if , 0.7, 0.9, 0.3
Suppose we want to predict 1

Precision
(cancer)only if very confident.
→ Higher precision, lower recall
0.5
Suppose we want to avoid missing too many
cases of cancer (avoid false negatives).
→ Higher recall, lower precision
0.5 1
Recall

76
Threshold
• More generally: Predict 1 if threshold.

77
F1 Score (F score)
How to compare precision/recall numbers?
Precision(P) Recall (R) Average F1 Score
Algorithm 1 0.5 0.4 0.45 0.444
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392

Average:

F1 Score:

78
ROC Curve

79
ROC Curve

80
How to Plot ROC Curve?

81
How to Plot ROC Curve?

82
AUC

The ROC curve is a useful tool for a few reasons:


•The curves of different models can be compared directly
in general or for different thresholds.
•The area under the curve (AUC) can be used as a
summary of the model skill.

83
Model Diagnosis

95
Debugging a learning algorithm
• However, when you test your hypothesis on a new test set, you find
that it makes unacceptably large errors in its predictions. What should
you try next?

- Get more training examples


- Try smaller sets of features
- Try getting additional features
- Try adding polynomial features
- Try decreasing
- Try increasing

96
Machine Learning Diagnostic
• Diagnostic: A test that you can run to gain insight into what
is/isn’t working with a learning algorithm, and gain guidance
as to how best to improve its performance.

• Diagnostics can take time to implement, but doing so can be


a very good use of your time.

97
Bias/variance as a function of the degree of
polynomial
Suppose your learning algorithm is performing less well than
you were hoping. ( or is high.) Is it a bias
problem or a variance problem?
Bias (underfit):
(cross validation
error

error)

Variance (overfit):
(training error)

degree of polynomial d

98
ăÿą /ÿĄ/ ăþÿÿþÿĀĂ
ăÿą /ÿĄ/ ăþÿÿþÿĀĂ
ăÿą /ÿĄ/ ÿÿþĀ

ăÿą /ÿĄ/ ÿÿþĀ

ăÿą /ÿĄ/ ÿÿþĀ

ăÿą /ÿĄ/ ăþÿÿþÿĀĂ

99
Thanks
100

You might also like