0% found this document useful (0 votes)
13 views44 pages

05 AIS302 ANN-Optimization

Lecture 5 of AIS302 covers methods to improve Deep Neural Networks (DNNs) through hyperparameter tuning, regularization, and optimization techniques. It discusses various loss functions used in regression and classification tasks, including Mean Absolute Error, Mean Squared Error, and Cross-Entropy Loss, along with the implications of overfitting and the role of regularization. Additionally, the lecture introduces optimization strategies such as Gradient Descent and its variants to minimize loss functions effectively.

Uploaded by

Hana El Gabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views44 pages

05 AIS302 ANN-Optimization

Lecture 5 of AIS302 covers methods to improve Deep Neural Networks (DNNs) through hyperparameter tuning, regularization, and optimization techniques. It discusses various loss functions used in regression and classification tasks, including Mean Absolute Error, Mean Squared Error, and Cross-Entropy Loss, along with the implications of overfitting and the role of regularization. Additionally, the lecture introduces optimization strategies such as Gradient Descent and its variants to minimize loss functions effectively.

Uploaded by

Hana El Gabry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

AIS302: ANN (Artificial Neural Networks)

Lecture 5: Improving DNNs: Hyperparameter tuning,


Regularization and Optimization
Spring 2025
Dr. Ensaf Hussein
Associate Professor, Artificial Intelligence,
School of Information Technology and Computer Science,
Nile University.
Course Map Deep Learning

Foundation Shallow Artificial Deep Computer Deep Sequence Deep Generative Deep
concepts NN Vision Modeling Models Reinforcement

Training
Convolutional NN Object Detection Recurrent NN VAE
Parameters

Pre-trained
LSTM GAN
Models

Transfer Learning Transformers

2
Lecture 5
Improving DNNs: Hyperparameter
tuning, Regularization and Optimization

Lectures are based on:


• Traditional Learning: Machine learning Andrew Ng [Full course]
• Stanford University CS231n,Deep Learning for Computer Vision
• MIT Introduction to Deep Learning | 6.S191
4
What is a Loss Function?
• A mathematical function that measures how well a model's
predictions match the actual values.
• Provides feedback to improve model performance.
• Goal: Minimize loss to improve accuracy.

5
How Loss Functions Work?
1.Model makes a prediction.
2.Loss function compares prediction with actual value.
3.Calculates the error (loss).
4.Optimization algorithms (e.g., Gradient Descent) adjust model
parameters to minimize loss.

6
Types of Loss Functions
• Loss functions are divided into:
• Regression Loss Functions (Continuous output)
• Classification Loss Functions (Categorical output)
• Ranking Loss Functions (Ordering tasks)
• Image & Reconstruction Loss Functions (Computer vision)
• Adversarial Loss Functions (GANs)

7
Regression Loss Functions
• Loss functions used in Regression Problems:
1. Mean Absolute Error/L1 Loss: used to minimize
the error which is the mean of the sum of all the
absolute differences between the true value and the
predicted value.
�㕁
1
L1 loss = ෍ |ĊāÿĂă 2 Ċ�㕝ÿăĂÿāāăĂ |
�㕁
ÿ=1
2. Mean Squared Error/L2 Loss: used to minimize
the error which is the mean of the sum of all the
squared differences between the true value and the
predicted value.
�㕁
1
L2 loss = ෍(ĊāÿĂă 2 Ċ�㕝ÿăĂÿāāăĂ )2
�㕁
ÿ=1
N= No. of samples, ytrue= true label, ypredicted=predicted label
8
Regression Loss Functions
Huber Loss: combines the advantages of MSE and MAE. It is less
sensitive to outliers than MSE and differentiable everywhere, unlike
MAE.
• Huber Loss is defined as:

How It Works:
1.For small errors (∣e∣≤δ):
1. The loss behaves like MSE (1/2e2), making it smooth and differentiable.
2. This helps with stable training.
2.For large errors (∣e∣>δ):
1. The loss behaves like MAE (δ∣e∣−1/2δ2), reducing sensitivity to outliers.
2. This prevents huge gradients caused by large errors, unlike MSE.

9
Regression Loss Functions

In the plotted graph:


•MSE Loss (dashed) grows quadratically, making it highly
sensitive to large errors.
•MAE Loss (dotted) grows linearly but is not smooth at
zero.
•Huber Loss (solid red) behaves like MSE near zero but
switches to MAE for large errors, balancing robustness and
smooth optimization.

10
Classification Loss Functions
• Loss functions used in classification Problems:
1. Hinge Loss/ SVM Loss: It is mainly used in problems where you
have to do 8maximum-margin9 classification. Even if new
observations are classified correctly, they can incur a penalty if the
margin from the decision boundary is not large enough.

Δ =1

11
Classification Loss Functions
• Loss functions used in classification Problems:
2. Cross-entropy Loss/ Logistic Loss /Multinomial Logistic Loss:
A generalized form of the log loss, which is used for multi-class
classification problems. For a perfect model, log loss value = 0.

1
ýĀĄĄăĂĂāÿāĂÿĀĀ = 2 σ�㕁 σ ā
ÿ=1 Ā ĊÿĀ log( āÿĀ ) ,
�㕁

�㕊/ăăă: āÿĀ = indicates probability of ith sample belonging to jth class.


K= number of classes

12
Classification Loss Functions
• Loss functions used in classification Problems:
Log Loss or binary Cross-entropy:
It is a Sigmoid activation plus a Cross-Entropy loss.
�㕁
1
ýĀĄĄĀÿĄÿÿþ = 2 ෍ Ċÿ log āÿ + 1 2 Ċÿ log(1 2 āÿ )
�㕁
ÿ=1
ăăăĀă → ∞ , Ċ → 0 ăăăĀă → ∞ , Ċ → 1
āÿ = indicates probability of ith sample

ÿĄ Ċ = 1 ÿĄ Ċ = 0

ăăăĀă = 0, Ċ = 1 ăăăĀă = 0, Ċ = 0

13
Classification Loss Functions
• Loss functions used in classification Problems:
Categorical Cross-Entropy/Softmax loss: It is a Softmax
activation plus a Cross-Entropy loss. If we use this loss, we will train
a network to output a probability over the C classes for each image. It is
used for multi-class classification.

Activation fn

ąÿ = ąăĆă ýÿĀăý, Ą(Ą)ÿ = ĄĀĄąþÿĉ ĄĆÿāąÿĀÿ, �㔶 = ÿĀ. ĀĄ āýÿĄĄăĄ


14
Other loss functions

15
Multiclass SVM Loss Example
• The multiclass SVM loss is set up so that the SVM <wants= the correct
class for each image to a have a score higher than the incorrect classes
by some fixed margin delta (∆ = 1 ) .

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

• Means that: ýĀĄĄÿ Lÿ ÿĄ ăĂĆÿý ąĀ 0 ÿĄ Ąþÿ ≥ ĄĀ + 1, ÿÿĂ ĄĀ 2 Ąþÿ +1 Āą/ăăĈÿĄă


• Where: Ąþÿ = score of true label of sample i,
ĄĀ = score of predicted other Āā/ class
16
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

cat 3.2 1.3 2.2 = max(0, 5.1 - 3.2 + 1) +


max(0, -1.7 - 3.2 + 1)
car 5.1 4.9 2.5 = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
frog -1.7 2.0 -3.1 = 2.9
Losses: 2.9
Fei-Fei Li & Justin Johnson & Serena Yeung

17
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

cat 3.2 1.3 2.2 = max(0, 1.3 - 4.9 + 1) +


max(0, 2.0 - 4.9 + 1)
car 5.1 4.9 2.5 = max(0, -2.6) + max(0, -1.9)
=0+0
frog -1.7 2.0 -3.1 =0
Losses: 2.9 0
Fei-Fei Li & Justin Johnson & Serena Yeung

18
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

cat 3.2 1.3 2.2 = max(0, 2.2 - (-3.1) + 1) +


max(0, 2.5 - (-3.1) + 1)
car 5.1 4.9 2.5 = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
frog -1.7 2.0 -3.1 = 12.9
Losses: 2.9 0 12.9
• What
Fei-Feiis the
Li & minimum
Justin and Yeung
Johnson & Serena maximum possible loss Li ?
• What is the loss when W is small and scores s near to 0?
19
Total Loss
• Until now, we calculated loss over each sample (i)
• Loss over full dataset is the average:
Score function of image (i)
1 �㕁
�㕇Āąÿý ýĀĄĄ = ෍ �㔿ÿ (Ą ĉÿ , �㕊 , Ċÿ )
�㕁 ÿ=1
• In our example:
Total loss = (2.9 + 0 + 12.9)/3

= 5.27

20
Softmax loss Example

Note:
1. We add some meaning to the scores. Softmax
2. probability of cat is 1 and others is 0.
3. put all the probability mass on correct class.
function
4. log is monotonic function, easier to maximize
5. - log as its loss
21
Softmax loss Example

◮ What is the minimum and maximum possible loss Li ?


◮ What is the loss when W is small and scores s near to 0?
22
Softmax vs. SVM Loss

Fei-Fei Li & Justin Johnson & Serena Yeung

23
What is missing in this Loss function?

Overfitting
Problem

24
The Problem of Overfitting
• Underfitting, or high bias, is when the
form of our hypothesis function / maps
poorly to the trend of the data. It is
usually caused by a function that is too
simple or uses too few features.

• At the other extreme, overfitting, or


high variance, is caused by a
hypothesis function that fits the available
data but does not generalize well to
predict new data. It is usually caused by
a complicated function that creates a lot
of unnecessary curves and angles
unrelated to the data.

25
Regularization

Regularization: Technique to discourage the


complexity of the model (i.e, express
preferences over weights). It does this by
penalizing the loss function. This helps to solve
the overfitting problem.

26
Regularization parameter λ
• Regularization works on assumption that smaller weights generate simpler
model and thus helps avoid overfitting.

• λ is the penalty term or regularization parameter which determines how


much to penalizes the weights.

• λ = 0, then the regularization term becomes zero (back to the original Loss
function).

• λ is large, the weights become close to zero (i.e. a very simple model have
underfitting).

• λ is a hyperparameter between 0 and a large value.

27
Regularization function R(W)
Simple Examples:
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):

28
L1 regularization

• Built-in feature selection : L1 regularization does feature selection. It


does this by assigning insignificant input features with zero weight
and useful features with a non zero weight.

• Sparsity : In L1 regularization we shrink the parameters to zero. When


input features have weights closer to zero that leads to sparse L1
norm. In Sparse solution majority of the input features have zero
weights and very few features have non zero weights.

29
L2 regularization

• L2 regularization forces the weights to be small but does not make


them zero and does non sparse solution.
L2 regularization likes to
<spread out= the weights

• L2 has no feature selection, it gives better prediction when output


variable is a function of all input features

• L2 regularization is able to learn complex data patterns

30
Optimization

31
Optimization
Optimization is the process of finding the set of parameters W
that minimize the l o s s f unc t i on .

Simple Optimizer:
• Gradient Descent (GD)
More Complex Optimizers:
• Momentum
• AdaGrad
• RMSProp
• Adam

32
Gradient Descent (GD)
• Following the Gradient
Main Idea: take iterative steps to update parameters W in the direction
of the negative gradient direction.
Cost Function: �㔽(�㔔0 , �㔔1 )

Cost
Function

ñ
ð

33
Gradient Descent: Problem
• Problem of GD is that it can converge to different locations upon the
initial W. It can stuck to local minima.

Cost Function: �㔽(�㔔0 , �㔔1 )

Cost
Function

ñ
ð

34
Gradient Descent: Algorithm
• We make steps down the cost function in the direction with the steepest
descent, and the size of each step is determined by the parameter α, which
is called the learning rate.
�㔽(�㔔 ) Cost Function: �㔽(�㔔1)
1
• The gradient descent algorithm is:

repeat until convergence {


Ā
ĈĀ ≔ ĈĀ 2 �㗼 �㔽(Ĉ0 , Ĉ1 )
ĀÿĀ

} �㔔1

Learning rate (step size) Positive slope (positive number)→ �㔔1 will decrease
Negative slope (negative number)→ �㔔1 will increase

Andrew Ng
Gradient Descent variants
• There are three variants of gradient descent based on the amount of
data used to calculate the gradient:

1. Batch gradient descent


2. Stochastic gradient descent
3. Mini-batch gradient descent

36
GD Variants: Batch Gradient Descent
• Batch Gradient Descent, aka Vanilla gradient descent, calculates the
error for each observation in the dataset but performs an update only
after all observations have been evaluated.

• One cycle through the entire training dataset is called a training


epoch. Therefore, it is often said that batch gradient descent
performs model updates at the end of each training epoch.

• Batch gradient descent is not often used, because it represents a


huge consumption of computational resources, as the entire dataset
needs to remain in memory.

37
GD Variants: Stochastic Gradient Descent (SGD)
• Stochastic gradient descent, often abbreviated SGD, is a variation of
the gradient descent algorithm that calculates the error and updates
the model for each example in the training dataset.

• SGD is usually faster than batch gradient descent, but its frequent
updates cause a higher variance in the error rate, that can sometimes
jump around instead of decreasing.

• The noisy update process can allow the model to avoid local minima
(e.g. premature convergence).

38
GD Variants: Mini-Batch Gradient Descent
• Mini-batch gradient descent seeks to find a balance between the
robustness of stochastic gradient descent and the efficiency of batch
gradient descent.

• It is the most common implementation of gradient descent used in the field


of deep learning.

• It splits the training dataset into small batches that are used to calculate
model error and update model coefficients.

• <Batch size= is hyperparameter, commonly used as power of 2: 32, 64,


128, 256, and so on.

39
Gradient Descent: Feature Scaling
• We can speed up gradient descent by having each of our input values in
roughly the same range.
Rule a thumb regarding acceptable ranges
• -3 to +3 is generally fine - any bigger bad
• -1/3 to +1/3 is ok - any smaller bad

Feature Scaling Mean Normalization

ĉÿ ĉÿ 2 �㔇ÿ
ĉÿ = ĉÿ =
þÿĉýÿ Ąÿ
�㔇ÿ = þăÿÿ ćÿýĆă ĀĄ ĉ Un-scaled features
þÿĉýÿ Ąÿ = range of values (max 2 min) or
= ýÿăąăĄą ćÿýĆă ĀĄ ĉ the standard deviation.
Learning Rate
• The gradient tells us the direction, but it does not tell us how far
along this direction we should step.

• The learning rate (step size) determines how big the step would be
on each iteration. It determines how fast or slow we will move
towards the optimal weights.

• Learning rate is one of the most hyperparameter settings in training a


neural network.

42
Learning Rate
• If learning rate is large, it may fail to converge and overshoot the minimum.
• If learning rate is very small, it would take long time to converge and become
computationally expensive.
• The most commonly used rates are :
0.001, 0.003, 0.01 (default), 0.03, 0.1, 0.3

43
Learning Rate
• Plot the cost function against different values of learning rate.
- If gradient descent is working properly, the cost function should decrease after
every iteration.
- When Gradient Descent can’t decrease the cost-function anymore and remains
more or less on the same level, we say it has converged.

Note:
If you see in the plot that your
learning curve is just going up
and down, without really
reaching a lower point, you also
should try to decrease the
learning rate.

44
Learning Rate Schedules
• When training deep neural networks, it is often useful to reduce
learning rate as the training progresses. This can be done by using
pre-defined learning rate schedules or adaptive learning rate
methods.

45
Thanks
46

You might also like