0% found this document useful (0 votes)

13 views44 pages

05 AIS302 ANN-Optimization

Lecture 5 of AIS302 covers methods to improve Deep Neural Networks (DNNs) through hyperparameter tuning, regularization, and optimization techniques. It discusses various loss functions used in regression and classification tasks, including Mean Absolute Error, Mean Squared Error, and Cross-Entropy Loss, along with the implications of overfitting and the role of regularization. Additionally, the lecture introduces optimization strategies such as Gradient Descent and its variants to minimize loss functions effectively.

Uploaded by

Hana El Gabry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views44 pages

05 AIS302 ANN-Optimization

Uploaded by

Hana El Gabry

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

AIS302: ANN (Artificial Neural Networks)

Lecture 5: Improving DNNs: Hyperparameter tuning,

Regularization and Optimization
Spring 2025
Dr. Ensaf Hussein
Associate Professor, Artificial Intelligence,
School of Information Technology and Computer Science,
Nile University.
Course Map Deep Learning

Foundation Shallow Artificial Deep Computer Deep Sequence Deep Generative Deep
concepts NN Vision Modeling Models Reinforcement

Training
Convolutional NN Object Detection Recurrent NN VAE
Parameters

Pre-trained
LSTM GAN
Models

Transfer Learning Transformers

2
Lecture 5
Improving DNNs: Hyperparameter
tuning, Regularization and Optimization

Lectures are based on:

• Traditional Learning: Machine learning Andrew Ng [Full course]
• Stanford University CS231n,Deep Learning for Computer Vision
• MIT Introduction to Deep Learning | 6.S191
4
What is a Loss Function?
• A mathematical function that measures how well a model's
predictions match the actual values.
• Provides feedback to improve model performance.
• Goal: Minimize loss to improve accuracy.

5
How Loss Functions Work?
1.Model makes a prediction.
2.Loss function compares prediction with actual value.
3.Calculates the error (loss).
4.Optimization algorithms (e.g., Gradient Descent) adjust model
parameters to minimize loss.

6
Types of Loss Functions
• Loss functions are divided into:
• Regression Loss Functions (Continuous output)
• Classification Loss Functions (Categorical output)
• Ranking Loss Functions (Ordering tasks)
• Image & Reconstruction Loss Functions (Computer vision)
• Adversarial Loss Functions (GANs)

7
Regression Loss Functions
• Loss functions used in Regression Problems:
1. Mean Absolute Error/L1 Loss: used to minimize
the error which is the mean of the sum of all the
absolute differences between the true value and the
predicted value.
�㕁
1
L1 loss = ෍ |ĊāÿĂă 2 Ċ�㕝ÿăĂÿāāăĂ |
�㕁
ÿ=1
2. Mean Squared Error/L2 Loss: used to minimize
the error which is the mean of the sum of all the
squared differences between the true value and the
predicted value.
�㕁
1
L2 loss = ෍(ĊāÿĂă 2 Ċ�㕝ÿăĂÿāāăĂ )2
�㕁
ÿ=1
N= No. of samples, ytrue= true label, ypredicted=predicted label
8
Regression Loss Functions
Huber Loss: combines the advantages of MSE and MAE. It is less
sensitive to outliers than MSE and differentiable everywhere, unlike
MAE.
• Huber Loss is defined as:

How It Works:
1.For small errors (∣e∣≤δ):
1. The loss behaves like MSE (1/2e2), making it smooth and differentiable.
2. This helps with stable training.
2.For large errors (∣e∣>δ):
1. The loss behaves like MAE (δ∣e∣−1/2δ2), reducing sensitivity to outliers.
2. This prevents huge gradients caused by large errors, unlike MSE.

9
Regression Loss Functions

In the plotted graph:

•MSE Loss (dashed) grows quadratically, making it highly
sensitive to large errors.
•MAE Loss (dotted) grows linearly but is not smooth at
zero.
•Huber Loss (solid red) behaves like MSE near zero but
switches to MAE for large errors, balancing robustness and
smooth optimization.

10
Classification Loss Functions
• Loss functions used in classification Problems:
1. Hinge Loss/ SVM Loss: It is mainly used in problems where you
have to do 8maximum-margin9 classification. Even if new
observations are classified correctly, they can incur a penalty if the
margin from the decision boundary is not large enough.

Δ =1

11
Classification Loss Functions
• Loss functions used in classification Problems:
2. Cross-entropy Loss/ Logistic Loss /Multinomial Logistic Loss:
A generalized form of the log loss, which is used for multi-class
classification problems. For a perfect model, log loss value = 0.

1
ýĀĄĄăĂĂāÿāĂÿĀĀ = 2 σ�㕁 σ ā
ÿ=1 Ā ĊÿĀ log( āÿĀ ) ,
�㕁

�㕊/ăăă: āÿĀ = indicates probability of ith sample belonging to jth class.

K= number of classes

12
Classification Loss Functions
• Loss functions used in classification Problems:
Log Loss or binary Cross-entropy:
It is a Sigmoid activation plus a Cross-Entropy loss.
�㕁
1
ýĀĄĄĀÿĄÿÿþ = 2 ෍ Ċÿ log āÿ + 1 2 Ċÿ log(1 2 āÿ )
�㕁
ÿ=1
ăăăĀă → ∞ , Ċ → 0 ăăăĀă → ∞ , Ċ → 1
āÿ = indicates probability of ith sample

ÿĄ Ċ = 1 ÿĄ Ċ = 0

ăăăĀă = 0, Ċ = 1 ăăăĀă = 0, Ċ = 0

13
Classification Loss Functions
• Loss functions used in classification Problems:
Categorical Cross-Entropy/Softmax loss: It is a Softmax
activation plus a Cross-Entropy loss. If we use this loss, we will train
a network to output a probability over the C classes for each image. It is
used for multi-class classification.

Activation fn

ąÿ = ąăĆă ýÿĀăý, Ą(Ą)ÿ = ĄĀĄąþÿĉ ĄĆÿāąÿĀÿ, �㔶 = ÿĀ. ĀĄ āýÿĄĄăĄ

14
Other loss functions

15
Multiclass SVM Loss Example
• The multiclass SVM loss is set up so that the SVM <wants= the correct
class for each image to a have a score higher than the incorrect classes
by some fixed margin delta (∆ = 1 ) .

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

• Means that: ýĀĄĄÿ Lÿ ÿĄ ăĂĆÿý ąĀ 0 ÿĄ Ąþÿ ≥ ĄĀ + 1, ÿÿĂ ĄĀ 2 Ąþÿ +1 Āą/ăăĈÿĄă

• Where: Ąþÿ = score of true label of sample i,
ĄĀ = score of predicted other Āā/ class
16
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

cat 3.2 1.3 2.2 = max(0, 5.1 - 3.2 + 1) +

max(0, -1.7 - 3.2 + 1)
car 5.1 4.9 2.5 = max(0, 2.9) + max(0, -3.9)
= 2.9 + 0
frog -1.7 2.0 -3.1 = 2.9
Losses: 2.9
Fei-Fei Li & Justin Johnson & Serena Yeung

17
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

cat 3.2 1.3 2.2 = max(0, 1.3 - 4.9 + 1) +

max(0, 2.0 - 4.9 + 1)
car 5.1 4.9 2.5 = max(0, -2.6) + max(0, -1.9)
=0+0
frog -1.7 2.0 -3.1 =0
Losses: 2.9 0
Fei-Fei Li & Justin Johnson & Serena Yeung

18
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:

Lÿ = ෍ max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ

cat 3.2 1.3 2.2 = max(0, 2.2 - (-3.1) + 1) +

max(0, 2.5 - (-3.1) + 1)
car 5.1 4.9 2.5 = max(0, 6.3) + max(0, 6.6)
= 6.3 + 6.6
frog -1.7 2.0 -3.1 = 12.9
Losses: 2.9 0 12.9
• What
Fei-Feiis the
Li & minimum
Justin and Yeung
Johnson & Serena maximum possible loss Li ?
• What is the loss when W is small and scores s near to 0?
19
Total Loss
• Until now, we calculated loss over each sample (i)
• Loss over full dataset is the average:
Score function of image (i)
1 �㕁
�㕇Āąÿý ýĀĄĄ = ෍ �㔿ÿ (Ą ĉÿ , �㕊 , Ċÿ )
�㕁 ÿ=1
• In our example:
Total loss = (2.9 + 0 + 12.9)/3

= 5.27

20
Softmax loss Example

Note:
1. We add some meaning to the scores. Softmax
2. probability of cat is 1 and others is 0.
3. put all the probability mass on correct class.
function
4. log is monotonic function, easier to maximize
5. - log as its loss
21
Softmax loss Example

◮ What is the minimum and maximum possible loss Li ?

◮ What is the loss when W is small and scores s near to 0?
22
Softmax vs. SVM Loss

Fei-Fei Li & Justin Johnson & Serena Yeung

23
What is missing in this Loss function?

Overfitting
Problem

24
The Problem of Overfitting
• Underfitting, or high bias, is when the
form of our hypothesis function / maps
poorly to the trend of the data. It is
usually caused by a function that is too
simple or uses too few features.

• At the other extreme, overfitting, or

high variance, is caused by a
hypothesis function that fits the available
data but does not generalize well to
predict new data. It is usually caused by
a complicated function that creates a lot
of unnecessary curves and angles
unrelated to the data.

25
Regularization

Regularization: Technique to discourage the

complexity of the model (i.e, express
preferences over weights). It does this by
penalizing the loss function. This helps to solve
the overfitting problem.

26
Regularization parameter λ
• Regularization works on assumption that smaller weights generate simpler
model and thus helps avoid overfitting.

• λ is the penalty term or regularization parameter which determines how

much to penalizes the weights.

• λ = 0, then the regularization term becomes zero (back to the original Loss
function).

• λ is large, the weights become close to zero (i.e. a very simple model have
underfitting).

• λ is a hyperparameter between 0 and a large value.

27
Regularization function R(W)
Simple Examples:
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):

28
L1 regularization

• Built-in feature selection : L1 regularization does feature selection. It

does this by assigning insignificant input features with zero weight
and useful features with a non zero weight.

• Sparsity : In L1 regularization we shrink the parameters to zero. When

input features have weights closer to zero that leads to sparse L1
norm. In Sparse solution majority of the input features have zero
weights and very few features have non zero weights.

29
L2 regularization

• L2 regularization forces the weights to be small but does not make

them zero and does non sparse solution.
L2 regularization likes to
<spread out= the weights

• L2 has no feature selection, it gives better prediction when output

variable is a function of all input features

• L2 regularization is able to learn complex data patterns

30
Optimization

31
Optimization
Optimization is the process of finding the set of parameters W
that minimize the l o s s f unc t i on .

Simple Optimizer:
• Gradient Descent (GD)
More Complex Optimizers:
• Momentum
• AdaGrad
• RMSProp
• Adam

32
Gradient Descent (GD)
• Following the Gradient
Main Idea: take iterative steps to update parameters W in the direction
of the negative gradient direction.
Cost Function: �㔽(�㔔0 , �㔔1 )

Cost
Function

ñ
ð

33
Gradient Descent: Problem
• Problem of GD is that it can converge to different locations upon the
initial W. It can stuck to local minima.

Cost Function: �㔽(�㔔0 , �㔔1 )

Cost
Function

ñ
ð

34
Gradient Descent: Algorithm
• We make steps down the cost function in the direction with the steepest
descent, and the size of each step is determined by the parameter α, which
is called the learning rate.
�㔽(�㔔 ) Cost Function: �㔽(�㔔1)
1
• The gradient descent algorithm is:

repeat until convergence {

Ā
ĈĀ ≔ ĈĀ 2 �㗼 �㔽(Ĉ0 , Ĉ1 )
ĀÿĀ

} �㔔1

Learning rate (step size) Positive slope (positive number)→ �㔔1 will decrease
Negative slope (negative number)→ �㔔1 will increase

Andrew Ng
Gradient Descent variants
• There are three variants of gradient descent based on the amount of
data used to calculate the gradient:

1. Batch gradient descent

2. Stochastic gradient descent
3. Mini-batch gradient descent

36
GD Variants: Batch Gradient Descent
• Batch Gradient Descent, aka Vanilla gradient descent, calculates the
error for each observation in the dataset but performs an update only
after all observations have been evaluated.

• One cycle through the entire training dataset is called a training

epoch. Therefore, it is often said that batch gradient descent
performs model updates at the end of each training epoch.

• Batch gradient descent is not often used, because it represents a

huge consumption of computational resources, as the entire dataset
needs to remain in memory.

37
GD Variants: Stochastic Gradient Descent (SGD)
• Stochastic gradient descent, often abbreviated SGD, is a variation of
the gradient descent algorithm that calculates the error and updates
the model for each example in the training dataset.

• SGD is usually faster than batch gradient descent, but its frequent
updates cause a higher variance in the error rate, that can sometimes
jump around instead of decreasing.

• The noisy update process can allow the model to avoid local minima
(e.g. premature convergence).

38
GD Variants: Mini-Batch Gradient Descent
• Mini-batch gradient descent seeks to find a balance between the
robustness of stochastic gradient descent and the efficiency of batch
gradient descent.

• It is the most common implementation of gradient descent used in the field

of deep learning.

• It splits the training dataset into small batches that are used to calculate
model error and update model coefficients.

• <Batch size= is hyperparameter, commonly used as power of 2: 32, 64,

128, 256, and so on.

39
Gradient Descent: Feature Scaling
• We can speed up gradient descent by having each of our input values in
roughly the same range.
Rule a thumb regarding acceptable ranges
• -3 to +3 is generally fine - any bigger bad
• -1/3 to +1/3 is ok - any smaller bad

Feature Scaling Mean Normalization

ĉÿ ĉÿ 2 �㔇ÿ
ĉÿ = ĉÿ =
þÿĉýÿ Ąÿ
�㔇ÿ = þăÿÿ ćÿýĆă ĀĄ ĉ Un-scaled features
þÿĉýÿ Ąÿ = range of values (max 2 min) or
= ýÿăąăĄą ćÿýĆă ĀĄ ĉ the standard deviation.
Learning Rate
• The gradient tells us the direction, but it does not tell us how far
along this direction we should step.

• The learning rate (step size) determines how big the step would be
on each iteration. It determines how fast or slow we will move
towards the optimal weights.

• Learning rate is one of the most hyperparameter settings in training a

neural network.

42
Learning Rate
• If learning rate is large, it may fail to converge and overshoot the minimum.
• If learning rate is very small, it would take long time to converge and become
computationally expensive.
• The most commonly used rates are :
0.001, 0.003, 0.01 (default), 0.03, 0.1, 0.3

43
Learning Rate
• Plot the cost function against different values of learning rate.
- If gradient descent is working properly, the cost function should decrease after
every iteration.
- When Gradient Descent can’t decrease the cost-function anymore and remains
more or less on the same level, we say it has converged.

Note:
If you see in the plot that your
learning curve is just going up
and down, without really
reaching a lower point, you also
should try to decrease the
learning rate.

44
Learning Rate Schedules
• When training deep neural networks, it is often useful to reduce
learning rate as the training progresses. This can be done by using
pre-defined learning rate schedules or adaptive learning rate
methods.

45
Thanks
46

Force and Pressure Notes For Class 8
100% (2)
Force and Pressure Notes For Class 8
4 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
Most Influential Data Science Research Papers
No ratings yet
Most Influential Data Science Research Papers
628 pages
Lecture6 Regularization
No ratings yet
Lecture6 Regularization
56 pages
Lect 8
No ratings yet
Lect 8
117 pages
Machine Learning Models
No ratings yet
Machine Learning Models
52 pages
7 TrainingNN-2
No ratings yet
7 TrainingNN-2
84 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
04 LossFunctions
No ratings yet
04 LossFunctions
22 pages
Lec 05
No ratings yet
Lec 05
54 pages
Machine Vesion hw6
No ratings yet
Machine Vesion hw6
18 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Nptel Lec
No ratings yet
Nptel Lec
22 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
3 - Loss Functions
No ratings yet
3 - Loss Functions
14 pages
CS231n Convolutional Neural Networks For Visual Recognition 3
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition 3
15 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
On Loss Functions For Deep Neural Networks in Classification Katarzyna Janocha, Wojciech Marian Czarnecki
No ratings yet
On Loss Functions For Deep Neural Networks in Classification Katarzyna Janocha, Wojciech Marian Czarnecki
10 pages
Introduction of Machine Learning
No ratings yet
Introduction of Machine Learning
61 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
Loss Function - Ipynb - Colaboratory
No ratings yet
Loss Function - Ipynb - Colaboratory
6 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
8 pages
Group 30
No ratings yet
Group 30
33 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Ds 2
No ratings yet
Ds 2
27 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Deep Learning Notes-2
No ratings yet
Deep Learning Notes-2
16 pages
Loss Function
No ratings yet
Loss Function
23 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
4-Loss Function
No ratings yet
4-Loss Function
8 pages
Lecture 11
No ratings yet
Lecture 11
26 pages
Lec 3
No ratings yet
Lec 3
22 pages
Unit 2b
No ratings yet
Unit 2b
11 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
03-Linear Classification
No ratings yet
03-Linear Classification
17 pages
Detailed Guide 7 Loss Functions Machine Learning Python Code
No ratings yet
Detailed Guide 7 Loss Functions Machine Learning Python Code
16 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Loss Function
No ratings yet
Loss Function
9 pages
Loss Functions
No ratings yet
Loss Functions
8 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
7.losses and Activations
No ratings yet
7.losses and Activations
79 pages
Module - 2 Ver 1.4
No ratings yet
Module - 2 Ver 1.4
35 pages
Loss Functions Types
No ratings yet
Loss Functions Types
11 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Loss
No ratings yet
Loss
18 pages
9.b Handout-1-Loss Functions
No ratings yet
9.b Handout-1-Loss Functions
3 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
2022 Scribe Lecture7
No ratings yet
2022 Scribe Lecture7
9 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
GSW NG01017640 GEN LA7880 00004 - Technical Specifications For Pipeline Valves - D01
100% (1)
GSW NG01017640 GEN LA7880 00004 - Technical Specifications For Pipeline Valves - D01
23 pages
2023 English For Computer Science
No ratings yet
2023 English For Computer Science
134 pages
You VS You
No ratings yet
You VS You
30 pages
Solving Algebraic Expression and Equation
100% (1)
Solving Algebraic Expression and Equation
36 pages
Trane Presentation
No ratings yet
Trane Presentation
52 pages
E-Learning Course Material On "Engineering Mechanics" - PPT 1
0% (1)
E-Learning Course Material On "Engineering Mechanics" - PPT 1
59 pages
Cgat Series
No ratings yet
Cgat Series
20 pages
Health Care Marketing Assignment 2
100% (1)
Health Care Marketing Assignment 2
11 pages
Technical Data Sheet Jazeera Maxim Tex JA-26002: Description
No ratings yet
Technical Data Sheet Jazeera Maxim Tex JA-26002: Description
3 pages
Environmental Thesis Statements
100% (1)
Environmental Thesis Statements
5 pages
Detention Volume Estimating Workbook (PDF) - 201404301105510967
No ratings yet
Detention Volume Estimating Workbook (PDF) - 201404301105510967
300 pages
Global Supply Chains
No ratings yet
Global Supply Chains
25 pages
Wa0007.
No ratings yet
Wa0007.
55 pages
Phinma University of Iloil1
No ratings yet
Phinma University of Iloil1
6 pages
Atomic Theory Science Presentation Colorful 3D Style
No ratings yet
Atomic Theory Science Presentation Colorful 3D Style
10 pages
Module#2 Assignment - Mendoza, Oliver R
No ratings yet
Module#2 Assignment - Mendoza, Oliver R
17 pages
Exetastai-The Discourses of Identity in Hellenistic Erythrai
100% (1)
Exetastai-The Discourses of Identity in Hellenistic Erythrai
34 pages
NMP5 Q4 Week 2
No ratings yet
NMP5 Q4 Week 2
16 pages
SCIENCE 1-4th QUARTER EXAM
No ratings yet
SCIENCE 1-4th QUARTER EXAM
3 pages
Island of Ignorance 31 Aug 23 Digital Draft
No ratings yet
Island of Ignorance 31 Aug 23 Digital Draft
41 pages
.Trashed 1724238737 BPSC Senior Secondary Teacher Booklets Sanskrit
No ratings yet
.Trashed 1724238737 BPSC Senior Secondary Teacher Booklets Sanskrit
32 pages
Protection Coordinator
No ratings yet
Protection Coordinator
4 pages
Review-Midterm-1-Eng 6
No ratings yet
Review-Midterm-1-Eng 6
6 pages
Secondary Annual Result 2023
No ratings yet
Secondary Annual Result 2023
1 page
Computatıonal Fluıd Dynamıcs Based Desıgn and Investıgatıon of Nose Cone Aerodynamıcs of Formula Style Student Desıgned Racecar
No ratings yet
Computatıonal Fluıd Dynamıcs Based Desıgn and Investıgatıon of Nose Cone Aerodynamıcs of Formula Style Student Desıgned Racecar
7 pages
Resume Math
No ratings yet
Resume Math
4 pages
Ministry of Resin Exposure Times - Durable Grey
No ratings yet
Ministry of Resin Exposure Times - Durable Grey
1 page
Byjonathan L. Mayuga: New DENR List Reveals More Boracay Businesses Violated Environment Laws
No ratings yet
Byjonathan L. Mayuga: New DENR List Reveals More Boracay Businesses Violated Environment Laws
4 pages
5 Explain How A Series of Chapters, Scenes, or Stanzas Fits Together To Provide The Overall Structure of A Particular Story, Drama, or Poem
No ratings yet
5 Explain How A Series of Chapters, Scenes, or Stanzas Fits Together To Provide The Overall Structure of A Particular Story, Drama, or Poem
3 pages
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

05 AIS302 ANN-Optimization

Uploaded by

05 AIS302 ANN-Optimization

Uploaded by

AIS302: ANN (Artificial Neural Networks)

Lecture 5: Improving DNNs: Hyperparameter tuning,

Transfer Learning Transformers

Lectures are based on:

In the plotted graph:

�㕊/ăăă: āÿĀ = indicates probability of ith sample belonging to jth class.

ąÿ = ąăĆă ýÿĀăý, Ą(Ą)ÿ = ĄĀĄąþÿĉ ĄĆÿāąÿĀÿ, �㔶 = ÿĀ. ĀĄ āýÿĄĄăĄ

• Means that: ýĀĄĄÿ Lÿ ÿĄ ăĂĆÿý ąĀ 0 ÿĄ Ąþÿ ≥ ĄĀ + 1, ÿÿĂ ĄĀ 2 Ąþÿ +1 Āą/ăăĈÿĄă

cat 3.2 1.3 2.2 = max(0, 5.1 - 3.2 + 1) +

cat 3.2 1.3 2.2 = max(0, 1.3 - 4.9 + 1) +

cat 3.2 1.3 2.2 = max(0, 2.2 - (-3.1) + 1) +

◮ What is the minimum and maximum possible loss Li ?

Fei-Fei Li & Justin Johnson & Serena Yeung

• At the other extreme, overfitting, or

Regularization: Technique to discourage the

• λ is the penalty term or regularization parameter which determines how

• λ is a hyperparameter between 0 and a large value.

• Built-in feature selection : L1 regularization does feature selection. It

• Sparsity : In L1 regularization we shrink the parameters to zero. When

• L2 regularization forces the weights to be small but does not make

• L2 has no feature selection, it gives better prediction when output

• L2 regularization is able to learn complex data patterns

Cost Function: �㔽(�㔔0 , �㔔1 )

repeat until convergence {

1. Batch gradient descent

• One cycle through the entire training dataset is called a training

• Batch gradient descent is not often used, because it represents a

• It is the most common implementation of gradient descent used in the field

• <Batch size= is hyperparameter, commonly used as power of 2: 32, 64,

Feature Scaling Mean Normalization

• Learning rate is one of the most hyperparameter settings in training a

You might also like