05 AIS302 ANN-Optimization
05 AIS302 ANN-Optimization
Foundation Shallow Artificial Deep Computer Deep Sequence Deep Generative Deep
concepts NN Vision Modeling Models Reinforcement
Training
Convolutional NN Object Detection Recurrent NN VAE
Parameters
Pre-trained
LSTM GAN
Models
2
Lecture 5
Improving DNNs: Hyperparameter
tuning, Regularization and Optimization
5
How Loss Functions Work?
1.Model makes a prediction.
2.Loss function compares prediction with actual value.
3.Calculates the error (loss).
4.Optimization algorithms (e.g., Gradient Descent) adjust model
parameters to minimize loss.
6
Types of Loss Functions
• Loss functions are divided into:
• Regression Loss Functions (Continuous output)
• Classification Loss Functions (Categorical output)
• Ranking Loss Functions (Ordering tasks)
• Image & Reconstruction Loss Functions (Computer vision)
• Adversarial Loss Functions (GANs)
7
Regression Loss Functions
• Loss functions used in Regression Problems:
1. Mean Absolute Error/L1 Loss: used to minimize
the error which is the mean of the sum of all the
absolute differences between the true value and the
predicted value.
�㕁
1
L1 loss = |ĊāÿĂă 2 Ċ�㕝ÿăĂÿāāăĂ |
�㕁
ÿ=1
2. Mean Squared Error/L2 Loss: used to minimize
the error which is the mean of the sum of all the
squared differences between the true value and the
predicted value.
�㕁
1
L2 loss = (ĊāÿĂă 2 Ċ�㕝ÿăĂÿāāăĂ )2
�㕁
ÿ=1
N= No. of samples, ytrue= true label, ypredicted=predicted label
8
Regression Loss Functions
Huber Loss: combines the advantages of MSE and MAE. It is less
sensitive to outliers than MSE and differentiable everywhere, unlike
MAE.
• Huber Loss is defined as:
How It Works:
1.For small errors (∣e∣≤δ):
1. The loss behaves like MSE (1/2e2), making it smooth and differentiable.
2. This helps with stable training.
2.For large errors (∣e∣>δ):
1. The loss behaves like MAE (δ∣e∣−1/2δ2), reducing sensitivity to outliers.
2. This prevents huge gradients caused by large errors, unlike MSE.
9
Regression Loss Functions
10
Classification Loss Functions
• Loss functions used in classification Problems:
1. Hinge Loss/ SVM Loss: It is mainly used in problems where you
have to do 8maximum-margin9 classification. Even if new
observations are classified correctly, they can incur a penalty if the
margin from the decision boundary is not large enough.
Δ =1
11
Classification Loss Functions
• Loss functions used in classification Problems:
2. Cross-entropy Loss/ Logistic Loss /Multinomial Logistic Loss:
A generalized form of the log loss, which is used for multi-class
classification problems. For a perfect model, log loss value = 0.
1
ýĀĄĄăĂĂāÿāĂÿĀĀ = 2 σ�㕁 σ ā
ÿ=1 Ā ĊÿĀ log( āÿĀ ) ,
�㕁
12
Classification Loss Functions
• Loss functions used in classification Problems:
Log Loss or binary Cross-entropy:
It is a Sigmoid activation plus a Cross-Entropy loss.
�㕁
1
ýĀĄĄĀÿĄÿÿþ = 2 Ċÿ log āÿ + 1 2 Ċÿ log(1 2 āÿ )
�㕁
ÿ=1
ăăăĀă → ∞ , Ċ → 0 ăăăĀă → ∞ , Ċ → 1
āÿ = indicates probability of ith sample
ÿĄ Ċ = 1 ÿĄ Ċ = 0
ăăăĀă = 0, Ċ = 1 ăăăĀă = 0, Ċ = 0
13
Classification Loss Functions
• Loss functions used in classification Problems:
Categorical Cross-Entropy/Softmax loss: It is a Softmax
activation plus a Cross-Entropy loss. If we use this loss, we will train
a network to output a probability over the C classes for each image. It is
used for multi-class classification.
Activation fn
15
Multiclass SVM Loss Example
• The multiclass SVM loss is set up so that the SVM <wants= the correct
class for each image to a have a score higher than the incorrect classes
by some fixed margin delta (∆ = 1 ) .
Lÿ = max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ
Lÿ = max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ
17
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:
Lÿ = max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ
18
Example
• Suppose, we have 3 training examples, and 3 classes. With
some W, the scores f(x,W)=Wx are: Multiclass SVM loss:
Lÿ = max(0, ĄĀ 2 Ąþÿ + 1)
Ā≠þÿ
= 5.27
20
Softmax loss Example
Note:
1. We add some meaning to the scores. Softmax
2. probability of cat is 1 and others is 0.
3. put all the probability mass on correct class.
function
4. log is monotonic function, easier to maximize
5. - log as its loss
21
Softmax loss Example
23
What is missing in this Loss function?
Overfitting
Problem
24
The Problem of Overfitting
• Underfitting, or high bias, is when the
form of our hypothesis function / maps
poorly to the trend of the data. It is
usually caused by a function that is too
simple or uses too few features.
25
Regularization
26
Regularization parameter λ
• Regularization works on assumption that smaller weights generate simpler
model and thus helps avoid overfitting.
• λ = 0, then the regularization term becomes zero (back to the original Loss
function).
• λ is large, the weights become close to zero (i.e. a very simple model have
underfitting).
27
Regularization function R(W)
Simple Examples:
L2 regularization:
L1 regularization:
Elastic net (L1 + L2):
28
L1 regularization
29
L2 regularization
30
Optimization
31
Optimization
Optimization is the process of finding the set of parameters W
that minimize the l o s s f unc t i on .
Simple Optimizer:
• Gradient Descent (GD)
More Complex Optimizers:
• Momentum
• AdaGrad
• RMSProp
• Adam
32
Gradient Descent (GD)
• Following the Gradient
Main Idea: take iterative steps to update parameters W in the direction
of the negative gradient direction.
Cost Function: �㔽(�㔔0 , �㔔1 )
Cost
Function
ñ
ð
33
Gradient Descent: Problem
• Problem of GD is that it can converge to different locations upon the
initial W. It can stuck to local minima.
Cost
Function
ñ
ð
34
Gradient Descent: Algorithm
• We make steps down the cost function in the direction with the steepest
descent, and the size of each step is determined by the parameter α, which
is called the learning rate.
�㔽(�㔔 ) Cost Function: �㔽(�㔔1)
1
• The gradient descent algorithm is:
} �㔔1
Learning rate (step size) Positive slope (positive number)→ �㔔1 will decrease
Negative slope (negative number)→ �㔔1 will increase
Andrew Ng
Gradient Descent variants
• There are three variants of gradient descent based on the amount of
data used to calculate the gradient:
36
GD Variants: Batch Gradient Descent
• Batch Gradient Descent, aka Vanilla gradient descent, calculates the
error for each observation in the dataset but performs an update only
after all observations have been evaluated.
37
GD Variants: Stochastic Gradient Descent (SGD)
• Stochastic gradient descent, often abbreviated SGD, is a variation of
the gradient descent algorithm that calculates the error and updates
the model for each example in the training dataset.
• SGD is usually faster than batch gradient descent, but its frequent
updates cause a higher variance in the error rate, that can sometimes
jump around instead of decreasing.
• The noisy update process can allow the model to avoid local minima
(e.g. premature convergence).
38
GD Variants: Mini-Batch Gradient Descent
• Mini-batch gradient descent seeks to find a balance between the
robustness of stochastic gradient descent and the efficiency of batch
gradient descent.
• It splits the training dataset into small batches that are used to calculate
model error and update model coefficients.
39
Gradient Descent: Feature Scaling
• We can speed up gradient descent by having each of our input values in
roughly the same range.
Rule a thumb regarding acceptable ranges
• -3 to +3 is generally fine - any bigger bad
• -1/3 to +1/3 is ok - any smaller bad
ĉÿ ĉÿ 2 �㔇ÿ
ĉÿ = ĉÿ =
þÿĉýÿ Ąÿ
�㔇ÿ = þăÿÿ ćÿýĆă ĀĄ ĉ Un-scaled features
þÿĉýÿ Ąÿ = range of values (max 2 min) or
= ýÿăąăĄą ćÿýĆă ĀĄ ĉ the standard deviation.
Learning Rate
• The gradient tells us the direction, but it does not tell us how far
along this direction we should step.
• The learning rate (step size) determines how big the step would be
on each iteration. It determines how fast or slow we will move
towards the optimal weights.
42
Learning Rate
• If learning rate is large, it may fail to converge and overshoot the minimum.
• If learning rate is very small, it would take long time to converge and become
computationally expensive.
• The most commonly used rates are :
0.001, 0.003, 0.01 (default), 0.03, 0.1, 0.3
43
Learning Rate
• Plot the cost function against different values of learning rate.
- If gradient descent is working properly, the cost function should decrease after
every iteration.
- When Gradient Descent can’t decrease the cost-function anymore and remains
more or less on the same level, we say it has converged.
Note:
If you see in the plot that your
learning curve is just going up
and down, without really
reaching a lower point, you also
should try to decrease the
learning rate.
44
Learning Rate Schedules
• When training deep neural networks, it is often useful to reduce
learning rate as the training progresses. This can be done by using
pre-defined learning rate schedules or adaptive learning rate
methods.
45
Thanks
46