0% found this document useful (0 votes)
13 views

Module2-Optimizations

The document discusses optimizers for neural networks, focusing on gradient descent as a method to minimize error in machine learning models. It explains the steps involved in neural network training, the importance of learning rates, and various types of gradient descent including batch, stochastic, and mini-batch. Additionally, it highlights challenges faced by gradient descent such as getting stuck in local minima and the need for adaptive learning rates.

Uploaded by

hicey94162
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Module2-Optimizations

The document discusses optimizers for neural networks, focusing on gradient descent as a method to minimize error in machine learning models. It explains the steps involved in neural network training, the importance of learning rates, and various types of gradient descent including batch, stochastic, and mini-batch. Additionally, it highlights challenges faced by gradient descent such as getting stuck in local minima and the need for adaptive learning rates.

Uploaded by

hicey94162
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Optimizers for Neural

Networks

“Generalization is the ultimate goal of any machine learning algorithm”


Introduction

Two major types of problems that machine learning algorithms try to


solve are:

• Classification — Predict the class of the given data point.

• Regression — Predict continuous value of a given data point.


Deep Learning and Linear Regression
The neural network equation is:

Z = Bias + W1X1 + W2X2 + …+ WnXn

Z is the symbol for denotation of the graphical representation of ANN.


Wis, are the weights or the beta coefficients
Xis, are the independent variables or the inputs
Bias = W0

Each of the neurons in the hidden layer will have an


equation like above which will connect between the
layers and the respective weights and the bias terms.
This is how the neurons get estimated and then are
passed on to the next layer.
Deep Learning and Linear Regression
Steps involved in a Neural Network:

1.Take the input equation: Z = W0 + W1X1 + W2X2 + …+ WnXn and predict the output Y
(Ypred).
2.Calculate the error - tells how much the model deviates from the actual observed values.
It is always calculated as the Ypred – Yactual.
• Error to be calculated changes depending on whether the problem is of regression or classification.
• Regression - RMSE, MAPE, MAE
• Classification - Binary Cross Entropy Cost Function.

3.What is the end goal of any model?

• To minimize this error term!


• And, how is it done here in Neural Networks?
• The computed loss value is taken back to each layer, so that it updates the weights in such
a way that minimizes the loss.
Linear Regression
• Linear Regression - tries to predict a continuous value of a
given data point by generalizing on the data in hand. The linear
part indicates that a linear approach is used in generalizing over
the data.

• Example: Predict the price of a house by knowing its size.

• In hand data: some house sizes and their corresponding prices.


Linear Regression
• Charting the data and fitting a line among them will look something like
this:
• To generalize, a straight line is drawn, such that it crosses through the
maximum points. Once you get that line, for house of any size you just
project that data point over the line which gives you the house price.

• We are done! Wait…


Linear Regression
The Real Problem

• The problem was never finding the house price!

• The problem was to find the best fit line which generalizes over the
data, well.

• The same old line equation is used: y = mx +c with a little statistical


look and some below terminologies are added specific to Linear
Regression modeling.
Linear Regression

•y — The value that you want to predict


•β₀ — The y-intercept of the line means where the line intersects
the Y-axis
•β₁ — The slope or gradient of the line means how steep is the
line
•x — The value of the data point
•u — The residual or noise that are caused by unexplained
factors
Linear Regression
Mathematics to the rescue!

• Possibly infinite number of lines - but never sure if the line is the best fit or not.
• Saviour: The Cost Function (J)
• Aim: Achieve the best-fit regression line to predict ‘y’ value such that the error difference between the
predicted value and true value is minimum.
• Goal: Reduce the Cost function, which in turns improves the accuracy.

• Solution: Take the parameter values by hit and trial method and calculate MSE for each combination of
parameters.
• Not at all efficient!
• Seems like there is a calculus solution to this problem!
Gradient Descent
• Gradient descent (GD) is an iterative first-order optimization
algorithm used to find a local minimum/maximum of a given function.

• Various Types of Critical Points:


Gradient Descent

Function Requirements

Gradient descent algorithm does not work for all functions. There are
two specific requirements. A function has to be:

• Differentiable

• Convex
Gradient Descent
What does it mean when a function is differentiable?

Answer: It has a derivative for each point in its domain.


Twist: Not all functions meet these criteria.

Differentiable functions Non-differentiable functions


Gradient Descent
Next requirement — The function has to be convex.

For a univariate function, this means that the line segment connecting
any two points on the function, lies on or above its curve and does not
cross it.
If it does it means that it has a local minimum which is not a global one.
Gradient Descent
Another way to check mathematically if a univariate function is convex
is to calculate the second derivative and check if its value is always
greater than 0.

More Calculus Ahead!

A quadratic function:

It’s first and second derivatives are:

Because the second derivative is greater than zero, the function above is
strictly convex.
Gradient Descent
• Gradient - slope of a curve at a given point in a specified direction.
• For a univariate function, it is the first derivative of the function at
a selected point.
• In the case of a multivariate function, it is a vector of (partial)
derivatives in each direction (along variable axes).
• A gradient for an n-dimensional function f(x) at a given point p is
defined as:
Gradient Descent
Gradient at point p(10,10):

What does the gradient values mean?

The slope is twice steeper along the y axis!!


Gradient Descent Algorithm
Let’s try an analogy!

• Visualise the function as a valley.


• You are standing on some random point.
• Your aim is to reach the bottom most point of the valley.
• What would you do?
Gradient Descent Algorithm
Here is how Gradient Descent does it!

• It has to know where the slope of the valley is going (and it doesn’t have
eyes as you do). So it takes the help of mathematics here.

• To know the slope of a function at any point, differentiate that point with
respect to its parameters. Thus Gradient Descent differentiates the Cost
function (J) and comes to know of the slope of that point.

• To go to the bottom most point it has to go in the opposite direction of the


slope, i.e. where the slope is decreasing.
Gradient Descent Algorithm

• It has to take small steps to move towards the bottom point. Here. the
learning rate decides the length of step that gradient descent will take.

• After every move it validates if the current position is the global minima
or not. This is validated by the slope of that point. If the slope is zero,
then the algorithm has reached the bottom most point.

• After every step, it updates the parameter (or weights). By doing the
above step repeatedly it reaches to the bottom most point.
Gradient Descent Algorithm

• Once the bottom most point of the valley is reached, it means, the
parameters corresponding to the least MSE or Cost function has been
obtained.

• The Linear Regression model is now ready for use, to predict the
dependent variable of any unforeseen data point with very high
accuracy.
Steps in the Gradient Descent Algorithm
• Choose a starting point (initialisation) - pn
• Calculate gradient at this point -
• Make a scaled step in the opposite direction to the gradient (objective:
minimise) -

• Repeat points 2 and 3 until one of the criteria is met:

• maximum number of iterations reached.

• step size is smaller than the tolerance (due to scaling or a small gradient).
Learning Rate & Step Size
• There’s an important parameter η which
scales the gradient and thus controls the step
size.

• In machine learning, it is called the learning


rate and have a strong influence on the
learned model’s performance.

• The smaller the learning rate, the longer the


Gradient Descent algorithm takes to converge
or the maximum iteration would be reached,
before reaching the optimum point.

• If the learning rate is too big, the algorithm


may not converge to the optimal point (jump
around) or even diverge completely.
Learning Process

5/19/2023 Natural Language Processing 23


Learning Process
• Output response from neuron is yk(n)
• Desired output or target output is dk(n)
• Error signal ek(n)= dk(n)- yk(n)
• Step by step adjustment to the synaptic weights of neuron k are continued
until the system reaches the steady state. At this point learning process is
terminated.
• Objective function is to minimize cost function in terms of error signal, leads
to delta rule
Δwkj (n)=η ek(n) xj(n)
where η is the learning rate, wkj(n) denote the value of weight of neuron k
excited by an element xj(n) at time step n
Updated value of weight wkj(n) is
wkj (n+1)= Δwkj (n)+ wkj (n)
5/19/2023 Natural Language Processing 24
Back Propagation Algorithm
Back propagation algorithm applies a correction term Δwkj (n) to the
synaptic weight wkj (n) ,which is proportional to the partial derivative

Δwkj (n)=η * δk(n) * xj(n)


The Gradient is a derivative of loss function with respect to the weights.

Δwkj (n)→Weight correction


η → Learning rate parameter
δk(n)→ Local gradient
xj(n) → Input signal of neuron j

5/19/2023 Natural Language Processing 25


Gradient Descent
• It is an algorithm to minimize a function by optimizing parameters
New value=Old value- Step size
Step size=learning rate* slop
Slop will give which direction to move, to reach global minimum

5/19/2023 Natural Language Processing 26


Gradient Descent

5/19/2023 Natural Language Processing 27


Learning Rate & Step Size
How to make sure the selected
learning rate works properly?

• A good way to make sure the gradient descent


algorithm runs properly is by plotting the cost function
as the optimization runs.
• Helps in viewing the value of cost function after each
iteration of gradient descent, and provides a way to
easily spot how appropriate the selected learning rate
is.
• If the gradient descent algorithm is working properly,
the cost function should decrease after every iteration.
• When gradient descent can’t decrease the cost-function
anymore and remains more or less on the same, it has
converged.
• The number of iterations gradient descent needs to
converge can vary a lot.
• It can take 50 iterations, 60,000 or maybe even 3
million, making the number of iterations to
convergence hard to estimate in advance.
Shortfalls of Gradient Descent
• There can be numerous hidden layers and neurons in a neural network.
• The weights are attached to each neuron in each layer and that too
from the output layer back to the original input layer.
• In such case, the weights that are updated via Gradient Descent falter
at two important places:

• Gradient Descent gets stuck at Local Minima

• The learning rate does not change in Gradient Descent


Challenge 1: Gradient Descent gets stuck at
Local Minima
• Consider the function:

• The weights are updates as:

Once stuck at the local point, the parameters doesn’t update. A


push is required to get out of this and move further to reach the
global minimum!
Challenge 1: Gradient Descent gets stuck at
Local Minima
Solution: Stochastic Gradient Descent with Momentum

To get the ball out of the local minima, we need accumulating momentum or
speed.
In Neural Network, the accumulated speed is equivalent to the weighted sum of the
past gradients and this is represented as:

where,
dL/dw = current gradient at the time t
mt-1 = previously accumulated gradient till the time t-1
ꞵ gives how much weightage to be given to the current gradient and the previously accumulated gradient.
Generally, 10 percent weightage is given to the current gradient and 90 percent to the previously accumulated
gradient.
Challenge 1: Gradient Descent gets stuck at Local Minima
• At the local minima (at the position of the full red ball), the slope dL/dw
will be zero and now the equation will become:

• This mt will give the required push to come out of the local minima.
• mt updates the weights to minimize the cost function for the Neural
Network as:
Challenge 2: The learning rate does not change in
Gradient Descent
• Solution RMSProp (Root Mean Squared Propagation)
• During the training process, the slope or the gradient dL/dw changes.
It is the rate of change in the loss function concerning the parameter
weight and by applying this we can update the learning rate.
• We can take the sum of squared the past gradients i.e square of the
sum of the partial derivatives of dL/dw as below:
Challenge 2: The learning rate does not change in
Gradient Descent
• The new weights are updated as:

• ε is the error term. It is added to Vt so that the denominator does not become zero
and this error term is generally very small in value.
• When the square of the slopes, (dL/dw)2 is high, then this will increase the value
of Vt and which would reduce the learning rate.
• Similarly, the value of Vt will be low when the square of the slopes is low and this
will increase the learning rate.
Challenge 2: The learning rate does not change in
Gradient Descent
• In the left panel, calculating the gradient at the
first topmost point gives a high magnitude of
slope. This increases the square of the slope.
This will reduce the learning rate and small
steps are taken to minimize the loss.

• In the right side, the point on the loss function


has a low magnitude of the gradient. Therefore,
the square of this gradient will also be less. It
will increase the learning rate and hence will
take bigger steps towards the optimal solution.

• This is how the RMSProp scales the learning


rate depending on the square of the gradients,
which eventually leads to faster convergence to
the optimal solution.
Types of Gradient Descent
There are three popular types of gradient descent that mainly differ in the
amount of data they use:

BATCH GRADIENT DESCENT


• Also called vanilla gradient descent, calculates the error for each example
within the training dataset
• Only after all training examples have been evaluated, the model
get updated. This whole process is like a cycle and it’s called a training
epoch.
• Advantages
• Computational efficiency
• It produces a stable error gradient and a stable convergence.
• Disadvantages
• The stable error gradient can sometimes result in a state of convergence that isn’t the
best the model can achieve.
• It requires the entire training dataset to be in memory and available to the algorithm.
Types of Gradient Descent
STOCHASTIC GRADIENT DESCENT
It updates the parameters for each training example one by one.
Advantages
• Depending on the problem, this can make SGD faster than batch gradient
descent.
• The frequent updates allow a pretty detailed rate of improvement.
Disadvantages
• The frequent updates, however, are more computationally expensive than the
batch gradient descent approach.
• The frequency of those updates can result in noisy gradients, which may cause
the error rate to jump around instead of slowly decreasing.
Types of Gradient Descent
MINI-BATCH GRADIENT DESCENT
It splits the training dataset into small batches and performs an update for
each of those batches.
This creates a balance between the robustness of stochastic gradient descent
and the efficiency of batch gradient descent.
Common mini-batch sizes range between 50 and 256, but like any other
machine learning technique, there is no clear rule because it varies for
different applications.

This is the go-to algorithm when training a neural network and it is the most
common type of gradient descent within deep learning.
Applications of Gradient Descent
Sales Driver Analysis — Linear Regression can be used to predict the sale of products in the future based
on past buying behaviour.

Predict Economic Growth — Economists use Linear Regression to predict the economic growth of a
country or state.

Score Prediction — Sports analyst use linear regression to predict the number of runs or goals a player
would score in the coming matches based on previous performances.

Salary Estimation — An organisation can use linear regression to figure out how much they would pay to
a new joiner based on the years of experience.

House Price Prediction — Linear regression analysis can help a builder to predict how much houses it
would sell in the coming months and at what price.

Oil Price Prediction — Petroleum prices can be predicted using Linear Regression
Errors in Machine Learning

• The ultimate goal of any supervised machine learning problem is to


find a model or function that predicts a target or label and minimizes
the expected error over all possible inputs and labels.
• Minimizing error over all possible input means the function must be
able to generalize and make accurate predictions on unseen inputs.

• The fundamental goal of machine learning is for the algorithm to


generalize beyond the training sets.
Errors in Machine Learning
• Irreducible errors are errors which will
always be present in a machine learning
model, because of unknown variables,
and whose values cannot be reduced.
• Reducible errors are those errors whose
values can be further reduced to
improve a model. They are caused
because our model’s output function
does not match the desired output
function and can be optimized.
What is Bias?
• To make predictions, our model will
analyze our data and find patterns in it.
Using these patterns, we can make
generalizations about certain instances in
our data. Our model after training learns
these patterns and applies them to the test
set to predict them.
• Bias is the difference between our actual
and predicted values. Bias is the simple
assumptions that our model makes about
our data to be able to predict new data.
What is Bias?
• When the Bias is high, assumptions made by our
model are too basic, the model can’t capture the
important features of our data. This means that our
model hasn’t captured patterns in the training data and
hence cannot perform well on the testing data too. If
this is the case, our model cannot perform on new data
and cannot be sent into production.
• This instance, where the model cannot find patterns in
our training set and hence fails for both seen and
unseen data, is called Underfitting.
• The below figure shows an example of Underfitting.
As we can see, the model has found no patterns in our
data and the line of best fit is a straight line that does
not pass through any of the data points. The model has
failed to train properly on the data given and cannot
predict new data either.
What is Variance?
• Variance is the very opposite of Bias. During
training, it allows our model to ‘see’ the data
a certain number of times to find patterns in
it. If it does not work on the data for long
enough, it will not find patterns and bias
occurs. On the other hand, if our model is
allowed to view the data too many times, it
will learn very well for only that data. It will
capture most patterns in the data, but it will
also learn from the unnecessary data present,
or from the noise.
• We can define variance as the model’s
sensitivity to fluctuations in the data. Our
model may learn from noise. This will cause
our model to consider trivial features as
important.
What is Variance?
• Hence, our model will
perform really well on
testing data and get
high accuracy but will
fail to perform on
new, unseen data.
New data may not
have the exact same
features and the model
won’t be able to
predict it very well.
This is called
Overfitting.
Mathematical Insight into Bias & Variance
• Bias of an estimator is the the “expected” difference between its
estimates and the true values in the data.

• which is literally the difference between the expected value of an


estimator at that point and the true value at that same point.
• simpler models have a higher bias compared to more sophisticated
models.
Mathematical Insight into Bias & Variance
• Variance of an estimator is the “expected” value of the squared
difference between the estimate of a model and the “expected” value
of the estimate.
• Suppose we are training ∞ models using different sample sets of the
data. Then at a test point xₒ, the expected value of the estimate over all
those models is the E[g(xₒ)]. Also, for any individual model out of all
the models, the estimate of that model at that point is g(xₒ). The
difference between these two can be written as g(xₒ) − E[g(xₒ)].
Variance is the expected value of the square of this distance over all
the models. Hence, variance of the estimator at at test point xₒ can be
mathematically written as :-
Mathematical Insight into Bias & Variance
• bias and variance of an estimator are complementary to each
other i.e. an estimator with high bias will vary less(have low
variance) and an estimator with high variance will have less
bias(as it can vary more to fit/explain/estimate the data points).
Mathematical Relation Between Bias and Variance
• defining estimator’s error at a test point as the “expected” squared
difference between the true value and estimator’s estimate.
• For a test point x0,
Mathematical Relation Between Bias and
Variance

1.Error(and hence the accuracy) of the estimator at a test data sample


can be decomposed into variance of the noise in the data, bias and
the variance of the estimator. Implies - both bias and variance are the
sources of error of an estimator.

2. Bias and variance of an estimator are complementary to each other -


increasing one of them would mean a decrease in the other and vice
versa.
Bias Variance Trade-off
• An estimator will have a high error if it
has very high bias and low variance -
when it is not able to adapt at all to the
data points in a sample set.

• An estimator will also have a high error if


it has very high variance and low bias -
when it adapts too well to all the data
points in a sample set and fails to
generalize other unseen samples.
Bull’s eye diagram for Bias-
Variance Trade-off:
Bias Variance Trade-off
Regularization
• Regularization will help select a midpoint between high bias and high
variance.
• regularization performs feature selection by shrinking the contribution
of each feature.
• Types of Regularization:
• Ridge Regression (L2 – Norm)
• Lasso Regression (L1 – Norm)
• Elastic Net Regression (Combination of L1 and L2 Norms)
Ridge Regression
• Uses L2 regularization to impose a penalty on the size of coefficients.
• Minimizing the residual sum of squares after penalization.
• The objective is to minimize:

• α – regularization parameter
• α controls the size of the coefficient and the amount of regularization
Lasso Regression
• Uses L1 regularization technique as a penalty on the size of
coefficients.
• Minimizing the absolute value of weights after penalization.
• The objective is to minimize:
Elastic Net Regression
• Combines Ridge (L2 as regularizer) and Lasso (L1 as regularizer) in
order to train the model.
• The objective is to minimize:

• Selecting the best value for α - use cross-validation combined with


different regression techniques that aim to regularize the model.
Activation Function
• Activation function describes, whether a neuron should be
activated or not by calculating weighted sum and further adding
bias with it
• The purpose is to introduce non-linearity into output of a
neuron
• Neural network without an activation function is a linear
regression model.
• Activation function does the non-linear transformation to the
input, making it to learn and perform more complex tasks.
• Neural network can learn any complex non-linear relationship
between inputs and outputs.
• Activation function is applied to hidden and output layers, not in
input layer

5/19/2023 Natural Language Processing 57


Different Activation Function
Step Function
➢Binary step function is a threshold-based activation function
which means after a certain threshold neuron is activated and
below the said threshold neuron is deactivated.
➢This activation function can be used in binary classifications

5/19/2023 Natural Language Processing 58


Different Activation Function
Linear Activation Function
• Output is directly proportional to the weighted sum of neurons.
Linear Activation function can deal with multiple classes,
unlike binary step function.
• Drawback of Linear Activation Function is that no matter how
deep the neural network is, last layer will always be a function
of the first layer. This limits the neural network’s ability to deal
with complex problems.

5/19/2023 Natural Language Processing 59


Non-Linear Activation Functions
Sigmoid function (Logistic function)
➢Probabilistic approach and the output ranges between 0–1.
➢It normalizes the output of each neuron.

5/19/2023 Natural Language Processing 60


Non-Linear Activation Functions
Tanh function (Hyperbolic tangent)
It is almost like the sigmoid function but slightly better than that
since it’s output ranges between -1 and 1 allowing negative
outputs

5/19/2023 Natural Language Processing 61


Non-Linear Activation Functions
• Sigmoid(z) will yield a value (a probability) between 0 and 1. Also, as the
sigmoid is a non-linear function, the output of this unit would be a non-
linear function of the weighted sum of inputs
• Multiclass extension of sigmoid is called Softmax, which is used for
multiclass classification problems

5/19/2023 Natural Language Processing 62


• Regression: consists of 1 neuron

• Binary Classification: consists of 1 neuron

• Multi-label Classification: consists of 1 neuron per


label

• Multi-class Classification: consists of 1 neuron per


class in the output layer.

5/19/2023 Natural Language Processing 63


Design Issues-Neural Network
 Initial weights
 Transfer function (How the inputs and the weights are
combined to produce output?)
 Error estimation
 Weights adjusting
 Number of neurons
 Data representation
 Size of training set
Thank You!

You might also like