Module2-Optimizations
Module2-Optimizations
Networks
1.Take the input equation: Z = W0 + W1X1 + W2X2 + …+ WnXn and predict the output Y
(Ypred).
2.Calculate the error - tells how much the model deviates from the actual observed values.
It is always calculated as the Ypred – Yactual.
• Error to be calculated changes depending on whether the problem is of regression or classification.
• Regression - RMSE, MAPE, MAE
• Classification - Binary Cross Entropy Cost Function.
• The problem was to find the best fit line which generalizes over the
data, well.
• Possibly infinite number of lines - but never sure if the line is the best fit or not.
• Saviour: The Cost Function (J)
• Aim: Achieve the best-fit regression line to predict ‘y’ value such that the error difference between the
predicted value and true value is minimum.
• Goal: Reduce the Cost function, which in turns improves the accuracy.
• Solution: Take the parameter values by hit and trial method and calculate MSE for each combination of
parameters.
• Not at all efficient!
• Seems like there is a calculus solution to this problem!
Gradient Descent
• Gradient descent (GD) is an iterative first-order optimization
algorithm used to find a local minimum/maximum of a given function.
Function Requirements
Gradient descent algorithm does not work for all functions. There are
two specific requirements. A function has to be:
• Differentiable
• Convex
Gradient Descent
What does it mean when a function is differentiable?
For a univariate function, this means that the line segment connecting
any two points on the function, lies on or above its curve and does not
cross it.
If it does it means that it has a local minimum which is not a global one.
Gradient Descent
Another way to check mathematically if a univariate function is convex
is to calculate the second derivative and check if its value is always
greater than 0.
A quadratic function:
Because the second derivative is greater than zero, the function above is
strictly convex.
Gradient Descent
• Gradient - slope of a curve at a given point in a specified direction.
• For a univariate function, it is the first derivative of the function at
a selected point.
• In the case of a multivariate function, it is a vector of (partial)
derivatives in each direction (along variable axes).
• A gradient for an n-dimensional function f(x) at a given point p is
defined as:
Gradient Descent
Gradient at point p(10,10):
• It has to know where the slope of the valley is going (and it doesn’t have
eyes as you do). So it takes the help of mathematics here.
• To know the slope of a function at any point, differentiate that point with
respect to its parameters. Thus Gradient Descent differentiates the Cost
function (J) and comes to know of the slope of that point.
• It has to take small steps to move towards the bottom point. Here. the
learning rate decides the length of step that gradient descent will take.
• After every move it validates if the current position is the global minima
or not. This is validated by the slope of that point. If the slope is zero,
then the algorithm has reached the bottom most point.
• After every step, it updates the parameter (or weights). By doing the
above step repeatedly it reaches to the bottom most point.
Gradient Descent Algorithm
• Once the bottom most point of the valley is reached, it means, the
parameters corresponding to the least MSE or Cost function has been
obtained.
• The Linear Regression model is now ready for use, to predict the
dependent variable of any unforeseen data point with very high
accuracy.
Steps in the Gradient Descent Algorithm
• Choose a starting point (initialisation) - pn
• Calculate gradient at this point -
• Make a scaled step in the opposite direction to the gradient (objective:
minimise) -
• step size is smaller than the tolerance (due to scaling or a small gradient).
Learning Rate & Step Size
• There’s an important parameter η which
scales the gradient and thus controls the step
size.
To get the ball out of the local minima, we need accumulating momentum or
speed.
In Neural Network, the accumulated speed is equivalent to the weighted sum of the
past gradients and this is represented as:
where,
dL/dw = current gradient at the time t
mt-1 = previously accumulated gradient till the time t-1
ꞵ gives how much weightage to be given to the current gradient and the previously accumulated gradient.
Generally, 10 percent weightage is given to the current gradient and 90 percent to the previously accumulated
gradient.
Challenge 1: Gradient Descent gets stuck at Local Minima
• At the local minima (at the position of the full red ball), the slope dL/dw
will be zero and now the equation will become:
• This mt will give the required push to come out of the local minima.
• mt updates the weights to minimize the cost function for the Neural
Network as:
Challenge 2: The learning rate does not change in
Gradient Descent
• Solution RMSProp (Root Mean Squared Propagation)
• During the training process, the slope or the gradient dL/dw changes.
It is the rate of change in the loss function concerning the parameter
weight and by applying this we can update the learning rate.
• We can take the sum of squared the past gradients i.e square of the
sum of the partial derivatives of dL/dw as below:
Challenge 2: The learning rate does not change in
Gradient Descent
• The new weights are updated as:
• ε is the error term. It is added to Vt so that the denominator does not become zero
and this error term is generally very small in value.
• When the square of the slopes, (dL/dw)2 is high, then this will increase the value
of Vt and which would reduce the learning rate.
• Similarly, the value of Vt will be low when the square of the slopes is low and this
will increase the learning rate.
Challenge 2: The learning rate does not change in
Gradient Descent
• In the left panel, calculating the gradient at the
first topmost point gives a high magnitude of
slope. This increases the square of the slope.
This will reduce the learning rate and small
steps are taken to minimize the loss.
This is the go-to algorithm when training a neural network and it is the most
common type of gradient descent within deep learning.
Applications of Gradient Descent
Sales Driver Analysis — Linear Regression can be used to predict the sale of products in the future based
on past buying behaviour.
Predict Economic Growth — Economists use Linear Regression to predict the economic growth of a
country or state.
Score Prediction — Sports analyst use linear regression to predict the number of runs or goals a player
would score in the coming matches based on previous performances.
Salary Estimation — An organisation can use linear regression to figure out how much they would pay to
a new joiner based on the years of experience.
House Price Prediction — Linear regression analysis can help a builder to predict how much houses it
would sell in the coming months and at what price.
Oil Price Prediction — Petroleum prices can be predicted using Linear Regression
Errors in Machine Learning
• α – regularization parameter
• α controls the size of the coefficient and the amount of regularization
Lasso Regression
• Uses L1 regularization technique as a penalty on the size of
coefficients.
• Minimizing the absolute value of weights after penalization.
• The objective is to minimize:
Elastic Net Regression
• Combines Ridge (L2 as regularizer) and Lasso (L1 as regularizer) in
order to train the model.
• The objective is to minimize: