0% found this document useful (0 votes)
30 views31 pages

Lec 5 - Gradient-Descent

Gradient descent is an iterative optimization algorithm used to minimize loss functions. It works by taking steps proportional to the negative gradient of the function to reach a local minimum. The function must be differentiable and convex for gradient descent to work. Each step of gradient descent calculates the gradient of the loss function at the current point and moves in the opposite direction, repeating until convergence within a specified tolerance.

Uploaded by

Ankur Saroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views31 pages

Lec 5 - Gradient-Descent

Gradient descent is an iterative optimization algorithm used to minimize loss functions. It works by taking steps proportional to the negative gradient of the function to reach a local minimum. The function must be differentiable and convex for gradient descent to work. Each step of gradient descent calculates the gradient of the loss function at the current point and moves in the opposite direction, repeating until convergence within a specified tolerance.

Uploaded by

Ankur Saroj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

GRADIENT DESCENT

Umarani Jayaraman
Gradient Descent - Introduction
◻ Gradient descent (GD) is an iterative first-order
optimization algorithm, used to find a local
minimum/maximum of a given function.
◻ This method is commonly used in machine
learning (ML) and deep learning (DL) to minimize
a cost/loss function (e.g. in a linear regression).
◻ This method was proposed long before the era of
modern computers by Augustin-Louis Cauchy in
1847.
Function requirements

◻ Gradient descent algorithm does not work for all


functions. There are two specific requirements. A
function has to be:
◻ differentiable
◻ convex
First requirement - differentiable

◻ First, what does it mean it has to be differentiable?


◻ If a function is differentiable it has a derivative for
each point in its domain
◻ Not all functions meet these criteria. First, let’s see
some examples of functions meeting this criterion:
◻ Typical non-differentiable functions have a step a
cusp or a discontinuity
Next requirement - function has to be
convex
◻ Next requirement — function has to be convex.
◻ For a univariate function, this means that the line
segment connecting two function’s points lays on
or above its curve (it does not cross it).
◻ If it crosses it has a local minimum which is not a
global one.
◻ Mathematically, for two points x₁, x₂ laying on the
function’s curve this condition is expressed as:
◻ where λ denotes a point’s location on a section line
and its value has to be between 0 (left point) and 1
(right point),
◻ e.g. λ=0.5 means a location in the middle.
◻ Below there are two functions with exemplary
section lines.
Caution: First requirement –
differentiable, what about second
requirement?
◻ First, what does it mean it has to be differentiable?
◻ If a function is differentiable it has a derivative for
each point in its domain
◻ Second and third function is not convex
◻ Another way to check mathematically if a
univariate function is convex is to calculate the
second derivative and check if its value is always
bigger than 0.
Let’s do a simple example
saddle points
◻ It is also possible to use quasi-convex
functions with a gradient descent algorithm.
◻ However, often they have so-called saddle
points (called also minimax points) where the
algorithm can get stuck
◻ An example of a quasi-convex function is:
◻ First order derivative
◻ Second order derivative
◻ Example of a saddle point in a bivariate function is
show below.
Gradient
◻ Intuitively it is a slope of a curve at a given point in
a specified direction.
◻ In the case of a univariate function, it is simply
the first derivative at a selected point.
◻ In the case of a multivariate function, it is
a vector of derivatives in each main direction
(along variable axes) (i.e) partial derivatives.
◻ A gradient for an n-dimensional function f(x) at a
given point ‘p’ is defined as follows:
Gradient Descent Procedure
◻ In summary, Gradient Descent method’s steps are:
◻ 1. choose a starting point (initialization)
◻ 2. calculate gradient at this point
◻ 3. make a scaled step in the opposite direction to
the gradient (objective: minimize)
◻ 4. repeat points 2 and 3 until one of the criteria is
met:
maximum number of iterations reached
step size is smaller than the tolerance (due to scaling or
a small gradient).
Gradient Descent: sample code
◻ This function takes 5 parameters:
◻ 1. starting point [float] - in our case, we define it
manually but in practice, it is often a random
initialisation
◻ 2. gradient function [object] - function calculating
gradient which has to be specified before-hand and
passed to the GD function
◻ 3. learning rate [float] - scaling factor for step sizes
◻ 4. maximum number of iterations [int]
◻ 5. tolerance [float] to conditionally stop the algorithm
(in this case a default value is 0.01)
Effect of different learning rate
◻ The animation below shows steps taken by the GD
algorithm for learning rates of 0.1 and 0.8.
Results of various learning rate
Gradient - summary
◻ The gradient is a fundamental concept in calculus and
optimization technique
◻ The gradient of a function, denoted by ∇ (nabla), is a
vector that points in the direction of the steepest
increase of the function at a given point.
◻ Mathematically, for a function f(x1,x2,...,xn), the
gradient is given by:
◻ ∇f=(∂f/∂x1, ∂f/ ∂x2 ,..., ∂f/ ∂xn )
◻ Each component of the gradient represents the partial
derivative of the function with respect to one of its
input variables.
Significance in Optimization:

◻ In the context of optimization problems, the goal is


often to find the minimum or maximum of a
function.
◻ The gradient provides crucial information about
the direction and rate of change of the function.
◻ The negative gradient points in the direction of the
steepest decrease of the function.
◻ Therefore, moving in the opposite direction of the
gradient helps in descending towards the minimum
of the function.
Gradient Descent Algorithm
24
Batch Gradient Descent
25

It can be computationally
intensive to compute
Stochastic Gradient Descent
26

Easy to compute but very noisy


Mini-batch Gradient Descent
27

Fast to compute and a much


better estimate of the true
gradient
Mini-batches while training
28

◻ More accurate estimation of gradient


◻ Smoother convergence
◻ Allows for larger leaning rates
◻ Mini-batches lead to fast training
Error minimization with iterations
29
Gradient Descent- Variants
◻ Batch
◻ Stochastic
◻ Mini-batch
Thank you

You might also like