0% found this document useful (0 votes)
6 views31 pages

Lect 5 - Gradient Descent

Gradient Descent (GD) is an iterative optimization algorithm used to find local minima or maxima of functions, commonly applied in machine learning to minimize cost functions. It requires functions to be differentiable and convex, with the gradient indicating the direction of steepest increase. Variants of GD include Batch, Stochastic, and Mini-batch methods, each with different computational efficiencies and convergence characteristics.

Uploaded by

cs22b2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

Lect 5 - Gradient Descent

Gradient Descent (GD) is an iterative optimization algorithm used to find local minima or maxima of functions, commonly applied in machine learning to minimize cost functions. It requires functions to be differentiable and convex, with the gradient indicating the direction of steepest increase. Variants of GD include Batch, Stochastic, and Mini-batch methods, each with different computational efficiencies and convergence characteristics.

Uploaded by

cs22b2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

GRADIENT DESCENT

Umarani Jayaraman
Gradient Descent -
Introduction
 Gradient descent (GD) is an iterative
first-order optimization algorithm, used
to find a local minimum/maximum of a
given function.
 This method is commonly used
in machine learning (ML) and deep
learning (DL) to minimize a cost/loss
function (e.g. in a linear regression).
 This method was proposed long before
the era of modern computers by
Augustin-Louis Cauchy in 1847.
Function requirements

 Gradient descent algorithm does not


work for all functions. There are two
specific requirements. A function has to
be:
 differentiable
 convex
First requirement - differentiable

 First, what does it mean it has to


be differentiable?
 If a function is differentiable it has a
derivative for each point in its domain
 Not all functions meet these criteria.
First, let’s see some examples of
functions meeting this criterion:
 Typical non-differentiable functions have
a step a cusp or a discontinuity
Next requirement - function has to be
convex

 Next requirement — function has to be


convex.
 For a univariate function, this means that
the line segment connecting two
function’s points lays on or above its
curve (it does not cross it).
 If it crosses it has a local minimum which
is not a global one.
 Mathematically, for two points x₁, x₂
laying on the function’s curve this
condition is expressed as:
 where λ denotes a point’s location on a
section line and its value has to be
between 0 (left point) and 1 (right point),
 e.g. λ=0.5 means a location in the
middle.
 Below there are two functions with
exemplary section lines.
Caution: First requirement – differentiable, what about second requirement?

 First, what does it mean it has to


be differentiable?
 If a function is differentiable it has a
derivative for each point in its domain
 Second and third function is not convex
 Another way to check mathematically if
a univariate function is convex is to
calculate the second derivative and
check if its value is always bigger than
0.
Let’s do a simple example
saddle points
 It is also possible to use quasi-convex
functions with a gradient descent
algorithm.
 However, often they have so-
called saddle points (called
also minimax points) where the
algorithm can get stuck
 An example of a quasi-convex function
is:
 First order derivative
 Second order derivative
 Example of a saddle point in a bivariate
function is show below.
Gradient
 Intuitively it is a slope of a curve at a given
point in a specified direction.
 In the case of a univariate function, it is
simply the first derivative at a selected
point.
 In the case of a multivariate function, it is
a vector of derivatives in each main
direction (along variable axes) (i.e) partial
derivatives.
 A gradient for an n-dimensional function f(x)
at a given point ‘p’ is defined as follows:
Gradient Descent Procedure
 In summary, Gradient Descent method’s steps
are:
 1. choose a starting point (initialization)
 2. calculate gradient at this point
 3. make a scaled step in the opposite direction
to the gradient (objective: minimize)
 4. repeat points 2 and 3 until one of the criteria
is met:
 maximum number of iterations reached
 step size is smaller than the tolerance (due to
scaling or a small gradient).
Gradient Descent: sample
code
 This function takes 5 parameters:
 1. starting point [float] - in our case, we define it
manually but in practice, it is often a random
initialisation
 2. gradient function [object] - function calculating
gradient which has to be specified before-hand and
passed to the GD function
 3. learning rate [float] - scaling factor for step
sizes
 4. maximum number of iterations [int]
 5. tolerance [float] to conditionally stop the
algorithm (in this case a default value is 0.01)
Effect of different learning
rate
 The animation below shows steps taken
by the GD algorithm for learning rates of
0.1 and 0.8.
Results of various learning
rate
Gradient - summary
 The gradient is a fundamental concept in calculus
and optimization technique
 The gradient of a function, denoted by ∇ (nabla),
is a vector that points in the direction of the
steepest increase of the function at a given point.
 Mathematically, for a function f(x1​,x2​,...,xn​), the
gradient is given by:
 ∇f=(∂f/∂x1​​, ​∂f/ ∂x2 ​,..., ​∂f/ ∂xn ​)
 Each component of the gradient represents the
partial derivative of the function with respect to
one of its input variables.
Significance in Optimization:

 In the context of optimization problems, the


goal is often to find the minimum or maximum
of a function.
 The gradient provides crucial information about
the direction and rate of change of the
function.
 The negative gradient points in the direction
of the steepest decrease of the function.
 Therefore, moving in the opposite direction of
the gradient helps in descending towards the
minimum of the function.
Gradient Descent Algorithm
24
Batch Gradient Descent
25

It can be
computationally
intensive to compute
Stochastic Gradient
26
Descent

Easy to compute but very


noisy
Mini-batch Gradient
27
Descent

Fast to compute and a


much better estimate of
the true gradient
Mini-batches while training
28

 More accurate estimation of gradient


 Smoother convergence
 Allows for larger leaning rates
 Mini-batches lead to fast training
Error minimization with
29
iterations
Gradient Descent- Variants
 Batch
 Stochastic
 Mini-batch
Thank you

You might also like