0% found this document useful (0 votes)
7 views

Lecture 8

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Lecture 8

Uploaded by

bhavesh agrawal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

MACHINE LEARNING (CS 403/603)

Introduction to Gradient Descent

Dr. Puneet Gupta


Derivatives
Magnitude of derivative at a point is the rate of change of the function at that point

Positive derivative means f is increasing at x if we increase the value of x by a very small amount;
negative derivative means it is decreasing

Understanding how f changes its value as we change x is helpful to understand optimization


(minimization/maximization) algorithms.

Derivative becomes zero at stationary points (optima or saddle points)


● The function becomes “flat” (f’(x)=0 if we change x by a very little at such points
● These are the points where the function has its maxima/minima (unless they are saddles). At
saddle points, derivative is zero but neither minima nor maxima. These are common in deep
learning models.
Derivatives
Each element in this gradient vector
tells us how much f will change if we
move a little along the corresponding
direction.

Optima and saddle points are defined similar to one-dim case. It require the properties
that we saw for one-dim case must be satisfied along all the directions.

The second derivative in this case is known as the Hessian


Analyzing optimal solutions for loss
functions
Finding the optima
(minima) of loss function
by visualizing the loss
function as a function of
weights in terms of
curves or surfaces.
In convex functions,
local minima and global
minima are same but in
non-convex minima,
they differ.

Global Optima
Loss

Loss

Local Optima

Convex Function Non-convex Function

W W
Convex set
A subset C is convex if, for all x and y in C, the line segment connecting x
and y is included in C.
This means that the affine combination (1 − t)x + ty belongs to C, for all x
and y in C, and t in the interval [0, 1].

Convex Set Non-convex Set


Convex Functions
f is convex if for all vectors x, y ∈ C and β ∈ [0,1]
f(βx+(1- β)y) ≤ βf(x) + (1- β)f(y)
The domain of a convex
function needs to be a
f(y) convex set.
f(x)
Intuitively, a function f(x)
is convex if all of its
βf(x) + (1- β)f(y)
f(βx+(1- β)y) chords lie above the
function everywhere

Examples

Loss
Loss

Convex Function Non-convex Function

W W
Convex Functions
Conditions to test convexity of
differentiable function:
● First-order convexity (graph of

Loss, f
function f, must be above all the (x,f(x))
tangents)

● Second-order convexity: Second


derivative or Hessian (if exists) must be Convex Function, f(w)
positive semi-definite W
f is convex if and only if f″(x) ≥ 0 for all x.

Some important points to remember:


● All linear and affine functions (e.g., ax + b) are convex
● exp(ax) is convex for x ∈ R, for any a ∈ R
● log(x) is concave (not convex) for x > 0
● xa is convex for x > 0, for any a ≥ 1 and a < 0, concave for 0 ≤ a ≤ 1
● Non-negative weighted sum of convex functions is also a convex function
● Affine transformation preserves convexity: if f (x) is convex then f (x) = f (ax + b) is
also convex
First-order optimality condition
Usually, ML problems are non-convex in nature and they can be solved by non-convex
optimization, which is a research area and outside the scope of our course.
Approach, we have used: The gradient g must be zero at each optima (local or global).
Also known as first-order optimality condition. That is, set g=0 and evaluate unknown
parameters (like w for hyperplane based learning) to find optima.

It may or may not provide closed form solution as in linear regression or logistic
regression respectively. Even if it does not provide close form solution, the gradient g
can still be helpful by utilizing it in iterative optimization methods.

Optimal Solution

Wrong Solution
Iterative Optimization using
gradients: Gradient Descent
First-order method (utilizing only the gradient g
of the objective)
Basic idea: Start at some location w(0) and
move in the opposite direction of the gradient.
By how much?
Till when?

● ηt known as learning rate, can be


constant or vary at each time step.
● The effective step size (how much
w moves) depends on both ηt and Negative Positive
current gradient gt gradient gradient

When to stop: Many criteria, e.g.,


gradients become negligible, or
validation error starts increasing.

What happen for convex function?


● Guaranteed to provide to local w(0) w(1) w(2) w(3)
optima (which is global optima for w(opt)
convex functions). w(2) w(1) w(0)
Importance of Weight Initializations
A good initialization w(0) plays a crucial role. We may get trapped in a bad local
optima for non-convex function.

Remedy Bad
Run multiple times initialization
with different Good
initialization and initialization
select the best
one.

Local Optima Global Optima


Importance of Learning Rates

Problems with small learning rates Problems with Large learning rates
● May take too long to converge ● May Keep Oscillating
● May not be able to “cross” bad optima ● Jump from good region to bad region
and reach towards good optima

The learning rate can defined as:


● Constant (Require proper tuning for good convergence and proper optima estimation)
● Adaptively decreasing as time step increases (like, divide η by a constant factor)
● Use adaptive learning rates (e.g., using methods such as Adagrad, Adam) (Revisit later).
Stochastic Gradient Descent
Mini-batch SGD
SGD uses a single example to approximate the gradient. It is a reasonable estimate of g
but will have large variance.

g, actual gradient
gi, stochastic gradient

One way to control the variance in the gradient’s approximation is mini-batch SGD where
mini-batch containing more than one sample is used for approximating the gradients.
Actual gradient is approximated in mini-batch SGD using:

where B is the batch size.


Gradient Descent: Examples
Gradient Descent: Observations
Summary
● Today, we looked at GD and SGD that are
applicable when function is differentiable.
● We will look several other ways to compute the
gradients, like subgradient descent can be used
when loss function is not differentiable,
alternating optimization when several
dependent variables need to be evaluated,
projected gradient descent when the variables
are constrained and so on....

You might also like