Lecture 8
Lecture 8
Positive derivative means f is increasing at x if we increase the value of x by a very small amount;
negative derivative means it is decreasing
Optima and saddle points are defined similar to one-dim case. It require the properties
that we saw for one-dim case must be satisfied along all the directions.
Global Optima
Loss
Loss
Local Optima
W W
Convex set
A subset C is convex if, for all x and y in C, the line segment connecting x
and y is included in C.
This means that the affine combination (1 − t)x + ty belongs to C, for all x
and y in C, and t in the interval [0, 1].
Examples
Loss
Loss
W W
Convex Functions
Conditions to test convexity of
differentiable function:
● First-order convexity (graph of
Loss, f
function f, must be above all the (x,f(x))
tangents)
It may or may not provide closed form solution as in linear regression or logistic
regression respectively. Even if it does not provide close form solution, the gradient g
can still be helpful by utilizing it in iterative optimization methods.
Optimal Solution
Wrong Solution
Iterative Optimization using
gradients: Gradient Descent
First-order method (utilizing only the gradient g
of the objective)
Basic idea: Start at some location w(0) and
move in the opposite direction of the gradient.
By how much?
Till when?
Remedy Bad
Run multiple times initialization
with different Good
initialization and initialization
select the best
one.
Problems with small learning rates Problems with Large learning rates
● May take too long to converge ● May Keep Oscillating
● May not be able to “cross” bad optima ● Jump from good region to bad region
and reach towards good optima
g, actual gradient
gi, stochastic gradient
One way to control the variance in the gradient’s approximation is mini-batch SGD where
mini-batch containing more than one sample is used for approximating the gradients.
Actual gradient is approximated in mini-batch SGD using: