0% found this document useful (0 votes)
84 views12 pages

Cs3491 - Aiml - Unit III - Gradient Descent

Uploaded by

Soban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views12 pages

Cs3491 - Aiml - Unit III - Gradient Descent

Uploaded by

Soban
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Department of Computer Science &

Engineering

Regulation 21

Semester: III

Course Code: CS3491

Course Name: Artificial Intelligence and Machine Learning

K.Sumithra Devi
Assistant Professor
CSE
1
KCG DEPARTMENT OF CSE 1
UNIT III SUPERVISED LEARNING – GRADIENT DESCENT

CO 3 Build supervised learning models

K3

KCG DEPARTMENT OF CSE 2


Gradient Descent
The algorithm:

x (t+1) = x (t) − α(t)∇f ( x (t) ),


t = 0, 1,
2, . . . ,
where α ( t ) is called the step
size.

KCG DEPARTMENT OF CSE 3


Why is the direction −∇f (x)?
If x∗ is optimal, then
1
lim [f (x∗ + ϵ d ) − f (x∗)] = ∇f (x∗)T
d ϵ
`є→0 ˛¸
≥0,x ∀d

=⇒ ∇f
(x∗)T d ≥ 0, ∀d
But if x ( t ) is not optimal, then we want
f ( x ( t ) + ϵ d ) ≤ f ( x (t) )
So, 1 h (t) i
lim f (x + ϵ d ) − f (x(t) ) = ∇f (x(t) )T d
`є→0 ϵ ˛¸
≤0, forxsome d

=⇒ ∇f ( x ( t ) ) T d ≤
0
KCG DEPARTMENT OF CSE 4
Descent Direction
Pictorial illustration:
∇f (x) is perpendicular to the contour.
A search direction d can either be on the positive side ∇f (x)T d ≥
0 or negative side ∇f (x)T d < 0.
Only those on the negative side can reduce the
cost. All such d ’s are called the descent
directions.

KCG DEPARTMENT OF CSE 5


Bayesian Linear Regression
The Steepest d
Previous slide: If x ( t ) is not optimal yet, then some d will give
∇f ( x ( t ) ) T d ≤ 0.

So, let us make ∇f ( x ( t ) ) T as negative as possible.


d ( t ) = argmin ∇f ( x ( t ) ) T d ,
d 2=δ

We need δ to control the magnitude; Otherwise d is


unbounded.
The solution is
d ( t ) = −∇f ( x (t) )
Why? By Cauchy Schwarz,
∇f ( x ( t ) ) T d ≥ − ∇f ( x ( t ) ) 2 d 2.

Minimum attained when d = −∇f (x(t)).


KCG
(t)
DEPARTMENT OF CSE 6
Steepest Descent Direction
Pictorial illustration:
Put a ball surrounding the current
point. All d ’s inside the ball are
feasible.
Pick the one that minimizes ∇f (x)T d .
This direction must be parallel (but
opposite sign) to ∇f (x).

7
KCG DEPARTMENT OF CSE 7
Step Size
The algorithm:

x (t+1) = x (t) − α(t)∇f ( x (t) ),


t = 0, 1,
2, . . . ,
where α ( t ) is called the step size.
1. Fixed step size
α ( t ) = α.
2. Exact line search
α ( t ) = argmin f x( t ) + α d ( t ) ,
α

E.g., if f (x) = 21 x T Hx + c T x,
then
∇f ( x ( t ) ) T d ( t )
α (t) = − .
d (t)T H d (t)

3. Inexact line search:


Amijo / Wolfe conditions. See Nocedal-Wright Chapter
8
KCG 3.1. DEPARTMENT OF CSE 8
Convergence
Let x∗ be the global minimizer. Assume the followings:
Assume f is twice differentiable so that ∇2f exist.
Assume 0 ≤ λminI ≤ ∇2f (x) ≤ λmaxI for all x ∈ Rn
Run gradient descent with exact line search.
Then, (Nocedal-Wright Chapter 3, Theorem 3.3)
λ min 2
f (x (t+ 1)
) − f (x ) ≤

1 f ( x ( t ) ) − f (x∗)
− λmax
λ min 4
≤ 1 f (x(t−1)) − f (x∗)
− λmax
≤ ..

λ min 2t
≤ 1 f (x(1)) − f (x∗) .
− λmax
Thus, f ( x ( t ) ) → f (x∗) as t → ∞. 9
KCG DEPARTMENT OF CSE 9
Understanding Convergence
Gradient descent can be viewed as successive
approximation. Approximate the function as
1
f (x + d ) ≈ f (x ) + ∇f (x ) d + d 2.
t t t T

We can show that the d that minimizes f (x t + d ) is d = −α∇f (x t
). This suggests: Use a quadratic function to locally approximate
f.
Converge when curvature α of the approximation is not too big.

10
KCG DEPARTMENT OF CSE 10
Advice on Gradient Descent

Gradient descent is useful because


Simple to implement (compared to ADMM, FISTA, etc)
Low computational cost per iteration (no matrix
inversion) Requires only first order derivative (no Hessian)
Gradient is available in deep networks (via back
propagation)
Most machine learning has built-in (stochastic) gradient
descents Welcome to implement your own, but you need to be
careful
Convex non-differentiable problems, e.g., l 1 -
norm Non-convex problem, e.g., ReLU in deep
network Trap by local minima
Inappropriate step size, a.k.a. learning rate
Consider more “transparent” algorithms such as CVX
when Formulating problems. No need to worry about
11
KCG algorithm. Trying toDEPARTMENT
obtain insights.
OF CSE 11
Types of Gradient Descent

• Based on the error in various training models, the Gradient


Descent learning algorithm can be divided into

• Batch gradient descent


• stochastic gradient descent
• mini-batch gradient descent

12
KCG DEPARTMENT OF CSE 12

You might also like