0% found this document useful (0 votes)
5 views44 pages

Lecture 7 Newton

lecture notes

Uploaded by

funtwang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views44 pages

Lecture 7 Newton

lecture notes

Uploaded by

funtwang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

Lecture 7: Second Order Methods

Nicholas Ruozzi
University of Texas at Dallas
Gradient Descent

Gradient Descent Algorithm:


• Pick an initial point
• Iterate until convergence

where is the step size (sometimes called learning rate)

2
Gradient Descent

• At a high level, gradient descent uses the first order Taylor


expansion to approximate the function locally

• Does not take into account the curvature of the function, i.e.,
how quickly the gradient is changing

• This means that it can dramatically overshoot the optimum


with a fixed step size

3
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion

4
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion

Sometimes written

5
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion

6
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion

• There are a variety of second order methods


• Newton
• Gauss-Newton
• Quasi-Newton
• BFGS
• LBFGS

7
Newton’s Method
• Again, let’s start with the univariate case,
• Newton’s method is a root-finding algorithm, i.e., it seeks a
solution to the equation
• How it works:
• Compute the first order approximation at

• Solve for
• Set

8
Newton’s Method

9
Newton’s Method

10
Newton’s Method

11
Newton’s Method

12
Newton’s Method

13
Newton’s Method
• I thought we were talking about second order methods?
• We are: Newton’s method for optimization applies the previous
strategy to find the zeros of the first derivative
• Approximate:
• Update:
• This is equivalent to minimizing the second order
approximation!

14
Multivariate Newton’s Method
• Update:
• Recall the inverse of a matrix , written , is the matrix such
that , the identity matrix
• Inverses do not always exist, if the inverse doesn’t exist, then
there may be no solution or infinitely many solutions
• Computing the inverse can be expensive: requires
operations for an matrix
• Computing the Hessian matrix itself can be computationally
expensive
• Can converge faster than gradient methods, but is less robust
in general
15
Gradient Descent

with diminishing step size rule

16
Newton’s Method

17
Newton Direction
• If is convex, then the direction specified by Newton’s method is
a descent direction!
• Recall that the Hessian matrix of a convex function is positive
semidefinite everywhere
• A matrix is positive semidefinite if for all
• Newton direction:

18
Newton Direction
• If is convex, then the direction specified by Newton’s method is
a descent direction!
• Recall that the Hessian matrix of a convex function is positive
semidefinite everywhere
• A matrix is positive semidefinite if for all
• Newton direction:

Can be used a stopping criteria:

19
Convergence
• Because Newton’s method specifies a descent direction, we can
use the same kinds of convergence criteria that we used for
gradient descent!
• We could choose different step sizes, use line search
methods, etc. with this direction
• One computational note: the inverse is often not computed
explicitly
• Most numerical methods packages have a special routine to
solve linear systems of the form
• For example, numpy.linalg.solve() in Python

20
Equality Constrained Newton

• Two different approaches:

• One based on duality

• One based on quadratic programming

• Aim is just to make sure that the step taken via Newton’s
method stays inside the constraint set

21
Equality Constrained Newton with Duality

subject to

• Dual problem:

22
Equality Constrained Newton with Duality

subject to

• Dual problem:

This function comes up


frequently in convex
optimization, gets a special
name
23
The Convex Conjugate
• The convex conjugate of a function is defined by

• The convex conjugate is always a convex function, even if is not


a convex function (it is a pointwise supremum of convex
functions)
• The conjugate is a special case of Lagrange duality
• If is convex, then

24
Equality Constrained Newton with Duality

subject to

• Dual problem:

If this function is twice differentiable, we can


apply Newton’s method to solve the dual
25
Equality Constrained Newton via QP/KKT
• Instead of constructing the dual problem, we can directly modify
Newton’s method so that it never takes a step outside the set of
constraints
• Recall that Netwon’s method steps to a minimum of the second
order approximation at a point

26
Equality Constrained Newton via QP/KKT
• Instead of constructing the dual problem, we can directly modify
Newton’s method so that it never takes a step outside the set of
constraints
• Recall that Netwon’s method steps to a minimum of the second
order approximation at a point

• Instead, solve the optimization problem

27
Equality Constrained Newton via QP/KKT
• Pick an initial point such that

• Solve the optimization problem

• Update

28
Equality Constrained Newton via QP/KKT
• Pick an initial point such that

• Solve the optimization problem

• Update

Note that

29
Equality Constrained Newton via QP/KKT

• The solution to this optimization problem can be written in


almost closed form

• As long as there is at least on feasible point, Slater’s


condition implies strong duality holds

• KKT conditions are then necessary and sufficient

30
Equality Constrained Newton via QP/KKT

31
Equality Constrained Newton via QP/KKT

32
Equality Constrained Newton via QP/KKT

Again, existing tools can be


applied to solve these kinds of
linear systems

33
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For univariate functions

34
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For univariate functions

Use the sequence of iterates to


approximate the 2nd derivative!

35
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For multivariate functions

Use the sequence of iterates to


approximate the 2nd derivative!

36
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For multivariate functions

Key idea is to replace with a good


approximation that yields equality in
this expression, but is much easier to
compute

37
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For multivariate functions

Note that this a system of equations


for an Hessian, so this system is
underdetermined: there could be
many possible substitutions for

38
Quasi-Newton Methods
• Using the previous approximation, the Quasi-Newton methods
seek to generate a series of approximate Hessian matrices such
that the matrix only depends on the matrix and satisfies the
constraint

39
Quasi-Newton Methods
• Using the previous approximation, the Quasi-Newton methods
seek to generate a series of approximate Hessian matrices such
that the matrix only depends on the matrix and satisfies the
constraint

• A wide variety of methods have been proposed to accomplish


this
• The most popular in practice are BFGS and its lower memory
counterpart L-BFGS

40
Broyden-Fletcher-Goldfarb-Shanno (BFGS)

• Choose to be a symmetric positive definite matrix whose


inverse satisfies

and is as small as possible

41
Broyden-Fletcher-Goldfarb-Shanno (BFGS)

such that

This is a convex optimization problem!


(note that the solution may not be
strictly positive definite: if this happens,
reinitialize to be a nice positive definite
matrix like )

42
Broyden-Fletcher-Goldfarb-Shanno (BFGS)

such that

Its solution is... messy...

43
Broyden-Fletcher-Goldfarb-Shanno (BFGS)

such that

44

You might also like