Lecture 7 Newton
Lecture 7 Newton
Nicholas Ruozzi
University of Texas at Dallas
Gradient Descent
2
Gradient Descent
• Does not take into account the curvature of the function, i.e.,
how quickly the gradient is changing
3
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion
4
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion
Sometimes written
5
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion
6
Second Order Methods
• Instead of using only the first derivatives, second order methods
use the first three terms of the multivariate Taylor series
expansion
7
Newton’s Method
• Again, let’s start with the univariate case,
• Newton’s method is a root-finding algorithm, i.e., it seeks a
solution to the equation
• How it works:
• Compute the first order approximation at
• Solve for
• Set
8
Newton’s Method
9
Newton’s Method
10
Newton’s Method
11
Newton’s Method
12
Newton’s Method
13
Newton’s Method
• I thought we were talking about second order methods?
• We are: Newton’s method for optimization applies the previous
strategy to find the zeros of the first derivative
• Approximate:
• Update:
• This is equivalent to minimizing the second order
approximation!
14
Multivariate Newton’s Method
• Update:
• Recall the inverse of a matrix , written , is the matrix such
that , the identity matrix
• Inverses do not always exist, if the inverse doesn’t exist, then
there may be no solution or infinitely many solutions
• Computing the inverse can be expensive: requires
operations for an matrix
• Computing the Hessian matrix itself can be computationally
expensive
• Can converge faster than gradient methods, but is less robust
in general
15
Gradient Descent
16
Newton’s Method
17
Newton Direction
• If is convex, then the direction specified by Newton’s method is
a descent direction!
• Recall that the Hessian matrix of a convex function is positive
semidefinite everywhere
• A matrix is positive semidefinite if for all
• Newton direction:
18
Newton Direction
• If is convex, then the direction specified by Newton’s method is
a descent direction!
• Recall that the Hessian matrix of a convex function is positive
semidefinite everywhere
• A matrix is positive semidefinite if for all
• Newton direction:
19
Convergence
• Because Newton’s method specifies a descent direction, we can
use the same kinds of convergence criteria that we used for
gradient descent!
• We could choose different step sizes, use line search
methods, etc. with this direction
• One computational note: the inverse is often not computed
explicitly
• Most numerical methods packages have a special routine to
solve linear systems of the form
• For example, numpy.linalg.solve() in Python
20
Equality Constrained Newton
• Aim is just to make sure that the step taken via Newton’s
method stays inside the constraint set
21
Equality Constrained Newton with Duality
subject to
• Dual problem:
22
Equality Constrained Newton with Duality
subject to
• Dual problem:
24
Equality Constrained Newton with Duality
subject to
• Dual problem:
26
Equality Constrained Newton via QP/KKT
• Instead of constructing the dual problem, we can directly modify
Newton’s method so that it never takes a step outside the set of
constraints
• Recall that Netwon’s method steps to a minimum of the second
order approximation at a point
27
Equality Constrained Newton via QP/KKT
• Pick an initial point such that
• Update
28
Equality Constrained Newton via QP/KKT
• Pick an initial point such that
• Update
Note that
29
Equality Constrained Newton via QP/KKT
30
Equality Constrained Newton via QP/KKT
31
Equality Constrained Newton via QP/KKT
32
Equality Constrained Newton via QP/KKT
33
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For univariate functions
34
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For univariate functions
35
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For multivariate functions
36
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For multivariate functions
37
Approximate Newton
• Computing the Hessian matrix is computationally expensive and
may be difficult in closed form (or maybe even not invertible!)
• Idea: can we approximate the Hessian the same way that we
did the derivatives on HW 1, i.e., using the secant method?
• For multivariate functions
38
Quasi-Newton Methods
• Using the previous approximation, the Quasi-Newton methods
seek to generate a series of approximate Hessian matrices such
that the matrix only depends on the matrix and satisfies the
constraint
39
Quasi-Newton Methods
• Using the previous approximation, the Quasi-Newton methods
seek to generate a series of approximate Hessian matrices such
that the matrix only depends on the matrix and satisfies the
constraint
40
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
41
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
such that
42
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
such that
43
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
such that
44