0% found this document useful (0 votes)
12 views7 pages

Newton Scribed

Uploaded by

Ronak Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

Newton Scribed

Uploaded by

Ronak Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

10-725/36-725: Convex Optimization Fall 2016

Lecture 14: Newton’s Method


Lecturer: Javier Pena Scribes: Varun Joshi, Xuan Li

Note: LaTeX template courtesy of UC Berkeley EECS dept.


Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

14.1 Review of previous lecture


Given a function f : Rn → R, we define its conjugate f ∗ : Rn → R as,
f ∗ (y) = max(y T x − f (x))
x

Some properties of the convex conjugate of a function are as follows:

• Conjugate f ∗ is always convex (regardless of the convexity of f )


• When f is quadratic and Q  0 then f ∗ is quadratic in Q−1 i.e., for f (x) = 1 T
2 x Qx + bT x, with
Q  0, f ∗ (y) = 12 (y − b)T Q−1 (y − b).
• When f is a norm, f ∗ is the indicator of the dual norm unit ball
• When f is closed and convex x ∈ ∂f ∗ (y) ⇐⇒ y ∈ ∂f (x)

A key result that helps us write down the dual in terms of the conjugate is the Fenchel duality:
P rimal : min f (x) + g(x)
x
Dual : max −f ∗ (u) − g ∗ (−u)
u

14.2 Introduction
In this section, we present the Newton’s method and show that it can be interpreted as minimizing a quadratic
approximation to a function at a point. We also briefly discuss the origin of Newton’s method and how it
can be used for finding the roots of a vector-valued function.

14.2.1 Newton’s Method


Newton’s method is a second-order method in the setting where we consider the unconstrained, smooth
convex optimization problem
min f (x)
x

where f is convex, twice differentiable and dom(f ) = Rn .

Newton’s method: choose initial x(0) ∈ Rn , and


−1
x(k) = x(k−1) − ∇2 f (x(k−1) ) ∇f (x(k−1) ), k = 1, 2, 3, . . .

14-1
14-2 Lecture 14: October 19

This is called pure Newton’s method since there is no concept of a step-size involved. In Newton’s method,
we move in the direction of the negative Hessian inverse times the gradient. Compare this to gradient descent
where we move in the direction of the negative gradient: choose initial x(0) ∈ Rn , and
x(k) = x(k−1) − tk ∇f (x(k−1) ), k = 1, 2, 3, . . .

14.2.2 Newton’s method interpretation


Newton’s method can be interpreted as minimizing a quadratic approximation to a function at a given
−1
point. The step x+ = x − ∇2 f (x) ∇f (x) can be obtained by minimizing over y the following quadratic
approximation:
1
f (y) ≈ f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
2
On the other hand, the gradient descent step x+ = x − t∇f (x) can be obtained by minimizing over y the
following quadratic approximation:
1
f (y) ≈ f (x) + ∇f (x)T (y − x) + ky − xk22
2t
As we can see, Newton’s method minimizes a finer quadratic approximation to a function as compared to
gradient descent. For example, for minimizing the function f (x) = (10x21 + x22 )/2 + 5 log(1 + exp(−x1 − x2 )) a
comparison of the steps taken by Newton’s method and gradient descent is provided in figure 14.1. The figure

Figure 14.1: Comparison of Newton’s Method (blue) with Gradient Descent (black)

shows a contrast between the behaviour of Newton’s method and gradient descent. In gradient descent the
direction of steps is always perpendicular to the level curves while that is not the case in Newton’s method
(due to the hessian term).
For a quadratic one step of Newton’s method minimizes the function directly because the quadratic approx-
imation to the quadratic function will be the function itself.

14.2.3 Newton’s method for root finding


Newton’s method was originally developed by Newton (1685) and Raphson (1690) for finding roots of poly-
nomials. This was later generalized to minimization of nonlinear equations by Simpson (1740). Suppose
F : Rn → Rn is a differentiable vector-valued function and consider the system of equations
F (x) = 0
Lecture 14: October 19 14-3

Then, the Newton’s method for finding the solution to this system of equations is: choose initial x(0) ∈ Rn ,
and 0 −1
x(k) = x(k−1) − F (x(k−1) ) F (x(k−1) ), k = 1, 2, 3, . . .
0
where F (x) is the Jacobian matrix of F at x.
0
The Newton step x+ = x − F (x)−1 F (x) can be obtained by solving over y the linear approximation
0
F (y) ≈ F (x) + F (x)(y − x) = 0
Newton’s method for root finding is directly related to the Newton’s method for convex minimization. In
particular, newton’s method for
min f (x)
x
is the same as Newton’s method for finding the roots of
∇f (x) = 0.

14.3 Properties
In this section, we present two key properties of Newton’s method which distinguish it from first order
methods.

14.3.1 Affine Invariance


Assume f : Rn → R is twice differentiable and A ∈ Rn×n is nonsingular. Let g(y) := f (Ay). Then, Newton
step for g at the point y is given by
−1
y + = y − ∇2 g(y) ∇g(y)
For the affine transformation x = Ay, it turns out that the Newton step for f at the point x is x+ = Ay + .
This means that the progress of Newton’s method is independent of linear scaling. This property is not true
for gradient descent.

14.3.2 Local Convergence


Newton’s method has the property of local convergence. The formal statement of the property is as follows.
Theorem 14.1 Assume F : Rn → Rn is continuously differentiable and x? ∈ Rn is a root of F , that is,
F (x? ) = 0 such that F 0 (x? ) is non-singular. Then
(a) There exists δ > 0 such that if kx(0) − x? k < δ then Newton’s method is well defined and
kx(k+1) − x? k
lim = 0.
k→∞ kx(k) − x? k

(b) If F 0 is Lipschitz continuous in a neighbourhood of x? then there exists K > 0 such that
kx(k+1) − x? k ≤ Kkx(k) − x? k2 .

Part (a) of the theorem says that Newton’s method has super-linear local convergence. Note that this is
stronger than linear convergence: x(k) → x? linearly ⇐⇒ kx(k+1) − x? k ≤ ckx(k) − x? k for some c ∈ (0, 1).
If we further assume that F 0 is Lipschitz continuous then from part (b) we get that Newton’s method has
local quadratic convergence which is even stronger than super-linear convergence.

Note that the above theorem talks only about local convergence so it holds only when we are close to
the root. Newton’s method does not necessarily converge in the global sense.
14-4 Lecture 14: October 19

14.4 Newton Decrement


For a smooth, convex function f the Newton decrement at a point x is defined as
 1/2
T 2
−1
λ(x) = ∇f (x) ∇ f (x) ∇f (x)

For an unconstrained convex optimization problem


min f (x)
x

there are two ways to interpret the Newton Decrement.

Interpretation 1: Newton decrement relates the difference between f (x) and the minimum of its quadratic
approximation:
1
f (x)− min f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)

y 2
1 −1 1
= ∇f (x)T ∇2 f (x) ∇f (x) = λ(x)2

2 2
Thus, we can think of λ(x)2 /2 as an approximate bound on the suboptimality gap f (x) − f ? . The bound is
approximate because we are considering only the minimum of the quadratic approximation, not the actual
minimum of f (x).
−1
Interpretation 2: Suppose the step in Newton’s method is denoted by v = − ∇2 f (x) ∇f (x), then
1/2
λ(x) = v T ∇2 f (x)v = kvk∇2 f (x)
Thus, λ(x) is the length of the Newton step in the norm defined by the Hessian.

Fact: Newton decrement is affine invariant i.e., for g(y) = f (Ay) for a nonsingular A, λg (y) = λf (x)
at x = Ay.

14.5 Convergence Analysis for backtracking line search


14.5.1 Introduction to algorithm
The pure Newton’s Method does not always converge, depending on the staring point. Thus, damped
Newton’s method is introduced to work together with pure Newton Method. With 0 < α ≤ 21 and 0 < β < 1,
at each iteration we start with t = 1, and while
f (x + tv) <= f (x) + αt∇f (x)T v
we perform the the Newton update, else we shrink t = βt. Here
−1
v = − ∇2 f (x) ∇f (x)

14.5.2 Example: logistic regression


In lecture we are given a logistic regression example with n = 500 and p = 100. With backtracking, the
Newton’s Method is compared with gradient descent and the covergence curve is shown in 14.2. It is seen
that Newton’s Method has a different regime of convergence. Notice that the comparison might be unfair
since the computation cost in these two methods might vary significantly.
Lecture 14: October 19 14-5

Figure 14.2: Comparison of Newton’s Method with Gradient Descent (backtracking)

14.5.3 Convergence analysis


Given the assumption that

• f is strongly convex with parameter L, twice differentiable, and dom(f ) = Rn


• ∇2 f Lipschitz with parameter M

Newton’s Method with backtracking line search satisfies the following convergence bounds
( 
(k) ? f (x(0) ) − f ? − γk if k ≤ k0
f (x ) − f ≤ 2m3 1 2k −k0 +1
M2 ( 2 ) if k > k0

where γ = αβ 2 η 2 m/L2 , η = min{1, 3(1 − 2α)}m2 /M , and k0 is the number of steps till ||∇f (x(k0 +1) )||2 < η.
More precisely, the results indicates that in damped phase, we have

f (x(k+1) ) − f (x(k) ) ≤ γ

In pure phase, backtracking selects t = 1, we have


 2
M M
||∇f (x(k+1) )||2 ≤ ||∇f (xk )||2
2m2 2m2
Also, once we enter pure phase, we won’t leave.
Finally, to reach f (x(k) ) − f ? ≤ , at most

f (x(k) ) − f ?
+ log log(0 /)
γ
14-6 Lecture 14: October 19

3
iterations are need, where 0 = 2mM2 .
The “log log” term in the convergence result makes the convergence quadratic. However, the quadratic
convergence result is only local, it is guaranteed in the second or pure phase only. Finally, the above bound
depends on L, m, M , but the algorithm itself does not.

14.6 Convergence Analysis for self concordant functions


14.6.1 Definition
To achieve a scale-free analysis we introduce self-concordant functions. A function is self-concordant if it
convex on an open segment of R and satisfies
000 00
|f (x)| ≤ 2f (x)3/2
Pn
Two example would be f (x) = − i=1 log(xj ) and f (X) = − log(det(X)).

14.6.2 Property
If g is self-concordance and A, b are of the right dimension, then
f (x) := g(Ax − b)
is also self-concordant.

14.6.3 Convergence Analysis


For self-concordant function f , Newton’s method with backtracking line search needs at most
C(α, β)(f (x(0) ) − f ? ) + log log(1/)
iterations to achieve f (xk ) − f ? ≤  where α, β are constants.

14.7 Comparison to first order methods


14.7.1 High-level comparison
• Memory : Each iteration of Newton’s method requires O(n2 ) storage due to the n×n Hessian whereas
each gradient iteration requires O(n) storage for the n-dimensional gradient.
• Computation : Each Newton iteration requires O(n3 ) flops as it solves a dense n × n linear system.
Each gradient descent iteration requires O(n) flops attributed to scaling/adding n-dimensional vectors.
• Backtracking : Backtracking line search has roughly the same cost for both methods, which use O(n)
flops per inner backtracking step.
• Conditioning : Newton’s method is not afected by a problem’s conditioning(due to affine invariance),
but gradient descent can seriously degrade, since it depends adversely on the condition number.
• Fragility : Newton’s method may be empirically more sensitive to bugs/numerical errors, whereas
gradient descent is more robust.

We can see that even though Newton’s method has quadratic convergence as compared to linear convergence
of gradient descent, however, computing the Hessian might make the method a lot slower. If the Hessian is
sparse and structured(e.g. banded), then both memory and computation are O(n).
Lecture 14: October 19 14-7

14.8 Equality-constrained Newton’s method


14.8.1 Introduction
Suppose now we have problems with equality constraints

min f (x) subject to Ax = b


x

Here we have three options: eliminating the equality constraints by writing x = F y + x0 , where F spans null
space of A, and Ax0 = b; deriving the dual; or use the most straightforward option equality- constrained
Newton’s Method.

14.8.2 Definition
In equality-constrained Newton’s Method, we take Newton steps which are confined to a region satisfied by
the constraints. The Newton update is now x+ = x + tv where
 
T 1 T 2
v = argmin f (x) + ∇f (x) z + z ∇ f (x)z
A(x+z)=b 2

From KKT condition it follows that for some u, v we have


 2
∇ f (x)AT
    
v ∇f (x)
· =−
A 0 w Ax − b

The latter is the root-finding Newton step for KKT conditions of the origin equality-constrained problem
that
∇f (x) + AT y
   
0
=
Ax − b 0

References
• S. Boyd and L. Vandenberghe (2004), “Convex optimization”, Chapters 9 and 10
• Guler (2010),“Foundations of Optimization”, Chapter 14.
• Y. Nesterov (1998), “Introductory lectures on convex optimization: a basic course”, Chapter 2

• Y. Nesterov and A. Nemirovskii (1994), “Interior-point polynomial methods in convex programming”,


Chapter 2
• J. Nocedal and S. Wright (2006), “Numerical optimization”, Chapters 6 and 7
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring 2011-2012

You might also like