Newton Scribed
Newton Scribed
A key result that helps us write down the dual in terms of the conjugate is the Fenchel duality:
P rimal : min f (x) + g(x)
x
Dual : max −f ∗ (u) − g ∗ (−u)
u
14.2 Introduction
In this section, we present the Newton’s method and show that it can be interpreted as minimizing a quadratic
approximation to a function at a point. We also briefly discuss the origin of Newton’s method and how it
can be used for finding the roots of a vector-valued function.
14-1
14-2 Lecture 14: October 19
This is called pure Newton’s method since there is no concept of a step-size involved. In Newton’s method,
we move in the direction of the negative Hessian inverse times the gradient. Compare this to gradient descent
where we move in the direction of the negative gradient: choose initial x(0) ∈ Rn , and
x(k) = x(k−1) − tk ∇f (x(k−1) ), k = 1, 2, 3, . . .
Figure 14.1: Comparison of Newton’s Method (blue) with Gradient Descent (black)
shows a contrast between the behaviour of Newton’s method and gradient descent. In gradient descent the
direction of steps is always perpendicular to the level curves while that is not the case in Newton’s method
(due to the hessian term).
For a quadratic one step of Newton’s method minimizes the function directly because the quadratic approx-
imation to the quadratic function will be the function itself.
Then, the Newton’s method for finding the solution to this system of equations is: choose initial x(0) ∈ Rn ,
and 0 −1
x(k) = x(k−1) − F (x(k−1) ) F (x(k−1) ), k = 1, 2, 3, . . .
0
where F (x) is the Jacobian matrix of F at x.
0
The Newton step x+ = x − F (x)−1 F (x) can be obtained by solving over y the linear approximation
0
F (y) ≈ F (x) + F (x)(y − x) = 0
Newton’s method for root finding is directly related to the Newton’s method for convex minimization. In
particular, newton’s method for
min f (x)
x
is the same as Newton’s method for finding the roots of
∇f (x) = 0.
14.3 Properties
In this section, we present two key properties of Newton’s method which distinguish it from first order
methods.
(b) If F 0 is Lipschitz continuous in a neighbourhood of x? then there exists K > 0 such that
kx(k+1) − x? k ≤ Kkx(k) − x? k2 .
Part (a) of the theorem says that Newton’s method has super-linear local convergence. Note that this is
stronger than linear convergence: x(k) → x? linearly ⇐⇒ kx(k+1) − x? k ≤ ckx(k) − x? k for some c ∈ (0, 1).
If we further assume that F 0 is Lipschitz continuous then from part (b) we get that Newton’s method has
local quadratic convergence which is even stronger than super-linear convergence.
Note that the above theorem talks only about local convergence so it holds only when we are close to
the root. Newton’s method does not necessarily converge in the global sense.
14-4 Lecture 14: October 19
Interpretation 1: Newton decrement relates the difference between f (x) and the minimum of its quadratic
approximation:
1
f (x)− min f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
y 2
1 −1 1
= ∇f (x)T ∇2 f (x) ∇f (x) = λ(x)2
2 2
Thus, we can think of λ(x)2 /2 as an approximate bound on the suboptimality gap f (x) − f ? . The bound is
approximate because we are considering only the minimum of the quadratic approximation, not the actual
minimum of f (x).
−1
Interpretation 2: Suppose the step in Newton’s method is denoted by v = − ∇2 f (x) ∇f (x), then
1/2
λ(x) = v T ∇2 f (x)v = kvk∇2 f (x)
Thus, λ(x) is the length of the Newton step in the norm defined by the Hessian.
Fact: Newton decrement is affine invariant i.e., for g(y) = f (Ay) for a nonsingular A, λg (y) = λf (x)
at x = Ay.
Newton’s Method with backtracking line search satisfies the following convergence bounds
(
(k) ? f (x(0) ) − f ? − γk if k ≤ k0
f (x ) − f ≤ 2m3 1 2k −k0 +1
M2 ( 2 ) if k > k0
where γ = αβ 2 η 2 m/L2 , η = min{1, 3(1 − 2α)}m2 /M , and k0 is the number of steps till ||∇f (x(k0 +1) )||2 < η.
More precisely, the results indicates that in damped phase, we have
f (x(k+1) ) − f (x(k) ) ≤ γ
f (x(k) ) − f ?
+ log log(0 /)
γ
14-6 Lecture 14: October 19
3
iterations are need, where 0 = 2mM2 .
The “log log” term in the convergence result makes the convergence quadratic. However, the quadratic
convergence result is only local, it is guaranteed in the second or pure phase only. Finally, the above bound
depends on L, m, M , but the algorithm itself does not.
14.6.2 Property
If g is self-concordance and A, b are of the right dimension, then
f (x) := g(Ax − b)
is also self-concordant.
We can see that even though Newton’s method has quadratic convergence as compared to linear convergence
of gradient descent, however, computing the Hessian might make the method a lot slower. If the Hessian is
sparse and structured(e.g. banded), then both memory and computation are O(n).
Lecture 14: October 19 14-7
Here we have three options: eliminating the equality constraints by writing x = F y + x0 , where F spans null
space of A, and Ax0 = b; deriving the dual; or use the most straightforward option equality- constrained
Newton’s Method.
14.8.2 Definition
In equality-constrained Newton’s Method, we take Newton steps which are confined to a region satisfied by
the constraints. The Newton update is now x+ = x + tv where
T 1 T 2
v = argmin f (x) + ∇f (x) z + z ∇ f (x)z
A(x+z)=b 2
The latter is the root-finding Newton step for KKT conditions of the origin equality-constrained problem
that
∇f (x) + AT y
0
=
Ax − b 0
References
• S. Boyd and L. Vandenberghe (2004), “Convex optimization”, Chapters 9 and 10
• Guler (2010),“Foundations of Optimization”, Chapter 14.
• Y. Nesterov (1998), “Introductory lectures on convex optimization: a basic course”, Chapter 2