0% found this document useful (0 votes)
16 views24 pages

14 Newton

Uploaded by

Santos Senior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views24 pages

14 Newton

Uploaded by

Santos Senior
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Newton’s Method

Ryan Tibshirani
Convex Optimization 10-725/36-725

1
Last time: dual correspondences
Given a function f : Rn → R, we define its conjugate f ∗ : Rn → R,

f ∗ (y) = max y T x − f (x)


x

Properties and examples:


• Conjugate f ∗ is always convex (regardless of convexity of f )
• When f is a quadratic in Q  0, f ∗ is a quadratic in Q−1
• When f is a norm, f ∗ is the indicator of the dual norm unit
ball
• When f is closed and convex, x ∈ ∂f ∗ (y) ⇐⇒ y ∈ ∂f (x)

Relationship to duality:

Primal : min f (x) + g(x)


x
Dual : max −f ∗ (u) − g ∗ (−u)
u

2
Newton’s method

Given unconstrained, smooth convex optimization

min f (x)

where f is convex, twice differentable, and dom(f ) = Rn . Recall


that gradient descent chooses initial x(0) ∈ Rn , and repeats

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

In comparison, Newton’s method repeats


−1
x(k) = x(k−1) − ∇2 f (x(k−1) ) ∇f (x(k−1) ), k = 1, 2, 3, . . .

Here ∇2 f (x(k−1) ) is the Hessian matrix of f at x(k−1)

3
Newton’s method interpretation

Recall the motivation for gradient descent step at x: we minimize


the quadratic approximation
1
f (y) ≈ f (x) + ∇f (x)T (y − x) + ky − xk22
2t
over y, and this yields the update x+ = x − t∇f (x)

Newton’s method uses in a sense a better quadratic approximation


1
f (y) ≈ f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
2
and minimizes over y to yield x+ = x − (∇2 f (x))−1 ∇f (x)

4
For f (x) = (10x21 + x22 )/2 + 5 log(1 + e−x1 −x2 ), compare gradient
descent (black) to Newton’s method (blue), where both take steps
of roughly same length

20





●●







● ●





10 ●



● ●







● ●
●●

●●
0
−10
−20

−20 −10 0 10 20

(For our example we needed to consider a nonquadratic ... why?)


5
Outline

Today:
• Interpretations and properties
• Backtracking line search
• Convergence analysis
• Equality-constrained Newton
• Quasi-Newton methods

6
Linearized optimality condition
Aternative interpretation of Newton step at x: we seek a direction
v so that ∇f (x + v) = 0. Now consider linearizing this optimality
condition
0 = ∇f (x + v) ≈ ∇f (x) + ∇2 f (x)v
and solving for v, which again yields v = −(∇2 f (x))−1 ∇f (x)
9 Unconstrained minimization

f!′
History: work of Newton (1685)
f′ and Raphson (1690) originally fo-
(x + ∆xnt , f ′ (x + ∆xnt )) cused on finding roots of poly-

(x, f (x)) nomials. Simpson (1740) ap-
plied this idea to general nonlin-

ear equations, and minimization
Figure 9.18 The solid curve is the derivative f of the function f shown in
figure 9.16. f (From
!

is the linearB & V page
approximation x. The Newton step by
of f at486)

∆x setting the gradient to zero
nt
is the difference between the root of f! and the point x.

For us ∆xnt = v
ro-crossing of the derivative f ′ , which is monotonically increasing since f is
x. Given our current approximation x of the solution, we form a first-order 7
Affine invariance of Newton’s method
Important property Newton’s method: affine invariance. Given f ,
nonsingular A ∈ Rn×n . Let x = Ay, and g(y) = f (Ay). Newton
steps on g are
−1
y + = y − ∇2 g(y) ∇g(y)
−1
= y − AT ∇2 f (Ay)A AT ∇f (Ay)
−1
= y − A−1 ∇2 f (Ay) ∇f (Ay)

Hence −1
Ay + = Ay − ∇2 f (Ay) ∇f (Ay)
i.e., −1
x+ = x − ∇2 f (x) f (x)
So progress is independent of problem scaling; recall that this is
not true of gradient descent

8
Newton decrement
At a point x, we define the Newton decrement as
 −1 1/2
λ(x) = ∇f (x)T ∇2 f (x) ∇f (x)

This relates to the difference between f (x) and the minimum of its
quadratic approximation:
 1 
f (x) − min f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
y 2
 1 −1 
T 2
= f (x) − f (x) − ∇f (x) ∇ f (x) ∇f (x)
2
1
= λ(x)2
2
Therefore can think of λ2 (x)/2 as an approximate bound on the
suboptimality gap f (x) − f ?

9
Another interpretation of Newton decrement: if Newton direction
is v = −(∇2 f (x))−1 ∇f (x), then
1/2
λ(x) = v T ∇2 f (x)v = kvk∇2 f (x)

i.e., λ(x) is the length of the Newton step in the norm defined by
the Hessian ∇2 f (x)

Note that the Newton decrement, like the Newton steps, are affine
invariant; i.e., if we defined g(y) = f (Ay) for nonsingular A, then
λg (y) would match λf (x) at x = Ay

10
Backtracking line search
We have seen pure Newton’s method, which need not converge. In
practice, we instead use damped Newton’s method (i.e., Newton’s
method), which repeats
−1
x+ = x − t ∇2 f (x) ∇f (x)

Note that the pure method uses t = 1

Step sizes here typically are chosen by backtracking search, with


parameters 0 < α ≤ 1/2, 0 < β < 1. At each iteration, we start
with t = 1 and while

f (x + tv) > f (x) + αt∇f (x)T v

we shrink t = βt, else we perform the Newton update. Note that


here v = −(∇2 f (x))−1 ∇f (x), so ∇f (x)T v = −λ2 (x)

11
Example: logistic regression
Logistic regression example, with n = 500, p = 100: we compare
gradient descent and Newton’s method, both with backtracking

1e+03
Gradient descent
Newton's method

1e−01
f−fstar

1e−05
1e−09
1e−13

0 10 20 30 40 50 60 70

Newton’s method seems to have a different regime of convergence!


12
Convergence analysis
Assume that f convex, twice differentiable, having dom(f ) = Rn ,
and additionally
• ∇f is Lipschitz with parameter L
• f is strongly convex with parameter m
• ∇2 f is Lipschitz with parameter M

Theorem: Newton’s method with backtracking line search sat-


isfies the following two-stage convergence bounds

(f (x(0) ) − f ? ) − γk if k ≤ k0
f (x(k) ) − f ? ≤ 2m3  1 2k−k0 +1
 if k > k0
M2 2
Here γ = αβ 2 η 2 m/L2 , η = min{1, 3(1 − 2α)}m2 /M , and k0 is
the number of steps until k∇f (x(k0 +1) )k2 < η

13
In more detail, convergence analysis reveals γ > 0, 0 < η ≤ m2 /M
such that convergence follows two stages
• Damped phase: k∇f (x(k) )k2 ≥ η, and

f (x(k+1) ) − f (x(k) ) ≤ −γ

• Pure phase: k∇f (x(k) )k2 < η, backtracking selects t = 1, and

M  M 2
(k+1) (k)
k∇f (x )k 2 ≤ k∇f (x )k 2
2m2 2m2
Note that once we enter pure phase, we won’t leave, because

2m2  M 2
η <η
M 2m2
when η ≤ m2 /M

14
Unraveling this result, what does it say? To reach
f (x(k) ) − f ? ≤ , we need at most

f (x(0) ) − f ?
+ log log(0 /)
γ

iterations, where 0 = 2m3 /M 2


• This is called quadratic convergence. Compare this to linear
convergence (which, recall, is what gradient descent achieves
under strong convexity)
• The above result is a local convergence rate, i.e., we are only
guaranteed quadratic(0)
convergence
?
after some number of steps
k0 , where k0 ≤ f (x γ)−f
• Somewhat bothersome may be the fact that the above bound
depends on L, m, M , and yet the algorithm itself does not

15
Self-concordance
A scale-free analysis is possible for self-concordant functions: on R,
a convex function f is called self-concordant if

|f 000 (x)| ≤ 2f 00 (x)3/2 for all x

and on Rn is called self-concordant if its projection onto every line


segment is so. E.g., f (x) = − log(x) is self-concordant

Theorem (Nesterov and Nemirovskii): Newton’s method


with backtracking line search requires at most

C(α, β) f (x(0) ) − f ? + log log(1/)

iterations to reach f (x(k) ) − f ? ≤ , where C(α, β) is a constant


that only depends on α, β

16
Comparison to first-order methods
At a high-level:
• Memory: each iteration of Newton’s method requires O(n2 )
storage (n × n Hessian); each gradient iteration requires O(n)
storage (n-dimensional gradient)
• Computation: each Newton iteration requires O(n3 ) flops
(solving a dense n × n linear system); each gradient iteration
requires O(n) flops (scaling/adding n-dimensional vectors)
• Backtracking: backtracking line search has roughly the same
cost, both use O(n) flops per inner backtracking step
• Conditioning: Newton’s method is not affected by a problem’s
conditioning, but gradient descent can seriously degrade
• Fragility: Newton’s method may be empirically more sensitive
to bugs/numerical errors, gradient descent is more robust

17
Back to logistic regression example: now x-axis is parametrized in
terms of time taken per iteration

1e+03
Gradient descent
Newton's method

1e−01
f−fstar

1e−05
1e−09
1e−13

0.00 0.05 0.10 0.15 0.20 0.25

Time

Each gradient descent step is O(p), but each Newton step is O(p3 )
18
Sparse, structured problems

When the inner linear systems (in Hessian) can be solved efficiently
and reliably, Newton’s method can strive

E.g., if ∇2 f (x) is sparse and structured for all x, say banded, then
both memory and computation are O(n) with Newton iterations

What functions admit a structured Hessian? Two examples:


• If g(β) = f (Xβ), then ∇2 g(β) = X T ∇2 f (Xβ)X. Hence if
X is a structured predictor matrix and ∇2 f is diagonal, then
∇2 g is structured
• If we seek to minimize f (β) + g(Dβ), where ∇2 f is diagonal,
g is not smooth, and D is a structured penalty matrix, then
the Lagrange dual function is −f ∗ (−DT u) − g ∗ (−u). Often
−D∇2 f ∗ (−DT u)DT can be structured

19
Equality-constrained Newton’s method

Consider now a problem with equality constraints, as in

min f (x) subject to Ax = b

Several options:
• Eliminating equality constraints: write x = F y + x0 , where F
spans null space of A, and Ax0 = b. Solve in terms of y
• Deriving the dual: can check that the Lagrange dual function
is −f ∗ (−AT v) − bT v, and strong duality holds. With luck, we
can express x? in terms of v ?
• Equality-constrained Newton: in many cases, this is the most
straightforward option

20
In equality-constrained Newton’s method, we start with x(0) such
that Ax(0) = b. Then we repeat the updates

x+ = x + tv, where
1
v = argmin ∇f (x)T (z − x) + (z − x)T ∇2 f (x)(z − x)
Az=0 2

This keeps x+ in feasible set, since Ax+ = Ax + tAv = b + 0 = b

Furthermore, v is the solution to minimizing a quadratic subject to


equality constraints. We know from KKT conditions that v satisfies
 2    
∇ f (x) AT v −∇f (x)
=
A 0 w 0

for some w. Hence Newton direction v is again given by solving a


linear system in the Hessian (albeit a bigger one)

21
Quasi-Newton methods

If the Hessian is too expensive (or singular), then a quasi-Newton


method can be used to approximate ∇2 f (x) with H  0, and we
update according to

x+ = x − tH −1 ∇f (x)

• Approximate Hessian H is recomputed at each step. Goal is


to make H −1 cheap to apply (possibly, cheap storage too)
• Convergence is fast: superlinear, but not the same as Newton.
Roughly n steps of quasi-Newton make same progress as one
Newton step
• Very wide variety of quasi-Newton methods; common theme
is to “propogate” computation of H across iterations

22
Davidon-Fletcher-Powell or DFP:
• Update H, H −1 via rank 2 updates from previous iterations;
cost is O(n2 ) for these updates
• Since it is being stored, applying H −1 is simply O(n2 ) flops
• Can be motivated by Taylor series expansion

Broyden-Fletcher-Goldfarb-Shanno of BFGS:
• Came after DFP, but BFGS is now much more widely used
• Again, updates H, H −1 via rank 2 updates, but does so in a
“dual” fashion to DFP; cost is still O(n2 )
• Also has a limited-memory version, L-BFGS: instead of letting
updates propogate over all iterations, only keeps updates from
last m iterations; storage is now O(mn) instead of O(n2 )

23
References and further reading

• S. Boyd and L. Vandenberghe (2004), “Convex optimization”,


Chapters 9 and 10
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 2
• Y. Nesterov and A. Nemirovskii (1994), “Interior-point
polynomial methods in convex programming”, Chapter 2
• J. Nocedal and S. Wright (2006), “Numerical optimization”,
Chapters 6 and 7
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

24

You might also like