14 Newton
14 Newton
Ryan Tibshirani
Convex Optimization 10-725/36-725
1
Last time: dual correspondences
Given a function f : Rn → R, we define its conjugate f ∗ : Rn → R,
Relationship to duality:
2
Newton’s method
min f (x)
3
Newton’s method interpretation
4
For f (x) = (10x21 + x22 )/2 + 5 log(1 + e−x1 −x2 ), compare gradient
descent (black) to Newton’s method (blue), where both take steps
of roughly same length
20
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
10 ●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●●
●
●●
0
−10
−20
−20 −10 0 10 20
Today:
• Interpretations and properties
• Backtracking line search
• Convergence analysis
• Equality-constrained Newton
• Quasi-Newton methods
6
Linearized optimality condition
Aternative interpretation of Newton step at x: we seek a direction
v so that ∇f (x + v) = 0. Now consider linearizing this optimality
condition
0 = ∇f (x + v) ≈ ∇f (x) + ∇2 f (x)v
and solving for v, which again yields v = −(∇2 f (x))−1 ∇f (x)
9 Unconstrained minimization
f!′
History: work of Newton (1685)
f′ and Raphson (1690) originally fo-
(x + ∆xnt , f ′ (x + ∆xnt )) cused on finding roots of poly-
′
(x, f (x)) nomials. Simpson (1740) ap-
plied this idea to general nonlin-
′
ear equations, and minimization
Figure 9.18 The solid curve is the derivative f of the function f shown in
figure 9.16. f (From
!
′
is the linearB & V page
approximation x. The Newton step by
of f at486)
′
∆x setting the gradient to zero
nt
is the difference between the root of f! and the point x.
′
For us ∆xnt = v
ro-crossing of the derivative f ′ , which is monotonically increasing since f is
x. Given our current approximation x of the solution, we form a first-order 7
Affine invariance of Newton’s method
Important property Newton’s method: affine invariance. Given f ,
nonsingular A ∈ Rn×n . Let x = Ay, and g(y) = f (Ay). Newton
steps on g are
−1
y + = y − ∇2 g(y) ∇g(y)
−1
= y − AT ∇2 f (Ay)A AT ∇f (Ay)
−1
= y − A−1 ∇2 f (Ay) ∇f (Ay)
Hence −1
Ay + = Ay − ∇2 f (Ay) ∇f (Ay)
i.e., −1
x+ = x − ∇2 f (x) f (x)
So progress is independent of problem scaling; recall that this is
not true of gradient descent
8
Newton decrement
At a point x, we define the Newton decrement as
−1 1/2
λ(x) = ∇f (x)T ∇2 f (x) ∇f (x)
This relates to the difference between f (x) and the minimum of its
quadratic approximation:
1
f (x) − min f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
y 2
1 −1
T 2
= f (x) − f (x) − ∇f (x) ∇ f (x) ∇f (x)
2
1
= λ(x)2
2
Therefore can think of λ2 (x)/2 as an approximate bound on the
suboptimality gap f (x) − f ?
9
Another interpretation of Newton decrement: if Newton direction
is v = −(∇2 f (x))−1 ∇f (x), then
1/2
λ(x) = v T ∇2 f (x)v = kvk∇2 f (x)
i.e., λ(x) is the length of the Newton step in the norm defined by
the Hessian ∇2 f (x)
Note that the Newton decrement, like the Newton steps, are affine
invariant; i.e., if we defined g(y) = f (Ay) for nonsingular A, then
λg (y) would match λf (x) at x = Ay
10
Backtracking line search
We have seen pure Newton’s method, which need not converge. In
practice, we instead use damped Newton’s method (i.e., Newton’s
method), which repeats
−1
x+ = x − t ∇2 f (x) ∇f (x)
11
Example: logistic regression
Logistic regression example, with n = 500, p = 100: we compare
gradient descent and Newton’s method, both with backtracking
1e+03
Gradient descent
Newton's method
1e−01
f−fstar
1e−05
1e−09
1e−13
0 10 20 30 40 50 60 70
13
In more detail, convergence analysis reveals γ > 0, 0 < η ≤ m2 /M
such that convergence follows two stages
• Damped phase: k∇f (x(k) )k2 ≥ η, and
f (x(k+1) ) − f (x(k) ) ≤ −γ
M M 2
(k+1) (k)
k∇f (x )k 2 ≤ k∇f (x )k 2
2m2 2m2
Note that once we enter pure phase, we won’t leave, because
2m2 M 2
η <η
M 2m2
when η ≤ m2 /M
14
Unraveling this result, what does it say? To reach
f (x(k) ) − f ? ≤ , we need at most
f (x(0) ) − f ?
+ log log(0 /)
γ
15
Self-concordance
A scale-free analysis is possible for self-concordant functions: on R,
a convex function f is called self-concordant if
16
Comparison to first-order methods
At a high-level:
• Memory: each iteration of Newton’s method requires O(n2 )
storage (n × n Hessian); each gradient iteration requires O(n)
storage (n-dimensional gradient)
• Computation: each Newton iteration requires O(n3 ) flops
(solving a dense n × n linear system); each gradient iteration
requires O(n) flops (scaling/adding n-dimensional vectors)
• Backtracking: backtracking line search has roughly the same
cost, both use O(n) flops per inner backtracking step
• Conditioning: Newton’s method is not affected by a problem’s
conditioning, but gradient descent can seriously degrade
• Fragility: Newton’s method may be empirically more sensitive
to bugs/numerical errors, gradient descent is more robust
17
Back to logistic regression example: now x-axis is parametrized in
terms of time taken per iteration
1e+03
Gradient descent
Newton's method
1e−01
f−fstar
1e−05
1e−09
1e−13
Time
Each gradient descent step is O(p), but each Newton step is O(p3 )
18
Sparse, structured problems
When the inner linear systems (in Hessian) can be solved efficiently
and reliably, Newton’s method can strive
E.g., if ∇2 f (x) is sparse and structured for all x, say banded, then
both memory and computation are O(n) with Newton iterations
19
Equality-constrained Newton’s method
Several options:
• Eliminating equality constraints: write x = F y + x0 , where F
spans null space of A, and Ax0 = b. Solve in terms of y
• Deriving the dual: can check that the Lagrange dual function
is −f ∗ (−AT v) − bT v, and strong duality holds. With luck, we
can express x? in terms of v ?
• Equality-constrained Newton: in many cases, this is the most
straightforward option
20
In equality-constrained Newton’s method, we start with x(0) such
that Ax(0) = b. Then we repeat the updates
x+ = x + tv, where
1
v = argmin ∇f (x)T (z − x) + (z − x)T ∇2 f (x)(z − x)
Az=0 2
21
Quasi-Newton methods
x+ = x − tH −1 ∇f (x)
22
Davidon-Fletcher-Powell or DFP:
• Update H, H −1 via rank 2 updates from previous iterations;
cost is O(n2 ) for these updates
• Since it is being stored, applying H −1 is simply O(n2 ) flops
• Can be motivated by Taylor series expansion
Broyden-Fletcher-Goldfarb-Shanno of BFGS:
• Came after DFP, but BFGS is now much more widely used
• Again, updates H, H −1 via rank 2 updates, but does so in a
“dual” fashion to DFP; cost is still O(n2 )
• Also has a limited-memory version, L-BFGS: instead of letting
updates propogate over all iterations, only keeps updates from
last m iterations; storage is now O(mn) instead of O(n2 )
23
References and further reading
24