0% found this document useful (0 votes)

16 views24 pages

14 Newton

Uploaded by

Santos Senior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views24 pages

14 Newton

Uploaded by

Santos Senior

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Newton’s Method

Ryan Tibshirani
Convex Optimization 10-725/36-725

1
Last time: dual correspondences
Given a function f : Rn → R, we define its conjugate f ∗ : Rn → R,

f ∗ (y) = max y T x − f (x)

Properties and examples:

• Conjugate f ∗ is always convex (regardless of convexity of f )
• When f is a quadratic in Q 0, f ∗ is a quadratic in Q−1
• When f is a norm, f ∗ is the indicator of the dual norm unit
ball
• When f is closed and convex, x ∈ ∂f ∗ (y) ⇐⇒ y ∈ ∂f (x)

Relationship to duality:

Primal : min f (x) + g(x)

x
Dual : max −f ∗ (u) − g ∗ (−u)
u

2
Newton’s method

Given unconstrained, smooth convex optimization

min f (x)

where f is convex, twice differentable, and dom(f ) = Rn . Recall

that gradient descent chooses initial x(0) ∈ Rn , and repeats

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

In comparison, Newton’s method repeats

−1
x(k) = x(k−1) − ∇2 f (x(k−1) ) ∇f (x(k−1) ), k = 1, 2, 3, . . .

Here ∇2 f (x(k−1) ) is the Hessian matrix of f at x(k−1)

3
Newton’s method interpretation

Recall the motivation for gradient descent step at x: we minimize

the quadratic approximation
1
f (y) ≈ f (x) + ∇f (x)T (y − x) + ky − xk22
2t
over y, and this yields the update x+ = x − t∇f (x)

Newton’s method uses in a sense a better quadratic approximation

1
f (y) ≈ f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
2
and minimizes over y to yield x+ = x − (∇2 f (x))−1 ∇f (x)

4
For f (x) = (10x21 + x22 )/2 + 5 log(1 + e−x1 −x2 ), compare gradient
descent (black) to Newton’s method (blue), where both take steps
of roughly same length

20
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
10 ●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●●
●
●●
0
−10
−20

−20 −10 0 10 20

(For our example we needed to consider a nonquadratic ... why?)

5
Outline

Today:
• Interpretations and properties
• Backtracking line search
• Convergence analysis
• Equality-constrained Newton
• Quasi-Newton methods

6
Linearized optimality condition
Aternative interpretation of Newton step at x: we seek a direction
v so that ∇f (x + v) = 0. Now consider linearizing this optimality
condition
0 = ∇f (x + v) ≈ ∇f (x) + ∇2 f (x)v
and solving for v, which again yields v = −(∇2 f (x))−1 ∇f (x)
9 Unconstrained minimization

f!′
History: work of Newton (1685)
f′ and Raphson (1690) originally fo-
(x + ∆xnt , f ′ (x + ∆xnt )) cused on finding roots of poly-
′
(x, f (x)) nomials. Simpson (1740) ap-
plied this idea to general nonlin-
′
ear equations, and minimization
Figure 9.18 The solid curve is the derivative f of the function f shown in
figure 9.16. f (From
!
′
is the linearB & V page
approximation x. The Newton step by
of f at486)
′
∆x setting the gradient to zero
nt
is the diﬀerence between the root of f! and the point x.
′

For us ∆xnt = v
ro-crossing of the derivative f ′ , which is monotonically increasing since f is
x. Given our current approximation x of the solution, we form a first-order 7
Affine invariance of Newton’s method
Important property Newton’s method: affine invariance. Given f ,
nonsingular A ∈ Rn×n . Let x = Ay, and g(y) = f (Ay). Newton
steps on g are
−1
y + = y − ∇2 g(y) ∇g(y)
−1
= y − AT ∇2 f (Ay)A AT ∇f (Ay)
−1
= y − A−1 ∇2 f (Ay) ∇f (Ay)

Hence −1
Ay + = Ay − ∇2 f (Ay) ∇f (Ay)
i.e., −1
x+ = x − ∇2 f (x) f (x)
So progress is independent of problem scaling; recall that this is
not true of gradient descent

8
Newton decrement
At a point x, we define the Newton decrement as
−1 1/2
λ(x) = ∇f (x)T ∇2 f (x) ∇f (x)

This relates to the difference between f (x) and the minimum of its
quadratic approximation:
1
f (x) − min f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x)
y 2
1 −1
T 2
= f (x) − f (x) − ∇f (x) ∇ f (x) ∇f (x)
2
1
= λ(x)2
2
Therefore can think of λ2 (x)/2 as an approximate bound on the
suboptimality gap f (x) − f ?

9
Another interpretation of Newton decrement: if Newton direction
is v = −(∇2 f (x))−1 ∇f (x), then
1/2
λ(x) = v T ∇2 f (x)v = kvk∇2 f (x)

i.e., λ(x) is the length of the Newton step in the norm defined by
the Hessian ∇2 f (x)

Note that the Newton decrement, like the Newton steps, are affine
invariant; i.e., if we defined g(y) = f (Ay) for nonsingular A, then
λg (y) would match λf (x) at x = Ay

10
Backtracking line search
We have seen pure Newton’s method, which need not converge. In
practice, we instead use damped Newton’s method (i.e., Newton’s
method), which repeats
−1
x+ = x − t ∇2 f (x) ∇f (x)

Note that the pure method uses t = 1

Step sizes here typically are chosen by backtracking search, with

parameters 0 < α ≤ 1/2, 0 < β < 1. At each iteration, we start
with t = 1 and while

f (x + tv) > f (x) + αt∇f (x)T v

we shrink t = βt, else we perform the Newton update. Note that

here v = −(∇2 f (x))−1 ∇f (x), so ∇f (x)T v = −λ2 (x)

11
Example: logistic regression
Logistic regression example, with n = 500, p = 100: we compare
gradient descent and Newton’s method, both with backtracking

1e+03
Gradient descent
Newton's method

1e−01
f−fstar

1e−05
1e−09
1e−13

0 10 20 30 40 50 60 70

Newton’s method seems to have a different regime of convergence!

12
Convergence analysis
Assume that f convex, twice differentiable, having dom(f ) = Rn ,
and additionally
• ∇f is Lipschitz with parameter L
• f is strongly convex with parameter m
• ∇2 f is Lipschitz with parameter M

Theorem: Newton’s method with backtracking line search sat-

isfies the following two-stage convergence bounds

(f (x(0) ) − f ? ) − γk if k ≤ k0
f (x(k) ) − f ? ≤ 2m3 1 2k−k0 +1
 if k > k0
M2 2
Here γ = αβ 2 η 2 m/L2 , η = min{1, 3(1 − 2α)}m2 /M , and k0 is
the number of steps until k∇f (x(k0 +1) )k2 < η

13
In more detail, convergence analysis reveals γ > 0, 0 < η ≤ m2 /M
such that convergence follows two stages
• Damped phase: k∇f (x(k) )k2 ≥ η, and

f (x(k+1) ) − f (x(k) ) ≤ −γ

• Pure phase: k∇f (x(k) )k2 < η, backtracking selects t = 1, and

M M 2
(k+1) (k)
k∇f (x )k 2 ≤ k∇f (x )k 2
2m2 2m2
Note that once we enter pure phase, we won’t leave, because

2m2 M 2
η <η
M 2m2
when η ≤ m2 /M

14
Unraveling this result, what does it say? To reach
f (x(k) ) − f ? ≤ , we need at most

f (x(0) ) − f ?
+ log log(0 /)
γ

iterations, where 0 = 2m3 /M 2

• This is called quadratic convergence. Compare this to linear
convergence (which, recall, is what gradient descent achieves
under strong convexity)
• The above result is a local convergence rate, i.e., we are only
guaranteed quadratic(0)
convergence
?
after some number of steps
k0 , where k0 ≤ f (x γ)−f
• Somewhat bothersome may be the fact that the above bound
depends on L, m, M , and yet the algorithm itself does not

15
Self-concordance
A scale-free analysis is possible for self-concordant functions: on R,
a convex function f is called self-concordant if

|f 000 (x)| ≤ 2f 00 (x)3/2 for all x

and on Rn is called self-concordant if its projection onto every line

segment is so. E.g., f (x) = − log(x) is self-concordant

Theorem (Nesterov and Nemirovskii): Newton’s method

with backtracking line search requires at most

C(α, β) f (x(0) ) − f ? + log log(1/)

iterations to reach f (x(k) ) − f ? ≤ , where C(α, β) is a constant

that only depends on α, β

16
Comparison to first-order methods
At a high-level:
• Memory: each iteration of Newton’s method requires O(n2 )
storage (n × n Hessian); each gradient iteration requires O(n)
storage (n-dimensional gradient)
• Computation: each Newton iteration requires O(n3 ) flops
(solving a dense n × n linear system); each gradient iteration
requires O(n) flops (scaling/adding n-dimensional vectors)
• Backtracking: backtracking line search has roughly the same
cost, both use O(n) flops per inner backtracking step
• Conditioning: Newton’s method is not affected by a problem’s
conditioning, but gradient descent can seriously degrade
• Fragility: Newton’s method may be empirically more sensitive
to bugs/numerical errors, gradient descent is more robust

17
Back to logistic regression example: now x-axis is parametrized in
terms of time taken per iteration

1e+03
Gradient descent
Newton's method

1e−01
f−fstar

1e−05
1e−09
1e−13

0.00 0.05 0.10 0.15 0.20 0.25

Time

Each gradient descent step is O(p), but each Newton step is O(p3 )
18
Sparse, structured problems

When the inner linear systems (in Hessian) can be solved efficiently
and reliably, Newton’s method can strive

E.g., if ∇2 f (x) is sparse and structured for all x, say banded, then
both memory and computation are O(n) with Newton iterations

What functions admit a structured Hessian? Two examples:

• If g(β) = f (Xβ), then ∇2 g(β) = X T ∇2 f (Xβ)X. Hence if
X is a structured predictor matrix and ∇2 f is diagonal, then
∇2 g is structured
• If we seek to minimize f (β) + g(Dβ), where ∇2 f is diagonal,
g is not smooth, and D is a structured penalty matrix, then
the Lagrange dual function is −f ∗ (−DT u) − g ∗ (−u). Often
−D∇2 f ∗ (−DT u)DT can be structured

19
Equality-constrained Newton’s method

Consider now a problem with equality constraints, as in

min f (x) subject to Ax = b

Several options:
• Eliminating equality constraints: write x = F y + x0 , where F
spans null space of A, and Ax0 = b. Solve in terms of y
• Deriving the dual: can check that the Lagrange dual function
is −f ∗ (−AT v) − bT v, and strong duality holds. With luck, we
can express x? in terms of v ?
• Equality-constrained Newton: in many cases, this is the most
straightforward option

20
In equality-constrained Newton’s method, we start with x(0) such
that Ax(0) = b. Then we repeat the updates

x+ = x + tv, where
1
v = argmin ∇f (x)T (z − x) + (z − x)T ∇2 f (x)(z − x)
Az=0 2

This keeps x+ in feasible set, since Ax+ = Ax + tAv = b + 0 = b

Furthermore, v is the solution to minimizing a quadratic subject to

equality constraints. We know from KKT conditions that v satisfies
2
∇ f (x) AT v −∇f (x)
=
A 0 w 0

for some w. Hence Newton direction v is again given by solving a

linear system in the Hessian (albeit a bigger one)

21
Quasi-Newton methods

If the Hessian is too expensive (or singular), then a quasi-Newton

method can be used to approximate ∇2 f (x) with H 0, and we
update according to

x+ = x − tH −1 ∇f (x)

• Approximate Hessian H is recomputed at each step. Goal is

to make H −1 cheap to apply (possibly, cheap storage too)
• Convergence is fast: superlinear, but not the same as Newton.
Roughly n steps of quasi-Newton make same progress as one
Newton step
• Very wide variety of quasi-Newton methods; common theme
is to “propogate” computation of H across iterations

22
Davidon-Fletcher-Powell or DFP:
• Update H, H −1 via rank 2 updates from previous iterations;
cost is O(n2 ) for these updates
• Since it is being stored, applying H −1 is simply O(n2 ) flops
• Can be motivated by Taylor series expansion

Broyden-Fletcher-Goldfarb-Shanno of BFGS:
• Came after DFP, but BFGS is now much more widely used
• Again, updates H, H −1 via rank 2 updates, but does so in a
“dual” fashion to DFP; cost is still O(n2 )
• Also has a limited-memory version, L-BFGS: instead of letting
updates propogate over all iterations, only keeps updates from
last m iterations; storage is now O(mn) instead of O(n2 )

23
References and further reading

• S. Boyd and L. Vandenberghe (2004), “Convex optimization”,

Chapters 9 and 10
• Y. Nesterov (1998), “Introductory lectures on convex
optimization: a basic course”, Chapter 2
• Y. Nesterov and A. Nemirovskii (1994), “Interior-point
polynomial methods in convex programming”, Chapter 2
• J. Nocedal and S. Wright (2006), “Numerical optimization”,
Chapters 6 and 7
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

Stack: 2.2.1. Adding An Element Onto The Stack (Push Operation)
No ratings yet
Stack: 2.2.1. Adding An Element Onto The Stack (Push Operation)
15 pages
Introductory Mathematical Analysis: For Business, Economics, and The Life and Social Sciences
No ratings yet
Introductory Mathematical Analysis: For Business, Economics, and The Life and Social Sciences
31 pages
Akhil Jadawala (41492) A Darshana Gulhane (41491)
No ratings yet
Akhil Jadawala (41492) A Darshana Gulhane (41491)
25 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Random Function Questions
No ratings yet
Random Function Questions
4 pages
ART2017951
No ratings yet
ART2017951
5 pages
Lecture 14
No ratings yet
Lecture 14
9 pages
2 PB
No ratings yet
2 PB
11 pages
Litreature Review
No ratings yet
Litreature Review
11 pages
CEN333 Project Descriptions-1
No ratings yet
CEN333 Project Descriptions-1
18 pages
DC 3
No ratings yet
DC 3
32 pages
WE - Writing A Literature Review
No ratings yet
WE - Writing A Literature Review
2 pages
t1 Sol
No ratings yet
t1 Sol
4 pages
Big Is Not Best - Maximum Common Substructure Based Similarity Searching - Ed Duesbury
No ratings yet
Big Is Not Best - Maximum Common Substructure Based Similarity Searching - Ed Duesbury
18 pages
Clnote Oct12
No ratings yet
Clnote Oct12
25 pages
Rep1 (Repaired) - 084910
No ratings yet
Rep1 (Repaired) - 084910
39 pages
Digi Trailblazers Course 2
No ratings yet
Digi Trailblazers Course 2
2 pages
Numerical Results For Gauss-Seidel Iterative Algor
No ratings yet
Numerical Results For Gauss-Seidel Iterative Algor
11 pages
Lecture12
No ratings yet
Lecture12
6 pages
Che305 3-1
No ratings yet
Che305 3-1
30 pages
Amortized Analysis
100% (1)
Amortized Analysis
15 pages
CIE 115 SAS 5 Highlighted Version
No ratings yet
CIE 115 SAS 5 Highlighted Version
7 pages
10 Unconstrained
No ratings yet
10 Unconstrained
41 pages
ROS Plastic Injection Molding
No ratings yet
ROS Plastic Injection Molding
19 pages
Lecture 5 Si416 2025
No ratings yet
Lecture 5 Si416 2025
21 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
No ratings yet
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
8 pages
Sequential Quadratic Programming
No ratings yet
Sequential Quadratic Programming
50 pages
Tree Algorithm Slides
No ratings yet
Tree Algorithm Slides
367 pages
Lecture 7 Newton
No ratings yet
Lecture 7 Newton
44 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
HW 3 Unconstrained-Optimization Advanced
No ratings yet
HW 3 Unconstrained-Optimization Advanced
9 pages
Big O Notation
No ratings yet
Big O Notation
19 pages
Chapter 3 - Introduction To Linear Programming A
No ratings yet
Chapter 3 - Introduction To Linear Programming A
37 pages
SOL - DU - MBAFT-6202 Decision Modeling and Optimization With Distributed Network
No ratings yet
SOL - DU - MBAFT-6202 Decision Modeling and Optimization With Distributed Network
45 pages
MAE Opti Worksheet 4 Correction
No ratings yet
MAE Opti Worksheet 4 Correction
3 pages
DS Lecture 2 (Compound Statements)
No ratings yet
DS Lecture 2 (Compound Statements)
29 pages
ProtoAccessEncode Java
No ratings yet
ProtoAccessEncode Java
256 pages
Unit 5
No ratings yet
Unit 5
24 pages
OPTFIT Aflevering
No ratings yet
OPTFIT Aflevering
9 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
Tutorial Manual
No ratings yet
Tutorial Manual
18 pages
Chapter 9 Lecture Notes
No ratings yet
Chapter 9 Lecture Notes
3 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
Chapter 6vh
No ratings yet
Chapter 6vh
12 pages
Dijkstra's and A-Star in Finding The Shortest Path: A Tutorial
No ratings yet
Dijkstra's and A-Star in Finding The Shortest Path: A Tutorial
5 pages
Optimization 2
No ratings yet
Optimization 2
40 pages
Jiyue Zeng Honors Thesis
No ratings yet
Jiyue Zeng Honors Thesis
59 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Prolog Programming: A Do-It-Yourself Course For Beginners: Kris@coli - Uni-Sb - de
No ratings yet
Prolog Programming: A Do-It-Yourself Course For Beginners: Kris@coli - Uni-Sb - de
16 pages
Intro To Artificial Intelligence Assignment 1: Search Algorithms
No ratings yet
Intro To Artificial Intelligence Assignment 1: Search Algorithms
3 pages
7 Newton Raphson Method
No ratings yet
7 Newton Raphson Method
20 pages
Opt Lec 10
No ratings yet
Opt Lec 10
16 pages
Computational Geometry
No ratings yet
Computational Geometry
6 pages
Newton Scribed
No ratings yet
Newton Scribed
7 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Lecture 13
No ratings yet
Lecture 13
7 pages
Binary Tree Traversal
No ratings yet
Binary Tree Traversal
4 pages
Chapter 9 Newton's Method
No ratings yet
Chapter 9 Newton's Method
27 pages
Chương 9
No ratings yet
Chương 9
12 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
7 Numerical Methods 3 Newton Raphson and Second Order
No ratings yet
7 Numerical Methods 3 Newton Raphson and Second Order
19 pages
19 Newton Method
No ratings yet
19 Newton Method
10 pages
Mandala
No ratings yet
Mandala
3 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Language of Mathematics - Elementary Logic
No ratings yet
Language of Mathematics - Elementary Logic
13 pages
Sample Final AI
No ratings yet
Sample Final AI
9 pages
CS502 - Midterm Solved Subjective With References by Moaaz
100% (5)
CS502 - Midterm Solved Subjective With References by Moaaz
7 pages
Polish Expression
No ratings yet
Polish Expression
20 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Arsdigita University Month 8: Theory of Computation Professor Shai Simonson Problem Set 5
No ratings yet
Arsdigita University Month 8: Theory of Computation Professor Shai Simonson Problem Set 5
4 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Optimumengineeringdesign Day3a
No ratings yet
Optimumengineeringdesign Day3a
34 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Newton's Method For Unconstrained Optimization
No ratings yet
Newton's Method For Unconstrained Optimization
14 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Unconstrained
No ratings yet
Unconstrained
30 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Rrrdesdelinear and Nonlinear Programming-4
No ratings yet
Rrrdesdelinear and Nonlinear Programming-4
3 pages
E1 251 Linear and Nonlinear Op2miza2on
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on
24 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Real Analysis Project
50% (2)
Real Analysis Project
14 pages
Algorithm of Code Minimization
No ratings yet
Algorithm of Code Minimization
3 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
CS-6777 Liu Abs
No ratings yet
CS-6777 Liu Abs
103 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
"Newton's Method and Loops": University of Karbala College of Engineering Petroleum Eng. Dep
No ratings yet
"Newton's Method and Loops": University of Karbala College of Engineering Petroleum Eng. Dep
11 pages
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
No ratings yet
Unconstrained Optimization Methods: Amirkabir University of Technology Dr. Madadi
13 pages
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Algorithms Process Optimization
No ratings yet
Algorithms Process Optimization
5 pages
Newton-Raphson Optimization: Steve Kroon
No ratings yet
Newton-Raphson Optimization: Steve Kroon
4 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Process Optimization
No ratings yet
Process Optimization
70 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Newton Gauss Method
No ratings yet
Newton Gauss Method
37 pages
Optim
No ratings yet
Optim
70 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages

14 Newton

Uploaded by

14 Newton

Uploaded by

Newton’s Method

f ∗ (y) = max y T x − f (x)

Properties and examples:

Primal : min f (x) + g(x)

Given unconstrained, smooth convex optimization

where f is convex, twice differentable, and dom(f ) = Rn . Recall

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

In comparison, Newton’s method repeats

Here ∇2 f (x(k−1) ) is the Hessian matrix of f at x(k−1)

Recall the motivation for gradient descent step at x: we minimize

Newton’s method uses in a sense a better quadratic approximation

(For our example we needed to consider a nonquadratic ... why?)

Note that the pure method uses t = 1

Step sizes here typically are chosen by backtracking search, with

f (x + tv) > f (x) + αt∇f (x)T v

we shrink t = βt, else we perform the Newton update. Note that

Newton’s method seems to have a different regime of convergence!

Theorem: Newton’s method with backtracking line search sat-

• Pure phase: k∇f (x(k) )k2 < η, backtracking selects t = 1, and

iterations, where 0 = 2m3 /M 2

|f 000 (x)| ≤ 2f 00 (x)3/2 for all x

and on Rn is called self-concordant if its projection onto every line

Theorem (Nesterov and Nemirovskii): Newton’s method

iterations to reach f (x(k) ) − f ? ≤ , where C(α, β) is a constant

0.00 0.05 0.10 0.15 0.20 0.25

What functions admit a structured Hessian? Two examples:

Consider now a problem with equality constraints, as in

min f (x) subject to Ax = b

This keeps x+ in feasible set, since Ax+ = Ax + tAv = b + 0 = b

Furthermore, v is the solution to minimizing a quadratic subject to

for some w. Hence Newton direction v is again given by solving a

If the Hessian is too expensive (or singular), then a quasi-Newton

• Approximate Hessian H is recomputed at each step. Goal is

• S. Boyd and L. Vandenberghe (2004), “Convex optimization”,

You might also like

iterations, where 0 = 2m3 /M 2

iterations to reach f (x(k) ) − f ? ≤ , where C(α, β) is a constant