0% found this document useful (0 votes)
15 views16 pages

Opt Lec 10

thong ke

Uploaded by

emilypham056
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views16 pages

Opt Lec 10

thong ke

Uploaded by

emilypham056
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 10.

Some Algorithms to Solve Unconstrained


Optimization Problems

Conditions for Local Maximizers/ Minimizers


The Training Problem
Gradient Method
Newton’s Method
Conjugate Direction Methods

1 / 16
Conditions for Local Maximizers/ Minimizers

Consider the problem finding x ∈ D such that f (x) attains its maximum
(minimum) where f (x) is continuously differentiable to the second-order.
This problem is a unconstrained optimization problem if an optimal
solution x∗ is an interior point of D or x∗ ∈ intD.
First-order necessary condition: x∗ is local maximizer (minimizer)
of f (x), then ∇f (x∗ ) = 0.
Second-order necessary condition: x∗ is local maximizer
(minimizer) f (x) then the Hessian matrix H(x∗ ) of f (x) at x∗ is
negative (positive) semidefinite.
Sufficient condition (for x∗ to be local maximizer): If ∇f (x∗ ) = 0
and the Hessian matrix H(x∗ ) is negative (positive) semidefinite then
f (x) attains its local maximum (minimum) at x∗ .
2 / 16
The training problem

The Machine Learners Job


(1) Get the labeled data (xx 1 , y 1 ), . . . , (xx n , y n )
(2) Choose a parametrization for hypothesis: hw (xx )
(3) Choose a loss function: ℓ(hw (xx ), y ) ≥ 0
(4) Solve the training problem:

1X n
min ℓ(hw (xx i ), y i ) + λR(w
w)
w ∈Rd n i =1

(5) Test and cross-validate. If fail, go back a few steps.

3 / 16
The Training Problem

The general training problem:

1X n
min ℓ(hw (xx i ), y i ) + λR(w
w)
w ∈Rd n i =1

n
ℓ(hw (xx i ), y i ): Goodness of fit
X
i =1
λ: Control tradeoff between fit and complexity
λR(w
w ): Penalizes complexity
w ) = ∥w ∥22 ,
R(w ∥w ∥1 , ∥w ∥p , . . .

4 / 16
Gradient Methods
The unconstrained Problem: Find x ∈ Rn such that f (xx ) attains its
minimum where f (xx ) is a second-order continuously differentiable
function.
The level set of f (xx ): {x ∈ Rn : f (xx ) = c }
The gradient of f at x 0 : ∇f (xx 0 ).
By the Taylor theorem, we have:

f (xx 0 − α∇f (xx 0 )) = f (xx 0 ) − α∥∇f (xx 0 )∥2 + o(α)

If ∇f (xx 0 ) ̸= 0 then for sufficiently small α > 0, we obtain:

f (xx 0 − α∇f (xx 0 )) < f (xx 0 )

Set x 1 := x 0 − α0 ∇f (xx 0 ), x 1 is an improvement over the point x 0 .


α0 ≥ 0: the step size
5 / 16
Gradient Methods
The steepest descent algorithm: This is a gradient algorithm where
step size αk is chosen to minimize φk (α) = f (xx k − α∇f (xx k )).
1 Let x 0 be a starting point.
2 Assign k := 0
3 Find ∇f (xx k ). If ∇f (xx k ) = 0, then go to 7th Step, otherwise go to next
step.
4 Find step size αk :

αk = arg min f (xx k − α∇f (xx k )


α≥0

5 Set x k +1 = x k − αk ∇f (xx k )
6 Assign k := k + 1 and go back 3rd Step
7 Stop the algorithm and conclude that x k is an optimal solution.
6 / 16
Gradient Methods

Proposition
If {x k }∞
k =0
is a steepest decent sequence for a given function f (xx ) and if
∇f (xx k ) ̸= 0 then f (xx k +1 ) < f (xx k ).

The practical stopping criterion


Let ε > 0 be a prespecified threshold, we stop the algorithm when
|f (xx k +1 ) − f (xx k )| < ε
∥x k +1 − x k ∥ < ϵ
|f (xx k +1 ) − f (xx k )|

|f (xx k )|
∥x k +1 − x k ∥

∥x k ∥

7 / 16
Gradient Methods

Examples
Solve these problems using steepest descent algorithm:
a. f (x1 , x2 ) = x12 + x22 starting from x 0 = (1, 2).
x12
b. f (x1 , x2 ) = + x22 starting from x 0 = (1, 2).
5
c. f (x1 , x2 ) = x1 + 12 x2 + 12x12 + x22 + 3 with starting point x 0 = (0, 0).
d. f (xx ) = 4x12 − 4x1 x2 + 2x22 with starting point x 0 = (2, 3)
e. f (xx ) = x12 − 2x1 x2 + 2x22 + 2x1 with starting point x 0 = (0, 0)

8 / 16
Gradient Methods

9 / 16
The Method of Steepest Descent with a Quadratic
Function

The form of quadratic function


1
f (xx ) = x T Q x − b T x ,
2
where Q ∈ Rn×n is a symmetric positive definite matrix, b ∈ Rn , and x ∈ Rn .

∇f (xx ) = Q x − b

We write d k = ∇f (xx k ) then

x k +1 = x k − αk d k

where
dkTdk
αk = arg min f (xx k − αd k ) = k T
α≥0 d Qd k
10 / 16
Newton’s Method (Newton-Raphson Method)

The gradient descent method is a first-order method. It relies on the


gradient to improve the solution.
A first-order method is intuitive, but sometimes too slow.
A second-order method relies on the Hessian to update a solution.
We will introduce one second-order method: Newton’s method.
Let’s start from Newton’s method for solving a nonlinear equation.

11 / 16
Newton’s method for a nonlinear equation

Let f : R −→ R be differentiable. We
want to find x satisfying f (x) = 0.
For any x k , let

fL (x k ) = f (x k ) + f ′ (x k )(x − x k )

be the linear approximation of f at x k .


We move from x k to x k +1 by setting:

fL (x k +1 ) = 0 ⇔ f (x k ) + f ′ (x k )(x k +1 − x k ) = 0

We will keep iterating until |f (x k )| < ϵ


or |x k +1 − x k | < ϵ for some
predetermined ϵ > 0.
12 / 16
Newton’s method for single-variate NLPs

Let f be twice differentiable. We want to find x satisfying f ′ (x) = 0


For any x k , let
fL′ (x) = f ′ (x k ) + f ′′ (x k )(x − x k )

be the linear approximation of f ′ at x k .


To approach x we move from x k to x k +1 by setting

fL′ (x k +1 ) = 0 ⇔ f ′ (x k ) + f ′′ (x k )(x k +1 − x k ) = 0

We will keep iterating until |f ′ (x k )| < ϵ or |x k +1 − x k | < ϵ for some


predetermined ϵ > 0.
Note that f ′ (x) = 0 does not guarantee a global minimum. That is
why showing f is convex is useful!
13 / 16
Newton’s method for single-variate NLPs

Let f be twice differentiable. We want to find x satisfying f (x) = 0


For any x k , let
1
fQ (x) = f (x k ) + f ′ (x k )(x − x k ) + f ”(x k )(x − x k )2
2
be the quadratic approximation of f at x k .
We move from x k to x k +1 by moving to the global minimum of the
quadratic approximation
1
x k +1 = arg min[f (x k ) + f ′ (x k )(x − x k ) + f ”(x k )(x − x k )2 ]
x ∈R 2
Differentiating the above objective function with respect to x, we have
f ′ (x k )
f ′ (x k ) + f ′′ (x k )(x k +1 − x k ) = 0 ⇔ x k +1 = x k −
f ”(x k )
14 / 16
Newton’s Method for multi-variate NLPs
The unconstrained Problem: Find x ∗ ∈ Rn such that f (xx ) attains its
minimum where f (xx ) is a third-order continuously differentiable function.
The Taylor series expansion of f about current point x k
1
f (xx ) ≈ f (xx k ) + (xx − x k )T d k + (xx − x k )T H(xx k )(xx − x k ) ≜ q(xx )
2
We have q(xx k ) = f (xx k ), ∇q(xx k ) = ∇f (xx k ) = d k and
∇2 q(xx k ) = ∇2 f (xx k ) = H(xx k ) Then, instead of minimize f (xx ), we
minimize q(xx ).
If H(xx k ) > 0 then q archive a minimum at

x k +1 := x k − H(xx k )−1d k

Example. Let minimize f (xx ) = x14 + 2x12 x22 + x24 with starting point
x 0 = (1, 1)
15 / 16
Newton’s Method

For Newton’s method:


Newton’s method does not have the step size issue.
It in many cases is faster.
For a quadratic function, Newton’s method find an optimal solution in
one iteration.
It may fail to converge for some functions.
More issues in general:
Convergence guarantee.
Convergence speed.
Non-differentiable functions.
Constrained optimization.
Example. Let minimize f (xx ) = x12 + 2x23 with starting point x 0 = (6, 6)

16 / 16

You might also like