Lecture 5
Lecture 5
November 6, 2017
Most algorithms for unconstrained optimization problems are what we call line search type algo-
rithms.
Definition. Line search type algorithm
p 5
4
f
3
w 1
−1
x
−2
−5
−4
−3 −3
−2
−1
0
1 −4
2
3
4 −5
a c
5
Most algorithms we consider are inherently local, meaning that the search direction pk is only
based on the information at the current point xk , that is, f (xk ), ∇f (xk ), and ∇2 f (xk ).
Think of a near-sighted mountain climber. The climber is in a deep fog and can only check his or
her barometer for the height and feel the steepness of the slope under her feet.
1
TMA947 / MMG621 – Nonlinear optimization Lecture 5
Vector pk is a descent direction at xk if f (xk + αpk ) < f (xk ) for all α ∈ (0, δ] for some δ > 0.
minimize ∇f (xk )T p.
p∈Rn :∥p∥=1
Let Q ∈ Rn×n be an arbitrary symmetric, positive definite matrix. Then pk = −Q∇f (xk ) is a
descent direction for f at xk , because
Examples:
– Steepest descent: Q = I,
– Newton’s method: Q = [∇2 f (xk )]−1 .
We will now derive Newton’s method. To do so, we need to assume that f ∈ C 2 . We also first
assume that ∇2 f (x) is positive definite. A second-order Taylor approximation is then:
1
f (xk + p) − f (xk ) ≈ ∇f (xk )T p + pT ∇2 f (xk )p =: φxk (p)
2
We now try to minimize this approximation by setting the gradient of φxk (p) to zero:
Now by choosing the vector fulfilling this we obtain pk = −[∇2 f (xk )]−1 ∇f (xk ) as the search
direction. When n = 1, we get that pk = −f ′ (xk )/f ′′ (xk ).
When the Hessian ∇2 f (xk ) is positive definite this search direction is a descent direction. But
when ∇2 f (xk ) is negative definite (may be also non invertible), the search direction is an ascent
direction, meaning that Newton’s method does differentiate between minimization and maxi-
mization problem. The solution to this problem is to modify ∇2 f (xk ) by adding a diagonal ma-
trix γI such that (∇2 f (xk ) + γI) is positive definite (this can always be done, why?). This method
is called the Levenberg-Marquardt modification. We thus take as search direction
−1
pk = − ∇2 f (xk ) + γI
∇f (xk ).
Note that
– Steepest descent: γ = ∞,
– Newton’s method: γ = 0.
2
TMA947 / MMG621 – Nonlinear optimization Lecture 5
What happens when we can not compute ∇2 f (xk )? Try to approximate the Hessian in some way
choosing approximate matrix B k . From Taylor expansion for ∇f (xk ) we have that
Many different choices of B k exist, and they lead to what is called quasi-Newton methods.
To summarize:
φ′ (α∗ ) ≥ 0,
α∗ φ′ (α∗ ) = 0,
α∗ ≥ 0.
These conditions state that if α∗ > 0, then φ′ (α∗ ) = 0, which implies that
∇f (xk + α∗ pk )T pk = 0,
p 5
4
f
3
w 1
−1
x
−2
−5
−4
−3 −3
−2
−1
0
1 −4
2
3
4 −5
a c
5
3
TMA947 / MMG621 – Nonlinear optimization Lecture 5
However, solving the line search problem to optimality is unnecessary. The optimal solution to
the original problem lies elsewhere anyway. Examples of methods to choose step lengths αk
– Interpolation: Use f (xk ), ∇f (xk ), and ∇f (xk )T pk to approximate φ = f (xk + αpk ) quadrat-
ically. Then minimize this approximation of φ analytically.
– Newton’s method: Repeat improvements from a quadratic approximation: α = α−φ′ (α)/φ′′ (α)
– Golden section: Derivative-free method which shrinks an interval wherein a solution to
φ′ (α) = 0 lies.
We will often use what is denoted as the Armijo rule. The idea is to choose a step length α which
provides sufficient decrease in f . We have that
for very small values of α > 0, meaning that we predict that the objective function will decrease
with α∇f (xk )T pk if we move a step length α in the direction of pk . Now this might be too
optimistic, and we will therefore accept the step length if the actual decrease is at least a fraction
µ (µ is small, typically µ ∈ [0.001, 0.01]) of the predicted decrease, i.e., we will accept α if
or equivalently, if
φ(α) − φ(0) ≤ µαφ′ (0).
We usually start with α = 1. If this is not fulfilled, then choose α := α/2.
r a
b c
Figure 1: The interval (R) accepted by the Armijo step length rule
4
TMA947 / MMG621 – Nonlinear optimization Lecture 5
Convergence
In order to state a convergence result for the algorithm, we make an additional assumption for
the search directions. We need the directions pk to fulfill
∇f (xk )T pk
− ≥ s1 , ∥pk ∥ ≥ s2 ||∇f (xk ||, and ∥pk ∥ ≤ M (2)
∥∇f (xk )∥ · ∥pk ∥
for some s1 , s2 , M > 0, where the first inequality makes the angle between pk and ∇f (xk ) stay
between 0 and π/2, but not too close to π/2. The second inequality makes sure that the only
case when pk can be zero is when the gradient is zero. These two conditions guarantee a certain
descent quality.
Theorem (convergence of unconstrained algorithm). Suppose f ∈ C 1 and for the starting point x0
the level set {x ∈ Rn | f (x) ≤ f (x0 )} is bounded. Consider the iterative algorithm described above.
Suppose that for all k, pk fulfills (2) and αk is chosen according to the Armijo rule. Then
If we add the assumption that f is a convex function, then we can show that
We can not terminate the algorithm when ∇f (xk ) = 0, since this rarely happens. We need to have
some tolerance level. Three examples are
5
TMA947 / MMG621 – Nonlinear optimization Lecture 5
Trust region methods use a quadratic approximation of the function around the current iterate
xk , avoid a line search but instead bound the length of the search direction. Let
1
φxk (p) := f (xk ) + ∇f (xk )T p + pT ∇2 f (xk )p.
2
Since this is a local approximation, we restrict our approximation to a trust region in the neighbor-
hood of xk , i.e., we trust the model in the region where ∥p∥ ≤ ∆k . We then solve the problem
to
and let the solution be pk . Then we update our iterate as xk+1 = xk + pk . We also update the trust
region parameter ∆k depending on the progress so far (actual reduction/predicted reduction).
The method is robust and possess strong convergence. More detailed information about trust
region methods can be found in the book on pages 301–302.
In some cases the value of the objective function f (x) is given through some unknown simulation
procedure. This implies that we do not have a clear representation of the gradient of the objective
function. In some cases, we can perform numerical differentiation and approximate the partial
derivatives as, e.g.,
∂f (x) f (x + αei ) − f (x)
≈ ,
∂xi α
where ei = (0, . . . , 0, 1, 0, . . . , 0)T is the unit vector in Rn .
If the simulation is not accurate, we get a bas derivative information. We can use derivative-free
methods instead. These try to build a model fˆ of the objective function f from evaluating the
objective function at some specific test points and optimize the model fˆ instead of the function f .