0% found this document useful (0 votes)
5 views

Optimization Slides

RATES

Uploaded by

nkele098
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Optimization Slides

RATES

Uploaded by

nkele098
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 301

Part 1

Examples of optimization
problems

49 Wolfgang Bangerth
What is an optimization problem?

Mathematically speaking:

Let X be a Banach space; let


f : X→R{+}
g: X→Rne
h: X→Rni
be functions on X, find x ∈ X so that

f ( x) → min!
g ( x) = 0
h ( x) ≥ 0

Questions: Under what conditions on X, f, g, h can we


guarantee that (i) there is a solution; (ii) the solution is unique;
(iii) the solution is stable.

50 Wolfgang Bangerth
What is an optimization problem?
In practice:

x={u,y} is a set of design and auxiliary variables that
completely describe a physical, chemical,
economical model;

f(x) is an objective function with which we measure how
good a design is;

g(x) describes relationships that have to be met exactly
(for example the relationship between y and u)

h(x) describes conditions that must not be exceeded

Then find me that x for which


f ( x) → min!
g ( x) = 0
h( x) ≥ 0
Question: How do I find this x?
51 Wolfgang Bangerth
What is an optimization problem?
Optimization problems are often subdivided into classes:

Linear vs. Nonlinear


Convex vs. Nonconvex
Unconstrained vs. Constrained
Smooth vs. Nonsmooth
With derivatives vs. Derivativefree
Continuous vs. Discrete
Algebraic vs. ODE/PDE

Depending on which class an actual problem falls into, there are


different classes of algorithms.

52 Wolfgang Bangerth
Examples

Linear and nonlinear functions f(x)


on a domain bounded by linear inequalities

53 Wolfgang Bangerth
Examples

Strictly convex, convex, and nonconvex functions f(x)


54 Wolfgang Bangerth
Examples

Another non-convex function with many (local) optima.


We may want to find the one global optimum.
55 Wolfgang Bangerth
Examples

Optima in the presence of (nonsmooth) constraints.

56 Wolfgang Bangerth
Examples

Smooth and non-smooth nonlinear functions.

57 Wolfgang Bangerth
Applications: The drag coefficient of a car

Mathematical description:
x={u,y}: u are the design parameters (e.g. the shape of the car)
y is the flow field around the car
f(x): the drag force that results from the flow field
g(x)=y-q(u)=0:
constraints that come from the fact that there is a flow
field y=q(u) for each design. y may, for example, satisfy
58 the Navier-Stokes equations Wolfgang Bangerth
Applications: The drag coefficient of a car
Inequality constraints:
(expected sales price – profit margin) - cost(u) ≥ 0

volume(u) – volume(me, my wife, and her bags) ≥ 0

material stiffness * safety factor


- max(forces exerted by y on the frame) ≥ 0

59
legal margins(u) ≥ 0 Wolfgang Bangerth
Applications: The drag coefficient of a car
Analysis:
linearity: f(x) may be linear
g(x) is certainly nonlinear (Navier-Stokes equations)
h(x) may be nonlinear

convexity: ??

constrained: yes

smooth: f(x) yes


g(x) yes
h(x) some yes, some no

derivatives: available, but probably hard to compute in practice

continuous: yes, not discrete

ODE/PDE: yes, not just algebraic


60 Wolfgang Bangerth
Applications: The drag coefficient of a car
Remark:

In the formulation as shown, the objective function was of the form

f(x) = cd(y)

In practice, one often is willing to trade efficiency for cost, i.e. we are
willing to accept a slightly higher drag coefficient if the cost is smaller.
This leads to objective functions of the form

f(x) = cd(y) + a cost(u)

or

f(x) = cd(y) + a[cost(u)]2

61 Wolfgang Bangerth
Applications: Optimal oil production strategies
Permeability field Oil saturation

Mathematical description:
x={u,y}: u are the pumping rates at injection/production wells
y is the flow field (pressures/velocities)
f(x): the cost of production and injection minus sales price of
oil integrated over lifetime of reservoir (or -NPV)
g(x)=y-q(u)=0:
constraints that come from the fact that there is a flow
field y=q(u) for each u. y may, for example, satisfy
62 the multiphase porous media flow equations Wolfgang Bangerth
Applications: Optimal oil production strategies

Inequality constraints h(x)≥0:

Uimax-ui ≥ 0 (for all wells i):


Pumps have a maximal pumping rate/pressure

produced_oil(T)/available_oil(0) – c ≥ 0:
Legislative requirement to produce at least
a certain fraction

c - water_cut(t) ≥ 0 (for all times t):


It is inefficient to produce too much water

pressure – d ≥ 0 (for all times and locations):


Keeps the reservoir from collapsing

63 Wolfgang Bangerth
Applications: Optimal oil production strategies
Analysis:
linearity: f(x) is nonlinear
g(x) is certainly nonlinear
h(x) may be nonlinear

convexity: no

constrained: yes

smooth: f(x) yes


g(x) yes
h(x) yes

derivatives: available, but probably hard to compute in practice

continuous: yes, not discrete

ODE/PDE: yes, not just algebraic


64 Wolfgang Bangerth
Applications: Switching lights at an intersection

Mathematical description:
x={T, ti1, ti2}: round-trip time T for the stop light system,
switch-green and switch-red times for all lights i
f(x): number of cars that can pass the intersection per
hour;
Note: unknown as a function, but we can measure it
65 Wolfgang Bangerth
Applications: Switching lights at an intersection

Inequality constraints h(x)≥0:

300 – T ≥ 0:
No more than 5 minutes of round-trip time, so that people
don't have to wait for too long

t2i-t1i – 5 ≥ 0 (for all lights i):


At least 5 seconds of green for everyone

t1(i+1)-t2i – 5 ≥ 0:
At least 5 seconds of all-red between different greens

66 Wolfgang Bangerth
Applications: Switching lights at an intersection
Analysis:

linearity: f(x) ??
h(x) is linear

convexity: ??

constrained: yes

smooth: f(x) ??
h(x) yes

derivatives: not available

continuous: yes, not discrete

ODE/PDE: no

67 Wolfgang Bangerth
Applications: Trajectory planning

Mathematical description:
x={y(t),u(t)}: position of spacecraft and thrust vector at time t
T
f  x=∫0 ∣u t∣dt minimize fuel consumption

m ÿ t −u t=0 Newton's law


∣y t ∣−d 0≥0 Do not get too close to the sun
u max −∣ut ∣≥0 Only limited thrust available
68 Wolfgang Bangerth
Applications: Trajectory planning
Analysis:

linearity: f(x) is nonlinear


g(x) is linear
h(x) is nonlinear

convexity: no
constrained: yes
smooth: yes, here
derivatives: computable
continuous: yes, not discrete

ODE/PDE: yes

Note: Trajectory planning problems are often called optimal


control.

69 Wolfgang Bangerth
Applications: Data fitting 1

Mathematical description: 1
x={a,b}: parameters for the model y t = log cosh   ab t 
a
f(x)=1/N ∑i |yi-y(ti)|2:
mean square difference between predicted value
and actual measurement

70 Wolfgang Bangerth
Applications: Data fitting 1
Analysis:
linearity: f(x) is nonlinear

convexity: ?? (probably yes)

constrained: no

smooth: yes

derivatives: available, and easy to compute in practice

continuous: yes, not discrete

ODE/PDE: no, algebraic

71 Wolfgang Bangerth
Applications: Data fitting 2

Mathematical description:
x={a,b}: parameters for the model y t =atb
f(x)=1/N ∑i |yi-y(ti)|2:
mean square difference between
predicted value and actual measurement

72 Wolfgang Bangerth
Applications: Data fitting 2
Analysis:
linearity: f(x) is quadratic

Convexity: yes

constrained: no

smooth: yes

derivatives: available, and easy to compute in practice

continuous: yes, not discrete

ODE/PDE: no, algebraic

Note: Quadratic optimization problems (even with linear


constraints) are easy to solve!

73 Wolfgang Bangerth
Applications: Data fitting 3

Mathematical description:
x={a,b}: parameters for the model y t =atb
f(x)=1/N ∑i |yi-y(ti)|:
mean absolute difference between predicted
value and actual measurement

74 Wolfgang Bangerth
Applications: Data fitting 3
Analysis:
linearity: f(x) is nonlinear

Convexity: yes

constrained: no

smooth: no!

derivatives: not differentiable

continuous: yes, not discrete

ODE/PDE: no, algebraic

Note: Non-smooth problems are really hard to solve!

75 Wolfgang Bangerth
Applications: Data fitting 3, revisited

Mathematical description:
x={a,b, si}: parameters for the model y t =atb
“slack” variables si
f(x)=1/N ∑i si → min!
si - |yi-y(ti)| ≥ 0

76 Wolfgang Bangerth
Applications: Data fitting 3, revisited
Analysis:
linearity: f(x) is linear, h(x) is not linear

Convexity: yes

constrained: yes

smooth: no!

derivatives: not differentiable

continuous: yes, not discrete

ODE/PDE: no, algebraic

Note: Non-smooth problems are really hard to solve!

77 Wolfgang Bangerth
Applications: Data fitting 3, re-revisited

Mathematical description:
x={a,b, si}: parameters for the model y t =atb
“slack” variables si
f(x)=1/N ∑i si → min!
si - |yi-y(ti)| ≥ 0 si - (yi-y(ti)) ≥ 0
si + (yi-y(ti)) ≥ 0
78 Wolfgang Bangerth
Applications: Data fitting 3, re-revisited
Analysis:
linearity: f(x) is linear, h(x) is now also linear

Convexity: yes

constrained: yes

smooth: yes

derivatives: yes

continuous: yes, not discrete

ODE/PDE: no, algebraic

Note: Linear problems with linear constraints are simple to


solve!

79 Wolfgang Bangerth
Applications: Traveling salesman

Task: Find the shortest tour


through N cities with mutual
distances dij.

(Here: the 15 biggest cities of


Germany; there are 43,589,145,600
possible tours through all these cities.)

Mathematical description:
x={ci }: the index of the ith city on our trip, i=1...N
f(x)= ∑i d c c
i i1

ci ≠c j for i≠ j no city is visited twice (alternatively: ci c j ≥1 )

80 Wolfgang Bangerth
Applications: Traveling salesman
Analysis:
linearity: f(x) is linear, h(x) is nonlinear

Convexity: meaningless

constrained: yes

smooth: meaningless

derivatives: meaningless
N
continuous: discrete: x∈ X ⊂{1,2,... , N }

ODE/PDE: no, algebraic

Note: Integer problems (combinatorial problems) are often


exceedingly complicated to solve!

81 Wolfgang Bangerth
Part 2

Minima, minimizers,
sufficient and necessary
conditions

82 Wolfgang Bangerth
Part 3

Metrics of algorithmic
complexity

83 Wolfgang Bangerth
Outline of optimization algorithms
All algorithms to find minima of f(x) do so iteratively:

- start at a point x 0
- for k=1,2,..., :
. compute an update direction pk
. compute a step length k
. set x k  x k−1k pk
. set k  k1

84 Wolfgang Bangerth
Outline of optimization algorithms
All algorithms to find minima of f(x) do so iteratively:

- start at a point x 0
- for k=1,2,..., :
. compute an update direction pk
. compute a step length k
. set x k  x k−1k pk
. set k  k1

Questions:

- If x * is the minimizer that we are seeking,


does xk  x * ?
- How many iterations does it take for ∥xk −x *∥≤ ?
- How expensive is every iteration?
85 Wolfgang Bangerth
How expensive is every iteration?
The cost of optimization algorithms is dominated by evaluating
f(x), g(x), h(x) and derivatives:


Traffic light example: Evaluating f(x) requires us to sit at an
intersection for an hour, counting cars

Designing air foils: Testing an improved wing design in a
wind tunnel costs millions of dollars.

86 Wolfgang Bangerth
How expensive is every iteration?
Example: Boeing wing design

Boeing 767 (1980s) Boeing 777 (1990s) Boeing 787 (2000s)


50+ wing designs 18 wing designs 10 wing designs
tested in wind tunnel tested in wind tunnel tested in wind tunnel

Planes today are 30% more efficient than those developed in


the 1970s. Optimization in the wind tunnel and in silico made
that happen but is very expensive.
87 Wolfgang Bangerth
How expensive is every iteration?
Practical algorithms:

To determine the search direction p k



Gradient (steepest descent) method requires 1 evaluation
of ∇ f ⋅ per iteration

Newton's method requires 1 evaluation of ∇ f ⋅ and
2
1 evaluation of ∇ f ⋅ per iteration


If derivatives can not be computed exactly, they can be
approximated by several evaluations of f ⋅ and ∇ f ⋅

To determine the step length k



Both gradient and Newton method typically require several
evaluations of f ⋅ and potentially ∇ f ⋅ per iteration.

88 Wolfgang Bangerth
How many iterations do we need?
Question: Given a sequence xk  x * (for which we know
that ∥xk −x *∥ 0 ), can we determine exactly how fast the error
goes to zero?

∥x k − x *∥

89 Wolfgang Bangerth
How many iterations do we need?
Definition: We say that a sequence xk  x * is of order s if
s
∥x k − x*∥ ≤ C∥x k −1 −x*∥
A sequence of numbers ak  0 is called of order s if
∣a k∣ ≤ C∣a k−1∣s
s−1
C is called the asymptotic constant. We call C∣ak−1∣ gain factor.

Specifically:
If s=1, the sequence is called linearly convergent.
Note: Convergence requires C<1. In a singly logarithmic plot,
linearly convergent sequences are straight lines.
If s=2, we call the sequence quadratically convergent.
If 1<s<2, we call the sequence superlinearly convergent.
90 Wolfgang Bangerth
How many iterations do we need?
Example: The sequence of numbers
ak = 1, 0.9, 0.81, 0.729, 0.6561, ...
is linearly convergent because
∣a k∣ ≤ C∣a k−1∣s
with s=1, C=0.9.

Remark 1: Linearly convergent sequences can converge very


slowly if C is close to 1.

Remark 2: Linear convergence is considered slow. We will want


to avoid linearly convergent algorithms.

91 Wolfgang Bangerth
How many iterations do we need?
Example: The sequence of numbers
ak = 0.1, 0.03, 0.0027, 0.00002187, ...
is quadratically convergent because
∣a k∣ ≤ C∣a k−1∣s
with s=2, C=3.

Remark 1: Quadratically convergent sequences can converge


very slowly if C is large. For many algorithms we can show that
they converge quadratically if a0 is small enough since then
∣a 1∣ ≤ C∣a0∣2 ≤ ∣a 0∣
If a0 is too large then the sequence may fail to converge since
∣a 1∣ ≤ C∣a0∣2 ≥ ∣a 0∣
Remark 2: Quadratic convergence is considered fast. We will
want to use quadratically convergent algorithms.
92 Wolfgang Bangerth
How many iterations do we need?
Example: Compare linear and quadratic convergence

∥x k − x *∥
Linear convergence.
Gain factor C<1
is constant.

k Quadratic convergence.
Gain factor C∣ak−1∣1
becomes better and better!

93 Wolfgang Bangerth
Metrics of algorithmic complexity
Summary:


Quadratic algorithms converge faster in the limit than
linear or superlinear algorithms

Algorithms that are better than linear will need to be
started close enough to the solution

Algorithms are best compared by counting the number of



function,

gradient, or

Hessian evaluations
to achieve a certain accuracy. This is generally a good
measure for the run-time of such algorithms.

94 Wolfgang Bangerth
Part 4

Smooth unconstrained
problems:
Line search algorithms
minimize f  x 

95 Wolfgang Bangerth
Smooth problems: Characterization of Optima

Problem: find solution x * of


minimize x f  x

A strict local minimum x * must satisfy two conditions:


First order necessary condition: gradient must vanish:
∇ f  x*=0

Sufficient condition for a strict minimum:

spectrum ∇ 2 f  x *  0
96 Wolfgang Bangerth
Basic Algorithm for Smooth Unconstrained Problems

Basic idea for iterative solution x k  x * of the problem


minimize f  x 

Generate a sequence x k by
xk
1. finding a search direction p k
2. choosing a step length k pk
x k1

Then compute the update

x k1=x k k p k
Iterate until we are satisfied.
97 Wolfgang Bangerth
Step 1: Choose search direction

Conditions for a useful search direction:


Minimization function should
be decreased in this ∇ f  xk 
direction:

p k⋅∇ f  x k ≤0

Search direction should lead


−∇ f x k 
to the minimum as straight
as possible
98 Wolfgang Bangerth
Step 1: Choose search direction

Basic assumption: We can usually only expect to know the


minimization function f x k  locally at x k .
That means that we can only evaluate

f x k  ∇ f  x k =g k ∇ 2 f  x k =H k ...

For a search direction, try to model f in the vicinity of xk


by a Taylor series:

f  x k  pk  ≈ f  x k 
T
 gk pk
1 T
 pk H k pk  
2
99 Wolfgang Bangerth
Step 1: Choose search direction

Goal: Approximate f ⋅ in the vicinity of x k by a model


T 1 T
f  x k  p ≈ mk  p  = f k  g k p  p H k p  
2
with f x k =f k ∇ f  x k =g k ∇ 2 f  x k =H k ...

Then: Choose that direction pk that minimizes the model mk  p

100 Wolfgang Bangerth


Step 1: Choose search direction

Method 1 (Gradient method, Method of Steepest Descent):


search direction is minimizing direction of linear model
T
f  x k  p ≈ f k  g p = mk  p
k

p k = − gk

p k =−∇ f  x k 

101 Wolfgang Bangerth


Step 1: Choose search direction
Method 2 (Newton's method):
search direction is to the minimum of the quadratic model
T 1 T
mk  p  = f k  g p  p H k p
k
2
Minimum is characterized by
∂mk  p
= g k  H k p=0  p k = − H −1
k gk
∂p

102 Wolfgang Bangerth


Step 1: Choose search direction

Method 2 (Newton's method) -- alternative viewpoint:


Newton step is also generated when applying Newton's method
for the root-finding problem (F(x)=0) to the necessary optimality
condition:

Linearize necessary condition around xk:

0 = ∇ f  x * = ∇ f  x k   ∇ 2 f  x k   x *−x k   ...
gk Hk pk

p k = −H −1
k gk

103 Wolfgang Bangerth


Step 1: Choose search direction
Method 3 (A third order method):
The search direction is to the minimum of the cubic model

[ ]
3
T 1 T 1 ∂ f
mk  p = f k  g p  p H k p
k pl pm pn
2 6 ∂ xl ∂ xm ∂ xn k

Minimum is characterized by the quadratic equation

∂mk  p
[ ]
3
1 ∂ f
= g k  H k p p l p m=0  pk = ? ? ?
∂p 2 ∂ xl ∂ xm ∂ xn k

There doesn't appear to be any practical way to compute the


solution of this equation for problems with more than one
104 variable. Wolfgang Bangerth
Step 2: Determination of Step Length

Once the search direction is known, compute the update by


choosing a step length k and set
x k1 = x k k p k

Determine the step length by solving the


1-d minimization problem (line search):
k = arg min  f  x k  p k 

For Newton's method: If the quadratic


model is good, then step is good, then
take full step with k =1

105 Wolfgang Bangerth


Convergence: Gradient method

Gradient method converges linearly, i.e.


∥x k − x *∥ ≤ C∥x k−1−x *∥

Gain is a fixed factor C<1


Convergence can be very slow if C close to 1.

Example: If f(x)=xTHx, with H positive definite and for


optimal line search, then
n −1
C≈ {i }=spectrum H
n 1

106 x 2 y 2  C =0 x 25y 2  C≈0.6 Wolfgang Bangerth


Convergence: Newton's method

Newton's method converges quadratically, i.e.

∥x k − x *∥ ≤ C∥x k−1−x *∥2

Optimal convergence order only if step length is 1, otherwise


slower convergence (step length is 1 if quadratic model
valid!)

If quadratic convergence: accelerating progress as iterations


proceed.

Size of C:

C ∼ sup x , y
∥∇ 2 −1
f (x *) ( ∇ 2 f (x)−∇ 2 f ( y ))∥
∥x− y∥
C measures size of nonlinearity beyond quadratic part.
107 Wolfgang Bangerth
Example 1: Gradient method

f  x , y =−x 3 2x 2 y 2

Local minimum at x=y=0,


saddle point at x=4/3, y=0

108 Wolfgang Bangerth


Example 1: Gradient method

∥x k − x *∥

Convergence of gradient method:


Converges quite fast, with linear rate
Mean value of convergence constant C : 0.28
At (x=0,y=0), there holds

2 4−2
∇ f 0,0~{1=4, 2=2} C≈ ≈0.33
42
109 Wolfgang Bangerth
Example 1: Newton's method

f  x , y =−x 3 2x 2 y 2

Local minimum at x=y=0,


saddle point at x=4/3, y=0

110 Wolfgang Bangerth


Example 1: Newton's method

∥x k − x *∥

Convergence of Newton's method:


Converges very fast, with quadratic rate
Mean value of convergence constant C : 0.15
∥x k − x *∥ ≤ C∥x k−1− x *∥2

Theoretical estimate yields C=0.5


111 Wolfgang Bangerth
Example 1: Comparison between methods

∥x k −x *∥

k
Newton's method much faster than gradient method
Newton's method superior for high accuracy due to higher
order of convergence
Gradient method simple but converges in a reasonable
number of iterations as well
112 Wolfgang Bangerth
Example 2: Gradient method


4 2 2
f  x , y =  x− y   1

1 2
100 100
y

(Banana valley function)

Global minimum at x=y=0

113 Wolfgang Bangerth


Example 2: Gradient method

∥x k − x *∥

Convergence of gradient method:


Needs almost 35,000 iterations to come closer than 0.1 to
the solution!
Mean value of convergence constant C : 0.99995
At (x=4,y=2), there holds
2 268−0.1
∇ f  4,2~{1=0.1,2=268} C≈ ≈0.9993
2680.01
114 Wolfgang Bangerth
Example 2: Newton's method


4 2 2
f  x , y =  x − y   1

1 2
100 100
y

(Banana valley function)

Global minimum at x=y=0

115 Wolfgang Bangerth


Example 2: Newton's method

∥x k − x *∥

Convergence of Newton's method:


Less than 25 iterations for an accuracy of better than 10 -7!

Convergence roughly linear for first 15-20 iterations since


step length k ≠1

Convergence roughly quadratic for last iterations with step


116
length k ≈1 Wolfgang Bangerth
Example 2: Comparison between methods

∥x k −x *∥

Newton's method much faster than gradient method


Newton's method superior for high accuracy (i.e. in the
vicinity of the solution) due to higher order of convergence
Gradient method converges too slowly for practical use

117 Wolfgang Bangerth


Practical line search strategies

Ideally: Use an exact step length determination (line search)


based on
k = arg min  f  x k  p k 

This is a 1d minimization problem for α, solvable via Newton's


method/bisection search/etc.

However: Expensive, may require many function/gradient


evaluations.

Instead: Find practical criteria that guarantee convergence but


need less function evaluations!

118 Wolfgang Bangerth


Practical line search strategies

Strategy: Find practical criteria that guarantee convergence


but need less evaluations.

Rationale:

Near the optimum, quadratic approximation of f is valid
→ take full steps (step length 1) there

Line search only necessary far away from the solution

If close to solution, need to try α=1 first

Consequence:

Near solution, quadratic convergence of Newton's method
is retained

Far away, convergence is slower in any case.
119 Wolfgang Bangerth
Practical line search strategies

Practical strategy: Use an inexact line search that:



finds a reasonable approximation to the exact step length

chosen step length guarantees a sufficient decrease in f(x);

chooses full step length 1 for Newton's method whenever
possible.

f  x , y =x 4− x 2 y 4− y 2
120 Wolfgang Bangerth
Practical line search strategies

Wolfe condition 1 (“sufficient decrease” condition):


Require step lengths to produce a sufficient decrease

f  x k  p k  ≤ f  x k   c1  [ ∂ f  x k  pk 
∂ ]
=0
= f k  c 1  ∇ f k⋅p k

f x k  p k  Necessary:
0c 11
Typical values:
c 1=10−4
i.e.: only very small
decrease mandated


121 Wolfgang Bangerth
Practical line search strategies

Wolfe condition 2 (“curvature” condition):


Require step lengths where f has shown sufficient
curvature upwards

∇ f  x k  p k ⋅p k =
[ ∂ f  x k  p k 
∂ ]
 = k
≥ c2
[ ∂ f  x k  p k 
∂ ] =0
= c 2 ∇ f k⋅p k

f x k  p k  Necessary:
0c 1c 21
Typical:
c 2=0.9
Rationale: Exclude too
small step lengths


122 Wolfgang Bangerth
Practical line search strategies

Wolfe conditions
Conditions 1 and 2 usually yield reasonable ranges for the
step lengths, but do not guarantee optimal ones

f x k  p k 


123 Wolfgang Bangerth
Practical line search strategies - Alternatives

Strict Wolfe conditions: f x k  p k 

∣[ ∂ f  x k  p k 
∂ ] ∣ ∣[
=k
≤ c2
∂ f  x k  p k 
∂ ]∣
=0

f x k  p k 

f x k  p k 

Goldstein conditions:

f  x k  p k  ≥ f  x k   1−c1  [ ∂ f  x k  p k 
∂ ]
=0


124 Wolfgang Bangerth
Practical line search strategies

Conditions like the ones above tell us whether a given step


length is acceptable or not.

In practice, don't try too many step lengths – checking the


conditions involves function evaluations of f(x).

Typical strategy (“Backtracking line search”):


1. Start with a trial step length t = 
(for Newton's method:  =1 )
2. Verify acceptance conditions for this t
3. If yes: k =t
4. If no:  t =c  t , c1 and go to 2.

1
Note: A typical reduction factor is c=
2
125 Wolfgang Bangerth
Practical line search strategies

An alternative strategy (“Interpolating line search”):


(0)

Start with αt = ᾱ =1 , set i=0

Verify acceptance conditions for it 

If yes:  k =i
t


If no:
- let k = f x k  pk 
- from evaluating the sufficient decrease condition
i   i
f  x k t p k  ≤ f k  c 1 t ∇ f k⋅p k
we already know k 0= f x k  , k ' 0=∇ f k⋅p k =g k⋅pk
and k it = f  x k it p k 
- if i=0 then choose i1
t as minimizer of the quadratic
function that interpolates k 0, ' k 0 ,k  it 
- if i0 then choose i1
t as the minimizer of the cubic
i i−1
function that interpolates k 0, ' k 0 ,k t , k  t 
126 Wolfgang Bangerth
Practical line search strategies

An alternative strategy (“Interpolating line search”):

Step 1: Quadratic interpolation

(0)
αt
127 Wolfgang Bangerth
Practical line search strategies

An alternative strategy (“Interpolating line search”):

Step 2 and following: Cubic interpolation

(1) (0)
αt αt
128 Wolfgang Bangerth
Part 5

Smooth unconstrained
problems:
Trust region algorithms
minimize f  x 

129 Wolfgang Bangerth


Line search vs. trust region algorithms

Line search algorithms:


Choose a relatively simple strategy to find a search direction
Put significant effort into finding an appropriate step length

130 Wolfgang Bangerth


Line search vs. trust region algorithms

Trust region algorithms:


Choose simple strategy to determine a step length.
Put effort into finding an appropriate search direction.

Background:
In line search methods, we choose a direction based on a local
approximation of the objective function
I.e.: Try to predict f(x) far away from xk by looking at fk , gk , Hk

This can't work when still far


from the solution!
(Unless f(x) is almost
quadratic everywhere.)

131 Wolfgang Bangerth


Trust region algorithms

Trust region algorithms:


Choose simple strategy to determine a step length.
Put effort into finding an appropriate search direction.

Alternative strategy:
Keep a number Δk that indicates up to which distance we trust
that our model mk(p) is a good approximation of f(xk+pk).

Find an update as follows:


1 T
pk = arg min p mk  p=f k g k⋅p p B p
2
such that ∥p∥ ≤  k
Then accept the update unconditionally, i.e. without line search:
xk1 = xk  pk
132 Wolfgang Bangerth
Trust region algorithms
xk
Example:
f x mk  p
pk

Line search Newton direction leads to the exact minimum of


approximating model mk(p).

However, mk(p) does not approximate f(x) well at these


distances.
Consequently, we need line search as a safe guard.

133 Wolfgang Bangerth


Trust region algorithms
xk
Example:
f x mk  p

Rather, decide how far we trust the model and stay within this
radius!

134 Wolfgang Bangerth


Trust region algorithms

Basic trust region algorithm:


For k=1,2,...:

Compute update by finding approximation p k to the solution of
1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k


Compute predicted improvement PI = mk 0−mk  p k 

Compute actual improvement AI = f  x k − f  x k  p k 

1

If AI / PI  1/ 4 then  k1 = ∥ pk∥
4
AI / PI  3 /4 and ∥pk∥= k then  k1 = 2  k


If AI / PI   for some ∈[ 0,1/ 4 ) then x k1 = x k  p k
else x k1 = xk
135 Wolfgang Bangerth
Trust region algorithms

Fundamental difficulty of trust region algorithms:


1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k

Not a trivial problem to solve!

As with line search algorithms, don't spend too much time
finding the exact minimum of an approximate model.


Practical trust region methods are about finding cheap ways
to approximate the solution of the problem above!

136 Wolfgang Bangerth


Trust region algorithms: The dogleg method

Find an approximation to the solution of:


1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k

Note:
If trust region radius is small, then we get the “Cauchy point” in
the steepest descent direction:
C SD SD gk
pk ≈ p =  p
k k ∈[0,1] p k = − k
∥g k∥
p Ck is the minimizer of f(x) in direction p SD
k

If trust region radius is large, then we get the (quasi-)Newton


update: B −1
p k = p k = −B k g k

137 Wolfgang Bangerth


Trust region algorithms: The dogleg method

Find an approximation to the solution of:


1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k

xk p B xk p Bk
k
pCk
p Ck

 k ∥ p Bk ∥  k ∥ p Bk ∥

138 Wolfgang Bangerth


Trust region algorithms: The dogleg method

Find an approximation to the solution of:


1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k

xk p B xk p Bk
k
C
pk
p Ck

Idea:
Find the approximate solution p k along the “dogleg” line
x k  x k  pCk  x k  pkB
139 Wolfgang Bangerth
Trust region algorithms: The dogleg method

Find an approximation to the solution of:


1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k

In practice, the Cauchy point is difficult to compute because it


requires a line search.
Thus, dogleg method doesn't use the minimizer p Ck of f along pSD
k

but the minimizer gT g


U k k
p k =− T
gk
g Bk g k
k
of
1
mk  p= f k g Tk p pT Bk p
2

U B
The dogleg then runs along x k  x k  p k  x k  p k
140 Wolfgang Bangerth
Trust region algorithms: The dogleg method

Find an approximation to the solution of:


1
p k = arg min p mk  p= f k g k⋅p pT B k p
2
such that ∥p∥ ≤  k

Dogleg algorithm:
B
If pBk =−B−1
k gk satisfies ∥p Bk ∥ k then set p k = p k

U gTk g k p Uk
Otherwise, if p =− k T
g k satisfies ∥p Uk ∥ k then set p k = U
k
g Bk g k
k ∥p ∥
k

Otherwise choose p k as the intersection point of the line p Uk  p kB


and the circle with radius  k

141 Wolfgang Bangerth


Part 6

Practical aspects of
Newton methods

minimize f  x 

142 Wolfgang Bangerth


What if the Hessian is not positive definite

At the solution, Hessian ∇ 2 f  x * is positive definite. If f(x) is


smooth, Hessian is positive definite near the optimum.

However, this needs not be so far away from the optimum:


At initial point x 0
the Hessian is indefinite:
2
H 0=∇ f  x 0 = 
−0.022 0.134
0.134 −0.337 
1 =−0.386, 2=0.027

Quadratic model
T 1 T
mk  p= f k  g p p H k p
k
2
has saddle point instead of
minimum, Newton step is
143
invalid! Wolfgang Bangerth
What if the Hessian is not positive definite

Background: Search direction only useful if it is a descent


direction:
∇ f  x k T⋅p k 0

Trivially satisfied for Gradient method, for Newton's method


there holds:
−1 T T −1
p k =−H g k
k  g ⋅p k =−g H g k  0
k k k

Search direction only a


guaranteed descent direction,
if H positive definite!

Otherwise search direction is


direction to saddle point of
quadratic model and might be
a direction of ascent!
144 Wolfgang Bangerth
What if the Hessian is not positive definite
If Hessian is not positive definite, then modify the quadratic
model:

retain as much information as possible;

model should be convex, so that we can seek a minimum.

The general strategy then is to replace the quadratic model by


a positive definite one:
T 1 T
mk  p = f k g p p H k p
k
2

Here, H k is a suitable modification of exact Hessian H k =∇ 2 f  xk 


so that H k is positive definite.

Note: To retain ultimate quadratic convergence, we need that


H k  H k as xk  x *

145 Wolfgang Bangerth


What if the Hessian is not positive definite
The Levenberg-Marquardt modification:

Choose
H k = H k  I −i
so that the minimum of
T1 T
mk  p = f k g p p H k p
k
2
lies at
pk =− H −1 g = − H  I −1
gk
k k k pkN
Note: Search direction is mixture
between Newton direction and gradient.

Note: Close to the solution the Hessian pGk


must become positive definite and we
can choose =0
146 Wolfgang Bangerth
What if the Hessian is not positive definite

The eigenvalue modification strategy:


Since H is symmetric, it has a complete set of eigenvectors:

H k = ∇2 f  xk  = ∑i i i i
 v v T

Therefore replace the quadratic model by a positive definite


one: 1 T
T
mk  p = f k g k p p H k p
2
with
k =
H ∑i { i } i i
max  ,  v v T

Note: Only modify the Hessian in directions of negative


curvature.
Note: Close to the solution, all eigenvalues become positive
and we get again the original Newton matrix.
147 Wolfgang Bangerth
What if the Hessian is not positive definite

One problem with the modification


k =
H ∑i { i } i i
max  ,  v v T

is that the search direction is given by

1
pk = − H̃ k g k = −∑ i
T
v i ( v i gk )
−1

max { λi , ϵ }

that is search direction has large component (of size 1/ε) in


direction of modified curvatures!

An alternative that avoids this is to use


k =
H ∑i i i i
∣ ∣v v T

148 Wolfgang Bangerth


What if the Hessian is not positive definite

Theorem: Using full step length and either of the Hessian


modifications
H k = H k  I −i
k =
H ∑i max { i ,  } vi v T
i

we have that if x k  x * and if f ∈C 2,1 then convergence


happens with quadratic rate.

Proof: Since f is twice continuously differentiable, there is a k


such that xk is close enough to x* that Hk is positive definite.
When that is the case, then
 k = Hk
H
for all following iterations, providing the quadratic convergence
rate of the full step Newton method.
149 Wolfgang Bangerth
What if the Hessian is not positive definite

Example:
f ( x , y ) = x 4− x 2 + y 4 − y 2

Blue regions indicate that


Hessian
( )
2
2 12x −2 0
∇ f (x , y) =
0 12y 2 −2
is not positive definite.

±√ 2 ±√(2)
minima at x= y=
2, 2

150 Wolfgang Bangerth


What if the Hessian is not positive definite

Starting point:
x 0 =0.1 y 0=0.87

H0 =  −1.88 0
0 7.08 
1.Negative gradient
(3)
2.Unmodified Hessian search
direction
(4)
3.Search direction with eigenvalue
modified Hessian (=10-6)
(2) (1)
4.Search direction with shifted
Hessian (=2.5; search direction
only good by lucky choice of )
151 Wolfgang Bangerth
Truncated Newton methods

In any Newton or Trust Region method, we have to solve an


equation of the sort
H k pk = −gk
or potentially with a modified Hessian:

H k pk = −g k

Oftentimes, computing the Hessian is more expensive than


inverting it, but not always.

Question: Could we possibly get away with only approximately


solving this problem, i.e. finding
−1
pk ≈ −H g k
k

with suitable conditions on how accurate the approximation is?


152 Wolfgang Bangerth
Truncated Newton methods

Example: Since the Hessian (or a modified version) is a


positive definite matrix, we may want to solve
H k pk = −gk

using an iterative method such as the Conjugate Gradient


method, Gauss-Seidel, Richardson iteration, SSOR, etc etc.

While all these methods eventually converge to the exact


Newton direction, we may want to truncate this iteration at one
point.

Question: When can we terminate this iteration?

153 Wolfgang Bangerth


Truncated Newton methods

Theorem 1: Let pk be an approximation to the Newton


direction defined by
H k pk = −gk

and let there be a sequence of numbers { k },  k 1 so that

∥g k H k pk∥
≤ k 1
∥g k∥

Then if x k  x * then the full step Newton method converges


with linear order.

154 Wolfgang Bangerth


Truncated Newton methods

Theorem 2: Let p̂ k be an approximation to the Newton


direction defined by
H k pk = −gk

and let there be a sequence of numbers { k },  k 1, k 0


so that
∥g k H k pk∥
≤ k 1
∥g k∥

Then if x k → x * then the full step Newton method converges


with superlinear order.

155 Wolfgang Bangerth


Truncated Newton methods

Theorem 3: Let pk be an approximation to the Newton


direction defined by
H k pk = −gk

and let there be a sequence of numbers { k },  k 1, k =O ∥g k∥


so that
∥g k H k pk∥
≤ k 1
∥g k∥

Then if x k  x * then the full step Newton method converges


with quadratic order.

156 Wolfgang Bangerth


Part 7

Quasi-Newton update formulas

Bk1=B k ...

157 Wolfgang Bangerth


Quasi-Newton update formulas

Observation 1:

Computing the exact Hessian to determine the Newton search


direction
H k pk = −gk

is expensive, and sometimes impossible.

It at least doubles the effort per iteration because we need not


only the first but also the second derivative of f(x).

It also requires us to solve a linear system for the search


direction.

158 Wolfgang Bangerth


Quasi-Newton update formulas

Observation 2:

We know that we can get superlinear convergence if we


choose the update pk using
Bk p k = −gk
instead of
H k pk = −gk

under certain conditions on the matrix Bk.

159 Wolfgang Bangerth


Quasi-Newton update formulas

Question:
● Maybe it is possible to find matrices Bk for which:

● Computing Bk is cheap and requires no additional


function evaluations

Solving
Bk p k = −gk
for pk is cheap


The resulting iteration still converges with superlinear
order.

160 Wolfgang Bangerth


Motivation of ideas

Consider a function p(x).

The Fundamental Theorem of Calculus tells us that


T
p  z− p x =∇ p   z −x 
for some =xt  z−x , t ∈[0,1 ]

Let's apply this to p x =∇ f  x , z =x k , x=x k−1 :


2
∇ f  x k −∇ f  x k−1 =gk −gk−1=∇ f x k −t  p k  x k − x k−1
= H  x k −x k −1 

Let us denote y k −1 =g k −g k−1 , s k−1= x k − x k−1 then this reads

H s k −1 = y k−1
 .
with an “average” Hessian H
161 Wolfgang Bangerth
Motivation of ideas

Requirements:
● We seek a matrix Bk+1 so that


The “secant condition” holds:
Bk1 s k = y k

● Bk+1 is symmetric

● Bk+1 is positive definite

● Bk+1 changes minimally from Bk


The update equation is easy to solve for

p k 1 = −B−1
k1 g k1

162 Wolfgang Bangerth


Davidon-Fletcher-Powell

The DFP update formula:

Given Bk define Bk+1 by


T T T
B k +1=( I −γ y k s k ) B k ( I−γ s k y k )+γ y k y k
1
γk= T
yk sk
This satisfies the conditions:

It is symmetric and positive definite

It is among all possible matrices the one that minimizes
−1/ 2 −1/ 2
 
∥H B k1−Bk  H ∥F

It satisfies the secant condition Bk1 s k = y k
163 Wolfgang Bangerth
Broyden-Fletcher-Goldfarb-Shanno

The BFGS update formula:

Given Bk define Bk+1 by

Bk s k s TK Bk y k y Tk
Bk1=B k − T
 T
s B k sK
k y s
k k

This satisfies the conditions:



It is symmetric and positive definite

It is among all possible matrices the one that minimizes
1/ 2 1/ 2
 −1 −1

∥H B k1−Bk  H ∥F

It satisfies the secant condition Bk1 s k = y k
164 Wolfgang Bangerth
Broyden-Fletcher-Goldfarb-Shanno

So far:
● We seek a matrix Bk+1 so that


The secant condition holds:

Bk1 s k = y k
● Bk+1 is symmetric

● Bk+1 is positive definite

● Bk+1 changes minimally from Bk in some sense


The update equation is easy to solve for

pk = −B−1
k gk

165 Wolfgang Bangerth


DFP and BFGS

Now a miracle happens:

For the DFP formula:


T T T 1
B k1= I − k y s  Bk  I − k s k y  k y k y ,
k k k k k= T
yk sk
T −1
−1 −1 B−1
k y k k Bk
y s k s Tk
Bk 1=B k − T −1
 T
y B yk
k k y k sk
For the BFGS formula:
T T
B k s k s K Bk y k y k
B k1=B k − T  T
s k B k sk yk sk
−1 T −1 T T 1
B k1 = I −k s k y  B  I − k y s k s s ,
k k k k k k  k= T
y k sk
This makes computing the next update very cheap!
166 Wolfgang Bangerth
DFP + BFGS = Broyden class

What if we mixed:

DFP T T T 1
B k1 = I− k y s  B k  I − k s k y  k y k y ,
k k k k k = T
yk sk
BFGS Bk s k s TK B k y k y Tk
B k1 =Bk − T

s Bk s k
k y Tk s k

DFP BFGS
Bk1=k B k1 1−k  B k

This is called the “Broyden class” of update formulas.

The class of Broyden methods with 0≤k ≤1 is called the


“restricted Broyden class”.

167 Wolfgang Bangerth


DFP + BFGS = Broyden class

Theorem: Let f ∈C2 , let x 0 be a starting point so that the set

={x :f  x≤f  x 0}


is convex. Let B0 be any symmetric positive definite matrix.
Then
xk  x *

for any sequence x k generated by a quasi-Newton method


that uses a Hessian update formula by any member of the
restricted Broyden class with the exception of the DFP method
k =1 .

168 Wolfgang Bangerth


DFP + BFGS = Broyden class

Theorem: Let f ∈C2,1 . Assume the BFGS updates converge,


then
xk  x *

with superlinear order.

169 Wolfgang Bangerth


Practical BFGS: Starting matrix

Question: How do we choose the initial matrix B0 or B−1


0
?

Observation 1: The theorem stated that we will eventually


converge for any symmetric, positive definite starting matrix.

In particular, we could choose a multiple of the identity matrix


1−1
B0= I , B = I
0

Observation 2: If  is too small, then
−11
p0=−B g =− g0
0 0

is too large, and we need many trials in line search to find a
suitable step length.

Observation 3: The matrices B should approximate the


Hessian matrix, so they at least need to have the same
170 physical units. Wolfgang Bangerth
Practical BFGS: Starting matrix

Practical approaches:

Strategy 1: Compute the first gradient g0, choose a “typical”


step length  , then set
∥g0∥  −1
B0= I, B = 0I
 ∥g 0∥
so that we get g0
−1
p0=−B 0 g0 =−
∥g0∥

Strategy 2: Approximate the true Hessian somehow. For


example, do one step with the heuristic above, choose
y T1 y 1 −1 y T1 s 1
B0= T
I, B =
0 T
I
y s
1 1 y y1
1
and start over again.
171 Wolfgang Bangerth
Practical BFGS: Limited Memory BFGS (LM-BFGS)

Observation: The matrices


T T
B k s s Bk
k K yk y k
B k1=B k − T
 T
s B k sk
k y sk
k
−1 T −1 T T 1
B k1 = I −k s k y  B  I − k y s k s s ,
k k k k k k k= T
y k sk
are full, even if the true Hessian is sparse.

Consequence:

We need to compute all n2 entries, and store them.

172 Wolfgang Bangerth


Practical BFGS: Limited Memory BFGS (LM-BFGS)

Solution: Note that in the kth iteration, we can write


T T
B−1
k =V B −1
V
k−1 k−1 k−1  s s
k−1 k−1 k−1
1 T
with  k−1= T ,V k−1= I− k−1 y k−1 s k−1
y k−1 s k−1
We can expand this recursively:
T T
B−1
k =V B −1
V
k−1 k−1 k−1  s s
k−1 k−1 k −1
=V Tk−1 V Tk−2 B−1k−2 V k−2 V k−1

k−2 V Tk−1 s k−1 s Tk−2 V k−1k−1 s k−1 s Tk−1


=...
=[ V Tk−1⋅⋅⋅V T1 ] B−1
0 [ V 1⋅⋅⋅V k−1 ]
k
∑ j=1 k− j {[ V ]s
T T T
⋅⋅⋅V
k−1 k − j1 s
k− j k− j [ V k− j1⋅⋅⋅V k−1 ] }
Consequence: We need only store kn entries.
173 Wolfgang Bangerth
Practical BFGS: Limited Memory BFGS (LM-BFGS)

Problem: kn elements may still be quite a lot if we need many


iterations. Forming the product with this matrix will then also be
expensive.

Solution: Limit memory and CPU time by only storing the last
m updates:

B =[ V
−1
k
T
k−1 ⋅⋅⋅V ] B [ V k−m⋅⋅⋅V k −1 ]
T
k−m
−1
0,k
m
∑ j=1 k− j {[ V Tk−1⋅⋅⋅V Tk − j1 ] s k− j s Tk− j [ V k− j1⋅⋅⋅V k−1 ] }

Consequence: We need only store mn entries and


multiplication with this matrix requires 2mn+O(m3) operations.

174 Wolfgang Bangerth


Practical BFGS: Limited Memory BFGS (LM-BFGS)

B−1
k = [ k−1 k−m ] 0,k [ V k−m⋅⋅⋅V k −1 ]
V T
⋅⋅⋅V T
B −1

m
∑ j=1 k− j {[ V Tk−1⋅⋅⋅V Tk − j1 ] s k− j s Tk− j [ V k− j1⋅⋅⋅V k−1 ] }

In practice:

Initial matrix can be chosen independently in each
iteration; typical approach is again

−1 y Tk−1 sk −1
B =
0,k T
I
y k−1 y k−1

Typical values for m are between 3 and 30.

175 Wolfgang Bangerth


Parts 1-7

Summary of methods for


smooth unconstrained
problems
minimize f  x 

176 Wolfgang Bangerth


Summary

Newton's method is unbeatable with regard to speed of
convergence

However: To converge, one needs
- a line search method + conditions like the Wolfe conditions
- Hessian matrix modification if it is not positive definite

Newton's method can be expensive or infeasible if
- computing Hessians is complicated
- the number of variables is large

Quasi-Newton methods, e.g. LM-BFGS, help:
- only need first derivatives
- need little memory and no explicit matrix inversions
- but converge slower (at best superlinear)

Trust region methods are an alternative to Newton's method
but share the same drawbacks
177 Wolfgang Bangerth
Part 8

Equality-constrained
Problems

minimize f  x
g i  x  = 0, i=1,... , ne

178 Wolfgang Bangerth


An example

Consider the example of the body suspended from a ceiling


with springs, but this time with an additional rod of fixed
length attached to a fixed point:

To find the position of the body we now need to solve the


following problem:

minimize f  x =E  x, z=∑i E spring, i  x ,zE pot  x , z


∥x − x0∥−Lrod = 0
179 Wolfgang Bangerth
An example

We can gain some insight into the problem by plotting the


energy as a function of (x,z) along with the constraint:

180 Wolfgang Bangerth


Definitions

We call this the standard form of equality constrained


problems:

minimize x∈ D⊂R f  x
n

g i  x = 0, i=1 ...n e

We will also frequently write this as follows, implying equality


elementwise:

minimize x∈D⊂R f  x
n

g x = 0

181 Wolfgang Bangerth


Definitions

A trivial reformulation of the problem is obtained by defining the


feasible set:
n
={x∈R : g x=0}

Then the original problem is equivalently recast as

minimize x∈D∩⊂ R f  x
n

Note 1: Reformulation is not of much practical interest.


Note 2: Feasible set can be continuous or discrete, or empty if
constraints are mutually incompatible.
We will always assume that it is continuous and non-empty.
182 Wolfgang Bangerth
The quadratic penalty method

Observation: The solution of


minimize x∈D⊂R f  x
n

g x = 0
must lie within the feasible set where g(x)=0.

Idea: Let's relax the constraint and also search close to


where g(x) is small but not zero. However, make sure that
the objective function becomes very large if far away from
the feasible set:
1 2
minimize x∈ D⊂ R n Q  x= f  x ∥g  x∥
2
Qμ(x) is called the quadratic relaxation of the constrained
minimization problem. μ is the penalty parameter.
183 Wolfgang Bangerth
The quadratic penalty method

Why is Qμ(x) called relaxation of the constrained


minimization problem with f(x), g(x)?

Consider the original problem

minimize f  x =E  x, z=∑i E spring, i  x , zE pot  x , z


∥x − x0∥−Lrod = 0
with relaxation
1 2
Q x =E  x, z ∥x − x0∥−Lrod 
2
Replacing fixed rod by spring with constant D  would yield
an unconstrained problem with objective function

f x =E x , z 1 D ∥x − x0∥−L rod 2


2
184 Wolfgang Bangerth
The quadratic penalty method

Example: Qμ(x) with μ=infinity

185 Wolfgang Bangerth


The quadratic penalty method

Example: Qμ(x) with μ=0.01

186 Wolfgang Bangerth


The quadratic penalty method

Example: Qμ(x) with μ=0.001

187 Wolfgang Bangerth


The quadratic penalty method

Example: Qμ(x) with μ=0.00001

188 Wolfgang Bangerth


The quadratic penalty method

Algorithm:
Given xstart
0 , {μ t }→0, {t }→ 0
For t=0,1, 2, ...:
* *
Find approximation x t to the (unconstrained) mimizer xt
of Q  x that satisfies
t

*
∥∇ Q  x ∥≤t
t
t

using xstart
t
as starting point.
start *
Set t=t+1, x =x
t t−1

Typical values:

μt =c μt−1 , c=0.1 to 0.5


t =c t −1
189 Wolfgang Bangerth
The quadratic penalty method

Positive properties of the quadratic penalty method:



Algorithms for unconstrained problems readily available;
● Q at least as smooth as f, gi for equality constrained
problems;

Usually only few steps are needed for each penalty
parameter, since good starting point known;

It is not really necessary to solve each unconstrained
minimization to high accuracy.

Negative properties of the quadratic penalty method:



Minimizers for finite penalty parameters are usually
infeasible;

Problem is becoming more and more ill-conditioned near
optimum as penalty parameter is decreased, Hessian large.
190 Wolfgang Bangerth
The quadratic penalty method

*
Theorem (Convergence): Let x t be exact minimizer of Q  x
t
and let  t  0 . Let f,g be once differentiable.
*
Then every limit point of the sequence {xt }t =1,2,... is a
solution of the constrained minimization problem

minimize x∈ D⊂R f  x
n

g x = 0

191 Wolfgang Bangerth


The quadratic penalty method

*
Theorem (Convergence): Let x t be approximate
minimizers of Q  x with
t
*
∥∇ Q  x ∥≤ t
t t

for a sequence  t  0 and let  t  0 . Let f ∈C 2 , g∈C 1.


*
Then every limit point of the sequence {xt }t =1,2,... satisfies
certain first-order necessary conditions for solutions of the
constrained minimization problem

minimize x∈ D⊂R f  x
n

g x = 0

192 Wolfgang Bangerth


Lagrange multipliers

Consider a (single) constraint g(x) as a function of x:

g(x,z)=-0.1 g(x,z)=0 g(x,z)=0.1

g x =∥x − x0∥− Lrod


193 Wolfgang Bangerth
Lagrange multipliers

Now look at the objective function f(x):

3 1 2
f x =∑i=1 D ∥x − xi∥−L 0 
2
194 Wolfgang Bangerth
Lagrange multipliers

Now both f(x), g(x):

g(x,z)=0

195 Wolfgang Bangerth


Lagrange multipliers

Now both f(x), g(x):

Conclusion:

Solution is where isocontours are tangential to each other

That is, where gradients of f and g are parallel

Solution is where g(x)=0
196 Wolfgang Bangerth
Lagrange multipliers

Conclusion:

The solution is where gradients of f and g are parallel

The solution is where g(x)=0

In mathematical terms:
The (local) solutions of

minimize f  x = E  x , z=∑i Espring , i  x ,zE pot  x , z


g  x =∥x − x0∥−L rod = 0
are where the following conditions hold for some value of λ:

∇ f x − ∇ g x  = 0


g x  = 0
197 Wolfgang Bangerth
Lagrange multipliers

Consider the same situation for three variables and two


constraints:

f  x = f  x , y , z
198 Wolfgang Bangerth
Lagrange multipliers

Constraint 1: Contours of g1(x)

g1(x)=0 g 1(x)=1 g1(x)=2

199 Wolfgang Bangerth


Lagrange multipliers

Constraint 2: Contours of g2(x)

g2(x)=-1

g2(x)=0

g2(x)=1

200 Wolfgang Bangerth


Lagrange multipliers

Constraints 1+2 at the same time

g2(x)=0

g1(x)=0

201 Wolfgang Bangerth


Lagrange multipliers

Constraints 1+2 and f(x):

g2(x)=0

g1(x)=0 local solutions

202 Wolfgang Bangerth


Lagrange multipliers

Conclusion:

The solution is where the gradient of f can be written as a
linear combination of the gradients of g1, g2
● The solution is where g1(x)=0, g2(x)=0
203 Wolfgang Bangerth
Lagrange multipliers

Generally (under certain conditions):


The (local) solutions of
n
minimize f x  f x :ℝ ℝ
n n
g x  = 0, g x :ℝ  ℝ e

are where the conditions

∇ f x − ⋅∇ g x  = 0


g x  = 0
n
hold for some vector of Lagrange multipliers  ∈ℝ e

Note: There are enough equations to determine both x and λ.


204 Wolfgang Bangerth
Lagrange multipliers

By introducing the Lagrangian

n ne
L x ,  = f x −⋅g x  , L:ℝ ×ℝ ℝ
the conditions

∇ f x − ⋅∇ g x  = 0


g x  = 0
can conveniently be written as

∇ {x , } L x ,   = 0

205 Wolfgang Bangerth


Constraint Qualification: Example 1

When can we characterize solutions by Lagrange multipliers?


Consider the problem
2 2 2,
minimize f x  = x1  y1 z
g1 x = x = 0,
g2 x = y = 0.
with solution
* T
x =0,0,0
At the solution, we have
* T * T * T
∇ f x =2,2,0 , ∇ g 1 x =1,0,0 , ∇ g 2 x =0,1,0
and consequently
=2,2T
206 Wolfgang Bangerth
Constraint Qualification: Example 1

When can we characterize solutions by Lagrange multipliers?


Compare this with the problem
2 2 2,
minimize f x  = x1  y1 z
2
g1  x  = x = 0,
2
g2 x  = y = 0.
with the same solution
* T
x =0,0,0
At the solution, we now have
* T * * T
∇ f x =2,2,0 , ∇ g 1 x =∇ g2 x =0,0,0
and there are no Lagrange multipliers so that
* *

∇ f x = ⋅∇ g x 
207 Wolfgang Bangerth
Constraint Qualification: Example 2

When can we characterize solutions by Lagrange multipliers?


Consider the problem

minimize f  x  = y ,
2 2
g1 x = x−1  y −1= 0,
2 2
g2 x = x1  y −1= 0.

There is only a single


point at which both
constraints are satisfied:
* T
⃗x =(0,0)
208 Wolfgang Bangerth
Constraint Qualification: Example 2

When can we characterize solutions by Lagrange multipliers?


Consider the problem

minimize f  x  = y ,
2 2
g1 x = x−1  y −1= 0,
2 2
g2 x = x1  y −1= 0.

* T
At the solution x =0,0 , we have
* T * * T
∇ f x =0,1 , ∇ g 1 x =−∇ g 2 x = 2,0
and again there are no Lagrange multipliers so that
* *

∇ f x = ⋅∇ g x 
209 Wolfgang Bangerth
Constraint Qualification: LICQ

Definition:
We say that at a point x the linear independence constraint
qualification (LICQ) is satisfied if

{∇ g i x }i=1... n e

is a set of ne linearly independent vectors.

Note: This is equivalent to saying that the matrix

[ ]
T
[∇ g 1 x ]
A= ⋮
T
[ ∇ g n  x ]e

has full row rank ne.

210 Wolfgang Bangerth


First-order necessary conditions

Theorem:
*
Suppose that x is a local solution of
n
minimize f x  f x :ℝ ℝ
n n
g x  = 0, g x :ℝ  ℝ e

and suppose that at this point the LICQ holds. Then there exists a
unique Lagrange multiplier vector so that the following conditions are
satisfied:
∇ f x − ⋅∇ g x  = 0
g x  = 0
Note: - These conditions are often referred to as the Karush-Kuhn-
Tucker (KKT) conditions.
- If LICQ does not hold, there may still be a solution,
211 but it may not satisfy the KKT conditions! Wolfgang Bangerth
First-order necessary conditions

Theorem (alternative form):


*
Suppose that x is a local solution of
n
minimize f x  f x :ℝ ℝ
n n
g x  = 0, g x :ℝ  ℝ e

and suppose that at this point the LICQ holds. Then

*
∇ f x ⋅w = 0
for every vector tangential to all constraints,
*
w ∈ {v : v⋅∇ gi x =0, i=1...n e }
or equivalently
w ∈ NullA
212 Wolfgang Bangerth
Second-order necessary conditions

Theorem:
*
Suppose that x is a local solution of
n
minimize f x  f x :ℝ ℝ
n n
g x  = 0, g x :ℝ  ℝ e

and suppose that at this point the first order necessary conditions
and the LICQ hold. Then

T 2 *
w ∇ f x ⋅w ≥ 0
for every vector tangential to all constraints,

w ∈ Null A
213 Wolfgang Bangerth
Second-order sufficient conditions

Theorem:
Suppose that at a feasible point x the first order necessary (KKT)
conditions hold. Suppose also that
T 2
w ∇ f x ⋅w  0
for all tangential vectors
w ∈ NullA , w ≠0
Then x is a strict local minimizer of
n
minimize f x  f x :ℝ ℝ
n n
g x  = 0, g x :ℝ  ℝ e

214 Wolfgang Bangerth


Characterizing the null space of A

All necessary and sufficient conditions required us to test


conditions like T 2
w ∇ f x ⋅w  0
for all tangential vectors

w ∈ NullA , w ≠0
In practice, this can be done as follows:

If LICQ holds, then dim(Null(A))=n-ne. Thus, there exist n-ne


vectors zl so that Azl=0 , and every vector w can be written as
n n×n−ne n−ne
w = Z 
, w ∈ℝ , Z =[ z 1 ,... , z n−n ]∈ℝ
e
, ∈ℝ

This matrix Z can be computed from A for example by a QR


decomposition.
215 Wolfgang Bangerth
Characterizing the null space of A

With this matrix Z , the following statements are equivalent:

First order ∇ f  x ⋅w = 0 ∀ w ∈ NullA


necessary
T
conditions [ ∇ f  x  ] Z = 0
T 2
Second order w ∇ f x ⋅
w ≥ 0 ∀ w ∈ Null  A
necessary
conditions Z [ ∇ f x  ] Z is positive semidefinite
T 2

Second order T 2
sufficient w ∇ f x ⋅
w  0 ∀ w ∈ Null A , w ≠0
conditions
Z [ ∇ f x  ] Z is positive definite
T 2

216 Wolfgang Bangerth


Part 9

Quadratic programming

1 T T
minimize f  x= x G xd xe
2
gx= A x−b = 0

217 Wolfgang Bangerth


Solving equality constrained problems

Consider a general nonlinear program with general nonlinear


equality constraints:
n
minimize f x f x:ℝ ℝ
n n
g x = 0, gx :ℝ  ℝ e

Maybe we can solve such problems with an iterative scheme


like unconstrained ones?

Analogy: For unconstrained nonlinear programs, we


approximate f(x) in each iteration by a quadratic model. For
quadratic functions, we can find minima in one step:
1 T T
min x f  x= x H xd xe
2
2 −1 −1
[∇ f  x 0 ] p 0=−∇ f  x 0  ⇔ x 1=x 0 −H Hx 0d=−H d
218 Wolfgang Bangerth
Solving equality constrained problems

For the general nonlinear constrained problem:


Assuming a condition like LICQ holds, then we know that we
need to find points {x ,  } at which

∇ f x − ⋅∇ g x  = 0


g x  = 0
Alternatively, we can write this as

∇ {x , } L x ,   = 0

with
n ne
L x ,  = f x −⋅g x  , L:ℝ ×ℝ ℝ
219 Wolfgang Bangerth
Solving equality constrained problems

If we combine z={x , } then this can also be written as

∇ z L z = 0
which looks like the first-order necessary condition for
minimizing L(z). We then may think of finding solutions as
follows:
T

[
Start at a point z 0= x0 ,  0 ]

Compute search directions using [∇ 2z L z k ] pk=−∇ z L z k 

Compute a step length  k

Update z k1 =zk  k p k

Note: This is misleading, since we will in fact not look for


2
minima of L(z), but for saddle points. Consequently, ∇ z L z k 
220 is indefinite. Wolfgang Bangerth
Solving equality constrained problems

The equations we have to solve in each Newton iteration have


the form
2
[∇ z L z k] pk = −∇ z L z k 
Because
n ne
L x , =f  x−⋅g x , L :ℝ ×ℝ  ℝ
the equations we have to solve read in component form:

  
2 2
∇ f  x k−∑ i i , k ∇ g i  x k −∇ g  x k pkx

=
−∇ g  x k T 0 pk

221
=−
 −g x k  
∇ f  x k −∑i  i ,k ∇ gi  x k 

Wolfgang Bangerth
Linear quadratic programs

Consider first the linear quadratic case with symm. matrix G:


1 T T n
f x= x G xd xe , f :ℝ  ℝ
2
ne×n ne
gx =Ax−b , A∈ℝ , b∈ℝ
with
T 1 T T T
L x , =f  x− g x= x G xd xe−  Ax−b
2
Then the first search direction needs to satisfy the (linear) set
of equations
2
[∇ L z 0] p0 = −∇ z L z 0 
z

or equivalently:

    
T x T
G −A p0 Gx 0d− 0 A
= −
−A 0 
p0 − Ax 0 −b
222 Wolfgang Bangerth
Linear quadratic programs

Theorem 1: Assume that G is positive definite in all feasible


directions, i.e. ZTGZ is positive definite, and that the matrix A
has full row rank. Then the KKT matrix

 
T
G −A
−A 0
is nonsingular and the system

    
T x T
G −A p 0 = − Gx 0d− 0 A
−A 0 p

− Ax 0 −b
0

has a unique solution.

223 Wolfgang Bangerth


Linear quadratic programs

Theorem 2: Assume that G is positive definite in all feasible


directions, i.e. ZTGZ is positive definite. Then the solution of
the linear quadratic program
1 T T
min x f  x= x G xd xe
2
g x =Ax−b = 0
is equivalent to the first iterate
x
x1 = x 0 p0
that results from solving the linear system

    
T x T
G −A p0 Gx 0d− 0 A
= −
−A 0 
p0 − Ax 0 −b
irrespective of the starting point x0.
224 Wolfgang Bangerth
Linear quadratic programs

Theorem 3: Assume that G is positive definite in all feasible


directions, i.e. ZTGZ is positive definite, and that the matrix A
has full row rank. Then the KKT matrix

 
T
G −A
−A 0
has n positive, ne negative eigenvalues, and no zero
eigenvalues. In other words, the KKT matrix is indefinite but
non-singular, and the quadratic function

1 T T T
L x , = x G xd xe−  Ax−b
2
in {x , } has a single stationary point that is a saddle point.

225 Wolfgang Bangerth


Part 10

Sequential Quadratic Programming


(SQP)

minimize f  x
g x = 0

226 Wolfgang Bangerth


The basic SQP algorithm

For z={x , }, the equality-constrained optimality conditions read

∇ z L z = 0
Like in the unconstrained Newton's method, sequential
quadratic programming uses the following basic iteration:

T

Start at a point z 0=[ x0 ,  0 ]
2

Compute search directions using [∇ z L z k] pk=−∇ z L z k 

Compute a step length  k


Update z k1 =zk  k p k

227 Wolfgang Bangerth


Computing the SQP search direction

The equations for the search direction are

  
2 2
∇ f  x k −∑i  i, k ∇ gi  xk  −∇ g xk  pkx

=
T
−∇ g xk  0 pk

=−
 ∇ f  x k −∑i  i, k ∇ gi  xk 
−g x k  
which we will abbreviate as follows:


W k −A k p xk
T
−A k 0 p k   
=−
∇ f  x k −∑i  i, k ∇ gi x k 
−g x k  
with
2
W k = ∇ x L(xk ,λ k )
A k = ∇ x g(x k) = −∇ x ∇ λ L(x k ,λ k)
228 Wolfgang Bangerth
Computing the SQP search direction

Theorem 1: Assume that W is positive definite in all feasible


directions, i.e. ZkTWkZk is positive definite, and that the matrix
Ak has full row rank. Then the KKT matrix of SQP step k

 
T
W k −A k
−A k 0
is nonsingular and the system that determines the SQP search
direction

   
T x
W k −A
−A k 0
k pk

pk
= − ∇ x Lx k , k 
−g x k 
has a unique solution.
Proof: Use Theorem 1 from Part 9.
Note: The columns of the matrix Zk span the null space of Ak.
229 Wolfgang Bangerth
Computing the SQP search direction

Theorem 2: The solution of the SQP search direction system

   
T
W k −A k p xk
−A k 0 pk
= − ∇ x Lx k , k 
−g x k 
equals the minimizer of the problem
x T 1 xT 2
x x
min x mk  p =Lx k , k ∇ x L xk , k p  pk ∇ x Lx k , k  pk
k k
2
g xk ∇ g xk T pkx = 0
that approximates the original nonlinear equality-constrained
minimization problem.

Proof: Essentially just use Theorem 2 from Part 9.


Note: This means that SQP in each step minimizes a quadratic
model of the Lagrangian, subject to linearized constraints.
230 Wolfgang Bangerth
Computing the SQP search direction

Theorem 3: The SQP iteration with full steps, i.e.

   
T
W k −A k p xk
−A k 0 pk
= − ∇ x Lx k , k 
−g x k 
x 
xk 1 =x k pk ,  k1 = kp k
converges to the solution of the constrained nonlinear
optimization problem with quadratic order if (i) we start close
enough to the solution, (ii) the LICQ holds at the solution and
(iii) the matrix Z*TW*Z* is positive definite at the solution.

231 Wolfgang Bangerth


How SQP works

Example 1:

1 2 2
min f  x =  x1 x2 
2
g x = x 21 = 0

The search direction is then computed using the step


T
x 1 xT x
x
min mk ( p ) = L(x k , λk ) +
k
( 1,k
x2, k −λ k )
x
p k + pk p k
2
T
x 2,k +1 + 0 p k = 0
() x

1
In other words, the linearized constraint enforces that
x x
p2,k = −(x 2,k +1) → x 2,k+1 =x2, k + p 2,k = −1
232 Wolfgang Bangerth
How SQP works

Example 2:

min f  x
g  x = x 2−sin x 1  = 0

Search direction is then computed by

x
min mk  p k
T

 
x 2, k −sin  x 1,k   −cos x 1, k  pkx = 0
1

In particular, if we are currently at (0,-2), this enforces

− p1, k  p 2,k = 2
233 Wolfgang Bangerth
How SQP works

Example 3:

min f  x
g x = 0

If constraint is already satisfied at a


step, then search direction solves

min mk  p xk 
T x T x
g  x k ∇ g x k  p k = ∇ g  x k  p k = 0

In other words: The update step can only be tangential to


the constraint (along the linearized constraint)!

234 Wolfgang Bangerth


Hessian modifications for SQP

The SQP step

   
T
W k −A k p xk
−A k 0 pk
= − ∇ x Lx k , k 
−g x k 
is equivalent to the minimization problem
x 1 xT 2
T x x
min x mk  p =Lx k , k ∇ x L xk , k p  pk ∇ x Lx k , k  pk
k k
2
g xk ∇ g xk T pkx = 0
or abbreviated:
x T T 1 xT
x x
min x mk  p =L k∇ f − A k  p  pk W k pk
k x k k k
2
g xk  ATk pxk = 0
From this, we may expect to get into trouble if the matrix
ZkTWkZk is not positive definite.
235 Wolfgang Bangerth
Hessian modifications for SQP

If the matrix ZkTWkZk in the SQP step

   
T x
W k −A
−A k 0
k pk

pk
= − ∇ x Lx k , k 
−g x k 
is not positive definite, then there may not be a unique solution.

There exist a number of modifications to ensure that an


alternative step can be computed that satisfies

   
T
W k −A k pkx
−A k 0 pk
= − ∇ x L x k ,  k
−g x k  
instead.

236 Wolfgang Bangerth


Line search procedures for SQP

Motivation: For unconstrained problems, we used f(x) to


measure progress along a direction pk computed from a
quadratic model mk that approximates f(x).

Idea: For constrained problems, we could consider L(z) to


measure progress along a search direction pk computed using
the SQP step based on the model mk.

Problem 1: The Lagrangian L(z) is unbounded. E.g., for


linear-quadratic problems, L(z) is quadratic of saddle-point form.
Indeed, we are now looking for this saddle point of L.

Consequence 1: We can't use L(z) to measure progress in line


search algorithms.
237 Wolfgang Bangerth
Line search procedures for SQP

Motivation: For unconstrained problems, we used f(x) to


measure progress along a direction pk computed from a
quadratic model mk that approximates f(x).

Idea: For constrained problems, we could consider L(z) to


measure progress along a search direction pk computed using
the SQP step based on the model mk.

Problem 2: Some step lengths may lead to a significant


reduction in f(x) but take us far away from constraints g(x)=0. Is
this better than a step that may increase f(x) but lands on the
constraint ?

Consequence 2: We need a merit function that balances


238 decrease of f(x) with satisfying the constraint g(x). Wolfgang Bangerth
Line search procedures for SQP

Solution: Drive step length determination using a merit


function that contains both f(x) and g(x).

Examples: Commonly used choices are the l1 merit function


1
1 x  = f  x ∥gx ∥1

with 1
=∥ k1∥∞  , 0

or Fletcher's merit function


T 1 2
F  x = f  x−  x g x ∥g x∥
with 2
T −1
 x  =[ A  x A  x ] A  x ∇ f  x

239 Wolfgang Bangerth


Line search procedures for SQP

Definition: A merit functions is called exact if the constrained


optimizer of the problem
min x f  x
g x  = 0

is also a minimizer of the merit function.

Note: Both the l1 and Fletcher's merit function


1
1 x  = f  x ∥gx ∥1

T 1 2
F  x = f  x−  x g x ∥g x∥
2
are exact for appropriate choices of  , .

240 Wolfgang Bangerth


Line search procedures for SQP

Theorem 4: The SQP search direction that satisfies

   
T x
W k −A
−A k 0
k pk

pk
= − ∇ x Lx k , k 
−g x k 
is a direction of descent for both the l1 as well as Fletcher's
merit function if (i) the current point xk is not a stationary point
of the equality-constrained problem, and (ii) the matrix ZkTWkZk
is positive definite.

241 Wolfgang Bangerth


A practical SQP algorithm

Algorithm: For k=0,1,2,...



Find a search direction using the KKT system

   
T x
W k −A
−A k 0
k pk

pk
= − ∇ x Lx k , k 
−g x k 

Determine step length using a backtracking linear search,
a merit function and the Wolfe (or Goldstein) conditions:
x x
 x k  p  ≤  x k   c1  ∇  x k ⋅p
k k
x x x
∇  x k  p ⋅p ≥ c2 ∇  x k ⋅p
k k k


Update the iterate using either
x λ
xk + 1=xk + αk p , k λ k+ 1=λ k+ α k p k
or
x T −1
x k1=x k k p , k  k1=[ Ak 1 A ] Ak 1 ∇ f  x k 1 
k 1
242 Wolfgang Bangerth
Parts 8-10

Summary of methods for


equality-constrained Problems

minimize f  x
g i  x  = 0, i=1,... , ne

243 Wolfgang Bangerth


Summary of methods

Two general methods for equality-constrained problems:



Penalty methods (e.g. the quadratic penalty method)
convert constrained problem into unconstrained one that can
be solved with techniques well known.

However, often lead to ill-conditioned problems


Lagrange multipliers reformulate the problem into one
where we look for saddle points of a Lagrangian

Sequential quadratic programming (SQP) methods solve
a sequence of quadratic programs with linear constraints,
which are simple to solve

SQP methods are the most powerful methods.
244 Wolfgang Bangerth
Part 11

Inequality-constrained
Problems

minimize f  x
g i  x  = 0, i=1,... , ne
hi  x ≥ 0, i=1,. .. , n i

385 Wolfgang Bangerth


An example

Consider the example of the body suspended from a ceiling


with springs, but with an element of fixed minimal length
attached to a fixed point:

To find the position of the body we now need to solve the


following problem:

minimize f  x = E  x , z=∑i Espring , i  x , zE pot  x , z


∥x − x0∥− Lrod ≥ 0
386 Wolfgang Bangerth
An example

We can gain some insight into the problem by plotting the


energy as a function of (x,z) along with the constraint:

387 Wolfgang Bangerth


Definitions

We call this the standard form of inequality constrained


problems:

minimize x∈ D⊂ R f  x
n

g i  x = 0, i=1... n e
hi  x ≥ 0, i=1 ... ni

We will also frequently write this as follows, implying


(in)equality elementwise:

minimize x∈ D⊂ R f  x 
n

g x = 0
h x ≥ 0
388 Wolfgang Bangerth
Definitions

Let x* be the solution of

minimize x∈ D⊂ R f  x
n

g i  x = 0, i=1... n e
hi  x ≥ 0, i=1 ... ni

We call a constraint active if it is zero at the solution x*:



Obviously, all equality constraints are active, since a
solution needs to satisfy g(x*)=0

Some inequality constraints may not be active if it so
happens that hi  x*0 for some index i

Other inequality constraints may be active if hi  x *=0
We call the set of all active (equality and inequality)
389
constraints the active set. Wolfgang Bangerth
Definitions

Note: If x* is the solution of

minimize x∈ D⊂ R f  x 
n

g i  x  = 0, i=1... n e
hi  x  ≥ 0, i=1 ... ni

then it is also the solution of the problem

minimize x∈ D⊂ R f  x
n

g i  x = 0, i=1... n e
hi  x = 0, i=1 ... ni ,i is active at x*

where we have dropped all inactive constraints and made


equalities out of all active constraints.
390 Wolfgang Bangerth
Definitions

A trivial reformulation of the problem is obtained by defining the


feasible set:
n
={x∈R : g  x=0, h x≥0}

Then the original problem is equivalently recast as

minimize x∈D∩⊂ R f  x
n

Note 1: This reformulation is not of much practical interest.


Note 2: The feasible set can be continuous or discrete. It can
also be empty if the constraints are mutually incompatible. In
the following we will always assume that it is continuous and
391 non-empty. Wolfgang Bangerth
The quadratic penalty method

Observation: The solution of


minimize x∈ D⊂R n f  x
g  x = 0
h x ≥ 0
must lie within the feasible set.

Idea: Let's relax the constraint and allow to search also


where g(x) is small but not zero, or where h(x) is small and
negative. However, make sure that the objective function
becomes very large if far away from the feasible set:
1 2 1 − 2
minimize x ∈D⊂R n Q  x= f  x ∥g  x∥  ∥[h x ] ∥
2 2
Qμ(x) is called the quadratic relaxation of the minimization
problem. μ is the penalty parameter, and

392 [h x] = min {0 , h x} Wolfgang Bangerth
The quadratic penalty method

Replace the original constrained minimization problem


minimize f  x 
gi  x = 0, i=1,... , ne
hi  x ≥ 0, i=1,. .. , ni
by an unconstrained method with a quadratic penalty term:
1 1
minimize x∈ D⊂R Q  x = f  x
n ∥g  x∥2  ∥[h x]−∥2
2 2
μ=0.01

Example:
μ=0.1 minimize f  x =sin x
h1  x=x−0 ≥ 0,
h2  x=1−x ≥ 0.
393 Wolfgang Bangerth
The quadratic penalty method
Negative properties of the quadratic penalty method:

minimizers for finite penalty parameters are usually
infeasible;

problem is becoming more and more ill-conditioned near
optimum as penalty parameter is decreased, Hessian large;

for inequality constrained problems, Hessian not twice
differentiable at constraints.
=0.01 =0.2
=2

=0.02

=0.1

minimize x 22 s.t. g  x =x 2x 21=0


394 Wolfgang Bangerth
The logarithmic barrier method
Replace the original constrained minimization problem
minimize f  x 
hi  x ≥ 0, i=1,. .. , ni

by an unconstrained method with a logarithmic barrier term:


ni
minimize x∈D ⊂R Q  x=f  x ∑i=1 −log hi x 
n

=0.1 =0.05

f x

minimize f  x =sin x  s.t. x ≥0, x ≤1


395 Wolfgang Bangerth
The logarithmic barrier method

Properties of successive minimization of


minimize x Q   x  = f  x  −  ∑ log h i  x 
i

● intermediate minimizers are feasible, since Qμ(x)=∞ in the


infeasible region; the method is an interior point method.

Q is smooth if constraints are smooth;

we need a feasible point as starting point;

ill-conditioning and inadequacy of Taylor expansion remain;
● Q (x) may be unbounded from below if h(x) unbounded.
μ

inclusion of equality constraints as before by quadratic
penalty method.

Summary:
This is an efficient method for the solution of constrained
396 problems. Wolfgang Bangerth
Algorithms for penalty/barrier methods

Algorithm (exactly as for the equality constrained case):


start
Given x0 , { t } 0, { t }0
For t=0,1, 2, ...:
* *
Find approximation x t to the (unconstrained) mimizer xt

of Q  x
t
that satisfies
*
∥∇ Q  x ∥≤t
t
t

start
using xt as starting point.

start *
Set t=t+1, x =x
t t−1

Typical values:  t =c  t−1 , c=0.1 to 0.5


 t =c  t−1
397 Wolfgang Bangerth
The exact penalty method

Previous methods suffered from the fact that minimizers of


Qμ(x) for finite μ are not optima of the original problem.
Solution: Use

minimize x
1
  x  = f  x

1
 [ ∑∣g i  x ∣∑∣[h i  x ] ∣
i i

]
−1
 =10 −1=10

f x −1=4
−1
 =1 f x
−1
 =1

minimize f  x =sin x  s.t. x ≥0, x ≤1


398 Wolfgang Bangerth
The exact penalty method

Properties of the exact penalty method:



for sufficiently small penalty parameter, the optimum of the
modified problem is the optimum of the original one;

possibly only one iteration in the penalty parameter needed
if size of μ is known in advance;

this is a non-smooth problem!

This is an efficient method


if (but only if!) a solver for nonsmooth problems is available!

399 Wolfgang Bangerth


Part 12

Theory of
Inequality-Constrained
Problems
minimize f  x
g i  x  = 0, i=1,... , ne
hi  x ≥ 0, i=1,. .. , n i

400 Wolfgang Bangerth


Lagrange multipliers

Consider a (single) constraint h(x) as a function for all x:

h(x,z)=-0.1 h(x,z)=0 h(x,z)=0.1

h  x =∥x− x0∥−L rod ≥ 0


401 Wolfgang Bangerth
Lagrange multipliers

Now look at the objective function f(x):

3 1 2
f x =∑i=1 D ∥x − xi∥−L 0 
2
402 Wolfgang Bangerth
Lagrange multipliers

Both f(x), h(x) for the case of a rod of minimal length 20cm:

infeasible
region

h(x,z)=0 with Lrod=20cm

403 Wolfgang Bangerth


Lagrange multipliers

Could this be a solution x*?

∇ h x *
∇ f  x *
x*

Answer: No – moving into the feasible direction would also


reduce f(x).
Rather, the solution will equal the unconstrained one, and the
404 inequality constraint will be inactive at the solution. Wolfgang Bangerth
Lagrange multipliers

Both f(x), h(x) for the case of a rod of minimal length 35cm:

infeasible
region

h(x,z)=0 with Lrod=35cm

405 Wolfgang Bangerth


Lagrange multipliers

Could this be a solution x*?

x* ∇ f x *
∇ h x *

Answer: Yes – moving into feasible direction would increase f(x).


Note: The gradients of h and f are parallel and in the same
406 direction. Wolfgang Bangerth
Lagrange multipliers

Conclusion:

Solution can be where the constraint is not active

If the constraint is active at the solution: gradients of f and h
are parallel, but not antiparallel

In mathematical terms: The (local) solutions of

minimize f  x =E  x, z=∑i E spring, i  x , zE pot  x , z


hx =∥x − x0∥− Lrod ≥ 0

are where one of the following conditions hold for some λ,μ:

∇ f  x−⋅∇ h x = 0
∇ f x = 0
h x =0 or
hx  0
 ≥ 0
407 Wolfgang Bangerth
Lagrange multipliers

Conclusion, take 2: Solutions are where either

∇ f  x−⋅∇ h x = 0
∇ f x = 0
h x =0 or
hx  0
 ≥ 0
which could also be written like so:

∇ f  x−⋅∇ h x = 0 ∇ f x−⋅∇ hx = 0


h x =0 or h x  0
 ≥ 0  = 0
(constraint is active) (constraint is inactive)

408 Wolfgang Bangerth


Lagrange multipliers

Conclusion, take 3: Solutions are where

∇ f  x−⋅∇ h x = 0 ∇ f x−⋅∇ hx = 0


or
h x =0 h x  0
 ≥ 0  = 0

or written differently:

∇ f x−⋅∇ hx = 0
hx ≥ 0
 ≥ 0
 hx = 0
Note: The last condition is called complementarity.
409 Wolfgang Bangerth
Lagrange multipliers

Same idea, but with two minimum length elements:

infeasible
region

h1(x,z)=0 h2(x,z)=0

410 Wolfgang Bangerth


Lagrange multipliers
Could this be a solution x*?

∇ h1  x *
∇ f x * x*

Answer: No – moving into feasible direction would decrease f(x).


Note: The gradient of f is antiparallel to the gradient of h1. h2 is an
411
inactive constraint so doesn't matter here. Wolfgang Bangerth
Lagrange multipliers

Same idea, but with two different minimum length elements:

infeasible
region

h1(x,z)=0 h2(x,z)=0

412 Wolfgang Bangerth


Lagrange multipliers

Could this be a solution x*?

∇ f x *

∇ h1  x *
∇ h2  x *
x*

Answer: Yes – moving into feasible direction would increase f(x).


Note: The gradient of f is a linear combination (with positive
413 multiples) of the gradients of h1 and h2. Wolfgang Bangerth
Constraint Qualification: LICQ
Definition:
We say that at a point x the linear independence constraint
qualification (LICQ) is satisfied if

{∇ gi  x}i=1 ... n ,{∇ hi  x}i=1. ..n ,i active at x


e i

is a set of linearly independent vectors.

Note: This is equivalent to saying that the matrix of gradients of all


active constraints,

[ ]
T
[ ∇ g1x ]

[ ∇ g n x]T
A = e

[∇ h first active i x ]T


⋮ has full row rank (i.e. its rank is
[ ∇ h last active i x ]T ne + # of active ineq. constraints).
414 Wolfgang Bangerth
First-order necessary conditions
Theorem:
Suppose that x* is a local solution of
n
minimize f  x f x :ℝ  ℝ
n n
g x = 0, g x:ℝ  ℝ e

n n
hx ≥ 0, h x:ℝ ℝ i

and suppose that at this point the LICQ holds. Then there exist
unique Lagrange multipliers so that these conditions are satisfied:

∇ f  x−⋅∇ gx−⋅∇ h x = 0
g x =0
hx ≥0
 ≥0
 i hi  x =0

Note: These are often called the Karush-Kuhn-Tucker (KKT)


415 conditions. Wolfgang Bangerth
First-order necessary conditions
Note: By introducing a Lagrangian
T T
L x , ,  =f  x− gx− hx 
the first two of the necessary conditions

∇ f  x−⋅∇ gx−⋅∇ h x = 0
g x =0
hx ≥0
 ≥0
 i hi  x =0

follow from requiring that ∇ z L z with z={x , ,  } , but not the


rest.

Consequence: We can not hope to find simple Newton-based


methods like SQP to solve inequality-constrained problems.
416 Wolfgang Bangerth
First-order necessary conditions
Note: The necessary conditions
∇ f  x−⋅∇ gx−⋅∇ h x = 0
g x =0
hx ≥0
 ≥0
 i hi  x =0
imply that at x* there is a unique set of (active) Lagrange
multipliers so that
λ
( )
T
∇ f ( x)=A
[μ]active
where A is the matrix of gradients of active constraints. An
alternative way of saying this is
∇ f ( x) ∈ span (rows of ( A))
However, the opposite is not true: Multipliers must also satisfy
 i ≥0
417 Wolfgang Bangerth
First-order necessary conditions
A more refined analysis: Consider the constraints
h1  x=x2−ax1≥0, h2  x=x2 ax1≥0

Intuitively (consider the isocontours), the vertex point x* is optimal


if the direction of steepest ascent ∇ f  x is a member of the family
of red vectors above. That is, let F0 be the cone
n
F 0(x *)= {w∈ℝ : w=μ1 ∇ h1 (x *)+ μ2 ∇ h2 (x *), μ1≥0, μ2 ≥0}
Then x* is optimal if
∇ f  x * ∈ F 0  x *
418 Wolfgang Bangerth
First-order necessary conditions
A more refined analysis: Consider the constraints
h1  x=x2−ax1≥0, h2  x=x2 ax1≥0

Note: We can write things slightly different if we define


n T
F 1 x * = {w ∈ℝ : w a≥0 ∀ a∈F 0  x *}
i.e. the set of vectors that form angles less than 90 degrees with
all vectors in F0. This set can also be written as
n T T
F 1 x * = {w ∈ℝ : w ∇ h1 x *≥0, w ∇ h2 x *≥0}
419 Wolfgang Bangerth
First-order necessary conditions
A more refined analysis: If the problem also has equality
constraints
g x=0, h1 x≥0, h2 x ≥0

all of which are active at x*, then the cone F1 is


n T T T
F 1 x * = {w ∈ℝ : w ∇ g x *=0, w ∇ h1  x *≥0, w ∇ h2  x *≥0}

In general:

{ }
n T
F 1 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,n e
w ∇ hi x *≥0, i=1,. ..,ni , constraint i is active at x *

Note: This is the cone of all feasible directions.

420 Wolfgang Bangerth


First-order necessary conditions
Theorem (a different version of the first order necessary
conditions): If x* is a local solution and if the LICQ hold at this
point, then

T
∇ f  x * w ≥ 0 ∀w∈F 1  x*

In other words: Whatever direction w in F1 we go into from x*, the


objective function to first order stays constant or increases.

Note: This is a necessary condition, but not sufficient. If f(x) stays


constant to first order it may still decrease in higher order Taylor
terms to make x* a local maximum or saddle point. But, if x* is a
solution, then the condition above has to be satisfied.
421 Wolfgang Bangerth
Second-order necessary conditions
Definition:
Let x* be a local solution of an inequality constrained problem
satisfying
∇ f x−⋅∇ gx−⋅∇ gx = 0
gi x = 0, i=1...n e
hi x ≥ 0, i=1...ni
i ≥ 0, i=1...ni
i hi x = 0, i=1... ni
We say that strict complementarity holds if for each inequality
constraint i exactly one of the following conditions is true:

 i =0

hi  x *=0
In other words, we require that the Lagrange multiplier is nonzero
422 for all active inequality constraints. Wolfgang Bangerth
Second-order necessary conditions
Definition:
Let x* be a local solution and assume that strict complementarity
holds. Then define as before

{ }
n T
F 1 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,n e
w ∇ hi x *≥0, i=1,. ..,ni , constraint i is active at x *

and the subspace of all tangential directions as

{ }
n T
F 2 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,ne
w ∇ hi  x*=0, i=1,..., ni , constraint i is active at x *

F 1 x * F 2 x *
423 Wolfgang Bangerth
Second-order necessary conditions
Note:
The subspace of all tangential directions

{ }
n T
F 2 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,ne
w ∇ hi  x*=0, i=1,..., ni , constraint i is active at x *

can be trivial (i.e. contain only the zero vector) if n or more


constraints are active at x*.

Example:

Here, F1 is a nonempty set, but


F2 contains only the zero vector.

424 Wolfgang Bangerth


Second-order necessary conditions
Theorem (necessary conditions):
Let x* be a local solution that satisfies the first order necessary
conditions with unique Lagrange multipliers. Assume that strict
complementarity holds. Then

T 2
w ∇ x L x* , * , * w=
=w [ ∇ x f  x *− * ∇ x g x *− * ∇ x h x * ] w ≥ 0
T 2 T 2 2

∀ w∈F 2 x *

Note: This means that f(x) can not


“curve down” to second order along
tangential directions. The first order
Conditions imply that it doesn't “slope” F 2 x *
in these directions.
425 Wolfgang Bangerth
Second-order sufficient conditions
Theorem (sufficient conditions):
Let x* be a local solution that satisfies the first order necessary
conditions with unique Lagrange multipliers. Assume that strict
complementarity holds. Then

T 2
w ∇ x L x *,  *,  * w=
=w [ ∇ x f  x *− * ∇ x g x *− * ∇ x h x * ] w  0
T 2 T 2 2

∀ w∈F 2  x*, w≠0

Note: This means that f(x) actually


“curves up” in a neighborhood of x*,
at least in tangential directions!
F 2 x *
For all other directions, we know that f(x)
slopes up from the first order necessary conditions.
426 Wolfgang Bangerth
Second-order sufficient conditions
Remark:
If strict complementarity holds, then the definition

{ }
n T
F 2 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... , ne
w ∇ hi  x*=0, i=1,..., ni , constraint i is active at x *
is equivalent to

F 2 x * = null A x *
with the matrix of gradients of active constraints A. If A does have
a null space, then the second order necessary and sufficient
conditions can also be written as
T 2
Z ∇ L x *, *, * Z is positive semidefinite
x
Z T ∇ L x *,  * , * Z is positive definite
2
x

respectively, where the columns of Z are a basis of the null space


of A.
427 Wolfgang Bangerth
Second-order necessary conditions
Definition (if strict complementarity does not hold):
Let x* be a local solution at which the KKT conditions with unique
Lagrange multiplier hold. Then define

{ }
n T
w∈ℝ : w ∇ gi  x*=0, i=1,. .., ne
F 2 x *,  * = T
w ∇ hi  x *=0, i=1,. .., ni , constraint i active and  i *0
wT ∇ hi  x *≥0, i=1,. .., ni , constraint i active and  i *=0

F 1 x * F 2 x *,  *

428 Wolfgang Bangerth


Second-order sufficient conditions
Theorem (sufficient conditions w/o strict complementarity):
Let x* be a local solution that satisfies the first order necessary
conditions with unique Lagrange multipliers. Assume that strict
complementarity does not hold. Then

T 2
w ∇ x L(x *, λ *,μ *)w=
=wT [ ∇ 2x f (x *)−λ *T ∇ 2x g (x *)−μ*∇ 2x h(x *) ] w > 0
∀ w∈F 2 (x *), w≠0

Note: This now means that f(x) actually


“curves up” in a neighborhood of x*,
at least in tangential directions plus all
those directions for which we can't infer F 2 x *
anything from the first order conditions!

429 Wolfgang Bangerth


Part 13

Active Set Methods for


Convex Quadratic Programs

1 T T
minimize f  x = x G xx d e
2
T
gi  x =ai x−bi = 0, i=1,. .., n e
T
hi x =i x− i ≥ 0, i=1,. .. , ni

430 Wolfgang Bangerth


General idea
Note:
Recall that if W* is the set of active (equality and inequality)
constraints at the solution x* then the solution of

1 T T
minimize f  x= x G xx de
2
T
gi x=ai x−bi = 0, i=1,... , ne
T
hi  x=i x− i ≥ 0, i=1,. .., ni

equals the solution of the following QP:

1 T T
minimize f x = x G xx de
2
T
gi x=ai x−bi = 0, i=1,... ,n e
T
hi  x=i x−i = 0, i=1,. .., ni ,i∈W *
431 Wolfgang Bangerth
General idea
Definition: Let

     
aT1 b1 aT1 b1
⋮ ⋮ ⋮ ⋮
T
an bn an
T bn
A= e B= e
A |W = e B |W = e

 T1 1 Tfirst inequality in W first inequality in W


⋮ ⋮ ⋮ ⋮
 Tn n Tlast inequality inW last inequality inW
i i

then the solution of the inequality-constrained QP equals the


solution of the following QP:

1 T T
minimize f x = x G xx de
2
A |W * x−B |W * = 0

432 Wolfgang Bangerth


General idea
Consequence: If we knew the active set W* at the solution, we
could just solve the linearly constrained QP

1 T T
minimize f x = x G xx de
2
A |W * x−B |W * = 0

and be done in one step.

Problem: Knowing the exact active set W* requires knowing the


solution x* because W* is the set of all equality constraints plus
those constraints for which
hi  x *=0
Solution: Solve a sequence of QPs using working sets Wk that we
iteratively refine until we have the exact active set W*.
433 Wolfgang Bangerth
The active set algorithm
Algorithm:
● Choose initial working set W0 and feasible point x0

For k=0, 1, 2, ....:
- Find search direction pk from xk to the solution xk+1 of the QP
1 T T
minimize f x = x G xx de
2
A |W x−B |W = 0
k k

- If pk=0 and all μi≥0 for constraints in Wk then stop


- Else if pk=0 but there are μi<0, then drop inequality with the
most negative μi from Wk to obtain Wk+1
- Else if xk+pk is feasible then set xk+1=xk+pk
- Otherwise, set xk+1=xk+αkpk with
{ }
T
 i− x k
i
k =min 1, mini∉W ,  T
p 0
k i k
Ti pk
and add the most blocking constraint to Wk+1
434 Wolfgang Bangerth
The active set algorithm
Example:

2 2
minimize f x = x1 −1  x2−2.5

 
1 −2 −2
−1 −2 −6
−1 2 x− −2 ≥0 h1 h2
1 0 0
0 1 0 h4 h3
h5

Choose as initial working set W0={3,5} and as starting point


x0=(2,0)T.
435 Wolfgang Bangerth
The active set algorithm
Example: Step 0

h1 h2

h4 h3
h5
W0={3,5}, x0=(2,0)T.
Then: p0=(0,0)T because no other point is feasible for W0
T

W0 0     
∇ f  x0 − | A |W = 2 −  3 −1 2 =0 implies
T
−5  5 0 1   
 3 −2
=
 5 −1
Consequently: W1={5}, x1=(2,0)T.
436 Wolfgang Bangerth
The active set algorithm
Example: Step 1

h1 h2

h4 h3
h5
W1={5}, x1=(2,0)T.
Then: p1=(-1,0)T leads to minimum along only active constraint.
There are no blocking constraints to get to the point xk+1=xk+pk

Consequently: W2={5}, x2=(1,0)T.


437 Wolfgang Bangerth
The active set algorithm
Example: Step 2

h1 h2

h4 h3
h5
W2={5}, x2=(1,0)T.
Then: p2=(0,0)T because we are at minimum of active constraints.

 
T
T 0 −  5   0 1  =0
∇ f x 2 − | A |W =
W2 2
−5
implies   5 =−5 
Consequently: W3={}, x3=(1,0)T.
438 Wolfgang Bangerth
The active set algorithm
Example: Step 3

h1 h2

h4 h3
h5
W3={}, x3=(1,0)T.
Then: p3=(0,2.5)T but this leads out of feasible region. The first
blocking constraint is inequality 1, and the maximal step length is
3 =0.6
Consequently: W4={1}, x4=(1,1.5)T.
439 Wolfgang Bangerth
The active set algorithm
Example: Step 4

h1 h2

h4 h3
h5
W4={1}, x4=(1,1.5)T.
Then: p4=(0.4,0.2)T is the minimizer along the sole constraint.
There are no blocking constraints to get there.

Consequently: W5={1}, x5=(1.4,1.7)T.


440 Wolfgang Bangerth
The active set algorithm
Example: Step 5

h1 h2

h4 h3
h5
W5={1}, x5=(1.4,1.7)T.
Then: p5=(0,0)T because we are already on the minimizer on the
constraint. Furthermore,

 
T
T
∇ f x 5 − | A |W = 0.8 −  1   1 −2  =0 implies   1 = 0.8  ≥0
W5 5
−1.6
Consequently: This is the solution.
441 Wolfgang Bangerth
The active set algorithm
Theorem:
If G is strictly positive definite (i.e. the objective function is strictly
convex), then Wk≠Wl for k ≠ l.
Consequently (because there are only finitely many possible
working sets), the active set algorithm terminates in a finite
number of steps.

Note:
In practice it may be that G is indefinite, and that for some
iterations the matrix ZkTGZk is indefinite as well. We know that at
the solution, Z*TGZ* is positive semidefinite, however. In that case,
we can't guarantee termination or convergence.

There are, however, Hessian modification techniques to deal with


this situation.
442 Wolfgang Bangerth
The active set algorithm
Remark:
In the active set method, we only change the working set Wk by at
most one element in each iteration.

One may be tempted to remove all constraints with negative


Lagrange multipliers at once, or add several constraints at the
same time when they become active.

However, we can then no longer guarantee that Wk≠Wl for k ≠ l


and cycling may happen, i.e. we cycle between the same points
and sets xk, Wk.

443 Wolfgang Bangerth


Active set SQP methods for general nonlinear problems

For equality constrained problems of the form


n
minimize f (x) f (x):ℝ →ℝ
n n
g(x) = 0, g(x):ℝ →ℝ e

we used the SQP method. It repeatedly solves linear-quadratic


problems of the form

x T x 1 xT 2 x
min x mk  p =Lx k , k ∇ x L xk , k p  pk ∇ x Lx k , k  pk
k k
2
g xk ∇ g xk T pkx = 0

Here, each subproblem (a single SQP step) could be solved in


one iteration by solving a saddle point linear system.

444 Wolfgang Bangerth


Part 14

Active Set SQP Methods

minimize f (x )
g i (x ) = 0, i=1,. .. , n e
h i (x ) ≥ 0, i=1,. .. , n i

445 Wolfgang Bangerth


Active set SQP methods for general nonlinear problems

For inequality constrained problems of the form

minimize f  x
gi x = 0, i=1,. .., ne
hi  x ≥ 0, i=1,..., ni
we repeatedly solve linear-quadratic problems of the form
x T 1 xT 2
x x
min x mk  p =Lx k ,k ∇ x L x k , k  p  pk ∇ x L x k , k  pk
k k
2
g xk ∇ g x kT pxk = 0
T x
h x k∇ h xk  pk ≥ 0
Each of these inequality constrained quadratic problems can be
solved using the active set method, and after we have the
exact solution of this approximate problem we can re-linearize
around this point for the next sub-problem.
446 Wolfgang Bangerth
Active set SQP methods for general nonlinear problems

Note: Each time we solve a problem like


x T x 1 xT 2 x
min x mk  p =Lx k ,k ∇ x L x k , k  p  pk ∇ x L x k , k  pk
k k
2
g xk ∇ g x kT pxk = 0
T x
h x k∇ h xk  pk ≥ 0

we have to do several active set iterations, though we can start


with the previous step's final working set and solution point.

Nevertheless, this is not going to be cheap, though it is


comparable to iterating over penalty/barrier parameters.

447 Wolfgang Bangerth


Parts 11-14

Summary of methods for


inequality-constrained problems

minimize f  x
g i  x  = 0, i=1,... , ne
hi  x ≥ 0, i=1,. .. , n i

448 Wolfgang Bangerth


Summary of methods
Two approaches to inequality-constrained problems:

Penalty/barrier methods:

Convert the constrained problem into an unconstrained


one that can be solved with known techniques.

Barrier methods ensure that intermediate iterates remain


feasible with respect to inequality constraints


Lagrange multiplier formulations lead to active set
methods


Both kinds of methods are expensive. Penalty/barrier
methods are simpler to implement but can only find
minima located at the boundary of the feasible set at the
price of dealing with ill-conditioned problems.
449 Wolfgang Bangerth
Part 15

Global optimization

minimize f  x
g i  x  = 0, i=1,... , ne
hi  x ≥ 0, i=1,. .. , n i

450 Wolfgang Bangerth


Motivation
What should we do when asked to find the (global) minimum
of functions like this:

1 2 2
f  x=  x1 x 2cos x 1cos x2 
20
451 Wolfgang Bangerth
A naïve sampling approach

Naïve approach: Sample at M-by-M points and choose the


one with the smallest value.

Alternatively: Start Newton's method at each of these points to


get higher accuracy.

Problem: If we have n variables, then we would have to start


at Mn points. This becomes prohibitive for large n!
452 Wolfgang Bangerth
Monte Carlo sampling

A better strategy (“Monte Carlo” sampling):



Start with a feasible point x0

For k=0,1,2,...:
- Choose a trial point xt

- If f  x t ≤f  xk  then xk 1 =x t [accept the sample]

- Else:
. draw a random number s in [0,1]
. if

then
exp − [
f  x −f x 
t
T
k
] ≥s

xk 1 =x t [accept the sample]


else
xk 1 =x k [reject the sample]
453 Wolfgang Bangerth
Monte Carlo sampling

Example: The first 200 sample points

454 Wolfgang Bangerth


Monte Carlo sampling

Example: The first 10,000 sample points

455 Wolfgang Bangerth


Monte Carlo sampling

Example: The first 100,000 sample points

456 Wolfgang Bangerth


Monte Carlo sampling

Example: Locations and values of the first 105 sample points

457 Wolfgang Bangerth


Monte Carlo sampling

Example: Values of the first 100,000 sample points

Note: The exact minimal value is -1.1032... . In the first


100,000 samples, we have 24 with values f(x)<-1.103.
458 Wolfgang Bangerth
Monte Carlo sampling

How to choose the constant T:



If T is chosen too small, then the condition

[
exp −
T ]
f  xt −f x k 
≥ s, s ∈U [0,1]

will lead to frequent rejections of sample points for which


f(x) increases.
Consequently, we will get stuck in local minima for long
periods of time before we accept a sequence of steps that
gets “us over the hump”.

On the other hand, if T is chosen too large, then we will
accept nearly every sample, irrespective of f(xt ).
Consequently, we will perform a random walk that is no
more efficient than uniform sampling.

459 Wolfgang Bangerth


Monte Carlo sampling

Example: First 100,000 samples, T=0.1

460 Wolfgang Bangerth


Monte Carlo sampling

Example: First 100,000 samples, T=1

461 Wolfgang Bangerth


Monte Carlo sampling

Example: First 100,000 samples, T=10

462 Wolfgang Bangerth


Monte Carlo sampling

Strategy: Choose T large enough that there is a reasonable


probability to get out of local minima; but small enough that this
doesn't happen too often.

1 2 2
Example: For f  x=  x1 x 2cos x 1cos x2 
20
the difference in function value between local minima and
saddle points is around 2. We want to choose T so that

[ ]
exp −
f
T
≥ s, s∈U [0,1]

is true maybe 10% of the time.

This is the case for T=0.87.


463 Wolfgang Bangerth
Monte Carlo sampling

How to choose the next sample xt:

● If xt is chosen independently of xk then we just sample the


entire domain, without exploring areas where f(x) is small.
Consequently, we should choose xt “close” to xk.

● If we choose xt too close to xk we will have a hard time


exploring a significant part of the feasible region.
● If we choose xt in an area around xk that is too large, then
we don't adequately explore areas where f(x) is small.
Common strategy: Choose
n
xt =x k y , y ∈N 0, I or U [−1,1] 
where σ is a fraction of the diameter of the domain or the
distance between local minima.
464 Wolfgang Bangerth
Monte Carlo sampling

Example: First 100,000 samples, T=1, σ=0.05

465 Wolfgang Bangerth


Monte Carlo sampling

Example: First 100,000 samples, T=1, σ=0.25

466 Wolfgang Bangerth


Monte Carlo sampling

Example: First 100,000 samples, T=1, σ=1

467 Wolfgang Bangerth


Monte Carlo sampling

Example: First 100,000 samples, T=1, σ=4

468 Wolfgang Bangerth


Monte Carlo sampling with constraints

Inequality constraints:

For simple inequality constraints, modify sample
generation strategy to never generate infeasible trial
samples

For complex inequality constraints, always reject samples
for which

hi  x t 0 for at least one i

469 Wolfgang Bangerth


Monte Carlo sampling with constraints

Inequality constraints:

For simple inequality constraints, modify the sample
generation strategy to never generate infeasible trial
samples

For complex inequality constraints, always reject samples:
- If Q xt ≤Q x k  then xk 1 =x t
- Else:
. draw a random number s in [0,1]
. if

then
exp −[Q xt −Q x k
T ] ≥s

xk 1 =x t
else
xk 1 =x k
where
Q x=∞ if at least one hi  x0, Q x =f  x otherwise
470 Wolfgang Bangerth
Monte Carlo sampling with constraints

Equality constraints:

Generate only samples that satisfy equality constraints


If we have only linear equality constraints of the form
g x= Ax−b=0
then one way to guarantee this is to generate samples
using
n−ne n−ne
xt =x k Z y , y∈ℝ , y=N 0, I or U [−1,1] 

where Z is the null space matrix of A, i.e. AZ=0.

471 Wolfgang Bangerth


Monte Carlo sampling

Theorem:
Let A be a subset of the feasible region. Under certain
conditions on the sample generation strategy, then as k ∞
we have
f (x)

number of samples x k∈ A ∝ ∫A e T
dx

That is: Every region A will be adequately sampled over time.


Areas around the global minimum will be better sampled than
other regions.

In particular, f (x)
1 1
( )

fraction of samples x k ∈A = ∫A e T
dx+ O
C √N
472 Wolfgang Bangerth
Monte Carlo sampling

Remark:
Monte Carlo sampling appears to be a strategy that bounces
around randomly, only taking into account the values (not the
derivatives) of f(x).

However, that is not so if sample generation strategy and T


are chosen carefully: Then we choose a new sample
moderately close to the previous one, and we always accept it
if f(x) is reduced, whereas we only sometimes accept it if f(x)
is increased by this step.

In other words: On average we still move in the direction of


steepest descent!

473 Wolfgang Bangerth


Monte Carlo sampling

Remark:
Monte Carlo sampling appears to be a strategy that bounces
around randomly, only taking into account the values (not the
derivatives) of f(x).

However, that is not so – because it compares function values.

That said: One can accelerate the Monte Carlo method by


choosing samples from a distribution that is biased towards
the negative gradient direction if the gradient is cheap to
compute.

Such methods are sometimes called Langevin samplers.

474 Wolfgang Bangerth


Simulated Annealing

Motivation:
Particles in a gas, or atoms in a crystal have an energy that is
on average in equilibrium with the rest of the system. At any
given time, however, its energy may be higher or lower.

In particular, the probability that its energy is E is


E

kB T
P E ∝ e
Where kB is the Boltzmann constant. Likewise, probability that
a particle can overcome an energy barrier of height ΔE is

{ }= { }
− E 1 if  E≤0
kB T
P E  E E ∝ min 1, e −E
k BT
e if  E0
This is exactly the Monte Carlo transition probability if we
identify
E = f kB
475 Wolfgang Bangerth
Simulated Annealing

Motivation:
In other words, Monte Carlo sampling is analogous to
watching particles bounce around in a potential f(x) when
driven by a gas at constant temperature.

On the other hand, we know that if we slowly reduce the


temperature of a system, it will end up in the ground state with
very high probability. For example, slowly reducing the
temperature of a melt results in a perfect crystal. (On the other
hand, reducing the temperature too quickly results in a glass.)

The Simulated Annealing algorithm uses this analogy by using


the modified transition probability

[
exp −
Tk ]
f  xt −f x k 
≥ s, s∈U [0,1], T k  0 as k  ∞

476 Wolfgang Bangerth


Simulated Annealing

Example: First 100,000 samples, σ=0.25

1
T=1 T k= −4
110 k

477 Wolfgang Bangerth


Simulated Annealing

Example: First 100,000 samples, σ=0.25

1
T=1 T k= −4
110 k

24 samples with f(x)<-1.103 192 samples with f(x)<-1.103


478 Wolfgang Bangerth
Simulated Annealing
2 1 2
Convergence: First 1,500 samples, f x = ∑i=1 x i cosx i
20

1
T=1 T k=
10.005 k

(Green line indicates the lowest function value found so far)


479 Wolfgang Bangerth
Simulated Annealing
10 1 2
Convergence: First 10,000 samples, f x=∑i=1 x i cos xi 
20

1
T=1 T k=
10.0005k

(Green line indicates the lowest function value found so far)


480 Wolfgang Bangerth
Simulated Annealing

Discussion:
Simulated Annealing is often more efficient in finding global
minima because it initially explores the energy landscape at
large, and later on explores the areas of low energy in greater
detail.

On the other hand, there is now another knob to play with


(namely how we reduce the temperature):

If the temperature is reduced too fast, we may get stuck in
local minima (the “glass” state)

If the temperature is not reduced fast enough, the
algorithm is no better than Monte Carlo sampling and may
require many many samples.

481 Wolfgang Bangerth


Very Fast Simulated Annealing (VFSA)

A further refinement:
In Very Fast Simulated Annealing we not only reduce
temperature over time, but also reduce the search radius of
our sample generation strategy, i.e. we compute
n
xt =x k  k y , y∈N 0, I or U [−1,1] 
and let
k  0
Like reducing the temperature, this ensures that we sample
the vicinity of minima better and better over time.

Remark: To guarantee that the algorithm can reach any point


in the search domain, we need to choose  k so that

∑k=0 k =∞
482 Wolfgang Bangerth
Genetic Algorithms (GA)

An entirely different idea:


Choose a set (“population”) of N points (“individuals”)
P0={x1,...xN}
For k=0,1,2,... (“generations”):
● Copy those N <N individuals in P with the smallest f(x) (i.e.
f k
the “fittest individuals”) into Pk+1
● While #Pk+1<N:
- select two individuals (“parents”) xa,xb from
among the first Nf individuals in Pk+1 with probabilities
−f x i /T
proportional to e
- create a new point xnew from xa,xb (“mating”)
- perform some random changes on xnew (“mutation”)
- add it to Pk+1

483 Wolfgang Bangerth


Genetic Algorithms (GA)

Example: Populations at k=0,1,2,5,10,20, N=500, Ns=2/3 N

484 Wolfgang Bangerth


Genetic Algorithms (GA)

Convergence: Values of the N samples for all generations k

21 10 1 2
f x=∑i=1 x 2i cos x i  f x=∑i=1 x i cos xi 
20 20

485 Wolfgang Bangerth


Genetic Algorithms (GA)

Mating:

Mating is meant to produce new individuals that share the
traits of the two parents

If the variable x encodes real values, then mating could just
take the mean value of the parents:
x ax b
x new=
2

For more general properties (paths through cities, which of M
objects to put where in a suitcase, …) we have to encode x in
a binary string. Mating may then select bits (or bit sequences)
randomly from each of the parents


There is a huge variety of encoding and selection strategies
in the literature.

486 Wolfgang Bangerth


Genetic Algorithms (GA)

Mutation:

Mutations are meant to introduce an element of randomness
into the process, to explore search directions that aren't
represented yet in the population

If the variable x represents real values, we can just add a
small random value to x to simulate mutations
x ax b n
x new=  y , y ∈ℝ , y=N 0, I 
2

For more general properties, mutations can be introduced by
randomly flipping individual bits or bit sequences in the
encoded properties


There is a huge variety of mutation strategies in the literature.

487 Wolfgang Bangerth


Part 15

Summary of
global optimization methods

minimize f  x
g i  x  = 0, i=1,... , ne
hi  x ≥ 0, i=1,. .. , n i

488 Wolfgang Bangerth


Summary of methods

Global optimization problems with many minima are
difficult because of the curse of dimensionality: the
number of places where a minimum could be becomes
very large if the number of dimensions becomes large

There is a large zoo of methods for these kinds of
problems

Most algorithms are stochastic to sample feasible region

Algorithms also work for non-smooth problems

Most methods are not very effective (if one counts number
of function evaluations) in return for the ability to get out of
local minima


Global optimization algorithms should never be used
whenever we know that the problem has only a small
number of minima and/or is smooth and convex
489 Wolfgang Bangerth

You might also like