Optimization Slides
Optimization Slides
Examples of optimization
problems
49 Wolfgang Bangerth
What is an optimization problem?
Mathematically speaking:
f ( x) → min!
g ( x) = 0
h ( x) ≥ 0
50 Wolfgang Bangerth
What is an optimization problem?
In practice:
●
x={u,y} is a set of design and auxiliary variables that
completely describe a physical, chemical,
economical model;
●
f(x) is an objective function with which we measure how
good a design is;
●
g(x) describes relationships that have to be met exactly
(for example the relationship between y and u)
●
h(x) describes conditions that must not be exceeded
52 Wolfgang Bangerth
Examples
53 Wolfgang Bangerth
Examples
56 Wolfgang Bangerth
Examples
57 Wolfgang Bangerth
Applications: The drag coefficient of a car
Mathematical description:
x={u,y}: u are the design parameters (e.g. the shape of the car)
y is the flow field around the car
f(x): the drag force that results from the flow field
g(x)=y-q(u)=0:
constraints that come from the fact that there is a flow
field y=q(u) for each design. y may, for example, satisfy
58 the Navier-Stokes equations Wolfgang Bangerth
Applications: The drag coefficient of a car
Inequality constraints:
(expected sales price – profit margin) - cost(u) ≥ 0
59
legal margins(u) ≥ 0 Wolfgang Bangerth
Applications: The drag coefficient of a car
Analysis:
linearity: f(x) may be linear
g(x) is certainly nonlinear (Navier-Stokes equations)
h(x) may be nonlinear
convexity: ??
constrained: yes
f(x) = cd(y)
In practice, one often is willing to trade efficiency for cost, i.e. we are
willing to accept a slightly higher drag coefficient if the cost is smaller.
This leads to objective functions of the form
or
61 Wolfgang Bangerth
Applications: Optimal oil production strategies
Permeability field Oil saturation
Mathematical description:
x={u,y}: u are the pumping rates at injection/production wells
y is the flow field (pressures/velocities)
f(x): the cost of production and injection minus sales price of
oil integrated over lifetime of reservoir (or -NPV)
g(x)=y-q(u)=0:
constraints that come from the fact that there is a flow
field y=q(u) for each u. y may, for example, satisfy
62 the multiphase porous media flow equations Wolfgang Bangerth
Applications: Optimal oil production strategies
produced_oil(T)/available_oil(0) – c ≥ 0:
Legislative requirement to produce at least
a certain fraction
63 Wolfgang Bangerth
Applications: Optimal oil production strategies
Analysis:
linearity: f(x) is nonlinear
g(x) is certainly nonlinear
h(x) may be nonlinear
convexity: no
constrained: yes
Mathematical description:
x={T, ti1, ti2}: round-trip time T for the stop light system,
switch-green and switch-red times for all lights i
f(x): number of cars that can pass the intersection per
hour;
Note: unknown as a function, but we can measure it
65 Wolfgang Bangerth
Applications: Switching lights at an intersection
300 – T ≥ 0:
No more than 5 minutes of round-trip time, so that people
don't have to wait for too long
t1(i+1)-t2i – 5 ≥ 0:
At least 5 seconds of all-red between different greens
66 Wolfgang Bangerth
Applications: Switching lights at an intersection
Analysis:
linearity: f(x) ??
h(x) is linear
convexity: ??
constrained: yes
smooth: f(x) ??
h(x) yes
ODE/PDE: no
67 Wolfgang Bangerth
Applications: Trajectory planning
Mathematical description:
x={y(t),u(t)}: position of spacecraft and thrust vector at time t
T
f x=∫0 ∣u t∣dt minimize fuel consumption
convexity: no
constrained: yes
smooth: yes, here
derivatives: computable
continuous: yes, not discrete
ODE/PDE: yes
69 Wolfgang Bangerth
Applications: Data fitting 1
Mathematical description: 1
x={a,b}: parameters for the model y t = log cosh ab t
a
f(x)=1/N ∑i |yi-y(ti)|2:
mean square difference between predicted value
and actual measurement
70 Wolfgang Bangerth
Applications: Data fitting 1
Analysis:
linearity: f(x) is nonlinear
constrained: no
smooth: yes
71 Wolfgang Bangerth
Applications: Data fitting 2
Mathematical description:
x={a,b}: parameters for the model y t =atb
f(x)=1/N ∑i |yi-y(ti)|2:
mean square difference between
predicted value and actual measurement
72 Wolfgang Bangerth
Applications: Data fitting 2
Analysis:
linearity: f(x) is quadratic
Convexity: yes
constrained: no
smooth: yes
73 Wolfgang Bangerth
Applications: Data fitting 3
Mathematical description:
x={a,b}: parameters for the model y t =atb
f(x)=1/N ∑i |yi-y(ti)|:
mean absolute difference between predicted
value and actual measurement
74 Wolfgang Bangerth
Applications: Data fitting 3
Analysis:
linearity: f(x) is nonlinear
Convexity: yes
constrained: no
smooth: no!
75 Wolfgang Bangerth
Applications: Data fitting 3, revisited
Mathematical description:
x={a,b, si}: parameters for the model y t =atb
“slack” variables si
f(x)=1/N ∑i si → min!
si - |yi-y(ti)| ≥ 0
76 Wolfgang Bangerth
Applications: Data fitting 3, revisited
Analysis:
linearity: f(x) is linear, h(x) is not linear
Convexity: yes
constrained: yes
smooth: no!
77 Wolfgang Bangerth
Applications: Data fitting 3, re-revisited
Mathematical description:
x={a,b, si}: parameters for the model y t =atb
“slack” variables si
f(x)=1/N ∑i si → min!
si - |yi-y(ti)| ≥ 0 si - (yi-y(ti)) ≥ 0
si + (yi-y(ti)) ≥ 0
78 Wolfgang Bangerth
Applications: Data fitting 3, re-revisited
Analysis:
linearity: f(x) is linear, h(x) is now also linear
Convexity: yes
constrained: yes
smooth: yes
derivatives: yes
79 Wolfgang Bangerth
Applications: Traveling salesman
Mathematical description:
x={ci }: the index of the ith city on our trip, i=1...N
f(x)= ∑i d c c
i i1
80 Wolfgang Bangerth
Applications: Traveling salesman
Analysis:
linearity: f(x) is linear, h(x) is nonlinear
Convexity: meaningless
constrained: yes
smooth: meaningless
derivatives: meaningless
N
continuous: discrete: x∈ X ⊂{1,2,... , N }
81 Wolfgang Bangerth
Part 2
Minima, minimizers,
sufficient and necessary
conditions
82 Wolfgang Bangerth
Part 3
Metrics of algorithmic
complexity
83 Wolfgang Bangerth
Outline of optimization algorithms
All algorithms to find minima of f(x) do so iteratively:
- start at a point x 0
- for k=1,2,..., :
. compute an update direction pk
. compute a step length k
. set x k x k−1k pk
. set k k1
84 Wolfgang Bangerth
Outline of optimization algorithms
All algorithms to find minima of f(x) do so iteratively:
- start at a point x 0
- for k=1,2,..., :
. compute an update direction pk
. compute a step length k
. set x k x k−1k pk
. set k k1
Questions:
●
Traffic light example: Evaluating f(x) requires us to sit at an
intersection for an hour, counting cars
●
Designing air foils: Testing an improved wing design in a
wind tunnel costs millions of dollars.
86 Wolfgang Bangerth
How expensive is every iteration?
Example: Boeing wing design
●
If derivatives can not be computed exactly, they can be
approximated by several evaluations of f ⋅ and ∇ f ⋅
88 Wolfgang Bangerth
How many iterations do we need?
Question: Given a sequence xk x * (for which we know
that ∥xk −x *∥ 0 ), can we determine exactly how fast the error
goes to zero?
∥x k − x *∥
89 Wolfgang Bangerth
How many iterations do we need?
Definition: We say that a sequence xk x * is of order s if
s
∥x k − x*∥ ≤ C∥x k −1 −x*∥
A sequence of numbers ak 0 is called of order s if
∣a k∣ ≤ C∣a k−1∣s
s−1
C is called the asymptotic constant. We call C∣ak−1∣ gain factor.
Specifically:
If s=1, the sequence is called linearly convergent.
Note: Convergence requires C<1. In a singly logarithmic plot,
linearly convergent sequences are straight lines.
If s=2, we call the sequence quadratically convergent.
If 1<s<2, we call the sequence superlinearly convergent.
90 Wolfgang Bangerth
How many iterations do we need?
Example: The sequence of numbers
ak = 1, 0.9, 0.81, 0.729, 0.6561, ...
is linearly convergent because
∣a k∣ ≤ C∣a k−1∣s
with s=1, C=0.9.
91 Wolfgang Bangerth
How many iterations do we need?
Example: The sequence of numbers
ak = 0.1, 0.03, 0.0027, 0.00002187, ...
is quadratically convergent because
∣a k∣ ≤ C∣a k−1∣s
with s=2, C=3.
∥x k − x *∥
Linear convergence.
Gain factor C<1
is constant.
k Quadratic convergence.
Gain factor C∣ak−1∣1
becomes better and better!
93 Wolfgang Bangerth
Metrics of algorithmic complexity
Summary:
●
Quadratic algorithms converge faster in the limit than
linear or superlinear algorithms
●
Algorithms that are better than linear will need to be
started close enough to the solution
94 Wolfgang Bangerth
Part 4
Smooth unconstrained
problems:
Line search algorithms
minimize f x
95 Wolfgang Bangerth
Smooth problems: Characterization of Optima
spectrum ∇ 2 f x * 0
96 Wolfgang Bangerth
Basic Algorithm for Smooth Unconstrained Problems
Generate a sequence x k by
xk
1. finding a search direction p k
2. choosing a step length k pk
x k1
x k1=x k k p k
Iterate until we are satisfied.
97 Wolfgang Bangerth
Step 1: Choose search direction
p k⋅∇ f x k ≤0
f x k pk ≈ f x k
T
gk pk
1 T
pk H k pk
2
99 Wolfgang Bangerth
Step 1: Choose search direction
p k = − gk
p k =−∇ f x k
0 = ∇ f x * = ∇ f x k ∇ 2 f x k x *−x k ...
gk Hk pk
p k = −H −1
k gk
[ ]
3
T 1 T 1 ∂ f
mk p = f k g p p H k p
k pl pm pn
2 6 ∂ xl ∂ xm ∂ xn k
∂mk p
[ ]
3
1 ∂ f
= g k H k p p l p m=0 pk = ? ? ?
∂p 2 ∂ xl ∂ xm ∂ xn k
Size of C:
C ∼ sup x , y
∥∇ 2 −1
f (x *) ( ∇ 2 f (x)−∇ 2 f ( y ))∥
∥x− y∥
C measures size of nonlinearity beyond quadratic part.
107 Wolfgang Bangerth
Example 1: Gradient method
f x , y =−x 3 2x 2 y 2
∥x k − x *∥
2 4−2
∇ f 0,0~{1=4, 2=2} C≈ ≈0.33
42
109 Wolfgang Bangerth
Example 1: Newton's method
f x , y =−x 3 2x 2 y 2
∥x k − x *∥
∥x k −x *∥
k
Newton's method much faster than gradient method
Newton's method superior for high accuracy due to higher
order of convergence
Gradient method simple but converges in a reasonable
number of iterations as well
112 Wolfgang Bangerth
Example 2: Gradient method
4 2 2
f x , y = x− y 1
1 2
100 100
y
∥x k − x *∥
4 2 2
f x , y = x − y 1
1 2
100 100
y
∥x k − x *∥
∥x k −x *∥
Rationale:
●
Near the optimum, quadratic approximation of f is valid
→ take full steps (step length 1) there
●
Line search only necessary far away from the solution
●
If close to solution, need to try α=1 first
Consequence:
●
Near solution, quadratic convergence of Newton's method
is retained
●
Far away, convergence is slower in any case.
119 Wolfgang Bangerth
Practical line search strategies
f x , y =x 4− x 2 y 4− y 2
120 Wolfgang Bangerth
Practical line search strategies
f x k p k ≤ f x k c1 [ ∂ f x k pk
∂ ]
=0
= f k c 1 ∇ f k⋅p k
f x k p k Necessary:
0c 11
Typical values:
c 1=10−4
i.e.: only very small
decrease mandated
121 Wolfgang Bangerth
Practical line search strategies
∇ f x k p k ⋅p k =
[ ∂ f x k p k
∂ ]
= k
≥ c2
[ ∂ f x k p k
∂ ] =0
= c 2 ∇ f k⋅p k
f x k p k Necessary:
0c 1c 21
Typical:
c 2=0.9
Rationale: Exclude too
small step lengths
122 Wolfgang Bangerth
Practical line search strategies
Wolfe conditions
Conditions 1 and 2 usually yield reasonable ranges for the
step lengths, but do not guarantee optimal ones
f x k p k
123 Wolfgang Bangerth
Practical line search strategies - Alternatives
∣[ ∂ f x k p k
∂ ] ∣ ∣[
=k
≤ c2
∂ f x k p k
∂ ]∣
=0
f x k p k
f x k p k
Goldstein conditions:
f x k p k ≥ f x k 1−c1 [ ∂ f x k p k
∂ ]
=0
124 Wolfgang Bangerth
Practical line search strategies
1
Note: A typical reduction factor is c=
2
125 Wolfgang Bangerth
Practical line search strategies
●
If no:
- let k = f x k pk
- from evaluating the sufficient decrease condition
i i
f x k t p k ≤ f k c 1 t ∇ f k⋅p k
we already know k 0= f x k , k ' 0=∇ f k⋅p k =g k⋅pk
and k it = f x k it p k
- if i=0 then choose i1
t as minimizer of the quadratic
function that interpolates k 0, ' k 0 ,k it
- if i0 then choose i1
t as the minimizer of the cubic
i i−1
function that interpolates k 0, ' k 0 ,k t , k t
126 Wolfgang Bangerth
Practical line search strategies
(0)
αt
127 Wolfgang Bangerth
Practical line search strategies
(1) (0)
αt αt
128 Wolfgang Bangerth
Part 5
Smooth unconstrained
problems:
Trust region algorithms
minimize f x
Background:
In line search methods, we choose a direction based on a local
approximation of the objective function
I.e.: Try to predict f(x) far away from xk by looking at fk , gk , Hk
Alternative strategy:
Keep a number Δk that indicates up to which distance we trust
that our model mk(p) is a good approximation of f(xk+pk).
Rather, decide how far we trust the model and stay within this
radius!
●
Compute predicted improvement PI = mk 0−mk p k
●
Compute actual improvement AI = f x k − f x k p k
1
●
If AI / PI 1/ 4 then k1 = ∥ pk∥
4
AI / PI 3 /4 and ∥pk∥= k then k1 = 2 k
●
If AI / PI for some ∈[ 0,1/ 4 ) then x k1 = x k p k
else x k1 = xk
135 Wolfgang Bangerth
Trust region algorithms
●
Practical trust region methods are about finding cheap ways
to approximate the solution of the problem above!
Note:
If trust region radius is small, then we get the “Cauchy point” in
the steepest descent direction:
C SD SD gk
pk ≈ p = p
k k ∈[0,1] p k = − k
∥g k∥
p Ck is the minimizer of f(x) in direction p SD
k
xk p B xk p Bk
k
pCk
p Ck
k ∥ p Bk ∥ k ∥ p Bk ∥
xk p B xk p Bk
k
C
pk
p Ck
Idea:
Find the approximate solution p k along the “dogleg” line
x k x k pCk x k pkB
139 Wolfgang Bangerth
Trust region algorithms: The dogleg method
U B
The dogleg then runs along x k x k p k x k p k
140 Wolfgang Bangerth
Trust region algorithms: The dogleg method
Dogleg algorithm:
B
If pBk =−B−1
k gk satisfies ∥p Bk ∥ k then set p k = p k
U gTk g k p Uk
Otherwise, if p =− k T
g k satisfies ∥p Uk ∥ k then set p k = U
k
g Bk g k
k ∥p ∥
k
Practical aspects of
Newton methods
minimize f x
Quadratic model
T 1 T
mk p= f k g p p H k p
k
2
has saddle point instead of
minimum, Newton step is
143
invalid! Wolfgang Bangerth
What if the Hessian is not positive definite
Choose
H k = H k I −i
so that the minimum of
T1 T
mk p = f k g p p H k p
k
2
lies at
pk =− H −1 g = − H I −1
gk
k k k pkN
Note: Search direction is mixture
between Newton direction and gradient.
H k = ∇2 f xk = ∑i i i i
v v T
1
pk = − H̃ k g k = −∑ i
T
v i ( v i gk )
−1
max { λi , ϵ }
Example:
f ( x , y ) = x 4− x 2 + y 4 − y 2
±√ 2 ±√(2)
minima at x= y=
2, 2
Starting point:
x 0 =0.1 y 0=0.87
H0 = −1.88 0
0 7.08
1.Negative gradient
(3)
2.Unmodified Hessian search
direction
(4)
3.Search direction with eigenvalue
modified Hessian (=10-6)
(2) (1)
4.Search direction with shifted
Hessian (=2.5; search direction
only good by lucky choice of )
151 Wolfgang Bangerth
Truncated Newton methods
H k pk = −g k
∥g k H k pk∥
≤ k 1
∥g k∥
Bk1=B k ...
Observation 1:
Observation 2:
Question:
● Maybe it is possible to find matrices Bk for which:
●
The resulting iteration still converges with superlinear
order.
H s k −1 = y k−1
.
with an “average” Hessian H
161 Wolfgang Bangerth
Motivation of ideas
Requirements:
● We seek a matrix Bk+1 so that
●
The “secant condition” holds:
Bk1 s k = y k
● Bk+1 is symmetric
●
The update equation is easy to solve for
p k 1 = −B−1
k1 g k1
Bk s k s TK Bk y k y Tk
Bk1=B k − T
T
s B k sK
k y s
k k
So far:
● We seek a matrix Bk+1 so that
●
The secant condition holds:
Bk1 s k = y k
● Bk+1 is symmetric
●
The update equation is easy to solve for
pk = −B−1
k gk
What if we mixed:
DFP T T T 1
B k1 = I− k y s B k I − k s k y k y k y ,
k k k k k = T
yk sk
BFGS Bk s k s TK B k y k y Tk
B k1 =Bk − T
s Bk s k
k y Tk s k
DFP BFGS
Bk1=k B k1 1−k B k
Practical approaches:
Consequence:
Solution: Limit memory and CPU time by only storing the last
m updates:
B =[ V
−1
k
T
k−1 ⋅⋅⋅V ] B [ V k−m⋅⋅⋅V k −1 ]
T
k−m
−1
0,k
m
∑ j=1 k− j {[ V Tk−1⋅⋅⋅V Tk − j1 ] s k− j s Tk− j [ V k− j1⋅⋅⋅V k−1 ] }
B−1
k = [ k−1 k−m ] 0,k [ V k−m⋅⋅⋅V k −1 ]
V T
⋅⋅⋅V T
B −1
m
∑ j=1 k− j {[ V Tk−1⋅⋅⋅V Tk − j1 ] s k− j s Tk− j [ V k− j1⋅⋅⋅V k−1 ] }
In practice:
●
Initial matrix can be chosen independently in each
iteration; typical approach is again
−1 y Tk−1 sk −1
B =
0,k T
I
y k−1 y k−1
●
Typical values for m are between 3 and 30.
Equality-constrained
Problems
minimize f x
g i x = 0, i=1,... , ne
minimize x∈ D⊂R f x
n
g i x = 0, i=1 ...n e
minimize x∈D⊂R f x
n
g x = 0
minimize x∈D∩⊂ R f x
n
g x = 0
must lie within the feasible set where g(x)=0.
Algorithm:
Given xstart
0 , {μ t }→0, {t }→ 0
For t=0,1, 2, ...:
* *
Find approximation x t to the (unconstrained) mimizer xt
of Q x that satisfies
t
*
∥∇ Q x ∥≤t
t
t
using xstart
t
as starting point.
start *
Set t=t+1, x =x
t t−1
Typical values:
*
Theorem (Convergence): Let x t be exact minimizer of Q x
t
and let t 0 . Let f,g be once differentiable.
*
Then every limit point of the sequence {xt }t =1,2,... is a
solution of the constrained minimization problem
minimize x∈ D⊂R f x
n
g x = 0
*
Theorem (Convergence): Let x t be approximate
minimizers of Q x with
t
*
∥∇ Q x ∥≤ t
t t
minimize x∈ D⊂R f x
n
g x = 0
3 1 2
f x =∑i=1 D ∥x − xi∥−L 0
2
194 Wolfgang Bangerth
Lagrange multipliers
g(x,z)=0
Conclusion:
●
Solution is where isocontours are tangential to each other
●
That is, where gradients of f and g are parallel
●
Solution is where g(x)=0
196 Wolfgang Bangerth
Lagrange multipliers
Conclusion:
●
The solution is where gradients of f and g are parallel
●
The solution is where g(x)=0
In mathematical terms:
The (local) solutions of
f x = f x , y , z
198 Wolfgang Bangerth
Lagrange multipliers
g2(x)=-1
g2(x)=0
g2(x)=1
g2(x)=0
g1(x)=0
g2(x)=0
Conclusion:
●
The solution is where the gradient of f can be written as a
linear combination of the gradients of g1, g2
● The solution is where g1(x)=0, g2(x)=0
203 Wolfgang Bangerth
Lagrange multipliers
n ne
L x , = f x −⋅g x , L:ℝ ×ℝ ℝ
the conditions
minimize f x = y ,
2 2
g1 x = x−1 y −1= 0,
2 2
g2 x = x1 y −1= 0.
minimize f x = y ,
2 2
g1 x = x−1 y −1= 0,
2 2
g2 x = x1 y −1= 0.
* T
At the solution x =0,0 , we have
* T * * T
∇ f x =0,1 , ∇ g 1 x =−∇ g 2 x = 2,0
and again there are no Lagrange multipliers so that
* *
∇ f x = ⋅∇ g x
209 Wolfgang Bangerth
Constraint Qualification: LICQ
Definition:
We say that at a point x the linear independence constraint
qualification (LICQ) is satisfied if
{∇ g i x }i=1... n e
[ ]
T
[∇ g 1 x ]
A= ⋮
T
[ ∇ g n x ]e
Theorem:
*
Suppose that x is a local solution of
n
minimize f x f x :ℝ ℝ
n n
g x = 0, g x :ℝ ℝ e
and suppose that at this point the LICQ holds. Then there exists a
unique Lagrange multiplier vector so that the following conditions are
satisfied:
∇ f x − ⋅∇ g x = 0
g x = 0
Note: - These conditions are often referred to as the Karush-Kuhn-
Tucker (KKT) conditions.
- If LICQ does not hold, there may still be a solution,
211 but it may not satisfy the KKT conditions! Wolfgang Bangerth
First-order necessary conditions
*
∇ f x ⋅w = 0
for every vector tangential to all constraints,
*
w ∈ {v : v⋅∇ gi x =0, i=1...n e }
or equivalently
w ∈ NullA
212 Wolfgang Bangerth
Second-order necessary conditions
Theorem:
*
Suppose that x is a local solution of
n
minimize f x f x :ℝ ℝ
n n
g x = 0, g x :ℝ ℝ e
and suppose that at this point the first order necessary conditions
and the LICQ hold. Then
T 2 *
w ∇ f x ⋅w ≥ 0
for every vector tangential to all constraints,
w ∈ Null A
213 Wolfgang Bangerth
Second-order sufficient conditions
Theorem:
Suppose that at a feasible point x the first order necessary (KKT)
conditions hold. Suppose also that
T 2
w ∇ f x ⋅w 0
for all tangential vectors
w ∈ NullA , w ≠0
Then x is a strict local minimizer of
n
minimize f x f x :ℝ ℝ
n n
g x = 0, g x :ℝ ℝ e
w ∈ NullA , w ≠0
In practice, this can be done as follows:
Second order T 2
sufficient w ∇ f x ⋅
w 0 ∀ w ∈ Null A , w ≠0
conditions
Z [ ∇ f x ] Z is positive definite
T 2
Quadratic programming
1 T T
minimize f x= x G xd xe
2
gx= A x−b = 0
with
n ne
L x , = f x −⋅g x , L:ℝ ×ℝ ℝ
219 Wolfgang Bangerth
Solving equality constrained problems
∇ z L z = 0
which looks like the first-order necessary condition for
minimizing L(z). We then may think of finding solutions as
follows:
T
●
[
Start at a point z 0= x0 , 0 ]
●
Compute search directions using [∇ 2z L z k ] pk=−∇ z L z k
●
Compute a step length k
●
Update z k1 =zk k p k
2 2
∇ f x k−∑ i i , k ∇ g i x k −∇ g x k pkx
=
−∇ g x k T 0 pk
221
=−
−g x k
∇ f x k −∑i i ,k ∇ gi x k
Wolfgang Bangerth
Linear quadratic programs
or equivalently:
T x T
G −A p0 Gx 0d− 0 A
= −
−A 0
p0 − Ax 0 −b
222 Wolfgang Bangerth
Linear quadratic programs
T
G −A
−A 0
is nonsingular and the system
T x T
G −A p 0 = − Gx 0d− 0 A
−A 0 p
− Ax 0 −b
0
T x T
G −A p0 Gx 0d− 0 A
= −
−A 0
p0 − Ax 0 −b
irrespective of the starting point x0.
224 Wolfgang Bangerth
Linear quadratic programs
T
G −A
−A 0
has n positive, ne negative eigenvalues, and no zero
eigenvalues. In other words, the KKT matrix is indefinite but
non-singular, and the quadratic function
1 T T T
L x , = x G xd xe− Ax−b
2
in {x , } has a single stationary point that is a saddle point.
minimize f x
g x = 0
∇ z L z = 0
Like in the unconstrained Newton's method, sequential
quadratic programming uses the following basic iteration:
T
●
Start at a point z 0=[ x0 , 0 ]
2
●
Compute search directions using [∇ z L z k] pk=−∇ z L z k
●
Compute a step length k
●
Update z k1 =zk k p k
2 2
∇ f x k −∑i i, k ∇ gi xk −∇ g xk pkx
=
T
−∇ g xk 0 pk
=−
∇ f x k −∑i i, k ∇ gi xk
−g x k
which we will abbreviate as follows:
W k −A k p xk
T
−A k 0 p k
=−
∇ f x k −∑i i, k ∇ gi x k
−g x k
with
2
W k = ∇ x L(xk ,λ k )
A k = ∇ x g(x k) = −∇ x ∇ λ L(x k ,λ k)
228 Wolfgang Bangerth
Computing the SQP search direction
T
W k −A k
−A k 0
is nonsingular and the system that determines the SQP search
direction
T x
W k −A
−A k 0
k pk
pk
= − ∇ x Lx k , k
−g x k
has a unique solution.
Proof: Use Theorem 1 from Part 9.
Note: The columns of the matrix Zk span the null space of Ak.
229 Wolfgang Bangerth
Computing the SQP search direction
T
W k −A k p xk
−A k 0 pk
= − ∇ x Lx k , k
−g x k
equals the minimizer of the problem
x T 1 xT 2
x x
min x mk p =Lx k , k ∇ x L xk , k p pk ∇ x Lx k , k pk
k k
2
g xk ∇ g xk T pkx = 0
that approximates the original nonlinear equality-constrained
minimization problem.
T
W k −A k p xk
−A k 0 pk
= − ∇ x Lx k , k
−g x k
x
xk 1 =x k pk , k1 = kp k
converges to the solution of the constrained nonlinear
optimization problem with quadratic order if (i) we start close
enough to the solution, (ii) the LICQ holds at the solution and
(iii) the matrix Z*TW*Z* is positive definite at the solution.
Example 1:
1 2 2
min f x = x1 x2
2
g x = x 21 = 0
1
In other words, the linearized constraint enforces that
x x
p2,k = −(x 2,k +1) → x 2,k+1 =x2, k + p 2,k = −1
232 Wolfgang Bangerth
How SQP works
Example 2:
min f x
g x = x 2−sin x 1 = 0
x
min mk p k
T
x 2, k −sin x 1,k −cos x 1, k pkx = 0
1
− p1, k p 2,k = 2
233 Wolfgang Bangerth
How SQP works
Example 3:
min f x
g x = 0
min mk p xk
T x T x
g x k ∇ g x k p k = ∇ g x k p k = 0
T
W k −A k p xk
−A k 0 pk
= − ∇ x Lx k , k
−g x k
is equivalent to the minimization problem
x 1 xT 2
T x x
min x mk p =Lx k , k ∇ x L xk , k p pk ∇ x Lx k , k pk
k k
2
g xk ∇ g xk T pkx = 0
or abbreviated:
x T T 1 xT
x x
min x mk p =L k∇ f − A k p pk W k pk
k x k k k
2
g xk ATk pxk = 0
From this, we may expect to get into trouble if the matrix
ZkTWkZk is not positive definite.
235 Wolfgang Bangerth
Hessian modifications for SQP
T x
W k −A
−A k 0
k pk
pk
= − ∇ x Lx k , k
−g x k
is not positive definite, then there may not be a unique solution.
T
W k −A k pkx
−A k 0 pk
= − ∇ x L x k , k
−g x k
instead.
T x
W k −A
−A k 0
k pk
pk
= − ∇ x Lx k , k
−g x k
is a direction of descent for both the l1 as well as Fletcher's
merit function if (i) the current point xk is not a stationary point
of the equality-constrained problem, and (ii) the matrix ZkTWkZk
is positive definite.
T x
W k −A
−A k 0
k pk
pk
= − ∇ x Lx k , k
−g x k
●
Determine step length using a backtracking linear search,
a merit function and the Wolfe (or Goldstein) conditions:
x x
x k p ≤ x k c1 ∇ x k ⋅p
k k
x x x
∇ x k p ⋅p ≥ c2 ∇ x k ⋅p
k k k
●
Update the iterate using either
x λ
xk + 1=xk + αk p , k λ k+ 1=λ k+ α k p k
or
x T −1
x k1=x k k p , k k1=[ Ak 1 A ] Ak 1 ∇ f x k 1
k 1
242 Wolfgang Bangerth
Parts 8-10
minimize f x
g i x = 0, i=1,... , ne
●
Lagrange multipliers reformulate the problem into one
where we look for saddle points of a Lagrangian
●
Sequential quadratic programming (SQP) methods solve
a sequence of quadratic programs with linear constraints,
which are simple to solve
●
SQP methods are the most powerful methods.
244 Wolfgang Bangerth
Part 11
Inequality-constrained
Problems
minimize f x
g i x = 0, i=1,... , ne
hi x ≥ 0, i=1,. .. , n i
minimize x∈ D⊂ R f x
n
g i x = 0, i=1... n e
hi x ≥ 0, i=1 ... ni
minimize x∈ D⊂ R f x
n
g x = 0
h x ≥ 0
388 Wolfgang Bangerth
Definitions
minimize x∈ D⊂ R f x
n
g i x = 0, i=1... n e
hi x ≥ 0, i=1 ... ni
minimize x∈ D⊂ R f x
n
g i x = 0, i=1... n e
hi x ≥ 0, i=1 ... ni
minimize x∈ D⊂ R f x
n
g i x = 0, i=1... n e
hi x = 0, i=1 ... ni ,i is active at x*
minimize x∈D∩⊂ R f x
n
Example:
μ=0.1 minimize f x =sin x
h1 x=x−0 ≥ 0,
h2 x=1−x ≥ 0.
393 Wolfgang Bangerth
The quadratic penalty method
Negative properties of the quadratic penalty method:
●
minimizers for finite penalty parameters are usually
infeasible;
●
problem is becoming more and more ill-conditioned near
optimum as penalty parameter is decreased, Hessian large;
●
for inequality constrained problems, Hessian not twice
differentiable at constraints.
=0.01 =0.2
=2
=0.02
=0.1
=0.1 =0.05
f x
Summary:
This is an efficient method for the solution of constrained
396 problems. Wolfgang Bangerth
Algorithms for penalty/barrier methods
of Q x
t
that satisfies
*
∥∇ Q x ∥≤t
t
t
start
using xt as starting point.
start *
Set t=t+1, x =x
t t−1
minimize x
1
x = f x
1
[ ∑∣g i x ∣∑∣[h i x ] ∣
i i
−
]
−1
=10 −1=10
f x −1=4
−1
=1 f x
−1
=1
Theory of
Inequality-Constrained
Problems
minimize f x
g i x = 0, i=1,... , ne
hi x ≥ 0, i=1,. .. , n i
3 1 2
f x =∑i=1 D ∥x − xi∥−L 0
2
402 Wolfgang Bangerth
Lagrange multipliers
Both f(x), h(x) for the case of a rod of minimal length 20cm:
infeasible
region
∇ h x *
∇ f x *
x*
Both f(x), h(x) for the case of a rod of minimal length 35cm:
infeasible
region
x* ∇ f x *
∇ h x *
Conclusion:
●
Solution can be where the constraint is not active
●
If the constraint is active at the solution: gradients of f and h
are parallel, but not antiparallel
are where one of the following conditions hold for some λ,μ:
∇ f x−⋅∇ h x = 0
∇ f x = 0
h x =0 or
hx 0
≥ 0
407 Wolfgang Bangerth
Lagrange multipliers
∇ f x−⋅∇ h x = 0
∇ f x = 0
h x =0 or
hx 0
≥ 0
which could also be written like so:
or written differently:
∇ f x−⋅∇ hx = 0
hx ≥ 0
≥ 0
hx = 0
Note: The last condition is called complementarity.
409 Wolfgang Bangerth
Lagrange multipliers
infeasible
region
h1(x,z)=0 h2(x,z)=0
∇ h1 x *
∇ f x * x*
infeasible
region
h1(x,z)=0 h2(x,z)=0
∇ f x *
∇ h1 x *
∇ h2 x *
x*
[ ]
T
[ ∇ g1x ]
⋮
[ ∇ g n x]T
A = e
n n
hx ≥ 0, h x:ℝ ℝ i
and suppose that at this point the LICQ holds. Then there exist
unique Lagrange multipliers so that these conditions are satisfied:
∇ f x−⋅∇ gx−⋅∇ h x = 0
g x =0
hx ≥0
≥0
i hi x =0
∇ f x−⋅∇ gx−⋅∇ h x = 0
g x =0
hx ≥0
≥0
i hi x =0
In general:
{ }
n T
F 1 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,n e
w ∇ hi x *≥0, i=1,. ..,ni , constraint i is active at x *
T
∇ f x * w ≥ 0 ∀w∈F 1 x*
{ }
n T
F 1 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,n e
w ∇ hi x *≥0, i=1,. ..,ni , constraint i is active at x *
{ }
n T
F 2 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,ne
w ∇ hi x*=0, i=1,..., ni , constraint i is active at x *
F 1 x * F 2 x *
423 Wolfgang Bangerth
Second-order necessary conditions
Note:
The subspace of all tangential directions
{ }
n T
F 2 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... ,ne
w ∇ hi x*=0, i=1,..., ni , constraint i is active at x *
Example:
T 2
w ∇ x L x* , * , * w=
=w [ ∇ x f x *− * ∇ x g x *− * ∇ x h x * ] w ≥ 0
T 2 T 2 2
∀ w∈F 2 x *
T 2
w ∇ x L x *, *, * w=
=w [ ∇ x f x *− * ∇ x g x *− * ∇ x h x * ] w 0
T 2 T 2 2
{ }
n T
F 2 x * = w∈ℝ : wT ∇ gi x *=0, i=1,... , ne
w ∇ hi x*=0, i=1,..., ni , constraint i is active at x *
is equivalent to
F 2 x * = null A x *
with the matrix of gradients of active constraints A. If A does have
a null space, then the second order necessary and sufficient
conditions can also be written as
T 2
Z ∇ L x *, *, * Z is positive semidefinite
x
Z T ∇ L x *, * , * Z is positive definite
2
x
{ }
n T
w∈ℝ : w ∇ gi x*=0, i=1,. .., ne
F 2 x *, * = T
w ∇ hi x *=0, i=1,. .., ni , constraint i active and i *0
wT ∇ hi x *≥0, i=1,. .., ni , constraint i active and i *=0
F 1 x * F 2 x *, *
T 2
w ∇ x L(x *, λ *,μ *)w=
=wT [ ∇ 2x f (x *)−λ *T ∇ 2x g (x *)−μ*∇ 2x h(x *) ] w > 0
∀ w∈F 2 (x *), w≠0
1 T T
minimize f x = x G xx d e
2
T
gi x =ai x−bi = 0, i=1,. .., n e
T
hi x =i x− i ≥ 0, i=1,. .. , ni
1 T T
minimize f x= x G xx de
2
T
gi x=ai x−bi = 0, i=1,... , ne
T
hi x=i x− i ≥ 0, i=1,. .., ni
1 T T
minimize f x = x G xx de
2
T
gi x=ai x−bi = 0, i=1,... ,n e
T
hi x=i x−i = 0, i=1,. .., ni ,i∈W *
431 Wolfgang Bangerth
General idea
Definition: Let
aT1 b1 aT1 b1
⋮ ⋮ ⋮ ⋮
T
an bn an
T bn
A= e B= e
A |W = e B |W = e
1 T T
minimize f x = x G xx de
2
A |W * x−B |W * = 0
1 T T
minimize f x = x G xx de
2
A |W * x−B |W * = 0
2 2
minimize f x = x1 −1 x2−2.5
1 −2 −2
−1 −2 −6
−1 2 x− −2 ≥0 h1 h2
1 0 0
0 1 0 h4 h3
h5
h1 h2
h4 h3
h5
W0={3,5}, x0=(2,0)T.
Then: p0=(0,0)T because no other point is feasible for W0
T
W0 0
∇ f x0 − | A |W = 2 − 3 −1 2 =0 implies
T
−5 5 0 1
3 −2
=
5 −1
Consequently: W1={5}, x1=(2,0)T.
436 Wolfgang Bangerth
The active set algorithm
Example: Step 1
h1 h2
h4 h3
h5
W1={5}, x1=(2,0)T.
Then: p1=(-1,0)T leads to minimum along only active constraint.
There are no blocking constraints to get to the point xk+1=xk+pk
h1 h2
h4 h3
h5
W2={5}, x2=(1,0)T.
Then: p2=(0,0)T because we are at minimum of active constraints.
T
T 0 − 5 0 1 =0
∇ f x 2 − | A |W =
W2 2
−5
implies 5 =−5
Consequently: W3={}, x3=(1,0)T.
438 Wolfgang Bangerth
The active set algorithm
Example: Step 3
h1 h2
h4 h3
h5
W3={}, x3=(1,0)T.
Then: p3=(0,2.5)T but this leads out of feasible region. The first
blocking constraint is inequality 1, and the maximal step length is
3 =0.6
Consequently: W4={1}, x4=(1,1.5)T.
439 Wolfgang Bangerth
The active set algorithm
Example: Step 4
h1 h2
h4 h3
h5
W4={1}, x4=(1,1.5)T.
Then: p4=(0.4,0.2)T is the minimizer along the sole constraint.
There are no blocking constraints to get there.
h1 h2
h4 h3
h5
W5={1}, x5=(1.4,1.7)T.
Then: p5=(0,0)T because we are already on the minimizer on the
constraint. Furthermore,
T
T
∇ f x 5 − | A |W = 0.8 − 1 1 −2 =0 implies 1 = 0.8 ≥0
W5 5
−1.6
Consequently: This is the solution.
441 Wolfgang Bangerth
The active set algorithm
Theorem:
If G is strictly positive definite (i.e. the objective function is strictly
convex), then Wk≠Wl for k ≠ l.
Consequently (because there are only finitely many possible
working sets), the active set algorithm terminates in a finite
number of steps.
Note:
In practice it may be that G is indefinite, and that for some
iterations the matrix ZkTGZk is indefinite as well. We know that at
the solution, Z*TGZ* is positive semidefinite, however. In that case,
we can't guarantee termination or convergence.
x T x 1 xT 2 x
min x mk p =Lx k , k ∇ x L xk , k p pk ∇ x Lx k , k pk
k k
2
g xk ∇ g xk T pkx = 0
minimize f (x )
g i (x ) = 0, i=1,. .. , n e
h i (x ) ≥ 0, i=1,. .. , n i
minimize f x
gi x = 0, i=1,. .., ne
hi x ≥ 0, i=1,..., ni
we repeatedly solve linear-quadratic problems of the form
x T 1 xT 2
x x
min x mk p =Lx k ,k ∇ x L x k , k p pk ∇ x L x k , k pk
k k
2
g xk ∇ g x kT pxk = 0
T x
h x k∇ h xk pk ≥ 0
Each of these inequality constrained quadratic problems can be
solved using the active set method, and after we have the
exact solution of this approximate problem we can re-linearize
around this point for the next sub-problem.
446 Wolfgang Bangerth
Active set SQP methods for general nonlinear problems
minimize f x
g i x = 0, i=1,... , ne
hi x ≥ 0, i=1,. .. , n i
●
Lagrange multiplier formulations lead to active set
methods
●
Both kinds of methods are expensive. Penalty/barrier
methods are simpler to implement but can only find
minima located at the boundary of the feasible set at the
price of dealing with ill-conditioned problems.
449 Wolfgang Bangerth
Part 15
Global optimization
minimize f x
g i x = 0, i=1,... , ne
hi x ≥ 0, i=1,. .. , n i
1 2 2
f x= x1 x 2cos x 1cos x2
20
451 Wolfgang Bangerth
A naïve sampling approach
- Else:
. draw a random number s in [0,1]
. if
then
exp − [
f x −f x
t
T
k
] ≥s
[
exp −
T ]
f xt −f x k
≥ s, s ∈U [0,1]
1 2 2
Example: For f x= x1 x 2cos x 1cos x2
20
the difference in function value between local minima and
saddle points is around 2. We want to choose T so that
[ ]
exp −
f
T
≥ s, s∈U [0,1]
Inequality constraints:
●
For simple inequality constraints, modify sample
generation strategy to never generate infeasible trial
samples
●
For complex inequality constraints, always reject samples
for which
Inequality constraints:
●
For simple inequality constraints, modify the sample
generation strategy to never generate infeasible trial
samples
●
For complex inequality constraints, always reject samples:
- If Q xt ≤Q x k then xk 1 =x t
- Else:
. draw a random number s in [0,1]
. if
then
exp −[Q xt −Q x k
T ] ≥s
xk 1 =x t
else
xk 1 =x k
where
Q x=∞ if at least one hi x0, Q x =f x otherwise
470 Wolfgang Bangerth
Monte Carlo sampling with constraints
Equality constraints:
●
Generate only samples that satisfy equality constraints
●
If we have only linear equality constraints of the form
g x= Ax−b=0
then one way to guarantee this is to generate samples
using
n−ne n−ne
xt =x k Z y , y∈ℝ , y=N 0, I or U [−1,1]
Theorem:
Let A be a subset of the feasible region. Under certain
conditions on the sample generation strategy, then as k ∞
we have
f (x)
−
number of samples x k∈ A ∝ ∫A e T
dx
In particular, f (x)
1 1
( )
−
fraction of samples x k ∈A = ∫A e T
dx+ O
C √N
472 Wolfgang Bangerth
Monte Carlo sampling
Remark:
Monte Carlo sampling appears to be a strategy that bounces
around randomly, only taking into account the values (not the
derivatives) of f(x).
Remark:
Monte Carlo sampling appears to be a strategy that bounces
around randomly, only taking into account the values (not the
derivatives) of f(x).
Motivation:
Particles in a gas, or atoms in a crystal have an energy that is
on average in equilibrium with the rest of the system. At any
given time, however, its energy may be higher or lower.
{ }= { }
− E 1 if E≤0
kB T
P E E E ∝ min 1, e −E
k BT
e if E0
This is exactly the Monte Carlo transition probability if we
identify
E = f kB
475 Wolfgang Bangerth
Simulated Annealing
Motivation:
In other words, Monte Carlo sampling is analogous to
watching particles bounce around in a potential f(x) when
driven by a gas at constant temperature.
[
exp −
Tk ]
f xt −f x k
≥ s, s∈U [0,1], T k 0 as k ∞
1
T=1 T k= −4
110 k
1
T=1 T k= −4
110 k
1
T=1 T k=
10.005 k
1
T=1 T k=
10.0005k
Discussion:
Simulated Annealing is often more efficient in finding global
minima because it initially explores the energy landscape at
large, and later on explores the areas of low energy in greater
detail.
A further refinement:
In Very Fast Simulated Annealing we not only reduce
temperature over time, but also reduce the search radius of
our sample generation strategy, i.e. we compute
n
xt =x k k y , y∈N 0, I or U [−1,1]
and let
k 0
Like reducing the temperature, this ensures that we sample
the vicinity of minima better and better over time.
21 10 1 2
f x=∑i=1 x 2i cos x i f x=∑i=1 x i cos xi
20 20
Mating:
●
Mating is meant to produce new individuals that share the
traits of the two parents
●
If the variable x encodes real values, then mating could just
take the mean value of the parents:
x ax b
x new=
2
●
For more general properties (paths through cities, which of M
objects to put where in a suitcase, …) we have to encode x in
a binary string. Mating may then select bits (or bit sequences)
randomly from each of the parents
●
There is a huge variety of encoding and selection strategies
in the literature.
Mutation:
●
Mutations are meant to introduce an element of randomness
into the process, to explore search directions that aren't
represented yet in the population
●
If the variable x represents real values, we can just add a
small random value to x to simulate mutations
x ax b n
x new= y , y ∈ℝ , y=N 0, I
2
●
For more general properties, mutations can be introduced by
randomly flipping individual bits or bit sequences in the
encoded properties
●
There is a huge variety of mutation strategies in the literature.
Summary of
global optimization methods
minimize f x
g i x = 0, i=1,... , ne
hi x ≥ 0, i=1,. .. , n i
●
Global optimization algorithms should never be used
whenever we know that the problem has only a small
number of minima and/or is smooth and convex
489 Wolfgang Bangerth