optimization techniques lec notes
optimization techniques lec notes
Introduction to Optimization
Fall Semester 2022
Roland Herzog∗
2023-03-29
∗ Interdisciplinary
Center for Scientific Computing, Heidelberg University, 69120 Heidelberg, Germany
([email protected], https://scoop.iwr.uni-heidelberg.de/team/rherzog/).
R. Herzog cbn
2 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
Contents
§1 Introduction 5
§2 Unconstrained Optimization and Parameter Estimation 9
§ 2.1 Algorithmic Sketches for Generic Unconstrained Problems 9
§ 2.2 Algorithmic Sketches for Unconstrained Least-Squares Problems 11
§3 Convex Optimization 17
§4 Constrained Optimization 23
§5 Infinite-Dimensional Optimization 30
§6 Summary and Practical Advice 32
§7 Solutions 33
§ 7.1 Solution of Example 2.2 (Rosenbrock) using CasADi (Problem-Based) 33
§ 7.2 Solution of Example 2.2 (Rosenbrock) using CasADi (Opti Interface-Based) 34
§ 7.3 Solution of Example 2.3 (Airplane) using CasADi (Problem-Based) 35
§ 7.4 Solution of Example 4.1 (Production) using Matlab’s Optimization Toolbox
(Solver-Based) 36
§ 7.5 Solution of Example 4.1 (Production) using Matlab’s Optimization Toolbox
(Problem-Based) 37
§ 7.6 Solution of Example 4.1 (Production) using scipy.optimize 38
§ 7.7 Solution of Example 4.1 (Production) using CasADi (Problem-Based) 39
§ 7.8 Solution of Example 4.1 (Production) using CasADi (Opti Interface-Based) 40
§ 7.9 Solution of Example 4.12 (Post-Office) using CasADi (Opti Interface-Based) 41
§ 7.10 Solution of Example 5.1 (van der Pol) using CasADi (Problem-Based) 42
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/
cbn Introduction to Optimization
§1 Introduction
• the number 𝑛 ineq of inequality constraints and the number 𝑛 eq of equality constraints are finite
(possibly zero).
More precisely, problems of type (1.1) fall into the realm of continuous optimization. In case some
or all components of 𝑥 are to be sought in a discrete set (such as the integers Z), we have a mixed
discrete-continuous problem or even a fully discrete problem, which we do not consider in this
course.
(a) What do we use as our optimization variable(s)? What is the significance of each component
of 𝑥?
(b) How do we formulate the objective function? What is its significance?
(c) Do we need to formulate constraints? How do we model them? What is the significance of
each of the constraints?
(d) Are there any constants in the problem? What are their values and what is their signifi-
cance?
The answer to these questions has an impact on the next step, the selection of a suitable solver.
The more specifically the solver fits the problem, the more efficient the solution process usually
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 5
R. Herzog cbn
is. For instance, a linear optimization problem (where the objective and all constraints are affine
functions of 𝑥) is best solved using a solver tailored to linear optimization problems (such as an
implementation of the simplex method). It is “overkill” to solve a linear optimization problem
using a solver capable of solving general nonlinear constrained problems and the more general
optimization procedure will typically perform poorer than specialized routines (although it
should usually succeed).
(4) Implementing the functions describing the problem for the solver.
You will need to implement the objective and constraint functions in a format suitable for the
solver selected in the respective programming language (such as Matlab, Python, Julia, or
C++ for instance).
There are also specific modeling languages for optimization problems, e. g., AMPL, which are, in
principle, independent of the solver. However, not all solvers will accept all input formats.
Finally, some solver packages such as CasADi (Python, Matlab) and JuMP (Julia), include
their own modeling paradigms which facilitate the formulation of optimization problems.
6 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
The above steps are not necessarily run through sequentially. For instance, it might turn out later in
the process that a different formulation of the problem than originally selected is more favorable for
the solver, so one needs to go back. Or we might figure out from the solution that we forgot to model
certain constraints, which we then include and run the solver again.
It is also important to understand that, given a problem in text form, the above steps do not have
a predetermined outcome. It is up to us to model the problem so that we can identify (or design
ourselves) a numerical solver that can efficiently solve the problem. Your modeling skills will get better
with practice, and this process is partially based on trial-and-error.
Example 1.1 (Choice of optimization variables). This examples illustrates that there is no single correct
answer to modeling a problem. For instance, what we choose to be our optimization variables is a modeling
decision. Consider, for instance, an investor, who seeks to distribute a certain amount of money 𝑀 between
a bank account with fixed interest and a stock investment with uncertain future stock prices. The investor’s
objective may be to maximize the expected amount of their wealth at a certain time in the future.
We are not specifying the problem any further here, but we mention that it is likely going to contain the
constraints 𝑥 1 ≥ 0, 𝑥 2 ≥ 0 and 𝑥 1 + 𝑥 2 = 𝑀.
Alternatively, instead of determining how much money to put in either investment, the investor may
describe their optimization variable using a single decision, namely what fraction (“percentage”) of the
money 𝑀 is going to be invested in the bank account. That is, the investor may choose their optimization
variable as
𝑥 1 fraction of 𝑀 to be invested in the bank account,
which is subject to the constraint 0 ≤ 𝑥 1 ≤ 1. Naturally, the remainder fraction 1 − 𝑥 1 of 𝑀 is going to be
invested in the stock.
We now discuss what it means to solve an optimization problem, i. e., to find and recognize minimiz-
ers.
(𝑖𝑖) The inequality 𝑔𝑖 (𝑥) ≤ 0 is said to be active at the point 𝑥 in case of 𝑔𝑖 (𝑥) = 0. It is called inactive
in case 𝑔𝑖 (𝑥) < 0. It is called violated in case 𝑔𝑖 (𝑥) > 0.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 7
R. Herzog cbn
(𝑖𝑣) A point 𝑥 ∗ ∈ 𝐹 is a local minimizer or local solution of (1.1) if there exists a (potentially small)
neighborhood of 𝑥 ∗1 𝑈 (𝑥 ∗ ) such that
While Definition 1.2 states what minimizers are, the definition is not very helpful when it comes to
checking whether a given point 𝑥 ∗ is indeed, say, a local minimizer. In order to verify this on the
grounds of the definition would require us to compare the value of the objective 𝑓 (𝑥 ∗ ) with the value
of the objective 𝑓 (𝑥) at an (uncountable!) number of nearby points 𝑥! This is clearly impractical.
Therefore, we need to resort to different ways of verifying that a certain point is a minimizer, and also of
excluding certain points from the list of candidate minimizers. To this end, we can use derivative-based
optimality conditions. Optimality conditions come in two flavors:
(1) A necessary optimality condition (NC) is a statement of the following form: “Suppose that
𝑥 ∗ is a local minimizer of problem (1.1). Then (NC) holds at 𝑥 ∗ .”
(a) Finding points 𝑥 ∗ ∈ R𝑛 that satisfy (NC) means to compile a list of candidates for local
minimizers. Those can then by further examined by other means.
(b) A necessary optimality condition can also be read as:2 “Suppose that (NC) does not hold at
𝑥 ∗ , then 𝑥 ∗ cannot be a local minimizer.” Consequently, necessary optimality conditions
can be used to exclude points from the search for local minimizers.
(2) A sufficient optimality condition (SC) is a statement of the following form: “Suppose that
(SC) holds at 𝑥 ∗ . Then 𝑥 ∗ is a local minimizer.”
We will discuss appropriate forms of necessary and sufficient optimality conditions for the particular
type of problem in each of the following sections.
1 Note that the definition is equivalent if we were to replace “neighborhood of 𝑥 ∗ ” by “ball around 𝑥 ∗ ”, i. e., 𝐵𝜀 (𝑥 ∗ ) = {𝑥 ∈
R𝑛 | ∥𝑥 − 𝑥 ∗ ∥ 2 < 𝜀}.
2 This is called the contraposition of the original statement.
8 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
As the name suggests, unconstrained optimization is about minimizing an objective in the absence of
any equality or inequality constraints. In this case, our generic problem (1.1) reduces to
For this problem, “derivative equals zero” is a necessary optimality condition. Points which satisfy
𝑓 ′ (𝑥) = 0 are called stationary points.
Sufficient optimality conditions are usually based on second-order derivatives, and we do not state
them here. Instead, we proceed to sketch some algorithms. Indeed, algorithms usually find stationary
points, which may or may not be local minimizers. Nevertheless, we use the term “optimization
algorithms” and continue to speak of “minimizing the objective”.
Optimization algorithms proceed in a sequential way and break down problem (1.1) into a sequence of
simpler problems. In doing so, they generate a sequence 𝑥 (𝑘 ) of iterates, which (under some conditions)
converges to a stationary point 𝑥 ∗ .
The vast majority of algorithms for (1.1) proceeds as follows. At the current iterate 𝑥 (𝑘 ) , they form a
quadratic model of the objective. A quadratic model is a quadratic (i. e., second-order) polynomial
that shares some properties with the true objective in the vicinity of the current iterate 𝑥 (𝑘 ) . The
quadratic model at 𝑥 (𝑘 ) is of the form (see Figure 2.1 for an illustration in 1D)
linear term
1
z }| {
𝑞 (𝑘 ) (𝑥) B 𝑓 (𝑥 (𝑘 ) ) + 𝑓 ′ (𝑥 (𝑘 ) ) (𝑥 − 𝑥 (𝑘 ) ) + (𝑥 − 𝑥 (𝑘 ) ) ᵀ𝐻 (𝑘 ) (𝑥 − 𝑥 (𝑘 ) ) . (2.2)
| {z } 2
constant term
| {z }
quadratic term
At 𝑥 = 𝑥 (𝑘 ) , the model 𝑞 (𝑘 ) shares its function value and its derivative with the true objective 𝑓 .
(Quiz 2.1: Can you show this?)
Algorithms proceed by minimizing the current quadratic model 𝑞 (𝑘 ) , or at least they look for a
stationary point of the model:
!
[𝑞 (𝑘 ) ] ′ (𝑥) = 𝑓 ′ (𝑥 (𝑘 ) ) + (𝑥 − 𝑥 (𝑘 ) ) ᵀ𝐻 (𝑘 ) = 0.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 9
R. Herzog cbn
𝑞 (𝑘 ) (𝑥) B 𝑓 (𝑥 (𝑘 ) ) + 𝑓 ′ (𝑥 (𝑘 ) ) (𝑥 − 𝑥 (𝑘 ) )
+ 21 (𝑥 − 𝑥 (𝑘 ) ) ᵀ𝐻 (𝑘 ) (𝑥 − 𝑥 (𝑘 ) ) 𝑓 (𝑥)
𝑞 (𝑘 ) (𝑥) B 𝑓 (𝑥 (𝑘 ) ) + 𝑓 ′ (𝑥 (𝑘 ) ) (𝑥 − 𝑥 (𝑘 ) )
+ 21 (𝑥 − 𝑥 (𝑘 ) ) ᵀ𝐻 (𝑘 ) (𝑥 − 𝑥 (𝑘 ) )
𝑥
Figure 2.1: A function 𝑓 B R → R and two possible quadratic models 𝑞 (𝑘 ) about the point 𝑥 (𝑘 ) (black
dot). The red model is the second-order Taylor polynomial of 𝑓 at 𝑥 (𝑘 ) . In other words,
the red model uses 𝐻 (𝑘 ) = 𝑓 ′′ (𝑥 (𝑘 ) ). The blue model instead uses an “incorrect” Hessian
𝐻 (𝑘 ) ≠ 𝑓 ′′ (𝑥 (𝑘 ) ).
𝐻 (𝑘 ) (𝑥 − 𝑥 (𝑘 ) ) = −∇𝑓 (𝑥 (𝑘 ) ).
In practical algorithms, this direction is then scaled with a suitable step length 𝛼 (𝑘 ) > 0, which accounts
for the deviation of the quadratic model from the original objective and gives us the update
𝑥 (𝑘+1) B 𝑥 (𝑘 ) + 𝛼 (𝑘 ) 𝑑.
The step length 𝛼 (𝑘 ) has the purpose of making the step between consecutive iterates small when
the model does not agree well with the true objective. This is done in an effort to obtain robust
convergence properties regardless of the quality of the initial guess provided by the user. Since 𝛼 (𝑘 ) is
determined by a line search procedure, in which trial step sizes are proposed and evaluated until a
satisfactory one is found, one speaks of a line search algorithm. An alternative concept is to limit
the size of 𝑑 in the first place by constraining the norm ∥𝑑 ∥ 2 ≤ Δ (𝑘 ) when minimizing the model
(2.2). The solution of this problem is more complex than solving a linear system (2.3). This leads to
trust-region algorithms, and Δ (𝑘 ) is the trust-region radius.
10 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
Gradient methods are simple in the sense that the linear system (2.3) solved in each iteration has a
constant coefficient matrix, which can be exploited. Gradient methods, however, typically exhibit slow
convergence. On the other hand, Newton methods convergence faster once they are near a minimizer;
however, the evaluation of the objective’s second derivative matrix in every iteration is usually costly.
In practice, quasi-Newton methods are a good compromise, in which 𝐻 (𝑘 ) tries to capture some of the
properties of the objective’s second derivative without actually computing it.
We close this section by mentioning what functions a user has to implement in order to run an
algorithm for solving an unconstrained problem (2.1):
objective 𝑓 (𝑥)
derivative 𝑓 ′ (𝑥) optional (finite difference fallback)
2nd derivative 𝑓 ′′ (𝑥) or 𝑓 ′′ (𝑥) 𝑑 optional, only for Newton methods
The evaluation of the first-order derivative 𝑓 ′ (𝑥), although required by all algorithms, is usually
optional for the user. When it is not provided, the algorithm will fall back to an internal approximation
of 𝑓 ′ (𝑥) by finite differences. While this may sound convenient, it may be very time consuming and
introduce inaccuracies. If possible, the user should provide first-order derivatives. To facilitate the
task, they may use algorithmic differentiation (AD) for this purpose. This is beyond the scope of
these notes.
Example 2.2 (Rosenbrock problem). The Rosenbrock problem is a classical test example in uncon-
strained optimization:
It has a global minimizer at (𝑥, 𝑦) = (𝑎, 𝑎 2 ), with an optimal function value of 0. A typical choice of the
parameters is (𝑎, 𝑏) = (1, 100). A plot of the function can be found in Figure 2.2.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 11
R. Herzog cbn
1.50
1.25
1.00
0.75
0.50
y
0.25
0.00
0.25
0.50
0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.25 1.50
x
Figure 2.2: Contour plot of the Rosenbrock “banana” function (Example 2.2) with parameters (𝑎, 𝑏) =
(1, 100), and its global minimizer.
These are called least-squares problems. The primary source of least-squares problems lies in
parameter estimation, also known as parameter identification, model calibration, curve fitting,
or data assimilation.
In parameter estimation one is given a model function 𝑚 which describes a relation 𝜂 = 𝑚(𝜉) between
independent variables (model input) 𝜉 ∈ R𝑟 and dependent variables 𝜂 ∈ R (model output). The model
in fact contains yet unknown parameters 𝑥 ∈ R𝑛 , i. e., we have a relation of the form
𝜂 = 𝑚(𝑥; 𝜉).
(2.5)
z}|{
𝑟𝑖 (𝑥) B 𝑚(𝑥; 𝜉𝑖 ) − 𝜂𝑖 ,
| {z }
model prediction
which describes the discrepancy between the value predicted by the model and the actual measure-
ment 𝜂𝑖 pertaining to the 𝑖-th input 𝜉𝑖 .
The least-squares problem (2.4) describes our desire to choose the parameters 𝑥 ∈ R𝑛 in such a way
that the model will fit the available measurement pairs (𝜉𝑖 , 𝜂𝑖 ) ∈ R𝑟 × R, 𝑖 = 1, . . . , 𝑀, in the best
possible way, in the sense of least squares.
There are statistical reasons why — at least for measurements 𝜂𝑖 perturbed by additive random
measurement errors, which are uncorrelated and identically normally distributed — the solution of
the least-squares problem (2.1) yields the best estimate of the parameters. When, more generally, the
12 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
measurement errors are still uncorrelated but may have non-identical variances 𝜎𝑖2 , we should consider
the weighted least-squares problem
1
𝑀
Minimize [𝑟𝑖 (𝑥)] 2 = ∥𝑟 (𝑥) ∥ 2Σ−1 where 𝑥 ∈ R𝑛 , (2.6)
∑︁
𝑓 (𝑥) B 2
𝜎
𝑖=1 𝑖
where Σ = diag(𝜎12, . . . , 𝜎𝑀
2 ) is the measurement covariance matrix. However, we do not specifically
consider the form (2.6) here. After all, (2.6) can be brought into the form (2.4) simply by scaling the
residuals by 1/𝜎𝑖 .
We could use any of the algorithmic ideas from § 2.1 to solve (2.4). However, it is usually more efficient
to use a different algorithm that exploits the particular structure of the objective in (2.4). In order to
understand this, we determine the first- and second-order derivatives of the objective from (2.4) using
the chain rule. We expect that the derivative of the residual function 𝑟 will appear, i. e., its Jacobian,
which we denote by 𝑟 ′ or by 𝐽 :
∇𝑟 (𝑥) ᵀ
1
𝜕
𝜕 𝑟 (𝑥) · · · 𝜕 𝑟 (𝑥) .
(𝑥) ∈ R𝑀 ×𝑛 . (2.7)
𝐽 (𝑥) B 𝑟 𝑖 = 1
𝜕𝑥 𝜕𝑥 = .
.
𝜕𝑥 𝑗
𝑛
𝑖𝑗
∇𝑟𝑚 (𝑥) ᵀ
For cosmetic reasons it is practical to divide the objective in (2.4) by 2. This does, of course, not
influence where the minimizers lie.
1 1
𝑓 (𝑥) = ∥𝑟 (𝑥) ∥ 22 = 𝑟 (𝑥) ᵀ𝑟 (𝑥) objective (2.8a)
2 2
𝑀
first-order derivative (2.8b)
∑︁
ᵀ
∇𝑓 (𝑥) = 𝐽 (𝑥) 𝑟 (𝑥) = 𝑟𝑖 (𝑥) ∇𝑟𝑖 (𝑥)
𝑖=1
𝑀
second-order derivative. (2.8c)
∑︁
𝑓 ′′ (𝑥) = 𝐽 (𝑥) ᵀ 𝐽 (𝑥) + 𝑟𝑖 (𝑥) 𝑟𝑖′′ (𝑥)
𝑖=1
With the help of this data one could set up a quadratic model of the objective 𝑓 at the current iterate
𝑥 (𝑘 ) , determine a search direction 𝑑 (𝑘 ) by minimizing it, then find a step length 𝛼 (𝑘 ) , take the step etc.
The additional knowledge, however, that we are dealing with a least-squares problem can be exploited
even better algorithmically. We describe this using the popular Levenberg-Marquardt method4 as
an example.
The Levenberg-Marquardt method sets up a certain quadratic model about the point 𝑥 (𝑘 ) :
(𝑘 ) 1 (𝑘 )
𝑞 LM (𝑑) B𝑓 (𝑥 (𝑘 ) ) + ∇𝑓 (𝑥 (𝑘 ) ) ᵀ𝑑+ 𝑑 ᵀ𝐻 LM 𝑑
2
1 1 (𝑘 )
= 𝑟 (𝑥 (𝑘 ) ) ᵀ𝑟 (𝑥 (𝑘 ) ) + 𝑟 (𝑥 (𝑘 ) ) ᵀ 𝐽 (𝑥 (𝑘 ) ) 𝑑 + 𝑑 ᵀ𝐻 LM 𝑑 by (2.8). (2.9)
2 2
The Levenberg-Marquardt method uses
(𝑘 )
𝐻 LM B 𝐽 (𝑥 (𝑘 ) ) ᵀ 𝐽 (𝑥 (𝑘 ) ) + 𝜆 (𝑘 ) Id (2.10)
as the model Hessian. Here 𝜆 (𝑘 ) > 0 is a positive number. This has the following advantages:
4 proposed by Levenberg (1944), rediscovered by Marquardt (1963)
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 13
R. Herzog cbn
(𝑖) One uses the “good” first part 𝐽 (𝑥) ᵀ 𝐽 (𝑥) of the true Hessian (2.8c). This part is at least positive
semidefinite and it can be computed without further effort since we need 𝐽 (𝑥) anyway for the
first-order derivative ∇𝑓 (𝑥).
(𝑖𝑖) The “bad” part 𝑖=1 𝑟𝑖 (𝑥) 𝑟𝑖′′ (𝑥) of the Hessian (2.8c) is omitted. Since this part depends on the
Í𝑀
second-order derivatives 𝑟𝑖′′ (𝑥), it is expensive to evaluate, and it can potentially destroy the
positive definiteness. Moreover we hope that the residuals 𝑟𝑖 (𝑥 ∗ ) at the solution 𝑥 ∗ will be
small (near zero), i. e., we hope that we will have a good fit, and this part of the Hessian can be
disregarded.
(𝑖𝑖𝑖) Instead one adds an auxiliary term 𝜆 (𝑘 ) Id. In view of 𝜆 (𝑘 ) > 0, positive definiteness of the model
Hessian 𝐻 LM
(𝑘 )
is ensured, which implies that the linear problems that need to be solved in the
algorithms substeps are uniquely solvable.
As with the methods of § 2.1, one is interested in the minimizer of the quadratic model (2.9) in the 𝑘-th
iteration. This minimizer is unique and it can be calculated by solving the following linear system of
equations, compare (2.3),
(𝑘 ) (𝑘 )
𝐻 LM 𝑑 = −∇𝑓 (𝑥 (𝑘 ) ). (2.11)
In more explicit terms, see (2.10) and (2.8b), we can write
𝐽 (𝑥 (𝑘 ) ) ᵀ 𝐽 (𝑥 (𝑘 ) ) + 𝜆 (𝑘 ) Id 𝑑 (𝑘 ) = −𝐽 (𝑥 (𝑘 ) ) ᵀ𝑟 (𝑥 (𝑘 ) ). (2.12)
The size of the Levenberg-Marquardt parameter 𝜆 (𝑘 ) implicitly determines the length of the step 𝑑.
(Quiz 2.2: What happens in the extreme cases 𝜆 (𝑘 ) → 0 and 𝜆 (𝑘 ) → ∞, respectively?) 𝜆 (𝑘 ) itself is
controlled depending on how well the model (2.9) was able to predict the actual decrease in the values
of the objective (2.4).
We mention what functions a user has to implement in order to run a dedicated least-squares algorithm
for solving an problem of type (2.4):
residual 𝑟 (𝑥)
Jacobian 𝐽 (𝑥) optional (finite difference fallback)
Often times, the user has a choice (through different interfaces to the algorithm) to specify the model
function 𝑚 and the set of measurement pairs (𝜉𝑖 , 𝜂𝑖 ) ∈ R𝑟 × R, 𝑖 = 1, . . . , 𝑀, and have the method itself
evaluate the residual function with components (2.5) and Jacobian
𝜕 𝑚(𝑥; 𝜉 1 ) · · · ∇𝑚(𝑥; 𝜉 1 ) ᵀ
𝜕𝑥𝑛 𝑚(𝑥; 𝜉 1 )
𝜕
𝜕𝑥 1
.. .. ..
′
𝐽 (𝑥) = 𝑟 (𝑥) = = .
𝜕 . . .
𝑚(𝑥; 𝜉 𝑀 ) · · · 𝑚(𝑥; 𝜉 ) ∇𝑚(𝑥; 𝜉 𝑀 ) ᵀ
𝜕
𝜕𝑥 1 𝜕𝑥𝑛 𝑀
In this case, the user provides
14 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
model 𝑚(𝑥; 𝜉)
model gradient ∇𝑚(𝑥; 𝜉) optional (finite difference fallback)
and the set of measurement pairs (𝜉𝑖 , 𝜂𝑖 ). Again, when derivatives are not provided, most algorithms
will fall back to finite differencing.
Example 2.3 (Parameter estimation). The position (𝑥, 𝑦) of an airplane should be determined by radio
bearing. To this end, the directions towards several radio beacons are measured from the airplane. The
positions (𝑋𝑖 , 𝑌𝑖 ) of the beacons are known and the directions are measured as angles 𝛼𝑖 to the 𝑥-axis:
The aim is to find the position (𝑥, 𝑦) of the airplane for which the discrepancies between the predicted and
measured angles are as small as possible in the least squares sense.
In this example, we have 𝑀 = 4 individual measurements. We require a model which maps the independent
variable 𝜉 = (𝑋, 𝑌 ) ∈ R2 to the dependent variable (𝜂 = 𝛼 ∈ R), as a function of the unknown airplane
position (𝑥, 𝑦). A model which achieves this is given by
independent variable
180◦
z}|{
𝑚((𝑥, 𝑦) ; (𝑋, 𝑌 )) = 𝑎𝑡𝑎𝑛2(𝑋 − 𝑥, 𝑌 − 𝑦) · ,
|{z} 𝜋
unknown parameter
−𝑥
which returns the angle of the vector 𝑋𝑌 −𝑦 against the 𝑥-axis in degrees. In fact, since 𝑎𝑡𝑎𝑛2 returns
angles in the interval (−𝜋, 𝜋], while our measurements are in 0◦ to 360◦ , we have to normalize 𝑎𝑡𝑎𝑛2’s
output to [0, 2𝜋) beforehand; see the listing in § 7.3 for details.
The solution obtained with the code from § 7.3 is shown in Figure 2.4.
End of Class 1
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 15
R. Herzog cbn
beacon 1
𝛼beacon
2
3
𝛼3
𝛼4 𝛼1
airplane
beacon 2 beacon 4
Figure 2.3: Illustration of the airplane position parameter estimation problem from Example 2.3.
6 beacon 0
2
y
0 beacon 2
2
beacon 1 beacon 3
2 0 2 4 6 8
x
Figure 2.4: Solution of the airplane position parameter estimation problem from Example 2.3.
16 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
§3 Convex Optimization
These problems feature a convex objective 𝑓 : R𝑛 → R ∪ {∞}. Due to the possibility of 𝑓 taking the
value ∞, we can implicitly model constraints, since points with a function value 𝑓 (𝑥) = ∞ are not
considered to be minimizers, since we exclude the pathological case 𝑓 ≡ ∞.
)
(1 − 𝛼 ) 𝑓 (𝑦
𝛼 𝑓 (𝑥 ) +
𝑥 𝑦 𝑥
The benefit of the objective being convex is clarified by the following theorem.
(c) When 𝑓 is strictly convex, then (3.1) has at most one solution.
5A set 𝐶 ⊆ R𝑛 is said to be convex if the line joining any points in 𝐶 remains in 𝐶, i. e.: for all 𝑥, 𝑦 ∈ 𝐶 we have
𝛼 𝑥 + (1 − 𝛼) 𝑦 ∈ 𝐶 for all 𝛼 ∈ [0, 1].
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 17
R. Herzog cbn
Consequently, in convex optimization one does not have to distinguish local from global minimizers.
We consider in this section optimization problems of a form more particular than the generic convex
problem (3.1). Indeed, we consider composite problems of the form
Minimize 𝑓 (𝑥) + 𝑔(𝐴𝑥) where 𝑥 ∈ R𝑛 . (3.4)
In (3.4), 𝑓 : R𝑛 → R ∪ {∞} and 𝑔 : R𝑚 → R ∪ {∞} are both convex functions, and 𝐴 ∈ R𝑚,𝑛 is a
matrix. Problems of type (3.4) are encountered in many practical problems, and they allow specialized
algorithms of convex optimization to be used, compared to the generic convex problem (3.1).
Example 3.3 (Image denoising). A classical problem in image denoising is as follows. Suppose we are
given a noisy image (gray-scale for simplicity), which is represented as a matrix 𝑍 ∈ R𝑛1 ×𝑛2 . The image
height (number of pixels in vertical direction) is 𝑛 1 while 𝑛 2 is the width. We wish to remove the noise
while retaining the image’s content. In particular, we wish to preserve edges, i. e., sharp variations of the
brightness values between neighboring pixels.
A classical way to approach this task (Rudin, Osher, Fatemi, 1992) is to formulate it as an optimization
problem:
2 −1 1 −1 ∑︁
1 ∑︁
𝑛 1 ∑︁
𝑛2 𝑛 1 𝑛∑︁ 𝑛∑︁ 𝑛2
2 ∑︁
Minimize 𝑋𝑖,𝑗 − 𝑍𝑖,𝑗 + 𝛽 |𝑋𝑖,𝑗 − 𝑋𝑖,𝑗+1 | + 𝛽 |𝑋𝑖+1,𝑗 − 𝑋𝑖,𝑗 |
2 𝑖=1 𝑗=1 | {z } 𝑖=1 𝑗=1 | {z } 𝑖=1 𝑗=1 | {z } (3.5)
fidelity term horizontal jump vertical jump
Problem (3.5) can be written in the form (3.4) when we vectorize6 the matrix-valued optimization variable
to become 𝑥 = vec(𝑋 ) ∈ R𝑛1𝑛2 . Likewise, we vectorize the observed image: 𝑧 = vec(𝑍 ). The first term
in (3.5) then simply becomes 21 ∥𝑥 − 𝑧 ∥ 22 . To account for the remaining terms, we require a matrix 𝐴
which converts the pixel values 𝑥 to the vector of horizontal differences, followed by the vector of vertical
differences. Such a matrix 𝐴 is of dimension [𝑛 1 (𝑛 2 − 1) +𝑛 2 (𝑛 1 − 1)] × [𝑛 1 𝑛 2 ]. Each row contains precisely
one entry +1 and one entry −1, all remaining entries are zero; see Figure 3.2. Finally, 𝑔(𝑦) = 𝛽 ∥𝑦 ∥ 1 holds,
i. e., 𝑔(𝑦) is 𝛽 times the sum of the absolute values of all entries of the vector 𝑦.
The algorithm we are considering exploits the additive and composite structure of (3.5). In order to
derive it, we require the notion of the Fenchel conjugate of a function.
6Vectorization of a matrix simply means to stack the columns of the matrix on top of each other to form a long vector with
all entries of the matrix.
18 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
1 2 3 4 5 6 7 8 9 10 11 12
1 2 −1 · · · · · −1 · · · · ·1
1 4 7 ·
· −1 · · · 1 −1 · · · ·2
3
· · · · −1 · · 1 · · · ·
7 9 11
1 −1 · · · · · · −1 · · · 4
3 4 1 −1 · 1 −1 · · 5
𝐴ᵀ = ·
2 5 8 · · · ·
· · · · 1 −1 · · · 1 · · 6
1 · −1 · 7
8 10 12 · · · · · · · ·
1 1 −1 8
· · · · · · · · ·
5 6
3 6 9 1 1 9
· · · · · · · · · ·
Figure 3.2: Illustration of the construction of the matrix 𝐴 for an image denoising problem (Example 3.3).
Its transpose 𝐴ᵀ is the incidence matrix of the directed graph obtained from connecting
neighboring pixels. For readibility, zeros are written as ·.
Definition 3.4 (Fenchel conjugate). Suppose that 𝑓 : R𝑛 → R ∪ {∞} is a function. Then the Fenchel
conjugate of 𝑓 is 𝑓 ∗ : R𝑛 → R ∪ {±∞}, defined by
In view of
𝑓 ∗ (𝜉) = inf 𝛽 ∈ R 𝜉 ᵀ𝑥 − 𝛽 ≤ 𝑓 (𝑥) for all 𝑥 ∈ R𝑛
𝐻𝑛,𝛽 = ( 𝑥𝑧 ) ∈ R𝑛 × R 𝑛 ᵀ ( 𝑥𝑧 ) = 𝛽
= ( 𝑥𝑧 ) ∈ R𝑛 × R 𝑧 = 𝜉 ᵀ𝑥 − 𝛽
in R𝑛 × R with the fixed normal vector 𝑛 = −1 𝜉
and unknown offset 𝛽, and push it upwards against
the graph of the function 𝑓 by decreasing the offset 𝛽. Then we record the minimal offset 𝛽. We can
find the (negative) offset 𝛽 when evaluating 𝑧 = 𝜉 ᵀ𝑥 − 𝛽 at 𝑥 = 0; see Figure 3.3.
The Fenchel conjugate 𝑓 ∗ of any function 𝑓 is always convex. Moreover, when 𝑓 itself is convex
and lower semicontinuous7 , then 𝑓 ∗ contains enough information to recover 𝑓 , in the sense that the
biconjugate function 𝑓 ∗∗ equals 𝑓 .
Without going into any details here, one can show the following theorem; see for instance Clason,
2017, Section 7.4.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 19
R. Herzog cbn
2 3
0 0
-1
-1
-2
-1.5 -1 -0.5 0 0.5 1 1.5 2 -4 -3 -2 -1 0 1 2
x
Figure 3.3: Illustration of the construction of the Fenchel conjugate function 𝑓 ∗ : R ∪ {±∞} (right) of
the (convex) function 𝑓 (𝑥) = 𝑥 2 + exp(−𝑥 3 ) (left).
A point 𝑥 ∗ ∈ R𝑛 is a (global) minimizer for (3.4) if and only if there exists 𝑝 ∗ ∈ R𝑚 such that, for any
𝜎, 𝜏 ≠ 0,
1
𝑥 ∗ solves Minimize ∥𝑦 − (𝑥 ∗ − 𝜏 𝐴ᵀ𝑝 ∗ ) ∥ 22 + 𝜏 𝑓 (𝑦) where 𝑦 ∈ R𝑛 , (3.8a)
2
1
and 𝑝 ∗ solves Minimize ∥𝑞 − (𝑝 ∗ + 𝜎 𝐴𝑥 ∗ ) ∥ 22 + 𝜎 𝑔∗ (𝑞) where 𝑞 ∈ R𝑚 . (3.8b)
2
The above theorem provides the idea for an algorithm. As the unknown solutions 𝑥 ∗ and 𝑝 ∗ appear in
the respective objectives for both problems, we can convert (3.8) into a fixed-point algorithm.
1
Minimize ∥𝑦 − (𝑥 (𝑘 ) − 𝜏 𝐴ᵀ𝑝 (𝑘 ) ) ∥ 22 + 𝜏 𝑓 (𝑦) where 𝑦 ∈ R𝑛 (3.9)
2
2: Set 𝑝 (𝑘+1) to the solution of the following problem:
1
Minimize ∥𝑞 − (𝑝 (𝑘 ) + 𝜎 𝐴𝑥 (𝑘+1) ) ∥ 22 + 𝜎 𝑔∗ (𝑞) where 𝑞 ∈ R𝑚 (3.10)
2
The advantage of such an algorithm is that the solution to the original problem (3.4), involving both 𝑓
and 𝑔, is broken down into the repeated solution to two different problems (3.9) and (3.10). Problem
20 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
(3.9) requires the minimization of 𝜏 𝑓 plus a quadratic distance term. The solution to this problem is
also known as proximity (prox) operator. More precisely, one defines
1
prox𝜏 𝑓 (𝑥) B unique solution of “Minimize ∥𝑦 − 𝑥 ∥ 22 + 𝜏 𝑓 (𝑦) where 𝑦 ∈ R𝑛 ”. (3.11)
2
Similarly, problem (3.10) requires the minimization of 𝜎 times the conjugate 𝑔∗ plus a quadratic distance
term, i. e., the evaluation of prox𝜎 𝑔∗ .
For a number of relevant problems, the solutions to problems (3.9) and (3.10) can be explicitly cal-
culated.8 We consider this now for 𝑓 and 𝑔∗ from the image denoising Example 3.3. We begin with
prox𝜏 𝑓 (𝑥), i. e., the unique solution of the problem
1 1 𝜏
Minimize ∥𝑦 − 𝑥 ∥ 22 + 𝜏 𝑓 (𝑦) = ∥𝑦 − 𝑥 ∥ 22 + ∥𝑦 − 𝑧 ∥ 22 where 𝑦 ∈ R𝑛1𝑛2 .
2 2 2
A short calculation shows (Quiz 3.1: Can you fill in the details?)
𝑥 +𝜏𝑧
prox𝜏 𝑓 (𝑥) = , (3.12)
1+𝜏
i. e., prox𝜏 𝑓 (𝑥) is simply a weighted average between 𝑥 and 𝑧. On the other hand, we first need to
evaluate
𝑔∗ (𝜉) = sup 𝜉 ᵀ𝑝 − 𝑔(𝑝) = sup 𝜉 ᵀ𝑝 − 𝛽 ∥𝑝 ∥ 1
𝑝 ∈R𝑚 𝑝 ∈R𝑚
0 if ∥𝜉 ∥ ∞ ≤ 𝛽
(
∗
𝑔 (𝜉) =
∞ if ∥𝜉 ∥ ∞ > 𝛽.
Therefore we have that prox𝜎 𝑔∗ (𝑝) is the unique solution of the problem
1
Minimize ∥𝑞 − 𝑝 ∥ 22 + 𝜎 𝑔∗ (𝑞) where 𝑞 ∈ R𝑚 ,
2
which is to say that prox𝜎 𝑔∗ (𝑝) is the unique solution of
1
Minimize ∥𝑞 − 𝑝 ∥ 22 where 𝑞 ∈ R𝑚
2
subject to ∥𝑞∥ ∞ ≤ 𝛽.
A short calculation shows (Quiz 3.3: Can you fill in the details?)
𝛽 𝑝𝑖
[prox𝜎 𝑔∗ (𝑝)] 𝑖 = = proj [ −𝛽,𝛽 ] (𝑝𝑖 ), (3.13)
max{𝛽, |𝑝𝑖 |}
which turns out to be independent of 𝜎. This formula means that the values of 𝑝 are simply clipped
elementwise to lie inside [−𝛽, 𝛽].
8 The web site http://proximity-operator.net/ collects computable prox operators.
9 Itcan be shown in general that the convex conjugate of a norm is the indicator function of the unit ball w.r.t. the dual
norm.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 21
R. Herzog cbn
It can be shown (see Chambolle, Pock, 2011, Theorem 1 or Clason, 2017, Theorem 7.8) that Algorithm 3.6
converges to a solution of (3.4) as long as 𝜎, 𝜏 are positive numbers satisfying 𝜎 𝜏 ∥𝐴∥ 2 < 1, where ∥𝐴∥
is the largest singular value of 𝐴.
We close this section by mentioning what functions a user has to implement in order to run the
Chambolle-Pock Algorithm 3.6 for solving (3.4):
It is remarkable that the user does not need to implement the objective (3.4) or even the functions 𝑓 and 𝑔,
but needs to implement the solution operators for certain problems involving 𝑓 and the conjugate 𝑔∗ .
Figure 3.4: Image denoising model problem (Example 3.3.) Left: original image (dimensions 𝑛 1 = 𝑛 2 =
256). Middle: image with noise. Right: denoising result obtained using 200 iterations of the
Chambolle-Pock Algorithm 3.6 with 𝜎 = 1 and 𝜏 = 1/10 and weighting parameter 𝛽 = 0.08.
End of Class 2
22 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
§4 Constrained Optimization
We recall from § 1 that we are considering constrained optimization problems of the form
Example 4.1 (Production example; slightly modified from John E. Beasley’s OR notes10 ).
A company makes two products (X and Y) using two machines (A and B). Each unit of X that is produced
requires 50 minutes processing time on machine A and 30 minutes processing time on machine B. Each
unit of Y that is produced requires 24 minutes processing time on machine A and 33 minutes processing
time on machine B. At the start of the current week there are 30 units of X and 90 units of Y in stock.
Available processing time on machine A is forecast to be 40 hours and on machine B is forecast to be
35 hours. The demand for X in the current week is forecast to be 70 units and for Y is forecast to be 95 units.
Company policy is to maximize the combined sum of the units of X and the units of Y in stock at the end
of the week.
Formulate this as an optimization problem. What kind of problem is it? What type of solver would you
use to solve it?
Example 4.2 (Formulating non-smooth problems as smooth ones). Try to formulate the following
optmization problems in the form (1.1) with smooth objective and smooth constraints.
Minimize − 𝑥 12 + 𝑥 2
)
(4.3)
subject to max{𝑥 1 + 𝑥 2, 2 𝑥 1 − 𝑥 2 } ≤ 3
Minimize − 𝑥 12 + 𝑥 2
)
(4.4)
subject to min{𝑥 1 + 𝑥 2, 2 𝑥 1 − 𝑥 2 } ≤ 3
Minimize max{−𝑥 12 + 𝑥 2, 𝑥 12 + 2 𝑥 2 }
)
(4.5)
subject to 𝑥 1 + 𝑥 2 ≤ 3
Minimize |−𝑥 12 + 𝑥 2 | + |𝑥 1 + 𝑥 22 |
)
(4.6)
subject to 𝑥 1 + 𝑥 2 ≤ 3
10 https://people.brunel.ac.uk/∼mastjjb/jeb/or/morelp.html
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 23
R. Herzog cbn
We will now look a little bit into the theory for problem (1.1) since it also tells us something about
choices we may be facing while modeling the problem.
One can easily see by example (Quiz 4.1: Can you give an example?) that the well-known condition
∇𝑓 (𝑥) = 0 is no longer a necessary condition for local optimality in case of a constrained problem. It
is one of the main conceptual difficulties in constrained optimization to come up with conditions that
replace ∇𝑓 (𝑥) = 0.
The main idea is as follows. Consider a local minimizer 𝑥 ∗ and a sequence 𝑥 (𝑘 ) of feasible points
approaching it. The elements of the sequence will eventually (i. e., for 𝑘 large enough) all remain in
the neighborhood 𝑈 (𝑥 ∗ ) of local optimality. Consequently,
𝑓 (𝑥 ∗ ) ≤ 𝑓 (𝑥 (𝑘 ) )
holds for all 𝑘 sufficiently large. Therefore, we have
0 ≤ 𝑓 (𝑥 (𝑘 ) ) − 𝑓 (𝑥 ∗ ). (4.7)
By Taylor’s theorem11 , for any 𝑘 ∈ N there exist a number 𝜉 (𝑘 ) ∈ (0, 1) such that
𝑓 (𝑥 (𝑘 ) ) = 𝑓 (𝑥 ∗ ) + ∇𝑓 (𝑥 ∗ + 𝜉 (𝑘 ) (𝑥 (𝑘 ) − 𝑥 ∗ )) ᵀ (𝑥 (𝑘 ) − 𝑥 ∗ )
holds. Plugging this into the inequality (4.7), we obtain
0 ≤ ∇𝑓 (𝑥 ∗ + 𝜉 (𝑘 ) (𝑥 (𝑘 ) − 𝑥 ∗ )) ᵀ (𝑥 (𝑘 ) − 𝑥 ∗ ) (4.8)
for sufficiently large 𝑘 ∈ N. Notice that since 𝑥 (𝑘 ) → 𝑥 ∗ and since 𝜉 (𝑘 ) is bounded, we must have
𝑥 ∗ + 𝜉 (𝑘 ) (𝑥 (𝑘 ) − 𝑥 ∗ ) → 𝑥 ∗
and therefore
∇𝑓 (𝑥 ∗ + 𝜉 (𝑘 ) (𝑥 (𝑘 ) − 𝑥 ∗ )) → ∇𝑓 (𝑥 ∗ ).
However, passing to the limit in (4.8) only yields the meaningless 0 ≤ 0 so we must do something
different.
Let us take another sequence 𝑡 (𝑘 ) which converges to zero from above: 𝑡 (𝑘 ) ↘ 0, and let us restrict
our attention to sequences 𝑥 (𝑘 ) such that the limit
𝑥 (𝑘 ) − 𝑥 ∗
𝑑 B lim
𝑘→∞ 𝑡 (𝑘 )
exists. Dividing (4.8) by 𝑡 (𝑘 ) gives
𝑥 (𝑘 ) − 𝑥 ∗
0 ≤ ∇𝑓 (𝑥 ∗ + 𝜉 (𝑘 ) (𝑥 (𝑘 ) − 𝑥 ∗ )) ᵀ
𝑡 (𝑘 )
and passing to the limit now yields the informative statement
0 ≤ ∇𝑓 (𝑥 ∗ ) ᵀ𝑑
for all directions 𝑑 ∈ R𝑛 constructed as above. We refer to those as tangent directions or tangent
vectors.
11 This requires that the objective 𝑓 is a continuously differentiable (𝐶 1 ) function.
24 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
Definition 4.3 (Tangent vectors, tangent cone). A vector 𝑑 ∈ R𝑛 is said to be a tangent direction or
tangent vector to the (feasible) set 𝐹 at the point 𝑥 ∗ ∈ 𝐹 if there exist sequences 𝑥 (𝑘 ) and 𝑡 (𝑘 ) such that
𝑥 (𝑘 ) → 𝑥 ∗ and 𝑡 (𝑘 ) ↘ 0 such that
𝑥 (𝑘 ) − 𝑥 ∗
𝑑 = lim
𝑘→∞ 𝑡 (𝑘 )
holds. The set of all tangent vectors to 𝐹 at 𝑥 ∗ is said to be the tangent cone12 to 𝐹 at 𝑥 ∗ and we denote it
by T𝐹 (𝑥 ∗ ).
The difficulty with this theorem is that the tangent cone T𝐹 (𝑥 ∗ ) is generally hard to characterize
and thus impossible to use in an optimization algorithm. Notice also that T𝐹 (𝑥 ∗ ) doesn’t even use
the description of the feasible set 𝐹 (4.2) in terms of the inequality constraints 𝑔𝑖 or the equality
constraints ℎ 𝑗 .
Definition 4.5 (Linearized feasible vector, linearizing cone). A vector 𝑑 ∈ R𝑛 is said to be a linearized
feasible direction to the (feasible) set 𝐹 described by (1.2) at the point 𝑥 ∗ ∈ 𝐹 if it satisfies the following
conditions:
The set of all linearized feasible directions to 𝐹 at 𝑥 ∗ is said to be the linearizing cone to 𝐹 at 𝑥 ∗ and we
denote it by T𝐹lin (𝑥 ∗ ).
Recall that the active indices at 𝑥 ∗ are those satisfying 𝑔𝑖 (𝑥 ∗ ) = 0. Therefore, inequalities that are
inactive at 𝑥 ∗ do not play a role in the definition.
Remark 4.6 (Tangent and linearizing cones). The tangent cone T𝐹 (𝑥 ∗ ) can be viewed as a geometric
approximation to the feasible set near the point 𝑥 ∗ . By contrast, the linearizing cone T𝐹lin (𝑥 ∗ ) is an attempt
to locally describe the feasible set in algebraic terms, by means of equalities and inequalities.
12 In mathematics, a cone in R𝑛 is any set 𝐾 ⊆ R𝑛 such that 𝑥 ∈ 𝐾 implies 𝛼 𝑥 ∈ 𝐾 for all 𝛼 > 0. In other words, a point 𝑥 in
the cone entails that the entire ray from the origin through 𝑥 belongs to the cone as well.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 25
R. Herzog cbn
One can show (although we do not do it here) that T𝐹 (𝑥 ∗ ) ⊆ T𝐹lin (𝑥 ∗ ) holds at any feasible point 𝑥 ∗ .
Therefore, we cannot expect that Theorem 4.4 continues to hold if 𝑑 ∈ T𝐹 (𝑥 ∗ ) were replaced by
𝑑 ∈ T𝐹lin (𝑥 ∗ ). In order for that to be possible, we require an extra condition that ensures that T𝐹 (𝑥 ∗ ) =
T𝐹lin (𝑥 ∗ ). Such a condition guarantees the agreement of the geometric and algebraic descriptions and
it is called a constraint qualification.
There are various types of constraint qualifications, and we consider only the most simple one.
Definition 4.7 (LICQ). Suppose that the (feasible) set 𝐹 is given by (1.2) and 𝑥 ∗ ∈ 𝐹 . We say that 𝑥 ∗
satisfies the linear independence constraint qualification (in short: LICQ) if the following set of
vectors is linearly independent:
𝑝
∇𝑔𝑖 (𝑥) 𝑖 ∈ A (𝑥 ∗ ) ∪ ∇ℎ 𝑗 (𝑥) 𝑗=1 .
One can show that the LICQ at 𝑥 ∗ implies T𝐹 (𝑥 ∗ ) = T𝐹lin (𝑥 ∗ ). Provided that 𝑥 ∗ is also a local minimizer,
we thus get from Theorem 4.4 that
By a technique called the Farkas lemma we can show that condition (4.11) can be equivalently case as
a set (4.12) of equalities and inequalities, which is amenable to numerical algorithms. We state this
result in the following lemma, without proof.
(b) There exist vectors 𝜇 ∈ R𝑚 and 𝜆 ∈ R𝑝 such that the following set of conditions hold:
𝑚 𝑝
𝜆 𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) = 0, (4.12a)
∑︁ ∑︁
∗ ∗
∇𝑓 (𝑥 ) + 𝜇𝑖 ∇𝑔𝑖 (𝑥 ) +
𝑖=1 𝑗=1
ℎ(𝑥 ) = 0,
∗
(4.12b)
𝜇 ≥ 0, 𝑔(𝑥 ) ≤ 0,
∗
𝜇 𝑔(𝑥 ) = 0.
⊤ ∗
(4.12c)
The system (4.12) is known as the Karush-Kuhn-Tucker conditions (in short: KKT conditions) of
problem (4.1). The vectors 𝜇 and 𝜆 are called Lagrange multipliers associated with the inequality
and equality constraints, respectively. A point 𝑥 ∗ which admits associated Lagrange multipliers 𝜇 and
𝜆 so that (4.12) holds is called a KKT point.
Condition (4.12a) is often written in terms of the Lagrangian associated with problem (4.1), which is
defined as
L (𝑥, 𝜇, 𝜆) B 𝑓 (𝑥) + 𝜇 ᵀ𝑔(𝑥) + 𝜆 ᵀℎ(𝑥). (4.13)
26 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
Theorem 4.9 (First-order necessary optimality conditions under LICQ, compare Theorem 4.4).
Suppose that 𝑥 ∗ is a local minimizer of an optimization problem with feasible set 𝐹 . Moreover, suppose
that the LICQ holds at 𝑥 ∗ . Then there exist Lagrange multipliers 𝜇 ∈ R𝑛ineq and 𝜆 ∈ R𝑛eq such that the
KKT conditions (4.12) hold. Moreover, 𝜇 and 𝜆 are unique.
The uniqueness of the multipliers follows from viewing (4.12a) as a linear system of equations for 𝜇 and
𝜆 and observing that the LICQ implies that the coefficient matrix has linearly independent columns.
(a) In the absence of inequality constraints (𝑛 ineq = 0), (4.12) is a (generally nonlinear) system of
equations only.
(b) Condition (4.12c) — which comes in through inequality constraints — is called a complemen-
tarity system. Due to the signs of 𝜇 and 𝑔(𝑥 ∗ ), we can equivalently write 𝜇 ᵀ𝑔(𝑥 ∗ ) = 0 compo-
nentwise as 𝜇𝑖 𝑔𝑖 (𝑥 ∗ ) = 0 for all 𝑖 = 1, 𝑛 ineq . Consequently, Lagrange multipliers pertaining to
inactive inequality constraints must be zero. On the other hand, a strictly positive multiplier
𝜇𝑖 > 0 indicates that the associated constraint must be active, i. e., 𝑔𝑖 (𝑥 ∗ ) = 0.
We can ask the question how “likely” it is for the LICQ to hold for an optimization problem at hand.
First of all, the very Definition 4.7 implies that for the LICQ to hold there cannot be too many active
inequality and equality constraints in the problem. As soon as the combined number of active inequality
and equality constraints at a point 𝑥 ∗ exceeds 𝑛 (the dimension of the optimization variable), the LICQ
cannot hold at 𝑥 ∗ .
What can we learn from this? When we are modeling an optimization problem, the observation above
tells us not to unconsciously throw in additional constraints (which may not be needed) at will. The
following example gives us additional hints about things to avoid in modeling constraints:
Example 4.10 (LICQ holding or not holding?). Consider the following formulations of a (trivial)
optimization problem with variable 𝑥 ∈ R, and decide whether or not the LICQ holds at the (obvious)
solution:
Minimize 𝑥 2
Minimize 𝑥 2 Minimize 𝑥 2 Minimize 𝑥 2
( ) ( ) ( )
subject to 𝑥 ≤ 0
subject to 𝑥 = 0 subject to 𝑥 2 = 0 subject to 𝑥 2 ≤ 0
and 𝑥 ≥ 0
The following example shows that the failure of constraint qualifications to hold can indeed be
problematic in practice:
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 27
R. Herzog cbn
Example 4.11 (Minimizer where the KKT conditions fail to hold). Consider the problem
Minimize − 𝑥1 where 𝑥 ∈ R2
subject to 𝑥 2 + 𝑥 13 ≤ 0
and − 𝑥2 ≤ 0
Obviously, 𝑥 ∗ = (0, 0) ᵀ is the unique global minimizer, and there are no further local minimizers. Since
both constraints are active at 𝑥 ∗ , the KKT conditions (4.12) boil down to
0 0 0
−1
+ 𝜇1 + 𝜇2 = .
0 1 −1 0
Clearly, it is impossible to satisfy the KKT conditions with any 𝜇1, 𝜇2 ≥ 0. Hence, Lagrange multipliers do
not exist at this minimizer. In other words, the minimizer 𝑥 ∗ fails to be a KKT point.
The observation in the previous example is noteworthy since essentially all numerical algorithms for
the solution of smooth, constrained problems (4.1) are indeed looking for KKT points.
Example 4.12 (Rosenbrock’s post office problem). A simple test example for nonlinear programming is
Rosenbrock’s post office problem:
Minimize − 𝑥1 𝑥2 𝑥3 where 𝑥 ∈ R3
subject to 𝑥 1 + 2 𝑥 2 + 2 𝑥 3 − 72 ≤ 0
and 𝑥 1 + 2 𝑥 2 + 2 𝑥 3 ≥ 0
and 0 ≤ 𝑥𝑖 ≤ 42, 𝑖 = 1, 2, 3
This problem has only linear constraints13 but a nonlinear objective. The global minimizer is 𝑥 ∗ =
(24, 12, 12) ᵀ .
13 Problems featuring only linear constraints satisfy a constraint qualification known as Abadie constraint qualification
(ACQ) at every feasible point. Hence for problems which have only linear constraints, every local minimizer is a KKT
point. However, in contrast to LICQ, the Lagrange multipliers need not be unique.
28 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
Many other details need to be taken into account in order to obtain a robust SQP solver, including
potentially infeasible QP subproblems, update of the Lagrange multipliers, stopping criteria, deteri-
oration of superlinear convergence due to nonlinear constraints, failure of constraint qualifications
etc.
We close this section by mentioning what functions a user has to implement in order to run an SQP
for solving a constrained problem (4.1):
objective 𝑓 (𝑥)
derivative 𝑓 ′ (𝑥) optional (finite difference fallback)
2nd derivative 𝑓 ′′ (𝑥) or 𝑓 ′′ (𝑥) 𝑑 optional, only for Newton SQP methods
constraints 𝑔(𝑥), ℎ(𝑥)
Jacobian 𝑔′ (𝑥), ℎ ′ (𝑥) optional (finite difference fallback)
2nd derivative 𝑔𝑖′′ (𝑥), ℎ ′′𝑗 (𝑥) or 𝑔𝑖′′ (𝑥) 𝑑, ℎ ′′𝑗 (𝑥) 𝑑 optional, only for Newton SQP methods
End of Class 3
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 29
R. Herzog cbn
§5 Infinite-Dimensional Optimization
In this section we consider problems which are infinite-dimensional, i. e., which feature infinite-
dimensional optimization variables. Most problems in this category involve a differential equation
of some sort, and the optimization variables are functions. Although for the purpose of numerical
optimization, these functions clearly need to be discretized to become finite-dimensional objects,
it is still useful to recognize the properties of the underlying undiscretized (infinite-dimensional)
problem.
Example 5.1 (Optimal control of the van der Pol oscillator). Consider the following optimal control
problem:
∫ 𝑇 ∫ 𝑇
Minimize (𝑥 12 + 𝑥 22 ) d𝑡 +𝛾 𝑢 2 d𝑡 where 𝑥 : [0,𝑇 ] → R2, 𝑢 : [0,𝑇 ] → R
0 0
𝑥¤1 = 𝜇 (1 − 𝑥 22 ) 𝑥 1 − 𝑥 2 + 𝑢
(
subject to
𝑥¤2 = 𝑥 1 (5.1)
0
𝑥 1 (0)
initial conditions =
𝑥 2 (0) 1
and control constraints − 1 ≤ 𝑢 ≤ 1.
The ordinary differential equation (ODE) above is known as the van der Pol oscillator. The unknown
function 𝑥 : [0,𝑇 ] → R2 is the state variable of the optimal control problem (5.1) and 𝑢 : [0,𝑇 ] → R
is the control variable. All constraints are meant in a pointwise sense on the interval [0,𝑇 ] with final
time 𝑇 > 0 given. The constant 𝜇 ≥ 0 is a given physical parameter. The objective expresses the goal to
steer the system to its equilibrium point at the origin, while the second term in the objective penalizes the
control effort. The parameter 𝛾 ≥ 0 is often called the control-cost parameter and it is used to balance both
terms in the objective.
We are going to discretize problem (5.1) using a so-called direct single shooting approach. This means
that we are going to use a numerical solver for the differential equation, which returns an approximate
solution for 𝑥 (𝑡) on a discrete grid of 𝑡-values. Therefore, the only optimization variable remaining is
the control function 𝑢. This function is discretized on a grid that is possibly coarser than the grid for
the state and we take 𝑢 to be piecewise constant.
30 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
control
state
𝑡
0 Δ𝑡 2Δ𝑡 3Δ𝑡 𝑇
A possible solution approach to this problem is shown in § 7.10. Experiments show that off-the-shelf
solvers have difficulties solving the problem even for moderate discretization sizes. For instance, we
typically observe an increase in the iteration numbers as we refine the control discretization. This is
the case in particular for CasADi’s default solver Ipopt with default settings.
The reason for this behavior is that most solvers for nonlinear programming problems (see § 4) are
designed for genuinely finite-dimensional problems. In particular, they are unaware of the geometry
of the problem. In a function space-aware method, the user would need to choose an inner product,
which ideally should be inherited from the infinite-dimensional limit problem. This has an impact,
for instance, how gradients are calculated, the geometry of trust regions, on stopping criteria, and
many other algorithmic components. Of course one may argue that, for the discretized problem
in R𝑛 , all norms are equivalent, but the equivalence constants typically deteriorate with increasing
dimension 𝑛.
A simple strategy can sometimes be successful to mitigate the difficulties of off-the-shelf solvers: one
first solves the problem on a coarse grid, then interpolates the solution to a finer grid and uses this as
an initial guess for the next round, until the desired discretization level is reached.
However, true efficiency requires an algorithm which “makes sense” also for the infinite-dimensional
limit problem. Only then can we hope for convergence results which are independent of the fine-
ness of discretization. These algorithms, however, are not generally available as software. As a
researcher working with infinite-dimensional problems, one therefore often finds oneself designing
and implementing the algorithm of choice oneself.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 31
R. Herzog cbn
We conclude these notes with a short summary and some practical advice.
(1) Modeling an optimization problem is a skill that requires training. The choices made may have
an impact on the characteristics of a problem.
(2) For example, we saw that squaring a constraint ℎ 𝑗 (𝑥) = 0 to become [ℎ 𝑗 (𝑥)] 2 = 0 is not a good
idea.
(3) There is not a single best solver for all kinds of optimization problems.
(4) It is useful to have an overview over different problem types in optimization in order to be able
to make informed modeling decisions and solver choices.
(5) Most algorithms rely on and exploit the smoothness of the objective and constraint functions.
They may not stall on the occasional kink, but generally we should strive to formulate our
problems in a smooth way.
(6) Depending on the type of algorithm, the user may have to implement different routines than
expected. For instance, for algorithms dedicated to least-squares problems, the user implements
the residual function (not the objective). The Chambolle-Pock method for composite convex
problems asks the user to implement prox operators for 𝜏 𝑓 and 𝜎 𝑔∗ .
End of Class 4
32 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
§7 Solutions
CasADi provides algorithmic differentiation for user-defined functions. We demonstrate below the
description of the problem in a suitable form for CasADi’s nlpsol solver interface.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 33
R. Herzog cbn
CasADi offers the Opti interface, which allows a simplified description and solution of optimization
problems.
34 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
Apparently CasADi has no interface for solving least-squares problems. Therefore, we resort to
treating the problem as a generic unconstrained optimization problem. Consequently — in contrast to
using a dedicated least-squares solver — we have to implement the objective 𝑓 (𝑥) = 21 𝑖=1 [𝑟𝑖 (𝑥)] 2
Í𝑀
instead of the residual function 𝑟 (𝑥).
# This code solves a parameter estimation problem for a given model, based on a
# set of measurement pairs, using CasADi's python bindings for modeling of
# (un)constrained optimization problems.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 35
R. Herzog cbn
The classical use of Matlab’s optimization toolbox required the user to model their optimization
problem in a format suitable for the respective solver to be used. Since Example 4.1 is a linear
optimization problem, linprog is the solver of choice.
% This code solves the production example problem using Matlab's optimization
% toolbox. The problem is a linear optization problem. The solver of choice is
% linprog from Matlab's optimization toolbox, see
% https://de.mathworks.com/help/optim/ug/linprog.html.
% Setup the matrix A and right hand side b for the inequality constraint
% A * x <= b.
A = [50, 24; 30, 33];
b = [2400; 2100];
% Setup the matrix Aeq and right hand side beq for the (void) equality constraint
% Aeq * x = beq.
Aeq = [];
beq = [];
36 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
With more recent versions of Matlab’s optimization toolbox, the user may formulate their problem in
a more natural way, and independently of the solver. The solver will be picked automatically depending
on the problem characteristics.
% This code solves the production example problem using Matlab's optimization
% toolbox. The problem is modeled using the problem-based approach, which
% allows for a more natural problem formulation and automatic selection of a
% suitable solver, see
% https://de.mathworks.com/help/optim/problem-based-basics.html.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 37
R. Herzog cbn
Here we demonstrate the solution of Example 4.1 with scipy.optimize.linprog. We need to present
the problem in a suitable form to the solver.
# This code solves the production example problem using the scipy.optimize
# package. The problem is a linear optization problem. The solver of choice is
# scipy.optimize.linprog, see
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.linprog.html.
# Setup the matrix A and right hand side b for the inequality constraint
# A @ x <= b.
A = [[50, 24], [30, 33]]
b = [2400, 2100]
38 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
CasADi provides algorithmic differentiation for user-defined functions (although this is not needed
for linear optimization problems). We demonstrate below the description of the problem in a suitable
form for CasADi’s nlpsol solver interface.
# This code solves the production example problem using CasADi's python
# bindings. The problem is a linear optization problem. For lack of a linear
# optimization solver, the solver of choice is nlpsol with ipopt, see
# https://web.casadi.org/docs/#nonlinear-programming.
# Call the solver, providing the upper bounds for the inequality constraint
# function g and the lower bound for the optimization variable.
result = solver(ubg = b, lbx = l)
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 39
R. Herzog cbn
CasADi offers the Opti interface, which allows a simplified description and solution of optimization
problems.
# This code solves the production example problem using CasADi's python
# bindings, and CasADi's Opti interface for modeling of (un)constrained
# optimization problems.
40 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
CasADi offers the Opti interface, which allows a simplified description and solution of optimization
problems.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 41
R. Herzog cbn
§ 7.10 Solution of Example 5.1 (van der Pol) using CasADi (Problem-Based)
# This code solves a discretized optimal control problem for the van der Pol
# oscillator.using CasADi's python bindings. The discretization is achieved
# using direct single shooting.
# Define the step size for the state integrator and determine how many state
# points are associated with each control interval.
dtx_target = 0.1
states_per_control_interval = max([round(dtu / dtx_target), 1])
42 https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 2023-03-29
cbn Introduction to Optimization
# Advance the state until the end of the current control interval and
# update the objective.
for substep in range(states_per_control_interval):
x = take_step(x, U[step], dtx)
objective = objective + dtx * (x[0]**2 + x[1]**2)
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/ 43
Bibliography
Andersson, J. A. E.; J. Gillis; G. Horn; J. B. Rawlings; M. Diehl (2018). “CasADi: a software framework for
nonlinear optimization and optimal control”. Mathematical Programming Computation 11.1, pp. 1–36.
doi: 10.1007/s12532-018-0139-4.
Chambolle, A.; T. Pock (2011). “A first-order primal-dual algorithm for convex problems with appli-
cations to imaging”. Journal of Mathematical Imaging and Vision 40.1, pp. 120–145. doi: 10.1007/
s10851-010-0251-1.
Clason, C. (2017). Nonsmooth analysis and optimization. arXiv: 1708.04180.
Rudin, L. I.; S. Osher; E. Fatemi (1992). “Nonlinear total variation based noise removal algorithms”.
Physica D 60.1–4, pp. 259–268. doi: 10.1016/0167-2789(92)90242-F.
https://scoop.iwr.uni-heidelberg.de/teaching/2022ws/short-course-optimization/