Mdobook
Mdobook
Design Optimization
Joaquim R. R. A. Martins
Andrew Ning
Engineering
Design
Optimization
joaquim r. r. a. martins
University of Michigan
andrew ning
Brigham Young University
Publication
First electronic edition: January 2020.
Contents
Contents v
Preface xi
Acknowledgements xiii
1 Introduction 1
1.1 Design Optimization Process 2
1.2 Optimization Problem Formulation 6
1.3 Optimization Problem Classification 17
1.4 Optimization Algorithms 21
1.5 Selecting an Optimization Approach 26
1.6 Notation 28
1.7 Summary 29
Problems 30
v
Contents vi
3.10 Summary 73
Problems 75
Bibliography 591
Index 615
Preface
xi
Preface xii
xiii
Introduction
1
Optimization is a human instinct. People constantly seek to improve
their lives and the systems that surround them. Optimization is intrinsic
in biology, as exemplified by the evolution of species. Birds optimize
their wings’ shape in real time, and dogs have been shown to find
optimal trajectories. Even more broadly, many laws of physics relate to
optimization, such as the principle of minimum energy. As Leonhard
Euler once wrote, “nothing at all takes place in the universe in which
some rule of maximum or minimum does not appear.”
The term optimization is often used to mean “improvement”, but
mathematically, it is a much more precise concept: finding the best
possible solution by changing variables that can be controlled, often
subject to constraints. Optimization has a broad appeal because it is
applicable in all domains and because of the human desire to make
things better. Any problem where a decision needs to be made can be
cast as an optimization problem.
Although some simple optimization problems can be solved an-
alytically, most practical problems of interest are too complex to be
solved this way. The advent of numerical computing, together with
the development of optimization algorithms, has enabled us to solve
problems of increasing complexity.
1
1 Introduction 2
Manual iteration
Change
design
manually No
Optimization
Update
design
No
variables
Initial
design
Evaluate
Specifications Optimality
objective and
achieved?
Formulate constraints
optimization
problem
Yes
design satisfies the optimality conditions that ensure that no other Fig. 1.2 Conventional (top) versus de-
design “close by” is better. The design changes are made automatically sign optimization process (bottom).
by the optimization algorithm and do not require intervention from
the designer.
This automated process does not usually provide a “push-button”
solution; it requires human intervention and expertise (often more
expertise than in the traditional process). Human decisions are still
needed in the design optimization process. Before running an op-
timization, in addition to determining the specifications and initial
design, engineers need to formulate the design problem. This requires
expertise in both the subject area and numerical optimization. The
designer must decide what the objective is, which parameters can be
changed, and which constraints must be enforced. These decisions
have profound effects on the outcome, so it is crucial that the designer
formulates the optimization problem well.
1 Introduction 5
performance
optimizations. We illustrate several advantages of design optimization
in Fig. 1.3, which shows the notional variations of system performance, System Conventional
design process
cost, and uncertainty as a function of time in design. When using
optimization, the system performance increases more rapidly compared
with the conventional process, achieving a better end result in a shorter
Cumulative
Reduced cost
total time. As a result, the cost of the design process is lower. Finally,
cost
𝑥 = [𝑥 1 , 𝑥2 , . . . , 𝑥 𝑛 𝑥 ] . (1.1)
1 Introduction 7
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 , 𝑖 = 1, . . . , 𝑛 𝑥 , (1.2)
where 𝑥 and 𝑥 are lower and upper bounds on the design variables,
respectively. These are also known as bound constraints or side constraints.
Some design variables may be unbounded or bounded on only one
side.
When all the design variables are continuous, the optimization prob-
lem is said to be continuous.† Most of this book focuses on algorithms † This is not to be confused with the conti-
nuity of the objective and constraint func-
that assume continuous design variables. tions, which we discuss in Section 1.3.
1 Introduction 8
Consider a wing design problem where the wing planform shape is rect-
angular. The planform could be parametrized by the span (𝑏) and the chord
(𝑐), as shown in Fig. 1.5, so that 𝑥 = [𝑏, 𝑐]. However, this choice is not 𝑐
unique. Two other variables are often used in aircraft design: wing area (𝑆)
𝑏
and wing aspect ratio (𝐴𝑅), as shown in Fig. 1.6. Because these variables are
not independent (𝑆 = 𝑏𝑐 and 𝐴𝑅 = 𝑏 2 /𝑆), we cannot just add them to the set Fig. 1.5 Wingspan (𝑏) and chord (𝑐).
1 Introduction 9
5 10
8
4 𝑆
𝑆= =
𝑆
𝑆=5
35
1.0
=
𝑏=
25
15
6 12
𝑐=
𝑐 3 𝐴𝑅
4 1.5
𝑐=
2 .0 𝑏=8
2 𝑐=
𝑐=2
.5
2
=1
=
𝐴𝑅 7
1
𝐴𝑅 =
0
𝑏=4
different sets of design variables, 𝑥 =
2 4 6 8 10 5 10 15 20 25 [𝑏, 𝑐] and 𝑥 = [𝑆, 𝐴𝑅].
𝑏 𝑆
of design variables. Instead, we must pick any two variables out of the four
to parametrize the design because we have four possible variables and two
dependency relationships.
For this wing, the variables must be positive to be physically meaningful,
so we must remember to explicitly bound these variables to be greater than
zero in an optimization. The variables should be bound from below by small
positive values because numerical models are probably not prepared to take
zero values. No upper bound is needed unless the optimization algorithm
requires it.
these can be represented more compactly with splines. This is a commonly used 𝑐1 𝑐3
technique in optimization because reducing the number of design variables 2
often speeds up an optimization with little if any loss in the model parameteri-
zation fidelity. Figure 1.7 shows an example spline describing the shape of a 𝑐4
turbine blade. In this example, only four design variables are used to represent 0
0 0.2 0.4 0.6 0.8 1
the curved shape. Blade fraction
designer, it does not matter how precisely the function and its optimum
point are computed—the mathematical optimum will be non-optimal
from the engineering point of view. A bad choice for the objective
function is a common mistake in design optimization. 0
𝑥∗
The choice of objective function is not always obvious. For example,
minimizing the weight of a vehicle might sound like a good idea, but
this might result in a vehicle that is too expensive to manufacture. In
this case, manufacturing cost would probably be a better objective. min − 𝑓 (𝑥)
However, there is a trade-off between manufacturing cost and the
performance of the vehicle. It might not be obvious which of these Fig. 1.8 A maximization problem can
be transformed into an equivalent
objectives is the most appropriate one because this trade-off depends on minimization problem.
customer preferences. This issue motivates multiobjective optimization,
which is the subject of Chapter 9. Multiobjective optimization does
not yield a single design but rather a range of designs that settle for
different trade-offs between the objectives.
Experimenting with different objectives should be part of the design
exploration process (this is represented by the outer loop in the design
optimization process in Fig. 1.2). Results from optimizing the “wrong”
objective can still yield insights into the design trade-offs and trends
for the system at hand.
In Ex. 1.1, we have the luxury of being able to visualize the design
space because we have only two variables. For more than three variables,
it becomes impossible to visualize the design space. We can also
visualize the objective function for two variables, as shown in Fig. 1.9.
In this figure, we plot the function values using the vertical axis, which
results in a three-dimensional surface. Although plotting the surface
might provide intuition about the function, it is not possible to locate
Fig. 1.9 A function of two variables
the points accurately when drawing on a two-dimensional surface. ( 𝑓 = 𝑥12 + 𝑥22 in this case) can be visual-
Another possibility is to plot the contours of the function, which ized by plotting a three-dimensional
are lines of constant value, as shown in Fig. 1.10. We prefer this type surface or contour plot.
1 Introduction 11
called an isosurface.
−1
choices of design variable sets discussed in Ex. 1.1. We can locate the minimum
graphically (denoted by the dot). Although the two optimum solutions are
the same, the shapes of the objective function contours are different. In this
case, using the aspect ratio and wing area simplifies the relationship between
the design variables and the objective by aligning the two main curvature
trends with each design variable. Thus, the parameterization can change the
effectiveness of the optimization.
1.5 70
1.2
50
𝑐 0.9 𝐴𝑅
Fig. 1.11 Required power contours
30
for two different choices of design
0.6 variable sets. The optimal wing is
the same for both cases, but the func-
10
0.3
tional form of the objective is simpli-
5 15 25 35 5 10 15 20 25 fied in the one on the right.
𝑏 𝑆
The optimal wing for this problem has an aspect ratio that is much higher
than that typically seen in airplanes or birds. Although the high aspect ratio
increases aerodynamic efficiency, it adversely affects the structural strength,
which we did not consider here. Thus, as in most engineering problems, we
need to add constraints and consider multiple disciplines.
1 Introduction 12
1.2.3 Constraints
The vast majority of practical design optimization problems require the
enforcement of constraints. These are functions of the design variables
that we want to restrict in some way. Like the objective function,
constraints are computed through a model whose complexity can vary
widely. The feasible region is the set of points that satisfy all constraints.
We seek to minimize the objective function within this feasible design
space.
When we restrict a function to being equal to a fixed value, we call
this an equality constraint, denoted by ℎ(𝑥) = 0. When the function is
required to be less than or equal to a certain value, we have an inequality
constraint, denoted by 𝑔(𝑥) ≤ 0.¶ Although we use the “less or equal” ¶A strict inequality, 𝑔(𝑥) < 0, is never
used because then 𝑥 could be arbitrar-
convention, some texts and software programs use “greater or equal” ily close to the equality. Because the
instead. There is no loss of generality with either convention because optimum is at 𝑔 = 0 for an active con-
straint, the exact solution would then be
we can always multiply the constraint by −1 to convert between the ill-defined from a mathematical perspec-
two. tive. Also, the difference is not meaning-
ful when using finite-precision arithmetic
(which is always the case when using a
Tip 1.2 Check the inequality convention computer).
Some texts and papers omit the equality constraints without loss
of generality because an equality constraint can be replaced by two
inequality constraints. More specifically, an equality constraint, ℎ(𝑥) =
0, is equivalent to enforcing two inequality constraints, ℎ(𝑥) ≥ 0 and
ℎ(𝑥) ≤ 0.
1 Introduction 13
𝑔1 (𝑥) ≤ 0 ℎ1 (𝑥) = 0
(active) 𝑓 (𝑥) (active) 𝑓 (𝑥)
the design space as shown in Ex. 1.2 and obtain the solution graphically. Fig. 1.13 Minimum-power wing with
In addition to the possibility of a large number of design variables a constraint on bending stress com-
and computationally expensive objective function evaluations, we now pared with the unconstrained case.
add the possibility of a large number of constraints, which might also
be expensive to evaluate. Again, this is further motivation for the
optimization techniques covered in this book.
minimize 𝑓 (𝑥)
by varying 𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥
(1.4)
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔
ℎ 𝑙 (𝑥) = 0 𝑙 = 1, . . . , 𝑛 ℎ .
initial design 𝑥 0 and then queries the analysis for a sequence of designs
until it finds the optimum design, 𝑥 ∗ . Fig. 1.14 The analysis computes the
objective ( 𝑓 ) and constraint values (𝑔,
ℎ) for a given set of design variables
Tip 1.3 Using an optimization software package (𝑥).
The setup of an optimization problem varies depending on the particular
software package, so read the documentation carefully. Most optimization
software requires you to define the objective and constraints as callback functions.
These are passed to the optimizer, which calls them back as needed during the
optimization process. The functions take the design variable values as inputs
∗∗ Optimization software resources in-
and output the function values, as shown in Fig. 1.14. Study the software clude the optimization toolboxes in
documentation for the details on how to use it.∗∗ To make sure you understand MATLAB, scipy.optimize.minimize in
how to use a given optimization package, test it on simple problems for which Python, Optim.jl or Ipopt.jl in Julia,
NLopt for multiple languages, and the
you know the solution first (see Prob. 1.4). Solver add-in in Microsoft Excel. The
pyOptSparse framework provides a com-
mon Python wrapper for many existing
optimization codes and facilitates the test-
When the optimizer queries the analysis for a given 𝑥, for most ing of different methods.1 SNOW.jl wraps
methods, the constraints do not have to be feasible. The optimizer is a few optimizers and multiple derivative
computation methods in Julia.
responsible for changing 𝑥 so that the constraints are satisfied.
1. Wu et al., pyOptSparse: A Python frame-
The objective and constraint functions must depend on the design work for large-scale constrained nonlinear
variables; if a function does not depend on any variable in the whole optimization of sparse systems, 2020.
1 Introduction 16
Continuous
Design variables Discrete
Mixed
Single
Problem Objective
formulation Multiobjective
Constrained
Constraints
Unconstrained
Optimization
Continuous
problem Smoothness
classification Discontinuous
Linear
Linearity
Nonlinear
𝑓 (𝑥)
1.3.1 Smoothness
The degree of function smoothness with respect to variations in the 𝑥
design variables depends on the continuity of the function values and
their derivatives. When the value of the function varies continuously, 𝑓 (𝑥)
0
the function is said to be 𝐶 continuous. If the first derivatives also vary
continuously, then the function is 𝐶 1 continuous, and so on. A function 𝑥
0 1
function, a 𝐶 function, and a 𝐶 function. Fig. 1.17 Discontinuous function
As we will see later, discontinuities in the function value or deriva- (top), 𝐶 0 continuous function (mid-
tives limit the type of optimization algorithm that can be used because dle), and 𝐶 1 continuous function (bot-
tom).
1 Introduction 19
1.3.2 Linearity
The functions of interest could be linear or nonlinear. When both the
objective and constraint functions are linear, the optimization problem
is known as a linear optimization problem. These problems are easier
to solve than general nonlinear ones, and there are entire books and
courses dedicated to the subject. The first numerical optimization
algorithms were developed to solve linear optimization problems, and 𝑥2
that such a function is unimodal would require evaluating the function Fig. 1.19 Types of minima.
at every point in the domain, which is computationally prohibitive.
However, it much easier to prove multimodality—all we need to do is
find two distinct local minima.
1 Introduction 20
Zeroth
Order First
Second
Local
Search
Global
Mathematical
Algorithm
Heuristic
Optimization
algorithm
classification
Function Direct
evaluation Surrogate model
1.4.5 Stochasticity
This attribute is independent of the stochasticity of the model that
we mentioned previously, and it is strictly related to whether the
optimization algorithm itself contains steps that are determined at
random or not.
A deterministic optimization algorithm always evaluates the same
points and converges to the same result, given the same initial conditions.
In contrast, a stochastic optimization algorithm evaluates a different set
of points if run multiple times from the same initial conditions, even
if the models for the objective and constraints are deterministic. For
example, most evolutionary algorithms include steps determined by
generating random numbers. Gradient-based algorithms are usually
deterministic, but some exceptions exist, such as stochastic gradient
descent (see Section 10.5).
Convex? Yes
Linear optimization, quadratic optimization, etc.
Ch. 11
Yes
No Branch and bound
Yes
Dynamic programming
Discrete? Yes
Linear? Markov chain?
Ch. 8 No SA or GA (bit-encoded)
No
No
Yes BFGS
Yes Ch. 4 Yes
Differentiable?
Unconstrained? Multimodal? Multistart
Ch. 6 SQP or IP
No Ch. 5
No
Yes
DIRECT, GPS, GA, PS, etc.
Gradient free
Multimodal?
Ch. 7 Nelder–Mead
No
The first node asks about convexity. Although it is often not Fig. 1.24 Decision tree for selecting
immediately apparent if the problem is convex, with some experience, an optimization algorithm.
we can usually discern whether we should attempt to reformulate it as a
convex problem. In most instances, convexity occurs for problems with
simple objectives and constraints (e.g., linear or quadratic), such as in
control applications where the optimization is performed repeatedly. A
convex problem can be solved with general gradient-based or gradient-
free algorithms, but it would be inefficient not to take advantage of the
convex formulation structure if we can do so.
The next node asks about discrete variables. Problems with discrete
design variables are generally much harder to solve, so we might
consider alternatives that avoid using discrete variables when possible.
For example, a wind turbine’s position in a field could be posed as
a discrete variable within a discrete set of options. Alternatively, we
could represent the wind turbine’s position as a continuous variable
with two continuous coordinate variables. That level of flexibility may
or may not be desirable but generally leads to better solutions. Many
1 Introduction 28
a desire not to deviate from the standard conventions used in each Fig. 1.25 An 𝑛-vector, 𝑥.
field. We explicitly note these exceptions as needed. For example, the
𝐴
objective function 𝑓 is a scalar function and the Lagrange multipliers
(𝜆 and 𝜎) are vectors. 𝐴11 𝐴1𝑚
Fig. 1.25. For more compact notation, we may write a column vector
horizontally, with its components separated by commas, for example, 𝐴𝑛1 𝐴𝑛𝑚
𝑥 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. We refer to a vector with 𝑛 components as an 𝑗
𝑛-vector, which is equivalent to writing 𝑥 ∈ R𝑛 . (𝑛 × 𝑚)
An (𝑛 × 𝑚) matrix has 𝑛 rows and 𝑚 columns, which is equivalent Fig. 1.26 An (𝑛 × 𝑚) matrix, 𝐴.
1 Introduction 29
Tip 1.5 Work out the dimensions of the vectors and matrices
As you read this book, we encourage you to work out the dimensions of
the vectors and matrices in the operations within each equation and verify the
dimensions of the result for consistency. This will enhance your understanding
of the equations.
1.7 Summary
Problems
𝑥12 + 𝑥22 ≤ 1
1
𝑥1 − 3𝑥 2 + ≥ 0 ,
2
and bound constraints:
𝑥1 ≥ 0, 𝑥2 ≥ 0 .
Plot the constraints and identify the feasible region. Find the
constrained minimum graphically. Use optimization software
to solve the constrained minimization problem. Which of the
inequality constraints and bounds are active at the solution?
1 Introduction 32
33
2 A Short History of Optimization 34
𝜃 𝜃
2.2 Optimization Revolution: Derivatives and Calculus
Mirror
𝐵0
The scientific revolution generated significant optimization develop-
ments in the seventeenth and eighteenth centuries that intertwined Fig. 2.2 The law of reflection can be
with other developments in mathematics and physics. derived by minimizing the length of
In the early seventeenth century, Johannes Kepler published a book the light beam.
in which he derived the optimal dimensions of a wine barrel.7 He 7. Kepler, Nova stereometria doliorum
vinariorum (New Solid Geometry of Wine
became interested in this problem when he bought a barrel of wine, Barrels), 1615.
and the merchant charged him based on a diagonal length (see Fig. 2.3).
This outraged Kepler because he realized that the amount of wine could
vary for the same diagonal length, depending on the barrel proportions.
Incidentally, Kepler also formulated an optimization problem when
looking for his second wife, seeking to maximize the likelihood of satis-
faction. This “marriage problem” later became known as the “secretary
problem”, which is an optimal-stopping problem that has since been
solved using dynamic optimization (mentioned in Section 1.4.6 and
discussed in Section 8.5).8
Willebrord Snell discovered the law of refraction in 1621, a formula
that describes the relationship between the angles of incidence and
Fig. 2.3 Wine barrels were measured
refraction when light passes through a boundary between two different by inserting a ruler in the tap hole
media, such as air, glass, or water. Whereas Hero minimized the length until it hit the corner.
to derive the law of reflection, Snell minimized time. These laws were 8. Ferguson, Who solved the secretary
generalized by Fermat in the principle of least time (or Fermat’s principle), problem? 1989.
which states that a ray of light going from one point to another follows
the path that takes the least time.
2 A Short History of Optimization 35
point each time does not, in general, result in the shortest overall path.
This is a combinatorial optimization problem that later became known
as the traveling salesperson problem, one of the most intensively studied
problems in optimization (Chapter 8).
In 1939, William Karush derived the necessary conditions for in-
equality constrained problems in his master’s thesis. His approach
generalized the method of Lagrange multipliers, which only allowed
for equality constraints. Harold Kuhn and Albert Tucker independently
rediscovered these conditions and published their seminal paper in
1951.15 These became known as the Karush–Kuhn–Tucker (KKT) condi- 15. Karush, Minima of functions of several
variables with inequalities as side constraints,
tions, which constitute the foundation of gradient-based constrained 1939.
optimization algorithms (see Section 5.3).
Leonid Kantorovich developed a technique to solve linear program-
ming problems in 1939 after having been given the task of optimizing
production in the Soviet government’s plywood industry. However,
his contribution was neglected for ideological reasons. In the United
States, Tjalling Koopmans rediscovered linear programming in the
early 1940s when working on ship-transportation problems. In 1947,
George Dantzig published the first complete algorithm for solving linear
programming problems—the simplex algorithm.16 In the same year, 16. Dantzig, Linear programming and
extensions, 1998.
von Neumann developed the theory of duality for linear programming
problems. Kantorovich and Koopmans later shared the 1975 Nobel
Memorial Prize in Economic Sciences “for their contributions to the
theory of optimum allocation of resources”. Dantzig was not included,
presumably because his work was more theoretical. The development
of the simplex algorithm and the widespread practical applications of
linear programming sparked a revolution in optimization. The first
international conference on optimization, the International Symposium
on Mathematical Programming, was held in Chicago in 1949.
In 1951, George Box and Kenneth Wilson developed the response-
surface methodology (surrogate modeling), which enables optimization
of systems based on experimental data (as opposed to a physics-based
model). They developed a method to build a quadratic model where
the number of data points scales linearly with the number of inputs
instead of exponentially, striking a balance between accuracy and ease of
application. In the same year, Danie Krige developed a surrogate model
based on a stochastic process, which is now known as the kriging model.
He developed this model in his master’s thesis to estimate the most likely
distribution of gold based on a limited number of borehole samples.17 17. Krige, A statistical approach to some
mine valuation and allied problems on the
These approaches are foundational in surrogate-based optimization Witwatersrand, 1951.
(Chapter 10).
In 1952, Harry Markowitz published a paper on portfolio theory
2 A Short History of Optimization 38
what became known as the Bellman equation (Section 8.5), which was
first applied to engineering control theory and subsequently became a
core principle in economic theory.
In 1959, William Davidon developed the first quasi-Newton method
for solving nonlinear optimization problems that rely on approxi-
mations of the curvature based on gradient information. He was
motivated by his work at Argonne National Laboratory, where he
used a coordinate-descent method to perform an optimization that
kept crashing the computer before converging. Although Davidon’s
approach was a breakthrough in nonlinear optimization, his original
paper was rejected. It was eventually published more than 30 years
later in the first issue of the SIAM Journal on Optimization.20 Fortunately, 20. Davidon, Variable metric method for
minimization, 1991.
his valuable insight had been recognized well before that by Roger
Fletcher and Michael Powell, who further developed the method.21 The 21. Fletcher and Powell, A rapidly con-
vergent descent method for minimization,
method became known as the Davidon–Fletcher–Powell (DFP) method 1963.
(Section 4.4.4).
Another quasi-Newton approximation method was independently
proposed in 1970 by Charles Broyden, Roger Fletcher, Donald Goldfarb,
and David Shanno, now called the Broyden–Fletcher–Goldfarb–Shanno
(BFGS) method. Larry Armijo, A. Goldstein, and Philip Wolfe developed
the conditions for the line search that ensure convergence in gradient-
based methods (see Section 4.3.2).22 22. Wolfe, Convergence conditions for
ascent methods, 1969.
Leveraging the developments in unconstrained optimization, re-
searchers sought methods for solving constrained problems. Penalty
and barrier methods were developed but fell out of favor because
of numerical issues (see Section 5.4). In another effort to solve non-
linear constrained problems, Robert Wilson proposed the sequential 23. Wilson, A simplicial algorithm for
concave programming, 1963.
quadratic programming (SQP) method in his PhD thesis.23 SQP consists
24. Han, Superlinearly convergent variable
of applying the Newton method to solve the KKT conditions (see Sec- metric algorithms for general nonlinear
programming problems, 1976.
tion 5.5). Shih-Ping Han reinvented SQP in 1976,24 and Michael Powell
25. Powell, Algorithms for nonlinear con-
popularized this method in a series of papers starting from 1977.25 straints that use Lagrangian functions, 1978.
2 A Short History of Optimization 39
women from having the same opportunities as men. The first known
female mathematician, Hypatia, lived in Alexandria (Egypt) in the
fourth century and was brutally murdered for political motives. In
the eighteenth century, Sophie Germain corresponded with famous
mathematicians under a male pseudonym to avoid gender bias. She
could not get a university degree because she was female but was
nevertheless a pioneer in elasticity theory. Ada Lovelace famously
wrote the first computer program in the nineteenth century.64 In the late 64. Hollings et al., Ada Lovelace: The
Making of a Computer Scientist, 2014.
nineteenth century, Sofia Kovalevskaya became the first woman to obtain
a doctorate in mathematics but had to be tutored privately because
she was not allowed to attend lectures. Similarly, Emmy Noether, who
made many fundamental contributions to abstract algebra in the early
twentieth century, had to overcome rules that prevented women from
enrolling in universities and being employed as faculty.65 65. Osen, Women in Mathematics, 1974.
Banneker, a free African American who was a self-taught mathematician 68. Rothstein, The Color of Law: A For-
gotten History of How Our Government
and astronomer, corresponded directly with Thomas Jefferson and Segregated America, 2017.
successfully challenged the morality of the U.S. government’s views on
race and humanity.69 Historically black colleges and universities were 69. King, More than slaves: Black founders,
Benjamin Banneker, and critical intellectual
established in the United States after the American Civil War because agency, 2014.
2 A Short History of Optimization 45
2.6 Summary
47
3 Numerical Models and Solvers 48
structure’s weight does not contribute to the loading. Finally, the displacements
are assumed to be small relative to the dimensions of the truss members.
The structure is discretized by pinned bar elements. The discrete governing
equations for any truss structure can be derived using the finite-element method.
This leads to the linear system
𝐾𝑢 = 𝑞 ,
where 𝐾 is the stiffness matrix, 𝑞 is the vector of applied loads, and 𝑢 represents
the displacements that we want to compute. At each joint, there are two degrees
of freedom (horizontal and vertical) that describe the displacement and applied
force. Because there are 9 joints, each with 2 degrees of freedom, the size of
this linear system is 18.
𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑛 ) = 0, 𝑖 = 1, . . . , 𝑛 , (3.1)
where 𝑟 is a vector of residuals that has the same size as the vector of
state variables 𝑢. The equations defining the residuals could be any
expression that can be coded in a computer program. No matter how
complex the mathematical model, it can always be written as a set of
equations in this form, which we write more compactly as 𝑟(𝑢) = 0.
Finding the state variables that satisfy this set of equations requires
a solver, as illustrated in Fig. 3.3. We review the various types of solvers Solver 𝑢
𝑢
in Section 3.6. Solving a set of implicit equations is more costly than
𝑟(𝑢)
computing explicit functions, and it is typically the most expensive step 𝑟
The linear system from Ex. 3.1 is an example of a system of implicit equations,
which we can write as a set of residuals by moving the right-hand-side vector
to the left to obtain
𝑟(𝑢) = 𝐾𝑢 − 𝑞 = 0 ,
where 𝑢 represents the state variables. Although the solution for 𝑢 could be
written as an explicit function, 𝑢 = 𝐾 −1 𝑓 , this is usually not done because it
is computationally inefficient and intractable for large-scale systems. Instead,
we use a linear solver that does not explicitly form the inverse of the stiffness
matrix (see Appendix B).
In addition to computing the displacements, we might also want to compute
the axial stress (𝜎) in each of the 15 truss members.This is an explicit function
of the displacements, which is given by the linear relationship
𝜎 = 𝑆𝑢 ,
We can still use the residual notation to represent explicit functions Solver 𝑢𝑟
𝑢𝑟
to write all the functions in a model (implicit and explicit) as 𝑟(𝑢) = 0 𝑓 (𝑢𝑟 ) 𝑢𝑓
𝑟𝑟 (𝑢𝑟 )
without loss of generality. Suppose we have an implicit system of 𝑟𝑟
equations, 𝑟𝑟 (𝑢𝑟 ) = 0, followed by a set of explicit functions whose
Fig. 3.4 A model with implicit and
output is a vector 𝑢 𝑓 = 𝑓 (𝑢𝑟 ), as illustrated in Fig. 3.4. We can rewrite
explicit functions.
the explicit function as a residual by moving all the terms to one side to
get 𝑟 𝑓 (𝑢𝑟 , 𝑢 𝑓 ) = 𝑓 (𝑢𝑟 ) − 𝑢 𝑓 = 0. Then, we can concatenate the residuals
and variables for the implicit and explicit equations as
Solver
𝑢
𝑢𝑟
𝑟𝑟 (𝑢𝑟 ) 𝑢𝑟 𝑢𝑓
𝑟(𝑢) ≡ = 0, where 𝑢≡ . (3.2) 𝑟𝑟 (𝑢𝑟 )
𝑓 (𝑢𝑟 ) − 𝑢 𝑓 𝑢𝑓 𝑟 𝑓 (𝑢𝑟 ) − 𝑢 𝑓
The solver arrangement would then be as shown in Fig. 3.5. Fig. 3.5 Explicit functions can be writ-
Even though it is more natural to just evaluate explicit functions ten in residual form and added to the
instead of adding them to a solver, in some cases, it is helpful to use solver.
the residual to represent the entire model with the compact notation,
𝑟(𝑢) = 0. This will be helpful in later chapters when we compute
derivatives (Chapter 6) and solve systems that mix multiple implicit
and explicit sets of equations (Chapter 13).
𝑢12 + 2𝑢2 − 1 = 0
𝑢1 + cos(𝑢1 ) − 𝑢2 = 0
𝑓 (𝑢1 , 𝑢2 ) = 𝑢1 + 𝑢2 .
3 Numerical Models and Solvers 52
The first two equations are written in implicit form, and the third equation is
given as an explicit function. The first equation could be manipulated to obtain
an explicit function of either 𝑢1 or 𝑢2 . The second equation does not have a
closed-form solution and cannot be written as an explicit function for 𝑢1 . The
third equation is an explicit function of 𝑢1 and 𝑢2 . In this case, we could solve
the first two equations for 𝑢1 and 𝑢2 using a nonlinear solver and then evaluate
𝑓 (𝑢1 , 𝑢2 ). Alternatively, we can write the whole system as implicit residual
equations by defining the value of 𝑓 (𝑢1 , 𝑢2 ) as 𝑢3 ,
Then we can use the same nonlinear solver to solve for all three equations
simultaneously.
Mesh point 𝑧
Cell 𝑧
Element 𝑧
With any of these discretization methods, the final result is a Fig. 3.6 Discretization methods in one
set of algebraic equations that we can write as 𝑟(𝑢) = 0 and solve spatial dimension.
for the state variables 𝑢. This is a potentially large set of equations
depending on the domain and discretization (e.g., it is common to
have millions of equations in three-dimensional computational fluid
dynamics problems). The number of state variables of the discretized PDE
model is equal to the number of equations for a complete and well-
defined model. In the most general case, the set of equations could be
implicit and nonlinear. 𝑡
𝑢(𝑧, 𝑡)
When a problem involves both space and time, the prevailing ap-
proach is to decouple the discretization in space from the discretization
in time—called the method of lines (see Fig. 3.7). The discretization in 𝑧
required.
𝑧1 𝑧2 𝑧3 𝑧4 𝑧5
3.5 Numerical Errors 𝑧
error sources.
Fig. 3.7 PDEs in space and time are
An absolute error is the magnitude of the difference between the exact often discretized in space first to yield
value (𝑥 ∗ ) and the computed value (𝑥), which we can write as |𝑥 − 𝑥 ∗ |. an ODE in time.
3 Numerical Models and Solvers 54
This is the more useful error measure in most cases. When the exact
value 𝑥 ∗ is close to zero, however, this definition breaks down. To
address this, we avoid the division by zero by using
|𝑥 − 𝑥 ∗ |
𝜀= . (3.4)
1 + |𝑥 ∗ |
This error metric combines the properties of absolute and relative errors.
When |𝑥 ∗ | 1, this metric is similar to the relative error, but when
|𝑥 ∗ | 1, it becomes similar to the absolute error.
Suppose that three decimal digits are available to represent a number (and
that we use base 10 for simplicity). Then, 𝜀ℳ = 0.005 because any number
smaller than this results in 1 + 𝜀 = 1 when rounded to three digits. For
example, 1.00 + 0.00499 = 1.00499, which rounds to 1.00. On the other hand,
1.00 + 0.005 = 1.005, which rounds to 1.01 and satisfies Eq. 3.6.
If we try to store 24.11 using three digits, we get 24.1. The relative error is
24.11 − 24.1
≈ 0.0004 ,
24.11
which is lower than the maximum possible representation error of 𝜀ℳ = 0.005
established in Ex. 3.4.
For addition and subtraction, an error can occur even when the
two operands are represented exactly. Before addition and subtraction,
the computer must convert the two numbers to the same exponent.
When adding numbers with different exponents, several digits from
the small number vanish (see Fig. 3.8). If the difference in the two
exponents is greater than the magnitude of the exponent of 𝜀ℳ , the
small number vanishes completely—a consequence of Eq. 3.6. The
relative error incurred in addition is still 𝜀ℳ .
𝑎 0.
± 𝑏 0. 0 0 0 0 0 0 0 0 0 0
𝑐 0.
On the other hand, subtraction can incur much greater relative Fig. 3.8 Adding or subtracting num-
bers of differing exponents results in
errors when subtracting two numbers that have the same exponent and a loss in the number of digits cor-
are close to each other. In this case, the digits that match between the responding to the difference in the
two numbers cancel each other and reduce the number of significant exponents. The gray boxes indicate
digits that are identical between the
digits. When the relative difference between two numbers is less than two numbers.
machine precision, all digits match, and the subtraction result is zero
(see Fig. 3.9). This is called subtractive cancellation and is a serious issue
when approximating derivatives via finite differences (see Section 6.4).
Common digits
𝑎 0.
− 𝑏 0.
If we use double precision and plot many points in a small interval, we can see
that the function exhibits the step pattern shown in Fig. 3.10. The numerical 2·
10−15
minimum of this function is anywhere in the interval around 𝑥 = 2 where the
numerical value is zero. This interval is much larger than the machine precision 1·
(𝜀ℳ = 2.2 × 10−16 ). An additional error is incurred in the function computation 10−15
Fig. 3.11. The norm of the residuals decreases gradually until a limit 105
is reached (near 10−10 in this case). This limit represents the lowest 102
error achieved with the iterative solver and is determined by other 10−1
sources of error, such as roundoff and truncation errors. If we terminate
10−4
before reaching the limit (either by setting a convergence tolerance to a
10−7
value higher than 10−10 or setting an iteration limit to lower than 400
iterations), we incur an additional error. However, it might be desirable 10−10
0 200 400 600
to trade off a less precise solution for a lower computational effort. 𝑘
0.5369
0.58 +2 · 10−8
𝑓 𝑓 ∼ 10−8
0.56
Fig. 3.12 To find the level of numerical
0.5369
0.54 noise of a function of interest with re-
spect to an input parameter (left), we
0.52 magnify both axes by several orders
0.5369 of magnitude and evaluate the func-
−2 · 10−8 tion at points that are closely spaced
0.5
0 1 2 3 4 2 − 1 · 10−6 2.0 2 + 1 · 10−6 (right).
𝑥 𝑥
The overall attitude toward programming should be that all code has bugs
until it is verified through testing. Programmers who are skilled at debugging
are not necessarily any better at spotting errors by reading code or by stepping
through a debugger than average programmers. Instead, effective programmers
use a systematic approach to narrow down where the problem is occurring.
Beginners often try to debug by running the entire program. Even experi-
enced programmers have a hard time debugging at that level. One primary
strategy discussed in this section is to write modular code. It is much easier
to test and debug small functions. Reliable complex programs are built up
through a series of well-tested modular functions. Sometimes we need to
simplify or break up functions even further to narrow down the problem. We
might need to streamline and remove pieces, make sure a simple case works,
then slowly rebuild the complexity.
You should also become comfortable reading and understanding the error
messages and stack traces produced by the program. These messages seem
obscure at first, but through practice and researching what the error messages
mean, they become valuable information sources.
Of course, you should carefully reread the code, looking for errors, but
reading through it again and again is unlikely to yield new insights. Instead,
it can be helpful to step away from the code and hypothesize the most likely
ways the function could fail. You can then test and eliminate hypotheses to
narrow down the problem.
3 Numerical Models and Solvers 60
There are several methods available for solving the discretized gov-
erning equations (Eq. 3.1). We want to solve the governing equations
for a fixed set of design variables, so 𝑥 will not appear in the solution
algorithms. Our objective is to find the state variables 𝑢 such that
𝑟(𝑢) = 0.
This is not a book about solvers, but it is essential to understand the
characteristics of these solvers because they affect the cost and precision
of the function evaluations in the overall optimization process. Thus,
we provide an overview and some of the most relevant details in this
section.∗ In addition, the solution of coupled systems builds on these ∗ Ascher and Greif74 provide a more de-
tailed introduction to the numerical meth-
solvers, as we will see in Section 13.2. Finally, some of the optimization ods mentioned in this chapter.
algorithms detailed in later chapters use these solvers. 74. Ascher and Greif, A First Course in
There are two main types of solvers, depending on whether the Numerical Methods, 2011.
Solver SOR
Newton
+ linear solver CG
Krylov subspace
Nonlinear
Nonlinear GMRES
variants of
fixed point
Linear systems can be solved directly or iteratively. Direct meth- Fig. 3.13 Overview of solution meth-
ods are based on the concept of Gaussian elimination, which can be ods for linear and nonlinear systems.
expressed in matrix form as a factorization into lower and upper tri-
angular matrices that are easier to solve (LU factorization). Cholesky
factorization is a more efficient variant of LU factorization that applies
only to symmetric positive-definite matrices.
Whereas direct solvers obtain the solution 𝑢 at the end of a process,
iterative solvers start with a guess for 𝑢 and successively improve it
3 Numerical Models and Solvers 62
more detail.
Direct methods are the right choice for many problems because
they are generally robust. Also, the solution is guaranteed for a fixed
Residual
number of operations, 𝒪(𝑛 3 ) in this case. However, for large systems
Iterative Direct
where 𝐴 is sparse, the cost of direct methods can become prohibitive,
whereas iterative methods remain viable. Iterative methods have other 𝜀ℳ
advantages, such as being able to trade between computational cost
𝒪 𝑛3
and precision. They can also be restarted from a good guess (see
Effort
Appendix B.4).
Fig. 3.14 Whereas direct methods
only yield the solution at the end
Tip 3.5 Do not compute the inverse of 𝐴
of the process, iterative methods pro-
Because some numerical libraries have functions to compute 𝐴−1 , you duce approximate intermediate re-
sults.
might be tempted to do this and then multiply by a vector to compute 𝑢 = 𝐴−1 𝑏.
This is a bad idea because finding the inverse is computationally expensive.
Instead, use LU factorization or another method from Fig. 3.13.
d𝑢
= −𝑟(𝑢, 𝑡) , (3.7)
d𝑡
which is called the semi-discrete form. A time-integration scheme is
then used to solve for the time history. The integration scheme can be
either explicit or implicit, depending on whether it involves evaluating
explicit expressions or solving implicit equations. If a system under a
certain condition has a steady state, these techniques can be used to
solve the steady state (d𝑢/d𝑡 = 0).
lim k𝑥 𝑘 − 𝑥 ∗ k = 0 , (3.8)
𝑘→∞
which means that the norm of the error tends to zero as the number of
iterations tends to infinity.
The rate of convergence of a sequence is of order 𝑝 with asymptotic
error constant 𝛾 when 𝑝 is the largest number that satisfies∗ ∗ Some authors refer to 𝑝 as the rate of
convergence. Here, we characterize the
rate of convergence by two metrics: order
k𝑥 𝑘+1 − 𝑥 ∗ k and error constant.
0 ≤ lim = 𝛾 < ∞. (3.9)
𝑘→∞ k𝑥 𝑘 − 𝑥 ∗ k 𝑝
3 Numerical Models and Solvers 64
Asymptotic here refers to the fact that this is the behavior in the limit,
when we are close to the solution. There is no guarantee that the initial
and intermediate iterations satisfy this condition.
To avoid dealing with limits, let us consider the condition expressed
in Eq. 3.9 at all iterations. We can relate the error from one iteration to
the next by
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k 𝑝 . (3.10)
When 𝑝 = 1, we have linear order of convergence; when 𝑝 = 2, we have
quadratic order of convergence. Quadratic convergence is a highly
valued characteristic for an iterative algorithm, and in practice, orders of
convergence greater than 𝑝 = 2 are usually not worthwhile to consider.
When we have linear convergence, then
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k , (3.11)
Thus, after six iterations, we get six-digit precision. Now suppose that
𝛾 = 0.9. Then we would have
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k 2 . (3.14)
If 𝛾 = 1, then the error norm sequence with a starting error norm of 0.1
would be
10−1 , 10−2 , 10−4 , 10−8 , . . . . (3.15)
This yields more than six digits of precision in just three iterations!
In this case, the number of correct digits doubles at every iteration.
When 𝛾 > 1, the convergence will not be as fast, but the series will still
converge.
3 Numerical Models and Solvers 65
k𝑥 𝑘+1 − 𝑥 ∗ k = 𝛾𝑘 k𝑥 𝑘 − 𝑥 ∗ k , (3.16)
0.1 10−1
Linear
10−2 𝑝 = 1, 𝛾 = 0.9
0.08 Superlinear
10−3 𝑝=1
0.06 𝛾→0
10−4
𝑥 𝑥
0.04 10−5
10−6
0.02 Linear Fig. 3.15 Sample sequences for lin-
10−7 Quadratic 𝑝=1 ear, superlinear, and quadratic cases
𝑝=2 𝛾 = 0.1
0
plotted on a linear scale (left) and a
10−8
0 2 4 6 0 2 4 6 logarithmic scale (right).
𝑘 𝑘
When using a linear scale plot, you can only see differences in two significant
digits. To reveal changes beyond three digits, you should use a logarithmic
scale. This need frequently occurs in plotting the convergence behavior of
optimization algorithms.
k𝑟 𝑘+1 k = 𝛾𝑘 k𝑟 𝑘 k 𝑝 . (3.18)
This is the secant method, which is useful when the derivative is not
available. The convergence rate is not quadratic like Newton’s method,
but it is superlinear.
Example 3.7 Newton’s method and the secant method for a single variable
2𝑢 𝑘3 + 4𝑢 𝑘2 + 𝑢 𝑘 − 2
𝑢 𝑘+1 = 𝑢 𝑘 − .
6𝑢 𝑘2 + 8𝑢 𝑘 + 1
When we start with the guess 𝑢0 = 1.5 (left plot in Fig. 3.16), the iterations
are well behaved, and the method converges quadratically. We can see the
3 Numerical Models and Solvers 68
𝑟 𝑟
−0.5 0
0 0 Fig. 3.16 Newton iterations starting
0.54 1.5 0.54 from different starting points.
𝑢∗ 𝑢 𝑢0 𝑢0 𝑢∗
0
The iterations for the secant method are shown in Fig. 3.17, where we can see 0.54 1.3 1.5
the successive secant lines replacing the exact tangent lines used in Newton’s 𝑢∗ 𝑢 𝑢1 𝑢0
method.
Fig. 3.17 Secant method applied to a
one-dimensional function.
This means that if the derivative is close to zero or the curvature tends
to a large number at the solution, Newton’s method will not converge
as well or not at all.
Now we consider the general case where we have 𝑛 nonlinear
equations of 𝑛 unknowns, expressed as 𝑟(𝑢) = 0. Similar to the single-
variable case, we derive the Newton step from a truncated Taylor
series. However, the Taylor series needs to be multidimensional in
both the independent variable and the function. Consider first the
multidimensionality of the independent variable, 𝑢, for a component
of the residuals, 𝑟 𝑖 (𝑢). The first two terms of the Taylor series about
𝑢 𝑘 for a step Δ𝑢 (which is now a vector with arbitrary direction and
3 Numerical Models and Solvers 69
magnitude) are
Õ
𝜕𝑟 𝑖
𝑛
𝑟 𝑖 (𝑢 𝑘 + Δ𝑢) ≈ 𝑟 𝑖 (𝑢 𝑘 ) + Δ𝑢 𝑗 . (3.28)
𝜕𝑢 𝑗 𝑢=𝑢𝑘
𝑗=1
𝐽𝑘 Δ𝑢 𝑘 = −𝑟 𝑘 . (3.30)
𝑢 𝑘+1 = 𝑢 𝑘 + Δ𝑢 𝑘 . (3.31)
1 √
𝑢2 = , 𝑢2 = 𝑢1 .
𝑢1
This corresponds to the two lines shown in Fig. 3.18, where the solution is at 𝑢2
5
their intersection, 𝑢 = (1, 1). (In this example, the two equations are explicit,
and we could solve them by substitution, but they could have been implicit.) 4
1
𝑟1 = 𝑢2 − =0 2
𝑢1
√ 1
𝑟2 = 𝑢2 − 𝑢1 = 0 . 𝑢∗
0
The Jacobian can be derived analytically, and the Newton step is given by the 0 1 2 3
𝑢1
linear system
" 1 #
1 Δ𝑢
𝑢2 − 𝑢11
Fig. 3.18 Newton iterations.
𝑢12 1
= − √ .
− √1
2 𝑢1
1 Δ𝑢2 𝑢2 − 𝑢1
Starting from 𝑢 = (2, 3) yields the iterations shown in the following table, with
the quadratic convergence shown in Fig. 3.19.
||𝑟||
100
𝑢1 𝑢2 k𝑢 − 𝑢∗ k k𝑟 k
10−3
2.000 000 3.000 000 2.24 2.96
0.485 281 0.878 680 5.29 × 10−1 1.20 10−6
the governing equations were not included because they were assumed
to be part of the computation of the objective and constraints for a Solve 𝑢
𝑟(𝑢; 𝑥) = 0
given 𝑥. However, we can include them in the problem statement for
completeness as follows: 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
𝑓 , 𝑔, ℎ
minimize 𝑓 (𝑥; 𝑢) ℎ(𝑥, 𝑢)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
Fig. 3.21 Computing the objective ( 𝑓 )
subject to 𝑔 𝑗 (𝑥; 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 and constraint functions (𝑔,ℎ) for a
ℎ 𝑘 (𝑥; 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ (3.32) given set of design variables (𝑥) usu-
ally involves the solution of a numeri-
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 cal model (𝑟 = 0) by varying the state
variables (𝑢).
while solving 𝑟 𝑙 (𝑢; 𝑥) = 0 𝑙 = 1, . . . , 𝑛𝑢
by varying 𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢 .
Here, “while solving” means that the governing equations are solved
at each optimization iteration to find a valid 𝑢 for each value of 𝑥.
The semicolon in 𝑓 (𝑥; 𝑢) indicates that 𝑢 is fixed while the optimizer
determines the next value of 𝑥.
Recalling the truss problem of Ex. 3.2, suppose we want to minimize the
mass of the structure (𝑚) by varying the cross-sectional areas of the truss
members (𝑎), subject to stress constraints.
The structural mass is an explicit function that can be written as
Õ
15
𝑚= 𝜌𝑎 𝑖 𝑙 𝑖 ,
𝑖=1
where 𝜌 is the material density, 𝑎 𝑖 is the cross-sectional area of each member 𝑖,
and 𝑙 𝑖 is the member length. This function depends on the design variables
directly and does not depend on the displacements.
3 Numerical Models and Solvers 72
minimize 𝑚(𝑎)
by varying 𝑎 𝑖 ≥ 𝑎min 𝑖 = 1, . . . , 15
subject to |𝜎 𝑗 (𝑎, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
while solving 𝐾𝑢 − 𝑞 = 0 (system of 18 equations)
by varying 𝑢𝑙 𝑙 = 1, . . . , 18 .
The governing equations are a linear set of equations whose solution determines
the displacements (𝑢) of a given design (𝑎) for a load condition (𝑞). We
mentioned previously that the objective and constraint functions are usually
explicit functions of the state variables, design variables, or both. As we saw in
Ex. 3.2, the mass is an explicit function of the cross-sectional areas. In this case,
it does not even depend on the state variables. The constraint function is also
explicit, but in this case, it is just a function of the state variables. This example
illustrates a common situation where the solution of the state variables requires
the solution of implicit equations (structural solver), whereas the constraints
(stresses) and objective (weight) are explicit functions of the states and design
variables.
minimize 𝑓 (𝑥, 𝑢)
𝑟(𝑥, 𝑢)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥 𝑟
𝑢𝑙 𝑙 = 1, . . . , 𝑛𝑢 𝑓 (𝑥, 𝑢)
𝑔(𝑥, 𝑢)
subject to 𝑔 𝑗 (𝑥, 𝑢) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (3.33) 𝑓 , 𝑔, ℎ ℎ(𝑥, 𝑢)
ℎ 𝑘 (𝑥, 𝑢) = 0 𝑘 = 1, . . . , 𝑛 ℎ
Fig. 3.22 In the full-space approach,
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 the governing equations are solved
by the optimizer by varying the state
𝑟 𝑙 (𝑥, 𝑢) = 0 𝑙 = 1, . . . , 𝑛𝑢 .
variables.
This approach is described in more detail in Section 13.4.3.
More generally, the optimization constraints and equations in a
model are interchangeable. Suppose a set of equations in a model can
be satisfied by varying a corresponding set of state variables. In that case,
these equations and variables can be moved to the optimization problem
statement as equality constraints and design variables, respectively.
3 Numerical Models and Solvers 73
To solve the structural sizing problem (Ex. 3.9) using a full-space approach,
we forgo the linear solver by adding 𝑢 to the set of design variables and letting
the optimizer enforce the governing equations. This results in the following
problem:
minimize 𝑚(𝑎)
by varying 𝑎 𝑖 ≥ 𝑎min 𝑖 = 1, . . . , 15
𝑢𝑙 𝑙 = 1, . . . , 18
subject to |𝜎 𝑗 (𝑎, 𝑢)| − 𝜎max ≤ 0 𝑗 = 1, . . . , 15
𝐾𝑢 − 𝑞 = 0 (system of 18 equations) .
Before you optimize, you should be familiar with the analysis (model and
solver) that computes the objective and constraints. If possible, make several
parameter sweeps to see what the functions look like—whether they are smooth,
whether they seem unimodal or not, what the trends are, and the range of
values. You should also get an idea of the computational effort required and if
that varies significantly. Finally, you should test the robustness of the analysis
to different inputs because the optimization is likely to ask for extreme values.
3.10 Summary
Problems
3.2 Choose an engineering system that you are familiar with and
describe each of the components illustrated in Fig. 3.1 for that
system. List all the options for the mathematical and numerical
models that you can think of, and describe the assumptions for
each model. What type of solver is usually used for each model
(see Section 3.6)? What are the state variables for each model?
𝑢12
+ 𝑢22 = 1
4
4𝑢1 𝑢2 = 𝜋
𝑓 = 4(𝑢1 + 𝑢2 ) .
3 Numerical Models and Solvers 76
3.4 Reproduce a plot similar to the one shown in Fig. 3.10 for
𝑓 (𝑥) = cos(𝑥) + 1
in the neighborhood of 𝑥 = 𝜋 .
𝑟(𝑢) = 𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0 .
𝐸 − 𝑒 sin(𝐸) = 𝑀,
3.7 Consider the equation from Prob. 3.5 where we replace one of the
coefficients with a parameter 𝑎 as follows:
𝑟(𝑢) = 𝑎𝑢 3 − 6𝑢 2 + 12𝑢 − 8 = 0 .
3 Numerical Models and Solvers 77
3.8 Reproduce the solution of Ex. 3.8 and then try different initial
guesses. Can you define a distinct region from where Newton’s
method converges?
3.9 Choose a problem that you are familiar with and find the magni-
tude of numerical noise in one or more outputs of interest with
respect to one or more inputs of interest. What means do you
have to decrease the numerical noise? What is the lowest possible
level of noise you can achieve?
Unconstrained Gradient-Based Optimization
4
In this chapter we focus on unconstrained optimization problems with
continuous design variables, which we can write as
79
4 Unconstrained Gradient-Based Optimization 80
4.1 Fundamentals
𝑓 𝜕𝑓
𝜕𝑓 𝜕𝑥 1
𝜕𝑥 2
𝜕𝑓
𝜕𝑥 1
𝜕𝑓
𝑥2
𝜕𝑥2
∇𝑓
𝑥2
∇𝑓
𝑥1 𝜕𝑓
𝜕𝑓 𝜕𝑥1
𝜕𝑥2
𝑥1
The gradient vectors are normal to the surfaces of constant 𝑓 in Fig. 4.2 Components of the gradient
vector in the two-dimensional case.
𝑛-dimensional space (isosurfaces). In the two-dimensional case, gradient
4 Unconstrained Gradient-Based Optimization 81
This defines the vector field plotted in Fig. 4.3, where each vector points in the
direction of the steepest local increase.
50
2 Saddle point 0
−50
Maximum
𝑥2 0
Minimum −100
−150
−2 Saddle point
2 Maximum Minimum
0 4
−4 2
−4 −2 0 2 4 𝑥2 0
−2 −2
𝑥1 −4 𝑥1
Consider the wing design problem from Ex. 1.1, where the objective function
is the required power (𝑃). For the derivative of power with respect to span
(𝜕𝑃/𝜕𝑏), the units are watts per meter (W/m). For a wing with 𝑐 = 1 m and
𝑏 = 12 m, we have 𝑃 = 1087.85 W and 𝜕𝑃/𝜕𝑏 = −41.65 W/m. This means that
for an increase in span of 1 m, the linear approximation predicts a decrease in
power of 41.65 W (to 𝑃 = 1046.20). However, the actual power at 𝑏 = 13 𝑚 is
1059.77 W because the function is nonlinear (see Fig. 4.4). The relative derivative
for this same design can be computed as (𝜕𝑃/𝜕𝑏)(𝑏/𝑃) = −0.459, which means
that for a 1 percent increase in span, the linear approximation predicts a 0.459
percent decrease in power. The actual decrease is 0.310 percent.
1.5 1,150
1.2 1,125
(12, 1)
𝜕𝑃 1,100
𝑐 0.9 𝑃
𝜕𝑏 𝜕𝑃
1,075
0.6 𝜕𝑏
1059.77
0.3
1046.20 Fig. 4.4 Power versus span and the
5 15 25 35 11 12 13 14 15 16 corresponding derivative.
𝑏 𝑏
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 | 𝑝 . (4.6)
4 Unconstrained Gradient-Based Optimization 83
∇ 𝑓 |𝑝
𝑥2
∇𝑓 ∇ 𝑓 |𝑝
𝑥2
∇𝑓
𝑥1 𝑝
𝑝 ∇ 𝑓 |𝑝
𝑥1
From the gradient projection, we can see why the gradient is the Fig. 4.5 Projection of the gradient in
an arbitrary unit direction 𝑝.
direction of the steepest increase. If we use this definition of the dot
product,
∇𝑝 𝑓 (𝑥) = ∇ 𝑓 | 𝑝 =
∇ 𝑓
𝑝
cos 𝜃 , (4.7)
where 𝜃 is the angle between the two vectors, we can see that this is
maximized when 𝜃 = 0◦ . That is, the directional derivative is largest
when 𝑝 points in the same direction as ∇ 𝑓 .
If 𝜃 is in the interval (−90, 90)◦ , the directional derivative is positive
and is thus in a direction of increase, as shown in Fig. 4.6. If 𝜃 is in the
interval (90, 180]◦ , the directional derivative is negative, and 𝑝 points
in a descent direction. Finally, if 𝜃 = ±90◦ , the directional derivative
is 0, and thus the function value does not change for small steps; it
is locally flat in that direction. This condition occurs when ∇ 𝑓 and 𝑝
are orthogonal; therefore, the gradient is orthogonal to the function
isosurfaces.
Positive directional
derivative (∇ 𝑓 | 𝑝 > 0)
∇𝑓
𝜃
Negative directional 𝑝
derivative (∇ 𝑓 | 𝑝 < 0)
Fig. 4.6 The gradient ∇ 𝑓 is always
orthogonal to contour lines (surfaces),
and the directional derivative in the
Contour line tangent
direction 𝑝 is given by ∇ 𝑓 | 𝑝.
(∇ 𝑓 | 𝑝 = 0)
4 Unconstrained Gradient-Based Optimization 84
To get the correct slope
in
the original units of 𝑥, the direction should
be normalized as 𝑝ˆ = 𝑝/
𝑝
. However, in some of the gradient-based
algorithms of this chapter, 𝑝 is not normalized because the length
contains useful information. If 𝑝 is not normalized, the slopes and
variable axis are scaled by a constant.
(0,1)
1.5 10
(-1,1) (1,1)
∇𝑓 k∇ 𝑓 k
𝑥 8
1
𝑝
𝑥 + 𝛼𝑝 6
−6 −3 0 3 6
𝑥2 0.5 𝑓 (-1,0) (1,0)
4 𝑥 ∇ 𝑓 |𝑝
0
2
∇ 𝑓 |𝑝
(-1,-1) (1,-1)
−0.5 0
−1.5 −1 −0.5 0 0.5 𝛼 (0,-1)
𝑥1
A projection of the function in the 𝑝 direction can be obtained by plotting Fig. 4.7 Function contours and direc-
𝑓 along the line defined by 𝑥 + 𝛼𝑝, where 𝛼 is the independent variable, as tion 𝑝 (left), one-dimensional slice
along 𝑝 (middle), directional deriva-
shown in Fig. 4.7 (middle). The projected slope of the function in that direction tive for all directions on polar plot
corresponds to the slope of this single-variable function. The polar plot in (right).
Fig. 4.7 (right) shows how the directional derivative changes with the direction
of 𝑝. The directional derivative has a maximum in the direction of the gradient,
has the largest negative magnitude in the opposite direction, and has zero
values in the directions orthogonal to the gradient.
4 Unconstrained Gradient-Based Optimization 85
𝜕2 𝑓
. (4.8)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝜕2 𝑓 𝜕2 𝑓
𝜕2 𝑓
···
𝜕𝑥12 𝜕𝑥 1 𝜕𝑥2 𝜕𝑥1 𝜕𝑥 𝑛
2
𝜕 𝑓 𝜕2 𝑓 𝜕2 𝑓
···
𝜕𝑥2 𝜕𝑥 𝑛 .
𝐻 𝑓 (𝑥) = 𝜕𝑥2 𝜕𝑥1 𝜕𝑥22
(4.10)
.. .. .. ..
. . . .
2
𝜕 𝑓 𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥 𝜕𝑥 ···
𝑛 1 𝜕𝑥 𝑛 𝜕𝑥2 𝜕𝑥 𝑛2
𝜕2 𝑓
𝐻 𝑓 𝑖𝑗 = . (4.11)
𝜕𝑥 𝑖 𝜕𝑥 𝑗
whose contours are shown in Fig. 4.8 (left). These contours are ellipses that
have the same center. The Hessian of this quadratic is
2 −1
𝐻= ,
−1 4
4 Unconstrained Gradient-Based Optimization 87
√
which is constant. To find the curvature in the direction 𝑝 = [−1/2, − 3/2], we
compute
h √ i 2
" −1 # √
| −1 − 3 −1 2
√ 7− 3
𝑝 𝐻𝑝 = − 3 = .
2 2 −1 4 2
2
The principal curvature directions can be computed by solving the eigenvalue
problem (Eq. 4.14). This yields two eigenvalues and two corresponding
eigenvectors,
√ √
√ 1− 2 √ 1+ 2
𝜅 𝐴 = 3 + 2, 𝑣 𝐴 = , and 𝜅 𝐵 = 3 − 2, 𝑣 𝐵 = .
1 1
(0,1)
𝜅1 𝑣ˆ 1 (-1,1) 𝜅1 (1,1)
1
𝜅2 𝑣ˆ 2 𝜅2
Fig. 4.8 Contours of 𝑓 for Ex. 4.4 and
𝑥2 0 (-1,0) (1,0)
0 2 4 6 the two principal curvature direc-
tions in red. The polar plot shows
𝑝 | 𝐻𝑝 the curvature, with the eigenvectors
−1 𝑝 pointing at the directions of principal
(-1,-1) (1,-1)
curvature; all other directions have
−1 0 1 (0,-1) curvature values in between.
𝑥1
Consider the same polynomial from Ex. 4.1. Differentiating the gradient
we obtained previously yields the Hessian:
6𝑥1 4𝑥 2
𝐻(𝑥1 , 𝑥2 ) = .
4𝑥2 4𝑥1 − 6𝑥2
We can visualize the variation of the Hessian by plotting the principal curvatures
at different points (Fig. 4.9).
4 Unconstrained Gradient-Based Optimization 88
4 100
50
2 0
Saddle point
−50
−100
𝑥2 0
−150
Maximum Minimum
−200
4 Minimum
−2 Saddle point
2
0
Maximum
−4 −4
𝑥1 −2 −2
−4 −2 0 2 4 0
𝑥1 2
−4 4 𝑥2
Taylor series, see Appendix A.1.
Using the gradient and Hessian of the two-variable polynomial from Ex. 4.1
and Ex. 4.5, we can use Eq. 4.15 to construct a second-order Taylor expansion
about 𝑥0 ,
|
3𝑥12 + 2𝑥 22 − 20 6𝑥1 4𝑥 2
𝑓˜(𝑝) = 𝑓 (𝑥0 ) + 𝑝+𝑝 |
𝑝.
4𝑥 1 𝑥2 − 3𝑥 22 4𝑥2 4𝑥1 − 6𝑥2
Figure 4.10 shows the resulting Taylor series expansions about different points.
We perform three expansions, each about three critical points: the minimum
(left), the maximum (middle), and the saddle point (right). The expansion
about the minimum yields a convex quadratic that is a good approximation of
the original function near the minimum but becomes worse as we step farther
4 Unconstrained Gradient-Based Optimization 89
away. The expansion about the maximum shows a similar trend except that the
approximation is a concave quadratic. Finally, the expansion about the saddle
point yields a saddle function.
4 4 4
2 2 2
Saddle point
𝑥2 0 𝑥2 0 𝑥2 0
Minimum Maximum
−2 −2 −2
−4 −4 −4
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
30 30 30
𝑓 0 𝑓 0 𝑓 0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
1
𝑓 (𝑥 ∗ + 𝑝) = 𝑓 (𝑥 ∗ ) + ∇ 𝑓 (𝑥 ∗ )| 𝑝 + 𝑝 | 𝐻(𝑥 ∗ )𝑝 + . . . . (4.16)
2
For 𝑥 ∗ to be an optimal point, we must have 𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) for all 𝑝.
This implies that the first- and second-order terms in the Taylor series
have to be nonnegative, that is,
1
∇ 𝑓 (𝑥 ∗ )| 𝑝 + 𝑝 | 𝐻(𝑥 ∗ )𝑝 ≥ 0 for all 𝑝. (4.17)
2
4 Unconstrained Gradient-Based Optimization 90
Because 𝑝 can be in any arbitrary direction, the only way this inequality
can be satisfied is if all the elements of the gradient are zero (refer to
Fig. 4.6),
∇ 𝑓 (𝑥 ∗ ) = 0 . (4.19)
This is the first-order necessary optimality condition for an unconstrained
problem. This is necessary because if any element of 𝑝 is nonzero, there
are descent directions (e.g., 𝑝 = −∇ 𝑓 ) for which the inequality would
not be satisfied.
Because the gradient term has to be zero, we must now satisfy the
remaining term in the inequality (Eq. 4.17), that is,
From Eq. 4.13, we know that this term represents the curvature in
direction 𝑝, so this means that the function curvature must be positive
or zero when projected in any direction. You may recognize this
inequality as the definition of a positive-semidefinite matrix. In other
words, the Hessian 𝐻(𝑥 ∗ ) must be positive semidefinite.
For a matrix to be positive semidefinite, its eigenvalues must all
be greater than or equal to zero. Recall that the eigenvalues of the
Hessian quantify the principal curvatures, so as long as all the principal
curvatures are greater than or equal to zero, the curvature along an
arbitrary direction is also greater than or equal to zero.
These conditions on the gradient and curvature are necessary condi-
tions for a local minimum but are not sufficient. They are not sufficient
because if the curvature is zero in some direction 𝑝 (i.e., 𝑝 | 𝐻(𝑥 ∗ )𝑝 = 0),
we have no way of knowing if it is a minimum unless we check the
third-order term. In that case, even if it is a minimum, it is a weak
minimum.
The sufficient conditions for optimality require the curvature to be
positive in any direction, in which case we have a strong minimum.
Mathematically, this means that 𝑝 | 𝐻(𝑥 ∗ )𝑝 > 0 for all nonzero 𝑝, which
is the definition of a positive-definite matrix. If 𝐻 is a positive-definite
matrix, every eigenvalue of 𝐻 is positive.§ § Forother approaches to determine if
a matrix is positive definite, see Ap-
Figure 4.11 shows some examples of quadratic functions that are pendix A.6.
positive definite (all positive eigenvalues), positive semidefinite (non-
negative eigenvalues), indefinite (mixed eigenvalues), and negative
definite (all negative eigenvalues).
4 Unconstrained Gradient-Based Optimization 91
In summary, the necessary optimality conditions for an unconstrained Fig. 4.11 Quadratic functions with
optimization problem are different types of Hessians.
∇ 𝑓 (𝑥 ∗ ) = 0
(4.21)
𝐻(𝑥 ∗ ) is positive semidefinite .
∇ 𝑓 (𝑥 ∗ ) = 0
(4.22)
𝐻(𝑥 ∗ ) is positive definite .
We can find the minima of this function by solving for the optimality conditions
analytically.
To find the critical points of this function, we solve for the points at which
the gradient is equal to zero,
𝜕𝑓
𝜕𝑥 2𝑥13 + 6𝑥12 + 3𝑥1 − 2𝑥2 0
= = .
∇𝑓 = 1
𝜕 𝑓
0
𝜕𝑥 2𝑥 2 − 2𝑥1
2
From the second equation, we have that 𝑥2 = 𝑥1 . Substituting this into the first
equation yields
𝑥1 2𝑥12 + 6𝑥 1 + 1 = 0 .
The solution of this equation yields three points:
√ √
3 7 3
0 − − 7 −
𝑥 𝐴 = , 𝑥 𝐵 = 2 √2 , 𝑥 𝐶 = √2 2 .
3 7 7 3
0 − −
2 2 2 − 2
4 Unconstrained Gradient-Based Optimization 92
The eigenvalues are 𝜅 1 ≈ 1.737 and 𝜅 2 ≈ 17.200, so this point is another local
minimum. For the third point,
√
9 − 3 7 −2
𝐻 (𝑥 𝐶 ) = .
−2 2
The eigenvalues for this Hessian are 𝜅 1 ≈ −0.523 and 𝜅 2 ≈ 3.586, so this point
is a saddle point.
Figure 4.12 shows these three critical points. To find out which of the two
local minima is the global one, we evaluate the function at each of these points.
Because 𝑓 (𝑥 𝐵 ) < 𝑓 (𝑥 𝐴 ), 𝑥 𝐵 is the global minimum.
𝑥 𝐴 : local minimum
0
𝑥 𝐶 : saddle point
𝑥2 −1
−2
𝑥 𝐵 : global minimum
−3
The absolute and relative conditions on the objective are of the same
form, although they only use an absolute value rather than a norm
because the objective is scalar.
4 Unconstrained Gradient-Based Optimization 94
𝑥𝑘 𝑥∗ 𝑥∗ 𝑥∗
𝑥𝑘
𝑥2 𝑥2 𝑥2
𝑥𝑘
𝑥1 𝑥1 𝑥1
models of the same function based on three different points. All Fig. 4.13 Taylor series quadratic mod-
quadratic approximations match the local gradient and curvature at els are only guaranteed to be accurate
near the point about which the series
the respective points. However, the Taylor series quadratic about the is expanded (𝑥 𝑘 ).
first point (left plot) yields a quadratic without a minimum (the only
critical point is a saddle point). The second point (middle plot) yields
a quadratic whose minimum is closer to the true minimum. Finally,
the Taylor series about the actual minimum point (right plot) yields a 𝑥0
quadratic with the same minimum, as expected, but we can see how
the quadratic model worsens the farther we are from the point.
Search
Because the Taylor series is only guaranteed to be a good model direction
locally, we need a globalization strategy to ensure convergence to an
optimum. Globalization here means making the algorithm robust
Update 𝑥 Line search
enough that it can converge to a local minimum when starting from
any point in the domain. This should not be confused with finding the
global minimum, which is a separate issue (see Tip 4.8). There are two No Is 𝑥 a
main globalization strategies: line search and trust region. minimum?
The line search approach consists of three main steps for every Yes
iteration (Fig. 4.14): 𝑥∗
1. Choose a suitable search direction from the current point. The Fig. 4.14 Line search approach.
choice of search direction is based on a Taylor series approxima- 𝑥0
tion.
2. Determine how far to move in that direction by performing a line
Create model
search.
Update trust-
3. Move to the new point and update all values. region size, Δ
Minimize
The two first steps can be seen as two separate subproblems. We model
address the line search subproblem in Section 4.3 and the search Update 𝑥
direction subproblem in Section 4.4. Is 𝑥 a
Trust-region methods also consist of three steps (Fig. 4.15): No minimum?
Yes
1. Create a model about the current point. This model can be based
on a Taylor series approximation or another type of surrogate 𝑥∗
2. Minimize the model within a trust region around the current point
to find the step.
3. Move to the new point, update values, and adapt the size of the
trust region.
with any method for finding the search direction. However, the search
direction method determines the name of the overall optimization
algorithm, as we will see in the next section.
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 , (4.26)
𝑥𝑘 𝛼=0
where the scalar 𝛼 is always positive and represents how far we go in
Fig. 4.16 The line search starts from
the direction 𝑝 𝑘 . This equation produces a one-dimensional slice of a given point 𝑥 𝑘 and searches solely
𝑛-dimensional space, as illustrated in Fig. 4.17. along direction 𝑝 𝑘 .
𝑥𝑘
𝑥2
𝑝𝑘
𝑓
𝑥 𝑘+1
𝑥 𝑘 + 𝛼𝑝 𝑘 Fig. 4.17 The line search projects
the 𝑛-dimensional problem onto one
dimension, where the independent
𝑥1 𝛼 variable is 𝛼.
𝑥𝑘 𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘
the 𝑝 𝑘 direction because it would not take us any closer to the minimum
𝑝𝑘
of the overall function (the dot on the right side of the plot). Instead, 𝑥2
we should find a point that is good enough and then update the search
direction.
To simplify the notation for the line search, we define the single-
𝑥1
variable function
𝜙(𝛼) = 𝑓 𝑥 𝑘 + 𝛼𝑝 𝑘 , (4.27) Fig. 4.19 The descent direction does
not necessarily point toward the min-
where 𝛼 = 0 corresponds to the start of the line search (𝑥 𝑘 in Fig. 4.17), imum, in which case it would be
and thus 𝜙(0) = 𝑓 (𝑥 𝑘 ). Then, using 𝑥 = 𝑥 𝑘 + 𝛼𝑝 𝑘 , the slope of the wasteful to do an exact line search.
single-variable function is
Õ
𝑛
0
𝜕 𝑓 (𝑥) 𝜕 𝑓 𝜕𝑥 𝑖
𝜙 (𝛼) = = . (4.28)
𝜕𝛼 𝜕𝑥 𝑖 𝜕𝛼
𝑖=1
|
(4.29)
𝜙
𝜙0(𝛼) = ∇ 𝑓 𝑥 𝑘 + 𝛼𝑝 𝑘 𝑝𝑘 ,
which is the directional derivative along the search direction. The slope 𝜙0(𝛼)
at the start of a given line search is
𝛼=0 𝛼
0 |
𝜙 (0) = ∇ 𝑓 𝑘 𝑝 𝑘 . (4.30) Fig. 4.20 For the line search, we de-
note the function as 𝜙(𝛼) with the
Because 𝑝 𝑘 must be a descent direction, 𝜙0(0) is always negative. Fig- same value as 𝑓 . The slope 𝜙0 (𝛼) is
ure 4.20 is a version of the one-dimensional slice from Fig. 4.17 in this the gradient of 𝑓 projected onto the
notation. The 𝛼 axis and the slopes scale with the magnitude of 𝑝 𝑘 . search direction.
4 Unconstrained Gradient-Based Optimization 99
where 𝜇1 is a constant such that 0 < 𝜇1 ≤ 1.† The quantity 𝛼𝜙0(0) † This condition can be problematic near
a local minimum because 𝜙(0) and 𝜙(𝛼)
represents the expected decrease of the function, assuming the function are so similar that their subtraction is inac-
continued at the same slope. The multiplier 𝜇1 states that Eq. 4.31 will curate. Hager and Zhang77 introduced a
condition with improved accuracy, along
be satisfied as long we achieve even a small fraction of the expected with an efficient line search based on a se-
decrease, as shown in Fig. 4.21. In practice, this constant is several cant method.
orders of magnitude smaller than 1, typically 𝜇1 = 10−4 . Because 𝑝 𝑘 77. Hager and Zhang, A new conjugate
gradient method with guaranteed descent and
is a descent direction, and thus 𝜙0(0) = ∇ 𝑓 𝑘 | 𝑝 𝑘 < 0, there is always a an efficient line search, 2005.
positive 𝛼 that satisfies this condition for a smooth function.
𝜙(0) 𝜇1 𝜙0(0)
The concept is illustrated in Fig. 4.22, which shows a function with
a negative slope at 𝛼 = 0 and a sufficient decrease line whose slope is Sufficient
a fraction of that initial slope. When starting a line search, we know 𝜙 decrease
below the sufficient decrease line is deemed acceptable. The sufficient 𝛼=0 𝛼
decrease line slope in Fig. 4.22 is exaggerated for illustration purposes; Fig. 4.21 The sufficient decrease line
for typical values of 𝜇1 , the line is indistinguishable from horizontal has a slope that is a small fraction
when plotted. of the slope at the start of the line
search.
𝜙0(0)
𝜙(0) 𝜇1 𝜙0(0)
Sufficient
decrease line
𝜙(𝛼)
we do have an educated guess for 𝛼, it is only a guess, and the first step
might not satisfy the sufficient decrease condition.
A straightforward algorithm that is guaranteed to find a step that
satisfies the sufficient decrease condition is backtracking (Alg. 4.2).
This algorithm starts with a maximum step and successively reduces
the step by a constant ratio 𝜌 until it satisfies the sufficient decrease
condition (a typical value is 𝜌 = 0.5). Because the search direction is a
descent direction, we know that we will achieve an acceptable decrease
in function value if we backtrack enough.
Inputs:
𝛼init > 0: Initial step length
0 < 𝜇1 < 1: Sufficient decrease factor (typically small, e.g., 𝜇1 = 10−4 )
0 < 𝜌 < 1: Backtracking factor (e.g., 𝜌 = 0.5)
Outputs:
𝛼∗ : Step size satisfying sufficient decrease condition
𝛼 = 𝛼init
while 𝜙(𝛼) > 𝜙(0) + 𝜇1 𝛼𝜙0 (0) do Function value is above sufficient decrease line
𝛼 = 𝜌𝛼 Backtrack
end while
Suppose we do a line search starting from 𝑥 = [−1.25, 1.25] in the direction 1.5
𝑝 = [4, 0.75], as shown in Fig. 4.23. Applying the backtracking algorithm with 1 𝑥𝑘
𝜇1 = 10−4 and 𝜌 = 0.7 produces the iterations shown in Fig. 4.24. The sufficient
0.5
decrease line appears to be horizontal, but that is because the small negative
slope cannot be discerned in a plot for typical values of 𝜇1 . Using a large initial 0
0 1 2 3
step of 𝛼init = 1.2 (Fig. 4.24, left), several iterations are required. For a small −3 −2 −1
𝑥1
initial step of 𝛼init = 0.05 (Fig. 4.24, right), the algorithm satisfies sufficient
decrease at the first iteration but misses the opportunity for further reductions. Fig. 4.23 A line search direction for
an example problem.
30 30
20 20
𝑓 10 𝑓 10
0 0
the slope of the function at the candidate point with the slope at the
start of the line search, we can get an idea if the function is “bottoming
out”, or flattening, using the sufficient curvature condition: 𝜙(0)
𝜙0(0)
|𝜙0(𝛼)| ≤ 𝜇2 |𝜙0(0)| . (4.32)
𝜙
This condition requires that the magnitude of the slope at the new
point be lower than the magnitude of the slope at the start of the line +𝜇2 𝜙0(0) −𝜇2 𝜙0(0)
search by a factor of 𝜇2 , as shown in Fig. 4.25. This requirement is called
the sufficient curvature condition because by comparing the two slopes, 𝛼=0 𝛼
we quantify the function’s rate of change in the slope—the curvature. Fig. 4.25 The sufficient curvature con-
Typical values of 𝜇2 range from 0.1 to 0.9, and the best value depends dition requires the function slope
on the method for determining the search direction and is also problem magnitude to be a fraction of the ini-
tial slope.
dependent. As 𝜇2 tends to zero, enforcing the sufficient curvature
condition tends toward a point where 𝜙0(𝛼) = 0, which would yield an
exact line search.
The sign of the slope at a point satisfying this condition is not
relevant; all that matters is that the function slope be shallow enough.
The idea is that if the slope 𝜙0(𝛼) is still negative with a magnitude
similar to the slope at the start of the line search, then the step is too
small, and we expect the function to decrease even further by taking
a larger step. If the slope 𝜙0(𝛼) is positive with a magnitude similar
to that at the start of the line search, then the step is too large, and we
expect to decrease the function further by taking a smaller step. On
the other hand, when the slope is shallow enough (either positive or
negative), we assume that the candidate point is near a local minimum,
and additional effort yields only incremental benefits that are wasteful
in the context of the larger problem.
The sufficient decrease and sufficient curvature conditions are
collectively known as the strong Wolfe conditions. Figure 4.26 shows
acceptable intervals that satisfy the strong Wolfe conditions, which are
more restrictive than the sufficient decrease condition (Fig. 4.22).
𝜙0(0)
𝜙(0) 𝜇1 𝜙0(0)
Sufficient
decrease line
𝜙(𝛼)
𝜇2 𝜙0(0)
tion means we require derivative information (𝜙0), at least using the 𝜇1 𝜙0(0)
from the gradient. There are various line search algorithms in the
literature, including some that are derivative-free. Here, we detail a line 𝜇2 𝜙0(0)
search algorithm based on the one developed by Moré and Thuente.78‡ 𝛼=0 𝛼
The algorithm has two phases:
Fig. 4.27 If 𝜇2 < 𝜇1 , there might be no
1. The bracketing phase finds an interval within which we are certain point that satisfies the strong Wolfe
to find a point that satisfies the strong Wolfe conditions. conditions.
2. The pinpointing phase finds a point that satisfies the strong Wolfe 78. Moré and Thuente, Line search algo-
rithms with guaranteed sufficient decrease,
conditions within the interval provided by the bracketing phase. 1994.
‡A similar algorithm is detailed in Chap-
The bracketing phase is given by Alg. 4.3 and illustrated in Fig. 4.28. ter 3 of Nocedal and Wright.79
For brevity, we use a notation in the following algorithms where, 79. Nocedal and Wright, Numerical Opti-
mization, 2006.
for example, 𝜙0 ≡ 𝜙(0) and 𝜙low ≡ 𝜙(𝛼low ). Overall, the bracketing
algorithm increases the step size until it either finds an interval that
must contain a point satisfying the strong Wolfe conditions or a point
that already meets those conditions.
We start the line search with a guess for the step size, which defines
the first interval. For a smooth continuous function, we are guaranteed
to have a minimum within an interval if either of the following hold:
1. The function value at the candidate step is higher than the value
at the start of the line search.
2. The step satisfies sufficient decrease, and the slope is positive.
These two scenarios are illustrated in the top two rows of Fig. 4.28. In
either case, we have an interval within which we can find a point that
satisfies the strong Wolfe conditions using the pinpointing algorithm.
The order in arguments to the pinpoint function in Alg. 4.3 is significant
because this function assumes that the function value corresponding
to the first 𝛼 is the lower one. The third row in Fig. 4.28 illustrates the
scenario where the point satisfies the strong Wolfe conditions, in which
case the line search is finished.
If the point satisfies sufficient decrease and the slope at that point
is negative, we assume that there are better points farther along the
line, and the algorithm increases the step size. This larger step and the
4 Unconstrained Gradient-Based Optimization 104
previous one define a new interval that has moved away from the line
search starting point. We repeat the procedure and check the scenarios
for this new interval. To save function calls, bracketing should return
not just 𝛼 ∗ but also the corresponding function value and gradient to
the outer function.
Inputs:
𝛼init > 0: Initial step size
𝜙0 , 𝜙00 : computed in outer routine, pass in to save function call
0 < 𝜇1 < 1: Sufficient decrease factor
𝜇1 < 𝜇2 < 1: Sufficient curvature factor
𝜎 > 1: Step size increase factor (e.g., 𝜎 = 2)
Outputs:
𝛼∗ : Acceptable step size (satisfies the strong Wolfe conditions)
If the bracketing phase does not find a point that satisfies the
strong Wolfe conditions, it finds an interval where we are guaranteed
to find such a point in the pinpointing phase described in Alg. 4.4
4 Unconstrained Gradient-Based Optimization 105
𝛼1 𝛼2
𝛼1 𝛼2
1. The interval has one or more points that satisfy the strong Wolfe
conditions.
2. Among all the points generated so far that satisfy the sufficient
decrease condition, 𝛼low has the lowest function value.
3. The slope at 𝛼low decreases toward 𝛼 high .
the sufficient decrease line or greater than or equal to 𝜙(𝛼 low ). In that
scenario, 𝛼 𝑝 becomes the new 𝛼 high , and we have a new smaller interval.
In the second, third, and fourth scenarios, 𝜙 𝛼 𝑝 is below the
sufficient decrease line, and 𝜙 𝛼 𝑝 < 𝜙(𝛼 low ). In those scenarios, we
check the value of the slope 𝜙0 𝛼 𝑝 . In the second and third scenarios,
we choose the new interval based on the direction in which the slope
predicts a local decrease. If the slope is shallow enough (fourth scenario),
we have found a point that satisfies the strong Wolfe conditions.
Inputs:
𝛼low : Interval endpoint with lower function value
𝛼high : Interval endpoint with higher function value
𝜙0 , 𝜙low , 𝜙high , 𝜙00 : Computed in outer routine
𝜙0low , 𝜙0high : One, if not both, computed previously
0 < 𝜇1 < 1: Sufficient decrease factor
𝜇1 < 𝜇2 < 1: Sufficient curvature factor
Outputs:
𝛼∗ : Step size satisfying strong Wolfe conditions
𝑘=0
while true do
Find 𝛼 𝑝 in interval (𝛼low , 𝛼 high ) Use interpolation (see Section 4.3.3)
Uses 𝜙low , 𝜙high , and 𝜙0 from at least one endpoint
𝜙𝑝 = 𝜙 𝛼𝑝 Also evaluate 𝜙0𝑝 if derivatives available
0
if 𝜙 𝑝 > 𝜙0 + 𝜇1 𝛼 𝑝 𝜙0 or 𝜙 𝑝 > 𝜙low then
𝛼high = 𝛼 𝑝 Also update 𝜙high = 𝜙 𝑝 , and if cubic interpolation 𝜙0high = 𝜙0𝑝
else
𝜙0𝑝 = 𝜙0 𝛼 𝑝 If not already computed
if |𝜙0𝑝 | ≤ −𝜇2 𝜙00 then
𝛼∗ = 𝛼 𝑝
return 𝛼 𝑝
else if 𝜙0𝑝 (𝛼high − 𝛼 low ) ≥ 0 then
𝛼high = 𝛼 low
end if
𝛼low = 𝛼 𝑝
end if
𝑘 = 𝑘+1
end while
In theory, the line search given in Alg. 4.3 followed by Alg. 4.4 is
guaranteed to find a step length satisfying the strong Wolfe conditions.
In practice, some additional considerations are needed for improved
4 Unconstrained Gradient-Based Optimization 107
𝛼high 𝛼 low
𝛼 low 𝛼 high
Let us perform the same line search as in Alg. 4.2 but using bracketing
and pinpointing instead of backtracking. In this example, we use quadratic
interpolation, the pinpointing phase uses a step size increase factor of 𝜎 = 2,
and the sufficient curvature factor is 𝜇2 = 0.9. Bracketing is achieved in the
first iteration by using a large initial step of 𝛼 init = 1.2 (Fig. 4.30, left). Then
pinpointing finds an improved point through interpolation. The small initial
step of 𝛼init = 0.05 (Fig. 4.30, right) does not satisfy the strong Wolfe conditions,
4 Unconstrained Gradient-Based Optimization 108
30 30
Bracketing
20 20 Bracketing
𝜙 10 Pinpointing 𝜙 10
Pinpointing
0 0
and the bracketing phase moves forward toward a flatter part of the function.
The result is a point that is much better than the one obtained with backtracking.
˜
𝜙(𝛼) = 𝑐0 + 𝑐1 𝛼 + 𝑐2 𝛼2 , (4.33)
the more generic indices 1 and 2 because the formulas of this section
are not dependent on which one is lower or higher. Then, the boundary
conditions at the endpoints are
𝜙(𝛼 1 ) = 𝑐 0 + 𝑐 1 𝛼 1 + 𝑐2 𝛼21
𝜙(𝛼 2 ) = 𝑐 0 + 𝑐 2 𝛼 2 + 𝑐2 𝛼22 (4.34)
𝜙 (𝛼 1 ) = 𝑐 1 + 2𝑐 2 𝛼 1 .
0
We can use these three equations to find the three coefficients based
on function and derivative values. Once we have the coefficients for
the quadratic, we can find the minimum of the quadratic analytically
by finding the point 𝛼 ∗ such that 𝜙˜ 0(𝛼 ∗ ) = 0, which is 𝛼∗ = −𝑐 1 /2𝑐 2 .
Substituting the analytic solution for the coefficients as a function of
the given values into this expression yields the final expression for the
minimizer of the quadratic:
2𝛼 1 𝜙(𝛼2 ) − 𝜙(𝛼1 ) + 𝜙0(𝛼 1 ) 𝛼 21 − 𝛼22
∗
𝛼 = . (4.35)
2 𝜙(𝛼 2 ) − 𝜙(𝛼 1 ) + 𝜙0(𝛼 1 )(𝛼 1 − 𝛼2 )
could be two valid solutions, but we are only interested in the minimum,
for which the curvature is positive; that is, 𝜙˜ 00(𝛼 ∗ ) = 2𝑐 2 + 6𝑐3 𝛼∗ > 0.
Substituting the coefficients with the expressions obtained from solving
the boundary condition equations and selecting the minimum solution
yields
𝜙0(𝛼 2 ) + 𝛽 2 − 𝛽 1
𝛼 ∗ = 𝛼 2 − (𝛼 2 − 𝛼 1 ) 0 , (4.38)
𝜙 (𝛼2 ) − 𝜙0(𝛼 1 ) + 2𝛽 2
where
𝜙(𝛼1 ) − 𝜙(𝛼 2 )
𝛽1 = 𝜙0(𝛼 1 ) + 𝜙0(𝛼 2 ) − 3
𝛼1 − 𝛼2
q (4.39)
𝛽2 = sign(𝛼 2 − 𝛼 1 ) 𝛽 21 − 𝜙0(𝛼 1 )𝜙0(𝛼 2 ) .
𝑝 = −∇ 𝑓 . (4.40)
One major issue with the steepest descent is that, in general, the
entries in the gradient and its overall scale can vary greatly depending
𝑝𝑘
on the magnitudes of the objective function and design variables. The
gradient itself contains no information about an appropriate step length,
Fig. 4.33 The steepest-descent direc-
and therefore the search direction is often better posed as a normalized
tion points in the opposite direction
of the gradient.
4 Unconstrained Gradient-Based Optimization 111
direction,
∇ 𝑓𝑘
𝑝𝑘 = − . (4.41)
k∇ 𝑓 𝑘 k
Algorithm 4.5 provides the complete steepest descent procedure.
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
∇ 𝑓 𝑘−1 | 𝑝 𝑘−1
𝛼 𝑘 = 𝛼 𝑘−1 . (4.43)
∇ 𝑓𝑘 | 𝑝 𝑘
where 𝛽 can be set to adjust the curvature in the 𝑥2 direction. In Fig. 4.34, we
show this function for 𝛽 = 1, 5, 15. The starting point is 𝑥0 = (10, 1). When
𝛽 = 1 (left), this quadratic has the same curvature in all directions, and the
steepest-descent direction points directly to the minimum. When 𝛽 > 1 (middle
and right), this is no longer the case, and steepest descent shows abrupt changes
in the subsequent search directions. This zigzagging is an inefficient way to
approach the minimum. The higher the difference in curvature, the more
iterations it takes.
𝑥0 𝑥0 𝑥0
𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗ 𝑥2 0 𝑥∗
−5 −5 −5
−5 0 5 10 −5 0 5 10 −5 0 5 10
𝑥1 𝑥1 𝑥1
𝛽=1 𝛽=5 𝛽 = 15
iteration, this means selecting the optimal value for 𝛼 along the line
4 Unconstrained Gradient-Based Optimization 113
search:
𝜕 𝑓 𝑥 𝑘 + 𝛼𝑝 𝑘
=0⇒
𝜕𝛼
𝜕 𝑓 (𝑥 𝑘+1 )
=0⇒
𝜕𝛼
𝜕 𝑓 (𝑥 𝑘+1 ) 𝜕 𝑥 𝑘 + 𝛼𝑝 𝑘 (4.44)
=0⇒
𝜕𝑥 𝑘+1 𝜕𝛼
∇ 𝑓 𝑘+1 | 𝑝 𝑘 =0⇒
−𝑝 𝑘+1 𝑝 𝑘 = 0 .
|
120
𝑥𝑘 80
∇ 𝑓 |𝑝
𝑥2 0 𝑓
∇𝑓 40
As discussed in the last section, exact line searches are not desirable,
so the search directions are not orthogonal. However, the overall
zigzagging behavior still exists.
1 2 34 iterations
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 , 𝑥0
2 2
using the steepest-descent algorithm with an exact line search, and a conver- 1
gence tolerance of k∇ 𝑓 k ∞ ≤ 10−6 . The optimization path is shown in Fig. 4.36. 𝑥∗
Although it takes only a few iterations to get close to the minimum, it takes 0
Tip 4.4 Scale the design variables and the objective function
𝑓¯ = 𝑓 /𝑠 𝑓 , (4.45)
where 𝑠 𝑓 is the scaling factor, which could be the value of the objective at the
starting point, 𝑓 (𝑥0 ), or another typical value. Multiplying the functions by a
scalar does not change the optimal solution but can significantly improve the
ability of the optimizer to find the optimum.
Scaling the design variables is more involved because scaling them changes 𝑥0 𝑥∗
the value that the optimizer would pass to the model and thus changes their
meaning. In general, we might use different scaling factors for different types 𝑥0 𝑠 𝑥 𝑥¯ ∗ 𝑠 𝑥
of variables, so we represent these as an 𝑛-vector, 𝑠 𝑥 . Starting with the physical
design variables, 𝑥0 , we obtain the scaled variables by dividing them by the 𝑥¯ 0 𝑥¯ ∗
take the logarithm. For example, suppose the objective was expected to vary
across multiple orders of magnitude. In that case, we could minimize log( 𝑓 )
instead of minimizing 𝑓 .∗ ∗ If
𝑓 can be negative, a transformation is
required to ensure that the logarithm ar-
This heuristic still does not guarantee that the derivatives are well scaled,
gument is always positive.
but it provides a reasonable starting point for further fine-tuning of the problem
scaling. A scaling example is discussed in Ex. 4.19.
Sometimes, additional adjustment is needed if the objective is far less
sensitive to some of the design variables than others (i.e., the entries in the
gradient span various orders of magnitude). A more appropriate but more
involved approach is to scale the variables and objective function such that the
gradient elements have a similar magnitude (ideally of order 1). Achieving a
well-scaled gradient sometimes requires adjusting inputs and outputs away
from the earlier heuristic. Sometimes this occurs because the objective is much
less sensitive to a particular variable.
Fortunately, the eigenvector directions are not the only set of direc-
Fig. 4.39 For a quadratic function
tions that can minimize the quadratic function in 𝑛 line searches. To
with the elliptical principal axis not
find out which directions can achieve this, let us express the path from aligned with the coordinate axis,
the origin to the minimum of the quadratic as a sequence of 𝑛 steps more iterations are needed to find the
with directions 𝑝 𝑖 and length 𝛼 𝑖 , minimum using a coordinate search.
Õ
𝑛−1 2 iterations
∗
𝑥 = 𝛼𝑖 𝑝𝑖 . (4.50)
𝑖=0 𝑥∗
𝑥2
Thus, we have represented the solution as a linear combination of 𝑛
𝑝1 𝑝0
vectors. Substituting this into the quadratic (Eq. 4.48), we get 𝑥0
!
Õ
𝑛−1
𝑥1
𝑓 (𝑥 ∗ ) = 𝑓 𝛼𝑖 𝑝𝑖
!| !
𝑖=0 Fig. 4.40 We can converge to the min-
1 ÕÕ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼 𝑖 𝛼 𝑗 𝑝 𝑖 | 𝐴𝑝 𝑗 − 𝛼𝑖 𝑏| 𝑝𝑖 .
2
𝑖=0 𝑗=0 𝑖=0
Then, the double-sum term in Eq. 4.51 can be simplified to a single sum
and we can write
𝑛−1
Õ
1
∗
𝑓 (𝑥 ) = 𝛼 2 𝑝 𝑖 | 𝐴𝑝 𝑖 |
− 𝛼𝑖 𝑏 𝑝𝑖 . (4.53)
2 𝑖
𝑖=0
Because each term in this sum involves only one direction 𝑝 𝑖 , we have
reduced the original problem to a series of one-dimensional quadratic
4 Unconstrained Gradient-Based Optimization 117
There are many possible sets of vectors that are conjugate with
Fig. 4.41 By minimizing along a se-
respect to 𝐴, including the eigenvectors. The conjugate gradient method quence of conjugate directions in
finds these directions starting with the steepest-descent direction, turn, we can find the minimum of
a quadratic in 𝑛 steps, where 𝑛 is the
𝑝0 = −∇ 𝑓 (𝑥0 ) , (4.55) number of dimensions.
𝛽 ← max(0, 𝛽) . (4.60)
4 Unconstrained Gradient-Based Optimization 119
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
Minimizing the same bean function from Ex. 4.11 and the same line search 𝑥2
3
algorithm and settings, we get the optimization path shown in Fig. 4.43. The 22 iterations
𝑥0
changes in direction for the conjugate gradient method are smaller than for 2
steepest descent, and it takes fewer iterations to achieve the same convergence
tolerance. 1 𝑥∗
−1
4.4.3 Newton’s Method −2 −1 0 1 2
𝑥1
1
𝑓 (𝑥 𝑘 + 𝑠) ≈ 𝑓 (𝑥 𝑘 ) + 𝑠 𝑓 0 (𝑥 𝑘 ) + 𝑠 2 𝑓 00 (𝑥 𝑘 ) . (4.61)
2
We now include a second-order term to get a quadratic that we can
minimize. We minimize this quadratic approximation by differentiating
with respect to the step 𝑠 and setting the derivative to zero, which yields
𝑓 0 (𝑥 𝑘 )
𝑓 0 (𝑥 𝑘 ) + 𝑠 𝑓 00 (𝑥 𝑘 ) = 0 ⇒ 𝑠=− . (4.62)
𝑓 00 (𝑥 𝑘 )
𝑓 𝑘0
𝑥 𝑘+1 = 𝑥 𝑘 − . (4.63)
𝑓 𝑘00
We could also derive this equation by taking Newton’s method for root
finding (Eq. 3.24) and replacing 𝑟(𝑢) with 𝑓 0(𝑥).
20 𝑓
𝑓 (𝑥) = (𝑥 − 2)4 + 2𝑥 2 − 4𝑥 + 4 .
5
d 𝑓 (𝑥 𝑘 + 𝑠)
= ∇ 𝑓𝑘 + 𝐻𝑘 𝑠 = 0 . (4.65)
d𝑠
Thus, each Newton step is the solution of a linear system where the
matrix is the Hessian,
𝐻 𝑘 𝑠 𝑘 = −∇ 𝑓 𝑘 . (4.66)
This linear system is analogous to the one used for solving nonlinear
systems with Newton’s method (Eq. 3.30), except that the Jacobian
becomes the Hessian, the residual is the gradient, and the design
variables replace the states. We can use any of the linear solvers
mentioned in Section 3.6 and Appendix B to solve this system.
When minimizing the quadratic function from Ex. 4.10, Newton’s 𝑥2
method converges in one step for any value of 𝛽, as shown in Fig. 4.45. 5
1 iteration
Thus, Newton’s method is scale invariant
𝑥0
Because the function is quadratic, the quadratic “approximation”
from the Taylor series is exact, so we can find the minimum in one 0 𝑥∗
step. It will take more iterations for a general nonlinear function, but
using curvature information generally yields a better search direction −5
4 4
2 2
𝑥𝑘 𝑥𝑘
𝑥 𝑘+1
𝑥2 0 𝑥2 0 𝑥 𝑘+1
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1
25
0
20 Fig. 4.46 Newton’s method in its pure
𝑓 𝑓 −20 form is vulnerable to negative curva-
15 ture (in which case it might step away
−40 from the minimum) and overshoot-
10
𝑥 𝑘+1 𝑥𝑘 𝑥𝑘 𝑥 𝑘+1 ing (which might result in a function
increase).
Negative curvature Overshoot
with 𝛼 init = 1 as the first guess for the step length. In this case, we
have a much better guess for 𝛼 compared with the steepest-descent
or conjugate gradient cases because this guess is based on the local
curvature. Even if the first step length given by the Newton step
overshoots, the line search would find a point with a lower function
value.
The trust-region methods in Section 4.5 address both of these issues
by minimizing the function approximation within a specified region
around the current iteration.
Another major issue with Newton’s method is that the Hessian can
be difficult or costly to compute. Even if available, the solution of the
linear system in Eq. 4.65 can be expensive. Both of these considerations
motivate the quasi-Newton methods, which we explain next.
Minimizing the same bean function from Exs. 4.11 and 4.12, we get the
optimization path shown in Fig. 4.47. Newton’s method takes fewer iterations
than steepest descent (Ex. 4.11) or conjugate gradient (Ex. 4.12) to achieve the
same convergence tolerance. The first quadratic approximation is a saddle
function that steps to the saddle point, away from the minimum of the function.
However, in subsequent iterations, the quadratic approximation becomes
convex, and the steps take us along the valley of the bean function toward the
minimum.
4 Unconstrained Gradient-Based Optimization 123
3 3 3
𝑥0
2 𝑠0 2 2
𝑥1 𝑥2 𝑥∗
𝑥2 1 𝑥2 1 𝑥2 1
𝑠2 𝑥3
0 0 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
15 10
4
8
𝑓 10 𝑓 6 𝑓 2
4
0
5 2
𝑥0 𝛼 𝑥1 𝑥2 𝛼 𝑥3 𝑥∗ 𝛼
˜
find that 𝑓 𝑘+1 (𝑥 𝑘 ) = 𝑓 𝑘 . Thus, the nature of this approximation is such
0 0
6
that it matches the slope of the actual function at the last two points, as
shown in Fig. 4.48.
4
In 𝑛 dimensions, things are more involved, but the principle is 𝑓
𝑓˜
the same: use first-derivative information from the last two points 2
to approximate second-derivative information. Instead of iterating
along the 𝑥-axis as we would in one dimension, the optimization in 𝑛 10 𝑓 0
dimensions follows a sequence of steps (as shown in Fig. 4.1) for the 5 𝑓0
separate line searches. We have gradients at the endpoints of each step, 0 𝑓˜0
so we can take the difference between the gradients at those points to −5
get the curvature along that direction. The question is: How do we
−10
update the Hessian, which is expressed in the coordinate system of 𝑥, 𝑥𝑘 𝑥 𝑘+1
based on directional curvatures in directions that are not necessarily Fig. 4.48 The quadratic approxima-
aligned with the coordinate system? tion based on the secant method
Quasi-Newton methods use the quadratic approximation of the matches the slopes at the two last
objective function, points and the function value at the
last point.
1
𝑓˜ 𝑥 𝑘 + 𝑝 = 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑝 + 𝑝 | 𝐻˜ 𝑘 𝑝 , (4.70)
2
We solve this linear system for 𝑝 𝑘 , but instead of accepting it as the final
step, we perform a line search in the 𝑝 𝑘 direction. Only after finding a
step size 𝛼 𝑘 that satisfies the strong Wolfe conditions do we update the
point using
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼 𝑘 𝑝 𝑘 . (4.72)
Quasi-Newton methods update the approximate Hessian at every
iteration based on the latest information using an update of the form
where the update Δ𝐻˜ 𝑘 is a function of the last two gradients. The first
Hessian approximation is usually set to the identity matrix (or a scaled
version of it), which yields a steepest-descent direction for the first line
search (set 𝐻˜ = 𝐼 in Eq. 4.71 to verify this).
4 Unconstrained Gradient-Based Optimization 125
𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘 = 𝛼 𝑘 𝑝 𝑘 , (4.78)
𝑝 𝑘+1
meaning of the product of the Hessian with a vector (Eq. 4.12): it is the
rate of change of the Hessian in the direction defined by that vector.
Thus, it makes sense that the rate of change of the curvature predicted
by the approximate Hessian should match the difference between the
gradients.‡ ‡ The secant equation is also known as the
quasi-Newton condition.
We need 𝐻˜ to be positive definite. Using the secant equation
(Eq. 4.80) and the definition of positive definiteness (𝑠 | 𝐻𝑠 > 0), we see
that this requirement implies that the predicted curvature is positive
along the step; that is,
𝑠 𝑘 | 𝑦𝑘 > 0 . (4.81)
This is called the curvature condition, and it is automatically satisfied if
the line search finds a step that satisfies the strong Wolfe conditions.
The secant equation (Eq. 4.80) is a linear system of 𝑛 equations
where the step and the gradients are known. However, there are
𝑛(𝑛 + 1)/2 unknowns in the approximate Hessian matrix (recall that it is
symmetric), so this equation is not sufficient to determine the elements
˜ The requirement of positive definiteness adds one more equation,
of 𝐻.
but those are not enough to determine all the unknowns, leaving us
with an infinite number of possibilities for 𝐻. ˜
To find a unique 𝐻˜ 𝑘+1 , we rationalize that among all the matrices
that satisfy the secant equation (Eq. 4.80), 𝐻˜ 𝑘+1 should be the one
closest to the previous approximate Hessian, 𝐻˜ 𝑘 . This makes sense
intuitively because the curvature information gathered in one step is
limited (because it is along a single direction) and should not change the
Hessian approximation more than necessary to satisfy the requirements.
The original quasi-Newton update, known as DFP, was first pro-
posed by Davidon and then refined by Fletcher and also Powell (see
historical note in Section 2.3).20,21 The DFP update formula has been 20. Davidon, Variable metric method for
minimization, 1991.
superseded by the BFGS formula, which was independently developed
21. Fletcher and Powell, A rapidly con-
by Broyden, Fletcher, Goldfarb, and Shanno.80–83 BFGS is currently vergent descent method for minimization,
1963.
considered the most effective quasi-Newton update, so we focus on this
80. Broyden, The convergence of a class
update. However, Appendix C.2.1 has more details on DFP. of double-rank minimization algorithms 1.
The formal derivation of the BFGS update formula is rather involved, General considerations, 1970.
81. Fletcher, A new approach to variable
so we do not include it here. Instead, we work through an informal metric algorithms, 1970.
derivation that provides intuition about this update and quasi-Newton 82. Goldfarb, A family of variable-metric
methods in general. We also include more details in Appendix C.2.2. methods derived by variational means, 1970.
Recall that quasi-Newton methods add an update to the previous 83. Shanno, Conditioning of quasi-Newton
methods for function minimization, 1970.
Hessian approximation (Eq. 4.73). One way to think about an update
that yields a matrix close to the previous one is to consider the rank
of the update, Δ𝐻. ˜ The lower the rank of the update, the closer the
updated matrix is to the previous one. Also, the curvature information
contained in this update is minimal because we are only gathering
4 Unconstrained Gradient-Based Optimization 127
as shown in Fig. 4.50. Matrices resulting from vector outer products Fig. 4.50 The self outer product of a
have rank 1 because all the columns are linearly dependent. vector produces a symmetric (𝑛 × 𝑛)
With two linearly independent vectors (𝑢 and 𝑣), we can get a rank matrix of rank 1.
2 update using
𝐻˜ 𝑘+1 = 𝐻˜ 𝑘 + 𝛼𝑢𝑢 | + 𝛽𝑣𝑣 | , (4.82)
where 𝛼 and 𝛽 are scalar coefficients. Substituting this into the secant
equation (Eq. 4.80), we have
Because the vectors 𝑦 𝑘 and 𝐻˜ 𝑘 𝑠 𝑘 are not parallel in general (because the
secant equation applies to 𝐻˜ 𝑘+1 , not to 𝐻˜ 𝑘 ), the only way to guarantee
this equality is to set the terms in parentheses to zero. Thus, the scalar
coefficients are
1 1
𝛼= | , 𝛽=− . (4.86)
𝑦𝑘 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘
4 Unconstrained Gradient-Based Optimization 128
Substituting these coefficients and the chosen vectors back into Eq. 4.82,
we get the BFGS update,
𝑦 𝑘 𝑦 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘
𝐻˜ 𝑘+1 = 𝐻˜ 𝑘 + − . (4.87)
𝑦𝑘 | 𝑠 𝑘 𝑠 𝑘 | 𝐻˜ 𝑘 𝑠 𝑘
Although we did not explicitly enforce positive definiteness, the rank 2
update is positive definite, and therefore, all the Hessian approxima-
tions are positive definite, as long as we start with a positive-definite
approximation.
Now recall that we want to solve the linear system that involves
this matrix (Eq. 4.71), so it would be more efficient to approximate
the inverse of the Hessian directly instead. The inverse can be found
analytically from the update (Eq. 4.87) using the Sherman–Morrison–
Woodbury formula.§ Defining 𝑉˜ as the approximation of the inverse of § This formula is also known as the Wood-
the Hessian, the final result is bury matrix identity. Given a matrix and an
update to that matrix, it yields an explicit
expression for the inverse of the updated
𝑉˜ 𝑘+1 = 𝐼 − 𝜎 𝑘 𝑠 𝑘 𝑦 𝑘 | 𝑉˜ 𝑘 𝐼 − 𝜎 𝑘 𝑦 𝑘 𝑠 𝑘 | + 𝜎 𝑘 𝑠 𝑘 𝑠 𝑘 | , (4.88) matrix in terms of the inverses of the ma-
trix and the update (see Appendix C.3).
where
1
𝜎𝑘 = . (4.89)
𝑦𝑘 | 𝑠 𝑘
Figure 4.51 shows the sizes of the vectors and matrices involved in this
equation.
𝑉˜ 𝑘+1 = 𝐼 − 𝜎𝑘 𝑠 𝑘 𝑦𝑘 𝑉˜ 𝑘 𝐼 − 𝜎𝑘 𝑦𝑘 𝑠𝑘 + 𝜎𝑘 𝑠 𝑘 𝑠𝑘
(𝑛 × 𝑛) (𝑛 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛) (𝑛 × 𝑛) (𝑛 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛) (1 × 1) (𝑛 × 1) (1 × 𝑛)
Now we can replace the potentially costly solution of the linear Fig. 4.51 Sizes of each term of the
BFGS update (Eq. 4.88).
system (Eq. 4.71) with the much cheaper matrix-vector product,
𝑝 𝑘 = −𝑉˜ 𝑘 ∇ 𝑓 𝑘 , (4.90)
𝛼 init = 1 to signify that a full (quasi-) Newton step was accepted (see
Tip 4.5).
As discussed previously, we need to start with a positive-definite
estimate to maintain a positive-definite inverse Hessian. Typically, this
is the identity matrix or a weighted identity matrix, for example:
1
𝑉˜ 0 =
𝐼 . (4.91)
∇ 𝑓0
∇ 𝑓0
𝑝0 = −𝑉˜ 0 ∇ 𝑓0 = − . (4.92)
k∇ 𝑓0 k
Inputs:
𝑥 0 : Starting point
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Minimum function value
Minimizing the same bean function from previous examples using BFGS, we 𝑥2
3
get the optimization path shown in Fig. 4.52. We also show the corresponding 7 iterations
𝑥0
quadratic approximations for a few selected steps of this minimization in 2
Fig. 4.53. Because we generate approximations to the inverse, we invert those
approximations to get the Hessian approximation for the purpose of illustration. 1 𝑥∗
We initialize the inverse Hessian to the identity matrix, which results in
a quadratic with circular contours and a steepest-descent step (Fig. 4.53, left). 0
and the inverse Hessian approximation is Fig. 4.52 BFGS optimization path.
0.435747 −0.202020
𝑉˜ 2 = .
−0.202020 0.222556
The predicted curvature improves, and it results in a good step toward the
minimum, as shown in the middle plot of Fig. 4.53. The one-dimensional
slice reveals how the approximation curvature in the line search direction is
higher than the actual; however, the line search moves past the approximation
minimum toward the true minimum.
By the end of the optimization, at 𝑥 ∗ = (1.213412, 0.824123), the BFGS
estimate is
0.276946 0.224010
𝑉˜ ∗ = ,
0.224010 0.347847
4 Unconstrained Gradient-Based Optimization 131
3 3 3
𝑥0
2 2 2
𝑠0
𝑥2 1 𝑥2 1 𝑥2 1
𝑥∗
𝑥1
𝑥3
𝑥2
0 0 𝑠2 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
15 6
4
4
𝑓 10 𝑓 𝑓 2
2
0
5 0
𝑥0 𝛼 𝑥1 𝑥2 𝛼 𝑥3 𝑥∗ 𝛼
where
1
𝜎= . (4.94)
𝑠| 𝑦
If we save the sequence of 𝑠 and 𝑦 vectors and specify a starting value
for 𝑉˜ 0 , we can compute any subsequent 𝑉˜ 𝑘 . Of course, what we want
is 𝑉˜ 𝑘 ∇ 𝑓 𝑘 , which we can also compute using an algorithm with the
recurrence relationship. However, such an algorithm would not be
advantageous from the memory-usage perspective because we would
have to store a long sequence of vectors and a starting matrix.
To reduce the memory usage, we do not store the entire history of
vectors. Instead, we limit the storage to the last 𝑚 vectors for 𝑠 and
𝑦. In practice, 𝑚 is usually between 5 and 20. Next, we make the
starting Hessian diagonal such that we only require vector storage (or
scalar storage if we make all entries in the diagonal equal). A common
choice is to use a scaled identity matrix, which just requires storing one
number,
𝑠| 𝑦
𝑉˜ 0 = | 𝐼 , (4.95)
𝑦 𝑦
where the 𝑠 and 𝑦 correspond to the previous iteration. Algorithm 4.8
details the procedure.
Inputs:
∇ 𝑓 𝑘 : Gradient at point 𝑥 𝑘
𝑠 𝑘−1,...,𝑘−𝑚 : History of steps 𝑥 𝑘 − 𝑥 𝑘−1
𝑦 𝑘−1,...,𝑘−𝑚 : History of gradient differences ∇ 𝑓 𝑘 − ∇ 𝑓 𝑘−1
Outputs:
𝑝: Search direction −𝑉˜ 𝑘 ∇ 𝑓 𝑘
𝑑 = ∇ 𝑓𝑘
for 𝑖 = 𝑘 − 1 to 𝑘 − 𝑚 by −1 do
𝛼 𝑖 = 𝜎𝑖 𝑠 𝑖 | 𝑑
𝑑 = 𝑑 − 𝛼 𝑖 𝑦𝑖
end for |
𝑠 𝑦 𝑘−1
𝑉˜ 0 = 𝑘−1| 𝐼 Initialize Hessian inverse approximation as a scaled identity matrix
𝑦 𝑘−1 𝑦 𝑘−1
𝑑 = 𝑉˜ 0 𝑑
for 𝑖 = 𝑘 − 𝑚 to 𝑘 − 1 do
𝛽 𝑖 = 𝜎𝑖 𝑦 𝑖 | 𝑑
𝑑 = 𝑑 + (𝛼 𝑖 − 𝛽 𝑖 )𝑠 𝑖
end for
𝑝 = −𝑑
4 Unconstrained Gradient-Based Optimization 133
Example 4.16 L-BFGS compared with BFGS for the bean function
Minimizing the same bean function from the previous examples, the 𝑥2
3
optimization iterations using BFGS and L-BFGS are the same, as shown in L-BFGS: 7 iterations
Fig. 4.54. The L-BFGS method is applied to the same sequence using the BFGS: 7 iterations
2 𝑥0
last five iterations. The number of variables is too small to benefit from the
limited-memory approach, but we show it in this small problem as an example. 1 L-BFGS
˜ 𝑓 is estimated using Alg. 4.8 as
At the same 𝑥 ∗ as in Ex. 4.15, the product 𝑉∇ 𝑥∗
BFGS
0
−7.38683 × 10−5
𝑑∗ = ,
5.75370 × 10−5 −1
−2 0 2
𝑥1
whereas the exact value is:
−7.49228 × 10−5 Fig. 4.54 Optimization paths using
𝑉˜ ∗ ∇ 𝑓 ∗ = . BFGS and L-BFGS.
5.90441 × 10−5
Example 4.17 Minimizing the total potential energy for a spring system
The contours of this function are shown in Fig. 4.56 for the case where
𝑙1 = 12, 𝑙2 = 8, 𝑘1 = 1, 𝑘2 = 10, 𝑚 𝑔 = 7. There is a minimum and a maximum.
The minimum represents the position of the mass at the stable equilibrium
condition. The maximum also represents an equilibrium point, but it is unstable.
All methods converge to the minimum when starting near the maximum. All
four methods use the same parameters, convergence tolerance, and starting
point. Depending on the starting point, Newton’s method can become stuck at
the saddle point, and if a line search is not added to safeguard it, it could have
terminated at the maximum instead.
As expected, steepest descent is the least efficient, and the second-order
methods are the most efficient. The number of iterations and the relative
4 Unconstrained Gradient-Based Optimization 134
ℓ1 ℓ2
𝑘1 𝑘2
𝑥1
𝑥2
Fig. 4.55 Two-spring system with no
applied force (top) and with applied
force (bottom).
𝑚𝑔
12 12
32 iterations 27 iterations
8 𝑥∗ 8 𝑥∗
4 4
𝑥2 𝑥2
0 𝑥0 0 𝑥0
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1
12 12
14 iterations 12 iterations
8 𝑥∗ 8 𝑥∗
4 4
𝑥2 𝑥2
0 𝑥0 0 𝑥0
−4 −4
−8 −8
−5 0 5 10 15 −5 0 5 10 15
Fig. 4.56 Minimizing the total poten-
𝑥1 𝑥1
tial for two-spring system.
Quasi-Newton Newton
4 Unconstrained Gradient-Based Optimization 135
2 2
10,662 iterations 930 iterations
𝑥∗ 𝑥∗
1 1
𝑥2 𝑥0 𝑥2 𝑥0
0 0
−1 0 1 −1 0 1
𝑥1 𝑥1
2 2
36 iterations 24 iterations
𝑥∗ 𝑥∗
1 1
𝑥2 𝑥0 𝑥2 𝑥0
0 0
Fig. 4.57 Optimization paths for the
Rosenbrock function using steepest
−1 0 1 −1 0 1
descent, conjugate gradient, BFGS,
𝑥1 𝑥1
and Newton.
Quasi-Newton Newton
103
102
Steepest
101
descent
100
10−1
||∇ 𝑓 || ∞ Newton
10−2 Fig. 4.58 Convergence of the four
10−3 Quasi- methods shows the dramatic differ-
Newton ence between the linear convergence
10−4 of steepest descent, the superlinear
Conjugate convergence of the conjugate gradi-
10−5
gradient ent method, and the quadratic conver-
10−6 gence of the methods that use second-
100 101 102 103 order information.
Major iterations
the convergence criterion. The methods that use second-order information are
even more efficient, exhibiting quadratic convergence in the last few iterations.
1 𝑥𝐴
𝑥2 𝑥∗
−1
𝑥0
−3
−1.4 −1.2 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
𝑥1 ·104
Let us attempt to minimize this function starting from 𝑥0 = [−5000, −3]. Fig. 4.59 The contours the scaled
The gradient at this starting point is ∇ 𝑓 (𝑥0 ) = [−0.0653, −650.0], so the slope Rosenbrock function (Eq. 4.96) are
highly stretched in the 𝑥1 direction,
in the 𝑥2 direction is four orders of magnitude times larger than the slope in by orders of magnitude more than
the 𝑥1 direction! Therefore, there is a significant bias toward moving along the what we can show here.
𝑥2 direction but little incentive to move in the 𝑥1 direction. After an exact line
search in the steepest descent direction, we obtain the step to 𝑥 𝐴 = [−5000, 0.25]
as shown in Fig. 4.59. The optimization stops at this point, even though it is
not a minimum. This premature convergence is because 𝜕 𝑓 /𝜕𝑥1 is orders of
magnitude smaller, so both components of the gradient satisfy the optimality
conditions when using a standard relative tolerance.
To address this issue, we scale the design variables as explained in
Tip 4.4. Using the scaling 𝑠 𝑥 = [104 , 1], the scaled starting point becomes
𝑥¯ 0 = [−5000, −3] [104 , 1] = [−0.5, −3]. Before evaluating the function, we
need to convert the design variables back to their unscaled values, that is,
𝑓 (𝑥) = 𝑓 (𝑥¯ 𝑠 𝑥 ).
This scaling of the design variables alone is sufficient to improve the
optimization convergence. Still, let us also scale the objective because it is
large at our starting point (around 900). Dividing the objective by 𝑠 𝑓 = 1000,
the initial gradient becomes ∇ 𝑓 (𝑥0 ) = [−0.00206, −0.6]. This is still not ideally
scaled, but it has much less variation in orders of magnitude—more than
sufficient to solve the problem successfully. The optimizer returns 𝑥¯ ∗ = [1, 1],
where 𝑓¯∗ = 1.57 × 10−12 . When rescaled back to the problem coordinates,
𝑥 ∗ = [104 , 1], 𝑓 ∗ = 1.57 × 10−9 .
In this example, the function derivatives span many orders of magnitude,
so dividing the function by a scalar does not have much effect. Instead, we
could minimize log( 𝑓 ), which allows us to solve the problem even without
scaling 𝑥. If we also scale 𝑥, the number of required iterations for convergence
4 Unconstrained Gradient-Based Optimization 138
decreases. Using log( 𝑓 ) as the objective and scaling the design variables as
before yields 𝑥¯ ∗ = [1, 1], where 𝑓¯∗ = −25.28, which in the original problem
space corresponds to 𝑥 ∗ = [104 , 1], where 𝑓 ∗ = 1.05 × 10−11 .
Although this example does not correspond to a physical problem, such
differences in scaling occur frequently in engineering analysis. For example,
optimizing the operating point of a propeller might involve two variables: the
pitch angle and the rotation rate. The angle would typically be specified in
radians (a quantity of order 1) and the rotation rate in rotations per minute
(typically tens of thousands).
where the circles show the trust regions for each iteration. 𝑥∗
The trust-region subproblem solved at each iteration is
Fig. 4.60 Trust-region methods mini-
mize a model within a trust region for
minimize 𝑓˜(𝑠) each iteration, and then they update
𝑠
(4.97) the trust-region size and the model
subject to k𝑠 k ≤ Δ , before the next iteration.
where 𝑓˜(𝑠) is the local trust-region model, 𝑠 is the step from the current
iteration point, and Δ is the size of the trust region. We use 𝑠 instead
of 𝑝 to indicate that this is a step vector and not simply the direction
vector used in methods based on a line search.
4 Unconstrained Gradient-Based Optimization 140
𝑥∗
𝑥2 𝑥 𝑘+1
𝑥𝑘 𝑠𝑘
The subproblem (Eq. 4.97) defines the trust region as a norm. The
Euclidean norm, k𝑠 k 2 , defines a spherical trust region and is the most
common type of trust region. Sometimes ∞-norms are used instead
because they are easy to apply, but 1-norms are rarely used because
they are just as complex as 2-norms but introduce sharp corners that
can be problematic.84 The shape of the trust region is dictated by the 84. Conn et al., Trust Region Methods,
2000.
norm (see Fig. A.8) and can significantly affect the convergence rate.
The ideal trust-region shape depends on the local function space, and
some algorithms allow for the trust-region shape to change throughout
the optimization.
1
minimize 𝑓˜(𝑠) = 𝑓 𝑘 + ∇ 𝑓 𝑘 | 𝑠 + 𝑠 | 𝐻˜ 𝑘 𝑠
𝑠 2 (4.98)
subject to k𝑠 k 2 ≤ Δ 𝑘 ,
𝑓 (𝑥) − 𝑓 (𝑥 + 𝑠)
𝑟= . (4.99)
𝑓˜(0) − 𝑓˜(𝑠)
Inputs:
𝑥 0 : Starting point
Δ0 : Initial size of the trust region
Outputs:
𝑥 ∗ : Optimal point
too long to reduce the trust region to an acceptable size over other
portions of the design space where a smaller trust region is needed.
The same convergence criteria used in other gradient-based methods
are applicable.∗ ∗ Conn et al.84 provide more detail on
trust-region problems, including trust-
region norms and scaling, approaches to
Example 4.20 Trust-region method applied to the total potential energy of solving the trust-region subproblem, ex-
spring system tensions to the model, and other impor-
tant practical considerations.
Minimizing the total potential energy function from Ex. 4.17 using a trust-
region method starting from the same points as before yields the optimization
path shown in Fig. 4.63. The initial trust region size is Δ = 0.3, and the
maximum allowable is Δmax = 1.5.
8 8 8
4 4 4
𝑥2 𝑥2 𝑥2
0 𝑠0 0 0
𝑥0
𝑠5
−4 −4 −4
𝑠3
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
8 8 𝑠 11 8
𝑠8
𝑥∗
4 4 4
𝑥2 𝑥2 𝑥2
0 0 0
−4 −4 −4
−5 0 5 10 15 −5 0 5 10 15 −5 0 5 10 15
𝑥1 𝑥1 𝑥1
𝑘=8 𝑘 = 11 𝑘 = 15
The first few quadratic approximations do not have a minimum because Fig. 4.63 Minimizing the total poten-
the function has negative curvature around the starting point, but the trust tial for two-spring system using a
trust-region method shown at differ-
region prevents steps that are too large. When it gets close enough to the bowl ent iterations. The local quadratic
containing the minimum, the quadratic approximation has a minimum, and approximation is overlaid on the func-
the trust-region subproblem yields a minimum within the trust region. In the tion contours and the trust region is
shown as a red circle.
last few iterations, the quadratic is a good model, and therefore the region
remains large.
4 Unconstrained Gradient-Based Optimization 144
2 2 2
1 𝑥0 𝑠0 1 1
𝑥2 𝑥2 𝑥2
𝑠3
0 0 0
𝑠7
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
2 2 2
𝑥∗
1 1 1
𝑥2 𝑥2 𝑠17 𝑥2
0 𝑠12 0 0
−1 −1 −1
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥1 𝑥1 𝑥1
𝑘 = 12 𝑘 = 17 𝑘 = 35
|𝑥|
if |𝑥| > Δ𝑥
(4.100) Fig. 4.65 Smoothed absolute value
𝑓 (𝑥) =
𝑥2 Δ𝑥
+ otherwise , function.
2Δ𝑥 2
where Δ𝑥 is a user-adjustable parameter representing the half-width of the
transition.
Piecewise functions are often used in fits to empirical data. Cubic splines
or a sigmoid function can blend the transition between two functions smoothly.
We can also use the same technique to blend discrete steps (where the two
functions are constant values) or implement smooth max or min functions.† † Another option to smooth the max of
For example, a sigmoid can be used to blend two functions ( 𝑓1 (𝑥) and 𝑓2 (𝑥)) multiple functions is aggregation, which
is detailed in Section 5.7.
together at a transition point 𝑥 𝑡 using
1
𝑓 (𝑥) = 𝑓1 (𝑥) + 𝑓2 (𝑥) − 𝑓1 (𝑥) , (4.101)
1 + 𝑒 −ℎ(𝑥−𝑥 𝑡 )
where ℎ is a user-selected parameter that controls how sharply the transition
occurs. The left side of Fig. 4.66 shows an example transitioning 𝑥 and 𝑥 2 with
𝑥 𝑡 = 0 and ℎ = 50.
4 Unconstrained Gradient-Based Optimization 146
0.2 0.2
𝑓1 (𝑥) 𝑓1 (𝑥)
0.1 0.1
𝑓2 (𝑥) 𝑓2 (𝑥)
𝑓 0 𝑓 0
−0.1 −0.1
𝑓 (𝑥) 𝑓 (𝑥)
−0.2 −0.2
Another approach is to use a cubic spline for the blending. Given a transition
point 𝑥 𝑡 and a half-width Δ𝑥, we can compute a cubic spline transition as
𝑓 (𝑥)
1
if 𝑥 < 𝑥1
𝑓 (𝑥) = if (4.102)
𝑓 (𝑥)
2 𝑥 > 𝑥2
𝑐 𝑥3 + 𝑐 𝑥2 + 𝑐 𝑥 + 𝑐
1 2 3 4 otherwise ,
𝑥2
3
2
Tip 4.8 Gradient-based optimization can find the global optimum
1
Gradient-based methods are local search methods. If the design space is
fundamentally multimodal, it may be helpful to augment the gradient-based 0
search with a global search. The simplest and most common approach is to use −1
𝑥∗
a multistart approach, where we run a gradient-based search multiple times,
starting from different points, as shown in Fig. 4.67. The starting points might −2
−2 0 2 4
be chosen from engineering intuition, randomly generated points, or sampling 𝑥1
found with a single starting point. One advantage of this approach is that it
can easily be run in parallel.
Another approach is to start with a global search strategy (see Chapter 7).
After a suitable initial exploration, the design(s) given by the global search
become starting points for gradient-based optimization(s). This finds points
that satisfy the optimality conditions, which is typically challenging with a
pure gradient-free approach. It also improves the convergence rate and finds
optima more precisely.
4.6 Summary
Problems
4.4 Review Kepler’s wine barrel story from Section 2.2. Approximate
the barrel as a cylinder and find the height and diameter of a
barrel that maximizes its volume for a diagonal measurement of
1 m.
𝑓 = 𝑥14 + 3𝑥 13 + 3𝑥 22 − 6𝑥1 𝑥2 − 2𝑥 2 .
4.6 Consider a slightly modified version of the function from Prob. 4.5,
where we add a 𝑥24 term to get
Can you find the critical points analytically? Plot the function
contours. Locate the critical points graphically and classify them.
4.7 Implement the two line search algorithms from Section 4.3, such
that they work in 𝑛 dimensions (𝑥 and 𝑝 can be vectors of any
size).
a. As a first test for your code, reproduce the results from the
examples in Section 4.3 and plot the function and iterations
for both algorithms. For the line search that satisfies the
strong Wolfe conditions, reduce the value of 𝜇2 until you get
an exact line search. How much accuracy can you achieve?
4 Unconstrained Gradient-Based Optimization 151
a. For your first test problem, reproduce the results from the
examples in Section 4.4.
b. Minimize the two-dimensional Rosenbrock function (see
Appendix D.1.2) using the various algorithms and compare
your results starting from 𝑥 = (−1, 2). Compare the total
number of evaluations. Compare the number of minor
4 Unconstrained Gradient-Based Optimization 152
4.12 The brachistochrone problem seeks to find the path that minimizes
travel time between two points for a particle under the force of
gravity.∗ Solve the discretized version of this problem using ∗ This problem was mentioned in Sec-
tion 2.2 as one of the problems that in-
an optimizer of your choice (see Appendix D.1.7 for a detailed spired developments in calculus of varia-
description). tions.
153
5 Constrained Gradient-Based Optimization 154
minimize 𝑓 (𝑥)
by varying 𝑥𝑖 𝑖 = 1, . . . , 𝑛 𝑥
subject to 𝑔 𝑗 (𝑥) ≤ 0 𝑗 = 1, . . . , 𝑛 𝑔 (5.1)
ℎ 𝑙 (𝑥) = 0 𝑙 = 1, . . . , 𝑛 ℎ
𝑥 𝑖 ≤ 𝑥𝑖 ≤ 𝑥 𝑖 𝑖 = 1, . . . , 𝑛 𝑥 ,
by inspection. We can visualize the contours for this problem because the 𝑥1
functions can be evaluated quickly and because it has only two dimensions. If
Fig. 5.1 Graphical solution for con-
the functions were more expensive, we would not be able to afford the many strained problem showing contours
evaluations needed to plot the contours. If the problem had more dimensions, of the objective, the two constraint
it would become difficult or impossible to visualize the functions and feasible curves, and the shaded infeasible re-
space fully. gion.
5 Constrained Gradient-Based Optimization 155
𝜕ℎ1 𝜕ℎ1
··· ∇ℎ |
𝜕𝑥1 𝜕𝑥 𝑛 𝑥 1
𝜕ℎ . .. ..
𝐽ℎ = = .. ..
. = . , (5.2)
|
.
𝜕𝑥 𝜕ℎ 𝜕ℎ 𝑛 ℎ ∇ℎ 𝑛
𝑛ℎ ℎ
···
𝜕𝑥1 𝜕𝑥 𝑛 𝑥
| {z }
(𝑛 ℎ ×𝑛 𝑥 )
There are several essential linear algebra concepts for constrained 86. Boyd and Vandenberghe, Convex
Optimization, 2004.
optimization. The span of a set of vectors is the space formed by all the
87. Strang, Linear Algebra and its Applica-
points that can be obtained by a linear combination of those vectors. tions, 2006.
With one vector, this space is a line, with two linearly independent
vectors, this space is a two-dimensional plane (see Fig. 5.2), and so
on. With 𝑛 linearly independent vectors, we can obtain any point in
𝑛-dimensional space.
𝛼𝑢 + 𝛽𝑣 𝑤
𝑢 𝛼𝑢
𝑢
𝑢
𝑥0 𝑝|𝑣 > 0
𝑥0
Fig. 5.4 Hyperplanes and half-spaces
𝑝|𝑣 < 0 in two and three dimensions.
𝑝|𝑣 <0
𝑥0
Tangent
plane ∇𝑓
∇𝑓
Tangent
line
Fig. 5.5 The gradient of a function
defines the hyperplane tangent to the
function isosurface.
𝑓 isosurface
5 Constrained Gradient-Based Optimization 158
𝛼𝑢 + 𝛽𝑣
𝛼, 𝛽 ≥ 0
𝑢
𝑣
Fig. 5.6 Polyhedral cones in two and
three dimensions.
𝑓 (𝑥 ∗ + 𝑝) ≥ 𝑓 (𝑥 ∗ ) . (5.4)
Given the Taylor series expansion (Eq. 5.3), the only way that this
inequality can be satisfied is if
∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 . (5.5)
5 Constrained Gradient-Based Optimization 159
𝐽ℎ (𝑥)𝑝 = 0 . (5.8)
This equation states that any feasible direction has to lie in the nullspace
of the Jacobian of the constraints, 𝐽ℎ .
Assuming that 𝐽ℎ has full row rank (i.e., the constraint gradients are Feasible point
linearly independent), then the feasible space is a subspace of dimension ∇ℎ1
𝑛 𝑥 − 𝑛 ℎ . For optimization to be possible, we require 𝑛 𝑥 > 𝑛 ℎ . Figure 5.8
illustrates a case where 𝑛 𝑥 = 𝑛 ℎ = 2, where the feasible space reduces 𝑥
of Fig. 5.9 for the three-dimensional case. For two or more constraints,
Fig. 5.8 If we have two equality con-
the feasible space corresponds to the intersection of all the tangent straints (𝑛 ℎ = 2) in two-dimensional
hyperplanes. On the right side of Fig. 5.9, we show the intersection of space (𝑛 𝑥 = 2), we are left with no
two tangent hyperplanes in three-dimensional space (a line). freedom for optimization.
5 Constrained Gradient-Based Optimization 160
∇ℎ | 𝑝 = 0 ℎ2 = 0
∇ℎ ∇ℎ1
𝐽ℎ 𝑝 = 0
∇ℎ 2
𝑥∗
𝑥∗
In other words, the projection of the objective function gradient onto the
feasible space must vanish. Figure 5.10 illustrates this requirement for a
case with two constraints in three dimensions.
∇𝑓
∇ℎ1 ∇ℎ 1
∇𝑓
Fig. 5.10 If the projection of ∇ 𝑓 onto
the feasible space is nonzero, there is
a feasible descent direction (left); if
𝑝 ∇ 𝑓 |𝑝 ∇ℎ2 𝑝 ∇ 𝑓 | 𝑝 = 0 ∇ℎ2 the projection is zero, the point is a
constrained optimum (right).
the constraints, the objective function gradient must be a linear combination algebra illustrated in Fig. 5.3 and the four
subspaces reviewed in Appendix A.4.
of the gradients of the constraints. Thus, we can write the requirements
defined in Eq. 5.9 as a single vector equation,
Õ
𝑛ℎ
∇ 𝑓 (𝑥 ∗ ) = − 𝜆 𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) , (5.10)
𝑗=1
where 𝜆 𝑗 are called the Lagrange multipliers.† There is a multiplier † Despite our convention of reserving
associated with each constraint. The sign of the Lagrange multipliers Greek symbols for scalars, we use 𝜆 to
is arbitrary for equality constraints but will be significant later when represent the 𝑛 ℎ -vector of Lagrange mul-
tipliers because it is common usage.
dealing with inequality constraints.
5 Constrained Gradient-Based Optimization 161
where we have reexpressed Eq. 5.10 in matrix form and added the
constraint satisfaction condition.
In constrained optimization, it is sometimes convenient to use the
Lagrangian function, which is a scalar function defined as
∇𝑥 ℒ = ∇ 𝑓 (𝑥) + 𝐽ℎ (𝑥)| 𝜆 = 0
(5.13)
∇𝜆 ℒ = ℎ(𝑥) = 0 ,
minimize 𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 + 2𝑥2
𝑥1 ,𝑥 2
1 2
subject to ℎ(𝑥1 , 𝑥2 ) = 𝑥 + 𝑥22 − 1 = 0 .
4 1
The Lagrangian for this problem is
1
ℒ(𝑥1 , 𝑥2 , 𝜆) = 𝑥1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥 22 − 1 .
4
𝜕ℒ 1
= 1 + 𝜆𝑥1 = 0
𝜕𝑥1 2
𝜕ℒ
= 2 + 2𝜆𝑥2 = 0
𝜕𝑥2
𝜕ℒ 1
= 𝑥 12 + 𝑥22 − 1 = 0 .
𝜕𝜆 4
Solving these three equations for the three unknowns (𝑥1 , 𝑥2 , 𝜆), we obtain two
possible solutions:
" √ #
𝑥 − 2 √
𝑥𝐴 = 1 = √
2 , 𝜆𝐴 = 2,
𝑥2 − 2
"√ #
𝑥 2 √
𝑥 𝐵 = 1 = √2 , 𝜆𝐵 = − 2 .
𝑥2
2
These two points are shown in Fig. 5.12, together with the objective and
constraint gradients. The optimality conditions (Eq. 5.11) state that the gradient
must be a linear combination of the gradients of the constraints at the optimum.
In the case of one constraint, this means that the two gradients are colinear
(which occurs in this example).
5 Constrained Gradient-Based Optimization 163
2
∇𝑓
∇ℎ
1
Maximum
∇𝑓 𝑥𝐵
𝑥2 0
Minimum
𝑥𝐴
−1
the feasible directions, in this case, we can show that it is positive or negative 2
definite in all possible directions. The Hessian is negative definite for 𝑥 𝐵 , so
this is not a minimum; instead, it is a maximum.
Figure 5.13 shows the Lagrangian function (with the optimal Lagrange 0
multiplier we solved for) overlaid on top of the original function and constraint.
𝑥∗
The unconstrained minimum of the Lagrangian corresponds to the constrained
minimum of the original function. The Lagrange multiplier can be visualized −2
as a third dimension coming out of the page. Here we show only the slice for −2 0 2
the Lagrange multiplier that solves the optimality conditions. 𝑥1
subject to ℎ(𝑥1 , 𝑥2 ) = 𝛽𝑥 12 − 𝑥2 = 0 ,
4 4 4
∇ℎ ∇ℎ ∇ℎ
2 2 2
𝑥2 𝑥2 𝑥2
0 0 0
−2 ∇𝑓 −2 ∇𝑓 −2 ∇𝑓
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
𝑥1 𝑥1 𝑥1
𝛽 = −0.5 1 𝛽 = 0.5
𝛽= 12
For 𝛽 = −0.5, the Hessian of the Lagrangian is positive definite, and we Fig. 5.14 Three different problems il-
have a minimum. For 𝛽 = 0.5, the Lagrangian has negative curvature in the lustrating the meaning of the second-
order conditions for constrained
feasible directions, so the point is not a minimum; we can reduce the objective problems.
by moving along the curved constraint. The first-order conditions alone do
not capture this possibility because they linearize the constraint. Finally, in the
limiting case (𝛽 = 1/12), the curvature of the constraint matches the curvature
of the objective, and the curvature of the Lagrangian is zero in the feasible
directions. This point is not a minimum either.
5 Constrained Gradient-Based Optimization 165
∇ 𝑓 (𝑥 ∗ )| 𝑝 ≥ 0 , (5.16)
∇ 𝑓 |𝑝 < 0
which is the same as for the equality constrained case. We use the Descent directions
arc in Fig. 5.15 to show the descent directions, which are in the open
Fig. 5.15 The descent directions are
half-space defined by the hyperplane tangent to the gradient of the
in the open half-space defined by the
objective. objective function gradient.
To consider inequality constraints, we use the same linearization as
∇𝑔 | 𝑝 ≥ 0
the equality constraints (Eq. 5.6), but now we enforce an inequality to Infeasible
get directions ∇𝑔
For a given candidate point that satisfies all constraints, there are 𝑥
Feasible
two possibilities to consider for each inequality constraint: whether
directions
the constraint is inactive (𝑔 𝑗 (𝑥) < 0) or active (𝑔 𝑗 (𝑥) = 0). If a given
𝑔=0
constraint is inactive, we do not need to add any condition for it because
we can take a step 𝑝 in any direction and remain feasible as long as Fig. 5.16 The feasible directions for
the step is small enough. Thus, we only need to consider the active each constraint are in the closed half-
constraints for the optimality conditions. space defined by the inequality con-
straint gradient.
For the equality constraint, we found that all first-order feasible
directions are in the nullspace of the Jacobian matrix. Inequality
|
𝐽 𝑔 𝜎, 𝜎 ≥ 0 ∇𝑔2
directions
constraints are not as restrictive. From Eq. 5.17, if constraint 𝑗 is
active (𝑔 𝑗 (𝑥) = 0), then the nearby point 𝑔 𝑗 (𝑥 + 𝑝) is only feasible if ∇𝑔1
∇𝑔 𝑗 (𝑥)| 𝑝 ≤ 0 for all constraints 𝑗 that are active. In matrix form, we can
write 𝐽 𝑔 (𝑥)𝑝 ≤ 0, where the Jacobian matrix includes only the gradients
𝑥
of the active constraints. Thus, the feasible directions for inequality
Feasible
constraint 𝑗 can be any direction in the closed half-space, corresponding directions
to all directions 𝑝 such that 𝑝 | 𝑔 𝑗 ≤ 0, as shown in Fig. 5.16. In this
figure, the arc shows the infeasible directions.
The set of feasible directions that satisfies all active constraints is Fig. 5.17 Excluding the infeasible di-
rections with respect to each con-
the intersection of all the closed half-spaces defined by the inequality
straint (red arcs) leaves the cone of
constraints, that is, all 𝑝 such that 𝐽 𝑔 (𝑥)𝑝 ≤ 0. This intersection of the feasible directions (blue), which is
feasible directions forms a polyhedral cone, as illustrated in Fig. 5.17 the polar cone of the active constraint
for a two-dimensional case with two constraints. To find the cone of gradients cone (gray).
5 Constrained Gradient-Based Optimization 166
feasible directions, let us first consider the cone formed by the active
inequality constraint gradients (shown in gray in Fig. 5.17). This cone
is defined by all vectors 𝑑 such that
Õ
𝑛𝑔
where 𝜎𝑗 ≥ 0 . (5.18)
|
𝑑 = 𝐽𝑔 𝜎 = 𝜎 𝑗 ∇𝑔 𝑗 ,
𝑗=1
𝐽 | 𝜎, 𝜎 ≥ 0 ∇𝑔2 −∇ 𝑓 ∇𝑔2
directions
∇𝑔1
∇𝑔1
−∇ 𝑓
Feasible ∇ 𝑓
descent
directions ∇𝑓
1. Feasible descent direction ex- 2. No feasible descent di- Fig. 5.18 Two possibilities involving
ists, so point is not an optimum rection exists, so point is an active inequality constraints.
optimum
𝑔 𝑗 + 𝑠 2𝑗 = 0, 𝑗 = 1, . . . , 𝑛 𝑔 , (5.20)
𝜕ℒ 𝜕𝑓 Õ
𝑛ℎ
𝜕ℎ 𝑙 Õ 𝜕𝑔 𝑗
𝑛𝑔
∇𝑥 ℒ = 0 ⇒ = + 𝜆𝑙 + 𝜎𝑗 =0
𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖 𝜕𝑥 𝑖
𝑙=1 𝑗=1
𝑖 = 1, . . . , 𝑛 𝑥 . (5.22)
𝜕ℒ
∇𝜆 ℒ = 0 ⇒ = ℎ 𝑙 = 0, 𝑙 = 1, . . . , 𝑛 ℎ , (5.23)
𝜕𝜆 𝑙
which enforces the equality constraints as before. Taking derivatives
with respect to the inequality Lagrange multipliers, we get
𝜕ℒ
∇𝜎 ℒ = 0 ⇒ = 𝑔 𝑗 + 𝑠 2𝑗 = 0 𝑗 = 1, . . . , 𝑛 𝑔 , (5.24)
𝜕𝜎 𝑗
𝜕ℒ
∇𝑠 ℒ = 0 ⇒ = 2𝜎 𝑗 𝑠 𝑗 = 0, 𝑗 = 1, . . . , 𝑛 𝑔 , (5.25)
𝜕𝑠 𝑗
ℎ=0
𝑔+𝑠𝑠 =0 (5.26)
𝜎𝑠=0
𝜎 ≥ 0.
∇𝑔1
∇𝑔1
𝑥∗ 𝑥∗ ∇𝑔2 ∇𝑔1 𝑥∗
∇𝑓 ∇𝑓 ∇𝑓
∇𝑔2
∇𝑔2
The middle and right panes of Fig. 5.19 illustrate cases where 𝑥 ∗ Fig. 5.19 The KKT conditions apply
only to regular points. A point 𝑥 ∗
is also a constrained minimum. However, 𝑥 ∗ is not a regular point in
is regular when the gradients of the
either case because the gradients of the two constraints are not linearly constraints are linearly independent.
independent. This means that the gradient of the objective cannot be The middle and right panes illustrate
cases where 𝑥 ∗ is a constrained mini-
expressed as a unique linear combination of the constraints. Therefore, mum but not a regular point.
we cannot use the KKT conditions, even though 𝑥 ∗ is a minimum.
The problem would be ill-conditioned, and the numerical methods
described in this chapter would run into numerical difficulties. Similar
to the equality constrained case, this situation is uncommon in practice.
5 Constrained Gradient-Based Optimization 170
Consider a variation of the problem in Ex. 5.2 where the equality is replaced
by an inequality, as follows:
The objective function and feasible region are shown in Fig. 5.20.
2
∇𝑓
∇𝑔
1 Maximum
𝑥𝐵
∇𝑓
𝑥2 0
Minimum
𝑥𝐴
−1
∇𝑔
−2 Fig. 5.20 Inequality constrained prob-
lem with linear objective and feasible
−3 −2 −1 0 1 2 3 space within an ellipse.
𝑥1
Differentiating the Lagrangian with respect to all the variables, we get the
first-order optimality conditions
𝜕ℒ 1
=1+ 𝜎𝑥 = 0
𝜕𝑥1 2 1
𝜕ℒ
= 2 + 2𝜎𝑥 2 = 0
𝜕𝑥2
𝜕ℒ 1 2
= 𝑥 + 𝑥22 − 1 = 0
𝜕𝜎 4 1
𝜕ℒ
= 2𝜎𝑠 = 0 .
𝜕𝑠
There are two possibilities in the last (complementary slackness) condition:
𝑠 = 0 (meaning the constraint is active) and 𝜎 = 0 (meaning the constraint is
not active). However, we can see that setting 𝜎 = 0 in either of the two first
equations does not yield a solution. Assuming that 𝑠 = 0 and 𝜎 ≠ 0, we can
solve the equations to obtain:
√ √
𝑥1 − 2 𝑥1 2
√ √
𝑥 𝐴 = 𝑥2 = − 2/2 , 𝑥 𝐵 = 𝑥2 = 2/2 .
𝜎 √2 𝜎 −√2
5 Constrained Gradient-Based Optimization 171
These are the same critical points as in the equality constrained case of Ex. 5.2,
as shown in Fig. 5.20. However, now the sign of the Lagrange multiplier is
significant.
According to the KKT conditions, the Lagrange multiplier has to be nonneg-
ative. Point 𝑥 𝐴 satisfies this condition. As a result, there is no feasible descent
direction at 𝑥 𝐴 , as shown in Fig. 5.21 (left). The Hessian of the Lagrangian at
this point is the same as in Ex. 5.2, which we have already shown to be positive
definite. Therefore, 𝑥 𝐴 is a minimum.
∇𝑓 Infeasible ∇𝑓
directions Fig. 5.21 At the minimum (left), the
∇𝑔 Lagrange multiplier is positive, and
there is no feasible descent direction.
𝑥∗ 𝑥𝐵 At the critical point 𝑥 𝐵 (right), the
Descent
directions Feasible Lagrange multiplier is negative, and
descent all descent directions are feasible, so
Infeasible directions this point is not a minimum.
∇𝑔 directions
Unlike the equality constrained problem, we do not need to check the Hes-
sian at point 𝑥 𝐵 because the Lagrange multiplier is negative. As a consequence,
there are feasible descent directions, as shown in Fig. 5.21 (right). Therefore,
𝑥 𝐵 is not a minimum.
Consider a variation of Ex. 5.4 where we add one more inequality constraint,
as follows:
minimize 𝑓 (𝑥 1 , 𝑥2 ) = 𝑥1 + 2𝑥2
𝑥1 ,𝑥 2
1 2
subject to 𝑔1 (𝑥 1 , 𝑥2 ) =𝑥 + 𝑥22 − 1 ≤ 0
4 1
𝑔2 (𝑥 2 ) = −𝑥2 ≤ 0 .
The feasible region is the top half of the ellipse, as shown in Fig. 5.22.
The Lagrangian for this problem is
1 2
ℒ(𝑥, 𝜎, 𝑠) = 𝑥1 + 2𝑥2 + 𝜎1 𝑥1 + 𝑥2 − 1 + 𝑠 1 + 𝜎2 −𝑥2 + 𝑠 22 .
2 2
4
Differentiating the Lagrangian with respect to all the variables, we get the
first-order optimality conditions,
𝜕ℒ 1
= 1 + 𝜎1 𝑥1 = 0
𝜕𝑥1 2
𝜕ℒ
= 2 + 2𝜎1 𝑥2 − 𝜎2 = 0
𝜕𝑥2
𝜕ℒ 1
= 𝑥12 + 𝑥 22 − 1 + 𝑠 12 = 0
𝜕𝜎1 4
5 Constrained Gradient-Based Optimization 172
𝜕ℒ
= −𝑥 2 + 𝑠 22 = 0
𝜕𝜎2
𝜕ℒ
= 2𝜎1 𝑠 1 = 0
𝜕𝑠 1
𝜕ℒ
= 2𝜎2 𝑠 2 = 0 .
𝜕𝑠 2
We now have two complementary slackness conditions, which yield the four
potential combinations listed in Table 5.1.
2
∇𝑓
∇𝑓 ∇𝑔1 ∇𝑓
1
𝑥𝐵
∇𝑔1 Minimum 𝑥𝐶
𝑥2 0
𝑥∗ ∇𝑔1
−1
∇𝑔2 ∇𝑔2
−2
Fig. 5.22 Only one point satisfies the
−3 −2 −1 0 1 2 3 first-order KKT conditions.
𝑥1
∇𝑓 ∇𝑓
Infeasible Feasible
𝑔1 descent Infeasible
directions directions 𝑔1 Fig. 5.23 At the minimum (left), the
directions intersection of the feasible directions
∇𝑔1 𝑥∗ 𝑥𝐶 and descent directions is null, so
∇𝑔1 there is no feasible descent direction.
At this point, there is a cone of de-
Infeasible
Infeasible scent directions that is also feasible,
𝑔2
∇𝑔2 𝑔2 ∇𝑔2 so it is not a minimum.
directions
directions
5 Constrained Gradient-Based Optimization 173
Assuming that both constraints are active yields two possible solutions (𝑥 ∗
and 𝑥 𝐶 ) corresponding to two different Lagrange multipliers. According to the
KKT conditions, the Lagrange multipliers for all active inequality constraints
have to be positive, so only the solution with 𝜎1 = 1 (𝑥 ∗ ) is a candidate for a
minimum. This point corresponds to 𝑥 ∗ in Fig. 5.22. As shown in Fig. 5.23 (left),
there are no feasible descent directions starting from 𝑥 ∗ . The Hessian of the
Lagrangian at 𝑥 ∗ is identical to the previous example and is positive definite
when 𝜎1 is positive. Therefore, 𝑥 ∗ is a minimum.
The other solution for which both constraints are active is point 𝑥 𝐶 in
Fig. 5.22. As shown in Fig. 5.23 (right), there is a cone of feasible descent
directions, and therefore 𝑥 𝐶 is not a minimum.
Assuming that neither constraint is active yields 1 = 0 for the first optimality
condition, so this situation is not possible. Assuming that 𝑔1 is active yields
the solution corresponding to the maximum that we already found in Ex. 5.4,
𝑥 𝐵 . Finally, assuming that only 𝑔2 is active yields no candidate point.
Objective Objective
𝑥∗ 𝑥∗
𝑥∗ 𝑥∗
𝑝 𝑝
Interior Exterior
Constraint Constraint
penalty penalty
𝑓ˆ(𝑥; 𝜇)
Fig. 5.27 Quadratic penalty for an
equality constrained problem. The
𝜇↑ 𝑓 (𝑥) minimum of the penalized function
(black dots) approaches the true con-
strained minimum (blue circle) as the
𝑥 penalty parameter 𝜇 increases.
∗
𝑥 true
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor (𝜌 ∼ 1.2 is conservative, 𝜌 ∼ 10 is aggressive)
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜇 𝑘 )
𝑥𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 = 0 , (5.38)
|
∇𝑥 𝑓ˆ = ∇ 𝑓 + 𝜇𝐽ℎ ℎ = 0 , (5.39)
|
5 Constrained Gradient-Based Optimization 179
𝜆∗𝑗
ℎ𝑗 ≈ . (5.40)
𝜇
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 ∗ˆ
𝑥 ∗ˆ 𝑓
𝑥 ∗ˆ 𝑓
𝑓
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Consider the equality constrained problem from Ex. 5.2. The penalized Fig. 5.28 The quadratic penalized
function for that case is function minimum approaches the
2 constrained minimum as the penalty
ˆ𝑓 (𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝜇 1 𝑥 2 + 𝑥 2 − 1 . (5.41)
parameter increases.
2 4 1 2
Figure 5.28 shows this function for different values of the penalty parameter | 𝑓ˆ∗ − 𝑓 ∗ |
𝜇. The penalty is active for all points that are infeasible, but the minimum of
the penalized function does not coincide with the constrained minimum of 100
the original problem. The penalty parameter needs to be increased for the
minimum of the penalized function to approach the correct solution, but this 10−1
with a small value of 𝜇 and reusing the optimal point for one solution as the
10−3
starting point for the next. Figure 5.29 shows that large penalty values are
10−1 100 101 102 103
required for high accuracy. In this example, even using a penalty parameter of 𝜇
𝜇 = 1, 000 (which results in extremely skewed contours), the objective value
achieves only three digits of accuracy. Fig. 5.29 Error in optimal solution for
increasing penalty parameter.
5 Constrained Gradient-Based Optimization 180
𝜇Õ 2
𝑛𝑔
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + max 0, 𝑔 𝑗 (𝑥) . (5.42)
2
𝑗=1
𝑓ˆ(𝑥; 𝜇)
Õ
𝑛ℎ
𝜇𝑔 Õ
𝑛𝑔
2
ˆ𝑓 (𝑥; 𝜇) = 𝑓 (𝑥) + 𝜇 ℎ 2
ℎ 𝑙 (𝑥) + max 0, 𝑔 𝑗 (𝑥) . (5.43)
2 2
𝑙=1 𝑗=1
Consider the inequality constrained problem from Ex. 5.4. The penalized
function for that case is
2
𝜇 1
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + max 0, 𝑥 12 + 𝑥22 − 1 .
2 4
This function is shown in Fig. 5.31 for different values of the penalty parameter
𝜇. The contours of the feasible region inside the ellipse coincide with the
5 Constrained Gradient-Based Optimization 181
original function contours. However, outside the feasible region, the contours
change to create a function whose minimum approaches the true constrained
minimum as the penalty parameter increases.
2 2 2
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗ 𝑥∗ 𝑥∗
−1 −1 −1 𝑥 ∗ˆ
𝑥 ∗ˆ 𝑓
𝑥 ∗ˆ 𝑓
𝑓
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
The considerations on scaling discussed in Tip 4.4 are just as crucial for
constrained problems. Similar to scaling the objective function, a good scaling
rule of thumb is to normalize each constraint function such they are of order
1. For constraints, a natural scale is typically already defined by the limits we
provide. For example, instead of
𝑔 𝑗 (𝑥)
−1 ≤ 0. (5.45)
𝑔max 𝑗
Augmented Lagrangian
Õ
𝑛ℎ
𝜇Õℎ 𝑛
𝑓ˆ(𝑥; 𝜆, 𝜇) = 𝑓 (𝑥) + 𝜆 𝑗 ℎ 𝑗 (𝑥) + ℎ 𝑗 (𝑥)2 . (5.46)
2
𝑗=1 𝑗=1
5 Constrained Gradient-Based Optimization 182
Õ
𝑛ℎ
∇𝑥 𝑓ˆ(𝑥; 𝜆, 𝜇) = ∇ 𝑓 (𝑥) + 𝜆 𝑗 + 𝜇ℎ 𝑗 (𝑥) ∇ℎ 𝑗 = 0 , (5.47)
𝑗=1
Õ
𝑛ℎ
∇𝑥 ℒ(𝑥 ∗ , 𝜆∗ ) = ∇ 𝑓 (𝑥 ∗ ) + 𝜆∗𝑗 ∇ℎ 𝑗 (𝑥 ∗ ) = 0 . (5.48)
𝑗=1
𝜆∗𝑗 ≈ 𝜆 𝑗 + 𝜇ℎ 𝑗 . (5.49)
𝜆∗𝑗
ℎ𝑗 ≈ . (5.52)
𝜇
Inputs:
𝑥 0 : Starting point
𝜆0 = 0: Initial Lagrange multiplier
𝜇0 > 0: Initial penalty parameter
𝜌 > 1: Penalty increase factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜆 𝑘 , 𝜇 𝑘 )
𝑥𝑘
𝜆 𝑘+1 = 𝜆 𝑘 + 𝜇 𝑘 ℎ(𝑥 𝑘 ) Update Lagrange multipliers
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Increase penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜆| 𝑔¯ (𝑥) + 𝜇
𝑔¯ (𝑥)
2 . (5.53) 90. Di Pillo and Grippo, A new augmented
2 Lagrangian function for inequality con-
straints in nonlinear programming problems,
where 1982.
ℎ 𝑗 (𝑥)
for equality constraints 91. Birgin et al., Numerical comparison
of augmented Lagrangian algorithms for
𝑔¯ 𝑗 (𝑥) ≡ 𝑔 (𝑥) if 𝑔 𝑗 ≥ −𝜆 𝑗 /𝜇 (5.54)
nonconvex problems, 2005.
𝑗
−𝜆 𝑗 /𝜇
otherwise . 92. Rockafellar, The multiplier method
of Hestenes and Powell applied to convex
programming, 1973.
Consider the inequality constrained problem from Ex. 5.4. Assuming the
inequality constraint is active, the augmented Lagrangian (Eq. 5.46) is
2
1 𝜇 1 2
𝑓ˆ(𝑥; 𝜇) = 𝑥1 + 2𝑥2 + 𝜆 𝑥12 + 𝑥 22 − 1 + 𝑥1 + 𝑥22 − 1 .
4 2 4
Applying Alg. 5.2, starting with 𝜇 = 0.5 and using 𝜌 = 1.1, we get the iterations
shown in Fig. 5.32.
5 Constrained Gradient-Based Optimization 184
2 2 2
𝑥0
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
Compared with the quadratic penalty in Ex. 5.7, the penalized function Fig. 5.32 Augmented Lagrangian ap-
is much better conditioned, thanks to the term associated with the Lagrange plied to inequality constrained prob-
lem.
multiplier. The minimum of the penalized function eventually becomes the
minimum of the constrained problem without a large penalty parameter.
As done in Ex. 5.6, we solve a sequence of problems starting with a small | 𝑓ˆ∗ − 𝑓 ∗ |
value of 𝜇 and reusing the optimal point for one solution as the starting point
for the next. In this case, we update the Lagrange multiplier estimate between 10−1
optimizations as well. Figure 5.33 shows that only modest penalty parameters 10−4
are needed to achieve tight convergence to the true solution, a significant
improvement over the regular quadratic penalty. 10−7
10−10
10−13
10−1 100
5.4.2 Interior Penalty Methods 𝜇
Interior penalty methods work the same way as exterior penalty Fig. 5.33 Error in optimal solution
as compared with true solution as
methods—they transform the constrained problem into a series of
a function of an increasing penalty
unconstrained problems. The main difference with interior penalty parameter.
methods is that they always seek to maintain feasibility. Instead of
adding a penalty only when constraints are violated, they add a penalty
as the constraint is approached from the feasible region. This type of
penalty is particularly desirable if the objective function is ill-defined
outside the feasible region. These methods are called interior because 𝜋(𝑥)
8
the iteration points remain on the interior of the feasible region. They
are also referred to as barrier methods because the penalty function acts 6
One possible interior penalty function to enforce 𝑔(𝑥) ≤ 0 is the 2 Inverse barrier
inverse barrier,
Õ
0
𝑛𝑔 −2 −1 𝑔(𝑥) 0
1
𝜋(𝑥) = − , (5.55) −2 Logarithmic barrier
𝑔 𝑗 (𝑥)
𝑗=1
Fig. 5.34 Two different interior
where 𝜋(𝑥) → ∞ as 𝑔 𝑗 (𝑥) → 0−
(where the superscript “−” indicates a penalty functions: inverse barrier and
left-sided derivative). A more popular interior penalty function is the logarithmic barrier.
5 Constrained Gradient-Based Optimization 185
logarithmic barrier,
Õ
𝑛𝑔
𝜋(𝑥) = − ln −𝑔 𝑗 (𝑥) , (5.56)
𝑗=1
which also approaches infinity as the constraint tends to zero from the
feasible side. The penalty function is then
Õ
𝑛𝑔
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) − 𝜇 ln(−𝑔 𝑗 (𝑥)) . (5.57)
𝑗=1
Inputs:
𝑥 0 : Starting point
𝜇0 > 0: Initial penalty parameter
𝜌 < 1: Penalty decrease factor
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝑘=0
while not converged do
𝑥 ∗𝑘 ← minimize 𝑓ˆ(𝑥 𝑘 ; 𝜇 𝑘 )
𝑥𝑘
𝜇 𝑘+1 = 𝜌𝜇 𝑘 Decrease penalty parameter
𝑥 𝑘+1 = 𝑥 ∗𝑘 Update starting point for next optimization
𝑘 = 𝑘+1
end while
Consider the equality constrained problem from Ex. 5.4. The penalized
function for that case using the logarithmic penalty (Eq. 5.57) is
1
𝑓ˆ(𝑥; 𝜇) = 𝑥 1 + 2𝑥2 − 𝜇 ln − 𝑥12 − 𝑥22 + 1 .
4
Figure 5.36 shows this function for different values of the penalty parameter 𝜇.
The penalized function is defined only in the feasible space, so we do not plot
its contours outside the ellipse.
2 2 2
1 1 1
𝑥2 0 𝑥 ∗ˆ 𝑥2 0 𝑥2 0
𝑓 𝑥 ∗ˆ 𝑥 ∗ˆ
𝑓
𝑓
−1 𝑥∗ −1 𝑥∗ −1 𝑥∗
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
tends to zero.93 There are augmented and modified barrier approaches 93. Murray, Analytical expressions for the
eigenvalues and eigenvectors of the Hessian
that can avoid the ill-conditioning issue (and other methods that remain matrices of barrier and penalty functions,
ill-conditioned but can still be solved reliably, albeit inefficiently).94 1971.
94. Forsgren et al., Interior methods for
However, these methods have been superseded by the modern interior- nonlinear optimization, 2002.
point methods discussed in Section 5.6, so we do not elaborate on
further improvements to classical penalty methods.
𝐽𝑟 (𝑢 𝑘 ) 𝑝 𝑢 = −𝑟 (𝑢 𝑘 ) , (5.60)
Differentiating the vector of residuals (Eq. 5.59) with respect to the two
concatenated vectors in 𝑢 yields the following block linear system:
|
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇𝑥 ℒ
= . (5.62)
𝐽ℎ 0 𝑝𝜆 −ℎ
5 Constrained Gradient-Based Optimization 188
This is like the procedure we used in solving the KKT conditions, except
that these are linear equations, so we can solve them directly without
5 Constrained Gradient-Based Optimization 189
1 |
minimize 𝑝 𝐻ℒ 𝑝 + ∇ 𝑥 ℒ | 𝑝
𝑝 2 (5.69)
subject to 𝐽ℎ 𝑝 + ℎ = 0 .
1 |
𝑝 𝐻 ℒ 𝑝 + ∇ 𝑓 | 𝑝 + 𝜆| 𝐽 ℎ 𝑝 . (5.70)
2
Then, we substitute the constraint 𝐽ℎ 𝑝 = −ℎ into the objective:
1 |
𝑝 𝐻 ℒ 𝑝 + ∇ 𝑓 | 𝑝 − 𝜆| ℎ . (5.71)
2
Now, we can remove the last term in the objective because it does
not depend on the variable (𝑝), resulting in the following equivalent
problem:
1 |
minimize 𝑝 𝐻ℒ 𝑝 + ∇ 𝑓 | 𝑝
𝑝 2 (5.72)
subject to 𝐽ℎ 𝑝 + ℎ = 0 .
Using the QP solution method outlined previously results in the
following system of linear equations:
|
𝐻ℒ 𝐽ℎ 𝑝𝑥 −∇ 𝑓
= . (5.73)
𝐽ℎ 0 𝜆 𝑘+1 −ℎ
1 |
minimize 𝑠 𝐻ℒ 𝑠 + ∇ 𝑥 ℒ | 𝑠
𝑠 2
subject to 𝐽ℎ 𝑠 + ℎ = 0 (5.76)
𝐽𝑔 𝑠 + 𝑔 ≤ 0 .
The determination of the working set could happen in the inner loop,
that is, as part of the inequality constrained QP subproblem (Eq. 5.76).
Alternatively, we could choose a working set in the outer loop and
then solve the QP subproblem with only equality constraints (Eq. 5.69),
where the working-set constraints would be posed as equalities. The
former approach is more common and is discussed here. In that case,
we need consider only the active-set problem in the context of a QP.
Many variations on active-set methods exist; we outline just one such
approach based on a binding-direction method.
The general QP problem we need to solve is as follows:
1 |
minimize 𝑥 𝑄𝑥 + 𝑞 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (5.77)
𝐶𝑥 + 𝑑 ≤ 0 .
Assume, for the moment, that the working set does not change at
nearby points (i.e., we ignore the constraints outside the working set).
We seek a step 𝑝 to update the design variables as follows: 𝑥 𝑘+1 = 𝑥 𝑘 + 𝑝.
We find 𝑝 by solving the following simplified QP that considers only
the working set:
1
minimize (𝑥 𝑘 + 𝑝)| 𝑄(𝑥 𝑘 + 𝑝) + 𝑞 | (𝑥 𝑘 + 𝑝)
𝑝 2
(5.79)
subject to 𝐴(𝑥 𝑘 + 𝑝) + 𝑏 = 0
𝐶𝑤 (𝑥 𝑘 + 𝑝) + 𝑑𝑤 = 0 .
1 |
minimize 𝑝 𝑄𝑝 + (𝑞 + 𝑄 | 𝑥 𝑘 )𝑝
𝑝 2
(5.80)
subject to 𝐴𝑝 = 0
𝐶𝑤 𝑝 = 0 .
Figure 5.39 shows the structure of the matrix in this linear system. 𝑛 𝑔𝑤 𝐶𝑤 0 0
Let us consider the case where the solution of this linear system is
nonzero. Solving the KKT conditions in Eq. 5.80 ensures that all the Fig. 5.39 Structure of the QP subprob-
constraints in the working set are still satisfied at 𝑥 𝑘 + 𝑝. Still, there is no lem within the inequality constrained
QP solution process.
guarantee that the step does not violate some of the constraints outside
of our working set. Suppose that 𝐶 𝑛 and 𝑑𝑛 define the constraints
outside of the working set. If
𝐶 𝑛 (𝑥 𝑘 + 𝑝) + 𝑑𝑛 ≤ 0 (5.82)
for all rows, all the constraints are still satisfied. In that case, we accept
the step 𝑝 and update the design variables as follows:
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑝 . (5.83)
5 Constrained Gradient-Based Optimization 193
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 . (5.84)
We cannot take the full step (𝛼 = 1), but we would like to take as large
a step as possible while still keeping all the constraints feasible.
Let us consider how to determine the appropriate step size, 𝛼.
Substituting the step update (Eq. 5.84) into the equality constraints, we
obtain the following:
𝑐 𝑖 (𝑥 𝑘 + 𝛼𝑝) + 𝑑 𝑖 ≤ 0 . (5.86)
|
(5.87)
| |
𝛼𝑐 𝑖 𝑝 ≤ −(𝑐 𝑖 𝑥 𝑘 + 𝑑 𝑖 ) .
Inputs:
𝑄, 𝑞, 𝐴, 𝑏, 𝐶, 𝐷: Matrices and vectors defining the QP (Eq. 5.77); Q must be positive defi-
nite
𝜀: Tolerance used for termination and for determining whether constraint is active
Outputs:
𝑥 ∗ : Optimal point
𝑘=0
𝑥 𝑘 = 𝑥0
𝑊𝑘 = 𝑖 for all 𝑖 where (𝑐 𝑖 | 𝑥 𝑘 + 𝑑 𝑖 ) > −𝜀 and length(𝑊𝑘 ) ≤ 𝑛 𝑥 One possible
initial working set
while true do
set 𝐶𝑤 = 𝐶 𝑖,∗ and 𝑑𝑤 = 𝑑 𝑖 for all 𝑖 ∈ 𝑊𝑘 Select rows for working set
Solve
the KKT system (Eq. 5.81)
if
𝑝
< 𝜀 then
if 𝜎 ≥ 0 then Satisfied KKT conditions
𝑥∗ = 𝑥 𝑘
return
else
𝑖 = argmin 𝜎
𝑊𝑘+1 = 𝑊𝑘 \ {𝑖} Remove 𝑖 from working set
𝑥 𝑘+1 = 𝑥 𝑘
end if
else
𝛼=1 Initialize with optimum step
𝐵 = {} Blocking index
for 𝑖 ∉ 𝑊𝑘 do Check constraints outside of working set
if 𝑐 𝑖 𝑝 > 0 then
|
Potential blocking constraint
|
−(𝑐 𝑖 𝑥 𝑘 +𝑑 𝑖 )
𝛼𝑏 = | 𝑐 𝑖 is a row of 𝐶 𝑛
𝑐𝑖 𝑝
if 𝛼 𝑏 < 𝛼 then
𝛼 = 𝛼𝑏
𝐵=𝑖 Save or overwrite blocking index
end if
end if
end for
𝑊𝑘+1 = 𝑊𝑘 ∪ {𝐵} Add 𝐵 to working set (if linearly independent)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝
end if
𝑘 = 𝑘+1
end while
5 Constrained Gradient-Based Optimization 196
Because all three constraints are outside of the working set, we check
0 2 4
all three. Constraint 1 is potentially blocking (𝑐 𝑖 𝑝 > 0) and leads to
|
𝑥1
𝛼 𝑏 = 0.35955. Constraint 2 is also potentially blocking and leads to
𝛼 𝑏 = 1.71429. Finally, constraint 3 is also potentially blocking and leads Fig. 5.40 Iteration history for the
to 𝛼 𝑏 = 0.32. We choose the constraint with the smallest 𝛼, which is active-set QP example.
constraint 3, and add it to our working set. At the end of the iteration,
𝑥 = [2.44, 0.0] and 𝑊 = {3}.
𝑘 = 2 The new QP subproblem yields 𝑝 = [−2.60667, 0.0] and 𝜎 = [0, 0, 5.6667].
Constraints 1 and 2 are outside the working set. Constraint 1 is potentially
blocking and gives 𝛼 𝑏 = 0.1688; constraint 2 is also potentially blocking
and yields 𝛼 𝑏 = 0.9361. Because constraint 1 yields the smaller step, we
add it to the working set. At the end of the iteration, 𝑥 = [2.0, 0.0] and
𝑊 = {1, 3}.
𝑘 = 3 The QP subproblem now yields 𝑝 = [0, 0] and 𝜎 = [6.5, 0, −9.5]. Because
𝑝 = 0, we check for convergence. One of the Lagrange multipliers
is negative, so this cannot be a solution. We remove the constraint
associated with the most negative Lagrange multiplier from the working
set (constraint 3). At the end of the iteration, 𝑥 is unchanged at 𝑥 =
[2.0, 0.0], and 𝑊 = {1}.
𝑘 = 4 The QP yields 𝑝 = [−1.5, 1.0] and 𝜎 = [3, 0, 0]. Constraint 2 is potentially
blocking and yields 𝛼 𝑏 = 1.333 (which means it is not blocking because
𝛼 𝑏 > 1). Constraint 3 is also not blocking (𝑐 𝑖 𝑝 < 0). None of the 𝛼 𝑏
|
values was blocking, so we can take the full step (𝛼 = 1). The new 𝑥
point is 𝑥 = [0.5, 1.0], and the working set is unchanged at 𝑊 = {1}.
5 Constrained Gradient-Based Optimization 197
these two metrics into account to determine the line search termination
criterion.
The Lagrangian is a function that accounts for the two metrics.
However, at a given iteration, we only have an estimate of the Lagrange
multipliers, which can be inaccurate.
One way to combine the objective value with the constraints in a
line search is to use merit functions, which are similar to the penalty
functions introduced in Section 5.4. Common merit functions include
functions that use the norm of constraint violations:
𝑓ˆ(𝑥; 𝜇) = 𝑓 (𝑥) + 𝜇
𝑔¯ (𝑥)
𝑝 , (5.89)
accepted, the line search ends, and this new point is added to the filter. 𝑓 (𝑥)
Unlike the previous case, none of the points in the filter are dominated.
Fig. 5.41 Filter method example show-
Therefore, no points are removed from the filter set, which becomes ing three points in the filter (blue
{(1, 6), (2, 5), (3, 2), (7, 1)}. dots); the shaded regions correspond
3. (4, 3): This point is dominated by a point in the filter, (3, 2). The step to all the points that are dominated by
is rejected, and the line search continues by selecting a new candidate the filter. The red dots illustrate three
different possible outcomes when
point. The filter is unchanged.
new points are considered.
𝐻˜ ℒ 𝑘 𝑠 𝑘 𝑠 𝑘 𝐻˜ ℒ 𝑘
| |
𝑦𝑘 𝑦𝑘
𝐻˜ ℒ 𝑘+1 = 𝐻˜ ℒ 𝑘 − + , (5.91)
𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘
|
|
𝑦𝑘 𝑠 𝑘
5 Constrained Gradient-Based Optimization 200
where:
𝑠 𝑘 = 𝑥 𝑘+1 − 𝑥 𝑘
(5.92)
𝑦 𝑘 = ∇𝑥 ℒ(𝑥 𝑘+1 , 𝜆 𝑘+1 ) − ∇𝑥 ℒ(𝑥 𝑘 , 𝜆 𝑘+1 ) .
The step in the design variable space, 𝑠 𝑘 , is the step that resulted from
the latest line search. The Lagrange multiplier is fixed to the latest value
when approximating the curvature of the Lagrangian because we only
need the curvature in the space of the design variables.
Recall that for the QP problem (Eq. 5.76) to have a solution, 𝐻˜ ℒ 𝑘
must be positive definite. To ensure a positive definite approximation,
we can use a damped BFGS update.25∗∗ This method replaces 𝑦 with a 25. Powell, Algorithms for nonlinear con-
straints that use Lagrangian functions, 1978.
new vector 𝑟, defined as ∗∗ The damped BFGS update is not al-
used when storing a dense Hessian for
1
if 𝑠 𝑘 𝑦 𝑘 ≥ 0.2𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘
| | large problems is prohibitive.101
𝜃𝑘 = (5.94)
0.8𝑠 𝐻˜ 100. Fletcher, Practical Methods of Opti-
|
𝑠 | 𝐻˜ ℒ 𝑠 𝑘 −𝑠 | 𝑦 𝑘
𝑠
if 𝑠 𝑘 𝑦 𝑘 < 0.2𝑠 𝑘 𝐻˜ ℒ 𝑘 𝑠 𝑘 ,
𝑘 ℒ𝑘 𝑘 | |
𝑘 𝑘
mization, 1987.
𝑘
101. Liu and Nocedal, On the limited
memory BFGS method for large scale opti-
which can range from 0 to 1. We then use the same BFGS update mization, 1989.
formula (Eq. 5.91), except that we replace each 𝑦 𝑘 with 𝑟 𝑘 .
To better understand this update, let us consider the two extremes
for 𝜃. If 𝜃𝑘 = 0, then Eq. 5.93 in combination with Eq. 5.91 yields
𝐻˜ ℒ 𝑘+1 = 𝐻˜ ℒ 𝑘 ; that is, the Hessian approximation is unmodified. At the
other extreme, 𝜃𝑘 = 1 yields the full BFGS update formula (𝑟 𝑘 is set
to 𝑦 𝑘 ). Thus, the parameter 𝜃𝑘 provides a linear weighting between
keeping the current Hessian approximation and using the full BFGS
update.
The definition of 𝜃𝑘 (Eq. 5.94) ensures that 𝐻˜ ℒ 𝑘+1 stays close enough
to 𝐻˜ ℒ 𝑘 and remains positive definite. The damping is activated when
the predicted curvature in the new latest step is below one-fifth of the
curvature predicted by the latest approximate Hessian. This could †† A few popular SQP implementations
happen when the function is flattening or when the curvature becomes include SNOPT,96 Knitro,102 MATLAB’s
fmincon, and SLSQP.103 The first three
negative. are commercial options, whereas SLSQP is
open source. There are interfaces in dif-
ferent programming languages for these
5.5.5 Algorithm Overview optimizers, including pyOptSparse (for
SNOPT and SLSQP).1
We now put together the various pieces in a high-level description 1. Wu et al., pyOptSparse: A Python frame-
of SQP with quasi-Newton approximations in Alg. 5.5.†† For the work for large-scale constrained nonlinear
optimization of sparse systems, 2020.
convergence criterion, we can use an infinity norm of the KKT system
102. Byrd et al., Knitro: An Integrated
residual vector. For better control over the convergence, we can consider Package for Nonlinear Optimization, 2006.
two separate tolerances: one for the norm of the optimality and another 103. Kraft, A software package for sequential
quadratic programming, 1988.
5 Constrained Gradient-Based Optimization 201
for the norm of the feasibility. For problems that only have equality
constraints, we can solve the corresponding QP (Eq. 5.62) instead.
Inputs:
𝑥 0 : Starting point
𝜏opt : Optimality tolerance
𝜏feas : Feasibility tolerance
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Corresponding function value
𝜆 𝑘+1 = 𝜆 𝑘 + 𝑝𝜆
𝛼 = linesearch 𝑝 𝑥 , 𝛼 init Use merit function or filter (Section 5.5.3)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝛼𝑝 𝑘 Update step
𝑊𝑘+1 = 𝑊𝑘 Active set becomes initial working set for next QP
Evaluate functions ( 𝑓 , 𝑔, ℎ) and derivatives (∇ 𝑓 , 𝐽 𝑔 , 𝐽 ℎ )
| |
∇𝑥 ℒ = ∇ 𝑓 + 𝐽 ℎ 𝜆 + 𝐽 𝑔 𝜎
𝑘 = 𝑘+1
end while
We now solve Ex. 5.2 using the SQP method (Alg. 5.5). We start at
𝑥0 = [2, 1] with an initial Lagrange multiplier 𝜆 = 0 and an initial estimate
5 Constrained Gradient-Based Optimization 202
2 2 2
𝑥0
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
𝑥∗
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
We repeat this process for subsequent iterations, as shown in Figure 5.42. Fig. 5.42 SQP algorithm iterations.
The gray contours show the QP subproblem (Eq. 5.72) solved at each itera-
tion: the quadratic objective appears as elliptical contours and the linearized
5 Constrained Gradient-Based Optimization 203
constraint as a straight line. The starting point is infeasible, and the iterations
remain infeasible until the last few iterations.
k∇𝑥 ℒ k
This behavior is common for SQP because although it satisfies the linear 101
approximation of the constraints at each step, it does not necessarily satisfy the
constraints of the actual problem, which is nonlinear. As the constraint approx-
10−2
imation becomes more accurate near the solution, the nonlinear constraint is
then satisfied. Figure 5.43 shows the convergence of the Lagrangian gradient
norm, with the characteristic quadratic convergence at the end. 10−5
10−8
0 5 10
𝑘
Example 5.13 SQP applied to inequality constrained problem
We now solve the inequality constrained version of the previous example Fig. 5.43 Convergence history of the
(Ex. 5.4) with the same initial conditions and general approach. The only norm of the Lagrangian gradient.
difference is that rather than solving the linear system of equations Eq. 5.62, we
have to solve an active-set QP problem at each iteration, as outlined in Alg. 5.4.
The iteration history and convergence of the norm of the Lagrangian gradient
are plotted in Figs. 5.44 and 5.45, respectively.
2 2 2
𝑥0
1 1 1
𝑥2 0 𝑥2 0 𝑥2 0
−1 −1 −1
−2 −2 −2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
k∇𝑥 ℒ k
10−5
max(𝜎) ≤ 𝜎yield .
𝜎 𝑗 ≤ 𝜎yield , 𝑗 = 1, . . . , 𝑛 𝜎 .
Interior-point methods use concepts from both SQP and interior penalty
methods.∗ These methods form an objective similar to the interior ∗ The name interior point stems from early
methods based on interior penalty meth-
penalty but with the key difference that instead of penalizing the ods that assumed that the initial point was
constraints directly, they add slack variables to the set of optimization feasible. However, modern interior-point
methods can start with infeasible points.
variables and penalize the slack variables. The resulting formulation is
as follows:
Õ
𝑛𝑔
minimize 𝑓 (𝑥) − 𝜇𝑏 ln 𝑠 𝑗
𝑥,𝑠
𝑗=1
(5.95)
subject to ℎ(𝑥) = 0
𝑔(𝑥) + 𝑠 = 0 .
This formulation turns the inequality constraints into equality con-
straints and thus avoids the combinatorial problem.
Similar to SQP, we apply Newton’s method to solve for the KKT
conditions. However, instead of solving the KKT conditions of the
5 Constrained Gradient-Based Optimization 205
original problem (Eq. 5.59), we solve the KKT conditions of the interior-
point formulation (Eq. 5.95).
These slack variables in Eq. 5.95 do not need to be squared, as was
done in deriving the KKT conditions, because the logarithm is only
defined for positive 𝑠 values and acts as a barrier preventing negative
values of 𝑠 (although we need to prevent the line search from producing
negative 𝑠 values, as discussed later). Because 𝑠 is always positive,
that means that 𝑔(𝑥 ∗ ) < 0 at the solution, which satisfies the inequality
constraints.
Like penalty method formulations, the interior-point formulation
(Eq. 5.95) is only equivalent to the original constrained problem in the
limit, as 𝜇𝑏 → 0. Thus, as in the penalty methods, we need to solve a
sequence of solutions to this problem where 𝜇𝑏 approaches zero.
First, we form the Lagrangian for this problem as
to the original KKT system (Eq. 5.97) and then made it symmetric, we
would have obtained a term with 𝑆−2 , which would make the system 𝑛𝑥 𝐻ℒ |
𝐽ℎ
|
𝐽𝑔 0
more challenging than with the 𝑆 −1 term in Eq. 5.100. Figure 5.46 shows
the structure and block sizes of the matrix.
𝑛ℎ 𝐽ℎ 0 0 0
Õ
𝑛𝑔
1
𝑓ˆ(𝑥) = 𝑓 (𝑥) − 𝜇𝑏 ln 𝑠 𝑖 + 𝜇𝑝 k ℎ(𝑥)k 2 + k 𝑔(𝑥) + 𝑠 k 2 , (5.101)
2
𝑖=1
where 𝜇𝑏 is the barrier parameter from Eq. 5.95, and 𝜇𝑝 is the penalty
parameter. Additionally, we must enforce an 𝛼max in the line search so
that the implicit constraint on 𝑠 > 0 remains enforced. The maximum
allowed step size can be computed prior to the line search because we
know the value of 𝑠 and 𝑝 𝑠 and require that
𝑠 + 𝛼𝑝 𝑠 ≥ 0 . (5.102)
5 Constrained Gradient-Based Optimization 207
𝑠 + 𝛼 max 𝑝 𝑠 = 𝜏𝑠 , (5.103)
Inputs:
𝑠: Current slack values
𝑝 𝑠 : Proposed step
𝜏: Fractional tolerance (e.g., 0.005)
Outputs:
𝛼max : Maximum feasible step length
𝛼max = 1
for 𝑖 = 1 to 𝑛 𝑔 do
𝑠
𝛼 = (𝜏 − 1) 𝑖
𝑝𝑠 𝑖
if 𝛼 > 0 then
𝛼max = min(𝛼max , 𝛼)
end if
end for
𝜆 𝑘+1 = 𝜆 𝑘 + 𝛼 𝜎 𝑝𝜆 (5.106)
𝜎 𝑘+1 = 𝜎 𝑘 + 𝛼 𝜎 𝑝 𝜎 . (5.107)
5 Constrained Gradient-Based Optimization 208
4 4
2 2
𝑥0 𝑥∗ 𝑥0 𝑥∗
𝑥2 𝑥2
0 0
−2 −2
−2 0 2 4 −2 0 2 4
Fig. 5.47 Numerical solution of prob-
𝑥1 𝑥1
lem solved graphically in Ex. 5.1.
Sequential quadratic programming Interior-point method
5 Constrained Gradient-Based Optimization 210
2 19 iterations
1 + 12 𝜎𝑥1 1 𝑥0
∇𝑥 ℒ(𝑥1 , 𝑥2 ) = = , 1
2 + 2𝜎𝑥2 2
0
and the gradient of the constraint is
1 −1 𝑥∗
𝑥 1
∇𝑔(𝑥1 , 𝑥2 ) = 2 1 = .
2𝑥 2 2 −2
−3 −2 −1 0 1 2 3
The interior-point system of equations (Eq. 5.100) at the starting point is 𝑥1
𝑥 𝑐1 𝑥 𝑐2
𝑦𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
The optimization paths for SQP and the interior-point method are shown in
Fig. 5.50.
12 12
8 8
4 4
𝑥2 𝑥∗ 𝑥2 𝑥∗
0 0
𝑥0 𝑥0
−4 −4
−8 −8
−5
ℓrope2 0 5 10 15 −5
ℓrope2 0 5 10 15
rope1 rope1 Fig. 5.50 Optimization of constrained
𝑥1 𝑥1
spring system.
Sequential quadratic programming Interior-point method
1 ©Õ
Systematic control design by optimizing a
ª
𝑛𝑔
ln exp(𝜌𝑔 𝑗 )® ,
vector performance index, 1979.
𝑔¯ KS (𝜌, 𝑔) = (5.111)
𝜌
« 𝑗=1 ¬
where 𝜌 is an aggregation factor that determines how close this function
is to the maximum function (Eq. 5.110). As 𝜌 → ∞, 𝑔¯ KS (𝜌, 𝑔) → max(𝑔).
However, as 𝜌 increases, the curvature of 𝑔¯ increases, which can cause
ill-conditioning in the optimization.
The exponential function disproportionately weighs the higher
positive values in the constraint vector, but it does so in a smooth way.
Because the exponential function can easily result in overflow, it is
preferable to use the alternate (but equivalent) form of the KS function,
1 ©Õ ª
𝑛𝑔
𝑔¯ KS (𝜌, 𝑔) = max 𝑔 𝑗 + ln exp 𝜌 𝑔 𝑗 − max 𝑔 𝑗 ®. (5.112)
𝑗 𝜌 𝑗
« 𝑗=1 ¬
The value of 𝜌 should be tuned for each problem, but 𝜌 = 100 works
well for many problems.
Consider the constrained spring system from Ex. 5.16. Aggregating the two
constraints using the KS function, we can formulate a single constraint as
1
𝑔¯ KS (𝑥1 , 𝑥2 ) = ln exp 𝜌𝑔2 (𝑥 1 , 𝑥2 ) + exp 𝜌𝑔2 (𝑥1 , 𝑥2 ) ,
𝜌
5 Constrained Gradient-Based Optimization 213
where q
2 2
𝑔1 (𝑥1 , 𝑥2 ) = 𝑥1 + 𝑥 𝑐1 + 𝑥2 + 𝑦𝑐 − ℓ 𝑐1
q
2 2
𝑔2 (𝑥1 , 𝑥2 ) = 𝑥1 − 𝑥 𝑐2 + 𝑥2 + 𝑦𝑐 − ℓ 𝑐2 .
Figure 5.51 shows the contour of 𝑔¯ KS = 0 for increasing values of the aggregation
parameter 𝜌.
8 8 8
6 6 6
𝑥2 𝑥2 𝑥2
4 𝑥∗ 4 𝑥∗ 4 𝑥∗
∗
∗
∗
𝑥KS 𝑥 KS
2 𝑥KS 2 2
ℓrope2
rope1 ℓrope2
rope1 ℓrope2
rope1
−2 0 2 4 6 −2 0 2 4 6 −2 0 2 4 6
𝑥1 𝑥1 𝑥1
𝜌KS = 2, ∗
𝑓KS = −19.448 𝜌KS = 10, ∗
𝑓KS = −21.653 𝜌KS = 100, ∗
𝑓KS = −22.090
For the lowest value of 𝜌, the feasible region is reduced, resulting in a Fig. 5.51 KS function aggregation of
conservative optimum. For the highest value of 𝜌, the optimum obtained two constraints. The optimum of the
problem with aggregated constraints,
with constraint aggregation is graphically indistinguishable, and the objective ∗ , approaches the true optimum
𝑥KS
function value approaches the true optimal value of −22.1358. as the aggregation parameter 𝜌KS in-
creases.
©Õ 𝑔 𝑗 ª
𝑔¯ 𝑃𝑁 (𝜌) = max | 𝑔 𝑗 | max 𝑗 𝑔 𝑗 ® . (5.113)
𝑗
« 𝑗=1 ¬
The absolute value in this equation can be an issue if 𝑔 can take both
positive and negative values because the function is not differentiable
in regions where 𝑔 transitions from positive to negative.
A class of aggregation functions known as induced functions was
designed to provide more accurate estimates of max(𝑔) for a given
value of 𝜌 than the KS and induced norm functions.110 There are two 110. Kennedy and Hicken, Improved
constraint-aggregation methods, 2015.
main types of induced functions: one uses exponentials, and the other
uses powers. The induced exponential function is given by
Í𝑛 𝑔
𝑔 𝑗 exp(𝜌𝑔 𝑗 )
𝑔IE (𝜌) = Í𝑛 𝑔
𝑗=1
. (5.114)
𝑗=1
exp(𝜌𝑔 𝑗 )
5 Constrained Gradient-Based Optimization 214
5.8 Summary
Problems
5.2 Let us modify Ex. 5.2 so that the equality constraint is the negative
of the original one—that is,
1
ℎ(𝑥1 , 𝑥2 ) = − 𝑥12 − 𝑥22 + 1 = 0 .
4
5 Constrained Gradient-Based Optimization 216
Classify the critical points and compare them with the original
solution. What does that tell you about the significance of the
Lagrange multiplier sign?
5.3 Similar to the previous exercise, consider Ex. 5.4 and modify it
so that the inequality constraint is the negative of the original
one—that is,
1
ℎ(𝑥 1 , 𝑥2 ) = − 𝑥12 − 𝑥22 + 1 ≤ 0 .
4
Classify the critical points and compare them with the original
solution.
be stated as follows:
minimize 2𝜌ℓ 𝜋𝑅𝑡 mass
by varying 𝑅, 𝑡 radius, wall thickness
𝐹
subject to − 𝜎yield ≤ 0 yield stress
2𝜋𝑅𝑡
𝜋3 𝐸𝑅3 𝑡
𝐹− ≤0 buckling load
4ℓ 2
In the formula for the mass in this objective, 𝜌 is the material
density, and we assume that 𝑡 𝑅. The first constraint is the
compressive stress, which is simply the force divided by the cross-
sectional area. The second constraint uses Euler’s critical buckling
load formula, where 𝐸 is the material Young’s modulus, and the
second moment of area is replaced with the one corresponding
to a circular cross section (𝐼 = 𝜋𝑅 3 𝑡).
Find the optimum 𝑅 and 𝑡 as a function of the other parameters.
Pick reasonable values for the parameters, and verify your solution
graphically. Plot the gradients of the objective and constraints at
the optimum, and verify the Lagrange multipliers graphically.
5.8 Beam with H section. Consider a cantilevered beam with an H-
𝑏 = 125 mm
shaped cross section composed of a web and flanges subject to a
transverse load, as shown in Fig. 5.53. The objective is to minimize
the structural weight by varying the web thickness 𝑡𝑤 and the
𝑡𝑏
flange thickness 𝑡𝑏 , subject to stress constraints. The other cross-
sectional parameters are fixed; the web height ℎ is 250 mm, and
𝑡𝑤 ℎ = 250 mm
the flange width 𝑏 is 125 mm. The axial stress in the flange and
the shear stress in the web should not exceed the corresponding
𝑡𝑏
yield values (𝜎yield = 200 MPa, and 𝜏yield = 116 MPa, respectively).
The optimization problem can be stated as follows:
minimize 2𝑏𝑡𝑏 + ℎ𝑡𝑤 mass
by varying 𝑡𝑏 , 𝑡𝑤 flange and web thicknesses 𝑃 = 100 kN
𝑃ℓ ℎ
subject to − 𝜎yield ≤ 0 axial stress
2𝐼
ℓ =1m
1.5𝑃
− 𝜏yield ≤ 0 shear stress
ℎ𝑡𝑤 Fig. 5.53 Cantilever beam with H sec-
The second moment of area for the H section is tion.
ℎ3 𝑏 ℎ2𝑏
𝐼= 𝑡𝑤 + 𝑡𝑏3 + 𝑡𝑏 .
12 6 2
Find the optimal values of 𝑡𝑏 and 𝑡𝑤 by solving the KKT conditions
analytically. Plot the objective contours and constraints to verify
your result graphically.
5 Constrained Gradient-Based Optimization 218
a. Reproduce the results from Ex. 5.12 (SQP) or Ex. 5.15 (interior
point).
b. Solve Prob. 5.3.
c. Solve Prob. 5.11.
d. Compare the computational cost, precision, and robustness
of your optimizer with those of an existing software package.
𝑑
5.11 Aircraft fuel tank. A jet aircraft needs to carry a streamlined ℓ
minimize 𝐷(ℓ , 𝑑)
by varying ℓ , 𝑑
subject to 𝑉req − 𝑉(ℓ , 𝑑) ≤ 0 .
1 2
𝐷= 𝜌𝑣 𝐶 𝐷 𝑆,
2
5 Constrained Gradient-Based Optimization 219
where the air density is 𝜌 = 0.55 kg/m3 , and the aircraft speed is
𝑣 = 300 m/s. The drag coefficient of an ellipsoid can be estimated
as∗ " 3/2 3#
𝑑 𝑑
𝐶 𝐷 = 𝐶 𝑓 1 + 1.5 +7 .
ℓ ℓ
5.12 Solve a variation of Ex. 5.16 where we replace the system of cables
with a cable and a rod that resists both tension and compression.
The cable is positioned above the spring, as shown in Fig. 5.55,
where 𝑥 𝑐 = 2 m, and 𝑦 𝑐 = 3 m, with a maximum length of
ℓ 𝑐 = 7.0 m. The rod is positioned at 𝑥 𝑟 = 2 m and 𝑦𝑟 = 4 m,
with a length of ℓ 𝑟 = 4.5 m. How does this change the problem
𝑥𝑐
𝑦𝑐 ℓ𝑐
𝑘1 , ℓ1 𝑘2 , ℓ2
ℓ𝑟 𝑦𝑟
Fig. 5.55 Spring system constrained
by two cables.
𝑥𝑟
5.14 Solve the same three-bar truss optimization problem in Prob. 5.13
by aggregating all the constraints into a single constraint. Try
5 Constrained Gradient-Based Optimization 221
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.13.
5.16 Solve the same 10-bar truss optimization problem of Prob. 5.15
by aggregating all the constraints into a single constraint. Try
different aggregation parameters and see how close you can get
to the solution you obtained for Prob. 5.15.
(𝐿/𝑏)(𝑏/2)2 𝐿𝑏
𝑀= = .
2 8
Now we assume that the wing structure has the H-shaped cross
section from Prob. 5.8 with a constant thickness of 𝑡𝑤 = 𝑡𝑏 = 4 mm.
We relate the cross-section height ℎsec and width 𝑏sec to the chord
as ℎ sec = 0.1𝑐 and 𝑏sec = 0.4𝑐. With these assumptions, we can
compute the second moment of area 𝐼 in terms of 𝑐.
The maximum bending stress is then
𝑀 ℎ sec
𝜎max = .
2𝐼
5 Constrained Gradient-Based Optimization 222
Considering the safety factor of 1.5 and the ultimate load factor
of 2.5, the stress constraint is
𝜎yield
2.5𝜎max − ≤ 0,
1.5
where 𝜎yield = 200 MPa.
Solve this problem and compare the solution with the uncon-
strained optimum. Plot the objective contours and constraint to
verify your result graphically.
Computing Derivatives
6
The gradient-based optimization methods introduced in Chapters 4
and 5 require the derivatives of the objective and constraints with
respect to the design variables, as illustrated in Fig. 6.1. Derivatives
also play a central role in other numerical algorithms. For example, the
Newton-based methods introduced in Section 3.8 require the derivatives
of the residuals.
𝑥
The accuracy and computational cost of the derivatives are critical for Optimizer
the success of these methods. Gradient-based methods are only efficient
when the derivative computation is also efficient. The computation of 𝑓,𝑔 Model
derivatives can be the bottleneck in the overall optimization procedure,
especially when the model solver needs to be called repeatedly. This Derivative
chapter introduces the various methods for computing derivatives and ∇ 𝑓 , 𝐽𝑔 Computation
discusses the relative advantages of each method.
Fig. 6.1 Efficient derivative computa-
tion is crucial for the overall efficiency
By the end of this chapter you should be able to: of gradient-based optimization.
223
6 Computing Derivatives 224
1
| {z }
(𝑛 𝑓 ×𝑛 𝑥 )
Consider the following function with two variables and two functions of
interest:
𝑓1 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + sin 𝑥1
𝑓 (𝑥) = = .
𝑓2 (𝑥1 , 𝑥2 ) 𝑥1 𝑥2 + 𝑥22
We can differentiate this symbolically to obtain exact reference values:
𝜕𝑓 𝑥 + cos 𝑥1 𝑥1
= 2 .
𝜕𝑥 𝑥2 𝑥1 + 2𝑥 2
We evaluate this at 𝑥 = (𝜋/4, 2), which yields
𝜕𝑓 2.707 0.785
= .
𝜕𝑥 2.000 4.785
6 Computing Derivatives 225
𝑥 𝑥 𝑥 𝑣1 = 𝑥
𝑣2 = 𝑣2 (𝑣1 )
Solver Solver 𝑢 Solver
𝑢 𝑣3 = 𝑣3 (𝑣1 , 𝑣2 )
..
𝑟(𝑥, 𝑢) = 0 𝑟(𝑢; 𝑥) . 𝑟(𝑥, 𝑢) = 0
𝑟
𝑓 = 𝑣 𝑛 (𝑣1 , . . .)
𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓 𝑓 (𝑥, 𝑢) 𝑓
codes are driven by reading and writing input and output files. However, the
numbers in the files usually have fewer digits than the code’s working precision.
The ideal solution is to modify the code to be called directly and pass the data
through memory. Another solution is to increase the precision in the files.
fixed-point iteration to determine the value of 𝑓 for a given input 𝑥. That means
we start with a guess for 𝑓 on the right-hand side of that expression to estimate
a new value for 𝑓 , and repeat. In this case, convergence typically happens in
about 10 iterations. Arbitrarily, we choose 𝑥 as the initial guess for 𝑓 , resulting
in the following computational procedure:
Input: 𝑥
𝑓 =𝑥
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 )
end for
return 𝑓
dfdx =
cos(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin (2*x))))))))))*( cos(x + sin(x + sin(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin (2*x)))))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin(x + sin(x + sin (2*x))))))))*( cos(x + sin(x + sin(x + sin
(x + sin(x + sin(x + sin (2*x)))))))*( cos(x + sin(x + sin(x + sin(x +
sin(x + sin (2*x))))))*( cos(x + sin(x + sin(x + sin(x + sin (2*x)))))
*( cos(x + sin(x + sin(x + sin (2*x))))*( cos(x + sin(x + sin (2*x)))*(
cos(x + sin (2*x))*(2* cos (2*x) + 1) + 1) + 1) + 1) + 1) + 1) + 1) +
1) + 1)
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓
𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + ℎ + + +... , (6.3)
𝜕𝑥 𝑗 2! 𝜕𝑥 𝑗 2 3! 𝜕𝑥 𝑗 3
6 Computing Derivatives 228
where 𝑒ˆ 𝑗 is the unit vector in the 𝑗th direction. Solving this for the first
derivative, we obtain the finite-difference formula,
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= + 𝒪(ℎ) , (6.4)
𝜕𝑥 𝑗 ℎ
𝜕𝑓 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥) 𝑓 (𝑥 + ℎ 𝑒ˆ 𝑗 ) − 𝑓 (𝑥)
= lim ≈ . (6.5) 𝑓 (𝑥 + ℎ)
𝜕𝑥 𝑗 ℎ→0 ℎ ℎ True derivative
𝑥 𝑥+ℎ
The truncation error is 𝒪(ℎ), and therefore this is a first-order approx-
imation. The difference between this approximation and the exact Fig. 6.3 Exact derivative compared
derivative is illustrated in Fig. 6.3. with a forward finite-difference ap-
The backward-difference approximation can be obtained by replac- proximation (Eq. 6.4).
ing ℎ with −ℎ to yield
𝜕𝑓 𝑓 (𝑥) − 𝑓 (𝑥 − ℎ 𝑒ˆ 𝑗 )
= + 𝒪(ℎ) , (6.6)
𝜕𝑥 𝑗 ℎ
see that this estimate is closer to the actual derivative than the forward
difference.
Even more accurate estimates can be derived by combining differ-
ent Taylor series expansions to obtain higher-order truncation error
6 Computing Derivatives 229
𝜕2 𝑓 𝑓 (𝑥 + 2ℎ 𝑒ˆ 𝑗 ) − 2 𝑓 (𝑥) + 𝑓 (𝑥 − 2ℎ 𝑒ˆ 𝑗 )
= + 𝒪 ℎ2 . (6.9)
𝜕𝑥 𝑗 2 4ℎ 2
𝑓 (𝑥 + ℎ𝑝) − 𝑓 (𝑥)
∇𝑝 𝑓 = + 𝒪(ℎ) . (6.10)
ℎ
1.5 10
𝑥 8
1
𝑥 + ℎ𝑝
6
𝑥2 0.5 𝑝 𝑓
4 𝑓 (𝑥)
0
∇𝑝 𝑓
2 𝑓 (𝑥 + ℎ𝑝)
Fig. 6.5 Computing a directional
derivative using a forward finite dif-
−0.5 0
−1.5 −1 −0.5 0 0.5 ℎ 𝛼 ference.
𝑥1
and error bounds when we set 𝜀 𝑓 = 10−16 . When ℎ is so small that no difference 100 Forward
difference Central
exists in the output (for steps smaller than 10−16 ), the finite-difference estimates difference
yield zero (and 𝜀 = 1), which corresponds to 100 percent error. 10−4
Table 6.1 lists the data for the forward difference, where we can see the
number of digits in the difference Δ 𝑓 decreasing with decreasing step size until 10−8
Tip 6.2 When using finite differencing, always perform a step-size Fig. 6.7 As the step size ℎ decreases,
study the total error in the finite-difference
estimates initially decreases because
In practice, most gradient-based optimizers use finite differences by default of a reduced truncation error. How-
to compute the gradients. Given the potential for inaccuracies, finite differences ever, subtractive cancellation takes
are often the culprit in cases where gradient-based optimizers fail to converge. over when the step is small enough
Although some of these optimizers try to estimate a good step size, there is and eventually yields an entirely
no substitute for a step-size study by the user. The step-size study must be wrong derivative.
6 Computing Derivatives 231
ℎ 𝑓 (𝑥 + ℎ) Δ𝑓 d 𝑓 /d𝑥
10−1 4.9562638252880662 0.4584837713419043 4.58483771
10−2 4.5387928890592475 0.0410128351130856 4.10128351
10−4 4.4981854440562818 0.0004053901101200 4.05390110
10−6 4.4977841073787870 0.0000040534326251 4.05343263
10−8 4.4977800944804409 0.0000000405342790 4.05342799
10−10 4.4977800543515052 0.0000000004053433 4.05344203
10−12 4.4977800539502155 0.0000000000040536 4.05453449
10−14 4.4977800539462027 0.0000000000000409 4.17443857
10−16 4.4977800539461619 0.0000000000000000 0.00000000 Table 6.1 Subtractive cancellation
10−18 4.4977800539461619 0.0000000000000000 0.00000000 leads to a loss of precision and, ul-
timately, inaccurate finite-difference
Exact 4.4977800539461619 4.05342789 estimates.
performed for all variables and does not necessarily apply to the whole design
space. Therefore, repeating this study for other values of 𝑥 might be required.
Because we do not usually know the exact derivative, we cannot plot the
error as we did in Fig. 6.7. However, we can always tabulate the derivative
estimates as we did in Table 6.1. In the last column, we can see from the pattern
of digits that match the previous step size that ℎ = 10−8 is the best step size in
this case.
𝑓
Finite-difference approximations are sometimes used with larger
0.5369
steps than would be desirable from an accuracy standpoint to help +2 · 10−8
smooth out numerical noise or discontinuities in the model. This
approach sometimes works, but it is better to address these problems 0.5369
within the model whenever possible. Figure 6.8 shows an example of
this effect. For this noisy function, the larger step ignores the noise and 0.5369
−2 · 10−8
gives the correct trend, whereas the smaller step results in an estimate
2 − 1 · 10−6 2.0 2 + 1 · 10−6
with the wrong sign. 𝑥
This is similar to the expression for the convergence criterion in Eq. 4.24.
Although the absolute step size usually differs for each 𝑥 𝑗 , the relative
step size ℎ is often the same and is user-specified.
6 Computing Derivatives 232
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Vector of functions of interest
Outputs:
𝐽: Jacobian of 𝑓 with respect to 𝑥
6.5.1 Theory
The complex-step method can also be derived using a Taylor series
expansion. Rather than using a real step ℎ, as we did to derive the ∗ This method originated with the work
finite-difference formulas, we use a pure imaginary step, 𝑖 ℎ.∗ If 𝑓 is a of Lyness and Moler,112 who developed
formulas that use complex arithmetic for
real function in real variables and is also analytic (differentiable in the computing the derivatives of real func-
complex domain), we can expand it in a Taylor series about a real point tions of arbitrary order with arbitrary or-
der truncation error, much like the Tay-
𝑥 as follows: lor series combination approach in finite
differences. Later, Squire and Trapp49 ob-
𝜕𝑓 ℎ 2 𝜕2 𝑓 ℎ 3 𝜕3 𝑓 served that the simplest of these formulas
𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = 𝑓 (𝑥) + 𝑖 ℎ − − 𝑖 +... . (6.12) was convenient for computing first deriva-
𝜕𝑥 𝑗 2 𝜕𝑥 𝑗 2 6 𝜕𝑥 𝑗 3 tives.
49. Squire and Trapp, Using complex vari-
Taking the imaginary parts of both sides of this equation, we have ables to estimate derivatives of real functions,
1998.
𝜕𝑓 ℎ 3 𝜕3 𝑓 112. Lyness and Moler, Numerical differen-
Im 𝑓 (𝑥 + 𝑖 ℎ 𝑒ˆ 𝑗 ) = ℎ − +... . (6.13) tiation of analytic functions, 1967.
𝜕𝑥 𝑗 6 𝜕𝑥 𝑗 3
6 Computing Derivatives 233
101
10−5
10−8
10−11
Fig. 6.9 Unlike finite differences, the
10−14
complex-step method is not subject to
Complex step subtractive cancellation. Therefore,
the error is the same as that of the
10−17
10−1 10−4 10−8 10−12 10−16 10−20 10−200 10−321
function evaluation (machine zero in
this case).
Step size, ℎ
ℎ Re 𝑓 Im 𝑓 /ℎ
10−1 4.4508662116993065 4.0003330384671729
10−2 4.4973069409015318 4.0528918144659292
10−4 4.4977800066307951 4.0534278402854467
10−6 4.4977800539414297 4.0534278938932582
10−8 4.4977800539461619 4.0534278938986201
10−10 4.4977800539461619 4.0534278938986201
10−12 4.4977800539461619 4.0534278938986201
10−14 4.4977800539461619 4.0534278938986210
10−16 4.4977800539461619 4.0534278938986201 Table 6.2 For a small enough step, the
10−18 4.4977800539461619 4.0534278938986210 real part of the complex evaluation is
10−200 4.4977800539461619 4.0534278938986201 identical to the real evaluation, and
the derivative matches to machine
Exact 4.4977800539461619 4.0534278938986201 precision.
Inputs:
𝑥: Point about which to compute the gradient
𝑓 : Function of interest
Outputs:
𝐽: Jacobian of 𝑓 about point 𝑥
exist in any case. We use the “greater or equal” in the logic so that the
approximation yields the correct right-sided derivative at that point.
Once you have made your code complex, the first test you should perform
is to run your code with no imaginary perturbation and verify that no variable
ends up with a nonzero imaginary part. If any number in the code acquires a
nonzero imaginary part, something is wrong, and you must trace the source of
the error. This is a necessary but not sufficient test.
which contains the derivative information, often lags relative to the real part 𝜀
in terms of convergence, as shown in Fig. 6.12. Therefore, if the solver only 101
checks for the real part, it might yield a derivative with a precision lower
than the function value. In this example, 𝑓 is the drag coefficient given by a 10−4
computational fluid dynamics solver and 𝜀 is the relative error for each part. Im 𝑓
10−9
Re 𝑓
10−14
0 50 100 150 200 250
6.6 Algorithmic Differentiation Iterations
Algorithmic differentiation (AD)—also known as computational differenti- Fig. 6.12 The imaginary parts of the
ation or automatic differentiation—is a well-known approach based on the variables often lag relative to the real
systematic application of the chain rule to computer programs.115,116 parts in iterative solvers.
The derivatives computed with AD can match the precision of the 115. Griewank, Evaluating Derivatives,
2000.
function evaluation. The cost of computing derivatives with AD can 116. Naumann, The Art of Differentiating
be proportional to either the number of variables or the number of Computer Programs—An Introduction to
Algorithmic Differentiation, 2011.
functions, depending on the type of AD, making it flexible.
Another attractive feature of AD is that its implementation is largely
automatic, thanks to various AD tools. To explain AD, we start by
6 Computing Derivatives 238
outlining the basic theory with simple examples. Then we explore how
the method is implemented in practice with further examples.
we get the desired total derivative. In the reverse mode, we choose one
output variable and work backward toward the inputs until we get the
desired total derivative.
6.6.2 Forward-Mode AD
The chain rule for the forward mode can be written as
d𝑣 𝑖 Õ 𝜕𝑣 𝑖 d𝑣 𝑘
𝑖−1
= , (6.21)
d𝑣 𝑗 𝜕𝑣 𝑘 d𝑣 𝑗
𝑘=𝑗
Õ
𝑖−1
𝜕𝑣 𝑖
𝑣¤ 𝑖 = 𝑣¤ 𝑘 . (6.22)
𝜕𝑣 𝑘
𝑘=𝑗
rule then propagates the total derivatives forward, as shown in Fig. 6.15,
affecting all the variables that depend on the seeded variable.
Once we are done applying the chain rule (Eq. 6.22) for the chosen
input variable 𝑣 𝑗 , we end up with the total derivatives d𝑣 𝑖 /d𝑣 𝑗 for all
𝑖 > 𝑗. The sum in the chain rule (Eq. 6.22) only needs to consider the
nonzero partial derivative terms. If a variable 𝑘 does not explicitly
appear in the expression for 𝑣 𝑖 , then 𝜕𝑣 𝑖 /𝜕𝑣 𝑘 = 0, and there is no need
to consider the corresponding term in the sum. In practice, this means
that only a small number of terms is considered for each sum. Seeded input, 𝑣¤ 𝑗
Suppose we have four variables 𝑣1 , 𝑣2 , 𝑣3 , and 𝑣4 , and 𝑥 ≡ 𝑣1 , 𝑓 ≡ 𝑣4 ,
and we want d 𝑓 /d𝑥. We assume that each variable depends explicitly Fig. 6.15 The forward mode propa-
on all the previous ones. Using the chain rule (Eq. 6.22), we set 𝑗 = 1 gates derivatives to all the variables
that depend on the seeded input vari-
(because we want the derivative with respect to 𝑥 ≡ 𝑣1 ) and increment able.
6 Computing Derivatives 240
𝑣¤ 1 = 1
𝜕𝑣2
𝑣¤ 2 = 𝑣¤ 1
𝜕𝑣1
𝜕𝑣3 𝜕𝑣3 (6.23)
𝑣¤ 3 = 𝑣¤ 1 + 𝑣¤ 2
𝜕𝑣1 𝜕𝑣2
𝜕𝑣4 𝜕𝑣4 𝜕𝑣4 d𝑓
𝑣¤ 4 = 𝑣¤ 1 + 𝑣¤ 2 + 𝑣¤ 3 ≡ .
𝜕𝑣1 𝜕𝑣2 𝜕𝑣3 d𝑥
1 0 0 0
d𝑣2
1 0 0
d𝑣1
𝐽𝑣 = d𝑣3 .
0
d𝑣 3 (6.24)
d𝑣1 1
d𝑣 2
d𝑣4 d𝑣 4 d𝑣4
1
d𝑣1 d𝑣 2 d𝑣3
By setting the seed 𝑣¤ 1 = 1 and using the forward chain rule (Eq. 6.22), we
have computed the first column of 𝐽𝑣 from top to bottom. This column
corresponds to the tangent with respect to 𝑣1 . Using forward-mode
AD, obtaining derivatives for other outputs is free (e.g., d𝑣3 /d𝑣1 ≡ 𝑣¤ 3
in Eq. 6.23).
However, if we want the derivatives with respect to additional
inputs, we would need to set a different seed and evaluate an entire
set of similar calculations. For example, if we wanted d𝑣4 /d𝑣2 , we
would set the seed as 𝑣¤ 2 = 1 and evaluate the equations for 𝑣¤ 3 and 𝑣¤ 4 ,
where we would now have d𝑣 4 /d𝑣2 = 𝑣¤ 4 . This would correspond to
computing the second column in 𝐽𝑣 (Eq. 6.24).
Thus, the cost of the forward mode scales linearly with the number
of inputs we are interested in and is independent of the number of
outputs.
6 Computing Derivatives 241
Consider the function with two inputs and two outputs from Ex. 6.1. We
could evaluate the explicit expressions in this function using only two lines of
code. However, to make the AD process more apparent, we write the code such
that each line has a single unary or binary operation, which is how a computer
ends up evaluating the expression:
𝑣1 = 𝑣1 (𝑣1 ) = 𝑥1
𝑣2 = 𝑣 2 (𝑣2 ) = 𝑥2
𝑣3 = 𝑣 3 (𝑣1 , 𝑣2 ) = 𝑣1 𝑣 2
𝑣4 = 𝑣 4 (𝑣1 ) = sin 𝑣 1
𝑣5 = 𝑣 5 (𝑣3 , 𝑣4 ) = 𝑣3 + 𝑣4 = 𝑓1
𝑣6 = 𝑣 6 (𝑣2 ) = 𝑣22
𝑣 7 = 𝑣 7 (𝑣3 , 𝑣6 ) = 𝑣3 + 𝑣6 = 𝑓2 .
Using the forward mode, set the seed 𝑣¤ 1 = 1, and 𝑣¤ 2 = 0 to obtain the derivatives
with respect to 𝑥1 . When using the chain rule (Eq. 6.22), only one or two partial
derivatives are nonzero in each sum because the operations are either unary
or binary in this case. For example, the addition operation that computes
𝑣5 does not depend explicitly on 𝑣2 , so 𝜕𝑣5 /𝜕𝑣2 = 0. To further elaborate,
when evaluating the operation 𝑣5 = 𝑣3 + 𝑣4 , we do not need to know how 𝑣3
was computed; we just need to know the value of the two numbers we are
adding. Similarly, when evaluating the derivative 𝜕𝑣5 /𝜕𝑣2 , we do not need
to know how or whether 𝑣3 and 𝑣4 depended on 𝑣2 ; we just need to know
how this one operation depends on 𝑣2 . So even though symbolic derivatives
are involved in individual operations, the overall process is distinct from
symbolic differentiation. We do not combine all the operations and end up
with a symbolic derivative. We develop a computational procedure to compute
the derivative that ends up with a number for a given input—similar to the
computational procedure that computes the functional outputs and does not
produce a symbolic functional output.
Say we want to compute d 𝑓2 /d𝑥1 , which in our example corresponds to
d𝑣7 /d𝑣1 . The evaluation point is the same as in Ex. 6.1: 𝑥 = (𝜋/4, 2). Using the
chain rule (Eq. 6.22) and considering only the nonzero partial derivative terms,
we get the following sequence:
𝑣¤ 1 = 1
𝑣¤ 2 = 0
𝜕𝑣3 𝜕𝑣
𝑣¤ 3 = 𝑣¤ 1 + 3 𝑣¤ 2 = 𝑣 2 · 𝑣¤ 1 + 𝑣1 · 0 = 2
𝜕𝑣1 𝜕𝑣 2
𝜕𝑣4
𝑣¤ 4 = 𝑣¤ 1 = cos 𝑣1 · 𝑣¤ 1 = 0.707 . . .
𝜕𝑣1
𝜕𝑣5 𝜕𝑣 𝜕 𝑓1
𝑣¤ 5 = 𝑣¤ 3 + 5 𝑣¤ 4 = 1 · 𝑣¤ 3 + 1 · 𝑣¤ 4 = 2.707 . . . ≡
𝜕𝑣3 𝜕𝑣 4 𝜕𝑥1
6 Computing Derivatives 242
𝜕𝑣6
𝑣¤ 6 = 𝑣¤ 2 = 2𝑣2 · 𝑣¤ 2 = 0
𝜕𝑣2
(6.25)
𝜕𝑣7 𝜕𝑣 𝜕 𝑓2
𝑣¤ 7 = 𝑣¤ 3 + 7 𝑣¤ 6 = 1 · 𝑣¤ 3 + 1 · 𝑣¤ 6 = 2 ≡ .
𝜕𝑣3 𝜕𝑣6 𝜕𝑥1
This sequence is illustrated in matrix form in Fig. 6.16. The procedure is
equivalent to performing forward substitution in this linear system.
We now have a procedure (not a symbolic expression) for computing d 𝑓2 /d𝑥1 𝑣¤ 1 1 𝑣¤ 1
for any (𝑥1 , 𝑥2 ). The dependencies of these operations are shown in Fig. 6.17 as 𝑣¤ 2 𝑣¤ 2
a computational graph. 𝑣¤ 3 𝑣¤ 3
Although we set out to compute d 𝑓2 /d𝑥1 , we also obtained d 𝑓1 /d𝑥1 as a 𝑣¤ 4 = 𝑣¤ 4
𝑣¤ 5 𝑣¤ 5
by-product. We can obtain the derivatives for all outputs with respect to one
𝑣¤ 6 𝑣¤ 6
input for the same cost as computing the outputs. If we wanted the derivative
𝑣¤ 7 𝑣¤ 7
with respect to the other input, d 𝑓1 /d𝑥2 , a new sequence of calculations would
be necessary. Fig. 6.16 Dependency used in the
forward chain rule propagation in
Eq. 6.25. The forward mode is equiv-
𝑓1 = 𝑣5 𝑓1
𝑣1 = 𝑥1 𝑣 4 = sin 𝑣1 𝑣5 = 𝑣3 + 𝑣4 alent to solving a lower triangular sys-
𝑥1 𝜕 𝑓1 tem by forward substitution, where
𝑣¤ 1 = 1 𝑣¤ 4 = 𝑣¤ 1 cos 𝑣1 𝑣¤ 5 = 𝑣¤ 3 + 𝑣¤ 4 = 𝑣¤ 5 𝜕 𝑓1
𝜕𝑥1 the system is sparse.
𝜕𝑥1
𝑣3 = 𝑣1 𝑣2
𝑣¤ 3 = 𝑣¤ 1 𝑣2 + 𝑣1 𝑣¤ 2
Fig. 6.17 Computational graph for
𝑓2 = 𝑣7 𝑓2
the numerical example evaluations,
𝑥2
𝑣2 = 𝑥2 𝑣6 = 𝑣22 𝑣7 = 𝑣3 + 𝑣6
𝑣¤ 2 = 0 𝑣¤ 6 = 2𝑣¤ 2 𝑣¤ 7 = 𝑣¤ 3 + 𝑣¤ 6
𝜕 𝑓2
= 𝑣¤ 7
showing the forward propagation of
𝜕𝑥1 𝜕 𝑓2 the derivative with respect to 𝑥1 .
𝜕𝑥1
(say, 𝑝 = [1, . . . , 1]) and then comparing it to the directional derivative in that
direction. If the result matches the reference, then all the gradient elements are
most likely correct (it is good practice to try a couple more directions just to be
sure). However, if the result does not match, this directional derivative does
not reveal which gradient elements are incorrect.
6.6.3 Reverse-Mode AD
The reverse mode is also based on the chain rule but uses the alternative
form:
d𝑣 𝑖 Õ 𝑖
𝜕𝑣 𝑘 d𝑣 𝑖
= , (6.26)
d𝑣 𝑗 𝜕𝑣 𝑗 d𝑣 𝑘
𝑘=𝑗+1
Õ
𝑖
𝜕𝑣 𝑘
𝑣¯ 𝑗 = 𝑣¯ 𝑘 . (6.27)
𝜕𝑣 𝑗
𝑘=𝑗+1
This chain rule propagates the total derivatives backward after setting
the reverse seed 𝑣¯ 𝑖 = 1, as shown in Fig. 6.18. This affects all the 𝑥 𝑓
variables on which the seeded variable depends.
Seeded output, 𝑣¯ 𝑖
The reverse-mode variables 𝑣¯ represent the derivatives of one output,
𝑖, with respect to all the input variables (instead of the derivatives of all
the outputs with respect to one input, 𝑗, in the forward mode). Once
we are done applying the reverse chain rule (Eq. 6.27) for the chosen
output variable 𝑣 𝑖 , we end up with the total derivatives d𝑣 𝑖 /d𝑣 𝑗 for all
𝑗 < 𝑖.
Applying the reverse mode to the same four-variable example as
before, we get the following sequence of derivative computations (we
set 𝑖 = 4 and decrement 𝑗):
Fig. 6.18 The reverse mode propa-
gates derivatives to all the variables
𝑣¯ 4 = 1 on which the seeded output variable
𝜕𝑣 4 depends.
𝑣¯ 3 = 𝑣¯ 4
𝜕𝑣 3
6 Computing Derivatives 244
𝜕𝑣3 𝜕𝑣4
𝑣¯ 2 = 𝑣¯ 3 + 𝑣¯ 4
𝜕𝑣2 𝜕𝑣2
(6.28)
𝜕𝑣2 𝜕𝑣3 𝜕𝑣4 d𝑓
𝑣¯ 1 = 𝑣¯ 2 + 𝑣¯ 3 + 𝑣¯ 4 ≡ .
𝜕𝑣1 𝜕𝑣1 𝜕𝑣1 d𝑥
The partial derivatives of 𝑣 must be computed for 𝑣 4 first, then 𝑣3 , and
so on. Therefore, we have to traverse the code in reverse. In practice,
not every variable depends on every other variable, so a computational
graph is created during code evaluation. Then, when computing the
adjoint variables, we traverse the computational graph in reverse. As
before, the derivatives we need to compute in each line are only partial
derivatives.
Recall the Jacobian of the variables,
1 0 0 0
d𝑣2
1 0 0
d𝑣1
𝐽𝑣 = d𝑣3 .
0
d𝑣 3 (6.29)
d𝑣1 1
d𝑣 2
d𝑣4 d𝑣 4 d𝑣4
1
d𝑣1 d𝑣 2 d𝑣3
By setting 𝑣¯ 4 = 1 and using the reverse chain rule (Eq. 6.27), we have
computed the last row of 𝐽𝑣 from right to left. This row corresponds
to the gradient of 𝑓 ≡ 𝑣4 . Using the reverse mode of AD, obtaining
derivatives with respect to additional inputs is free (e.g., d𝑣 4 /d𝑣 2 ≡ 𝑣¯ 2
in Eq. 6.28).
However, if we wanted the derivatives of additional outputs, we
would need to evaluate a different sequence of derivatives. For example,
if we wanted d𝑣3 /d𝑣1 , we would set 𝑣¯ 3 = 1 and evaluate the expressions
for 𝑣¯ 2 and 𝑣¯ 1 in Eq. 6.28, where d𝑣 3 /𝑑𝑣 1 ≡ 𝑣¯ 1 . Thus, the cost of
the reverse mode scales linearly with the number of outputs and is
independent of the number of inputs.
One complication with the reverse mode is that the resulting se-
quence of derivatives requires the values of the variables, starting with
the last ones and progressing in reverse. For example, the partial deriva-
tive in the second operation of Eq. 6.28 might involve 𝑣3 . Therefore, the
code needs to run in a forward pass first, and all the variables must be
stored for use in the reverse pass, which increases memory usage.
Suppose we want to compute 𝜕 𝑓2 /𝜕𝑥1 for the function from Ex. 6.5. First,
we need to run the original code (a forward pass) and store the values of all
the variables because they are necessary in the reverse chain rule (Eq. 6.26)
to compute the numerical values of the partial derivatives. Furthermore, the
6 Computing Derivatives 245
reverse chain rule requires the information on all the dependencies to determine
which partial derivatives are nonzero. The forward pass and dependencies are
represented by the computational graph shown in Fig. 6.19.
𝑥1 𝑣1 = 𝑥1 𝑣 4 = sin 𝑣 1 𝑣5 = 𝑣3 + 𝑣4 𝑓1 = 𝑣5 𝑓1
𝑣3 = 𝑣1 𝑣2
Using the chain rule (Eq. 6.26) and setting the seed for the desired variable
𝑣¯ 7 = 1, we get
𝑣¯ 7 = 1
𝜕𝑣 7
𝑣¯ 6 = 𝑣¯ 7 = 𝑣¯ 7 = 1
𝜕𝑣 6
𝑣¯ 5 = ==0
𝜕𝑣 5
𝑣¯ 4 = 𝑣¯ 5 = 𝑣¯ 5 = 0
𝜕𝑣 4 (6.30)
𝜕𝑣 7 𝜕𝑣
𝑣¯ 3 = 𝑣¯ 7 + 5 𝑣¯ 5 = 𝑣¯ 7 + 𝑣¯ 5 = 1
𝜕𝑣 3 𝜕𝑣3
𝜕𝑣6 𝜕𝑣3 𝜕 𝑓2
𝑣¯ 2 = 𝑣¯ 6 + 𝑣¯ 3 = 2𝑣2 𝑣¯ 6 + 𝑣1 𝑣¯ 3 = 4.785 =
𝜕𝑣 2 𝜕𝑣2 𝜕𝑥2
𝜕𝑣 4 𝜕𝑣 𝜕 𝑓2
𝑣¯ 1 = 𝑣¯ 4 + 3 𝑣¯ 3 = (cos 𝑣1 )𝑣¯ 4 + 𝑣2 𝑣¯ 3 = 2 = .
𝜕𝑣 1 𝜕𝑣1 𝜕𝑥1
After running the forward evaluation and storing the elements of 𝑣, we can run
the reverse pass shown in Fig. 6.20. This reverse pass is illustrated in matrix
form in Fig. 6.21. The procedure is equivalent to performing back substitution
in this linear system.
𝜕 𝑓2 𝜕 𝑓2 𝑣¯ 1 = 𝑣¯ 4 cos 𝑣1
= 𝑣¯ 1 𝑣¯ 4 = 𝑣¯ 5 𝑣¯ 5 = 0 𝑓1
𝜕𝑥1 𝜕𝑥1 + 𝑣2 𝑣¯ 3
𝑣¯ 3 = 𝑣¯ 7 + 𝑣¯ 5
Fig. 6.20 Computational graph for the
reverse mode, showing the backward
𝜕 𝑓2 𝜕 𝑓2 𝑣¯ 2 = 2𝑣2 𝑣¯ 6 propagation of the derivative of 𝑓2 .
= 𝑣¯ 2 𝑣¯ 6 = 𝑣¯ 7 𝑣¯ 7 = 1 𝑓2
𝜕𝑥2 𝜕𝑥2 + 𝑣 1 𝑣¯ 3
evaluating only one more line of code. Conversely, if we want the derivatives
of 𝑓1 , a whole new set of computations is needed.
In forward mode, the computation of a given derivative, 𝑣¤ 𝑖 , requires the 𝑣¯ 1 𝑣¯ 1
partial derivatives of the line of code that computes 𝑣 𝑖 with respect to its inputs. 𝑣¯ 2 𝑣¯ 2
In the reverse case, however, to compute a given derivative, 𝑣¯ 𝑗 , we require the 𝑣¯ 3 𝑣¯ 3
𝑣¯ 4 = 𝑣¯ 4
partial derivatives with respect to 𝑣 𝑗 of the functions that the current variable
𝑣¯ 5 𝑣¯ 5
𝑣 𝑗 affects. Knowledge of the function a variable affects is not encoded in that
𝑣¯ 6 𝑣¯ 6
variable computation, and that is why the computational graph is required.
𝑣¯ 7 1 𝑣¯ 7
AD tools that use source code transformation process the whole source
code automatically with a parser and add lines of code that compute
the derivatives. The added code is highlighted in Exs. 6.7 and 6.8.
Running an AD source transformation tool on the code from Ex. 6.2 produces
the code that follows.
The AD tool added a new line after each variable assignment that computes the
corresponding derivative. We can then set the seed, 𝑥¤ = 1 and run the code. As
the loops proceed, 𝑓¤ accumulates the derivative as 𝑓 is successively updated.
6 Computing Derivatives 248
The first loop is identical to the original code except for one line. Because the
derivatives that accumulate in the reverse loop depend on the intermediate
values of the variables, we need to store all the variables in the forward loop.
We store and retrieve the variables using a stack, hence the call to “push”.† † A stack, also known as last in, first out
The second loop, which runs in reverse, is where the derivatives are (LIFO), is a data structure that stores a one-
dimensional array. We can only add an
computed. We set the reverse seed, 𝑓¯ = 1, and then the adjoint variables element to the top of the stack (push) and
accumulate the derivatives back to the start. take the element from the top of the stack
(pop).
Operator Overloading
where we compute the original function value in the first term, and the
second term carries the derivative of the multiplication.
Although we wrote the two parts explicitly in Eq. 6.31, the source
code would only show a normal multiplication, such as 𝑣3 = 𝑣1 · 𝑣 2 .
However, each of these variables would be of the new type and carry the
corresponding 𝑣¤ quantities. By overloading all the required operations,
6 Computing Derivatives 249
the computations happen “behind the scenes”, and the source code
does not have to be changed, except to declare all the variables to be of
the new type and to set the seed. Example 6.9 lists the original code
from Ex. 6.2 with notes on the actual computations that are performed
as a result of overloading.
Using the derived data types and operator overloading approach in forward
mode does not change the code listed in Ex. 6.2. The AD tool provides
overloaded versions of the functions in use, which in this case are assignment,
addition, and sine. These functions are overloaded as follows:
𝑣2 = 𝑣1 ⇒ (𝑣2 , 𝑣¤ 2 ) = (𝑣 1 , 𝑣¤ 1 )
𝑣1 + 𝑣2 ⇒ (𝑣 1 , 𝑣¤ 1 ) + (𝑣2 , 𝑣¤ 2 ) ≡ (𝑣1 + 𝑣2 , 𝑣¤ 1 + 𝑣¤ 2 )
sin(𝑣) ⇒ sin (𝑣, 𝑣¤ ) ≡ (sin(𝑣), cos(𝑣)𝑣¤ ) .
In this case, the source code is unchanged, but additional computations occur
through the overloaded functions. We reproduce the code that follows with
notes on the hidden operations that take place.
Input: 𝑥 ¤
𝑥 is of a new data type with two components (𝑥, 𝑥)
𝑓 =𝑥 ( 𝑓 , 𝑓¤) = (𝑥, 𝑥)
¤ through the overloading of the “=” operation
for 𝑖 = 1 to 10 do
𝑓 = sin(𝑥 + 𝑓 ) Code is unchanged, but overloading
‡ The overloading
computes the derivative‡ of ‘’+” computes
end for (𝑣, 𝑣¤ ) = 𝑥 + 𝑓 , 𝑥¤ + 𝑓¤ and then the
return 𝑓 The new data type includes 𝑓¤, which is d 𝑓 /d𝑥 overloading of “sin” computes 𝑓 , 𝑓¤ =
(sin(𝑣), cos(𝑣)𝑣¤ ).
We set the seed, 𝑥¤ = 1, and for each function assignment, we add the cor-
responding derivative line. As the loops are repeated, 𝑓¤ accumulates the
derivative as 𝑓 is successively updated.
The source code transformation and the operator overloading ap- 121. Bradbury et al., JAX: Composable
Transformations of Python+NumPy Pro-
proaches each have their relative advantages and disadvantages. The grams, 2018.
overloading approach is much more elegant because the original code 122. Revels et al., Forward-mode automatic
differentiation in Julia, 2016.
stays practically the same and can be maintained directly. On the other
123. Neidinger, Introduction to automatic
hand, the source transformation approach enlarges the original code differentiation and MATLAB object-oriented
and results in less readable code, making it hard to work with. Still, it programming, 2010.
is easier to see what operations take place when debugging. Instead of 124. Betancourt, A geometric theory of
higher-order automatic differentiation, 2018.
maintaining source code transformed by AD, it is advisable to work
with the original source and devise a workflow where the parser is
rerun before compiling a new version.
One advantage of the source code transformation approach is that
it tends to yield faster code and allows more straightforward compile-
time optimizations. The overloading approach requires a language that
supports user-defined data types and operator overloading, whereas
source transformation does not. Developing a source transformation
AD tool is usually more challenging than developing the overloading
approach because it requires an elaborate parser that understands the
source syntax.
6 Computing Derivatives 251
𝐶¤ = 𝐴𝐵
¤ + 𝐴 𝐵¤ . (6.33)
The idea is to use 𝐴¤ and 𝐵¤ from the AD code preceding the operation
and then manually implement this formula (bypassing any AD of the
¤ as shown in Fig. 6.24.
code that performs that operation) to obtain 𝐶,
Then we can use 𝐶¤ to seed the remainder of the AD code.
The reverse mode of the multiplication yields
𝐴¯ = 𝐶𝐵
¯ |, 𝐵¯ = 𝐴| 𝐶¯ . (6.34)
Forward mode
𝐴¤ Manual 𝐶¤
𝑥¤ 𝑓¤
implementation
𝐵¤
Forward AD 𝐴, 𝐵 Forward AD
Original code
𝐴 Matrix 𝐶
𝑥 𝑓
operation
𝐵
Reverse AD 𝐴, 𝐵 Reverse AD
Fig. 6.24 Matrix operations, including
Reverse mode the solution of linear systems, can
𝐴¯ Manual 𝐶¯ be differentiated manually to bypass
𝑥¯ 𝑓¯ more costly AD code.
implementation
𝐵¯
𝐵¯ = 𝐴−| 𝐶,
¯ 𝐴¯ = −𝐵𝐶
¯ | (6.36)
𝑟(𝑢; 𝑥) = 0 , (6.37)
where the semicolon denotes that the design variables 𝑥 are fixed when
these equations are solved for the state variables 𝑢. Through these
equations, 𝑢 is an implicit function of 𝑥. This relationship is represented
by the box containing the solver and residual equations in Fig. 6.25.
The functions of interest, 𝑓 (𝑥, 𝑢), are typically explicit functions of 𝑥
the state variables and the design variables. However, because 𝑢 is an
Solver 𝑢
implicit function of 𝑥, 𝑓 is ultimately an implicit function of 𝑥 as well. 𝑢
To compute 𝑓 for a given 𝑥, we must first find 𝑢 such that 𝑟(𝑢; 𝑥) = 0. 𝑟(𝑢; 𝑥)
𝑟
This is usually the most computationally costly step and requires a
solver (see Section 3.6). The residual equations could be nonlinear and 𝑓 (𝑥, 𝑢) 𝑓
Recall Ex. 3.2, where we introduced the structural model of a truss structure.
The residuals in this case are the linear equations,
where the state variables are the displacements, 𝑢. Solving for the displacement
requires only a linear solver in this case, but it is still the most costly part of the
analysis. Suppose that the design variables are the cross-sectional areas of the
truss members. Then, the stiffness matrix is a function of 𝑥, but the external
forces are not.
Suppose that the functions of interest are the stresses in each of the truss
members. This is an explicit function of the displacements, which is given by
6 Computing Derivatives 254
𝑓 (𝑥, 𝑢) ≡ 𝜎(𝑢) = 𝑆𝑢 ,
d𝑓 𝜕𝑓 𝜕 𝑓 d𝑢
= + , (6.39)
d𝑥 𝜕𝑥 𝜕𝑢 d𝑥
where the result is an (𝑛 𝑓 × 𝑛 𝑥 ) matrix.† † This chain rule can be derived by writing
where 𝜕𝑟/𝜕𝑥 and d𝑢/d𝑥 are both (𝑛𝑢 × 𝑛 𝑥 ) matrices, and 𝜕𝑟/𝜕𝑢 is a 𝑟(𝑥, 𝑢) = 0
square matrix of size (𝑛𝑢 × 𝑛𝑢 ). This linear system is useful because if
we provide the partial derivatives in this equation (which are cheap 𝑥
d𝑓 𝜕𝑓 𝜕 𝑓 𝜕𝑟 −1 𝜕𝑟
= − , (6.42)
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
where all the derivative terms on the right-hand side are partial deriva-
tives. The partial derivatives in this equation can be computed using any
of the methods that we have described earlier: symbolic differentiation,
finite differences, complex step, or AD. Equation 6.42 shows two ways
to compute the total derivatives, which we call the direct method and the
adjoint method.
The direct method (already outlined earlier) consists of solving the
linear system (Eq. 6.41) and substituting d𝑢/d𝑥 into Eq. 6.39. Defining
𝜙 ≡ − d𝑢/d𝑥, we can rewrite Eq. 6.41 as
𝜕𝑟 𝜕𝑟
𝜙= . (6.43)
𝜕𝑢 𝜕𝑥
After solving for 𝜙 (one column at the time), we can use it in the total
derivative equation (Eq. 6.39) to obtain,
d𝑓 𝜕𝑓 𝜕𝑓
= − 𝜙. (6.44)
d𝑥 𝜕𝑥 𝜕𝑢
Solving the linear system (Eq. 6.43) is typically the most computa-
tionally expensive operation in this procedure. The cost of this approach
scales with the number of inputs 𝑛 𝑥 but is essentially independent
of the number of outputs 𝑛 𝑓 . This is the same scaling behavior as
finite differences and forward-mode AD. However, the constant of
proportionality is typically much smaller in the direct method because
we only need to solve the nonlinear equations 𝑟(𝑢; 𝑥) = 0 once to obtain
the states.
𝜙 (𝑛𝑢 × 𝑛 𝑥 )
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑟 −1 𝜕𝑟
= −
d𝑥 𝜕𝑥 𝜕𝑢 𝜕𝑢 𝜕𝑥
Fig. 6.27 The total derivatives
(Eq. 6.42) can be computed either by
(𝑛 𝑓 × 𝑛 𝑥 ) (𝑛 𝑓 × 𝑛 𝑥 ) (𝑛 𝑓 × 𝑛𝑢 ) (𝑛𝑢 × 𝑛𝑢 ) (𝑛𝑢 × 𝑛 𝑥 )
solving for 𝜙 (direct method) or by
solving for 𝜓 (adjoint method).
𝜓| (𝑛 𝑓 × 𝑛𝑢 )
𝜕𝑟 | 𝜕𝑓 |
𝜓= . (6.46)
𝜕𝑢 𝜕𝑢
This linear system has no dependence on 𝑥. Each adjoint vector is
associated with a function of interest 𝑓 𝑗 and is found by solving the
adjoint equation (Eq. 6.46) with the corresponding row 𝜕 𝑓 𝑗 /𝜕𝑢. The
solution (𝜓) is then used to compute the total derivative
d𝑓 𝜕𝑓 𝜕𝑟
= − 𝜓| . (6.47)
d𝑥 𝜕𝑥 𝜕𝑥
This is sometimes called the reverse mode because it is analogous to
reverse-mode AD.
6 Computing Derivatives 257
Although implementing implicit analytic methods is labor intensive, 127. Martins, Perspectives on aerodynamic
design optimization, 2020.
6 Computing Derivatives 258
𝜆
+ cos 𝜆 = 0 .
𝑚
Figure 6.29 shows the equivalent of Fig. 6.25 in this case. Our goal is to compute
the derivative d 𝑓 /d𝑚. Because 𝜆 is an implicit function of 𝑚, we cannot find
an explicit expression for 𝜆 as a function of 𝑚, substitute that expression into
Eq. 6.48, and then differentiate normally. Fortunately, the implicit analytic
methods allow us to compute this derivative.
Referring back to our nomenclature, 𝑚
d𝑓 𝜆
= 2𝜆𝑚 + 1 .
d𝑚 − sin 𝜆
𝑚
Thus, we obtained the desired derivative despite the implicitly defined function.
Here, it was possible to get an explicit expression for the total derivative, but
generally, it is only possible to get a numeric value.
6 Computing Derivatives 259
The adjoint linear system (Eq. 6.46) yields†† †† Usually,the stiffness matrix is symmet-
ric, and 𝐾 | = 𝐾 . This means that the
|
𝐾𝜓 𝑗 = 𝑆 𝑗,∗ , solver for displacements can be repur-
posed for adjoint computation by setting
the right-hand side shown here instead of
where 𝑗 corresponds to each truss member, and 𝑆 𝑗,∗ is the 𝑗th row of 𝑆. Once we the loads. For that reason, this right-hand
have 𝜓 𝑗 , we can use it to compute the total derivative of the stress in member 𝑗 side is sometimes called a pseudo-load.
with respect to all truss member areas with Eq. 6.47, as follows:
d𝜎 𝑗 | 𝜕
= −𝜓 𝑗 (𝐾𝑢) .
d𝑥 𝜕𝑥
In this case, there is no advantage in using one method over the other because
the number of areas is the same as the number of stresses. However, if
we aggregated the stresses as suggested in Tip 6.7, the adjoint would be
advantageous.
This is yet another transpose vector product that can be obtained using
the same reverse AD code for the residuals, except that now the residual
seed is 𝑟¯ = 𝜓, and the product we want is given by 𝑥.
¯
In sum, it is advantageous to use reverse-mode AD to compute
the partial derivative terms for the adjoint equations, especially if the
adjoint equations are solved using an iterative approach that requires
only matrix-vector products. Similar techniques and arguments apply
for the direct method, except that in that case, forward-mode AD is
advantageous for computing the partial derivatives.
𝜕𝑟 𝜕 𝑓𝑗
(6.52)
|
𝜓𝑗 = 𝜙𝑖 .
𝜕𝑥 𝑖 𝜕𝑢
Each side of this equation yields a scalar that should match to working precision.
The dot-product test verifies that your partial derivatives and the solutions for
the direct and adjoint linear systems are consistent. For AD, the dot-product
test for a code with inputs 𝑥 and outputs 𝑓 is as follows:
𝜕𝑓 | ¯ 𝜕 𝑓 | ¯ ¤| ¯
𝑥¤ | 𝑥¯ = 𝑥¤ | 𝑓 = 𝑥¤ | 𝑓 = 𝑓 𝑓. (6.53)
𝜕𝑥 𝜕𝑥
[𝑥1 , 𝑥 2 , . . . , 𝑥 𝑗 + ℎ, . . . , 𝑥 𝑛 𝑥 ] . (6.54)
𝐽11 0 0 0 0
0 0 0 0
d𝑓 𝐽22
≡ 0 0 𝐽33 0 0 . (6.55)
d𝑥 0 0
0 0 𝐽44
0 𝐽55
0 0 0
For this scenario, the Jacobian can be constructed with one evaluation
rather than 𝑛 𝑥 evaluations. This is because a given output 𝑓𝑖 depends
on only one input 𝑥 𝑖 . We could think of the outputs as 𝑛 𝑥 independent
functions. Thus, for finite differencing, rather than requiring 𝑛 𝑥 input
vectors with 𝑛 𝑥 function evaluations, we can use one input vector, as
follows:
[𝑥1 + ℎ, 𝑥 2 + ℎ, . . . , 𝑥5 + ℎ] , (6.56)
allowing us to compute all the nonzero entries in one pass.∗ ∗ Curtis et al.130
were the first to show that
Although the diagonal case is easy to understand, it is a special the number of function evaluations could
be reduced for sparse Jacobians.
situation. To generalize this concept, let us consider the following (5 × 6)
130. Curtis et al., On the estimation of
matrix as an example: sparse Jacobian matrices, 1974.
𝐽11 0 0 0 𝐽16
𝐽14
0 0 0 0
𝐽23 𝐽24
𝐽31 0 .
𝐽32 0 0 0 (6.57)
0 0
0 0 0 𝐽45
0 𝐽56
0 𝐽53 0 𝐽55
A subset of columns that does not have more than one nonzero in
any given row are said to be structurally orthogonal. In this example,
the following sets of columns are structurally orthogonal: (1, 3), (1,
5), (2, 3), (2, 4, 5), (2, 6), and (4, 5). Structurally orthogonal columns
can be combined, forming a smaller Jacobian that reduces the number
of forward passes required. This reduced Jacobian is referred to as
compressed. There is more than one way to compress this Jacobian, but
in this case, the minimum number of compressed columns—referred to
as colors—is three. In the following compressed Jacobian, we combine
6 Computing Derivatives 264
conditions of interest, the resulting Jacobian is sparse. Examples include 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
evaluating the power produced by a wind turbine at different wind speeds or design, analysis, and optimization, 2019.
6 Computing Derivatives 265
100
Jacobian time [s]
AD
10−1
10−2
AD with coloring
10−3
Now that we have introduced all the methods for computing deriva-
tives, we will see how they are connected. For example, we have
mentioned that the direct and adjoint methods are analogous to the
forward and reverse mode of AD, respectively, but we did not show
6 Computing Derivatives 266
𝑟2
of unknowns,
=
𝑟2
+Δ 𝑟 1
+Δ
𝑟1 =
=0
𝑟2 =
𝑟2
0
𝑟1 =
−Δ
𝑟 𝑖 (𝑢1 , 𝑢2 , . . . , 𝑢𝑛 ) = 0, 𝑖 = 1, . . . , 𝑛 , (6.60) 𝑢2
𝑟2
−Δ 𝑟 1
𝑢∗ 𝑟1 =
and that there is at least one solution 𝑢 ∗ such that 𝑟(𝑢 ∗ ) = 0. Such a
solution can be visualized for 𝑛 = 2, as shown in Fig. 6.35.
These residuals are general: each one can depend on any subset of 𝑢1
the variables 𝑢 and can be truly implicit functions or explicit functions Fig. 6.35 Solution of a system of two
converted to the implicit form (see Section 3.3 and Ex. 3.3). The total equations expressed by residuals.
differentials for these residuals are
𝜕𝑟 𝑖 𝜕𝑟 𝑖
d𝑟 𝑖 = d𝑢1 + . . . + d𝑢𝑛 , 𝑖 = 1, . . . , 𝑛 . (6.61)
𝜕𝑢1 𝜕𝑢𝑛
These represent first-order changes in 𝑟 due to perturbations in 𝑢. The
differentials of 𝑢 can be visualized as perturbations in the space of the
variables. The differentials of 𝑟 can be visualized as linear changes to
the surface defined by 𝑟 = 0, as illustrated in Fig. 6.36.
We can write the differentials (Eq. 6.61) in matrix form as
𝜕𝑟 𝜕𝑟1
𝑟1
+d
𝑟1 =
1 d𝑢1 d𝑟1
=0
···
𝑟1
𝜕𝑢1 𝜕𝑢𝑛
𝑢2
. ..
𝑟1
−d
.. .. .. 𝑟1 =
. = . .
..
. .
(6.62)
𝜕𝑟𝑛 𝜕𝑟𝑛
d𝑢𝑛 d𝑟𝑛
𝜕𝑢 ···
𝜕𝑢𝑛
1
𝑢1
𝜕𝑟 d𝑢
=𝐼. (6.65)
𝜕𝑢 d𝑟
derived in Section 6.7.2, these look like derivatives of states with respect
to residuals, not the derivatives that we ultimately want to compute
(d 𝑓 /d𝑥). However, we will soon see that with the appropriate choice of
𝑟 and 𝑢, we can obtain a linear system that solves for the total derivatives
we want.
With Eq. 6.65, we can solve one column at a time. Similar to AD, we
can also solve for the rows instead by transposing the systems as † † Normally, for two matrices 𝐴 and 𝐵,
(𝐴𝐵)| = 𝐵| 𝐴| , but in this case,
𝐴𝐵 = 𝐼 ⇒ 𝐵 = 𝐴−1 ⇒ 𝐵| = 𝐴−| ⇒
𝜕𝑟 | d𝑢 | 𝐴| 𝐵 | = 𝐼 .
=𝐼, (6.66)
𝜕𝑢 d𝑟
which is the reverse form of the UDE. Now, each column 𝑗 yields 𝑟2
d𝑢 𝑗 /d𝑟—the total derivative of one variable with respect to all the =
+
d𝑟 +d
𝑟1
2
𝑟1 =
residuals. This total derivative is interpreted visually in Fig. 6.38. 𝑟2
=
0 0
𝑟1 =
The usefulness of the total derivative Jacobian d𝑢/d𝑟 might still not 𝑢2
𝑢∗
be apparent. In the next section, we explain how to set up the UDE to
include d 𝑓 /d𝑥 in the UDE unknowns (d𝑢/d𝑟). d𝑢1
𝑢1
Example 6.14 Computing and interpreting d𝑢/d𝑟
Fig. 6.38 The total derivatives
Suppose we want to find the rectangle that is inscribed in the ellipse given d𝑢1 /d𝑟1 and d𝑢1 /d𝑟2 represent the
by first-order change in 𝑢1 resulting
𝑢2 from perturbations d𝑟1 and d𝑟2 .
𝑟1 (𝑢1 , 𝑢2 ) = 1 + 𝑢22 − 1 = 0 .
4
A change in this residual represents a change in the size of the ellipse without
changing its proportions. Of all the possible rectangles that can be inscribed in
the ellipse, we want the rectangle with an area that is half of that of this ellipse,
such that
𝑟2 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 𝜋 = 0 .
A change in this residual represents a change in the area of the rectangle. There
are two solutions, as shown in the left pane of Fig. 6.39. These solutions can be
found using Newton’s method, which converges to one solution or the other,
depending on the starting guess. We will pick the one on the right, which is
[𝑢1 , 𝑢2 ] = [1.79944, 0.43647]. The solution represents the coordinates of the
rectangle corner that touches the ellipse.
Taking the partial derivatives, we can write the forward UDE (Eq. 6.65) for
this problem as follows:
d𝑢1 d𝑢1
𝑢1 /2 2𝑢2 1 0
d𝑟2 =
d𝑟1 .
d𝑢2
(6.67)
d𝑢
4𝑢2 4𝑢1 2 0 1
d𝑟1 d𝑟2
6 Computing Derivatives 269
2 0.48
𝑟2 = −
𝑟2 = = 0
𝑟1
𝑟1
=
𝑟2
+
1.5 +Δ 0.46
d𝑟
0
Δ𝑟 2
𝑟2
1
𝑟1 = +Δ𝑟 𝑟2 = 0
1
𝑢2 1 𝑟1 = 0 𝑢2 0.44 𝑢∗
𝑟1 = −Δ𝑟 d𝑢2
1
𝑢∗ d𝑢1
0.5 0.42
0 0.4
0 0.5 1 1.5 2 1.76 1.78 1.8 1.82 1.84
𝑢1 𝑢1
Fig. 6.39 Rectangle inscribed in el-
Residuals and solution First-order perturbation view showing lipse problem.
interpretation of d𝑢/d𝑟1
Solving this linear system for each of the two right-hand sides, we get
d𝑢 d𝑢1
1 1.45353
d𝑟 −0.17628
1 d𝑟2 =
d𝑢 .
d𝑢2
(6.68)
2
d𝑟 −0.35257 0.18169
1 d𝑟2
These derivatives reflect the change in the coordinates of the point where the
rectangle touches the ellipse as a result of a perturbation in the size of the
ellipse, d𝑟1 , and the area of the rectangle d𝑟2 . The right side of Fig. 6.39 shows
the visual interpretation of d𝑢/d𝑟1 as an example.
This is a vector with 𝑛 𝑥 + 𝑛𝑢 + 𝑛 𝑓 variables. For the residuals, we
need a vector with the same size. We can obtain this by realizing that
the residuals associated with the inputs and outputs are just explicit
functions that can be written in implicit form. Then, we have
𝑥 − 𝑥ˇ
𝑟ˆ ≡ 𝑟 − 𝑟ˇ(𝑥, 𝑢) = 0 . (6.70)
𝑓 − 𝑓ˇ(𝑥, 𝑢)
Here, we distinguish 𝑥 (the actual variable in the UDE system) from
𝑥ˇ (a given input) and 𝑓 (the variable) from 𝑓ˇ (an explicit function of
𝑥 and 𝑢). Similarly, 𝑟 is the vector of variables associated with the
residual and 𝑟ˇ is the residual function itself. Taking the differential of
the residuals, and considering only one of them to be nonzero at a time,
we obtain,
d𝑥
dˆ𝑟 ≡ d𝑟 . (6.71)
d 𝑓
Using these variable and residual definitions in Eqs. 6.65 and 6.66 yields
the full UDE shown in Fig. 6.40, where the block we ultimately want to
compute is d 𝑓 /d𝑥.
| | |
𝜕 𝑟ˇ 𝜕 𝑓ˇ d𝑢 | d𝑓
𝐼 0 0 𝐼 0 0 𝐼 0 0 𝐼 − − 𝐼
𝜕𝑥 𝜕𝑥 d𝑥 d𝑥
| | |
𝜕 𝑟ˇ 𝜕 𝑟ˇ d𝑢 d𝑢 𝜕 𝑟ˇ 𝜕 𝑓ˇ d𝑢 | d𝑓
− − 0 0 = 0 𝐼 0 = 0 − − 0
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟 𝜕𝑢 𝜕𝑢 d𝑟 d𝑟
𝜕 𝑓ˇ 𝜕 𝑓ˇ d𝑓 d𝑓
− − 𝐼 𝐼 0 0 𝐼 0 0 𝐼 0 0 𝐼
𝜕𝑥 𝜕𝑢 d𝑥 d𝑟
To compute d 𝑓 /d𝑥 using the forward UDE (left-hand side of the Fig. 6.40 The direct and adjoint meth-
ods can be recovered from the UDE.
equation in Fig. 6.40, we can ignore all but three blocks in the total
derivatives matrix: 𝐼, d𝑢/d𝑥, and d 𝑓 /d𝑥. By multiplying these blocks
and using the definition 𝜙 ≡ − d𝑢/d𝑥, we recover the direct linear
system (Eq. 6.43) and the total derivative equation (Eq. 6.44).
To compute d 𝑓 /d𝑥 using the reverse UDE (right-hand side of
the equation in Fig. 6.40), we can ignore all but three blocks in the
total derivatives matrix: 𝐼, d 𝑓 /d𝑟, and d 𝑓 /d𝑥. By multiplying these
blocks and defining 𝜓 ≡ − d 𝑓 /d𝑟, we recover the adjoint linear system
(Eq. 6.46) and the corresponding total derivative equation (Eq. 6.47). The
definition of 𝜓 here is significant because, as mentioned in Section 6.7.2,
the adjoint vector is the total derivative of the objective function with
respect to the governing equation residuals.
6 Computing Derivatives 271
Say we want to compute the total derivatives of the perimeter of the rectangle
from Ex. 6.14 with respect to the axes of the ellipse. The equation for the ellipse
can be rewritten as
𝑢2 𝑢2
𝑟3 (𝑢1 , 𝑢2 ) = 12 + 22 − 1 = 0 ,
𝑥1 𝑥2
where 𝑥1 and 𝑥2 are the semimajor and semiminor axes of the ellipse, respec-
tively. The baseline values are [𝑥1 , 𝑥2 ] = [2, 1]. The residual for the rectangle
area is
𝜋
𝑟4 (𝑢1 , 𝑢2 ) = 4𝑢1 𝑢2 − 𝑥1 𝑥2 = 0 .
2
To add the independent variables 𝑥1 and 𝑥2 , we write them as residuals in
implicit form as
𝑟1 (𝑥1 ) = 𝑥1 − 2 = 0, 𝑟2 (𝑥2 ) = 𝑥2 − 1 = 0 .
𝑟5 (𝑢1 , 𝑢2 ) = 𝑓 − 4(𝑢1 + 𝑢2 ) = 0 .
Now we have a system of five equations and five variables, with the 𝑥
dependencies shown in Fig. 6.41. The first two variables in 𝑥 are given, and we
𝑟3 (𝑢, 𝑥) = 0
can compute 𝑢 and 𝑓 using a solver as before.
Taking all the partial derivatives, we get the following forward system: 𝑢
𝑟4 (𝑢, 𝑥) = 0
𝑢
1 0 1 0
0 0 0 0 0 0
𝑟5 (𝑢) = 0
0 0 0 0
1 0 0 1 0 0
2𝑢12 2𝑢 2 d𝑢1 d𝑢1 d𝑢1 d𝑢1 𝑓
− − 32
2𝑢1 2𝑢2
0 0 = 𝐼 .
𝑥3 d𝑥1
𝑥2 𝑥12 𝑥22 d𝑥2 d𝑟3 d𝑟4
1 d𝑢2 Fig. 6.41 Dependencies of the residu-
𝜋 d𝑢2 d𝑢2 d𝑢2
0 als.
− 𝑥 2 0 d𝑥
𝜋
4𝑢2 4𝑢1
2 − 𝑥1
2 1 d𝑥2 d𝑟3 d𝑟4
d𝑓
d𝑓 d𝑓 d𝑓
1
0 0 1 d𝑥
1
−4 −4 d𝑥2 d𝑟3 d𝑟4
6 Computing Derivatives 272
We only want the two d 𝑓 /d𝑥 terms in this equation. We can either solve this
linear system twice to compute the first two columns, or we can compute both
terms with a single solution of the reverse (transposed) system. Transposing
the system, substituting the numerical values for 𝑥 and 𝑢, and removing the
total derivative terms that we do not need, we get the following system:
d𝑓
1 0
0
d𝑥1
−0.80950 −1.57080 0
d𝑓
0 0
1
d𝑥2
−0.38101 −3.14159 0
d𝑓
0 = 0 .
0 0.89972 1.74588 −4
d𝑟3
d𝑓
0
0 −4
d𝑟4
0 0.87294 7.19776
1
0 0 0 0 1 1
Solving this linear system, we obtain
d𝑓
d𝑥 3.59888
1
d𝑓
d𝑥 1.74588
2 = .
d𝑓
4.40385
d𝑟
3
d𝑓
0.02163
d𝑟4
The total derivatives of interest are shown in Fig. 6.42. 𝑥2
2
We could have obtained the same solution using the adjoint equations
from Section 6.7.2. The only difference is the nomenclature because the adjoint
vector in this case is 𝜓 = −[d 𝑓 /d𝑟3 , d 𝑓 /d𝑟4 ]. We can interpret these terms as 1.5
the change of 𝑓 with respect to changes in the ellipse size and rectangle area, d𝑓
d𝑥 2
respectively. 1
d𝑓
d𝑥1
0.5
1 1.5 2 2.5 3
𝑥1
6.9.3 Recovering AD
Now we will see how we can recover AD from the UDE. First, we Fig. 6.42 Contours of 𝑓 as a function
of 𝑥 and the total derivatives at 𝑥 =
define the UDE variables associated with each operation or line of code [2, 1].
(assuming all loops have been unrolled), such that 𝑢 ≡ 𝑣 and
Recall from Section 6.6.1 that each variable is an explicit function of the
previous ones.
6 Computing Derivatives 273
𝑟 𝑖 = 𝑣 𝑖 − 𝑣ˇ 𝑖 (𝑣 1 , . . . , 𝑣 𝑖−1 ) . (6.73)
d𝑣1
1 0 0 0 0
d𝑣1
... ...
𝜕 𝑣ˇ 2 .. d𝑣 ..
− .. 2 d𝑣 2 ..
𝜕𝑣 1 . . . .
d𝑣1 d𝑣2 = 𝐼 . (6.74)
.. .
1
. .. ..
0 ..
.. ..
0
. . . .
𝜕 𝑣ˇ
− 𝑛 𝜕 𝑣ˇ 𝑛
1 𝑛
d𝑣 d𝑣 𝑛 d𝑣 𝑛
... − ...
𝜕𝑣1 𝜕𝑣 𝑛−1 d𝑣1 d𝑣 𝑛−1 d𝑣 𝑛
This equation is the matrix form of the AD forward chain rule (Eq. 6.21),
where each column of the total derivative matrix corresponds to the
tangent vector (𝑣¤ ) for the chosen input variable. As observed in Fig. 6.16,
the partial derivatives form a lower triangular matrix. The Jacobian
we ultimately want to compute (d 𝑓 /d𝑥) is composed of a subset of
derivatives in the bottom-left corner near the d𝑣 𝑛 /d𝑣1 term. To compute
these derivatives, we need to perform forward substitution and compute
one column of the total derivative matrix at a time, where each column
is associated with the inputs of interest.
Similarly, the reverse form of the UDE (Eq. 6.66) yields the transpose
of Eq. 6.74,
d𝑣1
1 − 𝜕 𝑣2
ˇ
𝜕 𝑣ˇ 𝑛 d𝑣2 d𝑣 𝑛
... − d𝑣1 ...
𝜕𝑣
𝜕𝑣1 d𝑣1 d𝑣1
1
..
.. 0
d𝑣2 .. ..
0 1 .
.
. .
. d𝑣2 = 𝐼 . (6.75)
.. 𝜕 𝑣ˇ 𝑛 . d𝑣 𝑛−1 d𝑣 𝑛
. . .. ..
.
..
. −
𝜕𝑣 𝑛−1
.
d𝑣 𝑛−1
d𝑣 𝑛−1
0 d𝑣 𝑛
0 1 d𝑣 𝑛
0 0
... ...
This is equivalent to the AD reverse chain rule (Eq. 6.26), where each
column of the total derivative matrix corresponds to the gradient vector
6 Computing Derivatives 274
(𝑣¯ ) for the chosen output variable. The partial derivatives now form
an upper triangular matrix, as previously shown in Fig. 6.21. The
derivatives of interest are now near the top-right corner of the total
derivative matrix near the d𝑣 𝑛 /d𝑣1 term. To compute these derivatives,
we need to perform back substitutions, computing one column of the
matrix at a time.
When scaling a problem (Tips 4.4 and 5.3), you should be aware that the
scale changes also affect the derivatives. You can apply the derivative methods
of this chapter to the scaled function directly. However, scaling is often done
outside the model because the scaling is specific to the optimization problem. In
this case, you may want to use the original functions and derivatives and make
the necessary modifications in an outer function that provides the objectives
and constraints.
Using the nomenclature introduced in Tip 4.4, we represent the scaled
design variables given to the optimizer as 𝑥.¯ Then, the unscaled variables are
¯ Thus, the required scaled derivatives are
𝑥 = 𝑠 𝑥 𝑥.
d 𝑓¯ d𝑓 𝑠𝑥
= . (6.76)
d𝑥¯ d𝑥 𝑠𝑓
Tip 6.10 Provide your own derivatives and use finite differences only
as a last resort
Because of the step-size dilemma, finite differences are often the cause of
failed optimizations. To put it more dramatically, finite differences are the root
of all evil. Most gradient-based optimization software uses finite differences
internally as a default if you do not provide your own gradients. Although
some software packages try to find reasonable finite-difference steps, it is easy
to get inaccurate derivatives, which then causes optimization difficulties or
total failure. This is the top reason why beginners give up on gradient-based
optimization!
Instead, you should provide gradients computed using one of the other
methods described in this chapter. In contrast with finite differences, the
derivatives computed by the other methods are usually as accurate as the
function computation. You should also avoid using finite-difference derivatives
as a reference for a definitive verification of the other methods.
If you have to use finite differences as a last resort, make sure to do a step-
size study (see Tip 6.2). You should then provide your own finite-difference
derivatives to the optimization or make sure that the optimizer finite-difference
estimates are acceptable.
6 Computing Derivatives 275
6.10 Summary
scaling factor for the forward mode is generally lower than that for
finite differences. The cost of reverse-mode AD is independent of the
number of design variables.
Implicit analytic methods (direct and adjoint) are accurate and
Problems
What does that mean, and how should you show those points on
the plot?
6.5 Suppose you have two airplanes that are flying in a horizontal
plane defined by 𝑥 and 𝑦 coordinates. Both airplanes start at 𝑦 = 0,
but airplane 1 starts at 𝑥 = 0, whereas airplane 2 has a head start
of 𝑥 = Δ𝑥. The airplanes fly at a constant velocity. Airplane 1 has
a velocity of 𝑣1 in the direction of the positive 𝑥-axis, and airplane
2 has a velocity of 𝑣2 at an angle 𝛾 with the 𝑥-axis. The functions
of interest are the distance (𝑑) and the angle (𝜃) between the two
airplanes as a function of time. The independent variables are
Δ𝑥, 𝛾, 𝑣1 , 𝑣2 , and 𝑡. Write the code that computes the functions of
interest (outputs) for a given set of independent variables (inputs).
Use AD to differentiate the code. Choose a set of inputs, compute
the derivatives of all the outputs with respect to the inputs, and
verify them against the complex-step method.
𝐸 − 𝑒 sin(𝐸) = 𝑀 ,
6.7 Compute the derivatives for the 10-bar truss problem described
in Appendix D.2.2 using the direct and adjoint implicit differenti-
ation methods. Compute the derivatives of the objective (mass)
with respect to the design variables (10 cross-sectional areas),
and the derivatives of the constraints (stresses in all 10 bars)
with respect to the design variables (a 10 × 10 Jacobian matrix).
Compute the derivatives using the following:
6.8 You can now solve the 10-bar truss problem (previously solved in
Prob. 5.15) using the derivatives computed in Prob. 6.7. Solve this
optimization problem using both finite-difference derivatives and
derivatives computed using an implicit analytic method. Report
the following:
6.9 Aggregate the constraints for the 10-bar truss problem and extend
the code from Prob. 6.7 to compute the required constraint deriva-
tives using the implicit analytic method that is most advantageous
in this case. Verify your derivatives against the complex-step
method. Solve the optimization problem and compare your re-
sults to the ones you obtained in Prob. 6.8. How close can you get
to the reference solution?
Gradient-Free Optimization
7
Gradient-free algorithms fill an essential role in optimization. The
gradient-based algorithms introduced in Chapter 4 are efficient in
finding local minima for high-dimensional nonlinear problems defined
by continuous smooth functions. However, the assumptions made
for these algorithms are not always valid, which can render these
algorithms ineffective. Also, gradients might not be available when a
function is given as a black box.
In this chapter, we introduce only a few popular representative
gradient-free algorithms. Most are designed to handle unconstrained
functions only, but they can be adapted to solve constrained problems
by using the penalty or filtering methods introduced in Chapter 5. We
start by discussing the problem characteristics relevant to the choice
between gradient-free and gradient-based algorithms and then give an
overview of the types of gradient-free algorithms.
281
7 Gradient-Free Optimization 282
rithms are easier to get up and running but are much less efficient,
particularly as the dimension of the problem increases.
One significant advantage of gradient-free algorithms is that they
do not assume function continuity. For gradient-based algorithms,
function smoothness is essential when deriving the optimality con-
ditions, both for unconstrained functions and constrained functions.
More specifically, the Karush–Kuhn–Tucker (KKT) conditions (Eq. 5.11)
require that the function be continuous in value (𝐶 0 ), gradient (𝐶 1 ), and
Hessian (𝐶 2 ) in at least a small neighborhood of the optimum. If, for
example, the gradient is discontinuous at the optimum, it is undefined,
and the KKT conditions are not valid. Away from optimum points, this
requirement is not as stringent. Although gradient-based algorithms
work on the same continuity assumptions, they can usually tolerate
the occasional discontinuity, as long as it is away from an optimum
point. However, for functions with excessive numerical noise and
discontinuities, gradient-free algorithms might be the only option.
Many considerations are involved when choosing between a gradient-
based and a gradient-free algorithm. Some of these considerations are
common sources of misconception. One problem characteristic often
cited as a reason for choosing gradient-free methods is multimodality.
Design space multimodality can be a result of an objective function
with multiple local minima. In the case of a constrained problem, the
multimodality can arise from the constraints that define disconnected
or nonconvex feasible regions.
As we will see shortly, some gradient-free methods feature a global
search that increases the likelihood of finding the global minimum. This
feature makes gradient-free methods a common choice for multimodal
problems. However, not all gradient-free methods are global search
methods; some perform only a local search. Additionally, even though
gradient-based methods are by themselves local search methods, they
are often combined with global search strategies, as discussed in Tip 4.8.
It is not necessarily true that a global search, gradient-free method is
more likely to find a global optimum than a multistart gradient-based
method. As always, problem-specific testing is needed.
Furthermore, it is assumed far too often that any complex prob-
lem is multimodal, but that is frequently not the case. Although it
might be impossible to prove that a function is unimodal, it is easy to
prove that a function is multimodal simply by finding another local
minimum. Therefore, we should not make any assumptions about
the multimodality of a function until we show definite multiple local
minima. Additionally, we must ensure that perceived local minima are
not artificial minima arising from numerical noise.
7 Gradient-Free Optimization 283
107 Gradient-free
Number of function evaluations
106
105
2.52
104 Fig. 7.1 Cost of optimization for in-
creasing number of design variables
103 in the 𝑛-dimensional Rosenbrock
Gradient-based function. A gradient-free algorithm
0.37
is compared with a gradient-based
102
algorithm, with gradients computed
analytically. The gradient-based al-
101 102 103 gorithm has much better scalability.
Number of design variables
Deterministic
2013.
Stochastic
Surrogate
Heuristic
Global
Direct
Local
Nelder–Mead • • • •
GPS • • • •
MADS • • • •
Trust region • • • •
Implicit filtering • • • •
DIRECT • • • •
MCS • • • •
EGO • • • •
Table 7.1 Classification of gradient-
Hit and run • • • • free optimization methods using the
Evolutionary • • • • characteristics of Fig. 1.22.
7 Gradient-Free Optimization 285
but it estimates lower and upper bounds for the optimum by using
the function variation between partitions. MCS is another algorithm
that partitions the design space into boxes, where a limit is imposed on
how small the boxes can get based on the number of times it has been
divided.
Global-search algorithms based on surrogate models are similar to
their local search counterparts. However, they use surrogate models
to reproduce the features of a multimodal function instead of convex
surrogate models. One of the most widely used of these algorithms is
efficient global optimization (EGO), which employs kriging surrogate
models and uses the idea of expected improvement to maximize the
likelihood of finding the optimum more efficiently (surrogate modeling
techniques, including kriging are introduced in Chapter 10, which also
described EGO). Other algorithms use radial basis functions (RBFs) as
the surrogate model and also maximize the probability of improvement
at new iterates.
Stochastic algorithms rely on one or more nondeterministic pro-
cedures; they include hit-and-run algorithms and the broad class of
evolutionary algorithms. When performing benchmarks of a stochastic
algorithm, you should run a large enough number of optimizations to ‡ Simon140 provides a more comprehen-
Hit-and-run algorithms generate random steps about the current 140. Simon, Evolutionary Optimization
Algorithms, 2013.
iterate in search of better points. A new point is accepted when it is
§ These algorithms include the follow-
better than the current one, and this process repeats until the point ing: ant colony optimization, artifi-
cannot be improved. cial bee colony algorithm, artificial fish
swarm, artificial flora optimization al-
What constitutes an evolutionary algorithm is not well defined.‡ gorithm, bacterial foraging optimization,
Evolutionary algorithms are inspired by processes that occur in nature bat algorithm, big bang–big crunch al-
gorithm, biogeography-based optimiza-
or society. There is a plethora of evolutionary algorithms in the literature, tion, bird mating optimizer, cat swarm
thanks to the fertile imagination of the research community and a optimization, cockroach swarm optimiza-
tion, cuckoo search, design by shop-
never-ending supply of phenomena for inspiration.§ These algorithms ping paradigm, dolphin echolocation al-
are more of an analogy of the phenomenon than an actual model. gorithm, elephant herding optimization,
firefly algorithm, flower pollination algo-
They are, at best, simplified models and, at worst, barely connected rithm, fruit fly optimization algorithm,
to the phenomenon. Nature-inspired algorithms tend to invent a galactic swarm optimization, gray wolf op-
timizer, grenade explosion method, har-
specific terminology for the mathematical terms in the optimization mony search algorithm, hummingbird op-
problem. For example, a design point might be called a “member of timization algorithm, hybrid glowworm
swarm optimization algorithm, imperial-
the population”, or the objective function might be the “fitness”. ist competitive algorithm, intelligent wa-
The vast majority of evolutionary algorithms are population based, ter drops, invasive weed optimization,
mine bomb algorithm, monarch butter-
which means they involve a set of points at each iteration instead of a fly optimization, moth-flame optimiza-
single one (we discuss a genetic algorithm in Section 7.6 and a particle tion algorithm, penguin search optimiza-
tion algorithm, quantum-behaved parti-
swarm method in Section 7.7). Because the population is spread out in cle swarm optimization, salp swarm algo-
the design space, evolutionary algorithms perform a global search. The rithm, teaching–learning-based optimiza-
tion, whale optimization algorithm, and
stochastic elements in these algorithms contribute to global exploration water cycle algorithm.
7 Gradient-Free Optimization 287
The simplex method of Nelder and Mead28 is a deterministic, direct- 28. Nelder and Mead, A simplex method
for function minimization, 1965.
search method that is among the most cited gradient-free methods. It
is also known as the nonlinear simplex—not to be confused with the
simplex algorithm used for linear programming, with which it has
nothing in common. To avoid ambiguity, we will refer to it as the
Nelder–Mead algorithm.
The Nelder–Mead algorithm is based on a simplex, which is a 𝑥 (3)
geometric figure defined by a set of 𝑛 + 1 points in the design space of 𝑛
variables, 𝑋 = 𝑥 (0) , 𝑥 (1) , . . . , 𝑥 (𝑛) . Each point 𝑥 (𝑖) represents a design
(i.e., a full set of design variables). In two dimensions, the simplex 𝑥 (0)
𝑥 (2)
is a triangle, and in three dimensions, it becomes a tetrahedron (see
Fig. 7.2). 𝑥 (1)
Each optimization iteration corresponds to a different simplex. The Fig. 7.2 A simplex for 𝑛 = 3 has four
algorithm modifies the simplex at each iteration using five simple vertices.
operations. The sequence of operations to be performed is chosen
based on the relative values of the objective function at each of the
points.
The first step of the simplex algorithm is to generate 𝑛 + 1 points
based on an initial guess for the design variables. This could be done by
simply adding steps to each component of the initial point to generate
𝑛 new points. However, this will generate a simplex with different edge
lengths, and equal-length edges are preferable. Suppose we want the
length of all sides to be 𝑙 and that the first guess is 𝑥 . The remaining
(0)
where 𝛼 is a scalar, and 𝑥 𝑐 is the centroid of all the points except for the
worst one, that is,
1 Õ (𝑖)
𝑛−1
𝑥𝑐 = 𝑥 . (7.4)
𝑛
𝑖=0
This generates a new point along the line that connects the worst point,
𝑥 (𝑛) , and the centroid of the remaining points, 𝑥 𝑐 . This direction can be
seen as a possible descent direction.
Each iteration aims to replace the worst point with a better one
to form a new simplex. Each iteration always starts with reflection,
which generates a new point using Eq. 7.3 with 𝛼 = 1, as shown in
Fig. 7.4. If the reflected point is better than the best point, then the
“search direction” was a good one, and we go further by performing an
expansion using Eq. 7.3 with 𝛼 = 2. If the reflected point is between the
second-worst and the worst point, then the direction was not great, but
it improved somewhat. In this case, we perform an outside contraction
(𝛼 = 1/2). If the reflected point is worse than our worst point, we try
an inside contraction instead (𝛼 = −1/2). Shrinking is a last-resort
operation that we can perform when nothing along the line connecting
𝑥 (𝑛) and 𝑥 𝑐 produces a better point. This operation consists of reducing
the size of the simplex by moving all the points closer to the best point,
𝑥 (𝑖) = 𝑥 (0) + 𝛾 𝑥 (𝑖) − 𝑥 (0) for 𝑖 = 1, . . . , 𝑛 , (7.5)
where 𝛾 = 0.5.
7 Gradient-Free Optimization 289
𝑥𝑐
Inputs:
𝑥 (0) : Starting point
𝜏𝑥 : Simplex size tolerances
𝜏 𝑓 : Function value standard deviation tolerances
Outputs:
𝑥 ∗ : Optimal point
Sort 𝑥 (0) , . . . , 𝑥 (𝑛−1) , 𝑥 (𝑛) Order from the lowest (best) to the highest 𝑓 (𝑥 (𝑗) )
1 Í𝑛−1 (𝑖)
𝑥𝑐 = 𝑛 𝑖=0 𝑥 The centroid excluding the worst point 𝑥 (𝑛) (Eq. 7.4)
if 𝑓 (𝑥 𝑒 ) < 𝑓 (𝑥 (0) )
then Is expanded point better than the best?
𝑥 (𝑛) = 𝑥 𝑒 Accept expansion and replace worst point
else
𝑥 (𝑛) = 𝑥 𝑟 Accept reflection
end if
else if 𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) then Is reflected better than second worst?
𝑥 (𝑛) = 𝑥 𝑟 Accept reflected point
else
if 𝑓 (𝑥 𝑟 ) > 𝑓 (𝑥 (𝑛) ) then Is reflected point worse than the worst?
if 𝑓 (𝑥 𝑖𝑐 ) < 𝑓 (𝑥 (𝑛) )
then Inside contraction better than worst?
𝑥 (𝑛) = 𝑥 𝑖𝑐 Accept inside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, Eq. 7.5 with 𝛾 = 0.5
end for
end if
else
𝑥 𝑜𝑐 = 𝑥 𝑐 + 0.5 𝑥 𝑐 − 𝑥 (𝑛) Outside contraction, Eq. 7.3 with 𝛼 = 0.5
if 𝑓 (𝑥 𝑜𝑐 ) < 𝑓 (𝑥 𝑟 ) then Is contraction better than reflection?
𝑥 (𝑛) = 𝑥 𝑜𝑐 Accept outside contraction
else
for 𝑗 = 1 to 𝑛 do
𝑥 (𝑗) = 𝑥 (0) + 0.5 𝑥 (𝑗) − 𝑥 (0) Shrink, Eq. 7.5 with 𝛾 = 0.5
7 Gradient-Free Optimization 291
end for
end if
end if
end if
end while
𝑘 = 𝑘+1
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (0) ) 𝑓 (𝑥 𝑒 ) ≤ 𝑓 (𝑥 (0) )
𝑥 (𝑛) = 𝑥 𝑒
else
𝑓 (𝑥 𝑟 ) ≤ 𝑓 (𝑥 (𝑛−1) ) 𝑥 (𝑛) = 𝑥 𝑟
𝑓 (𝑥 𝑟 ) ≥ 𝑓 (𝑥 (𝑛) ) 𝑓 (𝑥 𝑖𝑐 ) ≤ 𝑓 (𝑥 (𝑛) )
𝑥𝑐 𝑥 (𝑛) = 𝑥 𝑖𝑐
else
else
else 𝑓 (𝑥 𝑜𝑐 ) ≤ 𝑓 (𝑥 𝑟 )
𝑥 (𝑛) = 𝑥 𝑜𝑐
Like most direct-search methods, Nelder–Mead cannot directly Fig. 7.5 Flowchart of Nelder–Mead
(Alg. 7.1).
handle constraints. One approach to handling constraints would be to
use a penalty method (discussed in Section 5.4) to form an unconstrained
problem. In this case, the penalty does not need not be differentiable,
so a linear penalty method would suffice.
Figure 7.6 shows the sequence of simplices that results when minimizing
the bean function using a Nelder–Mead simplex. The initial simplex on the
upper left is equilateral. The first iteration is an expansion, followed by an
inside contraction, another reflection, and an inside contraction before the
shrinking. The simplices then shrink dramatically in size, slowly converging to
the minimum.
Using a convergence tolerance of 10−6 in the difference between 𝑓best and
𝑓worst , the problem took 68 function evaluations.
7 Gradient-Free Optimization 292
2
𝑥0
𝑥2 1
𝑥∗
𝑑1 = [1, 0, 0]
𝑑2 = [0, 1, 0]
𝐷 = {𝑑1 , . . . , 𝑑4 }, where . (7.9)
𝑑3 = [0, 0, 1]
𝑑4 = [−1, −1, −1]
Figure 7.8 shows an example maximal set (four vectors) and minimal
set (three vectors) for a two-dimensional problem.
These direction vectors are then used to create a mesh. Given a
current center point 𝑥 𝑘 , which is the best point found so far, and a mesh
size Δ 𝑘 , the mesh is created as follows:
The type of search can change throughout the optimization. Like the
polling phase, the goal of the search phase is to find a better point
(i.e., 𝑓 (𝑥 𝑘+1 ) < 𝑓 (𝑥 𝑘 )) but within a broader domain. We begin with a
search at every iteration. If the search fails to produce a better point, we
continue with a poll. If a better point is identified in either phase, the
iteration ends, and we begin a new search. Optionally, a successful poll
could be followed by another poll. Thus, at each iteration, we might
perform a search and a poll, just a search, or just a poll. Δ𝑘
We describe one option for a search procedure based on the same
mesh ideas as the polling step. The concept is to extend the mesh
throughout the entire domain, as shown in Fig. 7.10. In this example, the
Fig. 7.10 Meshing strategy extended
mesh size Δ 𝑘 is shared between the search and poll phases. However, it across the domain. The same direc-
is usually more effective if these sizes are independent. Mathematically, tions (and potentially spacing) are
we can define the global mesh as the set repeated at each mesh point, as indi-
cated by the lighter arrows through-
𝐺 = {𝑥 𝑘 + Δ 𝑘 𝐷𝑧 for all 𝑧 𝑖 ∈ Z+ }, (7.11) out the entire domain.
7 Gradient-Free Optimization 295
Inputs:
𝑥 𝑘 : Center point
Δ 𝑘 : Mesh size
𝑥, 𝑥: Lower and upper bounds
𝐷: Column vectors representing positive spanning set
𝑛 𝑠 : Number of search points
𝑓 𝑘 : The function previously evaluated at 𝑓 (𝑥 𝑘 )
Outputs:
success: True if successful in finding improved point
𝑥 𝑘+1 : New center point
𝑓 𝑘+1 : Corresponding function value
success = false
𝑥 𝑘+1 = 𝑥 𝑘
𝑓 𝑘+1 = 𝑓 𝑘
Construct global mesh 𝐺, using directions 𝐷, mesh size Δ 𝑘 , and bounds 𝑥, 𝑥
for 𝑖 = 1 to 𝑛 𝑠 do
Randomly select 𝑠 ∈ 𝐺
Evaluate 𝑓𝑠 = 𝑓 (𝑠)
if 𝑓𝑠 < 𝑓 𝑘 then
𝑥 𝑘+1 = 𝑠
𝑓 𝑘+1 = 𝑓𝑠
success = true
break
end if
end for
Inputs:
𝑥 0 : Starting point
𝑥, 𝑥: Lower and upper bounds
Δ0 : Starting mesh size
𝑛 𝑠 : Number of search points
𝑘max : Maximum number of iterations
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value
GPS can handle linear and nonlinear constraints. For linear con-
straints, one effective strategy is to change the positive spanning di-
rections so that they align with any linear constraints that are nearby
(Fig. 7.11). For nonlinear constraints, penalty approaches (Section 5.4)
are applicable, although the filter method (Section 5.5.3) is another
effective approach.
Example 7.2 Minimization of a multimodal function with GPS Fig. 7.11 Mesh direction changed dur-
ing optimization to align with linear
In this example, we optimize the Jones function (Appendix D.1.4). We start constraints when close to the con-
at 𝑥 = [0, 0] with an initial mesh size of Δ = 0.1. We evaluate two search points straint.
at each iteration and run for 12 iterations. The iterations are plotted in Fig. 7.12.
3 3 3
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
𝑥0
0 0 0
−1 −1 −1
−2 −2 −2
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑥1 𝑥1 𝑥1
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
−1 −1 −1
−2 −2 −2
−2 0 2 4 −2 0 2 4 −2 0 2 4
𝑥1 𝑥1 𝑥1
𝑘=6 𝑘=8 𝑘 = 12
smaller while allowing the poll size (which dictates the maximum
magnitude of the step) to remain large. This provides a much denser
set of options in poll directions (e.g., the grid points on the right panel
of Fig. 7.13). MADS randomly chooses the polling directions from this
much larger set of possibilities while maintaining a positive spanning
set.†
† The NOMAD software is an open-source
implementation of MADS.142
𝑝 142. Le Digabel, Algorithm 909: NOMAD:
Δ𝑚 Δ𝑘
𝑘 Nonlinear optimization with the MADS
algorithm, 2011.
and evaluating every point in this grid. This is called an exhaustive search,
and the precision of the minimum depends on how fine the grid is. The
cost of this brute-force strategy is high and goes up exponentially with
the number of design variables.
The DIRECT method relies on a grid, but it uses an adaptive meshing
scheme that dramatically reduces the cost. It starts with a single 𝑛-
dimensional hypercube that spans the whole design space—like many
other gradient-free methods, DIRECT requires upper and lower bounds
on all the design variables. Each iteration divides this hypercube into
smaller ones and evaluates the objective function at the center of each
of these. At each iteration, the algorithm only divides rectangles
determined to be potentially optimal. The fundamental strategy in the
7 Gradient-Free Optimization 299
Lipschitz Constant
Shubert’s Algorithm
𝑓 𝑓
𝑎 𝑥1 𝑏 𝑥2 𝑥1 𝑥3
𝑥 𝑥
𝑘=0 𝑘=1
𝑓 𝑓
is updated with the resulting new cones. We then iterate by finding the
two points that minimize the new lower bounding function, evaluating
the function at these points, updating the lower bounding function,
and so on.
The lowest bound on the function increases at each iteration and
ultimately converges to the global minimum. At the same time, the
segments in 𝑥 decrease in size. The lower bound can switch from
distinct regions as the lower bound in one region increases beyond the
lower bound in another region.
The two significant shortcomings of Shubert’s algorithm are that
(1) a Lipschitz constant is usually not available for a general function,
and (2) it is not easily extended to 𝑛 dimensions. The DIRECT algorithm
addresses these two shortcomings.
One-Dimensional DIRECT
+𝐿 −𝐿
𝑓 𝑓
𝑓 (𝑐) − 12 𝐿(𝑏 − 𝑎)
𝑑 = 12 (𝑏 − 𝑎)
𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏 𝑎 𝑐 = 12 (𝑎 + 𝑏) 𝑏
𝑥 𝑥
Like Shubert’s method, DIRECT starts with the domain [𝑎, 𝑏]. How- Fig. 7.16 The DIRECT algorithm eval-
uates the middle point (left), and each
ever, instead of sampling the endpoints 𝑎 and 𝑏, it samples the midpoint.
successive iteration trisects the seg-
Consider the closed domain [𝑎, 𝑏] shown in Fig. 7.16 (left). For each ments that have the greatest potential
segment, we evaluate the objective function at the segment’s midpoint. (right).
In the first segment, which spans the whole domain, the midpoint is
𝑐 0 = (𝑎 + 𝑏)/2. Assuming some value of 𝐿, which is not known and
that we will not need, the lower bound on the minimum would be
𝑓 (𝑐) − 𝐿(𝑏 − 𝑎)/2.
We want to increase this lower bound on the function minimum
by dividing this segment further. To do this in a regular way that
reuses previously evaluated points and can be repeated indefinitely,
7 Gradient-Free Optimization 302
𝑓 (𝑐)
𝐿
𝑓 (𝑐 𝑗 )
The overall rationale for the potentially optimal criterion is that two
metrics quantify this potential: the size of the segment and the function
value at the center of the segment. The larger the segment is, the greater
the potential for that segment to contain the global minimum. The
lower the function value, the greater that potential is as well. For a set
of segments of the same size, we know that the one with the lowest
function value has the best potential and should be selected. If two
segments have the same function value and different sizes, we should
select the one with the largest size. For a general set of segments with
various sizes and value combinations, there might be multiple segments
that can be considered potentially optimal.
We identify potentially optimal segments as follows. If we draw a
line with a slope corresponding to a Lipschitz constant 𝐿 from any point
in Fig. 7.17, the intersection of this line with the vertical axis is a bound
on the objective function for the corresponding segment. Therefore,
the lowest bound for a given 𝐿 can be found by drawing a line through
the point that achieves the lowest intersection.
However, we do not know 𝐿, and we do not want to assume a value
because we do not want to bias the search. If 𝐿 were high, it would favor
dividing the larger segments. Low values of 𝐿 would result in dividing
the smaller segments. The DIRECT method hinges on considering all
7 Gradient-Free Optimization 303
where 𝑓min is the best current objective function value, and 𝜀 is a small
positive parameter. The first condition corresponds to finding the
points in the lower convex hull mentioned previously.
The second condition in Eq. 7.16 ensures that the potential minimum
is better than the lowest function value found so far by at least a small
amount. This prevents the algorithm from becoming too local, wasting
function evaluations in search of smaller function improvements. The
parameter 𝜀 balances the search between local and global. A typical
value is 𝜀 = 10−4 , and its range is usually such that 10−7 ≤ 𝜀 ≤ 10−2 .
There are efficient algorithms for finding the convex hull of an
arbitrary set of points in two dimensions, such as the Jarvis march.144 144. Jarvis, On the identification of the
convex hull of a finite set of points in the
These algorithms are more than we need because we only require the plane, 1973.
lower part of the convex hull, so the algorithms can be simplified for
our purposes.
As in the Shubert algorithm, the division might switch from one
part of the domain to another, depending on the new function values.
Compared with the Shubert algorithm, the DIRECT algorithm produces
a discontinuous lower bound on the function values, as shown in
Fig. 7.18.
DIRECT in 𝑛 Dimensions
version of DIRECT.143
deal with hyperrectangles instead of segments. A hyperrectangle can
143. Jones, Direct Global Optimization
be defined by its center-point position 𝑐 in 𝑛-dimensional space and a Algorithm, 2009.
half-length in each direction 𝑖, 𝛿𝑒 𝑖 , as shown in Fig. 7.19. The DIRECT
algorithm assumes that the initial dimensions are normalized so that
we start with a hypercube.
δe2
Fig. 7.19 Hyperrectangle in three di-
c δe1 mensions, where 𝑑 is the maximum
δe3
distance between the center and the
d vertices, and 𝛿𝑒 𝑖 is the half-length in
each direction 𝑖.
sion and evaluates the two new points. The values for these three points
7 Gradient-Free Optimization 305
1 𝑓
2 𝑓
are plotted in the second column from the right in the 𝑓 –𝑑 plot, where
the center point is reused, as indicated by the arrow and the matching
color. At this iteration, we have two points that define the convex hull.
In the second iteration, we have three rectangles of the same size, so
we divide the one with the lowest value and evaluate the centers of
the two new rectangles (which are squares in this case). We now have
another column of points in the 𝑓 –𝑑 plot corresponding to a smaller 𝑑
and an additional point that defines the lower convex hull. Because the
convex hull now has two points, we trisect two different rectangles in
the third iteration.
Inputs:
𝑥, 𝑥: Lower and upper bounds
Outputs:
𝑥 ∗ : Optimal point
the final points and convex hull are highlighted. The sequence of rectangles 30
is shown in Fig. 7.22. The algorithm converges to the global minimum after 20
dividing the rectangles around the other local minima a few times.
10
0
3
−10
−20
2
10−2 10−1 100
𝑑
1
Fig. 7.21 Potentially optimal rectan-
𝑥2 gles for the DIRECT iterations shown
0 in Fig. 7.22.
Population
Gene Chromosome
𝑥 (0) 𝑥1 𝑥2 ... 𝑥𝑛
specified for the generation of the initial population, and the size of Offspring
that population varies. Similarly, there are many possible methods
for selecting the parents, generating the offspring, and selecting the
survivors. Here, the new population (𝑃𝑘+1 ) is formed exclusively by
the offspring generated from the crossover. However, some GAs add
an extra selection process that selects a surviving population of size 𝑛 𝑝 Mutation
among the population of parents and offspring. Population 𝑃𝑘+1
In addition to the flexibility in the various operations, GAs use differ-
ent methods for representing the design variables. The design variable
representation can be used to classify genetic algorithms into two broad
categories: binary-encoded and real-encoded genetic algorithms. Binary-
encoded algorithms use bits to represent the design variables, whereas
the real-encoded algorithms keep the same real value representation Fig. 7.24 GA iteration steps.
7 Gradient-Free Optimization 308
used in most other algorithms. The details of the operations in Alg. 7.5
depend on whether we are using one or the other representation, but
the principles remain the same. In the rest of this section, we describe a
particular way of performing these operations for each of the possible
design variable representations.
Inputs:
𝑥, 𝑥: Lower and upper bounds
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value
𝑘 = 0n o
𝑃 𝑘 = 𝑥 (1) , 𝑥 (2) , . . . , 𝑥 (𝑛 𝑝 ) Generate initial population
while 𝑘 < 𝑘max do
Compute 𝑓 (𝑥) for all 𝑥 ∈ 𝑃 𝑘 Evaluate objective function
Select 𝑛 𝑝 /2 parent pairs from 𝑃 𝑘 for crossover Selection
Generate a new population of 𝑛 𝑝 offspring (𝑃 𝑘+1 ) Crossover
Randomly mutate some points in the population Mutation
𝑘 = 𝑘+1
end while
𝑥−𝑥
Δ𝑥 = . (7.17)
2𝑚 − 1
To have a more precise representation, we must use more bits.
When using binary-encoded GAs, we do not need to encode the
design variables because they are generated and manipulated directly
in the binary representation. Still, we do need to decode them be-
fore providing them to the evaluation function. To decode a binary
7 Gradient-Free Optimization 309
representation, we use
Õ
𝑚−1
𝑥=𝑥+ 𝑏 𝑖 2𝑖 Δ𝑥 . (7.18)
𝑖=0
𝑖 1 2 3 4 5 6 7 8 9 10 11 12
𝑏𝑖 0 0 0 1 0 1 1 0 0 0 0 1
We can use Eq. 7.18 to compute the equivalent real number, which turns out to
be 𝑥 ≈ 32.55.
Initial Population
Selection
Figure 7.25 illustrates the process with a small population. Each member of
the population ends up in the mating pool zero, one, or two times, with better
points more likely to appear in the pool. The best point in the population will
always end up in the pool twice, whereas the worst point in the population is
always eliminated.
12 2
10 2
10 15
7 6
7 6
15 7
Fig. 7.25 Tournament selection exam-
2 10 ple. The best point in each randomly
2 10 selected pair is moved into the mating
pool.
6 12
− 𝑓𝑖 + Δ𝐹
𝐹= , (7.19)
max(1, Δ𝐹 − 𝑓low )
7 Gradient-Free Optimization 311
where Δ𝐹 = 1.1 𝑓high −0.1 𝑓low is based on the highest and lowest function
values in the population, and the denominator is introduced to scale
the fitness.
Then, to find the sizes of the sectors in the roulette wheel selection,
we take the normalized cumulative sum of the scaled fitness values to
compute an interval for each member in the population 𝑗 as
Í𝑗
𝐹𝑖
𝑖=1
𝑆𝑗 = . (7.20)
Í𝑛𝑝
𝐹𝑖
𝑖=1
This ensures that the probability of a member being selected for repro- 0
0.875
duction is proportional to its scaled fitness value. 𝑥 (4)
𝑥 (1)
1 and the remaining bits from parent 2. For the second offspring, the 1 1 0 0 0 1 1 0
first 𝑘 bits are taken from parent 2 and the remaining ones from parent Offspring 2
Initial Population
𝑥 𝑖 = 𝑥 𝑖 + 𝑟(𝑥 𝑖 − 𝑥 𝑖 ) (7.22)
Selection
The selection operation does not depend on the design variable encod-
ing. Therefore, we can use one of the selection approaches described
for the binary-encoded GA: tournament or roulette wheel selection.
7 Gradient-Free Optimization 313
Crossover
When using real encoding, the term crossover does not accurately
describe the process of creating the two offspring from a pair of points.
Instead, the approaches are more accurately described as a blending,
although the name crossover is still often used.
There are various options for the reproduction of two points encoded
using real numbers. A standard method is linear crossover, which
generates two or more points in the line defined by the two parent
points. One option for linear crossover is to generate the following two
points:
𝑥 𝑐1 = 0.5𝑥 𝑝1 + 0.5𝑥 𝑝2 ,
(7.23)
𝑥 𝑐2 = 2𝑥 𝑝2 − 𝑥 𝑝1 ,
where parent 2 is more fit than parent 1 ( 𝑓 (𝑥 𝑝2 ) < 𝑓 (𝑥 𝑝1 )). An example
of this linear crossover approach is shown in Fig. 7.29, where we can
see that child 1 is the average of the two parent points, whereas child 2
is obtained by extrapolating in the direction of the “fitter” parent.
Another option is a simple crossover like the binary case where a
𝑥 𝑐2
random integer is generated to split the vectors—for example, with a
split after the first index: 𝑥 𝑝2
𝑥 𝑝1 = [𝑥 1 , 𝑥2 , 𝑥3 , 𝑥4 ] 𝑥 𝑐1
𝑥 𝑝2 = [𝑥 5 , 𝑥6 , 𝑥7 , 𝑥8 ]
𝑥 𝑝1
⇓ (7.24)
𝑥 𝑐1 = [𝑥 1 , 𝑥6 , 𝑥7 , 𝑥8 ] Fig. 7.29 Linear crossover produces
two new points along the line defined
𝑥 𝑐2 = [𝑥 5 , 𝑥2 , 𝑥3 , 𝑥4 ] . by the two parent points.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝑘 = 10 𝑘 = 20 𝑘 = 50
Figure 7.30 shows the evolution of the population when minimizing the Fig. 7.30 Evolution of the population
bean function using a bit-encoded GA. The initial population size was 40, and using a bit-encoded GA to minimize
the simulation was run for 50 generations. Figure 7.31 shows the evolution the bean function, where 𝑘 is the gen-
eration number.
when using a real-encoded GA but otherwise uses the same parameters as the
bit-encoded optimization. The real-encoded GA converges faster in this case.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝑘=4 𝑘=6 𝑘 = 10
grangian, linear penalty). However, there are additional options for Fig. 7.31 Evolution of the population
GAs. In the tournament selection, we can use other selection criteria using a real-encoded GA to minimize
the bean function, where 𝑘 is the gen-
that do not depend on penalty parameters. One such approach for eration number.
choosing the best selection among two competitors is as follows:
This concept is a lot like the filter methods discussed in Section 5.5.3.
7.6.4 Convergence
Rigorous mathematical convergence criteria, like those used in gradient-
based optimization, do not apply to GAs. The most common way to
terminate a GA is to specify a maximum number of iterations, which
corresponds to a computational budget. Another similar approach is
to let the algorithm run indefinitely until the user manually terminates
the algorithm, usually by monitoring the trends in population fitness.
7 Gradient-Free Optimization 316
these are just design points, the history for each point is relevant to 150. Eberhart and Kennedy, New opti-
mizer using particle swarm theory, 1995.
the PSO algorithm, so we adopt the term particle. Each particle moves
according to a velocity. This velocity changes according to the past
objective function values of that particle and the current objective values
of the rest of the particles. Each particle remembers the location where
it found its best result so far, and it exchanges information with the
swarm about the location where the swarm has found the best result
so far.
The position of particle 𝑖 for iteration 𝑘 + 1 is updated according to
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + 𝑣 𝑘+1 Δ𝑡 , (7.27)
where Δ𝑡 is a constant artificial time step. The velocity for each particle
is updated as follows:
(𝑖) (𝑖) (𝑖)
(𝑖) (𝑖) 𝑥best − 𝑥 𝑘 𝑥best − 𝑥 𝑘
𝑣 𝑘+1 = 𝛼𝑣 𝑘 + 𝛽 +𝛾 . (7.28)
Δ𝑡 Δ𝑡
The first component in this update is the “inertia”, which determines
how similar the new velocity is to the velocity in the previous iteration
7 Gradient-Free Optimization 317
through the parameter 𝛼. Typical values for the inertia parameter 𝛼 are
in the interval [0.8, 1.2]. A lower value of 𝛼 reduces the particle’s inertia
and tends toward faster convergence to a minimum. A higher value of 𝛼
increases the particle’s inertia and tends toward increased exploration to
potentially help discover multiple minima. Some methods are adaptive,
choosing the value of 𝛼 based on the optimizer’s progress.151 151. Zhan et al., Adaptive particle swarm
optimization, 2009.
The second term represents “memory” and is a vector pointing
(𝑖)
toward the best position particle 𝑖 has seen in all its iterations so far, 𝑥best .
The weight in this term consists of a random number 𝛽 in the interval
[0, 𝛽max ] that introduces a stochastic component to the algorithm. Thus,
𝛽 controls how much influence the best point found by the particle so
far has on the next direction.
The third term represents “social” influence. It behaves similarly
to the memory component, except that 𝑥best is the best point the entire
swarm has found so far, and 𝛾 is a random number between [0, 𝛾max ]
that controls how much of an influence this best point has in the next
direction. The relative values of 𝛽 and 𝛾 thus control the tendency
toward local versus global search, respectively. Both 𝛽 max and 𝛾max are
in the interval [0, 2] and are typically closer to 2. Sometimes, rather
than using the best point in the entire swarm, the best point is chosen
within a neighborhood.
Because the time step is artificial, we can eliminate it by multiplying
Eq. 7.28 by Δ𝑡 to yield a step:
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝛼Δ𝑥 𝑘 + 𝛽 𝑥best − 𝑥 𝑘 + 𝛾 𝑥 best − 𝑥 𝑘 . (7.29)
We then use this step to update the particle position for the next
iteration:
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1 . (7.30)
The three components of the update in Eq. 7.29 are shown in Fig. 7.32
for a two-dimensional case.
𝑥best
(𝑖)
𝑥 best (𝑖)
𝑥 𝑘+1
(𝑖) (𝑖)
𝛽 𝑥best − 𝑥 𝑘
(𝑖)
Δ𝑥 𝑘+1
(𝑖)
𝑥 𝑘−1
(𝑖)
𝛾 𝑥best − 𝑥 𝑘
Fig. 7.32 Components of the PSO up-
(𝑖)
𝑥𝑘 (𝑖)
date.
𝛼Δ𝑥 𝑘
7 Gradient-Free Optimization 318
The first step in the PSO algorithm is to initialize the set of particles
(Alg. 7.6). As with a GA, the initial set of points can be determined
randomly or can use a more sophisticated sampling strategy (see
Section 10.2). The velocities are also randomly initialized, generally
using some fraction of the domain size (𝑥 − 𝑥).
Inputs:
𝑥: Variable upper bounds
𝑥: Variable lower bounds
𝛼: Inertia parameter
𝛽 max : Self influence parameter
𝛾max : Social influence parameter
Δ𝑥max : Maximum velocity
Outputs:
𝑥 ∗ : Best point
𝑓 ∗ : Corresponding function value
𝑘=0
for 𝑖 = 1 to 𝑛 do Loop to initialize all particles
(𝑖)
Generate position 𝑥0 within specified bounds.
(𝑖)
Initialize “velocity” Δ𝑥 0
end for
while not converged do Main iteration loop
for 𝑖 = 1 to 𝑛 do
(𝑖)
if 𝑓 𝑥 (𝑖) < 𝑓 𝑥best then Best individual points
(𝑖)
𝑥best =𝑥 (𝑖)
end if
if 𝑓 (𝑥 (𝑖) ) < 𝑓 (𝑥best ) then Best swarm point
𝑥best = 𝑥 (𝑖)
end if
end for
for 𝑖 = 1 to 𝑛 do
(𝑖) (𝑖) (𝑖) (𝑖) (𝑖)
Δ𝑥 𝑘+1 = 𝛼Δ𝑥 𝑘 + 𝛽 𝑥best − 𝑥 𝑘 + 𝛾 𝑥best − 𝑥 𝑘
(𝑖) (𝑖)
Δ𝑥 𝑘+1 = max min Δ𝑥 𝑘+1 , Δ𝑥max , −Δ𝑥 max Limit velocity
(𝑖) (𝑖) (𝑖)
𝑥 𝑘+1 = 𝑥 𝑘 + Δ𝑥 𝑘+1
Update the particle position
(𝑖) (𝑖)
𝑥 𝑘+1 = max min 𝑥 𝑘+1 , 𝑥 ,𝑥 Enforce bounds
end for
𝑘 = 𝑘+1
end while
7 Gradient-Free Optimization 319
Figure 7.33 shows the particle movements that result when minimizing the
bean function using a particle swarm method. The initial population size was
40, and the optimization required 600 function evaluations. Convergence was
assumed if the best value found by the population did not improve by more
than 10−4 for three consecutive iterations.
3 3 3
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
2 2 2
𝑥2 1 𝑥2 1 𝑥2 1
0 0 0
−1 −1 −1
−2 −1 0 1 2 3 −2 −1 0 1 2 3 −2 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
𝑘=5 𝑘 = 12 𝑘 = 17
criteria include the distance (sum or norm) between each particle and
the best particle, the best particle’s objective function value changes
for the last few generations, and the difference between the best and
worst member. For PSO, another alternative is to check whether the
velocities for all particles (as measured by a metric such as norm or
mean) are below some tolerance. Some of these criteria that assume
all the particles congregate (distance, velocities) do not work well for
multimodal problems. In those cases, tracking only the best particle’s
objective function value may be more appropriate.
𝑓
30
Example 7.9 Comparison of algorithms for a multimodal discontinuous
function 20
By taking the ceiling of the product of the two sine waves, this function creates a 𝑥∗
−20
0 1 2 3
checkerboard pattern with 0s and 4s. In this latter case, each gradient evaluation −1
𝑥1
is counted as an evaluation in addition to each function evaluation. Adding this
function to the Jones function produces the discontinuous pattern shown in Fig. 7.34 Slice of the Jones function
Fig. 7.34. This is a one-dimensional slice of constant 𝑥 2 through the optimum of with the added checkerboard pattern.
7 Gradient-Free Optimization 321
the Jones function; the full two-dimensional contour plot is shown in Fig. 7.35.
The global optimum remains the same as the original function.
3 3 3
179 evaluations 119 evaluations 99 evaluations
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
𝑥∗ 𝑥∗
−1 −1 −1
𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
3 3 3
2420 evaluations 760 evaluations 96 evaluations
2 2 2
1 1 1
𝑥2 𝑥2 𝑥2
0 0 0
𝑥∗
−1 −1 −1
𝑥∗ 𝑥∗
−1 0 1 2 3 −1 0 1 2 3 −1 0 1 2 3
𝑥1 𝑥1 𝑥1
The resulting optimization paths demonstrate that some gradient-free Fig. 7.35 Convergence path for
algorithms effectively handle the discontinuities and find the global minimum. gradient-free algorithms compared
with gradient-based algorithms with
Nelder–Mead converges quickly, but not necessarily to the global minimum. multistart.
GPS and DIRECT quickly converge to the global minimum. GAs and PSO
also find the global minimum, but they require many more evaluations. The
gradient-based algorithm (quasi-Newton) with multistart also converges the
global minimum in two of the six random starts.
7.8 Summary
Problems
7.3 Program the DIRECT algorithm and perform the following stud-
ies:
7.5 Program the PSO algorithm and perform the following studies:
Minimize the power with respect to span and chord by doing the
following:
327
8 Discrete Optimization 328
Even though a discrete optimization problem limits the options and thus
conceptually sounds easier to solve, discrete optimization problems
8 Discrete Optimization 329
Unless your optimization problem fits specific forms that are well suited to
discrete optimization, your problem is likely expensive to solve, and it may be
helpful to consider approaches to avoid discrete variables.
𝑥1 = 0 1
𝑥2 = 0 1 0 1
Another variation of branch and bound arises from how the tree
search is performed. Two common strategies are depth-first and breadth-
8 Discrete Optimization 333
Inputs:
𝑓best : Best known solution, if any; otherwise 𝑓best = ∞
Outputs:
𝑥 ∗ : Optimal point
𝑓 (𝑥 ∗ ): Optimal function value
To solve this problem, we begin at the first node by solving the linear
relaxation. The binary constraint is removed and instead replaced with
continuous bounds: 0 ≤ 𝑥 𝑖 ≤ 1. The solution to this LP is as follows:
The first branch (see Fig. 8.5) yields a feasible binary solution! The corre-
sponding function value 𝑓 = −4 is saved as the best value so far. There is no
need to continue on this branch because the solution cannot be improved on
this particular branch.
We continue solving along the rest of this row (Fig. 8.6). The third node
in this row yields another binary solution. In this case, the function value is
𝑓 = −4.9, which is better, so this becomes the new best value so far. The second
8 Discrete Optimization 335
and fourth nodes do not yield a solution. Typically, we would have to branch
these further, but they have a lower bound that is worse than the best solution
so far. Thus, we can prune both of these branches.
All branches have been pruned, so we have solved the original problem:
𝑥 ∗ = [1, 0, 1, 1]
𝑓 ∗ = −4.9.
𝑥3 = 0 1
𝑥2 = 0 1 0 1
𝑓 ∗ = −4 𝑓 ∗ = −4.9 bounded
𝑥1 = 0 1
𝑓 ∗ = −2.6
𝑥4 = 0 1
Fig. 8.7 Search path using a depth-
first strategy.
𝑓 ∗ = −3.6 infeasible
𝑥3 ≤ 4 𝑥3 ≥ 5
infeasible
𝑥2 ≤ 1 𝑥2 ≥ 2
𝑥1 ≤ 0 𝑥1 ≥ 1 𝑥3 ≤ 2 𝑥3 ≥ 3
bounded
𝑥1 ≤ 1 𝑥1 ≥ 2 Fig. 8.8 A breadth search of the
mixed-integer programming exam-
ple.
infeasible bounded
𝑥 ∗ = [0, 2, 3, 0.5]
𝑓 ∗ = −13.75.
Greedy algorithms are among the simplest methods for discrete opti-
mization problems. This method is more of a concept than a specific
algorithm. The implementation varies with the application. The idea is
to reduce the problem to a subset of smaller problems (often down to a
single choice) and then make a locally optimal decision. That decision
is locked in, and then the next small decision is made in the same
manner. A greedy algorithm does not revisit past decisions and thus
ignores much of the coupling between design variables.
8 Discrete Optimization 338
5
5 3
2 2 9
5
2 4 3
Global 4 6
4
1 6
1 3 10 12
6 7
5 2
Greedy 7 5 1
3 Fig. 8.9 The greedy algorithm in this
4 4 11
weighted directed graph results in
5 2 a total cost of 15, whereas the best
possible cost is 10.
8
A greedy algorithm simply makes the best choice assuming each decision
is the only decision to be made. Starting at node 1, we first choose to move to
node 3 because that is the lowest cost between the three options (node 2 costs
2, node 3 costs 1, node 4 costs 5). We then choose to move to node 6 because
that is the smallest cost between the next two available options (node 6 costs
4, node 7 costs 6), and so on. The path selected by the greedy algorithm is
highlighted in the figure and results in a total cost of 15. The global optimum
is also highlighted in the figure and has a total cost of 10.
The greedy algorithm used in Ex. 8.6 is easy to apply and scalable
but does not generally find the global optimum. To find that global
optimum, we have to consider the impact of our choices on future
decisions. A method to achieve this for certain problem structures is
discussed in the next section.
Even for a fixed problem, there are many ways to construct a greedy
algorithm. The advantage of the greedy approach is that the algorithms
are easy to construct, and they bound the computational expense of
the problem. One disadvantage of the greedy approach is that it
usually does not find an optimal solution (and in some cases finds the
worst solution!152 ). Furthermore, the solution is not necessarily feasible. 152. Gutin et al., Traveling salesman should
not be greedy: domination analysis of greedy-
type heuristics for the TSP, 2002.
8 Discrete Optimization 339
A few other examples of greedy algorithms are listed below. For the
traveling salesperson problem (Ex. 8.1), always select the nearest city as the
next step. Consider the propeller problem (Ex. 8.2 but with additional discrete
variables (number of blades, type of material, and number of shear webs). A
greedy method could optimize the discrete variables one at a time, with the
others fixed (i.e., optimize the number of blades first, fix that number, then
optimize material, and so on). As a final example, consider the grocery store
shopping problem discussed in a separate chapter (Ex. 11.1).∗ A few possible ∗ This is a form of the knapsack problem,
procedure fib(𝑛)
if 𝑛 ≤ 1 then
return 𝑛
else
return fib(𝑛 − 1) + fib(𝑛 − 2)
end if
end procedure
fib(5)
fib(4) fib(3)
𝑓0 = 0
𝑓1 = 1
for 𝑖 = 2 to 𝑛 do
𝑓𝑖 = 𝑓𝑖−1 + 𝑓𝑖−2
end for
where 𝑡 is a transition function.† At each transition, we compute the † For some problems, the transition func-
cost function 𝑐.‡ For generality, we specify a cost function that may tion is stochastic.
change at each iteration 𝑖: ‡ It
is common to use discount factors on
future costs.
𝑐 𝑖 (𝑠 𝑖 , 𝑥 𝑖 ). (8.5)
We want to make a set of decisions that minimize the sum of the
current and future costs up to a certain time, which is called the value
function,
Let us solve the graph problem posed in Ex. 8.6 using dynamic programming.
For convenience, we repeat a smaller version of the figure in Fig. 8.12. We use
the tabulation (bottom-up) approach. To do this, we construct a table where we
keep track of the cost to move from this node to the end (node 12) and which
node we should move to next:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost
Next
We start from the end. The last node is simple: there is no cost to move 5 5 3
from node 12 to the end (we are already there), and there is no next node. 2
2 5 9
2 4 3
4 6
Node 1 2 3 4 5 6 7 8 9 10 11 12 4
1 6
1 3 6 10 12
Cost 0 5
7
2
Next – 3 7 5 1
4
4 5 2 11
Now we move back one level to consider nodes 9, 10, and 11. These nodes
8
all lead to node 12 and are thus straightforward. We need to be more careful
with the formulas as we get to the more complicated cases next. Fig. 8.12 Small version of Fig. 8.9 for
convenience.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 6 2 0
Next 12 12 12 –
8 Discrete Optimization 343
Now we move back one level to nodes 5, 6, 7, and 8. Using the Bellman
equation for node 5, the cost is
We have already computed the minimum value for cost(9), cost(10), and cost(11),
so we just look up these values in the table. In this case, the minimum total
value is 3 and is associated with moving to node 11. Similarly, the cost for node
6 is
cost(6) = min[5 + cost(9), 4 + cost(10)]. (8.10)
The result is 8, and it is realized by moving to node 9.
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 3 8 3 6 2 0
Next 11 9 12 12 12 –
We repeat this process, moving back and reusing optimal solutions to find
the global optimum. The completed table is as follows:
Node 1 2 3 4 5 6 7 8 9 10 11 12
Cost 10 8 12 9 3 8 7 4 3 6 2 0
Next 2 5 6 8 11 9 11 11 12 12 12 –
From this table, we see that the minimum cost is 10. This cost is achieved
by moving first to node 2. Under node 2, we see that we next go to node 5, then
11, and finally 12. Thus, the tabulation gives us the global minimum for cost
and the design decisions to achieve that.
Õ
𝑛
maximize 𝑐𝑖 𝑥𝑖
𝑥
𝑖=1
Õ
𝑛
(8.11)
subject to 𝑤𝑖 𝑥𝑖 ≤ 𝐾
𝑖=1
𝑥 𝑖 ∈ {0, 1} .
In its present form, the knapsack problem has a linear objective and
linear constraints, so branch and bound would be a good approach.
However, this problem can also be formulated as a Markov chain, so we
can use dynamic programming. The dynamic programming version
accommodates variations such as stochasticity and other constraints
more easily.
To pose this problem as a Markov chain, we define the state as the
remaining capacity of the knapsack 𝑘 and the number of items we
have already considered. In other words, we are interested in 𝑣(𝑘, 𝑖),
where 𝑣 is the value function (optimal value given the inputs), 𝑘 is
the remaining capacity in the knapsack, and 𝑖 indicates that we have
already considered items 1 through 𝑖 (this does not mean we have
added them all to our knapsack, only that we have considered them).
We iterate through a series of decisions 𝑥 𝑖 deciding whether to take
item 𝑖 or not, which transitions us to a new state where 𝑖 increases and
𝑘 may decrease, depending on whether or not we took the item.
The real problem we are interested in is 𝑣(𝐾, 𝑛), which we solve
using tabulation. Starting at the bottom, we know that 𝑣(𝑘, 0) = 0 for
any 𝑘. This means that no matter the capacity, the value is 0 if we have
not considered any items yet. To work forward, consider a general case
considering item 𝑖, with the assumption that we have already solved
up to item 𝑖 − 1 for any capacity. If item 𝑖 cannot fit in our knapsack
(𝑤 𝑖 > 𝑘), then we cannot take the item. Alternatively, if the weight is
less than the capacity, we need to decide whether to select item 𝑖 or
not. If we do not, then the value is unchanged, and 𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1).
If we do select item 𝑖, then our value is 𝑐 𝑖 plus the best we could do
with the previous items but with a capacity that was smaller by 𝑤 𝑖 :
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1). Whichever of these decisions yields a
better value is what we should choose.
To determine which items produce this cost, we need to add more
logic. To keep track of the selected items, we define a selection matrix
𝑆 of the same size as 𝑣 (note that this matrix is indexed starting at zero
in both dimensions). Every time we accept an item 𝑖, we register that in
8 Discrete Optimization 345
Inputs:
𝑐 𝑖 : Cost of item 𝑖
𝑤 𝑖 : Weight of item 𝑖
𝐾: Total available capacity
Outputs:
𝑥 ∗ : Optimal selections
𝑣(𝐾, 𝑛): Corresponding cost, 𝑣(𝑘, 𝑖) is the optimal cost for capacity 𝑘 considering items 1
through 𝑖 ; note that indexing starts at 0
for 𝑘 = 0 to 𝐾 do
𝑣(𝑘, 0) = 0 No items considered; value is zero for any capacity
end for
for 𝑖 = 1 to 𝑛 do Iterate forward solving for one additional item at a time
for 𝑘 = 0 to 𝐾 do
if 𝑤 𝑖 > 𝑘 then
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1) Weight exceeds capacity; value unchanged
else
if 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1) > 𝑣(𝑘, 𝑖 − 1) then Take item
𝑣(𝑘, 𝑖) = 𝑐 𝑖 + 𝑣(𝑘 − 𝑤 𝑖 , 𝑖 − 1)
𝑆(𝑘, 𝑖) = 1
else Reject item
𝑣(𝑘, 𝑖) = 𝑣(𝑘, 𝑖 − 1)
end if
end if
end for
end for
𝑘=𝐾 Initialize
𝑥 ∗ = {} Initialize solution 𝑥 ∗ as an empty set
for 𝑖 = 𝑛 to 1 by −1 do Loop to determine which items we selected
if 𝑆 𝑘,𝑖 = 1 then
add 𝑖 to 𝑥 ∗ Item 𝑖 was selected
𝑘 = 𝑘 − 𝑤𝑖
end if
end for
We fill all entries in the matrix 𝑣[𝑘, 𝑖] to extract the last value
𝑣[𝐾, 𝑛]. For small numbers, filling this matrix (or table) is often
illustrated manually, hence the name tabulation. As with the Fibonacci
example, using dynamic programming instead of a fully recursive
solution reduces the complexity from 𝒪(2𝑛 ) to 𝒪(𝐾𝑛), which means it
is pseudolinear. It is only pseudolinear because there is a dependence
8 Discrete Optimization 346
on the knapsack size. For small capacities, the problem scales well
even with many items, but as the capacity grows, the problem scales
much less efficiently. Note that the knapsack problem requires integer
weights. Real numbers can be scaled up to integers (e.g., 1.2, 2.4 become
12, 24). Arbitrary precision floats are not feasible given the number of
combinations to search across.
Simulated annealing∗ is a methodology designed for discrete opti- developed by Kirkpatrick et al.153
∗ First
and Černý.154
mization problems. However, it can also be effective for continuous
multimodal problems, as we will discuss. The algorithm is inspired by 153. Kirkpatrick et al., Optimization by
simulated annealing, 1983.
the annealing process of metals. The atoms in a metal form a crystal
154. Černý, Thermodynamical approach to
lattice structure. If the metal is heated, the atoms move around freely. the traveling salesman problem: An efficient
As the metal cools down, the atoms slow down, and if the cooling is slow simulation algorithm, 1985.
𝑇 = 𝑇0 𝛼 𝑘 , (8.15)
The annealing schedule can substantially impact the algorithm’s perfor- ple.
156. Andresen and Gordon, Constant ther-
mance, and some experimentation is required to select an appropriate modynamic speed for minimizing entropy
schedule for the problem at hand. One essential requirement is that the production in thermodynamic processes and
simulated annealing, 1994.
temperature should start high enough to allow for exploration. This
should be significantly higher than the maximum expected energy
change (change in objective) but not so high that computational time is
8 Discrete Optimization 349
wasted with too much random searching. Also, cooling should occur
slowly to improve the ability to recover from a local optimum, imitating
the annealing process instead of the quenching process.
The algorithm is summarized in Alg. 8.5; for simplicity in the
description, the annealing schedule uses an exponential decrease at
every iteration.
Inputs:
𝑥 0 : Starting point
𝑇0 : Initial temperature
Outputs:
𝑥 ∗ : Optimal point
𝑥 (𝑘+1) = 𝑥new
else
𝑟 ∈ 𝒰[0, 1] ! Randomly draw from uniform distribution
− 𝑓 (𝑥 new )− 𝑓 𝑥 (𝑘)
𝑃 = exp 𝑇
two points, or (2) randomly choose two points and move the path segments to
follow another randomly chosen point. The distance traveled by the randomly
generated initial set of points is 26.2.
We specify an iteration budget of 25,000 iterations, set the initial temperature
to be 10, and decrease the temperature by a multiplicative factor of 0.95 at every
100 iterations. The right panel of Fig. 8.13 shows the final path, which has a
length of 5.61. The final path might not be the global optimum (remember,
these finite time methods are only approximations of the full combinatorial
search), but the methodology is effective and fast for this problem in finding at
least a near-optimal solution. Figure 8.14 shows the iteration history.
30
20
Distance
10
The binary form of a genetic algorithm (GA) can be directly used with
discrete variables. Because the binary form already requires a discrete
representation for the population members, using discrete design
variables is a natural fit. The details of this method were discussed in
Section 7.6.1.
8.8 Summary
Problems
8.2 Branch and bound. Solve the following problem using a manual
branch-and-bound approach (i.e., show each LP subproblem), as
8 Discrete Optimization 353
A B C D Limit
Chlorine 0.74 −0.05 1.0 −0.15 97
Sodium hydroxide 0.39 0.4 0.91 0.44 99
Sulfuric acid 0.86 0.89 0.09 0.83 52
Labor (person-hours) 5 7 7 6 1000
𝑤 𝑖 = [2, 5, 3, 4, 6, 1]
𝑐 𝑖 = [5, 3, 1, 5, 7, 2]
a. A greedy algorithm where you take the item with the best
cost-to-weight ratio (that fits within the remaining capacity)
at each iteration
8 Discrete Optimization 354
b. Dynamic programming
355
9 Multiobjective Optimization 356
becomes
𝑓1 (𝑥)
𝑓2 (𝑥)
minimize 𝑓 (𝑥) = . , where 𝑛𝑓 ≥ 2 . (9.2)
..
𝑥
𝑓𝑛 (𝑥)
𝑓
The constraints are unchanged unless some of them have been refor-
mulated as objectives. This multiobjective formulation might require
trade-offs when trying to minimize all functions simultaneously be-
cause, beyond some point, further reduction in one objective can only
be achieved by increasing one or more of the other objectives.
One exception occurs if the objectives are independent because they
depend on different sets of design variables. Then, the objectives are
said to be separable, and they can be minimized independently. If there
are constraints, these need to be separable as well. However, separable
objectives and constraints are rare because functions tend to be linked
in engineering systems.
Given that multiobjective optimization requires trade-offs, we need
a new definition of optimality. In the next section, we explain how there
is an infinite number of optimal points, forming a surface in the space of
objective functions. After defining optimality for multiple objectives, we
present several possible methods for solving multiobjective optimization
problems.
𝐵
9.2 Pareto Optimality 𝑓2 𝐴
Figure 9.1 shows three designs measured against two objectives that
Fig. 9.1 Three designs, 𝐴, 𝐵, and 𝐶,
we want to minimize: 𝑓1 and 𝑓2 . Let us first compare design A with
are plotted against two objectives, 𝑓1
design B. From the figure, we see that design A is better than design B and 𝑓2 . The region in the shaded
in both objectives. In the language of multiobjective optimization, we rectangle highlights points that are
say that design A dominates design B. One design is said to dominate dominated by design 𝐴.
9 Multiobjective Optimization 358
Noise
the left side tells us how much power we have to sacrifice for a given reduction
in noise. If the slope is steep, as is the case in the figure, we can see that a small
sacrifice in maximum power production can be exchanged for significantly
reduced noise. However, if more significant noise reductions are sought, then
large power reductions are required. Conversely, if the left side of the figure −Power
had a flatter slope, we would know that small reductions in noise would require
Fig. 9.3 A notional Pareto front repre-
significant decreases in power. Understanding the magnitude of these trade-off
senting power and noise trade-offs for
sensitivities helps make high-level design decisions. a wind farm optimization problem.
Õ
𝑛𝑓
𝑤𝑖 = 1 . (9.4)
𝑖=1
of 𝑤 should be used to sweep out the Pareto set evenly, and (3) this
method can only return points on the convex portion of the Pareto 𝑤=0
front.
𝑓1
In Fig. 9.5, we highlight the convex portions of the Pareto front from
Fig. 9.4. If we utilize the concept of pushing a line down and to the left, Fig. 9.5 The convex portions of this
we see that these are the only portions of the Pareto front that can be Pareto front are the portions high-
found using a weighted-sum method. lighted.
9 Multiobjective Optimization 360
passes through the anchor points. We space points along this plane
(usually evenly) and, starting from those points, solve optimization
problems that search along directions normal to this plane.
This procedure is shown in Fig. 9.7 for a two-objective case. In this
case, the plane that passes through the anchor points is a line. We
now space points along this line by choosing a vector of weights 𝑏, as
illustrated on the left-hand of Fig. 9.7. The weights are constrained such
9 Multiobjective Optimization 361
Í
that 𝑏 𝑖 ∈ [0, 1], and 𝑖 𝑏 𝑖 = 1. If we make 𝑏 𝑖 = 1 and all other entries
zero, then this equation returns one of the anchor points, 𝑓 (𝑥 ∗𝑖 ). For
two objectives, we would set 𝑏 = [𝑤, 1 − 𝑤] and vary 𝑤 in equal steps
between 0 and 1.
𝑓 (𝑥1∗ ) 𝑓 (𝑥1∗ )
𝑏 = [0.8, 0.2]
𝛼 𝑛ˆ
𝑓2 𝑓2 Fig. 9.7 A notional example of the
𝑃𝑏 + 𝑓 ∗ 𝑃𝑏 + 𝑓 ∗ NBI method. A plane is created that
passes through the single-objective
optima (the anchor points), and solu-
𝑓∗ 𝑓∗
𝑓 (𝑥2∗ ) 𝑓 (𝑥2∗ ) tions are sought normal to that plane
for a more evenly spaced Pareto front.
𝑓1 𝑓1
𝑓1 (𝑥 ∗ )
1
𝑓2 (𝑥 ∗ )
2
𝑓 = . ,
∗
(9.8)
..
𝑓𝑛 (𝑥 ∗ )
𝑛
𝑛˜ = −𝑃𝑒 , (9.11)
maximize 𝛼
𝑥,𝛼
subject to 𝑃𝑏 + 𝑓 ∗ + 𝛼 𝑛ˆ = 𝑓 (𝑥)
(9.12)
𝑔(𝑥) ≤ 0
ℎ(𝑥) = 0 .
This means that we find the point farthest away from the anchor-point
plane, starting from a given value for 𝑏, while satisfying the original
problem constraints. The process is then repeated for additional values
of 𝑏 to sweep out the Pareto front.
In contrast to the previously mentioned methods, this method yields
a more uniformly spaced Pareto front, which is desirable for computa-
tional efficiency, albeit at the cost of a more complex methodology.
For most multiobjective design problems, additional complexity
beyond the NBI method is unnecessary. However, even this method
can still have deficiencies for problems with unusual Pareto fronts,
and new methods continue to be developed. For example, the normal
constraint method uses a very similar approach,161 but with inequality
161. Ismail-Yahaya and Messac, Effective
constraints to address a deficiency in the NBI method that occurs when generation of the Pareto frontier using the
normal constraint method, 2002.
the normal line does not cross the Pareto front. This methodology has
undergone various improvements, including better scaling through
162. Messac and Mattson, Normal con-
normalization.162 A more recent improvement performs an even more straint method with guarantee of even repre-
sentation of complete Pareto frontier, 2004.
efficient generation of the Pareto frontier by avoiding regions of the
163. Hancock and Mattson, The smart nor-
Pareto front where minimal trade-offs occur.163 mal constraint method for directly generating
a smart Pareto set, 2013.
9 Multiobjective Optimization 363
3 (2, 3)
𝑓2 2
𝑛ˆ
1 (5, 1)
𝑓∗
First, we optimize the objectives one at a time, which in our example results
in the two anchor points shown in Fig. 9.8: 𝑓 (𝑥1∗ ) = (2, 3) and 𝑓 (𝑥2∗ ) = (5, 1).
The utopia point is then
∗ 2
𝑓 = .
1
For the matrix 𝑃, recall that the 𝑖th column of 𝑃 is 𝑓 (𝑥 ∗𝑖 ) − 𝑓 ∗ :
0 3
𝑃= .
2 0
Our quasi-normal vector is given by −𝑃𝑒 (note that the true normal is
[−2, −3]):
−3
𝑛˜ = .
−2 ∗ The first application of an evolution-
We now have all the parameters we need to solve Eq. 9.12. ary algorithm for solving a multiobjective
problem was by Schaffer.164
164. Schaffer, Some experiments in machine
learning using vector evaluated genetic
algorithms. 1984.
9.3.4 Evolutionary Algorithms
Gradient-free methods can, and occasionally do, use all of the previously
described methods. However, evolutionary algorithms also enable a
𝑓2
fundamentally different approach. Genetic algorithms (GAs), a specific
type of evolutionary algorithm, were introduced in Section 7.6.∗
A GA is amenable to an extension that can handle multiple objectives
because it keeps track of a large population of designs at each iteration. 𝑓1
If we plot two objective functions for a given population of a GA
iteration, we get something like that shown in Fig. 9.9. The points Fig. 9.9 Population for a multiob-
jective GA iteration plotted against
represent the current population, and the highlighted points in the
two objectives. The nondominated
lower left are the current nondominated set. As the optimization set is highlighted at the bottom left
progresses, the nondominated set moves further down and to the left and eventually converges toward the
and eventually converges toward the actual Pareto front. Pareto front.
9 Multiobjective Optimization 364
is just the current approximation of the Pareto front). The algorithm 165. Deb, Introduction to evolutionary
multiobjective optimization, 2008.
recursively divides the population in half and finds the nondominated
166. Kung et al., On finding the maxima of
set for each half separately. a set of vectors, 1975.
Inputs:
𝑝: A population sorted by the first objective
Outputs:
𝑓 : The nondominated set for the population
procedure front(𝑝)
if length(𝑝) = 1 then If there is only one point, it is the front
return f
end if
Split population into two halves 𝑝 𝑡 and 𝑝 𝑏
⊲ Because input was sorted, 𝑝 𝑡 will be superior to 𝑝 𝑏 in the first objective
𝑡 = front(𝑝 𝑡 ) Recursive call to find front for top half
𝑏 = front(𝑝 𝑏 ) Recursive call to find front for bottom half
Initialize 𝑓 with the members from 𝑡 merged population
for 𝑖 = 1 to length(𝑏) do
dominated = false Track whether anything in 𝑡 dominates 𝑏 𝑖
for 𝑗 = 1 to length(𝑡) do
if 𝑡 𝑗 dominates 𝑏 𝑖 then
dominated = true
break No need to continue search through 𝑡
end if
end for
if not dominated then 𝑏 𝑖 was not dominated by anything in 𝑡
Add 𝑏 𝑖 to 𝑓
end if
end for
return 𝑓
end procedure
9 Multiobjective Optimization 365
Inputs:
𝑝: A population
Outputs:
rank: The rank for each member in the population
Inputs:
𝑝: A population
Outputs:
𝑑: Crowding distances
Inputs:
𝑥: Variable upper bounds
𝑥: Variable lower bounds
Outputs:
𝑥 ∗ : Best point
We see that the current nondominated set consists of points D and J and that
there are four different ranks.
Next, we start filling the new population in the order of rank. Our maximum
capacity is 6, so all rank 1 {D, J} and rank 2 {E, H, K} fit. We cannot add rank 3
{A, C, I, L} because the population size would be 9. So far, our new population
consists of {D, J, E, H, K}. To choose which items from rank 3 continue forward,
we compute the crowding distance for the members of rank 3:
A C I L
1.67 ∞ 1.5 ∞
We would then add, in order {C, L, A, I}, but we only have room for one, so we
add C and complete this iteration with a new population of {D, J, E, H, K, C}.
9.4 Summary
Problems
• (20, 4)
• (18, 5)
• (34, 2)
• (19, 6)
9 Multiobjective Optimization 371
𝑓1 𝑓2
6.0 8.0
6.0 4.0
5.0 6.0
2.0 8.0
10.0 5.0
6.0 0.5
8.0 3.0
4.0 9.0
9.0 7.0
8.0 6.0
3.0 1.0
7.0 9.0
1.0 2.0
3.0 7.0
1.5 1.5
4.0 6.5
373
10 Surrogate-Based Optimization 374
There are various scenarios for which surrogate models are helpful.
One scenario is when the original model is computationally expensive.
Surrogate models can be queried with minimal computational cost, but
constructing them requires multiple evaluations of the original model.
Suppose the number of evaluations needed to build a sufficiently
accurate surrogate model is less than that needed to optimize the
original model directly. In that case, SBO may be a worthwhile option.
Constructing a surrogate model becomes even more compelling when
it is reused in multiple optimizations.
Surrogate modeling can be effective in handling noisy models
because they create a smooth representation of noisy data. This can be
particularly advantageous when using gradient-based optimization.
One scenario that leads to both expensive evaluation and noisy
output is experimental data. When the model data are experimental
and the optimizer cannot query the experiment in an automated way,
we can construct a surrogate model based on the experimental data.
Then, the optimizer can query the surrogate model in the optimization.
Surrogate models are also helpful when we want to understand
the design space, that is, how the objective and constraints (outputs)
vary with respect to the design variables (inputs). By constructing a
continuous model over discrete data, we obtain functional relationships
that can be visualized more effectively.
When multiple sources of data are available, surrogate models can
fuse the data to build a single model. The data could come from
numerical models with different levels of fidelity or experimental data.
For example, surrogate models can calibrate numerical model data
using experimental data. This is helpful because experimental data is
usually much more scarce than numerical data. The same reasoning
applies to low- versus high-fidelity numerical data.
One potential issue with surrogate models is the curse of dimension-
ality, which refers to poor scalability with the number of inputs. The
larger the number of inputs, the more model evaluations are needed
to construct a surrogate model that is accurate enough. Therefore, the
reasons for using surrogate models cited earlier might not be enough if
the optimization problem has a large number of design variables.
The SBO process is shown in Fig. 10.2. First, we use sampling
methods to choose the initial points to evaluate the function or conduct
experiments. These points are sometimes referred to as training data.
Next, we build a surrogate model from the sampled points. We can
then perform optimization by querying the surrogate model. Based
10 Surrogate-Based Optimization 375
10.2 Sampling
𝑥2 𝑥2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
Fig. 10.3 Contrast between random
0 0 and Latin hypercube sampling with
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 50 points using uniform distribu-
𝑥1 𝑥1
tions.
Random Latin hypercube sampling
𝑥2 𝑥2
plan shown on the left of Fig. 10.6. This plan meets our criteria but
clearly does not fill the space and likely will not capture the relationships
between design parameters well. Alternatively, the right side of Fig. 10.6
has a sample in each row and column while also spanning the space
much more effectively.
LHS approach is that rather than relying on the law of large numbers to
fill out our chosen probability distributions, we enforce it as a constraint.
This method may still require many samples to characterize the design
space accurately, but it usually requires far fewer than pure random
sampling.
Instead of defining LHS as an optimization problem, a much simpler
approach is typically used in which we ensure one sample per interval,
but we rely on randomness to choose point combinations. Although
this does not necessarily yield a maximum spread, it works well in
practice and is simple to implement. Before discussing the algorithm,
we discuss how to generate other distributions besides just uniform
distributions.
We can convert from uniformly sampled points to an arbitrary
distribution using a technique called inversion sampling. Assume that
we want to generate samples 𝑥 from an arbitrary probability density
function (PDF) 𝑝(𝑥) or, equivalently, from the corresponding cumulative
distribution function (CDF) 𝑃(𝑥).∗ The probability integral transform ∗ PDFs and CDFs are reviewed in Ap-
pendix A.9.
states that for any continuous CDF, 𝑦 = 𝑃(𝑥), the variable 𝑦 is uniformly
distributed (a simple proof, but it is not shown here to avoid introducing
additional notation). The procedure is to randomly sample from a
uniform distribution (e.g., generate 𝑦), then compute the corresponding
𝑥 such that 𝑃(𝑥) = 𝑦, which we denote as 𝑥 = 𝑃 −1 𝑦. This latter step is
known as an inverse CDF, a percent-point function, or a quantile function.
This process is depicted in Fig. 10.7 for a normal distribution. This
same procedure allows us to use LHS with any distribution, simply by
generating the samples on a uniform distribution.
1
CDF
0.8
Inputs:
𝑛 𝑠 : Number of samples
𝑛 𝑑 : Number of dimensions
𝑃 = {𝑃1 , . . . , 𝑃𝑛 𝑑 }: (optionally) A set of cumulative distribution functions
Outputs:
𝑋 = {𝑥1 , . . . , 𝑥 𝑛 𝑠 }: Set of sample points
for 𝑗 = 1 to 𝑛 𝑑 do
for 𝑖 = 1 to 𝑛 𝑠 do
𝑖 𝑅 𝑖𝑗
𝑉𝑖𝑗 = − where 𝑅 𝑖𝑗 ∈ U[0, 1] Randomly choose a value in each equally
𝑛𝑠 𝑛𝑠
spaced cell from uniform distribution
end for
𝑋∗𝑗 = 𝑃 𝑗−1 (𝑉∗𝑗 ) where 𝑃 𝑗 is a CDF Evaluate inverse CDF
Randomly permute the entries of this column 𝑋∗𝑗 Alternatively, permute the
indices 1 . . . 𝑛 𝑠 in the prior for loop
end for
𝑥2
An example using Alg. 10.1 for eight points is shown in Fig. 10.8. 3
𝑖=1
Inputs:
𝑖: 𝑖 th point in sequence
𝑏: Base (integer)
Outputs:
𝜙: Generated point
0.8
0.6
Halton Sequence
0.4
A Halton sequence uses pairwise prime numbers (larger than 1) for the
base of each dimension of the problem.† The 𝑖th point in the Halton 0.2
sequence is
0
𝜙(𝑖, 𝑏1 ), 𝜙(𝑖, 𝑏2 ), . . . , 𝜙(𝑖, 𝑏 𝑛 𝑥 ) , (10.5) 0 0.2 0.4 0.6 0.8 1
𝑥1
where the 𝑏 𝑗 set is pairwise prime. As an example in two dimensions,
Fig. 10.10 shows 30 generated points of the Halton sequence where 𝑥1 Fig. 10.10 Halton sequence with base
2 for 𝑥1 and base 3 for 𝑥2 . First, 30
uses base 2, and 𝑥2 uses base 3, and then a subsequent 20 generated
points are selected (in blue), and then
points are added (in another color), showing the reuse of existing points. 20 points are added (in red). These
If the dimensionality of the problem is high, then some of the points would be identical to 50 points
base combinations lead to points that are highly correlated and thus chosen at once.
10 Surrogate-Based Optimization 383
undesirable for a sampling plan. For example, the left of Fig. 10.11
shows 50 generated points where 𝑥1 uses base 17, and 𝑥2 uses base 19.
To avoid this issue, we can use a scrambled Halton sequence.
𝑥2 𝑥2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Fig. 10.11 Halton sequence with base
𝑥1 𝑥1
17 for 𝑥1 and base 19 for 𝑥2 .
Standard Halton sequence Scrambled Halton
Hammersley Sequence
𝑥2
The Hammersley sequence is closely related to the Halton sequence. 1
spacing:
0.4
𝑖
, 𝜙(𝑖, 𝑏1 ), 𝜙(𝑖, 𝑏2 ), . . . , 𝜙(𝑖, 𝑏 𝑛 𝑥 −1 ) . (10.7)
𝑛𝑝 0.2
Other Sequences
where 𝑥 (𝑖) is an input vector from the sampling plan, and 𝑓 (𝑖) contains
the corresponding outputs from evaluating the model: 𝑓 (𝑖) = 𝑓 𝑥 (𝑖) .
We seek to construct a surrogate model from this data set. Surrogate
models can be based on physics, mathematics, or a combination of
the two. Incorporating known physics into a model is often desirable
to improve model accuracy. However, functional relationships are
unknown for many complex problems, and a data-driven mathematical
model can be more effective.
Surrogate-based models can be based on interpolation or regression, Regression
as illustrated in Fig. 10.13. Interpolation builds a function that exactly
matches the provided training data. Regression models do not try to 𝑓
Interpolation
match the training data points exactly; instead, they minimize the
error between a smooth trend function and the training data. The
nature of the training data can help decide between these two types
of surrogate models. Regression is particularly useful when the data 𝑥
Õ 2
minimize 𝑓ˆ 𝑤; 𝑥 (𝑖) − 𝑓 (𝑖) . (10.11)
𝑤
𝑖
minimize 𝑤 | Ψ| Ψ𝑤 − 2 𝑓 | Ψ𝑤 + 𝑓 | 𝑓 . (10.15)
𝑤
We can omit the last term from the objective because our optimization
variables are 𝑤, and the last term has no 𝑤 dependence:
minimize 𝑤 | Ψ| Ψ𝑤 − 2 𝑓 | Ψ𝑤 . (10.16)
𝑤
𝑄 = 2Ψ| Ψ (10.18)
𝑞 = −2Ψ 𝑓 . |
(10.19)
10 Surrogate-Based Optimization 387
𝑤 = Ψ† 𝑓 . (10.24)
This allows for a similar form to solving a linear system of equations
where an inverse would be used instead. In solving both a linear system
and the linear least-squares equation (Eq. 10.23), we do not explicitly
invert a matrix. For linear least squares, a QR factorization is commonly
used for improved numerical conditioning as compared to solving
Eq. 10.23 directly.
Tip 10.2 Least squares is not the same as a linear system solution
Consider the quadratic fit discussed in Ex. 10.2. We are provided the data
points, 𝑥 and 𝑓 , shown as circles in Fig. 10.14. From these data, we construct 𝑓
the matrix Ψ for our basis functions as follows: 20
𝑥 (1) 2
𝑥 (1) 1
𝑥 (2) 2
𝑥 (2) 1 10
Ψ = ..
.
.
(𝑛 ) 2
𝑥 𝑠 1
0
𝑥 (𝑛 𝑠 )
We can then solve for the coefficients 𝑤 using the linear least squares solution −2 −1 0 1 2
𝑥
(Eq. 10.23). Substituting the coefficients and respective basis functions into
Eq. 10.10, we obtain the surrogate model, Fig. 10.14 Linear least squares exam-
ple with a quadratic fit on a one-
𝑓ˆ(𝑥) = 𝑤 1 𝑥 2 + 𝑤 2 𝑥 + 𝑤3 ,
dimensional function.
which is also plotted in Fig. 10.14 as a solid line.
where 𝜀 captures the error associated with the 𝑖th data point. We
assume that the error is normally distributed with mean zero and a
standard deviation of 𝜎:
!
1 𝜀(𝑖)
2
𝑝 𝜀 (𝑖)
= √ exp − 2 . (10.30)
𝜎 2𝜋 2𝜎
2 !
1 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)
𝑝 𝑓 (𝑖)
|𝑥 ; 𝑤 = √ exp −
(𝑖)
. (10.31)
𝜎 2𝜋 2𝜎2
10 Surrogate-Based Optimization 390
Once we include all the data points 𝑥 (𝑖) , we would like to compute the
probability of observing 𝑓 conditioned on the inputs 𝑥 for a given set
of parameters in 𝑤. We call this the likelihood function 𝐿(𝑤). In this case,
assuming all errors are independent, the total probability for observing
the outputs is the product of the probability of observing each output:
Ö
𝑛𝑠 2 !
1 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)
𝐿(𝑤) = √ exp − . (10.32)
𝑖=1 𝜎 2𝜋 2𝜎2
Now we can pose this as an optimization problem where we wish to
find the parameters 𝑤 that maximize the likelihood function; in other
words, we maximize the probability that our model is consistent with
the observed data. Because the objective is a product of multiple terms,
it is helpful to take the logarithm of the objective. Maximizing 𝐿 or
maximizing ℓ = ln(𝐿) does not change the solution to the problem but
makes it easier to solve. We call this the log likelihood function:
Ö
𝑛𝑠 2 !!
1 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)
ℓ (𝑤) = ln √ exp − (10.33)
𝜎 2𝜋 2𝜎2
2 !!
𝑖=1
Õ
𝑛𝑠
1 𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)
= ln √ exp − (10.34)
𝜎 2𝜋 2𝜎 2
2
𝑖=1
Õ 𝑛
1
𝑠
𝑓 (𝑖) − 𝑤 | 𝑥 (𝑖)
= 𝑛 𝑠 ln √ − . (10.35)
𝜎 2𝜋 𝑖=1
2𝜎2
𝜕𝑟 𝑖
𝐽𝑟 𝑖𝑗 = . (10.43)
𝜕𝑤 𝑗
This is now the same form as linear least squares (Eq. 10.14), so we can
reuse its solution (Eq. 10.23) to solve for the step
−1
(10.45)
| |
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 .
The gradient is
𝜕𝑟 𝑖
∇𝑒 𝑗 = 2𝑟 𝑖 , (10.48)
𝜕𝑤 𝑗
or in matrix form:
∇𝑒 = 2𝐽𝑟 𝑟 . (10.49)
|
If we neglect the second term in the Hessian, then the Newton update
is:
1 | −1 |
𝑤 𝑘+1 = 𝑤 𝑘 − 𝐽 𝐽𝑟 2𝐽𝑟 𝑟
2 𝑟 (10.52)
| −1 |
= 𝑤 𝑘 − 𝐽𝑟 𝐽𝑟 𝐽𝑟 𝑟 ,
which is the same update as before.
Thus, another interpretation of this method is that a Gauss–Newton
step is a modified Newton step where the second derivatives of the
10 Surrogate-Based Optimization 393
1 |
Δ𝑤 = − 𝐽𝑟 𝑟 . (10.55)
𝜇
where 𝐷 is defined as
𝐷 2 = diag 𝐽𝑟 𝐽𝑟 . (10.57)
|
This matrix scales the objective by the diagonal elements of the Hessian.
Thus, when 𝜇 is large, and the direction tends toward the steepest de-
scent, the components of the gradient are scaled by the curvature, which
10 Surrogate-Based Optimization 394
−1
Δ𝑤 = − 𝐽𝑟 𝐽𝑟 + 𝜇 diag 𝐽𝑟 𝐽𝑟 (10.58)
| | |
𝐽𝑟 𝑟 .
Inputs:
𝑥 0 : Starting point
𝜇0 : Initial damping parameter
𝜌: Damping parameter factor
Outputs:
𝑥 ∗ : Optimal solution
𝑘=0
𝑥 = 𝑥0
𝜇 = 𝜇0
𝑟, 𝐽 = residual(𝑥)
𝑒 = k𝑟 k 22 Residual error
−1 |
while |Δ| > 𝜏 do
𝑠 = − 𝐽 | 𝐽 + 𝜇 diag(𝐽 | 𝐽) 𝐽 𝑟 Evaluate step
𝑟 𝑠 , 𝐽𝑠 = residual(𝑥 + 𝑠)
𝑒 𝑠 = k𝑟 𝑠 k 22
Δ = 𝑒𝑠 − 𝑒 Change in residual error
if Δ < 0 then Objective decreased; accept step
𝑥=𝑥+𝑠
𝑟, 𝐽, 𝑒 = 𝑟 𝑠 , 𝐽𝑠 , 𝑒 𝑠
𝜇 = 𝜇/𝜌
else Reject step
𝜇=𝜇·𝜌 Increase damping
end if
𝑘 = 𝑘+1
end while
10 Surrogate-Based Optimization 395
In the following example, we use the same starting point as Ex. 4.18
(𝑥0 = [−1.2, −1]), an initial damping parameter of 𝜇 = 0.01, an update factor
of 𝜌 = 10, and a tolerance of 𝜏 = 10−6 (change in sum of squared errors). The
iteration path is shown on the left of Fig. 10.15, and the convergence of the sum
of squared errors is shown on the right side.
𝑥2 k𝑟 k 22
2 102
42 iterations
10−1
𝑥0 𝑥∗ 10−4
1
10−7
10−10
0 10−13
10−16
0 1 0 10 20 30 40
−1 Fig. 10.15 Levenberg–Marquardt al-
𝑥1 𝑘
gorithm applied to the minimization
Iteration history Convergence of the sum of squared of the Rosenbrock function.
residuals
Consider the set of training data (Fig. 10.16, left), which we use to create a
surrogate function. This is a one-dimensional problem so that it can be easily
visualized. In general, however, visualization is limited, and determining the
right basis functions to use can be difficult. If we use a polynomial basis, we
might attempt to determine the appropriate order by trying each case (e.g.,
quadratic, cubic, quartic) and measuring the error in our fit (Fig. 10.16, center).
𝑓 𝑓
5
4 4
4
2 3 2
Error
2
0 0
1
−2 −2
0
0 5 10 15 20
−3 −2 −1 0 1 2 −3 −2 −1 0 1 2
𝑥 Order of polynomial 𝑥
Training data The error in fitting the data decreases A 19th-order polynomial fit to the
with the order of the polynomial data has low error but poor predic-
tive ability
It seems as if the higher the order of the polynomial, the lower the error. Fig. 10.16 Fitting different order poly-
For example, a 20th-order polynomial reduces the error to almost zero. The nomials to data.
problem is that although the error is low on this set of data, the predictive
capability of such a model for other data points is poor. For example, the right
side of Fig. 10.16 shows a 19th-order polynomial fit to the data. The model
passes right through the points, but it does not work well for many of the
points that are not part of the training set (which is the whole purpose of the
surrogate).
The opposite of overfitting is underfitting, which is also a potential issue.
When underfitting, we do not have enough degrees of freedom to create a
useful model (e.g., imagine using a linear fit for the previous example).
1. Randomly split your data into a training set and a validation set
(e.g., a 70–30 split).
10 Surrogate-Based Optimization 397
Train Test
An alternative option that is more involved but uses the data more
effectively is called 𝑘-fold cross validation. It is particularly advantageous
when we have a small data set where we cannot afford to leave much out.
This procedure is illustrated in Fig. 10.18 and consists of the following
steps:
The extreme version of this process, when training data are very limited,
is leave-one-out cross validation (i.e., each testing subset consists of one
data point).
.. ..
. .
Train Test 𝑒𝑔 𝑛
·104
3 15
2 10
Error
Error
1 5
This example continues from Ex. 10.5. First, we perform 𝑘-fold cross
validation using 10 divisions. The average error across the divisions using the
training data is shown in Fig. 10.19 (with a smaller 𝑦-axis scale on the right).
The error increases dramatically as the polynomial order increases. Zoom-
ing in on the flat region, we see a range of options with similar errors. Among
the similar solutions, we generally prefer the simplest model. In this case,
a fourth-order polynomial seems reasonable. A fourth-order polynomial is
compared against the data in Fig. 10.20. This model has a much better predictive
ability.
𝑓
2
10.3.5 Common Basis Functions
Although cross validation can help us find the lowest generalization 0
Polynomials
built into the model form, fewer data points are needed to create a
reasonable model (e.g., a quadratic function in 𝑛 dimensions needs
at least 𝑛(𝑛 + 1)/2 + 𝑛 + 1 points, so this amounts to 6 points in two
dimensions, 10 points in three dimensions, and so on).
where 𝑐 is the center point, and 𝑟 is the radius about the center point.
Although the center points can be placed anywhere, we usually choose
the sampling data as centering points:
𝜓(𝑖) = 𝜓
𝑥 − 𝑥 (𝑖)
. (10.60)
This is often a useful choice because it captures the idea that our
ability to predict function behavior is related to how close we are to
known function values (in other words, nearby points are more highly
correlated). This form naturally lends itself to interpolation, although
regularization can be added to allow for regression. Polynomials are
often combined with radial basis functions because the polynomial can
capture global function behavior, while the radial basis functions can
introduce modifications to capture local behavior.
One popular radial basis function is the Gaussian basis:
2
© Õ (𝑖) ª
𝜓 (𝑖) (𝑥) = exp − 𝜃𝑗 𝑥 − 𝑥 𝑗 ® , (10.61)
« 𝑗 ¬
where 𝜃 𝑗 are the model parameters. One of the forms of kriging
discussed in the following section can be viewed as a radial basis
function model with a Gaussian basis.
10.4 Kriging
!
trix is always symmetric and positive def-
Õ inite.
(𝑖) (𝑗) 𝑝 𝑙
𝑛𝑑
𝐾 𝑥 ,𝑥 (𝑖) (𝑗)
= exp − 𝜃𝑙 𝑥 𝑙 − 𝑥 𝑙 , (10.64)
𝑙=1
10 Surrogate-Based Optimization 401
𝑒 | 𝐾 −1 𝑓
𝜇∗ = (10.69)
𝑒 | 𝐾 −1 𝑒
( 𝑓 − 𝑒𝜇∗ )| 𝐾 −1 ( 𝑓 − 𝑒𝜇∗ )
𝜎 ∗2 = . (10.70)
𝑛𝑠
We now substitute these values back into the log likelihood function
(Eq. 10.68), which yields
𝑛𝑠 1
ℓ (𝜃, 𝑝) = − ln 𝜎∗2 − ln |𝐾| . (10.71)
2 2
This function, also called the concentrated likelihood function, only de-
pends on the kernel 𝐾, which depends on 𝜃 and 𝑝.
We cannot solve for optimal values of 𝜃 and 𝑝 analytically. Instead,
we rely on numerical optimization to maximize Eq. 10.71. Because 𝜃 can
vary across a broad range, it is often better to search using logarithmic
scaling. Once we solve that optimization problem, we compute the
mean and variance in Eqs. 10.69 and 10.70.
Now that we have a fitted model, we can make predictions at new
points where we have not sampled. We do this by substituting 𝑥 𝑝 into
10 Surrogate-Based Optimization 403
a formula called the kriging predictor. The formula is unique, but there
are many ways to derive it. One way to derive it is to find the function
value at 𝑥 𝑝 that is the most consistent with the behavior of the function
captured by the fitted kriging model.
Let 𝑓𝑝 be our guess for the value of the function at 𝑥 𝑝 . One way
to assess the consistency of our guess is to add (𝑥 𝑝 , 𝑓𝑝 ) as an artificial
point to our training data (so that we now have 𝑛 𝑠 + 1 points) and
estimate the likelihood using the parameters from our fitted kriging
model. The likelihood of this augmented data can now be thought of
as a function of 𝑓𝑝 : high values correspond to guessed values of 𝑓𝑝 that
are consistent with function behavior captured by the fitted kriging
model. Therefore, the value of 𝑓𝑝 that maximizes the likelihood of this
augmented data set is a natural way to predict the value of the function.
This is an optimization problem with a closed-form solution, and the
corresponding formula is the kriging predictor.
Now we outline the derivation of the kriging predictor.§ With the § Jones173 provides the complete deriva-
tion; here we show only a few key steps.
augmented point, our function values are 𝑓¯ = [ 𝑓 , 𝑓𝑝 ], where 𝑓 is the
173. Jones, A taxonomy of global optimiza-
𝑛 𝑠 -vector of function values from the original training data. Then, the tion methods based on response surfaces,
correlation matrix with the additional data point is 2001.
𝐾 𝑘
𝐾¯ = | , (10.72)
𝑘 1
where 𝑘 is the correlation of the new point with the training data given
by
corr 𝐹 𝑥 (1) , 𝐹(𝑥 𝑝 ) = 𝐾 𝑥 (1) , 𝑥 𝑝
𝑘= .
..
(10.73)
corr 𝐹 𝑥 (𝑛𝑠 ) , 𝐹(𝑥 𝑝 ) = 𝐾 𝑥 (𝑛𝑠 ) , 𝑥 𝑝
.
The 1 in the bottom right of the augmented correlation matrix (Eq. 10.72)
is because the correlation of the new variable 𝐹(𝑥 𝑝 ) with itself is 1. The
log likelihood function with these new augmented vectors and the
previously determined parameters is as follows (see Eq. 10.68):
𝑛𝑠 𝑛𝑠 1 ( 𝑓¯ − 𝑒𝜇∗ )| 𝐾¯ −1 ( 𝑓¯ − 𝑒𝜇∗ )
ℓ ( 𝑓𝑝 ) = − ln(2𝜋) − ¯ −
ln(𝜎∗2 ) − ln | 𝐾| .
2 2 2 2𝜎∗2
We want to maximize this function with respect to 𝑓𝑝 . Because only the
last term depends on 𝑓𝑝 (it is a part of 𝑓¯) we can omit the other terms
and formulate the following:
𝑓¯ − 𝑒𝑣)| 𝐾¯ −1 ( 𝑓¯ − 𝑒𝜇∗
maximize ℓ ( 𝑓𝑝 ) = − . (10.74)
𝑓𝑝 2𝜎∗2
10 Surrogate-Based Optimization 404
This problem can be solved analytically, yielding the mean value of the
kriging prediction,
𝑓𝑝 = 𝜇∗ + 𝑘 | 𝐾 −1 ( 𝑓 − 𝑒𝜇∗ ) . (10.75)
The mean square error of the kriging prediction (that is, the expected
squared value of the error) is given by¶ ¶ The formula for mean squared error does
not come from the augmented likelihood
approach, but is a byproduct of showing
(1 − 𝑘 | 𝐾 −1 𝑒)2
𝜎𝑝2 = 𝜎∗2 1 − 𝑘 | 𝐾 −1 𝑘 + . (10.76) that the kriging predictor is the “best lin-
ear unbiased predictor” for the assumed
𝑒 | 𝐾 −1 𝑒
statistical model.174
One attractive feature of kriging models is that they are interpolatory 174. Sacks et al., Design and analysis of
computer experiments, 1989.
and thus match the training data exactly. To see how this is true, if
𝑥 𝑝 is the same as one of the training data points, 𝑥 (𝑖) , then 𝑘 is just 𝑖th
column of 𝐾. Hence, 𝐾 −1 𝑘 is a vector 𝑒 𝑖 , with all zeros except for 1 in
the 𝑖th element. In the prediction (Eq. 10.75), 𝑘 | 𝐾 −1 = 𝑒 𝑖 and so the
|
section. This includes solving the optimization problem of Eq. 10.71 using a 0 5 10
gradient-based method with exact derivatives. We fix 𝑝 = 2 and search for 𝜃 in 𝑥
the range [10−3 , 102 ] with the exponent as the optimization variable.
Fig. 10.21 Kriging model showing the
The resulting interpolation is shown in Fig. 10.21, where we plot the mean training data (dots), the kriging pre-
line. The shaded area represents the uncertainty corresponding to ±1 standard dictor (blue line) and the confidence
error. The uncertainty goes to zero at the known data points and is largest interval corresponding to ±1 stan-
when far from known data points. dard error (shaded areas), compared
to the actual function (gray line).
10 Surrogate-Based Optimization 405
𝑓1
..
.
𝑓𝑛 𝑠
≡ .
𝑓GEK (10.77)
∇ 𝑓1
.
..
∇ 𝑓
𝑛𝑠
This vector is of length 𝑛 𝑠 + 𝑛 𝑠 𝑛 𝑑 , where 𝑛 𝑑 is the dimension of 𝑥. The
gradients are usually provided at the same 𝑥 locations as the function
samples, but that is not required.
Recall that the term 𝑒𝜇∗ in Eq. 10.75 for the kriging predictor
represents the expected value of the random variables 𝐹 (1) , . . . , 𝐹 (𝑛 𝑠 ) .
Now that we have expanded the outputs to include the gradients at the
sampled points, the mean vector needs to be expanded to include the
expected values of ∇𝐹 (𝑖) , which are all zero. We can still use 𝑒𝜇∗ in the
formula for the predictor if we use the following definition:
where 1 occurs for the first 𝑛 𝑠 entries, and 0 for the remaining 𝑛 𝑠 𝑛 𝑑
entries.
The additional correlations (between function values and derivatives
and between the derivatives themselves) are as follows:
corr 𝐹 𝑥 (𝑖) , 𝐹 𝑥 (𝑗) = 𝐾 𝑖𝑗
10 Surrogate-Based Optimization 406
!
𝜕𝐹 𝑥 (𝑗) 𝜕𝐾 𝑖𝑗
corr 𝐹 𝑥 (𝑖)
, = (𝑗)
𝜕𝑥 𝑙 𝜕𝑥 𝑙
!
𝜕𝐹 𝑥 (𝑖) 𝜕𝐾 𝑖𝑗
corr , 𝐹 𝑥 (𝑗) = (𝑖)
(10.79)
𝜕𝑥 𝑙 𝜕𝑥 𝑙
!
𝜕𝐹 𝑥 (𝑖) 𝜕𝐹 𝑥 (𝑗) 𝜕2 𝐾 𝑖𝑗
corr , = (𝑗)
.
𝜕𝑥 𝑙 𝜕𝑥 𝑘 (𝑖)
𝜕𝑥 𝑙 𝜕𝑥 𝑘
Here, we use 𝑙 and 𝑘 to represent a component of a vector, and we
use 𝐾 𝑖𝑗 ≡ 𝐾 𝑥 (𝑖) , 𝑥 (𝑗) as shorthand. For our particular kernel choice
(Eq. 10.64), these correlations become the following:
!
Õ
𝑛𝑑 2
(𝑖) (𝑗)
𝐾 𝑖𝑗 = exp − 𝜃𝑙 𝑥𝑙 − 𝑥𝑙
𝑘=1
𝜕𝐾 𝑖𝑗
(𝑖) (𝑗)
(𝑗)
= 2𝜃𝑙 𝑥 𝑙 − 𝑥 𝑙 𝐾 𝑖𝑗
𝜕𝑥 𝑙
𝜕𝐾 𝑖𝑗 𝜕𝐾 𝑖𝑗 (10.80)
(𝑖)
=− (𝑗)
𝜕𝑥 𝑙 𝜕𝑥 𝑙
−4𝜃𝑙 𝜃𝑘 𝑥 𝑘 − 𝑥 𝑘
(𝑖) (𝑗) (𝑖) (𝑗)
𝜕2 𝐾 𝑖𝑗 𝑥 𝑙 − 𝑥 𝑙 𝐾 𝑖𝑗 𝑙≠𝑘
=
(𝑖) (𝑗)
−4𝜃𝑙2 𝑥 (𝑖) (𝑗) 2
𝐾 𝑖𝑗 + 2𝜃𝑙 𝐾 𝑖𝑗
𝜕𝑥 𝑙 𝜕𝑥 𝑘 𝑙
− 𝑥 𝑙
𝑙 = 𝑘,
𝜕𝐾11 | 𝜕𝐾 1𝑛 𝑠 |
𝜕𝑥 (1) ...
. 𝜕𝑥 (𝑛 𝑠 )
𝐽𝐾 = ..
.. ..
. . (10.82)
𝜕𝐾 | |
𝑛𝑠 1 𝜕𝐾 𝑛 𝑠 𝑛 𝑠
(1) ...
𝜕𝑥 𝜕𝑥 (𝑛 𝑠 )
and the (𝑛 𝑠 𝑛 𝑑 × 𝑛 𝑠 𝑛 𝑑 ) matrix of second derivatives is
𝜕2 𝐾11 𝜕2 𝐾 1𝑛 𝑠
𝜕𝑥 (1) 𝜕𝑥 (1) ...
𝜕𝑥 (1) 𝜕𝑥 (𝑛 𝑠 )
𝐻𝐾 = .
.. .. ..
. . . (10.83)
2
𝜕 𝐾 𝑛𝑠 1 𝜕2 𝐾 𝑛 𝑠 𝑛 𝑠
𝜕𝑥 (𝑛𝑠 ) 𝜕𝑥 (1) 𝜕𝑥 (𝑛 𝑠 ) 𝜕𝑥 (𝑛 𝑠 )
...
10 Surrogate-Based Optimization 407
We can still get the estimates 𝜇∗ and 𝜎∗2 with Eqs. 10.69 and 10.70
using the expanded versions of 𝐾, 𝑒, 𝑓 and replacing 𝑛 𝑠 in Eq. 10.76
with 𝑛 𝑠 (𝑛 𝑑 + 1), which is the new length of the outputs.
The predictor equations (Eqs. 10.75 and 10.76) also apply with the
expanded matrices and vectors. However, we also need to expand 𝑘 in
these computations to include the correlations between the gradients
at the sampled points with the gradient at the point 𝑥 where we make
a prediction. Thus, the expanded 𝑘 is:
𝑘 !
𝜕𝐾 𝑥 (1) , 𝑥 𝑝
𝜕𝐹 𝑥 (1)
corr , 𝐹(𝑥 𝑝 ) =
𝜕𝑥 (1) 𝜕𝑥 (1)
𝑘GEK ≡ . (10.84)
..
. !
𝜕𝐾 𝑥 𝑠 , 𝑥 𝑝
(𝑛 ) (𝑛 )
𝜕𝐹 𝑥 𝑠
corr , 𝐹(𝑥 ) =
𝑝
𝜕𝑥 (𝑛 𝑠 ) 𝜕𝑥 (𝑛 𝑠 )
We repeat Ex. 10.7 but this time include the gradients (Fig. 10.22). The
𝑓
standard error reduces dramatically between points. The additional information 1
Fit
contained in the derivatives significantly helps in creating a more accurate fit.
0.5
0
Example 10.9 Two-dimensional kriging
Actual
The Jones function (Appendix D.1.4) is shown on the left in Fig. 10.23. Using −0.5
GEK with only 10 training points from a Hammersley sequence (shown as the 0 5 10
dots), created the surrogate model on the right. A reasonable representation of 𝑥
this multimodal space can be captured even with a small number of samples.
Fig. 10.22 A GEK fit to the input data
(circles) and a shaded confidence in-
3 3
terval.
2 2
1 1
𝑥2 𝑥2
0 0
−1 −1
−1 0 1 2 3 −1 0 1 2 3
Fig. 10.23 Kriging fit to the multi-
𝑥1 𝑥1
modal Jones function.
Original function Kriging fit
10 Surrogate-Based Optimization 408
One difficulty with GEK is that the kernel matrix quickly grows in
size as the dimension of the problem increases, the number of samples
increases, or both. Various approaches have been proposed to improve
the scaling with higher dimensions, such as a weighted sum of smaller
correlation matrices175 or a partial least squares approach.172 175. Han et al., Weighted gradient-enhanced
kriging for high-dimensional surrogate
The version of kriging in this section is interpolatory. For noisy modeling and design optimization, 2017.
data, a regression approach can be used by modifying the correlation 172. Bouhlel and Martins, Gradient-
matrix as follows: enhanced kriging for high-dimensional
problems, 2019.
𝐾 reg ≡ 𝐾 + 𝜏𝐼 , (10.85)
with 𝜏 > 0. This adds a positive constant along the diagonal, so the
model no longer correlates perfectly with the provided points. The
parameter 𝜏 is then an additional parameter to estimate in the maximum
likelihood optimization. Even for interpolatory models, this term is
often still added to the covariance matrix with a small constant value
of 𝜏 (near machine precision) to ensure that the correlation matrix is
invertible. This section focused on the most common choices when
using kriging, but many other versions exist.176 176. Forrester et al., Engineering Design
via Surrogate Modelling: A Practical Guide,
2008.
Like kriging, deep neural nets can be used to approximate highly non-
linear simulations where we do not need to provide a parametric form.
Neural networks follow the same basic steps described for other surro-
gate models but with a unique model leading to specialized approaches
for derivative computation and optimization strategy. Neural networks
loosely mimic the brain, which consists of a vast network of neurons.
In neural networks, each neuron is a node that represents a simple
function. A network defines chains of these simple functions to obtain
composite functions that are much more complex. For example, three
simple functions, 𝑓 (1) , 𝑓 (2) , and 𝑓 (3) , may be chained into the composite
function (or network):
𝑓 (𝑥) = 𝑓 (3) 𝑓 (2) 𝑓 (1) (𝑥) . (10.86)
The total number of layers is called the network’s depth. Deep neural
networks have many layers, enabling the modeling of complex behavior.
The first and last layers can be viewed as the inputs and outputs
of a surrogate model. Each neuron in the hidden layer represents a
function. This means that the output from a neuron is a number, and
thus the output from a whole layer can be represented as a vector 𝑥.
We represent the vector of values for layer 𝑘 by 𝑥 (𝑘) , and the value for
(𝑘)
the 𝑖th neuron in layer 𝑘 by 𝑥 𝑖 .
Consider a neuron in layer 𝑘. This neuron is connected to many
neurons from the previous layer 𝑘 − 1 (see the first part of Fig. 10.25).
We need to choose a functional form for each neuron in the layer that
takes in the values from the previous layer as inputs. Chaining together
linear functions would yield another linear function. Therefore, some
layers must use nonlinear functions.
The most common choice for hidden layers is a layer of linear
functions followed by a layer of functions that create nonlinearity. A
neuron in the linear layer produces the following intermediate variable:
Õ
𝑛
(𝑘−1)
𝑧= 𝑤𝑗 𝑥𝑗 +𝑏. (10.87)
𝑗=1
In vector form:
𝑧 = 𝑤 | 𝑥 (𝑘−1) + 𝑏 . (10.88)
The first term is a weighted sum of the values from the neurons in the
previous layer. The 𝑤 vector contains the weights. The term 𝑏 is the
bias, which is an offset that scales the significance of the overall output.
10 Surrogate-Based Optimization 410
𝑧1
(𝑘−1) 𝑧2
𝑥1
𝑤1
(𝑘−1) 𝑧3
𝑥2
𝑤2
(𝑘−1)
𝑤3 Í (𝑘−1)
(𝑘)
𝑥3 𝑧4 = 𝑗 𝑤𝑗 𝑥𝑗 + 𝑏4 𝑥 (𝑘) = 𝑎(𝑧) 𝑥4
.. 𝑤𝑛 𝑧5
.
(𝑘−1)
𝑥𝑛 ..
.
𝑧𝑚
These two terms are analogous to the weights used in the previous
section but with the constant term separated for convenience. The
second column of Fig. 10.25 illustrates the linear (summation and bias)
layer. 1
Next, we pass 𝑧 through an activation function, which we call 𝑎(𝑧).
0.8
Historically, one of the most common activation functions has been the
Sigmoid
sigmoid function: 0.6
1 𝑎(𝑧)
𝑎(𝑧) = . (10.89) 0.4
1 + 𝑒 −𝑧
0.2
This function is shown in the top plot of Fig. 10.26. The sigmoid function
produces values between 0 and 1, so large negative inputs result in −5 5
𝑧
insignificant outputs (close to 0), and large positive inputs produce
outputs close to 1.
Most modern neural nets use a rectified linear unit (ReLU) as the 4
activation function:
𝑎(𝑧) = max(0, 𝑧) . (10.90) 𝑎(𝑧)
2
This function is shown in the bottom plot of Fig. 10.26. The ReLU ReLU
has been found to be far more effective than the sigmoid function in
producing accurate neural nets. This activation function eliminates −5 𝑧 5
To compute the outputs for all the neurons in this layer, the weights
𝑤 for one neuron form one row in a matrix of weights 𝑊 and we can
write:
𝑥 (𝑘) 𝑊1,𝑛 𝑘−1 𝑥 (𝑘−1) 𝑏
1 © 𝑊1,1 ··· 𝑊1,𝑗 ··· 1 1 ª
. .. . . ®
.. .
.. ..
.. .. ®
(𝑘−1) ®®
. .
(𝑘)
𝑥 𝑖 = 𝑎 𝑊𝑖,1 · · · 𝑊𝑖,𝑗 · · · 𝑊𝑖,𝑛 𝑘−1 𝑥 𝑗 + 𝑏 𝑖 ®
. .
. .. .. ..
.. .. ®®
. . . . . ®
(𝑘) (𝑘−1)
𝑥 𝑛 · · · 𝑊𝑛 𝑘 ,𝑛 𝑘−1 𝑥
𝑘 « 𝑊𝑛 𝑘 ,1 · · · 𝑊𝑛 𝑘 ,𝑗
𝑛 𝑘−1 𝑛 𝑘 ¬
𝑥
(10.92)
or
𝑥 (𝑘) = 𝑎 𝑊 𝑥 (𝑘−1) + 𝑏 . (10.93)
The activation function is applied separately for each row. The following
equation is more explicit (where 𝑤 𝑖 is the 𝑖th row of 𝑊):
(𝑘) | (𝑘−1)
𝑥𝑖 = 𝑎 𝑤𝑖 𝑥𝑖 + 𝑏𝑖 . (10.94)
We now have the objective and variables in place to train the neural
net. As with the other models discussed in this chapter, it is critical to
set aside some data for cross validation.
Because the optimization problem (Eq. 10.95) often has a large
number of parameters 𝜃, we generally use a gradient-based optimization
algorithm (however the algorithms of Chapter 4 are modified as we will
discuss shortly). To solve Eq. 10.95 using gradient-based optimization,
we require the derivatives of the objective function with respect to the
weighs 𝜃. Because the objective is a scalar and the number of weights
is large, reverse-mode algorithmic differentiation (AD) (see Section 6.6)
is ideal to compute the required derivatives.
Reverse-mode AD is known in the machine learning community as
backpropagation.∗ Whereas general-purpose reverse-mode AD operates ∗ The machine learning community inde-
Although less general, this approach can increase efficiency and stability. 58. Baydin et al., Automatic differentiation
in machine learning: A survey, 2018.
The ReLU activation function (Fig. 10.26, bottom) is not differentiable
at 𝑧 = 0, but in practice, this is generally not problematic—primarily
because these methods typically rely on inexact gradients anyway, as
discussed next.
The objective function in Eq. 10.95 consists of a sum of subfunctions,
each of which depends on a single data point (𝑥 (𝑖) , 𝑓 (𝑖) ). Objective
functions vary across machine learning applications, but most have this
same form:
minimize 𝑓 (𝜃) , (10.96)
𝜃
where
Õ
𝑛 Õ
𝑛
𝑓 (𝜃) = ℓ 𝜃; 𝑥 (𝑖) , 𝑓 (𝑖) = ℓ 𝑖 (𝜃) . (10.97)
𝑖=1 𝑖=1
As previously mentioned, the challenge with these problems is that
we often have large training sets where 𝑛 may be in the billions. That
means that computing the objective can be costly, and computing the
gradient can be even more costly.
If we divide the objective by 𝑛 (which does not change the solution),
the objective function becomes an approximation of the expected value
(see Appendix A.9):
1Õ
𝑛
𝑓 (𝜃) = ℓ 𝑖 (𝜃) = E(ℓ (𝜃)) (10.98)
𝑛
𝑖=1
𝑆 = 𝑥 (1) . . . 𝑥 (𝑚) , where 𝑚 is usually between 1 and a few hundred.
The entries 𝑥 (1) , . . . , 𝑥 (𝑚) do not correspond to the first 𝑛 entries but are
drawn randomly from a uniform probability distribution (Fig. 10.27).
Using the minibatch, we can estimate the gradient as the sum of the
subfunction gradients at different training points:
1 Õ
∇𝜃 𝑓 (𝜃) ≈ ∇𝜃 ℓ 𝜃; 𝑥 (𝑖) , 𝑓 (𝑖) . (10.99)
𝑚
𝑖∈𝑆
Thus, we divide the training data into these minibatches and use a new
minibatch to estimate the gradients at each iteration in the optimization.
This approach works well for these specific problems because of Fig. 10.27 Minibatches are randomly
the unique form for the objective (Eq. 10.98). As an example, for one drawn from the training data.
million training samples, a single gradient evaluation would require
evaluating all one million training samples. Alternatively, for a similar
cost, a minibatch approach can update the optimization variables a
million times using the gradient estimated from one training sample
at a time. This latter process usually converges much faster, mainly
because we are only fitting parameters against limited data in these
problems, so we generally do not need to find the exact minimum.
Typically, this gradient is used with steepest descent methods (Sec-
tion 4.4.1), more typically referred to as gradient descent in the machine
learning communities. As discussed in Chapter 4, steepest descent
is not the most effective optimization algorithm. However, steepest
descent with the minibatch updates, called stochastic gradient descent,
has been found to work well in machine learning applications. This
suitability is primarily because (1) many machine learning optimiza-
tions are performed repeatedly, (2) the true objective is difficult to
formalize, and (3) finding the absolute minimum is not as important as
finding a good enough solution quickly. One key difference in stochastic
gradient descent relative to the steepest descent method is that we do
not perform a line search. Instead, the step size (called the learning rate
in machine learning applications) is a preselected value that is usually
decreased between major optimization iterations.
10 Surrogate-Based Optimization 414
10.6.1 Exploitation
For models that do not provide uncertainty estimates, the only real
option is exploitation. A prediction-based exploitation infill strategy
adds an infill point wherever the surrogate predicts the optimum. The
reasoning behind this approach is that in SBO, we do not necessarily
care about having a globally accurate surrogate; instead, we only care
about having an accurate surrogate near the optimum.
The most logical point to sample is thus the optimum predicted by
the surrogate. Likely, the location predicted by the surrogate will not
be at the true optimum. However, evaluating this point adds valuable
information in the region of interest.
We rebuild the surrogate and re-optimize, repeating the process until
convergence. This approach usually results in the quickest convergence
to an optimum, which is desirable when the actual function is expensive
to evaluate. The downside is that we may converge prematurely to an
inferior local optimum for problems with multiple local optima.
Even though the approach is called exploitation, the optimizer used
on the surrogate can be a global search method (gradient-based or
10 Surrogate-Based Optimization 415
Inputs:
𝑛 𝑠 : Number of initial samples
𝑥, 𝑥: Variable lower and upper bounds
𝜏: Convergence tolerance
Outputs:
𝑥 ∗ : Best point identified
𝑓 ∗ : Corresponding function value
𝑥 (𝑖) = sample(𝑛
𝑠 , 𝑛𝑑 ) Sample
𝑥 ∗ , 𝑓ˆ∗
= min 𝑓ˆ(𝑥) Perform optimization on the surrogate function
𝑓new = 𝑓 (𝑥 ∗ ) Evaluate true function at predicted optimum
𝑥 (𝑖) = 𝑥 (𝑖) ∪ 𝑥 ∗ Append new point to training data
𝑓 (𝑖) = 𝑓 (𝑖) ∪ 𝑓new Append corresponding function value
𝑘 = 𝑘+1
end while
The expected value for a kriging model can be found analytically as:
∗
𝑓 ∗ − 𝜇 𝑓 (𝑥) 𝑓 ∗ − 𝜇 𝑓 (𝑥)
𝐸𝐼(𝑥) = ( 𝑓 − 𝜇 𝑓 (𝑥))Φ + 𝜎 𝑓 (𝑥)𝜙 ,
𝜎 𝑓 (𝑥) 𝜎 𝑓 (𝑥)
(10.102)
where Φ and 𝜙 are the CDF and PDF, respectively, for the standard
normal distribution, and 𝜇 𝑓 and 𝜎 𝑓 are the mean and standard error
functions produced from kriging ( Eqs. 10.75 and 10.76).
The algorithm is similar to that of the previous section (Alg. 10.4),
but instead of choosing the minimum of the surrogate, the selected
infill point is the point with the greatest expected improvement. The
corresponding algorithm is detailed in Alg. 10.5.
Inputs:
𝑛 𝑠 : Number of initial samples
𝑥, 𝑥: Lower and upper bounds
𝜏: Minimum expected improvement
Outputs:
𝑥 ∗ : Best point identified
𝑓 ∗ : Corresponding function value
𝜇(𝑥), 𝜎(𝑥) = GP(𝑥 (𝑖) , 𝑓 (𝑖) ) Construct Gaussian process surrogate model
𝑥 𝑘 , 𝑓𝑒 𝑖 = max 𝐸𝐼(𝑥) Maximize expected improvement
𝑓 𝑘 = 𝑓 (𝑥 𝑘 ) Evaluate true function at predicted optimum
𝑓 ∗ = min{ 𝑓 ∗ , 𝑓 𝑘 } Update best point and 𝑥 ∗ if necessary
𝑥 (𝑖) ← [𝑥 (𝑖) , 𝑥 𝑘 ] Add new point to training data
𝑓 (𝑖) ← [ 𝑓 (𝑖) , 𝑓 𝑘 ]
𝑘 = 𝑘+1
end while
·10−2 ·10−3
4 8
3 6
𝐸𝐼(𝑥) 2 𝐸𝐼(𝑥) 4
1 2
0 0
1 1
0.5 0.5
𝑓 𝑓
0 0
−0.5 −0.5
0 2 4 6 8 10 12 0 2 4 6 8 10 12
𝑥 𝑥
𝑘=1 𝑘=5
·10−3 ·10−5
1.5 1
𝐸𝐼(𝑥) 1 𝐸𝐼(𝑥) 0.5
0.5
0 0
1 1
0.5 0.5
𝑓 𝑓
0 0
−0.5 −0.5
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Fig. 10.29 Expected improvement
𝑥 𝑥
evaluated across the domain.
𝑘 = 10 𝑘 = 12
10.7 Summary
Problems
10.4 Linear regression. Use the following training data sampled at 𝑥 with
the resulting function value 𝑓 (also tabulated on the resources
website):
10.5 Cross validation. Use the following training data sampled at 𝑥 with
resulting function value 𝑓 (also tabulated on resources website):
𝑦 = exp(−𝑥) cos(5𝑥),
10.8 Efficient global optimization. Use EGO with the function from the
previous problem, showing the iteration history until the expected
improvement reduces below 0.001.
Convex Optimization
11
General nonlinear optimization problems are difficult to solve. De-
pending on the particular optimization algorithm, they may require
tuning parameters, providing derivatives, adjusting scaling, and trying
multiple starting points. Convex optimization problems do not have
any of those issues and are thus easier to solve. The challenge is that
these problems must meet strict requirements. Even for candidate
problems with the potential to be convex, significant experience is
usually needed to recognize and utilize techniques that reformulate the
problems into an appropriate form.
11.1 Introduction
423
11 Convex Optimization 424
because the linearization can be updated in the next time step. However,
this fidelity reduction is problematic for design applications.
In design scenarios, the optimization is performed once, and the
design cannot continue to be updated after it is created. For this reason,
convex optimization is less frequently used for design applications, ex-
cept for some limited uses in geometric programming, a topic discussed
in more detail in Section 11.6.
This chapter just introduces convex optimization and is not a re-
placement for more comprehensive textbooks on the topic.† We focus on † Boyd and Vandenberghe86 is the most
cited textbook on convex optimization.
understanding what convex optimization is useful for and describing
86. Boyd and Vandenberghe, Convex
the most widely used forms. Optimization, 2004.
The known categories of convex optimization problems include
linear programming, quadratic programming, second-order cone pro-
gramming, semidefinite programming, cone programming, and graph
form programming. Each of these categories is a subset of the next
(Fig. 11.2).‡ ‡ Several references exist with examples
for those categories that we do not discuss
We focus on the first three because they are the most widely used, in detail.180–181
including in other chapters in this book. The latter three forms are 180. Lobo et al., Applications of second-
less frequently formulated directly. Instead, users apply elementary order cone programming, 1998.
functions and operations and the rules specified by disciplined convex 181. Parikh and Boyd, Block splitting for
distributed optimization, 2013.
programming, and a software tool transforms the problem into a
182. Vandenberghe and Boyd, Semidefi-
suitable conic form that can be solved. Section 11.5 describes this nite programming, 1996.
procedure. 183. Vandenberghe and Boyd, Applica-
tions of semidefinite programming, 1999.
After covering the three main categories of convex optimization
problems, we discuss geometric programming. Geometric program- Graph form programming
(GFP)
ming problems are not convex, but with a change of variables, they
Cone programming
can be transformed into an equivalent convex form, thus extending the (CP)
types of problems that can be solved with convex optimization. Semidefinite programming
(SDP)
Second-order cone
11.2 Linear Programming programming (SOCP)
Quadratic
A linear program (LP) is an optimization problem with a linear objective programming (QP)
and linear constraints and can be written as
Linear programming
(LP)
minimize |
𝑓 𝑥
𝑥
subject to 𝐴𝑥 + 𝑏 = 0 (11.2)
𝐶𝑥 + 𝑑 ≤ 0 ,
Fig. 11.2 Relationship between vari-
where 𝑓 , 𝑏, and 𝑑 are vectors and 𝐴 and 𝐶 are matrices. All LPs are ous convex optimization problems.
convex.
11 Convex Optimization 426
Suppose we are shopping and want to find how best to meet our nutritional
needs for the lowest cost. We enumerate all the food options and use the
variable 𝑥 𝑗 to represent how much of food 𝑗 we purchase. The parameter 𝑐 𝑗 is
the cost of a unit amount of food 𝑗. The parameter 𝑁𝑖𝑗 is the amount of nutrient
𝑖 contained in a unit amount of food 𝑗. We need to make sure we have at least
𝑟 𝑖 of nutrient 𝑖 to meet our dietary requirements. We can now formulate the
cost objective as Õ
minimize 𝑐 𝑗 𝑥 𝑗 = 𝑐| 𝑥 .
𝑥
𝑗
To meet the nutritional requirement of nutrient 𝑖, we need to satisfy
Õ
𝑁𝑖𝑗 𝑥 𝑗 ≥ 𝑟 𝑖 ⇒ 𝑁 𝑥 ≥ 𝑟 .
𝑗
If the amount of each food is 𝑥, the cost column is 𝑐, and the nutrient
columns are 𝑛1 , 𝑛2 , and 𝑛3 , we can formulate the LP as
minimize 𝑐| 𝑥
𝑥
subject to 5 ≤ 𝑛1 𝑥 ≤ 8
|
7 ≤ 𝑛2 𝑥
|
1 ≤ 𝑛3 𝑥 ≤ 10
|
𝑥 ≤ 4.
11 Convex Optimization 427
The last constraint ensures that we do not overeat any one item and get tired of
it. LP solvers are widely available, and because the inputs of an LP are just a
table of numbers some solvers do not even require a programming language.
The solution for this problem is
suggesting that our optimal diet consists of items B, F, H, and I in the proportions
shown here. The solution reached the upper limit for nutrient 1 and the lower
limit for nutrient 2.
𝐶𝑥 + 𝑑 ≤ 0 .
The left pane of Fig. 11.3 shows some example data that are both noisy and
biased relative to the true (but unknown) underlying curve, represented as a
dashed line. Given the data points, we would like to estimate the underlying
functional relationship. We assume that the relationship is cubic and write it as
𝑦(𝑥) = 𝑎1 𝑥 3 + 𝑎2 𝑥 2 + 𝑎3 𝑥 + 𝑎4 .
30 30 30
20 20 20
𝑦 𝑦 𝑦
10 10 10
0 0 0
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
𝑥 𝑥 𝑥
Suppose that we know the upper bound of the function value based on Fig. 11.3 True function on the left,
measurements or additional data at a few locations. In this example, assume least squares in the middle, and con-
strained least squares on the right.
that we know that 𝑓 (−2) ≤ −2, 𝑓 (0) ≤ 4, and 𝑓 (2) ≤ 26. These requirements
can be posed as linear constraints:
𝑎1
(−2)3 (−2)2 1 −2
−2 𝑎2
0 1 ≤4 .
0 0 𝑎3
23 1 26
22 2 𝑎
4
After adding these linear constraints and retaining a quadratic objective
(the sum of the squared error), the resulting problem is still a QP. The resulting
solution is shown in the right pane of Fig. 11.3, which results in a much more
accurate fit.
11 Convex Optimization 429
𝑥 𝑡+1 = 𝐴𝑥 𝑡 + 𝐵𝑢𝑡 ,
where 𝑥 𝑡 is the deviation from a desired state at time 𝑡 (e.g., the positions and
velocities of an aircraft), and 𝑢𝑡 represents the control inputs that we want to
optimize (e.g., control surface deflections). This dynamic equation can be used
as a set of linear constraints in an optimization problem, but we must decide
on an objective.
We would like to have small 𝑥 𝑡 because that would mean reducing the error
in our desired state quickly, but we would also like to have small 𝑢𝑡 because
small control inputs require less energy. These are competing objectives, where
a small control input will take longer to minimize error in a state, and vice
versa.
One way to express this objective is as a quadratic function,
1Õ |
𝑛
minimize
|
𝑥 𝑡 𝑄𝑥 𝑡 + 𝑢𝑡 𝑅𝑢𝑡 ,
𝑥,𝑢 2
𝑡=0
minimize 𝑓 |𝑥
𝑥
subject to
|
k𝐴 𝑖 𝑥 + 𝑏 𝑖 k 2 ≤ 𝑐 𝑖 𝑥 + 𝑑 𝑖 (11.4)
𝐺𝑥 + ℎ = 0 .
1 |
minimize 𝑥 𝑄𝑥 + 𝑓 | 𝑥
𝑥 2
subject to 𝐴𝑥 + 𝑏 = 0 (11.5)
1 |
𝑥 𝑅 𝑖 𝑥 + 𝑐 𝑖 𝑥 + 𝑑 𝑖 ≤ 0 for 𝑖 = 1, . . . , 𝑚 ,
|
2
where 𝑄 and 𝑅 must be positive semidefinite for the QCQP to be
convex. A QCQP reduces to a QP if 𝑅 = 0. We formulated QCQPs
when solving trust-region problems in Section 4.5. However, for trust-
region problems, only an approximate solution method is typically
used.
Every QCQP can be expressed as an SOCP (although not vice versa).
The QCQP in Eq. 11.5 can be written in the equivalent form,
minimize 𝛽
𝑥,𝛽
subject to
𝐹𝑥 + 𝑔
≤ 𝛽
2 (11.6)
𝐴𝑥 + 𝑏 = 0
k𝐺 𝑖 𝑥 + ℎ 𝑖 k 2 ≤ 0 .
If we square both sides of the first and last constraints, this formulation
is exactly equivalent to the QCQP where 𝑄 = 2𝐹 | 𝐹, 𝑓 = 2𝐹 | 𝑔, 𝑅 𝑖 =
2𝐺 𝑖 𝐺 𝑖 , 𝑐 𝑖 = 2𝐺 𝑖 ℎ 𝑖 , and 𝑑 𝑖 = ℎ 𝑖 ℎ 𝑖 . The matrices 𝐹 and 𝐺 𝑖 are the square
| | |
Functions Examples
𝛼↑
𝑒 𝑎𝑥
( 𝛼≥1
−𝑥 𝑎 if 0 ≤ 𝑎 ≤ 1 𝛼≤0
𝑥𝑎 otherwise
0≤𝛼≤1
− log(𝑥)
k𝑥k 1 , k𝑥 k 2 , . . .
max(𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 )
ln (𝑒 𝑥1 + 𝑒 𝑥2 + . . . + 𝑒 𝑥 𝑛 )
Table 11.1 Examples of convex func-
tions.
CVX and its variants are free popular tools for disciplined convex program-
ming with interfaces for multiple programming languages.∗ ∗ https://stanford.edu/~boyd/software.
html
𝑓 (𝑥) = 𝑎 | 𝑥 + 𝛽 ,
that separates the two data sets, or in other words, a function that classifies the
objects. For example, if we call one data set 𝑦 𝑖 , for 𝑖 = 1 . . . 𝑛 𝑦 , and the other
𝑧 𝑖 , for 𝑖 = 1 . . . 𝑛 𝑧 , we need to satisfy the following constraints:
𝑎 | 𝑦𝑖 + 𝛽 ≥ 𝜀
(11.7)
𝑎 | 𝑧 𝑖 + 𝛽 ≤ −𝜀 ,
for some small tolerance 𝜀. In general, there are an infinite number of separating
hyperplanes, so we seek the one that maximizes the distance between the points.
However, such a problem is not yet well defined because we can multiply 𝑎 and
𝛽 in the previous equations by an arbitrary constant to achieve any separation
we want, so we need to normalize or fix some reference dimension (only the
ratio of the parameters matters in defining the hyperplane, not their absolute
magnitudes). We define the optimization problem as follows:
maximize 𝛾
by varying 𝛾, 𝑎, 𝛽
subject to 𝑎 | 𝑦 𝑖 + 𝛽 ≥ 𝛾 for 𝑖 = 1 . . . 𝑛 𝑦
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −𝛾 for 𝑗 = 1, . . . , 𝑛 𝑧
k𝑎k ≤ 1 .
The last constraint provides a normalization to prevent the problem from being
unbounded. This norm constraint is always active (k𝑎 k = 1), but we express
it as an inequality so that the problem remains convex (recall that equality
11 Convex Optimization 433
constraints must be affine, but inequality constraints can be any convex function).
The objective and inequality constraints are all convex functions, so we can
solve it in a disciplined convex programming environment. Alternatively, in
this case, we could employ a change of variables to put the problem in QP form
if desired.
An example is shown in Fig. 11.4 for data with two features for easy
visualization. The middle line shows the separating hyperplane and the outer
lines are a distance of 𝛾 away, just passing through a data point from each set. 2𝛾
𝑥2
If the data are not completely separable, we need to modify our approach.
Even if the data are separable, outliers may undesirably pull the hyperplane
so that points are closer to the boundary than is necessary. To address these
issues, we need to relax the constraints. As discussed, Eq. 11.7 can always be
𝑥1
multiplied by an arbitrary constant. Therefore, we can equivalently express the
constraints as follows: Fig. 11.4 Two separable data sets are
𝑎 | 𝑦𝑖 + 𝛽 ≥ 1 shown as points with two different
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −1 . colors. A classification boundary
with maximum width is shown.
To relax these constraints, we add nonnegative slack variables, 𝑢𝑖 and 𝑣 𝑗 :
𝑎 | 𝑦 𝑖 + 𝛽 ≥ 1 − 𝑢𝑖
𝑎 | 𝑧 𝑗 + 𝛽 ≤ −(1 − 𝑣 𝑗 ) ,
where we seek to minimize the sum of the entries in 𝑢 and 𝑣. If they sum
to 0, we have the original constraints for a completely separable function.
However, recall that we are interested in not just creating separation but also in
maximizing the distance to the classification boundary. To accomplish this, we
use a regularization approach where our two objectives include maximizing
the distance from the boundary and maximizing the sum of the classification
margins. The width between the two planes 𝑎 | 𝑥 + 𝛽 = 1 and 𝑎 | 𝑥 + 𝛽 = −1 is
2/k𝑎 k. Therefore, to maximize the separation distance, we minimize k𝑎 k. The
optimization problem is defined as follows:†
† In the machine learning community, this
Õ
𝑛
𝑎1𝑗 𝑎2𝑗 𝑎𝑚 𝑗
𝑓 (𝑥) = 𝑐 𝑗 𝑥1 𝑥2 · · · 𝑥 𝑚 , (11.9)
𝑗=1
𝐶 𝐿2
𝐷 = 𝐶 𝐷 𝑝 𝑞𝑆 + 𝑞𝑆.
𝜋𝐴𝑅𝑒
minimize 𝑓0 (𝑥)
𝑥
subject to 𝑓𝑖 (𝑥) ≤ 1 (11.10)
ℎ 𝑖 (𝑥) = 1 ,
where 𝑓𝑖 are posynomials, and ℎ 𝑖 are monomials. This problem does not
fit into any of the convex optimization problems defined in the previous
section, and it is not convex. This formulation is useful because we can
convert it into an equivalent convex optimization problem.
First, we take the logarithm of the objective and of both sides of the
constraints:
minimize ln 𝑓0 (𝑥)
𝑥
subject to ln 𝑓𝑖 (𝑥) ≤ 0 (11.11)
ln ℎ 𝑖 (𝑥) = 0 .
11 Convex Optimization 435
ln 𝑐 + 𝑎1 ln 𝑥 1 + 𝑎 2 ln 𝑥2 + . . . + 𝑎 𝑚 ln 𝑥 𝑚 = 0 . (11.13)
𝑎1 𝑦1 + 𝑎2 𝑦2 + . . . + 𝑎 𝑚 𝑦𝑚 + ln 𝑐 = 0 , 𝑎 | 𝑦 + ln 𝑐 = 0, (11.14)
©Õ 𝑎𝑚 𝑗 ª
𝑛
ln 𝑐 𝑗 𝑥1 𝑥2 . . . 𝑥 𝑚 ® .
𝑎1𝑗 𝑎2𝑗
(11.15)
« 𝑗=1 ¬
Because this is a sum of products, we cannot use the logarithm to
expand each term. However, we still introduce the same change of
variables (expressed as 𝑥 𝑖 = 𝑒 𝑦𝑖 ):
©Õ ª
𝑛
ln 𝑓𝑖 = ln 𝑐 𝑗 exp 𝑦1 𝑎1𝑗 exp 𝑦2 𝑎2𝑗 . . . exp 𝑦𝑚 𝑎 𝑚 𝑗 ®
« 𝑗=1 ¬
© Õ𝑛
ª
= ln 𝑐 𝑗 exp 𝑦1 𝑎1𝑗 + 𝑦2 𝑎2𝑗 + 𝑦𝑚 𝑎 𝑚 𝑗 ® (11.16)
« 𝑗=1 ¬
Õ
© ª
𝑛
= ln exp 𝑎 𝑗 𝑦 + 𝑏 𝑗 ® , where 𝑏 𝑗 = ln 𝑐 𝑗 .
|
« 𝑗=1 ¬
This is a log-sum-exp of an affine function. As mentioned in the previous
section, log-sum-exp is convex, and a convex function composed of an
affine function is a convex function. Thus, the objective and inequality
constraints are convex in 𝑦. Because the equality constraints are also
affine, we have a convex optimization problem obtained through a
change of variables.
11 Convex Optimization 436
by varying 𝑥 ℎ , 𝑥𝑤 , 𝑥 𝑑
subject to 2(𝑥 ℎ 𝑥 𝑤 + 𝑥 ℎ 𝑥 𝑑 + 𝑥 𝑤 𝑥 𝑑 ) ≤ 𝐴
𝑥𝑤
𝛼𝑙 ≤ ≤ 𝛼ℎ .
𝑥𝑑
We can express this problem in GP form (Eq. 11.10):
minimize 𝑥 −1 𝑥 −1 𝑥 −1
ℎ 𝑤 𝑑
by varying 𝑥 ℎ , 𝑥𝑤 , 𝑥 𝑑
2 2 2
subject to 𝑥 𝑥𝑤 + 𝑥 ℎ 𝑥 𝑑 + 𝑥𝑤 𝑥 𝑑 ≤ 1
𝐴 ℎ 𝐴 𝐴
1
𝑥 𝑤 𝑥 −1
𝑑
≤1
𝛼ℎ
−1
𝛼 𝑙 𝑥 𝑑 𝑥𝑤 ≤ 1.
We can now plug this into a GP solver. For this example, we use the
following parameters: 𝛼 𝑙 = 2, 𝛼 ℎ = 8, 𝐴 = 100. The solution is 𝑥 𝑑 = 2.887, 𝑥 ℎ =
3.849, 𝑥 𝑤 = 5.774, with a total volume of 64.16.
Unfortunately, many other functions do not fit this form (e.g., de-
sign variables that can be positive or negative, terms with negative
coefficients, trigonometric functions, logarithms, and exponents). GP
modelers use various techniques to extend usability, including using a
Taylor series across a restricted domain, fitting functions to posynomi-
als,185 and rearranging expressions to other equivalent forms, including 185. Hoburg et al., Data fitting with
geometric-programming-compatible soft-
implicit relationships. Creativity and some sacrifice in fidelity are max functions, 2016.
usually needed to create a corresponding GP from a general nonlinear
programming problem. However, if the sacrifice in fidelity is not too
great, there is a significant advantage because the formulation comes
with all the benefits of convexity—guaranteed convergence, global
optimality, efficiency, no parameter tuning, and limited scaling issues.
One extension to geometric programming is signomial program-
ming. A signomial program has the same form, except that the coeffi-
cients 𝑐 𝑖 can be positive or negative (the design variables 𝑥 𝑖 must still
be strictly positive). Unfortunately, this problem cannot be transformed
into a convex one, so a global optimum is no longer guaranteed. Still, a
signomial program can usually be solved using a sequence of geometric
11 Convex Optimization 437
11.7 Summary
Problems
11.2 Solve the following using a convex solver (not a general nonlinear
solver):
11.3 The following foods are available to you at your nearest grocer:
11 Convex Optimization 439
Minimize the amount you spend while making sure you get at
least 5 units of nutrient 1, between 8 and 20 units of nutrient 2,
and between 5 and 30 units of nutrient 3. Also be sure not to buy
more than 4 units of any one food item, just for variety. Determine
the optimal amount of each item to purchase and the total cost.
1 2
𝐿= 𝜌𝑣 𝑏𝑐𝐶 𝐿
2
as an equality constraint. This is equivalent to a GP-compatible
monomial constraint
𝜌𝑣 2 𝑏𝑐𝐶 𝐿
= 1.
2𝐿
441
12 Optimization Under Uncertainty 442
2 2 2
15 3 1.5
𝑝(𝑥) 10 𝑝(𝑥) 2 𝑝(𝑥) 1
5 1 0.5
0 0 0
−1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5 2
𝑥 𝑥 𝑥
a mean value of 𝑥 = 0.5 and three different standard deviations is Fig. 12.1 The global minimum of the
shown on the bottom row the figure. For a small variance (𝜎𝑥 = expected value 𝜇 𝑓 can shift depend-
ing on the standard deviation of 𝑥,
0.01), the expected value function 𝜇 𝑓 (𝑥) is indistinguishable from the 𝜎𝑥 . The bottom row of figures shows
deterministic function 𝑓 (𝑥), and the global minimum is the same for the normal probability distributions
at 𝑥 = 0.5.
both functions. However, for 𝜎𝑥 = 0.2, the minimum of the expected
value function is different from that of the deterministic function.
Therefore, the minimum on the right is not as robust as the one on
the left. The minimum one on the right is a narrow valley, so the
expected value increases rapidly with increased variance. The opposite
is true for the minimum on the left. Because it is in a broad valley, the
expected value is less sensitive to variability in 𝑥. Thus, a design whose
performance changes rapidly with respect to variability is not robust.
Of course, the mean is just one possible statistical output metric.
Variance, or standard deviation (𝜎 𝑓 ), is another common metric. How-
ever, directly minimizing the variance is less common because although
low variability is often desirable, such an objective has no incentive to
improve mean performance and so usually performs poorly. These two
metrics represent a trade-off between risk (variance) and reward (mean). 𝜎𝑓
The compromise between these two metrics can be quantified through
multiobjective optimization (see Chapter 9), which would result in a
Pareto front with the notional behavior illustrated in Fig. 12.2. Because
both multiobjective optimization and uncertainty quantification are 𝜇𝑓
costly, the overall cost of producing such a Pareto front might be pro-
hibitive. Therefore, we might instead seek to minimize the expected Fig. 12.2 When designing for robust-
ness, there is an inherent trade-off
value while constraining the variance to a value that the designer between risk (represented by the vari-
can tolerate. Another option is to minimize the mean plus weighted ance, 𝜎 𝑓 ) and reward (represented by
standard deviations. the expected value, 𝜇 𝑓 ).
12 Optimization Under Uncertainty 444
the red drag curve shown in Fig. 12.3. The drag is much lower at Mach 0.71 (as
requested!), but any deviation from the target Mach number causes significant 𝑐𝑑
·10−4
drag penalties. In other words, the design is not robust. 220
One way to improve the design is to use multipoint optimization, where we 200
minimize a weighted sum of the drag coefficient evaluated at different Mach 180 Baseline
numbers. In this case, we use Mach = 0.68, 0.71, 0.725. Compared with the 160
single-point design, the multipoint design has a higher drag at Mach 0.71 but a 140
lower drag at the other Mach numbers, as shown in Fig. 12.3. Thus, a trade-off Multi-point
120
in peak performance was required to achieve enhanced robustness. 100 Single-point
A multipoint optimization is a simplified example of OUU. Effectively, we 0.64 0.66 0.68 0.7 0.72 0.74
have treated the Mach number as a random parameter with a given probability Mach number
at three discrete values. We then minimized the expected value of the drag.
This simple change significantly increased the robustness of the design. Fig. 12.3 Single-point optimization
performs the best at the target speed
but poorly away from the condition.
Multipoint optimization is more ro-
bust to changes in speed.
Example 12.2 Robust wind farm layout optimization † See other wind farm OUU problems
parameter. Figure 12.4 shows a PDF of the wind direction for an actual wind
farm, known as a wind rose, which is commonly visualized as shown in the
plot on the right. The predominant wind directions are from the west and the
south. Because of the variable nature of the wind, it would be challenging to
intuit the optimal layout.
·10−3 N
8
NW NE
Relative Probability
4
W E
60
50 W E
40
Fig. 12.5 Wind farm power as a func-
1 dir tion of wind direction for two opti-
30 mization approaches: deterministic
SE
optimization using the most probable
0 90 180 270 360 SW
direction and OUU.
Wind direction [deg] S
12 Optimization Under Uncertainty 446
We can also analyze the trade-off in the optimal layouts. The left side of
Fig. 12.6 shows the optimal layout using the deterministic formulation, with
the wind coming from the predominant direction (the direction we optimized
for). The wakes are shown in blue, and the boundaries are depicted with a
dashed line. The optimization spaced the wind turbines out so that there is
minimal wake interference. However, the performance degrades significantly
when the wind changes direction. The right side of Fig. 12.6 shows the same
layout but with the wind coming from the second-most-probable direction. In
this case, many of the turbines are operating in the wake of another turbine
and produce much less power.
In contrast, the robust layout is shown in Fig. 12.7, with the predominant
wind direction on the left and the second-most-probable direction on the right.
In both cases, the wake effects are relatively minor. The turbines are not ideally
placed for the predominant direction, but trading the performance for that
one direction yields better overall performance when considering other wind
directions.
Consider the Barnes problem shown on the left side of Fig. 12.8. The three
red lines are the three nonlinear constraints of the problem, and the red regions
highlight regions of infeasibility. With deterministic inputs, the optimal value
is on the constraint line. An uncertainty ellipse shown around the optimal
point highlights the fact that the solution is not reliable. Any variability in the
inputs can cause one or more constraints to be violated.
Conversely, the right side of Fig. 12.8 shows a reliable optimum, with the
same uncertainty ellipse. In this case, it is much more probable that the design
will satisfy all constraints under the input variations. However, as noted in
the introduction, increased reliability presents a performance trade-off, with a
corresponding increase in the objective function. The higher the reliability we
seek, the more we need to give up on performance.
12 Optimization Under Uncertainty 448
𝜎(𝑥) ≤ 𝜂𝜎 𝑦 ,
where 𝜂 is a total safety factor that accounts for safety factors from loads,
materials, and failure modes. Of course, not all applications have standards-
driven safety factors already determined. The statistical approach discussed in
this chapter is useful in these situations to obtain reliable designs.
𝜇 𝑓 = 𝑓 (𝜇𝑥 ) . (12.5)
That is, when considering only first-order terms, the mean of the function
is the function evaluated at the mean of the input.
The variance of 𝑓 is given by
2
𝜎2𝑓 = E( 𝑓 (𝑥)2 ) − E( 𝑓 (𝑥))
"
Õ 𝜕𝑓
≈ E 𝑓 (𝜇𝑥 )2 + 2 𝑓 (𝜇𝑥 ) (𝑥 𝑖 − 𝜇𝑥 𝑖 )+
𝜕𝑥 𝑖
𝑖
ÕÕ 𝜕𝑓 𝜕𝑓
(12.6)
(𝑥 𝑖 − 𝜇𝑥 𝑖 )(𝑥 𝑗 − 𝜇𝑥 𝑗 ) − 𝑓 (𝜇𝑥 )2
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝑖 𝑗
ÕÕ 𝜕𝑓 𝜕𝑓 h i
= E (𝑥 𝑖 − 𝜇𝑥 𝑖 )(𝑥 𝑗 − 𝜇𝑥 𝑗 ) .
𝜕𝑥 𝑖 𝜕𝑥 𝑗
𝑖 𝑗
𝑛
Õ 2
𝜕𝑓
𝜎2𝑓 = 𝜎𝑥 𝑖 . (12.8)
𝜕𝑥 𝑖
𝑖=1
All the design variables are random variables with standard deviations 𝜎𝑥1 =
𝜎𝑥2 = 0.033, and 𝜎𝑥3 = 0.0167. We seek a reliable optimum, where each
constraint has a target reliability of 99.865 percent.
First, we compute the deterministic optimum, which is
We compute the standard deviation of each constraint, using Eq. 12.8, about
the deterministic optimum, yielding 𝜎 𝑔1 = 0.081, 𝜎 𝑔2 = 0.176. Using an
inverse CDF function (discussed in Section 10.2.1) shows that a CDF of 0.99865
corresponds to a 𝑧-score of 3. We then re-optimize with the new reliability
constraints to obtain the solution:
Number of samples
4
To check these results, we use Monte Carlo simulations (explained in Deterministic
3
Section 12.3.3) with 100,000 samples to produce the output histograms shown
in Fig. 12.9. The deterministic optimum fails often (k 𝑔(𝑥)k ∞ > 0), so its 2
reliability is a surprisingly poor 34.6 percent. The reliable optimum shifts the 1
distribution to the left, yielding a reliability of 99.75 percent, which is close to
0
our design target. −0.5 0 0.5
k 𝑔(𝑥 ∗ )k ∞
is to add a new node between all existing nodes. Thus, the accuracy of
the integral can be improved up to a specified tolerance while reusing
previous function evaluations.
Although straightforward to apply, the Newton–Cotes formulas are
usually much less efficient than Gaussian quadrature, at least for smooth,
nonperiodic functions. Efficiency is highly desirable because the output
functions must be called many times for forward propagation, as well
as throughout the optimization. The Newton–Cotes formulas are based
on fitting polynomials: constant (midpoint), linear (trapezoidal), and
quadratic (Simpson’s).The weights are adjusted between the different
methods, but the nodes are fixed. Gaussian quadrature includes the
nodes as degrees of freedom selected by the quadrature strategy. The
method approximates the integrand as a polynomial and then efficiently
evaluates the integral for the polynomial exactly. Because some of the
concepts from Gaussian quadrature are used later in this chapter, we
review them here.
An 𝑛-point Gaussian quadrature has 2𝑛 degrees of freedom (𝑛 node
positions and 𝑛 corresponding weights), so it can be used to exactly
integrate any polynomial up to order 2𝑛 − 1 if the weights and nodes are
appropriately chosen. For example, a 2-point Gaussian quadrature can
exactly integrate all polynomials up to order 3. To illustrate, consider
an integral over the bounds −1 to 1 (we will later see that these bounds
can be used as a general representation of any finite bounds through a
change of variables):
∫ 1
𝑓 (𝑥) d𝑥 ≈ 𝑤 1 𝑓 (𝑥1 ) + 𝑤 2 𝑓 (𝑥2 ) . (12.13)
−1
2𝑎 = 𝑎(𝑤 1 + 𝑤2 ). (12.14)
2 = 𝑤1 + 𝑤2
0 = 𝑤1 𝑥1 + 𝑤2 𝑥2
2 (12.15)
= 𝑤 1 𝑥 12 + 𝑤 2 𝑥22
3
0 = 𝑤 1 𝑥 13 + 𝑤 2 𝑥23 .
√
Solving these equations yields 𝑤 1 = 𝑤 2 = 1, 𝑥1 = −𝑥2 = 1/ 3. Thus,
we have the weights and node positions that integrate a cubic (or
12 Optimization Under Uncertainty 454
Legendre polynomials are defined over the interval [−1, 1], but we
can reformulate them for an arbitrary interval [𝑎, 𝑏] through a change
of variables:
𝑏−𝑎 𝑏+𝑎
𝑥= 𝑧+ , (12.21)
2 2
where 𝑧 ∈ [−1, 1].
Using the change of variables, we can write
∫ 𝑏 ∫ 1
(𝑏 − 𝑎) 𝑏+𝑎 𝑏−𝑎
𝑓 (𝑥) d𝑥 = 𝑓 𝑧+ d𝑧 . (12.22)
𝑎 −1 2 2 2
where the node locations and respective weights come from the Legen-
dre polynomials.
Recall that what we are after in this section is not just any generic
integral but, rather, metrics such as the expected value,
∫
𝜇𝑓 = 𝑓 (𝑥)𝑝(𝑥) d𝑥 . (12.24)
derived).
12 Optimization Under Uncertainty 456
Table 12.1 Orthogonal polynomials that correspond to some common probability distri-
butions.
Prob. dist. Weight function Polynomial Support range
Uniform 1 Legendre [−1, 1]
2
Normal 𝑒 −𝑥 Hermite (−∞, ∞)
Exponential 𝑒 −𝑥 Laguerre [0, ∞)
Beta (1 − 𝑥)𝛼 (1 + 𝑥)𝛽 Jacobi (−1, 1)
Gamma 𝑥 𝛼 𝑒 −𝑥 Generalized Laguerre [0, ∞)
distributed, then the integral is given by Fig. 12.11 The first few Hermite poly-
∫ ∞ nomials.
1 1 𝑥 − 𝜇 2
𝜇𝑓 = 𝑓 (𝑥) √ exp − d𝑥. (12.28)
−∞ 𝜎 2𝜋 2 𝜎
We use the change of variables,
𝑥−𝜇
𝑧= √ . (12.29)
2𝜎
Then, the resulting integral becomes
∫
1 ∞ √
𝜇𝑓 = √ 𝑓 𝜇 + 2𝜎𝑧 exp −𝑧 2 d𝑧. (12.30)
𝜋 −∞
This is now in the appropriate form, so the quadrature rule (using the
Hermite nodes and weights) is
1 Õ
𝑛 √
𝜇𝑓 = √ 𝑤 𝑖 𝑓 𝜇 + 2𝜎𝑧 𝑖 . (12.31)
𝜋 𝑖=1
12 Optimization Under Uncertainty 457
Figure 12.13 compares a two-dimension full tensor grid using the Clenshaw–
Curtis exponential rule (left) with a level 5 sparse grid using the same quadrature
strategy (right).
1 1
0.5 0.5
𝑥2 0 𝑥2 0
Fig. 12.13 Comparison between a two-
−0.5 −0.5 dimensional full tensor grid (left)
and a level 5 sparse grid using the
Clenshaw–Curtis exponential rule
−1 −1
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 (right).
𝑥1 𝑥1
1Õ
𝑛
𝜇𝑓 = 𝑓𝑖 , , (12.33)
𝑛
𝑖=1
statistics like the mean and variance, Monte Carlo generates the output
probability distributions. This is a unique advantage compared with
first-order perturbation and direct quadrature, which provide summary
statistics but not distributions.
2 2 2
Exact
1.8 1.8 1.8
Monte Carlo
1.6 1.6 1.6
1 1 1
6.4 1.5
Random sampling
Random sampling 1.4
6.2 1.3
LHS LHS
1.2
𝜇 𝜎
6 1.1
Halton 1 Halton
Fig. 12.15 Convergence of the mean
5.8 0.9 (left) and standard deviation (right)
0.8 versus the number of samples using
101 102 103 104 105 106 101 102 103 104 105 106 Monte Carlo.
𝑁 𝑁
12 Optimization Under Uncertainty 462
From the data, we conclude that we need about 𝑛 = 104 samples to have
well-converged statistics. Using 𝑛 = 104 yields 𝜇 = 6.127, 𝜎 = 1.235, and Count ·104
𝑟 = 0.9914. The random sampling of these results varies between simulations
(except for the Halton sequence in quasi-Monte Carlo, which is deterministic). 1.5
The production of an output histogram is a key benefit of this method. The
histogram of the objective function is shown in Fig. 12.16. Notice that it is not 1
0
5 10
𝑓
12.3.4 Polynomial Chaos
Polynomial chaos (also known as spectral expansions) is a class of forward- Fig. 12.16 Histogram of objective
function for 10,000 samples.
propagation methods that take advantage of the inherent smoothness
of the outputs of interest using polynomial approximations.§ § Polynomial chaos is not chaotic and does
The method extends the ideas of Gaussian quadrature to estimate the not actually need polynomials. The name
polynomial chaos came about because it was
output function, from which the output distribution and other summary initially derived for use in a physical the-
198
statistics can be efficiently generated. In addition to using orthogonal ory of chaos.
198. Wiener, The homogeneous chaos, 1938.
polynomials to evaluate integrals, we use them to approximate the
output function. As in Gaussian quadrature, the polynomials are
orthogonal with respect to a specified probability distribution (see
Eq. 12.25 and Table 12.1). A general function that depends on uncertain
variables 𝑥 can be represented as a sum of basis functions 𝜓 𝑖 (which
are usually polynomials) with weights 𝛼 𝑖 ,
Õ
∞
𝑓 (𝑥) = 𝛼 𝑖 𝜓 𝑖 (𝑥). (12.35)
𝑖=0
Õ
𝑛
𝑓 (𝑥) ≈ 𝛼 𝑖 𝜓 𝑖 (𝑥) . (12.36)
𝑖=0
h𝜓 𝑖 , 𝜓 𝑗 i = 0 if 𝑖 ≠ 𝑗. (12.38)
12 Optimization Under Uncertainty 463
Compute Statistics
The coefficients 𝛼 𝑖 are constants that can be taken out of the integral, so
we can write
Õ ∫
𝜇𝑓 = 𝛼𝑖 𝜓 𝑖 (𝑥)𝑝(𝑥) d𝑥
𝑖
∫ ∫ ∫
= 𝛼0 𝜓0 (𝑥)𝑝(𝑥) d𝑥 + 𝛼 1 𝜓1 (𝑥)𝑝(𝑥) d𝑥 + 𝛼 2 𝜓2 (𝑥)𝑝(𝑥) d𝑥 + . . . .
Because the polynomials are orthogonal, all the terms except the first
are zero (see Eq. 12.38). From the definition of a PDF (Eq. A.63), we
know that the first term is 1. Thus, the mean of the function is simply
the zeroth coefficient,
𝜇 𝑓 = 𝛼0 . (12.41)
We can derive a formula for the variance using a similar approach.
Substituting the polynomial representation (Eq. 12.36) into the definition
of variance and using the same techniques used in deriving the mean,
we obtain
∫ !2
Õ
𝜎2𝑓 = 𝛼 𝑖 𝜓 𝑖 (𝑥) 𝑝(𝑥) d𝑥 − 𝛼20
Õ
𝑖
∫
= 𝛼2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼 20
𝑖
12 Optimization Under Uncertainty 464
∫ Õ
𝑛 ∫
𝜎2𝑓 = 𝛼 20 𝜓02 𝑝(𝑥) d𝑥 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼20
∫
𝑖=1
Õ
𝑛
= 𝛼 20 + 𝛼 2𝑖 𝜓 𝑖 (𝑥)2 𝑝(𝑥) d𝑥 − 𝛼 20
∫
𝑖=1
(12.42)
Õ
𝑛
= 𝛼2𝑖 2
𝜓 𝑖 (𝑥) 𝑝(𝑥) d𝑥
𝑖=1
Õ𝑛
= 𝛼 2𝑖 h𝜓 2𝑖 i .
𝑖=1
That last step is just the definition of the weighted inner product
(Eq. 12.25), providing the variance in terms of the coefficients and
polynomials:
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 h𝜓 2𝑖 i . (12.43)
𝑖=1
𝜇 𝑓 = 𝛼0 (12.45)
Õ
𝑛
𝜎 2𝑓 = 𝛼 2𝑖 hΨ2𝑖 i . (12.46)
𝑖=1
Determine Coefficients
Using the orthogonality property of the basis functions (Eq. 12.38), all
the terms in the summation are zero except for
h 𝑓 (𝑥), 𝜓 𝑖 i = 𝛼 𝑖 h𝜓 2𝑖 i . (12.49)
12 Optimization Under Uncertainty 466
the corresponding quadrature strategy or utilize random sequences.‖ 201. Feinberg and Langtangen, Chaospy:
An open source tool for designing methods of
uncertainty quantification, 2015.
12 Optimization Under Uncertainty 467
where the current iteration is at 𝑥 = [1, 1], and we assume that both design
variables are normally distributed with the following standard deviations:
𝜎 = [0.06, 0.2].
We approximate the function with fourth-order Hermite polynomials.
Using Eq. 12.37, we see that there are 15 basis functions from the various
combinations of 𝐻𝑖 𝐻 𝑗 :
The integrals for the basis functions (Hermite polynomials) have analytic
solutions:
hΨ2𝑘 i = h(𝐻𝑚 𝐻𝑛 )2 i = 𝑚!𝑛! .
We now compute the following double integrals to obtain the coefficients using
Gaussian quadrature:
∫ ∞∫ ∞
1
𝛼𝑘 = 𝑓 (𝑥)Ψ 𝑘 (𝑥)𝑝(𝑥) d𝑥1 d𝑥 2
hΨ2𝑘 i −∞ −∞
We must be careful with variable definitions because the inputs are not standard
normal distributions. The function 𝑓 is defined over the unnormalized variable
𝑥, whereas our basis functions are defined over a standard normal distribu-
tion: 𝑦 = (𝑥 − 𝜇)/𝜎. The probability distribution in this case is a bivariate,
12 Optimization Under Uncertainty 468
1
𝑛𝑖 Õ
Õ 𝑛𝑗 √
𝛼𝑘 ≈ 𝑤 𝑖 𝑤 𝑗 𝑓 (𝑋𝑖𝑗 )Ψ 𝑘 2𝑍 𝑖𝑗 ,
𝜋hΨ2𝑘 i 𝑖=1 𝑗=1
and 𝑍 = 𝑧1 ⊗ 𝑧2 .
In this case, we choose a full tensor product mesh of the fifth order in both
dimensions. The nodes and weights are given by
𝛼0 = 2.1725 1
𝛼1 = −0.0586 0
𝛼2 = 0.0117
−1
𝛼3 = −0.00156
−2
𝛼5 = −0.0250
−2 −1 0 1 2
𝛼 9 = 0.01578 . 𝑧1
We can now easily compute the mean and standard deviation as Fig. 12.17 Evaluation nodes with area
proportional to weight.
𝜇 𝑓 = 𝛼 0 = 2.1725
v
u
tÕ𝑛
𝜎𝑓 = 𝛼2𝑖 hΨ2𝑖 i = 0.06966 .
𝑖=1
12 Optimization Under Uncertainty 469
In this case, we are able to accurately estimate the mean and standard
deviation with only 25 function evaluations. In contrast, applying Monte Carlo
to this same problem, with LHS, requires about 10,000 function calls to estimate
the mean and over 100,000 function calls to estimate the standard deviation
(with less accuracy).
Although direct quadrature would work equally well if all we wanted was
the mean and standard deviation, polynomial chaos gives us a polynomial
approximation of our function near 𝜇𝑥 :
Õ
𝑓˜(𝑥) = 𝛼 𝑖 Ψ𝑖 (𝑥).
𝑖
2 2
1.5 1.5
𝜇𝑥 𝜇𝑥
𝑥2 1 𝑥2 1
0.5 0.5
The primary benefit of this new function is that it is very inexpensive to ·104
evaluate (and the original function is often expensive), so we can use sampling 3
procedures to compute other statistics, such as percentiles or reliability levels,
or simply to visualize the output PDF, as shown in Fig. 12.19. 2
12.4 Summary 0
2 2.2 2.4 2.6 2.8
𝑓 (𝑥)
Engineering problems are subject to variation under uncertainty. OUU
deals with optimization problems where the design variables or other Fig. 12.19 Output histogram pro-
parameters have uncertain variability. Robust design optimization seeks duced by sampling the polynomial
designs that are less sensitive to inherent variability in the objective expansion.
function. Common OUU objectives include minimizing the mean or
standard deviation or performing multiobjective trade-offs between the
mean performance and standard deviation. Reliable design optimiza-
tion seeks designs with a reduced probability of failure, considering
the variability in the constraint values. To quantify robustness and
reliability, we need a forward-propagation procedure that propagates
12 Optimization Under Uncertainty 470
Problems
a. The greater the reliability, the less likely the design is to have
a worse objective function value.
b. Reliability can be handled in a deterministic way using safety
factors, which ensure that the optimum has some margin
before the original constraint is violated.
c. Forward propagation computes the PDFs of the outputs and
inputs for a given numerical model.
d. The computational cost of direct quadrature scales exponen-
tially with the number of random variables, whereas the cost
of Monte Carlo is independent of the number of random
variables.
e. Monte Carlo methods approximate PDFs using random
sampling and converges slowly.
f. The first-order perturbation method computes the PDFs
using local Taylor series expansions.
g. Because the first-order perturbation method requires first-
order derivatives to compute the uncertainty metrics, OUU
using the first-order perturbation method requires second-
order derivatives.
h. Polynomial chaos is a forward-propagation technique that
uses polynomial approximations with random coefficients
to model the input uncertainties.
i. The number of basis functions required by polynomial chaos
grows exponentially with the number of uncertain input
variables.
12.3 Using Gaussian quadrature, find the mean and variance of the
function exp(cos(𝑥)) at 𝑥 = 1, assuming 𝑥 is normally distributed
with a standard deviation of 0.1. Determine how many evaluation
points are needed to converge to 5 decimal places. Compare your
results to trapezoidal integration.
12.5 Consider the function in Ex. 12.10. Solve the same problem,
but use Monte Carlo sampling instead. Compare the output
histogram and how many function calls are required to achieve
well-converged results for the mean and variance.
12.6 Repeat Ex. 12.10 using polynomial chaos, except with a uniform
distribution in both dimensions, where the standard deviations
from the example correspond to the half-width of a uniform
distribution.
𝑇3
𝑦3
𝑇2
𝑦2
Wind
𝑇1
Take the two optimal designs that you found, and then
compare each on the two objectives (deterministic and 95th
percentile). The first row corresponds to the performance of
the optimal deterministic layout. Evaluate the performance
of this layout using the deterministic value for COE and the
95th percentile that accounts for uncertainty. Repeat for the
optimal solution for the OUU case. Discuss your findings.
Multidisciplinary Design Optimization
13
As mentioned in Chapter 1, most engineering systems are multidiscipli-
nary, motivating the development of multidisciplinary design optimiza-
tion (MDO). The analysis of multidisciplinary systems requires coupled
models and coupled solvers. We prefer the term component instead of
discipline or model because it is more general. However, we use these
terms interchangeably depending on the context. When components in
a system represent different physics, the term multiphysics is commonly
used.
All the optimization methods covered so far apply to multidisci-
plinary problems if we view the coupled multidisciplinary analysis
as a single analysis that computes the objective and constraint func-
tions by solving the coupled model for a given set of design variables.
However, there are additional considerations in the solution, derivative
computation, and optimization of coupled systems.
In this chapter, we build on Chapter 3 by introducing models and
solvers for coupled systems. We also expand the derivative computation
methods of Chapter 6 to handle such systems. Finally, we introduce
various MDO architectures, which are different options for formulating
and solving MDO problems.
475
13 Multidisciplinary Design Optimization 476
the uncertainty at a given point in time (recall Fig. 1.3). Although these
benefits still apply when modeling and optimizing a single discipline
or component, broadening the modeling and optimization to the whole
system brings on additional benefits.
Even without performing any optimization, constructing a multi-
disciplinary (coupled) model that considers the whole engineering
system is beneficial. Such a model should ideally consider all the
interactions between the system components. In addition to modeling
physical phenomena, the model should also include other relevant
considerations, such as economics and human factors. The benefit of
such a model is that it better reflects the actual state and performance of
the system when deployed in the real world, as opposed to an isolated
component with assumed boundary conditions. Using such a model,
designers can quantify the actual impact of proposed changes on the
whole system.
When considering optimization, the main benefit of MDO is that op-
timizing the design variables for the various components simultaneously
leads to a better system than when optimizing the design variables
for each component separately. Currently, many engineering systems
are designed and optimized sequentially, which leads to suboptimal
designs. This approach is often used in industry, where engineers are
grouped by discipline, physical subsystem, or both. This might be per-
𝑥2 𝑥∗
ceived as the only choice when the engineering system is too complex
and the number of engineers too large to coordinate a simultaneous
design involving all groups. 𝑥0
different values for those variables each time, in which case they
will not converge. On the other hand, if we let one discipline handle a
shared variable, it will likely converge to a value that violates one or
more constraints from the other disciplines.
By considering the various components and optimizing a multidisci-
plinary performance metric with respect to as many design variables
as possible simultaneously, MDO automatically finds the best trade-off
between the components—this is the key principle of MDO. Suboptimal
designs also result from decisions at the system level that involve power
struggles between designers. In contrast, MDO provides the right
trade-offs because mathematics does not care about politics.
Shape
Structural
sizing
Surface
Aerodynamic pressures
solver
Structural Displacements
Displacements solver
Weight
Weight
Surface
pressure Drag, lift
integration
Stress Structural
computation stresses
Fuel Fuel
computation consumption
𝑢1
𝑟1 (𝑥, 𝑢) = 0
𝑢1
𝑢 = 𝑢2
𝑢2 𝑢2
𝑥 𝑟2 (𝑥, 𝑢) = 0 Fig. 13.4 Coupled model composed
𝑢3 of three numerical models. This cou-
𝑢3 pled model would replace the single
𝑟3 (𝑥, 𝑢) = 0 model in Fig. 3.21.
13 Multidisciplinary Design Optimization 479
13.2.1 Components
In Section 3.3, we explained how all models can ultimately be written as
a system of residuals, 𝑟(𝑥, 𝑢) = 0. When the system is large or includes
submodels, it might be natural to partition the system into components.
We prefer to use the more general term components instead of disciplines
to refer to the submodels resulting from the partitioning because the
partitioning of the overall model is not necessarily by discipline (e.g.,
aerodynamics, structures). A system model might also be partitioned
by physical system components (e.g., wing, fuselage, or an aircraft
in a fleet) or by different conditions applied to the same model (e.g.,
aerodynamic simulations at different flight conditions).
The partitioning can also be performed within a given discipline
for the same reasons cited previously. In theory, the system model
equations in 𝑟(𝑥, 𝑢) = 0 can be partitioned in any way, but only some
partitions are advantageous or make sense. We denote a partitioning
into 𝑛 components as
𝑟1 (𝑢1 ; 𝑢2 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ) = 0
..
.
𝑟(𝑢) = 0 ≡ 𝑟 (𝑢 ; 𝑢 , . . . , 𝑢 ,𝑢 ,...,𝑢 ) = 0 . (13.1)
𝑖 𝑖 1 𝑖−1 𝑖+1 𝑛
..
.
𝑟𝑛 (𝑢𝑛 ; 𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛−1 ) = 0
Each 𝑟 𝑖 and 𝑢𝑖 are vectors corresponding to the residuals and states of
component 𝑖. The semicolon denotes that we solve each component 𝑖
by driving its residuals (𝑟 𝑖 ) to zero by varying only its states (𝑢𝑖 ) while
keeping the states from all other components constant. We assume this
is possible, but this is not guaranteed in general. We have omitted the
dependency on 𝑥 in Eq. 13.1 because, for now, we just want to find the
state variables that solve the governing equations for a fixed design.
Components can be either implicit or explicit, a concept we introduced
in Section 3.3. To solve an implicit component 𝑖, we need an algorithm
for driving the equation residuals, 𝑟 𝑖 (𝑢1 , . . . , 𝑢𝑖 , . . . , 𝑢𝑛 ), to zero by
varying the states 𝑢𝑖 while the other states (𝑢 𝑗 for all 𝑗 ≠ 𝑖) remain fixed.
This algorithm could involve a matrix factorization for a linear system
or a Newton solver for a nonlinear system.
An explicit component is much easier to solve because that com-
ponents’ states are explicit functions of other components’ states. The
states of an explicit component can be computed without factorization
or iteration. Suppose that the states of a component 𝑖 are given by the
explicit function 𝑢𝑖 = 𝑓 (𝑢 𝑗 ) for all 𝑗 ≠ 𝑖. As previously explained in
Section 3.3, we can convert an explicit equation to the residual form by
13 Multidisciplinary Design Optimization 481
moving the function on the right-hand side to the left-hand side. Then,
we obtain set of residuals,
𝑑𝑧
↑ ↑ ↑ ↑ ↑ Fig. 13.5 Aerostructural wing model
showing the aerodynamic state vari-
ables (circulations Γ) on the left and
y
y
Γ 𝑑𝜃
ments 𝑑 𝑧 and rotations 𝑑𝜃 ) on the
right.
position on the beam. The states 𝑑 are the displacements and rotations at each
node, as shown on the right-hand side of Fig. 13.5. The weight does not depend
on the states, and it is an explicit function of the beam sizing and shape, so it
does not involve the structural model (Eq. 13.3). The stresses are an explicit
function of the displacements, so we can write 𝜎 = 𝜎(𝑑), where 𝜎 is a vector
whose size is the number of elements.
When we couple these two models, 𝐴 and 𝑣 depend on the wing dis- Aerodynamics Γ
placements 𝑑, and 𝑞 depends on Γ. We can write all the implicit and explicit 𝐴(𝑑)Γ − 𝑣(𝑑) = 0
equations as residuals:
𝑟1 = 𝐴(𝑑)Γ − 𝑣(𝑑) Structures
𝑑 𝐾𝑑 − 𝑞(Γ) = 0
𝑟2 = 𝐾𝑑 − 𝑞(Γ) .
The states of this system are as follows:
Fig. 13.6 The aerostructural model
𝑢 Γ couples aerodynamics and structures
𝑢= 1 ≡ . through a displacement and force
𝑢2 𝑑
transfer.
This coupled system is illustrated in Fig. 13.6.
𝑢𝑖
Solver
𝑢𝑖
𝑟𝑖 𝑟 𝑖 (𝑢𝑖 ; 𝑝 𝑖 ) Fig. 13.7 In the general case, a model
may require conversions of inputs
and outputs distinct from the states
𝑄 𝑖 (𝑝 𝑖 , 𝑢𝑖 ) 𝑢ˆ 𝑖 = 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) that the solver computes.
Consider the structural model from Ex. 13.2. We wrote 𝑞(Γ) to represent
the dependency of the external forces on the aerodynamic model circulations
to keep the notation simple, but in reality, there should be a separate explicit
component that converts Γ into 𝑞. The circulation translates to a lift force at each
spanwise position, which in turn needs to be distributed consistently to the
nodes of each beam element. Also, the displacements given by the structural
model (translations and rotations of each node) must be converted into a twist
distribution on the wing, which affects the right-hand side of the aerodynamic
model, 𝜃(𝑑). Both of these conversions are explicit functions.
13 Multidisciplinary Design Optimization 484
𝑢1
𝑟1 𝑅1
𝑢2 𝑢ˆ 1 𝑢ˆ 1
𝑟2 ˆ 22, 𝑢ˆ 3 )
𝑈1 (𝑢𝑅
𝑢3 𝑢3
𝑟3 𝑅3
𝑢4
𝑟4 𝑅4
𝑢5 𝑢ˆ 2
𝑟5 ˆ 15, 𝑢ˆ 3 )
𝑈2 (𝑢𝑅 𝑢ˆ 2
𝑢6
𝑟6 𝑢6 𝑅6
𝑢7
𝑟7 𝑅7
𝑢8 𝑢ˆ 3 𝑢ˆ 3
𝑟8 ˆ 18, 𝑢ˆ 2 )
𝑈3 (𝑢𝑅
𝑢9 𝑢9
𝑟9 𝑅9
B B
C
a graph (Fig. 13.10, right), where the graph nodes are the components,
and the edges represent the information dependency. This graph is
a directed graph because, in general, there are three possibilities for
coupling two components: single coupling one way, single coupling
the other way, and two-way coupling. A directed graph is cyclic when
there are edges that form a closed loop (i.e., a cycle). The graph on
the right of Fig. 13.10 has a single cycle between components B and C.
When there are no closed loops, the graph is acyclic. In this case, the
whole system can be solved by solving each component in turn without
iterating.
The DSM can be viewed as a matrix where the blank entries are
zeros. For real-world systems, this is often a sparse matrix. This means
that in the corresponding DSM, each component depends only on a
subset of all the other components. We can take advantage of the
structure of this sparsity in the solution of coupled systems.
The components in the DSM can be reordered without changing
the solution of the system. This is analogous to reordering sparse
matrices to make linear systems easier to solve. In one extreme case,
reordering could achieve a DSM with no entries below the diagonal. In
that case, we would have only feedforward connections, which means
all dependencies could be resolved in one forward pass (as we will
see in Ex. 13.4). This is analogous to having a linear system where
the matrix is lower triangular, in which case the linear solution can be 203. Cuthill and McKee, Reducing the
obtained with forward substitution. bandwidth of sparse symmetric matrices,
1969.
The sparsity of the DSM can be exploited using ideas from sparse 204. Amestoy et al., An approximate mini-
linear algebra. For example, reducing the bandwidth of the matrix (i.e., mum degree ordering algorithm, 1996.
moving nonzero elements closer to the diagonal) can also be helpful. ‡ Although these methods were designed
for symmetric matrices, they are still use-
This can be achieved using algorithms such as Cuthill–McKee,203 ful for non-symmetric ones. Several nu-
reverse Cuthill–McKee (RCM), and approximate minimum degree merical libraries include these methods.
(AMD) ordering.204‡ 205. Lambe and Martins, Extensions to the
design structure matrix for the description
We now introduce an extended version of the DSM, called XDSM,205 of multidisciplinary design, analysis, and
which we use later in this chapter to show the process in addition to optimization processes, 2012.
the data dependencies. Figure 13.11 shows the XDSM for the same
A ûA ûA
four-component system. When showing only the data dependencies,
the only difference relative to DSM is that the coupling variables are B ûB
labeled explicitly, and the data paths are drawn. In the next section, we
add the process to the XDSM. ûC C ûD
D
13.2.5 Solving Coupled Numerical Models
The solution of coupled systems, also known as multidisciplinary analysis Fig. 13.11 XDSM showing data de-
(MDA), requires concepts beyond the solvers reviewed in Section 3.6 pendencies for the four-component
coupled system of Fig. 13.10.
13 Multidisciplinary Design Optimization 487
In some cases, we have access to the model’s internal states, but we may
want to use a dedicated solver for that model anyway.
Because each model, in general, depends on the outputs of all other
models, we have a coupled dependency that requires a solver to resolve.
This means that the functional form requires two levels: one for the
model solvers and another for the system-level solver. At the system
level, we only deal with the coupling variables (𝑢),
ˆ and the internal
states (𝑢) are hidden.
The rest of this section presents several system-level solvers. We
will refer to each model as a component even though it is a group of
components in general.
Tip 13.2 Avoid coupling components with file input and output
The coupling variables are often passed between components through files.
This is undesirable because of a potential loss in precision (see Tip 13.1) and
because it can substantially slow down the coupled solution.
Instead of using files, pass the coupling variable data through memory
whenever possible. You can do this between codes written in different languages
by wrapping each code using a common language. When using files is
unavoidable, be aware of these issues and mitigate them as much as possible.
û(0)
0, 2 → 1 :
1 : û2 , û3 1 : û1 , û3 1 : û1 , û2
Jacobi
1:
û1 2 : û1
Solver 1
1:
û2 2 : û2
Solver 2
Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables
Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states
𝑘=0
while
𝑢ˆ (𝑘) − 𝑢ˆ (𝑘−1)
> 𝜀 or 𝑘 = 0 do Do not check convergence for first iteration
2
for all 𝑖 ∈ {1, . . . , 𝑚} do Can be done in parallel
(𝑘+1) (𝑘+1) (𝑘)
𝑢ˆ 𝑖 ← solve 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑢ˆ 𝑗 = 0, 𝑗 ≠ 𝑖 Solve for component 𝑖 ’s states
using the states from the previous iteration of other components
end for
𝑘 = 𝑘+1
end while
The block Jacobi solver (Alg. 13.1) can also be used when one or
more components are linear solvers. This is useful for computing the
derivatives of the coupled system using implicit analytics methods
because that involves solving a coupled linear system with the same
structure as the coupled model (see Section 13.3.3).
û(0)
0, 4 → 1 :
1 : û2 , û3 2 : û3
Gauss–Seidel
1:
û1 4 : û1 2 : û1 2 : û1
Solver 1
2:
û2 4 : û2 3 : û2
Solver 2
𝑘=0
while
𝑢ˆ (𝑘) − 𝑢ˆ (𝑘−1)
> 𝜀 or 𝑘 = 0 do Do not check convergence for first iteration
2
for 𝑖 = 1, 𝑚 do
(𝑘+1) (𝑘+1) (𝑘+1) (𝑘) (𝑘)
𝑢ˆ temp ← solve 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑖−1 , 𝑢ˆ 𝑖+1 , . . . , 𝑢ˆ 𝑚 = 0
Solve for component 𝑖 ’s states using the latest states from other components
(𝑘) (𝑘)
Δ𝑢ˆ 𝑖 = 𝑢ˆ temp − 𝑢ˆ 𝑖 Compute step
if 𝑘 > 0 then | !
Δ𝑢ˆ (𝑘) −Δ𝑢ˆ (𝑘−1) Δ𝑢ˆ (𝑘)
𝜃(𝑘) = 𝜃(𝑘−1) 1− 2 Update the relaxation factor
k Δ𝑢ˆ (𝑘) −Δ𝑢ˆ (𝑘−1) k
end if
(𝑘+1) (𝑘) (𝑘)
𝑢ˆ 𝑖 = 𝑢ˆ 𝑖 + 𝜃(𝑘) Δ𝑢ˆ 𝑖 Update component 𝑖 ’s states
end for
𝑘 = 𝑘+1
end while
A E
B C
C A
Newton’s Method
𝑢 𝑟1 𝑢 𝑟𝑛 𝑢 𝑢1 𝑢 𝑢𝑛 𝑢ˆ 𝑈1 𝑢ˆ 𝑈𝑚
component is solved first using the latest states. The Newton step for Fig. 13.15 There are three options
each component 𝑖 is given by for solving a coupled system with
Newton’s method. The monolithic
𝜕𝑟 𝑖 approach (left) solves for all state
Δ𝑢𝑖 = −𝑟 𝑖 𝑢𝑖 ; 𝑢 𝑗≠𝑖 , (13.10) variables simultaneously. The block
𝜕𝑢𝑖 approach (middle) solves the same
system as the monolithic approach,
where 𝑢 𝑗 represents the states from other components (i.e., 𝑗 ≠ 𝑖), but solves each component for its
which are fixed at this level. Each component is solved before taking states at each iteration. The black box
approach (right) applies Newton’s
a step in the entire state vector (Eq. 13.9). The procedure is given method to the coupling variables.
in Alg. 13.3 and illustrated in Fig. 13.16. We call this the full-space
hierarchical Newton approach because the system-level solver iterates the
entire state vector. Solving each component before taking each step in
the full space Newton iteration acts as a preconditioner. In general, the
monolithic approach is more efficient, and the hierarchical approach is
more robust, but these characteristics are case-dependent.
u(0)
0, 2 → 1 :
1 : u2 , u3 1 : u1 , u3 1 : u1 , u2
Newton
∂ r1 1:
u1 2 : u1 ,
∂u Solver 1
∂ r2 1:
u2 2 : u2 ,
∂u Solver 2
smaller than the full space of the state variables. Using this approach,
each component’s solver can be a black box, as in the nonlinear block
Jacobi and Gauss–Seidel solvers. This approach is illustrated on the
right panel of Fig. 13.15. The reduced-space approach is mathemati-
cally equivalent and follows the same iteration path as the full-space
approach if each component solver in the reduced-space approach is
converged well enough.132 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
design, analysis, and optimization, 2019.
Algorithm 13.3 Full-space hierarchical Newton
Inputs: h i
(0) (0)
𝑢 (0) = 𝑢1 , . . . , 𝑢𝑛 : Initial guess for coupling variables
Outputs:
𝑢 = [𝑢1 , . . . , 𝑢𝑛 ]: System-level states
where we need the partial derivatives of all the residuals with respect to
the coupling variables to form the Jacobian matrix 𝜕ˆ𝑟 /𝜕𝑢.
ˆ The Jacobian
can be found by differentiating Eq. 13.11 with respect to the coupling
variables. Then, expanding the concatenated residuals and coupling
variable vectors yields
𝜕𝑈1
𝐼 𝜕𝑈1
Δ𝑢ˆ 1 𝑢ˆ 1 − 𝑈1 (𝑢ˆ 2 , . . . , 𝑢ˆ 𝑚 )
− ··· −
𝜕𝑢ˆ 2 𝜕𝑢ˆ 𝑚
𝜕𝑈 𝜕𝑈2
− 2 Δ𝑢ˆ 2 𝑢ˆ 2 − 𝑈2 (𝑢ˆ 1 , 𝑢ˆ 3 , . . . , 𝑢ˆ 𝑚 )
𝐼 ··· −
𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚 =− .
. .. ..
.. .. ..
.
..
. . .
.
𝜕𝑈𝑚
− 𝜕𝑈𝑚 Δ𝑢ˆ 𝑚 𝑢ˆ 𝑚 − 𝑈𝑚 (𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚−1 )
𝜕𝑢ˆ − ··· 𝐼
1 𝜕𝑢ˆ 2
(13.13)
The residuals in the right-hand side of this equation are evaluated at
the current iteration.
The derivatives in the block Jacobian matrix are also computed
at the current iteration. Each row 𝑖 represents the derivatives of the
(potentially implicit) function that computes the outputs of component
𝑖 with respect to all the inputs of that component. The Jacobian matrix
in Eq. 13.13 has the same structure as the DSM (but transposed) and is
often sparse. These derivatives can be computed using the methods
from Chapter 6. These are partial derivatives in the sense that they do
not take into account the coupled system. However, they must take
into account the respective model and can be computed using implicit
analytic methods when the model is implicit.
This Newton solver is shown in Fig. 13.17 and detailed in Alg. 13.4.
Each component corresponds to a set of rows in the block Newton
system (Eq. 13.13). To compute each set of rows, the corresponding
component must be solved, and the derivatives of its outputs with
respect to its inputs must be computed as well. Each set can be computed
in parallel, but once the system is assembled, a step in the coupling
variables is computed by solving the full system (Eq. 13.13).
These coupled Newton methods have similar advantages and dis-
advantages to the plain Newton method. The main advantage is that it
converges quadratically once it is close enough to the solution (if the
problem is well-conditioned). The main disadvantage is that it might
13 Multidisciplinary Design Optimization 496
û(0)
0, 2 → 1 :
1 : û2 , û3 1 : û1 , û3 1 : û1 , û2
Newton
∂ U1 1:
û1 2 : û1 ,
∂ û Solver 1
∂ U2 1:
û2 2 : û2 ,
∂ û Solver 2
Inputs: h i
(0) (0)
𝑢ˆ (0) = 𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 : Initial guess for coupling variables
Outputs:
𝑢ˆ = [𝑢ˆ 1 , . . . , 𝑢ˆ 𝑚 ]: System-level states
𝑘=0
while k 𝑟ˆ k 2 > 𝜀 do Check residual norm for all components
for all 𝑖 ∈ {1, . . . , 𝑚} do Can be done in parallel
(𝑘)
𝑈 𝑖 ← compute 𝑈 𝑖 𝑢ˆ 𝑗≠𝑖 Solve component 𝑖 and compute its outputs
end for
𝑟ˆ = 𝑢ˆ (𝑘) − 𝑈 Compute all coupling variable residuals
𝜕𝑈
Compute Jacobian of coupling variables for current state
𝜕𝑢ˆ
𝜕ˆ𝑟
Solve Δ𝑢ˆ = −ˆ𝑟 Coupled Newton system (Eq. 13.13)
𝜕𝑢ˆ
𝑢ˆ (𝑘+1) = 𝑢ˆ (𝑘) + Δ𝑢ˆ Update all coupling variables
𝑘 = 𝑘+1
end while
Nonlinear block Gauss–Seidel (Alg. 13.2) is similar, but we need to solve 0.02
0
the two components in sequence. We can start by solving 𝑟1 = 0 for Γ with
𝑑𝑦
𝑑 = 0. Then we use the Γ obtained from this solution in 𝑟2 and solve for a new 0.3
Vertical displacement
𝑑. We now have a new 𝑑 to use in 𝑟1 to solve for a new Γ, and so on. 0.2
The Jacobian for the Newton system (Eq. 13.9) is 0.1
𝜕𝑟1 𝜕𝑟1 𝜕𝑣
𝐴 𝜕𝐴 0
1 𝜕𝑢2 =
Γ−
𝜕𝑑 .
0 5 10 15
𝜕𝑟
= 𝜕𝑢 𝜕𝑟2 𝜕𝑞
𝜕𝑑 Spanwise location [m]
𝜕𝑟2
−
𝜕𝑢
𝐾
𝜕𝑢1 𝜕𝑢2 𝜕Γ
Fig. 13.18 Spanwise distribution of
We already have the block diagonal matrices in this Jacobian from the governing the lift, wing rotation (𝑑𝜃 ), and verti-
equations, but we need to compute the off-diagonal partial derivative blocks, cal displacement (𝑑 𝑧 ) for the coupled
which can be done analytically or with algorithmic differentiation (AD). aerostructural solution.
13 Multidisciplinary Design Optimization 498
The solution is shown in Fig. 13.18, where we plot the variation of lift,
vertical displacement, and rotation along the span. The vertical displacements
are a subset of 𝑑, and the rotations are a conversion of a subset of 𝑑 representing
the rotations of the wing section at each spanwise location. The lift is the
vertical force at each spanwise location, which is proportional to Γ times the
wing chord at that location.
The monolithic Newton approach does not converge in this case. We
apply the full-space hierarchical approach (Alg. 13.3), which converges more
reliably. In this case, the reduced-space approach is not used because there is
no distinction between coupling variables and state variables.
In Fig. 13.19, we compare the convergence of the methods introduced in this
section.¶ The Jacobi method has the poorest convergence rate and oscillates. ¶ These results and subsequent results
The Gauss–Seidel method is much better, and it is even better with Aitken based on the same example were obtained
using OpenAeroStruct,202 which was de-
acceleration. Newton has the highest convergence rate, as expected. Broyden veloped using OpenMDAO. The descrip-
performs about as well as Gauss–Seidel in this case. tion in these examples is simplified for di-
dactic purposes; check the paper and code
for more details.
104 202. Jasa et al., Open-source coupled aero-
structural optimization using Python, 2018.
102
100
Block Jacobi
k𝑟 k
10−2
Newton
10−4 Block
Gauss-Seidel
10−6
Broyden Block GS
10−8 with Aitken Fig. 13.19 Convergence of each solver
3 6 9 12 15 18 for aerostructural system.
Iterations
shown in Fig. 13.20 for a system of six components and five solvers.
The root node ultimately solves the complete system, and each solver is
responsible for a subsystem and thus handles a subset of the variables.
Recursive
solver
1 6
Recursive Recursive
solver solver
2 3 7 8
𝑢1 𝑢1 𝑢1
𝑢2 𝑢2 𝑢2
Fig. 13.21 There are three main pos-
sibilities involving two components.
Parallel Serial Coupled
𝑢1 𝑢1 𝑢1
𝑢2 𝑢2 𝑢2 Parallel
𝑢3 𝑢3 𝑢3 Serial
𝑢4 𝑢4 𝑢4 Coupled
To solve the system from Ex. 13.3 using hierarchical solvers, we can Fig. 13.22 Three examples of a system
use the hierarchy shown in Fig. 13.23. We form three groups with three of four components with a two-level
solver hierarchy.
components each. Each group includes the input and output conversion
components (which are explicit) and one implicit component (which
requires its own solver). Serial solvers can be used to handle the input
and output conversion components. A coupled solver is required to
solve the entire coupled system, but the coupling between the groups
is restricted to the corresponding outputs (components 3, 6, and 9).
Alternatively, we could apply a coupled solver to the functional
representation (Fig. 13.9, right). This would also use two levels of
solvers: a solver within each group and a system-level solver for the
coupling of the three groups. However, the system-level solver would
handle coupling variables rather than the residuals of each component.
13 Multidisciplinary Design Optimization 501
𝑟1 Serial
𝑟2 Coupled
𝑟3
𝑟4
𝑟5
𝑟6
Fig. 13.23 For the case of Fig. 13.9,
𝑟7 we can use a serial evaluation within
each of the three groups and require a
𝑟8 coupled solver to handle the coupling
between the three groups.
𝑟9
The development of coupled solvers is often done for a specific set of models
from scratch, which requires substantial effort. OpenMDAO is an open-source
framework that facilitates such efforts by implementing MAUD.132 All the 132. Gray et al., OpenMDAO: An open-
source framework for multidisciplinary
solvers introduced in this chapter are available in OpenMDAO. This framework design, analysis, and optimization, 2019.
also makes it easier to compute the derivatives of the coupled system, as we
will see in the next section. Users can assemble systems of mixed explicit and
implicit components.
For implicit components, they must give OpenMDAO access to the residual
computations and the corresponding state variables. For explicit components,
OpenMDAO only needs access to the inputs and the outputs, so it supports
black-box models.
OpenMDAO is usually more efficient when the user provides access to
the residuals and state variables instead of treating models as black boxes. A
hierarchy of multiple solvers can be set up in OpenMDAO, as illustrated in
Fig. 13.20. OpenMDAO also provides the necessary interfaces for user-defined
solvers. Finally, OpenMDAO encourages coupling through memory, which is
beneficial for numerical precision (see Tip 13.1) and computational efficiency
(see Tip 13.2).
𝜕𝑟 𝜕𝑟1 𝜕𝑟1
1 𝜙1
··· 𝜕𝑥
𝜕𝑢1 𝜕𝑢𝑛
. .. .. ..
.. ..
. = . ,
. .
(13.17)
𝜕𝑟𝑛 𝜕𝑟𝑛 𝜕𝑟𝑛
𝜙𝑛
𝜕𝑢 ···
𝜕𝑢𝑛 𝜕𝑥
1
where 𝜙 𝑖 represents the derivatives of the states from component 𝑖 with
respect to the design variables. Once we have solved for 𝜙, we can
use the coupled equivalent of the total derivative equation (Eq. 6.44) to
compute the derivatives:
𝜙1
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 ..
= − . . (13.18)
...
d𝑥
𝜙𝑛
𝜕𝑥 𝜕𝑢1 𝜕𝑢𝑛
Similarly, the adjoint equations (Eq. 6.46) can be written for a coupled
system using the same concatenated state and residual vectors. The
coupled adjoint equations involve a corresponding concatenated adjoint
13 Multidisciplinary Design Optimization 504
𝜕𝑢 𝜓 𝑛
𝜕𝑢𝑛 𝜕𝑢𝑛
···
𝑛
After solving this equations for the coupled-adjoint vector, we can
use the coupled version of the total derivative equation (Eq. 6.47) to
compute the desired derivatives as
𝜕𝑟1
𝜕𝑥
d𝑓 | .
− 𝜓1 . . . 𝜓 𝑛 .. .
𝜕𝑓
(13.20)
|
=
d𝑥 𝜕𝑥 𝜕𝑟
𝑛
𝜕𝑥
Like the adjoint method from Section 6.7, the coupled adjoint is a
powerful approach for computing gradients with respect to many
design variables.∗ ∗ The coupled-adjoint approach has been
The required partial derivatives are the derivatives of the residuals implemented for aerostructural problems
or outputs of each component with respect to the state variables or governed by coupled PDEs207 and demon-
strated in wing design optimization.209
inputs of all other components. In practice, the block structure of
207. Kenway et al., Scalable parallel ap-
these partial derivative matrices is sparse, and the matrices themselves proach for high-fidelity steady-state aeroe-
are sparse. This sparsity can be exploited using graph coloring to lastic analysis and derivative computations,
2014.
drastically reduce the computation effort of computing Jacobians at the 209. Kenway and Martins, Multipoint
system or component level, as explained in Section 6.8. high-fidelity aerostructural optimization of a
transport aircraft configuration, 2014.
Figure 13.26 shows the structure of the Jacobians in Eq. 13.17 and
Eq. 13.19 for the three-group case from Fig. 13.23. The sparsity structure
of the Jacobian is the transpose of the DSM structure. Because the
Jacobian in Eq. 13.19 is transposed, the Jacobian in the adjoint equation
has the same structure as the DSM.
The structure of the linear system can be exploited in the same
way as for the nonlinear system solution using hierarchical solvers:
serial solvers within each group and a coupled solver for the three
groups. The block Jacobi and Gauss–Seidel methods from Section 13.2.5
are applicable to coupled linear components, so these methods can
be re-used to solve this coupled linear system for the total coupled
derivatives.
The partial derivatives in the coupled Jacobian, the right-hand side
of the linear systems (Eqs. 13.17 and 13.19), and the total derivatives
equations (Eqs. 13.18 and 13.20) can be computed with any of the
methods from Chapter 6. The nature of these derivatives is the same
13 Multidisciplinary Design Optimization 505
as we have seen previously for implicit analytic methods (Section 6.7). Fig. 13.26 Jacobian structure for resid-
They do not require the solution of the equation and are typically ual form of the coupled direct (left)
and adjoint (right) equations for the
cheap to compute. Ideally, the components would already have analytic three-group system of Fig. 13.23. The
derivatives of their outputs with respect to their inputs, which are all structure of the transpose of the Jaco-
bian is the same as that of the DSM.
the derivatives needed at the system level.
The partial derivatives can also be computed using the finite-
difference or complex-step methods. Even though these are not efficient
for cases with many inputs, it might still be more efficient to compute
the partial derivatives with these methods and then solve the coupled
derivative equations instead of performing a finite difference of the
coupled system, as described in Section 13.3.1. The reason is that com-
puting the partial derivatives avoids having to reconverge the coupled
system for every input perturbation. In addition, the coupled system
derivatives should be more accurate when finite differences are used
only to compute the partial derivatives.
Variants of the coupled direct and adjoint methods can also be derived
for the functional form of the system-level representation (Eq. 13.4),
by using the residuals defined for the system-level Newton solver
(Eq. 13.11),
ˆ = 𝑢ˆ 𝑖 − 𝑈 𝑖 (𝑢ˆ 𝑗≠𝑖 ) = 0 ,
𝑟ˆ𝑖 (𝑢) 𝑖 = 1, . . . , 𝑚 . (13.21)
Recall that driving these residuals to zero relies on a solver for each
component to solve for each component’s states and another solver to
solve for the coupling variables 𝑢.
ˆ
13 Multidisciplinary Design Optimization 506
𝜕𝑈1 𝜕𝑈ˆ 1
𝐼 𝜕𝑈1
𝜙ˆ 1
− ··· − 𝜕𝑥
𝜕𝑢ˆ 2 𝜕𝑢ˆ 𝑚
𝜕𝑈 𝜕𝑈2 𝜕𝑈ˆ 2
− 2 𝜙ˆ 2
𝐼 ··· − 𝜕𝑥
𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚 = ,
. .. .. ..
(13.22)
.. .. ..
. .
. . .
ˆ
𝜕𝑈𝑚 ˆ 𝜕𝑈𝑚
− 𝜕𝑈𝑚 𝜙𝑚
𝜕𝑢ˆ − ··· 𝐼 𝜕𝑥
1 𝜕𝑢ˆ 2
where the Jacobian is identical to the one we derived for the coupled
Newton step (Eq. 13.13). Here, 𝜙ˆ 𝑖 represents the derivatives of the cou-
pling variables from component 𝑖 with respect to the design variables.
The solution can then be used in the following equation to compute the
total derivatives:
𝜙ˆ 1
d𝑓 𝜕𝑓 𝜕𝑓 𝜕𝑓 ..
= − ... . . (13.23)
d𝑥
𝜙ˆ 𝑚
𝜕𝑥 𝜕𝑢ˆ 1 𝜕𝑢ˆ 𝑚
Similarly, the functional version of the coupled adjoint equations
can be derived as
|
𝜕 𝑓 |
𝐼 𝜕𝑈2 | 𝜕𝑈𝑚 𝜓ˆ 1
− ··· − 𝜕𝑢ˆ 1
𝜕𝑢ˆ 1 𝜕𝑢ˆ 1
𝜕𝑈 | |
𝜕 𝑓 |
− 1 𝜕𝑈𝑚 𝜓ˆ 2
𝐼 ··· − 𝜕𝑢ˆ 2
𝜕𝑢ˆ 2 𝜕𝑢ˆ 2 =
.. ..
. (13.24)
.. .. .. .. . .
. . . .
𝜕𝑈1 | ˆ 𝜕 𝑓
−
| |
𝜓 𝑚
𝜕𝑈2
𝜕𝑢ˆ − ··· 𝐼 𝜕𝑢ˆ 𝑚
𝑚 𝜕𝑢ˆ 𝑚
After solving for the coupled-adjoint vector using the previous equa-
tion, we can use the total derivative equation to compute the desired
derivatives:
𝜕ˆ𝑟1
𝜕𝑥
d𝑓 .
− 𝜓ˆ 1| . . . 𝜓ˆ |𝑚 .. .
𝜕𝑓
= (13.25)
d𝑥 𝜕𝑥 𝜕ˆ𝑟
𝑚
𝜕𝑥
Because the coupling variables (𝑢) ˆ are usually a reduction of the
internal state variables (𝑢), the linear systems in Eqs. 13.22 and 13.24 are
usually much smaller than that of the residual counterparts (Eqs. 13.17
13 Multidisciplinary Design Optimization 507
and 13.19). However, unlike the partial derivatives in the residual form,
the partial derivatives in the functional form Jacobian need to account
for the solution of the corresponding component. When viewed at
the component level, these derivatives are actually total derivatives of
the component. When the component is an implicit set of equations,
computing these derivatives with finite-differencing would require
solving the component’s equations for each variable perturbation.
Alternatively, an implicit analytic method (from Section 6.7) could be
applied to the component to compute these derivatives.
Figure 13.27 shows the Jacobian structure in the functional form of 𝜕𝑈1 𝜕𝑈1
𝐼 − −
the coupled direct method (Eq. 13.22) for the case of Fig. 13.23. The 𝜕𝑢ˆ 2 𝜕𝑢ˆ 3
dimension of this Jacobian is smaller than that of the residual form.
𝜕𝑈2 𝜕𝑈2
Recall from Fig. 13.9 that 𝑈1 corresponds to 𝑟3 , 𝑈2 corresponds to 𝑟6 , and −
𝜕𝑢ˆ 1
𝐼 −
𝜕𝑢ˆ 3
𝑈3 corresponds to 𝑟9 . Thus, the total size of this Jacobian corresponds
to the sum of the sizes of components 3, 6, and 9, as opposed to the −
𝜕𝑈3
−
𝜕𝑈3
𝐼
sum of the sizes of all nine components for the residual form. However, 𝜕𝑢ˆ 1 𝜕𝑢ˆ 2
The empty weight 𝑊 only depends on 𝑡 and 𝑏, and the dependence is explicit
(it does not require solving the aerodynamic or structural models). The drag 𝐷
and lift 𝐿 depend on all variables once we account for the coupled system of
equations. The remaining variables are fixed: 𝑅 is the required range, 𝑉 is the
airplane’s cruise speed, and 𝑐 is the specific fuel consumption of the airplane’s
engines. We also need to constrain the stresses in the structure, 𝜎, which are an
explicit function of the displacements (see Ex. 6.12).
To solve this optimization problem using gradient-based optimization, we
need the coupled derivatives of 𝑓 and 𝜎 with respect to 𝛼, 𝑏, 𝜃, and 𝑡. Computing
the derivatives of the aerodynamic and structural models separately is not
13 Multidisciplinary Design Optimization 509
sufficient. For example, a perturbation on the twist changes the loads, which
then changes the wing displacements, which requires solving the aerodynamic
model again. Coupled derivatives take this effect into account.
𝑟𝛼 𝑟𝑏 𝑟𝜃 𝑟𝑡 𝑟Γ 𝑟𝑑 𝑟𝑊 𝑟𝐷 𝑟𝐿 𝑟𝜎 𝑟𝑓
𝑏
Design variables
Γ
Intermediate variables
𝐿
Functions
We show the DSM for the system in Fig. 13.28. Because the DSM has the
same sparsity structure as the transpose of the Jacobian, this diagram reflects
the structure of the reverse UDE. The blocks that pertain to the design variables
have unit diagonals because they are independent variables, but they directly
affect the solver blocks. The blocks responsible for solving for Γ and 𝑑 are the
only ones with feedback coupling. The part of the UDE pertaining to Γ and 𝑑
is the Jacobian of residuals for the aerodynamic and structural components,
which we already derived in Ex. 13.5 to apply Newton’s method on the coupled
system. The functions of interest are all explicit components and only depend
directly on the design variables or the state variables. For example, the weight
𝑊 depends only on 𝑡; drag and lift depend only on the converged Γ; 𝜎 depends
on the displacements; and finally, the fuel burn 𝑓 just depends on drag, lift,
and weight. This whole coupled chain of derivatives is computed by solving
the linear system shown in Fig. 13.28.
For brevity, we only discuss the derivatives required to compute the
derivative of fuel burn with respect to span, but the other partial derivatives
would follow the same rationale.
• 𝜕𝑟/𝜕𝑢 is identical to what we derived when solving the coupled aero-
13 Multidisciplinary Design Optimization 510
d𝑓 𝜕 𝑓 d𝐷 𝜕 𝑓 d𝐿 𝜕 𝑓 d𝑊
= + + ,
d𝑏 𝜕𝐷 d𝑏 𝜕𝐿 d𝑏 𝜕𝑊 d𝑏
where the partial derivatives can be obtained by differentiating Eq. 13.26
symbolically, and the total derivatives are part of the coupled linear
system solution.
After computing all the partial derivative terms, we solve either the forward d𝑓
d𝑡
or reverse UDE system. For the derivative with respect to span, neither method 1,250
Decoupled
has an advantage. However, for the derivatives of fuel burn with respect to 1,000
the twist and thickness variables, the reverse mode is much more efficient. In 750
Coupled
this example, d 𝑓 /d𝑏 = −11.0 kg/m, so each additional meter of span reduced 500
the fuel burn by 11 kg. If we compute this same derivative without coupling 250
(by converging the aerostructural model but not considering the off-diagonal 0
terms in the aerostructural Jacobian), we obtain d 𝑓 /d𝑏 = −17.7 kg/m, which is
d𝑓
significantly different. The derivatives of the fuel burn with respect to the twist d𝛾
distribution and the thickness distribution along the wingspan are plotted in 0
Fig. 13.29, where we can see the difference between coupled and uncoupled
−20
derivatives.
−40
−60
0 5 10 15
13.4 Monolithic MDO Architectures Spanwise location [m]
So far in this chapter, we have extended the models and solvers from Fig. 13.29 Derivatives of the fuel burn
with respect to the spanwise distribu-
Chapter 3 and derivative computation methods from Chapter 6 to
tion of twist and thickness variables.
coupled systems. We now discuss the options to optimize coupled The coupled derivatives differ from
systems, which are given by various MDO architectures. the uncoupled derivatives, especially
Monolithic MDO architectures cast the design problem as a single for the derivatives with respect to
structural thicknesses near the wing
optimization. The only difference between the different monolithic root.
architectures is the set of design variables that the optimizer is responsi-
ble for, which affects the constraint formulation and how the governing
equations are solved.
13 Multidisciplinary Design Optimization 511
x(0) û(0)
0, 7 → 1 :
x∗ 2:x 3:x 4:x 6:x
Optimization
0, 4 → 1 :
2 : û2 , û3 3 : û3
MDA
2:
û∗1 5 : û1 3 : û1 4 : û1 6 : û1
Solver 1
3:
û∗2 5 : û2 4 : û2 6 : û2
Solver 2
4:
û∗3 5 : û3 6 : û3
Solver 3
6:
7 : f, g
Functions
sense as we are with finding an improved design. However, it is not Fig. 13.30 The MDF architecture re-
guaranteed that the design constraints are satisfied if the optimization is lies on an MDA to solve for the cou-
pling and state variables at each op-
terminated early; that depends on whether the optimization algorithm timization iteration. In this case, the
maintains a feasible design point or not. MDA uses the block Gauss–Seidel
method.
The main disadvantage of MDF is that it solves an MDA for each
optimization iteration, which requires its own algorithm outside of the
optimization. Implementing an MDA algorithm can be time-consuming
if one is not already in place.
As mentioned in Tip 13.3, a MAUD-based framework such as Open-
MDAO can facilitate this. MAUD naturally implements the MDF archi-
tecture because it focuses on solving the MDA (Section 13.2.5) and on
computing the derivatives corresponding to the MDA (Section 13.3.3).‡ ‡ The first application of MAUD was the
design optimization of a satellite and its
When using a gradient-based optimizer, gradient computations are orbit dynamics. The problem consisted
also challenging for MDF because coupled derivatives are required. of over 25,000 design variables and over 2
million state variables210
Finite-difference derivative approximations are easy to implement, but
210. Hwang et al., Large-scale multidiscipli-
their poor scalability and accuracy are compounded by the MDA, as nary optimization of a small satellite’s design
explained in Section 13.3. Ideally, we would use one of the analytic and operation, 2014.
Continuing the wing aerostructural problem from Ex. 13.6, we are finally
ready to optimize the wing. The MDF formulation is as follows: Initial
minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡 Optimized
subject to 𝐿−𝑊 = 0
2.5|𝜎| − 𝜎yield ≤ 0 0 5 10 15 20
𝐾𝑑 − 𝑞(Γ) = 0
Fig. 13.31 The optimization reduces
for Γ, 𝑑. the fuel burn by increasing the span.
The structural stresses are constrained to be less than the yield stress of the
material by a safety factor (2.5 in this case). In Ex. 13.5, we set up the MDA for 𝜃
2
the aerostructural problem, and in Ex. 13.6, we set up the coupled derivative Optimized
computations needed to solve this problem using gradient-based optimization. 0
Initial
Solving this optimization resulted in the larger span wing shown in Fig. 13.31. −2
This larger span increases the structural weight, but decreases drag. Although
the increase in weight would typically increase the fuel burn, the drag decrease 𝑡
more than compensates for this adverse effect, and the fuel burn ultimately 1 Optimized
decreases up to this value of span. Beyond this optimal span value, the weight 0.5 Initial
penalty would start to dominate, resulting in a fuel burn increase.
0
The twist and thickness distributions are shown in Fig. 13.32. The wing 0 0.2 0.4 0.6 0.8 1
twist directly controls the spanwise lift loading. The baseline wing had no Normalized spanwise location
twist, which resulted in the loading shown in Fig. 13.33. In this figure, the
gray line represents a hypothetical elliptical lift distribution, which results in Fig. 13.32 Twist and thickness dis-
the theoretical minimum for induced drag. The loading distributions for the tributions for the baseline and opti-
mized wings.
level flight (1 g) and maneuver conditions (2.5 g) are indistinguishable. The
optimization increases the twist in the midspan and drastically decreases it
toward the tip. This twist distribution differentiates the loading at the two
conditions: it makes the loading at level flight closer to the elliptical ideal while
Loading
shifting the loading at the maneuver condition toward the wing root.
1 1g Initial
The thickness distribution also changes significantly, as shown in Fig. 13.32.
The optimization tailors the thickness by adding more thickness in the spar 2.5g
0.5
near the root, where the moments are larger, and thins out the wing much
0
more toward the tip, where the loads decrease. This more radical thickness
Loading
distribution is enabled by the tailoring of the spanwise lift loading discussed
1 Optimized
1g
previously.
These trades make sense because, at the level flight condition, the optimizer 0.5 2.5g
5.5
Sequential
5 𝑥0
4.5
Thickness [cm]
4
MDO
3.5
To perform sequential optimization for the wing design problem of Ex. 13.1,
we could start by optimizing the aerodynamics by solving the following
problem:
minimize 𝑓
by varying 𝛼, 𝜃
subject to 𝐿−𝑊 = 0.
Here, 𝑊 is constant because the structural thicknesses 𝑡 are fixed, but 𝐿 is a
function of the aerodynamic design variables and states. We cannot include the
span 𝑏 because it is a shared variable, as explained in Section 13.1. Otherwise,
this optimization would tend to increase 𝑏 indefinitely to reduce the lift-induced
drag. Because 𝑓 is a function of 𝐷 and 𝐿, and 𝐿 is constant because 𝐿 = 𝑊, we
could replace the objective with 𝐷.
Once the aerodynamic optimization has converged, the twist distribution
and the forces are fixed, and we then optimize the structure by minimizing
weight subject to stress constraints by solving the following problem:
minimize 𝑓
by varying 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0 .
13 Multidisciplinary Design Optimization 515
Because the drag and lift are constant, the objective could be replaced by 𝑊.
Again, we cannot include the span in this problem because it would decrease
indefinitely to reduce the weight and internal loads due to bending.
These two optimizations are repeated until convergence. As shown in
Fig. 13.34, sequential optimization only changes one variable at a time, and it
converges to a point on the constraint with about 3.5 ◦ more twist than the true
optimum of the MDO. When including more variables, these differences are
likely to be even larger.
minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥, 𝑢ˆ 𝑡
subject to ˆ ≤0
𝑔 (𝑥; 𝑢)
ℎ 𝑖𝑐 = 𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 = 0 𝑖 = 1, . . . , 𝑚 (13.29)
while solving 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑥, 𝑢ˆ 𝑗≠𝑖
𝑡
=0 𝑖 = 1, . . . , 𝑚
for 𝑢ˆ .
x(0) , ût,(0)
0, 3 → 1 :
x∗ 2 : x, ût 1 : x, ût2 , ût3 1 : x, ût1 , ût3 1 : x, ût1 , ût2
Optimization
2:
û∗1 3 : f, g, g c
Functions
1:
û∗2 2 : û1
Solver 1
1:
û∗3 2 : û2
Solver 2
1:
2 : û3
Solver 3
One advantage of IDF is that each component can be solved in Fig. 13.35 The IDF architecture breaks
up the MDA by letting the optimizer
parallel because they do not depend on each other directly. Another
solve for the coupling variables that
advantage is that if gradient-based optimization is used to solve the satisfy interdisciplinary feasibility.
problem, the optimizer is typically more robust and has a better conver-
gence rate than the fixed-point iteration algorithms of Section 13.2.5.
The main disadvantage of IDF is that the optimizer must handle
more variables and constraints compared with the MDF architecture. If
the number of coupling variables is large, the size of the resulting opti-
mization problem may be too large to solve efficiently. This problem can
be mitigated by careful selection of the components or by aggregating
the coupling variables to reduce their dimensionality.
Unlike MDF, IDF does not guarantee a multidisciplinary feasible
state at every design optimization iteration. Multidisciplinary feasibility
is only guaranteed at the end of the optimization through the satisfaction
of the consistency constraints. This is a disadvantage because if the
optimization stops prematurely or we run out of time, we do not have
a valid state for the coupled system.
13 Multidisciplinary Design Optimization 517
For the IDF architecture, we need to make copies of the coupling variables
(Γ𝑡 and 𝑑 𝑡 ) and add the corresponding consistency constraints, as highlighted
in the following problem statement:
minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡, Γ𝑡 , 𝑑 𝑡
subject to 𝐿=𝑊
2.5|𝜎| − 𝜎yield ≤ 0
Γ𝑡 − Γ = 0
𝑑𝑡 − 𝑑 = 0
while solving 𝐴 𝑑𝑡 Γ − 𝜃 𝑑𝑡 , 𝛼 = 0
𝐾𝑑 − 𝑞 Γ𝑡 = 0
for Γ, 𝑑 .
The aerodynamic and structural models are solved independently. The aerody-
namic solver finds Γ for the 𝑑 𝑡 given by the optimizer, and the structural solver
finds 𝑑 for the given Γ𝑡 .
When using gradient-based optimization, we do not require coupled
derivatives, but we do need the derivatives of each model with respect to both
state variables. The derivatives of the consistency constraints are just a unit
matrix when taken with respect to the variable copies and are zero otherwise.
coupling variables, and the design optimization problem for the design
variables. All that is required from the model is the computation
of residuals. Because the optimizer is controlling all these variables,
SAND is also known as a full-space approach. SAND can be stated as
13 Multidisciplinary Design Optimization 518
follows:
minimize ˆ 𝑢)
𝑓 (𝑥, 𝑢,
by varying ˆ 𝑢
𝑥, 𝑢,
(13.31)
subject to ˆ ≤0
𝑔 (𝑥, 𝑢)
ˆ 𝑢) = 0 .
𝑟 (𝑥, 𝑢,
Here, we use the representation shown in Fig. 13.7, so there are two
sets of explicit functions that convert the input coupling variables of
the component. The SAND architecture is also applicable to single
components, in which case there are no coupling variables. The XDSM
for SAND is shown in Fig. 13.36.
0, 2 → 1 :
x∗ , û∗ 1 : x, û 1 : x, û, u1 1 : x, û, u2 1 : x, û, u3
Optimization
1:
2 : f, g
Functions
1:
2 : r1
Residual 1
1:
2 : r2
Residual 2
1:
2 : r3
Residual 3
Because it solves for all variables simultaneously, the SAND archi- Fig. 13.36 The SAND architecture
lets the optimizer solve for all vari-
tecture can be the most efficient way to get to the optimal solution. In
ables (design, coupling, and state
practice, however, it is unlikely that this is advantageous when efficient variables), and component solvers are
component solvers are available. no longer needed.
The resulting optimization problem is the largest of all MDO archi-
tectures and requires an optimizer that scales well with the number
of variables. Therefore, a gradient-based optimization algorithm is
likely required, in which case the derivative computation must also
be considered. Fortunately, SAND does not require derivatives of the
coupled system or even total derivatives that account for the component
solution; only partial derivatives of residuals are needed.
SAND is an intrusive approach because it requires access to residuals.
13 Multidisciplinary Design Optimization 519
For the SAND approach, we do away completely with the solvers and let
the optimizer find the states. The resulting problem is as follows:
minimize 𝑓
by varying 𝛼, 𝑏, 𝜃, 𝑡, Γ, 𝑑
subject to 𝐿=𝑊
2.5|𝜎| − 𝜎yield ≤ 0
𝐴Γ − 𝜃 = 0
𝐾𝑑 − 𝑞 = 0.
Instead of being solved separately, the models are now solved by the optimizer.
When using gradient-based optimization, the required derivatives are just
partial derivatives of the residuals (the same partial derivatives we would use
for an implicit analytic method).
variables) and design variables that affect two or more components 41. Martins and Lambe, Multidisciplinary
design optimization: A survey of architec-
directly (called shared design variables). We denote the vector of design tures, 2013.
variables local to component 𝑖 by 𝑥 𝑖 and the shared variables by 𝑥0 . The
full vector of design variables is given by concatenating
| | the shared and
|
local design variables into a single vector 𝑥 = 𝑥0 , 𝑥1 , . . . , 𝑥 𝑚 , where
𝑚 is the number of components.
If a constraint can be computed using a single component and
satisfied by varying only the local design variables for that component,
it is a local constraint; otherwise, it is nonlocal. Similarly,
| | for the|design
variables, we concatenate the constraints as 𝑔 = 𝑔0 , 𝑔1 , . . . , 𝑔𝑚 . The
same distinction could be applied to the objective function, but we do
not usually do this.
The MDO problem representation we use here is shown in Fig. 13.37
for a general three-component system. We use the functional form
introduced in Section 13.2.3, where the states in each component are
hidden. In this form, the system level only has access to the outputs of
each solver, which are the coupling coupling variables and functions of
interest.
x0 , x1 x0 , x2 x0 , x3 x x
Global
g0
constraints
The set of constraints is also split into shared constraints and local
ones. Local constraints are computed by the corresponding component
and depend only on the variables available in that component. Shared
constraints depend on more than one set of coupling variables. These
dependencies are also shown in Fig. 13.37.
target values of the coupling and shared design variables. These target
values are then shared with all disciplines during every iteration of
the solution procedure. The complete independence of disciplinary
subproblems combined with the simplicity of the data-sharing protocol
makes this architecture attractive for problems with a small amount of
shared data.
The system-level subproblem modifies the original problem as
follows: (1) local constraints are removed, (2) target coupling variables,
𝑢ˆ 𝑡 , are added as design variables, and (3) a consistency constraint is
added. This optimization problem can be written as follows:
minimize 𝑓 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡
by varying 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡
subject to 𝑔0 𝑥0 , 𝑥1𝑡 , . . . , 𝑥 𝑚
𝑡
, 𝑢ˆ 𝑡 ≤ 0
2
2
𝐽𝑖∗ =
𝑥 0𝑖
𝑡
− 𝑥 0
2 +
𝑥 𝑖𝑡 − 𝑥 𝑖
2
+
𝑢ˆ 𝑖𝑡 − 𝑢ˆ 𝑖 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖
=0 for 𝑖 = 1, . . . , 𝑚 ,
(13.32)
where 𝑥0𝑖𝑡
are copies of the shared design variables that are passed to
discipline 𝑖, and 𝑥 𝑖𝑡 are copies of the local design variables passed to
the system subproblem.
The constraint function 𝐽𝑖∗ is a measure of the inconsistency between
the values requested by the system-level subproblem and the results
from the discipline 𝑖 subproblem. The disciplinary subproblems do not
include the original objective function. Instead, the objective of each
subproblem is to minimize the inconsistency function.
13 Multidisciplinary Design Optimization 522
for 𝑢ˆ 𝑖 .
These subproblems are independent of each other and can be solved
in parallel. Thus, the system-level subproblem is responsible for
minimizing the design objective, whereas the discipline subproblems
minimize system inconsistency while satisfying local constraints.
The CO problem statement has been shown to be mathematically
equivalent to the original MDO problem.212 There are two versions of 212. Braun and Kroo, Development and
the CO architecture: CO1 and CO2 . Here, we only present the CO2 application of the collaborative optimization
architecture in a multidisciplinary design
version. The XDSM for CO is shown in Fig. 13.38 and the procedure is environment, 1997.
detailed in Alg. 13.5.
0, 2 → 1 :
x∗0 System 1 : x0 , xt1...m , ût 1.1 : ûtj6=i 1.2 : x0 , xti , ût
optimization
1:
2 : f0 , g0 System
functions
1.1 :
û∗i 1.2 : ûi
Solver i
1.2 :
2 : Ji∗ 1.3 : fi , gi , Ji Discipline i
functions
CO has the organizational advantage of having entirely separate Fig. 13.38 Diagram for the CO archi-
disciplinary subproblems. This is desirable when designers in each tecture.
discipline want to maintain some autonomy. However, the CO for-
13 Multidisciplinary Design Optimization 523
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑔 ∗ : Optimal constraint values
Here, the structural optimization minimizes the discrepancy between the span
wanted by the structures (a decrease) versus what the system level requests
(which takes into account the opposite trend from aerodynamics). The structural
subproblem is fully responsible for satisfying the stress constraints by changing
the structural sizing 𝑡, which are local variables.
by varying 𝑥 0 , 𝑢ˆ 𝑡 ,
13 Multidisciplinary Design Optimization 525
0, 8 → 1 :
6:w 3 : wi
Update w
5, 7 → 6 :
x∗0 System 6 : x0 , ût 3 : x0 , ût 2 : ûtj6=i
optimization
6:
System and
7 : f0 , Φ0...m
penalty
functions
1, 4 → 2 :
x∗i 6 : xt0i , xi 3 : xt0i , xi 2 : xt0i , xi
Optimization i
3:
Discipline i
4 : fi , gi , Φ0 , Φi
and penalty
functions
2:
û∗i 6 : ûi 3 : ûi
Solver i
by varying 𝑡
𝑥0𝑖 , 𝑥𝑖 (13.35)
subject to 𝑡
𝑔𝑖 𝑥0𝑖 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ≤ 0
while solving 𝑟 𝑖 𝑢ˆ 𝑖 ; 𝑥0𝑖
𝑡 𝑡
, 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖 =0
for 𝑢ˆ 𝑖 .
The most common penalty functions used in ATC are quadratic
penalty functions (see Section 5.4.1). Appropriate penalty weights are
important for multidisciplinary consistency and convergence.
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values
(0) (0)
x(0) ût,(0) x0 xi
0, 11 → 1 :
Convergence
check
1, 3 → 2 :
6 : ûtj6=i 6, 9 : ûtj6=i 6 : ûtj6=i 2, 5 : ûtj6=i
MDA
8, 10 :
x∗0 11 : x0 System 6, 9 : x0 6, 9 : x0 9 : x0 6 : x0 2, 5 : x0
optimization
4, 7 :
x∗i 11 : x0 6, 9 : xi 6, 9 : xi 9 : xi 6 : xi 2, 5 : xi
Optimization i
6, 9 :
10 : f0 , g0 7 : f0 , g0 System
functions
6, 9 :
10 : fi , gi 7 : fi , gi Discipline i
functions
9:
Shared
10 : df /dx0 , dg/dx0
variable
derivatives
3:
Discipline i
7 : df0,i /dx0 , dg0,i /dx0
variable
derivatives
2:
û∗i 3 : ûi 6, 9 : ûi 6, 9 : ûi 9 : ûi 6 : ûi
Solver i
13.5.4 Asymmetric Subspace Optimization Fig. 13.40 Diagram for the BLISS ar-
chitecture.
Asymmetric subspace optimization (ASO) is a distributed MDF archi-
tecture motivated by cases where there is a large discrepancy between
the cost of the disciplinary solvers. The cheaper disciplinary analyses
are replaced by disciplinary design optimizations inside the overall
MDA to reduce the number of more expensive disciplinary analyses.
The system-level optimization subproblem is as follows:
minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥0 , 𝑥 𝑘
subject to ˆ ≤0
𝑔0 (𝑥; 𝑢)
𝑔 𝑘 (𝑥; 𝑢ˆ 𝑘 ) ≤ 0 for all 𝑘, (13.38)
while solving 𝑟 𝑘 𝑢ˆ 𝑘 ; 𝑥 𝑘 , 𝑢ˆ 𝑗≠𝑖
𝑡
=0
for 𝑢ˆ 𝑘 .
13 Multidisciplinary Design Optimization 530
minimize ˆ
𝑓 (𝑥; 𝑢)
by varying 𝑥𝑖
subject to 𝑔𝑖 (𝑥 0 , 𝑥 𝑖 ; 𝑢ˆ 𝑖 ) ≤ 0 (13.39)
while solving 𝑟𝑖 𝑢ˆ 𝑖 ; 𝑥 𝑖 , 𝑢ˆ 𝑗≠𝑖
𝑡
=0
for 𝑢ˆ 𝑖 .
Inputs:
𝑥: Initial design variables
Outputs:
𝑥 ∗ : Optimal variables
𝑓 ∗ : Optimal objective value
𝑐 ∗ : Optimal constraint values
(0) (0)
x0,1,2 ût,(0) x3
0, 10 → 1 :
x∗0,1,2 System 9 : x0,1,2 2 : x0 , x1 3 : x0 , x2 6 : x0,1,2 5 : x0
optimization
9:
Discipline 0, 1,
10 : f0,1,2 , g0,1,2
and 2
functions
0, 8 → 2 :
2 : ût2 , ût3 3 : ût3
MDA
2:
û∗1 9 : û1 8 : û1 3 : û1 6 : û1 5 : û1
Solver 1
3:
û∗2 9 : û2 8 : û2 6 : û2 5 : û2
Solver 2
4, 7 → 5 :
x∗3 9 : x3 6 : x3 5 : x3
Optimization 3
6:
Discipline 0
7 : f0 , g0 , f3 , g3
and 3
functions
5:
û∗3 9 : û3 8 : û3 6 : û3
Solver 3
For a gradient-based system-level optimizer, the gradients of the Fig. 13.41 Diagram for the ASO archi-
objective and constraints must take into account the suboptimization. tecture.
This requires coupled post-optimality derivative computation, which
increases computational and implementation time costs compared
with a normal coupled derivative computation. The total optimization
cost is only competitive with MDF if the discrepancy between each
disciplinary solver is high enough.
follows:
minimize 𝑓
by varying 𝑏, 𝜃
subject to 𝐿 − 𝑊∗ = 0
while solving 𝐴(𝑑∗ )Γ − 𝜃(𝑑∗ ) = 0
for Γ ,
where 𝑊 ∗ 𝑑∗
and correspond to values obtained from the structural subopti-
mization. The suboptimization is formulated as follows:
minimize 𝑓
by varying 𝑡
subject to 2.5|𝜎| − 𝜎yield ≤ 0
while solving 𝐾𝑑 − 𝑞 = 0
for 𝑑.
Similar to the sequential optimization, we could replace 𝑓 with 𝑊 in the
suboptimization because the other parameters in 𝑓 are fixed. To solve the
system-level problem with a gradient-based optimizer, we would need post-
optimality derivatives of 𝑊 ∗ with respect to span and Γ.
13.6 Summary
MDF/MAUD
Monolithic IDF
SAND
BLISS
MDO
architecture CSSO
classification Distributed MDF
MDOIS
ASO
Distributed CO
Multilevel
QSD
Penalty
IPD/EPD
ECO
The distributed architectures can be divided according to whether Fig. 13.42 Classification of MDO ar-
or not they enforce multidisciplinary feasibility (through an MDA of the chitectures.
whole system), as shown in Fig. 13.42. Distributed MDF architectures
enforce multidisciplinary feasibility through an MDA. The distributed
IDF architectures are like IDF in that no MDA is required. However,
they must ensure multidisciplinary feasibility in some other way. Some
do this by formulating an appropriate multilevel optimization (such as
CO), and others use penalties to ensure this (such as ATC).∗ ∗ Martins and Lambe41 describe all of
Several commercial MDO frameworks are available, including these MDO architectures in detail.
41. Martins and Lambe, Multidisciplinary
Isight/SEE 218 by Dassault Systèmes, ModelCenter/CenterLink by design optimization: A survey of architec-
Phoenix Integration, modeFRONTIER by Esteco, AML Suite by Tech- tures, 2013.
noSoft, Optimus by Noesis Solutions, and VisualDOC by Vanderplaats 218. Golovidov et al., Flexible implementa-
tion of approximation concepts in an MDO
Research and Development.219 These frameworks focus on making it framework, 1998.
easy for users to couple multiple disciplines and use the optimization 219. Balabanov et al., VisualDOC: A soft-
ware system for general purpose integration
algorithms through graphical user interfaces. They also provide con- and design optimization, 2002.
venient wrappers to popular commercial engineering tools. Typically,
these frameworks use fixed-point iteration to converge the MDA. When
derivatives are needed for a gradient-based optimizer, finite-difference
approximations are used rather than more accurate analytic derivatives.
13 Multidisciplinary Design Optimization 535
Problems
13.3 Consider the DSMs that follow. For each case, what is the lowest
number of feedback loops you can achieve through reordering?
What hierarchy of solvers would you recommend to solve the
coupled problem for each case?
a. A
b. A
c. A
13.4 Consider the “spaghetti” diagram shown in Fig. 13.43. Draw the
equivalent DSM for these dependencies. How can you exploit
the structure in these dependencies? What hierarchy of solvers
would you recommend to solve a coupled system with these
dependencies?
A B
C F
𝐶 𝐿 = 𝐶 𝐿0 − 𝐶 𝐿,𝜃 𝜃 ,
Structures
𝐿𝑏 2
where 𝜃 is the angle of deflection at the wing tip. Use 𝐶 𝐿0 = 0.4 𝜃 𝜃 = 48𝐸𝐼
and 𝐶 𝐿,𝜃 = 0.1 rad−1 . The deflection also depends on the lift. We
compute 𝜃 assuming the uniform lift distribution and using the Fig. 13.44 The aerostructural model
couples aerodynamics and structures
simple beam bending theory as through lift and wing deflection.
(𝐿/𝑏)(𝑏/2)3 𝐿𝑏 2
𝜃= = .
6𝐸𝐼 48𝐸𝐼
The Young’s modulus is 𝐸 = 70 GPa. Use the H-shaped cross-
section described in Prob. 5.17 to compute the second moment of
inertia, 𝐼.
We add the flight speed 𝑣 to the set of design variables and
handle 𝐿 = 𝑊 as a constraint. The objective of the aerostructural
optimization problem is to minimize the power with respect to
𝑥 = [𝑏, 𝑐, 𝑣], subject to 𝐿 = 𝑊.
Solve this problem using MDF, IDF, and a distributed MDO
architecture. Compare the aerostructural optimal solution with
the original solution from Appendix D.1.6 and discuss your
results.
Mathematics Background
A
This appendix briefly reviews various mathematical concepts used
throughout the book.
𝑓 (𝑥 + Δ𝑥) = 𝑎0 + 𝑎1 Δ𝑥 + 𝑎2 Δ𝑥 2 + . . . + 𝑎 𝑘 Δ𝑥 𝑘 + . . . . (A.1)
𝑓 (𝑘) (𝑥)
𝑎𝑘 = . (A.3)
𝑘!
539
A Mathematics Background 540
Substituting this into the polynomial in Eq. A.1 yields the Taylor series
Õ
∞
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) . (A.4)
𝑘!
𝑘=0
Õ
𝑚
Δ𝑥 𝑘
𝑓 (𝑥 + Δ𝑥) = 𝑓 (𝑘) (𝑥) + 𝒪 Δ𝑥 𝑚+1 , (A.5)
𝑘!
𝑘=0
𝑛=2
6
use Taylor series expansions of this function about 𝑥 = 0, we get
=
𝑛
1 1
𝑓 (Δ𝑥) = −4 + Δ𝑥 + 2Δ𝑥 2 − Δ𝑥 4 + Δ𝑥 6 − . . . .
6 180
Four different truncations of this series are plotted and compared to the exact 𝑓
function in Fig. A.1. 1
𝑛=
𝑛=4
The Taylor series in multiple dimensions is similar to the single- 0
variable case but more complicated. The first derivative of the function 𝑥
becomes a gradient vector, and the second derivatives become a Hessian Fig. A.1 Taylor series expansions for
matrix. Also, we need to define a direction along which we want to one-dimensional example. The more
approximate the function because that information is not inherent like terms we consider from the Taylor
series, the better the approximation.
it is in a one-dimensional function. The Taylor series expansion in 𝑛
dimensions along a direction 𝑝 can be written as
Õ
𝑛
𝜕𝑓 1 ÕÕ
𝑛
𝜕2 𝑓
𝑛
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼 𝑝𝑘 + 𝛼2 𝑝𝑘 𝑝𝑙 + 𝒪 𝛼3 ,
𝜕𝑥 𝑘 2 𝜕𝑥 𝑘 𝜕𝑥 𝑙
𝑘=1 𝑘=1 𝑙=1
(A.6)
where 𝛼 is a scalar that determines how far to go in the direction 𝑝. In
matrix form, we can write
1
𝑓 (𝑥 + 𝛼𝑝) = 𝑓 (𝑥) + 𝛼∇ 𝑓 (𝑥)| 𝑝 + 𝛼 2 𝑝 | 𝐻(𝑥)𝑝 + 𝒪 𝛼3 , (A.7)
2
where 𝐻 is the Hessian matrix.
A Mathematics Background 541
1 2
𝑓 (𝑥1 , 𝑥2 ) = (1 − 𝑥1 )2 + (1 − 𝑥2 )2 + 2𝑥2 − 𝑥12 .
2
Performing a Taylor series expansion about 𝑥 = [0, −2], we get
1 2 | 10 0
𝑓 (𝑥 + 𝛼𝑝) = 18 + 𝛼 −2 − 14 𝑝 + 𝛼 𝑝 𝑝.
2 0 6
The original function, the linear approximation, and the quadratic approxima-
tion are compared in Fig. A.2.
d d 𝑓 d𝑔
𝑓 (𝑔(𝑥)) = . (A.8)
d𝑥 d𝑔 d𝑥
d 𝜕 𝑓 d𝑔 𝜕 𝑓 dℎ
𝑓 (𝑔(𝑥), ℎ(𝑥)) = + , (A.9)
d𝑥 𝜕𝑔 d𝑥 𝜕ℎ d𝑥
d𝑓 𝜕𝑓 𝜕 𝑓 d𝑦
= +
d𝑥 𝜕𝑥 𝜕𝑦 d𝑥
= 2𝑥 + 2𝑦 cos(𝑥)
= 2𝑥 + 2 sin(𝑥) cos(𝑥) .
Notice that the partial derivative and total derivative are quite different. For this
simple case, we could also find the total derivative by direct substitution and
then using an ordinary one-dimensional derivative. Substituting 𝑦(𝑥) = sin(𝑥)
directly into the original expression for 𝑓 gives
d 𝜕 𝑓 d𝑔 𝜕 𝑓 dℎ
𝑓 𝑔(𝑥), ℎ(𝑥) = +
d𝑥 𝜕𝑔 d𝑥 𝜕ℎ d𝑥
d𝑔 𝑑ℎ
= 2𝑔 ℎ 3 + 𝑔 2 3ℎ 2
d𝑥 d𝑥
= −2𝑔 ℎ 3 sin(𝑥) + 𝑔 2 3ℎ 2 cos(𝑥)
= −2 cos(𝑥) sin4 (𝑥) + 3 cos3 (𝑥) sin2 (𝑥) .
d𝑦 = 𝑓 0(𝑥) d𝑥 , (A.11)
d 𝑓 = 𝑓 0(𝑥) d𝑥 . (A.12)
We can solve Ex. A.5 using differentials as follows. Taking the definition of
each function, we write their differentials,
d𝑓
= −2 cos(𝑥) sin4 (𝑥) + 3 cos3 (𝑥) sin2 (𝑥).
d𝑥
A Mathematics Background 544
𝑥2 + 𝑦2 = 𝑟2 .
2𝑥 d𝑥 + 2𝑦 d𝑦 = 2𝑟 d𝑟 .
Say we want to find the slope of the tangent of a circle with a fixed radius. Then,
d𝑟 = 0, and we can solve for the derivative d𝑦/d𝑥 as follows:
d𝑦 𝑥
2𝑥 d𝑥 + 2𝑦 d𝑦 = 0 ⇒ =− .
d𝑥 𝑦
Another interpretation of this derivative is that it is the first-order change in
𝑦 with respect to a change in 𝑥 subject to the constraint of staying on a circle
(keeping a constant 𝑟). Similarly, we could find the derivative of 𝑥 with respect
to 𝑦 as d𝑥/d𝑦 = −𝑦/𝑥. Furthermore, we can find relationships between any
derivative involving 𝑟, 𝑥, or 𝑦.
Õ
𝑛
𝐶 𝑖𝑗 = 𝐴𝑖 𝑘 𝐵 𝑘 𝑗 , (A.13) 𝐶 𝑖𝑗 = 𝐴 𝑖∗ 𝐵∗𝑗
𝑘=1
Two matrices can be multiplied only if their inner dimensions are equal
Fig. A.3 Matrix product and resulting
(𝑛 in this case). The remaining products discussed in this section are size.
just special cases of matrix multiplication, but they are common enough ∗ In this notation, 𝑚 is the number of rows
that we discuss them separately. and 𝑛 is the number of columns.
A Mathematics Background 545
𝑣1
𝑣2 Õ
=
𝑛
|
𝛼 = 𝑢 𝑣 = 𝑢1 𝑢2 . . . 𝑢𝑛 . = 𝑢𝑖 𝑣 𝑖 . (A.14)
..
(1 × 1) (1 × 𝑛) (𝑛 × 1)
𝑣 𝑛
𝑖=1
Fig. A.4 Dot (or inner) product of two
The order of multiplication is irrelevant, and therefore, vectors.
—— 𝐴1∗ —— =
𝐴 𝑖∗
—— 𝐴2∗ ——
𝑣= 𝑢, (A.20)
..
(𝑚 × 1) (𝑚 × 𝑛) (𝑛 × 1)
.
—— 𝐴𝑚∗ ——
Fig. A.6 Matrix-vector product.
| | |
𝑣 = 𝐴∗1 𝑢1 + 𝐴∗2 𝑢2 + . . . +
𝐴∗𝑛 𝑢𝑛 ,
(A.21)
| | |
and 𝐴∗𝑗 are the columns of 𝐴.
We can also multiply by a vector on the left, instead of on the right:
𝑣 | = 𝑢 | 𝐴. (A.22)
𝐴
R𝑛 R𝑚
Row space
𝐴| 𝑦 ≠ 0 Column space
dim = 𝑟 𝐴𝑥 ≠ 0
dim = 𝑟
𝐴𝑥 𝑟 = 𝑏 Fig. A.7 The four fundamental sub-
𝑥𝑟
𝑏 spaces of linear algebra. An (𝑚 × 𝑛)
𝐴𝑥 = 𝑏 matrix 𝐴 maps vectors from 𝑛-space
to 𝑚-space. When the vector is in
𝑥 = 𝑥𝑟 + 𝑥𝑛
the row space of the matrix, it maps
0 0 to the column space of 𝐴 (𝑥 𝑟 → 𝑏).
0
𝐴𝑥 𝑛 =
When the vector is in the nullspace
𝑥𝑛 of 𝐴, it maps to zero (𝑥 𝑛 → 0). Com-
Left nullspace bining the row space and nullspace
Nullspace of 𝐴, we can obtain any vector in
𝐴| 𝑦 = 0
𝐴𝑥 = 0
dim = 𝑚 − 𝑟 𝑛-dimensional space (𝑥 = 𝑥 𝑟 + 𝑥 𝑛 ),
dim = 𝑛 − 𝑟
which maps to the column space of
𝐴 (𝑥 → 𝑏).
Because this norm is used so often, we often omit the subscript and just
write k𝑥 k. In this book, we sometimes use the square of the 2-norm,
which can be written as the dot product,
k𝑥 k 22 = 𝑥 | 𝑥 . (A.26)
Several norms for matrices exist. There are matrix norms similar to
the vector norms that we defined previously. Namely,
Õ
𝑛
k𝐴k 1 = max 𝐴 𝑖𝑗
1≤ 𝑗≤𝑛
𝑖=1
1
k𝐴k 2 = (𝜆max (𝐴| 𝐴)) 2 (A.32)
Õ
𝑛
k𝐴k ∞ = max 𝐴 𝑖𝑗 ,
1≤𝑖≤𝑛
𝑖=1
Another matrix norm that is useful but not related to any vector
norm is the Frobenius norm, which is defined as the square root of the
absolute squares of its elements, that is,
v
u
tÕ𝑚 Õ
𝑛
k𝐴k 𝐹 = 𝐴2𝑖𝑗 . (A.34)
𝑖=1 𝑗=1
Note that
(𝐴| )| = 𝐴
(𝐴 + 𝐵)| = 𝐴| + 𝐵| (A.38)
| | |
(𝐴𝐵) = 𝐵 𝐴 .
A symmetric matrix is one where the matrix is equal to its transpose:
𝐴| = 𝐴 ⇒ 𝐴 𝑖𝑗 = 𝐴 𝑗𝑖 . (A.39)
Not all matrices are invertible. Some common properties for inverses
are as follows: −1
𝐴−1 =𝐴
(𝐴𝐵)−1 = 𝐵−1 𝐴−1 (A.41)
−1 | | −1
𝐴 = (𝐴 ) .
A symmetric matrix 𝐴 is positive definite if and only if
𝑥 | 𝐴𝑥 > 0 (A.42)
one element), then check that det(𝐴2 ) > 0, and so on, until det(𝐴𝑛 ) > 0.
Suppose any of the determinants in this sequence is not positive. In
that case, we can stop the process and conclude that 𝐴 is not positive
definite.
A positive-semidefinite matrix satisfies
𝑥 | 𝐴𝑥 ≥ 0 (A.44)
for all nonzero vectors 𝑥. In this case, the eigenvalues are nonnegative,
and there is at least one that is zero. A negative-definite matrix satisfies
𝑥 | 𝐴𝑥 < 0 (A.45)
for all nonzero vectors 𝑥. In this case, all the eigenvalues are negative.
An indefinite matrix is one that is neither positive definite nor negative
definite. Then, there are at least two nonzero vectors 𝑥 and 𝑦 such that
Õ
𝑛
𝑓 (𝑥) = 𝑎 | 𝑥 + 𝑏 = 𝑎𝑖 𝑥𝑖 + 𝑏𝑖 , (A.47)
𝑖=1
where 𝑎, 𝑥, and 𝑏 are vectors of length 𝑛, and 𝑎 𝑖 , 𝑥 𝑖 , and 𝑏 𝑖 are the 𝑖th
elements of 𝑎, 𝑥, and 𝑏, respectively. If we take the partial derivative
of each element with respect to an arbitrary element of 𝑥, namely, 𝑥 𝑘 ,
we get " #
𝜕 Õ
𝑛
𝑎𝑖 𝑥𝑖 + 𝑏𝑖 = 𝑎𝑘 . (A.48)
𝜕𝑥 𝑘
𝑖=1
Thus,
∇𝑥 (𝑎 | 𝑥 + 𝑏) = 𝑎 . (A.49)
Recall the quadratic form presented in Appendix A.3.3; we can
combine that with a linear term to form a general quadratic function:
𝑓 (𝑥) = 𝑥 | 𝐴𝑥 + 𝑏 | 𝑥 + 𝑐 , (A.50)
A Mathematics Background 553
We now move the diagonal terms back into the sums to get
𝜕𝑓 Õ 𝑛
= 𝑏𝑘 + (𝑥 𝑗 𝑎 𝑗 𝑘 + 𝑎 𝑘 𝑗 𝑥 𝑗 ) , (A.54)
𝜕𝑥 𝑘
𝑗=1
∇𝑥 𝑓 (𝑥) = 𝐴| 𝑥 + 𝐴𝑥 + 𝑏 . (A.55)
∇𝑥 (𝑥 | 𝐴𝑥 + 𝑏 | 𝑥 + 𝑐) = 2𝐴𝑥 + 𝑏 . (A.56)
𝑣 𝑖 𝑣 𝑖 = 1).
|
This is actually a sample mean, which would differ from the pop-
ulation mean (the true mean if you could measure every bar). With
enough samples, the sample mean approaches the population mean. In
this brief review, we do not distinguish between sample and population
statistics.
∗ Unbiasedmeans that the expected value
Another important quantity is the variance or standard deviation. This of the sample variance is the same as the
is a measure of spread, or how far away our samples are from the mean. true population variance. If 𝑛 were used
in the denominator instead of 𝑛 − 1, then
The unbiased∗ estimate of the variance is the two quantities would differ by a con-
stant.
A Mathematics Background 555
1 Õ
𝑛
𝜎𝑥2 = (𝑥 𝑖 − 𝜇𝑥 )2 , (A.60)
𝑛−1
𝑖=1
and the standard deviation is just the square root of the variance. A
small variance implies that measurements are clustered tightly around
the mean, whereas a large variance means that measurements are spread
out far from the mean. The variance can also be written in the following
mathematically equivalent but more computationally-friendly format:
!
1 Õ
𝑛
𝜎𝑥2 = 𝑥 2𝑖 − 𝑛𝜇2𝑥 . (A.61)
𝑛−1
𝑖=1
The total integral of the PDF must be 1 because it contains all possible
outcomes (100 percent):
∫ ∞
𝑝(𝑥) d𝑥 = 1 . (A.63)
−∞
From the PDF, we can also measure various statistics, such as the mean
value:
∫ ∞
𝜇𝑥 = E(𝑥) = 𝑥𝑝(𝑥) d𝑥 . (A.64)
−∞
The mean and variance are the first and second moments of the
distribution. In general, a distribution may require an infinite number
of moments to describe it fully. Higher-order moments are generally
mean centered and are normalized by the standard deviation so that
the 𝑛th normalized moment is computed as follows:
𝑥 − 𝜇 𝑛
𝑥
E . (A.68)
𝜎
The third moment is called skewness, and the fourth is called kurtosis,
although these higher-order moments are less commonly used.
The cumulative distribution function (CDF) is related to the PDF, which
is the cumulative integral of the PDF and is defined as follows:
∫ 𝑥
𝑃(𝑥) = 𝑝(𝑡) d𝑡 . (A.69)
−∞
The capital 𝑃 denotes the CDF, and the lowercase 𝑝 denotes the PDF.
As an example, the PDF and corresponding CDF for the axial strength
are shown in Fig. A.10. The CDF always approaches 1 as 𝑥 → ∞.
1
0.3
0.8
0.2 0.6
𝑝(𝜎) 𝑃(𝜎)
0.4
0.1
0.2
0 0
990 995 1,000 1,005 1,010 990 995 1,000 1,005 1,010 Fig. A.10 Comparison between PDF
𝜎 𝜎
and CDF for a simple example.
PDF for the axial strength of a rod. CDF for the axial strength of a rod.
For a normal distribution, the mean and variance are visible in the func-
tion, but these quantities are defined for any distribution. Figure A.11
A Mathematics Background 557
𝜇 = 1, 𝜎 = 0.5
0.6
𝑝(𝑥) 0.4
𝜇 = 3, 𝜎 = 1.0
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
0 2 4 6 0 2 4 6
𝑥 𝑥
0.3 0.3
0.2 0.2
𝑝(𝑥) 𝑝(𝑥)
0.1 0.1
0 0
Fig. A.12 Popular probability distri-
0 2 4 6 0 2 4 6 butions besides the normal distribu-
𝑥 𝑥
tion.
Lognormal distribution Exponential distribution
A Mathematics Background 558
cov(𝑥, 𝑦)
corr(𝑥, 𝑦) = . (A.73)
𝜎𝑥 𝜎 𝑦
Linear Solvers
B
In Section 3.6, we present an overview of solution methods for dis-
cretized systems of equations, followed by an introduction to Newton-
based methods for solving nonlinear equations. Here, we review the
solvers for linear systems required to solve for each step of Newton-
based methods.∗ and Bau III220 provides a
∗ Trefethen
B.1 Systems of Linear Equations 220. Trefethen and Bau III, Numerical
Linear Algebra, 1997.
𝐴𝑢 = 𝑏 , (B.1)
559
B Linear Solvers 560
𝑟(𝑢) = 𝐴𝑢 − 𝑏 = 0. (B.2)
B.2 Conditioning
time, starting with the first one and progressing from left to right. This 1
is done by subtracting multiples of each row from subsequent rows. Fig. B.1 𝐿𝑈 factorization.
B Linear Solvers 561
Inputs:
𝐴: Nonsingular square matrix
𝑏: A vector
Outputs:
𝑢: Solution to 𝐴𝑢 = 𝑏
𝑏1 1 © Õ𝑖−1
ª
𝑦1 = , 𝑦𝑖 = 𝑏 𝑖 − 𝐿 𝑖𝑗 𝑦 𝑗 ® for 𝑖 = 2, . . . , 𝑛
𝐿11 𝐿 𝑖𝑖
« 𝑗=1 ¬
Perform backward substitution to solve the following 𝑈𝑢 = 𝑦 for 𝑢:
𝑦𝑛 1 © Õ
𝑛
ª
𝑢𝑛 = , 𝑢𝑖 = 𝑦𝑖 − 𝑈 𝑖𝑗 𝑢 𝑗 ® for 𝑖 = 𝑛 − 1, . . . , 1
𝑈𝑛𝑛 𝑈 𝑖𝑖
« 𝑗=𝑖+1 ¬
Although direct methods are usually more efficient and robust, iterative
methods have several advantages:
𝑢 𝑘+1 = 𝑢 𝑘 + 𝑀 −1 (𝑏 − 𝐴𝑢 𝑘 ) . (B.10)
B Linear Solvers 563
𝑟 (𝑢 𝑘 ) = 𝑏 − 𝐴𝑢 𝑘 , (B.11)
we can write
𝑢 𝑘+1 = 𝑢 𝑘 + 𝑀 −1 𝑟 (𝑢 𝑘 ) . (B.12)
The splitting matrix 𝑀 is fixed and constructed so that it is easy to
invert. The closer 𝑀 −1 is to the inverse of 𝐴, the better the iterations
work. We now introduce three stationary methods corresponding to
three different splitting matrices.
The Jacobi method consists of setting 𝑀 to be a diagonal matrix 𝐷,
where the diagonal entries are those of 𝐴. Then,
𝑢 𝑘+1 = 𝑢 𝑘 + 𝐷 −1 𝑟 (𝑢 𝑘 ) . (B.13)
Õ
𝑛𝑢
1 𝑏 𝑖 − 𝐴 𝑖𝑗 𝑢 𝑗 𝑘 ,
𝑢𝑖 𝑘+1 = 𝑖 = 1, . . . , 𝑛𝑢 . (B.14)
𝐴 𝑖𝑖
𝑗=1,𝑗≠𝑖
Using this method, each component in 𝑢 𝑘+1 is independent of each
other at a given iteration; they only depend on the previous iteration
values, 𝑢 𝑘 , and can therefore be done in parallel.
The Gauss–Seidel method is obtained by setting 𝑀 to be the lower
triangular portion of 𝐴 and can be written as
Õ Õ
1 𝑏 𝑖 − 𝐴 𝑖𝑗 𝑢 𝑗 𝑘 ,
𝑢𝑖 𝑘+1 = 𝐴 𝑖𝑗 𝑢 𝑗 𝑘+1 − 𝑖 = 1, . . . , 𝑛𝑢 . (B.16)
𝐴 𝑖𝑖
𝑗<𝑖 𝑗>𝑖
Unlike the Jacobi iterations, a Gauss–Seidel iteration cannot be per-
formed in parallel because of the terms where 𝑗 < 𝑖, which require
the latest values. Instead, the states must be updated sequentially.
However, the advantage of Gauss–Seidel is that it generally converges
faster than Jacobi iterations.
B Linear Solvers 564
Õ Õ
𝜔 𝑏 𝑖 −
𝑢𝑖 𝑘+1 = (1−𝜔)𝑢𝑖 𝑘 + 𝐴 𝑢 − 𝐴 𝑖𝑗 𝑗 𝑘 ,
𝑢 𝑖 = 1, . . . , 𝑛𝑢 .
𝑖𝑗 𝑗 𝑘+1
𝐴 𝑖𝑖 𝑢2
𝑗<𝑖 𝑗>𝑖
2
(B.18)
With the correct value of 𝜔, SOR converges faster than Gauss–Seidel. 1.5
𝑢0
1
Example B.1 Iterative methods applied to a simple linear system.
𝑢∗
0.5
Suppose we have the following linear system of two equations:
0
2 −1 𝑢1 0 0 1 2
= .
−2 3 𝑢2 1 𝑢1
Jacobi
This corresponds to the two lines shown in Fig. B.3, where the solution is at
𝑢2
their intersection. 2
Applying the Jacobian iteration (Eq. B.14),
1.5
1
𝑢1 𝑘+1 = 𝑢2 𝑘 𝑢0
2 1
1
𝑢2 𝑘+1 = (1 + 2𝑢1 𝑘 ) . 𝑢∗
3 0.5
Starting with the guess 𝑢 (0) = (2, 1), we get the iterations shown in Fig. B.3. The 0
Gauss–Seidel iteration (Eq. B.16) is similar, where the only change is that the 0 1 2
𝑢1
second equation uses the latest state from the first one:
Gauss–Seidel
1
𝑢1 𝑘+1 = 𝑢
2 2𝑘
𝑢2
2
1
𝑢2 𝑘+1 = (1 + 2𝑢1 𝑘+1 ) .
3 1.5
SOR converges even faster for the right values of 𝜔. The result shown here is SOR
for 𝜔 = 1.2.
Fig. B.3 Jacobi, Gauss–Seidel, and
SOR iterations.
B Linear Solvers 565
∇ 𝑓 (𝑢) = 𝐴𝑢 − 𝑏 . (B.21)
Thus, the gradient of the quadratic is the residual of the linear system,
𝑟 𝑘 = ∇ 𝑓 (𝑢 𝑘 ) . (B.22)
1 Õ Õ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼𝑘 𝑝𝑘 𝐴 𝛼𝑘 𝑝𝑘 − 𝑏 |
𝛼𝑘 𝑝𝑘 (B.24)
2
𝑘=0 𝑘=0 𝑘=0
1 ÕÕ Õ
𝑛−1 𝑛−1 𝑛−1
= 𝛼 𝑘 𝛼 𝑗 𝑝 𝑘 | 𝐴𝑝 𝑗 − 𝛼 𝑘 𝑏| 𝑝 𝑘 .
2
𝑘=0 𝑗=0 𝑘=0
1 ÕÕ 1Õ 2 |
𝑛−1 𝑛−1 𝑛−1
𝛼 𝑘 𝛼 𝑗 𝑝 𝑘 | 𝐴𝑝 𝑗 = 𝛼 𝑘 𝑝 𝑘 𝐴𝑝 𝑘 . (B.26)
2 2
𝑘=0 𝑗=0 𝑘=0
B Linear Solvers 566
Because each term in this sum involves only one direction 𝑝 𝑘 , we have
reduced the original problem to a series of one-dimensional quadratic
functions that can be minimized one at a time. Each one-dimensional
problem corresponds to minimizing the quadratic with respect to the
step length 𝛼 𝑘 . Differentiating each term and setting it to zero yields
the following:
𝑏| 𝑝 𝑘
𝛼 𝑘 𝑝 𝑘 | 𝐴𝑝 𝑘 − 𝑏 | 𝑝 𝑘 = 0 ⇒ 𝛼 𝑘 = . (B.28)
𝑝 𝑘 | 𝐴𝑝 𝑘
Now, the question is: How do we find this set of directions? There
are many sets of directions that satisfy conjugacy. For example, the
eigenvectors of 𝐴 satisfy Eq. B.25.∗ However, it is costly to compute the ∗ Suppose we have two eigenvectors, 𝑣 𝑘
and 𝑣 𝑗 . Then 𝑣 𝑘 | 𝐴𝑣 𝑗 = 𝑣 𝑘 | (𝜆 𝑗 𝑣 𝑗 ) =
eigenvectors of a matrix. We want a more convenient way to compute a 𝜆 𝑗 𝑣 𝑘 | 𝑣 𝑗 . This dot product is zero because
sequence of conjugate vectors. the eigenvectors of a symmetric matrix are
mutually orthogonal.
The conjugate gradient method sets the first direction to the steepest-
descent direction of the quadratic at the first point. Because the gradient
of the function is the residual of the linear system (Eq. B.22), this first
direction is obtained from the residual at the starting point,
𝑝1 = −𝑟 (𝑢0 ) . (B.29)
𝑝 𝑘+1 | 𝐴𝑝 𝑘 = 0 . (B.31)
Substituting the new direction 𝑝 𝑘+1 with the update (Eq. B.30), we get
|
−𝑟 𝑘+1 + 𝛽 𝑘 𝑝 𝑘 𝐴𝑝 𝑘 = 0 . (B.32)
𝑟 𝑘+1 | 𝐴𝑝 𝑘
𝛽𝑘 = . (B.33)
𝑝 𝑘 | 𝐴𝑝 𝑘
B Linear Solvers 567
By setting this derivative to zero, we can get the step size that minimizes
the quadratic along the line to be
𝑟𝑘 | 𝑝 𝑘
𝛼𝑘 = − . (B.35)
𝑝 𝑘 | 𝐴𝑝 𝑘
Here we have used the property of the conjugate directions stating that
the residual vector is orthogonal to all previous conjugate directions,
so that 𝑟 𝑖 | 𝑝 𝑖 for 𝑖 = 0, 1, . . . , 𝑘 − 1.† Thus, we can now write, † For a proof of this property, see Theorem
The numerator of the expression for 𝛽 (Eq. B.33) can also be written
in terms of the residual alone. Using the expression for the residual
(Eq. B.19) and taking the difference between two subsequent residuals,
we get
1
𝑟 𝑘+1 | 𝐴𝑝 𝑘 =
(𝑟 𝑘+1 | 𝑟 𝑘+1 ) , (B.39)
𝛼𝑘
where we have used the property that the residual at any conjugate
residual iteration is orthogonal to the residuals at all previous iterations,
so 𝑟 𝑘+1 | 𝑟 𝑘 = 0.‡ ‡ For a proof of this property, see Theorem
𝑟𝑘 | 𝑟𝑘
𝛽𝑘 = . (B.40)
𝑟 𝑘−1 | 𝑟 𝑘−1
We use this result in the nonlinear conjugate gradient method for
function minimization in Section 4.4.2.
The linear conjugate gradient steps are listed in Alg. B.2. The
advantage of this method relative to the direct method is that 𝐴 does
not need to be stored or given explicitly. Instead, we only need to
provide a function that computes matrix-vector products with 𝐴. These
products are required to compute residuals (𝑟 = 𝐴𝑢 − 𝑏) and the 𝐴𝑝
term in the computation of 𝛼. Assuming a well-conditioned problem
with good enough arithmetic precision, the algorithm should converge
to the solution in 𝑛 steps.§ § Because the linear conjugate gradient
method converges in 𝑛 steps, it was orig-
inally thought of as a direct method. It
Algorithm B.2 Linear conjugate gradient was initially dismissed in favor of more ef-
ficient direct methods, such as LU factor-
Inputs: ization. However, the conjugate gradient
𝑢 (0) : Starting point method was later reframed as an effective
iterative method to obtain approximate so-
𝜏: Convergence tolerance lutions to large problems.
Outputs:
𝑢 ∗ : Solution of linear system
residual (GMRES) method, do not have such restrictions on the matrix. 220. Trefethen and Bau III, Numerical
Linear Algebra, 1997.
Compared with stationary methods of Appendix B.4.1, Krylov methods
have the advantage that they use information gathered throughout the
iterations. Instead of using a fixed splitting matrix, Krylov methods
effectively vary the splitting so that 𝑀 is changed at each iteration
according to some criteria that use the information gathered so far. For
this reason, Krylov methods are usually more efficient than stationary
methods.
Like stationary iteration methods, Krylov methods do not require
forming or storing 𝐴. Instead, the iterations require only matrix-vector
products of the form 𝐴𝑣, where 𝑣 is some vector given by the Krylov
algorithm. The matrix-vector product could be given by a black box, as
shown in Fig. B.2.
For the linear conjugate gradient method (Appendix B.4.2), we
found conjugate directions and minimized the residual of the linear
system in a sequence of these directions.
Krylov subspace methods minimize the residual in a space,
𝑥 0 + 𝒦𝑘 , (B.41)
we do not need an explicit form for 𝑀. The matrix resulting from the
product 𝑀 −1 𝐴 should have a smaller condition number so that the new
linear system is better conditioned.
In the extreme case where 𝑀 = 𝐴, that means we have computed
the inverse of 𝐴, and we can get 𝑥 explicitly. In another extreme, 𝑀
could be a diagonal matrix with the diagonal elements of 𝐴, which
would scale 𝐴 such that the diagonal elements are 1.‖ ‖
The splitting matrix 𝑀 we used in the
equation for the stationary methods (Ap-
Krylov subspace solvers require three main components: (1) an pendix B.4.1) is effectively a precondi-
orthogonal basis for the Krylov subspace, (2) an optimal property that tioner. An 𝑀 using the diagonal entries
of 𝐴 corresponds to the Jacobi method
determines the solution within the subspace, and (3) an effective pre- (Eq. B.13).
conditioner. Various Krylov subspace methods are possible, depending
on the choice for each of these three components. One of the most
popular Krylov subspace methods is the GMRES.221∗∗ 221. Saad and Schultz, GMRES: A gener-
alized minimal residual algorithm for solving
nonsymmetric linear systems, 1986.
∗∗ GMRES and other Krylov subspace
methods are available in most program-
ming languages, including C/C++, For-
tran, Julia, MATLAB, and Python.
Quasi-Newton Methods
C
C.1 Broyden’s Method
𝑠 𝑘 = 𝑢 𝑘+1 − 𝑢 𝑘 , (C.2)
𝑦 𝑘 = 𝑟 𝑘+1 − 𝑟 𝑘 , (C.3)
𝐽˜ = 𝐽˜𝑘 + 𝑣𝑣 | , (C.5)
571
C Quasi-Newton Methods 572
yields
𝑦 𝑘 − 𝐽˜𝑘 𝑠 𝑘 𝑠 𝑘
|
|
𝑣𝑣 = | . (C.7)
𝑠𝑘 𝑠𝑘
Substituting this result into the update (Eq. C.5), we get the Jacobian
approximation update,
𝑦 𝑘 − 𝐽˜𝑘 𝑠 𝑘 𝑠 𝑘
|
where
𝑦 𝑘 = 𝑟 𝑘+1 − 𝑟 𝑘 (C.9)
is the difference in the function values (as opposed to the difference in
the gradients used in optimization).
This update can be inverted using the Sherman–Morrison–Woodbury
formula (Appendix C.3) to get the more useful update on the inverse of
the Jacobian,
𝑠 𝑘 − 𝐽˜𝑘−1 𝑦 𝑘 𝑦 𝑘
|
𝐽˜𝑘+1
−1
= 𝐽˜𝑘−1 + | . (C.10)
𝑦𝑘 𝑦𝑘
We can start with 𝐽˜0−1 = 𝐼. Similar to the Newton step (Eq. 3.30), the step
in Broyden’s method is given by solving the linear system. Because the
inverse is provided explicitly, we can just perform the multiplication,
Δ𝑢 𝑘 = −𝐽˜−1 𝑟 𝑘 . (C.11)
𝑢 𝑘+1 = 𝑢 𝑘 + Δ𝑢 𝑘 . (C.12)
We need the inverse version of the secant equation (Eq. 4.80), which is
𝑉˜ 𝑘+1 𝑦 𝑘 = 𝑠 𝑘 . (C.15)
𝑉˜ 𝑘 𝑦 𝑘 + 𝛼𝑠 𝑘 𝑠 𝑘 𝑦 𝑘 + 𝛽𝑉˜ 𝑘 𝑦 𝑘 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘 = 𝑠 𝑘 . (C.16)
| |
1 1
𝑉˜ 𝑘+1 = 𝑉˜ 𝑘 + 𝑦𝑘 | 𝑠 𝑘 − 𝑦 𝑘 | 𝑉˜ 𝑘 𝑦 𝑘 . (C.17)
𝑠𝑘 𝑠𝑘 | 𝑉𝑘 𝑦 𝑘 𝑦 𝑘 | 𝑉˜ 𝑘
˜
C Quasi-Newton Methods 574
C.2.2 BFGS
The BFGS update was informally derived in Section 4.4.4. As discussed
previously, obtaining an approximation of the Hessian inverse is a more
efficient way to get the quasi-Newton step.
Similar to DFP, BFGS was originally formally derived by analytically
solving an optimization problem. However, instead of solving the
optimization problem of Eq. C.13, we solve a similar problem using the
Hessian inverse approximation instead. This problem can be stated as
minimize
𝑉˜ − 𝑉˜ 𝑘
subject to 𝑉˜ 𝑦 𝑘 = 𝑠 𝑘 (C.20)
𝑉˜ = 𝑉˜ | ,
𝑉˜ = 𝑉˜ 𝑘 + 𝛼𝑣𝑣 | , (C.23)
where we only need one self outer product to produce a rank 1 update
(as opposed to two).
Substituting the rank 1 update (Eq. C.23) into the secant equation,
we obtain
𝑉˜ 𝑘 𝑦 𝑘 + 𝛼𝑣𝑣 | 𝑦 𝑘 = 𝑠 𝑘 . (C.24)
Rearranging yields
𝛼𝑣 | 𝑦 𝑘 𝑣 = 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.25)
Substituting Eqs. C.26 and C.28 into Eq. C.23, we get the SR1 update
1 |
𝑉˜ = 𝑉˜ 𝑘 + 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 𝑠 𝑘 − 𝑉˜ 𝑘 𝑦 𝑘 . (C.29)
𝑠 𝑘 𝑦 𝑘 − 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
1
𝛽 SR1 =− | (C.31)
𝑦 𝑠 𝑘 − 𝑦 𝑉˜ 𝑘 𝑦 𝑘
|
𝑘 𝑘
1
𝛾SR1 = .
− 𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
| |
𝑦𝑘 𝑠 𝑘
𝑦 𝑘 𝑉˜ 𝑘 𝑦 𝑘
|
1 1
𝛼 BFGS = 0, 𝛽BFGS =− | , 𝛾BFGS = | + 2 . (C.33)
𝑦𝑘 𝑠 𝑘 𝑦𝑘 𝑠 𝑘 |
𝑦𝑘 𝑠 𝑘
The formal derivations of the DFP and BFGS methods use the Sherman–
Morrison–Woodbury formula (also known as the Woodbury matrix
identity). Suppose that the inverse of a matrix is known, and then the
matrix is perturbed. The Sherman–Morrison–Woodbury formula gives
the inverse of the perturbed matrix without having to re-invert the
perturbed matrix. We used this formula in Section 4.4.4 to derive the
quasi-Newton update.
One possible perturbation is a rank 1 update of the form
𝐴ˆ = 𝐴 + 𝑢𝑣 | , (C.34)
C Quasi-Newton Methods 577
𝐴−1 𝑢𝑣 | 𝐴−1
𝐴ˆ −1 = 𝐴−1 − . (C.35)
1 + 𝑣 | 𝐴−1 𝑢
This formula can be verified by multiplying Eq. C.34 and Eq. C.35,
which yields the identity matrix.
This formula can be generalized for higher-rank updates as follows:
𝐴ˆ = 𝐴 + 𝑈𝑉 | , (C.36)
gradient-based optimizer:
4
𝑓 (𝑥1 , 𝑥2 ) = 𝑥12 + 𝑥 22 − 𝛽𝑥 1 𝑥2 , (D.1) 𝑥∗
0
ing valley. The large difference between the maximum and minimum
curvatures, and the fact that the principal curvature directions change Fig. D.2 Rosenbrock function.
along the valley, makes it a good test for quasi-Newton methods. 223. Rosenbrock, An automatic method
for finding the greatest or least value of a
The Rosenbrock function can be extended to 𝑛 dimensions by function, 1960.
defining the sum,
𝑛−1
Õ
2
𝑓 (𝑥) = 100 𝑥 𝑖+1 − 𝑥 𝑖 2 + (1 − 𝑥 𝑖 )2 . (D.3)
𝑖=1
579
D Test Problems 580
from different points. There are saddle points, maxima, and minima,
with one global minimum. This function, shown in Fig. D.4 along with 2
𝑥∗
Global minimum: = −13.5320 at = (2.6732, −0.6759).
𝑓 (𝑥 ∗ ) 𝑥∗ −1
Local minima: 𝑓 (𝑥) = −9.7770 at 𝑥 = (−0.4495, 2.2928). −1 0 1 2 3
𝑓 (𝑥) = −9.0312 at 𝑥 = (2.4239, 1.9219). 𝑥1
Õ
4
© Õ
3
ª
1
𝛼 𝑖 exp − 𝐴 𝑖𝑗 (𝑥 𝑗 − 𝑃𝑖𝑗 )2 ® ,
𝑥∗
𝑓 (𝑥) = − (D.6) 0.8
𝑖=1 « 𝑗=1 ¬
0.6
where
0.4
𝛼 = [1.0, 1.2, 3.0, 3.2] ,
3 10 30 3689 1170 2673
0.2
0.1 10 35 4387 7470 (D.7) 0 0.25 0.5 0.75 1 1.25
𝐴 = −4 4699
10 30
𝑃 = 10
8732 5547
𝑥2
, .
3 1091
0.1 10 35 381 5743 8828
Fig. D.5 An 𝑥2 − 𝑥3 slice of Hartmann
function at 𝑥 1 = 0.1148.
A slice of the function, at the optimal value of 𝑥1 = 0.1148, is shown
in Fig. D.5.
Global minimum: 𝑓 (𝑥 ∗ ) = −3.86278 at 𝑥 ∗ = (0.11480, 0.55566, 0.85254).
D Test Problems 581
𝐿 = 𝑞𝐶 𝐿 𝑆 , (D.10)
𝑆 = 𝑏𝑐 . (D.12)
𝐷 𝑓 = 𝑘𝐶 𝑓 𝑞𝑆wet . (D.13)
In this equation, the Reynolds number is based on the wing chord and
is defined as follows:
𝜌𝑣𝑐
𝑅𝑒 = , (D.15)
𝜇
where 𝜌 is the air density, and 𝜇 is the air dynamic viscosity. The form
factor, 𝑘, accounts for the effects of pressure drag. The wetted area,
𝑆wet , is the area over which the skin friction drag acts, which is a little
more than twice the planform area. We will use
𝐿2
𝐷𝑖 = , (D.17)
𝑞𝜋𝑏 2 𝑒
where 𝑒 is the Oswald efficiency factor. The total drag is the sum of
induced and viscous drag, 𝐷 = 𝐷𝑖 + 𝐷 𝑓 .
Our objective function, the power required by the motor for level
flight, is
𝐷𝑣
𝑃(𝑏, 𝑐) = , (D.18)
𝜂
where 𝜂 is the propulsive efficiency. We assume that our electric
propellers have a Gaussian efficiency curve (real efficiency curves are
not Gaussian, but this is simple and will be sufficient for our purposes):
−(𝑣 − 𝑣)2
𝜂 = 𝜂max exp . (D.19)
2𝜎2
This is the same problem that was presented in Ex. 1.2 of Chapter 1.
0.9
The optimal wingspan and chord are 𝑏 = 25.48 m and 𝑐 = 0.50 m,
respectively, given the parameters. The contour and the optimal wing 0.6
shape are shown in Fig. D.6.
Because there are no structural considerations in this problem, the 0.3
5 15 25 35
resulting wing has a higher wing aspect ratio than is realistic. This 𝑏
∫ p
𝑥 𝑖 +d𝑥 d𝑥 2 + d𝑦 2
Δ𝑡 = p
2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
r
𝑥𝑖
2 (D.22)
∫ 𝑥 𝑖 +d𝑥 1+
d𝑦
d𝑥 d𝑥
= p .
𝑥𝑖 2𝑔(ℎ − 𝑦(𝑥) − 𝜇 𝑘 𝑥)
To discretize this problem, we can divide the path into linear
segments. As an example, Fig. D.7 shows the wire divided into four
linear segments (five nodes) as an approximation of a continuous wire.
The slope 𝑠 𝑖 = (Δ𝑦/Δ𝑥)𝑖 is then a constant along a given segment, and 𝑦 (𝑥 𝑖 , 𝑦 𝑖 )
𝑦(𝑥) = 𝑦 𝑖 + 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ). Making these substitutions results in Δ𝑦
q
(𝑥 𝑖+1 , 𝑦 𝑖+1 )
1 + 𝑠 𝑖2 ∫ 𝑥 𝑖+1
d𝑥 Δ𝑥
Δ𝑡 𝑖 = p p . (D.23) 𝑥
2𝑔 𝑥𝑖 ℎ − 𝑦 𝑖 − 𝑠 𝑖 (𝑥 − 𝑥 𝑖 ) − 𝜇 𝑘 𝑥
Fig. D.7 A discretized representation
Performing the integration and simplifying (many steps omitted here) of the brachistochrone problem.
results in
s q
2 Δ𝑥 2𝑖 + Δ𝑦 𝑖2
Δ𝑡 𝑖 = p p , (D.24)
𝑔 ℎ − 𝑦 𝑖+1 − 𝜇 𝑘 𝑥 𝑖+1 + ℎ − 𝑦 𝑖 − 𝜇 𝑘 𝑥 𝑖
Õ
𝑛−1
𝑇= Δ𝑡 𝑖 . (D.25)
𝑖=1
The design variables are the 𝑛−2 positions of the path parameterized
by 𝑦 𝑖 . The endpoints must be fixed; otherwise, the problem is ill-defined,
which is why there are 𝑛 − 2 design variables instead of 𝑛. Note that
𝑥 is a parameter, meaning that it is fixed. We could space the 𝑥 𝑖 any
reasonable way and still find the same underlying optimal curve, but
D Test Problems 585
The analytic solution for the case with friction is more difficult to
derive, but the analytic solution for the frictionless case (𝜇 𝑘 = 0) with
our starting and ending points is as follows:
𝑥 = 𝑎(𝜃 − sin(𝜃))
(D.27)
𝑦 = −𝑎(1 − cos(𝜃)) + 1 ,
ℓ1 ℓ2
𝑘1 𝑘2
𝑥1
𝑥2
Fig. D.8 Two-spring system with no
applied force (top) and with applied
force (bottom).
𝑚𝑔
1 1
𝐸 𝑝 (𝑥1 , 𝑥2 ) = 𝑘 1 (Δ𝑙1 )2 + 𝑘2 (Δ𝑙2 )2 − 𝑚 𝑔𝑥 2 , (D.28)
2 2
where Δ𝑙1 and Δ𝑙2 are the changes in length for the two springs. With
respect to the original lengths, and displacements 𝑥 1 and 𝑥 2 as shown,
D Test Problems 586
where the partial derivatives of Δ𝑙1 and Δ𝑙2 are Fig. D.9 Total potential energy con-
tours for two-spring system.
𝜕(Δ𝑙1 ) 𝑙1 + 𝑥 1
=q
𝜕𝑥 1
(𝑙1 + 𝑥 1 )2 + 𝑥22
(D.31)
𝜕(Δ𝑙2 ) 𝑙2 − 𝑥 1
=q .
𝜕𝑥2
(𝑙2 − 𝑥 1 )2 + 𝑥 22
q q
By letting ℒ1 = (𝑙1 + 𝑥1 )2 + 𝑥22 and ℒ2 = (𝑙2 − 𝑥1 )2 + 𝑥22 , the partial
derivative of 𝐸 𝑝 with respect to 𝑥1 can be written as
𝑎 1 = 75.196 𝑎 2 = −3.8112
𝑎 3 = 0.12694 𝑎 4 = −2.0567 × 10−3
𝑎 5 = 1.0345 × 10−5 𝑎 6 = −6.8306
𝑎 7 = 0.030234 𝑎 8 = −1.28134 × 10−3
𝑎 9 = 3.5256 × 10−5 𝑎 10 = −2.266 × 10−7
𝑎11 = 0.25645 𝑎 12 = −3.4604 × 10−3
𝑎13 = 1.3514 × 10−5 𝑎 14 = −28.106
𝑎 15 = −5.2375 × 10−6 𝑎 16 = −6.3 × 10−8
𝑎 17 = 7.0 × 10−10 𝑎 18 = 3.4054 × 10−4
𝑎 19 = −1.6638 × 10−6 𝑎20 = −2.8673
𝑎 21 = 0.0005
𝑓 (𝑥1 , 𝑥2 ) = 𝑎1 + 𝑎2 𝑥1 + 𝑎 3 𝑦4 + 𝑎4 𝑦4 𝑥1 + 𝑎5 𝑦42 + 𝑎6 𝑥2 + 𝑎7 𝑦1 +
𝑎8 𝑥1 𝑦1 + 𝑎 9 𝑦1 𝑦4 + 𝑎 10 𝑦2 𝑦4 + 𝑎11 𝑦3 + 𝑎12 𝑥2 𝑦3 + 𝑎 13 𝑦32 +
𝑎14 (D.35)
+ 𝑎15 𝑦3 𝑦4 + 𝑎16 𝑦1 𝑦4 𝑥2 + 𝑎17 𝑦1 𝑦3 𝑦4 + 𝑎18 𝑥1 𝑦3 +
𝑥2 + 1
𝑎19 𝑦1 𝑦3 + 𝑎20 exp(𝑎21 𝑦1 ).
not to be in the corner and so set the bounds to [0, 65] in both dimen-
𝑥∗
sions. The contour of this function is plotted in Fig. D.10. 20
ℓ ℓ
1 2
7 8 9 10
ℓ
5 6
The stress in the truss element can be computed from the equation
𝜎 = 𝑆 𝑒 𝑑, where 𝜎 is a scalar, 𝑑 is the same vector as before, and the
element 𝑆 𝑒 matrix (really a row vector because stress is one-dimensional
for truss elements) is
𝐸
𝑆𝑒 = −𝑐 −𝑠 𝑐 𝑠 . (D.39)
𝐿
The global structure (an assembly of multiple finite elements) has the
same equations, 𝐾𝑑 = 𝑞 and 𝜎 = 𝑆𝑑, but now 𝑑 contains displacements
for all of the nodes in the structure, 𝑑 = [𝑥1 , 𝑥2 , . . . , 𝑥 𝑛 ]. If we have 𝑛
nodes and 𝑚 elements, then 𝑞 and 𝑑 are 2𝑛-vectors, 𝐾 is a (2𝑛 × 2𝑛)
matrix, 𝑆 is an (𝑚 × 2𝑛) matrix, and 𝜎 is an 𝑚-vector. The elemental
stiffness and stress matrices are first computed and then assembled into
the global matrices. This is straightforward because the displacements
and forces of the individual elements add linearly.
After we assemble the global matrices, we must remove any degrees
of freedom that are structurally rigid (already known to have zero
displacement). Otherwise, the problem is ill-defined, and the stiffness
matrix will be ill-conditioned.
D Test Problems 590
𝐾𝑑 = 𝑞 . (D.40)
1 Wu, N., Kenway, G., Mader, C. A., Jasa, J., and Martins, J. R. R. A., cited on pp. 15, 200
“PyOptSparse: A Python framework for large-scale constrained
nonlinear optimization of sparse systems,” Journal of Open Source
Software, Vol. 5, No. 54, October 2020, p. 2564.
doi: 10.21105/joss.02564
2 Lyu, Z., Kenway, G. K. W., and Martins, J. R. R. A., “Aerodynamic cited on p. 20
Shape Optimization Investigations of the Common Research Model
Wing Benchmark,” AIAA Journal, Vol. 53, No. 4, April 2015, pp. 968–
985.
doi: 10.2514/1.J053318
3 He, X., Li, J., Mader, C. A., Yildirim, A., and Martins, J. R. R. A., cited on p. 20
“Robust aerodynamic shape optimization—From a circle to an
airfoil,” Aerospace Science and Technology, Vol. 87, April 2019, pp. 48–
61.
doi: 10.1016/j.ast.2019.01.051
4 Betts, J. T., “Survey of numerical methods for trajectory optimiza- cited on p. 26
tion,” Journal of Guidance, Control, and Dynamics, Vol. 21, No. 2, 1998,
pp. 193–207.
doi: 10.2514/2.4231
5 Bryson, A. E. and Ho, Y. C., Applied Optimal Control; Optimization, cited on p. 26
Estimation, and Control. Waltham, MA: Blaisdell Publishing, 1969.
6 Bertsekas, D. P., Dynamic Programming and Optimal Control. Belmont, cited on p. 26
MA: Athena Scientific, 1995.
7 Kepler, J., Nova stereometria doliorum vinariorum (New Solid Geometry cited on p. 34
of Wine Barrels). Linz, Austria: Johannes Planck, 1615.
8 Ferguson, T. S., “Who solved the secretary problem?” Statistical cited on p. 34
Science, Vol. 4, No. 3, August 1989, pp. 282–289.
doi: 10.1214/ss/1177012493
9 Fermat, P. de, Methodus ad disquirendam maximam et minimam cited on p. 35
(Method for the Study of Maxima and Minima). 1636, translated by
Jason Ross.
591
Bibliography 592
44 Hwang, J. T. and Martins, J. R. R. A., “A computational architecture cited on pp. 41, 498
for coupling heterogeneous numerical models and computing
coupled derivatives,” ACM Transactions on Mathematical Software,
Vol. 44, No. 4, June 2018, Article 37.
doi: 10.1145/3182393
45 Wright, M. H., “The interior-point revolution in optimization: cited on p. 41
History, recent developments, and lasting consequences,” Bulletin
of the American Mathematical Society, Vol. 42, 2005, pp. 39–56.
doi: 10.1007/978-1-4613-3279-4_23
46 Grant, M., Boyd, S., and Ye, Y., “Disciplined convex programming,” cited on pp. 41, 430
Global Optimization—From Theory to Implementation, Liberti, L. and
Maculan, N., Eds. Boston, MA: Springer, 2006, pp. 155–210.
doi: 10.1007/0-387-30528-9_7
47 Wengert, R. E., “A simple automatic derivative evaluation program,” cited on p. 41
Communications of the ACM, Vol. 7, No. 8, August 1964, pp. 463–464,
issn: 0001-0782.
doi: 10.1145/355586.364791
48 Speelpenning, B., “Compiling fast partial derivatives of functions cited on p. 41
given by algorithms,” PhD dissertation, University of Illinois at
Urbana–Champaign, Champaign, IL, January 1980.
doi: 10.2172/5254402
49 Squire, W. and Trapp, G., “Using complex variables to estimate cited on pp. 42, 232
derivatives of real functions,” SIAM Review, Vol. 40, No. 1, 1998,
pp. 110–112, issn: 0036-1445 (print), 1095-7200 (electronic).
doi: 10.1137/S003614459631241X
50 Martins, J. R. R. A., Sturdza, P., and Alonso, J. J., “The complex- cited on pp. 42, 233, 235, 237
step derivative approximation,” ACM Transactions on Mathematical
Software, Vol. 29, No. 3, September 2003, pp. 245–262.
doi: 10.1145/838250.838251
51 Torczon, V., “On the convergence of pattern search algorithms,” cited on p. 42
SIAM Journal on Optimization, Vol. 7, No. 1, February 1997, pp. 1–25.
doi: 10.1137/S1052623493250780
52 Jones, D., Perttunen, C., and Stuckman, B., “Lipschitzian optimiza- cited on pp. 42, 298
tion without the Lipschitz constant,” Journal of Optimization Theory
and Application, Vol. 79, No. 1, October 1993, pp. 157–181.
doi: 10.1007/BF00941892
53 Jones, D. R. and Martins, J. R. R. A., “The DIRECT algorithm—25 cited on pp. 42, 298
years later,” Journal of Global Optimization, Vol. 79, March 2021,
pp. 521–566.
doi: 10.1007/s10898-020-00952-6
Bibliography 596
66 Hodges, A., Alan Turing: The Enigma. Princeton, NJ: Princeton cited on p. 44
University Press, 2014.
isbn: 9780691164724
67 Lipsitz, G., How Racism Takes Place. Philadelphia, PA: Temple cited on p. 44
University Press, 2011.
68 Rothstein, R., The Color of Law: A Forgotten History of How Our cited on p. 44
Government Segregated America. New York, NY: Liveright Publishing,
2017.
69 King, L. J., “More than slaves: Black founders, Benjamin Banneker, cited on p. 44
and critical intellectual agency,” Social Studies Research & Practice
(Board of Trustees of the University of Alabama), Vol. 9, No. 3, 2014.
70 Shetterly, M. L., Hidden Figures: The American Dream and the Untold cited on p. 45
Story of the Black Women Who Helped Win the Space Race. New York,
NY: William Morrow and Company, 2016.
71 Box, G. E. P., “Science and statistics,” Journal of the American Statistical cited on p. 49
Association, Vol. 71, No. 356, 1976, pp. 791–799, issn: 01621459.
doi: 10.2307/2286841
72 Wilson, G., Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., cited on p. 59
Guy, R. T., Haddock, S. H. D., Huff, K. D., Mitchell, I. M., Plumbley,
M. D., Waugh, B., White, E. P., and Wilson, P., “Best practices for
scientific computing,” PLoS Biology, Vol. 12, No. 1, 2014, e1001745.
doi: 10.1371/journal.pbio.1001745
73 Grotker, T., Holtmann, U., Keding, H., and Wloka, M., The De- cited on p. 60
veloper’s Guide to Debugging, 2nd ed. New York, NY: Springer,
2012.
74 Ascher, U. M. and Greif, C., A First Course in Numerical Methods. cited on p. 61
Philadelphia, PA: SIAM, 2011.
75 Saad, Y., Iterative Methods for Sparse Linear Systems, 2nd ed. Philadel- cited on pp. 62, 569
phia, PA: SIAM, 2003.
76 Higgins, T. J., “A note on the history of mixed partial derivatives,” cited on p. 85
Scripta Mathematica, Vol. 7, 1940, pp. 59–62.
77 Hager, W. W. and Zhang, H., “A new conjugate gradient method cited on p. 99
with guaranteed descent and an efficient line search,” SIAM Journal
on Optimization, Vol. 16, No. 1, January 2005, pp. 170–192, issn:
1095-7189.
doi: 10.1137/030601880
Bibliography 598
78 Moré, J. J. and Thuente, D. J., “Line search algorithms with guaran- cited on p. 103
teed sufficient decrease,” ACM Transactions on Mathematical Software
(TOMS), Vol. 20, No. 3, 1994, pp. 286–307.
doi: 10.1145/192115.192132
79 Nocedal, J. and Wright, S. J., Numerical Optimization, 2nd ed. Berlin: cited on pp. 103, 118, 141, 142, 190,
Springer, 2006. 209, 567, 568
doi: 10.1007/978-0-387-40065-5
80 Broyden, C. G., “The convergence of a class of double-rank min- cited on p. 126
imization algorithms 1. General considerations,” IMA Journal of
Applied Mathematics, Vol. 6, No. 1, 1970, pp. 76–90, issn: 1464-3634.
doi: 10.1093/imamat/6.1.76
81 Fletcher, R., “A new approach to variable metric algorithms,” The cited on p. 126
Computer Journal, Vol. 13, No. 3, March 1970, pp. 317–322, issn:
1460-2067.
doi: 10.1093/comjnl/13.3.317
82 Goldfarb, D., “A family of variable-metric methods derived by cited on p. 126
variational means,” Mathematics of Computation, Vol. 24, No. 109,
January 1970, pp. 23–23, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0258249-6
83 Shanno, D. F., “Conditioning of quasi-Newton methods for func- cited on p. 126
tion minimization,” Mathematics of Computation, Vol. 24, No. 111,
September 1970, pp. 647–647, issn: 0025-5718.
doi: 10.1090/s0025-5718-1970-0274029-x
84 Conn, A. R., Gould, N. I. M., and Toint, P. L., Trust Region Methods. cited on pp. 140, 141, 142, 143
Philadelphia, PA: SIAM, 2000.
isbn: 0898714605
85 Steihaug, T., “The conjugate gradient method and trust regions in cited on p. 141
large scale optimization,” SIAM Journal on Numerical Analysis, Vol.
20, No. 3, June 1983, pp. 626–637, issn: 1095-7170.
doi: 10.1137/0720042
86 Boyd, S. P. and Vandenberghe, L., Convex Optimization. Cambridge, cited on pp. 156, 425
UK: Cambridge University Press, March 2004.
isbn: 0521833787
87 Strang, G., Linear Algebra and its Applications, 4th ed. Boston, MA: cited on pp. 156, 547
Cengage Learning, 2006.
isbn: 0030105676
88 Dax, A., “Classroom note: An elementary proof of Farkas’ lemma,” cited on p. 166
SIAM Review, Vol. 39, No. 3, 1997, pp. 503–507.
doi: 10.1137/S0036144594295502
Bibliography 599
89 Gill, P. E., Murray, W., Saunders, M. A., and Wright, M. H., “Some cited on p. 183
theoretical properties of an augmented Lagrangian merit function,”
SOL 86-6R, Systems Optimization Laboratory, September 1986.
90 Di Pillo, G. and Grippo, L., “A new augmented Lagrangian function cited on p. 183
for inequality constraints in nonlinear programming problems,”
Journal of Optimization Theory and Applications, Vol. 36, No. 4, 1982,
pp. 495–519
doi: 10.1007/BF00940544
91 Birgin, E. G., Castillo, R. A., and MartÍnez, J. M., “Numerical cited on p. 183
comparison of augmented Lagrangian algorithms for nonconvex
problems,” Computational Optimization and Applications, Vol. 31, No.
1, 2005, pp. 31–55
doi: 10.1007/s10589-005-1066-7
92 Rockafellar, R. T., “The multiplier method of Hestenes and Powell cited on p. 183
applied to convex programming,” Journal of Optimization Theory
and Applications, Vol. 12, No. 6, 1973, pp. 555–562
doi: 10.1007/BF00934777
93 Murray, W., “Analytical expressions for the eigenvalues and eigen- cited on p. 187
vectors of the Hessian matrices of barrier and penalty functions,”
Journal of Optimization Theory and Applications, Vol. 7, No. 3, March
1971, pp. 189–196.
doi: 10.1007/bf00932477
94 Forsgren, A., Gill, P. E., and Wright, M. H., “Interior methods for cited on p. 187
nonlinear optimization,” SIAM Review, Vol. 44, No. 4, January 2002,
pp. 525–597.
doi: 10.1137/s0036144502414942
95 Gill, P. E. and Wong, E., “Sequential quadratic programming cited on p. 190
methods,” Mixed Integer Nonlinear Programming, Lee, J. and Leyffer,
S., Eds., ser. The IMA Volumes in Mathematics and Its Applications.
New York, NY: Springer, 2012, Vol. 154.
doi: 10.1007/978-1-4614-1927-3_6
96 Gill, P. E., Murray, W., and Saunders, M. A., “SNOPT: An SQP cited on pp. 190, 197, 200
algorithm for large-scale constrained optimization,” SIAM Review,
Vol. 47, No. 1, 2005, pp. 99–131.
doi: 10.1137/S0036144504446096
97 Fletcher, R. and Leyffer, S., “Nonlinear programming without a cited on p. 198
penalty function,” Mathematical Programming, Vol. 91, No. 2, January
2002, pp. 239–269.
doi: 10.1007/s101070100244
Bibliography 600
98 Benson, H. Y., Vanderbei, R. J., and Shanno, D. F., “Interior-point cited on p. 198
methods for nonconvex nonlinear programming: Filter methods
and merit functions,” Computational Optimization and Applications,
Vol. 23, No. 2, 2002, pp. 257–272.
doi: 10.1023/a:1020533003783
99 Fletcher, R., Leyffer, S., and Toint, P., “A brief history of filter cited on p. 198
methods,” ANL/MCS-P1372-0906, Argonne National Laboratory,
September 2006.
100 Fletcher, R., Practical Methods of Optimization, 2nd ed. Hoboken, NJ: cited on p. 200
Wiley, 1987.
101 Liu, D. C. and Nocedal, J., “On the limited memory BFGS method cited on p. 200
for large scale optimization,” Mathematical Programming, Vol. 45,
No. 1–3, August 1989, pp. 503–528.
doi: 10.1007/bf01589116
102 Byrd, R. H., Nocedal, J., and Waltz, R. A., “Knitro: An integrated cited on pp. 200, 208
package for nonlinear optimization,” Large-Scale Nonlinear Opti-
mization, Di Pillo, G. and Roma, M., Eds. Boston, MA: Springer US,
2006, pp. 35–59.
doi: 10.1007/0-387-30065-1_4
103 Kraft, D., “A software package for sequential quadratic program- cited on p. 200
ming,” DFVLR-FB 88-28, DLR German Aerospace Center–Institute
for Flight Mechanics, Koln, Germany, 1988.
104 Wächter, A. and Biegler, L. T., “On the implementation of an cited on p. 208
interior-point filter line-search algorithm for large-scale nonlinear
programming,” Mathematical Programming, Vol. 106, No. 1, April
2005, pp. 25–57.
doi: 10.1007/s10107-004-0559-y
105 Byrd, R. H., Hribar, M. E., and Nocedal, J., “An interior point cited on p. 208
algorithm for large-scale nonlinear programming,” SIAM Journal
on Optimization, Vol. 9, No. 4, January 1999, pp. 877–900.
doi: 10.1137/s1052623497325107
106 Wächter, A. and Biegler, L. T., “On the implementation of a primal- cited on p. 208
dual interior point filter line search algorithm for large-scale non-
linear programming,” Mathematical Programming, Vol. 106, No. 1,
2006, pp. 25–57.
107 Gill, P. E., Saunders, M. A., and Wong, E., “On the performance cited on p. 209
of SQP methods for nonlinear optimization,” Modeling and Opti-
mization: Theory and Applications, Defourny, B. and Terlaky, T., Eds.
New York, NY: Springer, 2015, Vol. 147, pp. 95–123.
doi: 10.1007/978-3-319-23699-5_5
Bibliography 601
108 Kreisselmeier, G. and Steinhauser, R., “Systematic control design by cited on p. 212
optimizing a vector performance index,” IFAC Proceedings Volumes,
Vol. 12, No. 7, September 1979, pp. 113–117, issn: 1474-6670.
doi: 10.1016/s1474-6670(17)65584-8
109 Duysinx, P. and Bendsøe, M. P., “Topology optimization of contin- cited on p. 213
uum structures with local stress constraints,” International Journal
for Numerical Methods in Engineering, Vol. 43, 1998, pp. 1453–1478.
doi: 10 . 1002 / (SICI ) 1097 - 0207(19981230 ) 43 : 8 % 3C1453 :: AID -
NME480%3E3.0.CO;2-2
110 Kennedy, G. J. and Hicken, J. E., “Improved constraint-aggregation cited on p. 213
methods,” Computer Methods in Applied Mechanics and Engineering,
Vol. 289, 2015, pp. 332–354, issn: 0045-7825.
doi: 10.1016/j.cma.2015.02.017
111 Hoerner, S. F., Fluid-Dynamic Drag. Bakersfield, CA: Hoerner Fluid cited on p. 219
Dynamics, 1965.
112 Lyness, J. N. and Moler, C. B., “Numerical differentiation of analytic cited on p. 232
functions,” SIAM Journal on Numerical Analysis, Vol. 4, No. 2, 1967,
pp. 202–210, issn: 0036-1429 (print), 1095-7170 (electronic).
doi: 10.1137/0704019
113 Lantoine, G., Russell, R. P., and Dargent, T., “Using multicomplex cited on p. 233
variables for automatic computation of high-order derivatives,”
ACM Transactions on Mathematical Software, Vol. 38, No. 3, April
2012, pp. 1–21, issn: 0098-3500.
doi: 10.1145/2168773.2168774
114 Fike, J. A. and Alonso, J. J., “Automatic differentiation through the cited on p. 233
use of hyper-dual numbers for second derivatives,” Recent Advances
in Algorithmic Differentiation, Forth, S., Hovland, P., Phipps, E., Utke,
J., and Walther, A., Eds. Berlin: Springer, 2012, pp. 163–173, isbn:
978-3-642-30023-3.
doi: 10.1007/978-3-642-30023-3_15
115 Griewank, A., Evaluating Derivatives. Philadelphia, PA: SIAM, 2000. cited on pp. 237, 247, 249
doi: 10.1137/1.9780898717761
116 Naumann, U., The Art of Differentiating Computer Programs—An cited on p. 237
Introduction to Algorithmic Differentiation. Philadelphia, PA: SIAM,
2011.
117 Utke, J., Naumann, U., Fagan, M., Tallent, N., Strout, M., Heimbach, cited on p. 250
P., Hill, C., and Wunsch, C., “OpenAD/F: A modular open-source
tool for automatic differentiation of Fortran codes,” ACM Trans-
actions on Mathematical Software, Vol. 34, No. 4, July 2008, issn:
Bibliography 602
0098-3500.
doi: 10.1145/1377596.1377598
118 Hascoet, L. and Pascual, V., “The Tapenade automatic differentia- cited on p. 250
tion tool: Principles, model, and specification,” ACM Transactions
on Mathematical Software, Vol. 39, No. 3, May 2013, 20:1–20:43, issn:
0098-3500.
doi: 10.1145/2450153.2450158
119 Griewank, A., Juedes, D., and Utke, J., “Algorithm 755: ADOL-C: cited on p. 250
A package for the automatic differentiation of algorithms written
in C/C++,” ACM Transactions on Mathematical Software, Vol. 22, No.
2, June 1996, pp. 131–167, issn: 0098-3500.
doi: 10.1145/229473.229474
120 Wiltschko, A. B., Merriënboer, B. van, and Moldovan, D., “Tangent: cited on p. 250
Automatic differentiation using source code transformation in
Python,” arXiv:1711.02712, 2017.
url: https://arxiv.org/abs/1711.02712.
121 Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., cited on p. 250
Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J., Wanderman-
Milne, S., and Zhang, Q., “JAX: Composable Transformations of
Python+NumPy Programs,” 2018.
url: http://github.com/google/jax.
122 Revels, J., Lubin, M., and Papamarkou, T., “Forward-mode auto- cited on p. 250
matic differentiation in Julia,” arXiv:1607.07892, July 2016.
url: https://arxiv.org/abs/1607.07892.
123 Neidinger, R. D., “Introduction to automatic differentiation and cited on p. 250
MATLAB object-oriented programming,” SIAM Review, Vol. 52,
No. 3, January 2010, pp. 545–563.
doi: 10.1137/080743627
124 Betancourt, M., “A geometric theory of higher-order automatic cited on p. 250
differentiation,” arXiv:1812.11592 [stat.CO], December 2018.
url: https://arxiv.org/abs/1812.11592.
125 Giles, M., “An extended collection of matrix derivative results for cited on pp. 251, 252
forward and reverse mode algorithmic differentiation,” Oxford,
UK, January 2008.
url: https://people.maths.ox.ac.uk/gilesm/files/NA-08-01.pdf.
126 Peter, J. E. V. and Dwight, R. P., “Numerical sensitivity analysis for cited on p. 252
aerodynamic optimization: A survey of approaches,” Computers
and Fluids, Vol. 39, No. 3, March 2010, pp. 373–391.
doi: 10.1016/j.compfluid.2009.09.013
Bibliography 603
127 Martins, J. R. R. A., “Perspectives on aerodynamic design optimiza- cited on pp. 257, 444
tion,” Proceedings of the AIAA SciTech Forum. American Institute of
Aeronautics and Astronautics, January 2020.
doi: 10.2514/6.2020-0043
128 Lambe, A. B., Martins, J. R. R. A., and Kennedy, G. J., “An evaluation cited on p. 260
of constraint aggregation strategies for wing box mass minimiza-
tion,” Structural and Multidisciplinary Optimization, Vol. 55, No. 1,
January 2017, pp. 257–277.
doi: 10.1007/s00158-016-1495-1
129 Kenway, G. K. W., Mader, C. A., He, P., and Martins, J. R. R. A., cited on p. 260
“Effective Adjoint Approaches for Computational Fluid Dynamics,”
Progress in Aerospace Sciences, Vol. 110, October 2019, p. 100 542.
doi: 10.1016/j.paerosci.2019.05.002
130 Curtis, A. R., Powell, M. J. D., and Reid, J. K., “On the estimation cited on p. 263
of sparse Jacobian matrices,” IMA Journal of Applied Mathematics,
Vol. 13, No. 1, February 1974, pp. 117–119, issn: 1464-3634.
doi: 10.1093/imamat/13.1.117
131 Gebremedhin, A. H., Manne, F., and Pothen, A., “What color is cited on p. 264
your Jacobian? Graph coloring for computing derivatives,” SIAM
Review, Vol. 47, No. 4, January 2005, pp. 629–705, issn: 1095-7200.
doi: 10.1137/s0036144504444711
132 Gray, J. S., Hwang, J. T., Martins, J. R. R. A., Moore, K. T., and cited on pp. 264, 494, 501, 508, 533
Naylor, B. A., “OpenMDAO: An open-source framework for multi-
disciplinary design, analysis, and optimization,” Structural and
Multidisciplinary Optimization, Vol. 59, No. 4, April 2019, pp. 1075–
1104.
doi: 10.1007/s00158-019-02211-z
133 Ning, A., “Using blade element momentum methods with gradient- cited on p. 265
based design optimization,” Structural and Multidisciplinary Opti-
mization, May 2021
doi: 10.1007/s00158-021-02883-6
134 Martins, J. R. R. A. and Hwang, J. T., “Review and unification of cited on p. 266
methods for computing derivatives of multidisciplinary compu-
tational models,” AIAA Journal, Vol. 51, No. 11, November 2013,
pp. 2582–2599.
doi: 10.2514/1.J052184
135 Yu, Y., Lyu, Z., Xu, Z., and Martins, J. R. R. A., “On the influence of cited on p. 283
optimization algorithm and starting design on wing aerodynamic
shape optimization,” Aerospace Science and Technology, Vol. 75, April
Bibliography 604
146 Barricelli, N., “Esempi numerici di processi di evoluzione,” Metho- cited on p. 306
dos, 1954, pp. 45–68.
147 Jong, K. A. D., “An analysis of the behavior of a class of genetic cited on p. 306
adaptive systems,” PhD dissertation, University of Michigan, Ann
Arbor, MI, 1975.
148 Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., “A fast and cited on pp. 308, 364
elitist multiobjective genetic algorithm: NSGA-II,” IEEE Transactions
on Evolutionary Computation, Vol. 6, No. 2, April 2002, pp. 182–197.
doi: 10.1109/4235.996017
149 Deb, K., Multi-Objective Optimization Using Evolutionary Algorithms. cited on p. 313
Hoboken, NJ: John Wiley & Sons, 2001.
isbn: 047187339X
150 Eberhart, R. and Kennedy, J. A., “New optimizer using particle cited on p. 316
swarm theory,” Proceedings of the Sixth International Symposium
on Micro Machine and Human Science. Institute of Electrical and
Electronics Engineers, 1995, pp. 39–43.
doi: 10.1109/MHS.1995.494215
151 Zhan, Z.-H., Zhang, J., Li, Y., and Chung, H. S.-H., “Adaptive cited on p. 317
particle swarm optimization,” IEEE Transactions on Systems, Man,
and Cybernetics, Part B (Cybernetics), Vol. 39, No. 6, April 2009,
pp. 1362–1381.
doi: 10.1109/TSMCB.2009.2015956
152 Gutin, G., Yeo, A., and Zverovich, A., “Traveling salesman should cited on p. 338
not be greedy: Domination analysis of greedy-type heuristics for
the TSP,” Discrete Applied Mathematics, Vol. 117, No. 1–3, March
2002, pp. 81–86, issn: 0166-218X.
doi: 10.1016/s0166-218x(01)00195-0
153 Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P., “Optimization cited on p. 347
by simulated annealing,” Science, Vol. 220, No. 4598, May 1983,
pp. 671–680, issn: 1095-9203.
doi: 10.1126/science.220.4598.671
154 Černý, V., “Thermodynamical approach to the traveling salesman cited on p. 347
problem: An efficient simulation algorithm,” Journal of Optimization
Theory and Applications, Vol. 45, No. 1, January 1985, pp. 41–51, issn:
1573-2878.
doi: 10.1007/bf00940812
155 Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., cited on p. 347
and Teller, E., “Equation of state calculations by fast computing
machines,” Journal of Chemical Physics, March 1953.
doi: 10.2172/4390578
Bibliography 606
156 Andresen, B. and Gordon, J. M., “Constant thermodynamic speed cited on p. 348
for minimizing entropy production in thermodynamic processes
and simulated annealing,” Physical Review E, Vol. 50, No. 6, Decem-
ber 1994, pp. 4346–4351, issn: 1095-3787.
doi: 10.1103/physreve.50.4346
157 Lin, S., “Computer solutions of the traveling salesman problem,” cited on p. 349
Bell System Technical Journal, Vol. 44, No. 10, December 1965,
pp. 2245–2269, issn: 0005-8580.
doi: 10.1002/j.1538-7305.1965.tb04146.x
158 Press, W. H., Wevers, J., Flannery, B. P., Teukolsky, S. A., Vetterling, cited on p. 351
W. T., Flannery, B. P., and Vetterling, W. T., Numerical Recipes in
C: The Art of Scientific Computing. Cambridge, UK: Cambridge
University Press, 1992.
isbn: 0521431085
159 Haimes, Y. Y., Lasdon, L. S., and Wismer, D. A., “On a bicriterion cited on p. 360
formulation of the problems of integrated system identification
and system optimization,” IEEE Transactions on Systems, Man, and
Cybernetics, Vol. SMC-1, No. 3, July 1971, pp. 296–297.
doi: 10.1109/tsmc.1971.4308298
160 Das, I. and Dennis, J. E., “Normal-boundary intersection: A new cited on p. 360
method for generating the Pareto surface in nonlinear multicriteria
optimization problems,” SIAM Journal on Optimization, Vol. 8, No.
3, August 1998, pp. 631–657.
doi: 10.1137/s1052623496307510
161 Ismail-Yahaya, A. and Messac, A., “Effective generation of the cited on p. 362
Pareto frontier using the normal constraint method,” Proceedings
of the 40th AIAA Aerospace Sciences Meeting & Exhibit. American
Institute of Aeronautics and Astronautics, January 2002.
doi: 10.2514/6.2002-178
162 Messac, A. and Mattson, C. A., “Normal constraint method with cited on p. 362
guarantee of even representation of complete Pareto frontier,”
AIAA Journal, Vol. 42, No. 10, October 2004, pp. 2101–2111.
doi: 10.2514/1.8977
163 Hancock, B. J. and Mattson, C. A., “The smart normal constraint cited on p. 362
method for directly generating a smart Pareto set,” Structural and
Multidisciplinary Optimization, Vol. 48, No. 4, June 2013, pp. 763–775.
doi: 10.1007/s00158-013-0925-6
164 Schaffer, J. D., “Some experiments in machine learning using cited on p. 363
vector evaluated genetic algorithms.” PhD dissertation, Vanderbilt
University, Nashville, TN, 1984.
Bibliography 607
175 Han, Z.-H., Zhang, Y., Song, C.-X., and Zhang, K.-S., “Weighted cited on p. 408
gradient-enhanced kriging for high-dimensional surrogate model-
ing and design optimization,” AIAA Journal, Vol. 55, No. 12, August
2017, pp. 4330–4346.
doi: 10.2514/1.J055842
176 Forrester, A., Sobester, A., and Keane, A., Engineering Design via cited on p. 408
Surrogate Modelling: A Practical Guide. Hoboken, NJ: John Wiley &
Sons, 2008.
isbn: 0470770791
177 Ruder, S., “An overview of gradient descent optimization algo- cited on p. 414
rithms,” arXiv:1609.04747, 2016.
url: http://arxiv.org/abs/1609.04747.
178 Goh, G., “Why momentum really works,” Distill, 2017. cited on p. 414
doi: 10.23915/distill.00006
179 Diamond, S. and Boyd, S., “Convex optimization with abstract lin- cited on p. 423
ear operators,” Proceedings of the 2015 IEEE International Conference
on Computer Vision (ICCV). Institute of Electrical and Electronics
Engineers, December 2015.
doi: 10.1109/iccv.2015.84
180 Lobo, M. S., Vandenberghe, L., Boyd, S., and Lebret, H., “Applica- cited on p. 425
tions of second-order cone programming,” Linear Algebra and Its
Applications, Vol. 284, No. 1–3, November 1998, pp. 193–228.
doi: 10.1016/s0024-3795(98)10032-0
181 Parikh, N. and Boyd, S., “Block splitting for distributed optimiza- cited on p. 425
tion,” Mathematical Programming Computation, Vol. 6, No. 1, October
2013, pp. 77–102.
doi: 10.1007/s12532-013-0061-8
182 Vandenberghe, L. and Boyd, S., “Semidefinite programming,” cited on p. 425
SIAM Review, Vol. 38, No. 1, March 1996, pp. 49–95.
doi: 10.1137/1038003
183 Vandenberghe, L. and Boyd, S., “Applications of semidefinite cited on p. 425
programming,” Applied Numerical Mathematics, Vol. 29, No. 3, March
1999, pp. 283–299.
doi: 10.1016/s0168-9274(98)00098-1
184 Boyd, S., Kim, S.-J., Vandenberghe, L., and Hassibi, A., “A tutorial cited on p. 436
on geometric programming,” Optimization and Engineering, Vol. 8,
No. 1, April 2007, pp. 67–127.
doi: 10.1007/s11081-007-9001-7
Bibliography 609
185 Hoburg, W., Kirschen, P., and Abbeel, P., “Data fitting with geometric- cited on p. 436
programming-compatible softmax functions,” Optimization and
Engineering, Vol. 17, No. 4, August 2016, pp. 897–918.
doi: 10.1007/s11081-016-9332-3
186 Kirschen, P. G., York, M. A., Ozturk, B., and Hoburg, W. W., “Ap- cited on p. 437
plication of signomial programming to aircraft design,” Journal of
Aircraft, Vol. 55, No. 3, May 2018, pp. 965–987.
doi: 10.2514/1.c034378
187 York, M. A., Hoburg, W. W., and Drela, M., “Turbofan engine cited on p. 437
sizing and tradeoff analysis via signomial programming,” Journal
of Aircraft, Vol. 55, No. 3, May 2018, pp. 988–1003.
doi: 10.2514/1.c034463
188 Stanley, A. P. and Ning, A., “Coupled wind turbine design and cited on p. 444
layout optimization with non-homogeneous wind turbines,” Wind
Energy Science, Vol. 4, No. 1, January 2019, pp. 99–114.
doi: 10.5194/wes-4-99-2019
189 Gagakuma, B., Stanley, A. P. J., and Ning, A., “Reducing wind cited on p. 444
farm power variance from wind direction using wind farm layout
optimization,” Wind Engineering, January 2021.
doi: 10.1177/0309524X20988288
190 Padrón, A. S., Thomas, J., Stanley, A. P. J., Alonso, J. J., and Ning, cited on p. 444
A., “Polynomial chaos to efficiently compute the annual energy
production in wind farm layout optimization,” Wind Energy Science,
Vol. 4, May 2019, pp. 211–231.
doi: 10.5194/wes-4-211-2019
191 Cacuci, D., Sensitivity & Uncertainty Analysis. Boca Raton, FL: cited on p. 450
Chapman and Hall/CRC, May 2003, Vol. 1.
doi: 10.1201/9780203498798
192 Parkinson, A., Sorensen, C., and Pourhassan, N., “A general ap- cited on p. 450
proach for robust optimal design,” Journal of Mechanical Design, Vol.
115, No. 1, 1993, p. 74.
doi: 10.1115/1.2919328
193 Golub, G. H. and Welsch, J. H., “Calculation of Gauss quadrature cited on p. 455
rules,” Mathematics of Computation, Vol. 23, No. 106, 1969, pp. 221–
230, issn: 00255718, 10886842.
doi: 10.1090/S0025-5718-69-99647-1
194 Wilhelmsen, D. R., “Optimal quadrature for periodic analytic cited on p. 457
functions,” SIAM Journal on Numerical Analysis, Vol. 15, No. 2, 1978,
pp. 291–296, issn: 00361429.
doi: 10.1137/0715020
Bibliography 610
195 Trefethen, L. N. and Weideman, J. A. C., “The exponentially conver- cited on p. 457
gent trapezoidal rule,” SIAM Review, Vol. 56, No. 3, 2014, pp. 385–
458, issn: 00361445, 10957200.
doi: 10.1137/130932132
196 Johnson, S. G., “Notes on the convergence of trapezoidal-rule cited on p. 457
quadrature,” March 2010.
url: http://math.mit.edu/~stevenj/trapezoidal.pdf.
197 Smolyak, S. A., “Quadrature and interpolation formulas for tensor cited on p. 458
products of certain classes of functions,” Proceedings of the USSR
Academy of Sciences, 5. 1963, Vol. 148, pp. 1042–1045.
doi: 10.3103/S1066369X10030084
198 Wiener, N., “The homogeneous chaos,” American Journal of Mathe- cited on p. 462
matics, Vol. 60, No. 4, October 1938, p. 897.
doi: 10.2307/2371268
199 Eldred, M., Webster, C., and Constantine, P., “Evaluation of non- cited on p. 465
intrusive approaches for wiener–askey generalized polynomial
chaos,” Proceedings of the 49th AIAA Structures, Structural Dynamics,
and Materials Conference. American Institute of Aeronautics and
Astronautics, April 2008.
doi: 10.2514/6.2008-1892
200 Adams, B. M., Bohnhoff, W. J., Dalbey, K. R., Ebeida, M. S., Eddy, J. P., cited on p. 466
Eldred, M. S., Hooper, R. W., Hough, P. D., Hu, K. T., Jakeman, J. D.,
Khalil, M., Maupin, K. A., Monschke, J. A., Ridgway, E. M., Rushdi,
A. A., Seidl, D. T., Stephens, J. A., Swiler, L. P., and Winokur, J. G.,
“Dakota, a multilevel parallel object-oriented framework for design
optimization, parameter estimation, uncertainty quantification,
and sensitivity analysis: Version 6.14 user’s manual,” May 2021.
url: https://dakota.sandia.gov/content/manuals.
201 Feinberg, J. and Langtangen, H. P., “Chaospy: An open source tool cited on p. 466
for designing methods of uncertainty quantification,” Journal of
Computational Science, Vol. 11, November 2015, pp. 46–57.
doi: 10.1016/j.jocs.2015.08.008
202 Jasa, J. P., Hwang, J. T., and Martins, J. R. R. A., “Open-source cited on pp. 481, 498
coupled aerostructural optimization using Python,” Structural and
Multidisciplinary Optimization, Vol. 57, No. 4, April 2018, pp. 1815–
1827.
doi: 10.1007/s00158-018-1912-8
Bibliography 611
203 Cuthill, E. and McKee, J., “Reducing the bandwidth of sparse cited on p. 486
symmetric matrices,” Proceedings of the 1969 24th National Confer-
ence. New York, NY: Association for Computing Machinery, 1969,
pp. 157–172.
doi: 10.1145/800195.805928
204 Amestoy, P. R., Davis, T. A., and Duff, I. S., “An approximate cited on p. 486
minimum degree ordering algorithm,” SIAM Journal on Matrix
Analysis and Applications, Vol. 17, No. 4, 1996, pp. 886–905.
doi: 10.1137/S0895479894278952
205 Lambe, A. B. and Martins, J. R. R. A., “Extensions to the design cited on p. 486
structure matrix for the description of multidisciplinary design,
analysis, and optimization processes,” Structural and Multidiscipli-
nary Optimization, Vol. 46, August 2012, pp. 273–284.
doi: 10.1007/s00158-012-0763-y
206 Irons, B. M. and Tuck, R. C., “A version of the Aitken accelerator cited on p. 490
for computer iteration,” International Journal for Numerical Methods
in Engineering, Vol. 1, No. 3, 1969, pp. 275–277.
doi: 10.1002/nme.1620010306
207 Kenway, G. K. W., Kennedy, G. J., and Martins, J. R. R. A., “Scalable cited on pp. 490, 504
parallel approach for high-fidelity steady-state aeroelastic analysis
and derivative computations,” AIAA Journal, Vol. 52, No. 5, May
2014, pp. 935–951.
doi: 10.2514/1.J052255
208 Chauhan, S. S., Hwang, J. T., and Martins, J. R. R. A., “An automated cited on p. 490
selection algorithm for nonlinear solvers in MDO,” Structural and
Multidisciplinary Optimization, Vol. 58, No. 2, June 2018, pp. 349–377.
doi: 10.1007/s00158-018-2004-5
209 Kenway, G. K. W. and Martins, J. R. R. A., “Multipoint high-fidelity cited on p. 504
aerostructural optimization of a transport aircraft configuration,”
Journal of Aircraft, Vol. 51, No. 1, January 2014, pp. 144–160.
doi: 10.2514/1.C032150
210 Hwang, J. T., Lee, D. Y., Cutler, J. W., and Martins, J. R. R. A., cited on p. 512
“Large-scale multidisciplinary optimization of a small satellite’s
design and operation,” Journal of Spacecraft and Rockets, Vol. 51, No.
5, September 2014, pp. 1648–1663.
doi: 10.2514/1.A32751
211 Biegler, L. T., Ghattas, O., Heinkenschloss, M., and Bloemen Waan- cited on p. 517
ders, B. van, Eds., Large-Scale PDE-Constrained Optimization. Berlin:
Springer, 2003.
Bibliography 612
212 Braun, R. D. and Kroo, I. M., “Development and application of cited on pp. 521, 522
the collaborative optimization architecture in a multidisciplinary
design environment,” Multidisciplinary Design Optimization: State of
the Art, Alexandrov, N. and Hussaini, M. Y., Eds. Philadelphia, PA:
SIAM, 1997, pp. 98–116.
doi: 10.5555/888020
213 Kim, H. M., Rideout, D. G., Papalambros, P. Y., and Stein, J. L., cited on p. 524
“Analytical target cascading in automotive vehicle design,” Journal
of Mechanical Design, Vol. 125, No. 3, September 2003, pp. 481–490.
doi: 10.1115/1.1586308
214 Tosserams, S., Etman, L. F. P., Papalambros, P. Y., and Rooda, cited on p. 524
J. E., “An augmented Lagrangian relaxation for analytical target
cascading using the alternating direction method of multipliers,”
Structural and Multidisciplinary Optimization, Vol. 31, No. 3, March
2006, pp. 176–189.
doi: 10.1007/s00158-005-0579-0
215 Talgorn, B. and Kokkolaras, M., “Compact implementation of non- cited on p. 524
hierarchical analytical target cascading for coordinating distributed
multidisciplinary design optimization problems,” Structural and
Multidisciplinary Optimization, Vol. 56, No. 6, 2017, pp. 1597–1602
doi: 10.1007/s00158-017-1726-0
216 Sobieszczanski–Sobieski, J., Altus, T. D., Phillips, M., and Sandusky, cited on p. 527
R., “Bilevel integrated system synthesis for concurrent and dis-
tributed processing,” AIAA Journal, Vol. 41, No. 10, 2003, pp. 1996–
2003.
doi: 10.2514/2.1889
217 Tedford, N. P. and Martins, J. R. R. A., “Benchmarking multidiscipli- cited on p. 533
nary design optimization algorithms,” Optimization and Engineering,
Vol. 11, No. 1, February 2010, pp. 159–183.
doi: 10.1007/s11081-009-9082-6
218 Golovidov, O., Kodiyalam, S., Marineau, P., Wang, L., and Rohl, cited on p. 534
P., “Flexible implementation of approximation concepts in an
MDO framework,” Proceedings of the 7th AIAA/USAF/NASA/ISSMO
Symposium on Multidisciplinary Analysis and Optimization. American
Institute of Aeronautics and Astronautics, 1998.
doi: 10.2514/6.1998-4959
219 Balabanov, V., Charpentier, C., Ghosh, D. K., Quinn, G., Vander- cited on p. 534
plaats, G., and Venter, G., “Visualdoc: A software system for general
purpose integration and design optimization,” Proceedings of the 9th
AIAA/ISSMO Symposium on Multidisciplinary Analysis and Optimiza-
Bibliography 613
615
Index 616