OptimisationII Notes
OptimisationII Notes
Krupa Prag
& Nouralden Mohammed
2024-07-10
2
Contents
Course Outline 11
Course Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Course Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.1.1 Neighbourhoods . . . . . . . . . . . . . . . . . . . . . . . 15
1.3.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3
4 CONTENTS
5.6.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6.0.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6.0.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.5.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7
8 LIST OF TABLES
List of Figures
4.1 Condition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Condition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Condition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Golden Search Interval and Interior Points . . . . . . . . . . . . . . . . . 35
9
10 LIST OF FIGURES
In industry, commerce, government, indeed in all walks of life, one frequently needs
answers to questions concerning operational efficiency. Thus an architect may need
to know how to lay out a factory floor so that the article being manufactured does not
have to be moved about too much as it goes from one machine tool to another; the
manager of a shipping company needs to plan the itineraries of his ships so as to in-
crease the amount of goods handled, while avoiding costly waiting-around in docks.
A telecommunications engineer may want to know how best to transmit signals
so as to minimise the possibility of error on reception. Further examples of prob-
lems of this sort are provided by the planning of a railroad time-table to ensure that
trains are available as and when needed, the synchronisation of traffic lights, and
many other real-life situations. Formerly such problems would usually be ‘solved’
by imprecise methods giving results that were both unreliable and costly. Today,
they are increasingly being subjected to rigid mathematical analysis, designed to
provide methods for finding exact solutions or highly reliable estimates rather than
vague approximations. Optimisation provide many of the mathematical tools used
for solving such problems.
13
14 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
Linear programming has proved an extremely powerful tool, both in modelling real-
world problems and as a widely applicable mathematical theory. However, many
interesting but practical optimisation problems are nonlinear. The study of such
problems involves a diverse blend of linear algebra, multivariate calculus, quadratic
form, numerical analysis and computing techniques. Important special areas in-
clude the design of computational algorithms, the geometry and analysis of convex
sets and functions, and the study of specially structured problems such as uncon-
strained and constrained nonlinear optimisation problems. Nonlinear optimisation
provides fundamental insights into mathematical analysis, and is widely used in the
applied sciences, in areas such as engineering design, regression analysis, inventory
control, and geophysical exploration among others. General nonlinear optimisation
problems and various computational algorithm to addressing such problems will be
taught in this course.
Optimisation problems are made of three primary ingredients; (i) an objective func-
tion, (ii) variables (unknowns) and (iii) the constraints.
Find values of the variables that minimize or maximize the objective function while
satisfying the constraints.
The general statement being:
1.3. IMPORTANT OPTIMISATION CONCEPTS 15
subject to:
g j (x) ≤ 0, j = 1, 2, . . . , m,
h j (x) = 0, j = 1, 2, . . . , r.
where f (x), g j (x) and h j (x) are scalar functions of the real vector x.
The continuous components x i of x are called the the variables, f (x) is the objec-
tive function, g j (x) denotes the respective inequality constraint functions and h j (x)
the equality constraint functions. The optimum vector x that solves Equation (1.1) is
denoted by x∗ with a corresponding optimum function value f (x∗ ). If there are no
constraints specified, then the problem is aptly named an unconstrained minimi-
sation problem. A large quantity of progress has been made when solving different
classes of the general problem introduced in Equation (1.1). On occasion these so-
lutions can be attained analytically yielding a closed-form solution. However, most
real world problems exist where n > 2 and as a result need to be solved numerically
through suitable computational algorithms.
1.3.1 Definitions
1.3.1.1 Neighbourhoods
Nδ (y) : ∥x − y∥ ≤ δ, (1.2)
that is x ∈ Nδ (y).
1.3.2 Convexity
Definition 1.6 (Affine Set). A line though the points x1 and x2 in Rn is the set:
This is known as an Affine Set. An example of this is the solution of linear equations
Ax = b.
Definition 1.7 (Convex Set). A set S ⊂ R n is a Convex Set if for all x1 , x2 ∈ S, the line
segment between x 1 and x 2 is in S. The line segment between the points x1 and x2 ,
can be represented as:
If this condition does not hold then the set is non-convex. Think of this as line of
sight. Some are examples are considered in the Figure below:
1.3. IMPORTANT OPTIMISATION CONCEPTS 17
x = θ1 x1 + θ2 x2 + . . . + θn xn , (1.10)
where θ1 + . . . + θn = 1, θi ≥ 0.
The Convex Hull (conv L) is the set of all convex combinations of the points in L.
This can be thought of as the tightest bound across all points in the set, as can be
seen in the Figure below:
{x | a T x ≤ b} (a ̸= 0), (1.12)
Note: Hyperplanes are both affine and convex, while halfspaces are only convex.
These are illustrated in the Figures below:
Definition 1.10 (Level Set). Consider the real valued function f on L. Let a be in R1 ,
then we denote L a to be the set:
The level set is the set of points that have a corresponding function value euqal to
that of the constant value a.
Definition 1.11 (Level Surface). Consider the real valued function f on L. Let a be
in R3 , then we denote C a to be the set:
These sets are known as level surfaces of f on L and can be thought of as the cross
section taken at some point x0 ∈ L.
1.3.3 Exercises
S1 = {(x 1 , x 2 ) | x 12 + x 22 ≤ 1}
S2 = {(x 1 , x 2 ) | x 12 + x 22 > 1}
S3 = {(0, 0), (1, 0), (1, 1), (0, 1)}
S4 = {(x 1 , x 2 ) | |x 1 | + |x 2 | < 1}
4. In each of the following cases, sketch the level sets L b of the function f :
• f (x 1 , x 2 ) = x 1 + x 2 , b = 1, 2
• f (x 1 , x 2 ) = x 1 x 2 , b = 1, 2
• f (x) = e x , b = 10, 0
5. Let L be a convex set in Rn , A be an m × n matrix and α a scalar. Show that the
following sets are convex.
• {y : y = Ax, x ∈ L}
20 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS
• {αx : x ∈ L}
6. If you have two points that solve a system of linear equations Ax = b,
i.e. points x 1 and x 2 , where x 1 ̸= x 2 . Then prove that the line that passes
through these two points is in the affine set.
7. Prove that a halfspace is convex.
Chapter 2
One Dimensional
Unconstrained and Bound
Constrained Problems
For example if we consider f (x) = x 2 , where x > 0, then f (x) is monotonic increas-
ing. For x < 0, f (x) is monotonic decreasing. A function that has a single minimum
or a single maximum (single peak) is known as unimodal function. Functions with
two peaks (two minima or two maxima) are called bimodal and functions with many
peaks are known as multimodal functions.
21
22CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS
Consider the function of a single variable. We may think of this easily then by saying;
The function f is convex if the chord connecting x 1 and x 2 lays above the graph.
See the Figure below:
Note:
d f (x)
f ′ (x) = = 0, (2.3)
dx
which corresponds to the first order necessary condition (FONC). The FONC may
be necessary but it is not sufficient. For example, consider the function f (x) = x 3 as
seen in the Figure below. At x = 0, f ′ (x) = 0 but there is no maximum or minimum
point on the interval (−∞, ∞). At x = 0 there is a point of inflection, where f ′′ = 0.
Therefore, the point x = 0 is a stationary point but not a local optima.
Thus in addition to the FONC, non-negative curvature is also necessary at x ∗ , i.e. it
is required that the second order condition:
d 2 f (x)
f ′′ (x) = > 0, (2.4)
d x2
must hold at x ∗ for a strong local minimum. This is known as the second order
sufficient condition (SOSC).
2.4.1 Exercises
1. If the convexity condition for any real valued function f : R → R is given by:
then using the above, prove that the following one dimensional functions are
convex;
• f 1 (x) = 1 + x 2
• f 2 (x) = x 2 − 1
24CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS
f (x) = x 3 (3x 2 − 5) + 1,
and decide the maximiser, minimiser and the point of inflection, if any.
3. Using the FONC of optimality, determine the optimiser of the following func-
tions:
• f (x) = 31 x 3 − 72 x 2 + 12x + 3
• f (x) = 2(x − 1)2 + x 2
4. Using necessary and sufficient conditions for optimality, investigate the max-
imiser/minimiser of:
f (x) = −(x − 1)4 .
5. The function f (x) = max{0.5, x, x 2 } is convex. True or false?
6. The function f (x) = min{0.5, x, x 2 } is concave. True or false?
x2 + 2
7. The function f (x) = is concave. True or false?
x +2
8. Determine the maximum and minimum values of the function:
Numerical Solutions to
Nonlinear Equations
It is often necessary to find the stationary point(s) of a given function f (x). This
would mean finding the root of a nonlinear function g (x) if we consider g (x) =
f ′ (x) = 0. In other words, solving g (x) = 0. Here, we introduce the Newton method.
This method is important because when we cannot solve f ′ (x) = 0 analytically we
will be able to solve numerically.
Newton’s method is one of the more powerful and well known numerical methods
for finding a root of g (x) i.e. for solving for x such that g (x) = 0. So we can use it to
find the turning point i.e., when f ′ (x) = 0. In the context of optimization we want an
x ∗ such that f ′ (x ∗ ) = 0. Consider the figure below:
Suppose at some stage we have obtained the point x n as an approximation to the
root at x ∗ (initially this is a guess). Newton observed that if g (x) was a straight line
through (x n , g (x n )) with slope = g ′ (x n ) = g ′ (x) = constant for all x then the equation
of the straight line could be found and the root read off. Obviously there would be
no problem if g (x) was actually a straight line; however the tangent might still be
a good approximation (as seen in the Figure above). If we regard the tangent as a
model of the function g (x) and we have an approximation x n then we can produce
a better approximation x n+1 . The method can be applied again and again, to give a
sequence of values, each approximating x ∗ with more and more certainty.
The general equation of the tangent to the curve of g (x) at (x n , g (x n )) has slope
(y − g (x n ))
g ′ (x n ) = . When y = 0, let (3.1)
(x − x n )
25
26 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS
Hence the Newton method can be described be the following two steps:
1. Get an initial ‘guess’ x 0 .
2. Iterate by x n+1 = x n − f ′ (x n )/ f ′′ (x n ).
x n converges to a turning point for suitable choice of x 0 .
3.1.0.1 Example
x4 x2
Find the root of f (x) = + − 3x near x = 2.
4 2
We can see that to find the root of the deritive yields the minimum value of f (x) in
this case (approximately 1.21341). Check: perform a few iterations of Equation (3.2)
to check the solution.
3.1. NEWTON’S METHOD 27
Advantages
Disadvantages:
• Unknown number of steps needed for required accuracy, unlike Bisection for
example.
• f must be at least twice differentiable
• Run into problems when g ′ (x ∗ ) = 0.
• Potentially could be difficult to compute g (x) and g ′ (x) even if they do exist.
In general Newton’s Method is fast, reliable and trouble-free, but one has to be mind-
ful of the potential problems.
28 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS
(x n − x n−1 )
x n+1 = x n − f ′ (x n ) ¡ ¢. (3.3)
f ′ (x n ) − f ′ (x n−1 )
3.2.1 Exercises
3x − sin(x) − exp(x) = 0,
1 1
f (x) = x 4 + x 2 − 2x + 1,
4 2
3.2. SECANT METHOD 29
using both Newton’s Method and the Secant Method. If the critical point is a
minimiser then obtain the minimum value. You may assume x = 2 as an initial
guess.
30 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS
Chapter 4
Numerical Optimisation of
Univariate Functions
The simplest functions with which to begin a study of non-linear optimisation meth-
ods are those with a single independent variable. Although the minimisation of uni-
variate functions is in itself of some practical importance, the main area of applica-
tion for these techniques is as a subproblem of multivariate minimisation.
There are functions to be minimised where the variable x is unrestricted (say, x ∈ R);
there are also functions to be optimised over a finite interval (in n-dimension it is
a box). Single variable optimization in a finite interval is important because of its
application is in multi-variable optimisation. In this chapter we will consider one
dimensional optimisation.
If one needs to find the maximum or minimum (i.e. the optimal) value of a function
f (x) on the interval [a, b] the procedure would be:
31
32 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
We assume that an interval [a, b] is given and that a local minimum x ∗ ∈ [a, b]. When
the first derivative of the objective function f (x) is known at a and b, it is necessary
to evaluate function information at only one interior point in order to reduce this
interval. This is because it is possible to decide if any interval brackets a minimum
simply by looking at the function values f (a), f (b) and f ′ (a), f ′ (b) at extreme points
a and b. The conditions that to be satisfied are:
These three situations are illustrated in the Figure below. The next step of the bi-
section method is to reduce the interval. At the k-th iteration we have an interval
[a k , b k ] and the mid-point c k = 21 (a k + b k ) is computed. The next interval will be
called [a k+1 , b k+1 ] which is either [a k , c k ] or [c k , b k ] depending on which interval
brackets the minimum. The process continues until two consecutive interval pro-
duces minima which are within an acceptable tolerance.
4.1.1.1 Exercise
Suppose f : R → R on the interval [a, b] and f has only one minimum (we say f is
unimodal at x ∗ . The problem is to locate x ∗ . The method we now discuss is based
on evaluating the objective function at different points in the interval [a, b]. We
choose these points in such a way that an approximation to the minimiser of f may
be achieved in as few evaluation as possible. Our goal is to progressively narrow
down the range of the subinterval containing x ∗ . If we evaluate f at only one
intermediate point of the interval [a, b], we cannot narrow the range within which
we know the minimiser is located. We have to evaluate f at two intermediate points
x1 − a
in such a way that the reduction in the range is symmetrical, such that ρ =
b−a
b − x2
and ρ = . We then evaluate f at the intermediate points.
b−a
Case I:
If f (x 1 ) > f (x 2 ), then the minimiser located in the range [a, x 1 ]. Then we need to
update the interval and calculate an update the interior points at the next iteration.
Case II:
If, on the other hand, f (x 1 ) < f (x 2 ), then the minimiser must lie in the range [x 2 , b].
Then we need to update the interval and calculate and the updated interior points
at the next iteration.
Starting with the reduced range of uncertainty we can repeat the process and simi-
larly find two new interior point, respectively. We would like to minimise the num-
ber of function evaluations while reducing the width of the interval of uncertainty.
Suppose that f (x 1 ) < f (x 2 ). Then we know that x ∗ ∈ [x 2 , b]. Because x 1 is already
in the uncertainty interval and f (x 1 ) is known, we can use these information. We
can make x 2 coincide with x 1 . Thus, only one new evaluation of f at x 1 would be
necessary.
If f (x 1 ) > f (x 2 ), then the minimiser must lie in the range [a, x 1 ]. Because x 2 is al-
ready in the uncertainty interval and f (x 2 ) is known, we can use these information.
We can make x 1 coincide with x 2 . Thus, only one new evaluation of x 2 and the cor-
responding function value f (x 1 ) would be necessary.
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 35
L1 L2
Using the two conditions that L 0 = L 1 + L 2 and that = , using the quadratic
p L 0 L1
5−1
formula we can deduce the ratio ρ to be equal to = 0.681....
2
This forms the basis of a search algorithm since the technique is applied again on
the reduced interval.
4.1.2.1 Example
Use the four iterations Golden Section search to find the value of x that minimizes:
f (x) = x 2 − 6x + 15,
x 1 = a + ρ(b − a) = 6.18,
x 2 = b − ρ(b − a) = 3.82.
We compute
f (x 1 ) = 16.12,
f (x 2 ) = 6.67.
x 2 = a + ρ(b − a) = 2.36.
Now we have:
f (x 1 ) = 6.67,
f (x 2 ) = 6.41.
Now, f (x 1 ) > f (x 2 ), and so the uncertainty interval is reduced to [a, x 1 ] = [0, 3.82].
Iteration 3:
We set x 1 = x 2 , and compute x 2 :
x 2 = 1.46.
We have:
f (x 1 ) = 6.41,
f (x 2 ) = 8.37.
x 1 = 2.92.
We have:
f (x 1 ) = 6.01,
f (x 2 ) = 6.41.
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 37
Since f (x 1 ) < f (x 2 ). Thus the value of x that minimizes f is located in the interval
[2.36, 3.82].
4.1.3 Exercises
Multivariate Unconstrained
Optimisation
where S is the feasible set. It should then be possible to find local minima and max-
ima just by looking at the behaviour of the objective function; and indeed sufficient
and necessary conditions. In this chapter these conditions will be derived. The idea
of a line in a particular direction is important for any unconstrained optimization
methods, we discuss this and derive the slope and curvature of the function f at a
point on the line.
For a function f (x) ∈ Rn there exists, at any point x a vector of first order partial
derivatives, or gradient vector:
∂f
(x)
∂x 1
∂f
(x)
∂x 2
∇ f (x) = = g(x). (5.2)
..
.
∂f
(x)
∂x n
It can be shown that if the function f (x) is smooth, then at the point x the gradient
vector ∇ f (x) (denoted by g (x)) is always perpendicular to the contours (or surfaces
39
40 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
If f (x) is twice continuously differentiable then at the point x there exists a matrix of
second order partial derivatives called the Hessian matrix:
∂2 f ∂2 f ∂2 f
(x) (x) ... (x)
∂x 1
2 ∂x 1 ∂x 2 ∂x 1 ∂x n
2
∂ f ..
H(x) = . = ∇2 f (x) (5.3)
∂x 2 ∂x 1 (x)
2
∂ f ∂2 f
(x) ... (x)
∂x n ∂x 1 ∂x n2
5.1.0.1 Example
Let f (x 1 , x 2 ) = 5x 1 + 8x 2 + x 1 x 2 − x 12 − 2x 22 . Then:
· ¸
5 + x 2 − 2x 1
∇ f (x) = ,
8 + x 1 − 4x 2
and · ¸
2 −2 1
∇ f (x) = .
1 −4
f (x + αd ) − f (x)
∇ f T d = lim (5.4)
α→0 α
x = x′ + αd, ∀ α, (5.6)
where d and x′ are given. For α ≥ 0 Equation (5.6) is a half-line. The point x′ is a
fixed point (corresponding to α = 0) along the line, d is the direction of the line. For
instance, if we take the fixed point x′ to be (2, 2)T and the direction d = (3, 1)T then
the Figure below shows the line in the direction of d.
d = [3, 1]
alpha = norm(d)
print(’Alpha is:’)
3.1622776601683795
norm_d = d/alpha
print(’The normalised vector d is:’)
[0.9486833 0.31622777]
print(’The normalised d^Td gives:f’, 0.9999999999999999)
print(’So alpha x normalised d returns d:’)
[3. 1.]
42 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
We now use the gradient and the Hessian of f (x) to derive the derivative of f (x) along
a line of any direction. For a fixed line of a given direction like Equation (5.6) we see
that the points on the line is a function of α only. Hence a change in α causes change
in all coordinates of x(α). The derivative of f (x) with respect to α :
d f (x(α)) ∂ f (x(α)) d x 1 (α) ∂ f (x(α)) d x 2 (α) ∂ f (x(α)) d x n (α)
= + +···+ (5.7)
dα ∂x 1 dα ∂x 2 dα ∂x n dα
The Equation (5.7) represents the derivative of f (x) at any point x(α) along the line.
The operator ddα can be expressed as:
d ∂ d x1 ∂ d x2 ∂ d xn
= + +···+ = dT ∇ (5.8)
d α ∂x 1 d α ∂x 2 d α ∂x n d α
The slope of f (x) at x(α) can be written as:
df
= dT ∇ f (x(α)) = ∇ f (x(α))T d. (5.9)
dα
Likewise, the curvature of f (x(α)) along the line:
d2 f d d f (x(α))
µ ¶
= dT ∇ ∇ f T d = dT ∇2 f d,
¡ ¢
= (5.10)
d α2 d α dα
where ∇ f and ∇2 f are evaluated at x(α). These (slope and curvature) when evalu-
ated at α=0 are respectively known as derivative (also called slope since f = f (α) is
now a function of the single variable α) and curvature of f at x ′ in the direction of d .
5.2.0.1 Example
If x′ = 0 then show that the slope of f (x) along the line generated by d = 10 is
¡0 ¢ ¡ ¢
−400x 1 (x 2 − x 12 ) − 2(1 − x 1 )
· ¸ · ¸
−2
▽f = =
200(x 2 − x 12 ) 0
In the context of optimization involving smooth function f (x) the Taylor series is
indispensable. Since x = x(α) = x′ + αd for a fixed point x′ and a given direction d,
the f (x) at x(α) becomes a function of the single variable α. Hence, f (x) = f (x(α)) =
f (α). Therefore, expanding the Taylor series around zero we have:
1
f (α) = f (0 + α) = f (0) + α f ′ (0) + α2 f ′′ (0) + · · · (5.12)
2
But f (α) = f (x′ +αd) is the value of the function f (x) of many variable along the line
x(α). Hence, we can re-write Equation (5.12) as:
1
f (x′ + αd) = f (x′ ) + αdT ∇ f (x′ ) + α2 dT ∇2 f (x′ ) d + · · ·
£ ¤
(5.13)
2
and:
∇ f (x) = Ax + b; H(x) = A. (5.16)
The form A is said to be positive definite if A ≥ 0 for all x with A = 0 iff x = 0. The
form A is said to be positive semi-definite if A ≥ 0 for all x. Similar definitions apply
to negative definite and negative semi-definite with the inequalities reversed.
5.5. STATIONARY POINTS 45
5.4.0.1 Example
• If ∇2 f (x∗ ) is indefinite, i.e. all λi are mixed sign, then x∗ is a saddle point.
• If ∇2 f (x∗ ) is positive definite, i.e. all λi > 0, then x∗ is a minimum.
• If ∇2 f (x∗ ) is negative definite, i.e. all λi < 0, then x∗ is a maximum.
• If ∇2 f (x∗ ) is positive semi-definite, i.e. all λi ≥ 0, then x∗ is a half cylinder.
In summary:
and:
There are a number of ways for us to test for positive or negative definiteness.
Namely;
f (x) = 2x 12 + x 1 x 22 + x 22 .
Solution:
∂f
= 4x 1 + x 2 2 = 0
∂x 1
∂f
= 2x 1 x 2 + 2x 2 = 0
∂x 2
which gives x1 = (0, 0)T , x2 = (−1, 2)T and x3 = (−1, −2)T . The Hessian matrix is:
µ ¶
4 2x 2
G=
2x 2 2x 1 + 2
Thus: µ ¶
4 0
G1 =
0 2
(4 − λ)(2 − λ) = 0
has eigenvalues:
p p
λ = 2 + 20, 2 − 20
From the Hessian we can compute the determinant of all subminors. If these are all
greater than zero, then the Hessian is positive definite. Utilising the example above.
If:
µ ¶
4 0
G1 =
0 2
Then the first subminor is just det|4| which is > 0. The second and final subminor is
the entire matrix, so:
¯ ¯
¯ 4 0 ¯
det ¯¯ ¯ = 8 − 0 > 0.
0 2 ¯
This approach would be preferable when dealing with the case of large matrices.
Theorem 5.1 (First Order Necessary Condition (FONC) for Local Maxima/Minima).
If f (x) has continuous first partial derivatives at all points of S ⊂ R n and if x∗ is an
interior point of the feasible set S then x∗ is a local minimum or maximum of f (x):
∇ f (x ∗ ) = 0. (5.17)
∂ f (x ∗ )
= 0; i = 1, 2, ...n. (5.18)
∂x i
Theorem 5.2 (Second Order Necessary Condition (SONC) for Local Max-
ima/Minima). Let f be twice continuously differentiable on the feasible set S,
x∗ is a local minimiser of f (x), and d is a feasible direction at x∗ . If dT ∇ f (x∗ ) = 0,
then:
dT ∇2 f (x∗ )d ≥ 0. (5.19)
Theorem 5.3 (Second Order Sufficient Condition (SOSC) for Strong Local Max-
ima/Minima). Let x∗ be an interior of S. If x∗ is a local minimiser of f (x) then (i)
∇ f (x∗ ) = 0 and (ii) dT ∇2 f (x∗ )d > 0. That is the hessian is positive definite.
5.6. NECESSARY AND SUFFICIENT CONDITIONS 53
5.6.0.1 Example
Let f (x) = x 12 + x 22 . Show that x = (0, 0)T satisfies the FONC, the SONC and SOSC
hence (0, 0)T is a strict local minimiser. We see that ∇ f (x) = (2x 1 , 2x 2 ) = 0 if and only
if x 1 = x 2 = 0. It also can be easily shown that for all d ̸= 0, dT ∇2 f (x)d = 2d 12 +2d 22 > 0.
Hence ∇2 f (x) is positive definite.
5.6.0.2 Example
f (x 1 , x 2 ) = x 14 + x 24
µ 3¶
4x 1
∇ f (x) = . The only stationary point is (0 0)T . Now the Hessian ∇2 f =
4x 23
12x 12
µ ¶ µ ¶
0 0 0
. At the origin the Hessian is and so there is no prediction of
0 12x 22 0 0
the minimum from the test although it is easy to see that the origin is a minimum.
5.6.0.3 Example
1 x 12 x 22
à !
f (x 1 , x 2 ) = + ,
2c a 2 b 2
µ x1 ¶
where a, b, and c are constants. ∇ f (x) = −x c a 2 . So the only stationary point is (0 0)T .
2
µ 1 ¶ cb 2
2 0
The Hessian is ∇2 f (x) = ca . This is clearly indefinite and hence (0 0)T is a
0 − cb1 2
saddle point.
Thus in summary, the necessary and sufficent condition for x∗ to be a strong local
minimum are:
• ∇ f (x∗ ) = 0
• Hessian is positive definite
54 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION
5.6.1 Exercises
f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 1)2 + x 1 x 2
4. For the following function, find the points where the gradients vanish, and
investigate which of these are local minima, maxima or saddle.
• f (x 1 , x 2 ) = x 1 (1 + x 1 ) + x 2 (1 + x 2 ) − 1.
5. Consider the function f : R 2 → R determined by
· ¸ · ¸
T 1 2 T 3
f (x) = x x +x +6
4 8 4
• Find the gradient and Hessian of f at the point (1, 1).
• Find the directional derivative of f at the point (1, 1) in the direction of
the maximal rate of increase.
• Find a point that satisfies the first order necessary condition (FONC).
Does the point also satisfy the second order necessary condition (SONC)
for a minimum?
6. Find the stationary points of the function
¢2
f (x 1 , x 2 ) = x 1 2 − 4 + x 2 2
¡
Show that f has an absolute minimum at each of the points (x 1 , x 2 ) = (±2, 0).
Show that the point (0, 0) is a saddle point.
7. Show that the point x ∗ on the line x 2 − 2x 1 = 0 is a weak global minimiser of
f (x) = 4x 1 2 − 4x 1 x 2 + x 2 2
8. Show that
f (x) = 3x 1 2 − x 2 2 + x 1 3
has a strong local maximiser at (−2, 0)T and a saddle point at (0, 0)T , but has
no minimisers.
9. Prove that for a general quadratic function f (x) = c +bT x+ 12 xT Gx, the Hessian
G of f maps differences in position into differences in gradient, i.e., g1 − g2 =
G(x1 − x2 ).
Chapter 6
In this chapter we will study the methods for solving nonlinear unconstrained op-
timisation problems. The non-linear minimisation algorithms to be described here
are iterative methods which generate a sequence of points, x0 , x1 . . . . say, or {xk } (su-
perscripts denoting iteration number), hopefully converging to a minimiser x∗ of
f (x). Univariate minimisation along the line in a particular direction is known as
the line search technique. One dimensional minimisation is known as line search
subproblem in many variable unconstrained non-linear minimisation.
The algorithms for multivariate minimisation are all iterative processes which fit
into the same general framework:
xk+1 = xk + αk dk (6.1)
55
56 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
Bear in mind that this single variable minimiser cannot always be obtained analyti-
cally and hence some numerical techniques may be necessary.
The challenges in finding a good αk are both in avoiding a step length that is too
long or too short. Consider the Figures below:
Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k + αk d k are gen-
3
erated by the descent directions d k = (−1)k+1 with steps αk = 2 + 2k+1 with an initial
starting point of x 0 = 2.
6.2. EXACT AND INEXACT LINE SEARCH 57
Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k +αk d k are gener-
1
ated by the descent directions d k = (−1) with steps αk = 2k+1 with an initial starting
point of x 0 = 2.
Given the direction dk and the point xk , f (xk +αdk ) becomes a function of α. Hence
it is simply a one dimensional minimisation with respect to α. The solution of
d f (α) k
d α = 0 will determine the exact location of the minimiser α . However, it may
d f (α)
not be possible to locate the exact location of αk for which d α = 0. It may even
require very large number of iterations to locate the minimiser αk . Nonetheless, the
df
idea is conceptually useful. Notice that for exact line search the slope d α at αk must
be zero. Therefore, we get:
d f (xk+1 ) d xk+1
= ∇ f (xk+1 )T = g(x k+1 )T dk = 0. (6.4)
dα dα
Line search algorithms used in practice are much more involved than the one di-
mensional search methods (optimisation methods) presented in the previous chap-
ter. The reason for this stems from several practical considerations. First, determin-
ing the value of αk that exactly minimises f (α) may be computationally demanding;
58 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
even worse, the minimiser of f (α) may not even exist. Second, practical experience
suggests that it is better to allocate more computational time on iterating the op-
timisation algorithm rather than performing exact line searches. These considera-
tions led to the development of conditions for terminating line search algorithms
that would result in low-accuracy line searches while still securing a decrease in the
value of f from one iteration to the next.
In practice, the line search is terminated when some descent conditions along the
line xk + αdk are satisfied. Hence, it is no longer necessary to go for the exact line
search. The line search carried out in this way is known as the inexact line search. A
further justification for the inexact line search is that it is not efficient to determine
the line search minima to a high accuracy when xk is far from the minimiser x∗ . Un-
der these circumstances, nonlinear minimisation algorithms employ an inexact or
approximate line search. To sum up, exact line search relates to theoretical concept
and the inexact is its practical implementation.
Remark:
Each iteration of a line search method computes a search direction dk and then de-
cides how far to move along that direction. The iteration is given by
xk+1 = xk + αk dk ,
where the positive scalar αk is called the step length. The success of a line search
method depends on effective choices of both the direction dk and and the step
length αk . Most line search algorithms require dk to be a descent direction.
6.3. THE DESCENT CONDITION 59
Central to the development of the gradient based minimisation methods is the idea
of a descend direction. Conditions for the descent direction can be obtained using
Taylor series around the point xk . Using two terms of Taylor series we have:
T
f (xk + αdk ) − f (xk ) = αdk ∇ f (xk ) + · · · (6.5)
A simple line search descent method is the steepest descent method in which:
dk = −gk (6.10)
This negative gradient direction which satisfy the descent condition (6.10) gives rise
to the method of steepest descent.
Here the the search direction is taken as the negative gradient and the step size, αk ,
is chosen to achieve the maximum decrease in the objective function f at each step.
Specifically we solve the problem:
³ ³ ´´
Minimise f x(k) − α∇ f x(k) w.r.t. α (6.11)
• ∥∇ f (xk )∥ < ϵ2
6.5.2.1 Example
Consider f (x) = 2x 12 + 3x 22 , where x0 = (1, 1). Use two iterations of Steepest Descent.
Solution:
· ¸
4x 1
Compute ∇ f (x) = = g.
6x 2
First Iteration:
We know that:
x1 = x0 − α0 g (x0 ),
so: · ¸ · ¸ · ¸
1 4α 1 − 4α
x1 = − = .
1 6α 1 − 6α
Therefore:
Finally:
13 9
1 − 4
70 35
x1 = 13 = −4 .
1−6
70 35
Second Iteration:
We have:
x2 = x1 − α1 g (x1 )
1 ¡
f (x1 − α1 g (x1 )) = 2(9 − 36α)2 + 3(−4 + 24α)2 .
¢
35 2
62 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
We get:
∇ f (x1 − α1 g (x1 )) = 0
⇒ 60α = 13
13
α =
60
Therefore:
9 9 6
35 13 35 175
x2 = x1 − α1 g (x1 ) = −4 − = 6 .
60 −4
35 35 175
The process continues in the same manner above. We can see from inspection that
the function should achieve a minimum at (0, 0). We can see this as a sanity check
in the Python code below.
It is also worth noting that since this is a quadratic function, we can actually use
another technique. We will redo the first iteration as illustration. Specifically, the
quadratic functions allow α to be solved using:
T
−g k d k
αk = T
.
d k Qd k
Thus:
First Iteration:
· ¸
4 0
Compute f (x0 ) = 5, g(x̄ 0 )T = (4, 6) and Q =
0 6
Therefore:
(g k )T d k 52 13
α1 = − = ¸· ¸ =
(d k )T Qd k 70
·
4 0 4
(4, 6)
0 6 6
Thus: µ ¶
13 9 4
x1 = (1, 1) − (4, 6) = − ,
70 35 35
Similarly, the process repeats.
Although you will only cover inexact line search techniques in the third year syllabus,
we will quickly introduce a very simply inexact technique to use for the purpose of
your labs.
6.5. THE METHOD OF STEEPEST DESCENT 63
Figure 6.4: 2x 12 + 3x 22
64 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
t
f (x − t ∇ f (x)) > f (x) − ∥∇ f (x)∥2 ,
2
and update t = βt .
This is a simple technique and tends to work quite well in practice. For further read-
ing you can consult Convex Optimisation by Boyd.
x0 = x0 - alpha*g(x0)
it += 1
xvals = np.append(xvals, x0)
Return x0, it, xvals
6.5.4 Exercises
ax 12 + bx 22 + c x 32
reached after taking a single of the steepest descent method from the point
(1, 1, 1)T is:
ab(b − a)2 + bc(c − b)2 + c a(a − c)2
.
a3 + b3 + c 3
66 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
Figure 6.6: 2x 12 + 3x 22
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 67
f (x) = 3x 1 2 + 2x 2 2
We will briefly look at the context of what we have learnt from the machine learning
perspective. This is to emphasize the power of this chapter. In machine learning,
you will find the gradient descent algorithm everywhere. While the literature may
seem to allude to this method being new, powerful and cool, it is really nothing more
than the method of steepest descent introduced above.
## (-1.0, 2.5)
## (0.0, 3.0)
So from the above plot we can see that there is a local minimum somewhere around
1.3 - 1.4 according to the x-axis. Of course, we normally won’t be afforded the luxury
of information such as this a priori, so let’s just assume we arbitrarily set our start-
ing point to be x 0 = 2. Implementing the gradient descent with a fixed stepsize, or
learning rate (in the context of ML) we have:
68 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.1 # step size fixed at 0.1
precision = 0.0001 # tolerance value
## Number of steps: 17
## (-1.0, 2.5)
## (0.0, 3.0)
## (1.2, 2.1)
## (0.0, 3.0)
One means of overcoming this issue is to use adaptive step-sizes. This can be done
using scipy’s fmin function to find the optimal step-size at each iteration.
x_old = 0
x_new = 2 # The algorithm starts at x=2
precision = 0.0001
## Number of steps: 4
So we can see that using the adaptive step-sizes, we’ve reduced the number of itera-
tions to convergence from 17 to 4. This is a substantial reduction, however, it must
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 71
be noted that it takes time to compute the appropriate step-size at each iterations.
This highlights a major issue in the decision making for optimisation: trying to find
the balance between speed and accuracy.
How did the modified algorithm look step by step?
Well we can see that it converges rapidly and after the first two iterations, we need
to zoom in to see further improvements.
## (-1.0, 2.5)
## (1.2, 2.1)
## (0.0, 3.0)
## (1.3333, 1.3335)
## (0.0, 3.0)
x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.17 # step size
precision = 0.0001
t, d = 0, 1
## Number of steps: 6
We can now see that we’ve still reduced the number of iterations required substan-
tially but are not bounding to finding an optimal step-size at each iteration. This
highlights that trade-off of finding cheap improvements that improve convergence
at minimal cost.
While using these line methods to find the minima of basic functions is interesting,
one might wonder how this relates to some of the regressions we are interested in
performing. Let us consider a slightly more complicated example. In this data set,
we have data relating to how temperature affects the noise produced by crickets.
Specifically, the data is a number of observations or samples of cricket chirp rates at
various temperatures.
## (13.0, 21.0)
## (65.0, 95.0)
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 73
h θ (x) = θ0 + θ1 x,
that best fits all of our data points, i.e. minimise the residual error.
The function that we are trying to minimize in this case is:
m
1
J (θ0 , θ1 ) = (h θ (x i ) − y i )2
P
2m
i =1
def J(x,y,m,theta_0,theta_1):
returnValue = 0
for i in range(m):
74 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
returnValue += (h(theta_0,theta_1,x[i])-y[i])**2
returnValue = returnValue/(2*m)
return returnValue
def grad_J(x,y,m,theta_0,theta_1):
returnValue = np.array([0.,0.])
for i in range(m):
returnValue[0] += (h(theta_0,theta_1,x[i])-y[i])
returnValue[1] += (h(theta_0,theta_1,x[i])-y[i])*x[i]
returnValue = returnValue/(m)
return returnValue
import time
start = time.time()
theta_old = np.array([0.,0.])
theta_new = np.array([1.,1.]) # The algorithm starts at [1,1]
n_k = 0.001 # step size
precision = 0.001
num_steps = 0
s_k = float("inf")
## theta_0 = 25.128552558595363
## theta_1 = 3.297264756251897
end = time.time()
print(str(end - start) + 'seconds')
## 19.64289903640747seconds
It’s clear that the algorithm seems to take quite a long time for such a trivial example.
Let’s check that the values we’ve obtained from the gradient descent are any good.
We can get the true values for θ0 and θ1 with the following:
## theta_0 = 25.232304983426026
## theta_1 = 3.2910945679475647
end = time.time()
print(str(end - start) + 'seconds')
## 0.012906551361083984seconds
One thing this highlights is how much effort goes into optimising the func-
tions found in these libraries. If one looks at the code inside linregress, clever
exploitations to speed up the computation can be found.
Now, let’s plot our obtained results on the original data set:
## (13.0, 21.0)
## (65.0, 95.0)
76 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
So to move a single step towards the minimum, one would need to cal-
culate each cost 3 million times.
So what can we do to overcome this? Well, we can use the stochastic gradient de-
scent. In this idea, we use the cost gradient of 1 sample at each iteration rather than
the sum of the cost gradient of all samples. So recall our gradient equations from
above:
∂ 1 Xm
J (θ0 , θ1 ) = (h θ (x i ) − y i ),
∂θ0 m i =1
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 77
∂ 1 Xm
J (θ0 , θ1 ) = ((h θ (x i ) − y i ) · x i ),
∂θ1 m i =1
where:
h θ (x) = θ0 + θ1 x.
We now want to update our values at each item in the training set instead of all so
that we can begin improvement straight away.
We can redefine our algorithm into the stochastic gradient descent for the simple
linear regression as follows:
θ0 θ0
· ¸ · ¸ · ¸
2(h θ (x i ) − y i )
= −α
θ1 θ1 2x i (h θ (x i ) − y i )
end for
end for
Depending on the size of the data set, we run the entire data set 1 to k times.
So the key advantage here is that unlike batch gradient descent where we have to
go through the entire data set before initiating any progress, we can now make pro-
cess straight away as we move through the data set. This is the primary reason why
stochastic gradient descent is used when dealing with large data sets.
78 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
Chapter 7
The steepest descent method uses information based only on the first partial deriva-
tives in selecting a suitable search direction. This strategy is not always the most ef-
fective. A faster method may be obtained by approximating the objective function
f (x) as a quadratic q(x) and making use of a knowledge of the second partial deriva-
tives. This is the basis of Newton’s method. The idea behind this method is as fol-
lows. Given a starting point, we construct a quadratic approximation to the objective
function that matches the first and the second derivative of the original objective
function at that point. We then minimise the approximate (quadratic) function in-
stead of the original objective function. We then use the minimiser of the quadratic
function to obtain the next iterate and repeat the procedure iteratively. If the ob-
jective function is quadratic then the approximation is exact and and the method
yields the true minimiser in one step. If, on the other hand, the objective function is
not quadratic, then the approximation will provide only an estimate of the position
of the true minimiser.
We can obtain a quadratic approximation to the given twice continuously differen-
tiable objective function using the Taylor series expansion of f about the current x k ,
neglecting terms of order three and the higher. Using the Taylor series expansion:
where g = ∇ f and H is the Hessian matrix. The minimum of the quadratic q(x)
satisfies:
0 = ∇q(x) = g(k) + H (x(k) )(x − x(k) ),
or inverting:
x = x(k) − H −1 (x(k) )g(k) .
Newton’s formula is:
x(k+1) = x(k) − H −1 (x(k) )g(k) . (7.1)
79
80 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
g (x )
Note to solve in 1-dimension g (x) = 0, we iterate x k+1 = x k − g ′ (xk ) . The
k
above formula is the multidimensional extension of Newton’s method.
The Method requires that f k , gk and H k i.e., the function value, the gra-
dient and the Hessian to be made available at each iterate xk . Most im-
portantly the Newton method is only well defined if the Hessian H k is
positive definite. This is because only then q(x) will have a unique min-
imiser. The positive definiteness of the Hessian can only be guaranteed
if the starting iterate x0 is very near to the minimizer x∗ of f (x)
The Newton method is fast to converge when it is applied close to the minimiser. If
the starting point (the initial point) is further from the minimiser then the Algorithm
may not converge.
7.0.0.1 Example
f (x) = 100(x 2 − x 12 )2 + (1 − x 1 )2 .
¡0¢
Let us take x0 = 0
0 . The gradient vector and the Hessian at x are respectively given
by: Ã !
−400x 1 x 2 − x 12 − 2(1 − x 1 )
¡ ¢
∇ f (x) = = g,
200 x 2 − x 12
¡ ¢
and:
800x 12 − 400 x 2 − x 12 + 2
µ ¡ ¢ ¶
−400x 1
H (x) = .
−400x 1 200
So substituting x 0 gives:
µ ¶
2 0
g0 = (2, 0)T ; H0 = .
0 200
Now using
H 0 d0 = −g0 ,
recall that:
H k d k = −g k ,
81
so: µ ¶µ ¶ µ ¶
2 1/2 0 1
H 0 d0 = −g0 ⇒ d0 = g0 (H 0 )−1 = = .
0 0 1/200 0
Recall:
dk = xk+1 − xk ⇒ d0 = x1 − x0 ⇒ x1 = d0 + x0
Thus: Ã ! Ã ! Ã !
1 1 0 1
x = + =
0 0 0
Calculating the function value we have:
“‘
82 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
The step length parameter αk modifies the step taken in the search direction, usually
to minimize f (x(k+1) ). Newton’s method applied without this modification does not
necessarily produce a decrease in f (x(k+1) ), as described by the above example.
To address the drawbacks of Newton method line search is introduced where f k+1 <
f k is sought. As with the other gradient based methods the new iterate xk+1 is found
by minimizing f along the search direction dk such that:
xk+1 = xk + αk dk
The result also works if Q is negative definite resulting in a strong local maximum or
Q is symmetric indefinite giving x∗ as a saddle point.
7.3. QUASI-NEWTON METHODS 83
The basic Newton method as it stands is not suitable for a general purpose algorithm
since H k may not be positive definite when xk is remote from the solution. Further-
more, as we have shown in the previous example, even if H k is positive definite the
convergence may not occur. To address these issues Quasi-Newton algorithms were
developed. We start by describing the drawbacks of the Newton method. At each
iteration (say, at the k-th iteration) of the Newton’s method a new matrix H k has to
be calculated (even if the method uses line search) and then either the inverse of this
matrix has to found or a system of equation has to be solved before the new point
x(k+1) is found using x(k+1) = x(k) + d(k) . Quasi-Newton methods avoid the calcula-
tion of a new matrix at each iteration, rather they only update the matrix (positive
definite) of the previous iteration. This matrix remains also positive definite. This
method also does not need to solve a system of equation. First it finds its direction
using the positive definite matrix and it finds the step length using line search.
Introduction of the quasi-Newton method largely increased the range of problems
which could be solved. This type of method is like Newton method with line search,
−1
except that H k at each iteration is approximated by a symmetric positive definite
matrix G k , which is updated from iteration to iteration. Thus the kth iteration has
the basic structure.
1. Set dk = −G k gk
2. Line search along dk giving xk+1 = xk + αk dk
3. Update G k giving G k+1
Much of the interest lies in the updating formula which enables G k+1 to be calcu-
lated from G k . We know that for any quadratic function:
1
q(x) = xT H x + bT x + c,
2
where H , b and c are constant and H is symmetric, the Hessian maps differences in
position into differences in gradient,.i.e.,
The above property says that changes in gradient g (=∇ f (x)) provide information
about the second derivative of q(x) along (xk+1 − xk ). In the quasi-Newton methods
84 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
at xk we have the information about the direction dk , G k and the gradient gk . We can
use these information to perform line search to obtain xk+1 and gk+1 . We now need
to calculate G k+1 (the approximate inverse of H k+1 ) using the above information.
At this point we impose the condition given by Equation (7.3) for the non-quadratic
function f . In other words, we impose that changes in the gradient provide infor-
mation about the second derivative of f along the search direction dk . Hence, we
have:
−1
H (k+1) (gk+1 − gk ) = (xk+1 − xk ) (7.4)
Therefore, we would like have G k+1 = G k + ∆G k such that:
G k+1 γk = δk , (7.5)
−1
where G k+1 = H k+1 , δk = (xk+1 − xk ) and γk = (gk+1 − gk ). This is known as the
quasi-Newton condition and for the quasi-Newton algorithm the update H k+1 from
H k must satisfy Equation (7.5).
Methods differ in the way they update the matrix G k . Essentially they are classified
according to a rank one and rank two updating formulae.
This formula was first suggested as part of a method due to Davidon (1959), and later
also presented by Fletcher and Powel (1963). The Quasi-Newton method which goes
with this updating formula is known as DFP (Davidson, Fletcher and Powel) method.
The DFP algorithm is also known as the variable matrix algorithm. The DFP al-
gorithm preserves the positive definiteness of G k but can sometimes gives trouble
when G k becomes nearly singular. A modification (known as BFGS) introduced in
1970 can cure this problem. The algorithm for DFP method is given below:
7.3.2 Exercises
f (x) = x 1 4 − 3x 1 x 2 + (x 2 + 2)2 ,
starting at the point x 0 = [0, 0]T and show that the function value at x 0 cannot
be improved searching in Newton direction.
2. Find the stationary points of:
f (x) = x 12 + x 22 − x 12 x 2
and determine their nature. Plot the contours of f . Find the value of f after
taking a basic Newton optimisation method from x 0 = (1, 1)T .
3. Using Newton method, find the minimiser of:
1
f (x) = x 2 − sin(x).
2
f (x) = 4x 12 − 4x 1 x 2 + 3x 22 + x 1 ,
Direct search methods, unlike the Descent methods discussed in earlier Chapters do
not require the derivatives of the function. The Direct search methods require only
the objective function values when finding minima and are often known as zeroth-
order methods since they use the zeroth-order derivatives of the function. We will
consider two Direct Methods in this course. Namely, the Random Walk Method
and the Downhill Simplex Method.
xi +1 = xi + λui ,
where λ is some scalar step length and ui some random unit vector generated at the
i th stage.
We can describe the algorithm as follows:
1. Start with an initial point x1 , a sufficiently large initial step length λ, a mini-
mum allowable step length ϵ, and a maximum permissible number of itera-
tions N .
2. Find the function value f 1 = f (x1 ).
3. Set the iteration number, i , to 1
87
88CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION
To avoid bias in the calculation, we only accept the vector if the length of:
1
is ≤ 1.
(r 12 + r 22 + . . . + r n2 )1/2
5. Compute the new vector and the corresponding function value x = x1 +λu and
f = f (x).
6. If f < f 1 , then set the new values of x1 = x and f 1 = f and go to step 3, else
continue to 7.
7. If i ≤ N , set the new iteration to i + 1 and go to step 4. Otherwise, if i > N , go
to step 8.
8. Compute new, reduced, step length as λ = λ/2. If new step length is smaller
than or equal to ϵ, then go to step 9, else go to step 4.
9. Stop the procedure by taking xopt = x1 and f opt = f 1 .
8.1.0.1 Example
A direct search method for the unconstrained optimisation problem is the Down-
hill simplex method developed by Nelder and Mead (1965). It does not make an
assumption on the cost function to minimise. Importantly, the function in question
does not need to satisfy any condition of differentiability unlike other methods, i.e. it
is a zero order method. It makes use of simplices, or polytopes in given dimension
n +1. For example, in 2 dimensions, the simplex is a polytope of 3 vertices (triangle).
In 3 dimensional space it forms a tetrahedron.
The method starts from an initial simplex. Subsequent steps of the method consist
of updating the simplex where it defines:
8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 89
• G is the centroid of all the vertices except xh , ie. the centroid of n points out of
n + 1:
1 n+1
xj
X
G= (8.1)
n j =1, j ̸=h
The movement of the simplex is achieved by using three operations, known as re-
flection, contraction and expansion.
A common practice to generate the initial remaining simplex vertices is to make use
of x0 + ei b, where ei is the unit vector in the direction of the x i coordinate and b an
edge length. Assume a value of 0.1 for b.
Let y = f (x) and y h = f (xh ) then the algorithm suggested by Nelder and Mead is as
follows:
The typical values for the above factors are α = 1, γ = 2 and β = 0.5. The stopping
90CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION
8.2.1 Exercises
1. Apply the above two strategies to the all the multivariate function introduced
in earlier chapters and achieve their respective minima.
Chapter 9
subject to:
g i (x) = b i
L(x, λ) = f (x) + λ
X£ ¤
b i − g i (x) (9.2)
i =1
∂L ∂f m ∂g ∂L
λi
X
= + = 0, = g i (x) = 0 (9.3)
∂x j ∂x j i =1 ∂x i ∂λi
9.0.1 Example
f (x 1 , x 2 ) = x 12 + 4x 22
subject to:
x 1 + 2x 2 = 1
95
96 CHAPTER 9. LAGRANGIAN MULTIPLIERS FOR CONSTRAINT OPTIMISATION
Solution:
∂f ∂g
−λ = 0,
∂x 1 ∂x 1
∂f ∂g
−λ = 0,
∂x 2 ∂x 2
and
g (x 1 , x 2 ) = b.
Therefore, we solve:
2x 1 − λ = 0,
8x 2 − 2λ = 0,
and
x 1 + 2x 2 = 1.
1 1
Solving these three equations we obtain x 1 = , x 2 = and λ = 1. Therefore, the
2 4
optimum is:
1
f (x 1 , x 2 ) = .
2
9.0.2 Exercises
1. Minimise
f (x) = x 1 2 + x 2 2
subject to
x 1 + 2x 2 + 1 = 0
2. Find the dimensions of a cylindrical tin of sheet metal to maximise its volume
such that the total surface area is equal to A 0 = 24π.