100% found this document useful (1 vote)
48 views94 pages

OptimisationII Notes

The document outlines the course structure and content for 'Optimisation II' by Krupa Prag and Nouralden Mohammed, scheduled for July 10, 2024. It covers various topics in optimization, including nonlinear optimization, numerical solutions, and gradient methods, along with exercises and examples. Additionally, it specifies hardware requirements and provides a detailed table of contents for easy navigation.

Uploaded by

bongum22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
48 views94 pages

OptimisationII Notes

The document outlines the course structure and content for 'Optimisation II' by Krupa Prag and Nouralden Mohammed, scheduled for July 10, 2024. It covers various topics in optimization, including nonlinear optimization, numerical solutions, and gradient methods, along with exercises and examples. Additionally, it specifies hardware requirements and provides a detailed table of contents for easy navigation.

Uploaded by

bongum22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

APPM2025A Optimisation II

Krupa Prag
& Nouralden Mohammed

2024-07-10
2
Contents

Course Outline 11

Course Structure and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Course Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Course Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1 Definition and General Concepts 13

1.1 Nonlinear Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 General Statement of an Optimisation Problem . . . . . . . . . . . . . . 14

1.3 Important Optimisation Concepts . . . . . . . . . . . . . . . . . . . . . . 15

1.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.1.1 Neighbourhoods . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.1.2 Local and Global Minimisers . . . . . . . . . . . . . . . . 15

1.3.1.3 Infimum and Supremum of a Function . . . . . . . . . . 16

1.3.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.2.1 Affine Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.2.2 Convex Set . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.2.3 Convex Combination and Convex Hull . . . . . . . . . . 17

1.3.2.4 Hyperplanes and Halfspaces . . . . . . . . . . . . . . . . 17

1.3.2.5 Level Sets and Level Surfaces . . . . . . . . . . . . . . . . 18

1.3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3
4 CONTENTS

2 One Dimensional Unconstrained and Bound Constrained Problems 21


2.1 Unimodal and Multimodal . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Global Extrema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Necessary and Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . 23
2.4.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Numerical Solutions to Nonlinear Equations 25


3.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.1 Advantages and Disadvantages of Newton’s Method . . . . . . . 27
3.2 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Numerical Optimisation of Univariate Functions 31


4.1 Techniques Using Function Evaluations . . . . . . . . . . . . . . . . . . . 32
4.1.1 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1.1 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Golden Search Method . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Multivariate Unconstrained Optimisation 39


5.1 Terminology for Functions of Several Variables . . . . . . . . . . . . . . 39
5.1.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 A Line in a Particular Direction in the Context of Optimisation . . . . . 41
5.2.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Taylor Series for Multivariate Function . . . . . . . . . . . . . . . . . . . 44
5.4 Quadratic Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 Stationary Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5.1 Tests for Positive Definiteness . . . . . . . . . . . . . . . . . . . . 50
CONTENTS 5

5.5.1.1 Compute the Eigenvalues . . . . . . . . . . . . . . . . . . 50

5.5.1.2 Principle Minors . . . . . . . . . . . . . . . . . . . . . . . 52

5.6 Necessary and Sufficient Conditions . . . . . . . . . . . . . . . . . . . . . 52

5.6.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6.0.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6.0.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Gradient Methods for Unconstrained Optimisation 55

6.1 General Line Search Techniques used in Unconstrained Multivariate


Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.1.1 Challenges in Computing Step Length αk . . . . . . . . . . . . . . 56

6.2 Exact and Inexact Line Search . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2.1 Algorithmic Structure . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 The Descent Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4 The Direction of Greatest Reduction . . . . . . . . . . . . . . . . . . . . . 59

6.5 The Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . 60

6.5.1 Steepest Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . 60

6.5.2 Convergence Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.5.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.5.3 Inexact Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6.5.3.1 Backtracking Line Search . . . . . . . . . . . . . . . . . . 65

6.5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.6 The Gradient Descent Algorithm and Machine Learning . . . . . . . . . 67

6.6.1 Basic Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.6.2 Adaptive Step-Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.6.3 Decreasing Step-Size . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.6.4 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 76


6 CONTENTS

7 Newton and Quasi-Newton Methods 79


7.0.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.1 The Modified Newton Method . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 Convergence of Newton’s Method for Quadratic Functions . . . . . . . 82
7.3 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.3.1 The DFP Quasi-Newton Method . . . . . . . . . . . . . . . . . . . 84
7.3.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8 Direct Search Methods for Unconstrained Optimisation 87


8.1 Random Walk Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.1.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
8.2 Downhill Simplex Method of Nelder and Mead . . . . . . . . . . . . . . 88
8.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

9 Lagrangian Multipliers for Constraint Optimisation 95


9.0.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.0.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
List of Tables

7
8 LIST OF TABLES
List of Figures

1.1 A Line Segement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17


1.2 Left and Right are Non-Convex, Center Convex. . . . . . . . . . . . . . . 17
1.3 An Example of a Convex Hull. . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 A Hyperplane. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 A Halfspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 An Example of a Convex Function. . . . . . . . . . . . . . . . . . . . . . . 22

3.1 Newton’s Method in Action. . . . . . . . . . . . . . . . . . . . . . . . . . . 26


3.2 Function f (x) and it’s derivative g (x). . . . . . . . . . . . . . . . . . . . . 27
3.3 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 Condition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Condition 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Condition 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Golden Search Interval and Interior Points . . . . . . . . . . . . . . . . . 35

5.1 Mathematica Demo of Gradient. . . . . . . . . . . . . . . . . . . . . . . . 40


5.2 An Example of a Line in a Particular Direction. . . . . . . . . . . . . . . . 42
5.3 Positive Definite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Negative Definite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Positive Semi Definite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.6 Indefinite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Step-size too big . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9
10 LIST OF FIGURES

6.2 Step-size too small . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


6.3 Varying Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 2x 12 + 3x 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Corresponding contour plot . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.6 2x 12 + 3x 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.7 f (x) = x 3 − 2x 2 + 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.8 Iterative steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9 Iterative steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.1 Rosenbrock Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

8.1 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89


8.2 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.3 Here we have reflection and expansion . . . . . . . . . . . . . . . . . . . 91
8.4 Here we have contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
8.5 Here we have multiple contractions . . . . . . . . . . . . . . . . . . . . . 92
8.6 Nelder Mead Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Chapter 1

Definition and General


Concepts

In industry, commerce, government, indeed in all walks of life, one frequently needs
answers to questions concerning operational efficiency. Thus an architect may need
to know how to lay out a factory floor so that the article being manufactured does not
have to be moved about too much as it goes from one machine tool to another; the
manager of a shipping company needs to plan the itineraries of his ships so as to in-
crease the amount of goods handled, while avoiding costly waiting-around in docks.
A telecommunications engineer may want to know how best to transmit signals
so as to minimise the possibility of error on reception. Further examples of prob-
lems of this sort are provided by the planning of a railroad time-table to ensure that
trains are available as and when needed, the synchronisation of traffic lights, and
many other real-life situations. Formerly such problems would usually be ‘solved’
by imprecise methods giving results that were both unreliable and costly. Today,
they are increasingly being subjected to rigid mathematical analysis, designed to
provide methods for finding exact solutions or highly reliable estimates rather than
vague approximations. Optimisation provide many of the mathematical tools used
for solving such problems.

Optimisation, or mathematical programming, is the study of maximising and min-


imising functions subject to specified boundary conditions or constraints. The func-
tions to be optimised arise in engineering, the physical and management sciences,
and in various branches of mathematics. With the emergence of the computer age,
optimisation experienced a dramatic growth as a mathematical theory, enhancing
both combinatorics and classical analysis. In its applications to engineering and
management science, optimisation (linear programming) forms an important part
of the discipline of operations research.

13
14 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

1.1 Nonlinear Optimisation

Linear programming has proved an extremely powerful tool, both in modelling real-
world problems and as a widely applicable mathematical theory. However, many
interesting but practical optimisation problems are nonlinear. The study of such
problems involves a diverse blend of linear algebra, multivariate calculus, quadratic
form, numerical analysis and computing techniques. Important special areas in-
clude the design of computational algorithms, the geometry and analysis of convex
sets and functions, and the study of specially structured problems such as uncon-
strained and constrained nonlinear optimisation problems. Nonlinear optimisation
provides fundamental insights into mathematical analysis, and is widely used in the
applied sciences, in areas such as engineering design, regression analysis, inventory
control, and geophysical exploration among others. General nonlinear optimisation
problems and various computational algorithm to addressing such problems will be
taught in this course.

1.2 General Statement of an Optimisation Problem

Optimisation problems are made of three primary ingredients; (i) an objective func-
tion, (ii) variables (unknowns) and (iii) the constraints.

• An objective function which we want to minimize or maximize. For instance,


in a manufacturing process, we might want to maximize the profit or mini-
mize the cost. In fitting experimental data to a user-defined model, we might
minimize the total deviation of observed data from predictions based on the
model. In designing an automobile panel, we might want to maximize the
strength.
• A set of unknowns/variables which affect the value of the objective function.
In the manufacturing problem, the variables might include the amounts of
different resources used or the time spent on each activity. In fitting-the-data
problem, the unknowns are the parameters that define the model. In the panel
design problem, the variables used define the shape and dimensions of the
panel.
• A set of constraints that allow the unknowns to take on certain values but ex-
clude others. For the manufacturing problem, it does not make sense to spend
a negative amount of time on any activity, so we constrain all the ‘time’ vari-
ables to be non-negative. In the panel design problem, we would probably
want to limit the weight of the product and to constrain its shape. The optimi-
sation problem is then :

Find values of the variables that minimize or maximize the objective function while
satisfying the constraints.
The general statement being:
1.3. IMPORTANT OPTIMISATION CONCEPTS 15

minimise f (x), x = [x 1 , x 2 , . . . , x n ]T ∈ Rn (1.1)


w.r.t x

subject to:

g j (x) ≤ 0, j = 1, 2, . . . , m,
h j (x) = 0, j = 1, 2, . . . , r.

where f (x), g j (x) and h j (x) are scalar functions of the real vector x.
The continuous components x i of x are called the the variables, f (x) is the objec-
tive function, g j (x) denotes the respective inequality constraint functions and h j (x)
the equality constraint functions. The optimum vector x that solves Equation (1.1) is
denoted by x∗ with a corresponding optimum function value f (x∗ ). If there are no
constraints specified, then the problem is aptly named an unconstrained minimi-
sation problem. A large quantity of progress has been made when solving different
classes of the general problem introduced in Equation (1.1). On occasion these so-
lutions can be attained analytically yielding a closed-form solution. However, most
real world problems exist where n > 2 and as a result need to be solved numerically
through suitable computational algorithms.

1.3 Important Optimisation Concepts

1.3.1 Definitions

1.3.1.1 Neighbourhoods

Definition 1.1 (Neighbourhood). δ-neighbourhood of a point, say y, where y ∈ Rn :


A δ-neighbourhood of a point y is the set of all points within a neighbourhood of y,
it is denoted by Nδ (y). It is the set of all points x such that:

Nδ (y) : ∥x − y∥ ≤ δ, (1.2)

that is x ∈ Nδ (y).

1.3.1.2 Local and Global Minimisers

Definition 1.2 (Global Minimiser/Maximiser). If a point x ∈ S, where S is the feasi-


ble set, then x is a feasible solution. A feasible solution xg of some problem P is the
global minimiser of f (x) if:
f (xg ) ≤ f (x), (1.3)
for all feasible points x of the problem P . The value f (xg ) is called the global mini-
mum. The converse applies for the global maximum.
16 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

Definition 1.3 (Local Minimiser/Maximiser). A point x∗ is called a local minimiser


of f (x) if there exists a suitable δ > 0 such that for all feasible x ∈ Nδ (x∗ ):

f (x∗ ) ≤ f (x). (1.4)

In other words, x∗ is a local minimiser of f (x) if there exists a neighbourhood Nδ (x∗ )


of x∗ containing feasible x such that:

f (x∗ ) ≤ f (x), ∀ ∈ Nδ (x∗ ). (1.5)

Again the converse applies for local maximisers.

1.3.1.3 Infimum and Supremum of a Function

Definition 1.4 (Infimum). If f is a function on S, the greatest lower bound or infi-


mum of f on S is the largest number m (possibly m = −∞) such that f (x) ≥ m ∀x ∈
S. The infimum is denoted by:
inf f (x). (1.6)
x∈S

Definition 1.5 (Supremum). If f is a function on S, the least upper bound or supre-


mum of f on S is the smallest number m (possibly m = +∞) such that f (x) ≤ m ∀x ∈
S. The supremum is denoted by:
sup f (x). (1.7)
x∈S

1.3.2 Convexity

1.3.2.1 Affine Set

Definition 1.6 (Affine Set). A line though the points x1 and x2 in Rn is the set:

L = {x| θx1 + (1 − θ)x2 , ∀ θ ∈ R}. (1.8)

This is known as an Affine Set. An example of this is the solution of linear equations
Ax = b.

1.3.2.2 Convex Set

Definition 1.7 (Convex Set). A set S ⊂ R n is a Convex Set if for all x1 , x2 ∈ S, the line
segment between x 1 and x 2 is in S. The line segment between the points x1 and x2 ,
can be represented as:

x = θx1 + (1 − θ)x2 , where 0 ≤ θ ≤ 1. (1.9)

If this condition does not hold then the set is non-convex. Think of this as line of
sight. Some are examples are considered in the Figure below:
1.3. IMPORTANT OPTIMISATION CONCEPTS 17

Figure 1.1: A Line Segement.

Figure 1.2: Left and Right are Non-Convex, Center Convex.

1.3.2.3 Convex Combination and Convex Hull

Definition 1.8 (Convex Combination). We can define the Convex Combination of


x1 , . . . , xn , as any point x which satisfies:

x = θ1 x1 + θ2 x2 + . . . + θn xn , (1.10)

where θ1 + . . . + θn = 1, θi ≥ 0.

The Convex Hull (conv L) is the set of all convex combinations of the points in L.
This can be thought of as the tightest bound across all points in the set, as can be
seen in the Figure below:

1.3.2.4 Hyperplanes and Halfspaces

Definition 1.9 (Hyperplane/Halfspace). We can define the Hyperplane as the set of


the form:
{x | a T x = b} (a ̸= 0), (1.11)
we can define the Halfspace as the set of the form:

{x | a T x ≤ b} (a ̸= 0), (1.12)

where a is the normal vector.


18 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

Figure 1.3: An Example of a Convex Hull.

Note: Hyperplanes are both affine and convex, while halfspaces are only convex.
These are illustrated in the Figures below:

Figure 1.4: A Hyperplane.

1.3.2.5 Level Sets and Level Surfaces

Definition 1.10 (Level Set). Consider the real valued function f on L. Let a be in R1 ,
then we denote L a to be the set:

L a = {x ∈ L | f (x) = a}. (1.13)

The level set is the set of points that have a corresponding function value euqal to
that of the constant value a.
Definition 1.11 (Level Surface). Consider the real valued function f on L. Let a be
in R3 , then we denote C a to be the set:

C a = {x ∈ L | f (x) = a}. (1.14)

These sets are known as level surfaces of f on L and can be thought of as the cross
section taken at some point x0 ∈ L.

See the example of x 2 y + x y 2 below:


1.3. IMPORTANT OPTIMISATION CONCEPTS 19

Figure 1.5: A Halfspace.

1.3.3 Exercises

1. Find the convex hull of the following sets:

S1 = {(x 1 , x 2 ) | x 12 + x 22 ≤ 1}
S2 = {(x 1 , x 2 ) | x 12 + x 22 > 1}
S3 = {(0, 0), (1, 0), (1, 1), (0, 1)}
S4 = {(x 1 , x 2 ) | |x 1 | + |x 2 | < 1}

2. Find the minimum (if any) of the following:


• min x1 for x ∈ [1, ∞)
• min x for x ∈ (0, 1]
3. Determine the supremum, infimum, maximum and minimum of the follow-
ing functions:
• f (x) = e x , x ∈ R
• Let f : L ⊂ R2 → R. The set L is defined by the disc x 12 + x 22 ≤ 1 ∈ R2 and:
2 2
f (x 1 , x 2 ) = e x1 +x2 .

4. In each of the following cases, sketch the level sets L b of the function f :
• f (x 1 , x 2 ) = x 1 + x 2 , b = 1, 2
• f (x 1 , x 2 ) = x 1 x 2 , b = 1, 2
• f (x) = e x , b = 10, 0
5. Let L be a convex set in Rn , A be an m × n matrix and α a scalar. Show that the
following sets are convex.
• {y : y = Ax, x ∈ L}
20 CHAPTER 1. DEFINITION AND GENERAL CONCEPTS

• {αx : x ∈ L}
6. If you have two points that solve a system of linear equations Ax = b,
i.e. points x 1 and x 2 , where x 1 ̸= x 2 . Then prove that the line that passes
through these two points is in the affine set.
7. Prove that a halfspace is convex.
Chapter 2

One Dimensional
Unconstrained and Bound
Constrained Problems

The one dimensional unconstrained problem is defined by:

minimise f (x), L ⊆ R, (2.1)


x∈L

where f is a continuous and twice differentiable function. If L is an interval, then x


is bound constrained. If L is indeed R, then the problem is unconstrained.

2.1 Unimodal and Multimodal

A function is said to be monotonic increasing along a given path when f (x 2 ) > f (x 1 ),


and monotonic decreasing when f (x 2 ) < f (x 1 ) for all points in the domain for which
x 2 > x 1 . For the situation in which f (x 2 ) ≥ f (x 1 ) and f (x 2 ) ≤ f (x 1 ) the functions are
respectively called monotonic non-decreasing and monotonic non-increasing.

For example if we consider f (x) = x 2 , where x > 0, then f (x) is monotonic increas-
ing. For x < 0, f (x) is monotonic decreasing. A function that has a single minimum
or a single maximum (single peak) is known as unimodal function. Functions with
two peaks (two minima or two maxima) are called bimodal and functions with many
peaks are known as multimodal functions.

21
22CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS

2.2 Convex Functions

Definition 2.1 (Convex Functions). A function f : L ⊂ Rn → R defined on the convex


set L is convex if for all x1 , x2 ∈ L, for all θ ∈ [0, 1]:

f (θx2 + (1 − θ)x1 ) ≤ θ f (x2 ) + (1 − θ) f (x1 ). (2.2)

Consider the function of a single variable. We may think of this easily then by saying;
The function f is convex if the chord connecting x 1 and x 2 lays above the graph.
See the Figure below:
Note:

The function is strictly convex if only < applies f is concave if − f is


convex

Figure 2.1: An Example of a Convex Function.

2.3 Global Extrema

We shall summarise some definitions t hat w ere previously mentioned. We begin


with the concept of global optimisation (maximisation or minimisation). In the con-
text of optimisation, relative optima (maximum or minimum) are normally referred
to as local optima.
Global Optima
A point f (x) attains its greatest (or least) value on an interval [a, b] is called a point of
global maximum (or minimum). In general, however, a function f (x) takes on its

absolute global) aximum minimum) t oint f x) < x∗) f x) > x∗))
for all x over which the function f (x) is defined.
Local Optima
2.4. NECESSARY AND SUFFICIENT CONDITIONS 23

f (x) has a strong local (relative) maximum (minimum) at an interior point x ∗ ∈


(a, b) if f (x) < f (x ∗ ) ( f (x) > f (x ∗ )) for all x in some neighbourhood of x ∗ . The max-
imum (minimum) is weak if ≤ replaces <. Strong local are illustrated in the Figure
below. If a function f (x) has a strong relative maximum at some point x ∗ , then there
is an interval including x ∗ , no matter how small, such that for all x in this interval,
f (x) is strictly less than f (x ∗ ), i.e. f (x) < f (x ∗ ). It is the ‘strictly less’ that makes this
a stronger relative maximum. If however, the strictly less sign is replaced by a ≤ sign,
i.e. f (x) ≤ f (x ∗ ), then the minimum value at x ∗ is a weak minimum and x ∗ is a weak
minimiser.

2.4 Necessary and Sufficient Conditions

A necessary condition for f (x) to have a maximum or a minimum at an interior


point x ∗ ∈ (a, b) is that the slope at this point must be zero, i.e. where:

d f (x)
f ′ (x) = = 0, (2.3)
dx
which corresponds to the first order necessary condition (FONC). The FONC may
be necessary but it is not sufficient. For example, consider the function f (x) = x 3 as
seen in the Figure below. At x = 0, f ′ (x) = 0 but there is no maximum or minimum
point on the interval (−∞, ∞). At x = 0 there is a point of inflection, where f ′′ = 0.
Therefore, the point x = 0 is a stationary point but not a local optima.
Thus in addition to the FONC, non-negative curvature is also necessary at x ∗ , i.e. it
is required that the second order condition:

d 2 f (x)
f ′′ (x) = > 0, (2.4)
d x2
must hold at x ∗ for a strong local minimum. This is known as the second order
sufficient condition (SOSC).

2.4.1 Exercises

1. If the convexity condition for any real valued function f : R → R is given by:

λ f (b) + (1 − λ) f (a) ≥ f (λb + (1 − λ)a), [a, b] ∈ R,

then using the above, prove that the following one dimensional functions are
convex;
• f 1 (x) = 1 + x 2
• f 2 (x) = x 2 − 1
24CHAPTER 2. ONE DIMENSIONAL UNCONSTRAINED AND BOUND CONSTRAINED PROBLEMS

2. Find all stationary points of:

f (x) = x 3 (3x 2 − 5) + 1,

and decide the maximiser, minimiser and the point of inflection, if any.
3. Using the FONC of optimality, determine the optimiser of the following func-
tions:
• f (x) = 31 x 3 − 72 x 2 + 12x + 3
• f (x) = 2(x − 1)2 + x 2
4. Using necessary and sufficient conditions for optimality, investigate the max-
imiser/minimiser of:
f (x) = −(x − 1)4 .
5. The function f (x) = max{0.5, x, x 2 } is convex. True or false?
6. The function f (x) = min{0.5, x, x 2 } is concave. True or false?
x2 + 2
7. The function f (x) = is concave. True or false?
x +2
8. Determine the maximum and minimum values of the function:

f (x) = 12x 5 − 45x 4 + 40x 3 + 5


Chapter 3

Numerical Solutions to
Nonlinear Equations

It is often necessary to find the stationary point(s) of a given function f (x). This
would mean finding the root of a nonlinear function g (x) if we consider g (x) =
f ′ (x) = 0. In other words, solving g (x) = 0. Here, we introduce the Newton method.
This method is important because when we cannot solve f ′ (x) = 0 analytically we
will be able to solve numerically.

3.1 Newton’s Method

Newton’s method is one of the more powerful and well known numerical methods
for finding a root of g (x) i.e. for solving for x such that g (x) = 0. So we can use it to
find the turning point i.e., when f ′ (x) = 0. In the context of optimization we want an
x ∗ such that f ′ (x ∗ ) = 0. Consider the figure below:
Suppose at some stage we have obtained the point x n as an approximation to the
root at x ∗ (initially this is a guess). Newton observed that if g (x) was a straight line
through (x n , g (x n )) with slope = g ′ (x n ) = g ′ (x) = constant for all x then the equation
of the straight line could be found and the root read off. Obviously there would be
no problem if g (x) was actually a straight line; however the tangent might still be
a good approximation (as seen in the Figure above). If we regard the tangent as a
model of the function g (x) and we have an approximation x n then we can produce
a better approximation x n+1 . The method can be applied again and again, to give a
sequence of values, each approximating x ∗ with more and more certainty.
The general equation of the tangent to the curve of g (x) at (x n , g (x n )) has slope

(y − g (x n ))
g ′ (x n ) = . When y = 0, let (3.1)
(x − x n )

25
26 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS

Figure 3.1: Newton’s Method in Action.

x = x n+1 (the point where the tangent cuts the x-axis).


Thus the Newton formula for root finding is :
g (x n )
x n+1 = x n − g ′ (x n ) ̸= 0 (3.2)
g ′ (x n )

Hence the Newton method can be described be the following two steps:
1. Get an initial ‘guess’ x 0 .
2. Iterate by x n+1 = x n − f ′ (x n )/ f ′′ (x n ).
x n converges to a turning point for suitable choice of x 0 .

3.1.0.1 Example

x4 x2
Find the root of f (x) = + − 3x near x = 2.
4 2
We can see that to find the root of the deritive yields the minimum value of f (x) in
this case (approximately 1.21341). Check: perform a few iterations of Equation (3.2)
to check the solution.
3.1. NEWTON’S METHOD 27

Figure 3.2: Function f (x) and it’s derivative g (x).

3.1.1 Advantages and Disadvantages of Newton’s Method

Advantages

• The method is really fast when it works (quadratic convergence)

Disadvantages:

• Unknown number of steps needed for required accuracy, unlike Bisection for
example.
• f must be at least twice differentiable
• Run into problems when g ′ (x ∗ ) = 0.
• Potentially could be difficult to compute g (x) and g ′ (x) even if they do exist.

In general Newton’s Method is fast, reliable and trouble-free, but one has to be mind-
ful of the potential problems.
28 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS

3.2 Secant Method

Here we try to avoid the problems of computing f ′′ (x) and approximate it by


f ′ (x n )− f ′ (x n−1 )
x n −x n−1 . This given the updating formula:

(x n − x n−1 )
x n+1 = x n − f ′ (x n ) ¡ ¢. (3.3)
f ′ (x n ) − f ′ (x n−1 )

The other disadvantages of Newton’s method still apply.

Figure 3.3: Secant Method

3.2.1 Exercises

1. Beginning with x = 0, apply Newton’s Method to find the solution of

3x − sin(x) − exp(x) = 0,

up to four iterations. Here x is in radians.


2. Find the critical point(s) of the function

1 1
f (x) = x 4 + x 2 − 2x + 1,
4 2
3.2. SECANT METHOD 29

using both Newton’s Method and the Secant Method. If the critical point is a
minimiser then obtain the minimum value. You may assume x = 2 as an initial
guess.
30 CHAPTER 3. NUMERICAL SOLUTIONS TO NONLINEAR EQUATIONS
Chapter 4

Numerical Optimisation of
Univariate Functions

The simplest functions with which to begin a study of non-linear optimisation meth-
ods are those with a single independent variable. Although the minimisation of uni-
variate functions is in itself of some practical importance, the main area of applica-
tion for these techniques is as a subproblem of multivariate minimisation.
There are functions to be minimised where the variable x is unrestricted (say, x ∈ R);
there are also functions to be optimised over a finite interval (in n-dimension it is
a box). Single variable optimization in a finite interval is important because of its
application is in multi-variable optimisation. In this chapter we will consider one
dimensional optimisation.
If one needs to find the maximum or minimum (i.e. the optimal) value of a function
f (x) on the interval [a, b] the procedure would be:

1. Find all turning (stationary) points of f (x) (assuming f (x) is differentiable) on


[a, b] and then decide the optimum.
2. Find the optimal turning point of f (x) on [a, b].

Generally it may be difficult/impossible/tiresome to implement (i) analytically, so


we resort to the computer and an appropriate numerical method to find an optimal
(hopefully the best estimate!) solution of an univariate function. In the next section
we introduce some numerical techniques. The numerical approach is mandatory
when the function f (x) is not given explicitly.
In many cases when one would like to find the minimiser of a function f (x) but
neither f (x) nor f ′ (x) are given (or known) explicitly, then the numerical approaches
viz : polynomial interpolations or function comparison methods are used. These are
the univariate optimisation used as line search in multivariate optimisation.

31
32 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS

4.1 Techniques Using Function Evaluations

4.1.1 Bisection Method

We assume that an interval [a, b] is given and that a local minimum x ∗ ∈ [a, b]. When
the first derivative of the objective function f (x) is known at a and b, it is necessary
to evaluate function information at only one interior point in order to reduce this
interval. This is because it is possible to decide if any interval brackets a minimum
simply by looking at the function values f (a), f (b) and f ′ (a), f ′ (b) at extreme points
a and b. The conditions that to be satisfied are:

• f ′ (a) < 0 and f ′ (b) > 0.


• f ′ (a) < 0 and f (b) > f (a).
• f ′ (a) > 0 and f ′ (b) > 0 and f (b) < f (a).

These three situations are illustrated in the Figure below. The next step of the bi-
section method is to reduce the interval. At the k-th iteration we have an interval
[a k , b k ] and the mid-point c k = 21 (a k + b k ) is computed. The next interval will be
called [a k+1 , b k+1 ] which is either [a k , c k ] or [c k , b k ] depending on which interval
brackets the minimum. The process continues until two consecutive interval pro-
duces minima which are within an acceptable tolerance.

Figure 4.1: Condition 1


4.1. TECHNIQUES USING FUNCTION EVALUATIONS 33

Figure 4.2: Condition 2

Figure 4.3: Condition 3


34 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS

4.1.1.1 Exercise

Find the minimum value of:


1 1
f (x) = − x 3 − x 2 + 2x − 5,
3 2
over the domain [−3, 0] using the bisection method. The problem has a minimum
of value of -8.33 at x = 2.

4.1.2 Golden Search Method

Suppose f : R → R on the interval [a, b] and f has only one minimum (we say f is
unimodal at x ∗ . The problem is to locate x ∗ . The method we now discuss is based
on evaluating the objective function at different points in the interval [a, b]. We
choose these points in such a way that an approximation to the minimiser of f may
be achieved in as few evaluation as possible. Our goal is to progressively narrow
down the range of the subinterval containing x ∗ . If we evaluate f at only one
intermediate point of the interval [a, b], we cannot narrow the range within which
we know the minimiser is located. We have to evaluate f at two intermediate points
x1 − a
in such a way that the reduction in the range is symmetrical, such that ρ =
b−a
b − x2
and ρ = . We then evaluate f at the intermediate points.
b−a
Case I:
If f (x 1 ) > f (x 2 ), then the minimiser located in the range [a, x 1 ]. Then we need to
update the interval and calculate an update the interior points at the next iteration.
Case II:
If, on the other hand, f (x 1 ) < f (x 2 ), then the minimiser must lie in the range [x 2 , b].
Then we need to update the interval and calculate and the updated interior points
at the next iteration.

Starting with the reduced range of uncertainty we can repeat the process and simi-
larly find two new interior point, respectively. We would like to minimise the num-
ber of function evaluations while reducing the width of the interval of uncertainty.
Suppose that f (x 1 ) < f (x 2 ). Then we know that x ∗ ∈ [x 2 , b]. Because x 1 is already
in the uncertainty interval and f (x 1 ) is known, we can use these information. We
can make x 2 coincide with x 1 . Thus, only one new evaluation of f at x 1 would be
necessary.

If f (x 1 ) > f (x 2 ), then the minimiser must lie in the range [a, x 1 ]. Because x 2 is al-
ready in the uncertainty interval and f (x 2 ) is known, we can use these information.
We can make x 1 coincide with x 2 . Thus, only one new evaluation of x 2 and the cor-
responding function value f (x 1 ) would be necessary.
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 35

We let the L 0 represent the span of the interval, that is L 0 = b − a, L 1 = x 1 − a, and


L 2 = b − x 1 . Then L 0 can also be seen as the sum of L 1 and L 2 .

Figure 4.4: Golden Search Interval and Interior Points

L1 L2
Using the two conditions that L 0 = L 1 + L 2 and that = , using the quadratic
p L 0 L1
5−1
formula we can deduce the ratio ρ to be equal to = 0.681....
2
This forms the basis of a search algorithm since the technique is applied again on
the reduced interval.

4.1.2.1 Example

Use the four iterations Golden Section search to find the value of x that minimizes:

f (x) = x 2 − 6x + 15,

on the domain [0, 10].


Answer:
Iteration 1:
We evaluate f in two intermediate points x 1 and x 2 . We have:
36 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS

x 1 = a + ρ(b − a) = 6.18,
x 2 = b − ρ(b − a) = 3.82.

We compute
f (x 1 ) = 16.12,
f (x 2 ) = 6.67.

Thus we have f (x 1 ) > f (x 2 ), and so the uncertainty interval is reduced to [a, x 1 ] =


[0, 6.18].
Iteration 2:
We choose x 1 to coincide with x 2 , and f need only to be evaluated at one new point

x 2 = a + ρ(b − a) = 2.36.

Now we have:

f (x 1 ) = 6.67,
f (x 2 ) = 6.41.

Now, f (x 1 ) > f (x 2 ), and so the uncertainty interval is reduced to [a, x 1 ] = [0, 3.82].
Iteration 3:
We set x 1 = x 2 , and compute x 2 :

x 2 = 1.46.

We have:

f (x 1 ) = 6.41,
f (x 2 ) = 8.37.

So we have f (x 1 ) < f (x 2 ). Hence, the new interval is [x 2 , b] = [1.46, 3.82].


Iteration 4:
We set x 2 = x 1 , and compute x 1 :

x 1 = 2.92.

We have:

f (x 1 ) = 6.01,
f (x 2 ) = 6.41.
4.1. TECHNIQUES USING FUNCTION EVALUATIONS 37

Since f (x 1 ) < f (x 2 ). Thus the value of x that minimizes f is located in the interval

[2.36, 3.82].

Golden Search Pseudocode

Inputs [a, b], iterations and R = (sqrt(5)-1)/2


x1 = a + R(b-a)
x2 = b - R(b-a)
for i in range(iterations):
f_x1 = f(x1)
f_x2 = f(x2)
if f_x1>f_x2:
b = x1
x1 = x2
x2 = b - R(b-a)
f_x1 = f(x1)
f_x2 = f(x2)
else if f_x1<f_x2:
a = x2
x2 = x1
x1 = a + R(b-a)
f_x1 = f(x1)
f_x2 = f(x2)
end
end
Return (b+a)/2

4.1.3 Exercises

1. Find the minimum value of the one dimensional function f (x) =


x 2 − 3x exp(−x), over [0, 1], using:
• Bisection Method
• Golden Search Method
38 CHAPTER 4. NUMERICAL OPTIMISATION OF UNIVARIATE FUNCTIONS
Chapter 5

Multivariate Unconstrained
Optimisation

Unconstrained optimisation is optimisation when we know we do not have to worry


about the boundaries of the feasible set.

min f (x) (5.1)


s.t . x∈S

where S is the feasible set. It should then be possible to find local minima and max-
ima just by looking at the behaviour of the objective function; and indeed sufficient
and necessary conditions. In this chapter these conditions will be derived. The idea
of a line in a particular direction is important for any unconstrained optimization
methods, we discuss this and derive the slope and curvature of the function f at a
point on the line.

5.1 Terminology for Functions of Several Variables

For a function f (x) ∈ Rn there exists, at any point x a vector of first order partial
derivatives, or gradient vector:
∂f
 
(x)
 ∂x 1 
 ∂f
 

 (x)
 ∂x 2 

∇ f (x) =   = g(x). (5.2)
 .. 
 . 
 ∂f
 

(x)
∂x n

It can be shown that if the function f (x) is smooth, then at the point x the gradient
vector ∇ f (x) (denoted by g (x)) is always perpendicular to the contours (or surfaces

39
40 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Figure 5.1: Mathematica Demo of Gradient.

of constant function value) and is the direction of maximum increase of f (x) as


seen in the Figure above.

If f (x) is twice continuously differentiable then at the point x there exists a matrix of
second order partial derivatives called the Hessian matrix:

∂2 f ∂2 f ∂2 f
 
(x) (x) ... (x)
 ∂x 1
 2 ∂x 1 ∂x 2 ∂x 1 ∂x n 
 2
 ∂ f ..

H(x) =  .  = ∇2 f (x) (5.3)

 ∂x 2 ∂x 1 (x) 
 2
 ∂ f ∂2 f


(x) ... (x)
∂x n ∂x 1 ∂x n2

5.1.0.1 Example

Let f (x 1 , x 2 ) = 5x 1 + 8x 2 + x 1 x 2 − x 12 − 2x 22 . Then:
· ¸
5 + x 2 − 2x 1
∇ f (x) = ,
8 + x 1 − 4x 2

and · ¸
2 −2 1
∇ f (x) = .
1 −4

Definition 5.1 (Feasible Direction). A vector d ∈ Rn , d ̸= 0, is a feasible direction at


x ∈ S if there exists α0 > 0 such that x + αd ∈ S for all α ∈ [0, α0 ].
5.2. A LINE IN A PARTICULAR DIRECTION IN THE CONTEXT OF OPTIMISATION41

Definition 5.2 (Directional Derivative). Let f : Rn → R be a real-valued function and


let d be a feasible direction at x ∈ S. The directional derivative of f in the direction
of d , denoted by d T ∇ f (x), is given by:

f (x + αd ) − f (x)
∇ f T d = lim (5.4)
α→0 α

If ∥d ∥ = 1, then d T ∇ f (x) is the rate of increase of f at x in the direction d . To com-


pute the above directional derivative, suppose that x and d are given. Then, f (x+αd )
is a function of α, and:
d
d T ∇ f (x) = f (x + αd )¯α=0 .
¯
(5.5)

5.2 A Line in a Particular Direction in the Context of


Optimisation

A line is a set of points x such that:

x = x′ + αd, ∀ α, (5.6)

where d and x′ are given. For α ≥ 0 Equation (5.6) is a half-line. The point x′ is a
fixed point (corresponding to α = 0) along the line, d is the direction of the line. For
instance, if we take the fixed point x′ to be (2, 2)T and the direction d = (3, 1)T then
the Figure below shows the line in the direction of d.

The vector d in indicated by the arrow. If we normalise the vector d so that dT d =


P 2
i d i = 1. This does not change the line, but only the value of α associated with any
point along the line. For Example:

d = [3, 1]
alpha = norm(d)
print(’Alpha is:’)
3.1622776601683795
norm_d = d/alpha
print(’The normalised vector d is:’)
[0.9486833 0.31622777]
print(’The normalised d^Td gives:f’, 0.9999999999999999)
print(’So alpha x normalised d returns d:’)
[3. 1.]
42 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Figure 5.2: An Example of a Line in a Particular Direction.


5.2. A LINE IN A PARTICULAR DIRECTION IN THE CONTEXT OF OPTIMISATION43

We now use the gradient and the Hessian of f (x) to derive the derivative of f (x) along
a line of any direction. For a fixed line of a given direction like Equation (5.6) we see
that the points on the line is a function of α only. Hence a change in α causes change
in all coordinates of x(α). The derivative of f (x) with respect to α :
d f (x(α)) ∂ f (x(α)) d x 1 (α) ∂ f (x(α)) d x 2 (α) ∂ f (x(α)) d x n (α)
= + +···+ (5.7)
dα ∂x 1 dα ∂x 2 dα ∂x n dα
The Equation (5.7) represents the derivative of f (x) at any point x(α) along the line.
The operator ddα can be expressed as:

d ∂ d x1 ∂ d x2 ∂ d xn
= + +···+ = dT ∇ (5.8)
d α ∂x 1 d α ∂x 2 d α ∂x n d α
The slope of f (x) at x(α) can be written as:
df
= dT ∇ f (x(α)) = ∇ f (x(α))T d. (5.9)

Likewise, the curvature of f (x(α)) along the line:

d2 f d d f (x(α))
µ ¶
= dT ∇ ∇ f T d = dT ∇2 f d,
¡ ¢
= (5.10)
d α2 d α dα
where ∇ f and ∇2 f are evaluated at x(α). These (slope and curvature) when evalu-
ated at α=0 are respectively known as derivative (also called slope since f = f (α) is
now a function of the single variable α) and curvature of f at x ′ in the direction of d .

5.2.0.1 Example

Let us consider the Rosenbrock’s function:

f (x) = 100(x 2 − x 12 )2 + (1 − x 1 )2 (5.11)

If x′ = 0 then show that the slope of f (x) along the line generated by d = 10 is
¡0 ¢ ¡ ¢

dT ∇ f = −2 and the curvature is dT Gd = 2 where G = ∇2 f (x′ ).


Solution:

−400x 1 (x 2 − x 12 ) − 2(1 − x 1 )
· ¸ · ¸
−2
▽f = =
200(x 2 − x 12 ) 0

Therefore d ▽ f = [1 0] × [−2 0]T = −2. Next:

−400(x 2 − x 12 ) + 800x 12 + 2 −400x 1


· ¸ · ¸
2 0
▽2 f = =
−400x 1 200 0 200
· ¸
2 0
Thus dT Gd = [1 0] × × [1 0]T = 2
0 200
44 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

These definitions of slope and curvature depend on the size of d, and


this ambiguity can be resolved by requiring that ∥d∥ = 1. Hence Equa-
tion (5.9) is the directional derivative in the direction of a unit vector d
and this given by ∇ f (x)dT . Likewise the curvature along the line in the
direction of the unit vector is given by dT ∇2 f (x)d.

Since x(α) = x′ + αd, at α = 0 we have x(0) = x′ . Therefore, the function


value f (x(0)) = f (x′ ), the slope at α = 0 in the direction of d is f ′ (0) =
dT ∇ f (x′ ) and the curvature at α = 0 is f ′′ (0) = dT G(x′ )d.

5.3 Taylor Series for Multivariate Function

In the context of optimization involving smooth function f (x) the Taylor series is
indispensable. Since x = x(α) = x′ + αd for a fixed point x′ and a given direction d,
the f (x) at x(α) becomes a function of the single variable α. Hence, f (x) = f (x(α)) =
f (α). Therefore, expanding the Taylor series around zero we have:
1
f (α) = f (0 + α) = f (0) + α f ′ (0) + α2 f ′′ (0) + · · · (5.12)
2
But f (α) = f (x′ +αd) is the value of the function f (x) of many variable along the line
x(α). Hence, we can re-write Equation (5.12) as:
1
f (x′ + αd) = f (x′ ) + αdT ∇ f (x′ ) + α2 dT ∇2 f (x′ ) d + · · ·
£ ¤
(5.13)
2

5.4 Quadratic Forms

The quadratic function in n variables may be written as:


1
f (x) = xT Ax + bT x + c, (5.14)
2
where c ∈ R, b is a real n vector and A is a n × n real matrix that can be chosen in a
non-unique manner. It is usually chosen symmetrical in which case it follows that:
n X
n
a 11 x 1 2 + 2a 12 x 1 x 2 + a 22 x 2 2 + 2a 13 x 1 x 3 + · · · + a nn x n 2 =
X
ai j xi x j , (5.15)
i =1 j =1

and:
∇ f (x) = Ax + b; H(x) = A. (5.16)
The form A is said to be positive definite if A ≥ 0 for all x with A = 0 iff x = 0. The
form A is said to be positive semi-definite if A ≥ 0 for all x. Similar definitions apply
to negative definite and negative semi-definite with the inequalities reversed.
5.5. STATIONARY POINTS 45

5.4.0.1 Example

Write A(x) = x 1 2 + 5x 1 x 2 + 4x 2 2 in the matrix form.


µ 5 ¶µ ¶
12 x1
Solution: A(x) = (x 1 , x 2 ) 5
2 4
x2

5.5 Stationary Points

In the following chapters we will be concerned with gradient based minimization


methods. Therefore, we only consider the minimization of smooth functions. We
will not consider the non-smooth minima as they do not satisfy the same conditions
as smooth minima. We, however, will consider the case of saddle point. Hence, we
assume that the first and the second derivative exist.

We can classify definiteness by looking at the eigenvalues of ∇2 f (x). Specifically:

• If ∇2 f (x∗ ) is indefinite, i.e. all λi are mixed sign, then x∗ is a saddle point.
• If ∇2 f (x∗ ) is positive definite, i.e. all λi > 0, then x∗ is a minimum.
• If ∇2 f (x∗ ) is negative definite, i.e. all λi < 0, then x∗ is a maximum.
• If ∇2 f (x∗ ) is positive semi-definite, i.e. all λi ≥ 0, then x∗ is a half cylinder.

These can be seen in the Figure below:

In summary:

Let G = ∇2 f (x), i.e. the Hessian.

• G(x) is positive semi-definite if x T G x ≥ 0, ∀x


• G(x) is negative semi-definite if x T G x ≤ 0, ∀x
• G(x) is positive definite iff x T G x > 0, ∀x ̸= 0
• G(x) is negative definite iff x T G x < 0, ∀x ̸= 0
• G(x) is indefinite iff x T G x is mixed negative and positive

and:

• f (x) is concave iff G(x) is negative semi-definite


• f (x) is strictly concave iff G(x) is negative definite
• f (x) is convex iff G(x) is positive semi-definite
• f (x) is convex iff G(x) is positive definite
46 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Figure 5.3: Positive Definite.


5.5. STATIONARY POINTS 47

Figure 5.4: Negative Definite.


48 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

Figure 5.5: Positive Semi Definite.


5.5. STATIONARY POINTS 49

Figure 5.6: Indefinite.


50 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

5.5.1 Tests for Positive Definiteness

There are a number of ways for us to test for positive or negative definiteness.
Namely;

5.5.1.1 Compute the Eigenvalues

5.5.1.1.1 Example Classify the stationary points of the function

f (x) = 2x 12 + x 1 x 22 + x 22 .

Solution:

The stationary points are the solutions of

∂f
= 4x 1 + x 2 2 = 0
∂x 1
∂f
= 2x 1 x 2 + 2x 2 = 0
∂x 2

which gives x1 = (0, 0)T , x2 = (−1, 2)T and x3 = (−1, −2)T . The Hessian matrix is:
µ ¶
4 2x 2
G=
2x 2 2x 1 + 2

Thus: µ ¶
4 0
G1 =
0 2

The eigenvalues are the solution of

(4 − λ)(2 − λ) = 0

which gives λ = 4, 2. Thus x1 correspond to a minimum. Similarly:


µ ¶
4 4
G2 =
4 0

has eigenvalues:
p p
λ = 2 + 20, 2 − 20

Thus x2 correspond to a saddle point. Finally


µ ¶
4 −4
G3 =
−4 0

has the same eigenvalues as G 2 and therefore x3 corresponds to a saddle point.


5.5. STATIONARY POINTS 51
52 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

5.5.1.2 Principle Minors

From the Hessian we can compute the determinant of all subminors. If these are all
greater than zero, then the Hessian is positive definite. Utilising the example above.
If:
µ ¶
4 0
G1 =
0 2

Then the first subminor is just det|4| which is > 0. The second and final subminor is
the entire matrix, so:
¯ ¯
¯ 4 0 ¯
det ¯¯ ¯ = 8 − 0 > 0.
0 2 ¯

Therefore G 1 is positive definite. G 2 and G 3 are dealt with similarly. However, to


prove negative definiteness we need to prove (−1)k D k > 0, where D is the determi-
nant of the k-th principle minor.

This approach would be preferable when dealing with the case of large matrices.

5.6 Necessary and Sufficient Conditions

Theorem 5.1 (First Order Necessary Condition (FONC) for Local Maxima/Minima).
If f (x) has continuous first partial derivatives at all points of S ⊂ R n and if x∗ is an
interior point of the feasible set S then x∗ is a local minimum or maximum of f (x):

∇ f (x ∗ ) = 0. (5.17)

Alternatively, if x∗ ∈ S is a local minimum or maximum, then at the point x∗ :

∂ f (x ∗ )
= 0; i = 1, 2, ...n. (5.18)
∂x i

Theorem 5.2 (Second Order Necessary Condition (SONC) for Local Max-
ima/Minima). Let f be twice continuously differentiable on the feasible set S,
x∗ is a local minimiser of f (x), and d is a feasible direction at x∗ . If dT ∇ f (x∗ ) = 0,
then:
dT ∇2 f (x∗ )d ≥ 0. (5.19)

Theorem 5.3 (Second Order Sufficient Condition (SOSC) for Strong Local Max-
ima/Minima). Let x∗ be an interior of S. If x∗ is a local minimiser of f (x) then (i)
∇ f (x∗ ) = 0 and (ii) dT ∇2 f (x∗ )d > 0. That is the hessian is positive definite.
5.6. NECESSARY AND SUFFICIENT CONDITIONS 53

5.6.0.1 Example

Let f (x) = x 12 + x 22 . Show that x = (0, 0)T satisfies the FONC, the SONC and SOSC
hence (0, 0)T is a strict local minimiser. We see that ∇ f (x) = (2x 1 , 2x 2 ) = 0 if and only
if x 1 = x 2 = 0. It also can be easily shown that for all d ̸= 0, dT ∇2 f (x)d = 2d 12 +2d 22 > 0.
Hence ∇2 f (x) is positive definite.

5.6.0.2 Example

f (x 1 , x 2 ) = x 14 + x 24
µ 3¶
4x 1
∇ f (x) = . The only stationary point is (0 0)T . Now the Hessian ∇2 f =
4x 23
12x 12
µ ¶ µ ¶
0 0 0
. At the origin the Hessian is and so there is no prediction of
0 12x 22 0 0
the minimum from the test although it is easy to see that the origin is a minimum.

5.6.0.3 Example

1 x 12 x 22
à !
f (x 1 , x 2 ) = + ,
2c a 2 b 2
µ x1 ¶
where a, b, and c are constants. ∇ f (x) = −x c a 2 . So the only stationary point is (0 0)T .
2
µ 1 ¶ cb 2
2 0
The Hessian is ∇2 f (x) = ca . This is clearly indefinite and hence (0 0)T is a
0 − cb1 2
saddle point.

Thus in summary, the necessary and sufficent condition for x∗ to be a strong local
minimum are:

• ∇ f (x∗ ) = 0
• Hessian is positive definite
54 CHAPTER 5. MULTIVARIATE UNCONSTRAINED OPTIMISATION

5.6.1 Exercises

1. Find the gradient vectors of the following functions (where x ∈ Rn ):


• f (x) = cT x, c ∈ Rn
• f (x) = 21 xT x
• f (x) = 21 xT Gx where G is symmetric
2. Find the slope and the curvature of the following function:
• f (x) = 100(x 2 − x 12 ) + (1 − x 1 )2 at (0, 0)T in the direction of (1, 0)T .
3. Use the necessary condition of optimality to determine the optimiser of the
following function

f (x 1 , x 2 ) = (x 1 − 1)2 + (x 2 − 1)2 + x 1 x 2

4. For the following function, find the points where the gradients vanish, and
investigate which of these are local minima, maxima or saddle.
• f (x 1 , x 2 ) = x 1 (1 + x 1 ) + x 2 (1 + x 2 ) − 1.
5. Consider the function f : R 2 → R determined by
· ¸ · ¸
T 1 2 T 3
f (x) = x x +x +6
4 8 4
• Find the gradient and Hessian of f at the point (1, 1).
• Find the directional derivative of f at the point (1, 1) in the direction of
the maximal rate of increase.
• Find a point that satisfies the first order necessary condition (FONC).
Does the point also satisfy the second order necessary condition (SONC)
for a minimum?
6. Find the stationary points of the function
¢2
f (x 1 , x 2 ) = x 1 2 − 4 + x 2 2
¡

Show that f has an absolute minimum at each of the points (x 1 , x 2 ) = (±2, 0).
Show that the point (0, 0) is a saddle point.
7. Show that the point x ∗ on the line x 2 − 2x 1 = 0 is a weak global minimiser of

f (x) = 4x 1 2 − 4x 1 x 2 + x 2 2

8. Show that
f (x) = 3x 1 2 − x 2 2 + x 1 3
has a strong local maximiser at (−2, 0)T and a saddle point at (0, 0)T , but has
no minimisers.
9. Prove that for a general quadratic function f (x) = c +bT x+ 12 xT Gx, the Hessian
G of f maps differences in position into differences in gradient, i.e., g1 − g2 =
G(x1 − x2 ).
Chapter 6

Gradient Methods for


Unconstrained Optimisation

In this chapter we will study the methods for solving nonlinear unconstrained op-
timisation problems. The non-linear minimisation algorithms to be described here
are iterative methods which generate a sequence of points, x0 , x1 . . . . say, or {xk } (su-
perscripts denoting iteration number), hopefully converging to a minimiser x∗ of
f (x). Univariate minimisation along the line in a particular direction is known as
the line search technique. One dimensional minimisation is known as line search
subproblem in many variable unconstrained non-linear minimisation.

6.1 General Line Search Techniques used in Uncon-


strained Multivariate Minimisation

The algorithms for multivariate minimisation are all iterative processes which fit
into the same general framework:

At the beginning of the k-th iteration the current estimate of minimum


is f (xk ), and a search is made in Rn from xk along a given vector di-
rection dk (dk is different for different minimization methods) in an at-
tempt to find a new point xk+1 such that f (xk+1 ) is sufficiently smaller
than f (xk ). This process is called line (or linear) search.

Line-search methods, therefore, generate the iterates by setting:

xk+1 = xk + αk dk (6.1)

55
56 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

where dk is a search direction and αk > 0 is chosen so that:

f (xk + αk dk ) = f (xk+1 ) < f (xk ), (6.2)

Therefore, for a given dk , a line-search procedure is used to choose an αk > 0 that


approximately minimises f along the ray x k + αk d k : αk > 0. Hence, the line search
is the univariate minimisation involving the single variable αk (since both the xk
and dk ) are known f (xk + αk dk ) becomes a function of αk only) such that:

f (αk ) = f (xk + αk dk ). (6.3)

Bear in mind that this single variable minimiser cannot always be obtained analyti-
cally and hence some numerical techniques may be necessary.

6.1.1 Challenges in Computing Step Length αk

The challenges in finding a good αk are both in avoiding a step length that is too
long or too short. Consider the Figures below:

Figure 6.1: Step-size too big

Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k + αk d k are gen-
3
erated by the descent directions d k = (−1)k+1 with steps αk = 2 + 2k+1 with an initial
starting point of x 0 = 2.
6.2. EXACT AND INEXACT LINE SEARCH 57

Figure 6.2: Step-size too small

Here the objective function is f (x) = x 2 and the iterates, x k+1 = x k +αk d k are gener-
1
ated by the descent directions d k = (−1) with steps αk = 2k+1 with an initial starting
point of x 0 = 2.

6.2 Exact and Inexact Line Search

Given the direction dk and the point xk , f (xk +αdk ) becomes a function of α. Hence
it is simply a one dimensional minimisation with respect to α. The solution of
d f (α) k
d α = 0 will determine the exact location of the minimiser α . However, it may
d f (α)
not be possible to locate the exact location of αk for which d α = 0. It may even
require very large number of iterations to locate the minimiser αk . Nonetheless, the
df
idea is conceptually useful. Notice that for exact line search the slope d α at αk must
be zero. Therefore, we get:

d f (xk+1 ) d xk+1
= ∇ f (xk+1 )T = g(x k+1 )T dk = 0. (6.4)
dα dα

Line search algorithms used in practice are much more involved than the one di-
mensional search methods (optimisation methods) presented in the previous chap-
ter. The reason for this stems from several practical considerations. First, determin-
ing the value of αk that exactly minimises f (α) may be computationally demanding;
58 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 6.3: Varying Alpha

even worse, the minimiser of f (α) may not even exist. Second, practical experience
suggests that it is better to allocate more computational time on iterating the op-
timisation algorithm rather than performing exact line searches. These considera-
tions led to the development of conditions for terminating line search algorithms
that would result in low-accuracy line searches while still securing a decrease in the
value of f from one iteration to the next.
In practice, the line search is terminated when some descent conditions along the
line xk + αdk are satisfied. Hence, it is no longer necessary to go for the exact line
search. The line search carried out in this way is known as the inexact line search. A
further justification for the inexact line search is that it is not efficient to determine
the line search minima to a high accuracy when xk is far from the minimiser x∗ . Un-
der these circumstances, nonlinear minimisation algorithms employ an inexact or
approximate line search. To sum up, exact line search relates to theoretical concept
and the inexact is its practical implementation.
Remark:
Each iteration of a line search method computes a search direction dk and then de-
cides how far to move along that direction. The iteration is given by
xk+1 = xk + αk dk ,
where the positive scalar αk is called the step length. The success of a line search
method depends on effective choices of both the direction dk and and the step
length αk . Most line search algorithms require dk to be a descent direction.
6.3. THE DESCENT CONDITION 59

6.2.1 Algorithmic Structure

The typical behaviour of a minimisation algorithm is that it repeatedly generates


points xk such that as k increases xk moves close to x∗ . Features of a minimisation
algorithm is that f (xk ) is always reduced on each iteration, which imply that the
stationary point turns out to be a local minimiser. In a minimisation algorithm it is
required to supply an initial estimate, say x0 . At each iteration the algorithm finds a
descent direction along which the function is minimised. This minimisation algo-
rithm in a particular direction is known as the line search. The basic structure of the
general algorithm is:

1. Initialise the algorithm with estimate xk . Initialise k = 0.


2. Determine a search direction dk at xk .
3. Find αk to minimise f (xk + αdk ) with respect to α.
4. Set xk+1 = xk + αk dk .
5. Line search is stopped when f (xk+1 ) < f (xk )
6. If algorithm meets stopping criteria then STOP, ELSE set k = k +1 and go back
to (2).

Different minimisation methods select dk in different ways in (2). Steps (3&4) is


the one dimensional sub-problem carried out along the line xk+1 = xk + αk dk for
α ∈ [0, 1]. The direction dk at xk must satisfy the descent condition.

6.3 The Descent Condition

Central to the development of the gradient based minimisation methods is the idea
of a descend direction. Conditions for the descent direction can be obtained using
Taylor series around the point xk . Using two terms of Taylor series we have:

T
f (xk + αdk ) − f (xk ) = αdk ∇ f (xk ) + · · · (6.5)

Clearly the descent condition can easily be seen as:


T
dk ∇ f (xk ) < 0, (6.6)

since we require the left hand side of Equation (6.5) to be negative.

6.4 The Direction of Greatest Reduction

A simple line search descent method is the steepest descent method in which:

dk = −∇ f (xk ) = −gk , ∀k (6.7)


60 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

From Equation (6.5) we see that:


T
f k+1 − f k = αk dk gk (6.8)
k k k
= α ∥d ∥∥g ∥ cos θ, (6.9)

where θ can be interpreted geometrically as the angle between dk and gk . If we allow


θ to vary holding αk , ∥dk ∥ and ∥gk ∥ constant, then the right hand side of Equation
(6.9) is most negative when θ = π. Thus when αk is sufficiently small, the greatest
reduction in function is obtained in the direction:

dk = −gk (6.10)

This negative gradient direction which satisfy the descent condition (6.10) gives rise
to the method of steepest descent.

6.5 The Method of Steepest Descent

Here the the search direction is taken as the negative gradient and the step size, αk ,
is chosen to achieve the maximum decrease in the objective function f at each step.
Specifically we solve the problem:
³ ³ ´´
Minimise f x(k) − α∇ f x(k) w.r.t. α (6.11)

This is now a one-dimensional optimisation problem.

6.5.1 Steepest Descent Algorithm

Given x0 , for all iterations k = 1, 2, . . . until stopping criterion is met, do:

• Compute gradient g(xk ) = ∇ f (xk ).


• Compute αk such that f (xk − αk gk ) = α f (xk − αgk ).
min
• Compute xk+1 = xk − αk gk
• If stopping criterion met STOP, Else set k = k + 1 and go to (1).

6.5.2 Convergence Criteria

In practice the algorithm is terminated if some convergence criterion is satisfied.


Usually termination is enforced at iteration k if one, or a combination of the follow-
ing is met:

• ∥xk − xk−1 ∥ < ϵ1 .


6.5. THE METHOD OF STEEPEST DESCENT 61

• ∥∇ f (xk )∥ < ϵ2

• | f (xk ) − f (xk−1 )| < ϵ3

Here ϵ1 , ϵ2 and ϵ3 are designated some small positive tolerances.

6.5.2.1 Example

Consider f (x) = 2x 12 + 3x 22 , where x0 = (1, 1). Use two iterations of Steepest Descent.

Solution:
· ¸
4x 1
Compute ∇ f (x) = = g.
6x 2
First Iteration:

We know that:
x1 = x0 − α0 g (x0 ),

so: · ¸ · ¸ · ¸
1 4α 1 − 4α
x1 = − = .
1 6α 1 − 6α

Therefore:

f (x0 − α0 g (x0 )) = 2(1 − 4α)2 + 3(1 − 6α)2


= 2 − 16α + 32α2 + 3 − 36α + 108α2
⇒ ∇ f (x0 − α0 g (x0 )) = 280α − 52 = 0
⇒α = 52/280 = 13/70.

Finally:
13 9
   
1 − 4
70   35 
x1 =  13  =  −4  .

1−6
70 35

Second Iteration:

We have:
x2 = x1 − α1 g (x1 )

Compute (Simplified here):

1 ¡
f (x1 − α1 g (x1 )) = 2(9 − 36α)2 + 3(−4 + 24α)2 .
¢
35 2
62 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

We get:

∇ f (x1 − α1 g (x1 )) = 0
⇒ 60α = 13
13
α =
60

Therefore:
9 9 6

    
 35  13  35   175 
x2 = x1 − α1 g (x1 ) =  −4 −  =  6 .
60 −4

35 35 175
The process continues in the same manner above. We can see from inspection that
the function should achieve a minimum at (0, 0). We can see this as a sanity check
in the Python code below.

It is also worth noting that since this is a quadratic function, we can actually use
another technique. We will redo the first iteration as illustration. Specifically, the
quadratic functions allow α to be solved using:
T
−g k d k
αk = T
.
d k Qd k

Thus:

First Iteration:
· ¸
4 0
Compute f (x0 ) = 5, g(x̄ 0 )T = (4, 6) and Q =
0 6
Therefore:
(g k )T d k 52 13
α1 = − = ¸· ¸ =
(d k )T Qd k 70
·
4 0 4
(4, 6)
0 6 6
Thus: µ ¶
13 9 4
x1 = (1, 1) − (4, 6) = − ,
70 35 35
Similarly, the process repeats.

6.5.3 Inexact Line Search

Although you will only cover inexact line search techniques in the third year syllabus,
we will quickly introduce a very simply inexact technique to use for the purpose of
your labs.
6.5. THE METHOD OF STEEPEST DESCENT 63

Figure 6.4: 2x 12 + 3x 22
64 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 6.5: Corresponding contour plot


6.5. THE METHOD OF STEEPEST DESCENT 65

6.5.3.1 Backtracking Line Search

One way to adaptively choose the step size is to do the following:

• First fix a parameter 0 < β < 1


• Then at each iteration, start with t = 1 and while

t
f (x − t ∇ f (x)) > f (x) − ∥∇ f (x)∥2 ,
2

and update t = βt .

This is a simple technique and tends to work quite well in practice. For further read-
ing you can consult Convex Optimisation by Boyd.

Steepest Descent Inexact Method

Inputs: f, g, x0, beta and tol


from numpy import linalg as LA
it = 0
xvals = np.array([x0])
while LA.norm(g(x0)) > tol and it < 1000:
alpha = 1
while(f(x0 - alpha*g(x0))) > (f(x0) - (alpha/2)*LA.norm(g(x0))**2):
alpha = beta*alpha

x0 = x0 - alpha*g(x0)
it += 1
xvals = np.append(xvals, x0)
Return x0, it, xvals

6.5.4 Exercises

1. Show that the value of the function

ax 12 + bx 22 + c x 32

reached after taking a single of the steepest descent method from the point
(1, 1, 1)T is:
ab(b − a)2 + bc(c − b)2 + c a(a − c)2
.
a3 + b3 + c 3
66 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 6.6: 2x 12 + 3x 22
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 67

2. Show that if exact line search is carried out on the quadratic


1 T
x Qx + b T x + c
2
using the iteration
x k+1 = x k + αk d k ,
then:
g (k)T d k
αk = − .
d (k)T Qd k
3. Compute the first two iterations of the method of steepest descent applied to
the objective function
f (x) = 4x 1 2 + x 2 2 − x 1 2 x 2
with x 0 = [1, 1]T . Use exact line search.
4. Use three iterations of the steepest descent method on the function

f (x) = 3x 1 2 + 2x 2 2

with initial point (1, 1)T .

6.6 The Gradient Descent Algorithm and Machine


Learning

We will briefly look at the context of what we have learnt from the machine learning
perspective. This is to emphasize the power of this chapter. In machine learning,
you will find the gradient descent algorithm everywhere. While the literature may
seem to allude to this method being new, powerful and cool, it is really nothing more
than the method of steepest descent introduced above.

6.6.1 Basic Example

Let’s try find a local minimum for the function f (x) = x 3 − 2x 2 + 2:

## (-1.0, 2.5)

## (0.0, 3.0)

So from the above plot we can see that there is a local minimum somewhere around
1.3 - 1.4 according to the x-axis. Of course, we normally won’t be afforded the luxury
of information such as this a priori, so let’s just assume we arbitrarily set our start-
ing point to be x 0 = 2. Implementing the gradient descent with a fixed stepsize, or
learning rate (in the context of ML) we have:
68 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 6.7: f (x) = x 3 − 2x 2 + 2

x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.1 # step size fixed at 0.1
precision = 0.0001 # tolerance value

x_list, y_list = [x_new], [f(x_new)]

# returns the value of the derivative of our function


def f_prime(x):
return 3*x**2-4*x

while abs(x_new - x_old) > precision:


x_old = x_new
s_k = -f_prime(x_old)
x_new = x_old + n_k * s_k
x_list.append(x_new)
y_list.append(f(x_new))
print("Local minimum occurs at:", x_new)

## Local minimum occurs at: 1.3334253508453249


6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 69

print("Number of steps:", len(x_list))

## Number of steps: 17

How did the algorithm look step by step?

## (-1.0, 2.5)

## (0.0, 3.0)

## (1.2, 2.1)

## (0.0, 3.0)

Figure 6.8: Iterative steps

In our above implementation we had a fixed step-size n k . In machine learning, this


is called the learning rate. You’ll notice this is contrary to the algorithm in the afore-
mentioned pseudocode. Making the assumption of the fixed learning rate made the
implementation easier but could yield the issues mentioned in the beginning of the
chapter.
70 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

6.6.2 Adaptive Step-Size

One means of overcoming this issue is to use adaptive step-sizes. This can be done
using scipy’s fmin function to find the optimal step-size at each iteration.

from scipy import stats


from scipy.optimize import fmin

# we setup this function to pass into the fmin algorithm


def f2(n,x,s):
x = x + n*s
return f(x)

x_old = 0
x_new = 2 # The algorithm starts at x=2
precision = 0.0001

x_list, y_list = [x_new], [f(x_new)]

# returns the value of the derivative of our function


def f_prime(x):
return 3*x**2-4*x

while abs(x_new - x_old) > precision:


x_old = x_new
s_k = -f_prime(x_old)

# use scipy fmin function to find ideal step size.


# Uses the downhill simplex algorithm which is a zero-order method
n_k = fmin(f2,0.1,(x_old,s_k), full_output = False, disp = False)

x_new = x_old + n_k * s_k


x_list.append(x_new)
y_list.append(f(x_new))

print("Local minimum occurs at ", float(x_new))

## Local minimum occurs at 1.3333333284505209

print("Number of steps:", len(x_list))

## Number of steps: 4

So we can see that using the adaptive step-sizes, we’ve reduced the number of itera-
tions to convergence from 17 to 4. This is a substantial reduction, however, it must
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 71

be noted that it takes time to compute the appropriate step-size at each iterations.
This highlights a major issue in the decision making for optimisation: trying to find
the balance between speed and accuracy.
How did the modified algorithm look step by step?
Well we can see that it converges rapidly and after the first two iterations, we need
to zoom in to see further improvements.

## (-1.0, 2.5)

## (1.2, 2.1)

## (0.0, 3.0)

## (1.3333, 1.3335)

## (0.0, 3.0)

Figure 6.9: Iterative steps

6.6.3 Decreasing Step-Size

Instead of using computational resources having to find an optimal step-size at each


iteration, we could apply an dampening factor at each step to reduce the step-size
over time. For example:
η(t )
η(t + 1) =
1+t ×d

x_old = 0
x_new = 2 # The algorithm starts at x=2
n_k = 0.17 # step size
precision = 0.0001
t, d = 0, 1

x_list, y_list = [x_new], [f(x_new)]


72 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

# returns the value of the derivative of our function


def f_prime(x):
return 3*x**2-4*x

while abs(x_new - x_old) > precision:


x_old = x_new
s_k = -f_prime(x_old)
x_new = x_old + n_k * s_k
x_list.append(x_new)
y_list.append(f(x_new))
n_k = n_k / (1 + t * d)
t += 1

print("Local minimum occurs at:", x_new)

## Local minimum occurs at: 1.3308506740900838

print("Number of steps:", len(x_list))

## Number of steps: 6

We can now see that we’ve still reduced the number of iterations required substan-
tially but are not bounding to finding an optimal step-size at each iteration. This
highlights that trade-off of finding cheap improvements that improve convergence
at minimal cost.

How Do We Use the Gradient Descent in Linear Regression?

While using these line methods to find the minima of basic functions is interesting,
one might wonder how this relates to some of the regressions we are interested in
performing. Let us consider a slightly more complicated example. In this data set,
we have data relating to how temperature affects the noise produced by crickets.
Specifically, the data is a number of observations or samples of cricket chirp rates at
various temperatures.

## (13.0, 21.0)

## (65.0, 95.0)
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 73

What can we deduce from the plotted data?


We can see that the data set is exhibiting a linear relationship. Therefore, our aim is
to find the equation of the straight line given by:

h θ (x) = θ0 + θ1 x,

that best fits all of our data points, i.e. minimise the residual error.
The function that we are trying to minimize in this case is:
m
1
J (θ0 , θ1 ) = (h θ (x i ) − y i )2
P
2m
i =1

In this case, our gradient will be defined in two dimensions:


m
∂ 1
J (θ0 , θ1 ) =
P
∂θ0 m (h θ (x i ) − y i )
i =1
m
∂ 1
J (θ0 , θ1 ) =
P
∂θ1 m ((h θ (x i ) − y i ) · x i )
i =1

Below, we set up our function for h, J and the gradient:

h = lambda theta_0,theta_1,x: theta_0 + theta_1*x

def J(x,y,m,theta_0,theta_1):
returnValue = 0
for i in range(m):
74 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

returnValue += (h(theta_0,theta_1,x[i])-y[i])**2
returnValue = returnValue/(2*m)
return returnValue

def grad_J(x,y,m,theta_0,theta_1):
returnValue = np.array([0.,0.])
for i in range(m):
returnValue[0] += (h(theta_0,theta_1,x[i])-y[i])
returnValue[1] += (h(theta_0,theta_1,x[i])-y[i])*x[i]
returnValue = returnValue/(m)
return returnValue

import time
start = time.time()
theta_old = np.array([0.,0.])
theta_new = np.array([1.,1.]) # The algorithm starts at [1,1]
n_k = 0.001 # step size
precision = 0.001
num_steps = 0
s_k = float("inf")

while np.linalg.norm(s_k) > precision:


num_steps += 1
theta_old = theta_new
s_k = -grad_J(x,y,m,theta_old[0],theta_old[1])
theta_new = theta_old + n_k * s_k

print("Local minimum occurs where:")

## Local minimum occurs where:

print("theta_0 =", theta_new[0])

## theta_0 = 25.128552558595363

print("theta_1 =", theta_new[1])

## theta_1 = 3.297264756251897

print("This took",num_steps,"steps to converge")

## This took 565859 steps to converge


6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 75

end = time.time()
print(str(end - start) + 'seconds')

## 19.64289903640747seconds

It’s clear that the algorithm seems to take quite a long time for such a trivial example.
Let’s check that the values we’ve obtained from the gradient descent are any good.
We can get the true values for θ0 and θ1 with the following:

from scipy import stats as sp


start = time.time()
actualvalues = sp.stats.linregress(x,y)
print("Actual values for theta are:")

## Actual values for theta are:

print("theta_0 =", actualvalues.intercept)

## theta_0 = 25.232304983426026

print("theta_1 =", actualvalues.slope)

## theta_1 = 3.2910945679475647

end = time.time()
print(str(end - start) + 'seconds')

## 0.012906551361083984seconds

One thing this highlights is how much effort goes into optimising the func-
tions found in these libraries. If one looks at the code inside linregress, clever
exploitations to speed up the computation can be found.

Now, let’s plot our obtained results on the original data set:

## (13.0, 21.0)

## (65.0, 95.0)
76 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION

So in our implementation above, we needed to compute the gradient at each step.


While this might not seem important, it is! In this toy example, we only have 15
data points, however, imagine the computational intractability when millions of
data points are involved.

6.6.4 Stochastic Gradient Descent

What we implemented above is often called Vanilla/Batch gradient descent. As


pointed out, this implementation means that we need to sum the cost of each
sample in order to calculate the gradient of the cost function. This means given 3
million samples, we would have to loop through 3 million times!

So to move a single step towards the minimum, one would need to cal-
culate each cost 3 million times.

So what can we do to overcome this? Well, we can use the stochastic gradient de-
scent. In this idea, we use the cost gradient of 1 sample at each iteration rather than
the sum of the cost gradient of all samples. So recall our gradient equations from
above:

∂ 1 Xm
J (θ0 , θ1 ) = (h θ (x i ) − y i ),
∂θ0 m i =1
6.6. THE GRADIENT DESCENT ALGORITHM AND MACHINE LEARNING 77

∂ 1 Xm
J (θ0 , θ1 ) = ((h θ (x i ) − y i ) · x i ),
∂θ1 m i =1
where:
h θ (x) = θ0 + θ1 x.
We now want to update our values at each item in the training set instead of all so
that we can begin improvement straight away.
We can redefine our algorithm into the stochastic gradient descent for the simple
linear regression as follows:

Randomly shuffle the data set


for k = 0, 1, 2, ... do
for i = 1 to m do

θ0 θ0
· ¸ · ¸ · ¸
2(h θ (x i ) − y i )
= −α
θ1 θ1 2x i (h θ (x i ) − y i )

end for
end for

Depending on the size of the data set, we run the entire data set 1 to k times.
So the key advantage here is that unlike batch gradient descent where we have to
go through the entire data set before initiating any progress, we can now make pro-
cess straight away as we move through the data set. This is the primary reason why
stochastic gradient descent is used when dealing with large data sets.
78 CHAPTER 6. GRADIENT METHODS FOR UNCONSTRAINED OPTIMISATION
Chapter 7

Newton and Quasi-Newton


Methods

The steepest descent method uses information based only on the first partial deriva-
tives in selecting a suitable search direction. This strategy is not always the most ef-
fective. A faster method may be obtained by approximating the objective function
f (x) as a quadratic q(x) and making use of a knowledge of the second partial deriva-
tives. This is the basis of Newton’s method. The idea behind this method is as fol-
lows. Given a starting point, we construct a quadratic approximation to the objective
function that matches the first and the second derivative of the original objective
function at that point. We then minimise the approximate (quadratic) function in-
stead of the original objective function. We then use the minimiser of the quadratic
function to obtain the next iterate and repeat the procedure iteratively. If the ob-
jective function is quadratic then the approximation is exact and and the method
yields the true minimiser in one step. If, on the other hand, the objective function is
not quadratic, then the approximation will provide only an estimate of the position
of the true minimiser.
We can obtain a quadratic approximation to the given twice continuously differen-
tiable objective function using the Taylor series expansion of f about the current x k ,
neglecting terms of order three and the higher. Using the Taylor series expansion:

f (x) ≈ f (x(k) ) + (x − x(k) )T g(k) + (x − x(k) )T H (x(k) )(x − x(k) ) = q(x),

where g = ∇ f and H is the Hessian matrix. The minimum of the quadratic q(x)
satisfies:
0 = ∇q(x) = g(k) + H (x(k) )(x − x(k) ),
or inverting:
x = x(k) − H −1 (x(k) )g(k) .
Newton’s formula is:
x(k+1) = x(k) − H −1 (x(k) )g(k) . (7.1)

79
80 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS

This can be rewritten as


H (k) d(k) = −g(k) (7.2)
where d(k) = x(k+1) − x(k)

g (x )
Note to solve in 1-dimension g (x) = 0, we iterate x k+1 = x k − g ′ (xk ) . The
k
above formula is the multidimensional extension of Newton’s method.

The Method requires that f k , gk and H k i.e., the function value, the gra-
dient and the Hessian to be made available at each iterate xk . Most im-
portantly the Newton method is only well defined if the Hessian H k is
positive definite. This is because only then q(x) will have a unique min-
imiser. The positive definiteness of the Hessian can only be guaranteed
if the starting iterate x0 is very near to the minimizer x∗ of f (x)

The Newton method is fast to converge when it is applied close to the minimiser. If
the starting point (the initial point) is further from the minimiser then the Algorithm
may not converge.

7.0.0.1 Example

For example let us take the following example

f (x) = 100(x 2 − x 12 )2 + (1 − x 1 )2 .

¡0¢
Let us take x0 = 0
0 . The gradient vector and the Hessian at x are respectively given
by: Ã !
−400x 1 x 2 − x 12 − 2(1 − x 1 )
¡ ¢
∇ f (x) = = g,
200 x 2 − x 12
¡ ¢

and:
800x 12 − 400 x 2 − x 12 + 2
µ ¡ ¢ ¶
−400x 1
H (x) = .
−400x 1 200

So substituting x 0 gives:
µ ¶
2 0
g0 = (2, 0)T ; H0 = .
0 200

Now using
H 0 d0 = −g0 ,
recall that:
H k d k = −g k ,
81

so: µ ¶µ ¶ µ ¶
2 1/2 0 1
H 0 d0 = −g0 ⇒ d0 = g0 (H 0 )−1 = = .
0 0 1/200 0
Recall:
dk = xk+1 − xk ⇒ d0 = x1 − x0 ⇒ x1 = d0 + x0
Thus: Ã ! Ã ! Ã !
1 1 0 1
x = + =
0 0 0
Calculating the function value we have:

f (x1 ) = 100 > f (x0 ) = 1

which shows that the algorithm is diverging!

Figure 7.1: Rosenbrock Function

“‘
82 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS

7.1 The Modified Newton Method

The modified Newton method is

x(k+1) = x(k) − αk H −1 (x(k) )g(k) .

The step length parameter αk modifies the step taken in the search direction, usually
to minimize f (x(k+1) ). Newton’s method applied without this modification does not
necessarily produce a decrease in f (x(k+1) ), as described by the above example.
To address the drawbacks of Newton method line search is introduced where f k+1 <
f k is sought. As with the other gradient based methods the new iterate xk+1 is found
by minimizing f along the search direction dk such that:

xk+1 = xk + αk dk

where αk is the value of α which minimizes f (xk + αdk ).


Although Newton method without this modification may generate points where the
function may increase (see example above), the directions generated by Newton
method are initially downhill if H k is positive definite.
Remarks:

• Newton’s method always goes in a descent direction provided we do not go


too far but sometimes Newton over-steps the mark and does not work.
• The drawback to the method is that evaluating H −1 can be expensive in com-
putational time.

7.2 Convergence of Newton’s Method for Quadratic


Functions

If f (x) = 12 xT Qx + xT b + c is a quadratic function with positive definite symmetric


Q, then Newton’s method reaches the minimum in 1 step irrespective of the initial
starting point.
Proof:
The gradient vector g(x) = ∇ f (x) = Qx + b. The Hessian H (x) = Q and is a constant.
Hence given x(0) ,

x(1) = x(0) − H −1 g(0)


x(1) x(0) −Q −1 Qx(0) + b
¡ ¢
=
x(1) = −Q −1 b = x∗ .

The result also works if Q is negative definite resulting in a strong local maximum or
Q is symmetric indefinite giving x∗ as a saddle point.
7.3. QUASI-NEWTON METHODS 83

7.3 Quasi-Newton Methods

The basic Newton method as it stands is not suitable for a general purpose algorithm
since H k may not be positive definite when xk is remote from the solution. Further-
more, as we have shown in the previous example, even if H k is positive definite the
convergence may not occur. To address these issues Quasi-Newton algorithms were
developed. We start by describing the drawbacks of the Newton method. At each
iteration (say, at the k-th iteration) of the Newton’s method a new matrix H k has to
be calculated (even if the method uses line search) and then either the inverse of this
matrix has to found or a system of equation has to be solved before the new point
x(k+1) is found using x(k+1) = x(k) + d(k) . Quasi-Newton methods avoid the calcula-
tion of a new matrix at each iteration, rather they only update the matrix (positive
definite) of the previous iteration. This matrix remains also positive definite. This
method also does not need to solve a system of equation. First it finds its direction
using the positive definite matrix and it finds the step length using line search.
Introduction of the quasi-Newton method largely increased the range of problems
which could be solved. This type of method is like Newton method with line search,
−1
except that H k at each iteration is approximated by a symmetric positive definite
matrix G k , which is updated from iteration to iteration. Thus the kth iteration has
the basic structure.

1. Set dk = −G k gk
2. Line search along dk giving xk+1 = xk + αk dk
3. Update G k giving G k+1

The initial positive definite matrix is chosen as G 0 = I . Potential advantages of the


method (as against Newton’s method) are:

• Only first derivative required (Second derivative required in Newton method)


• G k positive definite implies the descent property (H k may be indefinite in
Newton method)

Much of the interest lies in the updating formula which enables G k+1 to be calcu-
lated from G k . We know that for any quadratic function:

1
q(x) = xT H x + bT x + c,
2
where H , b and c are constant and H is symmetric, the Hessian maps differences in
position into differences in gradient,.i.e.,

gk+1 − gk = H (xk+1 − xk ). (7.3)

The above property says that changes in gradient g (=∇ f (x)) provide information
about the second derivative of q(x) along (xk+1 − xk ). In the quasi-Newton methods
84 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS

at xk we have the information about the direction dk , G k and the gradient gk . We can
use these information to perform line search to obtain xk+1 and gk+1 . We now need
to calculate G k+1 (the approximate inverse of H k+1 ) using the above information.
At this point we impose the condition given by Equation (7.3) for the non-quadratic
function f . In other words, we impose that changes in the gradient provide infor-
mation about the second derivative of f along the search direction dk . Hence, we
have:
−1
H (k+1) (gk+1 − gk ) = (xk+1 − xk ) (7.4)
Therefore, we would like have G k+1 = G k + ∆G k such that:

G k+1 γk = δk , (7.5)
−1
where G k+1 = H k+1 , δk = (xk+1 − xk ) and γk = (gk+1 − gk ). This is known as the
quasi-Newton condition and for the quasi-Newton algorithm the update H k+1 from
H k must satisfy Equation (7.5).
Methods differ in the way they update the matrix G k . Essentially they are classified
according to a rank one and rank two updating formulae.

7.3.1 The DFP Quasi-Newton Method

Rank two updating formulae are given by:

G k+1 = G k γk + auu T + bv v T γk . (7.6)

One method is to choose u = δk and v = G k γk . Then au T γk = 1 and bv T γk = −1


determine a and b. Thus:
T T
k+1 k δ k δk G k γk γk G k
G =G + − T
(7.7)
δ kT γk γk G k γk

This formula was first suggested as part of a method due to Davidon (1959), and later
also presented by Fletcher and Powel (1963). The Quasi-Newton method which goes
with this updating formula is known as DFP (Davidson, Fletcher and Powel) method.
The DFP algorithm is also known as the variable matrix algorithm. The DFP al-
gorithm preserves the positive definiteness of G k but can sometimes gives trouble
when G k becomes nearly singular. A modification (known as BFGS) introduced in
1970 can cure this problem. The algorithm for DFP method is given below:

1. Set k = 0,G 0 = I and compute gk = g (xk ).


2. Compute dk from dk = −G k gk .
3. Compute α = αk such that f (xk + αk dk ), set xk+1 = xk + αk dk .
4. Compute gk+1 such that gk+1 = g (xk+1 ).
5. If ∥g k+1 ∥ ≤ ϵ (ϵ is a user supplied small number) then go to (9).
6. Compute δk and γk such that δk = xk+1 − xk and γ = gk+1 − gk .
7. Compute G k+1 .
7.3. QUASI-NEWTON METHODS 85

8. Set k = k + 1 and go to (2).


9. Set x∗ = xk+1 , STOP.

7.3.2 Exercises

1. Use Newton method to minimise the function:

f (x) = x 1 4 − 3x 1 x 2 + (x 2 + 2)2 ,

starting at the point x 0 = [0, 0]T and show that the function value at x 0 cannot
be improved searching in Newton direction.
2. Find the stationary points of:

f (x) = x 12 + x 22 − x 12 x 2

and determine their nature. Plot the contours of f . Find the value of f after
taking a basic Newton optimisation method from x 0 = (1, 1)T .
3. Using Newton method, find the minimiser of:

1
f (x) = x 2 − sin(x).
2

The initial value¯ is x 0 = 0.5.


¯ The required accuracy is ϵ = 10 in the sense that
−5

you stop when ¯x k+1 − x k ¯ < ϵ.


4. Using the DFP method, find the minimum of the following function:

f (x) = 4x 12 − 4x 1 x 2 + 3x 22 + x 1 ,

using the starting point (4, 3).


5. Find the minimum of the function given in question (2) utilising the DFP
method. Use the same starting point.
86 CHAPTER 7. NEWTON AND QUASI-NEWTON METHODS
Chapter 8

Direct Search Methods for


Unconstrained Optimisation

Direct search methods, unlike the Descent methods discussed in earlier Chapters do
not require the derivatives of the function. The Direct search methods require only
the objective function values when finding minima and are often known as zeroth-
order methods since they use the zeroth-order derivatives of the function. We will
consider two Direct Methods in this course. Namely, the Random Walk Method
and the Downhill Simplex Method.

8.1 Random Walk Method

The random walk method is based on generating a sequence of improved approx-


imations to a minimum, where each approximation is derived from the previous
approximation. Therefore, xi is the approximation to the minimum obtained in the
(i − 1)th iteration, yielding the relation:

xi +1 = xi + λui ,

where λ is some scalar step length and ui some random unit vector generated at the
i th stage.
We can describe the algorithm as follows:

1. Start with an initial point x1 , a sufficiently large initial step length λ, a mini-
mum allowable step length ϵ, and a maximum permissible number of itera-
tions N .
2. Find the function value f 1 = f (x1 ).
3. Set the iteration number, i , to 1

87
88CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

4. Generate a set of n random numbers, r 1 , . . . , r n , each lying in the interval


[−1, 1] and formulate the unit vector u as:
 
r1
r 
1  2
 .
u= 2 2 2 1/2  .. 
(r + r + . . . + r n )  . 
1 2
rn

To avoid bias in the calculation, we only accept the vector if the length of:
1
is ≤ 1.
(r 12 + r 22 + . . . + r n2 )1/2
5. Compute the new vector and the corresponding function value x = x1 +λu and
f = f (x).
6. If f < f 1 , then set the new values of x1 = x and f 1 = f and go to step 3, else
continue to 7.
7. If i ≤ N , set the new iteration to i + 1 and go to step 4. Otherwise, if i > N , go
to step 8.
8. Compute new, reduced, step length as λ = λ/2. If new step length is smaller
than or equal to ϵ, then go to step 9, else go to step 4.
9. Stop the procedure by taking xopt = x1 and f opt = f 1 .

8.1.0.1 Example

Minimise f (x 1 , x 2 ) = x 1 −x 2 +2x 12 +2x 1 x 2 +x 22 using the random walk method. Begin


with the initial point x 0 = [0, 0] and a starting step length of λ = 1. Use ϵ = 0.05 and
iteration limit N = 100
If you code this method for the stated parameters, your output once the terminating
criterion is met should approximately be: x = [−0.99768499, 1.49885167] with the
corresponding function value of f (x) = −1.249993279604305.
Let us plot the function to see if our answer makes sense:

8.2 Downhill Simplex Method of Nelder and Mead

A direct search method for the unconstrained optimisation problem is the Down-
hill simplex method developed by Nelder and Mead (1965). It does not make an
assumption on the cost function to minimise. Importantly, the function in question
does not need to satisfy any condition of differentiability unlike other methods, i.e. it
is a zero order method. It makes use of simplices, or polytopes in given dimension
n +1. For example, in 2 dimensions, the simplex is a polytope of 3 vertices (triangle).
In 3 dimensional space it forms a tetrahedron.
The method starts from an initial simplex. Subsequent steps of the method consist
of updating the simplex where it defines:
8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 89

Figure 8.1: Random Walk

• xh is the vertex with highest function value,

• xs is the vertex with second highest function value,

• xl is the vertex with lowest function value,

• G is the centroid of all the vertices except xh , ie. the centroid of n points out of
n + 1:
1 n+1
xj
X
G= (8.1)
n j =1, j ̸=h

The movement of the simplex is achieved by using three operations, known as re-
flection, contraction and expansion.

These can be seen in the Figures below:

A common practice to generate the initial remaining simplex vertices is to make use
of x0 + ei b, where ei is the unit vector in the direction of the x i coordinate and b an
edge length. Assume a value of 0.1 for b.

Let y = f (x) and y h = f (xh ) then the algorithm suggested by Nelder and Mead is as
follows:

The typical values for the above factors are α = 1, γ = 2 and β = 0.5. The stopping
90CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 8.2: Random Walk


8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 91

Figure 8.3: Here we have reflection and expansion

Figure 8.4: Here we have contraction


92CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

Figure 8.5: Here we have multiple contractions


8.2. DOWNHILL SIMPLEX METHOD OF NELDER AND MEAD 93

Figure 8.6: Nelder Mead Algorithm


94CHAPTER 8. DIRECT SEARCH METHODS FOR UNCONSTRAINED OPTIMISATION

criteria to use is defined by:


s
1 X n ³ ´2
f (x i ) − f (x i ) ≤ ϵ (8.2)
n + 1 i =0

8.2.1 Exercises

1. Apply the above two strategies to the all the multivariate function introduced
in earlier chapters and achieve their respective minima.
Chapter 9

Lagrangian Multipliers for


Constraint Optimisation

In this Chapter we will briefly consider the optimisation of continuous functions


subjected to equality constraints (this will be covered extensively in the 3rd year
course), that is the problem:

minimize z = f (x) (9.1)

subject to:
g i (x) = b i

where f and g i are differentiable. The Lagrange function, L, is defined by introduc-


ing one Lagrange multiplier λi for each constraint g i (x) as:

L(x, λ) = f (x) + λ
X£ ¤
b i − g i (x) (9.2)
i =1

The necessary condition of optimality are given by:

∂L ∂f m ∂g ∂L
λi
X
= + = 0, = g i (x) = 0 (9.3)
∂x j ∂x j i =1 ∂x i ∂λi

9.0.1 Example

Use Lagrangian multipliers to minimise:

f (x 1 , x 2 ) = x 12 + 4x 22

subject to:
x 1 + 2x 2 = 1

95
96 CHAPTER 9. LAGRANGIAN MULTIPLIERS FOR CONSTRAINT OPTIMISATION

Solution:

∂f ∂g
−λ = 0,
∂x 1 ∂x 1
∂f ∂g
−λ = 0,
∂x 2 ∂x 2
and
g (x 1 , x 2 ) = b.
Therefore, we solve:
2x 1 − λ = 0,
8x 2 − 2λ = 0,
and
x 1 + 2x 2 = 1.
1 1
Solving these three equations we obtain x 1 = , x 2 = and λ = 1. Therefore, the
2 4
optimum is:
1
f (x 1 , x 2 ) = .
2

9.0.2 Exercises

1. Minimise
f (x) = x 1 2 + x 2 2
subject to
x 1 + 2x 2 + 1 = 0
2. Find the dimensions of a cylindrical tin of sheet metal to maximise its volume
such that the total surface area is equal to A 0 = 24π.

You might also like