MA324 Lecture Notes
MA324 Lecture Notes
Mathematical Modelling
and Simulation
Lecture Notes
Winter Term 2023-24
Dr Aled Williams
[email protected]
Department of Mathematics
London School of Economics and Political Science
These notes are based in part on notes by Ahmad Abdi, Katerina Papadaki,
Gregory Sorkin, Giacomo Zambelli and on the textbooks [6, 21].
i
Contents
Contents i
I Mathematical Modelling 7
4 Modelling Tricks 51
4.1 Fixed Costs and the Big-M Method . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Facility Location and the Big-M Method . . . . . . . . . . . . . . . . . . . . . 53
4.3 Facility Location and Indicator Variables . . . . . . . . . . . . . . . . . . . . . 54
4.4 Expressing Logical Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Modelling “or” Constraints (Disjunctions) . . . . . . . . . . . . . . . . . . . . 57
4.6 Semi-Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.7 Binary Polynomial Programming . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.8 Exercises for Self-Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5 Sensitivity Analysis 65
5.1 A Brief Review of Dual LPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Exercises for Self-Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
II Simulation 137
Bibliography 233
1
Chapter 1
Welcome to Mathematical Modelling and Simulation (MA324). The convener for the
course is Dr Aled Williams. My room number and email address are COL.5.05 (Columbia
House) and [email protected], respectively.
You have two primary options if you would like to discuss anything related to this
course. You can firstly post your question to the anonymous discussion forum (avail-
able through Moodle). This approach will likely yield a much faster response time and
your peers will additionally benefit from the question. It should be emphasised that
the anonymous forum is completely anonymous (meaning that no one including the
lecturer cannot see who has posted). Further, it is very valuable if you can answer ques-
tions posted by your peers as studies demonstrate that teaching others is one of the most
effective ways to actually deepen your understanding and enhance learning gains.
You can instead come to my office hours. Note that you do not need to book an ap-
pointment. My office hours during the Winter Term (WT) are 13:30-14:30 each Monday
in COL.5.05 (Columbia House)1 . These are generally in-person, however, if you would
like to instead meet remotely please drop me an email. It should be noted that these
times could change, however, any such change will be outlined on the course announce-
ments forum (available through Moodle).
Lecture Notes
The lecture notes have been designed as a self-contained study resource for the MA324
Mathematical Modelling and Simulation course at the LSE. It should be noted that the
1
Note that my office hours will begin in week 3 this term. Despite this, if you have additional questions
please post on the anonymous discussion forum and I will be more than happy to help.
2 Chapter 1. Initial Information and Orientation
lecture notes are by design “gappy”, which means that gaps are interspersed throughout
and during the lectures we will fill in these gaps together. It should be noted that this
means studying the lecture notes without also attending lectures will not be sufficient
for the course.
The current chapter, Chapter 1, is an orientation section. This chapter provides
information about the course arrangements including lectures, classes, computer work-
shops and course assessment. The other chapters are split across two parts, namely
Mathematical Modelling and Simulation.
Lectures
There is one live lecture each week, taking place on Tuesday 14:00-16:00 (SAL.G.03, Sir
Arthur Lewis Building, formerly 32 Lincoln’s Inn Fields). These will run from WT week
1 through to WT week 10. The aim of these sessions is to help you consolidate your
understanding of the material and the lecture notes. Further, because the notes are by
design “gappy”, throughout the lectures we will fill in the gaps. It should be noted that
we may not cover all the material in the lecture notes during the lectures.
The lectures will be additionally recorded and you can access the videos through
Moodle. Note that there may be a time lag of two days before you can access the
recorded lecture for technical reasons.
Classes
There is one class each week, which takes place either on Monday 9:00-10:00 or Monday
10:00-11:00 (FAW.3.02, Fawcett House). Your personal timetable indicates which of
the two classes you should attend. These classes will run from WT week 2 through
to WT week 11. The aim of these sessions is to strengthen your understanding of the
course material through a combination of individual and group work. Please note that
attendance at these sessions is compulsory and attendance will be recorded. If you have
a good reason to be absent then please email me ahead of the session. The classes will
not be recorded as per departmental policy.
Computer Workshops
There are computer workshops during WT week 3, 5, 7, 9 and 11. This takes place
on Thursday 9:00-10:00 or Thursday 10:00-11:00 (FAW.4.02, Fawcett House). These
workshops are optional and your personal timetable indicates which of the two classes
you are timetabled to attend.
1.3. Syllabus 3
The primary aim of these sessions is to support you in learning how to utilise the
relevant software (AMPL and R) in order to answer real world problems. No material
will be covered during these sessions, however, you can ask programming questions on
programming examples that were covered in the lectures and classes or instead get help
with the using the software on the mock project. Because of this, you are completely free
to drop in and out of either computer workshop as you see fit. The computer workshops
will not be recorded.
Recall that there are two course forums that can be accessible through Moodle. One
of the forums (course announcements) is for general announcements that will be made
throughout the term. The other forum (anonymous discussion) is intended for any
questions you may have about the course. It should be noted that it is expected that you
try to answer questions posed by your peers via the anonymous forum as this will have
a positive impact on deepening your understanding of the course material.
There are some exercises marked “exercises for self-study” at the end of each Chapter of
the lecture notes. These exercises are designed to deepen your understanding through
applying the lecture material. Note that completing these exercises is entirely optional
and complete solutions to these exercises will not be provided.
1.3 Syllabus
By the end of the academic year we will have covered a wide range of mathematical
topics that provide a broad introduction to mathematical modelling and simulation.
Throughout this course there will be an applied focus as we make use of appropriate
computer software to solve real world problems. The topics we discuss can be split
into two rather broad categories, namely mathematical modelling and simulation. For
mathematical modelling we will discuss:
• linear programming,
• integer programming,
• nonlinear optimisation.
You will be given weekly exercises to complete. These exercises will be initiated at the
end of each class and you will complete them at home. The deadline for each weekly
submission is Thursday at 5pm in the week which your class takes place. Your first class
for instance takes place on Jan 22 (Monday) and the deadline for that submission is Jan
25 (Thursday) at 5pm.
Your solution should be submitted electronically as a PDF file plus any code files
you utilise. You will receive both individual and collective feedback on your work. The
feedback will emphasise key ideas and draw attention to common mistakes and miscon-
ceptions.
It should be emphasised that this work is formative and as such will not contribute
towards your final grade, however, some of the homework will feature questions that
are similar in nature to what will be expected from the project.
You will be given a formative mock project in the second half of WT. In particular, you
will be given a mock project in WT week 5 (Feb 14, 1pm) and have three complete
weeks to complete the work (Mar 11, midnight). This mock project will not contribute
1.4. Course Assessment 5
towards your final grade, however, it will give a good indication of what to expect from
the final assessment. You will receive individual and collective feedback on your work.
The mock project will be approximately one third of the size of the final project.
Summative Project
There will be one individual summative project in the Spring Term (ST) worth 100% of
your final mark. This will cover mathematical modelling and/or simulation and will be
a report of around 15-20 pages, along with a copy of any computer code that you make
use of. More information will come later.
Part I
Mathematical Modelling
7
9
Chapter 2
2. decision variables: the value of these variables are under control and influence the outcome
10 Chapter 2. An Introduction to Optimisation and Modelling in Operational Research
After formulating such a model, the goal is to find the values of the decision variables
that gives the optimised outcome for the objective function. The general process of
modelling is discussed in the next section. The process of finding such decision variables
is a mathematical optimisation problem.
The term mathematical optimisation (or mathematical programming) refers to using
mathematical tools for optimally allocating and using limited resources when planning
activities. It should be noted that the use of the term mathematical “programming” here
has a rather old school meaning of simply meaning planning and does not refer to the
process of creating a set of instructions that tell a computer how to perform a task.
Mathematical optimisation deals with optimisation problems. An optimisation prob-
lem consists of maximising or minimising some function, known as the objective or cost
function, subject to some constraints. The objective function could for example repre-
sent total profit or cost, total number of staff or total carbon emissions associated with
a project, while, the constraints represent limitations on the available resources or on
the way these can be used.
There are perhaps unsurprisingly a wide variety of such mathematical models. In
particular, some of the broad types of models are:
• linear and nonlinear models: a linear model is one in which the objective function
and constraints are linear, else we have a nonlinear model.
• integer and noninteger models: if one or more decision variable must be integer
then the optimisation model is an integer model, else the model is noninteger.
the case that you do not have a single objective function or that perhaps your objectives
are conflicting. Further, such a model may not capture all details of your real life situa-
tion since, for example, a model may simply not exist for your problem or perhaps you
can formulate a sufficiently detailed model but not solve this in practice. This suggests
there may be some trade off between the accuracy of your model and your ability to
actually solve your model.
After formulating the model then should implement this model using some program-
ming language. In particular, we will be using AMPL (A Mathematical Programming
Language), which is specifically designed for mathematical optimisation. To find the
computer solution we make use of a solver, which AMPL has variety of available. The
computer solution will yield our mathematical solution which should be then interpreted
as a real life solution, which is something can be implemented in the real-world. It may
be that the real life solution is accepted by the organisation who set the initial problem,
however, it could turn out that the solution is unsatisfactory and as such one would go
back and change the model.
The focus of the following sections in on linear programming, a simple yet fundamen-
tal mathematical optimisation model. Recall that a linear model is one in which both
the objective function and constraints are linear. A linear programming problem is to
determine values of the decision variables in order to maximise (or minimise) a lin-
12 Chapter 2. An Introduction to Optimisation and Modelling in Operational Research
ear objective function subject to linear constraints. This is the fundamental problem in
mathematical optimisation.
Formally, a general linear programming problem is of the form
where the number of variables is n, the number of constraints is m and the symbol ø
denotes any one of the , or = relations.
In matrix algebra terms, a linear programming problem is
minimise or maximise c1 x 1 + c2 x 2 + . . . + cn x n
subject to a11 x 1 + a12 x 2 + . . . + a1n x n ø b1
a21 x 1 + a22 x 2 + . . . + a2n x n ø b2
.. (2.1)
.
am1 x 1 + am2 x 2 + . . . + amn x n ø bm
xj 0, j = 1, 2, . . . , n ,
2.3. An Introduction to Linear Programming 13
where the number of variables is n, the number of resource constraints is m and there
are n nonnegativity constraints x j 0 for all j 2 {1, 2, . . . , n}. We will say in such case
that the linear program (LP) has size m ⇥ n, namely if the LP has n variables and m
resource constraints.
Similarly, In matrix algebra terms, a linear programming problem in which all vari-
ables are nonnegative can be written in the form
minimise or maximise cT x
subject to Ax ø b
x 0,
Example. (Furniture production) A furniture factory produces two types of benches using
steel and wood. The benches are then sold to furniture shops for £3000 and £1000 per
dozen of units of each type, respectively. Wood is only used for benches of type 1 and steel
is used in both. It requires 1 tonne of steel to produce a dozen benches of type 2, while, the
same amount of benches of type 1 requires 1 tonne of steel and 1 tonne of wood. In total,
the factory has 3 tonnes of steel and 2 tonnes of wood available in the next month. The
question facing the factory is, given the limited availability of materials, what quantity (in
dozens) of each product should the company produce in the next month, in order to achieve
the maximum total profit?
In the previous example, the factory can decide how many benches of each type can
be produced (which corresponds to the decision variables), the factory’s objective was
to achieve the maximum total profit (corresponding to the objective function) and the
factory was restricted by limited availability of materials (corresponding to the resource
constraints). Further, note that the nonnegativity constraints were deduced as it would
not make sense to produce a negative number of benches in a month.
Let us briefly look at an abstract production model. For this purpose, consider some
company that produces n products using m types of material. For the next time pe-
riod the unit prices for the n products are projected to be c1 , c2 , . . . , cn . The amount of
materials available to the company in the next month are given by b1 , b2 , . . . , bm . The
amount of material i 2 {1, 2, . . . , m} consumed by a unit of product j 2 {1, 2, . . . , n} is
given by ai j 0, where some ai j ’s can take value zero if j does not use material i. Given
the limited availability of materials, what quantity of each product should the company
produce in the next time period in order to achieve maximum total profit?
This above information can be formulated as an LP, namely
(2.2)
Note once more that the nonnegativity constraints were deduced since it would not make
sense to expect the company to produce a negative amount of a product. In addition, if
one product, say product j is not profitable, i.e. if c j < 0, then without the nonnegativity
constraints the model would produce a solution with x j < 0 which would generate a
profit c j x j > 0. In fact, as a negative amount of a product j would not consume any
material but instead “generate” materials, one could in such case drive the profit towards
positive infinity by forcing x j to tend to negative infinity.
Our objective in the LP (2.2) is to maximise the combined unit prices for the items
produced subject to bounds on the availability of the m types of material. In a similar
2.4. Feasible and Optimal Solutions to Linear Programs 15
fashion, in matrix algebra terms, the abstract production model corresponds to the LP
Consider the LP
maximise 2x 1 + x 2
subject to 5x 1 + 11x 2 90
(2.3)
x2 5
x1, x2, 0.
We can represent each constraint in (x 1 , x 2 ) space. Note that the nonnegativity con-
straints on the variables imply that only the positive quadrant including the axes needs
to be considered. This example is illustrated in Figure 2.2.
Figure 2.2: The feasible region for the LP (2.3) is represented by the area shaded grey.
Notice that it is not the case that all points in the (x 1 , x 2 ) space satisfy all of the
constraints. The point (0, 6) for example does not satisfy the second constraint, while,
the point (0, 0) in contrast does satisfy each of the constraints. This observation inspires
the following definitions. A point is called a feasible point for an LP if it satisfies all
constraints. The set of all feasible points form the feasible region of the problem. The
feasible region of the above LP (2.3) is depicted in grey in Figure 2.2.
16 Chapter 2. An Introduction to Optimisation and Modelling in Operational Research
Observe the optimal solution to the LP (2.3), namely (0, 18) is a corner point of
feasible region. This is not a coincidence as it is true in general that the optimum is
achieved by some “corner point”. For this purpose, we define what we mean by “cor-
ner points”, which in the language of linear programming are called extreme points. An
extreme point is a feasible point that satisfies at equality n (in this case n = 2) indepen-
dent linear constraints. Note in the above example (18, 0) and (0, 0) are examples of
two extreme points, while, (1, 1) and (5, 0) are not extreme. One of the most fundamen-
tal and useful facts in the theory of linear programming is that if an LP has a solution
that satisfies at equality n independent linear constraints and the LP admits an optimal
solution, then there exists some optimal solution that is an extreme point of the feasible
region. This fact is useful as it tells us that when one solves an LP it is enough to search
among the extreme points and pick the one with the best objective function value.
The above Figure 2.2 further shows that likely there are multiple feasible solutions
satisfying our LP. It should be noted that if there are multiple feasible solutions, then
our feasible region is neither empty nor a single point. A natural question is to here ask
which of the possible feasible solutions is “best”? This question inspires the following
definition. Given a maximisation (or minimisation) linear programming problem, an
optimal solution for the LP is a feasible point where the largest (or smallest in the case
of minimisation) value of the objective function is taken among all feasible points.
The LP represented in Figure 2.2 has one optimal solution, however, it should be em-
phasised that there may be cases where there exist multiple optimal solutions. Suppose
for example that instead of maximising 2x 1 + x 2 in the LP (2.3), that we were tasked
with minimising the value of x 2 , namely that we have the LP
minimise x2
subject to 5x 1 + 11x 2 90
x2 5
x1, x2, 0.
Note that the optimal value must be between 0 and 5 because the constraints of the
problem yield that 0 x 2 5. Further, observe that the points (0, 0) and (1, 0) are both
feasible solutions that obtain objective value 0 (since both points have second coordi-
nate equal to zero). It follows in consequence that this problem has multiple optimal
solutions. In particular, all points of the form (a, 0) are optimal, where 0 a 18.
It can happen that a linear programming problem has no feasible solution. In such
case, we say that the LP is infeasible. Note that equivalently we could state that an LP
2.4. Feasible and Optimal Solutions to Linear Programs 17
maximise 2x 1 + x 2
subject to x1 + x2 4
(2.4)
x2 5
x1, x2, 0.
It can be seen from a diagram that this LP (2.4) is indeed infeasible. Instead, one could
algebraically see that this problem is infeasible since if x 2 5 and both variables x 1 , x 2
are nonnegative, then their sum x 1 + x 2 must have value at least 5, meaning that the
first constraint is violated.
It can happen that a linear programming problem has feasible solutions yet there
does not exist an optimal solution. Note that this means the LP may have a nonempty
feasible region but not attain some maximum or minimum value for the objective func-
tion. Consider the LP
maximise x1 + x2
subject to x1 + x2 1
x 1 +x 2 2 (2.5)
x 1 2x 2 2
x1, x2, 0.
The LP (2.5) is illustrated in Figure 2.3. From the diagram, it is evident that there exist
feasible solutions for the LP with arbitrarily large value in the objective function. This
observation inspires the following definition. A maximisation (minimisation) linear pro-
gramming problem is unbounded if there exist feasible solutions with arbitrarily large
positive (negative) value in the objective function. Informally, an LP is unbounded if it
is feasible but its objective function can be made arbitrarily “good” or “bad”.
Observe that an LP is unbounded only if its feasible set is an unbounded set. How-
ever, an unbounded feasibility set does not necessarily imply that the LP is itself un-
bounded. If for example in the LP (2.5) we were asked to instead minimise x 1 + x 2 then
we clearly have at least one optimal solution.
It should be noted that while the unboundedness of an LP is a mathematical possi-
bility, in real-world scenarios, a message from the solver indicating that the problem is
unbounded likely indicates that some error has occured during the mathematical for-
mulation of the problem, which typically is the omission of at least one constraint.
In summary, for any linear programming problem, exactly one of the following sce-
narios must occur:
18 Chapter 2. An Introduction to Optimisation and Modelling in Operational Research
It should be emphasised that the fact we mention LPs in the above statement is crucial.
In particular, the statement may be false when given an optimisation problem that is not
linear. For example, consider the following nonlinear optimisation problem
minimise x2
subject to x1 · x2 1
x1, x2, 0.
In this scenario, notice that the problem is feasible, as (1, 1) for example satisfies the
constraints and the problem is not unbounded, since it is a minimisation problem that
takes solution no less than 0. Despite this, there is no optimal solution as there are
feasible solutions for which x 2 take values arbitrarily close to 0, however, no feasible
solution with x 2 = 0 exists in light of the first constraint.
2.5. Assumptions of a Linear Programming Problem 19
1.
2.
3.
4.
It should be emphasised that the first assumption would not normally hold because of
economies or diseconomies of scale, as the per unit profit is not independent of the
number of items sold. The second assumption may not hold as two products may can-
nibalise one another. The third assumption makes a significant difference in scenarios
where the variables must take integer values and simply rounding would not be suf-
ficient. The fourth assumption may be overcome provided the level of uncertainty is
“small” by making use of tools from sensitivity analysis, which is a topic we study later,
however, if the level of uncertainty is “large”, then one may have to tackle such problems
using tools from stochastic programming or robust optimisation.
When dealing with linear programming problems, it is often convenient to assume that
they are of some specific form. We will often consider problems in one of the following
forms.
An LP is in standard form if it is of the form
It should be noted that it is possible to show that any general LP “can be brought” into
either of the above forms. This means that it is always possible to take any LP and write
an “equivalent” LP in either standard form or standard equality form as desired. An
important note is that if an LP is a minimisation problem, then it can be turned into a
maximisation problem after replacing min c T x by max c T x subject to the same con-
straints. For this reason, it does not make a difference from a mathematical standpoint
in linear programming if we are maximising or minimising.
Example. (Phone manufacturing) A phone manufacturer has just announced two new
phones, a budget model and a high-end model, to be released in a particular region where
upper bounds on the demand for these models are known. Suppose for simplicity that the
making of a phone can be simplified down to two resources, namely the minutes of machine
time and the material required. The resources are limited since we have only 2500 minutes
of machine time per month and 4800 units of material. The required resources per phone,
the demand per month and the sales price for each model are given in the table below.
Model this scenario as an LP and solve.
Example. (Fast food restaurant) A popular fast food restaurant in Covent Garden makes
burgers from some combination of high quality meat and some cheaper meat with a higher
fat content. The company keeps precise details a secret, however, the restaurant guarantees
that its burgers have a fat content no more than 25%. The high quality meat costs 80p per
kilogram and comprises of 80% lean meat and 20% fat. The cheaper meat costs 60p per
kilogram and comprises of 68% lean meat and 32% fat. How much of each kind of meat
should the restaurant use in each kilogram of burger meat if it wants to minimise its cost
and ensure the fat content is no greater than no more than 25%?
It should be noted that the examples we have seen to date have not required a large
amount of data. This will not be the case for all examples and, for that reason, in some
scenarios it may be useful to change the right-hand sides, i.e. the bi ’s appearing in the
LP (2.1) with i 2 {1, 2, . . . , m} to parameters. In particular, for each constraint, it may be
useful to replace the right-hand side with some parameter, save all the data (parameter
values) in a data file and call upon this data before using the solver via AMPL.
1. For each of the following constraints, draw a separate graph to show the nonneg-
ative solutions that satisfy the constraint.
a) x 1 + 3x 2 6.
b) 4x 1 + 3x 2 12.
c) 4x 1 + x 2 8.
22 Chapter 2. An Introduction to Optimisation and Modelling in Operational Research
Combine these constraints into a single graph to show the feasible region, where
these constraints are our resource constraints plus nonnegativity.
maximise c1 x 1 +c2 x 2
subject to x 1 + 3x 2 6
4x 1 + 3x 2 12
4x 1 + x 2 8
x1, x2 0,
c) is infeasible, and
d) is unbounded.
If for some case there does not exist such c1 and c2 , then you should justify why
this is the case.
maximise 2x 1 + x 2
subject to x 2 10
2x 1 + 5x 2 60
x 1 + x 2 18
3x 1 + x 2 44
x1, x2 0.
4. Recall the maximisation LP given in the previous exercise. Which (if any) of the
constraints can be removed without changing the optimal solution? Verify your
answer using the solver via AMPL.
2.8. Exercises for Self-Study 23
5. This is your lucky day. You have just won a £10,000 prize. You are setting aside
£4, 000 for taxes and general partying expenses, however, you have decided to in-
vest the other £6, 000. Upon hearing the news, two different friends have offered
you an opportunity to become a partner in two different entrepreneurial ventures,
one planned by each friend. In both cases, this investment would involve expend-
ing some of your time next summer as well as putting up cash. Becoming a full
partner in the first friend’s venture would require an investment of £5, 000 and
400 hours, where your estimated profit (ignoring the value of your time) would
be £4, 500. The corresponding figures for the second friend’s venture are £4, 000
and 500 hours, with an estimated profit to you of £4, 500. However, both friends
are flexible and would allow you to come in at any fraction of a full partnership
you would like. If you choose a fraction of a full partnership, all the above figures
given for a full partnership (money investment, time investment and your profit)
would be multiplied by this same fraction.
Because you were looking for an interesting summer job anyway (for a maximum
of 600 hours), you have decided to participate in one or both friends’ ventures
in whichever combination would maximise your total estimated profit. You now
need to solve the problem of finding the best combination. Model this scenario as
an LP and solve using AMPL.
6. The Household Web sells many household products online. The company needs
substantial warehouse space for storing its goods. Plans now are being made for
leasing warehouse storage space over the next 5 months. Just how much space
will be required in each of these months is known. However, since these space
requirements are quite different, it may be most economical to lease only the
amount needed each month on a month-by-month basis. On the other hand, the
additional cost for leasing space for additional months is much less than for the
first month, so it may be less expensive to lease the maximum amount needed for
the entire 5 months. Another option is the intermediate approach of changing the
total amount of space leased (by adding a new lease and/or having an old lease
expire) at least once but not every month.
Month Required Space (sq. ft) Month Cost per sq. ft Leased
1 30,000 1 £65
2 20,000 2 £100
3 40,000 3 £135
4 10,000 4 £160
5 50,000 5 £190
24 Chapter 2. An Introduction to Optimisation and Modelling in Operational Research
The space requirement and the leasing costs for the various leasing periods are
described in the following tables. The objective is to minimise the total leasing
cost for meeting the space requirements. Formulate a linear programming model
for this problem and solve.
It is now the beginning of the Lent Term and Beryl is confronted with the problem
of assigning different working hours to her operators. Because all the operators
are currently enrolled in the university, they are available to work only a limited
number of hours each day, as shown in the following table.
There are six available operators, namely four undergraduate students and two
graduate students. They all have different wage rates because of differences in
their experience with computers and programming. The above table outlines their
wage rates, along with the maximum number of hours that each operator can work
each day.
Each operator is guaranteed a certain minimum number of hours per week that
will maintain an adequate knowledge of the operation. This level is set arbitrarily
at 8 hours per week for the undergraduate students (K. C., D. H., H. B., and S. C.)
and 7 hours per week for the graduate students (K. S. and N. K.).
The computer facility is to be open for operation from 8am to 10pm Monday
through Friday with exactly one operator on duty during these hours. On Satur-
days and Sundays, the computer is to be operated by other staff.
Because of a tight budget, Beryl has to minimise costs. She wishes to determine
the number of hours she should assign to each operator on each day. Formulate a
2.8. Exercises for Self-Study 25
8. The shaded area in the following graph represents the feasible region of a linear
programming problem whose objective function is to be maximised.
Label each of the following statements as True or False, justifying your answer
based on a graphical method. In each scenario, give an example of an objective
function that illustrates your answer.
a) If (3, 3) produces a larger value of the objective function than (0, 2) and
(6, 3), then (3, 3) must be an optimal solution.
b) If (3, 3) is an optimal solution and multiple optimal solutions exist, then
either (0, 2) or (6, 3) must also be an optimal solution.
c) The point (0, 0) cannot be an optimal solution.
9. The Metalco Company desires to blend a new alloy of 40 percent tin, 35 percent
zinc and 25 percent lead from several available alloys. The properties of the avail-
able alloys are outlined in the following table.
Alloy
Property 1 2 3 4 5
Percentage of Tin 60 25 45 20 50
Percentage of Zinc 10 15 45 50 40
Percentage of Lead 30 60 10 30 10
Cost (£/kg) 77 70 88 84 94
The objective is to determine the proportions of these alloys that should be blended
to produce the new alloy at a minimum cost. Formulate a linear programming
model for this problem and solve.
27
Chapter 3
Let us begin this chapter with a simple motivating example. This is of an optimisation
problem that is known as the “healthy” diet problem. Suppose that Caleb is a mathe-
matics student who does not particularly enjoy going to the gym. Despite this, Caleb
still wants to live a somewhat healthy lifestyle and decides to compose a diet that meets
the daily reference intake of vitamins with the minimal amount of calories. Unfortu-
nately for Caleb, they can only eat pizzas and burritos because of where they live. For
this purpose, the (fictional) nutritional values of a slice of pizza and a burrito are shown
below, in Table 3.1.
A C D Calories
Pizza 225 120 200 600
Burrito 600 100 75 300
Intake 1800 550 600
Table 3.1: The (fictional) nutritional values of a slice of pizza and a burrito, as well as
the required daily intake of vitamins A, C and D.
Note that in this case a solution to this optimisation problem yields a diet of some
combination of slices of pizzas and burritos. The variables of the diet are the number
of slices of pizza, x p 0, and the number of burritos, x b 0. Since Caleb has a
preference for lower calories meals, the objective function will be the minimisation of
calories consumed, namely
Observe that if one simply requires x p , x b 0, the optimal solution for the above ob-
jective function would be to simply eat nothing at all. However, to achieve the daily
28 Chapter 3. Integer and Mixed Integer Programming Applications
reference intake, we have to respect the constraints on the amount of vitamins, namely
These constraints will be our resource constraints. Following the techniques outlined
in the previous chapter, one suspects that the optimisation problem can be modelled by
formulating the LP
Using AMPL, we find that this LP has optimal solution (x p , x b ) T = (75/44, 38/11) T with
optimal value 22650/11 ⇡ 2059.09. This tells us that Caleb’s optimal diet consists of
eating approximately 1.70455 slices of pizza and 3.45455 burritos each day, for a total
of around 2059.0909 calories.
Despite the solution (x p , x b ) T = (75/44, 38/11) T being optimal, it is not particularly
practical since expecting Caleb to eat 1.70455 slices of pizza and 3.45455 burritos each
day is rather difficult. In light of this, it would be useful if we could find a solution
with integer entries since, in that case, Caleb would have a food plan that requires
consuming a combination of a specific number of slices of pizza or whole burritos. It
should be emphasised that in such case we are interested in solving the above LP with
the additional requirement that the variables take integer values.
Perhaps at this point one may naturally think that simply rounding the above optimal
solution, where we round each entry to the nearest integer, would be sufficient. If we
do this, we yield the rounded solution (x p , x b ) T = (2, 3) T , which corresponds to eating
two slices of pizzas and three burritos. Upon substituting this rounded solution into the
above resource constraints, we notice that the second constraint becomes
which tells us that this rounded solution is not feasible, i.e. that eating this combination
does not meet the daily intake requirements for vitamin C. Further, it is not particularly
clear how we should proceed without either relying on a graphical method (which would
not be of use with more variables) or to instead solve this optimisation problem via some
brute-force approach.
3.1. Integer and Mixed Integer Programming 29
maximise cT x
subject to Ax b
(3.1)
x 0
x i 2 Z, i 2 I.
Note that if the problem has both integer and continuous variables, namely if I 6= ;
and I 6= {1, 2, . . . , n}, then the problem is called a mixed-integer linear programming
problem. In that case, we call the above integer program a mixed-integer program
(MIP). If instead all variables are integer, i.e. if I = {1, 2, . . . , n}, then the problem is
called a (pure) integer linear programming problem. If all variables are binary, namely
if x i 2 {0, 1} for all i 2 {1, 2, . . . , n}, then the problem is called a binary linear program-
ming problem. In that scenario, we call the above integer program a binary integer
program (BIP).
The set
30 Chapter 3. Integer and Mixed Integer Programming Applications
is the feasible region of the IP (3.1). Figure 3.1 illustrates the feasible regions for some
mixed-integer and pure integer linear programming problems.
Figure 3.1: The feasible regions associated with some mixed-integer and a pure integer
linear programming problems.
There is a natural linear programming problem associated with (3.1), namely the
linear programming problem
maximise cT x
subject to Ax b (3.2)
x 0.
The LP (3.2) is called the linear relaxation of the IP (3.1). Let I P(c, A, b) and L P(c, A, b)
denote the optimal values of the IP (3.1) and the LP (3.2), respectively. It is worth
noting that I P(c, A, b) and L P(c, A, b) denote the maximum objective function values
associated the IP (3.1) and the LP (3.2), respectively. An easy but surprisingly useful
fact is that
(3.3)
holds. Informally, because more restrictions are imposed on the variables in the IP (3.1)
compared to LP (3.2), we would not expect the objective value of (3.1) to be greater
than the objective value of the relaxed problem (3.2).
Remark. Observe that the we define integer linear programming problems as “maximi-
sation” problems. Integer linear programming problems can be defined equivalently as
minimisation problems. If we instead consider a minimisation IP, then the above relation
(3.3) is reversed, namely I P(c, A, b) L P(c, A, b) .
3.1. Integer and Mixed Integer Programming 31
we can conclude that x̄ is indeed an optimal solution for the IP (3.1) and that the
equality L P(c, A, b) = I P(c, A, b) holds.
If L P(c, A, b) = I P(c, A, b) holds, then there exists an optimal solution to the IP which
is also optimal for the corresponding LP relaxation. Despite this, in general I P(c, A, b)
may be different from L P(c, A, b) and, in practice, “it almost always is”. Note that similar
deductions can be made in the setting of MIPs rather than IPs.
A natural question to ask is since these two values do not in general coincide, how
“challenging” is it to solve each problem? It turns out that an LP can be solved efficiently
in both theory and practice. Being a little more precise, an LP in practice can be solved
efficiently in “most cases” by making use of the celebrated simplex method. The first
polynomial time algorithm was the ellipsoid method, demonstrated by Khachiyan [17]
in 1979, however, the later polynomial time interior-point method of Karmarkar [16]
was arguably of greater theoretical and practical importance. Note that problems that
can be solved in polynomial time are thought of as “easy” or “tractable” since the running
time of the algorithm is upper bounded by a polynomial expression in the size of the
input for the algorithm.
Despite the seeming similarity between (M)IPs and LPs, it turns out solving such
problems cannot in theory be solved efficiently, however, some instances of the problem
may be solvable depending on the formulation. Formally, integer and mixed-integer
programming is N P -hard, a mathematical concept of hardness in computational com-
plexity theory, which informally means that one should not expect to solve a random
instance of the problem in polynomial time unless P = N P (see e.g. [13]). Despite
this, in practice solvers make use of two main approaches, namely branch and bound
and cutting planes, to solve (M)IPs. It should be emphasised that unlike LP solvers,
(M)IP solvers cannot guarantee fast solution times in all instances and the running time
it is heavily dependent on the underlying formulation. The branch and bound algorithm
will be discussed in the next section.
It important to keep in mind the trade-off between the vastly increased expressive
power of IP models and the increase in difficulty in solving IPs compared to LPs. Prac-
tically speaking, state-of-the arts solvers can solve LPs with hundreds of thousands of
variables. For IPs, the time required to compute a solution is heavily dependent on the
specific instance. For example, there may be IPs instances that can be solved within
minutes, however, there exist “tiny” instances with a few hundred variables that are out
32 Chapter 3. Integer and Mixed Integer Programming Applications
maximise x2
subject to 2k x 1 x2 2k
2k x 1 + x 2 4k
x1, x2 0
x 1 , x 2 2 Z,
Figure 3.2: An integer programming problem with large (additive) integrality gap. We
set k = 3 in this figure.
3.2. The Branch and Bound Algorithm 33
Note that each choice of k provides an integer programming formulation which yields
the same feasible region, namely {(1, 0), (2, 0)} . The (additive) integrality gap
between the IP optimum I P(c, A, b) = 0 and the LP relaxation L P(c, A, b) = k gets arbi-
trarily large for larger choices of k. Further, notice that rounding the optimal solution of
the LP relaxation, (1.5, k) to a “nearby” integer point, say (1, k) or (2, k) if we set k to be
an integer, does not yield a feasible solution and in particular the rounded solution can be
made arbitrarily far from those feasible solutions to the above IP.
maximise 2x 1 + 5x 2
subject to 12x 1 + 5x 2 60
2x 1 + 10x 2 35
x1, x2 0
x 1 , x 2 2 Z.
The feasible region is illustrated in Figure 3.3, where the dots are the feasible integer
points. The linear program optimum is at the point x 1 = 3.864, x 2 = 2.727 and the
34 Chapter 3. Integer and Mixed Integer Programming Applications
objective function 2x 1 +5x 2 takes value 21.363. Graphically, we can see that the possible
candidate optimum points are (2, 3) with objective value 19, and (4, 2) with objective
value 18. Hence, the optimal solution is x 1 = 2 and x 2 = 3.
0 2 4
Figure 3.3: The feasible regions associated with the above IP.
The branch and bound algorithm starts by solving the LP relaxation. This corre-
sponds to the root node of the branch and bound tree. If the solution satisfies the inte-
grality constraints, we stop, we have the optimal integer solution. Otherwise, we branch
by selecting a variable whose value is noninteger and create two new subproblems. The
constraints of the new subproblems are chosen so that the current noninteger solution
is infeasible in both subproblems.
In the example, the solution at the root node is x 1 = 3.864 and x 2 = 2.727 with
objective value 21.363. In light of the relation (3.3), it follows that this objective value
is an upper bound on the value of the optimal integer solution x ⇤ , i.e. that
where c = (2, 5) T . Further, since the objective coefficients of the IP are integer, we know
that the optimal value is integer and thus we can round down 21.363 to 21 to yield the
stronger upper bound
c T x ⇤ 21.
x1 = 3.864
x2 = 2.727
obj= 21.363
2
x1 ≤ 3 x1 ≥ 4
0 2 4
(a) The root note and initial branches of the (b) The geometric interpretation of the first
branch and bound tree. branching step.
Figure 3.4: The branch and bound tree and geometric interpretation of the first steps.
The two subproblems are illustrated in Figure 3.4(b). This step in the algorithm is
called the branching step.
1. We choose to solve the left-hand side problem first, namely the relaxed LP with the
added constraint x 1 3. This gives a solution x 1 = 3 and x 2 = 2.9 with objective
value 20.5. Notice that because an additional constraint, namely x 1 3, has been
added to the original IP, the value of the objective function either stays the same
or decreases. This gives a tighter upper bound of 20 for the value of the optimal
integer solution of the current subproblem, after rounding down from 20.5. It
should be emphasised that the upper bound of 20 does not apply to the original
problem.
In other words, this tells us that the optimal IP value is an integer value be-
tween 16 and 21. Note that whereas the LP relaxation gives an upper bound,
36 Chapter 3. Integer and Mixed Integer Programming Applications
obj= 21.363
x1 ≤ 3 x1 ≥ 4
x1 = 3
x2 = 2.9 2
obj= 20.5
x2 ≤ 2 x2 ≥ 3
0 2 4
(a) The root note and further branches of (b) The geometric interpretation of the sec-
the branch and bound tree. ond branching step.
Figure 3.5: The branch and bound tree and geometric interpretation of the next steps.
finding feasible integer solutions yields lower bounds. For this subproblem
we have found the best integer solution, so we do not branch any further.
This branch or subproblem, namely x 1 3 and x 2 2, is said to be fathomed
or conquered by integrality.
b) Now, the algorithm backtracks and selects one of the unsolved subproblems.
Choosing the most recently formed unsolved subproblem, namely x 1 3 and
x2 3 and solving it gives the solution x 1 = 2.5 and x 2 = 3 with objective
value 20. Since x 1 is fractional, two subproblems are created, namely x 1 2
and x 1 3. See Figure 3.6.
i. Moving to the subproblem with constraints x 1 3, x 2 3 and x 1 2
is equivalent to solving the subproblem with constraints x 1 2 and
x2 3. This produces the solution x 1 = 2 x 2 = 3.1 with objective
value 19.5. This gives an upper-bound of 19 on the value of the optimal
integer solution of the current subproblem. Since x 2 is fractional, from
here we branch with respect to x 2 by adding constraints, namely the
constraints x 2 3 and x 2 4.
A. Solving the relaxation of the left subproblem with x 2 3 gives an
integer solution x 1 = 2 and x 2 = 3 with objective value 19, which is
higher than the value of the current incumbent solution. As a result
x 1 = 2 and x 2 = 3 becomes the incumbent solution, and we say that
this subproblem is fathomed by incumbent solution. Thus, we do not
branch on this subproblem. The integer solution here provides a
3.2. The Branch and Bound Algorithm 37
19 c T x ⇤ 21.
We have now considered all options that branch from the subproblem x 1 3. The
resulting subtree is shown in Figure 3.6. The incumbent solution at this stage is
(2, 3) with objective value 19 and the bounds are
19 c T x ⇤ 21.
2. At this point in our process, the entire branch of the tree where x 1 3 is com-
pletely fathomed. So we backtrack to the very beginning and solve the subproblem
with the single extra constraint x 1 4. Solving the subproblem with x 1 4 yields
the solution x 1 = 4 and x 2 = 2.4 with objective function value 20. Since x 2 is frac-
tional, we add the constraints x 2 2 and x 2 3, respectively. Continuing in the
same fashion as before and subsequently branching gives the tree on Figure 3.7.
obj= 21.363
x1 ≤ 3 x1 ≥ 4
obj= 20.5
x2 ≤ 2 x2 ≥ 3
obj= 20
Fathomed by
integrality
x1 ≤ 2 x1 ≥ 3
x1 = 3
x2 = 2
obj= 16
x1 = 2 Infeasible
x2 = 3.1
obj=19.5
x2 ≤ 3 x2 ≥ 4
Infeasible
Fathomed,
incumbent
solution
x1 = 2
x2 = 3
obj= 19
The incumbent, and thus optimal solution to the problem at the end of the entire branch
and bound search is x 1 = 2 and x 2 = 3 with an objective function value of 19.
Node Termination
In the Branch and Bound Algorithm, there are three ways that a subproblem can be
so-called fathomed, which means conquered or dismissed from consideration:
• Fathomed by integer solution: When the solution of the LP relaxation of the sub-
problem is integer, then there is no need to branch further since we have solved
3.2. The Branch and Bound Algorithm 39
optimally the IP subproblem. If this integer solution is better than the current
incumbent solution, then it becomes the new incumbent and we say that it is
fathomed by the incumbent solution.
The branch and bound algorithm for (pure) IPs can be summarised as:
• Initialise: Apply the bounding step, fathoming step and optimality test to the
whole problem. If not fathomed, then classify this problem as one remaining
subproblem and perform the iteration steps below:
2. Bounding step: For each new subproblem, apply the Simplex Method to its
linear programming relaxation to obtain an optimal solution and an objective
value. If the objective coefficients are integer, then round down this value
(round up if minimising). This rounded objective value is an upper bound
on the objective value of the IP subproblem (lower bound if minimising).
3. Fathoming step: For each subproblem, apply the three fathoming tests sum-
marised above and discard all the subproblems that are fathomed by any of
the tests.
• Optimality test: Stop when there is no remaining subproblem. The current in-
cumbent solution is optimal.
Remarks.
1. We did not specify above how to pick the next subproblem. We can pick the one
with the highest bound (lowest if minimising) because this subproblem would be
the most promising one to contain an optimal solution to the whole problem. Or
we could pick the one that was created most recently (this is what we did above),
so the solver could use re-optimisation techniques to solve it faster.
2. If the objective coefficients are not integer, then we should not round the bound
in the bounding step.
3. If the variables were binary integer variables, then for branching variable say x 1 ,
the branches would simply be equalities x 1 = 1 and x 1 = 0.
4. To solve the problem in the previous section, which had just two variables, we
solved 11 linear programs. In general, the number of steps “blows up” exponen-
tially: If there are k binary variables, the number of subproblems can be as large
as 2k . This is essentially why solving an IP requires substantially more work than
solving a LP. For this reason, great care should be taken in setting up an IP and,
in particular, one should always check whether there is a way to formulate the
problem in a more economical way with fewer integer variables.
5. We described the Branch and Bound Algorithm for pure integer programs. How-
ever, we can also apply the algorithm, though with some minor changes, to mixed-
integer programs, which contain both integer and continuous variables. The mi-
nor changes are listed as follows:
b) Bounding step: We do not round the bound of the linear programming relax-
ation since the objective value of the mixed-integer program is most likely
fractional.
The following example demonstrates how we can formulate simple mathematical opti-
misation problems using integer linear programming.
Example. (Computer sales) A shop needs to put together two types of computer systems
to sell. They are identical except that one contains one monitor and 3 hard drives and the
other has 2 monitors and 1 hard drive. The profit for the two systems is the same, £300.
The shop has 70 monitors and 63 hard drives available to put into the systems. How many
of each computer should it make?
We now consider a classical problem known as the knapsack problem. It should be noted
that the word knapsack was the usual name for a rucksack or backpack until around the
middle of the 20th century.
Suppose we are given a knapsack which can carry a maximum weight b and that
there are n types of items that we could take. Suppose further that an item of type
i 2 {1, 2, . . . , n} has weight ai > 0 and value ci 2 R. The knapsack problem is to load
the knapsack with items (possibly several items of the same type) without exceeding
the knapsack capacity b to maximise the total value of the knapsack. In order to model
this, let variable x i represent the number of items of type i to be loaded.
42 Chapter 3. Integer and Mixed Integer Programming Applications
An important variant of this problem is known as the binary knapsack problem, when
only one unit of each item type can be selected. In this case we use binary variables
instead of general integers. The binary knapsack set can be formulated as
X
n
maximise ci x i
i=1
X
n
subject to ai x i b
i=1
x 0
x 2 {0, 1}n ,
where x 2 {0, 1}n denotes that x is an n-dimensional column vector whose entries are
either 0 or 1.
Example. (Project management) The project manager of a company has five projects that
they would like to undertake. It is sadly not possible for the company to undertake all five
projects due to budgetary limitations. In particular, the available budget is £85,000. Each
project has some positive value to the company and requires certain investment. The value
and costs are presented in the following table.
Which of the projects should be undertaken in order to maximise the total value of these
projects subject to the aforementioned budgetary constraint?
3.5. The Set Covering Problem 43
For the constraints, we define the following m ⇥ n matrix whose entries take value
either 0 or 1. For each i 2 {1, 2, . . . , m} and each j 2 {1, 2, . . . , n} we set
8
<1, if i 2 S ,
j
ai j =
:0, otherwise.
It should be noted that when x j 2 {0, 1} for all j, the above coverage constraints require
that each neighbourhood i 2 {1, 2, . . . , m} is covered by at least one of the n subsets
S j ✓ {1, 2, . . . , m}.
In matrix form, this can be expressed by
minimise cT x
subject to Ax 1
x 2 {0, 1}n ,
where A = (ai j ) 2 Zm⇥n is the m ⇥ n matrix defined above with ai j denoting the entry
in the i-th row and j-th column, 1 denotes the m-dimensional vector of all ones and
c = (c1 , c2 , . . . , cn ) T is the n-dimensional vector with entries c j for j 2 {1, 2, . . . , n}.
The following illustrates a real-world application of the set covering problem that is
known as the crew scheduling problem.
Example. (Airline crew allocation) Airlines routinely solve massive set covering problems
in order to allocate crews to aircrafts. This is known as the crew scheduling problem.
An airline wants to operate its daily flight schedule using the smallest number of crews
to make use of the available resources efficiently. A crew is on duty for a certain number
of consecutive hours and may therefore operate several flights. A crew assignment is a
sequence of flights that may be operated by the same crew within its duty time. For instance,
some crew assignment may consist of the 8:30-10:00am flight from Pittsburgh to Chicago,
then the 11:30am-1:30pm Chicago-Atlanta flight and finally the 2:45-4:30pm Atlanta-
Pittsburgh flight. The problem is to find the smallest number of crew assignments such that
every flight is covered by at least one of the selected crew assignment.
3.6. Exercises for Self-Study 45
This is a set covering problem, where n is the number of crew assignments, m is the
number of flights to be operated and, for each crew assignment j, S j denotes the set of
flights that are included in crew assignment j. Since we want to minimise the total number
of crews needed, the cost of each set S j is 1.
maximise 2x 1 x 2 + 5x 3 3x 4 + 4x 5
subject to 3x 1 2x 2 + 7x 3 5x 4 + 4x 5 6
x1 x 2 + 2x 3 4x 4 + 2x 5 0
x 1 , x 2 , . . . , x 5 2 Z,
where for each subproblem you should solve its linear programming relaxation
using the solver via AMPL.
3. Use the solver via AMPL in order to solve the IPs appearing in the previous two
exercises. Note that your solutions should coincide with the solutions previously
found by applying the branch and bound algorithm.
46 Chapter 3. Integer and Mixed Integer Programming Applications
4. Consider the following statements about any pure integer programming problem
in maximisation form and its LP relaxation. Label each of the following statements
as True or False, justifying your answer.
a) The feasible region for the LP relaxation is a subset of the feasible region for
the IP.
5. Eve and Steven are a young couple and want to divide their main household
weekly chores between them such that each has two allocated tasks but the total
time they spend on household duties is kept to a minimum. The main household
chores are cleaning, cooking, dishwashing and laundry. Their efficiencies on these
tasks differ, where the time each would need to perform the task is outlined in the
following table.
6. Vincent Cardoza is the owner and manager of a machine shop that does custom
order work. This Wednesday afternoon, they received calls from two customers
who would like to place rush orders. One is a trailer hitch company which would
like some custom-made heavy-duty tow bars. The other is a mini-car-carrier com-
pany which needs some customized stabilizer bars. Both customers would like as
many as possible by the end of the week (two working days). Since both products
would require the use of the same two machines, Vincent needs to decide and in-
form the customers this afternoon about how many of each product he will agree
to make over the next two days.
Each tow bar requires 3.2 hours on machine 1 and 2 hours on machine 2. Each
stabilizer bar requires 2.4 hours on machine 1 and 3 hours on machine 2. Machine
1 will be available for 16 hours over the next two days and machine 2 will be
3.6. Exercises for Self-Study 47
available for 15 hours. The profit for each tow bar produced would be $130 and
the profit for each stabilizer bar produced would be $150.
Vincent now wants to determine the mix of these production quantities that will
maximize the total profit. Formulate an integer programming model for this prob-
lem and solve.
Development Project
1 2 3 4 5
Estimated Profit 1 1.8 1.6 0.8 1.4
Capital Required 6 12 10 4 8
The owners of the firm, Dave Peterson and Ron Johnson, have raised $20 million
of investment capital for these projects. Dave and Ron now want to select the
combination of projects that will maximize their total estimated long-run profit
(net present value) without investing more that $20 million. Formulate an integer
programming model for this problem and solve.
It is predicted that enough trained pilots will be available to the company to crew
30 new airplanes. If only short-range planes were purchased, the maintenance
facilities would be able to handle 40 new planes. However, each medium-range
4
plane is equivalent to 3 short-range planes, while, each long-range plane is equiv-
5
alent to 3 short-range planes in terms of their use of the maintenance facilities.
48 Chapter 3. Integer and Mixed Integer Programming Applications
The information given here was obtained by a preliminary analysis of the prob-
lem. A more detailed analysis will be conducted subsequently. However, using the
preceding data as a first approximation, management wishes to know how many
planes of each type should be purchased to maximize profit. Formulate an integer
programming model for this problem and solve.
9. GreenPower are a renewable energy developer who have been tasked with select-
ing the best five out of ten possible sites for the construction of new wind warms
in the UK. The sites and expected profits associated with each site are s1 , s2 , . . . , s10
and p1 , p2 , . . . , p10 , respectively. Requirements from UK planning permission en-
force that if site s2 is selected, then site s3 must also be selected.
Development restrictions enforce that selecting sites s1 and s7 prevents the se-
lection of s8 . Further, these restrictions also enforce that selecting sites s3 or s4
prevents the selection of s5 . Formulate an integer program that could determine
the best selection scheme.
10. There are six cities (labelled cities 1-6) in Kilroy County. The county must deter-
mine where to build fire stations. The county wants to build the minimum number
of fire stations needed to ensure that at least one fire station is within a 15 minute
drive of each city. The times in minutes required to drive between the cities in
Kilroy County are shown in following table.
To
From City 1 City 2 City 3 City 4 City 5 City 6
City 1 0 10 20 30 30 20
City 2 10 0 25 35 20 10
City 3 20 25 0 15 30 20
City 4 30 35 15 0 15 25
City 5 30 20 30 15 0 14
City 6 20 10 20 25 14 0
Formulate and solve an IP that will tell Kilroy how many fire stations should be
built and where they should be located.
11. StockCo is considering four investments. Investment 1 will yield a net present
value (NPV) of $16, 000, investment 2 yields an NPV of $22, 000, investment 3
yields an NPV of $12, 000 and investment 4 yields an NPV of $8000. Each in-
vestment requires a certain cash outflow at the present time, namely $5000 for
investment 1, $7000 for investment 2, $4000 for investment 3 and $3000 for in-
vestment 4. There is presently $14, 000 is available for investment. Formulate
3.6. Exercises for Self-Study 49
an IP whose solution will tell StockCo how to maximize the NPV obtained from
investments 1 4.
12. Modify the StockCo formulation from the previous exercise to account for each of
the following requirements:
Chapter 4
Modelling Tricks
It turns out that we can make use of IPs, MIPs or BIPs when the objective function or the
constraints do not appear to be linear at first sight. In this section we outline how we
can apply different modelling tricks such that we can use IPs, MIPs or BIPs to a broader
set of problems.
While this cost function may seem linear, it actually is not linear. Observe that the
function f + cz evaluated at z = 0 is f and not 0. This raises the question as to how
should we express the production cost of this product because if the costs are no longer
linear, then we cannot rely on a standard LP.
It turns out that binary (or indicator) variables can come to our rescue. In order to
revise our model, we require two things. First, we need a new variable 2 {0, 1} which
takes binary values, where = 0 indicates that the product is not produced and =1
52 Chapter 4. Modelling Tricks
indicates that the product is produced. Secondly, we require an upper bound M on the
decision variable z. We can obtain the upper bound M from the manufacturer, who will
have an absolute upper bound on the total number of units produced.
We can now revise our linear program as follows. We firstly add the following two
constraints
(4.1)
Let us argue that this revision now models our problem. Observe that if = 0, then
(4.1) forces z = 0 and hence the production cost becomes f + cz = 0 as required.
If instead = 1, then (4.1) becomes z M and hence the production cost becomes
f + cz = f + cz. Because z M is always satisfied by definition, the inequality does
not limit the range of possible values for z.
Observe that = 0 indicates that z = 0 and, in an optimal solution, we have that if
= 1, then z > 0. The second implication follows as the objective is to minimise the
total costs and it would not make sense to not produce any of a certain product yet pay
the fixed costs associated with turning on the new machine, i.e. an optimal solution
could not have both = 1 and z = 0. It follows that the new revised program, which is
now a MIP, correctly models this manufacturing problem.
Note that in the above we introduced a new binary variable 2 {0, 1} modelling the
logical statement
if z > 0, then = 1.
That is, behaves akin to an on/off switch that is turned on once z > 0. The previous
logical statement is logically equivalent to the contrapositive statement
if = 0, then z = 0.
It follows consequently that = 1 if and only if z > 0 (or equivalently z = 0 if and only
if = 0) holds when minimising total costs.
In the above we utilised the big-M method, which is a widely applicable modelling
strategy, which requires an upper bound M on the possible values of z. During the next
section, we illustrate the big-M method on a more complex modelling problem.
4.2. Facility Location and the Big- M Method 53
P P
We have so far incurred the transportation costs t2T f 2F c f t x f t in the objective func-
tion, however, we have not taken account of opening costs.
For this purpose, we similarly need a binary variable f 2 {0, 1} for each facility f ,
indicating whether or not the facility is open. More precisely, acting as an on/off switch,
we use f to model the following logical statement
If the facility f is closed, then ever x f t for t 2 T must be 0, so we need a big-M constraint
enforcing this. A natural upper bound on x f t is r t as this is the total amount of goods
needed for store t. We model the above statement via the two big-M constraints
x f t rt f for all t 2 T
f 2 {0, 1}.
In addition, we include
in our objective function. Observe that if f = 0, then x f t = 0 for all t 2 T and the
opening cost is b f f = 0. If instead f = 1, then x f t r t and the opening cost is
bf f = b f . Notice that since x f t r t always holds, it follows that the inequality does
not constrain anything. Moreover, since the objective is to minimise cost, it does not
make sense to set f = 1 when each x f t , t 2 T is 0 .
In summary, our final model for this scenario is the following MIP
X XX
minimise bf f + cf t x f t
f 2F t2T f 2F
X
subject to xf t = rt for all t 2 T
f 2F
xf t rt f for all f 2 F, t 2 T
xf t 0 for all f 2 F, t 2 T
f 2 {0, 1} for all f 2 F.
namely that the amount of goods x f t supplied by facility f to store t is either the r t
units of goods needed by t if f is the facility selected, while, the supply from f is zero
otherwise. This is captured by the constraint
from exactly one facility. That is, among the values f t, f 2 F we must have exactly
one 1, while, all others values are 0. Furthermore, upon recalling the f t ’s are binary
variables, the above is enforced by
X
subject to ft = 1 for all t 2 T
f 2F
xf t = rt ft for all f 2 F, t 2 T
ft f for all f 2 F, t 2 T
xf t 0 for all f 2 F, t 2 T
f 2 {0, 1} for all f 2 F
ft 2 {0, 1} for all f 2 F, t 2 T.
Note that in this revised program, the variables x f t are not necessary and could be
replaced simply by x f t = r t f t. In particular, notice that the above inequalities on x f t
are implied by the inequalities on the ft variables. Removing the x f t ’s would leave us
with variables f and only. This replacement yields
ft
X XX
minimise bf f + c f t rt f t
f 2F t2T f 2F
X
subject to ft = 1 for all t 2 T
f 2F
ft f for all f 2 F, t 2 T
f 2 {0, 1} for all f 2 F
ft 2 {0, 1} for all f 2 F, t 2 T.
Note that in contrast to the previous models, this problem is a pure IP.
It should be emphasised that “X 1 or X 2 ” means that at least one (or possibly both) of
the events occur, while, “X 1 and X 2 ” means that both events must occur.
Logical conditions such as these can be transformed into other equivalent logical
conditions, where two conditions are called logically equivalent if they always have
the same truth value. Further, such equivalences can be shown using the algebraic
expressions above. For example, notice that
Remark. The constraints for the condition “X 1 and X 2 ” have been mentioned here since
they can be used when an “and” condition is part of a larger and perhaps more complex
logical expression. In the case of a simple “and” statement, we do not need to use
indicator variables. For example, for the constraints appearing in linear and integer
linear programming problems, it is assumed that there is an “and” relationship between
the constraints. In particular, we assume the first constraint holds “and” the second
constraint holds “and” so on. This implies that for an expression like “X 1 and X 2 ”, it is
normally sufficient to simply add the expressions X 1 and X 2 as regular constraints.
Furthermore, we can generalise the above to the case of more than two events. This
generalisation allows us to express longer and more complicated conditions using binary
variables. Consider for this purpose n events X 1 , X 2 , . . . , X n with corresponding indicator
4.5. Modelling “or” Constraints (Disjunctions) 57
P
Note that the constraint f 2F ft = 1 for all t 2 T of the facility location problem that
ensured that only one facility f 2 F can supply store t was of this type.
(4.2a)
(4.2b)
It should be emphasised that a standard program admits only “and” constraints and not
the “or” constraints that we have in this scenario. It turns out that once more binary
variables and the big-M method can come our rescue.
For this purpose, let us introduce a binary variable 2 {0, 1}, where = 1 if our
production is “lager-dominant”, i.e. if (4.2a) holds, whereas = 0 if our production is
“ale-dominant”, i.e. if (4.2b) holds. In other words, our aim is to keep (4.2a) and make
(4.2b) void if = 1, and conversely, keep (4.2b) and make (4.2a) void if = 0.
58 Chapter 4. Modelling Tricks
Recall that the big-M method requires some upper bound on the underlying decision
variables. Because total production is at most 10,000 barrels in the next quarter, we yield
the big-M bounds
10, 000 x 1 x 2 10, 000.
(4.3)
Observe that if = 1, then the first inequality from (4.3) becomes (4.2a), while, the
second inequality from (4.3) becomes x 1 x 2 10, 000. This inequality is void as it
always holds in light of the constraint on total production during the next quarter. If
instead = 0, then the first inequality from (4.3) becomes the void inequality x 1 x2
10, 000, while, in this case the second inequality from (4.3) becomes (4.2b). Thus, it
follows that (4.3) correctly models our problem.
This can be done in general. Suppose that, within our problem, we have several
linear constraints, denoted by
a1T x b1 , a2T x b2 , . . . , a kT x bk ,
where a i 2 Rn and bi 2 R for each i 2 {1, 2, . . . , k} and that we want to impose the
condition that at least one of them is satisfied. In other words, suppose that
(a1T x b1 )
or (a2T x b2 )
.. (4.4)
.
or (a kT x bk )
It should be emphasised that here we are interested in imposing the condition that at
least one of the k constraints are satisfied. It is perhaps surprising that it is impossible
in general to write as an IP the condition that exactly one of the above constraints is
satisfied. Note that in the previous example (4.2a) and (4.2b) could not be both satisfied
at the same time and as such imposing that at least one of the two is satisfied was the
same as imposing that exactly one of the two is satisfied, however, such case cannot be
ensured in general.
Suppose that M denotes an upper bound on a iT x for all i, namely that
a iT x M for all i.
4.6. Semi-Continuous Variables 59
Note that the choice of M means that the following holds for every feasible solution x for
our underlying problem. In a similar fashion, we can formulate (4.4) by introducing a
binary variable i for every i 2 {1, 2, . . . , k}, where i = 1 if the i-th constraint a iT x bi
is satisfied.
We can express this with the system
(4.5)
Observe the constraint 1 + 2 + ··· + k = 1 forces all the i ’s to take value 0, except
for exactly one, say h, which takes value 1. For each i such that i = 0, notice that the
constraint a iT x bi i +M (1 i ) becomes a iT x M , which by our assumption is always
satisfied and hence does not impose any further restriction. For h = 1, notice that the
constraint ahT x bh i + M (1 h) becomes ahT x bh . In particular, h = 1 enforces
that the h-th constraint must be satisfied. It follows that the system (4.5) imposes that at
least one (in this case the h-th) of the constraints is satisfied by x . Note for completeness
that the equality constraint 1 + 2 + ··· + k = 1 appearing in (4.5) could be replaced
by 1 + 2 + ··· + k 1 without impacting on the solutions.
(4.6)
60 Chapter 4. Modelling Tricks
This can be modelled once more using the big-M method. Suppose that we have
knowledge of a value M > 0 that is “large enough” such that x 1 M is guaranteed in
every optimal solution. Further, let us define a binary variable which we want to take
the following meaning 8
<0, if x 1 = 0,
=
:1, if x 1 > 0.
x1 M (4.7)
x1 `1 (4.8)
2 {0, 1}.
minimise z = f (x)
subject to g i (x) = 0, i = 1, 2, . . . , m (4.9)
x j 2 {0, 1}, j = 1, 2, . . . , n,
x kj = x j .
It follows that we can replace every expression of the form x kj with x j for all j. This
ensures that no variable appears in the functions f or g i , where i 2 {1, 2, . . . , m}, with
an exponent greater than 1.
4.7. Binary Polynomial Programming 61
Note that this linearises all expressions featuring only one variables. It remains to
consider how we linearise expressions with more than one binary variable. The product
x j · x l of two binary variables where j, l 2 {1, 2, . . . , n} can be replaced by a new binary
variable y jl related to x j and x l by linear constraints. In particular, in order to ensure
that we have
y jl = x j · x l
when x j and x l are binary variables, it suffices to impose the linear constraints
y jl x j
y jl x l
y jl xj + xj 1
in addition to x j , x l , y jl 2 {0, 1}. If there are more variables are featured in an expres-
sion, then we can apply a similar procedure to linearise.
For example, consider the objective function f defined by
f (x) = x 15 x 2 + 4x 1 x 2 x 32 .
Upon applying the above linearisation sequentially, the function f is replaced initially
by the function
z = x 1 x 2 + 4x 1 x 2 x 3
for the binary variables x j , where j = 1, 2, 3. We then introduce binary variables y12 in
place of x 1 x 2 and y123 in place of y12 x 3 . The objective function is as such replaced by
the linear function
z = y12 + 4 y123 ,
y12 x 1
y12 x 2
y12 x1 + x2 1
y123 y12
y123 x 3
y123 y12 + x 3 1
y12 , y123 , x 1 , x 2 , x 3 2 {0, 1}.
It should be noted that it is possible to replace the fourth and sixth constraints above by
other constraints if one would prefer not to make use of the new binary variable y12 in
the right-hand sides.
62 Chapter 4. Modelling Tricks
maximise 4x 12 x 13 + 10x 22 x 24 + x 1 x 27
subject to x1 + x2 3
x1 + x2 3
x1, x2 0
x 1 , x 2 2 {0, 1}.
2. The Research and Development Division of the Progressive Company has been de-
veloping four possible new product lines. Management must now make a decision
as to which of these four products actually will be produced and at what levels.
Therefore, an operations research study has been requested to find the most prof-
itable product mix. A substantial cost is associated with beginning the production
of any product, as given in the first row of the following table. Management’s ob-
jective is to find the product mix that maximizes the total profit (total net revenue
minus start-up costs).
Product
1 2 3 4
Start-up Cost $50,000 $40,000 $70,000 $60,000
Marginal Revenue $70 $60 $90 $80
Introduce auxiliary binary variables to formulate and solve a mixed BIP model for
this problem.
3. Suppose that a mathematical model fits linear programming except for the restric-
tion that |x 1 x 2 | = 0, or 3 , or 6 . Show how to reformulate this restriction to fit
an MIP model.
4.8. Exercises for Self-Study 63
4. The Toys 4 U Company has developed two new toys for possible inclusion in its
product line for the upcoming Christmas season. Setting up the production fa-
cilities to begin production would cost $50, 000 for toy 1 and $80, 000 for toy 2.
Once these costs are covered, the toys would generate a unit profit of $10 for toy
1 and $15 for toy 2.
The company has two factories that are capable of producing these toys. However,
to avoid doubling the start-up costs, just one factory would be used, where the
choice would be based on maximizing profit. For administrative reasons, the same
factory would be used for both new toys if both are produced.
Toy 1 can be produced at the rate of 50 per hour in factory 1 and 40 per hour in
factory 2 . Toy 2 can be produced at the rate of 40 per hour in factory 1 and 25 per
hour in factory 2. Factories 1 and 2 , respectively, have 500 hours and 700 hours
of production time available before Christmas that could be used to produce these
toys. It is not known whether these two toys would be continued after Christmas.
Therefore, the problem is to determine how many units (if any) of each new toy
should be produced before Christmas to maximize the total profit.
5. Suppose that a mathematical model fits linear programming except for the restric-
tions that
3x 1 x2 x 3 + x 4 12
x 1 + x 2 + x 3 + x 4 15
holds.
2x 1 + 5x 2 x 3 + x 4 30
x 1 + 3x 2 + 5x 3 + x 4 40
3x 1 x 2 + 3x 3 x 4 60
holds.
6. A contractor, Susan Meyer, has to haul gravel to three building sites. She can
purchase as much as 18 tons at a gravel pit in the north of the city and 14 tons at
64 Chapter 4. Modelling Tricks
one in the south. She needs 10, 5, and 10 tons at sites 1, 2, and 3, respectively.
The purchase price per ton at each gravel pit and the hauling cost per ton are
given in the table below.
Susan wishes to determine how much to haul from each pit to each site to minimise
the total cost for purchasing and hauling gravel.
b) Susan now needs to hire the trucks (and their drivers) to do the hauling.
Each truck can only be used to haul gravel from a single pit to a single site.
In addition to the hauling and gravel costs specified above, there now is a
fixed cost of $150 associated with hiring each truck. A truck can haul 5 tons,
but it is not required to go full. For each combination of pit and site, there
are now two decisions to be made: the number of trucks to be used and the
amount of gravel to be hauled.
Formulate and solve an appropriate model for this problem.
65
Chapter 5
Sensitivity Analysis
For a real matrix A 2 Rm⇥n with m rows and n columns, b 2 Rm and c 2 Rn , consider
the maximisation LP given in standard form
maximise cT x
subject to Ax b (5.1)
x 0.
• For every variable in the primal problem, there is a constraint in the dual.
• The objective function coefficients in the primal are the right-hand side coefficients
in the dual, and vice versa.
66 Chapter 5. Sensitivity Analysis
The most important result in linear programming is a theorem that connects the
primal and dual problems. This is known as the Strong Duality Theorem of Linear
Programming and the result is stated below without proof.
Theorem. If a linear programming problem admits an optimal solution, then also its dual
admits an optimal solution. Furthermore, the optimal values of the primal problem and of
its dual coincide.
a iT x bi .
Given a point x̄ , the difference bi a iT x̄ is called the slack of the constraint at x̄ . Ob-
serve that if x̄ is feasible, then the slack is nonnegative. The following is known as
complementary slackness.
Theorem. Consider the primal LP (5.1) and its dual LP (5.2), where A 2 Rm⇥n , b 2 Rm0
and c 2 Rn 0 . Given a feasible solution x ⇤ for the primal (5.1) and a feasible solution y ⇤
for the dual (5.2), the following statements are equivalent:
• x ⇤ is optimal for the primal (5.1) and y ⇤ is optimal for the dual (5.2),
5.2. Sensitivity Analysis 67
maximise 2x 1 + 8x 2
subject to 2x 1 + x 2 10
x 1 + 2x 2 10
x1 + x2 6
(5.3)
x 1 + 3x 2 12
3x 1 + x 2 0
x1 4x 2 4
x1, x2 0.
68 Chapter 5. Sensitivity Analysis
As apparent from the diagram in Figure 5.1, the optimum is the extreme point x ⇤ defined
by constraints 4 and 5, which is the point of coordinates x 1 = 1.2, x 2 = 3.6. The
maximum objective function value is 31.2. Recall that an extreme point is a feasible
point that satisfies at equality n independent constraints from the above system.
Figure 5.1: Feasible region, shaded in gray, and optimal contour of the objective func-
tion. The direction of maximisation is represented by the arrow perpendicular to the
objective function contour.
which uniquely determines the values y4⇤ = 2.6 and y5⇤ = 0.2. Observe that the dual
problem has minimum objective function value equal to 2.6·12+0.2·0 = 31.2, affirming
primal and dual objective function values coincide.
How does the optimal value change if we were to change the value of the right-hand
side of constraint 5 by some amount ✓ , namely from 0 to 0 + ✓ ? Note that ✓ could take
5.2. Sensitivity Analysis 69
maximise 2x 1 + 8x 2
subject to 2x 1 + x 2 10
x 1 + 2x 2 10
x1 + x2 6
x 1 + 3x 2 12
3x 1 + x 2 ✓
x1 4x 2 4
x1, x2 0,
mininimise 10 y1 + 10 y2 + 6 y3 + 12 y4 + ✓ y5 4 y6
subject to 2 y1 + y2 + y3 + y4 3 y5 y6 2
y1 + 2 y2 + y3 + 3 y4 + y5 4 y6 8
y1 , y2 , y3 , y4 , y5 , y6 0.
If the amount of change in ✓ is “too large”, there is not much one can say without
simply re-solving the problem. However, let us assume that the change is “small enough”
that the optimal solution will still be defined by constraints 4 and 5. We will later discuss
how small ✓ must be in order to satisfy this assumption.
⇤ ⇤
Denote by = (✓ ) the optimal value of the modified problem. Assuming that
the optimum point is still defined by constraints 4 and 5, the optimal dual solution y ⇤
satisfies y1 = y2 = y3 = y6 = 0 and it must also satisfy the first and second dual
constraints at equality as x 1 > 0 and x 2 > 0 holds. Note that the constrains of the dual
are unchanged (since ✓ only appears in the objective function) and as such y4⇤ = 2.6
and y5⇤ = 0.2 as before.
⇤
It follows that the new optimal value is
⇤
= 12 y4⇤ + ✓ y5⇤ = 12 · 2.6 + ✓ · 0.2 = 31.2 + 0.2 · ✓ .
Notice that the change in resource availability can be either up or down. If the i-th
constraint is a and the problem is a maximisation problem so that the dual value yi is
nonnegative, then if the resource increases, so will the objective function, while, if the
available resource decreases, then so will the objective function. This is to be expected
as increasing the right-hand side of a constraint makes the problem less constrained
and therefore there could exist solutions with higher objective function value, whereas
decreasing the right-hand side makes the problem more constrained and as such the
previous optimal solution might become infeasible and the new optimum would have
lower value.
If instead the dual variable has value 0, i.e. yi = 0, then for small enough changes in
the right-hand side of the i-th constraint, there will be no change in the optimal objective
function value.
Figure 5.2: Feasible region of the original and modified problem and corresponding
optimal contours of the objective function. The new optimal solution is indicated by the
black dot. Note that the optimal solution changes but it is still defined by constraints 4
and 5. The diagram corresponds to the value ✓ = 3.
5.2. Sensitivity Analysis 71
The interpretation of the dual value as the rate of change of the objective function that
results from a change in a resource is only true for a limited range of values of the
right-hand side of the constraint. The derivation of the ranges is easy for the ineffective
constraints. The general derivation of this range for effective constraints is outside the
scope of this course, however we will compute it for the previous example and give
a geometric intuition through diagrams. Recall that the resource constraints that are
defining for some extreme point are called the effective constraints at that point, while,
the remaining resource constraints are called ineffective.
Ineffective Constraints
If the ineffective constraint is a constraint, then clearly the right-hand side can increase
to infinity without affecting the solution. The right-hand side can decrease until it is low
enough for the constraint to be satisfied at equality, namely to the value below which
the current solution would be infeasible. Similarly, the right-hand-side of an ineffective
constraint could decrease to minus infinity and increase to the value above which the
current solution will be infeasible.
For example, consider the LP (5.3). At the optimal solution x 1 = 1.2, x 2 = 3.6, the
constraints 1, 2, 3, and 6 are all ineffective. By how much can we change the right-hand
side of these constraints before the optimal solution changes? For constraint 1, we have
that
2 · 1.2 + 1 · 3.6 = 6.
Since the right-hand side of the first constraint is 10, this can decrease to 6 without
affecting the optimal solution and it can increase to infinity. That is, the optimal solution
x 1 = 1.2, x 2 = 3.6 does not change as long as the right-hand side b1 of the first constraint
is within the range 6 b1 +1.
In a similar fashion, we can compute the ranges for the other constraints, namely
Effective Constraints
Let us consider the previous example (5.3), where we change the right-hand side of
constraint 5. From the diagram on the left in Figure 5.3, it is apparent that if we increase
the right-hand side of constraint 5, the optimal solution of the problem remains defined
72 Chapter 5. Sensitivity Analysis
by constraints 4 and 5 until the border of constraint 5 passes through the intersection
of constraint 4 and the nonnegativity of x 1
Figure 5.3: Representation of the largest and smallest values that the right-hand side of
constraint 5 can take in order for constraints 4 and 5 to remain effective.
By solving a suitable linear system, we can compute that the intersection point of
constraint 4 and x 1 = 0 is the point (0, 4). Constraint 5 passes through this point when
its right-hand side becomes
3 · 0 + 1 · 4 = 4.
Similarly, from the diagram on the right of Figure 5.3, it is apparent that if we decrease
the right- hand-side of constraint 5, the optimal solution of the problem remains defined
by constraints 4 and 5 until the border of constraint 5 passes through the intersection of
constraints 3 and 4. One can compute that the point has coordinates (3, 3). Constraint
5 passes through this point when its right-hand side becomes
3 · 3 + 1 · 3 = 6.
6 b5 4.
It is possible to derive these bounds formally as follows. As we have seen, the primal
optimal solution is defined by constraints 4 and 5 as long as such solution is feasible (be-
cause the corresponding dual solution, as well as the dual constraints, are unchanged).
The basic solution defined by constraints 4 and 5 is the unique solution to the system
x 1 + 3x 2 = 12
3x 1 + x 2 = ✓ ,
where we denote by ✓ the new right-hand side of constraint 5. Solving the system, we
get that the solution is given by x 1 (✓ ) = 1.2 0.3✓ and x 2 (✓ ) = 3.6 + 0.1✓ . We need to
5.2. Sensitivity Analysis 73
find the values of ✓ for which the solution is feasible. This is done by substituting x (✓ )
into the primal constraints:
Constraint 1 : 2x 1 (✓ ) + x 2 (✓ ) 10 () ✓ 8
Constraint 2 : x 1 (✓ ) + 2x 2 (✓ ) 10 () ✓ 16
Constraint 3 : x 1 (✓ ) + x 2 (✓ ) 6 () ✓ 6
Constraint 6 : x 1 (✓ ) 4x 2 (✓ ) 4 () ✓ 116
Nonnegativity : x 1 (✓ ), x 2 (✓ ) 0 () ✓ 4 and ✓ 36.
In this case, observe that ✓ satisfies the above conditions if and only if 6 ✓ 4 as
we had previously determined.
In this subsection, we study how changes in one of the objective function coefficients
affect the value of the optimal solution. As before, if the coefficient falls within a certain
range, the new optimal objective value can be computed without having to re-solve
the problem. Computing this range is a straightforward task if the change occurs to the
coefficient of a variable that is nonbasic, while it is more complicated for basic variables.
Recall that if some nonnegativity constraint, say x j 0, is defining at some extreme
point, then we say that x j is a nonbasic variable. The other variables are basic variables
at that point.
Nonbasic Variables
If the problem is a maximisation and a variable x j is nonbasic, then clearly its coeffi-
cient c j in the objective function can decrease to minus infinity and the variable x j will
continue at the value zero. As the coefficient c j of the objective function increases, there
will be a value at which the solution will be multiply optimal, and then above that the
current solution will be suboptimal. The new optimal solution will have the variable x j
at a nonzero value.
74 Chapter 5. Sensitivity Analysis
Consider the LP
maximise 8x 1 + 3x 2
subject to 2x 1 + x 2 10
x 1 + 2x 2 10
x1 + x2 6
x 1 + 3x 2 1
3x 1 + x 2 0
x1 4x 2 4
x1, x2 0.
At the optimal solution, x 1 is basic with a value of 5 and x 2 is nonbasic. The effective
constraint is constraint 1 with dual value y1 = 4. Suppose we replace the objective
function coefficient of the nonbasic variable x 2 with another number c2 . Since all other
coefficients in the LP are unchanged, in order to check optimality of the solution x 1 = 5,
x 2 = 0, we only need to confirm that the dual constraint relative to the variable x 2
remains satisfied by the dual solution. That is
y1 + 2 y2 + y3 + 3 y4 + y5 4 y6 c2 .
1·4+2·0+1·0+3·0+1·0 4·0=4 c2 .
The solution would be multiply optimal if the value of c2 , now 3 , were to increase to
4 . Beyond the value of 4 , the solution would be non-optimal. Thus the upper limit of
the value of c2 is 4 and its lower limit is 1.
More generally suppose we have a maximisation LP. Given some optimal extreme
point x ⇤ , let y ⇤ be an optimal dual solution. Let x ⇤j be a nonbasic variable for x ⇤ . The
solution x ⇤ remains optimal if we change the corresponding objective function coeffi-
cient c j within the range
1 c j a1 j y1⇤ + a2 j y2⇤ + · · · + am j ym
⇤
.
If instead we were given a minimisation LP, then the solution x ⇤ remains optimal if we
change the corresponding objective function coefficient c j for nonbasic x ⇤j within the
range
a1 j y1⇤ + a2 j y2⇤ + · · · + am j ym
⇤
c j +1.
5.2. Sensitivity Analysis 75
Basic Variables
Consider once more the problem (5.3), represented in Figure 5.1. At the optimal so-
lution x ⇤ of coordinates x 1 = 1.2 and x 2 = 3.6, both variables are basic. Suppose we
change the coefficient of x 1 in the objective function, which is 2x 1 + 8x 2 . As the coeffi-
cient c1 of x 1 increases, the optimal contour of the objective function rotates clockwise
around the point x ⇤ until it becomes parallel to constraint 4, namely x 1 + 3x 2 12.
This happens when c1 = 8/3. At this value of c1 , the solution x ⇤ is no longer the unique
optima, while, for c1 > 8/3 the optimal solution becomes the point x 0 at the intersection
of constraints 3 and 4. This is illustrated in Figure 5.4(a).
(a) When c1 is increased to 8/3, the optimal (b) When c1 is decreased to 24, the opti-
extreme points are x ⇤ and x 0 . mal extreme points are x ⇤ and x 00 .
8
24 c1 .
3
System L1 L2 L3 L4 L5
Price (£) 2000 1400 1000 800 500
# CPUs 1 1 1 1 1
# SSDs 1 0.7 0 0.3 0
# Memory Boards 6 4 4 2 2
For example, 7 out of 10 of laptops from family L2 make use of SSDs (where the re-
maining 3 out of 10 use regular hard-disks) and the average price of a model in the family
L2 is £1400. The following difficulties are anticipated for the next quarter:
A demand of 5000 units is estimated for the first two types of laptops, while, a demand
of 4000 is estimated for the last three types. Furthermore, there are already 700 orders
placed for laptops L2 and L5.
The company would like to devise a production plan for the next quarter in order to
maximise their profit. Further, the company wants to analyse the following “what-if” sce-
narios:
• The company could purchase 2000 extra memory boards from a different supplier, at
a cost of £200,000. Should they consider it?
• Marketing estimates that spending £150,000 in advertising would boost demand for
the lower priced laptops L3, L4, L5 by one thousand units in the next quarter. Should
the company invest the money in advertising?
• The company realises that it is loosing money on its cheapest line of laptops, so they
intend to scrap production. However, the company would face a £120,000 penalty
for the missed delivery of the orders that have already been placed. What should they
do?
• The company realises that it has priced its top-of-the range laptop too low. At how
much should they price it in order for it to become profitable?
• Higher labour prices in the factory producing laptops of type 4 will reduce profit on
each unit by £100. Should the company consider changing its production plan? How
will this affect its profits?
5.2. Sensitivity Analysis 77
Model this scenario and provide suggestions as to what decisions should be made to the
previous “what-if” scenarios?
78 Chapter 5. Sensitivity Analysis
a) The LP
maximise 4x 1 + x 2 + 3x 3
subject to x 1 + 4x 2 1
3x 1 x2 + x3 3
3x 1 x2 + x3 3
x1 x2, x3 0.
b) The LP
minimise x 1 + 7x 2 + 17x 3
subject to x 1 + 4x 2 13
x1 11x 2 + x 3 3
x1 x2, x3 0.
c) The LP
maximise x1 2x 2
subject to x 1 + 2x 2 x3 + x4 0
4x 1 + 3x 2 + 4x 3 2x 4 3
x1 x 2 + 2x 3 + x 4 = 1
x2, x3 0.
2. For each of the following linear programming models, give your recommendation
on which is probably the more efficient way to obtain an optimal solution, by either
say applying the simplex method directly to this primal problem or by instead
applying the simplex method directly to the dual problem instead. Justify your
answer.
a) The LP
maximise 10x 1 + 4 x 2 + 7x 3
subject to 3x 1 2x 2 + 2x 3 25,
x1 2x 2 + 3x 3 25,
5x 1 + x 2 + 2x 3 40,
x 1 + x 2 + x 3 90,
2x 1 x 2 + x 3 20,
x1, x2, x3 0.
5.3. Exercises for Self-Study 79
b) The LP
maximise 2x 1 + 5x 2 + 3x 3 + 4x 4 + x 5
subject to x 1 + 3x 2 + 2x 3 + 3x 4 + x 5 6
4x 1 + 6x 2 + 5x 3 + 7x 4 + x 5 15
x1, x2, . . . , x5 0.
3. Construct a pair of primal and dual problems, each with two decision variables and
two resource constraints, such that the primal problem has no feasible solutions
and the dual problem has an unbounded objective function.
4. Consider the LP
maximise 3x 1 8x 2
subject to x1 2x 2 10
x1, x 2 0.
a) Construct the dual problem and find its optimal solution by inspection.
b) Use the complementary slackness property and the optimal solution to the
dual problem to find the optimal solution to the primal problem.
c) Suppose that c1 , the coefficient of x 1 in the primal objective function, actually
can have any value in the model. For what values of c1 does the dual problem
have no feasible solutions? For these values, what does duality theory then
imply about the primal problem?
maximize x 1 + 2x 2 + x 3 + x 4
subject to 2x 1 + x 2 + 5x 3 + x 4 8
2x 1 + 2x 2 + 4x 4 12
3x 1 + x 2 + 2x 3 18
x1, x2, x3, x4 0.
6. SugarCo can manufacture three types of candy bar. Each candy bar consists totally
of sugar and chocolate. The compositions of each type of candy bar and the profit
earned from each candy bar are shown in the table below. Fifty oz of sugar and
100 oz of chocolate are available.
d) Suppose a type 1 candy bar used only 0.5 oz of sugar and 0.5 oz of chocolate.
Should SugarCo make type 1 candy bars?
e) SugarCo is considering making type 4 candy bars. A type 4 candy bar earns
17 cents profit and requires 3 oz of sugar and 4 oz of chocolate. Should
SugarCo manufacture any type 4 candy bars?
7. For each of the objective coefficients in the previous exercise, find the range of
values for which the optimal solution remain optimal.
81
Chapter 6
In this chapter, we introduce an important class of linear programming problems that are
known as network flow problems. These problems are important for various reasons. One
reason is that they can be represented not only as a linear programming problem, but
additionally as a specific mathematical object called a graph. It turns out that looking
at problems in terms of graphs can be a helpful way of analysing them which often
provides a fresh perspective on the problem.
Moreover, because of their mathematical structure, network flow problems can be
sometimes solved much faster than general linear programming problems. From a prac-
tical perspective, the modeler is therefore in a very convenient situation if it is possible
to represent a problem as a network flow problem. Finally, network flow problems are
guaranteed to have solutions that are integer if the right-hand sides of the constraints are
integer. This can be very important in practical applications when fractional solutions
do not make sense. Throughout this chapter, we consider some of the most relevant
types of network flow problems including minimum cost flow problems and transporta-
tion problems.
6.1 Graphs
We can represent any network flow problems as a graph. Here we introduce some basic
terminology about graphs before we come back to different optimisation problems.
An undirected graph (or simply a graph), denoted by G = (V, E) , consists of two
(finite) sets V and E. The elements of V are called the vertices or nodes and the elements
of E are called the edges of the graph G. Each edge is an unordered pair of vertices that
are called the endnodes.
Consider for example the undirected graph with
In order to simplify notation, we will usually just write a b instead of {a, b} to repre-
sent the unordered pair of vertices. It should be emphasised that in an undirected graph
a b and ba are the same edge. Graphs have natural visual representations in which each
vertex is represented by a point and each edge by a line joining its endnodes. Figure 6.1
illustrates this undirected graph.
Figure 6.1: The representation of the undirected graph {a, b, c, d}, {a b, bc, bd, ad} .
It should be noted that for the above graph, illustrated by Figure 6.1, some pairs of
vertices, including ad and bc, are “connected” in the sense that there exists some edge
connecting them. This observation inspires the following definition. Two vertices x, y
of a graph G are said to be adjacent if x y 2 E. If e = x y is an edge of G, then we say
that e is incident with x and y.
Notice that if x and y are adjacent vertices in an undirected graph, then it follows
that y and x are adjacent. Informally, this tells us that the above notion of connected
is commutative over undirected graphs. From a modelling viewpoint, it turns out that
this commutativity property is rather restrictive as we can only represent relationships
that are “symmetrical”. In order to overcome this limitation, we introduce the following
definition of a directed graph.
A directed graph (or digraph) or network, denoted by G = (N , A) , consists of two
(finite) sets N and A. The elements of N are called the vertices or nodes and the elements
of A are the arcs of the digraph G, where each arc is an ordered pairs of vertices. Consider
for example the directed graph with
N = {a, b, c, d} and A = (a, b), (b, c), (b, d), (d, b), (a, d) .
Observe that (b, d) is not the same as (d, b) because (b, d) is the arc from b to d, while,
(d, b) is the arc from d to b. In particular, making use of arcs in this manner has enabled
6.2. Minimum Cost Flow Problems 83
Figure 6.2: The representation of the above directed graph, namely {a, b, c, d},
{(a, b), (b, c), (b, d), (d, b), (a, d)} .
It will be useful in applications for us to formalise the above intuitive notion of some
graph or digraph being “connected”. In an undirected graph, a path is a sequence of
vertices v1 , v2 , . . . , vk 2 V such that {vi , vi+1 } is an edge for each i = 1, 2, . . . , k 1. In
other words, a path is a sequence of vertices with the property that each vertex in the
sequence is adjacent to the vertex next to it within the sequence. A graph G is said to
be connected if any two vertices of G are joint by a path.
In a similar fashion, in a directed graph, a directed path (or simply a path) is a
sequence of vertices v1 , v2 , . . . , vk such that (vi , vi+1 ) is an arc for each i = 1, 2, . . . , k 1
and this is a path from v1 to vk . In other words, a directed path is a path with the added
restriction that the edges must be all directed in the same direction. A directed graph D
is said to be connected if, for any two nodes x, y, there exists both a directed path from
x to y and a directed path from y to x.
During this section, we introduce the minimum cost network flow problem in a general
setting before providing examples. For the minimum cost network flow problem, the
inputs are:
• lower and upper capacity bounds `i j and ui j for arc (i, j) , respectively.
It should be noted that the “flow” corresponds to whatever it is that it supply and de-
manded, which for example could be products in a logistic network or electricity over
electrical distribution systems, that passes through the given network. Suppose that the
lower capacity bound is no larger than the upper bound, i.e. that
`i j ui j .
It will often be the case in applications that flows must be nonnegative meaning that
our lower capacity bound is `i j 0. Further, suppose that the total supply equals the
total demand, namely
It should be noted that we can make this assumption without loss of generality provided
the problem is feasible, which will be explained in detail later. The aim of the problem is
to send flow from the specified supply nodes S to demand nodes D at minimum cost in
order to satisfy the demands, while, not exceeding capacities and satisfying the bounds
on the flow on each arc.
For example, suppose the directed graph G = (N , A) represents some supply chain
network, where the set S ✓ N represents the warehouses and D ✓ N represents the
stores. Further, suppose that ai represents the supply of a certain product available at
warehouse i 2 S and bi represents the demand at each store i 2 D. In this case, the
problem is that we wish to send products from warehouses to stores to meet demand at
minimum total cost, while, not violating constraint capacities on the arcs. Note that the
constraint capacities could in this case correspond to say capacities vehicles available in
different locations.
6.2. Minimum Cost Flow Problems 85
The decision variables x i j for every (i, j) 2 A are defined to capture the amount of
flow on arc (i, j) . In this scenario, we need to solve the following LP
X
minimise ci j x i j
(i, j)2A
X X
subject to x ji x i j = ai for every i 2 S,
j:( j,i)2A j:(i, j)2A
X X
x ji xi j = 0 for every i 2 N \(S [ D) ,
j:( j,i)2A j:(i, j)2A
X X
x ji x i j = bi for every i 2 D,
j:( j,i)2A j:(i, j)2A
where AG is a matrix that depends on the directed graph G with a row for every node
and a column for every arc, d is the vector with a component for every node where each
component is equal to the negative supply, the demand or zero depending on whether
the component corresponds to a supply node, a demand node or neither, respectively
and ` and u are vectors whose entries are the corresponding lower and upper capacity
bounds on each arc, respectively.
86 Chapter 6. Optimisation Problems on Graphs
Recall that each decision variable x i j corresponds to arc (i, j). Further, recall that
the columns and rows of AG correspond to the arcs and the nodes of G, respectively. It
follows in light of the structure of the problem that the column of AG corresponding to
arc (i, j) has exactly two non-zero elements, namely one 1 in the row corresponding
to node i and one +1 in the row corresponding to node j. The matrix AG is called the
incidence matrix of the digraph G. In other words, entry (i, e) of the incidence matrix
AG is:
Example. (Constructing an incidence matrix from a digraph) Consider the directed graph
illustrated in Figure 6.3. The digraph can be represented by the incidence matrix given in
the table below.
The next example demonstrates how we can construct an LP for the minimum cost
network flow problem from a directed graph, namely the digraph shown in Figure 6.4.
Figure 6.4: An example of a directed graph with five nodes, where the capacities (within
squares) and costs associated with each arc are shown.
88 Chapter 6. Optimisation Problems on Graphs
minimise cT x
subject to d AG x f (6.1)
` x u.
If all right-hand sides d, f , `, u of the constraints are integer, then all the extreme solutions
to the LP are integer. If all entries of c are integer, then all extreme optimal dual solutions
of the above LP are also integer (even if d, f , `, u are not integer).
It should be noted that the minimum cost network flow problem corresponds to a
special case of above theorem, where d = f . The above theorem ensures that, whenever
we need the values of the flows to take integer values, we get this “for free” because any
optimal extreme solution to the LP will have integer components. Furthermore, the
second part of the statement says that if the cost function c is also integer, then the
extreme optimal dual solutions are integer as well. This is a very useful property as
6.3. Integer Solutions to Minimum Cost Flow Problems 89
ii) overtime production up to a limit of v units per month at a cost of £b per unit, or
iii) store the product from one month to the next at a cost of £c per unit per month.
The planning horizon for this factory is 3 months, where the demand is d1 , d2 and d3 units
in each of these months, respectively. The cost of meeting demand at minimum total cost
can be formulated as a minimum cost flow problem.
Figure 6.5 illustrates the network structure. The nodes M 1, M 2 and M 3 are the three
months. The nodes RT 1, RT 2 and RT 3 are regular time in those three months, while, OT 1,
OT 2 and OT 3 are overtime.
90 Chapter 6. Optimisation Problems on Graphs
Figure 6.5: The network corresponding to the production and inventory example.
The next example from [1] demonstrates how a network flow model can be used
within a medical setting.
Example. (Network flow for the left ventricle) This application describes a network flow
model for reconstructing the three-dimensional shape of the left ventricle from biplane an-
giocardiograms that the medical profession uses to diagnose heart diseases. To conduct this
analysis, we first reduce the three-dimensional reconstruction problem into several two-
dimensional problems by dividing the ventricle into a stack of parallel cross sections. Each
two-dimensional cross section consists of one connected region of the left ventricle.
During a cardiac catheterization, doctors inject a dye known as Roentgen contrast agent
into the ventricle; by taking X-rays of the dye, they would like to determine what portion
of the left ventricle is functioning properly, i.e. permitting the flow of blood. Conventional
biplane X-ray installations do not permit doctors to obtain a complete picture of the left
ventricle; rather, these X-rays provide one-dimensional projections that record the total
intensity of the dye along two axes (see Figure 6.6). The problem is to determine the dis-
tribution of the cloud of dye within the left ventricle and thus the shape of the functioning
portion of the ventricle, assuming that the dye mixes completely with the blood and fills the
portions that are functioning properly.
This can be modelled by a network with a supply node for every row, where the supply
at a given row equals the cumulative dye intensity in that row, and a demand node for every
column, where the demand at a given column equals the cumulative dye intensity in that
column. Each entry (i, j) of the matrix corresponds to an arc (i, j) in the network, with
capacity 0, 1. An integer flow will therefore correspond to a binary assignment to the entries
of the matrix so that the row sums and column sums equal the corresponding cumulative
dye intensities.
To reconstruct a plausible shape of the left ventricle, we can use a priori information:
after some small time interval, the cross sections might resemble cross sections determined in
a previous examination. In consequence, we might attach a probability pi j that a solution
will contain an element (i, j) of the binary matrix and might want to find a feasible solution
with the largest possible total probability. This problem is equivalent to a minimum cost
flow problem.
Recall that during our statement of the minimum cost flow problem we assumed that
total supply was equal to total demand. This assumption was not necessary and, as
92 Chapter 6. Optimisation Problems on Graphs
such, we could simply the minimum cost network flow problem as the LP
X
minimise ci j x i j
(i, j)2A
X X
subject to x ji xi j ai for every i 2 S,
j:( j,i)2A j:(i, j)2A
X X
x ji xi j 0 for every i 2 N \(S [ D) ,
j:( j,i)2A j:(i, j)2A
X X
x ji xi j bi for every i 2 D,
j:( j,i)2A j:(i, j)2A
Note that if total demand is strictly larger than total supply, then the above problem
is unsurprisingly infeasible. If instead the total supply is strictly larger than the total
P P
demand, i.e. i2S ai > i2D bi , then we can write the problem in the original (equality)
form by rebalancing supply and demand by introducing some new “dummy” demand
P P
node, say zdummy , whose demand is exactly the excess supply i2S ai i2D bi . We
then can introduce a new arc (i, zdummy ) from every node i 2 N with cost 0. It should
be noted that in this “rebalanced” network, it is indeed the case that total demand is
equal to total supply and the excess capacity in each supply node is sent at cost 0 to the
dummy node. This explains why we can indeed assume that total supply is equal to total
demand without loss of generality. It should be noted that the constraints corresponding
to N \(S [ D) could be written as equality constraints (with right-hand side 0) if all costs
ci j are positive, however, if this is not the case, then there may be benefit for sending
additional supplies.
In other words, our objective is here to minimise total transportation costs. The con-
straints ensure that the amount of the commodity leaving each source i 2 S is not greater
than the available supply ai at source i and, similarly, that the demand b j for the com-
modity at every destination j 2 D is met. It should be emphasised that the transportation
problem is a special case of the minimum cost flow problem, where all nodes are either
supply nodes or demand nodes and where every arc goes directly from a supply node to
a demand node.
The following example demonstrates how the transportation problem may be used
for running targeted internet advertising.
Example. (Targeted advertising) A large company called Faces hosts internet sites and it
derives a large proportion of its overall revenue from advertising. Faces’s customers are other
companies who intend to advertise their services and products throughout Face’s webpages.
When an ad is displayed on a page, Faces is paid a fee if the visitor clicks on the link.
In order to increase yield, Faces intends to resort to more targeted advertising. For ex-
ample, visitors on a sports news webpage may be more inclined to click on an advertisement
for sporting goods than, say, readers of a cooking forum.
Webpages hosted by Faces are divided into many context clusters, which include sports,
entertainment, technology, weather and politics. Let m be the number of these clusters.
Suppose within a given time unit that there are n advertisements to be displayed. Faces
have an estimate of the probability pi j visitors will click on advertisement j 2 {1, 2, . . . , n}
when this is displayed on a page in cluster i 2 {1, 2, . . . , m}.
Faces’s customers understandably want their ad to appear in at least a certain number
of pages within the given time unit, where b j denote the minimum number of times that
94 Chapter 6. Optimisation Problems on Graphs
ad j should appear. Further, due to bounds on available webpage space, within a given
time unit only a limited number of ads can appear in each cluster, where ai denotes the
maximum number of ads that can appear in cluster i in the given time unit.
Face’s objective is to maximise the expected total number of clicks on the advertisements
displayed, which increases the fees paid to them.
This problem can be cast as a (maximisation) transportation problem as follows. Each
cluster i corresponds to a supply vertex, with available supply ai (to ensure that each cluster
has no more than ai ads in total). Each ad j corresponds to a demand vertex, with demand
b j (to ensure that each ad is shown at least b j times across all clusters). The “transported
profit” from source i to destination j is the probability pi j .
For each ad j 2 {1, 2, . . . , n} and each cluster i 2 {1, 2, . . . , m}, the decision variable
x i j represents the number of times ad j appears in cluster i. The objective function is to
maximise total profit, namely
n X
X m
maximise pi j x i j ,
i=1 j=1
which is to maximise the total expected number of times that visitors will click on some ad.
Then we can formulate the problem as the following binary program, namely
X
n n X
X n
minimise ti j xi j = ti j xi j
i, j=1 j=1 i=1
X
n
subject to xi j = 1 for every i 2 {1, 2, . . . , n},
j=1
Xn
xi j = 1 for every j 2 {1, 2, . . . , n},
i=1
Note that the X-ray projection example is a special case of assignment problem.
The shortest path problem is a particular network flow model that has received much
attention for both practical and theoretical reasons. This problem can be stated as fol-
lows. Given a directed graph G = (N , A) with (possibly negative) costs ci j associated
with each arc (i, j) 2 A, find the cheapest path through the network from a specified
source (or origin) s 2 N to a specified sink (or destination) t 2 N .
The theoretical interest in the shortest path problem arises since the problem has
a special structure and, in addition, the underlying network results in very efficient
solution procedures. The practical interest in this problem is unsurprising where, for
example, when you ask Google/Apple Maps for directions, it must solve a shortest path
problem in order to tell you the cheapest or fastest route.
Representing this problem as a network flow problem is straightforward. Figure 6.8
illustrates a directed graph that corresponds to the shortest path problem, where we
wish to find the shortest path from node s = 1 to node t = 8.
96 Chapter 6. Optimisation Problems on Graphs
Figure 6.8: A directed graph corresponding to a shortest path problem, where we wish
to find the shortest path from the source s = 1 to the sink t = 8.
The problem is to send one unit of flow from the source node s to the sink node t
at the minimum cost (or distance). Let x i j be the flow from node i to node j, then the
formulation of the shortest path problem is
X
minimise ci j x i j
(i, j)2A
8
>
> 1, if i = s,
X X <
subject to x ji xi j = 1, if i = t,
>
>
j:( j,i)2A j:(i, j)2A :0, otherwise,
Note that if i = s, then we are at the source node and, as such, the flow into s minus the
flow out s is -1. Similarly, if i = t, we are at the sink node and, as such, the flow into t
minus the flow out of t is 1. For other vertices, the in-flow minus out-flow must be 0.
It should be noted that because of the integrality property, every extreme optimal
solution to the above problem will have integer components, where the components will
be 0 or 1. In particular, the arcs with flow 1 will form a path from s to t and the cost
of the corresponding flow will be exactly the cost of the path. Hence, the above LP will
provide a solution to the shortest path problem. This problem can be solved efficiently
via usual linear programming algorithms, however, there are in practice more efficient
algorithms available for this problem such as the Bellman-Ford algorithm [12, 4].
We finally make a short observation in the case that our digraph has a directed cycle
with negative total cost. In a directed graph, a directed cycle is a nonempty directed path
from some node to itself. Observe that if the digraph has a directed cycle of negative
6.8. The Maximum Flow Problem 97
total cost, then the above LP will be unbounded. This intuitively follows since the cost
in such case is 1 as you can “go around” the cycle forever, where the cost decreases
each time you “go around”. Hence, the above LP is of no use when you want to find a
shortest path if there is a directed cycle of negative total cost.
Figure 6.9: A directed graph illustrating an optimal solution to the maximum flow prob-
lem. The red numbers here represent positive flows, while, black numbers denote the
positive arc capacities. Note that missing red numbers represent 0 flows. The maximum
flow through this digraph has value 5.
The maximum flow problem can be modelled as a minimum cost flow problem.
Recall that in the maximum flow problem that the in-flow equals the out-flow over all
nodes (other than the source s and sink t) and, as such, the above problem is equivalent
to maximising over all feasible flow out of the source s.
98 Chapter 6. Optimisation Problems on Graphs
In order to remove this technicality regarding the in-flow out-flow equality every-
where other than the source and sink, it is useful to add an artificial arc (t, s) from
the sink to the source, illustrated in Figure 6.10. The objective in the maximum flow
problem is to then maximise the flow value on that artificial arc.
More precisely, let A0 := A [ {(t, s)} denote the set of arcs from our original digraph
G plus the artificial arc, then the general formulation of the maximum flow problem is
maximise x ts
X X
subject to x ji xi j = 0 for every i 2 N ,
(6.2)
j:( j,i)2A0 j:(i, j)2A0
where 8
< 1, if (i, j) = (t, s),
ci j =
:0, otherwise.
Observe that, in the above formulation, the artificial arc x st is unrestricted, meaning
that there are no upper or lower bounds imposed on it. In the digraph illustrated Figure
6.10, we wish to find the maximum flow from vertex a to vertex b. The values associated
with the solid arcs are the capacities. The dashed arc is introduced to the digraph to
model this as a maximum flow problem.
Figure 6.10: A directed graph corresponding to a maximum flow problem, where the
artificial arc (dashed) connects the sink to the source.
Note for completeness that there are similarly more efficient ways of solving maxi-
mum flows problems, such as the algorithms described by Ford and Fulkerson [11].
6.9. Minimum s t Cuts 99
The maximum flow problem can be seen as a measure of resilience of the connection
between the source s and sink t. Another possible measure of this connection can be
provided by the minimum cut.
Under the same data as the maximum flow problem we have a digraph G = (N , A)
with two special nodes, the source s and t, where every arc e 2 A has a strictly positive
capacity ue > 0. In this problem we would like to remove a set of arcs of minimum
total capacity in order to disconnect the s from t. Being more precise, an s t cut is a
set C ✓ A of arcs such that there is no path from s to t in the graph (N , A\C) , which is
obtained from the original digraph G by removing those arcs in C. We want to find an
s t cut C of minimum total capacity, namely the cut C with objective
Notice that if S is the set of nodes that can be reached from s in (N , A\C) , then
because C is an s t cut, we see that the sink t cannot be in S. In particular, it follows
that any optimal s t cut C contains all arcs leaving S. This set is denoted by
+
(S) = (i, j) 2 A : i 2 S, j 2
/S .
+
In particular, since every set (S) with s 2 S and t 2
/ S is itself an s t cut (as every
path from s to t must at some point use an arc that goes from some node in S to some
+
node not in S), it follows that a minimum s t cut must indeed be of the form (S)
for some S ⇢ N such that s 2 S and t 2
/ S.
The integer programming problem associated with the minimum s t cut problem
100 Chapter 6. Optimisation Problems on Graphs
is therefore X
minimise ue ze
e2A
Recall that the maximum flow and minimum cut problems provide measures of the
connection between the source and the sink within a digraph. It turns out that the
minimum s t cut problem is in a sense the “dual” of the maximum s t flow problem
and throughout this section we argue why this is the case.
Let us firstly write the dual of the maximum flow LP (6.2). Note that the LP has
a flow-balance constraint for each node i 2 N , meaning that the dual will have a cor-
responding variable yi , and the LP has a capacity constraint x e ue for every e 2 A,
meaning the dual will have a corresponding variable ze . The right-hand sides of the
flow-balance constraints of (6.2) are all 0, while, the right-hand-side of the capacity
constraint x e ue for arc e 2 A is ue , therefore the objective of the dual is
X
minimise ue ze .
e2A
Note the LP (6.2) has a variable for each arc e 2 A and an extra variable for the
artificial arc (t, s) . Further, recall that in the LP (6.2), for each (i, j) 2 A , the decision
variable x i j has a zero coefficient in the objective function, it appears in the flow-balance
constraint for node i with coefficient 1, in the flow-balance constraint of node j with
coefficient -1 and it appears in the constraint x i j ui j with coefficient 1. It follows that
the corresponding dual constraint is y j yi +zi j 0. For the remaining variable x ts that
corresponds to the artificial arc, the objective coefficient is 1, however, the variable has
no capacity and is free (since we did not enforce nonnegativity). It follows therefore
that the corresponding dual constraint is ys y t = 1.
6.10. Maximum Flows vs Minimum Cuts 101
Observe that the LP (6.4) is similar to the LP (6.3) corresponding to the minimum
s t cut problem. We show that the two are indeed equivalent. Firstly, observe that
by the integrality property discussed previously, extreme solutions to the LP (6.4) are
integer. Secondly, observe that every constraint involving the yi variables in (6.4) always
contain one variable with coefficient 1 and one with coefficient -1.
Furthermore, notice that the yi ’s do not appear in the objective function. This means
that we can “translate” the yi ’s without changing the value of the solution. That is, for
any solution (ȳ, z̄) to (6.4), if we replace all ȳi for i 2 N with ȳi + ✓ for any value
✓ , we yield another feasible solution with the same objective function value. We can
in consequence assume without loss of generality that y t = 0 in an optimal solution,
which then implies ys = 1 since ys y t = 1.
Finally, given an optimal integer solution to (6.4) such that y t = 0, if we define
S = {i 2 N : yi 1},
+
then the constraints zi j yi y j force zi j 1 for all (i, j) 2 (S) and therefore
X X
ue ze ue .
e2E e2 + (S)
It follows that the value of (6.4) is always greater than or equal to the value of the
minimum s t cut, which is equal to the value of the (6.3). It follows that, in an optimal
solution to (6.4) where we assume that we set y t = 0, the variables yi for i 2 N and ze
for e 2 A will always take binary values even without explicitly using binary variables.
This implies that the optimum value of the LP (6.4) equals the value of the minimum
s t cut. Because the LP (6.4) is the dual of the maximum flow problem, it follows
by strong duality that the two problems have the same objective function value. This
argument yields the following classical result.
Note that in all types of networks with flow conservation (meaning that in-flow
equals out-flow), the amount of flow that can flow through the network is intuitively
restricted by the weakest connection between disjoint parts of the network. This weakest
connection can be thought of as a “bottleneck” in the digraph. It turns out that this
“bottleneck” is precisely the minimum cut, namely a minimal set of edges that stops the
flow through the network. Figure 6.11 illustrates the connection between maximum
flows and minimum cuts.
Figure 6.11: The optimal flow of value 5 along with an s t cut of total capacity 5. In
this case, we have S = {a, b, c} and + (S) is represented by the bold arcs.
In the traveling salesman problem, the salesperson must visit n cities and return to the
city they started from. This will be called a tour. Given the costs ci j of travelling from
city i to city j for each 1 i, j n with i 6= j, in which order should the salesperson
visit the cities in order to minimise the cost of the tour? This problem is the famous
(asymmetric) traveling salesman problem (ATSP). Note that the acronym TSP is usually
reserved for the symmetric version of the problem, where ci j = c ji for all arcs (i, j) .
Further, note that traffic collisions and one-way streets between cities are examples of
how this symmetry could break down.
The ATSP and TSP problems unsurprisingly can be viewed as an optimisation prob-
lem on a graph. In the ATSP problem, we have a directed graph where the cities are
the vertices, the paths between cities are the arcs, a path’s distance is the corresponding
cost and we have a minimisation problem starting and finishing at a specified vertex,
subject to visiting each other vertex exactly once. In the TSP problem, we instead have
6.11. The Traveling Salesman Problem 103
Given a subset of cities S ⇢ {1, 2, . . . , n}, denote its complement by S̄, namely S̄ :=
{1, 2, . . . , n}\S. Further, denote the cardinality of some subset of cities S as |S|.
The TSP problem can be formulated as
XX
minimise ci j x i j
i j6=i
X
subject to xi j = 1 for every i 2 {1, 2, . . . , n},
j6=i
X
xi j = 1 for every j 2 {1, 2, . . . , n},
i6= j
X
xi j 1 for every nonempty subset S ⇢ {1, 2, . . . , n} with |S|, |S̄| 2,
i2S, j2S̄
where S ⇢ {1, 2, . . . , n} is used to denote that S is proper subset of {1, 2, . . . , n}, namely
that S ✓ {1, 2, . . . , n} with S 6= {1, 2, . . . , n}. This is the classical and most widely used
formulation of the TSP as developed by Dantzig, Fulkerson and Johnson [7].
Note that the first two sets of constraints guarantee that the traveling salesman visits
each city exactly once. It should be emphasised that these two sets of constraints alone
are not sufficient. In particular, one can find solutions satisfying the first two sets of
constraints that do not correspond to one tour, but rather multiple subtours.
In order to prevent this from happening, we impose the third set of constraints,
known as the subtour elimination constraints. Indeed, for any proper nonempty subset
S of the cities, to “reach” the cities in S̄ from the cities in S we need to cross S and, thus,
there must be a city in S̄ that is preceded by a city in S. This condition is enforced by
the third set of constraints.
Note that we only need the subtour elimination constraints only for sets S such
that both S and S̄ contain at least two elements. If instead |S| = 1, then the subtour
elimination constraint relative to S is implied by the equation in the first set of constraints
relative to the only node in S. In a similar fashion, if |S̄| = 1, then the subtour elimination
104 Chapter 6. Optimisation Problems on Graphs
constraint relative to S is implied by the equation in the second set of constraints relative
to the only node in S̄.
Let us consider an example with the five cities, namely {1, 2, . . . , 5}, in order to
demonstrate how the subtour elimination constraints prevents subtours. The subtour
elimination constraints correspond to every nonempty subset S ⇢ {1, 2, . . . , 5} with
|S|, |S̄| 2. Notice that in this case there is one subtour elimination constraint for
each nonempty subset S with either two or three elements. Firstly, consider the subset
S = {1, 2} with S̄ = {3, 4, 5}. The subtour elimination constraint ensures that there ex-
ists at least one edge connecting a city in S (namely 1 or 2) to a city in S̄ (namely to
3, 4 or 5). This prevents any solution from being a subtour that only includes 1 and 2
or only includes 3, 4 or 5. Consider now the subset S = {3, 4} with S̄ = {1, 2, 5}. In a
similar fashion, the corresponding constraints ensures that there must be at least one
edge that goes from either 3 or 4 to 1, 2 or 5. Upon applying these constraints to all
possible subsets of cities with at least two cities in each subset, the formulation ensures
that the final solution is a single tour that covers all cities without forming smaller, dis-
connected tours. It should be noted that the greater than cannot be necessarily replaced
with an equality in the subtour elimination constraints. This follows since doing such
a replacement could lead to over-constraining the problem, where it then could impos-
sible to even find a feasible solution (since it is not clear there is necessarily a solution
with the rigid condition that there is exactly one edge between S and S̄ over all subsets
of suitable size).
Notice that such formulation has an exponential number of constraints, since the
number of proper subsets of {1, 2, . . . , n} is 2n 1 . Despite the exponential number of
constraints, this is the formulation that is most widely used in practice. Initially, one
solves the linear programming relaxation that only contains first and second sets of
constraints and 0 x i j 1. The subtour elimination constraints are generally added
later, on the fly, only when needed. It should be noted that this formulation is not the
only way of modelling the TSP. In particular, two other approaches of interest, namely
the MTZ [20] or SCF [14] formulations, are carefully outlined in the document entitled
Remarks on Modelling the TSP available on Moodle.
It turns out that the TSP problem is hard both in theory and in practice. More for-
mally, the TSP problem is an N P -hard problem, which recall informally means that
one should not expect to solve a random instance of the problem in polynomial time
unless P = N P (see e.g. [13]). The difficultly of this problem is not simply posed
by the exponential number of constraints in the above formulation. In contrast to the
minimum cost network flow problem, the extreme points of the linear programming
relaxation are typically not integer and the optimal value of the linear programming re-
6.12. Exercises for Self-Study 105
laxation can be very far from the optimal value of a tour. Figure 6.12 below (taken from
https://www.math.uwaterloo.ca/tsp/uk/index.html) illustrates how the TSP
problem was remarkably solved in order to calculate an optimal 49,687 stop pub crawl
of the UK. The computation of this tour required 14 months, which is equivalent to 250
years of computation time on a single processor.
1. You need to take a trip by car to another town which you have never visited be-
fore. Therefore, you are studying a map to determine the shortest route to your
destination (from the origin). Depending on which route you choose, there are
five other towns (call them A, B, C , D, E) that you might pass through on the way.
The map shows the mileage along each road that directly connects two towns
without any intervening towns. These numbers are summarized in the following
table, where a dash indicates that there is no road directly connecting these two
towns without going through any other towns.
Formulate and solve this problem as a shortest path problem by drawing a network
where nodes represent towns, links represent roads and numbers indicate the
length of each link in miles.
2. At a small but growing airport, the local airline company is purchasing a new
tractor for a tractor-trailer train to bring luggage to and from the airplanes. A
new mechanized luggage system will be installed in 3 years, so the tractor will
not be needed after that. However, because it will receive heavy use, so that the
running and maintenance costs will increase rapidly as the tractor ages, it may still
be more economical to replace the tractor after 1 or 2 years. The following table
gives the total net discounted cost associated with purchasing a tractor (purchase
price minus trade-in allowance, plus running and maintenance costs) at the end
of year i and trading it in at the end of year j (where year 0 is now).
106 Chapter 6. Optimisation Problems on Graphs
j
1 2 3
0 $13,000 $28,000 $48,000
i 1 $17,000 $33,000
2 $20,000
The problem is to determine at what times (if any) the tractor should be replaced
to minimise the total cost for the tractors over 3 years. Formulate and solve this
problem as a shortest path problem.
3. The next diagram depicts a system of aqueducts that originate at three rivers
(nodes R1, R2, and R3) and terminate at a major city (node T), where the other
nodes are junction points in the system.
Using units of thousands of acre feet, the tables below show the maximum amount
of water that can be pumped through each aqueduct per day.
To To To
From A B C From D E F From T
R1 130 115 — A 110 85 - D 220
R2 70 90 110 B 130 95 85 E 330
R3 — 140 120 C - 130 160 F 240
The city water manager wants to determine a flow plan that will maximise the flow
of water to the city. Formulate and solve this problem as a maximum flow problem
by identifying a source, a sink and the transhipment nodes and by drawing the
complete digraph that shows the capacity of each arc.
4. The Texaco Corporation has four oil fields, four refineries and four distribution
centers. A major strike involving the transportation industries now has sharply
curtailed Texaco’s capacity to ship oil from the oil fields to the refineries and to
6.12. Exercises for Self-Study 107
ship petroleum products from the refineries to the distribution centers. Using units
of thousands of barrels of crude oil (and its equivalent in refined products), the
following tables show the maximum number of units that can be shipped per day
from each oil field to each refinery, and from each refinery to each distribution
center.
Refinery
Oil Field New Orleans Charleston Seattle St. Louis
Texas 11 7 2 8
California 5 4 8 7
Alaska 7 3 12 6
Middle East 8 9 4 15
Distribution Centre
Refinery Pittsburgh Atlanta Kansas City San Francisco
New Orleans 5 9 6 4
Charleston 8 7 9 5
Seattle 4 6 7 8
St. Louis 12 11 9 7
The Texaco management now wants to determine a plan for how many units to
ship from each oil field to each refinery and from each refinery to each distribu-
tion center that will maximise the total number of units reaching the distribution
centers.
a) Draw a rough map that shows the location of Texaco’s oil fields, refineries
and distribution centers. Add arrows to show the flow of crude oil and then
petroleum products through this distribution network.
b) Redraw this distribution network by lining up all the nodes representing oil
fields in one column, all the nodes representing refineries in a second col-
umn, and all the nodes representing distribution centers in a third column.
Then add arcs to show the possible flow.
5. The MK Company is a fully integrated company that both produces goods and sells
them at its retail outlets. After production, the goods are stored in the company’s
108 Chapter 6. Optimisation Problems on Graphs
two warehouses until needed by the retail outlets. Trucks are used to transport
the goods from the two plants to the warehouses, and then from the warehouses
to the three retail outlets.
Using units of full truckloads, the following table outlines each plant’s monthly
output, its shipping cost per truckload sent to each warehouse, and the maximum
amount that it can ship per month to each warehouse.
To
Unit Shipping Cost Shipping Capacity Output
From Warehouse 1 Warehouse 2 Warehouse 1 Warehouse 1
Plant 1 $1175 $1580 375 450 600
Plant 2 $1430 $1700 525 600 900
For each retail outlet (RO), the next table shows its monthly demand, its shipping
cost per truckload from each warehouse, and the maximum amount that can be
shipped per month from each warehouse.
To
Unit Shipping Cost Shipping Capacity
From RO1 RO2 RO3 RO1 RO2 RO3
Warehouse 1 $1370 $1505 $1490 300 450 300
Warehouse 2 $1190 $1210 $1240 375 450 225
Demand 450 600 450 450 600 450
b) Formulate and solve this problem as a minimum cost flow problem by insert-
ing all the necessary data into this network.
6. Consider an assignment problem having the following cost table, where all times
are given in hours. It should be noted that this cost table illustrates that here we
are here tasked with assigning four workers (which are labelled A, B, C, D) to four
tasks (which are labelled 1, 2, 3, 4) with the objective to minimise the total time
required to complete the tasks.
Task
1 2 3 4
A 8 6 5 7
B 6 5 3 4
Assignee
C 7 8 4 6
D 6 7 5 6
7. Four cargo ships will be used for shipping goods from one port to four other ports
(labeled 1, 2, 3, 4). Any ship can be used for making any one of these four trips.
However, because of differences in the ships and cargoes, the total cost of loading,
transporting and unloading the goods for the different ship-port combinations
varies considerably, as shown in the following table.
Port
1 2 3 4
1 $500 $400 $600 $700
2 $600 $600 $700 $500
Ship
3 $700 $500 $700 $600
4 $500 $400 $600 $600
The objective is to assign the four ships to four different ports in such a way as to
minimise the total cost for all four shipments. Formulate and solve this problem
as an appropriate optimisation problem on a graph.
8. The coach of an age group swim team needs to assign swimmers to a 200m medley
relay team to send to a regional swimming competition. Since most of their best
swimmers are relatively fast in more than one stroke, it is not immediately clear
which swimmer should be assigned to each of the four strokes. The five fastest
swimmers and the best times (in seconds) they have achieved in each of the strokes
over 50m are outlined in the following table.
The coach wishes to determine how to assign four swimmers to the four different
strokes to minimize the sum of the corresponding best times. Formulate and solve
this problem as an appropriate optimisation problem on a graph.
110 Chapter 6. Optimisation Problems on Graphs
9. a) Recall that each maximum flow defines a minimum capacity cut. Is this min-
imum capacity cut unique? That is, for a given maximum flow, could there
be more than one minimum capacity cut for this flow? Justify your answer.
10. Consider the maximum flow problem described by the following digraph, where
the source is node A, the sink is node F and the arc capacities are the numbers
shown next to these directed arcs.
Formulate and solve this problem as a maximum flow problem. Further, determine
a minimum cut in the network.
11. Joe State lives in Gary, Indiana. He owns insurance agencies in Gary, Fort Wayne,
Evansville, Terre Haute and South Bend. Each December, they visit each of their
insurance agencies. The distance between each of their agencies (in miles) is
shown in the following table.
What order should Joe visit their agencies to minimise the total distance travelled?
12. Find a minimum s t cut for each of these networks. The numbers along the edges
represent maximum capacities.
13. Find all minimum s t cuts in the following digraph. The capacity of each arc
appears as a label next to the arc.
6.12. Exercises for Self-Study 111
14. Suppose we have a directed graph with nonnegative capacities on the arcs. Prove
or disprove the following statements.
a) If all arcs have distinct capacities, then the minimum cut is unique.
b) Multiplying all capacities by a number > 0 does not change the minimal
cuts.
c) Adding a number > 0 to all capacities does not change the minimal cuts.
112 Chapter 6. Optimisation Problems on Graphs
Figure 6.12: An optimal 49,687 stop pub crawl of the UK, presented in
https://www.math.uwaterloo.ca/tsp/uk/index.html. The computation of
this tour required 14 months, which is equivalent to 250 years of computation time
on a single processor.
113
Chapter 7
Recall that there are a wide variety of mathematical models, some of which fall into the
rather broad categories linear and nonlinear, integer and noninteger, and deterministic
and stochastic models. It should be noted that to this point we have only considered
deterministic models with linear constraints, where possibly some of the variables may
take integer values through our consideration of LPs, IPs, MIPs and BIPs. During this
chapter, we introduce another class of mathematical optimisation models, namely those
that are nonlinear.
• the domain of the function f : x 7! log x is the set dom( f ) = {x 2 R : x > 0},
(7.1)
For this reason, we define the domain D of the problem (7.1) as the set of points for
which both the objective function and the constraints functions are defined, that is
where \ denotes the (set theoretic) intersection. The feasible region is the set X of all
points in D satisfying the constraints. Note that X ✓ D , however, the converse is not in
general necessarily true.
This allows us to restate the above general nonlinear optimisation problem as
(7.2)
which means the problem concerns minimising a real-valued function over some finite
number of real-valued functions where inputs belong to the problem’s domain.
Linear programming is a special case of the above problem where all the f i ’s for
i 2 {0, 1, 2, . . . , m} are affine functions, namely that they are of the form
For completeness, note that the word affine comes from the Latin affinis, which trans-
lates roughly to “connected with”. Further, in geometry an affine transformation (or
mapping) between two vector spaces consists of a linear transformation followed by a
translation, meaning that an affine transformation is informally “connected with” some
linear transformation by a translation.
Note that an important property of LPs with n decision variables is that the prob-
lem’s domain is Rn . This follows because there does not exist points in Rn where affine
functions are undefined. Despite this, it should be emphasised that the feasible region
associated with an LP does not usually also equal Rn .
Nonlinear programming problems refer to those problems with general form (7.2)
which do not satisfy this linearity assumption. Because of this, the above definition is far
too generic to say anything useful regarding actually solving such problems. Observe
for example that such a general problem includes any problem with binary variables
because x i 2 {0, 1} can be expressed by the equality x i (1 x i ) = 0, i.e. via the two
inequalities x i (1 xi) 0 and x i (1 x i ) 0. In particular, this argument suggests that
nonlinear programming is in general at least as hard as binary programming, which is
an N P -hard problem.
7.2. Global and Local Optimality 115
Global maxima and minima are often collectively referred to as global extrema.
Further, a point x ⇤ 2 X is:
• a local minimum for f in X if there exists an ✏ > 0 such that f (x ⇤ ) f (x ) for all
x 2 X such that kx x ⇤ k ✏,
• a strict local minimum for f in X if there exists an ✏ > 0 such that f (x ⇤ ) < f (x )
for all x 2 X such that kx x ⇤ k ✏,
• a local maximum for f in X if there exists an ✏ > 0 such that f (x ⇤ ) f (x ) for all
⇤
x 2 X such that kx x k ✏, and
116 Chapter 7. Nonlinear Optimisation Models
• a strict local maximum for f in X if there exists an ✏ > 0 such that f (x ⇤ ) > f (x )
for all x 2 X such that kx x ⇤ k ✏,
where k · k denotes the `2 -norm (or Euclidean norm). In other words, a local minimum
say is a point x ⇤ in the set X such that no points x 2 X within a ball of radius ✏ (some
neighbourhood) from x ⇤ have strictly smaller function value.
Consider for example the single variable function illustrated in Figure 7.1. Observe
that the point x 0 is a (strict) local minimum of the function f since we can find a small
interval around x 0 where no point has strictly smaller function value than x 0 . Despite
this, note that the point x 0 is not a global minimum because there are points including x 00
with lower objective value. This illustrates an important fact that not all local extrema
are global extrema.
Figure 7.1: The point x 0 is a (strict) local minimum but not a global minimum.
Example. (MINOS finds a local optima) Consider the nonlinear optimisation problem
minimise x · sin(x + 4)
subject to 10 x 10.
Figure 7.2 illustrates the objective function values over [ 10, 10] .
7.3. Convex Functions 117
In this case, AMPL outputs that the optimal solution is x = 1.34995 with corresponding
objective value 1.084752213, namely the red dot in Figure 7.2. Notice that this is a local
minimiser, however, it is clearly not the global minimiser of our function on [ 10, 10] .
Figure 7.2: The nonlinear function values for x · sin(x + 4) over [ 10, 10] . The red dot
at x = 1.34995 is the point that the solver MINOS outputs as the optimal solution to
the problem.
In other words, a function f is convex for every two points in the domain of the function
and every 2 [0, 1] the value of the weighted average of the two points has value less
than or equal the weighted average of the values. Figure 7.3 provides a geometric
interpretation of the above definition. Observe that the point
is a point on the graph of f and, therefore, (7.3) states that the point belongs to the
epigraph of f , where the epigraph of a function is the set of all points in the Cartesian
product dom( f ) ⇥ R ✓ Rn+1 lying on or above its graph. For example, the epigraph of
the function g : R ! R defined by g(x) = x 2 is the set (x, y) T 2 R2 : y x2 .
118 Chapter 7. Nonlinear Optimisation Models
x + (1 )y, f (x ) + (1 ) f (y)
for 2 [0, 1] is the line segment joining (x , f (x )) to (y, f (y)) . Hence, it follows that
(7.3) means that the epigraph of f contains the linear segment joining any two points
in the graph of f .
Figure 7.3: A function f is convex if the line segment joining any two points in the graph
of f is contained in the epigraph of f .
• e a x for any a 2 R,
• |x| p on R with p 1,
It is possible to verify that these functions are convex using definition (7.3). Note that
the simplest example of convex functions are affine functions.
It should be noted that the above definition of a convex function requires that the
domain of f , namely dom( f ), is convex. For this purpose, we next provide a definition
of a set being convex.
7.3. Convex Functions 119
Given a set C ✓ Rn , we say that C is convex if the line segment between any two
points in C lies in C. In other words, a set C is convex if for any x , y 2 C, the line
segment x + (1 )x : 0 1 with endpoints x and y is contained in C. Figure
7.4 illustrates both a convex and non-convex set.
Figure 7.4: The set on the left is not convex, while, the one on the right is convex.
• any point x 2 Rn ,
• any subspace of Rn ,
It should be noted that the definition of convex sets allows us to state an equivalent
definition to (7.3) for convex functions. In particular, a real-valued function f : Rn ! R
is convex if both dom( f ) and its epigraph are convex sets.
A simple yet important fact is that if f : Rn ! R is convex, then all sublevel sets of f
are convex, where the sublevel sets are sets of the form
x 2 dom( f ) : f (x ) ↵
120 Chapter 7. Nonlinear Optimisation Models
for some fixed ↵ 2 R. In other words, a sublevel set of a function f is the set of points
in the domain of f such that their function values are no greater than any fixed value
↵. It should be emphasised that the fact tells us that if a function f is convex, then all
sublevel sets of f are convex, however, the converse is not true in general.
Recall that in the previous section we provided a definition for both convex functions
and convex sets. Given some function, it is possible to prove it is or is not convex
using definition (7.3), however, this can be difficult and nonintuitive, even for certain
seemingly simple functions. It turns out that there are often simpler conditions that we
can check, which involve the first and second derivatives of the function.
Here we consider the familiar case of univariate functions f : R ! R, namely those
functions of a single variable.
This result informally tells us that a univariate function f is convex if and only if the
function f lies above its tangent lines.
This result informally tells us that a univariate function f is convex if and only if the
function f is always “curving upward”.
It should be noted that the second theorem yields the test that is most practical
provided the function is indeed twice differentiable on its domain. Upon making use of
this criterion, we can verify the convexity of the following univariate functions:
Recall that in general nonlinear optimisation problems, as illustrated in Figure 7.2, not
all local extrema are global extrema. Further, recall that the nonlinear programming
methods, including gradient methods or Newton methods (see e.g. [5, Chapter 1]),
generate sequences of points that converge to some local optimum and not necessarily
the global optima. The following theorem is the fundamental property of convex func-
tions that explains why convex optimisation problems are easier than general nonlinear
optimisation problems.
Theorem. Let f : Rn ! R be convex and let X ✓ dom( f ) be a convex set. Then every local
minimum for f in X is a global minimum for f in X .
Proof. Suppose f is convex and let x ⇤ be a local minimum of f is X . Then for some
neighbourhood N ✓ X about x ⇤ , we have f (x ) f (x ⇤ ) for all x 2 N . Suppose, to
derive a contradiction, that there exists some x 0 2 X such that f (x 0 ) < f (x ⇤ ) , i.e. that
x ⇤ is not a global minimum of X .
Consider the line segment x ( ) = x ⇤ + (1 )x 0 for 2 [0, 1] . Observe that
x ( ) ⇢ X by the convexity of X . Then by the convexity of f , we have
(7.4)
for all 2 (0, 1), where the strict inequality follows since f (x 0 ) < f (x ⇤ ) by assumption.
We can pick to be sufficiently close to 1 such that x ( ) 2 N . Then by the definition
of the neighbourhood N , we have f x ( ) f (x ⇤ ) , however, f x ( ) < f (x ⇤ ) by
(7.4), which yields the contradiction as required.
Note that in the above theorem, we do indeed need that both the function f and the
set X are convex. Consider for example the problem min{x 12 + x 22 : f (x) 0}, where
f (x) = min{1 x1, 1 x 2 } . The objective function is convex, however, the feasible
region is non-convex since it is the set X = {(x 1 , x 2 ) T : x 1 1 or x 2 1}. Note that
both (1, 0) and (0, 1) are local minima, however, (1, 0) is the global minimum since it
has value 1, while, (0, 1) has value 2. This is illustrated in Figure 7.5.
122 Chapter 7. Nonlinear Optimisation Models
Figure 7.5: The dashed lines represent contours of the function x 12 +2x 22 and the shaded
area represents the feasible region X .
(7.5)
• x 2 on R,
p
• x with x 0,
Note that affine function are both convex and concave. In fact, affine functions are
perhaps surprisingly the only functions that are both convex and concave.
Recall that convex optimisation problems (7.5) have been defined as minimisation
problems. However, if in a maximisation problem of the form
maximise f0 (x )
subject to f i (x ) 0 for i 2 {1, 2, . . . , m},
hi (x ) = 0 for i 2 {1, 2, . . . , k}
The mass balance relationships that must hold for the m elements are
X
n
ai j x j = bi for i 2 {1, 2, . . . , m} (7.6)
j=1
and
xj 0 j 2 {1, 2, . . . , n}. (7.7)
where ✓ ◆
F0
cj = + log P
RT j
is the Gibbs free energy function for the j-th compound that can be found in tables, P
denotes the total pressure in atmospheres and log denotes the natural logarithm.
7.8. Quadratic Optimisation 125
ai j
i=1 i=2 i=3
j Compound (F 0 /RT ) j cj H N O
1 H -10.021 -6.089 1
2 H2 -21.096 -17.164 2
3 H2O -37.986 -34.054 2 1
4 N -9.846 -5.914 1
5 N2 -28.653 -24.721 2
6 NH -18.918 -14.986 1 1
7 NO -28.032 -24.100 1 1
8 O -14.640 -10.708 1
9 O2 -30.594 -26.662 2
10 OH -26.111 -22.179 1 1
The AMPL model for the general problem can be found on Moodle. To solve this
problem, we make use of the MINOS solver. It should be noted that if we chose as a
solver CPLEX or Gurobi we would receive an error message. This is since as we will dis-
cuss, CPLEX and Gurobi can handle quadratic objective and constraints but not general
nonlinear functions, even those that are convex.
f (x 1 , x 2 ) = x 12 + 3x 1 x 2 + 2x 22 5x 1 + 6x 2 + 3
126 Chapter 7. Nonlinear Optimisation Models
is a quadratic function.
Let A 2 Rn⇥n be a square matrix with n rows and columns. The square matrix
A = (ai j )n⇥n is symmetric if it is equal to its transpose, i.e. if A = AT . In other words, a
matrix is symmetric if ai j = a ji for all i, j 2 {1, 2, . . . , n}.
Further, a symmetric matrix A is:
• positive semi-definite if
• negative semi-definite if
• indefinite if
The identity matrix is for example positive semi-definite. Further, a diagonal matrix
with all non-negative entries is positive semi-definite. The matrix
Ç å
1 0
A=
0 2
• negative semi-definite if and only if all its eigenvalues are non-positive, and
• indefinite if and only if there is at least one eigenvalue that is positive and at least
one eigenvalue that is negative.
Observe that given any quadratic function f : Rn ! R, we can write it in the form
f (x ) = x T Qx + p T x + r,
Notice that in the special case when n = 1, the above theorem yields the well-known
fact that f (x) = a x 2 + b x + c is convex if and only if a 0. The following result will be
useful in the remainder.
Theorem. Let Q 2 Rn⇥n be a symmetric matrix. The Q is positive semi-definite if and only
if it can be written as Q = AT A for some m ⇥ n matrix A with m n.
where k · k denotes the `2 -norm (or Euclidean norm). In particular, this shows that if
we have Q = AT A for some A, then the symmetric matrix Q is positive semi-definite.
A convex quadratic optimisation problem is of the form
minimise x T Q 0 x + p 0T x + r0
(7.9)
subject to x T Q i x + p iT x + ri 0 for i 2 {1, 2, . . . , k},
ij = E[(ri µi )(r j µ j )]
128 Chapter 7. Nonlinear Optimisation Models
and
ÇX
n X
n
å2
Portfolio variance =E ri x i µi x i
i=1 i=1
X
n X n
=E (ri µi )(r j µ j) xi x j
i=1 j=1
XX
n n
⇥ ⇤
= E (ri µi )(r j µ j) xi x j
i=1 j=1
n X
X n
= i j xi x j = xTΣ x.
i=1 j=1
minimise xTΣ x
subject to µ T x R,
Xn
x i = 1,
i=1
x 0.
x T Σx = E[(r T x µ T x )2 ]
x 7! k Ai x + b i k c iT x di
are convex and, in consequence, it follows that SOCPs are convex optimisation problems.
It should be noted that the name comes from the fact that each constraint defines
an affine transformation of the standard second order cone (or the quadratic, Lorentz or
ice-cream cone), which, in Rn , is the set of points of the form
1
(u, t) 2 Rn ⇥ R : kuk t .
Note that while there are general nonlinear solvers, such as MINOS, KNITRO, Ipopt,
SNOPT, CONOPT, the main commercial solvers only accept problems in more restricted
forms. For example, CPLEX, Gurobi, Xpress and Mosek solve SOCP problems.
On the one hand, the advantages of general nonlinear solvers are clear. For example,
they accept any form of problem and they tolerate non-convexities. However, there are
130 Chapter 7. Nonlinear Optimisation Models
Linear Programming
Linear programming problems are very special types of SOCP. This follows since SOCPs
have a linear objective and any linear constraint, say a T x b, can be equivalently
written as k0k b a T x , where 0 is the n-dimensional zero vector. Hence, we can
always write linear constraints as part of an SOCP.
Hyperbolic Constraints
x 2 yz, y, z 0
where x, y, z are three variables. Such constraints can be transformed into SOCP con-
straints as follows. Observe that
⇣ y + z ⌘2 ⇣y z ⌘2
yz = ,
2 2
which implies that the hyperbolic constraint can be written as
⇣ y z ⌘2 ⇣ y + z ⌘2
x2 + .
2 2
7.10. Second Order Cone Programming Representable Sets 131
Example. (Maximising the product of linear functions) Consider a problem in the variables
x 1 , . . . , x n , y1 , . . . , yk , where the constraints are linear but the objective has the form
where c 2 Rn and d 2 Rk and we require both c T x , d T y 0. Note that the above objective
is quadratic since it is a polynomial of degree two, however, it is not concave. Hence, the
above is not a convex quadratic problem.
132 Chapter 7. Nonlinear Optimisation Models
Quadratic Programs
Quadratic programming is a special case of SOCP. Indeed, recall from an earlier theorem
that any positive semi-definite matrix Q 2 Rn⇥n can be factorised as Q = AT A for some
m ⇥ n matrix A with m n. In particular, consider the quadratic program (7.9) and
matrices Ai such that Q i = ATi Ai for i 2 {0, 1, 2, . . . , k}.
Then, upon introducing new variables zi for i 2 {0, 1, . . . , k}, the quadratic program
(7.9) can be written as
minimise z0 + p 0T x + r0
subject to k Ai x k2 zi for i 2 {0, 1, 2, . . . , k},
zi + p iT x + ri 0 for i 2 {1, 2, . . . , k}.
Note that the constraints k Ai x k2 zi are not SOCP constraints. However, they can be
written as hyperbolic constraints of the form kuk2 wzi , where u = Ai x and w = 1.
As we have seen previously, these can be expressed as second order constraints. We
can do this explicitly using the fact that
Å ã2 Å ã2
zi + 1 zi 1
zi =
2 2
Robust Optimisation
In this section we show how SOCP can be used to solve some simple robust convex
optimization problems, in which uncertainty in the data is explicitly accounted for. We
consider a LP
minimise cT x
subject to a iT x bi for i 2 {1, 2, . . . , m}.
in which there is some uncertainty (or variation) in the parameters c 2 Rn , a i 2 Rn or
bi 2 R. In order to simplify the exposition, we assume that c and bi are fixed, and the
a i ’s are known to lie in some ellipsoids a i 2 Ei . Note that ellipsoids are simply affine
transformations of the unit sphere and therefore we can write each ellipsoid Ei as
Ei = ā i + Pi u : kuk 1 ,
7.11. Applications of Second Order Cone Programming 133
where Pi is some symmetric positive semi-definite n⇥n matrix and ā i denotes the centre
of the ellipsoid.
In a worst-case framework, we require that the constraints be satisfied for all possible
values of the parameters a i , which leads us to the robust LP
minimise cT x
(7.10)
subject to a iT x bi for a i 2 E and i 2 {1, 2, . . . , m}.
max{a iT x : a i 2 E } bi .
u T Pi x kukkPi x k kPi x k
(Pi x ) T Pi x
u T Pi x = = kPi x k ,
kPi x k
which shows
max u T Pi x : kuk 1 = kPi x k
minimise cT x
(7.11)
subject to kPi x k bi ā iT x for i 2 {1, 2, . . . , m}.
Note that the additional norm terms act as “regularization terms”, discouraging large x
in directions with considerable uncertainty in the parameters a i .
There is another, equivalent, statistical interpretation of Robust optimisation. In
this interpretation, we assume that each vector a i is independently drawn at random
134 Chapter 7. Nonlinear Optimisation Models
from some Gaussian distribution. Further, we know the corresponding mean a¯i and the
covariance matrix. Our objective is to find the vector x minimising the linear function
c T x such that the probability that x violates a constraint a iT x bi is less than some
tolerance ⌘. Such framework gives rise to an SOCP of the same form as (7.11), however,
we do not give a formal derivation here.
The Sharpe ratio in finance defines an efficiency metric of a portfolio as the expected
return per unit risk, where the risk is measured as the standard deviation of the portfolio.
As in the Markowitz model, we have n assets i 2 {1, 2, . . . , n} with random returns
r1 , r2 , . . . , rn and we are given the vector of expected returns µ and covariance matrix
Σ. Given an allocation x 2 Rn of the n assets, the Sharpe ratio is defined as
µT x rf
S(x ) = ,
(x T Σ x )1/2
where r f denotes the return of a risk-free asset. The Sharpe ratio compares the pro-
jected returns relative to an investment benchmark (such as government bonds) with
the historical or expected variability of such returns.
Suppose there is a portfolio with µ T x > r f . It should be noted that if not, then
it would be better to simply invest in the risk-free asset. Further, there is by assump-
tion such a portfolio, note that maximising the Sharpe ratio is equivalent to minimising
1/S(x ) . Note that since Σ is positive semi-definite, it can be factorised as Σ = AT A for
some m ⇥ n matrix A (for some m n). Hence, the standard deviation of µ T x can be
written as
1/2
xTΣ x = kAx k
kAx k
minimise
µT x rf
Xn
subject to x i = 1,
i=1
x 0.
minimise t
subject to kAyk t,
µ T y r f = 1,
Xn
yi = z,
i=1
y 0,
which is an SOCP. It should be noted that we yield the optimal portfolio x = y/z from
the optimal solution (y, z, t) .
2. Show that if f : Rn ! R is convex, then all sublevel sets of f are convex, where
the sublevel sets are sets of the form
x 2 dom( f ) : f (x ) ↵
3. For each of the following functions, show whether the function is convex, concave
or neither.
a) f (x) = 10x x 2,
b) f (x) = x 4 + 6x 2 + 12x,
c) f (x) = 2x 3 3x 2 ,
d) f (x) = x 4 + x 2 , and
e) f (x) = x 3 + x 4 .
5. For each of the following functions, show whether the function is convex, concave
or neither.
136 Chapter 7. Nonlinear Optimisation Models
a) f (x ) = x 1 x 2 x 12 x 22 ,
b) f (x ) = 3x 1 + 2x 12 + 4x 2 + x 22 2x 1 x 2 ,
c) f (x ) = x 12 + 3x 1 x 2 + 2x 22 ,
e) f (x ) = x 1 x 2 .
minimise x 14 + 2x 22
subject to x 12 + x 22 2.
Show that this problem is a convex programming problem both geometrically and
algebraically.
7. Solve the nonlinear programming problem from the previous exercise using the
solver via AMPL.
8. Answer the questions below for the following problem, where in each case you
must justify your answer.
1 4 1 2
minimise f (x ) = x x x2
4 1 2 1
subject to x 12 + x 22 4
x1 x 2 2.
✓1 x 1 + ✓2 x 2 + · · · + ✓k x k 2 C.
Simulation
137
139
Chapter 8
S = {H, T },
where the outcome H means the coin is heads and the outcome T means the coin is
tails. For tossing a coin twice, the sample space becomes
S = {H H, H T, T H, T T }, (8.1)
where the outcomes are defined in a similar fashion. For less standard interesting ex-
ample consider the experiment where eight runners who are numbered 1 through 8 run
a race, then (assuming all runners complete the race) the sample space is
where the outcome (2, 7, 1, 8, 3, 4, 5, 6} for example means that runner number 2 fin-
ished first, runner number 7 finished second, and so on.
Any subset A ✓ S of the sample space is known as an event. In other words, an event is
a set consisting of possible outcomes of the experiment. If the outcome of the experiment
is contained in A, we will say that A has occurred. For example, in the experiment where
one flips a coin twice, if A = {T H, T T }, then A is the event that the first flip is a tails. In a
similar light, in the example about runners, if A = {all outcomes in S starting with a 3},
then A is the event that runner number 3 finishes first.
For any two events A and B, we define the new event A [ B, called the union of A
and B, to consist of all outcomes that are in either A or B or in both A and B. In a similar
fashion, we define the event A \ B, called the intersection of A and B, to consist of all
140 Chapter 8. Statistics and Probability Background
outcomes that are in both A and B. Note if A \ B = ; such that A and B cannot both
occur, we say that A and B are mutually exclusive.
It is also possible to define unions and intersections for more than two events. For
this purpose, let A1 , A2 , . . . , An denote n events. The union of these n events is
[
n
Ai := A1 [ A2 [ · · · [ An ,
i=1
which consists of all outcomes that are in at least one Ai . The intersection of these n
events is
\
n
Ai := A1 \ A2 \ · · · \ An ,
i=1
In words, the first axiom states that the probability that the outcome of the experiment
lies within A is some number between 0 and 1 inclusive. The second axiom states that
with probability 1 this outcome will be a member of the sample space S. Finally, the
third axiom states that for any set of mutually exclusive events, the probability that at
least one of the events A1 , A2 , . . . occurs is precisely equal to the sum of their respective
probabilities. These three axioms may be used in order to prove a wide variety of results
about probabilities.
that each element of the sample space occurs with probability 14 . Suppose further that
that we know that the first flip lands on heads. In light of this information (regarding the
first flip), what is the probability that both flips land on heads? It is relatively straight-
forward to argue that because the first flip is a head, there are now at most two possible
outcomes, namely HH or HT, both of which are equally likely to occur and hence each
of these outcomes have (conditional) probability 21 . It is worth noting that the (condi-
tional) probability of the other two outcomes, namely TH and TT, given that the first
flip is a head are unsurprisingly both 0.
If we respectively denote by A and B the event that both flips land on heads and the
event that the first flip lands on heads, then the probability obtained above is called the
conditional probability of A given B has occurred is denoted P(A|B). We can similarly
deduce the following general formula for P(A|B) which is valid for all experiments and
events A and B, namely
P(A \ B)
P(A|B) = .
P(B)
It is worth emphasising that the conditional probability P(A|B) is defined only when
P(B) > 0, namely in the scenario that the event B can occur.
As illustrated in the coin flipping example, P(A|B), namely the conditional probabil-
ity of A given B occurred, does not in general necessarily equal P(A), the unconditional
probability of A. In the special case that P(A|B) = P(A) holds, we say that A and B are in-
dependent. Using the aforementioned general formula for P(A|B), we can equivalently
state that A and B are independent if
When an experiment is performed we are sometimes concerned with the value of some
numerical quantity determined by the result. These quantities of interest that are de-
termined by the results of the experiment are known as random variables. Being a little
more precise, a random variable is a mathematical formalisation of some quantity or
object that depends on random events. It is a mapping or a function from possible out-
comes within a sample space to some “measurable space”, which for our purpose will
be the real numbers R.
142 Chapter 8. Statistics and Probability Background
The cumulative distribution function (or simply the distribution function) F of the
random variable X is defined for any real number x by
where the right-hand side represents the probability that the random variable X takes
on a value less than or equal to x.
A random variable that can take either a finite or at most a countable number of
possible values is said to be discrete. For a discrete random variable X we define its
probability mass function p(x) by
where the right-hand side represents the probability that the discrete random variable X
is exactly equal to x. If X is a discrete random variable that takes on one of the countable
number of possible values x 1 , x 2 , . . ., then because X must take one of these values, by
the Axioms of Probability we have
X
1
p(x i ) = 1.
i=1
In particular, the set of outcomes lie in a set which is formally continuous and can be
intuitively thought of as some set without gaps.
The relationship between the cumulative distribution F (·) and its probability density
function f (·) is expressed by
Z a
F (a) = P{X 2 ( 1, a)} = f (x) d x.
1
In other words, the area under the curve of the probability density function f (x) from
negative infinity to a is equal to the probability that the random variable X takes value
less than a. Upon differentiation note that the density is the derivative of the cumulative
distribution function.
8.5. Expectation 143
8.5 Expectation
One of the most fundamental and useful concepts in probability is that of the expectation
of a random variable. If X is a discrete random variable which takes one of the possible
values x 1 , x 2 , . . ., then the expected value of X or the mean of X is denoted by E[X ] and
defined by X X
E[X ] = x i P {X = x i } = x i pi , (8.2)
i i
where pi denotes the probability corresponding to the value x i . In other words, the
expected value of X is a weighted average of the possible values X can take, where the
weights are simply the probability that X assumes that value.
If X is instead a continuous random variable with probability density function f (·),
then similarly to (8.2), the expected value of X is
Z1
E[X ] = x f (x) d x.
1
Suppose now that we do not want to determine the expected value of a random
variable X but instead of the random variable g(X ), where g(·) denotes some given
function. Note that g(X ) takes on the value g(x) when the random variable X takes
on the value x. Intuitively, it seems that E[g(X )] should be a weighted average of the
possible values g(x) with, for a given x, the weight for g(x) equals the probability (or
probability density) that X equals x. Indeed, this turns out to be the case and thus the
following result holds.
Proposition. Let g(·) denote any function. If X is a discrete random variable with proba-
bility mass function p(x) , then
X
E[g(X )] = g(x) p(x) .
x
E[aX + b] = aE[X ] + b.
Further, it can be shown that expectation is a linear operation in the sense that for any
two random variables X 1 and X 2 , we have
8.6 Variance
Despite the clear uses of the expected value E[X ], it does not yield any information about
the variance of these values. There are several ways to measure variance, however, one
important approach is to consider the average value of the square of the difference
between X and E[X ]. This inspires the following definition.
which can be derived by expanding and simplifying. A useful identity, the proof of which
is left as an exercise, is that for all real constants a and b, we have
Var(aX + b) = a2 · Var(X ) .
In contrast to the fact that the expected value of a sum of random variables is equal
to the sum of the expectations, the corresponding result does not in general hold for
variances. Despite this, this does in fact turn out to be true in the special case where the
random variables are independent.
We now define the useful concept of covariance between two random variables. It
should be noted that the following definition enables us to prove the claim that the sum
of the variance of the sum of independent random variables is equal to the sum of their
variances.
which can be obtained by expanding and then using the linearity of expectation. In
addition, it will be useful to have an expression for Var(X +Y ) in terms of their individual
variances and the covariance between them. In particular, we deduce that
which follows by expanding, simplifying and using the aforementioned linearity of ex-
pectation, i.e. E[X + Y ] = µ x + µ y .
The above allows us to define the correlation between two random variables, which
informally provides a normalised measure of their covariance. It is worth noting that
the correlation is known by several names including the Pearson correlation coefficient
(PCC) or the Pearson product-moment correlation coefficient (PPMCC), named after
the English mathematician Karl Pearson who presented the coefficient in 1895 after a
related idea was previously suggested by Galton (see e.g. [18]).
Definition. The correlation between two random variables X and Y , which is denoted by
Corr(X , Y ) , is defined by
Cov(X , Y )
Corr(X , Y ) = p . (8.3)
Var(X ) · Var(Y )
Note that the random variables X and Y are said to be positively correlated if their
covariance is strictly positive, while, these random variables are said to be negatively
correlated if their covariance is strictly negative. For completeness, note that
0 |Corr(X , Y )| 1
holds, where | · | denotes the absolute value. The lower bound follows upon recalling
that Cov(X , Y ) = 0 holds if X and Y are independent, while, the upper bound follows
in light of the Cauchy-Schwarz inequality (see e.g. [2, Section 10.1]). Further, upon
rearranging the equality (8.3) we observe that the covariance between two random
variables is equal to the product of their correlation and the square root of the product
of their corresponding variances.
we deduce as a corollary Chebyshev’s inequality. This states that the probability that
a random variable differs from its mean by more than k of its standard deviations is
bounded by 1/k2 , where the standard deviation of a random variable is defined to be
the nonnegative square root of its variance.
Suppose that n independent trials are to be performed, where with probability p the
result of each trial is a “success”. If X represents the number of successes which occur
within the n trials, then X is called a binomial random variable with parameters (n, p).
Its probability mass function is
Å ã
n i
Pi ⌘ P{X = i} = p (1 p)n i , i = 0, 1, . . . , n (8.4)
i
8.8. Some Discrete Random Variables 147
where Å ã
n n!
=
i i!(n i)!
is the binomial coefficient that equals the number of different subsets of i elements that
can be chosen from a set of n elements.
A binomial (1, p) random variable is known as a Bernoulli random variable, named
after Swiss mathematician Jacob Bernoulli. Note that since a binomial (n, p) random
variable X represents the number of successes within n independent trials, each of which
with success probability p, we can perhaps unsurprisingly represent it as
X
n
X= Xi (8.5)
i=1
where 8
<1, if the i-th trial is a success,
Xi =
:0, otherwise.
Now
E [X i ] = P{X i = 1} = p
⇥ ⇤
Var(X i ) = E X i2 E ( [X i ] )2 = E [X i ] E ( [X i ] )2
=p p2 = p (1 p),
X
n
E[X ] = E[X i ] = np
i=1
Xn
Var(X ) = Var(X i ) since the X i ’s are independent
i=1
= np(1 p).
It is worth noting for completeness that one can generalise the binomial distribution
such that for our n independent trials, each trial now leads to a “success” for exactly
one of k possible categories, where each category has a given success probability. Such
a distribution is called the multinomial distribution and gives the probability of any par-
ticular combination of numbers of successes for the various categories. Note if k = 2
and n = 1, we obtain the Bernoulli distribution, while, if k = 2 and n 2, we obtain
the binomial distribution.
148 Chapter 8. Statistics and Probability Background
A random variable X that takes on one of the values 0, 1, 2, . . . is said to be a Poisson ran-
dom variable, named after French mathematician Siméon Denis Poisson, with parameter
, where > 0, if its probability mass function is
i
pi = P{X = i} = e , i = 0, 1, . . . ,
i!
where the symbol e denotes the famous constant in mathematics (with rough value
2.7183) which is defined by e = limn!1 (1 + 1/n)n .
Poisson random variables have a wide array of applications. One reason for this is
that such random variables may be used to approximate the distribution of the number
of successes in a large number of trials (which are either independent or at most “weakly
dependent”) when each trial has a small probability of success. In order to see why this
is the case, suppose that X is a binomial (n, p) random variable, i.e. that X represents the
number of successes in n independent trials, where each has success probability equal
to p, and let = np. Then (8.4) becomes
n!
P {X = i} = p i (1 p)n i
i!(n i)!
Å ãi Å ãn i
n!
= 1 (8.6)
(n i)! i! n n
n(n 1) · · · (n i + 1) i (1 /n)n
= .
ni i! (1 /n)i
it is perhaps unsurprising that given the relationship between binomial and Poisson
random variables, that for a Poisson random variable X with parameter , that
E[X ] = Var(X ) = .
8.8. Some Discrete Random Variables 149
Consider independent trials, where each has success probability equal to p. If X repre-
sents the number of the first trial that is a success, then
Note (8.7) is obtained via independence and by observing that in order for the first
success to occur on the n-th trial, the first n 1 trials must all be failures (where each
failure occurs with probability 1 p) and the n-th trial a success.
A random variable with probability mass function (8.7) is called a geometric random
variable with parameter p. The mean is
X
1
1
1
E[X ] = np(1 p)n = ,
n=1
p
P1 n 1
where the final equality follows from the algebraic identity n=1 nx = 1/(1 x)2
for 0 < x < 1. Further, in this case
1 p
Var(X ) = .
p2
If X denotes the number of trials needed to amass a total of r successes when each trial
has independent success probability p, then X is said to be a negative binomial random
variable (or a Pascal random variable) with parameters p and r. The probability mass
function of such a random variable is given by
Å ã
n 1 r
P{X = n} = p (1 p)n r , n r. (8.8)
r 1
Note that (8.8) is valid since in order for it to take exactly n trials to amass r successes,
the first n 1 trials must result in exactly r 1 successes, which occurs with probability
n 1 r 1 n r
r 1 p (1 p) and then the n-th trial must be a success.
If we denote by X i with i = 1, . . . , r the number of trials needed after the (i 1)-th
success in order to obtain the i-th success, then it follows that each X i is an independent
geometric random variable with common parameter p. Since
X
r
X= Xi
i=1
150 Chapter 8. Statistics and Probability Background
it follows that ñ ô
X
r X
r
r
E[X ] = E Xi = E[X i ] =
i=1 i=1
p
X
r
r(1 p)
Var(X ) = Var(X i ) = .
i=1
p2
There are certain types of random variables that frequently appear in applications. In
this section we survey some of the most widely used continuous ones.
A random variable X is said to be uniformly distributed over the interval (a, b), a < b, if
its probability density function is
8
< 1
if a < x < b,
b a,
f (x) =
:0, otherwise.
In particular, X is uniformly distributed over (a, b) if it places all of its mass on that
interval and it is equally likely to be “near” any point on that interval.
The mean of a uniform (a, b) random variable is
Z b
1 b2 a2 b+a
E[X ] = x dx = = .
b a a 2(b a) 2
It is worth emphasising that the expected value is perhaps unsurprisingly the midpoint
of the interval (a, b). Further, upon noting that
Z b
⇥ 2
⇤ 1 b3 a3 a2 + b2 + a b
E X = x2 d x = = ,
b a a 3(b a) 3
we deduce
a2 + b2 + a b a2 + b2 + 2a b (b a)2
Var(X ) = = .
3 4 12
The distribution function of X is given, for a < x < b, by
Z x
1 x a
F (x) = P{X x} = (b a) dx = .
a b a
8.9. Some Continuous Random Variables 151
2
The parameters µ and equal the expectation and variance of the normal, respec-
tively. In other words, we have
2
E[X ] = µ and Var(X ) = .
An important fact about normal random variables is that if X is normal with mean
2
µ and variance , then for any constants a, b 2 R, it follows that aX + b is normally
distributed (since it is the linear transformation of a normally distributed random vari-
able X ) with mean aµ + b and variance a2 2
. It follows that if X is normal with mean
2
µ and variance , then
X µ
Z=
152 Chapter 8. Statistics and Probability Background
is a normal with mean 0 and variance 1. Such a random variable Z is said to have
a standard (or unit) normal distribution. Let Φ denote the distribution function of a
standard normal random variable. This function is given by
Z x
1 2
Φ(x) = p e x /2 d x, 1 < x < 1.
2⇡ 1
x2
It should be noted for completeness that no elementary antiderivative for e exists,
however, it is possible to evaluate the definite Gaussian integral through methods from
multivariate calculus.
The observation that Z = (X µ)/ has a standard normal distribution provided
2
that X is normal with mean µ and variance turns out to be very useful since it enables
us to evaluate all probabilities related to X in terms of Φ(x). To demonstrate, observe
that the distribution function of X can be expressed as
F (x) = P{X x}
ß ™
X µ x µ
=P
n x µo
=P Z
⇣ x µ⌘
=Φ .
The value Φ(x) can be determined simply by either looking it up in a table or by writing
a computer program to approximate it.
Given any a 2 (0, 1) , let za be such that a standard normal variable will exceed za
with probability a, namely
The value of za can be obtained simply using a table of values of Φ. For example, since
Φ(2.33) = 0.99 we see that z0.01 = 2.33. In light of the symmetry of the standard normal
about the zero, it follows that
for some > 0 is said to be an exponential random variable with parameter (or rate) .
Its cumulative distribution is
Z x
x x
F (x) = e dx = 1 e , 0 < x < 1.
0
The most important property of exponential random variables is that they possess
the “memoryless property”, which informally means that the probability distribution is
independent of its history. Being more precise, we say that the nonnegative random
variable X is memoryless if
In order to understand why the above is called the memoryless property, imagine that X
represents the lifetime of some unit and consider the probability that a unit of age s will
survive an additional time t. This example demonstrates that (8.9) is simply a statement
that expresses that the remaining life of some unit with age s does not depend on s.
Another useful property of exponential random variables is that they remain expo-
nential after multiplication with a positive constant. In order to show that this is indeed
the case, suppose that X is an exponential random variable with parameter and let c
be a positive number. Then
n xo x
P{cX x} = P X =1 e c ,
c
which shows that cX is an exponential random variable with parameter /c.
Let X 1 , . . . , X n be independent exponential random variables with rate parameters
respectively. A useful (and perhaps surprising) result is that min {X 1 , . . . , X n }
1, . . . , n,
P
is exponentially distributed with rate i i . It is worth noting max {X 1 , . . . , X n } is not
in general exponential.
154 Chapter 8. Statistics and Probability Background
Suppose that “events” occur at random time points and denote by N (t) the number of
events that occur in the time interval [0, t]. These events are said to constitute a Poisson
process with rate , where > 0, if
(a) N (0) = 0,
(b) the number of events occurring in disjoint time intervals are independent,
(c) the distribution of the number of events that occur in a given interval depends only
on the length of the interval (and not on its location),
P{N (h)=1}
(d) limh!0 h = , and
P{N (h) 2}
(e) limh!0 h = 0.
In particular, Condition (a) states the process begins at time 0. Condition (b), known
as the independent increment assumption, tells us that the number events that occur by
time t, i.e. N (t), is independent of the number of events that occur between time t and
t + s, i.e. N (t + s) N (t). Condition (c), which is known as the stationary increment
assumption, states that the probability distribution of N (t + s) N (t) is the same for all
values of t. Condition (d) states that in a small interval of length h, the probability of one
event occurring is approximately h, while, Condition (e) tells us that the probability
that two or more events occur in such an interval is approximately 0.
These assumptions imply the number of events occurring in an interval of length t
is a Poisson random variable with mean t. In order to show this, one could consider
the interval [0, t], break this into n nonoverlapping (disjoint) subintervals of length t/n
and then consider the number of these which contain an event (see e.g. [21, pp. 29]).
For a Poisson process, let the time of the first event be denoted by X 1 . Further, for
n > 1, let X n denote the time that elapsed between the (n 1)-th and n-th event. The
sequence {X n : n = 1, 2, . . .} is called the sequence of interarrival times. This sequence
of interarrival times X i are independent and identically distributed exponential random
variables with (common) parameter .
Let
X
n
Sn = Xi = X1 + X2 + · · · + Xn (8.10)
i=1
8.9. Some Continuous Random Variables 155
denote the time of the n-th event. Observe that Sn will be less than or equal to t if and
only if there has been at least n events by time t, hence
P {Sn t} = P {N (t) n}
= P {N (t) = n} + P {N (t) = n + 1} + · · ·
X
1
t) j
t(
= e ,
j=n
j!
where the final inequality follows since the number of events occurring in an interval of
length t is a Poisson random variable with mean t. Because the left-hand side of the
above equality is the cumulative distribution of Sn , upon differentiation, we yield the
density function for Sn , denoted here by f n (t), which is
t ( t)n 1
f n (t) = e .
(n 1)!
This inspires the following definition.
A random variable with probability density function
t ( t)n 1
f (t) = e , t >0
(n 1)!
is called a gamma random variable with parameters (n, ).
In particular, it follows that Sn , namely the time of n-th event of a Poisson process
with rate , is a gamma random variable with parameters (n, ). Further, in light of
(8.10) and since a sequence of independent and identically distributed interarrival times
are exponential random variables with parameter , we deduce the following.
Corollary. The sum of n independent exponential random variables, each which have pa-
rameter , is a gamma random variable with parameters (n, ).
From a modelling viewpoint, a major weakness of the Poisson process is the rather strong
stationary increment assumption which tells us that events are just as likely to occur in
all intervals of equal size. A generalisation of the (standard) Poisson process, which
relaxes this assumption, leads to the nonhomogeneous or nonstationary process.
If “events” occur randomly in time and N (t) denotes the number of events that occur
by time t, then we say that {N (t) : t 0} constitutes a nonhomogeneous Poisson process
with rate (or intensity function) (t), where t 0, if
(a) N (0) = 0,
156 Chapter 8. Statistics and Probability Background
(b) the number of events that occur in disjoint time intervals are independent,
P{exactly 1 event occurs between t and t+h}
(c) limh!0 h = (t), and
P{2 or more events occur between t and t+h}
(d) limh!0 h = 0.
is called the mean-value function. This function allows us to state that the number of
events that occur between time t and t + s, namely N (t + s) N (t), is a Poisson random
variable with mean m(t + s) m(t).
It should be noted that the intensity at time t, denoted by (t), indicates how likely
it is that an event will occur around time t. Further, if we set (t) = for all t, then
the nonhomogeneous process simply reverts to the usual Poisson process. The following
proposition is a useful result that allows us a way to interpret a nonhomogeneous Poisson
process.
Proposition. Suppose that events are occurring according to a Poisson process with rate
and suppose that, independently of anything that came before, an event that occurs at
time t is counted with probability p(t) . Then the process of counted events constitutes a
nonhomogeneous Poisson process with intensity function (t) = · p(t) .
P{X = i} = ic, i = 1, 2, 3, 4
Determine Var(X ) .
8.10. Exercises for Self-Study 157
5. Show that for all real constants a and b, we have Var(aX + b) = a2 · Var(X ) .
6. Suppose X is binomial random variable with parameters (n, p). Show that the
probability P{X = i} firstly increases and then decreases, reaching its maximum
value when i is the largest integer less than or equal to (n + 1) p.
7. Show if X and Y are independent binomial random variables with respective pa-
rameters (n, p) and (m, p), then X + Y is binomial with parameters (n + m, p).
E[X ] = Var( ) = .
2
9. For a normal random variable X with parameters µ and , show that
a) E[X ] = µ, and
2
b) Var(X ) = .
11. Consider a Poisson process in which events occur at a rate of 0.3 per hour. What
is the probability that no events occur between 10 AM and 2 PM?
159
Chapter 9
Random Numbers
One of the most fundamental building blocks of a simulation study is the ability to gener-
ate random numbers. People who think about the topic of random number generation
frequently get into philosophical discussions about what the word “random” actually
means. In some sense, there is no such thing as a random number since, for example,
would you say that 11 is a random number? In light of this, we instead will talk about a
sequence of independent random numbers with a specified distribution. Informally, this
means that each number is obtained by chance, having no relation to the other numbers
in the sequence and that each number has a specified probability of falling in any given
range of values.
The construction of a random number generator may initially appear to be the kind
of thing that any good programmer can do easily. Despite this, it turns out that gen-
erating truly random numbers is not such a simple task. Historically, some options for
generating random numbers in scientific work include:
• rolling dice,
• coin flipping,
via these approaches requires a great deal of both time and effort. Hence, in this section
we will discuss how such numbers are computationally generated and illustrate a small
number of their uses. For our purpose, we say a random number represents the value
of a random variable that is uniformly distributed on (0, 1).
(9.1)
where a and m are given positive integers and the above means that x n takes the value
of the remainder of a x n 1 upon division by m. Note that for all values of n, each x n
is one of the integers {0, 1, . . . , m 1} and then the quantity x n /m, the pseudorandom
number, is taken as an approximation to the value of a uniform (0, 1) random variable.
1.
2.
3.
As a guideline, it turns out that m should be chosen to be a large prime number that
can be fitted to the computer word size. For a 32-bit word machine (where the first bit
is a sign bit), it has been shown that m = 231 1 and a = 75 = 16, 807 result in desirable
properties. For a 36-bit word machine, the choices 235 31 and a = 55 = 3, 125 appear
to work well.
Another generator of pseudorandom numbers uses recursions of the type
where c is a nonnegative integer. These type of generators are called mixed congruential
generators since they involve both an additive and a multiplicative term. It is worth
noting that if c = 0, then we yield the multiplicative congruential generator (9.1). When
making use of mixed congruential generators, one may choose m equal to the computer’s
word length as this makes computing the division of a x n 1 + c by m quite efficient.
As our starting point for computer simulation, we suppose that we can generate a
sequence of pseudorandom numbers that can be taken as an approximation to the values
of a sequence of independent uniform (0, 1) random variables.
In particular, this means that we can approximate ✓ , i.e. the definite integral, by gen-
erating a large number of random numbers ui and taking as our approximation the
average value of g(ui ). This approach for approximating integrals is called the Monte
Carlo approach.
It is worth noting that in previous calculus courses you will have seen that the diffi-
cultly of evaluating ✓ using standard techniques from calculus depends significantly on
the given integrand g(x). For example, if the integrand was a “simple” polynomial, then
calculating ✓ would be a relatively straightforward task. In contrast, from the standard
R 2
normal distribution, it is known that no elementary antiderivative for e x d x exists
and, in such case, one would need to make use of other techniques (such as the approach
outlined above) in order to evaluate ✓ .
Note that in the above, the limits of integration were 0 and 1, respectively. If we
instead wanted to compute
Z b
✓= g(x) d x
a
then, using the substitution y = (x a)/(b a) and hence d y = d x/(b a), we obtain
Z1 Z1
✓= g a + (b a) y (b a) d y = h( y) d y,
0 0
The key to estimating ✓ via the Monte Carlo approach is that we can similarly express
✓ as
✓ = E [g(U1 , . . . , Un )] ,
then, because the random variables g(U1i , . . . , Uni ) , where i = 1, . . . , k, are all indepen-
dent and identically distributed random variables with mean ✓ , we can estimate ✓ using
X
k
g U1i , . . . , Uni
.
i=1
k
Example. (Estimating ⇡)
164 Chapter 9. Random Numbers
7. Let U be a uniform random variable on (0, 1). Use simulation to approximate the
correlation Corr(U, 1 U).
8. Let U be a uniform random variable on (0, 1). Use simulation to approximate the
p
correlation Corr(U, 1 U 2 ).
9. Let U be a uniform random variable on (0, 1). Use simulation to approximate the
p
correlation Corr(U 2 , 1 U 2 ).
165
Chapter 10
Recall that a random variable that can take either a finite or at most a countable number
of possible values is said to be discrete. For a discrete random variable X , its probability
mass function p(x) is p(x) = P{X = x}, where the right-hand side represents the prob-
ability that X is exactly equal to x. Within this chapter, we introduce several approaches
to generating discrete random variables.
Suppose we want to generate the value of some discrete random variable X with prob-
ability mass function
X
P X = x j = p j , where j 2 {0, 1, . . .} and p j = 1. (10.1)
j
To demonstrate how we can generate the value of such a random variable, let us
consider firstly the following example.
it follows that ( )
X
j 1 X
j
P X = xj = P pi U < pi = pj
i=0 i=0
holds, which implies that X has the desired distribution described by (10.1).
Remarks.
2. if the x i ’s are ordered such that x 0 < x 1 < x 2 < · · · and if we let F (·) denote the
cumulative distribution function for X , then
F (x k ) = P {X x k }
X
k
= P {X = x 0 } + P {X = x 1 } + · · · + P {X = x k } = pi
i=0
In other words, after we generate a random number U, we can determine the value
⇥
of some discrete random variable X by finding the half-open interval F (x j 1 ), F (x j )
1
that contains U. We could equivalently determine X by finding F (U) , namely the
10.1. The Inverse Transform Method 167
inverse of F (U). It is for this reason that this approach is known as the discrete inverse
transform method for generating X .
The first remark demonstrates that the amount of time it takes to generate a discrete
random variable using this approach is proportional to the number of intervals one must
search. In light of this, it is sometimes worthwhile to rearrange the x i ’s such that they
appear in decreasing order of the p j ’s.
It should be noted that the ability to generate a random subset is particularly useful
when conducting medical trials. For example, suppose that a medical centre wishes to
test a new drug that is designed to reduce the user’s symptoms of long COVID after ex-
posure to the coronavirus infection. In order to test the effectiveness, suppose that the
medical centre has recruited 2000 volunteers to be subjects in the test. In order to take
account of the possibility that the subjects’ response to the infection could be impacted
by factors that are external to the test (such as a change in behaviour or weather condi-
tions), it has been decided to split the volunteers into two groups of size 1000, namely
a treatment group that are given the drug and a control group that will instead be given
a placebo. Further, both the volunteers and the administrators of the drug will not be
told who is in each group during the trial (and for this reason the approach is known as
called a double-blind trial).
168 Chapter 10. Generating Discrete Random Variables
It now remains to decide which of the 2000 volunteers should be chosen for the
treatment group. It is clear that we would want to the treatment and control groups
to be as similar as possible in all respects with the exception that the members of one
group receive the drug while those in the other receive a placebo. If this occurs, then
it would be indeed possible to draw conclusions that any difference in response is due
to the drug. There is general agreement that the most effective approach to accomplish
this is to simply choose the 1000 volunteers to be in the treatment group completely at
2000
random. In particular, the choice should be made such that each of the 1000 subsets of
1000 volunteers is equally likely to constitute the set of volunteers.
Example. (Approximating)
The key to using the inverse transform method to generate such a Poisson random
variable is to make use of the identity
pi+1 = pi , i 0, (10.2)
i+1
which follows upon rearranging
i+1
e
pi+1 (i+1)!
= i
= .
pi e i+1
i!
Upon using the above recursion (10.2) to compute the Poisson probabilities as they
become needed, the inverse transform algorithm to generate a Poisson random variable
with mean can be expressed as follows. Here we use i 2 {0, 1, 2, . . .} to denote the
value currently under consideration, p = pi is the probability that X equals i and F =
F (i) is the probability that X is less than or equal to i. The algorithm is:
Step 5: Go to Step 3.
It should be emphasised that in the above when we write, for example i = i+1, we do not
mean that i is equal to i + 1, but rather we mean that the value of i should be increased
by 1. Further, in order to see why the above does indeed generate a Poisson random
variable with mean (which recall takes on one of the values 0, 1, 2, . . .), observe that
we firstly generate the random number U and then check if U < e = p0 . If this is
indeed the case, we set X = 0. If not, in Step 4 we compute p1 using (10.2). Then,
we check if U < p0 + p1 , where the right-hand side is the updated value of F . If this is
the case, we now set X = 1. If not, the process continues and we compute the values
2, 3, . . . for as long as necessary.
The algorithm outlined above checks firstly if the Poisson value is 0, then whether it
is 1, then 2, and so on. Observe that the number of comparisons needed will be more one
more than the value generated for the Poisson. If for example we generated the value
X = 0, then we would have compared U with F (in Step 3) once, while, if instead we
generated the value X = 1, then we would have completed the comparison (from Step
3) twice. In consequence, the algorithm that has been outlined will require on average
170 Chapter 10. Generating Discrete Random Variables
1+ searches. In particular, this is fine when is small, however, this approach can be
greatly improved on when is large.
Because a Poisson random variable is most likely to take on one of the two integral
values closest to , a more efficient approach would be first to check one of these values
rather than starting at 0. For example, if we let I = b c , namely the largest integer that
is less than or equal to , and then use (10.2) to recursively determine F (I) . In order
to generate a Poisson random variable X with mean we generate firstly a random
number U and see whether or not X I holds by checking if U F (I) holds. We then
search downwards if X I holds and search upwards starting from I +1 otherwise. The
number of searches needed by this algorithm is approximately one more than absolute
p
difference between X and its mean, which is around 1 + 0.798 on average.
where Å ã
n n!
=
i i!(n i)!
is the binomial coefficient.
In order to use the inverse transform method to generate such a Binomial random
variable, we similarly make use of the recursive identity
n i p
P{X = i + 1} = P{X = i} ,
i+1 1 p
holds. Let i denote the value currently under consideration, pr = P{X = i} be the
probability that X is equal to i and F = F (i) be the probability that X is less than or
equal to i. The inverse algorithm for generating a binomial random variable is:
10.4. The Acceptance-Rejection Technique 171
Step 5: Go to Step 3.
It should be noted that the algorithm outlined checks firstly if X = 0, then whether
it is 1, then 2, and so on. The number of searches needed will similarly be one more
than the generated value of X and hence it will take 1 + np searches to generate X
on average. Observe that because a binomial (n, p) random variable represents the
number of successes that occur within n independent trials, where each trial has success
probability p, it follows that we can also generate this random variable by subtracting
from n the value of a binomial (n, 1 p) random variable. This follows since each trial
can be either a success (with probability p) or a failure (with probability 1 p). For
this reason, if p > 1/2 , then a more efficient approach would be to use the outlined
approach to generate a binomial (n, 1 p) random variable before subtracting its value
from n to obtain the desired value.
Remarks.
1. Recall that a binomial (n, p) random variable X can be interpreted as the number
of successes in n independent Bernoulli trials, where each Bernoulli trial has suc-
cess probability p. In light of this, another way to simulate X is to instead generate
the outcomes of these n Bernoulli trials.
2. Similarly to the Poisson case, when the mean np is large it will be more efficient to
determine if the generated value is less than or equal to (or greater than) I = bnpc.
In the first case, one should start the search with I and then successively search
downwards, while, in the second case, start from I + 1 and move upwards.
random variable Y whose probability mass function is {q j } before then accepting this
simulated value with a probability that is proportional to pY /qY .
Being a little more precise, let c be a strictly positive constant such that
pj
c for all j such that p j > 0
qj
holds, i.e. that p j c · q j for all j such that p j > 0. The following technique, which is
called the acceptance-rejection method or the rejection method, allows us to generate a
discrete random variable X with probability mass function p j = P{X = j} for each j. In
particular, the algorithm is:
Step 4: Go to Step 1.
Informally, this algorithm simulates a random variable X with probability mass function
is p j by instead generating another random variable whose mass function is q j , where
the mass function q j is “close” to p j , namely that the ratio p j /g j is bounded by a constant
value. In practice, we would like the constant to take value as close to 1 as possible, i.e.
that the two mass functions q j and p j are as similar as possible.
The power of the rejection method, an early version was initially proposed by von
Neumann, will become even more clear when we consider its analogue for generating
continuous random variables. We now show that the rejection method works.
Proof.
10.5. Exercises for Self-Study 173
a) Let n = 100 and run your computer program to determine the proportion of
values that are equal to 1.
3. Give an efficient algorithm to simulate the value of a random variable X such that
4. A deck of 100 cards that are numbered 1, 2, . . . , 100 is shuffled and then turned
over one card at a time. We say that a “match” occurs whenever card i is the i-th
card to be turned over, where i = 1, 2, . . . , 100. Write a simulation program to
estimate the expectation and variance of the total number of matches. Run your
computer program to find estimates for the desired values and then compare these
with exact answers.
5. A pair of fair dice are continually rolled until all possible outcomes 2, 3, . . . , 12
have occurred at least once. Develop a simulation study to estimate the expected
number of dice rolls that are needed.
175
Chapter 11
Further, the relationship between the cumulative distribution F (·) and its probability
density function f (·) is expressed by
Z a
F (a) = P{X 2 ( 1, a)} = f (x) d x.
1
Consider a continuous random variable with cumulative distribution function F (·). The
inverse transformation method provides a general method for generating continuous ran-
dom variables and is based on the following proposition.
Proposition. Let U be a uniform (0, 1) random variable. Given any continuous cumulative
distribution function F (·) , the random variable X defined by
1
X=F (U)
1
has distribution F , where F (u) takes the value x such that F (x) = u holds.
176 Chapter 11. Generating Continuous Random Variables
Proof.
Remark. The example outlined provides us with an additional algorithm for generating
a Poisson random variable. Recall that a Poisson process with rate results when the
times between successive events are independent exponential random variables with
rate . For such a process, the number of events that occur by time 1, denoted by N (1),
is Poisson distributed with mean . Further, we can alternatively express the number
of events by time 1 depending on the successive interarrival times of these events. In
particular, if we let X i with i 2 {1, 2, . . .} denote the successive arrival times, then the
Pn
n-th event will occur at time i=1 X i . Therefore, N (1) can be expressed as
i.e. the number of events that occur by time 1 is equal to the maximal n such that
the n-th event occurs by time 1. Upon using of the techniques from the example, we
can generate a Poisson random variable with mean , denoted by N = N (1), by firstly
generating random numbers U1 , U2 , . . . , Un and then by setting
® ´
Xn
1
N = max n : log(Ui ) 1
i=1
® ´
X
n
= max n : log(Ui )
i=1
= max {n : log(U1 · · · Un ) }
= max n : U1 · · · Un e
In particular, this shows one can generate a Poisson variable N with mean by succes-
sively generating random numbers until their product falls below e and then set N to
equal one less than the number of random numbers required, i.e.
N = min n : U1 · · · Un < e 1.
Recall (from Chapter 8) that the sum of n independent exponential random vari-
ables, each with parameter , is a gamma random variable with parameters (n, ) . It
follows that the example above allows us to further generate a gamma (n, ) random
variable efficiently. The following example demonstrates how we do this.
178 Chapter 11. Generating Continuous Random Variables
It is worth noting this algorithm saves a logarithmic computation at the cost of two
multiplications and the generation of a random number when compared with the more
direct approach of generating two random numbers U1 and U2 and setting X = log(U1 )
and Y = log(U2 ). Similarly, in order to generate k independent exponential random
variables with mean 1 we can generate firstly their sum, say t = log(U1 · · · Uk ), and
then by generating k 1 additional random numbers U10 , . . . , Uk0 1 which are ordered.
0 0
Suppose that U(1) < ... < U(k 1)
denote the corresponding ordered values, then the k
exponentials are
⇥ ⇤
t U(i) U(i 1) , i = 1, 2, . . . , k, where U(0) = 0 and U(k) = 1.
f ( y)
c for all y,
g( y)
i.e. c must satisfy f ( y) c · g( y) for all y. In particular, the algorithm for generating a
random variable with probability density function f (x) using this approach is:
Step 4: Go to Step 1.
Recall that this algorithm informally simulates a random variable X with probability
density function g(x) by instead generating another random variable whose density
function is g(x), where the density function g(x) is “close” to f (x), namely that their
ratio is bounded by a constant value. In practice, we would like the constant to take
value as close to 1 as possible, meaning that the two density functions are as similar as
possible, however, the constant will not take on value 1 because in such case the two
densities would coincide. It is worth emphasising that the algorithm is the same as the
previously discussed discrete case, where the only difference is that we have replaced
mass functions by densities. In the same way as in the discrete setting, we can prove
the following.
Note that during this example we generated a gamma random variable using the
acceptance-rejection approach by making use of an exponential distribution with the
same mean as the gamma. It turns out that generating a gamma random variable in
this way is always the most efficient approach (see e.g. [21, Section 5.2]), i.e. that this
approach minimises the mean number of iterations needed.
The following example demonstrates how the acceptance-rejection method allows
us to generate normal random variables.
Example. (Generating
(Generatingaanormal
normalrandom
randomvariable)
variable)
Hence, this demonstrates that the following algorithm generates an exponential with
rate 1 and an independent standard normal random variable. The algorithm is:
The random variables Z and Y generated are independent where Z is normal with mean
0 and variance 1, while, Y is exponential with rate 1. It is worth noting that if we wish
2
to generate a normal random variable mean µ and variance , we use µ + Z.
Remarks.
p
1. Because c = 2e/⇡ ⇡ 1.32, the number of steps via the above approach is geo-
metrically distributed with mean 1.32
3. The sign of the standard normal can be determined without the need to generate
a new random number (as in Step 4). It is possible to instead use the first digit of
an earlier random number in order to decide the sign
Note that just as how we earlier generated a normal random variable by using the
acceptance-rejection method based on an exponential random variable, we could al-
ternatively simulate a normal random variable that is conditioned to lie within some
interval using this method based on an exponential random variable.
182 Chapter 11. Generating Continuous Random Variables
Step 5: Go to Step 2.
It is worth emphasising that the final value of I via this approach will be the number of
events that occur by time T and the values S(1), S(2), . . . , S(I) will be the event times
of those events in increasing order.
Alternatively, we could simulate the first T time units of a Poisson process with
parameter by firstly simulating N (T ), i.e. the total number of events that occur by
time T . Recall (from Chapter 8) that N (T ) is a Poisson random variable with mean
T and we can therefore use of the techniques from the previous chapter to generate
this value. Finally, if n denotes the simulated value of N (T ), then n random numbers
U1 , U2 , . . . , Un are generated and finally
{T U1 , T U2 , . . . , T Un }
are the set of event times by time T of the Poisson process. The preceding approach
works because conditional on N (T ) = n, the unordered set of event times are distributed
11.4. Generating a Nonhomogeneous Poisson Process 183
as a set of n independent uniform (0, t) random variables (see e.g. [21, pp. 84]). If
we only desired to simulate the set of event times of the Poisson process, then the pre-
ceding approach would be more efficient than generating the exponentially distributed
interarrival times. It should be noted that we would normally desire the event times to
be presented in increasing order and therefore we would additionally need to order the
values T Ui for i = 1, 2, . . . , n.
(11.1)
and then, by the proposition from Section 8.9 (entitled Some Continuous Random Vari-
ables), such a nonhomogeneous Poisson process can be generated by a random selection
of the event times of Poisson process with rate . More precisely, if an event of a Poisson
process with rate occurs at time t is counted (independently of anything that came be-
fore) with probability (t)/ , then the process of counted events is a nonhomogeneous
Poisson process with intensity function (t) , where 0 t T . It is worth noting that
since (t)/ is here a probability, our assumption (11.1) follows in light of the first ax-
iom of probability (from Section 8.2). In other words, upon simulating a Poisson process
and then randomly counting its events, we can generate the desired nonhomogeneous
Poisson process. The algorithm is:
184 Chapter 11. Generating Continuous Random Variables
1
Step 3: Set t = t log(U). If t > T , stop.
Step 6: Go to Step 2.
Note that here (t) is the intensity function and denotes a value satisfying (11.1).
The final value of I denotes the number of events by time T and S(1), S(2), . . . , S(I) are
the corresponding event times.
This procedure, that is referred to as the thinning algorithm (because it “thins” the
homogeneous Poisson points), becomes more efficient when is close to (t) through-
out the interval as in this case we will reject a minimal number of event times. The
approach can become inefficient when the intensity function (t) exhibits heavy fluctu-
ation in time.
It is possible to easily modify the thinning algorithm with the objective of mitigating
excessive rejection when (t) heavily fluctuates. The intuitive idea behind this exten-
sion, called piecewise thinning, is to break up the interval into k subintervals and then
performing standard thinning within each subinterval. Being a little more precise, the
extension determines appropriate values k, 0 = t 0 < t 1 < t 2 < · · · < t k < t k+1 = T and
1, 2, . . . , k+1 such that
In order to generate the nonhomogeneous Poisson process over the interval (t i 1, t i ) for
i 2 {1, 2, . . . , k + 1}, we firstly generate exponential random variables with correspond-
ing rate i and then accept a generated event that occurs at time s 2 (t i 1, t i ) with
probability (s)/ i .
1. Give a method for generating a random variable with probability density function
ex
f (x) = , 0 x 1.
e 1
11.5. Exercises for Self-Study 185
2. Use the inverse transform method to generate a random variable having distribu-
tion function
x2 + x
F (x) = , 0 x 1.
2
3. Show how to generate a random variable whose distribution function is
1
F (x) = (x + x 2 ) , 0 x 1
2
using:
Which method do you think is best for this example? Justify your answer.
1 x
f (x) = (1 + x)e , 0 < x < 1.
2
6. Write a computer program to generate the first T time units of a Poisson process
with (common) rate .
187
Chapter 12
The two key components in discrete event simulation are variables and events. In order
to complete a simulation, we need to continually keep track of certain variables. In
particular, in general three types of variables are often used, these are:
1.
2.
3.
188 Chapter 12. Discrete Event Simulation
When an “event” occurs, the values of the aforementioned variables are updated and
then we collect any relevant data that is of interest as output. To determine when the
next event occurs, it will be useful to maintain an “event list”, that lists the nearest future
events and when they are scheduled to occur. Upon an event “occurring”, we then reset
the time and all state and counter variables and collect the relevant data. Through this
approach, we are able to “follow” the system as it evolves over time.
It is worth noting that the above is only supposed to provide a very high-level idea
of the elements of discrete event simulation. In particular, it will be useful to look at
some examples. In Section 12.2 we consider the simulation of a single-sever queuing (or
waiting line) system. In Sections 12.3 and 12.4 we consider the simulation of multiple-
sever queuing system, where the first section supposes that the servers are arranged in
series, while, the second supposes that the servers are arranged in parallel. Finally, in
Section 12.5 we consider an inventory stocking model.
In all the queuing models, we will suppose that the customers arrive in accordance
with a nonhomogeneous Poisson process with bounded intensity function (t) , where
t > 0. Recall the nonhomogeneous Poisson process is a generalisation of the (standard)
Poisson process, where the strong stationary increment assumption is relaxed, which
means that the average rate of arrivals is allowed to vary with time. While simulating
these queuing models, we will frequently make use of the following subroutine (or func-
tion) in order to generate the value of a random variable Ts , defined to equal the time
of the first arrival after time s.
Let be chosen such that (t) for all t. Suppose that the intensity function (t)
for t > 0 and are both specified, then the following subroutine generates the value of
the random variable Ts . The subroutine is:
Step 1: Let t = s.
Step 6: Go to Step 2.
It is worth emphasising that this is very the algorithm demonstrated for simulating the
first T time units of a nonhomogeneous Poisson process. The difference is that in this
case we run the process until an event occurs and that there is now no time limit.
12.2. A Queuing System with a Single Server 189
It should be noted for completeness that in this chapter we will be using discrete
event simulation in order to understand the behaviour of queues rather than using more
classical queuing theory techniques. In particular, classical queuing theory studies the
long run behaviour of the queue under simple models (such as when arrivals follow
a homogeneous Poisson process) in order to derive analytical formulae that often rely
on some steady state behaviour. In contrast, using discrete event simulation instead
allows us to simulate queuing systems both in the short and long term where there is
no guarantee of steady state (such as when arrivals occur following a nonhomogeneous
Poisson process).
• random,
• rule-based queuing.
The LIFO queue discipline could be used to model the usage of plates in a cafeteria,
where when new clean plates are available they are added to the top of an existing
190 Chapter 12. Discrete Event Simulation
stack and customers take the top one from the stack. The random queue discipline
could be used to model the usage of screws by a builder, where they reach into a box
full of parts and select one screw at random. The priority queuing discipline is used
by the National Health Service (NHS) within Accident & Emergency departments. In
particular, when patients arrive they go through a preliminary assessment that evalu-
ates their symptoms and the urgency of their medical needs. The patients with more
life-threatening conditions are given the highest priority and are then attended to im-
mediately, those with less critical but still urgent issues are placed in a second priority
group, while those with non-urgent conditions are given the lowest priority. The rule-
based queuing discipline is supposedly used by Tesla for delivering pre-ordered vehicles,
where the company reportedly prioritise deliveries based on the customer’s proximity
to the factory irrespective of when an order was placed during the pre-order period.
In this single-server queuing scenario, we are interested in determining different
quantities such as:
(a)
(b)
Recall (from Section 12.1) that the two key components in discrete event simulation
are variables and events. In particular, in order to do a simulation of the preceding
system we can use the following variables:
1.
2.
3.
Further, because we update the values of these variables and collect any relevant data
upon an “event” occurring, it is natural to take both arrivals and departures as these
events. Hence the event list contains the time of the next arrival and the time of the
departure of the customer currently being served. In other words, the events list EL is
EL = {t A, t D } ,
where t A is the time of the next arrival after time t and t D is the service completion
time of the customer currently being served. If no customer is being served at present,
then we set t D equal to 1. In this scenario, the output variables that we collect are the
arrival time A(i) of customer i, the departure time D(i) of customer i and the time Tp
past time T that the last customer departs the system. It is worth emphasising that A(i)
12.2. A Queuing System with a Single Server 191
and D(i) will provide us with information about average waiting times, while, Tp tells
us about the server overtime. To begin the simulation, we initialise the variables and
the event times as:
1. Set t = NA = ND = 0.
2. Set n = 0.
In order to update the system, we need to increase time (move along the time axis)
until we encounter the next event. In order to see how this is accomplished, we consider
different cases that depend upon how members of the events list EL = {t A, t D } compare.
In particular, the cases that we distinguish are:
Case 3: T = min {t A, t D , T } and n > 0, which is a departure when the time has ended
but there are still customers remaining, and
Case 4: T = min {t A, t D , T } and n = 0, which is a departure after the time has ended
and there are no customers remaining.
In the following, let Y be the random variable with probability distribution G that
gives the service time of a the server for one customer. We have a subroutine for each
of the above cases.
Case 1: t A = min {t A, t D , T }
Step 1: Set t = t A.
Step 2: Set NA = NA + 1.
Step 3: Set n = n + 1.
Case 2: t D = min {t A, t D , T }
Step 1: Set t = t D .
192 Chapter 12. Discrete Event Simulation
Step 2: Set ND = ND + 1.
Step 3: Set n = n 1.
Step 4: If n = 0, set t D = 1 and go to Step 6.
Step 5: If n > 0, generate the random variable Y and set t D = t + Y .
Step 6: Collect the output data D(ND ) = t.
Step 1: Set t = t D .
Step 2: Set ND = ND + 1.
Step 3: Set n = n 1.
Step 4: If n > 0, generate the random variable Y and set t D = t + Y .
Step 5: Collect the output data D(ND ) = t.
Further, in order to estimate the average time that a customer spends in the system, we
run the simulation K times and take averages. In a similar fashion, to estimate the mean
time past T that the last customer departs, we simply run the simulation K times and
then take averages over all values Tp .
Figure 12.1: Simulating the Single Server Queue [21, Chapter 7].
first server and upon completion of service the customer goes to server 2. This type of
system is called sequential or a tandem queuing system. Upon arrival a customer either
enter service with the first server if that server is free, else they join a queue if the server
is busy. In a similar fashion, when the customer has been served they either enter service
with the second server if they are free, else they join their queue. After being served by
the second server, the customer then departs the system. If there are customers in the
queue, they are served in order of which customer has been waiting the longest. The
service times for server i where i 2 {1, 2} have corresponding distribution Gi . This is
illustrated in Figure 12.2.
Analogously to the previous section, suppose now that we are interested using sim-
ulation in order to study the distribution of times that a customer would spend at both
server 1 and 2. In particular, in order to do a simulation of the preceding system we will
use the following variables:
1.
2.
3.
In this case, since the system now features two servers we must therefore amend our
previous event list to include the corresponding completion times for each server.
In particular, the event list contains the time of the next arrival, the time of service
completion for the first server and the time of service completion for the second server,
i.e. the time of departure from system. In other words, the events list EL is
EL = {t A, t 1 , t 2 } ,
where t A denotes the time of the next arrival after time t and t i denotes the service
completion time of the customer presently being served by server i, where i 2 {1, 2}. If
no customer is presently with server i, then we set t i equal to 1. In this scenario, the
output variables collected are the arrival time A1 (n) of customer n, where n 1, the
arrival time A2 (n) of customer n at the second server and the departure time D(n) of
customer n. It is worth emphasising that these variables give us information about the
time spent with each server and additionally the total time spent in the system.
To begin the simulation, we initialise the variables and the event times as:
1. Set t = NA = ND = 0.
Similarly, in order to update the system we increase time until we encounter the next
event. We consider different cases that depend upon which member of the events list
EL = {t A, t 1 , t 2 } is smallest. In particular, the first case t A = min {t A, t 1 , t 2 } is an arrival,
the second case t 1 = min {t A, t 1 , t 2 } is a departure from the first server and the third case
t 3 = min {t A, t 1 , t 2 } is a departure from the second server (and hence a departure from
12.3. A Queuing System with Two Servers in Series 195
the system). It is worth noting that we do not specify a stopping rule in the following
pseudocode, however, when we write the script we will make use the same rule as in
the single server case.
In the following, denote by Yi the random variable with corresponding probability
distribution Gi that gives the service time of the i-th server, where i = 1, 2. We have the
following subroutines for each case.
Case 1: t A = min {t A, t 1 , t 2 }
Step 1: Set t = t A.
Step 2: Set NA = NA + 1.
Step 3: Set n1 = n1 + 1.
Case 2: t 1 = min {t A, t 1 , t 2 }
Step 1: Set t = t 1 .
Case 3: t 2 = min {t A, t 1 , t 2 }
Step 1: Set t = t 2 .
Step 2: Set ND = ND + 1.
Step 3: Set n2 = n2 1.
The above allows us to update the system during the simulation process and collect the
relevant data as explained in the previous section.
196 Chapter 12. Discrete Event Simulation
It is worth emphasising that we assume that if both servers are idle and there is a
new arrival, then that customer goes to the first server. Suppose similarly that we are
interested using simulation in order to study the distribution of times that a customer
would spend in the system and the number of services performed by each server. An im-
portant observation is that since there are multiple servers, the order in which customers
depart the system will not necessarily coincide with the order of arrivals. This means
that customers cannot be labelled as before and additionally in order to know which
customer is departing the system, we must formally keep track of which customers are
in the system.
Because customers arrive and join a single queue if both servers are busy, the natural
choice is simply to label the customers as they arrive. In particular, let the first arrival be
customer number one, the next be customer number two, and so on. In order to identify
which customers are waiting it is sufficient to know which customers are currently being
served and also the number that are waiting in the queue. More formally, let us suppose
that customers i and j are being served, that customer i arrived first, i.e. that i < j
12.4. A Queuing System with Two Servers in Parallel 197
and that the queue is nonempty, i.e. that n 2 > 0, where n denotes the number of
customers in the system. Notice that all customers with numbers strictly less than j
would have entered service before j and all customers with numbers strictly greater
than j could not have completed service. In light of this, it follows immediately that
customers j + 1, j + 2, . . . , j + n 2 are currently waiting in the queue.
Recall that here we are interested using simulation in order to study the distribution
of times that a customer would spend in the system and the number of services per-
formed by each server. In order to analyse the preceding system we will make use of
the following variables:
1.
2.
3.
It is worth emphasising that if the system state triple is (0, 0, 0) , then the whole system
is empty. If instead the triple is (1, j, 0) or (1, 0, j) , then the only customer is j and they
are being served by the first or second server, respectively.
Similarly to the previous section, the system features two servers and as such the
event list is defined as before. In particular, the events list EL is
EL = {t A, t 1 , t 2 } ,
where t A denotes the time of the next arrival after time t and t i denotes the service
completion time of the customer presently being served by server i, where i 2 {1, 2}. If
no customer is presently with server i, then we set t i equal to 1. In this scenario, the
output variables are the arrival time A(n) of customer n, where n 1 and the departure
time D(n) of customer n. It is worth noting that the output variables are different to
those from the previous section since here we only have one arrival time.
To begin the simulation, we initialise the variables and the event times as:
1. Set t = NA = C1 = C2 = 0.
Similarly, we increase time until we encounter the event and consider different cases
that depend upon which member of EL = {t A, t 1 , t 2 } is smallest. In particular, the first
case is an arrival, the second case is a departure from the first server and the third case
is a departure from the second server. Let Yi be the random variable with corresponding
probability distribution Gi that gives the service time of server i, where i = 1, 2. We
have the following subroutines for each case.
Step 1: Set t = t A.
Step 2: Set NA = NA + 1.
Step 1: Set t = t 1 .
Step 2: Set C1 = C1 + 1.
The procedure for the third case is left as an exercise, namely Exercise 4. The above
allows us to update the system during the simulation process, where we stop this process
at some predetermined termination point. Then using the output variables A(n) and
D(n) and the counting variables C1 and C2 enable us to obtain data on the arrival and
departure times of the customers and the number of services performed by each server.
12.5. An Inventory Model 199
1. time variable t,
2. counter variables: the total amount C of ordering costs by time t, the total amount
H of inventory holding costs by time t and the total amount R of revenue earned
by time t, and
3. system state variable: the pair (x, y) , where x denotes the inventory on hand and
y is the amount on order from the supplier.
In this case, the events will either be a customer arriving or an order being deliv-
ered/completed. Hence, our events lists EL is
EL = {t 0 , t 1 },
where t 0 denotes the time of the next customer arrival and t 1 is the time at which the
order that is being filled by the supplier will be delivered. If no orders from the supplier
are outstanding (i.e. that are yet to be delivered), then we set t 1 = 1.
200 Chapter 12. Discrete Event Simulation
We can here run the simulation until the first event occurs after some large preas-
signed time T and use the expression
in order to estimate the average profit per unit time. It should be noted that doing
this while varying the values of s and S would allow us to determine a good inventory
ordering policy for the store.
To begin the simulation, we suppose that there is an initial inventory of size I and
initialise the variables and the event times as:
1. Set t = C = H = R = 0.
4. Set t 1 = 1.
1
Note that we set t 0 = t log(U) since we assume that customers arrive in accordance
with a Poisson process with rate . We once more increase time until we encounter the
next event and then consider different cases. Ignoring the predetermined time T for
simplicity, we have only two cases, where the first case t 0 < t 1 is a customer arrival and
the second case t 0 t 1 is a supplier order completion. If we are at time t, then we move
along in time using the following subroutines for each case.
Case 1: t 0 < t 1
Step 2: Set t = t 0 .
Step 3: Generate the random variable D, the demand of the arriving customer
following probability distribution G.
Step 6: Set x = x w.
Case 2: t 0 t1
12.6. Exercises for Self-Study 201
It is worth noting that in Step 5 of the second case we assumed that when an order of
size y is delivered, the total inventory level is now no less than s, which means that
no additional order is then placed. It is possible to guarantee this is the case by simply
assuming that y > s holds, which can be ensured by assuming that S 2s.
The above allows us to update the system during the simulation process, which
enables us to provide useful information to the store owner about balancing costs under
the aforementioned assumptions.
2. Suppose in the model presented Section 12.2 that we are additionally interested
in obtaining information about the amount of time a server would be idle in a day.
Explain how this could be accomplished.
3. Suppose that jobs arrive at a single server queuing system according to a nonho-
mogeneous Poisson process, whose rate is initially 4 per hour, increases steadily
until it hits 19 per hour after 5 hours, before then decreasing steadily until it hits
4 per hour after an additional 5 hours. The rate then repeats indefinitely in this
fashion, i.e. (t + 10) = (t) holds for all t 0. Suppose that the service dis-
tribution is exponential with rate 25 per hour. Suppose also that whenever the
server completes a service and finds no jobs waiting they go on a break for a time
that is uniformly distributed on (0, 0.3). If upon returning from their break there
are no jobs waiting, then they go on another break.
202 Chapter 12. Discrete Event Simulation
Use simulation to estimate the amount of time that the server is on break during
the first 100 hours of operation. Perform 500 simulation runs.
4. Complete the updating scheme for Case 3 in the model presented in Section 12.4.
5. In the model presented in Section 12.4, suppose that G1 is the exponential distri-
bution with rate 4 and that G2 is exponential with rate 3. Suppose further that
the arrivals occur in accordance to a Poisson process with rate 6. Write a simula-
tion program to generate data corresponding to the first 1000 arrivals. Use this
to simulate
Perform now a second simulation of the first 1000 arrivals and use this to once
more answer parts a) and b). Compare your answers to the ones you obtained
previously.
6. Suppose in the two-sever parallel presented in Section 12.4 that each server now
have their own queue and that upon arrival a customer joins the shortest queue.
If an arrival finds that both queues are of the same size (or finds that both severs
are empty), then they go to server 1.
a) Determine the appropriate variables and events to analyse this model and
give the updating procedure.
b) Using the same distributions and parameters as in Exercise 5, find the aver-
age time spent in the system by the first 1000 customers.
c) Using the same distributions and parameters as in Exercise 5, find the pro-
portion of the first 1000 services that are performed by the first server.
203
Chapter 13
Usually one is motivated for undertaking a simulation study in order to determine the
value of some quantity (or quantities), denoted here by ✓ , that are inherently connected
with some underlying probabilistic model. Being a little more precise, a simulation of
some given system results in output data X , whose expected value is the aforementioned
quantity of interest ✓ . We then undertake a second simulation run which provides a
new and independent random variable with mean ✓ . This is repeated until we have
amassed n total runs and, in particular, the n independent and identically distributed
random variables X 1 , X 2 , . . . , X n which all have mean ✓ . It is then possible to take the
average of these values, namely calculate
Xn
Xi
X̄ =
i=1
n
1X i
n
T ,
n i=1 p
which was used as an estimator for the random variable Tp . The following two natural
questions arise out of what is outlined above:
2. What value of n should be chosen? In other words, how many times should one
run the simulation?
204 Chapter 13. Statistical Analysis of Simulated Data
denote the population mean and population variance of the X i ’s. The quantity
namely the arithmetic mean of the n values, is called the sample mean. It is worth
emphasising the sample mean is simply the average value of a sample (i.e. a subset
with possible duplicates) of numbers which are taken from some larger population of
numbers. In the scenario when the population mean ✓ is unknown, we make use of the
sample mean to estimate it.
Observe that
(13.1)
where the second equality follows since expectation is a linear operation (as shown in
Chapter 8). In particular, (13.1) demonstrates that the sample mean X̄ is an unbiased
estimator of the population mean ✓ , where an estimator is said to unbiased if the differ-
ence between the estimator’s expected value and the true value of the parameter being
estimated is zero.
In order determine the “worth” of the sample mean X̄ as an estimator of the popula-
tion mean ✓ , we consider its mean squared error, which is defined as the expected value
⇥ ⇤
of squared difference between X̄ and ✓ , namely E (X̄ ✓ )2 .
It is worth noting that squaring the differences eliminates negative values for the
differences and hence ensures that the mean squared error is always greater than or
equal to zero. Further, the mean squared error takes on almost always strictly positive
(and not zero) because of randomness. In addition, squaring increases the impact of
larger errors (differences), which in fact turns out to be a favourable property.
13.1. The Sample Mean and Sample Variance 205
Notice that
(13.2)
which means that the probability that the sample mean is c standard deviations from
the population mean is no greater than 1/c 2 . For example, this bound tells us that the
probability that the sample mean differs from the population mean ✓ by more than 1.96
standard deviations, i.e. when c = 1.96, is no more than 1/(1.96)2 = 0.2603.
This rather conservative bound can be drastically improved upon when the value of
n is large, which usually is the case when running simulations. In particular, if n is large,
then since the X i ’s are independent and identically distributed random variables by as-
p
sumption, we can apply the central limit theorem, which tells us that (X̄ ✓ )/( / n)
is approximately distributed as a standard normal variable and therefore
ß ™
c
P |X̄ ✓| > p ⇡ P {|Z| > c} where Z is a standard normal
n
=2 1 Φ(c) by symmetry of the standard normal,
where Φ denotes the standard normal distribution function (from Chapter 8). For exam-
ple, the probability the sample mean differs from the ✓ by greater than 1.96 standard
deviations is approximately 0.05, which follows as Φ(1.96) = 0.975. This means we can
206 Chapter 13. Statistical Analysis of Simulated Data
be approximately 95% certain that the sample mean does not differ from the popula-
tion mean by more than 1.96 standard deviations. It is worth emphasising that 0.05 is
indeed much stronger than the conservative bound of 0.2603 that was deduced using
Chebyshev’s inequality.
It is likely that this sounds very promising since the above argument suggests that
p
provided the quantity / n (or 2 /n) is small then the sample mean will be a good
estimator for the population mean. The natural difficultly with using this value as an
indicator of how accurately the sample mean X̄ of n values estimates the population
2
mean is that the population variance is not usually known in advance. Hence, we
need to estimate its value. Recall by definition that
2
= E[(X ✓ )2 ]
is the average of the squared difference between the random variable X and its (un-
known) mean. In light of this, it is perhaps natural that when we wish to make use of
the sample mean X̄ as the estimator of the population mean that a natural estimator for
2
would instead to take the average of the squared distances between the X i ’s and the
Pn
estimated mean X̄ , i.e. by using i=1 (X i X̄ )2 /n. For technical reasons and in order
to make the estimator unbiased we instead prefer to divide the sum of squares by n 1
Pn
rather than n. Informally, observing that the sum of deviations i=1 (X i X̄ ) equals
zero as shown by the equalities
X
n X
n X
n
Xi X̄ = Xi X̄
i=1 i=1 i=1
Xn
= Xi nX̄ (13.3)
i=1
Ç å
Xn Xn
Xi
= Xi n =0
i=1 i=1
n
implies that only n 1 of these deviations are needed in order to determine all the
deviations (since they have the property that they must sum to zero). In particular, this
argument means that there are only n 1 “degrees of freedom” in our sample variance
sum. This informal argument inspires the following definition.
1 X
n
2
2
S = Xi X̄
n 1 i=1
X
n
2
X
n
Xi X̄ = X i2 nX̄ 2 , (13.4)
i=1 i=1
which follows in light of the equality (13.3), we show that the sample variance is an
2
unbiased estimator of the population variance . In particular, we now prove the
following proposition.
To prove this proposition it is useful to firstly recall (from Chapter 8) that for all
random variables Y , we have
⇥ ⇤
E Y 2 = Var(Y ) + (E[Y ])2 .
Proof.
The above tells us that we can use the sample variance S 2 as our estimator of the
p
population variance 2 . The so-called sample standard deviation, S = S 2 , is used as
p p
our estimator of . Further, we use S/ n as an estimator for / n, namely for the
standard deviation of X̄ .
Consider now the second natural question, namely when should we stop generating
extra data values? Suppose for this purpose that as in a simulation we have the option to
continually generate additional data values X i as needed. Further, suppose the quantity
that we are interesting in estimating is the population mean ✓ = E[X i ] . Intuitively, we
will require a sufficiently large number of data values to allow the central limit theorem
to apply, however, when our estimate is “good enough”, in the sense that it is not too
far away from the quantity of interest, we can stop generating additional values.
208 Chapter 13. Statistical Analysis of Simulated Data
Being a little more precise, we firstly choose an acceptable value d for the standard
deviation of our estimator. If d is the standard deviation of the estimator X̄ , then recall
that we can for example be 95% certain that X̄ will not differ from ✓ by more than 1.96d
provided the central limit theorem applies. We should then continue to generate new
data until we we have generate n data values for which our estimate of the standard
p
deviation of X̄ , namely S/ n is less than our accepted value d. It is worth emphasising
that in order for the sample standard deviation S to be a good estimator of the popula-
tion standard deviation in general we require the sample to be sufficiently large. In
light of this, the following procedure can be used to determine when to stop generating
additional data values:
Step 1: Choose an acceptable value d for the standard deviation of the estimator.
The following example demonstrates how one would decide when to stop generating
values in the setting when we once more work with a single server queuing model in
order to estimate the time that the last customer departs the system.
Example. (Estimating the expected time the last customer leaves the system)
Notice that in the previous procedure we need to compute the sample standard
deviation S at each iteration. In order to calculate S, one may naïvely recompute S
from scratch each time a new value is generated. In order to improve the efficiency of
the approach, it would be favourable if we found a method for recursively computing
successive sample means and sample variances. For this purpose, consider the sequence
13.1. The Sample Mean and Sample Variance 209
of data values X 1 , X 2 , . . . and denote by X̄ j and S 2j be the sample mean and sample
variance of the first j observations, respectively. In other words, let
1X
j
X̄ j = Xi
j i=1
and
1 X
j
2
S 2j = Xi X̄ j , where j 2.
j 1 i=1
These expressions allows us to deduce the following recursions via simple algebraic
manipulation. Let S12 = 0 and X̄ 0 = 0, then
X j+1 X̄ j
X̄ j+1 = X̄ j + , and
j+1
Å ã (13.5)
1 2 2
S 2j+1 = 1 S j + ( j + 1) X̄ j+1 X̄ j .
j
Example. (Recursion)
This analysis is modified in the scenario when the X i ’s are Bernoulli (or 0,1) random
variables, as would be the case when we are estimating some probability. Suppose we
can generate Bernoulli random variables X i such that
8
<1, with probability p,
Xi =
:0, with probability 1 p.
Suppose further that we wish to estimate the expected value of X i , which (from Chapter
8) we know is given by
E [X i ] = P {X i = 1} = p.
Var(X i ) = p (1 p) ,
210 Chapter 13. Statistical Analysis of Simulated Data
it follows that there is no need to use sample variance to estimate Var(X i ) . Being more
precise, observe that if we have generated the n values X 1 , X 2 , . . . , X n , then since the
estimate of E [X i ] = p is once more given by the sample mean
1X
n
X̄ n = Xi,
n i=1
X̄ n 1 X̄ n .
1
X̄ n 1 X̄ n ,
n
where taking the nonnegative square root yields the estimator of standard deviation.
In light of this, the following procedure can be used to determine when to stop
generating additional Bernoulli random variables:
Step 1: Choose an acceptable value d for the standard deviation of the estimator.
Suppose once more that X 1 , X 2 , . . . , X n are independent and identically distributed ran-
dom variables with mean ✓ and variance 2 . The previous section argues that the sam-
Pn
ple mean X̄ = i=1 X i /n is an effective estimator of the population mean ✓ . Despite
this, it should be emphasised that we should not expect X̄ to equal ✓ but rather that
they are in some sense “close”. It is sometimes valuable to be able to formally quantify
this notion of “closeness”, by which we explicitly specify an interval for which we have
a certain degree of confidence that the population mean ✓ lies within.
To find such an interval of confidence we require the approximate distribution of the
estimator X̄ . Recall, for this purpose, that (13.1) and (13.2) show that
2
E[X̄ ] = ✓ and Var(X̄ ) =
n
and, in light of the central limit theorem, for large n we deduce that
p (X̄ ✓)
n ⇠
˙ N (0, 1) ,
where ⇠
˙ N (0, 1) means “is approximately distributed as a standard normal”. If we
additionally replace the (unknown) population standard deviation by its estimator
the sample standard deviation S, then the resulting quantity remains approximately a
standard normal by Slutsky’s theorem (see e.g. [15, Chapter 3]). In other words, if n is
large, then
p (X̄ ✓)
n ⇠
˙ N (0, 1) . (13.6)
S
For any ↵ 2 (0, 1) , let z↵ be such that a standard normal variable Z will exceed z↵
with probability ↵, namely
P{Z > z↵ } = ↵.
212 Chapter 13. Statistical Analysis of Simulated Data
Recall (from Chapter 8) that the value z↵ could be obtained using for example a table
of values for the distribution function of a standard normal random variable. In light of
the symmetry of the standard normal density function about zero, it follows that
z1 ↵ = z↵ ,
where z1 ↵ is the point at which to its right the area under the standard normal density
is equal to 1 ↵. Further, it follows that
(13.7)
In other words, (13.7) tells us that with probability 1 ↵, the population mean ✓ will
p
lie within the region X̄ ± z↵/2 S/ n about the sample mean X̄ .
The above inspires the following definition of an approximate 100(1 ↵) percent
confidence estimate if the population mean ✓ .
Definition. If the observed values of the sample mean and sample standard deviation are
p
X̄ = x̄ and S = s, then we call the interval x̄ ± z↵/2 s/ n an approximate 100(1 ↵)
percent confidence interval estimate of ✓ .
Step 4: If x̄ and s are the observed values of X̄ and S, then the 100(1 ↵) confidence
p
interval estimate of ✓ , whose length is less than l, is x̄ ± z↵/2 s/ k.
As was previously noted, in the case when the X i ’s are Bernoulli (or 0,1) random
variables, the analysis is modified. More precisely, suppose X 1 , X 2 , . . . , X n are Bernoulli
random variables such that
8
<1, with probability p,
Xi =
:0, with probability 1 p.
Recall that E[X i ] = p and that Var(X i ) can be approximated using X̄ (1 X̄ ) . It follows
that the when n is large, the analogous statement to (13.6) is
p (X̄ p)
n ⇠
˙ N (0, 1) . (13.8)
X̄ (1 X̄ )
For any ↵ 2 (0, 1), we therefore have
® v v ´
t X̄ (1 X̄ ) t X̄ (1 X̄ )
P X̄ z↵/2 < p < X̄ + z↵/2 z↵/2 ⇡1 ↵.
n n
In particular, if the observed value of the sample mean X̄ is denoted by pn , we say that
the 100(1 ↵) percent confidence interval estimate of the expected value p is
v
t p (1 p )
n n
pn ± z↵/2 .
n
2. Give a probabilistic proof of the result of the previous exercise. This can be
achieved by letting X denote a random variable that is equally likely to take any
of the n values before applying suitable algebraic identities from Chapter 8.
214 Chapter 13. Statistical Analysis of Simulated Data
3. Write a computer program that uses the recursions (13.5) in order to calculate the
sample mean and sample variance of a data set.
5. Repeat the previous exercise with the exception that you now continue generating
p
standard normals until S/ n < 0.01.
R1 2
6. Estimate 0
e x d x by generating random numbers. Generate at least 100 values
and stop when the standard deviation of your estimator is less than 0.01.
215
Chapter 14
Recall (from the Chapter 13) that typically we are usually interested in determining
some parameter ✓ that is connected with some stochastic model when undertaking a
simulation study. To estimate this parameter, a simulation of the model results in output
data X with the property that E[X ] = ✓ . Repeated runs of the simulation are performed,
2
where the i-th run yields the output variable X i . Let = Var(X i ) denote the variance
of the X i ’s. As explained in Chapter 13, we terminate the simulation study after n runs
and the estimate of ✓ is given by calculating the sample mean X̄ of the X i ’s, namely
Xn
Xi
X̄ = .
i=1
n
Further, recall the sample mean X̄ is an unbiased estimator of ✓ and it therefore follows
by (13.2) (from Chapter 13) that its mean squared error equals its variance, namely that
⇥ ⇤ 2
E (X̄ ✓ )2 = Var(X̄ ) = .
n
To this point we have reduced the variance of our estimator X̄ by increasing the value
of n. The issue with this approach is that sometimes we would have to simulate a very
large number of observations n in order to get the variance within some predetermined
acceptable range. It turns out that we could face such an issue even when working with
a seemingly quite simple model.
In the following we present other more efficient methods that one can use to reduce
the variance of the simulation estimator X̄ , namely Var(X̄ ) . In particular, we will out-
line two such approaches for variance reduction called antithetic varieties and control
varieties, respectively. Informally, antithetic varieties makes use of pairs of variables
that are highly negatively correlated to reduce variance. In contrast, control varieties
informally reduces variance by making use of linear combinations of random variables
with high (positive or negative) correlation, where the population mean of one of the
random variables is known.
216 Chapter 14. Variance Reduction Techniques
X 2 = h (1 U1 , 1 U2 , . . . , 1 Um )
has the same distribution as X 1 . Further, since we note that 1 U is negatively correlated
with U, our hope is that X 2 is therefore negatively correlated with X 1 . Recall that our
random variables X 1 and X 2 depend on some unknown function h and perhaps unsur-
prisingly deciding if the random variables are negatively correlated will depend on this
underlying function. It turns out that the random variables X 1 and X 2 are negatively
correlated when the underlying function h is monotone. Note that using X 1 and X 2 as
described above has double benefit, where not only do we reduce the variance of our
estimator provided the function h is monotone, we additionally save some computation
time as we do not need to generate a second set of random numbers.
To show that the use of antithetic variables will lead to a reduction in variance when-
ever the function h is monotone, we make use of the following Theorem before deducing
the result of interest as a corollary. It should be noted that we state the initial result with-
out proof in order to simplify the presentation of the material, however, you can find
the complete proof here [21, Section 9.9].
14.1. The Use of Antithetic Variables 217
where X = (X 1 , X 2 , . . . , X n ) .
Cov ( f (X ), g(X) 0.
The following corollary proves the result that provided the function h is monotone
on each of its arguments, then the random variables X 1 and X 2 cannot be positively
correlated, which as described above is advantageous for variance reduction.
Corollary. If h denotes a monotone function in each of its n arguments, then for a set
U1 , U2 , . . . , Um of independent uniform random numbers we have
Proof.
It should be emphasised that we have shown that if the random variables are
and that h is a monotone function, then (14.1) yields that Cov(X 1 , X 2 ) 0 holds, namely
that we have a reduction in variance as desired. A natural question is to here ask how
much is this reduction in variance?
For this purpose, we will now formally compare the variance of the estimator be-
tween two independent and two antithetic variables. Let X 1 and X 2 be an antithetic
pair of random variables as defined above. Let Y1 and Y2 be independent over the same
distribution as the X i ’s. Suppose further that the aforementioned random variables all
2
have the same variance .
218 Chapter 14. Variance Reduction Techniques
Recall (from Chapter 8) that the correlation coefficient of two random variables X
and Y , denoted here by ⇢X Y , is defined as
Cov(X , Y )
⇢X Y = Corr(X , Y ) = p . (14.2)
Var(X ) · Var(Y )
Recall that our input variables have to date been uniform and we have made use of
the negative correlation between the uniform input U with 1 U to reduce variance. In
some scenarios, the relevant output of a simulation study is some function of the input
variables Y1 , Y2 , . . . , Ym . In other words, sometimes the relevant output is
X = h(Y1 , Y2 , . . . , Ym ) ,
where h once more denotes some function. Similarly, the approach we take is to generate
two random variables X 1 and X 2 that estimate the relevant output X by making use of the
estimator (X 1 + X 2 )/2 . Further, we simultaneously reduce the variance of this estimator
by making use of underlying antithetic variables.
Let us suppose that the input variable Yi has corresponding cumulative distribu-
tion function Fi for each i = 1, 2, . . . , m. Note that if we generate the input variables
Y1 , Y2 , . . . , Ym using the inverse transform method, then
The following example demonstrates how one can use antithetic variables to esti-
mate the value of some quantity. In particular, we outline how we may estimate the
famous constant e in mathematics (which was introduced in Chapter 8), which is de-
fined by e = limn!1 (1 + 1/n)n and takes on rough value 2.7183.
To this point we have reduced variance by generating antithetic variables using the
relation between uniform random numbers U and 1 U. Upon working with normally
distributed random variables, we can apply similar antithetic ideas in order to reduce
variance. For this purpose, let us suppose that we are working with normal random
2
variables with mean µ and variance . Suppose we have generated such a random
variable Y and then consider the variable Y 0 = 2µ Y . Upon making use of several
expressions from Chapter 8, we notice that
which shows that both Y and Y 0 are both normal random variables with the same mean
and variance. Further, observe that
Cov(Y, Y 0 ) = Cov(Y, 2µ Y)
1
= Var(Y + 2µ Y ) Var(Y ) Var(Y 0 )
2
1
= Var(2µ) 2 2 = 2
,
2
which shows that Y and Y 0 are negatively correlated and suggests that utilising such
random variables will yield the desired variance reduction.
In particular, if we were interested in using simulation for computing
⇥ ⇤
E h(Y1 , Y2 , . . . , Ym ) ,
where the Yi ’s are independent normal random variances with corresponding means µi
for i = 1, 2, . . . , m and h denotes some function. Recall inequality (14.1) and note that
the result was proven not using density of uniform random variables but rather that they
are independent and identically distributed. In light of this, it turns out that no if h is
once more a monotone function on its coordinates, then (14.1) holds upon replacing Ui
and 1 Ui by Yi and 2µi Yi for each i = 1, 2, . . . , m, respectively.
Being a little more precise, in this setting the antithetic approach is to generate
m normal random variables Y1 , Y2 , . . . , X m with corresponding means µi for each i to
compute h(Y1 , Y2 , . . . , Ym ) , before using the corresponding antithetic variables 2µi Yi
to compute the next simulated value of h. In particular, if h is monotone, we yield that
holds, showing that we obtain a reduction in variance when compared with simply gen-
erating a second set of m normal random variables.
X + c(Y µY ) ,
222 Chapter 14. Variance Reduction Techniques
which follows upon making use of several algebraic expressions for expectation (from
Chapter 8). Further, upon using similar expressions for variance (from Chapter 8), we
observe that
Motivated by the task of determining the best value of c, denoted here by c ⇤ , that
minimises this variance, we use standard techniques from calculus to find that
Cov(X , Y )
c⇤ =
Var(Y )
and, in consequence, for this value, the variance of the controlled estimator is
2
⇤
Cov(X , Y )
Var X + c (Y µY ) = Var(X ) . (14.3)
Var(Y )
Recall that the quantity Y is by assumption an output variable with already known
expected value µY . In light of this, the quantity Y is called a control variate for the
simulation estimator X , where we have intuitively “assumed some control” over this
output variable Y . In order to reduce variance we want X and Y to be either highly
positively or highly negatively correlated.
Upon dividing (14.3) by Var(X ), we yield that
Var X + c ⇤ (Y µY )
=1 Corr2 (X , Y ) = 1 ⇢X2 Y ,
Var(X )
where recall from (14.2) that ⇢X Y = Corr(X , Y ) denotes the correlation between the
outputs X and Y . Further, in light of this equality, we deduce that the variance reduction
obtained using the control variate Y is a percentage reduction of 100 ⇢X2 Y .
It should be emphasised that the quantities Cov(X , Y ), Var(X ) and Var(Y ) would
not necessarily generally be known in advance. Hence, we must once more estimate
their values using the simulated data. For this purpose, let us suppose that n simulation
runs have been performed, where we have obtained the outputs X i and Yi for each
14.2. The Use of Control Varieties 223
1 X
n
d ,Y) =
Cov(X Xi X̄ Yi Ȳ
n 1 i=1
1 Xn
2
”
Var(X ) = Xi X̄
n 1 i=1
1 X
n
2
”
Var(Y ) = Yi Ȳ
n 1 i=1
”
to approximate c ⇤ , where note that Var(·) denotes the sample variance. Let us denote
the approximate value of c ⇤ by b
c ⇤ , where
Pn
i=1 Xi X̄ Yi Ȳ
c⇤ =
b Pn 2
.
i=1 Yi Ȳ
Further, following a similar argument that was presented in Chapter 13, the variance of
the controlled estimator
Ç å
1X
n
Var X̄ + c ⇤ (Ȳ µY ) = Var X i + c ⇤ (Yi µY )
n i=1
Ç n å
1 X
= 2 Var X i + c ⇤ (Yi µY )
n i=1
1 Ä ä
= 2 n · Var X + c ⇤ (Y µY )
n
2
!
1 Cov(X , Y )
= Var(X ) ,
n Var(Y )
where the final inequality follows by (14.3). In particular, this shows that the variance of
the controlled estimator can be estimated using the estimator Cov(Xd , Y ) for covariance
”
and the sample variance estimators V ”
ar(X ) and Var(Y ) , respectively.
The following example makes use of the previously discussed reliability function to
demonstrate how control varieties can be used to reduce variance.
The next example considers a queuing system where customers arrive in accordance
with the nonhomogeneous Poisson process with intensity function (s), where s 0.
During our next example, we consider how control varieties may be used reduce
variance when we are interested in estimating the value of some definite integral. This
integral was introduced previously when we demonstrated how antithetic variables can
lead to significant variance reduction.
The following example introduces a list recording problem. Suppose for this purpose
that we are given a (finite) set of n elements, that are arranged in an ordered list. A
request is made at each unit time to retrieve one of these elements with some probability,
where the selected element is then put back into the list but not necessarily in the same
position. Note that when we place the selected element back into the list we normally
would make use of some reordering rule (such as it is interchanged with its preceding
14.2. The Use of Control Varieties 225
element). The problem starts with an initial ordering (where any of the possible n!
orderings of the n elements are equally likely), before we determine the expected sum
of the positions of the first N elements requested. The following example demonstrates
how we may use simulation to accomplish this task efficiently.
Recall that for any constant c the controlled estimator X + c(Y µY ) is an unbiased
estimator of ✓ = E[X ] , where the expected value of Y is assumed known. It is perhaps
unsurprising that we could use more than a single variable as a control if needed. If
for example a simulation study results in output variables Yi for i = 1, 2, . . . , k and the
values E[Yi ] = µi are known for each i, then for any constants ci we can use
X
k
X+ ci (Yi µi )
i=1
Example. (Blackjack)
226 Chapter 14. Variance Reduction Techniques
To conclude this chapter, we make a number of remarks regarding both control va-
rieties and antithetic variables.
Remarks.
1. One particularly valuable way of interpreting the control variates approach is that
it combines unbiased estimators of ✓ . In particular, suppose that X and W are
determined by the simulation with the property that E[X ] = E[W ] = ✓ . We may
then consider any unbiased estimator of the form
↵X + (1 ↵)W,
which is unbiased for all ↵. Similarly, the best such estimator, that is obtained by
choosing the value of ↵ that minimise variance, denoted here by ↵⇤ , is given by
Var(W ) Cov(X , W )
↵⇤ = , (14.4)
Var(X ) + Var(W ) 2 Cov(X , W )
which follows using expressions for variance (from Chapter 8) and by standard
techniques from calculus.
Suppose once more that for some other output variable Y , that the expected value
E[Y ] = µY is known. Note that we have two unbiased estimators, namely X and
X +Y µY . Further, these can be combined to yield the combined estimator
↵X + (1 ↵) (X + Y µY ) = X + (1 ↵) (Y µY ) ,
2. The above remark suggests that the antithetic variable approach can be thought
of as a special case of combined unbiased estimators and thus control variates.
In particular, if E[X ] = ✓ , where X = h(U1 , U2 , . . . , Un ), then E[W ] = ✓ , where
W = h(1 U1 , 1 U2 , . . . , 1 Un ) . The estimators X and W are both unbiased and
we combine them to yield
↵X + (1 ↵)W.
14.3. Exercises for Self-Study 227
Since X and W have the same distribution we see that Var(X ) = Var(W ). It follows
by (14.4) that the best value of ↵ is ↵ = 1/2 and as such
X +W
↵X + (1 ↵)W = ,
2
i.e. the combined unbiased estimators become the antithetic variable estimator.
3. The above remark further indicates why it is not usually possible to effectively
combine antithetic variables with a control variable. In particular, if a control
variate Y has large positive correlation with X = h(U1 , U2 , . . . , Un ), then Y likely
has large negative correlation with W = h(1 U1 , 1 U2 , . . . , 1 Un ) . It follows
that Y is unlikely to have a large correlation with the antithetic variate
1
h(U1 , U2 , . . . , Un ) + h(1 Un , 1 U2 , . . . , 1 Un ) .
2
a) Show that
2
e U (1 + e1 2U )
(14.5)
2
is an unbiased estimator of ✓ , where U denotes a random number.
b) Show that using the unbiased estimator (14.5) is better than generating two
2 2
random numbers U1 and U2 and using the estimator (e U1 + e U2 )/2.
2. Explain how antithetic variables can be used in obtaining a simulation of the quan-
tity
Z 1Z 1
2
✓= e(x+ y) d y d x.
0 0
Is it clear in this case that using antithetic variables is more efficient than gener-
ating a new pair of random numbers?
Perform this simulation to obtain an interval of length no greater than 0.1 that
you can assert with 95% confidence contains the value of ✓ .
6. Show that Var(↵X + (1 ↵)W ) is minimised by ↵ being equal to the value given
by (14.4) and determine the resulting variance.
b) Perform 100 simulation runs, using the control given in a), in order to esti-
mate firstly c ⇤ and then the variance of the estimator.
c) Using the same data as in b), determine the variance of the antithetic variable
estimator.
d) Which of the two types of variance reduction worked better in this scenario?
229
Chapter 15
For a brief introduction to Markov chains you should read Sections 11.1, 11.2, 11.3
and 11.4 from [6]. These sections are highlighted and can be found on Moodle. It is
relatively easy reading. You may additionally look at Sections 11.5 and 11.6 from the
same textbook, which can be found on Moodle.
231
Chapter 16
For an overview of the Markov chain Monte Carlo method you should read Sections 12.1
and 12.2 from [6]. These sections are highlighted and can be found on Moodle.
233
Bibliography
[1] Ravindra K Ahuja, Thomas L Magnanti, James B Orlin, and MR Reddy. “Applica-
tions of network optimization”. In: Handbooks in Operations Research and Man-
agement Science 7 (1995), pp. 1–83.
[2] Martin Anthony and Michele Harvey. Linear algebra: concepts and methods. Cam-
bridge University Press, 2012.
[3] Robert G Bartle and Donald R Sherbert. Introduction to real analysis. 4th ed. John
Wiley & Sons, Inc., 2011.
[4] Richard Bellman. “On a routing problem”. In: Quarterly of applied mathematics
16.1 (1958), pp. 87–90.
[5] Dimitri P Bertsekas. Nonlinear Programming. 3rd ed. Athena Scientific, 2016.
[6] Joseph K Blitzstein and Jessica Hwang. Introduction to probability. CRC Press,
Taylor & Francis Group, 2015.
[7] George Dantzig, Ray Fulkerson, and Selmer Johnson. “Solution of a large-scale
traveling-salesman problem”. In: Journal of the operations research society of Amer-
ica 2.4 (1954), pp. 393–410.
[8] George B Dantzig. “Maximization of a linear function of variables subject to linear
inequalities”. In: Activity analysis of production and allocation 13 (1951), pp. 339–
347.
[9] George B Dantzig. “Origins of the simplex method”. In: A history of scientific com-
puting. 1990, pp. 141–151.
[10] George B Dantzig. “Reminiscences about the origins of linear programming”. In:
Mathematical Programming The State of the Art. Springer, 1983, pp. 78–86.
[11] Lester R Ford and Delbert R Fulkerson. “Flows in Networks”. In: Flows in Net-
works. Princeton University Press, 1962.
[12] Lester R Ford Jr. Network flow theory. Tech. rep. Rand Corp Santa Monica Ca,
1956.
234 Bibliography
[13] Michael R Garey and David S Johnson. Computers and Intractability: A Guide to
the Theory of NP-Completeness. Vol. 174. Freeman San Francisco, 1979.
[14] Bezalel Gavish and Stephen C Graves. “The travelling salesman problem and re-
lated problems”. In: (1978).
[15] Arthur S Goldberger. Econometric Theory. New York: John Wiley & Sons, Inc.,
1964.
[18] Joseph Lee Rodgers and W Alan Nicewander. “Thirteen ways to look at the cor-
relation coefficient”. In: The American Statistician 42.1 (1988), pp. 59–66.
[20] Clair E Miller, Albert W Tucker, and Richard A Zemlin. “Integer programming
formulation of traveling salesman problems”. In: Journal of the ACM (JACM) 7.4
(1960), pp. 326–329.
[22] John Von Neumann. “Various techniques used in connection with random digits”.
In: National Bureau of Standards Applied Mathematics Series 12 (1951), pp. 36–
38.