Sim Book
Sim Book
Manuele Leonelli
2021-04-05
2
Contents
Preface 5
1 Introduction 7
1.1 What is simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Types of simulations . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Elements of a simulation model . . . . . . . . . . . . . . . . . . . 12
1.4 The donut shop example . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Simulating a little health center . . . . . . . . . . . . . . . . . . . 16
1.6 What’s next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 R programming 21
2.1 Why R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 R basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Accessing and manipulating variables . . . . . . . . . . . . . . . . 27
2.4 Loops and conditions . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 The apply family of functions . . . . . . . . . . . . . . . . . . . . 37
2.7 The pipe operator . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.8 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Probability Basics 41
3.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . 41
3.2 Notable Discrete Variables . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 54
3
4 CONTENTS
These are lecture notes for the module Simulation and Modelling to Understand
Change given in the School of Human Sciences and Technology at IE University,
Madrid, Spain. The module is given in the 2nd semester of the 1st year of
the bachelor in Data & Business Analytics. Knowledge of basic elements of R
programming as well as probability and statistics is assumed.
5
6 CONTENTS
Chapter 1
Introduction
The first introductory chapter gives an overview of simulation, what it is, what
it can be used for, as well as some examples.
• During the design stage of a system, meaning while it is being built, sim-
ulation can be used to guide its construction.
7
8 CHAPTER 1. INTRODUCTION
Suppose we decided to open a donut shop and are unsure about how many
employees to hire to sell donuts to costumers. The operations of our little shop
is the real-world system whose behavior we want to understand. Given that the
shop is not operating yet, only a simulation model can provide us with insights.
We could of course devise models of different complexities, but for now suppose
that we are happy with a simple model where we have the following elements:
• It can be replicated multiple times and the workings of the systems can
be observed a large number of times;
• It is ethical and legal since it can implement changes in policies that would
be unethical or illegal to do in real-world.
Table 1.1: Dataframe ‘result‘ from the social media deterministic simulation
The dataframe result is reported in Table 1.1, showing that she will be able
to hit her target of 10k followers since she will have 11619 followers. If we
run again the simulation we will obtain the exact same results: there is no
stochasticity/uncertainty about the outcome.
Simulation models that represent the system at a particular point in time only
are called static. This type of simulations are often called as Monte Carlo
simulations and will be the focus of later chapters.
Dynamic simulation models represent systems as they evolve over time. The
simulation of the donut shop during its working hours is an example of a dynamic
model.
1.2. TYPES OF SIMULATIONS 11
2
1
0
0 5 10 15 20 25 30 35
Time
Figure 1.1 further illustrates that for specific period of times the system does
not change state, that is the number of customers queuing remains constant.
It is therefore useless to inspect the system during those times where nothing
changes. This prompts the way in which time is usually handled in dynamic
discrete simulations, using the so-called next-event technique. The model is only
examined and updated when the system is due to change. These changes are
usually called events. Looking at Figure 1.1 at time zero there is an event: a
customer arrives; at time nine another customer arrives; at time ten another
customer arrives; at time twelve a customer is served; and so on. All these are
examples of events.
Continuous simulation models are such that the variables of interest change
continuously over time. Suppose for instance a simulation model for a car
journey was created where the interest is on the speed of the car throughout the
journey. Then this would be a continuous simulation model. Figure 1.2 gives
an illustration of this.
12 CHAPTER 1. INTRODUCTION
61
60
Speed in km/h
59
58
57
56
0 20 40 60 80 100
Time
In later chapters we will focus on discrete simulations, which are usually called
discrete-event simulation. Continuous simulations will not be discussed in these
notes.
There are two types of objects a simulation model is often made of:
• Entities: individual elements of the system that are being simulated and
whose behavior is being explicitly tracked. Each entity can be individually
identified;
• Resources: also individual elements of the system but they are not mod-
elled individually. They are treated as countable items whose behavior is
not tracked.
Consider our simple donut shop. Clients will be most likely be resources since
we are not really interested in what each of them do. Employees may either
be considered as entities or resources: in the former case we want to track the
amount of time each of them are working; in the latter the model would only
be able to output an overview of how busy overall the employees are.
During a simulation study, entities and resources will cooperate and therefore
change state. The following terminology describe this as well as the flow of time:
• Event: instant of time where the state of the system changes. In the
donut shop suppose that there are currently two customers being served.
An event is when a customer has finished being served: the number of
busy employees decreases by one and there is one less customer queuing.
• Activity: a time period of specified length which is known when it begins
(although its length may be random). The time an employee takes to
serve a customer is an example of an activity: this may be specified in
terms of a random distribution.
• Delay: duration of time of unspecified length, which is not known until it
ends. This is not specified by the modeller ahead of time but is determined
by the conditions of the system. Very often this is one of the desired output
of a simulation. For instance, a delay is the waiting time of a customer in
the queue of our donut shop.
• Clock: variable representing simulated time.
14 CHAPTER 1. INTRODUCTION
From an abstract point of view we have now defined all components of our
simulation model. Before implementing, we need to choose the length of the
activities. This is usually done using common sense, intuition or historical
data. Suppose for instance that the time between the arrival of customers is
modeled as an Exponential distribution with parameter 1/3 (that is on average
a customer arrives every three minutes) and the service time is modeled as a
continuous Uniform distribution between 1 and 5 (on average a service takes
three minutes).
With this information we can now implement the workings of our donut shop. It
does not matter the specific code itself, we will learn about it in later chapters.
At this stage it is only important to notice that we use the simmer package
together with the functionalities of magrittr. We simulate our donut shop for
two hours.
1.4. THE DONUT SHOP EXAMPLE 15
library(simmer)
library(magrittr)
set.seed(2021)
env %>%
add_resource("employee", 2) %>%
add_generator("customer", customer, function() rexp(1,1/3))
env %>%
run(until=120)
The above code creates a simulation of the donut shop for two hours. Next we
report some graphical summaries that describe how the system worked.
library(simmer.plot)
library(gridExtra)
p1 <- plot(get_mon_resources(env), metric = "usage", items = "server",step = T)
p2 <- plot(get_mon_arrivals(env), metric = "waiting_time")
grid.arrange(p1,p2,ncol=2)
The left plot in Figure 1.3 reports the number of busy employees busy through-
out the simulation. We can observe that often no employees were busy, but
sometimes both of them are busy. The right plot in Figure 1.3 reports the
waiting time of customers throughout the simulation. Most often customers do
not wait in our shop and the largest waiting time is of about four minutes.
Some observations:
• this is the result of a single simulation where inputs are random and de-
scribed by a random variable (for instance, Poisson and Uniform). If we
were to run the simulation again we would observe different results.
1.5
waiting time
item
in use
1.0
server 1
0.5 0
−1
0.0
Figure 1.3: Graphical summaries from the simulation of the donut shop
• System state:
• Nurse visit times follow a Normal distribution with mean 15 and variance
1;
• Doctor visit times follow a Normal distribution with mean 20 and variance
1;
• Administrative staff visit times follow a Normal distribution with mean 5
and variance 1;
• Time between the arrival of patients is modeled as a Normal with mean
10 and variance 4.
The model above can be implemented using the following code (we run the
simulation for four hours). Again do not worry about it now!
set.seed(2021)
env <- simmer("HealthCenter")
release("administration", 1)
env %>%
add_resource("nurse", 1) %>%
add_resource("doctor", 2) %>%
add_resource("administration", 1) %>%
add_generator("patient", patient, function() rnorm(1, 10, 2))
Resource utilization
100%
80%
60%
utilization
40%
20%
0%
Figure 1.4 shows the utilization of the different resources in the system. Nurses
are most busy, doctors are overall fairly available, whilst the administration is
more than half of the time available.
Figure 1.5 confirms this. We see that the usage of nurses is almost 1, whilst for
1.6. WHAT’S NEXT 19
Resource usage
administration doctor nurse
2.0
1.5
item
in use
1.0
server
0.5
0.0
0 50 100 150 200 2500 50 100 150 200 2500 50 100 150 200 250
time
doctors and administrative staff we are below the number of doctors and staff
available.
Last Figure 1.6 reports the average time spent by patients in the health center.
We can see that as the simulation clock increases, patients spend more time
in the health center. From the previous plots, we can deduce that in general
patients wait for the nurse, who has been busy all the time during the simulation.
75
flow time
50
25
R programming
• The books of Hadley Wickham are surely a great starting point and are
all available here.
• If you are unsure on how to do something with R, Google it!!! The com-
munity of R users is so wide that surely someone else has already asked
your same question.
• The R help is extremely useful and comprehensive. If you want to
know more about a function, suppose it is called function, you can type
?function.
2.1 Why R?
As mentioned in the previous chapter, simulation is very often applied in many
areas, for instance management science and engineering. Often a simulation is
carried out using an Excel spreadsheet or using a specialised software whose only
purpose is creating simulations. Historically, R has not been at the forefront of
the implementation of simulation models, in particular of discrete-event simu-
lations. Only recently, R packages implementing discrete-event simulation have
21
22 CHAPTER 2. R PROGRAMMING
appeared, most importantly the simmer R package that you will learn using in
later chapters.
These notes are intended to provide a unique view of simulation with specific
implementation in the R programming language. Some of the strenght of R are:
• the community of R users is huge, with many forums, sites and resources
that give you practical support in developing your own code;
2.2 R basics
So let’s get started with R programming!
2.2.1 R as a calculator
In its most basic usage, we can use R as a calculator. Basic algebraic opera-
tions can be carried out as you would expect. The symbol + is for sum, - for
subtraction, * for multiplication and / for division. Here are some examples:
4 + 2
## [1] 6
4 - 2
## [1] 2
2.2. R BASICS 23
4 * 2
## [1] 8
5 / 2
## [1] 2.5
a <- 4
b <- 3
a + b
## [1] 7
a - b
## [1] 1
Notice for example that the code a <- 4 does not show us the value of the
variable a. It only creates this assignment. If we want to print the value of a
variable, we have to explictly type the name of the variable.
## [1] 4
In the previous examples we worked with numbers, but variables could be as-
signed other types of information. There are four basic types:
Examples:
a <- TRUE
a
## [1] TRUE
b <- "hello"
b
## [1] "hello"
2.2.4 Vectors
In all previous examples the variables included one element only. More generally
we can define sequences of elements or so-called vectors. They can be defined
with the command c, which stands for combine.
## [1] 1 3 5 7
We created a variable vec where the first entry is a number, then a character
string, then a Boolean. When we print vec, we get that its elements are "1",
"hello" and "TRUE": it has transformed the number 1 into the string "1" and
the Boolean TRUE into "TRUE".
2.2. R BASICS 25
2.2.5 Matrices
Matrices are tables of elements that are organized in rows and columns. You
can think of them as an arrangement of vectors into a table. Matrices must have
the same data type in all its entries, as for vectors. Matrices can be constructed
in multiple ways. One way is by stacking vectors into a matrix row-by-row with
the command rbind. Consider the following example.
So first we created a vector vec with numbers from 1 to 9 and then stored them
in a matrix with 3 rows and 3 columns. Number are stored by column: the first
element of vec is in entry (1,1), the second element of vec is in entry (2,1), and
so on.
2.2.6 Dataframes
Dataframes are very similar as matrices, they are tables organized in rows and
columns. However, different to matrices they can have columns with different
data types. They can be created with the command data.frame.
## X1 X2 X3
## 1 1 TRUE male
## 2 2 FALSE male
## 3 3 FALSE female
The dataframe data includes three columns: the first column X1 of numbers, the
second column X2 of Boolean and the third column X3 of characters. Dataframes
are the objects that are most commonly used in real world data analysis.
## [1] 3 NA 5
Although the second element of vec is the expression NA, R recognizes that it
is used for missing value and therefore the elements 3 and 5 are still considered
numbers: indeed they are not printed as "3" and "5".
NULL is an additional datatype. This can have various uses. For instance, it is
associated to a vector with no entries.
2.3. ACCESSING AND MANIPULATING VARIABLES 27
c()
## NULL
Given a vector vec we can access its i-th entry with vec[i].
## [1] 3
For a matrix or a dataframe we need to specify the associated row and column.
If we have a matrix mat we can access the element in entry (i,j) with mat[i,j].
## [1] 7
To access multiple entries we can on the other hand define a vector of indexes
of the elements we want to access. Consider the following examples:
## [1] 1 3
The above code accesses the first two entries of the vector vec. To do this we
had to define a vector using c(1,2) stating the entries we wanted to look at.
For matrices consider:
28 CHAPTER 2. R PROGRAMMING
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
The syntax is very similar as before. We defined to index vectors, one for the
rows and one for columns. The two statements c(1,2) and c(2,3) are separated
by a comma to denote that the first selects the first and second row, whilst the
second selects the second and third column.
If one wants to access full rows or full columns, the argument associated to rows
or columns is left blank. Consider the following examples.
## [1] 1 4 7
mat[,c(1,2)]
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
The code mat[1,] selects the first full row of mat. The code mat[,c(1,2)]
selects the first and second column of mat. Notice that the comma has always
to be included!
1:9
## [1] 1 2 3 4 5 6 7 8 9
More generally, one can define sequences of numbers using seq (see ?seq).
2.3. ACCESSING AND MANIPULATING VARIABLES 29
We constructed a vector vec and check which entries were larger than 4. The
output is a Boolean vector with the same number of entries as vec where only
the last two entries are TRUE. Similarly,
## [1] 5 6
We have seen in the previous section that dataframes are special types of matri-
ces where columns can include a different data type. For this reason they have
special way to manipulate and access their entries.
First, specific columns of a dataframe can be accessed using its name and the $
sign as follows.
## [1] 1 2 3
data$X3
So using the name of the dataframe data followed by $ and then the name of
the column, for instance X1, we access that specific column of the dataframe.
Second, we can use the $ sign to add new columns to a dataframe. Consider
the following code.
## X1 X2 X3 X4
## 1 1 TRUE male yes
## 2 2 FALSE male no
## 3 3 FALSE female no
## X1 X2 X3
## 1 1 TRUE male
## 2 2 FALSE male
The above code returns the rows of data such that X1 is less or equal to 2. More
complex rules to subset a dataframe can be combined using the and operator &
and the or operator |. Let’s see an example.
## X1 X2 X3
## 1 1 TRUE male
So the above code selects the rows such that X1 is less or equal to 2 and X2 is
TRUE. This is the case only for the first row of data.
Here is a list of functions which are often useful to get information about objects
in R.
• head returns the first entries of a vector or the first rows of a matrix or a
dataframe
## [1] 5
32 CHAPTER 2. R PROGRAMMING
unique(vec)
## [1] 4 2 7 5
order(vec)
## [1] 2 1 4 5 3
length gives the number of elements of vec, unique returns the different values
in vec (so 5 is not repeated), order returns in entry i the ordering of the i-th
entry of vec. So the first entry of order(vec) is 2 since 4 is the second-smallest
entry of vec.
## [1] 4 3
So dim tells us that data has four rows and three columns.
2.4.1 if statements
if(condition){true_action}
Condition must return a Boolean, either TRUE or FALSE. If TRUE then the code
follows the code within the curly brackets and performs the true_action. If
condition is FALSE the code does nothing.
It is more customary to also give a chunk of code for the case condition is
FALSE. This can be achieved with else.
2.4. LOOPS AND CONDITIONS 33
a <- 5
if (a < 2){"hello"} else {"goodbye"}
## [1] "goodbye"
a <- 1
if (a < 2){"hello"} else {"goodbye"}
## [1] "hello"
2.4.2 ifelse
if works when checking a single element and the condition returns either TRUE
or FALSE. The command ifelse can be used to quickly check a condition over
all elements of a vector. Consider the following example.
2.4.3 Loops
for loops are used to iterate over items in a vector. They have the following
skeleton:
34 CHAPTER 2. R PROGRAMMING
For each item in vector, perform_action is performed once and the value of
item is updated each time.
Here is an example.
for (i in c(1,2,3)){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
Item is the variable i (it is costumary to use just a letter) and at each step
i is set equal to a value in the vector c(1,2,3). At each of these iterations,
the command print(i), which simply returns the value that i takes is called.
Indeed we see that the output is the sequence of numbers 1, 2, 3.
2.5 Functions
Functions are chunks of code that are given a name so that they can be easily
used multiple times. Perhaps without realising it, you have used functions
already many times!
• a name: in R functions are objects just like vectors or matrices and they
are given a name.
• arguments: these are objects that will be used within the function.
The above function computes the sum of two numbers x and y. Let’s call it.
my.function(2,3)
## [1] 5
The new.function returns the sum between the square of the first input x and
the second input y. Let’s call the function.
new.function(2,3)
## [1] 7
new.function(3,2)
## [1] 11
only within the function: when you call the function the output does not create
a variable z1. The output does not create either a variable z2 it simply returns
the value that is stored in z2, which can the be assigned as in the following
example.
## [1] 7
## [1] 7
The output is the same. We did not create any variable within the function and
we did not explicitly use the return command. R understands that the last
line of code is what the function should return.
In R functions can be called in various ways. Before we have seen function calls
as
new.function(2,3)
We could have also been more explicit and state what x and y were.
2.6. THE APPLY FAMILY OF FUNCTIONS 37
new.function(x=2, y=3)
## [1] 7
So now explicitly we state that the input x of new.function is 2 and that the
input y is 3. Notice that the two ways of specifying inputs give the exact same
results.
## [1] 12 15 18
The code first defines a matrix x and an empty vector y (recall that this is bad
practice, but for this example it does not matter). Then there is a for cycle
which assigns to the i-th entry of y the sum of the entries of the i-th row of x.
So the vector y includes the row-totals.
For this simple example the for cycle is extremely quick, but this is just to
illustrate how we can replace it using the apply function.
apply(x, 1, sum)
## [1] 12 15 18
Let’s look at the above code. The first input of apply is the object we want
to operate upon, in this case the matrix x. The second input specifies if the
operation has to act over the rows of the matrix (input equal to 1) or over the
columns (input equal to 2). The third input is the operation we want to use, in
this case sum.
Beside being faster, the above code is also a lot more compact than using a for
loop.
The following example computes the mean of each column of x.
apply(x, 2, mean)
## [1] 2 5 8
Consider again our function new.function which computes the sum of the
squared of a number x with another number y.
Suppose that we want to compute such a sum for all numbers x from 1 to 10.
Suppose that y is chosen as 2. We can achieve this with a for cycle as follows.
x <- 1:10
z <- c()
for (i in 1:10){
z[i] <- new.function(x[i],2)
}
z
## [1] 3 6 11 18 27 38 51 66 83 102
x <- 1:10
sapply(x,new.function, y=2)
## [1] 3 6 11 18 27 38 51 66 83 102
x <- 1:10
sapply(x, function(i) i^2 + 2)
## [1] 3 6 11 18 27 38 51 66 83 102
So we defined the vector x and we want to apply the function defined within
sapply multiple times: once for each entry in the vector x.
x <- -5:-1
mean(log(abs(x)))
## [1] 0.9574983
Such nested code where we apply multiple functions over the same line of code
becomes cluttered and difficult to read.
For this reason the package magrittr introduces the so-called pipe operator %>%
which makes the above code much more readable. Consider the same example
using the pipe operator.
library(magrittr)
x <- -5:-1
x %>% abs() %>% log() %>% mean()
## [1] 0.9574983
The above code can be seen as follows: consider the vector x and apply the
function abs over its entries. Then apply the function log over the resulting
vector and last apply the function mean.
The code is equivalent to standard R but it is simpler to read. So sometimes it
is preferrable to code using pipes instead of standard R syntax.
2.8 Plotting
R has great plotting capabilities. Details about plotting functions and a dis-
cussion of when different representations are most appropriate are beyond the
scope of these notes. This is just to provide you with a list of functions:
• barplot creates a barplot: notice that you first need to construct a so-
called contingency table using the function table.
• hist creates an histogram;
• boxplot creates a boxplot;
There are many functions to customize such plots, and again details can be
found in the references given. A package which is often used to create nice data
visualization is ggplot2.
Chapter 3
Probability Basics
• variable: this means that there is some process that takes some value. It is
a synonym of function as you have studied in other mathematics classes.
• random: this means that the variable takes values according to some prob-
ability distribution.
• discrete: this refers to the possible values that the variable can take. In
this case it is a countable (possibly infinite) set of values.
41
42 CHAPTER 3. PROBABILITY BASICS
So for any outcome 𝑥 ∈ 𝕏 the pmf describes the likelihood of that outcome
happening.
Recall that pmfs must obey two conditions:
So the pmf associated to each outcome is a non-negative number such that the
sum of all these numbers is equal to one.
Let’s consider an example at this stage. Suppose a biased dice is thrown such
that the numbers 3 and 6 are twice as likely to appear than the other numbers.
A pmf describing such a situation is the following:
3.1. DISCRETE RANDOM VARIABLES 43
𝑥 1 2 3 4 5 6
It is apparent that all numbers 𝑝(𝑥) are non-negative and that their sum is equal
to 1: so 𝑝(𝑥) is a pmf. Figure 3.1 gives a graphical visualization of such a pmf.
0.3
P = 2/8 P = 2/8
0.2
Probability
0.1
0.0
1 2 3 4 5 6
Value of X
Whilst you should have been already familiar with the concept of pmf, the next
concept may appear to be new. However, you have actually used it multiple
times when computing Normal probabilities with the tables.
We now define what is usually called the cumulative distribution function (or
cdf) of a random variable 𝑋. The cdf of 𝑋 at the point 𝑥 ∈ 𝕏 is
𝐹 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = ∑ 𝑝(𝑦)
𝑦≤𝑥
that is the probability that 𝑋 is less or equal to 𝑥 or equally the sum of the pmf
of 𝑋 for all values less than 𝑥.
44 CHAPTER 3. PROBABILITY BASICS
Let’s consider the dice example to illustrate the idea of cdf and consider the
following values 𝑥:
We can compute in a similar way the cdf for any value 𝑥. A graphical visual-
ization of the resulting CDF is given in Figure 3.2.
1.00
0.75
P(X <= x)
0.50
0.25
0.00
1 2 3 4 5 6
Value of X
The plot highlights some properties of CDFs which can proved hold in general
for any discrete CDF:
3.1. DISCRETE RANDOM VARIABLES 45
3.1.3 Summaries
The pmf and the cdf fully characterize a discrete random variable 𝑋. Often
however we want to compress that information into a single number which still
retains some aspect of the distribution of 𝑋.
The expectation or mean of a random variable 𝑋 denoted as 𝐸(𝑋) is defined as
𝐸(𝑋) = ∑ 𝑥𝑝(𝑥)
𝑥∈𝕏
In general we will not compute variance by hand. The following R code computes
the variance of the random variable associated to the biased dice.
46 CHAPTER 3. PROBABILITY BASICS
## [1] 2.9375
The standard deviation of the discrete random variable 𝑋 is the square root of
𝑉 (𝑋).
Figure 3.3 reports the pmf and the cdf of a Bernoulli random variable with
parameter 0.3.
1.00 1.00
0.75 0.75
Probability
CDF
0.50 0.50
0.25 0.25
0.00 0.00
0 1 −2 −1 0 1 2 3
Value of X Value of X
Figure 3.3: PMF (left) and CDF (right) of a Bernoulli random variable with
parameter 0.3
Let’s think of tossing a coin 𝑛 times. Then we would expect that the probability
of showing heads is the same for all tosses and that the result of previous tosses
does not affect others. So this situation appears to meet the above assumptions
and can be modeled by what we call a Binomial random variable.
Formally, the random variable 𝑋 is a Binomial random variable with parameters
𝑛 and 𝜃 if it denotes the number of successes of 𝑛 independent Bernoulli random
variables, all with parameter 𝜃.
The pmf of a Binomial random variable with parameters 𝑛 and 𝜃 can be written
as:
(𝑛)𝜃𝑥 (1 − 𝜃)𝑛−𝑥 , 𝑥 = 0, 1, … , 𝑛
𝑝(𝑥) = { 𝑥
0, otherwise
Let’s try and understand the formula by looking term by term.
The Bernoulli distribution can be seen as a special case of the Binomial where
the parameter 𝑛 is fixed to 1.
We will not show why this is the case but the expectation and the variance of
the Binomial random variable with parameters 𝑛 and 𝜃 can be derived as
1.00 1.00
0.75 0.75
Probability
Probability
0.50 0.50
0.25 0.25
0.00 0.00
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Value of X Value of X
• first argument is the value at which to compute the pmf or the cdf;
• size is the parameter 𝑛 of the Binomial;
• prob is the parameter 𝜃 of the Binomial.
So for instance
## [1] 0.1171875
## [1] 0.9900182
The last class of discrete random variables we discuss is the so-called Poisson
distribution. Whilst for Bernoulli and Binomial we had an interpretation of
why the pmf took its specific form by associating it to independent binary
experiments each with an equal probability of success, for the Poisson there is
no such an interpretation.
A discrete random variable 𝑋 has a Poisson distribution with parameter 𝜆 if its
pmf is
𝑒−𝜆 𝜆𝑥
, 𝑥 = 0, 1, 2, 3, …
𝑝(𝑥) = { 𝑥!
0, otherwise
where 𝜆 > 0.
So the sample space of a Poisson random variable is the set of all non-negative
integers.
One important characteristic of the Poisson distribution is that its mean and
variance are equal to the parameter 𝜆, that is
𝐸(𝑋) = 𝑉 (𝑋) = 𝜆.
3.2. NOTABLE DISCRETE VARIABLES 51
1.00 1.00
0.75 0.75
Probability
Probability
0.50 0.50
0.25 0.25
0.00 0.00
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Value of X Value of X
Figure 3.5: PMF of a Poisson random variable with parameter 1 (left) and 4
(right)
• first argument is the value at which to compute the pmf or the cdf;
So for instance
dpois(3, lambda = 1)
## [1] 0.06131324
52 CHAPTER 3. PROBABILITY BASICS
ppois(8, lambda = 4)
## [1] 0.9786366
We next consider two examples to see in practice the use of the Binomial and
Poisson distributions.
A recent survey indicated that 82% of single women aged 25 years old will be
married in their lifetime. Compute
The above situation can be modeled by a Binomial random variable where the
parameter 𝑛 depends on the question and 𝜃 = 0.82.
The first question requires us to compute 𝑃 (𝑋 ≤ 3) = 𝐹 (3) where 𝑋 is Binomial
with parameters 𝑛 = 20 and 𝜃 = 0.82. Using R
## [1] 0.0004400767
Using R
3.2. NOTABLE DISCRETE VARIABLES 53
## [1] 0.02003866
For the third question, notice that saying two women out of 20 will never be
married is equal to 18 out of 20 will be married. Therefore we need to compute
𝑃 (𝑋 = 17)+𝑃 (𝑋 = 18) = 𝑝(17)+𝑝(18) where 𝑋 is a Binomial random variable
with parameters 𝑛 = 20 and 𝜃 = 0.82. Using R
## [1] 0.4007631
A stuntman injures himself an average of three times a year. Use the Poisson
probability formula to calculate the probability that he will be injured:
• 4 times a year
• Less than twice this year.
• More than three times this year.
## [1] 0.1680314
ppois(1,lambda=3)
## [1] 0.1991483
1 - ppois(2, lambda = 3)
## [1] 0.5768099
Our attention now turns to continuous random variables. These are in general
more technical and less intuitive than discrete ones. You should not worry about
all the technical details, since these are in general not important, and focus on
the interpretation.
A continuous random variable 𝑋 is a random variable whose sample space 𝕏 is
an interval or a collection of intervals. In general 𝕏 may coincide with the set of
real numbers ℝ or some subset of it. Examples of continuous random variables:
• the current temperature in the city of Madrid: it can be any real number;
Whilst for discrete random variables we considered summations over the ele-
ments of 𝕏, i.e. ∑𝑥∈𝕏 , for continuous random variables we need to consider
integrals over appropriate intervals.
You should be more or less familiar with these from previous studies of calculus.
But let’s give an example. Consider the function 𝑓(𝑥) = 𝑥2 computing the
squared of a number 𝑥. Suppose we are interested in this function between the
values -1 and 1, which is plotted by the red line in Figure 3.6. Consider the so-
1
called integral ∫−1 𝑥2 𝑑𝑥: this coincides with the area delimited by the function
1
and the x-axis. In Figure 3.6 the blue area is therefore equal to ∫−1 𝑥2 𝑑𝑥.
We will not be interested in computing integrals ourselves, so if you do not
know/remember how to do it, there is no problem!
Discrete random variable are easy to work with in the sense that there exists a
function, that we called probability mass function, such that 𝑝(𝑥) = 𝑃 (𝑋 = 𝑥),
that is the value of that function in the point 𝑥 is exactly the probability that
𝑋 = 𝑥.
3.3. CONTINUOUS RANDOM VARIABLES 55
1.00
0.75
0.50
y
0.25
0.00
Figure 3.6: Plot of the squared function and the area under its curve
Therefore we may wonder if this is true for a continuous random variable too.
Sadly, the answer is no and probabilities for continuous random variables are
defined in a slightly more involved way.
Let 𝑋 be a continuous random variable with sample space 𝕏. The probability
that 𝑋 takes values in the interval [𝑎, 𝑏] is given by
𝑏
𝑃 (𝑎 ≤ 𝑋 ≤ 𝑏) = ∫ 𝑓(𝑥)𝑑𝑥
𝑎
where 𝑓(𝑥) is called the probability density function (pdf in short). Pdfs, just
like pmfs must obey two conditions:
So in the discrete case the pmf is defined exactly as the probability. In the
continuous case the pdf is the function such that its integral is the probability
that random variable takes values in a specific interval.
As a consequence of this definition notice that for any specific value 𝑥0 ∈ 𝕏,
𝑃 (𝑋 = 𝑥0 ) = 0 since
𝑥0
∫ 𝑓(𝑥)𝑑𝑥 = 0.
𝑥0
The pdf is drawn in Figure 3.7 by the red line. One can see that 𝑓(𝑥) ≥ 0 for
all 𝑥 ≥ 0 and one could also compute that it integrates to one.
56 CHAPTER 3. PROBABILITY BASICS
Therefore the probability that the waiting time is between any two values (𝑎, 𝑏)
can be computed as
𝑏
1
∫ 𝑒−𝑥/4 𝑑𝑥.
𝑎 4
0.25
0.20
0.15
f(x)
0.10
0.05
0.00
0 5 10 15
x
Figure 3.7: Probability density function for the waiting time in the donut shop
example
𝐹 (𝑥) = 1 − 𝑒−𝑥/4 ,
1.00
0.75
F(x)
0.50
0.25
0.00
0 5 10 15 20
x
Figure 3.8: Cumulative distribution function for the waiting time at the donut
shop
We can notice that the cdf has similar properties as in the discrete case: it is
non-decreasing, on the left-hand side is zero and on the right-hand side tends
to zero.
In the continuous case, one can prove that cdfs and pdfs are related as
𝑑
𝑓(𝑥) = 𝐹 (𝑥).
𝑑𝑥
3.3.3 Summaries
Just as for discrete random variables, we may want to summarize some features
of a continuous random variable into a unique number. The same set of sum-
maries exists for continuous random variables, which are almost exactly defined
as in the discrete case (integrals are used instead of summations).
As in the discrete case, there are some types of continuous random variables
that are used frequently and therefore are given a name and their proprieties
are well-studied.
The first, and simplest, continuous random variable we study is the so-called
(continuous) uniform distribution. We say that a random variable 𝑋 is uni-
formly distributed on the interval [𝑎, 𝑏] if its pdf is
1
𝑓(𝑥) = { 𝑏−𝑎 , 𝑎≤𝑥≤𝑏
0, otherwise
0.25
0.20
0.15
f(x)
0.10
0.05
0.00
Figure 3.9: Probability density function for a uniform random variable with
parameters a = 2 and b = 6
By looking at the pdf we see that it is a flat, constant line between the values 𝑎
and 𝑏. This implies that the probability that 𝑋 takes values between two values
𝑥0 and 𝑥1 only dependens on the length of the interval (𝑥0 , 𝑥1 ).
3.4. NOTABLE CONTINUOUS DISTRIBUTION 59
⎧ 0, 𝑥<𝑎
{
𝐹 (𝑥) = ⎨ 𝑥−𝑎
𝑏−𝑎 , 𝑎 ≤𝑥≤𝑏
{
⎩ 1, 𝑥>𝑏
1.00
0.75
f(x)
0.50
0.25
0.00
So for instance
## [1] 0.25
60 CHAPTER 3. PROBABILITY BASICS
computes the pdf at the point 5 of a uniform random variable with parameters
𝑎 = 2 and 𝑏 = 6.
Conversely,
punif(0.5)
## [1] 0.5
computes the cdf at the point 0.5 of a uniform random variable with parameters
𝑎 = 0 and 𝑏 = 1.
The second class of continuous random variables we will study are the so-called
exponential random variables. We have actually already seen such a random
variable in the donut shop example. More generally, we say that a continuous
random variable 𝑋 is exponential with parameter 𝜆 > 0 if its pdf is
𝜆𝑒−𝜆𝑥 , 𝑥 ≥ 0
𝑓(𝑥) = {
0, otherwise
Figure 3.11 reports the pdf of exponential random variables for various choices
of the parameter 𝜆.
2.5
2.0
1.5 lambda:
0.25
f(x)
1.0 2.5
0.5
0.0
0 1 2 3
x
Exponential random variables are very often used in dynamic simulations since
they are very often used to model interarrival times in process: for instance the
time between arrivals of customers at the donut shop.
Its cdf can be derived as
0, 𝑥<0
𝐹 (𝑥) = {
1 − 𝑒−𝜆𝑥 , 𝑥 ≥ 0
3.4. NOTABLE CONTINUOUS DISTRIBUTION 61
1.00
0.75
lambda:
0.25
f(x)
0.50
1
2.5
0.25
0.00
0 2 4 6
x
1 1
𝐸(𝑋) = , 𝑉 (𝑋) =
𝜆 𝜆2
R provides an implementation of the uniform random variable with the functions
dexp and pexp whose details are as follows:
So for instance
dexp(2, rate = 3)
## [1] 0.007436257
computes the pdf at the point 2 of an exponential random variable with param-
eter 𝜆 = 3.
Conversely
pexp(4)
## [1] 0.9816844
computes the cdf at the point 4 of an exponential random variable with param-
eter 𝜆 = 1.
62 CHAPTER 3. PROBABILITY BASICS
The last class of continuous random variables we consider is the so-called Normal
or Gaussian random variable. They are the most used and well-known random
variable in statistics and we will see why this is the case.
A continuous random variable 𝑋 is said to have a Normal distribution with
mean 𝜇 and variance 𝜎2 if its pdf is
1 1 (𝑥 − 𝜇)2
𝑓(𝑥) = √ exp (− ).
2𝜋𝜎2 2 𝜎2
Recall that
𝐸(𝑋) = 𝜇, 𝑉 (𝑋) = 𝜎2 ,
and so the parameters have a straightforward interpretation in terms of mean
and variance.
Figure 3.13 shows the form of the pdf of the Normal distribution for various
choices of the parameters. On the left we have Normal pdfs for 𝜎2 = 1 and
various choices of 𝜇: we can see that 𝜇 shifts the plot on the x-axis. On the
right we have Normal pdfs for 𝜇 = 1 and various choices of 𝜎2 : we can see that
all distributions are centered around the same value while they have a different
spread/variability.
0.4
0.3
0.4
mu: sigma^2:
0 0.5
f(x)
f(x)
0.2
1 1
2 2
0.2
0.1
0.0 0.0
−2 0 2 4 −2 0 2 4
x x
The form of the Normal pdf is the well-known so-called bell-shaped function.
We can notice some properties:
• it is symmetric around the mean: the function on the left-hand side and
on the right-hand side of the mean is mirrored. This implies that the
median is equal to the mean ;
• the maximum value of the pdf occurs at the mean. This implies that the
mode is equal to the mean (and therefore also the median).
3.4. NOTABLE CONTINUOUS DISTRIBUTION 63
The cdf of the Normal for various choices of parameters is reported in Figure
3.14.
1.00 1.00
0.75 0.75
mu: sigma^2:
0 0.5
f(x)
0.25 0.25
0.00 0.00
−2 0 2 4 −2 0 2 4
x x
Unfortunately it is not possible to solve such an integral (as for example for
the Uniform and the Exponential), and in general it is approximated using
some numerical techniques. This is surprising considering that the Normal
distribution is so widely used!!!
However, notice that we would need to compute such an approximation for every
possible value of (𝜇, 𝜎2 ), depending on the distribution we want to use. This is
unfeasible to do in practice.
There is a trick here, that you must have used multiple times already. We
can transform a Normal 𝑋 with parameters 𝜇 and 𝜎2 to the so-called standard
Normal random variable 𝑍, and viceversa, using the relationship:
𝑋−𝜇
𝑍= , 𝑋 = 𝜇 + 𝜎𝑍. (3.1)
𝜎
It can be shown that 𝑍 is a Normal random variable with parameter 𝜇 = 0 and
𝜎2 = 1.
The values of the cdf of the standard Normal random variable then need to be
computed only once since 𝜇 and 𝜎2 are fixed. You have seen these numbers many
many times in what are usually called the tables of the Normal distribution.
As a matter of fact you have also computed many times the cdf of a generic
Normal random variable. First you computed 𝑍 using equation (3.1) and then
looked at the Normal tables to derive that number.
64 CHAPTER 3. PROBABILITY BASICS
Let’s give some details about the standard Normal. Its pdf is
1
𝜙(𝑧) = √ exp (−𝑧2 /2) .
2𝜋
It can be seen that it is the same as the one of the Normal by setting 𝜇 = 0 and
𝜎2 = 1. Such a function is so important that it is given its own symbol 𝜙.
The cdf is
𝑧
1
Φ(𝑧) = ∫ √ exp (−𝑥2 /2) 𝑑𝑥
−∞ 2𝜋
Again this cannot be computed exactly, there is no closed-form expression. This
is why you had to look at the tables instead of using a simple formula. The cdf
of the standard Normal is also so important that it is given its own symbol 𝜙.
Instead of using the tables, we can use R to tell us the values of Normal proba-
bilities. R provides an implementation of the Normal random variable with the
functions dnorm and pnorm whose details are as follows:
So for instance
dnorm(3)
## [1] 0.004431848
computes the value of the standard Normal pdf at the value three.
Similarly,
pnorm(0.4,1,0.5)
## [1] 0.1150697
√
compute the value of the Normal cdf with parameters 𝜇 = 1 and 𝜎2 = 0.5 at
the value 0.4.
3.5. THE CENTRAL LIMIT THEOREM 65
lim 𝑋̄ 𝑛 = 𝑌
𝑛→+∞
Random Number
Generation
At the hearth of any simulation model there is the capability of creating numbers
that mimic those we would expect in real life. In simulation modeling we will
assume that specific processes will be distributed according to a specific random
variable. For instance we will assume that an employee in a donut shop takes
a random time to serve customers distributed according to a Normal random
variable with mean 𝜇 and variance 𝜎2 . In order to then carry out a simulation
the computer will need to generate random serving times. This corresponds to
simulating number that are distributed according to a specific distribution.
Let’s consider an example. Suppose you managed to generate two sequences
of numbers, say x1 and x2. Your objective is to simulate numbers from a
Normal distribution. The histograms of the two sequences are reported in Figure
4.1 together with the estimated shape of the density. Clearly the sequence x1
could be following a Normal distribution, since it is bell-shaped and reasonably
symmetric. On the other hand, the sequence x2 is not symmetric at all and
does not resembles the density of a Normal.
In this chapter we will learn how to characterize randomness in a computer and
how to generate numbers that appear to be random realizations of a specific
random variable. We will also learn how to check if a sequence of values can be
a random realization from a specific random variable.
67
68 CHAPTER 4. RANDOM NUMBER GENERATION
0.4
0.3
density
density
2
0.2
1
0.1
0.0 0
distribution between zero and one. From the previous chapter, you should re-
member that such a random variables has pdf
1, 0 ≤ 𝑥 ≤ 1
𝑓(𝑥) = {
0, otherwise
and cdf
⎧ 0, 𝑥 < 0
{
𝐹 (𝑥) = ⎨ 𝑥, 0 ≤ 𝑥 ≤ 1
{
⎩ 1, otherwise
These two are plotted in Figure 4.2.
1.00 1.00
0.75 0.75
f(x)
f(x)
0.50 0.50
0.25 0.25
0.00 0.00
−1 0 1 2 −1 0 1 2
x x
Figure 4.2: Pdf (left) and cdf (right) of the continuous uniform between zero
and one.
one on the right clearly does not (it is far from being flat) and therefore it is
hard to believe that such numbers follow a uniform distribution.
0.9
0.6
density
density
2
0.3
0.0 0
Figure 4.3: Histograms from two sequences of numbers between zero and one.
0.25 0.72 0.18 0.63 0.49 0.88 0.23 0.78 0.02 0.52
We can notice that numbers below and above 0.5 are alternating in the se-
quence. We would therefore believe that after a number less than 0.5 it is
much more likely to observe a number above it. This breaks the assumption of
independence.
• the cycle of random generated numbers should be long. The cycle is the
length of the sequence before numbers start to repeat themselves.
set.seed(2021)
This ensures that every time the code following set.seed is run, the same
results will be observed. We will give below examples of this.
R has all the capabilities to generate such numbers. This can be done with the
function runif, which takes one input: the number of observations to generate.
So for instance:
set.seed(2021)
runif(10)
generates ten random numbers between zero and one. Notice that if we repeat
the same code we get the same result since we fixed the so-called seed of the
simulation.
set.seed(2021)
runif(10)
Conversely, if we were to simply run the code runif(10) we would get a different
result.
runif(10)
Some comments:
𝑢𝑖 = 𝑥𝑖 /𝑚.
It can be shown that the method works well for specific choices of 𝑎, 𝑐 and 𝑚,
which we will not discuss here.
Let’s look at an implementation.
We can see that this specific choice of parameters is quite bad: it has cycle 4!
After 4 numbers the sequence repeats itself and we surely would not like to use
this in practice.
In general you should not worry of these issues, R does things properly for you!
A simple first method to check if the numbers are uniform is to create an his-
togram of the data and to see if the histogram is reasonably flat. We already
saw how to assess this, but let’s check if runif works well. Simple histograms
can be created in R using hist (or if you want you can use ggplot).
set.seed(2021)
u <- runif(5000)
hist(u)
Histogram of u
500
400
Frequency
300
200
100
0
We can see that the histogram is reasonably flat and therefore the assumption
of uniformity seems to hold.
Although the histogram is quite informative, it is not a fairly formal method.
We could on the other hand look at tests of hypotheses of this form:
The null hypothesis is thus that the numbers are indeed uniform, whilst the
alternative states that the numbers are not. If we reject the null hypothesis,
which happens if the p-value of the test is very small (or smaller than a critical
value 𝛼 of our choice), then we would believe that the sequence of numbers is
not uniform.
There are various ways to carry out such a test, but we will consider here only
one: the so-called Kolmogorov-Smirnov Test. We will not give all details of this
test, but only its interpretation and implementation.
In order to understand how the test works we need to briefly introduce the
concept of empirical cumulative distribution function or ecdf. The ecdf 𝐹 ̂ is the
74 CHAPTER 4. RANDOM NUMBER GENERATION
1.00
0.75
ECDF
0.50
0.25
0.00
For instance, since there are 3 numbers out of 5 in the vector u that are less
than 0.7, then 𝐹 ̂ (0.7) = 3/5.
The idea behind the Kolmogorov-Smirnov test is to quantify how similar the
ecdf computed from a sequence of data is to the one of the uniform distribution
which is represented by a straight line (see Figure 4.2).
As an example consider Figure 4.6. The step functions are computed from two
different sequences of numbers between one and zero, whilst the straight line
is the cdf of the uniform distribution. By looking at the plots, we would more
strongly believe that the sequence in the left plot is uniformly distributed, since
the step function is much more closer to the theoretical straight line.
The Kolmogorov-Smirnov test formally embeds this idea of similarity between
the ecdf and the cdf of the uniform in a test of hypothesis. The function ks.test
implements this test in R. For the two sequences in Figure @ref{fig:Kol} u1 (left
plot) and u2 (right plot), the test can be implemented as following:
ks.test(u1,"punif")
##
## One-sample Kolmogorov-Smirnov test
##
4.4. TESTING RANDOMNESS 75
1.00 1.00
0.75 0.75
ECDF
ECDF
0.50 0.50
0.25 0.25
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
u1 u2
Figure 4.6: Comparison between ecdf and cdf of the uniform for two sequences
of numbers
## data: u1
## D = 0.11499, p-value = 0.142
## alternative hypothesis: two-sided
ks.test(u2,"punif")
##
## One-sample Kolmogorov-Smirnov test
##
## data: u2
## D = 0.56939, p-value < 2.2e-16
## alternative hypothesis: two-sided
From the results that the p-value of the test for the sequence u1 is 0.142 and
so we would not reject the null hypothesis that the sequence is uniformly dis-
tributed. On the other hand the p-value for the test over the sequence u2 has an
extremely small p-value therefore suggesting that we reject the null hypothesis
and conclude that the sequence is not uniformly distributed. This confirms our
intuition by looking at the plots in Figure 4.6.
76 CHAPTER 4. RANDOM NUMBER GENERATION
𝐻0 ∶ 𝑢1 , … , 𝑢𝑁 are independent
𝐻𝑎 ∶ 𝑢1 , … , 𝑢𝑁 are not independent
set.seed(2021)
u1 <- runif(200)
acf(u1)
Series u1
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 5 10 15 20
Lag
The bars in Figure 4.7 are the autocorrelations at various lags, whilst the dashed
blue lines are confidence bands: if a bar is within the bands it means that we
cannot reject the hypothesis that the autocorrelation of the associated lag is
equal to zero. Notice that the first bar is lag 0: it computes the correlation
for the sample (𝑢1 , 𝑢1 ), (𝑢2 , 𝑢2 ), … , (𝑢𝑁 , 𝑢𝑁 ) and therefore it is always equal to
one. You should never worry about this bar. Since all the bars are within the
confidence bands, we believe that all autocorrelations are not different from zero
and consequently that the data is independent (it was indeed generated using
‘runif“).
Figure 4.8 reports the autocorrelations of a sequence of numbers which is not
independent. Although the histogram shows that the data is uniformly dis-
tributed, we would not believe that the sequence is of independent numbers
since autocorrelations are very large and outside the bands.
Histogram of u2 Series u2
1.0
40
30
0.5
Frequency
ACF
20
0.0
10
−0.5
0
u2 Lag
Box.test(u1, lag = 5)
##
## Box-Pierce test
##
## data: u1
## X-squared = 4.8518, df = 5, p-value = 0.4342
Box.test(u2, lag = 5)
##
## Box-Pierce test
##
## data: u2
## X-squared = 807.1, df = 5, p-value < 2.2e-16
Here we chose a lag up to 5 (it is usually not useful to consider larger lags).
The test confirms our observations of the autocorrelations. For the sequence
u1 generated with runif the test has a high p-value and therefore we cannot
4.5. RANDOM VARIATE GENERATION 79
reject the hypothesis of independence. For the second sequence u2 which had
very large autocorrelations the p-value is very small and therefore we reject the
hypothesis of independence.
In the next few sections we will learn results that allow for the simulation of
random observations from generic distributions. No matter how the methods
work, they have a very simple and straightforward implementation in R.
We have already learned that we can simulate observations from the uniform
between zero and one using the code runif(N) where N is the number of obser-
vations to simulate. We can notice that it is similar to the commands dunif
and punif we have already seen for the pdf and cdf of the uniform.
Not surprisingly we can generate observations from any random variable using
the syntax r followed by the naming of the variable chosen. So for instance:
Each of these functions takes as first input the number of observations that we
want to simulate. They then have additional inputs that can be given, which
depend on the random variable chosen and are the same that we saw in the
past.
So for instance
80 CHAPTER 4. RANDOM NUMBER GENERATION
rnorm(10, mean = 1, sd = 2)
generates ten observations from a Normal distribution with mean 1 and standard
deviation 2.
• continuous;
• strictly increasing.
1. First we need to compute the inverse of 𝐹 . This means solving the equa-
tion:
1 − 𝑒−𝜆𝑥 = 𝑢
for 𝑥. This can be done following the steps:
1 − 𝑒−𝜆𝑥 = 𝑢
−𝜆𝑥
𝑒 = 1−𝑢
−𝜆𝑥 = log(1 − 𝑢)
1
𝑥 = − log(1 − 𝑢)
𝜆
set.seed(2021)
# Define inverse function
invF <- function(u,lambda) -log(1-u)/lambda
# Simulate 5 uniform observations
u <- runif(5)
# Compute the inverse
invF(u, lambda = 2)
We know how to simulate uniformly between 0 and 1, but we do not know how
to simulate uniformly between two generic values 𝑎 and 𝑏.
Recall that the cdf of the uniform distribution between 𝑎 and 𝑏 is
𝑥−𝑎
𝐹 (𝑥) = , for 𝑎 ≤ 𝑥 ≤ 𝑏
𝑏−𝑎
The inverse transform method requires the inverse of 𝐹 , which using simple
algebra can be computed as
𝐹 −1 (𝑢) = 𝑎 + (𝑏 − 𝑎)𝑢
𝑥1 = 𝑎 + (𝑏 − 𝑎)𝑢1 , … , 𝑥𝑁 = 𝑎 + (𝑏 − 𝑎)𝑢𝑁
In R:
set.seed(2021)
a <- 2
b <- 6
a + (b-a)*runif(5)
The code simulates five observations from a Uniform between two and six. This
can be equally achieved by simply using:
set.seed(2021)
runif(5, min = 2, max = 6)
Notice that since we fixed the seed, the two methods return exactly the same
sequence of numbers.
We will not prove that this actually works, but it intuitively does. Let’s code it
in R.
set.seed(2021)
theta <- 0.5
u <- runif(5)
x <- ifelse(u < theta, 0, 1)
x
## [1] 0 1 1 0 1
So here we simulated five observations from a Bernoulli with parameter 0.5: the
toss of a fair coin. Three times the coin showed head, and twice tails.
From this comment, it is easy to see how to simulate one observation from
a Binomial: by simply summing the randomly generated observations from
Bernoullis. So if we were to sum the five numbers above, we would get one
random observations from a Binomial with parameter 𝑛 = 5 and 𝜃 = 0.5.
There are many other algorithms that allow to simulate specific as well as generic
random variables. Since these are a bit more technical we will not consider them
here, but it is important for you to know that we now can simulate basically
any random variable you are interested in!
1. if the random sequence had the same distribution as the theoretical one
(in previous sections Uniform between zero and one);
2. if the sequence was of independent numbers
We will see that the tools to perform these steps are basically the same.
84 CHAPTER 4. RANDOM NUMBER GENERATION
There are various ways to check if the random sequence of observations has the
same distribution as the theoretical one.
4.6.1.1 Histogram
ggplot(x1, aes(x1)) +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme_bw()
0.4
0.3
Density
density
Empirical
0.2 Normal
0.1
0.0
−2 0 2
x1
Figure 4.9: Histogram of the sequence x1 together with theoretical pdf of the
standard Normal
Figure 4.9 reports the histogram of the sequence x1 together with a smooth
estimate of the histogram, often called density plot, in the red line. The blue
line denotes the theoretical pdf of the standard Normal distribution. We can
see that the sequence seems to follow quite closely a Normal distribution and
therefore we could be convinced that the numbers are indeed Normal.
Let’s consider a different sequence x2. Figure 4.10 clearly shows that there
is a poor fit between the sequence and the standard Normal distribution. So
we would in general not believe that these observations came from a Standard
Normal.
4.6. TESTING GENERIC SIMULATION SEQUENCES 85
ggplot(x2, aes(x2)) +
geom_line(aes(y = ..density.., colour = 'Empirical'), stat = 'density') +
stat_function(fun = dnorm, aes(colour = 'Normal')) +
geom_histogram(aes(y = ..density..), alpha = 0.4) +
scale_colour_manual(name = 'Density', values = c('red', 'blue')) +
theme_bw()
0.6
Density
density
0.4
Empirical
Normal
0.2
0.0
−1 0 1 2 3
x2
Figure 4.10: Histogram of the sequence x2 together with theoretical pdf of the
standard Normal
We have already seen for uniform numbers that we can use the empirical cdf to
assess if a sequence of numbers is uniformly distributed. We can use the exact
same method for any other distribution.
Figure 4.11 reports the ecdf of the sequence of numbers x1 (in red) together
with the theoretical cdf of the standard Normal (in blue). We can see that the
two functions match closely and therefore we could assume that the sequence is
distributed as a standard Normal.
ggplot(x1, aes(x1)) +
stat_ecdf(geom = "step",aes(colour = 'Empirical')) +
stat_function(fun = pnorm,aes(colour = 'Theoretical')) +
theme_bw() +
scale_colour_manual(name = 'Density', values = c('red', 'blue'))
Figure 4.12 reports the same plot but for the sequence x2. The two lines strongly
differ and therefore it cannot be assume that the sequence is distributed as a
standard Normal.
86 CHAPTER 4. RANDOM NUMBER GENERATION
1.00
0.75
Density
0.50 Empirical
y
Theoretical
0.25
0.00
−3 −2 −1 0 1 2 3
x1
Figure 4.11: Empirical cdf the sequence x1 together with theoretical cdf of the
standard Normal
ggplot(x2, aes(x2)) +
stat_ecdf(geom = "step",aes(colour = 'Empirical')) +
stat_function(fun = pnorm,aes(colour = 'Theoretical')) +
theme_bw() +
scale_colour_manual(name = 'Density', values = c('red', 'blue'))
1.00
0.75
Density
0.50 Empirical
y
Theoretical
0.25
0.00
−1 0 1 2 3
x2
Figure 4.12: Empirical cdf the sequence x2 together with theoretical cdf of the
standard Normal
4.6.1.3 QQ-Plot
a series of points, where each point is associated to a number in our random se-
quence, and a line, which describes the theoretical distribution we are targeting.
The closest the points and the line are, the better the fit to that distribution.
In particular, in Figure 4.13 we are checking if the sequence x1 is distributed
according to a standard Normal (represented by the straight line). Since the
points are placed almost in a straight line over the theoretical line of the standard
Normal, we can assume the sequence to be Normal.
2
sample
−2
−2 0 2
theoretical
Figure 4.13: QQ-plot for the sequence x1 checking against the standard Normal
Figure 4.14 reports the qq-plot for the sequence x2 to check if the data can be
following a standard Normal. We can see that the points do not differ too much
from the straight line and in this case we could assume the data to be Normal
(notice that the histograms and the cdf strongly suggested that this sequence
was not Normal).
Notice that the form of the qq-plot does not only depend on the sequence of
numbers we are considering, but also on the distribution we are testing it against.
Figure 4.13 reports the qq-plot for the sequence x1 when checked against an
Exponential random variable with parameter 𝜆 = 3. Given that the sequence
also includes negative numbers, it does not make sense to check if it is distributed
as an Exponential (since it can only model non-negative data), but this is just
an illustration.
88 CHAPTER 4. RANDOM NUMBER GENERATION
sample
1
−1
−2 0 2
theoretical
Figure 4.14: QQ-plot for the sequence x2 checking against the standard Normal
7.5
5.0
sample
2.5
0.0
−2.5
The above plots are highly informative since they provide insights into the shape
of the data distribution, but these are not formal. Again, we can carry out tests
of hypothesis to check if data is distributed as a specific random variable, just
like we did for the Uniform.
Again, there are many tests one could use, but here we focus only on the
Kolmogorov-Smirnov Test which checks how close the empirical and the the-
oretical cdfs are. It is implemented in the ks.test R function.
Let’s check if the sequences x1 and x2 are distributed as a standard Normal.
4.6. TESTING GENERIC SIMULATION SEQUENCES 89
ks.test(x1,pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: x1
## D = 0.041896, p-value = 0.3439
## alternative hypothesis: two-sided
ks.test(x2,pnorm)
##
## One-sample Kolmogorov-Smirnov test
##
## data: x2
## D = 0.2997, p-value < 2.2e-16
## alternative hypothesis: two-sided
##
## One-sample Kolmogorov-Smirnov test
##
## data: x1
## D = 0.54604, p-value < 2.2e-16
## alternative hypothesis: two-sided
The p-value is small and therefore we would reject the null hypothesis that the
sequence is distributed as a Normal with mean two and standard deviation two.
acf(x1)
acf(x2)
x1
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 5 10 15 20 25
Lag
x2
1.0
0.8
0.6
ACF
0.4
0.2
0.0
0 5 10 15 20 25
Lag
Let’s run the Box test to assess if the assumption of independence is tenable for
both sequences.
Box.test(x1, lag = 5)
##
## Box-Pierce test
##
## data: x1
## X-squared = 8.4212, df = 5, p-value = 0.1345
4.6. TESTING GENERIC SIMULATION SEQUENCES 91
Box.test(x2, lag = 5)
##
## Box-Pierce test
##
## data: x2
## X-squared = 4.2294, df = 5, p-value = 0.5169
In both cases the p-values are larger than 0.10, thus we would not reject the null
hypothesis of independence for both sequences. Recall that x1 is distributed as
a standard Normal, whilst x2 is not.
For the sequence x1 we observed that one bar was slightly outside the confidence
bands: this sometimes happens even when data is actually (pseudo-) random -
I created x1 using rnorm. The autocorrelations below are an instance of a case
where independence is not tenable since we see that multiple bars are outside
the confidence bands.
Series x
1.0
0.5
ACF
0.0
−0.5
0 5 10 15 20 25 30
Lag
92 CHAPTER 4. RANDOM NUMBER GENERATION
Chapter 5
The previous chapters laid the foundations of probability and statistics that
now allow us to carry out meaningful simulation experiments. In this chapter
we start looking at non-dynamic simulations which are often referred to as Monte
Carlo simulations.
93
94 CHAPTER 5. MONTE CARLO SIMULATION
or equal to 0.50 as heads and greater than 0.50 as tails, is a Monte Carlo
simulation of the behavior of repeatedly tossing a coin.
The main idea behind this method is that a phenomenon is simulated multiple
times on a computer using random-number generation based and the results are
aggregated to provide statistical summaries associated to the phenomenon.
Sawilowsky lists the characteristics of a high-quality Monte Carlo simulation:
The first thoughts and attempts I made to practice [the Monte Carlo
Method] were suggested by a question which occurred to me in 1946
as I was convalescing from an illness and playing solitaires. The
question was what are the chances that a Canfield solitaire laid out
with 52 cards will come out successfully? After spending a lot of
time trying to estimate them by pure combinatorial calculations, I
wondered whether a more practical method than “abstract thinking”
might not be to lay it out say one hundred times and simply observe
and count the number of successful plays. This was already possible
to envisage with the beginning of the new era of fast computers,
and I immediately thought of problems of neutron diffusion and
other questions of mathematical physics, and more generally how to
change processes described by certain differential equations into an
equivalent form interpretable as a succession of random operations.
5.3. STEPS OF MONTE CARLO SIMULATION 95
Later [in 1946], I described the idea to John von Neumann, and we
began to plan actual calculations.
Being secret, the work of von Neumann and Ulam required a code name. A
colleague of von Neumann and Ulam, Nicholas Metropolis, suggested using the
name Monte Carlo, which refers to the Monte Carlo Casino in Monaco where
Ulam’s uncle would borrow money from relatives to gamble.
Imagine that this circle is circumscribed within a square, which therefore has
side 2𝑟 (also equal to the diameter).
96 CHAPTER 5. MONTE CARLO SIMULATION
What is the probability that if I choose a random point inside the square, it
will also be inside the circle? If I choose any random point within the square, it
can be inside the circle or just inside the square. A very simple way to compute
this probability is the ratio between the area of the circle and the area of the
square.
The probability that a random selected point in the square is in the circle is
𝜋/4. This means that if I were to replicate the selection of a random point in
the square a large number of times, I could count the proportion of points inside
the circle, multiply it by four and that would give me an approximation of 𝜋.
• If the value is less than 1, the case will be inside the circle
• If the value is greater than 1, the case will be outside the circle.
3. Calculate the proportion of points inside the circle and multiply it by four
to approximate the 𝜋 value.
set.seed(2021)
nPoints <- 100
x <- runif(nPoints,-1,1)
y <- runif(nPoints,-1,1)
head(x)
head(y)
So both x and y are vectors of length 100 storing numbers between -1 and 1.
• If the value is less than 1, the case will be inside the circle
• If the value is greater than 1, the case will be outside the circle.
The vector result has in i-th position TRUE if x[i]^2 + y[i]^2 <= 1, that is
if the associated point is within the circle. We can see that out of the first six
simulated points, only one is outside the circle.
Calculate the proportion of points inside the circle and multiply it by four to
approximate the 𝜋 value.
98 CHAPTER 5. MONTE CARLO SIMULATION
4*sum(result)/nPoints
## [1] 2.92
set.seed(1988)
x <- runif(nPoints,-1,1)
y <- runif(nPoints,-1,1)
result <- ifelse(x^2 + y^2 <= 1, TRUE, FALSE)
4*sum(result)/nPoints
## [1] 3.08
set.seed(2021)
piVal()
## [1] 2.92
set.seed(1988)
piVal()
## [1] 3.08
5.3. STEPS OF MONTE CARLO SIMULATION 99
So we can see that the function works since it gives us the same output as the
code above.
set.seed(2021)
N <- 1000
pis <- replicate(N, piVal())
head(pis)
We can see that the first entry of the vector pis is indeed pis[1] which is the
same value we obtained running the function ourselves (in both cases we fixed
the same seed).
Calculate the average of the previous 1000 experiments to give a final value
estimate.
mean(pis)
## [1] 3.13828
boxplot(pis)
100 CHAPTER 5. MONTE CARLO SIMULATION
3.6
3.4
3.2
3.0
2.8
2.6
The boxplot importantly tells us two things:
One thing you might wonder now is the following. Why did we replicate the
experiment 1000 times and each time took only 100 points. Could have we not
taken a much larger number of points only once (for example 1000 × 100)?
On one hand that would have clearly given us a good approximation, using the
same total number of simulated points. Indeed
set.seed(2021)
piVal(1000*100)
## [1] 3.1416
c(sort(pis)[25],sort(pis)[975])
So for instance if we wanted to simulate ten tosses of a fair dice we can write.
set.seed(2021)
sample(1:6, size = 10, replace = TRUE)
## [1] 6 6 2 4 4 6 6 3 6 6
Notice that the vector x does not necessarily needs to be numeric. It could be a
vector of characters. For instance, let’s simulate the toss of 5 coins, where the
probability of heads is 2/3 and the probability of tails is 1/3.
set.seed(2021)
sample(c("heads","tails"), size = 5, replace = TRUE, prob = c(2/3,1/3))
set.seed(2021)
win <- sample(c(-1,1),size = 50, replace = T)
head(win)
## [1] -1 1 1 1 -1 1
For this particular game Peter lost the first game, then won the second, the
third and the fourth and so on.
Suppose Peter is interested in his cumulative winnings as he plays this game.
The function cumsum() computes the cumulative winnings of the individual
values and we store the cumulative values in a vector named cumul.win.
## [1] -1 0 1 2 1 2 3 4 5 6 5 6 7 6 5 4 3 2 3 2 3 4 3 4 3
## [26] 2 3 4 5 6 5 6 5 6 7 6 7 6 7 6 5 4 5 6 5 4 5 6 7 8
So at the end of this specific game Peter won 8€. Figure 5.1 reports Peter’s for-
tune as the game evolved. We can notice that Peter was in the lead throughout
almost the whole game.
Of course this is the result of a single simulation and the outcome may be totally
different than the one we saw. Figure 5.2 reports four simulated games: we can
see that in the first one Peter wins, in the second he almost breaks even, whilst
in the third and fourth he clearly loses.
5.5. A GAME OF CHANCE 103
20
10
cumsum(win)
0
−10
−20
0 10 20 30 40 50
Index
set.seed(2021)
par(mfrow=c(2, 2))
for(j in 1:4){
plot(cumsum(sample(c(-1, 1),size=50,replace=T)),type="l" ,ylim=c(-25, 25), ylab="Outcome")
abline(h=0)}
Outcome
Outcome
10
10
−20
−20
0 10 20 30 40 50 0 10 20 30 40 50
Index Index
Outcome
Outcome
10
10
−20
−20
0 10 20 30 40 50 0 10 20 30 40 50
Index Index
• What is the probability that Peter breaks even at the end of the game?
## [1] 8
The output is the same as the previous code, so it seems that our function works
correctly.
Let’s replicate the experiment many times.
set.seed(2021)
experiment <- replicate(1000,peter.paul())
head(experiment)
So the vector experiment stores Peter’s final fortune in 1000 games. Since
Peter’s fortune is an integer-value variable, it is convenient to summarize it
using the table function.
table(experiment)
## experiment
## -22 -20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16
## 1 2 7 7 16 30 42 66 58 101 116 102 110 103 70 60 44 26 15 14
## 18 20
## 4 6
plot(table(experiment))
100 120
table(experiment)
80
60
40
20
0
experiment
So we can see that Peter breaks even 102 out of 1000 times. Furthermore the
plot shows us that most commonly Peter will win/lose little money and that
big wins/losses are unlikely.
To conclude our experiment we need to calculate our estimated probability of
Peter breaking even. Clearly this is equal to 102/1000= 0.102. In R:
sum(experiment==0)/1000
## [1] 0.102
Notice that we could have also answered this question exactly. The event Peter
breaking even coincides with a number of successes 𝑛/2 in a Binomial experiment
with parameters 𝑛 = 50 and 𝜃 = 0.5. This can be computed in R as
## [1] 0.1122752
So our approximation is already quite close to the true value. We would get
even closer by replicating the experiment a larger number of times.
106 CHAPTER 5. MONTE CARLO SIMULATION
set.seed(2021)
experiment <- replicate(10000,peter.paul())
length(experiment[experiment==0])/1000
## [1] 1.096
It takes two inputs: the budget and the number we decided to play all the time.
It outputs our budget throughout the whole game until it ends.
Let’s play one game with a budget of 15 and betting on the number 8.
set.seed(2021)
roulette(15,8)
## [1] 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
In 15 spins the number 8 never comes up and therefore our game ends quite
quickly.
We can ask ourselves many questions about such a game. For instance:
5.6. PLAYING THE ROULETTE 107
1. What is the probability that the game last exactly 15 spins if I start with
a budget of 15 and bet on the number 8?
2. What is the average length of the game starting with a budget of 15 and
betting on the number 8?
3. What is the average maximum wealth I have during a game started with
a budget of 15 and betting on the number 8?
We will develop various Monte Carlo experiments to answer all the above ques-
tions. In each case, we simply need to modify the function roulette to output
some summary about the game.
5.6.1 Question 1
1. What is the probability that the game last exactly 15 spins if I start with
a budget of 15 and bet on the number 8?
If a game where I started with a budget of 15 ends after 15 spins, it means that
the number 8 never showed up. We can adapt the function roulette to output
TRUE if the length of the vector wealth is exactly equal to budget.
Therefore for the previous example the function should output TRUE.
set.seed(2021)
roulette1(15,8)
## [1] TRUE
108 CHAPTER 5. MONTE CARLO SIMULATION
Let’s replicate the experiment 1000 times. The proportion of TRUE we observe
is our estimate of this probability.
set.seed(2021)
experiment <- replicate(1000,roulette1(15,8))
sum(experiment)/1000
## [1] 0.657
Notice that actually we could have also computed this probability exactly. This
is the probability that a Binomial random variable with parameter 𝑛 = 15 (15
spins of the roulette) and 𝜃 = 1/37 (the number eight has probability 1/37 of
appearing in a single spin) is equal to zero (no 8 can happen). This is equal to:
dbinom(0,15,1/37)
## [1] 0.6629971
5.6.2 Question 2
2. What is the average length of the game starting with a budget of 15 and
betting on the number 8?
We can answer this question by adapting the roulette function to output the
length of the vector wealth.
set.seed(2021)
roulette2(15,8)
## [1] 15
Let’s replicate the experiment 1000 times and summarize the results with a plot
(it may take some time to run the code!).
set.seed(2021)
experiment <- replicate(1000,roulette2(15,8))
plot(table(experiment))
500
table(experiment)
300
100
0
experiment
We can see that the distribution is very skewed, most often the length is 15 spins,
but then sometimes the game has a length which is much longer. Therefore, the
median of the data is a much better option to summarize the average length of
the game. We can compute it, together with a range of plausible values, as
median(experiment)
## [1] 15
c(sort(experiment)[25],sort(experiment)[975])
## [1] 15 1923
So we can see that the median is indeed 15: most of the times the number 8
does not appear and therefore the game ends in 15 spins.
110 CHAPTER 5. MONTE CARLO SIMULATION
5.6.3 Question 3
3. What is the average maximum wealth I have during a game started with
a budget of 15 and betting on the number 8?
We can answer this question by adapting the roulette function to output the
maximum of the vector wealth.
set.seed(2021)
experiment <- replicate(1000,roulette3(15,8))
plot(table(experiment))
500
table(experiment)
300
100
0
experiment
median(experiment)
## [1] 14
c(sort(experiment)[25],sort(experiment)[975])
## [1] 14 331
Since most often the game ends with no 8 appearing, the maximum wealth is
most often 14.
5.7.1 Question 1
• What is the probability that Annie arrives before Sam?
The following function outputs TRUE if Annie arrives before Sam and FALSE
otherwise.
112 CHAPTER 5. MONTE CARLO SIMULATION
set.seed(2021)
sam_annie1()
## [1] FALSE
set.seed(2021)
sam_annie1()
## [1] 0.223
set.seed(2021)
experiment <- replicate(1000, sam_annie1())
boxplot(experiment)
0.26
0.24
0.22
0.20
5.7. IS ANNIE MEETING SAM? 113
5.7.2 Question 2
The function sam_annie2 computes this expectation using 1000 random obser-
vations.
set.seed(2021)
experiment <- replicate(1000,sam_annie2())
boxplot(experiment)
−28
−30
−32
−34
5.7.3 Question 3
• If they each wait only twenty minutes after their arrival, what is the
probability that they meet?
The function sleepless below returns TRUE if Annie and Sam meets,FALSE
otherwise.
114 CHAPTER 5. MONTE CARLO SIMULATION
set.seed(2021)
mean(replicate(10000,sleepless()))
## [1] 0.178
So we see that if they each wait 20 minutes, they have a probability of meeting
of 0.178.
5.7.4 Question 4
• How much should they wait (assuming they wait the same amount of
time), so that the probability they meet is at least 50%?
In order to answer this question we can use the sapply function. We already
know that if they wait 20 minutes, the probability of meeting is 0.19. So we
consider longer waiting times.
sapply(30:60,function(x) mean(replicate(10000,sleepless(x))))
## [1] 0.2754 0.2769 0.2811 0.2925 0.3055 0.3104 0.3147 0.3248 0.3400 0.3482
## [11] 0.3571 0.3605 0.3746 0.3869 0.3958 0.4034 0.4104 0.4184 0.4237 0.4283
## [21] 0.4444 0.4653 0.4664 0.4763 0.4831 0.4846 0.4929 0.5067 0.5091 0.5222
## [31] 0.5312
It’s the 28th entry of the vector 30:60, that is 57, the first one for which the
probability is at least 0.50. So they should wait 57 minutes if they want to have
a probability of at least 50% to actually meet.
Chapter 6
You may recall that in Chapter 1 we discussed the example of a simple donut
shop where we were interested in the waiting time of costumers depending on
the number of employees in the shop. We will slowly build a more and more
realistic implementation of the shop using simmer. First, if you have never done
this, you need to install the package using the code
115
116 CHAPTER 6. DISCRETE EVENT SIMULATION
install.packages("simmer")
Once it is installed (that you only need to do once), you then need to load it at
the beginning of every R session using the code
library(simmer)
We first model a single customer who arrives at the shop for a visit, looks around
at the decor for a time and then leaves. There is no queueing. First we will
assume his arrival time and the time he spends in the shop are fixed.
The arrival time is fixed at 5, and the time spent in the shop is fixed at 10.
We interpret ‘5’ and ‘10’ as ‘5 minutes’ and ‘10 minutes’. The simulation runs
for a maximum of 100 minutes, or until all the customers that are generated
complete their visit to the shop.
Let’s define step by step the code that implements this. First we define a variable
called customer which describes the evolution of the customer in the shop. The
evolution of the customer is called the trajectory and it is given some name.
The function log_ produces text which is shown as specific points during the
simulation. Lastly, timeout specifies how long the customer will spend in the
shop.
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
timeout(10) %>%
log_("I must leave")
Now that we have described how the customer behaves in the shop, we must
create a variable specifying how the shop itself works. The code below creates
the variable shop where first we create a simmer object called shop and then we
specify via the function add_generator that a customer arrives after 5 minutes.
shop <-
simmer("shop") %>%
add_generator("Customer", customer, at(5))
Now that the shop is created we have to run the simulation using the run
command.
6.1. THE DONUT SHOP 117
## 5: Customer0: Here I am
## 15: Customer0: I must leave
Now we extend the model to allow our customer to arrive at a random simulated
time though we will keep the time in the bank at 10, as before.
The change occurs in the arguments to the add_generator function. We will
assume that the customer arrival time is generated from an Exponential distri-
bution with parameter 1/5 (that is mean = 5).
set.seed(2021)
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
timeout(10) %>%
log_("I must leave")
shop <-
simmer("shop") %>%
add_generator("Customer", customer, at(rexp(1, 1/5)))
The trace shows that the customer now arrives at time 5.925307. Changing the
seed value would change that time.
set.seed(2021)
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
timeout(10) %>%
log_("I must leave")
shop <-
simmer("shop") %>%
add_generator("Customer", customer, function() rexp(1, 1/5))
So far, the model has been more like an art gallery, the customers entering,
looking around, and leaving. Now they are going to require service from an em-
ployee. We extend the model to include a service counter that will be modelled
as a ‘resource’. The actions of a Resource are simple: a customer requests a
unit of the resource (an employee). If one is free, then the customer gets service
(and the unit is no longer available to other customers). If there is no free em-
ployee, then the customer joins the queue until it is the customer’s turn to be
served. As each customer completes service and releases the unit, the employee
can start serving the next in line.
The service counter is created with the add_resource function. Default argu-
ments specify that it can serve one customer at a time, and has infinite queuing
capacity.
The seize function causes the customer to join the queue at the counter. If
the queue is empty and the counter is available (not serving any customers),
120 CHAPTER 6. DISCRETE EVENT SIMULATION
then the customer claims the counter for itself and moves onto the timeout
step. Otherwise the customer must wait until the counter becomes available.
Behaviour of the customer while in the queue is controlled by the arguments of
the seize function. Once the timeout step is complete, the release function
causes the customer to make the counter available to other customers in the
queue.
We will assume that serving time follows a Normal distribution with mean 10
and standard deviation 2.
set.seed(2021)
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
seize("counter") %>%
timeout(function() rnorm(1,10,2)) %>%
release("counter") %>%
log_("Finished")
shop <-
simmer("shop") %>%
add_resource("counter") %>%
add_generator("Customer", customer, function() rexp(1, 1/5))
So we see that 4 customers arrived in the shop and that 2 of them were served.
Let’s use the function get_mon_arrivals to have a summary of each customer.
By default, the function does not tell us the waiting time for a customer, which
we will need to compute.
6.1. THE DONUT SHOP 121
shop %>%
get_mon_arrivals() %>%
transform(waiting_time = end_time - start_time - activity_time)
Here we model a shop whose customers arrive randomly and are to be served
at a group of counters, taking a random time for service, where we assume that
waiting customers form a single first-in first-out queue.
The only difference between this model and the single-server model is in the
add_resource function, where we have increased the capacity to two so that it
can serve two customers at once.
set.seed(2021)
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
seize("counter") %>%
timeout(function() rnorm(1,10,2)) %>%
release("counter") %>%
log_("Finished")
shop <-
simmer("shop") %>%
add_resource("counter",2) %>%
add_generator("Customer", customer, function() rexp(1, 1/5))
shop %>%
get_mon_arrivals() %>%
transform(waiting_time = end_time - start_time - activity_time)
Now that we also have a counter we can get some summary statistics from it
using the function get_mon_resources.
We have learned to implement various simple simulations of our donut shop. The
output we get is informative and comprehensive but nor particularly appealing
to present, for instance in a report. The package simmer.plot provides plotting
capabilities to summarize the results of a simulation. At this stage we will see
two simple capabilities of the package. We will learn more about it in the
following sections.
Before using simmer.plot you need to install it only once via
install.packages("simmer.plot")
and then load it at the beginning of every R session where you plan to use it.
library("simmer.plot")
First, we can plot how much a resource, in this case our two employees, is
utilized using the following code.
Resource utilization
100%
80%
60%
utilization
40%
20%
0%
counter
resource
So we see that our employees are busy around 90% of the time. We can also see
when they are busy as well as how many people are queuing at each moment
during the simulation using the code below.
124 CHAPTER 6. DISCRETE EVENT SIMULATION
Resource usage
counter
item
4
in use
queue
server
system
10 20 30
time
The green line reports the number of employees busy and we can see that most of
the time they are both busy. The red line reports the number of people queuing
and waiting to be served. The blue line is the total number of customers in the
system: those queuing plus those being served.
6.2 Replication
For all previous examples, we ran a unique simulation and observed the re-
sults. As we have already learned, these results are affected by randomness and
different runs will show different results.
Consider the last simulation we implemented where customers arrive at the shop
where we have two counters. We can simulate the system 1000 times using the
following code. We will not include log since otherwise the output will become
cluttered.
customer <-
trajectory("Customer's path") %>%
seize("counter") %>%
timeout(function() rnorm(1,10,2)) %>%
release("counter")
simmer("shop") %>%
add_resource("counter",2) %>%
add_generator("Customer", customer, function() rexp(1, 1/5)) %>%
run(until=240)
}
)
Now envs stores the output of simulating the behavior of the shop for 4 hours
100 times.
We can summarize the results of these simulations using the simmer.plot pack-
age.
Resource utilization
100%
80%
60%
utilization
40%
20%
0%
counter
resource
Compared to previous plots we now notice that the output also has something
that resembles a boxplot which tells us what was the utilization of the resource
in different simulation runs.
We can also assess how busy were the employees in different simulations. As
the shop opens the two employees become more and more busy and at the end
of the four hours they are busy almost all the time.
Resource usage
counter
2.0
1.5
item
in use
1.0
server
0.5
0.0
75
waiting time
50
25
Each black line represents a single simulation and the blue line gives an overall
representation of the simulation. We can see that the waiting time seems to be
linearly increasing with time.
Suppose the donut shop have priority customers that when arrive at the shop,
are served as soon as possible. We make the assumption that they arrive
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
128 CHAPTER 6. DISCRETE EVENT SIMULATION
seize("counter") %>%
timeout(function() rnorm(1,10,2)) %>%
release("counter") %>%
log_("Finished")
shop <-
simmer("shop") %>%
add_resource("counter") %>%
add_generator("Customer", customer, function() rexp(1, 1/5)) %>%
add_generator("Priority_Customer", customer, function() rexp(1, 1/15), priority = 1)
set.seed(2021)
shop %>% run(until = 45)
From the output we can see that whenever a priority customer joins the queue
he is sold donuts as soon as the employee becomes available.
6.3. THE DONUT SHOP - ADVANCED FEATURES 129
Now we allow priority customers to have preemptive priority. They will displace
any customer in service when they arrive. That customer will resume when they
finish (unless higher priority customers intervene). This requires only a change
to one line of the program, adding the argument, preemptive = TRUE to the
add_resource function call.
shop <-
simmer("shop") %>%
add_resource("counter", preemptive = TRUE) %>%
add_generator("Customer", customer, function() rexp(1, 1/5)) %>%
add_generator("Priority_Customer", customer, function() rexp(1, 1/15), priority = 1)
set.seed(2021)
shop %>% run(until = 45)
In this other case, priority customers are served straight away. The customer
130 CHAPTER 6. DISCRETE EVENT SIMULATION
that was served when the priority customer arrived resumes is service as soon
as the priority customer finishes.
Balking occurs when a customer refuses to join a queue if it is too long. Suppose
that if there is one customer queuing in our shop then customers do not join the
queue and leave. We can implement this by setting the queue_size option of
add_resource and by adding some options of the seize function. Let’s consider
the following code.
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
seize("counter", continue = FALSE, reject =
trajectory("Balked customer") %>% log_("Balking") ) %>%
timeout(function() rnorm(1,10,2)) %>%
release("counter") %>%
log_("Finished")
shop <-
simmer("shop") %>%
add_resource("counter", queue_size = 1) %>%
add_generator("Customer", customer,
function() rexp(1, 1/5))
The input queue_size is self-explanatory and simply sets how many people
can queue for the counter. In the seize function we set the inputs continue
and reject. With continue = FALSE we are saying that a rejected customer
does not follow the rest of the trajectory. With reject we are specifying what
trajectory the rejected customer will follow.
Let’s run the simulation.
set.seed(2021)
shop %>% run(until = 45)
So now we see that often customers just leave the shop because they decide not
to queue. We can count how many of them left for balking using:
sum(get_mon_arrivals(shop)$activity_time == 0)
## [1] 4
sum(get_mon_arrivals(shop)$activity_time == 0)/now(shop)*60
## [1] 5.333333
Often in practice an impatient customer will leave the queue before being served.
Simmer can model this reneging behaviour using the renege_in() function in
a trajectory. This defines the maximum time that a customer will wait before
reneging, as well as an ‘out’ trajectory for them to follow when they renege.
If the customer reaches the server before reneging, then their impatience must
be cancelled with the renege_abort() function.
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
renege_in(function() rnorm(1,5,1),
out = trajectory("Reneging customer") %>%
132 CHAPTER 6. DISCRETE EVENT SIMULATION
shop <-
simmer("shop") %>%
add_resource("counter") %>%
add_generator("Customer", customer, function() rexp(1, 1/5))
Each counter is now assumed to have its own queue. The programming is
more complicated because the customer has to decide which queue to join. The
obvious technique is to make each counter a separate resource.
In practice, a customer might join the shortest queue. We implement this be-
haviour by first selecting the shortest queue, using the select function. Then we
use seize_selected to enter the chosen queue, and later release_selected.
6.3. THE DONUT SHOP - ADVANCED FEATURES 133
set.seed(2021)
customer <-
trajectory("Customer's path") %>%
log_("Here I am") %>%
select(c("counter1", "counter2"), policy = "shortest-queue") %>%
seize_selected() %>%
timeout(function() rnorm(1,10,2)) %>%
release_selected() %>%
log_("Finished")
shop <-
simmer("shop") %>%
add_resource("counter1", 1) %>%
add_resource("counter2", 1) %>%
add_generator("Customer", customer, function() rexp(1, 1/5))
There are several policies implemented internally that can be accessed by name:
Customers arrive at random, some of them getting to the shop before the door
is opened by a doorman. They wait for the door to be opened and then rush in
and queue to be served.
This model defines the door as a resource, just like the counter. The capacity
of the door is defined according to the schedule function, so that it has zero
capacity when it is shut, and infinite capacity when it is open. Customers ‘seize’
the door and must then wait until it has capacity to ‘serve’ them. Once it is
available, all waiting customers are ‘served’ immediately (i.e. they pass through
the door). There is no timeout between ‘seizing’ and ‘releasing’ the door.
customer <-
trajectory("Customer's path") %>%
log_(function()
if (get_capacity(shop, "door") == 0)
"Here I am but the door is shut."
else "Here I am and the door is open."
) %>%
seize("door") %>%
log_("I can go in!") %>%
release("door") %>%
seize("counter") %>%
timeout(function() {rexp(1, 10)}) %>%
release("counter")
shop <-
simmer("shop") %>%
add_resource("door", capacity = door_schedule) %>%
add_resource("counter") %>%
add_generator("Customer", customer, function() rexp(1, 1))
Customers arrive at random, some of them getting to the shop before the door
is open. This is controlled by an automatic machine called the doorman which
opens the door only at intervals of 30 minutes (it is a very secure shop). The
136 CHAPTER 6. DISCRETE EVENT SIMULATION
customers wait for the door to be opened and all those waiting enter and proceed
to the counter. The door is closed behind them.
One possible solution is using batching. Customers can be collected into batches
of a given size, or for a given time, or whichever occurs first. Here, they are
collected for periods of 30, and the number of customers in each batch is unre-
stricted.
After the batch is created with batch it is then separated with separate.
set.seed(2021)
customer <-
trajectory("Customer's path") %>%
log_("Here I am, but the door is shut.") %>%
batch(n = Inf, timeout = 30) %>%
separate() %>%
log_("The door is open!") %>%
seize("counter") %>%
timeout(function() {rexp(1, 1/2)}) %>%
release("counter") %>%
log_("Finished.")
• timeout: set an optional timer which triggers batches every timeout time
units even if the batch size has not been fulfilled.
plot(patient)
138 CHAPTER 6. DISCRETE EVENT SIMULATION
Seize
Timeout
Release
Seize
Timeout
Release
Seize
Timeout
Release
Once the trajectory is known, you may attach arrivals to it and define the
resources needed. In the example below, three types of resources are added:
the nurse and administration resources, each one with a capacity of 2, and the
doctor resource, with a capacity of 4. The last method adds a generator of
arrivals (patients) following the trajectory patient. The time between patients
is about 5 minutes.
plot(get_mon_resources(envs),metric= "utilization")
6.4. SIMULATING A SIMPLE HEALTH CENTER 139
Resource utilization
100%
80%
60%
utilization
40%
20%
0%
Resource usage
administration doctor nurse
item
in use
2
server
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
time
140 CHAPTER 6. DISCRETE EVENT SIMULATION
200
waiting time
100
AWAY) represent activities that take some (random) time to complete. Two
kind of processes can be identified: shop jobs, which use machines and degrade
them, and personal tasks, which take operatives AWAY for some time.
Notice that after a job is completed by a machine there may be two possible
trajectories to follow:
set.seed(2021)
env <- simmer("Job Shop")
env %>%
add_resource("machine", 10) %>%
add_resource("operative", 5) %>%
add_generator("job", job, NEW_JOB) %>%
add_generator("task", task, NEW_TASK) %>%
run(until=10)
Let’s extract a history of the resource’s state to analyze the average number of
6.5. A PRODUCTION PROCESS SIMULATION 143
plot(get_mon_resources(env),"utilization")
Resource utilization
100%
80%
60%
utilization
40%
20%
0%
machine operative
resource
Resource usage
machine operative
10.0
7.5
item
in use
5.0
server
2.5
0.0
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
time