System Simulation and Modeling-Course Material
System Simulation and Modeling-Course Material
(AUTONOMOUS)
Sree Sainath Nagar, A. Rangampet-517102
COURSE MATERIAL
IV B. Tech. - I Semester
20BT71501: SYSTEM SIMULATION MODELING
Prepared by
COURSE OUTCOMES: After successful completion of the course, students will be able to:
CO1. Understand the concepts of discrete event simulation by using single–server queuing
system and simulation software.
CO2. Develop a probabilistic model to solve real life problems and validate it.
CO3. Apply statistical models to represent the data for simulation.
CO4. Apply Techniques to generate Random variates for modeling a system
CO5. Apply goodness of fit tests for identified input data distribution
CO6. Analyze the techniques for output data analysis for a single system
DETAILED SYLLABUS:
UNIT I: BASIC SIMULATION MODELING (10 Periods)
Introduction: The nature of simulation, Systems, Models, and simulation, discrete event
simulation, Simulation of a single-server queuing system: problem statement, Intuitive
Explanation, Program Organization and Logic, simulation output and discussion, Alternative
Stopping Rules, steps in simulation study, advantages, disadvantages, and pitfalls of
simulation
Simulation software: introduction, comparison of simulation packages with programming
languages, classification of simulation software, desirable software features.
REFERENCE BOOKS:
1. Jerry Banks, John S. Carson II, Barry L.Nelson and David M.Nicol, Discrete-Event System
Simulation, Pearson India,5th edition, 2013.
2. Narsingh Deo, System Simulation with Digital Computer, Prentice Hall India 2009.
TEXT BOOK:
1. Averill M. Law, Simulation Modeling and Analysis, McGraw Hill Education (India) Private
Limited, 5th edition, 2015.
REFERENCE BOOKS:
1. Jerry Banks, John S. Carson II, Barry L.Nelson and David M.Nicol, Discrete-Event System
Simulation, Pearson India,5th edition, 2013.
2. Narsingh Deo, System Simulation with Digital Computer, Prentice Hall India 2009.
UNIT-I: BASIC SIMULATION MODELING
Logistics, Supply
chain and Transportation
Military application
distribution modes and Traffic
application
Business Process
Health Care
Simulation
The departure event’s logic is depicted in the flowchart of Fig. 10. Recall that this routine is
invoked when a service completion (and subsequent departure) occurs. If the departing
customer leaves no other customers behind in queue, the server is idled and the departure
event is eliminated from consideration, since the next event must be an arrival. On the other
hand, if one or more customers are left behind by the departing customer, the first customer
in queue will leave the queue and enter service, so the queue length is reduced by 1, and
the delay in queue of this customer is computed and registered in the appropriate statistical
counter
Example: X = {0, 1, 2, 3}
X could be 0, 1, 2, or 3 randomly.
And they might each have a different probability.
We use a capital letter, like X or Y, to avoid confusion with the Algebra type of variable.
Sample Space A Random Variable's set of values is the Sample Space.
Example: Throw a die once
Random Variable X = "The score shown on the top face".
X could be 1, 2, 3, 4, 5 or 6
So the Sample Space is {1, 2, 3, 4, 5, 6}
2.1.2 Probability
We can show the probability of any one value using this style:
P(X = value) = probability of that value
Example (continued): Throw a die once
X = {1, 2, 3, 4, 5, 6}
In this case they are all equally likely, so the probability of any one is 1/6
P(X = 1) = 1/6
P(X = 2) = 1/6
P(X = 3) = 1/6
P(X = 4) = 1/6
P(X = 5) = 1/6
P(X = 6) = 1/6
Note that the sum of the probabilities = 1, as it should be.
HHH 3
HHT 2
HTH 2
HTT 1
THH 2
THT 1
TTH 1
TTT 0
Looking at the table we see just 1 case of Three Heads, but 3 cases of Two Heads, 3 cases
of One Head, and 1 case of Zero Heads. So:
P(X = 3) = 1/8
P(X = 2) = 3/8
P(X = 1) = 3/8
P(X = 0) = 1/8
2.1.4 Continuous
Random Variables can be either Discrete or Continuous:
Discrete Data can only take certain values (such as 1,2,3,4,5)
Continuous Data can take any value within a range (such as a person's height)
All our examples have been Discrete.
In short:
X = {0, 1}
2.1.6 Continuous
Random Variables can be either Discrete or Continuous:
Discrete Data can only take certain values (such as 1,2,3,4,5)
Continuous Data can take any value within a range (such as a person's height)
In our Introduction to Random Variables (please read that first!) we look at many examples
of Discrete Random Variables.
But here we look at the more advanced topic of Continuous Random Variables.
2.1.7The Uniform Distribution
The Uniform Distribution (also called the Rectangular Distribution) is the simplest
distribution.
It has equal probability for all values of the Random variable between a and b:
Example: Old Faithful erupts every 91 minutes. You arrive there at random and wait for 20
minutes ... what is the probability you will see it erupt?
If you waited the full 91 minutes you would be sure (p=1) to have seen it erupt.
But remember this is a random thing! It might erupt the moment you arrive, or any time in
the 91 minutes.
2.1.8 Cumulative Uniform Distribution
We can have the Uniform Distribution as a cumulative (adding up as it goes along)
distribution:
The general name for any of these is probability density function or "pdf"
Each random variable in the collection of the values is taken from the same mathematical
space, known as the state space. This state-space could be the integers, the real line, or η-
dimensional Euclidean space, for example. A stochastic process's increment is the amount
that a stochastic process changes between two index values, which are frequently interpreted
as two points in time. Because of its randomness, a stochastic process can have many
outcomes, and a single outcome of a stochastic process is known as, among other things, a
sample function or realization.
Classification
A stochastic process can be classified in a variety of ways, such as by its state space, index
set, or the dependence among random variables and stochastic processes are classified in a
single way, the cardinality of the index set and the state space.
When expressed in terms of time, a stochastic process is said to be in discrete-time if its index
set contains a finite or countable number of elements, such as a finite set of numbers, the set
of integers, or the natural numbers. Time is said to be continuous if the index set is some
interval of the real line. Discrete-time stochastic processes and continuous-time stochastic
processes are the two types of stochastic processes. The continuous-time stochastic processes
require more advanced mathematical techniques and knowledge, particularly because the
index set is uncountable, discrete-time stochastic processes are considered easier to study.
If the index set consists of integers or a subset of them, the stochastic process is also known
as a random sequence.
If the state space is made up of integers or natural numbers, the stochastic process is known
as a discrete or integer-valued stochastic process. If the state space is the real line, the
stochastic process is known as a real-valued stochastic process or a process with continuous
state space. If the state space is η-dimensional Euclidean space, the stochastic process is
known as a η-dimensional vector process or η-vector process.
Examples
You can study all the theory of probability and random processes mentioned below in the
brief, by referring to the book Essentials of stochastic processes.
Bernoulli Process
The Bernoulli process is one of the simplest stochastic processes. It is a sequence of
independent and identically distributed (iid) random variables, where each random variable
has a probability of one or zero, say one with probability P and zero with probability 1-P. This
process is analogous to repeatedly flipping a coin, where the probability of getting a head is
P and its value is one, and the probability of getting a tail is zero. In other words, a Bernoulli
process is a series of iid Bernoulli random variables, with each coin flip representing a Bernoulli
trial.
Wiener Process
The Wiener process is a stationary stochastic process with independently distributed
increments that are usually distributed depending on their size. The Wiener process is named
after Norbert Wiener, who demonstrated its mathematical existence, but it is also known as
the Brownian motion process or simply Brownian motion due to its historical significance as a
model for Brownian movement in liquids
The Wiener process, which plays a central role in probability theory, is frequently regarded as
the most important and studied stochastic process, with connections to other stochastic
processes. It has a continuous index set and states space because its index set and state
spaces are non-negative numbers and real numbers, respectively. However, the process can
be defined more broadly so that its state space is dimensional Euclidean space. The resulting
Wiener or Brownian motion process is said to have zero drift if the mean of any increment is
zero. If the mean of the increment between any two points in time equals the time difference
multiplied by some constant μ, that is a real number, the resulting stochastic process is said
to have drift μ.
Poisson Process
The Poisson process is a stochastic process with various forms and definitions. It is a counting
process, which is a stochastic process that represents the random number of points or events
up to a certain time. The number of process points located in the interval from zero to some
given time is a Poisson random variable that is dependent on that time and some parameter.
This process's state space is made up of natural numbers, and its index set is made up of
non-negative numbers. This process is also known as the Poisson counting process because
it can be interpreted as a counting process.
2.3.2 Variance
The Variance is defined as:
The average of the squared differences from the Mean.
To calculate the variance follow these steps:
Work out the Mean (the simple average of the numbers)
Then for each number: subtract the Mean and square the result (the squared
difference).
Then work out the average of those squared differences. (Why Square?)
Example
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
Answer:
Mean = 600 + 470 + 170 + 430 + 3005
= 19705
= 394
so the mean (average) height is 394 mm. Let's plot this on the chart:
To calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2 = 2062 + 762 + (−224)2 + 362 + (−94)25
= 42436 + 5776 + 50176 + 1296 + 88365
= 1085205
= 21704
So the Variance is 21,704
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation
σ = √21704
= 147.32...
= 147 (to the nearest mm)
is normal, and what is extra large or extra small.
Rottweilers are tall dogs. And Dachshunds are a bit short, right?
We can expect about 68% of values to be within plus-or-minus 1 standard deviation.
Read Standard Normal Distribution to learn more.
Also try the Standard Deviation Calculator.
Our example has been for a Population (the 5 dogs are the only dogs we are interested in).
But if the data is a Sample (a selection taken from a bigger Population), then the calculation
changes!
2.3.3 Formulas
Here are the two formulas, explained at Standard Deviation Formulas if you want to know
more:
Looks complicated, but the important change is to
divide by N-1 (instead of N) when calculating a Sample Standard Deviation.
2.3.4 Correlation
When two sets of data are strongly linked together we say they have a High Correlation.
The word Correlation is made of Co- (meaning "together"), and Relation
Correlation is Positive when the values increase together, and
Correlation is Negative when one value decreases as the other increases
A correlation is assumed to be linear (following a line).
As a formula it is:
Where:
Σ is Sigma, the symbol for "sum up"
Let X1, X2, . . . ,Xn be IID random variables with finite mean m and finite variance s2
.
How to construct a confidence interval for m and also the complementary problem of
testing the hypothesis that m 5 m0.
if n is “sufficiently large,” the random variable Zn will be approximately distributed as
a standard normal random variable, regardless of the underlying distribution of the Xi
’s.
It can also be shown for large n that the sample mean X(n) is approximately
distributed as a normal random variable with mean m and variance s2 yn.
The difficulty with using the above results in practice is that the variance s2 is
generally unknown.
Thus, there is nothing probabilistic about the single confidence interval [l(n, a), u(n,
a)] after the data have been obtained and the interval’s endpoints have been given
numerical values.
The correct interpretation to give to the confidence interval .
If one constructs a very large number of independent 100(1 2 a) percent confidence
intervals, each based on n observations, where n is sufficiently large, the proportion
of these confidence intervals .
We call this proportion the coverage for the confidence interval. To further amplify
the correct interpretation to be given to a confidence interval, we generated 15
independent samples of size n 5 10 from a normal distribution with mean 5 and
variance 1.
For each data set we constructed a 90 percent confidence interval for m, which we
know has a true value of 5.
Suppose that the 10 observations 1.20, 1.50, 1.68, 1.89, 0.95, 1.49, 1.58, 1.55, 0.50,
and 1.09 are from a normal distribution with unknown mean m and that our objective
is to construct a 90 percent
Regular interaction with these people also maintains their interest in the simulation
study. To gain credibility these members of the project team, we had to include
machine breakdowns and contention for resources. Furthermore, after the initial
model runs were made, it was necessary to make additional changes to the model
suggested by a mixer operator.
• Do not have more detail in the model than is necessary to address the issues of
interest, subject to the proviso that the model must have enough detail to be
credible. Thus, it may sometimes be necessary to include things in a model that are
not strictly required for model validity, due to credibility concerns.
• The level of model detail should be consistent with the type of data available. A
model used to design a new manufacturing system will generally be less detailed
than one used to fi ne-tune an existing system, since little or no data will be
available for a proposed system.
• In virtually all simulation studies, time and money constraints are a major factor in
determining the amount of model detail.
• If the number of factors
``of interest) for the study is large, then use a “coarse” simulation model or an analytic
model to identify what factors have a significant impact on system performance.
3.1 Introduction:
To carry out a simulation using random inputs such as inter arrival times or demand
sizes, we have to specify their probability distributions.
For example, in the simulation of the single-server queuing system , the inter arrival
times were taken to be IID exponential random variables with a mean of 1 minute;
the demand sizes in the inventory simulation were specified to be 1, 2, 3, or 4 items
with respective probabilities 1/6,1/3,1/3,1/6..
Then, given that the input random variables to a simulation model follow particular
distributions, the simulation proceeds through time by generating random values from
these distributions.
Almost all real-world systems contain one or more sources of randomness.
for an automotive manufacturer. It can be seen that the histogram has a longer right
tail (positive skewness) and that the minimum value is approximately 25 minutes.
Note that none of the four histograms has a symmetric shape like that of a normal
distribution, despite the fact that many simulation practitioners and simulation
books widely use normal input distributions.
it is generally necessary to represent each source of system randomness by a
probability distribution (rather than just its mean) in the simulation model.
The following example shows that failure to choose the “correct” distribution can also
affect 1. A single-server queuing system (e.g., a single machine in a factory) has
exponential interarrival times with a mean of 1 minute.
Suppose that 200 service times are available from the system, but their underlying
probability distribution is the accuracy of a model’s results, sometimes drastically.
Note that none of the four histograms has a symmetric shape like that of a normal
distribution, despite the fact that many simulation practitioners widely use normal
input distributions.
t it is generally necessary to represent each source of system randomness by a
probability distribution (rather than just its mean) in the simulation model.
distribution can also affect the accuracy of a model’s results, sometimes drastically.
A single-server queueing system (e.g., a single machine in a factory)
has exponential interarrival times with a mean of 1 minute. Suppose that 200 service
times are available from the system, but their underlying probability distribution is
unknown.
Gamma, Weibull, lognormal, and normal distributions to the observed service-time
data.
In the case of the exponential distribution, we chose the mean b so that the resulting
distribution most closely “resembled” the available data.
We then made 100 independent simulation runs (i.e., different random numbers were
used for each run) of the queueing system, using each of the five fitted distributions.
For the normal distribution, if a service time was negative, then it was generated again.
Each of the 500 simulation runs was continued until 1000 delays in queue were
collected..
The Weibull distribution actually provides the best model for the service-time data.
Thus, the average delay for the real system should be close to 4.36 minutes.
The average delays for the normal and lognormal distributions are 6.04 and 7.19
minutes, respectively, corresponding to model output errors of 39 percent and 65
percent.
This is particularly surprising for the lognormal distribution, since it has the same
general shape (i.e., skewed to the right) as the Weibull distribution.
It turns out that the lognormal distribution has a “thicker” right tail, which allows larger
service times and delays to occur.
The probability distributions can evidently have a large impact on the simulation output
and, potentially, on the quality of the decisions made with the simulation results.
If it is possible to collect data on an input random variable of interest, these data can
be used in one of the following approaches to specify a distribution (in increasing order
of desirability):
1. The data values themselves are used directly in the simulation. For example, if the data
represent service times, then one of the data values is used whenever a service time is needed
in the simulation. This is sometimes called a trace-driven simulation.
2. The data values themselves are used to define an empirical distribution function in some
way. If these data represent service times, we would sample from this distribution when a
service time is needed in the simulation.
3. Standard techniques of statistical inference are used to “fit” a theoretical distribution form
e.g., exponential or Poisson, to the data and to perform hypothesis tests to determine the
goodness of fit. If a particular theoretical distribution with certain values for its parameters is
a good model for the service-time data, then we would sample from this distribution when a
service time is needed in the simulation.
Two drawbacks of approach 1 are that the simulation can only reproduce what has
happened historically and that there is seldom enough data to make all the desired
simulation runs.
Approach 2 avoids these shortcomings since, at least for continuous data, any value
between the minimum and maximum observed data points can be generated . Thus,
approach 2 is generally preferable to approach 1.
Approach 1 does have its uses. For example, suppose that it is desired to compare a
proposed material-handling system with the existing system for a distribution center.
For each incoming order there is an arrival time, a list of the desired products, and a
quantity for each product. Modeling a stream of orders for a certain period of time
(e.g., for 1 month) will be diffi cult, if not impossible, using approach 2 or 3.
Thus, in this case the existing and proposed systems will often be simulated using the
historical order stream.
Approach 1 is also recommended for model validation when model output for an
existing system is compared with the corresponding output for the system itself.
If a theoretical distribution can be found that fits the observed data reasonably well
(approach 3), then this will generally be preferable to using an empirical distribution
(approach 2) for the following reasons:
• An empirical distribution function may have certain “irregularities,” particularly if only
a small number of data values are available. A theoretical distribution, on the other
hand, “smooths out” the data and may provide information on the overall underlying
distribution.
If empirical distributions are used in the usual way, It is not possible to generate values
outside the range of the observed data in the simulation . This is unfortunate, since many
measures of performance for simulated systems depend heavily on the probability of an
“extreme” event’s occurring, e.g., generation of a very large service time. With a
fitted theoretical distribution, however, values outside the range of the observed data can be
generated.
There may be a compelling physical reason in some situations for using a certain theoretical
distribution form as a model for a particular input random variable Even when we are
fortunate enough to have this kind of information, it is a good idea to use observed data to
provide empirical support for the use of this particular distribution.
• A theoretical distribution is a compact way of representing a set of data values.
Conversely, n data values are available from a continuous distribution, then 2nvalues
(e.g., data and corresponding cumulative probabilities) must be entered and stored in
the computer to represent an empirical distribution in simulation packages. Thus, use
of an empirical distribution will be cumbersome if the data set is large.
• A theoretical distribution is easier to change. For example, suppose that a set of inter
arrival times is found to be modeled well by an exponential distribution with a mean
of 1 minute. If we want to determine the effect on the simulated system of increasing
the arrival rate by 10 percent, then all we have to do is to change the mean of the
exponential distribution to 0.909.
A probability distribution is formed from all possible outcomes of a random process (for a
random variable X) and the probability associated with each outcome. Probability
distributions may either be discrete (distinct/separate outcomes, such as number of children)
or continuous (a continuum of outcomes, such as height). A probability density function is
defined such that the likelihood of a value of X between a and b equals the integral (area
under the curve) between a and b. This probability is always positive. Further, we know that
the area under the curve from negative infinity to positive infinity is one.
As illustrated at the top of this page, the standard normal probability function has a mean of
zero and a standard deviation of one. Often times the x values of the standard normal
distribution are called z-scores. We can calculate probabilities using a normal distribution table
(z-table). Here is a link to a normal probability table. It is important to note that in these
tables, the probabilities are the area to the LEFT of the z-score. If you need to find the area
to the right of a z-score (Z greater than some value), you need to subtract the value in the
table from one.
Using this table, we can calculate p(-1<z<1). To do so, first look up the probability that z is
less than negative one [p(z)<-1 = 0.1538]. Because the normal distribution is symmetric,
we therefore know that the probability that z is greater than one also equals 0.1587 [p(z)>1
= 0.1587]. To calculate the probability that z falls between 1 and -1, we take 1 – 2(0.1587)
= 0.6826. The green area in the figure above roughly equals 68% of the area under the
curve. This solutions jives with the three sigma rule stated earlier!!!
We can convert any and all normal distributions to the standard normal distribution using the
equation below. The z-score equals an X minus the population mean (μ) all divided by the
standard deviation (σ).
KEY TAKEAWAYS
A discrete probability distribution counts occurrences that have countable or finite
outcomes.
This is in contrast to a continuous distribution, where outcomes can fall anywhere on
a continuum.
Common examples of discrete distribution include the binomial, Poisson, and Bernoulli
distributions.
These distributions often involve statistical analyses of "counts" or "how many times"
an event occurs.
In finance, discrete distributions are used in options pricing and forecasting market
shocks or recessions.
Understanding Discrete Distribution
Distribution is a statistical concept used in data research. Those seeking to identify the
outcomes and probabilities of a particular study will chart measurable data points from a
data set, resulting in a probability distribution diagram. There are many types of probability
distribution diagram shapes that can result from a distribution study, such as the normal
distribution ("bell curve").
Statisticians can identify the development of either a discrete or continuous distribution by
the nature of the outcomes to be measured. Unlike the normal distribution, which is
continuous and accounts for any possible outcome along the number line, a discrete
distribution is constructed from data that can only follow a finite or discrete set of outcomes.
Discrete distributions thus represent data that has a countable number of outcomes, which
means that the potential outcomes can be put into a list. The list may be finite or infinite.
For example, when studying the probability distribution of a die with six numbered sides the
list is {1, 2, 3, 4, 5, 6}. A binomial distribution has a finite set of just two possible outcomes:
zero or one—for instance, lipping a coin gives you the list {Heads, Tails}. The Poisson
distribution is a discrete distribution that counts the frequency of occurrences as integers,
whose list {0, 1, 2, ...} can be infinite.
The most common discrete probability distributions include binomial, Poisson, Bernoulli,
and multinomial.
The Poisson distribution is also commonly used to model financial count data where the tally
is small and is often zero. For one example, in finance, it can be used to model the number
of trades that a typical investor will make in a given day, which can be 0 (often), or 1, or 2,
etc. As another example, this model can be used to predict the number of "shocks" to the
market that will occur in a given time period, say over a decade.
Another example where such a discrete distribution can be valuable for businesses
is inventory management. Studying the frequency of inventory sold in conjunction with a
finite amount of inventory available can provide a business with a probability distribution
that leads to guidance on the proper allocation of inventory to best utilize square footage.
The binomial distribution is used in options pricing models that rely on binomial trees. In a
binomial tree model, the underlying asset can only be worth exactly one of two possible
values—with the model, there are just two possible outcomes with each iteration—a move
up or a move down with defined probabilities.
Discrete distributions can also be seen in the Monte Carlo simulation. Monte Carlo simulation
is a modeling technique that identifies the probabilities of different outcomes through
programmed technology. It is primarily used to help forecast scenarios and identify risks. In
Monte Carlo simulation, outcomes with discrete values will produce discrete distributions for
analysis. These distributions are used in determining risk and trade-offs among different
items being considered.
3.5 Hypothesizing Families Of Distributions:
The first step in selecting a particular input distribution is to decide what general
families—e.g., exponential, normal, or Poisson—appear to be appropriate on the
basis of their shapes, without worrying (yet) about the specific parameter values for
these families.
It describes some general techniques that can be used to hypothesize families of
distributions that might be representative of a simulation input random variable.
In some situations, use can be made of prior knowledge about a certain random
variable’s role in a system to select a modeling distribution or at least rule out some
distributions;
This is done on theoretical grounds and does not require any data at all.
For example, if we feel that customers arrive to a service facility one at a time, at a
constant rate, and so that the numbers of customers arriving in disjoint time intervals
are independent, then there are theoretical reasons for postulating that the interarrival
times are IID exponential random variables.
Several discrete distributions—binomial, geometric, and negative binomial—
were developed from a physical model.
The range of a distribution rules it out as a modeling distribution. Service times, for
example, should not be generated directly from a normal distribution (at least in
principle), since a random value from any normal distribution can be negative.
The proportion of defective items in a large batch should not be assumed to have a
gamma distribution, since proportions must be between 0 and 1, whereas gamma
random variables have no upper bound.
Information should be used whenever available, but confirming the postulated
distribution with data is also strongly recommended.
In practice, we seldom have enough of this kind of theoretical prior information to
select a single distribution, and the task of hypothesizing a distribution family from
observed data is somewhat less structured.
In the remainder of this section, we discuss various heuristics, or guidelines, that can
be used to help one choose appropriate families of distributions.
Summary Statistics
Let us first understand the meaning of summary statistics.
This indicates that we can efficiently use summary statistics to quickly get the gist of
the information.
Descriptive statistics deals with the collection, organization, summaries, and presentation of
data.
To find the mean of the data, we will need to find the average marks of 30 students. If the
average marks obtained by 30 students is 75 out of 100, then we can derive a conclusion or
give judgment about the performance of the students on the basis of this result.
Measures of Location
The arithmetic mean, median, mode, and inter quartile mean are the common measures of
location or central tendency.
Measures of Spread
Standard deviation, range, variance, absolute deviation, inter quartile range, distance
standard deviation, etc. are the common measures of spread/dispersion.
The coefficient of variation (CV) is a statistical measure of the relative spread of data points
around the mean.
Graphs / charts
Some of the graphs and charts frequently used in the statistical representation of the data
are given below.
Graphs:
Line graph
Bar graph
Histogram
Scatter plot
Frequency distribution graph
Charts:
Flow chart
Pie chart
3.5.2 Histograms:
What Is a Histogram?
A histogram is a graphical representation of data points organized into user-specified ranges.
Similar in appearance to a bar graph, the histogram condenses a data series into an easily
interpreted visual by taking many data points and grouping them into logical ranges or bins.
KEY TAKEAWAYS
A histogram is a bar graph-like representation of data that buckets a range of classes
into columns along the horizontal x-axis.
The vertical y-axis represents the number count or percentage of occurrences in the
data for each column
Columns can be used to visualize patterns of data distributions.
In trading, the MACD histogram is used by technical analysts to indicate changes in
momentum.
The MACD histogram columns can give earlier buy and sell signals than the
accompanying MACD and signal lines.
This histogram example would look similar to the chart below. Let's say the numerals along
the vertical access represent thousands of people. To read this histogram example, you can
start with the horizontal axis and see that, beginning on the left, there are approximately
500 people in the town who are from less than one year old to 10 years old. There are 4,000
people in town who are 11 to 20 years old. And so on.
Histograms can be customized in several ways by analysts. They can change the interval
between buckets. In the example referenced above, there are eight buckets with an interval
of ten. This could be changed to four buckets with an interval of 20.
Another way to customize a histogram is to redefine the y-axis. The most basic label used is
the frequency of occurrences observed in the data. However, one could also use percentage
of total or density instead.
Histograms vs. Bar Charts
Both histograms and bar charts provide a visual display using columns, and people often use
the terms interchangeably. Technically, however, a histogram represents the frequency
distribution of variables in a data set. A bar graph typically represents a graphical comparison
of discrete or categorical variables.
UNIT-IV
GENERATING RANDOM VARIATES
If ‘N’ number of random numbers are divided into ‘K’ class interval, then expected
number of samples in each class should be equal to e= N / K.
4. Maximum Cycle: It states that the repetition of numbers should be allowed only after
a large interval of time.
Here,
The initial value x0 is called the seed;
a is called the constant multiplier;
c is the increment
m is the modulus
For example,
The sequence obtained when X0 = a = c = 7, m = 10, is
7, 6, 9, 0, 7, 6, 9, 0...
This example shows, the sequence is not always "random" for all choices of X 0, a, c, and m;
the way of choosing these values appropriately is the most important part of this method.
When c is not equal to 0, the form is called the mixed congruential method;
When c is equal to 0, the form is known as the multiplication congruential
method.
Combined or Mixed Congruential method:
By combining two or more multiplicative congruential generators may increase the length of
the period and results in other better statistics.
Procedure for generating Random Numbers using Linear Congruential Method:
Choose the seed value X0, Modulus parameter m, Multiplier term a, and increment term
c.
Initialize the required amount of random numbers to generate (say, an integer
variable no Of RandomNums).
Define a storage to keep the generated random numbers (here, vector is considered) of
size noOfRandomNums.
Initialize the 0th index of the vector with the seed value.
For rest of the indexes follow the Linear Congruential Method to generate the random
numbers.
The first one tests for uniformity and the second to fifth ones test independence.
1. Frequency test:
• The frequency test is a test of uniformity.
• Two different methods used in frequence test are
a. Kolmogorov-Smirnov test and
b. chi-square test.
• Both these two tests measure the agreement between the distribution of a sample of
generated random numbers and the theoretical uniform distribution.
• Both tests are based on the null hypothesis of no significant difference between the
sample distribution and the theoretical distribution
•
• As N becomes larger SN(X) , should be close to F(x)
• Kolmogorov-Smirnov test is based on the following statistic measure
2. Compute
3. Compute
4. Determine the critical value, , from Table for the specified significance
level and the given sample size N.
5. If the sample statistic D is greater than the critical value , the null hypothesis
that the sample data is from a uniform distribution is rejected; if , then
there is no evidence to reject it.
i 1 2 3 4 5
R(i) 0.05 0.14 0.44 0.81 0.93
i/N 0.20 0.40 0.60 0.80 1.00
i/N – R(i) 0.15 0.26 0.16 - 0.07
R(i) – (i-
0.05 - 0.04 0.21 0.13
1)/N
Chi-square test:
The Chi-Square test is a statistical procedure for determining the difference between
observed and expected data. This test can also be used to determine whether it correlates
to the categorical variables in our data. It helps to find out whether a difference between
two categorical variables is due to chance or a relationship between them. The formula for
chi-square test is
1. Independence
The Chi-Square Test of Independence is a derivable ( also known as inferential ) statistical
test which examines whether the two sets of variables are likely to be related with each other
or not. This test is used when we have counts of values for two nominal or categorical
variables and is considered as non-parametric test. A relatively large sample size and
independence of obseravations are the required criteria for conducting this test.
For Example-
In a movie theatre, suppose we made a list of movie genres. Let us consider this as the first
variable. The second variable is whether or not the people who came to watch those genres
of movies have bought snacks at the theatre. Here the null hypothesis is that genre of the
film and whether people bought snacks or not are un relatable. If this is true, the movie
genres don’t impact snack sales.
2. Goodness-Of-Fit
In statistical hypothesis testing, the Chi-Square Goodness-of-Fit test determines whether a
variable is likely to come from a given distribution or not. We must have a set of data values
and the idea of the distribution of this data. We can use this test when we have value counts
for categorical variables. This test demonstrates a way of deciding if the data values have a “
good enough” fit for our idea or if it is a representative sample data of the entire population.
For Example
Suppose we have bags of balls with five different colours in each bag. The given condition is
that the bag should contain an equal number of balls of each colour. The idea we would like
to test here is that the proportions of the five colours of balls in each bag must be exact.
Example1: Let's say if we want to know if gender has anything to do with political party
preference. You poll 440 voters in a simple random sample to find out which political party
they prefer. The results of the survey are shown in the table below:
Similarly, you can calculate the expected value for each of the cells.
2. Runs Test:
1. Runs up and down
The runs test examines the arrangement of numbers in a sequence to test the
hypothesis of independence.
A run is defined as a succession of similar events proceeded and followed by a
different event.
E.g. in a sequence of tosses of a coin, we may have
{H T T H H T T T H T}
The first toss is proceeded and the last toss is followed by a "no event". This
sequence has six runs, first with a length of one, second and third with length two,
fourth length three, fifth ad sixth length one.
A few features of a run
o two characteristics: number of runs and the length of run
o an up run is a sequence of numbers each of which is succeeded by a larger
number; a down run is a sequence of numbers each of which is succeeded by
a smaller number
If a sequence of numbers have too few runs, it is unlikely a real random sequence.
E.g. 0.08, 0.18, 0.23, 0.36, 0.42, 0.55, 0.63, 0.72, 0.89, 0.91, the sequence
has one run, an up run. It is not likely a random sequence.
If a sequence of numbers have too many runs, it is unlikely a real random sequence.
E.g. 0.08, 0.93, 0.15, 0.96, 0.26, 0.84, 0.28, 0.79, 0.36, 0.57. It has nine
runs, five up and four down. It is not likely a random sequence.
If a is the total number of runs in a truly random sequence, the mean and variance
of a is given by
and
The number of runs above and below the mean, also random variables, the expected
value of Yi is approximated by
Where E(I) the approximate expected length of a run and ui is the approximate
probability of length .
Ui is given by
E(I) is given by
The approximate expected total number of runs (of all length) in a sequence of
length N is given by
Where L = N - 1 for runs up and down, L = N for runs above and below the mean.
3. Auto-correlation:
The tests for auto-correlation are concerned with the dependence between numbers in a
sequence.
The test computes the auto-correlation between every m numbers (m is also known
as the lag) starting with the ith number.
Thus the autocorrelation pim between the following numbers would be of interest.
and
After computing Z0, do not reject the null hypothesis of independence if
Step 3.
Find D, the maximum deviation between F(x) and SN(x) .
Step 4.
Determine the critical value, , from Table A.8 for the specified value of and
the sample size N.
Step 5.
If the calculated value of D is greater than the tabulated value of , the null
hypothesis of independence is rejected.
5. Poker Test
The poker test for independence is based on the frequency in which certain digits are
repeated in a series of numbers.
For example 0.255, 0.577, 0.331, 0.414, 0.828, 0.909, 0.303, 0.001... In each case,
a pair of like digits appears in the number.
In a three digit number, there are only three possibilities.
1. The individual digits can be all different. Case 1.
2. The individual digits can all be the same. Case 2.
3. There can be one pair of like digits. Case 3.
P(case 1) = P(second differ from the first) * P(third differ from the first and second)
= 0.9 * 0.8 = 0.72
P(case 2) = P(second the same as the first) * P(third same as the first) = 0.1 * 0.1 =
0.01
P(case 3) = 1 - 0.72 - 0.01 = 0.27
4.5 Inverse-Transform Technique:
The inverse transform technique can be used to sample from exponential, the
uniform, the Weibull and the triangle distributions.
The basic principle is to find the inverse function of F, such
that .
denotes the solution of the equation r=F(x) in terms of r, not 1/F.
For example, the inverse of y = x is x = y, the inverse of y = 2 x + 1 is x = (y-1)/2
The inverse-transform technique can be used in principle for any distribution.
• Most useful when the CDF F(x) has an inverse F -1(x) which is easy to compute.
Step 1.
cdf
for
Step 2.
let
Step 3.
Solve for in terms of yields
Step 4.
Generate uniformly distributed from (0,1) feeding them to the function in Step 3 to
get .
Triangular Distribution:
Cumulative distribution function:
Steps
Step 1.
cdf
Step 2.
let and
Step 3.
Solve in terms of
Step 2a.
Poisson Distribution
o pmf
if and only if
essentially this means if there are n arrivals in one unit time, the sum of inter-
arrival time of the past n observations has to be less than or equal to one, but if
one more inter-arrival time is added, it is greater than one (unit time).
o The Ai s in the relation can be generated from uniformly distributed random
number , thus
that is
If , then accept N = n, meaning at this time unit, there are n arrivals. Otherwise, reject the
current n, increase n by one, return to Step 2.
o Efficiency: How many random numbers will be required, on the average, to
generate one Poisson variate, N? If N = n, then n+1 random numbers are
required (because of the (n+1) random numbers product).
Estimation of performance metrics: Replication statistics provide the data for computing
point estimates and confidence intervals for system parameters of interest. Critical estimation
issues are the size of the sample to be collected and the independence of observations used
to compute statistics, particularly confidence intervals.
EXAMPLE: Consider a bank with five tellers and one queue, which opens its doors at 9 a.m.,
closes its doors at 5 p.m., but stays open until all customers in the bank at 5 p.m. have been
served. Assume that customers arrive in accordance with a Poisson process at rate 1 per
minute (i.e., IID exponential inter arrival times with mean 1 minute), that service times are
IID exponential random variables with mean 4 minutes, and that customers are served in a
FIFO manner. Table 5.1 shows several typical output statistics from l0 independent
replications of a simulation of the bank, assuming that no customers are present initially.
Steady state: It shows the distribution of the random variable from a a particular point will
be approximately same as each other. It does not depend on initial conditions I.
Since most simulations are stochastic in nature, their output can vary from run to run due to
random chance. We typically need to analyze our results over many runs. The analysis is
affected by the type of outputs. They generally fall into two categories of behaviors for a
stochastic process.
Transient behaviour:
Indicated by a simulation with a specific termination event (ex: runs for X minutes, or runs
until C customers have been processed, or runs until inventory is exhausted etc.)
Steady-state behaviour
Indicated by a simulation that runs over a very long period of simulated time, or with no
stated stop event.
Consider the output stochastic process Y1, Y2, . . . . Let Fi (y) = P(Yi # Yz) for i = 1, 2, . . . ,
where y is a real number and I represents the initial conditions used to start the simulation
at time 0. [The conditional probability P(Yi # Yz) is the probability that the event {Yi # y}
occurs given the initial conditions I.] For a manufacturing system, specify the number of jobs
present, and whether each machine is busy or idle, at time 0. Here Fi (Yz=1) the transient
distribution of the output process at (discrete) time i for initial conditions I.
The density functions for the transient distributions corresponding to the random variables Yi
=1 , Yi= 2 , Yi= 3 , and Yi =4 are shown in Fig. 5.1 for a particular set of initial conditions I
and increasing time indices i1, i2, i3, and i4, where it is assumed that the random variable Yi
j has density function fYij . The density specifies how the random variable Yi j can vary from
one replication to another. In particular, suppose that we make a very large number of
replications, n, of the simulation and observe the stochastic process Y1, Y2, . . . on each one.
If we make a histogram of the n observed values of the random variable Yi j , then this
histogram (when appropriately scaled) will look very much like the density fYij . For fixed y
and I, the probabilities F1(YZ=1 ), F2(Y Z= I), . . . are just a sequence of numbers. If Fi (Y
Z= I) S F(y) as i S ` for all y and for any initial conditions I, then F(y) is called the steady-
state distribution of the output process Y1, Y2, . . . ..the steady-state distribution F(y) is only
obtained in the limit as i S `. In practice,
Fig. 5.1 Transient and steady-state density functions for a particular stochastic
process Y1, Y2, . . . and initial conditions I.
Example: Consider the stochastic process D1, D2, . . . for the M/M/1 queue with r 5 0.9 (l5
1, v 5 10y9), where Di is the delay in queue of the ith customer. In Fig. 5.2 we plot the
convergence of the transient mean E(Di ) to the steady-state mean
Fig. 5.2 E(Di) as a function of i and the number in system at time 0, s, for the
M/M/1 queue with r 5 0.9.
The Table 5.2 shows he differences between transient and steady state behavior of
stochastic processes with the following characteristics
5.3 Types of Simulation with respect to output analysis
The options available in designing and analyzing simulation experiments depend on the type
of simulation at hand, as depicted in Fig. 5.3. Simulations may be either terminating or non
terminating, depending on whether there is an obvious way for determining the run length.
Terminating simulation:
Runs for some duration of time TE, where E is a specified event that stops the
simulation.
Starts at time 0 under well-specified initial conditions.
Ends at the stopping time TE.
Bank example: Opens at 8:30 am (time 0) with no customers present and 8 of the 11
teller working (initial conditions), and closes at 4:30 pm (Time TE = 480 minutes).
The simulation analyst chooses to consider it a terminating system because the object
of interest is one day’s operation.
A non terminating simulation is one that executes continuously.
Examples
1. A retail / commercial establishment e.g. Bank, has working hours 9 to 5, the object is to
measure the quality of customer service in this specified 8 hours. Here the initial condition
is number of customers present at time E(t)=0 ( which is to be specified)
2. An aerospace manufacturer recieves a contract to produce 100 airplanes, which must be
delieverd within 18 months.
3. A company that sells a single product would like to decide how many items to have in
inventory during 120 month planning horizon. Given some initial inventory level, the object
is determine how much order each month so as to minimize the expected averaging cost
per month of inventory system
4. Consider a manufacturing company that operates 16 hours a day (two shifts) with work in
process carrying over from one day to the next. Would this qualify as a terminating
simulation with E 5 {16 hours of simulated time have elapsed}? No, since this
manufacturing operation is essentially a continuous process, with the ending conditions for
one day being the initial conditions for the next day.
Non-terminating simulation:
Non-Terminating Simulation is a system that runs continuously, or at least a very long period
of time, It starts at simulation time 0 under initial conditions defined by the analyst and runs
for some analyst defined period of time TE , A steady-state simulation is a simulation whose
objective is to study long-run behaviour of a non-terminating system.
Example: Consider a company that is going to build a new manufacturing system and would
like to determine the long-run (steady-state) mean hourly throughput of their system after it
has been running long enough for the workers to know their jobs and for mechanical
difficulties to have been worked out. Assume that:
(a) The system will operate 16 hours a day for 5 days a week.
(b) There is negligible loss of production at the end of one shift or at the beginning of the next
shift .
(c) There are no breaks (e.g., lunch) that shut down production at specified times each day.
Fig 5.3: Types of simulations with regard to output analysis
This system could be simulated by “pasting together” 16-hour days, thus ignoring the system
idle time at the end of each day and on the weekend. Let Ni be the number of parts
manufactured in the ith hour. If the stochastic process N1, N2, . . . has a steady-state
distribution with corresponding random variable N, then we are interested in estimating the
mean v= E(N)
stochastic processes for most real systems do not have steady-state distributions, since
the characteristics of the system change over time. For example, in a manufacturing system
the production-scheduling rules and the facility layout (e.g., number and location of
machines) may change from time to time.
A simulation model (which is an abstraction of reality) may have steady-state distributions,
since characteristics of the model are often assumed not to change over time.
Example: The manufacturing company wanted to know the time required for the system to
go from startup to operating in a “normal” manner, this would be a terminating simulation
with terminating event E 5 {simulated system is running “normally”} (if such can be defined).
Thus, a simulation for a particular system might be either terminating or non terminating,
depending on the objectives of the simulation study.
Consider a stochastic process Y1, Y2, . . . for a non terminating simulation that does not have
a steady-state distribution. Suppose that we divide the time axis into equal-length, contiguous
time intervals called cycles.
Let Yi C be a random variable defined on the ith cycle, and assume that Y1C, Y2 C, . . . are
comparable. Suppose that the process Y1C, Y2C, . . . has a steady-state distribution FC and
that YC , FC. Then a measure of performance is said to be a steady-state cycle parameter if it
is a characteristic of YC such as the mean vC = E(YC). Thus, a steady-state cycle parameter is
just a steady-state parameter of the appropriate cycle process Y1C, Y2C, . . . .
Example: Suppose for the manufacturing system , there is a half-hour lunch break at the
beginning of the fifth hour in each 8-hour shift. Then the process of hourly throughputs N1,
N2, . . . has no steady-state distribution. Let Ni C be the average hourly throughput in the ith
8-hour shift (cycle).
For a non terminating simulation, suppose that the stochastic process Y1, Y2, . . .does
not have a steady-state distribution, and that there is no appropriate cycle definition such
that the corresponding process Y1C, Y2C, . . . has a steady-state distribution.
This can occur, for example, if the parameters for the model continue to change over time.
In Example, if the arrival rate of calls changes from week to week and from year to year,
then steady-state (cycle) parameters will probably not be well defined. In these cases,
however, there will typically be a fixed amount of data describing how input parameters
change over time. This provides, in effect, a terminating event E for the simulation and, thus,
the analysis techniques for terminating simulations are appropriate.
Estimating Means
Suppose that if we obtain a point estimate and confidence interval for the mean m = E(X),
where X is a random variable defined on a replication as described above. Make n independent
replications of the simulation and let X1, X2, . . . ,Xn be the resulting IID random variables.
Then, by substituting the Xj ’s into mean of X for n variables we get that X(n) is an unbiased
point estimator form, and an approximate 100(1 - a) percent (0<a < 1) confidence interval
form is given by
-----(1)
where the sample variance S2 (n) is given by Eq. (1) for the fixed-sample-size procedure.
Example:
For the bank, suppose that we want to obtain a point estimate and an approximate 90
percent confidence interval for the expected average delay of a customer over a day, which
is given by
Thus, subject to the correct interpretation to be given to confidence intervals , we can claim
with approximately 90 percent confidence that E(X) is contained in the interval [1.71, 2.35]
minutes
For the inventory system, suppose that we want to obtain a point estimate and an
approximate 95 percent confidence interval for the expected average cost over the 120-
which resulted in
X-(10) = 126.07, S2(10)= 23.55
and the 95 percent confidence interval
126.07 + 3.47 or, 126.07 - 3.47 alternatively, [122.60, 129.54]
The estimated coefficient of variation ,a measure of variability, is 0.04 for the inventory
system and 0.27 for the bank model. Thus the Xj’s for the bank model are inherently more
variable than those for the inventory system.
The decision to perform a terminating or non-terminating simulation has less to do with the
nature of the system than it does with the behaviour of interest. A terminating simulation is
one in which the simulation starts at a defined state or time and ends when it reaches
some other defined state or time.
In general, independent replications are used, each run using a different random number
stream and independently chosen initial conditions.
Statistical Background:
Important to distinguish within-replication data from across replication data
For example, simulation of a manufacturing system
o Two performance measures of that system: cycle time for parts and work in
process (WIP).
o Let Yij be the cycle time for the j-th part produced in the i-th replication
o Across-replication data are formed by summarizing within-replication data Yi-
The following are the properties of Across replication and within replication data.
Across Replication:
Discrete time data
o Within replication:
Continuous time data
Within-replication data are not independent and not identically distributed and Across-
replication data are independent and identically distributed.
Suppose that an error criterion ε is specified with probability 1-α, a sufficiently large sample
size should satisfy:
Example:
Call Center Example: estimate the agent’s utilization ρ over the first 2 hours of the workday.
Initial sample of size R0 = 4 is taken and an initial estimate of the population variance is
S0 2= (0.072)2 = 0.00518.
The error criterion is ε = 0.04 and confidence coefficient is 1-α = 0.95, hence, the final
sample size must be at least: