Data Analyticsi Foundations
Data Analyticsi Foundations
Stochastic Processes
Any brand names and products names mentioned in this book are subject to trademark, brand or patent
protection and are trademarks or registered trademarks of their respective holders. The use of brand
names, products names, common names, trade names, product description etc, even without a partic-
ular marking in this work is in no way to be construed to mean that such names may be regarded as
unrestricted in respect of trademark and brand protection legislation and could thus be used by anyone.
Publisher:
LAP LAMBERT Academic Publishing
is a trademark of
International Book Market Service Ltd., member of OmniScriptum Publishing Group
17 Meldrum Street, Beau Bassin 71504, Mauritius
Printed at: see last page
Copyright © 2020 International Book Market Service Ltd., member of OmniScriptum Publishing Group
To the memory of my late parents,
Subject headings
——————————————————————————
DATA ANALYTICS is a series of textbooks that provide the background of concepts, statistical methods,
probabilistic models and practical research problems in science, engineering and sustainable develop-
ment. The main point is to develop Statistics Science based solutions in specific areas of engineering,
actuarial science, industrial production and traffic science in the context of sustainable economic devel-
opment.
Statistical and probabilistic methods essentially reveal important information and knowledge in dataset
observed in application a bunch of domains and sectors such as actuarial science, financial science, qual-
ity control, government bodies, supply chain management, urban traffic management, software and
industrial manufacturing.
The books in the series provide statistical support to practitioners, experts and researchers who work
in various fields, including not only actuaries, computer scientists, environmentalists, finance analysts,
government decision makers, corporate managers, but also process engineers, software engineers, and
traffic engineers.
The book series also provides support for students from masters to doctoral level, who are reading
courses in Data Science and/or Data Analytics, for ones who are daily finding practical solutions, who
are making optimal decisions using actual observation data sets that are becoming ever larger in size
and with very complex structure.
DATA ANALYTICS- STATISTICAL FOUNDATION
Stochastic Processes
Department of Mathematics
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
1.7.4 Self-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
16.1.4 Coding all variables and Defining the response variable . . . . . . . . . . . . . . . . 462
16.1.5 Choosing suitable statistical models via predictors and response . . . . . . . . . . . 464
16.1.6 Discussion from computational outcome of R . . . . . . . . . . . . . . . . . . . . . 467
16.2 Data Analytics Project 2: Bridge Health Monitoring . . . . . . . . . . . . . . . . . . . . . 471
16.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
16.2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472
16.2.3 Selecting the Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
16.2.4 Selecting the Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
16.2.5 Time Series Modeling for BHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
16.2.6 Data clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
16.2.7 Damage extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
16.2.8 Sequential Probability Ratio Test for Damage Identification . . . . . . . . . . . . . . 478
16.2.9 Closing remarks and open problems . . . . . . . . . . . . . . . . . . . . . . . . . . 481
Probability Theory decodes uncertainty of events in our worlds, it provides mathematical concepts and
methods for formulating and computing likelihood of events and processes under influence of many
factors. Statistical Science (Statistics) can be briefly described as the science of problem solving in the
presence of variability or uncertainty. This identifies Statistics as a scientific discipline, which demands
the same type of rigor and adherence to basic principles as Physics or Chemistry.
The newly coming Data Analytics refers to the techniques used to analyze data to enhance productiv-
ity and business gain. Data Analytics practically is the synergy of many building components including
mostly tools from Statistics, Probability Theory, next algorithmic ideas from Computing, powerful fun-
damentals of Mathematics, and last but not least specific knowledge of application domains. We can
define Data Analytics as the science and the art of making ‘good or meaningful’ decision based on data
sets, within a limited time range, a finite budget and/or computational resource.
Combining Probability Theory and Statistics together provides a powerful approach to scientific en-
deavors and technological breakthroughs. This statement can be seen by looking in parallel between the
scientific discoveries at least in the 19th and 20th centuries, from the theory of evolution (Darwin) to
modern biology (Watson, 1953), quantum mechanics (Dirac, Bohr, Heisenberg, Hawking during 1930-
1970...) to social sciences; and the other side, the developments of Statistics, from Monte Carlo casino
gamblers giving Monte Carlo simulation algorithm, the foundations of probability (Kolmogorov, 1933),
anova and the design of experiments (Fisher, 1920s), exploratory data analysis (John Tukey), Bayesian
inference (Lehmann, 1970) to bootstrapping (Efron, 1980).
Now in our current 21st century, the marriage between mathematics, computing with probability
and statistics gives machine learning and artificial intelligence (AI); and the newborn disciplines even
prove fundamentally meaningful roles in solving hard and urgent problems, typically in traffic, com-
munication, and life sciences. We all know that the major forces in artificial intelligence have been
intensively computational algorithms and causal inference from the 1960s (Pearl [16]).
However, the probability-statistics-based coupling topics of causal analysis and causal inference
turn out to be the critical component of developing any useful and complex tools, since AI won’t be
very smart if computer don’t grasp cause and effect, see comments by Pearl’s followers in a recent MIT
Technology Review [15].
This book just presents an essential knowledge body for data analytics consisting of statistical infer-
ence, linear regression and stochastic analysis. It honestly conveys basic knowledge, theoretical meth-
ods and practical approaches to students and professionals, especially ones working in sciences (such as
computer, environmental, actuarial, economic), engineering and industrial manufacturing.
The book has been written thanking to the best support of Department of Mathematics, Faculty of
Science, Mahidol University, and the Center of Excellence in Mathematics, Thailand.
The author expresses his sincere thanks to Vietnamese colleagues for contributing many beautiful
figures as well as useful ideas. They bring aesthetic senses to readers and perhaps help readers to
better understand abstract ideas in this science. Particular thanks are sent to An Khuong Nguyen, Tuong
Nguyen Huynh, Phu Le Vo, Hoai Van Tran and Trung Van Le at HCMUT, VNHCM, Vietnam; to Hien
Trong Huu Phan (Melbourn, Australia), Linh V. Huynh and Vinh Tan Tran (in the U.S.).
Moreover, some theoretical topics have been included in the book due to special attractions of ac-
tual monitoring data sets, through communication with engineers or/and listening to experimental re-
searchers, arising after seminars or an academic exchange. In these contexts, the interaction itself, the
ideas from communicating with colleagues are the main factors that shapes the structure and content of
this text. Most gracious thanks are due to John Borkowski in Montana, and Nabendu Pal in Louisiana,
the U.S. for newly introducing diverse approaches in the statistical science.
Despite the fine efforts of these individuals, however, a few errors may occur in the text and these
possible flaws are totally my responsibility. I would appreciate receiving comments from readers so that
this series can be more useful to the readers.
The author hope readers find it joyful to employ the methods in this book, the first text of this data
analytics series, when you are making an important or optimal decision based on huge data sets with
complex or unusual structure. He sincerely thanks colleagues and friends through that collaboration,
and wish to listen to your opinions and comments in the future.
The book has sixteen chapters and three appendices, which can be grouped roughly into four parts,
ordered according to increasing difficulty. The level of difficulty is far from uniform: the first and
second parts are intended to be accessible with less background.
Part A provides the foundation for Statistical Science and Data Analytics, starts with theory of prob-
ability, random variables, and probability distributions (Chapter 1), and ends with an introduction of
Statistical Science for Data Analytics in Chapter 2. A few probabilistic tools are recalled in Appendix A
and an introduction to software R is shown in Appendix B.
Part A:
PROBABILITY- Part D:
PROBABILITY STOCHASTIC PROCESS
DISTRIBUTIONS BASED ANALYSIS
Part C:
Part B:
DATA ANALYSIS
DATA EXPLORATION and
STATISTICAL DESIGNS and
STATISTICAL INFERENCE
LINEAR REGRESSION
Part B on data exploration and statistical inference opens with Exploratory Data Analysis in Chapter
3, being meaningful for all later developments in Data Analytics, not just limited to this book. Param-
eter estimation and hypothesis testing are elaborated in Chapter 4 and Chapter 5. In Part C we start
the discussion of Causal Data Analysis when predictors are both qualitative (Chapter 6 and 7); then
quantitative in Chapters 8 to 11 with linear regression models.
Finally, Part D introduces Stochastic Analysis, being based on theory of stochastic processes, with chap-
ters 12 to 16, discussing about Stochastic Process, Statistical Simulation, Poisson Processes, Branching
Processes, and few case studies respectively. Appendix C briefly presents generalized linear model, a
key approach for linear regression analysis when responses are counts.
Guidelines for Instructors and Self-learners
• Chapters 1 to 5 could provide an essential knowledge body for one-semester studying in any scientific
and engineering program at undergraduate level.
• Chapter 6 and 7, on statistically designs contribute another view of causal analysis (but not causal
inference) when we are interested in discrete-valued predictors. Professional readers can extend its
1
contents to a full course of quality control for academic or industrial sectors.
• Chapters 8 to 11 of Part C fully build up a brief background for regression, and particularly linear
regression modeling, which could be studied in one semester at undergraduate level 2
• Last but not least, the advanced Part D basically presents diversifying angles of stochastic analysis.
Chapters 12 to 15 could be merged to be a solid lecture for masters or doctoral students in one
3
semester. Chapter 16 is especially designed as a seminar-based course for graduates.
Keywords:
1 These chapters are motivated from the papers [34], [24], [21], the thesis [36], and the book [62].
2 These chapters are motivated from the papers [26], [20], and the books [38], [46].
3 These chapters are motivated from the papers [19], [22], [25], [32], [33], and the texts [42], [44] and [45].
Part A
Probability- Probability Distribution
The world is full of uncertainty, and this is certainly true in engineering, service, commerce and
business. A key aspect of solving real problems is appropriately dealing with uncertainty. This
involves explicitly recognizing that uncertainty exists and using quantitative methods to model
uncertainty. If you want to develop realistic models of industry, technology and/ or business prob-
lems, you should not simply act as if uncertainty doesn’t exist.
Part A provides the foundation for the whole book. The objective of Part A specifically is to in-
troduce fundamental theoretical concepts of random variables and probability distributions. This
presentation will establish the necessary link between statistics and probability.
• Firstly, basic probability theory (e.g. concepts, rules, Bayes theorem) and random variables
(components and laws of random variables) are formally defined in Chapter 1. Discrete and
continuous types then are treated separately, followed by a description of their properties and
use. Multiple random variables are then treated extensively in more advanced data analytics
courses.
• Specific types of discrete and continuous models of importance in actuarial science, medicine,
finance, civil and environmental engineering ... are briefly motivated in Chapter 2. Finally, Ap-
pendix B introduces the popular statistical software R .
This page is left blank intentionally.
Chapter 1
Figure 1.1: Would we reveal information and knowledge behind this beautiful picture?
[Source [9]]
CHAPTER 1. PROBABILITY THEORY AND PROBABILITY DISTRIBUTIONS
4 CAN UNCERTAINTY BE EVALUATED?
Education Given the probability that a randomly selected student in a class is a female is 56%, how
much chance that a selected student is a male?
Farming Eggs sold at a market are usually packaged in boxes of six. The number 𝑋 of broken eggs
in a box has the probability distribution given by table:
𝑥 0 1 2 3 4 5 6
P[𝑋 = 𝑥] 0.80 0.14 0.03 0.02 0.01 0 0
Denote by 𝑌 the number of unbroken eggs in a box. What are the average of 𝑋 and of 𝑌 respec-
tively?
Industry The mass, 𝑋 kg, of silicon produced in a PC manufacturing process is modeled by the
probability density function (pdf)
3
𝑓𝑋 (𝑥) = (4𝑥 − 𝑥2 ) if 0 ≤ 𝑥 ≤ 4;
32
𝑓𝑋 (𝑥) = 0 otherwise.
Financial and Actuarial Science: Suppose an insurance company 𝐵 has thousands of customers,
and each customer is charged $500 a year. Since the customer’s businesses are risky, from the past
experience the company 𝐵 estimates that about 15% of their customers would get fatal trouble (e.g.
fire, accident ...) and, as a result they will submit a claim in any given year.
We assume that the claim will always be $3000 for each customer.
a/ Model the amount of money that the insurance company believes to obtain from each customer
How much can the company expect to make per customer in a certain year?
b/ Now suppose that there are 𝑁 = 300 customers, the amount of money that the 𝑖-th customer
could receive from 𝐵 is a random variable 𝑋𝑖 for 𝑖 = 1, 2, . . . 𝑁 . The {𝑋𝑖 } are I.I.D. sequence
with the same random variable 𝑋, (see Section 1.4) and 𝑋 has the observed values
What is probability? Probability of an event, a phenomenon ... informally is a numerical measure how
much chance that event would/ will happen.
Experiments. An experiment E is a specific trial/activity (of scientists, human being) whose out-
comes possess randomness.
• Coin throwing- throw a coin, random outcomes are head (H) or tail (T)
[34, 29, 28, 32, 31, 32, 30, 31, 30, 33] (in Celsius degree).
Basic concepts
Usually we include all events into a set, called the event set
Q := { events 𝐴 : 𝐴 ⊂ Ω}.
When an experiment E is performed and an outcome 𝑎 is observed we say that event 𝐴 has occurred
if 𝑎 ∈ 𝐴.
P : Q −→ [0, 1]
𝐴 ∈ Q ↦−→ P(𝐴) = Prob(𝐴) =
probability or chance that 𝐴 occurs.
Events are sets of outcomes. Therefore, to learn how to compute probabilities of events, we shall
discuss some set operations. Namely, we shall define unions, intersections, differences, and comple-
ments.
Let 𝐴, 𝐵 be events, we can make new events as below.
• 𝐴 and 𝐵 are disjoint if 𝐴 ∩ 𝐵 = ∅, that is, they contain no common element. Obviously, by definition,
any event is detached from its complementary event, i.e. 𝐴 ∩ 𝐴𝑐 = ∅.
𝐴∪𝐵 𝐴∩𝐵
𝐴 𝐵 𝐴 𝐵
• Example: • Example:
We extend the definition of the union (and intersection) of the pairs of events to the case of a finite
number of events.
𝑛
⋃︁
𝐴𝑖 = 𝐴1 ∪ 𝐴2 ∪ ... ∪ 𝐴𝑛 = {𝑥 | 𝑥 ∈ 𝐴1 ∨ 𝑥 ∈ 𝐴2 ∨ ... ∨ 𝑥 ∈ 𝐴𝑛 },
𝑖=1
𝑛
⋂︁
𝐴𝑖 = 𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑛 = {𝑥 | 𝑥 ∈ 𝐴1 ∧ 𝑥 ∈ 𝐴2 ∧ ... ∧ 𝑥 ∈ 𝐴𝑛 }.
𝑖=1
𝑛
⋂︁
𝐴𝑖 = 𝐴1 ∩ 𝐴2 ∩ ... ∩ 𝐴𝑛 ≡ 𝐴1 𝐴2 ... 𝐴𝑛 .
𝑖=1
The union of events satisfies the commutation, association and distribution, is expressed in turn
by the following equations:
𝐴 ∪ (𝐵 ∩ 𝐶) = (𝐴 ∪ 𝐵) ∩ (𝐴 ∪ 𝐶) (distribution law)
and
(𝐴 ∪ 𝐵)𝑐 = 𝐴𝑐 ∩ 𝐵 𝑐 De Morgan law
(1.2)
(𝐴 ∩ 𝐵)𝑐 = 𝐴𝑐 ∪ 𝐵 𝑐 .
in which 𝐴, 𝐵 ⊂ Ω are events, and the sample space Ω is formed from a random experiment E. More
general, we have
a) Frequency interpretation: probability is based on history (data obtained or observed). For any event
𝐴 ⊂ Ω, its probability is the relative frequency
∑︁
P(𝐴) = Prob(𝐴) = P(𝑠).
𝑠∈𝐴
[34, 29, 28, 32, 31, 32, 30, 31, 30, 33] (in Celsius degree),
The sample space Ω is the above list, and if we suppose the chance to see any temperature 𝑡 ∈ Ω is
∑︀ 6
the same, then P(𝐴) = 𝑠∈𝐴 P(𝑠) = .
10
b) Classical interpretation: compute probability based on the assumption that all outcomes have equal
probability. Apply when the sample space Ω holds |Ω| = 𝑛 < ∞, then for any event 𝐴 ⊂ Ω, its
probability is the fraction found by counting methods:
|𝐴|
P(𝐴) = Prob(𝐴) = .
|Ω|
1
Example 3: In Coin throwing, Ω = {𝐻, 𝑇 }, then P(𝐻) = P({𝐻}) = = P(𝑇 ).
2
c) Subjective interpretation: use a model, can hypothesize about phenomenon possessing randomness
Probability of a single event. For finite sample spaces, we assume Ω = {𝑠1 , 𝑠2 , · · · , 𝑠𝑛 } and next
define 𝑝𝑖 = P(𝑠𝑖 ) then
𝑛
∑︁
𝑝𝑖 ≥ 0, and 𝑝𝑖 = 1.
𝑖=1
Multiple events– Union or addition rule. What are mutually exclusive and not mutually exclusive
events?
Two events 𝐴 and 𝐵 are mutually exclusive if 𝐴 ∩ 𝐵 = ∅, i.e. the occurrence of 𝐴 precludes the
occurrence of 𝐵. Then
P(𝐴 and 𝐵) = P(𝐴 ∩ 𝐵) = 0
Example 1.2.
If the die is fair, when tossing of a die, the probability of seeing number 𝑖 is 𝑝𝑖 = P(𝑖) = 1/6. The
probability of event 𝑍 =‘getting 2 or 3 or 4’ is
∑︁
P(𝑍) = P(2 or 3 or 4) = P(𝑠) = 3/6.
𝑠∈𝑍
• Observe that
P[𝐵] = P[𝐵𝐴] + P[𝐵𝐴𝑐 ],
𝐵 𝐸𝑖 ∩ 𝐵 𝐸𝑗 = 𝐵 ∩ 𝐸𝑖 ∩ 𝐸𝑗 = ∅
Event 𝐴 and event 𝐵 are dependent if the appearance of an event is related (in a certain way) to the
occurrence of the other event.
In the above picture on the right, e.g. we can view 𝐵 is the partial girl appearance event, then event 𝐴
of boy appearance in both left and right large pictures is dependent with 𝐵.
So what happen when 𝐴 is not related to the occurrence of 𝐵?
CONCEPT 1.
Events 𝐴 and 𝐵 are independent if the occurrence of 𝐴 is not connected in any way to the occurrence
of 𝐵. Then P(𝐴𝐵) = P(𝐴) · P(𝐵).
• The joint probability of two independent events 𝐴 and 𝐵 is
• In general, events 𝐸1 , 𝐸2 , . . . , 𝐸𝑛 are independent if they occur independently of each other, i.e.,
occurrence of one event does not affect the probabilities of others.
Given that event 𝐵 “happened”, what is the probability that event 𝐴 also happened?
Brainstorming thought: narrow down the sample space to the space where 𝐵 has occurred; to com-
pare 𝐴 ∩ 𝐵 with 𝐵.
The formula: Conditional probability of event 𝐴 given event 𝐵
P(𝐴𝐵) P(𝐴 ∩ 𝐵)
P(𝐴 | 𝐵) = = . (1.5)
P(𝐵) P(𝐵)
CONCEPT 2.
If 𝐴 and 𝐵 are independent we got their joint probability as presented in Equation 1.4: P(𝐴 ∩ 𝐵) =
P(𝐴) · P(𝐵).
Example 1.3. Suppose that two balls are to be withdrawn, without replacement, from an urn that
contains 9 blues and 7 yellow balls.
If each ball drawn is equally likely to be any of the balls in the urn at the time, what is the probability
that both balls are blue?
1. Sampling with replacement: you take a unit out of the interest population, and return it back (to
the population) before you take the next unit; repeat the action until you get enough the sample with
given size 𝑛 ≥ 1.
2. Sampling without replacement: you take a unit out of the interest population, and continue taking
other units from that population (now decline available units for picking new units); until you get
enough the sample with given size 𝑛 ≥ 1.
we always have
P(𝐴) · P(𝐵 | 𝐴)
P(𝐴 | 𝐵) = (1.6)
P(𝐵)
What are dependent events? Events 𝐴 and 𝐵 are dependent if the occurrence of one is connected
in some way to the occurrence of the other. Then the joint probability of 𝐴 and 𝐵 is
How about the independent case? We know from the above section that events 𝐴 and 𝐵 are
independent if the occurrence of 𝐴 is not connected in any way to the occurrence of 𝐵. Then
Plug this result back to Equation 1.7 we obtain the joint probability of independent events 𝐴 and 𝐵
being given in Equation 1.4.
Independence for 𝑚 > 2 events 𝐴1 , 𝐴2 , ..., 𝐴𝑚 : The mutual independence for 𝑚 > 2 events
means:
• they are pairwise independent: P[𝐴𝑖 𝐴𝑗 ] = P[𝐴𝑖 ] P[𝐴𝑗 ], for all pair of indices 1 ≤ 𝑖 ̸= 𝑗 ≤ 5;
• they are 3-wise independent: P[𝐴𝑖 𝐴𝑗 𝐴𝑙 ] = P[𝐴𝑖 ] P[𝐴𝑗 ]P[𝐴𝑙 ] for all triple of indices 1 ≤ 𝑖, 𝑗, 𝑙 ≤ 5; ...
A variable, such as the strength of a concrete or any other material or physical quantity, whose value is
uncertain or unpredictable or nondeterministic is called a random variable or a variate if its distribution
is known. A random variable, practically may assume some value, the magnitude of which depends on
a particular occurrence or outcome (usually noted by an observation or measurement) of an experiment
in which tests are made and records maintained.
A random variable is formally viewed as a function defined on the sample space of an experiment
such that there is a numerical value of the random variable corresponding to each possible outcome;
that is there is a probability associated with each occurrence in the sample space.
A random variable 𝑋 is a map from a sample space Ω to the reals R. That is for 𝑤 ∈ Ω then
𝑋(𝑤) ∈ R.
𝑆𝑋 = Range(𝑋) = {𝑋(𝑤)}.
Its range 𝑆𝑋 can be the set of all real numbers R, or only the positive numbers R+ = (0, +∞), or
the integers Z, or the interval (0, 1), etc., depending on what possible values the random variable can
potentially take.
For example, if 𝑋 measures the height of male students in Europe, here
then 𝑋(𝑤) (in meter) is the height of student 𝑤, and possibly 𝑆𝑋 = Range(𝑋) = (1, 2.2).
Definition 1.1.
𝐸 := 𝑋 −1 (𝑏) = {𝑤 ∈ Ω : 𝑋(𝑤) = 𝑏} ⊂ Ω
is an event, we define
P[𝑋 = 𝑏] = Prob{𝑋 = 𝑏} := Prob(𝐸).
|𝐸|
P[𝑋 = 𝑏] := Prob(𝐸) = .
|Ω|
P[𝑋 = 𝑏] indicates how much chance observation 𝑏 happen, and called the probability density (mass)
function ( p.d.f. or p.m.f ) of 𝑋 at observation (or observed value) 𝑏.
In Picture 1.4,
event 𝐸 := 𝑋 −1 (𝑏) is the violet square on the left, and
event 𝐴 := 𝑋 −1 (ARIN) is the green oval.
Let us consider two simple examples below.
1. Our sample space Ω is the set of all King Mongkuk University’s students, define a map ‘most liked
singer’
we ask each student 𝑤 ∈ Ω to know his (her) most liked singer 𝑏 in Thailand, 𝑋(𝑤) = 𝑏, then the set
Range(𝑋) of the map 𝑋 is the set of all famous singers in Thailand.
If KM University has 30000 students, and 1500 students like - say (value) singer ‘ARIN’ then
|𝐴| 1500
P[𝑋 = ‘ARIN’] := Prob(𝐴) = = = 1/20.
|Ω| 30000
ARIN
In Picture 1.4, event 𝐴 := 𝑋 −1 (ARIN) (the green oval) consists of precisely 1500 students 𝑤.
2. Now let our sample space Ω be the set of all Honda cars produced in Japan, inspect each car 𝑤 ∈ Ω
to know its fault (defect) 𝑑, 𝑋(𝑤) = 𝑑, then
the set Range(𝑋) of the map ‘car defect’ 𝑋 is the set of all popular car defects. If Honda Japan
produced 1 million cars, in which 2000 cars have the (value) fault 𝑑 = ‘𝑑𝑒𝑓 𝑒𝑐𝑡𝑒𝑑 𝑝𝑖𝑠𝑡𝑜𝑛′ then
|𝐴| 2000
has cardinality 2000, and P[𝑋 = 𝑑] := Prob(𝐴) = = =?.
|Ω| 106
For any random variable 𝑋, we write 𝑆𝑋 = Range(𝑋) with the same meaning for the set of observed
values (observations) of 𝑋.
CONCEPT 3.
A random variable 𝑋(.) is discrete when it has a discrete range, meaning that the set of values
consists of no more than a countable number of elements.
In finite element case, we usually write set of values
For example,
• the number of famous singers in Thailand is a discrete variable, then its range 𝑆𝑋 is finite. In the
above singer example, if Thailand has 15 famous singers with names 𝑥𝑖 then
• The number of banks in the US is a discrete variable 𝑋, then 𝑆𝑋 of all observed values is finite
(bank 1 = Citibank, ...)
• The number of defect types of Honda cars is a discrete variable 𝑋, 𝑆𝑋 is finite. In the above car
fault example, suppose that Honda’s cars has 5 specific defects then
• The number of traffic accidents in Asia each year is a discrete variable, but 𝑆𝑋 is a countably infinite set.
So far, we are dealing with discrete random variables. These are variables whose range is finite or
countable (i.e. countably infinite). In particular, it means that their values can be listed, or arranged in
a sequence, as in Concept 3. Examples include
the number of jobs submitted to a printer,
the number of corruption-free departments in a government,
the number of failed components of a software, and so on.
Discrete variables don’t have to be integers.
Continuous random variables assume a whole interval of values. This could be a bounded interval
(𝑎, 𝑏), or an unbounded interval such as
(𝑎, +∞), (−∞, 𝑏), or (−∞, +∞).
Sometimes, it may be a union of several such intervals. Intervals are uncountable, therefore, all
values of a random variable cannot be listed in this case.
Examples of continuous variables include
• various times (software installation time, code execution time, connection time, waiting time, lifetime),
also
• physical variables like weight, height, voltage, temperature, distance, the number of miles per gallon,
etc.
We discuss both discrete and continuous random variables in detail in Chapter ??.
Bernoulli variable describes a random variable 𝐵 that can take only two possible values, i.e.
𝑆𝐵 = {0, 1}.
Its probability density function is given by
𝑝(1) = P(𝐵 = 1) = 𝑝,
• Two random variables 𝐴, 𝐵 with ranges 𝑆𝐴 , 𝑆𝐵 are said to independent if we have the joint probability
density function satisfies:
• A sequence of 𝑛 random variables 𝑋𝑖 are mutually independent if they are independent in pairs, in
triple of random variables, and in general 𝑘-wise independent in the sense of Equation 1.9, that is
P[𝑋1 = 𝑎1 , 𝑋2 = 𝑎2 , . . . , 𝑋𝑘 = 𝑎𝑘 · · · ]
• A sequence of 𝑛 random variables 𝑋𝑖 are identically distributed if they follow the same distribution of
a common random variable 𝑋. More precisely they have the same ranges Range(𝑋) and the same
p.d.f. 𝑓𝑋 (). We write 𝑋𝑖 ∼ 𝑋.
A sequence of many random variables 𝑋𝑖 are independently and identically distributed (write
I.I.D. or i.i.d.) if they are both mutually independent and identically distributed.
PRACTICE 1.
Two fair dice are tossed. If the total is 7, we win $100; if the total is 2 or 12, we lose $100; otherwise
we lose $10. What is the expected value of the game?
A bet whose expected winnings equals to 0 is called a fair bet. Let the random variable 𝑋 denote
the amount that we win when we make a certain bet.
Find the expectation (or mean) E(𝑋) if there is
a 60% chance that we lose 1 USD,
a 20% chance that we win 1 USD, and
a 20% chance that we win 2 USD.
Is this a fair bet?
HINT: We can solve these two examples by using Bernoulli random variable, with some modifications.
Proof of Equation A.5. Let 𝐻 and 𝑇 be two outcomes of a Bernoulli trial with value space Range(B(𝑝)) =
{𝐻, 𝑇 }, and
P(𝐻) = 𝑝; P(𝑇 ) = 1 − 𝑝 =: 𝑞.
Assume that we perform 𝑛 trials of the experiment and each trial is independent of the others. For
example,
the event “𝐻 on the first trial” is independent from
the event “𝐻 on the second trial.” So both events have probability 𝑝.
The value space of sequences of 𝑛 trials is:
Our question now is: what is the probability of exactly 𝑘 successes in 𝑛 trials of a binomial experiment
where
P(success = 𝐻) = 𝑝, P(failure = 𝑇 ) = 1 − 𝑝?
𝑋 = 𝐵1 + 𝐵2 + . . . + 𝐵𝑛 , each 𝐵𝑖 ∼ B(𝑝)
then 𝑋 takes values in {0, 1, ..., 𝑛}. Therefore exactly 𝑘 successes in 𝑛 trials is 𝑋 = 𝑘 ∈ {0, 1, ..., 𝑛}. By
combinatorial reasoning, the binomial distribution Bin(𝑛, 𝑘) is given by the probability density function
(︂ )︂
𝑛 𝑘
𝑝(𝑘) = P(𝑋 = 𝑘) = 𝑝 (1 − 𝑝)𝑛−𝑘 .
𝑘
Figure 1.5 shows pdf of the binomial Bin(𝑛, 𝑝) when 𝑛 = 50, and with 𝑝 = 0.25, 0.50, 0.75.
Practical Problem 1.
Suppose we observe a telephone station, the statistics show that in average for 1 minute there are 2
phone users. We need to calculate the probability that in 1 minute there are 4 phone users.
• then 𝑋 receives a countably infinite numbers 0, 1, 2, 3, ... Hence it is discrete random variable
accepting the range Range(𝑋) = N.
Poisson variable is denoted by 𝑋 ∼ Pois(𝜆), it is also known as variable for rare events, playing an
important role for discrete event modeling in the actuarial science or quality control.
The value 𝜆 > 0 denotes for the speed of rare events (as the average amount of customers
arriving at an insurance firm) in each time period or per spacial sample.
Figure 1.5: The probability density function of Bin(50, 𝑝) with 𝑝 = .25, .50, .75
𝜆𝑖 𝑒−𝜆
𝑝(𝑖; 𝜆) = P(𝑋 = 𝑖) = 𝑖 = 0, 1, 2, ... (1.11)
𝑖!
If the variance of a data is much greater than the mean, the Poisson distribution would not be a
good model for the random variable’s distribution.
For a Poisson arriving process, the number of arrivals 𝑋(𝑡) occurring in a time interval of length 𝑡 is
Poisson-distributed (random variable) with mean 𝜆 𝑡. It means that the probability density function
of 𝑋(𝑡) is given by:
𝑒−𝜆 𝑡 (𝜆 𝑡)𝑘
P[𝑋(𝑡) = 𝑘] = , 𝑘 = 0, 1, 2, ...
𝑘!
Example 1.5.
If customers come to a SCB branch in Bangkok and follow a Poisson distribution with constant rate
𝜆 = 5 (see Figure 1.6) then
• We informally say a random variable 𝑋 is a continuous random variable if and only if the range set
Range(𝑋) is a continuous set (such as the reals R or its interval subsets)
• A continuous probability distribution refers to the range Range(𝑋) of all possible values of 𝑋, along
with the associated probabilities P(𝑋 ≤ 𝑡).
Definition 1.3.
A random variable 𝑋 is continuous if and only if its range 𝑆𝑋 is a continuous set (uncountable
infinite set), such as the reals R or interval subsets of R ).
CONVENTION: If 𝑋 is clear from the context, write 𝐹 (𝑡) = 𝐹𝑋 (𝑡). Also we can change variable 𝑡
to 𝑥 and rewrite 𝐹 (𝑥), 𝑓 (𝑥). In few books, 𝑓 (𝑥) is simply called a probability function, and 𝐹 (𝑥) the
distribution function.
is the probability that 𝑋 is less than or equals 𝑥, equals to the area under the curve 𝑓 (𝑥) between −∞
and 𝑥. 𝐹 (𝑥) must have derivative
𝑑𝐹 (𝑥)
=: 𝑓 (𝑥) (1.15)
𝑑𝑥
XThis probability density function 𝑓 (𝑥) is defined almost every where and is piecewise continuous.
It is is given by a smooth curve 𝐶. The pdf 𝑓 (𝑥) then satisfies 2 conditions: i) 𝑓 (𝑥) ≥ 0, ∀𝑥, and ii) the
whole area below the pdf curve ∫︁ ∞
𝑓 (𝑥) 𝑑𝑥 = 1 = 𝐹 (+∞).
−∞
XThe probability that a continuous random variable 𝑋 receives any value within a given interval say,
[𝑎, 𝑏], i.e. 𝑎 ≤ 𝑋 ≤ 𝑏 is measured by the blue area under the curve 𝐶 within that interval.
In other words, the probability of the event “𝑎 ≤ 𝑋 ≤ 𝑏′′ is:
∫︁ 𝑏
P(𝑎 ≤ 𝑋 ≤ 𝑏) = 𝑓 (𝑥)𝑑𝑥 = P(𝑎 < 𝑋 < 𝑏).
𝑎
The total area under the curve 𝑓 (𝑥) is the whole area including:
• the blue area in the middle P(𝑎 ≤ 𝑋 ≤ 𝑏), (see Figure 1.7.a)
MEAN and VARIANCE: The mean 𝜇 of a continuous variable 𝑋 with pdf 𝑓 (𝑥) is
∫︁
𝜇 = E(𝑋) = 𝑥 𝑓 (𝑥)𝑑𝑥, (1.16)
𝑥∈Range(𝑋)
A new type of biological catalyst used in making breads has an efficient working time of 𝑋 (in hours).
The random variable 𝑋 is modeled by the pdf
a) Find 𝑘; DIY. b) Compute the approximated mean of work time E[𝑋] of this biological catalysts.
Answer b): 584 hours
Popular continuous probability distributions include
• Normal or Gaussian distribution: found to be useful in many areas like petroleum engineering,
environmental and medical sciences.
1
𝑓 (𝑥) = , 𝑎≤𝑥≤𝑏 (1.18)
𝑏−𝑎
⎧
⎨1/(𝑏 − 𝑎), 𝑎≤𝑥≤𝑏
𝑓𝑈 (𝑥) = 𝑓 (𝑥; 𝑎, 𝑏) = (1.19)
⎩0, otherwise,
Practice.
Find the mean E[𝑋] and variance V[𝑋] of the continuous uniform random variable 𝑋, using Formula
1.16, 1.17.
SOLUTION- ANSWER:
∫︁ 𝑏 ⃒ ⃒𝑏
1 1 ⃒⃒ 1 2 ⃒⃒ 1 𝑎+𝑏
𝜇= 𝑥𝑑𝑥 = 𝑥 = (𝑏2 − 𝑎2 ) = .
𝑏−𝑎 𝑎 𝑏 − 𝑎 ⃒ 2 ⃒𝑎 2(𝑏 − 𝑎) 2
As a result,
1 2 1 1
𝜎 2 = 𝜇2 − 𝜇21 = (𝑎 + 𝑎𝑏 + 𝑏2 ) − (𝑎2 + 2𝑎𝑏 + 𝑏2 ) = (𝑏 − 𝑎)2 .
3 4 12
[︂ ]︂2
𝑥−𝜇
− 12
1 𝜎
𝑓 (𝑥) = √ 𝑒 , (1.21)
𝜎 2𝜋
On the normal curve. The normal curve (of the probability function 𝑓 (𝑥)) is
i/ bell-shaped,
iii/ when we move further away from the mean in both directions, the normal curve approaches the
horizontal axis.
These properties is described in three most useful cases (see Figure 1.10):
Using Gauss pdf to compute areas being symmetric around the mean
Definition 1.4.
𝑧2
1 −
𝑓 (𝑧) = √ 𝑒 2 , −∞ < 𝑧 < ∞,
2𝜋
equals 𝑥, is equal to the area under the standard normal density function
𝑧2
1 −
𝑓 (𝑧) = √ 𝑒 2 , −∞ < 𝑧 < ∞
2𝜋
Case 1 : 𝑋 = 𝜇 ⇐⇒ 𝑍 = 0; 𝑋 = 𝜇 + 𝜎 ⇐⇒ 𝑍 = 1. Since
𝑋 −𝜇
𝑋 − 𝜇 ≤ 𝑎𝜎 ⇔ 𝑍 = ≤ 𝑎,
𝜎
Case 2 : 𝑋 = 𝜇 + 𝑘𝜎 ⇐⇒ 𝑍 = 𝑘. So
P(|𝑋 − 𝜇| ≤ 𝑘 𝜎) = P(|𝑍| ≤ 𝑘) = P(−𝑘 ≤ 𝑍 ≤ 𝑘) = Φ(𝑘) − Φ(−𝑘).
Normal distribution- Computation using the 𝑍-transformation.
Observe that the probability cumulative function of a normal variable 𝑋 ∼ N(𝜇, 𝜎 2 ) given by
∫︁ 𝑎
𝐹 (𝑎) = P(𝑥 ≤ 𝑎) = 𝑓 (𝑥)𝑑𝑥
−∞
𝑋 −𝜇 𝑋 − 10 (︀ )︀
𝑍= = ∼ N 0, 1 .
𝜎 2
Suppose that the cost measurements of customer’s claims in Actuarial Science are assumed to follow
a normal distribution 𝑋 with 𝜇 = $1000 and 𝜎 = $200. What is the probability that the cost measurement
is between $1000 and $1400?
First we scale down 100 USD to 1 USD, hence 1000 USD means 10USD, 𝜎 = $200 becomes $2...
We assumed 𝑋 ∼ N(𝜇, 𝜎 2 ) with 𝜇 = 10, 𝜎 = 2; then use
𝑋 −𝜇 𝑋 − 10
𝑍= =
𝜎 2
we have 10 ≤ 𝑋 ⇒ 0 ≤ 𝑍, and 𝑋 ≤ 14 ⇒ 𝑍 ≤ 2, that gives us
𝑝 = Φ(𝑧𝛼 ) 99.5% 99% 97.72% 97.5% 95% 90% 80% 75% 0.5
𝑧𝛼 2.58 2.33 2.00 1.96 1.645 1.28 0.84 0.6745 0
Furthermore, by Property 5:
P(|𝑋 − 𝜇| ≤ 2𝜎) = P(|𝑋 − 10| ≤ 4)
Few most practically well known cases, see Figure 1.10 are:
• 𝑥 = 𝜇 + 3𝜎 ⇐⇒ 𝑧 = 3 ⇒ P(−3 ≤ 𝑍 ≤ 3) = 99.7%.
IQ examination scores for freshmen are normally distributed with mean value 𝜇 = 100 and standard
deviation 𝜎 = 14.2. What is the probability that a randomly chosen freshman has an IQ score greater
than 130?
1. The standard Gaussian density 𝑓 (𝑥) is even because 𝑓 (𝑥) = 𝑓 (−𝑥), ∀𝑥.
3. Table of the Gaussian distribution usually does not list the values of Φ(𝑥) for 𝑥 < 0. The reason is
because the density function 𝑓 (𝑥) is symmetric over the line 𝑥 = 0,
We have a relation
Φ(−𝑥) = 1 − Φ(𝑥) for every 𝑥. (1.23)
Thus, Figure 1.11 shows the probability that 𝑍 < −1, indeed
4. The 𝑝-th percentile of the standard Gaussian distribution is number 𝑧𝑝 that meets the equation
If 𝑋 ∼ N(𝜇, 𝜎 2 ) we denote the 𝑝-th percentile of 𝑋 by 𝑥𝑝 . We can see that 𝑥𝑝 is related to the
normalized quantile 𝑧𝑝 by
𝑥𝑝 = 𝜇 + 𝑧𝑝 𝜎.
XAn exponential random variable 𝑋 with parameter 𝜆 is given by the probability density function
1 2 1
𝜇= ; 𝜎 = 𝜇2 = 2 .
𝜆 𝜆
zThe exponential cumulative distribution function (cdf) is
∫︁ 𝑡 ∫︁ 𝑡
𝐹 (𝑡) = P(𝑥 ≤ 𝑡) = 𝑓 (𝑥)𝑑𝑥 = 𝜆𝑒−𝜆 𝑥 𝑑𝑥 = 1 − 𝑒−𝜆𝑡 , 𝑡 ≥ 0.
0 0
Notes:
1. In practice, an exponential random variable 𝑋 is used to describe the distance between successive
events of a Poisson process with mean number of events 𝜆 > 0 per unit interval. [See details in
Section 1.6.4, and an application in Section 5.4]
2. For Poisson distributions, the mean and variance are the same; while for exponential ones, the mean
and standard deviation are the same.
Let 𝑇 be any continuous random variable with non-negative values, with cdf 𝐹𝑇 . The survival
function of 𝑇 is one minus the cdf, defined as
∫︁ ∞
𝑆(𝑡) = 1 − 𝐹𝑇 (𝑡) = P(𝑇 > 𝑡) = 𝑓 (𝑥)𝑑𝑥, with 𝑡 ≥ 0. (1.26)
𝑡
give the probability that the component will fail after 𝑡 time units.
Definition 1.5.
* The instantaneous hazard function of a person or system, also called the failure rate function, is
defined as
𝑓 (𝑡)
𝜆(𝑡) = , 𝑡 ≥ 0. (1.27)
𝑆(𝑡)
* The function ∫︁ 𝑡
Λ(𝑡) = 𝜆(𝑢)𝑑𝑢 (1.28)
0
is called cumulative hazard rate.
* The expected life length E[𝑇 ] of a product is called the mean time till death or mean time till failure
(MTTF). This quantity is given by
∫︁ ∞ ∫︁ ∞ ∫︁ ∞
𝜇 = E[𝑇 ] = MTTF = 𝑡𝑓 (𝑡)𝑑𝑡 = P(𝑇 > 𝑡) 𝑑𝑡 = 𝑆(𝑡) 𝑑𝑡. (1.29)
0 0 0
In applications the exponential distribution with mean 𝛽 is used for 𝑇 , with pdf
𝑡
1 −𝛽
𝑓 (𝑡) = 𝑓 (𝑡; 𝛽) = 𝑒 , for 𝑡 ≥ 0 (1.30)
𝛽
and the survival
𝑆(𝑡) = 1 − 𝐹 (𝑡) = 𝑒−𝑡/𝛽 , 𝑡 ≥ 0.
In this model the survival function diminishes from 1 to 0 exponentially fast, relative to 𝛽. The hazard
rate function is
1 −𝑡/𝛽
𝑓 (𝑡) 𝛽 𝑒 1
𝜆(𝑡) = = −𝑡/𝛽 = , 𝑡 ≥ 0.
𝑆(𝑡) 𝑒 𝛽
1
That is, the exponential model is valid for cases where the hazard rate function 𝜆(𝑡) = 𝛽 = 𝜆 > 0, is a
constant independent of time.
1 1
If the MTTF is E[𝑇 ] = 𝛽 = 100 [hr], we expect 1 failure per 100 hours, i.e., ℎ(𝑡) = [ ].
100 hr
∫︁
𝑑𝐹𝑋 (𝑡)
𝑓 (𝑡) = ⇐⇒ 𝑓 𝑑𝑡 = 𝐹 (𝑡).
𝑑𝑡
Two important distributions for studying the reliability and failure rates of systems are the gamma and
the Weibull distributions. We will need these distributions in our study of reliability methods. These
distributions are discussed here as further examples of continuous distributions.
♣ Practical motivation 1.
(𝜆 𝑡)𝑗
P[𝑋(𝑡) = 𝑗] = 𝑒−𝜆 𝑡 , 𝑗 = 0, 1, 2, ...
𝑗!
Now we wish to study the distribution of the time until the 𝑘-th defective part is produced.
Call this continuous random variable 𝑌𝑘 .
We use the fact that the 𝑘-th defect will occur before time 𝑡 (i.e., 𝑌𝑘 ≤ 𝑡) if and only if at least 𝑘 defects
occur up to time 𝑡 (i.e. 𝑋(𝑡) ≥ 𝑘). Therefore,
𝑌𝑘 ≤ 𝑡 ⇔ 𝑋(𝑡) ≥ 𝑘,
𝜆𝑘
⎧
𝑡𝑘−1 𝑒−(𝜆𝑡) , when 𝑡 ≥ 0,
⎪
⎨
𝑔(𝑡; 𝑘, 𝜆) = (𝑘 − 1)! (1.32)
⎪
⎩0, when 𝑡 < 0.
This p.d.f. is clearly a member of a general family of distributions gamma 𝐺(𝜈, 𝛽) which depend on two
parameters 𝜈 and 𝛽. The probability density function of 𝐺(𝜈, 𝛽), generalized from Equation 1.32, is
⎧
1
⎪
⎨
𝜈 Γ(𝜈)
𝑥𝜈−1 𝑒−𝑥/𝛽 , if 𝑥 ≥ 0,
𝑔(𝑥; 𝜈, 𝛽) = 𝛽 (1.33)
⎩0, if 𝑥 < 0.
⎪
In soft R, function pgamma computes c.d.f of a gamma distribution having the shape 𝜈, and scale 𝛽;
0 < 𝜈, 𝛽 < ∞. If we use 𝜈 = 𝑠ℎ𝑎𝑝𝑒 = 1 = 𝛽 = 𝑠𝑐𝑎𝑙𝑒 then the cdf 𝐹𝐺 (1) = 0.6321206.
> pgamma(q=1, shape=1, scale=1)
[1] 0.6321206
The expected value and variance of the gamma distribution 𝐺(𝜈, 𝛽) are, respectively,
𝜇 = 𝜈𝛽, 𝜎 2 = 𝜈𝛽 2 . (1.34)
Property.
We note also that the exponential distribution E(𝛽) is a special case of the gamma distribution with
𝜈 = 1, write E(𝛽) = 𝐺(1, 𝛽).
𝑛
∑︁
Therefore, if particularly, 𝑋𝑖 ∼ E(𝛽) = 𝐺(1, 𝛽) are iid exponential, the sum 𝑇 = 𝑋𝑖 ∼ 𝐺(𝑛, 𝛽).
𝑖=1
Hence,
𝑛
∑︁
𝑇 = 𝑡(X) = 𝑋𝑖 = 𝑡
𝑖=1
has pdf
⎧ 1
⎪ 𝑡𝑛−1 𝑒−𝑡/𝛽 , if 𝑡 ≥ 0.
𝛽 𝑛 Γ(𝑛)
⎨
𝑓𝑇 (𝑡; 𝑛, 𝛽) = 𝑛 (1.37)
⎩or 𝜃 𝑡𝑛−1 𝑒−𝜃 𝑡 ,
⎪ if put 𝜃 = 1/𝛽
Γ(𝑛)
Weibull distributions 𝑊 (𝛼, 𝛽) are often used in reliability models in which the system either “ages” with
time or becomes “younger”.
The Weibull family of distributions will be denoted by 𝑊 (𝛼, 𝛽). The positive parameters 𝛼, 𝛽 > 0 are
called the shape and the scale parameters, respectively. Figure 1.13 draws two p.d.f. of 𝑊 (𝛼, 𝛽) with
𝛼 = 1.5; 2, and 𝛽 = 1. Note that, 𝑊 (1, 𝛽) = E(𝛽) is the exponential distribution.
⎧
⎨1 − 𝑒−(𝑡/𝛽)𝛼 , when 𝑡 ≥ 0,
𝑊 (𝑡; 𝛼, 𝛽) = (1.39)
⎩0, when 𝑡 < 0.
1
𝜇 = 𝛽 · Γ(1 + ), (1.40)
𝛼
and {︂ }︂
2 1
𝜎 2 = 𝛽 2 Γ(1 + ) − Γ(1 + )2 . (1.41)
𝛼 𝛼
⎧
1
⎪
⎨ 𝑥𝜈1 −1 (1 − 𝑥)𝜈2 −1 , when 1 > 𝑥 > 0,
𝑓 (𝑥; 𝜈1 , 𝜈2 ) = 𝐵(𝜈1 , 𝜈2 ) (1.42)
otherwise,
⎩0,
⎪
Figure 1.14: The pdf 𝑓 (𝑥; 𝜈1 , 𝜈2 ) of Beta(𝜈1 , 𝜈2 ) when 𝜈1 = 2.5, 𝜈2 = 2.5; 𝜈1 = 2.5, 𝜈2 = 5.00.
∫︁ 𝑥
1
𝐼𝑥 (𝜈1 , 𝜈2 ) = 𝑢𝜈1 −1 (1 − 𝑢)𝜈2 −1 𝑑𝑢, (1.44)
𝐵(𝜈1 , 𝜈2 ) 0
with 0 ≤ 𝑥 ≤ 1. Note that 𝐼𝑥 (𝜈1 , 𝜈2 ) = 1 − 𝐼1−𝑥 (𝜈2 , 𝜈1 ). Figure 1.14 shows the graph 𝑓 of Beta(2.5, 5.0)
1
and Beta(2.5, 2.5). If 𝜈1 = 𝜈2 then the pdf 𝑓 of Beta is symmetric via the vertical line 𝜇 = .
2
The beta distribution has an important role in the theory of statistics. As will be seen later, many
methods of statistical inference are based on the order statistics, and their distributions are related to
the beta distribution.
1.6 Summary
• Continuous: the real numbers (rational and irrational numbers) denoted by R = (−∞, +∞).
Elucidation.
a/ The number 0, 1, 2, 3, and so on are called natural numbers N. However, if we subtract or divide
two natural numbers, the results are not always a natural number. To overcome the limitation of subtrac-
tion, we extend the natural number system to the system Z of integers. We include in Z with all the nat-
ural numbers, all of their negatives and the number zero (0). Thus, Z = {. . . , −3, −2, −1, 0, 1, 2, 3, . . .}.
b/ We still can not always divide any two integers. For example 8/(-2) = -4 is an integer, but 8/3 is
not an integer. To overcome this problem, we extend the system of integers to the system of rational
numbers Q.
We define a number as rational if it can be expressed as a ratio of two integers. Thus, all four
basic arithmetic operations (addition, subtraction, multiplication and division) are all possible in the
rational number system Q. Some numbers in everyday use are not rational number; i.e. they can not
be expressed as a ratio of two integers. E.g. 𝜋 ≈ 3.14, 𝑒 ≈ 2.71, etc. are not rational numbers; such
numbers are called irrational numbers.
c/ The term real number is used to describe a number that is either rational or irrational. To give a
complete definition of real numbers R would involve the introduction of a number of new ideas, and we
shall not do this task now. However, it is a good idea to think about a real number in terms of decimals.
In summary, denote ‘events’ or ‘outcomes’ with capital letters 𝐴, 𝐵..., we have that
• Probability of any event is a number between 0 and 1. If 𝐴 is an event, P(𝐴) is the probability that
the event 𝐴 occurs: 0 ≤ P(𝐴) ≤ 1. The empty event has probability 0, and the sample space has
probability 1.
• There are rules for computing probabilities of unions, intersections, and complements.
• Unconditional probability of 𝐵 can be computed from its conditional probabilities by the Law of Total
Probability, set up as follows.
• Given occurrence of event 𝐵, one can compute conditional probability of event 𝐴 as in Eqn. 1.5.
• The Bayes Rule, often used in testing and diagnostics, relates conditional probabilities of 𝐴 given 𝐵
and of 𝐵 given 𝐴, as in
P[𝐴] · P[𝐵 | 𝐴]
P(𝐴 | 𝐵) = (1.45)
P[𝐵]
Replace 𝐴 by 𝐸𝑖 and
𝑛
∑︁ 𝑛
∑︁
P[𝐵] = P[𝐵 ∩ 𝐸𝑖 ] = P[𝐸𝑗 ) · P(𝐵 | 𝐸𝑗 ]
𝑖 𝑗=1
we get
P(𝐸𝑖 ) · P(𝐵 | 𝐸𝑖 )
P(𝐸𝑖 | 𝐵) = 𝑛 , 𝑖 = 1, · · · , 𝑛. (1.46)
∑︁
P(𝐸𝑗 ) · P(𝐵 | 𝐸𝑗 )
𝑗=1
The probability that the coin lands heads up. is the relative frequency, over the long run, with which
the coin lands heads up.
Here are some more similar interesting situations where probability can be applied:
* Commuting to work daily and observing whether a certain traffic signal is red when we see it
* Testing individuals in a population and observing whether they carry a gene for a certain disease
i/ The interpretation does not apply to situations where the outcome one time is influenced by or in-
fluences the outcome the next time because the probability would not remain the same from one
time to the next. We cannot determine a number that is always changing.
ii/ Probability cannot be used to determine whether the outcome will occur on a single occasion but
can be used to predict the long-term proportion of the times the outcome will occur.
Rule 1: If there are only two possible outcomes in an uncertain situation, their probabilities must add
to 1: P(𝐴) + P(𝐴𝑐 ) = 1.
Rule 2: If two outcomes or events cannot happen simultaneously, they are said to be mutually
exclusive. The probability of one or the other of two mutually exclusive outcomes is
Rule 3: If two events 𝐴, 𝐵 do not influence each other, the events are said to be independent of
each other.
If two events 𝐴, 𝐵 are independent, the probability that they both happen is
3. Method C. The personal-probability interpretation of an event being the degree to which a given
individual believes the event will happen.
𝜒2 𝑋 ∼ 𝜒2𝑛 𝑛 𝜇=𝑛 2𝑛
𝑒−𝜆 𝜆𝑥
𝑝(𝑥) = 𝑥 = 0, 1, 2, ... (1.47)
𝑥!
where
𝑥 = designated number of successes, 𝑒 = 2.71 the natural base,
𝜆 > 0 = the average number of successes per unit of time period
Poisson distribution can be used in the followings.
• To model of the number of occurrences of some event/ phenomenon in the time interval (0, 𝑡], we
can use Formula 1.47. Now 𝑥 = 0 implies that there are no occurrences of the event in (0, 𝑡], and
Prob(𝑥 = 0) = 𝑝(0) = 𝑒−𝜆 .
• To model the number of defects or non-conformities that occur in a unit of product (unit area, volume
...) say, a semiconductor device, by a Poisson distribution.
USAGE: Consider a queue where customers are buses and server is a bus station.
• Arrivals (buses) at a bus-stop follow a Poisson distribution with an average of 𝜆 = 4.5 buses every
15 minutes.
The probabilities of fewer than 3 (meaning from 0 up to 2) arrivals can be calculated directly from
Formula 1.11 with 𝜆 = 4.5:
𝑒−𝜆 𝜆𝑖
𝑝(𝑖; 𝜆) = P(𝑋 = 𝑖) = ,
𝑖!
𝑒−4.5 4.50
hence 𝑝(0; 𝜆) = 𝑃 (𝑋 = 0) = = 0.01111. Similarly 𝑝(1; 𝜆) = 0.04999 and 𝑝(2; 𝜆) = 0.11248.
0!
Therefore the probability of fewer than 3 arrivals is the cdf
2
∑︁
𝑃 (2; 𝜆) = P(𝑋 ≤ 2) = P(𝑋 = 𝑖) = 0.17358 = 17.36%.
𝑖=0
The next diagram shows the case of 𝜆 = 10 buses every 15 minutes, we see that the probability of 20
arrivals is no longer negligible.
𝑁 = 𝑛1 · 𝑛2 · . . . · 𝑛𝑘 . (1.48)
2. Permutation rule:
• If 𝑆 = {𝑎, 𝑏, 𝑐}, then there are 6 permutations, namely: 𝑎𝑏𝑐, 𝑎𝑐𝑏, 𝑏𝑎𝑐, 𝑏𝑐𝑎, 𝑐𝑎𝑏, 𝑐𝑏𝑎 (order matters)
Subset Permutations
For a sequence of 𝑘 items from a set of 𝑛 items,
the number of choosing sequences is:
𝑛!
𝑃𝑘𝑛 =
(𝑛 − 𝑘)!
• A printed circuit board has eight different locations in which a component can be placed.
8!
𝑃48 = = ... = 1680
(8 − 4)!
3. Combination rule:
• A combination is a selection of 𝑘 items from a set of 𝑛 where order does not matter.
• # of permutations ≥ # of combinations
Outline:
A/ Experiment and sample space
B/ Basic rules of operations with events
C/ Computation of probability
D/ Independent events- Conditional probability
1. The experiment is to select a sequence of 5 letters for transmission of a code in a money transfer
operation. Let 𝑎1 , 𝑎2 , . . . , 𝑎5 denote the first, second, ..., fifth letter chosen.
The sample space Ω is the set of all possible sequences of five letters. Formally,
* This is a finite sample space containing 265 possible sequences of 5 letters, due to the multiplica-
tion rule in Equation 1.48.
Quiz: Let 𝐸 be the event that all the 5 letters in the sequence are the same. Describe 𝐸 and
find P[𝐸].
2. (Industrial Production) Our experiment now is to choose a steel bar from a specific production process,
and to measure its weight.
CONCLUSION
• The fact 𝐸1𝑐 ∩ 𝐸2𝑐 = ∅ means the complementary events are disjoint.
• By De Morgan’s law
C/ Computation of probability
• Each point is equally probable. Each point is a combination of 3 signals assigned to 6 positions of a
binary sequence.
(︀6)︀
• The number of combinations of 3 chosen from 6 is 𝐶36 = 3 ,
due to Equation 1.49. The probability of 𝐸3 is
(︂ )︂
6 1 6·5·4 20
P[𝐸3 ] = 6
= = = 0.3125.
3 2 1 · 2 · 3 · 64 64
HINT: represent students in classes of industrial, mechanical, electrical and civil engineering by
events 𝐼, 𝑀 , 𝐸 and 𝐶; then the whole statistics class for engineers 𝑆 is the union of these events.
• Suppose it is known that the probability that the component survives for more than 6000 hours is
0.42.
• Suppose also that the probability that the component survives no longer than 4000 hours is 0.04.
(a) What is the probability that the life of the component is less than or equal to 6000 hours?
(b) What is the probability that the life is greater than 4000 hours?
HINT: Let 𝐴 and 𝐵 be the respective events that the fire engine and the ambulance are available.
Problem 6 (*)- Quality control
Since these departments come from the same production process (i.e. construction company), we
can assume that P(𝐸𝑖 ) = 𝑝, all 𝑖 = 1, · · · , 5. Thus, the probability that all the 5 departments are
non-defective is 𝑝5 .
What is the probability that one department is defective and
all the other four are non-defective?
HINTS: Let 𝐴1 be the event that one out of five parts is defective. In order to simplify the notation,
we write the intersection of events as their product. Thus,
∪ 𝐸1 𝐸2 𝐸3 𝐸4𝑐 𝐸5 ∪ 𝐸1 𝐸2 𝐸3 𝐸4 𝐸5𝑐 .
Similarly,
P(𝐸1 𝐸2𝑐 𝐸3 𝐸4 𝐸5 ) = · · · = P(𝐸1 𝐸2 𝐸3 𝐸4 𝐸5𝑐 ) = (1 − 𝑝)𝑝4 .
More generally, if 𝐽5 is the number of defective parts out of a total of five machine parts, then
(︂ )︂
5
P(𝐽5 = 𝑖) = 𝑝5−𝑖 (1 − 𝑝)𝑖 .
𝑖
An insurance company charges $50 per customer in a year. Let 𝑋 be a discrete random variable
(customer’s injury level) with three values (outcomes) Death, Disability and Good.
Assuming that it made a research on 1000 people and have following table:
1
Death 10,000 1000
2
Disability 5000 1000
997
Good 0 1000
Or horizontally
• 𝑋 is a discrete random variable (customer’s injury level) with three values (outcomes) Death, Disabil-
ity and Good.
• Call 𝑀 be a random variable indicating the money that the company pays to a customer, correspond-
ing with values Death, Disability and Good of 𝑋.
• The company expects that they have to pay for each customer:
∑︁ ∑︁
E[𝑋] = 𝑥 · P(𝑋 = 𝑥) = 𝑚 · P(𝑀 = 𝑚)
1 2 997
= $10, 000 ( ) + $5000 ( ) + $0 ( ) = $20.
1000 1000 1000
∑︁
V[𝑋] = (𝑚 − 𝐸(𝑀 ))2 · P(𝑀 = 𝑚) = V[𝑀 ]
1 2 997
V[𝑀 ] = 99802 ( 1000 ) + 49802 ( 1000 ) + (−20)2 ( 1000 ) = $149, 600
√︀ √
• Standard deviation V[𝑀 ] = 149, 600 ≈ $386.78 = 𝜎
The company expects to pay out $20, and make $30 (= $50- $20). However, the standard deviation
of $386.78 indicates that it’s no sure thing. That’s pretty big spread (and risk) for an average profit of
$20.
Problem 1.2.
As batches of water filters at a brewery firm B often contain defective parts, inspectors are required
to detect the number of defective parts for several consecutive days. Each day they check one batch
only, count and record the number of defective parts of that batch. After many days they are able to
calculate the corresponding probability
𝑝𝑥 = P[𝑋 = 𝑥], 𝑥 = 0, 1, 2, · · · , 5. For example, 𝑝5 = 0.2 means that 20% of all shipments contain 5
damaged parts. They finally report the status to the manager, which is listed in the following table:
𝑋 0 1 2 3 4 5
Compute probability 𝑎 and the expected value 𝜇 (the average number of defective parts in each
shipment).
Problem 1.3.
A discrete random variable 𝑅 indicates four health insurance types 𝑟 = 1, 2, 3, 4 being provided by
an actuarial company D. Each type 𝑟 has a corresponding percentage of P[𝑅 = 𝑟] (out of total cases
provided annually), as shown in the following table:
𝑟 1 2 3 4
You further know that the higher 𝑟 is, the higher the quality (of service) company D can get. The
company’s goal is to achieve the expected value (the average quality level of service) E[𝑅] = 3.
Find the highest quality percentage 𝑏 (associated with level 4).
Problem 1.4.
Given that the number of wire-bonding defects per unit 𝑋 is Poisson distributed with parameter 𝜆 = 4.
Compute the probability that a randomly selected semiconductor device will contain two or fewers wire-
bonding defects.
1.7.4 Self-test
1. Given that P(𝐴) = 0.9, P(𝐵) = 0.8, and P(𝐴 ∩ 𝐵) = 0.75. Find
(a) What is the probability that at least two persons have the same birthday?
(c) How large need 𝑛 be for this probability to be greater than 0.5?
4. A committee of 5 persons is to be selected randomly from a group of 5 men and 10 women. Find
(a) the probability that the committee consists of 2 men and 3 women.
P(𝐵 ∩ 𝐴)
P(𝐵|𝐴) = .
P(𝐴)
Show that P(𝐴|𝐵) just defined satisfies the three axioms of a probability.
6. Two manufacturing plants produce similar parts. Plant 1 produces 1,000 parts, 100 of which are
defective. Plant 2 produces 2,000 parts, 150 of which are defective. A part is selected at random and
found to be defective.
a) if P(𝐴|𝐵) > P(𝐴), then P(𝐵|𝐴) > P(𝐵). b) Show that P(𝐵) = P(𝐵|𝐴) P(𝐴) + P(𝐵|𝐴𝑐 ) P(𝐴𝑐 ).
c) Now suppose that a medical laboratory test to detect a certain disease has the following statistics.
Let 𝐴 = ‘event that the tested person has the disease’, and 𝐵 = ‘event that the test result is positive’.
It is known that
P(𝐵|𝐴) = 0.99 and P(𝐵|𝐴𝑐 ) = 0.005,
What is the probability that a person has the disease given that the test result is positive?
8. (AVIATION) The probability that a regularly scheduled flight departs on time is P(𝐷) = 0.83; the
probability that it arrives on time is P(𝐴) = 0.82; and the probability that it departs and arrives on time
is P(𝐷 ∩ 𝐴) = 0.78.
9. (BIO-MEDICAL Engineering)
A manufacturer of a flu vaccine is concerned about the quality of its flu serum. Batches of serum are
processed by three different departments having rejection rates of 0.10, 0.08, and 0.12, respectively.
The inspections by the three departments are sequential and independent.
• What is the probability that a batch of serum survives the first departmental inspection but is
rejected by the second department?
• What is the probability that a batch of serum is rejected by the third department?
[Source [9]]
CHAPTER 2. STATISTICAL SCIENCE FOR DATA ANALYTICS
54 DOES DATA MAKE SENSE IN SERVICES?
2.1 Overview
Randomness and variability (also called) uncertainty are phenomena that engineering students as
well as actuarial and financial students and economics learners are facing in both their daily life and
in professional environments. The text introduces a few key techniques, fundamental methodologies
together with basic formalization in Statistical Data Analysis for Engineers and Scientists. These are
aimed for both undergraduates and graduates in
The techniques and foundation help them to understand and resolve efficiently theoretical and prac-
tical problems possessing randomness by nature. Let imagine, for example, the above image would
show the color spectrum of the Gulf of Thailand, do the pattern or chaos in the picture suggest or
inspire us to choose sample drilling areas while searching for oil or gas?
• Descriptive statistics is concerned with summarizing and describing numerically a body of data.
• Inferential statistics, a key part of Statistical Science. More importantly, Inferential statistics is the
process of reaching generalizations about the whole (called the populations) by examining a portion
or many portions (called samples).
Hence we can use the following more elaborated and popular definition:
Trace metals in drinking water affect the flavor, and unusually high concentrations can pose a health
hazard. The article “Trace Metals of South Indian River” (Environmental Studies, 1982: 62–66) reports
on a study in which six river locations were selected (six experimental objects) and the zinc concen-
tration (mg/L) determined for both surface water and bottom water at each location. The six pairs of
observations are displayed in the table below.
Location
1 2 3 4 5 6
Zinc concentration in bottom water (𝑥) .430 .266 .567 .531 .707 .716
Zinc concentration in surface water (𝑦) .415 .238 .390 .410 .605 .609
Does true average concentration in bottom water exceed that of surface water?
A city’s manager wishes to determine if it is equally likely that floods (traffic jams, dead holes ...)
will take place at major districts of metropolis like Bangkok or Hanoi. He records the number of floods
happened during one year in Bangkok and obtains the following frequencies:
District 2, 20; District 3 , 14; District 4, 18;
District TD, 17; District BC, 22; and District BT, 29.
Do the data indicate there is a difference with respect to the number of floods happened at different
districts of Bangkok?
A test statistic, the chi-square statistic will be introduced for analyzing this data, measured with a
nominal scale.
Suppose an insurance company 𝐵 has thousands of customers, and each customer is charged $500
a year. Since the customer’s businesses are risky, from the past experience the company estimates
that about 15% of their customers would get fatal trouble (e.g. fire, accident ...) and, as a result they will
submit a claim in any given year. We assume that the claim will always be $3000 for each customer.
• Model the amount of money that the insurance company believes to obtain from each customer.
• Determine the random variable 𝑆𝑁 representing the total amount of claim that the company 𝐵 have
to pay to its customers? Compute E[𝑆𝑁 ].
The above predictors could affect BPH growth and their measured values.
Realistic data set from Mekong Delta
Long. Lat. Rice Seed. Temp Humi. Water Leaf Grass No. No. No.
c/ what if some assumptions of linear models turn wrong [as random errors are not i.i.d normal?],
what can we do?
The term scientific suggests a process of objective investigation that ensures that valid conclusions
can be drawn from an experimental study.
Scientific investigations are important not only in laboratories of research universities but also in the
engineering laboratories of industrial manufacturers.
• Environmental Studies (do polluted water sources induce higher cancer rates?). See Practical
Problem 2 above for data and details.
• Flooding monitoring and urban management - wise decision making based on counting fre-
quency, see Practical Problem 2 above.
a/ Design data collection in a way that minimizes bias and confounders, and at the same time maxi-
mize the content of information
c/ Analyzes the data by calculated statistic and the methods that provide insight or knowledge sup-
porting to engineers, industrialists, administrators, or specialists researchers making decisions.
• A population or statistical population includes all of the entities of interest in a study. A population
includes all of the entities of common interest, whether they be people, households, machines, or
whatever. A unit is a specific element in a population.
For example, a gathering all US citizens on January 1, 2010, is a statistical population. Generally
this includes many communities (subpopulations), for example, all men between the ages of 19-25,
living in Illinois, etc.
In this example the population of US citizens as of January 1, 2010, is finite and real; while the
population of concrete blocks with fixed sizes and can be produced by a specific production process
is infinite and assumed population.
• A sample is a subset of the population, often randomly chosen and preferably representative of the
population as a whole.
A sample is usually selected from a population with the aim of observing properties/ characteristics
of that population, for gathering information, data and then making statistical decisions related to the
corresponding characteristics.
A parameter is a constant that defines a certain characteristic of the distribution of a random variable
/ observation, of a population or a process.
A statistic or statistical attribute/ criterion is a value that can be calculated from data samples (being
observed or experimented).
Statistical criterion practically is the notion of element’s characteristic in a population. Each attribute
has its expression values, based on which one divides it into two categories:
Qualitative attribute: reflects the type or nature of the unit (as gender)
Quantitative attribute: the characteristic of the population unit expressed as the number (yield, height
of the crop); can be discrete (finite or countably infinite), or continuous.
• continuous (such as heights, weights, temperatures); their values are often real numbers; there
are few repeated values;
• discrete (counts, such as the number of faulty parts in a car, the number of telephone calls to
you per day, etc); their values are usually integers; there may be many repeated values.
Statistics is a collection of methods used to collect and clean data, to visualize and compute
various characteristics of the data, to analyze and infer (deduce) complex relationships among
related factors, to make decisions and to model explicit implicit phenomena, and last but not
least to deliver optimal solutions, all tasks are based on the collected data.
You have seen that every task (calculating, analyzing or making decision) is based on observed data.
But what would be typical characteristics of our scientific, technical, economic, financial data?
• Various data sets come from many sources (from environment, chemistry, bridges, computers, biotech-
nology, food technology...), showing very different chemical and physical characteristics, depending
on where and when they are observed, as well as how to collect samples.
• Our common perception is that data sets are differently structured, could be quantitative or qualitative
... They require a completely different set of expressions, modeling, as well as statistical interpreta-
tions, as a result, methods for making decisions or conclusions are distinct.
• Actuarial, economic and financial data are often susceptible to the effects of abnormalities, and the
graphing method for presenting quantitative and qualitative data will be different.
Describing data First of all, we go from the description, using criteria (or sample characteristics) to
measure the centrality (central tendency ) as median, median, mode. Then we measure the spread-
ing tendency (or dispersion) such as variance, standard deviation ...
Decoding dataset’s uncertainty The next step is to use the theory of random variables and the cor-
responding probability distributions to quantify numerically. A fundamental characteristic of the data
collected from actuarial science and economics is the size (large or very large) and structural com-
plexity, the correlation of the factors in the pattern expressed for the happening process. The estima-
tion of parameters from large sample data is discussed in Part B.
Discover implicit relationships in data Finally, if you want to explain the phenomenon, discuss the
effect of factors on the response of the process, find the root of the problem, or make management
decisions, we must use more difficult and sophisticated techniques such as statistical inference and
estimation of population parameters (in Chapter ??).
About mathematically theoretic consideration for statistically analyzing process (of observed data sets)
we need to comply with the following steps:
1. removing the raw numerical error; calculating sample characteristics (mean, min, max, deviation) in
Chapter 3; evaluating the value of the population parameters [using both point estimation and interval
estimation], as seen from Chapter 4,
3. analyzing correlations between factors that influence the outcome of a process (groundwater con-
tamination, river salinity, drought, cash flow of a bank ...
4. knowing some basic models of your specific application domains; e.g. in Environment: water quality,
the spread of pollutants in the air, solid deposit in liquid.
In environment/ resource management - water quality indicators, geographic data (GIS), satellite
images (GPS) ...
In economics/ finance - mechanisms to keep track daily changes of selling and buying, stock ex-
changes/ fluctuations... (see more in [14]).
Engineers or scientists can use field observation data to answer many practical questions, to forecast
trends, to resolve or make decision quantitatively for a series of problems. These scientific solutions
are positively related to the well-being and sustainability of people’s life, especially in countries where
natural resources and manpower haven’t been properly governed in a scientific way. A complete pro-
cedure for using monitoring data is provided in Section 5.2.
First, let’s us try some practical issues in using field observation data.
Based on these data, he obtained the average weight x = 8.5 mg of pills before the upgrade, and the
average weight y = 7.2 mg of pills after the upgrade.
Given the population standard deviation 𝜎𝑋 = 𝜎𝑌 = 1.8 mg both before and after the upgrade, what
is a 90% confidence interval CI of the population mean difference 𝜇𝑋 − 𝜇𝑌 ? If we select significance
level 𝛼 = 0.05, can you test the following pair of hypotheses
𝐻0 : 𝜇1 − 𝜇2 = 0, 𝐻1 : 𝜇1 > 𝜇2 ?
STATISTICAL INFERENCE
Statistical inference, mathematically is a group of important tools used to estimate and test hy-
pothesis on parameters. In the following sections, further approach and statistical thinking allow
you decipher the uncertainty of the phenomena in our world.
Part B is designed to develop knowledge and skills for explaining fundamental concepts of esti-
mation and hypothesis testing, and then apply statistical inference into real world problems.
• Chapter 3 presents basic concepts and tools for describing transformations. We emphasize
graphic techniques to explore and summarize changes in observations. This chapter introduces
readers to the examples in the R software.
• Chapter 4 presents basic inferential methods, including sampling distributions and parameter
estimation.
• In Chapter 5 we discuss how to test of statistical hypotheses for one and two populations.
This page is left blank intentionally.
Chapter 3
[Source [56]]
CHAPTER 3. EXPLORATORY DATA ANALYSIS
66 MAKING OBSERVED DATA MEANINGFUL
• Fundamental EDA
• Numerical measures
• Graphical visualization
The goal of this chapter and the next is to make sense out of data by constructing appropriate sum-
mary measures, tables, and graphs. Our purpose here is to take a set of data that at first glance might
have little meaning and to present the data in a form that makes sense to people.
There are many ways to do this, tools used most often are:
1. a variety of graphs, including bar and pie charts, histograms, scatter and time series plots;
3. tables of summarized measures such as totals, averages, and counts, grouped by categories.
Statistical tools and ideas help us examine data in order to describe their main features. This
examination is called exploratory data analysis.
Like an explorer crossing unknown lands, we want first to simply describe what we see.
Hence, we also call Descriptive Statistics for EDA.
Here are two basic strategies that help us organize our exploration of a data:
• Begin by examining each variable by itself. Then move on to study the relationships among the
variables.
• Begin with a graph or graphs. Then add numerical summaries of specific aspects of the data.
Classification of variables
Let recall terms being useful for this chapter and subsequent ones.
2. Quantitative variables measurements with values in the real R and counts in the natural N; see
from Section 3.4.
The values of a categorical variable are labels for the categories, such as
‘female’ and ‘male’ in biology; ‘sell’ or ‘buy’ in stock market;
‘nations’ in the world, ‘car producers’ in Thailand,
‘investment for industry’ or ‘investment for agriculture’ in macroeconomics...
• Distribution. The distribution of a categorical variable lists the categories and gives either the count
or the percent of individuals who fall in each category;
to measure of how frequently they occur in a process or sample.
• Frequency distribution. In any sample data 𝑥 of size 𝑛, the number of observations 𝑛𝐴 of a partic-
ular value 𝐴 is its absolute frequency distribution.
The heights of the bars in a histogram of frequency distribution show the counts of the categories.
𝑛𝐴
• Relative frequency distribution. A relative frequency distribution of 𝐴 is 𝑛 . The heights of the
bars in a histogram of relative frequency distribution show the percents in the categories.
• Description of the values of a categorical variable uses some tabular or graphical structures like
histogram. A histogram is a bar graph of a frequency distribution.
We can draw histogram for frequency distribution or relative frequency distribution. Any statistical
software package will of course make a histogram for you.
An engineering course has 𝑛 = 160 students enrolled, in which there are 60 female students. If
denote 𝑛𝐹 and 𝑛𝑀 for the number of female and male students, 𝑝𝐹 and 𝑝𝑀 for the frequency of
female and male students, respectively then we can make (a table of) frequency distribution and relative
frequency distribution as follows.
Gender
Female Male
Making a histogram from raw data sometimes is a long but interesting process, as follows.
Economists of IMF compare how rich of developed countries in comparison with developing countries
via GDP per capita of 4 typical countries USA, UK, Mexico and India.
Denote 𝐴, 𝐵, 𝐶, 𝐷 respectively be the monthly average income of citizens at countries USA, UK,
Mexico and India. If IMF make a survey census in 𝑛 months, let 𝑎𝑖 , 𝑏𝑖 , 𝑐𝑖 , 𝑑𝑖 be specific income in
month 𝑖, for 𝑖 = 1, 2, . . . , 𝑛. Write
𝑥1 + 𝑥2 + · · · + 𝑥𝑛
x=
𝑛
for the average of sequence 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 .
They can get the answer by using a sample data of size 𝑛 = 10, observations are monthly
income, having been recorded in 10 months at the four countries above in 2016, shown in Table
3.1, where 𝑎, 𝑏, 𝑐 and 𝑑 are yearly GDP per capita of 4 countries.
• Observation:
• Data set:
the columns contain variables, such as height, gender, and income, and
Table 3.1: GDP in USA, UK, Mexico and India via monthly income
Nation
Month 𝐴 𝐵 𝐶 𝐷
1 𝑎1 𝑏1 𝑐1 𝑑1
2 𝑎2 𝑏2 𝑐2 𝑑2
.. .. .. .. ..
. . . . .
the row contain an observation or measurement (including the attributes of a particular member of
the population)
A population (also called statistical population) , from Section 8.2, is a set of elements having
one or many certain common properties.
A sample is a subset of some specific units in the population, often randomly chosen and prefer-
ably representative of the whole population.
Example 3.3.
• The collection of all rare animals living in Thailand on January 1, 2017 is a statistical population. This
population includes many communities (sub-populations), such as all elephants aged 1-15 years,
living in Chiang Mai National Park and so on.
• Another statistical population: all sets of smart phones that have a defined configuration, and that
can be manufactured under specific conditions by Samsung’s factories (or any other manufacturer
on the world market).
We put the data into a table according to a certain rule. Enumeration table usually start with a
header/title and end with a source/ origin.
+ Title: a simple description for the contents of the table
+ Origin: recording the source of the data in the table.
Thailand’s government wants to compare how competitive among the car producers in order to
design macroeconomic policy in automobile manufacturing.
They can get the answer by using a sample data of size 𝑛 = 10, observations are brand names,
having been recorded from 10 producers in 2008, shown in Table 3.2. From this enumeration table
we can draw many charts to support for their decisions.
183
Honda 𝑓1 = 183 𝑝1 = 1000 = 0.183
100
Toyota 𝑓2 = 100 𝑝2 = 1000 = 0.1 = 10%
.. ..
. .
𝑓𝑖
Ford 𝑓𝑖 𝑝𝑖 =
𝑛
.. .. ..
. . .
𝑓𝑚
GM 𝑓𝑚 = 106 𝑝𝑚 =
𝑛
Sum 𝑛 = 1000 100%
Charts are graphs that present statistical information - recorded in variables - with a more graphical
and dynamic way, including:
• Bar chart (histogram, for qualitative variable), Pie chart (for qualitative variable), and
UK $43090
Mexico $10210
.. .. ..
. . .
Time series plot usage- to predict change, e.g. as seen in Figure 3.2(b): with a continuously moni-
tored industrial productivity data set, we obtain a time series plot, which reflects a variability (fluctua-
tion) and a trend of industrial productivity changing of Bangkok City in many years.
Making a statistical graph is not an end in itself. The purpose of the graph is to help us understand the
data. After you make a graph, always ask, “What do I see?” Once you have displayed a distribution,
you can see its important features as follows.
In any graph of data, look for the overall pattern and for striking deviations from that pattern.
You can describe the overall pattern of a distribution by its shape, center, and spread.
An important kind of deviation is an outlier, an individual value that falls outside the overall
pattern.
Qualitative (categorical, factors, class variables); these variables classify objects into groups.
• ordinal: such as Thai citizen’s income, being classified as high, medium or low; there is a natural
order - there exists comparison for the values of the variable.
• discrete: counts, such as the numbers of faulty parts in a software, the numbers of students,
of telephone calls etc; their values are usually naturals N or integers Z.
All R applications mentioned in this subject are contained in a package called mistat, coupled with
the book Modern Industrial Statistics with applications in R, MINITAB, 2nd edition,
Type commands in the R console. Symbol # used for comments to human, R doesn’t read.
install.packages(“mistat”, dependencies=TRUE)
• insulin <- c(10, 12, 19, 23, 21, 20, 17, 10)
• hist(age)
Graphs are created automatically by the appropriate R functions, e.g. plot and hist(). A graph may
be resized, be copied and pasted directly into a MS Word document, Excel spreadsheet, or other
applications.
• salaries <- c(19, 24, 28, 29, 30, 34, 12, 13, 19, 20, 19, 23, 24)
• x= salaries; length(x);
• barplot(x)
• mean(x); median(x) Produces the mean and the median of the elements in vector x
• barplot(z) Produces a barplot with one “bar” for every individual component in vector z.
• barplot(table(z)) more visually helpful than just barplot. It gathers together like terms.
𝑥 = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325], and
after sorting 𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].
What is the average salary of the US’s IT engineers? What is salary at which half (50%) of the IT
engineers has monthly income smaller that value?
𝑥 = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325],
∑︀𝑛
𝑥𝑖
give its sample mean 𝑥 = 𝑖=1
𝑛 = 2940.
The median 𝑀 is the value in the middle when the data 𝑥1 , · · · , 𝑥𝑛 of size 𝑛 is sorted in ascending
order (smallest to largest).
- If 𝑛 is odd, then the median 𝑀 is the middle value.
- If 𝑛 is even, 𝑀 is the average of the two middle values.
𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].
Indeed, since 𝑛 = 12 is even, the middle two values of data 𝑥* are 2890 and 2920; the median 𝑀 is
the average of these values:
2890 + 2920
𝑀= = 2905 = 𝑀 (𝑥* ).
2
Remark : Whenever a data set contain extreme values, the median is often the preferred measure of
central location than the mean.
𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000].
Data 𝑥* consists of extreme values (outliers) such as $10000, then the new sample mean is
∑︀𝑛
𝑖=1 𝑥*𝑖
𝑥* = = 3496 >> 2940 = the old mean of data 𝑥
𝑛
But the median is unchanged, reflecting better central tendency:
2890 + 2920
𝑀 (𝑥) = 𝑀 (𝑥* ) = = 2905.
2
Mean versus median
The median and mean are the most common measures of the center of a distribution.
The mean and median of a symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed
distribution, the mean is farther out in the long tail than is the median.
So we just choose a specific value 𝐴 with greatest frequency (or greatest relative frequency) from
the histogram.
5 1 0.1
6 4 0.4
7 2 0.2
8 1 0.1
9 2 0.2
Example 3.5.
Professor 𝐾𝑎𝑛𝑗𝑎𝑛𝑎 received the following grades of her students in the first semester 2017:
𝑥 = [6, 7, 6, 8, 5, 7, 6, 9, 9, 6].
Choose 𝐴 = 6. Hence, the mode of our grade data 𝑥 is 𝑀 𝑜𝑑𝑒 = 6, its absolute frequency is 4, its
relative frequency is 0.4.
R code: hist(x, br = 6, col="blue", border="pink");
gives us the above histogram.
Definition 3.1. The 𝑝th percentile, for any 0 < 𝑝 < 1, is a value 𝑚 such that
P[𝑋 ≤ 𝑚] = 𝑝;
• and 100(1 − 𝑝) percent of the observations are greater than this value.
• The domain of 𝑝 is [0, 100]: 𝑝 is a real number, but in practice we usually allow 𝑝 ∈ Q ∩ [1, 100]: 𝑝 is a
rational number.
Universities frequently report admission test scores in terms of percentiles. Suppose an applicant
𝐾 obtain a raw score 𝑚 = 54 (on the scale 100) of an admission test.
Would we know his chance to pass the exam in comparison with his friends?
YES, if we know how many percent the value 𝑚 corresponds to on the set of all applicant scores!
If the value 𝑚 = 54 corresponds to, say 75th percentile (of the whole students scores), we know
• that approximately 75% of students scored at most mark of applicant 𝐾, that is 54 marks,
Often we divide data into four equal parts, each part contains approximately one-fourth, or 25% of the
observations. The division points are called the quartiles, and defined as:
Example 3.7. Our salary data sample on salaries of IT engineers, given in Practical motivation 1, is
divided into four parts:
𝑥 = [2710, 2755, 2850, ‖2880, 2880, 2890, ‖2920, 2940, 2950, ‖3050, 3130, 3325]
In summary,
• the Interquartile range is the data lie between the third quartile 𝑄3 and the first quartile 𝑄1 , counts
for 50% in the middle of the data.
3. Interquartile Range.
Definition 3.2. The variance of a data is a measure of variability that utilizes all the data.
√
We denote 𝑠2 = V[𝑥] for sample variance of data 𝑥, and 𝑠 = 𝑠2 for its standard deviation.
∑︀𝑛
2 𝑖=1 𝑥2𝑖 − 𝑛𝑥2
or 𝑠 = ;
𝑛−1
where the sample mean is
𝑛
1 ∑︁
𝑥 := x 𝑛 = (𝑥1 + . . . + 𝑥𝑛 )/𝑛 = 𝑥𝑖 . (3.2)
𝑛 𝑖=1
Coefficient of Variation 𝐶𝑉 measures relative dispersion, i.e. compares how large the standard
deviation is relative to the mean:
(︂ )︂
𝜎
𝐶𝑉 = × 100 % for populations
𝜇
and (︂ )︂
𝜎𝑥
𝐶𝑉 = × 100 % for samples 𝑥.
𝜇𝑥
You are the purchasing agent of Maximart in SG, you regularly place orders with two distinct
suppliers in good and luxurious ceramic, say
Minh Long ceramic, denoted M and another foreign brand, denoted F.
After several months of operation, you obtained the following DATA, given in Table 3.5, repre-
sented in terms of frequency distributions of times the two suppliers that meet Maximart’s request.
Question 1.
Can we use data analytics to make your business decision that, in the long run which supplier- 𝑀
or 𝐹 will Maximart go along with?
First observations:
a) the 7 or 8 deliveries shown for the foreign brand F are viewed as favorable, meanwhile
b) the slow 12- to 15- deliveries for the foreign brand F could be disastrous, in terms of keeping your
business run smoothly, as keeping workforce busy, or big selling during peak-seasons ...
Data set of the two suppliers M and F is given in
𝑥1 = 7 0 𝑣1 = 2
8 0 𝑣2 = 1
𝑥3 = 9 𝑤3 = 1 0
𝑥4 = 10 𝑤4 = 5 3
𝑥5 = 11 𝑤5 = 4 1
12 0 1
13 0 1
14 0 𝑣8 = 0
𝑥9 = 15 0 𝑣9 = 1
We will show that although the sample means of Minh Long (M) and F are the same, but
• Minh Long ceramic has smaller dispersion than that of the brand F.
———————————–
• Do the two suppliers M and F demonstrate the same degree of reliability in terms of making deliveries
on schedule?
(︂ )︂
𝜎𝐹 2.58
𝐶𝑉𝐹 = × 100 % = × 100% = 25%.
𝜇𝐹 10.3
• Although Minh Long ceramic and the foreign brand F have the same mean x 𝑀 = x 𝐹 = 10.3,
• In that sense, the foreign brand F has large dispersion, so less reliable
(than Minh Long firm) in terms of making deliveries on schedule!
Hence Supplier F is less reliable than Supplier M with a ratio of almost 4 times! Reliability here
means providing goods on time/day, without too much delay!
𝑥 = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 3325]
the range of the data is 3325 − 2710 = 615. For the extreme one
𝑥* = [2710, 2755, 2850, 2880, 2880, 2890, 2920, 2940, 2950, 3050, 3130, 10000]
The Interquartile Range (IQR) is the range of the middle 50% of the data:
IQR = 𝑄3 − 𝑄1 .
[2710, 2755, 2850, ‖2880, 2880, 2890, ‖2920, 2940, 2950, ‖3050, 3130, 3325]
the Interquartile Range of the data is 𝑄3 − 𝑄1 = 3000 − 2865 = 135: not indicates how far the extreme
data values is from the mean!
The sample standard deviation is used more often since its units (cm., lb.) are the same as those of
the original measurements. In the next section we will discuss some ways of interpreting the sample
standard deviation. Presently we remark only that data sets with greater dispersion about the mean will
have larger standard deviations. The sample standard deviation and sample mean provide information
on the variability and central tendency of observation.
2. Why do we emphasize the standard deviation 𝑠 rather than the sample variance 𝑠2 ? ANS 1: 𝑠 is a
natural measure of spread for Normal distributions, learn in next weeks.
3. Why do we average by dividing by 𝑛 − 1 rather than 𝑛 in calculating the variance? Read book for
answer.
In this section we learn some common graphic techniques available today for exploratory data analysis.
These techniques include Box plot and quantile plot. We also discuss the sensitivity of the sample
mean and sample standard deviation according to the abnormal observed observations (outlier ), and
also robust statistics.
In general, statistics are calculated from an observed sample and used to infer the characteristics of
the population containing that sample.
We now identify several characteristic values of a sample of observations that are increasingly or-
dered These sample characteristics are called ordered statistic. Statistics that are not required to
arrange observed values will be discussed later.
Denote by 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 a series of observed values according to a random sampling procedure.
For example, consider the following 10 values of the cutting resistance of stainless steel welds [lb /
weld]:
2385, 2400, 2285, 2765, 2410, 2360, 2750, 2200, 2500, 2550.
𝑋(1) < 𝑋(2) < · · · < 𝑋(𝑖) < 𝑋(𝑖+1) < · · · < 𝑋(𝑛) .
E.g. 𝑋(1) = 2200 is the smallest value, 𝑋(2) = 2285 the second small, ... 𝑋(10) = 2765 the biggest
value.
Next we identify some characteristic values that depend on the ordered (sequential) statistics, namely:
sample minimum and sample maximum values, sample range, sample median, and sample quart-tiles.
• The “middle” value in the ordered pattern is called sample median, computed as
𝑀𝑒 = 𝑋(𝑚) , 𝑚 = (𝑛 + 1)/2.
When 𝑛 is even, 𝑀𝑒 is the mean of two values in the middle of the formula, according to the formula
𝑋𝑖.5 .
The median characterizes the center (midpoint) of dispersion of the sample values, so called statistic
for central tendency, or location statistic. About 50% of the sample value is less than median.
The length of the sections tell us about the spread of the sample. The five numbers are:
• Lower Quartile denoted by Q1 , which ‘cuts off’ a quarter of the ordered data;
• Median, Med, also denoted by Q2, is the number such that half of the values are above it and half
are below it.
If there are an odd number of values in the data set, the median is simply the middle value in the
ordered list.
If there is an even number of values, the median is the average of the middle two values.
• Upper Quartile denoted by Q3, which ‘cuts off’ three quarters of the ordered data;
B) The boxplot
named after G. Box, a famous statistician, is a graphical technique that displays the distribution of
variables. It helps us see the location, spread, tail length and outlying points or outliers.
• An extreme value is considered an outlier if it is outside of the range [𝑄1, 𝑄3] (greater than Q3 or
less than Q1 ) by more than 1.5 times the interquartile 𝐼𝑄 = Q3- Q1.
• The boxplot is a graphical representation of the Five Number Summary, and is particularly useful in
comparing different batches.
C) Draw a box-plot
Use R to draw box-plot with data 19, 24, 28, 29, 30, 34, 12, 13, 19, 20, 19, 23, 24:
• salaries <- c(19, 24, 28, 29, 30, 34, 12, 13, 19, 20, 19, 23, 24)
• x= salaries;
• barplot(x)
Additional information pertaining to the shape of a distribution of observations is derived from the
sample skewness and sample kurtosis.
𝑛
1 ∑︁
𝛽4 = (𝑋𝑖 − 𝑋)4 /𝑆 4 . (3.4)
𝑛 𝑖=1
The quantile plot is a plot of the sample quantiles 𝑥𝑝 against 𝑝, 0 < 𝑝 < 1 and 𝑥𝑝 = 𝑋(𝑝(𝑛+1)) .
In Figure 3.5 we see the quantile plot of the log yarn-strength. From such a plot one can obtain
graphical estimates of the quantiles of the distribution. For example, from Figure 3.6 we immediately
obtain the estimate 2.8 for the median, 2.23 for the first quartile 𝑄1 and 3.58 for the third quartile 𝑄3 .
These are close to the values presented earlier.
We see also in Figure 3.5 that the maximal point of this data set is an outlier.
When the data 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 represents a sample of observations from some population, we can
use the sample statistics discussed in the previous sections to predict how future measurements will
behave. Of course, our ability to predict accurately depends on the size of the sample.
Prediction using order statistics is very simple and is valid for any type of distribution. Since the
ordered measurements partition the real line R into 𝑛 + 1 subintervals,
we can predict that 100/(𝑛 + 1)% of all future observations will fall in any one of these subintervals.
Hence 100𝑖/(𝑛 + 1)% of future sample values are expected to be less than the 𝑖-th statistic 𝑋(𝑖).
It is interesting to note that the sample minimum, 𝑋(1), is not the smallest possible value. Instead
we expect to see one out of every 𝑛 + 1 future measurements to be less than 𝑋(1). Similarly one out
of every 𝑛 + 1 future measurements is expected to be greater than 𝑋(𝑛).
Predicting future measurements using sample skewness and kurtosis is a bit more difficult because
it depends on the type of distribution that the data follow.
Normal data: If the distribution is symmetric (skewness ≈ 0) and somewhat “bell-shaped” or “normal”
(Gaussian, with kurtosis ≈ 3) as in Figure 3.7, for the log yarn strength data, we can make the
following statements:
1. Approximately 68% of all future values will lie within one standard deviation 𝜎 of the mean.
2. Approximately 95% of all future measurements will lie within two 𝜎 of the mean.
3. Approximately 99.7% of all future measurements will lie within three 𝜎 of the mean.
Chebyshev’s Inequality.
E(𝑌 )
Type 1: Given a random variable 𝑌 , and 𝑚 > 0 then P[𝑌 > 𝑚] ≤ .
𝑚
Type 2: For any number 𝑘 > 1, the percentage of future measurements within 𝑘 standard
deviations of the mean will be at least 100(1 − 1/𝑘 2 )%:
[︁ ]︁
P |𝑌 − E[𝑌 ]| < 𝑘 𝜎 ≥ (1 − 1/𝑘 2 ).
ELUCIDATION
• This means that at least 75% of all future measurements will fall within 2 standard deviations (𝑘 = 2).
Similarly, at least 89% will fall within 3 standard deviations (𝑘 = 3). These statements are true for
any distribution; however, the actual percentages may be considerably larger. Notice that for data
which is normally distributed, 95% of the values fall in the interval [X −2𝑆, X +2𝑆]. The Chebyshev
inequality gives only the lower bound of 75%, and is therefore very conservative.
• Any prediction statements, using the order statistics or the sample mean and standard deviation, can
only be made with the understanding that they are based on a sample of data. They are accurate
Using Gauss pdf to compute areas being symmetric around the mean
only to the degree that the sample is representative of the entire population.
When the sample size is small, we cannot be very confident in our prediction. In Section 4.5 we will
discuss theoretical and computerized statistical inference whereby we assign a “confidence level” to
such statements. This confidence level will depend on the sample size.
We now consider the relationship between variables via two most important descriptive measures:
Covariance measures the co-movement of two separate distributions and Correlation. Let us start by
looking at the example below.
Practical motivation 4. [Sale trend]
A manager of a sound equipment store wants to determine the relationship between
1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46
3.8.1 Covariance
∑︀
𝑖 (𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑠𝑥𝑦 =
𝑛−1
In our example we have 𝑥 = 30/10 = 3 and 𝑦 = 510/10 = 51, and the sample covariance 𝑠𝑥𝑦 =
99/9 = 11.
∑︀
𝑖 (𝑥𝑖 − 𝜇𝑥 )(𝑦𝑖 − 𝜇𝑦 )
𝜎𝑥𝑦 =
𝑁
A positive covariance indicates that 𝑋 and 𝑌 move together in relation to their means.
A negative covariance indicates that they move in opposite directions.
Remark that
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) > 0 ⇐⇒ the point (𝑥𝑖 , 𝑦𝑖 ) ∈ quadrants 𝐼&𝐼𝐼𝐼
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦) < 0 ⇐⇒ the point (𝑥𝑖 , 𝑦𝑖 ) ∈ quadrants 𝐼𝐼&𝐼𝑉
As a result,
3.8.2 Correlation
In our example, 𝑠𝑥𝑦 = 99/9 = 11 indicating a strong positive linear relationship between the number 𝑥
of television commercials shown and the sales 𝑦 at the multimedia equipment store.
But the value of the covariance depends on the measurement units for 𝑥 and 𝑦. Is there other precise
measure of this relationship?
𝑠𝑥𝑦
𝑟𝑥𝑦 =
𝑠𝑥 𝑠𝑦
We get −1 ≤ 𝑟𝑥𝑦 ≤ 1. And moreover, if 𝑥 and 𝑦 are linearly related by the equation
𝑦 = 𝑎 + 𝑏𝑥,
then
𝑟𝑥𝑦 = 1 when 𝑏 is positive, and
𝑟𝑥𝑦 = −1 when 𝑏 is negative.
3.9 Summary
√
If we also denote 𝑆 2 = Var(𝑥) for the sample variance then 𝑆 = 𝑆 2 = 𝜎𝑥 exactly is our sample
standard deviation. In general, sample standard deviation 𝑆 is used more frequently than sample
variance 𝑆 2 because it has the same unit (kg, cm., Lb, Newton, ton, hour, ...) with our initial (original)
measurements.
Our concern now is: Given 𝑝%, where 𝑝 ∈ Q ∩ [1, 100], find the value 𝑚 by locating its position (index)
in the observed sample data 𝑥 of size 𝑛.
1. Arrange the data 𝑥 in ascending order to obtain the sorted sample data 𝑦: 𝑦 = 𝑠𝑜𝑟𝑡(𝑥.
2. Compute an index 𝑖
(︁ 𝑝 )︁
𝑖= 𝑛
100
𝑝
(︀ )︀
Note that some text books use 𝑖 = 100 (𝑛 + 1), this doesn’t change the mathematical meaning of
the concept of percentile. We don’t use this one since when 𝑝 = 100 the index 𝑖 = 𝑛 + 1 is out of the
sample’s index range: 𝑛 + 1 ̸∈ [1, 2, 3, . . . , 𝑛 − 1, 𝑛]!
3. Locate 𝑚 from 𝑖:
• If 𝑖 is not an integer, round up to the ceiling ⌈𝑖⌉ =: 𝑗 (the smallest integer that bigger than 𝑖).
Then 𝑚 = 𝑦[𝑗].
𝑝
(︀ )︀
Remark 1. As mentioned above, if use 𝑖 = 100 (𝑛 + 1) then when 𝑖 is not an integer we must
round down to the floor ⌊𝑖⌋ =: 𝑗 (the biggest integer that smaller than 𝑖).
Example. Let us find the 75th percentile for the salary data given in Practical motivation 1.
𝑥 = [2710, 2755, 2850, ‖2880, 2880, 2890, ‖2920, 2940, 2950, ‖3050, 3130, 3325]
* The 50th percentile for the same data is similarly computed as:
(︁ 𝑝 )︁ (︂ )︂
50
𝑖= 𝑛= 12 = 6 ∈ N
100 100
A sample is a subset of a population. A population is the collection of items under discussion. It may
be finite or infinite; it may be real or hypothetical.
In reality, to understand what a real population is, we conduct a statistical experiment (by oberseving,
surveying, interviewing or measuring some interest process/phenomenon) and can use the set of all
possible outcomes of that statistical experiment as a population. This set of possible outcomes is
called the sample space, usually denoted by Ω .
We examine the collected data, to see any surprising features, before attempting to answer any for-
mal questions. This is the exploratory stage of data analysis. There are two major kinds of variables:
• continuous (such as heights, weights, temperatures); their values are often real numbers; there are
few repeated values;
• discrete (counts, such as numbers of faulty parts, numbers of tele- phone calls etc); their values are
usually integers; there may be many repeated values.
XQualitative Variables (factors, class variables); these variables classify objects into groups.
• ordinal (such as income classified as high, medium or low); there is natural order for the values of
the variable.
Median: the middle value of the ordered data set, 𝑥1 < · · · < 𝑥𝑛 of size 𝑛; (smallest to largest),
denoted by Med.
- If 𝑛 is even, then the median Med is the average of the two midle values.
Unimodal or Bimodal. If there is a single prominent peak in a histogram, the shape is called unimodal,
meaning “one mode.” If there are two prominent peaks, the shape is called bimodal, meaning two
modes.
Besides the mean, representing the center, the standard deviation, representing the spread or vari-
ability in the values. Sometimes the variance is given instead of the standard deviation.
The standard deviation is simply the square root of the variance, so once you have one you can
easily compute the other.
c. Square the deviations and take the sum the squared deviations
∑︁
𝑆𝑆 = (𝑥𝑖 − 𝜇𝑥 )2
𝑖=1..𝑛
d. Divide the sum 𝑆𝑆 by the number of values- 1= 𝑛 − 1, resulting in the variance 𝑉 𝑎𝑟𝑥 = 𝑆𝑆/𝑛 − 1.
√
e Take the square root of the variance, 𝑉 𝑎𝑟𝑥 is the standard deviation.
Computing the Standard Deviation of a population: keep the same steps, just replace
b. compute 𝑥𝑖 − 𝜇, then
− 𝜇)2
∑︀
c. take the sum over whole population 𝑖=1..𝑁 (𝑥𝑖
3.10 Problems
1. Given the list 70, 75, 85, 86, 87 and 85, find the five-number summary
If the list had an additional value of 90 in it, what is the five-number summary?
2. The following table gives the grades on a test for a class of 40 students.
7 5 6 2 8 7 6 7 3 9 10 4 5 5 4
6 7 4 8 2 3 5 6 7 9 8 2 4 7 9
4 6 7 8 3 6 7 9 10 5
(a) Arrange these grades (raw data set) into an array from the lowest grade to the highest grade.
(c) Construct a table showing the absolute, relative, and cumulative frequencies for each grade.
3. A firm pays a wage of $4 per hour to its 25 unskilled workers, $6 to its 15 semiskilled workers, and
$8 to its 10 skilled workers. What is
a/ the mean wage, the median and the mode wage paid by this firm?
∑︀
b/ the weighted average, or weighted mean 𝑥𝑤 = ∑︀𝑤𝑥 , wage paid by this firm? Hint: the weights 𝑤
𝑤
have the same function as the frequency in finding the weighted mean for the grouped data.
4. a) Suppose that engineers of Bangkok Insurance Firm (BIF) have observed claims of customers in
23 weeks from January 2015 till November 2016. They recorded the following data sample
𝑦 =(21, 17, 14, 26, 15, 19, 16, 12, 13, 67, 18, 16,
29, 25, 32, 24, 30, 25, 33, 50, 43, 48, 32) [in 1000 USD].
Find the observation error 𝑠 and the standard error of sample mean 𝑠y .
b) Assume that the Quality Assurance department of Singha Beer Cooperation recorded the number
of bad Heineken bottles (sourer, less volume ...) in 12 months of 2016 as in the following sample
𝐷 = 1, 2, 4, 5, 2, 0, 4, 4, 9, 8, 8, 8.
Figure 4.1: Would we infer useful knowledge behind this beautiful picture?
[Source [56]]
CHAPTER 4. STATISTICAL PARAMETER ESTIMATION
100 ESTIMATING PARAMETERS OF A POPULATION
* When 𝜎 is known
• Compute the Margin of Error from standard deviation and sample size
A charter airplane company is asked to carry regular loads of 100 ‘small sporty’ motorbikes. The
plane available for this work has a carrying capacity of 5000 kg. Records of the weights of about 1000
sheep which are typical of those that might be carried show that the distribution of sheep weight has a
mean of 45 kg and a standard deviation of 3 kg.
Can the company take the order?
AIA firm, a well known insurance company in the US recorded monthly salaries of 27 new customers
as in the list 𝑥 below:
𝑥 = [6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5, 6.5,
9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1, 7.9]
3. If they play independently, what is the probability that Tom will score lower than George and thus
do better in the tournament?
This chapter’s methods can solve the above problems, using the powerful approach of statistical
inference. Informally, Statistical Inference draws conclusions about a population or process based on
sample data. It also provides a statement, expressed in terms of probability, of how much confidence
we can place in our conclusions.
Statistical inference in short is a sample-based analyzing and conclusion making. That means a
process in which we infer from information contained in a sample to get properties of the population
(from which the observations are taken).
Through this inference we better understand and be able to model the underlying process which
generates the data. This is a major objective in statistics. Often samples are never equal to population,
hence any conclusion in statistical inference are probabilistic conclusion.
The techniques of statistical inference can be classified into two broad categories:
Parameter Estimation is the process of inferring or estimating a population parameter (its mean,
variance) from the corresponding statistic of a sample drawn from the population.
Hypothesis Testing is the process of determining, on the basis of sample information, whether to
accept or reject a hypothesis or assumption with regard to the value of a parameter.
A population is the collection of items under discussion. It may be finite or infinite; it may be real or
hypothetical. A sample is a subset of a population.
A sample should be chosen to be representative of the population because we usually want to draw
conclusions or inferences about that population based on the sample selected.
NOTATION 1.
• If all 𝑋𝑖 are associated with the same population random variable 𝑋 with a pdf 𝑓 (𝑥), we say the
observations are identical.
• If 𝑋𝑖 are mutually independent then we say the observations are independent. [See Definition 1.2
in Section 1.4.]
A sample data is called a random sample if the sample observations 𝑋 are both identical and
independent, i.e. [𝑋𝑖 ]𝑖 is a list of identically and independent distributed random variables. We
shortly write 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋.
This concept is applicable for both finite or infinite populations, and where sampling is performed
with replacement (write RSWR).
CONVENTION.
After observing realistic phenomena, our complete data resulting from a random sample 𝑋1 , . . . , 𝑋𝑛
now is written as 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 , considered as a list of numerical values of the 𝑋𝑖 ’s.
a/ Specifically, the population mean 𝜇, standard deviation 𝜎, variance 𝜎 2 ... are symbolized with Greek
characters and they refer to ”true”, but hidden or unknown values which we cannot know exactly.
These are characteristic parameters or population parameters.
b/ The sample average X , the sample standard deviation 𝑆, variance 𝑆 2 etc. are represented by Latin
characters and they refer to values, which we can calculate from a certain sample data.
c/ We use such values to estimate the corresponding true (but unknown) population parameter. Each
values (of X , 𝑆, 𝑆 2 ...) is called a statistical estimator, an estimator, a sample statistic, or just
statistic.
• Sampling: the act of selectively choosing units from a population to form a sample so that we can
compute statistical estimators and make inferences about a population.
Random sampling: means making a random sample, that is when observations are identically and
independently distributed. Only with random samples our estimation and inference are made sure
mathematically correct! And the process of estimation and inference requires information of sample
statistic as below.
Both the last twos describe the variability of data, process or a system. Based on these popular statis-
tics, we will develop procedures to estimate the parameters of a population or probability distributions,
and solve other inference or decision-oriented problems.
In general, the parameters of a population are unknown and constant, such as the mean 𝜇 and vari-
ance 𝜎 2 . To approximate a population parameter 𝜃 we must have a random sample data , then build a
single estimate 𝜃ˆ of it.
Statistical parameter estimation is the process of deriving the population parameter values from
the sample statistics, when the sample statistics oscillate (vary) around the actual value of the
parameter.
−∞ − − − − − − − −𝑋 − − − −𝜇 − − − −𝑋 − − − − − −∞
0 − − − − − − − −𝑆 2 − − − − − − − 𝜎 2 − − − −𝑆 2 − − − +∞
0 − − − − − − − −𝑆 − − − − − − − 𝜎 − − − −𝑆 − − − − + ∞
In point estimation, a numerical value 𝜃ˆ for each 𝜃 is calculated, meanwhile in interval estimation, a
𝑘 ≥ 1-dimensional region is determined. We describe point estimation here, see next chapter for
interval estimation.
Sampling distribution: 𝜃ˆ has distribution means that it has a range, a pdf, a cdf, an expecta-
ˆ and a variance V(𝜃).
tion E(𝜃) ˆ
• When we have got realistic values 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 of the random sample 𝑀 the value 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 )
becomes a real number 𝜃ˆ𝑛 := 𝑇 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), called a point estimate of 𝜃.
We take few examples from specific populations as: all cars in Thailand, all customers of JP Mor-
gan bank in USA, all students in Havard University, all households in NewYork... In such a population
let choose a certain random sample 𝑀 = 𝑋1 , . . . , 𝑋𝑛 , then we have 3 most used point estimators.
𝜃ˆ = 𝑋 𝑀 = (𝑋1 + . . . + 𝑋𝑛 )/𝑛 = 𝑋,
Now we fix a random sample 𝑋1 , . . . , 𝑋𝑛 from a population with mean 𝜇 and finite variance 𝜎 2 . We
have the following results on unbiased estimators.
E[𝑋] = 𝜇.
1 ∑︀
b) The sample variance 𝑆 2 = 2
𝑖 [𝑋𝑖 − 𝑋] is an unbiased estimator of
𝑛−1
the population variance 𝜎 2 .
PROOF: Item a) use definition of random sample; of Item b): since E[𝑆 2 ] = 𝜎 2 . [Check this as
exercise!]
−∞ − − − − − − − −𝑋 − − − − − 𝜇 − − − − − − − − − −∞
0 − − − − − − − −𝑆 2 − − − − − 𝜎 2 − − − − − − − − − − + ∞
A very small population of SO polluted indices of Thame River in London consists of values {2, 4, 6, 8, 10},
unit mg / l, hence 𝑁 = 5. We see the population mean
∑︀
𝑥𝑖
𝜇= 𝑖 = (2 + 4 + 6 + 8 + 10)/5 = 6.
𝑁
(︀𝑁 )︀ (︀5)︀
If we take all samples of size 𝑛 = 2, then the number of random samples is 𝑛 = 2 = 10, as in the
table
Sample {2, 4} {2, 6} {2, 8} {2, 10} {4, 6} {4, 8} {4, 10} {6, 8} {6, 10} {8, 10}
𝑋 3 4 5 6 5 6 7 7 8 9
The sample mean X has Range(X ) = {3, 4, 5, 6, 7, 8, 9} =: 𝑢, its probability density values 𝑝x are
0.1, 0.1, 0.2, 0.2, 0.2, 0.1, 0.1 =: 𝑣.
Will the expectation of the sample mean E[𝑋] = 𝜇?
∑︁
E[𝑋] = x 𝑝x = 𝑢 · 𝑣 (inner product of two vectors)
x ∈Range(X )
V[𝑋] = 𝜎 2 /𝑛.
Proof. Item 1:
1 𝜎2 + · · · + 𝜎2
=( ) [V[𝑋1 ] + · · · + V[𝑋𝑛 ]] = = 𝜎 2 /𝑛.
𝑛2 𝑛2
Hence, there are few error types being associated with a random data 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 .
1. Real errors 𝑒𝑖 = 𝑥𝑖 − 𝜇.
AIA firm, a well known insurance company in the US recorded monthly salaries of 27 new customers
as in the list below:
𝑥 = [6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5, 6.5,
9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1, 7.9]
with unit of 1000 USD. Assume that the population mean 𝜇 = 8 (obtained from a national census).
Hence in real life the sample mean x = $7507, the observation error
S.E.[𝑥] = 𝑠 = $1383
The method of moments is dated back at least to Karl Pearson in the late 1800s. It has the virtue of
being quite simple to use and almost always yields some sort of estimate, when other methods prove
intractable.
Definition 4.4.
Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 be a random sample (i.i.d. random variables) from a pmf or pdf 𝑓 (𝑥). For 𝑘 =
1, 2, 3, . . . then
• the 𝑘-th population moment, or 𝑘-th moment of the distribution 𝑓 (𝑥) is 𝜇𝑘 = E[𝑋 𝑘 ];
When 𝑘 = 1: The first population moment is E[𝑋] = 𝜇 and the first sample moment is the X =
𝑛
∑︁
𝑋𝑖 /𝑛.
𝑖=1
𝑛
∑︁
When 𝑘 = 2: The second population and sample moments respectively are E[𝑋 2 ] and 𝑋𝑖2 /𝑛.
𝑖=1
For 𝑘 ≥ 2, we generalize the LLN (??) to the case of moment of order 𝑘 > 1 to get the strong law of
large number (SLLN) (??).
One parameter: The sample 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 has pdf 𝑓 (𝑥; 𝜃), where 𝜃 is the sole parameter whose
values are unknown. The moment estimator 𝜃ˆ is obtained by equating 𝑀1 to 𝜇1 , which is a function
of 𝜃, and solves for 𝜃.
to estimate/solve for 𝜃1 , · · · , 𝜃𝑘 . The RHS depends on parameters 𝜃, and the LHS can be computed
from data 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 .
𝜇1 (𝜃) = E[𝑋] = 𝑝
Since there is only one parameter to be estimated, the estimator is obtained by equating 𝑀1 = X to
𝜇1 = 𝜇 = E[𝑋] = 1/𝜆.
ˆ = 1/ X .
The moment estimator is then 𝜆
Now let 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 be random variables having a joint distribution, with joint p.d.f.
𝑓 (𝑥1 , ..., 𝑥𝑛 ) := 𝑓𝑋1 ,...,𝑋𝑛 (𝑥1 , ..., 𝑥𝑛 ), with means 𝜇𝑖 = E(𝑋𝑖 ) and variances 𝜎𝑖2 = V(𝑋𝑖 ). Let
𝛼1 , 𝛼2 , . . . , 𝛼𝑛 be given constants. Then
𝑛
∑︁
𝑊 = 𝛼𝑖 𝑋𝑖
𝑖=1
is a linear combination of the 𝑋’s. We discuss in the present section only the formulae of the expected
value and variance of 𝑊 .
That is, the expected value of a linear combination is the same linear combination of the expectations.
or equivalently
𝑛
∑︁ ∑︁
V[𝑊 ] = 𝛼𝑖2 V(𝑋𝑖 ) + 2 𝛼𝑖 𝛼𝑗 Cov(𝑋𝑖 , 𝑋𝑗 ).
𝑖=1 𝑖<𝑗
Cov(𝑋𝑖 , 𝑋𝑗 ) = 0 if 𝑖 ̸= 𝑗;
We obtain
𝑛
∑︁ 𝑛
∑︁ 𝑛
(︂ ∑︁ )︂
V[𝑊 ] = 𝛼𝑖2 V(𝑋𝑖 ) = 𝛼𝑖2 𝜎𝑖2 = 𝛼𝑖2 𝜎 2 ;
𝑖=1 𝑖=1 𝑖=1
X = (𝑋1 + 𝑋2 + · · · + 𝑋𝑛 )/𝑛
becomes
𝜎2
V[X ] = .
𝑛
Problem 4.4.
The drought indexes of 10 provinces in Thailand are denoted by a sample 𝑋1 , 𝑋2 , · · · , 𝑋10 . Find the
variance of the sum of these 10 random variables
if each has the same variance 5 and
if each pair has correlation coefficient 0.5.
Use
Cov(𝑋𝑖 , 𝑋𝑗 ) = 𝜌(𝑋𝑖 , 𝑋𝑗 ) 𝜎𝑖 𝜎𝑗 = 𝜌(𝑋𝑖 , 𝑋𝑗 ) 𝜎 2 = 25(1/2) = 12.5
𝜎2
The sum 𝑆10 = 10 X , then apply V[X ] = .
𝑛
The main principle of MLE is that the data we observe is associated with some probability distribution
function, whose parameters are unknown. We then estimate these parameters based on the chance
or likelihood to see data samples.
Let us consider a family of distributions F = 𝑃𝜃 indexed by a parameter (possibly be a vector of
parameters) 𝜃 ∈ Θ. We can think of
• The likelihood function of observing data vector 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) with respect to 𝜃 is given as
𝑛
∏︁
𝐿(𝜃) = 𝑓 (𝑥; 𝜃) = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ; 𝜃) = 𝑓 (𝑥𝑖 ; 𝜃).
𝑖=1
Definition 4.5.
• The maximum likelihood estimate (MLE) of the parameter 𝜃 is a numerical value 𝜃ˆ that maxi-
mize the function 𝐿(𝜃), or equivalently
the log likelihood function 𝐿* (𝜃) = log 𝐿(𝜃), provided the data 𝑥:
Note that, if 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) is just a sample (not i.i.d) from a density 𝑓 (𝑥; 𝜃), the likelihood
𝐿(𝜃) = 𝑓 (𝑥; 𝜃) is called the sample density considered as a function of 𝜃 for fixed sample 𝑥.
1. Write 𝐿* (𝜃) = 𝑙𝑛 (𝜃) to indicate the log-likelihood, a function of both sample 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) and
parameter 𝜃.
2. The notation Argmax means that both 𝐿(𝜃) and the log-likelihood 𝐿* (𝜃) achieves theirs maximum
ˆ see Figure 4.3.
values at 𝜃,
3. If the pdf 𝑓 (𝑥; 𝜃) depends on parameters 𝜃 = (𝜃1 , 𝜃2 , · · · , 𝜃𝑝 ), then the maximum likelihood estimates
are those values of the 𝜃𝑖 ’s that maximize the likelihood function, so that
ˆ
4. When the 𝑋𝑖 ’s are substituted in place of the 𝑥𝑖 ’s in 𝜃(𝑥), the maximum likelihood estimators 𝜃(𝑋) ˆ
result. We should write 𝜃ˆ𝑛 for 𝜃ˆ and 𝐿*𝑛 (𝜃) for the log-likelihood 𝐿* (𝜃) if the sample size 𝑛 of data 𝑥
is matter for interpretation.
We employ the fact that observed data 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) is a realization of the IID (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ),
with the same pdf 𝑓 (𝑥; 𝜃). Since 𝑋𝑖 are IID, we can write the likelihood function in parameter 𝜃 as
• To find a numerical value 𝜃ˆ that maximize the function 𝐿(𝜃), or equivalently the log likelihood
𝑙(𝜃) = log 𝐿(𝜃), use the Lagrange method:
Suppose that 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼ B(𝑝)- the Bernoulli distribution. The probability function, for 𝑥 =
0, 1, is
𝑓 (𝑥; 𝑝) = 𝑝𝑥 (1 − 𝑝)1−𝑥 .
EXTRA QUESTION: suppose that we know in advance that, instead of 𝑝 ∈ [0, 1], 𝑝 is restricted by
the inequalities 0 ≤ 𝑝 ≤ 𝑐 < 1. Prove that the new mle 𝑝ˆ = min{x , 𝑐}.
Example 4.6.
A sample of 10 new bike helmets manufactured by a firm is obtained. Upon testing, it is found that
the 1st, 3rd, and 10th helmets are flawed, whereas the others are not.
Let 𝑝 = P[ flawed helmet ] and define 𝑋1 , 𝑋2 , · · · , 𝑋10 by
𝑋𝑖 = 1 if the 𝑖-th helmet is flawed and 𝑋𝑖 = 0 otherwise.
Then the observed 𝑥𝑖 ’s are 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,
so the joint pmf of the sample is
♣ QUESTION 4.1.
We now ask, “For what value of 𝑝 is the observed sample most likely to have occurred?” That is, we
wish to find the value of 𝑝 that maximizes the above pmf, or equivalently, maximizes its natural log.
and this is a differentiable function of 𝑝, equating the derivative of (4.12) to zero gives the maximizing
value:
𝑑 3 7
log[𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥10 ; 𝑝)] = − =0
𝑑𝑝 𝑝 1−𝑝
3 𝑥
=⇒ 𝑝 = =
10 𝑛
where 𝑥 is the observed number of successes (flawed helmets), see Figure 4.3.
The estimate of 𝑝 is now 𝑝ˆ = 3/10. It is called the maximum likelihood estimate because for fixed
𝑥1 , 𝑥2 , · · · , 𝑥10 it is the parameter value that maximizes the likelihood (joint pmf) 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥10 ; 𝑝).
We would like to estimate the mean height 𝜇 of the population of all Thai women between the ages
of 18 and 24 years.
• This 𝜇 is the mean 𝜇𝑋 of the random variable 𝑋 obtained by choosing a young woman at random
and measuring her height.
• To estimate 𝜇, we choose an SRS 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 of young women and use the sample mean X to
estimate the unknown population mean 𝜇.
Reminder: Statistics obtained from probability samples are random variables because their values
vary in repeated sampling. The sampling distributions of statistics are just the probability distributions
of these random variables.
Why do we choose X to estimate 𝜇? Three reasons:
1. An SRS should fairly represent the population, so the mean X of the sample should be somewhere
near the mean 𝜇 of the population.
2. X is unbiased, and we can control its variability by choosing the sample size, as we saw in Equation
4.5.
3. We have know that if we keep on measuring more women, eventually we will estimate the mean
height of all young women very accurately.
This key fact is called the law of large numbers. It is remarkable because it holds for any population,
not just for some special class such as Normal distributions.
• We draw independent observations at random from any population with finite mean 𝜇, then decide
how accurately we will estimate 𝜇.
• The distribution of the heights of all young women is close to the Normal distribution with 𝜇 =
64.5 𝑖𝑛𝑐ℎ𝑒𝑠, 𝜎 = 2.5 𝑖𝑛𝑐ℎ𝑒𝑠.
• Suppose that 𝜇 = 64.5 were exactly true. The above figure shows the behavior of the mean height X 𝑛
of 𝑛 women chosen at random from a population whose heights follow the N(64.5, 2.52 ) distribution.
Central Limit Theorem (C.L.T) informally says in selecting random samples of size 𝑛,
if 𝑛 is large (≥ 30) the Sampling Distribution of 𝑋 can be approximated by a normal distribution,
regardless of the distributions of component random samples. Formally we have the following.
Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋 with the same mean 𝜇 and the same standard deviation 𝜎. (𝑋 in
general does not need to be a normal).
(︀ 𝜎 2 )︀
But the sample mean follows a normal distribution, i.e: X ∼ N 𝜇, when 𝑛 is large (𝑛 > 30).
𝑛
As a result, the standardized (of X )
X n −𝜇
𝑍𝑛 = √
𝜎/ 𝑛
satisfies lim𝑛→∞ 𝑍𝑛 = 𝑍 ∼ N(0, 1) in distribution.
𝑛→∞
It means the probability cumulative function 𝐹𝑍𝑛 goes to Φ(𝑧): 𝐹𝑍𝑛 −→ Φ(𝑧), i.e.
• In Statistics, a point estimate of a population parameter is a sample statistic used to estimate that
population parameter.
• But a point estimate cannot be expected to provide the exact value of the population parameter.
Mathematically, an interval estimate refers to a range of values together with the probability, called
confidence level, that the interval includes the unknown population parameter.
A engineer wishes to estimate the mean velocity 𝜇 (of km/h) that motorbikers pass an observing
point, known the width of the confidence interval 𝑤 = 3 𝑘𝑚/ℎ and 1 − 𝛼 = 99% confidence level.
If he knows 𝜎 = 1.5 𝑘𝑚/ℎ, compute the minimum required 𝑛 (the numbers of motor-bikers we need
to observe), known that 𝑛 > 30.
A poll of 1,200 voters in Bangkok asked what the most significant issue was in the upcoming election.
Sixty-five percent answered the economy. We are interested in the population proportion of voters
who feel the economy is the most important.
Which probability distribution should you use for this problem?
𝐿| − − − − − 𝜃 − − − −𝜃ˆ − − − − − − − − − |𝑈
• Four values 𝐿, 𝑈 , 𝜃ˆ and 𝑅 (the margin of error) all must be computed from sample data.
P{𝐿 ≤ 𝜇 ≤ 𝑈 } = 1 − 𝛼 (4.14)
𝐿| − − − − − 𝜇 − −ˆ
𝜇 − − − − − − − |𝑈
Mathematically, 𝛼 + 𝐶𝐿 = 1.
Example 4.8.
If 𝜇 is the mean of most productive age of human being, with the significant level 𝛼 = 0.1, so 1 − 𝛼 =
0.9, 𝐿 = 35, 𝑈 = 45, then the interval
[35, 45] = 35 ≤ 𝜇 ≤ 45
is called a 100(1 − 𝛼)% = 90% confidence interval for the population mean 𝜇.
𝐿 = 35| − − − − − 𝜇
ˆ − − − 𝜇 − − − − − |𝑈 = 45
Write (𝐿, 𝑈 ) = (35, 45) = 90%CI of 𝜇. The interval [𝐿, 𝑈 ] = 𝐿 ≤ 𝜇 ≤ 𝑈 is a 100(1 − 𝛼)%
confidence interval for the unknown population mean 𝜇:
𝐶𝐿 = P{𝐿 ≤ 𝜇 ≤ 𝑈 } = 1 − 𝛼
ˆ − 𝑅(𝛼), 𝑈 = 𝜇
𝐿=𝜇 ˆ + 𝑅(𝛼),
Here 𝑅(𝛼) = 𝑈 − 𝜇
ˆ measures how large the bounding area of 𝜇 is!
> t.test(age);
data: age
Consider a customer survey conducted by AIA, a insurance firm at Bangkok. The firm’s quality assur-
ance team uses a customer survey to measure satisfaction of customers.
Summarizing data, how? We rate satisfaction of customers by asking their satisfaction scores, in
range 0..100.
A sample data of 𝑛 = 100 customers are surveyed, a sample mean 𝑥 = 42 of customer satisfaction
is then computed.
𝑥 = satis-score = [48, 55, 35, 31, · · · , 29, 31, 29, 39, 32, 44, 50.]
A confidence interval of 𝜇 is
[𝐿, 𝑈 ] = 𝑥 ± 𝑅(𝛼).
- the Margin of Error (or EBM- Error Bound Margin, or just error) 𝐸 = 𝑅(𝛼), with
𝜎
𝑅(𝛼) = 𝑧𝛼/2 · √ ,
𝑛
𝜎
𝑧𝛼/2 = 𝑧0.025 = 1.96, 𝑅(𝛼) = 1.96 · √ .
𝑛
𝜎 10
S.E. (𝑋) = √ = √ = 10/10 = 1.
𝑛 100
𝜎
𝜇 ∈ [𝑥 ± 𝑧𝛼/2 · √ ] = [42 − 1.96, 42 + 1.96] =?
𝑛
• Since various possible values of 𝑋 are results of distinct random samples 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 , the
probability distribution of 𝑋 is called the Sampling Distribution of 𝑋.
With the above observed sample data 𝑥 = 𝑎𝑔𝑒 = [48, 55, 35, 31, · · · 29, 31, 29, 39, 32, 44, 50], the
standard error of the sample mean 𝑥 is
𝜎 𝑠
𝑠𝑥 = √ ≈ √ .
𝑛 𝑛
Here, 𝑥 provides a point estimate of 𝜇 for all AIA customer’s scores. From the survey we found that the
population of ages is normally distributed, with a 𝜎 ≈ 𝑠 =?. So
𝑠 𝑠
𝑠𝑥 = √ = √ = 𝑠/10.
𝑛 100
The Sampling Distribution of 𝑋 allows us to make probability statements about how close the sample
mean 𝑥 is to 𝜇; described by two cases in Proposition 4.3.
In general for any the significance level 0 < 𝛼 < 1, the Margin of Error is given by
𝜎 𝑠
𝑅(𝛼) = 𝑧𝛼/2 · √ or 𝑅(𝛼) = 𝑧𝛼/2 · √ .
𝑛 𝑛
I/ A small sample confidence interval: The population must be normal distributed and known 𝜎,
we establish a confidence interval 100(1 − 𝛼)% CI of 𝜇 similarly as in the above Eq. 4.15:
X −𝜇
𝑍= √
𝑆/ 𝑛
When 𝑛 ≤ 30 (small sample size), the population is normal but 𝜎 is unknown we must replace the
Gauss distribution 𝑍 by the Student distribution 𝑇 , being discussed in Section 4.7.
𝛼
𝑝=1− = Φ(𝑧) 99.5% 97.5% 95% 90% 80% 75% 0.5
2
𝑧𝑝 2.58 1.96 1.645 1.28 0.84 0.6745 0
• or equivalently b): we can say this critical value 𝑧𝛼/2 = 𝑞1−𝛼/2 - the Standard Normal quantile deter-
𝛼
mines an area (1 − ) in the lower tail of 𝑓𝑍 .
2
In interpretation b) we view −𝑧𝛼/2 = 𝑞𝛼/2 , 𝑧𝛼/2 = 𝑞1−𝛼/2 .
𝛼
⇒ (1 − ) = 0.95 = 𝑝 = Φ(𝑧) =⇒ 𝑧 = Φ−1 (𝑝) = 1.645
2
• The 95% confidence level for 𝜇 as P(𝑥 − 1.96 𝜎𝑥 < 𝜇 < 𝑥 + 1.96 𝜎𝑥 ) = 0.95.
Figure 4.8: Standard normal quantiles split the area under the Gauss density curve
In practice the standard deviation 𝜎 is often unknown. Under such conditions we must modify our
approach, see Section 4.7.
[︂ ]︂
𝜎 𝜎
𝜇 ≤ 𝑈 = X +𝑅(𝛼) = X +𝑧𝛼 · √ , with P 𝜇 ≤ X +𝑧𝛼 · √ = 1 − 𝛼. (4.17)
𝑛 𝑛
[︂ ]︂
𝜎 𝜎
X −𝑧𝛼 · √ = X −𝑅(𝛼) = 𝐿 ≤ 𝜇 with P 𝜇 ≥ X −𝑧𝛼 · √ = 1 − 𝛼. (4.18)
𝑛 𝑛
With sample x = 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, 64.3; suppose that our population is
normally distributed with 𝜎 = 1, find the a lower, one-sided 100(1 − 𝛼)% = 95% CI for 𝜇.
Recall that 𝑧𝛼 = 𝑧0.05 = 1.64, 𝑛 = 10, and x = 64.46, the CI is
𝜎 1
𝜇 ≥ x −𝑅(𝛼) = x −𝑧𝛼 · √ = 64.46 − 1.64 √ = 63.94.
𝑛 𝑛
If sample size 𝑛 is large in Case 1. If 𝑛 is small we must depart from normality, see Case 2.
If population is arbitrary (non-normal) and if the sample size is large, say, 𝑛 > 40, the quantity
𝑋 −𝜇
√
𝑆/ 𝑛
still has an approximate standard normal distribution. [By CLT] Apply
𝑠 𝑠
x −𝑧𝛼/2 · √ ≤ 𝜇 ≤ x +𝑧𝛼/2 · √ ,
𝑛 𝑛
where 𝑠 is the sample standard deviation.
* 1
Case 2: If population is normal but 𝑛 is small, we apply the Student’s 𝑡 distribution.
1 Historical fact: Originated by William S. Gosset, who worked for the Guinness brewery in Dublin around 1900 and wrote
under the pseudonym Student.
Student variable 𝑡 with a degree of freedom 𝜈, is given by the probability density function
𝑢2 −(𝜈+1)/2
[︂ ]︂
Γ((𝜈 + 1)/2)
𝑓 (𝑢) = √ 1+ , 𝑢 ∈ R.
𝜈𝜋 Γ(𝜈/2) 𝜈
𝑓 (𝑢; 𝜈) is even since 𝑓 (−𝑢; 𝜈) = 𝑓 (𝑢; 𝜈). Since 𝑓 is symmetric so 𝑡1−𝑝 = −𝑡𝑝 .
For 𝑝 ∈ (0, 1) the 𝑝-percentile is
P(𝑇 ≤ 𝑡𝜈,𝑝 ) = 𝑝 = P(𝑇 ≤ 𝑡𝑝 ).
Let 𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛 be a random sample of size 𝑛, taken from a normal distribution with mean 𝜇.
𝑋−𝜇
Similarly to using the 𝑍 transform 𝑍 = √
𝜎/ 𝑛
to standardize the sample mean 𝑋, we use the 𝑇 variable
as
𝑋 −𝜇
𝑇 = √ ∼ 𝑡𝑛−1 (4.19)
𝑆/ 𝑛
• Thus for a normal sample of size 𝑛, 𝑇 has the Student or 𝑡 distribution with 𝑣 = (𝑛 − 1) degrees of
freedom. The sample standard deviation
∑︀𝑛
− 𝑥)2
𝑖=1 (𝑥𝑖
𝑠=
𝑛−1
is used in place of 𝜎.
E(𝑇 ) = 0; if 𝑣 > 1
(4.20)
V(𝑇 ) = 𝑣/(𝑣 − 2); if 𝑣 > 2.
𝑥 = 𝑥1 , 𝑥2 , . . . , 𝑥𝑛
where 𝑡𝑛−1,𝛼/2 is a Gosset’s 𝑡 variate with 𝑛 − 1 degrees of freedom and right tail probability 𝛼/2.
The right tail and left tail probabilities both are 𝛼/2, as a result obviously we have
(︀ √ √ )︀
P 𝑥 − 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛 ≤ 𝜇 ≤ 𝑥 + 𝑡𝑛−1,𝛼/2 · 𝑠/ 𝑛 = 1 − 𝛼.
Percentiles of 𝑡[𝜈] is denoted as 𝑡𝛼 [𝜈], read from Table 4.3, or determined by software.
6.9, 7.8, 8.9, 5.2, 7.7, 9.6, 8.7, 6.7, 4.8, 8.0, 10.1, 8.5,
6.5, 9.2, 7.4, 6.3, 5.6, 7.3, 8.3, 7.2, 7.5, 6.1, 9.4, 5.4, 7.6, 8.1, 7.9 [in mg/l]
of 𝑛 = 27 metal concentrations in a river near by a paper factory.
From data we get x = 7.51 mg/l and sample standard deviation 𝑠 = 1.38 mg/l.
The score 𝑡1−𝛼/2 [𝑛 − 1] satisfies
𝛼
P[𝑇 > 𝑡1−𝛼/2 [𝑛 − 1]] = ,
2
where 𝑇 is a random variable having 𝑡[𝑛 − 1] distribution with 𝑣 = (𝑛 − 1) degrees of freedom.
𝑑𝑓
𝑑𝑓 = 1 1.376 . . . . 31.82
2 1.06 . . . . 6.96
3 0.978 . . . . .
or
7.51 − 2.056 · (0.27) < 𝜇 < 7.51 + 2.056 · (0.27)
Given x and a specified error 𝐸, can we find a sample size 𝑛 such that
𝑅(𝛼) = | x −𝜇| ≤ 𝐸?
𝑅(𝛼) = | x −𝜇| ≤ 𝐸,
i.e. 𝑅(𝛼) will not exceed 𝐸 when the sample size 𝑛 (an integer) is
𝑧𝛼/2 𝜎 2
(︂ )︂
𝑛= . (4.21)
𝐸
Example 4.11.
Ten measurements of impact energy (J, joule) on specimens of stainless steel cut at 60°C are as
follows: 64.1, 64.7, 64.5, 64.6, 64.5, 64.3, 64.6, 64.8, 64.2, 64.3 (𝑖𝑛𝐽).
The impact energy is normally distributed with 𝜎 = 1𝐽.
Determine how many specimens must be tested to ensure that
the 95% CI on 𝜇 for this stainless steel cut at 60°C has a length of at most 1.0𝐽.
The bound on error in estimation 𝐸 is one-half of the length of the CI . Use Equation 4.21
𝑧𝛼/2 𝜎 2
(︂ )︂
𝑛=
𝐸
to determine 𝑛 with 𝐸 = 0.5, 𝜎 = 1, and 𝑧𝛼/2 = 1.96, then the solution is 𝑛 = 𝐶𝑒𝑖𝑙𝑖𝑛𝑔(15...) =
⌈15...⌉ = 16?
The length of CI = the width of CI = 2 * 𝐸.
Assume that random variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 ∼𝑖.𝑖.𝑑. 𝑋 (the 𝑋𝑖 independent and have the same
distribution with a common random variable 𝑋) with expected value 𝜇 and variance 𝜎 2 .
The normal population : If population 𝑋 ∼ N(𝜇, 𝜎 2 ) then for any 𝑛 the sample mean
𝑛
∑︁
𝑋𝑖
𝑖=1
X =
𝑛
has expectation E[𝑋] = 𝜇 and variance V[𝑋] = 𝜎 2 /𝑛. Moreover X n is normal for any 𝑛:
(︀ 𝜎 2 )︀
X n ∼ N 𝜇, .
𝑛
(︀ 𝜎 2 )︀
Generic population : If population 𝑋 is not normal, then X n approximates with variable N 𝜇,
𝑛
only when 𝑛 is large (𝑛 > 30). In brief we say the sampling distribution of the sample mean will tend
to normality asymptotically, see C.L.T. above.
Only one question remained: If population 𝑋 is not normal but 𝑛 ≤ 30 then what is the Sampling
Distribution of 𝑋? Next chapter will answer this.
1. Confidence Interval (CI or Interval Estimate): an interval estimate for an unknown population pa-
rameter. This depends on:
• information that is known about the distribution (for example, known standard deviation),
2. Error Bound for a Population Mean (EBM) or the Margin of error 𝑅(𝛼): depends on confidence level
1 − 𝛼, sample size 𝑛, and known population standard deviation 𝜎 or estimated 𝑠.
3. Confidence Level (CL): the probability 1 − 𝛼 that the confidence interval contains the true population
parameter. For e.g,, if the CL = 90%, then in 90 out of 100 samples the interval estimate will enclose
the true population parameter. The CI is:
(x −𝑅(𝛼), x +𝑅(𝛼)).
𝛼 is the significant (level) probability that the interval does not contain the unknown population pa-
rameter.
4. Degrees of Freedom (df): the number of objects in a sample that are free to vary.
6. Point Estimate: a single number computed from a sample and used to estimate a population pa-
rameter
• The 𝑡-score (statistic) has the same interpretation as the 𝑧-score. It measures how far x is from
𝜇.
• For each sample size 𝑛, there is a different Student’s 𝑡-distribution.
• For example, if we have a sample of size 𝑛 = 20 items, then we calculate the degrees of freedom
as
𝑑𝑓 = 𝑛 − 1 = 20 − 1 = 19
Yoonie is a personnel manager in a large corporation. Each month she must review 16 of the em-
ployees. From past experience, she has found that the reviews take her approximately four hours each
to do with a population standard deviation of 1.2 hours.
Let 𝑋 be the random variable representing the time it takes her to complete one review.
Assume that 𝑋 is normally distributed. Let X be the random variable representing the mean time to
complete the 16 reviews. Assume that the 16 reviews represent a random set of reviews.
What are the mean, the standard deviation, and the sample size?
Solution: Mean = 4 hours; Standard deviation = 1.2 hours; Sample size = 16.
A random sample of 𝑛 bike helmets manufactured by a company is selected. Let 𝑋 be the number
of helmets among the 𝑛 that are flawed, and let 𝑝 = 𝑃 (flawed). Assume that only 𝑋 is observed,
rather than the sequence of 𝑆’s and 𝐹 ’s.
1. Derive the maximum likelihood estimator of 𝑝. If 𝑛 = 20 and 𝑥 = 3, what is the estimate? Is this
estimator unbiased?
2. If 𝑛 = 20 and 𝑥 = 3, what is the MLE of the probability (1–𝑝)5 that none of the next five helmets
examined is flawed?
Let 𝑋 be the proportion of allotted time that a randomly selected student spends working on a
certain aptitude test.
1. Use the method of moments to obtain an estimator of 𝜃; compute the estimate for this data.
2. Obtain the maximum likelihood estimator of 𝜃, then compute the estimate for the data.
Problem 4.11.
A random sample of annual salaries 144 mathematicians at Los Alamos National Lab (USA) with a
mean 𝑥 = 100 (in unit of $1000), and a known standard deviation 𝜎 of 16.
Determine the CI of the population mean 𝜇, given confidence level 1 − 𝛼 = 95%.
Problem 4.12.
A manager wishes to estimate the mean number of minutes that workers take to complete a specific
manufacturing task within ±1 min and with 1 − 𝛼 = 90% confidence. From past data, he knows the
standard deviation 𝜎 = 15 min.
Compute the minimum required sample size 𝑛, known that 𝑛 > 30.
A engineer wishes to estimate the mean velocity 𝜇 (of km/h) that motorbikers pass an observing
point, known the width of the confidence interval 𝑤 = 3 𝑘𝑚/ℎ and 1 − 𝛼 = 99% confidence level.
If he knows 𝜎 = 1.5 𝑘𝑚/ℎ, compute the minimum required 𝑛 (the numbers of motorbikers we need
to observe), known that 𝑛 > 30.
Problem 4.14.
The standard deviation of the weights of ‘baby’ elephants is known to be approximately 15 pounds.
We wish to construct a 95% confidence interval for the mean weight of newborn elephant calves. Fifty
newborn elephants are weighed. The sample mean is 244 pounds. The sample standard deviation is
11 pounds. Identify the following:
Problem 4.15.
[Source [?]]
CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
138 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS
There are many situations in which we have to make decisions based on observations or data that are
random variables. What we are examining concerns
the parameters or the form of probability distribution that yields the observations.
This involves making a declaration or statement called a hypothesis about a population. The theory
behind the solutions for these situations is known as decision theory or statistical hypothesis testing.
Major topics we will learn in this part include:
Motivation
First let’s look at few problems that can be solved by the chapter’s methods.
In the past, machines of a production company produced bearings (small mechanical devices such
as digital cameras, tablets ...) with an average thickness of 0.05 cm. To determine whether these
machines were still operating normally, a sample of 10 bearings was selected, the average thickness
of 0.053 cm and the standard deviation of 0.003 cm were calculated.
Assume that a significance level is 𝛼 = 0.05.
The Telegraph newspaper reported that, on average, United Kingdom (U.K.) subscribers with third
generation technology (3G) phones in 2006 spent an average of 8.3 hours per month listening to coun-
try music on their cell phones.
To study what happen in the U.S. researchers draw the following random sample 𝑥 of music listening
time amount per month from the U.S. population of 3G subscribers:
Do these data suggest that, on average, an American subscriber listens to country music less than a
U.K subscriber? Explain clearly your conclusion.
Problem 5.3.
• Moreover, in communication or radar technology, for instance decision theory or hypothesis test-
ing is known as (signal) detection theory.
• In politics, during an election year, we see articles in the newspaper that state confidence intervals
in terms of proportions or percentages.
The scientific method, briefly, states that only by following a careful and specific process can some
assertion be included in the accepted body of knowledge. This process begins with a set of assump-
tions upon which a theory, sometimes called a model, is built. This theory, if it has any validity, will lead
to predictions; what we call hypotheses.
General setting
• The two complementary hypotheses in a hypothesis testing problem are called the null hypothesis
𝐻0 and the alternative hypothesis 𝐻1 or 𝐻𝐴 .
Hypothesis testing is a decision process establishing the validity of a formulated (or conjectured) hy-
pothesis. Mathematically, suppose we observe a random sample (𝑋1 , 𝑋2 , . . . , 𝑋𝑛 ) of a random variable
𝑋 whose probability density function
𝑓 (𝑥; 𝜃) = 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ; 𝜃)
depends on a parameter 𝜃, where 𝜃 denotes a population parameter. We wish to test the assumption
𝜃 = 𝜃0 , denoted by 𝐻0 against the assumption 𝜃 ̸= 𝜃0 , denoted by 𝐻1 .
𝐻0 : 𝜃 ∈ 𝐴 and 𝐻 1 : 𝜃 ∈ 𝐴𝑐
where
• 𝐴𝑐 is its complement.
Hypothesis-testing procedures rely on using the information in a random sample from the population
of interest.
• If this information is consistent with the hypothesis, we conclude that the hypothesis is true;
• if this information is inconsistent with the hypothesis, we conclude that the hypothesis is false.
Hence, let 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) be the observed data vector, 𝜃 is our parameter, and 𝜃ˆ is an estimate
of 𝜃 from data 𝑥. Then
Example 5.1.
A hypothesis can be that the mean age of students 𝜇 taking the SDA course is greater than 19
years. Writing 𝜇 for the mean age, the two hypotheses are
Main aim: to draw valid conclusions from data gathered, or more precisely
to determine whether a statement about the value of a population parameter should or should
not be rejected. Thus in the age example, formally
(0 − − − − − − − −19](19 − − − − − − − − − − + ∞)
In brief, Figure 5.2 diagrams the learning process of humans in order to solve problems in relationship
to two types of studies, observational studies (based on observation) and experimental studies.
Both types produce data.
Observational Studies. In many cases, the only viable way to collect data is to observe individuals
under natural conditions. This is true in studies of wildlife’s behaviors in natural habitat conditions (see
Figure 5.2). In observational studies, researchers observe or measure the main characteristics of
interest individuals in natural conditions (and, therefore, do not attempt to influence or change these
characteristics).
Experimental Studies. Used when one influences or changes the characteristics of a phenomenon,
by altering the levels of major factors influencing the phenomenon or process, and then collecting the
data.
There are eight basic steps in the decision-making process in both types of researches, including:
- Sampling time (Weather (seasons)? In production shifts or normal?) and Sampling locations
+ Early description of the analysis phase: analysis date, analyst, analyzer ...
4. Field work?
6. Data analysis:
8. Reporting the results: purpose, method of implementation, statistical evaluation, confirming corre-
lation between factors, making decisions...
Remark 2. The first three basic steps are important! If we perform them rightly, the valuable
information and knowledge that hide behind the phenomenon can be discovered.
How do we verify a hypothesis 𝐻0 ? Suppose we have collected some data. Do they support the
null hypothesis or contradict it? We need a criterion, based on the collected data, which will help us
answer the question.
(a) Gauss (normal) distribution: If 𝑋 is a normal variable, written 𝑋 ∼ N(𝜇, 𝜎 2 ), then its pdf is
[︂ ]︂2
𝑥−𝜇
− 12
1 𝜎
𝑓 (𝑥) = √ 𝑒 , (5.3)
𝜎 2𝜋
where −∞ < 𝑥 < ∞, 𝜇 ∈ R, 𝜎 2 > 0 ,
𝑓 (𝑥) = the height of the normal curve, 𝑒 = constant 2.71, 𝜋 = constant 3.14,
𝜇 is the mean, and 𝜎 2 is the variance of the normal distribution.
Assume that 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are normal and independent random variables with means 𝜇1 , 𝜇2 , . . . , 𝜇𝑛 ,
and variances 𝜎12 , 𝜎22 , . . . , 𝜎𝑛2 .
𝜇𝑌 = 𝑎1 𝜇1 + 𝑎2 𝜇2 + · · · + 𝑎𝑛 𝜇𝑛 ,
and the variance is 𝜎𝑌2 = 𝑎21 𝜎12 + 𝑎22 𝜎22 + · · · + 𝑎2𝑛 𝜎𝑛2 .
• for which sample values the decision is made to reject 𝐻0 , i.e. accept 𝐻1 ,
To be able to extract meaningful facts from huge data, two key points being kept in our minds are: 1)
a clear formulation of the hypotheses to be tested, and 2) a good comprehension and decent under-
standing of the mathematical phenomena involved. Let’s see a few cases below.
Drill’s speed in oil reservoir exploration being manufactured are supposed to have a mean speed of
50 cm/s. From past experience we know
GUIDANCE for solving. A statistical hypothesis is a statement about the parameters of one or more
populations.
B) Manufacturing
A tire manufacturer developed a new tire designed to provide an increase in mileage over the firm’s
current line of tires. To estimate the mean number of miles provided by the new tires, the manufac-
turer selected a sample of 120 new tires for testing. The test provided a sample mean of x = 36.500
miles. Hence, an estimate of the population mean tire mileage 𝜇 (for the population of new tires)
was x = 36.500 miles.
C) Market research
Teenagers in Thailand seems to favor a particular mobile phone brand 𝐴. Ee want to know an
estimate of the proportion of surveyed voters favoring the product. How could we do?
Hypothesis testing consists of two contradictory hypotheses or statements, a decision based on the
data, and a conclusion. To perform a hypothesis test, we will:
2. Collect sample data (yourself, or the data or summary statistics will be provided).
4. Analyze sample data by performing the calculations that ultimately will allow you to reject or decline
to reject the null hypothesis.
Step 3 Collect the sample data and compute value of sample mean
Step 7 Make decision (conclusion) using the critical value approach, or 𝑝-value approach
The general aim: we consider a test of 𝐻0 versus 𝐻1 for a distribution with an unknown parameter 𝜃.
There are 4 possible situations, only twos of which lead to a correct decisions.
The other twos are called Type I and Type II errors, their descriptions and corresponding probabilities
are given by:
Type I error occurs when we reject the true null hypothesis; and it has probability
Type II error occurs when we accept the false null hypothesis; and it has probability
We would like to have test procedures which make both kinds of errors small.
1−𝛼
1−𝛽
= 1 − P(𝑎𝑐𝑐𝑒𝑝𝑡 𝐻0 | 𝐻0 false )
= 1 − P( Type II error ) = 1 − 𝛽.
Hence, the power of the test is 1 − 𝛽. And a test has high power if the probability 1 − 𝛽 (rejecting a
false null hypothesis) is large.
CONCEPT 5.
• Alternative of the type 𝐻𝐴 : 𝜇 ̸= 𝜇0 covering regions on both sides of the hypothesis (𝐻0 : 𝜇 = 𝜇0 )
is a two-sided alternative.
• Alternative of the type 𝐻𝐴 : 𝜇 < 𝜇0 covering the region to the left of 𝐻0 is a one-sided alterna-
tive, left-tail.
• Alternative of the type 𝐻𝐴 : 𝜇 > 𝜇0 covering the region to the right of 𝐻0 is a one-sided
alternative, right-tail.
• A null hypothesis is always an equality, express a usual statement that people have believed in for
years. In order to overturn the common belief and to reject the hypothesis 𝐻0 , we need significant
evidence. Such evidence can only be provided by data.
• Only when such evidence is found, and when it strongly supports the alternative 𝐻𝐴 , can the hypoth-
esis 𝐻0 be rejected in favor of 𝐻𝐴 .
E.g., the managing board of KMITL in Thailand believes that the average height 𝑋 of freshmen in
IMSE (Industrial and Management Systems Engineering) Program is 1.7m; they write the hypothesis
𝐻0 : 𝜇 = 1.7 (𝜇 = E[𝑋]).
But if you collect a sample of heights of your friends and see that their heights are around 1.65 .. 1.8
m then you might accept the alternative 𝐻𝐴 : 𝜇 > 1.7.
Example 5.2.
In oil reservoir exploration, firms use drills, and drill’s speed are supposed to form a set of mean
drilling speeds, with population mean 𝜇 = 50𝑐𝑚/𝑠. From past experience we know
- the standard deviation 𝜎 = 10𝑐𝑚/𝑠 and
- the drilling speeds 𝑋𝑖 are normally distributed.
A random sample x = 𝑥1 , 𝑥2 , · · · , 𝑥10 of 10 drill speeds had a sample mean x = 45𝑐𝑚/𝑠. You can
write the hypothesis 𝐻0 : 𝜇 = 50; but could you accept the alternative 𝐻𝐴 : 𝜇 < 50?
Example 5.3.
To verify if the proportion of defective products (computers, cars...) is at most 3%, we test 𝐻0 :
𝑝 = 0.03 vs. 𝐻𝐴 : 𝑝 > 0.03, where 𝑝 is the (population) proportion of defects in the whole shipment.
Why do we choose the right-tail alternative 𝐻𝐴 : 𝑝 > 0.03?
That is because we reject the shipment only if significant/meaningful evidence supporting this 𝐻𝐴
is collected.
The qualified shoes for selling to the market must have sizes
fitted into the well-designed boxes.
If the data suggest that 𝑝 ≤ 0.03, the shipment will still be accepted.
The alternative 𝐻1 : 𝑠𝑖𝑧𝑒𝑠 > 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠, or better
𝐻1 : the defective proportion 𝑝(𝑠𝑖𝑧𝑒𝑠 > 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠) > 3%.
𝐻0 : 𝑠𝑖𝑧𝑒𝑠 = 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠
to be true; where
(𝑠𝑖𝑧𝑒𝑠 = 𝑚𝑎𝑥𝑠𝑖𝑧𝑒𝑠) ⇔ 𝑙𝑒𝑛𝑔𝑡ℎ = 𝑙0 , 𝑤𝑖𝑑𝑡ℎ = 𝑤0 , ℎ𝑒𝑖𝑔ℎ𝑡 = ℎ0 .
Significance level
The probability of a type I error, 𝛼, is called the significance level of the test. It is controlled by
the location of the rejection (critical) region. So it can be set as small as required.
• indirectly control: to design a test process so that a small value of the Type II error probability 𝛽 is
obtained, that is the 𝑃 𝑜𝑤𝑒𝑟 𝑓 (𝑛) = 1 − 𝛽, as a function of sample size, is large!
Step 3: Collect the data and and compute value of sample mean
In this step researchers go field trips, use survey and measurement devices... to capture or observe
values of units in a sample. Then they have to employ various popular formulas of central and spread-
ing tendency ... (in Chapter 2) to compute the value of the test statistic.
Example 5.4. Assume that a population is composed of 900 elements with a mean of 𝜇 = 20 units and
a standard deviation of 𝜎 = 12.
Find P[18 < 𝑥 < 24], if given for a sample size of 36.
The mean and standard error- of the sampling distribution of the mean 𝑥- are:
𝜎 12
E(𝑥) = 𝜇𝑥 = 𝜇 = 20; 𝜎𝑥 = √ = √ = 2.
𝑛 36
The probability that the sample mean 𝑥 of a random sample of 36 elements falls in the interval [18, 24]
is computed as follows:
x 1 −𝜇x x 1 −𝜇
𝑧1 = = = (18 − 20)/2 = −1
𝜎x 𝜎x
x 2 −𝜇x x 2 −𝜇
𝑧2 = = = (24 − 20)/2 = 2
𝜎x 𝜎x
Looking up these values in 𝑧 values table we get
P[18 < 𝑋 < 24] = P[−1 < 𝑍 < 2] = 0.3413 + 0.4772 = 0.8185, or 81.85%.
Suppose that 𝑥 is a normally distributed random variable, with mean of the normal distribution 𝜇 ∈ R
and the variance 𝜎 2 > 0, given by the pdf
[︂ ]︂2
𝑥−𝜇
− 21
1 𝜎
𝑓 (𝑥) = √ 𝑒 , −∞ < 𝑥 < ∞. (5.4)
𝜎 2𝜋
If 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 is a random sample of size 𝑛 from this process, then the distribution of the sample
mean 𝑋 is 𝑁 (𝜇, 𝜎 2 /𝑛), due to Fact 5.1.
𝐻0 : 𝜇 = 𝜇0 ,
𝐻1 : 𝜇 ̸= 𝜇0 .
E[𝑋] = 𝜇, and
Var[𝑋] = 𝜎 2
Case b/ If the population 𝑋 is not normal but has finite mean and variance then for 𝑛 large, due to
CLT we have
𝑋 ∼ N(𝜇, 𝜎 2 /n)
𝑋 − 𝜇0
𝑧= √ ∼ N(0, 1) .
𝜎/ 𝑛
The set
{𝑧 : |𝑧| ≤ 𝑧𝛼/2 }
Then we decide:
Figure 5.5: Critical region for two side test with 𝑍 statistic
Indeed, if the null hypothesis is true 𝑧 should be close to zero. Large values of |𝑧| would tend to
contradict the hypothesis. Suppose we find the value 𝑧𝛼/2 such that
The two bold areas in Figure 5.5 correspond with these probabilities. Therefore, finally we set the
rejection region as {𝑧 : |𝑧| > 𝑧𝛼/2 } or
– We see that the probability of a type I error is the probability 𝑝 that 𝑍 lies in the rejection region,
when the null hypothesis 𝐻0 is true, and
exactly. Sometimes Type I error is called the significance level, or the 𝛼-error, or the size of the
test.
– The rejection region, for this alternative hypothesis, consists of the two tails of the standard
normal distribution and, for this reason, we call it a two-tailed test.
– Specifications require that the mean burning rate must be 50 centimeters per second and
the standard deviation is 𝜎 = 2 centimeters per second.
2. Formulate the null hypothesis 𝐻0 : 𝜇 = 50 centimeters per second, and its alternative hypothe-
sis, 𝐻1 : 𝜇 ̸= 50.
6. Compute the test score, precisely, compute 𝑧0 associated with 𝛼 = 0.05, use two-side test in
Step 5a.
√ x −𝜇0
Since x = 51.3 and 𝜎 = 2, so with 𝑧0 = 𝑛
𝜎
√ 51.3 − 𝜇0
𝑧0 = 25 = 3.25.
2
𝛼
𝑝=1− = Φ(𝑧) 99.94% 99.5% 97.5% 95% 90% 80% 75% 0.5
2
𝑧𝑝 3.25 2.58 1.96 1.645 1.28 0.84 0.6745 0
Figure 5.6: Critical region for two side test with 𝑍 statistic
If 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 are independent r. variables with means E(𝑋𝑖 ) = 𝜇𝑖 , and variances 𝜎𝑖2 , and if
In the case of I.I.D variables 𝑋1 , 𝑋2 , . . . , 𝑋𝑛 with the same means E(𝑋𝑖 ) = 𝜇, and variances
Var(𝑋𝑖 ) = 𝜎 2 , then
𝑌 −𝑛𝜇 𝑌 −𝑛𝜇 X −𝜇
𝑍= √ = √ = (5.7)
𝑛𝜎 2 𝜎 𝑛 𝜎X
approaches the 𝑁 (0, 1) in the sense that
Example 5.6. Computing Type I, Type II Error probabilities, see from Example 5.2
Drills have 𝜇0 = 50𝑐𝑚/𝑠, 𝜎 = 10𝑐𝑚/𝑠 and the lengths are normally distributed. 𝑛 = 10 drills had a
mean x = 45𝑐𝑚/𝑠.
𝛼 = P[X < 48.5 when 𝜇 = 50] + P[X > 51.5 when 𝜇 = 50] =?
The 𝑧-values that correspond to the critical values 48.5 and 51.5 are
𝑥1 − 𝜇
𝑧1 = = (48.5 − 50)/0.79 = −1.9
𝜎𝑥
𝑥2 − 𝜇
𝑧2 = = (51.5 − 50)/0.79 = 1.9
𝜎𝑥
Therefore
𝛼 = P(𝑍 < 1.90) + P(𝑍 > 1.90) = 0.028717 + 0.028717 = 0.057434
which implies 5.74% of all random samples would lead to the rejection of the hypothesis 𝐻0 : 𝜇 =
50.
Figure 5.7: Acceprtance area and rejection area in one side test
3. Choose significant level 𝛼 and compute critical value 𝑧1−𝛼 with the table
5. Case a/ if 𝐻0 : 𝜇 ≤ 𝜇0 and 𝐻1 : 𝜇 > 𝜇0 then the rejection region is {𝑍0 : 𝑍0 > 𝑧𝛼 }, satisfies
equation P[𝑍0 > 𝑧𝛼 ] = 𝛼. This is equivalent with the region
√
{𝑋 𝑛 : 𝑋 𝑛 ≥ 𝜇0 + 𝑧𝛼 𝜎/ 𝑛}.
− − − − − − − − −0 − − − − − 𝑧𝛼 − − −− > 𝑍
A close relationship exists between hypothesis testing for 𝜃, and the confidence interval for 𝜃.
If [𝐿, 𝑈 ] is a 100(1 − 𝛼)% confidence interval for the parameter 𝜃, the test of significant level 𝛼 of the
hypothesis pair
𝐻0 : 𝜃 = 𝜃0 and the alternative 𝐻1 : 𝜃 ̸= 𝜃0
will lead to rejection of 𝐻0 if and only if 𝜃0 is not in the 100(1 − 𝛼)% CI [𝐿, 𝑈 ].
Using 𝑝-value approach is controversial, as claimed by ASA. But software as R and SPSS still use it,
hence we introduce it in this section.
Definition 5.3. The 𝑝-value is the smallest level of significance that would lead to rejection of the 𝐻0 .
It is the tail area beyond the value of the test statistic 𝑡0 for a one-sided test, or twice this area for a
two-sided test.
We discussed the use of 𝑝-values when we looked at goodness of fit tests. They can be useful as a
hypothesis test with a fixed significance level 𝛼 does not give any indication of the strength of evidence
against the null hypothesis.
The 𝑝-value depends on whether we have a one-sided or two-sided test.
𝑝- value Interpretation
Motivation
Investors in the stock market are interested in the true proportion of stocks that go up and down each
week. Businesses that sell personal computers are interested in the proportion of households in the
United States that own personal computers. Confidence intervals can be calculated for the true proportion of stocks
that go up or down each week and
for the true proportion of households in the United States that own personal computers.
• The procedure to find the confidence interval for a population proportion is similar to that for the
population mean, but the formulas are a bit different although conceptually identical.
• While the formulas are different, they are based upon the same mathematical foundation given to us
by the Central Limit Theorem. Because of this we will see the same basic format using the same
three pieces of information:
- the number of standard deviations we need to have the confidence in our estimate that we desire.
Distribution used for a proportion 𝑃 : First, the underlying distribution of a proportion 𝑃 of interest
is a binomial distribution. Why?
We knew that if 𝑋 represents the number of successes in 𝑛 trials, then 𝑋 is a binomial random
variable, and 𝑋 ∼ Bin(𝑛, 𝑝) where 𝑛 is the number of trials and 𝑝 is the probability of a success.
𝑋
𝑃ˆ = .
𝑛
• The formula for the confidence interval for a population proportion follows the same format as
that for an estimate of a population mean.
Mean and Standard deviation (standard error) of the estimator 𝑃ˆ : secondly- by CLT- are given by
𝑛𝑝
𝜇𝑃^ = E[𝑃ˆ ] = E[𝑋/𝑛] = =𝑝 (5.8)
𝑛
2
𝜎𝑋 𝑛𝑝𝑞 𝑝𝑞
𝜎𝑃2^ = 𝜎𝑋/𝑛
2
= = 2 = , (5.9)
𝑛2 𝑛 𝑛
so the standard error (of the sampling distribution) of 𝑝ˆ is
√︂ √︂
𝑝𝑞 𝑝 (1 − 𝑝)
𝜎𝑝^ = = .
𝑛 𝑛
𝑃ˆ − 𝑃
P[−𝑧𝛼/2 < 𝑍 < 𝑧𝛼/2 ] = 1 − 𝛼, with 𝑍 = (5.10)
𝜎𝑝^
here 𝑧𝛼/2 is is the value above which we find an area of 𝛼/2 under the standard normal curve. Substi-
tuting for 𝑍, we write:
𝑃ˆ − 𝑃
P[−𝑧𝛼/2 < < 𝑧𝛼/2 ] = 1 − 𝛼, (5.11)
𝜎𝑝^
and this gives us the CI of 𝑃 with significance level 𝛼:
√︂ √︂
𝑝𝑞 𝑝𝑞
𝑃ˆ − 𝑧𝛼/2 < 𝑃 < 𝑃ˆ + 𝑧𝛼/2
𝑛 𝑛
When 𝑛 is large, and we don’t know the unknown population proportions 𝑝, 𝑞 beforehand, very little
error is introduced by substituting the point estimate 𝑝ˆ = 𝑥/𝑛 for the 𝑝 under the radical sign.
More precisely, 𝑝ˆ is the numerical value of the statistic 𝑃ˆ , also the estimated proportion of successes
or the sample proportion of successes,
√︂ √︂
𝑝ˆ 𝑞ˆ 𝑝ˆ 𝑞ˆ
𝑝ˆ − 𝑧𝛼/2 < 𝑃 < 𝑝ˆ + 𝑧𝛼/2 .
𝑛 𝑛
Example 5.7.
Suppose that a market research firm is hired to estimate the percent of adults living in a large city
who have cell phones.
Five hundred randomly selected adult residents in this city are surveyed to determine whether they
have cell phones. Of the 500 people sampled, 421 responded yes - they own cell phones.
Using a 95% confidence level, compute a confidence interval estimate for the true proportion of adult
residents of this city who have cell phones.
The solution step-by-step. Let 𝑋 = the number of people in the sample who have cell phones. 𝑋 is
binomial: the random variable is binary, people either have a cell phone or they do not.
To calculate the confidence interval, we must find 𝑝ˆ, 𝑞ˆ.
𝑛 = 500, 𝑥 = the number of successes in the sample = 421 ... Answer: 0.810 ≤ 𝑝 ≤ 0.874
Interpretation: We estimate with 95% confidence that between 81% and 87.4% of all adult residents
of this city have cell phones.
Many decision making problems require to determine whether the means or proportions of two popu-
lation are the same or different. In general, the two sample problem arises when two populations are
to be compared e.g.,
1. IQ’s of babies bottle or breast fed.
2. Time to complete task when instructed in two ways.
3. Weight gain of anorexic girls when subjected to two different treatments.
In Example 1, IQ is the response variable (measured at a specified age); the explanatory variable or
treatment is the type of feeding at infancy. Examples 2,3 could be designed to be carried out under
experimental conditions.
• Taking example 3, suppose a random sample of 16 girls is taken all of whom are anorexic. Randomly
allocate eight of the girls to treatment 1 and the remaining eight to treatment 2.
• To assess the effectiveness of the treatments it is usual that one of the treatments is a ‘control’ i.e.,
no treatment or the usual treatment. Thus
After careful study of this session, you should be able to do the following.
• Test hypotheses and construct confidence intervals on the difference in means of two normal
distributions.
• Compute power, Type II error probability, and make sample size decisions for two-sample
tests on means.
Assumptions
2
𝜎𝑋 𝜎2
V[X − Y ] = V X ] + V[Y ] = + 𝑌. (5.13)
𝑛1 𝑛2
𝑥 − 𝑦 − Δ0
𝑍 = √︂ 2
𝜎𝑋 𝜎𝑌2
+
𝑛1 𝑛2
DATA ANALYTICS- FOUNDATION
CHAPTER 5. STATISTICAL HYPOTHESIS TESTING
164 CONFIRMING YOUR CLAIMS OR BELIEFS ABOUT POPULATION PARAMETERS
that has a 𝑡 distribution with 𝑛1 + 𝑛2 − 2 degrees of freedom, where the pooled variance
Here it is assumed that the experimental units are relatively homogeneous. The two treatments are
randomly assigned to the experimental material and the appropriate measurements taken.
Example 5.8.
Vitamin D deficiency, it is conjectured, could be related to the amount of fibre in the diet.
Two groups of healthy adults are randomly assigned to one of two diets: Normal or High Fibre.
A measure of vitamin D is then obtained for each of the subjects:
Assumptions:
Notation. 𝑋1 = 𝑋, 𝑋2 = 𝑌 .
For the Normal diet,
let 𝑥1 be the sample mean, 𝑠22 be the sample variance;
let 𝜇1 be the population mean, common variance 𝜎 2 .
Hypotheses:
𝐻0 : 𝜇1 = 𝜇2 , or 𝜇1 − 𝜇2 = ∆0 = 0
versus with 𝐻1 : 𝜇1 ̸= 𝜇2 .
Test Statistic: Since we do not know 𝜎1 , 𝜎2 and small sample sizes, use sample variances:
𝑥1 − 𝑥2
𝑇 = √︀
𝑠𝑝 1/𝑛1 + 1/𝑛2
where the pooled variance
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠2𝑝 =
𝑛1 + 𝑛2 − 2
Make decision: If 𝐻0 is true then 𝑇 ∼ 𝑡𝑛1 +𝑛2 −2 , where
𝑛1 + 𝑛2 − 2 = 6 + 7 − 2 = 11
𝑇 > 𝑡𝑛1 +𝑛2 −2;𝛼/2 or 𝑇 < −𝑡𝑛1 +𝑛2 −2;𝛼/2 ; that means 𝑇 > 2.20 or 𝑇 < −2.20.
From samples, we get 𝑥1 = 27.70, 𝑥2 = 18.61, 𝑠21 = 29.63, 𝑠22 = 30.80 and 𝑠2𝑝 = 30.27.
Hence the test statistic 𝑇 = ... = 2.97 > 2.20: we reject the null hypothesis 𝐻0 .
We conclude that there is a significant difference between the vitamin D levels of the two
groups. (Vitamin D levels are lower by an estimated 9.09 in the high fibre diet group.)
a/ if we know 𝜎1 , 𝜎2 then
𝑥 1 − 𝑥 2 − ∆0
𝑍 = √︀ 2
𝜎1 /𝑛1 + 𝜎22 /𝑛2
𝑥 1 − 𝑥 2 − ∆0
𝑇 = √︀
𝑠𝑝 1/𝑛1 + 1/𝑛2
where the pooled standard deviation is
√︃
(𝑛1 − 1)𝑠21 + (𝑛2 − 1)𝑠22
𝑠𝑝 = .
𝑛1 + 𝑛2 − 2
𝑋 −𝜇
𝑇 = √ ∼ 𝑡𝑛−1 (5.14)
𝑆/ 𝑛
Here the two samples are not independent. Subjects to be allocated to the two treatments are paired.
Pairing is sometimes achieved
by using twins, using the same subject twice, or
simply pairing by some other characteristic.
Example 5.9. Six healthy male subjects were taken and their growth hormone level measured at night
under two conditions: Post daily exercise and without daily exercise. The results were:
Subject 123456
Post Exercise 13.6 14.7 42.8 20.0 19.2 17.3
Control 8.5 12.6 21.6 19.4 14.7 13.6
• Observed
𝑇 = ... = 2.017 ∈ [−2.57, 2.57],
so no evidence to reject 𝐻0 . Conclude that on average there is not a significant difference in the
growth hormone level when exercise is taken compared to that when there is no exercise.
where 𝐸 is the max margin of error as usual. Remember also to round up if 𝑛 is not an integer. This
ensures that the level of confidence does not drop below 100(1 − 𝛼)%, where 𝛼 is a significance level.
Given type I error 𝛼 and type II error 𝛽 we can determine sample size 𝑛.
(︂ )︂2
(𝑧𝛼/2 + 𝑧𝛽 ) 𝜎
𝑛= . (5.16)
𝛿
where 𝛿 = 𝜇 − 𝜇0 , and 𝑛 must be an integer.
2. Point Estimate: a single number computed from a sample and used to estimate a population param-
eter
3. Confidence Interval (CI ): an interval estimate for an unknown population parameter. This depends
on: • the desired confidence level, • information that is known about the distribution • the sample and
its size.
4. Confidence Level (CL): the percent expression for the probability that the confidence interval contains
the true population parameter; for example, if the CL = 90%, then in 90 out of 100 samples the interval
estimate will enclose the true population parameter.
5. Hypothesis: a statement about the value of a population parameter, in case of two hypotheses
The actual test begins by considering two hypotheses. They are called the null hypothesis (notation
𝐻0 ) and the alternative hypothesis (notation 𝐻𝑎 ). These hypotheses contain opposing viewpoints.
6. 𝐻0 - The null hypothesis: It is a statement about the population that either is believed to be true or
is used to put forth an argument unless it can be shown to be incorrect beyond a reasonable doubt.
Since the null and alternative hypotheses are contradictory, you must examine evidence to decide if
you have enough evidence to reject the null hypothesis or not.
The evidence is in the form of sample data.
8. After you have determined which hypothesis the sample supports, you make a decision. There are
two options for a decision.
1/ ”reject 𝐻0 ” if the sample information favors the alternative hypothesis or
2/ ”do not reject 𝐻0 ” or ”decline to reject 𝐻0 ” if the sample information is insufficient to reject the null
hypothesis.
• The 𝑡-score (statistic) has the same interpretation as the 𝑧-score. It measures how far x is from
𝜇.
• For each sample size 𝑛, there is a different Student’s 𝑡-distribution.
• For example, if we have a sample of size 𝑛 = 20 items, then we calculate the degrees of freedom
as
𝑑𝑓 = 𝑛 − 1 = 20 − 1 = 19
and we write variable 𝑇 ∼ 𝑡19 .
(︀ 𝜎 2 )︀
Examples include the CLT: the sample mean follows a normal distribution: X ∼ N 𝜇, . When 𝑛
𝑛
is large (𝑛 > 30) the standardized (of X )
X n −𝜇
𝑍𝑛 = √ satisfies lim 𝑍𝑛 = 𝑍 ∼ N(0, 1).
𝜎/ 𝑛 𝑛→∞
5.7 Problems
5.7.1 Fundamentals
1. You are testing that the mean speed of your cable Internet connection is more than three Megabits
per second.
ANS: The random variable is the mean Internet speed in Megabits per second.
ANS: DIY
2. The mean entry level salary of an employee at a company in USA is $58,000. You believe it is higher
for IT professionals in the company.
3. In a population of fish, approximately 42% are female. A test is conducted to see if, in fact, the
proportion is less. State the null and alternative hypotheses.
ANS: DIY
4. A group of doctors is deciding whether or not to perform an operation. Suppose the null hypothesis
𝐻0 is: the surgical procedure will go well.
ANS: Type I: The procedure will go well, but the doctors think it will not.
Type II: The procedure will not go well, but the doctors think it will.
(a) Consider a hypothesis testing problem: It is believed that 70% of males pass their drivers test in
the first attempt, while 65% of females pass the test in the first attempt. Of interest is whether
the proportions are in fact equal.
Indicate if the hypothesis test is for
a. independent group means, population standard deviations, and/or variances known
b. independent group means, population standard deviations, and/or variances unknown
c. matched or paired samples; d. single mean; e. two proportions; f. single proportion.
ANS: e. two proportions
(b) A study is done to determine which of two soft drinks has more sugar. There are 13 cans of
Beverage A in a sample and six cans of Beverage B. The mean amount of sugar in Beverage A
is 36 grams with a standard deviation of 0.6 grams. The mean amount of sugar in Beverage B
is 38 grams with a standard deviation of 0.8 grams.
The researchers believe that Beverage B has more sugar than Beverage A, on average. Both
populations have normal distributions.
Is this a one-tailed or two-tailed test? ANS: This is a one-tailed test.
(c) The Telegraph newspaper reported that, on average, United Kingdom (U.K.) subscribers with
third generation technology (3G) phones in 2006 spent an average of 8.3 hours per month
listening to country music on their cell phones.
To study what happen in the U.S. researchers draw the following random sample 𝑥 of music
listening time amount per month from the U.S. population of 3G subscribers:
Do these data suggest that, on average, an American subscriber listens to country music less
than a U.K subscriber? Explain clearly your conclusion.
Hints: If use confidence interval or hypothesis testing you must choose a confidence level of
95%.
In the past, machines of a production company produced bearings (small mechanical devices such
as digital cameras, tablets ...) with an average thickness of 0.05 cm. To determine whether these
machines were still operating normally, a sample of 10 bearings was selected, the average thickness
of 0.053 cm and the standard deviation of 0.003 cm were calculated. Assume that a significance level
is 𝛼 = 0.05.
1. Test the hypothesis that this machine works normally.
2. Find the value 𝑃 of the above test.
𝑋 −𝜇 𝑋 −𝜇 √
𝑇 = = * 𝑛 ∼ 𝑡𝑛−1
𝜎X 𝑆
we get the value
𝑥 − 0.05 √
𝑡= * 𝑛 = 3.1623 > 𝑡𝛼 = 2.2622, (see Table 5.2),
𝑠
we reject 𝐻0 and accept 𝐻1 . The machine is not working properly.
𝑑𝑓
𝑑𝑓 = 1 1.376 . . . . 31.82
2 1.06 . . . . 6.96
The insurance premium (or insurance fee) 𝐾 paid by a customer is thought to be affected by the risk
preference (or risk type) of that customer. The standard deviation of 𝐾 is known to be 𝜎 = 3 (unit 10
USD per year) regardless of the risk types, including
𝑢 = 66.4, 71.7, 70.3, 69.3, 64.8, 69.6, 68.6, 69.4, 65.3, 68.8
𝑣 = 57.9, 66.2, 65.4, 65.4, 65.2, 62.6, 67.6, 63.7, 67.2, 71.0
• If we select significance level 𝛼 = 0.05, which of the following pair of hypotheses can you conclude
from the given data sets?
𝐻0 : 𝜇1 = 𝜇2 , 𝐻1 : 𝜇1 ̸= 𝜇2 ; or 𝐻0 : 𝜇1 − 𝜇2 = 0, 𝐻1 : 𝜇1 > 𝜇2 ?
Hint: for a one-sided hypothesis test of the difference of population means 𝜇1 , 𝜇2 , need to find the
pair of Student critical value 𝑡𝛼, 𝑛1 +𝑛2 −2 and score 𝑡0 .
1. Which two distributions can you use for hypothesis testing for this chapter? ANS: A normal distribution
or a Student’s 𝑡-distribution
2. Which distribution do you use when you are testing a population mean and the population standard
deviation is known? Assume a normal distribution, with 𝑛 ≥ 30.
ANS: Use a normal distribution.
3. A population mean is 13. The sample mean is 12.8, and the sample standard deviation is two. The
sample size is 20.
What distribution should you use to perform a hypothesis test?
Assume the underlying population is normal. ANS: a Student’s 𝑡-distribution
4. A particular brand of tires claims that its deluxe tire averages at least 50,000 miles before it needs to
be replaced. From past studies of this tire, the standard deviation is known to be 8,000. A survey of
owners of that tire design is conducted. From the 28 tires surveyed, the mean lifespan was 46,500
miles with a standard deviation of 9,800 miles.
Using 𝛼 = 0.05, is the data highly inconsistent with the claim?
HINT: Use the 7-step procedure for one-sided test, with
ANS: There is sufficient evidence to conclude that the mean lifespan of the tires is less than 50,000
miles. The CI = (43, 537, 49, 463).
5. The Nation, a key press in Thailand recently reported that average PM2.5 pollution index in the air of
BKK metropolis was 24 (in 𝑚𝑖𝑐𝑟𝑜𝑔𝑟𝑎𝑚/𝑚3 ) in Summer. In Bangkok City, Thai researchers observed
the following random sample of
𝑥 = 24, 22, 26, 34, 35, 32, 33, 29, 19, 36, 30, 15, 17, 28, 38, 40, 37, 27
Consider a political election in the US (or any democratic nation having two parties), where each
people either say YES to vote for the 2nd term of President T
Write 𝑋 ∼ Bin(𝑘, 𝜃) for the binomial, formed by a Bernoulli sample data of size 𝑘 = 3,
𝑍1 , 𝑍2 , · · · , 𝑍𝑘 ∼ B(𝜃)
𝑘
∑︁
then 𝑋 = 𝑍𝑖 is the number of individuals in each sample of size 𝑘 who say ”Yes”.
𝑖=1
• DATA: We observed a random sample X = 𝑋1 , 𝑋2 , · · · , 𝑋𝑛 with 𝑛 = 40, each 𝑋𝑖 follows the same
binomial 𝑋 ∼ Bin(3, 𝜃), known E[𝑋] = 𝑘 𝜃.
𝑋=𝑗 0 1 2 3
Frequency 6 10 14 10
𝑛0 𝑛1 𝑛2 𝑛3
• QUESTIONS:
b/ Compute the corresponding probability P(𝑋 ≥ 2), indicating an estimated proportion of Americans
who support President T.
∑︀𝑘
Suppose the value 𝑗 has frequency 𝑛𝑗 , then 𝑛 = 𝑗=0 𝑛𝑗 , with 𝑗 ∈ Range(𝑋). We prove that
𝑛
∑︁
𝑋𝑖 ∑︀𝑘
𝑖=1 𝑗=0 𝑗 𝑛𝑗
𝜃ˆ𝑛 = =
𝑛𝑘 𝑛𝑘
with 𝑛𝑗 = #{𝑖 = 1..𝑛 : 𝑋𝑖 = 𝑗}. Indeed, 𝜃 = Prop(𝑌 𝑒𝑠) = proportion of people in the population who
say ”Yes”, then from Example 4.5, the ML estimator of 𝜃 is
𝑛
∑︁
𝑋𝑖
{︁ 1 2 𝑛𝑘 }︁
𝑖=1
𝜃ˆ𝑛 = ∈ 0, , ,...,
𝑛𝑘 𝑛𝑘 𝑛𝑘 𝑛𝑘
𝑆 approx
(︁ 𝜃(1 − 𝜃) )︁
𝜃ˆ𝑛 = ∼ N 𝜃, .
𝑛𝑘 𝑛
This P(𝑋 ≥ 2) measures how likely the Americans vote for President T.
In general we view
∑︁
P[𝑋 ≥ 𝑐] = P[𝑋 = 𝑗]
𝑐≤𝑗≤𝑘
where (︂ )︂
𝑘 ̂︀𝑗
P[𝑋 = 𝑗] = 𝐶𝑘𝑗 𝜃̂︀𝑗 (1 − 𝜃)
̂︀ 𝑘−𝑗 = ̂︀ 𝑘−𝑗
𝜃 (1 − 𝜃) (5.20)
𝑗
ASEAN nations- in particular the Mekong River Basin with countries of Lao, Thailand, Cambodia and
Vietnam- are experiencing extreme natural threats or disasters (to be defined as either flooding or
drought) for years. Those extreme phenomena are called rare events, and a Poisson distribution
Poiss(𝜃) is used to model them.
Scientists of the Mekong River Committee (MRC) observed 𝑋, the monthly number of natural disas-
ters continuously in the last 5 years in the entire region. They code threat levels in five codes of 0, 1,
2, 3, 4 where the larger value means the more serious level the disaster is.
The following table shows observed data
𝑥 0 1 2 3 4
Frequency 9 14 14 12 11
• QUESTIONS:
a/ Find the maximum likelihood estimate (MLE) 𝜃̂︀ of 𝜃.
b/The most fatal events are expressed by codes 3 and 4.
Compute the probability P(𝑋 ≥ 3) (measuring how much likelihood the most fatal events happened
in the past 5 years), from which the MRC could workout risk-control solutions.
• Since 𝑋𝑖 are IID, we can write the likelihood function (of observing data vector 𝑥 = (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ))
with respect to 𝜃 by the formula
∏︁
𝐿(𝜃) = 𝑓 (𝑋; 𝜃) = 𝑝(𝑋𝑖 ; 𝜃).
𝑖
𝜕 2 𝐿* (𝜃)
̂︀ ∑︀
Check that < 0. The estimate 𝜃𝑀 𝐿 = 𝜃̂︀ = X = ( 𝑥𝑖 )/𝑛 = (1.14 + ... + 4.11)/60 = 2.03?
𝜕𝜃2
𝑥 0 1 2 3 4
Frequency 9 14 14 12 11
b/ The probability P[𝑋 ≥ 3] of most fatal events that is associated with the ML estimate 𝜆𝑀 𝐿 is
∑︁
P[𝑋 ≥ 3] = P[𝑋 = 𝑥] = 1 − P(𝑋 < 3] = 1 − 𝐹 (2),
3≤𝑥
where
̂︀𝑥 𝑒−𝜆̂︀
𝜆
P[𝑋 = 𝑥] = 𝑥 = 0, 1, 2, ... (5.21)
𝑥!
is just the pmf of the Poisson distribution Pois(𝜆)
̂︀ (3 points).
Engineering and economics students nowadays need to know a few key quantitative methods in
order to properly design and test their prototypes before handling products (cars, mobile phones,
censors...) or services to industry for mass manufacturing. These tasks essentially need a causal
analysis between predictors and responses (dependent variables).
Chapter 6 discusses a group of methods for this task, called statistically designed experiments,
an essential knowledge body for students in engineering institutions.
What if predictors (independent variables) are quantitative? The causal relationship between
predictors and a response variable will be studied from Chapter 9, to Chapter 11. Chapter
9 covers correlation and regression analysis, linear regression, and analysis of variance for
regression. Chapter 10 and Chapter 11 discuss advanced aspects of linear regression.
Chapter 6
[Source [56]]
CHAPTER 6. EXPERIMENTAL DESIGNS IN ENGINEERING
180 CAUSAL ANALYSIS WITH ORDINAL VARIABLES
6.1 Introduction
Statistically designed experiments (or Experimental Designs in brief) have been used to accelerate
learning since their introduction by R.A. Fisher in the first half of the twentieth century. One of Fisher’s
major contributions was the development of factorial experiments, which can simultaneously study
several factors, see major methods in texts [6], [?] and [73].
New challenges and opportunities arose when Fisher’s ideas were applied in the industrial environ-
ment. Experimental Designs are now recognized as essential for rapid learning, manufacturing and
thereby, for reducing time-to-market [of products], while preserving high quality and peak perfor-
mance.
Nowadays, statistically based experimental design techniques are particularly useful in the engi-
neering world for solving many important problems: discovery of new basic phenomena that can lead
to new products and commercialization of new technology including new product development, new
process development, and improvement of existing products and processes.
Repeated experiments
Experiments (or repeated experiments) are used in industry to improve productivity, reduce variability
and obtain robust products/processes. In this section we study how to design and analyze experi-
ments which are aimed at testing hypotheses (both scientific and technological).
These hypotheses are concerned with
• the effects of procedures or treatments on the yield; the relationship between variables;
• and the conditions under which a production process yields maximum output or optimum results.
4. Conclusion: what has been learned about the original conjecture from the experiment.
Often the experiment will lead to a revised conjecture, a new experiment, and so forth.
The following are guiding principles which we follow in designing experiments. They are designed to
ensure high information quality (InfoQ) of the study, and robust products in engineering and manufac-
turing (see Kenett [?]).
1. The objectives of the study should be well stated, and criteria established to test whether these
objectives have been met.
2. The response variable(s) should be clearly defined so that the study objectives are properly trans-
lated. At this stage measurement uncertainty should be established.
3. All factors which might affect the response variable(s) should be listed, and specified. We call these
the controllable factors. This requires interactive brainstorming with content experts.
6. A statistical model should be formulated concerning the relationship between the pertinent vari-
ables, and their error distributions. This can rely on prior knowledge or literature search.
7. An experimental layout or experimental array should be designed so that the inference from the
gathered data will be:
8. The trials should be performed if possible in a random order, to avoid bias by factors which are not
taken into consideration.
10. The execution of the experiment should follow the protocol with proper documentation.
11. The results of the experiments should be carefully analyzed and reported ensuring proper documen-
tation and traceability.
12. Confirmatory experiments should be conducted to validate the inference (conclusions) of the main
experiments.
Fisher random variable is a ratio of independent mean squares, denoted by 𝐹 [𝜈1 , 𝜈2 ], where 𝜈1 > 0 and
𝜈2 > 0 are numerator and denominator degrees of freedom, respectively.
Its density function 𝑓 (𝑥) is given in Figure 6.2, with Range = {𝑥 : 𝑥 > 0}.
𝜈2
Expectation: E[𝐹 [𝜈1 , 𝜈2 ]] = for 𝜈2 > 2;
𝜈2 − 2
Variance:
2𝑣22 (𝑣1 + 𝑣2 − 2)
V[𝐹 [𝜈1 , 𝜈2 ]] = ; if 𝑣2 > 4.
𝑣1 (𝑣2 − 2)2 (𝑣2 − 4)
Figure 6.2: The pdf of 𝐹 (5, 5), 𝐹 (5, 15) and percentiles
1
𝑞𝛼 [𝜈1 , 𝜈2 ] = . (6.1)
𝑞1−𝛼 [𝜈2 , 𝜈1 ]
For example, 𝑞0.05 [15, 10] = 1/𝑞0.95 [10, 15] = 1/2.54 = .3932; due to
1
𝑞1−𝛼 [𝜈1 , 𝜈2 ] = .
𝑞𝛼 [𝜈2 , 𝜈1 ]
When 𝑠21 and and 𝑠22 are sample variances from independent SRSs of sizes 𝑛1 and 𝑛2 drawn
from normal populations, the statistic
𝑠21 /𝜎12
𝐹 =
𝑠22 /𝜎22
• Relationship:
𝜒2 [𝜈1 ]
= 𝐹 [𝜈1 , 𝜈2 ].
𝜒2 [𝜈2 ]
The use of experimental design in the engineering design process can result in
• products that have better field performance and reliability than their competitors, and
For the sake of explanation, let us start first with CRD- the simplest repeated experiment, in which the
response depends on one factor only. Thus, let 𝐴 designate some factor, which is applied at different
levels or categories 𝐴1 , · · · , 𝐴𝑎 . The levels of 𝐴 are also called ‘treatments.’
• Suppose that at each level of 𝐴 we make 𝑛 independent repetitions (replicas) of the experiment.
Let 𝑌𝑖𝑗 (𝑖 = 1, 2, · · · , 𝑎 and 𝑗 = 1, 2, · · · , 𝑛) denote the observed yield at the 𝑗-th replication of level
𝐴𝑖 . We model the random variables 𝑌𝑖𝑗 by
• Errors 𝑒𝑖𝑗 (for 𝑖 = 1, 2, · · · , 𝑎; 𝑗 = 1, 2, · · · , 𝑛) are independent random variables such that (zero mean
and constant variance)
E[𝑒𝑖𝑗 ] = 0 and V(𝑒𝑖𝑗 ) = 𝜎 2 . (6.4)
Put
𝑛
1 ∑︁
Y𝑖 = 𝑌𝑖𝑗 , 𝑖 = 1, 2, · · · , 𝑎.
𝑛 𝑗=1
𝑎
∑︁
Because 𝜏𝑖𝐴 = 0 we obtain the expectation of the grand mean
𝑖=1
E[𝑌 ] = 𝜇. (6.7)
CRD design with experimental data can be shown in the following table:
Observation
A manufacturer of paper used for making grocery bags is interested in improving the product’s tensile
strength.
• Product engineers believe that tensile strength is a function of the hardwood concentration in the pulp
and that the range of hardwood concentrations of practical interest is between 5% và 20%.
Observations
concentration (%)
5 7 8 15 11 9 10 60 10.00
10 12 17 13 18 19 15 94 15.67
15 14 18 19 17 16 18 102 17.00
20 19 25 22 23 18 20 127 21.17
∑︀
𝑖𝑗 𝑦𝑖𝑗 = 383 𝑦 = 15.96
• A team of engineers responsible for the study decides to investigate four levels of hardwood con-
centration: 5%, 10%, 15%, và 20%.
• They decide to make up six test specimens at each concentration level by using a pilot plant. All 24
specimens are tested on a laboratory tensile tester in random order.
• The role of randomization in this experiment is extremely important. By randomizing the order of
the 24 runs, the effect of any nuisance variable that may influence the observed tensile strength is
approximately balanced out.
they are the numbers in the last column of the Table 6.2.
• The linear statistical model for variable of yarn strength of the (𝑖𝑗) observation is
E[Y 𝑖 ] = 𝜇 + 𝜏𝑖 , 𝑖 = 1, 2, · · · , 4; (6.9)
We can use variance analysis (ANOVA) to test the hypothesis that different wood stiffness does not
affect the average fiber strength of the shopping bag. The hypotheses are
𝐻0 : 𝜏1 = 𝜏2 = 𝜏3 = 𝜏4 = 0,
𝐻1 : 𝜏𝑖 ̸= 0 for each 𝑖 ∈ {1, 2, 3, 4}.
With the above data we have the boxplot in Figure (a). If it is assumed that the 4 populations corre-
sponding to 4 treatments are all Gaussian and have the same variance 𝜎 2 , then the graph of the effect
of treatments on bag strength 𝑌 is given in Figure (b).
The ANOVA partitions the total variability in the sample data into two component parts. The total
variability in data is described by the total sum of squares [of observed deviations compared to 𝑌 ]
4 ∑︁
∑︁ 6
𝑆𝑆𝑇 = (𝑦𝑖𝑗 − 𝑦)2 = 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚 + 𝑆𝑆𝐸 = 512.96, (6.10)
𝑖=1 𝑗=1
where
4
∑︁
𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚 = 𝑛 (𝑦 𝑖 − 𝑦)2 = 382.79,
𝑖=1
Blocking and randomization are devices in planning of experiments, which are aimed at increasing the
precision of the outcome and ensuring the validity of the inference. Blocking is used to reduce errors.
Variation
A block is a portion of the experimental material that is expected to be more homogeneous than the
whole aggregate.
For example, if the experiment is designed to test the effect of polyester coating of electronic circuits
on their current output, the variability between circuits could be considerably bigger than the effect
of the coating on the current output. In order to reduce this component of variance, one can block
by circuit. Each circuit will be tested under two treatments: no-coating and coating. We first test the
current output of a circuit without coating. Later we coat the circuit, and test again.
• Such a comparison of before and after a treatment, of the same units, is called paired-comparison.
Other examples of blocks could be machines, shifts of production, days of the week, operators, etc.
• Generally, if there are 𝑡 treatments to compare, and 𝑏 blocks, and if all 𝑡 treatments can be performed
within a single block, we assign all the 𝑡 treatments to each block.
The order of applying the treatments within each block should be randomized.
Such a design is called a randomized complete block design (RCBD). We will see later how a
proper analysis of the yield can validly test for the effects of the treatments.
• If not all treatments can be applied within each block, it is desirable to assign treatments to blocks in
some balanced fashion. Such designs, to be discussed later, are called balanced incomplete block
designs (BIBD).
where 𝑦𝑖𝑗 is the yield of the 𝑖-th treatment in the 𝑗th block.
for all 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑏.
• The main effect of the 𝑖-th treatment is 𝜏𝑖 , and the main effect of the 𝑗-th block is 𝛽𝑗 .
• It is assumed that the effects are additive (no interaction). Under this assumption, each treatment is
tried only once in each block. The different blocks serve the role of replicas.
However, since the blocks may have additive effects, 𝛽𝑗 , we have to adjust for the effects of blocks in
estimating 𝜎 2 . This is done as shown in the ANOVA table below.
Here,
𝑡 ∑︁
∑︁ 𝑏
𝑆𝑆𝑇 = (𝑌𝑖𝑗 − 𝑌 )2 , (6.13)
𝑖=1 𝑗=1
𝑡
∑︁
𝑆𝑆𝑇 𝑅 = 𝑏 (Y 𝑖. −𝑌 )2 , [sum of squared errors by treaments] (6.14)
𝑖=1
𝑡
∑︁
𝑆𝑆𝐵𝐿 = 𝑡 (Y .𝑗 −𝑌 )2 , [sum of squared errors by blocks] (6.15)
𝑖=1
and
𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑇 𝑅 − 𝑆𝑆𝐵𝐿 [sum of squared errors by randomness]. (6.16)
Source of DF SS MS E(𝑀 𝑆)
Variation
𝑡
𝑏 ∑︁
Treatments 𝑡−1 𝑆𝑆𝑇 𝑅 𝑀 𝑆𝑇 𝑅 𝜎2 + 𝜏𝑖
𝑡 − 1 𝑖=1
𝑏
𝑡 ∑︁
Blocks 𝑏−1 𝑆𝑆𝐵𝐿 𝑀 𝑆𝐵𝐿 𝜎2 + 𝛽𝑗
𝑏 − 1 𝑗=1
Total 𝑡𝑏 − 1 𝑆𝑆𝑇 -
Besides
𝑏 𝑡
1 ∑︁ 1 ∑︁
Y 𝑖. = 𝑌𝑖𝑗 , Y .𝑗 = 𝑌𝑖𝑗 (6.17)
𝑏 𝑗=1 𝑡 𝑖=1
As mentioned before, it is often the case that the blocks are not sufficiently large to accommodate all
the 𝑡 treatments. For example, in testing the wear-out of fabric one uses a special machine (Martindale
wear tester) which can accommodate only four pieces of clothes simultaneously. Here the block size is
fixed at 𝑘 = 4, while the number of treatments 𝑡, is the number of types of cloths to be compared.
Balanced Incomplete Block Designs (BIBD) are designs which assign 𝑡 treatment to 𝑏 blocks of
size 𝑘 (𝑘 < 𝑡) in the following manner.
According to these requirements there are, altogether, 𝑁 = 𝑡 𝑟 = 𝑏 𝑘 trials. Moreover, the following
equality should hold
𝜆(𝑡 − 1) = 𝑟(𝑘 − 1). (6.18)
(︀ 𝑡 )︀
• One can obtain a BIBD by the complete combinatorial listing of the 𝑘 selections without replace-
ments of 𝑘 out of 𝑡 letters. In this case, the number of blocks is
(︂ )︂
𝑡
𝑏= . (6.19)
𝑘
(︀ 𝑡−1 )︀ (︀ 𝑡−2 )︀
• The number of replicas is 𝑟 = 𝑘−1 , and 𝜆 = 𝑘−2 .
• Such designs of BIBD are called combinatoric designs. A BIB design is said to be symmetric if
𝑡 = 𝑏 and consequently 𝑟 = 𝑘; otherwise, asymmetric.
DISCUSSION.
They generally might be too big. For example, if 𝑡 = 8 and 𝑘 = 4 we are required to have 𝑘𝑡 = 84 =
(︀ )︀ (︀ )︀
70 = 𝑏 blocks. Thus, the total number of trials is 𝑁 = 70 × 4 = 280 and 𝑟 = 73 = 35. Here 𝜆 = 62 = 15.
(︀ )︀ (︀ )︀
There are advanced algebraic methods which can yield smaller designs for 𝑡 = 8 and 𝑘 = 4.
But it is not always possible to have a BIBD smaller in size than a complete combinatoric design.
Such a case is 𝑡 = 8 and 𝑘 = 5. Here the smallest number of blocks possible is 𝑏 = 𝑘𝑡 = 85 = 56,
(︀ )︀ (︀ )︀
and 𝑁 = 𝑘𝑏 = 5 × 56 = 280.
Let 𝐵𝑖 denote the set of treatments in the i-th block. For example, if block 1 contains the treatments 1,
2, 3, 4, then 𝐵1 = {1, 2, 3, 4}. Let 𝑌𝑖𝑗 be the yield of treatment 𝑗 ∈ 𝐵𝑖 . The effect model is
{𝑒𝑖𝑗 } are random experimental errors, with E(𝑒𝑖𝑗 ) = 0 and V(𝑒𝑖𝑗 ) = 𝜎 2 for all (𝑖, 𝑗).
• Let 𝑇𝑗 be the set of all indices of blocks containing the 𝑗-th treatment. The least squares estimates
of the treatment effects are obtained in the following manner.
∑︁
Let 𝑊𝑗 = 𝑌𝑖𝑗 be the sum of all 𝑌 values under the 𝑗-th treatment. Let
𝑖∈𝑇𝑗
∑︁ ∑︁
𝑊𝑗* = 𝑌𝑖𝑙
𝑖∈𝑇𝑗 𝑙∈𝐵𝑖
Total 𝑁 −1 𝑆𝑆𝑇 - -
be the sum of the values in all the 𝑟 blocks which contain the 𝑗-th treatment. Compute
The LSE of 𝜏𝑗 is
𝑄𝑗
𝜏̂︀𝑗 = , 𝑗 = 1, · · · , 𝑡. (6.23)
𝑡𝜆
𝑡 𝑡 𝑏
∑︁ ∑︁ 1 ∑︁ ∑︁
Notice that 𝑄𝑗 = 0, thus 𝜏̂︀𝑗 = 0. Let 𝑌 = 𝑌𝑖𝑙 .
𝑗=1 𝑗=1
𝑁 𝑖=1
𝑙∈𝐵𝑖
*
Y 𝑗 = 𝑌 + 𝜏̂︀𝑗 , 𝑗 = 1, · · · , 𝑡.
𝑀 𝑆𝑇 𝑅
The significance of the treatments effects is tested by the statistic 𝐹 = .
𝑀 𝑆𝐸
Example 6.1.
Six different adhesives (𝑡 = 6) are tested for the bond strength in a lamination process, under curing
pressure of 200 [psi]. Lamination can be done in blocks of size 𝑘 = 4. A combinatoric design (listed in
Table 6.6) has 𝑘𝑡 = 64 = 15 blocks, with
(︀ )︀ (︀ )︀
(︂ )︂ (︂ )︂ (︂ )︂ (︂ )︂
𝑡−1 5 𝑡−2 4
𝑟= = = 10, 𝜆 = = = 6,
𝑘−1 3 𝑘−2 2
𝑖 𝐵𝑖 𝑖 𝐵𝑖
1 1, 2, 3, 4 9 1, 3, 5, 6
2 1, 2, 3, 5 10 1, 4, 5, 6
3 1, 2, 3, 6 11 2, 3, 4, 5
4 1, 2, 4, 5 12 2, 3, 4, 6
5 1, 2, 4, 6 13 2, 3, 5, 6
6 1, 2, 5, 6 14 2, 4, 5, 6
7 1, 3, 4, 5 15 3, 4, 5, 6
8 1, 3, 4, 6
𝑖 𝑌𝑖𝑙 𝑖 𝑌𝑖𝑙
𝑗 𝑇𝑗 𝑊𝑗 𝑊𝑗* 𝑄𝑗
Source of Variation DF SS MS F
Total 59 1463.81
* *
Treatment 𝑌𝑗 S.E. 𝑌 𝑗
1 22.96 1.7445
2 20.96 1.7445
3 29.33 1.7445
4 24.86 1.7445
5 31.35 1.7445
6 34.86 1.7445
* 𝑘𝜎 2
V[𝑌 𝑗 ] = , 𝑗 = 1, · · · , 𝑡. (6.25)
𝑡𝜆
*
Thus, the S.E. of 𝑌 𝑗 is
(︂ )︂1/2
* 𝑘 𝑀 𝑆𝐸
S.E. 𝑌 =
𝑗 , 𝑗 = 1, · · · , 𝑡. (6.26)
𝑡𝜆
It seems that there are two homogeneous groups of treatments {1, 2, 4} and {3, 5, 6}.
For the remainder of Part C, to get insights of causal data analysis, we will study factorial designs
and linear regression models.
Factorial designs (a specific experimental design) is a very useful solution for our industrial
manufacturing problems.
Regression analysis captures relationships between random variables, determine the magni-
tude of the relationships between variables, and used to make predictions based on the models.
Linear regression describes the linear relationship between a response variable and a set of
other variables, called regressors, explanatory or predictor variables, in which we wish to predict
the values of the response variable of interest.
When several factors are of interest in an experiment, a factorial experiment should be used. In these
experiments, factors are varied together.
• By factorial design or experiment, we mean that in each complete trial or replicate of the exper-
iment, all possible combinations of the levels of the factors are investigated. Thus, if there are two
factors 𝐴 and 𝐵 with 𝑎 levels of factor 𝐴 and 𝑏 levels of factor 𝐵, each replicate 𝐷 = 𝐴 × 𝐵 contains
all 𝑎𝑏 treatment combinations. A full factorial design with 3 binary factors is visualized in Figure 6.3.
• The effect of a factor is defined as the change in response produced by a change in the level of the
factor. It is called a main effect because it refers to the primary factors in the study.
• Fractional factorial design 𝐹 is a subset of a (full) factorial 𝐷, with repeated runs are allowed.
In full factorial experiments, the number of levels of different factors do not have to be the same. Some
factors might be tested at two levels and others at three or four levels. Full factorial, or certain fractional
factorials which will be discussed later, are necessary, if the statistical model is not additive.
• In order to estimate or test the effects of interactions, one needs to perform factorial experiments,
full or fractional. In a full factorial experiment, all the main effects and interactions can be tested or
estimated.
On the whole there are, together with the grand mean 𝜇, 2𝑝 parameters.
Suppose that there are two factors, 𝐴 and 𝐵, at 𝑎 and 𝑏 levels respectively, there are 𝑎 𝑏 treatment
combinations (𝐴𝑖 , 𝐵𝑗 ), 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, · · · , 𝑏.
Suppose also that 𝑛 independent replicas are made at each one of the treatment combinations.
The yield at the 𝑘-th replication of treatment combination (𝐴𝑖 , 𝐵𝑗 ) is given by
for all 𝑖 = 1, 2, · · · , 𝑎, 𝑗 = 1, 2, · · · , 𝑏, 𝑘 = 1, 2, · · · , 𝑛.
In the following section we discuss the structure of the Analysis of Variance (ANOVA) for testing the
significance of main effects and interactions.
The analysis of variance for full factorial designs is done to test the hypotheses that main-effects or
interaction parameters are equal to zero. We present the ANOVA for a two factor situation, 𝐴 and 𝐵
with a statistical model given in Equation (6.27). The method can be generalized to any number of
factors. Let
𝑛
1 ∑︁
Y 𝑖𝑗 = 𝑌𝑖𝑗𝑘 , (6.29)
𝑛
𝑘=1
𝑏
1 ∑︁
Y 𝑖. = Y 𝑖𝑗 , 𝑖 = 1, · · · , 𝑎 (6.30)
𝑏 𝑗=1
𝑎
1 ∑︁
Y .𝑗 = Y 𝑖𝑗 , 𝑗 = 1, · · · , 𝑏 (6.31)
𝑎 𝑖=1
and
𝑎 𝑏
1 ∑︁ ∑︁
𝑌 = Y 𝑖𝑗 . (6.32)
𝑎𝑏 𝑖=1 𝑗=1
1. The ANOVA partitions first the total sum of squares of deviations from 𝑌
∑︁ 𝑏 ∑︁
𝑎 ∑︁ 𝑛
𝑆𝑆𝑇 = (𝑌𝑖𝑗𝑘 − 𝑌 )2 [total sum of squared errors] (6.33)
𝑖=1 𝑗=1 𝑘=1
to two components
∑︁ 𝑏 ∑︁
𝑎 ∑︁ 𝑛
𝑆𝑆𝑊 = (𝑌𝑖𝑗𝑘 − Y 𝑖𝑗 )2 [sum of squared errors in whole design] (6.34)
𝑖=1 𝑗=1 𝑘=1
and
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐵 = 𝑛 (Y 𝑖𝑗 −𝑌 )2 [sum of squared errors among factor levels]. (6.35)
𝑖=1 𝑗=1
2. In the second stage, the sum of squares of deviations 𝑆𝑆𝐵 is partitioned to three components
𝑆𝑆𝐼, 𝑆𝑆𝑀 𝐴, 𝑆𝑆𝑀 𝐵, as 𝑆𝑆𝐵 = 𝑆𝑆𝐼 + 𝑆𝑆𝑀 𝐴 + 𝑆𝑆𝑀 𝐵, where
𝑎 ∑︁
∑︁ 𝑏
𝑆𝑆𝐼 = 𝑛 (Y 𝑖𝑗 − Y 𝑖. − Y .𝑗 +𝑌 )2 , [errors caused by interaction effects]
𝑖=1 𝑗=1
𝑎
∑︁
𝑆𝑆𝑀 𝐴 = 𝑛𝑏 (Y 𝑖. −𝑌 )2 [sum of squared errors caused by factor effect 𝐴] (6.37)
𝑖=1
𝑏
∑︁
𝑆𝑆𝑀 𝐵 = 𝑛𝑎 (Y .𝑗 −𝑌 )2 , [sum of squared errors caused by factor effect 𝐵],
𝑗=1
𝑆𝑆𝑀 𝐴 𝑆𝑆𝑀 𝐵
𝑀 𝑆𝐴 = , 𝑀 𝑆𝐵 = , (6.38)
𝑎−1 𝑏−1
and
𝑆𝑆𝐼 𝑆𝑆𝑊
𝑀 𝑆𝐴𝐵 = , 𝑀 𝑆𝑊 = . (6.39)
(𝑎 − 1)(𝑏 − 1) 𝑎𝑏(𝑛 − 1)
Source of variation DF SS MS F
𝐴 𝑎−1 𝑆𝑆𝑀 𝐴 𝑀 𝑆𝐴 𝐹𝐴
𝐵 𝑏−1 𝑆𝑆𝑀 𝐵 𝑀 𝑆𝐵 𝐹𝐵
Between 𝑎𝑏 − 1 𝑆𝑆𝐵 - -
Total 𝑁 −1 𝑆𝑆𝑇 - -
𝑀 𝑆𝐵
𝐹𝐵 = , (6.41)
𝑀 𝑆𝑊
and
𝑀 𝑆𝐴𝐵
𝐹𝐴𝐵 = . (6.42)
𝑀 𝑆𝑊
𝐹𝐴 , 𝐹𝐵 and 𝐹𝐴𝐵 are test statistics to test, respectively, the significance of the main effects of 𝐴, the
main effects of 𝐵 and the interactions 𝐴𝐵 on the response. Few cases to consider:
cannot be rejected.
cannot be rejected.
3. Also, if 𝐹𝐴𝐵 < 𝐹1−𝛼 [(𝑎 − 1)(𝑏 − 1), 𝑎𝑏(𝑛 − 1)], we cannot reject the null hypothesis
𝐻0𝐴𝐵 : 𝜏11
𝐴𝐵 𝐴𝐵
= · · · = 𝜏𝑎𝑏 = 0.
As an essential illustration of full factorial designs, we present 2𝑚 factorial designs, the most simple full
factorials of 𝑚 factors, each one at two levels. The levels of the factors are labeled as “Low” and “High”,
or 1 and 2. If the factors are categorical then the labeling of the levels is arbitrary and the ordering of
values of the main effects and interaction parameters depend on this arbitrary labeling.
We will discuss here experiments in which the levels of the factors are measured on a continuous
scale, like in the case of the factors effecting the piston cycle time.
• The levels of the 𝑖-th factor (𝑖 = 1, · · · , 𝑚) are fixed at 𝑥𝑖1 and 𝑥𝑖2 , where 𝑥𝑖1 < 𝑥𝑖2 . By simple
transformation all factor levels can be reduced to
⎧
+1, if 𝑥 = 𝑥𝑖2
⎪
⎪
⎪
⎪
⎨
𝑐𝑖 = 𝑖 = 1, · · · , 𝑚.
⎪
⎪
⎪
⎩−1, if 𝑥 = 𝑥𝑖1
⎪
• In such a factorial experiment there are 2𝑚 treatment combination (or treatment). Let (𝑖1 , · · · , 𝑖𝑚 )
denote a treatment combination, where 𝑖1 , · · · , 𝑖𝑚 are indices, such that
⎧
⎨0, if 𝑐𝑗 = −1
𝑖𝑗 =
⎩1, if 𝑐 = 1.
𝑗
Thus, if there are 𝑚 = 3 factors, the number of possible treatment combinations is 23 = 8. These are
given in Table 6.12. The index 𝜈 of the standard order, is given by the formula
𝑚
∑︁
𝜈= 𝑖𝑗 2𝑗−1 . (6.43)
𝑗=1
Notice that 𝜈 ranges from 0 to 2𝑚 − 1. This produces tables of the treatment combinations for a 2𝑚
factorial design, arranged in a standard order (see Table 6.13).
In R we obtain a fraction of a full factorial design with:
𝜈 𝑖1 𝑖2 𝑖3
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
> install.packages("FrF2")
> library(FrF2)
> FrF2(nfactors=5, resolution=5)
A B C D E
1 -1 1 -1 1 1
2 1 1 1 -1 -1
3 1 -1 -1 1 1 ...
class=design, type= FrF2
> head(Design, 2)
A B C D E
1 2 2 2 1 2
2 1 2 1 1 1
> tail(Design, 2)
A B C D E
𝜈 𝐼1 𝐼2 𝐼3 𝐼4 𝐼5 𝜈 𝐼1 𝐼2 𝐼3 𝐼4 𝐼5
0 1 1 1 1 1 16 1 1 1 1 2
1 2 1 1 1 1 17 2 1 1 1 2
2 1 2 1 1 1 18 1 2 1 1 2
3 2 2 1 1 1 19 2 2 1 1 2
4 1 1 2 1 1 20 1 1 2 1 2
5 2 1 2 1 1 21 2 1 2 1 2
6 1 2 2 1 1 22 1 2 2 1 2
7 2 2 2 1 1 23 2 2 2 1 2
8 1 1 1 2 1 24 1 1 1 2 2
9 2 1 1 2 1 25 2 1 1 2 2
10 1 2 1 2 1 26 1 2 1 2 2
11 2 2 1 2 1 27 2 2 1 2 2
12 1 1 2 2 1 28 1 1 2 2 2
13 2 1 2 2 1 29 2 1 2 2 2
14 1 2 2 2 1 30 1 2 2 2 2
15 2 2 2 2 1 31 2 2 2 2 2
31 1 2 2 2 2
32 2 1 2 1 1
> rm(Design)
Let 𝑌𝜈 , 𝜈 = 0, 1, · · · , 2𝑚 − 1, denote the yield of the 𝜈-th treatment combination.We discuss now the
estimation of the main effects and interaction parameters. Starting with the simple case of 2 factors,
the variables are presented schematically, in Table 6.14.
According to our previous definition there are four main effects main effect 𝜏1𝐴 , 𝜏2𝐴 , 𝜏1𝐵 , 𝜏2𝐵 and four
𝐴𝐵 𝐴𝐵 𝐴𝐵 𝐴𝐵
interaction effects 𝜏11 , 𝜏12 , 𝜏21 , 𝜏22 . But since 𝜏1𝐴 + 𝜏2𝐴 = 𝜏1𝐵 + 𝜏2𝐵 = 0, it is sufficient to represent
the main effects of 𝐴 and 𝐵 by 𝜏2𝐴 and 𝜏2𝐵 . Similarly, since
𝐴𝐵 𝐴𝐵 𝐴𝐵 𝐴𝐵
𝜏11 + 𝜏12 = 0 = 𝜏11 + 𝜏21 ,
1 Y0 Y1 Y 1.
2 Y2 Y3 Y 2.
Column means Y .1 Y .2 𝑌
𝐴𝐵 𝐴𝐵 𝐴𝐵 𝐴𝐵
𝜏21 + 𝜏22 = 0 = 𝜏12 + 𝜏22 ,
𝐴𝐵
it is sufficient to represent the interaction effects by 𝜏22 . Main effect 𝜏2𝐴 is estimated by
1 1
𝜏̂︀2𝐴 = Y .2 −𝑌 = (Y 1 + Y 3 ) − (Y 0 + Y 1 + Y 2 + Y 3 )
2 4
1
= (− Y 0 + Y 1 − Y 2 + Y 3 ).
4
Estimator of 𝜏2𝐵 is
1 1
𝜏̂︀2𝐵 = Y 2. −𝑌 = (Y 2 + Y 3 ) − (Y 0 + Y 1 + Y 2 + Y 3 )
2 4
1
= (− Y 0 − Y 1 + Y 2 + Y 3 ).
4
𝐴𝐵
Finally, the estimator of 𝜏22 is
𝐴𝐵
𝜏̂︀22 = Y 3 − Y 2. − Y .2 +𝑌
1 1 1
= Y 3 − (Y 2 + Y 3 ) − (Y 1 + Y 3 ) + (Y 0 + Y 1 + Y 2 + Y 3 )
2 2 4
1
= (Y 0 − Y 1 − Y 2 + Y 3 ).
4
⎡ ⎤
⎢1 1⎥
⎢ ⎥
⎢ ⎥
⎢2 1⎥
⎢ ⎥
𝐷22 =⎢
⎢
⎥.
⎥
⎢1 2⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
2 2
The corresponding 𝐶 coefficients are the 2nd and 3rd columns in the matrix
⎡ ⎤
⎢1 −1 −1 1⎥
⎢ ⎥
⎢ ⎥
⎢1
⎢ 1 −1 −1⎥
⎥
𝐶22 =⎢
⎢
⎥.
⎥
⎢1 −1 1 −1⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
1 1 1 1
The 4th column of this matrix is the product of the elements in the 2nd and 3rd columns. Notice also
that the linear model for the yield vector is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
⎢ 𝑌0 ⎥ ⎢1 −1 −1 1⎥ ⎢ 𝜇 ⎥ ⎢ 𝑒1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝑌 ⎥ ⎢1 1 −1 −1⎥ ⎢ 𝜏𝐴 ⎥ ⎢ 𝑒 ⎥
⎢ 1 ⎥ ⎢ ⎥ ⎢ 2 ⎥ ⎢ 2 ⎥
⎢ ⎥=⎢ ⎥ ⎢ ⎥+⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 𝐵
⎢ 𝑌2 ⎥ ⎢1 −1 1 −1⎥ ⎢ 𝜏2 ⎥ ⎢ 𝑒3 ⎥
⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦
𝐴𝐵
𝑌3 1 1 1 1 𝜏22 𝑒4
where 𝑒𝑖 are independent random variables, with E[𝑒𝑖 ] = 0 and V[𝑒𝑖 ] = 𝜎 2 for 𝑖 = 1, 2, 3, 4.
Let
𝑌 (4) = [𝑌0 , 𝑌1 , 𝑌2 , 𝑌3 ]′ ,
𝐴𝐵 ′
𝜃 (4) = [𝜇, 𝜏2𝐴 , 𝜏2𝐵 , 𝜏22 ], and
𝑒(4) = [𝑒1 , 𝑒2 , 𝑒3 , 𝑒4 ]′
This is the usual linear model for multiple regression. The least squares estimator of 𝜃 (4) is
The matrix 𝐶22 has orthogonal column (row) vectors and 𝐶2′ 2 𝐶22 = 4𝐼4 , where 𝐼4 is the identity matrix
of rank 4. Therefore
⎡ ⎤ ⎡ ⎤
⎢1 1 1 1⎥ ⎢ Y0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎢−1 1 −1 1⎥ ⎢ Y ⎥
̂︀(4) = 1 (𝐶22 )′ 𝑌 (4) 1⎢ ⎥ ⎢ 1 ⎥
𝜃 = ⎢ ⎥ ⎢ ⎥.
4 4⎢ ⎥ ⎢ ⎥
⎢−1 −1 1 1⎥ ⎢ Y2 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
1 −1 −1 1 Y3
This is identical with the solution obtained earlier. The estimators of the main effects and interactions
are the least squares estimators, as has been mentioned before.
Definition 6.1. Formally, we fix 𝑑 finite sets 𝑄1 , 𝑄2 , . . . , 𝑄𝑑 called factors, where 1 < 𝑑 ∈ N. The
elements of a factor are called its levels. The (full) factorial design (also factorial experiment design-
FED) with respect to these factors is the Cartesian product 𝐷 = 𝑄1 × 𝑄2 × . . . × 𝑄𝑑 .
A fractional design or fraction 𝐹 of 𝐷 is a subset consisting of elements of 𝐷 (possibly with mul-
tiplicities). Put 𝑟𝑖 := |𝑄𝑖 | be the number of levels of the 𝑖th factor. We say that 𝐹 is symmetric if
𝑟1 = 𝑟2 = · · · = 𝑟𝑑 , otherwise 𝐹 is mixed.
We wrap up the developed theory by illustrating the useful role of factorial designs in the following case
study (modified from [?, Example 11.7]).
Seven prediction factors for the piston cycle time, in an industrial design problem, were listed. These
are
We are interested to test the effects of the piston weight (A) and the spring coefficient (D) on the
cycle time (seconds). For this purpose we designed a factorial experiment at three levels of A, and
three levels of D. The levels are 𝐴1 = 30[𝑘𝑔], 𝐴2 = 45[𝑘𝑔], and 𝐴3 = 60[𝑘𝑔]; and
Five replicas were performed at each treatment combination (𝑛 = 5). The five factors which were not
under study were kept at the levels
> install.packages("DoE.base")
> library(DoE.base)
> Factors <- list( m=c(30, 45, 60), k=c(1500, 3000, 4500))
> FacDesign <- fac.design(
factor.names=Factors,
randomize=TRUE, replications=5, repeat.only=TRUE)
creating full factorial with 9 runs ...
Source DF SS MS 𝐹 𝑝
Total 44 3.28619 - -
Figure 6.4: Effect of spring coefficient on cycle time. (The 𝑦 axis corresponds to cycle time in minutes).
• The P-values are computed with the appropriate Fisher 𝐹 distribution. We see in the ANOVA table
that only the main effects of the spring coefficient (D) are significant. Since the effects of the piston
weight (A) and that of the interaction are not significant, we can estimate 𝜎 2 by a pooled estimator
𝑆𝑆𝑊 + 𝑆𝑆𝐼 + 𝑆𝑆𝑀 𝐴 2.2711
̂︀2 =
𝜎 = = 0.0541.
36 + 4 + 2 42
• To estimate the main effects of D we pool all data from samples having the same level of D together.
We obtain pooled samples of size 𝑛𝑝 = 15. The means of the cycle time for these samples are
𝐷1 𝐷2 𝐷3 Grand
• Since we estimate on the basis of the pooled samples, and the main effects 𝜏̂︀𝑗𝐷 (𝑖 = 1, 2, 3) are
contrasts of 3 means, the coefficient 𝑆𝛼 for the simultaneous confidence intervals has the formula
√
𝑆𝛼 = (2𝐹.95 [2, 42])1/2 = 2 × 3.22 = 2.538.
• We see that the confidence interval for 𝜏2𝐷 covers zero. Thus 𝜏2𝐷 is not significant. The significant
main effects are 𝜏1𝐷 và 𝜏3𝐷 .
• Hence, we just studied the effects of two factors on the cycle time of a piston, keeping all the other
five factors fixed. In the present analysis we perform a 25 experiment with the piston varying factors
A, B, C, D and F at two levels, keeping the atmospheric pressure (factor E) fixed at 90000 [N/𝑚2 ]
and the filling gas temperature (factor D) at 340 [∘ K].
• The two levels of each factor are those specified, in the above part, as the limits of the experimental
range.
Fractional factorial designs, especially the balanced factorial designs (also called orthogonal array)
are employed in various industries and engineering, for instance, see [13] for testing software, see [49]
for military science, and [50] for cDNA microarray experiments. We firstly consider a small balanced
factorial design.
A start-up company wants to put a new type of new product (mobile phones, yogurt, cars, new way
of banking management ...). The product have five potential and different features, yet to be decided,
namely:
1. Color Co,
2. Shape Sh,
3. Weight Wei,
5. Price Pri.
Each of these features can take on only two possible values, hence require 25 = 32 experiments. But
5 factors are not enough to study cellphone’s quality! How about using other 6 extra factors?
0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 0 1 1 0 1 0 0 0 2
0 1 1 1 0 1 1 0 1 0 0 3
0 0 1 1 1 0 1 1 0 1 0 4
0 0 0 1 1 1 0 1 1 0 1 5
1 0 0 0 1 1 1 0 1 1 0 6
0 1 0 0 0 1 1 1 0 1 1 7
1 0 1 0 0 0 1 1 1 0 1 8
1 1 0 1 0 0 0 1 1 1 0 9
0 1 1 0 1 0 0 0 1 1 1 10
1 0 1 1 0 1 0 0 0 1 1 11
1 1 0 1 1 0 1 0 0 0 1 12
• The design resolution 𝑅 of a design is the minimum length of all defining generators. 𝑅 is used
to catalog binary fractional factorial designs according to the alias patterns they produce. Regular
designs are designs that be fully defined by generator words.
Specifically,
1. Resolution III Designs: ones for which no main effect are aliased with other main effects, but main
effects are aliased with 2-factor interactions, and some 2-factor interactions may be aliased with each
other.
2. Resolution IV Designs: designs in which no main effect is aliased with any other main effect or
two-factor interactions, but two-factor interactions are aliased with each other.
3. Resolution V Designs: in which no main effect or 2-factor interaction is aliased with any other main
effect or 2-factor interaction, but 2-factor interactions are aliased with 3-factor interactions.
We can rewrite 𝐹 = OA(𝑁 ; 𝑠𝑎1 1 · 𝑠𝑎2 2 · · · 𝑠𝑎𝑚𝑚 ; 𝑡), meaning that 𝐹 has
However, orthogonal arrays include both regular designs and irregular designs!
A case study for more than two factors with distinct levels
To illustrate the useful role of fractional factorial designs (FFD), we present here an usage of FFD with
more than two factors in software manufacturing.
Suppose that you are a quality engineer in a software firm.
• Your responsibility is to use statistical techniques for lowering the cost of design and production while
maintaining customer satisfaction.
• A competitor improves its product while simultaneously reducing the price. Your job is to identify
components in your company’s software production process which can be changed to reduce the
production time and lower the price, while making the product more robust [52, 6]
• You are required to carry out a series of experiments, in which a range of parameters, called factors,
can be varied. The outcome of these experiments will be used to decide which strategy should be
followed in the future. To be precise, you will perform experiments and measure some quantitative
outcomes, called responses, when values of the factors are varied. Each experiment is also called
an experimental run, or just a run. In each run, the factors are set to specific values from a certain
finite set of settings or levels, and the responses are recorded.
The board wants to study as many parameters as possible within a limited budget. They have identified
8 factors, coded as 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺, 𝐻, that could affect the outcome. The factors and their levels
are described in Table 6.17, where # stands for the number of levels of each factor.
• An initial investigation indicates that employees should have at least one year of experience, and that
there is a great difference between an employee with three years experience and one with five years.
• We choose 5 levels for years of experience, which we call factor 𝐴. Factor 𝐵 is the programming
language that our software is written in. Of the many languages used in the market nowadays, we
choose 4 which are appropriate for large applications.
• Although there are many different applications of software (factor 𝐶), we can classify them into two
major categories: scientific applications and business applications (such as finance, accountancy,
and tax).
For the former, the software developers require a fair knowledge of exact sciences like mathematics
or physics, but relatively little knowledge of the particular customers. On the other hand, for the latter,
the clients have specific requirements, which we need to know before designing, implementing and
testing the software.
• We use two popular operating systems, Windows and Linux, for factor 𝐷. Whether we interview the
customers is factor 𝐸 – as mentioned, we expect this to interact with factor 𝐶. The factors 𝐹, 𝐺, 𝐻
are self-explanatory, and each clearly has two levels.
Let 𝑁 be the number of experimental runs in a possible fractional experiment; each run will be assigned
to a particular combination of factor levels. In the worst case scenario, the total number of possible level
combinations of the factors 𝐴, 𝐵, 𝐶, 𝐷, 𝐸, 𝐹, 𝐺 and 𝐻 is max 𝑁 = 5 · 4 · 26 = 1280.
Selecting the right combinations of levels in these factors is obviously crucial, from the cost-benefit
view. The experiments are indeed costly and the board has decided that the budget allows for only 100
experiments. Is there any design with at most 100 experiments available for our task?
1. We restrict ourselves to studying only one response, 𝑌 , the number of failures (errors or crashes)
occurring in a week. To minimize the average number of failures in new products, we study the
combined influence of the factors using linear regression models.
Table 6.17: Eight factors, the number of levels and the level meanings
Level
Factor Description # 0 1 2 3 4
𝐴 years of experience 5 1 3 5 7 9
2. In these models, we make a distinction between main effects, two-factor interactions, and higher-
order interactions.
3. The main effect of a factor models the average change in the response when the setting of that
factor is changed. A model containing just the main effects takes the form
4
∑︁ 3
∑︁
𝑌 = 𝜃0 + 𝜃𝐴𝑖 𝑎𝑖 + 𝜃𝐵𝑗 𝑏𝑗 + 𝜃𝐶 𝑐 + . . . + 𝜃𝐻 ℎ + 𝜖, (6.44)
𝑖=1 𝑗=1
4. In particular, 𝜃0 , the number of failures when all factors are set to the values 0, is called the intercept
of the model. These coefficients are estimated by taking linear combinations of the responses.
5. Two-factor interactions, or two-interactions, model changes in the main effects of a factor due to a
change in the setting of another factor.
To study the activity of all two-interactions simultaneously, we may want to augment Model (6.44) by
adding
4 ∑︁
∑︁ 3 4
∑︁ 4
∑︁
𝜃𝐴𝑖 𝐵𝑗 𝑎𝑖 𝑏𝑗 + 𝜃𝐴𝑖 𝐶 𝑎𝑖 𝑐 + . . . + 𝜃𝐴𝑖 𝐻 𝑎𝑖 ℎ +
𝑖=1 𝑗=1 𝑖=1 𝑖=1
(6.45)
3
∑︁ 3
∑︁
𝜃 𝐵 𝑗 𝐶 𝑏𝑗 𝑐 + . . . + 𝜃𝐵𝑗 𝐻 𝑏𝑗 ℎ + 𝜃𝐶𝐷 𝑐𝑑 + . . . + 𝜃𝐺𝐻 𝑔ℎ.
𝑗=1 𝑗=1
6. We can also define higher-order interactions but these are usually considered unimportant. The
total number of intercept, main effect and two interaction parameters is
8
∑︁ 8
∑︁
1+ (𝑠𝑖 − 1) + (𝑠𝑖 − 1)(𝑠𝑗 − 1),
𝑖=1 𝑖,𝑗=1
𝑖<𝑗
7. This formula shows that we need 83 parameters up to two-factor interactions to model the com-
bined influences of the factors. In fact, only some of the two-factor interactions turn out to be im-
portant, so we need even fewer than 83 parameters. This is in contrast with a full model including
all interactions up to order 8, which needs 1280 parameters.
Our trade-off solution will be using a design with runzise around 83 runs.
The full factorial design of the eight factors described above is the Cartesian product {0, 1, . . . , 4} ×
{0, 1, . . . , 3} × {0, 1}6 .
• Using this design, we are able to estimate all interactions, but performing all 1280 runs exceeds the
firm’s budget. Instead we use a fractional factorial design, that is, a subset of elements in the full
factorial design.
• We want to choose a fractional design that still allows us to estimate the main effects and some of the
two-interactions. If we want to measure simultaneously all effects up to 2-interactions of the above 8
factors, an 83 run fractional design would be needed.
• Constructing an 83 run design is possible, and could be found with trial-and-error algorithms. But it
lacks some attractive features such as balance, which is discussed next.
DISCUSSION.
1. An algebraic approach can also be used to construct such a design, but it is infeasible for large run
size designs; for more details see Nguyen [36]. A workable solution is the 80 run experimental design
presented in Table 6.18. This allows us to estimate the main effect of each factor and some of their
pairwise interactions. The construction of this design is presented in [34]. Note that the responses 𝑌
have been computed by simulation, not by conducting actual experiments.
2. A notable property of the array in Table 6.18 is that it has strength 3. That is, if we choose any 3
columns in the table and go down we find that every triple of symbols in those columns appears the
same number of times.
This property is also called 3-balance or 3-orthogonality; and the array (fractional design) itself is
called a strength 3 orthogonal array or a 3-balanced fractional design. By [51, Theorem 11.3], a
strength 3 design allows us to measure all the main effects and some of the two-interactions.
run 𝐴 𝐵 𝐶 𝐷 𝐸 𝐹 𝐺 𝐻 𝑌 run 𝐴 𝐵 𝐶 𝐷 𝐸 𝐹 𝐺 𝐻 𝑌
5 4 2 2 2 2 2 2 5 4 2 2 2 2 2 2
1 0 0 0 0 0 0 0 0 25 41 2 2 0 0 1 1 0 0 12
2 0 0 1 0 1 0 1 0 15 42 2 2 1 0 0 0 1 1 6
3 0 0 1 1 0 1 0 1 15 43 2 2 1 1 1 1 1 0 12
4 0 0 0 1 1 1 1 1 15 44 2 2 0 1 0 0 0 1 15
5 0 1 0 0 0 0 0 1 5 45 2 3 0 0 1 0 0 0 12
6 0 1 1 0 1 1 1 0 15 46 2 3 1 0 0 0 1 0 15
7 0 1 1 1 0 1 0 0 10 47 2 3 1 1 1 1 1 1 0
8 0 1 0 1 1 0 1 1 15 48 2 3 0 1 0 1 0 1 21
9 0 2 0 0 0 1 1 1 25 49 3 0 0 0 0 1 1 0 4
10 0 2 1 0 1 1 0 1 30 50 3 0 1 0 1 1 1 1 4
11 0 2 1 1 0 0 1 0 20 51 3 0 1 1 1 0 0 0 4
12 0 2 0 1 1 0 0 0 10 52 3 0 0 1 0 0 0 1 8
13 0 3 0 0 0 1 1 0 15 53 3 1 0 0 1 1 0 0 8
14 0 3 1 0 1 0 0 1 30 54 3 1 1 0 0 0 1 0 8
15 0 3 1 1 0 0 1 1 30 55 3 1 1 1 0 0 1 1 0
16 0 3 0 1 1 1 0 0 10 56 3 1 0 1 1 1 0 1 2
17 1 0 0 0 0 0 0 1 20 57 3 2 0 0 0 0 0 0 4
18 1 0 1 0 1 1 0 0 4 58 3 2 1 0 1 0 0 1 6
19 1 0 1 1 0 1 1 1 4 59 3 2 1 1 1 1 1 0 14
20 1 0 0 1 1 0 1 0 8 60 3 2 0 1 0 1 1 1 6
21 1 1 0 0 1 0 1 0 0 61 3 3 0 0 1 0 1 1 14
22 1 1 1 0 0 1 1 1 16 62 3 3 1 0 0 1 0 1 8
23 1 1 1 1 1 0 0 1 4 63 3 3 1 1 0 1 0 0 4
24 1 1 0 1 0 1 0 0 20 64 3 3 0 1 1 0 1 0 0
25 1 2 0 0 0 1 1 0 24 65 4 0 0 0 1 1 0 1 2
.. .. .. ..
. . . .
39 2 1 1 1 1 0 0 0 15 79 4 3 1 1 1 0 0 1 7
40 2 1 0 1 0 0 1 0 6 80 4 3 0 1 0 0 0 0 8
3. We could, in fact, investigate all main effects and all two-interactions of the abovementioned eight
factors by using an 160 run strength 3 orthogonal array; see [51, Section 11.4] for a detailed expla-
nation.
But the board would have to increase the current budget by at least 60 percent if we use an 160 run
orthogonal array.
Problem 6.1.
Describe a production process familiar to you, like baking of cakes, or manufacturing concrete. List
the pertinent variables. What is (are) the response variable(s)? Classify the variables which affect the
response to noise variables and control variables.
How many levels would you consider for each variable?
Problem 6.2.
Different types of adhesive are used in a lamination process, in manufacturing a computer card.
The card is tested for bond strength. In addition to the type of adhesive, a factor which might influence
the bond strength is the curing pressure (currently at 200psi).
Follow the basic steps of experimental design to set a possible experiment for testing the effects of
adhesives and curing pressure on the bond strength.
Problem 6.3.
Three factors A, B, C are tested in a given experiment, designed to assess their effects on the
response variable. Each factor is tested at 3 levels. List all the main effects and interactions.
Experimental Designs II
Analysis with Random Effects Model
[Source [56]]
CHAPTER 7. EXPERIMENTAL DESIGNS II
218 ANALYSIS WITH RANDOM EFFECTS MODEL
Introduction
An experimenter is frequently interested in a factor that has a large number of possible levels. If the
experimenter randomly selects a of these levels from the population of factor levels, then we say that
the factor is random. Because the levels of the factor actually used in the experiment were chosen
randomly, inferences are made about the entire population of factor levels.
We assume that the population of factor levels is either of infinite size or is large enough to be
considered infinite. Situations in which the population of factor levels is small enough to employ a finite
population approach are not encountered frequently.
where both the treatment effects 𝜏𝑖 := 𝜏𝑖𝐴 and 𝜀𝑖𝑗 are random variables. We will assume
that the treatment effects 𝜏𝑖 are iid N (0, 𝜎𝜏2 ) random variables and
that the errors 𝜀𝑖𝑗 are iid N (0, 𝜎 2 ) random variables, and
that the 𝜏𝑖 and 𝜀𝑖𝑗 are independent.
Definition 7.1.
The variances 𝜎𝜏2 and 𝜎 2 are called variance components, and the model (Equation 7.10) is
called the components of variance or random effects model.
The observations in the random effects model are normally distributed because they are linear com-
binations of the two normally and independently distributed random variables 𝜏𝑖 and 𝜀𝑖𝑗 . But in Model
7.10 the observations 𝑌𝑖𝑗 are only independent if they come from different factor levels. Specifically,
we can show that the covariance of any two observations is
Hence, the observations within a specific factor level all have the same covariance, because before
the experiment is conducted, we expect the observations at that factor level to be similar since they all
have the same random component.
Once the experiment has been conducted, we can assume that all observations can be assumed to
be independent, because the parameter 𝜏𝑖 has been determined and the observations in that treatment
differ only because of random error.
The covariance structure of the observations as in (7.3) [of the single-factor random effects model]
can be written through the covariance matrix of the observations with size 𝑁 ×𝑁 , 𝑁 = 𝑎 𝑛. To illustrate,
suppose that we have 𝑎 = 3 treatments and 𝑛 = 2 replicates. There are 𝑁 = 6 observations, which we
can write as a vector ⎡ ⎤
⎢ 𝑦11 ⎥
⎢ ⎥
⎢ ⎥
⎢
⎢ 𝑦12 ⎥
⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑦21 ⎥
𝑇
⎢ ⎥
𝑦 = [𝑦11 , 𝑦12 , 𝑦21 , 𝑦22 , 𝑦31 , 𝑦32 ] = ⎢
⎢
⎥
⎥
𝑦22 ⎥
⎢ ⎥
⎢
⎢ ⎥
⎢ ⎥
𝑦31 ⎥
⎢ ⎥
⎢
⎢ ⎥
⎣ ⎦
𝑦32
⎡ ⎤
⎢ 𝜎𝜏2 +𝜎 2
𝜎𝜏2 0 0 0 0 ⎥
⎢ ⎥
⎢ ⎥
⎢
⎢ 𝜎𝜏2 𝜎𝜏2 + 𝜎 2 0 0 0 0 ⎥
⎥
⎢ ⎥
⎢ ⎥
⎢
⎢ 0 0 𝜎𝜏2 + 𝜎 2 𝜎𝜏2 0 0 ⎥
⎥
Σ = Cov(𝑦) = ⎢
⎢
⎥
⎥ (7.4)
0 0 𝜎𝜏2 𝜎𝜏2 +𝜎 2
0 0
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
0 0 0 0 𝜎𝜏2 + 𝜎 2 𝜎𝜏2
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
0 0 0 0 𝜎𝜏2 𝜎𝜏2 + 𝜎 2
The main diagonals of this matrix are the variances of each individual observation and every off-
diagonal element is the covariance of a pair of observations.
How to estimate the variance components 𝜎𝜏2 and 𝜎 2 in the model (7.11)?
Two methods are used: I) method of moments and II) maximum likelihood.
I) Method of moments relies on equating the expected mean squares to their observed values in the
ANOVA table and solving for the variance components,
𝑀 𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 = 𝜎 2 + 𝑛 𝜎𝜏2
(7.5)
𝑀 𝑆𝐸 = 𝜎 2
𝑎
∑︁
where 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 = 𝑛 (𝑦 𝑖 − 𝑦)2 , 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝑆𝑆𝑇 𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡𝑠 ,
𝑖=1
Variation
Summary: The method of moments does not require the normality assumption. The estimators
̂︀2 , 𝜎
𝜎 ̂︀𝜏2 are best quadratic unbiased (i.e., of all unbiased quadratic functions of the observations, these
estimators have minimum variance).
The likelihood function, with parameters 𝜃 = (𝜇, 𝜎𝜏2 , 𝜎 2 ), has the form
1 {︁ 1 (︀ )︀′ )︀}︁
𝐿(𝑦; 𝜃) = 𝑓 (𝑦1 , 𝑦2 , · · · , 𝑦𝑎 𝑛 ; 𝜇, 𝜎𝜏2 , 𝜎 2 ) = 𝑦 −𝑗 𝑁 𝜇 Σ−1 𝑦 −𝑗 𝑁 𝜇
(︀
exp − (7.8)
𝐶 2
1
𝐶= .
(2𝜋)𝑁/2 [Σ]1/2
The maximum likelihood estimates of the parameters 𝜃 = (𝜇, 𝜎𝜏2 , 𝜎 2 ) are the values of these quantities
(with two variance components) that maximize the likelihood function (7.8).
• The standard variant of maximum likelihood estimation that is used for estimating variance com-
ponents is known as the residual maximum likelihood (REML) method. The basic characteristic of
REML is that it takes the location parameters in the model into account when estimating the random
effects.
(𝑋𝑖 − X )2
∑︀
𝑆2 = 𝑖
(7.9)
𝑛−1
In this section, we focus on methods for the design and analysis of factorial experiments with two
random factors. This paves the way for studying, in next chapters, nested and split-plot designs, two
situations where random factors are frequently encountered in practice.
Suppose that we have two factors, 𝐴 and 𝐵 and that both factors have a large number of levels that
are of interest. We will choose at random 𝑎 levels of factor 𝐴 and 𝑏 levels of factor 𝐵 and arrange these
factor levels in a factorial experimental design. If the experiment is replicated 𝑛 time the linear random
effects model of factors 𝐴, 𝐵 is
where the model parameters (treatment effects) 𝜏𝑖 , 𝛽𝑗 , (𝛼 𝛽)𝑖𝑗 and 𝜀𝑖𝑗𝑘 are all independent random
variables.
We will assume that the random variables 𝜏𝑖 𝛽𝑗 , (𝛼 𝛽)𝑖𝑗 and 𝜀𝑖𝑗𝑘 are normally distributed with mean
0 and variances given by
• Since model parameters are mutually independent, the variance of any observation is
A common industrial application is to use a designed experiment to study the components of vari-
ability in a measurement system. These studies are often called gauge capability studies or gauge
repeatability and reproducibility (R & R) studies because these are the components of variability
that are of interest.
Next we introduce an important type of experimental designs, the nested designs. These designs
find reasonably widespread application in the industrial use of designed experiments.
In multifactor experiments, the levels of one factor (e.g., factor 𝐵) are similar but not identical for different
levels of another factor (e.g., 𝐴). Such an arrangement is called a nested, or hierarchical, design,
with the levels of factor 𝐵 nested under the levels of factor 𝐴. For example, consider a company that
purchases its raw material from three different suppliers. The company wishes to determine whether
the purity of the raw material is the same from each supplier. There are four batches of raw material
available from each supplier, and three determinations of purity are to be taken from each batch.
This is a two-stage nested design, with batches nested under suppliers.
At first glance, you may ask why this is not a factorial experiment. If this were a factorial, then batch
1 would always refer to the same batch, batch 2 would always refer to the same batch, and so on. This
is clearly not the case because the batches from each supplier are unique for that particular supplier.
That is, batch 1 from supplier 1 has no connection with batch 1 from any other supplier, batch 2 from
supplier 1 has no connection with batch 2 from any other supplier, and so forth.
⎧
𝑖 = 1, 2, · · · , 𝑎,
⎪
⎪
⎪
⎪
⎨
𝑌𝑖𝑗𝑘 = 𝜇 + 𝛼𝑖 + 𝛽𝑗(𝑖) + 𝜀(𝑖𝑗) 𝑘 𝑗 = 1, 2, · · · , 𝑏, (7.12)
⎪
⎪
⎪
⎩𝑘 = 1, 2, · · · , 𝑛
⎪
That is, there are a 𝑎 levels of factor 𝐴 and 𝑏 levels of factor 𝐵, nested under each level of 𝐴, and 𝑛
replicates. We also view
• the subscript 𝑗(𝑖) indicates that the 𝑗th level of factor 𝐵 is nested under the 𝑖th level of factor 𝐴.
• It is convenient to think of the replicates as being nested within the combination of levels of 𝐴 and 𝐵;
thus, the subscript (𝑖𝑗) 𝑘 is used for the error term.
• This is a balanced nested design because there are an equal number of levels of 𝐵 within each
level of 𝐴 and an equal number of replicates.
There are 𝑎𝑏𝑛 − 1 degrees of freedom for 𝑆𝑆𝑇 , 𝑎 − 1 degrees of freedom for 𝑆𝑆𝐴 , 𝑎(𝑏 − 1) degrees of
freedom for 𝑆𝑆𝐵(𝐴) , and 𝑎𝑏(𝑛 − 1) degrees of freedom for error.
𝑆𝑆𝐴
Factor 𝐴 𝑎 − 1 =? 𝑆𝑆𝐴 𝑀 𝑆𝐴 =
𝑎−1
𝑆𝑆𝐵(𝐴)
𝐵 within 𝐴 𝑎(𝑏 − 1) 𝑆𝑆𝐵(𝐴) 𝑀 𝑆𝐵(𝐴) =
𝑎(𝑏 − 1)
𝑆𝑆𝐸
Error 𝑎𝑏(𝑛 − 1) 𝑆𝑆𝐸 𝑀 𝑆𝐸 =
𝑎𝑏(𝑛 − 1)
Total 𝑁 − 1 = 𝑎𝑏𝑛 − 1 𝑆𝑇
Knowledge box 3.
1. The ANOVA idea means that if the errors are NID(0, 𝜎 2 ), we may divide each sum of squares on
the right of Equation 7.13 by its degrees of freedom to obtain independently distributed mean
squares such that the ratio of any two mean squares is distributed as the Fisher distribution 𝐹 .
The appropriate statistics for testing the effects of factors 𝐴 and 𝐵 depend on whether 𝐴 and 𝐵
are fixed or random.
𝑎
∑︁ 𝑏
∑︁
• If factors 𝐴 and 𝐵 are fixed, we assume that 𝜏𝑖 = 0, and 𝛽𝑗(𝑖) = 0, for each
𝑖=1 𝑗=1
𝑖 = 1, 2, · · · , 𝑎. That is, the 𝐴 treatment effects sum to zero, and the 𝐵 treatment effects
sum to zero within each level of 𝐴.
• Alternatively, if 𝐴 and 𝐵 are random, we assume that 𝜏𝑖 is NID(0, 𝜎𝜏2 ) and 𝛽𝑗(𝑖) is NID(0, 𝜎𝛽2 ).
Consider a company that buys raw material in batches from three different suppliers. The purity of
this raw material varies considerably, which causes problems in manufacturing the finished product. We
wish to determine whether the variability in purity is attributable to differences between the suppliers.
Four batches of raw material are selected at random from each supplier, and three determinations of
purity are made on each batch. This is, of course, a two-stage nested design. The data, after coding
by subtracting 93, are shown in Table 7.3.
Table 7.3: Coded Purity Data for Example- (Code: 𝑦𝑖𝑗𝑘 = 𝑃 𝑢𝑟𝑖𝑡𝑦 − 93)
Batches 1 2 3 4 1 2 3 4 1 2 3 4
1 −2 −2 1 1 0 −1 0 2 −2 1 3
−1 −3 0 4 −2 4 0 3 4 0 −1 2
♣ OBSERVATION 1.
1. The practical implications of this experiment and the analysis are very important.
The objective of the experimenter is to find the source of the variability in raw material purity. If
it results from differences among suppliers, we may be able to solve the problem by selecting
the ‘best’ supplier. However, that solution is not applicable here because the major source of
variability is the batch-to-batch purity variation within suppliers.
Therefore, we must attack the problem by working with the suppliers to reduce their batch-to-
batch variability. This may involve modifications to the suppliers’ production processes or their
internal quality assurance system.
2. This analysis indicates that batches differ significantly and that there is a significant interaction
between batches and suppliers. However, it is difficult to give a practical interpretation of the
batches × suppliers interaction.
[Source [9]]
CHAPTER 8. MULTIVARIATE PROBABILITY DISTRIBUTIONS
228 SIMULTANEOUSLY STUDY RANDOM VARIABLES
Motivation
In previous statistical courses, we have discussed probability models and computation of probability for
events involving only one random variable. These are called univariate models.
• In this chapter, we discuss probability models that involve more than one random variable-naturally
enough, called multivariate models.
• Methods and probabilistic models in engineering and science often involve some random variables.
For example,
- in the context of a computer network, the workload of several portals may be of interest.
All random variables are associated with the same experiment, sample space, and probability law
and their values may have interrelated.
Recall that a random variable 𝑋 is a map from a sample space Ω to the reals R. That is for 𝑤 ∈ Ω
then 𝑋(𝑤) ∈ R.
𝑆𝑋 = Range(𝑋) = {𝑋(𝑤)}.
𝑆𝑋 can be the set of all real numbers R, or the integers Z ... depending on what possible values
the random variable can take.
Hence, a random variable was defined to be a function from a sample space Ω into the real numbers.
A random vector, consisting of several random variables, is defined similarly.
Definition 8.1.
Random Vector when 𝑛 = 2: Given a random experiment with a sample space C, consider two ran-
dom variables 𝑋1 and 𝑋2 , which assign to each element 𝑐 of C one and only one ordered pair of
numbers 𝑋1 (𝑐) = 𝑥1 , 𝑋2 (𝑐) = 𝑥2 .
• The space of (𝑋1 , 𝑋2 ) is the set of ordered pair D = {(𝑥1 , 𝑥2 ) : 𝑥1 = 𝑋1 (𝑐), 𝑥2 = 𝑋2 (𝑐); 𝑐 ∈ C}. We
also denote Range(𝑋1 , 𝑋2 ) := D.
CONVENTION:
1. Use notation Ω to indicate the sample space of a single random variable, and notation C for the
sample space of multiple random variables or random vectors;
2. use notation Range(.) or 𝑆. to indicate the range (value space) of a single random variable, and use
notation D for the value space of multiple random variables or random vectors.
S_X
S_Y
Y (welfare classification)
D = S_X x S_Y
Example: C is the set of whole population of Bangkok. Hence, the vector
X = (X, Y) provides two aspects of dwellers in BKK, shown in the range set D = { (x,y): x in S_X, y in S_Y }.
It helps the government to know BKK’s economic- social structures, to plan suitable policies...
For instance, let C be the population in Bangkok, we want to measure two key indexes of people
there: (1) Hobby and (2) Living standard. From previous examples,
• let 𝑋 denote the ‘favored singer’ variable with the value space Range(𝑋) = {𝐴𝑅𝐼𝑁, 𝑏, 𝑐} = 𝑆𝑋 [in
Entertainment Industry], and
• let 𝑌 denote the ‘social class’ (welfare classification) random variable with
Range(𝑌 ) = {𝑃 𝑜𝑜𝑟, 𝑅𝑖𝑐ℎ, Extremely rich} = {𝑝, 𝑟, er} = 𝑆𝑌 [in Economics].
Our interest is represented by the pair of variables (𝑋, 𝑌 ) =: X, we call it a random vector.
Example 8.1.
A coin is tossed three times and our interest is in the ordered number pair (number of H’s on first
two tosses, number of H’s on all three tosses), where H and T represent, respectively, heads and
tails. Let 𝑋1 denote the number of H’s on the first two tosses and 𝑋2 denote the number of H’s on
all three flips. Then our interest can be represented by the pair of random variables (𝑋1 , 𝑋2 ).
For example, (𝑋1 (𝐻𝑇 𝐻), 𝑋2 (𝐻𝑇 𝐻)) represents the outcome (1, 2).
The sample space C = {𝑇 𝑇 𝑇, 𝑇 𝑇 𝐻, 𝑇 𝐻𝑇, 𝐻𝑇 𝑇, 𝑇 𝐻𝐻, 𝐻𝑇 𝐻, 𝐻𝐻𝑇, 𝐻𝐻𝐻}. Continuing in this
way, 𝑋1 and 𝑋2 are real-valued functions defined on the sample space C, which take us from C to the
space of ordered number pairs
D = {(0, 0), (0, 1), (1, 1), (1, 2), (2, 2), (2, 3)}.
Thus 𝑋1 and 𝑋2 are two random variables defined on the space C, and, in this example, the space
(range) of these random variables is the two-dimensional set 𝐷, which is a subset of two-dimensional
Euclidean space 𝑅2 . Hence (𝑋1 , 𝑋2 ) is a vector function from C to D.
We often denote 2-dim random vectors using vector notation X = (𝑋1 , 𝑋2 )′ , where the ′ denotes the
transpose of the row vector (𝑋1 , 𝑋2 ).
for all (𝑥1 , 𝑥2 ) ∈ D. If the range of 𝑋 is 𝑆𝑋 , of 𝑌 is 𝑆𝑌 then the value space of (𝑋1 , 𝑋2 ) is
D = 𝑆𝑋 × 𝑆𝑌 .
(8.2)
∑︁
(𝑖𝑖) 𝑝(𝑥1 , 𝑥2 ) = 1.
(𝑥1 ,𝑥2 )∈D
To stress the fact that 𝑝() is the joint pmf of the vector (𝑋1 , 𝑋2 ) rather than some other vector,
the notation 𝑝𝑋1 ,𝑋2 (𝑥1 , 𝑥2 ) will be used.
Let D be the value space associated with the random vector (𝑋1 , 𝑋2 ). As in the case of one random
variable, we speak of the event 𝐴 ⊂ D.
a/ The probability of the event 𝐴: The jpmf uniquely defines the probability of 𝐴 defined in terms of
(𝑋1 , 𝑋2 ), by
∑︁
P[(𝑋1 , 𝑋2 ) ∈ 𝐴] = 𝑝(𝑥1 , 𝑥2 ) (8.3)
(𝑥1 ,𝑥2 )∈𝐴
c/ The marginal pmf of 𝑋, 𝑌 can be obtained from the joint pmf 𝑝(𝑥1 , 𝑥2 ), by summing the joint p.m.f.
with respect to all 𝑥𝑗 , 𝑗 ̸= 𝑖. E.g., the marginal p.m.f. of 𝑋, 𝑌 are
∑︁ ∑︁
𝑝𝑋 (𝑥) = 𝑝(𝑥, 𝑦), and 𝑝𝑌 (𝑦) = 𝑝(𝑥, 𝑦).
𝑦∈𝑆𝑌 𝑥∈𝑆𝑋
Example 8.2.
Consider the experiment of tossing two fair dice (blue and red say). The sample space for this
experiment has 36 equally likely points. With each of these 36 points associate two numbers,
𝑋 := 𝑋1 and 𝑌 := 𝑋2 . Let
𝑋 = sum of the two dice and 𝑌 = | Difference of the two dice |.
C = {(1, 1), . . . , (1, 6), (2, 1), (2, 2), . . . , (2, 6), . . . , (6, 5), (6, 6)}.
The range of 𝑋 is 𝑆𝑋 = {2, 3, . . . , 12} and of 𝑌 is 𝑆𝑌 = {0, 1, 2, . . . , 5}. Therefore, the value space of
(𝑋1 , 𝑋2 ) is
D = 𝑆𝑋 × 𝑆𝑌 = {(2, 0), . . . , (2, 5), · · · , (12, 0), . . . , (12, 5)}.
1. What is P[𝑋 = 5 𝑎𝑛𝑑 𝑌 = 3]? The only two sample points in the event
that yield 𝑋 = 5 𝑎𝑛𝑑 𝑌 = 3 are (4, 1 ) and (1 , 4)? So the joint pmf is
𝑝𝑋,𝑌 (5, 3) = P[𝑋 = 5 𝑎𝑛𝑑 𝑌 = 3] = P[𝐴] = P[{(4, 1), (1, 4)}] = 2/36
2. The joint distribution (or joint cumulative distribution function- joint cdf)
𝑋 = 2: Only one point (1,1) yield 𝑋 = 2 𝑎𝑛𝑑 𝑌 = 0 , and no point yield 𝑋 = 2 𝑎𝑛𝑑 𝑌 = 𝑗, when
𝑗 ≥ 1, so
𝑝𝑋,𝑌 (2, 0) = P[𝑋 = 2 𝑎𝑛𝑑 𝑌 = 0] = P[{(1, 1)}] = 1/36,
𝑋 = 3: Similarly, you see that no point yield 𝑋 = 3 𝑎𝑛𝑑 𝑌 = 0, and can find sample points that
yield 𝑋 = 3 𝑎𝑛𝑑 𝑌 = 𝑗 when 𝑗 > 0. Summing all the found values of 𝑝𝑋,𝑌 (𝑖, 𝑗) gives the result,
D.I.Y.
Now consider the case of 𝑛 ≥ 2 continuous random variables. Let 𝑋1 , 𝑋2 , ..., 𝑋𝑛 be random variables
which are jointly observed at the same experiments.
a/ Joint distribution:
Similarly as Condition (8.2), a function 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) ≥ 0 is called the joint p.d.f of 𝑋1 , ..., 𝑋𝑛 if
and
Note that if the joint probability density function is a constant 𝑐 over a bounded region 𝑅 (and
zero elsewhere), then Condition (ii) in Equation 8.7 becomes, in this special case, as
∫︁ ∫︁
··· 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ) 𝑑𝑥1 . . . 𝑑𝑥𝑛 = 𝑐 × (volumeof region 𝑅) = 1. (8.8)
𝑅
𝑓 (𝑥, 𝑦) is the joint probability density function for 𝑋 and 𝑌 if for any two-dimensional set 𝐴
∫︁ ∫︁
P[(𝑋, 𝑌 ) ∈ 𝐴] = 𝑓 (𝑥, 𝑦) 𝑑𝑥 𝑑𝑦 (8.9)
𝐴
Particularly, in logistic industry, risk management (or military science and similar applications) if
𝐴 is the two-dimensional rectangle
{(𝑥, 𝑦) : 𝑎 ≤ 𝑥 ≤ 𝑏; 𝑐 ≤ 𝑦 ≤ 𝑑},
then ∫︁ 𝑏 ∫︁ 𝑑
P[(𝑋, 𝑌 ) ∈ 𝐴] = P[ 𝑎 ≤ 𝑋 ≤ 𝑏; 𝑐 ≤ 𝑌 ≤ 𝑑] = 𝑓 (𝑥, 𝑦) 𝑑𝑦 𝑑𝑥
𝑎 𝑐
c/ Marginal distributions (marginal c.d.f): By letting one or more variables tend to infinity, we obtain
the joint c.d.f. of the remaining variables. For example,
The c.d.f.’s of the individual variables, are called the marginal distributions. E.g., 𝐹1 (𝑥) is the
marginal c.d.f. of 𝑋1 . Hence,
d/ Marginal p.d.f.
The marginal p.d.f. of 𝑋𝑖 , (𝑖 = 1, · · · , 𝑛) can be obtained from the joint p.d.f. 𝑓 (𝑥1 , 𝑥2 , . . . , 𝑥𝑛 ), by
integrating the joint p.d.f. with respect to all 𝑥𝑗 , 𝑗 ̸= 𝑖.
E.g., if 𝑛 = 2, 𝑓 (𝑥1 , 𝑥2 ) is the joint p.d.f. of 𝑋 = 𝑋1 , 𝑌 = 𝑋2 , then the marginal p.d.f. of 𝑋, 𝑌 are
∫︁ +∞ ∫︁ +∞
𝑓1 (𝑥) = 𝑓 (𝑥, 𝑦) 𝑑𝑦, 𝑓2 (𝑦) = 𝑓 (𝑥, 𝑦) 𝑑𝑥. (8.11)
−∞ −∞
Example 8.3.
The present example is theoretical and is designed to illustrate the above concepts. Let (𝑋, 𝑌 ) be a
pair of random variables having a joint uniform distribution on the region
𝑇 = {(𝑥, 𝑦) : 0 ≤ 𝑥, 𝑦, 𝑥 + 𝑦 ≤ 1.}
SOLUTION:
𝑇 is a triangle in the (𝑥, 𝑦)-plane with vertices at (0, 0), (1, 0) and (0, 1).
The joint p.d.f. of 𝑋, 𝑌 𝑓𝑋,𝑌 (𝑥, 𝑦) = 𝑓 (𝑥, 𝑦), must fulfills conditions:
⎧
⎨(𝑖)
⎪ 𝑓 (𝑥, 𝑦) ≥ 0, ∀𝑥, 𝑦, and
∫︁ +∞ ∫︁ +∞ (8.12)
⎩(𝑖𝑖)
⎪ 𝑓 (𝑥, 𝑦) 𝑑𝑥𝑑𝑦 = 1.
−∞ −∞
According to the assumption of uniform distribution, the joint p.d.f. 𝑓 (𝑥, 𝑦) is a constant, so
∫︁ +∞ ∫︁ +∞ ∫︁ 1 ∫︁ 1−𝑥
𝑓 (𝑥, 𝑦) 𝑑𝑥𝑑𝑦 = 𝑐 𝑑𝑥𝑑𝑦 = 1,
−∞ −∞ 0 0
hence
⎧
⎨2, if (𝑥, 𝑦) ∈ 𝑇
𝑓 (𝑥, 𝑦) =
⎩0, otherwise.
Remark 3. As with univariate random variables, we often drop the subscript (𝑋1 , 𝑋2 ) from joint cdfs,
pdfs, and pmfs, when it is clear from the context.
We also use notation such as 𝑓12 instead of 𝑓𝑋1 ,𝑋2 .
Besides (𝑋1 , 𝑋2 ), we often use (𝑋, 𝑌 ) to express random vectors.
When considering multiple random variables, we can generate a new random variable by considering a
function that takes the associated random variables as arguments. In particular, a function 𝑍 = 𝑔(𝑋, 𝑌 )
of the variables 𝑋 and 𝑌 is a random variable, its density can be calculated from the joint probability
density function 𝑝𝑋,𝑌 (discrete case) or 𝑓𝑋,𝑌 (continuous case).
Given any two random variables 𝑋 and 𝑌 having a joint distribution with p.m.f. 𝑝(𝑥, 𝑦), as defined in
Equation (8.2), or p.d.f. 𝑓 (𝑥, 𝑦), as in (??). We define
The covariance Cov(𝑋, 𝑌 ) between two rv’s 𝑋 and 𝑌 is Cov(𝑋, 𝑌 ) = E[(𝑋 − 𝜇𝑋 )(𝑌 − 𝜇𝑌 )] with
𝜇𝑋 = E[𝑋], 𝜇𝑌 = E[𝑌 ] are the two means.
Cov(𝑋, 𝑌 )
Corr(𝑋, 𝑌 ) = 𝜌𝑋𝑌 = . (8.17)
𝜎𝑋 𝜎𝑌
where 𝜎𝑋 𝜎𝑌 are the standard deviations of 𝑋, 𝑌 .
A large insurance agency services a number of customers who have purchased both a homeowner’s
policy and an automobile policy from the agency. For each type of policy, a deductible amount must be
specified. For an automobile policy, the choices are $100 and $250, whereas for a homeowner’s policy,
the choices are 0, $100, and $200.
Suppose an individual with both types of policy is selected at random from the agency’s files.
Let variables
CONCEPT 6. Given (𝑋, 𝑌 ) a pair of random variables, we say 𝑋 and 𝑌 are independent if
Knowledge box 4.
b/ The converse is not true. Zero correlation does not imply independence.
If 𝑋1 , 𝑋2 , ..., 𝑋𝑘 are mutually independent (mutually independent), then, for any integrable func-
tions 𝑔1 (𝑋1 ), 𝑔2 (𝑋2 ), . . . , 𝑔𝑘 (𝑋𝑘 ),
{︃ 𝑘
}︃ 𝑘
∏︁ ∏︁
E 𝑔𝑖 (𝑋𝑖 ) = E(𝑔𝑖 (𝑋𝑖 )). (8.19)
𝑖=1 𝑖=1
A sequence of 𝑛 random variables 𝑋𝑖 are identically distributed if they follow the same distribution of
a common random variable 𝑋.
More precisely they have the same ranges Range(𝑋) and the same p.d.f. 𝑓𝑋 (). We write 𝑋𝑖 ∼ 𝑋.
If 𝑋 and 𝑌 are two random variables having a joint p.d.f. 𝑓 (𝑥, 𝑦) = 𝑓𝑋,𝑌 (𝑥, 𝑦), and marginal ones,
𝑓𝑋 (𝑥), 𝑓𝑌 (𝑦), respectively, then the conditional p.d.f. of 𝑌 given 𝑋 = 𝑥, where 𝑓𝑋 (𝑥) > 0, is defined to
be
𝑓 (𝑥, 𝑦)
𝑓𝑌 (𝑦|𝑥) = 𝑓𝑌 |𝑋 (𝑦|𝑋 = 𝑥) = . (8.20)
𝑓𝑋 (𝑥)
Similarly, the conditional p.d.f. of 𝑋 given 𝑌 = 𝑦 is
𝑓 (𝑥, 𝑦)
𝑓𝑋 (𝑥|𝑦) = 𝑓𝑋|𝑌 (𝑥|𝑌 = 𝑦) = .
𝑓𝑌 (𝑦)
• Write
𝑓𝑋 (𝑥|𝑦) = 𝑓𝑋 (𝑥|𝑌 = 𝑦)
When 𝑋 and 𝑌 are independent we replace 𝑓 (𝑥, 𝑦) = 𝑓𝑋 (𝑥) 𝑓𝑌 (𝑦) and get
From Ex. 8.3, we compute the conditional pdf of 𝑌 , given 𝑋 = 𝑥, when 0 < 𝑥 < 1. By (8.20),
⎧
⎨ 1 ,
⎪
when 0 < 𝑦 < 1 − 𝑥
𝑓𝑌 |𝑋 (𝑦|𝑋 = 𝑥) = 𝑓𝑌 (𝑦|𝑥) = 1 − 𝑥
⎩0, otherwise.
⎪
This is a uniform distribution over (0, 1 − 𝑥), 0 < 𝑥 < 1. If 𝑥 ̸∈ (0, 1), the conditional p.d.f. does not exist.
E(𝑌 | 𝑥) := E(𝑌 | 𝑋 = 𝑥)
is the expected value of 𝑌 with respect to the conditional p.d.f. 𝑓𝑌 (𝑦|𝑥), that is,
∞
∑︁
E[𝑌 | 𝑥] = E[𝑌 |𝑋 = 𝑥] = 𝑦𝑗 𝑝𝑌 (𝑦𝑗 |𝑥) if (𝑋, 𝑌 ) discrete,
𝑗=1
∫︁ ∞ (8.22)
= 𝑦 𝑓𝑌 (𝑦|𝑥) 𝑑𝑦 if (𝑋, 𝑌 ) continuous.
−∞
Similarly, we can define the conditional variance of 𝑌 , given 𝑋 = 𝑥, as the variance of 𝑌 , with
respect to the conditional p.d.f. 𝑓𝑌 |𝑋 (𝑦 | 𝑥).
♣ OBSERVATION 2.
Notice that for a pair of random variables (𝑋, 𝑌 ), the conditional expectation E[𝑌 | 𝑋 = 𝑥] changes
with 𝑥, if 𝑋 and 𝑌 are dependent. Thus, we can consider E[𝑌 | 𝑋] to be a random variable, which is
a function of 𝑋. This is the foundation for defining regression models in Chapter 9 about Simple Linear
Regression.
Moreover, if we assume linearity between non-random predictor 𝑋 and 𝑌 in a model
𝑌 = 𝑓 (𝑋) = 𝛼 + 𝛽𝑋 + 𝜀 (8.23)
then the response variance V[𝑌 ] = 0 + V[𝜀] = V[𝜀] = 𝜎 2 . This condition turns out to be a key fact for
linear regression analysis in the subsequent parts.
where 𝑆 is a square of area 2, whose vertices are (1, 0), (0, 1), (−1, 0), (0, −1).
In an electronic assembly, let the random variables 𝑋1 , 𝑋2 , · · · , 𝑋4 denote the lifetime of four com-
ponents, respectively, in hours. Suppose that the joint probability density function of these variables
is
𝑓𝑋1 ,𝑋2 ,𝑋3 ,𝑋4 (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) = 9 × 10−12 𝑒−0.001𝑥1 −0.002𝑥2 −0.0015𝑥3 −0.003𝑥4
for 𝑥1 ≥ 0, 𝑥2 ≥ 0, 𝑥3 ≥ 0, 𝑥4 ≥ 0. What is the probability that the device operates for more than
1000 hours without any failures?
The requested probability is P[𝑋1 > 1000, 𝑋2 > 1000, 𝑋3 > 1000, 𝑋4 > 1000], which equals the
multiple integral of 𝑓𝑋1 ,𝑋2 ,𝑋3 ,𝑋4 (𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 ) over the region 𝑥1 > 1000, 𝑥2 > 1000, 𝑥3 > 1000, 𝑥4 >
1000. The joint probability density function can be written as a product of exponential functions, and
each integral is the simple integral of an exponential function. Therefore,
P[𝑋1 > 1000, 𝑋2 > 1000, 𝑋3 > 1000, 𝑋4 > 1000] = ... = 𝑒−1−2−1.5−3 = 000055
Regression Analysis I
Simple Linear Regression
[Source [9]]
CHAPTER 9. REGRESSION ANALYSIS I
244 SIMPLE LINEAR REGRESSION
In previous chapters, we were concerned about the distribution of one random variable and also random
vector, theirs parameters, expectation, variance, median, etc. In this chapter, we study relations among
variables. This chapter covers the followings.
Learning Objectives
6. Check the adequacy of the model (goodness of fit) by analysis of variance for regression
• Linear regression models show the linear relationship between a variable of interest (response vari-
able or dependent variable) and a set of other variables, called observation variables or predictor
variables, Accordingly, we wish to predict the values of the variable of interest.
• Chapter 11 on multivariate regression analysis will lay the foundations for more advanced data
analysis, being essential for experimental sciences such as chemical engineering, bio-medical sci-
ences or industrial production.
• Techniques for comparing the mean (or other parameters) of many populations are grouped into a
group of methods named Analysis of variances (ANOVA).
Specific datasets
In the present chapter we start with numerical analysis of multivariate data. In order to illustrate the
ideas and enrich practical applications, we introduce here an industrial data set, called ALMPIN.csv.
The ALMPIN.csv set consists of 70 records on 6 variables measured on aluminum pins used in
airplanes. The aluminum pins are inserted with air-guns in pre-drilled holes in order to combine critical
airplane parts such as wings, engine supports and doors.
We introduce now a statistic which summarizes the simultaneous variability of samples obtained from
two variables 𝑋 and 𝑌 . The statistic is called the sample covariance. It is a generalization of the
sample variance statistics, 𝑠2𝑥 of one variable, 𝑋.
Let 𝑥 = 𝑥1 , 𝑥2 , · · · , 𝑥𝑛 and 𝑦 = 𝑦1 , 𝑦2 , · · · , 𝑦𝑛 be two samples of the same size 𝑛, observed on
variable 𝑋 and 𝑌 respectively; 𝑥 * 𝑦 is the inner product of two vectors 𝑥, 𝑦.
Note that 𝑠𝑥𝑥 is the sample variance 𝑠2𝑥 , and 𝑠𝑦𝑦 = 𝑠2𝑦 .
• The sample covariance 𝑠𝑥𝑦 can assume positive or negative values. If one of the variables, say, 𝑋,
assumes a constant value 𝑐, i.e. 𝑥𝑖 = 𝑐, ∀𝑖, then 𝑠𝑥𝑦 = 0. This can be immediately verified, since
x = 𝑐 and 𝑥𝑖 − x = 0 for all 𝑖 = 1, · · · , 𝑛.
By dividing 𝑠𝑥𝑦 by 𝑠𝑥 · 𝑠𝑦 we obtain a standardized index of dependence, which is called the sample
correlation (Pearson’s sample correlation), namely
𝑠𝑥𝑦
𝑟𝑥𝑦 = . (9.3)
𝑠𝑥 · 𝑠𝑦
We always have −1 ≤ 𝑟 ≤ 1 for the sample correlation 𝑟 = 𝑟𝑥𝑦 . Two limit values in this constraint,
when 𝑟 = 1 and when 𝑟 = −1 are theoretical and apply when all data points of a scatter plot (diagram)
on a straight line of the form
𝑌𝑖 = 𝛼 + 𝛽𝑥𝑖 ,
The measurements of data ALMPIN.csv were taken in a computerized numerically controlled (CNC)
metal cutting operation. The six variables are
Diameter 1, Diameter 2, Diameter 3,
Cap Diameter,
Lengthncp and
Lengthwcp.
All the measurements are in millimeters.
The first three variables give the pin diameter at three specified locations. Cap Diameter is the
diameter of the cap on top of the pin. The last two variables are the length of the pin, without and with
the cap, respectively.
Computation on R .
To see the first five rows, we write R code in command line environment:
> library(mistat);
> data(ALMPIN)
> ALMPIN[1:5, ];
Y
X
Diameter 1 Diam. 2 Diam. 3 Cap Diam. Lengthnocp Lengthwcp
Diameter 1 0.0270
In Table 9.1 we present the sample covariances of the six variables measured on the aluminum
pins. Since
𝑆𝑥𝑦 = 𝑆𝑦𝑥
(covariances and correlations are symmetric statistics), it is sufficient to present the values at the
bottom half of Table 9.1 (on and below the diagonal).
In Table 9.2 we present the sample correlations in the data file ALMPIN.csv. We see that the
sample correlations between Diameter 1, Diameter 2 and Diameter 3 and Cap Diameter are all
greater than 0.9.
As we see in Figure 9.2 (the multivariate scatter plots) the points of these variables are scattered
close to straight lines.
Y
X
Diam. 1 Diam. 2 Diam. 3 Cap Diam. Lengthnocp Lengthwcp
Diameter 1 1
Diameter 2 0.958 1
On the other hand, no clear relationship is evident between the first four variables and the length of
the pin (with or without the cap). The negative correlations, usually indicate that the points are scattered
around a straight line having a negative slope. In the present case it seems that the magnitude of these
negative correlations are due to the one outlier (pin # 66).
To get Figure 9.2, write R code:
> data(ALMPIN)
> plot(ALMPIN)
PRACTICE 2.
Forecasting is to comprehend the rules and trends of future research objectives, based on the analysis
of information flows or actual data from the past and present. The forecast consists of 4 steps, being
named as DPSF:
2. Preliminary treatment:
- Divide the dataset into two groups, for parameter estimation and model goodness check, see c).
- Build the regression model so that the random error is the smallest and the determined coefficient
𝑅 is close to 1.
- This forecast error is verified by the test data [set out in b).
Linear regression is the most common method for forecasting and optimizing in the process of Statis-
tical Optimization above. We need a way to express our objective, which we call model.
This definition has the common trait that the response and predictor variables are assumed to be
free of specification error and measurement uncertainty.
Statistical model. A model is termed statistical if it is derived from data that are subject to various
types of specification, observation, experimental, and/or measurement errors.
Statistical Models are approximations to actual physical systems, and are subject to specification
and measurement errors.
Statistical regression model. Many variables observed in real life are related. The type of their re-
lation can often be expressed in a mathematical form called statistical regression model or just re-
gression.
Example 9.2.
Consumption Theory in Economics tells us that generally people increase their consumption expen-
diture 𝐶 when their after-tax (disposable) income 𝑌𝑑 increases, but not by as much as the increase in
their disposable income.
This can be stated in explicit linear equation form, mathematically as:
𝐶 = 𝑏0 + 𝑏1 𝑌𝑑
where 𝑏0 , 𝑏1 are unknown constants called parameters. But different people having the same dispos-
able income are likely to have somewhat different consumption expenditures. As a result, the above
deterministic relationship must be modified to include a random disturbance or error term 𝑢, making
it stochastic or statistical model:
𝐶 = 𝑏0 + 𝑏1 𝑌𝑑 + 𝑢.
Statistical model building is an iterative process. We entertain a tentative model but we are ready
to revise it if necessary. Only when we are happy with our model should we stop. We can then use
our model, sometimes to understand our current set of data, sometimes to help us predict what may
happen in the future. We must be ready to translate what the model is telling us statistically to the client
with the real life problem. But how to build a model from data?
Example 9.3. In Ecology, how do brown creepers 𝑌 increase in relative abundance with increas-
ing extent 𝑋 of late-successional forest in high areas as North America?
In a statistical model between the amount of brown creepers 𝑌 in the increasing extent 𝑋
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀
we see 2 components. We want to determine the deterministic part why minimizing the stochastic
part! The deterministic component must be found via estimating parameters from data.
Knowledge box 5. The process of building a model from data is called model fitting.
The model fitting process in this text generally involves four steps.
* an equation linking the response and explanatory variables, call it an regression model;
3/ Checking the adequacy of the model- how well it fits or summarizes the data, see Chapter 10.
4/ Inference- classically this involves calculating confidence intervals, testing hypotheses about the
parameters in the model and interpreting the results, see Chapter 4, 5, 6 and Appendix.
Regression analysis is the study of relationships between variables. It is one of the most useful tools
for a business/industry/management analyst. Regression analysis involves the number of explanatory
variables in the analysis. In every regression study there is a single variable that we are trying to
explain or predict, called the dependent variableor response variable.
To explain or predict the response, we use one or more explanatory variables (also called indepen-
dent variables, regressor or predictor variables). The response variable is the single variable being
explained by the regression, more precisely, the explanatory variables are used to explain the response
variable. For example, we can not only understand how a company’s sales are affected by its advertis-
ing, but we can also use the company’s records of current and past advertising levels to predict future
sales.
The branch of statistics that studies such relationships is called regression analysis. Some poten-
tial uses of regression analysis in business include the following:
• How do wages of employees depend on years of experience, years of education, and gender? How
does the current price of a stock depend on its own past values, as well as the current and past
values of a market index?
• How does a company’s current sales level depend on its current and past advertising levels, the
advertising levels of its competitors, the company’s own past sales levels, and the general level of the
market?
• How does the selling price of a house depend on such factors as the appraised value of the house,
the square footage of the house, the number of bedrooms in the house, and others?
Each of these questions asks how a single variable, such as selling price or employee wages, de-
pends on other relevant variables. If we can estimate this relationship, then we can not only better
understand how the world operates, but we can also do a better job of predicting the variable in
question.
Briefly speaking, regression models relate a response variable to one or several predictors. Hav-
ing observed predictors, we can forecast the response by computing its conditional expectation (see
Equation (14.8) of Section 8.4.2), given all the available predictors.
• Predictors or independent variables 𝑋1 , 𝑋2 , · · · , 𝑋𝑘 are used to predict the values and be-
havior of the response variable 𝑌 .
A simple regression analysis includes one explanatory variable, whereas multiple regression [see
Chapter 11] include any number of explanatory variables. We have seen examples before in which the
relationship between two variables 𝑋 and 𝑌 is close to linear. This is the case when the (𝑥, 𝑦) points
scatter along a straight line.
Definition 9.2. Suppose that we are given D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}.
• By Definition 9.1, linear regression model assumes that the conditional expectation
The coefficients 𝛼 and 𝛽 are called the regression coefficients. 𝛼 is the intercept and 𝛽 is the slope
coefficient.
• Generally, the coefficients 𝛼 and 𝛽 are unknown. We fit to the data points a straight line, which is
called the estimated regression line, or prediction line.
• The intercept
𝛼 = 𝐺(0)
equals the value of the regression function for 𝑥 = 0. Sometimes it has no physical meaning. For
example, nobody will try to predict the value of a computer with 0 random access memory (RAM), and
nobody will consider the national reserve rate in year 0. In other cases, intercept is quite important.
• The slope
𝛽 = 𝐺(𝑥 + 1) − 𝐺(𝑥)
is the predicted change in the response variable when predictor changes by 1. This is a very impor-
tant parameter that shows how fast we can change the expected response by varying the predictor.
A zero slope means absence of a linear relationship between 𝑋 and 𝑌 . In this case, 𝑌 = 𝛼 a
constant when 𝑋 changes.
Example 9.4.
Model the quality of produced computers 𝑥 on the customer satisfaction 𝑦 from the public. Have
a single linear regression model, described by (changing 𝛽0 = 𝛼, 𝛽1 = 𝛽)
𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝜀.
• The mean 𝜇𝑦 = E[𝑌 |𝑋 = 𝑥] of the response variable 𝑌 changes as 𝑥 changes. The means all lie
on a straight line when plotted against 𝑥, that is
𝜇𝑦 = 𝛽0 + 𝛽1 𝑥.
• Individual responses of 𝑦 with the same 𝑥 vary according to a normal distribution. These normal
distributions all have the same standard deviation.
See Section 10.4 for concrete computation and analysis of the these two parts.
Example 9.5.
𝑦 = 𝛼 + 𝛽 1 𝑥 1 + 𝛽2 𝑥 2 + 𝜀 (9.5)
• This is a multiple linear regression model with independent variables, or regressor variables 𝑥1 , 𝑥2 .
This model describes a plane in the two-dimensional space of the regressor variables 𝑥1 , 𝑥2 .
• Termed linear since 𝑦 is a linear function of the unknown parameters or regression coefficients
𝛼, 𝛽1 , 𝛽2 .
Multiple linear regression models [see details in Chapter 11] become more complex, when we add
an interaction term between 𝑥1 , 𝑥2 to the 1st-order model (11.21), and get
𝑦 = 𝛼 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 + 𝜀 (9.6)
The model takes each 𝑥 to be a fixed known quantity. In practice, 𝑥 may not be exactly known. If the
error in measuring 𝑥 is large, more advanced inference methods are needed.
Generally, we have
The response 𝑌 to a given 𝑥 is a random variable. The linear regression model describes the
conditional mean
𝑌̂︀ = E[𝑌 |𝑋 = 𝑥]
and standard deviation of this random variable 𝑌 . These unknown parameters 𝛽𝑗 must be estimated
from the data.
We now introduce the method of least squares by looking at the least squares geometry and dis-
cussing some of its algebraic properties.
Suppose that 𝑦̂︀ = 𝑎 + 𝑏𝑥 is the straight line fitted to the data. The principle of least squares requires
one to determine 𝑎 and 𝑏, the estimates of 𝛼 and 𝛽, which minimize the sum of squares of residuals
𝜀𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖
We can now explain how to choose the best-fitting line through the points in the scatter-plot. It is the
line with the smallest sum of squared residuals. The resulting line is called the least squares line.
QUIZ: Why do we use the sum of squared residuals? Why not minimize some other measure
of the residuals?
The definitions of the following terms for the sum of squares and cross-products are useful for com-
putations in regression:
𝑛
∑︁
𝑆𝑥𝑦 = (𝑥𝑖 − x )(𝑦𝑖 − y) = (𝑛 − 1) 𝑠𝑥𝑦
𝑖=1
and
𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 = (𝑛 − 1) 𝑠2𝑥 ,
𝑖=1
both are estimated from a data set D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )} of size 𝑛.
DATA ANALYTICS- FOUNDATION
CHAPTER 9. REGRESSION ANALYSIS I
258 SIMPLE LINEAR REGRESSION
Computing 𝑎, 𝑏
First, it is not appropriate to simply minimize the sum of the residuals. This is because the positive
residuals would cancel the negative residuals.
• 𝜀𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 is viewed as a residual (deviation, random error), between observed responses 𝑦𝑖 and
their fitted values 𝑦̂︀𝑖 ,
• the method of least squares (OLS) looks for a line such that the sum
𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝜀2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2 (9.9)
𝑖=1 𝑖=1 𝑖=1
is the smallest. Divide the two sides of the above equation to 𝑛 − 1 we have
(︂ )︂2
𝑆𝑆𝐸 2 2 2 𝑆𝑥𝑦
= 𝑆𝑦 (1 − 𝑅𝑥𝑦 ) + 𝑆𝑥 𝑏 − 2 . (9.10)
𝑛−1 𝑆𝑥
Here
𝑛
∑︁
𝑆𝑥𝑦 = (𝑥𝑖 − x )(𝑦𝑖 − y),
𝑖=1
𝑛
∑︁ 𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 , 𝑆𝑦2 = 𝑆𝑦𝑦 = (𝑦𝑖 − y)2
𝑖=1 𝑖=1
• If we write
𝑛
∑︁
𝑆𝑆𝑇 := (𝑦𝑖 − y)2 (9.11)
𝑖=1
for the sum of squared deviations between actual and estimated mean values, and
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2
(ˆ (9.12)
𝑖=1
the sum of squared regression errors, we get 𝑆𝑆𝐸 = 𝑆𝑆𝑇 − 𝛽ˆ 𝑆𝑥𝑦 , and also
Figure 9.4: The sum 𝑆𝑆𝐸, the orange fitted line gives value minimum 𝑆𝑆𝐸
𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 = (𝑛 − 1) 𝑠2𝑥 . (9.16)
𝑖=1
Computation on R :
The basic function for fitting ordinary multiple models is lm(), and a streamlined version of the call
in general is as follows:
fitted.model = lm( model, data = data.frame); # or if we write
Details
Models for lm are specified symbolically. A typical model has the form
response ∼ terms
where response is the (numeric) response vector and
terms is a series of terms which specifies a linear predictor for response.
We observed the world population from 1950 to 2010, each survey carried out for 5 years, and want
to fit data by a linear regression model.
3000 4000 5000 6000 7000
y
2 4 6 8 10 12
x
Figure 9.5: Dot diagram of the world population
> x = c(1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010);
# 13 years observing the world population from 1950 to 2010
> y = c(2558,2782,3043,3350,3712,4089,4451,4855,5287,5700,6090,6474,6864);
# the world population
> D = data.frame(x,y); # a table with 13 rows and 2 columns
> M1 <- lm(y ~ x, D); # fitted model
> summary(M1)
Call:
lm(formula = y ~ x, data = D)
Residuals:
Min 1Q Median 3Q Max
-107.08 -96.26 -12.29 62.90 223.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.422e+05 3.041e+03 -46.76 5.24e-14 ***
x 7.412e+01 1.536e+00 48.26 3.71e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 103.6 on 11 degrees of freedom
World population in 1950–2010 and its regression forecast for 2015 and 2020.
Figure 9.6: Linear model of world population
We conclude that the world population grows at the average rate of 74.1 million every year. Note that
2
the Multiple R-squared= 0.9953 is exactly the square of Pearson’s correlation, 𝑟𝑥𝑦 given in Equation
(9.3).
The straight line- regression line in Figure 9.6- that fits the observed data for years 1950–2010-
predicts the population of 7.15 billion in 2015 and 7.52 billion in 2020. But what is the Adjusted
1. What is the slope of the population regression line? Explain clearly what this slope says about the
change in the mean of 𝑦 for a change in 𝑥.
3. Between what 2 values would approximately 95% of the observed responses, 𝑦, fall when 𝑥 = 10?
• evaluate the goodness of fit of the chosen linear model to the observed data
Motivation
Seventy house sale prices in a certain county are depicted in Figure 9.7 along with the house area.
First, we see a clear relation between these two variables, and in general, bigger houses are more
expensive. However, the trend no longer seems linear.
Second, there is a large amount of variability around this trend. Indeed, area is not the only factor
determining the house price. Houses with the same area may still be priced differently. Then, how can
we estimate the price of a 3200-square-foot house?
We can estimate the general trend (the dotted line in Figure 9.7) and plug 3200 into the resulting
formula, but due to obviously high variability, our estimation will not be as accurate as in Example 9.6.
For example, there exists some variation among the house sale prices on Figure 9.7.
Why are the houses priced differently?
Well, the price depends on the house area, and bigger houses tend to be more expensive. So, to
some extent, variation among prices is explained by variation among house areas. However, two
houses with the same area may still have different prices. These differences cannot be explained by
the area.
is the sum of squared deviations between actual and estimated mean values. 𝑆𝑆𝑇 is the variation of
𝑦𝑖 about their sample mean regardless of our regression model.
Regression sum SSR A portion of this total variation is attributed to predictor 𝑋 and the regression
model connecting predictor and response. This portion is measured by the sum of squared regres-
sion errors,
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2
(ˆ
𝑖=1 (9.18)
2 2
= 𝑏 𝑆𝑥𝑥 = 𝑏 (𝑛 − 1) 𝑠2𝑥
with 𝑆𝑥𝑥 given in Equation (9.16). This is the portion of total variation explained by the model.
Error sum SSE The rest of total variation is attributed to “random errors”
The error portion of total variation then is measured by the error sum of squares
𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝑑2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 .
𝑖=1 𝑖=1
𝑆𝑆𝐸
1. The value is called the sample variance of the residuals around the least squares re-
𝑛−1
2
gression line, denoted by 𝑆𝑦|𝑥 .
E[𝑆𝑆𝐸] = (𝑛 − 2) 𝜎 2 ,
The regression line - determined from Knowledge Box 6- passes through the point (x , y), where x , y are
the sample means of 𝑋 and 𝑌 . Due to Equation (9.10), corresponding to the least squares estimate
𝑆𝑥𝑦
(then 𝑏 − 2 = 0) the value
𝑆𝑥
𝑆𝑆𝐸 2
= 𝑆𝑦|𝑥 = 𝑆𝑦2 (1 − 𝑅𝑥𝑦
2
). (9.20)
𝑛−1
𝑛
∑︁
• Here 𝑆𝑦2 = 𝑆𝑦𝑦 = (𝑦𝑖 − y)2 and 𝑆𝑦|𝑥
2
is the sample variance of the residuals around the least
𝑖=1
squares regression line.
2 2
By definition, 𝑆𝑦|𝑥 ≥ 0, hence 𝑅𝑥𝑦 ≤ 1, or −1 ≤ 𝑅𝑥𝑦 ≤ 1.
2
• 𝑅𝑥𝑦 = ±1 when 𝑆𝑦|𝑥 = 0. This is the case when all the points (𝑥𝑖 , 𝑦𝑖 ), 𝑖 = 1, · · · , 𝑛 lie on a straight
line.
2
• If 𝑅𝑥𝑦 = 0, then the slope of the regression line is 𝑏 = 0 and 𝑆𝑦|𝑥 = 𝑆𝑦2 .
2
• 𝑅𝑥𝑦 is called the coefficient of determination.
The goodness of fit, meaning appropriateness of the predictor and the chosen regression model,
can be judged by the proportion 𝑅2 of 𝑆𝑆𝑅 over 𝑆𝑆𝑇 . The coefficient of determination 𝑅2 is
the proportion of the total variation being explained by the regression model,
𝑆𝑆𝑅
𝑅2 = .
𝑆𝑆𝑇
It is always between 0 and 1, with high values generally suggesting a good fit. In univariate regression,
R-square also equals the squared sample correlation coefficient, derived from Concept 7.
2
• 𝑅𝑥𝑦 = 𝑅2 (the coefficient of determination above) is the change in 𝑌 , which is explained by the
linear relationship 𝑦̂︀ = 𝑎+𝑏𝑥. Thus, 𝑅𝑥𝑦 (correlation coeff.) measures the degree of linear relationship
in the data.
• Linear regression line (or predictive line) can be used to predict the values of 𝑌 .
A 𝑅2 “adjusted ” - defined as
[︂ ]︂
2 2 𝑛−1
𝑅𝑥𝑦 (adjusted) = 1 − (1 − 𝑅𝑥𝑦 ) . (9.22)
𝑛−2
will be more useful for explaining the determinants, especially when surveying with multiple
regression models.
𝑛
∑︁
𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑛 − 1) 𝑠2𝑦 = (12)(2.093 · 106 ) = 2.512 · 107
𝑖=1
𝑛
∑︁
𝑆𝑆𝑅 = 𝑦𝑖 − y)2 = 𝑏2 𝑆𝑥𝑥 = (74.1)2 (4550) = 2.500 · 107 ,
(ˆ
𝑖=1
A linear model for the growth of the world population has a very high 𝑅-square of
𝑆𝑆𝑅
𝑅2 = = 0.995 𝑜𝑟 99.5%.
𝑆𝑆𝑇
This is a very good fit although some portion of the remaining 0.5% of total variation can still be
explained by adding non-linear terms into the model.
• We could use the term Model adequacy to talk about Goodness of fitting model.
• The standard deviation of the residuals (also called the residual standard error) around the regres-
sion line is 𝑆𝑒 with
2
(1 − 𝑅𝑥𝑦 )
𝑆𝑒2 = (9.23)
𝑛−2
𝑛−1 2
Here we see that 𝑆𝑒2 = 𝑆 .
𝑛 − 2 𝑦|𝑥
Overfitting a model
Among all possible straight lines, the method of least squares chooses one line that is closest to the
observed data. Still, as we see in Figure 9.8.b, we had some residuals 𝑑𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 ) and some positive
sum of squared residuals. The straight line has not accounted for all 100% of variation among 𝑦𝑖 . Why,
one might ask, have we considered only linear models?
As long as all 𝑥𝑖 are different, can we always find a regression function 𝑌̂︀ (𝑥) that passes through all
the observed points without any error? Then, the sum 𝑖 𝑑2𝑖 = 0 will truly be minimized!
∑︀
Trying to fit the data perfectly is a rather dangerous habit. Although we can achieve an excellent fit
to the observed data, it never guarantees a good prediction. The model will be overfitted, too much
attached to the given data. Using it to predict unobserved responses is very questionable.
TT 𝑋 𝑌 𝑋𝑌 𝑋 − X 𝑌 − Y (𝑋 − X )(𝑌 − Y ) (𝑋 − X )2 (𝑌 − Y )2
𝑆𝑥𝑦
We get X = 3.25, Y = 51.43, and correlation value 𝑟 = =? Thus, the catalyst concentration
𝑆𝑥 · 𝑆𝑦
𝑋 and the detergent output 𝑌 are closely/weakly or inversely correlated?
Usually determined in a laboratory after a 5-day incubation of samples taken from the water. Sam-
pling at 38 stations along the river gives the data set, shown in the table on Figure 9.9, in mg / L
(milligrams per liter). Compute coefficient of correlation of DO and BOD. What is your comment?
HINT:
𝑆𝑥𝑦
The coefficient of correlation from Eq. (9.3) is 𝑟 = = −0.90?
𝑆𝑥 · 𝑆𝑦
• As expected, the scatter diagram below of Figure 9.9 strongly indicates a negative type of correlation
(inverse correlation) with high values of DO associated with low values of BOD and vice versa.
• It suggests that the value of BOD can be estimated from a measurement of the DO. The scatter in
the diagram may be partly attributed to some inadequacies of the BOD test and partly to factors such
as temperature and rate of flow, which affect the DO.
1 The BOD denotes the amount of oxygen used in meeting the metabolic needs of aerobic micro organisms in water, whether
naturally occurring or resulting from sewage outflows and other discharges; thus, high values of BOD generally indicate high
levels of pollution.
[Source [56]]
CHAPTER 10. REGRESSION ANALYSIS II:
272 INFERENCE FOR REGRESSION
We studied about the simple linear regression model in previous chapter. In this chapter, we study
some other advanced techniques in regression analysis. Many problems in engineering and the sci-
ences involve a study or analysis of the relationship between two or more variables. In many situations,
the relationship between variables is not deterministic.
• For example, a) the electrical energy consumption of a house (𝑦) is related to the house’s size (𝑥),
but it is unlikely to be a deterministic relationship. Similarly, b) the fuel usage of an automobile (𝑦) is
related to the vehicle weight 𝑥, but the relationship is not a deterministic one.
• In both of these examples, the value of the response of interest 𝑦 (energy consumption or fuel
usage as mentioned above) can not be predicted perfectly from knowledge of the corresponding 𝑥.
It is possible in b) for different automobiles to have different fuel usage even if they weigh the same,
and it is possible in a) for different houses to use different amounts of electricity even if they are the
same size.
1. Test statistical hypotheses and construct confidence intervals on regression model parameters
The models, capturing non-deterministic relationship of 𝑦 with 𝑥, are briefly named empirical
models if we can find them based on observed datasets.
The collection of statistical tools that are used to model and explore relationships between many
variables that are related in a non-deterministic manner is called regression analysis.
Problems of this type occur so frequently in many branches of engineering and science, there re-
gression analysis is one of the most widely used statistical tools.
• Telecommunication satellites are powered while in orbit by solar cells. Tadicell, a solar cells producer
that supplies several satellite manufacturers, was requested to provide data on the degradation of its
solar cells over time.
• Tadicell engineers performed a simulated experiment in which solar cells were subjected to temper-
ature and illumination changes similar to those in orbit and measured the short circuit current ISC
(amperes) of solar cells at three different time periods, in order to determine their rate of degradation.
• In Table 10.1 we present the ISC values of 𝑛 = 16 solar cells, measured at three time epochs, one
month apart. The data is given in file SOCELL.csv.
The response 𝑌 of that empirical model measures the short circuit current ISC (amperes) of solar
cells at time point t2; while
the predictor 𝑋 measures the short circuit current ISC at time point t1.
• With concrete data, this study is about degradation of solar cells in telecommunication satellites,
the key aim was establishing an empirical model.
In Figure 10.1 we see the scatter of the ISC values at time point 𝑡1 and time point 𝑡2 .
Time epochs
Cell
𝑋 = 𝑡1 𝑌 = 𝑡2 𝑡3
Call:
lm(formula = Y ~ X, data = SOCELL)
Residuals:
Min 1Q Median 3Q Max
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.53578 0.20314 2.638 0.0195 *
X 0.92870 0.05106 18.189 3.88e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1. What are Std.Error, t value? Why they appear next to the estimate?
Key Arguments
• If the value of 𝑦 does not change linearly with the value of 𝑥, then using the mean value of 𝑦 is the
best predictor for the actual value of 𝑦.
• If the value of 𝑦 does change linearly with the value of 𝑥, then using the regression model gives a
better prediction for the value of 𝑦 than using the mean of 𝑦.
vs
vs
Methods of estimating a regression line and partitioning the total variation do not rely on any distribu-
tion; thus, we can apply them to virtually any data.
For further analysis, we introduce standard regression assumptions.
(𝑋 could be family size, interest rate or a project input, number of drunk men per day in BKK, and
𝑌 could be electricity consumption, project return in investment, or number of traffic accidents in
Bangkok)
At the 𝑖-th observation, predictor 𝑋𝑖 is considered non-random, and we assume a linear relationship
between the two 𝑌𝑖 and 𝑋𝑖 of the form:
where
Assumption 2: Normality of the responses: We therefore assume that observed responses 𝑌𝑖 are
independent normal random variables with mean
After we estimate the variance 𝜎 2 , they can be studied by T-tests and T-intervals.
According to Assumption 2, responses (𝑌1 , 𝑌2 , · · · , 𝑌𝑛 ) have different means but the same variance 𝜎 2 ,
that is V[𝑌𝑖 ] = 𝜎 2 . Let us estimate it.
𝐺(𝑥
̂︀ 𝑖 ) = 𝑦̂︀𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 .
We consider errors 𝑒𝑖 = 𝑦𝑖 − 𝑦̂︀𝑖 , obtain the error sum of squares 𝑆𝑆𝐸 as below
𝑛
∑︁ 𝑛
∑︁ 𝑛
∑︁
𝑆𝑆𝐸 = 𝑒2𝑖 = (𝑦𝑖 − 𝑦̂︀𝑖 )2 = (𝑦𝑖 − 𝑏0 − 𝑏1 𝑥𝑖 )2 (10.2)
𝑖=1 𝑖=1 𝑖=1
• the regression sum of squares 𝑆𝑆𝑅 has df 𝑅 = 1 degree of freedom (the dimension of the
corresponding space (𝑋, 𝑌 ) is 1);
• the total sum of squares
𝑛
∑︁
𝑆𝑆𝑇 = (𝑦𝑖 − y)2 = (𝑛 − 1) 𝑠2𝑦 ,
𝑖=1
has 𝑛 − 1 degrees of freedom, because it is computed directly from the sample variance 𝑠2𝑦 . So,
𝑆𝑆𝐸 has df 𝐸 = df 𝑇 − df 𝑅 = 𝑛 − 2 degrees of freedom. Obviously,
3. With these info, we now unbiasedly estimate the response variance 𝜎 2 = V[𝑌 ] by the sample
regression variance
𝑆𝑆𝐸
𝑠2 = 𝜎
̂︀2 = . (10.3)
𝑛−2
Notice that the usual sample variance
𝑛
∑︁
(𝑦𝑖 − y)2
𝑆𝑆𝑇 𝑖=1
𝑠2𝑦 = =
𝑛−1 𝑛−1
is biased because y no longer estimates the expectation E[𝑌𝑖 ] of 𝑌𝑖 .
A standard way to present analysis of variance of the response is the ANOVA table.
Definition 10.1.
Univariate ANOVA
𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 𝑀 𝑆𝑅 = 𝑆𝑆𝑅
𝑀 𝑆𝐸
𝑛
∑︁
= 𝑦𝑖 − y)2
(ˆ
𝑖=1
Aim: The response variable here is the number of extra hours (𝑌 ) the dwellers can enjoy, and we
attempt to predict it from the number traffic jams (𝑋). More precisely we will answer the questions:
𝑆𝑥𝑦
𝑏1 = = −4.14; and 𝑏0 = y −𝑏1 x = 72.3.
𝑆𝑥2
2. ANOVA table and variance estimation. Let us compute all components of the ANOVA. We have
𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 1452 partitioned into
Simultaneously, 𝑛 − 1 = 6 degrees of freedom of 𝑆𝑆𝑇 are partitioned into 𝑑𝑓𝑅 = 1 and 𝑑𝑓𝐸 = 5 degrees
of freedom. Fill the rest of the ANOVA table,
𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 = 961 𝑀 𝑆𝑅 = 𝑆𝑆𝑅 = 961 = 9.79
𝑀 𝑆𝐸
𝑆𝑆𝐸
Error 5 𝑆𝑆𝐸 = 491 𝑠2 = 𝑀 𝑆𝐸 = = 98.2
(𝑛 − 2)
Total 6 𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 1452
𝑛
∑︁
𝑆𝑥2 = 𝑆𝑥𝑥 = (𝑥𝑖 − x )2 = (𝑛 − 1) 𝑠2𝑥 . (10.5)
𝑖=1
of the response variance 𝜎 2 in Equation 10.3, we can proceed with tests and confidence intervals for
the regression slope 𝛽1 . As usually, we start with the estimator 𝑏1 of 𝛽1 and its sampling distribution.
𝑛
∑︁
(𝑥𝑖 − x )(𝑦𝑖 − y)
𝑆𝑥𝑦 𝑖=1
𝑏1 = 𝑏 = =
𝑆𝑥2 𝑆𝑥2
𝑛
∑︁
(𝑥𝑖 − x ) 𝑦𝑖
𝑖=1
=
𝑆𝑥𝑥
𝑛
∑︁
[we can drop y because it is multiplied by (𝑥𝑖 − x ) = 0].
𝑖=1
According to standard regression assumptions above, 𝑦𝑖 are normal random variables and 𝑥𝑖 are
non-random. The estimated slope 𝑏1 - as a linear function of 𝑦𝑖 - is also normal N (𝜇𝑏 , 𝜎𝑏2 ) with the
𝜎2
• the mean 𝜇𝑏 = E[𝑏1 ] = 𝛽1 , and variance 𝜎𝑏2 =
𝑆𝑥𝑥
• the standard error of 𝑏1 is estimated by
√︀
S.E.(𝑏1 ) = 𝑠/ 𝑆𝑥𝑥
with given data, and therefore, we can use T-intervals and T-tests, with 𝑛 − 2 degrees of
freedom, due to Equation 10.3.
The estimator 𝑏1 can be shown to have minimum variance, and, because it is linear, it is called the
BLUE of 𝛽1 , that is, the best (signifying minimum variance) linear unbiased estimator.
By similar arguments one can show that 𝑏0 the BLUE of 𝛽0 as follows. Firstly, the 𝑌𝑖 are independent
and have a constant variance 𝜎 2 then
𝜎2 𝑠2
V[𝑌 ] = =⇒ V[𝑦] = .
𝑛 𝑛
√︀
here S.E.(𝑏0 ) = V[𝑏0 ], where the estimated variance of 𝛽0 is
𝑠2
V[𝑏0 ] = V[𝑦 − 𝑏1 𝑥] = + 𝑥2 V[𝑏1 ]
𝑛
(10.11)
𝑠2 𝑠2 𝑥2
(︂ )︂
2 1
= + 𝑥2 =𝑠 + .
𝑛 𝑆𝑥𝑥 𝑛 𝑆𝑥𝑥
𝑠2
• the estimated slope 𝑏1 ∼ N (𝛽1 , V[𝑏1 ]) where V[𝑏1 ] = ;
𝑆𝑥𝑥
• the estimated intercept 𝑏0 ∼ N (𝛽0 , V[𝑏0 ]) where V[𝑏0 ]) is given in (10.11):
𝑥2
(︂ )︂
2 1
√︀
V[𝑏0 ] = 𝑠 + =⇒ so S.E.(𝑏0 ) = V[𝑏0 ].
𝑛 𝑆𝑥𝑥
• On testing for linearity of 𝑋 and 𝑌 [that also means testing significance of the model (10.6)] we
check whether 0 ∈ CI (𝛽1 ) or not.
• The test of 𝐻0 : 𝛽1 = 0 is quite useful. When accept 𝐻0 we substitute 𝛽1 = 0 in the model, the 𝑋
term drops out and we are left with E(𝑌 ) = 𝛽0 .
𝐻0 : 𝛽1 = 0 𝑣𝑠 𝐻𝑎
Theory: Continue from the CI given in Equation 10.9, we can test hypothesis
𝐻0 : 𝛽1 = 𝐵 about the regression slope, use the T-statistic
𝑏1 − 𝐵 𝑏1 − 𝐵
𝑇 = = √ ∼ 𝑡[𝑛 − 2] (10.12)
S.E.(𝑏1 ) 𝑠/ 𝑆𝑥𝑥
where 𝑡-distribution has (𝑛 − 2) degrees of freedom, these degrees of freedom used in the estimation
of 𝜎 2 . As always, the form of the alternative hypothesis determines whether it is a two-sided, right-tail,
or left-tail test.
Argument: A non-zero slope 𝑏1 indicates significance of the model, relevance of predictor 𝑋 in the
inference about response 𝑌 , and existence of a linear relation among them. It means that a change
in 𝑋 causes changes in 𝑌 .
𝐻0 : 𝛽1 = 0 𝑣𝑠 𝐻𝑎
𝑏1 𝑏1
𝑡0 = = √ ∼ 𝑡[𝑛 − 2]. (10.13)
S.E.(𝑏1 ) 𝑠/ 𝑆𝑥𝑥
Definition 10.2.
The goodness of fit, meaning appropriateness of the predictor and the chosen regression model.
This appropriateness can be judged by the coefficient of determination 𝑅2 , defined to be the
proportion of the total variation being explained by the regression model,
𝑆𝑆𝑅 𝑆𝑆𝐸
𝑅2 = =1− .
𝑆𝑆𝑇 𝑆𝑆𝑇
We always have 0 ≤ 𝑅2 ≤ 1, with high values generally suggesting a good fit.
This concept is extremely useful in the next example on Transportation Science, and also for multiple
regression in next chapters.
• In univariate regression, R-square also equals the squared sample correlation coefficient 𝑟𝑥𝑦 =
𝑠𝑥𝑦
.
𝑠𝑥 · 𝑠𝑦
• The statistic 𝑅2 should be used with caution because it is always possible to make 𝑅2 unity by
simply adding enough terms to the model. In general, 𝑅2 will increase if we add a variable to the
model, but this does not necessarily imply that the new model is superior to the old one.
• There are several misconceptions about 𝑅2 . In general, 𝑅2 does not measure the magnitude of
the slope of the regression line. A large value of 𝑅2 does not imply a steep slope. Furthermore, 𝑅2
does not measure the appropriateness of the model because it can be artificially inflated by adding
higher order polynomial terms in 𝑥 to the model.
In general, more traffic jams reduce time for creation after work, and therefore, less happiness
index for urban citizens.
Aim: The response variable here is the number of extra hours (𝑌 ) the dwellers can enjoy, and we
attempt to predict it from the number traffic jams (𝑋).
5. ANOVA F-test.
1. Estimation of the regression line. The estimated regression line, from Example 10.2, has an
equation 𝑦 = 72.3 − 4.14 𝑥.
Notice the negative slope. It means that increasing traffic jams by 1 case, we expect to reduce 4.14
hours for creation after work.
𝑀 𝑆𝑅
Model 1 𝑆𝑆𝑅 = 𝑏2 𝑆𝑥𝑥 = 961 𝑀 𝑆𝑅 = 𝑆𝑆𝑅 = 961 = 9.79
𝑀 𝑆𝐸
𝑆𝑆𝐸
Error 5 𝑆𝑆𝐸 = 491 𝑠2 = 𝑀 𝑆𝐸 = = 98.2
(𝑛 − 2)
Total 6 𝑆𝑆𝑇 = 𝑆𝑦𝑦 = 1452
𝑠2 = 𝑀 𝑆𝐸 = 98.2,
𝑆𝑆𝑅
and 𝑅2 = = 0.662.
𝑆𝑆𝑇
That is, only 66.2% of the total variation of the number of extra relaxing hours 𝑌 is explained by the
number of traffic jams (per month) 𝑋.
E[𝑌 ] = 𝜇𝑦 = 𝛽0 + 𝛽1 𝑥 (10.14)
is significant as long as its slope 𝛽1 is not zero, i.e. the null 𝐻0 : 𝛽1 = 0 is rejected.
A more universal, and therefore, more popular method of testing significance of a model is the
ANOVA F-test. It compares
the portion of variation explained by regression with
the portion that remains unexplained.
• Each portion of the total variation is measured by the corresponding sum of squares,
• Dividing each SS by the number of degrees of freedom, we obtain mean squares, as given in Table
10.2:
𝑀 𝑆𝑅 = 𝑆𝑆𝑅,
𝑆𝑆𝐸
𝑀 𝑆𝐸 = .
𝑛−2
Knowledge box 8.
F-test statistic
SUMMARY 2.
ANOVA F-test is always one-sided and right-tail because only large values of the F-statistic show
a large portion of explained variation and the overall significance of the model. See Chapter ?? for
background of Fisher distribution.
We now have two tests for the model significance, a T-test for the regression slope and the ANOVA
F-test. For the univariate regression, they are absolutely equivalent. In fact, the F-statistic equals the
squared T-statistic for testing 𝐻0 : 𝛽1 = 0, due to Equation 10.13
𝑏21 𝑟2 𝑆𝑆𝑇
𝑇2 = = ... =
𝑠2 /𝑆𝑥𝑥 𝑠2
(10.15)
𝑆𝑆𝑅
= 2 = 𝐹.
𝑠
Hence, both tests give us the same result. Note that the T-statistic itself is
𝑏1
𝑇 = √︀ (10.16)
𝑠2 /𝑆𝑥𝑥
5. ANOVA F-test?
4. Inference about the slope. Is the slope statistically significant? Meaning: what if data says the
regression slope favors 𝐻𝑎 : 𝛽1 ̸= 0?
𝑏1 𝑏1
𝑡= = √
S.E.(𝑏1 ) 𝑠/ 𝑆𝑥𝑥
and compute P-value 𝑃 = 2 P[𝑇 > |𝑡|] = P[𝑇 < −𝑡] + P[𝑇 > 𝑡].
In the [Transportation Science example], does the number of extra relaxing hours 𝑌 really depend on
the number traffic jams 𝑋? We test the null hypothesis 𝐻0 : 𝛽1 = 0 by computing the T-statistic as in
Equation 10.16
𝑏1
𝑡 = √︀ = −3.13
𝑠2 /𝑆𝑥𝑥
Checking the T-distribution table with 5 d.f., we find that the P-value
𝑝 = 2 P[𝑇 > 3.13] for the two-sided test is between 0.02 and 0.04.
We conclude that the slope is moderately significant. Precisely, it is significant (𝛽1 ̸= 0) at any level
𝛼 ≥ 0.04 and not significant at any 𝛼 ≤ 0.02.
5. ANOVA F-test. We knew from Equation 10.15 that a similar result can be found by the F-test. From
Equation (10.15), the F-statistic of 𝐹 = 9.79 = 𝑡2 is not significant at the 0.025 level, but significant
at the 0.05 level.
A major application of regression analysis is making forecasts, predictions of the response variable 𝑌
based on the known or controlled predictors 𝑋.
Confidence Interval for the Mean of 𝑌 : places an upper and lower bound around the point estimate
for the mean (average value) of 𝑌 given 𝑋 = 𝑥.
Prediction Interval for an Individual 𝑌 : places an upper and lower bound around the point estimate
for an individual value of 𝑌 given 𝑋 = 𝑥.
Let 𝑥* be the value of the predictor 𝑋. The corresponding value of the response 𝑌 is
𝑦̂︀* = 𝐺(𝑥
̂︀ * ) = 𝑏0 + 𝑏1 𝑥* .
Question 2.
a/ How reliable are regression predictions, and
b/ how close are they to the real true values?
𝜇* = E[𝑌 |𝑋 = 𝑥* ]
2. and compute a (1 − 𝛼) 100% prediction interval for the actual value of 𝑌 = 𝑦* when 𝑋 = 𝑥* .
The expectation
𝜇* = E[𝑌 |𝑋 = 𝑥* ] = 𝐺(𝑥* ) = 𝛽0 + 𝛽1 𝑥* (10.17)
is a population parameter. This is the mean response for the entire sub-population of units
where the independent variable 𝑋 equals 𝑥* .
For example, it corresponds to the average price of all houses with the area 𝑥* = 2500 square feet.
HOW TO DO? Given a data D = {(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), . . . , (𝑥𝑛 , 𝑦𝑛 )}, we follow steps below. First, we
estimate 𝜇* by
𝑛
∑︁
𝑦ˆ* = 𝑏0 + 𝑏1 𝑥* = y −𝑏1 x +𝑏1 𝑥* = y +𝑏1 (𝑥* − x ) = 𝑚𝑖 𝑦𝑖 ,
𝑖=1
𝑛
∑︁ 𝑛
∑︁
(𝑥𝑖 − x ) (𝑥𝑖 − x ) 𝑦𝑖
𝑖=1 𝑖=1 ∑︀
with 𝑚𝑖 = 1/𝑛 + (𝑥* − x ), due to 𝑏1 = , y = ( 𝑦𝑖 )/𝑛.
𝑆𝑥𝑥 𝑆𝑥𝑥
We see again that the estimator 𝑦ˆ* is a linear function of responses 𝑦𝑖 . Then, under standard regres-
sion assumptions, 𝑦ˆ* is normal, with expectation
E[ˆ
𝑦* ] = E[𝑏0 + 𝑏1 𝑥* ] = 𝛽0 + 𝛽1 𝑥* = 𝜇*
Then, we estimate the regression variance 𝜎 2 by 𝑠2 and obtain the following confidence interval for the
mean of responses
√
(1 − 𝛼) 100% CI of 𝜇* = 𝑏0 + 𝑏1 𝑥* ± 𝑡𝑛−2, 𝛼/2 𝑠 ℎ
where
1 (𝑥* − x )2
ℎ= + .
𝑛 𝑆𝑥𝑥
Often we are more interested in predicting the actual response rather than the mean of all possible
responses. For example, we may be interested in the price of one particular house that we are planning
to buy, not in the average price of all similar houses. Instead of estimating a population parameter, we
are now predicting the actual value of a random variable.
Definition 10.3.
An interval [𝑎, 𝑏] is a (1 − 𝛼) 100% prediction interval for the individual response 𝑌 corresponding
to predictor 𝑋 = 𝑥* if it contains the value of 𝑌 with probability (1 − 𝛼),
P[𝑎 ≤ 𝑌 ≤ 𝑏 | 𝑋 = 𝑥* ] = 1 − 𝛼 (10.19)
This time, all three quantities, 𝑌, 𝑎, and 𝑏, are random variables. We predict 𝑌 by 𝑦ˆ* = 𝑏0 + 𝑏1 𝑥* , and
estimate the standard deviation
√︃
√︀ 1 (𝑥* − x )2
Std(𝑌 − 𝑦ˆ* ) = 𝑉 [𝑌 ] + 𝑉 [ˆ
𝑦* ] = 𝜎 1+ + (10.20)
𝑛 𝑆𝑥𝑥
√
(1 − 𝛼) 100% PI = 𝑏0 + 𝑏1 𝑥* ± 𝑡𝑛−2, 𝛼/2 𝑠 𝑘 (10.21)
1 (𝑥* − x )2
where 𝑘 = 1 + + .
𝑛 𝑆𝑥𝑥
2. Second, we get more accurate estimates and more accurate predictions from large samples. When
the sample size 𝑛 (and therefore, typically, 𝑆𝑥𝑥 ), tends to ∞, the margin of the confidence interval
converges to 0. Besides, the margin of a prediction interval converges to (𝑡𝛼/2 𝜎). As we collect
more observations, our estimates of 𝑏0 and 𝑏1 become more accurate; however, uncertainty about
the individual response 𝑌 will never vanish.
Figure 10.3: Illustration for prediction interval for the individual response
3. Prediction bands: For all possible values of a predictor 𝑥* , we can prepare a graph of (1 − 𝛼)
prediction bands given by Equation 10.21, see Figure 10.3. Then, for each value of 𝑥* , one can
draw a vertical line and obtain (1 − 𝛼) 100% prediction interval between these bands.
• Estimating the model parameters requires assuming that the errors are uncorrelated random vari-
ables with mean zero and constant variance.
• Tests of hypotheses and interval estimation require that the errors be normally distributed.
• In addition, we assume that the order of the model is correct; that is, if we it a simple linear regression
model, we are assuming that the phenomenon actually behaves in a linear or first-order manner.
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛. (10.22)
The residuals from a regression model are 𝑒ˆ𝑖 = 𝑦𝑖 − 𝑦ˆ𝑖 where 𝑦𝑖 is an actual observation and 𝑦ˆ𝑖 is
the corresponding fitted value from the regression model. Analysis of the residuals is helpful
• in checking the assumption that the errors are approximately normally distributed with constant
variance and
Assumption 1: The disturbances (random errors) have zero mean, i.e., E[𝑒𝑖 ] = 0 for every 𝑖 =
1, 2, . . . , 𝑛. This assumption is needed to insure that on the average we are on the true line.
Assumption 2: The disturbances have a constant variance, i.e., V[𝑒𝑖 ] = 𝜎 2 for every 𝑖 = 1, 2, . . . , 𝑛.
This insures that every observation is equally reliable.
If V[𝑒𝑖 ] = 𝜎𝑖2 , each observation has a different variance. An observation with a large variance is less
reliable than one with a smaller variance.
Assumption 4: The explanatory variable (predictor) 𝑋 is nonstochastic, i.e., fixed in repeated sam-
ples, and hence, not correlated with the disturbances.
∑︀𝑛
Also, ( 𝑖=1 𝑥2𝑖 )/𝑛 ̸= 0 and has a finite limit as 𝑛 −→ ∞.
• The analyst should always consider the validity of these assumptions to be doubtful and conduct
analyses to examine the adequacy of the model that has been tentatively entertained. In this section,
we discuss methods useful in this respect.
• As an approximate check of normality, the experimenter can construct a frequency histogram of the
residuals or a normal probability plot of residuals.
• Many computer programs will produce a normal probability plot of residuals, and because the sample
sizes in regression are often too small for a histogram to be meaningful, the normal probability plotting
method is preferred. It requires judgment to assess the abnormality of such plots.
(1) in time sequence (if known), (2) against the 𝑦ˆ𝑖 , and
These graphs will usually look like one of the four general patterns shown in Figure 10.4.
Pattern (a) represents the ideal situation, and patterns (b), (c), and (d) represent anomalies. If the
residuals appear as in (b), the variance of the observations may be increasing with time or with the
magnitude of 𝑦𝑖 or 𝑥𝑖 .
Example 10.5. Look back to the data SOCELL.csv in the case study on Aviation Engineering of
Example 10.1 with sample info in Table 10.1.
In Table 10.3 we present the values of ISC at time 𝑡2 , 𝑦, and their predicted values, according to
those at time 𝑡1 and 𝑦ˆ. We present also a graph (Figure 10.5) of the residuals, 𝑒ˆ = 𝑦 − 𝑦ˆ, versus the
predicted values 𝑦ˆ.
DISCUSSION.
1. The analysis from Example 10.1 has shown that the least squares regression (prediction) line 𝑦̂︀ =
2
0.536+0.929𝑥. We read also that the coefficient of determination is 𝑅𝑥𝑦 = 0.959. This means that only
4% of the variability in the ISC values, at time period 𝑡2 , are not explained by the linear regression on
the ISC values at time 𝑡1 .
2. Observation #9 is an “unusual observation.” It has relatively a lot of influence on the regression line,
as can be seen in Figure 10.1. Here we see that
𝑛−1 2
𝑆𝑒2 = 𝑆 .
𝑛 − 2 𝑦|𝑥
The value of 𝑆𝑒2 in the above analysis is 0.0076. The standard deviation of the residuals around the
2
regression line is 𝑆𝑒 = 0.08709. This explains the high value of 𝑅𝑥𝑦 .
𝑖 𝑦𝑖 𝑦̂︀𝑖 𝑒̂︀𝑖
3. In Table 10.3 we present the values of ISC at time 𝑡2 , 𝑦, and their predicted values, according to those
at time 𝑡1 and 𝑦ˆ. We present a graph (Figure 10.5) of the residuals, 𝑒ˆ = 𝑦 − 𝑦ˆ, versus the predicted
values 𝑦ˆ.
[Source [56]]
CHAPTER 11. REGRESSION ANALYSIS III
298 MULTIPLE REGRESSION ANALYSIS
11.1 Introduction
After careful study of this chapter, you should be able to do the following:
Learning Objectives
• perform the statistical tests and confidence procedures that are analogous to those for simple linear
regression, and check for model adequacy.
11.1.1 Setting
Model the influence of advertising time 𝑥 on the number of positive reactions 𝑦 from the public. We
have a single linear regression model, described by
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜀.
𝑌ˆ = 𝜇𝑌 = E[𝑌 |𝑋 = 𝑥] = 𝛽0 + 𝛽1 𝑥.
Here 𝑝 − 1 = 1, 𝑌 holds the number of positive reactions caused by the amount of advertising time 𝑥,
then the number of observations 𝑛 ≥ 2.
Example 11.2.
Suppose we develop an empirical model linearly relating the viscovity 𝑌 of a polymer to two
independent variables or regressor variables 𝑋1 , 𝑋2 :
the temparature 𝑋1 and the catalyst feed rate 𝑋2 .
Here 𝑝 − 1 = 2, a model that might describe this linear relationship is
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀 (11.1)
This model represents that the viscosity 𝑌 is influenced by the temperature 𝑋1 and the catalyst feed
rate 𝑋2 , plus a stochastic component 𝜀 [expressing uncertainty or uncontrolled noises of real world].
The term linear is used because Equation 11.2 is a linear function of 𝑝 = 3 unknown parameters
𝛽0 , 𝛽1 , 𝛽2 . The regression model (11.2) describes a plane in the three-dimensional space of 𝑌ˆ , 𝑥1
and 𝑥2 .
𝑌ˆ = 50 + 10𝑥1 + 7𝑥2
where we have assumed that the expected value of the errors E(𝜀) = 0.
Interaction effects can appear in and be analyzed via a multiple linear regression model, e.g. adding
a cross-product term into Equation 11.2 to get :
𝑌ˆ = 𝜇𝑌 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ] = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝛽12 𝑥1 𝑥2 . .
Similarly the method of least squares (OLS) for single predictor, here we looks for a plane
𝑦ˆ = 𝛽1 + 𝛽1 𝑥1 + 𝛽2 𝑥2
Figure 11.1: The regression plane for the model 𝑌^ = 50 + 10𝑥1 + 7𝑥2
• The expression 𝑄 = 𝑆𝑆𝐸 = ℎ(𝛽0 , 𝛽1 , 𝛽2 ) depends on three unknowns 𝛽0 , 𝛽1 , 𝛽2 , and we find its
extreme by taking partial derivatives of
𝑛
∑︁
𝑄= (𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖1 − 𝛽2 𝑥𝑖2 )2 ,
𝑖=1
• The least squares estimates [shown explicitly in Equations (11.14), (11.15) and (11.16)]
Note that there are 𝑝 = 𝑘 + 1 normal equations, one for each of the unknown regression coeffi-
cients. The normal equations can be solved by any method appropriate for solving a system of linear
equations.
NOTATION
• 𝑥𝑖𝑗 is the value of the 𝑗-th predictor 𝑋𝑗 at the 𝑖-th observation (observation 𝑖 = 1, 2, . . . , 𝑛 and predictor
𝑗 = 1, 2, . . . , 𝑝 − 1);
𝑓 : 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 −→ 𝑌 = 𝑓 𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 ,
𝑦 = 𝑓 (X) + 𝑒 (11.5)
We intend to obtain a good overall fit of the model and easy mathematical tractability. The most
mathematically tractable model 𝑓 is a linear one:
1. A linear model, firstly may serve as a suitable approximation to several nonlinear functional rela-
tionships.
3. Thirdly, simplicity the best: many relationships in reality just need a linear function of predictors
𝑋1 , 𝑋2 , · · · , 𝑋𝑝−1 to describe the context. The linear models would guarantee the inclusion of impor-
tant variables, and the exclusion of unimportant variables.
Assumption 1: The deviations (random errors) have zero mean, i.e., E[𝑒𝑖 ] = 0 for every
𝑖 = 1, 2, . . . , 𝑛. This assumption is needed to insure that on the average we are on the true
line.
Briefly, Assumptions 1-3 mean the deviations 𝑒𝑖 are an SRS from the N(0, 𝜎 2 ) distribution.
As part of a recent study titled “Predicting Success for Actuarial Students in Undergraduate Mathe-
matics Courses,” data from 106 Mahidol Uni. actuarial graduates were obtained. The researchers
were interested in describing how students’ overall math grade point averages (GPA) are ex-
plained by SAT Math and SAT Verbal scores, class rank, and faculty of science’s mathematics
placement score.
𝑝−1
∑︁
𝑦 𝑖 = 𝛽0 + 𝛽𝑗 𝑥𝑖𝑗 + 𝑒𝑖 , 𝑖 = 1, 2, . . . , 𝑛,
𝑗=1
where 𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑝−1 are the linear regression coefficients, and 𝑒𝑖 are random errors, 𝑒𝑖 ∼
N(0, 𝜎 2 ), for 𝑖 = 1, 2, . . . , 𝑛.
With 𝑥(𝑖) are particular values of the predictor 𝑋𝑖 , then Model 11.6 implies that the conditional expec-
tation E[𝑌 |𝑋1 = 𝑥(1) , 𝑋2 = 𝑥(2) , . . . , 𝑋𝑝−1 = 𝑥(𝑝−1) ] of response 𝑌 [see Definition 9.1] is generated by
a linear combination of the predictor variables: 𝑦 = E[𝑌 ] = X 𝛽, or explicitly
E[𝑌 |𝑋1 = 𝑥(1) , 𝑋2 = 𝑥(2) , . . . , 𝑋𝑝−1 = 𝑥(𝑝−1) ] = 𝛽0 + 𝛽1 𝑥(1) + . . . + 𝛽𝑝−1 𝑥(𝑝−1) . (11.7)
The vector ⎡ ⎤
⎢ 𝛽0 ⎥
⎢ ⎥
⎢ ⎥
⎢
⎢ 𝛽1 ⎥
⎥
𝛽=⎢ ⎥
⎢ .. ⎥
.
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝛽𝑝−1
How to estimate 𝜎 2 ?
The parameter 𝜎 2 measures the variability of the responses about the population regression equa-
tion. As in the case of simple linear regression, we estimate 𝜎 2 by an average of the squared residuals.
The estimator is ∑︀ 2
𝑆𝑆𝐸 𝑒𝑖
𝑠2 = = . (11.9)
𝑛−𝑝 𝑛−𝑝
• The degrees of freedom equal the sample size, 𝑛, minus 𝑝, the number of regression coefficients 𝛽𝑗 ’s
we must estimate to fit the model.
We find values for 𝑝 parameters 𝛽0 , 𝛽1 , . . . , 𝛽𝑝−1 which minimize the sum of square differences
𝑛
∑︁
𝑒2𝑖 = 𝑒′ 𝑒 = (𝑦 − X𝛽)′ (𝑦 − X𝛽)
𝑖=1
• 𝑥′𝑖 · 𝛽 = 𝛽0 + 𝑥𝑖1 𝛽1 + 𝑥𝑖2 𝛽2 · · · + 𝑥𝑖𝑝−1 𝛽𝑝−1 , with 𝑥′𝑖 = [1, 𝑥𝑖1 , · · · , 𝑥𝑖𝑝−1 ] is the 𝑖th row of matrix
X (recording of the values of predictors 𝑋1 , · · · , 𝑋𝑝−1 );
A successful choice for regression parameter vector 𝛽 is indicated by small values of all 𝑒𝑖 . Quite
a few conceivable principles by which the quality of an actual choice for 𝛽 may be evaluated. Among
others, the following measures of the residual sum 𝑆(𝛽) have been proposed:
∑︀𝑛
* 𝑆(𝛽) = 𝑖=1 |𝑒𝑖 |; 𝑆(𝛽) = max𝑖=1..𝑛 |𝑒𝑖 |; or using Euclidean distance
𝑛
∑︁
𝑆(𝛽) = 𝑒2𝑖 = 𝑒′ 𝑒. (11.10)
𝑖=1
The first two proposals are subject to either some complicated mathematics or poor statistical prop-
erties, the last principle has become widely accepted, providing the basis for the famous method of
least squares in Chapter 9.
We can do least squares [similarly as in Chapter 2, Section 9.3.2] to find estimates 𝑏𝑗 = 𝛽ˆ𝑗 of regression
coefficients 𝛽𝑗 that minimizes 11.10, the sum of squared residuals,
𝑛
∑︁
𝑆(𝛽) = 𝑒2𝑖 = 𝑒′ 𝑒 = (𝑦 − X 𝛽)′ (𝑦 − X 𝛽) = 𝑆𝑆𝐸 (11.11)
𝑖=1
𝑆(𝛽) = 𝑦 ′ 𝑦 + 𝛽 ′ X′ X 𝛽 − 2𝛽 ′ X′ 𝑦 ′
CONCEPT 8. Generalized inverse (g-inverse) of a square matrix 𝐴, written 𝐴− is the matrix that
satisfies 𝐴𝐴− 𝐴 = 𝐴.
𝜕𝑆(𝛽)
= 2X′ X 𝛽 − 2X′ y.
𝜕𝛽
When the rank 𝑚 of X′ X fulfills that 𝑚 = 𝑝 − 1, the matrix X′ X is non-singular, then the
ˆ is found by
estimated (most-fitted) coefficients b = 𝛽
ˆ = (X′ X)−1 X′ y
b=𝛽 (11.12)
𝑦 = 𝛽1 𝑥 2 + 𝛽2 𝑥 + 𝛽3 ,
𝑥 0 1 2 3 4
𝑦 1 0 3 5 8
𝑌 = 𝛽2 𝑋2 + 𝛽1 𝑋1 + 𝛽3 + 𝜀, where 𝑋1 = 𝑋, 𝑋2 = 𝑋 2
The fitted model is 𝑦ˆ = 𝛽2 𝑥2 +𝛽1 𝑥1 +𝛽3 . Now this looks exactly like a multiple regression equation with
two predictors. The design (observed data) matrix 𝑋 (where the last column matches with coefficient
𝛽3 , the second column matches with coefficient 𝛽1 ...), and its transpose now are
⎡ ⎤
⎢ 𝑥(2) 𝑥(1) 1 ⎥
⎢ ⎥
⎢ ⎥
⎢ − − − ⎥
⎢ ⎥
⎢ ⎥ ⎡ ⎤
⎢ ⎥
⎢ 0 0 1 ⎥
⎢ ⎥ ⎢ 0 1 4 9 16 ⎥
⎢ ⎥ ⎢ ⎥
′
⎢ ⎥ ⎢ ⎥
𝑋=⎢
⎢ 1 1 ⎥;
1 ⎥ 𝑋 =⎢
⎢ 0 1 2 3 ⎥,
4 ⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎣ ⎦
⎢ 4 2 1 ⎥ 1 1 1 1 1
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 9 3 1 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
16 4 1
and the vector of fitted values y ˆ = (0.6, 1, 2.4, 4.8, 8.2)′ as compared with the vector of observa-
̂︀ = X 𝛽
tions 𝑦 = (1, 0, 3, 5, 8).
Theorem 11.1. Vector 𝑏 minimizes the sum of squared errors if and only if it is a solution of (X′ X) 𝛽 =
X′ 𝑦. All solutions are located on the hyperplane X𝑏.
X′ X 𝛽 = X′ 𝑦
ˆ′ 𝑒
• An important property of the sum of squared errors 𝑆(𝑏) = 𝑒 ˆ = 𝑆𝑆𝐸 is
̂︀ ′ 𝑦
𝑦′ 𝑦 = 𝑦 ˆ′ 𝑒
ˆ+𝑒 ˆ, (11.13)
for each 𝑗, represents the expected change in response 𝑌 per unit change in 𝑋𝑗 when all the re-
maining predictors 𝑋𝑖 (𝑖 ̸= 𝑗) are held constant.
Knowledge box 9.
̂︀ = (X′ X)−1 X′ y
b=𝛽
are
Recall that the sample correlation (Pearson’s sample correlation) of two samples 𝑥 and 𝑦 [or two
𝑠𝑥𝑦
variables] is given as 𝑟𝑥𝑦 = where
𝑠𝑥 · 𝑠𝑦
𝑛
∑︁
(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑖
𝑠𝑥𝑦 =
𝑛−1
is their sample covariance.
The multiple regression linear model, in the case of two predictors, assumes the form 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 +
𝛽2 𝑥2𝑖 + 𝑒𝑖 𝑖 = 1, 2, . . . , 𝑛, where 𝑒𝑖 are independent r.v.’s, with E(𝑒𝑖 ) = 0 and V(𝑒𝑖 ) = 𝜎 2 .
The principle of least-squares - as described in Knowledge Box 9 - calls for the minimization of
𝜕𝑆(𝛽)
= 2X′ X 𝛽 − 2X′ y.
𝜕𝛽
This yields the least squares estimators 𝑏0 , 𝑏1 and 𝑏2 of the regression coefficients 𝛽0 , 𝛽1 , 𝛽2 . We
specifically have
where 𝑠2𝑥1 , 𝑠2𝑥2 denote the sample variances, and 𝑠𝑥1 𝑥2 , 𝑠𝑥1 𝑦 and 𝑠𝑥2 𝑦 the covariances of 𝑥1 , 𝑥2 and 𝑦.
CONCEPT 9.
• The values 𝑦̂︀𝑖 = 𝑏0 + 𝑏1 𝑥𝑖1 + 𝑏2 𝑥𝑖2 , 𝑖 = 1, 2, . . . , 𝑛 are called the predicted response values of the
regression, and
√︃
2
𝑆𝑦|(𝑥 1 ,𝑥2 )
• The residual standard error 𝑠 = .
𝑛−𝑝
We illustrate the building of a multiple regression model when 𝑘 = 2 and interpret the meaning of
the variables involved.
Define a response variable 𝑌 = 𝐼𝑄 to be the human intelligence index, assuming that it de-
pends on two variables
𝑋1 = 𝐸𝐷𝑈 (education level, in years of schooling), and
𝑋2 = ℎ (height).
𝑋1 = 𝐸𝐷𝑈 6 7 7 8 10 12 15 16 18
𝑋2 = ℎ (ℎ𝑒𝑖𝑔ℎ𝑡) 1.6 1.65 1.72 1.59 1.68 1.61 1.77 1.7 1.69
The HI index 𝑌 140 155 150 141 147 166 176 183 199
D0 = data.frame(x1, x2);
M0=lm(y ~ x1+ x2, D0); anova(M0); summary(M0);
Residuals:
Min 1Q Median 3Q Max
-10.8804 -4.6568 0.4855 4.0847 10.3512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.1888 84.3722 1.010 0.35162
x1 4.2296 0.7173 5.897 0.00106 **
x2 18.0925 52.8039 0.343 0.74356
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
See Section 11.5 for practical usages of regression on three and four predictors. For many predictors,
see Section 11.6 for a realistic case study about Climate Change’s impacts on agriculture in Mekong
Delta Region, Vietnam.
In this section, we briefly discuss several other aspects of building multiple regression models. The
linear model
𝑌 = X 𝛽 + 𝜀 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝜀
is a general model that can be used to it any relationship that is linear in the unknown parameters
𝛽. This includes two important classes of models with interactions and polynomial regression
models.
If the change in the mean 𝑦 value associated with a 1-unit increase in one independent variable de-
pends on the value of a second independent variable, there is interaction between these two variables.
Denoting the two independent variables by 𝑋1 , 𝑋2 , we can model this interaction by including as an
additional predictor 𝑋3 = 𝑋1 𝑋2 , the product of the two independent variables.
• When 𝑋1 and 𝑋2 do interact, this model will usually give a much better fit to resulting data than would
the no-interaction model.
• Failure to consider a model with interaction too often leads an investigator to conclude incorrectly
that the relationship between 𝑌 and a set of independent variables is not very substantial.
• In applied work, quadratic predictors 𝑋12 and 𝑋22 are often included to model a curved relationship.
This leads to the full quadratic or complete second-order model
𝑦 = E[𝑌 |𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ]
(11.20)
= 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽3 𝑥1 𝑥2 + 𝛽4 𝑥21 + 𝛽5 𝑥22 .
Suppose that an industrial chemist is interested in a product yield (𝑌 ) of a polymer being influenced
by two independent variables or predictor 𝑋1 , 𝑋2 , and possibly theirs certain reaction. Here
𝑋1 = reaction temperature and
𝑋2 = pressure at which the reaction is carried out.
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽12 𝑋1 𝑋2 + 𝜀 (11.21)
(b)
Generally, interaction implies that the effect produced by changing one variable (𝑥1 , say) depends on
the level of the other variable (𝑥2 ). This figure shows that changing 𝑥1 from 2 to 8 produces a much
smaller change in E[𝑌 ] when 𝑥2 = 2 than when 𝑥2 = 10.
Figure 11.2(b) shows the three-dimensional plot of the regression model
Concept: In general, any regression model such that the 𝛽’s is linear in parameters then is called a
linear regression model, regardless of the shape of the surface that it generates.
Let’s return for a moment to the case of bivariate data D consisting of 𝑛 pairs of (𝑥, 𝑦). Suppose that
a scatter plot shows a parabolic (in Figure 11.3.b) rather than linear shape. Then it is natural to specify
a quadratic regression model, via a second-degree polynomial in one variable
𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + 𝜀, (11.22)
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝜀, where 𝑋1 = 𝑋, 𝑋2 = 𝑋 2
Now this looks exactly like a multiple regression equation with two predictors.
• The message at the moment is that quadratic regression is a special case of multiple regression.
Thus any software package capable of carrying out a multiple regression analysis can fit the quadratic
regression model.
• The same is true of cubic regression and even higher-order polynomial models, although in practice
very rarely are such higher-order predictors needed.
• Polynomial regression models are widely used when the response is curvilinear (see Figure
11.3.b) because the general principles of multiple regression can be applied.
Figure 11.3: Linear and nonlinear model of the U.S. population regression
• However, the interpretation of 𝛽𝑖 given previously for the general multiple regression model is not
legitimate in quadratic regression, and polynomial regression in general. This is because 𝑋2 = 𝑋 2
, so the value of 𝑋2 cannot be increased while 𝑋1 = 𝑋 is held fixed.
One can often reduce variability around the trend and do more accurate analysis by adding non-
linear terms into the regression model.
𝑐(1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010);
that allows us to predict the world population for years from 2015, based on the linear model
• This model 𝐺(𝑥) has a pretty good fit. However, a linear model does a poor prediction of the U.S.
population between 1790 and 2010 (see Figure 11.3.a).
• The population growth over a longer period of time is clearly nonlinear. On the other hand, a
quadratic model in Figure 11.3.b gives an amazingly excellent fit!
E[ population 𝑌 ] = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 ,
In order to develop hypothesis tests and confidence intervals for the regression coefficients in the
subsequent Chapter ??, the standard deviations of the estimated coefficients are needed. These
can be obtained from a certain covariance matrix, a matrix with the variances on the diagonal and the
covariances in the off-diagonal elements.
Cov(V) = A Cov(𝐵) A ′ .
Problem 11.4.
Consider the linear regression model 𝑦 = X 𝛽 + 𝑒. Here 𝑒 is a vector of random variable , such
that E[𝑒𝑖 ] = 0 for all 𝑖 = 1, ..., 𝑛 and 𝑗 = 1, ..., 𝑛:
⎧
⎨𝜎 2 , when 𝑖 = 𝑗
Cov(𝑒𝑖 , 𝑒𝑗 ) =
⎩0, when 𝑖 ̸= 𝑗.
ˆ = (X′ X)−1 X′ y is
Show that the variance-covariance matrix of the LSE 𝑏 = 𝛽
We illustrate the building of a multiple regression model when 𝑘 = 3 and interpret the meaning of
the variables involved. Define a response variable 𝑌 = 𝐼𝑄 to be the human intelligence index,
assuming that it depends on three variables 𝑋1 = 𝐸𝐷𝑈 (education level, in years of schooling),
𝑋2 = ℎ (ℎ𝑒𝑖𝑔ℎ𝑡) and 𝑋3 = 𝑔 (𝑠𝑒𝑥).
𝑋1 = 𝐸𝐷𝑈 6 7 7 8 10 12 15 16 18
𝑋2 = ℎ (ℎ𝑒𝑖𝑔ℎ𝑡) 1.6 1.65 1.72 1.59 1.68 1.61 1.77 1.7 1.69
𝑋3 = 𝑔 (𝑠𝑒𝑥) F F M F M F M F M
The HI index 𝑌 140 155 150 141 117 126 176 183 199
We observe 𝑛 = 7 persons, and get the responses 𝑦 = 𝑐(140, 155, 150, 141, 117, 126, 196)...
x1 = c(6, 7, 7, 8, 10, 12, 15, 16, 18);
x2 = c(1.6, 1.65, 1.72, 1.59, 1.68, 1.61, 1.77, 1.7, 1.69);
x3 = c(’F’, ’F’, ’M’, ’F’, ’M’, ’F’,’M’,’F’,’M’);
fx3= factor(x3);
y = c(140,155,150,141,117,126,176, 183, 199);
M=lm(y~ x1+ x2+ fx3+ x1:fx3); anova(M); summary(M);
Call: lm(formula = y ~ x1 + x2 + fx3 + x1:fx3)
Residuals:
1 2 3 4 5 6 7 8 9
2.041 -0.249 17.665 6.041 -17.963 -16.625 -16.87 8.79 17.17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -409.9486 327.6905 -1.251 0.279
x1 0.2084 2.8920 0.072 0.946
x2 341.6604 209.9772 1.627 0.179
fx3M -83.3929 50.7856 -1.642 0.176
We illustrate the empirical model fitting of a multiple regression on a data set GASOL.csv with four
predictors in oil industry (see Kenett [?]). The data set consists of 32 measurements of distillation
properties of crude oils. There are five variables 𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 and 𝑦, given as
The measurements of crude oil, and gasoline volatility indicate the temperatures at which a given
amount of liquid has been evaporized. The sample correlations between these five variables are
𝑥2 𝑥3 𝑥4 𝑦
𝑥3 0.412 −0.315
𝑥4 0.712
We see that the yield 𝑦 is highly correlated with 𝑥4 and with 𝑥2 (or 𝑥3 ).
Computation on R .
> data(GASOL)
> LmYield <- lm(yield ~ 1 + astm + endPt, data=GASOL)
> summary(LmYield)
Call:
lm(formula = yield ~ 1 + astm + endPt, data = GASOL)
Residuals:
Min 1Q Median 3Q Max
-3.9593 -1.9063 -0.3711 1.6242 4.3802
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.467633 3.009009 6.137 1.09e-06 ***
astm -0.209329 0.012737 -16.435 3.11e-16 ***
endPt 0.155813 0.006855 22.731 < 2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.426 on 29 degrees of freedom
Multiple R-squared: 0.9521, Adjusted R-squared: 0.9488
F-statistic: 288.4 on 2 and 29 DF, p-value: < 2.2e-16
We compute now these estimates of the regression coefficients using the formulae in Section
11.3.1. The variances and covariances of 𝑥3 , 𝑥4 and 𝑦 are
ASTM 𝑥3 1409.355
The means of these variables are X 3 = 241.500, X 4 = 332.094, Y = 19.6594. Thus, the least-squares
estimates 𝑏1 and 𝑏2 of 𝛽1 and 𝛽2 are obtained by solving the equations
⎧
⎨1409.355𝑏1 + 1079.565𝑏2 = −126.808
⎩1079.565𝑏 + 4865.894𝑏 = 532.188.
1 2
Predictors could affect BPH growth and their measured values shown in the above table. But what
factors have most significant impacts on the BPH growth?
Long. Lat. Rice Seed. Temp Humi. Water Leaf Grass No. No. No.
No Interaction Terms
> DataBPH = read.csv("BPH.csv",sep=’;’)
> ncol(DataBPH); nrow(DataBPH);
> Y= BPH;
> out=lm(Y ~ x2+ x6+x7+x8, data= DataBPH)
> anova(out)
> summary(out)
Call:
lm(formula = BPH ~ x2 + x6 + x7 + x8)
Residuals:
Min 1Q Median 3Q Max
-556.7 -223.4 -120.2 55.7 12531.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Residuals:
Min 1Q Median 3Q Max
-595.3 -212.5 -119.5 47.5 12486.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.293e+04 1.221e+04 1.059 0.2899
x2 -1.213e-02 1.121e-02 -1.082 0.2797
x6 -1.169e+02 1.473e+02 -0.794 0.4276
x7 -1.062e+02 5.240e+01 -2.026 0.0431 *
x8 3.674e+02 3.855e+02 0.953 0.3408
x2:x6 1.214e-04 1.353e-04 0.897 0.3697
x6:x7 1.984e-01 6.394e-01 0.310 0.7564
x6:x8 -6.422e+00 4.616e+00 -1.391 0.1645
x7:x8 1.797e+01 9.103e+00 1.974 0.0487 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> fx3=factor(x3)
# x3 (Leaf color) is categorical;
# fx3 is the dummy variable of x3,
# now eligible for using in linear model
# need the theory of Analysis of Covariance (ANCOVA) to explain
> out3=lm(Y~ fx3+ x6+ x7+ x8+ fx3:x6+ x6:x7+ x6:x8+ x7:x8)
> anova(out3)
>summary(out3)
Residual standard error: 588.7 on 733 degrees of freedom
Multiple R-squared: 0.1523,Adjusted R-squared: 0.02974
F-statistic: 1.243 on 106 and 733 DF, p-value: 0.06013
𝑌 = X 𝛽 + 𝜀 = 𝛽0 + 𝑋1 𝛽1 + · · · + 𝑋𝑝−1 𝛽𝑝−1 + 𝑒.
ˆ = (X′ X)−1 X′ y
b=𝛽
Furthermore,
• If we have 𝑛 > 1 observations measured simultaneously at predictors, the true linear regression
model (at the 𝑖-th observation) is
• Stochastic Process uses probabilistic models to study uncertainty. Probabilistic models often
involve several popular random variables of interest as Poisson, exponential or Gaussian. For
example,
All of these random variables are associated with the same experiment, sample space, and
probability law, and their values may relate in interesting ways.
• Stochastic Analysis uses probabilistic models, simulation theory and computation to study un-
certainty.
Chapter 12
Stochastic Process
Characterizing systems with randomly
spatial-temporal fluctuations
[Source [9]]
CHAPTER 12. STOCHASTIC PROCESS
330 CHARACTERIZING SYSTEMS WITH RANDOMLY SPATIAL-TEMPORAL FLUCTUATIONS
• Markov processes
1. the fact that many classes of jobs (clients) come in a system with distinct rates demands a wise policy
to get them through efficiently,
2. measuring performance of a system through many different parameters (metrics) is hard, requires
complex mathematical models.
B. Evolutionary Dynamics
Keywords: critial lineages, virus mutant,mutation, reproductive ratio, invasion, ecology, vaccine.
Introductory Biomatics- Invasion and Escape. Some realistic biological phenomena occur in na-
ture such as: (a) a parasite infecting a new host, (b) a species trying to invade a new ecological niche,
(c) cancer cells escaping from chemotherapy.
Typical problems. Imagine a virus of one host species that is transferred to another host species
(HIV, SARS). In the new host, the virus has a basic reproductive ratio 𝑅 < 1.
Figure 12.1: An example of mixed flow of arrivals ( jobs, clients, cars ...)- Source [58]
2. How to calculate the probability that a virus quasi-species contains an escape mutant that establishes
an infection and thereby causes vaccine failure?
We need a theory to calculate the probability of non-extinction/ escape for lineages starts from studying
evolutionary dynamics of single individuals.
C. Brand loyalty in Buz: Consider a Markov chain M describing the loyalty of customers to three
retailers Coop, BigC and Walmart, coded by states 0, 1, and 2 respectively. The transition probability
matrix 𝑃 is given by
⎛ ⎞
⎜𝐶 𝐵 𝑊⎟
⎜ ⎟
⎜ ⎟
⎜0.4 0.2 0.4⎟
⎜ ⎟
𝑃 =⎜
⎜
⎟
⎟
⎜0.6 0 0.4⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
0.2 0.5 0.3
What is the probability in the longrun that the chain M is in state 1, that is how much chance the
customers will go with BigC? We need Markov Chain theory to solve this.
We first take a look at few problems that can be solved by the chapter’s methods.
1. A certain product is made by two companies, A and B, that control the entire market.
Currently, A and B have 60 percent and 40 percent, respectively, of the total market.
Each year, A loses 5 of its market share to By while B loses 3 of its share to A.
Find the relative proportion of the market that each hold after 2 years.
2. Let two gamblers, A and B, initially have 𝑘 dollars and 𝑚 dollars, respectively.
Suppose that at each round of their game, A wins one dollar from B with probability 𝑝 and loses one
dollar to B with probability 𝑞 = 1 − 𝑝.
Assume that A and B play until one of them has no money left. (This is known as the Gambler’s
Ruin problem.)
(a) Show that 𝑋(𝑛) = {𝑋𝑛 , 𝑛 ≥ 0} is a Markov chain with absorbing states.
(b) Find its transition probability matrix 𝑃 . Realize 𝑃 when 𝑝 = 𝑞 = 1/2 and 𝑁 = 4
Basic concepts
A stochastic process is a mathematical model of a probabilistic experiment that evolves in time and
generates a sequence of numerical values.
A stochastic process is just a collection (usually infinite) of random variables, denoted 𝑋𝑡 or 𝑋(𝑡);
where parameter 𝑡 often represents time.
State space S of a stochastic process consists of all realizations 𝑥 of 𝑋𝑡 , i.e. 𝑋𝑡 = 𝑥 says the
random process is in state 𝑥 at time 𝑡.
Stochastic processes can be generally subdivided into four distinct categories depending on whether
𝑡 ∈ 𝑇 or 𝑋𝑡 ∈ S are discrete or continuous.
Discrete Continuous
1. Discrete processes: both S and 𝑇 are discrete, as Bernoulli process or Discrete Time Markov
chains.
2. Continuous time discrete state processes: the state space S of 𝑋𝑡 is discrete and the index set
(time set) 𝑇 of 𝑡 is continuous, as the reals R = (−∞, ∞) or its intervals.
• Poisson process– the number of clients 𝑋(𝑡) who has entered a bank from the opening time
until 𝑡. 𝑋(𝑡) follows a Poisson distribution with mean E[𝑋(𝑡)] = 𝜆𝑡 (𝜆 - the arrive rate).
3. Continuous processes: both state space S and time index set 𝑇 are continuous, such as diffusion
process (Brownian motion).
4. Discrete time continuous state processes: the time index 𝑇 is discrete, and the state space S is
continuous– the so-called TIME SERIES such as
Examples
1. Discrete processes: random walk model consisting of positions 𝑋𝑡 of an object (drunkand) at hourly
time point 𝑡 during 24 hours, whose directional distance from a particular point 0 is measured in
integer units. Here 𝑇 = {0, 1, 2, . . . , 24}. See details of random walk model in Section 13.6.
2. Continuous time discrete state processes: 𝑋𝑡 is the number of infant births in a given population
during time period [0, 𝑡]. Here the time index 𝑇 = R+ = [0, ∞) and
Realistic data of a financial time series model: from http://cafef.vn, shown in Table 12.1 has
58 records corresponding with 58 trading days in Quarter 1, 2013. VNM and DPM are two giant firms
in Vietnam.
Table 12.1: Data of VNM and DPM stock price in Quater 1, 2013.
VNM DPM
Seq Date Price Return Log return Price Return Log return
Let {𝑆(𝑡) : 𝑡 ≥ 0} be a stock price process, with the stock price change over the period (𝑘, 𝑘 + 1) be
𝑆(𝑘 + 1)
𝑅𝑘 = , named the return of stock, see Nguyen (2013) [26].
𝑆(𝑘)
Our stock prices given in Table 12.1 follow a geometric Brownian motion.
·10−2 ·10−2
4 4
2 2
10 20 30 40 50 10 20 30 40 50
−2 −2
• Next figures 12.3 and 12.4 separately show time-series graphs of actual stock prices (in blue color)
of two data sets of VNM and DPM.
• The other two curves (with red and green colors) show some approximated statistical models (the
log-normal and auto-regressive ones) that are estimated from the same data above.
105
100
95
90
5 10 15 20 25 30 35 40 45 50 55
46
44
42
40
38
Actual price
Lognormal - Binomial expected price
AR(1) expected price
36
5 10 15 20 25 30 35 40 45 50 55
• parameter 𝑡 ∈ 𝑇 is time, with 𝑇 being a set of possible times, usually [0, ∞), (−∞, ∞),
N = {0, 1, 2, . . .}, or
𝑋 : 𝑇 × Ω −→ S
(𝑡, 𝑤) ↦→ 𝑋(𝑡, 𝑤) = 𝑥
X:TxΩ S
𝑋(𝑡, 𝑤) = 𝑠
Knowledge box 10. We have known that a stochastic process is a mathematical model of a proba-
bilistic experiment that evolves in time and generates a sequence of numerical values.
(a) The dependencies in a series of values generated by the process. For example, how do
future prices of a stock depend on past values?
(b) Long-term averages, involving the entire sequence of generated values. E.g, what is the
fraction of time that a machine is idle?
E.g, a) what is the probability that within a given hour all circuits of some telephone system become
simultaneously busy, or
b) what is the frequency with which some buffer in a PC network overflows with data?
1. STATIONARY property:
A process is stationary when all the 𝑋(𝑡) have the same distribution. That means,
• for any 𝜏 , the distribution of a stationary process will be unaffected by a shift in the time
origin, and
Mathematical treatment: We will focus on models in which the inter-arrival times (the times between
successive arrivals) are independent random variables.
♢ The case where arrivals occur in discrete time and the interarrival times are geometrically
distributed – is the Bernoulli process.
♢ The case where arrivals occur in continuous time and the interarrival times are exponentially
distributed – is the Poisson process, see Chapter 14.
Many processes with memory-less property caused by experiments that evolve in time and in which
the future evolution exhibits a probabilistic dependence on the past.
As an example, the future daily prices of a stock are typically dependent on past prices. However, in
a Markov process, we assume a very special type of dependence:
the next value depends on past values only through the current value, that is 𝑋𝑖+1 depends only on 𝑋𝑖 ,
and not on any previous values.
• On the other hand, if we fix 𝑤 ∈ Ω, we obtain a function of time 𝑋𝑤 (𝑡). This function is called a
a sample path, or a trajectory of the process 𝑋(𝑡, 𝑤).
Looking at the past usage of the central processing unit (CPU), we see a realization of this process
until the current time (Figure 12.6). However, the future behavior of the process is unclear.
Depending on which outcome 𝑤 will actually take place, the process can develop differently. For
example, see two different trajectories for 𝑤 = 𝑤1 and 𝑤 = 𝑤2 , and
two elements of the sample space Ω, on Figure 12.7).
PRACTICE 3.
a/The CPU usage process, in percents of the above example belong to what class of SP?
b/ In a printer shop, now let
2. Describe components of process Y = {𝑌 (𝑛, 𝑤)}. What class of SP does it belong to?
From now on, we shall not write 𝑤 as an argument of 𝑋(𝑡, 𝑤). Just keep in mind that behavior of a
stochastic process depends on chance, just as we did with random variables and random vectors.
Informally, a stochastic process is called a Markov process, only its present state is important to
know the future development, and it does not matter how the process arrived to this state (memory-less
property),
P[ future | past, present ] = P[ future | present ]
A bit historical fact: The idea of Markov dependence was developed by Andrei Markov (1856-1922)
who was a student of P. Chebyshev.
Let 𝑋(𝑡) be the total number of internet connections registered by some internet service provider
by the time 𝑡. Typically, people connect to the internet at random times, regardless of how many
connections have already been made.
Therefore, the number of connections in a minute will only depend on the current number.
For example, if 999 connections have been registered by 10 o’clock, then their total number will
exceed 1000 during the next minute regardless of when and how these 999 connections were made in
the past. This process is Markov.
• The outcome of the 𝑛th trial is represented by the random variable 𝑋𝑛 , which we assume
to be discrete and
• Sequence 𝑀 is called a (discrete time) Markov chain if, while occupying 𝑄 states at each of the unit
time points 1, 2, 3, . . . , 𝑛 − 1, 𝑛, 𝑛 + 1, . . ., 𝑀 satisfies the following property, called Markov property
or Memoryless property.
X1 X2 X(n-1) X(n)
ei ej
for all 𝑛 = 1, 2, · · · .
defined as the conditional probability that the process is in state 𝑗 at time 𝑛 given the fact that
the process was in state 𝑖 at the previous time 𝑛 − 1, for all 𝑖, 𝑗 ∈ 𝑄.
If the state transition probabilities 𝑝𝑖𝑗 (𝑛) in a Markov chain 𝑀 is independent of time 𝑛, write 𝑝𝑖𝑗 (𝑛) =
𝑝𝑖𝑗 , they are said to be stationary, time homogeneous or just homogeneous.
The state transition probability in homogeneous chain then can be written without mention
time point 𝑛:
𝑝𝑖𝑗 = P[𝑋𝑛 = 𝑗|𝑋𝑛−1 = 𝑖]. (12.2)
The Markov property, quantitatively described through transition probabilities, is represented in the
state transition matrix 𝑃 = [𝑝𝑖𝑗 ]:
⎡ ⎤
⎢ 𝑝11 𝑝12 𝑝13 ... .𝑝1𝑠 . ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑝 𝑝22 𝑝23 ... 𝑝2𝑠 . ⎥
⎢ 21 ⎥
⎢ ⎥
⎢ ⎥
𝑃 =⎢
⎢ 𝑝31 𝑝32 𝑝33 ... 𝑝3𝑠 . ⎥⎥ (12.3)
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥
⎥
⎢ ⎥
⎣ ⎦
𝑝𝑠1 𝑝𝑠2 𝑝𝑠3 ... 𝑝𝑠𝑠 .
Unless stated otherwise, we assume and will work with homogeneous Markov chains 𝑀 . The one-step
transition probabilities given by (12.2) of these Markov chains satisfy (that each row total equals 1):
𝑠
∑︁
𝑝𝑖𝑗 = 𝑝𝑖1 + 𝑝𝑖2 + 𝑝𝑖3 + . . . + 𝑝𝑖𝑠 = 1;
𝑗=1
• In practice, the initial probabilities 𝑝(0) is obtained at the current time (begining of a research),
In most cases, the major concern is using 𝑃 and 𝑝(0) to predict future. In summary we have
P[𝑋𝑛 = 𝑗 |𝑋𝑛−1 = 𝑖, · · · , 𝑋0 = 𝑎]
Example 12.3.
The Coopmart chain (denoted 𝐶) in SG currently controls 60% of the daily processed-food market,
their rivals Maximart and other brands (denoted 𝑀 ) takes the other share. Data from the previous
years (2016 and 2017) show that 88% of 𝐶’s customers remained loyal to 𝐶, while 12% switched to
rival brands. In addition, 85% of 𝑀 ’s customers remained loyal to 𝑀 , while other 15% switched to 𝐶.
Assuming that these trends continue, determine 𝐶’s share of the market
(a) in 5 years and (b) over the long run.
• Suppose that the brand attraction is time homogeneous, for a sample of large enough size 𝑛, we
denote the customer’s attention in the year 𝑛 by a random variable 𝑋𝑛 .
• The market share probability of the whole population then can be approximated by using the sample
statistics, e.g.
|{𝑥 : 𝑋𝑛 (𝑥) = 𝐶}|
P(𝑋𝑛 = 𝐶) = , and
𝑛
P(𝑋𝑛 = 𝑀 ) = 1 − P(𝑋𝑛 = 𝐶).
• We build a transition probability matrix with labels on rows and columns to be 𝐶 and 𝑀
⎡ ⎤
⎢ 𝐶 𝑀 ⎥
⎢ ⎥ ⎡ ⎤
⎢ ⎥
⎢ −− −− −− ⎥
⎢ ⎥ ⎢ 1−𝑎 𝑎 = 0.12 ⎥
𝑃 =⎢
⎢
⎥=⎢
⎥ ⎣
⎥
⎦ (12.4)
⎢ 𝐶 0.88 0.12 ⎥ 𝑏 = 0.15 1−𝑏
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝑀 0.15 0.85
⎡ ⎤
⎢ 0.88 0.12 ⎥
or 𝑃 = ⎢ ⎥ , 𝑎 = 𝑝𝐶𝑀 = P[𝑋𝑛+1 = 𝑀 |𝑋𝑛 = 𝐶], 𝑏 = 𝑝𝑀 𝐶 = P[𝑋𝑛+1 = 𝐶|𝑋𝑛 = 𝑀 ].
⎣ ⎦
0.15 0.85
This is called the ℎ-step transition probability, being independent of 𝑚 ∈ N if the chain is homoge-
(ℎ)
neous, see from Equation 12.2. The ℎ-step transition matrix is denoted as 𝑃 (ℎ) = (𝑝𝑖𝑗 ).
We have a recursive way for obtaining 𝑃 (ℎ) as follows.
(ℎ)
relate the ℎ-step transition probabilities 𝑝𝑖𝑗
with 𝑘-step and ℎ − 𝑘-step transition probabilities:
𝑠
(ℎ) (ℎ−𝑘) (𝑘)
∑︁
𝑝𝑖𝑗 = 𝑝𝑖𝑙 𝑝𝑙𝑗 , 0 < 𝑘 < ℎ.
𝑙=1
Now from each 𝑝𝑖 (𝑛) being defined as in Equation (12.1), that is the density P[𝑋𝑛 = 𝑖] of the chain 𝑋𝑛
at time 𝑛 receiving state 𝑖 ∈ 𝑄, we set 𝑝(𝑛) to be the vector form of probability mass distribution (pmf
or absolute probability distribution) associated with all possible 𝑋𝑛 of the Markov process, i.e.
Proposition 12.1.
The absolute probability distribution 𝑝(𝑛) at any stage 𝑛 of a Markov chain is given in the form
Practical Problem 2. A state transition diagram of a finite-state Markov chain is a line diagram
- with a vertex corresponding to each state and
- a directed line between two vertices 𝑖 and 𝑗 if 𝑝𝑖𝑗 > 0.
In such a diagram, if one can move from 𝑖 and 𝑗 by a path following the arrows, then 𝑖 → 𝑗.
The diagram is useful to determine whether a finite-state Markov chain is irreducible or not, or to
check for periodicity.
Draw the state transition diagrams and classify the states of the MCs with the following transition
probability matrices:
⎡ ⎤
⎢ 0 0 0.5 0.5 ⎥
⎡ ⎤
⎢ 0 0.5 0.5 ⎥ ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ 1 0 0 0 ⎥
⎢ ⎥ ⎢ ⎥
𝑃1 = ⎢
⎢ 0.5 0 0.5 ⎥ ; 𝑃2 = ⎢
⎥
⎢
⎥.
⎥
⎢ 0 1 0 0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦
0.5 0.5 0
⎢ ⎥
⎣ ⎦
0 1 0 0
We will use in parallel notations below for convenience from now on:
• 𝑋(𝑛), 𝑋𝑛 are the distribution of states of the chain at any stage (time point) 𝑛 ∈ N,
By the Markov property, each next state should be predicted from the previous state only. Therefore,
it is sufficient to know
2. The mechanism of transitions from one state to another, i.e. all one-step transition probabilities 𝑝𝑖,𝑗 ,
or the transition probability matrix P = [𝑝𝑖,𝑗 ].
(ℎ)
• compute ℎ-step transition probabilities 𝑝𝑖,𝑗 , using P(ℎ) = Pℎ ;
called the probability distribution of states at time point 𝑛, which is our forecast for 𝑋(𝑛);
given by
𝑝(𝑛) = 𝑝(0) P𝑛 , (12.7)
• to know our long-term forecast: take the limit of 𝑝(𝑛) as 𝑛 → ∞. It will be more efficient to take
the limit of matrix P𝑛 .
NOTATIONS
transition probability
(ℎ)
𝑝𝑖𝑗 = Prob(𝑋𝑚+ℎ = 𝑗|𝑋𝑚 = 𝑖)
Vector 𝑝* = (𝑝*1 , 𝑝*2 , · · · , 𝑝*𝑠 ) is called the stationary distribution of a Markov chain {𝑋𝑛 , 𝑛 ≥ 0}
with the state transition matrix 𝑃 if:
𝑝* P = 𝑝* . (12.8)
This equation indicates that a stationary distribution 𝑝* is a left eigenvector of P with eigenvalue 1.
Note that any nonzero multiple of 𝑝* is also an eigenvector of P. But the stationary distribution 𝑝* is
fixed by being a probability vector; that is, its components sum to 1.
The absolute probability distribution 𝑝(𝑛) is
In general, taking 𝑛 → ∞ in Equation (12.9) we may find the limiting probability 𝑝(∞) as
𝑝(∞) = 𝑝(0) P∞ .
We need some general results to determine the stationary distribution 𝑝* and the limiting probability
𝑝(∞) of a Markov chain. For a specific class of MCs below, there exist stationary distribution.
P(𝑚) = P𝑚 > 0
a
(i.e. every matrix entry is positive).
a Matrix 𝐴 = [𝑎𝑖,𝑗 ] > 0 means that its entries 𝑎𝑖,𝑗 > 0.
Lemma 12.2.
lim P𝑛 = P∞ (12.10)
𝑛→∞
where P∞ is a matrix whose rows are all equal to the stationary distribution 𝑝* .
Proposition 12.3.
a) The 𝑛-step transition probability matrix is given by
⎡ ⎤ ⎡ ⎤
⎢ 𝑎 −𝑎 ⎥ ⎪
⎧ 𝑏 𝑎 ⎥ ⎫
1 ⎪
P(𝑛) = P𝑛 = ⎥ + (1 − 𝑎 − 𝑏)𝑛 ⎢
⎢
⎪
⎩⎢ ⎥⎪
𝑎+𝑏 ⎣ ⎦ ⎣ ⎦⎭
𝑏 𝑎 −𝑏 𝑏
Proof. To prove this (computing transition probability matrix of a 2-state Markov chain), we use a fun-
damental result of Linear Algebra, recalled in Section 12.5.2.
The eigenvalues of the state transition matrix 𝑃 found by solving equation
𝑐(𝜆) = |𝜆𝐼 − P| = 0
are 𝜆1 = 1 and 𝜆2 = 1 − 𝑎 − 𝑏. The spectral decomposition of square matrix (Section 12.5.2) says P
can be decomposed into two constituent matrices 𝐸1 , 𝐸2 (since only two eigenvalues was found):
1 1
𝐸1 = [𝑃 − 𝜆2 𝐼], 𝐸2 = [𝑃 − 𝜆1 𝐼].
𝜆1 − 𝜆2 𝜆2 − 𝜆1
P = 𝜆1 𝐸1 + 𝜆2 𝐸2 ; 𝐸12 = 𝐸1 , 𝐸22 = 𝐸2 .
Hence,
P𝑛 = 𝜆𝑛1 𝐸1 + 𝜆𝑛2 𝐸2 = 𝐸1 + (1 − 𝑎 − 𝑏)𝑛 𝐸2 ,
or ⎡ ⎤ ⎡ ⎤
⎢ 𝑎 −𝑎 ⎥ ⎪
⎧ 𝑏 𝑎 ⎥ ⎫
(𝑛) 1 ⎪
= P𝑛 = ⎥ + (1 − 𝑎 − 𝑏)𝑛 ⎢
⎢
P ⎪
⎩⎢ ⎥⎪
𝑎+𝑏 ⎣ ⎦ ⎣ ⎦⎭
𝑏 𝑎 −𝑏 𝑏
II) Markov chains that have more than two states, 𝑠 > 2.
For 𝑠 > 2, it is cumbersome to compute constituent matrices 𝐸𝑖 of P, we could employ the so-called
regular property, see Theorem 12.9, or using directly linear algebra via Equation 12.16.
In next sections we provide a classification of states in a Markov chain. Four types are accessible,
recurrent/persistence, periodic, and absorbing.
(𝑁 )
• Two states 𝑖 and 𝑗 accessible to each other are said to communicate, write 𝑖 ↔ 𝑗, if ∃𝑁 ≥ 0, 𝑝𝑖,𝑗 > 0
(𝑀 )
and ∃𝑀 ≥ 0, 𝑝𝑗,𝑖 > 0.
Two states that communicate are said to be in the same class. All members of one class communi-
cate with one another.
If a class is not accessible from any state outside the class, we define the class to be a closed
communicating class.
If all states communicate with each other, then we say that the Markov chain 𝑀 = (S, 𝜋, P) is
irreducible. Formally, irreducibility means either of the following 3 conditions is satisfied.
(𝑁 )
1. 𝑀 is irreducible if and only if for all 𝑖, 𝑗 ∈ S : ∃𝑁 ≥ 0 [𝑝𝑖,𝑗 > 0].
3. The chain is irreducible if and only if there exists a path, whose probability is strictly positive,
which starts from any state 𝑖 and returns to 𝑖 after having visited at least once all other states of
the chain.
We can say that there exists a cycle with strictly positive probability.
The states of a Markov chain can be classified into two broad groups: those that the process enters
infinitely often and those that it enters finitely often. In the long run, the process will be found to be in
only those states that it enters infinitely often.
Let 𝐴(𝑗) be the set of states that are accessible from state 𝑗. We say that
1. State 𝑗 is recurrent if from any future state, there is always some probability of returning to 𝑗 and,
given enough time, this is certain to happen.
By repeating this argument, if a recurrent state is visited once, it will be revisited an infinite number
of times.
The first return time: The first return time 𝑇𝑗 to state 𝑗 ∈ S is the number of steps the chain is firstly
at state 𝑗 after leaving 𝑗 after time 0.
Probability of the first passage or visit: For any two states 𝑖 ̸= 𝑗 and 𝑛 > 0,
let 𝑓𝑖,𝑗 (𝑛) be the conditional probability that given that the process is presently in state 𝑖, the first time
it will enter state 𝑗 occurs in exactly 𝑛 transitions (or steps):
• We call 𝑓𝑖,𝑗 (𝑛) the probability of first passage from state 𝑖 to state 𝑗 in 𝑛 steps. By addition rule and
Bellman’s optimal principle:
∑︁
𝑓𝑖,𝑗 (𝑛) = 𝑝𝑖,𝑘 𝑓𝑘,𝑗 (𝑛 − 1), for 𝑛 = 2, 3, ...
𝑘̸=𝑗
1. State 𝑗 is recurrent if
𝑓𝑗,𝑗 = P[𝑇𝑗 < ∞|𝑋0 = 𝑗] = 1,
i.e., starting from the state, the process is guaranteed to return to the state again and again, in
fact, infinitely many times.
So, a transient state, as the name suggests, is a state to which we may not come back. Recurrent
state is one to which we will come back with probability 1. Next, we will characterize the recurrent
states further.
1. Show that in a finite-state Markov chain, not all states can be transient, in other words at least one of
the states must be recurrent.
2. Show that if P is a Markov matrix, then P𝑛 is also a Markov matrix for any positive integer 𝑛.
3. Verify the transitivity property of Markov chains, that is, if 𝑖 → 𝑗 and 𝑗 → 𝑘, then 𝑖 → 𝑘. (Hint: use
Chapman Komopgorov equations).
1. Use contradiction.
2. Employ the fact that the row sums of a stochastic matrix all are equal to 1.
Theorem 12.4.
1. With 𝑇𝑗 is the first return time, i.e the number of times that state 𝑖 will be visited, given that
𝑋0 = 𝑖. The state 𝑗 is recurrent if and only if E[𝑇𝑗 ] = ∞.
2. Recurrence is a class property: if 𝑖 and 𝑗 communicate, then they are either both recurrent or
both transient.
Hint: It is sufficient to show that if 𝑖 is recurrent, then 𝑗 too is recurrent. (The other result being
simply the contrapositive of this assertion).
3. Hence, all states of a finite and irreducible Markov chain are recurrent.
1. State 𝑗 is said to be positive recurrent if, starting from the state, the expected number of transitions
until the chain return to the state is finite:
Problem 12.2.
Consider a Markov chain with state space {0, 1} and transition probability matrix
⎡ ⎤
⎢ 1 0 ⎥
P=⎢
⎣
⎥
⎦
1/2 1/2
(𝑛)
𝑑𝑖 = gcd{𝑛 > 0 : 𝑝𝑖,𝑖 > 0}
is not divisible by 𝑑.
• A Markov Chain 𝑀 is aperiodic if the period of each state 𝑖 ∈ S is 1; in other words, there is no
such periodic state in 𝑀 .
Proposition 12.5.
(𝑛)
1. If 𝑝𝑖,𝑖 = 0 for all 𝑛, we consider 𝑖 as an aperiodic state. That is, the chain may start from state 𝑖,
but it leaves 𝑖 on its first transition and never returns to 𝑖.
2. It can be shown that periodicity is a class property: if 𝑖 and 𝑗 communicate and if 𝑖 is periodic
with period 𝑑, then 𝑗 too is periodic with period 𝑑.
(1)
3. If 𝑝𝑖,𝑖 > 0, then 𝑖 is evidently aperiodic.
(2) (3)
4. If 𝑝𝑖,𝑖 > 0 and 𝑝𝑖,𝑖 > 0, then 𝑖 is aperiodic. Prove this.
Example 12.5.
Consider a Markov chain with state space {0, 1, 2} with transition probability matrix
⎡ ⎤
⎢ 0 1/2 1/2 ⎥
⎢ ⎥
⎢ ⎥
P=⎢
⎢ 1/2 0 1/2 ⎥⎥.
⎢ ⎥
⎣ ⎦
1/2 1/2 0
Prove that:
1/ the Markov chain is irreducible, 2/ the Markov chain is aperiodic.
HINT:
(𝑛)
1/ Find 𝑝𝑖,𝑖 > 0 for each 𝑖. 2/ Use Property 4 above.
Example 12.6.
then it is periodic.
Absorbing state: State 𝑗 is said to be an absorbing state if 𝑝𝑗𝑗 = 1; that is, once state 𝑗 is reached,
it is never left. If there are multiple absorbing states, the probability that one of them will be eventually
reached is still 1, but the identity of the absorbing state to be entered is random and the associated
probabilities may depend on the starting state.
A player, at each play of a game, wins one unit (for example, 1 USD) with probability 𝑝 and loses one
unit with probability 𝑞 := 1 − 𝑝. Assume that he initially possesses 𝑖 units and that he plays independent
repetitions of the game until his fortune reaches 𝑘 units or he goes broke.
S = {0, 1, . . . , 𝑘}.
• The chain thus has three classes: 𝐶1 = {0}, 𝐶2 = {𝑘}, and 𝐶3 = {1, 2, . . . , 𝑘 − 1}. The first two
are recurrent, because 0 and 𝑘 are absorbing, whereas the third one is transient. For example,
by Condition (12.13), at state 1 ∈ 𝐶3 we have that
there is positive probability 1 − 𝑓1,1 of never returning to state 1; from which we can conclude
that the player’s fortune will reach 0 or 𝑘 units after a finite number of repetitions.
Summary
1. irreducible iff it has only one single recurrent class, or any state can be accessible from all
states.
Fact 12.1. In a DTMC 𝑀 that have more than two states, we have 4 cases:
1. 𝑀 has irreducible, positive recurrent, but periodic states. The component 𝜋𝑖 of the stationary distri-
bution vector 𝜋 is understood as the long-run proportion of time that the process is in state 𝑖.
2. 𝑀 has several closed, positive recurrent classes. In this case, the transition matrix of the DTMC
takes the block form. In contrast to the irreducible ergodic DTMC, where the limiting distribution is
independent of the initial state, the DTMC with several closed, positive recurrent classes has the
limiting distribution that is dependent on the initial state.
3. 𝑀 has both recurrent and transient classes. In this situation, we often seek the probabilities that
the chain is eventually absorbed by different recurrent classes. See the well-known gambler’s ruin
problem.
4. 𝑀 is an irreducible DTMC with null recurrent or transient states. This case is only possible when
the state space is infinite, since any finite-state, irreducible DTMC must be positive recurrent. In this
case, neither the limiting distribution nor the stationary distribution exists. A well-known example of
this case is the random walk model, see Chapter 13.
Let 𝐴 ∈ Mat𝑛 (R) be a square matrix of order 𝑛 with real entries. Let 𝑥 be an indeterminate, then the
polynomial
𝑝𝐴 (𝑥) = 𝑝(𝑥) = det([𝐴 − 𝑥 I𝑛 ]) = 𝑥𝑛 + 𝑎𝑛−1 𝑥𝑛−1 + . . . + 𝑎1 𝑥 + 𝑎0 (12.14)
• For an eigenvalue 𝜆, the pair (𝜆, 𝑣) is named an eigenpair. All eigenvectors being associated with
eigenvalue 𝜆 forms a subspace of R𝑛 , denoted by
𝑉𝜆 := {𝑣 ∈ R𝑛 : 𝐴𝑣 = 𝜆𝑣},
Definition 12.10.
• A stochastic matrix 𝐴 is non-negative and each row sum equals one. E.g., the transition probability
matrix P = [𝑝𝑖,𝑗 ] of a Markov chain is a stochastic matrix.
• If the column sums also equal one, the matrix is called doubly stochastic.
𝑠
∑︁ ∏︀𝑠
• The trace 𝑡𝑟(P) = 𝜆𝑖 , and the determinant |P| = 𝑖=1 𝜆𝑖 .
𝑖=1
• If 𝐵𝑖 is a basis for the null space 𝑁 (P − 𝜆𝑖 𝐼), then B = 𝐵1 ∪ 𝐵2 · · · ∪ 𝐵𝑘 is a linearly independent set.
Definition 12.11.
Any square matrix that can be transformed into a diagonal matrix through the postmultiplication by a
nonsingular matrix and premultiplication by its inverse is said to be diagonalizable.
Precisely, a square matrix P is diagonalizable if and only if there exists a nonsingular matrix 𝐻 (i.e.
det(𝐻) ̸= 0) such that 𝐻 −1 P𝐻 is a diagonal matrix.
Square matrix P of order 𝑠 is diagonalizable if and only if P possesses a complete set of eigen-
vectors (i.e. a set of 𝑠 linearly independent vectors corresponding with distinct eigenvalues 𝜆𝑗 ).
Moreover, the nonsingular matrix 𝐻 is built by a complete set of eigenvectors 𝑣𝑗 as the columns,
where each (𝜆𝑗 , 𝑣𝑗 ) is an eigenpair of P; and we have
𝐻 −1 P𝐻 = 𝐷 = Diag(𝜆1 , 𝜆2 , · · · , 𝜆𝑠 ).
P = 𝜆 1 𝐸1 + 𝜆 2 𝐸2 + · · · + 𝜆 𝑘 𝐸𝑘 , (12.15)
• 𝐸1 + 𝐸2 + · · · + 𝐸𝑘 = 𝐼
Proposition 12.7.
b/ none of the eigenvalues exceeds 1 in absolute value, that is all eigenvalues 𝜆𝑖 satisfy |𝜆𝑖 | ≤ 1.
a/ When 𝐾 is stochastic, 𝜌(𝐾) = 1, because 𝜆 = 1 is eigenvalue with eigenvector 𝑒 = [1, 1, . . . , 1]𝑡 due
to eigenvalue-eigenvector equation
det(𝐾 − 𝜆 I𝑛 ) = 0 ⇔ (𝐾 − 1. I𝑛 )𝑒 = 0 ⇔ 𝐾𝑒 = 𝑒.
P𝑚 = 𝜆𝑚 𝑚 𝑚
1 𝐸1 + 𝜆2 𝐸2 + · · · + 𝜆𝑘 𝐸𝑘 , for any integer 𝑚 > 0. (12.16)
𝑁 (P − 𝜆𝑖 𝐼) = {𝑣 : 𝑣 ′ 𝑃 = 𝜆𝑖 𝑣 ′ }.
Definition 12.12.
(𝑚)
• A stochastic matrix P = [𝑝𝑖,𝑗 ] is ergodic if lim𝑚→∞ P𝑚 = 𝐿 (say) exists, that is each 𝑝𝑖,𝑗 has a
limit when 𝑚 → ∞.
In our context, a Markov chain, with transition probability matrix P, is called regular if there exists an
𝑚 > 0 such that P𝑚 > 0, i.e. there is a finite positive integer 𝑚 such that after 𝑚 time-steps, every
state has a nonzero chance of being occupied, no matter what the initial state is.
GUIDANCE for solving. DIY, with the 𝑛-step transition probability matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑎 −𝑎
⎧ 𝑏 𝑎 ⎫
1 ⎪
P(𝑛) = P𝑛 = ⎥ + (1 − 𝑎 − 𝑏)𝑛 ⎢
⎢ ⎥ ⎥⎪
⎪
⎩⎢ ⎥⎪
𝑎+𝑏 ⎣ ⎦ ⎣ ⎦⎭
𝑏 𝑎 −𝑏 𝑏
when ⎡ ⎤ ⎡ ⎤
⎢ 𝑝1,1 𝑝2,1 ⎥ ⎢ 1 − 𝑎 𝑎 ⎥
P=⎢ ⎥=⎢ ⎥ , where 0 < 𝑎 < 1, 0 < 𝑏 < 1. (12.17)
⎣ ⎦ ⎣ ⎦
𝑝1,2 𝑝2,2 𝑏 1−𝑏
♣ QUESTION 12.2.
The limit matrix 𝐿 = lim𝑚→∞ P𝑚 practically shows the long-term behaviour (distribution, property) of
the process. How to know the existence 𝐿 (i.e. the ergodicity of transition matrix P = [𝑝𝑖,𝑗 ])?
1. 1 is an eigenvalue of multiplicity one, and all other eigenvalues 𝜆𝑖 satisfy |𝜆𝑖 | < 1;
2. P is ergodic, that is lim𝑚→∞ P𝑚 = 𝐿 exists. Furthermore, 𝐿’s rows are identical and equal to
the stationary distribution 𝑝* .
Proof.
Item (2). If (1) is proved then, by Theorem 12.8, P = [𝑝𝑖,𝑗 ] is ergodic. Hence, when P = [𝑝𝑖,𝑗 ] is
regular, the limit matrix 𝐿 = lim𝑚→∞ 𝑃 𝑚 does exist. By the decomposition (12.15),
• Let vector 𝑝* be the unique left eigenvector associating with the biggest eigenvalue 𝜆1 = 1 (which is
simple eigenvalue since it has multiplicity one), that is
𝑝* P = 𝑝* ⇔ 𝑝* (P − 1 𝐼) = 0.
• We now prove that 𝐿’s rows are identical and equal to the stationary distribution 𝑝* : 𝐿 = [𝑝* , · · · , 𝑝* ]′ .
Given a finite, aperiodic and irreducible Markov chain 𝑀 = (𝑄, 𝜋, P), where S consists of 𝑠 states.
Then there exist stationary probabilities
See the proof in Theorem 12.9, because C2 means that the stationary vector 𝑝* = [𝑝*1 , 𝑝*2 , . . . , 𝑝*𝑠 ]𝑇
satisfies equation 𝑝* P = 𝑝* .
• (a) for regular MC, stationary distribution 𝑝* does not depend on the initial state distribution probabil-
ities 𝑝(0); by Theorem 12.9 [Item 2];
• (b) but, in general, the long-term behavior expressed by the limiting distributions 𝑝(∞) are influenced
by the initial distributions 𝑝(0), via 𝑝(∞) = 𝑝(0) 𝐿; whenever the stochastic matrix P = [𝑝𝑖,𝑗 ] is
ergodic but not regular.
Example 12.10. Consider a Markov chain with two states and transition probability matrix
⎡ ⎤
⎢ 0 1 ⎥
⎢ ⎥
⎣ ⎦
1 0
(a) Find the stationary distribution 𝑝* of the chain. (b) Find lim𝑛→∞ P𝑛 .
SOLUTION:
a) Use definition 12.8
⎡ ⎤
⎢ 0 1 ⎥
𝑝* P = 𝑝* ⇐⇒ [𝑝1 , 𝑝2 ] ⎢
⎣
⎥ = [𝑝1 , 𝑝2 ]
⎦
1 0
Example 12.11. Consider a Markov chain with two states and transition probability matrix
⎡ ⎤
⎢ 3/4 1/4 ⎥
⎢ ⎥
⎣ ⎦
1/2 1/2
⎡ ⎤
⎢ 𝑝1 𝑝2 ⎥
lim P𝑚 = 𝐿 = [𝑝* , 𝑝* ]′ = ⎢ ⎥.
𝑚→∞ ⎣ ⎦
𝑝1 𝑝2
Example 12.12. Diagonalize the following matrix and provide its spectral decomposition.
⎡ ⎤
⎢ 1 −4 −4 ⎥
⎢ ⎥
⎢ ⎥
P=⎢
⎢ 8 −11 ⎥.
−8 ⎥
⎢ ⎥
⎣ ⎦
−8 8 5
can be taken as a basis of the eigenspace (or null space) 𝑁 (P − 𝜆𝐼). Bases of the eigenspaces are:
(︂ )︂ (︂ )︂
′ ′ ′
𝑁 (P − 1𝐼) = 𝑠𝑝𝑎𝑛 [1, 2, −2] ; and 𝑁 (P + 3𝐼) = 𝑠𝑝𝑎𝑛 [1, 1, 0] , [1, 0, 1] .
Easy to check that these three eigenvectors 𝑥𝑖 form a linearly independent set, then P is diagonaliz-
able. The nonsingular matrix (also called similarity transformation matrix)
⎡ ⎤
⎢ 1 1 1 ⎥
⎢ ⎥
⎢ ⎥
𝐻 = (𝑥1 |𝑥2 |𝑥3 ) = ⎢
⎢ 2 ⎥;
1 0 ⎥
⎢ ⎥
⎣ ⎦
−2 0 1
⎡ ⎤
⎢ 1 0 0 ⎥
⎢ ⎥
𝐻 −1 P𝐻 = 𝐷 = Diag(𝜆1 , 𝜆2 , 𝜆2 ) = Diag(1, −3, −3) = ⎢
⎢ ⎥
⎢ 0 −3 0 ⎥
⎥
⎢ ⎥
⎣ ⎦
0 0 −3
⎡ ⎤
⎢ 1 −1 −1 ⎥
⎢ ⎥
Here, 𝐻 −1
⎢ ⎥
=⎢
⎢ −2 3 ⎥ implies that
2 ⎥
⎢ ⎥
⎣ ⎦
2 −2 −1
𝑦 𝑡1 = [1, −1, −1], 𝑦 𝑡2 = [−2, 3, 2], 𝑦 𝑡3 = [2, −2, −1]. The constituent matrices are
⎡ ⎤
⎢ 1 −1 −1 ⎥
⎢ ⎥
𝐸1 = 𝑥1 · 𝑦 𝑡1 = ⎢
⎢ ⎥
⎢ 2 −2 ⎥;
−2 ⎥
⎢ ⎥
⎣ ⎦
−2 2 2
⎡ ⎤ ⎡ ⎤
⎢ −2 3 2 ⎥ ⎢ 2 −2 −1 ⎥
⎢ ⎥ ⎢ ⎥
𝑡 𝑡
⎢ ⎥ ⎢ ⎥
𝐸2 = 𝑥2 · 𝑦 2 = ⎢
⎢ −2 3 2 ⎥ ; 𝐸3 = 𝑥3 · 𝑦 3 = ⎢
⎥
⎢ 0 0 ⎥.
0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦ ⎣ ⎦
0 0 0 2 −2 −1
Obviously,
⎡ ⎤
⎢ 1 −4 −4 ⎥
⎢ ⎥
⎢ ⎥
𝑃 = 𝜆1 𝐸1 + 𝜆2 𝐸2 + 𝜆3 𝐸3 = ⎢
⎢ 8 −11 ⎥.
−8 ⎥
⎢ ⎥
⎣ ⎦
−8 8 5
Recall from Definition 12.3 that a stochastic process 𝑋(𝑡) is said to be a Markov process if for any
sequence real numbers
𝑡1 < 𝑡2 < . . . < 𝑡𝑛 < 𝑡, and any sets 𝐴; 𝐴1 , 𝐴2 , . . . , 𝐴𝑛 (𝐴𝑖 ⊂ R)
P[𝑋(𝑠 + 𝑡) = 𝑗|𝑋(𝑠) = 𝑖 and 𝑋(𝑢) for 𝑢 < 𝑠] = P[X(s + t) = j|X(s) = i], (12.18)
𝑡𝑖𝑚𝑒 : − − − − − − 𝑢 − − − − − s − − − − − s + t − − − − − − >
That means given the evolution of the process up to any current time 𝑠, the future value and its
probabilistic description depend only on the current state at time 𝑠.
that the process is in state 𝑗 at time 𝑠 + 𝑡 given that the process was in state 𝑖 at the previous time 𝑠,
for all 𝑖, 𝑗 ∈ S, [compare with Definition 12.2].
The Markov jump process is stationary or time homogeneous, if the 𝑝𝑖,𝑗 (𝑠, 𝑡) will be unaffected
by a shift in the time origin (see from Definition 12.2). It means the state transition probability is
given by
𝑝𝑖,𝑗 (𝑠, 𝑡) = P[𝑋(𝑠 + 𝑡) = 𝑗 |𝑋(𝑠) = 𝑖] = 𝑝𝑖,𝑗 (𝑡) (12.19)
now depends only on the length 𝑡 of the time interval [𝑠, 𝑠 + 𝑡].
With the Markov property quantitatively described in (12.18), these transition probabilities are sum-
marized in the transition matrix P(𝑡) = [𝑝𝑖,𝑗 (𝑡)].
Analogous to the discrete time setting, the matrix P(𝑡) whose elements are 𝑝𝑖,𝑗 (𝑡) is called the state
transition matrix of the process.
⎡ ⎤
⎢ 𝑝1,1 (𝑡) 𝑝1,2 (𝑡) 𝑝1,3 (𝑡) ... .𝑝1,𝑠 (𝑡) ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑝 (𝑡) 𝑝 (𝑡) 𝑝 (𝑡) ... 𝑝2,𝑠 (𝑡) ⎥
⎢ 2,1 2,2 2,3 ⎥
⎢ ⎥
⎢ ⎥
P(𝑡) = ⎢
⎢ 𝑝3,1 (𝑡) 𝑝3,2 (𝑡) 𝑝3,3 (𝑡) ... 𝑝3,𝑠 (𝑡) ⎥⎥. (12.20)
⎢ ⎥
⎢ .. .. .. .. .. ⎥
⎢
⎢ . . . . .
⎥
⎥
⎢ ⎥
⎣ ⎦
𝑝𝑠,1 (𝑡) 𝑝𝑠,2 (𝑡) 𝑝𝑠,3 (𝑡) ... 𝑝𝑠,𝑠 (𝑡)
It is a stochastic matrix, i.e. its entries are non-negative and each row sum equals 1:
∑︁
𝑝𝑖,𝑗 (𝑡) = 1; for each 𝑖 ∈ S;
𝑗∈S
The dynamic of the Markov jump process (CTMC) (S, 𝑝(0), P(𝑡)) can be determined by the transition
probability matrix P(𝑡) = [𝑝𝑖,𝑗 (𝑡)] and its initial probability distribution (at 𝑋(0))
where
is the (marginal) probability that the process will be in state 𝑖 at time 𝑡. We may express this marginal
probability via the initial distribution and the transitions
∑︁
𝑝𝑗 (𝑡) = 𝑎𝑖 𝑝𝑖,𝑗 (𝑡). (12.24)
𝑖∈S
The main problem when studying continuous-time Markov chains is finding these probabilities 𝑝𝑖,𝑗 (𝑡).
They are continuous functions of 𝑡, for every pair (𝑖, 𝑗). In general, it is not easy to calculate the
functions 𝑝𝑖,𝑗 (𝑡) explicitly.
Example 12.13.
For 𝑟 any sequence real numbers 0 < 𝑡1 < 𝑡2 < . . . < 𝑡𝑛 , few typical probabilities involving the
process 𝑋(𝑡) can be determined via 𝑝𝑖,𝑗 (𝑡).
• If know state at 𝑡 = 0:
The transition probabilities of the Markov jump process at origin time 𝑡 = 0 satisfy
• This equation is called the Chapman-Kolmogorov equation for the continuous time Markov
chain.
Example 12.14. For a Poisson process 𝑋(𝑡) ∼ Pois(𝜆𝑡) with set of states S = N = {0, 1, 2, ...} we can
determine 𝑝𝑖,𝑗 (𝑡).
Concept of Poisson process: We can write 𝑋(𝑡) to count the number of events randomly occurring
in the time interval [0, 𝑡), with pdf
𝑒−𝜆𝑡 (𝜆𝑡)𝑖
𝑝𝑋(𝑡) (𝑖) = 𝑝(𝑖; 𝜆𝑡) = P[𝑋(𝑡) = 𝑖] = 𝑖 = 0, 1, 2, ... (12.26)
𝑖!
with a constant value 𝜆 > 0 defined as the average number of events occurring in one unit of time, or
the speed of rare events or just the process rate.
The transition probabilities 𝑝𝑖,𝑗 (𝑡) of 𝑋(𝑡) ∼ Pois(𝜆𝑡) are given by
⎧ −𝜆𝑡
⎨𝑒 (𝜆𝑡)𝑗−𝑖
for 𝑗 − 𝑖 ≥ 0
⎪
𝑝𝑖,𝑗 (𝑡) = (𝑗 − 𝑖)! (12.27)
⎪
⎩0 for 𝑗 − 𝑖 < 0.
Then we can check explicitly that with 𝑗 ≥ 𝑖,
∑︁
𝑝𝑖,𝑗 (𝑠 + 𝑡) = 𝑝𝑖𝑘 (𝑠) 𝑝𝑘𝑗 (𝑡).
𝑘∈S
Hint: Write explicitly 𝑝𝑖𝑘 (𝑠) and 𝑝𝑘𝑗 (𝑡) then use the Newton binomial theorem.
Whenever a stochastic process enters a state 𝑖, it spends an amount of time called the dwell time (or
holding time) in that state.
2. A homogeneous CTMC or Markov jump process {𝑋(𝑡)} = (S, 𝑝(0), P(𝑡)) is one which satisfies
two conditions, for any state 𝑖 ∈ S:
- is exponentially distributed with rate constant 𝑣𝑖 (or with mean 1/𝑣𝑖 , the cdf 𝐹𝐻 (𝑡) = 1 − 𝑒−𝑣𝑖 𝑡 ),
[𝑣𝑖 represents the transition rate at which the process leaves state 𝑖]
- and 𝐻 does not depend on the next state 𝑗;
C2: After that time, it jumps to some other state 𝑗 with probability 𝑝𝑖,𝑗 .
Transition rates play an important roles in Markov jump processes. They are defined as the instanta-
neous rate of change of the transition probability.
• For all 𝑖 ̸= 𝑗 ∈ S, the transition rate of the process when the process makes a transition from state 𝑖 to state 𝑗,
denoted by 𝑞𝑖,𝑗 , is defined by
𝑞𝑖,𝑗 = 𝑣𝑖 𝑝𝑖,𝑗 (12.28)
The transition rates 𝑞𝑖,𝑗 are also known as instantaneous transition rates, transition intensities,
or forces of transition.
1 − 𝑝𝑖𝑖 (ℎ)
𝑣𝑖 = lim
ℎ→0 ℎ
(12.29)
𝑝𝑖,𝑗 (ℎ)
𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0) = lim for 𝑖 ̸= 𝑗.
ℎ→0 ℎ
Definition 12.16.
The matrix ⎡ ⎤
⎢ 𝑞1 𝑞1,2 𝑞1,3 ... 𝑞1,𝑠 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑞 𝑞2 𝑞2,3 ... 𝑞2,𝑠 ⎥
⎢ 2,1 ⎥
⎢ ⎥
⎢ ⎥
Q = [𝑞𝑖,𝑗 ] = ⎢
⎢ 𝑞3,1 𝑞3,2 𝑞3 ... 𝑞3,𝑠 ⎥
⎥ (12.30)
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥
⎥
⎢ ⎥
⎣ ⎦
𝑞𝑠,1 𝑞𝑠,2 𝑞𝑠,3 ... 𝑞𝑠
is called the transition rate matrix or generator matrix of the process. We will see that in practice,
the distribution of the process can be determined by the matrix Q and its initial distribution. We see
that:
∑︁
𝑝𝑖,𝑗 (𝑡) = 1; for each 𝑖 ∈ S
𝑗∈S
with respect to 𝑡 at 𝑡 = 0.
or can be approximated by
⎧
⎨1 + ℎ 𝑞𝑖𝑖 , if 𝑖 = 𝑗
𝑝𝑖,𝑗 (ℎ) ≈ (12.32)
⎩ℎ 𝑞
𝑖,𝑗 if 𝑖 ̸= 𝑗,
i.e. the probability of a transition from 𝑖 to 𝑗 during any short time interval [𝑠, 𝑠 + ℎ] is proportional to ℎ.
Example 12.15. Consider a Markov jump process with two states, namely states 1 and 2. Denote
the transition rates of the process from state 1 to 2 and from 2 to 1 respectively by 𝜆 and 𝜇 for some
𝜆, 𝜇 > 0.
HINT:
1. In this case the state space S = {1, 2} is finite.
2. The transition rate matrix Q is given by
⎡ ⎤
⎢ −𝜆 𝜆 ⎥
Q=⎢
⎣
⎥.
⎦
𝜇 −𝜇
NOTATION: We recall and further define the following vectors and matrix:
(12.33)
⎡ ⎤
⎢ 𝑞1 𝑞1,2 𝑞1,3 ... 𝑞1,𝑠 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑞 𝑞2 𝑞2,3 ... 𝑞2,𝑠 ⎥
⎢ 2,1 ⎥
⎢ ⎥
⎢ ⎥
Q=⎢
⎢ 𝑞3,1 𝑞3,2 𝑞3 ... ⎥.
𝑞3,𝑠 ⎥
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥
⎥
⎢ ⎥
⎣ ⎦
𝑞𝑠,1 𝑞𝑠,2 𝑞𝑠,3 ... 𝑞𝑠
𝑑𝑝𝑖 (𝑡)
In the steady state, 𝑝𝑖 (𝑡) → 𝑝𝑖 and lim𝑡→∞ = 0.
𝑑𝑡
Kolmogorov’s (backward and forward) differential equations provide a relationship between the tran-
sition probabilities and transition rates. By solving the differential equations, we can express
the transition probabilities in terms of the transition rates. Consequently, statistical properties of
the Markov jump process can be completely determined by the transition rates, given in the matrix
Q = [𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0)], satisfying two properties
(1) each row sum is zero and
(2) the off-diagonal elements are nonnegative.
The goal of this section is to give a general methodology of finding the transition probabilities 𝑝𝑖,𝑗 (𝑡)
which would completely characterize a CTMC. These probabilities are functions of the time 𝑡 elapsed
between the two states. They will be expressed as solutions of a system of differential equations in 𝑡.
From properties of Q in Definition 12.16 we see that the row sums of 𝑄 equal 0, the nondiagonal
entries of 𝑄 are nonnegative, and the diagonal entries 𝑞𝑖𝑖 ≤ 0.
∞
∑︁ Q𝑘 𝑡 𝑘
P(𝑡) = 𝑒Q𝑡 = I𝑠 + . (12.38)
𝑘!
𝑘=1
𝑞𝑖(𝑖+1) = 𝜆𝑖 “ a birth ”
𝑞𝑖(𝑖−1) = 𝜇𝑖 “ a death ” (12.39)
In addition, we define 𝑣0 = 𝜆0 , 𝑞0,−1 = 𝜇0 = 0. Note that in this case the birth and the death rates de-
pend on the population size 𝑖. This is very realistic, but complicates the differential equations presented
in Section 14.7. .
Problem 12.3.
𝑀 represents a system with 2 states on and off of an email server, where on = acceptable operation,
and
off = overload (when the server can not receive or deliver email) Time shift unit is 1 minute. Which
of the followings is true:
• 𝑃 has 3 eigenvalues
• 𝑀 is reducible
Problem 12.4.
Let 𝑀 be a two state Markov chain, with its state transition matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑝11 𝑝21 ⎥ ⎢ 1 − 𝑐 𝑐 ⎥
𝑃 =⎢ ⎥=⎢ ⎥ , where 0 < 𝑐 < 1, 0 < 𝑑 < 1. (12.41)
⎣ ⎦ ⎣ ⎦
𝑝12 𝑝22 𝑑 1−𝑑
𝑀 represents a traffic system with 2 states on and off of a road at SG, where on = acceptable vehicle
flow, and
off = traffic jam (when the road can not fulfill its functionality).
When 𝑐 = 𝑑, compute the limit matrix lim𝑛→∞ 𝑃 𝑛 .
Problem 12.5.
Consider a Markov chain with two states and transition probability matrix
⎡ ⎤
⎢ 3/4 1/4 ⎥
𝑃 =⎢
⎣
⎥
⎦
1/2 1/2
In some town, each day is either sunny or rainy. A sunny day is followed by another sunny day with
probability 0.7, whereas a rainy day is followed by a sunny day with probability 0.4.
It rains on Monday. Make forecasts for Tuesday, Wednesday, and Thursday.
Solution.
Weather conditions in this problem represent a homogeneous Markov chain with 2 states:
state 1 = “sunny” and
state 2 = “rainy.”
Transition probabilities are: 𝑝11 = 0.7, 𝑝12 = 0.3, 𝑝21 = 0.4, 𝑝22 = 0.6,
where 𝑝12 and 𝑝22 were computed by the complement rule.
If it rains on Monday, then
- Tuesday is sunny with probability 𝑝21 = 0.4 (making a transition from a rainy to a sunny day), and
- Tuesday is rainy with probability 𝑝22 = 0.6; can predict a 60% chance of rain.
Wednesday forecast requires 2-step transition probabilities,
(2)
Conditioning on the weather situation of Tuesday and using the Law of Total Probability 𝑝21 =
Prob(𝑋𝑚+2 = 1|𝑋𝑚 = 2) =
because it takes 3 transitions to move from Monday to Thursday. We have to use the Law of Total
Probability conditioning on both Tuesday and Wednesday. Explain and DIY based on:
2 → 𝑖 → 𝑗 → 1;
(2) (2)
• or using already computed 2-step transition probabilities 𝑝21 and 𝑝22 , describing transition from
Monday to Wednesday, and 1-step transition probabilities in 𝑃 from Wednesday to Thursday; ...
Problem 12.7.
A certain product is made by two companies, A and B, that control the entire market. Currently, A
and B have 60 percent and 40 percent, respectively, of the total market. Each year, A loses 5 of its
market share to By while B loses 3 of its share to A.
Find the relative proportion of the market that each hold after 2 years, 20 years.
Problem 12.8.
Bernoulli process: Consider a Bernoulli random variable (a trial or r.v.) 𝑋 that can take only two
possible values, success as 1 and failure as 0, i.e. S𝑋 = {0, 1}. The probability of success is 𝑝 and the
probability of failure is 1 − 𝑝,
Describe the Bernoulli process and construct a typical sample sequence of this process.
Theorem 12.15.
If every eigenvalue of a matrix 𝑃 yields linearly independent left eigenvectors in number equal
to its multuiplicity, then
* there exists a nonsingular matrix 𝑀 whose rows are left eigenvectors of 𝑃 , such that
* 𝐷 = 𝑀 𝑃 𝑀 −1 is a diagonal matrix with diagonal elements are the eigenvalues of 𝑃 , repeated
according to multiplicity.
Apply this for a practical problem in Business Intelligence through a case study in mobile phone in-
dustry in Thailand.
Due to a most recent survey, there are four big mobile producers/sellers 𝑁 , 𝑆, 𝑀 and 𝐿, and their
market distributions in 2017 is given by the stochastic matrix:
⎡ ⎤
⎢ 𝑁 𝑀 𝐿 𝑆 ⎥
⎢ ⎥
⎢ ⎥
⎢ −− −− −− −− −− ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑁 1 0 0 0 ⎥
⎢ ⎥
𝑃 =⎢ ⎢
⎥
⎥
⎢ 𝑀 0.4 0 0.6 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝐿 0.2 0 0.1 0.7 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝑆 0 0 0 1
A computer is shared by 2 users who send tasks to a computer remotely and work independently. At
any minute,
• any disconnected user may connect with a new task with probability 0.2.
Let 𝑋(𝑡) be the number of concurrent users at time t (minutes). This is a Markov chain with 3 states:
0, 1, and 2.
Compute transition probabilities in matrix 𝑃 .
Solution.
Row 1 of 𝑃 :
Suppose 𝑋(0) = 0, i.e., there are no users at time 𝑡 = 0. Then 𝑋(1) is the number of new connections
within the next minute.
It has binomial distribution Bin(2, 0.2), therefore, 𝑝00 =?, 𝑝01 =?, 𝑝02 =?
Draw the transition diagram for the Markov chain in this case.
Row 2 of 𝑃 :
Try to go further using binomial distributions Bin(1, 0.2), Bin(1, 0.5).
Row 3 of 𝑃 : Use Bin(2, 0.5).
a) Explain 𝑃 .
b) Check that both matrices 𝑃 and 𝑃 2 are stochastic. ♦
Problem 12.11. A Markov jump process has two states, namely states 0 and 1 (for example state 0
for healthy and state 1 for sick respectively). In this case, we ignore the transitions from either healthy
or sick state to death state.
Known that the distribution of stay in state 0 is exponential with rate 𝑣0 = 𝜆, and the distribution of
the time spent in state 1 is also exponential with rate 𝑣1 = 𝜇.
3. Find the transition probabilities, i.e. compute 𝑝00 (𝑡), 𝑝01 (𝑡).
SOLUTION: Let us calculate the transition probabilities by solving the Kolmogorov equations. For
example, we use the forward equations. Note we can do this since in this case the state space S =
{0, 1} is finite.
1. The transition rate matrix Q is given by
⎡ ⎤
⎢ −𝜆 𝜆 ⎥
Q=⎢
⎣
⎥
⎦
𝜇 −𝜇
𝑑 ∑︀
2. Use forward Kolmogorov equations 𝑝𝑖,𝑗 (𝑡) = 𝑘̸=𝑗 𝑝𝑖,𝑘 (𝑡) 𝑞𝑘,𝑗 − 𝑣𝑗 𝑝𝑖,𝑗 (𝑡):
𝑑𝑡
So
𝑑 ∑︁
𝑝1,1 (𝑡) = ... = 𝑝1,0 (𝑡) 𝑞0,1 − 𝑣1 𝑝1,1 (𝑡)
𝑑𝑡
𝑘̸=1
with
𝑞0,1 = 𝑣0 𝑝0,1 = 𝜆 𝑝0,1 = 𝜆 1 = 𝜆,
hence
𝑑
𝑝1,1 (𝑡) = 𝜆𝑝1,0 (𝑡) − 𝜇𝑝1,1 (𝑡).
𝑑𝑡
Now, since 𝑝1,0 (𝑡) + 𝑝1,1 (𝑡) = 1, we get
𝜆
𝑦(𝑡) = + 𝑐 𝑒−(𝜆+𝜇)𝑡 .
𝜆+𝜇
Find 𝑐 by using
𝜆 𝜇
1 = 𝑝1,1 (0) = 𝑦(0) = +𝑐⇒𝑐=
𝜆+𝜇 𝜆+𝜇
which gives the solution
𝜆 𝜇 −(𝜆+𝜇)𝑡
𝑝1,1 (𝑡) = + 𝑒 .
𝜆+𝜇 𝜆+𝜇
We can find 𝑝1,0 (𝑡) next using the fact that 𝑝1,0 (𝑡) + 𝑝1,1 (𝑡) = 1.
3. Use the backward Kolmogorov eqns. for 𝑝0,1 (𝑡), 𝑝0,0 (𝑡); or see 4.
so ⎡ ⎤
⎢ 1 0 ⎥
⎥− 1
[︀ −(𝜆+𝜇)𝑡 ]︀
P(𝑡) = ⎢
⎣ ⎦ 𝜆+𝜇 𝑒 −1 Q
0 1
Therefore,
⎡ ⎤
⎢ 1 0 ⎥
P(𝑡) = ⎢ ⎥ + 1 Q − 1 Q 𝑒−(𝜆+𝜇)𝑡
⎣ ⎦ 𝜆+𝜇 𝜆+𝜇
0 1
finally,
⎡ ⎤ ⎡ ⎤
𝜇 𝜆 𝜆 −𝜆
⎢ 𝜆+𝜇 𝜆+𝜇 ⎥ ⎢ 𝜆+𝜇 𝜆+𝜇 ⎥
P(𝑡) = ⎢ ⎥+⎢ ⎥ 𝑒−(𝜆+𝜇)𝑡
⎣ 𝜇 𝜆 ⎦ ⎣ −𝜇 𝜇 ⎦
𝜆+𝜇 𝜆+𝜇 𝜆+𝜇 𝜆+𝜇
Now we can write down 𝑝0,1 (𝑡), 𝑝0,0 (𝑡) .
Statistical Simulation
Describing systems with algorithms
[Source [56]]
CHAPTER 13. STATISTICAL SIMULATION
380 DESCRIBING SYSTEMS WITH ALGORITHMS
13.1 Introduction
The main purpose of simulations is estimating such quantities whose direct computation is
complicated, risky, consuming (time and money), expensive, or impossible.
1. For example, suppose a complex device or machine is to be built and launched. Before it happens,
its performance is simulated, and this allows experts to evaluate its adequacy and associated risks
carefully and safely.
2. For example, one surely prefers to evaluate reliability and safety of a new module of a space station
by means of computer simulations rather than during the actual mission.
Recall that probability can be defined as a long-run proportion. With the help of random number
generators, computers can actually simulate a long run. Then, probability can be estimated by the
associated observed frequency. The longer run is simulated, the more accurate result is obtained.
Similarly, one can estimate expectations, variances, and other distribution characteristics from a long
run of simulated random variables. In brief we present
Problems in engineering and technology cane be solved by theory of simulation and Poisson pro-
cesses, including:
1. Generate a random matrix in which columns represent different variables, see Section 13.2.2
2. Generate sample paths of a homogeneous Discrete Time Markov Chain (DTMC) by synchronous
simulation and asynchronous simulation, in Section 13.5
Consider a network of nodes. Some nodes are connected, say, with transmission lines, others are
not (mathematicians would call such a network a graph). A signal is sent from a certain node. Once
a node 𝑘 receives a signal, it sends it along each of its output lines with some probability 𝑝𝑘 . After
a certain period of time, one desires to estimate the proportion of nodes that received a signal, the
probability for a certain node to receive it, etc.
Technically, simulation of such a network reduces to generating Bernoulli random variables [in
Chapter 4] with parameters 𝑝𝑖 . Line 𝑖 transmits if the corresponding generated variable 𝑋𝑖 = 1. In
the end, we simply count the number of nodes that got the signal, or verify whether the given node
received it.
Example 13.2. (Queuing) A queuing system [discussed in detail in Chapter 14.] is described by
a number of random variables. It involves spontaneous arrivals of jobs, their random waiting time,
assignment to servers, and finally, their random service time and departure.
When designing a queuing system or a server facility, it is important to evaluate its vital performance
characteristics. This will include
• the average number of available (idle) servers at the time when a job arrives, and so on.
An organization has realized that a system is not operating as desired, it will look for ways to improve
its performance. To do so, sometimes it is possible to experiment with the real system and, through
observation and the aid of Statistics, reach valid conclusions towards future system improvement.
• However, experiments with a real system may entail ethical and/or economical problems, which may
be avoided dealing with a prototype, a physical model.
• Sometimes, it is not feasible or possible, to build a prototype, yet we may obtain a mathematical
model describing, through equations and constraints, the essential behaviour of the system.
• This analysis may be done, sometimes, through analytical or numerical methods, but the model may
be too complex to be dealt with.
Statistically, in the design phase of a system, there is no system available, we can not rely on mea-
surements for generating a pdf.
In such extreme cases, we may use simulation. Large complex system simulation has become
common practice in many industrial areas. Essentially, simulation consists of
Once we have a computer siumulation model of the actual system, we need to generate values for the
random quantities that are part of the system input (to the model).
In this chapter, from the Statistical point of view, we introduce key concepts, methods and tools from
simulation with the Industrial Statistics orientation in mind. The major parts of this section are from
[58, Chapter 5]. We mainly consider the problem within Step (ii) only.
To conduct Step (i) rightly and meaningfully, a close collaboration with experts in specific areas is
vital. Topics discussing Step (ii) are shown in the other chapters.
We learn
4. How to analyze and intepret output data and making meaningful inferences?
y = rpois(8, 12);
# get a sample of 8 numbers, follow the Pois(𝜆) assuming the average rate 𝜆 = 12
SYNTAX:
# dxyzts(parameters) = computes probability mass func - p.d.f with name xyzts
# pxyzts(parameters) = finds c.d.f - cumulative distribution
# qxyzt(parameters) = gives the quantile function, inverse of cdf
E.g., Probability of male height less than or equal to 180 cm, given that the Gauss distribution has
mean=175 and sd= 5
t = 180; m = 175; s = 5;
• Fisher: What is the upper 𝑎 = 5% = 0.05 critical value for the Fisher with two degree of freedom: n1
= 16; n2 = 21;
We will illustrate a simulation of data using popular probability distributions in software R, via a practical
problem of insurance premium determination in Actuarial Science with 𝑝 − 1 = 8 predictors.
We observe 𝑛 = 18 customers together their actuarial premium amounts 𝑦 = [𝑦1 , 𝑦2 , . . . , 𝑦18 ] they
annually bought from AIA to against risk in their lives. Hence, our data matrix X has size 18 × 9.
How to design the data matrix before collecting real sample data?
Our data matrix X and response sample 𝑦 must reflex realistic conditions of real life, such as the
predictors 𝑋𝑗 and the response 𝑌 [at least] have to follow certain probability distributions.
3. Number of family members (𝑥4 , ordinal) x4 = rbinom(n, 4, 0.5)+ 1; # the number of family members
of customer
x5 = rpois(n, 12);
x7 = c(’d’, ’t’, ’b’, ’e’, ’d’, ’d’,’p’, ’n’, ’l’, ’t’, ’b’, ’p’, ’n’, ’l’, ’t’,’e’, ’p’, ’n’);
Y = rpois(n, 860);
# actuarial premium cost per year of customer,
# average = 860 USD
dataX= data.frame(x1, x2, x3, x4, x5, x6,x7, x8); nrow(dataX);
# No INTERACTION
M1=lm(Y ~ x4+ x5+ x6+ x8)
anova(M1); summary(M1)
# No INTERACTION
# but transform nominal variable to numeric
fx3= factor(x3); fx7= factor(x7);
M0=lm(Y~ x1+ x2+ fx3+ x4+ x5+ x6+ fx7+ x8)
anova(M0); summary(M0)
Analysis of Variance Table
Response: Y
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 4.3 4.3 0.1030 0.7643802
x2 1 742.0 742.0 17.8248 0.0134566 *
fx3 1 264.6 264.6 6.3567 0.0652693 .
x4 1 156.9 156.9 3.7694 0.1241658
x5 1 1122.7 1122.7 26.9700 0.0065463 **
x6 1 1529.3 1529.3 36.7356 0.0037411 **
fx7 6 6905.2 1150.9 27.6459 0.0032232 **
x8 1 7771.5 7771.5 186.6839 0.0001662 ***
Residuals 4 166.5 41.6
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.452 on 4 degrees of freedom
Multiple R-squared: 0.9911,Adjusted R-squared: 0.9621
F-statistic: 34.18 on 13 and 4 DF, p-value: 0.001878
General concepts.
The most basic computational component in simulation involves the generation of random variables
distributed uniformly between 0 and 1.
These then can be used to generate other random variables, both discrete and continuous de-
pending on practical contexts. Key requirements for meaningfully reasonable/reliable simulation:
• the simulation is run long enough to obtain an estimate of the operating characteristics of the system
• the number of runs also should be large enough to obtain reliable estimate
• the result of each run is a random sample implies that a simulation is a statistical experiment, that
must be conducted using statistical tools such as:
i) point estimation,
→ pdf or cdf of 𝑋.
𝑘
∑︁
𝐹 (𝑘) = 𝑝(𝑖) ∈ [0, 1],
𝑖=0
Consider a r.v. 𝑉 with pdf 𝑓𝑉 (𝑣) and given transformation 𝑋 = 𝑔(𝑉 ). Denote by 𝑣1 , 𝑣2 , · · · , 𝑣𝑛 the
real roots of the equation
𝑛
∑︁ 1
𝑓𝑋 (𝑥) = 𝑓𝑉 (𝑣𝑙 ) · .
|𝑑𝑔(𝑣𝑙 )/𝑑𝑣𝑙 |
𝑙=1
Given 𝑥, if Equ. 13.1 has no real solutions then the pdf 𝑓𝑋 (𝑥) = 0
Proof. DIY
1 𝑥−𝑏
𝑓𝑋 (𝑥) = 𝑓𝑉 ( ).
|𝑎| 𝑎
−1
• B) Inverse case: given the cdf 𝐹𝑋 (𝑥) of 𝑋, then 𝑋 = 𝑔(𝑉 ) = 𝐹𝑋 (𝑉 ).
Consider a r.v. 𝑉 with uniform cdf 𝐹𝑉 (𝑣) = 𝑣, 𝑣 ∈ [0, 1]. Then the transformation 𝑋 = 𝑔(𝑉 ) =
−1
𝐹𝑋 (𝑉 ) gives variates 𝑥 of 𝑋 with cdf 𝐹𝑋 (𝑥).
Proof. For any real number 𝑎, and due to the monotonicity of the cdf function 𝐹𝑋 ,
−1
P(𝑋 ≤ 𝑎) = P[𝐹𝑋 (𝑉 ) ≤ 𝑎] = P[𝑉 ≤ 𝐹𝑋 (𝑎)] = 𝐹𝑉 (𝐹𝑋 (𝑎)) = 𝐹𝑋 (𝑎).
−1
1. Invert the given cdf 𝐹𝑋 (𝑥) to find its inverse 𝐹𝑋
−1
3. Generate variates 𝑥 via the transformation 𝑋 = 𝐹𝑋 (𝑉 ).
We next discuss how to generate values of arbitrary discrete and continuous distributions.
𝐹 (𝑥𝑘 ) = P[𝑋 ≤ 𝑥𝑘 ] = 𝑉 ⇐⇒ 𝑥𝑘 = 𝐹 −1 (𝑉 ),
in which the parameters of the step function 𝑔(𝑉 ) are given by:
• 𝑋 = 0 if 𝑉 < 0,
• else
∑︁ ∑︁
𝑋 = 𝐹 −1 (𝑉 ) = 𝑥𝑘 ⇐⇒ 𝑝𝑖 < 𝑉 ≤ 𝑝𝑖 , 𝑘 ∈ {1, ..., 𝑛}; (13.2)
𝑖=0..𝑘−1 𝑖=0..𝑘
⎧
⎪
⎪
⎪
⎪ 0 if 𝑉 < 0
⎪
⎪
𝑥1 if 𝑉 < 𝑝1
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎨𝑥
⎪
2 if 𝑝1 < 𝑉 < 𝑝1 + 𝑝2
𝑋= .. (13.3)
⎪
⎪
⎪
⎪ .
⎪
⎪ ∑︀ ∑︀
if 𝑝𝑖 < 𝑉 ≤
⎪
⎪
⎪
⎪ 𝑥𝑘 𝑖=0..𝑘−1 𝑖=0..𝑘 𝑝𝑖
⎪
⎪ .
⎩..
⎪
⎪
Clearly,
[︂ ∑︁ ∑︁ ]︂
P[𝑋 = 𝑥𝑘 ] = P 𝑘 − 1𝑝𝑖 < 𝑉 ≤ 𝑘 𝑝 𝑖 = 𝑝𝑘 .
𝑖=0 𝑖=0
𝐴0 = [0, 𝑝0 )
𝐴1 = [𝑝0 , 𝑝0 + 𝑝1 )
(13.4)
𝐴2 = [𝑝0 + 𝑝1 , 𝑝0 + 𝑝1 + 𝑝2 )
···
Subinterval 𝐴𝑘 will have length 𝑝𝑘 ; there may be a finite or infinite number of them, according to
possible values of 𝑋.
2. Obtain a standard uniform random variable 𝑉 from a random number generator or a table of random
numbers.
3. If 𝑉 belongs to 𝐴𝑘 , let 𝑋 = 𝑥𝑘 .
𝐹 −1 (𝑉 ) = 𝑢(𝑉 − (1 − 𝑝)).
(︂ )︂
𝑛 𝑘
𝑝𝑘 = P[𝑋 = 𝑘] = 𝑝 (1 − 𝑝)𝑛−𝑘 .
𝑘
∑︁ ∑︁
𝑝𝑖 < 𝑉 ≤ 𝑝𝑖 , 𝑘 ∈ {1, ..., 𝑛};
𝑖=0..𝑘−1 𝑖=0..𝑘
⎧
⎪
⎪
⎪ 0 if 𝑉 < 0
⎪
⎪
⎪
1 if 𝑉 < 𝑝1
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ 2 if 𝑝1 < 𝑉 < 𝑝1 + 𝑝2
⎪
⎨.
𝑋 = .. (13.5)
⎪
⎪
⎪ ∑︀ ∑︀
𝑘 if 𝑝𝑖 < 𝑉 ≤ 𝑝𝑖
⎪
⎪
⎪
⎪ 𝑖=0..𝑘−1 𝑖=0..𝑘
⎪..
⎪
⎪
.
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎩𝑛 if ??
Continuous-randomize(𝐹 )
Output Values 𝑥 of 𝑋.
1. Obtain a standard uniform random variable 𝑉 ∈ [0, 1] from a random number generator.
2. Generate values 𝑥 via the transformation 𝑋 = 𝐹 −1 (𝑉 ). In other words, solve the equation 𝐹 (𝑋) = 𝑉
for 𝑋.
The shape parameter 𝛼 and the frequency parameter 𝛽 completely determine Gamma(𝛼, 𝛽). How-
ever, the probability density function of 𝑋 ∼ Gamma(𝛼; 𝛽)
⎧
1
⎪
⎨
𝛼 Γ(𝛼)
𝑥𝛼−1 𝑒−𝑥/𝛽 , if 𝑥 ≥ 0,
𝑔(𝑥; 𝛼, 𝛽) = 𝛽 (13.6)
⎩0, if 𝑥 < 0.
⎪
The third step in a simulation process consists of passing the inputs through the simulation model to
obtain outputs to be analyzed later. We shall consider three main application areas:
Discrete event simulation (DES) deals with systems whose state changes at discrete times, not con-
tinuously. These methods were initiated in the late 50’s; for example, the first DES-specific language,
GSP, was developed at GE by Tocher and Owen to study manufacturing problems.
To study such systems, we build a discrete event model. Its evolution in time implies changes in
the attributes of one of its entities, or model components, and it takes place in a given instant. Such
change is called event. The time between two events/instants is an interval. A process describes the
sequence of states of an entity throughout its life in the system.
There are several strategies to describe such evolution, which depend on the mechanism that regulates
time evolution within the system.
CONCEPT 11.
• When such evolution is based on time increments of the same duration, we talk about
synchronous simulation.
• MCMC: a modern technique of generating random variables from rather complex, often intractable
distributions, as long as conditional distributions have a reasonably simple form.
• According to the Markov chain Monte Carlo (MCMC) methodology, a long sequence of random vari-
ables is generated from conditional distributions.
A wisely designed MCMC will then produce random variables that have the desired unconditional
distribution, no matter how complex it is.
• The joint distribution of good and defective chips on a produced wafer has a rather complicated
correlation structure.
As a result, it can only be written explicitly for rather simplified artificial models.
• On the other hand, the quality of each chip is predictable based on the quality of the surrounding,
neighboring chips.
• Given its neighborhood, conditional probability for a chip to fail can be written, and thus, its
quality can be simulated by generating a corresponding Bernoulli random variable with 𝑋𝑖 = 1
indicating a failure.
Generation of values of a Markov Chain is discussed now. We consider a homogeneous Discrete Time
Markov Chain (DTMC) described by a transition matrix 𝑃 .
Definition 13.1.
• 𝑃 = [𝑝𝑖𝑗 ]- the state transition matrix, with 𝑝𝑖𝑗 = P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖).
P(𝑋𝑛+1 = 𝑗|𝑋𝑛 = 𝑖, · · · , 𝑋0 = 𝑎)
If the state transition probabilities 𝑝𝑖𝑗 (𝑛 + 1) in a Markov chain 𝑀 is independent of time 𝑛, they are
said to be stationary, time homogeneous or just homogeneous. The state transition probability in
homogeneous chain then can be written without mention time point 𝑛:
Unless stated otherwise, we assume and will work with homogeneous Markov chains 𝑀 . The one-
step transition probabilities given by 12.2 of these Markov chains must satisfy:
𝑠
∑︁
𝑝𝑖𝑗 = 1; for each 𝑖 = 1, 2, · · · , 𝑠 and 𝑝𝑖𝑗 ≥ 0.
𝑗=1
• the initial distribution (i.e. the probability distribution of starting position of the concerned object
at time point 0), and
We want to determine the the probability distribution of position 𝑋𝑛 for any time point 𝑛 > 0. The
Markov property, quantitatively described through transition probabilities, can be represented conve-
niently in the so-called state transition matrix 𝑃 = [𝑝𝑖𝑗 ]:
⎡ ⎤
⎢ 𝑝11 𝑝12 𝑝13 ... .𝑝1𝑠 . ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑝 𝑝22 𝑝23 ... 𝑝2𝑠 . ⎥
⎢ 21 ⎥
𝑃 =⎢
⎢
⎥
⎥ (13.9)
⎢ 𝑝31 𝑝32 𝑝33 ... 𝑝3𝑠 . ⎥
⎢ ⎥
⎢ ⎥
⎣ . .. .. .. .. ⎦
.. . . . .
Definition 13.2.
Vector 𝑝* is called a stationary distribution of a Markov chain {𝑋𝑛 , 𝑛 ≥ 0} with the state transi-
tion matrix 𝑃 if:
𝑝* 𝑃 = 𝑝* .
Consider a homogeneous DTMC {𝑋𝑛 } described by the transition matrix 𝑃 = [𝑝𝑖𝑗 ]. How do we
generate sample paths of {𝑋𝑛 }. Two issues involved here:
In a), we want to generate values for a single stationary random variable 𝑝* that describes the
steady-state behavior of the MC. 𝑝* is one-dimensional pdf the algorithm after Theorem 13.2 suffices.
We illustrate both strategies describing how to sample from a Markov chain with state space 𝑆 and
transition matrix
𝑃 = (𝑝𝑖𝑗 ), with 𝑝𝑖𝑗 = P(𝑋(𝑛 + 1) = 𝑗|𝑋(𝑛) = 𝑖).
This synchronous approach has the potential shortcoming that 𝑋(𝑛) = 𝑋(𝑛 + 1), with the corre-
sponding computational effort lost.
Alternatively, we may
• then sample the new state 𝑋(𝑛 + 𝑇𝑛 ). If 𝑋(𝑛) = 𝑠, 𝑇𝑛 follows a geometric distribution Geom(𝑝𝑠𝑠 ) of
parameter 𝑝𝑠𝑠 and
Should we wish to sample 𝑁 transitions of the chain, assuming 𝑋(0) = 𝑖0 , we could run the following
algorithm.
Sampling-transitions(𝑁, S, P)
Do 𝑡 = 0, 𝑋(0) = 𝑖0
While 𝑡 < 𝑁
Sample ℎ ∼ Geom(𝑝𝑥(𝑡) 𝑥(𝑡) )
Sample
{︁ 𝑝𝑥(𝑡) 𝑗 }︁
𝑋(𝑡 + ℎ) ∼ : 𝑗 ∈ 𝑆 ∖ {𝑥(𝑡)}
(1 − 𝑝𝑥(𝑡) 𝑥(𝑡) )
Do 𝑡 = 𝑡 + ℎ
1. Event scheduling
The simulation time advances until the next event and the corresponding activities are executed.
If we have 𝑘 types of events (1, 2, . . . , 𝑘) , we maintain a list of events, ordered according to their
execution times (𝑡1 , 𝑡2 , . . . , 𝑡𝑘 ) .
A routine 𝑅𝑖 associated with the 𝑖-th type of event is started at time 𝜏𝑖 = min(𝑡1 , 𝑡2 , . . . , 𝑡𝑘 ).
2. Process interaction
• A process represents an entity and the set of actions that experiments throughout its life within
the model. The system behavior may be described as a set of processes that interact, for
example, competing for limited resources.
A list of processes is maintained, ordered according to the occurrence of the next event. Pro-
cesses may be interrupted, having their routines multiple entry points, designated reactivation
points.
• Each execution of the program will correspond to a replication, which corresponds to simulating
the system behavior for a long enough period of time, providing average performance measures,
say 𝑋(𝑛), after 𝑛 customers have been processed.
If the system is stable,
𝑋(𝑛) −→ 𝑋.
If, e.g., processing 1000 jobs is considered long enough, we associate with each replication 𝑗
of the experiment the output 𝑋 𝑗 (1000).
Random walks are special cases of Markov chain, thus can be studied by Markov chain methods.
We use random walks to supply the math base for BLAST. BLAST is a procedure often employed in
Biomatics that
Example 13.10. Consider a simple case of the two aligned DNA sequences
ggagactgtagacagctaatgctata
gaacgccctagccacgagcccttatc
Suppose we give
- a score +1 if the two nucleotides in corresponding positions are the same and
- a score -1 if they are different.
When we compare two sequences from left to right, the accumulated score performs a random walk,
or better a simple random walk in one dimension. The following theory although mentions the generic
case, but we will use this example and BLAST as running example.
𝑛
∑︁
𝑋𝑛 = 𝑍𝑖 , 𝑛 = 1, 2, · · · and 𝑋0 = 0.
𝑖=1
The collection of r.v.’s {𝑋𝑛 , 𝑛 ≥ 0} is a random process, and it is called the simple random walk in
one dimension.
(d) Verify the result of part (a) by enumerating all possible sample sequences that lead to the value
𝑋(𝑛) = −2 after four steps.
(e) Find the mean and variance of the simple random walk 𝑋(𝑛). Find the autocorrelation function
𝑅𝑋 (𝑛, 𝑚) of the simple random walk 𝑋(𝑛).
(f) Show that the simple random walk 𝑋(𝑛) is a Markov chain.
(h) Derive the first-order probability distribution of the random walk 𝑋(𝑛).
Solution.
(a) Describe the simple random walk. 𝑋(𝑛) is a discrete-parameter (or time), discrete-state random
process. The state space is 𝐸 = {..., −2, −1, 0, 1, 2, ...}, and the index parameter set is 𝑇 = {0, 1, 2, ...}.
(b) Typical sample sequence.
A sample sequence 𝑥(𝑛) of a simple random walk 𝑋(𝑛) can be produced by tossing a coin every
second and letting 𝑥(𝑛) increase by unity if a head H appears and decrease by unity if a tail T appears.
Thus, for instance, we have a small realization of 𝑋(𝑛) in Table 13.1:
𝑛 0 1 2 3 4 5 6 7 8 9 10 · · ·
𝑥𝑛 0 1 0 - 1 0 1 2 1 2 3 2 ···
Remark 5. The simple random walk process is often used in Game Theory or Biomatics.
We define the ladder points to be the points in the walk lower than any previously reached point. An
excursion in a walk is the part of the walk from a ladder point to the highest point attained before the
next ladder point.
𝐴 + 𝐵 = 𝑛, 𝐴 − 𝐵 = 𝑋𝑛 .
When 𝑋(𝑛) = 𝑘, we see that 𝐴 = (𝑛 + 𝑘)/2, which is a binomial r.v. with parameters (𝑛, 𝑝). We
conclude that the probability distribution of 𝑋(𝑛) is given by 13.10, in which 𝑛 ≥ |𝑘|, and 𝑛, 𝑘 must be
both even or odd.
Set 𝑘 = −2 and 𝑛 = 4 in (13.10) to get the concerned probability 𝑝4 (−2) that 𝑋(4) = −2
(d) Verify the result of part (a) by enumerating all possible sample sequences that lead to the value
𝑋(𝑛) = −2 after four steps. DIY!
(e) The mean and variance of the simple random walk 𝑋(𝑛). Use the fact
The term Monte Carlo originally referred to simulations that involved random walks [see Section 13.6]
and was first used by Jon von Neumann and S. M. Ulam in the 1940’s. Today, the Monte Carlo method
refers to any simulation that involves the use of random numbers.
In the following parts, we show that Monte Carlo simulations (or experiments) are feasible and in-
expensive way to understand the phenomena of interest. To conduct a simulation, you need a model
that represents your population or phenomena of interest and a way to generate random numbers
(according to your model) using a computer. The data that are generated from your model can then be
studied as if they were observations.
As we will see, one can use statistics based on the simulated data (means, medians, modes, vari-
ance, skewness, etc.) to gain understanding about the population.
The fundamental idea behind Monte Carlo simulation for inferential statistics is that insights regarding
the characteristics of a statistic can be gained by repeatedly drawing random samples from the same
population of interest and observing the behavior of the statistic over the samples.
W. Martinez [?] suggests steps of a basic Monte Carlo procedure as follows:
1. Determine the pseudo-population or model that represents the true population of interest.
5. Use the 𝑀 values found in step 4 to study the distribution of the statistic.
This section discusses the most basic and typical application of Monte Carlo methods. Keeping in
mind that probabilities are long-run proportions, we generate a long run of experiments and compute
the proportion of times when our event occurred.
̂︀ ∈ 𝐴] = number of 𝑋1 , 𝑋2 , . . . , 𝑋𝑁 ∈ 𝐴 = 𝑆
𝑝ˆ = P[𝑋 (13.11)
𝑁 𝑁
where 𝑁 is the size of a Monte Carlo experiment, 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 are generated random variables with
the same distribution as 𝑋, and a “hat” means the estimator. The latter is a very common and standard
notation:
The number 𝑆 of 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 that fall within set 𝐴 has binomial Bin(𝑁, 𝑝), with the mean E[𝑆] =
𝑁 𝑝, and the variance V[𝑆] = 𝑁 𝑝(1 − 𝑝). The accuracy of a Monte Carlo study depends on expectation
and variance of the estimator 𝑝ˆ
𝑁𝑝
E[ˆ
𝑝] = = 𝑝,
𝑁
𝑁 𝑝(1 − 𝑝)
V[ˆ
𝑝] = (13.12)
𝑁 2 √︂
𝑝(1 − 𝑝)
=⇒ Std[ˆ𝑝] = .
𝑁
♣ OBSERVATION 3.
𝑝] = 𝑝, shows that our Monte Carlo estimator of 𝑝 is unbiased, so that over a long
The first result, E[ˆ
run, it will- on the average- return the desired quantity 𝑝. The last result, the standard deviation
√︂
𝑝(1 − 𝑝)
Std[ˆ
𝑝] =
𝑁
√
indicates that the standard error of our estimator 𝑝ˆ decreases with 𝑁 at the rate of 1/ 𝑁 .
Larger Monte Carlo experiments produce more accurate results. A 100-fold increase in the number of
generated variables reduces the standard deviation (therefore, enhancing accuracy) by a factor of 10.
Why? Because we want to design a Monte Carlo study that attains desired accuracy. That is,
an error |ˆ
𝑝 − 𝑝| not exceeding 𝜀 with high probability (1 − 𝛼), i.e.
𝑝 − 𝑝| > 𝜀] ≤ 𝛼.
P[|ˆ (13.13)
Remind that, 𝑧𝛼 = Φ−1 (1 − 𝛼) is such a value (critical value) of a Standard Normal variable 𝑍 that
can be exceeded with probability 𝛼.
Computational problems
Problem 13.1.
𝑀 represents a system with 2 states on and off of an email server, where on = acceptable operation,
and off = overload (when the server can not receive or deliver email) Time shift unit is 1 minute. Which
of the followings is true:
• 𝑃 has 3 eigenvalues
• 𝑀 is reducible
Problem 13.2.
Let 𝑀 be a two state Markov chain, with its state transition matrix is
⎡ ⎤ ⎡ ⎤
⎢ 𝑝11 𝑝21 ⎥ ⎢ 1 − 𝑐 𝑐 ⎥
𝑃 =⎢ ⎥=⎢ ⎥ , where 0 < 𝑐 < 1, 0 < 𝑑 < 1. (13.15)
⎣ ⎦ ⎣ ⎦
𝑝12 𝑝22 𝑑 1−𝑑
𝑀 represents a traffic system with 2 states on and off of a road at SG, where on = acceptable vehicle
flow, and off = traffic jam (when the road can not fulfill its functionality).
When 𝑐 = 𝑑, compute the limit matrix lim𝑛→∞ 𝑃 𝑛 .
Problem 13.3.
Consider a Markov chain with two states and transition probability matrix
⎡ ⎤
⎢ 3/4 1/4 ⎥
𝑃 =⎢
⎣
⎥
⎦
1/2 1/2
Problem 13.4.
Toyota (denoted 𝑇 ) currently takes over 60% of the yearly car market in Vietnam, its rival Ford and
other brands (denoted 𝐹 ) takes the other share. Historical data shows that the state transition matrix
𝑃 of the market fluctuation is found as
⎡ ⎤
⎢ 𝑇 𝐹⎥
⎢ ⎥
⎢ ⎥
⎢ −− −− −− ⎥
⎢ ⎥
𝑃 =⎢
⎢
⎥
⎥ (13.16)
⎢ 𝑇 0.88 0.12 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
𝐹 0.15 0.85
The car market share of Toyota in the next 4 years is given by the vector 𝑝(4) as:
a) 𝑝(4) = [𝑝𝑇 (4), 𝑝𝐹 (4)] = 𝑃 3 𝑝(1)?
b) 𝑝(4) = [𝑝𝑇 (4), 𝑝𝐹 (4)] = 𝑃 3 𝑝(0)?
c) 𝑝(5) = [𝑝𝑇 (5), 𝑝𝐹 (5)] = 𝑃 5 𝑝(0)?
d) 𝑝(4) = [2/3, 1/3]?
—————————————————–
Theoretic problems
A/ Concepts
1. Show that if 𝑃 is a Markov matrix, then 𝑃 𝑛 is also a Markov matrix for any positive integer 𝑛.
2. A state transition diagram of a finite-state Markov chain is a line diagram with a vertex corresponding
to each state and a directed line between two vertices 𝑖 and 𝑗 if 𝑝𝑖𝑗 > 0.
In such a diagram, if one can move from 𝑖 and 𝑗 by a path following the arrows, then 𝑖 → 𝑗. The
diagram is useful to determine whether a finite-state Markov chain is irreducible or not, or to check
for periodicities.
Draw the state transition diagrams and classify the states of the MCs with the following transition
probability matrices:
⎡ ⎤
⎢ 0 0 0.5 0.5 ⎥
⎡ ⎤
⎢ 0 0.5 0.5 ⎥ ⎢ ⎥
⎢ ⎥
⎢ ⎥ ⎢ 1 0 0 0 ⎥
⎢ ⎥ ⎢ ⎥
𝑃1 = ⎢
⎢ 0.5 0 0.5 ⎥ ; 𝑃2 = ⎢
⎥
⎢
⎥;
⎥
⎢ 0 1 0 0 ⎥
⎢ ⎥ ⎢ ⎥
⎣ ⎦
0.5 0.5 0
⎢ ⎥
⎣ ⎦
0 1 0 0
⎡ ⎤
⎢ 0.3 0.4 0 0 0.3 ⎥
⎢ ⎥
⎢ ⎥
⎢ 0 1 0 0 0 ⎥
⎢ ⎥
𝑃3 = ⎢
⎢
⎥
⎥
⎢ 0 0 0 0.6 0.4 ⎥
⎢ ⎥
⎢ ⎥
⎣ ⎦
0 0 1 0 0
3. Verify the transitivity property of the Markov chain ; that is, if 𝑖 → 𝑗 and 𝑗 → 𝑘, then 𝑖 → 𝑘. (Hint: use
Chapman Komopgorov equations).
4. Show that in a finite-state Markov chain, not all states can be transient.
1. A certain product is made by two companies, A and B, that control the entire market. Currently, A
and B have 60 percent and 40 percent, respectively, of the total market. Each year, A loses 5 of its
market share to By while B loses 3 of its share to A.
Find the relative proportion of the market that each hold after 2 years.
2. Let two gamblers, A and B, initially have 𝑘 dollars and 𝑚 dollars, respectively. Suppose that at
each round of their game, A wins one dollar from B with probability 𝑝 and loses one dollar to B with
probability 𝑞 = 1 − 𝑝. Assume that A and B play until one of them has no money left. (This is known
as the Gambler’s Ruin problem.)
• The gambler 𝐴, say, plays continuously until he either accumulates a target amount of 𝑚, or
loses all his money.
We introduce the Markov chain shown whose state 𝑖 represents the gambler’s wealth at the
beginning of a round.
• All states are transient, except for the winning and losing states which are absorbing. Thus, the
problem amounts to finding the probabilities of absorption at each one of these two absorbing
states. Note, these absorption probabilities depend on the initial state 𝑖.
[Source [56]]
CHAPTER 14. POISSON PROCESS AND VARIATIONS
406 SYSTEMS CHANGED BY RANDOM ARRIVALS IN TIME
Engineering and service problems being solved by the theory of Poisson processes include:
a/ a flow of patients arrive at the doctor’s office (equivalent emails arrive at server...)
REMINDER: Time homogeneous Markov processes 𝑀 = {𝑋(𝑡)}𝑡≥0 = (S, 𝑝, 𝑃 (𝑡)) have stationary
(or homogeneous) transition probabilities. For such processes we have known that
Here
• 𝑝𝑖,𝑗 (𝑡) is the transition probability from state 𝑖 to state 𝑗 after a time duration 𝑡, whatever the time
point 𝑠 is, and
• 𝑝𝑗 (𝑢) is the ‘state’ probability that the process is in state 𝑗 at time point 𝑢.
Here constant value 𝜆 > 0 is the average number of events occurring in one unit of time, or the
rate or speed of events. Then we can write 𝑋(𝑡) to count the number of events randomly occurring
in the time interval [0, 𝑡), now with pdf
𝑒−𝜆𝑡 (𝜆𝑡)𝑖
𝑝𝑋(𝑡) (𝑖) = 𝑝(𝑖; 𝜆𝑡) = P[𝑋(𝑡) = 𝑖] = 𝑖 = 0, 1, 2, ... (14.3)
𝑖!
𝑥
∑︁ 𝑥
∑︁
𝐹 (𝑥; 𝜆) = P(𝑋 ≤ 𝑥) = P[𝑋 = 𝑖] = 𝑝(𝑖; 𝜆) 𝑥 = 0, 1, 2, . . . (14.4)
𝑖=0 𝑖=0
• Hence, if V[𝑥] of a data 𝑥 is much greater than the mean E(𝑥), the Poisson distribution would not
be a good model for fitting data.
Example 14.1.
In the Poisson Pois(𝜆), the constant 𝜆 > 0 is the rate or speed of events, we see that if customers
come to a SCB branch in Bangkok and follow a Poisson distribution with rate 𝜆 = 10 then
• the mean 𝜇 = 𝜆 = 10 customers per hour, the variance is also V(𝑋) = 𝜎 2 = 𝜆 = 10, and
√
• the standard deviation of the arrivals is 𝜎 = 20 = 3.16 customers per hour. We see that the proba-
bility of 20 arrivals is not negligible.
• Atmospheric dust particles (PM10 or PM2.5) at a particular location cause a serious environ-
mental problem for inhabitants.
• The number of particles within a unit volume is observed by focusing a powerful microscope on
the particles and making counts. The results of tests on 100 such volumes are shown in Table
14.1.
0 1 2 3 4 >4
Observed frequency 13 24 30 18 7 8
Poisson frequency 12 26 27 19 10 6
HINT: Use formula 14.5, the estimated mean of the number of dust particles within each volume is
calculated as follows:
∑︁ 13 24 30 18 7 8
x= 𝑓𝑖 𝑥𝑖 = ·0+ ·1+ ·2+ ·3+ ·4+ ·6
𝑖
100 100 100 100 100 100
= 2.14 = 𝜆.
The theoretical Poisson frequencies of occurrence shown in 14.1 are obtained from
𝜆𝑥 𝑒−𝜆
P(𝑋 = 𝑥|𝜆) = = 2.14𝑥 𝑒−2.14 / 𝑥! for 𝑥 = 0, 1, 2, 3, 4, 6.
𝑥!
Counting processes are often found in Arrival-Type Processes. For such a process, we are interested
in occurrences that have the character of an ‘arrival’, such as
- message receptions at a receiver,
- job completions in a manufacturing cell, and customers purchase at a store, etc.
Suppose we count the total number of events 𝑁 (𝑡) that have occurred up to time 𝑡.
c/ for 𝑠 < 𝑡, the quantity 𝑁 (𝑡) − 𝑁 (𝑠) equals the number of events that occurred in the interval
[𝑠, 𝑡].
𝑡𝑖𝑚𝑒 𝑇 : 0 − − − −s − − − −t − − − − − −𝑢 − − − − − 𝑣 − − − −− >
[E.g. events = customer arrivals at an HSBC bank from 8 am till 12 pm, equivalent to interval [0, 4], 𝑡 =
4 hours; events = storms attacking coastal area of Vietnam during June to December, equivalent to
interval [0, 7], 𝑡 = 7 months...)
P1. Independent increment property: says that the (random) number of events (arrivals) 𝑁 (𝑡)−𝑁 (𝑠)
and 𝑁 (𝑣) − 𝑁 (𝑢) in two disjoint intervals [say [𝑠, 𝑡] ∩ [𝑢, 𝑣] = ∅] are independent.
P2. Stationary increment property: says that the distribution of the number of events 𝑁 (𝑠, 𝑡) :=
𝑁 (𝑡)−𝑁 (𝑠) occurring in interval [𝑠, 𝑡] depends only on the length ℎ = 𝑡−𝑠 of the interval, not on the position
of the interval.
CONCEPT 12.
A stochastic process {𝑋(𝑡), 𝑡 ≥ 0} is stationary increment when all the 𝑋(𝑡) have the same distribution.
That means, mathematically for any 𝜏 , the distribution of a stationary process will be unaffected by a
shift in the time origin, i.e. 𝑋(𝑡) and 𝑋(𝑡 + 𝜏 ) have the same distribution. For the first-order distribution,
that means
𝐹𝑋 (𝑥; 𝑡) = 𝐹𝑋 (𝑥; 𝑡 + 𝜏 ) = 𝐹𝑋 (𝑥); and 𝑓𝑋 (𝑥; 𝑡) = 𝑓𝑋 (𝑥).
P1- Independent increment: the (random) number of events- arrivals in two disjoint intervals are
independent.
P2- Stationary increment: the distribution of the number of events depends only on the length of
the interval, not on the position of the interval. It mathematically means: 𝑁 (𝑡 + ℎ) − 𝑁 (𝑠 + ℎ)
(the number of events in the interval [𝑠 + ℎ, 𝑡 + ℎ]) has the same distribution as 𝑁 (𝑡) − 𝑁 (𝑠) (the
number of events in the interval [𝑠, 𝑡]), for all 𝑠 < 𝑡 and ℎ > 0.
Combining Definition 14.1 and 14.2 we have the following working description.
Definition 14.3. 𝑁 (𝑡) is Poisson process with rate 𝜆, denoted 𝑁 (𝑡) ∼ Pois(𝜆), if
4. if 0 ≤ 𝑎 < 𝑏 then 𝑁 (𝑎) ≤ 𝑁 (𝑏), [non-decreasing function], and the quantity 𝑁 (𝑏) − 𝑁 (𝑎) equals
the number of events that occurred in the interval [𝑎, 𝑏],
5. 𝑁 (𝑡)−𝑁 (𝑠) and 𝑁 (𝑣)−𝑁 (𝑢) are independent random variables, for [𝑠, 𝑡]∩[𝑢, 𝑣] = ∅, [independent
increments],
6. 𝑁 (𝑡 + ℎ) − 𝑁 (𝑠 + ℎ) has the same distribution as 𝑁 (𝑡) − 𝑁 (𝑠), for all 𝑠 < 𝑡 and ℎ > 0 [stationary
increments,].
Combining 4. and 6. we now can say 𝑁 (𝑡 − 𝑠), the number of events that occurred in the time interval
[0, 𝑡 − 𝑠], has the same distribution as that of 𝑁 (𝑡) − 𝑁 (𝑠), i.e. P[𝑁 (𝑡 − 𝑠) = 𝑘] = P[𝑁 (𝑡) − 𝑁 (𝑠) = 𝑘],
for all 0 ≤ 𝑠 < 𝑡.
See postulates of Poisson processes in details in Section 14.2.3.
Example 14.3. Under assumption of purely random events, here are typical Poisson processes.
• The Poisson process is often applied to occurrences of events in time; like requests for service,
breakdowns of equipment, or arrivals of vehicles at a road intersection in cities...
• Hereafter we will refer to Poisson-type events that depend on temporal scale (in time) as arrivals,
such as customers arriving at a queue.
• The concept of Poisson process can be extended to spatial applications (in space) to model,
e.g, the locations of demands for service.
(𝜆𝑡)𝑛 −𝜆𝑡
P[𝑁 (𝑡) = 𝑛] = 𝑒 𝑛 = 0, 1, 2, ... (14.6)
𝑛!
• This means 𝑁 (𝑡) ∼ Pois(𝜆𝑡), it is distributed as a Poisson random variable with mean 𝜆𝑡.
• We have P[𝑁 (𝑠 + 𝑡) − 𝑁 (𝑠) = 𝑛] = P[𝑁 (𝑡) = 𝑛], by the stationarity of the increments, so this
distribution completely characterizes the entire process.
We consider three postulates being associated with the Poisson process of rate 𝜆.
1. The Orderliness: given that one Poisson arrival occurs at a particular time, the conditional probability
that another occurs at exactly the same time is zero.
Fact 14.1. Thus, two or more arrivals cannot occur simultaneously, i.e. the probability that at least
two Poisson arrivals occur in a time interval of length 𝜏 is 𝑜(𝜏 ):
2. The Independent increment: the numbers of arrivals happening in disjoint time intervals are mutu-
ally independent random variables. Define
Fact 14.2. If the {[𝑎𝑖 , 𝑏𝑖 ] : 1 ≤ 𝑖 ≤ 𝑛} are non-overlapping, then random variables {𝑁 (𝑎𝑖 , 𝑏𝑖 ) : 1 ≤
𝑖 ≤ 𝑛} are independent.
3. The Stationary increment: the number of Poisson-type arrivals happening in any prespecified time
interval of fixed length is
• Suppose that {𝑁1 (𝑡)} and {𝑁2 (𝑡)} are two independent Poisson processes with rates 𝜆1 and 𝜆2
respectively. [They are the respective cumulative numbers of arrivals in time interval [0, 𝑡).]
• Let 𝑁 (𝑡) = 𝑁1 (𝑡) + 𝑁2 (𝑡), the combined process of a cumulative number of arrivals until time 𝑡.
Then {𝑁 (𝑡)} is a Poisson process with arrival-rate parameter 𝜆1 + 𝜆2 ( the sum of the individual
arrival rates).
(︀ )︀
𝑁 (𝑡) ∼ Pois (𝜆1 + 𝜆2 )𝑡 .
This result extends in the obvious way to more than two independent Poisson processes. There
are many ways to prove this result, but the simplest is just to observe that the pooled process satisfies
each of the four postulates of the Poisson process in Section 14.2.3.
If 𝑇1 , 𝑇2 , · · · , 𝑇𝑛 are independent exponential variables, with 𝑇𝑖 ∼ E(𝛽𝑖 ), for all 𝑖 = 1, 2, . . . , 𝑛 then the
𝑛
∑︁
∑︀
min variable 𝑇 = min{𝑇1 , 𝑇2 , · · · , 𝑇𝑛 } ∼ E( 𝑖 𝛽𝑖 ), it is exponentially distributed with parameter 𝛽𝑖 .
𝑖=1
For 𝑛 = 2, an interesting problem regarding the competition between two exponential random vari-
ables is: the probability that one of the two variables, such as 𝑋, is less than the other, say 𝑌 ; as
in service providing, a provider completing his service earlier than that of the other provider being
considered as the winner?
Given two independently operating Poisson processes with rate parameters 𝜆1 and 𝜆2 respectively,
what is the probability that an arrival from process 1 (a ”type 1” arrival as blue buses) occurs before an
arrival from process 2 (a ”type 2” arrival as red buses)?
GUIDANCE for solving. To solve this problem, let the two independent inter-arrival times of interest
be denoted by 𝐴 and 𝐵 for processes 1 and 2, respectively. We want to compute P[𝐴 < 𝐵]?
Work out yourself, invoking our knowledge of Poisson processes, and using the joint pdf 𝑓𝐴,𝐵 (𝑎, 𝑏) =
𝑓𝐴 (𝑎)𝑓𝐵 (𝑏)?, when the pdf’s for 𝐴, 𝐵 are negative exponential with means 1/𝜆1 and 1/𝜆2 respectively,
get
𝜆1
P[𝐴 < 𝐵] = .
𝜆1 + 𝜆2
(employ integrating over the part of the positive quadrant for which 𝑎 < 𝑏)
∫︁ +∞ ∫︁ 𝑣
𝛽2
P[𝑇1 < 𝑇2 ] = 𝑑𝑣 𝑓𝑇1 ,𝑇2 (𝑢, 𝑣)𝑑𝑢 = . (14.7)
0 0 𝛽1 + 𝛽2
E[𝑋1 (𝑇 )] = 𝜆1 𝑇...
This result makes sense intuitively: the probability that a type 1 arrival occurs before a type 2 arrival
is equal to the fraction of the pooled arrival rate comprising type 1 arrivals.
Proposition 14.3.
Let {𝑁 (𝑡), 𝑡 ≥ 0} represent Poisson arrivals with rate 𝜆. Moreover, each arrival can be of
type 1 with probability 𝑝 and be of
type 2 with probability 1 − 𝑝 independently of all other arrivals.
Let 𝑁1 (𝑡) be the number of type 1 arrivals up to time 𝑡 and 𝑁2 (𝑡) be the number of type 2 arrivals
up to time 𝑡. Then {𝑁1 (𝑡), 𝑡 ≥ 0} and {𝑁2 (𝑡), 𝑡 ≥ 0} are independent Poisson processes with
rates 𝜆𝑝 and 𝜆(1 − 𝑝) respectively.
Suppose the number of claims {𝑁 (𝑡)} to an insurance company like AIA is formed from smokers
1
and non-smokers following independent Poisson processes. Intuitively one thinks that 4 of the claims
are from non-smokers and the rest from smokers.
• Then the number of events, 𝑁1 (𝑡) of Type I and the number of events, 𝑁2 (𝑡) of Type II also give rise
to Poisson processes.
If the rate of {𝑁 (𝑡)} is 𝜆, then the rate of {𝑁1 (𝑡)} is 𝜆𝑝 and the rate of {𝑁2 (𝑡)} is 𝜆(1 − 𝑝).
Recall conditional distributions from Section 8.4, in which the concepts of conditional expectation and
conditional variance are particularly useful for Section 14.5.
It is the expected value of 𝑌 with respect to the conditional p.m.f. 𝑝𝑌 (𝑦|𝑥) (or the conditional p.d.f.
𝑓𝑌 (𝑦|𝑥)).
♣ OBSERVATION 4.
Note that for a pair of variables (𝑋, 𝑌 ), the conditional expectation E[𝑌 | 𝑋 = 𝑥] changes with 𝑥,
if 𝑋 and 𝑌 are dependent. Thus, we can consider E[𝑌 | 𝑋] to be a random variable, which is a
function of 𝑋.
E[𝑌 | 𝑋] is a random variable, so E[E[𝑌 |𝑋]] does exit, computed on the range of 𝑋. We have a
very powerful identity
E[𝑌 ] = E[E[𝑌 |𝑋]]. (14.9)
Expectation of a function of a random variable: Just as conditional probabilities satisfy all the
properties of ordinary probabilities, so do the conditional expectations satisfy all the properties of
ordinary expectations.
Let 𝑔(𝑌 ) be a function of a r.v. 𝑌 . We consider the conditional expectation of 𝑔(𝑌 ), as an extension
of (14.8):
⎧
∑︀
𝑔(𝑦) 𝑝𝑌 (𝑦|𝑥) if (𝑋, 𝑌 ) discrete,
⎪
⎪
⎨ 𝑦
⎪
⎪
E[𝑔(𝑌 )| 𝑥] = E[𝑔(𝑌 )|𝑋 = 𝑥] = (14.12)
⎪
⎪
⎩ ∞ 𝑔(𝑦) 𝑓𝑌 (𝑦|𝑥) 𝑑𝑦 if (𝑋, 𝑌 ) continuous.
⎪
⎪∫︀
−∞
E[𝑔(𝑌 )|𝑋] is a function of 𝑋 and takes the value E[𝑔(𝑌 )|𝑋 = 𝑥] when 𝑋 = 𝑥.
Consequently, E[𝑔(𝑌 )|𝑋] is a random variable, whose mean can be calculated. We have a gener-
alization of Equation (14.40) as follows:
Similarly as Definition 14.4, we define the conditional variance of 𝑌 , given 𝑋 = 𝑥, as the variance of
𝑌 , with respect to the conditional p.d.f. 𝑓𝑌 (𝑦 | 𝑥) = 𝑓𝑌 |𝑋 (𝑦 | 𝑥).
That is, V[𝑌 | 𝑋] is equal to the (conditional) expected square of the difference between 𝑌 and its
(conditional) mean E[𝑌 |𝑋] when the value of 𝑋 is given. In other words, V[𝑌 | 𝑋] is exactly analogous
to the usual definition of variance, but now all expectations are conditional on the fact that 𝑋 is known.
• And furthermore, the variance of 𝑌 itself is computed via the conditional variance V[𝑌 | 𝑋]
and the conditional expectation E[𝑌 |𝑋]:
[︂ ]︂
[︀ ]︀2
due to (14.13). Also, E[ E[𝑌 |𝑋]] = E[𝑌 ] =⇒ V[ E[𝑌 |𝑋]] = E (E[𝑌 |𝑋])2 − E[𝑌 ] then summing
[︀ ]︀2
the two shows E[ V[𝑌 |𝑋]] + V[ E[𝑌 |𝑋]] = E[𝑌 2 ] − E[𝑌 ] = V[𝑌 ].
Suppose that in a population 𝑃 of size 𝑁 , there are 𝑀 units that have a certain property 𝑇 , and
so 𝑁 − 𝑀 units that do not have the property. Let 𝐽𝑛 denote the number of units having the certain
property 𝑇 , randomly sampled without replacement (RSWOR) from 𝑃 .
𝐽𝑛 is a random variable with Range(𝐽𝑛 ) = {0, 1, 2, . . . , 𝑀 }.
The distribution of 𝐽𝑛 is called the hypergeometric distribution, denoted by 𝐻(𝑁, 𝑀, 𝑛).
The condition for 𝑛 is 𝑛 ≤ min{𝑀, 𝑁 − 𝑀 }.
1. Its pmf is denoted by ℎ(𝑗; 𝑁, 𝑀, 𝑛), for 𝑗 ∈ {0, 1, . . . , 𝑛} [Fig. 14.1 shows ℎ(𝑗; 500, 350, 100)],
given as
𝐶(𝑀, 𝑗) 𝐶(𝑁 − 𝑀, 𝑛 − 𝑗)
𝑝𝑗 = P[𝐽𝑛 = 𝑗] = ℎ(𝑗; 𝑁, 𝑀, 𝑛) = ; (14.16)
𝐶(𝑁, 𝑛)
(︀𝐴)︀
where 𝐶(𝐴, 𝑎) = 𝑎 is the binomial coefficient of choosing 𝑎 units from 𝐴 units. The pmf table
is
𝐽𝑛 0 1 ··· 𝑛−1 𝑛
𝑘
∑︁
2. Its cdf is denoted by 𝐻(𝑘; 𝑁, 𝑀, 𝑛) = 𝑝𝑗 .
𝑗=1
Assume that 𝑋 and 𝑌 are independent binomial random variables with identical parameters 𝑛 and
𝑝; having the joint p.m.f. 𝑝(𝑥, 𝑦) = 𝑝𝑋,𝑌 (𝑥, 𝑦).
Calculate the conditional expected value of 𝑌 given that 𝑌 + 𝑋 = 𝑚.
FACT: Furthemore, the sum 𝑆 = 𝑋 + 𝑌 ∼ Bin(2𝑛, 𝑝), it means 𝑋 + 𝑌 also is a binomial random
variable with parameters 2𝑛 and 𝑝.
P[𝑌 = 𝑘] P[𝑋 = 𝑚 − 𝑘]
=
P[𝑋 + 𝑌 = 𝑚]
(14.18)
(︀𝑛)︀ (︀ 𝑛 )︀ 𝑚−𝑘
𝑘 𝑝𝑘 (1 − 𝑝)𝑛−𝑘 𝑚−𝑘 𝑝 (1 − 𝑝)𝑛−𝑚+𝑘
= (︀2𝑛)︀
𝑚 2𝑛−𝑚
𝑚 𝑝 (1 − 𝑝)
(︀𝑛)︀ (︀ 𝑛
)︀
𝑘
= (︀2𝑛𝑚−𝑘
)︀ = ℎ(𝑘; 𝑁, 𝑀, 𝑛)
𝑚
Suppose that by any time 𝑡 the number of people that have arrived at a train depot is a Poisson
random variable with mean 𝜆 𝑡.
If the initial train arrives at the depot at a time (independent of when the passengers arrive) that
is uniformly distributed over (0, 𝑇 ), what are the mean and variance of the number of passengers
who enter the train? Is the load of train high?
Modeling:
For each 𝑡 ≥ 0, let 𝑁 (𝑡) denote the number of arrivals by 𝑡, and let 𝑌 denote the time at which the
train arrives. Obviously 𝑌 ∼ Uniform((0, 𝑇 )).
Finding solution:
Should we find E[𝑁 (𝑌 )] (and V[𝑁 (𝑌 )]) or the conditional E[𝑁 (𝑌 )|𝑌 ]?
Conditioning on 𝑌 gives
Making decision
= 𝜆 𝑡.
[︀ ]︀
As above arguments, V 𝑁 (𝑌 )|𝑌 = 𝜆 𝑌 . Therefore,
[︀ ]︀ [︀ ]︀
V[𝑁 (𝑌 )] = E V[𝑁 (𝑌 )| 𝑌 ] + V E[𝑁 (𝑌 )| 𝑌 ]
= E[𝜆 𝑌 ] + V[𝜆 𝑌 ] (14.21)
Assumption: Let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 be a random sample (iid) from a certain distribution, where 𝑁 itself
is a natural-valued random variable (having its own distribution).
The compound random variable of 𝑋𝑖 and 𝑁 is given by 𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁 . In practice, 𝑁
may be the number of people stopping at a service station in a day, and the 𝑋𝑖 are the amounts of gas
they purchased.
One can find the mean and variance of 𝑆𝑁 if observations are random.
𝑖.𝑖.𝑑.
When 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 ∼ 𝑋 and 𝑋𝑖 is independent with 𝑁 , then
[︂ ]︂
= E 𝑁 E[𝑋] = E[𝑁 ] E[𝑋]
We will use above results in the next chapters. In the particular case when 𝑁 = 𝑁 (𝑡) is a regular
Poisson process and 𝑋𝑖 ’s are independent of the time of the event, the sum
𝑁 (𝑡)
∑︁
𝑆𝑁 (𝑡) := 𝑋𝑖
𝑖=1
Definition 14.8.
Let {𝑁 (𝑡), 𝑡 > 0} be a Poisson process with rate 𝜆, and let 𝑋1 , 𝑋2 , · · · be random variables that are
i.i.d. and independent of the process {𝑁 (𝑡)}.
If 𝜆 is the rate for the regular Poisson process 𝑁 (𝑡) and the iid variables 𝑋𝑖 have mean 𝜇 and variance
𝜎 2 , then we can calculate the mean and standard deviation of the compound Poisson process 𝑆𝑁 (𝑡)
as
E[𝑆𝑁 (𝑡)] = 𝜆 𝜇 𝑡,
(14.24)
V[𝑆𝑁 (𝑡)] = 𝜆 (𝜇2 + 𝜎 2 ) 𝑡 = 𝜆 E[𝑋 2 ] 𝑡
Definition 14.9. {𝑁 (𝑡), 𝑡≥0} is a nonhomogeneous (or non-stationary) Poisson process with intensity
(rate) function 𝜆(𝑡) if
1. 𝑁 (0) = 0
2. {𝑁 (𝑡), 𝑡≥0} has independent increments
3. P[ 2 or more events in (𝑡, 𝑡 + ℎ)] = 𝑜(ℎ)
4. P[ exactly 1 event in (𝑡, 𝑡 + ℎ)] = 𝜆(𝑡) ℎ + 𝑜(ℎ) or equivalently
4*. limℎ→0 (1/ℎ)P[𝑁 (𝑡 + ℎ) − 𝑁 (𝑡) = 1] = 𝜆(𝑡).
NOTATION. If we let ∫︁ 𝑡
𝑚(𝑡) = 𝜆(𝑠)𝑑𝑠
0
[𝑚(𝑡)]𝑘
𝑝𝑘 (𝑡) = P[𝑁 (𝑡) = 𝑘] = 𝑒−𝑚(𝑡) , 𝑘 ≥ 0.
𝑘!
𝑁 (𝑡) has a Poisson distribution with mean 𝑚(𝑡), 𝑚(𝑡) is called the mean value function, or also the
principal function of the process.
• In the non-homogeneous case, the rate parameter 𝜆(𝑡) now depends on 𝑡. That is
P {𝑁 (𝜏 + 𝑡) − 𝑁 (𝜏 ) = 1} = 𝜆(𝑡) 𝑡, as 𝑡 → 0.
Fact 14.4. 𝑁 (𝑡) is a Poisson random variable with mean 𝑚(𝑡). If {𝑁 (𝑡), 𝑡 ≥ 0} is a non-homogeneous
with mean value function 𝑚(𝑡), then 𝑁 (𝑚−1 (𝑡)), 𝑡 ≥ 0 is homogeneous with intensity 𝜆 = 1.
{︀ }︀
This result follows because 𝑁 (𝑡) is Poisson random variable with mean 𝑚(𝑡), and if we let 𝑋(𝑡) =
𝑁 (𝑚−1 (𝑡)), then 𝑋(𝑡) is Poisson with mean 𝑚(𝑚−1 (𝑡)) = 𝑡.
A Poisson process, {𝑁 (𝑡), 𝑡 > 0}, only counts the number of events that occurred in the interval [0, 𝑡],
while the process {𝑌 (𝑡), 𝑡 > 0} gives, for example,
• the sum of the lengths of telephone calls that happened in [0, 𝑡], or
• the total number of persons who were involved in car accidents in this interval [0, 𝑡], etc.
Note that we must assume that the lengths of the calls or the numbers of persons involved in distinct
accidents are independent and identically distributed random variables. We could consider the two-
dimensional process {[𝑁 (𝑡), 𝑌 (𝑡)], 𝑡 > 0} to retain all the information of interest.
• The first door leads to a tunnel that will take him to safety after 3 hours of travel.
• The second door leads to a tunnel that will return him to the mine after 5 hours of travel.
• The third door leads to a tunnel that will return him to the mine after 7 hours.
If we assume that the miner is at all times equally likely to choose any one of the doors, what is the
expected length of time until he reaches safety?
Modeling: Let 𝑌 denote the amount of time (in hours) until the miner reaches safety, and let 𝑋 denote
the door he initially chooses. Now,
∑︁
E[𝑌 ] = E[E[𝑌 |𝑋]] = E[𝑌 |𝑋 = 𝑥] P[𝑋 = 𝑥]
𝑥
∑︀3 1 [︀ ]︀
then E[𝑌 ] = 𝑖=1 E[𝑌 |𝑋 = 𝑖] P[𝑋 = 𝑖] = E[𝑌 |𝑋 = 1] + E[𝑌 |𝑋 = 2] + E[𝑌 |𝑋 = 3] .
3
Finding solution We see that E[𝑌 |𝑋 = 1] = 3, and need to find E[𝑌 |𝑋 = 2] =?, E[𝑌 |𝑋 = 3] =? They
are kind of recursive formulae!
Birth and death (BD) processes informally are obtained when we generalize the Poisson process in
two ways:
1. By letting the value of the arrival rate 𝜆 depend on the current state 𝑛;
2. By including departures, which allows the process to instantaneously decrease its value by one unit
(at a rate that will also be a function of 𝑛).
Birth and death processes formally are a special type of continuous-time Markov chains (CTMC or
Markov jump model).
Consider a continuous-time Markov chain with states 0, 1, 2, . . .
Definition 14.10.
If 𝑝𝑖𝑗 = 0 whenever 𝑗 ̸= 𝑖−1 or 𝑗 ̸= 𝑖+1, then the Markov chain is called a birth and death process.
Thus, a birth and death process is a CTMC in which transitions from state 𝑖 can only go to either
state 𝑖 − 1 or 𝑖 + 1.
• That is, a transition either causes an increase in state by one or a decrease in state by one.
• A birth is said to occur when the state increases by one, and a death is said to occur when the state
decreases by one.
Reminder :
For all 𝑖 ̸= 𝑗 ∈ S, the transition rate of the process when the process makes a transition
from state 𝑖 to state 𝑗, denoted by 𝑞𝑖,𝑗 , is defined by
𝑝𝑖,𝑗 (ℎ)
𝑞𝑖,𝑗 = 𝑝′𝑖,𝑗 (0) = lim for 𝑖 ̸= 𝑗. (14.26)
ℎ→0 ℎ
The transition rates 𝑞𝑖,𝑗 are also known as instantaneous transition rates, transition intensi-
ties, or forces of transition.
Definition 14.11.
If employ the (instantaneous) transition rate 𝑞𝑖,𝑗 given in Equation 14.26 of the continuoustime
Markov chain, then the process is said to be a birth and death process when
𝑞𝑖,𝑗 = 0 𝑖𝑓 |𝑗 − 𝑖| > 1.
To describe the process we define the birth and death rates from each state 𝑖 ∈ S:
𝜆𝑖 = 𝑞𝑖,𝑖+1 = 𝑣𝑖 𝑝𝑖,𝑖+1
(14.27)
𝜇𝑖 = 𝑞𝑖,𝑖−1 = 𝑣𝑖 𝑝𝑖,𝑖−1
• Thus, 𝜆𝑖 is the rate at which a birth occurs, and 𝜇𝑖 is the rate at which a death occurs, both when the
process is in state 𝑖.
• The rate of transition out of state 𝑖 is the sum of these two rates 𝜆𝑖 + 𝜇𝑖 = 𝑣𝑖 .
Note that 𝜇0 = 0, because there can be no death when the process is in empty state.
⎡ ⎤
⎢ 𝑞1 𝑞1,2 𝑞1,3 ... 𝑞1,𝑠 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝑞 𝑞2 𝑞2,3 ... 𝑞2,𝑠 ⎥
⎢ 2,1 ⎥
⎢ ⎥
⎢ ⎥
Q=⎢
⎢ 𝑞3,1 𝑞3,2 𝑞3 ... ⎥.
𝑞3,𝑠 ⎥ (14.28)
⎢ ⎥
⎢ . .. .. .. .. ⎥
⎢ . .
⎢ . . . . ⎥
⎥
⎢ ⎥
⎣ ⎦
𝑞𝑠,1 𝑞𝑠,2 𝑞𝑠,3 ... 𝑞𝑠
For a BD process with parameters given by diagram 14.2, Q takes a special form
⎡ ⎤
⎢ −𝜆0 𝜆0 0 ... 0... 0 ⎥
⎢ ⎥
⎢ ⎥
⎢ 𝜇
⎢ 1 −(𝜆1 + 𝜇1 ) 𝜆1 0... 0... 0 ⎥⎥
Q=⎢
⎢
⎥.
⎥ (14.29)
⎢ 0 𝜇2 −(𝜆2 + 𝜇2 ) 𝜆2 0... 0 ⎥
⎢ ⎥
⎢ ⎥
⎣ . .. .. .. .. .. ⎦
.. . . . . .
𝑑
𝑝(𝑡) = 𝑝(𝑡) Q, (14.30)
𝑑𝑡
where 𝑝(𝑡) = [𝑝0 (𝑡), 𝑝1 (𝑡), 𝑝2 (𝑡), · · · , 𝑝𝑖 (𝑡), 𝑝𝑖+1 (𝑡), · · · ] is the vector of state distribution.
Transient analysis of this birth and death (B & D) process is done by studying the following system
of differential equations:
𝑑
𝑝0 (𝑡) = −𝜆0 𝑝0 (𝑡) + 𝜇1 𝑝1 (𝑡)
𝑑𝑡
𝑑
𝑝1 (𝑡) = 𝜆0 𝑝0 (𝑡) − (𝜆1 + 𝜇1 )𝑝1 (𝑡) + 𝜇2 𝑝2 (𝑡)
𝑑𝑡 (14.31)
....
..
𝑑
𝑝𝑖 (𝑡) = 𝜆𝑖−1 𝑝𝑖−1 (𝑡) − (𝜆𝑖 + 𝜇𝑖 )𝑝𝑖 (𝑡) + 𝜇𝑖+1 𝑝𝑖+1 (𝑡)
𝑑𝑡
𝑑𝑝0 (𝑡)
= −𝜆0 𝑝0 (𝑡) + 𝜇1 𝑝1 (𝑡)
𝑑𝑡 (14.32)
𝑑𝑝𝑖 (𝑡)
= −(𝜆𝑖 + 𝜇𝑖 )𝑝𝑖 (𝑡) + 𝜇𝑖+1 𝑝𝑖+1 (𝑡) + 𝜆𝑖−1 𝑝𝑖−1 (𝑡), for 𝑖 > 0
𝑑𝑡
𝑑𝑝𝑖 (𝑡)
where the left hand side is the total rate of transition changing of state 𝑖, the + indicates
𝑑𝑡
the moving in rate and the − indicates the moving out rate.
𝑑𝑝𝑖 (𝑡)
• lim𝑡→∞ = 0, and
𝑑𝑡
• lim𝑡→∞ 𝑝𝑖 (𝑡) = 𝑝𝑖 exists, for every 𝑖 = 0, 1, 2, . . .;
𝜆0
𝜆0 𝑝0 = 𝜇1 𝑝1 =⇒ 𝑝1 = 𝑝0
𝜇1
(𝜆𝑖 + 𝜇𝑖 )𝑝𝑖 = 𝜇𝑖+1 𝑝𝑖+1 + 𝜆𝑖−1 𝑝𝑖−1 for 𝑖 = 1, 2, . . . (14.33)
∑︁
𝑝𝑖 = 1
𝑖
𝜆𝑖 𝑝𝑖 = 𝜇𝑖+1 𝑝𝑖+1 ∀𝑖 = 0, 1, . . .
This result states that when the process is in the steady state, the rate at which it makes a transition
from state 𝑖 to state 𝑖 + 1, which we refer to the rate of flow from state 𝑖 to state 𝑖 + 1, is equal to the
rate of flow from state 𝑖 + 1 to state 𝑖. This property (14.33) is referred to as the local balance equation
or condition because it balances (or equates) the rate at which the process enters state 𝑖 with the rate
at which it leaves state 𝑖.
Direct application of the property allows us to solve for the steady-state probabilities of the birth and
death process recursively as follows:
∏︀𝑖
𝜆𝑖 𝑗=0 𝜆𝑗
𝑝𝑖+1 = 𝑝𝑖 = . . . = ∏︀𝑖+1 𝑝0
𝜇𝑖+1 𝑗=1 𝜇𝑗
[︂ ∞ (︂ )︂𝑖 ]︂−1
∑︁ 𝜆
𝑝0 = 1 + (14.34)
𝑖=1
𝜇
𝜆
The sum converges if and only if < 1, equivalent to 𝜆 < 𝜇. Under this condition we obtain
𝜇
⎧ 𝜆
⎨𝑝0
⎪ =1−
𝜇
(︂ )︂
𝜆 (︁ 𝜆 )︁𝑖 (14.35)
⎩𝑝𝑖
⎪ = 1− , for 𝑖 = 1, 2, . . .
𝜇 𝜇
Example 14.8.
A machine is operational for an exponentially distributed time with mean 1/𝜆 before breaking down.
When it breaks down, it takes a time that is exponentially distributed with mean 1/𝜇 to repair it.
What is the fraction of time that the machine is operational (or available)?
Solution: This is a two-state birth and death process. Let 𝑈 denote the up state and 𝐷 the down
state. Let 𝑝𝑈 denote the steady-state probability that the process is in the operational state, and
let 𝑝𝐷 denote the steady-state probability that the process is in the down state. Then the balance
equations become
𝜆𝑝𝑈 = 𝜇 𝑝𝐷
(14.36)
𝑝𝑈 + 𝑝𝐷 = 1 ⇒ 𝑝𝐷 = 1 − 𝑝𝑈
Hence, the fraction of time that the machine is operational is just 𝑝𝑈 = 𝜇/(𝜆 + 𝜇).
a) Let 𝑁 (𝑡) be the number of failures of a computer system in the time interval [0, 𝑡]. We suppose that
{𝑁 (𝑡), 𝑡 ≥ 0} is a Poisson process with rate 𝜆 = 1 per week.
Find the probability that the system operates without failure during two consecutive weeks.
b) Let 𝑁 (𝑡) be the number of telephone calls received at an exchange in the time interval [0, 𝑡]. We
suppose that {𝑁 (𝑡), 𝑡 ≥ 0} is a Poisson process with rate 𝜆 = 10 per hour. Calculate the probability
that no calls will be received during each of two consecutive 15-minute periods.
The pmf of a Poisson 𝑁 (𝑡) counting the number of events randomly occurring in the time interval
[0, 𝑡) is given by
𝑒−𝜆𝑡 (𝜆𝑡)𝑖
𝑝𝑁 (𝑡) (𝑖) = 𝑝(𝑖; 𝜆𝑡) = P[𝑁 (𝑡) = 𝑖] = 𝑖 = 0, 1, 2, ...
𝑖!
a) The system operates without failure during two consecutive weeks, so we get event 𝑁 (2) = 0, with
𝑁 (𝑡) ∼ Pois(𝜆 = 1), therefore the probability is
𝑒−2 (2)0
P[𝑁 (2) = 0] = = 𝑒−2 .
0!
b) The probability that no calls will be received during each of two consecutive 15-minute periods.
The counting process 𝑁 (𝑡) ∼ Pois(𝜆 = 10), with the unit of one hour, then two consecutive 15-minute
periods mean 1/2 unit. The Poisson satisfies the stationary increment property, therefore we can find
the probability of interest for the interval [0, 1/2):
𝑒−10(1/2) (5)0
P[𝑁 (1/2) = 0] = = 𝑒−5 .
0!
a) Suppose that there are 𝑚 terrorists in a group of 𝑁 visitors arriving per day in all airports of the
U.S., with 𝑚 ≪ 𝑁 . If you choose randomly 𝑛 visitors from that group, 𝑛 < 𝑁 , compute the expected
number of terrorists.
b) Use the moment generating function to prove that both the mean E[𝑋] and variance V[𝑋] of a
Poisson random variable 𝑋 with parameter 𝜆 are
E[𝑋] = 𝜆; V[𝑋] = 𝜆.
a) There are 𝑚 terrorists in a group of 𝑁 visitors. You chose randomly 𝑛 visitors from that group, 𝑛 < 𝑁 .
Denote by 𝑋 the number of terrorists in that random sample of 𝑛 visitors, then 𝑋 ∼ Bin(𝑛, 𝑝) a
binomial, since
𝑋 = 𝐵1 + 𝐵2 + · · · + 𝐵𝑛
𝑚
where each 𝐵𝑖 ∼ B(𝑝). The probability 𝑝 = are the same for each 𝐵𝑖 . The linearity of expectation
𝑁
says
𝑛𝑚
E[𝑋] = 𝑛𝑝 = .
𝑁
b) Prove that both the mean E[𝑋] and variance V[𝑋] of a Poisson random variable 𝑋 with parameter
𝜆 are
E[𝑋] = 𝜆; V[𝑋] = 𝜆.
∞
𝑡𝑋 −𝜆
∑︁ 𝜆𝑗
𝑀 (𝑡) = E[𝑒 ]=𝑒 · 𝑒𝑡𝑗
𝑗=0
𝑗! (14.37)
𝑡 𝑡
= 𝑒−𝜆 · 𝑒𝜆 𝑒 = 𝑒−𝜆(1−𝑒 ) , −∞ < 𝑡 < ∞.
Therefore, ⎧
⎨ 𝑑𝑀 = 𝑀 ′ (𝑡) = 𝜆𝑀 (𝑡)𝑒𝑡 ,
⎪
𝑑𝑡
⎩𝑀 ′′ (𝑡)
⎪
= (𝜆2 𝑒2𝑡 + 𝜆𝑒𝑡 ) 𝑀 (𝑡).
Using
𝑀 (𝑛) (𝑡)|𝑡=0 = 𝜇𝑛 = E[𝑋 𝑛 ] = 𝑀 (𝑛) (0) (14.38)
E[𝑋] = 𝜇 = 𝑀 ′ (0) = 𝜆,
(14.39)
Suppose that the number of people entering a department store on a given day is a random variable
with mean 50. Suppose further that the amounts of money spent by these customers are independent
random variables having a common mean of $8. Finally, suppose also that the amount of money spent
by a customer is also independent of the total number of customers who enter the store. What is the
expected amount of money spent in the store on a given day?
Consider an insurance company which receives claims according to a Poisson process with rate
𝜆 = 400/𝑦𝑒𝑎𝑟. Suppose the size of claims are random variables 𝑅𝑛 ∼ 𝑅 which are distributed with an
exponential distribution with mean E[𝑅] = $1000.
2. Assuming that the insurance company has 𝑛 clients, how much should the monthly insurance pre-
mium be to make sure that the company has a yearly profit.
Provide numerical values if the company has 𝑛 = 1, 000, 10, 000 and 100,000 clients.
3. Assume that the company has 10 people on the staff with a total of $500,000 salary budget and it
has to produce profit of $500,000 at the end of the year. How much should the monthly premium be?
Problem 14.6.
1. Prove that a Poisson process 𝑋(𝑡) with positive rate 𝜆 has stationary increments, and
2. Patients arrive at the doctor’s office according to a Poisson process with rate 𝜆 = 1/10 minute. The
doctor will not see a patient until at least three patients are in the waiting room.
a/ Find the expected waiting time until the first patient is admitted to see the doctor.
b/ What is the probability that nobody is admitted to see the doctor in the first hour?
• The first door leads to a tunnel that will take him to safety after 3 hours of travel.
• The second door leads to a tunnel that will return him to the mine after 5 hours of travel.
• The third door leads to a tunnel that will return him to the mine after 6 hours.
• The fourth door leads to a tunnel that will return him to the mine after 7 hours.
If we assume that the miner is at all times equally likely to choose any one of the doors, what is the
expected length of time until he reaches safety?
We employ the fact: if 𝑋, 𝑌 are random variables, then E[𝑌 | 𝑋] is a random variable, so E[E[𝑌 |𝑋]]
does exit, computed on the range of 𝑋. We have
The mine contains 4 doors, let 𝑋 be the doors, having pmf 𝑝(𝑥) = 1/4 for every 𝑥 ∈ {1, 2, 3, 4}. Let
𝑔(𝑋) = E[𝑌 |𝑋] then
∑︁ 4
∑︁
E[𝑌 ] = E[𝑌 |𝑋 = 𝑥] 𝑝(𝑥) = 𝑔(𝑥) 𝑝(𝑥)
𝑥 𝑥=1
Problem 14.8.
Consider an insurance company that has two types of policy: Policy A and Policy B . Total claims
from the company arrive according to a Poison process at the rate of 9 per day. A randomly selected
1
claim has a 3 chance that it is of policy A. Calculate, on a given day
• c/ the probability that total claims from the company will be fewer than 2.
GUIDANCE for solving. Apply the Thinning technique in Section 14.3.2 above. Brief inputs are:
• 𝑁𝐴 (𝑡)= number of claims of policy A∼Poisson process at rate 𝜆 𝑝 = 9.1/3 = 3 per day
• 𝑁𝐵 (𝑡)= number of claims of policy B∼Poisson process at rate 𝜆(1 − 𝑝) = 9.2/3 = 6 per day
0 1
a) 𝑃 (𝑁𝐴 (1) < 2) = 𝑃 (𝑁𝐴 (1) = 0) + 𝑃 (𝑁𝐴 (1) = 1) = 𝑒−3 30! .𝑒−3 31! = 4.𝑒−3 = 0.19915
0 1
b) 𝑃 (𝑁𝐵 (1) < 2) = 𝑃 (𝑁𝐵 (1) = 0) + 𝑃 (𝑁𝐵 (1) = 1) = 𝑒−6 60! .𝑒−6 61! = 7.𝑒−6 = 0.1735
0 1
c) 𝑃 (𝑁 (1) < 2) = 𝑃 (𝑁 (1) = 0) + 𝑃 (𝑁 (1) = 1) = 𝑒−9 90! .𝑒−9 91! = 10.𝑒−9 = 0.0123
Problem 14.9.
Suppose that a security expert arrives at a server room at 6:15 AM. Until 7:00 AM, emails arrive
at a Poisson rate of 1 email per 30 minutes. Starting from 7:00 AM, they arrive at a Poisson rate of 2
emails per 30 minutes.
Apply the Thinning technique in Section 14.3.2 above. Brief inputs are:
⎧
⎨ ∼ 𝑒1/30 , 𝑇𝑏 ≤ 45
𝑇𝑤 = wait time = .
⎩ ∼ 𝑒1/15 , 𝑇 > 45
𝑎
∫︁ 45 ∫︁ ∞
1 −𝑡/30 1 −𝑡/30
= 𝑡 𝑒 𝑑𝑡 + 45. 𝑒 𝑑𝑡 + 15 𝑒−1.5
0 30 45 30
Branching Processes
And Renewal Processes
[Source [56]]
CHAPTER 15. BRANCHING PROCESSES
434 AND RENEWAL PROCESSES
Introduction
Branching processes is a special case of a Markov process with infinitely many states. The states
are nonnegative integers that usually represent the number of members of a population. It arises in
situations where one individual produces a random number of off-springs (possibly zero, according to
a specific probability distribution), who in turn keep on reproducing themselves in the same manner.
This is repeated by the offspring themselves, from generation to generation, leading to either a pop-
ulation explosion or its ultimate extinction. Examples include:
1. Nuclear chain reaction (neutrons are the “offspring” of each atomic fission).
3. In one-server queueing theory, customers arriving (and lining up) during the service time of a given
customer can be, in this sense, considered that customer’s “offspring” - this simplifies dealing with
some tricky issues of queueing theory.
Francis Galton (1822-1911) formulated the problem of population extinction (e.g. certain family
names would disappear, for lack of male descendants) mathematically in the Educational Times in
1873. The corresponding stochastic processes are sometimes called branching processes.
Henry William Watson (1827- 1903) replied with a solution in the same venue. Together, they then
wrote a paper entitled “On the probability of extinction of families” in 1874. Nowadays, we can name
branching processes by Galton-Watson process.
We will study
• Key concepts
Definition 15.1.
Let {𝑍𝑛,𝑗 , 𝑛 = 0, 1, . . . ; 𝑗 = 1, 2, . . .} be a set of i.i.d. random variables whose possible values are
nonnegative integers, S𝑍𝑛,𝑗 = S = N = {0, 1, 2, 3, . . .}. 𝑍𝑛,𝑗 is the number of descendants of the 𝑗-th
member of the 𝑛-th generation.
the number of ancestors of the population. The process {𝑋𝑛+1 , 𝑛 = 0, 1, 2, . . .} is said to be lineage
if 𝑋0 = 1.
Definition 15.2. Write the probability that at the death of each such individual we obtain exactly 𝑖
offspring, 𝑖 = 0, 1, 2, . . . as
[To avoid trivial cases, we assume that 𝑝𝑖 is strictly smaller than 1, for all 𝑖 ≥ 0.]
2. The transition probability 𝑝𝑖,𝑘 = P[𝑋𝑛+1 = 𝑘|𝑋𝑛 = 𝑖] is just the probability that
𝑋𝑛
∑︁
𝑍𝑛,𝑗 = 𝑍𝑛,1 + 𝑍𝑛,2 + . . . + 𝑍𝑛,𝑖 = 𝑘.
𝑗=1
As state 0 is absorbing, a trapping state [𝑝0,0 = 1, no future offspring can arise in this case], we can
decompose S𝑋 into two sets:
S𝑋 = 𝐷 ∪ {0} (15.2)
3. Limiting population size: Given that a transient state is visited only a finite number of times, we can
assert that the process cannot remain indefinitely in the set 𝐷𝑘 = {1, 2, . . . , 𝑘}, for any finite 𝑘. Thus,
we conclude that the population will either disappear (ultimate extinction), or that its size will tend
to infinity.
4. The average number of individuals in the 𝑛-th generation- The lineage case.
Suppose that 𝑋0 = 1. Let’s now calculate the average number 𝜇𝑛 = E[𝑋𝑛 ] of individuals in
the 𝑛-th generation, for 𝑛 = 1, 2, . . ..
∞
∑︁
𝜇𝑛 ≡ E[𝑋𝑛 ] = E[𝑋𝑛 |𝑋𝑛−1 = 𝑗] P[𝑋𝑛−1 = 𝑗] (15.3)
𝑗=0
∞
∑︁ ∞
∑︁
= 𝑗 𝜇1 P[𝑋𝑛−1 = 𝑗] = 𝜇1 𝑗 P[𝑋𝑛−1 = 𝑗] = 𝜇1 E[𝑋𝑛−1 ] = · · · = 𝜇𝑛1 E[𝑋0 ] = 𝜇𝑛1
𝑗=0 𝑗=0
Reminder: we defined the conditional variance of 𝑌 , given 𝑋 = 𝑥, as the variance of 𝑌 , with respect
to the conditional p.d.f. 𝑓𝑌 (𝑦 | 𝑥) = 𝑓𝑌 |𝑋 (𝑦 | 𝑥).
Reminder
• And furthermore, the variance of 𝑌 itself is computed via the conditional variance V[𝑌 | 𝑋]
and the conditional expectation E[𝑌 |𝑋]:
CONCEPT 13.
𝑖
𝑞0,𝑖 = 𝑞0,1 = 𝑞0𝑖 . (15.9)
since if each individual has less than one descendant, on average, we indeed expect the population
to disappear.
2. When 𝜇1 ≥ 1:
𝜇1 = 1 =⇒ 𝑞0 ≥ 0
𝜇1 > 1 =⇒ 𝑞0 ≥ −∞.
Theorem 15.1.
Recall that the (point) probability distribution of a discrete r. v. 𝑋, with Range(𝑋) = N and pmf
P[𝑋 = 𝑗] = 𝑝𝑗 gives rise the probability-generating function of 𝑋, defined by
∞
∑︁
𝑃 (𝑡) = 𝑃𝑋 (𝑡) = 𝑝𝑗 𝑡𝑗 = E(𝑡𝑋 ), (15.10)
𝑗=0
∞
∑︁
𝑃𝑁 (𝑠) = P[𝑁 = 𝑖] 𝑠𝑖 (15.12)
𝑖=0
We would like to find the PGF 𝐻(𝑠) of the sum (the total purchased gas)
𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁
We can prove
∞
∑︁
𝐻(𝑠) = P[𝑆𝑁 = 𝑘] 𝑠𝑘 = ... = 𝑃𝑁 (𝑃𝑋 (𝑠)). (15.13)
𝑘=0
So
∞ ∑︁
∑︁ ∞
𝐻(𝑡) = P[𝑆𝑁 = 𝑘|𝑁 = 𝑗] P[𝑁 = 𝑗] 𝑡𝑘
𝑘=0 𝑗
∞
∑︁ ∞
∑︁
= P[𝑁 = 𝑗] P[𝑆𝑗 = 𝑘]𝑡𝑘 = 𝑃𝑁 (𝑃𝑋 (𝑡))
𝑗 𝑘=0
∑︀∞ 𝑗
since 𝑘=0 P[𝑆𝑗 = 𝑘]𝑡𝑘 = 𝑃𝑆𝑗 (𝑡) = 𝑃𝑋 (𝑡). Hence, the PGF of 𝑆𝑁 is thus a composition of the PGF
𝑃𝑁 (.) of 𝑁 and 𝑃𝑋 (𝑠) of 𝑋𝑖 .
We assume a branching process (15.1) starts with a single individual (Generation 0); that is, 𝑋0 = 1
(the corresponding PGF is thus equal to 𝑡).
1. He (and ultimately all of his descendants) produces a random number of offspring, each according
to a distribution whose PGF is 𝑃 (𝑡). This is thus the PGF of the number of members of the first
generation (denoted by 𝑋1 ).
2. 𝑋1 = 𝑁 becomes the number of ancestors for producing the next generation with
𝑋2 = 𝑍1 + 𝑍2 + · · · + 𝑍𝑋1 (15.14)
𝑋𝑛 = 𝑍1 + 𝑍2 + · · · + 𝑍𝑋𝑛−1 .
The PGF of 𝑋𝑛 generally is the 𝑛-fold composition of 𝑃 (𝑡) with itself, given as
Based on the recurrence formula (15.15) for computing 𝑃(𝑛) (𝑡) namely,
we can easily derive the corresponding formula for the expected value of 𝑋𝑛 by a simple differentiation
and the chain rule, to get
′
𝑃(𝑛) (𝑡) = 𝑃 ′ (𝑃(𝑛−1) (𝑡)) · 𝑃(𝑛−1)
′
(𝑡).
Let 𝑡 = 1, then
′
𝑃(𝑛−1) (1) = 1, 𝑃(𝑛−1) (1) = 𝜇𝑛−1 ,
Example 15.1.
𝜇𝑛1 − 1
(︂ )︂
V[𝑋10 ] = 𝜎12 𝜇𝑛−1
1 = 5.432 .
𝜇1 − 1
P1. Independent increment property: says that the (random) number of events (arrivals) 𝑁 (𝑡)−𝑁 (𝑠)
and 𝑁 (𝑣) − 𝑁 (𝑢) in two disjoint intervals [say (𝑠, 𝑡] ∩ (𝑢, 𝑣] = ∅] are independent.
P2. Stationary increment property: says that the distribution of the number of events 𝑁 (𝑠, 𝑡) :=
𝑁 (𝑡) − 𝑁 (𝑠) occurring in interval (𝑠, 𝑡] depends only on the length ℎ = 𝑡 − 𝑠 of the interval, not
on the position of the interval.
A Poisson process with rate 𝜆, formally is a counting process 𝑁 (𝑡) with P1, P2 and that additionally
satisfies the following feature:
Orderliness: saying two or more arrivals cannot simultaneously occur,
• Transform Methods
Definition 15.3.
The Poisson process is a special case of a renewal process, being obtained by taking the 𝑋𝑖
variables exponentially distributed with some constant rate of failure 𝜆; see Theorem ??.
Example 15.2.
Consider an experiment that involves a set of identical lightbulbs whose lifetimes are independent.
The experiment consists of using one lightbulb at a time, and when it fails it is immediately replaced by
another lightbulb from the set.
• The time to failure 𝑇𝑛 of the first 𝑛 lightbulbs or also the time of the 𝑛-th renewal (replacing the bulb)
𝑛
∑︁
is given by 𝑇𝑛 = 𝑋𝑖 .
𝑖=1
In this light bulb example, 𝑋𝑖 ’s are the lifetimes of each light bulb replaced. It is easy to see that the
renewal process is a more general process than the a Poisson process.
• We have that
Property. Denote the cdf of the failure time 𝑇𝑛 by 𝐹𝑛 (𝑡), it is P[𝑁 (𝑡) ≥ 𝑛].
• The pmf (probability mass function) of the random variable 𝑁 (𝑡) can be obtained from the
formula
P[𝑁 (𝑡) = 𝑛] = P[𝑇𝑛 ≤ 𝑡] − P[𝑇𝑛+1 ≤ 𝑡] = 𝐹𝑛 (𝑡) − 𝐹𝑛+1 (𝑡). (15.19)
Equation 15.19 interestingly links the distribution of the renewal process 𝑁 (𝑡) with the distribution of the
times of renewals 𝑇𝑛 . This relationship, however, does not reduce to a simple exponential/Poisson
relationship as in the case of the Poisson process of Chapter ??.
In some cases, we know the exact distribution of the random variable 𝑇𝑛 .
* If 𝑋𝑖 ∼ Pois(𝜆), by Proposition 14.2, then 𝑇𝑛 ∼ Pois(𝑛𝜆).
In general, it is difficult to find the exact distribution function 𝐹𝑛 (𝑡) of 𝑇𝑛 .
Consider the (point) probability distribution of a discrete r. v. 𝑋, with the observed values in Range(𝑋) =
N and pmf P[𝑋 = 𝑗] = 𝑝𝑗 .
Besides, the moment-generating capability of the PGF-transform lies in the results obtained from
evaluating the derivatives of the transform at 𝑡 = 1. We have, for a discrete r. v. 𝑋, with values
𝑗 ∈ Range(𝑋) = N and pmf 𝑝𝑗 that
∞
∑︁
𝑃𝑋 (𝑡) = E(𝑡𝑋 ) = 𝑡𝑗 𝑝𝑗
𝑗=0
∞
𝑑𝑃 (𝑡) ∑︁
= 𝑗 𝑡𝑗−1 𝑝𝑗
𝑑𝑡 𝑗=1
∞
(15.22)
𝑑𝑃 (𝑡) ∑︁
|𝑡=1 = 𝑃 ′ (1) = 𝑗𝑝𝑗 = E[𝑋] = 𝜇𝑋
𝑑𝑡 𝑗=0
∞
𝑑2 𝑃 (𝑡) ′′ ∑︁
2
|𝑡=1 = 𝑃 (1) = 𝑗(𝑗 − 1) 𝑡𝑗−1 𝑝𝑗 = E[𝑋 2 ] − E[𝑋].
𝑑𝑡 𝑗=1
Now let 𝑋1 , 𝑋2 , · · · , 𝑋𝑁 ∼ 𝑋 be a random sample (iid) from a certain distribution, where 𝑁 itself is a
random variable (having its own distribution on N).
We would like to find the PGF 𝐻(𝑠) of the sum
𝑆𝑁 := 𝑋1 + 𝑋2 + · · · + 𝑋𝑁 .
We knew from Section 15.4.1 that the PGF 𝐻(𝑠) of 𝑆𝑁 is thus a composition of the PGF 𝑃𝑁 (.) of 𝑁
and 𝑃𝑋 (𝑠) of 𝑋𝑖 , because
∞
∑︁
𝐻(𝑠) = P[𝑆𝑁 = 𝑘] 𝑠𝑘 = 𝑃𝑁 (𝑃𝑋 (𝑠)). (15.23)
𝑘=0
𝑀 (0) = 1 for all distributions. But 𝑀 (𝑡) may not exist for some 𝑡 ̸= 0. To be useful, it is sufficient
that 𝑀 (𝑡) will exist in some interval containing 𝑡 = 0.
then ∫︁ 𝑏
1 1
𝑀 (𝑡) = 𝑒𝑡𝑥 𝑑𝑥 = (𝑒𝑡𝑏 − 𝑒𝑡𝑎 ).
𝑏−𝑎 𝑎 𝑡(𝑏 − 𝑎)
This is a differentiable function of 𝑡, for all 𝑡, −∞ < 𝑡 < ∞.
♣ OBSERVATION 5.
Another useful property of the m.g.f. 𝑀 (𝑡) is that often we can obtain the moments of 𝐹 (𝑥) by differen-
tiating 𝑀 (𝑡). More specifically, consider the 𝑛-th order derivative of 𝑀 (𝑡). Assuming that this derivative
exists, and differentiation can be interchanged with integration (or summation), then
𝑑𝑛
∫︁
𝑀 (𝑛) (𝑡) = 𝑒𝑡𝑥 𝑓 (𝑥)𝑑𝑥
𝑑𝑡𝑛
∫︁ (︂ 𝑛 )︂ ∫︁
𝑑 𝑡𝑥
= 𝑒 𝑓 (𝑥)𝑑𝑥 = 𝑥𝑛 𝑒𝑡𝑥 𝑓 (𝑥)𝑑𝑥.
𝑑𝑡𝑛
Example 15.3.
(15.27)
V[𝑋] = E[𝑋 2 ] − E[𝑋]2
= 𝑀 ′′ (0) − 𝑀 ′ (0)2 = 𝜆.
Besides of the PGF, we will make extensive use of the Laplace and Fourier integral transforms. Here
we provide a basic introduction to such methods.
Let 𝑓 (𝑥) = 𝑓𝑋 (𝑥) be the PDF of a continuous random variable 𝑋 that takes only non-negative values;
that is, 𝑓𝑋 (𝑥) = 0 for 𝑥 < 0.
PROPERTIES:
2. One of the primary reasons for studying 𝐿𝑋 (𝑠) is to derive the moments of the different probability
distributions. By definition, we take different derivatives of the Laplace transform 𝐿𝑋 (𝑠) and evaluate
them at 𝑠 = 0, we obtain the followings:
∫︁ ∞
𝑑𝐿(𝑠) 𝑑𝐿(𝑠)
=− 𝑥 𝑒−𝑠𝑥 𝑓 (𝑥) 𝑑𝑥 ⇒ |𝑠=0 = 𝐿′ (0) = −E[𝑋]
𝑑𝑠 0 𝑑𝑠
∫︁ ∞
𝑑2 𝐿(𝑠) 𝑑2 𝐿(𝑠)
2
= 𝑥2 𝑒−𝑠𝑥 𝑓 (𝑥) 𝑑𝑥 ⇒ 2
|𝑠=0 = 𝐿′′ (0) = E[𝑋 2 ]
𝑑𝑠 0 𝑑𝑠 (15.29)
..
.
𝑑𝑛 𝐿(𝑠)
|𝑠=0 = (−1)𝑛 E[𝑋 𝑛 ]
𝑑𝑠𝑛
4. For two pdf 𝑓 (𝑡), 𝑔(𝑡) of continuous random variables 𝑋 and 𝑌 that takes only non-negative values;
we have
a) the convolution 𝑓 * 𝑔 of 𝑓 and 𝑔 is the distribution function of the sum variable 𝑋 + 𝑌 , being
defined in next part;
CONCEPT 14.
1. When 𝑋 and 𝑌 are two independent discrete variables, having respectively pdf 𝑓 and 𝑔; the general
formula for the distribution of the sum 𝑍 = 𝑋 + 𝑌 is
∞
∑︁
P[𝑍 = 𝑧] = P[𝑋 = 𝑘]P[𝑌 = 𝑧 − 𝑘] (15.31)
𝑘=−∞
2. When 𝑋 and 𝑌 are two independent continuous variables, we have the convolution of 𝑓, 𝑔 (or the
distribution of the sum 𝑍 = 𝑋 + 𝑌 ) as
The renewal function 𝐻(𝑡) is the expected number of renewals E[𝑁 (𝑡)] by time 𝑡.
Proof.
∞
∑︁ ∞
∑︁
𝐻(𝑡) = E[𝑁 (𝑡)] = 𝑛P[𝑁 (𝑡) = 𝑛] = 𝑛[𝐹𝑛 (𝑡) − 𝐹𝑛+1 (𝑡)]
𝑛=0 𝑛=0
∞
(15.35)
∑︁
= ... = 𝐹𝑛 (𝑡)
𝑛=1
Now, if the 𝑋𝑖 ’s are continuous random variables, we can take the derivative of each side we obtain
∞ ∞
𝑑𝐻(𝑡) ∑︁ ∑︁
ℎ(𝑡) = = 𝑑𝐹𝑛 (𝑡)/𝑑𝑡 = 𝑓𝑇𝑛 (𝑡)
𝑑𝑡 𝑛=1 𝑛=1
The renewal density is ℎ(𝑡). Using the Laplace transform we obtain an explicit integral equation
∫︁ 𝑡
ℎ(𝑡) = 𝑓𝑋 (𝑡) + ℎ(𝑡 − 𝑢) 𝑓𝑋 (𝑢)𝑑𝑢. (15.36)
𝑢=0
The renewal equation of the process 𝑁 (𝑡) is given by integrating both sides of the last equation
∫︁ 𝑡 ∫︁ 𝑡
𝐻(𝑡) = ℎ(𝑢)𝑑𝑢 = 𝐹𝑋 (𝑡) + 𝐻(𝑡 − 𝑢) 𝑓𝑋 (𝑢) 𝑑𝑢 for 𝑡 ≥ 0. (15.37)
𝑢=0 𝑢=0
Theorem 15.3.
Suppose that 𝐻(𝑡) is the renewal function of a renewal process 𝑁 (𝑡), with the lifetime of event 𝑖 is
𝑋𝑖 = 𝑇𝑖 − 𝑇𝑖−1 , and 𝑇𝑖 is the 𝑖th renewal time point.
Example 15.4.
Assume that the lifetime 𝑋 is exponentially distributed with mean E[𝑋] = 1/𝜆. Then
Clearly,
𝐻(𝑡) 𝜆𝑡 1
= =𝜆= .
𝑡 𝑡 E[𝑋]
Corollary 15.4. The Poisson process is the only renewal process having a linear renewal function. It
is 𝐻(𝑡) = 𝜆𝑡.
The Poisson process is also the only Markovian renewal process.
Proof: This is because only the exponential distribution has the memory-less property.
is a renewal process if and only if {𝑁𝑖 (𝑡), 𝑡 ≥ 0} are Poisson processes, for 𝑖 = 1, 2.
Problem 15.1.
Suppose that the renewal process {𝑁 (𝑡), 𝑡 ≥ 0} ∼ Pois(𝜆𝑡) is a Poisson process with rate 𝜆. Com-
pute E[𝑁 2 (𝑡)] and confirm that
∫︁ 𝑡
E[𝑁 2 (𝑡)] = 𝐻(𝑡) + 2 𝐻(𝑡 − 𝑢) 𝑑𝐻(𝑢) for 𝑡 ≥ 0.
0
Problem 15.2.
• The time to repair the machine when it breaks down is exponentially distributed with mean 1/𝜇.
• The time the machine runs before breaking down is also exponentially distributed with mean 1/𝜆.
• When repaired the machine is considered to be as good as new. The repair time and the running
time are assumed to be independent.
If the machine is in good condition at time 0, what is the expected number of failures up to time 𝑡?
We have earlier defined the renewal process {𝑁 (𝑡), 𝑡 ≥ 0} as a counting process that denotes the
number of renewals up to time 𝑡.
The Markov renewal process is a generalization of the renewal process in which the
times between renewals are chosen according to a Markov chain.
Consider a random variable 𝑋𝑛 that takes values in a countable set S, and a random variable 𝑇𝑛
(renewal time) that takes values in the interval [0, ∞) such that
0 = 𝑇0 ≤ 𝑇1 ≤ 𝑇2 ≤ . . .
1. The stochastic process {(𝑋𝑛 , 𝑇𝑛 )| 𝑛 ∈ N} is defined to be a Markov renewal process with state
space S if
= P[Xn+1 = j, Tn+1 − Tn ≤ t| Xn ],
The number of times the process {(𝑋𝑛 , 𝑇𝑛 )} visiting state 𝑋𝑛 = 𝑘 in the interval (0, 𝑡] is
∞
∑︁
𝑁𝑘 (𝑡) = 𝑉𝑘 (𝑛, 𝑡), 𝑘 ∈ S, 𝑡 ≥ 0. (15.42)
𝑛=0
The function
∞ ∞
[︀ ∑︁ ]︀ ∑︁
𝑀𝑖,𝑘 (𝑡) = E 𝑉𝑘 (𝑛, 𝑡)| 𝑋0 = 𝑖 = E[𝑉𝑘 (𝑛, 𝑡)| 𝑋0 = 𝑖]
𝑛=0 𝑛=0
∞
(15.44)
∑︁
= P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖]
𝑛=1
𝐽𝑛 is also called the epoch of the 𝑛-th transition of the process {(𝑋𝑛 , 𝑇𝑛 )| 𝑛 ∈ N}.
The one-step transition probability 𝑄𝑖,𝑗 (𝑡) of the above Markov renewal process by
The family of probabilities Q = {𝑄𝑖,𝑗 (𝑡), 𝑖, 𝑗 ∈ S, 𝑡 ≥ 0} is called the semi-Markov kernel over S.
In particular, when 𝑗 = 𝑘
We then obtain
∞
∑︁
𝑀𝑖,𝑘 (𝑡) = P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖] = · · ·
𝑛=1
∞
(15.46)
∑︁
= 𝑄𝑖,𝑘 (𝑡) + P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖]
𝑛=2
If we define
(𝑛)
𝑄𝑖,𝑘 (𝑡) = P[𝑋𝑛 = 𝑘, 𝐽𝑛 ≤ 𝑡| 𝑋0 = 𝑖] (15.47)
and
(0)
𝑄𝑖,𝑘 (𝑡) = 0 if 𝑖 ̸= 𝑘; 1 if 𝑖 = 𝑘 (15.48)
(𝑛+1)
∑︁ ∫︁ 𝑡 (𝑛)
𝑄𝑖,𝑘 (𝑡) = 𝑄𝑖,𝑗 (𝑡 − 𝑢) 𝑑𝑄𝑗,𝑘 (𝑢). (15.49)
𝑗∈S 0
If we define the matrix 𝑄 = [𝑄𝑖,𝑘 ], then the above expression is the convolution of 𝑄(𝑛) and 𝑄. That
is,
𝑄(𝑛+1) (𝑡) = 𝑄(𝑛) (𝑡) * 𝑄(𝑡).
∞
∑︁ ∞
∑︁
𝑀 (𝑡) = 𝑄(𝑛) (𝑡) = 𝑄(𝑡) + 𝑄(𝑛) (𝑡), 𝑡≥0 (15.50)
𝑛=1 𝑛=2
or
∞
∑︁
𝑀 (𝑡) = 𝑄(𝑡) + 𝑄(𝑛) (𝑡) * 𝑄(𝑡) = ... = 𝑄(𝑡) + 𝑄(𝑡) * 𝑀 (𝑡) (15.51)
𝑛=1
Write 𝑀 * (𝑠) be the Laplace transform of 𝑀 (𝑡) and 𝑄* (𝑠) be the Laplace transform of 𝑄(𝑡), then
Problem 15.3. Prove that: The convolution of two binomial distributions 𝑋 ∼ Bin(𝑚, 𝑝) and 𝑌 ∼
Bin(𝑛, 𝑝), one with parameters 𝑚 and 𝑝; and the other with parameters 𝑛 and 𝑝, is a binomial distri-
bution with parameters (𝑚 + 𝑛) and 𝑝, ie.
𝑍 = 𝑋 + 𝑌 ∼ Bin(𝑚 + 𝑛, 𝑝).
Problem 15.4.
The price 𝑋 of a stock on a given trading day changes according to the distribution
𝑥: -1 0 1 2
Describe the random variable 𝑍 = 𝑋 + 𝑋. Find the distribution for the change in stock price after
two (independent) trading days.
𝑋𝑛−1
∑︁
𝑋𝑛 = 𝑍𝑛−1,𝑗 , 𝑍𝑘,𝑗 ∼ 𝑁 = Pois(𝜆)
𝑗=1
𝑒−𝜆 𝜆𝑖
𝑝𝑖 = P[𝑁 = 𝑖] = 𝑖 = 0, 1, 2, ...
𝑖!
That is, the number of descendants of an arbitrary individual has a Poisson distribution with parameter
𝜆. Determine the probability 𝑞0 of eventual extinction of the population if
(a) 𝜆 = ln 2 and (b) 𝜆 = ln 4.
Problem 15.6 (System reliability). Let 𝑁 (𝑡) be the number of failures of a computer system in the
interval [0, 𝑡]. We suppose that {𝑁 (𝑡), 𝑡 > 0} is a Poisson process with rate 𝜆 = 1 per week. Calculate
the probability that
ii) the system will have exactly two failures during a given week, knowing that it operated without failure
during the previous two weeks,
iii) less than two weeks elapse before the third failure occurs.
Problem 15.7.
Let 𝑁 (𝑡) be the number of accidents at a specific intersection in the interval [0, 𝑡]. We suppose that
{𝑁 (𝑡), 𝑡 > 0} is a Poisson process with rate 𝜆1 = 1 per week. Moreover, the number 𝑌𝑘 of persons
injured in the 𝑘th accident has (approximately) a Poisson distribution with parameter 𝜆2 = 1/2, for all 𝑘.
Finally, the random variables 𝑌1 , 𝑌2 , ... are independent among themselves and are also independent
of the stochastic process {𝑁 (𝑡), 𝑡 > 0}.
a) Calculate the probability that the total number of persons injured in the interval [0, 𝑡] is greater than
or equal to 2, given that 𝑁 (𝑡) = 2.
c) Let 𝑆𝑘 be the time instant when the 𝑘th person was injured, for 𝑘 = 1, 2, .... We set 𝑇 = 𝑆2 − 𝑆1 .
Calculate P[𝑇 > 0].
[Source [56]]
CHAPTER 16. STATISTICAL DATA ANALYTICS IN PRACTICE
456 UNCERTAINTY-ACCEPTED DECISION MAKING
Emergency medical services (EMS), also known as ambulance services or paramedic services, are
emergency services which treat illnesses and injuries that require an urgent medical response, pro-
viding out-of-hospital treatment and transport to definitive care. They may also be known as a first aid
squad, FAST squad, emergency squad, rescue squad, ambulance squad, ambulance corps, life squad
or by other initialisms such as EMAS or EMARS [?].
The assistance of EMS has steps and procedures according to the operating system. However,
systematic work must compete with the time that decides the fate of the patient. Arrival of rescue
teams to the patient and giving the basic treat in a short time will be likely to save the patient’s life. That
is, if the rescuers can use the shortest travel time is better. Sometimes in a remote area to the hospital
or treatment center is difficult to assist in a short time. Therefore, the optimal location of the station
improves the quality of service.
A related study to this project was published by Zhaoxiang He et. al. [?]. Their research is about
EMS work in the South Dakota area. Their study using data from the U.S. National Emergency Medical
Services Information System (NEMSIS) in 2012, it is case-by-case data and 2 types of data and related
factors that affect the performance of services which are divided into
• case-specific variables: caller’s complaint, light and siren, dispatch time, location type, and weather
condition,
• service-specific variables: EMS station location, staffing, weather, highway and, last but not least,
traffic conditions.
Their study used 3 regression methods which are the linear regression, the spatial econometric
model, and geographically weighted regression (GWR) model to find the best fitted model.
In this case study, our experimental data set was obtained in Thailand. Generally, the population in
the rural areas of Thailand are lives in groups, sporadically scattered throughout the area, and some of
them are far from hospitals or public health centers. In the event of an accident or emergency illness,
that is difficult to reach immediate treatment. Therefore, the researcher chose the northeastern region,
the largest region in Thailand. And select Ubon Ratchathani province because it is the most prosperous
palace that ranks 77th in the survey of the Office of the National Economic and Social Development
Council in the year 2015 (measured by GPP and GPP per capita) [?] and is the 10th poorest province
based on the ranking of the Ministry of Science and Technology in 2018 [?].
We will employ the powerful GLM approach in Appendix C to analyze the specific data in which the
response is not defined clearly at the first glance.
Our realistic data is provided by National Institute for Emergency Medical Services (NIEMS), observed
during 1st October 2018 to 31st September 2019 at 7 districts of Amnat Chareon province which are
Chanuman, Pathum Ratchawong, Phana, Lue Amnat, Hua Taphan, Senang kha Nikhom, and Mueang
Amnat Chareon.
The data table of each station with it contributing factors is partially shown in Figure 16.1. In some
columns of all realistic data, they are in the same group and can be generated as other single factors.
For example, the column of colors, red, yellow, green, white, and black, they can be used to divide the
damage of patients, and summation of them is the total number of observations.
From all 11619 cases (observations) of our realistic data, the researchers generate data according
to the relevant factors in the operation of EMS as follows.
1. Location factor or location of each station by dividing each station according to the district where
the station is located as a binary factor:
1 = station located in the city area, and 0 = station located in the rural area.
The criteria for dividing are the number of educational institutions, department stores, hospitals,
subdistrict public health centers, and the provincial barn. It can be seen that stations located in
Mueang Amnat Charoen district will be organized in the city area and the other will be in the rural
area.
2. Vehicle factor is the number of emergency vehicles used in the operation of each station.
3. Staff factor is the number of operators in the station. The levels are divided according to education
and training duration as follows
4. Injury factor or number of patients sorting by telephone or telephone triage from operating of each
station according to the severity of the patient.
The triage criteria are under the principles of The Emergency Severity Index(ESI) - a five-level emer-
gency department triage algorithm - which can be divided into
Red and yellow are cases that should be delivered to the hospital emergency room(ER) immediately
and within 10 minutes, respectively, according to green and white are cases that are delivered to the
outpatient department(OPD) under ESI principles.
- Trauma cases or patients with serious emergency conditions requiring urgent assistance will
include red and yellow
- Non-trauma cases or mild emergency patients consisting of green, white, and black, see [7]
5. Light and Siren factor is the number of cases in which the rescue team opens emergency lights for
From the information that we have separated according to the work order of each level of operation
of each station which is linked to the phone triage factor. The operating instructions of each level are
divided according to the severity of the patient as follows:
Therefore, the rescue teams at the ILS / EMR / ALS and BLS levels are emergency light and siren
enabled groups or called light and siren factor. And the non-light and siren factor is the number of
cases in which the rescue team at the FR level does not turn on emergency lights for space while
traveling to the scene of each station.
6. Response time in 8 minutes factor is the number of cases that the rescue team uses to travel from
the station to the scene within 8 minutes.
7. Response time out of 8 minutes factor is the number of cases that the rescue team used to travel
from the station to the scene for more than 8 minutes.
After screening, the new factors are shown in the table below.
We will employ linear modeling and its extension, generalized linear modeling (GLM), to analyze the
observed data and get conclusion from analyzing and computing on soft R. But why using the general-
ized linear models?
The essential reason is based on a key principle in statistical modeling and analytics, that data
structure decides method of analysis. Explicitly, our observed data does not provide time-related in-
formation, hence time-series models in statistics or continuous models in mathematics as ordinary
differential equations are not appropriate to apply.
Furthermore, we knew from the previous chapters that, the linear model is suitable when certain
assumptions are presumed, and the modeled response variables 𝑌 receives continuous values. Now
we will see that our response variable 𝑌 , to be defined in subsequent sections actually are counts, ie.
its range are the natural numbers.
Therefore the only tools to be applicable for analyzing this specific dataset possibly are the GLMs
and ANCOVA, if the location are of interest.
where the random variables 𝑌𝑖 are independent. Note that the 𝑌𝑖 for different subjects, indexed by
the subscript 𝑖, may have different expected values 𝜇𝑖 = E[𝑌𝑖 ].
• Generalized linear models (GLM), extended from linear regression models, are important in the anal-
ysis of insurance data, for which, the assumptions of the normal model are frequently not applica-
ble.
For example, in actuarial science, claim sizes, claim frequencies and the occurrence of a claim on a
single policy are all outcomes which are not normal. Also, the relationship between outcomes and
drivers of risk is often multiplicative rather additive.
• HOW? Generalized linear modeling is used to assess and quantify the relationship between a
response variable 𝑌 and explanatory variables x = 𝑥1 , 𝑥2 , · · · , 𝑥𝑝 .
The generalized linear modeling differs from ordinary regression modeling (Equation 16.1) in two
important respects:
1. The distribution of the response is chosen from the exponential family, consisting of Poisson, Gaus-
sian and binomial families. Thus the distribution of the response need not be normal or close to
normal and may be explicitly nonnormal.
Given a response variable 𝑌 , with E[𝑌 ] = 𝜇, constructing a GLM consists of the following steps.
Successive observations are assumed to be independent, i.e. the sample will be regarded as a ran-
dom sample from the background population.
examine how well the model fits by examining the departure of the fitted values from actual values,
as well as other model diagnostics.
What would we do next? The procedural analyzing include the following steps.
2. Coding variables and defining the response 𝑌 from the relevant factors 𝑋𝑖 .
3. Choosing suitable statistical models (selecting key predictors 𝑋𝑖 which corresponding to 𝑌𝑖 into the
model).
4. Fitting the model in the R software, improving, and selecting the best model.
• 𝑋8 = the number of cases that the rescue team open light and siren while drive to the scene,
• 𝑋9 = the number of cases that the rescue team do not use light and siren while drive to the scene,
𝑌 is not known at the first glance to data. So 𝑌 must generally be determined by exploiting possible
relationships among the existing independent variables. How do we determine 𝑌 ?
♣ OBSERVATION 6.
This step is interesting since there are a few relevant factors in data, showing a mixture capacity of
entire EMS system, like the rescue-time of medical team under requests, in terms of variables 𝐼 and
𝑂, or rescue-task related factor 𝑋4. In other words,
• 𝑌 can not directly defined as the response time (not exist in data), 𝑌 can be discretely defined via
𝐼- the number of cases of which response (rescue) time of medical teams arriving in site at most 8
minutes, Range(𝐼) = N+ ;
and 𝑂- the number of cases of which response (rescue) time of medical teams arriving in site more
than 8 minutes, Range(𝑂) = N+ .
𝐼 and 𝑂 are good candidates to choose. Indeed, if a station got large 𝐼 that means it has a great
value of 𝑌 . On the other hand, if a station got large 𝑂 that means in average the medical team took a
long time travel to the scene in order to save lives. In other words,
* 𝐼 definitely has a positive impact on 𝑌 , and 𝑂 possibly has a negative impact on 𝑌 .
𝑌 , the response factor, should represent the ‘goodness-of-service’. Then the formula of 𝑌 is pro-
posed to be
𝑌 = 𝐼 − 𝑂. (16.3)
In addition, for the definition of 𝑌 then the family of 𝑌 should be Poisson, 𝑌 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛. Since
assuming 𝑌 ∼ 𝑃 𝑜𝑖𝑠𝑠𝑜𝑛 so 𝐼 − 𝑂 cannot be negative. Then we redefine,
𝑌 = 𝐼 − 𝑂 when 𝐼 ≥ 𝑂
𝑌 = 0 when 𝐼 < 𝑂
Moreover, the researcher can improve 𝑌 further by using the number of professionals 𝑋4, because
when 𝑋4 is high the performance of the station might be higher. Then we redefine 𝑌 as
𝑌 = 𝑋4 + 𝐼 − 𝑂 (16.4)
Remark 6.
1. The last formula is interesting, since 𝑌 now combines the professional level of the rescue team (𝑋4)
on one side, with the on-time level of their travels to the scenes, on the other side. Both 𝑋4 and the
pair of 𝐼 and 𝑂 obviously depend on locations of the local clinics.
2. Note that, in these definitions, the response 𝑌 is unit-less, and has natural values.
After screening factors, see Figure 16.2, we decide to choose only few factors being most meaningful
impacting on 𝑌 , which are 𝑋2, 𝑋3, 𝑋5, and 𝑋8. The reasons for this choice are as follows.
• For location factor(𝑋2), if the station located in the city area it might be difficult traveling from place
to place,
• for vehicle factor(𝑋3), if there is an accident that has many seriously injured patients, more vehicles
might be handle better,
• for volunteer factor(𝑋5), since some station has no professional staff, then the number of volunteer
staff is important in this case,
• for light and siren factor(𝑋8), if the rescue team open the light and siren will make other drivers know
to let them pass.
The reason why we do not choose trauma(𝑋6) or non-trauma(𝑋7) cases as predictors is that, no
matter what patients’ status are, it is better if the rescue team goes to the scene in a short time. We
are specifically interested in two factor interactions 𝑋2 * 𝑋5 and 𝑋2 * 𝑋8 as well.
1. We put 𝑋2 * 𝑋5 in the model because, for the stations in the city area that do not have professional
staff, the rescue volunteer team will be taking as an alternative, which is helpful only if locations is
not far from hospitals, so this interaction is worthy to consider;
2. and put 𝑋2 * 𝑋8 because in the city areas might have a lot of cars more than rural areas that the
other drivers may not cooperate to let rescue team go first.
As a result, the multiple linear model, involving 4 predictors 𝑋2, 𝑋3, 𝑋5, 𝑋8, is
𝑌 ∼ 𝑋3 + 𝑋2 * 𝑋5 + 𝑋2 * 𝑋8
Note that writing 𝐴 * 𝐵 we mean terms both singleton of 𝐴, 𝐵 for main effects, and combination of
𝐴𝐵 for interaction effect are included in the model.
• We will check if E[𝑌 ] = V[𝑌 ], the observed mean and variance are the same or not, then the model’s
response should really follow the family of Poisson distributions,
else if E[𝑌 ] < V[𝑌 ] then the response would be quasipoisson, because 𝑌 is overdispersed Poisson.
• The dataset shows that 𝑌 is indeed overdispersed, hence we could improve the model from Poisson
to negative binomial, a specific case of quasipoisson.
model = glm( newY, X3 + newX2+ X5+ newX2 * X8, family ∼ quasipoisson, data= dataX)
We implemented the following segments of R code for data analysis, as in the following Figures 16.3
and 16.4.
After defining every independent variable, then defining the dependent variable 𝑛𝑒𝑤𝑌 , and checked the
mean and variance of 𝑛𝑒𝑤𝑌 to see the family of 𝑛𝑒𝑤𝑌 . The output gave the mean of 𝑛𝑒𝑤𝑌 is 180.25
and the variance is 14328.91, meaning that E[𝑛𝑒𝑤𝑌 ] < V[𝑛𝑒𝑤𝑌 ]. Then the probability distribution
family of 𝑌 is quasipoisson.
Figure 16.5 shows the ANOVA outcome of the GLM with quasipoissonl but there is no significant fac-
tor. Figure 16.6 shows a summary of the quasipoisson model. The intercept of the model is significant,
𝑋8 given p-value = 0.0384, and the interaction of 𝑋2 and 𝑋8 given p-value = 0.0497.
The dispersion parameter for quasipoisson family taken to be 65.7762, this parameter tells us how
many times larger the variance is than the mean.
Since our dispersion was more than one, it turns out the conditional variance is actually larger than
the conditional mean. We have over-dispersion. The null deviance is 3316.5 and the residual deviance
is 2201.7, the AIC value is not available.
Figure 16.7 presents the Residuals 𝑣𝑠 Fitted plot that shown curvilinear trends, but the fit of logistic
regression is curvilinear by itself. Figure 16.8 presents the Normal Q-Q plot that shown the residuals
are normally distributed, but the deviance residuals are not normally distributed.
• Since the AIC value does not display and the model is overdispersion then we try to use the nega-
tive binomial model to improve the model. Figure 16.9 shows the ANOVA outcome of the negative
binomial model. The value of deviance residuals is all decrease from the quasipoisson model. There
are significant factors which are 𝑋2, p-value = 0.03330, 𝑋2 : 𝑋5, p-value = 0.06730, and 𝑋2 : 𝑋8,
p-value = 0.03003.
• Figure 16.10 shows a summary of the negative binomial model. The value of deviance residuals
is all decrease from the quasipoisson model. There are the same significant factors that are con-
stant(intercept), 𝑋8, and 𝑋2 : 𝑋8.
But the dispersion parameter for negative binomial is 1.8836 that close to 1 that means E[𝑛𝑒𝑤𝑌 ] =
V[𝑛𝑒𝑤𝑌 ]. The null deviance is 63.380 and the residual deviance is 48.197. These two are decreased
from quasipoisson model. The AIC value is 497.09. Hence, the negative binomial model is chosen
to be the best fit model.
We analyzed a data set collected from major bridges in Saigon, Vietnam, aimed for monitoring and/or
evaluating bridge health. This problem is known as bridge health monitoring (BHM) in literature.
Kindly see full details in Nguyen et. al. (2013) [25].
16.2.1 Overview
• Variables having a variation smaller than the measurement noise and thus will be irrelevant.
• Many of the variables will be correlated with each other (e.g., through linear combinations or other
functional dependence); a new set of uncorrelated variables should be found.
Structural Health Monitoring, specifically Bridge Health Monitoring (BHM) is an important problem
in many countries, including developing countries like Viet Nam. Many mechanical, mathematical,
statistical, etc. methods have been proposed to solve this problem. In BHM process, one important step
is to reduce and extract important information from realistic datasets obtained from bridge monitoring.
Our first contribution in this study is on reduction of variables measured on the bridge using the
Principal components analysis (PCA), in coupling with some additional methods. Specifically, after
achieving a new dataset having new uncorrelated variables using PCA, this study uses the idea of
cross-validation to point out some first few components enough to be able to reconstruct the original
data with appropriate information (variance). To this end, for the purpose of variable reduction, the
Canonical correlations analysis is used to decide which subset of the original dataset keeps the most
information
Due to the time series nature of bridge database, on the other hand we consider a probabilistic
approach to detect potential fatality of monitored bridge which have been caused by external forces
and conditions. Specifically, the second contribution is based on a combination of auto-regressive (AR)
modelling, its variations and sequential probability ratio test that is aimed for evaluating how severe
damages can be and for identifying where they can possibly be located on the bridge.
In this case study we propose a multiphase scheme for evaluating reliability/health of a structure or
system 𝑆 when on-line measurement of that structure is possible. Suppose you obtained a huge
dataset 𝐷 after continuously on-line measuring 𝑆 by using many sensors distributed in a certain way
on the structure. Exploiting the spatial and temporal characteristics of 𝐷, your aim is answering two
key questions:
1. which sensors could provide most information about the structure status/ health?
2. from the most informative sensors/locations determined by the first answer, how much certain we
could conclude that they could potentially be fatal places of the observed structure?
Having a mathematical answer of the 1st question helps us to optimize the resource for investigation
the structure’ lifetime or usage; and to some extent, knowing fatally dangerous places of the struc-
ture/system obviously guides the engineers and managers to make the right decision at the right time.
PCA, which is one of the most widely used multivariate techniques, is described in most standard
multivariate texts, e.g. [81][85]. One of its most popular uses is that of dimensionality reduction in
which there are a large number of interrelated variables, while retaining as much variation present in
the data set as possible. This reduction is achieved by linear transformation to a new set of variables,
the PCs, which are uncorrelated, and ordered so that the first few retain most of the variation present
in all of the original variables.
Let us consider a sample x = (𝑥1 , 𝑥2 , . . . , 𝑥𝑝 )⊤ is a 𝑝 random variables vector. In essence, PCA
seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the PCs)
of the original variables with the largest variance. The first PC, 𝑌1 , is the linear combination with the
largest variance. We have 𝑌1 = 𝛿1 ⊤ x, where the 𝑝-dimensional coefficient vector 𝛿1 = (𝛿1,1 , . . . , 𝛿1,𝑝 )⊤
solves
𝛿1 = arg max Var(𝛿1 ⊤ x). (16.5)
‖𝛿‖=1
The second PC is the linear combination with the second largest variance and orthogonal to the first
PC, and so on. There are as many PCs as the number the original variables. For many datasets, the
first several PCs explain most of the variance, so that the rest can be disregarded with minimal loss of
information.
Since the variance depends on the scale of the variables, it is customary to first standardize each
variable to have mean zero and standard deviation one. After the standardization, the original variables
with possibly different unit of measurement are all in comparable units. Assuming a standardized
1 ⊤
data with the empirical covariance matrix Σ𝑝×𝑝 = 𝑛 𝑋𝑋 , where 𝑋 is a (𝑛 × 𝑝) matrix consists of 𝑛
observations on 𝑝 variables in x, we can use the spectral decomposition theorem to write Σ as
Σ = 𝑉 𝐿𝑉 ⊤ , (16.6)
Performing PCA using (16.7) (i.e., by initially finding the eigenvalues of the sample covariance and
then finding the corresponding eigenvectors) is already simple and computationally fast. However, case
of computation can be further enhanced by utilizing the connection between PCA and the singular value
decomposition (SVD) of the mean-centered data matrix 𝑋 which takes the form:
𝑋 = 𝑈 𝑆𝑉 ⊤ , (16.8)
Frequently, just the first few of the PCs are sufficient to represent the original data adequately. The
precise number of components to be retained, however, is often not clear. A approach using cross-
validation will be used in this paper.
The standard cross-validation procedure is to subdivide the data matrix into a number of groups.
Each subgroup is deleted from the data in turn, the parameters of the predictor are evaluated from the
remainder of the data for each competing model, and the deleted values are then predicted for each
model. Some suitable function relating actual and predicted values, summed over all group deletions, is
used as the objective function and the model that optimizes this function is selected. We can describe
the technique as follows. Consider a (𝑛 × 𝑝) data matrix 𝑋 obtained by observing 𝑛 objects on 𝑝
variables, mean-centered and appropriately scaled. Associated with a given value 𝑘 is the predictor
𝑋ˆ (𝑘) ; an estimate of 𝑋 which arises from fitting only the first 𝑘 PCs. Thus the prediction model is given
by
ˆ (𝑘) + 𝐸 (𝑘) ,
𝑋=𝑋 (16.9)
where 𝐸 (𝑘) is the (𝑛 × 𝑝) matrix of error scores and 𝑘 = 0, 1, 2, . . . Each row of 𝐸 (𝑘) has a multivariate
normal distribution under the usual distributional assumptions. The errors in any row of 𝐸 (𝑘) are sta-
tistically independent of the errors in any other row since the rows of a data matrix generally represent
randomly sampled subjects.
To compute the discrepancy between actual and predicted values, we use
1 {︁(︁ )︁⊤ (︁ )︁}︁
PRESS(𝑘) = trace 𝐸 (𝑘) 𝐸 (𝑘)
𝑛×𝑝
and some suitable function of these PRESS values is considered in order to choose the optimum value
of 𝑘. The notation PRESS, stands for PREdiction Sum of Squares, and this is taken in a similar sense
as in linear regression. These PRESS(𝑘) values are a measure of how well the model in (16.9) predicts
the data for each 𝑘. As noted before, the singular value decomposition of the data matrix enables us
to represent 𝑥𝑖𝑗 , the (𝑖, 𝑗)-th element of the data matrix with the (𝑖, 𝑗)-th element of 𝑈 , 𝑢𝑖𝑗 , and (𝑖, 𝑗)-th
element of 𝑉 , 𝑣𝑖𝑗
𝑝
∑︁
𝑥𝑖𝑗 = 𝑢𝑖𝑡 𝑠𝑡 𝑣𝑡𝑗 . (16.10)
𝑡=1
𝑘
∑︁
𝑥𝑖𝑗 = 𝑢𝑖𝑡 𝑠𝑡 𝑣𝑡𝑗 + 𝜀𝑖𝑗 , (16.11)
𝑡=1
where 𝜀𝑖𝑗 is a residual term, and this is equivalent to estimating the data using only the first 𝑘 PCs.Cross-
validation ensures that each data point is not used in both the prediction and assessment stages, but
nevertheless to use as much of the original data as possible in predicting each 𝑥𝑖𝑗 . This suggests that
𝑥𝑖𝑗 should be predicted from all the data except the 𝑖th row and 𝑗th column of 𝑋.
𝑘
(𝑘)
∑︁ (︀ √︀ )︀(︀√ )︀
𝑥
ˆ𝑖𝑗 = 𝑢
ˆ𝑖𝑡 𝑠ˆ𝑡 𝑠𝑡 𝑣 𝑡𝑗 . (16.12)
𝑡=1
To choose the optimum value of 𝑘, we finally consider a suitable function of PRESS(𝑘). Analogy with
regression analysis suggests some function of the difference between successive PRESS values. One
such possibility is the statistic
where 𝐷𝑘 is the number of degrees of freedom required to fit the 𝑘th component and 𝐷𝑟 is the number
of degrees of freedom remaining after fitting 𝑘th component. Consideration of the number of param-
eters to be estimated, together with all the constraints on the eigenvectors at each stage, shows that
𝐷𝑘 = 𝑛 + 𝑝 − 2𝑘. Also, since there are 𝑛𝑝 − 𝑝 degrees of freedom at the outset (each column of 𝑋
being mean-centered), 𝐷𝑟 can be found easily at each stage. 𝑊 represents the increase in predictive
information supplied by the 𝑘th component, divided by the average information in each of the remaining
components. We therefore suggest that the optimum value for 𝑘 is the last value of 𝑘 at which 𝑊 is
greater than a chosen unity.
The first major aim of this paper is to explore the choice of subsets of the original variables that will
retain the overall features or the multivariate structure present in the entire set of variables. The tech-
nique used in this paper is motivated by the long standing and well established technique of Canonical
Correlation Analysis [83], we used measures of multivariate association based on canonical correla-
tions as criteria for selecting variables in PCA. The idea is to maximize the similarity or overlap between
the spaces spanned by the sets of two PCs, one arises from the full set data while the other arises
from the subset data.
Let 𝑋 be the (𝑛 × 𝑝) data matrix, consisting of 𝑝 variables measured on each 𝑛 individuals in the
sample, and 𝑌 be the (𝑛 × 𝑘) transformed data matrix of PC scores yielding the best 𝑘-dimensional
approximation to the original data determined using 𝑊 in the previous section. Similarly, let 𝑋
̃︀ denote
the (𝑛 × 𝑞) reduced data matrix which retains only 𝑞 selected variables, and 𝑌̃︀ be the corresponding
(𝑛 × 𝑘) matrix of PC scores. It should be noted that since 𝑘 components may be sufficient to model the
‘signal’ in the data, the remaining 𝑝 − 𝑘 dimensions are a reflection of the ‘noise’. Hence, it would seem
reasonable to set 𝑞, the number of variables to retain, to 𝑘.
(︀ )︀
Now let 𝑍̃︀ = 𝑌 |𝑌̃︀ be the (𝑛 × 2𝑘) partitioned matrix arising from the horizontal concatenation of
the matrices of PC scores 𝑌 and 𝑌̃︀ . Then the corresponding (2𝑘 × 2𝑘) partitioned correlation matrix
between PCs is given by ⎛ ⎞
⎜ 𝑅𝑌 𝑌 𝑅𝑌 𝑌̃︀ ⎟
𝑅=⎜
⎝
⎟.
⎠ (16.14)
𝑅𝑌̃︀ 𝑌 𝑅𝑌̃︀ 𝑌̃︀
Where, 𝑅𝑌 𝑌̃︀ is the (𝑘 × 𝑘) matrix of correlations between the PCs of the set 𝑌 ⊤ and 𝑌̃︀ ⊤ , and since
the correlation matrix 𝑅 is symmetric, and the PCs in 𝑌 are orthogonal to each other and similarly,
the PCs in 𝑌̃︀ are orthogonal to each other and hence uncorrelated. Therefore, the squared canonical
correlations between the two sets of PCs are given by the 𝑘 eigenvalues of
arranged in descending order. The canonical correlations can also be interpreted as the simple correla-
tions between linear combinations of the PCs of the set 𝑌 and those of set 𝑌̃︀ , computed in the specific
manner. Such linear combinations are usually referred to as canonical variates. For a reasonable index
of the total association between the sets (of variables), it would seem appropriate to combine in some
way the successive canonical correlations which can be extracted. For our purpose. we use an index,
recommended in [93], as follows: ⎛ ⎞
𝑘
1 ⎝∑︁ 2 ⎠
𝛾ˆ = 𝜌ˆ , (16.16)
𝑘 𝑗=1 𝑗
where 𝜌ˆ1 , 𝜌ˆ2 , . . . , 𝜌ˆ𝑘 are the canonical correlations between the sets of PCs in 𝑌 and 𝑌̃︀ arranged in
descending order.
Recall that the damage identification process for bridge, a major topic of Damage Prognosis, is complex
and referred as bridge health monitoring (BHM). This process can be addressed at many levels. The
damage state of the bridge can be described as a five-step process as discussed in [90] to answer the
following questions.
1) Existence: Is there damage in the system?
2) Location: Where are damages in the system?
3) Type: What kind of damages presents?
4) Extent: How serve are the damages?
5) Prognosis: How much/how long does the useful life of the structure remain?
After the dimensionality reduction step discussed in previous parts, that allowed us to select meaningful
predictor sensor variables, our strategy now is answering the final question. To do so, we reversely have
to find solutions to question 3 and 4, then question 1 and 2. To start with, in this paper, we combine
the approaches suggested in works [90][96][97][107] to answer questions 1 and 2: whether a bridge
was damaged or not, and if it was damaged, where the damages should be located. We employ time
series analysis, data clustering and hypothesis testing to make decision about the state of a bridge.
We have known that many sensors were stuck on the bridge at varying locations to on-line measure
signals of the concerned bridge at varying operational and environmental conditions. Sai Gon bridge
have 32 spans, each has length of 24 meters. There are two ways to measure the data. Dynamic
measurement and static measurement, obtained the distortion or displacement of the bridge when
having vehicle movements and when do not have vehicles on the bridge, respectively.
The data that measured at the first time can be used as reference database, i.e. the data were
measured at the undamaged state of the bridge. Then, the new data are recorded at unknown state of
bridge, be used as unknown (new) database, and the data clustering step is used to choose a sample
of data in the reference database which is in similar operational and environmental conditions as new
sample. Our data clustering step in part 16.2.6 is a combination of auto-regressive (AR) modelling and
data clustering technique. First, data sample from two databases are applied the AR model, then the
AR model coefficients are used for data clustering. After clustering step, damage-sensitive features
are extracted from two data samples. Auto-regressive with eXogenous (ARX) model is used in part
16.2.7 to extract the features. In part 16.2.8, the hypothesis testing technique is deployed to make
decision about the state of the bridge. An application of the mentioned 3-steps paradigm to numerical
data obtained from Sai Gon bridge is presented in section ??.
In this study, time series signals measured from real world are stored in two databases. The reference
database contains signal data which are measured when the bridge is undamaged covering various en-
vironmental and operational conditions (climate, traffic loading...). The second database contains new
signal data which are measured when bridge is in unknown condition. The data clustering procedure
is a process to select the previously recorded signal from the reference database which is recorded
under environmental and operational conditions closest to that of new obtained signal.
Step 2. After standardization procedure, all time series signals 𝑥(𝑡) in reference database are fitted
with AR model of order 𝑟 such that:
𝑟
∑︁
𝑥(𝑡) = 𝜃𝑥 (𝑗)𝑥(𝑡 − 𝑗) + 𝑒𝑥 (𝑡) (16.18)
𝑗=1
where 𝜃𝑥 (𝑗), 𝑗 = 1..𝑟 are the AR coefficients, 𝑒𝑥 (𝑡) is white noise input with variance of 𝜎𝑥2 and
{𝜃𝑥 (𝑗), 𝜎𝑥2 } can be regarded as feature of signal.
For each time series signal 𝑦(𝑡) in new database under unknown condition of bridge, we repeat two
steps above. The AR model of order 𝑟 for 𝑦(𝑡) can be written as:
𝑟
∑︁
𝑦(𝑡) = 𝜃𝑦 (𝑗)𝑦(𝑡 − 𝑗) + 𝑒𝑦 (𝑡) (16.19)
𝑗=1
where 𝜃𝑦 (𝑗), 𝑗 = 1..𝑟 are the AR coefficients, 𝑒𝑦 (𝑡) is white noise input with variance of 𝜎𝑦2 and
{𝜃𝑦 (𝑗), 𝜎𝑦2 } can be regarded as feature of signal.
When all time series signals in the reference database and the new database are fitted with AR
model, two feature spaces, Ω𝑅 and Ω𝑁 respectively corresponding to the reference database and
new database, can be obtained. The data clustering procedure is then implemented by searching in
space Ω𝑅 a point {𝜃𝑥 (𝑗), 𝜎𝑥2 } that is similar to the target point {𝜃𝑦 (𝑗), 𝜎𝑦2 } in Ω𝑁 . First, the following
condition is applied for a certain target point {𝜃𝑦 (𝑗), 𝜎𝑦2 } to select a subspace of feature points in Ω𝑅 :
⃒ 2
⃒𝜎𝑥 − 𝜎𝑦2 ⃒ ≤ 𝜀1 ,
⃒
(16.20)
meaning that when the variance of residual errors is close, the distribution of these variables is similar.
This step aims at choosing two data samples that have similar environmental and operational condi-
tions. After this step, a subspace Ω𝑅
1 is obtained. This space contains the feature points satisfying Eq
(16.20). Then, the cosine between two coefficient vectors is calculated to further searching in subspace
Ω𝑅
1 the feature points satisfying:
𝑟
∑︀
𝜃𝑥 (𝑗)𝜃𝑦 (𝑗)
𝑗=1
√︃ √︃ ≥ 𝜀2 (16.21)
𝑟
∑︀ 𝑟
∑︀
𝜃𝑥2 (𝑗) 𝜃𝑦2 (𝑗)
𝑗=1 𝑗=1
After data clustering procedure, the time series signals which are measured in similar environmental
and operational conditions with each time series signal in new database are chosen to make the signal
pairs which are used in further steps.
The prediction capability of an AR model can be tested by calculating the standard deviation of the
prediction error 𝑒𝑥 (𝑡) in Eq (16.18). If the AR model well represents signal series 𝑥(𝑡) then the standard
deviation of the predict error 𝑒𝑥 (𝑡) should be keep smaller than 10% of standard deviation of origin
signal 𝑥(𝑡). In the practical calculation which is presented later, the standard deviation of the prediction
error around 30-40% of standard deviation of origin signal 𝑥(𝑡). This indicates that AR model is not
well predicted the time series signal.
To solve this problem, an ARX model is used to model time series signals. For construction the ARX
model, it is assumed that the error between the measurement and the prediction in AR model mainly
caused by unknown external input. Based on this assumption, the ARX model is used to represent the
input/output relationship between 𝑒𝑥 (𝑡) and 𝑥(𝑡):
𝑝
∑︁ 𝑞
∑︁
𝑥(𝑡) = 𝛼𝑖 𝑥(𝑡 − 𝑖) + 𝛽𝑗 𝑒𝑥 (𝑡 − 𝑗) + 𝑧𝑥 (𝑡) (16.23)
𝑖=1 𝑗=1
where 𝑧𝑥 (𝑡) is the residual error after fitting the ARX(𝑝, 𝑞) model to 𝑒𝑥 (𝑡) and 𝑥(𝑡). After that, this
ARX(𝑝, 𝑞) model is used to reproduce the input/output relationship between 𝑒𝑦 (𝑡) and 𝑦(𝑡):
𝑝
∑︁ 𝑞
∑︁
𝑧𝑦 (𝑡) = 𝑦(𝑡) − 𝛼𝑖 𝑦(𝑡 − 𝑖) + 𝛽𝑗 𝑒𝑦 (𝑡 − 𝑗), (16.24)
𝑖=1 𝑗=1
here 𝛼𝑖 and 𝛽𝑗 are the coefficients in Eq (16.23). If the ARX model of reference signal 𝑥(𝑡) and 𝑒𝑥 (𝑡)
were not good for representing the new signal 𝑦(𝑡) and 𝑒𝑦 (𝑡), there would be a significant change in
the standard deviation of residual error 𝑧𝑦 (𝑡) compared to that of 𝑧𝑥 (𝑡). Consequently, the standard
deviation of residual error can be defined as the damage-sensitive feature.
When the data are collected sequentially, the Sequential Probability Ratio Test (SPRT) procedure is an
appropriate technique to analyse data. This procedure helps to make the decision faster and needs
less collected data compared to classical test. In the classical hypothesis test, the number of samples
are fixed at the beginning of the test and the data are not analysed while collecting. After all data are
collected, the analysis is done and conclusion are drawn. While in the sequential test, the data are
analyzed during data collection and the parameter at the moment are compared to threshold value in
the hypothesis to make the decision. So the decision can be make at the time data are collected and
the conclusion can be drawn at the moment where the number of data samples are smaller than that
in the classical test.
Sequential Test
Sequential test is a method of statistical inference in which the number of observations required by
the procedure is not determined in advance [102]. This procedure starts with the accumulation of a
sequence of random variables {𝑧𝑖 } (𝑖 = 1, 2, ...). This accumulated data set at stage 𝑛 is denoted as:
𝑍𝑛 = (𝑧1 , 𝑧2 , ..., 𝑧𝑛 )
The data set in this study is the collection of the residual errors computed from the ARX model in the
previous section.
The sequential probability ratio test frequently results in a saving of 50 % in the number of observations
over the most efficient test procedure based on the fixed number of observations. As in classical
hypothesis test, the SPRT starts with a pair of hypotheses, say 𝐻0 and 𝐻1 for the null hypothesis and
alternative hypothesis respectively. The next step calculates the cumulative sum of the log-likelihood
ratio:
𝑓 (𝑧𝑛 ; 𝜎1 )
𝑇𝑛 = 𝑇𝑛−1 + ln
𝑓 (𝑧𝑛 ; 𝜎0 )
where 𝑓 (𝑧𝑛 ; 𝜎1 ) is the probability density function of 𝑧𝑛 at 𝜎 = 𝜎1 . The stopping rule is a simple
threshold scheme:
2. 𝑇𝑛 ≤ 𝑏: Accept 𝐻0
3. 𝑇𝑛 ≥ 𝑎: Reject 𝐻0 (Accept 𝐻1 )
where 𝑎 and 𝑏 (0 < 𝑎 < 𝑏 < ∞) depend on the desired type I and type II errors, 𝛼 and 𝛽. They may be
chosen as follows:
𝛽 1−𝛽
𝑎∼
= log , 𝑏∼
= log .
1−𝛼 𝛼
In the damage detection problem, the standard deviation of residual errors is considered as parameter
of the hypothesis testing:
In the hypothesis above, if the standard deviation of the residual errors, 𝜎 is equal or less than a user
specified standard deviation value 𝜎0 , that bridge’ location (where the sensor is plugged) is considered
possibly undamaged. Otherwise, if 𝜎 is equal or greater than the other user specified standard deviation
value 𝜎1 , the location is concluded to potentially be damaged. There are many ways to choose two user
specified standard deviation values. One can choose these values from the training database obtain
from structure. Alternatively, one can initialize these values by experiments and then adjust when there
are more data.
𝑓 (𝑧𝑖 ; 𝜎1 )
𝑡𝑖 = ln
𝑓 (𝑧𝑖 ; 𝜎0 )
𝑓 (𝑧𝑛 ; 𝜎1 )
𝑇𝑛 = 𝑇𝑛−1 + ln
𝑓 (𝑧𝑛 ; 𝜎0 )
𝑛 𝑛 𝑛
ln 𝑓𝑓 (𝑧𝑖 ;𝜎1 )
∑︀ ∑︀ ∑︀
𝑇𝑛 = 𝑡𝑖 = (ln 𝑓 (𝑧𝑖 ; 𝜎1 ) − ln 𝑓 (𝑧𝑖 ; 𝜎0 ))
(𝑧𝑖 ;𝜎0 ) =
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
(︂ 𝑛 )︂ (︂ 𝑛 )︂
∑︀ ∑︀ ∏︀ ∏︀
= ln 𝑓 (𝑧𝑖 ; 𝜎1 ) − ln 𝑓 (𝑧𝑖 ; 𝜎0 ) = ln 𝑓 (𝑧𝑖 ; 𝜎1 ) − ln 𝑓 (𝑧𝑖 ; 𝜎0 )
𝑖=1 𝑖=1 𝑖=1 𝑖=1
If 𝑇𝑛 > 𝑎 then 𝑓 (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ; 𝜎1 ) > 𝑒𝑎 * 𝑓 (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ; 𝜎0 ). The value of joint probability density
function with standard deviation 𝜎1 of (𝑧1 , 𝑧2 , ..., 𝑧𝑛 ) is greater than the value of joint probability density
function with standard deviation 𝜎0 (𝑒𝑎 times). Then we accept 𝐻1 . This argument can be applied for
two remaining stopping rules. Assuming that 𝑇𝑛 has a normal distribution with mean 𝜇 and standard
deviation 𝜎, 𝑡𝑖 can be calculated by 𝑧𝑖 as follows:
(𝑧𝑖 −𝜇)2
1 − 2
2𝜎1
𝑓 (𝑧𝑖 ; 𝜎1 ) 𝜎1
√
2𝜋
𝑒
𝑡𝑖 = ln = ln (𝑧𝑖 −𝜇)2
𝑓 (𝑧𝑖 ; 𝜎0 ) 1 −
2𝜎02
𝜎0
√
2𝜋
𝑒
(𝑧 −𝜇)2 (𝑧 −𝜇)2
𝜎0 ( 𝑖 2
2𝜎0
− 𝑖 2
2𝜎1
)
= ln 𝑒
𝜎1 (16.25)
A concrete data analysis with realistic dataset are fully explored in Nguyen et. al. [25].
Nevertheless, to have a more concise conclusion, both theoretically and practically, the research team
would collaborate with bridge engineers to compare the result with mechanical approaches. Regarding
to Questions 3, 4, 5 raised in the beginning of Section 16.2.5, many tough problems remains. However,
few interesting problems possibly be investigated in the near future could be the followings.
1. Constant selecting. How could the best constants (𝜀1 , 𝜀2 , 𝜎0 , 𝜎1 , 𝑑𝑚𝑖𝑛 . . .) be chosen, for a certain
bridge, to get highly reliable prediction level? A feasibly empirical way may be applying the proposed
solution to other bridges to get more experience on choosing these values.
2. Efficient data-mining. When a damage would be discovered at certain positions on a bridge, be-
sides repairing or upgrading measures, what more efficient data-mining methods can be realized
and/or which stabilizers can be designed to automatically set off extreme frequencies or deviations
that caused by vehicle movements?
3. Sensor clustering- cost optimizing. What happens if impacts-outcomes of some sensors depend
on other impacts? If that is the case, detecting possible dependence between sensors and then
reducing the number of sensors used in the investigation may be meaningful actions.
4. Finally, although we tried to use a better (mathematical and statistical) approach to choose the num-
ber 𝑘 of PCs other than eigenvalue-based one, cross-validation, mentioned in Section 16.2.3 is not
the best one, and the research on this direction still needs more attention.
This page is left blank intentionally.
List of Figures
1.1 Would we reveal information and knowledge behind this beautiful picture? . . . . . . . . 3
1.5 The probability density function of Bin(50, 𝑝) with 𝑝 = .25, .50, .75 . . . . . . . . . . . . . 19
1.10 Areas with radius 1, 2 and 3 𝜎 - when data approximate Gauss distribution . . . . . . . 26
1.14 The pdf 𝑓 (𝑥; 𝜈1 , 𝜈2 ) of Beta(𝜈1 , 𝜈2 ) when 𝜈1 = 2.5, 𝜈2 = 2.5; 𝜈1 = 2.5, 𝜈2 = 5.00. . . . . . 36
3.8 Areas with radius 1, 2 and 3 𝜎 - when data approximate Gauss distribution . . . . . . . 92
10.3 Illustration for prediction interval for the individual response . . . . . . . . . . . . . . . . 292
11.1 The regression plane for the model 𝑌ˆ = 50 + 10𝑥1 + 7𝑥2 . . . . . . . . . . . . . . . . . . 301
11.3 Linear and nonlinear model of the U.S. population regression . . . . . . . . . . . . . . . 316
12.1 An example of mixed flow of arrivals ( jobs, clients, cars ...)- Source [58] . . . . . . . . . 331
3.1 GDP in USA, UK, Mexico and India via monthly income . . . . . . . . . . . . . . . . . . 69
12.1 Data of VNM and DPM stock price in Quater 1, 2013. . . . . . . . . . . . . . . . . . . . 334
B.1 The cycle of piston with control factors are set to minimum values . . . . . . . . . . . . . 494
Transform Methods
Consider the (point) probability distribution of a discrete r. v. 𝑋, with the observed values in 𝑗 ∈
Range(𝑋) = N and pmf P[𝑋 = 𝑗] = 𝑝𝑗 .
Probability-generating function
For a discrete r. v. 𝑋, with pmf 𝑝𝑗 = P[𝑋 = 𝑗] and the pgf 𝐺(𝑧) that is differentiable at the points
𝑧 = 0 and 𝑧 = 1, we have the followings.
∞ ∞
𝑑𝐺(𝑧) ∑︁ 𝑗−1 𝑑𝐺(𝑧) ∑︁
= 𝑗𝑧 𝑝𝑗 =⇒ 𝐺′ (1) = |𝑧=1 𝑗𝑝𝑗 = E[𝑋] = 𝜇𝑋 ,
𝑑𝑧 𝑗=1
𝑑𝑧 𝑗=0
(A.3)
∞
𝑑2 𝐺(𝑧) ′′ ∑︁
2
|𝑧=1 = 𝐺 (1) = 𝑗(𝑗 − 1) 𝑧 𝑗−2 𝑝𝑗 = E[𝑋 2 ] − E[𝑋].
𝑑𝑧 𝑗=1
Besides of the PGF, we will make extensive use of the Laplace Steiltjes transform or just Laplace
transform. Let 𝑓 (𝑡) = 𝑓𝐴 (𝑡) be the pdf of a continuous random variable 𝐴 that takes only non-negative
values; that is, 𝑓𝐴 (𝑡) = 0 for 𝑡 < 0.
The Laplace transform of 𝐴 or 𝑓 (𝑥), denoted by 𝐴(𝑠), is defined by
∫︁ ∞ ∫︁ ∞
−𝑠𝐴 −𝑠𝑡
𝐿𝑇 (𝐴) = 𝐴(𝑠) = E[𝑒 ]= 𝑒 𝑓 (𝑡) 𝑑𝑡 = 𝑒−𝑠𝑡 𝑑𝐹 (𝑡), (A.7)
0 0
where 𝐹 = 𝐹𝐴 is the cdf of 𝐴.
Let E[𝐴𝑛 ] (𝑛 ≥ 1) be the 𝑛-th moment of a continuous random variable 𝐴. Take different
derivatives of the Laplace transform 𝐿𝑇 (𝐴) = 𝐴(𝑠) and evaluate them at 𝑠 = 0, we obtain the
following.
∫︁ ∞
𝑑𝐴(𝑠) 𝑑𝐴(𝑠)
=− 𝑡 𝑒−𝑠𝑡 𝑓 (𝑡) 𝑑𝑡 =⇒ |𝑠=0 = 𝐴′ (0) = −E[𝐴]
𝑑𝑠 0 𝑑𝑠
∞
𝑑2 𝐴(𝑠) 𝑑2 𝐴(𝑠)
∫︁
= 𝑡2 𝑒−𝑠𝑡 𝑓 (𝑡) 𝑑𝑡 =⇒ |𝑠=0 = 𝐴′′ (0) = E[𝐴2 ] (A.8)
𝑑𝑠2 0 𝑑𝑠2
..
.
𝑑𝑛 𝐴(𝑠)
|𝑠=0 = (−1)𝑛 E[𝐴𝑛 ].
𝑑𝑠𝑛
Let 𝐴 ∼ E(𝜇) denote an exponentially distributed random variable with mean rate 𝜇. Its pdf is
𝑓 (𝑡) = 𝜇 𝑒−𝜇 𝑡 .
The Laplace transform of 𝐴 or 𝑓 (𝑡) is
∫︁ ∞ ∫︁ ∞
−𝑠𝐴 −𝑠𝑡 𝜇
𝐴(𝑠) = E[𝑒 ]= 𝑒 𝑓 (𝑡) 𝑑𝑡 = 𝑒−𝑠𝑡 𝜇 𝑒−𝜇 𝑡 𝑑𝑡 = ? (A.9)
0 0 𝜇+𝑠
The Laplace transforms of the pdfs of random variables will be particularly important for the analysis
of M/G/1 queuing systems.
1. Common commerical statistical softwares: SAS, SPSS, Stata, Statistica, Gauss, Splus (S+). All very
costly
3. R is a statistical language
– can perform any common statistical functions
– interactive interface
All R applications mentioned in this book are contained in a package called mistat being available
and downloaded from the site CRAN (link at https://cran.r-project.org/).
Remark 7. The functions in R have parentheses. The library function library() loads an extra pack-
age to extend the functionality of R, and CYCLT is an object that contains a vector of many numeric
values.
Practical Problem 3. A piston is a mechanical device that is present in most types of engines. A
measure of the performance of a piston is the time it takes to complete a cycle. We call this measure
a cycle (a cycle time).
We provide R code here to simulate operations of the piston.
Table B.1: The cycle of piston with control factors are set to minimum values
1. Prompt: >
R Practicality
R Grammar- Operations
3. $x != 5$ # x is not equal to 5
4. y < x # y is less than x
5. $p <= 1$ # p is smaller than or equal to
6. $A & B$ # A and B
R- Dataframe
• Dataset −→ data.frame
• columns −→ variables
• rows −→ observations
age = c(23, 43, 17, 52, 28, 31, 15, 31) # unit: year
• insulin <- c(10, 12, 19, 23, 21, 20, 17, 10) # unit: mg / litter
• insudata <- data.frame(age, insulin);
• insudata;
R - Basic plots
hist(age);
# Can draw histogram of single variables only
• hist(insulin);
• hist(insudata);
# Oh oh! Can not! Why? Now how to do?
• plot(insudata); # 2D table requires using plot
• diagnosis[gdpi <= 0] = 0
• diagnosis[gdpi > 0 & gdpi <= 2.0] = 1
• diagnosis[gdpi > 2.0] = 2
• healthEconomies <- data.frame(gdpi, diagnosis);
• healthEconomies;
# the larger index the better growth economy
The full description of probability distributions will be presented in Chapter ??. We only show how to
get numerical values of samples by using key R commands here.
SYNTAX:
# rxyzts(parameters) = generates randomly sample with distribution name xyzts
b = rnorm(37, 1.65, 0.5); # get a sample of 37 numbers, follow the Gaussian N (1.65, 0.5)
y = rpois(8, 12); # get a sample of 8 numbers, follow the Pois(𝜆) with rate 𝜆 = 12
z= rf(6, n1,n2); # get a sample of 6 numbers, follow the Fisher 𝐹 [𝑛1, 𝑛2]
Computing distributions
SYNTAX:
# dxyzts(parameters) = computes probability mass func - p.d.f with name xyzts
# pxyzts(parameters) = finds c.d.f - cumulative distribution func
# qxyzt(parameters) = gives the quantile function, inverse of cdf
𝑥* is the p-th quantile of a distribution if
E.g., Probability of male height less than or equal to 180 cm, given that the Gauss distribution has
mean=175 and sd= 5
• Fisher: What is the upper 𝑎 = 5% = 0.05 critical value for the Fisher with two degree of freedom:
Since the linear model is the model describe the continuous response but cannot handle discrete or
skewed continuous response such as binary or count data. However, generalized linear models(GLMs)
extend the linear modeling ideas to address this problem. The example of GLMs such as regression
analysis, analysis of variance(ANOVA), analysis of covariance(ANCOVA), etc.
There are three components of GLM which are random component, systematic component, and link
function.
I) Random component: GLMs assume the responses come from a distribution that belongs to an
exponential family of distributions, also called the exponential dispersion model family (EDMs).
Continuous EDMs include the normal and gamma distributions. Discrete EDMs include the Poisson,
binomial and negative binomial distributions. The EDM enables GLMs to be fitted to a wide range of
data types, including binary data, proportions, counts, and positive continuous data.
Definition C.1.
Consider a random variable 𝑌 whose pdf 𝑓 depends on parameters 𝜃 and 𝜑. The distribution
belongs to the EDM family if it can be written as
{︂ }︂
𝑦 𝜃 − 𝑏(𝜃)
𝑓 (𝑦) = 𝑓 (𝑦; 𝜃, 𝜑) = 𝑎(𝑦, 𝜑) 𝑒𝑥𝑝 (C.1)
𝜑
or equivalently {︂ }︂
𝑦 𝜃 − 𝑏(𝜃)
𝑓 (𝑦) = 𝑓 (𝑦; 𝜃, 𝜑) = 𝑒𝑥𝑝 + 𝑐(𝑦, 𝜑) (C.2)
𝜑
where
• 𝜃 is called the canonical (natural) parameter, specific to observations 𝑌𝑖 , which will carry infor-
mation from the explanatory variables,
The choice of 𝑏(𝜃) determines the response distribution. Given 𝜃, 𝑦 is determined as a draw from
the exponential density specified in (𝜃).
¯
NOTATIONS:
1. The notation 𝑌 ∼ EDM(𝜇, 𝜑) indicates that the responses come from the EDM family (C.1), with
mean 𝜇 and dispersion parameter 𝜑. The corresponding domain of 𝜇 is denoted Ω.
2. The support of 𝑌 is denoted by S = Range(𝑌 ) (the set of its possible values), where S does not
depend on the parameters 𝜃 and 𝜑.
3. The domain of 𝜃, denoted Θ ⊂ R, is an open interval satisfying 𝑏(𝜃) < ∞ that includes zero.
II) Systematic component is a function definition of predictor used to forecasting response, the
linear combination of the context variable is called the linear predictor
𝜂 = X𝛽 = 𝛼 + 𝛽1 𝑥1 + ... + 𝛽𝑝 𝑥𝑝 (C.3)
III) Link function 𝑔(.) describe the relationship between the mean of response variable and system-
atic elements:
• Often 𝑔 is monotonic and differentiable, such as identity function, logarithm log 𝜇, or square root
√
𝜇.
• This systematic component 𝑔(𝜇) = 𝜂 shows that GLMs are regression models linear in the parame-
ters 𝛽 = (𝛽0 , 𝛽1 , 𝛽2 , · · · , 𝛽𝑝 )𝑇 .
• The canonical link function is a special link function, the function 𝑔(𝜇) such that 𝜂 = 𝜃 = 𝑔(𝜇).
The classic linear models have three parts, upon the view of GLM:
1. The Random component: the components of Y have independent Normal distributions with
E[𝑌 ] = 𝜇 and constant variance 𝜎 2 ;
𝜇 = 𝜂.
Hence, for classic linear models, the response 𝑌 is Normal (Gaussian) distributed in Item 1, and
the link in Item 3 is identity function (the mean and the linear predictor are identical). Specifically,
𝑌 ∼ N(𝜇, 𝜎 2 ), the link function 𝑔(𝜇) = 𝜇 = 𝜂.
Commonly used link functions 𝑔(𝜇) are given in Table C.1.
Key distributions of the exponential family: include the Binomial, Poisson, and Gassian (Normal).
This logarithmic link ensures 𝜂 (which possibly takes any real value) always maps to a positive value
of 𝜇.
1
The normalizing function is 𝑎(𝑦, 𝜑) = . The Poisson distribution 𝑌 ∼ Pois(𝜆) is an EDM .
𝑦!
𝑌 ∼ Pois(𝜆) is used to model count data. Typically these are the number of occurrences of some
event in a defined time period or space, when the probability of an event occurring in a very small
time (or space) is low and the events occur independently.
Over-dispersion: Real data that might be plausibly modeled by the Poisson distribution often have
a larger variance and are said to be overdispersed, and the model may have to be adapted to reflect
this feature.
2. The Bernoulli distribution belongs to the exponential family of distributions. The Bernoulli and
Poisson distributions have no dispersion parameters, so for these distributions we take 𝜑 ≡ 1.
• a link function that describes how the mean, 𝐸[𝑌𝑖 ] = 𝜇𝑖 depends on the linear predictor
𝑔(𝜇𝑖 ) = 𝜂𝑖 ; (C.9)
• and a variance function that describes how the variance, 𝑣𝑎𝑟[𝑌𝑖 ] depends on the mean
Most of the commonly used statistical distributions, e.g. Gaussian, Binomial, and Poisson, are mem-
bers of the exponential family of distributions whose densities can be written in the form
𝑦𝜃 − 𝑏(𝜃)
𝑓 (𝑦; 𝜃, 𝜑) = 𝑒𝑥𝑝( (C.11)
𝜑 + 𝑐(𝑦, 𝜑)
where 𝜑 is the dispersion parameter and 𝜃 is the canonical parameter. It can be shown that
and
𝑣𝑎𝑟[𝑌 ] = 𝜑𝑏′′ (𝜃) = 𝜑𝑉 [𝜇] (C.13)
We study counts when the individual events being counted are independent, or nearly so, and
where there is no clear upper limit for the number of events that can occur,
or where the upper limit is very much greater than any of the actual counts.
The Poisson distribution with expected counts 𝜇 = E[𝑌 ] > 0, has the pmf
𝜇𝑦 𝑒−𝜇
𝑝(𝑦; 𝜇) = P[𝑌 = 𝑦] = , 𝑦 = 0, 1, 2, ... (C.14)
𝑦!
It has already been established as an EDM .
The most common link function used for Poisson GLMs is the logarithmic link function, which ensures
𝜇 > 0 and enables the regression parameters to be interpreted as having multiplicative effects. Using
the logarithmic link function log, the general form of a Poisson GLM is
⎧
⎨ 𝑦 ∼ Pois(𝜇)
⎪
⎪
𝑝
∑︁ (C.15)
⎩ 𝑔(𝜇) = log 𝜇 = 𝛽0 +
⎪
⎪ 𝛽𝑗 𝑥𝑗 = 𝑥𝑇 𝛽.
𝑗=1
2. When the explanatory variables 𝑥𝑗 are all qualitative (that is, factors), the data can be summa-
rized as a contingency table and the model is often called a log-linear model.
3. If any of the explanatory variables are quantitative (that is, covariates), the model is often called
a Poisson regression model.
Having seen how to fit a given parametric model, a major decision is: what parametric model should
we fit? There are many metrics for comparing models. So not only do we choose a model, we have
to choose what metric to use to choose a model. In other words, we present various statistics and
techniques useful for assessing the fit of a GLM in this section.
The process of fitting a model to data may be regarded as a way of replacing a set of data values 𝑦 by
a set of fitted values 𝜇
̂︀ derived from a model involving usually a relatively small number of parameters.
Measures of discrepancy or goodness of fit may be formed in various ways, but we shall be primar-
ily concerned with that formed from the logarithm of a ratio of likelihoods, to be called the deviance.
Given 𝑛 observations we can fit models to them containing up to 𝑛 parameters.
• The full model 𝑀0 has 𝑛 parameters, one per observation, and the 𝜇𝑗 derived from it match the data
exactly. The full model gives us a baseline for measuring the discrepancy for an intermediate model
with 𝑝 < 𝑛 parameters.
• The fitted model, 𝑀1 , has one parameter, representing a common 𝜇 for all the 𝑦s.
∆ := 2[𝑙0 − 𝑙1 ],
twice the difference between the maximum log likelihood achievable and that achieved by the model un-
der investigation 𝑀1 , see Equation (C.16). The deviance is sort of logarithm of a ratio of two likelihood,
we call log-likelihood ratio.
In estimation based on maximum likelihood (ML), a standard assessment is to compare the fitted
model with a fully specified model (a model with as many independent parameters as observations).
The scaled deviance ∆ is given in terms of the likelihoods 𝐿𝑀0 , 𝐿𝑀1 of the full model 𝑀0 and fitted
model 𝑀1 , by
[︀ ]︀
∆ = −2 log(𝐿𝑀1 /𝐿𝑀0 ) = 2 log 𝐿𝑀0 − log 𝐿𝑀1 = 2[𝑙0 − 𝑙1 ]. (C.16)
This quantity is the deviance of these models. This deviance briefly is a statistic measuring the
current fit against the best possible fit. The deviance is twice the difference between
• the log-likelihood of the best fit model, named full or saturated model, and
𝑀0 : 𝑔(𝜇) = X0 𝛽 0
against
𝑀1 : 𝑔(𝜇) = X1 𝛽 1
Our goal in assessing deviance is to determine the utility of the parameters added to the null model.
When working with GLMs in practice, it is useful to have a quantity that can be interpreted in a similar
way to the residual sum of squares, in ordinary linear modeling.
The deviance, Dev, is defined as:
where
• 𝑙0 = L(𝑦; 𝑦) is the log-likelihood of the full model or saturated model, denoted 𝑀0 (with a single
parameter for each observation); and
• 𝑙1 = L(𝑦; 𝜇) is the log-likelihood for the fitted model, denoted 𝑀1 under consideration.
The notation L(𝑦; 𝑦) for the full model is a reflection that the saturated model perfectly captures the
outcome variable. Thus, the model’s predicted values 𝜇
̂︀ = 𝑦. The difference in deviance statistics
between the saturated model and the fitted model captures the distance between the predicted values
and the outcomes.
The full model 𝑀0 is, by definition, the best possible fit to the data, so the deviance ∆ measures
how far the model 𝑀1 under consideration is from a perfect fit.
Normal Deviance
Poisson Deviance
An information criterion balances the goodness-of-fit of a model against its complexity. The aim is to
provide a single statistic which allows model comparison and selection. We provide the formulas for
two criterion measures useful for model comparison, including Akaike Information Criterion and the
Bayesian Information Criterion.
Both are based on the log likelihood along with a penalty term based on the number of parameters
in the model. In this way, the criterion measures seek to balance our competing desires for finding the
best model (in terms of maximizing the likelihood) with model parsimony (including only those terms
that significantly contribute to the model).
We use the following definitions,
𝑝 = number of predictors
𝑛 = number of observations
𝐿(𝑀𝑘 ) = likelihood for model 𝑘 (C.20)
The information criterion is a measure of the information lost in using the associated model, and our
goal is to find the model that has the lowest loss of information, ie. lower values of the criterion are
indicative of a preferable model.
QUESTION: What sort of difference in AIC should be regarded as significant?
• A rule of thumb is that a difference of 4 or more AIC units would be regarded as significant. This
does require a degree of judgement: if there were two models with AICs within 4 units of each other,
we would pick the more parsimonious (economical), ie. the one with fewer parameters.
• The number of parameters 𝑝 is a penalty against larger covariate lists. The criterion measure is
especially amenable to comparing GLMs of the same link and variance function but different covariate
lists.
Generally, the model having the lower AIC statistic is preferred over another model, but there is no
specific statistical test from which a 𝑝-value may be computed.
Summary
The GLM is briefly denoted as GLM (EDM , Link function), and explicitly
⎧
⎨ 𝑦𝑖 ∼ EDM(𝜇𝑖 , 𝜑)
⎪
⎪
𝑝
∑︁ (C.23)
⎪
⎪
⎩ 𝑔(𝜇𝑖 ) = 𝑜 𝑖 + 𝛽 0 + 𝛽𝑗 𝑥𝑖𝑗 .
𝑗=1
The core structure of a GLM is specified by the choice of distribution from the EDM class and
the choice of link function.
EDM = Random component: The observations 𝑦𝑖 come independently from a specified EDM such
that 𝑦𝑖 ∼ EDM(𝜇𝑖 , 𝜑) for 𝑖 = 1, 2, . . . , 𝑛.
Link function defines systematic component: A linear predictor
𝑝
∑︁
𝜂𝑖 = 𝑔(𝜇𝑖 ) = 𝑜𝑖 + 𝛽0 + 𝛽𝑗 𝑥𝑖𝑗 where the 𝑜𝑖 are offsets, that are often equal to zero, and 𝑔(𝜇) = 𝜂
𝑗=1
is a known, monotonic, differentiable link function.
This page is left blank intentionally.
Bibliography
[5] David S. Moore, George P. Mccabe and Bruce A. Craig, Introduction to the Practice of Statistics,
6th edition, (2009), W. H. Freeman and Company New York
[6] Madhav, S. P., Quality Engineering using robust design, Prentice Hall, 1989.
[15] Brian Bergstein, AI still gets confused about how the world works, pp 62-65, MIT Technology
Review, The predictions issues, Vol 123 (2), 2020
[19] Nhut Cong Nguyen, Man Van Minh Nguyen, and Phu Le Vo.
Co-kriging Method for Air Pollution Prediction - A Case Study in Hochiminh City, accepted to
appear in Thailand Statistician, Journal of the Thai Statistical Association, 2020
[20] Uyen Huynh, Nabendu Pal, Buu-Chau Truong and Man Nguyen.
A Statistical Profile of Arsenic Prevalence in the Mekong Delta Region, accepted to appear in
Thailand Statistician Journal 2020
[25] Man Nguyen, Tran Vinh Tan and Phan Phuc Doan,
Statistical Clustering and Time Series Analysis for Bridge Monitoring Data, Recent Progress in
Data Engineering and Internet Technology,
Lecture Notes in Electrical Engineering 156, (2013) pp. 61 - 72, Springer-Verlag
[29] Hien Phan, Ben Soh and Man VM. Nguyen, A Parallelism Extended Approach for the Enumeration
of Orthogonal Arrays, ICA3PP 2011, Part I, Lecture Notes in Computer Science, Vol. 7016, Y.
Xiang et al. eds., Springer- Verlag Berlin Heidelberg, pp. 482–494, 2011.
[30] Hien Phan, Ben Soh and Man VM. Nguyen, A Step-by-Step Extending Parallelism Approach for
Enumeration of Combinatorial Objects, ICA3PP 2010, Part I, Lecture Notes in Computer Science,
Vol. 6081, C.-H. Hsu et al. eds., Springer- Verlag Berlin Heidelberg, pp. 463-475, 2010.
[31] Eric D. Schoen, Pieter T. Eendebak, and Man Nguyen, Complete enumeration of pure-level and
mixed-level orthogonal array, Journal of Combinatorial Designs 18(2) (2010) 123-140.
[35] Brouwer, A. E., Cohen, A. M. and Nguyen, M. V. M. (2006), Orthogonal arrays of strength 3 and
small run sizes, Journal of Statistical Planning and Inference, 136, 3268-3280.
[37] Marie-Pierre De Bellefon, Jean-Michel Floch. Handbook of Spatial Analysis. Chapter 9, 231-254.
2018.
[38] Nelder, J. and R. Wedderburn (1972). Generalized linear models. Journal of the Royal Statistical
Society, Series A 132, 370–384.
[39] McCuIlagh Peter and NeIder John Ashworth, Generalized Linear Models, 2nd ed., Springer, 1989.
[40] David Ardia, Financial Risk Management with Bayesian Estimation of GARCH Models, Springer
(2008)
[41] Peter K. Dunn and Gordon K. Smyth Generalized Linear Models With Examples in R (2018),
Springer Nature.
[42] Philippe Jorion , Value at Risk- The New Benchmark for Managing Financial Risk, 3rd Edition
McGraw Hill (2007)
[43] Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning Data
Mining, Inference, and Prediction, 2nd Ed. Springer (2017)
[44] Simon Hubbert, Essential Mathematics for Market Risk Management, Wiley (2012)
[45] Soren Asmussen and Peter W. Glynn, Stochastic Simulation- Algorithms and Analysis, Springer
(2007)
[46] A. Stewart Fotheringham, Chris Brundon, Martin Charlton. Geographically Weighted Regression
: the analysis of spatoally varying relationships. Wiley, England. 2002.
[47] Scheffe, H. (1959) The Analysis of Variance, John Wiley & Sons, Inc., New York.
[48] Sung H. Park, Six-Sigma for Quality and Productivity Promotion, Asian Productivity Organization,
1-2-10 Hirakawacho, Chiyoda-ku, Tokyo, Japan, 2003.
[49] M. F. Fecko and al., Combinatorial designs in Multiple faults localization for Battlefield networks,
IEEE Military Communications Conf., Vienna, 2001.
[50] Glonek G.F.V. and Solomon P.J. Factorial and time course designs for cDNA microarray experi-
ments, Biostatistics 5, 89-111, 2004.
[51] Hedayat, A. S., Sloane, N. J. A. and Stufken, J. Orthogonal Arrays, Springer, 1999.
[52] Joel Cutcher-Gershenfeld – ESD.60 Lean/Six Sigma Systems, LFM, MIT
[53] John J. Borkowski’s Home Page, www.math.montana.edu/ jobo/courses.html/
[54] Joseph A. de Feo, Junran’s Quality Management And Analysis, McGraw-Hill, 2015.
[55] Jay L. Devore and Kenneth N. Berk,
Modern Mathematical Statistics with Applications, 2nd Edition, Springer (2012)
[56] Google Earth, Digital Globe, 2014- 2019
[57] Robert V. Hogg, Joseph W. McKean, Allen T. Craig Introduction to Mathematical Statistics, Sev-
enth Edition Pearson, 2013.
[58] Michael Baron, Probability and Statistics for Computer Scientists, 2nd Edition (2014), CRC Press,
Taylor & Francis Group
[59] R. H. Myers, Douglas C. Montgomery and Christine M. Anderson-Cook
Response Surface Methodology : Process and Product Optimization Using Designed Experi-
ments, Wiley, 2009.
[60] Nathabandu T. Kottegoda, Renzo Rosso.
Applied Statistics for Civil and Environmental Engineers, 2nd edition (2008), Blackwell Publishing
Ltd and The McGraw-Hill Inc
[61] Paul Mac Berthouex. L. C. Brown. Statistics for Environmental Engineers; 2nd edition (2002), CRC
Press
[62] Ron S. Kenett, Shelemyahu Zacks. Modern Industrial Statistics with applications in R, MINITAB,
2nd edition, (2014), Wiley
[65] Sudhir Gupta, Balanced Factorial Designs for cDNA Microarray Experiments, Communications in
Statistics: Theory and Methods, Volume 35, Number 8 , p. 1469-1476 (2006)
[66] Sung H. Park, Six-Sigma for Quality and Productivity Promotion, Asian Productivity Organization,
1-2-10 Hirakawacho, Chiyoda-ku, Tokyo, Japan, 2003.
[69] Vo Ngoc Thien An, Design of Experiment for Statistical Quality Control, Master thesis, LHU, Viet-
nam (2011)
[70] Wang, J.C. and Wu, C. F. J. (1991), An approach to the construction of asymmetrical orthogonal
arrays, Journal of the American Statistical Association, 86, 450–456.
[71] Larry Wasserman, All of Statistics- A Concise Course in Statistical Inference, Springer, (2003)
[73] C.F. Jeff Wu, Michael Hamada Experiments: Planning, Analysis and Parameter Design Optimiza-
tion, Wiley, 2000.
[74] Doebling, S. W., Farrar, C. R., Prime, M. B., and Shevitz, D. W.. ”Damage Identification and Health
Monitoring of Structural and Mechanical Systems From Changes in Their Vibration Characteris-
tics: A Literature Review,” Los Alamos National Laboratory Report LA-13070-MS, 1996.
[75] Dohono, David. High-dimensional data analysis: The curses and blessings of dimensionality.,
2000.
[76] Farrar, Charles R, and Keith Worden. ”An introduction to structural health monitoring.” Philo-
sophical transactions. Series A, Mathematical, physical, and engineering sciences 365, no. 1851
(2007): 303-15.
[77] Fodor, Imola. A Survey of Dimension Reduction Techniques. Center for Applied Scientific Com-
puting, Lawrence Livermore National Laboratory, 2002.
[78] Eastment, H. T., and W. J. Krzanowski. ”Cross-Validatory Choice of the Number of Components
from a Principal Component Analysis.” Technometrics 24, no. 1 (1982): 73 - 77.
[79] Garcia, Gabriel V., and Roberto A. Osegueda. ”Combining damage detection methods to improve
probability of detection.” In Smart Structures and Materials 2000: Smart Systems for Bridges,
Structures, and Highways, Shih-Chi Liu, 135-142. SPIE, 2000.
[80] Halfpenny, Angela. ”A Frequency Domain Approach for Fatigue Life Estimation from Finite Ele-
ment Analysis.” Key Engineering Materials 167-168: D (1999): 401-410.
[81] Härdle, Wolfgang, and Léopold Simar. Applied multivariate statistical analysis. 2nd. Springer,
2007.
[82] Haywood, Jonathan, Wieslaw J. Staszewski, and Keith Worden. ”Impact Location in Composite
Structures Using Smart Sensor Technology and Neural Networks.” In The 3rd International Work-
shop on Structural Health Monitoring, 1466-1475. Stanford, California, 2001.
[83] Hotelling, Harold. ”Relations Between Two Sets of Variates.” Biometrika 28, no. 3-4 (1936): 321-
377.
[84] Inada, T., Shimamura, Y., Todoroki, A., Kobayashi, H., and Nakamura, H., Damage Identification
Method for Smart Composite Cantilever Beams with Piezoelectric Materials, Structural Health
Monitoring 2000, Stanford University, Palo Alto, California, 1999,pp. 986-994.
[86] Lapin, L.L. , Probability and Statistics for Modern Engineering, PWS-Kent Publishing, 2nd Edition,
Boston, Massachusetts,1990.
[87] Ljung, L. System identification: theory for the user, Prentice Hall, Englewood Cliffs, NJ, 1987
[88] Masri, S.F., Smyth, A.W., Chassiakos, A.G., Caughey, T.K., and Hunter, N.F.,Application of Neural
Networks for Detection of Changes in Nonlinear Systems, Journal of Engineering Mechanics,
July,2000, pp. 666-676.
[89] Papadimitriou, C. ”Optimal sensor placement methodology for parametric identification of struc-
tural systems.” Journal of Sound and Vibration 278, no. 4-5 (2004): 923-947.
[90] Rytter, A., Vibration based inspection of civil engineering structures. Ph.D Dissert., Department of
Building Technology and Structural Engineering, Aalborg University, Denmark, 1993.
[91] Rytter, A., and Kirkegaard, P. ,Vibration Based Inspection Using Neural Networks,” Structural Dam-
age Assessment Using Advanced Signal Processing Procedures, Proceedings of DAMAS ‘97,
University of Sheffield, UK, 1997,pp. 97-108.
[92] Silverman, B.W. , Density Estimation for Statistics and Data Analysis, Chapman and Hall, New
York, New York,1986.
[93] Sithole, M.M., and S. Ganeshanandam. ”Variable selection in principal component analysis to pre-
serve the underlying multivariate data structure.” In ASC XII – 12th Australian Stats Conference.
Monash University, Melbourne, Australia, 1994.
[94] Sohn, Hoon. ”Effects of environmental and operational variability on structural health monitoring..”
Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 365, no.
1851 (2007): 539-60.
[95] Sohn, Hoon, and Charles R. Farrar. Damage diagnosis using time series analysis of vibration
signals. Smart Materials and Structures. Vol. 10, 2001.
[96] Sohn, Hoon, David W.Allen, Keith Worden and Charles R. Farrar, Statistical damage classification
using sequential probability ratio test, Structural Health Monitoring, 2003.p.57-74
[97] Sohn, Hoon, Keith Worden, Charles R. Farrar, Statistical Damage Classification under Changing
Environmental and Operational Conditions, Journal of Intelligent Materials Systems and Struc-
tures, 2007
[98] Sohn, Hoon, Charles R. Farrar, Francois M. Hemez, Devin D. Shunk, Daniel W. Stinemates,
Brett R. Nadler, and Jerry J. Czarnecki. A Review of Structural Health Monitoring Literature:
1996–2001. Structural Health Monitoring. Los Alamos National Laboratery Report, 2004.
[99] Todd, M.D., and Nichols, J.M., Structural Damage Assessment Using Chaotic Dynamic Interroga-
tion, Proceedings of 2002 ASME International Mechanical Engineering Conference and Exposi-
tion, New Orleans, Louisiana, 2002.
[105] Worden, K., and Fieller, N.R.J., Damage Detection Using Outlier Analysis, Journal of Sound and
Vibration, Vol. 229, No. 3,1999, pp. 647-667.
[106] Yang, Lingyun, Jennifer M. Schopf, Catalin L. Dumitrescu, and Ian Foster. ”Statistical Data Re-
duction for Efficient Application Performance Monitoring.” CCGRID (2006).
[107] Q.W.Zhang, Statistical damage identification for bridges using ambient vibration data, Elsevier,
2006. p.476-485.
Index